Liquid Templated Data in Jekyll

TL;DR: You can include one data file in another by adding this file to your _plugin directory: read_data_file_with_liquid.rb

I recently designed the website for one of my labs. After a lot of back and forth, I decided to build the site in Jekyll. Jekyll is a framework for building static websites. It has a lot of the same advantages of dynamic site frameworks like Ruby-on-Rails, except that Jekyll doesn’t do any database accesses or page construction every time a new user views the site. Instead Jekyll generates a bunch of HTML files one time and those can be served unchanged from your server forever until you decide to change the content. I like Jekyll because it allows a significant amount of customization while remaining relatively simple to create and deploy. As with most technology that’s easy to pick up and use, it can become increasingly difficult to use as the task becomes more and more complicated. This is a story about how feature creep painted me into a corner, and a description of the hack I used to fix it. Hopefully this can serve as a warning to others, and potentially give them an out if they need it.

One view, one data store

The initial design of the site included a few prose-only pages and a “people” page to list all the members of the lab. Jekyll has the ability to build pages backed by data stored in various formats like YAML, JSON, or CSV. This data feature was an obvious way to factor out a lot of the repetition that would normally be required in a page that has dozens of identically formatted entries. The way it works is that you create a YAML (or JSON or CSV) file with some information. Let’s call it people.yml:

people:
  - name: Dan
    spirit_animal: Otter
  - name: Lois
    spirit_animal: Penguin

And then you can build an HTML (or Markdown) page with templating to automatically populate the page with the data from your YAML file:

{% for person in site.data.people.people %}
  <div class="person" id="">
    <div class="name"></div>
    <div class="spirit_animal"></div>
  </div>
{% endfor %}

This will make an HTML page with two person divs, each populated with the data from the two people in people.yml file. So far so good. We’ve got a nice webpage with no duplication of logic and a good separation of concerns.

Two views, two data stores

As time went on, the demands of the project expanded and we needed to add a projects page to the site. No problem, just add a second data file and second view, projects.yml and projects.html respectively. Here’s what projects.yml might look like:

projects:
  - name: Mercator
    features: Cylindricality
  - name: Armadillo
    features: Compromise

So far everything was still fine, but it was about to get wacky. The next requirement was that every project link to every person who works on it, and every person link back to every project they work on. The simplest way to do this is by entering the data by hand:

projects:
  - name: Mercator
    features: Cylindricallity
    people:
      - name: Dan               #
        spirit_animal: Otter    #
  - name: Armadillo
    features: Compromise
    people:
      - name: Dan               #
        spirit_animal: Otter    #
      - name: Lois              #
        spirit_animal: Penguin  #

At this point problems were starting to emerge. Now each person’s information needs to be entered in multiple places. Once in people.yml, and several times in project.yml for every project that person works on. This is a problem because every time a person’s information changes, we need to update that data in arbitrarily many places. To help mitigate this problem, YAML provides anchors and references (notated with & and * respectively). These features allow you to define data in one place and reuse it in other places without having to copy-paste all of the data.

people:
  - &dan                        #
    name: Dan
    spirit_animal: Otter
  - &lois                       #
    name: Lois
    spirit_animal: Penguin

projects:
  - name: Mercator
    features: Cylindricallity
    people:
      - *dan                    #
  - name: Armadillo
    features: Compromise
    people:
      - *dan                    #
      - *lois                   #

One of the limitations of anchors and references, however, is that they only work within a single file. Up until this point the duplication was reduced to just having to define each type of information in each file, which for only two files is acceptable.

But of course, requirements continued to grow. We suddenly needed two new pages, one for publications and one for press. And there needed to be links between both of them and the project and people pages. Under this scheme, most of the of data needed to be copy-pasted between 4 different files, for a total of 14 datastructures which all need to be consistent with each other.

Uber-YAML

The only way to still leverage anchors and references between all 4 files even though anchors are only visible in the files in which they are defined was to merge all files together into a singular uber-yml file that contained all the data. There are several drawbacks to this method. Firstly, the singular file becomes very large. In our case the uber-yml was well over 2000 lines, all of which need to be ordered and indented correctly. Secondly, anchors are only visible to references below them, which means that data can’t be separated into groups. Ideally the top of the file would have definitions for all the people, followed by a section that defined all the groups. Instead, if the first group describes each of the projects, then each person must be defined immediately inside the first project they’re a part of.

projects:
  - &mercator                         #
    name: Mercator
    features: Cylindricallity
    people:
    - &dan                            #
      name: Dan
      spirit_animal: Otter
      projects:
        - *mercator
        - &armadillo                  #
          name: Armadillo
          features: Compromise
          people:
            - *dan
            - &lois                   #
              name: Lois
              spirit_animal: Penguin
              projects:
                - *armadillo
  - *armadillo                        #

people:
  - *dan
  - *lois

Solution

So, the ideal solution would have the following properties:

DRY - (don’t repeat yourself) each model should be defined only once.
Grouped - all similar data should be defined in the same format next to each other.
Separated - different data should be defined in different places.

Each of the Jekyll data modelling methods methods above violates at least one of these ideals.

The way most full blown programming languages solve this problem is with an inclusion system. You have one file that contains common definitions that can be included in every file that needs it. The more complicated inclusion systems (for example in Clojure) set up modules and namespaces so that complicated dependency graphs can be navigated efficiently. A simpler inclusion system (like the C preprocessor) simply injects the text of one file into another. But it turns out that Jekyll already has a system like this built in, inside Liquid. Among many other features, Liquid has the

{% include %}

directive, which is a textual injection like C's. The simplest way to solve my multiple-cross-linked-data-file problem while respecting each of the requirements above, was to tell Jekyll to treat my data files like liquid.

_plugins/read_data_file_with_liquid.rb

# Treat every _data file as liquid.
# This allows us to include YAML files in other YAML files.
module Jekyll
  # Monkey patch Jekyll::DataReader::read_data_file with our own implementation
  class DataReader
    def read_data_file_with_liquid(path)
      begin
        dir = File.dirname(path)
        filename = File.basename(path)

        # If there are multiple sites assume we're
        # the most recent since we're just starting up
        site = Jekyll.sites.last

        content = File.read(site.in_source_dir(dir, filename))
        template = Liquid::Template.parse(content)

        context = Liquid::Context.new({}, {}, { :site => site })
        rendered = template.render(context)

        # Write the post-liquid-rendered file to a temporary file.
        # read_data_file parses the name of the file to use as its
        # variable name in site.data so it's important to make the
        # temp file name match the original file name.
        Dir.mktmpdir do |tmp_dir|
          tmp_path = File.join(tmp_dir, filename)
          File.write(tmp_path, rendered)
          read_data_file_without_liquid(tmp_path)
        end
      rescue => e
        Jekyll.logger.warn(
          "[SSL-specific] Error parsing data files " +
          "for Liquid content at file #{path}: #{e.message}")
      end
    end

    # Make our function overwrite the existing read_data_file function
    # but keep the ability to still call back to the original
    alias_method :read_data_file_without_liquid, :read_data_file
    alias_method :read_data_file, :read_data_file_with_liquid
  end
end

This allows us to directly include one YAML file into another. For example:

_data/projects.yml

{% include_relative _data/people.yml %}

projects:
  - name: Mercator
    features: Cylindricallity
    people: 
      - *dan                    #
  - name: Armadillo
    features: Compromise
    people: 
      - *dan                    #
      - *lois                   #

But when we go to reference projects from people the whole system breaks down. Suddenly, since all liquid is doing is retrieving the text of one file and putting it into the other, but treating both as liquid, there’s an infinite loop.

_data/people.yml

{% include_relative _data/projects.yml %}

people:
  - &dan                    #
    name: Dan
    spirit_animal: Otter
    projects:
      - *mercator           #
      - *armadillo          #
  - &lois                   #
    name: Lois
    spirit_animal: Penguin
    projects: 
      - *armadillo          #

The canonical solution for this, used in systems like C’s preprocessor is to define a variable upon first inclusion. Upon subsequent calls to the include function, if the variable is defined the file isn’t included. In C this is accomplished with the #ifdef directive. In Liquid, we can pass variables to included pages.

_data/people.yml

{% unless include.included %}
  {% include_relative _data/projects.yml included=true %}
{% endunless %}

This solution, while technically sound, is somewhat annoying to have to place atop each of your data files. Instead it can be encapuslated inside a plugin for easy consumption.

_plugins/read_data_file_with_liquid.rb

# Only include the given file one time (in this call tree)
# Useful for files that include files that include the original file
module Jekyll
  module Tags
    class IncludeRelativeOnceTag < IncludeRelativeTag
      # Create a flag that indicates we're already 1 level
      # deep in the inclusion, and don't go any farther down
      SENTINEL = 'included_relative_once'
      def render(context)
        context.stack do
          unless context[SENTINEL]
            context[SENTINEL] = true
            super(context)
          end
        end
      end
    end
  end
end

Liquid::Template.register_tag("include_relative_once", Jekyll::Tags::IncludeRelativeOnceTag)

Then each file can include others with a single a command.

{% include_relative_once _data/projects.yml %}

With all these pieces in place, each file can reference each other file. It’s achieved simply, with only one extra line of code required. There’s no reorganization of the data required. All similar data is grouped together, and kept in separate files from other types of data.

There are a couple drawbacks to this method:

It requires a custom plugin, which makes your Jekyll behavior non-standard
All of your data files first need to be parsed by the Liquid parser before being parsed by YAML. This means there’s an extra level of parsing for errors to occur in.
When errors do occur they are hard to track down. Do to the file-generation magic that the plugin performs, the line numbers reported by Psych are off. (On the other hand, even without this plugin Psych almost always returns the wrong line-numbers anyway).
Data inclusion only goes one level deep. You can reference person.project, but not person.project.person due to the ifdef logic.

Still, I think these compromises are well worth the benefit for a large project. If this is something you’d like to experiment, you can grab the code directly from this gist.