TL;DR: You can include one data file in another by adding this file to your _plugin
directory: read_data_file_with_liquid.rb
I recently designed the website for one of my labs. After a lot of back and forth, I decided to build the site in Jekyll. Jekyll is a framework for building static websites. It has a lot of the same advantages of dynamic site frameworks like Ruby-on-Rails, except that Jekyll doesn’t do any database accesses or page construction every time a new user views the site. Instead Jekyll generates a bunch of HTML files one time and those can be served unchanged from your server forever until you decide to change the content. I like Jekyll because it allows a significant amount of customization while remaining relatively simple to create and deploy. As with most technology that’s easy to pick up and use, it can become increasingly difficult to use as the task becomes more and more complicated. This is a story about how feature creep painted me into a corner, and a description of the hack I used to fix it. Hopefully this can serve as a warning to others, and potentially give them an out if they need it.
One view, one data store
The initial design of the site included a few prose-only pages and a “people” page to list all the members of the lab. Jekyll has the ability to build pages backed by data stored in various formats like YAML, JSON, or CSV. This data feature was an obvious way to factor out a lot of the repetition that would normally be required in a page that has dozens of identically formatted entries. The way it works is that you create a YAML (or JSON or CSV) file with some information. Let’s call it people.yml:
And then you can build an HTML (or Markdown) page with templating to automatically populate the page with the data from your YAML file:
This will make an HTML page with two person divs, each populated with the data from the two people in people.yml file. So far so good. We’ve got a nice webpage with no duplication of logic and a good separation of concerns.
Two views, two data stores
As time went on, the demands of the project expanded and we needed to add a projects page to the site. No problem, just add a second data file and second view, projects.yml and projects.html respectively. Here’s what projects.yml might look like:
So far everything was still fine, but it was about to get wacky. The next requirement was that every project link to every person who works on it, and every person link back to every project they work on. The simplest way to do this is by entering the data by hand:
At this point problems were starting to emerge.
Now each person’s information needs to be entered in multiple places.
Once in people.yml, and several times in project.yml for every project that person works on.
This is a problem because every time a person’s information changes, we need to update that data in arbitrarily many places.
To help mitigate this problem, YAML provides anchors and references (notated with &
and *
respectively).
These features allow you to define data in one place and reuse it in other places without having to copy-paste all of the data.
One of the limitations of anchors and references, however, is that they only work within a single file. Up until this point the duplication was reduced to just having to define each type of information in each file, which for only two files is acceptable.
But of course, requirements continued to grow. We suddenly needed two new pages, one for publications and one for press. And there needed to be links between both of them and the project and people pages. Under this scheme, most of the of data needed to be copy-pasted between 4 different files, for a total of 14 datastructures which all need to be consistent with each other.
Uber-YAML
The only way to still leverage anchors and references between all 4 files even though anchors are only visible in the files in which they are defined was to merge all files together into a singular uber-yml file that contained all the data. There are several drawbacks to this method. Firstly, the singular file becomes very large. In our case the uber-yml was well over 2000 lines, all of which need to be ordered and indented correctly. Secondly, anchors are only visible to references below them, which means that data can’t be separated into groups. Ideally the top of the file would have definitions for all the people, followed by a section that defined all the groups. Instead, if the first group describes each of the projects, then each person must be defined immediately inside the first project they’re a part of.
Solution
So, the ideal solution would have the following properties:
- DRY - (don’t repeat yourself) each model should be defined only once.
- Grouped - all similar data should be defined in the same format next to each other.
- Separated - different data should be defined in different places.
Each of the Jekyll data modelling methods methods above violates at least one of these ideals.
The way most full blown programming languages solve this problem is with an inclusion system. You have one file that contains common definitions that can be included in every file that needs it. The more complicated inclusion systems (for example in Clojure) set up modules and namespaces so that complicated dependency graphs can be navigated efficiently. A simpler inclusion system (like the C preprocessor) simply injects the text of one file into another. But it turns out that Jekyll already has a system like this built in, inside Liquid. Among many other features, Liquid has the directive, which is a textual injection like C's. The simplest way to solve my multiple-cross-linked-data-file problem while respecting each of the requirements above, was to tell Jekyll to treat my data files like liquid.
_plugins/read_data_file_with_liquid.rb
This allows us to directly include one YAML file into another. For example:
_data/projects.yml
But when we go to reference projects from people the whole system breaks down. Suddenly, since all liquid is doing is retrieving the text of one file and putting it into the other, but treating both as liquid, there’s an infinite loop.
_data/people.yml
The canonical solution for this, used in systems like C’s preprocessor is to define a variable upon first inclusion.
Upon subsequent calls to the include function, if the variable is defined the file isn’t included.
In C this is accomplished with the #ifdef
directive. In Liquid, we can pass variables to included pages.
_data/people.yml
This solution, while technically sound, is somewhat annoying to have to place atop each of your data files. Instead it can be encapuslated inside a plugin for easy consumption.
_plugins/read_data_file_with_liquid.rb
Then each file can include others with a single a command.
With all these pieces in place, each file can reference each other file. It’s achieved simply, with only one extra line of code required. There’s no reorganization of the data required. All similar data is grouped together, and kept in separate files from other types of data.
There are a couple drawbacks to this method:
- It requires a custom plugin, which makes your Jekyll behavior non-standard
- All of your data files first need to be parsed by the Liquid parser before being parsed by YAML. This means there’s an extra level of parsing for errors to occur in.
- When errors do occur they are hard to track down. Do to the file-generation magic that the plugin performs, the line numbers reported by Psych are off. (On the other hand, even without this plugin Psych almost always returns the wrong line-numbers anyway).
- Data inclusion only goes one level deep.
You can reference
person.project
, but notperson.project.person
due to the ifdef logic.
Still, I think these compromises are well worth the benefit for a large project. If this is something you’d like to experiment, you can grab the code directly from this gist.