Using A Static Site Generator At Scale: Lessons Learned

About The Author

Stefan Baumgartner is a software architect based in Austria. He has published online since the late 1990s, writing for Manning, Smashing Magazine, and A List … More about Stefan ↬

Email Newsletter

Weekly tips on front-end & UX.
Trusted by 176.000 folks.

Quick summary ↬ Static site generators are wonderful, even though they have to deal with work for which they weren’t initially created for. Learn how to take care and provide aid wherever necessary for them to get going and become more productive than ever.

Static site generators are pretty en vogue nowadays. It is as if developers around the world are suddenly realizing that, for most websites, a simple build process is easy enough to render the last 20 years of content management systems useless. All right, that’s a bit over the top. But for the average website without many moving parts, it’s pretty close!

However, does that hold true for websites bigger than your humble technology blog? How do static site generators behave when the number of pages exceeds the average portfolio website and runs up into the thousands? Or when development is a team effort? Or when people of different technical backgrounds are involved? This is the story of how we managed to bring roughly 2000 pages and 40 authors onto a technology stack made for hackers.

The Reason For Static Sites And The Task At Hand

We used static site generators at our company’s spin-off startup, where we had a reasonable amount of content to maintain: about 30 to 40 pages of product information, the occasional landing page and some company-related websites.

We have had good experiences with static site generators. For us front-end-heavy web developers, using a static site generator is as easy as templating, but with real data and actual content! And it enables us and our content providers to scale easily.

We started using the popular Jekyll static site generator. The usual page consisted of the following:

  • YAML front matter.
    This is a delight for authors and editors, because they can put any meta information in it — even meta data that is not yet interpreted but might be interpreted in the future.
  • Markdown.
    Markdown provides the basic structure for the content. It is easy to understand, it is easy to write, and a ton of editors out there give a good preview of the content at hand.
  • Liquid block elements
    Liquid is Shopify’s templating language, and it’s very powerful. It allows for advanced loops and conditionals and can be easily extended using plugins written in Ruby. Developers provide content editors with structural elements such as {% section %} and {% column %} to better organize the page.
  • And, of course, a lot of images.
    Each page had about 5 to 15 images.
More after jump! Continue reading below ↓

A typical page looked like this:

---
title: Getting started with our product
layout: blue
headerImage: getting-started.svg
permalink: /getting-started/
---
{% section %}
# How to get started with our product
…
{% endsection %}
…

In the end, writing content was as easy as scribbling down notes in an editor. Polish and beauty were added once the page ran through Jekyll.

In addition to the ease of using our preferred content editors, we loved the additional benefits that static site generators gave us as developers and “webmasters.” (You haven’t heard that term in quite some time, have you?)

  • Security-wise, static websites are a fortress. Not having any database or any dynamic interpreter running on your servers reduces the risk of hacks tremendously.
  • A static website is incredibly fast to serve. Put it on a CDN and let it be consumed worldwide in no time.
  • Web developers love the flexibility. Changing the layout or adding a microsite to the content does not require you to go deep into the internals of the content management system, nor does it require any hacks. You can maintain these resources next to your usual content and “just deploy” them with it.
  • Storing all of the technology parts as well as the content in a version-control system such as Git allows for a flexible publishing cycle. Preparing content in a branch, merging it on demand and putting it out on the servers entails just a few clicks on a Monday morning.
Git as content store
Using Git as content store allows you to treat content like source code. Including pull requests, code reviews and versioning. This brings content authors to the same place as developers and designers. (Image credits: Stefan Baumgartner and Simon Ludwig) (Large preview)

Our five-person team was pretty pleased with the results. We took the idea from our marketing website over to our other web entities. Suddenly, next to our 40-page main brand website was a 50-page style guide. Then, 150 pages of documentation. Then, almost every web entity from our sibling company, counting up to 2000 pages of documentation. You wouldn’t believe it, but our tech stack was at the edge of exhaustion:

  • The bigger your page, the longer your build. Static site generation is pro-active compilation of source code into HTML pages. It can take hours until your website is deployment ready.
  • Choosing a static site generator is like betting on one Content Management System. A tool which speeds you up at first, but slows you down when it doesn’t meet your needs.
  • Even if most of your content is static and does not require any user input, there is the occasional case where you need dynamic data.
  • Tech-savvy content editors love working with content that is actually source code. Not so tech-savvy editors …, well, they don’t.

Let’s see how we tackle each one of those topics.

More Content, More Build Time

One key factor of static site generation is the proactive approach to rendering. With a traditional content management system (CMS), each page you access gets rendered just for that one visit (obvious caching algorithms not included). With a static site generator, you create all of your pages at once. And even though the process is fast, it scales linearly with the amount of content. Especially when you have to auto-generate and auto-optimize responsive images of screenshots from 200 pixels up to full high definition in 200-pixel steps. Even with our initial setup of 40 pages and roughly 300 images, the build took about two hours from start to finish on our continuous delivery machines. That’s not time you want to wait for to see whether you fixed that typo in your headline or not. Anticipating our workload in the not-so-distant future, we had to make some important decisions.

Divide And Conquer Build

Even if your technology stack can generate thousands of pages, it doesn’t necessarily need to. Not every item of content needs to know about every other item. A German-language website can be treated separately from an English-language version. And the documentation is a different content area than our main brand website.

Not only are the content elements discrete, but they also diverge in update frequency. Blogs are updated many times a day, the main brand website many times a week, and the documentation every two weeks to coincide with our product’s feature releases. And our Java performance book? Well, that’s once or twice a year.

This led us to split all of our websites into distinct content packages and tech repositories. The content packages were Git repositories that contained everything an author or editor could and should touch: Markdown files, images, additional data lists. For each entity, we created a content repository with the same structure.

One tech repo for various content packages
The tech repository can be regarded as a machine that converts several content repositories into full static websites. (Image credits: Stefan Baumgartner and Simon Ludwig) (Large preview)

The tech repositories, on the other hand, contained everything meant for developers: front-end code, templates, content plugins, and the build process for our static site generation.

Our build servers were set up to listen for changes to each of those Git repositories. A change in a content repository triggered a build of that content package alone. A change in a tech repository triggered a build of only the corresponding content repository. Depending on how many content repositories you have, this can cut down the build time tremendously.

Incremental Builds

Even more important than splitting the content files into separate packages was the method for dealing with images. The objective was to generate each screenshot of our product in various responsive-friendly sizes, down to 200 pixels wide. Also, each newly generated image would have to be optimized in gulp-imagemin. Doing this for every build iteration took up a good chunk of the initial two-hour build time.

While the two hours of build time were necessary for the very first build, that was wasted time for each subsequent one. Not every image changed from iteration to iteration. Much of the work had already been done, so why do it over and over again? Incremental builds were the key to saving our build servers and ourselves from having to do a lot of work.

Our image processing was done in Gulp. The gulp-newer plugin is exactly what we needed for our incremental builds. It compares the timestamp of each file with the timestamp of the file with the same name in the destination directory. If the timestamp is newer, the file is kept in the stream. Otherwise, it is discarded. Generating all of the responsive images, then, was a matter of chaining the right plugins in the right order:

const merge = require('merge2');
const rename = require('gulp-rename');
const newer = require('gulp-newer');
const imagemin = require('gulp-imagemin');
…

/*
The options in this case are an array of image widths. In our
case, we want responsive images from 200 pixels
up to 1600 pixels wide.
*/
const options = [
  { width: 200 }, { width: 400 },
  { width: 600 }, { width: 800 },
  { width: 1000 }, { width: 1200 },
  { width: 1400 }, { width: 1600 }
];

gulp.task('images',() => {
  /*
  We can map each element of this array to a Gulp stream.
  Each of those streams selects each of the original images
  and creates one variant.
  */
  const streams = options.map(el =>  {
    return gulp.src(['./src/images/**/*'])
       /*
       We follow a naming convention of adding a suffix to
       the file's base name. This suffix is the image's target
       width.
       */
      .pipe(rename(file => file.basename += '-' + el.width))
       /*
       This is where the "incremental" builds kick in.
       Before we run the heavy processing and resizing tasks,
       we filter elements that don't have any updates.
       This Gulp tasks checks whether the results in "images" are
       older than the source items.
       */
      .pipe(newer('dist/images'))
      .pipe(resize(el))
      .pipe(imagemin())
      .pipe(gulp.dest('dist/images'));
  });
  return merge(streams);
});

In the absence of a good image-resizing plugin, we had to create our own. This was also the time to ensure that no unnecessary file was being processed. If an image couldn’t be resized because the target width was bigger than the original, then we discarded the image as well. The following snippet uses Node.js’ GraphicsMagick bindings to complete the task.

const gm = require('gm');
const through = require('through2');

module.exports = el => {
  return through.obj((originalFile, enc, cb) => {
    var file = originalFile.clone({contents: false});

    if (file.isNull()) {
      return cb(null, file);
    }

    const gmfile = gm(file.contents, file.path);
    gmfile.size((err, size) => {
      if(el.width < size.width) {
        gmfile
          .resize(el.width,
            (el.width / size.width) * size.height)
          .toBuffer((err, buffer) => {
             file.contents = buffer;
             /* add resized image to stream */
             cb(null, file);
           });
      } else {
        /* remove from stream */
        cb(null, null);
      }
    });
  });
};

With all of this incremental adding of files, we couldn’t forget to get rid of files in the destination directory that had been deleted in the source. We didn’t want any leftovers from previous builds lying around, adding extra weight to the bundle to be deployed. Thankfully, Gulp’s task system allows for promises, so we had a lot of Promise-based plugins we could use for this task.

const globby = require('globby');
const del = require('del');
const path = require('path');
const globArray = [
  'images/**/*'
];

const widths = [200, 400, 600, 800, 1000, 1200, 1400, 1600];

/* This helper function adds the width suffix
   to the file name. */
const addSuffix = (name, w) => {
  const p = path.parse(name);
  p.name = `${p.name}-w`;
  return path.format(p);
}

gulp.task('diff', () => {
  return Promise.all([
    /* First, we select all files in the destination. */
    globby(globArray, { cwd: 'dist', nodir: true }),
    /* In parallel, we select all files in the source
       folder. Because they don't have the width suffix,
       we add them for every image width after selecting. */
    globby(globArray, { cwd: 'src', nodir: true })
      .then(files => files
         .map(el => […widths.map(w => addSuffix(el, w)])
  ])
  /* This is the diffing process. All file names that
     are in the destination directory but not in the source
     directory are kept in this array. Everything else is
     filtered. */
  .then(paths => paths[0]
    .filter(i => paths[i].indexOf(i) < 0))
  /* The array now consists of files that are in "dest"
     but not in "src." They are leftovers and should be
     deleted. */
  .then(diffs => del(diffs));
);

With all of these changes to the build process, the initial two hours for the images had been reduced to two to five minutes per build, depending on the number of images added. The extra time of doing all of the file-status checks passed pretty quickly, even with tens of thousands of images lying around.

Avoiding Technology Lock-In

Jekyll is an amazing tool because it comes with a lot of features that go beyond merely creating HTML pages. The healthy plugin ecosystem makes Jekyll not just a static site generator, but a full-fledged build system. Out of the box, it’s possible to compile Sass and CoffeeScript with Jekyll. The Jekyll asset pipeline offers not only a ton of features for creating images, but also extra confidence because it checks every included asset for existence and integrity. This is gold if you’re dealing with a lot of assets.

However, these benefits come at a high cost, not only with performance and build time, but also with a certain level of technology lock-in. Instead of Jekyll being in your build system, it becomes your build system. Anything not included in Jekyll or one of the plugins has to be written and maintained by you in Ruby.

This bugged us in a several ways. Ruby was not our favorite language to begin with. While many on our team could work with Ruby, some of us couldn’t write a single line without referring to the language’s specification. Even worse is that we were trying to move away from a traditional CMS to gain more freedom and flexibility in the way we do things. By relying heavily on Jekyll’s ecosystem, we were trading one monolith for another. To avoid this form of technology lock-in, we took a few more steps.

Separation of Concerns

First, we stripped away everything from Jekyll’s duties that had nothing to do with the actual output of HTML. We still included the check for an asset’s existence. However, image generation and the compilation of JavaScript and style sheets would all be done by Gulp builds running beforehand.

This gave us a list of completely different benefits:

  • Should we have a change of heart and switch Sass for something trendier, it would affect only a single part of our build file, not the entire static site generation. The same goes for the compilation of our other assets.
  • We could decide whether even to build certain assets. The JavaScript might change, but the styles might not, so why compile the styles again? This cuts down even more on build time.
  • We could even remove Jekyll at some point, and key parts of our build would still be intact and functioning.

Secondly, we removed any post-processing steps from Jekyll. The Jekyll asset pipeline allows you to create hashed URLs for JavaScript, style sheets and images. Stripping that away from the Jekyll process meant Jekyll had less to do, thus clarifying its purpose. Interestingly enough, we saw an improvement in speed by moving the revisiting process from Ruby to Node.js. The wonderful plugin gulp-rev took care of this process.

const gulp       = require('gulp');
const rev        = require('gulp-rev');
const revReplace = require('gulp-rev-replace');
…
gulp.task('revision', () => {
  return gulp.src(['./**/*.js',
     './**/*.css', './images/**/*.*'])
    .pipe(rev())
    .pipe(gulp.dest(based))
    .pipe(rev.manifest())
    .pipe(gulp.dest('.'));
});

gulp.task('rev', ['revision'], () => {
  var manifest = gulp.src('rev-manifest.json');
  return gulp.src(['./**/*.html'])
    .pipe(revReplace({
      manifest: manifest
    }))
    .pipe(gulp.dest(based));
});

From here on in, we made sure to know what is a part of Jekyll’s purpose and what isn’t. You can do amazing things with Jekyll and its ecosystem, but you also don’t want to rely too much on a tool that might not be the right one for tasks to come.

Jekyll’s responsibilities were reduced to converting Markdown and Liquid to HTML pages. With everything else being done by Gulp, you can easily spot the odd bird in the stack:

  • The self-written plugins for custom sectioning elements were being done in Ruby (the only Ruby dependency still there).
  • We were still using Liquid, a rather “exotic” templating language.

We also realized that Jekyll is not meant to be included in a build process. Jekyll was created to be the build process. Jekyll opens and analyzes every file during a build. Once you strip away everything from Jekyll that isn’t HTML creation, you have to take care of Jekyll’s built-in features like incremental builds by yourself.

Liquid Voodoo

While Jekyll is very popular, the underlying templating engine, Liquid, seems to be the odd one standing. It bears similarities to the PHP templating engine Twig and the JavaScript equivalent Swig, but it has a lot of features that are nowhere else to be seen. Liquid is powerful and allows for a lot of logic to find its way into the templates. This is not always a good thing, but it also isn’t Liquid’s fault. Take, for example, the creation of breadcrumbs based on a document’s permalink, done entirely in the templating language:

{% assign coll = site.content %}
<ul class="breadcrumbs">
  <li><a href="{{site.baseurl}}/">Home</a></li>
  {% assign crumbs = page.url | split: '/' %}
  {% for crumb in crumbs offset: 1%}
  {% capture crumb_url %}{% assign crumb_limit = forloop.index | plus: 1 %}{% for crumb in crumbs limit: crumb_limit %}{{ crumb | append: '/' }}{% endfor %}{% endcapture %}
  {% capture site_name %}{% for p in coll %}{% if p.url == crumb_url %}{{ p.title }}{% endif %}{% endear %}{% end capture %}
  {% endunless %}
  {% unless site_name == '' %}
  <li>
  {% unless forloop.last %}
    <a href="{{ site.baseurl }}{{ crumb_url | strip_newlines }}">{{site.name}}</a>
  {% else %}
    <span>{{ site_name }}</span>
  {% endunless %}
  </li>
  {% endfor %}
</ul>

Let’s not go too deep into the abomination of code you see above. A mere glance should get the point across: The code above will output correctly, but it’s obviously not as readable as one would expect from a templating engine. On the contrary, the more features and logic you cram into this, the worse it’s going to be if you ever have to reconstruct what has happened. Moving away from this Liquid “voodoo” to Jekyll plugins would be a better idea:

  • Restrict Liquid to content output (loops and simple conditionals).
  • Create complex data beforehand. If it’s not available in Jekyll itself, then a plugin or a pregenerated YAML or JSON file is the way to go.

Looking at the breadcrumb generation again. A plugin that fills the relevant data set would be much more flexible and would not rely on string concatenation or splitting magic. Also, the Liquid templates that access the prefilled data would be much more readable and easier to understand:

{% if page.breadcrumbs %}
<ul class="breadcrumbs">
  <li><a href="{{site.baseurl}}/">Home</a></li>
  {% for item in page.breadcrumbs %}
  <li>
    {% unless forloop.last %}
    <a href="{{item.url}}">{{item.label}}</a>
    {% else %}
    <span>{{item.label}}</span>
  {% endunless %}
  </li>
  {% endfor %}
</ul>
{% endif %}

This will keep your templates clean and tidy. Also, if you want to move from Liquid to another templating engine (in case you ever drop Jekyll), the templates will be a lot easier to convert.

Serving More Than Static Websites

Deploying a static website sounds easy at first. You have a bundle of rendered HTML files and a lot of assets, and you just have to put them somewhere to be delivered to the World Wide Web. With free hosting services, static storage services and content delivery networks, the possibilities for getting your content out seem endless. You even can serve a page from a Dropbox folder!

If you are doing more than simply delivering content — perhaps you are in an ongoing migration process — then the requirements for the server might be a little more demanding.

The solution we have in place is based on nginx, which is great for serving static websites to begin with, but also makes for an easy setup when you’re not just serving a static website.

Ongoing Migration From Old to New

With 2000 pages of content divided into different content packages, we had two strategies to choose from to go live:

  • Convert all of the old content, wait for a big-bang release and fail miserably.
  • Or start to release smaller content packages, and migrate over time.

With option one, the converted content would grow stale or would have to be maintained twice for a certain amount of time. And we wouldn’t get the benefits of static websites until long after everything is done. Of course, we opted for the latter. To make sure we could freely deploy new content created with the new technology stack without killing access to unmerged pages from the old CMS, we configured nginx to serve as a “fall-through” proxy.

A proxy handling files from two different locations
The proxy hits files from the static content folder first. Should the file not be available, the proxy falls through to the old CMS server. (Image: Stefan Baumgartner and Simon Ludwig)