Markup for Syntax Highlighting

A Comparison of Different Approaches

Screenshot of a Vim screen
with syntax highlighting — Vim’s syntax highlighting

Highlighted syntax of source code on websites is a common sight these days. Tools like GeSHi, Pygments and SyntaxHighlighter make it easy to embed this functionality on both server and client side.

However, the generated markup differs highly with regard to semantic correctness and user experience. In this article I compare different approaches and present a setup, that suits both the semantics and the daily use for syntax highlighting.

State of the Art ¶

The venerable pre element serves as starting point. It is used since the olden days for putting source code samples on the web (while the original xmp and listing elements never gained much respect). In the simplest case the content is stuffed inside a pre, perhaps marked up with simple emphasis on some keywords.

When automatic syntax highlighters come into play, they offer many useful features for showing off source code:

Obviously, highlighting keywords and special operators with different text styles
Semantic markup, i.e., using appropriate elements for a task
Simple copy’n’pasting of the code sample
Highlighting single lines and zebra striping of lines
Line numbers
Configurable line wrapping or scrolling

I’ll compare different markup approaches on how they cope with these feature requests.

Simple `pre` Element ¶

The pre element with spans for colorizing does well for the first four categories, but it utterly fails for the last two. Line numbers, if displayed, will land in copied text, rendering it unusable without post-processing. They would also be part of the content of the pre, which is not correct. Wrapping is something, that is completely off-topic for plain pre elements. They will always scroll, as long as they are not forced to other behaviour via CSS.

The huge advantages are the otherwise correct semantics and the ease of use.

<pre>
<span class="kw">function</span> foo<span class="op">();</span>
</pre>

Tables ¶

Another approach is using table elements for highlighted code. The line numbers are put in the first cell of a row, while the code moves to the second cell, either plain or fenced by a pre or code element.

This approach is brilliant at handling line issues. Zebra striping comes naturally, and highlighting lines can be done by background color or by a dedicated column.

However, copying the code suffers from the same issue as the plain pre solution, and the semantics are debatable: Has source code really a tabular nature? Are the line numbers part of the information that the source is carrying?

The copying problem can be circumvented with the help of CSS: When the first cells are hidden with display: none, they won’t show up in the text from the clipboard. The toggling of line numbers can be achieved using a tiny bit of Javascript.

If the lines should wrap, there has to be taken care by the algorithm, that they still only occupy a single cell in a row, as otherwise the line numbering will get out of order. This is especially a problem in setups, where all line numbers are in one cell, while all the code is in a single cell next to it.

<table>
  <tr>
    <td>1</td>
    <td><span class="kw">function</span> foo<span class="op">();</span></td>
  </tr>
</table>

A variant of the table approach uses floating divs to achieve the same effect. This is nothing more than a symptom of Divitis, but doesn’t add any new insight.

Using an Ordered List ¶

Focusing on the line numbers again, another HTML element comes to mind, that is most natural for handling ordinal data sets: ol. In this version, every line of code is enclosed in a li element, that marks single lines.

This solution is elegant for several reasons. The most important is, that the line numbers need not be handled in the markup themselves. They are automatically generated by the browser.

The markup is semantic in the way, that line-oriented content is put in an element, where sorting is important.

Zebra striping and highlighting are equally trivial as in the table case, and line wrapping is done automatically, together with correct adaption of line numbers. A really nice feature is, that clicking the line number automatically highlights the whole corresponding line.

What prohibits this solution, is the problem of copy’n’pasting code, again. Even if the lis receive a list-style: none via CSS, the browser still adds automatic line numbers to copied text.

Also, the line numbers themselves cannot be styled independently from the rest of the content. They will always take on the text color of the ol and no custom background color.

<ol>
  <li><span class="kw">function</span> foo<span class="op">();</span></li>
</ol>

General Problems when Skipping `pre` ¶

The div, table and ol solutions all exist in variants with and without embedded pre element. If a highlighter chooses to skip using it, even if using the code element in place, there are several issues arising immediately.

White space will suffer from the usual collapsing if not replaced by nbsps or changed with CSS
Automatic HTML compression (stripping unnecessary whitespace) will likely destroy semantics in the code (think Python)
Bots, screen readers, older browsers might display the code wrong or partial
And forgetting to deliver an appropriate print stylesheet will finally end in the same effect for all “normal” users

Overview of current solutions
	`pre`	`table`	`ol`
Semantic	✓	✗	✓
Line wrapping	✗	✓	✓
Line numbers	✗	✓	✓
Copy’n’paste	✓	✗	✗
Degrading	✓	✗	✗

Reviving the `pre` Element with CSS ¶

The promised new approach to marking up syntax highlighting is in fact a simple extension to the old pre technique. We only add a single new span element to discriminate lines:

<pre>
<span class="line"><span class="kw">function</span> foo<span class="op">();</span></span>
<span class="line"><span class="kw">function</span> bar<span class="op">();</span></span>
</pre>

That doesn’t gain us anything itself, but now we have all prerequisites in place for proper CSS formatting:

pre {
  counter-reset: code;
  padding-left: 30px;
}

.line {
  display: block;
  counter-increment: code;
}

.line:before {
  content: counter(code);
  float: left;
  margin-left: -30px;
  width: 25px;
  text-align: right;
}

Using CSS’s generated content and counter properties we can now simply build line numbers, that fall in no (but one) way behind the ones from the ol solution. Plus they have the added benefit, that they don’t get copied to the clipboard. The display of line numbers can be controlled by classes on the pre element, e. g. by adding line numbers only for pre.with_numbers .line:before.

The .line:before pseudo-element can even be styled like any other element: We can freely choose color, background, width and so on. Line wrapping can be controlled with CSS, too:

pre {
  overflow-x: auto; /* show scrollbars, if we’re not wrap-
                       ping long lines */
}

.line {
  white-space: pre; /* the default: don’t wrap */
  white-space: pre-wrap; /* wrap long lines, but keep mul-
                            tiple spaces and tabs intact */
}

The pre-wrap solution works in all modern browsers and in IE 8 and newer. The non-wrapping solution works down to IE 6.

The approach is both semantic and usable and works in all recent browsers and IE from v8 and up. For IE 7 and below a simple Javascript solution is imaginable, that adds the line numbers dynamically. This will however have impact on the clipboard.

With the rise of CSS 3 in current browsers zebra striping becomes as simple as

.line:nth-child(2n) {
  background: green;
}

.line:nth-child(2n+1) {
  background: red;
}
/* granted, the color combination is not the best ;-) */

Highlighting a line can be achieved with class names alone.

.highlighted.line {
  background: yellow;
}

A drawback compared to the ol solution is, that clicking a line number doesn’t automatically select the line, but this can also be re-built in Javascript, if the feature seems necessary.

All in all the simple pre element, together with spans for lines and a little bit CSS fairy dust serve great for marking up syntax highlighted text in a meaningful way. And they do so in every browser on this side of IE8 while degrading gracefully in older ones.

Update: The other minute I read Adam Prescott’s article, one month old, on the same topic. He concentrates there on explaining the possibilities and limits of using as little markup as possible. I suggest the article, since the information given there completes and rounds up the “use only pre” solution.

State of the Art ¶

Simple pre Element ¶

Tables ¶

Using an Ordered List ¶

General Problems when Skipping pre ¶

Reviving the pre Element with CSS ¶

Simple `pre` Element ¶

General Problems when Skipping `pre` ¶

Reviving the `pre` Element with CSS ¶