BenBE's humble thoughts Thoughts the world doesn't need yet …

03.08.2010

A brief overview on Syntax Highlighting

Filed under: GeSHi — Schlagwörter: , , , , — BenBE @ 19:17:12

The following is part of a weekly column by Ido Gendel for the Israeli online magazine nana10.co.il, which he allowed me to publish in its English translation in my blog. Besides a short history of syntax highlighting it also covers a short interview which we made by mail. Happy reading!

Colorful Programming with Syntax highlighting

Excluding, perhaps, programs written in Assembly, software code text is intended for the people writing it much more than for the computers running it. For instance, take one of the most fundamental elements of high-level programming: Variable names. These names are meaningless to the CPU: it’s only interested in numerical addresses in the computer memory, and in its own registers. A major part of a compiler’s job is to translate all the names and commands that we write and understand into numbers and codes the CPU can manage. In fact, some parts of the code are entirely irrelevant to the computer: it disregards comments altogether, and in most programming languages (Python being the notable exception), whitespaces and indentation are syntactically meaningless as well. The only reason to „prettify“ the code and to comment it is to make it more readable for humans.

Color at the programmer’s service

A relatively recent addition to the code readability arsenal is Syntax Highlighting – changes in the way characters and words are displayed to the programmer, according to their role in the programming language’s syntax. Such changes may include typefacing, coloring, different fonts or any other graphic alteration. They carry no meaning in terms of code functionality or interpretation, and are meant solely to make the programmer’s life easier. When code is highlighted, it’s easier to read it and to make out its structure; when the highlighting is done in real time, it also helps the programmer avoid typos. For example, if keywords are colored automatically, the programmer will be able to recognize a mistyped keyword immediately, because it won’t get colored.

The history of syntax highlighting may be short, but information about it is hard to find. Considering the inflexibility of old terminal displays and the computing resources required for highlighting text, it seems reasonable that the first highlighted code actually appeared in print. The oldest programming book in my personal library – Niklaus Wirth’s famous Algorithms + Data Structures = Programs from 1976 – contains Pascal code with bold keywords, and identifiers (variable and function names) in italics. Several old BASIC interpreters, such as 1985’s AmigaBASIC, converted keywords to Capital letters on the fly. Color highlighting first appeared, as far as I could ascertain, at the early 1990’s with IDEs such as Turbo Pascal 7.0 or Turbo C++ 3.0. Today, every major IDE and text editor offers syntax highlighting, and in most cases, the highlighting scheme is even customizable.

The ABC of syntax highlighting

In most programming languages, the syntax is based on the following fundamental element types:

  • Keywords (e.g. if, case, for)
  • Symbols (e.g. „=„, „+„, „.„)
  • Identifiers (variable and function names)
  • Comments
  • String literals (and perhaps explicit numbers)

Any syntax highlighter must be able to differentiate these types and to present them accordingly. Here’s some Object Pascal code as it appears on the FPC/Lazarus IDE. The keywords are colored white, the symbols are green, identifiers are yellow by default, the comments are grayed and the string literals are magenta. Notice that the string literal contains text that’s identical to the text at the first line, but because it serves here as a part of a string and not as actual code, it is highlighted differently.

How far should syntax highlighting go? Take a look at the following code, also from FPC/Lazarus:

Why is the word „write“ highlighted here as a keyword, while on the previous source code it was an identifier? Because in this particular context, it IS an Object Pascal keyword. In other words, when a programming language is complex, perfect syntax highlighting requires more than just isolating words and symbols – it actually has to parse the code, much the same way as the compiler itself does. When this is done within a specialized IDE, the integral syntax highlighter can utilize the code parser or other relevant components (e.g. Code Completion). For developers of generic syntax highlighters, however, this is a serious challenge.

GeSHi: Syntax highlighting on the web

One of the most famous and widespread generic syntax highlighters is GeSHi – an open source project written in PHP, that accepts code in any of 177 different programming languages and adds syntax-highlighting CSS/HTML tags to it, for display in web browsers. In its early days, back in 2004, GeSHi was developed for use on internet forums (phpBB2 platform), but it grew and spread all over, from platforms such as DokuWiki and WordPress, to private website such as my own (see, for instance, the syntax-highlighted Object Pascal code in the [Hebrew] „Tic-Tac-Toe program“ page). The manager of the project since 2008, a German student called Benny Baumann, gives a very rough estimate of 2-5 million websites currently using GeSHi.

How to use GeSHi

If you intend to put highlighted code in a WordPress blog, the simplest solution will probably be using one of the existing, GeSHi-based plugins out there. Even if you maintain a „regular“ website, such as mine, the procedure is straightforward. First, download the preferred GeSHi version to your computer. At the time of writing, this would be version 1.0.8.8 (which we will use here), but if you have a taste for adventure, you can download and play with more advanced development versions. All of them, of course, are free.

Extract the files, then create a folder on your website server and copy geshi.php and the geshi subfolder into it. This subfolder contains the definition files for the different programming languages; Naturally, you only need to copy the files for the languages you’re actually going to display. At the specific page where you want to display highlighted code, add the following PHP code:

<?php include "geshi.php"; ?>

The string should include the path to the folder where this file resides, and if you’re using a common template for several pages, use include_once instead of include. But that’s nitpicking.

Further down the page source, create a GeSHi object and define the code and the language for it. This can be done in several ways (do read the documentation). Here’s how I did it, for snippets of Object Pascal code that were saved in separate files:

$geshi = new GeSHi('', ''); //Create the object
$geshi_src = "..."; // Path+Filename
$geshi->load_from_file($geshi_src);
$geshi->set_language('pascal', true);
echo $geshi->parse_code(); // Highlighted output

If you have additional code to highlight further down, you may want to recycle the same object and simply load the new code snippet into it. The very short GeSHi user’s guide is here, and you can find the details of all the relevant methods and constants, including examples and plenty of customization options, on the project’s documentation page.

The man behind the project

Baumann got his diploma in Computer Sciences earlier this year, and has plans for an M.Sc. in mathematics and his own small computer firm. He confesses to like challenges and to be bored easily by recurring tasks. In his view, syntax highlighting is one tool in the code readability toolbox, with good code structuring and proper naming being just as important. As an example, he mentions the IOCCC, where the goal is to create the most obscure, obfuscated (though still valid) code possbile. No syntax highlighting, states Baumann, will make such programs readable. When doing his own programming, he prefers color contrast highlighting, as opposed to typefacing and font changing.

The language files in the stable version of GeSHi contain, more or less, simple lists of keywords, symbols and so forth. The development versions, however, support much more complex syntactic rules, and one of Baumann’s ambitions is to create – with the help of other developers – new and updated files that will reflect the syntax of each language better, and will even utilize similarities between different languages. Other future plans include context-sensitive syntax highlighting (e.g. for programming under Linux or Windows), and support for additional output formats such as XML, man page (Unix’s classic documentation format) and so on.

Baumann wishes to take this opportunity to thank all the volunteers who helped and are stiil helping to develop GeSHi, to write the language files and to test them. If you want to take part in this project as well, here are the details.

This article was written and translated from Hebrew by Ido Gendel. It appeared on the Computer and Technology channel of the major Israeli portal nana10.co.il, in a weekly column that introduces computer concepts and programming. The original article can be found here.

Creative Commons Lizenzvertrag
A brief overview on Syntax Highlighting by Ido Gendel and Benny Baumann is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 Germany License.
The content is a translation of an article at http://net.nana10.co.il/Article/?ArticleID=736240.
Additional permissions not yet covered by this license can be obtained at http://blog.benny-baumann.de/?page_id=154.

Flattr this!

Keine Kommentare »

No comments yet.

RSS feed for comments on this post. TrackBack URL

Leave a comment

Powered by WordPress