BenBE's humble thoughts Thoughts the world doesn't need yet …


Again some parser tweaking

Filed under: GeSHi — Schlagwörter: , , , , , , — BenBE @ 13:03:02

After contained a somehow major update on the parser (I reordered the processing of keywords, numbers and regexps) the next Release Candidate ( will contain yet another minor reordering as I noticed a small problem with Delphi. As always I’m hereby asking you to test this new release candidate once it is out and report any issues you find to me. If you don’t want to wait that long, you can always have a look at the trunk where these changes already have been applied.

But now for the details on what went wrong. Basically even worked just fine and had no major breakage, but one small inconvenience when trying to highlight things like character literals in Delphi. Their syntax in Delphi is # followed by a number, which is highlighted as (symbol) followed by number, as # is no prohibited char in front of a number. So nothing to worry about. Now it happens that Delphi had a Regexp defined for #+number to be highlighted, which now was broken since numbers got too much priority. This now has been changed and the new order of things is as follows:

  1. Keywords
  2. Regexps
  3. Numbers

For this to work properly I had to add some Look-Arounds for Numbers (they now need to take care not only of number literals and keyword matches, but also for Regexps – which was one central fix though. The downside with this is Regexps now have to care about what they do as to not break things with numbers or accidentially include something that a number regexp might match – but that’s nothing new as this problem existed already before.

Maybe some people wonder why I didn’t simply put the char highlighting into the comment regexp section: For one reason: They don’t belong there. The difference between Comment Regexps and Basic Regexps is they way the parser handles them.

While Comment Regexps usually define Comments they have a major influence on what the parser actually sees. If you match a string starter with a comment regexp, the parser won’t see it and thus no string is started. This is e.g. used with Scilab to properly highlight the Transpose Operator of the language which happens to be mentioned string starter.

On the other site the Basic Regexps allow for markup of certain structures like variables in PHP or mentioned chars in Delphi. They don’t need hiding from the parser and thus shouldn’t be hidden.

This is especially important to know because this is a performance decision to make: When you highlight code, comment_regexp are matched one-by-one and thus you slow down highlighting when there are many matches to do. On the opposite Basic Regexps are matched all at once in their block – which gives more performance when you get larger blocks to work on. That’s why languages where only few strings and comments are encountered are more likely to highlight at top speeds than languages with only small „non-string parts“.

With this in mind its quite easy to write a language file that gives good performance.

Flattr this!

Keine Kommentare »

No comments yet.

RSS feed for comments on this post. TrackBack URL

Leave a comment

Powered by WordPress