BenBE's humble thoughts Thoughts the world doesn't need yet …

12.12.2008

Bug Fixing – The End of the Story

Filed under: GeSHi — Schlagwörter: , — BenBE @ 22:48:05

As I mentioned in an earlier post, that wasn’t the full story yet. So I’ll now keep my promise and will add the rest of the story and the missing informationen.

Initial Problem Description

As mentioned in my earlier article 1.0.7.22 had a bug where you could get GesHi into an infinite loop. Getting a program into an infinite loop in PHP in itself is no problem as PHP has a time limit of usually 30 seconds after which a program is terminated by force. The problem in this case was more that this bug existed in a place where each iteration allocated another piece of memory. And even this is not that of a problem: PHP forcibly terminates a program once you try to allocate more memory than allowed (usually 64 MB).

So where the actual problem steps in, are both issues noted above: First you have a PHP program run at full speed using up all CPU time it can get while allocating as much memory as possible. With the defaults set by PHP this is not in itself bad for the server – in fact, it’s a hickup and the server runs as usual. The problem comes if you induce load or the allowed settings for PHP in respect to execution time and memory usage are much more liberal. I actually noted some systems (amongst others the one I first found this bug) that PHP had sometimes up to 1GB of memory; which made me declare this a severe remote DoS attack.

parse_code() explained

But what was the underlaying problem that caused the initial trouble. To explain this, I probably should go into some detail on how GeSHi basically works and what happens when you actually call parse_code().

Before you can do anything with GeSHi you have to load a language file. This basically is nothing more than some (well-protected) include of some more PHP code that initializes an array and fills some structures for GeSHi to use in later stages. Newer versions of GeSHi (since 1.0.8) not only load this data structure, but also does some preparative work by building regular expressions on-the-fly (funny tech, and best: it even works!) and rearranging some information (e.g. the style information for numbers are reformatted).

When you now set your source GeSHi should highlight, only that code is stored by GeSHi, but still remains there as a simple string. The actual processing starts with calling parse_code().

The first action taken, when parse_code() is called is checking the internal state of GeSHi if all necessary information are present, i.e. is there any source to highlight, has some language been loaded, has no other problem been detected yet. If anything goes wrong here, you only get a html-escaped string of the source you entered, without any further processing – and a message you can query from GeSHi that tells you what happened. After this initial „sanity check“ GeSHi checks its internal caches Milian and me built while optimizing the parser. These caches hold information on keywords (strictly speaking PCRE-compatible regular expressions that match all keywords at once ;-)), regular expressions for matching numbers and some simplified structures where a lot of information needed while parsing are „pre-computed“.

Now off to the tricky part: Actual source parsing. The parsing of the code is done in two phases whereby the first determines what to highlight (Strict Mode Processing) and the second one does the actual highlighting (even though this has been divided a bit more).

Strict Mode processing is done by skimming over the source looking for the possible Strict Mode Delimiters (e.g. < and > for HTML) and if found storing those pieces into an array noting their start offset, length and strict mode group used. This information later on is used to decide if the code should be highlighted. Another trick here is using the same array for highlighted and unhighlighted pieces which increases performance in the second pass a bit. The second pass then uses these information to pick out smaller strings and analyze them for finding comments, strings, keywords, numbers and all the other stuff GeSHi can highlight.

The Problem in Detail

And here comes the problem of 1.0.7.22: Older versions (including 1.0.7.22) padded the source with a leading and a trailing line break character to make the regular expressions easier. A side-effect on this was, that some parser problems in the second pass could be simplified too as end checks could be skipped and centralized at one point in the source. But what does that „Strict Block Detection“ look like? Well, it’s basically nothing more than

        if ($this->strict_mode) {
            // Break the source into bits. Each bit will be a portion of the code
            // within script delimiters - for example, HTML between < and >
            $parts = array(0 => array(0 => '', 1 => ''));
            $k = 0;
            for ($i = 0; $i < $length; ++$i) {
                foreach ($this->language_data['SCRIPT_DELIMITERS'] as $delimiters) {
                    foreach ($delimiters as $open => $close) {
                        // Get the next little bit for this opening string
                        $open_strlen = strlen($open);
                        $check = substr($code, $i, $open_strlen);
                        // If it matches...
                        if ($check == $open) {
                            // We start a new block with the highlightable
                            // code in it
                            ++$k;
                            $parts[$k][0] = $open;
                            $close_i = strpos($code, $close, $i + $open_strlen)  + strlen($close);
                            if ($close_i === false) {
                                $close_i = $length - 1;
                            }
                            $parts[$k][1] = substr($code, $i, $close_i - $i);
                            $i = $close_i - 1;
                            ++$k;
                            $parts[$k][0] = '';
                            $parts[$k][1] = '';

                            // No point going around again...
                            continue 3;
                        }
                    }
                }
                // only non-highlightable text reaches this point
                $parts[$k][1] .= $code[$i];
            }
        } else {
            // Not strict mode - simply dump the source into
            // the array at index 1 (the first highlightable block)
            $parts = array(
                1 => array(
                    0 => '',
                    1 => $code
                )
            );
        }

in the buggy version. And that was the problem 😉 There was some very important check missing: If no ender was found GeSHi basically looped back to where it found the current starter (actually a fixed position in the string), finding it again and again and again … Or well, it wasn’t missing, but it just didn’t work. So what was the culprit? This little part:

                            $close_i = strpos($code, $close, $i + $open_strlen)  + strlen($close);
                            if ($close_i === false) {
                                $close_i = $length - 1;
                            }
                            $parts[$k][1] = substr($code, $i, $close_i - $i);
                            $i = $close_i - 1;

or more exactly the first line of it. If a closing delimiter was found, there’s nothing interesting about this piece of source but once you miss a closing > e.g. in XML at the end of the source this collapses. But let’s do this step by step:
Given the input „<“ this is padded to „\n<\n“ at this stage so the starter is found at string position 1. The $open_strlen is also 1 (as the opener is „<„. But what value is $close_i assigned in that line: If strpos doesn’t find anything, it returns false. False in return is interpreted as 0 when used in integer additions. strlen($close) is 1 (the „>“ closing a XML tag) and thus we get $close_i = false + 1 = 1. Unluckily $i – the next char to search for – was already 1; by assigning it 0 ($close_i – 1) we get exactly to where we left off at the beginning of this iteration. BANG!

And now the easy part (which was so trivial, you will NEVER find it when you look for it:

                            $close_i = strpos($code, $close, $i + $open_strlen);
                            if ($close_i === false) {
                                $close_i = $length - 1;
                            } else {
                                $close_i += strlen($close);
                            }
                            $parts[$k][1] = substr($code, $i, $close_i - $i);
                            $i = $close_i - 1;

Note, how little actually changed: That revisions diff (cf. rev 1322) is actually only 3 lines changed … When about 600 revisions passed you will never remember this 😉

The Secunia Story

Well, some might want to interrupt me now and ask „Didn’t Secunia initially complain about something else?“ Yes. they did! Secunia initially didn’t even mention this problem, but instead reported another issue I closed in 1.0.8.1 – which by the way has been there for years. Well, there was that issue that allowed people to include arbitary PHP source into GeSHi by setting the language file path. And that’s why I declared any software developer insane who would allow an end user to change this path through user input. As Secunia correctly reports, you could ask GeSHi to load its language files from anywhere … even moon://crater/7/rock/3 or if you like http://mal.icio.us/evil.txt?

And now to the feasability of this attack: Most applications leave the defailt autodetect of GeSHi (which NEVER uses user input, some others set it to a constant (without user input) and some few allow well checked configuration strings to be used. So if you know any script that allows the user to set that configuration via browser request, send its autor to the next booby hatch for an intensive check-up. Okay. I summarize: Feasability: Fraction of mad programmers that write scripts with GeSHi times the scripts they wrote with it divided by number of scripts written with GeSHi. I really hope this converges to zero 😉

And regarding the patch: Well, revision 1750, where an additional check for a string without colon was (except on Windows where colon as second char is permitted) added, isn’t all that interesting. It changes behaviour, nothing more. Except for the paranoid mode of this patch for which people oddly glanced at me.

Thanks

And finally I’d like to thank some people. First of all thanks go out to Milian for his great work optimizing and implementing the ideas that lead to this optimized version of GeSHi. Another thanks is hereby sent to Romain Beauxis and Nico Golde who keep the Debian package up-to-date (as good as the Debian rules permit it).

Flattr this!

Keine Kommentare »

No comments yet.

RSS feed for comments on this post. TrackBack URL

Leave a comment

Powered by WordPress