Possible rework on the GeSHi parser waiting for review

01.09.2010

Possible rework on the GeSHi parser waiting for review

Filed under: GeSHi — Schlagwörter: Developement, GeSHi, Patch, PHP — BenBE @ 12:27:04

As announced earlier here’s some external work by Simon Gábor that might get into the GeSHi core IF I can get enough confidence in those changes to work properly and don’t cause any major regressions – which I currently lack nor can confirm by reviewing this patch.

To give you an overview: Here’s what this change is about:

If you remember, the problem was that if there were more than one regexes that matched a certain line, the first one inserted the marker tags into the text, and so the text had changed and the second regex couldn’t match it.

Since the presence of the marker tags spoiled the regex matching, the tags may be inserted only when all regex matching/replacements are done, but then we have to store the information about which regex matched which portion of the text.

As this cannot be stored in the text itself, I added an array ($keys[]) that is exactly as long as the text, and its elements contain the indices of the regexes that matched the characters of the text. However, it introduced another issue, namely that the text itself can (legally) change during the replacements, so this $keys array shall be changed along with it:

if a backreference (\1..\9) is used in the ‚before‘ or ‚after‘ blocks, not only the given portions of the text shall be copied, but the portions of the $keys as well

all the text generated by the ‚replace‘ block of a regex shall generate elements in the keys array with the index of the regex

The patch Simon sent me tries to address this issue, BUT I don’t get the hang of it and thus decided (fow now) not including it. In addition this patch has some bad implications that I feared AND which Simon confirms:

Unfortunately, this means that neither ‚preg_replace()‘, nor even ‚preg_replace_callback()‘ can be used for managing the (text, keys) buffer pair, so I had to use ‚preg_match()‘ and do the regex replacements manually, modifying the text and the keys buffers in parallel.

When all the regex processing is done, finally this $keys buffer must be iterated through, and the proper marker tags must be inserted into the text.

To sum the problems up:

Complex patch
Might need MUCH memory and thus conflicts with the aim for a reasonably low memory footprint
Will slow down processing even if not needed, since the housekeeping has to be done (if opting for a way to enabe it by language duplicate source WILL result, which is bad too)
Only few language files actually use it (The only one apart from Simon’s one that might benefit is LaTeX AFAIK)

As I trust the masses to come up with a solution I’m releasing this patch including some more details on its history so people get a chance working on it. Thus if anyone has an idea that might help with getting this feature integrated in a swifter way, while not affecting languages that don’t require this processing, I’d be really glad to hear about it.

As an example on what can be done with this patch – or well, let’s more say: for what he needed it – he mentioned highlighting of changelog files that look simular to this:

2010-06-14 Author Name 
One-line summary about the changes

        * file1.c: some changes

        * file2.c, file2.h: some other changes whose
        description wraps to a second line

        * file3.c: yet again some changes

But as this is neither the format that Debian uses nor the one that GeSHi uses there is little chance for directly applying this language file as is. But there is one standing issue that affects parts of LaTeX highlighting thus even though the Changelog file itsel is no convincing reason, having this fix for LaTeX might be a benefit. Additionally this works around a long-standing problem with Regexp breaking highlighting for some language files which requires you to take care in what you’re doing with Regexp.

But enough said for now: Here’s the patch for the mentioned feature. It should apply cleanly to GeSHi releases 1.0.8.8 and 1.0.8.9 as well as GeSHi trunk.

diff -Naur SyntaxHighlight_GeSHi.orig/geshi/geshi/changelog.php SyntaxHighlight_GeSHi/geshi/geshi/changelog.php
--- SyntaxHighlight_GeSHi.orig/geshi/geshi/changelog.php    1970-01-01 01:00:00.000000000 +0100
+++ SyntaxHighlight_GeSHi/geshi/geshi/changelog.php    2010-07-05 10:48:17.000000000 +0200
@@ -0,0 +1,151 @@
+<?php
+/*************************************************************************************
+ * changelog.php
+ * --------
+ * Author: Gabor Simon (gabor.simon2@it-services.hu)
+ * Copyright: (c) 2010 Gabor Simon
+ * Release Version: 1.0.8.X
+ * Date Started: 06/10/2010
+ *
+ * Changelog language file for GeSHi.
+ *
+ * CHANGES
+ * -------
+ * 07/05/1010 (0.0.1)
+ * - Pattern marking now uses the before-replace-after scheme
+ * 06/10/2010 (0.0.0)
+ * - Syntax File Created
+ *
+ *
+ *************************************************************************************
+ *
+ *     This file is part of GeSHi.
+ *
+ *   GeSHi is free software; you can redistribute it and/or modify
+ *   it under the terms of the GNU General Public License as published by
+ *   the Free Software Foundation; either version 2 of the License, or
+ *   (at your option) any later version.
+ *
+ *   GeSHi is distributed in the hope that it will be useful,
+ *   but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *   GNU General Public License for more details.
+ *
+ *   You should have received a copy of the GNU General Public License
+ *   along with GeSHi; if not, write to the Free Software
+ *   Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ ************************************************************************************/
+
+$language_data = array (
+    'LANG_NAME' => 'changelog',
+    'CASE_KEYWORDS' => 0,
+    'CASE_SENSITIVE' => array(),
+    'OBJECT_SPLITTERS' => array(),
+    'SCRIPT_DELIMITERS' => array(),
+    'HIGHLIGHT_STRICT_BLOCK' => array(),
+    'COMMENT_SINGLE' => array(),
+    'COMMENT_MULTI' => array(),
+    'CASE_KEYWORD' => array(),
+    'QUOTEMARKS' => array(),
+    'ESCAPE_CHAR' => '',
+    'KEYWORDS' => array(),
+    'SYMBOLS' => array(),
+    'SCRIPT' => array(),
+    'URLS' => array(),
+    'OOLANG' => false,
+    'STRICT_MODE_APPLIES' => GESHI_NEVER,
+    'NUMBERS' => array(),
+    'REGEXPS' => array(
+        # header: "yyyy-mm-dd Author Name <author@email>"
+        0 => array(
+                GESHI_SEARCH => '^([0-9]{4}-[0-9]{2}-[0-9]{2})(.*?)(\&lt;.*?\&gt;)',
+                GESHI_MODIFIERS => 'm',
+                GESHI_BEFORE => '',
+                GESHI_REPLACE => '\\1',    # date
+                GESHI_AFTER => '\\2\\3'
+                ),
+        1 => array(
+                GESHI_SEARCH => '^([0-9]{4}-[0-9]{2}-[0-9]{2})(.*?)(\&lt;.*?\&gt;)',
+                GESHI_MODIFIERS => 'm',
+                GESHI_BEFORE => '\\1',
+                GESHI_REPLACE => '\\2',    # author name
+                GESHI_AFTER => '\\3'
+                ),
+        2 => array(
+                GESHI_SEARCH => '^([0-9]{4}-[0-9]{2}-[0-9]{2})(.*?)(\&lt;.*?\&gt;)',
+                GESHI_MODIFIERS => 'm',
+                GESHI_BEFORE => '\\1\\2',
+                GESHI_REPLACE => '\\3',    # author email
+                GESHI_AFTER => ''
+                ),
+        # summary: "One line that starts with non-space and non-date"
+        3 => array(
+                GESHI_SEARCH => '^((?![0-9]{4}-[0-9]{2}-[0-9]{2})[^\s].*)$',
+                GESHI_MODIFIERS => 'm',
+                GESHI_BEFORE => '',
+                GESHI_REPLACE => '\\1',    # the summary line
+                GESHI_AFTER => ''
+                ),
+        # filehdr: "      * filename: comments"
+        4 => array(
+                GESHI_SEARCH => '^(\s+\*\s*)([^:]*)(:\s*)(.*)$',
+                GESHI_MODIFIERS => 'm',
+                GESHI_BEFORE => '',
+                GESHI_REPLACE => '\\1',    # the '*' sign
+                GESHI_AFTER => '\\2\\3\\4'
+                ),
+        5 => array(
+                GESHI_SEARCH => '^(\s+\*\s*)([^:]*)(:\s*)(.*)$',
+                GESHI_MODIFIERS => 'm',
+                GESHI_BEFORE => '\\1',
+                GESHI_REPLACE => '\\2',    # filename
+                GESHI_AFTER => '\\3\\4'
+                ),
+        6 => array(
+                GESHI_SEARCH => '^(\s+\*\s*)([^:]*)(:\s*)(.*)$',
+                GESHI_MODIFIERS => 'm',
+                GESHI_BEFORE => '\\1\\2',
+                GESHI_REPLACE => '\\3',    # the ':' sign
+                GESHI_AFTER => '\\4'
+                ),
+        7 => array(
+                GESHI_SEARCH => '^(\s+\*\s*)([^:]*)(:\s*)(.*)$',
+                GESHI_MODIFIERS => 'm',
+                GESHI_BEFORE => '\\1\\2\\3',
+                GESHI_REPLACE => '\\4',    # comments
+                GESHI_AFTER => ''
+                ),
+        # comment cont: "     [^*]comments"
+        8 => array(
+                GESHI_SEARCH => '^(\s{2,}[^*\s].*)$',
+                GESHI_MODIFIERS => 'm',
+                GESHI_BEFORE => '',
+                GESHI_REPLACE => '\\1',
+                GESHI_AFTER => ''
+                )
+    ),
+    'STYLES' => array(
+        'KEYWORDS' => array(),
+        'COMMENTS' => array(),
+        'ESCAPE_CHAR' => array(),
+        'BRACKETS' => array(),
+        'SYMBOLS' => array(),
+        'STRINGS' => array(),
+        'NUMBERS' => array(),
+        'METHODS' => array(),
+        'SCRIPT' => array(),
+        'REGEXPS' => array(
+                0 => 'color: #7f0000;',        # date
+                1 => 'color: #bfbf00;',        # author
+                2 => 'color: #bf00bf;',        # email
+                3 => 'color: #00bf7f;',        # summary
+                4 => 'color: #00bf00;',        # *
+                5 => 'color: #0000ff;',        # filenames
+                6 => 'color: #00bf00;',        # :
+                7 => 'color: #000000;',        # comment
+                8 => 'color: #000000;'         # comment cont.
+                )
+        )
+);
+
+?>
diff -Naur SyntaxHighlight_GeSHi.orig/geshi/geshi.php SyntaxHighlight_GeSHi/geshi/geshi.php
--- SyntaxHighlight_GeSHi.orig/geshi/geshi.php    2010-06-29 20:18:53.000000000 +0200
+++ SyntaxHighlight_GeSHi/geshi/geshi.php    2010-07-05 18:51:12.000000000 +0200
@@ -3229,6 +3229,48 @@
     }

     /**
+     * appends a backreference-resolved string to a (string,regex-index) buffer pair
+     *
+     * @param string the string buffer
+     * @param array the index buffer
+     * @param string the string to append
+     * @param array the matches for the backref resolution
+     *
+     * @note the matches array must be of the extended form:
+     * @note matches[][0] is the matched string
+     * @note matches[][1] is the match offset
+     * @note matches[][2] is the index array of the match
+     * @return none
+     * @access private
+     */
+    function rgx_resolve_append(&$newstuff, &$newkeys, $src, $matches) {
+        $srclen = strlen($src);
+        for ($i = 0; $i < $srclen; $i++) {
+        // check if it is an escaped char
+            if ($src[$i] == '\\') {
+            // escaped char, skip backslash anyway
+                $i++;
+                if (($i < $srclen) and ('0' <= $src[$i]) and ($src[$i] <= '9')) {
+            // backreference, parse which submatch is referenced
+                    $m = $src[$i] - '0';
+                    $newstuff .= $matches[$m][0]; # append the text
+                    $newkeys = array_merge($newkeys, $matches[$m][2]); # append the indices
+                } else {
+            // not a backreference, but an backslash-escaped char
+                    $newstuff .= '\\';  // append the backslash
+                    array_push($newkeys, -1); // new text, doesn't belong to any regex
+                    $newstuff .= $src[$i]; // append the char
+                    array_push($newkeys, -1);  // new text again
+                }
+            } else {
+            # normal char
+                $newstuff .= $src[$i]; // append the char
+                array_push($newkeys, -1); // new text again
+            }
+        }
+    }
+
+    /**
      * Takes a string that has no strings or comments in it, and highlights
      * stuff like keywords, numbers and methods.
      *
@@ -3308,41 +3350,127 @@
         }

         // Regular expressions
+        // NOTE: for some reason, there is a space is at the start of the content
+        // and that spoils matching '^' on the 1st line, so I replace it by an EOL
+        // and restore it at the end
+        $stuff_firstchar = $stuff_to_parse[0];
+        if ($stuff_firstchar == ' ')
+            $stuff_to_parse[0] = "\n";
+        $keys = array_fill(0, strlen($stuff_to_parse), -1);
         foreach ($this->language_data['REGEXPS'] as $key => $regexp) {
             if ($this->lexic_permissions['REGEXPS'][$key]) {
+
+                $this->_hmr_key = $key;
                 if (is_array($regexp)) {
-                    if ($this->line_numbers != GESHI_NO_LINE_NUMBERS) {
-                        // produce valid HTML when we match multiple lines
-                        $this->_hmr_replace = $regexp[GESHI_REPLACE];
-                        $this->_hmr_before = $regexp[GESHI_BEFORE];
-                        $this->_hmr_key = $key;
-                        $this->_hmr_after = $regexp[GESHI_AFTER];
-                        $stuff_to_parse = preg_replace_callback(
-                            "/" . $regexp[GESHI_SEARCH] . "/{$regexp[GESHI_MODIFIERS]}",
-                            array($this, 'handle_multiline_regexps'),
-                            $stuff_to_parse);
-                        $this->_hmr_replace = false;
-                        $this->_hmr_before = '';
-                        $this->_hmr_after = '';
-                    } else {
-                        $stuff_to_parse = preg_replace(
-                            '/' . $regexp[GESHI_SEARCH] . '/' . $regexp[GESHI_MODIFIERS],
-                            $regexp[GESHI_BEFORE] . '<|!REG3XP'. $key .'!>' . $regexp[GESHI_REPLACE] . '|>' . $regexp[GESHI_AFTER],
-                            $stuff_to_parse);
-                    }
+                    $pattern = "/" . $regexp[GESHI_SEARCH] . "/{$regexp[GESHI_MODIFIERS]}";
+                    $this->_hmr_before = $regexp[GESHI_BEFORE];
+                    $this->_hmr_replace = $regexp[GESHI_REPLACE];
+                    $this->_hmr_after = $regexp[GESHI_AFTER];
                 } else {
-                    if ($this->line_numbers != GESHI_NO_LINE_NUMBERS) {
-                        // produce valid HTML when we match multiple lines
-                        $this->_hmr_key = $key;
-                        $stuff_to_parse = preg_replace_callback( "/(" . $regexp . ")/",
-                                              array($this, 'handle_multiline_regexps'), $stuff_to_parse);
-                        $this->_hmr_key = '';
+                    $pattern = "/(" . $regexp . ")/";
+                    $this->_hmr_before = '';
+                    $this->_hmr_replace = '\\1';
+                    $this->_hmr_after = '';
+                }
+
+                // NOTE: GESHI_NO_LINE_NUMBERS is handled when inserting the tags at the end
+
+                // NOTE: the matched (and hence marked) strings may be required for other
+                // regexes, so we may actually insert the marker tags only after all regex
+                // matching and replacement.
+                // Because of this, for each character we store the number of the regex
+                // that matched it (==key) in a separate array, and only do the tag insertion
+                // at the end.
+                // This means that whenever the string is modified (eg. at replacements),
+                // this key array must be modified as well, so we cannot use preg_replace
+                // but have to do it manually.
+                $offs = 0;      // offset from whence to try matching
+                while (preg_match($pattern, $stuff_to_parse, $matches, PREG_OFFSET_CAPTURE, $offs) == 1) {
+                    // NOTE: PREG_OFFSET_CAPTURE generates a detailed $matches:
+                    //   $matches[$i][0] is the matched string
+                    //   $matches[$i][1] is the position of the match
+                    // and we extend this by
+                    //   $matches[$i][2] is the array of keys for the matched string
+                    for ($i = 0; $i < count($matches); $i++)
+                    {
+                        $matches[$i][2] = array_slice($keys, $matches[$i][1], strlen($matches[$i][0]));
+                    }
+
+                    // NOTE: as we don't want to spoil the offsets in $matches[][1], we cannot
+                    // modify $stuff_to_parse and $keys in-place, but temporary working copies
+                    // ($newstuff, $newkeys) are needed
+
+                    // copy the text that precedes the match
+                    $newstuff = substr($stuff_to_parse, 0, $matches[0][1]);
+                    $newkeys = array_slice($keys, 0, $matches[0][1]);
+
+                    // append the 'before' part
+                    $this->rgx_resolve_append($newstuff, $newkeys, $this->_hmr_before, $matches);
+
+                    // append the 'replace' and mark its part in $newkeys as $key
+                    $kpos = count($newkeys);
+                    $this->rgx_resolve_append($newstuff, $newkeys, $this->_hmr_replace, $matches);
+                    $offs = count($newkeys);        // this is where we start looking for a match next time
+                    if ($offs > $kpos) {
+                        array_splice($newkeys, $kpos, $offs - $kpos, array_fill(0, $offs - $kpos, $key));
                     } else {
-                        $stuff_to_parse = preg_replace( "/(" . $regexp . ")/", "<|!REG3XP$key!>\\1|>", $stuff_to_parse);
+                        error_log("Infinite loop caught;");
+                        break;
                     }
+
+                    // append the 'after'
+                    $this->rgx_resolve_append($newstuff, $newkeys, $this->_hmr_after, $matches);
+
+                    // append the text after the match
+                    $i = $matches[0][1] + strlen($matches[0][0]);
+                    $newstuff .= substr($stuff_to_parse, $i);
+                    $newkeys = array_merge($newkeys, array_slice($keys, $i));
+
+                    // replace the original $stuff and $keys with the temporary ones
+                    $stuff_to_parse = $newstuff;
+                    $keys = $newkeys;
+                }
+                $this->_hmr_key = '';
+                $this->_hmr_before = '';
+                $this->_hmr_replace = false;
+                $this->_hmr_after = '';
+            }
+        }
+        // process $keys and insert the appropriate tags into $stuff_to_parse
+        // NOTE: if $keys is all -1s (when the formatting was done by preg_replace),
+        // this part leaves $stuff_to_parse intact
+        array_push($keys, -1); // trailing 'normal' marker to ensure closing of the last range
+        $k = -1; // currently used key
+        for ($kpos = $spos = 0; $kpos < count($keys); $kpos++, $spos++)
+        {
+                if (($this->line_numbers != GESHI_NO_LINE_NUMBERS) and
+                    ($stuff_to_parse[$spos] == '\\n')) {
+                    $keys[$kpos] = -1;
+                    # treat '\n'-s as normal (key==-1), so a closing tag will be inserted
+                    # before and an opening after them if they are part of a coloured
+                    # range -> the line numbering won't affect the colouring
+                }
+                if ($keys[$kpos] != $k)
+                {
+                        if ($k != -1)
+                        {
+                                // insert closing tag
+                                $stuff_to_parse = substr_replace($stuff_to_parse, '|>', $spos, 0);
+                                $spos += 2;
+                        }
+                        $k = $keys[$kpos];
+                        if ($k != -1)
+                        {
+                                // insert opening tag
+                                $repl = '<|!REG3XP'. $k .'!>';
+                                $stuff_to_parse = substr_replace($stuff_to_parse, $repl, $spos, 0);
+                                $spos += strlen($repl);
+                        }
                 }
-            }
         }
+        // Restore the leading ' ' that was replace by an EOL
+        if ($stuff_to_parse[0] == "\n")
+            $stuff_to_parse[0] = $stuff_firstchar;

         // Highlight numbers. As of 1.0.8 we support different types of numbers
         $numbers_found = false;

BenBE's humble thoughts Thoughts the world doesn't need yet …

01.09.2010

Possible rework on the GeSHi parser waiting for review

Keine Kommentare »

Leave a comment