BenBE's humble thoughts Thoughts the world doesn't need yet …

15.06.2009

GeSHi and non-latin charsets

Filed under: GeSHi — Schlagwörter: , , , — BenBE @ 22:34:14

I recently got some people asking me on how to work with GeSHi if using non-latin charsets like cp1251 or CJK encodings. GeSHi 1.0.X itself doesn’t care much about charsets and thus there’s no beautiful way to work with them. But limited support for UTF-8 is built-in and thus can be used.

The idea behind GeSHi 1.0.X basically is a string replacement of your source with exact or (more or less broad) search patterns. When charsets come into play this can become quite annoying.

But let’s first come to the question I received:

Sorry for my bad english and for what i am ask you in this blog, but i must ask you about next:
How i may use GeSHi with „mbstring.func_overload = 7“ parameter in php.ini. May be I use GeSHi bad, but GeSHi is not correctly work with this parameter. What can i do with this problem? I am try do next:

<?
$Highlighter = new GeSHi(mb_convert_encoding($this->_sCode, 'ASCII', 'UTF-8'), $this->_sCodeName);
$Highlighter->enable_line_numbers(GESHI_NORMAL_LINE_NUMBERS);
$Highlighter->set_encoding('ASCII');
return mb_convert_encoding($Highlighter->parse_code(), 'UTF-8', 'ASCII');
?>

GeSHi work correct. But when i replace ‚ASCII‘ with ‚cp1251‘ it can’t work with chars above 127, if I correctly understand GeSHi work. And I need cp1251 encoding.
Help me please. Yours faithfully, Konstantin.

The function mb_convert_encoding can be used to do the task.

Basically what you need to do is to convert the input from your encoding (e.g. cp1251) to UTF-8. After this has been done, process it by GeSHi and convert it back to your charset as desired:

<?
$Highlighter = new GeSHi(mb_convert_encoding($this->_sCode, 'UTF-8', 'cp1251'), $this->_sCodeName);
$Highlighter->enable_line_numbers(GESHI_NORMAL_LINE_NUMBERS);
//$Highlighter->set_encoding('UTF-8'); //UTF-8 by default
return mb_convert_encoding($Highlighter->parse_code(), 'cp1251', 'UTF-8');
?>

This is necessary as GeSHi at the moment doesn’t convert the full encoding internally when processing input as usually language files only use ASCII-characters and thus the full conversion would consume unnecessary time for most users. The only case where GeSHi already cares about the charset are escape characters where GeSHi exactly needs to know their size to avoid splitting up multibyte character sequences. This handling has a hardcoded fallback in case mbstring is not available to GeSHi.

I hope this helps with the problem.

Flattr this!

Keine Kommentare »

No comments yet.

RSS feed for comments on this post. TrackBack URL

Leave a comment

Powered by WordPress