# [texworks] Improving Syntax Highlighting

Charlie Sharpsteen chuck at sharpsteen.net
Tue May 8 17:34:40 CEST 2012

On Tue, May 8, 2012 at 7:48 AM, Chris Jefferson <chris at bubblescope.net>wrote:

> On 08/05/12 11:42, Stefan Löffler wrote:
>
>> Hi,
>>
>> On 2012-05-06 12:50, Chris Jefferson wrote:
>>
>>  This implies a number of limitations. The big one is no multi-line
>>> user regular expressions, sorry. Specific multi-line things can be
>>> custom written in C++ obviously.
>>>
>>> The things I would most like, in order of preference, are:
>>>
>>> 1) Matching of maths ( both  and  ).
>>> 2) Ability to highlight specific \begin{x} ... \end{x} sections.
>>> 3) Highlighting of parts of regular expressions (for example, in
>>> \textbf{XYZ}, make the XYZ bold).
>>>
>> What would be nice here would be some form of delimiter matching. E.g.,
>> correctly match something like \section{A {B} C}. This doesn't work with
>> reg-exps alone, but I recently found that Gtk-source-view
>> (http://projects.gnome.org/**gtksourceview/documentation.**html<http://projects.gnome.org/gtksourceview/documentation.html>)
>> can do it.
>> As I understand it, it includes the possibility to give two regular
>> expressions: one for the beginning, and one for the end of the
>> to-be-matched string. Since I guess something like that will be needed
>> for \begin/\end section matching anyway, I thought I'd mention this.
>> To that end, I guess we should think about supporting some more
>> sophisticated configuration files in the long run (e.g., XML based).
>>
>
> this.
>
> Perhaps rather than regular expressions, some kind of latex-aware
> tokeniser might be a better approach.
>
> For example, given something like:
>
> I like \textbf{Lots of $x$ and $y$ and \textit{z} }
>
> This would be tokenised into (note: I would go and look what proper latex
> tokenisation looks like!)
>
> 'I' 'like' '\textbf' '{' 'Lots' 'of' '$' 'x' '$' 'and' '$' 'y' '$' 'and'
> '\textit' '{' 'z' '}' '}'
>
> Then make a stack of the current state, and as we scan along we 'push' and
> 'pop' things on and off this stack. That would handle nested expressions
> nicely, and would (I believe) make things like not highlighting inside a
> verbatim easier.
>
> In this mode, rather than giving a regular expression, you would state how
> you wanted (for example) inside a textbf, or inside math mode, or inside a
> tabular, to be formatted. You could also state how classes of tokens
> (numbers, {}, \commands) were coloured.
>
> The biggest problem with this is that is would be totally different to
> what came before, and would be very latex-dependant. I (for example) don't
> know what is up in the world of luatex, and other tex variants.
>
> I might have a play with this, and see what it looks like and how the code
> looks.
>

Sounds like a good approach. However, I would suggest focusing on looking
for existing syntax highlighters to study, such as
AUCTeX<http://www.gnu.org/software/auctex> and
Pygments<https://bitbucket.org/birkenfeld/pygments-main/src/7925d53cb09d/pygments/lexers/text.py#cl-399>,
rather than spending a lot of time searching for a "proper latex
tokenisation" as you probably won't find it. This is because LaTeX is a
context-sensitive language so the proper tokenisation can change on the fly
in the middle of a document. For example, you can completely twist
everything around and still have a valid LaTeX document:

\catcode\~=0
\catcode\]=1
\catcode\[=2
\catcode\}=12
\catcode\{=12
\catcode\\=12
~textbf]This is bold-faced text.[  This text contains a literal backslash
(~texttt]\[) and literal curly braces (~texttt]{}[).

So, any tokeniser will always be an approximation---find the best approach