# [texworks] Improving Syntax Highlighting

Reinhard Kotucha reinhard.kotucha at web.de
Thu May 10 01:42:54 CEST 2012

On 2012-05-08 at 08:34:40 -0700, Charlie Sharpsteen wrote:

On Tue, May 8, 2012 at 7:48 AM, Chris Jefferson <chris at bubblescope.net>wrote:
>
On 08/05/12 11:42, Stefan Löffler wrote:
> >
> >> Hi,
> >>
On 2012-05-06 12:50, Chris Jefferson wrote:
> >>
> >>  This implies a number of limitations. The big one is no
> >>multi-line > user regular expressions, sorry. Specific multi-line
> >>things can be > custom written in C++ obviously.
> >>>
The things I would most like, in order of preference, are:
> >>>
> >>> 1) Matching of maths ( both  and  ).  2) Ability to
> >>> highlight specific \begin{x} ... \end{x} sections.  3)
> >>> Highlighting of parts of regular expressions (for example, in
> >>> \textbf{XYZ}, make the XYZ bold).
> >>>
> >> What would be nice here would be some form of delimiter
> >> matching. E.g., correctly match something like \section{A {B}
> >> C}. This doesn't work with reg-exps alone, but I recently found
> >> that Gtk-source-view
> >> (http://projects.gnome.org/**gtksourceview/documentation.**html<http://projects.gnome.org/gtksourceview/documentation.html>)
> >> can do it.  As I understand it, it includes the possibility to
> >> give two regular expressions: one for the beginning, and one for
> >> the end of the to-be-matched string. Since I guess something
> >> like that will be needed for \begin/\end section matching
> >> anyway, I thought I'd mention this.  To that end, I guess we
> >> should think about supporting some more sophisticated
> >> configuration files in the long run (e.g., XML based).
> >>
> >
this.
> >
Perhaps rather than regular expressions, some kind of latex-aware tokeniser might be a better approach.
> > tokeniser might be a better approach.
> >
For example, given something like:
> >
I like \textbf{Lots of $x$ and $y$ and \textit{z} }
> >
> > This would be tokenised into (note: I would go and look what
> > proper latex tokenisation looks like!)
> >
'I' 'like' '\textbf' '{' 'Lots' 'of' '$' 'x' '$' 'and' '$' 'y' '$' 'and' '\textit' '{' 'z' '}' '}'
> > '\textit' '{' 'z' '}' '}'
> >
> > Then make a stack of the current state, and as we scan along we
> > 'push' and 'pop' things on and off this stack. That would handle
> > nested expressions nicely, and would (I believe) make things like
> > not highlighting inside a verbatim easier.
> >
> > In this mode, rather than giving a regular expression, you would
> > state how you wanted (for example) inside a textbf, or inside
> > math mode, or inside a tabular, to be formatted. You could also
> > state how classes of tokens (numbers, {}, \commands) were
> > coloured.
> >
> > The biggest problem with this is that is would be totally
> > different to what came before, and would be very
> > latex-dependant. I (for example) don't know what is up in the
> > world of luatex, and other tex variants.
> >
I might have a play with this, and see what it looks like and how the code looks.
> > the code looks.
> >
>
> Sounds like a good approach. However, I would suggest focusing on
> looking for existing syntax highlighters to study, such as
> AUCTeX<http://www.gnu.org/software/auctex> and
> Pygments<https://bitbucket.org/birkenfeld/pygments-main/src/7925d53cb09d/pygments/lexers/text.py#cl-399>,
> rather than spending a lot of time searching for a "proper latex
> tokenisation" as you probably won't find it. This is because LaTeX
> is a context-sensitive language so the proper tokenisation can
> change on the fly in the middle of a document. For example, you can
> completely twist everything around and still have a valid LaTeX
> document:
>
> \catcode\~=0
> \catcode\]=1
> \catcode\[=2
> \catcode\}=12
> \catcode\{=12
> \catcode\\=12
> ~textbf]This is bold-faced text.[  This text contains a literal backslash
> (~texttt]\[) and literal curly braces (~texttt]{}[).
>
>
So, any tokeniser will always be an approximation---find the best

Well, LaTeX has a well-defined syntax and such \catcode changes are
quite rare in user documents.  There are a few exceptions though,
especially if verbatim stuff is involved.

However, if some macro arguments and environments can be treated as
exceptions, I believe that parsing a LaTeX file isn't much more
difficult than parsing an HTML file.

Chris, if you are willing to investigate, please proceed.

Regards,
Reinhard

