[texworks] Improving Syntax Highlighting

Thu May 10 01:42:54 CEST 2012

On 2012-05-08 at 08:34:40 -0700, Charlie Sharpsteen wrote:

 > On Tue, May 8, 2012 at 7:48 AM, Chris Jefferson <chris at bubblescope.net>wrote:
 > 
 > > On 08/05/12 11:42, Stefan Löffler wrote:
 > >
 > >> Hi,
 > >>
 > >> On 2012-05-06 12:50, Chris Jefferson wrote:
 > >>
 > >>  This implies a number of limitations. The big one is no
 > >>multi-line > user regular expressions, sorry. Specific multi-line
 > >>things can be > custom written in C++ obviously.
 > >>>
 > >>> The things I would most like, in order of preference, are:
 > >>>
 > >>> 1) Matching of maths ( both $ $ and \( \) ).  2) Ability to
 > >>> highlight specific \begin{x} ... \end{x} sections.  3)
 > >>> Highlighting of parts of regular expressions (for example, in
 > >>> \textbf{XYZ}, make the XYZ bold).
 > >>>
 > >> What would be nice here would be some form of delimiter
 > >> matching. E.g., correctly match something like \section{A {B}
 > >> C}. This doesn't work with reg-exps alone, but I recently found
 > >> that Gtk-source-view
 > >> (http://projects.gnome.org/**gtksourceview/documentation.**html<http://projects.gnome.org/gtksourceview/documentation.html>)
 > >> can do it.  As I understand it, it includes the possibility to
 > >> give two regular expressions: one for the beginning, and one for
 > >> the end of the to-be-matched string. Since I guess something
 > >> like that will be needed for \begin/\end section matching
 > >> anyway, I thought I'd mention this.  To that end, I guess we
 > >> should think about supporting some more sophisticated
 > >> configuration files in the long run (e.g., XML based).
 > >>
 > >
 > > I am thinking about your other comments, but one specific thought about
 > > this.
 > >
 > > Perhaps rather than regular expressions, some kind of latex-aware
 > > tokeniser might be a better approach.
 > >
 > > For example, given something like:
 > >
 > > I like \textbf{Lots of $x$ and $y$ and \textit{z} }
 > >
 > > This would be tokenised into (note: I would go and look what
 > > proper latex tokenisation looks like!)
 > >
 > > 'I' 'like' '\textbf' '{' 'Lots' 'of' '$' 'x' '$' 'and' '$' 'y' '$' 'and'
 > > '\textit' '{' 'z' '}' '}'
 > >
 > > Then make a stack of the current state, and as we scan along we
 > > 'push' and 'pop' things on and off this stack. That would handle
 > > nested expressions nicely, and would (I believe) make things like
 > > not highlighting inside a verbatim easier.
 > >
 > > In this mode, rather than giving a regular expression, you would
 > > state how you wanted (for example) inside a textbf, or inside
 > > math mode, or inside a tabular, to be formatted. You could also
 > > state how classes of tokens (numbers, {}, \commands) were
 > > coloured.
 > >
 > > The biggest problem with this is that is would be totally
 > > different to what came before, and would be very
 > > latex-dependant. I (for example) don't know what is up in the
 > > world of luatex, and other tex variants.
 > >
 > > I might have a play with this, and see what it looks like and how
 > > the code looks.
 > >
 > 
 > Sounds like a good approach. However, I would suggest focusing on
 > looking for existing syntax highlighters to study, such as
 > AUCTeX<http://www.gnu.org/software/auctex> and
 > Pygments<https://bitbucket.org/birkenfeld/pygments-main/src/7925d53cb09d/pygments/lexers/text.py#cl-399>,
 > rather than spending a lot of time searching for a "proper latex
 > tokenisation" as you probably won't find it. This is because LaTeX
 > is a context-sensitive language so the proper tokenisation can
 > change on the fly in the middle of a document. For example, you can
 > completely twist everything around and still have a valid LaTeX
 > document:
 > 
 > \catcode`\~=0
 > \catcode`\]=1
 > \catcode`\[=2
 > \catcode`\}=12
 > \catcode`\{=12
 > \catcode`\\=12
 > ~textbf]This is bold-faced text.[  This text contains a literal backslash
 > (~texttt]\[) and literal curly braces (~texttt]{}[).
 > 
 > 
 > So, any tokeniser will always be an approximation---find the best
 > approach and adapt it.

Well, LaTeX has a well-defined syntax and such \catcode changes are
quite rare in user documents.  There are a few exceptions though,
especially if verbatim stuff is involved.

However, if some macro arguments and environments can be treated as
exceptions, I believe that parsing a LaTeX file isn't much more
difficult than parsing an HTML file.

Chris, if you are willing to investigate, please proceed.  

Regards,
  Reinhard

-- 
----------------------------------------------------------------------------
Reinhard Kotucha                                      Phone: +49-511-3373112
Marschnerstr. 25
D-30167 Hannover                              mailto:reinhard.kotucha at web.de
----------------------------------------------------------------------------
Microsoft isn't the answer. Microsoft is the question, and the answer is NO.
----------------------------------------------------------------------------