On Tue, May 8, 2012 at 7:48 AM, Chris Jefferson <span dir="ltr"><<a href="mailto:chris@bubblescope.net" target="_blank">chris@bubblescope.net</a>></span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="im">On 08/05/12 11:42, Stefan Löffler wrote:<br>
</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">
Hi,<br>
<br>
On 2012-05-06 12:50, Chris Jefferson wrote:<br>
<br>
</div><div class="im"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
This implies a number of limitations. The big one is no multi-line<br>
user regular expressions, sorry. Specific multi-line things can be<br>
custom written in C++ obviously.<br>
<br>
The things I would most like, in order of preference, are:<br>
<br>
1) Matching of maths ( both $ $ and \( \) ).<br>
2) Ability to highlight specific \begin{x} ... \end{x} sections.<br>
3) Highlighting of parts of regular expressions (for example, in<br>
\textbf{XYZ}, make the XYZ bold).<br>
</blockquote>
What would be nice here would be some form of delimiter matching. E.g.,<br>
correctly match something like \section{A {B} C}. This doesn't work with<br>
reg-exps alone, but I recently found that Gtk-source-view<br>
(<a href="http://projects.gnome.org/gtksourceview/documentation.html" target="_blank">http://projects.gnome.org/<u></u>gtksourceview/documentation.<u></u>html</a>) can do it.<br>
As I understand it, it includes the possibility to give two regular<br>
expressions: one for the beginning, and one for the end of the<br>
to-be-matched string. Since I guess something like that will be needed<br>
for \begin/\end section matching anyway, I thought I'd mention this.<br>
To that end, I guess we should think about supporting some more<br>
sophisticated configuration files in the long run (e.g., XML based).<br>
</div></blockquote>
<br>
I am thinking about your other comments, but one specific thought about this.<br>
<br>
Perhaps rather than regular expressions, some kind of latex-aware tokeniser might be a better approach.<br>
<br>
For example, given something like:<br>
<br>
I like \textbf{Lots of $x$ and $y$ and \textit{z} }<br>
<br>
This would be tokenised into (note: I would go and look what proper latex tokenisation looks like!)<br>
<br>
'I' 'like' '\textbf' '{' 'Lots' 'of' '$' 'x' '$' 'and' '$' 'y' '$' 'and' '\textit' '{' 'z' '}' '}'<br>
<br>
Then make a stack of the current state, and as we scan along we 'push' and 'pop' things on and off this stack. That would handle nested expressions nicely, and would (I believe) make things like not highlighting inside a verbatim easier.<br>
<br>
In this mode, rather than giving a regular expression, you would state how you wanted (for example) inside a textbf, or inside math mode, or inside a tabular, to be formatted. You could also state how classes of tokens (numbers, {}, \commands) were coloured.<br>
<br>
The biggest problem with this is that is would be totally different to what came before, and would be very latex-dependant. I (for example) don't know what is up in the world of luatex, and other tex variants.<br>
<br>
I might have a play with this, and see what it looks like and how the code looks.<br>
</blockquote></div><br><div>Sounds like a good approach. However, I would suggest focusing on looking for existing syntax highlighters to study, such as <a href="http://www.gnu.org/software/auctex">AUCTeX</a> and <a href="https://bitbucket.org/birkenfeld/pygments-main/src/7925d53cb09d/pygments/lexers/text.py#cl-399">Pygments</a>, rather than spending a lot of time searching for a "proper latex tokenisation" as you probably won't find it. This is because LaTeX is a context-sensitive language so the proper tokenisation can change on the fly in the middle of a document. For example, you can completely twist everything around and still have a valid LaTeX document:</div>
<div><br></div><blockquote style="margin:0 0 0 40px;border:none;padding:0px">\catcode`\~=0<br>\catcode`\]=1<br>\catcode`\[=2<br>\catcode`\}=12<br>\catcode`\{=12<br>\catcode`\\=12<br>~textbf]This is bold-faced text.[ This text contains a literal backslash<br>
(~texttt]\[) and literal curly braces (~texttt]{}[).<br></blockquote><br><div>So, any tokeniser will always be an approximation---find the best approach and adapt it.</div><div><br></div><div>-Charlie</div>