[tug-summer-of-code] Project ideas

Sun Mar 8 11:01:26 CET 2009

Hi,

Scott Pakin wrote:
> Karl Berry wrote:
>>     I also have an idea to propose. In short, the idea is parallel
>>     version of TeX. My plan is to use OpenMP for parallelization. It
>>     should make TeX run faster without much more complication in the
>>     code base. Please give me some thoughts.
>>
>> The basic TeX code is monolithic and relies almost entirely on global
>> variables.  There have been some discussions of parallelizing certain
>> small parts of the code, such as the output routine, but even this is
>> fraught with extreme difficulty.  I am not sure what can be usefully
>> parallelized.
> 
> While it's true that TeX is monolithic and relies almost entirely on
> global variables, the real problem for parallelism as I see it is
> that TeX implements an extremely sequential state machine.  Each
> character must be read and processed before proceeding with the next
> character.  This is because earlier processing can affect later
> processing (cf.  \catcode).  One can't even process pages in
> parallel because it is not known in advance where the page breaks
> will wind up.

After a lot of talk and discussion with lots of people, I have
collected some ideas where parallelization seems possible,
at least in theory (I am talking about pdftex|luatex. None of
this applies to tex82, and I don't know enough about xetex).

A list of ideas, more or less in descending order of potential
gain and increasing order of implementation complexity:

* The backend can run in a separate thread. This is the big one,
I estimate that up to a 20% speedup is possible. All other
points below will score lower than this except for specially
crafted documents (and a lot lower, like maybe 2-3% gain in
'good' cases).

* Some types of images require conversion or extensive parsing
(like inclusion of pdf pages). These could be prefetched by a
separate thread.

* Font subsetting at the end of the run could happen in parallel
for multiple fonts.

* Processing can usually continue after a \font\cs assignment. And even
if the font turns out to be needed immediately, at least the guard
code can be simple: you only have to block the \cs and \fontdimen
access to it.

* The paragraph builder typically does two runs over a paragraph:
first without hyphenation, then with. That second run is not always
needed, but it could be done in a second thread in any case, thus
saving time when it was indeed necessary.

* Input reading can continue during the display math list
conversion, because the result of that conversion is not needed
until the next display or \par or one of a few primitives
(\unskip, \prevgraf, and maybe a few more) is encountered.

* The same may be true for input reading during paragraph breaking,
but there the list of blocking events is much longer.

The problem: I fear that for someone not intimate with tex-like
engines and their complex build systems, the simplest of these
(image prefetching or font subsetting) will take the full gsoc period,
and may not even get done in that time. Worse still, we could end up
with some patch that only works on linux, nowhere else. I definately
cannot mentor such a project because it would in all likelihood eat
up so much time that I could do nothing else in the period.

Best wishes,
Taco