[luatex] runtime performance of lua(la)tex incl. gprof

Hans Hagen j.hagen at xs4all.nl
Tue Jun 20 16:33:14 CEST 2023


On 6/20/2023 3:11 PM, Axel Kittenberger wrote:
> Hello,
> 
> First let me say, I really don't want this to be a complaint. I'm just
> wondering.
> 
> I considered switching my department's default compiler from pdflatex to
> lualatex.
> Some subtle differences were to be expected and with test documents so far
> easily catered for.
> The output is okay.
> 
> However what surprised me is a complicated test document which took ~150
> seconds with pdflatex now takes 210 seconds with lualatex.
> 
> Trying to figure out if this is some of the many packages it does, I kept
> simplifying.
> 
> --- laliptest.tex ---
> \documentclass{article}
> \input{plipsum}
> \begin{document}
> \newcount\ii
> \ii=100
> \loop
>    \lipsum{1-100}
>    \advance\ii-1
> \ifnum \ii>0
> \repeat
> \end{document}
> ---------
> 
> This most simple document doesn't use any package, but plipsum which can be
> replaced with plain text too. Compile time results:
> 
> pdflatex: user 0m1.920s (3.1 MB result)
> lualatex: user 0m17.565s (3.8 MB result)
> 
> 8 times slower.
> 
> Versions tested with:
> pdfTeX 3.141592653-2.6-1.40.24 (TeX Live 2022/Debian)
> This is LuaHBTeX, Version 1.15.0 (TeX Live 2022/Debian)
> 
> Since LaTeX also includes a lot of stuff already, same tests with plain TeX.
> 
> --- liptest.tex ---
> \input{plipsum}
> \newcount\i
> \i=100
> \loop
>    \lipsum{1-100}
>    \advance\i-1
> \ifnum \i>0
> \repeat
> \end
> ---------
> pdftex: user 0m1.053s (2.9 MB result)
> luatex: user 0m1.943s (3.1 MB result)
> 
> This isn't as bad as the LaTeX variants, but still almost a factor two.
> Searching about this online turns up results about microtype or front
> loading etc.
> Both cannot be an issue, since microtype is off and frontloading must be a
> fixed offset, but the compile time increases linearly with document length.
> 
> This now took me a while, but I managed to compile luatex with "-gp" to
> create a gprof profile and this is
> the result:
> ----------
> Flat profile:
> 
> Each sample counts as 0.01 seconds.
>    %   cumulative   self              self     total
>   time   seconds   seconds    calls   s/call   s/call  name
>   14.63      0.42     0.42  2409555     0.00     0.00  longest_match
>    8.71      0.67     0.25   295700     0.00     0.00  hnj_hyphen_hyphenate
>    8.19      0.91     0.24 52832741     0.00     0.00  get_sa_item
>    6.62      1.10     0.19      773     0.00     0.00  deflate_slow
>    3.48      1.20     0.10 30117352     0.00     0.00  char_info
>    2.79      1.28     0.08    10000     0.00     0.00  ext_do_line_break
>    2.79      1.36     0.08      773     0.00     0.00  compress_block
>    2.09      1.42     0.06  2978422     0.00     0.00  calc_pdfpos
>    2.09      1.48     0.06   515855     0.00     0.00  handle_lig_word
>    1.74      1.53     0.05 14032575     0.00     0.00  char_exists
>    1.74      1.58     0.05  4689611     0.00     0.00  flush_node
>    1.74      1.63     0.05  2896557     0.00     0.00  output_one_char
>    1.74      1.68     0.05   227877     0.00     0.00  hash_normalized
>    1.74      1.73     0.05    41510     0.00     0.00  hlist_out
>    1.74      1.78     0.05    23020     0.00     0.00  fix_node_list
>    1.74      1.83     0.05     2319     0.00     0.00  adler32_z
>    1.39      1.87     0.04   227877     0.00     0.00  hash_insert_normalized
>    1.39      1.91     0.04    39615     0.00     0.00  fm_scan_line
>    1.39      1.95     0.04    11510     0.00     0.00  hnj_hyphenation
>    1.05      1.98     0.03  3831639     0.00     0.00  get_x_token
>    1.05      2.01     0.03  2896557     0.00     0.00  get_charinfo_whd
>    1.05      2.04     0.03  2382502     0.00     0.00  add_kern_before
>    1.05      2.07     0.03   303962     0.00     0.00  luaS_hash
>    1.05      2.10     0.03    10000     0.00     0.00  ext_post_line_break
> -------
> So it's not like there is one function that takes the bulk of the slowdown
> as I expected (and often
> happens in reality an innocent looking small thing takes so much)
> 
> longest_match() is something from zlib.
> 
> I'm just really surprised, I keep following this project for a while now,
> since I consider it highly interesting and thought since I read one of the
> major steps was rewriting the TeX core from somewhat idiosyncratic WEB to
> C, I expected it to be even a bit faster...
> 
> And this is the profile of pdftex in comparison.
> ----------
> Flat profile:
> 
> Each sample counts as 0.01 seconds.
>    %   cumulative   self              self     total
>   time   seconds   seconds    calls   s/call   s/call  nam
>   29.48      0.51     0.51  2362906     0.00     0.00  longest_match
>   13.29      0.74     0.23  5876210     0.00     0.00  zdividescaled
>   11.56      0.94     0.20      775     0.00     0.00  deflate_slow
>    4.62      1.02     0.08    41510     0.00     0.00  pdfhlistout
>    4.62      1.10     0.08      774     0.00     0.00  compress_block
>    3.47      1.16     0.06        1     0.06     1.59  maincontrol
>    2.89      1.21     0.05      423     0.00     0.00  inflate_fast
>    2.31      1.25     0.04   227877     0.00     0.00  hash_insert_normalized
>    2.31      1.29     0.04    41510     0.00     0.00  zhpack
>    1.73      1.32     0.03 17821585     0.00     0.00  zeffectivechar
>    1.73      1.35     0.03   825830     0.00     0.00  zpdfprintint
>    1.73      1.38     0.03   260088     0.00     0.00  read_line
>    1.73      1.41     0.03   223361     0.00     0.00  pqdownheap
>    1.73      1.44     0.03    39615     0.00     0.00  fm_scan_line
>    1.45      1.47     0.03  1274937     0.00     0.00  zgetnode
>    1.16      1.49     0.02  2896157     0.00     0.00  zadvcharwidth
>    1.16      1.51     0.02   579800     0.00     0.00  ztrybreak
>    1.16      1.53     0.02   227877     0.00     0.00  hash_normalized
>    1.16      1.55     0.02    26742     0.00     0.00  zflushnodelist
>    0.87      1.56     0.02  1274936     0.00     0.00  zfreenode
>    0.58      1.57     0.01  4160738     0.00     0.00  getnext
>    0.58      1.58     0.01  3419912     0.00     0.00  zgetautokern
>    0.58      1.59     0.01  2896161     0.00     0.00  hasfmentry
>    0.58      1.60     0.01  2896159     0.00     0.00  isscalable
>    0.58      1.61     0.01  2896157     0.00     0.00  zpdfprintchar
> --------
> Both weren't exactly the same version as tested previously, I self compiled
> the newest texlive tagged as release.
> (This is LuaTeX, Version 1.16.0 (TeX Live 2023))
> (pdfTeX 3.141592653-2.6-1.40.25 (TeX Live 2023))
> 
> Runtimes when compiled with -O3 are almost the same as the native debian
> above, and I profiled the plain TeX variants only.
> 
> So zlib also takes a bulk, in relation even larger. So not the culprit.
> Different implementation of hyphenation seems to be one factor I'd "blame"
> 
> Turning it off with \language=255 improves it:
> 
> pdftex: user 0m1.029s
> luatex: user 0m1.596s
> 
> but there is still more.
> which is get_sa_item()/char_info().
> 
> And reading the comments managed-sa.c, it seems main the issue is being
> sparse? So I guess the way to improve that would be to be not sparse?
> 
> Anyway, that was my report to this, unfortunately I'm holding off pushing
> it as the new default compiler for us, since the slowdown is a bad sell for
> something which only sometimes is userful.
> 
> PS: personally I use Lua to calculate some complexer drawing for tikz, as
> using a "normal" programming
> language is much easier to me than doing more complicated pgf macros. But
> also in the end it just generates .tex code, which I simply  feed into
> pdflatex, it's only this gets complicated which files people ought to
> change and which are autogenerated .tex files.
There are many factors that play a role here. Among them are:

- the macro package that is used
- the kind of fonts (8 bit or 32 bit, features, processing mode)
- the amount of work that the backend needs to do
- the input encoding
- an 8 vs 32 bit code path
- storing all in 256 entry data structured or unicode range (sparse)
- using synctex
- startup time (loading format, lua etc)
- the size of fonts
- backend compression
- node list processing (advanced lua based features)

but mosty important is the kind of document and a pure text document in 
10pt with a plain tex layout is not the best test.

When i run that test in context, in a 6.5in by 8.9in text area and 10 pt 
latin modern i get this (on a laptop):

  4.0 sec  pdftex     (910 pages)
11.8 sec  xetex      (953 pages, don't ask why)
  9.7 sec  luatex     (910 pages)
11.8 sec  luametatex (910 pages)

the main reason why pdftex is faster here is that it uses 8 bit fonts in 
what we call base mode:

  4.0 sec  pdftex     (910 pages)
  6.6 sec  luatex     (910 pages)
  9.0 sec  luametatex (910 pages)

the reason why luametatex is slower here is that more is doen in lua 
(backend) but also quite a bit more is / can be done on the frontend.

btw, not hyphenating saves 1 second also because less font processing is 
needed

if we enable expansion runtime in pdftex will go up but i didn't test 
that, with the lua based engines we get:

  11.7 sec  luatex
  19.1 sec  luametatex (more granular)

but in luametatex that can drop to below 13 seconds with more selective 
processing.

But ... what does that mean in practice? Not that much because I can 
process a 350 page luametatex manual with plenty mixed fonts in 8+ 
seconds and here 33 pages per second in luametatex is way more than the 
about 24 pages per second that we get with luatex (there are reasons for 
this that i won't detail here). A matafun manual with 450 pages and 
thousands of runtime graphics can be done in 15-20 seconds.

So how does that compare with pdftex and xetex? Add structure, color, 
mixed fonts, a bit of layout, some metapost graphics and luametatex wins 
over luatex and thaty one can beat pdftex.

ps. We need to use sparse arrays because we use unicode and wide fonts 
so we cannot use 8 bit indexed arrays. This is true for fonts and all 
the data tables that define the \xxcode's of characters.

ps. If pdftex performance was ok 10 years ago then a modern machine will 
run luatex at comparable speed. I use a 5 year old laptop and sometimes 
wonder what a modern desktop would bring. I bet that the 350 page manual 
will be processed in 4 seconds, which is ok for me.

ps. context spends > 50% of its tiem in lua so it gets harder to gain 
runtime in the fontend (but it does happen)

ps. don't take context as reference; i can imagine that latex and plain 
process faster, you mention latex:

    pdflatex: user 0m 1.920s (3.1 MB result)
    lualatex: user 0m17.565s (3.8 MB result)

    8 times slower.

which is indeed somewhat suspicious, but it might be due to 8 bit vs 32 
bit input and fonts. More realistic is the plain comparison:

    pdftex: user 0m1.053s (2.9 MB result)
    luatex: user 0m1.943s (3.1 MB result)

but i guess both use 8 bit fonts here (luatex in base mode) so .9 sec 
for an engine with more features is okay i think, esp given that 
machines got a bit faster since latex showed up.

Hans


-----------------------------------------------------------------
                                           Hans Hagen | PRAGMA ADE
               Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
        tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl
-----------------------------------------------------------------



More information about the luatex mailing list.