[luatex] runtime performance of lua(la)tex incl. gprof

Wed Jun 21 12:03:34 CEST 2023

> On 21 Jun 2023, at 11:18, Axel Kittenberger <axkibe at gmail.com> wrote:
> 
> Hi guys,
> 
> Thanks a lot for your replies, it was an interesting read. I pondered a bit what could be done and how much is unavoidable fate due to 32bit/truetype etc.

The sparse arrays are 64-bit data values in a 21-bit address space split into three 7bit segments.

A full non-sparse version of one of the arrays would occupy 4MB of memory. That is not a lot these days, but besides a dozen or so static ones, each of the loaded font instances has one of those for the glyph indices; there are potentially 32k catcode tables of one each; and there are potentially 16k hyphenation codes, also of one each.

That adds up, even for a somewhat ’normal’ smallish run. Full Unicode support is quite expensive in terms of memory, we found. 

If even a relatively simple TeX run needs some 100s of Mbs of memory, that introduces it own slowdown due to L2/L3 cache misses, operating system disk cache shrinkage,  and possibly even process memory swapping. You do not see that in the low-level profile counts, but it becomes visible when using high-level running time “time” tests.

That is why there is a sparse array implementation in the first place. Not a quick as flat access, but faster than btree while using a lot less memory. 

Using dummy data doesn’t help I believe, because you then need to check the memory pointer against the dummy pointer instead of against NULL. That should not be helpful. 

However, forcing head->tree != NULL is nice indeed. 

Best wishes,
Taco

> 
> For example one of the larger chunks in the profile is get_sa_item() (relating to 32bit and sparse arrays). Albeit the profile wasn't as lopsided as I hoped.
> 
> The current implementation is:
> ---
> sa_tree_item get_sa_item(const sa_tree head, const int n)
> {
>     if (head->tree != NULL) {
>         register int h = HIGHPART_PART(n);
>         if (head->tree[h] != NULL) {
>             register int m = MIDPART_PART(n);
>             if (head->tree[h][m] != NULL) {
>                 return head->tree[h][m][LOWPART_PART(n)];
>             }
>         }
>     }
>     return head->dflt;
> }
> ---
> Since someone put in those "register" directives I believe that person already did something similar as I and pointed it out as one bottleneck. However that register directive very likely does absolutely nothing for most compilers as they would optimize it anyway.
> 
> At first I wondered about the dereferencing of the same pointers multiple times. The first experiment was to store the pointers in variables instead of repeating the same array operations. But I must say the effect was zero. The compiler certainly sees these non-volatile pointers cannot change inbetween and optimizes that anyhow.
> 
> However one thing where we indeed can help the compiler is by allowing it the option to static inline that function. This improves the runtime already a bit (when compiled with -O3). On the other hand it likely increases the binary a bit, that is the typical tradeoff.
> 
> Another starting point is that first if. Looking at new_sa_tree() the likelihood of tree to be NULL is very unlikely. Actually I think there may only be one way via undump_sa_tree() to ever be NULL, when undump(x) was 0.. which I believe also cannot happen, as long all undumped trees have to be dumped first. But I admit I have no idea what this whole dump API does.
> 
> For experimenting I changed the code in a way for tree to be always non-null. I may easily miss some special cases though, but IMO a head without any tree makes little sense. And while at it, reduce one deferencation by putting the HIGHPART as array directly into head, since it is non null and non changing for a given head anyway.
> 
> Also my previous post with the one-shot time measurements was a bit unprecise, now comparing patch with n=20.
> 
> without patch
> 1: /disk/axel/texlive23-build/source/work2/texk/web2c/luatex liptest.tex
>             Mean        Std.Dev.    Min         Median      Max
> real        2.072       0.053       1.984       2.067       2.197      
> user        2.021       0.050       1.932       2.010       2.128      
> sys         0.050       0.012       0.028       0.048       0.068 
> 
> with patch:
> 
> ===> multitime results
> 1: /disk/axel/texlive23-build/source/work2/texk/web2c/luatex liptest.tex
>             Mean        Std.Dev.    Min         Median      Max
> real        1.948       0.021       1.917       1.943       2.012       
> user        1.899       0.023       1.867       1.895       1.956       
> sys         0.045       0.011       0.024       0.048       0.064
> 
> A 5% performance increase, IMO not bad considering it from the original profile to be responsible for 8% of total runtime (thus 8% max improvement).
> 
> I think get_sa_item() could even be made totally branch less -- something modern CPUs love -- by filling the HIGH and MID parts with a link to an array with default values instead of NULLs (albeit ~512 Byte more memory required for the default dummy trees). But it would require some more code adaptations to test for dummy tree links everywhere where it currently tests for a NULL pointer.
> 
> I admit though, I'm fiddling with a code base with no understanding of the larger picture it fits in.
> 
> Kind regards, Axel
> 
> On Tue, Jun 20, 2023 at 3:11 PM Axel Kittenberger <axkibe at gmail.com> wrote:
> Hello,
> 
> First let me say, I really don't want this to be a complaint. I'm just wondering.
> 
> I considered switching my department's default compiler from pdflatex to lualatex.
> Some subtle differences were to be expected and with test documents so far easily catered for.
> The output is okay.
> 
> However what surprised me is a complicated test document which took ~150 seconds with pdflatex now takes 210 seconds with lualatex.
> 
> Trying to figure out if this is some of the many packages it does, I kept simplifying.
> 
> --- laliptest.tex ---
> \documentclass{article}
> \input{plipsum}
> \begin{document}
> \newcount\ii
> \ii=100
> \loop
>   \lipsum{1-100}
>   \advance\ii-1
> \ifnum \ii>0
> \repeat
> \end{document}
> ---------
> 
> This most simple document doesn't use any package, but plipsum which can be replaced with plain text too. Compile time results:
> 
> pdflatex: user 0m1.920s (3.1 MB result)
> lualatex: user 0m17.565s (3.8 MB result)
> 
> 8 times slower.
> 
> Versions tested with:
> pdfTeX 3.141592653-2.6-1.40.24 (TeX Live 2022/Debian)
> This is LuaHBTeX, Version 1.15.0 (TeX Live 2022/Debian)
> 
> Since LaTeX also includes a lot of stuff already, same tests with plain TeX.
> 
> --- liptest.tex ---
> \input{plipsum}
> \newcount\i
> \i=100
> \loop
>   \lipsum{1-100}
>   \advance\i-1
> \ifnum \i>0
> \repeat
> \end
> ---------
> pdftex: user 0m1.053s (2.9 MB result)
> luatex: user 0m1.943s (3.1 MB result)
> 
> This isn't as bad as the LaTeX variants, but still almost a factor two.
> Searching about this online turns up results about microtype or front loading etc.
> Both cannot be an issue, since microtype is off and frontloading must be a fixed offset, but the compile time increases linearly with document length.
> 
> This now took me a while, but I managed to compile luatex with "-gp" to create a gprof profile and this is
> the result:
> ----------
> Flat profile:
> 
> Each sample counts as 0.01 seconds.
>   %   cumulative   self              self     total
>  time   seconds   seconds    calls   s/call   s/call  name
>  14.63      0.42     0.42  2409555     0.00     0.00  longest_match
>   8.71      0.67     0.25   295700     0.00     0.00  hnj_hyphen_hyphenate
>   8.19      0.91     0.24 52832741     0.00     0.00  get_sa_item
>   6.62      1.10     0.19      773     0.00     0.00  deflate_slow
>   3.48      1.20     0.10 30117352     0.00     0.00  char_info
>   2.79      1.28     0.08    10000     0.00     0.00  ext_do_line_break
>   2.79      1.36     0.08      773     0.00     0.00  compress_block
>   2.09      1.42     0.06  2978422     0.00     0.00  calc_pdfpos
>   2.09      1.48     0.06   515855     0.00     0.00  handle_lig_word
>   1.74      1.53     0.05 14032575     0.00     0.00  char_exists
>   1.74      1.58     0.05  4689611     0.00     0.00  flush_node
>   1.74      1.63     0.05  2896557     0.00     0.00  output_one_char
>   1.74      1.68     0.05   227877     0.00     0.00  hash_normalized
>   1.74      1.73     0.05    41510     0.00     0.00  hlist_out
>   1.74      1.78     0.05    23020     0.00     0.00  fix_node_list
>   1.74      1.83     0.05     2319     0.00     0.00  adler32_z
>   1.39      1.87     0.04   227877     0.00     0.00  hash_insert_normalized
>   1.39      1.91     0.04    39615     0.00     0.00  fm_scan_line
>   1.39      1.95     0.04    11510     0.00     0.00  hnj_hyphenation
>   1.05      1.98     0.03  3831639     0.00     0.00  get_x_token
>   1.05      2.01     0.03  2896557     0.00     0.00  get_charinfo_whd
>   1.05      2.04     0.03  2382502     0.00     0.00  add_kern_before
>   1.05      2.07     0.03   303962     0.00     0.00  luaS_hash
>   1.05      2.10     0.03    10000     0.00     0.00  ext_post_line_break
> -------
> So it's not like there is one function that takes the bulk of the slowdown as I expected (and often
> happens in reality an innocent looking small thing takes so much)
> 
> longest_match() is something from zlib.
> 
> I'm just really surprised, I keep following this project for a while now, since I consider it highly interesting and thought since I read one of the major steps was rewriting the TeX core from somewhat idiosyncratic WEB to C, I expected it to be even a bit faster...
> 
> And this is the profile of pdftex in comparison.
> ----------
> Flat profile:
> 
> Each sample counts as 0.01 seconds.
>   %   cumulative   self              self     total
>  time   seconds   seconds    calls   s/call   s/call  nam
>  29.48      0.51     0.51  2362906     0.00     0.00  longest_match
>  13.29      0.74     0.23  5876210     0.00     0.00  zdividescaled
>  11.56      0.94     0.20      775     0.00     0.00  deflate_slow
>   4.62      1.02     0.08    41510     0.00     0.00  pdfhlistout
>   4.62      1.10     0.08      774     0.00     0.00  compress_block
>   3.47      1.16     0.06        1     0.06     1.59  maincontrol
>   2.89      1.21     0.05      423     0.00     0.00  inflate_fast
>   2.31      1.25     0.04   227877     0.00     0.00  hash_insert_normalized
>   2.31      1.29     0.04    41510     0.00     0.00  zhpack
>   1.73      1.32     0.03 17821585     0.00     0.00  zeffectivechar
>   1.73      1.35     0.03   825830     0.00     0.00  zpdfprintint
>   1.73      1.38     0.03   260088     0.00     0.00  read_line
>   1.73      1.41     0.03   223361     0.00     0.00  pqdownheap
>   1.73      1.44     0.03    39615     0.00     0.00  fm_scan_line
>   1.45      1.47     0.03  1274937     0.00     0.00  zgetnode
>   1.16      1.49     0.02  2896157     0.00     0.00  zadvcharwidth
>   1.16      1.51     0.02   579800     0.00     0.00  ztrybreak
>   1.16      1.53     0.02   227877     0.00     0.00  hash_normalized
>   1.16      1.55     0.02    26742     0.00     0.00  zflushnodelist
>   0.87      1.56     0.02  1274936     0.00     0.00  zfreenode
>   0.58      1.57     0.01  4160738     0.00     0.00  getnext
>   0.58      1.58     0.01  3419912     0.00     0.00  zgetautokern
>   0.58      1.59     0.01  2896161     0.00     0.00  hasfmentry
>   0.58      1.60     0.01  2896159     0.00     0.00  isscalable
>   0.58      1.61     0.01  2896157     0.00     0.00  zpdfprintchar
> --------
> Both weren't exactly the same version as tested previously, I self compiled the newest texlive tagged as release.
> (This is LuaTeX, Version 1.16.0 (TeX Live 2023))
> (pdfTeX 3.141592653-2.6-1.40.25 (TeX Live 2023))
> 
> Runtimes when compiled with -O3 are almost the same as the native debian above, and I profiled the plain TeX variants only.
> 
> So zlib also takes a bulk, in relation even larger. So not the culprit.
> Different implementation of hyphenation seems to be one factor I'd "blame"
> 
> Turning it off with \language=255 improves it:
> 
> pdftex: user 0m1.029s
> luatex: user 0m1.596s
> 
> but there is still more.
> which is get_sa_item()/char_info().
> 
> And reading the comments managed-sa.c, it seems main the issue is being sparse? So I guess the way to improve that would be to be not sparse?
> 
> Anyway, that was my report to this, unfortunately I'm holding off pushing it as the new default compiler for us, since the slowdown is a bad sell for something which only sometimes is userful.
> 
> PS: personally I use Lua to calculate some complexer drawing for tikz, as using a "normal" programming
> language is much easier to me than doing more complicated pgf macros. But also in the end it just generates .tex code, which I simply  feed into pdflatex, it's only this gets complicated which files people ought to change and which are autogenerated .tex files.
> 
> Kind regards, Axel
> <diff>

— 
Taco Hoekwater              E: taco at bittext.nl
genderfluid (all pronouns)