[luatex] runtime performance of lua(la)tex incl. gprof

Wed Jun 21 11:18:13 CEST 2023

Hi guys,

Thanks a lot for your replies, it was an interesting read. I pondered a bit
what could be done and how much is unavoidable fate due to 32bit/truetype
etc.

For example one of the larger chunks in the profile is get_sa_item()
(relating to 32bit and sparse arrays). Albeit the profile wasn't as
lopsided as I hoped.

The current implementation is:
---
sa_tree_item get_sa_item(const sa_tree head, const int n)
{
    if (head->tree != NULL) {
        register int h = HIGHPART_PART(n);
        if (head->tree[h] != NULL) {
            register int m = MIDPART_PART(n);
            if (head->tree[h][m] != NULL) {
                return head->tree[h][m][LOWPART_PART(n)];
            }
        }
    }
    return head->dflt;
}
---
Since someone put in those "register" directives I believe that person
already did something similar as I and pointed it out as one bottleneck.
However that register directive very likely does absolutely nothing for
most compilers as they would optimize it anyway.

At first I wondered about the dereferencing of the same pointers multiple
times. The first experiment was to store the pointers in variables instead
of repeating the same array operations. But I must say the effect was zero.
The compiler certainly sees these non-volatile pointers cannot change
inbetween and optimizes that anyhow.

However one thing where we indeed can help the compiler is by allowing it
the option to static inline that function. This improves the runtime
already a bit (when compiled with -O3). On the other hand it likely
increases the binary a bit, that is the typical tradeoff.

Another starting point is that first if. Looking at new_sa_tree() the
likelihood of tree to be NULL is very unlikely. Actually I think there may
only be one way via undump_sa_tree() to ever be NULL, when undump(x) was
0.. which I believe also cannot happen, as long all undumped trees have to
be dumped first. But I admit I have no idea what this whole dump API does.

For experimenting I changed the code in a way for tree to be always
non-null. I may easily miss some special cases though, but IMO a head
without any tree makes little sense. And while at it, reduce one
deferencation by putting the HIGHPART as array directly into head, since it
is non null and non changing for a given head anyway.

Also my previous post with the one-shot time measurements was a bit
unprecise, now comparing patch with n=20.

without patch
1: /disk/axel/texlive23-build/source/work2/texk/web2c/luatex liptest.tex
            Mean        Std.Dev.    Min         Median      Max
real        2.072       0.053       1.984       2.067       2.197
user        2.021       0.050       1.932       2.010       2.128
sys         0.050       0.012       0.028       0.048       0.068

with patch:

===> multitime results
1: /disk/axel/texlive23-build/source/work2/texk/web2c/luatex liptest.tex
            Mean        Std.Dev.    Min         Median      Max
real        1.948       0.021       1.917       1.943       2.012
user        1.899       0.023       1.867       1.895       1.956
sys         0.045       0.011       0.024       0.048       0.064

A 5% performance increase, IMO not bad considering it from the original
profile to be responsible for 8% of total runtime (thus 8% max improvement).

I think get_sa_item() could even be made totally branch less -- something
modern CPUs love -- by filling the HIGH and MID parts with a link to an
array with default values instead of NULLs (albeit ~512 Byte more memory
required for the default dummy trees). But it would require some more code
adaptations to test for dummy tree links everywhere where it currently
tests for a NULL pointer.

I admit though, I'm fiddling with a code base with no understanding of the
larger picture it fits in.

Kind regards, Axel

On Tue, Jun 20, 2023 at 3:11 PM Axel Kittenberger <axkibe at gmail.com> wrote:

> Hello,
>
> First let me say, I really don't want this to be a complaint. I'm just
> wondering.
>
> I considered switching my department's default compiler from pdflatex to
> lualatex.
> Some subtle differences were to be expected and with test documents so far
> easily catered for.
> The output is okay.
>
> However what surprised me is a complicated test document which took ~150
> seconds with pdflatex now takes 210 seconds with lualatex.
>
> Trying to figure out if this is some of the many packages it does, I kept
> simplifying.
>
> --- laliptest.tex ---
> \documentclass{article}
> \input{plipsum}
> \begin{document}
> \newcount\ii
> \ii=100
> \loop
>   \lipsum{1-100}
>   \advance\ii-1
> \ifnum \ii>0
> \repeat
> \end{document}
> ---------
>
> This most simple document doesn't use any package, but plipsum which can
> be replaced with plain text too. Compile time results:
>
> pdflatex: user 0m1.920s (3.1 MB result)
> lualatex: user 0m17.565s (3.8 MB result)
>
> 8 times slower.
>
> Versions tested with:
> pdfTeX 3.141592653-2.6-1.40.24 (TeX Live 2022/Debian)
> This is LuaHBTeX, Version 1.15.0 (TeX Live 2022/Debian)
>
> Since LaTeX also includes a lot of stuff already, same tests with plain
> TeX.
>
> --- liptest.tex ---
> \input{plipsum}
> \newcount\i
> \i=100
> \loop
>   \lipsum{1-100}
>   \advance\i-1
> \ifnum \i>0
> \repeat
> \end
> ---------
> pdftex: user 0m1.053s (2.9 MB result)
> luatex: user 0m1.943s (3.1 MB result)
>
> This isn't as bad as the LaTeX variants, but still almost a factor two.
> Searching about this online turns up results about microtype or front
> loading etc.
> Both cannot be an issue, since microtype is off and frontloading must be a
> fixed offset, but the compile time increases linearly with document length.
>
> This now took me a while, but I managed to compile luatex with "-gp" to
> create a gprof profile and this is
> the result:
> ----------
> Flat profile:
>
> Each sample counts as 0.01 seconds.
>   %   cumulative   self              self     total
>  time   seconds   seconds    calls   s/call   s/call  name
>  14.63      0.42     0.42  2409555     0.00     0.00  longest_match
>   8.71      0.67     0.25   295700     0.00     0.00  hnj_hyphen_hyphenate
>   8.19      0.91     0.24 52832741     0.00     0.00  get_sa_item
>   6.62      1.10     0.19      773     0.00     0.00  deflate_slow
>   3.48      1.20     0.10 30117352     0.00     0.00  char_info
>   2.79      1.28     0.08    10000     0.00     0.00  ext_do_line_break
>   2.79      1.36     0.08      773     0.00     0.00  compress_block
>   2.09      1.42     0.06  2978422     0.00     0.00  calc_pdfpos
>   2.09      1.48     0.06   515855     0.00     0.00  handle_lig_word
>   1.74      1.53     0.05 14032575     0.00     0.00  char_exists
>   1.74      1.58     0.05  4689611     0.00     0.00  flush_node
>   1.74      1.63     0.05  2896557     0.00     0.00  output_one_char
>   1.74      1.68     0.05   227877     0.00     0.00  hash_normalized
>   1.74      1.73     0.05    41510     0.00     0.00  hlist_out
>   1.74      1.78     0.05    23020     0.00     0.00  fix_node_list
>   1.74      1.83     0.05     2319     0.00     0.00  adler32_z
>   1.39      1.87     0.04   227877     0.00     0.00
>  hash_insert_normalized
>   1.39      1.91     0.04    39615     0.00     0.00  fm_scan_line
>   1.39      1.95     0.04    11510     0.00     0.00  hnj_hyphenation
>   1.05      1.98     0.03  3831639     0.00     0.00  get_x_token
>   1.05      2.01     0.03  2896557     0.00     0.00  get_charinfo_whd
>   1.05      2.04     0.03  2382502     0.00     0.00  add_kern_before
>   1.05      2.07     0.03   303962     0.00     0.00  luaS_hash
>   1.05      2.10     0.03    10000     0.00     0.00  ext_post_line_break
> -------
> So it's not like there is one function that takes the bulk of the slowdown
> as I expected (and often
> happens in reality an innocent looking small thing takes so much)
>
> longest_match() is something from zlib.
>
> I'm just really surprised, I keep following this project for a while now,
> since I consider it highly interesting and thought since I read one of the
> major steps was rewriting the TeX core from somewhat idiosyncratic WEB to
> C, I expected it to be even a bit faster...
>
> And this is the profile of pdftex in comparison.
> ----------
> Flat profile:
>
> Each sample counts as 0.01 seconds.
>   %   cumulative   self              self     total
>  time   seconds   seconds    calls   s/call   s/call  nam
>  29.48      0.51     0.51  2362906     0.00     0.00  longest_match
>  13.29      0.74     0.23  5876210     0.00     0.00  zdividescaled
>  11.56      0.94     0.20      775     0.00     0.00  deflate_slow
>   4.62      1.02     0.08    41510     0.00     0.00  pdfhlistout
>   4.62      1.10     0.08      774     0.00     0.00  compress_block
>   3.47      1.16     0.06        1     0.06     1.59  maincontrol
>   2.89      1.21     0.05      423     0.00     0.00  inflate_fast
>   2.31      1.25     0.04   227877     0.00     0.00
>  hash_insert_normalized
>   2.31      1.29     0.04    41510     0.00     0.00  zhpack
>   1.73      1.32     0.03 17821585     0.00     0.00  zeffectivechar
>   1.73      1.35     0.03   825830     0.00     0.00  zpdfprintint
>   1.73      1.38     0.03   260088     0.00     0.00  read_line
>   1.73      1.41     0.03   223361     0.00     0.00  pqdownheap
>   1.73      1.44     0.03    39615     0.00     0.00  fm_scan_line
>   1.45      1.47     0.03  1274937     0.00     0.00  zgetnode
>   1.16      1.49     0.02  2896157     0.00     0.00  zadvcharwidth
>   1.16      1.51     0.02   579800     0.00     0.00  ztrybreak
>   1.16      1.53     0.02   227877     0.00     0.00  hash_normalized
>   1.16      1.55     0.02    26742     0.00     0.00  zflushnodelist
>   0.87      1.56     0.02  1274936     0.00     0.00  zfreenode
>   0.58      1.57     0.01  4160738     0.00     0.00  getnext
>   0.58      1.58     0.01  3419912     0.00     0.00  zgetautokern
>   0.58      1.59     0.01  2896161     0.00     0.00  hasfmentry
>   0.58      1.60     0.01  2896159     0.00     0.00  isscalable
>   0.58      1.61     0.01  2896157     0.00     0.00  zpdfprintchar
> --------
> Both weren't exactly the same version as tested previously, I self
> compiled the newest texlive tagged as release.
> (This is LuaTeX, Version 1.16.0 (TeX Live 2023))
> (pdfTeX 3.141592653-2.6-1.40.25 (TeX Live 2023))
>
> Runtimes when compiled with -O3 are almost the same as the native debian
> above, and I profiled the plain TeX variants only.
>
> So zlib also takes a bulk, in relation even larger. So not the culprit.
> Different implementation of hyphenation seems to be one factor I'd "blame"
>
> Turning it off with \language=255 improves it:
>
> pdftex: user 0m1.029s
> luatex: user 0m1.596s
>
> but there is still more.
> which is get_sa_item()/char_info().
>
> And reading the comments managed-sa.c, it seems main the issue is being
> sparse? So I guess the way to improve that would be to be not sparse?
>
> Anyway, that was my report to this, unfortunately I'm holding off pushing
> it as the new default compiler for us, since the slowdown is a bad sell for
> something which only sometimes is userful.
>
> PS: personally I use Lua to calculate some complexer drawing for tikz, as
> using a "normal" programming
> language is much easier to me than doing more complicated pgf macros. But
> also in the end it just generates .tex code, which I simply  feed into
> pdflatex, it's only this gets complicated which files people ought to
> change and which are autogenerated .tex files.
>
> Kind regards, Axel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://tug.org/pipermail/luatex/attachments/20230621/5ac1f47e/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: diff
Type: application/octet-stream
Size: 4587 bytes
Desc: not available
URL: <https://tug.org/pipermail/luatex/attachments/20230621/5ac1f47e/attachment-0001.obj>