[luatex] how many bytes for fontdimens?

Tue Aug 3 19:10:13 CEST 2021

On 8/3/2021 3:26 PM, jfbu wrote:

>> Le 3 août 2021 à 12:49, Hans Hagen <j.hagen at xs4all.nl> a écrit :
>>
>> On 8/3/2021 9:45 AM, jfbu wrote:
>>> Hi,
>>>> Le 3 août 2021 à 09:09, Hans Hagen <j.hagen at xs4all.nl <mailto:j.hagen at xs4all.nl>> a écrit :
>>>>
>>>> On 8/2/2021 9:37 PM, jfbu wrote:
>>>>> forgot to mention that I am aware a \fontdimen is limited to 2**30 strictly anyhow
>>>>> but my question is whether such « arrays » are stored 32bits or 64bits itemwise
>>>> it happens to be an array of 32 bit integers (that grows on demand) but such implementaiton details are unspecified (could as well have been a sparse array in which case each entry that is actually set has more
>>>>
>>>> also, the fact that it grow is a sort of side effect of the fact that tfm fonts can have 7 or more, but 7 are used, for text upto more for math fonts
>>>>
>>>> so, i wouldn't rely on these properties too much
>>> Thanks!
>>> Reason I asked is because I contributed an Eratosthenes Prime sieve to a github site comparing a whole bunch of langages and it is asked there to specify whether the « arrays » use 1bit, 8bits, 32bits, or 64bits (or « unknown ») per (potential) prime.
>>> https://github.com/PlummersSoftwareLLC/Primes/blob/drag-race/CONTRIBUTING.md#flag-storage <https://github.com/PlummersSoftwareLLC/Primes/blob/drag-race/CONTRIBUTING.md#flag-storage>
>>
>> ha, i've seen that one a few weeks ago (after some yt video) and to be honnest it's one of these useless speed comparisons (can be fun, but useless as one compares languages with different objectives and doing some prime stuff is hardly representative for usage)
> 
> 
> It seems the C++ solutions listed as fastest do quite a part of the job at compile time not at execution time...

that's why comparing script languages and compiled ones makes no sense 
(one can argue for jit but in e.g. luatex that doesn't pay of)

> The chosen topic (basic sieve and possibly some refinements like « wheel » algorithms) seems to test the capacity of the language to access successive memory addresses in the most efficient way and/or perhaps to move bit patterns around.

but that is not made clear -)

> Also, it stops at having the array done, but does not address how you use this after the fact.

indeed

> If the final aim is to print out primes to a file the bottleneck may be rather into the conversion of the bit pattern into explicit ascii bytes...

then lua might perform better because it's all about strings (although 
printing numbers is a bad example as they all differ here)

> For example, if I produce using lualatex a pdf file of all primes less than 999,999,999, most of the time by far is consumed by the typesetting phase (something like 1350 pages, 10 columns per page)

sure because quite some processing kicks in there (fonts, backend, 
columnization) but also the way that big array gets serialized

but 1350 pages for 999.999.999 is not that bad because here 10.000.000 
in 12 columns and an 8pt monospaced font takes 955 pages and indeed some 
runtime (using context that is, latex is supposed to be faster)

> This is what I do except for one auxiliary array in the « wheel » because I felt it would be cheating to allocate the 480 slots to start with. I could however have allocated say 1155 = 2310/2 (2310 = 2 * 3 * 5 * 7 *11) slots, wasting some.

even in lua you'd waste them and if oen goes hash there is more overhead 
(linked list) than in an array

> But this is executed only once anyhow, at loading time of the « sieve library » independently of the number of passes done during benchmark. Maybe I will do the change for some small gain.
> 
>>> I will thus modify the « bit count » tag of my « solution » from unknown to 32bits, thanks to your answer, knowing though that this remains officially unspecified. But the Dockerfile which I was asked to include, and which their benchmarking uses, pulls a texlive-minimal based image dating back to 2018.
>>> Perhaps someone here will be interested into contributing a genuine luatex (i.e. using Lua) solution (my code uses only Knuth TeX; there is also a LaTeX3 code also on the github site).
>>
>> a lua solution in luatex is just a lua solution -)
>>
>>> There is already at least one Lua contribution. I don’t know if a genuine luatex would have to be categorized under « PrimeTeX » or « PrimeLua » ...
>>> ... in particular a LuaTeX genuine solution may have a way to use an « array » not based on font dimension parameters.
>>
>> mixing lua and tex will also introduce lua call overhead so there is no gain there (maybe let lua do the sqrt but then you can well do all in lua)
>>
>> my guess is that the sqrt is the bottleneck
>>
>> fontdimens are actually bnto that slow not that slow because they are (1) global so no save stack overhead, and (2) directly accessible because they are part of the font structure (so no tex dimen access overhead)
>>
>> also, using etex \dimexpr is also slower than the simple operators
> 
> Not to mention that etex division rounds which sometimes is more inconvenient than truncating

(it's why in luametatex we have : for integer division, but that's 
another story)

> On the other hand having \numexpr at once disposal allows to more easily have some namespacing of auxiliary counters, using macros instead of \count's

hm. a bit expansive, \chardefs etc are cheaper

>>> One particular point I don’t know is whether LuaTeX would allow a « faithful » solution: this seems to mean roughly a class-encapsulated one (it is hard to understand what they precisely mean in their guidelines), which I could not really emulate in my code due to global nature of fontdimen assignments.
>>
>> hm, do you really need local?
> 
> honestly I don’t know exactly what they are aiming at, but I guess basically some piece of code you can re-use in arbitrary code as a library. Global is perhaps not that bad, but having no way to release the memory disqualifies it for faithfulness I believe. The memory can be marked for re-use, but one can not extend the array size after the fact.

font memory is never released anyway

> anyway, at least some namespace should be added to my solution, I did it (weirdly) for some aspects but not all

>> if you use csnames, then you can also consider using \chardef's for numbers (these obey grouping)
> 
> ok, the extended range of \char in luatex would indeed help if one needed to store numbers up to the sieving range of 1,000,000. I have some alternative algorithm needing this kind of storage, but a priori when I mentioned csnames it was only to code a « on/off » situation (testing if some number has been marked for composite or no via an \ifcsname, using e-TeX for convenience). Unfortunately this has the overhead of requiring to convert a number into explicit digit tokens.

there is of course a limit in the size of the hash table that you can hit

> or is there a way in luatex to tag objects by numbers not having been converted to digit tokens?

not in luatex

>>> (I also experimented with  a csname based approach but never could reach comparable speed to fontdimen arrays ; and this required extending other parts of the memory)
>>
>> in luatex csname is costly because of the serialization (pdftex is probably faster because there is no utf related overhead)
> 
> I did not compare the two. For the fontdimen based approach, at first pdftex looks faster but on closer look for ranges higher than 1,000,000, it seems luatex gets relatively faster, regarding the process of sieving itself.

hm, maybe hash misses kick in which means a chain lokup (maybe be faster 
in luatex than in pdftex)

> For example I get this typically on my (slow) machine for sieving up to 100,000,000:
> 
> pdftex
> 
> Instantiate object for sieving up to 100000000...
> ...done (0.2322s)
> Sieving...
> ...done (29.4603s)
> Outputting to file listofprimes-100000000.txt...
> 5761455 primes were written to file listofprimes-100000000.txt
> ...done (22.96228s)
>   ) )
> No pages of output.
> Transcript written on wheel_primestofile.log.
> 
> real	0m52.898s
> user	0m52.502s
> sys	0m0.281s
> 
> luatex
> Instantiate object for sieving up to 100000000...
> ...done (3.84619s)
> Sieving...
> ...done (26.5682s)
> Outputting to file listofprimes-100000000.txt...
> 5761455 primes were written to file listofprimes-100000000.txt
> ...done (28.48254s)
> ))
> warning  (pdf backend): no pages of output.
> Transcript written on wheel_primestofile.log.
> 
> real	0m59.444s
> user	0m58.895s
> sys	0m0.445s
> 
> 
> The sieving itself seems being done faster by luatex. For some reason the « write to file » part appears slower. As  the « write to a file » was not part of benchmark, I simply issue one \write per prime and it is quite possible gathering together the \write’s by batches could improve.  Not tested.

could be because there is some utf juggling but it shoul dnot make a 
huge dent

> Something I don’t understand is that the spread between the « base » algorithm (basically the elementary school one, but using only odd numbers from the start, and maybe starting at factor*factor and going by steps of 2*factor) and the « wheel » algorithm (with some primes already sieved out) seems to be greater with pdftex than with luatex: perhaps 2.8 or 2.9 speed gain for pdftex, but only like 2.3 or at most 2.5 with luatex.

hard to say ... maybe cpu cache (different programs, different code)

>>> Here is a link to how the various implementations sort out currently on one specific machine:
>>> https://plummerssoftwarellc.github.io/PrimeView/?sc=dt&sd=True&rc=30 <https://plummerssoftwarellc.github.io/PrimeView/?sc=dt&sd=True&rc=30>
>> the lua solution they post is not only somewhat slow but also makes some (imo wrong, but who am i to claim) assumptions about how lua stores data so it was not that hard to make a variant that was over 200 times faster
> 
> ah very interesting. I was also very surprised that the Lua solution appeared so slow relatively.

the latets version posted there is already better but i'll mail you an 
variant of that that is about three times faster

- basic performance        : 1.90
- make functions local     : 1.84
- inline the two functions : 1.52
- use integer division     : 1.27
- some more                : 0.71

> Notice by the way that the first Python solution is also very slow... slower than my tex one! (to be fair, slower than my « wheel » one, but Python uses the « base » algorithm)

object oriented pverhead i guess (the few times that i ran into some 
python code it was easy to make faster lua variants)

>> because i have a relative old laptop i can't compare with the numbers for e.g. c there (of course lua will be slower)
> 
> I think the machine they use for the « PrimeView » is also relatively old (2013). Mine is 2012 and is faster than theirs.
> 
>> but as i consider these shootouts useles anyway, i didn't want to spend more time on it (all that docker stuff and such)
> 
> I confirm I have spent between 5x and 10x more time with these contributing requirements than with the TeX coding itself. In the end the Dockerfile is very simple but it went through more complicated intermediate versions. I did not want to pull a 4G or 5G texlive image though and initially I was using pdftex but needed to understand how to let it acquire more memory, then I used luatex but I needed to add \pdfresettimer, which finally I did via the \directlua in the tex file itself.
> 
>> nor comment on the posted lua code (i never comment on code anyway, unless I know someone well and we can discuss specific issues out of mutual interest)
> 
> This is wise indeed.
> 
> I know some under-optimal aspects of my contribution and I have not tested some aspects like whether at some location I should use \ifdim rather than \ifnum, or even some other way of testing. I also hesitated regarding the wheel algorithm between using only one giant array, or 480 smaller ones indexed on class modulo 2310, the latter approach reducing the memory impact (by a factor of 1155/480).

it looked at the code but it will take me too much time to check where 
it can be made faster (saw a few spots) as i have to make sieve only 
macro then

> Also let me point out that during the various squashes before my PR was merged, I accidentally lost crediting the fact that the \Replicate I use, copied over from xint with removal of e-TeX, was originally basically cloned from the LaTeX3 and has a decades old history I trust.

dunno, loops can be very fast when done right (and when on eknows where 
tex's bottlenecks are -)

> I have also not tried to optimise how this \Replicate should be used, deciding to nest it with an inner one replicating 1000 times, but never testing if that was actually good.  As the benchmarking overthere is done only for 1,000,000 range, there is danger also to make code seemingly general but in fact tailored for that range only, which would not be entirely fair, imho.

maybe don't pass the number but put it in a register ; this romannumeral 
is also kind of strange (not sure why you need it ; actually you should 
time romannumeral in pdftex vs luatex .. could be more efficient in 
thelater

>> (messing with bits and storing efficiently in lua probably costs more than it saves, and the same might be true in tex)
> 
> Sadly, the nice coincidence that TeX dimensions allow 30 bits and that 30=2*3*5 does not help much as I know no native TeX way to do bitwise operations. We only need two: setting a bit and querying it. This can de done with existing TeX arithmetic and/or macros but will be costly. I did not even try it out.

this is off topic but in luametatex we can do

\scratchcounter 1

\scratchcounter \numexpression \scratchcounter bor 4\relax

\the\scratchcounter : 5

which performs quite ok (1 milion times takes .27 sec on my machine)

> I also know of no way to initialize a \fontdimen array of a specific size with a specific pattern such as  some non-zero dimension repeated all the way, apart naturally to go through assigning all of them one by one, where the fact that always the same value is used brings no gain whatsoever.
ok, but when you pack bits you do 32 per stel so you migth end up in the 
'seconds' range

i'll mail you my test file

Hans

-----------------------------------------------------------------
                                           Hans Hagen | PRAGMA ADE
               Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
        tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl
-----------------------------------------------------------------