[luatex] fio library byte order

Sat Jun 27 09:59:43 CEST 2020

On 6/27/2020 2:56 AM, Reinhard Kotucha wrote:
> On 2020-06-20 at 10:25:33 +0200, Hans Hagen wrote:
> 
>   > On 6/19/2020 11:16 PM, Reinhard Kotucha wrote:
>   > > Hi,
>   > > it's nice that with the fio library LuaTeX can now process binary
>   > > files.  What I'm missing is the ability to specify the byte order
>   > > (little vs. big endian).
>   > >
>   > > I didn't find any hint, neither in the manual nor in the sources.
>   > >
>   > > Without being able to specify the byte order usage of the library is
>   > > quite limited.  The byte order of a particular file is not necessarily
>   > > the same as that of your system.
>   > >
>   > > Some file formats have a certain byte order (PNM, for instance) and
>   > > others precede binary data with a byte order mark (TIFF).  In any case
>   > > it's necessary to specify the byte order before reading binary stuff.
>   > >
>   > > Is there a chance to provide a switch?
>   >
>   > When I have time I'll backport a couple of the additional
>   > [integer|cardinal]*_le ones that we have in luametatex (I though that
>   > i'd already done that).
> 
> Hi Hans,
> I must admit that I don't know anything about luametatex.  I just
> looked into liolibext.c .
> 
> IMO there are a few things to consider.
> 
> The current code extracts single bytes from a file.
> 
>   | static int readcardinal2(lua_State *L) {
>   |     FILE *f = tofile(L);
>   |     int a = getc(f);
>   |     int b = getc(f);
>   |
> 
> This, and even the extraction of short strings, is extremely slow.
> It's much more efficient to read data blocks of 8192 bytes, for
> instance, into memory and to process these data blocks.  I'm not
> convinced that reading a complete file into memory is a good idea,
> despite its simplicity.

that would add all kind of overhead (buffer underrun, adapting to seek 
etc and therefore reload) and we can assume that the operating system 
also buffers

> Processing the content of a file with the fio library is then similar
> to processing a string with the sio library, with the exception that
> endianness has to be considered when files are involved.

it depends on what one does, sometimes a full load and using sio is 
faster but that also has its overhead (pseudo seek)

as usual i did lots of (performance) tests and there is not that much to 
gain on either end (several variants were played with)

> The host byte order must always be determined automatically, either
> with Luigi's approach or probably more easily with ntohs(3) if this
> function is available on Windows too.  The file byte order has to be
> specified by the user because it depends on the file format.

the lib is meant for usage in known scenarios (known, documented file 
formats), not arbitrary, depending on architecture or implementation

(btw, the format file used to normalize to hig endian but that was 
dropped long ago already: formats are no longer portable, which in fact 
was already dropped before that)

> If a particular file format has a BOM in its header, the BOM can be
> evaluated by the user, for instance with fio.readline().  This means
> that a user should be able to specify the andianness at any time, not
> necessarily in advance.

sure but a few extra readers would solve that

> As far as I understand it's sufficient that the relevant functions
> read{cardinal,integer}{2,4} obey a flag which tells them whether byte
> re-ordering is necessary.  The flag has to be set if host and file
> byte orders are different.  I don't know whether we have to consider
> 64 bit integers too.

that adds passing parameters and checking them for each call ... you can 
then as well use lua's 'read' function and convert with string.byte/char 
which is then about equally fast

> If you intend to go this way the number of functions in liolibext.c
> can be halved because there is no significant difference between a
> buffer and a string.  Only very few functions have to be aware of
> endianness.

halved in calls to simple functions, enlarged by more checking .. .more 
pain than gain

> There is one difference though.  A string is always complete while a
> buffer contains only a part of a file.  If a there are not enough
> bytes at the end of a buffer in order to fulfill a request, the
> missing bytes can be loaded from the file and appended to the buffer.
> This has no significant impact on speed because it happens quite
> rarely.  It's similar to the example in PIL, chapter 'The complete I/O
> Model', section 'A small performance trick'.
> 
> If the user doesn't specify a byte order we can assume host byte
> order.  I can't imagine any reasonable use case right now, except
> if a temporary file is read by the same process that created it.
as we have lua 5.3 you can consider using the string.unpack function

Hans

-----------------------------------------------------------------
                                           Hans Hagen | PRAGMA ADE
               Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
        tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl
-----------------------------------------------------------------