[luatex] fio library byte order

Sat Jun 27 02:56:33 CEST 2020

On 2020-06-20 at 10:25:33 +0200, Hans Hagen wrote:

 > On 6/19/2020 11:16 PM, Reinhard Kotucha wrote:
 > > Hi,
 > > it's nice that with the fio library LuaTeX can now process binary
 > > files.  What I'm missing is the ability to specify the byte order
 > > (little vs. big endian).
 > >
 > > I didn't find any hint, neither in the manual nor in the sources.
 > >
 > > Without being able to specify the byte order usage of the library is
 > > quite limited.  The byte order of a particular file is not necessarily
 > > the same as that of your system.
 > >
 > > Some file formats have a certain byte order (PNM, for instance) and
 > > others precede binary data with a byte order mark (TIFF).  In any case
 > > it's necessary to specify the byte order before reading binary stuff.
 > >
 > > Is there a chance to provide a switch?
 >
 > When I have time I'll backport a couple of the additional
 > [integer|cardinal]*_le ones that we have in luametatex (I though that
 > i'd already done that).

Hi Hans,
I must admit that I don't know anything about luametatex.  I just
looked into liolibext.c .

IMO there are a few things to consider.

The current code extracts single bytes from a file.

 | static int readcardinal2(lua_State *L) {
 |     FILE *f = tofile(L);
 |     int a = getc(f);
 |     int b = getc(f);
 |

This, and even the extraction of short strings, is extremely slow.
It's much more efficient to read data blocks of 8192 bytes, for
instance, into memory and to process these data blocks.  I'm not
convinced that reading a complete file into memory is a good idea,
despite its simplicity.

Processing the content of a file with the fio library is then similar
to processing a string with the sio library, with the exception that
endianness has to be considered when files are involved.

The host byte order must always be determined automatically, either
with Luigi's approach or probably more easily with ntohs(3) if this
function is available on Windows too.  The file byte order has to be
specified by the user because it depends on the file format.

If a particular file format has a BOM in its header, the BOM can be
evaluated by the user, for instance with fio.readline().  This means
that a user should be able to specify the andianness at any time, not
necessarily in advance.

As far as I understand it's sufficient that the relevant functions
read{cardinal,integer}{2,4} obey a flag which tells them whether byte
re-ordering is necessary.  The flag has to be set if host and file
byte orders are different.  I don't know whether we have to consider
64 bit integers too.

If you intend to go this way the number of functions in liolibext.c
can be halved because there is no significant difference between a
buffer and a string.  Only very few functions have to be aware of
endianness.

There is one difference though.  A string is always complete while a
buffer contains only a part of a file.  If a there are not enough
bytes at the end of a buffer in order to fulfill a request, the
missing bytes can be loaded from the file and appended to the buffer.
This has no significant impact on speed because it happens quite
rarely.  It's similar to the example in PIL, chapter 'The complete I/O
Model', section 'A small performance trick'.

If the user doesn't specify a byte order we can assume host byte
order.  I can't imagine any reasonable use case right now, except
if a temporary file is read by the same process that created it.

Regards,
  Reinhard

--
------------------------------------------------------------------
Reinhard Kotucha                            Phone: +49-511-3373112
Marschnerstr. 25
D-30167 Hannover                    mailto:reinhard.kotucha at web.de
------------------------------------------------------------------