[tlbuild] upmendex, U_INVALID_FORMAT_ERROR, U_ICUDATA_ENTRY_POINT, isBigEndian

Karl Berry karl at freefriends.org
Tue Apr 5 01:26:23 CEST 2016


Regarding the failure of upmendex.test on 32-bit freebsd et al.

The short answer: the workaround I found is to explicitly specify the
compiler (both CFLAGS and CXXFLAGS) flag -DU_IS_BIG_ENDIAN=0.
(This should not be needed in any other environment.)

I did my test build (just upmendex) like this:
env TL_COMPILER_GFLAGS="-g -DU_IS_BIG_ENDIAN=0" \
    $HOME/src/Build -g \
      CC=/usr/bin/gcc CXX=/usr/bin/g++ \
      --disable-all-pkgs --enable-upmendex --enable-missing
but any method of getting the symbol #defined should do as well.
(TL_COMPILER_GFLAGS is my little kludge that exists only in TL ./Build
script for including extra flags everywhere.)

I was working on FreeBSD 9.1, using the system-provided gcc and g++ as shown.

You can check if this is the problem by running gdb on the upmendex
binary, putting breakpoint at main, run, then execute the gdb command
  print (DataHeader) icudt57_dat
The result should show the header of the icu data structure, and you can
see the value of isBigEndian there.

For me, it was (incorrectly) 1, resulting in upmendex.test aborting with
U_INVALID_FORMAT_ERROR being reported in, e.g.,
Work/texk/upmendex/tests/upmendex.log.  Setting -DU_IS_BIG_ENDIAN=0
changed that.

Nikola and Apostolos, I fervently hope it will work for you too.


-- the rest of this message is laborious detail about my debugging day
that I wanted to write down while it is still in my head (it won't be
tomorrow ...) --

The proximate cause of the U_INVALID_FORMAT_ERROR is this code in
icu-src/source/common/ucmndata.c (function udata_checkCommonData a.k.a.
udata_checkCommonData_57), lines 322.ff --

    } else if(!(udm->pHeader->dataHeader.magic1==0xda &&
        udm->pHeader->dataHeader.magic2==0x27 &&
        udm->pHeader->info.isBigEndian==U_IS_BIG_ENDIAN &&
        udm->pHeader->info.charsetFamily==U_CHARSET_FAMILY)
        ) {
        /* header not valid */
        *err=U_INVALID_FORMAT_ERROR;

The magic bytes are correct, but isBigEndian is 1, and the header value
udm->pHeader->info.charsetFamily does equal U_CHARSET_FAMILY (both are
zero), and therefore U_INVALID_FORMAT_ERROR is returned.

Now, since this is on x86, isBigEndian should not be 1.  (It is 0 on my
i386-linux system.  And I strongly suspect it is 0 on the 64-bit freebsd
where the test succeeds.)

The huge ICU data set contains the original header data.  It is linked
as a symbol in the library, generically named U_ICUDATA_ENTRY_POINT,
specifically named (before the horrible cpp hacking) icudt57_dat.
Evidently this symbol is defined in stubdata.c:41ff --

U_EXPORT const ICU_Data_Header U_ICUDATA_ENTRY_POINT = {
    ...
#if U_IS_BIG_ENDIAN
        1,
#else
        0,
#endif

However, when I looked at the preprocessor output from stubdata.c, or
indeed the .ao or .a files themselves, U_IS_BIG_ENDIAN is 0.

So how is it getting switched to 1 by the time the rest of the code sees
this structure?  The udm structure being looked in the code above seems
to be getting copied directly from U_ICUDATA_ENTRY_POINT, in
udata.cpp:openCommonData --
  setCommonICUDataPointer(&U_ICUDATA_ENTRY_POINT, FALSE, pErrorCode);

U_IS_BIG_ENDIAN is ultimately defined by the cpp mess in
icu-src/source/common/unicode/platform.h.  Comparing to the compiler
predefinitions, I saw no obvious reason why it would turn out to be 1.

Unfortunately, I could not find where the data is being messed with
between the stubdata compilation on the one hand and invoking the binary
on the other.  Or even discern if the problem is at build time (of the
libicu*.a libraries) or runtime (somehow before main() starts).  But
since the explicit #define seems to suffice, letting it go ...

More points:

- the bibtexu.test is not a test.  It just invokes --version to be sure
there is a binary (pretty pointless).  I expect any actual use of
bibtexu would fail in the same way as upmendex, though I did not try it.

- xetex does not call the ICU collator functions, so its use of ICU
differs from upmendex in that regard.  That might not be important;
seems more likely the difference is in their respective builds, but
I couldn't find it.  isBigEndian was always 0 in my test xetex runs.

- icu's configure takes it upon itself to prefer clang and clang++ above
all others.  Thus clang will be used for icu if it is available (as it
is on freebsd 9), regardless of what is used for the rest of TL.  In my
case, that did not make a difference, except for additional agony and
lost time in debugging.  This is why I explicitly set CC= and CXX= in my
invocation above; then those compilers are used for icu too.

I guess that's it.  More than enough.  Hope I never have to come back here.

-karl


More information about the tlbuild mailing list