[pdftex] OT: Unicode and typesetting
Michael Chapman
chapman at mchapman.com
Mon Apr 4 08:04:46 CEST 2005
I know Unicode is 'off topic' for this list.
OT, that is except for use as an input encoding (and this message is not
about that, specifically).
What I do want to ask is the views of expert typesetters on whether Unicode
measures up to its claims (or their expectations).
I have recently been typesetting a Japanese legal text and the following
'features' have confused me.
Much is made of the fact that "the standard defines how characters are
interpreted, and not how glyphs are rendered"[ref.1]
A common example are the 'round' (as in Adobe 'built in' AvantGarde) and
'open' (as in built in Helvetica) lowercase a's. Yet many characters still
duplicate.
1. Character x3000 is a Japanese (really CJK) space. Our old
friend x20 is the normal ASCII space.
The reason for having an ideographic space seems to be the need for a fixed
width space (the same width as one ideograph). But surely that is a glyph
issue?
Courier needs fixed width spaces .... , numbers in many (?all) fonts are
fixed width (so accounting tables align) ...
There are lots more examples (fixed width brackets, to name but one other).
To me these seem to be locale issues (and solvable (_if_ they are a problem,
at all) by declaring the language used, e.g. <div xml:lang="jp">FIXED WIDTH
(JAPANESE) <span xml:lang="en">(the language of Japan, noted in Latin
characters)</span> CHARACTERS</div>, or whatever.)
2. This duplication of characters becomes even stranger with
'foreign type glyphs that ASCII users might like', e.g.:
The Angstrom sign (Â: in HTML speak, I think) has its own code point
(x212B), different to 'A with ring above' (xC5) (let alone the fact that you
can 'build your own' glyphs: x41 x30A).
That there is an Angstrom sign code point (and a degrees Celsius (x2103) and
degrees Fahrenheit (x2109)) is a boon for text searching. One can find all
the measurements in Angstroms in a text, even if that text is in a language
that uses circles on top of vowels.
But being able to search for kilometres (let alone metres: 'm') would be
equally (if not more) useful.
It is not even as if x212B is some kind of symbolic link to xC5 for legacy
purposes. There are two distinct code points.
3. There is also a set of Roman numerals. Thus VII (x2166) and
vii (x2176) exist.
Again indexing is not the main issue behind code points or glyphs (though it
does have its uses!), but this would be useful for searching for the seventh
article of an international convention, one could even build a synonym
database where 7 (x37) is mapped to VII and vii (or vice versa).
This, though, only highlights the fact that there are no code points for
'(g)', '(7)', '7.', etc. So this really is a cul-de-sac, indeed when one gets
into the detail one again discovers the Roman numerals are really fixed-width
ones to go with ideographs, for Latin text you are meant to use Latin
alphabet letters (I think).
This leads to the potentially bizarre result that an 'intelligent' search
alogarithm would find numeral 7 (x37), CJK fixed width 7 (xFF17), real
Japanese 7 (x4E03), and CJK Roman VII's (x2166 and x2178) but not Latin
script VII or vii !!
So, Unicode is definitely a big advance on all those dozens of (often
corrupted by M$) code tables, but is it really a set of 'characters' which
leaves glyph selection/representation to the rendering engine, or is it a
peculiar mixture of characters and glyphs that may not only make many
problems for the future interpretation of electronic files, but also is not
that easy to intelligently use (e.g. searching) or even render as glyphs?
How do others feel?
Michael Chapman.
[ref.1] "The difference between identifying a code value and rendering it on
screen or paper is crucial to understanding the Unicode Standards role in
text processing. ... ...
" ... the standard defines how characters are interpreted, and not how glyphs
are rendered ..." 'The Unicode Standard Version 3.0, April 2000, page 5.
This issue is discussed further on page 298, which says use of many of the
_given_ code points is "strongly discouraged" ... (?).
More information about the pdftex
mailing list