So, I guess I was foolish to hope that Google has figured out how to return results that have non-identical but equivalent strings?<br><br>I hope it's not too off-topic for this list, but can you point me to any good resources on normalization (is there a straightforward automation for someone who doesn't do scripting? am I supposed to use decomposed characters?)?<br>


<br>Thanks.<br><br>Josh<br><br><div class="gmail_quote">On Fri, Jul 8, 2011 at 3:11 PM, maxwell <span dir="ltr"><<a href="mailto:maxwell@umiacs.umd.edu">maxwell@umiacs.umd.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">


On Fri, 8 Jul 2011 15:00:42 -0500, Joshua and Amy <<a href="mailto:josh.ruthamy@gmail.com">josh.ruthamy@gmail.com</a>><br>

wrote:<br>

<div class="im">> I'm creating some hyphenation rules for Jarai texts that I'm<br>

> interlinearizing. Here's the problem: In various texts, a complex<br>

character<br>

> such as LATIN SMALL LETTER A WITH BREVE might be encoded as a single<br>

code<br>

> point (U+0103) or as a combination of code points (LATIN SMALL LETTER A:<br>

> U+0061 plus COMBINING BREVE: U+0306).<br>

<br>

</div>Can't (shouldn't!) you pass your texts through a Unicode normalization<br>

process?  Otherwise search on them might not work either, depending on how<br>

smart your search tool is.<br>

<br>

   Mike Maxwell<br>

<br>

<br>

--------------------------------------------------<br>

Subscriptions, Archive, and List information, etc.:<br>

  <a href="http://tug.org/mailman/listinfo/xetex" target="_blank">http://tug.org/mailman/listinfo/xetex</a><br>

</blockquote></div><br>