texlive[61096] Build/source/texk: upmendex 1.00: experimental support

commits+takuji at tug.org commits+takuji at tug.org
Sat Nov 20 10:29:34 CET 2021


Revision: 61096
          http://tug.org/svn/texlive?view=revision&revision=61096
Author:   takuji
Date:     2021-11-20 10:29:33 +0100 (Sat, 20 Nov 2021)
Log Message:
-----------
upmendex 1.00: experimental support Arabic & Hebrew

Modified Paths:
--------------
    trunk/Build/source/texk/README
    trunk/Build/source/texk/upmendex/ChangeLog
    trunk/Build/source/texk/upmendex/configure
    trunk/Build/source/texk/upmendex/configure.ac
    trunk/Build/source/texk/upmendex/convert.c
    trunk/Build/source/texk/upmendex/exvar.h
    trunk/Build/source/texk/upmendex/fwrite.c
    trunk/Build/source/texk/upmendex/mendex.h
    trunk/Build/source/texk/upmendex/sort.c
    trunk/Build/source/texk/upmendex/styfile.c
    trunk/Build/source/texk/upmendex/upmendex.ja.txt
    trunk/Build/source/texk/upmendex/var.h

Modified: trunk/Build/source/texk/README
===================================================================
--- trunk/Build/source/texk/README	2021-11-20 00:48:35 UTC (rev 61095)
+++ trunk/Build/source/texk/README	2021-11-20 09:29:33 UTC (rev 61096)
@@ -108,7 +108,7 @@
 
 ttfdump - maintained here, by us, since Taiwan upstream apparently gone.
 
-upmendex 0.60 - by Takuji Tanaka
+upmendex 1.00 - by Takuji Tanaka
   http://www.ctan.org/pkg/upmendex
   https://github.com/t-tk/upmendex-package
 

Modified: trunk/Build/source/texk/upmendex/ChangeLog
===================================================================
--- trunk/Build/source/texk/upmendex/ChangeLog	2021-11-20 00:48:35 UTC (rev 61095)
+++ trunk/Build/source/texk/upmendex/ChangeLog	2021-11-20 09:29:33 UTC (rev 61096)
@@ -1,3 +1,17 @@
+2021-11-20  TANAKA Takuji  <ttk at t-lab.opal.ne.jp>
+
+	* version 1.00  Stable version.
+	* configure.ac: Bump version.
+	* {,ex}var.h, fwrite.c, styfile.c:
+	Add options "script_preamble" and "script_postamble"
+	in style file.
+	* sort.c: Add Latin Extended-F and -G.
+	Add U_FORMAT_CHAR to is_type_mark_or_punct().
+	* mendex.h, var.h, convert.c, fwrite.c, sort.c, styfile.c:
+	Supports Arabic, Hebrew scripts (experimental).
+	* upmendex.ja.txt:
+	Update document.
+
 2021-10-24  TANAKA Takuji  <ttk at t-lab.opal.ne.jp>
 
 	* version 0.60  Beta version.

Modified: trunk/Build/source/texk/upmendex/configure
===================================================================
--- trunk/Build/source/texk/upmendex/configure	2021-11-20 00:48:35 UTC (rev 61095)
+++ trunk/Build/source/texk/upmendex/configure	2021-11-20 09:29:33 UTC (rev 61096)
@@ -1,6 +1,6 @@
 #! /bin/sh
 # Guess values for system-dependent variables and create Makefiles.
-# Generated by GNU Autoconf 2.71 for upmendex (TeX Live) 0.60.
+# Generated by GNU Autoconf 2.71 for upmendex (TeX Live) 1.00.
 #
 #
 # Copyright (C) 1992-1996, 1998-2017, 2020-2021 Free Software Foundation,
@@ -626,8 +626,8 @@
 # Identity of this package.
 PACKAGE_NAME='upmendex (TeX Live)'
 PACKAGE_TARNAME='upmendex--tex-live-'
-PACKAGE_VERSION='0.60'
-PACKAGE_STRING='upmendex (TeX Live) 0.60'
+PACKAGE_VERSION='1.00'
+PACKAGE_STRING='upmendex (TeX Live) 1.00'
 PACKAGE_BUGREPORT=''
 PACKAGE_URL=''
 
@@ -1390,7 +1390,7 @@
   # Omit some internal or obsolete options to make the list less imposing.
   # This message is too long to be a string in the A/UX 3.1 sh.
   cat <<_ACEOF
-\`configure' configures upmendex (TeX Live) 0.60 to adapt to many kinds of systems.
+\`configure' configures upmendex (TeX Live) 1.00 to adapt to many kinds of systems.
 
 Usage: $0 [OPTION]... [VAR=VALUE]...
 
@@ -1462,7 +1462,7 @@
 
 if test -n "$ac_init_help"; then
   case $ac_init_help in
-     short | recursive ) echo "Configuration of upmendex (TeX Live) 0.60:";;
+     short | recursive ) echo "Configuration of upmendex (TeX Live) 1.00:";;
    esac
   cat <<\_ACEOF
 
@@ -1587,7 +1587,7 @@
 test -n "$ac_init_help" && exit $ac_status
 if $ac_init_version; then
   cat <<\_ACEOF
-upmendex (TeX Live) configure 0.60
+upmendex (TeX Live) configure 1.00
 generated by GNU Autoconf 2.71
 
 Copyright (C) 2021 Free Software Foundation, Inc.
@@ -2268,7 +2268,7 @@
 This file contains any messages produced by compilers while
 running configure, to aid debugging if configure makes a mistake.
 
-It was created by upmendex (TeX Live) $as_me 0.60, which was
+It was created by upmendex (TeX Live) $as_me 1.00, which was
 generated by GNU Autoconf 2.71.  Invocation command line was
 
   $ $0$ac_configure_args_raw
@@ -8806,7 +8806,7 @@
 
 # Define the identity of the package.
  PACKAGE='upmendex--tex-live-'
- VERSION='0.60'
+ VERSION='1.00'
 
 
 # Some tools Automake needs.
@@ -18942,7 +18942,7 @@
 Report bugs to <bug-libtool at gnu.org>."
 
 lt_cl_version="\
-upmendex (TeX Live) config.lt 0.60
+upmendex (TeX Live) config.lt 1.00
 configured by $0, generated by GNU Autoconf 2.71.
 
 Copyright (C) 2011 Free Software Foundation, Inc.
@@ -21114,7 +21114,7 @@
 # report actual input values of CONFIG_FILES etc. instead of their
 # values after options handling.
 ac_log="
-This file was extended by upmendex (TeX Live) $as_me 0.60, which was
+This file was extended by upmendex (TeX Live) $as_me 1.00, which was
 generated by GNU Autoconf 2.71.  Invocation command line was
 
   CONFIG_FILES    = $CONFIG_FILES
@@ -21182,7 +21182,7 @@
 cat >>$CONFIG_STATUS <<_ACEOF || ac_write_fail=1
 ac_cs_config='$ac_cs_config_escaped'
 ac_cs_version="\\
-upmendex (TeX Live) config.status 0.60
+upmendex (TeX Live) config.status 1.00
 configured by $0, generated by GNU Autoconf 2.71,
   with options \\"\$ac_cs_config\\"
 

Modified: trunk/Build/source/texk/upmendex/configure.ac
===================================================================
--- trunk/Build/source/texk/upmendex/configure.ac	2021-11-20 00:48:35 UTC (rev 61095)
+++ trunk/Build/source/texk/upmendex/configure.ac	2021-11-20 09:29:33 UTC (rev 61096)
@@ -8,7 +8,7 @@
 dnl   gives unlimited permission to copy and/or distribute it,
 dnl   with or without modifications, as long as this notice is preserved.
 dnl
-AC_INIT([upmendex (TeX Live)],[0.60])
+AC_INIT([upmendex (TeX Live)],[1.00])
 AC_PREREQ([2.71])
 AC_CONFIG_SRCDIR([main.c])
 AC_CONFIG_AUX_DIR([../../build-aux])

Modified: trunk/Build/source/texk/upmendex/convert.c
===================================================================
--- trunk/Build/source/texk/upmendex/convert.c	2021-11-20 00:48:35 UTC (rev 61095)
+++ trunk/Build/source/texk/upmendex/convert.c	2021-11-20 09:29:33 UTC (rev 61096)
@@ -224,7 +224,7 @@
 			else if (is_latin(buff3)||is_cyrillic(buff3)||is_greek(buff3)
 				 ||is_jpn_kana(buff3)||is_kor_hngl(buff3)||is_zhuyin(buff3)
 				 ||is_numeric(buff3)==1||is_type_symbol(buff3)==1
-				 ||is_devanagari(buff3)||is_thai(buff3)
+				 ||is_devanagari(buff3)||is_thai(buff3)||is_arabic(buff3)||is_hebrew(buff3)
 					||is_type_mark_or_punct(buff3)) {
 				buff2[j]=buff3[0];
 				if (wclen==2) buff2[j+1]=buff3[1];

Modified: trunk/Build/source/texk/upmendex/exvar.h
===================================================================
--- trunk/Build/source/texk/upmendex/exvar.h	2021-11-20 00:48:35 UTC (rev 61095)
+++ trunk/Build/source/texk/upmendex/exvar.h	2021-11-20 09:29:33 UTC (rev 61096)
@@ -35,6 +35,7 @@
 extern UChar devanagari_head[],thai_head[];
 extern char page_compositor[],page_precedence[];
 extern char character_order[];
+extern char script_preamble[][STYBUFSIZE],script_postamble[][STYBUFSIZE];
 extern char icu_locale[],icu_rules[];
 extern int icu_attributes[];
 

Modified: trunk/Build/source/texk/upmendex/fwrite.c
===================================================================
--- trunk/Build/source/texk/upmendex/fwrite.c	2021-11-20 00:48:35 UTC (rev 61095)
+++ trunk/Build/source/texk/upmendex/fwrite.c	2021-11-20 09:29:33 UTC (rev 61096)
@@ -166,7 +166,7 @@
 /*   write ind file   */
 void indwrite(char *filename, struct index *ind, int pagenum)
 {
-	int i,j,hpoint=0,tpoint=0,ipoint=0,jpoint=0;
+	int i,j,hpoint=0,tpoint=0,ipoint=0,jpoint=0,block_open=0;
 	char lbuff[BUFFERLEN],obuff[BUFFERLEN];
 	UChar datama[256],initial[INITIALLENGTH],initial_prev[INITIALLENGTH];
 	int chset,chset_prev;
@@ -212,7 +212,11 @@
 	for (i=line_length=0;i<lines;i++) {
 		index_normalize(ind[i].dic[0], initial, &chset);
 		if (i==0) {
-			if ((CH_LATIN<=chset&&chset<=CH_GREEK) || chset==CH_HANZI) {
+			if (is_any_script(chset) && strlen(script_preamble[chset])) {
+				fputs(script_preamble[chset],fp);
+				block_open=chset;
+			}
+			if ((CH_LATIN<=chset&&chset<=CH_GREEK) || chset==CH_HANZI || (CH_ARABIC<=chset&&chset<=CH_HEBREW)) {
 				if (lethead_flag!=0) {
 					fputs(lethead_prefix,fp);
 					fprint_uchar(fp,initial,lethead_flag,-1);
@@ -352,7 +356,19 @@
 		}
 		else {
 			index_normalize(ind[i-1].dic[0], initial_prev, &chset_prev);
-			if ((CH_LATIN<=chset&&chset<=CH_GREEK) || chset==CH_HANZI) {
+			if (chset!=chset_prev && is_any_script(chset_prev) && block_open) {
+				if (strlen(script_postamble[chset_prev])) {
+					fputs(script_postamble[chset_prev],fp);
+				}
+				block_open=0;
+			}
+			if (chset!=chset_prev && is_any_script(chset)) {
+				if (strlen(script_preamble[chset])) {
+					fputs(script_preamble[chset],fp);
+					block_open=chset;
+				}
+			}
+			if ((CH_LATIN<=chset&&chset<=CH_GREEK) || chset==CH_HANZI || (CH_ARABIC<=chset&&chset<=CH_HEBREW)) {
 				if (chset!=chset_prev || ss_comp(initial,initial_prev)) {
 					fputs(group_skip,fp);
 					if (lethead_flag!=0) {
@@ -505,6 +521,9 @@
 			printpage(ind,fp,i,lbuff);
 		}
 	}
+	if (is_any_script(chset) && strlen(script_postamble[chset]) && block_open) {
+		fputs(script_postamble[chset],fp);
+	}
 	fputs(postamble,fp);
 
 	if (fp!=stdout) fclose(fp);
@@ -886,8 +905,13 @@
 		u_strcpy(ini,hz_index[lo-1].idx);
 		return;
 	}
-	else if (is_devanagari(&ch)||is_thai(&ch)) {
-		if (ch==0x929||0x931||0x934||(0x958<=ch&&ch<=0x95F)) {
+	else if (is_devanagari(&ch)||is_thai(&ch)||is_arabic(&ch)||is_hebrew(&ch)) {
+		if (ch==0x929||ch==0x931||ch==0x934||(0x958<=ch&&ch<=0x95F) /* Devanagary */
+			||(0x622<=ch&&ch<=0x626)||ch==0x6C0||ch==0x6C2||ch==0x6D3 /* Arabic */
+			||(0xFB50<=ch&&ch<=0xFDFF) /* Arabic Presentation Forms-A */
+			||(0xFE70<=ch&&ch<=0xFEFF) /* Arabic Presentation Forms-B */
+			||(0xFB1D<=ch&&ch<=0xFB4F) /* Hebrew presentation forms */
+		   ) {
 			src[0]=ch;  src[1]=0x00;
 			perr=U_ZERO_ERROR;
 			unorm2_normalize(unormalizer_NFD, src, 1, dest, 8, &perr);
@@ -894,6 +918,9 @@
 			if (U_SUCCESS(perr))
 				ch=dest[0];                         /* without modifier */
 		}
+		else if (ch==0x5DA||ch==0x5DD||ch==0x05DF||ch==0x5E3||ch==0x05E5) { /* Hebrew letter final */
+			ch++;
+		}
 		ini[0]=ch;
 		return;
 	}

Modified: trunk/Build/source/texk/upmendex/mendex.h
===================================================================
--- trunk/Build/source/texk/upmendex/mendex.h	2021-11-20 00:48:35 UTC (rev 61095)
+++ trunk/Build/source/texk/upmendex/mendex.h	2021-11-20 09:29:33 UTC (rev 61096)
@@ -59,6 +59,8 @@
 int is_greek(UChar *c);
 int is_devanagari(UChar *c);
 int is_thai(UChar *c);
+int is_arabic(UChar *c);
+int is_hebrew(UChar *c);
 int is_type_mark_or_punct(UChar *c);
 int is_type_symbol(UChar *c);
 int chkcontinue(struct page *p, int num);
@@ -73,9 +75,11 @@
 #define CH_HANZI        6
 #define CH_DEVANAGARI   7
 #define CH_THAI         8
+#define CH_ARABIC       9
+#define CH_HEBREW      10
 #define CH_SYMBOL   0x100
 #define CH_NUMERIC  0x101
-#define  is_any_script(a)  ((CH_LATIN<=(a) && (a)<=CH_THAI))
+#define  is_any_script(a)  ((CH_LATIN<=(a) && (a)<=CH_HEBREW))
 
 /* sort.c */
 int charset(UChar *c);

Modified: trunk/Build/source/texk/upmendex/sort.c
===================================================================
--- trunk/Build/source/texk/upmendex/sort.c	2021-11-20 00:48:35 UTC (rev 61095)
+++ trunk/Build/source/texk/upmendex/sort.c	2021-11-20 09:29:33 UTC (rev 61096)
@@ -20,7 +20,7 @@
 	zh at collation=zhuyin  28880
 */
 
-int sym,nmbr,ltn,kana,hngl,hnz,cyr,grk,dvng,thai;
+int sym,nmbr,ltn,kana,hngl,hnz,cyr,grk,dvng,thai,arab,hbrw;
 
 static int wcomp(const void *p, const void *q);
 static int pcomp(const void *p, const void *q);
@@ -83,6 +83,14 @@
 			thai=order++;
 			break;
 
+		case 'a':
+			arab=order++;
+			break;
+
+		case 'h':
+			hbrw=order++;
+			break;
+
 		default:
 			verb_printf(efp,"\nWarning: Illegal input for character_order (%c).",character_order[i]);
 			break;
@@ -101,6 +109,8 @@
 	if (grk==0) grk=order++;
 	if (dvng==0) dvng=order++;
 	if (thai==0) thai=order++;
+	if (arab==0) arab=order++;
+	if (hbrw==0) hbrw=order++;
 
 	status = U_ZERO_ERROR;
 	if (strlen(icu_rules)>0) {
@@ -307,6 +317,8 @@
 		else if (is_numeric(c))  return nmbr;
 		else if (is_devanagari(c)) return dvng;
 		else if (is_thai(c))     return thai;
+		else if (is_arabic(c))   return arab;
+		else if (is_hebrew(c))   return hbrw;
 		else                     return sym;
 	}
 }
@@ -330,6 +342,8 @@
 		else if (is_numeric(c))  return CH_NUMERIC;
 		else if (is_devanagari(c)) return CH_DEVANAGARI;
 		else if (is_thai(c))     return CH_THAI;
+		else if (is_arabic(c))   return CH_ARABIC;
+		else if (is_hebrew(c))   return CH_HEBREW;
 		else                     return CH_SYMBOL;
 	}
 }
@@ -401,6 +415,8 @@
 
 int is_latin(UChar *c)
 {
+	UChar32 c32;
+
 	if (((*c>=L'A')&&(*c<=L'Z'))||((*c>=L'a')&&(*c<=L'z'))) return 1;
 	else if ((*c==0x00AA)||(*c==0x00BA)) return 1; /* Latin-1 Supplement */
 	else if ((*c>=0x00C0)&&(*c<=0x00D6)) return 1;
@@ -416,9 +432,15 @@
 	else if ((*c>=0xFF21)&&(*c<=0xFF3A)) return 1; /* Fullwidth Latin Capital Letter */
 	else if ((*c>=0xFF41)&&(*c<=0xFF5A)) return 1; /* Fullwidth Latin Small Letter */
 		/* Property of followings is "Common, So (other symbol)", but seem to be treated as Latin by ICU collator */
-	else if ((*c>=0x24B6)&&(*c<=0x24CF)) return 1; /* CIRCLED LATIN CAPITAL LETTER */
-	else if ((*c>=0x24D0)&&(*c<=0x24E9)) return 1; /* CIRCLED LATIN SMALL LETTER */
-	else return 0;
+	else if ((*c>=0x24B6)                          /* CIRCLED LATIN CAPITAL LETTER */
+	                     &&(*c<=0x24E9)) return 1; /* CIRCLED LATIN SMALL LETTER */
+
+	if (is_surrogate_pair(c)) {
+		c32=U16_GET_SUPPLEMENTARY(*c,*(c+1));
+		if      ((c32>=0x10780) && (c32<=0x107BF)) return 2; /* Latin Extended-F */
+		else if ((c32>=0x1DF00) && (c32<=0x1DFFF)) return 2; /* Latin Extended-G */
+	}
+	return 0;
 }
 
 int is_numeric(UChar *c)
@@ -487,11 +509,11 @@
 {
 	UChar32 c32;
 
-	if      ((*c>=0x2E80)&&(*c<=0x2EFF)) return 1; /* CJK Radicals Supplement */
-	else if ((*c>=0x2F00)&&(*c<=0x2FDF)) return 1; /* Kangxi Radicals */
+	if      ((*c>=0x2E80)                          /* CJK Radicals Supplement */
+	                     &&(*c<=0x2FDF)) return 1; /* Kangxi Radicals */
 	else if ((*c>=0x31C0)&&(*c<=0x31EF)) return 1; /* CJK Strokes */
-	else if ((*c>=0x3300)&&(*c<=0x33FF)) return 1; /* CJK Compatibility */
-	else if ((*c>=0x3400)&&(*c<=0x4DBF)) return 1; /* CJK Unified Ideographs Extension A */
+	else if ((*c>=0x3300)                          /* CJK Compatibility */
+	                     &&(*c<=0x4DBF)) return 1; /* CJK Unified Ideographs Extension A */
 	else if ((*c>=0x4E00)&&(*c<=0x9FFF)) return 1; /* CJK Unified Ideographs */
 	else if ((*c>=0xF900)&&(*c<=0xFAFF)) return 1; /* CJK Compatibility Ideographs */
 
@@ -513,7 +535,9 @@
 
 int is_cyrillic(UChar *c)
 {
-	if      ((*c>=0x0400)&&(*c<=0x052F)) return 1; /* Cyrillic, Cyrillic Supplement */
+	if      ((*c==0x0482))               return 0; /* Cyrillic Thousands Sign */
+	else if ((*c>=0x0400)                          /* Cyrillic */
+	                     &&(*c<=0x052F)) return 1; /* Cyrillic Supplement */
 	else if ((*c>=0x1C80)&&(*c<=0x1C8F)) return 1; /* Cyrillic Extended-C */
 	else if ((*c>=0x2DE0)&&(*c<=0x2DFF)) return 1; /* Cyrillic Extended-A */
 	else if ((*c>=0xA640)&&(*c<=0xA69F)) return 1; /* Cyrillic Extended-B */
@@ -522,7 +546,8 @@
 
 int is_greek(UChar *c)
 {
-	if      ((*c>=0x0370)&&(*c<=0x03FF)) return 1; /* Greek */
+	if      ((*c==0x03F6))               return 0; /* Greek Reversed Lunate Epsilon Symbol */
+	else if ((*c>=0x0370)&&(*c<=0x03FF)) return 1; /* Greek */
 	else if ((*c>=0x1F00)&&(*c<=0x1FFF)) return 1; /* Greek Extended */
 	else return 0;
 }
@@ -529,8 +554,8 @@
 
 int is_devanagari(UChar *c)
 {
-	if      ((*c>=0x0964)&&(*c<=0x0965)) return 0; /* Generic punctuation for scripts of India */
-	else if ((*c>=0x0966)&&(*c<=0x096F)) return 0; /* Devanagari Digit */
+	if      ((*c>=0x0964)                          /* Generic punctuation for scripts of India */
+	                     &&(*c<=0x096F)) return 0; /* Devanagari Digit */
 	else if ((*c>=0x0900)&&(*c<=0x097F)) return 1; /* Devanagari */
 	else if ((*c>=0xA8E0)&&(*c<=0xA8FF)) return 1; /* Devanagari Extended */
 	else return 0;
@@ -544,6 +569,48 @@
 	else return 0;
 }
 
+int is_arabic(UChar *c)
+{
+	if      ((*c>=0x0600)                          /* ARABIC NUMBER SIGN..ARABIC SIGN SAMVAT */
+	                                               /* ARABIC NUMBER MARK ABOVE */
+	                     &&(*c<=0x0608)) return 0; /* ARABIC-INDIC CUBE ROOT..ARABIC RAY */
+	else if ((*c==0x060B))               return 0; /* AFGHANI SIGN */
+	else if ((*c==0x060C))               return 0; /* ARABIC COMMA */
+	else if ((*c>=0x060E)&&(*c<=0x060F)) return 0; /* ARABIC POETIC VERSE SIGN..ARABIC SIGN MISRA */
+	else if ((*c>=0x0660)&&(*c<=0x0669)) return 0; /* ARABIC-INDIC DIGIT ZERO..ARABIC-INDIC DIGIT NINE */
+	else if ((*c==0x061B))               return 0; /* ARABIC SEMICOLON */
+	else if ((*c==0x061C))               return 0; /* ARABIC LETTER MARK */
+	else if ((*c==0x061F))               return 0; /* ARABIC QUESTION MARK */
+	else if ((*c==0x0640))               return 0; /* ARABIC TATWEEL */
+	else if ((*c==0x06DD))               return 0; /* ARABIC END OF AYAH */
+	else if ((*c==0x06DE))               return 0; /* ARABIC START OF RUB EL HIZB */
+	else if ((*c==0x06E9))               return 0; /* ARABIC PLACE OF SAJDAH */
+	else if ((*c>=0x06F0)&&(*c<=0x06F9)) return 0; /* EXTENDED ARABIC-INDIC DIGIT ZERO..EXTENDED ARABIC-INDIC DIGIT NINE */
+	else if ((*c>=0x06FD)&&(*c<=0x06FE)) return 0; /* ARABIC SIGN SINDHI AMPERSAND..ARABIC SIGN SINDHI POSTPOSITION MEN */
+	else if ((*c==0x08E2))               return 0; /* ARABIC DISPUTED END OF AYAH */
+	else if ((*c>=0x0890)&&(*c<=0x0891)) return 0; /* ARABIC POUND MARK ABOVE..ARABIC PIASTRE MARK ABOVE */
+	else if ((*c>=0xFD40)&&(*c<=0xFD4F)) return 0; /* ARABIC LIGATURE RAHIMAHU ALLAAH..ARABIC LIGATURE RAHIMAHUM ALLAAH */
+	else if ((*c==0xFDCF))               return 0; /* ARABIC LIGATURE SALAAMUHU ALAYNAA */
+	else if ((*c==0xFDFC))               return 0; /* RIAL SIGH */
+	else if ((*c>=0xFDFD)&&(*c<=0xFDFF)) return 0; /* ARABIC LIGATURE BISMILLAH AR-RAHMAN AR-RAHEEM..ARABIC LIGATURE AZZA WA JALL */
+
+	else if ((*c>=0x0600)&&(*c<=0x06FF)) return 1; /* Arabic */
+	else if ((*c>=0x0750)&&(*c<=0x077F)) return 1; /* Arabic Supplement */
+	else if ((*c>=0x0870)                          /* Arabic Extended-B */
+	                     &&(*c<=0x08FF)) return 1; /* Arabic Extended-A */
+	else if ((*c>=0xFB50)&&(*c<=0xFDFF)) return 1; /* Arabic Presentation Forms-A */
+	else if ((*c>=0xFE70)&&(*c<=0xFEFF)) return 1; /* Arabic Presentation Forms-B */
+	else return 0;
+}
+
+int is_hebrew(UChar *c)
+{
+	if      ((*c==0xFB29))               return 0; /* Hebrew Letter Alternative Plus Sign */
+	else if ((*c>=0x0590)&&(*c<=0x05FF)) return 1; /* Hebrew */
+	else if ((*c>=0xFB1D)&&(*c<=0xFB4F)) return 1; /* Hebrew presentation forms */
+	else return 0;
+}
+
 int is_type_mark_or_punct(UChar *c)
 {
 	UChar32 c32;
@@ -558,6 +625,7 @@
 	case U_CONNECTOR_PUNCTUATION: case U_OTHER_PUNCTUATION:
 	case U_INITIAL_PUNCTUATION: case U_FINAL_PUNCTUATION:
 	case U_NON_SPACING_MARK: case U_ENCLOSING_MARK: case U_COMBINING_SPACING_MARK:
+	case U_FORMAT_CHAR:
 		return 1;
 	default:
 		return 0;

Modified: trunk/Build/source/texk/upmendex/styfile.c
===================================================================
--- trunk/Build/source/texk/upmendex/styfile.c	2021-11-20 00:48:35 UTC (rev 61095)
+++ trunk/Build/source/texk/upmendex/styfile.c	2021-11-20 09:29:33 UTC (rev 61096)
@@ -161,6 +161,43 @@
 		}
 		if (getparam(buff,"icu_attributes", icu_attr_str   )) continue;
 
+		cc=scompare(buff,"script_preamble");
+		if (cc!= -1) {
+			strcpy(tmp,buff+strlen("script_preamble"));
+			if (getparam(tmp,"latin",     script_preamble[CH_LATIN]      )) continue;
+			if (getparam(tmp,"cyrillic",  script_preamble[CH_CYRILLIC]   )) continue;
+			if (getparam(tmp,"greek",     script_preamble[CH_GREEK]      )) continue;
+			if (getparam(tmp,"kana",      script_preamble[CH_KANA]       )) continue;
+			if (getparam(tmp,"hangul",    script_preamble[CH_HANGUL]     )) continue;
+			if (getparam(tmp,"hanzi",     script_preamble[CH_HANZI]      )) continue;
+			if (getparam(tmp,"devanagari",script_preamble[CH_DEVANAGARI] )) continue;
+			if (getparam(tmp,"thai",      script_preamble[CH_THAI]       )) continue;
+			if (getparam(tmp,"arabic",    script_preamble[CH_ARABIC]     )) continue;
+			if (getparam(tmp,"hebrew",    script_preamble[CH_HEBREW]     )) continue;
+			if (strlen(tmp)>0) {
+				verb_printf(efp,"\nWarning: Unknown script for specifier \"script_preamble\" (%s).", tmp);
+			}
+			continue;
+		}
+		cc=scompare(buff,"script_postamble");
+		if (cc!= -1) {
+			strcpy(tmp,buff+strlen("script_postamble"));
+			if (getparam(tmp,"latin",     script_postamble[CH_LATIN]      )) continue;
+			if (getparam(tmp,"cyrillic",  script_postamble[CH_CYRILLIC]   )) continue;
+			if (getparam(tmp,"greek",     script_postamble[CH_GREEK]      )) continue;
+			if (getparam(tmp,"kana",      script_postamble[CH_KANA]       )) continue;
+			if (getparam(tmp,"hangul",    script_postamble[CH_HANGUL]     )) continue;
+			if (getparam(tmp,"hanzi",     script_postamble[CH_HANZI]      )) continue;
+			if (getparam(tmp,"devanagari",script_postamble[CH_DEVANAGARI] )) continue;
+			if (getparam(tmp,"thai",      script_postamble[CH_THAI]       )) continue;
+			if (getparam(tmp,"arabic",    script_postamble[CH_ARABIC]     )) continue;
+			if (getparam(tmp,"hebrew",    script_postamble[CH_HEBREW]     )) continue;
+			if (strlen(tmp)>0) {
+				verb_printf(efp,"\nWarning: Unknown script for specifier \"script_postamble\" (%s).", tmp);
+			}
+			continue;
+		}
+
 		cc=strcspn(buff," \t\r\n");
 		if (cc>0) buff[cc]='\0';
 		if (buff[0]=='%' || buff[0]=='\n') continue;

Modified: trunk/Build/source/texk/upmendex/upmendex.ja.txt
===================================================================
--- trunk/Build/source/texk/upmendex/upmendex.ja.txt	2021-11-20 00:48:35 UTC (rev 61095)
+++ trunk/Build/source/texk/upmendex/upmendex.ja.txt	2021-11-20 09:29:33 UTC (rev 61096)
@@ -123,7 +123,7 @@
 
    preamble  <文字列>
       "\\begin{theindex}\n"
-      出力ファイルの文字列。
+      出力ファイルの先頭の文字列。
 
    postamble  <文字列>
       "\n\n\\end{theindex}\n"
@@ -167,6 +167,10 @@
       0
       lethead_flagと同じ。
 
+   headings_flag  <数値>
+      0
+      lethead_flagと同じ。
+
    kana_head  <文字列>
       ""
       仮名見出し文字。見出し文字を文字列で指定する。
@@ -283,23 +287,36 @@
 
    symhead_positive  <文字列>
       "Symbols"
-      lethead_flag または heading_flag が正数の場合に数字・記号の頭文字として
+      lethead_flag または heading_flag または headings_flag が正数の場合に記号の頭文字として
       出力する文字列。
 
    symhead_negative  <文字列>
       "symbols"
-      lethead_flag または heading_flag が負数の場合に数字・記号の頭文字として
+      lethead_flag または heading_flag または headings_flag が負数の場合に記号の頭文字として
       出力する文字列。
 
    symbol  <文字列>
       ""
-      symbol_flag が0でない場合に、数字・記号の頭文字として出力する文字列。
+      symbol_flag が0でない場合に、記号の頭文字として出力する文字列。
       文字列が定義されていれば、symhead_positive および symhead_negative より
       優先される。((up)mendex専用)
 
+   numhead_positive  <文字列>
+      "Numbers"
+      lethead_flag または heading_flag が正数かつ symbol_flag が2の場合に数字の頭文字として
+      出力する文字列。
+
+   numhead_negative  <文字列>
+      "numbers"
+      lethead_flag または heading_flag が負数かつ symbol_flag が2の場合に数字の頭文字として
+      出力する文字列。
+
    symbol_flag  <数値>
       1
-      symbol の出力フラグ。0のとき出力しない。((up)mendex専用)
+      symbol の出力フラグ。0のとき見出しを出力しない。
+      1のとき数字を記号の一部として扱い記号の見出しを出力する。
+      2のとき数字と記号を別の集合に分類し数字と記号の見出しを出力する。
+      ((up)mendex専用)
 
    letter_head  <数値>
       1
@@ -313,14 +330,31 @@
       ((up)mendex専用)
 
    character_order  <文字列>
-      "SNLGCJKHDT"
+      "SNLGCJKHDTah"
       記号、英字、日本語の優先順位。'S'は記号、'N'は数字、'L'はラテン文字、
       'G'はギリシャ文字、'C'はキリル文字、'J'は日本語(かな)、'K'はハングル、
-      'H'は漢字、'D'はデーヴァナーガリー、'T'はタイ文字を表す。
-      upendexでは索引項目の分類として「数字」は「記号」に含める仕様なので、
+      'H'は漢字、'D'はデーヴァナーガリー、'T'はタイ文字、
+      'a'はアラビア文字、'h'はヘブライ文字を表す。
+      symbol_flag=1のとき、索引項目の分類として「数字」は「記号」に含めるので、
       'S'と'N'は必ず隣り合わせること(数字と数字以外の記号の順序入れ替えは可能)。
       (upmendex専用)
 
+   script_preamble  <文字列1>  <文字列2>
+      ""
+      各スクリプトごとのブロックの先頭の文字列を文字列2に指定する。
+      スクリプト名1個を以下の中から文字列1に指定しなければならない:
+      'latin', 'cyrillic', 'greek', 'kana', 'hangul', 'hanzi',
+      'devanagari', 'thai', 'arabic', 'hebrew'
+      (upmendex専用)
+
+   script_postamble  <文字列1>  <文字列2>
+      ""
+      各スクリプトごとのブロックの末尾の文字列を文字列2に指定する。
+      スクリプト名1個を以下の中から文字列1に指定しなければならない:
+      'latin', 'cyrillic', 'greek', 'kana', 'hangul', 'hanzi',
+      'devanagari', 'thai', 'arabic', 'hebrew'
+      (upmendex専用)
+
    icu_locale  <文字列>
       ""
       ICU collatorにおいて従うlocale。
@@ -331,12 +365,13 @@
       ""
       ICU collatorにおいてlocaleによらずに照合順序を指定する場合、
       照合順序のルールを示す文字列。
-      ( Ref. http://userguide.icu-project.org/collation/customization
-             https://unicode-org.github.io/icu/userguide/collation/customization/
+      ( Ref. https://unicode-org.github.io/icu/userguide/collation/customization/
              http://www.unicode.org/reports/tr35/tr35-collation.html#Rules )
       UTF-8のUnicode文字及び、以下のエスケープ文字列が使用可能である:
       \Uhhhhhhhh (16進数[0-9A-Fa-f]を8桁), \uhhhh (16進数を4桁),
       \xhh (16進数を2桁), \x{h...} (16進数を1〜8桁), \ooo (8進数[0-7]を3桁)。
+      icu_localeとicu_rulesを同時に指定した場合、icu_localeで指定したルールの上に
+      icu_rulesで指定したルールを追加する。
       空文字列(デフォルト)のときは、localeによる照合順序のルールに従う。
       (upmendex専用)
 
@@ -343,8 +378,7 @@
    icu_attributes  <文字列>
       ""
       ICU collatorのattribute指定。
-      ( Ref. http://userguide.icu-project.org/collation/customization#TOC-Default-Options
-             https://unicode-org.github.io/icu/userguide/collation/customization/#default-options
+      ( Ref. https://unicode-org.github.io/icu/userguide/collation/customization/#default-options
              http://www.unicode.org/reports/tr35/tr35-collation.html#Setting_Options )
       以下の文字列を解釈する:
       "alternate:shifted", "alternate:non-ignorable",
@@ -365,7 +399,7 @@
 カナに揃え、拗音、撥音、濁点を除いた読みを付けなければなりませんでした(自動的に
 揃えるバージョンもある)。
   upmendex ではカナについてはInternational Components for Unicode (ICU)による
-ソートを行います。また漢字については辞書ファイルを設定することにより各索引語ごと
+ソートを行います。また漢字と記号については辞書ファイルを設定することにより各索引語ごと
 に読みを付ける作業をかなり解消できます。
 
   辞書ファイルは <熟語  読み> のリストで構成されます。熟語と読みの区切りはタブま
@@ -376,7 +410,7 @@
    漢字     かんじ
    読み     よみ
    環境     かんきょう
-   α       アルファ
+   $       ドル
 
   辞書に登録する熟語は、読み方が1通りになるよう送り仮名を付けてください。
   「表」、「性質」のように送り仮名によらず2通りの読み方ができる語についてはどち
@@ -445,5 +479,5 @@
 参考
 
 International Components for Unicode (ICU)
-http://site.icu-project.org/
+http://icu.unicode.org/
 https://unicode-org.github.io/icu/

Modified: trunk/Build/source/texk/upmendex/var.h
===================================================================
--- trunk/Build/source/texk/upmendex/var.h	2021-11-20 00:48:35 UTC (rev 61095)
+++ trunk/Build/source/texk/upmendex/var.h	2021-11-20 09:29:33 UTC (rev 61096)
@@ -35,7 +35,8 @@
 UChar atama[STYBUFSIZE],hangul_head[STYBUFSIZE],hanzi_head[STYBUFSIZE]={L'\0'},kana_head[STYBUFSIZE]={L'\0'};
 UChar devanagari_head[STYBUFSIZE],thai_head[STYBUFSIZE];
 char page_compositor[STYBUFSIZE]={"-"},page_precedence[STYBUFSIZE]={"rnaRA"};
-char character_order[STYBUFSIZE]={"SNLGCJKHDT"};
+char character_order[STYBUFSIZE]={"SNLGCJKHDTah"};
+char script_preamble[11][STYBUFSIZE],script_postamble[11][STYBUFSIZE];
 char icu_locale[STYBUFSIZE]={"root"},icu_rules[STYBUFSIZE]={""};
 int icu_attributes[UCOL_ATTRIBUTE_COUNT];
 



More information about the tex-live-commits mailing list.