LexBuild - Fix Illegal Non-ASCII Characters
There are some characters (U+0080 ~ U+009F) has different value between Unicode and HTML entities. These characters defy compliance are defined as nonstandard entities in conformance column in HTML character entities table. These characters are not decoded/viewed well in UTF-8 while they are viewable in html browser. These characters need to be converted to correct Unicode value. This feature is to modify these characters as described in following tables to correct Unicode.
In LEXICON 2008, this fix is done in post-processing on LEXICON. Please notes that annotation field in LEXICON are not included in the official release and thus are not fixed by this feature.
After 2008, LexBuild is upgraded to Unicode (UTF-8) in web browser. All data are also upgrade to UTF-8 after this upgrade. All these mapping are done automatically when users input the data. This mapping program is only used to double check.
I. Automatic fix characters
Illegal Characters | Replace Characters | |||||
---|---|---|---|---|---|---|
Name | Char | Value | Name | Char | Value | ASCII Mapping in 2008 |
[] | U+0082 | SINGLE LOW-9 QUOTATION MARK | [‚] | U+201A | [,]: U+002C, COMMA | |
Florin | [] | U+0083 | LATIN SMALL LETTER F WITH HOOK | [ƒ] | U+0192 | |
Right double Quote | [] | U+0084 | DOUBLE LOW-9 QUOTATION MARK | [„] | U+201E; | ["]: U+0022, QUOTATION MARK |
Ellipsis | [ ] | U+0085 | HORIZONTAL ELLIPSIS | […] | U+2026 | [.]: U+002E, Periods (Add three period [...]) |
Dagger | [] | U+0086 | DAGGER | [†] | U+2020 | |
Double Dagger | [] | U+0087 | DOUBLE DAGGER | [‡] | U+2021 | |
Circumflex | [] | U+0088 | MODIFIER LETTER CIRCUMFLEX ACCENT | [ˆ] | U+02C6 | [^]: U+005E, CIRCUMFLEX ACCENT |
Permil | [] | U+0089 | PER MILLE SIGN | [‰] | U+2030 | |
[] | U+008A | LATIN CAPITAL LETTER S WITH CARON | [Š] | U+0160 | ||
Less than sign | [] | U+008B | SINGLE LEFT-POINTING ANGLE QUOTATION MARK | [‹] | U+2039 | [<]: U+003C, LESS-THAN SIGN |
Capital OE Ligature | [] | U+008C | LATIN CAPITAL LIGATURE OE | [Œ] | U+0152 | |
Left Single Quote | [] | U+0091 | LEFT SINGLE QUOTATION MARK | [‘] | U+2018 | [']: U+0027, APOSTROPHE |
Right Single Quote | [] | U+0092 | RIGHT SINGLE QUOTATION MARK | [’] | U+2019 | [']: U+0027, APOSTROPHE |
Left Double Quote | [] | U+0093 | LEFT DOUBLE QUOTATION MARK | [“] | U+201C | ["]: U+0022, QUOTATION MARK |
Right Double Quote | [] | U+0094 | RIGHT DOUBLE QUOTATION MARK | [”] | U+201D | ["]: U+0022, QUOTATION MARK |
Bullet | [] | U+0095 | BULLET | [•] | U+2022 | |
Hyphen | [‐] | U+2010 | HYPHEN, GENERAL_PUNCTUATION | [-] | U+002D | |
[] | U+200E | GENERAL_PUNCTUATION | ||||
En Dash | [] | U+0096 | EN DASH | [–] | U+2013 | |
Em Dash | [] | U+0097 | EM DASH | [—] | U+2014 | |
Tilde | [] | U+0098 | SMALL TILDE | [˜] | U+02DC | |
Trademark | [] | U+0099 | TRADE MARK SIGN | [™] | U+2122 | |
[] | U+009A | LATIN SMALL LETTER S WITH CARON | [š] | U+0161 | ||
Greater than sign | [] | U+009B | SINGLE RIGHT-POINTING ANGLE QUOTATION MARK | [›] | U+203A | [>]: U+003E, GREATER-THAN SIGN |
Small oe ligature | [] | U+009C | LATIN SMALL LIGATURE OE | [œ] | U+0153 | |
Capital Y, umlaut | [] | U+009F | LATIN CAPITAL LETTER Y WITH DIAERESIS | [Ÿ] | U+0178 |
II. Manually fix characters
Illegal Characters | Replace Characters | |||||
---|---|---|---|---|---|---|
Name | Char | Value | Name | Char | Value | Notes |
Nonbreaking Space | [ ] | U+00A0 | Space | [ ] | U+0020 | trim it if at the end of string |
Program:
shell> $LEXBUILD_DIR/Tools/PostProcessing/NonAscii
------------------------------------ Input file ? ------------------------------------ ------------------------------------ Which Program ? 1) Check Non-ASCII 2) Fix Illegal Non-ASCII ------------------------------------
Inputs:
$LEXBUILD_DIR/data/WebApp/Outputs/Lexicon/LEXICON
Outputs: