Strip or Map Unicode
Most Unicode characters do not have semantic or graphic similarity to ASCII characters. In other words, these Unicode can't be logically normalized to ASCII since there are no ASCII characters to map to. These Unicode characters are:
These Unicode characters are either stripped (irrelevant) or mapped to ASCII string (according to users preferences) in NLP. Accordingly, two processes are followed:
This method can be used after Unicode Core Norm to ensure the results of normalization is pure ASCII by stripping all irrelevant non-ASCII characters.
Unicode | Mapped ASCII | Char | Unicode Name |
---|---|---|---|
U+0251 | alpha | ɑ | LATIN SMALL LETTER ALPHA |
Please note:
- If the input character is an ASCII (< 128, U+0080)
- Return the input character
- else
- if the character is in the Unicode non-strip mapping table
- Map the character
- else
- Strip the character