Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.
Strip or Map Unicode
Most Unicode characters do not have semantic or graphic similarity to ASCII characters. In other words, these Unicode can't be logically normalized to ASCII since there are no ASCII characters to map to. These Unicode characters are:
These Unicode characters are either stripped (irrelevant) or mapped to ASCII string (according to users preferences) in NLP. Accordingly, two processes are followed:
This method can be used after Unicode Core Norm to ensure the results of normalization is pure ASCII by stripping all irrelevant non-ASCII characters.
Unicode | Mapped ASCII | Char | Unicode Name |
---|---|---|---|
U+0251 | alpha | ɑ | LATIN SMALL LETTER ALPHA |
Please note:
- If the input character is an ASCII (< 128, U+0080)
- Return the input character
- else
- if the character is in the Unicode non-strip mapping table
- Map the character
- else
- Strip the character