Map Symbols & Punctuation to ASCII
Unicode symbols and punctuation are very confusing not only because they looks alike, are multiple defined (in different Unicode blocks), but also because text editor software automatically change them during the editing and transaction. For example, quotations are defined in Unicode Latin Basic and Punctuation blocks, such as QUOTATION ( " ), APOSTROPHE ( ' ), GRAVE ( ` ), ACUTE ( ´ ), LEFT SINGLE QUOTATION ( ‘ ), RIGHT SINGLE QUOTATION ( ’ ), LEFT DOUBLE QUOTATION ( “ ), and RIGHT DOUBLE QUOTATION ( ” ), etc.. ASCII was designed to support the very restricted typographic style available to typewriter users, QUOTATION MARK ( " ) and APOSTROPHE ( ' ). In recent years, text editor software change ASCII (dumb) quotes to smart quotes automatically. “Smart quotes” is the automatic replacement of the correct typographic quote character (‘ or ’ and “ or ”) as you type (' and "). It does not refer to the curved quotes themselves.
In addition, the X Window System fonts, MS software, and some software replace following punctuation automatically:
Convert the Unicode punctuation and symbols to ASCII punctuation and symbols is imperative in NLP for preserving the original documents.
Unicode | Mapped ASCII | Char | Unicode Name |
---|---|---|---|
U+02BA | " | ʺ | MODIFIER LETTER DOUBLE PRIME |
Please note:
Perform mapping if the character is in the punctuation & symbols mapping table