Problems of Tokenize & Punctuations
Tokenization or Punctuation themselves are simple functions and cause no problems. However, most of applications use tokenization as front part of its algorithm. Typically, applications in text handling perform three steps:
In such text handling applications, punctuations may cause lots of problems as described bellows:
Space " " is definitely a delimiter. Some punctuations are delimiters, such as ",", ".", "(", and ")" while other punctuations are not delimiters, such "@" and "-".
< For example >
* This is a book.
=> space and "." are delimiters
* Sports (tennis, baseball, and basketball) are professional sports.
=> space, "(", ")", ",", and "." are delimiters
* My E-mail address is: lu@nlm.nih.gov
=> space and ":" are delimiters while "-" and "@" are not delimiter
If the punctuation is consider as a delimiter during tokenization, it may be stripped or kept during reverse tokenization.
< For example >
* Strip stopwords on (in, the, top, left)
=> Strip stopwords (, , top, left)
=> Strip stopwords (top, left)
------------------------------------------------
=> Space, "(", ",", and ")" are delimiters
=> Spaces are always kept (and trimmed) during reverse tokenization
=> "(", "," and ")" are also kept during reverse tokenization
=> Sometimes, "," are stripped during reverse tokenization