Lexical Tools

Tokenize & Reverse Token for Strip

II. Analysis

As discussed in the (introduction), some smart algorithm is needed to clean up the tokens' list. This algorithm is based on the types of delimiters (or types of tokens). They are discussed as follows.

Token Type:
=> Any token in the token list and not a delimiter (as described bellows)
=> A token must be kept during cleanup
Space Delimiter Type:
=> " " or "\t" (tab) are most common in this type
=> Space delimiter is always a delimiter
=> Space delimiters are trimmed to a single space and kept during cleanup
Stripped Type:
=> Any token in the token list that is modified (stripped)
=> Such as "in" or "on" in the strip stop word function
=> The string should be changed to " " if stripped happen
Restore Delimiter Type:
=> A delimiter which will always be kept during cleanup in all circumstances.
=> Such as "({[".
Striping Delimiter Type:
=> A delimiter which will always be kept during cleanup in all circumstances. However, tokens that are conjoint to and in front of it need to be stripped if they are stripped type or belong to a conflict token list.
=> Such as ")}]". For example:
(top) => (top)
(A.I.D.S.) => (A.I.D.S.)
(in, on, of) => ( ) => " , , " are stripped from ( , , ) since they are belong to conflict token list.
Conflict list includes: "-,:;"
Strippable Delimiter Type:
=> A delimiter which will be kept only if the previous token is not a Stripped type
=> If a token is stripped, the following punctuation should be stripped. According to grammar, most of punctuations need to be placed directly after a word (no space between). Such as ".,:;". For these punctuations, they should be stripped if their previous conjoint token is stripped. For example:
in, on, Top => Top
"," are stripped since in and on are stripped