Lexical Tools

Design: Tokenize & Reverse Token for Strip

IV. Example

Let's use the same example

< For example >

To strip stop word on "Check-in four words (in, the, top, of) are checked"

Algorithms:

  • Use delimiters listed in previous section.

  • Tokenize string into a list of token and determine token type

    IndexToken StringToken Type
    1[Check-in]token
    2[ ] Space Delimiter
    3[four] Token
    4[ ] Space Delimiter
    5[words] Token
    6[ ] Space Delimiter
    7[(] Restore Delimiter
    8[in] Token
    9[,] Strippable Delimiter
    10[ ] Space Delimiter
    11[the] Token
    12[,] Strippable Delimiter
    13[ ] Space Delimiter
    14[top] Token
    15[,] Strippable Delimiter
    16[ ] Space Delimiter
    17[of] Token
    18[)] Striping Delimiter
    19[ ] Space Delimiter
    20[are] Token
    21[ ] Space Delimiter
    22[checked] Token

  • Modify list by stripping stop words (in, the, of) from the list.

    IndexToken StringToken Type
    1[Check-in]token
    2[ ] Space Delimiter
    3[four] Token
    4[ ] Space Delimiter
    5[words] Token
    6[ ] Space Delimiter
    7[(] Restore Delimiter
    8[ ] Stripped
    9[,] Strippable Delimiter
    10[ ] Space Delimiter
    11[ ] Stripped
    12[,] Strippable Delimiter
    13[ ] Space Delimiter
    14[top] Token
    15[,] Strippable Delimiter
    16[ ] Space Delimiter
    17[ ] Stripped
    18[)] Striping Delimiter
    19[ ] Space Delimiter
    20[are] Token
    21[ ] Space Delimiter
    22[checked] Token

    If we compose the string on this list, the output string will be
    "Check-in four words ( , , top, ,) are checked"

  • However, this is not the result we want. We want something better than this. Thus, one more step is taken to clean up the list and make the list from table-1 to table-2:

    Table-1
    IndexToken StringToken TypeClean Up Action
    1[Check-in]tokenkeep: token
    2[ ] Space Delimiterkeep
    3[four] Tokenkeep
    4[ ] Space Delimiterkeep
    5[words] Tokenkeep
    6[ ] Space Delimiterkeep
    7[(] Restore Delimiterkeep
    8[ ] Strippedstrip: stripped type
    9[,] Strippable Delimiterstrip: Strippable Delimiter
    10[ ] Space Delimiterstrip: last type is stripped
    11[ ] Strippedstrip: stripped type
    12[,] Strippable Delimiterstrip: Strippable Delimiter
    13[ ] Space Delimiterstrip: last type is stripped
    14[top] Tokenkeep
    15[,] Strippable Delimiterstrip: conflict char
    16[ ] Space Delimiterstrip: conflict char
    17[ ] Strippedstrip: stripped type
    18[)] Striping Delimiterkeep
    19[ ] Space Delimiterkeep
    20[are] Tokenkeep
    21[ ] Space Delimiterkeep
    22[checked] Tokenkeep

    Table-2
    IndexToken StringToken Type
    1[Check-in]token
    2[ ] Space Delimiter
    3[four] Token
    4[ ] Space Delimiter
    5[words] Token
    6[ ] Space Delimiter
    7[(] Restore Delimiter
    8[top] Token
    9[)] Striping Delimiter
    10[ ] Space Delimiter
    11[are] Token
    12[ ] Space Delimiter
    13[checked] Token

  • Compose the string base on this cleaned list. The output is:
    "Check-in four words (top) are checked"