Lexical Tools

Tokenize & Reverse Token for Strip

I. Introduction

As discussed in previous section (III. problems), we understand what the problem is. In this section, we will discuss the design details for resolving this problem on tokenization works with strip function. First of all, let review the algorithm procedure by illustrating following example.

< For example >

To strip stop word on "Check-in four words (in, the, top, of) are checked"

Algorithms:

  • Decide delimiters: " ", "(", ",", ")".

  • Tokenize string into a list of token
    1. [Check-in]
    2. [ ]
    3. [four]
    4. [ ]
    5. [words]
    6. [ ]
    7. [(]
    8. [in]
    9. [,]
    10. [ ]
    11. [the]
    12. [,]
    13. [ ]
    14. [top]
    15. [,]
    16. [ ]
    17. [of]
    18. [)]
    19. [ ]
    20. [are]
    21. [ ]
    22. [checked]

  • Modify list by stripping stop words (in, the, of) from the list.
    1. [Check-in]
    2. [ ]
    3. [four]
    4. [ ]
    5. [words]
    6. [ ]
    7. [(]
    8. [in] --> [ ]
    9. [,]
    10. [ ]
    11. [the] --> [ ]
    12. [,]
    13. [ ]
    14. [top]
    15. [,]
    16. [ ]
    17. [of] --> [ ]
    18. [)]
    19. [ ]
    20. [are]
    21. [ ]
    22. [checked]

    If we compose the string on this list, the output string will be
    "Check-in four words ( , , top, ,) are checked"

  • However, this is not the result we want. We want something better than this. Thus, one more step is taken to clean up the list and make the list become:
    1. [Check-in]
    2. [ ]
    3. [four]
    4. [ ]
    5. [words]
    6. [ ]
    7. [(]
    8. [top]
    9. [)]
    10. [ ]
    11. [are]
    12. [ ]
    13. [checked]

  • Compose the string base on this cleaned list. The output is:
    "Check-in four words (top) are checked"