Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

Lexical Tools

Tokenize & Reverse Token for Strip

I. Introduction

As discussed in previous section (III. problems), we understand what the problem is. In this section, we will discuss the design details for resolving this problem on tokenization works with strip function. First of all, let review the algorithm procedure by illustrating following example.

< For example >

To strip stop word on "Check-in four words (in, the, top, of) are checked"

Algorithms:

  • Decide delimiters: " ", "(", ",", ")".

  • Tokenize string into a list of token
    1. [Check-in]
    2. [ ]
    3. [four]
    4. [ ]
    5. [words]
    6. [ ]
    7. [(]
    8. [in]
    9. [,]
    10. [ ]
    11. [the]
    12. [,]
    13. [ ]
    14. [top]
    15. [,]
    16. [ ]
    17. [of]
    18. [)]
    19. [ ]
    20. [are]
    21. [ ]
    22. [checked]

  • Modify list by stripping stop words (in, the, of) from the list.
    1. [Check-in]
    2. [ ]
    3. [four]
    4. [ ]
    5. [words]
    6. [ ]
    7. [(]
    8. [in] --> [ ]
    9. [,]
    10. [ ]
    11. [the] --> [ ]
    12. [,]
    13. [ ]
    14. [top]
    15. [,]
    16. [ ]
    17. [of] --> [ ]
    18. [)]
    19. [ ]
    20. [are]
    21. [ ]
    22. [checked]

    If we compose the string on this list, the output string will be
    "Check-in four words ( , , top, ,) are checked"

  • However, this is not the result we want. We want something better than this. Thus, one more step is taken to clean up the list and make the list become:
    1. [Check-in]
    2. [ ]
    3. [four]
    4. [ ]
    5. [words]
    6. [ ]
    7. [(]
    8. [top]
    9. [)]
    10. [ ]
    11. [are]
    12. [ ]
    13. [checked]

  • Compose the string base on this cleaned list. The output is:
    "Check-in four words (top) are checked"