Design: Tokenize & Reverse Token for Strip
Let's use the same example
< For example >
To strip stop word on "Check-in four words (in, the, top, of) are checked"
Algorithms:
Index | Token String | Token Type |
1 | [Check-in] | token |
2 | [ ] | Space Delimiter |
3 | [four] | Token |
4 | [ ] | Space Delimiter |
5 | [words] | Token |
6 | [ ] | Space Delimiter |
7 | [(] | Restore Delimiter |
8 | [in] | Token |
9 | [,] | Strippable Delimiter |
10 | [ ] | Space Delimiter |
11 | [the] | Token |
12 | [,] | Strippable Delimiter |
13 | [ ] | Space Delimiter |
14 | [top] | Token |
15 | [,] | Strippable Delimiter |
16 | [ ] | Space Delimiter |
17 | [of] | Token |
18 | [)] | Striping Delimiter |
19 | [ ] | Space Delimiter |
20 | [are] | Token |
21 | [ ] | Space Delimiter |
22 | [checked] | Token |
Index | Token String | Token Type |
1 | [Check-in] | token |
2 | [ ] | Space Delimiter |
3 | [four] | Token |
4 | [ ] | Space Delimiter |
5 | [words] | Token |
6 | [ ] | Space Delimiter |
7 | [(] | Restore Delimiter |
8 | [ ] | Stripped |
9 | [,] | Strippable Delimiter |
10 | [ ] | Space Delimiter |
11 | [ ] | Stripped |
12 | [,] | Strippable Delimiter |
13 | [ ] | Space Delimiter |
14 | [top] | Token |
15 | [,] | Strippable Delimiter |
16 | [ ] | Space Delimiter |
17 | [ ] | Stripped |
18 | [)] | Striping Delimiter |
19 | [ ] | Space Delimiter |
20 | [are] | Token |
21 | [ ] | Space Delimiter |
22 | [checked] | Token |
If we compose the string on this list, the output string will be
"Check-in four words ( , , top, ,) are checked"
Table-1 | |||
Index | Token String | Token Type | Clean Up Action |
1 | [Check-in] | token | keep: token |
2 | [ ] | Space Delimiter | keep |
3 | [four] | Token | keep |
4 | [ ] | Space Delimiter | keep |
5 | [words] | Token | keep |
6 | [ ] | Space Delimiter | keep |
7 | [(] | Restore Delimiter | keep |
8 | [ ] | Stripped | strip: stripped type |
9 | [,] | Strippable Delimiter | strip: Strippable Delimiter |
10 | [ ] | Space Delimiter | strip: last type is stripped |
11 | [ ] | Stripped | strip: stripped type |
12 | [,] | Strippable Delimiter | strip: Strippable Delimiter |
13 | [ ] | Space Delimiter | strip: last type is stripped |
14 | [top] | Token | keep |
15 | [,] | Strippable Delimiter | strip: conflict char |
16 | [ ] | Space Delimiter | strip: conflict char |
17 | [ ] | Stripped | strip: stripped type |
18 | [)] | Striping Delimiter | keep |
19 | [ ] | Space Delimiter | keep |
20 | [are] | Token | keep |
21 | [ ] | Space Delimiter | keep |
22 | [checked] | Token | keep |
Table-2 | ||
Index | Token String | Token Type |
1 | [Check-in] | token |
2 | [ ] | Space Delimiter |
3 | [four] | Token |
4 | [ ] | Space Delimiter |
5 | [words] | Token |
6 | [ ] | Space Delimiter |
7 | [(] | Restore Delimiter |
8 | [top] | Token |
9 | [)] | Striping Delimiter |
10 | [ ] | Space Delimiter |
11 | [are] | Token |
12 | [ ] | Space Delimiter |
13 | [checked] | Token |