Core-term
I. Introduction
Lots of nGrams have punctuation at the begining or/and at the end. Such as:
Input Term | CoreTerm |
---|---|
- in details, | in details |
- in details | |
in details, | |
in (5) details, | in (5) details |
(in (5) details, | |
(in (5) details), |
All above n-grmas are normalized to "in details" and "in (5) details" by stripping the leading or/and ending punctuation. The normalized term is called core-term, which is the core of the term. This process is called core-term normalization.
A core term might remain internal punctuation, such as "in (5) details". Also, leading or/and ending puncutation might remian in core-term, such as "clean room(s)".
II. Algorithm
Recursively repeat the following process until term does not change or legnth = 0:
ASCII | -)}]_!@#%&*\\:;\"',.?/~+=|>$`^ |
Unicode | ¦§»‐‑‒–—―’”•․‥…⁈ |
ASCII | -({[_!@#%&*\\:;\"',.?/~+=|>$`^ |
Unicode | ¦§«‐‑‒–—―‘“•․‥…⁈ |
ASCII | (), [], {}, <> |
Unicode | «»‘’“” |
* net bracket no = total left bracket no - total right bracket no
For example,
Term | Net Bracket No |
---|---|
(in details:) | 0 |
(in (5) details:) | 0 |
(in (5) details | 1 |
in (5) details) | -1 |
III. Examples
Input nGram | Core-term |
---|---|
Strip punctuation | |
-in details | in details |
In details: | In details |
#$%IN DETAILS:%^( | IN DETAILS |
( | |
() | |
Strip brackets | |
{in (5) details} | in (5) details |
{{in (5) details} | in (5) details |
{in (5) details}} | in (5) details |
{in (5)} details}} | {in (5)} details |
Strip brackets and punctuation | |
(in details:) | in details |
(in details:)) | in details |
(-(in details)%^) | in details |
{in (5) days}, | in (5) days |
in (5 days), | in (5 days) |
in ((5) days), | in ((5) days) |
((clean room(s))) | clean room(s) |
((inch(es))) | inch(es) |