The SPECIALIST Lexicon

Core-term

I. Introduction

Lots of nGrams have punctuation at the begining or/and at the end. Such as:

Input TermCoreTerm
- in details,in details
- in details
in details,
in (5) details,in (5) details
(in (5) details,
(in (5) details),

All above n-grmas are normalized to "in details" and "in (5) details" by stripping the leading or/and ending punctuation. The normalized term is called core-term, which is the core of the term. This process is called core-term normalization.

A core term might remain internal punctuation, such as "in (5) details". Also, leading or/and ending puncutation might remian in core-term, such as "clean room(s)".

II. Algorithm

Recursively repeat the following process until term does not change or legnth = 0:

  • Strip leading chars if they are punctuation except for left closed brackets, including:
    ASCII -)}]_!@#%&*\\:;\"',.?/~+=|>$`^
    Unicode ¦§­»‐‑‒–—―’”•․‥…⁈​
  • Strip ending chars if they are punctuation except for right closed brackets, including:
    ASCII -({[_!@#%&*\\:;\"',.?/~+=|>$`^
    Unicode ¦§­«‐‑‒–—―‘“•․‥…⁈​
  • strip close brackets at both ends (leading and ending position), including
    ASCII(), [], {}, <>
    Unicode«»‘’“”
    • Strip brackets of both lead end char if they matches and net bracket no* is = 0
    • Strip left brackets of lead char if net bracket no* is > 0
    • Strip right brackets of end char if net bracket no* is < 0
  • trim

* net bracket no = total left bracket no - total right bracket no

For example,

TermNet Bracket No
(in details:)0
(in (5) details:)0
(in (5) details1
in (5) details)-1

III. Examples

Input nGramCore-term
Strip punctuation
-in detailsin details
In details:In details
#$%IN DETAILS:%^(IN DETAILS
( 
() 
Strip brackets
{in (5) details}in (5) details
{{in (5) details}in (5) details
{in (5) details}}in (5) details
{in (5)} details}}{in (5)} details
Strip brackets and punctuation
(in details:)in details
(in details:))in details
(-(in details)%^)in details
{in (5) days},in (5) days
in (5 days),in (5 days)
in ((5) days),in ((5) days)
((clean room(s)))clean room(s)
((inch(es)))inch(es)