Lexical Tools

Normalize

Short Description: Normalize the input text in a non-canonical way.
Full Description: This process involves abstracting away from case, inflection, citation, and word order. It also involves removing stop words, possessives, parenthetic plural forms, stripping diacritics, splitting ligatures, normalizing non-ASCII Unicode to ASCII, and replacing punctuation with spaces from the input term. Specifically, this normalization is somewhat equivalent to the combined flow options (in this order as well) -f:q0:g:rs:o:t:l:B:Ct:q7:q8:w . That is, map non-ASCII Unicode symbols and punctuation to ASCII, then remove genitives, then remove parenthetic plural forms, then replace punctuation with spaces, then remove stop words, then lowercase, then uninflect each word, then take each of the citation words, then Unicode core normalization, then strip or map non-ASCII Unicode. and finally sort word in an alphabetic order,
This flow differs from the traditional norm flow (LuiNorm) in two aspects. First, it retrieves all uninflected form and then use citation form. Second, it does not map uninflected (citation) forms to canonical forms. This has advantages, such as: it does not require knowing the universe of words before indexing; it does not rely upon an additional lookup into the canonical table; and returns known ambiguity when appropriate.
No effect on the -m option. "none" is added at the end of the output.
Difference:
1. Differences are caused by the differences of each flow components
2. Map non-ASCII symbols and punctuation to ASCII
3. Utilize Unicode core norm to convert non-ASCII Unicode to ASCII, which perform (2008):
  - Map Unicode symbols and punctuation to ASCII
  - Map Unicode to ASCII
  - Split ligatures
  - Strip diacritics
4. Add features of removing parenthetic forms before lowering case (2007)
5. Add features of retrieving citation for each base form before word order sorting
Features:
1. map Unicode symbols and punctuation to ASCII
2. remove genitives
3. remove parenthetic plural forms
4. replace punctuation with spaces
5. remove stop words
6. lowercase
7. uninflect each word
8. retrieve a citation (first in the alphabetical order) for each uninflected word
9. Unicode core norm to
  - map Unicode symbols and punctuation to ASCII
  - map Unicode to ASCII
  - split ligatures
  - strip diacritics
10. strip or map non-ASCII Unicode characters
11. sort words by order
Symbol: N

Examples:

 
shell> lvg -f:N
left
left|left|2047|1|q0+g+rs+o+t+l+B+Ct+q7+q8+w|1|
left|leave|2047|1|q0+g+rs+o+t+l+B+Ct+q7+q8+w|1|

Hodgkin's diseases, NOS
Hodgkin's diseases, NOS|disease hodgkin|2047|1|q0+g+rs+o+t+l+B+Ct+q7+q8+w|1|

Down's Syndrome
Down's Syndrome|down syndrome|2047|1|q0+g+rs+o+t+l+B+Ct+q7+q8+w|1|

Acetolyses
Acetolyses|acetolyse|2047|1|q0+g+rs+o+t+l+B+Ct+q7+q8+w|1|
Acetolyses|acetolysis|2047|1|q0+g+rs+o+t+l+B+Ct+q7+q8+w|1

Lung cancer
Lung cancer|cancer lung|2047|1|q0+g+rs+o+t+l+B+Ct+q7+q8+w|1|
Cancer, lung
Cancer, lung|cancer lung|2047|1|q0+g+rs+o+t+l+B+Ct+q7+q8+w|1|

Paget's disease-scapula
Paget's disease-scapula|disease paget scapula|2047|1|q0+g+rs+o+t+l+B+Ct+q7+q8+w|1|
Scapula, Paget Disease
Scapula, Paget Disease|disease paget scapula|2047|1|q0+g+rs+o+t+l+B+Ct+q7+q8+w|1|

Dysenterie amibienne (aiguë)
Dysenterie amibienne (aiguë)|aigue amibienne dysenterie|2047|1|q0+g+rs+o+t+l+B+Ct+q7+q8+w|1|

Abdomen CT Adrenal Mass(es) Bilateral
Abdomen CT Adrenal Mass(es) Bilateral|abdomen adrenal bilateral ct mas|2047|1|q0+g+rs+o+t+l+B+Ct+q7+q8+w|1|
Abdomen CT Adrenal Mass(es) Bilateral|abdomen adrenal bilateral ct mass|2047|1|q0+g+rs+o+t+l+B+Ct+q7+q8+w|1|

sequelae of; injury, nerve, roots and plexus(es), spinal
sequelae of; injury, nerve, roots and plexus(es), spinal|injury nerve plexus root sequela spinal|2047|1|q0+g+rs+o+t+l+B+Ct+q7+q8+w|1|
sequelae of; injury, nerve, roots and plexus(es), spinal|injury nerve plexus roots sequela spinal|2047|1|q0+g+rs+o+t+l+B+Ct+q7+q8+w|1|

proofread
proofread|proofread|2047|1|q0+g+rs+o+t+l+B+Ct+q7+q8+w|1|
proof-read
proof-read|proof read|2047|1|q0+g+rs+o+t+l+B+Ct+q7+q8+w|1|
proof read
proof read|proof read|2047|1|q0+g+rs+o+t+l+B+Ct+q7+q8+w|1|

ɑ-Tech™
ɑ-Tech™|alpha tech|2047|1|q0+g+rs+o+t+l+B+Ct+q7+q8+w|1|

“Quote”
“Quote”|quote|2047|1|q0+g+rs+o+t+l+B+Ct+q7+q8+w|1|

More examples

Implementation Logic:
1. use flow component q0 to map non-ASCII Unicode symbols and punctuation to ASCII
2. use flow component g to remove genitive
3. use flow component rs to remove parenthetic plural forms
4. use flow component o replace punctuation with space
5. use flow component t to strip stop words
6. use flow component l lowercase all characters
7. use flow component B uninflect words
8. retrieve a citation form (first in the alphabetical order) for uninflect words
9. use flow component q7 to normalize non-ASCII Unicode to ASCII
10. use flow component q8 to strip or map non-ASCII Unicode characters
11. use flow component w sort words by ASCII order
Source Code: ToNormalize.java
Hierarchy: Object -> Transformation -> ToNormalize