CSpell

Test Set from NER Collection

This page describes the process of generating a spelling correction test set from NER (Name Entity Recognition) collection.

I. Source
The original data from NER collection is in the directory of CHQA-NER-Corpus_1.0. It includes 3100 files as shown in the following table:

Type File Extension No.
Configuration *.conf 2
Text *.txt 1548
Annotation *.ann 1548

Type	File Extension	No.
Configuration	*.conf	2
Text	*.txt	1548
Annotation	*.ann	1548

II. Formats and Retrieval Data
The test set is retrieved from the 1548 text files.

Only retrieve data from 1128 *.xml.txt.
Excludes 420 *.txt (they are already annotated in the baseline gold standard).

The sources of *.xml.txt includes different sources, such as email, web inquiry, etc.. Data from different sources are stored in different formats. The table below describes the retrieved data from these files of different formats. In general, the format can be identified by the key pattern [XXX:] in the first line of the file. The key pattern [XXX:] is used in each line of a file. "http:" and "https:" are excluded from key pattern.

Key (first line)	No.	Retrieved Fields	Example/Notes
SUBJECT:	963	SUBJECT: MESSAGE:	1-118268098.xml.txt 1-118259395.xml.txt 12626.xml.txt
None (Plain Text)	144	Plain Text (142) MESSAGE: (1) "Subject:" and "Message:" (1)	11901.xml.txt 13247.xml.txt 1-118316905.xml.txt 1-135889572.xml.txt 1-123082816.xml.txt ("MESSAGE:") 1-135050116.xml.txt ("Subject:" and "Message:")
EMAIL:	14	MESSAGE: (13) Plain Text (1)	1-118275165.xml.txt 1-120103542.xml.txt 1-122955272.xml.txt 1-123818745.xml.txt 11433.xml.txt (plain text)
Name:	6	"Message Body:"	1-130899901.xml.txt 1-131195919.xml.txt 1-131297375.xml.txt 1-131417291.xml.txt 1-131503031.xml.txt 1-132136861.xml.txt
From:	1	"Subject:"	1-133488182.xml.txt

III. Retrieve Relevant Data
Relevant data are retrieved and stored in ChrText.out in the following format:

File Name Text (retrieved data)
- A period is added to the contents of "SUBJECT:" or "Subject:" if no sentence ending punctuation (.!/) is found.
- A space is used to replace new line for all contents
- Contents is trimmed (removed space at the begin or the end)
IV. Frequency (word count)
ChrText.out is used to calculate WC and saved in ChrText.wc.coreLc.out:
- Each text is tokenized by space/tab ("\\s+")
- Each token is lower cased
- CoreTerm of each token is used (unnecessary leading or ending punctuation is removed)
- token is trimmed
V. Retrieve Candidates of Spelling Error Words
Low frequency and OOV (out of vocabulary) word are considered as candidates for spelling error words. They are retrieved to errWordCandidates.out by the following algorithm:
- CoreTerm is used (input is in the form of coreTerm)
- Low frequency (WC <= 5)
- OOV (not in the dictionary, use Lexicon element words and numbers, slightly more coverage than the dictionary of the baseline)
  - handles possessive (e.g. wife's is converted to wife, then check)
  - handles parenthetic plural forms (e.g. drug(s) is converted to drug, then check)
  - handles multiple term connected by slash (e.g. CASE/TEST is converted to two words CASE and TEST and then check individually)
- Not pure digit (e.g. 123.50)
- Not pure punctuation
- Not the combination of digit and punctuation
- Not measurements (120mg/10Kg, but not 120mg)
- Not URL
- Not Email
VI. Generate NER Test Set (TestSetTextObj.java)
- Inputs:
  - ChrText.out
  - errWordCandidates.out
  - lexNumDic.data
  - unit.data
  - maxErrNo (1000)
- Algorithm:
  - Go through all files and count of OOV_LWC and OOV
  - Sort by OOV_LWC, then OOV, then Text
  - Print if the total OOV_LWC is less than maxErrNo
- Outputs:
  Generate the test set in three formats:
  - testSet.out (text format, used for tagging)
  - testSet.out.vtt (vtt format, provided visual tagging to ease manual tagging)
    Read in to VTT and then saved as PDF file (testSetTag.pdf)
  - testSet.out.all (for all files)
  File Format:
  
  Source file name OOV_LWC OOV Text
  
  Results Stats:
  - File No: 226
  - OOV_LWC No: 1002
  - OOV No: 1073
- VII. Annotation NER Test Set (Brat)
  - Remove non-English customer's query during Brat annotation:
    - 1-133262975.xml.txt: Spanish
    - 14030.txt: Spanish