Go through all lines in MRCONSO.RRF to generate sClass (synonym class). A sClass includes:
Descriptions | Output Logs
|
---|
Retrieve all English terms in Lexicon with same CUI
| SynonymCan.data.1.all
|
Exclude terms with disallowed STI, such as Chemicals and Drugs
- CuiStiMap: use ./inData/MRSTY.RRF to map CUI to STI
- disallowedStiSet:
./inData/SemGroups.filter.txt specified disallowed STI (tagged by linguists), such as SemGroup is CHEM.
- disallowed: 2,867,695, allowed: 6,155,198
- Example-1: The following synonym class is removed because of disallowed STI
#SYNONYM_CUI|C0000098|1-Methyl-4-phenylpyridinium
128|E0020400|cyperquat|
128|E0319735|mpp|
| SynonymCan.data.2.disallow
|
Exclude terms are acronyms or abbreviations because they drops precision too much.
- There are too many expansions, such as "AA" has 39 expansions in Lexicon.
- Preprocess:
shell> flds 1 LRABR | sort -u > LRABR.f1.uSort
- Use LRABR.f1.uSort to check if a term is an abbreviations or acronyms.
- AcrAbb: 26,596, NotAcrAbb: 426,665
- Example-2: lines with abbreviations are removed
128|E0006443|abdomen|
128|E0554771|abdominal|
128|E0689526|abd|
128|E0689531|abd|
1|E0006444|abdominal|
1|E0692924|abd|
- Example-3: The synonym class is removed, after remove acad, this class has only one candidates, thus is removed!
#SYNONYM_CUI|C0000876|Academies
128|E0006659|academy|
128|E0417973|acad|
128|E0722828|acad|
| SynonymCan.data.3.abb
|
Remove spVars to reduce manual tagging efforts.
- If a term has a synonym of A, all spVars of that term are synonym of A.
- Do not add to sClass if EUI exist in the sClass (spVars)
- Use EUI in inflVar.data
- Use any base form for terms have spVars (same EUI).
- spVarNo: 274,469, after remove spVar no: 152,196
- SpVars should be added in Post-process
- Example-4: lines are spVar are removed
#SYNONYM_CUI|C0000934|Acclimatization
128|E0006730|acclimation|
128|E0006731|acclimatisation|
128|E0006731|acclimatization|
128|E0007239|adaptation|
128|E0422110|adaption|
In the post-process, the deleted spVars will be added back in (if the tag of acclimatisation is [y]), so the record will become (assuming all tags are [y]):
#SYNONYM_CUI|C0000934|Acclimatization
128|E0006730|acclimation|
128|E0006731|acclimatisation|
128|E0006731|acclimatization|
128|E0007239|adaptation|
128|E0422110|adaption|
- Example-5: The synonym class is removed, after remove spVar, this class has only one candidates, thus it is removed!
#SYNONYM_CUI|C0000880|Acanthamoeba Keratitis
128|E0429790|acanthameba keratitis|
128|E0429790|acanthamoeba keratitis|
=> In the post-process, no synonyms will be generated for this sClass.
SynonymCan.data.4.spVar
| Remove nominalization of a term.
- If a term has a synonym of A, all nominalization of that term are synonym of A.
- Sort sClass by CUI (key)
- Use nomMap: ./inData/LRNOM, key: EUI of noun, value is a set of EUIs of nominalizations (adj and verb).
- For implemenation, keep noun, remove its nominalizationof adj and verb
- nomNo: 819, passNomNo: 151,377
- All nominalization are synonyms (use LRNOM).
- Example-6: lines are nominalization of a noun is removed
#SYNONYM_CUI|C0001807|Agressvie behavior
128|E0007791|aggression|
128|E0007793|aggressiveness|
128|E0528674|aggressive|
1|E0007792|aggressive|
=> In the post-process, nominalization of all lines are added as follows:
#SYNONYM_CUI|C0001807|Agressvie behavior
128|E0007791|aggression|
128|E0007793|aggressiveness|
128|E0528674|aggressive|
1|E0007792|aggressive|
1024|E02212219|aggress|
1|E0007792|aggressive|
| SynonymCan.data.5.nom
| Print sClass with multiple candidates (must have more than 1 term in the sCalss)
- notMultiCanNo: 96,455, multiCanNo: 54,922
| SynonymCan.data
| |