Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.
Validate and Fix LEXICON
shell> cp -p LEXICON ${LEXICON}/data/${YEAR}/data/LEXICON.mmddyy
shell> cd ${LEXICON}/data/${YEAR}/data
shell> ln -sf ./LEXICON.mmddyy LEXICON.freeze
shell> fgrep " " LEXICON.freeze | wc -l
=> should be 0, all extra space is taken care of in LexBuild automatically
If not, need to have data in LexBuild fixed as well
shell> ${LEXICON}/bin/1.FinalizeLexicon <year>
mv ./LEXICON.release LEXICON.release.log.1.noAnno
ln -sf ./LEXICON.release.log.1.noAnno LEXICON.release
shell> ${LEXICON}/bin/2.ValidateLexicon <year> > log.2
shel>fgrep "entry=" LEXICON.release > Euis
shell> cd ${LEX_CHECK_PROC}/data/GetFiles
shell> cp -p ${LEXICON}/data/${YEAR}/data/LEXICON.release.log.1.noAnno LEXICON.release.log.1.noAnno.${YEAR}
shell> cd ${LEX_CHECK_PROC}/bin
shell> GetFilesFromLexicon
2 (prepositions)
3 (particles)
12
13
=> Use ./LEXICON.release.2.fixContent for the next steps (if it is different from the input)
ln -sf ./LEXICON.release.log.2.3.contentFix LEXICON.release
Year | DupRec | N | C | Notes |
---|---|---|---|---|
2014 | 137 | 69 | 68 | Only multiword (137/1184) are tagged due to limited resource and due date. The rest (abbreviations or acronyms) are updated in the next release. |
2015 | 1183 | 1042 | 141 | Changes are updated in LB and fixed for next release |
2016 | 67 | 62 | 5 | Changes are updated in LB and fixed for next release |
2017 | 69 | 63 | 6 | Changes are updated in LB and fixed for next release |
2018 | 55 | 48 | 7 | Changes are updated in LB and fixed for next release |
2019 | 11 | 6 | 5 | Changes are updated in LB and fixed for next release |
2020 | 3 | 0 | 3 | Changes are updated in LB and this release |
2021 | 3 | 0 | 3 | Changes are updated in LB and this release |
2022 | 12 | 3 | 9 | Changes are updated in LB and this release |
2023 | 3 | 2 | 1 | Changes are updated in LB and this release |
2024 | 2 | 2 | 0 | Changes are updated in LB and this release |
2025 | 1 | 0 | 1 | Changes are updated in LB and this release |
2026 | 1 | 1 | 0 | Changes are updated in LB and this release |
shell>fgrep " no EUI (" log.2 > 2.4.03.noEui
Year | no EUI No. | notBaseForm No. |
---|---|---|
2017 | 22 | 4 |
2018 | 4 | 2 |
2019 | 63 | 0 |
2020 | 61 | 0 |
2021 | 34 | 0 |
2022 | 18 | 0 |
2023 | 0 | 0 |
2024 | 0 | 0 |
2025 | 0 | 0 |
2026 | 1 | 0 |
shell>fgrep " wrong citation (spVar) (" log.2 |fgrep -v " wrong citation (spVar), duplicates (" > 2.4.04.wrongCitSpVar
Year | wrong citation (spVar) No. |
---|---|
2017 | 71 |
2018 | 0 |
2019 | 59 |
2020 | 0 |
2021 | 1 |
2022 | 0 |
2023 | 0 |
2024 | 0 |
2025 | 0 |
2026 | 0 |
shell>fgrep " wrong citation (spVar), duplicates (" log.2 > 2.5.wrongCitSpVarDup
Year | wrong citation (spVar), duplictes No. |
---|---|
2017 | 12 |
2018 | 0 |
2019 | 2 |
2020 | 1 |
2021 | 6 |
2022 | 2 |
2023 | 9 |
2024 | 20 |
2025 | 11 |
2026 | 0 |
Steps 3, 4, 5 are auto-fixed at the same time when run the validataion program. So, use the LEXICON.release.3.fixCrossCheck as LEXICON.release (link) and rerun
shell> cp -p ./LEXICON.release.3.fixCrossCheck Lexicon.release.3.fixCrossCheck.2.5.cit
shell> ln -sf ./LEXICON.release.log.${No}.fixCrossRed Lexicon.release
rerun 2.ValidateLexicon ${YEAR} > log.2
Please make sure check everything to make sure everything is OK because the auto-fix in different steps might cause new issuess. Such as add EUI and causes duplicates. Rerun this until no error found!
shell>fgrep "missing EUI (" log.2 > 2.6.missingEui
=> use LEXICON.release.3.fixCrossCheck and rerun
shell> cp -r LEXICON.release.3.fixCrossCheck Lexicon.release.log.${no}.missEuiFix
shell> ln -sf ./LEXICON.release.log.${no}.missEuiFix Lexicon.release
Save LEXICON.release.3.fixCrossCheck as LEXICON.release.log.${No}.misEuiFix (link to Lexicon.release) and rerun this step
shell> fgrep "wrong EUI" log.2 > 2.4.7.wrongEui.nom
shell> cp -p LEXICON.release.3.fixCrossCheck Lexicon.release.log.${No}.wrongEuiFix
shell> ln -sf ./LEXICON.release.log.${No}.wrongEuiFix Lexicon.release
nominalization
and nominalization_of
.
shell> fgrep " symmetric none @ [" log.2 > 2.12.symNone
shell> fgrep " new EUI (" log.2 > 2.4.13.fixCrossRef-newEui
shell> fgrep "nominalizations - new EUI (" log.2 > 2.13.newEui.nom
shell> fgrep "acronyms - new EUI (" log.2 > 2.13.newEui.acr
shell> fgrep "abbreviations - new EUI (" log.2 > 2.13.newEui.abb
Post-Procedures:
(This is the post-process that need to be done for current release, before the next release)
Ideally, LEXICON.release should be identical to LEXICON.release.3.fixCrossCheck
> non-ascii char|U+value|EUI1|tag
action: check to replace non-ASCII with ASCII char
tag
Name | Letter 1 | Letter 2 (Illegal non-ASCII) | Notes |
---|---|---|---|
postrophe | [']-(APOSTROPHE, U+0027) | [‘]-(LEFT SINGLE QUOTATION MARK, U+2018) | Replace illegal non-ASCII |
[’]-(RIGHT SINGLE QUOTATION MARK, U+2019)
=> accepted after 2021+ release | |||
[ʼ]-(MODIFIER LETTER APOSTROPHE, U+02BC) | |||
hyphen | [-]-(HYPHEN-MINUS, U+002D) | [‑]-(NON-BREAKING HYPHEN, U+2011) | Replace illegal non-ASCII
=> accepted after 2021+ release |
[–]-(EN DASH, U+2013) | |||
beta | [β]-(GREEK SMALL LETTER BETA, U+03B2) | [ß]-(LATIN SMALL LETTER SHARP S, U+00DF) | Replace illegal non-ASCII |
mu/micro | [μ]-(GREEK SMALL LETTER MU, U+03BC) | [µ]-(MICRO SIGN, U+00B5) | Both could be legal. Check the records to make sure the right chars are used. |
Y/EPSILON | [Y]-(LATIN CAPITAL LETTER Y, U+0059) | [Υ]-(GREEK CAPITAL LETTER UPSILON, U+03A5) | Both could be legal. Check the records to make sure the right chars are used. |
shell> ${LEXICON}/bin/2.ValidateLexicon <year> > log.2
shell> ${LEXICON}/bin/2.ValidateLexicon <year> > log.2
Completed: Clean up files and logs: move all logs and files to ./${year}