Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

The SPECIALIST Lexicon

Spelling Variant Patterns - Test on Lexicon-LRSPL

Norm, MES, and ES are used in a sequential order to retrieve the most spelling variant groups. This model is tested on Lexicon (inflVars.data) for the recall, precisino, F1, and accuracy. The results are shown as follows:

  • Results:

    2015 (Used in AMIA paper submission)

    StepMethodsEdit DistanceSample No.ret-relret-irrelnotRet-relnotRet-irrelPrecisionRecallF1Accuracy
    0Lexicon.2015N/A867,728363,21700504,5111.00001.00001.00001.0000
    1NormN/A867,728306,38719,37456,830485,1370.94050.84350.88940.9122
    2MES2867,728355,423173,6477,794330,8640.67180.97850.79670.7909
    3ES1867,728360,599286,9322,618217,5790.55690.99280.71350.6663
    4MES3867,728360,956301,0972,261203,4140.54520.99380.70410.6504
    5ES2867,728362,082353,5121,135150,9990.50600.99690.67130.5913
    6MES4867,728362,159356,1561,058148,3550.50420.99710.66970.5883

  • Discussion:

    Step 6 is the final results we use for the matcher. Use it as example for calculation check:

    Check ItemCheck numbers
    Total sample no867,728 = 362,159 + 356,156 + 1,058 + 148,355
    Precision 0.5042 = 362,159 / (362,159 + 356,156)
    Recall0.9971 = 362,159 / (362,159 + 1,058)
    F10.6697 = (2 * 0.5042 * 0.9971) / (0.5042 + 0.9971)
    Accuracy0.5883 = (362,159 + 148,355) / 867,728

  • Discussion:
    • The recall reaches 99.71% while precision, F1, and accuracy are relatively low. Also, the performance is very low. Thus, we have to reduce the size of n-gram by increasing the WC from 30 to 150. Even so, the entire process took more than 14 days to run. Thus, a better model with improve performnace, precision, F1, and accuracy should be developed.

  • Enhancement:
    StepMethodsEdit DistanceSample No.ret-relret-irrelnotRet-relnotRet-irrelPrecisionRecallF1Accuracy
    Baseline
    (step-2 from above)
    MES2867,728355,423173,6477,794330,8640.67180.97850.79670.7909
    1Double Metaphone (10)2867,728356,375178,6986,842325,8130.66600.98120.79350.7862
    2
    • Metaphone (10)
    • Caverphone
    2867,728354,790151,0288,427353,4830.70140.97680.81650.8162
    3
    • Metaphone (60)
    • Caverphone
    2867,728352,911115,53110,306388,9800.75340.97160.84870.8550
    Enhanced SpVarNorm
    Baseline
    (step-1 from above)
    NormN/A867,728306,38719,37456,830485,1370.94050.84350.88940.9122
    New BaslineNormN/A867,728304,8313,97358,386500,5380.98710.83930.90720.9281
    4
    • Metaphone (60)
    • Caverphone
    2867,728352,826114,27110,391390,2400.75540.97140.84990.8563
    5
    • Metaphone (60)
    • Caverphone
    • GrecoLatin
    2867,728352,675105,62310,542398,8880.76950.97100.85860.8661
    New GoldStandard - with inflectional Spelling Variants
    6.0Norm2867,728305,3293,47574,447484,4770.98870.80400.88680.9102
    6.1??
    • Metaphone (60)
    • Caverphone
    • GrecoLatin
    2867,728369,20097,89710,576390,0550.79040.97220.87190.8750
    6.1
    • Metaphone (60)
    • Caverphone
    • Greco-Latin
    1867,728369,04989,24910,727398,7030.80530.97180.88070.8845
    6.2
    • Metaphone (60)
    • Caverphone
    • Greco-Latin
    2867,728369,04989,24910,727398,7030.80530.97180.88070.8845

  • Enhancement logs:

    Tried:

    • LVG Metaphone (1.0)
    • Double Metaphone (2.0)
    • Soundex
    • Refined Soundex
    • Caverphone 1.0
    • Caverphone 2.0
    • Cologne Phonetic
    • Step 1: Use double metaphone to increase recall on MES:
      ExampleTermMetaphone 1Metaphone 2Notes
      1meagrenessMKRNSMKRNS
      • increase TP to increase recall
      meagernessMJRNSMKRNS
      2abkhasianABKHXNAPKSN
      • increase TP to increase recall
      abkhazianABKHSNAPKSN
      3toxic edemaTKSSTMTKSKTM
      • increase TP to increase recall
      toxic oedemaTKSKTMTKSKTM

    • Step 2: Add Caverphone to increase precision on MES
      ExampleTermMetaphone 1Metaphone 2Caverphone 2.0Notes
      1zymographicalSMKRFKLSMKRFKLSMKRFKA111
      • reduce FP to increase precision
      zymographicallySMKRFKLSMKRFKLSMKRFKLA11
      2absorption testABSRPXNTSTAPSRPXNTSTAPSPSNTST1
      • reduce FP to increase precision
      absorption testsABSRPXNTSTAPSRPXNTSTAPSPSNTSTS
      3bacterial culture mediaBKTRLKLTRMPKTRLKLTRMPKTRKTRMTA
      • reduce FP to increase precision
      bacterial culture mediumBKTRLKLTRMPKTRLKLTRMPKTRKTRMTM

    • Step 3: increase maxLength in metaphone to 60 to increase precision for long words (the max length of LexItem Metaphone is 54 while the amx length of n-grams is 50) on MES

      ExampleTermMetaphone (10)Metaphone (60)Notes
      12-item patient health questionnairTMPTNTL0KSITMPTN0L0KSXNR
      • reduce FP to increase precision
      2-item patient health questionnairesTMPTNTL0KSTMPTNTL0KSXNRS
      2bacterial culture mediaPKTRLKLTRMPKTRLKLTRMT
      • reduce FP to increase precision
      bacterial culture mediumPKTRLKLTRMPKTRLKLTRMTM

    • Step 4: for noun, if it is metareg, it generates x -> x's as it's plural form. Which should not be normalized on SpVarNorm

      ExampleSingularPluralNotes
      1aanaan's
      • reduce FP to increase precision
      2dcmp deaminasedcmp deaminase's
      • reduce FP to increase precision

    • Step 5: for noun, if it matches GrecoLatin singular-plural pattern, then do not add to spVar (exclude them out).

      ExampleTermMetaphone (60)Caverphone 2.0Greco-LatinNotes
      1acrosclerosesAKRSKRSSAKRSKLRSS1singular
      • reduce FP to increase precision
      acrosclerosisAKRSKRSSAKRSKLRSS1plural
      2ammon's horn sclerosesAMNSRNSKRSSAMNSNSKLRSsingular
      • reduce FP to increase precision
      ammon's horn sclerosisAMNSRNSKRSSAMNSNSKLRSplural
      3fimbriaFMPRFMPRA11111singular
      • reduce FP to increase precision
      fimbriaeFMPRFMPRA11111plural
      4infraorbital foramenANFRRPTLFRMNANFRPTFRMNsingular
      • reduce FP to increase precision
      infraorbital foraminaANFRRPTLFRMNANFRPTFRMNplural

    • Step 6: if leading characters are number, must be the same
      • 2-item patient health questionnair -> TMPTNTL0KSXNR
      • 4-item patient health questionnair -> TMPTNTL0KSXNR
      • 15-item patient health questionnair -> TMPTNTL0KSXNR

  • Step 6: TBD

    ExampleTermMetaphone (60)Caverphone 2.0Greco-LatinNotes
    1zygomycetesSKMSTSSKMSTS1111?
    • TBD?
    zygomycetousSKMSTSSKMSTS1111?

  • Step 7: Lexicon error (can be used in LexBuild for approximate match)

    ExampleTermMetaphone (60)Caverphone 2.0Greco-LatinNotes
    1zygapophyseal jointTBDTBD?
    • TBD?
    zygapophysial jointTBDTBD?
    2zuclomifeneTBDTBD?
    • TBD?
    zuclomipheneTBDTBD?