The SPECIALIST Lexicon

Test on Lexicon: for AMIA Initial Submission

Norm, MES, and ES are used in a sequential order to retrieve the most spelling variant groups. This model is tested on Lexicon (inflVars.data) and LRSPL for the recall, precisino, F1, and accuracy. The details are shown as follows:

  • Setup:
    • Test name: LexTest.Amia.1.Init
    • Input File: (Lexicon.2015)
      • inflVars.data
      • LRSPL
    • Software:
      • GoldStd: GetGoldStdFromLex.java.1.AmiaInit
      • Norm: SpVarNorm.java.1.AmiaInit
      • MES: GroupSpVarByMES.java
      • ES: GroupSpVarByES.java

  • Results:

    2015 (Used in AMIA paper initial submission)

    StepMethodsEdit DistanceSample No.ret-relret-irrelnotRet-relnotRet-irrelPrecisionRecallF1AccuracyNotes
    0GoldStdN/A867,728363,21700504,5111.00001.00001.00001.00001 min.
    1NormN/A867,728306,38719,37456,830485,1370.94050.84350.88940.91222 min.
    2MES2867,728355,423173,6477,794330,8640.67180.97850.79670.79096 hr.
    3ES1867,728360,599286,9322,618217,5790.55690.99280.71350.666324 hr.
    4MES3867,728360,956301,0972,261203,4140.54520.99380.70410.65048 min.
    5ES2867,728362,082353,5121,135150,9990.50600.99690.67130.591327 hr.
    6MES4867,728362,159356,1561,058148,3550.50420.99710.66970.58832 min.

  • Discussion:
    • Step 6 is the final results we use for the matcher. Use it as example for calculation check:

      Check ItemCheck numbers
      Total sample no867,728 = 362,159 + 356,156 + 1,058 + 148,355
      Precision 0.5042 = 362,159 / (362,159 + 356,156)
      Recall0.9971 = 362,159 / (362,159 + 1,058)
      F10.6697 = (2 * 0.5042 * 0.9971) / (0.5042 + 0.9971)
      Accuracy0.5883 = (362,159 + 148,355) / 867,728

    • The recall reaches 99.71% while precision, F1, and accuracy are relatively low. Also, the performance is very low. Thus, we have to reduce the size of n-gram by increasing the WC from 30 to 150. Even so, the entire process took more than 14 days to run. Thus, a better model with improve performnace, precision, F1, and accuracy should be developed.