The SPECIALIST Lexicon

Distilled Process Walk Through, 2016

This page describes the filter processes from the MEDLINE n-gram set to generazte the distilled MEDLINE n-gram and candidate multiword list:

ID/ProgramIn No.Filtered No. (%)Out No.Pass RateAcc. Pass RateFilter example and notes
Generate the MEDLINE n-gram set
Generate MEDLINE n-gram set4,388,787,832
  • N=1: 24,121,470
  • N=2: 229,691,126
  • N=3: 788,417,523
  • N=4: 1,460,588,176
  • N=5: 1,885,969,537
4,369,462,494 (99.56%)
  • N=1: 23,238,183 (96.34%)
  • N=2: 224,576,579 (97.77%)
  • N=3: 781,282,716 (99.10%)
  • N=4: 1,456,207,702 (99.70%)
  • N=5: 1,884,157,314 (99.90%)
19,325,338
  • N=1: 883,287
  • N=2: 5,114,547
  • N=3: 7,134,807
  • N=4: 4,380,474
  • N=5: 1,812,223
0.4403%N/A From MEDLINE TI & AB to the MDELINE n-gram set
  • filter out n-grams with length > 50
  • filter out n-grams with word count < 30

  • Calculated by Excel (manualy input In and Out No.)
Basic operation: Sort nGrams by DC|WC|Terms
ID-01
  • NGramFilter: SortNGramByDcWcTerm
  • Param: 1, 01
  • Run Time: 1 Min.
19,325,338019,325,338100.0000%100.0000%
  • Create link: ./05.ApplyFilters/nGram.${YEAR}
Apply General Exclusive Filters
ID-10 19,325,338719,325,331100.0000%100.0000%
  • 216|357||
  • 58|75|(|r|
  • 50|85|||
  • 40|44|Ag|AgCl
  • 34|47||D|
  • 27|41||E|
  • 16|38|lambda(||)
ID-11 19,325,33142519,324,90699.9978%99.9978%
  • 1468508|4379867|=
  • 830526|1804458|<
  • 679245|2645584|+/-
  • 275852|455400|>
  • 206249|327350|-

  • 11339|25445|-->
  • 8168|15079|(+)
  • 4721|5819|(%)
  • 97|136|"+"
  • 73|80|((-/-))

  • 30|70|==>
  • 39|48|[...]
  • 10|33|*}
  • 7|35|*//
ID-12
  • Filter: Digit
  • InTerm: core-term.lc
  • Param: 2, 12
  • Run time: 2 Min (norm - strip punc and space)
19,324,906132,65019,192,25699.3136%99.3114%
  • 1573415|2318062|2
  • 1557465|2348937|1
  • 1249600|1739034|3
  • 980112|1258360|10
  • 952500|1257959|4

  • 298055|643298|95%
  • 202590|239870|2,
  • 96654|114627|2000
  • 13712|15014|3-5
  • 1657|1714|+/-0.5
  • 71|77|(+/-0.05)
  • 56|58|$1,500
  • 31|40|"3 + 1"
  • 3|32|55834

  • 192.168.1.1
  • [192, 168]
  • (+15%),
ID-13
  • Filter: Number
  • InTerm: core-term.lc
  • Create link: ./inData/NRVAR
  • Param: 2, 13
  • Run time: 2 Min
19,192,2564,32619,187,93099.9775%99.2890%
  • 17110403|101408848|and
  • 2755149|3759973|two
  • 2009841|2536674|one
  • 1509215|1868858|first
  • 1499457|1945263|three

  • 20573|23342|first and second
  • 20272|21359|one third
  • 3383|3507|twenty-eight
  • 81|81|NINE
  • 40|42|zeroth and
  • 36|36|Four hundred and forty-seven

  • 27|34|zero-one
  • 24|32|'half'
  • 21|32|One"
ID-14 19,187,930157,78619,030,14499.1777%98.4725%
  • 11350676|26564037|of the
  • 9309757|18668800|in the
  • 4600972|6314812|to the
  • 4541887|5954054|and the
  • 3553488|4655141|on the

  • 1128696|1314980|In the
  • 477963|547704|and/or
  • 86317|90215|50% of
  • 12642|14399|1, 2, and
  • 11527|12363|2003 to
  • 506|548|2003 to 2007
  • 29|43|for >=50%
  • 17|41|the 8:2
  • 13|42|-196 to -174

  • 12|32|OR-462
  • 9|45|AND-34
  • 8|44|IN-1130
  • 8|31|And-1
Apply Exclusive Filters - pattern
ID-20 19,030,144197,02218,833,12298.9647%97.4530%
  • 43723|43924|tomography (CT)
  • 41512|41818|imaging (MRI)
  • 41087|41387|resonance imaging (MRI)
  • 38888|39331|oxide (NO)
  • 36558|36838|reaction (PCR)

  • 36441|36720|chain reaction (PCR)
  • 33100|33333|polymerase chain reaction (PCR)
  • 34608|34807|magnetic resonance imaging (MRI)
  • 32581|32691|computed tomography (CT)
  • 18344|18650|enzyme-linked immunosorbent assay (ELISA)
  • 11876|11950|single nucleotide polymorphisms (SNPs)
  • 7501|7512|magnetic resonance (MR) imaging

  • 57|57|"Standards, Options and Recommendations" (SOR)
  • 58|58|(CREB)-binding protein (CBP)

  • 24|31|kinase (ASK)
  • 24|30|proline-rich polypeptide (PRP)
  • 24|30|semi-permeable membrane devices (SPMDs)
ID-21 18,833,122344,40318,488,71998.1713%95.6709%
  • 645982|727397|a significant
  • 463569|531423|a single
  • 380448|415096|a high
  • 357730|410233|a novel
  • 308795|335142|a case

  • 137642|142602|a very
  • 128611|141101|a group
  • 85739|93650|a dose-dependent
  • 45415|45699|A series
  • 27489|35407|A and B
  • 23369|28504|a meta-analysis
  • 19|31|a SIF
  • 17|37|A alpha C
  • 17|30|A nonseminomatous

  • 9|41|a delivery rate per
  • 8|30|A beta 2m
  • 6|33|a beta ab
ID-22 18,488,719113,93618,374,78399.3838%95.0813%
  • 555284|2556344|RESULTS:
  • 2018160|2018508|METHODS:
  • 1485141|1485210|CONCLUSIONS:
  • 1142888|1142951|CONCLUSION:
  • 1037474|1037552|BACKGROUND:

  • 841859|841956|OBJECTIVE:
  • 470496|470522|OBJECTIVE: To
  • 183998|184006|MATERIALS AND METHODS:
  • 142817|142837|SETTING:
  • 137418|137419|PURPOSE: To
  • 137969|137977|INTRODUCTION:
  • 24100|24101|AIM: The
  • 16|51|L: -DOPA
  • 19|56|(95% PI:

  • 12|44|PHPT:
  • 14|36|months [95% CI:
  • 10|31|vs N:
  • 9|45|(mode MIC:
  • 7|33|% [95 % CI:
ID-23 18,374,783135,50818,239,27599.2625%94.3801%
  • 377909|832392|(n =
  • 305362|600169|(P <
  • 203048|431029|(P =
  • 200192|370403|(p <
  • 187425|368477|P <

  • 101928|158559|(P < 0.05)
  • 16323|41573|95% CI =
  • 5620|7573|(P<0.001),
  • 472|473|{CI},
  • 282|500|(US$
  • 140|510|VSL#3
  • 46|57|N^N
  • 25|37|group (n=6) received
  • 9|35|CYP3A7*1C

  • 4|31|studies; average
  • 1|37|n.; Trichoteleia
  • 1|37|sp. n.; Trichoteleia
ID-24 18,239,275336,11217,903,16398.1572%92.6409%
  • 177940|210038|two groups
  • 142624|190341|6 months
  • 125652|165359|24 h
  • 106208|106208|(ABSTRACT TRUNCATED AT 250 WORDS)
  • 96569|111790|the two groups

  • 76377|96164|5 years
  • 44885|53892|at 37 degrees
  • 16786|18097|3 times
  • 16270|20791|100 mg
  • 15591|16755|January 1,
  • 13116|16443|10 mg/kg
  • 6637|7563|12-year-old
  • 3199|3806|at -20 degrees C
  • 1871|1896|September 2006
  • 228|237|65 years or older with
  • 194|220|20 cigarettes per day
  • 54|63|3 - 6 months

  • 7|33|6 hours plus
  • 7|33|minutes) per day, 5 days
  • 7|31|MMR + V
  • 6|30|3 mg/EE
  • 4|33|317615 x
ID-25 17,903,163166,35617,736,80799.0708%91.7801%
  • 38680|53429|group (P
  • 29553|33483|significant (P
  • 28932|29667|years) with
  • 28420|36452|significantly (P
  • 27946|28749|years) and

  • 2573|257|interval [95%
  • 1776|1784|see text] The
  • 1262|1390|< 0.05) lower
  • 980|980|(CENTRAL) (The
  • 8|32|nM (SD
  • 6|33|pOGH (ANG

  • 3|30|cB72.3(gamma
  • 3|30|new species (type
Apply Exclusive Filters - Lead-End-Terms
ID-30 17,736,8074,712,16213,024,64573.4329%67.3967%
  • 3094375|3846276|of a
  • 2579661|3150375|that the
  • 2197093|2738637|from the
  • 2059855|2318833|is a
  • 1932233|2122291|of this

  • 807264|849049|The results
  • 619048|694688|was observed
  • 523214|525272|this study was
  • 13190|13456|about 50%
  • 338|344|- but not
  • 172|190|"what is
  • 49|49|AND COURSE

  • 5|31|iT reg
  • 5|31|of FoxM1b
  • 2|34|or spinal or conduction
  • 2|32|or spinal or conduction block,
ID-31 13,024,6452,710,47010,314,17579.1897%53.3713%
  • 2119774|4013252|patients with
  • 1983358|2792617|associated with
  • 1619127|2077643|at the
  • 1256833|1304678|suggest that
  • 1206088|1437403|between the

  • 847370|1211964|in patients with
  • 437150|440606|results suggest that
  • 186025|186034|MATERIALS AND
  • 3820|4157|cross-reacted with
  • 143|174|(ST 36) and
  • 80|93|Zusanli (ST 36) and
  • 61|61|determine whether this could
  • 38|38|primarily composed of the

  • 5|41|tilt-in-space and
  • 5|30|systems, assays and
  • 4|38|ppm Cu as
  • 3|35|epidural or spinal or
ID-32 10,314,1752,68710,311,48899.9739%53.3573%
  • 2904059|3505586|in a
  • 2592576|3105056|to be
  • 2383010|2931652|with a
  • 2002275|2347123|as a
  • 1380216|1544681|may be

  • 248186|258942|In a
  • 8273|10744|in A.
  • 1903|191|For one
  • 1874|212|on NO

  • 17|36|anti-NOR
  • 17|36|plus AT
  • 12|39|I/a
  • 10|36|AS-ON
  • 10|31|anti-OF
ID-33 10,311,4881,450,3948,861,09485.9342%45.8522%
  • 754541|812741|to determine
  • 616952|641216|In addition,
  • 498080|526046|to evaluate
  • 441062|475504|to assess
  • 417985|428942|to investigate

  • 387143|475128|in the presence
  • 106242|106242|AT 250
  • 42061|42399|As a result,
  • 552|552|ON THE TREATMENT
  • 255|256|as a possible treatment for
  • 165|166|in details,
  • 144|145|- for example,
  • 71|78|within working memory
  • 58|59|for various chronic
  • 49|50|in 0.1% trifluoroacetic
  • 28|34|in threatened preterm labor

  • 5|38|with the MIC90S
  • 5|32|On PTD
  • 4|40|plus LHRH-A
  • 4|35|with the MIC90S of
ID-34 8,861,0941,458,2467,402,84883.5433%38.3064%
  • 1266941|1629133|effects of
  • 1143004|1468193|number of
  • 1119157|1384237|use of
  • 1110577|1383390|presence of
  • 1095221|1237508|used to

  • 146690|149087|Comparison of
  • 887|894|low cost of
  • 848|857|(HPV) in
  • 294|29|NUMBER OF
  • 112|113|zymography was used to
  • 45|45|loss of two or more

  • 6|33|1 goes to
  • 5|35|active with the MIC90s of
  • 5|32|syn. nov. of
  • 3|37|microg/mmol of
The final results of above is used as the distilled MEDLINE n-gram set
Apply Exclusive Filters - Project domain
ID-40 7,402,848714,8966,687,95290.3430%34.6072%
  • 19606267|140460643|of
  • 17471706|134983756|the
  • 17200048|79351393|in
  • 14143686|51239396|to
  • 13841799|43070808|a

  • 11720760|24982979|The
  • 3580301|487296|We
  • 8691|9239|"The
  • 6254|6529|linear,
  • 6001|6796|"normal"
  • 179|185|{systematic name:
  • 56|56|systematic name
  • 10|33|anterior intermeniscal ligament
  • 9|30|regional low-flow perfusion

  • 2|32|Neo.
  • 2|31|Cannon &
  • 2|30|Polycentropus
  • 1|62|Penneys &
  • 1|30|% (month