The SPECIALIST Lexicon

Distilled Process Log - Walk Through, 2023

This page describes the filter processes from the MEDLINE n-gram set to generazte the distilled MEDLINE n-gram and candidate multiword list:

ID/ProgramIn No.Filtered No. (%)Out No.Pass RateAcc. Pass RateFilter example and notes
Generate the MEDLINE n-gram set
Generate MEDLINE n-gram set7,009,033,721
  • N=1: 41,049,611
  • N=2: 365,653,326
  • N=3: 1,260,314,546
  • N=4: 2,233,423,690
  • N=5: 3,108,592,548
6,976,926,660 (99.54%)
  • N=1: 39,737,675 (96.80%)
  • N=2: 357,709,919 (97.83%)
  • N=3: 1,248,541,161 (99.07%)
  • N=4: 2,225,746,502 (99.66%)
  • N=5: 3,105,191,403 (99.89%)
32,107,061
  • N=1: 1,311,936
  • N=2: 7,943,407
  • N=3: 11,773,385
  • N=4: 7,677,188
  • N=5: 3,401,145
0.36%N/A From MEDLINE TI & AB to the MDELINE n-gram set
  • filter out n-grams with length > 50
  • filter out n-grams with word count < 30

  • Calculated by Excel (manualy input In and Out No.)
  • Used data from n-gram set log file (not distilled); key into speadsheet for calculation.
Basic operation: Sort nGrams by DC|WC|Terms
ID-01
  • NGramFilter: SortNGramByDcWcTerm
  • Param: 1, 01
  • Run Time: 1 Min.
32,107,061032,107,061100.0000%100.0000%
  • Create link: ./05.ApplyFilters/nGram.${YEAR}
Apply General Exclusive Filters
ID-10 32,107,0614832,107,01399.9999%99.9999%
  • |
  • (|r|
  • ||
  • Ag|AgCl
  • |D|
  • |E|
  • lambda(||)
ID-11 32,107,06170232,106,23199.9976%99.9974%
  • =
  • <
  • +/-
  • >
  • -

  • -->
  • (+)
  • (%)
  • "+"
  • ((-/-))

  • ==>
  • [...]
  • *}
  • *//
ID-12
  • Filter: Digit
  • InTerm: core-term.lc
  • Param: 2, 12
  • Run time: 2 Min (norm - strip punc and space)
32,106,231198,24431,907,98799.3825%99.3800%
  • 2
  • 1
  • 3
  • 10
  • 4

  • 95%
  • 2,
  • 2000
  • 3-5
  • +/-0.5
  • (+/-0.05)
  • $1,500
  • "3 + 1"
  • 55834

  • 192.168.1.1
  • [192, 168]
  • (+15%),
ID-13
  • Filter: Number
  • InTerm: core-term.lc
  • Create link: ./inData/NRVAR
  • Param: 2, 13
  • Run time: 2 Min
31,907,9875,64031,902,34799.9823%99.3624%
  • and
  • two
  • one
  • first
  • three

  • first and second
  • one third
  • twenty-eight
  • NINE
  • zeroth and
  • Four hundred and forty-seven

  • zero-one
  • 'half'
  • One"
ID-14 31,902,347230,94031,671,40799.2761%98.6431%
  • of the
  • in the
  • to the
  • and the
  • on the

  • In the
  • and/or
  • 50% of
  • 1, 2, and
  • 2003 to
  • 2003 to 2007
  • for >=50%
  • the 8:2
  • -196 to -174

  • OR-462
  • AND-34
  • IN-1130
  • And-1
Apply Exclusive Filters - pattern
ID-20 31,671,407407,81331,263,59498.7124%97.3730%
  • tomography (CT)
  • imaging (MRI)
  • resonance imaging (MRI)
  • oxide (NO)
  • reaction (PCR)

  • chain reaction (PCR)
  • polymerase chain reaction (PCR)
  • magnetic resonance imaging (MRI)
  • computed tomography (CT)
  • enzyme-linked immunosorbent assay (ELISA)
  • single nucleotide polymorphisms (SNPs)
  • magnetic resonance (MR) imaging

  • "Standards, Options and Recommendations" (SOR)
  • (CREB)-binding protein (CBP)

  • kinase (ASK)
  • proline-rich polypeptide (PRP)
  • semi-permeable membrane devices (SPMDs)
ID-21 31,263,594561,78630,701,80898.2031%95.6232%
  • a significant
  • a single
  • a high
  • a novel
  • a case

  • a very
  • a group
  • a dose-dependent
  • A series
  • A and B
  • a meta-analysis
  • a SIF
  • A alpha C
  • A nonseminomatous

  • a delivery rate per
  • A beta 2m
  • a beta ab
ID-22 30,701,808217,95330,483,85599.2901%94.9444%
  • RESULTS:
  • METHODS:
  • CONCLUSIONS:
  • CONCLUSION:
  • BACKGROUND:

  • OBJECTIVE:
  • OBJECTIVE: To
  • MATERIALS AND METHODS:
  • SETTING:
  • PURPOSE: To
  • INTRODUCTION:
  • AIM: The
  • L: -DOPA
  • 95% PI:

  • PHPT:
  • months [95% CI:
  • vs N:
  • mode MIC:
  • [95 % CI:
ID-23 30,483,855381,11430,102,74198.7498%93.7574%
  • °C
  • ≥3
  • (mean ±
  • × 3
  • (IC 95%;
  • TGF-&beta;
ID-24 30,102,741030,102,741100.0000%93.7574%
  • (n =
  • (P <
  • (P =
  • (p <
  • P <

  • (P < 0.05)
  • 95% CI =
  • P<0.001),
  • CI},
  • US$
  • VSL#3
  • N^N
  • group (n=6) received
  • CYP3A7*1C

  • studies; average
  • n.; Trichoteleia
  • sp. n.; Trichoteleia
ID-25 30,102,741486,29529,616,44698.3845%92.2428%
  • two groups
  • 6 months
  • 24 h
  • (ABSTRACT TRUNCATED AT 250 WORDS)
  • the two groups

  • 5 years
  • at 37 degrees
  • 3 times
  • 100 mg
  • January 1,
  • 10 mg/kg
  • 12-year-old
  • at -20 degrees C
  • September 2006
  • 65 years or older with
  • 20 cigarettes per day
  • 3 - 6 months

  • 6 hours plus
  • minutes) per day, 5 days
  • MMR + V
  • 3 mg/EE
  • 317615 x
ID-26 29,616,446282,39929,334,04799.0465%91.3632%
  • group (P
  • significant (P
  • years) with
  • significantly (P
  • years) and

  • interval [95%
  • see text] The
  • lt; 0.05) lower
  • CENTRAL) (The
  • nM (SD
  • pOGH (ANG

  • cB72.3(gamma
  • new species (type
Apply Exclusive Filters - Lead-End-Terms
ID-30 29,334,0477,281,04222,053,00575.1789%68.6858%
  • of a
  • that the
  • from the
  • is a
  • of this

  • The results
  • was observed
  • this study was
  • about 50%
  • - but not
  • "what is
  • AND COURSE

  • iT reg
  • of FoxM1b
  • or spinal or conduction
  • or spinal or conduction block,
ID-31 22,053,0054,505,33017,547,67579.5704%54.6536%
  • patients with
  • associated with
  • at the
  • suggest that
  • between the

  • in patients with
  • results suggest that
  • MATERIALS AND
  • cross-reacted with
  • (ST 36) and
  • Zusanli (ST 36) and
  • determine whether this could
  • primarily composed of the

  • tilt-in-space and
  • systems, assays and
  • ppm Cu as
  • epidural or spinal or
ID-32 17,547,6754,07217,543,60399.9768%54.6409%
  • in a
  • to be
  • with a
  • as a
  • may be

  • In a
  • in A.
  • For one
  • on NO

  • anti-NOR
  • plus AT
  • I/a
  • AS-ON
  • anti-OF
ID-33 17,543,6032,360,80815,182,79586.5432%47.2880%
  • to determine
  • In addition,
  • to evaluate
  • to assess
  • to investigate

  • in the presence
  • AT 250
  • As a result,
  • ON THE TREATMENT
  • as a possible treatment for
  • in details,
  • - for example,
  • within working memory
  • for various chronic
  • in 0.1% trifluoroacetic
  • in threatened preterm labor

  • with the MIC90S
  • On PTD
  • plus LHRH-A
  • with the MIC90S of
ID-34 15,182,7952,402,89512,779,90084.1736%39.8040%
  • effects of
  • number of
  • use of
  • presence of
  • used to

  • Comparison of
  • low cost of
  • HPV) in
  • NUMBER OF
  • zymography was used to
  • loss of two or more

  • 1 goes to
  • active with the MIC90s of
  • syn. nov. of
  • microg/mmol of
The final results of above is used as the distilled MEDLINE n-gram set
Apply Exclusive Filters - Project domain
ID-40 12,779,900959,83811,820,06292.4895%36.8145%
  • of
  • the
  • in
  • to
  • a

  • The
  • We
  • "The
  • linear,
  • "normal"
  • {systematic name:
  • systematic name
  • anterior intermeniscal ligament
  • regional low-flow perfusion

  • Neo.
  • Cannon &
  • Polycentropus
  • Penneys &
  • % (month