Text Categorization

Blooper Detector

MEDLINE records contain MeSH indexing terms assigned by human indexers. Indexers may be aided by the MTI (Medical Text Indexer), an automated system that recommends a list of MeSH indexing terms from which indexers may select at their workstations. Sometimes MTI recommendations are grossly erroneous. For example, recommendations for the MEDLINE record (PMID 11748928) titled, "Viral Interleukin 6 stimulates human peripheral blood B cells that are unresponsive to human interleukin 6." include the MeSH term "Coma" due to "unresponsive" in the title. This term is clearly inappropriate for indexing unresponsive cells. Occasionally, indexers themselves assign terms erroneously; for example, the MEDLINE record (PMID 9809206) titled, "Modeling Escherichia coli. The concept of competitive coherence." was mis-indexed with the term Competitive Behavior (which MeSH reserves for human and animal behavior). Such erroneous terms are sometimes called "bloopers." The goal of our research is to develop a Blooper Detector that can automatically detect bloopers using JDI (Journal Descriptor Indexing) to identify them as outliers, in contrast to the more reasonable recommendations returned by MTI.

  • Method
    Compare the similarity on the JDI results of title, abstract, and suggested MeSH terms to JDI results of Indexed or MTI suggested MeSH terms with different options of removing stopWords and using restrictWords to find the bloopers.
  • Processes
    • Run text through JDI with various options
      A text includes title, abstract, and the indexed/MTI suggested MeSH terms.
    • Run term through JDI with various options
      A term is the indexed or MTI suggested MeSH term

      Each line in MTI output is a recommendation. The format for this line consists of 8 fields, as shown in the following table:

      FieldContentNotes
      1PMIDPubMed assigned unique identifier.
      If free text, this is "0".
      2TermMeSH Term.
      Starts with '*', comes from Title section
      3CUIConcept Unique Identifier for the MeSH term
      4ScoreMTI score
      5TypeMH: MeSH Heading
      HM: Heading Mapped to
      ET: Entry Term
      NM: Supplemental Concept
      SH: MeSH SubHeading
      CT: MeSH CheckTag
      6MiscIf ET, this explains the replacement
      If not, blank
      7LocationIf from MMI:
      TI: Title
      AB: Abstract
      TI;AB: Title and Abstract
      8Path(s)MM: MetaMap's MMI
      RC: PubMed Related Citations
      TG: John Wilbur's Trigram Method

      The blooper detector selects the MeSH recommendation to be evaluated, as follows:

      • If Path field has MM
        • If Type field = MH, select MeSH Term in Term field
          • If MeSH term begins with *, remove it
        • Else if Type = ET, select as MeSH Term the replacement term in Misc

      For example:
      From the line
      17313486|*Stupor|C0085628|23580|MH|RtM via: unresponsive behavior|TI|MM
      the recommendation to be evaluated is Stupor.

      From the line
      17313486|*B-Cells|C0004561|21420|ET|Entry Term Replacement for "B-Lymphocytes"|TI;AB|MM;RC
      the recommendation to be evaluated is B-Lymphocytes.

    • Options
       TextRecommended Term
      OptionsInputJDI OptionInputJDI Option
      TIAB_SR_SR TIAB
    • index text
    • removeStopWord = true
    • useRestrictWords = true
    • Mesh Term
    • index text
    • removeStopWord = true
    • useRestrictWords = true
    • TIAB_nSR_nSR TIAB
    • index text
    • removeStopWord = false
    • useRestrictWords = false
    • Mesh Term
    • index text
    • removeStopWord = false
    • useRestrictWords = false
    • TIAB_SR_nSR TIAB
    • index text
    • removeStopWord = true
    • useRestrictWords = true
    • Mesh Term
    • index text
    • removeStopWord = false
    • useRestrictWords = false
    • TIAB_SR_MH TIAB
    • index text
    • removeStopWord = true
    • useRestrictWords = true
    • Mesh Term
    • index MeSH
    • TIAB_nSR_MH TIAB
    • index text
    • removeStopWord = false
    • useRestrictWords = false
    • Mesh Term
    • index MeSH
    • TIABMH_SR_SR TIABMH
    • index text|MeSH
    • removeStopWord = true
    • useRestrictWords = true
    • Mesh Term
    • index text
    • removeStopWord = true
    • useRestrictWords = true
    • TIABMH_nSR_nSR TIABMH
    • index text|MeSH
    • removeStopWord = false
    • useRestrictWords = false
    • Mesh Term
    • index text
    • removeStopWord = false
    • useRestrictWords = false
    • TIABMH_SR_nSR TIABMH
    • index text|MeSH
    • removeStopWord = true
    • useRestrictWords = true
    • Mesh Term
    • index text
    • removeStopWord = false
    • useRestrictWords = false
    • TIABMH_SR_MH TIABMH
    • index text|MeSH
    • removeStopWord = true
    • useRestrictWords = true
    • Mesh Term
    • index MeSH
    • TIABMH_nSR_MH TIABMH
    • index text|MeSH
    • removeStopWord = false
    • useRestrictWords = false
    • Mesh Term
    • index MeSH
    • Calculate the similarity on the above two results (by cosine coefficient)
    • Identify bloopers if the similarity value is below cutoff
      0.42 is used for the cutoff.