MetaMap 2008_v1 Release Notes
Introduction

This document outlines a number of substantial changes made to produce Metamap 2008_v1, the most important of which are
  1. Changes in XML Generation
  2. A new sentence-breaking algorithm,
  3. Changes in MetaMap Machine Output (MMO),
  4. Changes in the allowed form of PMIDs, and
  5. No Variant Generation for Short Words.
Other less visible changes, which will be mentioned but not described further, are bug fixes allowing the correct display of bracketed output, MMI output, and fielded MMI output.

XML Generation

XML generation in the initial release of MetaMap08, first described in the
original MetaMap08 Release Notes, did not work properly. It has been fixed in this release, which correctly generates both formatted and unformatted XML output.

Another substantial change to the XML generation is the one-to-one mapping of input citations (not of input files) and XML documents: The original release of MetaMap08 was intended to generate an XML document for each input file, even if the input file contained multiple citations. By contrast, MetaMap08_v1 generates an XML document for each citation in an input file. Consequently, if an input file contains multiple citations, the XML output generated from that file will contain several XML documents separated by a blank line.

For example, consider the following input file:


Heart attack.

Lung cancer.






Because Heart attack. and Lung cancer. are separated by a blank line, they are considered separate citations, just as if multiple Medline citations had been downloaded to a single input file. The XML generated for that file will be therefore be as follows:


<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE MMO PUBLIC "-//NLM//DTD MetaMap Machine Output//EN"
                     "http://ii-public.nlm.nih.gov/DTD/MMOtoXML_v1.dtd">
<MMO>

 . . .  XML for "heart attack." . . .

</MMO>

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE MMO PUBLIC "-//NLM//DTD MetaMap Machine Output//EN"
                     "http://ii-public.nlm.nih.gov/DTD/MMOtoXML_v1.dtd">
<MMO>

 . . .  XML for "lung cancer." . . .

</MMO>



Sentence-Breaking Algorithm

MetaMap08 (and all previous versions of the application) required that a blank space immediately follow a period for that period to signal the end of a sentence. We recently noticed, however, tens of thousands of instances in Medline citations in which a period not followed by a blank space ended a sentence. Consequently, we did not correctly handle instances such as the following (note that each line is intended to be read by itself--this is not continuous text):


serum markers for each of those subgroups.The mathematical simulation

neurogenesis in the adult nervous system.These findings may have

informed choices.Patients who are ready to make changes must be provided

affiliation with a trade union.Although still shut-out by the general

Note that in each of these actual lines from Medline citations, the period is not followed by a blank space, but nonetheless still marks the end of a sentence. We analyzed the phenomenon of end-of-sentence periods that are not followed by a blank space, and determined that if a period was followed by a string beginning with an uppercase letter, and that string was either

  1. one of a well-defined list of short (at most six characters) words such as The, In, We, This, These, It, Our, To, Study, When, etc.; see highlighted examples above, or
  2. any long (seven characters or more) word; see highlighted examples above.
a sentence break was very likely to have been intended. This logic has been included in MetaMap08_v1.

Changes in MMO

The form of phrase and utterance terms in MetaMap Machine Output (MMO) has changed slightly in order to allow MMO to more closely match the form of XML output. Just as MetaMap08 introduced an argument to MMO terms that represents positional information, MetaMap08_v1 introduces an additional argument representing the character positions in the string in which <CR> characters have been replaced by blank spaces. The reason for this change and examples of the previous and current MMO forms follow.

Consider this extract from the beginning of PMID 17047334:


PMID- 17047334
OWN - NLM
STAT- MEDLINE
DA  - 20061106
DCOM- 20070618
PUBM- Print-Electronic
IS  - 0001-5652 (Print)
VI  - 62
IP  - 2
DP  - 2006
TI  - Ethnic differences in key candidate genes for spontaneous preterm birth:
      TNF-alpha and its receptors.
PG  - 107-18
AB  - OBJECTIVES: Spontaneous preterm birth (PTB) has a significant ethnic
      disparity with people of African descent having an almost 2-fold higher
      incidence than those of European descent in the United States.

One of the phrases identified by MetaMap in the first utterance of the abstract, a significant ethnic disparity, is represented in XML output by


    <PText>a significant ethnic
      disparity</PText>

Note that the line break between ethnic and disparity and the six blank spaces before disparity in the original input text are faithfully reproduced in the XML output; had the beginning of the citation's abstract read instead


AB  - OBJECTIVES: Spontaneous preterm birth (PTB) has a significant ethnic disparity
      with people of African descent having an almost 2-fold higher incidence
      than those of European descent in the United States.

the XML code generated for the phrase a significant ethnic disparity would have been instead


    <PText>a significant ethnic disparity</PText>

In order to ensure that the MMO representation of phrases and utterances mirrors as faithfully as possible their XML representation, we have modified the MMO phrase and utterance terms to include all blank spaces in the original text. We deemed it unwise, however, to include <CR> characters in MMO terms, because users' postprocessing programs expect all MMO terms to be contained on a single line. A compromise balancing faithfulness to the original text and backward compatibility for our users involved modifying MMO phrase and utterance terms by

  1. changing each <CR> character to a blank space, and
  2. adding an extra argument at the end of phrase and utterance terms representing the character positions in the utterance in which a <CR> character was replaced by a blank space.
For example, the previous form of the MMO term generated for the phrase a significant ethnic disparity was


phrase('a significant ethnic disparity',
       [det([lexmatch([a]),inputmatch([a]),tag(det),tokens([a])]),
        mod([lexmatch([significant]),inputmatch([significant]),tag(adj),tokens([significant])]),
        mod([lexmatch([ethnic]),inputmatch([ethnic]),tag(adj),tokens([ethnic])]),
        head([lexmatch([disparity]),inputmatch([disparity]),tag(noun),tokens([disparity])])],
        325/36).

Note that this term has been pretty-printed for readability; in actual MMO output, the entire term would appear on one line. The argument 325/36 tells us that the string a significant ethnic disparity begins at the 325th character of the abstract (counting from the very beginning, i.e., PMID- 17047334), and contains 36 characters.

The deficiency in this representation is that it does not correctly capture <CR> characters and multiple blank spaces. By way of contrast, the new form in which <CR>s and multiple blanks are more faithfully represented (again, pretty-printed for readability) is


phrase('a significant ethnic       disparity',
       [det([lexmatch([a]),inputmatch([a]),tag(det),tokens([a])]),
        mod([lexmatch([significant]),inputmatch([significant]),tag(adj),tokens([significant])]),
        mod([lexmatch([ethnic]),inputmatch([ethnic]),tag(adj),tokens([ethnic])]),
        head([lexmatch([disparity]),inputmatch([disparity]),tag(noun),tokens([disparity])])],
        325/36,
        [345]).

The additional argument [345] shows that one <CR> character at character position 345 was replaced by a blank space in the MMO representation.

Similarly, the utterance term for the first utterance in the citation's abstract would be the following (the actual utterance text has been replaced by "____" in order to show the entire utterance term on one line):


utterance('17047334.ab.1', ____, 277/216, [345,423]).


The argument [345,423] shows that <CR> characters at positions 345 (between ethnic and disparity) and 423 (between higher and incidence) have been replaced by blank spaces.


Allowed Form of PMIDs

A bug had been introduced to MetaMap08 that prevented the analysis of citations whose PMIDs were not purely numeric. MetaMap08_v1 includes a fix that allows the analysis of citations whose PMIDs contain any printing ASCII characters. The table below shows examples of allowable PMID formats.


PMID- 19052218
PMID- MP:001
PMID- 19052218_Findings

No Variant Generation for Short Words

In order to simplify and streamline processing via the elimination of a great many false positives, MetaMap will no longer generate variants for words of one or two characters. This change will suppress, for example, the generation of


   966 TXS (TBXAS1 gene) [Gene or Genome]

from the input word t, and the generation of


   966 AAS (Addiction Admission Scale) [Intellectual Product]

from the input word aa.