MetaMap 2009 V2 XML Output Explained

MetaMap has long provided an output format known as Machine Output, whose pre-MetaMap08 version is described here. In MetaMap08, we introduced a number of changes to Machine Output outlined here; a full description of the current form of Machine Output format is described here.

Although MetaMap Machine Output is compact, it is made up of Prolog terms, and therefore difficult to parse without using Prolog. We wanted to provide MetaMap users a more convenient and generally understood output format such as XML, which has has become the de facto lingua franca for internet-based information exchange. MetaMap08 therefore introduced XML output, described here.

Recent feedback from users made us realize that the XML tags in our initial version of MetaMap XML output were not sufficiently mnemonic, so with MetaMap2009 V2, we are changing most of the tags to make them more easily understandable. The structure of MetaMap XML output has not changed--only the tags themselves.

The following three tables will present

  • The previous and current XML tags listed alphabetically by previous tag,
  • An explanation of the current XML tags in which the tags are listed alphabetically, and
  • An explanation of the current XML tags in which the tags are listed hierarchically.

    Note that the second and third tables present exactly the same information, only arranged differently.


    Previous and current XML tags

     Previous tag Current tag
     <AAs> 
     <AA> 
      unchanged
     <AAExpansion>   <AAExp> 
     <AALen>    unchanged
     <AAName>   <AAText> 
     <Args>   <CmdLine> 
     <CUIs> 
     <CUI> 
     <AACUIs> 
     <AACUI> 
     <CUIConcepts> 
     <CUIConcept> 
     <NegConcepts> 
     <NegConcept> 
     <CWMatchPosE>   <ConcMatchEnd> 
     <CWMatchPosS>   <ConcMatchStart> 
     <Candidates> 
     <Candidate> 
      unchanged
     <Command>    unchanged
     <DefnLen>   <AAExpLen> 
     <InputMatch>    unchanged
     <IsHead>    unchanged
     <IsOverMatch>    unchanged
     <LexMatch>    unchanged
     <Location>   <UttSection> 
     <MMOlist> 
     <MMO> 
     <MMOs> 
      unchanged
     <MapNegScore>   <MappingScore> 
     <Mappings> 
     <Mapping> 
      unchanged
     <MatchedWords> 
     <MatchedWord> 
      unchanged
     <MatchMaps> 
     <MatchMap> 
      unchanged
     <NCSpans> 
     <NCSpan> 
     <NegConcPIs> 
     <NegConcPI> 
     <Negations> 
     <Negation> 
      unchanged
     <NegExCUI>   <NegConcCUI> 
     <NegExConcept>   <NegConcMatched> 
     <NegScore>   <CandidateScore> 
     <NegTrigger>    unchanged
     <NegType>    unchanged
     <NTSpans> 
     <NTSpan> 
     <NegTriggerPIs> 
     <NegTriggerPI> 
     <NumAATokens>   <AATokenNum> 
     <NumDefnTokens>   <AAExpTokenNum> 
     <Options> 
     <Option> 
      unchanged
     <OptName>    unchanged
     <OptValue>    unchanged
     <Phrases> 
     <Phrase> 
      unchanged
     <PMID>    unchanged
     <POS>   <LexCat> 
     <PSpanLen>   <PhraseLength> 
     <PStartPos>   <PhraseStartPos> 
     <PText>   <PhraseText> 
     <SeqNo>   <UttNum> 
     <Sources> 
     <Source> 
      unchanged
     <Spans> 
     <Span> 
     <ConceptPIs> 
     <ConceptPI> 
     <SpanLen>   <Length> 
     <STs> 
     <ST> 
     <SemTypes> 
     <SemType> 
     <StartPos>    unchanged
     <Tags> 
     <Tag> 
     <SyntaxUnits> 
     <SyntaxUnit> 
     <Tokens> 
     <Token> 
      unchanged
     <TWMatchPosE>   <TextMatchEnd> 
     <TWMatchPosS>   <TextMatchStart> 
     <Type>   <SyntaxType> 
     <UMLSConcept>   <CandidateMatched> 
     <UMLSCUI>   <CandidateCUI> 
     <UMLSPreferred>   <CandidatePreferred> 
     <USpanLen>   <UttLength> 
     <UStartPos>   <UttStartPos> 
     <UText>   <UttText> 
     <Utterances> 
     <Utterance> 
      unchanged
     <Variation>   <LexVariation> 



    In the alphabetical and hierarchical tables below, the XML tags are characterized by structure (simple or complex) and number (unique or repeating):
  • A simple (S) tag is atomic, and consists of only a character string or a number, e.g.,
       <Length>, <LexCat>, <SemType>, <Source>, and <StartPos>.
  • A complex (C) tag contains one or more sub-components, e.g.,
       <Candidate>, <Mapping>, <Negation>, <Phrase>, and <Utterance>.
  • A unique (U) tag occurs only once in the immediately higher-level structure, e.g.,
       <InputMatch>, <MappingScore>, <NegType>, <PhraseText>, and <PMID>.
  • A repeating (R) tag may occur multiple times in the immediately higher-level structure, e.g.,
       <AA>, <MatchMap>, <Option>, <SyntaxUnit>, and <Token>.
       These repeating tags also exist in plural form, e.g.,
       <AAs>, <MatchMaps>, <Options>, <SyntaxUnits>, and <Tokens>.

    In the "Description" fields below, a repeating tag will be followed by "+" if it is mandatory and "*" if not.
    Note that the <Candidate> tag appears as <Candidate>+ within a <Mapping> structure, but as <Candidate>* within a <Phrase> structure. This is correct, because if a mapping is created, it must necessarily be composed of candidates; a phrase, however, need not have any associated candidates if MetaMap is unable to identify any Metathesaurus candidates in it.



  • Alphabetical listing of current XML tags

    TagTypeDescription
    <AAs>
    <AA>
    CR
    All the data generated for an Acronym/Abbreviation (AA), consisting of
  • <AAText>: the text of the AA,
  • <AAExp>: its expansion,
  • <AATokenNum>: the number of tokens in the AA
  • <AALen>: the character length of the AA
  • <AAExpTokenNum>: the number of tokens in expansion
  • <AAExpLen>: the character length of its expansion, and
  • <AACUI>*: any CUIs associated with the expansion of the AA
  • The following AA examples will use the text
    polymerase chain reaction (PCR).
    <AACUIs>
    <AACUI>
    SR
    Any CUIs associated with the expansion of the AA.
    <AAExp>
    SU
    The expansion of the AA (polymerase chain reaction)
    <AAExpLen>
    SU
    The character length of the expansion of the AA (25, because polymerase chain reaction contains 25 characters)
    <AAExpTokenNum>
    SU
    The number of tokens in the AA expansion (5, because polymerase chain reaction contains 5 tokens, including two blank tokens)
    <AALen>
    SU
    The character length of the AA (3, because PCR contains 3 characters)
    <AAText>
    SU
    The AA itself (PCR)
    <AATokenNum>
    SU
    The number of tokens in the AA (1, because PCR contains 1 token)
    <Candidates>
    <Candidate>
    CR
    All the data generated for a candidate concept, including
  • <CandidateScore>: the candidate's negative score,
  • <CandidateCUI>: its CUI,
  • <CandidateMatched>: the candidate matched,
  • <CandidatePreferred>: its preferred name,
  • <MatchedWord>+: the word(s) in the Candidate matching the text,
  • <MatchMap>+: the matchmap(s),
  • <SemType>+: the semantic type(s),
  • <IsHead>: IsHead (yes/no),
  • <IsOverMatch>: IsOverMatch (yes/no),
  • <Source>+: the source(s), and
  • <ConceptPI>+: the positional information
  • <CandidateCUI>
    SU
    The CUI of the candidate concept
    <CandidateMatched>
    SU
    The candidate concept matched
    <CandidatePreferred>
    SU
    The preferred name of the candidate concept
    <CandidateScore>
    SU
    The negative score of the candidate concept; the computation of this value is explained on pp. 5-9 of MetaMap Evaluation.
    <CmdLine>
    CU
    All the data about the command used to start MetaMap, consisting of
  • <Command>: the actual operating-system call used to start MetaMap, and
  • <Option>*: any options passed to MetaMap
  • <Command>
    SU
    The actual operating-system call used to start MetaMap
    <ConceptPIs>
    <ConceptPI>
    CR
    The positional information of the concept, consisting of
  • <StartPos>: the 0-based character offset of the concept, counting from the beginning of the input text, and
  • <Length>: the character length of the string
  • <ConcMatchEnd>
    SU
    The position within the concept words of the last matching word
    <ConcMatchStart>
    SU
    The position within the concept words of the first matching word
    <InputMatch>
    SU
    The input word(s) making up the syntax unit
    <IsHead>
    SU
    Yes/no value denoting if the candidate concept includes the head of the phrase containing it
    <IsOverMatch>
    SU
    Yes/no value denoting if the candidate concept is an overmatch, i.e., if it contains words on one or both ends that do not match the input text.
    <Length>
    SU
    The character length of the string
    <LexCat>
    SU
    The lexical category of the syntax unit; the LexCats are adj, adv, aux, compl, conj, det, modal, noun, prep, pron, and verb.
    <LexMatch>
    SU
    The lexical item(s) matched by the syntax unit
    <LexVariation>
    SU
    The degree of lexical variation between the words in the candidate concept and the words in the phrase; the computation of this value is explained on pp. 2-3 of MetaMap Evaluation.
    <Mappings>
    <Mapping>
    CR
    A set of candidate concepts making up the mapping for the phrase, consisting of
  • <MappingScore>: the negative score of the mapping, and
  • <Candidate>+: the candidate concept(s)
  • <MappingScore>
    SU
    The negative score of the mapping; the computation of this value is explained on pp. 9-10 of MetaMap Evaluation.
    <MatchedWords>
    <MatchedWord>
    SR
    The word(s) in the Candidate matching the text
    <MatchMaps>
    <MatchMap>
    CR
    A data structure representing
  • the correspondence of words in the candidate concept (<TextMatchStart> and <TextMatchEnd>) and words in the phrase (<ConcMatchStart> and <ConcMatchEnd>), and
  • the lexical variation (<LexVariation>) between the words in the candidate concept and the words in the phrase.
  • For example, given the input text obstructive sleep apnea and the candidate concept sleep apnea, the matching words sleep and apnea are
  • the 2nd and 3rd words of the text, and
  • the 1st and 2nd words of the concept.
  • There is no lexical variation, so the matchmap would therefore be [[[2,3],[1,2],0]]. For the candidate concept sleep apneas, the MatchMap would be the same, other than having lexical variation of 1 instead of 0.
    <MMOs>
    <MMO>
    CR
    All the XML output generated for an entire input record or citation, consisting of
  • <CmdLine>: the command used to start MetaMap,
  • <AA>*: any acronyms/abbreviation(s) found in the text,
  • <Negation>*: any negation(s) found in the text, and
  • <Utterance>+: the utterance(s) found in the text
  • <Negations>
    <Negation>
    CR
    All the data generated for a negation, including
  • <NegType>: the negation type,
  • <NegTrigger>: the negation trigger,
  • <NegTriggerPI>+: the negation trigger's positional information,
  • <NegConcept>+: the negated concept(s), and
  • <NegConcPI>+: the negated concept's StartPos/Length positional information
  • For more information about MetaMap's implementation of NegEx, see the MetaMap09 Release Notes.
    <NegConcCUI>
    SU
    The CUI associated with the negated concept
    <NegConcepts>
    <NegConcept>
    CR
    The negated concept, consisting of
  • <NegConcCUI>: the negated concept's CUI, and
  • <NegConcMatched>: the negated concept's name
  • <NegConcMatched>
    SU
    The name of the negated concept
    <NegConcPIs>
    <NegConcPI>
    CR
    The StartPos/Length positional information of the negated concept
    <NegTrigger>
    SU
    The negation trigger
    <NegTriggerPIs>
    <NegTriggerPI>
    CR
    The StartPos/Length positional information of the negation trigger
    <NegType>
    SU
    The negation type
    <Options>
    <Option>
    CR
    The option(s) passed to MetaMap, consisting of
  • <OptName>: the option's name, and
  • <OptValue>: the option's value.
  • <OptName>
    SU
    The name of the command-line option
    <OptValue>
    SU
    The value of the command-line option (can be null)
    <Phrases>
    <Phrase>
    CR
    The syntactic subcomponent of the utterance, consisting of
  • <PhraseText>: the text of the phrase,
  • <SyntaxUnit>+: the syntax unit(s),
  • <PhraseStartPos>: the 0-based character offset of the phrase, counting from the beginning of the input text
  • <PhraseLength>: the character length of the phrase,
  • <Candidate>*: any candidate concepts identified in the phrase, and
  • <Mapping>*: any mappings created
  • <PhraseLength>
    SU
    The character length of the phrase
    <PhraseStartPos>
    SU
    The 0-based character offset of the phrase, counting from the beginning of the input text
    <PhraseText>
    SU
    The text of the phrase
    <PMID>
    SU
    The PubMed ID of the citation containing the utterance
    <SemTypes>
    <SemType>
    SR
    The semantic type(s) of the candidate
    <Sources>
    <Source>
    SR
    The UMLS vocabulary/ies in which the concept was found
    <StartPos>
    SU
    The 0-based character offset of the string, counting from the beginning of the input text
    <SyntaxType>
    SU
    The syntactic type of the syntax unit; the SyntaxTypes are adv, aux, compl, conj, det, head, mod, modal, pastpart, prep, pron, punc, shapes, and verb.
    <SyntaxUnits>
    <SyntaxUnit>
    CR
    The syntactic subcomponent of the phrase, consisting of
  • <SyntaxType>: the syntactic type of the syntax unit (e.g., head, mod, verb, etc.,
  • <LexMatch>: the lexical item(s),
  • <InputMatch>: the input word(s),
  • <LexCat>: the lexical category, and
  • <Token>+: the token(s) making up the lexical items
  • <TextMatchEnd>
    SU
    The position within the phrase words of the last matching word
    <TextMatchStart>
    SU
    The position within the phrase words of the first matching word
    <Tokens>
    <Token>
    SR
    The tokens making up the lexical items
    <Utterances>
    <Utterance>
    CR
    All the data generated for an utterance, including
  • <PMID>: the utterance's PubMed ID,
  • <UttSection>: the section type (e.g., title or abstract),
  • <UttNum>: the 1-based utterance number within the section,
  • <UttText>: the text of the utterance,
  • <UttStartPos>: the 0-based character offset of the utterance, counting from the beginning of the input text
  • <UttLength>: the length, and
  • <Phrase>+: the phrase(s) making up the utterance
  • <UttLength>
    SU
    The character length of the utterance
    <UttNum>
    SU
    The 1-based numerical position of the utterance within the section
    <UttSection>
    SU
    The section type (e.g., title or abstract) of the utterance
    <UttStartPos>
    SU
    The 0-based character offset of the utterance, counting from the beginning of the input text
    <UttText>
    SU
    The text of the utterance



    Hierarchical listing of current XML tags

    TagTypeDescription
    <MMOs>
    <MMO>
    CR
    All the XML output generated for an entire input record or citation, consisting of
  • <CmdLine>: the command used to start MetaMap,
  • <AA>*: any acronyms/abbreviation(s) found in the text,
  • <Negation>*: any negation(s) found in the text, and
  • <Utterance>+: the utterance(s) found in the text
  • <CmdLine>
    CU
    All the data about the command used to start MetaMap, consisting of
  • <Command>: the actual operating-system call used to start MetaMap, and
  • <Option>*: any options passed to MetaMap
  • <Command>
    SU
    The actual operating-system call used to start MetaMap
    <Options>
    <Option>
    CR
    The option(s) passed to MetaMap, consisting of
  • <OptName>: the option's name, and
  • <OptValue>: the option's value.
  • <OptName>
    SU
    The name of the command-line option
    <OptValue>
    SU
    The value of the command-line option (can be null)
    <AAs>
    <AA>
    CR
    All the data generated for an Acronym/Abbreviation (AA), consisting of
  • <AAText>: the text of the AA,
  • <AAExp>: its expansion,
  • <AATokenNum>: the number of tokens in the AA
  • <AALen>: the character length of the AA
  • <AAExpTokenNum>: the number of tokens in expansion
  • <AAExpLen>: the character length of its expansion, and
  • <AACUI>*: any CUIs associated with the expansion of the AA
  • The following AA examples will use the text
    polymerase chain reaction (PCR).
    <AAText>
    SU
    The AA itself (PCR)
    <AAExp>
    SU
    The expansion of the AA (polymerase chain reaction)
    <AATokenNum>
    SU
    The number of tokens in the AA (1, because PCR contains 1 token)
    <AALen>
    SU
    The character length of the AA (3, because PCR contains 3 characters)
    <AAExpTokenNum>
    SU
    The number of tokens in the AA expansion (5, because polymerase chain reaction contains 5 tokens, including two blank tokens)
    <AAExpLen>
    SU
    The character length of the expansion of the AA (25, because polymerase chain reaction contains 25 characters)
    <AACUIs>
    <AACUI>
    SR
    Any CUIs associated with the expansion of the AA.
    <Negations>
    <Negation>
    CR
    All the data generated for a negation, including
  • <NegType>: the negation type,
  • <NegTrigger>: the negation trigger,
  • <NegTriggerPI>+: the negation trigger's positional information,
  • <NegConcept>+: the negated concept(s), and
  • <NegConcPI>+: the negated concept's StartPos/Length positional information
  • For more information about MetaMap's implementation of NegEx, see the MetaMap09 Release Notes.
    <NegType>
    SU
    The negation type
    <NegTrigger>
    SU
    The negation trigger
    <NegTriggerPIs>
    <NegTriggerPI>
    CR
    The StartPos/Length positional information of the negation trigger
    <NegConcepts>
    <NegConcept>
    CR
    The negated concept, consisting of
  • <NegConcCUI>: the negated concept's CUI, and
  • <NegConcMatched>: the negated concept's name
  • <NegConcCUI>
    SU
    The CUI associated with the negated concept
    <NegConcMatched>
    SU
    The name of the negated concept
    <NegConcPIs>
    <NegConcPI>
    CR
    The StartPos/Length positional information of the negated concept
    <Utterances>
    <Utterance>
    CR
    All the data generated for an utterance, including
  • <PMID>: the utterance's PubMed ID,
  • <UttSection>: the section type (e.g., title or abstract),
  • <UttNum>: the 1-based utterance number within the section,
  • <UttText>: the text of the utterance,
  • <UttStartPos>: the 0-based character offset of the utterance, counting from the beginning of the input text
  • <UttLength>: the length, and
  • <Phrase>+: the phrase(s) making up the utterance
  • <PMID>
    SU
    The PubMed ID of the citation containing the utterance
    <UttSection>
    SU
    The section type (e.g., title or abstract) of the utterance
    <UttNum>
    SU
    The 1-based numerical position of the utterance within the section
    <UttText>
    SU
    The text of the utterance
    <UttStartPos>
    SU
    The 0-based character offset of the utterance, counting from the beginning of the input text
    <UttLength>
    SU
    The character length of the utterance
    <Phrases>
    <Phrase>
    CR
    The syntactic subcomponent of the utterance, consisting of
  • <PhraseText>: the text of the phrase,
  • <SyntaxUnit>+: the syntax unit(s),
  • <PhraseStartPos>: the 0-based character offset of the phrase, counting from the beginning of the input text
  • <PhraseLength>: the character length of the phrase,
  • <Candidate>*: any candidate concepts identified in the phrase, and
  • <Mapping>*: any mappings created
  • <PhraseText>
    SU
    The text of the phrase
    <SyntaxUnits>
    <SyntaxUnit>
    CR
    The syntactic subcomponent of the phrase, consisting of
  • <SyntaxType>: the syntactic type of the syntax unit (e.g., head, mod, verb, etc.,
  • <LexMatch>: the lexical item(s),
  • <InputMatch>: the input word(s),
  • <LexCat>: the lexical category, and
  • <Token>+: the token(s) making up the lexical items
  • <SyntaxType>
    SU
    The syntactic type of the syntax unit; the SyntaxTypes are adv, aux, compl, conj, det, head, mod, modal, pastpart, prep, pron, punc, shapes, and verb.
    <LexMatch>
    SU
    The lexical item(s) matched by the syntax unit
    <InputMatch>
    SU
    The input word(s) making up the syntax unit
    <LexCat>
    SU
    The lexical category of the syntax unit; the LexCats are adj, adv, aux, compl, conj, det, modal, noun, prep, pron, and verb.
    <Tokens>
    <Token>
    SR
    The tokens making up the lexical items
    <PhraseStartPos>
    SU
    The 0-based character offset of the phrase, counting from the beginning of the input text
    <PhraseLength>
    SU
    The character length of the phrase
    <Candidates>
    <Candidate>
    CR
    All the data generated for a candidate concept, including
  • <CandidateScore>: the candidate's negative score,
  • <CandidateCUI>: its CUI,
  • <CandidateMatched>: the candidate matched,
  • <CandidatePreferred>: its preferred name,
  • <MatchedWord>+: the word(s) in the Candidate matching the text,
  • <MatchMap>+: the matchmap(s),
  • <SemType>+: the semantic type(s),
  • <IsHead>: IsHead (yes/no),
  • <IsOverMatch>: IsOverMatch (yes/no),
  • <Source>+: the source(s), and
  • <ConceptPI>+: the positional information
  • <CandidateScore>
    SU
    The negative score of the candidate concept; the computation of this value is explained on pp. 5-9 of MetaMap Evaluation.
    <CandidateCUI>
    SU
    The CUI of the candidate concept
    <CandidateMatched>
    SU
    The candidate concept matched
    <CandidatePreferred>
    SU
    The preferred name of the candidate concept
    <MatchedWords>
    <MatchedWord>
    SR
    The word(s) in the Candidate matching the text
    <SemTypes>
    <SemType>
    SR
    The semantic type(s) of the candidate
    <MatchMaps>
    <MatchMap>
    CR
    A data structure representing
  • the correspondence of words in the candidate concept (<TextMatchStart> and <TextMatchEnd>) and words in the phrase (<ConcMatchStart> and <ConcMatchEnd>), and
  • the lexical variation (<LexVariation>) between the words in the candidate concept and the words in the phrase.
  • For example, given the input text obstructive sleep apnea and the candidate concept sleep apnea, the matching words sleep and apnea are
  • the 2nd and 3rd words of the text, and
  • the 1st and 2nd words of the concept.
  • There is no lexical variation, so the matchmap would therefore be [[[2,3],[1,2],0]]. For the candidate concept sleep apneas, the MatchMap would be the same, other than having lexical variation of 1 instead of 0.
    <TextMatchStart>
    SU
    The position within the phrase words of the first matching word
    <TextMatchEnd>
    SU
    The position within the phrase words of the last matching word
    <ConcMatchStart>
    SU
    The position within the concept words of the first matching word
    <ConcMatchEnd>
    SU
    The position within the concept words of the last matching word
    <LexVariation>
    SU
    The degree of lexical variation between the words in the candidate concept and the words in the phrase; the computation of this value is explained on pp. 2-3 of MetaMap Evaluation.
    <IsHead>
    SU
    Yes/no value denoting if the candidate concept includes the head of the phrase containing it
    <IsOverMatch>
    SU
    Yes/no value denoting if the candidate concept is an overmatch, i.e., if it contains words on one or both ends that do not match the input text.
    <Sources>
    <Source>
    SR
    The UMLS vocabulary/ies in which the concept was found
    <ConceptPIs>
    <ConceptPI>
    CR
    The positional information of the concept, consisting of
  • <StartPos>: the 0-based character offset of the concept, counting from the beginning of the input text, and
  • <Length>: the character length of the string
  • <StartPos>
    SU
    The 0-based character offset of the string, counting from the beginning of the input text
    <Length>
    SU
    The character length of the string
    <Mappings>
    <Mapping>
    CR
    A set of candidate concepts making up the mapping for the phrase, consisting of
  • <MappingScore>: the negative score of the mapping, and
  • <Candidate>+: the candidate concept(s)
  • <MappingScore>
    SU
    The negative score of the mapping; the computation of this value is explained on pp. 9-10 of MetaMap Evaluation.