Query expansion for UMLS Metathesaurus disambiguation based on automatic corpus extraction.

Jimeno-Yepes A, Aronson AR

ICMLA 2010, Dec 2010.


Word sense disambiguation (WSD) is an intermediate task within information retrieval and information extraction, which attempts selecting the proper sense of ambiguous terms. In the biomedical domain, general WSD has not received much attention compared to the disambiguation of specific categories of entities like proteins and genes or diseases. Statistical learning approaches have achieved better performance compared to other methods. On the other hand, manually annotated data is limited, and covering all the ambiguous cases of a large resource like the UMLS is infeasible. Knowledge-based approaches using the UMLS and MEDLINE citations have achieved good performance but below that of statistical learning approaches. Our best knowledge-based result has been obtained by training a Naïve Bayes algorithm on an automatically extracted MEDLINE corpus. In this work, we extend on previous methods to enhance the quality of an automatically extracted corpus using related terms obtained from MEDLINE without manually annotated training data. We have focused on the extraction of collocations which might be used in combination with one of the senses of the ambiguous terms. We find that left side collocations have the largest improvement in accuracy with an improvement of 4%. In addition, the combination of different types of collocations and post-filtering of retrieved citations achieves an improvement of almost 9% in accuracy.

Jimeno-Yepes A, Aronson AR Query expansion for UMLS Metathesaurus disambiguation based on automatic corpus extraction. 
ICMLA 2010, Dec 2010.