Cross Language Information Retrieval and Automatic Indexing of MEDLINE Documents

Date: November 21, 2003 Time: (All day)
Event Type: Lecture

Part 1: cross language information retrieval -In this presentation, we show how cross language information retrieval (CLIR) can be performed in the medical domain by taking advantage of existing multilingual. We compare two types of CLIR strategies: the first approach applies a general purpose automatic keyword assignement system that associates a ranked list of MeSH terms to user queries, using the UMLS terminological resources as interlingua; the second approach uses a commercial translation system to provide a grammatical translation of the original query. Results show the effectiveness of a terminology driven approach for CLIR in the medical domain, as compared to other CLIR approaches: CLIR ratio ranging from 80% vs. 60%. Finally, combining both strategies results in a CLIR engine with CLIR ratio above 90%, and is competitive with best performing approaches based on large bilingual corpora.Part 2: Automatic indexing of MEDLINE documents -PROBLEM: Automatic key word assignment has been largely studied in medical informatics in the context of the MedLine library, both for querying MedLine and in order to provide an indicative gist of the content of articles via MeSH terms. Abstracts are also used for this purpose. However with often more than 300 words, MedLine abstracts can still be regarded as long documents; therefore we explore the ability to automatically select a unique key sentence to be displayed. This key sentence must be indicative of the article's content: we assume that conclusions are good candidates. PURPOSE: Our purpose is to design and assess the performance of an automatic key sentence selector. Sentences are ranked and classified into 4 argumentative moves: PURPOSE, METHODS, METHODS and CONCLUSION. METHODS: We fully rely on a well-known data-driven approach using Bayesian classifiers trained on automatically aquired training data. Features representation, selection and weighting are reported and classification effectiveness is evaluated on the four classes using confusion matrices. We also explore the use of simple heuristics to take the position of sentences into account. Recall, precision and F-scores are computed for the CONCLUSION class. Two benchmarks are used for the evaluation: 1) a structured one (set B), where explicit markers have been removed, 2) a manually annotated one (set C). RESULTS and CONCLUSION: The use of the position is mostly ineffective for set B, while it is of interest for set C, especially for lexically confusing classes such as PURPOSE and CONCLUSION. For the latter class, recall is close to 100%, while precision is above 70%; what results in an F-score close to 84%. Automatic argumentative classification is feasible on MedLine abstract and should help user navigation and retrieval in such repositories.