Home | Browse | Search | Credits | About
Register | User Area | DL-Harvest | Help
DLIST

A shallow parser based on closed-class words to capture relations in biomedical text

Leroy, Gondy and Chen, Hsinchun and Martinez, Jesse D. (2003) A shallow parser based on closed-class words to capture relations in biomedical text. Journal of Biomedical Informatics 36:pp. 145-158.

Full text available as:
PDF - Requires Adobe Acrobat Reader or other PDF viewer.

Abstract

Natural language processing for biomedical text currently focuses mostly on entity and relation extraction. These entities and relations are usually pre-specified entities, e.g., proteins, and pre-specified relations, e.g., inhibit relations. A shallow parser that captures the relations between noun phrases automatically from free text has been developed and evaluated. It uses heuristics and a noun phraser to capture entities of interest in the text. Cascaded finite state automata structure the relations between individual entities. The automata are based on closed-class English words and model generic relations not limited to specific words. The parser also recognizes coordinating conjunctions and captures negation in text, a feature usually ignored by others. Three cancer researchers evaluated 330 relations extracted from 26 abstracts of interest to them. There were 296 relations correctly extracted from the abstracts resulting in 90% precision of the relations and an average of 11 correct relations per abstract.

EPrint Type:Journal Article (Paginated)
Keywords:National Science Digital Library, NSDL, Artificial Intelligence Lab, AI Lab, Natural language processing; Shallow parsing; Finite state automata; Biomedicine; Free text; Bottom-up parser; NLP
Subjects:Artificial Intelligence
Natural Language Processing
ID Code:430
Deposited On:20 August 2004
Alternative Locations:http://ai.bpa.arizona.edu/go/papers.html
Eprint Statistics:View statistics for this eprint
Tell A Colleague:Tell a colleague about it.

[1] Maojo V, Iakovidis I, Martin-Sanchez F, Crespo J, Kulikowski C. Medical information and bioinformatics: european efforts to facilitate synergy. J Biomed Inform 2001;34:423–7.

[2] McCray AT, Aronson AR, Browne AC, Rindflesch TC, Razi A, Srinivasan S. UMLS knowledge for biomedical language processing. Bull Med Libr Assoc 1993;81(2):184–94.

[3] Hersh WR, Campbell EM, Malveau SE. Assessing the feasibility of large-scale natural language processing in a corpus of ordinary medical records: a lexical analysis. In: Proceedings of the 1997 AMIA Annual Symposium; 1997. p. 580–84.

[4] Aronson AR. effective mapping of biomedical text to the UMLS metathesaurus: the MetaMap program. In: AMIA Symposium; 2001. p. 17–21.

[5] Humphreys BL, McCray AT, Cheh ML. Evaluating the coverage of controlled health data terminologies: report on the results of the NLM/AHCPR large scale vocabulary test. J Am Med Inform Assoc 1997;4(6):484–500.

[6] Wain HM, Lush M, Ducluzeau F, Povey S. Genew: the human gene nomenclature database. Nucleic Acids Res 2002;3 (1):169–71.

[7] Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene ontology: tool for the unification of biology. Nat Genet 2000;25:25–9.

[8] Gene Ontology Consortium. Creating the gene ontology resource: design and implementation. Genome Research 2001;11(8):1425–33.

[9] Ohta T, Tateisi Y, Kim J-D. The GENIA corpus: an annotated research abstract corpus in molecular biology domain. In: Human Language Technology Conference, San Diego, California, USA; 2002.

[10] Weischedel R, Meteer M, Schwartz R, Ramshaw L, Palmucci J. Coping with ambiguity and unknown words through probabilistic models. Comput Linguist 1993;19(2):359–82.

[11] Hindle D. Deterministic parsing of syntactic non fluencies. In: 21st Annual Meeting of the Association for Computational Linguistics; 1983. p. 123–28.

[12] Church KW. A stochastic parts program and noun phrase parser for unrestricted text. In: Proceedings of the Second Conference on Applied Natural Language Processing; 1988. p. 136–43.

[13] Vourtilainen A, Padro L. Developing a hybrid NP parser. In: Fifth Conference on Applied Natural Language Processing; 1997. p. 80–87.

[14] McDonald DD. Robust partial parsing through incremental, multi-algorithm processing. In: Jacobs PS, editor. Text-based intelligent systems. 1992. p. 83–99.

[15] Tolle KM, Chen H. Comparing noun phrasing techniques for use with medical digital library tools. J Am Soc Inform Syst 2000;51(4):352–70.

[16] Hersh W, Mailhot M, Arnott-Smith C, Lowe H. Selective automated indexing of findings and diagnoses in radiology reports. J Biomed Inform 2001;34:262–73.

[17] Rindflesch TC, Hunter L, Aronson AR. Mining molecular binding terminology from biomedical text, In: Amia Fall Symposium; 1999, p. 127–31.

[18] Raychaudhuri S, Chang JT, Sutphin PD, Altman RB. Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature. Genome Res 2002;12:203–14.

[19] Kazama Ji, Maino T, Ohta Y, Tsujii Ji. Tuning support vector machines for biomedical named entity recognition. In: Association for Computation Linguistics Workshop on Natural Language Processing in the Biomedical Domain. Philadelphia: ACL; 2002.

[20] Fukuda K, Tsunoda T, Tamura A, Takagi T. Toward information extraction: identifying protein names from biological papers. In: Pacific Symposium on Biocomputing; 1998. p. 705–16.

[21] Cohen KB, Dolbey AE, Acquaash-Mensah GK, Hunter L. Contrast and variability in gene names. In: Workshop on Natural Language Processing in the Biomedical Domain: Association for Computational Linguistics; 2002. p. 14–20.

[22] Hatzivassiloglou V, Duboue PA. Disambiguating proteins, genes, and RNA in text: a machine learning approach. Bioinformatics 2001;1(1):1–10.

[23] Krauthammer M, Rzhetsky A, Morozov P, Friedman C. Using BLAST for identifying gene and protein names in journal articles. Gene 2000;259:245–52.

[24] Proux D, Rechenmann F, Julliard L, Pillet V, Jacq B. Detecting gene symbols and names in biological texts: a first step toward pertinent information extraction. In: Ninth Workshop on Genome Informatics; 1998. p. 72–80.

[25] Liu H, Lussier YA, Friedman C. Disambiguating ambiguous biomedical terms in biomedical narrative text: an unsupervised method. J Biomed Inform 2001;34:249–61.

[26] Jenssen T-K, Laegreid A, Komorowski J, Hovig E. A literature network of human genes for high-throughput analysis of gene expression. Nat Genet 2001;28:21–8.

[27] Blaschke C, Valencia A. Can bibliographic pointers for known biological data be found automatically? Protein interactions as a case study. Comp Funct Genom 2001;2:196 206.

[28] Sekimisu T, Park HS, Tsujii Ji. Identifying the interaction between genes and gene products based on frequently seen verbs in Medline abstracts. Genome Inform 1998:62–71.

[29] Thomas J, Milward D, Ouzounis C, Pulman S, Carroll M. Automatic extraction of protein interactions from scientific abstracts. In: Pacific Symposium on Biocomputing; 2000. p. 538–49.

[30] Pustejovsky J, Casta~no J, Zhang J, Kotecki M, Cochran B. Robust relational parsing over biomedical literature: extracting inhibit relations. In: Pacific Symposium on Biocomputing; 2002. p. 362–73.

[31] Gaizauskas R, Demetriou G, Artymiuk PJ, Willett P. Protein structures and information extraction from biological texts: the PASTA system. Bioinformatics 2003;19(1):135–43.

[32] Friedman C, Kra P, Yu H, Krauthammer M, Rzhetsky A. GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics 2001;17(Suppl 1):S74–82.

[33] Friedman C, Hripcsak G. Evaluating natural language processors in the clinical domain. Methods of Information in Medicine 1998;37:334–44.

[34] Barrows RC, Busuioc M, Friedman C. Limited parsing of notational text visit notes: ad-hoc vs. NLP approaches. In: AMIA 2000 Symposium; 2000. p. 51–5.

[35] Friedman C. A broad-coverage natural language processing system. In: AMIA 2000 Annual Symposium; 2000.

[36] Pullum GK, Huddleston R. Prepositions and preposition phrases. In: Pullum GK, editor. The Cambridge grammar of the English language. Cambridge, UK: Cambridge University Press; 2002.

[37] Manning CD, Sch€utze H. Foundations of statistical natural language processing. Cambridge, MA: MIT Press; 2001.

[38] Jolly J. Prepositional analysis within the framework of role and reference grammar. New York: Peter Lang Publishing; 1991.

[39] Ratnaparkhi A, Reynar J, Roukos S. A maximum entropy model for prepositional phrase attachment. In: ARPA Human Language Technology Workshop; 1994. p. 250–55.

[40] Brill E, Resnik P. A rule-based approach to prepositional phrase attachment disambiguation. In: COLING; 1994.

[41] Ratnaparkhi A. Statistical models for unsupervised prepositional phrase attachment. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics; 1998.

[42] Abney S, Schapire RE, Singer Y. Boosting applied to tagging and PP attachment. In: Empirical Methods in Natural Language Processing and Very Large Corpora; 1999.

[43] Leroy G, Chen H. Filling preposition-based templates to capture information from medical abstracts. In: Pacific Symposium on Biocomputing, January, Kauai; 2002. p. 350–61.

[44] Tottie G. Negation in English speech and writing: a study in variation. New York: Academic Press; 1991.

[45] Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG. A simple algorithm for identifying negated findings and diseases in discharge summaries. J Biomed Inform 2001;34:301–10.

[46] Friedman C, Alderson PO, Austin JM, Cimino JJ, Johnson SB. A general natural-language text processor for clinical radiology. J Am Med Inform Assoc 1994;1(2):161–74.

[47] Mutalik PG, Deshpande A, Nadkarni PM. Use of generalpurpose negation detection to augment concept indexing of medical documents. J Am Med Inform Assoc 2001;8:598–609.

[48] Jurafsky D, Martin JH. Regular expressions and automata. In: Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition. Upper Saddle River, NJ: Prentice-Hall; 2000. p. 21–56.

[49] Roche E, Schabes Y. Deterministic part-of-speech tagging with finite state transducers. In: Schabes Y, editor. Finite-state language processing. Cambridge, MA: The MIT Press; 1997.

[50] Jurafsky D, Martin JH. Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition. Upper Saddle River, NJ: Prentice-Hall; 2000.

[51] Kokkinakis D, Johansson-Kokkinakis S. A cascaded finite-state parser for syntactic analysis of swedish. In: The 9th EACL, Bergen, Norway; 1999.

[52] Grefenstette G. Light parsing as finite-state filtering. In: Workshop on Extended Finite State Models of Language (ECAI96), Budapest, Hungary; 1996.

[53] Abney S. Partial parsing via finite-state cascades. In: 8th European Summer School in Logic, Language and Information— Workshop on Robust Parsing: Prague, Czech Republic; 1996. p. 8–15.

[54] Van Delden S., Gomez F. Combining finite state automata and a greedy learning algorithm to determine the syntactic roles of commas. In: 14th IEEE International Conference on Tools with Artificial Intelligence (ICTAI02); 2002.

[55] Roche E. Parsing with finite state transducers. In: Schabes Y, editor. Finite-state language processing. Cambridge, MA: The MIT Press; 1997. p. 241–81.

[56] Ng S-K, Wong M. Toward routine automatic pathway discovery from on-line scientific text abstracts. Genome Inform 1999;10:104–12.

EPrints dLIST, an open access archive for the Information Sciences, is supported by the School of Information Resources and Library Science and Learning Technologies Center, University of Arizona. Established in 2002, dLIST has a global Advisory Board and is a part of the Information Technology & Society Research Lab. Open Archives
Contact: Admin | Donate