Cross-Domain Data Integration for Entity Disambiguation in Biomedical Text

In collaboration with Stanford University

AuthorsMaya Varma, Laurel Orr, Sen Wu, Megan Leszczynski, Xiao Ling and Christopher Ré

Named entity disambiguation (NED), which involves mapping textual mentions to structured entities, is particularly challenging in the medical domain due to the presence of rare entities. Existing approaches are limited by the presence of coarse-grained structural resources in biomedical knowledge bases as well as the use of training datasets that provide low coverage over uncommon resources. In this work, we address these issues by proposing a cross-domain data integration method that transfers structural knowledge from a general text knowledge base to the medical domain. We utilize our integration scheme to augment structural resources and generate a large biomedical NED dataset for pretraining. Our pretrained model with injected structural knowledge achieves state-of-the-art performance on two benchmark medical NED datasets: MedMentions and BC5CDR. Furthermore, we improve disambiguation of rare entities by up to 57 accuracy points.

Related readings and updates.

Entity Disambiguation via Fusion Entity Decoding

June 4, 2024research area Knowledge Bases and Search, research area Speech and Natural Language Processingconference NAACL

Entity disambiguation (ED), which links the mentions of ambiguous entities to their referent entities in a knowledge base, serves as a core component in entity linking (EL). Existing generative approaches demonstrate improved accuracy compared to classification approaches under the standardized ZELDA benchmark. Nevertheless, generative approaches suffer from the need for large-scale pre-training and inefficient generation. Most importantly,…

Bootleg: Self-Supervision for Named Entity Disambiguation

June 25, 2021research area Knowledge Bases and Search, research area Speech and Natural Language Processingconference CIDR

A challenge for named entity disambiguation (NED), the task of mapping textual mentions to entities in a knowledge base, is how to disambiguate entities that appear rarely in the training data, termed tail entities. Humans use subtle reasoning patterns based on knowledge of entity facts, relations, and types to disambiguate unfamiliar entities. Inspired by these patterns, we introduce Bootleg, a self-supervised NED system that is explicitly…

Cross-Domain Data Integration for Entity Disambiguation in Biomedical Text

Related readings and updates.

Entity Disambiguation via Fusion Entity Decoding

Bootleg: Self-Supervision for Named Entity Disambiguation

Discover opportunities in Machine Learning.