bh home links home
Benjamin Hachey
IE Relation Data
Benjamin Hachey

IE Relation Extraction Data Sets

The following table contains a list of information extraction (IE) corpora with information about what IE subtasks are addressed. Relation annotation is highlighted with a blue background. The focus of this list is data sets that contain relation information. The key for the column labels follows the table.

Name Domain Language NER CRF RLD RLC TME EVT
ACE news english Y Y Y Y Y Y
BioCreative I (JHakenberg Version w/ PPI) biomed english Y Y Y N N N
BioCreative II biomed english Y Y Y N N N
UTurku BioInfer Corpus biomed english Y N Y Y N N
EDGAR biomed english Y N Y N N N
FetchProt Corpus biomed english ? N Y N N N
IowaState IEPA Corpus biomed english ? ? Y N N N
LLL05 Data biomed english ? N Y N N N
MUC/HUB (MUC-6, MUC-7) news english Y Y Y N Y Y
PICorpus (Orig PDG Version) biomed english Y ? Y ? N N
RCAHMS IE Data (Forthcoming) cultural english Y ? Y ? N N
UCBerkeley BioText Data biomed english Y ? Y Y N N
UCLA DIP Data (JHakenberg Version) biomed english ? ? Y ? N N
UPenn BioIE biomed english Y Y ? ? N N
UTexas BioInf ID-SERVE biomed english Y ? Y ? N N
UTexas ML Res Group Data biomed english Y ? Y ? N N
UWisconsin Mark Craven Data biomed english ? ? Y N N N

Table Key

The following abbreviations are used for attributes describing IE subtasks:

NER
Data contains named entity annotation.
CRF
Data contains abbreviation/alias/coreference/normalisation annotation.
RLD
Data contains relation detection annotation.
RLC
Data contains relation characterisation annotation.
TME
Data contains date/time annotation.
EVT
Data contains event annotation.

Contribute To This List!

Please contact me if you you have something to add. I am especially interested in other domains and other languages.

Other IE Data

Some relevant IE data that I could not find relation annotation for:

Other Lists of IE Data

Kevin Cohen's group at the Center for Computational Pharmacology at UColorado maintain a list of corpora for biomedical NLP. Also check out their survey of corpus design and usage.

Other lists of corpora for IE/BioNLP can be found at:

Benjamin Hachey
Benjamin Hachey
Last modified: Thu Apr 5 09:58:04 BST 2007