1 Introduction
2 Fundamentals of corpus annotation and event extraction
Figure 1. Example of a closed-domain EE using a predefined event schema. |
2.1 Existing corpus for event extraction tasks
Table 1. Annotated corpus for the event extraction task. |
ID | Corpus Short Name | Corpus Full Name | Domain Area | Language | Corpus Size (# docs) | Annotation Method | Public Access | Charges | Format | Benchmark Corpus |
---|---|---|---|---|---|---|---|---|---|---|
C01 | MUSIED | Multi-Source Informal Event Detection | General | Chinese | 11,381 | Manual | √ | Free of charge | JSON | × |
C02 | MAVEN | MAssive eVENt detection dataset | General | English | 4,480 | Manual | √ | Free of charge | JSON | √ |
C03 | ACE 2005 | ACE 2005 Multilingual Training Corpus1 | General | English, Chinese | 599 (En), 633 (Ch) | Manual | × | Licensed (Paid) | XML | √ |
C04 | CFEE | Chinese Financial Event Extraction | Finance | Chinese | 2,976 | Automatic | √ | Free of charge | JSON | √ |
C05 | ChFinAnn | ChFinAnn | Finance | Chinese | 32,040 | Manual | √ | Free of charge | JSON | √ |
C06 | FEED | Chinese Financial Event Extraction Dataset | Finance | Chinese | 31,748 | Automatic & manual | √ | Free of charge | JSON | × |
C07 | EPI | Epigenetics and Post-Translational Modifications 2011 | Biomedical | English | 1,200 | Manual | × | Free of charge | BioNLP Standoff | √ |
C08 | ID | Infectious Diseases 20112 | Biomedical | English | 30 | Manual | √ | Free of charge | BioNLP Standoff | √ |
C09 | GE 11 | Genia Event Extraction 2011 | Biomedical | English | 1,210 | Manual | √ | Free of charge | BioNLP Standoff | √ |
C10 | PC | Pathway Curation 2013 | Biomedical | English | 525 | Manual | √ | Free of charge | BioNLP Standoff | √ |
C11 | CG | Cancer Genetics 2013 (CG) | Biomedical | English | 600 | Manual | √ | Free of charge | BioNLP Standoff | √ |
C12 | BB3 | Bacteria Biotope 2016 | Biomedical | English | 215 | Manual | × | Free of charge | BioNLP Standoff | √ |
C13 | MLEE | Multi-Level Event Extraction | Biomedical | English | 262 | Manual | √ | Free of charge | BRAT Standoff, CoNLL-U | √ |
C14 | LEVEN | Large-Scale Chinese Legal Event Detection Dataset | Legal | Chinese | 8,116 | Automatic & manual | √ | Free of charge | JSON | √ |
1 The ACE 2005 (Arabic) corpus is excluded since it is not supported for the event extraction task. 2 The size of the ID corpus is relatively small since the number reflects count of full text documents, instead of only abstracts |
2.1.1 General domain corpora
2.1.2 Financial domain corpora
2.1.3 Biomedical domain corpora
2.1.4 Legal domain corpus
3 Developing a closed-domain annotated corpus for event extraction tasks
Figure 2. Flowchart of the manual corpus annotation procedure. |
3.1 Common text annotation formats
Figure 3. Structure of annotated corpus in the BioNLP Standoff format. |
Figure 4. Structure of annotated corpus in BRAT Standoff format. |
Figure 5. Structure of annotated corpus in the CoNLL-U format. |
Figure 6. Structure of annotated corpus in the OneIE’s JSON format. |
Table 2. Comparison summary of the common annotation formats. |
Annotation Format | Summary | Output Files | Implementation method | Annotation structure |
---|---|---|---|---|
BioNLP Standoff | The annotation format is widely used in BioNLP Shared Task and BioNLP Open Shared Task challenges. | .txt .a1 .a2 | Manual annotation using text corpus annotation tool | Tab-delimited data |
BRAT Standoff | The annotation format is almost identical to the BioNLP format, with the annotations combined into a single annotation file (.ann). | .txt .ann | Manual annotation using text corpus annotation tool | Tab-delimited data |
CoNLL-U | The sentence-level annotations are presented in three types of lines: comment, word, and blank lines. | .txt .conll | Python’s spacy_conll package | Tab-delimited data |
OneIE’s JSON format | Provides a comprehensive annotations storage for each sentence in a JSON objects structure. | .JSON | Use OneIE’s package preprocessing script1 or manual data transformation | JSON structure |
1https://github.com/dwadden/dygiepp/tree/master/scripts/data (retrieved: 1 August 2024) |
3.2 Tools for annotating a text corpus
Table 3. Corpus annotation tools. |
ID | Tool Name | Platform Compatibility | Output Format | Charges & License Information | Latest Stable Release1 |
---|---|---|---|---|---|
T01 | AlvisAE | Web-based (RESTful web app) | JSON | Free (Open Source) No license provided | 2016 |
T02 | BRAT Rapid Annotation Tool | Web-based (Python package) | BRAT Standoff | Free MIT License | vl.3 Crunchy Frog (Nov 8, 2012) |
T03 | TextAE | Online/Web-based (Python package) | JSON | Free (Open Source) MIT License | v4.5.4 (Mar 1, 2017) |
1Latest stable release checked on 1 August 2024 |
Figure 7. AlvisAE text annotation editor. |
Figure 8. BRAT annotation tool. |
Figure 9. TextAE online text annotation editor. |
3.3 Document selection
Figure 10. Steps for selecting documents to build an event extraction corpus. |
3.4 Event annotation process
3.4.1 General rules for annotating events
3.4.2 Elements of event annotation
Figure 11. Example of event annotation using the BRAT annotation tool. |
3.4.3 Information that should be annotated
3.4.4 Information that should not be annotated
4 Discussion
Figure 12. The distribution of corpus based on language and domain. |
Table 4. Corpus statistics. |
ID | Corpus Name | Data Sources | Tokens Count | Sentences Count | Event Mentions | Negative Events | Event Types |
---|---|---|---|---|---|---|---|
C01 | MUSIED | 11,381 docs | 7.105 M | 315,743 | 35,313 | N/A | 21 |
C02 | MAVEN | 4,480 docs | 1.276 M | 49,873 | 118,732 | 497,261 | 168 |
C03 | ACE 20051 | 599 docs (En), 633 docs (Ch) | 303k (En), 321k (Ch) | 15,789 (En), 7,269 (Ch) | 5,349 (En), 3,333 (Ch) | N/A | 5 |
C04 | CFEE | 2,976 docs | N/A | N/A | 3,044 | 32,936 | 4 |
C05 | ChFinAnn | 32,040 docs | 29,220,480† | 640,800† | > 48,000 | N/A | 5 |
C06 | FEED | 31,748 docs | 28,954,176† | 603,212† | 46,960 | N/A | 5 |
C07 | EPI | 1,200 abstracts | 253,628 | N/A | 3,714 | 369 | 8 |
C08 | ID | 30 full-text articles | 153,153 | 5,118 | 5,150 | 214 | 10 |
C09 | GE 11 | 1,210 abstracts | 267,229 | N/A | 13,603 | N/A | 9 |
C10 | PC | 525 docs | 108,356 | N/A | 12,125 | 571 | 21 |
C11 | CG | 600 abstracts | 129,878 | N/A | 17,248 | 1,326 | 40 |
C12 | BB3 | 146 abstracts (ee), 161 abstracts (ee+ner) | 35,380 (ee), 39,118 (ee+ner) | N/A | 890 (ee), 864 (ee+ner) | N/A | 2 |
C13 | MLEE | 262 docs | 56,588 | 2,608 | 6,677 | N/A | 29 |
C14 | LEVEN | 8,116 docs | 2.241 M | 63,616 | 150,977 | N/A | 108 |
*docs = documents, M = millions, k = thousands, En = English, Cn = Chinese, N/A = not available, †estimated value from calculations, > exceeding 1 The ACE 2005 (Arabic) corpus is excluded since it does not support the event extraction task. |
Figure 13. Top five largest annotated corpora for event extraction tasks. |
Figure 14. Comparison of tokens, sentences, and event mentions in existing annotated corpora. |
Figure 15. The count of event mentions in each corpus. |
4.1 Limitations of the existing annotated corpora
4.2 Annotation challenges and potential solutions
Figure 16. Conceptual representation of the universal text annotation converter. |
Table 5. Summary of challenges and recommendations. |
Challenges | Recommendations |
---|---|
Lack of high-quality annotated data | ·To facilitate the rapid development of an annotated corpus, it is suggested to employ a hybrid approach as demonstrated by Li et al. (2022). ·This involves partially annotating the texts manually, and then training ML algorithms to annotate the remaining data based on the trained model. ·This strategy is faster than manually annotating all data. However, it is critical to measure the accuracy of the automatic annotations. |
Incompatibility of annotated corpus formats | ·Develop a standardized annotation format that is universally accepted for annotating text corpus. ·These formats should store all information required for common EE tasks. ·Develop a universal text annotation converter for converting annotations between different formats ( |
Subjectivity and text ambiguity | ·Develop a complete annotation guideline and strictly adhere to it throughout the annotation process. ·Utilize tools like Git version control to manage the version of annotation files. |
4.3 Recent advancement of LLMs for corpus annotation
Table 6. Summary of recent studies on LLMs for corpus annotations. |
Study | Results | Advantages | Limitations |
---|---|---|---|
Csanády et al. (2024) | BERT shows 91.2% to 96.5% test accuracies on the IMDb datasets using the model on random baselines. | ·The proposed method can handle large-scale text annotation tasks. ·Provides a cost-effective alternative to annotate large amounts of text. | ·Annotation using LLMs slightly compromises the annotation accuracy. ·LLMs alone cannot provide high-quality corpus annotations. ·The annotated corpus is not suitable for EE task. |
Akkurt et al. (2024) | ·The proposed approach improved result by 2%. ·All models show improved performance with the GPT-4 + UD Turkish BOUN v2.11: 76.9% (best performance). | ·The model has been tested with data from UD English and Turkish Treebanks. ·The authors use public data and verify the methodology complies with ethical standards. | ·The annotation outcome varies (inconsistent) depending on the user’s prompt. ·The method is for entity annotation; thus output is not suitable for EE tasks. |
Frei and Kramer (2023) | Result on various baseline models: ·gbert-large (P: 70.7%, R:97.9%, F1: 82.1%) ·GottBERT-base (P: 80.0%, R: 89.9%, F1: 84.7%) ·German-MedBERT (P: 72.7%, R: 81.8%, F1: 77.0%) | ·Solves limited corpus availability for non- English medical texts. ·The proposed method shows a reliable performance. | ·The proposed method is computationally expensive. ·The annotated corpus cannot be considered a gold-standard and requires more validation. ·The method is for entity annotation, thus output is not suitable for EE tasks. |
Li et al. (2023) | The result shows up to 21% performance improvement over random baselines. | ·The annotation process is done together by humans and LLMs. ·Provides a cost-effective alternative to annotate large amounts of text. | ·The study does not assess if LLM-generated annotations outperform human-annotated corpus. ·The method is for entity annotation, thus output is not suitable for EE tasks. |