Skip to content

nert-nlp/cgel

Repository files navigation

cgel

This repo contains CGELBank, a human-annotated treebank of English using the syntactic formalism of the Cambridge Grammar of the English Language (CGEL). The treebank is described in Reynolds et al. (2023), published at the Linguistic Annotation Workshop (LAW).

Status CC BY 4.0

This work is licensed under a Creative Commons Attribution 4.0 International License.

CC BY 4.0

Datasets

We annotated data from Twitter and the English Web Treebank (EWT).

To load the CGEL trees for scripting, use the cgel.py library.

Summary information is available in:

  • STATS.md (statistics extracted from the trees)
  • INDEX.md (list of sentences and notable properties)

Gold Data

  • datasets/twitter.cgel: CGEL gold trees from Twitter
  • datasets/ewt.cgel: CGEL trees from a sample of EWT train sentences (manually annotated by Brett Reynolds)
  • datasets/{ewt-test_iaa50.cgel, ewt-test_pilot5.cgel}: Adjudicated and up-to-date trees from the IAA experiment; sentences drawn from the EWT test split
  • datasets/trial/{ewt-trial.cgel, twitter-etc-trial.cgel}: Miscellaneous trees annotated but not adjudicated by both annotators
  • datasets/oneoff/*.cgel: Various CGEL trees for ad hoc sentences

Corresponding .conllu files are also available alongside the datasets/*.cgel and datasets/trial/*.cgel files. EWT .conllu files are gold trees; other .conllu files are manual corrections of Stanza output.

All data was revised with the aid of consistency-checking scripts.

Other subdirectories contain older/silver versions of the trees.

Interannotator Data

Under datasets/iaa/:

  • ewt-test_pilot5.{nschneid, brettrey, adjudicated}.cgel: Pilot interannotator study (5 sentences from EWT).
  • ewt-test_iaa50.{...}.cgel: Main interannotator study (50 sentences from EWT).
    • {nschneid, brettrey}.novalidator: Initial annotation.
    • {nschneid, brettrey}.validator: Corrected individual annotation after running automatic validation script to catch common errors.
    • adjudicated: Final adjudicated version combining both annotations.

Structure

  • cgel.py: library that implements classes for CGEL trees and the nodes within them, incl. helpful functions for printing and processing trees in PENMAN notation
  • cgel2ptb.py: prints CGEL trees in PTB bracketed style
  • constituent.py: information about how constituents join in a tree, for use by other scripts
  • eval.py: script for comparing two sets of CGEL annotations with tree edit distance (and derived metrics)
  • iaa.sh: script that runs eval.py on all files involved in our interannotator study (comparing pre- and post-validation trees as well as final adjudicated version)
  • tree2tex.py: print CGEL trees in pretty LaTeX
  • ud2cgel.py: converts UD trees (from English EWT treebank) to CGEL format using rule-based system
  • validate_trees.py: script to check the well-formedness of trees

Folders

  • analysis/: scripts for analysing the datasets, incl. edit distance
  • convertor/: includes conversion rules in DepEdit script + outputs from conversion, with a simple Flask web interface for local testing in the browser (English text > automatic UD w/ Stanza > CGEL)
  • datasets/: all the final datasets
  • figures/: figures for papers/posters and code for generating them
  • scripts/: one-off scripts that were used to clean/restructure data
  • test/: validation tests

Tests

To run tests locally:

$ python -m pytest

This will validate the trees and test distance metrics (Levenshtein and TED).

History

  • CGELBank 1.0: 2023-07-04.
    • Initial release of 257 trees.

Resources

Overview of the project:

Brett Reynolds, Aryaman Arora, and Nathan Schneider (2023). Unified Syntactic Annotation of English in the CGEL Framework. Proc. of the 17th Linguistic Annotation Workshop (LAW-XVII), Toronto, Canada.

@inproceedings{cgelbank-law,
    address = {Toronto, Canada},
    title = {Unified Syntactic Annotation of {E}nglish in the {CGEL} Framework},
    author = {Reynolds, Brett and Arora, Aryaman and Schneider, Nathan},
    year = {2023},
    month = jul,
    url = {https://people.cs.georgetown.edu/nschneid/p/cgeltrees.pdf},
    booktitle = {Proc. of the 17th Linguistic Annotation Workshop (LAW-XVII)}
}

Annotation manual:

Brett Reynolds, Nathan Schneider, and Aryaman Arora (2023). CGELBank Annotation Manual v1.0. arXiv.

Further analysis:

Brett Reynolds, Aryaman Arora, and Nathan Schneider (2022). CGELBank: CGEL as a Framework for English Syntax Annotation. arXiv.

Aryaman Arora, Nathan Schneider, and Brett Reynolds (2022). A CGEL-formalism English treebank. MASC-SLL (poster), Philadelphia, USA.

Source data:

Natalia Silveira, Timothy Dozat, Marie-Catherine de Marneffe, Samuel Bowman, Miriam Connor, John Bauer, Chris Manning (2014). A Gold Standard Dependency Corpus for English. Proc. of the Ninth International Conference on Language Resources and Evaluation (LREC '14).

Ann Bies, Justin Mott, Colin Warner, Seth Kulick (2012). English Web Treebank. LDC.