Overview
The LinGO Redwoods treebank is a collection of hand-annotated corpora analysed with the
LinGO ERG. For each utterance from a corpus, the treebank records (in principle) all analyses hypothesized by the grammar, together with an annotator decision as to which reading is preferred in context.
The key innovative aspect of the Redwoods approach to treebanking is the anchoring of all linguistic data captured in the treebank to the HPSG framework and a generally-available broad-coverage grammar of English, viz. the
LinGO English Resource Grammar. Unlike existing treebanks, there is no need to define a (new) form of grammatical representation specific to the treebank (and, consequently, less dissemination effort in establishing this representation). Instead, the treebank records complete syntacto-semantic analyses as defined by the LinGO ERG; tools are provided to extract many different types of linguistic information at varying granularity.
Other relevant aspects of the Redwoods treebank include the integration of alternate, though dis-preferred analyses for each utterance and the dynamic nature of the annotations: as the underlying grammar evolves and improves its analyses, there is a provision for a (nearly) fully automated update of the treebank against a version of the original corpus analysed with the revised grammar. As a methodological results, part of the Redwoods data are now regularly maintained as part of the grammar regression cycle with each new release of the ERG.
Current Development Status
As of January 2005, we have released the
Fifth Growth, a substantially enlarged new revision of the Redwoods treebank. Besides increased coverage on the existing VerbMobil data sets (now including analyses of non-sentential utterances), the Fifth Growth includes four new data sets drawn from a corpus of ecommerce customer email (donated to CSLI by a former industrial affiliate).
The following table summarizes Redwoods Fifth Growth in terms of the total number of utterances, average string length, and average ambiguity rates for three sub-divisions, viz. rejected items (t-active = 0), fully disambiguated items (t-active = 1), and items for which annotators considered more than one analysis active (t-active > 1). The latter class will typically correspond to multiple analyses that are either semantically equivalent (as in the two readings of, say, do you have time on monday that correspond to PP attachment to either the main verb or the auxiliary) or `pragmatically equivalent' (as in the third reading of the above example, attaching the PP within the NP complement of the stative predicate have), which is taken to indicate differences in logical form that are not expected to affect typical NLP applications. While in earlier Redwoods revisions the partly disambiguated class was typically considered a last resort measure for annotators, we expect to make more (and more systematic) use of it in future revisions. This move is expected to result in a reduction of ad-hoc disambiguation rules (e.g. a dis-preference for attachment to auxiliaries or a general low attachment preference), since error analysis of previous Redwoods revisions suggests that annotators find it difficult to apply such rules consistently.
tactive = 0
tactive = 1
tactive > 1
VM6
107
12.1
2448
3438
7.4
239
0
0.0
0
VM13
128
14.1
8562
2875
7.9
407
0
0.0
0
VM31
146
12.8
4465
3296
5.3
235
0
0.0
0
VM32
32
13.3
1637
894
7.0
313
0
0.0
0
ECPA
26
11.7
28
1130
8.0
102
275
8.8
147
ECOS
31
13.5
346
1088
8.0
18
24
11.6
37
ECPR
57
12.9
676
1444
8.2
128
0
0.0
0
ECOC
22
10.9
39
1150
7.5
61
0
0.0
0
TREC
22
10.9
39
1150
7.5
61
0
0.0
0
HIKE
1
22.0
888
317
12.9
216
0
0.0
0
Total
572
12.8
3697
16782
7.3
213
299
9.0
138
Earlier relevant Redwoods revisions include the
Second Growth and
Third Growth.
Data Format
Unlike in previous Redwoods revisions, the Fifth Growth is distributed in [incr tsdb()] profile form exclusively (see below for instructions on how to expand the data into a textual export format), and we have limited the number of dis-preferred analyses per item to a maximum of the 1,000 best analyses according to a simple MaxEnt model (derivational features only, no grandparenting) trained on an interim version of the treebank. In principle, Redwoods users could use the LKB or PET parsers to obtain the complete set of analyses and then use the [incr tsdb()] update facility to automatically produce a version of the treebank against the unrestricted profile. However, we expect that the reduced distribution provides a sufficiently large portion of the dis-preferred analyses for high-quality stochastic modelling and that the substantial reduction in overall size will actually benefit experimentation.
See the LkbInstallation instructions for details, but the following should just be sufficient to obtain a full installation of the LKB, ERG, [incr tsdb()], and Redwoods Fifth Growth data for Linux (x86) and Solaris (sparc) environments (the choice of DELPHINHOME, the root directory for the DELPH-IN source tree, can be varied, of course; the example below assumes a sub-directory `delphin' in the user home directory):
export DELPHINHOME=${HOME}/delphin
wget http://lingo.stanford.edu/ftp/etc/install
bash install --redwoods
Expanding and Exporting
Assuming a functional installation of the LKB, ERG, and [incr tsdb()], the process of exporting all or parts of the Redwoods Fifth Growth data into a collection of plain text files can be fully automated by virtue of a shell script provided in the [incr tsdb()] data directory. By default, the script will include the following representations in the export
derivation tree: primary, labeled in terms of grammar-internal identifiers;
phrase structure tree: derived, labeled using a set of abbreviatory symbols;
attibute value matrix: derived, the full HPSG sign, including all daughters;
MRS: derived, in two flavours ('raw' and 'indexed'), meaning representation;
dependencies: derived, elementary dependency relations (reduced form of MRS).
Setting the parameter *redwoods-export-values* in the script (see below) to a sub-set of the above may result in significant savings in export time and disk space requirements. The default set of (close to all) export representations requires several cpu days and around 20 gbytes of disk space (as a set of gzip(1)-compressed files) for the full Redwoods Fifth Growth. Following is an example session to export just the VM6 section:
cd $DELPHINHOME/lkb/src/tsdb/home ./export redwoods/jun-04/vm6/04-06-11
A full export can be fairly memory-intense for highly ambiguous items, i.e. it is advisable to run the above in a suitable machine (with, say, two gbytes of RAM or above). Consult the export script for further configuration options.
Bibliography
Following is an incomplete selection of publications on the creation and use of the Redwoods treebank.
Oepen, Stephan, Kristina Toutanova, Stuart Shieber, Christopher Manning, Dan Flickinger, and Thorsten Brants (2002).
The LinGO Redwoods Treebank: Motivation and Preliminary Applications. In Proceedings of the 19th International Conference on Computational Linguistics (COLING 2002), Taipei, Taiwan (pages 1253-1257). Oepen, Stephan, Dan Flickinger, Kristina Toutanova, and Christoper D. Manning (2002).
LinGO Redwoods. A Rich and Dynamic Treebank for HPSG. In Proceedings of The First Workshop on Treebanks and Linguistic Theories (TLT 2002), Sozopol, Bulgaria. Toutanova, Kristina, Christoper D. Manning, and Stephan Oepen (2002).
Parse Ranking for a Rich HPSG Grammar. In Proceedings of The First Workshop on Treebanks and Linguistic Theories (TLT 2002), Sozopol, Bulgaria. Toutanova, Kristina and Christopher D. Manning (2002).
Feature Selection for a Rich HPSG Grammar Using Decision Trees. In Proceedings of the Sixth Conference on Natural Language Learning (CoNLL 2002), Taipei, Taiwan. Velldal, Erik, Stephan Oepen, and Dan Flickinger (2004).
Paraphrasing Treebanks for Stochastic Realization Ranking. In Proceedings of The Third Workshop on Treebanks and Linguistic Theories (TLT 2004), Tuebingen, Germany.
An overview presentation on many of the methodological aspects of the Redwoods initiative is available from
an invited presentation at the 2003 Treebanks and Linguistic Theories workshop.
Acknowledgements
The Redwoods treebank has been under active development at the CSLI
LinGO Laboratory since sometime early in 2001. The annotation environment was built from the combination of the
LKB tree comparison window (originally developed by Rob Malouf) and the
[incr tsdb()] profiling tools; Stephan Oepen did the bulk of the Redwoods software development. Dan Flickinger, as the main developer of the
ERG, has been an invaluable source of inspiration on the treebank design and has also been the main treebanker since Redwoods Second Growth. Chris Manning and Kristina Toutanova, and Stuart Shieber, as early adopters and consultants on the overall design of the resource and representations, have greatly influenced the evolution of the treebank and pioneered its use for stochastic parse selection. Ezra Callahan was the first annotator, constructing what has been released as the First Growth during a ten-week summer internship. John Beavers did the annotations of the new ecommerce sections. Francis Bond and his colleagues at the
NTT Research Laboratory have been vigorous supporters, adapted the Redwoods approach for Japanese (dubbing their treebank
Hinoki), and thus helped a lot in scaling up the technology. Marty Mayberry, Jason Baldridge, Alex Lascarides, and Miles Osborne, as active users of the ERG and Redwoods data, have provided crucial feedback on the representations and software and positively contributed to recent developments. Tim Baldwin, Emily M. Bender, Kathryn Campbell-Kibler, Ann Copestake, Andreas Eisele, Rob Malouf, Rebecca Neil, Ivan Sag, Erik Velldal, and Tom Wasow have all helped through advice and productive critique in various stages of the project.
The development of the Redwoods treebank was financed opportunistically from numerous sources, including multiple donations to CSLI from YY Technologies (Mountain View, CA), a CSLI Seeding Grant, the Stanford
Symbolic Systems Program (through multiple sponsored summer internships), the Commission of the European Community (through the
Deep-Thought project), Scottish Enterprise (through the
ROSIE project),
Nippon Telegraph and Telephone Corporation (NTT) (through a sponsored research contract to the LinGO Laboratory), and the
Norwegian LOGON Initiative (through financial support to Dan Flickinger and Stephan Oepen).