(This web page is currently under construction.) Penn Treebank Project

The Penn Treebank Project

[picture of pine trees and mountains] The Penn Treebank Project annotates naturally-occuring text for linguistic structure. Most notably, we produce skeletal parses showing rough syntactic and semantic information -- a bank of linguistic trees. We also annotate text with part-of-speech tags, and for the Switchboard corpus of telephone conversations, dysfluency annotation. We are located in the LINC Laboratory of the Computer and Information Science Department at the University of Pennsylvania.
All data produced by the Treebank is released through the Linguistic Data Consortium.

Descriptions and samples of annotated corpora:

Wall Street Journal | The Brown Corpus | Switchboard | ATIS

Frequently Asked Questions (FAQs)

  • tokenization
  • NP heads and Base NPs in Treebank II bracketing

    Annotation Style Manuals

  • Part-of-speech tagging
  • Treebank I bracketing was used until 12/92.
  • Treebank II bracketing is designed to allow the extraction of simple predicate-argument structure.
  • Dysfluency annotation for Switchboard corpus

    [image of CD] Treebank Releases on CD

  • Preliminary Release, Version 0.5 CDROM, 1992
  • Release 2 CDROM, 1995

    Publications

  • A nice overview of the project (before Treebank II style), Computational Linguistics, vol. 19, 1993.
  • Introduction to predicate-argument bracketing (a.k.a. Treebank II), ARPA '94.

    Personnel

    Principal Investigator:
    Mitchell Marcus
    Project Administrator:
    Ann Taylor
    Programmer/Data Manager:
    Robert MacIntyre
    Annotators:
    Ann Bies, Constance Cooper, Mark Ferguson, Alyson Littman

    This web page is maintained by treebank@unagi.cis.upenn.edu.
    Last change: $Date: 1996/02/29 09:52:56 $ UTC.
    access count: [nph-count odometer]