Git Product home page Git Product logo

ptb_text_only's Introduction

annotations_creators language_creators language license license_details multilinguality size_categories source_datasets task_categories task_ids paperswithcode_id pretty_name dataset_info
expert-generated
found
en
other
LDC User Agreement for Non-Members
monolingual
10K<n<100K
original
text-generation
fill-mask
language-modeling
masked-language-modeling
Penn Treebank
features config_name splits download_size dataset_size
name dtype
sentence
string
penn_treebank
name num_bytes num_examples
train
5143706
42068
name num_bytes num_examples
test
453710
3761
name num_bytes num_examples
validation
403156
3370
5951345
6000572

Dataset Card for Penn Treebank

Table of Contents

Dataset Description

Dataset Summary

This is the Penn Treebank Project: Release 2 CDROM, featuring a million words of 1989 Wall Street Journal material. The rare words in this version are already replaced with token. The numbers are replaced with token.

Supported Tasks and Leaderboards

Language Modelling

Languages

The text in the dataset is in American English

Dataset Structure

Data Instances

[Needs More Information]

Data Fields

[Needs More Information]

Data Splits

[Needs More Information]

Dataset Creation

Curation Rationale

[Needs More Information]

Source Data

Initial Data Collection and Normalization

[Needs More Information]

Who are the source language producers?

[Needs More Information]

Annotations

Annotation process

[Needs More Information]

Who are the annotators?

[Needs More Information]

Personal and Sensitive Information

[Needs More Information]

Considerations for Using the Data

Social Impact of Dataset

[Needs More Information]

Discussion of Biases

[Needs More Information]

Other Known Limitations

[Needs More Information]

Additional Information

Dataset Curators

[Needs More Information]

Licensing Information

Dataset provided for research purposes only. Please check dataset license for additional information.

Citation Information

@article{marcus-etal-1993-building, title = "Building a Large Annotated Corpus of {E}nglish: The {P}enn {T}reebank", author = "Marcus, Mitchell P. and Santorini, Beatrice and Marcinkiewicz, Mary Ann", journal = "Computational Linguistics", volume = "19", number = "2", year = "1993", url = "https://www.aclweb.org/anthology/J93-2004", pages = "313--330", }

Contributions

Thanks to @harshalmittal4 for adding this dataset.

ptb_text_only's People

Contributors

julien-c avatar mariosasko avatar

Watchers

fade_away avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.