annotations_creators

language_creators

language

license

license_details

multilinguality

size_categories

source_datasets

task_categories

task_ids

paperswithcode_id

pretty_name

dataset_info

expert-generated

found

other

LDC User Agreement for Non-Members

monolingual

10K<n<100K

original

text-generation

fill-mask

language-modeling

masked-language-modeling

Penn Treebank

features

config_name

splits

download_size

dataset_size

name	dtype
sentence	string

penn_treebank

name	num_bytes	num_examples
train	5143706	42068

name	num_bytes	num_examples
test	453710	3761

name	num_bytes	num_examples
validation	403156	3370

5951345

6000572

Dataset Card for Penn Treebank

Dataset Description
Dataset Structure
Dataset Creation
Considerations for Using the Data
Additional Information

Dataset Description

Homepage: https://catalog.ldc.upenn.edu/LDC99T42
Repository: 'https://raw.githubusercontent.com/wojzaremba/lstm/master/data/ptb.train.txt', 'https://raw.githubusercontent.com/wojzaremba/lstm/master/data/ptb.valid.txt', 'https://raw.githubusercontent.com/wojzaremba/lstm/master/data/ptb.test.txt'
Paper: https://www.aclweb.org/anthology/J93-2004.pdf
Leaderboard: [Needs More Information]
Point of Contact: [Needs More Information]

Dataset Summary

This is the Penn Treebank Project: Release 2 CDROM, featuring a million words of 1989 Wall Street Journal material. The rare words in this version are already replaced with token. The numbers are replaced with token.

Supported Tasks and Leaderboards

Language Modelling

Languages

The text in the dataset is in American English

Dataset Structure

Data Instances

[Needs More Information]

Data Fields

[Needs More Information]

Data Splits

[Needs More Information]

Dataset Creation

Curation Rationale

[Needs More Information]

Source Data

Initial Data Collection and Normalization

[Needs More Information]

Who are the source language producers?

[Needs More Information]

Annotations

Annotation process

[Needs More Information]

Who are the annotators?

[Needs More Information]

Personal and Sensitive Information

[Needs More Information]

Considerations for Using the Data

Social Impact of Dataset

[Needs More Information]

Discussion of Biases

[Needs More Information]

Other Known Limitations

[Needs More Information]

Additional Information

Dataset Curators

[Needs More Information]

Licensing Information

Dataset provided for research purposes only. Please check dataset license for additional information.

Citation Information

@article{marcus-etal-1993-building, title = "Building a Large Annotated Corpus of {E}nglish: The {P}enn {T}reebank", author = "Marcus, Mitchell P. and Santorini, Beatrice and Marcinkiewicz, Mary Ann", journal = "Computational Linguistics", volume = "19", number = "2", year = "1993", url = "https://www.aclweb.org/anthology/J93-2004", pages = "313--330", }

Contributions

Thanks to @harshalmittal4 for adding this dataset.

sleepwalker2017 / ptb_text_only Goto Github PK

ptb_text_only's Introduction

Dataset Card for Penn Treebank

Table of Contents

Dataset Description

Dataset Summary

Supported Tasks and Leaderboards

Languages

Dataset Structure

Data Instances

Data Fields

Data Splits

Dataset Creation

Curation Rationale

Source Data

Initial Data Collection and Normalization

Who are the source language producers?

Annotations

Annotation process

Who are the annotators?

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

Contributions

ptb_text_only's People

Contributors

Watchers

Recommend Projects

Recommend Topics

Recommend Org