Git Product home page Git Product logo

c4repset's Introduction

c4repset: Representative Subset from C4 data for Training Pre-trained LMs

TL;DR

This dataset is a subset of C4 data in TFDS, which might provide effective training of language models even if the data size is small. The detail of how we obtained the subset is described in our paper Extracting Representative Subset from Extensive Text Data for Training Pre-trained Language Models, Information Processing & Management, Volume 60, Issue 3, May 2023 (Accepted Dec 17, 2022)

Reason of providing this dataset

Neural language models, which have been rapidly developed recently, are an essential technology that plays a fundamental role in the success of the natural language processing (NLP) field. Many studies have been published so far and have clarified that incorporating neural language models as pre-trained models into target task-specific models can dramatically improve performance compared to the case without incorporating pre-trained neural language models (PreLMs). In other words, PreLMs learned from large-scale text datasets can effectively serve as universal features for various NLP tasks.

For PreLMs, several recent studies have verified and experimentally proven that the amount of training data and increasing the model size are the two significant factors that can stably improve the performance, e.g., [1][2][3][4][5]. However, it is also well-known that the efficiency of the performance improvement is often approximately log-linear with respect to the amount of dataset and model size [6][7]. In other words, if we aim to achieve the same level of performance improvement as that obtained by increasing the amount of data from a certain amount, we need to increase the amount of data by a factor of 100. This fact implies that a vast amount of training data is required to build a higher-performance PreLM. Therefore, the computational resources necessary for training may become infeasible. In fact, most of the PreLMs trained from the large-scale dataset are released by large companies with ample computational resources, and it is very difficult for research organizations with few computational resources or research funds, such as many university laboratories, to build relatively high-performance PreLMs.

However, this situation may allow the research on new developments of PreLMs only in certain research institutions. It is inappropriate for open research if researchers cannot widely participate in studying such an important fundamental research topic as PreLMs. Therefore, we focus on the training data of PreLMs and explore a subset that can train a language model with equal or better performance from the data used for training a large-scale PreLM. We refer to the representative subset from the original full training dataset as the "representative dataset" or "RepSet" for short. Suppose it is possible to extract a representative subset. In that case, conducting research on PreLMs with less practical computational resources and research budgets will be possible. It is expected that more researchers will participate, and the field will develop more quickly.

How to use

We provide a list of URLs extracted from C4 data. A naive and straightforward way to use this dataset is to download a URL list and extract data from the original CommonCrawl dataset defined in C4. Another choice is to use the URL list via a slight modification of TFDS.

Citation

Please cite as:

@article{SUZUKI_IPM2023103249,
  author = {Jun Suzuki and Heiga Zen and Hideto Kazawa},
  title = {Extracting representative subset from extensive text data for training pre-trained language models},
  journal = {Information Processing & Management},
  volume = {60},
  number = {3},
  pages = {103249},
  year = {2023},
  issn = {0306-4573},
  doi = {https://doi.org/10.1016/j.ipm.2022.103249},
  url = {https://www.sciencedirect.com/science/article/pii/S0306457322003508},
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.