Git Product home page Git Product logo

speech-dataset's Introduction

Zpoken is a Ukrainian IT company with one of the major divisions oriented on Speech Recognition technologies in English and Slavic (Ukrainian, Russian) languages.

What is this repo about

We are happy to present here our Russian Speech Dataset โ€” Zpoken Dataset [RU]

At the current moment the dataset consists of 5 source parts: radio_source_1, radio_source_2, radio_source_3, radio_source_5, Ru-films.

All data is stored in .opus format and was converted to mono, 16 kHz sampling rate, 16-bit.

Part name Duration (h) Samples num. Average duration (s) Characters per second Characters per sample
radio_source_1 16 424.82 7 887 042 7.50 14.12 105.84
radio_source_2 2 308.46 955 904 8.69 13.53 117.62
radio_source_3 500.14 165 584 10.87 13.90 151.16
radio_source_5 655.88 216 101 10.93 16.63 181.66
Ru-films 850.88 203 972 15.02 8.76 131.57
Total | Average 20 740,18 9 428 603 7.91 13.95 106.17

All parts were scraped from open sources. Basically there were long audio files and transcriptions without timesteps. So that one of the challenges we solved is to align original transcription directly to each short audio sample. More about this problem you will be able to read in our future paper.

Download & play

We provide absolutely free to use 150 hours demos for each part. It is a randomly selected sample from the original dataset part.

Part name Duration(h) Samples num. Size (MB) Link to download
radio_source_1 50 34 356 837 Radio1_50h.zip
radio_source_2 25 16 041 430 Radio2_25h.zip
radio_source_3 25 8 933 418 Radio3_25h.zip
radio_source_5 25 10 786 441 Radio5_25h.zip
Ru-films 25 7 358 380 Ru_films_25h.zip
Total 150 77 474 2 506

They are hosted on Gdrive so we provide ./download.sh to easily get them.

Requirements

You need a gdown to run the ./download.sh

pip install gdown

Just run bash download.sh on your linux machine.

Data structure

You will find the next directory structure, after you unzip each archive.

+---<DatasetPartName>
| +---data
| | +---subfolder1 (optional)
| | | +---speech\_file1.opus
| | | +...
| | | \---speech\_file[N].opus
| | +...
| | +---subfolder[N] (optional)
| | | +---speech\_file1.opus
| | | ...
| | \ \---speech\_file[N].opus
| +---transcription.csv

Get full dataset.

If you are interested in the full version of the dataset feel free to contact us in this form. Usually we'll answer in one working day.

Future work

  • release more hours
  • optimize archive storage (Gdrive is too annoying)

License

CC-BY-4.0

Creative Commons License
Zpoken Dataset [RU] is licensed under a Creative Commons Attribution 4.0 International License.

speech-dataset's People

Contributors

makartroyan avatar zpoken avatar

Stargazers

Roman Krukovsky avatar pkrysenko avatar Maksym Kudymets avatar Viacheslav Dudar avatar  avatar  avatar Vadim121197 avatar vparkarenko avatar Oleg Borovik avatar ypiven avatar

Watchers

 avatar

Forkers

pkrysenko

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.