Git Product home page Git Product logo

speech-to-text-data-collection's Introduction

speech-to-text-data-collection

A speech to text data collection using Apache Kafka, Apache spark, Airflow, and S3 bucket

Table of Contents

Overview

This week, 10 Academy is your client. Recognizing the value of large data sets for speech-t0-text data sets, and seeing the opportunity that there are many text corpuses for both languages, and understanding that complex data engineering skills is valuable to your profile for employers, this week’s task is simple: design and build a robust, large scale, fault tolerant, highly available Kafka cluster that can be used to post a sentence and receive an audio file.

By the end of this project, you should produce a tool that can be deployed to process posting and receiving text and audio files from and into a data lake, apply transformation in a distributed manner, and load it into a warehouse in a suitable format to train a speech-t0-text model.

Project Structure

The repository has a number of files including python scripts, jupyter notebooks, pdfs and text files. Here is their structure with a brief explanation.

Data

The purpose of this week’s challenge is to build a data engineering pipeline that allows recording millions of Amharic and Swahili speakers reading digital texts in-app and web platforms.

There are a number of large text corpora we will use, but for the purpose of testing the backend development, you can use the recently released Amharic news text classification dataset with baseline performance dataset:

IsraelAbebe/An-Amharic-News-Text-classification-Dataset

Alternative data Ready-made Amharic data collected from different sources here

Usage

Docker-compose

Both the front-end and the back-end could be run on a docker container.

1. Clone the repo

git clone https://github.com/GrpHu/speech-to-text-data-collection

2. cd into repo

cd speech-to-text-data-collection

3.Start docker container:

docker-compose up -d

notebooks

  • [EDA.ipynb]: a jupyter notebook for exploratory data analysis

scripts

tests:

  • the folder containing unit tests for components in the scripts

logs:

  • the folder containing log files (if it doesn't exist it will be created once logging starts)

Contributors

Contributors list

License

MIT

back to top

speech-to-text-data-collection's People

Contributors

jedisam avatar bwibokhaabi avatar rafaesam avatar nahomfix avatar micky373 avatar dagmawiii03 avatar jeremy-tesh avatar

Stargazers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.