handwriting-gathering-with-toloka's Introduction

Handwriting image dataset collection

This is an example of simple handwriting images gathering pipeline implemented using Yandex.Toloka and Yandex.Disk API

The goal for this project is to collect images of handwritten text for a dataset that could be used to train and evaluate HTR models.

Furthermore, you can later enhance this dataset with extra bounding boxes for separate lines or words, using this tutorial from toloka-kit as an example.

Main code is provided in handwriting-gathering.ipynb. With provided data and images one can re-run the whole pipeline for French. Code can be reused for any other language provided sample training photos are collected for training and the project#1 instructions are translated to that language (optionally, you can use English instructions that are also provided).

Structure:

data/ contains sample sentences that can be reused instead of scraping
instructions/ contains instructions for the projects and images for them
img/ contains illustrations for the notebook
corpus.py contains sample class for Wikipedia dump sentence extraction
requirements.txt contains all the requirements for the pipeline
handwriting-gathering.ipynb contains main code for the pipeline

Recommend Projects

tardis-forever / handwriting-gathering-with-toloka Goto Github PK

handwriting-gathering-with-toloka's Introduction

Handwriting image dataset collection

Structure:

handwriting-gathering-with-toloka's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent