Git Product home page Git Product logo

ocr4wikisource's Introduction

Version

1.54

#Note This program is evolving heavily. Always use the latest version. Compare the version number with your's version.

Feel free to report any errors in the issues section.

OCR4wikisource

There are many PDF files and DJVU files in WikiSource in various languages. In many wikisource projects, those files are splited into individual page as an Image, using proofRead extension.

Contributors see those images and type them manually.

This project helps the wikisource team to OCR the entire PDF or DJVU file, using the google drive OCR. Then it will update the relevant page in the wikisource with the text.

The OCR will not give 100% correct text. But still, it is better than typing entire page manually.

To run this, you need a GNU/Linux system. Sorry Windows guys. It uses the commandline utilities available in GNU/Linux heavily.

Check the file INSTALL for installation instructions

config.ini

Edit this file and fill the details as file_url, columns, wiki_usernmae, wiki_password, wikisource_language_code

do_ocr.py

Run this as

python do_ocr.py

This will do the following things.

  • download the file
  • if it is DJVU, convert to PDF
  • split it into individual files based on column no
  • create a folder in your Google Drive
  • upload the PDF files to Google Drive
  • Download the OCRed text
  • split, merge the text files properly to fit as the PDF files

This will give text files as 'text_for_page_00001.txt' etc equvalent to the no of pages in the PDF file.

mediawiki_uploader.py

run as

python mediawiki_uploader.py

This uploads the text files provided by do_ocr.py to the wikisource details you provided in config.ini

For testing you can keep only few files provided by do_ocr.py (example from text_for_page_00001.txt to text_for_page_00005.txt) Move all other text files to another folder. Once you are satisfied, you can place all the files in the current folder.

For this, the PDF or DJVU file should be already splitted into individual pages in wiki souce, using Proofread extension for wikisource.

Example :  "https://" + wikisource_language_code + ".wikisource.org/wiki/Page:" + filename/pageno

Contact your wikisource team for splitting the files into individual pages.

ToDo:

  • This program uplads single page PDF to google drive. Google accepts 10 page PDF files. Create 10 page pdf files, upload, split the text files for faster operations.
  • Give a web interface using Django or Flask.
  • Port to other Operating Systems like Windows.

Contact : T Shrinivasan [email protected]

ocr4wikisource's People

Contributors

bodhisattwawiki avatar jayantanth avatar kartikm avatar tshrinivasan avatar vvasuki avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.