Git Product home page Git Product logo

pedrobarcha / context-spelling-correction Goto Github PK

View Code? Open in Web Editor NEW
11.0 2.0 0.0 364 KB

Given a text, wrap it into phrases and send them to Yandex's search engine. If it yields a "did you mean:", substitute the original phrase for the suggestion. The software was originally developed for correcting OCR output.

License: GNU General Public License v3.0

Python 100.00%
spelling-correction ocr-post-processing context-aware context-awareness online-spelling-correction

context-spelling-correction's Introduction

THE PROJECT

Currently, the program wraps single or several text files into 8 words blocks, query them to yandex and if it suggests a spell correction (like google's "did you mean"), then the original phrase is subtituted by the given suggestion. Otherwise, the phrase reamins the same in the file.

DEPENDECIES

All you need is python 2 and its standard libraries.

SET UP

  1. Get an API Key at https://tech.yandex.com/xml/ ;
  2. In order to grant access to 10.000 quries/day, confirm your cel number at https://passport.yandex.com/profile ;
  3. Run git clone https://github.com/PedroBarcha/Context-Spelling-Correction.git to clone the repo;
  4. From the terminal, navigate to the repo directory and run python config.py , in order to set you API key and username.

USAGE

From the terminal, run inside the repo's directory: python correction.py PATH_TO_THE_FILE . If you wish to correct several files, stored in the same directory, use instead: python correction.py PATH_TO_THE_DIR/* .

IMPORTANT

Before running the program, always make sure that you have your correct IP set at https://xml.yandex.com/settings/

OUTPUT

  • For every file specified in the input, a file with the same name and extension ".corrected" will be generated.
  • A single file named "yandex_suggestions.txt" is produced, containing all of Yandex's suggestions for you file(s) and also runtime statistics at the end of it.

NOTES

  • According to the API's doc., you need to convert special chars into escape sequences. However, the result of Yandex's spell check field ( in XML) is ASCII (bug?), not UTF8. So, we don't make the conversion of the queries before sending them simply because it doesn't matter for us. What this means is: any UTF8 char in your text will be rounded to ASCII if a suggestion is made by Yandex. (e.g: “ and ” turn to ". Also, — will become -). Unfortunately nothing can be done about it, since it is an API issue.
  • Book pages often contain hyphen (-) linking parts of a single word, when it reaches the end of a line. In cases like this, Yandex usually suggests the whole word binded together(without the hyphen) as a spell correction. Eg: "unfor- tunate" becomes "unfortunate". If you don't want that, you can easily reprogram it to check if the query sent to Yandex, without the hyphen, provides the same result as Yandex's suggestion for it. If so, you can mantain the former query instead of the suggestion.

TODO

  • Make a new algorithm for wrapper.py. Currently the program wrap the text into N words phrases and query them to Yandex. However, this is problematic as it doesn't really get the context of the phrases. The idea of the new algorithm is to wrap the text according to the dots contained in the text.
  • After the above is done, make a setup.py and publish the program on PyPi

context-spelling-correction's People

Contributors

pedrobarcha avatar pphbc13 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.