Git Product home page Git Product logo

comprasnet_captcha_breaker's Introduction

comprasnet_captcha_breaker

This python module allows the user to automate the process of downloading multiple auction reports from Comprasnet. The solution relies on using Convolutional Neural Networks to solve CAPTCHAs and selenium to interact with the web browser. The user must provide a list containing the 16-digit codes identifying the auctions of interest.

Comprasnet and public procurement data in Brazil

In Brazil, Federal-level public procurements for ordinary goods and services must take place as online auctions. One of the most used platforms for such procurement auctions is Comprasnet.

There is an useful API for accessing data regarding procurement auctions, such as relevant dates, winning bids, and data regarding purchased goods and services, including detailed description and quantities.

Nevertheless, many other important information cannot be accessed via the API, and are available only in the procurement records (atas). Such information include the complete list of bids, with value, bidder and timestamps, detailed information about the auction phases and events, as well as all communication between bidders and the procurement official (pregoeiro).

The procurement records are available online as html pages on a different address, outside the API. If one wants to access information regarding a procurement auction, she must type in the search parameteres, navigate through several pages and, more importantly, solve a CAPTCHA. The process must be entirely repeated for every procurement auction.

This is not aligned with the principles of government transparency and makes it harder to analyze data from multiple procurement auctions.

Types of captchas

After the user types in the search parameters and navigates through some pages, she must solve one five different types of captchas appear alternately in the webpage preceding a procurement report. These are:

bubble bubble_cut dotted dotted_wave wave
alt text alt text alt text alt text alt text

Convolutional Neural Networks (CNNs)

Character Recognition

The package includes 4 trained CNNs designed to predict characters from a specific captcha type. As of now, dotted_wave captchas are not supported.

Captcha classification

In addition to the models that perform character recognition, the repository also contains a CNN to classify captchas according to the 5 types listed above.

The idea is to use this model to predict a captcha's type, which allows the selection of the suitable character recognition CNN.

Character segmentation

Since the models are trained to predict individual characters, character segmentation must be performed beforehand. Thus, the repository also includes python modules and functions with simple algorithms for isolating the characters in captcha images.

The algorithms are specific to each captcha-type (with the exception of bubble and bubble_cut, which are handled by a single algorithm/function).

Currently there is no algorithm for splitting dotted_wave captchas, and this is the reason why there is no model for this captcha class. Help in this regard would be much appreciated.

Web scraping

The package relies on selenium and Chromedriver to interact with the pages, type in search parameters, break the captchas and save html sources.

Reading captchas from browser into python session

It is not possible to save image directly from its URL. Each time the URL is accessed, a new captcha is generated. Thus, if opens the browser, gets to the captcha page and then uses the image URL to store it in disk or load into the python session, the retrieved image will not match the one that must be solved.

In order to circumvent this, we use selenium to simulate a right click on the captcha image, open the context menu, and then simulate keystrokes to select the option that copies it to the clipboard. Once in clipboard, the image can be easily loaded into the python session.

The main disavantage of this approach is that computer cannot be used while web scraping, since the simulated keystrokes are processed in the open window. A solution to this is running the program on a virtual machine.

Finally, in Linux systems the keystrokes are simulated with the xte command, from xautomation.

comprasnet_captcha_breaker's People

Contributors

andremenegatti avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.