unb-knedle / dodfminer Goto Github PK

View Code? Open in Web Editor NEW

17.0 17.0 1.0 173.01 MB

Data extractor of PDF documents from the Official Gazette of the Federal District, Brazil.

License: GNU General Public License v3.0

Python 99.80% Dockerfile 0.20%

hacktoberfest library

dodfminer's People

Contributors

Stargazers

Watchers

Forkers

skalwalker

dodfminer's Issues

[BUG] dodf documents from 2020 wrong titles_subtitles hierarchy

The title/subtitle hierarchy is constantly broken when the TitleExtractor is run over those PDFs.
Specifically, seems to happen when there are several titles within the same page.
Subtitles are also affected similarly.

Steps to reproduce the behavior:

run title_extractor.ExtractorTitleSubtitle over the PDF attached
2.inspect the hierarchy property of the object and compare (manually) with the expected hierarchy.

Expected behavior
Rebuild the title/subtitle hierarchy correctly.

Additional context
Attached are the

pdf file
json with the output hierarchy, annotated

teste issue

test-githubbot

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:
1.
2.
3.

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. macOS]
Version [e.g. mojave]
Python Version [e.g. 3.8]

Additional context
Add any other context about the problem here.

[BUG] Iminent crash if FLEX_DATE fails to match

Missing dodf_data causes crashing on SemEfeitoAposentadoria

Expected behavior
If FLEX_DATE regex does not matches, the subsequent code will crash.

FILE:
https://github.com/UnB-KnEDLe/DODFMiner/blob/main/dodfminer/extract/polished/acts/sem_efeito_aposentadoria.py

Encontrar amostra do ato no DODF e Consolidar atos semelhantes (31-37)

Aviso 3ª SESSÃO
Publicação aviso julgamento final das propostas
Publicação Aviso Recurso Contra Propostas de Preço
Publicação Aviso Julgamento Recurso e Convocação 4ª Sessão
Publicação Aviso Julgamento Habilitação e Resultado Final
Publicação Aviso Recurso Contra Habilitação
Publicação Aviso Julgamento Recurso Habilitação

[FEATURE] Extract Title and Subtitle Informations

Feature description
Given a list of title and subtitles, create a JSON with information from blocks of each DODF.

Additional context

The pdf reading technology will consist of tesseract.

Fazer documentação final

Encontrar amostra do ato no DODF e Consolidar atos semelhantes (25-30)

AVISO PROSSEGUIMENTO LICITAÇÃO
AVISO REABERTURA LICITAÇÃO
Publicação Aviso 2ª Sessão –
Publicação Aviso Resultado Julgamento Propostas Técnicas
Publicação Aviso Recuros contra Resultado Julgamento Propostas Técnicas
Publicação Aviso Julgamento Recusos

Encontrar amostra do ato no DODF e Consolidar atos semelhantes (7-12)

Aviso Adiamento com Nova Abertura – CC n.º 01/2020 –
Aviso de Licitação –
Aviso de Alteração – Convocação 5.11.2019 –
Resultado de Recurso Fundac e Convocação
Aviso DODF
Aviso de Resultado – Habilitação e Convocação

[BUG] "List out of index" ao ler alguns PDFs

Erro ao executar o código ext = ExtratorTituloSubTitulo(file).
Erro acusado: list index out of range

Erro no terminal:
Erro na extração de titulos do pdf: ../data/dodfs/2019/01_Janeiro/DODF 021 30-01-2019 SUPLEMENTO B.pdf list index out of range Erro na extração de titulos do pdf: ../data/dodfs/2019/01_Janeiro/DODF 002 08-01-2019 EDICAO EXTRA.pdf Não começa com títulos Erro na extração de titulos do pdf: ../data/dodfs/2019/01_Janeiro/DODF 005 15-01-2019 EDICAO EXTRA.pdf list index out of range Erro na extração de titulos do pdf: ../data/dodfs/2019/01_Janeiro/DODF 001 01-01-2019 EDICAO ESPECIAL.pdf list index out of range

Passos para reproduzir o erro:

Criar um objeto do tipo ExtratorTituloSubTitulo passando como argumento algum dos pdfs listados no erro

Comportamento esperado:
Retornar um objeto com um dicionário contendo títulos e subtítulos do pdf passado no argumento.

Notebook Vivobook:

OS: Ubuntu 18.04.4 LTS
Python 3.6.9

[ENHANCEMENT] PDF2Image Configs via CLI

Implement PDF2Image configs to be passed via cli

5.7. Impugnação do edital

[DOCUMENTATION] Update README

Update README.md file:

Fix broken links
Review file content and add new topics
Add python version

Encontrar amostra do ato no DODF e Consolidar atos semelhantes (SEE-DF)

http://www.educacao.df.gov.br/licitacoes/

[ENHANCEMENT] CLI Flag to Low Cost Extractions

It would be useful to have a CLI flag to extract only pdfs that are below a certain size or page number. Big pdfs uses as lot of memory and some computers might not be able to handle.

Identificar atos da linha histórica de licitação

Encontrar amostra do ato no DODF e Consolidar atos semelhantes (1-6)

Aviso Resultado do Julgamento da Habilitação
Aviso Resultado de Julgamento das Propostas e Convocação
Aviso Interposição Recursos – Propostas Técnicas
Aviso Resultado Julgamento Propostas Técnicas
Aviso Convocação Segunda Sessão
Aviso Alteração Subcomissão Técnica

[BUG] with-titles option is not working

Describe the bug
When you opt to run dodfminer with the option -t -with-titles, it does not work.

To Reproduce
Steps to reproduce the behavior:

Be sure you have PDFs in the appropriate folders
dodfminer -t -with-titles

Expected behavior
Create json with titles

Desktop (please complete the following information):

OS: Linux
Python Version: 3.7

[FEATURE] Dynamic generated requirements.txt

It'd be interesting if the repo's requirements.txt could be generated locally instead of containing all project dependencies once there are modules which are independent from each other and therefore should not rely on one single requirements.txt file.

extrator-titulo-subtitulo

deve ser feita classe que extraia os títulos/subtítulos de exemplares do DODF

[ENHANCEMENT] JSON output in MongoDB

Describe the feature you would like the improvement and the reasons to it.
Output the JSON file is a little bit unfriendly. User willing to read it have to manually change the encoding, also documents might get messy.

Solution you'd like
Implements the output to be saved to a MongoDB database in a BSON file, where each pdf is an instance. This would enable a development of a module to create user interaction with the database.

[BUG] DODFMiner does not extract correctly "sem efeito aposentadoria"

Majority props are no longer present on the .csv produced by DODFMiner. Also even the props which appear in the dataframes have only nan.

To Reproduce

dodfminer extract -s path_to_pdf sem_efeito_aposentadoria

Expected behavior
An dataframe containing also the columns:

dodf_num
tonado_sem_efeito_publicacao
dodf_pagina
servidor
matricula
cargo
dodf_tipo_edicao

and not having all the inermediate collumns filed with NaN.

Use a word correction library in tesseract output

Redesenhar o dash do DODFMiner

[BUG] Wrong title mouting

File: https://ufile.io/ycq50e9p
Page 9.
PROBLEM: Mixed titles:

SECRETARIA DE ESTADO DE OBRAS E INFRAESTRUTURA
SECRETARIA DE ESTADO DE DESENVOLVIMENTO URBANO E HABITAÇÃO

These titles must be separated.

Expected solution: take page column into account when mounting titles/subtitles.

Encontrar amostra do ato no DODF e Consolidar atos semelhantes (13-18)

Aviso Errata 02 –
Aviso de Licitação
Aviso de Convocação para devolução de Invólucros
Aviso de Revogação
2.º Aviso Alteração Subcomissão Técnica
1.º Aviso Alteração Subcomissão Técnica

Title extraction

criar arquivo contendo todos os títulos dos PDF do DODF no período [2001, oo]

[ENHANCEMENT] Single pdf for test

Implement download only one single pdf for tests in bad internet connection

Extract raw text with tesseract

Melhorar regex de Abertura.

[ENHANCEMENT] Tesseract Configs via CLI

Add the possibility to change tesseract configs on execution

Criar lógica comparativa entre SisEditais e Extração Regex

[FEATURE] extrair corretamente títulos e subtítulos

Extrair a informação de hierarquia entre os títulos e subtítulos.

Criar o CLI

Criar o CLI de execução do programa

Encontrar amostra do ato no DODF e Consolidar atos semelhantes (19-24)

Aviso de reabertura
Aviso de suspensão
Aviso de nova abertura –
Aviso de suspensão –
AVISO SORTEIO SUBCOMISSÃO TÉCNICA
AVISO SUSPENSÃO LICITAÇÃO

[ENHANCEMENT] Colocar arquivos de extração de títulos como source

Os arquivos de extração de títulos foram mal posicionados e precisam ser refatorados para sem implementado como código fonte.

pylint

o pylint ficou de ser usado; porém, qual o arquivo configuração a ser usado?
Executei a ferramenta no extrator localmente mas como a identação padrão difere da minha, deu uma pontuação bem negativa.
Passar --indent-string=' ' não resolveu, mesmo estando meu TAB configurado para 2 espaços no VSCode.

[BUG] title extractor crashing + hierarchy base

the problem was caused by a mistype extract_title_subtitle intead of extract_titles_subtitles.
These issue should be fixed ASAP because the currently extractor version contains that bug and does not work since this is a critical bug.

[Buscar exemplos] 5.3. Divulgação do edital

[BUG] Progress Bar crashes

Progress Bar is Crashing when download finishes

[ENHANCEMENT] Remove global variables and functions

Encapsulate core functions in class to prevent user access

Create a DODFMiner API[ENHANCEMENT]

It has been determined that the Dash interface will be removed and a React page will be created to replace it. To use React as a front-end, an API is required.

The solution found is create an api with Flask

Architech routes
Develop and test routes functions
Analyze the router return quality

Similarity Analysis for Titles and Subtitles

[ENHANCEMENT] Melhorar leitura código

Código não segue alguns padrões de documentação e PEP8.

Realizar as seguintes melhorias na branch de refatoração, nos arquivos title_extractor e title_filter

Melhorias:

Documentação faltante em algumas funcões, seguir padrão do Google (link no readme)
Não declarar tipagem em parametros e retornos
Esclarecer funções anonimas confusas
Diminuir Presença de muitos números mágicos
Melhorar Variaveis com nomes não intuitivos
Quebrar códigos de uma linha com muitas cláusulas ou detalhes
Atualizar requirements.txt com pacotes adicionados
Prover um exemplo de como executar o código existente
Mencionar uso e descrever instalação de outras dependencias existentes

Divide raw text into JSON blocks

[TESTS] Fix failing tests on Extractor Polished Core

Currently there are two tests failing:

FAILED tests/test_extract_polished_helper.py::test_helper_xml_multiple - assert False
FAILED tests/test_run_extract.py::test_run_extract_input_folder_xml - AssertionError: assert False

It is important to make the tests pass for coverage improvement.