Git Product home page Git Product logo

dodfminer's People

Contributors

abmhub avatar alvesisaque avatar andlq avatar avio11 avatar ciriatico avatar davialvb avatar dependabot[bot] avatar felipexbds avatar fepas avatar ianfpferreira avatar khalil09 avatar knedle-unb avatar lacwerda avatar lary15 avatar lpfgarcia avatar maffei2443 avatar notopoloko avatar skalwalker avatar thiagodepaulo avatar victorlisboa avatar vitorararuna avatar vitorvvo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

skalwalker

dodfminer's Issues

[BUG] dodf documents from 2020 wrong titles_subtitles hierarchy

The title/subtitle hierarchy is constantly broken when the TitleExtractor is run over those PDFs.
Specifically, seems to happen when there are several titles within the same page.
Subtitles are also affected similarly.

Steps to reproduce the behavior:

  1. run title_extractor.ExtractorTitleSubtitle over the PDF attached
    2.inspect the hierarchy property of the object and compare (manually) with the expected hierarchy.

Expected behavior
Rebuild the title/subtitle hierarchy correctly.

Additional context
Attached are the

pdf file
json with the output hierarchy, annotated

test-githubbot

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:
1.
2.
3.

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. macOS]
  • Version [e.g. mojave]
  • Python Version [e.g. 3.8]

Additional context
Add any other context about the problem here.

Encontrar amostra do ato no DODF e Consolidar atos semelhantes (31-37)

Aviso 3ª SESSÃO
Publicação aviso julgamento final das propostas
Publicação Aviso Recurso Contra Propostas de Preço
Publicação Aviso Julgamento Recurso e Convocação 4ª Sessão
Publicação Aviso Julgamento Habilitação e Resultado Final
Publicação Aviso Recurso Contra Habilitação
Publicação Aviso Julgamento Recurso Habilitação

[BUG] "List out of index" ao ler alguns PDFs

Erro ao executar o código ext = ExtratorTituloSubTitulo(file).
Erro acusado: list index out of range

Erro no terminal:
Erro na extração de titulos do pdf: ../data/dodfs/2019/01_Janeiro/DODF 021 30-01-2019 SUPLEMENTO B.pdf list index out of range Erro na extração de titulos do pdf: ../data/dodfs/2019/01_Janeiro/DODF 002 08-01-2019 EDICAO EXTRA.pdf Não começa com títulos Erro na extração de titulos do pdf: ../data/dodfs/2019/01_Janeiro/DODF 005 15-01-2019 EDICAO EXTRA.pdf list index out of range Erro na extração de titulos do pdf: ../data/dodfs/2019/01_Janeiro/DODF 001 01-01-2019 EDICAO ESPECIAL.pdf list index out of range

Passos para reproduzir o erro:

  1. Criar um objeto do tipo ExtratorTituloSubTitulo passando como argumento algum dos pdfs listados no erro

Comportamento esperado:
Retornar um objeto com um dicionário contendo títulos e subtítulos do pdf passado no argumento.

erro_screen_shot.png

Notebook Vivobook:

  • OS: Ubuntu 18.04.4 LTS
  • Python 3.6.9

[ENHANCEMENT] CLI Flag to Low Cost Extractions

It would be useful to have a CLI flag to extract only pdfs that are below a certain size or page number. Big pdfs uses as lot of memory and some computers might not be able to handle.

Encontrar amostra do ato no DODF e Consolidar atos semelhantes (1-6)

Aviso Resultado do Julgamento da Habilitação
Aviso Resultado de Julgamento das Propostas e Convocação
Aviso Interposição Recursos – Propostas Técnicas
Aviso Resultado Julgamento Propostas Técnicas
Aviso Convocação Segunda Sessão
Aviso Alteração Subcomissão Técnica

[BUG] with-titles option is not working

Describe the bug
When you opt to run dodfminer with the option -t -with-titles, it does not work.

To Reproduce
Steps to reproduce the behavior:

  1. Be sure you have PDFs in the appropriate folders
  2. dodfminer -t -with-titles

Expected behavior
Create json with titles

Desktop (please complete the following information):

  • OS: Linux
  • Python Version: 3.7

[FEATURE] Dynamic generated requirements.txt

It'd be interesting if the repo's requirements.txt could be generated locally instead of containing all project dependencies once there are modules which are independent from each other and therefore should not rely on one single requirements.txt file.

[ENHANCEMENT] JSON output in MongoDB

Describe the feature you would like the improvement and the reasons to it.
Output the JSON file is a little bit unfriendly. User willing to read it have to manually change the encoding, also documents might get messy.

Solution you'd like
Implements the output to be saved to a MongoDB database in a BSON file, where each pdf is an instance. This would enable a development of a module to create user interaction with the database.

[BUG] DODFMiner does not extract correctly "sem efeito aposentadoria"

Majority props are no longer present on the .csv produced by DODFMiner. Also even the props which appear in the dataframes have only nan.

To Reproduce

dodfminer extract -s path_to_pdf sem_efeito_aposentadoria

Expected behavior
An dataframe containing also the columns:

  • dodf_num
  • tonado_sem_efeito_publicacao
  • dodf_pagina
  • servidor
  • matricula
  • cargo
  • dodf_tipo_edicao

and not having all the inermediate collumns filed with NaN.

nan

[BUG] Wrong title mouting

File: https://ufile.io/ycq50e9p
Page 9.
PROBLEM: Mixed titles:

bug-titulo

  • SECRETARIA DE ESTADO DE OBRAS E INFRAESTRUTURA
  • SECRETARIA DE ESTADO DE DESENVOLVIMENTO URBANO E HABITAÇÃO

These titles must be separated.

Expected solution: take page column into account when mounting titles/subtitles.

Title extraction

criar arquivo contendo todos os títulos dos PDF do DODF no período [2001, oo]

pylint

o pylint ficou de ser usado; porém, qual o arquivo configuração a ser usado?
Executei a ferramenta no extrator localmente mas como a identação padrão difere da minha, deu uma pontuação bem negativa.
Passar --indent-string=' ' não resolveu, mesmo estando meu TAB configurado para 2 espaços no VSCode.

[BUG] title extractor crashing + hierarchy base

the problem was caused by a mistype extract_title_subtitle intead of extract_titles_subtitles.
These issue should be fixed ASAP because the currently extractor version contains that bug and does not work since this is a critical bug.

Create a DODFMiner API[ENHANCEMENT]

It has been determined that the Dash interface will be removed and a React page will be created to replace it. To use React as a front-end, an API is required.

The solution found is create an api with Flask

  • Architech routes
  • Develop and test routes functions
  • Analyze the router return quality

[ENHANCEMENT] Melhorar leitura código

Código não segue alguns padrões de documentação e PEP8.

Realizar as seguintes melhorias na branch de refatoração, nos arquivos title_extractor e title_filter

Melhorias:

  • Documentação faltante em algumas funcões, seguir padrão do Google (link no readme)
  • Não declarar tipagem em parametros e retornos
  • Esclarecer funções anonimas confusas
  • Diminuir Presença de muitos números mágicos
  • Melhorar Variaveis com nomes não intuitivos
  • Quebrar códigos de uma linha com muitas cláusulas ou detalhes
  • Atualizar requirements.txt com pacotes adicionados
  • Prover um exemplo de como executar o código existente
  • Mencionar uso e descrever instalação de outras dependencias existentes

[TESTS] Fix failing tests on Extractor Polished Core

Currently there are two tests failing:

  • FAILED tests/test_extract_polished_helper.py::test_helper_xml_multiple - assert False
  • FAILED tests/test_run_extract.py::test_run_extract_input_folder_xml - AssertionError: assert False

It is important to make the tests pass for coverage improvement.

Redesign and refactor the DODFMiner frontend[ENHANCEMENT]

It's necessary to replace Dash. The actual method is inefficient, slow and difficult to work in multiple fronts.

To resolve this problem, has been decided to replace Dash by a React Page in Frontend

Prerequisites:

  • List design requisites
  • Implement and test API requests

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.