Git Product home page Git Product logo

tcm-ba's People

Contributors

anapaulagomes avatar laerte avatar raulbsantos avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

tcm-ba's Issues

Remover ponto final dos nomes das pastas

Isso impede de abrir os arquivos e de fazer upload na interface do S3.

Exemplo de nome: Comprovantes bancários de valores de receita transferidos à entidade pela Prefeitura ou outro órgão.

Publicar pacote no pypi

Dessa forma outros projetos poderão utilizar os spiders.

Critérios de aceitação:

  • Ao criar uma release o pacote deve ser publicado no pypi

Adicionar filtros

O usuário poderá filtrar pelos mesmos campos da interface:

  • Periodicidade (anual ou mensal - deverá informar ano ou mês/ano)
  • Cidade
  • Órgão (opcional - o padrão seria "todos")

Permitir filtragem por unidade no raspador de documentos

Atualmente recebemos o argumento porém ao tentar escolher uma unidade recebemos o erro:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/scrapy/utils/defer.py", line 120, in iter_errback
    yield next(it)
  File "/usr/local/lib/python3.8/site-packages/scrapy/utils/python.py", line 353, in __next__
    return next(self.data)
  File "/usr/local/lib/python3.8/site-packages/scrapy/utils/python.py", line 353, in __next__
    return next(self.data)
  File "/usr/local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 56, in _evaluate_iterable
    for r in iterable:
  File "/usr/local/lib/python3.8/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/usr/local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 56, in _evaluate_iterable
    for r in iterable:
  File "/usr/local/lib/python3.8/site-packages/scrapy/spidermiddlewares/referer.py", line 342, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/usr/local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 56, in _evaluate_iterable
    for r in iterable:
  File "/usr/local/lib/python3.8/site-packages/scrapy/spidermiddlewares/urllength.py", line 40, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 56, in _evaluate_iterable
    for r in iterable:
  File "/usr/local/lib/python3.8/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 56, in _evaluate_iterable
    for r in iterable:
  File "/usr/local/lib/python3.8/site-packages/tcmba/spiders/consulta_publica.py", line 184, in get_search_results
    unit_payload = unit_payloads.pop(0)
IndexError: pop from empty list

Salvar valores normalizados no item

Unidades, categorias e nomes de arquivos são normalizados na hora da criação das pastas e dos arquivos mas não nos itens, deixando o valor que está nos itens ligeiramente diferente dos que está no disco. Pra facilitar a criação de scripts combinando os dois (e reduzir inconsistências no código) podemos salvar os valores nos itens já normalizados e utilizá-los normalmente na criação dos arquivos.

Objetos duplicados na raspagem dos documentos

Alguns documentos estão duplicados na raspagem. Os testes foram feitos em uma amostra de 3674 itens (09/2020). Desses, 64 estavam duplicados.

Comando: scrapy crawl consulta_publica -a periodicidade=mensal -a competencia=09/2020 -o consulta-publica-09-2020.json

Exemplo: TERMO DE APOSTILAMENTO 023-2020-1123-3.pdf

Existem apenas um desse no portal (seguindo os parâmetros):
image

Itens duplicados:

{
    "category": "Documentos Adicionais",
    "filename": "abfd75c5-f600-4176-985d-5da0fa0e7ee0-TERMO DE APOSTILAMENTO 023-2020-1123-3.pdf",
    "original_filename": "TERMO DE APOSTILAMENTO 023-2020-1123-3.pdf",
    "inserted_by": "FLORIEDNA DA SILVA GOMES",
    "inserted_at": "20/10/2020",
    "unit": "Funda\u00e7\u00e3o Hospitalar de Feira de Santana",
    "crawled_at": "2021-04-09 03:42:49",
    "month": "09",
    "year": "2020",
    "period": "Mensal",
    "filepath": "/files/feira-de-santana/2020/mensal/09/Funda\u00e7\u00e3o Hospitalar de Feira de Santana/Documentos Adicionais/"
}
{
    "category": "Documentos Adicionais",
    "filename": "ff815183-b82c-49d6-84de-aef10b8968a9-TERMO DE APOSTILAMENTO 023-2020-1123-3.pdf",
    "original_filename": "TERMO DE APOSTILAMENTO 023-2020-1123-3.pdf",
    "inserted_by": "FLORIEDNA DA SILVA GOMES", 
    "inserted_at": "20/10/2020", 
    "unit": "Funda\u00e7\u00e3o Hospitalar de Feira de Santana", 
    "crawled_at": "2021-04-09 03:43:18", 
    "month": "09", 
    "year": "2020", 
    "period": "Mensal", 
    "filepath": "/files/feira-de-santana/2020/mensal/09/Funda\u00e7\u00e3o Hospitalar de Feira de Santana/Documentos Adicionais/"
}

Erro ao buscar por dados indisponíveis

Ao tentar fazer uma consulta anual que não possui dados ainda o seguinte erro é retornado:

    return next(self.data)
  File "/home/ana/.local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "/home/ana/.local/lib/python3.8/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/home/ana/.local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "/home/ana/.local/lib/python3.8/site-packages/scrapy/spidermiddlewares/referer.py", line 340, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/home/ana/.local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "/home/ana/.local/lib/python3.8/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/ana/.local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "/home/ana/.local/lib/python3.8/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/ana/.local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "/home/ana/workspace/documentos-tcmba/tcmba/spiders/consulta_publica.py", line 180, in get_search_results
    yield FormRequest(**unit_payload)
  File "/home/ana/.local/lib/python3.8/site-packages/scrapy/http/request/form.py", line 31, in __init__
    querystr = _urlencode(items, self.encoding)
  File "/home/ana/.local/lib/python3.8/site-packages/scrapy/http/request/form.py", line 71, in _urlencode
    values = [(to_bytes(k, enc), to_bytes(v, enc))
  File "/home/ana/.local/lib/python3.8/site-packages/scrapy/http/request/form.py", line 71, in <listcomp>
    values = [(to_bytes(k, enc), to_bytes(v, enc))
  File "/home/ana/.local/lib/python3.8/site-packages/scrapy/utils/python.py", line 106, in to_bytes
    raise TypeError('to_bytes must receive a str or bytes '
TypeError: to_bytes must receive a str or bytes object, got NoneType
2021-03-11 17:47:27 [scrapy.core.engine] INFO: Closing spider (finished)
2021-03-11 17:47:27 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 3268,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 1,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 1016725,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 15.951255,
 'finish_reason': 'finished',

Para reproduzir basta executar o comando com periodicidade anual e ano 2020.

Erro ao buscar valor anual de uma cidade com acento no nome

Traceback (most recent call last):
  File "/home/ana/.local/lib/python3.8/site-packages/scrapy/utils/defer.py", line 120, in iter_errback
    yield next(it)
  File "/home/ana/.local/lib/python3.8/site-packages/scrapy/utils/python.py", line 353, in __next__
    return next(self.data)
  File "/home/ana/.local/lib/python3.8/site-packages/scrapy/utils/python.py", line 353, in __next__
    return next(self.data)
  File "/home/ana/.local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "/home/ana/.local/lib/python3.8/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/home/ana/.local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "/home/ana/.local/lib/python3.8/site-packages/scrapy/spidermiddlewares/referer.py", line 340, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/home/ana/.local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "/home/ana/.local/lib/python3.8/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/ana/.local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "/home/ana/.local/lib/python3.8/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/ana/.local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "/home/ana/workspace/documentos-tcmba/tcmba/spiders/consulta_publica.py", line 297, in get_detailed_results
    filename=f"{uuid4()}-{self.normalize_text(texts[1])}",
IndexError: list index out of range

Ao printar os valores:

[<Selector xpath='./td' data='<td colspan="5">No records found.</td>'>]  # columns
['No records found.']  # texts

Desconfio que seja por causa do acento. Ao inspecionar o elemento encontro SÃO GONÇALO DOS CAMPOS mas no nosso crawler é exibido o parâmetro SAO GONCALO DOS CAMPOS .

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.