Git Product home page Git Product logo

csv2es's People

Contributors

rholder avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

csv2es's Issues

Out of memory

I am trying to index a huge CSV file (~35GB). I have kept the --docs-per-chunk as 500 and --bytes-per-chunk to 100000. But then the RAM gets filled up as and when the csv file is continuously read. Any thoughts on the same?

I am working on a 16GB RAM machine with 4 core processor.

Regards,
Vijay Raajaa G S

CSV with Date Attributes

Hi,

I'm using a @timestamp field in my CSV file, however this field is not recognized.
I converted this timestamp to a 'yyyy-MM-dd'T'HH:mm:ss.SSSZ' Date format but still having the same issue.

The problem is that Elasticsearch/Kibana are reading a String while I want a Date to be processed.

Thanks in advance for your answer.

Best,
Reda

Document bash completion in a usable way

Here is how we activate bash completion for csv2es after pip installing it:

eval "$(_CSV2ES_COMPLETE=source csv2es)"

This seems to work fine but presumably we'll install this in a virtualenv and keep it around for a while or some such so what's the best way to get it working when we install?

An alternative method is to create and drop this file somewhere for sourcing (but where....?):

_CSV2ES_COMPLETE=source csv2es > csv2es-complete.sh

There are more details here.

Suppport for quotes in CSV

Nice tool.

But I am surprised it does not recognize quoted CSV files. To me an absolute standard option. Instead it uploads all data as strings if you have a quoted CSV.

Use Poetry for requirements package version management

Hi friend,

I would like to suggest that you could refactor your this package using Poetry. I have some issues trying to install csv2es breaking my other installed packages. I fix it removing the versions in requirements.txt.

Take a look at:
My package using Poetry: https://gitlab.com/israel.oliveira.softplan/legal-pre-processing
Some good references and tutorials:
https://johnfraney.ca/posts/2019/05/28/create-publish-python-package-poetry/
https://towardsdatascience.com/how-to-publish-a-python-package-to-pypi-using-poetry-aa804533fc6f

Great and very useful your code!!
Thanks,
Israel.

Add support for selecting empty string as null

When we have an empty string in csv, and this field is date type in elastic search 1.7 we get an exception from ES :
([{u'create': {u'status': 400, u'_type': u'csv_type', u'_id': u'AVgBSdxaLywLgoALVw4Q', u'error': u'MapperParsingException[failed to parse [ticket_metrics_created_at_hour]]; nested: MapperParsingException[failed to parse date field [], tried both date format [YYYY-MM-DD HH:mm:ss], and timestamp number with locale []]; nested: IllegalArgumentException[Invalid format: ""]; ', u'_index': u'test_0_delete'}}

Add docs to same index

How to read a new csv file and store the records in already existing index using this ? If i try giving command, i get this :
raise error_class(status, error_message)
pyelasticsearch.exceptions.ElasticHttpError: (400, {u'index': u'test-index', u'root_cause': [{u'index': u'test-index', u'reason': u'already exists', u'type': u'index_already_exists_exception'}], u'type': u'index_already_exists_exception', u'reason': u'already exists'})

-- I want --update-index option, is this possible with current version ?

csv2es.py AttributeError

I'm using Python 3 from Anaconda, and have created the potatoes example csv file. I am able to create the potatoes index, but it won't load the data. I've tried different csv files and indexes, and still get the same. I get the following error:

Traceback (most recent call last):
File "/anaconda/bin/csv2es", line 11, in
sys.exit(cli())
File "/anaconda/lib/python3.6/site-packages/click/core.py", line 664, in call
return self.main(*args, **kwargs)
File "/anaconda/lib/python3.6/site-packages/click/core.py", line 644, in main
rv = self.invoke(ctx)
File "/anaconda/lib/python3.6/site-packages/click/core.py", line 837, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/anaconda/lib/python3.6/site-packages/click/core.py", line 464, in invoke
return callback(*args, **kwargs)
File "/anaconda/lib/python3.6/site-packages/csv2es.py", line 211, in cli
perform_bulk_index(host, index_name, doc_type, documents, docs_per_chunk, bytes_per_chunk, parallel)
File "/anaconda/lib/python3.6/site-packages/csv2es.py", line 107, in perform_bulk_index
bytes_per_chunk=bytes_per_chunk))
File "/anaconda/lib/python3.6/site-packages/joblib/parallel.py", line 652, in call
for function, args, kwargs in iterable:
File "/anaconda/lib/python3.6/site-packages/csv2es.py", line 104, in
delayed(local_bulk)(host, index_name, doc_type, chunk)
File "/anaconda/lib/python3.6/site-packages/pyelasticsearch/utils.py", line 31, in bulk_chunks
for action in actions:
File "/anaconda/lib/python3.6/site-packages/csv2es.py", line 57, in all_docs
fieldnames = doc_file.next().strip().split(delimiter)
AttributeError: '_io.BufferedReader' object has no attribute 'next'

Header type problem

Trying:

python csv2es.py --index-name test --doc-type test --import-file test.csv

gives:

Using host: http://127.0.0.1:9200/
Traceback (most recent call last):
  File "csv2es.py", line 215, in <module>
    cli()
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 664, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 644, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 837, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 464, in invoke
    return callback(*args, **kwargs)
  File "csv2es.py", line 197, in cli
    es.create_index(index_name)
  File "/usr/local/lib/python2.7/dist-packages/pyelasticsearch/client.py", line 92, in decorate
    return func(*args, query_params=query_params, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pyelasticsearch/client.py", line 1005, in create_index
    query_params=query_params)
  File "/usr/local/lib/python2.7/dist-packages/pyelasticsearch/client.py", line 257, in send_request
    self._raise_exception(status, error_message)
  File "/usr/local/lib/python2.7/dist-packages/pyelasticsearch/client.py", line 271, in _raise_exception
    raise error_class(status, error_message)
pyelasticsearch.exceptions.ElasticHttpError: (406, u'Content-Type header [] is not supported')

I'm not an elasticsearch expert so, forgive me if this is something idiot.

Indexing CSV to AWS ES

Hi,

Is it possible to use csv2es to index data to a an AWS ES instance? Can I pass in the dictionary of parameters? I am currently using as AWS4AUTH object in my python script to access the ES.

Support for geo_point

Would be nice to be able to indicate 2 field names in csv file that would represent geo_point.

Add option to stream from stdin

Sometimes we have things that are compressed, like bananas.tsv.gz, or require some other transformation that can be computed on the fly from another tool. It would be nice to be able to accept a stream from stdin instead of just an input file.

support for array and list types?

Does csv2es support the 'array' and 'list' type? I was trying something like the following and the result didn't seem to be compatible. Any ideas?

id, item, description, keywords
1, 'a big item', 'keyword1', 'keyword2', keyword3'

Thanks, Jeff

error with dependencies

I installed it but:

Traceback (most recent call last):

  File "/usr/local/bin/csv2es", line 5, in <module>
    from pkg_resources import load_entry_point
  File "/Library/Python/2.7/site-packages/distribute-0.6.14-py2.7.egg/pkg_resources.py", line 2671, in <module>
    working_set.require(__requires__)
  File "/Library/Python/2.7/site-packages/distribute-0.6.14-py2.7.egg/pkg_resources.py", line 654, in require
    needed = self.resolve(parse_requirements(requirements))
  File "/Library/Python/2.7/site-packages/distribute-0.6.14-py2.7.egg/pkg_resources.py", line 552, in resolve
    raise DistributionNotFound(req)
pkg_resources.DistributionNotFound: urllib3==1.10.2

despite urllib3 is present (1.19.1)

how to fix it? Maybe upgrading requirements? Also, pip install will uninstall most recent versions of dependencies - which I manually restailled to latest versions.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.