rholder / csv2es Goto Github PK

View Code? Open in Web Editor NEW

62.0 62.0 37.0 26 KB

Load a CSV (or TSV) file into an Elasticsearch instance

License: Apache License 2.0

Python 100.00%

csv2es's People

Contributors

Stargazers

Watchers

csv2es's Issues

TypeError: "delimiter" must be an 1-character string

what could be the reason for such error in a perfectly delimited file

Proposal - custom _id

Hi, How to map an id to the es _id, so to make the index cheaper ?
Looking at your example, I want potato_id be the es _id.

See: http://stackoverflow.com/a/32334834/305883

Is it possible or maybe a new feature?

Support for geo_point

Would be nice to be able to indicate 2 field names in csv file that would represent geo_point.

Option to set the frequency of logging

Right now, a message is printed for every 100K lines of ES upload. It'd probably be helpful to make it an option.

Indexing CSV to AWS ES

Hi,

Is it possible to use csv2es to index data to a an AWS ES instance? Can I pass in the dictionary of parameters? I am currently using as AWS4AUTH object in my python script to access the ES.

How to read a new csv file and store the records in already existing index using this ? If i try giving command, i get this :
raise error_class(status, error_message)
pyelasticsearch.exceptions.ElasticHttpError: (400, {u'index': u'test-index', u'root_cause': [{u'index': u'test-index', u'reason': u'already exists', u'type': u'index_already_exists_exception'}], u'type': u'index_already_exists_exception', u'reason': u'already exists'})

-- I want --update-index option, is this possible with current version ?

support for array and list types?

Does csv2es support the 'array' and 'list' type? I was trying something like the following and the result didn't seem to be compatible. Any ideas?

id, item, description, keywords
1, 'a big item', 'keyword1', 'keyword2', keyword3'

Thanks, Jeff

Out of memory

I am trying to index a huge CSV file (~35GB). I have kept the --docs-per-chunk as 500 and --bytes-per-chunk to 100000. But then the RAM gets filled up as and when the csv file is continuously read. Any thoughts on the same?

I am working on a 16GB RAM machine with 4 core processor.

Regards,
Vijay Raajaa G S

Hitting CSV field limit error

This is a brilliant tool, thanks. However, I found due to the size of some of my fields that I was hitting the limits in the csv module i.e. getting "_csv.Error: field larger than field limit (131072)". I solved this by inserting csv.field_size_limit(sys.maxsize) into csv2es.py as per http://stackoverflow.com/questions/15063936/csv-error-field-larger-than-field-limit-131072

using the command it hangs , does it support es version 1.2.1 ?

hi have tried to use the library , but it hangs up. do you know why ?
using python version : 2.7.5

Using host: http://eshost:9203
Index sample already exists
Using document type: act
Using the following 3 fields:
potato_id
potato_type
description
^C
Aborted!

error with dependencies

I installed it but:

Traceback (most recent call last):

  File "/usr/local/bin/csv2es", line 5, in <module>
    from pkg_resources import load_entry_point
  File "/Library/Python/2.7/site-packages/distribute-0.6.14-py2.7.egg/pkg_resources.py", line 2671, in <module>
    working_set.require(__requires__)
  File "/Library/Python/2.7/site-packages/distribute-0.6.14-py2.7.egg/pkg_resources.py", line 654, in require
    needed = self.resolve(parse_requirements(requirements))
  File "/Library/Python/2.7/site-packages/distribute-0.6.14-py2.7.egg/pkg_resources.py", line 552, in resolve
    raise DistributionNotFound(req)
pkg_resources.DistributionNotFound: urllib3==1.10.2

despite urllib3 is present (1.19.1)

how to fix it? Maybe upgrading requirements? Also, pip install will uninstall most recent versions of dependencies - which I manually restailled to latest versions.

CSV with Date Attributes

Hi,

I'm using a @timestamp field in my CSV file, however this field is not recognized.
I converted this timestamp to a 'yyyy-MM-dd'T'HH:mm:ss.SSSZ' Date format but still having the same issue.

The problem is that Elasticsearch/Kibana are reading a String while I want a Date to be processed.

Thanks in advance for your answer.

Best,
Reda

Use Poetry for requirements package version management

Hi friend,

I would like to suggest that you could refactor your this package using Poetry. I have some issues trying to install csv2es breaking my other installed packages. I fix it removing the versions in requirements.txt.

Take a look at:
My package using Poetry: https://gitlab.com/israel.oliveira.softplan/legal-pre-processing
Some good references and tutorials:
https://johnfraney.ca/posts/2019/05/28/create-publish-python-package-poetry/
https://towardsdatascience.com/how-to-publish-a-python-package-to-pypi-using-poetry-aa804533fc6f

Great and very useful your code!!
Thanks,
Israel.

Add user specified id field from available fields

Allow the user to specify an id field for the document to uniquely identify it when sending it to Elasticsearch. By default Elasticsearch will generate its own unique document identifiers.

Document bash completion in a usable way

Here is how we activate bash completion for csv2es after pip installing it:

eval "$(_CSV2ES_COMPLETE=source csv2es)"

This seems to work fine but presumably we'll install this in a virtualenv and keep it around for a while or some such so what's the best way to get it working when we install?

An alternative method is to create and drop this file somewhere for sourcing (but where....?):

_CSV2ES_COMPLETE=source csv2es > csv2es-complete.sh

There are more details here.

Add option to stream from stdin

Sometimes we have things that are compressed, like bananas.tsv.gz, or require some other transformation that can be computed on the fly from another tool. It would be nice to be able to accept a stream from stdin instead of just an input file.

Problem with presence of unicode

The code does not work if unicode characters are present in the csv. Can you add that feature?

Add support for selecting empty string as null

When we have an empty string in csv, and this field is date type in elastic search 1.7 we get an exception from ES :
([{u'create': {u'status': 400, u'_type': u'csv_type', u'_id': u'AVgBSdxaLywLgoALVw4Q', u'error': u'MapperParsingException[failed to parse [ticket_metrics_created_at_hour]]; nested: MapperParsingException[failed to parse date field [], tried both date format [YYYY-MM-DD HH:mm:ss], and timestamp number with locale []]; nested: IllegalArgumentException[Invalid format: ""]; ', u'_index': u'test_0_delete'}}

Suppport for quotes in CSV

Nice tool.

But I am surprised it does not recognize quoted CSV files. To me an absolute standard option. Instead it uploads all data as strings if you have a quoted CSV.

Header type problem

Trying:

python csv2es.py --index-name test --doc-type test --import-file test.csv

gives:

Using host: http://127.0.0.1:9200/
Traceback (most recent call last):
  File "csv2es.py", line 215, in <module>
    cli()
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 664, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 644, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 837, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 464, in invoke
    return callback(*args, **kwargs)
  File "csv2es.py", line 197, in cli
    es.create_index(index_name)
  File "/usr/local/lib/python2.7/dist-packages/pyelasticsearch/client.py", line 92, in decorate
    return func(*args, query_params=query_params, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pyelasticsearch/client.py", line 1005, in create_index
    query_params=query_params)
  File "/usr/local/lib/python2.7/dist-packages/pyelasticsearch/client.py", line 257, in send_request
    self._raise_exception(status, error_message)
  File "/usr/local/lib/python2.7/dist-packages/pyelasticsearch/client.py", line 271, in _raise_exception
    raise error_class(status, error_message)
pyelasticsearch.exceptions.ElasticHttpError: (406, u'Content-Type header [] is not supported')

I'm not an elasticsearch expert so, forgive me if this is something idiot.

csv2es.py AttributeError

I'm using Python 3 from Anaconda, and have created the potatoes example csv file. I am able to create the potatoes index, but it won't load the data. I've tried different csv files and indexes, and still get the same. I get the following error:

Traceback (most recent call last):
File "/anaconda/bin/csv2es", line 11, in
sys.exit(cli())
File "/anaconda/lib/python3.6/site-packages/click/core.py", line 664, in call
return self.main(*args, **kwargs)
File "/anaconda/lib/python3.6/site-packages/click/core.py", line 644, in main
rv = self.invoke(ctx)
File "/anaconda/lib/python3.6/site-packages/click/core.py", line 837, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/anaconda/lib/python3.6/site-packages/click/core.py", line 464, in invoke
return callback(*args, **kwargs)
File "/anaconda/lib/python3.6/site-packages/csv2es.py", line 211, in cli
perform_bulk_index(host, index_name, doc_type, documents, docs_per_chunk, bytes_per_chunk, parallel)
File "/anaconda/lib/python3.6/site-packages/csv2es.py", line 107, in perform_bulk_index
bytes_per_chunk=bytes_per_chunk))
File "/anaconda/lib/python3.6/site-packages/joblib/parallel.py", line 652, in call
for function, args, kwargs in iterable:
File "/anaconda/lib/python3.6/site-packages/csv2es.py", line 104, in
delayed(local_bulk)(host, index_name, doc_type, chunk)
File "/anaconda/lib/python3.6/site-packages/pyelasticsearch/utils.py", line 31, in bulk_chunks
for action in actions:
File "/anaconda/lib/python3.6/site-packages/csv2es.py", line 57, in all_docs
fieldnames = doc_file.next().strip().split(delimiter)
AttributeError: '_io.BufferedReader' object has no attribute 'next'

rholder / csv2es Goto Github PK

csv2es's People

Contributors

Stargazers

Watchers

Forkers

csv2es's Issues

Recommend Projects

Recommend Topics

Recommend Org