rholder / csv2es Goto Github PK
View Code? Open in Web Editor NEWLoad a CSV (or TSV) file into an Elasticsearch instance
License: Apache License 2.0
Load a CSV (or TSV) file into an Elasticsearch instance
License: Apache License 2.0
what could be the reason for such error in a perfectly delimited file
Hi, How to map an id to the es _id, so to make the index cheaper ?
Looking at your example, I want potato_id be the es _id.
See: http://stackoverflow.com/a/32334834/305883
Is it possible or maybe a new feature?
Would be nice to be able to indicate 2 field names in csv file that would represent geo_point.
Right now, a message is printed for every 100K lines of ES upload. It'd probably be helpful to make it an option.
Hi,
Is it possible to use csv2es to index data to a an AWS ES instance? Can I pass in the dictionary of parameters? I am currently using as AWS4AUTH object in my python script to access the ES.
How to read a new csv file and store the records in already existing index using this ? If i try giving command, i get this :
raise error_class(status, error_message)
pyelasticsearch.exceptions.ElasticHttpError: (400, {u'index': u'test-index', u'root_cause': [{u'index': u'test-index', u'reason': u'already exists', u'type': u'index_already_exists_exception'}], u'type': u'index_already_exists_exception', u'reason': u'already exists'})
-- I want --update-index option, is this possible with current version ?
Does csv2es support the 'array' and 'list' type? I was trying something like the following and the result didn't seem to be compatible. Any ideas?
id, item, description, keywords
1, 'a big item', 'keyword1', 'keyword2', keyword3'
Thanks, Jeff
I am trying to index a huge CSV file (~35GB). I have kept the --docs-per-chunk as 500 and --bytes-per-chunk to 100000. But then the RAM gets filled up as and when the csv file is continuously read. Any thoughts on the same?
I am working on a 16GB RAM machine with 4 core processor.
Regards,
Vijay Raajaa G S
This is a brilliant tool, thanks. However, I found due to the size of some of my fields that I was hitting the limits in the csv module i.e. getting "_csv.Error: field larger than field limit (131072)". I solved this by inserting csv.field_size_limit(sys.maxsize) into csv2es.py as per http://stackoverflow.com/questions/15063936/csv-error-field-larger-than-field-limit-131072
hi have tried to use the library , but it hangs up. do you know why ?
using python version : 2.7.5
Using host: http://eshost:9203
Index sample already exists
Using document type: act
Using the following 3 fields:
potato_id
potato_type
description
^C
Aborted!
I installed it but:
Traceback (most recent call last):
File "/usr/local/bin/csv2es", line 5, in <module>
from pkg_resources import load_entry_point
File "/Library/Python/2.7/site-packages/distribute-0.6.14-py2.7.egg/pkg_resources.py", line 2671, in <module>
working_set.require(__requires__)
File "/Library/Python/2.7/site-packages/distribute-0.6.14-py2.7.egg/pkg_resources.py", line 654, in require
needed = self.resolve(parse_requirements(requirements))
File "/Library/Python/2.7/site-packages/distribute-0.6.14-py2.7.egg/pkg_resources.py", line 552, in resolve
raise DistributionNotFound(req)
pkg_resources.DistributionNotFound: urllib3==1.10.2
despite urllib3 is present (1.19.1)
how to fix it? Maybe upgrading requirements? Also, pip install will uninstall most recent versions of dependencies - which I manually restailled to latest versions.
Hi,
I'm using a @timestamp field in my CSV file, however this field is not recognized.
I converted this timestamp to a 'yyyy-MM-dd'T'HH:mm:ss.SSSZ' Date format but still having the same issue.
The problem is that Elasticsearch/Kibana are reading a String while I want a Date to be processed.
Thanks in advance for your answer.
Best,
Reda
Hi friend,
I would like to suggest that you could refactor your this package using Poetry. I have some issues trying to install csv2es breaking my other installed packages. I fix it removing the versions in requirements.txt.
Take a look at:
My package using Poetry: https://gitlab.com/israel.oliveira.softplan/legal-pre-processing
Some good references and tutorials:
https://johnfraney.ca/posts/2019/05/28/create-publish-python-package-poetry/
https://towardsdatascience.com/how-to-publish-a-python-package-to-pypi-using-poetry-aa804533fc6f
Great and very useful your code!!
Thanks,
Israel.
Allow the user to specify an id
field for the document to uniquely identify it when sending it to Elasticsearch. By default Elasticsearch will generate its own unique document identifiers.
Here is how we activate bash completion for csv2es
after pip installing it:
eval "$(_CSV2ES_COMPLETE=source csv2es)"
This seems to work fine but presumably we'll install this in a virtualenv
and keep it around for a while or some such so what's the best way to get it working when we install?
An alternative method is to create and drop this file somewhere for sourcing (but where....?):
_CSV2ES_COMPLETE=source csv2es > csv2es-complete.sh
There are more details here.
Sometimes we have things that are compressed, like bananas.tsv.gz
, or require some other transformation that can be computed on the fly from another tool. It would be nice to be able to accept a stream from stdin instead of just an input file.
The code does not work if unicode characters are present in the csv. Can you add that feature?
When we have an empty string in csv, and this field is date type in elastic search 1.7 we get an exception from ES :
([{u'create': {u'status': 400, u'_type': u'csv_type', u'_id': u'AVgBSdxaLywLgoALVw4Q', u'error': u'MapperParsingException[failed to parse [ticket_metrics_created_at_hour]]; nested: MapperParsingException[failed to parse date field [], tried both date format [YYYY-MM-DD HH:mm:ss], and timestamp number with locale []]; nested: IllegalArgumentException[Invalid format: ""]; ', u'_index': u'test_0_delete'}}
Nice tool.
But I am surprised it does not recognize quoted CSV files. To me an absolute standard option. Instead it uploads all data as strings if you have a quoted CSV.
Trying:
python csv2es.py --index-name test --doc-type test --import-file test.csv
gives:
Using host: http://127.0.0.1:9200/
Traceback (most recent call last):
File "csv2es.py", line 215, in <module>
cli()
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 664, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 644, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 837, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 464, in invoke
return callback(*args, **kwargs)
File "csv2es.py", line 197, in cli
es.create_index(index_name)
File "/usr/local/lib/python2.7/dist-packages/pyelasticsearch/client.py", line 92, in decorate
return func(*args, query_params=query_params, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pyelasticsearch/client.py", line 1005, in create_index
query_params=query_params)
File "/usr/local/lib/python2.7/dist-packages/pyelasticsearch/client.py", line 257, in send_request
self._raise_exception(status, error_message)
File "/usr/local/lib/python2.7/dist-packages/pyelasticsearch/client.py", line 271, in _raise_exception
raise error_class(status, error_message)
pyelasticsearch.exceptions.ElasticHttpError: (406, u'Content-Type header [] is not supported')
I'm not an elasticsearch expert so, forgive me if this is something idiot.
I'm using Python 3 from Anaconda, and have created the potatoes example csv file. I am able to create the potatoes index, but it won't load the data. I've tried different csv files and indexes, and still get the same. I get the following error:
Traceback (most recent call last):
File "/anaconda/bin/csv2es", line 11, in
sys.exit(cli())
File "/anaconda/lib/python3.6/site-packages/click/core.py", line 664, in call
return self.main(*args, **kwargs)
File "/anaconda/lib/python3.6/site-packages/click/core.py", line 644, in main
rv = self.invoke(ctx)
File "/anaconda/lib/python3.6/site-packages/click/core.py", line 837, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/anaconda/lib/python3.6/site-packages/click/core.py", line 464, in invoke
return callback(*args, **kwargs)
File "/anaconda/lib/python3.6/site-packages/csv2es.py", line 211, in cli
perform_bulk_index(host, index_name, doc_type, documents, docs_per_chunk, bytes_per_chunk, parallel)
File "/anaconda/lib/python3.6/site-packages/csv2es.py", line 107, in perform_bulk_index
bytes_per_chunk=bytes_per_chunk))
File "/anaconda/lib/python3.6/site-packages/joblib/parallel.py", line 652, in call
for function, args, kwargs in iterable:
File "/anaconda/lib/python3.6/site-packages/csv2es.py", line 104, in
delayed(local_bulk)(host, index_name, doc_type, chunk)
File "/anaconda/lib/python3.6/site-packages/pyelasticsearch/utils.py", line 31, in bulk_chunks
for action in actions:
File "/anaconda/lib/python3.6/site-packages/csv2es.py", line 57, in all_docs
fieldnames = doc_file.next().strip().split(delimiter)
AttributeError: '_io.BufferedReader' object has no attribute 'next'
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.