dilshod / xlsx2csv Goto Github PK

View Code? Open in Web Editor NEW

1.6K 45.0 303.0 611 KB

Convert xslx to csv, it is fast, and works for huge xlsx files

License: MIT License

Python 100.00%

xlsx2csv's Introduction

xlsx2csv

xlsx to csv converter (http://github.com/dilshod/xlsx2csv)

Converts xlsx files to csv format. Handles large XLSX files. Fast and easy to use.

Tested(supported) python versions:

2.4
2.7
3.4 to 3.12

Installation:

sudo easy_install xlsx2csv

pip install xlsx2csv

Also, works standalone with only the xlsx2csv.py script

Usage:

 xlsx2csv.py [-h] [-v] [-a] [-c OUTPUTENCODING] [-s SHEETID]
                   [-n SHEETNAME] [-d DELIMITER] [-l LINETERMINATOR]
                   [-f DATEFORMAT] [--floatformat FLOATFORMAT]
                   [-i] [-e] [-p SHEETDELIMITER]
                   [--hyperlinks]
                   [-I INCLUDE_SHEET_PATTERN [INCLUDE_SHEET_PATTERN ...]]
                   [-E EXCLUDE_SHEET_PATTERN [EXCLUDE_SHEET_PATTERN ...]] [-m]
                   xlsxfile [outfile]

positional arguments:

  xlsxfile              xlsx file path, use '-' to read from STDIN
  outfile               output csv file path, or directory if -s 0 is specified

optional arguments:

  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -a, --all             export all sheets
  -c OUTPUTENCODING, --outputencoding OUTPUTENCODING
                        encoding of output csv ** Python 3 only ** (default: utf-8)
  -s SHEETID, --sheet SHEETID
                        sheet number to convert, 0 for all
  -n SHEETNAME, --sheetname SHEETNAME
                        sheet name to convert
  -d DELIMITER, --delimiter DELIMITER
                        delimiter - columns delimiter in csv, 'tab' or 'x09'
                        for a tab (default: comma ',')
  -l LINETERMINATOR, --lineterminator LINETERMINATOR
                        line terminator - lines terminator in csv, '\n' '\r\n'
                        or '\r' (default: os.linesep)
  -f DATEFORMAT, --dateformat DATEFORMAT
                        override date/time format (ex. %Y/%m/%d)
  --floatformat FLOATFORMAT
                        override float format (ex. %.15f)
  -i, --ignoreempty     skip empty lines
  -e, --escape          Escape \r\n\t characters
  -p SHEETDELIMITER, --sheetdelimiter SHEETDELIMITER
                        sheet delimiter used to separate sheets, pass '' if
                        you do not need delimiter, or 'x07' or '\\f' for form
                        feed (default: '--------')
  -q QUOTING, --quoting QUOTING
                        field quoting, 'none' 'minimal' 'nonnumeric' or 'all' (default: 'minimal')
  --hyperlinks, --hyperlinks
                        include hyperlinks
  -I INCLUDE_SHEET_PATTERN [INCLUDE_SHEET_PATTERN ...], --include_sheet_pattern INCLUDE_SHEET_PATTERN [INCLUDE_SHEET_PATTERN ...]
                        only include sheets named matching given pattern, only
                        effects when -a option is enabled.
  -E EXCLUDE_SHEET_PATTERN [EXCLUDE_SHEET_PATTERN ...], --exclude_sheet_pattern EXCLUDE_SHEET_PATTERN [EXCLUDE_SHEET_PATTERN ...]
                        exclude sheets named matching given pattern, only
                        effects when -a option is enabled.
  -m, --merge-cells     merge cells

Usage with folder containing multiple xlxs files:

    python xlsx2csv.py /path/to/input/dir /path/to/output/dir

will output each file in the input dir converted to .csv in the output dir. If omitting the output dir it will output the converted files in the input dir

Usage from within Python:

  from xlsx2csv import Xlsx2csv
  Xlsx2csv("myfile.xlsx", outputencoding="utf-8").convert("myfile.csv")

Expat SAX parser used for xml parsing.

See alternatives:

Bash: http://kirk.webfinish.com/?p=91

Python: http://github.com/staale/python-xlsx http://github.com/leegao/pyXLSX

Ruby: http://roo.rubyforge.org/

Java: http://poi.apache.org/

xlsx2csv's People

Contributors

Stargazers

Watchers

Forkers

djmitche jablan hempalex daemon13 dremora foleyx90 jaalto gony55 nelsonko aubinlrx felixr beck917 aforward-oss srmaximiano stephenzl kbhatta outoftheboxx mankyuhan guileen kbroman joshuashen gladiatr72 maverickg rockscience kenahoo respectpotentialenergy pctj101 dw250100785 ui-volodymyr-shaboltas zouf jesseky ngocthanhit marvinlav emirhg whwc rafatower shidh wadebiyi kmila csojinb neeraj-vmware hcw70 luftreich ttabor1977 yalechang hehaotian yongjunjian kempwrites ihuro marcomzp ashleyquinton quyiqy nickdietel umo2000 oddsix luluti caot seanliangyoung doloopwhile bawalker tomtrask nndiaye mukolx oztc rulnick amboxer21 xjzhou mnjstwins bwhiton xtmgah huskyeder alittlesliceoftom prasadgaddai rockykitamura tlevine lucienfostier sebratt liubo404 rod-dot-codes caseycs arbruijn caomw starlein faionsax voonik336 bobo987cro artem-bez theavey benwilson512 cargosense jeffreyvdb prabhatgouda mdekauwe freeart lewis-liu ninoninkovic hskksk eawag-rdm ekwurzele vukio

xlsx2csv's Issues

floating errors

Excel exported csv Line.
;G3;Dachbox;320;;;;Schwarz Metallic;Arjes;;;133.8;72;37.5;Dach

xlsx2csv exported csv Line
G3Dachbox~~320~~~~Schwarz Metallic~~Arjes~~~133.800000000000017237.5~Dach

Found some more samples in my Files.
Excel field format: General
Original Value: 133.8
CSV Value: 133.80000000000001

I Also found: 7.4999999999999997E-2 instead of 0.075

Converting all sheets raises 'cmd' not defined

Using with --all or -s 0 causes:

Traceback (most recent call last):
File "/usr/local/bin/xlsx2csv", line 550, in
xlsx2csv.convert(outfile, sheetid)
File "/usr/local/bin/xlsx2csv", line 168, in convert
if cmd:
NameError: global name 'cmd' is not defined

Sometimes a few cells are messed in the attached XLSX

Hi!

Unfortunately sometimes the attached XLSX cannot be converted.

I haved attached the expected output as well - please rename the jpg :-)

NOTE: I'm using the options --sheet=1 --delimiter=x124

Are you able to reproduce the problem and to help me?

THANKS A LOT!!!

ratert

Can't get it to work in Win 7 - Encoding Problem?

I executed in a Windows command shell the following command:

python d:\yy\ww\py\lib\site-packages\xlsx2csv.py "D:\yy\ww\XLS2CSV Converter\source\xxx.xlsx" "D:\yy\xxx\XLS2CSV Converter\distination\vv.csv"

and get the following error message:

Traceback (most recent call last):
File "d:\yy\ww\py\lib\site-packages\xlsx2csv.py", line 847, in
xlsx2csv.convert(outfile, sheetid)
File "d:\yy\ww\py\lib\site-packages\xlsx2csv.py", line 178, in convert
self._convert(sheetid, outfile)
File "d:\yy\ww\py\lib\site-packages\xlsx2csv.py", line 247, in _convert
sheet.to_csv(writer)
File "d:\yy\ww\py\lib\site-packages\xlsx2csv.py", line 558, in to_csv
self.parser.ParseFile(self.filehandle)
File "d:\yy\ww\py\lib\site-packages\xlsx2csv.py", line 706, in handleEndElement
self.writer.writerow(d)
UnicodeEncodeError: 'cp932' codec can't encode character '\xd4' in position 215: illegal multibyte sequence

I'm using Python 3.4.0 and xlsx2csv-0.6.1 installed via pip.
What am I doing wrong?

Shbang line should use /usr/bin/env python instead of /usr/bin/python

If the correct Python is somewhere else in the path, you can't run xlsx2csv.py directly.

Проблема с страницами в документе

Привет. Имеется проблема с разбором многостраничных документов. Приблизительно проблема заключается в том, что id документа в .../worksheets/sheet*.xml не обязательно совпадает с номером страницы.

ps. Я не Питонист, да и в формате Excel 2007 не разбираюсь, сейчас поковыряю, может смогу разобраться.

Invalid multiline string values handling

For this file https://dl.dropboxusercontent.com/u/16118108/01-11-2013-BeCompact-USD%3D321.xlsx i get such output:

"line1: line 1
Line2: line2"

I expect a bit different:

"line1: line 1\nLine2: line2"

May be it will be better to escape newline characters in output?

Problem with float

I have problem with float.
In xlsx-file i have 0.103 (general or number). After convert in csv i got 0.10299999999999999
The same, 0.276 -> 0.27600000000000002

Formatting minutes

The tool exports minutes < 10 as single characters, for example 14:05 is exported as 14:5. For some routines which import this value, it is converted to 0145, because these only read the digits.

If you can change line 545 to this: self.data = str(t / 60) + ":" + ('0' + str(t % 60))[-2:]
then the minutes will always have two digits, even when 0.

Modify README file ; working link to bash script & link to perl package

Hi,

In the Alternatives section, kindly change this entry
Bash:
http://kirk.webfinish.com/2009/12/xlsx2csv/
TO
Bash:
http://kirk.webfinish.com/?p=91 <-- direct link to bash script which works now.

Also, please add the perl alternative too, if possible.
And which is achived like this:
wget http://repo.iotti.biz/CentOS/6/noarch/xls2csv-1.06-16.el6.lux.1.noarch.rpm
yum localinstall xls2csv-1.06-16.el6.lux.1.noarch.rpm -y

Thanks.

Please add tag "release/N.N.N" in the future

Hi,

It would nice if the repository were tagged on release in order to (1) diff between releases (2) use automated tools to package releases in regular basis (http://githubredir.debian.net/).

Use of "release/" prefix helps to find tags by name easily with git tag -l:

release/1.0
release/1.1
release/1.10
release/1.2

zipfile.BadZipfile: File is not a zip file

I tried to use the script but gave me the above error . find details below :

[xlsx2csv master] $ ./xlsx2csv.py /home/rtcoms/Desktop/bank_wise_csv_ifsc_codes/IFCB2009_02.xls /home/rtcoms/Desktop/code/personal_projects/ifcs_details/tmp/1.csv

Traceback (most recent call last):
File "./xlsx2csv.py", line 443, in
xlsx2csv(args[0], outfile, **kwargs)
File "./xlsx2csv.py", line 115, in xlsx2csv
ziphandle = zipfile.ZipFile(infilepath)
File "/usr/lib/python2.6/zipfile.py", line 696, in init
self._GetContents()
File "/usr/lib/python2.6/zipfile.py", line 716, in _GetContents
self._RealGetContents()
File "/usr/lib/python2.6/zipfile.py", line 728, in _RealGetContents
raise BadZipfile, "File is not a zip file"
zipfile.BadZipfile: File is not a zip file

only first sheet converted to csv.

I ran xlsx2csv with xlsx that had multiple sheets. But, xlsx2csv converted only one sheet [first sheet] to csv.

Invalid xlsx file error

I am getting the following error from an excel file I have (Microsoft Excel 97-2004 workbook):

Invalid xlsx file: ...

When I looked at the source code, it seems to be coming from the following line:

        self.ziphandle = zipfile.ZipFile(xlsxfile)

Any idea why I get this error and any work-around?

Trouble running on Windows 7

I get this error, when running on Windows 7:

C:\Users\jh>python.exe C:\1\xlsx2csv.py
File "C:\1\xlsx2csv.py", line 286
date = datetime.datetime(1904, 01, 01) + datetime.timedelta(float(data))
^
SyntaxError: invalid token

Am I doing it wrong?

fails on large xlsx files

I am trying to convert some large xlsx files in the range of 60,000 to 85,000 lines.
The program fails with the error:

$ xlsx2csv.py -s 1 60656.xlsx 60656.csv
Traceback (most recent call last):
File "/home/Alex/mybin/xlsx2csv.py", line 412, in
xlsx2csv(args[0], outfile, **kwargs)
File "/home/Alex/mybin/xlsx2csv.py", line 98, in xlsx2csv
workbook = parse(ziphandle, Workbook, "xl/workbook.xml")
File "/home/Alex/mybin/xlsx2csv.py", line 125, in parse
instance.parse(ziphandle.read(filename))
File "/home/Alex/mybin/xlsx2csv.py", line 135, in parse
self.appName = workbookDoc.firstChild.getElementsByTagName("fileVersion")[0]._attrs['appName'].value
IndexError: list index out of range

I am using python 2.7 under linux.
I have tested the script with other xlsx files (smaller) and had no issus.

Thanks
Alex.

Added outfile.close() to xlsx2csv()

Dilshod,

I added an outfile.close() to the finally statement in xlsx2csv() because garbage collection (I believe) wasn't closing the file quickly enough for my use case. I'm just letting you know so that you can add it to master if you think other's would benefit as well.

Thanks,
Nathan

Trailing zeros truncated

If a cell contains something like "21.10", xlsx2csv interprets it as a number and truncates the zero. This is bad if the cell is intended to represent a version number or something like that.
These cells are formatted as numbers in Excel, with a fixed number of digits after the decimal point. I have hundreds of files, so I can't change the formats by hand.

The best fix would be to simply retain leading and trailing zeros in the csv.

Another fix would be to interpret numbers with trailing or leading zeros as strings. (maybe with a flag)

Another fix would be a flag to treat all cells as strings, rather than numbers.

Date formatting issue (v0.14)

An xlsx file I'm processing contains a date cell that looks like "9/15/2011 15:22" in Excel. I haven't been able to figure out what format to pass to xlsx2csv in order to resolve this.
(1) [no format passed]: m/15/11 h:09
(2) %Y/%m/%d %hh:%mm : 2011/09/15 Seph:09m
(3) %Y/%m/%d %H:%M : 2011/09/15 00:00

The last seem closest, but I would like to see the hours:minutes resolved.

Misinterpretation of &

If there is an ampersand in a cell the tool will split up the cell contents.

Example:
Cell: Hello & World

will be parsed to
Hello
World

I made a temporary hack in my version by:

def handleStartElement(self, name, attrs):
        self.collectedString = ""

and

    def handleCharData(self, data):
        if self.in_cell_value: # defines that there is a value between a set of nodes
            self.collectedString += data
            self.data = self.collectedString # default value

ValueError: invalid literal for int() with base 10: ''

Got this error today with a xlsx file (tested with 0.6.1):

Traceback (most recent call last):
File "./xlsx2csv.py", line 843, in
xlsx2csv = Xlsx2csv(options.infile, **kwargs)
File "./xlsx2csv.py", line 166, in init
self.workbook = self._parse(Workbook, "xl/workbook.xml")
File "./xlsx2csv.py", line 268, in _parse
instance.parse(filehandle)
File "./xlsx2csv.py", line 314, in parse
if 'r:id' in attrs: id = int(attrs["r:id"].value[3:])
ValueError: invalid literal for int() with base 10: ''

Crash on simple file

Excel 2007 file with content
8001990029363
8001990029387
8001990028694
8001990028663
8001990028649
8001990028854
8001990048081
8001990029370
8001990029394
8001990048180
8001990027710

Code:

from openpyxl.reader.excel import load_workbook

book = load_workbook('error.xlsx', use_iterators=True)
sheet = book.get_active_sheet()

for row in sheet.iter_rows():
    print [cell.internal_value for cell in row]

Crash with error:

Traceback (most recent call last):
  File "test.py", line 6, in <module>
    for row in sheet.iter_rows():
  File "C:\Python\lib\site-packages\openpyxl\reader\iter_worksheet.py", line 236, in get_squared_range
    cell = cell._replace(internal_value=shared_date.from_julian(float(cell.internal_value)))
  File "C:\Python\lib\site-packages\openpyxl\shared\date_time.py", line 162, in from_julian
    return EPOCH + datetime.timedelta(days=utc_days)
OverflowError: Python int too large to convert to C long

Shared string data is not resolved correctly in some cases.

In some cases cell data is not parsed correctly.
Investigation and debug output shows that these cells contain shared string and use 2 exec of handleCharData function for each. But when script resolves shared data address to string it uses data of last exec not self.collected_string (or self.data). In result you have almost random data in cell.

Change in 296th line of script
int(data) > int(self.data)

Diff:

diff xlsx2csv.py xlsx2csv_fixed.py
296c296
<                 self.data = self.sharedStrings[int(data)]

---
>                 self.data = self.sharedStrings[int(self.data)]

Looks like this is fix after "Issue #15 (Misinterpretation of &". Data parsing is fixed.

How large an excel file can it handle ?

Date formatting ("dates since ...")

First, thank you for this amazing library.

The date formatting seems to be really bad though. It converts "4/5/2014" into "04-05-14". This format makes it very hard to convert back into a date. PHP strtotime() interprets it as May 14, 2004. At the very least, the year should be 4 digits, that would help a lot.

Additionally, Excel sometimes stores dates as simply an integer representing the number of days since January 1, 1900 (or January 1, 1904 on Mac for some reason). It would be excellent if those could also be converted to date strings as well.

Throws traceback on some

Hi! Thank you for the tool!
Unfortunately, it throws a traceback, while tries to convert one some files:

Traceback (most recent call last):
  File "./xlsx2csv.py", line 477, in <module>
    xlsx2csv(args[0], None, args[1], **kwargs)
  File "./xlsx2csv.py", line 121, in xlsx2csv
    styles = parse(ziphandle, Styles, "xl/styles.xml")
  File "./xlsx2csv.py", line 179, in parse
    instance.parse(f)
  File "./xlsx2csv.py", line 222, in parse
    numFmtId = int(numFmt._attrs['numFmtId'].value)
AttributeError: Text instance has no attribute '_attrs'

Here is file itself: http://www.sima-land.ru/files-local/price/igrushki.xlsx

IndexError: list index out of range

I have this error :
Traceback (most recent call last):
File "/usr/local/bin/xlsx2csv", line 5, in
pkg_resources.run_script('xlsx2csv==0.6', 'xlsx2csv')
File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 499, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 1235, in run_script
execfile(script_filename, namespace, namespace)
File "/usr/local/lib/python2.7/dist-packages/xlsx2csv-0.6-py2.7.egg/EGG-INFO/scripts/xlsx2csv", line 548, in
xlsx2csv = Xlsx2csv(options.infile, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/xlsx2csv-0.6-py2.7.egg/EGG-INFO/scripts/xlsx2csv", line 155, in init
self.workbook = self._parse(Workbook, "xl/workbook.xml")
File "/usr/local/lib/python2.7/dist-packages/xlsx2csv-0.6-py2.7.egg/EGG-INFO/scripts/xlsx2csv", line 219, in _parse
instance.parse(filehandle)
File "/usr/local/lib/python2.7/dist-packages/xlsx2csv-0.6-py2.7.egg/EGG-INFO/scripts/xlsx2csv", line 239, in parse
sheets = workbookDoc.firstChild.getElementsByTagName("sheets")[0]
IndexError: list index out of range

you can find excel file at :
http://members.tsetmc.com/tsev2/excel/MarketWatchPlus.aspx?d=0

thanks

does not include trailing comma when last column is empty

given the headers A,B,C and data stuff,more stuff, [empty] the csv output doesn't not include a comma after "more stuff".

Displayed Value vs Actual Value

Importing xlsx files gets the displayed value (seen on the sheet) but sometimes the "real" value is desired (the value seen in the fx bar when a cell is selected).

Is it possible for this to be an option? Displayed vs actual values depending on a provided option to the command?

KeyError: 'appName'

Using the version in pip, I found this error on some of my workbooks:

Traceback (most recent call last):
  File "/usr/local/bin/xlsx2csv", line 548, in <module>
    xlsx2csv = Xlsx2csv(options.infile, **kwargs)
  File "/usr/local/bin/xlsx2csv", line 155, in __init__
    self.workbook = self._parse(Workbook, "xl/workbook.xml")
  File "/usr/local/bin/xlsx2csv", line 219, in _parse
    instance.parse(filehandle)
  File "/usr/local/bin/xlsx2csv", line 233, in parse
    self.appName = workbookDoc.firstChild.getElementsByTagName("fileVersion")[0]._attrs['appName'].value
KeyError: 'appName'

Won't exclude sheetname "Sheet2" and "Sheet3"

In the excel file "Sheet2" and "Sheet3" are blank so I'd like to exclude them, I keep passing the following command:

xlsx2csv -a -E"Sheet2" -E"Sheet3" file.xlsx

but the produced output still contains delimiters for these two sheets. I've also tried putting a space between the -E option and the names, and I've tried excluding the quotation marks, but to no avail.

Is it possible to add xls support?

xlsx to csv works great but fails with the following for xls files. xls might not even be supported but if you could add xls support, that had be awesome! Thank you!

Traceback (most recent call last):
File "/home/thinkcode/libs/python/xlsx2csv.py", line 344, in ?
xlsx2csv(args[0], f, **kwargs)
File "/home/thinkcode/libs/python/xlsx2csv.py", line 91, in xlsx2csv
ziphandle = zipfile.ZipFile(infilepath)
File "/usr/lib64/python2.4/zipfile.py", line 210, in init
self._GetContents()
File "/usr/lib64/python2.4/zipfile.py", line 230, in _GetContents
self._RealGetContents()
File "/usr/lib64/python2.4/zipfile.py", line 242, in _RealGetContents
raise BadZipfile, "File is not a zip file"
zipfile.BadZipfile: File is not a zip file

Convert Entire Directory

Hi,

I've got a directory filled with xlsx's that I wanted to convert to csv. Is there any easy way to batch convert?

Converting Apache POI excel files to csv ?

I can't covert xlsx documents from Apache POI . No error messages, just making a csv file that contains row numbers.
My server is a solaris10 server. I need help on this Converting xlsx file from my office 2010 exel works fine

blank lines in conversion

The conversion from myfile.xslx to csv works fine on Linux but with the same options (defaults ones :o) ) the conversion on windows doubles the number of lines : it adds one blank line every lines !

Do you have an idea of what's wrong in the way i use xlsx2csv ?
Is there any option that may solve this pb ?
Thanks.

Ignore <rPh> Element

Excel has capability holds phonetic for cell value as as . element placed inner tag. So, the current implements shows mixed value and phonetic.
I've avoid the problem like below code. I hope apply the code.

def handleStartElement(self, name, attrs):
    if name == 'si':
        self.si = True
        self.value = ""
    elif name == 't' and self.rPh:
        self.t = False
    elif name == 't' and self.si:
        self.t = True
    elif name == 'rPh':
        self.rPh = True

def handleEndElement(self, name):
    if name == 'si':
        self.si = False
        self.strings.append(self.value)
    elif name == 't':
        self.t = False
    elif name == 'rPh':
        self.rPh = False

issue with empty start row

Hi,

The parser seems to turn into troubles with xlsx files with empty row 1.
It returns:

...
self.colNum = cellId[:len(cellId)-len(self.rowNum)]
TypeError: object of type 'NoneType' has no len()

Without success, I tried to fix it by "ignoring" the cell with NoneType and go on with empty value, but this makes the data in following rows to be missing in the CSV output.

It only happens if the input XLSX has an entirely empty row 1.

Do you have an idea how to deal with this issue?

Thanks for help and thanks for the great .py!

Hi, please help, what's this problem ?

Traceback (most recent call last):
File "/usr/local/bin/xlsx2csv", line 550, in
xlsx2csv.convert(outfile, sheetid)
File "/usr/local/bin/xlsx2csv", line 162, in convert
self._convert(sheetid, outfile)
File "/usr/local/bin/xlsx2csv", line 200, in _convert
sheet.to_csv(writer)
File "/usr/local/bin/xlsx2csv", line 362, in to_csv
self.parser.ParseFile(self.filehandle)
File "/usr/local/bin/xlsx2csv", line 377, in handleCharData
xfs_numfmt = self.styles.cellXfs[s]
IndexError: list index out of range

UnicodeEncodeError when trying to convert xlsx file to csv

I tried : python xlsx2csv.py myfile.xlsx myfile.csv

Complete error message :

Traceback (most recent call last):
File "xlsx2csv.py", line 164, in
xlsx2csv(sys.argv[1], f)
File "xlsx2csv.py", line 22, in xlsx2csv
Sheet(shared_strings, ziphandle.read("xl/worksheets/sheet1.xml"), writer)
File "xlsx2csv.py", line 94, in init
self.parser.Parse(data)
File "xlsx2csv.py", line 154, in handleEndElement
self.writer.writerow(d)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 1: ordinal not in range(128)

handleStartElement matching fails and leads to AttributeError

Getting "null-pointer exceptions" from xlsx2csv when parsing some files:

Traceback (most recent call last):
  File "/usr/local/bin/xlsx2csv", line 847, in <module>
    xlsx2csv.convert(outfile, sheetid)
  File "/usr/local/bin/xlsx2csv", line 178, in convert
    self._convert(sheetid, outfile)
  File "/usr/local/bin/xlsx2csv", line 247, in _convert
    sheet.to_csv(writer)
  File "/usr/local/bin/xlsx2csv", line 558, in to_csv
    self.parser.ParseFile(self.filehandle)
  File "/usr/local/bin/xlsx2csv", line 660, in handleStartElement
    startCol = start.group(1)
AttributeError: 'NoneType' object has no attribute 'group'

An example where this problem occurs is here.

According to pip, my currently installed version is the latest, but xlsx2csv -v only outputs the script's name (xslx2csv - want me to open a separate ticket for that?). I see I do have the --merge-cells options available from the script.

Cell Links lost during conversion

Some of the cells in a xlsx sheet contain links to external websites. When converted all the links are lost and only the text inside the links remain.

test

becomes

test

Is there an option to maybe give output as:
test

PyPI, library

Hi, great tool! Consider adding it to PyPI so people can install it with pip, and also possibly exposing it additionally as a library. Thanks!

Syntax Error with Python 2.3

Yes, it's ancient Python, but still:

line 780: parser.add_argument("-s", "--sheet", dest="sheetid", default=1, type=int, help="sheet number to convert")

produces a

Traceback (most recent call last):
  File "xlsx2csv.py", line 780, in ?
    help="sheet number to convert")
  File "/usr/lib/python2.3/optparse.py", line 820, in add_option
    option = self.option_class(*args, **kwargs)
  File "/usr/lib/python2.3/optparse.py", line 430, in __init__
    checker(self)
  File "/usr/lib/python2.3/optparse.py", line 499, in _check_type
    raise OptionError("invalid option type: %r" % self.type, self)```

Change it to

line 780: parser.add_argument("-s", "--sheet", dest="sheetid", default=1, type="int", help="sheet number to convert")

and it works. (Note the "" around int.)

Updating links in xlsx sheet before conversion

I am not sure if this is strictly in the purview of this program.

I was trying to use the program to convert an excel sheet A.xlsx which had a link to B.xlsx.
I changed B.xlsx and then ran xlsx2csv.py A.xlsx A.csv

As expected, the change done in B.xlsx did not get reflected in A.csv

But note A.xlsx, it is already set to "Automatically update links"

The only way it works is this way - Open A.xlsx and just do a save. Then do the conversion.

Is there a way in python to force an xlsx to save itself (at command prompt) - before doing the conversion.

All this is being done in a windows machine - but I guess any trick available, I should be able to map from linux to windows.

Kind Regards

Extra Carridge Returns (CR) codes at the end of lines - please strip

Please strip CR (carriage return, \r, 0x0D) from output lines. Now:

$ xlsx2csv test.xlsx | cat -A | head -2
test, 1^M$
test, 2^M$

Something like this to detect if running under modern Windows (pseudo code):

STRIP_OUTPUT = true

# example: ProgramData=C:\ProgramData
if <environment variable "ProgramData" exists>
   AND
   <it matches regexp "[a-z]:\">
then
   STRIP_OUTPUT = false
endif

Plese list project at freecode.com

The Freecode is kinda "yellow pages" of Open Source software. It would be useful if the project were listed there; just open an account and you're set.

Once set up, there is nothing to maintain, unless you want to issue release announcements for the subsribers of project watchers.

problem with numbers

Hi,
I am facing a problem with xlsx files that have number such as 9.810070626E+019 (which shows in excel as 98100707530000000000) --> with xlsx2csv this becomes 98100707530000007168 when viewed with text editor. However, with OpenOffice xls-to-csv macro the value is shown as it should as 98100707530000000000.

Both of these are shown correctly when the csv is viewed in e.g. OpenOffice, but I'm taking the csv into different software, e.g. Pentaho Data Integration and there the values show as in a text editor..

Any help appreciated
--patrik

inlineStr support

I ran across a xlsx file using the inlineStr method instead of the more common sharedStrings approach. The conversion failed because the inlineStr are stored in a tag instead of the normal tag.

Adding this to the if then block in the handleStartElement method allows for proper conversion of inlineStr. Please test and add to next version:

  elif self.in_cell and name == 'is':
        self.in_cell_value = True

dd/mm/yyyy date type missing

So in the UK, where dates are typically of this type, they were converted to numbers (as in the number excel uses to store a date).

To fix this, I added:
'dd/mm/yyyy' : 'date',
on line 49

Blank cells in XLSX not reflected in CSV

When converting an XLSX file that has emtpy cells at the end of the row, the corresponding row in the CSV is missing the trailing commas needed for using the CSV as an import into a database orr other system that validates the number of expected columns are present.