jackiekazil / data-wrangling Goto Github PK

View Code? Open in Web Editor NEW

539.0 539.0 553.0 7.25 MB

Code repository for Data Wrangling with Python (O'Reilly)

Python 4.53% HTML 95.47%

data-wrangling's People

Contributors

Stargazers

Watchers

Forkers

absarf danlorts huskyeder dbaeli krishnatray davidurpani old21nick21 alesarrett mwyatt1685 caohy1988 rublev09 datahobo amcampos alexanderhub exhale11 thefon jdcbean adolccc kwarodom oagbaneje chuangmu1990 m4573r dannykugler nayih99 jt14den nywfan ondrej-tucek nguyennhatnam emilyliu36 scdavis50 salmcdonagh benuklove haticesahinoglu ryokoakaike semioticos netgary0430 stoneyu3 lyazzat amitbe79 tianye00 ksjpswaroop zhj930924 boukos stampedpassp0rt benjamin-chan jeanilieski ajagaja carhart andersrmr bradleetw usamakhan44 fox1223 cjspencer csaracho datenspieler monocharismatic pwaila alanponce moabd you-zhou xtgli bm57 bpengelly syzdemonhunter kkhanh89 kevinaudiberti infotariat prakashraaz williamhardys gcmatos carsondahlberg kcobindev mister-joff dataist2019 abekohen jbpressac js771r insight-eun insightbook georgeannie anniegeorge radovankavicky gapdata mparkstx ferisetiawan ybj2004 violin1208 kevinwkc robodellaz mj568 anjunact fourbic sdnanyflow tnoda elmotoja striker-tejas dsbib zzygyx9119 jonwicou honin

data-wrangling's Issues

"Table 9" sheet does not exist in "SOWC 2014 Stat Tables_Table 9.xlsx"

But "Table 9 " sheet exists. (Blank character at the end)
use data: https://github.com/jackiekazil/data-wrangling/blob/master/data/chp4/SOWC%202014%20Stat%20Tables_Table%209.xlsx

sample code

import xlrd
book = xlrd.open_workbook('SOWC 2014 Stat Tables_Table 9.xlsx')
sheet = book.sheet_by_name('Table 9')
print sheet

result:

$ python test.py
Traceback (most recent call last):
  File "test.py", line 3, in <module>
    sheet = book.sheet_by_name('Table 9')
  File "/env27/lib/python2.7/site-packages/xlrd/book.py", line 441, in sheet_by_name
    raise XLRDError('No sheet named <%r>' % sheet_name)
xlrd.biffh.XLRDError: No sheet named <'Table 9'>

exists sheet name code:

import xlrd
book = xlrd.open_workbook('SOWC 2014 Stat Tables_Table 9.xlsx')
sheet = book.sheet_by_name('Table 9 ')  # <- MODIFIED
print sheet

result:

$ python test.py
<xlrd.sheet.Sheet object at 0x102a50f90>

UnicodeDecodeError: 'gbk' codec can't decode byte when running parse_pdf_text.py

Hi, thank you for your wonderful book on data wrangling
I encountered some issue when I was running the parse_pdf_text.py of chapter 5 in anaconda (python3.5)
The IDE show me the followning error message

Traceback (most recent call last):

  File "<ipython-input-10-957ab6bc6f5e>", line 39, in <module>
    for line in openfile:

UnicodeDecodeError: 'gbk' codec can't decode byte 0x93 in position 46: illegal multibyte sequence

it looks like the code opened the file in text mode with a "gbk" encoding. It should probably be opened in binary mode? I'm not sure. How can I fix this problem? thank you.

python: can't open file 'pdf2txt.py': [Errno 2] No such file or directory

In chap 05,when running this command to transfer a PDF to txt from my work folder.This error occurs.

python pdf2txt.py -o en-final-table9.txt EN-FINAL\ Table\ 9.pdf

Is it about the path?

raw download issue

GitHub displayed error: "(Sorry about that, but we can’t show files that are this big right now.)" but allowed option to View Raw.

Raw view doesn't include tabs, making code provided in book throw an error.

Here is alternate (Python 3) code from pydocs that worked for me:

from xml.etree import ElementTree as ET

# this opens data-text.xml and read it as a string into variable read_xml
read_xml = open('data-text.xml').read()

# this finds the root from xml stored as a string
root = ET.fromstring(read_xml)
print(root) # for Python 2.7 use > print root

test

OK, u's guys. What's going on?

While running the code in Chapter 3, importing the .json data file, the script worked fine, but the output was thus:

{u'Indicator': u'Healthy life expectancy (HALE) at birth (years)', u'Country': u'Zambia', u'Comments': u'', u'Display Value': 36, u'World Bank income group': u'Low-income', u'Numeric': 36.0, u'Sex': u'Both sexes', u'High': u'', u'Low': u'', u'Year': 2000, u'WHO region': u'Africa', u'PUBLISH STATES': u'Published'}
{u'Indicator': u'Healthy life expectancy (HALE) at birth (years)', u'Country': u'Zimbabwe', u'Comments': u'', u'Display Value': 51, u'World Bank income group': u'Low-income', u'Numeric': 51.0, u'Sex': u'Female', u'High': u'', u'Low': u'', u'Year': 2012, u'WHO region': u'Africa', u'PUBLISH STATES': u'Published'}

Any idea why all the "u's"?

python code error

(null): can't open file 'python.py': [Errno 2] No such file or directory
how to solve this problem

The data can't be downloaded

The data can't be downloaded from this website, which bring me so much difficult that I can't learn it smoothly, would you please help to solve this issue?

There is no Chp 7,8 in data

I am a chinese buyer of the book “Data Wrangling with Python".
When I read for the chapter 7, I found I could not find the data in the https://github.com/jackiekazil/data-wrangling/tree/master/data.
There is no Chp7 & Chp8. Could you help to upload the file to all bookers or send the file to me?
That's really helpful.

CH4 Page 76 - Parse Excel Setup

I've created a folder on the desktop, inserted the SOWC 2014 Stat Tables_Table 9.xlsx along with parse_excel.py.

It says to now run 'python parse_script.py' from the command line, which gives the following:
C:\>python parse_script.py python: can't open file 'parse_script.py': [Errno 2] No such file or directory

Also, I cannot store the opened file in the book variable:
book = xlrd.open_workbook('SOWC 2014 Stat Tables_Table 9.xlsx')

>>> book = xlrd.open_workbook('SOWC 2014 Stat Tables_Table 9.xlsx') Traceback (most recent call last): File "<pyshell#0>", line 1, in <module> book = xlrd.open_workbook('SOWC 2014 Stat Tables_Table 9.xlsx') File "C:\Python\Python36\lib\site-packages\xlrd\__init__.py", line 111, in open_workbook with open(filename, "rb") as f: FileNotFoundError: [Errno 2] No such file or directory: 'SOWC 2014 Stat Tables_Table 9.xlsx'

Is a Python 3 version of this book in the works or available?

This isn't an issue per se. I love the style of your writing; it is clear that the you wanted to focus upon educating the reader, instead of making sure you delved into all the gory details of each method.

However, Python 2.7 is now getting very old, especially considering your book was fairly recently published. I was curious if a Python 3 version is either in the works or if you already have one available somewhere (perhaps provided to folks who can prove they bought the book).

Thank you!
Regards,
Mike

PDFSyntaxError('No /Root object! - Is this really a PDF?')

code like this:
import slate with open('xxx.pdf') as f: doc = slate.PDF(f)
raise problem:
Traceback (most recent call last):
File "", line 2, in
File "C:\Python27\lib\site-packages\slate\slate.py", line 38, in init
self.doc.set_parser(self.parser)
File "C:\Python27\lib\site-packages\pdfminer\pdfparser.py", line 333, in set_parser
raise PDFSyntaxError('No /Root object! - Is this really a PDF?')
pdfminer.pdfparser.PDFSyntaxError: No /Root object! - Is this really a PDF?

"How to import XML Data" on page 57 with Python 3.6.3.

I try to run "How to import XML Data" on page 57 with Python 3.6.3. However I get the following errors:
TypeError: unhashable type: 'dict_keys'.;
TypeError: 'dict_keys' object does not support indexing.
Unsuccessfully I have tried to write the code with a 'list' in Python 3.6.3.
Can you please give me the code for Python 3.6.3.?
Most appreciated ... Johannes

Problem with XML exercise, Chapter 3

Hello,

I was working through the exercises and had a problem when I tried:

from xml.etree import ElementTree as ET

tree = ET.parse('data-text.xml')
root = tree.getroot()


print list(root)

(pages 57-61). It wouldn't list the elements in the list. I ended up writing a for loop to get it to look like the example in the book (bottom of page 60, top of page 61):

import xml.etree.ElementTree as ET

tree = ET.parse('data-text.xml')
root = tree.getroot()

# data = root.find('Data')

for element in root:
	print element

Not sure if I did that correctly or efficiently, but this worked!

Thank you for writing the book.

Edit:
Python version:

Python 2.7.10 (default, Jul 30 2016, 19:40:32)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

Mac OS: Sierra 10.12.3

Thanks!

Chapter 4. Working with Excel Files

Hi Jackie:

I think this code has a little bug on line that says 'if count < 10:' because I got empty list as the result.
count = 0
data = {}
for i in xrange(sheet.nrows):
if count < 10:
if i >= 14:
row = sheet.row_values(i)
country = row[1]
data[country] = {}
count += 1

print data

After doing some investigation, I found that it should be written as 'if count < 20' as follows:
count = 0
data = {}
for i in xrange(sheet.nrows):
if count < 20:
if i >= 14:
row = sheet.row_values(i)
country = row[1]
data[country] = {}
count += 1

print data

noob question regarding JSON

I noticed on Chap 3 that the JSON structure on the sample data starts with [ (open-bracket) on the first row and ends with a ] (close-bracket); however, the original WHO JSON data only starts with { and ends with }. Can Python handle JSON data that does not start with a bracket? If I run the book example using the original WHO data, I don't get a full listing of the contents of the data, but just the 5 top-level objects.

Thanks.

Chapter 8 our_first_script_with_functions.py issue

When the script runs the following function I get the following error.

def find_missing_data(zipped_data):
missing_count = 0
for question, answer in zipped_data:
if not answer:
missing_count += 1
return missing_count

_in find_missing_data
for question, answer in zipped_data:
ValueError: too many values to unpack

Chapter 5, pg. 97 possible error

@jackiekazil I am trying to work through chapter 5 and having lots of problems with slate, pdfminer, etc. Specifically, I get no print results from this (p. 97)--any recommendations? I've exhausted google searches and stackoverflow for possible solutions.

`pdf_txt = 'en-final-table9.txt
openfile = open(pdf_txt, 'r')

for line in openfile:
print (line)`

Problem with zipped data, page 164

for x in enumerate(zipped_data[0][:20]):
print x

TypeError: 'zip' object is not subscriptable

An error on chapter 7.2.1

The code reading csv data and using list comprehension in this chapter gets an error in my Windows:
"Error: iterator should return strings, not bytes (did you open the file in
text mode?)"
BTW, the code is:
`from csv import DictReader
data_rdr = DictReader(open(r'F:\Learn\python data wrangling\data-wrangling-master\data\unicef\mn.csv','rb'))
header_rdr = DictReader(open(r'F:\Learn\python data wrangling\data-wrangling-master\data\unicef\mn_headers.csv','rb'))

data_rows = [d for d in data_rdr]
header_rows = [h for h in header_rdr]

print(data_rows[:5])
print(header_rows[:5])`
I searched in stackoverflow, which says

You need to wrap the file in a io.TextIOWrapper() instance, and you need to figure out the encoding

Is this correct?
I am using Python 3.5.2