packtpublishing / pandas-cookbook Goto Github PK

Pandas Cookbook, published by Packt

License: MIT License

Jupyter Notebook 100.00%

pandas-cookbook's Introduction

Pandas Cookbook

This is the code repository for Pandas Cookbook, published by Packt. It contains all the supporting project files necessary to work through the book from start to finish.

About the Book

This book will provide you with unique, idiomatic, and fun recipes for both fundamental and advanced data manipulation tasks with pandas. Some recipes focus on achieving a deeper understanding of basic principles, or comparing and contrasting two similar operations. Other recipes will dive deep into a particular dataset, uncovering new and unexpected insights along the way.

The pandas library is massive, and it’s common for frequent users to be unaware of many of its more impressive features. The official pandas documentation, while thorough, does not contain many useful examples of how to piece together multiple commands like one would do during an actual analysis. This book guides you, as if you were looking over the shoulder of an expert, through practical situations that you are highly likely to encounter.

Instructions and Navigation

All of the code is organized into folders. Each folder starts with a number followed by the application name. For example, Chapter02.

The code will look like the following:

>>> employee = pd.read_csv('data/employee')
>>> max_dept_salary = employee.groupby('DEPARTMENT')['BASE_SALARY'].max()

Pandas is a third-party package for the Python programming language and, as of the printing of this book, is on version 0.20. Currently, Python has two major supported releases, versions 2.7 and 3.6. Python 3 is the future, and it is now highly recommended that all scientific computing users of Python use it, as Python 2 will no longer be supported in 2020. All examples in this book have been run and tested with pandas 0.20 on Python 3.6.

In addition to pandas, you will need to have the matplotlib version 2.0 and seaborn version 0.8 visualization libraries installed. A major dependence for pandas is the NumPy library, which forms the basis of most of the popular Python scientific computing libraries.

There are a wide variety of ways in which you can install pandas and the rest of the libraries mentioned on your computer, but by far the simplest method is to install the Anaconda distribution. Created by Continuum Analytics, it packages together all the popular libraries for scientific computing in a single downloadable file available on Windows, Mac OSX, and Linux. Visit the download page to get the Anaconda distribution (https://www.anaconda.com/download).

In addition to all the scientific computing libraries, the Anaconda distribution comes with Jupyter Notebook, which is a browser-based program for developing in Python, among many other languages. All of the recipes for this book were developed inside of a Jupyter Notebook and all of the individual notebooks for each chapter will be available for you to use.

It is possible to install all the necessary libraries for this book without the use of the Anaconda distribution. For those that are interested, visit the pandas Installation page (http://pandas.pydata.org/pandas-docs/stable/install.html).

pandas-cookbook's People

Contributors

Stargazers

Watchers

Forkers

sthitaprajnas abbyf foxy-za liston airob moondav insrest innovationexploited shaoweistar yunfzhai zhaodanian mohitravi666 jessewei vodaka kcciti nguyentrunghai partrita xuezhizeng codeslord rhmiller47 yawen012 parekhabhishekn aakims mahir ashaney anuragreddygv323 pratikbarjatya akigt roccamora dzydxw maryamnajafian zhongminjin madhukarabs dustinrbunch arthmx solegaonkar tlfmcooper moenchishti sarfarazit08 bleearmstrong pjkonicki tiravata drwangnutc valeman nhatnguyen12 soletrip santiagobasulto scotthibberd anhhd roytechcode alexliberzonlab naveedafzal chengweitj pauloviga22 markriedesel kirosds anxietyhangover agnesyane rambora gauravkhemchandani msoancah oserttas-math chiewxia gedman4b mrbagel arhsimluhar morphio valerylynn datascience4me giordafrancis vasanthgx mengwangk soudiptad99 yumyumz iamseancheney birol-yildiz dengjiahao kapooraparna sam1mishra jalajthanaki kalyankumarpichuka daniel-risi davetlewis anyacha anyaconda charlesfung87 lion30 nikhil-k-singh vectormars yencarnacion sa-i audreyebaker bkwart venki1995 andreluiscostac valrcs northbreeze medh2000 fabinhojorge keerthisaran

pandas-cookbook's Issues

Ch 5 Translating SQL WHERE clauses section.

Enjoying the book!

In the “Translating SQL WHERE clauses“ There’s More” section the negation causes Pandas to return salaries < $80000 and >$120000 so it doesn’t emulate a SQL ‘between’.

Q

Chapter 10, important typo

crime_sort.resample('QS-MAR')['IS_CRIME', 'IS_TRAFFIC'].sum().head() Page 412 describes 'QS_MAR as quarters starting in March, while it should read "ending" in March.
It's important because the resultant table starts at the beginning of Dec, which is not possible under the latter scenario and caused some confusion for me.

A

File won't download

When I try to download the file by clicking the download button it just opens another webpage with unformatted csv text information. How do I actually download the file?

flights_sort = flights[['ORG_AIR', 'DEST_AIR']].apply(sorted, axis=1)

pycharm produces a different series than the one in the example. The column keys are lost this way

Chapter 11, In [49]

The code, as written, does not produce the output shown in Out [49] in ipython in a jupyter notebook.

In the scatter plot, color= expects one input, not a list of two inputs.
In the line plot, the x axis has no labels.

ValueError: Table tracks not found

I solved this problem by trying different expressions for paths to the chinook.db file for the create_engine(), which eventually produced a useful error message. The following line works if you provide an absolute address for the dir that holds the chinook.db file in the string that I called "path_to_file":

engine = create_engine('sqlite:////' + path_to_file + 'chinook.db')

For absolute addresses, create_engine expects //// not ///.

this was my original statement of the problem

from sqlalchemy import create_engine
engine = create_engine('sqlite:///chinook.db')
tracks = pd.read_sql_table('tracks', engine)
tracks.head()

InvalidRequestError Traceback (most recent call last)
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/io/sql.py in read_sql_table(table_name, con, schema, index_col, coerce_float, parse_dates, columns, chunksize)
239 try:
--> 240 meta.reflect(only=[table_name], views=True)
241 except sqlalchemy.exc.InvalidRequestError:

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sqlalchemy/sql/schema.py in reflect(self, bind, schema, views, only, extend_existing, autoload_replace, **dialect_kwargs)
4148 "Could not reflect: requested table(s) not available "
-> 4149 "in %r%s: (%s)" % (bind.engine, s, ", ".join(missing))
4150 )

InvalidRequestError: Could not reflect: requested table(s) not available in Engine(sqlite:///chinook.db): (tracks)

During handling of the above exception, another exception occurred:

ValueError Traceback (most recent call last)
in ()
----> 1 tracks = pd.read_sql_table('tracks', engine)
2 tracks.head()

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/io/sql.py in read_sql_table(table_name, con, schema, index_col, coerce_float, parse_dates, columns, chunksize)
240 meta.reflect(only=[table_name], views=True)
241 except sqlalchemy.exc.InvalidRequestError:
--> 242 raise ValueError("Table %s not found" % table_name)
243
244 pandas_sql = SQLDatabase(con, meta=meta)

ValueError: Table tracks not found

Website seems to have changed.

Chapter 9, Comparing President Trump's and Obama's approval ratings

Code below does not work any more,

base_url = 'http://www.presidency.ucsb.edu/data/popularity.php?pres={}'
trump_url = base_url.format(45)

df_list = pd.read_html(trump_url)
len(df_list)

resulting in ValueError: No tables found

Can you share the data in csv format or modify the source code to work with the present website?

Vehicles dataset?

In Ch. 5 - EDA, there are examples with the vehicles.csv dataset, but I cannot seem to locate it in the dataset. Is there a url that can be provided to the actual data? I couldn't locate this on fueleconomy.gov

thanks

Chapter 1 > notebook

.dtypes.value_counts() is the one to go for;
this following is deprecated: .get_dtype_counts()