guipsamora / pandas_exercises Goto Github PK

Practice your pandas skills!

License: BSD 3-Clause "New" or "Revised" License

Jupyter Notebook 100.00%

pandas exercise practice tutorial data-analysis

pandas_exercises's Introduction

Pandas Exercises

Fed up with a ton of tutorials but no easy way to find exercises I decided to create a repo just with exercises to practice pandas. Don't get me wrong, tutorials are great resources, but to learn is to do. So unless you practice you won't learn.

There will be three different types of files:
      1. Exercise instructions
      2. Solutions without code
      3. Solutions with code and comments

My suggestion is that you learn a topic in a tutorial, video or documentation and then do the first exercises. Learn one more topic and do more exercises. If you are stuck, don't go directly to the solution with code files. Check the solutions only and try to get the correct answer.

Suggestions and collaborations are more than welcome.🙂 Please open an issue or make a PR indicating the exercise and your problem/solution.

Lessons


Getting and knowing	Merge	Time Series
Filtering and Sorting	Stats	Deleting
Grouping	Visualization	Indexing
Apply	Creating Series and DataFrames	Exporting

Video Solutions

Video tutorials of data scientists working through the above exercises:

Data Talks - Pandas Learning By Doing

pandas_exercises's People

Contributors

Stargazers

Watchers

Forkers

coursera-machine-learning-data-analysis tanzirhasan bertomartin jjude adarsh0806 reinmj zkan sudarshan1413 logen1004 nifannn singh0021 benjamesbabala snowdj andymason57 mmejdoubi ebunt giserh rhogerbrugge mandel01 niallmartin alifeinbeing trinker qgzang wavelets matheusrabetti tejaykodali rithwik mikestromme cmccann11 yonfai anuragshivam amit-dingare dataworkshop ha-nguyen thefon saadatqadri obswork shniu bolutife-lawrence ddbs ved93 jt14den dannyhou coolspiderghy arunpn mstampfer chrisdamba xuetingchen likeshumidity percyvision lizhihao1990 rikazry lliangfangchen 861 arvinsim diwahars jasonan spidermanxyz98 princerajeev21 sambozek akgeni mukultaneja jeffcarey moonsmoother frogwang92 laventura vmtrung sudsfsp bnmnetp alexcleu kennethbayus coll3ctions hemel-cse aniketgurav eligah jasonwuyun mayankmurari dkofiarmah sololeeee jsonbao aamin25 jargonautical yeyichun b5710546232 magenti pablonsanches tahr-datascitech juliansalazar sallym hanwoong89 chaosem kaushalthecoder nathanli123 jhk2020 fpirbhai jiujitsunami karmacelina michaelyou anjaspang chensteven

pandas_exercises's Issues

02_Filtering_&_Sorting / Fictional_Army

According to the link below, the ix method is deprecated.

http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#deprecate-ix

Solutions for step 17 could be, for example (just what i came up with):
army.loc[['Arizona'], army.columns[3]]

and for step 18:
army.iloc[2, army.columns.tolist().index('deaths')]

Also, in step 3 the "Don't forget to include the columns names" is confusing since pd.DataFrame imports the structure fine without, just not in the order expected by the solutions of the following exercises.

Thanks for this great collection of exercises, it's a real treasure.

Chiptole : Finding the highest ordered item

When I tried using the following code the ordered item quantity is different
chipo['item_name'].value_counts().head(1)
Out[48]:
Chicken Bowl 726
Name: item_name, dtype: int64

But when i try your method it is having different value.
chipo.groupby('item_name').sum().sort_values(['quantity'],ascending =False).head(1)

order_id	quantity

713926 | 761

can you please help me with same and explanation would very much appreciated

What is the license of the exercises?

02: Filtering and Sorting/Chipotle, Step-4 & 5

In 02-Filtering_and_Sorting/Chipotle, step 4 and 5,

Solution doesn't consider items which do not have a 'quantity'==1 in the data
They can be extracted by
chipo['item_price'] = chipo['item_price']/chipo['quantity']
chipo['quantity'] = 1 #Dividing item_price by quantity, therefore let quantity be 1
chipo.drop_duplicates(['item_name'], keep='first', inplace=True)
chipo.sort_values(by='item_price', ascending=False, inplace=True)
display(chipo[['item_name', 'item_price']])

I'm also a beginner at Pandas, please let me know about any stupid thing that I missed. Thanks.

Using nbgrader?

Great exercises, thank you so much. Not really an issue but a suggestion would be to use nbgrader to check the user's answers against the solution automatically? Here's a link

Chipotle Item Price

In a few of the early Chipotle examples, item_price is treated as the cost of a single item. It looks to me like it's actually the cost of all items of the type in that order - the most expensive item is "2 steak burritos" at $22, but a single steak burrito only costs $11.

Euro12 Data is missing.

The data for Euro 12 is not there within the Filtering and Sorting Exercise

Question: Is Step 9 solution correct?

Hello, thx for that nice repo!

Step 9. Calculate the mean Yellow Cards given per Team in 02_Filtering_&_Sorting Exercises_with_Solutions.ipynb

round(discipline['Yellow Cards'].mean())

I guess its the overall average rather than grouped by teams (discipline.groupby("Team").agg({"Yellow Cards":"mean"}))

Type_error in python 3.7

TypeError Traceback (most recent call last)
in ()
----> 1 prices= [float(value[1:-1]) for value in chipo.item_price]
2
3 #reassign the column values with the updated values
4 chipo.item_price = prices
5

in (.0)
----> 1 prices= [float(value[1:-1]) for value in chipo.item_price]
2
3 #reassign the column values with the updated values
4 chipo.item_price = prices
5

TypeError: 'float' object is not subscriptable

Add requirements.txt file

add requirements.txt file to make it easy to install dependencies with pip

More efficient alternative in 04_ApplyStudents_Alcohol_Consumption

In step 10, we want to multiply all numerical values by 10.

The provided solution is:
df.applymap(times10).head(10)

But this is very slow, because it runs a regular python function on every element in the dataframe.

Better is to test each column's type, and then use pandas built in multiplication on the whole column:

for colname, coltype in df.dtypes.to_dict().items():
    if coltype.name in ['int64']:
        df[colname] = df[colname] * 10

I used %%timeit to test the two solutions. On this small dataset, my solution is 5x as fast (1.1ms vs 5.8ms). The difference would get larger with a larger dataset.

02: Filtering and Sorting/Euro12, Dataset Unavailable

Link expired, Please use this instead.

Ex1 - Getting and knowing your Data - WorldFoodFacts - Dataset Changed

The solution to the steps need to be updated since the data size has changed.
For reference:

Given Solution Dataset info:

food.info() #Columns: 159 entries
(65503, 159)
159
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65503 entries, 0 to 65502
Columns: 159 entries, code to nutrition_score_uk_100g
dtypes: float64(103), object(56)
memory usage: 79.5+ MB

Current Dataset info:

>>> food.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 356027 entries, 0 to 356026
Columns: 163 entries, code to water-hardness_100g
dtypes: float64(107), object(56)
memory usage: 442.8+ MB

Links to datasets in 05_Merge and 06_Stats incorrect

The links in Step 2 of 05_Merge are incorrect.
They're:
https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/Merge/Auto_MPG/cars1.csv
https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/Merge/Auto_MPG/cars2.csv

They should be:
https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/05_Merge/Auto_MPG/cars1.csv
https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/05_Merge/Auto_MPG/cars2.csv
The link in Step 2 of 06_Stats is incorrect. Currently it's :
https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/Stats/US_Baby_Names/US_Baby_Names_right.csv

It should be:
https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv

Massive pull requests!!

Just Fixed lots of typos and errors in "Exercise_with_solutions" parts and some for "Exercise" and "Solution" parts.

For "Exercise" and "Solution" parts, I couldn't fix all of the links or titles so far.
It might be little confusing for readers, so please grep urls or titles and fix them.

For some parts, commits are messy, because I had to execute some cells and renew the result.
Rather than checking fixes from github, for some part, checking them from Jupyter notebook would be better.

#37
#38
#39
#40
#41
#42
#43
#44
#45
#46
#47
#49

Korean translation

Hello. I want to translate the repo to Korean for korean learners.

I am a Koraean student who study data science. I think It is very useful resource to exercise Pandas. So I want to share it with korean learners.

I know. It is open-source in BSD license. I already forked it. But It is better that i notify you. Thanks for working.

pandas_exercises/04_Apply/Students_Alcohol_Consumption/Exercises_with_solutions.ipynb Step 5. Capitalize

In step 5, the exercise question says to create a function to capitalize on strings. The function applied is x.upper(). Isn't it supposed to be x.capitalize(), as a capitalizing string means to make the first character as uppercase and rest as lowercase?

Step 5. Create a lambda function that captalize strings.
captalizer = lambda x: x.upper()

Expected:

Step 5. Create a lambda function that captalize strings.
captalizer = lambda x: x.capitalize()

New to GitHub

Hey,
I came across your pandas exercise. I am new to GitHub, how do I code in your pandas exercise and execute them? Right now they are greyed out and I am unable to edit.

06_Stats/Wind_Stats/Step 8

Hi, thanks for the exercises provided. Very helpful.

I think the result for 06_Stats/Wind_Stats/Step 8 is incorrect.
Believe we should skip NA when calculate the mean value, but the default of mean() already exclude NA.
So just use below should be fine:
data.mean().mean()

Ex2 - Getting and Knowing your Data Step 10

Ex2 - Getting and Knowing your Data
Step 10. How many items were ordered?
the answer is same as step 9, but i think the right answer is chipo['quantity'].sum(). do i misunderstand the question?

there is some mistakes

In
pandas_exercises-master\09_Time_Series\Getting_Financial_Data\Exercises_with_solutions_and_code

""step 4"" not work!

Wrong correction in 06_Stats/Wind_Stats/Step 8

Hi! Earlier you accepted corrections by this issue by @maxim5
But I think one of them is wrong:

The solution to step 8 is wrong: the mean value does not equal to the mean of means. The right solution is:
data.fillna(0).values.flatten().mean()

because when you fill NA values with 0 you distort the entire data. I think there is no reason to pick 0 or 5 or -100 to replace NA. They must be just skipped. Just like you do in rest of project when using functions like .mean(), .sum() etc. They skip NA values by default.

So the solution must be something like this:
data.sum().sum() / data.notna().sum().sum()
or this:
data.values.flatten()[~np.isnan(data.values.flatten())].mean()

However, I like you project and I learned a lot on it. Thank you!

Occupation: Step18 solution

Step 18. What is the age with least occurrence?
current ver. :
users.age.value_counts().tail(1)
7 year-old is with the least occurrence 1.

However, 11, 10, 73, 66 year-old are with occurrence 1 as well.
So the correct ver.:
users.age.value_counts().tail()

Occupation: Step12 solution

Step 12. How many different occupations there are in this dataset?
current answer is:

len(users.occupation.unique())

the 'len' function is redundant, try:

users['occupation'].nunique()

Count not a general solution in Chipotle, 02_Filtering_&_Sorting

Step 8 proposes the following solution to count how many Veggie Salad Bowl there are.

chipo_salad = chipo[chipo.item_name == "Veggie Salad Bowl"]

len(chipo_salad)

However this doesn't seem like a general solution to the problem. Wouldn't it be better:

chipo[chipo.item_name == "Veggie Salad Bowl"].quantity.sum()

So that quantities get taken into account?

Step 9 not clear in Apply_Alcohol

Step number #9 is not clear. What is the threshold at which you label one as legal drinker?

https://github.com/guipsamora/pandas_exercises/blob/master/04_Apply/Students_Alcohol_Consumption/Exercises.ipynb

Pandas Application

Good exercises

Link in 06_Stats/Wind_Stats not working

Link in 06_Stats/Wind_Stats not working. It currently is:
https://github.com/guipsamora/pandas_exercises/blob/master/Stats/Wind_Stats/wind.data

It should be:
https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/Wind_Stats/wind.data

Wrong title in the sort->baby_names exercise

In the baby names exercise, in task no. 5, there is a small mistake:

#deletes Unnamed: 0
del baby_names['Unnamed: 0']
#deletes Unnamed: 0
del baby_names['Id']

instead of:

#deletes Unnamed: 0
del baby_names['Unnamed: 0']
#deletes **Id**
del baby_names['Id']

Completing Exercises

Hi - I am new to this site and am attempting to complete the exercises; however, the line where I'd enter code are not in edit form. Am I missing something? Can you please help me figure out why I cannot add answers to the lines?

Poor quality of "06_Stats/Wind_Stats" exercise

The link at step 2 is wrong, as it leads to github, not the data
Steps 4 and 5 are confusing, because they depend on the way the step 3 is solved. I didn't do a mistake described in solutions and spent some time figuring out what these steps are about.
The solution to step 7 is wrong, it should be:
data.shape[0] - data.isnull().sum() or better data.notnull().sum().
The solution to step 8 is wrong: the mean value does not equal to the mean of means. The right solution is:
data.fillna(0).values.flatten().mean()
The solution to step 9 seems a bit verbose. What's wrong with describe(percentiles=[])?
The solution to step 11 seems ok, but is too verbose as well. Pandas favors one-liners where possible:
data.loc[data.index.month == 1].mean()
The solutions to steps 12, 13 and 14 are absolutely wrong. It has nothing to do with "yearly/monthly frequency", but selects individual rows from the data frame. The right solution looks like:
data.groupby(data.index.to_period('A')).mean()
Step 15 is confusing, because that's "monthly frequency" that step 13 was about. The distinction should be better clarified in both steps. And the solution is wrong again. Here's what it should be:
data.groupby(data.index.to_period('M')).mean().head()

Overall, 11 out of 16 are either wrong or misleading. On top of that, the bulk of this notebook belongs to "Time series analysis"

broken url

Hi ! I cant reach file that is train.csv for https://github.com/guipsamora/pandas_exercises/blob/master/07_Visualization/Titanic_Desaster/Exercises_code_with_solutions.ipynb .
Thank you!

Getting Financial Data - Pandas Datareader - Step 8 Solution

Could you please share the code for step 8.

02_Filtering_&_Sorting/Chipotle step4 is missing

How to use the notebooks (exercise-only)

When I open the notebook in the browser (Chrome), it seems to render properly except that the cells are all grayed out. I was expecting that I would be able to type into them and run the code I write. Is that not what was intended? Thanks

Missing third exercise in 04_Apply

"04_Apply/US_Crime_Rates/Exercises_with_solutions.ipynb" has some questions on the 'Chipotle' data, which is not related to the US Crime rate data. See the questions from 'Step 11'.

Maybe these questions should be part of '"04_Apply/US_Crime_Rates/Chipotle'.

Chipotle Exercises Step 16 Issue

You ask: 'What is the average amount per order?'

Your solution:
order_grouped = chipo.groupby(by=['order_id']).sum() order_grouped.mean()['item_price']
output: 18.81

But if we're talking about average amount per order, I assume that would mean the average revenue per order (quantity * price, what was computed in question 14):

order_grouped = chipo.groupby(by=['order_id']).sum() order_grouped.mean()['rev']
output: 21.39

Just a matter of semantics really. This practice set is awesome btw!

Please add a license file

It would be nice if these exercises would have a license, so one knows under which conditions one can make use of them.

I don't have any particular license in mind myself, and of course that's not my call to make, tough in the name of reducing license proliferation I would suggest to use the same license as pandas itself uses: https://github.com/pandas-dev/pandas/blob/master/LICENSE .

Link Error 07_Visualization, Online_Retail

Links redirect to this address:
https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/Visualization/Online_Retail/Online_Retail.csv

The working link right now is:

https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/07_Visualization/Online_Retail/Online_Retail.csv

the link doesn't work anymore

Can I translate this code into Korean and post it on my blog?

hello,

These codes were very helpful in studying the pandas for me.

I want to post on a blog for studying data qualification exams in Korea.

CHIPOTLE ::Step 10: How many items were ordered

According to me the solution for step 10 is wrong because the question asks how many distinct items were ordered.
So the solution should be :- len(chipo.groupby("item_name"))

Alternative for 06_Stats/Wind_Stats/Step 12-14

From step12 to 14, it asked to downsample the record to a yearly/monthly/weekly frequency for each location.
The provided solution is like below:
data.groupby(data.index.to_period('A')).mean()

I think it would be simpler to use resample function as below:
data.resample('AS').mean()
data.resample('M').mean()
data.resample('W').mean()

'Chips and Roasted Chili-Corn Salsa' and 'Chips and Roasted Chili Corn Salsa' is duplicated?

We have 'Chips and Roasted Chili-Corn Salsa', 'Chips and Roasted Chili Corn Salsa' at chipotle exercise.

04_Apply Students_Alcohol_Consumption, Step 10

the age column type is np.int64, but after applymat, it returns None
As attached:

Ex1 - Filterong and Sorting Data - Price is ill defined

Steps 4 and 5 assume that each price has a unique price.

However,
chipo[(chipo['item_name'] == 'Chicken Bowl') & (chipo['quantity'] == 1)].item_price.unique()
returns
array([10.98, 11.25, 8.75, 8.49, 8.19, 10.58, 8.5 ])

This is covered by the drop_duplicates, however is still misleading as they aren't duplicate with price.

01_Getting_&_Knowing_Your_Data - Chipotle - Total Revenue Incorrect

In the chipotle dataset I believe the total revenue is incorrect.
It assumes that the quantity must be multiplied by the price to get the the total.
(chipo['quantity']* chipo['item_price']).sum()

However, I believe that the quantity is already included in the price, as can be seen by examining
chipo[chipo['item_name'] == '6 Pack Soft Drink']

It think the following is sufficient.
chipo['item_price']).sum()