Git Product home page Git Product logo

pandas_exercises's Introduction

Pandas Exercises

Fed up with a ton of tutorials but no easy way to find exercises I decided to create a repo just with exercises to practice pandas. Don't get me wrong, tutorials are great resources, but to learn is to do. So unless you practice you won't learn.

There will be three different types of files:
      1. Exercise instructions
      2. Solutions without code
      3. Solutions with code and comments

My suggestion is that you learn a topic in a tutorial, video or documentation and then do the first exercises. Learn one more topic and do more exercises. If you are stuck, don't go directly to the solution with code files. Check the solutions only and try to get the correct answer.

Suggestions and collaborations are more than welcome.🙂 Please open an issue or make a PR indicating the exercise and your problem/solution.

Lessons

Getting and knowing Merge Time Series
Filtering and Sorting Stats Deleting
Grouping Visualization Indexing
Apply Creating Series and DataFrames Exporting

Chipotle
Occupation
World Food Facts

Chipotle
Euro12
Fictional Army

Alcohol Consumption
Occupation
Regiment

Students Alcohol Consumption
US_Crime_Rates

Auto_MPG
Fictitious Names
House Market

US_Baby_Names
Wind_Stats

Chipotle
Titanic Disaster
Scores
Online Retail
Tips

Pokemon

Apple_Stock
Getting_Financial_Data
Investor_Flow_of_Funds_US

Iris
Wine

Video Solutions

Video tutorials of data scientists working through the above exercises:

Data Talks - Pandas Learning By Doing

pandas_exercises's People

Contributors

aquaraga avatar cconw avatar doganck avatar freddie71010 avatar gaurangtandon avatar germavinsmoke avatar guipsamora avatar jeffcarey avatar manjunath24 avatar max-alletsee avatar mcgradymvp avatar mukultaneja avatar njutn95 avatar oleg104 avatar pkro avatar romansnsk avatar skgurura avatar takaakifuruse avatar zaheer031 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pandas_exercises's Issues

Occupation: Step18 solution

Step 18. What is the age with least occurrence?
current ver. :
users.age.value_counts().tail(1)
7 year-old is with the least occurrence 1.

However, 11, 10, 73, 66 year-old are with occurrence 1 as well.
So the correct ver.:
users.age.value_counts().tail()

pandas_exercises/04_Apply/Students_Alcohol_Consumption/Exercises_with_solutions.ipynb Step 5. Capitalize

In step 5, the exercise question says to create a function to capitalize on strings. The function applied is x.upper(). Isn't it supposed to be x.capitalize(), as a capitalizing string means to make the first character as uppercase and rest as lowercase?

Step 5. Create a lambda function that captalize strings.
captalizer = lambda x: x.upper()

Expected:

Step 5. Create a lambda function that captalize strings.
captalizer = lambda x: x.capitalize()

Occupation: Step12 solution

Step 12. How many different occupations there are in this dataset?
current answer is:

len(users.occupation.unique())

the 'len' function is redundant, try:

users['occupation'].nunique()

02: Filtering and Sorting/Chipotle, Step-4 & 5

In 02-Filtering_and_Sorting/Chipotle, step 4 and 5,

  • Solution doesn't consider items which do not have a 'quantity'==1 in the data
  • They can be extracted by
    chipo['item_price'] = chipo['item_price']/chipo['quantity']
    chipo['quantity'] = 1 #Dividing item_price by quantity, therefore let quantity be 1
    chipo.drop_duplicates(['item_name'], keep='first', inplace=True)
    chipo.sort_values(by='item_price', ascending=False, inplace=True)
    display(chipo[['item_name', 'item_price']])

I'm also a beginner at Pandas, please let me know about any stupid thing that I missed. Thanks.

Poor quality of "06_Stats/Wind_Stats" exercise

  • The link at step 2 is wrong, as it leads to github, not the data
  • Steps 4 and 5 are confusing, because they depend on the way the step 3 is solved. I didn't do a mistake described in solutions and spent some time figuring out what these steps are about.
  • The solution to step 7 is wrong, it should be:
    data.shape[0] - data.isnull().sum() or better data.notnull().sum().
  • The solution to step 8 is wrong: the mean value does not equal to the mean of means. The right solution is:
    data.fillna(0).values.flatten().mean()
  • The solution to step 9 seems a bit verbose. What's wrong with describe(percentiles=[])?
  • The solution to step 11 seems ok, but is too verbose as well. Pandas favors one-liners where possible:
    data.loc[data.index.month == 1].mean()
  • The solutions to steps 12, 13 and 14 are absolutely wrong. It has nothing to do with "yearly/monthly frequency", but selects individual rows from the data frame. The right solution looks like:
    data.groupby(data.index.to_period('A')).mean()
  • Step 15 is confusing, because that's "monthly frequency" that step 13 was about. The distinction should be better clarified in both steps. And the solution is wrong again. Here's what it should be:
    data.groupby(data.index.to_period('M')).mean().head()

Overall, 11 out of 16 are either wrong or misleading. On top of that, the bulk of this notebook belongs to "Time series analysis"

Wrong title in the sort->baby_names exercise

In the baby names exercise, in task no. 5, there is a small mistake:

#deletes Unnamed: 0
del baby_names['Unnamed: 0']
#deletes Unnamed: 0
del baby_names['Id']

instead of:

#deletes Unnamed: 0
del baby_names['Unnamed: 0']
#deletes **Id**
del baby_names['Id']

Chipotle Item Price

In a few of the early Chipotle examples, item_price is treated as the cost of a single item. It looks to me like it's actually the cost of all items of the type in that order - the most expensive item is "2 steak burritos" at $22, but a single steak burrito only costs $11.

Ex1 - Filterong and Sorting Data - Price is ill defined

Steps 4 and 5 assume that each price has a unique price.

However,
chipo[(chipo['item_name'] == 'Chicken Bowl') & (chipo['quantity'] == 1)].item_price.unique()
returns
array([10.98, 11.25, 8.75, 8.49, 8.19, 10.58, 8.5 ])

This is covered by the drop_duplicates, however is still misleading as they aren't duplicate with price.

there is some mistakes

In
pandas_exercises-master\09_Time_Series\Getting_Financial_Data\Exercises_with_solutions_and_code

""step 4"" not work!

Please add a license file

It would be nice if these exercises would have a license, so one knows under which conditions one can make use of them.

I don't have any particular license in mind myself, and of course that's not my call to make, tough in the name of reducing license proliferation I would suggest to use the same license as pandas itself uses: https://github.com/pandas-dev/pandas/blob/master/LICENSE .

Korean translation

Hello. I want to translate the repo to Korean for korean learners.

I am a Koraean student who study data science. I think It is very useful resource to exercise Pandas. So I want to share it with korean learners.

I know. It is open-source in BSD license. I already forked it. But It is better that i notify you. Thanks for working.

Wrong correction in 06_Stats/Wind_Stats/Step 8

Hi! Earlier you accepted corrections by this issue by @maxim5
But I think one of them is wrong:

  • The solution to step 8 is wrong: the mean value does not equal to the mean of means. The right solution is:
    data.fillna(0).values.flatten().mean()

because when you fill NA values with 0 you distort the entire data. I think there is no reason to pick 0 or 5 or -100 to replace NA. They must be just skipped. Just like you do in rest of project when using functions like .mean(), .sum() etc. They skip NA values by default.

So the solution must be something like this:
data.sum().sum() / data.notna().sum().sum()
or this:
data.values.flatten()[~np.isnan(data.values.flatten())].mean()

However, I like you project and I learned a lot on it. Thank you!

Link Error 07_Visualization, Online_Retail

Links redirect to this address:
https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/Visualization/Online_Retail/Online_Retail.csv

The working link right now is:

https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/07_Visualization/Online_Retail/Online_Retail.csv

Using nbgrader?

Great exercises, thank you so much. Not really an issue but a suggestion would be to use nbgrader to check the user's answers against the solution automatically? Here's a link

01_Getting_&_Knowing_Your_Data - Chipotle - Total Revenue Incorrect

In the chipotle dataset I believe the total revenue is incorrect.
It assumes that the quantity must be multiplied by the price to get the the total.
(chipo['quantity']* chipo['item_price']).sum()

However, I believe that the quantity is already included in the price, as can be seen by examining
chipo[chipo['item_name'] == '6 Pack Soft Drink']

It think the following is sufficient.
chipo['item_price']).sum()

02_Filtering_&_Sorting / Fictional_Army

According to the link below, the ix method is deprecated.

http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#deprecate-ix

Solutions for step 17 could be, for example (just what i came up with):
army.loc[['Arizona'], army.columns[3]]

and for step 18:
army.iloc[2, army.columns.tolist().index('deaths')]

Also, in step 3 the "Don't forget to include the columns names" is confusing since pd.DataFrame imports the structure fine without, just not in the order expected by the solutions of the following exercises.

Thanks for this great collection of exercises, it's a real treasure.

How to use the notebooks (exercise-only)

When I open the notebook in the browser (Chrome), it seems to render properly except that the cells are all grayed out. I was expecting that I would be able to type into them and run the code I write. Is that not what was intended? Thanks

Type_error in python 3.7


TypeError Traceback (most recent call last)
in ()
----> 1 prices= [float(value[1:-1]) for value in chipo.item_price]
2
3 #reassign the column values with the updated values
4 chipo.item_price = prices
5

in (.0)
----> 1 prices= [float(value[1:-1]) for value in chipo.item_price]
2
3 #reassign the column values with the updated values
4 chipo.item_price = prices
5

TypeError: 'float' object is not subscriptable

Missing third exercise in 04_Apply

"04_Apply/US_Crime_Rates/Exercises_with_solutions.ipynb" has some questions on the 'Chipotle' data, which is not related to the US Crime rate data. See the questions from 'Step 11'.

Maybe these questions should be part of '"04_Apply/US_Crime_Rates/Chipotle'.

Completing Exercises

Hi - I am new to this site and am attempting to complete the exercises; however, the line where I'd enter code are not in edit form. Am I missing something? Can you please help me figure out why I cannot add answers to the lines?

Massive pull requests!!

Just Fixed lots of typos and errors in "Exercise_with_solutions" parts and some for "Exercise" and "Solution" parts.

For "Exercise" and "Solution" parts, I couldn't fix all of the links or titles so far.
It might be little confusing for readers, so please grep urls or titles and fix them.

For some parts, commits are messy, because I had to execute some cells and renew the result.
Rather than checking fixes from github, for some part, checking them from Jupyter notebook would be better.

#37
#38
#39
#40
#41
#42
#43
#44
#45
#46
#47
#49

Ex2 - Getting and Knowing your Data Step 10

Ex2 - Getting and Knowing your Data
Step 10. How many items were ordered?
the answer is same as step 9, but i think the right answer is chipo['quantity'].sum(). do i misunderstand the question?

Chipotle Exercises Step 16 Issue

You ask: 'What is the average amount per order?'

Your solution:
order_grouped = chipo.groupby(by=['order_id']).sum() order_grouped.mean()['item_price']
output: 18.81

But if we're talking about average amount per order, I assume that would mean the average revenue per order (quantity * price, what was computed in question 14):

order_grouped = chipo.groupby(by=['order_id']).sum() order_grouped.mean()['rev']
output: 21.39

Just a matter of semantics really. This practice set is awesome btw!

Ex1 - Getting and knowing your Data - WorldFoodFacts - Dataset Changed

The solution to the steps need to be updated since the data size has changed.
For reference:

Given Solution Dataset info:

food.info() #Columns: 159 entries
(65503, 159)
159
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65503 entries, 0 to 65502
Columns: 159 entries, code to nutrition_score_uk_100g
dtypes: float64(103), object(56)
memory usage: 79.5+ MB

Current Dataset info:

>>> food.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 356027 entries, 0 to 356026
Columns: 163 entries, code to water-hardness_100g
dtypes: float64(107), object(56)
memory usage: 442.8+ MB

Question: Is Step 9 solution correct?

Hello, thx for that nice repo!

Step 9. Calculate the mean Yellow Cards given per Team in 02_Filtering_&_Sorting Exercises_with_Solutions.ipynb

round(discipline['Yellow Cards'].mean())

I guess its the overall average rather than grouped by teams (discipline.groupby("Team").agg({"Yellow Cards":"mean"}))

Alternative for 06_Stats/Wind_Stats/Step 12-14

From step12 to 14, it asked to downsample the record to a yearly/monthly/weekly frequency for each location.
The provided solution is like below:
data.groupby(data.index.to_period('A')).mean()

I think it would be simpler to use resample function as below:
data.resample('AS').mean()
data.resample('M').mean()
data.resample('W').mean()

02 Fictional Army Step Steps 9-12

I think the steps 9-12, dealing with slicing with loc, need to be reworked. Some solutions do not quite match the exercise text. For example, step 12 asks for columns 3-7 (five columns), but the solution retrieves columns 5-7 (three columns)

I've found these exercises great. Thanks for putting them together

Chiptole : Finding the highest ordered item

When I tried using the following code the ordered item quantity is different
chipo['item_name'].value_counts().head(1)
Out[48]:
Chicken Bowl 726
Name: item_name, dtype: int64

But when i try your method it is having different value.
chipo.groupby('item_name').sum().sort_values(['quantity'],ascending =False).head(1)

order_id quantity

713926 | 761

can you please help me with same and explanation would very much appreciated

06_Stats/Wind_Stats/Step 8

Hi, thanks for the exercises provided. Very helpful.

I think the result for 06_Stats/Wind_Stats/Step 8 is incorrect.
Believe we should skip NA when calculate the mean value, but the default of mean() already exclude NA.
So just use below should be fine:
data.mean().mean()

Count not a general solution in Chipotle, 02_Filtering_&_Sorting

Step 8 proposes the following solution to count how many Veggie Salad Bowl there are.

chipo_salad = chipo[chipo.item_name == "Veggie Salad Bowl"]

len(chipo_salad)

However this doesn't seem like a general solution to the problem. Wouldn't it be better:

chipo[chipo.item_name == "Veggie Salad Bowl"].quantity.sum()

So that quantities get taken into account?

Facing issue while importing data

Since the address points to a file in your repository and not to the data itself, import_csv is importing html codes of the website and not the data in the address.

More efficient alternative in 04_ApplyStudents_Alcohol_Consumption

In step 10, we want to multiply all numerical values by 10.

The provided solution is:
df.applymap(times10).head(10)

But this is very slow, because it runs a regular python function on every element in the dataframe.

Better is to test each column's type, and then use pandas built in multiplication on the whole column:

for colname, coltype in df.dtypes.to_dict().items():
    if coltype.name in ['int64']:
        df[colname] = df[colname] * 10

I used %%timeit to test the two solutions. On this small dataset, my solution is 5x as fast (1.1ms vs 5.8ms). The difference would get larger with a larger dataset.

New to GitHub

Hey,
I came across your pandas exercise. I am new to GitHub, how do I code in your pandas exercise and execute them? Right now they are greyed out and I am unable to edit.

Links to datasets in 05_Merge and 06_Stats incorrect

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.