Git Product home page Git Product logo

datawranglingpy's Introduction

Minimalist Data Wrangling with Python is envisaged as a student's first introduction to data science, providing a high-level overview as well as discussing key concepts in detail. We explore methods for cleaning data gathered from different sources, transforming, selecting, and extracting features, performing exploratory data analysis and dimensionality reduction, identifying naturally occurring data clusters, modelling patterns in data, comparing data between groups, and reporting the results.

For many students around the world, educational resources are hardly affordable. Therefore, I have decided that this book should remain an independent, non-profit, open-access project. You can read it at:

You can also order a paper copy.

Whilst, for some people, the presence of a "designer tag" from a major publisher might still be a proxy for quality, it is my hope that this publication will prove useful to those who seek knowledge for knowledge's sake.

Please spread the news about this project.

Consider citing this book as: Gagolewski M. (2024), Minimalist Data Wrangling with Python, Melbourne, DOI: 10.5281/zenodo.6451068, ISBN: 978-0-6455719-1-2, URL: https://datawranglingpy.gagolewski.com/.

Any remarks and bug fixes are appreciated. Please submit them via this repository's Issues tracker. Thank you.

About the Author

Dr habil. Marek Gagolewski is currently an Associate Professor at the Systems Research Institute of the Polish Academy of Sciences.

His research interests are related to data science, in particular: modelling complex phenomena, developing usable, general-purpose algorithms, studying their analytical properties, and finding out how people use, misuse, understand, and misunderstand methods of data analysis in research, commercial, and decision-making settings.

He's an author of 90+ publications, including journal papers in outlets such as Proceedings of the National Academy of Sciences (PNAS), Journal of Statistical Software, The R Journal, Information Fusion, International Journal of Forecasting, Statistical Modelling, Physica A: Statistical Mechanics and its Applications, Information Sciences, Knowledge-Based Systems, IEEE Transactions on Fuzzy Systems, and Journal of Informetrics.

In his "spare" time, he writes books for his students (check out Deep R Programming) and develops open-source software for data analysis, such as stringi (one of the most often downloaded R packages) and genieclust (a fast and robust hierarchical clustering algorithm in both Python and R).


Copyright (C) 2022–2024, Marek Gagolewski. Some rights reserved.

This material is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0).

datawranglingpy's People

Contributors

gagolews avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

datawranglingpy's Issues

8.1.3 - Small typo

small typo in the Important box final paragraph.
It should be corrected to:

"Generally, for two matrices, their column/row numbers must match or be equal to 1. Also, if one operand is a one-dimensional array, it will be promoted to a row vector."

Exercise 8.3 and others: normalisation

I came across multiple instances where it was not clear what exaxctly was meant by normalisation.
For example Exercise 8.3 asks for standardisation (calculating the Z-score), normalisation(?) and min-max scaling (min-max normalization). I assume that normalisation means calculating the mean normalisation. However, min-max scaling is a normalisation technique as well.
I'd recommend to spell out specifically what kind of normalisation needs to be calculated in excercises, to prevent confussion.

Missing figures

There are 4 missing 4 figures in chapter 9.
9.2, 9.4, 9.6, 9.8

misleading code comment

In 5.4.3 Slicing the comment in part of the very first code snippet

...
x[::-1]  # every second element
## array([50, 40, 30, 20, 10])
...

should say something like

...
x[::-1]  # every element in reverse order
## array([50, 40, 30, 20, 10])
...

Typo in 3.5.4. Modify in place or return a modified copy?

I think there is a typo in section 3.5.4. Modify in place or return a modified copy?:

The list.sorted method modifies the list it is applied on in place:

x = [5, 3, 2, 4, 1]
x.sort()  # modifies x in place and returns nothing

(e.g. list.sort instead of list.sorted)?

example 7.8 - command soon to be depreciated

a warning comes up that the command
cmap=cm.get_cmap("copper"), # colour map
will soon be depreciated

substituting it with
cmap=plt.colormaps.get_cmap("copper"), # colour map
as suggested by the warning text seems to give the same output

backquoted operator names

In 5.4.3 Slicing
it looks like the inline code *= has. ot been rendered correctly

This did not modify the original vector, because we applied `*=` on a different object, which has not even been memorised after that operation took place.

Formatting error In section 1.2.3 point 5

Thank you for this free and easy to follow introduction to Data Wrangling in Python.

In section 1.2.3 under point 5:
Code is presented as normal text.
grafik
To make this code work I needed to change the double quote characters and reformat the text.
import matplotlib.pyplot as plt # basic plotting library plt.bar( ['Python', 'JavaScript', 'HTML', 'CSS'], # a list of strings [80, 30, 10, 15] # a list of integers (the corresponding bar heights) ) ) plt.title('What makes you happy?') plt.show()
grafik

missing subject

In 6.3.2. Pareto Distribution I think the sentence

This time, however, will be interested in not what is typical, but ...

is IMHO missing the subject.
I would write:

This time, however, we will be interested in not what is typical, but ...

equation ch 4

Thanks for the great book. The equation between fig 4.10 and 4.11 shows unrendered, as \[ \hat{F}_n(t) = \left\{ \begin{array}{ll} 0 & \text{for }t. I wanted to submit a PR, but couldn't find the book sources!

minor typos in 7.1.4

In section 7.1.4, there are two minor typos:

Exercise 7.2
Are they* worth taking note of ...

Exercise 7.3
Using numpy.insert, add* a new row/column ...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.