aeturrell / skimpy Goto Github PK

View Code? Open in Web Editor NEW

371.0 371.0 18.0 3.99 MB

skimpy is a light weight tool that provides summary statistics about variables in data frames within the console.

Home Page: https://aeturrell.github.io/skimpy/

License: Other

Python 98.62% Makefile 1.38%

data-science eda exploratory-data-analysis pandas statistics summary-statistics

skimpy's People

Contributors

Stargazers

Watchers

Forkers

lenamax2355 creative-research-project-v1-1 amalsebastian7 stjordanis hieunt27 ra2003 vkesizab tanglespace reveurmichael elijahahianyo alitrack rumiallbert semework shalevy1 zcfrank1st william-1066 galenseilis tridoxx

skimpy's Issues

Truncated names should be longer (as a lot of empty space is present too)

It seems that skimpy truncates variable names to 20 symbols. This seems to be unreasonable as there is a lot of empty space which is not used (indicated with yellow squares). This empty space can be removed to have more space for longer names

How can we control the maximum length of variable names?
Can this empty space be removed to longer names?
Can an ellipsis (single symbol "…") be used to pay attention to that variable name was truncated?
There should be a way to identify ambiguous variables after the names are truncated (see the second figure, red squares).

Wrong number of NA rows in the output?

Hi,
first of all thank you for this great tool.

If I run skimpy on this 999 rows CSV I have 1000 NA rows.

Thank you

IndexError: list index out of range

Colab notebook including data to reproduced error is here:

https://github.com/Mjboothaus/Jupyter/blob/master/cleanup_beach_data.ipynb

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-9-d37235d13c7a> in <module>()
----> 1 skim(df)

/usr/local/lib/python3.7/dist-packages/typeguard/__init__.py in wrapper(*args, **kwargs)
   1031         memo = _CallMemo(python_func, _localns, args=args, kwargs=kwargs)
   1032         check_argument_types(memo)
-> 1033         retval = func(*args, **kwargs)
   1034         try:
   1035             check_return_type(retval, memo)

/usr/local/lib/python3.7/dist-packages/skimpy/__init__.py in skim(df, header_style, **colour_kwargs)
    543         grid.add_row(sum_tab)
    544     # Weirdly, iteration over list of tabs misses last entry
--> 545     grid.add_row(list_of_tabs[-1])
    546     console.print(Panel(grid, title="skimpy summary", subtitle="End"))
    547 

IndexError: list index out of range

Error on __infer_datatypes due to 'cannot convert NA to integer'

Running skimpy.skim(df) returns me an error

    662     df = _delete_unsupported_columns(df)
    663     # Perform inference of datatypes
--> 664     df = _infer_datatypes(df)

/python3.9/site-packages/skimpy/__init__.py in _infer_datatypes(df)
    137             continue
    138         # There is no else statement here because logic should never get to this point.
--> 139         df[col[0]] = df[col[0]].astype(data_type)
    140     return df
    141

I have a bunch of columns so the message does not usefully describe how to fix.

i also cleaned my df (b10_r) with and i still get that error.

for column in b10_r.columns:
    ty = pandas.api.types.infer_dtype(b10_r[column])
    print("{} - {}".format(column, ty))
    if ty in ["mixed-integer", "mixed", "mixed-integer-float", "unknown-array"]:
        kols.append(column)

for k in kols :
    del b10_r[k]

but i still get it

`skim`: changing number of columns to be summarized

skim summarizes 20 columns as default. I couldn't find to change this default behaviour.

Inline histogram distorts the output of the layout

Uneven inline histogram bar widths distort the layout of the output:

This is the case as UTF-8 symbols (squares) that form the histogram, have different widths. I noticed that in R, in some cases, 4-th and 8-th (the narrowest) symbols are excluded in some cases:

https://github.com/ropensci/skimr/blob/d5126aa020e703f37740af7ee56a4acb5830fd08/R/stats.R#L133-L136

My question:

Can an option be added to remove the histogram? Instead, an option to include the median, which is missing, could be added.

add timedelta to the generated test data

A lot of Jupyter dependencies

This tool looks very useful. Although when I try installing it into my kernel environment it has a lot of dependencies including Jupyter and all associated server dependencies. Perhaps these need to be dev dependencies? I can't see where the dependency is used otherwise. I can't see where ipykernel is used either (initially i thought you might need to import from IPython.display).

bash$ poetry add git+https://github.com/aeturrell/skimpy.git

Updating dependencies
Resolving dependencies... (7.9s)

Package operations: 34 installs, 0 updates, 0 removals

• Installing types-python-dateutil (2.8.19.20240106)
• Installing arrow (1.3.0)
• Installing fqdn (1.5.1)
• Installing isoduration (20.11.0)
• Installing jsonpointer (2.4)
• Installing rfc3339-validator (0.1.4)
• Installing rfc3986-validator (0.1.1)
• Installing uri-template (1.3.0)
• Installing webcolors (1.13)
• Installing argon2-cffi-bindings (21.2.0)
• Installing python-json-logger (2.0.7)
• Installing terminado (0.18.0)
• Installing anyio (4.2.0)
• Installing argon2-cffi (23.1.0)
• Installing jupyter-events (0.9.0)
• Installing jupyter-server-terminals (0.5.1)
• Installing overrides (7.4.0)
• Installing send2trash (1.8.2)
• Installing websocket-client (1.7.0)
• Installing babel (2.14.0)
• Installing json5 (0.9.14)
• Installing jupyter-server (2.12.4)
• Installing async-lru (2.0.4)
• Installing jupyter-lsp (2.2.1)
• Installing jupyterlab-server (2.25.2)
• Installing notebook-shim (0.2.3)
• Installing jupyterlab (4.0.10)
• Installing qtpy (2.4.1)
• Installing jupyter-console (6.6.3)
• Installing notebook (7.0.6)
• Installing qtconsole (5.5.1)
• Installing jupyter (1.0.0)
• Installing typeguard (4.1.5)
• Installing skimpy (0.0.11 556aff6)

Writing lock file

Remove Jupyter book dependency

adding support for datetime.date object types

Hi,

The package is superuseful. However, it seems like the support for some key datatypes frequently used with pandas is missing.
It would be great if you could add support for datetime.date, datetime.month, datetime.year and so on.

for example, it supports datetime64 but if one wants to keep only date part
dt['date'] = dt['datetime'].dt.date

It will give an error "data type 'date' not understood"

Thank you

Add export data to features / quick start

A kwarg in skim function

Explore switching docs to Quartodoc

https://github.com/machow/quartodoc

Main advantage is to remove hackyness of current solution.

Add doc tests and examples

Use format as in: https://github.com/Erotemic/xdoctest

Round numbers to sensible number of significant figures

Use something like

i = 32.1123
print(f'{float(f"{i:.2g}"):g}')

Reports not properly generated with a single dataframe column

MRE:

from numpy.random import Generator, PCG64
from skimpy import skim

seed = 34729
rng = Generator(PCG64(seed))
len_df = 1000
df = pd.DataFrame()
df["length"] = rng.beta(0.5, 0.5, size=len_df)
skim(df)

Discrepancy between pandas describe results and those provided by skimpy

Firstly, I would like to thank you for the library, its output, in addition to being extremely important in descriptive analysis, is also beautiful to see. Before reporting the Bug, I would like to make a suggestion, which perhaps could be integrated into Skimpy.

Display unique values (of object/string types) from pandas (equivalent to .value_counts()).

the link to the dataset used by me is on kaggle: https://www.kaggle.com/competitions/autoam-car-price-prediction

Let's see the divergence I found, in the attached photo, we have the column (year) the first, and the values for mean, p0, p25, and others highlighted in yellow, differ from the output of pandas.describe() which is also shown in the figure , the average of the 'year' column is 2014.80 and skimpy provided us with 2000. The minimum value of this same variable is 1987 and skimpy provided us with 2000, in addition to the others shown in the figure.

Once again, congratulations on the library.

Column name colour - how can we change / customise?

Hi there, great package here, wondering if there is an easy way to change the colour used for the column names in the output - current default uses pink, unfortunately I have a grey terminal background, pink foreground font is pretty much impossible to see...

Thanks

Skimpy ignores other data types except for Float and Integer

Hi,

I've tried Skimpy first time today and it seems like I found a bug. I used Skimpy on my sample dataframe and it only returns the summary for the Float and Integer columns while others were ignored.

This is the sample code:

from skimpy import skim
import datetime
import pandas as pd

data = ([datetime.datetime(2021, 1, 1), None, 'as', 6],
        [datetime.datetime(2021, 1, 2), 5.2, 'asd', 7],
        [None, 6.3, 'adasda', 8])

df = pd.DataFrame(data, columns=['date', 'float', 'string', 'integer'])

skim(df)

This is the result I got:

Please make it available on Polars as well.

First of all, thank you for creating such a wonderful package.

I was able to quickly understand the characteristics of the data using skim in R, and thank you for making it possible in Python as well.

Polars in the DataFrame package has been growing rapidly in popularity recently.
You can use the skim function in Polars using the to_pandas() function.
However, it would be better if polars was supported directly in pyskim.
Also, Pandas has been updated to version 2.x, but if you install pyskim, the Pandas version will be downgraded. It would be nice if Pandas were also updated to support 2.x.

Thank you

Add citation

And an option to submit notification of use.

Support for polars

Polars is an increasingly popular data frame package. Although polars users can currently convert to pandas to run Skimpy would it be better if it was native?

Consider Pandas 2.0+ support?

Hi there. Pandas 2 came out a few months back and your installation dependency is at pandas ^1.3.2. Would you consider checking for Pandas 2 compatibility?
I'm going to mention skimpy on my newsletter (to 1,600 data scientists), I know that a bunch have upgraded to Pandas 2 already (given recent conferences talks I've given on Pandas 2 and Polars), so hopefully that'd open the door to a new base of users for you.
ydata-profiling (neé pandas-profiling) just added Pandas 2 support too: https://github.com/ydataai/ydata-profiling/releases/tag/v4.3.0
Cheers Ian.

A neat option to export to a well-formatted table for onward inclusion in reports and figures

Practically, this will need to be something like a JSON given the structure of the results table.

Citation

I'm using skimpy in a project and would love to have details for a Bibtex reference, thank you!

TypeError: import_optional_dependency() got an unexpected keyword argument 'errors'

From my python (v 3.7) dev vm at work I pull data from Vertica (mysql) into a pandas df, and I get what smells like a dependency issue. If this is based on a pandas dependency, is it possible to use skimpy with a different version of pandas through some older version of skimpy?

I run:
skim(df)

I get the issue:
`---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/tmp/ipykernel_xyz/myscript.py in
----> 1 skim(shipments)
2 # shipments.describe()

~/.venv/asdf/lib/python3.7/site-packages/typeguard/init.py in wrapper(*args, **kwargs)
1031 memo = _CallMemo(python_func, _localns, args=args, kwargs=kwargs)
1032 check_argument_types(memo)
-> 1033 retval = func(*args, **kwargs)
1034 try:
1035 check_return_type(retval, memo)

~/.venv/asdf/lib/python3.7/site-packages/skimpy/init.py in skim(df, header_style, **colour_kwargs)
527 xf = df.select_dtypes(col_type)
528 if not xf.empty:
--> 529 sum_df = summary_func(xf)
530 list_of_tabs.append(
531 dataframe_to_rich_table(

~/.venv/asdf/lib/python3.7/site-packages/skimpy/init.py in numeric_variable_summary_table(xf)
306 data_dict = {
307 "missing": count_nans_vec,
--> 308 "complete rate": 1 - count_nans_vec / xf.shape[0],
309 NUM_COL_MEAN: xf.mean(),
310 "sd": xf.std(),

~/.venv/asdf/lib/python3.7/site-packages/pandas/core/ops/common.py in new_method(self, other)
63 break
64 if isinstance(other, cls):
---> 65 return NotImplemented
66
67 other = item_from_zerodim(other)

~/.venv/asdf/lib/python3.7/site-packages/pandas/core/arraylike.py in truediv(self, other)
111 def rmul(self, other):
112 return self._arith_method(other, roperator.rmul)
--> 113
114 @unpack_zerodim_and_defer("truediv")
115 def truediv(self, other):

~/.venv/asdf/lib/python3.7/site-packages/pandas/core/series.py in _arith_method(self, other, op)
4996 0 True
4997 1 True
-> 4998 2 True
4999 3 False
5000 4 True

~/.venv/asdf/lib/python3.7/site-packages/pandas/core/ops/array_ops.py in arithmetic_op(left, right, op)
187 Evaluate an arithmetic operation +, -, *, /, //, %, **, ...
188
--> 189 Note: the caller is responsible for ensuring that numpy warnings are
190 suppressed (with np.errstate(all="ignore")) if needed.
191

~/.venv/asdf/lib/python3.7/site-packages/pandas/core/ops/array_ops.py in _na_arithmetic_op(left, right, op, is_cmp)
137
138 def _na_arithmetic_op(left, right, op, is_cmp: bool = False):
--> 139 """
140 Return the result of evaluating op on the passed in values.
141

~/.venv/asdflib/python3.7/site-packages/pandas/core/computation/expressions.py in
17 from pandas._typing import FuncType
18
---> 19 from pandas.core.computation.check import NUMEXPR_INSTALLED
20 from pandas.core.ops import roperator
21

~/.venv/data_analyses/lib/python3.7/site-packages/pandas/core/computation/check.py in
1 from pandas.compat._optional import import_optional_dependency
2
----> 3 ne = import_optional_dependency("numexpr", errors="warn")
4 NUMEXPR_INSTALLED = ne is not None
5 if NUMEXPR_INSTALLED:

  TypeError: import_optional_dependency() got an unexpected keyword argument 'errors'`

Be able to handle time delta

Currently, this data type is converted to strings.

import pandas as pd

df_check = pd.DataFrame(
        {
            "header": [pd.Timedelta(365, "d"), pd.Timedelta(-19, "d")],
            "header_1": ["length_one", "length_two"],
        }
    )
skim(df_check)

should produce a table with a time difference section.

skim raises exception with multiindexes

The culprit appears to be the _infer_datatypes.

skimpy/src/skimpy/__init__.py

Line 95 in ad48d11

def _infer_datatypes(df: pd.DataFrame) -> pd.DataFrame:

The workaround appears to be replacing the above function with panda's builtin infer_objects method.

Bug in string word counts?

skimpy/src/skimpy/__init__.py

Line 414 in 910eb80

xf[xf.columns[0]].str.count(" ").add(1).sum()

Noticed some weird behavior with the word counts in skimpy output - should this be using col to subset xf rather than xf.columns[0]?

Have an export to pandas option

This would mostly be straightforward.

For examples of how to do the charts within pandas dataframe, see https://twitter.com/jonathanrlarkin/status/1503591106939867137?s=11 and https://twitter.com/jonathanrlarkin/status/1503591106939867137?s=11.

codecoverage stats are not appearing

Error for .clean_columns with polars

in changelog I read:
v0.0.11
adding polars support

I am now encountering an error when using .clean_columns()
am I doing something wrong or is .clean_columns not supported for polars?
Thank you!

import polars as pl 
import skimpy #v0.0.14

df = pl.DataFrame({
    "Column Name 1": [1, 2, 3],
    "ANOTHER Column": [4, 5, 6],
    "More DATA": [7, 8, 9]
})
skimpy.clean_columns(df)

---> skimpy.clean_columns(df)
TypeCheckError: argument "df" (polars.dataframe.frame.DataFrame) is not an instance of pandas.core.frame.DataFrame

Suppress 'sum(cleaned)} column names have been cleaned' Message

Could you please add a parameter to allow to supress the 'sum(cleaned)} column names have been cleaned' message?
I would really appreciate it!
Thank you
Regards

Make output friendly to Quarto documents when there is any R code being executed in the .qmd file too

It would be helpful to have a quarto-friendly output option, so that tables generated from skim render in markdown instead of rendering as code-like objects.

For instance, a file like this, with a python skim(df) statement and an R skim(df) statement

(you'll have to add the qmd extension, github won't let me upload a qmd file)
test.qmd

renders as

Thank you for making this package, btw - it has made it much easier to teach my students R and python simultaneously when there are so many packages that have parallel functions and syntax between them.

Remove decimals and trailing zeros on whole numbers

It would look better to remove decimals and trailing zeros on whole numbers. Something like s.rstrip('0').rstrip('.') if '.' in s else s could work.

Skewness & kurtosis?

Hey Arthur!
Are you planning to add skewness & kurtosis to the summary stats?
Thanks!
Pedro

Words per row and Word count improper when we have multiple text columns

This is the result I got for titanic dataset and it looks improper.

Coverage github action is broken

Error message below. Likely related to this section of tests.yml github action, as it is using upload artifact v3:

      - name: Upload coverage data
        if: always() && matrix.session == 'tests'
        uses: "actions/upload-artifact@v3"
        with:
          name: coverage-data
          path: ".coverage.*"

Run actions/download-artifact@v4
  with:
    name: coverage-data
    merge-multiple: false
    repository: aeturrell/skimpy
    run-id: 9093669457
  env:
    pythonLocation: /opt/hostedtoolcache/Python/3.9.19/x64
    PKG_CONFIG_PATH: /opt/hostedtoolcache/Python/3.9.19/x64/lib/pkgconfig
    Python_ROOT_DIR: /opt/hostedtoolcache/Python/3.9.19/x64
    Python[2](https://github.com/aeturrell/skimpy/actions/runs/9093669457/job/24993172431?pr=738#step:7:2)_ROOT_DIR: /opt/hostedtoolcache/Python/[3](https://github.com/aeturrell/skimpy/actions/runs/9093669457/job/24993172431?pr=738#step:7:3).9.19/x64
    Python3_ROOT_DIR: /opt/hostedtoolcache/Python/3.9.19/x6[4](https://github.com/aeturrell/skimpy/actions/runs/9093669457/job/24993172431?pr=738#step:7:4)
    LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.9.19/x[6](https://github.com/aeturrell/skimpy/actions/runs/9093669457/job/24993172431?pr=738#step:7:6)4/lib
Downloading single artifact
Error: Unable to download artifact(s): Artifact not found for name: coverage-data
        Please ensure that your artifact is not expired and the artifact was uploaded using a compatible version of toolkit/upload-artifact.
        For more information, visit the GitHub Artifacts FAQ: https://github.com/actions/toolkit/blob/main/packages/artifact/docs/faq.md

Broken Contributing Link

The link at the top of the home page points to contributing.html, but the page is called CONTRIBUTING.html, hence the link is broken.

Skim output is not able to be recorded or exported to html or svg

I have tried to export skim results. I tried to record it using Console(record=True) before calling the function; however, I got a NoneType object. The expected result is an object to be exported via html or svg to share the results obtained.
I also tested the Console.capture() method obtaining the same behavior. Did I do something wrong?

from skimpy import skim, generate_test_data
console = Console(record=True)
df = generate_test_data()
skim(df)
console.save_html("demo.html")

The demo.html is empty. Thanks for your support
Regards

Error importing 'Any' from 'typing_extensions' in Databricks environment

Example Description and Error
This script demonstrates how to use the skimpy library to perform a statistical summary of a dataset using Pandas in Python. The dataset used is the "Adult" dataset from the UCI Machine Learning repository, which contains demographic and employment information of adults.

The error occurs when attempting to import the Any class from the typing_extensions module, and is specifically encountered in the Databricks environment. This may be caused by an incompatibility between the Python version and the dependencies of the skimpy library.

Script:
import pandas as pd
from skimpy import skim

Read data from CSV file

csv_path = "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
df = pd.read_csv(csv_path, sep=",", header=None)

Create headers list

headers = ["age","workclass","fnlwgt","education","education-num", "marital-status","occupation",
"relationship","race","sex", "capital-gain","capital-loss","hours-per-week","native-country","makes"]
df.columns = headers

df.head()

skim(df)

Error on Databricks:
ImportError: cannot import name 'Any' from 'typing_extensions' (/databricks/python/lib/python3.10/site-packages/typing_extensions.py)

How to resolve the ImportError when importing 'Any' from 'typing_extensions' in a Databricks environment?

Add tests for compilation of docs

Explore jupyter notebook for readme.rst generation

jupyter nbconvert --to rst README.ipynb

aeturrell / skimpy Goto Github PK

skimpy's People

Contributors

Stargazers

Watchers

Forkers

skimpy's Issues

Read data from CSV file

Create headers list

Recommend Projects

Recommend Topics

Recommend Org