Git Product home page Git Product logo

skimpy's People

Contributors

aeturrell avatar dependabot[bot] avatar galenseilis avatar rumiallbert avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

skimpy's Issues

Truncated names should be longer (as a lot of empty space is present too)

It seems that skimpy truncates variable names to 20 symbols. This seems to be unreasonable as there is a lot of empty space which is not used (indicated with yellow squares). This empty space can be removed to have more space for longer names

image

image

  1. How can we control the maximum length of variable names?
  2. Can this empty space be removed to longer names?
  3. Can an ellipsis (single symbol "…") be used to pay attention to that variable name was truncated?
  4. There should be a way to identify ambiguous variables after the names are truncated (see the second figure, red squares).

IndexError: list index out of range

Colab notebook including data to reproduced error is here:

https://github.com/Mjboothaus/Jupyter/blob/master/cleanup_beach_data.ipynb

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-9-d37235d13c7a> in <module>()
----> 1 skim(df)

/usr/local/lib/python3.7/dist-packages/typeguard/__init__.py in wrapper(*args, **kwargs)
   1031         memo = _CallMemo(python_func, _localns, args=args, kwargs=kwargs)
   1032         check_argument_types(memo)
-> 1033         retval = func(*args, **kwargs)
   1034         try:
   1035             check_return_type(retval, memo)

/usr/local/lib/python3.7/dist-packages/skimpy/__init__.py in skim(df, header_style, **colour_kwargs)
    543         grid.add_row(sum_tab)
    544     # Weirdly, iteration over list of tabs misses last entry
--> 545     grid.add_row(list_of_tabs[-1])
    546     console.print(Panel(grid, title="skimpy summary", subtitle="End"))
    547 

IndexError: list index out of range

Error on __infer_datatypes due to 'cannot convert NA to integer'

Running skimpy.skim(df) returns me an error

    662     df = _delete_unsupported_columns(df)
    663     # Perform inference of datatypes
--> 664     df = _infer_datatypes(df)

/python3.9/site-packages/skimpy/__init__.py in _infer_datatypes(df)
    137             continue
    138         # There is no else statement here because logic should never get to this point.
--> 139         df[col[0]] = df[col[0]].astype(data_type)
    140     return df
    141 

I have a bunch of columns so the message does not usefully describe how to fix.

i also cleaned my df (b10_r) with and i still get that error.

for column in b10_r.columns:
    ty = pandas.api.types.infer_dtype(b10_r[column])
    print("{} - {}".format(column, ty))
    if ty in ["mixed-integer", "mixed", "mixed-integer-float", "unknown-array"]:
        kols.append(column)

for k in kols :
    del b10_r[k]

but i still get it

Inline histogram distorts the output of the layout

Uneven inline histogram bar widths distort the layout of the output:

image

This is the case as UTF-8 symbols (squares) that form the histogram, have different widths. I noticed that in R, in some cases, 4-th and 8-th (the narrowest) symbols are excluded in some cases:

https://github.com/ropensci/skimr/blob/d5126aa020e703f37740af7ee56a4acb5830fd08/R/stats.R#L133-L136

My question:

  • Can an option be added to remove the histogram? Instead, an option to include the median, which is missing, could be added.

A lot of Jupyter dependencies

This tool looks very useful. Although when I try installing it into my kernel environment it has a lot of dependencies including Jupyter and all associated server dependencies. Perhaps these need to be dev dependencies? I can't see where the dependency is used otherwise. I can't see where ipykernel is used either (initially i thought you might need to import from IPython.display).

bash$ poetry add git+https://github.com/aeturrell/skimpy.git

Updating dependencies
Resolving dependencies... (7.9s)

Package operations: 34 installs, 0 updates, 0 removals

• Installing types-python-dateutil (2.8.19.20240106)
• Installing arrow (1.3.0)
• Installing fqdn (1.5.1)
• Installing isoduration (20.11.0)
• Installing jsonpointer (2.4)
• Installing rfc3339-validator (0.1.4)
• Installing rfc3986-validator (0.1.1)
• Installing uri-template (1.3.0)
• Installing webcolors (1.13)
• Installing argon2-cffi-bindings (21.2.0)
• Installing python-json-logger (2.0.7)
• Installing terminado (0.18.0)
• Installing anyio (4.2.0)
• Installing argon2-cffi (23.1.0)
• Installing jupyter-events (0.9.0)
• Installing jupyter-server-terminals (0.5.1)

• Installing overrides (7.4.0)
• Installing send2trash (1.8.2)
• Installing websocket-client (1.7.0)
• Installing babel (2.14.0)
• Installing json5 (0.9.14)
• Installing jupyter-server (2.12.4)
• Installing async-lru (2.0.4)
• Installing jupyter-lsp (2.2.1)
• Installing jupyterlab-server (2.25.2)

• Installing notebook-shim (0.2.3)
• Installing jupyterlab (4.0.10)
• Installing qtpy (2.4.1)
• Installing jupyter-console (6.6.3)
• Installing notebook (7.0.6)

• Installing qtconsole (5.5.1)
• Installing jupyter (1.0.0)
• Installing typeguard (4.1.5)
• Installing skimpy (0.0.11 556aff6)

Writing lock file

adding support for datetime.date object types

Hi,

The package is superuseful. However, it seems like the support for some key datatypes frequently used with pandas is missing.
It would be great if you could add support for datetime.date, datetime.month, datetime.year and so on.

for example, it supports datetime64 but if one wants to keep only date part
dt['date'] = dt['datetime'].dt.date

It will give an error "data type 'date' not understood"

Thank you

Discrepancy between pandas describe results and those provided by skimpy

Firstly, I would like to thank you for the library, its output, in addition to being extremely important in descriptive analysis, is also beautiful to see. Before reporting the Bug, I would like to make a suggestion, which perhaps could be integrated into Skimpy.

  1. Display unique values (of object/string types) from pandas (equivalent to .value_counts()).

divergence_skimpy

the link to the dataset used by me is on kaggle: https://www.kaggle.com/competitions/autoam-car-price-prediction

Let's see the divergence I found, in the attached photo, we have the column (year) the first, and the values for mean, p0, p25, and others highlighted in yellow, differ from the output of pandas.describe() which is also shown in the figure , the average of the 'year' column is 2014.80 and skimpy provided us with 2000. The minimum value of this same variable is 1987 and skimpy provided us with 2000, in addition to the others shown in the figure.

Once again, congratulations on the library.

Column name colour - how can we change / customise?

Hi there, great package here, wondering if there is an easy way to change the colour used for the column names in the output - current default uses pink, unfortunately I have a grey terminal background, pink foreground font is pretty much impossible to see...

Thanks

Skimpy ignores other data types except for Float and Integer

Hi,

I've tried Skimpy first time today and it seems like I found a bug. I used Skimpy on my sample dataframe and it only returns the summary for the Float and Integer columns while others were ignored.

This is the sample code:

from skimpy import skim
import datetime
import pandas as pd

data = ([datetime.datetime(2021, 1, 1), None, 'as', 6],
        [datetime.datetime(2021, 1, 2), 5.2, 'asd', 7],
        [None, 6.3, 'adasda', 8])

df = pd.DataFrame(data, columns=['date', 'float', 'string', 'integer'])

skim(df)

This is the result I got:
image

Please make it available on Polars as well.

First of all, thank you for creating such a wonderful package.

I was able to quickly understand the characteristics of the data using skim in R, and thank you for making it possible in Python as well.

Polars in the DataFrame package has been growing rapidly in popularity recently.
You can use the skim function in Polars using the to_pandas() function.
However, it would be better if polars was supported directly in pyskim.
Also, Pandas has been updated to version 2.x, but if you install pyskim, the Pandas version will be downgraded. It would be nice if Pandas were also updated to support 2.x.

Thank you

Support for polars

Polars is an increasingly popular data frame package. Although polars users can currently convert to pandas to run Skimpy would it be better if it was native?

Consider Pandas 2.0+ support?

Hi there. Pandas 2 came out a few months back and your installation dependency is at pandas ^1.3.2. Would you consider checking for Pandas 2 compatibility?
I'm going to mention skimpy on my newsletter (to 1,600 data scientists), I know that a bunch have upgraded to Pandas 2 already (given recent conferences talks I've given on Pandas 2 and Polars), so hopefully that'd open the door to a new base of users for you.
ydata-profiling (neé pandas-profiling) just added Pandas 2 support too: https://github.com/ydataai/ydata-profiling/releases/tag/v4.3.0
Cheers Ian.

Citation

I'm using skimpy in a project and would love to have details for a Bibtex reference, thank you!

TypeError: import_optional_dependency() got an unexpected keyword argument 'errors'

From my python (v 3.7) dev vm at work I pull data from Vertica (mysql) into a pandas df, and I get what smells like a dependency issue. If this is based on a pandas dependency, is it possible to use skimpy with a different version of pandas through some older version of skimpy?

I run:
skim(df)

I get the issue:
`---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/tmp/ipykernel_xyz/myscript.py in
----> 1 skim(shipments)
2 # shipments.describe()

~/.venv/asdf/lib/python3.7/site-packages/typeguard/init.py in wrapper(*args, **kwargs)
1031 memo = _CallMemo(python_func, _localns, args=args, kwargs=kwargs)
1032 check_argument_types(memo)
-> 1033 retval = func(*args, **kwargs)
1034 try:
1035 check_return_type(retval, memo)

~/.venv/asdf/lib/python3.7/site-packages/skimpy/init.py in skim(df, header_style, **colour_kwargs)
527 xf = df.select_dtypes(col_type)
528 if not xf.empty:
--> 529 sum_df = summary_func(xf)
530 list_of_tabs.append(
531 dataframe_to_rich_table(

~/.venv/asdf/lib/python3.7/site-packages/typeguard/init.py in wrapper(*args, **kwargs)
1031 memo = _CallMemo(python_func, _localns, args=args, kwargs=kwargs)
1032 check_argument_types(memo)
-> 1033 retval = func(*args, **kwargs)
1034 try:
1035 check_return_type(retval, memo)

~/.venv/asdf/lib/python3.7/site-packages/skimpy/init.py in numeric_variable_summary_table(xf)
306 data_dict = {
307 "missing": count_nans_vec,
--> 308 "complete rate": 1 - count_nans_vec / xf.shape[0],
309 NUM_COL_MEAN: xf.mean(),
310 "sd": xf.std(),

~/.venv/asdf/lib/python3.7/site-packages/pandas/core/ops/common.py in new_method(self, other)
63 break
64 if isinstance(other, cls):
---> 65 return NotImplemented
66
67 other = item_from_zerodim(other)

~/.venv/asdf/lib/python3.7/site-packages/pandas/core/arraylike.py in truediv(self, other)
111 def rmul(self, other):
112 return self._arith_method(other, roperator.rmul)
--> 113
114 @unpack_zerodim_and_defer("truediv")
115 def truediv(self, other):

~/.venv/asdf/lib/python3.7/site-packages/pandas/core/series.py in _arith_method(self, other, op)
4996 0 True
4997 1 True
-> 4998 2 True
4999 3 False
5000 4 True

~/.venv/asdf/lib/python3.7/site-packages/pandas/core/ops/array_ops.py in arithmetic_op(left, right, op)
187 Evaluate an arithmetic operation +, -, *, /, //, %, **, ...
188
--> 189 Note: the caller is responsible for ensuring that numpy warnings are
190 suppressed (with np.errstate(all="ignore")) if needed.
191

~/.venv/asdf/lib/python3.7/site-packages/pandas/core/ops/array_ops.py in _na_arithmetic_op(left, right, op, is_cmp)
137
138 def _na_arithmetic_op(left, right, op, is_cmp: bool = False):
--> 139 """
140 Return the result of evaluating op on the passed in values.
141

~/.venv/asdflib/python3.7/site-packages/pandas/core/computation/expressions.py in
17 from pandas._typing import FuncType
18
---> 19 from pandas.core.computation.check import NUMEXPR_INSTALLED
20 from pandas.core.ops import roperator
21

~/.venv/data_analyses/lib/python3.7/site-packages/pandas/core/computation/check.py in
1 from pandas.compat._optional import import_optional_dependency
2
----> 3 ne = import_optional_dependency("numexpr", errors="warn")
4 NUMEXPR_INSTALLED = ne is not None
5 if NUMEXPR_INSTALLED:

  TypeError: import_optional_dependency() got an unexpected keyword argument 'errors'`

Be able to handle time delta

Currently, this data type is converted to strings.

eg

import pandas as pd

df_check = pd.DataFrame(
        {
            "header": [pd.Timedelta(365, "d"), pd.Timedelta(-19, "d")],
            "header_1": ["length_one", "length_two"],
        }
    )
skim(df_check)

should produce a table with a time difference section.

Error for .clean_columns with polars

in changelog I read:
v0.0.11
adding polars support

I am now encountering an error when using .clean_columns()
am I doing something wrong or is .clean_columns not supported for polars?
Thank you!

import polars as pl 
import skimpy #v0.0.14

df = pl.DataFrame({
    "Column Name 1": [1, 2, 3],
    "ANOTHER Column": [4, 5, 6],
    "More DATA": [7, 8, 9]
})
skimpy.clean_columns(df)

---> skimpy.clean_columns(df)
TypeCheckError: argument "df" (polars.dataframe.frame.DataFrame) is not an instance of pandas.core.frame.DataFrame

Make output friendly to Quarto documents when there is any R code being executed in the .qmd file too

It would be helpful to have a quarto-friendly output option, so that tables generated from skim render in markdown instead of rendering as code-like objects.

For instance, a file like this, with a python skim(df) statement and an R skim(df) statement

(you'll have to add the qmd extension, github won't let me upload a qmd file)
test.qmd

renders as

Screenshot of rendered html file showing skim-py output as text and skim-r output as HTML tables.

Thank you for making this package, btw - it has made it much easier to teach my students R and python simultaneously when there are so many packages that have parallel functions and syntax between them.

Skewness & kurtosis?

Hey Arthur!
Are you planning to add skewness & kurtosis to the summary stats?
Thanks!
Pedro

Coverage github action is broken

Error message below. Likely related to this section of tests.yml github action, as it is using upload artifact v3:

      - name: Upload coverage data
        if: always() && matrix.session == 'tests'
        uses: "actions/upload-artifact@v3"
        with:
          name: coverage-data
          path: ".coverage.*"
Run actions/download-artifact@v4
  with:
    name: coverage-data
    merge-multiple: false
    repository: aeturrell/skimpy
    run-id: 9093669457
  env:
    pythonLocation: /opt/hostedtoolcache/Python/3.9.19/x64
    PKG_CONFIG_PATH: /opt/hostedtoolcache/Python/3.9.19/x64/lib/pkgconfig
    Python_ROOT_DIR: /opt/hostedtoolcache/Python/3.9.19/x64
    Python[2](https://github.com/aeturrell/skimpy/actions/runs/9093669457/job/24993172431?pr=738#step:7:2)_ROOT_DIR: /opt/hostedtoolcache/Python/[3](https://github.com/aeturrell/skimpy/actions/runs/9093669457/job/24993172431?pr=738#step:7:3).9.19/x64
    Python3_ROOT_DIR: /opt/hostedtoolcache/Python/3.9.19/x6[4](https://github.com/aeturrell/skimpy/actions/runs/9093669457/job/24993172431?pr=738#step:7:4)
    LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.9.19/x[6](https://github.com/aeturrell/skimpy/actions/runs/9093669457/job/24993172431?pr=738#step:7:6)4/lib
Downloading single artifact
Error: Unable to download artifact(s): Artifact not found for name: coverage-data
        Please ensure that your artifact is not expired and the artifact was uploaded using a compatible version of toolkit/upload-artifact.
        For more information, visit the GitHub Artifacts FAQ: https://github.com/actions/toolkit/blob/main/packages/artifact/docs/faq.md

Broken Contributing Link

The link at the top of the home page points to contributing.html, but the page is called CONTRIBUTING.html, hence the link is broken.

Skim output is not able to be recorded or exported to html or svg

I have tried to export skim results. I tried to record it using Console(record=True) before calling the function; however, I got a NoneType object. The expected result is an object to be exported via html or svg to share the results obtained.
I also tested the Console.capture() method obtaining the same behavior. Did I do something wrong?

from skimpy import skim, generate_test_data
console = Console(record=True)
df = generate_test_data()
skim(df)
console.save_html("demo.html")

The demo.html is empty. Thanks for your support
Regards

Error importing 'Any' from 'typing_extensions' in Databricks environment

Example Description and Error
This script demonstrates how to use the skimpy library to perform a statistical summary of a dataset using Pandas in Python. The dataset used is the "Adult" dataset from the UCI Machine Learning repository, which contains demographic and employment information of adults.

The error occurs when attempting to import the Any class from the typing_extensions module, and is specifically encountered in the Databricks environment. This may be caused by an incompatibility between the Python version and the dependencies of the skimpy library.

Script:
import pandas as pd
from skimpy import skim

Read data from CSV file

csv_path = "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
df = pd.read_csv(csv_path, sep=",", header=None)

Create headers list

headers = ["age","workclass","fnlwgt","education","education-num", "marital-status","occupation",
"relationship","race","sex", "capital-gain","capital-loss","hours-per-week","native-country","makes"]
df.columns = headers

df.head()

skim(df)

Error on Databricks:
ImportError: cannot import name 'Any' from 'typing_extensions' (/databricks/python/lib/python3.10/site-packages/typing_extensions.py)

How to resolve the ImportError when importing 'Any' from 'typing_extensions' in a Databricks environment?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.