Git Product home page Git Product logo

Comments (2)

erdogant avatar erdogant commented on August 26, 2024

Great suggestion. I looked into it but it is quite some work to integrate such an approach. At the moment all functions (figures, predict etc) are designed for single-column univariate and not multi-column univariate. I will put this on my never-ending-always-getting-longer-todo-list.

from distfit.

jkmackie avatar jkmackie commented on August 26, 2024

Thank you for the reply!

In the meantime, here is starter code to run a distfit exploratory data analysis with multiple cores. Pandas DataFrames are used for readability. (Code can be easily tweaked to use numpy instead.)

The illustration below uses a numeric-only dataset called Company Bankruptcy Prediction. It has 6819 rows and 96 columns.

Note: Error-handling is required to run distfit on this dataset. Certain columns will error out--with or without parallel processing.

import numpy as np
import pandas as pd
import re
from distfit import distfit
from joblib import Parallel, delayed
import collections
pd.options.display.max_columns = 100

# Numeric-only data from here (sign-in required):  
# https://www.kaggle.com/datasets/fedesoriano/company-bankruptcy-prediction/download?datasetVersionNumber=2

#----------------------------------------------------------------------------------
#Clean up column names and lower memory.
#----------------------------------------------------------------------------------
df = pd.read_csv("./data.csv")
for c in df.columns:  #clean up column names
    no_beg_end_spaces = c.strip()
    result = re.sub(r"\s+", "_", no_beg_end_spaces)
    df.rename(columns={c : result}, inplace=True)

print('df shape:', df.shape)
display(df.tail(3))

for c in df.columns:
    df[c] = pd.to_numeric(df[c], downcast='float')


#----------------------------------------------------------------------------------
#Use joblib to run distfit on CPU cores in parallel.
#----------------------------------------------------------------------------------
chunks = np.array_split(df, len(df.columns), axis=1)  #chunks are one column due to univariate constraint.
display(chunks[0].head())
display(chunks[1].head())

def get_distfit(series):
    try:        
        result = dfit.fit_transform(series.values, verbose=30)
        return result['model']['name']
    except:
        return 'ERROR'
    
dfit = distfit()
with Parallel(n_jobs=-2, prefer="processes") as parallel:
    results = parallel(delayed(get_distfit)(chunk) for chunk in chunks)

display(list(zip(df.columns, results))[0:5])  #show best distribution by column
display(sorted(collections.Counter(results).items(), key=lambda x:x[1], reverse=True))


#----------------------------------------------------------------------------------
#Get best distribution one column at a time (slower than parallel run).
#----------------------------------------------------------------------------------
sequential_outputs = []
for chunk in chunks:
    sequential_outputs.append(get_distfit(chunk))
display(list(zip(df.columns, sequential_outputs))[0:5])  #show best distribution by column
display(sorted(collections.Counter(sequential_outputs).items(), key=lambda x:x[1], reverse=True))

from distfit.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.