I recommend dfit.fit_transform(X) be extended to include multiple var

new feature request - automatically fit multiple variables about distfit HOT 2 OPEN

erdogant commented on August 26, 2024

new feature request - automatically fit multiple variables

from distfit.

Comments (2)

erdogant commented on August 26, 2024

Great suggestion. I looked into it but it is quite some work to integrate such an approach. At the moment all functions (figures, predict etc) are designed for single-column univariate and not multi-column univariate. I will put this on my never-ending-always-getting-longer-todo-list.

from distfit.

jkmackie commented on August 26, 2024

Thank you for the reply!

In the meantime, here is starter code to run a distfit exploratory data analysis with multiple cores. Pandas DataFrames are used for readability. (Code can be easily tweaked to use numpy instead.)

The illustration below uses a numeric-only dataset called Company Bankruptcy Prediction. It has 6819 rows and 96 columns.

Note: Error-handling is required to run distfit on this dataset. Certain columns will error out--with or without parallel processing.

import numpy as np
import pandas as pd
import re
from distfit import distfit
from joblib import Parallel, delayed
import collections
pd.options.display.max_columns = 100

# Numeric-only data from here (sign-in required):  
# https://www.kaggle.com/datasets/fedesoriano/company-bankruptcy-prediction/download?datasetVersionNumber=2

#----------------------------------------------------------------------------------
#Clean up column names and lower memory.
#----------------------------------------------------------------------------------
df = pd.read_csv("./data.csv")
for c in df.columns:  #clean up column names
    no_beg_end_spaces = c.strip()
    result = re.sub(r"\s+", "_", no_beg_end_spaces)
    df.rename(columns={c : result}, inplace=True)

print('df shape:', df.shape)
display(df.tail(3))

for c in df.columns:
    df[c] = pd.to_numeric(df[c], downcast='float')


#----------------------------------------------------------------------------------
#Use joblib to run distfit on CPU cores in parallel.
#----------------------------------------------------------------------------------
chunks = np.array_split(df, len(df.columns), axis=1)  #chunks are one column due to univariate constraint.
display(chunks[0].head())
display(chunks[1].head())

def get_distfit(series):
    try:        
        result = dfit.fit_transform(series.values, verbose=30)
        return result['model']['name']
    except:
        return 'ERROR'
    
dfit = distfit()
with Parallel(n_jobs=-2, prefer="processes") as parallel:
    results = parallel(delayed(get_distfit)(chunk) for chunk in chunks)

display(list(zip(df.columns, results))[0:5])  #show best distribution by column
display(sorted(collections.Counter(results).items(), key=lambda x:x[1], reverse=True))


#----------------------------------------------------------------------------------
#Get best distribution one column at a time (slower than parallel run).
#----------------------------------------------------------------------------------
sequential_outputs = []
for chunk in chunks:
    sequential_outputs.append(get_distfit(chunk))
display(list(zip(df.columns, sequential_outputs))[0:5])  #show best distribution by column
display(sorted(collections.Counter(sequential_outputs).items(), key=lambda x:x[1], reverse=True))

from distfit.

new feature request - automatically fit multiple variables about distfit HOT 2 OPEN

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent