A common rule of thumb for data scientist is that the data preparation process will take approximately 80% of the total time on a project. Not only is this process time consuming, but it is also considered one of the less enjoyable components of a project (Forbes, 2016). To help address this problem, we have decided to build a package that will help improve some of the common techniques used in data preparation. This includes a function that will streamline the process of splitting a dataset into testing and training data (and provide a model ready output!), a function that incorporates more standardization methods then a data scientist could ever want and a function that will allow data scientist to quickly understand the columns and quantity with NA
values in a dataset.
From the terminal, type:
pip install git+https://github.com/UBC-MDS/prepackPy.git
From the Python IDE, type
from prepackPy import na_counter as na, splitter as sp, stdizer as sd
After the package has been installed you will be able to complete the following examples in the Python IDE. For full function descriptions please see the Function Description
section below.
sp.splitter(X, target_index, split_size, seed)
# import numpy package
import numpy as np
# example dataset
X = np.random.randint(10, size=(3, 3))
# example function call
X_train, y_train, X_test, y_test = sp.splitter(X, target_index=2, split_size=0.3, seed=0)
Output:
X_train = array([[5, 0], [3, 7]])
y_train = array([3, 9])
X_test = array([[3, 5]])
y_test = array([2])
sd.stdizer(X, method="mean_sd", method_args=None)
# import numpy package
import numpy as np
# example dataset
X = np.array([[-1, 0], [2, 1], [1, -2], [1, 1]])
# example function call
sd.stdizer(X, method="mean_sd", method_args=None)
Output:
array([[ 1.41421356, -1.35873244, -0.53916387],
[-0.70710678, 1.01904933, 1.40182605],
[-0.70710678, 0.33968311, -0.86266219]])
na.na_counter(X)
# import numpy package
import numpy as np
# example dataset
X = np.array([[-1, np.nan], [np.nan, np.nan], [1, np.nan], [1, 1]])
# example function call
na.na_counter(X, col_index=[0,1])
Output:
{'column': [0, 1], 'nans': [1, 3]}
sp.splitter(X, target_index, split_size, seed)
Description: consolidate scikit-learns current work flow for splitting a data set in to train and test sets, i.e. turn this:
import pandas as pd
data = pd.read_csv('data.csv')
X = data.iloc[:, 0:10]
y = data.iloc[:, 10:11]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
into this:
import pandas as pd
data = pd.read_csv('data.csv')
X_train, X_test, y_train, y_test = splitter(data, target_index='y', split_size=0.3, seed=42)
Input Parameters | Input Type | Output Parameters | Output Type |
---|---|---|---|
X | pandas.dataframe, numpy.ndarray | y train | numpy.ndarray |
target index | integer | y test | numpy.ndarray |
split size | float | X train | numpy.ndarray |
seed | integer | X test | numpy.ndarray |
sd.stdizer(X, method="mean_sd", method_args=None)
Description: standardize features. Accepts both pandas dataframes and numpy arrays as input. Returns numpy array as output.
Input Parameters | Input Type | Output Parameters | Output Type |
---|---|---|---|
X | pandas.dataframe, numpy.ndarray | standardized X | numpy.ndarray |
method | string | "" | "" |
method_args | list of lists | "" | "" |
The input parameter method
accepts the following values: mean_sd, mean, sd, min_max, own. Each value for the method
parameter will allow the user to apply a different type of standardization to dataset X
.
method
= own
required additional input parameters called method_args
. method_args
contains a list of lists i.e. [[1,2], [3,4]], where the values correspond to the means and standard deviation of each column in X
, respectively.
na.na_counter(X)
Description: summarise the missing data (NA
values) in a dataset. Accepts both pandas dataframes and numpy arrays as input. Returns dictionary where the key is the column index, and the value is the NA count as output.
Input Parameters | Input Type | Output Parameters | Output Type |
---|---|---|---|
X | pandas.dataframe, numpy.ndarray | dictionary(key= column index, value = NA count) | dictionary |
splitter
The existing package/method is sklearn.model_selection.train_test_split
, which only splits features/target into train features/target and test features/target.
What splitter
will improve is that it will be able to separate the target variable from the dataset by specifying the column index of the target variable.
stdizer
The existing package/method is sklearn.preprocessing.StandardScaler
, which considers three standardization methods including subtracting mean and dividing by standard deviation, subtracting mean only, and dividing by standard deviation only.
This function also will consider two more standardization techniques including subtracting the maximum value of each column and dividing by the minimum value of each column, and substracting the user specified mean and dividing by the user specified standard deviation.
na_counter
The existing package/method is pandas.DataFrame.describe
or pandas.DataFrame.info
, which contains a summary of the dataset including information of missing values. However, there is no method for finding and reporting where missing values exist in Python.
This function will take this problem into consideration. It will be able to return both the indices of columns that contains missing values, number of missing values.
You can test our functions by navigating to tests
folder and tying the following from the Terminal:
pytest
The following screenshot shows the test result of our functions.
You can also create a coverage report by typing the following from the Terminal:
pytest --cov-branch; coverage report -m
The following screenshot shows that the branch test coverage for each file in the prepackPy package.
If you don't have pytest
installed on your machine, you can install this package by typing the following from the Terminal:
pip install pytest