Git Product home page Git Product logo

prepackpy's Introduction

Build Status

prepackPy

Team

  1. Jingyun Chen: jchen9314
  2. Anthony Chiodo: apchiodo
  3. Sarah Watts: smwatts

Topic

A common rule of thumb for data scientist is that the data preparation process will take approximately 80% of the total time on a project. Not only is this process time consuming, but it is also considered one of the less enjoyable components of a project (Forbes, 2016). To help address this problem, we have decided to build a package that will help improve some of the common techniques used in data preparation. This includes a function that will streamline the process of splitting a dataset into testing and training data (and provide a model ready output!), a function that incorporates more standardization methods then a data scientist could ever want and a function that will allow data scientist to quickly understand the columns and quantity with NA values in a dataset.

Install

From the terminal, type:

pip install git+https://github.com/UBC-MDS/prepackPy.git

From the Python IDE, type

from prepackPy import na_counter as na, splitter as sp, stdizer as sd

Example Useage

After the package has been installed you will be able to complete the following examples in the Python IDE. For full function descriptions please see the Function Description section below.

  • sp.splitter(X, target_index, split_size, seed)
# import numpy package
import numpy as np

# example dataset
X = np.random.randint(10, size=(3, 3))

# example function call
X_train, y_train, X_test, y_test = sp.splitter(X, target_index=2, split_size=0.3, seed=0)

Output:

X_train = array([[5, 0], [3, 7]])

y_train = array([3, 9])

X_test = array([[3, 5]])

y_test = array([2])
  • sd.stdizer(X, method="mean_sd", method_args=None)
# import numpy package
import numpy as np

# example dataset
X = np.array([[-1, 0], [2, 1], [1, -2], [1, 1]])

# example function call
sd.stdizer(X, method="mean_sd", method_args=None)

Output:

array([[ 1.41421356, -1.35873244, -0.53916387],
       [-0.70710678,  1.01904933,  1.40182605],
       [-0.70710678,  0.33968311, -0.86266219]])
  • na.na_counter(X)
# import numpy package
import numpy as np

# example dataset
X = np.array([[-1, np.nan], [np.nan, np.nan], [1, np.nan], [1, 1]])

# example function call
na.na_counter(X, col_index=[0,1])

Output:

{'column': [0, 1], 'nans': [1, 3]}

Function Descriptions

  • sp.splitter(X, target_index, split_size, seed)

Description: consolidate scikit-learns current work flow for splitting a data set in to train and test sets, i.e. turn this:

import pandas as pd
data = pd.read_csv('data.csv')

X = data.iloc[:, 0:10]
y = data.iloc[:, 10:11]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

into this:

import pandas as pd
data = pd.read_csv('data.csv')

X_train, X_test, y_train, y_test = splitter(data, target_index='y', split_size=0.3, seed=42)

Input Parameters Input Type Output Parameters Output Type
X pandas.dataframe, numpy.ndarray y train numpy.ndarray
target index integer y test numpy.ndarray
split size float X train numpy.ndarray
seed integer X test numpy.ndarray

  • sd.stdizer(X, method="mean_sd", method_args=None)

Description: standardize features. Accepts both pandas dataframes and numpy arrays as input. Returns numpy array as output.

Input Parameters Input Type Output Parameters Output Type
X pandas.dataframe, numpy.ndarray standardized X numpy.ndarray
method string "" ""
method_args list of lists "" ""

The input parameter method accepts the following values: mean_sd, mean, sd, min_max, own. Each value for the method parameter will allow the user to apply a different type of standardization to dataset X.

method = own required additional input parameters called method_args. method_args contains a list of lists i.e. [[1,2], [3,4]], where the values correspond to the means and standard deviation of each column in X, respectively.


  • na.na_counter(X)

Description: summarise the missing data (NA values) in a dataset. Accepts both pandas dataframes and numpy arrays as input. Returns dictionary where the key is the column index, and the value is the NA count as output.

Input Parameters Input Type Output Parameters Output Type
X pandas.dataframe, numpy.ndarray dictionary(key= column index, value = NA count) dictionary

Relationship to the Python Ecosystem

  • splitter

The existing package/method is sklearn.model_selection.train_test_split, which only splits features/target into train features/target and test features/target.

What splitter will improve is that it will be able to separate the target variable from the dataset by specifying the column index of the target variable.

  • stdizer

The existing package/method is sklearn.preprocessing.StandardScaler, which considers three standardization methods including subtracting mean and dividing by standard deviation, subtracting mean only, and dividing by standard deviation only.

This function also will consider two more standardization techniques including subtracting the maximum value of each column and dividing by the minimum value of each column, and substracting the user specified mean and dividing by the user specified standard deviation.

  • na_counter

The existing package/method is pandas.DataFrame.describe or pandas.DataFrame.info, which contains a summary of the dataset including information of missing values. However, there is no method for finding and reporting where missing values exist in Python.

This function will take this problem into consideration. It will be able to return both the indices of columns that contains missing values, number of missing values.

Function Test

You can test our functions by navigating to tests folder and tying the following from the Terminal:

pytest

The following screenshot shows the test result of our functions.

You can also create a coverage report by typing the following from the Terminal:

pytest --cov-branch; coverage report -m

The following screenshot shows that the branch test coverage for each file in the prepackPy package.

If you don't have pytest installed on your machine, you can install this package by typing the following from the Terminal:

pip install pytest

prepackpy's People

Contributors

jchen9314 avatar smwatts avatar apchiodo avatar

Watchers

James Cloos avatar  avatar

prepackpy's Issues

Milestone 4 (Final project) tasks

  • Add appropriate exception handler (both in Python and in R package code) for all functions.
  • Add adequate integration tests in each package, if applicable.
  • Set up continuous integration testing using Travis and add a passing build stamp for each package README files.
  • In your README, include a clear description of how to install your package and call your functions. Also, include an example with a toy dataset and a screenshot of the output when you run tests.

Milestone 3 tasks

  • address feedback from TA
  • pass all test cases and reach 100% test coverage
  • update package README (a screenshot of test coverage report)
  • use GitHub's issue tracker and milestone feature to plan the milestone (document tasks in issues and assign each issue to Milestone 3).

Feedback for milestone 2

Excellent job! This package was well done and easy to use. The documentation and the examples for usage allowed me to use all the functions with no problem.

Your tests did the job that was needed when I inputted other things.

Two minor comments I had were:

  • You should add in your example usage that you need to import numpy to get your example arrays.

  • Does python treat nan and na the same way? If I use "NA" instead of np.nan na_counter gives me an error.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.