Git Product home page Git Product logo

prelurn's Introduction

Codeship Status for tesera/prelurn

prelurn

Prelurn is a package and CLI for exploring tabular data.

Prelurn describes CSV datasets, augmenting the describe function of pandas.

For each column, it identifies:

  • data type
    • (numeric, boolean, timestamp, categorical)
  • descriptive statistics (depending on the data type)
    • mean
    • standard deviation
    • max
    • min
    • quantiles or deciles
    • categories
    • most frequent value and its number of occurrences
  • data integrity
    • propertion of observations which are missing

This enables users to do some basic determination of which variables may be useful in a machine learning model. Particularly in linear parametric models (e.g. LDA, logistic regression), which are sensitive to the distribution of features.

(venv) root@b9241715e1c2:/opt/prelurn# prelurn describe tests/data/data.csv

(venv) root@b9241715e1c2:/opt/prelurn# cat data_describe.csv

type categories missing_proportion count unique top freq mean std min 10% 20% 30% 40% 50% 60% 70% 80% 90% max
UID numeric 0.0 64.0 257.09375 79.36168335988272 106.0 145.9 184.6 211.9 234.60000000000002 257.5 281.4 309.19999999999993 346.6 359.7 373.0
VAR2 numeric 0.0 64.0 0.201531991390625 0.07087738465807435 0.108853503 0.13695698950000001 0.15321989400000002 0.16363511890000002 0.1732312022 0.1840266945 0.1953042124 0.21185764439999996 0.2354144256 0.2847120188 0.5441622070000001
VAR3 numeric 0.0 64.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
VAR4 numeric 0.0 64.0 0.421875 0.4977628523130841 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0
VAR5 numeric 0.0 64.0 0.234375 0.42695628191498325 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0
VAR6 numeric 0.0 64.0 0.109375 0.3145764348029479 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.7000000000000028 1.0
VAR7 numeric 0.0 64.0 0.078125 0.27048970875004197 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
VAR8 numeric 0.953125 3.0 1.0 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
VAR9 numeric 0.0 64.0 0.03125 0.17536809360305042 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
VAR10 categorical "'0','1'" 0.0 64 2 '0' 61

Installation

Docker and (host machine) virtualenv installation are incompatible!

Virtualenv

git clone [email protected]:tesera/prelurn.git
cd prelurn
virtualenv venv
. venv/bin/activate
pip install . # regular users

Docker

Build the prelurn image as follows

docker build -t prelurn .

That creates an isolated environment where prelurn can be run

Start a container using the prelurn image

docker run -t -i prelurn /bin/bash
. venv/bin/activate

Run prelurn in the container

root@de6cccb1a217:/opt/prelurn# prelurn 10
False
False
False
False
False
False
False
False
False
False

Usage

(venv) root@8248c14c4ffc:/opt/prelurn# prelurn --help
  Usage: prelurn [OPTIONS] COMMAND [ARGS]...

  Machine learning preprocessing steps

  Options:
      -v, --verbose
      -h, --help     Show this message and exit.

  Commands:
      describe  Summarize data in a csv file T
(venv) root@8248c14c4ffc:/opt/prelurn# prelurn describe --help
  Usage: prelurn describe [OPTIONS] TABLE

  Summarize data in a csv file

  TABLE is the path to a csv file, as per http://pandas.pydata.org/pandas-
  docs/stable/io.html#basic

  Options:
      -j, --json                      write output to json file
      -q, --quantile-type [decile|quartile]
                                      percentiles to use
      -o, --outfile PATH              path to use for output data, excluding file
                                      extension. default is to write to working directory
      -h, --help                      Show this message and exit.

Development

dev.env

A file, dev.env is required for some tasks

It should have the following

AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=
AWS_REGION=
BOTO_CONFIG=
S3_TEST_DATA_URL=

BOTO_CONFIG in dev.env will override the path specified in Dockerfile.

docker-compose

Several development tasks are defined in docker-compose.yml

Note - the base directory is mounted to a volume in the container - /opt/prelurn. This way you do not need to rebuild the image every time you change a file or requirement.

Start the dev container and setup a venv; the venv will be persisted to your local storage but will probably not run on your system!

docker-compose run dev
virtualenv venv -p python2.7
. venv/bin/activate
pip install -e .[dev]
docker-compose run test # run tests
docker-compose run docs # build docs
docker-compose run dev # run bash

Tests

Any functional changes or additions should be tested. Add a test for your changes and update old tests, if required.

Run the tests as follows

docker-compose run test

# or, if set up with virtualenv outside of container
py.test tests/

Documentation

Documentation is build automatically from docstrings

See http://www.sphinx-doc.org/en/stable/rest.html for the syntax

If you change function arguments, or the docstrings are otherwise updated, you should rebuild the docs as follows:

docker-compose run docs

# or, if set up with virtualenv outside of container
cd docs
sphinx-apidoc -o ./source ../prelurn
make html
make coverage

prelurn's People

Contributors

jo-tham avatar thespencercox avatar

Watchers

 avatar  avatar  avatar

prelurn's Issues

Why prelurn not prelearn

Is this going to be confusing in the suite of other *learn utilities? I'm ok with it if it was deliberate.

Add CI

@whyvez interested in seeing the process for this when you have some time

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.