Git Product home page Git Product logo

aster's Introduction

aster - a bot to write kaggle baseline kernels

Aster is a python based bot (or a module), which is capabale of writing baseline starter kernels for competitions or datasets hosted on Kaggle. As of now, It can work with two types of datasets - numerical dataset (having continuous and / or categorical columns) and text datasets having single text / document field.

Key features

  1. Can create kernels on Compeititon and Datasets both
  2. Can create kernels on datasets with binary / multi classification
  3. Can create kernels on text datasets and numerical datasets
  4. Performs Quick Exploration, Preprocessing, Feature Engineering, and Modelling
  5. Changes the visuals according to data, for example - generates word clouds for text data and pairplots for numerical datasets
  6. Uses a config to create new kernels

How Aster Works

Aster first understands the inputs given in the config by the user and the types of columns present in the dataset. According to this information, aster dynamically chooses the most relevant code / text templates and appends them to the baseline kernel. For example, if the dataset is belongs to text classification category, then aster will generate some wordclouds, will not perform correlation charts, pair plots or categorical variable distributions. While if the dataset is non text classification type, then aster will choose the most relevant templates, for example - distribution of categorical variables, missing value treatments etc.

Detailed table of contents

Aster creates following contents based on the type of data.

  1. Environment Preparation
  2. Quick Exploration
         2.1 Load Dataset
         2.2 Dataset Snapshot and Summary
         2.3 Target Variable Distribution
         2.4 Missing Values
         2.5 Variable Types
         2.6 Variable Correlations
  3. Preprocessing
         3.1 Label Encoding
         3.2 Missing Values Treatment
         3.3 Feature Engineering (text fields)
             3.3.1 TF-IDF Vectorizor
             3.3.2 Top Keywords - Wordcloud
         3.4 Train Test Split
  4. Modelling
         4.1 Logistic Regression
         4.2 Decision Tree
         4.3 Random Forest
         4.4 ExtraTrees Classifier
         4.5 Extereme Gradient Boosting
  5. Feature Importance
  6. Model Ensembling
         6.1 A simple Blender
  7. Creating Submission

Useage : example 1

from aster.aster import aster

config = {	"COMPETITION" : "titanic", 
            "_TARGET_COL" : "Survived", 
            "_ID_COL" : "PassengerId"}

ast = aster(config) # aster object with config 
ast._prepare() # prepare the kernel
ast._push() # push the kernel on kaggle

Useage : example 2

from aster.aster import aster

config = {	"COMPETITION" : "spooky-author-identification", 
            "_TARGET_COL" : "author", 
            "_ID_COL" : "id",
            "_TAG" : "doc",
            "_TEXT_COL" : "text"}

ast = aster(config) # aster object with config 
ast._prepare() # prepare the kernel
ast._push() # push the kernel on kaggle

config examples

Aster uses config and its key-value pairs to write kernels on different datasets. All of the keys are not mandatory and most of them are optional. Check the following table.

Key Example Value Default Optional/Mandatory Definition
DATASET iris "" optional Name of the dataset to be used
COMPETITION titanic "" optional Name of the competition
_TARGET_COL Survived "" mandatory target column name
_ID_COL PassengerId "" optional id column name
_TRAIN_FILE train train optional name of the train file
_TEST_FILE test test optional name of the test file
_TAG doc num optional (only for text) doc : text dataset, num : numerical dataset
_TEXT_COL text "" optional (only for text) name of the column containing text data

Example Kernels generated by Aster

1. Binary Classification on Numerical Data - Competition Data
  • Titanic Baseline Kernel :
2. Multi Classification on Text Data - Competition Data
  • Spooky Author Baseline Kernel
3. Classification - Non Competition Data
  • Iris Dataset

  • Diabetes Dataset

  • Mushrooms Dataset

Installation

Aster can be installed directly from github using following commands

git clone https://github.com/shivam5992/aster.git
cd textstat
python setup.py install

Future Work

  • Dynamic Code Selection Improvements
  • Add More Content
        - Automated Feature Engineering
        - Hyperparameter Tuning
  • Extend Datatypes
        - Regression Problems - Numerical Data
        - Image Classifiication

aster's People

Contributors

shivam5992 avatar

Watchers

amrrs avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.