palashio / libra Goto Github PK

View Code? Open in Web Editor NEW

1.9K 1.9K 109.0 186.17 MB

Ergonomic machine learning for everyone.

Home Page: http://libradocs.org/

License: MIT License

Python 99.48% Dockerfile 0.26% Makefile 0.26%

auto-ml machine-learning neural-networks

libra's People

Contributors

Stargazers

Watchers

Forkers

kartikchugh dbzkunalss murshedy2k samyak-doshi galacticor riteshprataps ssahgal a10mic junamarv abhilash2000 deeplearning2012 davidbarath maddyvc maxcodextc rohitpandey13 ndu3000 wildessilva navaneeth20 khaledmaghnia rishisinha deepchatterjeevns samshadali libradocs stjordanis aliaryan amirstudy subbu-nayak trendingtechnology umangj123 dtommandru jingmouren vraj123 bulelanimbali pranesh200 menonpg profitalo rafaelrrs kamranmajid41 jefferyansah slbinilkumar sts-sadr jadycunha learnpy2020 skaiphd keshabb bharatr21 yash19062000 muleina mynhervankoek o7s8r6 ezhou89 hirajanwin arupjyoti micseb tgood13 piyush1416 trigger-happy chokha lxngoddess5321 georgi-petkov milan-chicago admariner tchigher namanphy monkidea krishnaveni2802 srikanth-gandi cybernetics boursa thedkblog cclauss hadryan wjjmjh billyhope01 shalevy1 devstah dethnass ylmzfun beesitech isl-m nekkans lyrl ai4everyone appnaokia brian-emarquez ugolbck abx393 freedomme guptam ncolyer sukkritsharmaofficial vladimiralencar domoritz smitakshigupta dineshkumarsarangapani loonghao anooppanyam tesseract-42 ethpony mikayelh

libra's Issues

modifying cropping/resizing to median dimensions

process for cropping: find median of height/width and interpolate each image up/down based on > or < then value.

CNN Data Parameters

Most image data sets consist of a CSV file with image path and labels as well as an overall image folder. Currently, the need to enter a path for every class seems redundant and might be difficult for most large image datasets. We might want to accept a CSV and Image directory path.

testing slack github notification

Remove Rows with NaN in the target column

make common_problems.txt file

just text file highlighted common developer problems, and how to overcome

extensively testing preprocesser.

need to test single_reg_preprocesser preprocesser with many structured datasets. Need to report how it fails, and what it needs improvement upon.

re-do keras logging

because Libra is training multiple models we need to mute the logging that Keras has, where it outputs accuracies after every epoch and just do it for every model. This'll help clean up output when you're running the models.

current logger spams console box

need to figure out a way to make the log just stay at the bottom of the console and then update instead of re-printing out everytime.

try recurrent neural network for instruction identification

try a recurrent neural network for identifying target instructions. PM me for this; i've tried using seq2seq and it didn't work.

getting rid of generate_set_fit_cnn

querying google images is too unreliable; gonna get rid of this.

Create predict() function in Client class to process fed data and create operators part of dictionary

Use Scaler fit in the preprocess function to alter data before prediction

create tuning for NLP

we need to be able to call .tune() for NLP tasks. Look into keras-tuner for this.

change query names

some of the queries like regression_querry_ann are quite tacky, a system for naming queries needs to be made.

removing noisy columns

need to create way in the structured_preprocesser() to identify columns that will reduce performance because of noise before training. This can be because columns are similar to each other and/or aren't coorelated whatsoever.

support excel file datasets

right now we're only allowing users to upload .csv files. Need support for excel files. We also need to make sure that when we do read in these files into pandas dataframes they're maintaining the same format/scheme as .csv files.

integrate regression and classification query into one

Both regression_query_ann() and classification_query_ann() need to be modified so that it's just one simple ann_query().

move parts of queries that are repeated to function

parts of queries that are used in each one like initial file reading etc. should be moved to one separate usable function to avoid code repetition.

automatically remove ID columns

sometimes datasets have columns with ID's, we need to remove them.

Here's my idea for removing them: columns that hold ID's are just non-numerical columns (which can be found using data[column].dtype.name != 'object where the number of unique elements obtained by np.unique['column'] is equal to the number of rows in the dataset.

using decreasing metric for shallow tuning doesn't work well

so currently this line of code: while(all(x > y for x, y in zip(losses, losses[1:]))): inside the regression_query_ann checks to see if the new loss (with one more layer) is lower then the last loss. This results in always low stoppage. Need to find better mechanism.

improve documentation overall

documentation for library is very weak right now; we need to develop a method to document properly. Also create a file describing how the documentation works.

add example projects folder

include a folder where we include example projects people can do with libra.

add all hyperparameters for queries as keyword arguments

Many queries are lacking possible hyperparameters that users could specify. Examples would be for decision_tree_query(), user currently can't specify min_samples_leaf.

Add More Machine Learning Models

We should think about adding more machine learning models, such as the Linear Regression or Logistic Regression model.

better splitting of files

currently, all the queries are just shoved into predictionQueries.py under client class. What's a better way of distributing these?

text classification query

sentiment analysis NLP query. This should definitely be implemented. Could even be a pre-existing algorithm.

conversion from excel file to csv not supported

Queries should first see what sort of file is need to be read. If it's excel file, a different reading function should be applied etc.

using keras-tuner doesn't update library because of history issue

when calling .tune() on functions, i haven't figured out how to extract the history from keras-tuner so right now, the tune() only updates the compiled model, not the accuracy, losses etc. Need to figure out how to get ALL values.

sequence to sequence query

part of the textual module. a sequence to sequence query that converts any set of text to a new text.

Replacing the Missing/Nan Data

Instead of filling the Nan values with Zero, replace the numerical value with the column mean/median value and Categorical value with last value before the missing value in the column. For better accuracy.

best parameters for callback function?

need to figure out best patience value for callback; read papers online and figure out when it's safe to say the model has overfit.

perform image augmentation on generated images

Image Augmentation: The downloaded images may not be enough so augmentation may play a vital role.

dealing with date/time columns

dates and times can come in so many different shapes and sizes; we need a whole method that can be integrated into structured_preprocesser() to deal with all possibilities.

use built in levenshentein distance

python has a built in Levenshtein distance module. I can use this to compare two strings and get rid of produceMask().

Only fit pipeline with X_train, rather than train and test data

We are leaking data by scaling train and test together currently

convert all queries to pipeline setup

the main framework for the queries should be similar to how the pipeline is setup under the dev-pipeliner module. This whole conversion needs to happen immediately.

Loading Datasets within Libra

TensorFlow currently allows users to load multiple datasets within TensorFlow itself (i.e. MNIST, COCO, etc). We could add this to Libra, but include relevant datasets to the time right now, such as COVID-19 related datasets, Election-Related Datasets, etc.

reinforcement learning queries (q-learning / policy gradient)

look into implementing some sort of reinforcement learning query? how are most users setting up these queries. This is a bit more difficult because RL problems require action and agents are constantly changing. Maybe look into something more practical? Deep Q Networks?