Git Product home page Git Product logo

gmihaila / ml_things Goto Github PK

View Code? Open in Web Editor NEW
234.0 8.0 62.0 80.12 MB

This is where I put things I find useful that speed up my work with Machine Learning. Ever looked in your old projects to reuse those cool functions you created before? Well, this repo is designed to be a Python Library of functions I created in my previous project that can be reused. I also share some Notebooks Tutorials and Python Code Snippets.

Home Page: https://gmihaila.github.io

License: Apache License 2.0

Jupyter Notebook 99.64% Python 0.36% Makefile 0.01%
machine-learning google-colab notebooks snippets python-snippets nlp nlp-machine-learning transformer pytorch

ml_things's Introduction

Machine Learning Things

ml_things Generic badge License Generic badge Generic badge pages-build-deployment

Machine Learning Things is a lightweight python library that contains functions and code snippets that I use in my everyday research with Machine Learning, Deep Learning, NLP.

I created this repo because I was tired of always looking up same code from older projects and I wanted to gain some experience in building a Python library. By making this available to everyone it gives me easy access to code I use frequently and it can help others in their machine learning work. If you find any bugs or something doesn't make sense please feel free to open an issue.

That is not all! This library also contains Python code snippets and notebooks that speed up my Machine Learning workflow.

Note:

  • Update: Feb 5, 2022 Thank you all again for your support and kindness! This package is available on pypi now! pip install ml-things
  • Update: July 16, 2021 Thank you all for your support and kindness! As I promissed I will move this repo to pip install modules.
  • If I reach 100 stars I will release the first official version and add it to the pip install modules!

Table of contents


ML_things

Installation

This repo is tested with Python 3.6+.

It's always good practice to install ml_things in a virtual environment. If you guidance on using Python's virtual environments you can check out the user guide here.

You can install ml_things with pip from GitHub:

pip install git+https://github.com/gmihaila/ml_things

Or from pypi:

pip install ml-things

Functions

All function implemented in the ml_things module.

Array Functions

Array manipulation related function that can be useful when working with machine learning.

pad_array [source]

Pad variable length array to a fixed numpy array. It can handle single arrays [1,2,3] or nested arrays [[1,2],[3]].

By default it will padd zeros to the maximum length of row detected:

>>> from ml_things import pad_array
>>> pad_array(variable_length_array=[[1,2],[3],[4,5,6]])
array([[1., 2., 0.],
       [3., 0., 0.],
       [4., 5., 6.]])

It can also pad to a custom size and with cusotm values:

>>> pad_array(variable_length_array=[[1,2],[3],[4,5,6]], fixed_length=5, pad_value=99)
array([[ 1.,  2., 99., 99., 99.],
       [ 3., 99., 99., 99., 99.],
       [ 4.,  5.,  6., 99., 99.]])

batch_array [source]

Split a list into batches/chunks. Last batch size is remaining of list values. Note: This is also called chunking. I call it batches since I use it more in ML.

The last batch will be the reamining values:

>>> from ml_things import batch_array
>>> batch_array(list_values=[1,2,3,4,5,6,7,8,8,9,8,6,5,4,6], batch_size=4)
[[1, 2, 3, 4], [5, 6, 7, 8], [8, 9, 8, 6], [5, 4, 6]]

Plot Functions

Plot related function that can be useful when working with machine learning.

plot_array [source]

Create plot from a single array of values.

All arguments are optimized for quick plots. Change the magnify arguments to vary the size of the plot:

>>> from ml_things import plot_array
>>> plot_array([1,3,5,3,7,5,8,10], path='plot_array.png', magnify=0.1, use_title='A Random Plot', start_step=0.3, step_size=0.1, points_values=True, use_ylabel='Thid', use_xlabel='This')

plot_array

plot_dict [source]

Create plot from a single array of values.

All arguments are optimized for quick plots. Change the magnify arguments to vary the size of the plot:

>>> from ml_things import plot_dict
>>> plot_dict({'train_acc':[1,3,5,3,7,5,8,10],
                'valid_acc':[4,8,9]}, use_linestyles=['-', '--'], magnify=0.1, 
                start_step=0.3, step_size=0.1,path='plot_dict.png', points_values=[True, False], use_title='Title')

plot_dict

plot_confusion_matrix [source]

This function prints and plots the confusion matrix. Normalization can be applied by setting normalize=True.

All arguments are optimized for quick plots. Change the magnify arguments to vary the size of the plot:

>>> from ml_things import plot_confusion_matrix
>>> plot_confusion_matrix(y_true=[1,0,1,1,0,1], y_pred=[0,1,1,1,0,1], magnify=0.1, use_title='My Confusion Matrix', path='plot_confusion_matrix.png');
Confusion matrix, without normalization
array([[1, 1],
       [1, 3]])

plot_confusion_matrix

Text Functions

Text related function that can be useful when working with machine learning.

clean_text [source]

Clean text using various techniques:

>>> from ml_things import clean_text
>>> clean_text("ThIs is $$$%.  \t\t\n \\ so dirtyyy$$ text :'(.   omg!!!", full_clean=True)
'this is so dirtyyy text omg'

Web Related

Web related function that can be useful when working with machine learning.

download_from [source]

Download file from url. It will return the path of the downloaded file:

>>> from ml_things import  download_from
>>> download_from(url='https://raw.githubusercontent.com/gmihaila/ml_things/master/setup.py', path='.')
'./setup.py'

Snippets

This is a very large variety of Python snippets without a certain theme. I put them in the most frequently used ones while keeping a logical order. I like to have them as simple and as efficient as possible.

Name Description
Read FIle One liner to read any file.
Write File One liner to write a string to a file.
Debug Start debugging after this line.
Pip Install GitHub Install library directly from GitHub using pip.
Parse Argument Parse arguments given when running a .py file.
Doctest How to run a simple unittesc using function documentaiton. Useful when need to do unittest inside notebook.
Fix Text Since text data is always messy, I always use it. It is great in fixing any bad Unicode.
Current Date How to get current date in Python. I use this when need to name log files.
Current Time Get current time in Python.
Remove Punctuation The fastest way to remove punctuation in Python3.
PyTorch-Dataset Code sample on how to create a PyTorch Dataset.
PyTorch-Device How to setup device in PyTorch to detect if GPU is available.

Comments

These are a few snippets of how I like to comment my code. I saw a lot of different ways of how people comment their code. One thing is for sure: any comment is better than no comment.

I try to follow as much as I can the PEP 8 β€” the Style Guide for Python Code.

When I comment a function or class:

# required import for variables type declaration
from typing import List, Optional, Tuple, Dict

def my_function(function_argument: str, another_argument: Optional[List[int]] = None,
                another_argument_: bool = True) -> Dict[str, int]
       r"""Function/Class main comment. 

       More details with enough spacing to make it easy to follow.

       Arguments:
       
              function_argument (:obj:`str`):
                     A function argument description.
                     
              another_argument (:obj:`List[int]`, `optional`):
                     This argument is optional and it will have a None value attributed inside the function.
                     
              another_argument_ (:obj:`bool`, `optional`, defaults to :obj:`True`):
                     This argument is optional and it has a default value.
                     The variable name has `_` to avoid conflict with similar name.
                     
       Returns:
       
              :obj:`Dict[str: int]`: The function returns a dicitonary with string keys and int values.
                     A class will not have a return of course.

       """
       
       # make sure we keep out promise and return the variable type we described.
       return {'argument': function_argument}

Notebooks Tutorials

This is where I keep notebooks of some previous projects which I turnned them into small tutorials. A lot of times I use them as basis for starting a new project.

All of the notebooks are in Google Colab. Never heard of Google Colab? πŸ™€ You have to check out the Overview of Colaboratory, Introduction to Colab and Python and what I think is a great medium article about it to configure Google Colab Like a Pro.

If you check the /ml_things/notebooks/ a lot of them are not listed here because they are not in a 'polished' form yet. These are the notebooks that are good enough to share with everyone:

Name Description Links
πŸ‡ Better Batches with PyTorchText BucketIterator How to use PyTorchText BucketIterator to sort text data for better batching. Open In Colab Generic badge Generic badge Generic badge Generic badge
🐢 Pretrain Transformers Models in PyTorch using Hugging Face Transformers Pretrain 67 transformers models on your custom dataset. Open In Colab Generic badge Generic badge Generic badge Generic badge
🎻 Fine-tune Transformers in PyTorch using Hugging Face Transformers Complete tutorial on how to fine-tune 73 transformer models for text classification β€” no code changes necessary! Open In Colab Generic badge Generic badge Generic badge Generic badge
βš™οΈ Bert Inner Workings in PyTorch using Hugging Face Transformers Complete tutorial on how an input flows through Bert. Open In Colab Generic badge Generic badge Generic badge Generic badge
🎱 GPT2 For Text Classification using Hugging Face πŸ€— Transformers Complete tutorial on how to use GPT2 for text classification. Open In Colab Generic badge Generic badge Generic badge Generic badge

Final Note

Thank you for checking out my repo. I am a perfectionist so I will do a lot of changes when it comes to small details.

If you see something wrong please let me know by opening an issue on my ml_things GitHub repository!

A lot of tutorials out there are mostly a one-time thing and are not being maintained. I plan on keeping my tutorials up to date as much as I can.


Contact 🎣

🦊 GitHub: gmihaila

🌐 Website: gmihaila.github.io

πŸ‘” LinkedIn: mihailageorge

πŸ“¬ Email: [email protected]


ml_things's People

Contributors

gmihaila avatar viajiefan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ml_things's Issues

using a fine-tuned model for prediction on unlabeled data

Hi,

Thanks for creating these helpers and finetuning tutorials. I've found this repo to be very helpful and appreciate all the work you've put into it :).

One question I have is in regards to the following validation function from your gpt2-finetune-classification notebook. Here is the code snippet:

    # speeding up validation
    with torch.no_grad():
        # Forward pass, calculate logit predictions.
        # This will return the logits rather than the loss because we have
        # not provided labels.
        # token_type_ids is the same as the "segment ids", which 
        # differentiates sentence 1 and 2 in 2-sentence tasks.
        # The documentation for this `model` function is here: 
        # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification
        outputs = model(**batch)

        # The call to `model` always returns a tuple, so we need to pull the 
        # loss value out of the tuple along with the logits. We will use logits
        # later to to calculate training accuracy.
        loss, logits = outputs[:2]
        
        # Move logits and labels to CPU
        logits = logits.detach().cpu().numpy()

        # Accumulate the training loss over all of the batches so that we can
        # calculate the average loss at the end. `loss` is a Tensor containing a
        # single value; the `.item()` function just returns the Python value 
        # from the tensor.
        total_loss += loss.item()
        
        # get predicitons to list
        predict_content = logits.argmax(axis=-1).flatten().tolist()

        # update list
        predictions_labels += predict_content

I'm trying to repurpose this to predict on unlabeled data. Although the comments in the code suggest that we have not provided the labels, I am getting errors when feeding in a dictionary with one extra other category (indicating that no class has been assigned):

-> 2264         ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
   2265     elif dim == 4:
   2266         ret = torch._C._nn.nll_loss2d(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
IndexError: Target 13 is out of bounds.

Is there a good way to change this to exclude the labels and label dictionary from being fed into the model, and also to exclude the loss from being calculated? Sorry if there is an obvious solution, it seems to be evading me right now.

Thank you for the help!

Early Stopping

Can you please add examples for earlystopping based on the loss metrics during the training?

Tutorials: Pretraining Transformers

First of all, thanks for writing this notebookβ€”it's been a huge help!

I have an unusual situation, in that I have a small, hand-defined vocabulary for a very specific purpose. For this reason, I've been using the BertWordPieceTokenizer for everything (whether MLM or CLM), and loading it with my fixed vocab file. Is there a way I can do this with your notebook?

Thanks in advance.

QA task

the finetune notebook can be used for QA task also?

Question on XLNet training

Hey, thanks for sharing a comprehensive pre training on huggingface.

I have a few questions on training procedure inside your notebook
https://github.com/gmihaila/ml_things/blob/master/notebooks/pytorch/pretrain_transformers_pytorch.ipynb

When I tried to train on XLNet, it throws an error.

ValueError: This collator requires that sequence lengths be even to create a leakage-free perm_mask. Please see relevant comments in source code for details.

What do you do to prevent this?

I tried to it on tokenizer, but I don't know if it is ok.
Something like this

# this input_ids 5 is <pad token>, since xlnet tokenizer ends with <sep> and <cls>, I put that before <sep> (-2)
# attention_mask and token_type_ids just preserve the previous state
def tokenize_function(examples):
    token_res = tokenizer(examples["text"], truncation=True, max_length=MAX_LENGTH)
    for i, item in enumerate(token_res["input_ids"]):
        if len(item) % 2 != 0:
            token_res["input_ids"][i].insert(-2,5)
            token_res["attention_mask"][i].insert(-2,0)
            token_res["token_type_ids"][i].insert(-2,1)

    return token_res

doubts regarding Bert Innerworking

hi,

      i read your article bert inner working.. it is very informative. i have some questions regarding this article. that is, how to show the final sentiment of the two statements?

'I love cats!',
"He hates pineapple pizza."

pre-train

Hi @gmihaila thanks for this splendid library.
Just have a quick question regarding pre-training from scratch, is it possible using a single system. I don't have much data to pre-train.
Any suggestions.
thanks

RuntimeError: CUDA out of memory.

The fine-tuning worked perfectly on BERT, in other words, all default settings in the finetuning notebook, however, while running the longformer, I get this error:
RuntimeError: CUDA out of memory. Tried to allocate 514.00 MiB (GPU 0; 15.75 GiB total capacity; 14.14 GiB already allocated; 106.81 MiB free; 14.32 GiB reserved in total by PyTorch)
I am using 'allenai/longformer-base-4096 '

I am using Google Colab pro with GPU and high-RAM memory.
image

Longformer issue

Hi @gmihaila
The fine-tuning notebook works very well, but when I use a sample size of larger than 100 records, the longformer notebook shows strange messages.
image

this has nothing to do with the ML-Things library fix text(), even if I dont use it, I still get some error, upong reducing the sample size, it works fine,
any suggestions please

pretrain_transformer - checkpoints save

machine_learning_things/tutorial_notebooks/pretrain_transformer.ipynb

Is saving too many checkpoints on large data. Use a larger save_steps or set save_steps=-1 to avoid saving any checkpoints.

# process training arguments
training_args = TrainingArguments(output_dir='bert_podcast', do_train=True, 
                                  do_eval=True,
                                  save_steps=-1)

a few questions

@gmihaila
Hi, this is such as amazing notebook (pretrain_transformers_pytorch.ipynb).
This is not an issue just few quick questions for my knowledge:

(i) can I use more evaluation metrics here?
(ii) when using the checkpoint, should it be the last one? or the notebook saving the best one?
(iii) if my dataset has multiple features, I would need to pass all of these pieces of information as one piece during fine-tuning? if I intend to have side information, which part should I change? any thoughts
(iv) if I need to include timestamp, any thoughts? or may be during training, we arrange all data temporally?
(v) How can I use the notebook for multi-class @labels

thanks in advance and for the wonderful work that you share with the community

AttributeError: 'BucketIterator' object has no attribute

I’m doing seq2seq machine translation on my own dataset. I have preproceed my dataset using this code.

def tokenize_word(text):
  return nltk.word_tokenize(text)

id = Field(sequential=True, tokenize = tokenize_word, lower=True, init_token="<sos>", eos_token="<eos>")
ti = Field(sequential=True, tokenize = tokenize_word, lower=True, init_token="<sos>", eos_token="<eos>")

fields = {'id': ('i', id), 'ti': ('t', ti)}

train_data = TabularDataset.splits(
    path='/content/drive/MyDrive/Colab Notebooks/Tidore/',
    train = 'id_ti.tsv',
    format='tsv',
    fields=fields
)[0]

id.build_vocab(train_data)
ti.build_vocab(train_data)

print(f"Unique tokens in source (id) vocabulary: {len(id.vocab)}")
print(f"Unique tokens in target (ti) vocabulary: {len(ti.vocab)}")

train_iterator = BucketIterator.splits(
    train_data,
    batch_size = batch_size,
    sort_within_batch = True,
    sort_key = lambda x: len(x.id),
    device = device
)

The output of code above is below:

Unique tokens in source (id) vocabulary: 1425
Unique tokens in target (ti) vocabulary: 1297

The problem comes when i tried to split train_data using BucketIterator.split(). When I want to print the value of train_iterator, It says that it has no attribute 'i', eventough i had declare the fields. Here is my code to print it:

for data in train_iterator:
  print(data.i)

The output of code above is below:

AttributeError                            Traceback (most recent call last)

<ipython-input-9-322cc3aa78d6> in <module>()
      1 for data in train_iterator:
----> 2   print(data.i)

AttributeError: 'BucketIterator' object has no attribute 'i'

When i try just to print data, the result makes me more confuse:
image

I am very confuse, because i don’t know what key i should use for train iterator. Thank you for your help

pad token as eos token for gpt2 classification

Hello,

First of all thanks a lot for the tutorials!
I have a question regarding the pad token. In the GPT-2 for classification example, you set padding to be the eos token. Why was that the case? Shouldn't every sequence have one eos token at the end of the sequence to be passed for classification just like [CLS] in BERT?

Question on InnerWorkings Of Bert

First of all thanks for the great Breakdown of the Huggingface-Bert-Model.
From reading the Annotated Transformer and the Bert-paper it seems they use Vanilla Transformer encoders.
But in the Huggingface implementation there is an extra layer for the selfattention(BertSelfOutput). Is there a specific reason for the additional linear layer?
many thanks in advance

predictions - metrics

Hi @gmihaila
any suggestions on saving the predictions of the model.
Also you are using sklearn.metrics, so I think we can use more metrics from the sklearn.metrics

thanks

Making use of saved model

Hello,

i am using pretrain_transformers_pytorch.ipynb, I have successfully followed all steps, my question is how to reload the pretrain_bert saved model to make us of it?

thank you

Installation error.

Hello, I tried to re-build your GPT2 classifcation page.

But I found an issue.

Maybe some of your requirement is outdated?

ERROR: Could not find a version that satisfies the requirement matplotlib>=3.4.0 (from ml-things)
ERROR: No matching distribution found for matplotlib>=3.4.0

I got this issue.

Thank you.

how to custom config file for Bert?

in your notebook :
https://colab.research.google.com/github/gmihaila/ml_things/blob/master/notebooks/pytorch/pretrain_transformers_pytorch.ipynb#scrollTo=UCLtm5BiXona
get_model_config() method load config from model_path, how to change it to custom config?,
for example
config = BertConfig(
vocab_size=10000,
hidden_size=256,
num_hidden_layers=6,
num_attention_heads=4,
intermediate_size=3072,
hidden_act="gelu",
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
max_position_embeddings=512,
type_vocab_size=2,
pad_token_id=0,
position_embedding_type="absolute",
truncation=True,
)

I wanto change the num_hidden_layers. for training a lightly Bert model

pad_array padding value

pad_array can only use 0 for padding. Add functionality to select what value to use for padding

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.