alchemab / antiberta Goto Github PK

Public repository describing training and testing of AntiBERTa.

License: Apache License 2.0

Jupyter Notebook 100.00%

antiberta's Introduction

AntiBERTa

This is a repository with Jupyter notebooks describing how we pre-trained the model, and how we apply the model for fine-tuning.

FAQs

What dataset did you use? This is described in our paper, but briefly, we used a section of the Observed Antibody Space (OAS) database (Kovaltsuk et al., 2018) for pre-training, and a snapshot of SAbDab (Dunbar et al., 2014) as of 26 August, 2021. We've included small snippets of the OAS database that we used for pre-training, and the paratope prediction datasets under assets.
Why HuggingFace? We felt that the maturity of the library and its straight-forward API were key advantages. Not to mention it fits really well with cloud compute architectures like AWS.

antiberta's People

Contributors

Stargazers

Watchers

Forkers

jianqingzheng nsridhar1 briney minnegalieva kehan777 biocoder007 natnaelt nikuanweekee abhi-siripurapu xtm233 zhengzha2000

antiberta's Issues

OSError: We couldn't connect to 'https://huggingface.co' to load this model, couldn't find it in the cached files and it looks like antiberta-master is not the path to a directory containing a {configuration_file} file.

I get an OSError while trying to run the paratope-prediction.ipynb.
OSError: We couldn't connect to 'https://huggingface.co' to load this model, couldn't find it in the cached files and it looks like antiberta-master is not the path to a directory containing a {configuration_file} file.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.

Can you help me and tell me what I am doing wrong? I have the internet connection. Perhaps MODEL_DIR must be adjusted with the value "antiberta-master"?

Many thanks in advance!

Tokenizer not found

Hi, when I try to load the tokenizer I get following error:
HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/antibody-tokenizer/resolve/main/vocab.json

I was wondering if the tokenizer has been removed?

Pre-training data

Hey team, great work on this project. I noticed that the pre-training data snippets were removed from the repository back in April. Are they available as a HF dataset? Or available someplace else (besides the git history)?

How to use models in Hugging Face?

Hello, I have used the model weights and tokenizer that you uploaded on Hugging Face, but I encountered some diffcults when actually using it. I find the output of tokenizer is weird.Could you tell me Is it necessary to insert a space between every character in the input sequence? and is the use of H and L (on the start of sequence) necessary? thanks!
tokenizer = RoFormerTokenizer.from_pretrained("trans_model/antiberta2") model = RoFormerForMaskedLM.from_pretrained("trans_model/antiberta2") seqs = ["ḢACCKLMNDDDKKLL", "ḶALYYNNMACD"] outs = tokenizer( seqs, return_tensors="pt", padding=True, truncation=True, max_length=1024, )
`

outs
{'input_ids': tensor([[ 1, 26, 4, 2],
[ 1, 27, 4, 2]]), 'token_type_ids': tensor([[0, 0, 0, 0],
[0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1],
[1, 1, 1, 1]])}
seqs = ["Ḣ A C C K L M N D D D K K L L", "Ḷ A L Y Y N N M A C D"]
out
{'input_ids': tensor([[ 1, 26, 5, 6, 6, 13, 14, 15, 16, 7, 7, 7, 13, 13, 14, 14, 2],
[ 1, 27, 5, 14, 24, 24, 16, 16, 15, 5, 6, 7, 2, 0, 0, 0, 0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])}
`

Tokenizer and model not accessible

Greetings,

I am trying to fine tune your model in my own dataset. But when I instantiate the tokenizer and model through huggingface,
HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/antibody-tokenizer/resolve/main/vocab.json. This error occurs which I think is because this is a private model not available in hugginface?

Can you give me access to the tokenizer and model?

Thankyou,
Palistha

alchemab / antiberta Goto Github PK

antiberta's Introduction

AntiBERTa

FAQs

antiberta's People

Contributors

Stargazers

Watchers

Forkers

antiberta's Issues

OSError: We couldn't connect to 'https://huggingface.co' to load this model, couldn't find it in the cached files and it looks like antiberta-master is not the path to a directory containing a {configuration_file} file.

Tokenizer not found

Pre-training data

How to use models in Hugging Face?

Tokenizer and model not accessible

predicting paratopes

where is the pretrained model?

Where is the datasets?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent