fhalab / embeddings_reproduction Goto Github PK

View Code? Open in Web Editor NEW

126.0 126.0 29.0 137.77 MB

License: Other

Jupyter Notebook 98.47% Python 1.53%

embeddings_reproduction's People

Contributors

Stargazers

Watchers

embeddings_reproduction's Issues

Did you compare your results with BioVec by EhsaneddinAsgari

Thank you so much for your great work.

I read a paper called "DeepPrime2Sec: Deep Learning for Protein Secondary Structure Prediction from the Primary Sequences" by Asgari, E., Poerner, N., McHardy, A., & Mofrad, M.. (https://github.com/ehsanasgari/DeepPrime2Sec)
In the paper, he mentioned he used five kinds of features to do the prediction of protein secondary structure from the protein primary sequence. These five features are:

One-hot vector representation (length: 21) --- onehot: vector representation indicating which amino acid exists at each specific position, where each index in the vector indicates the presence or absence of that amino acid.
ProtVec embedding (length: 50) --- protvec: representation trained using Skip-gram neural network on protein amino acid sequences (ProtVec). The only difference would be character-level training instead of n-gram based training.
3. Contextualized embedding (length: 300) --- elmo: we use the contextualized embedding of the amino acids trained in the course of language modeling, known as ELMo, as a new feature for the secondary structure task. Contextualized embedding is the concatenation of the hidden states of a deep bidirectional language model. The main difference between ProtVec embedding and ELMO embedding is that the ProtVec embedding for a given amino acid or amino acid k-mer is fixed and the representation would be the same in different sequences. However, the contextualized embedding, as it is clear from its name, is an embedding of word changing based on its context. We train ELMo embedding of amino acids using UniRef50 dataset in the dimension size of 300.
4. Position Specific Scoring Matrix (PSSM) features (length: 21) --- pssm: PSSM is amino acid substitution scores calculated on protein multiple sequence alignment of homolog sequences for each given position in the protein sequence.
5. Biophysical features (length: 16) --- biophysical For each amino acid we create a normalized vector of their biophysical properties, e.g., flexibility, instability, surface accessibility, kd-hydrophobicity, hydrophilicity, and etc.

However, he didn't show how to do these feature extraction. I am not sure if you compared your embedding to his work.

By the way,
In my ML project, I want to embed a protein to a vector and then use DL models to do drug-protein interaction prediction. Do you have an example to show how to use it similar to RDkit, eg.
fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, nBits=512)?

Many thanks!

Examples in README

As a user, it would be nice to directly have some examples in the README to show how I could use this library. There are two simple scenarios that would personally benefit me:

Load the embeddings and generate a hierarchical clustering (show a plot, perhaps?)
Load the embeddings and train a simple model. One example would be a target prioritization approach - @ozlemmuslu would be able to help using the example from her master's thesis if we can see exactly how to make a dataframe out of the embeddings!

Separate code from data

I'm 7GB+ into trying to clone this repo and my computer is incredibly upset. I would suggest making a separate repository to house all of the data so the code can be downloaded and used independently.

What is the path to the (final) protein embeddings?

Hi,

Somewhat related to #3 I cannot identify where to find the actual pretrained protein embedding. In gensim I would like to use Word2Vec.load(path/to/embedding.model) -- where can I find this?

Thank you

Error in train_docvec_models.ipynb

I tried to recreate the original doc2vec models in train_docvec_models.ipynb but ran into the following error at "model.build_vocab(documents)" when using "merge=True" in the kmer_hypers

TypeError: unhashable type: 'list'

Do you have any suggestions? Thanks!

http://cheme.caltech.edu/~kkyang/ : 404

Hello,

the URL above is returning a 404. Can you provide an alternate URL?

Thanks

'Doc2Vec' object has no attribute 'running_training_loss

from embeddings_reproduction import embedding_tools
embeds = embedding_tools.get_embeddings_new(['ABCFFFFFFFFFFFF','EFGHQWERRTTUIIO'], seqs, k=5, overlap=False)
getting the following error
'Doc2Vec' object has no attribute 'running_training_loss

Low efficiency of large volume protein sequence prediction

Can this model be used to generate features for a large number (more than 10,000) of protein sequences, and is there an improvement after trying the low efficiency of the model prediction?

User assistance

Hi, I was really interested in your paper, but this repository isn't so user friendly. It would be wonderful to add a setup.py so it can be installed with pip and some documentation for users on how to access the embeddings.

I would be happy to send a PR for the setup.py then we could discuss further on the PR

UnpicklingError: invalid load key, 'v'.

Hello,
many thanks for the github.

when I run test_predictions, i got following errors ...

UnpicklingError Traceback (most recent call last)
in
1 with open('../inputs/X_aaindex_64_cosine.pkl', 'rb') as f:
----> 2 X_aa = pickle.load(f)

UnpicklingError: invalid load key, 'v'.
also,
npicklingError Traceback (most recent call last)
in
13 # Sequence and structure
14 with open('../inputs/T50_seq_struct.pkl', 'rb') as f:
---> 15 X, _ = pickle.load(f)
16 evals, mu = evaluate(df_train, df_test, X, y_col, 'seq_struc', guesses=(1, 100))
17 res = pd.concat((res, evals), ignore_index=True)

UnpicklingError: invalid load key, 'v'.

my version
print(np.version)
1.18.5
print(pd.version)
1.1.0.

or pkl files are corrupted?

thanks,

Which model to use for computing new embeddings?

As some users have noted before, in other issues, it is unclear how to use the final model to generate embeddings for a new set of protein sequences. I have identified the files located at http://cheme.caltech.edu/~kkyang/models/ and I have found the script embedding_tools.py from which I suppose the function get_embeddings_new() is the relevant one. But which doc2vec_file should I use to compute embeddings for my set of sequences? Which one is the "final" one?

As previously noted, if a minimal example of this was included in the main README file I am sure it would enable many more users to benefit from your work.

An error in visualize page

Hi
when i run all script i get this error message:
in plot_ChRs()
4 df = pd.read_csv('../inputs/localization.txt')
5 with open('../inputs/localization_seq.pkl', 'rb') as f:
----> 6 X_1, terms = pickle.load(f)
7 X_p = pd.read_csv('../inputs/localization_profet.tsv', delimiter='\t')
8 X_p.index = X_p['name']

UnpicklingError: invalid load key, 'v'

fhalab / embeddings_reproduction Goto Github PK

embeddings_reproduction's People

Contributors

Stargazers

Watchers

Forkers

embeddings_reproduction's Issues

Recommend Projects

Recommend Topics

Recommend Org