materialsintelligence / mat2vec Goto Github PK

Supplementary Materials for Tshitoyan et al. "Unsupervised word embeddings capture latent knowledge from materials science literature", Nature (2019).

License: MIT License

Python 100.00%

mat2vec's Introduction

Supplementary Materials for "Unsupervised word embeddings capture latent knowledge from materials science literature", Nature 571, 95–98 (2019).

Tshitoyan, V., Dagdelen, J., Weston, L., Dunn, A., Rong, Z., Kononova, O., Persson, K. A., Ceder, G. and Jain, A.

doi: 10.1038/s41586-019-1335-8

A view-only (no download) link to the paper: https://rdcu.be/bItqk

For those interested in the ab initio thermoelectric data, see below

Set up

Make sure you have python3.6 and the pip module installed. We recommend using conda environments.
Navigate to the root folder of this repository (the same folder that contains this README file) and run pip install --ignore-installed -r requirements.txt. Note: If you are using a conda env and any packages fail to compile during this step, you may need to first install those packages separately with conda install package_name.
Wait for all the requirements to be downloaded and installed.
Run python setup.py install to install this module. This will also download the Word2vec model files. If the download fails, manually download the model, word embeddings and output embeddings and put them in mat2vec/training/models.
Finalize your chemdataextractor installation by executing cde data download (You may need to restart your virtual environment for the cde command line interface to be found).
You are ready to go!

Processing

Example python usage:

from mat2vec.processing import MaterialsTextProcessor
text_processor = MaterialsTextProcessor()
text_processor.process("LiCoO2 is a battery cathode material.")

(['CoLiO2', 'is', 'a', 'battery', 'cathode', 'material', '.'], [('LiCoO2', 'CoLiO2')])

For the various methods and options see the docstrings in the code.

Pretrained Embeddings

Load and query for similar words and phrases:

from gensim.models import Word2Vec
w2v_model = Word2Vec.load("mat2vec/training/models/pretrained_embeddings")
w2v_model.wv.most_similar("thermoelectric")

[('thermoelectrics', 0.8435688018798828), ('thermoelectric_properties', 0.8339033126831055), ('thermoelectric_power_generation', 0.7931368350982666), ('thermoelectric_figure_of_merit', 0.7916493415832 52), ('seebeck_coefficient', 0.7753845453262329), ('thermoelectric_generators', 0.7641351819038391), ('figure_of_merit_ZT', 0.7587921023368835), ('thermoelectricity', 0.7515754699707031), ('Bi2Te3', 0 .7480161190032959), ('thermoelectric_modules', 0.7434879541397095)]

Phrases can be queried with underscores:

w2v_model.wv.most_similar("band_gap", topn=5)

[('bandgap', 0.934801459312439), ('band_-_gap', 0.933477520942688), ('band_gaps', 0.8606899380683899), ('direct_band_gap', 0.8511275053024292), ('bandgaps', 0.818678617477417)]

Analogies:

# helium is to He as ___ is to Fe? 
w2v_model.wv.most_similar(
    positive=["helium", "Fe"], 
    negative=["He"], topn=1)

[('iron', 0.7700884938240051)]

Material formulae need to be normalized before analogies:

# "GaAs" is not normalized
w2v_model.wv.most_similar(
    positive=["cubic", "CdSe"], 
    negative=["GaAs"], topn=1)

KeyError: "word 'GaAs' not in vocabulary"

from mat2vec.processing import MaterialsTextProcessor
text_processor = MaterialsTextProcessor()
w2v_model.wv.most_similar(
    positive=["cubic", text_processor.normalized_formula("CdSe")], 
    negative=[text_processor.normalized_formula("GaAs")], topn=1)

[('hexagonal', 0.6162797212600708)]

Keep in mind that words should also be processed before queries. Most of the time this is as simple as lowercasing, however, it is the safest to use the process() method of mat2vec.processing.MaterialsTextProcessor.

Training

To run an example training, navigate to mat2vec/training/ and run

python phrase2vec.py --corpus=data/corpus_example --model_name=model_example

from the terminal. It should run an example training and save the files in models and tmp folders. It should take a few seconds since the example corpus has only 5 abstracts.

For more options, run

python phrase2vec.py --help

Thermoelectric Datasets

You can find the condensed thermoelectric CRTA data in the thermoelectric_data directory.

Related Work

Weston, L., Tshitoyan, V., Dagdelen, J., Kononova, O., Persson, K. A., Ceder, G. and Jain, A. Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature, ChemRxiv. Preprint. (2019).

Issues?

You can either report an issue on github or contact one of us directly. Try [email protected], [email protected], [email protected] or [email protected].

mat2vec's People

Contributors

Stargazers

Watchers

Forkers

vtshitoyan yifding jb4earth songsiwei cse2011 izoel13 alistairwalsh jachoke ideaplexus mogaio myforkedrepositories yongmi282 kfriesth masoudj stjordanis codeaudit wang-b22 donjaime jwarkentin ev0rtex tomeybest jreades rosehannon hack121 sunn-e jmarinero hhy5277 yuanj-i rbt-c adiamb ardunn igponce ghostintheshellarise kingmanzhang nakosy thedatalass rubiel1 charlyshaka chaosqian masoud-ghodrati junbinzhao sanvorav sriharshakondapalli johannesebke erictham loveinthebone chirayukong danbroz jamesho22 saksham-jain01 jabogithub nitinh ikitozen a-matai alekzandr brad1141 jamesbeales duhart jwood1985 nascimentorodrigo rafaelescrich marcelomata jdagdelen junaidqazi hemantsolanki99 nikolaospapachristou kmu oaklight epistemic-ai millacurafa ihalage rohancalum tawells jhird xanderdunn bearic skjq rashidhaffadi darkdragon84 ssthurai rdpli abarbarov bowen-gao cta5816 jydiw science-ed njisrawi wskwon ranjithdevaraj eggheaded mbmkeffat bugdaryan thefirehacker jsourati weirenwang-beike sailfish009 kevinselhi viktoriiabaib brj9836 chorseng

mat2vec's Issues

Training my own word embeddings

Hi,
I am trying to train my own word embeddings on my own corpus using your model. Could you please help me step by step how can I do that?
I run the model according to your instructions in the readme file and it works.

Trained my model using phrase2vec.py but now I want to test using that model. How?

I used my own corpus using this command in the Read Me replacing the corpus_example with my own, it created a new model called model_example in the same folder as the pre trained embeddings.
python phrase2vec.py --corpus=data/corpus_example --model_name=model_example

I want to then use this new model to test. How do I do that? I thought I would run

from gensim.models import Word2Vec w2v_model = Word2Vec.load("mat2vec/training/models/model_example")

but I got an error saying
No such file or directory: '/Users/monicapuerto/Desktop/Github/mat2vec/mat2vec/processing/models/phraser.pkl'

When that file does indeed exist. Why does the new model depend on this file? How do I utilize the model I just trained?

Drug repurposing for COVID

Hello @jdagdelen . You mentioned in an earlier thread that Mat2Vec could be used for drug repurposing. Experiments are currently underway to perform simulations on 8,000 drugs approved by the FDA, and 77 of those compounds have been selected as being likely candidates for COVID treatment. https://onezero.medium.com/the-worlds-most-powerful-supercomputer-has-entered-the-fight-against-coronavirus-3e98c4d67459

If we use Mat2vec for the same drug repurposing task, the intersection of our work and theirs might yield a shortened list of candidate compounds. What do you think about getting started on this? I can use the program but I don’t have a deep enough understanding of it to figure out how to discover new knowledge from it. Would you like to collaborate?

Prediction of (new) thermoelectric materials

First of all thank you so much for sharing all this! I found the paper and the associated results very exciting!

I tried to reproduce Fig. 2a using your pre-trained model. I first printed the tokens the most similar to "thermoelectric" (highest cosine similarity). Then I used one of your processing functions (in the process script) to get only "simple chemical formulae". And finally, as you were mentioning it in the paper, I removed the formulae appearing less than 3 times in the corpus.

However, I ended up with a lot of noise in my list compared to yours. I got the 2 first same predictions but then I was also having formulae like Bi20Se3Te27 or SSe4Sn5 in the top 10. Just to give you an idea of the noise amount, PbTe, which is 3rd in your list, is 92th in mine.

So what am I missing?

Thank you in advance!
Anita

my corpus is too big to be put in one large file

my corpus is too big to be put in one large file, my computer runs out of memory in doing that.

Is it possible to run this code on multiple files? or run it using iterator?

Some questions about the use of chemdataextractor tool

Hello, I feel very interesting after reading your article, I have tried some similar work myself, and have some questions, I would like to ask you for advice. When using chemdataextractor tool, the process of extracting chemical formula from about 20W abstracts of literature is very slow. May I ask how do you carry out the process of chemical formula labeling in abstracts? Or is there a way to speed up the process? Thank you very much!

TypeError: init() got an unexpected keyword argument 'common_terms'

Hi,
I tried to train a model and I got this error:

Traceback (most recent call last):
File "C:\Users\T\desktop\backup\mat2vec-master\mat2vec\training\phrase2vec.py", line 164, in
sentences, phraser = wordgrams(processed_sentences,
File "C:\Users\T\desktop\backup\mat2vec-master\mat2vec\training\phrase2vec.py", line 43, in wordgrams
phrases = Phrases(
TypeError: init() got an unexpected keyword argument 'common_terms'

About the final word embeddings.

I just trained a model on my own corpus. It has space group numbers and I replaced them with 'Xx1, Xx2,...., Xx229, Xx230' to avoid overlapping with some element names. But when I tried to get final embeddings from the model it says, some space group numbers (Xx105, Xx139 etc.) are not in vocabulary independent of the frequency! Why is this happening? I've tried to look up the code and couldn't figure it out.

Training data for original model

I was wondering, is the training data used for the original model is available anywhere as the Matscholar API is currently not available.

This is such an amazing idea.

Thank you for posting it :)

Script to fetch cleaned abstracts

I noticed you've nicely provided the DOIs, but a simple pull will fetch the article in raw html. Might you have a recommendation on grabbing the cleaned abstracts as you did it? There's the several dataset splits that you mentioned being quite influential on the final result.

Other applications

How can this software be applied to other research areas? Space travel/propulsion, physics, history, etc? Thank you!

Problems training the model

Dear community,

I'm having a problem running the line code to train the model on the corpus example. The following error messages are printed.

:228: RuntimeWarning: scipy._lib.messagestream.MessageStream size changed, may indicate binary incompatibility. Expected 56 from C header, got 64 from PyObject
2022-06-10 00:06:51,569 : INFO : Basic min_count trim rule for formula.
2022-06-10 00:06:51,569 : INFO : Not including extra phrases, option not specified.
Traceback (most recent call last):
File "/home/lucas_bandeira/Documents/mat2vec/mat2vec/training/phrase2vec.py", line 165, in
sentences, phraser = wordgrams(processed_sentences,
File "/home/lucas_bandeira/Documents/mat2vec/mat2vec/training/phrase2vec.py", line 44, in wordgrams
phrases = Phrases(
TypeError: __init__() got an unexpected keyword argument 'common_terms'

Could somebody help me to solve this problem?

Sincerely yours,

DAWG fails on 64-bit Windows 10 w/ Python 3.7.3

Have been excited to give this a try since yesterday, but I haven't been able to get it to install due to DAWG failing

DAWG_error.txt

Setup requirements error

Had an issue (from within a fresh conda environment) running:

pip install -r requirements.txt

After successfully compiling packages installation ultimately failed with:

...
Installing collected packages: monty, regex, urllib3, requests, unidecode, ruamel.yaml, pydispatcher, tabulate, spglib, palettable, pymatgen, pycryptodome, pdfminer.six, python-crfsuite, cssselect, appdirs, DAWG, chemdataextractor, jmespath, botocore, s3transfer, boto3, smart-open, gensim, tqdm
  Found existing installation: urllib3 1.12
ERROR: Cannot remove entries from nonexistent file /xxx/.conda/envs/tshitoyan/lib/python3.6/site-packages/easy-install.pth

The solution was to re-run pip with the --ignore-installed option:

pip install --ignore-installed -r requirements.txt

Beyond that, everything installed and ran as expected.
Hope this saves some headaches for others.

Great work!

Doug

Do we need to use argument `include_extra_phrases`

In phrase2vec.py there is an argument phrase2vec.py that reads all_ents.p. I wonder did you use this argument, and if yes, can you give the link to the all_ents.pfile? Thanks.

Request for a step by step document on how to run the code

Would you be kind enough to first share a step by step document on how to run the code using Jupyer Notebook in Anaconda3 environment available on https://github.com/materialsintelligence/mat2vec on the laptop with CPU only (i.e. without GPU). I am running Python 3,7.3 on Juptyter Notebook 6.0.2 in Anaconda3 with Tensorflo version 1.15.0, Keras version 2.2.4.
(i) I have installed all packages mentioned in the requirements.txt including"ChemDataExtractor". But am running into issues with installing "molsets". Any guidance for that?
(iii) l'd like to know which .py file(s) exactly to run, vs. a sequence of .py files to run, and any other tips. For example there are these .py files, setup.py, process.py, test-process.py, phrase2vec.py, etc. Assuming I want to simply run the model and get the output using Juptyer Notebook in Anaconda3 environment, what exactly do I have to run and in what order?
Once I am able to run it on my laptop, I will attempt to run it in Colab. Basically would appreciate knowing, assuming I use Jupyer Notebook in Anaconda3 environment, what exactly should be the steps? Thanks.

I was able to train and use this on the COVID papers dataset

I just wanted to say thank you :)

Formatting Abstracts

Is there any special text formatting that needs to be done to abstracts before training? I noticed the corpus example has % and <nUm> in various places. Just wondering if formatting matters at all, or if you can dump the plain text from abstracts into a corpus file.

Question about the outputs

I have successfully installed the mat2vec in the conda environment (python=3.6) and tried to reproduce your work. However, when I followed the directions under Processing and Pretrained Embeddings, no output appeared (and no error message).

For example, I run test.txt in the root folder of this repository, after serveral seconds running, no outputs (like pic1.jpg).

test.txt

I am an freshman in machine learning, hoping to get your guidance. Thank you!

Model missing from the training folder

I was looking for the word2vec model specifically. According to the code in the first page, the model is supposed to be at "mat2vec/training/models/pretrained_embeddings", but I'm unable to find it.

No module named 'helpers' error when loading newly trained embeddings

Dear all,
I've trained my own embeddings and I'm now trying to open using your mat2vec tool using

 w2v_model = Word2Vec.load(....)

however I get a strange error about a module helpers. This same problem does not happens with the pre-trained models you are providing:

>>> w2v_model = Word2Vec.load("mat2vec/training/models/pretrained_embeddings")
>>> w2v_model = Word2Vec.load("mat2vec/training/models/test_model")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/anaconda3/envs/mat2vec/lib/python3.6/site-packages/gensim/models/word2vec.py", line 975, in load
    return super(Word2Vec, cls).load(*args, **kwargs)
  File "/anaconda3/envs/mat2vec/lib/python3.6/site-packages/gensim/models/base_any2vec.py", line 629, in load
    model = super(BaseWordEmbeddingsModel, cls).load(*args, **kwargs)
  File "/anaconda3/envs/mat2vec/lib/python3.6/site-packages/gensim/models/base_any2vec.py", line 278, in load
    return super(BaseAny2VecModel, cls).load(fname_or_handle, **kwargs)
  File "/anaconda3/envs/mat2vec/lib/python3.6/site-packages/gensim/utils.py", line 425, in load
    obj = unpickle(fname)
  File "/anaconda3/envs/mat2vec/lib/python3.6/site-packages/gensim/utils.py", line 1332, in unpickle
    return _pickle.load(f, encoding='latin1')
  File "/Users/lfoppiano/development/github/mat2vec/mat2vec/training/__init__.py", line 1, in <module>
    from helpers import utils
ModuleNotFoundError: No module named 'helpers'
>>>

Any suggestion?

Another question, after the training, I have only accuracies, loss and phraser in my model output directory. I copied the files containing vectors, trainable and the actual model from the tmp directory (I took the one of epoque29), was this the correct way?

.rw-r--r-- lfoppiano staff 286.4 MB Mon Aug  5 11:33:58 2019   test_model
.rw-r--r-- lfoppiano staff   3.2 GB Mon Aug  5 11:33:24 2019   test_model.trainables.syn1neg.npy
.rw-r--r-- lfoppiano staff   3.2 GB Mon Aug  5 11:39:55 2019   test_model.wv.vectors.npy
.rw-r--r-- lfoppiano staff   7.5 KB Mon Aug  5 11:26:21 2019   test_model_accuracies.pkl
.rw-r--r-- lfoppiano staff   389 B  Mon Aug  5 11:27:07 2019   test_model_loss.pkl
.rw-r--r-- lfoppiano staff    53 MB Mon Aug  5 11:40:00 2019   test_model_phraser.pkl

Code used to obtain the training data and for abstract classification

Could you please share the code you used to query the APIs and filter the abstracts and for classifying them for relevance, as described in the Methods sections of the paper? Thanks.

Question about target and context words

I have a question about your research approach communicated in Nature. You use there phrases "target word" and "context word". Normally, in the skip-gram model embedding for the "target word" (input layer) is different that the embedding for the "context word" (output layer). In gensim if you use model.wv.most_similar you are effectively searching for similar words using embeddings from the input layer. You can also access "context word" embeddings via model.syn1neg. Where you using both embeddings for analyzing e.g. relation between chemical compound and "thermoelectric"?