Git Product home page Git Product logo

mat2vec's Introduction

Supplementary Materials for "Unsupervised word embeddings capture latent knowledge from materials science literature", Nature 571, 95–98 (2019).

Tshitoyan, V., Dagdelen, J., Weston, L., Dunn, A., Rong, Z., Kononova, O., Persson, K. A., Ceder, G. and Jain, A.

doi: 10.1038/s41586-019-1335-8

A view-only (no download) link to the paper: https://rdcu.be/bItqk

For those interested in the ab initio thermoelectric data, see below

Set up

  1. Make sure you have python3.6 and the pip module installed. We recommend using conda environments.
  2. Navigate to the root folder of this repository (the same folder that contains this README file) and run pip install --ignore-installed -r requirements.txt. Note: If you are using a conda env and any packages fail to compile during this step, you may need to first install those packages separately with conda install package_name.
  3. Wait for all the requirements to be downloaded and installed.
  4. Run python setup.py install to install this module. This will also download the Word2vec model files. If the download fails, manually download the model, word embeddings and output embeddings and put them in mat2vec/training/models.
  5. Finalize your chemdataextractor installation by executing cde data download (You may need to restart your virtual environment for the cde command line interface to be found).
  6. You are ready to go!

Processing

Example python usage:

from mat2vec.processing import MaterialsTextProcessor
text_processor = MaterialsTextProcessor()
text_processor.process("LiCoO2 is a battery cathode material.")

(['CoLiO2', 'is', 'a', 'battery', 'cathode', 'material', '.'], [('LiCoO2', 'CoLiO2')])

For the various methods and options see the docstrings in the code.

Pretrained Embeddings

Load and query for similar words and phrases:

from gensim.models import Word2Vec
w2v_model = Word2Vec.load("mat2vec/training/models/pretrained_embeddings")
w2v_model.wv.most_similar("thermoelectric")

[('thermoelectrics', 0.8435688018798828), ('thermoelectric_properties', 0.8339033126831055), ('thermoelectric_power_generation', 0.7931368350982666), ('thermoelectric_figure_of_merit', 0.7916493415832 52), ('seebeck_coefficient', 0.7753845453262329), ('thermoelectric_generators', 0.7641351819038391), ('figure_of_merit_ZT', 0.7587921023368835), ('thermoelectricity', 0.7515754699707031), ('Bi2Te3', 0 .7480161190032959), ('thermoelectric_modules', 0.7434879541397095)]

Phrases can be queried with underscores:

w2v_model.wv.most_similar("band_gap", topn=5)

[('bandgap', 0.934801459312439), ('band_-_gap', 0.933477520942688), ('band_gaps', 0.8606899380683899), ('direct_band_gap', 0.8511275053024292), ('bandgaps', 0.818678617477417)]

Analogies:

# helium is to He as ___ is to Fe? 
w2v_model.wv.most_similar(
    positive=["helium", "Fe"], 
    negative=["He"], topn=1)

[('iron', 0.7700884938240051)]

Material formulae need to be normalized before analogies:

# "GaAs" is not normalized
w2v_model.wv.most_similar(
    positive=["cubic", "CdSe"], 
    negative=["GaAs"], topn=1)

KeyError: "word 'GaAs' not in vocabulary"

from mat2vec.processing import MaterialsTextProcessor
text_processor = MaterialsTextProcessor()
w2v_model.wv.most_similar(
    positive=["cubic", text_processor.normalized_formula("CdSe")], 
    negative=[text_processor.normalized_formula("GaAs")], topn=1)

[('hexagonal', 0.6162797212600708)]

Keep in mind that words should also be processed before queries. Most of the time this is as simple as lowercasing, however, it is the safest to use the process() method of mat2vec.processing.MaterialsTextProcessor.

Training

To run an example training, navigate to mat2vec/training/ and run

python phrase2vec.py --corpus=data/corpus_example --model_name=model_example

from the terminal. It should run an example training and save the files in models and tmp folders. It should take a few seconds since the example corpus has only 5 abstracts.

For more options, run

python phrase2vec.py --help

Thermoelectric Datasets

You can find the condensed thermoelectric CRTA data in the thermoelectric_data directory.

Related Work

  • Weston, L., Tshitoyan, V., Dagdelen, J., Kononova, O., Persson, K. A., Ceder, G. and Jain, A. Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature, ChemRxiv. Preprint. (2019).

Issues?

You can either report an issue on github or contact one of us directly. Try [email protected], [email protected], [email protected] or [email protected].

mat2vec's People

Contributors

ardunn avatar computron avatar jdagdelen avatar johannesebke avatar vtshitoyan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mat2vec's Issues

Training my own word embeddings

Hi,
I am trying to train my own word embeddings on my own corpus using your model. Could you please help me step by step how can I do that?
I run the model according to your instructions in the readme file and it works.

Trained my model using phrase2vec.py but now I want to test using that model. How?

I used my own corpus using this command in the Read Me replacing the corpus_example with my own, it created a new model called model_example in the same folder as the pre trained embeddings.
python phrase2vec.py --corpus=data/corpus_example --model_name=model_example

image

I want to then use this new model to test. How do I do that? I thought I would run

from gensim.models import Word2Vec w2v_model = Word2Vec.load("mat2vec/training/models/model_example")

but I got an error saying
No such file or directory: '/Users/monicapuerto/Desktop/Github/mat2vec/mat2vec/processing/models/phraser.pkl'

When that file does indeed exist. Why does the new model depend on this file? How do I utilize the model I just trained?

Drug repurposing for COVID

Hello @jdagdelen . You mentioned in an earlier thread that Mat2Vec could be used for drug repurposing. Experiments are currently underway to perform simulations on 8,000 drugs approved by the FDA, and 77 of those compounds have been selected as being likely candidates for COVID treatment. https://onezero.medium.com/the-worlds-most-powerful-supercomputer-has-entered-the-fight-against-coronavirus-3e98c4d67459

If we use Mat2vec for the same drug repurposing task, the intersection of our work and theirs might yield a shortened list of candidate compounds. What do you think about getting started on this? I can use the program but I don’t have a deep enough understanding of it to figure out how to discover new knowledge from it. Would you like to collaborate?

Prediction of (new) thermoelectric materials

First of all thank you so much for sharing all this! I found the paper and the associated results very exciting!

I tried to reproduce Fig. 2a using your pre-trained model. I first printed the tokens the most similar to "thermoelectric" (highest cosine similarity). Then I used one of your processing functions (in the process script) to get only "simple chemical formulae". And finally, as you were mentioning it in the paper, I removed the formulae appearing less than 3 times in the corpus.

However, I ended up with a lot of noise in my list compared to yours. I got the 2 first same predictions but then I was also having formulae like Bi20Se3Te27 or SSe4Sn5 in the top 10. Just to give you an idea of the noise amount, PbTe, which is 3rd in your list, is 92th in mine.

So what am I missing?

Thank you in advance!
Anita

Some questions about the use of chemdataextractor tool

Hello, I feel very interesting after reading your article, I have tried some similar work myself, and have some questions, I would like to ask you for advice. When using chemdataextractor tool, the process of extracting chemical formula from about 20W abstracts of literature is very slow. May I ask how do you carry out the process of chemical formula labeling in abstracts? Or is there a way to speed up the process? Thank you very much!

TypeError: __init__() got an unexpected keyword argument 'common_terms'

Hi,
I tried to train a model and I got this error:

Traceback (most recent call last):
File "C:\Users\T\desktop\backup\mat2vec-master\mat2vec\training\phrase2vec.py", line 164, in
sentences, phraser = wordgrams(processed_sentences,
File "C:\Users\T\desktop\backup\mat2vec-master\mat2vec\training\phrase2vec.py", line 43, in wordgrams
phrases = Phrases(
TypeError: init() got an unexpected keyword argument 'common_terms'

About the final word embeddings.

I just trained a model on my own corpus. It has space group numbers and I replaced them with 'Xx1, Xx2,...., Xx229, Xx230' to avoid overlapping with some element names. But when I tried to get final embeddings from the model it says, some space group numbers (Xx105, Xx139 etc.) are not in vocabulary independent of the frequency! Why is this happening? I've tried to look up the code and couldn't figure it out.

Training data for original model

I was wondering, is the training data used for the original model is available anywhere as the Matscholar API is currently not available.

Script to fetch cleaned abstracts

I noticed you've nicely provided the DOIs, but a simple pull will fetch the article in raw html. Might you have a recommendation on grabbing the cleaned abstracts as you did it? There's the several dataset splits that you mentioned being quite influential on the final result.

Other applications

How can this software be applied to other research areas? Space travel/propulsion, physics, history, etc? Thank you!

Problems training the model

Dear community,

I'm having a problem running the line code to train the model on the corpus example. The following error messages are printed.

:228: RuntimeWarning: scipy._lib.messagestream.MessageStream size changed, may indicate binary incompatibility. Expected 56 from C header, got 64 from PyObject
2022-06-10 00:06:51,569 : INFO : Basic min_count trim rule for formula.
2022-06-10 00:06:51,569 : INFO : Not including extra phrases, option not specified.
Traceback (most recent call last):
File "/home/lucas_bandeira/Documents/mat2vec/mat2vec/training/phrase2vec.py", line 165, in
sentences, phraser = wordgrams(processed_sentences,
File "/home/lucas_bandeira/Documents/mat2vec/mat2vec/training/phrase2vec.py", line 44, in wordgrams
phrases = Phrases(
TypeError: __init__() got an unexpected keyword argument 'common_terms'

Could somebody help me to solve this problem?

Sincerely yours,

Setup requirements error

Had an issue (from within a fresh conda environment) running:

pip install -r requirements.txt

After successfully compiling packages installation ultimately failed with:

...
Installing collected packages: monty, regex, urllib3, requests, unidecode, ruamel.yaml, pydispatcher, tabulate, spglib, palettable, pymatgen, pycryptodome, pdfminer.six, python-crfsuite, cssselect, appdirs, DAWG, chemdataextractor, jmespath, botocore, s3transfer, boto3, smart-open, gensim, tqdm
  Found existing installation: urllib3 1.12
ERROR: Cannot remove entries from nonexistent file /xxx/.conda/envs/tshitoyan/lib/python3.6/site-packages/easy-install.pth

The solution was to re-run pip with the --ignore-installed option:

pip install --ignore-installed -r requirements.txt

Beyond that, everything installed and ran as expected.
Hope this saves some headaches for others.

Great work!

Doug

Request for a step by step document on how to run the code

Would you be kind enough to first share a step by step document on how to run the code using Jupyer Notebook in Anaconda3 environment available on https://github.com/materialsintelligence/mat2vec on the laptop with CPU only (i.e. without GPU). I am running Python 3,7.3 on Juptyter Notebook 6.0.2 in Anaconda3 with Tensorflo version 1.15.0, Keras version 2.2.4.
(i) I have installed all packages mentioned in the requirements.txt including"ChemDataExtractor". But am running into issues with installing "molsets". Any guidance for that?
(iii) l'd like to know which .py file(s) exactly to run, vs. a sequence of .py files to run, and any other tips. For example there are these .py files, setup.py, process.py, test-process.py, phrase2vec.py, etc. Assuming I want to simply run the model and get the output using Juptyer Notebook in Anaconda3 environment, what exactly do I have to run and in what order?
Once I am able to run it on my laptop, I will attempt to run it in Colab. Basically would appreciate knowing, assuming I use Jupyer Notebook in Anaconda3 environment, what exactly should be the steps? Thanks.

Formatting Abstracts

Is there any special text formatting that needs to be done to abstracts before training? I noticed the corpus example has % and <nUm> in various places. Just wondering if formatting matters at all, or if you can dump the plain text from abstracts into a corpus file.

Question about the outputs

I have successfully installed the mat2vec in the conda environment (python=3.6) and tried to reproduce your work. However, when I followed the directions under Processing and Pretrained Embeddings, no output appeared (and no error message).

For example, I run test.txt in the root folder of this repository, after serveral seconds running, no outputs (like pic1.jpg).

test.txt
pic1

I am an freshman in machine learning, hoping to get your guidance. Thank you!

Model missing from the training folder

I was looking for the word2vec model specifically. According to the code in the first page, the model is supposed to be at "mat2vec/training/models/pretrained_embeddings", but I'm unable to find it.

No module named 'helpers' error when loading newly trained embeddings

Dear all,
I've trained my own embeddings and I'm now trying to open using your mat2vec tool using

 w2v_model = Word2Vec.load(....)

however I get a strange error about a module helpers. This same problem does not happens with the pre-trained models you are providing:

>>> w2v_model = Word2Vec.load("mat2vec/training/models/pretrained_embeddings")
>>> w2v_model = Word2Vec.load("mat2vec/training/models/test_model")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/anaconda3/envs/mat2vec/lib/python3.6/site-packages/gensim/models/word2vec.py", line 975, in load
    return super(Word2Vec, cls).load(*args, **kwargs)
  File "/anaconda3/envs/mat2vec/lib/python3.6/site-packages/gensim/models/base_any2vec.py", line 629, in load
    model = super(BaseWordEmbeddingsModel, cls).load(*args, **kwargs)
  File "/anaconda3/envs/mat2vec/lib/python3.6/site-packages/gensim/models/base_any2vec.py", line 278, in load
    return super(BaseAny2VecModel, cls).load(fname_or_handle, **kwargs)
  File "/anaconda3/envs/mat2vec/lib/python3.6/site-packages/gensim/utils.py", line 425, in load
    obj = unpickle(fname)
  File "/anaconda3/envs/mat2vec/lib/python3.6/site-packages/gensim/utils.py", line 1332, in unpickle
    return _pickle.load(f, encoding='latin1')
  File "/Users/lfoppiano/development/github/mat2vec/mat2vec/training/__init__.py", line 1, in <module>
    from helpers import utils
ModuleNotFoundError: No module named 'helpers'
>>> 

Any suggestion?

Another question, after the training, I have only accuracies, loss and phraser in my model output directory. I copied the files containing vectors, trainable and the actual model from the tmp directory (I took the one of epoque29), was this the correct way?

.rw-r--r-- lfoppiano staff 286.4 MB Mon Aug  5 11:33:58 2019   test_model
.rw-r--r-- lfoppiano staff   3.2 GB Mon Aug  5 11:33:24 2019   test_model.trainables.syn1neg.npy
.rw-r--r-- lfoppiano staff   3.2 GB Mon Aug  5 11:39:55 2019   test_model.wv.vectors.npy
.rw-r--r-- lfoppiano staff   7.5 KB Mon Aug  5 11:26:21 2019   test_model_accuracies.pkl
.rw-r--r-- lfoppiano staff   389 B  Mon Aug  5 11:27:07 2019   test_model_loss.pkl
.rw-r--r-- lfoppiano staff    53 MB Mon Aug  5 11:40:00 2019   test_model_phraser.pkl

Question about target and context words

I have a question about your research approach communicated in Nature. You use there phrases "target word" and "context word". Normally, in the skip-gram model embedding for the "target word" (input layer) is different that the embedding for the "context word" (output layer). In gensim if you use model.wv.most_similar you are effectively searching for similar words using embeddings from the input layer. You can also access "context word" embeddings via model.syn1neg. Where you using both embeddings for analyzing e.g. relation between chemical compound and "thermoelectric"?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.