williamleif / histwords Goto Github PK
View Code? Open in Web Editor NEWCollection of tools for building diachronic/historical word vectors
Home Page: http://nlp.stanford.edu/projects/histwords/
License: Apache License 2.0
Collection of tools for building diachronic/historical word vectors
Home Page: http://nlp.stanford.edu/projects/histwords/
License: Apache License 2.0
Hi, I'm working on the Chinese corpus downloaded from Histwords.
I read the vectors of 病毒 & 电脑 and get the following results for cosine similarity:
('病毒', '电脑')
1950, cosine similarity=0.000
1960, cosine similarity=0.000
1970, cosine similarity=0.000
1980, cosine similarity=0.360
1990, cosine similarity=0.263
The Spearman correlation between [0, 0, 0, 0.36, 0.26] and [1950, 1960, 1970, 1980, 1990] is 0.78. However, in the paper reports the correlation as 0.89 (at the end of section 3.2).
Is there anything going wrong with my data processing? Thank you for your attention.
Hi,
I am currently experiencing some difficulties generating new embeddings with your code for visualizing words over time.
For now, I generated separared embeddings by year using sgns/hyperwords. Seems to be ok.
I know try to use your script vecanalysis/seq_procrustes.py, but I think I do not use the correct format for the needed count file: I suppose it's not the same than the one genetated in hyperwords? Maybe I missed it but is there any example of this file somewhere?
I downloaded the example "embeddings/eng-fiction-all_sgns" for visualisation (and it works), but could not find any count file.
Thank you for the answer.
Best regards.
I'm sorry to disturb you. I'm a newer. I try my best to study the code, but I don't know how to solve it. There may be two errors in the scripts.
1.When I ran the script , there generated 5 files(sgns.contexts.txt, sgns.contexts.bin, sgns.words.txt, sgns.words.bin, sgns.words.words).what's the last file? Why is sgns.words.words blank?Can you offer the source code of the word2vecf?
2.the file of vecanalysis has no representations. If I want to get the alligned embeddings, how can I operate it. What's the perparameters to coney in the scripts of the seq_procrustes.py?
from vecanalysis.representations.representation_factory import create_representation
Thanks for your help.Thanks.@williamleif
I went to the project site too which has the same issue
Regarding the pre-trained vectors for some of the corpora: (on the HistWords website)
For specific decades, there appear to be a handful of word vectors that are "0.0" across all 300 dimensions. It should be noted that for these corresponding words, they are still present in the corpus for this particular decade.
However, they do not seem to get any sort of representation across 300 dimensions, and have been assigned zero values throughout. For example, the vector for the word 'autism', from the 1800s decade of the Google n-grams eng-all vectors is [0.0 ... 0.0] for all 300 dimensions.
Would treating these words as simply 'missing' from the corpus at this particular decade be apt?
Hi,
I have a problem in re-generating SGNS embeddings on google ngram corpus
I follow these steps:
My problem is that the vectors generated is not the same as pre-trained vectors on http://snap.stanford.edu/historical_embeddings/eng-all_sgns.zip. The size of vocabulary is about 50000 while yours 100000
So, my question are there wrong in the steps I follow? or can you help me with any info why this happens
Thanks,
very cool work!
I have been looking for the code that uses t-SNE to create the visualizations that are on the front page of the git repo, but can't find them. I've grepped for usage of sklearn (and for 'manifold' / 'tsne' key words) but only found sklearn's normalization in use - is the visualization code in the repo?
the visualization on the frontpage shows broadcast 1900s twice (in the middle panel) - is that intentional?
Existing "setup.py" file could not enable successful installation when using "pip install", for the value "sklearn" in "install_requires" is no longer available for installation. The correct value should be "scikit-learn". Installation successes after this change.
Is there any tutorial on how to use the code?
bash-4.3$ pip install --user git+https://github.com/williamleif/histwords.git
You are using pip version 6.0.8, however version 8.1.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Collecting git+https://github.com/williamleif/histwords.git
Cloning https://github.com/williamleif/histwords.git to /tmp/pip-GQOH9S-build
Traceback (most recent call last):
File "<string>", line 20, in <module>
File "/tmp/pip-GQOH9S-build/setup.py", line 7, in <module>
ext_modules = cythonize(["googlengram/pullscripts/*.pyx", "cooccurrence/*.pyx"]),
File "/usr/lib64/python2.7/site-packages/Cython/Build/Dependencies.py", line 754, in cythonize
aliases=aliases)
File "/usr/lib64/python2.7/site-packages/Cython/Build/Dependencies.py", line 649, in create_extension_list
for file in nonempty(extended_iglob(filepattern), "'%s' doesn't match any files" % filepattern):
File "/usr/lib64/python2.7/site-packages/Cython/Build/Dependencies.py", line 103, in nonempty
raise ValueError(error_msg)
ValueError: 'cooccurrence/*.pyx' doesn't match any files
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-GQOH9S-build
bash-4.3$
In line of 108 and 110 of histwords/representations/embedding.py it seems that you use array multiplication instead of matrix multiplication. Is this intended? Because usually s and u are combined by using matrix multiplication, I think.
I noticed the pretrained embeddings open for download are all normalised with each vector's L2 norm being 1. However, as normalisation causes information lost, can you provide me the original embeddings?
thanx.
Thank you for your great work. I am wondering can I load the pre-trained word vectors of histwords using gensim?
Hi!
To learn historical embeddings for new data, what's the implementation and pre-processing details?
Hello,
The current gist (https://gist.github.com/louridas/a3cdb1b109ac03a8e202f4b19c9335b3) for the gensim Procrustes implementation does not work (due to changes in gensim).
I have made a fix here:
https://gist.github.com/louridas/a3cdb1b109ac03a8e202f4b19c9335b3
Best Regards,
Panos.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.