williamleif / histwords Goto Github PK

View Code? Open in Web Editor NEW

412.0 412.0 92.0 1.02 MB

Collection of tools for building diachronic/historical word vectors

Home Page: http://nlp.stanford.edu/projects/histwords/

License: Apache License 2.0

Python 81.94% Shell 4.21% C 0.41% Makefile 0.53% JavaScript 9.76% HTML 0.51% CSS 2.64%

histwords's People

Contributors

Stargazers

Watchers

Forkers

tpnguyen lingulist albertmcma clear-datacenter eriche2016 nudtchengqing muranava kalyanp ml-lab pedro-walter markostam codeaudit nancyx chuyuhanxin004 rishabgargeya maquigu endertunc leyiwang bidexbido okayzed martinafdez sudodo alaa-ebshihy steccami loretoparisi minghao2016 nabilmagnus jullan shubhampachori12110095 grv1207 bdubbya bnewm0609 afcarl petershan1119 aan680 liangshichen aculich jieyuzhao grantglass theresearchproject chubbymaggie psh93 ilaine damian-romero iamyourking007 jan21 chenguoooo lexuswang murrayds lapwai eddings huhailinguist jvansoest bfsujason yaoyi1997 rababalkhalifa mmochtak zhicongchen abogdan271 nishanthsanjeev fagan2888 joshzyj yago1994 mukund-v samvanmeer justcherie ashjanalsulaimani bazzmx robertnward marjp cpatainfei christopher-y-to andreaspung tian-wen-wb cksteven fyjgreatlion idiig ghadarasim emileighharrison zizhe-wang01 2022-2023-m2-nlp-group-5 pietrosanguin ad2000x jbfish00 anecdotal yhliu2022 xinwang-ou commonerd

histwords's Issues

Different result found in the released vectors on Chinese corpus against the paper

Hi, I'm working on the Chinese corpus downloaded from Histwords.

I read the vectors of 病毒 & 电脑 and get the following results for cosine similarity:

('病毒', '电脑')
1950, cosine similarity=0.000
1960, cosine similarity=0.000
1970, cosine similarity=0.000
1980, cosine similarity=0.360
1990, cosine similarity=0.263

The Spearman correlation between [0, 0, 0, 0.36, 0.26] and [1950, 1960, 1970, 1980, 1990] is 0.78. However, in the paper reports the correlation as 0.89 (at the end of section 3.2).

Is there anything going wrong with my data processing? Thank you for your attention.

Difficulties to use seq_procrustes.py with new embeddings

Hi,

I am currently experiencing some difficulties generating new embeddings with your code for visualizing words over time.

For now, I generated separared embeddings by year using sgns/hyperwords. Seems to be ok.

I know try to use your script vecanalysis/seq_procrustes.py, but I think I do not use the correct format for the needed count file: I suppose it's not the same than the one genetated in hyperwords? Maybe I missed it but is there any example of this file somewhere?

I downloaded the example "embeddings/eng-fiction-all_sgns" for visualisation (and it works), but could not find any count file.

Thank you for the answer.

Best regards.

Ask for help

I'm sorry to disturb you. I'm a newer. I try my best to study the code, but I don't know how to solve it. There may be two errors in the scripts.

1.When I ran the script , there generated 5 files(sgns.contexts.txt, sgns.contexts.bin, sgns.words.txt, sgns.words.bin, sgns.words.words).what's the last file? Why is sgns.words.words blank?Can you offer the source code of the word2vecf?

word2vecf/word2vecf -train w2.sub/pairs -pow 0.75 -cvocab w2.sub/counts.contexts.vocab -wvocab w2.sub/counts.words.vocab -dumpcv w2.sub/sgns.contexts -output w2.sub/sgns.words -threads 10 -negative 15 -size 500;

2.the file of vecanalysis has no representations. If I want to get the alligned embeddings, how can I operate it. What's the perparameters to coney in the scripts of the seq_procrustes.py?

from vecanalysis.representations.representation_factory import create_representation

Thanks for your help.Thanks.@williamleif

Can't download the pre-trained embeddings via the link listed

I went to the project site too which has the same issue

Zero-valued vectors?

Regarding the pre-trained vectors for some of the corpora: (on the HistWords website)

For specific decades, there appear to be a handful of word vectors that are "0.0" across all 300 dimensions. It should be noted that for these corresponding words, they are still present in the corpus for this particular decade.

However, they do not seem to get any sort of representation across 300 dimensions, and have been assigned zero values throughout. For example, the vector for the word 'autism', from the 1800s decade of the Google n-grams eng-all vectors is [0.0 ... 0.0] for all 300 dimensions.

Would treating these words as simply 'missing' from the corpus at this particular decade be apt?

SGNS results

Hi,

I have a problem in re-generating SGNS embeddings on google ngram corpus

I follow these steps:

use histwords/googlengram/pullscripts/posgrab.py to generate counts for 1-gram
use histwords/googlengram/pullscripts/downloadandsplit.py then histwords/googlengram/pullscripts/gramgrab.py (set context to 4)
use histwords/googlengram/pullscripts/runmerge.py on the output from 2 and then histwords/googlengram/pullscripts/indexmerge.py
use histwords/googlengram/freqperyear.py on the output of 3
use histwords/googlengram/makedecades.py on the output of 3
use histwords/sgns/makecorpus.py py passing the output of 1, 4 and 5
train embeddings using histwords/sgns/runword2vec.py (using --sequential option)
use histwords/sgns/postprocessingsgns.py on the trained data.

My problem is that the vectors generated is not the same as pre-trained vectors on http://snap.stanford.edu/historical_embeddings/eng-all_sgns.zip. The size of vocabulary is about 50000 while yours 100000

So, my question are there wrong in the steps I follow? or can you help me with any info why this happens

Thanks,

code for visualizing results

very cool work!

I have been looking for the code that uses t-SNE to create the visualizations that are on the front page of the git repo, but can't find them. I've grepped for usage of sklearn (and for 'manifold' / 'tsne' key words) but only found sklearn's normalization in use - is the visualization code in the repo?
the visualization on the frontpage shows broadcast 1900s twice (in the middle panel) - is that intentional?

wrong setup information in "setup.py"

Existing "setup.py" file could not enable successful installation when using "pip install", for the value "sklearn" in "install_requires" is no longer available for installation. The correct value should be "scikit-learn". Installation successes after this change.

Is there any tutorial on how to use the code?

Problem installing, seemingly due to (missing?) folder "cooccurrence"

bash-4.3$ pip install --user git+https://github.com/williamleif/histwords.git
You are using pip version 6.0.8, however version 8.1.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Collecting git+https://github.com/williamleif/histwords.git
Cloning https://github.com/williamleif/histwords.git to /tmp/pip-GQOH9S-build
Traceback (most recent call last):

  File "<string>", line 20, in <module>

  File "/tmp/pip-GQOH9S-build/setup.py", line 7, in <module>

    ext_modules = cythonize(["googlengram/pullscripts/*.pyx", "cooccurrence/*.pyx"]),

  File "/usr/lib64/python2.7/site-packages/Cython/Build/Dependencies.py", line 754, in cythonize

    aliases=aliases)

  File "/usr/lib64/python2.7/site-packages/Cython/Build/Dependencies.py", line 649, in create_extension_list

    for file in nonempty(extended_iglob(filepattern), "'%s' doesn't match any files" % filepattern):

  File "/usr/lib64/python2.7/site-packages/Cython/Build/Dependencies.py", line 103, in nonempty

    raise ValueError(error_msg)

ValueError: 'cooccurrence/*.pyx' doesn't match any files

----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-GQOH9S-build

bash-4.3$

I have made a fix here:

https://gist.github.com/louridas/a3cdb1b109ac03a8e202f4b19c9335b3

Best Regards,

Panos.