gipplab / math2vec Goto Github PK

View Code? Open in Web Editor NEW

5.0 5.0 0.0 9.04 MB

Set of tools we used to create, cultivate and process datasets for our math2vec project.

HTML 93.78% Java 0.64% Ruby 0.95% CoffeeScript 0.67% CSS 0.15% JavaScript 0.01% TeX 2.84% Shell 0.05% Python 0.92%

math2vec's Introduction

Workflow

Getting arXMLiv dataset (here)
Processing it via planetext (here)
Post-Processing via custom procedures (here)

arXMLiv

Download arXMLiv 08.2018 (requires git lfs) and extract the html files.

PlaneText

We use PlaneText for processing the html files from arXMLiv. We customizes the code a little bit. The original sources can be found on KMCS-NII. Our customized version can be found under planetext.

Our version of PlaneText a) do not substitute MathML inner elements b) do not create xhtml or html files

For processing arXMLiv:

# navigate to the planetext directory
./bin/planetext arxmliv.yaml <path to html files> -O <output path>

The process may stop or crash for a subset of files. In that case you can use missing.sh to copy not yet processed files to another directory and repead the conversion process for the subset of the files. Before you do so, change the paths in missing.sh

DIRIN="no_problems_raw"
DIRPROC="no_problems_txt"
DIROUT="no_problems_tmp"

Another problem that appears are empty or files without meanings. PlaneText will than generate empty annotation files and maybe even empty text files. To identify those files in advance (before post processing) we created the broken.sh. Similar to missing.sh one may have to modify the paths in broken.sh also.

Post-Processing

The folder post contain a java project to post process the files generated by PlaneText. We split the text into sentences (one line per sentence) and replace all math-tokens by their LLaMaPuN (Language and Mathematics Processing and Understanding) representation. You have to specify the in- and output directory.

math2vec's People

Contributors

Stargazers

Watchers

math2vec's Issues

Develop an automated evaluation with the gold standard

We have two gold standards. The old one that was used by MLP and there is an extended and enhanced version at MathMLBen.

To be able to compare our results with the MLP we have to evaluate on the old goldstandard first.

For each entry in the goldstandard we have to:

Extract identifiers from the entry
Bring them into the same format we used in our model (custom LLaMaPuN). For example, \alpha is math-alpha in our model.
Ask word2vec for closest relations (see #5)
Improve the results by using Doc2Vec also considering the context (also extracted from gold standard) (see #4)
Generate CSV entries for the entries above a given threshold. We have to choose at least one entry (the best in the list) and all entries above a cosine-similarity of 0.7. (We can tweak this value later on).

For the basics, step 4 is currently optional since we don't really know how to consider the context. (See more in #7).

Develop a set of typical relations

To get meaningful results, we have to ask the word2vec model for relations. Something like

  math-a is to variables like <identifier> to ???
  math-f is to function  like <identifier> to ???
  ...

Here we have many ways to improve the results. Should we calculate the average of the results, or the weighted averages? Which results should have more impact.

But I think that's future work!

Merge Math-Tokens and or skip too long chains

We still have the problem of too many math tokens in one paragraph. So we have to find a way to reduce them.

I came up with the following ideas

only consider the left-hand side of relations (equations). If an identifier appears only on the right-hand side of relations it will be most likely explained in other places and not in the direct environment of the equation. Furthermore, the left-hand might be defined by the right-hand side and therefore could get a special name. Therefore, we only take the identifiers of the left-hand side of relations. (A similar idea was used by Aizawa in her paper Extracting Textual Descriptions of Mathematical Expressions in Scientific Papers).
We should maybe replace too long chains (let's say if there are more than 5 consecutive math-tokens) by a single token and store those replacements in a dictionary. I don't have a better idea yet to handle many math tokens.

Stopwords in word2vec and doc2vec

Hello, @AndreG-P @physikerwelt

Just sharing some interesting stuff I found about how word2vec and doc2vec deal with stopwords and how the vectors are affected about it.

Apparently, there will be not much of difference when training with or without stopwords since the API we are using (Gensim), has a parameter called sample. this parameter basically get rid of high-frequent words.

https://radimrehurek.com/gensim/models/word2vec.html

sample (float, optional) – The threshold for configuring which higher-frequency words are randomly downsampled, useful range is (0, 1e-5).

Gordon Mohr, one of developers of the API mentions the effects of stopwords for the model. It seems because of sampling and little interest in the (non) use of stopwords, there's not many on actual papers.

https://groups.google.com/forum/#!topic/gensim/bnNz7UJd-iY
https://groups.google.com/forum/#!topic/gensim/i6TzbQPn8-c

In several discussions (mostly on stack overflow) they mention that stopwords will just create noise overall. As a rule of thumb, removing stopwords will only improve things, be in quantity of data to read or process.

You can leave to the sample parameter to take care of most frequent words (usually stopwords). However, given the mathematical nature of the papers/MLP tags we are working, things might backfire at us. For example, math- appears in every document, multiple times, so sample might cut this high-frequency word as well.

I found paper from CMU (Carnegie Mellon University) where they played around with word embeddings using GloVe (variation of word2vec). It seems that the quality in their results when the fraction of stopwords decreases (or window size increases).

https://arxiv.org/pdf/1703.00993.pdf

Avoid plurals in the trained models

We have to avoid plurals in trained models. There is no reason for us to distinguish between variable and variables in our dataset.

We have to approaches to solve this suggested by Terry:

Stemming: We reduce words to the word stems. For example, universe, universal, and university become univers*.
Simple rules: We only cut of s at the end, e.g., variables become to variable.

Extending Mathosphere

@physikerwelt
In mathosphere gibt es ein paar Programme die man starten kann (siehe Main.java).

Mit MLP gibt es die möglichkeit MLP auszuführen. Der berechnet dann die scores der definien-identifier paare nur anhand der Distanzen. Ich denke die Ergebnisse davon lagen bei run 30% precision oder so?
Dann gibt es ML der den WekaLearner implementiert. (50% precision).

Dann gibt es noch ein paar andere Programme, von denen ich gerade nicht weiß was sie machen.

Die Frage ist, was sollten wir jetzt am besten erweitern? Den WekaLearner hab ich nie dazu bringen können die selben Resultate zu erzeugen wie sie im Paper waren.

Ich hätte 2 Ideen die sich unterschiedlich leicht ausprobieren ließen.

Wir fragen Word2Vec nach den besten Ergebnissen und ergänzen diese zu den Ergebnissen vom MLP (mit oder ohne learning...) bzw wir gewichten die Ergebnisse vom MLP mit Word2Vec neu. Das hätte den Vorteil das wir auch Einträge bekommen könnten, die nicht namentlich im Paper erwähnt werden, weil sie global als allgemein Bekannt angenommen werden.
Wir extrahieren den Absatz mit dem Identifier und fragen Doc2Vec nach ähnlichen Paragraphen aus dem Arxiv Korpus. Auf diesen ähnlichen Absätzen können wir dann MLP anwenden, der dann (soweit die Theorie) viel bessere Werte generieren müsste, weil sich alle Absätze ja halbwegs ähneln sollten. Wie das mit dem WekaLearner gehen soll weiß ich aber nicht. Wir müssten eigentlich das trainierte Model laden vom Learner und dann nur die Ergebnisse analysieren. Aber das hat ja nie geklappt.

Ein weiteres Problem, wir haben MLP (den wekalearner) nie auf den neuen Goldstandard angewant, weil wir nicht so leicht an die Quellen gekommen sind. Das eine ist DLMF, das andere arxiv und die ersten 100 sind von Wikipedia. Wir stehen damit wieder vorm selben problem, wenn wir den neuen Goldstandard evaluieren wollen.

PS @truas sorry I will explain that tomorrow to you

Improve results of W2V by taking care of the context

We have to consider the context of an identifier. There is no solution yet, how to do so. The current idea is to

ask Doc2Vec for similar paragraphs for a given input, and
compute cosine similarity between the results from the Word2Vec model and the closest paragraphs from 1
Reorder results from 1 according to their similarities to the paragraphs in 2

Currently the Doc2Vec model seems to be not well trained. Maybe its also because the training data is quite small.

Ditinguish between capital and small letters in math tokens

Currently, our models train on small letters only. I think that's fine for text but not for math tokens. We have to distinguish between math-n and math-N.

Kind of N-Grams (merging nouns and adj-nouns)

We should merge some words because we have to identify them as one entity. Classical example is to distinguish between integer, positive integer, and negative_integer.

After discussions with Aizawa-sensei and others, we probably can just apply 2 simple rules.

Merge consecutive nouns, e.g., Catalan number -> catalan_number.
Merge consecutive adjective-nouns, e.g., arbitrary positive integer -> arbitrary_positive_integer.

@truas you mentioned we should avoid too long chains. So I would say we only allow a maximum length of 3 words? @physikerwelt what do you think?

@truas A question about the PoS-Tagger. Our first example in the gold standard contains W as the Van der Waerdens number. Do you know if this is possible to merge by our rules 1 and 2? I wonder how the PoS tagger tags der between Van and Waerden.

Prepare warning dataset

Extend our corpus with the warning dataset. Currently, we only use the no_problems dataset which is relatively small.

Train a Doc-ID also for Doc2Vec Model

@truas
The MLP made the assumption, that an identifier has only one meaning in each document. Since we train on paragraphs, would it be possible to train an extra vector for the Doc-ID for each paragraph also? So that we can ask the Doc2Vec model for the most likely Doc-ID for a custom sentence?

I've read that this might be possible in a blog about Doc2Vec.