Using OCR to generate Typst code based on images of math formulas as a fully client-side webapp.
The model is hosted here.
We use oxen to version control our data. To get the oxen executable, run nix develop
. Then, from the root of this repo, clone the oxen repo:
oxen clone https://hub.oxen.ai/DiracDelta/data
The datasets we use for this project will now be available in data/
.
Detypstify uses a custom dataset which was generated by transpiling the
im2latex-230k with pandoc and
cleaning the resulting data (see scraper/
). The final dataset is available on
Kaggle.
- Download the dataset and unzip it
- Run
poetry run train_val_split
to perform a train validation split - Generate
formulas.txt
by runningscripts/mk_formulas_txt.sh
on thetrain
andval
directories - Install
pix2tex
- Follow the instructions to generate
tokenizer, train.pkl, val.pkl
- Create a
config.yaml
based on the template - Train the model with
python -m pix2tex.train --config config.yaml
- Follow the instructions to generate