Git Product home page Git Product logo

printed-latex-data-generation's Introduction

Printed-Latex-Data-Generation

Python and JS tools for generating Printed Latex Dataset (images of tex formulas with labels) via parsing Cornell's KDDCUP.

Also see KDDCUP paper.

Note: parsing for ArXiv, Wikipedia and Stackexchange sources are coming.



How to generate data

The easiest way to generate data is via Jupyter Notebook Data generation.ipynb located in folder Jupyter Notebooks/.

Running it will output all the data in Data folder.

Final outputs

  • folder generated_png_images contianing PNG images
  • corresponding_png_images.txt each new line contains png images filename for the folder generated_png_images
  • final_png_formulas.txt each new line contains a carresponing LaTex formula
  • folder raw_data containing raw downaloded data
  • folder temporary_data containing formulas from various stages of processing and svg images generated along the way


You can download a prebuilt dataset 180k from here.

Note: This code is very ad-hoc and requires tinkering with the source

Depenencies

  1. Tested with Python 3.9.7 and [Anaconda version 2021.11] (https://repo.anaconda.com/archive/Anaconda3-2021.11-Linux-x86_64.sh)

  2. pip install opencv-python

  3. pip install smart_open

  4. For Tex to SVG see:

    sudo apt install nodejs npm
    sudo npm install --global mathjax-node-cli

  5. For SVG to PNG:

Linux:

https://ubuntu.pkgs.org/20.04/ubuntu-universe-arm64/librsvg2-bin_2.48.2-1_arm64.deb.html

sudo apt install librsvg2-bin


For MacOS: Download Inkscape, also see here



Contents

  • Printed_Tex.py
    • Main module
  • download_data_utils.py
    • Contains tools for downlaoding tex tars and unpacking and parisng them.
  • configs.py
    • Contains Paths and command line script commands.
  • third_party/
    • Contains Katex for parsing LaTex formulas
  • preprocess_formulas.py and preprocess_formulas.js
    • Collection of tools for handling and parsing LaTex formulas
  • svg_to_png.py
    • Funcitons to convert LaTex formulas to SVG images using MathJax
  • png_to_svg.py
    • Funcitons to convert SVG images formulas to PNG images using inkscape for (Darwin) MacOS and rsvg-convert for all other systems.
  • Data/
    • Contains generated_png_images/ folder, corresponding_png_images.txt and final_png_formulas.txt. Also temporary folder temporary_data (formulas for various stages of processing and generated SVG images) and raw_data where raw data is downloaded.
  • Jupyter Notebooks
    • Contains examples of generating data using Jupyter notebooks


Based on https://github.com/Miffyli/im2latex-dataset

printed-latex-data-generation's People

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.