Git Product home page Git Product logo

chaya's Introduction

chaya's People

Contributors

shreevatsa avatar

Watchers

 avatar

chaya's Issues

Mark for non-breaking hyphen

Maybe we can add a mark (ProseMirror term for things like "bold" and "italic") that can be used for end-of-line hyphens — in reading view or when chunks wrap, they should be hidden.

Joining lines into regions (chunks)

Right now, all the text from each page of OCR is processed into a single blob of text. We should instead use the line-level information and make smaller regions.

For now / MVP, it would be a good start to just get paragraphs into the output, rather than a single paragraph with all the text:

chaya/main.ts

Lines 308 to 309 in d60d4d2

const { data: { text }, } = await worker.recognize(img.src);
const pageNode = schema.node('region', { pageNum: i, pageImageNode: img }, schema.text(text));

chaya/main.ts

Lines 344 to 346 in d60d4d2

const text = ocrResponse.fullTextAnnotation.text;
const pageNode = schema.node('region', { pageNum: i, pageImageNode: img }, schema.text(text));

Image view options

Add to the Prosemirror menu with three options:

  • No images (reading mode)
  • Chunk by chunk (default)
  • Line by line (for re-editing)

OCR alternatives

Option to use Google OCR (bring your own key) instead of Tesseract.js

PM save/load

Before getting further into the code, would be good to implement the feature to:

  • save the generated PM doc into a .sc file, and
  • load a previously saved PM doc, instead of doing OCR to generate pages from scratch.

Part of this is UI work, and part is the actual serialization / deserialization.

Ungroup button

Like the button to group lines into chunks, there should be a button to ungroup a chunk into single-line chunks.

Reconsider highlighting of OCR-recognized words

This was implemented for #15 and is working. But I'm having second thoughts about this feature:

Pro

  • Makes it easier to see which words got recognized by the OCR, and which got missed

Cons

  • The size of the .chaya files gets much larger (TODO: quantify), because of having to store bounding box for each word
  • In-browser memory usage increases (TODO: quantify), for the same reason. This may make editing sluggish (or even cause tab to crash) for large PDFs or on devices with less memory.
  • Code complexity increases (Not just the code to highlight these areas, but also the interaction between pdf.js output being at the page level v/s this information being at the line level: eg when loading from file, cannot simply add one chunk at a time, but need to accumulate chunks until a page is done)
  • It only shows which words were recognized in the initial OCR response, which is something that may not remain meaningful after manual editing
  • It's not clear how useful it is: how valuable is "not recognized", when anything else could just as well be "incorrectly recognized"? (OCR systems probably differ in whether they will always output something, or will just leave out uncertain words)
  • A deficiency of the current implementation: The words unrecognized by OCR actually get dimmed (when in fact we should probably highlight them more)
  • Subjective: Looks ugly?

Alternatives

Some alternative ways to show which words were recognized:

  • Show lines with tight fit, i.e. only display the part of each line that corresponds to the recognized words (This is dangerous as it may leave out precisely the words we want to catch)
  • Indent each line of the editable text by an indent corresponding to the first recognized word in the image. This is more likely to place recognized text directly under the corresponding image (at least at roughly similar zoom sizes)
  • That thing that scribeocr.com does, of superimposing the text (rendered in some font with matching font metrics) onto the image.

Auto-detect chunks

Using line-widths, example. Idea by Andrew Ollett.

Edit: Even a heuristic of "group together adjacent lines if they have (roughly) the same xmin and xmax" may work.

Release versions

Before making this public:

  • A page /versions/ that links to several index.html files (chaya-202401.html etc)
    • Manually created
  • Say "This is version [N]. Older versions are here" and link to that page.
    • At least, link to the page.
  • Code to check that schema versions match (otherwise redirect?)
  • Some process for building version files (maybe just before changing schema).
    • Manual instructions in Makefile
  • Some message about "use at your own risk, no guarantee of future compatibility, just save this file locally and use it".

Get started

  • Accompanying data file that can be saved
    • Export to print?
  • Single file that holds the PDF bounding boxes inside it.

Initial Prosemirror setup

Looks like I need to install the package etc:

corepack enable
yarn init -2
yarn add prosemirror-state
yarn add pdfjs-dist

and build it:

# Build JS.
js-dev:
	npx esbuild main.ts --outfile=main.js --bundle --watch --format=esm

# Build JS for production.
js-prod:
	npx esbuild main.ts --outfile=main.js --bundle --minify --format=esm

Highlight (un)recognized words

Suggested by Suhas: when words get skipped by the OCR, would be useful to know.

Maybe we can draw rectangles around recognized words, or increase their relative brightness/opacity (or whatever), so that unrecognized words get noticed

More prominent warning when creating a new file

A user (my mother) did a few pages, saved them, then (on a later day) opened the PDF file and ended up re-doing OCR, not knowing that she should have loaded from the saved file. I think we can be more prominent about "saved" v/s "new", then have the OCR options under New.

Chunk labels

Continuing discussion from #7 (comment) — we want at least these types:

  • heading (displayed as h1, h2 etc)
  • paragraph (lines will be displayed inline-block or whatever, with wrapping) (unless in line-by-line mode)
  • verse (lines will be centered)

and a "footnote label", which is blank by default. (The point is that a single footnote may have multiple chunks, like paragraphs and verses.)

Break apart lines

Sometimes a line can span multiple lines, and right now there's no way to insert a line break within a line.

image

We could either:

  • Change schema to have a line be a <p>, so that it contain line breaks internally.
  • Allow re-drawing lines and re-doing OCR (#8).

Manually incorporate prosemirror-example-setup

Right now I'm using keymap and menu directly.

chaya/main.ts

Line 13 in 829a688

import { buildKeymap, buildMenuItems } from "prosemirror-example-setup";

Doing it manually will hopefully give me some experience, possibly help with all of #7 (how the join command is working) and #9 (save button in the menu) and #10 (menu for switching between image modes).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.