shreevatsa / chaya Goto Github PK

View Code? Open in Web Editor NEW

0.0 1.0 0.0 1.09 MB

HTML 22.21% Makefile 0.99% TypeScript 73.46% Python 1.14% JavaScript 2.20%

chaya's Introduction

Scan companion.

Serving at https://chaya.shreevatsa.net/

chaya's People

Contributors

Watchers

chaya's Issues

Prevent clicking on "create new" till PDF file has been loaded

Mark for non-breaking hyphen

Maybe we can add a mark (ProseMirror term for things like "bold" and "italic") that can be used for end-of-line hyphens — in reading view or when chunks wrap, they should be hidden.

Joining lines into regions (chunks)

Right now, all the text from each page of OCR is processed into a single blob of text. We should instead use the line-level information and make smaller regions.

For now / MVP, it would be a good start to just get paragraphs into the output, rather than a single paragraph with all the text:

chaya/main.ts

Lines 308 to 309 in d60d4d2

 const { data: { text }, } = await worker.recognize(img.src); 

 const pageNode = schema.node('region', { pageNum: i, pageImageNode: img }, schema.text(text));

chaya/main.ts

Lines 344 to 346 in d60d4d2

 const text = ocrResponse.fullTextAnnotation.text; 

 const pageNode = schema.node('region', { pageNum: i, pageImageNode: img }, schema.text(text));

Two-column mode / OCR on manually specified regions

Will likely need this

Image view options

Add to the Prosemirror menu with three options:

No images (reading mode)
Chunk by chunk (default)
Line by line (for re-editing)

Warn on closing tab when there are unsaved changes

I think the Ambuda one had it (even after my changes).

Custom image size and text (font) size

It should be possible to change the sizes.

(And also of course where the text is displayed: side-by-side, or below the image.)

OCR alternatives

Option to use Google OCR (bring your own key) instead of Tesseract.js

PM save/load

Before getting further into the code, would be good to implement the feature to:

save the generated PM doc into a .sc file, and
load a previously saved PM doc, instead of doing OCR to generate pages from scratch.

Part of this is UI work, and part is the actual serialization / deserialization.

Ungroup button

Like the button to group lines into chunks, there should be a button to ungroup a chunk into single-line chunks.

Reconsider highlighting of OCR-recognized words

This was implemented for #15 and is working. But I'm having second thoughts about this feature:

Pro

Makes it easier to see which words got recognized by the OCR, and which got missed

Cons

The size of the .chaya files gets much larger (TODO: quantify), because of having to store bounding box for each word
In-browser memory usage increases (TODO: quantify), for the same reason. This may make editing sluggish (or even cause tab to crash) for large PDFs or on devices with less memory.
Code complexity increases (Not just the code to highlight these areas, but also the interaction between pdf.js output being at the page level v/s this information being at the line level: eg when loading from file, cannot simply add one chunk at a time, but need to accumulate chunks until a page is done)
It only shows which words were recognized in the initial OCR response, which is something that may not remain meaningful after manual editing
It's not clear how useful it is: how valuable is "not recognized", when anything else could just as well be "incorrectly recognized"? (OCR systems probably differ in whether they will always output something, or will just leave out uncertain words)
A deficiency of the current implementation: The words unrecognized by OCR actually get dimmed (when in fact we should probably highlight them more)
Subjective: Looks ugly?

Alternatives

Some alternative ways to show which words were recognized:

~~Show lines with tight fit, i.e. only display the part of each line that corresponds to the recognized words~~ (This is dangerous as it may leave out precisely the words we want to catch)
Indent each line of the editable text by an indent corresponding to the first recognized word in the image. This is more likely to place recognized text directly under the corresponding image (at least at roughly similar zoom sizes)
That thing that scribeocr.com does, of superimposing the text (rendered in some font with matching font metrics) onto the image.

Auto-detect chunks

Using line-widths, example. Idea by Andrew Ollett.

Edit: Even a heuristic of "group together adjacent lines if they have (roughly) the same xmin and xmax" may work.

Release versions

Before making this public:

A page /versions/ that links to several index.html files (chaya-202401.html etc)
- Manually created
Say "This is version [N]. Older versions are here" and link to that page.
- At least, link to the page.
Code to check that schema versions match (otherwise redirect?)
Some process for building version files (maybe just before changing schema).
- Manual instructions in Makefile
Some message about "use at your own risk, no guarantee of future compatibility, just save this file locally and use it".

Get started

Accompanying data file that can be saved
- Export to print?
Single file that holds the PDF bounding boxes inside it.

Initial Prosemirror setup

Looks like I need to install the package etc:

corepack enable
yarn init -2
yarn add prosemirror-state
yarn add pdfjs-dist

and build it:

# Build JS.
js-dev:
	npx esbuild main.ts --outfile=main.js --bundle --watch --format=esm

# Build JS for production.
js-prod:
	npx esbuild main.ts --outfile=main.js --bundle --minify --format=esm

Highlight (un)recognized words

Suggested by Suhas: when words get skipped by the OCR, would be useful to know.

Maybe we can draw rectangles around recognized words, or increase their relative brightness/opacity (or whatever), so that unrecognized words get noticed

Button to delete chunk

Had mentioned it earlier in #17 but tracking it separately as that issue got closed.

More prominent warning when creating a new file

A user (my mother) did a few pages, saved them, then (on a later day) opened the PDF file and ended up re-doing OCR, not knowing that she should have loaded from the saved file. I think we can be more prominent about "saved" v/s "new", then have the OCR options under New.

Chunk labels

Continuing discussion from #7 (comment) — we want at least these types:

heading (displayed as h1, h2 etc)
paragraph (lines will be displayed inline-block or whatever, with wrapping) (unless in line-by-line mode)
verse (lines will be centered)

and a "footnote label", which is blank by default. (The point is that a single footnote may have multiple chunks, like paragraphs and verses.)

Break apart lines

Sometimes a line can span multiple lines, and right now there's no way to insert a line break within a line.

We could either:

Change schema to have a line be a <p>, so that it contain line breaks internally.
Allow re-drawing lines and re-doing OCR (#8).

Manually incorporate prosemirror-example-setup

Right now I'm using keymap and menu directly.

chaya/main.ts

Line 13 in 829a688

import { buildKeymap, buildMenuItems } from "prosemirror-example-setup";

Doing it manually will hopefully give me some experience, possibly help with all of #7 (how the join command is working) and #9 (save button in the menu) and #10 (menu for switching between image modes).

	const { data: { text }, } = await worker.recognize(img.src);
	const pageNode = schema.node('region', { pageNum: i, pageImageNode: img }, schema.text(text));

	const text = ocrResponse.fullTextAnnotation.text;

	const pageNode = schema.node('region', { pageNum: i, pageImageNode: img }, schema.text(text));

shreevatsa / chaya Goto Github PK

chaya's Introduction

chaya's People

Contributors

Watchers

chaya's Issues

Pro

Cons

Alternatives

Recommend Projects

Recommend Topics

Recommend Org