qurator-spk / neat Goto Github PK
View Code? Open in Web Editor NEWNamed entity annotation tool
License: Apache License 2.0
Named entity annotation tool
License: Apache License 2.0
Here is one based on the Unicode subset from the OCR-D Guidelines.
Use papaparse streaming with chunk
https://www.papaparse.com/faq#streaming and preferably pagination using sentence boundaries (= newlines)
Hello,
looking for a quick startup I stumbled over the links provided on top of both example.tsv-files (https://content.staatsbibliothek-berlin.de/dc/PPN757123368-00000008/left,top,width,height/full/0/default.jpg, https://content.staatsbibliothek-berlin.de/dc/PPN757123368-00000008/left,top,width,height/full/0/default.jpg) , but unfortunately, they are currently leading nowhere.
OCR results improve when archival master images are used. However, the image dimensions of the archival master images differ from the scaled-down images available on the web, which breaks the facsimile snippet feature.
Accordingly, we also need to transform the coordinates in cli.py.
I know we have been discussing this before, but annotating on the new documents has shown that OCR often leaves out several sections or lines. This makes it necessary to have an overview of the full page. We can display the page on ZEFYS in most cases, but it costs extra time and is not possible with the pages that are not online yet. Therefore, would it be possible to link the full scan (e.g. on the upper display of the filename in neath)?
It would be practical, if pressing ENTER would confirm the current value in the token-line like clicking on OK.
For testing purposes on annotating events.
...and throw a warning alert
Clicking on a string(token) should trigger different actions based on the different rows:
LOCATION
: display facsimile snippet
POSITION
: --undefined--
TOKEN
: edit text (for OCR error correction)
NE-TAG
: drop-down menu of supported NE-Tags
NE-EMB
: drop-down menu of supported NE-Tags
OCR coordinates via IIIF API
We noticed that OCR corrections in combinations of strings with punctuations are not being saved (e.g. "31.") even if they were corrected several times.
Sometimes it is necessary to either merge or split tokens, e.g. due to segmentation errors.
There are 2 basic operations that should be supported here:
a) merge
: concatenates the text content of the current row with that of the above row, deletes the row and updates the offsets
b) split
: adds a new row below with the text content of the current row copied there and updates the offsets
As of 57b6b7d, we don't require it as an anchor anymore, so could as well hide this to save screen estate (it carries no relevant information either).
The TO DO
tag is very helpful to mark ambiguous cases in the NER annoation, maybe it would be possible to establish a similar option for the marking of ambiguous links for discussion in the annoation team. This would preferably be a free text cell located next to the ID
.
for artworks
About every 5th to 10th time, neither the snippet display on hover or clicking, nor the save button is working (Firefox environment). This left us with the need to pre-try the saving button before starting to annotate, because otherwise there is no chance to keep the changes made.
requested in 21 oct telco
in order to remain as compatible as possible to the HIPE Shared Task data format, it would be desirable that the start-sentence
functionality inserts a newline to mark the sentence boundaries.
(actually, the newline must contain 10x \t
to retain the proper column structure)
The delete function for the row in the position
column is very handy. Since the annotation of links in the ID
column requires the deletion of false links, we were wondering whether it would be possible to implement a similar function here, which does not delete the row but the content of the corresponding cell.
...for uncertain/yet to decide entity types
First testing sessions show that in case of OCR-correction, nearly every line needs to be altered. Therefore, navigation to the line above or below (up/down) without clicking would be crucial, including #24 . Navigation within the string by left/right arrow key as in the current version is very helpful.
change the yellow (for highlighting) to something less saturated/more pleasing to the eyes? consider design for color vision deficiency (even though I think it is not that important, since there are labels anyway)? ( https://projects.susielu.com/viz-palette )
color palette generator: https://medialab.github.io/iwanthue
Document our processing pipeline:
two columns B | I
with dropdown of tags on each side
Decide whether/when to turn this into a webapp.
it would be desirable to display the snippets in the left column upon hovering over their location (saves a lot of clicking)
It would improve usability a lot if the image snippet would always be displayed vertically aligned to the row with the corresponding token text.
We noticed that ending parts of the plain text (OCR) of the newspaper page in the example.tsv were missing after saving the annotation result of each session (see different states of the file attached). We couldn't figure out the pattern yet - our best guess is, that its connected to the split/merge function.
example.tsv_states.zip
to allow generating the input format for the annotation tool from OCR output
It's already supported via keyboard command d l
In the current version, there are 30 lines shown at a time which avoids a full view unless the browser screen is zoomed down to 60%. A reduction to appr. 15 lines combined with the aligned image snippet #26 would improve usability by a great deal.
Links to the facsimile must open in a new tab using target=_blank
or will trigger the reload event alert
It would be preferable to retain the line OCR confidence scores in the tsv
and do the color coding within neat
rather than include the hex color code.
display image on mouseover over the whole row and not just by hovering over the "location" link
For continuous work on one .tsv document in different sessions, it would be handy to be able to mark the progress within that document.
This could fx be realized through an additional column which places a visual marker (e.g. arrow icon) in a selected row by clicking on it.
requested in 21 oct telco
We will not distinguish between these two entity types in the annotation process.
broken as of d7590c7
A funny thing occured in our annotation sessions, both to the student assistant and me on different browsers (Firefox and Edge) when we try to correct the word "Breslau" in the file 27646518_1892-07-05_21_335_005 (occurs two times in the third newspaper-column of the document: line 4599 and line 4834):
After annotating for a while (different time spans) the page seems to "reload" but it looks more like a flicker than a reload.
The tags can not longer be accessed, the NERs can no longer be edited. By saving at this point and reopening the file, the facsimile snippet does no longer appear, even though saving and editing are possible (but useless without the snippet). This can be fixed by re-opening NEATH over and over until Snippet is visible again.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.