qurator-spk / neat Goto Github PK

View Code? Open in Web Editor NEW

27.0 6.0 5.0 5.56 MB

Named entity annotation tool

License: Apache License 2.0

HTML 9.35% JavaScript 90.65%

named-entities annotation-tool annotation-guidelines qurator sonar-idh

neat's Introduction

neat: named entity annotation tool

3. Annotation Guidelines

1. Introduction

neat is a simple, browser-based tool for editing and annotating text with named entities to produce labeled data for training/testing/evaluation. It can be used to add or correct named entity labels and to correct the token text or tokenization (e.g. due to OCR/segmentation errors).

neat is developed at the Berlin State Library for data annotation in the SoNAR-IDH project and the QURATOR project.

2. User Guide

2.1 Installation

neat runs locally as a pure HTML+JavaScript webpage in your web browser. No additional software needs to be installed, but JavaScript has to be enabled in the browser.

Clone the repo using git clone https://github.com/qurator-spk/neat.git or download and extract the ZIP. Make sure you have neat.html and neat.js in the same directory and open neat.html in a browser. Any fairly recent browser should work, but only Chrome and Firefox are tested.

2.2 Data format

The source data we use for annotation are OCR results in PAGE-XML format. We provide a Python tool for the transformation of OCR files in PAGE-XML into the TSV format used by neat.

The internal data format used by neat is based on the format used in the GermEval2014 Named Entity Recognition Shared Task. Text is encoded as one token per line, with name spans in the IOB2 format as tab-separated values:

the first column contains either
- # a comment to indicate the source the sentence is taken from, or
- >=1 the token position within the sentence, or
- 0 to mark sentence boundaries
the second column contains the token text
outer entity spans are encoded in the third column NE-TAG
embedded entity spans are encoded in the fourth column NE-EMB

Example (simple)

No.	TOKEN	NE-TAG	NE-EMB
# https://example.url
1	Donnerstag	O	O
2	,	O	O
3	1	O	O	
4	.	O	O	
5	Januar	O	O	
6	.	O	O		
0		O	O
1	Berliner	B-ORG	B-LOC	
2	Tageblatt	I-ORG	O	
3	.	O	O		
0		O	O
1	Nr	O	O	
2	.	O	O		
3	1	O	O	
4	.	O	O	
0		O	O
1	Seite	O	O
2	3	O	O

For our purposes we extend this format by adding these (optional) values:

a fifth column for an ID for the outer NE-TAG from an authority file (neat supports automatic linking for Wikidata identifiers)
column six for use as a variable url_id for iiif Image API support (neat supports the embedding of image snippets into its interface to assist data annotation and correction if the PAGE-XML source contains word bounding boxes)
columns 7-10 are used for storing left,right,top,bottom pixel coordinates for the image snippets

Example (full)

No.	TOKEN	NE-TAG	NE-EMB	ID	url_id	left,right,top,bottom
# https://example.url/iiif/left,right,top,bottom/full/0/default.jpg
1	Donnerstag	O	O	-	0	174,352,358,390
2	,	O	O	-	0	174,352,358,390	
3	1	O	O	-	0	367,392,361,381
4	.	O	O	-	0	370,397,352,379
5	Januar	O	O	-	0	406,518,358,386
6	.	O	O	-	0	406,518,358,386	
0
1	Berliner	B-ORG	B-LOC	Q455014	0	816,984,358,388
2	Tageblatt	I-ORG	O	Q455014	0	1005,1208,360,387
3	.	O	O	-	0	1005,1208,360,387
0
1	Nr	O	O	-	0	1237,1288,360,382
2	.	O	O	-	0	1237,1288,360,382
3	1	O	O	-	0	1304,1326,361,381
4	.	O	O	-	0	1304,1326,361,381
0
1	Seite	O	O	-	0	1837,1926,361,392
2	3	O	O	-	0	1939,1967,364,385

2.3 Navigation

neat can be used both with a keyboard or a mouse, but for ergonomic reasons, we strongly recommend the use of below key combinations.

Keyboard

Key Combination	Action
Left	Move one cell left
Right	Move one cell right
Up	Move one row up
Down	Move one row down
PageDown	Move page down
PageUp	Move page up
Crtl+Up	Move entire table one row up
Crtl+Down	Move entire table one row down
----------	--------------------------------------------
s t	Start new sentence in current row
m e	Merge current row with row above
s p	Create copy of current row
d l	Delete current row
----------	--------------------------------------------
backspace	Set NE-TAG / NE-EMB to `O`
b p	Set NE-TAG / NE-EMB to `B-PER`
b l	Set NE-TAG / NE-EMB to `B-LOC`
b o	Set NE-TAG / NE-EMB to `B-ORG`
b w	Set NE-TAG / NE-EMB to `B-WORK`
b c	Set NE-TAG / NE-EMB to `B-CONF`
b e	Set NE-TAG / NE-EMB to `B-EVT`
b t	Set NE-TAG / NE-EMB to `B-TODO`
i p	Set NE-TAG / NE-EMB to `I-PER`
i l	Set NE-TAG / NE-EMB to `I-LOC`
i o	Set NE-TAG / NE-EMB to `I-ORG`
i w	Set NE-TAG / NE-EMB to `I-WORK`
i c	Set NE-TAG / NE-EMB to `I-CONF`
i e	Set NE-TAG / NE-EMB to `I-EVT`
i t	Set NE-TAG / NE-EMB to `I-TODO`
----------	--------------------------------------------
enter	Edit TOKEN or ID
esc	Close TOKEN or ID edit field without
	application of changes
----------	--------------------------------------------
l a	add one display row
l r	remove on display row (minimum is 5)
----------	--------------------------------------------

Mouse

use mouse wheel to scroll up and down
left-click << and >> to move 15 rows up or down
left-click O in the NE-TAG or NE-EMB column to open a drop-down menu and subsequently select any of the supported NE-Tags to tag a token or change an existing tag
left-click the NE-TAG or NE-EMB column and select O to remove a tag
left-click the TOKEN column to edit the token text
left-click the POSITION and select split from the drop-down menu to create a copy of the current row below
left-click the POSITION and select merge from the drop-down menu to merge the current row with the row above
left-click the POSITION and select start-sentence from the drop-down menu to mark the start of a new sentence

2.4 Saving progress

neat runs fully locally in the browser. Therefore it can not automatically save any changes you made to disk. You have to use the Save Changes button to do so manually from time to time.

3. Annotation Guidelines

Annotation Guidelines

neat's People

Contributors

Stargazers

Watchers

Forkers

snmnzl kba vri-ufpr r0man-ist

neat's Issues

support confirmation of OCR-changes by ENTER key

It would be practical, if pressing ENTER would confirm the current value in the token-line like clicking on OK.

merge entity tags B-/I-PUB and B-/I-ART to B-WORK and I-WORK

We will not distinguish between these two entity types in the annotation process.

Webapp

Decide whether/when to turn this into a webapp.

open facsimile urls in new tab

Links to the facsimile must open in a new tab using target=_blank or will trigger the reload event alert

delete function for ID column

The delete function for the row in the position column is very handy. Since the annotation of links in the ID column requires the deletion of false links, we were wondering whether it would be possible to implement a similar function here, which does not delete the row but the content of the corresponding cell.

[ocr] keyboard layouts

Here is one based on the Unicode subset from the OCR-D Guidelines.

ability to add identifiers for gnd/wikidata

Example Links

Hello,

looking for a quick startup I stumbled over the links provided on top of both example.tsv-files (https://content.staatsbibliothek-berlin.de/dc/PPN757123368-00000008/left,top,width,height/full/0/default.jpg, https://content.staatsbibliothek-berlin.de/dc/PPN757123368-00000008/left,top,width,height/full/0/default.jpg) , but unfortunately, they are currently leading nowhere.

Support streaming for large TSV files

Use papaparse streaming with chunk https://www.papaparse.com/faq#streaming and preferably pagination using sentence boundaries (= newlines)

Coordinate transformation

OCR results improve when archival master images are used. However, the image dimensions of the archival master images differ from the scaled-down images available on the web, which breaks the facsimile snippet feature.

Accordingly, we also need to transform the coordinates in cli.py.

modification of sentence marks not possible

If the manual marking of a sentence is made by mistake, marking the actual beginning does not set off continuous counting of the lines.

display image on mouseover over the whole row

display image on mouseover over the whole row and not just by hovering over the "location" link

support navigation by arrow keys in token-column

First testing sessions show that in case of OCR-correction, nearly every line needs to be altered. Therefore, navigation to the line above or below (up/down) without clicking would be crucial, including #24 . Navigation within the string by left/right arrow key as in the current version is very helpful.

document Python tools

https://github.com/cneud/ner.edith/tree/master/tools

different click actions for different rows

Clicking on a string(token) should trigger different actions based on the different rows:

LOCATION: display facsimile snippet
POSITION: --undefined--
TOKEN: edit text (for OCR error correction)
NE-TAG: drop-down menu of supported NE-Tags
NE-EMB: drop-down menu of supported NE-Tags

add entity tag B-ART

for artworks

add entity tag PUB

requested in 21 oct telco

split tagger accordion into two column layout

two columns B | I with dropdown of tags on each side

add entity tag CONF

requested in 21 oct telco

Python tools for converting OCR (PAGE.XML) to TSV

to allow generating the input format for the annotation tool from OCR output

change colors?

change the yellow (for highlighting) to something less saturated/more pleasing to the eyes? consider design for color vision deficiency (even though I think it is not that important, since there are labels anyway)? ( https://projects.susielu.com/viz-palette )

color palette generator: https://medialab.github.io/iwanthue

Screenshots

Annotation Guidelines

First draft https://github.com/qurator-spk/neath/blob/master/docs/Annotation_Guidelines.md

OCR correction is not being saved in certain cases

We noticed that OCR corrections in combinations of strings with punctuations are not being saved (e.g. "31.") even if they were corrected several times.

display facsimile snippets in the viewer/editor

detect page reload

...and throw a warning alert

include overall-scan

I know we have been discussing this before, but annotating on the new documents has shown that OCR often leaves out several sections or lines. This makes it necessary to have an overview of the full page. We can display the page on ZEFYS in most cases, but it costs extra time and is not possible with the pages that are not online yet. Therefore, would it be possible to link the full scan (e.g. on the upper display of the filename in neath)?

User Guide

First draft https://github.com/qurator-spk/neath/blob/master/docs/User_Guide.md

add "delete current row" to mouse navigation

It's already supported via keyboard command d l

limited functionality of save function and snippet display

About every 5th to 10th time, neither the snippet display on hover or clicking, nor the save button is working (Firefox environment). This left us with the need to pre-try the saving button before starting to annotate, because otherwise there is no chance to keep the changes made.

support the display of facsimile snippets

OCR coordinates via IIIF API

re-enable TOKEN editing

broken as of d7590c7

undo function

@snmnzl wrote:

Support Strg-Z shortcut for undoing latest drop-down choice.

"Breslau" problem

A funny thing occured in our annotation sessions, both to the student assistant and me on different browsers (Firefox and Edge) when we try to correct the word "Breslau" in the file 27646518_1892-07-05_21_335_005 (occurs two times in the third newspaper-column of the document: line 4599 and line 4834):

After annotating for a while (different time spans) the page seems to "reload" but it looks more like a flicker than a reload.
The tags can not longer be accessed, the NERs can no longer be edited. By saving at this point and reopening the file, the facsimile snippet does no longer appear, even though saving and editing are possible (but useless without the snippet). This can be fixed by re-opening NEATH over and over until Snippet is visible again.

vertically align image snippet with active row

It would improve usability a lot if the image snippet would always be displayed vertically aligned to the row with the corresponding token text.

lines are being cut off (after saving?)

We noticed that ending parts of the plain text (OCR) of the newspaper page in the example.tsv were missing after saving the annotation result of each session (see different states of the file attached). We couldn't figure out the pattern yet - our best guess is, that its connected to the split/merge function.
example.tsv_states.zip

add entity tag EVENT

For testing purposes on annotating events.

show facsimile snippets on hover

it would be desirable to display the snippets in the left column upon hovering over their location (saves a lot of clicking)

support token merges/splits

Sometimes it is necessary to either merge or split tokens, e.g. due to segmentation errors.

There are 2 basic operations that should be supported here:

a) merge: concatenates the text content of the current row with that of the above row, deletes the row and updates the offsets

b) split: adds a new row below with the text content of the current row copied there and updates the offsets

insert newline with "start sentence"

in order to remain as compatible as possible to the HIPE Shared Task data format, it would be desirable that the start-sentence functionality inserts a newline to mark the sentence boundaries.

(actually, the newline must contain 10x \t to retain the proper column structure)

Provenance

Document our processing pipeline:

textline extraction @sbb_textline_detector
word segmentation + OCR @ocrd_tesserocr
Tokenization @SoMaJo
Pretagging @sbb_ner

[ocr] confidence color coding

It would be preferable to retain the line OCR confidence scores in the tsv and do the color coding within neat rather than include the hex color code.

hide LOCATION column

As of 57b6b7d, we don't require it as an anchor anymore, so could as well hide this to save screen estate (it carries no relevant information either).

add entity tag "TODO"

...for uncertain/yet to decide entity types

reduce no. of lines

In the current version, there are 30 lines shown at a time which avoids a full view unless the browser screen is zoomed down to 60%. A reduction to appr. 15 lines combined with the aligned image snippet #26 would improve usability by a great deal.

support progress bar/marker

For continuous work on one .tsv document in different sessions, it would be handy to be able to mark the progress within that document.
This could fx be realized through an additional column which places a visual marker (e.g. arrow icon) in a selected row by clicking on it.

extra column for NEL annotation notes

The TO DO tag is very helpful to mark ambiguous cases in the NER annoation, maybe it would be possible to establish a similar option for the marking of ambiguous links for discussion in the annoation team. This would preferably be a free text cell located next to the ID.

qurator-spk / neat Goto Github PK

neat's Introduction

neat: named entity annotation tool

Table of contents

1. Introduction

2. User Guide

2.1 Installation

2.2 Data format

Example (simple)

Example (full)

2.3 Navigation

Keyboard

Mouse

2.4 Saving progress

3. Annotation Guidelines

neat's People

Contributors

Stargazers

Watchers

Forkers

neat's Issues

Recommend Projects

Recommend Topics

Recommend Org