soduco / paper-ner-bench-das22 Goto Github PK

All the material (paper, code, dataset, results) of our DAS 2022 paper (OCR+NER benchmark)

Makefile 0.04% TeX 3.33% Shell 0.03% Python 1.79% R 0.07% Jupyter Notebook 92.28% C 2.21% C++ 0.24%

benchmark dataset document-analysis ner ocr

paper-ner-bench-das22's Issues

Créer dataset xp 2 avec intersection entrées valides post alignement tags

après alignement des tags de référence, certaines entrées contiennent des tags vides.
exemple extrême : l'OCR ne reconnaît rien, il est impossible de projeter les tags sur des positions différentes

Sortie attendue: dataset/supervised/40-train-val-test/gold.csv avec les colonnes suivantes :

id
ner_xml_ref
ner_xml_pero
ner_xml_tess
annuaire

Clean up the repo before making it public?

README file
add a LICENCE
remove commits?
make a new release?

Harmoniser le vocabulaire dans l'article

- Ground truth vs Gold / Gold reference.
- Models short names in figures: CmBERT, CmBERT+ptrn, SpaCy NER.
- pre-training / training / fine-tuning
- entries vs texts
- raw entries vs OCRed entries

revoir abstract/titre/auteurs et faire pré-soumission

deadline : 15 janvier

pré-soumission faite ?

Rebuttal response summary

[Review 1]

1.1 ICDAR Contest in 2011 related to that topic
1.2 Qualitative analysis of influence of OCR quality on NER
1.3 Metrics uses, potential biases

[Review 2]

2.1 comparing two runs of Pero OCR together, or two runs of Tesseract together
2.2 reason for not reporting Kraken performance on downstream NER
2.3 Figure 1 is never referred to in the text
2.4 When referring to figures, you sometimes write "fig. n", "figure n", and "Figure n"
2.5 You wrote Tesseract twice with a lowercase t
2.6 Experiment with artificial OCR noise in future work

[Review 3]

3.1 Section 2.3 should be a separate section about the pipeline
3.2 What are "thumbnails" in page 7
3.3 Description of NER QA is confusing
3.4 Figure 4 needs clarification

Ajouter Kraken à l'expérimentation 2

Experiment 2 : train on clean data

- Create Spacy noisy test sets
- Create Huggingface noisy test sets
- Create Spacy clean train/dev sets
- Create Hugginface clean train/dev sets
- Create notebook : train on clean & save model
- Create notebook : eval on clean & noisy data
- Gather metrics and create figures

Annotations coordinates are valid for deskewed image (not PDF image)

Images in PDF compilation: extracted from "storage" <=> no transform except grayscale + JPEG conversion
Annotation coordinates: valid for deskewed image.
Problem: transformation is not stored with coordinates.

Experiment 2: train on noisy data

- Fine tune camembert & camembert -pretrained on noisy data (pero-ocr only ? Both ?)
- Fine-tune camembert & camembert-pretrained on noisy+clean ?

Add code for OCR evaluation and NER reference sync

code for OCR evaluation
code for text normalization
code for text sanity checks
code for projecting tag positions in NER_XML_REF to NER_XML_PRED

soduco / paper-ner-bench-das22 Goto Github PK

paper-ner-bench-das22's People

Contributors

Stargazers

Watchers

paper-ner-bench-das22's Issues

Créer dataset xp 2 avec intersection entrées valides post alignement tags

Clean up the repo before making it public?

Harmoniser le vocabulaire dans l'article

revoir abstract/titre/auteurs et faire pré-soumission

Rebuttal response summary

Ajouter Kraken à l'expérimentation 2

Experiment 2 : train on clean data

Annotations coordinates are valid for deskewed image (not PDF image)

Experiment 2: train on noisy data

Add code for OCR evaluation and NER reference sync

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent