Git Product home page Git Product logo

paper-ner-bench-das22's People

Contributors

edwin-carlinet avatar hueynemud avatar jchazalon avatar nfabadie avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

paper-ner-bench-das22's Issues

Créer dataset xp 2 avec intersection entrées valides post alignement tags

après alignement des tags de référence, certaines entrées contiennent des tags vides.
exemple extrême : l'OCR ne reconnaît rien, il est impossible de projeter les tags sur des positions différentes

Sortie attendue: dataset/supervised/40-train-val-test/gold.csv avec les colonnes suivantes :

  • id
  • ner_xml_ref
  • ner_xml_pero
  • ner_xml_tess
  • annuaire

Harmoniser le vocabulaire dans l'article

  • - Ground truth vs Gold / Gold reference.
  • - Models short names in figures: CmBERT, CmBERT+ptrn, SpaCy NER.
  • - pre-training / training / fine-tuning
  • - entries vs texts
  • - raw entries vs OCRed entries

Rebuttal response summary

[Review 1]

  • 1.1 ICDAR Contest in 2011 related to that topic
  • 1.2 Qualitative analysis of influence of OCR quality on NER
  • 1.3 Metrics uses, potential biases

[Review 2]

  • 2.1 comparing two runs of Pero OCR together, or two runs of Tesseract together
  • 2.2 reason for not reporting Kraken performance on downstream NER
  • 2.3 Figure 1 is never referred to in the text
  • 2.4 When referring to figures, you sometimes write "fig. n", "figure n", and "Figure n"
  • 2.5 You wrote Tesseract twice with a lowercase t
  • 2.6 Experiment with artificial OCR noise in future work

[Review 3]

  • 3.1 Section 2.3 should be a separate section about the pipeline
  • 3.2 What are "thumbnails" in page 7
  • 3.3 Description of NER QA is confusing
  • 3.4 Figure 4 needs clarification

Experiment 2 : train on clean data

  • - Create Spacy noisy test sets
  • - Create Huggingface noisy test sets
  • - Create Spacy clean train/dev sets
  • - Create Hugginface clean train/dev sets
  • - Create notebook : train on clean & save model
  • - Create notebook : eval on clean & noisy data
  • - Gather metrics and create figures

Experiment 2: train on noisy data

  • - Fine tune camembert & camembert -pretrained on noisy data (pero-ocr only ? Both ?)
  • - Fine-tune camembert & camembert-pretrained on noisy+clean ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.