First of all, thank you very much for your work! docExtractor extracts text l

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

The process of GT generation about docextractor HOT 2 CLOSED

CrazyCrud commented on June 19, 2024

The process of GT generation

from docextractor.

Comments (2)

monniert commented on June 19, 2024

Hi @CrazyCrud thanks for the interest in the project! Here are some answers:

we always filter the border annotations in our extractions results (https://enherit.paris.inria.fr/ or src/extractor.py) so that's why they don't show up
I would recommend annotating the x-height representation only (wiki) for example using VIA annotator, and then augmenting the ground-truth to generate borders either directly when converting the via json to images (I will see what I can do for #13 in the upcoming days) or after conversion, with morphological operations
the labels used to train the default model are illustration, text and text_border. There are two options to finetune it: (i) you care about extracting all these elements so you keep the same labels (colors) in your GT and finetuning is straightforward or (ii) you want to finetune on a different list of labels (completely different or a subset, in your case text and text_border), in that case the final conv1x1 layer would be randomly initialized but you will still strongly benefit from the rest of the pretrained network. The latter (ii) is the one performed to report the finetuned results on the baseline detection benchmarks (cBADs, table 1 and 2 in the paper)

Hope this helps

from docextractor.

CrazyCrud commented on June 19, 2024

@monniert thank you very much for your detailed answer!

I would recommend annotating the x-height representation only (wiki) for example using VIA annotator [...] after conversion, with morphological operations

This sounds like a reasonable approach as you explained how to use erosion to generate colored borders in the other issue.