Git Product home page Git Product logo

Comments (3)

dhkim0225 avatar dhkim0225 commented on May 23, 2024

Input data preprocess

word_cnt = 0
class_seq = []
for word_idx, word in enumerate(form["words"]):
word_text = word["text"]
bb = word["box"]
bb = [[bb[0], bb[1]], [bb[2], bb[1]], [bb[2], bb[3]], [bb[0], bb[3]]]
tokens = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(word_text))
word_obj = {"text": word_text, "tokens": tokens, "boundingBox": bb}
if len(word_text) != 0:
out_json_obj["words"].append(word_obj)
word_cnt += 1
class_seq.append(len(out_json_obj["words"]) - 1)

  1. The data must have 4 point quadrangle coordinates. If you have a rectangle coordinate, transform it into (8,) shape.
  2. Tokenize transcription(GT or output of OCR) using bert tokenizer.
    tokenizer = BertTokenizer.from_pretrained(VOCA, do_lower_case=True)

KIE task

Please refer to the code block below.

for form_idx, form in enumerate(in_json_obj["form"]):
form_id = form["id"]
form_text = form["text"].strip()
form_label = form["label"]
form_linking = form["linking"]
if len(form_linking) == 0:
continue
for link_idx, link in enumerate(form_linking):
if link[0] == form_id:
if (
link[1] in form_id_to_word_idx
and link[0] in form_id_to_word_idx
):
relation_pair = [
form_id_to_word_idx[link[0]],
form_id_to_word_idx[link[1]],
]
out_json_obj["parse"]["relations"].append(relation_pair)

from bros.

siamakzd avatar siamakzd commented on May 23, 2024

Thank you!

For now I am interested in token classification task. To clarify, let's say for each document I have:

  • a list of words
  • a list of bounding boxes corresponding to those words
  • and a list of labels for each box

Which type of preprocessing should I do? For FUNSD I see there are two types funsd and funsd_spade.
I ran both preprocessing and see that parse will be different in the processed files. I appreciate if you can tell me conceptually the reason for this difference.

from bros.

tghong avatar tghong commented on May 23, 2024

Simply,

  • funsd: for BIO-tagging decoder
  • funsd_spade: for SPADE style decoder

Since BIO-tagging approach is common, I recommend using this method first.

from bros.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.