Thank you very much for sharing this great work! I was wondering if there are any

Fine Tuning on Custom Dataset about bros HOT 3 OPEN

clovaai commented on May 23, 2024

Fine Tuning on Custom Dataset

from bros.

Comments (3)

dhkim0225 commented on May 23, 2024

Input data preprocess

bros/preprocess/funsd_spade/preprocess.py

Lines 74 to 86 in 55c52d0

 word_cnt = 0 

 class_seq = [] 

 for word_idx, word in enumerate(form["words"]): 

 word_text = word["text"] 

 bb = word["box"] 

 bb = [[bb[0], bb[1]], [bb[2], bb[1]], [bb[2], bb[3]], [bb[0], bb[3]]] 

 tokens = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(word_text)) 

 word_obj = {"text": word_text, "tokens": tokens, "boundingBox": bb} 

 if len(word_text) != 0: 

 out_json_obj["words"].append(word_obj) 

 word_cnt += 1 

 class_seq.append(len(out_json_obj["words"]) - 1)

The data must have 4 point quadrangle coordinates. If you have a rectangle coordinate, transform it into (8,) shape.
Tokenize transcription(GT or output of OCR) using bert tokenizer.

bros/preprocess/funsd_spade/preprocess.py

Line 31 in 55c52d0

tokenizer = BertTokenizer.from_pretrained(VOCA, do_lower_case=True)

KIE task

Please refer to the code block below.

bros/preprocess/funsd_spade/preprocess.py

Lines 96 to 116 in 55c52d0

 for form_idx, form in enumerate(in_json_obj["form"]): 

 form_id = form["id"] 

 form_text = form["text"].strip() 

 form_label = form["label"] 

 form_linking = form["linking"] 

 if len(form_linking) == 0: 

 continue 

 for link_idx, link in enumerate(form_linking): 

 if link[0] == form_id: 

 if ( 

 link[1] in form_id_to_word_idx 

 and link[0] in form_id_to_word_idx 

 ): 

 relation_pair = [ 

 form_id_to_word_idx[link[0]], 

 form_id_to_word_idx[link[1]], 

 ] 

 out_json_obj["parse"]["relations"].append(relation_pair)

from bros.

siamakzd commented on May 23, 2024

Thank you!

For now I am interested in token classification task. To clarify, let's say for each document I have:

a list of words
a list of bounding boxes corresponding to those words
and a list of labels for each box

Which type of preprocessing should I do? For FUNSD I see there are two types funsd and funsd_spade.
I ran both preprocessing and see that parse will be different in the processed files. I appreciate if you can tell me conceptually the reason for this difference.

from bros.

tghong commented on May 23, 2024

Simply,

funsd: for BIO-tagging decoder
funsd_spade: for SPADE style decoder

Since BIO-tagging approach is common, I recommend using this method first.

from bros.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.

Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

TensorFlow

An Open Source Machine Learning Framework for Everyone

Django

The Web framework for perfectionists with deadlines.

Laravel

A PHP framework for web artisans

D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

web

Some thing interesting about web. New door for the world.

server

A server is a program made to process requests and deliver data to clients.

Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

Visualization

Some thing interesting about visualization, use data art

Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.

Microsoft

Open source projects and samples from Microsoft.

Google

Google ❤️ Open Source for everyone.

Alibaba

Alibaba Open Source for everyone

D3

Data-Driven Documents codes.

Tencent

China tencent open source team.

	word_cnt = 0
	class_seq = []
	for word_idx, word in enumerate(form["words"]):
	word_text = word["text"]
	bb = word["box"]
	bb = [[bb[0], bb[1]], [bb[2], bb[1]], [bb[2], bb[3]], [bb[0], bb[3]]]
	tokens = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(word_text))

	word_obj = {"text": word_text, "tokens": tokens, "boundingBox": bb}
	if len(word_text) != 0:
	out_json_obj["words"].append(word_obj)
	word_cnt += 1
	class_seq.append(len(out_json_obj["words"]) - 1)

	for form_idx, form in enumerate(in_json_obj["form"]):
	form_id = form["id"]
	form_text = form["text"].strip()
	form_label = form["label"]
	form_linking = form["linking"]

	if len(form_linking) == 0:
	continue

	for link_idx, link in enumerate(form_linking):
	if link[0] == form_id:
	if (
	link[1] in form_id_to_word_idx
	and link[0] in form_id_to_word_idx
	):
	relation_pair = [
	form_id_to_word_idx[link[0]],
	form_id_to_word_idx[link[1]],
	]
	out_json_obj["parse"]["relations"].append(relation_pair)

Comments (3)

Input data preprocess

KIE task

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org