Hi, I have used the brat tool to annotate my data, so I have two files: .ann files

How to run pre-trained model on a custom datasets about pure HOT 5 CLOSED

princeton-nlp commented on May 30, 2024

How to run pre-trained model on a custom datasets

from pure.

Comments (5)

a3616001 commented on May 30, 2024

Hi! Could you please elaborate more on your data format (I am not familiar with the .ann files)? In general, you need to store the entity information in the ner filed and relation information in the relations filed (as shown in README).

from pure.

Kehindeajayi01 commented on May 30, 2024

Hi! Could you please elaborate more on your data format (I am not familiar with the .ann files)? In general, you need to store the entity information in the ner filed and relation information in the relations filed (as shown in README).

An example of .ann file is:
T1 Material 117 131 Ag_{5}Te_{2}Cl
T2 Property 335 356 electric conductivity
T3 Property 640 651 thermopower
T8 Property 1954 1976 thermal conductivities
T9 Value 1985 2005 0.19 W m^{-1} K^{-1}
R1 has_value Arg1:T8 Arg2:T9

The 2nd column is the entity name, 3rd and 4th columns are the starting and ending positions of the token, and the last column is the token.
R1 is the first relation showing that entities in tags 8 and 9 have the relation "has_value".
The main challenge is that this file does not indicate which entities or relation tags belong to which sentence in the raw text, thereby makes it difficult to match the raw text to the annotation file.

from pure.

a3616001 commented on May 30, 2024

Hi, I assume the starting and ending positions of entities are the position indexes to the raw text. Then, you may want to split the raw text into sentences, so that you know which entities belong to which sentence. One way to do that is use the nltk library (e.g., nltk.sent_tokenize).

from pure.

Kehindeajayi01 commented on May 30, 2024

Hi, I assume the starting and ending positions of entities are the position indexes to the raw text. Then, you may want to split the raw text into sentences, so that you know which entities belong to which sentence. One way to do that is use the nltk library (e.g., nltk.sent_tokenize).

I already have the text tokenized into sentences. However, in the scierc data, the ner and relations refers to [start_position_of_entity_token, end_position_of_entity_token, entity_type] and not character position as in the above example.

from pure.

a3616001 commented on May 30, 2024

I think you can go over each token in the text and use len(token) to compute the length in characters of each token, then you should be able to compute the starting character position and the ending character position for each token.
After that, it should be easy to map the character positions in your .ann files to token positions (for each character position in the .ann files, you just need to find a token that contains this character position).

from pure.

How to run pre-trained model on a custom datasets about pure HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent