Git Product home page Git Product logo

Comments (5)

a3616001 avatar a3616001 commented on May 30, 2024

Hi! Could you please elaborate more on your data format (I am not familiar with the .ann files)? In general, you need to store the entity information in the ner filed and relation information in the relations filed (as shown in README).

from pure.

Kehindeajayi01 avatar Kehindeajayi01 commented on May 30, 2024

Hi! Could you please elaborate more on your data format (I am not familiar with the .ann files)? In general, you need to store the entity information in the ner filed and relation information in the relations filed (as shown in README).

An example of .ann file is:
T1 Material 117 131 Ag_{5}Te_{2}Cl
T2 Property 335 356 electric conductivity
T3 Property 640 651 thermopower
T8 Property 1954 1976 thermal conductivities
T9 Value 1985 2005 0.19 W m^{-1} K^{-1}
R1 has_value Arg1:T8 Arg2:T9

The 2nd column is the entity name, 3rd and 4th columns are the starting and ending positions of the token, and the last column is the token.
R1 is the first relation showing that entities in tags 8 and 9 have the relation "has_value".
The main challenge is that this file does not indicate which entities or relation tags belong to which sentence in the raw text, thereby makes it difficult to match the raw text to the annotation file.

from pure.

a3616001 avatar a3616001 commented on May 30, 2024

Hi, I assume the starting and ending positions of entities are the position indexes to the raw text. Then, you may want to split the raw text into sentences, so that you know which entities belong to which sentence. One way to do that is use the nltk library (e.g., nltk.sent_tokenize).

from pure.

Kehindeajayi01 avatar Kehindeajayi01 commented on May 30, 2024

Hi, I assume the starting and ending positions of entities are the position indexes to the raw text. Then, you may want to split the raw text into sentences, so that you know which entities belong to which sentence. One way to do that is use the nltk library (e.g., nltk.sent_tokenize).

I already have the text tokenized into sentences. However, in the scierc data, the ner and relations refers to [start_position_of_entity_token, end_position_of_entity_token, entity_type] and not character position as in the above example.

from pure.

a3616001 avatar a3616001 commented on May 30, 2024

I think you can go over each token in the text and use len(token) to compute the length in characters of each token, then you should be able to compute the starting character position and the ending character position for each token.
After that, it should be easy to map the character positions in your .ann files to token positions (for each character position in the .ann files, you just need to find a token that contains this character position).

from pure.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.