Comments (5)
Hi! Could you please elaborate more on your data format (I am not familiar with the .ann files)? In general, you need to store the entity information in the ner
filed and relation information in the relations
filed (as shown in README).
from pure.
Hi! Could you please elaborate more on your data format (I am not familiar with the .ann files)? In general, you need to store the entity information in the
ner
filed and relation information in therelations
filed (as shown in README).
An example of .ann file is:
T1 Material 117 131 Ag_{5}Te_{2}Cl
T2 Property 335 356 electric conductivity
T3 Property 640 651 thermopower
T8 Property 1954 1976 thermal conductivities
T9 Value 1985 2005 0.19 W m^{-1} K^{-1}
R1 has_value Arg1:T8 Arg2:T9
The 2nd column is the entity name, 3rd and 4th columns are the starting and ending positions of the token, and the last column is the token.
R1 is the first relation showing that entities in tags 8 and 9 have the relation "has_value".
The main challenge is that this file does not indicate which entities or relation tags belong to which sentence in the raw text, thereby makes it difficult to match the raw text to the annotation file.
from pure.
Hi, I assume the starting and ending positions of entities are the position indexes to the raw text. Then, you may want to split the raw text into sentences, so that you know which entities belong to which sentence. One way to do that is use the nltk
library (e.g., nltk.sent_tokenize
).
from pure.
Hi, I assume the starting and ending positions of entities are the position indexes to the raw text. Then, you may want to split the raw text into sentences, so that you know which entities belong to which sentence. One way to do that is use the
nltk
library (e.g.,nltk.sent_tokenize
).
I already have the text tokenized into sentences. However, in the scierc data, the ner and relations refers to [start_position_of_entity_token, end_position_of_entity_token, entity_type] and not character position as in the above example.
from pure.
I think you can go over each token in the text and use len(token)
to compute the length in characters of each token, then you should be able to compute the starting character position and the ending character position for each token.
After that, it should be easy to map the character positions in your .ann files to token positions (for each character position in the .ann files, you just need to find a token that contains this character position).
from pure.
Related Issues (20)
- Multiple issues HOT 2
- different F1 with the same seed HOT 2
- tensorflow版本 HOT 1
- About the relation in datasets HOT 1
- [Paper] What are "gold" entity and relationship types? HOT 2
- Provide full environment
- Input Data Format HOT 5
- How to load models into Python HOT 2
- some code problems reguarding run_relation_approx(get_features_from_file) HOT 2
- where is the code of Efficient Batch Computations
- Approximation Model Training & Inference HOT 1
- entity is S or O ?
- Further question of f1 and e2e_f1
- 版本库问题 HOT 1
- 版本库问题
- ACE dataset
- Training a model on a dataset that is not ace04, ace05, or scierc HOT 1
- training model for WLP -- stuck in suboptimal solution
- Input data format question for custom dataset !
- cuda out of memory
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pure.