SSCIMA

Sequential Sentence Classification in Medical Abstracts

Aim

Implement a deep learning model architecture that utilizes NLP techniques to classify the abstracts of RCT(Randomized Controlled Trial) Medical Papers. Classification into classes of Background, Objective, Method, Result, or Conclusion.

1. Problem Statement

Over 1 million RCTs have been published so far and around half of them are in PubMed, which makes it challenging for medical investigators to pinpoint the information they are looking for. When researchers search for previous literature, e.g., to write systematic reviews, they often skim through abstracts in order to quickly check whether the papers match the criteria of interest. This process is easier when abstracts are structured, i.e., the text in an abstract is divided into semantic headings such as the objective, method, result, and conclusion. However, over half of published RCT abstracts are unstructured, which makes it more difficult to quickly access the information of interest.

The problem is of Sequential Sentence Classification problem, that is, the classification of short texts that appear in sequences.

2. Technological Area: NLP and Deep Learning

Classical linguistic methods for natural language processing required experts in language defining rules to cover specific cases. These worked in narrow cases but turned out to be fragile. Statistical methods improve upon classical linguistic methods by learning rules and models from data rather than requiring them to be specified in a top-down manner. They result in much better performance, but must still be complemented with hand-crafted augmentations by language experts in order to achieve useful results. Often, a pipeline of statistical methods is required to achieve a single modeling outcome, such as in the case of machine translation. Deep learning methods are starting to out-compete the statistical methods on some challenging natural language processing problems with singular and simpler models.

3. Existing Systems

Many of the existing text classification techniques like Bag of words, Vector Semantic, Word Embedding, and Probabilistic Language model consider the importance of words individually and not in a long sequence.
Techniques like sequence labeling, parsing, and semantics consider a sequence of sentences but do not dive deep into the complex nature and meaning of human languages and their quirks in them.

Methodology

Getting the data ready
Preprocessing the data
Feature Engineering
Building a model
Fitting the data into the model
Evaluate the model
Improving the model through experimentation
Predicting on the saved model

Implementation

Implementation involved building a series of models, starting by getting a baseline, and improving throughout using the NLP and Deep Learning techniques. The models developed in the process are:

TF-IDF Multinomial Naive Bayes - A baseline model
Conv1D with token embedding
Feature Extraction with pre-trained token-embedding
Conv1D with character embedding
Hybrid Embedding = token embedding + character embedding
Transfer learning with pre-trained token embedding + character embedding + positional embedding

Results

Based on the F1 score.

Model	F1 Score
baseline	0.69
custom_token_embed_conv1d	0.80
pretrained_token_embed	0.74
custom_char_embed_conv1d	0.69
hybrid_char_token_embed	0.75
tribrid_pos_char_token_embed	0.85

Conclusion

We found out that the model with Transfer learning with pre-trained token embedding + character embedding + positional embedding performed well covering a wide range of samples and providing an F1 score of 85% accuracy.

Future works

Train the model on a bigger and complete dataset
Develop interfaces and applications for the usage of the model
Classification mechanisms for papers other than Medical research

gsakshay / sscima Goto Github PK

sscima's Introduction

SSCIMA

Aim

1. Problem Statement

2. Technological Area: NLP and Deep Learning

3. Existing Systems

Methodology

Implementation

Results

Conclusion

Future works

sscima's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent