Git Product home page Git Product logo

rhetorical_roles's Introduction

SemEval2023 Task 6 (Rhetorical Role identification)

Introduction

This project deals with identification of Rhetorical Roles in Indian Case documents that have a inherent structure to it. The documents are classified to 13 different rhetorical roles for the SemEval task and provided for training. The details of the task is given in this link. In this project we try to extend the work done in Bhattacharya et. al. and also compare the performance of different BERT based legal models such as LegalBERT, InLegalBERT, GPT2 etc.

The code for the BERT models resides in rhetorical_roles.ipynb and rr.ipynb. The instruction to run Hierarchial BiLSTM model is given below.

Citation

If you use the Hierarchical BiLSTM code, please refer to the following paper:

  @inproceedings{bhattacharya-jurix19,
   author = {Bhattacharya, Paheli and Paul, Shounak and Ghosh, Kripabandhu and Ghosh, Saptarshi and Wyner, Adam},
   title = {{Identification of Rhetorical Roles of Sentences in Indian Legal Judgments}},
   booktitle = {{Proceedings of the 32nd International Conference on Legal Knowledge and Information Systems (JURIX)}},
   year = {2019}
  }

Requirements

  • python = 3.7.3
  • pytorch = 1.1.0
  • sklearn = 0.21.3
  • numpy = 1.17.2
  • sent2vec

Codes

  • model/submodels.py : Contains codes for submodels that is used for constructing the top-level architecture
  • model/Hier_BiLSTM_CRF.py : Contains code for the top-level architecture
  • prepare_data.py : Functions to prepare numericalized data from raw text files
  • train.py : Functions for training, validation and learning
  • run.py : For reproducing results in the paper
  • infer.py : For using a trained model to infer labels for unannotated documents

Training

For training a model on an annotated dataset

Input Data format

For training and validation, data is placed inside "data/text" folder. Each document is represented as an individual text file, with one sentence per line. The format is:

text <TAB> label

If you wish to use pretrained embeddings variant of the model, data is placed inside "data/pretrained_embeddings" folder. Each document is represented as an individual text file, with one sentence per line. The format is:

emb_f1 <SPACE> emb_f2 <SPACE> ... <SPACE> emb_f200 <TAB> label  (For 200 dimensional sentence embeddings)

"categories.txt" contains the category information of documents in the format:

category_name <TAB> doc <SPACE> doc <SPACE> ...

Usage

To run experiments with default setup, use:

python run.py                                                                 (no pretrained variant)
python run.py --pretrained True --data_path data/pretrained_embeddings/       (pretrained variant)

Constants, hyper parameters and path to data files can be provided as switches along with the previous command, to know more use:

python run.py -h

To see default values, check out "run.py"

By default, the model employs 5 fold cross-validation on a total of 50 documents, where folds are manually constructed to have balanced category distribution across each fold.

Output Data format

All output data will be found inside "saved" folder. This contains:

  • model_state_fn.tar : fn is the validation fold number. This contains the architecture and model state which achieved highest macro-f1 on validation.
  • data_state_fn.json : Contains predictions, true labels, loss and training index for each document in the validation fold.
  • word2idx.json and tag2idx.json : Needed for inference

Inference

For using a trained model to automatically annotate documents

Input Data format

Un-annotated data is to be placed inside "infer/data" folder. Each document should be represented as an individual text file, containing one sentence per line.

For inference, we need a trained Hier-BiLSTM-CRF model. For this, place model_state.tar, word2idx.json and tag2idx.json (which were obtained after the training process) inside "infer" folder.

For pretrained variant, we also need to place a trained sent2vec model inside "infer" folder. You can download a sent2vec model pretrained on Indian Supreme Court case documents here (binary file of size more than 2 GB).

Usage

To infer with default setup, use:

python infer.py                       (no pretrained variant)
python infer.py --pretrained True     (pretrained variant)

Constants, hyper parameters and path to data files can be provided as switches along with the previous command, to know more use:

python infer.py -h

To see default values, check out "infer.py"

Output Data format

Output will be saved in "infer/predictions.txt", which has the format:

document_filename <TAB> label_sent1 <COMMA> label_sent2 <COMMA> ... <COMMA> label_sentN     (N sentences in this document)

Notes

  1. Make sure to set the switch --device cpu (or change the default value) if cuda is not available.
  2. Remove the blank "init.py" files before running experiments.

rhetorical_roles's People

Contributors

shounakpaul95 avatar skiran13 avatar paheli avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.