Git Product home page Git Product logo

ir's Introduction

Information Extraction & Knowledge Elicitation for UN General Assembly (UNGA) resolutions

Currently, UN organizations and UN-related organizations produce, process, and maintain a high volume of documents, and the reports are initially designed for humans to read, process, and generated insights for decision-making.

The massive number of documents and the current document format (.doc or .pdf) create challenges for both the UN information management system and decision-makers in the management. Extract keywords from materials in the original forms is both time-consuming and labor-intensive.

This project aims to transform the documents into the "machine-readable" format, identify critical information and knowledge, improve information processing efficiency with automation, and conduct analysis for insights discovery. Specifically, the main objectives of the information system project are to generate machine-readable and semantically enhanced documentation automatically for 1) document retrieval tool; 2) metadata and document description query; 3) text mining for content analysis.

1. Metadata with Regular Expression

The Metadata was crawled from the UN General Assembly (UNGA) resolutions: https://www.un.org/en/sections/documents/general-assembly-resolutions/

To locate the critical information in the document, we display the high-level information under a structured format. The system chooses Regular Expression (RegEx) based on the existing structure of each document as RegEx allows to check a series of characters for “matches” with efficiency and adaptability.

The task consists of two parts: fields extraction and basic segementation.

1.1 Metadata for documents

The sample extracted metadata fields are shown below.
Title and Closing Formula, which are not necessaily metadata, are also included.

  • Doc Name:N1846596
    Note*:Doc Name does not need extraction. It is set as index.
  • ID:A/RES/73/277
  • Session: Seventy-third Session
  • Agenda Items:148
  • Proponent Authority:The Fifth Committee
  • Approval Date: 2018-12-22
  • Title
    • Financing of the International Residual Mechanism for Criminal Tribunals
  • Closing Formula
    • 65th plenary meeting
    • 22 December 2018

The result is shown as follows:

images

1.2 Paragraph Segmentation

We extract operative, preamble, annex and footnote information, which would be crucial for further content analysis.
The figure below shows example of the N1643743.doc with 'op' = operative:

15-images

The figure below shows example of the N1643743.doc with 'pre' = preamble:

16-images

The figure below shows example of the N1643743.doc with 'ax' = annex:

17-images

The figure below shows example of the N1643743.doc with 'fn' = footnote:

2-images

2. Task-based information extraction

This part consists of document abbreviation, deadlines extraction, references extraction and database filtering.
These tasks are based on the first part, and are required with higher precision.

2.1 Document abbreviation

The abbreviation is only done for operative paragraphs.
The sample output is shown as follows. Words in red belong to wrong abbreviations.
The testing accuracy for this task is 0.88.

14-images

2.2 Refences and deadlines

Here our goal is to find out past resolutions and future dates.
We can make more precise matches thanks to their specific formats.
Sample outputs are shown as follows:

Refences:

  >>>b.refence(file)
  ['resolutions 1980/67 1989/84',
  'resolution 69/313',
  'resolutions 53/199 61/185',
  'decision XIII/5',
  'decision 14/30',
  'decision XII/19',
  'resolution 70/1',
  'decision 14/5']

Referred resolutions:

  >>>b.refered_doc(file,df)
  ['N1523222', 'N0650553', 'N1529189']

Future Date and Year:

  >>>b.future_date(file)
  (['8 June 2020', '11 June 2020'], ['2030', '2020', '2019'])
  ###  Note that there are two lists returned
  ###  Year list is used when only year or year range is mentioned

2.3 Word count and word-based filtered database

These two functions are only exploratory, no need to evaluate.
Only nouns and adjectives are kept for Word count, since they are loaded with more meaning.
Users can specify columns to search keywords, case_sensitive is also supported.
Sample outputs are shown as follows:

word-based filtered database with word 'African':

4-images

word-count with number of terms=10

5-images

3. Document classification

In this part, the goal is to do classification of the documents based on UNBIS.
We build algorithm based on Bidirectional-LSTM, relying on preamble, operatives and title.

Model and methodology:

Instead of using pre-trained embedding layer directly, we set up this layer from scratch.
3 LSTMs are applied parallelly. They are expected to deal with preamble, operatives, title separately.
Dropout layer added to fight against overfitting.

6-images

Results and evaluation

Using 1271 human-labeled documents.
Overall accuracy is around 94%.
Considering the labelling method, this model may rely too much on title.

7-images

Sample predictions

The figure below shows the predictions to some of the testing data.

8-images

4. Content analysis

In this part, we applied NER(Name Entity Recognition) and LDA for Topic Modeling.

4.1 LDA topic modeling

Latent Dirichlet Allocation (LDA) allows a sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.
Users can input all of the database, or subsets of database filtered by keywords or categories.
Sample output: (original HTML)

9-images

4.2 Named Entity Recognitions

NER speeds up the information extraction process by recognizing, locating and classifying named entities in the documents into pre-defined categories such as names of persons or organizations.
Trained NER entities for UN resolutions: persons, organizations, date, law and places labels.
We use displaCy visualizer from Spacy to display the labeled texts from documents.
After 250 times iterations, demo result is shown as follows.

13-images

5. Django website

In order to demonstrate the results with the user-friendly interface , a repository website is established.
This website is still under construction...
Categories are the result of classifications according to UNBIS.
Labels are the aggregation of five top words for each document.

Current views: 10-images 12-images

Original Code

1.click here for basic.py (bottom)
2.click here for quick demo

ir's People

Contributors

hayleyteng avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.