Currently, UN organizations and UN-related organizations produce, process, and maintain a high volume of documents, and the reports are initially designed for humans to read, process, and generated insights for decision-making.
The massive number of documents and the current document format (.doc or .pdf) create challenges for both the UN information management system and decision-makers in the management. Extract keywords from materials in the original forms is both time-consuming and labor-intensive.
This project aims to transform the documents into the "machine-readable" format, identify critical information and knowledge, improve information processing efficiency with automation, and conduct analysis for insights discovery. Specifically, the main objectives of the information system project are to generate machine-readable and semantically enhanced documentation automatically for 1) document retrieval tool; 2) metadata and document description query; 3) text mining for content analysis.
The Metadata was crawled from the UN General Assembly (UNGA) resolutions: https://www.un.org/en/sections/documents/general-assembly-resolutions/
To locate the critical information in the document, we display the high-level information under a structured format. The system chooses Regular Expression (RegEx) based on the existing structure of each document as RegEx allows to check a series of characters for “matches” with efficiency and adaptability.
The task consists of two parts: fields extraction and basic segementation.
The sample extracted metadata fields are shown below.
Title and Closing Formula, which are not necessaily metadata, are also included.
- Doc Name:N1846596
Note*:Doc Name does not need extraction. It is set as index. - ID:A/RES/73/277
- Session: Seventy-third Session
- Agenda Items:148
- Proponent Authority:The Fifth Committee
- Approval Date: 2018-12-22
- Title
- Financing of the International Residual Mechanism for Criminal Tribunals
- Closing Formula
- 65th plenary meeting
- 22 December 2018
The result is shown as follows:
We extract
operative, preamble, annex and footnote information
, which would be crucial for further content analysis.
The figure below shows example of the N1643743.doc with'op'
= operative:
The figure below shows example of the N1643743.doc with
'pre'
= preamble:
The figure below shows example of the N1643743.doc with
'ax'
= annex:
The figure below shows example of the N1643743.doc with
'fn'
= footnote:
This part consists of document abbreviation, deadlines extraction, references extraction and database filtering.
These tasks are based on the first part, and are required with higher precision.
The abbreviation is only done for
operative paragraphs
.
The sample output is shown as follows. Words in red belong towrong abbreviations
.
The testing accuracy for this task is0.88
.
Here our goal is to find out past resolutions and future dates.
We can make more precise matches thanks to theirspecific formats
.
Sample outputs are shown as follows:
>>>b.refence(file)
['resolutions 1980/67 1989/84',
'resolution 69/313',
'resolutions 53/199 61/185',
'decision XIII/5',
'decision 14/30',
'decision XII/19',
'resolution 70/1',
'decision 14/5']
>>>b.refered_doc(file,df)
['N1523222', 'N0650553', 'N1529189']
>>>b.future_date(file)
(['8 June 2020', '11 June 2020'], ['2030', '2020', '2019'])
### Note that there are two lists returned
### Year list is used when only year or year range is mentioned
These two functions are only exploratory, no need to evaluate.
Onlynouns and adjectives
are kept for Word count, since they are loaded with more meaning.
Users canspecify columns
to search keywords,case_sensitive
is also supported.
Sample outputs are shown as follows:
word-based filtered database with word 'African':
word-count with
number of terms
=10
In this part, the goal is to do classification of the documents based on UNBIS
.
We build algorithm based on Bidirectional-LSTM
, relying on preamble, operatives and title.
Instead of using pre-trained
embedding
layer directly, we set up this layer from scratch.
3 LSTMs
are applied parallelly. They are expected to deal with preamble, operatives, title separately.
Dropout
layer added to fight against overfitting.
Using
1271
human-labeled documents.
Overall accuracy is around94%
.
Considering the labelling method, this model may rely too much ontitle
.
The figure below shows the predictions to some of the testing data.
In this part, we applied NER(Name Entity Recognition) and LDA for Topic Modeling.
Latent Dirichlet Allocation (LDA) allows a sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.
Users can input all of the database, or subsets of database filtered bykeywords or categories
.
Sample output: (original HTML)
NER speeds up the information extraction process by recognizing, locating and classifying named entities in the documents into pre-defined categories such as names of persons or organizations.
Trained NER entities for UN resolutions:persons, organizations, date, law and places labels
.
We usedisplaCy
visualizer from Spacy to display the labeled texts from documents.
After 250 times iterations, demo result is shown as follows.
In order to demonstrate the results with the user-friendly interface , a repository website is established.
This website is still under construction...
Categories are the result of classifications according to UNBIS.
Labels are the aggregation of five top words for each document.
1.click here for basic.py (bottom)
2.click here for quick demo