Git Product home page Git Product logo

cr-unsc's Introduction

Corpus of Resolutions: UN Security Council

Overview

This code in the R programming language downloads and processes the full set of resolutions, drafts and meeting records rendered by the United Nations Security Council (UNSC), as published by the UN Digital Library, into a rich and structured human- and machine-readable dataset. It is the basis for the Corpus of Resolutions: UN Security Council (CR-UNSC).

All data sets created with this script will always be hosted permanently open access and freely available at Zenodo, the scientific repository of CERN. Each version is uniquely identified with a persistent Digitial Object Identifier (DOI), the Version DOI. The newest version of the data set will always available via the link of the Concept DOI: https://doi.org/10.5281/zenodo.7319780

Features

  • 82 Variables
  • Resolution texts in all six official UN languages (English, French, Spanish, Arabic, Chinese, Russian)
  • Draft texts of resolutions in English
  • Meeting record texts in English
  • URLs to draft texts in all other languages (French, Spanish, Arabic, Chinese, Russian)
  • URLs to meeting record texts in all other languages (French, Spanish, Arabic, Chinese, Russian)
  • Citation data as GraphML (UNSC-to-UNSC resolutions and UNSC-to-UNGA resolutions)
  • Bibliographic database in BibTeX/OSCOLA format for e.g. Zotero, Endnote and Jabref
  • Extensive Codebook to explain the uses of the dataset
  • Compilation Report and Quality Assurance Report explain construction and validation of the data set
  • Publication quality diagrams for teaching, research and all other purposes (PDF for printing, PNG for web)
  • Open and platform independent file formats (CSV, PDF, TXT, GraphML)
  • Software version controlled with Docker
  • Publication of full data set (Open Data)
  • Publication of full source code (Open Source)
  • Free Software published under the GNU General Public License Version 3 (GNU GPL v3)
  • Data published under Public Domain waiver (CC Zero 1.0)
  • Secure cryptographic signatures for all files in version of record (SHA2-256 and SHA3-512)

Functionality

The pipeline will produce the following results and store them in the output/ folder:

  • Codebook as PDF
  • Compilation Report as PDF
  • Quality Assurance Report as PDF
  • ZIP archive containing the main data set as a CSV file
  • ZIP archive containing only the metadata of the main data set as a CSV file
  • ZIP archive containing citation data and metadata as a GraphML file
  • ZIP archive containing bibliographic data as a BIBTEX file
  • ZIP archive containing all resolution texts as TXT files (OCR and extracted)
  • ZIP archive containing all resolution texts as PDF files (original and English OCR)
  • ZIP archive containing all draft texts as PDF files (original)
  • ZIP archive containing all meeting record texts as PDF files (original)
  • ZIP archive containing the full Source Code
  • ZIP archive containing all intermediate pipeline results ("targets")

The integrity and veracity of each ZIP archive is documented with cryptographically secure hash signatures (SHA2-256 and SHA3-512). Hashes are stored in a separate CSV file created during the data set compilation process.

System Requirements

  • The reference data sets were compiled on a Debian host system. Running the Docker config on an SELinux system like Fedora will require modifications of the Docker Compose config file.
  • 40 GB space on hard drive
  • Multi-core CPU recommended. We used 8 cores/16 threads to compile the reference data sets. Standard config will use all cores on a system. This can be fine-tuned in the config file.
  • Given these requirements the runtime of the pipeline is approximately 40 hours.

Instructions

Step 1: Prepare Project Folder

Copy the Github repository into an empty (!) folder, for example by:

$ git clone https://github.com/seanfobbe/cr-unsc

Please always use an empty folder for creating the data set. The code will delete and re-create certain subfolders without requesting additional permission.

Step 2: Create Docker Image

The Dockerfile contains automated instructions to create a full operating system with all necessary dependencies. To create the image from the Dockerfile, please execute:

$ bash docker-build-image.sh

Step 3: Compile Dataset

If you have previously compiled the data set, whether successfuly or not, you can delete all output and temporary files by executing:

$ Rscript delete_all_data.R

You can compile the full data set by executing:

$ bash docker-run-project.sh

Results

Once the pipeline has concluded successfuly, the data set and all results are now stored in the folder output/.

Visualize Pipeline

After you have run run_project.R at least once you can use the commands below to visually inspect the pipeline.

> targets::tar_glimpse()     # Only data objects
> targets::tar_visnetwork()  # All objects, including functions

Troubleshooting

The below commands are useful to troubleshoot the pipeline.

> tar_progress()  # Show progress and errors
> tar_meta()      # Show all metadata
> tar_meta(fields = "warnings", complete_only = TRUE)  # Warnings
> tar_meta(fields = "error", complete_only = TRUE)  # Errors
> tar_meta(fields = "seconds")  # Runtime for each target

Project Structure

This structural analysis of the project describes its most important and version-controlled components. During compilation the pipeline will create further folders in which intermediate results are stored (files, temp/ analysis and output/). Final results are stored in the folder output/.

.
├── buttons                    # Buttons (for tex title pages)
├── CHANGELOG.md               # Narrative summary of changes
├── config.toml                # Primary configuration file
├── data                       # Data sets that are imported by the pipeline
├── delete_all_data.R          # Clear all results for fresh run
├── docker-build-image.sh      # Build Docker image
├── docker-compose.yaml        # Docker container runtime configuration
├── Dockerfile                 # Instructions on how to create Docker image
├── docker-run-project.sh      # Build Docker image and run full project
├── etc                        # Additional configuration files
├── functions                  # Key pipeline components
├── gpg                        # Personal Public GPG-Key for Seán Fobbe
├── instructions               # Instructions on how to manually handle data
├── LICENSE                    # License for the software
├── pipeline.Rmd               # Master file for data pipeline
├── README.md                  # Usage instructions
├── reports                    # Report templates
├── run_project.R              # Run entire pipeline
└── tex                        # LaTeX templates


Open Access Publications (Fobbe)

Website --- https://www.seanfobbe.com

Open Data --- https://zenodo.org/communities/sean-fobbe-data

Code Repository --- https://zenodo.org/communities/sean-fobbe-code

Regular Publications --- https://zenodo.org/communities/sean-fobbe-publications

Contact

Did you discover any errors? Do you have suggestions on how to improve the data set? You can either post these to the Issue Tracker on GitHub or write me an e-mail at [email protected]

cr-unsc's People

Contributors

seanfobbe avatar

Stargazers

Daniel Pett avatar Chris Hartgerink avatar Umar Butler avatar

Watchers

 avatar

cr-unsc's Issues

Variable Master List

We should create a master list of all variables, where they should be sourced from and if they have been implemented already.

Variable group: Voting Record

We could add:

  • voting record (verbose)
  • yes votes
  • no votes
  • abstentions
  • total number of votes cast
  • whether it is an acclamation
  • other machine-readable details from the verbose voting record

Also, from where do we best acquire the voting record?

Review gold standard resolution texts

Lorenzo did a superb first run creating all English gold-standard resolution texts. These are stored in data/res_en_gold.

Please do a second and third pass to check for remaining OCR errors. The original instructions are stored in instructions/revision-texts.md.

  • Ensure that all paragraphs are on the same line. This makes checking the initial verb/gerund easier (e.g. "noting" or "notes")
  • Make sure to do the editing directly in GitHub. In the Web Interface, select the folder data/res_en_gold, then the individual file you want to edit. In the top right corner there is a pen symbol. Click that and you can edit the text directly in Github.

Multiple Draft Documents

A small number of record pages include multiple draft documents (main draft and amendment(s)). How should we deal with these?

Currently we include these with the pipe separator in the same column as the URL for the draft record page.

This affects:

res_no url_record
1: 63 https://digitallibrary.un.org/record/111960
2: 67 https://digitallibrary.un.org/record/112011
3: 113 https://digitallibrary.un.org/record/112086
4: 126 https://digitallibrary.un.org/record/112085
5: 133 https://digitallibrary.un.org/record/112114
6: 138 https://digitallibrary.un.org/record/112107
7: 184 https://digitallibrary.un.org/record/112183
8: 185 https://digitallibrary.un.org/record/112184
9: 217 https://digitallibrary.un.org/record/90484
10: 404 https://digitallibrary.un.org/record/66645

OCR for older resolutions

Which resolution numbers require OCR treatment? I am assuming up until 1998 or something. Please confirm by manual inspection (i.e. does it look scanned?).

Review current data set output

Please review the current data set output stored in Drive and comment on the following questions:

  • Should we add additional variables? If yes, which?
  • Should we add other features? If yes, which?
  • Should we remove variables? If yes, which?
  • Should we remove features? If yes, which?
  • Import the GraphML file into Gephi. Is everything as you would expect it? Run your own tests from previous research to check plausibility.
  • Import the BibTex file into your favorite citation manager (e.g. Zotero). Does everything work?

Document Splits

  • Do resolutions incorporate superfluous text?
  • Do English resolutions include French text that needs to be removed?
  • Do they include text snippets of resolutions other than the one named in the metadata?

If yes, please provide a precise set of resolutions numbers that require special processing instructions. If possible, please also provide details on the kind of processing required (e.g. two-column layout English/French, top part of resolution includes foreign text)

Create Tables of UN Regional Grouping Members

With tables of UN regional grouping members (e.g. GRULAC) we can auto-create REGEX to search for them.

Also, how should we deal with membership change over the years? Contemporary membership tables would be the best, but veeeery labor-intensive.

Add bibtex export with metadata

We can export the metadata as bibliography database files (bibtex format) for use with Endnote, Citavi, Zotero and Jabref!

Do you use reference managers? Do you cite your UNSC resolutions with them? If yes, please let me know what kind of format you use. I will definitely include one compliant with bl-oscola for latex, because I use that myself.

Create Complete End-to-End Data Pipeline for UNSC

Goal is to create a complete end-to-end pipeline that can be run with a single command and will build a complete corpus of UNSC resolutions and associated documents.

I will use this issue to keep you updated on major updates to the pipeline.

Re-Check Record Pages for Meetings and Drafts

Some record pages for resolutions --- instead of containing links to the specific record page for a draft or meeting --- include a search query for the document symbol. The current scrape function expects a single well-defined record page link and returns garbage otherwise. Currently hotfixed as assignment of "NA" where this would occur.

Possible Solutions:

  • Resolve search query
  • Acquire record page in other manner
  • Ignore and replace with NA

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.