Git Product home page Git Product logo

all-foreign-gifts-around-us's Introduction

All Foreign Gifts Around Us

This project aims to extract structured data from unstructured text published in the Federal Register about gifts provided to U.S. officials by foreign government officials.

Overview

The Federal Register is the official journal of the United States government, which publishes various notices, rules, and regulatory information. Among these publications are reports detailing gifts received by U.S. government officials from foreign sources. Some presidential administrations are better than others about reporting these gifts. The current minimum value of reportable gifts is $480.

This project uses a Large Language Model (LLM), specifically Claude 3 Sonnet, to extract structured information from these unstructured text reports and convert it into JSON format. The JSON data can then be used for further analysis, visualization, or integration with other systems.

Data Source

The source data for this project is a text file containing excerpts from the Federal Register, specifically the sections related to gift reports. The text file is structured with sections separated by the string "Federal Register / Vol.".

Data Extraction Process

The data extraction process involves the following steps:

  1. Splitting the Text: The source text file is split into sections based on the "Federal Register / Vol." separator.
  2. AI-Assisted Extraction: Each section is passed to an LLM (Claude 3 Sonnet) for extracting structured information in JSON format. The LLM is provided with an example JSON structure and instructions to extract relevant gift details from the text.
  3. Parsing and Error Handling: The output from the LLM is parsed as JSON, with error handling and recovery mechanisms in place to handle invalid or problematic JSON output.
  4. Deduplication and Merging: The extracted JSON objects from different sections are combined, removing duplicates based on specific keys. The 'disposition' key is handled as an array, merging values from duplicate entries.
  5. Output Generation: The final merged and deduplicated JSON data is saved to an output file (combined.json).

Code Overview

The project consists of Python scripts that handle the data extraction, deduplication, and merging processes. The LLM (Claude 3 Sonnet) is integrated using the Anthropic API, and its role includes:

  1. Extracting structured JSON objects from unstructured text sections.
  2. Assisting in writing and refining the Python code for organizing and running the extraction process.

The main scripts in the project are:

  • extract_data.py: Handles the text splitting, AI-assisted extraction, parsing, and error handling.
  • combine_json.py: Combines the extracted JSON objects from different sections, handles deduplication, and merges the 'disposition' values.

Data Updates

The Federal Register publishes gift reports annually (and sometimes provides updates to previous records), and this project will be updated periodically to incorporate new data as it becomes available.

Contributing

Contributions to this project are welcome. If you find any issues or have suggestions for improvements, please open an issue or submit a pull request.

Dependencies

For text extraction from PDFs on Linux systems, you can run the following commands:

sudo apt-get update
sudo apt-get install libpoppler-cpp-dev
sudo apt-get install poppler-utils

Python dependencies are listed in requirements.txt.

all-foreign-gifts-around-us's People

Contributors

dwillis avatar

Stargazers

Chris Zubak-Skees avatar Jimmy Cloutier avatar Lau Van Kiet avatar

Watchers

 avatar

all-foreign-gifts-around-us's Issues

Extract foreign government gifts from text

Probably need to chunk the text into parts since there is a lot of text. Add bioguide IDs for lawmakers, along with offices for all. Create separate columns for item, donor, value and reason.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.