Git Product home page Git Product logo

court-data-pipeline's People

Contributors

bguayante avatar jdziurlaj avatar jungshadow avatar

Watchers

 avatar  avatar  avatar  avatar

court-data-pipeline's Issues

Naming Convention for JSON-LD Files

Currently, the JSON-LD files that are scraped are saved locally using the url of the site on which they originated. As an example, a JSON hosted on https://www.courts.ca.gov/los-angeles-county.html would be saved by the scraper as https://www.courts.ca.gov/los-angeles-county.json. This file name is used during the validation process. Validated files are renamed using string manipulation to extract the just the jurisdiction name (los-angeles-county.json). This file is then used to import new data into the DB.

For the moment, this is fine, but I am using dev data with a standardized naming scheme. It's unlikely that urls encountered in the wild will be so easily parsed given the lack of standardization among court sites. So at the moment, I can think of two solutions:

  1. Continue to use the URL as the filename of the JSON that is scraped but do not rename it after validation. It will be passed to the DB as {url}.json.
  2. As the URLs to scrape are passed to the script as a CSV provided as an argument when the script is executed, require an additional column that provides some identifier for the courthouse or its jurisdiction and use that as the filename throughout the script.

I think (1) is probably the way to go. There was a benefit to using simpler names earlier in development that is lost now that everything is automated. While more information is always good, (2) puts additional burden on the administrators running the script and I think a goal is to make this process as easy as possible.

Regardless of approach, there is one other issue: URLS make bad filenames due to their use of the / symbol. They get parsed as directories by the file system and throw errors when they are accessed by the script. Is there a standard replacement character or can we choose one? I'm currently replacing / with . but that's also a symbol used by the file system and might have unintended consequences.

@jungshadow and @JDziurlaj, I'd appreciate your input.

Make SHACL file accessible remotely

As the definitions and SHACL file have to be kept in parity, and because the definitions will be hosted remotely, I need to add logic to download the SHACL file from the same remote source, save it locally, then use it with pyshacl in validator.py

Wishlist for Pipeline

A few things I'd like to see:

  • Build application on argparse
  • Split the application code into logical sections/modules (e.g., db, scraper, exporter)
  • #3
  • #4

You can create individual Issues on any of these for discussion as you see fit.

Task Tracking

  • Update ontology URI in validator and SHACL file when provided path by Margaret/Stanford
  • Update validator to fetch SHACL file each run to ensure parity with updates to ontology (need above URI)
  • Convert SPARQL query output to CSV
  • Add License file
  • Integrate pipeline modules into argparse
  • Merge dev and main, delete dev

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.