<input type="checkbox" id=

I'm hacking away on this in <a href="https://github.com/hackoregon/data-science-pet-co

Create Data Science Environment about transportation-systems HOT 8 CLOSED

bhgrant8 commented on September 8, 2024

Create Data Science Environment

from transportation-systems.

Comments (8)

znmeb commented on September 8, 2024

Here's what I'm proposing:

Two services:

odot_crash_data - will contain the ODOT crash data.
passenger_census - will contain the ridership data; the name passenger_census comes from the CSV file we received.

Container port numbers and their host mappings and postgres user passwords will be set from a local .env file.

We need to define a mechanism for the Dockerfiles to acquire the input database dump files without the user having to download them. In other words, I want to be able to do a wget orcurl in the Dockerfile that runs at image build time, rather than a doing it with a Dockerfile COPY. This is something we have to get nailed down for DevOps / deployment anyway, so we might as well solve it this week. ;-) See hackoregon/civic-devops#3.

from transportation-systems.

BrianHGrant commented on September 8, 2024

I'll get some data on my personal dev s3 account and setup a billing alert and we can play around a little.

If we can get a proof of concept and cost idea, there would be pretty quick adoption I would imagine. This should be a priority in my mind because then we ensure we are working from the same data and saving manual hours updating.

from transportation-systems.

znmeb commented on September 8, 2024

OK ... how does S3 authentication work? Is it like everything else (a PEM key, ssh-stuff?)

from transportation-systems.

bhgrant8 commented on September 8, 2024

Access and secret key.

You will need to add the aws cli client to your DOCKERFILE:

RUN pip install --upgrade --user awscli

We did something similar to pull our secrets last year:

https://github.com/hackoregon/backend-service-pattern/blob/master/bin/getconfig.sh

Which was called in the entrypoint file:

https://github.com/hackoregon/backend-service-pattern/blob/master/bin/docker-entrypoint.sh

from transportation-systems.

znmeb commented on September 8, 2024

Yeah - syncing with S3 is built into cookiecutter's data science template

from transportation-systems.

bhgrant8 commented on September 8, 2024

Ok so I went ahead and setup the following access policy (actual bucket name is redacted):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "<ACTUAL ARN>"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:GetObjectVersion"
            ],
            "Resource": [
                "<ACTUAL ARN>"
            ]
        }
    ]
}

I then attached this policy to a IAM group and created a user within it. Will provide creds through slack.

The creds will work for either a docker or cookiecutter setup as you wish. it looks like cookiecutter is using the sync command from the cli:

https://github.com/drivendata/cookiecutter-data-science/blob/master/%7B%7B%20cookiecutter.repo_name%20%7D%7D/Makefile#L47

it looks like we may need to name the folder within the bucket as "data"?

from transportation-systems.

znmeb commented on September 8, 2024

I'm hacking away on this in https://github.com/hackoregon/data-science-pet-containers. It's just about where I want it, so I'm planning a "formal release" later this week.

I'm testing a utility called rclone (https://rclone.org/) for the cloud syncing. It's available in all the Linux distros, including Debian. It seems to be well maintained and will sync just about anywhere, not just S3. But IMHO it is not suitable for deployment, just for desktops. It's interactive and its secrets management scheme would probably rule out its use even in self-managed servers.

from transportation-systems.

znmeb commented on September 8, 2024

I put this on the back burner for the Tech Challenge but I'm back on it. I just have one major documentation task and another example scenario to do.

from transportation-systems.

Create Data Science Environment about transportation-systems HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent