Git Product home page Git Product logo

josephmachado / beginner_de_project Goto Github PK

View Code? Open in Web Editor NEW
388.0 9.0 90.0 1.77 MB

Beginner data engineering project - batch edition

Home Page: https://www.startdataengineering.com/post/data-engineering-project-for-beginners-batch-edition/

License: MIT License

Shell 5.94% Python 37.82% Makefile 16.06% Dockerfile 0.43% HCL 39.74%
airflow etl python docker emr redshift spark redshift-cluster database engineering

beginner_de_project's People

Contributors

dependabot[bot] avatar josephmachado avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

beginner_de_project's Issues

command `make infra-up` failed when creating resource "redshift_schema" "external_from_glue_data_catalog"

Hi @josephmachado
when I run make infra-up , I got this error message:

│ Error: error waiting for EMR Cluster (j-626JY5011R9W) to create: unexpected state 'TERMINATED_WITH_ERRORS', wanted target 'RUNNING, WAITING'. last error: VALIDATION_ERROR: EMR service role arn:aws:iam::972206383570:role/EMR_DefaultRole is invalid
│ 
│   with aws_emr_cluster.sde_emr_cluster,
│   on main.tf line 110, in resource "aws_emr_cluster" "sde_emr_cluster":
│  110: resource "aws_emr_cluster" "sde_emr_cluster" {
│ 
╵
╷
│ Error: could not start transaction: dial tcp 18.141.109.61:5439: connect: connection timed out
│ 
│   with redshift_schema.external_from_glue_data_catalog,
│   on main.tf line 172, in resource "redshift_schema" "external_from_glue_data_catalog":
│  172: resource "redshift_schema" "external_from_glue_data_catalog" {
│ 
╵
make: *** [Makefile:45: infra-up] Error 1

in the AWS UI, I checked the EMR cluster has a terminated status.

Terminated with errorsEMR service role arn:aws:iam::972206383570:role/EMR_DefaultRole is invalid

Error: could not start transaction: dial tcp 23.22.231.56:5439: connect: connection timed out

Error: could not start transaction: dial tcp 23.22.231.56:5439: connect: connection timed out

│ with redshift_schema.external_from_glue_data_catalog,
│ on main.tf line 170, in resource "redshift_schema" "external_from_glue_data_catalog":
│ 170: resource "redshift_schema" "external_from_glue_data_catalog" {

I have been working in this for many days now but There no luck.
I tried to identify what actually this IP ( 23.22.231.56)

seems like this endpoint is causing the problem
"Endpoint": {
"Address": "sde-redshift-cluster.ccieif6ln9js.us-east-1.redshift.amazonaws.com",
"Port": 5439
},

I can confirm that this is the one by trying to ping it as I am able to see the IP mentioned in the error.

ping sde-redshift-cluster.ccieif6ln9js.us-east-1.redshift.amazonaws.com
PING ec2-23-22-231-56.compute-1.amazonaws.com (23.22.231.56) 56(84) bytes of data.

aws redshift describe-clusters --cluster-identifier sde-redshift-cluster
{
"Clusters": [
{
"ClusterIdentifier": "sde-redshift-cluster",
"NodeType": "dc2.large",
"ClusterStatus": "available",
"ClusterAvailabilityStatus": "Available",
"MasterUsername": "sde_user",
"DBName": "",
"Endpoint": {
"Address": "sde-redshift-cluster.ccieif6ln9js.us-east-1.redshift.amazonaws.com",
"Port": 5439
},
"ClusterCreateTime": "2023-05-01T05:30:17.926Z",
"AutomatedSnapshotRetentionPeriod": 1,
"ManualSnapshotRetentionPeriod": -1,
"ClusterSecurityGroups": [],
"VpcSecurityGroups": [
{
"VpcSecurityGroupId": "sg-0a32ec33fa085f52d",
"Status": "active"
}
],
"ClusterParameterGroups": [
{
"ParameterGroupName": "default.redshift-1.0",
"ParameterApplyStatus": "in-sync"
}
],
"ClusterSubnetGroupName": "default",
"VpcId": "vpc-0bdf0c67e3b0c8093",
"AvailabilityZone": "us-east-1e",
"PreferredMaintenanceWindow": "wed:03:30-wed:04:00",
"PendingModifiedValues": {},
"ClusterVersion": "1.0",
"AllowVersionUpgrade": true,
"NumberOfNodes": 1,
"PubliclyAccessible": true,
"Encrypted": false,
"ClusterPublicKey": "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCJiSS/9+VF3Nk0FCdZReWLjPkGjDuG3jzJE7I6+SH3CBNKVEV2TzoCdZ4tw3h+675CgNIGkN3ICk6qMzBepJxZcJ7oHfu6R0kF7x2gKHIyACl/hOmUZLBUSTGWF8ix768F25a++9XZwbaCDkNpOq+7TlipS35dMaRpuVWRVND7Dkyk7u4saWSx2U++iOYmyo7tTbn8XCMtfiy6qeukFpFCEcUfynfN11Tz+ycsmXJuudjBdBz3g17vHGlLJgEmJg5hRo3HuhsIB8/OrZgTrZAO7JJX0wBupliPt3KjyoLkjGoSq0GWr0icYv7SD+UC/c6mIEt7og5IdnFcZwH8dZAT Amazon-Redshift\n",
"ClusterNodes": [
{
"NodeRole": "SHARED",
"PrivateIPAddress": "172.31.51.149",
"PublicIPAddress": "3.213.178.124"
}
],
"ClusterRevisionNumber": "49780",
"Tags": [],
"EnhancedVpcRouting": false,
"IamRoles": [
{
"IamRoleArn": "arn:aws:iam::460318025676:role/sde_redshift_iam_role",
"ApplyStatus": "in-sync"
}
],
"MaintenanceTrackName": "current",
"DeferredMaintenanceWindows": [],
"NextMaintenanceWindowStartTime": "2023-05-03T03:30:00Z"
}
]
}


Postgres denied permission when trying to write to temp folder

unload_user_purchase ='./scripts/sql/filter_unload_user_purchase.sql' temp_filtered_user_purchase = '/temp/temp_filtered_user_purchase.csv'

When copying the original database, postgres is unable to write to the temp folder in the docker container due to insufficient permissions. I have tried changing permissions, using \COPY, and creating a new directory which threw a 'no such directory error' in the Airflow log. I am using Ubuntu 20.

AWS EMR IAM role error [sent via email]

Describe the bug
A clear and concise description of what the bug is.

I am trying to get the infra up from more than 3 days. I am terribly stuck at this place. I am working on the beginner batch DE project. Please help me with the error below.

│ Error: error waiting for EMR Cluster (j-1QCC2RIRPI8ZQ) to create: unexpected state 'TERMINATING', wanted target 'RUNNING, WAITING'. last error: VALIDATION_ERROR: EMR service role arn:aws:iam::460318025676:role/EMR_DefaultRole is invalid

│ with aws_emr_cluster.sde_emr_cluster,
│ on main.tf line 108, in resource "aws_emr_cluster" "sde_emr_cluster":
│ 108: resource "aws_emr_cluster" "sde_emr_cluster" {



│ Error: could not start transaction: dial tcp 54.90.81.41:5439: connect: connection timed out

│ with redshift_schema.external_from_glue_data_catalog,
│ on main.tf line 170, in resource "redshift_schema" "external_from_glue_data_catalog":
│ 170: resource "redshift_schema" "external_from_glue_data_catalog" {


#Set up EMR
resource "aws_emr_cluster" "sde_emr_cluster" {
name = "sde_emr_cluster"
release_label = "emr-6.10.0"
applications = ["Spark", "Hadoop"]
scale_down_behavior = "TERMINATE_AT_TASK_COMPLETION"
service_role = "EMR_DefaultRole"
termination_protection = false
auto_termination_policy {
idle_timeout = var.auto_termination_timeoff
}

ec2_attributes {
instance_profile = aws_iam_instance_profile.sde_ec2_iam_role_instance_profile.id
}

master_instance_group {
instance_type = var.instance_type
instance_count = 1
name = "Master - 1"

ebs_config {
  size                 = 32
  type                 = "gp2"
  volumes_per_instance = 2
}

}

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Smartphone (please complete the following information):

  • Device: [e.g. iPhone6]
  • OS: [e.g. iOS8.1]
  • Browser [e.g. stock browser, safari]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

Dags keeps failing

Hey @josephmachado I need your help on this..

I tried running my own spark script using your framework after changing the needed files but my dags keeps failing....please I really would need your assistance...

Below is the link to the issue am currently facing....
Link !

user_behaviour.py vs using DAG?

Hi Joseph,

Having a hard time understanding how to execute all the loads (step 5 onward). So i see you've updated the user_behaviour.py to pull in EMR ID and bucket name dynamically, therefore it seems we wouldn't need to update that.

So, basically, to get the tasks running, what is needed? I have just clicked "run DAG" on airflow -- it shows a running status however it doesn't seem that any of the tasks are launching.

Thanks,
Steve

Download dataset

hi,
I want to build this project on my own on GCP and wanted to know where can i find all the dataset used in this project

Running docker-cumpose command is giving me connection refuse errors

Hi, I am at stage 1 of the handout, and running this command docker-compose -f docker-compose-LocalExecutor.yml up -d is giving me the below error.

  File "urllib3/connectionpool.py", line 670, in urlopen
  File "urllib3/connectionpool.py", line 392, in _make_request
  File "http/client.py", line 1255, in request
  File "http/client.py", line 1301, in _send_request
  File "http/client.py", line 1250, in endheaders
  File "http/client.py", line 1010, in _send_output
  File "http/client.py", line 950, in send
  File "docker/transport/unixconn.py", line 43, in connect
ConnectionRefusedError: [Errno 61] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "requests/adapters.py", line 439, in send
  File "urllib3/connectionpool.py", line 726, in urlopen
  File "urllib3/util/retry.py", line 410, in increment
  File "urllib3/packages/six.py", line 734, in reraise
  File "urllib3/connectionpool.py", line 670, in urlopen
  File "urllib3/connectionpool.py", line 392, in _make_request
  File "http/client.py", line 1255, in request
  File "http/client.py", line 1301, in _send_request
  File "http/client.py", line 1250, in endheaders
  File "http/client.py", line 1010, in _send_output
  File "http/client.py", line 950, in send
  File "docker/transport/unixconn.py", line 43, in connect
urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionRefusedError(61, 'Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "docker/api/client.py", line 214, in _retrieve_server_version
  File "docker/api/daemon.py", line 181, in version
  File "docker/utils/decorators.py", line 46, in inner
  File "docker/api/client.py", line 237, in _get
  File "requests/sessions.py", line 543, in get
  File "requests/sessions.py", line 530, in request
  File "requests/sessions.py", line 643, in send
  File "requests/adapters.py", line 498, in send
requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionRefusedError(61, 'Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "docker-compose", line 3, in <module>
  File "compose/cli/main.py", line 81, in main
  File "compose/cli/main.py", line 200, in perform_command
  File "compose/cli/command.py", line 60, in project_from_options
  File "compose/cli/command.py", line 152, in get_project
  File "compose/cli/docker_client.py", line 41, in get_client
  File "compose/cli/docker_client.py", line 170, in docker_client
  File "docker/api/client.py", line 197, in __init__
  File "docker/api/client.py", line 221, in _retrieve_server_version
docker.errors.DockerException: Error while fetching server API version: ('Connection aborted.', ConnectionRefusedError(61, 'Connection refused'))```

Botocore deprecated?

Getting the below error for Task Instance: start_emr_movie_classification_script

*** Reading local file: /opt/airflow/logs/user_behaviour/start_emr_movie_classification_script/2021-05-23T00:00:00+00:00/3.log
[2021-07-22 22:13:49,092] {taskinstance.py:876} INFO - Dependencies all met for <TaskInstance: user_behaviour.start_emr_movie_classification_script 2021-05-23T00:00:00+00:00 [queued]>
[2021-07-22 22:13:49,135] {taskinstance.py:876} INFO - Dependencies all met for <TaskInstance: user_behaviour.start_emr_movie_classification_script 2021-05-23T00:00:00+00:00 [queued]>
[2021-07-22 22:13:49,137] {taskinstance.py:1067} INFO - 
--------------------------------------------------------------------------------
[2021-07-22 22:13:49,138] {taskinstance.py:1068} INFO - Starting attempt 3 of 3
[2021-07-22 22:13:49,141] {taskinstance.py:1069} INFO - 
--------------------------------------------------------------------------------
[2021-07-22 22:13:49,170] {taskinstance.py:1087} INFO - Executing <Task(EmrAddStepsOperator): start_emr_movie_classification_script> on 2021-05-23T00:00:00+00:00
[2021-07-22 22:13:49,184] {standard_task_runner.py:52} INFO - Started process 76268 to run task
[2021-07-22 22:13:49,198] {standard_task_runner.py:76} INFO - Running: ['***', 'tasks', 'run', 'user_behaviour', 'start_emr_movie_classification_script', '2021-05-23T00:00:00+00:00', '--job-id', '10', '--pool', 'default_pool', '--raw', '--subdir', 'DAGS_FOLDER/user_behaviour.py', '--cfg-path', '/tmp/tmp678ait_b', '--error-file', '/tmp/tmp6ovmxhju']
[2021-07-22 22:13:49,203] {standard_task_runner.py:77} INFO - Job 10: Subtask start_emr_movie_classification_script
[2021-07-22 22:13:49,269] {logging_mixin.py:104} INFO - Running <TaskInstance: user_behaviour.start_emr_movie_classification_script 2021-05-23T00:00:00+00:00 [running]> on host c158fb23ec09
[2021-07-22 22:13:49,367] {taskinstance.py:1282} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_EMAIL=***@***.com
AIRFLOW_CTX_DAG_OWNER=***
AIRFLOW_CTX_DAG_ID=user_behaviour
AIRFLOW_CTX_TASK_ID=start_emr_movie_classification_script
AIRFLOW_CTX_EXECUTION_DATE=2021-05-23T00:00:00+00:00
AIRFLOW_CTX_DAG_RUN_ID=scheduled__2021-05-23T00:00:00+00:00
[2021-07-22 22:13:49,369] {base_aws.py:368} INFO - Airflow Connection: aws_conn_id=aws_default
[2021-07-22 22:13:49,383] {base_aws.py:166} INFO - Credentials retrieved from login
[2021-07-22 22:13:49,384] {base_aws.py:82} INFO - Retrieving region_name from Connection.extra_config['region_name']
[2021-07-22 22:13:49,385] {base_aws.py:87} INFO - Creating session with aws_access_key_id=AKIAQRZVCO4C63R5GTLY region_name=us-west-2
[2021-07-22 22:13:49,401] {base_aws.py:157} INFO - role_arn is None
[2021-07-22 22:13:49,469] {emr_add_steps.py:92} INFO - Adding steps to j-2NOT8EACUS8WV
[2021-07-22 22:13:49,687] {taskinstance.py:1481} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1137, in _run_raw_task
    self._prepare_and_execute_task_with_callbacks(context, task)
  File "/home/airflow/.local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1311, in _prepare_and_execute_task_with_callbacks
    result = self._execute_task(context, task_copy)
  File "/home/airflow/.local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1341, in _execute_task
    result = task_copy.execute(context=context)
  File "/home/airflow/.local/lib/python3.6/site-packages/airflow/providers/amazon/aws/operators/emr_add_steps.py", line 100, in execute
    response = emr.add_job_flow_steps(JobFlowId=job_flow_id, Steps=steps)
  File "/home/airflow/.local/lib/python3.6/site-packages/botocore/client.py", line 357, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/home/airflow/.local/lib/python3.6/site-packages/botocore/client.py", line 676, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the AddJobFlowSteps operation: A job flow that is shutting down, terminated, or finished may not be modified.
[2021-07-22 22:13:49,694] {taskinstance.py:1531} INFO - Marking task as FAILED. dag_id=user_behaviour, task_id=start_emr_movie_classification_script, execution_date=20210523T000000, start_date=20210722T221349, end_date=20210722T221349
[2021-07-22 22:13:49,772] {local_task_job.py:151} INFO - Task exited with return code 1

It seems botocore is deprecated https://stackoverflow.com/questions/65595398/mrjob-in-emr-is-running-only-1-mrstep-out-of-3-mrsteps-and-cluster-is-shutting-d

I'm going to see if this workaround helps. Please let me know if you're aware of a fix.

psql: error: could not connect to server

Getting this error toward the end of the setup shell script, seemingly where we upload sql tables or data to redshift. It seems I'm unable to connect to the Redshift server/port combination specified in the variables section of setup_infra.sh.

psql -f ./redshiftsetup.sql postgres://sdeuser:sdeP0ssword0987@sde-batch-de-project.cinry7pfnj94.us-west-2.redshift.amazonaws.com:5439/dev

psql: error: could not connect to server: Operation timed out Is the server running on host "sde-batch-de-project.cinry7pfnj94.us-west-2.redshift.amazonaws.com" (44.236.207.102) and accepting TCP/IP connections on port 5439?

I'm looking at AWS gui redshift section. This URL seems to correspond exactly with the endpoint url sde-batch-de-project.cinry7pfnj94.us-west-2.redshift.amazonaws.com:5439/dev

so perhaps an issue with the port. i will reach out to AWS soon if i can't figure it out. please let me know if you can think of anything! thanks

S3 bucket creation and permissions issue

from comments section on the blog

  1. the bucket creation command is giving an error An error occurred (IllegalLocationConstraintException) when calling the CreateBucket operation: The unspecified location constraint is incompatible for the region specific endpoint this request was sent to. You can fix this by editing the file setup_infra.sh at line 26: adding argument. I fix that using command like aws s3api create-bucket --bucket my-bucket-name --region us-west-2 --create-bucket-configuration LocationConstraint=us-west-2

  2. in linux get lot of permission issues so give permission to
    chmod -R 777 /opt chmod -R 777 ./logs chmod -R 777 ./data chmod -R 777 ./temp

connection time out issue..

I am currently experiencing a connection time out in connecting to host.
below is a snippet

 ssh: connect to host ec2-35-173-186-196.compute-1.amazonaws.com port 22: Connection timed out

Download data
ssh: connect to host ec2-35-173-186-196.compute-1.amazonaws.com port 22: Connection timed out

Recreate logs and temp dir
ssh: connect to host ec2-35-173-186-196.compute-1.amazonaws.com port 22: Connection timed out

Creating an AWS EMR Cluster named sde-batch-de-project
Creating AWS IAM role for redshift spectrum S3 access
Attaching AmazonS3ReadOnlyAccess Policy to our IAM role
Attaching AWSGlueConsoleFullAccess Policy to our IAM role
Creating an AWS Redshift Cluster named sde-batch-de-project
Waiting for Redshift cluster sde-batch-de-project to start, sleeping for 60s before next check
Waiting for Redshift cluster sde-batch-de-project to start, sleeping for 60s before next check
Waiting for Redshift cluster sde-batch-de-project to start, sleeping for 60s before next check
Running setup script on redshift
psql: error: connection to server at "sde-batch-de-project.cq1las6j8rph.us-east-1.redshift.amazonaws.com" (172.31.65.80), port 5439 failed: Connection timed out
        Is the server running on that host and accepting TCP/IP connections?

Spinning up remote Airflow docker containers
ssh: connect to host ec2-35-173-186-196.compute-1.amazonaws.com port 22: Connection timed out

Sleeping 5 Minutes to let Airflow containers reach a healthy state
adding redshift connections to Airflow connection param
ssh: connect to host ec2-35-173-186-196.compute-1.amazonaws.com port 22: Connection timed out

adding postgres connections to Airflow connection param
ssh: connect to host ec2-35-173-186-196.compute-1.amazonaws.com port 22: Connection timed out

adding S3 bucket name to Airflow variables
ssh: connect to host ec2-35-173-186-196.compute-1.amazonaws.com port 22: Connection timed out

adding EMR ID to Airflow variables
ssh: connect to host ec2-35-173-186-196.compute-1.amazonaws.com port 22: Connection timed out

set Airflow AWS region to us-east-1
ssh: connect to host ec2-35-173-186-196.compute-1.amazonaws.com port 22: Connection timed out

Stuck at "attaching to airflow_init_1"

Thank you for the help! I figured out on my own some issues and also cant find updated instructions on setting up the environments. Here is a screenshot of my issue. I also am getting that the username and password for airflow is wrong. I also dont understand where and when to enter that. I would appreciate the help!
Screen Shot 2021-06-22 at 8 07 09 PM

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.