docker pull shieldsio/shields
This is a data pipeline project built with Apache Airflow to process input data and generate useful insights. The pipeline was developed to run in a Python 3.x environment.
Make sure you have Python 3.x installed on your system. You can download and install Python from the official Python website.
- Python (Version: 3.9.9)
- Apache Airflow (Version: 2.7.2)
- Embulk (Version: 0.10.27)
- Docker (Version: 20.10.11)
- Docker Compose (Version: 1.29.2)
To run the data pipeline locally using Docker and Apache Airflow, follow these steps:
-
Clone the repository:
git clone https://github.com/vlruiz108/LH_ED_VANESSA_RUIZ
-
Navigate to the project directory:
cd data-pipeline
python data_pipeline.py
-
Create and activate a virtual environment (optional but recommended):
python -m venv venv source venv/bin/activate # on Windows use venv\Scripts\activate.bat
-
Install project dependencies:
pip install -r requirements.txt
-
Configure the necessary credentials and parameters in the
config.py
file. You can create a copy of the example fileconfig_example.py
and rename it toconfig.py
. -
Ensure that Apache Airflow is configured correctly. You can refer to the official Apache Airflow documentation for detailed instructions on configuration.
-
Build and start the Docker containers:
docker-compose up --build
- To lift the containers:
docker-compose up -d
-
Consult the container
docker-compose ps
-
Start Apache Airflow:
airflow webserver --port 8080
and in another terminal:
airflow scheduler
-
Access the Airflow dashboard at http://localhost:8080 in your web browser.
-
Activate the DAG (Directed Acyclic Graph)
data_pipeline
in the Airflow dashboard. -
The pipeline is now configured to run according to the schedule defined in the DAG.