Git Product home page Git Product logo

project-one's Introduction

Crypto Currency ETL Pipeline


Introduction

A cryptocurrency is an implementation of blockchain technlogy that allows transactions to be verified cryptographically by nodes on a decentralised network. It is the medium of exchange on a computer network that operates without a central authority. The goal of this project is design and build a simple pipeline that provides high quality and tested data for further analysis. This is especially useful for investors in the crypto markets.

This project is not a data analysis project however there are some basic questions that have piqued our interests. There are as follows:

  • What crypotocurrency has the highest change in trading price on a daily basis?
  • What cryptocurrency has the highest and lowest market capitalisation?

Datasources

The data sources considered during the design stage of pipeline development are provided in Table 1 below. Upon careful consideration of each option, CoinGecko was selected because of its ease of use and detailed documentation.

Data Source Useful Documentation
Alpaca https://alpaca.markets/learn/getting-started-with-alpaca-crypto-api
CoinAPI https://www.coinapi.io/
CoinGecko https://www.coingecko.com/api/documentations/v3

Folder Structure


├── project-one
	├── Dockerfile
	├── ERD.png 
	├── README.md
	├── src/
		├── crypto/
		├── database/
		├── utility/
		├── requirements.txt
		├── set_python_path.bat 
		├── set_python_path.sh 



Entity Population and ETL

The entites discussed in this section are created following a successful pipeline run.


ERD


Extract & Load

1. Coins table:

  • Description: This table contains list of available coins using CoinGecko with only three columns required ('id','name','symbol').
  • Type: Full , this means it will be fully extracted every time we run the code and will be loaded fully into the crypto database
  • How it's getting extracted : In order to get a list of available coins using CoinGecko, the following needs to be done:
cg.get_coins_list()

So to extract coins table required for this project here is the code has been use:

df = pd.DataFrame(cg.get_coins_list(), columns=['id','name','symbol'])

2. Trending table:

  • Description: This table contains list of top 7 trends for the day.
  • Type: Full , this means it will be fully extracted every time we run the code and will be loaded fully into the crypto database
  • How it's getting extracted : The following can be used to obtain the trending coins using CoinGecko:
cg.get_search_trending()

In this project, trending data has been extracted using below code, and then unnecessary columns are droped.

trend = cg.get_search_trending()

3. Coins_history table:

  • Description: This table contains historical daily data (price , marketcap) for this list of coins ('bitcoin','litecoin','ethereum','solana' ,'umee','terra-luna','evmos','dejitaru-tsuka','reserve-rights-token','insights-network')
  • Type: incremental, means it extracts data since the last day and upsert it into the correspondant table in crypto database
  • How it's getting extracted : In this project, coins_history data has been extracted using below code, and then unnecessary columns are droped.
data = cg.get_coin_history_by_id(id=coin,date=date_variable, localization='false')

4. Coin_price_history table:

  • Description: This table contains historical market data include price, market cap, and 24h volume. Data granularity is automatic and cannot be adjusted.
  • Type: incremental, it extracts 1 day from current time = 5 minute interval data
  • How it's getting extracted : In this project, coin_price_history data has been extracted using below code. It loops through thist list of coins('bitcoin','litecoin','ethereum','solana' ,'umee','terra-luna','evmos','dejitaru-tsuka','reserve-rights-token','insights-network') and concatinate the data of each coin to to bottoim of the dataframe.
price_date = cg.get_coin_market_chart_by_id(id=coin,vs_currency='usd',days=number_of_days)

Transformations and Manipulations

The scripts used for data manipulation/transformation can be found in crypto/models/transform/.

Transformations

  1. The price_date column of the coins_history table was renamed to price_extraction_date in the staging layer.
  2. The current_price column of the staging_coins_history table was rounded to two decimal places in the serving layer.
  3. The market_cap column of the staging_coins_history table was rounded to two decimal places in the serving layer.
  4. The Price column of the coin_price_history table was rounded to two decimal places in the staging layer.
  5. A new column called percent_increase was added to the staging_coin_price_history table in the serving layer.
  6. A serving table called serving_coins_greater_than_1_USD_latest_price was created. The purpose of this serving table is to show all latest prices of coins, ordered by their price, showing only coins with their latest price above 1 USD. It contained the following transforms:
  • Join between date from public.coin_price_history, aliased as latest_date, and Date from public.coin_price_history aliased as hist
  • Renaming/aliasing of column names
  • Sorting (in order of latest coin price descending)
  • Where clause (>1 USD)

Trigger the pipeline

Follow the steps below to successfully trigger the pipeline.


  1. The requirements.txt file in the src/ directory contains all the necessary python libraries to run the pipeline. Use the command below to install these dependencies.
pip install -r requirements.txt

  1. Set the PYTHONPATH for the project using the command below. Use set_python_path.bat for windows.
# navigate to the src directory
cd src 

# set the python path
. ./set_python_path.sh

  1. The pipeline extracts data from the CoinGecko API and persists the data in a PostgreSQL database thus, a PostgreSQL database called crypto has to be created. Modify the database connection variables found in src/crypto/config.template.sh. Use config.bat for windows.

  1. The pipeline can now be tested by navigating to the src/ directory and running the command below
# E2E test
pytest

  1. The pipeline can be manually triggered by using the following command
python crypto/pipeline/crypto_pipeline.py


project-one's People

Contributors

rashidcodes avatar aananth1 avatar helenamin avatar puneetverma avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.