Git Product home page Git Product logo

analytics_at_scale's Introduction

Analytics @ Scale

Code for demonstrating AWS's ability to scale analytics to enterprise volumes (30TB dataset) and use a Lake House architecture to eliminate data silos and empower analysts.

The goal of this exercise is to facilitate demonstrating AWS’s scaling functionality with industry standard benchmarks and datasets. The features we want to demonstrate include:

  • Redshift Data Volume Scale - Redshift is a petabyte-scale data warehouse

  • Redshift Concurrency Scaling - Effortlessly scale-out query concurrency to meet virtual any number of reporting queries against a data warehouse

  • Redshift Spectrum – A Redshift feature that helps eliminate data silos by providing seamless access to Data Lakes

  • Athena – Perform serverless adhoc/exploratory analytics directly on Amazon S3 using Standard ANSI SQL

Setup

Setup VPC in us-east-1

  • Provision a Redshift cluster of sufficient scale, e.g. 12ra3.4xlarge, name it tpcds
  • Provision a ~small EC2 Instance for the query driver, e.g. T2.XL

Prepare EC2 Instance and Load Redshift

Upload/SCP load_tpcds.py and run_queries.py to the EC2 instance

Install postgres and development libraries so psycopg2 compiles

$ sudo yum install -y postgresql postgresql-devel postgresql-libs python3 python3-devel
$ sudo yum groupinstall -y "Development Tools"

Make a virtualenv for Python3 (so we don’t want to stress the Amazon Linux/yum managed Python environment)

  • $ python3 -m venv venv

Get into the virtualenv

  • $ source venv/bin/activate

Install additional python libraries

  • $ pip3 install sqlparse psycopg2

Set environment variables:

$ export REDSHIFT_ENDPOINT="tpcds.abcdefghi123.us-east-1.redshift.amazonaws.com"
$ export REDSHIFT_USER="admin"
$ export REDSHIFT_PASS="asdfasdfasdf"
$ export REDSHIFT_IAMROLE="arn:aws:iam::123456789012:role/mySpectrumRole"

Load data

  • Run load_tpcds.py. It is recommended to run this in screen, tmux, or nohup. If the SSH session disconnects the load will not complete; screen, tmux, or nohup will keep the job going.
  • $ nohup python3 ./load_tpcds.py all

Setup Glue Data Catalog

  • Crawl s3://jwyant-tpcds/processed/30tb
  • Modify in table store_sales ss_sold_date_sk from string/char to int in Data Catalog

Setup Redshift Spectrum

Create IAM Role for Spectrum with appropriate access

Create Spectrum Schema on your newly crawled Glue Data Catalog Schema

CREATE EXTERNAL SCHEMA spectrum_tpcds
FROM DATA CATALOG
DATABASE 'tpcds'
IAM_ROLE 'arn:aws:iam::123456789012:role/mySpectrumRole'
CREATE EXTERNAL DATABASE IF NOT EXISTS;

Run this to set the rownumbers in the schema so the optimizer has more information

  • $ python ./load_tpcds.py setnumrows tpcds

Update the settings in Redshift WLM

Set max_concurrency_scaling_clusters > 0 (image below)

Set Max Concurrency

Set concurrency_scaling mode to "auto" (image below)

Set Concurrency Scaling to Auto

Ready to go! Use demo_script.sql for the sample queries. Use run_queries.py for demonstrating concurrency scaling. Please ensure you're in your virtualenv and have the right dependencies.

$ python ./run_queries.py [1-???] “custom query group”

Happy Analytics!

analytics_at_scale's People

Contributors

jwyant avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.