Git Product home page Git Product logo

cli4ds's Introduction

Introduction

Say a you have a .csv file with a few columns. You have to run basic descriptive statistics on these columns, and maybe a few group-by operations (okay, pivot-tables.) If the file is a few thousand rows (under 100MB in size), you will probably double-click on it straight away and run the analysis in Excel.

Give yourself a pat on the back, you chose the right tool for the right job.

Who's a good analyst? Yes, you are!

But what if the file was 750MB? Assuming you have enough RAM (8GB or more), of course you'll use dplyr in R, or pandas in Python or (gasp) write a data step in SAS.

Right? Excellent.

But what if the file was 6GB?

15GB?

If the word Hadoop is stuck in your throat, I implore you to swallow it. This repository focues on open-source command-line utilities that can do the same.

Yes, there are Python libraries that allow you to work with larger-than-RAM files on a single machine (Spark, Dask and perhaps some more), but we'll keep that for later.

Why

  • Because I've met too many 'data scientists' who
  • Because there is an entire ecosystem of wonderful open-source software for data analysis
  • Because renting servers with more RAM or more cores is now easier and cheaper than ever.
  • Because too many businesses do not have massive data and are spending money and resources trying to solve their problems with the wrong (and expensive) tools
    • The closest analogy I can think of is someone trying to break a pebble with a sledgehammer. Of course, the pebble will break, but wouldn't you rather first try using the hammer hanging in your toolshed?
  • But mostly, because I like to teach! ๐Ÿ˜‡

Some Quotes

In forecasting applications, we never observe the whole population. The problem is to forecast from a finite sample. Hence statistics such as means and standard deviations must be estimated with error.

"At Facebook, 90% of the jobs have input sizes under 100GB."

"For workloads that process multi-GB rather than multi-TB, a big memory server will provide better performance-per-dollar than a cluster."

Tools

  • GNU Coreutils everyday tools like grep, sed, cut, shuf and cat for working on text-files
  • GNU awk, a programming language designed for text processing and typically used as a data extraction and reporting tool
  • GNU Datamash, a command-line program which performs basic numeric, textual and statistical operations on input textual data files.
  • xsv, a fast CSV toolkit written in Rust
  • csvkit, a suite of command-line tools for converting to and working with CSV. Written in Python
  • Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON. Written in C.
  • csvtk A cross-platform, efficient, practical and pretty CSV/TSV toolkit in Golang.
  • textql Execute SQL against structured text like CSV or TSV. Written in Golang.
  • q allows direct execution of SQL-like queries on CSVs/TSVs (and any other tabular text files)

Docker Image

  • I've created a Docker image with all of these tools, and tutorials on how to use them.
  • It also contains
    • Miniconda3
    • A conda environment ds-py3 configured with the PyData stack (pandas, scikit-learn ...)
  • Build (or pull) the docker image
# clone this repo, and 
cd docker/
docker build -t cli-4-ds .
# or
docker pull dushyantkhosla/cli-4-ds:latest
  • Run a container with the image
docker run -it --privileged \
           -v $(pwd):/home \
           -p 8888:8888 \
           -p 5000:5000 \
           -p 3128:3128 \
           dushyantkhosla/cli-4-ds:latest
  • Learn how to use these tools using the notebooks in tutorials/

    • There is a dedicated notebook for each of the tools above
  • Run the start.sh script to see helpful messages

bash /root/start.sh

Get the Data

  • To generate data for these tutorials, cd into the data/ directory and
    • Run the get-csvs.sh script to download flightDelays and KDDCup datasets
    • PS: This will download ~1.5GB data
cd data/
bash get-csvs.sh
python make-data.py
  • Run the make-data.py to create a synthetic dataset with 10 million rows

Part 2: SQL Analytics with Metabase

  • You might want to try out Metabase, which has a nice front-end for writing SQL
docker pull metabase/metabase:v0.19.0
  • I recommend this version against the latest because it works with SQLite

  • If you want to run other DBs like Postgresql, you can get the latest image instead

  • Then, run a container

docker run -d -v $(pwd):/tmp -p 3000:3000 metabase/metabase:v0.19.0
  • The -d switch is for running the container in detached mode
  • Navigate to localhost:3000, connect to a .db file or run another DB and connect to it

Appendix

  • There are no rules, of thumb; but we can try
Data Size Remedy
up to 1GB Pandas
up to 10GB Get more RAM. Try Spark/Dask.
up to 100GB Postgres
500GB+ Hadoop

tl;dr

As long as your data fits-on-disk (ie, a few hundred GBs or less,)

  • For filter or transform jobs (like WHERE and CASE WHEN) , use

    • cli-tools or python scripts (line-by-line or stream processing)
    • break files into chunks, use pandas (chunk processing)
  • For reductions or groupby jobs (like AVERAGE and PIVOT),

    • think deeply about your data, draw representative samples
      • you're a better statistician than a programmer afterall, aren't you?
    • use bootstrap measures to quantify uncertainty
  • For machine-learning jobs,

    • Cluster your data, then pull samples from each group. (stratify)
    • Fit your first model on a 10% sample.
      • Build a Learning Curve.
      • Use cross-validated measures to quantify uncertainty

Further Reading

  • In a later post, I'd like to talk about extracting more juice out of your hardware with parallel-processing
    • For now, here's something to munch on

cli4ds's People

Contributors

dushyantkhosla avatar

Watchers

 avatar  avatar

cli4ds's Issues

Suggestion: Consider eBay/tsv-utils

Hi Khosla,

Just came across your repo. It's a nice article, one that matches my own experiences.

In case you're not familiar with them, I'd like to suggest that you take a look at eBay's TSV Utilities. They were designed to address the data mining scenarios described in your post. The disclaimer of course is that I'm the author.

Even if the tools themselves don't fit your goals, you may find the performance benchmarks of interest, as they include several of the tools on your list.

--Jon

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.