This project holds the examples code for the Book Advanced Analytics with Spark, by Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills.
There is also an official source repo for this book, you may find the source code in the author's Github.
SBT and Java 8+ are required to build and run the examples in my repo.
In order to play with the examples, you may need to fetch example data from the root level of the project by running sh scripts/chxx.data.fetch.sh
. Or you may download the datasets manually and unzip them to the corresponding directory under data/chxx
directory.
- Chapter 2: https://archive.ics.uci.edu/ml/machine-learning-databases/00210/
- Chapter 3: http://www-etud.iro.umontreal.ca/~bergstrj/audioscrobbler_data.html
- Chapter 4: https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/
- Chapter 5: https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html (do not use http://www.sigkdd.org/kdd-cup-1999-computer-network-intrusion-detection as the copy has a corrupted line)
- Chapter 6: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles-multistream.xml.bz2
- Chapter 7: ftp://ftp.nlm.nih.gov/nlmdata/sample/medline/ (
*.gz
) - Chapter 8: http://www.andresmh.com/nyctaxitrips/
- Chapter 9: (see
ch09-risk/data/download-all-symbols.sh
script) - Chapter 10: ftp://ftp.ncbi.nih.gov/1000genomes/ftp/phase3/data/HG00103/alignment/HG00103.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam
- Chapter 11: https://github.com/thunder-project/thunder/tree/v0.4.1/python/thunder/utils/data/fish/tif-stack