rjurney / agile_data_code_2 Goto Github PK

View Code? Open in Web Editor NEW

456.0 45.0 306.0 23.72 MB

Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition

Home Page: http://bit.ly/agile_data_science

License: MIT License

Shell 0.53% Python 4.53% JavaScript 1.05% HTML 1.18% CSS 0.87% Jupyter Notebook 91.65% Dockerfile 0.19%

data-syndrome data data-science analytics apache-spark apache-kafka kafka spark predictive-analytics machine-learning

agile_data_code_2's Introduction

Agile Data Science 2.0 (O'Reilly, 2017)

This repository contains the updated sourcec code for Agile Data Science 2.0, O'Reilly 2017. Now available at the O'Reilly Store, on Amazon (in Paperback and Kindle) and on O'Reilly Safari. Also available anywhere technical books are sold!

NOTE: THE BOOK'S CODE IS OLD, BUT THE CODE IS MAINTAINED. USE DOCKER COMPOSE AND THE NOTEBOOKS IM THIS REPOSITORY.

You should refer to the Jupyter Notebooks in this repository rather than the book's source code, which is badly outdated and will no longer work for you.

Have problems? Please file an issue!

Deep Discovery

Like my work? Connect with me on LinkedIn!

Installation and Execution

There is now only ONE version of the install: Docker via the docker-compose.yml. It is MUCH EASIER than the old methods.

To build the agile Docker image, run this:

docker-compose build agile

To run the agile Docker image, defined by the docker-compose.yml and Dockerfile, run:

docker-compose up -d

Now visit: http://localhost:8888

Other Images

To manage the mongo image with Mongo Express, visit: http://localhost:8081

Downloading Data

Once the server comes up, download the data and you are ready to go. First open a shell in Jupyter Lab. The working directory corresponds to this folder.

Now download the data:

./download.sh

Running Examples

All scripts run from the base directory, except the web app which runs in ex. ch08/web/. Open Welcome.ipynb and get started.

Jupyter Notebooks

All notebooks assume you have run the jupyter notebook command from the project root directory Agile_Data_Code_2. If you are using a virtual machine image (Vagrant/Virtualbox or EC2), jupyter notebook is already running. See directions on port mapping to proceed.

The Data Value Pyramid

Originally by Pete Warden, the data value pyramid is how the book is organized and structured. We climb it as we go forward each chapter.

System Architecture

The following diagrams are pulled from the book, and express the basic concepts in the system architecture. The front and back end architectures work together to make a complete predictive system.

Front End Architecture

This diagram shows how the front end architecture works in our flight delay prediction application. The user fills out a form with some basic information in a form on a web page, which is submitted to the server. The server fills out some neccesary fields derived from those in the form like "day of year" and emits a Kafka message containing a prediction request. Spark Streaming is listening on a Kafka queue for these requests, and makes the prediction, storing the result in MongoDB. Meanwhile, the client has received a UUID in the form's response, and has been polling another endpoint every second. Once the data is available in Mongo, the client's next request picks it up. Finally, the client displays the result of the prediction to the user!

This setup is extremely fun to setup, operate and watch. Check out chapters 7 and 8 for more information!

Back End Architecture

The back end architecture diagram shows how we train a classifier model using historical data (all flights from 2015) on disk (HDFS or Amazon S3, etc.) to predict flight delays in batch in Spark. We save the model to disk when it is ready. Next, we launch Zookeeper and a Kafka queue. We use Spark Streaming to load the classifier model, and then listen for prediction requests in a Kafka queue. When a prediction request arrives, Spark Streaming makes the prediction, storing the result in MongoDB where the web application can pick it up.

This architecture is extremely powerful, and it is a huge benefit that we get to use the same code in batch and in realtime with PySpark Streaming.

Screenshots

Below are some examples of parts of the application we build in this book and in this repo. Check out the book for more!

Airline Entity Page

Each airline gets its own entity page, complete with a summary of its fleet and a description pulled from Wikipedia.

Airplane Fleet Page

We demonstrate summarizing an entity with an airplane fleet page which describes the entire fleet.

Flight Delay Prediction UI

We create an entire realtime predictive system with a web front-end to submit prediction requests.

agile_data_code_2's People

Contributors

Stargazers

Watchers

Forkers

trinq radovankavicky gapdata mickdelaney machinelearning-tutorials vijeth8 rkazak seajosh naoyak aabercrombie0492 themichaelbizness mplearning ntblok misterfxguy gth158a umit fajohn carlespoles 80kr dataplayr dts3 aakinlalu plear msellamitn miguelperalvo codepiper tahirh directional-star srishail doublelg matrix-revolution vnnw mileserickson sharp-pixel azmikamis rajvg thirapat haibinli11 fizzbb henrypan lin-meng josecyc sherlockjjj chenghaibo teonbel ansrivas jazzman37 xiangchenm kpmac5 an100 taodsqi ravirajadrangi slitayem duke-translational-bioinformatics decastro-alex elr1co rs6 chauamtran luojianp pandagongfu aymar73 blu3h0rn ptzagk sksundaram-learning varunpillai konu90 rahermur atallc selvakumar-sss peopzen dsnoor kamalsky hisakato ytluck gianimpronta sachinsaligram muhammadzak dvillaj adrian-wang smadbhavi moacybarros robert-hardy bayalla kevinpsx bluekidds dineshkumares mysky528 bochuxt ddamasceno peinwu splashgi feihelan youngkwonjo hunterluok gt6796c mrmaher dasmonalisa tjsoftworks farhangithub27 caser789

agile_data_code_2's Issues

p20 copy

last paragraph:

"The data’s opinion must always be included in product discussions,
which means that they must be grounded in visualization through exploratory data
analysis, in the internal application that becomes the focus of our efforts."

remove comma after analysis or rewrite, the sentence is clumsy
"opinion" is a strong word to use here, maybe "conclusion" or other word that doesn't imply the data is a living agent

It's nice to have a script to clean up EC2 resource

It's nice to have a script to clean up EC2 resources (i.e. instance, volume, security group, key pair) after user is done with the instance.

Currently in ec2.sh "Ebs":{"DeleteOnTermination":false} so even when user terminated instance the EBS volume is still available and will cost user $$$. I actually think Ebs should be automatically deleted when instance is terminated. What's the point of keeping the volume when all the data are totally reproducible? But if it is decided to be like this the clean up script is quite important to avoid unnecessary charge.

Airflow not working

Whenever I try to run the command $ airflow I get the following:

This happens with any airflow command

Traceback (most recent call last):
  File "/home/ubuntu/anaconda/bin/airflow", line 17, in <module>
    from airflow import configuration
  File "/home/ubuntu/anaconda/lib/python3.5/site-packages/airflow/__init__.py", line 29, in <module>
    from airflow import configuration as conf
  File "/home/ubuntu/anaconda/lib/python3.5/site-packages/airflow/configuration.py", line 769, in <module>
    with open(TEST_CONFIG_FILE, 'w') as f:
PermissionError: [Errno 13] Permission denied: '/home/ubuntu/airflow/unittests.cfg'

can someone please tell me what mistake I might be making.

I am running the ec2 instance...

manual_install.sh: Check if java is already installed, halt with message if not

Miniconda doesn't include jupyter

bootstrap.sh:27

ch02/mongodb.py missing

ch02/mongodb.py is referenced in Ch02 "Publishing Data with MongoDB" p39 but is missing in the repo

Create Chapter 9 Code

Chapter 9 is about improving predictions. Therefore we need to improve our predictions in several ways: hyperparameter tuning, feature engineering, etc.

EC2 instance requires instructions how to make Flask web pages available externally

For EC2 instance, I cannot view Flask web pages even when I add port 5000 to the Security Group.
There should be instruction in EC2 set up on how to view Flask pages, maybe using ssh tunnels?

ch2: Vagrant setup

The reader might need to change this before running vagrant up depending on their system. I had problems provisioning vagrant until I changed the memory limit to 4GB / 4096;

  config.vm.provider "virtualbox" do |vb|
      # Customize the amount of memory on the VM:
      vb.memory = "9216"
  end

pyspark / python 3.6.0

Pyspark doesn't work with Python 3.6.0

https://issues.apache.org/jira/browse/SPARK-19019

aws/ec2_bootstrap.sh should log progress in a log file

The bootstrap script aws/ec2_bootstrap.sh takes around 15 minutes to complete. I think the script should log the progress in a log file with a confirmation when the setup is complete. User can then check the log and wait until it is complete so that they won't start playing when the set up is in progress.

ch06/enrich_airlines_wikipedia.py

In ch06/enrich_airlines_wikipedia.py throws error because it cannot find the file 'data/our_airlines.jsonl'. I checked the data directory I only found the file 'data/our_airlines.json'

our_airlines = utils.read_json_lines_file('data/our_airlines.jsonl')
Traceback (most recent call last):
File "", line 1, in
File "/home/ubuntu/Agile_Data_Code_2/lib/utils.py", line 29, in read_json_lines_file
f = codecs.open(path, "r", "utf-8")
File "/home/ubuntu/anaconda/lib/python3.5/codecs.py", line 895, in open
file = builtins.open(filename, mode, buffering)
FileNotFoundError: [Errno 2] No such file or directory: 'data/our_airlines.jsonl'

ec2.sh no hostname created

No hostname created after running ec2.sh

manual_install.sh: Install airflow in $PROJECT_HOME like the other software

Debugging Prediction Problems.ipynb is not working

Hi Russell,

Recently I take a serious look on Chapter 9 and find the Debugging Prediction Problems.ipynb is not working... I can't find a number of files that s required.

departure_bucketizer.bin
string_indexer_pipeline_model
final_vector_assembler.bin

Can you help share the required Jupiter Notebook that create these bin files ?

Cheers

Unable to launch instance (SOLVED)

Hey Russel,

I am trying to launch the EC2 instance. So far the script was able to create the security group but not launch the instance. I am getting the following:


Welcome to Agile Data Science 2.0 :)

I will launch an r3.xlarge instance in the default VPC for you, using a key and security group we will create.

The utility 'jq' is required for this script to detect the hostname of your ec2 instance ...
Detecting 'jq' ...
'jq' was detected ...
Creating security group 'agile_data_science' ...
{
    "GroupId": "sg-14f41e6e"
}

Detecting external IP address ...
Authorizing port 22 to your external IP (187.227.148.67) in security group 'agile_data_science' ...

Generating keypair called 'agile_data_science' ...

An error occurred (InvalidKeyPair.Duplicate) when calling the CreateKeyPair operation: The keypair 'agile_data_science' already exists.
Changing permissions of 'agile_data_science.pem' to 0600 ...

Detecting the default region...
The default region is 'us-west-2'
Determining the image ID to use according to region...
The image for region 'us-west-2' is 'ami-a41eaec4' ...

Initializing EBS optimized r3.xlarge EC2 instance in region 'us-west-2' with security group 'agile_data_science', key name 'agile_data_science' and image id 'ami-a41eaec4' using the script 'aws/ec2_bootstrap.sh'
Got reservation ID 'r-0b32b659f0bc4c1aa' ...

Sleeping 10 seconds before inquiring to get the public hostname of the instance we just created ...
...
Awake!

Using the reservation ID to get the public hostname ...
The public hostname of the instance we just created is 'ec2-34-212-189-56.us-west-2.compute.amazonaws.com' ...
Writing hostname to '.ec2_hostname' ...

Now we will tag this ec2 instance and name it 'agile_data_science_ec2' ...

After a few minutes (for it to initialize), you may ssh to this machine via the command in red: 
ssh -i ./agile_data_science.pem [email protected]
Note: only your IP of '187.227.148.67' is authorized to connect to this machine.

NOTE: IT WILL TAKE SEVERAL MINUTES FOR THIS MACHINE TO INITIALIZE. PLEASE WAIT FIVE MINUTES BEFORE LOGGING IN.

Note: if you ssh to this machine after a few minutes and there is no software in $HOME, please wait a few minutes for the install to finish.

Once you ssh in, the exercise code is in the Agile_Data_Code_2 directory! Run all files from this directory, with the exception of the web applications, which you will run from ex. ch08/web

Note: after a few minutes, now you will need to run ./ec2_create_tunnel.sh to forward ports 5000 and 8888 on the ec2 instance to your local ports 5000 and 8888. This way you can run the example web applications on the ec2 instance and browse them at http://localhost:5000 and you can view Jupyter notebooks at http://localhost:8888
If you tire of the ssh tunnel port forwarding, you may end these connections by executing ./ec2_kill_tunnel.sh

---------------------------------------------------------------------------------------------------------------------

Thanks for trying Agile Data Science 2.0!

If you have ANY problems, please file an issue on Github at https://github.com/rjurney/Agile_Data_Code_2/issues and I will resolve them.

If you need help creating your own applications, or with on-site or video training...
Check out Data Syndrome at http://datasyndrome.com

Enjoy! Russell Jurney <@rjurney> <[email protected]> <http://linkedin.com/in/russelljurney>

I believe it must be something to do with 'aws/ec2_bootstrap.sh' but I can't pinpoint exactly

Thanks in advance

Jose

[Solved] MongoDB/PySpark not working

Edit - Solved the issue but will keep up for future use (if that's cool with you). Solution in first comment.

I receive the following (see below) error when running ch02/pyspark_mongodb.py.

The prior examples in chapter 2 ran smoothly. For this example, I ran the following on the ec2 instance in the Agile_Data_Code_2 directory:

>>>pyspark
>>>exec(open("ch02/pyspark_mongodb.py").read())

Which yields this error:

[Stage 2:>                                                          (0 + 2) / 2]17/05/31 08:15:59 ERROR MongoOutputCommitter: Could not write to MongoDB
com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type=UNKNOWN, servers=[{address=localhost:27017, type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketOpenException: Exception opening socket}, caused by {java.net.ConnectException: Connection refused (Connection refused)}}]
	at com.mongodb.connection.BaseCluster.createTimeoutException(BaseCluster.java:369)
	at com.mongodb.connection.BaseCluster.selectServer(BaseCluster.java:101)
	at com.mongodb.binding.ClusterBinding$ClusterBindingConnectionSource.<init>(ClusterBinding.java:75)
	at com.mongodb.binding.ClusterBinding$ClusterBindingConnectionSource.<init>(ClusterBinding.java:71)
	at com.mongodb.binding.ClusterBinding.getWriteConnectionSource(ClusterBinding.java:68)
	at com.mongodb.operation.OperationHelper.withConnection(OperationHelper.java:219)
	at com.mongodb.operation.MixedBulkWriteOperation.execute(MixedBulkWriteOperation.java:168)
	at com.mongodb.operation.MixedBulkWriteOperation.execute(MixedBulkWriteOperation.java:74)
	at com.mongodb.Mongo.execute(Mongo.java:781)
	at com.mongodb.Mongo$2.execute(Mongo.java:764)
	at com.mongodb.DBCollection.executeBulkWriteOperation(DBCollection.java:2195)
	at com.mongodb.DBCollection.executeBulkWriteOperation(DBCollection.java:2188)
	at com.mongodb.BulkWriteOperation.execute(BulkWriteOperation.java:121)
	at com.mongodb.hadoop.output.MongoOutputCommitter.commitTask(MongoOutputCommitter.java:167)
	at com.mongodb.hadoop.output.MongoOutputCommitter.commitTask(MongoOutputCommitter.java:70)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1132)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)
17/05/31 08:15:59 ERROR MongoOutputCommitter: Could not write to MongoDB
com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type=UNKNOWN, servers=[{address=localhost:27017, type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketOpenException: Exception opening socket}, caused by {java.net.ConnectException: Connection refused (Connection refused)}}]
	at com.mongodb.connection.BaseCluster.createTimeoutException(BaseCluster.java:369)
	at com.mongodb.connection.BaseCluster.selectServer(BaseCluster.java:101)
	at com.mongodb.binding.ClusterBinding$ClusterBindingConnectionSource.<init>(ClusterBinding.java:75)
	at com.mongodb.binding.ClusterBinding$ClusterBindingConnectionSource.<init>(ClusterBinding.java:71)
	at com.mongodb.binding.ClusterBinding.getWriteConnectionSource(ClusterBinding.java:68)
	at com.mongodb.operation.OperationHelper.withConnection(OperationHelper.java:219)
	at com.mongodb.operation.MixedBulkWriteOperation.execute(MixedBulkWriteOperation.java:168)
	at com.mongodb.operation.MixedBulkWriteOperation.execute(MixedBulkWriteOperation.java:74)
	at com.mongodb.Mongo.execute(Mongo.java:781)
	at com.mongodb.Mongo$2.execute(Mongo.java:764)
	at com.mongodb.DBCollection.executeBulkWriteOperation(DBCollection.java:2195)
	at com.mongodb.DBCollection.executeBulkWriteOperation(DBCollection.java:2188)
	at com.mongodb.BulkWriteOperation.execute(BulkWriteOperation.java:121)
	at com.mongodb.hadoop.output.MongoOutputCommitter.commitTask(MongoOutputCommitter.java:167)
	at com.mongodb.hadoop.output.MongoOutputCommitter.commitTask(MongoOutputCommitter.java:70)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1132)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)
17/05/31 08:15:59 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type=UNKNOWN, servers=[{address=localhost:27017, type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketOpenException: Exception opening socket}, caused by {java.net.ConnectException: Connection refused (Connection refused)}}]
	at com.mongodb.connection.BaseCluster.createTimeoutException(BaseCluster.java:369)
	at com.mongodb.connection.BaseCluster.selectServer(BaseCluster.java:101)
	at com.mongodb.binding.ClusterBinding$ClusterBindingConnectionSource.<init>(ClusterBinding.java:75)
	at com.mongodb.binding.ClusterBinding$ClusterBindingConnectionSource.<init>(ClusterBinding.java:71)
	at com.mongodb.binding.ClusterBinding.getWriteConnectionSource(ClusterBinding.java:68)
	at com.mongodb.operation.OperationHelper.withConnection(OperationHelper.java:219)
	at com.mongodb.operation.MixedBulkWriteOperation.execute(MixedBulkWriteOperation.java:168)
	at com.mongodb.operation.MixedBulkWriteOperation.execute(MixedBulkWriteOperation.java:74)
	at com.mongodb.Mongo.execute(Mongo.java:781)
	at com.mongodb.Mongo$2.execute(Mongo.java:764)
	at com.mongodb.DBCollection.executeBulkWriteOperation(DBCollection.java:2195)
	at com.mongodb.DBCollection.executeBulkWriteOperation(DBCollection.java:2188)
	at com.mongodb.BulkWriteOperation.execute(BulkWriteOperation.java:121)
	at com.mongodb.hadoop.output.MongoOutputCommitter.commitTask(MongoOutputCommitter.java:167)
	at com.mongodb.hadoop.output.MongoOutputCommitter.commitTask(MongoOutputCommitter.java:70)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1132)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)
17/05/31 08:15:59 ERROR Executor: Exception in task 1.0 in stage 2.0 (TID 3)
com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type=UNKNOWN, servers=[{address=localhost:27017, type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketOpenException: Exception opening socket}, caused by {java.net.ConnectException: Connection refused (Connection refused)}}]
	at com.mongodb.connection.BaseCluster.createTimeoutException(BaseCluster.java:369)
	at com.mongodb.connection.BaseCluster.selectServer(BaseCluster.java:101)
	at com.mongodb.binding.ClusterBinding$ClusterBindingConnectionSource.<init>(ClusterBinding.java:75)
	at com.mongodb.binding.ClusterBinding$ClusterBindingConnectionSource.<init>(ClusterBinding.java:71)
	at com.mongodb.binding.ClusterBinding.getWriteConnectionSource(ClusterBinding.java:68)
	at com.mongodb.operation.OperationHelper.withConnection(OperationHelper.java:219)
	at com.mongodb.operation.MixedBulkWriteOperation.execute(MixedBulkWriteOperation.java:168)
	at com.mongodb.operation.MixedBulkWriteOperation.execute(MixedBulkWriteOperation.java:74)
	at com.mongodb.Mongo.execute(Mongo.java:781)
	at com.mongodb.Mongo$2.execute(Mongo.java:764)
	at com.mongodb.DBCollection.executeBulkWriteOperation(DBCollection.java:2195)
	at com.mongodb.DBCollection.executeBulkWriteOperation(DBCollection.java:2188)
	at com.mongodb.BulkWriteOperation.execute(BulkWriteOperation.java:121)
	at com.mongodb.hadoop.output.MongoOutputCommitter.commitTask(MongoOutputCommitter.java:167)
	at com.mongodb.hadoop.output.MongoOutputCommitter.commitTask(MongoOutputCommitter.java:70)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1132)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)
17/05/31 08:15:59 WARN TaskSetManager: Lost task 1.0 in stage 2.0 (TID 3, localhost, executor driver): com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type=UNKNOWN, servers=[{address=localhost:27017, type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketOpenException: Exception opening socket}, caused by {java.net.ConnectException: Connection refused (Connection refused)}}]
	at com.mongodb.connection.BaseCluster.createTimeoutException(BaseCluster.java:369)
	at com.mongodb.connection.BaseCluster.selectServer(BaseCluster.java:101)
	at com.mongodb.binding.ClusterBinding$ClusterBindingConnectionSource.<init>(ClusterBinding.java:75)
	at com.mongodb.binding.ClusterBinding$ClusterBindingConnectionSource.<init>(ClusterBinding.java:71)
	at com.mongodb.binding.ClusterBinding.getWriteConnectionSource(ClusterBinding.java:68)
	at com.mongodb.operation.OperationHelper.withConnection(OperationHelper.java:219)
	at com.mongodb.operation.MixedBulkWriteOperation.execute(MixedBulkWriteOperation.java:168)
	at com.mongodb.operation.MixedBulkWriteOperation.execute(MixedBulkWriteOperation.java:74)
	at com.mongodb.Mongo.execute(Mongo.java:781)
	at com.mongodb.Mongo$2.execute(Mongo.java:764)
	at com.mongodb.DBCollection.executeBulkWriteOperation(DBCollection.java:2195)
	at com.mongodb.DBCollection.executeBulkWriteOperation(DBCollection.java:2188)
	at com.mongodb.BulkWriteOperation.execute(BulkWriteOperation.java:121)
	at com.mongodb.hadoop.output.MongoOutputCommitter.commitTask(MongoOutputCommitter.java:167)
	at com.mongodb.hadoop.output.MongoOutputCommitter.commitTask(MongoOutputCommitter.java:70)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1132)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)

17/05/31 08:15:59 ERROR TaskSetManager: Task 1 in stage 2.0 failed 1 times; aborting job
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 14, in <module>
  File "/home/ubuntu/Agile_Data_Code_2/lib/pymongo_spark.py", line 40, in saveToMongoDB
    conf=conf)
  File "/home/ubuntu/spark/python/pyspark/rdd.py", line 1421, in saveAsNewAPIHadoopFile
    keyConverter, valueConverter, jconf)
  File "/home/ubuntu/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/home/ubuntu/spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/home/ubuntu/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.saveAsNewAPIHadoopFile.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 (TID 3, localhost, executor driver): com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type=UNKNOWN, servers=[{address=localhost:27017, type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketOpenException: Exception opening socket}, caused by {java.net.ConnectException: Connection refused (Connection refused)}}]
	at com.mongodb.connection.BaseCluster.createTimeoutException(BaseCluster.java:369)
	at com.mongodb.connection.BaseCluster.selectServer(BaseCluster.java:101)
	at com.mongodb.binding.ClusterBinding$ClusterBindingConnectionSource.<init>(ClusterBinding.java:75)
	at com.mongodb.binding.ClusterBinding$ClusterBindingConnectionSource.<init>(ClusterBinding.java:71)
	at com.mongodb.binding.ClusterBinding.getWriteConnectionSource(ClusterBinding.java:68)
	at com.mongodb.operation.OperationHelper.withConnection(OperationHelper.java:219)
	at com.mongodb.operation.MixedBulkWriteOperation.execute(MixedBulkWriteOperation.java:168)
	at com.mongodb.operation.MixedBulkWriteOperation.execute(MixedBulkWriteOperation.java:74)
	at com.mongodb.Mongo.execute(Mongo.java:781)
	at com.mongodb.Mongo$2.execute(Mongo.java:764)
	at com.mongodb.DBCollection.executeBulkWriteOperation(DBCollection.java:2195)
	at com.mongodb.DBCollection.executeBulkWriteOperation(DBCollection.java:2188)
	at com.mongodb.BulkWriteOperation.execute(BulkWriteOperation.java:121)
	at com.mongodb.hadoop.output.MongoOutputCommitter.commitTask(MongoOutputCommitter.java:167)
	at com.mongodb.hadoop.output.MongoOutputCommitter.commitTask(MongoOutputCommitter.java:70)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1132)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1951)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1158)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1085)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply$mcV$sp(PairRDDFunctions.scala:1005)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:996)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:996)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:996)
	at org.apache.spark.api.python.PythonRDD$.saveAsNewAPIHadoopFile(PythonRDD.scala:834)
	at org.apache.spark.api.python.PythonRDD.saveAsNewAPIHadoopFile(PythonRDD.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:748)
Caused by: com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type=UNKNOWN, servers=[{address=localhost:27017, type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketOpenException: Exception opening socket}, caused by {java.net.ConnectException: Connection refused (Connection refused)}}]
	at com.mongodb.connection.BaseCluster.createTimeoutException(BaseCluster.java:369)
	at com.mongodb.connection.BaseCluster.selectServer(BaseCluster.java:101)
	at com.mongodb.binding.ClusterBinding$ClusterBindingConnectionSource.<init>(ClusterBinding.java:75)
	at com.mongodb.binding.ClusterBinding$ClusterBindingConnectionSource.<init>(ClusterBinding.java:71)
	at com.mongodb.binding.ClusterBinding.getWriteConnectionSource(ClusterBinding.java:68)
	at com.mongodb.operation.OperationHelper.withConnection(OperationHelper.java:219)
	at com.mongodb.operation.MixedBulkWriteOperation.execute(MixedBulkWriteOperation.java:168)
	at com.mongodb.operation.MixedBulkWriteOperation.execute(MixedBulkWriteOperation.java:74)
	at com.mongodb.Mongo.execute(Mongo.java:781)
	at com.mongodb.Mongo$2.execute(Mongo.java:764)
	at com.mongodb.DBCollection.executeBulkWriteOperation(DBCollection.java:2195)
	at com.mongodb.DBCollection.executeBulkWriteOperation(DBCollection.java:2188)
	at com.mongodb.BulkWriteOperation.execute(BulkWriteOperation.java:121)
	at com.mongodb.hadoop.output.MongoOutputCommitter.commitTask(MongoOutputCommitter.java:167)
	at com.mongodb.hadoop.output.MongoOutputCommitter.commitTask(MongoOutputCommitter.java:70)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1132)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	... 1 more

Inspecting my mongodb.log file, I noted the following error:

ERROR: Insufficient free space for journal files
Please make at least 3379MB available in /var/lib/mongodb/journal or use --smallfiles

To attempt to solve this error, I expanded the storage on the server via aws as recommended by this post

Unfortunately, this still yields the same error. Any help would be greatly appreciated!

elasticsearch not working

elasticsearch doesn't seem to be working for me. I'm running code from ch02.

On ec2 I am running bash ch02/elasticsearch.sh and obtain:

curl: (7) Failed to connect to localhost port 9200: Connection refused

Running

PYSPARK_DRIVER_PYTHON=ipython pyspark --jars lib/elasticsearch-spark-20_2.10-5.2.1.jar \
      --driver-class-path lib/elasticsearch-spark-20_2.10-5.2.1.jar

>>>exec(open("ch02/pyspark_elasticsearch.py").read())

yields this error:

17/05/31 10:25:10 WARN EsOutputFormat: Speculative execution enabled for reducer - consider disabling it to prevent data corruption
17/05/31 10:25:10 WARN EsOutputFormat: Cannot determine task id - redirecting writes in a random fashion
17/05/31 10:25:10 WARN EsOutputFormat: Cannot determine task id - redirecting writes in a random fashion
17/05/31 10:25:10 ERROR NetworkClient: Node [127.0.0.1:9200] failed (Connection refused (Connection refused)); no other nodes left - aborting...
17/05/31 10:25:10 ERROR NetworkClient: Node [127.0.0.1:9200] failed (Connection refused (Connection refused)); no other nodes left - aborting...
17/05/31 10:25:10 ERROR Utils: Aborting task
org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
	at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:250)
	at org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:546)
	at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.init(EsOutputFormat.java:173)
	at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.write(EsOutputFormat.java:149)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1125)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1123)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1123)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1131)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[127.0.0.1:9200]] 
	at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:150)
	at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:444)
	at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:424)
	at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:428)
	at org.elasticsearch.hadoop.rest.RestClient.get(RestClient.java:154)
	at org.elasticsearch.hadoop.rest.RestClient.remoteEsVersion(RestClient.java:609)
	at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:243)
	... 15 more
17/05/31 10:25:10 ERROR Utils: Aborting task
org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
	at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:250)
	at org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:546)
	at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.init(EsOutputFormat.java:173)
	at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.write(EsOutputFormat.java:149)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1125)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1123)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1123)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1131)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[127.0.0.1:9200]] 
	at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:150)
	at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:444)
	at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:424)
	at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:428)
	at org.elasticsearch.hadoop.rest.RestClient.get(RestClient.java:154)
	at org.elasticsearch.hadoop.rest.RestClient.remoteEsVersion(RestClient.java:609)
	at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:243)
	... 15 more
17/05/31 10:25:10 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 2)
org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
	at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:250)
	at org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:546)
	at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.init(EsOutputFormat.java:173)
	at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.write(EsOutputFormat.java:149)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1125)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1123)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1123)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1131)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[127.0.0.1:9200]] 
	at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:150)
	at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:444)
	at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:424)
	at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:428)
	at org.elasticsearch.hadoop.rest.RestClient.get(RestClient.java:154)
	at org.elasticsearch.hadoop.rest.RestClient.remoteEsVersion(RestClient.java:609)
	at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:243)
	... 15 more
17/05/31 10:25:10 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
	at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:250)
	at org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:546)
	at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.init(EsOutputFormat.java:173)
	at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.write(EsOutputFormat.java:149)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1125)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1123)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1123)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1131)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[127.0.0.1:9200]] 
	at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:150)
	at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:444)
	at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:424)
	at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:428)
	at org.elasticsearch.hadoop.rest.RestClient.get(RestClient.java:154)
	at org.elasticsearch.hadoop.rest.RestClient.remoteEsVersion(RestClient.java:609)
	at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:243)
	... 15 more
17/05/31 10:25:10 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, localhost, executor driver): org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
	at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:250)
	at org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:546)
	at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.init(EsOutputFormat.java:173)
	at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.write(EsOutputFormat.java:149)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1125)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1123)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1123)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1131)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[127.0.0.1:9200]] 
	at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:150)
	at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:444)
	at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:424)
	at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:428)
	at org.elasticsearch.hadoop.rest.RestClient.get(RestClient.java:154)
	at org.elasticsearch.hadoop.rest.RestClient.remoteEsVersion(RestClient.java:609)
	at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:243)
	... 15 more

17/05/31 10:25:10 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times; aborting job
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 10, in <module>
  File "/home/ubuntu/spark/python/pyspark/rdd.py", line 1421, in saveAsNewAPIHadoopFile
    keyConverter, valueConverter, jconf)
  File "/home/ubuntu/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/home/ubuntu/spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/home/ubuntu/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.saveAsNewAPIHadoopFile.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost, executor driver): org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
	at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:250)
	at org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:546)
	at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.init(EsOutputFormat.java:173)
	at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.write(EsOutputFormat.java:149)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1125)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1123)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1123)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1131)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[127.0.0.1:9200]] 
	at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:150)
	at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:444)
	at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:424)
	at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:428)
	at org.elasticsearch.hadoop.rest.RestClient.get(RestClient.java:154)
	at org.elasticsearch.hadoop.rest.RestClient.remoteEsVersion(RestClient.java:609)
	at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:243)
	... 15 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1951)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1158)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1085)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply$mcV$sp(PairRDDFunctions.scala:1005)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:996)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:996)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:996)
	at org.apache.spark.api.python.PythonRDD$.saveAsNewAPIHadoopFile(PythonRDD.scala:834)
	at org.apache.spark.api.python.PythonRDD.saveAsNewAPIHadoopFile(PythonRDD.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
	at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:250)
	at org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:546)
	at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.init(EsOutputFormat.java:173)
	at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.write(EsOutputFormat.java:149)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1125)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1123)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1123)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1131)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	... 1 more
Caused by: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[127.0.0.1:9200]] 
	at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:150)
	at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:444)
	at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:424)
	at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:428)
	at org.elasticsearch.hadoop.rest.RestClient.get(RestClient.java:154)
	at org.elasticsearch.hadoop.rest.RestClient.remoteEsVersion(RestClient.java:609)
	at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:243)
	... 15 more

I've never used elasticsearch and am unsure how to proceed. Any help is appreciated!

p21 copy

"developing a normal application, what will we ship? If we don’t ship, we aren’t agile. To solve this
problem, in Agile Data Science, we “get meta” and make the focus documenting the
analytics process as opposed to the end state or product we are seeking."

use the active voice

"Agile Data Science hearts "meta" and focuses on documenting the analytics process."

Bootstrap Install Fails on Jupyter Notebook Install

When I try and install the dependencies using the Bootstrap.sh file, I keep getting hung up on the Jupyter Notebook install with the following error:
[C 14:26:04.850 NotebookApp] Bad config encountered during initialization:
[C 14:26:04.850 NotebookApp] Unrecognized flag: '--allow-root'

jupyter/notebook#1539

A quick google search brings me to a similar error with Docker files, though I don't know where to go next to troubleshoot/resolve this.

Do you have any advice?

Thanks,

Jay

On_Time_On_Time_Performance_2015.csv.bz2 not accessable

I tried downloading the data, but the file at http://s3.amazonaws.com/agile_data_science/On_Time_On_Time_Performance_2015.csv.bz2 is not available. I'm getting this xml response:

<Error>
  <Code>AllAccessDisabled</Code>
  <Message>All access to this object has been disabled</Message>
  <RequestId>F5FC6F5C1A108967</RequestId>
  >HostId>U39Xt33cQTGKJIid06DOYvqQyStrIYS4h0THvrJWU6X1Snevvd+SdCM1dGkwDpJBehWbAzwu5Fg=</HostId>
</Error>

Has the url changed, is the data available somewhere else?

Connection Refused

Hi
I get the following after running .ec2:

$ ./ec2.sh
Logging operations to '/tmp/ec2.sh.log' ...
tee: unknown option -- /
Try 'tee --help' for more information.
Welcome to Agile Data Science 2.0 :)

I will launch an r3.xlarge instance in the default VPC for you, using a key and security group we will create.

The utility 'jq' is required for this script to detect the hostname of your ec2 instance ...
Detecting 'jq' ...
'jq' was detected ...
Testing for security group 'agile_data_science' ...
Security group 'agile_data_science' not present ...
Creating security group 'agile_data_science' ...
{
"GroupId": "sg-539ac23a"
}

Detecting external IP address ...
./ec2.sh: line 49: dig: command not found
Authorizing port 22 to your external IP () in security group 'agile_data_science' ...

An error occurred (InvalidParameterValue) when calling the AuthorizeSecurityGroupIngress operation: CIDR block C:/Program Files/Git/32 is malformed

Testing for existence of keypair 'agile_data_science' and key 'agile_data_science.pem' ...
Key pair 'agile_data_science' not found ...
Generating keypair called 'agile_data_science' ...
Changing permissions of 'agile_data_science.pem' to 0600 ...

Detecting the default region...
The default region is 'eu-west-2'
Determining the image ID to use according to region...
The image for region 'eu-west-2' is '' ...

Initializing EBS optimized r3.xlarge EC2 instance in region 'eu-west-2' with security group 'agile_data_science', key name 'agile_data_science' and image id '' using the script 'aws/ec2_bootstrap.sh'
usage: aws [options] [ ...] [parameters]
To see help text, you can run:

aws help
aws help
aws help
aws: error: argument --image-id: expected one argument
Got reservation ID '' ...

Sleeping 10 seconds before inquiring to get the public hostname of the instance we just created ...
...
Awake!

Using the reservation ID to get the public hostname ...
The public hostname of the instance we just created is '' ...
Writing hostname to '.ec2_hostname' ...

Now we will tag this ec2 instance and name it 'agile_data_science_ec2' ...

An error occurred (MissingParameter) when calling the CreateTags operation: The request must contain the parameter resourceIdSet

After a few minutes (for it to initialize), you may ssh to this machine via the command in red:
ssh -i ./agile_data_science.pem ubuntu@
Note: only your IP of '' is authorized to connect to this machine.

NOTE: IT WILL TAKE SEVERAL MINUTES FOR THIS MACHINE TO INITIALIZE. PLEASE WAIT FIVE MINUTES BEFORE LOGGING IN.

Note: if you ssh to this machine after a few minutes and there is no software in $HOME, please wait a few minutes for the install to finish.

Once you ssh in, the exercise code is in the Agile_Data_Code_2 directory! Run all files from this directory, with the exception of the web applications, which you will run from ex. ch08/web

Note: after a few minutes, now you will need to run ./ec2_create_tunnel.sh to forward ports 5000 and 8888 on the ec2 instance to your local ports 5000 and 8888. This way you can run the example web applications on the ec2 instance and browse them at http://localhost:5000 and you can view Jupyter notebooks at http://localhost:8888
If you tire of the ssh tunnel port forwarding, you may end these connections by executing ./ec2_kill_tunnel.sh

kindly assist

The vagrant install script for Virtual PC stopped after Jupyter start

I run the 'vagrant up' command the vagrant script stop after Jupyter started as below. Is it an expected behaviour ?

==> default: writing new private key to '/home/ubuntu/certs/mycert.pem'
==> default: /home/ubuntu/certs/mycert.pem: No such file or directory
==> default: 140332068832928:error:02001002:system library:fopen:No such file or directory:bss_file.c:398:fopen('/home/ubuntu/certs/mycert.pem','w')
==> default: 140332068832928:error:20074002:BIO routines:FILE_CTRL:system lib:bss_file.c:400:
==> default: [I 15:24:19.003 NotebookApp] Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret
==> default: [I 15:24:19.097 NotebookApp] Serving notebooks from local directory: /home/vagrant
==> default: [I 15:24:19.097 NotebookApp] 0 active kernels
==> default: [I 15:24:19.097 NotebookApp] The Jupyter Notebook is running at: http://0.0.0.0:8888/?token=4015c82d2c7f00926f6066101ad77cfc8e14667b9959ca93
==> default: [I 15:24:19.098 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
==> default: [W 15:24:19.098 NotebookApp] No web browser found: could not locate runnable browser.
==> default: [C 15:24:19.098 NotebookApp]
==> default:
==> default: Copy/paste this URL into your browser when you connect for the first time,
==> default: to login with a token:
==> default: http://0.0.0.0:8888/?token=4015c82d2c7f00926f6066101ad77cfc8e14667b9959ca93

ch08/extract_features.py is not working

Seems missed the section to create SparkContext via findspark

p21 copy

"The first level of the data-value pyramid is about plumbing; making a dataset flow
from where it is gathered to where it appears in an application. The next level, charts
and tables, is where refinement and analysis begins. The next level, reports, enables
immersive exploration of data, and we can really reason about it and get to know it.
The final level, predictions, is where the most value is created, but creating good pre‐
dictions means feature engineering, which the lower levels encompass and facilitate."

bullet points!

Chapter 8 test_classification_api.sh is wrong

It should write as follow

curl -XPOST 'http://localhost:5000/flights/delays/predict/classify'
-F 'DepDelay=5.0'
-F 'Carrier=AA'
-F 'FlightDate=2016-12-23'
-F 'Dest=ATL'
-F 'FlightNum=1519'
-F 'Origin=SFO'
| json_pp

FlightDate instead of Date

manual_install.sh: Make copy of ~/.bash_profile and create uninstall script to move it back in place

Issues with Vagrant Instance

Hi:

i get this following error during vagrant up. anyone seen this before?

Fix direct calls to DataFrame.map with DataFrame.rdd.map

ch05/assess_airplanes.py And ch05/save_tail_numbers.py

Fix all {{ ds }} date stamps to {{ ts }} timestamps in Airflow examples

Add screenshots to README.md

Show ppl what the app looks like!

Broken code

I just purchased this book and have been following the examples closely. I have come across several items in which the code was incorrect and needed to be debugged and changed. My most recent example is the web code in chapter 4.

The flights template binds the display_nav macro with four parameters that are required by display_nav. The actual macro signature only includes three parameters - query being the odd man out.

Another issue I came across was the mongo queries - you code in the on_time_performance files has the mongo query keys surrounded by single quotes. The queries did not work for me unless I changes the single to double quotes. You also cast the flight number to an integer - this did not work unless I removed the int function in the query.

Am I mistaken? Are these actual issues? I find it hard to believe that anyone is able to make this code work following the book.

I AM using the vagrant image and the code that came with the image itself which does seem to match the GitHub repository for this book.

Missing of faa_tail_number_inquiry.jsonl file

Is it possible to provide the script to generate the faa_tail_number_inquiry.jsonl file, as the file is required for programs at other chapter.

It is mentioned in the book under "Evaluating Enriched Data"

Now that we’ve got our tail number data in data/faa_tail_number_inquiry.json', lets see what we’ve got. First we want to know how many records did we successfully achieve, both in raw form and as a percent?"

ch06/add_name_to_airlines.py throws error

The following error is thrown when running ch06/add_name_to_airlines.py

airlines = spark.read.format('com.databricks.spark.csv')
... .options(header='false', nullValue='\N')
... .load('data/airlines.csv')
File "", line 2
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: malformed \N character escape

Lines 38-39 of aws/ec2_bootstrap.sh are inconsistent

Line 38 is a comment indicating that Spark will use 8GB of RAM, but the following line sets spark.driver.memory to 25GB. One or the other needs to be updated for consistency.

38: # Give Spark 8GB of RAM, used Python3
39: echo "spark.driver.memory 25g" | sudo tee -a $SPARK_HOME/conf/spark-defaults.conf

Make manual install script check if components are already installed, in case it is being re-run

Hadoop not installed on vagrant machine

After setting up the vagrant machine, ~/hadoop directory exists but is empty. This is probably due to the fact that http://apache.osuosl.org/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz returns now a 404 error. The problem is fixed after executing the hadoop part of the manual_install.sh script.

Note that I had to execute this lother line from manual_install.sh in order for spark to works export SPARK_DIST_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath`

pyspark example bug

Ch02 "Collecting Data", p37

problem:
Assuming user in is in ch02 directory per text

csv_lines = sc.textFile("data/example.csv")
incorrect file path

solution:
csv_lines = sc.textFile("../data/example.csv")

ch06/extract_airlines.py throws error at airplanes_per_carrier.count()

airplanes_per_carrier.count()
[Stage 3:===============================================> (172 + 4) / 200]17/02/10 23:15:43 ERROR Executor: Exception in task 15.0 in stage 3.0 (TID 392)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/ubuntu/spark/python/lib/pyspark.zip/pyspark/worker.py", line 174, in main
process()
File "/home/ubuntu/spark/python/lib/pyspark.zip/pyspark/worker.py", line 169, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/home/ubuntu/spark/python/pyspark/rdd.py", line 2407, in pipeline_func
return func(split, prev_func(split, iterator))
File "/home/ubuntu/spark/python/pyspark/rdd.py", line 2407, in pipeline_func
return func(split, prev_func(split, iterator))
File "/home/ubuntu/spark/python/pyspark/rdd.py", line 2407, in pipeline_func
return func(split, prev_func(split, iterator))
File "/home/ubuntu/spark/python/pyspark/rdd.py", line 346, in func
return f(iterator)
File "/home/ubuntu/spark/python/pyspark/rdd.py", line 1041, in
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "/home/ubuntu/spark/python/pyspark/rdd.py", line 1041, in
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "", line 9, in
TypeError: unorderable types: NoneType() < str()

at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

java?

Should the requirement for java be listed somewhere ( not requirements.txt, my bad! )?

password required for sshing into ec2 instance

password is required for agile_data_science_pem file, is this expected?

When trying to run the ch02/pyspark_elasticsearch.py on the Vagrant image, I'm getting an error, slightly different error in ec2

csv_lines = sc.textFile("data/example.csv")
data = csv_lines.map(lambda line: line.split(","))
schema_data = data.map(lambda x: ('key', {'name': x[0], 'company': x[1], 'title': x[2]}))
schema_data.saveAsNewAPIHadoopFile(
... path='-',
... outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat",
... keyClass="org.apache.hadoop.io.NullWritable",
... valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
... conf={ "es.resource" : "agile_data_science/executives" })
Traceback (most recent call last):
File "", line 6, in
File "/home/vagrant/spark/python/pyspark/rdd.py", line 1421, in saveAsNewAPIHadoopFile
keyConverter, valueConverter, jconf)
File "/home/vagrant/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in call
File "/home/vagrant/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/home/vagrant/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.saveAsNewAPIHadoopFile.
: java.lang.NoClassDefFoundError: org/apache/commons/httpclient/URIException
at org.elasticsearch.hadoop.cfg.Settings.getProperty(Settings.java:550)
at org.elasticsearch.hadoop.cfg.Settings.getResourceWrite(Settings.java:513)
at org.elasticsearch.hadoop.mr.EsOutputFormat.init(EsOutputFormat.java:257)
at org.elasticsearch.hadoop.mr.EsOutputFormat.checkOutputSpecs(EsOutputFormat.java:233)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1099)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1085)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply$mcV$sp(PairRDDFunctions.scala:1005)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:996)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:996)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:996)
at org.apache.spark.api.python.PythonRDD$.saveAsNewAPIHadoopFile(PythonRDD.scala:834)
at org.apache.spark.api.python.PythonRDD.saveAsNewAPIHadoopFile(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: org.apache.commons.httpclient.URIException
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 31 more

Figure out problem with predictions always resulting in 0/on-time

SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: malformed \N character escape

airlines = spark.read.format('com.databricks.spark.csv')\
  .options(header='false', nullValue='\N')\
  .load('data/airlines.csv')
airlines.show()

should change to

airlines = spark.read.format('com.databricks.spark.csv')\
  .options(header='false', nullValue='\\N')\
  .load('data/airlines.csv')
airlines.show()

Vagrant / jupyter port forwarding

Windows 10
Windows Defender: off

Problem:
After provisioning a Vagrant guest and running jupyter notebook, HTTP requests weren't forwarded to Jupyter on the Vagrant guest.

Solution:
jupyter was binding on localhost/127.0.0.1; therefore, vagrant couldn't forward the requests because it couldn't find a listener on the guest IP.

Here's a command for running Jupyter so it binds on vagrant's guest IP instead of localhost/127.0.0.1 :

jupyter notebook --ip=0.0.0.0

This allows vagrant's port forwarding to work correctly.

Half of the time Agile_Data_Code_2 is not present in EC2 instance during startup

About 50% of the time when I spinned up an EC2 instance, the directory Agile_Data_Code_2 does not exist and it turns out that it is because the apt-get process seems to get stuck. This is the first line in aws/ec2_bootstrap.sh:
sudo apt-get update && sudo apt-get upgrade -y

I think there may be some timing issue here when apt-get is called before the system is ready. I think it is safer to remove the bootstrap script aws/ec2_bootstrap.sh from ec2.sh and let the user scp the script to the instance and run it manually once they ssh into the instance.

File "/home/ubuntu/Agile_Data_Code_2/ch06/web/search_helpers.py", line 28 except AttributeError, e:

File "report_flask.py", line 9, in
import search_helpers
File "/home/ubuntu/Agile_Data_Code_2/ch06/web/search_helpers.py", line 28
except AttributeError, e:

In python3 the raise syntax no longer accepts comma-separated arguments.

Use as instead:

Error "No module named 'pyspark'"

I am running through the examples on the EC2 instance and have successfully run all the example code up to the section entitled Pushing data to MongoDB from PySpark. This includes successfully running pyspark from the command line.

When I run the ch02/pyspark_mongodb.py script, I get the following error:

ImportError: No module named 'pyspark'

I have also run the command "import pyspark" in python and python3 and get the same error. Trying to install pyspark using "pip install pyspark" results in a permissions error.

Anyone have an idea on how to get past this?

questions on manual_install.sh

I am opening this issue for manual_install.sh so that I can bounce ideas on it with you
as I come across them?

AttributeError: 'DataFrame' object has no attribute 'map'

The following program would not work under Spark 2.0

Use .rdd.map:

DataFrame.map has been removed in Spark 2.

Load the parquet file

on_time_dataframe = spark.read.parquet('data/on_time_performance.parquet')
on_time_dataframe.registerTempTable("on_time_performance")

Dump the unneeded fields

tail_numbers = on_time_dataframe.map(lambda x: x.TailNum)
tail_numbers = tail_numbers.filter(lambda x: x != '')

distinct() gets us unique tail numbers

unique_tail_numbers = tail_numbers.distinct()

now we need a count() of unique tail numbers

airplane_count = unique_tail_numbers.count()
print("Total airplanes: {}".format(airplane_count))