googlecloudplatform / data-science-on-gcp Goto Github PK

Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017

License: Apache License 2.0

Shell 1.11% Python 8.11% Jupyter Notebook 90.73% Dockerfile 0.05%

data-analysis data-visualization cloud-computing machine-learning data-pipeline data-processing data-science data-engineering

data-science-on-gcp's Introduction

data-science-on-gcp

Source code accompanying book:

	Data Science on the Google Cloud Platform, 2nd Edition Valliappa Lakshmanan O'Reilly, Apr 2022	Branch 2nd Edition [also main]
	Data Science on the Google Cloud Platform Valliappa Lakshmanan O'Reilly, Jan 2017	Branch edition1_tf2 (obsolete, and will not be maintained)

Try out the code on Google Cloud Platform

The code on Qwiklabs (see below) is continually tested, and this repo is kept up-to-date.

If the code doesn't work for you, I recommend that you try the corresponding Qwiklab lab to see if there is some step that you missed. If you still have problems, please leave feedback in Qwiklabs, or file an issue in this repo.

Try out the code on Qwiklabs

Purchase book

Read on-line or download PDF of book

Buy on Amazon.com

Updates to book

I updated the book in Nov 2019 with TensorFlow 2.0, Cloud Functions, and BigQuery ML.

data-science-on-gcp's People

Contributors

Stargazers

Watchers

Forkers

nsrinivasapps wkusnierczyk anasmt parthea huahuajhu valerian-sky dsignr sujalringwala mmarkaki saurabhguptasg mrgoogol nrsundar y44k0v irishbird aipachakutiqwan flannimal3000 mdiby akirakane ip-2014 kobbesun rcontesti kumarkishan samithaj alexanderdeleon dolaameng col-n ideafest muhammadzak maheshkumarrp mco-gh lions0815 lovescott pb-pravin kumarchinnakali tuf21130 anagrawa andandandand iampatgrady dansavage davified mnaeem95 dvillaj ivan-rodriguez-ck naheedmk temoadame zhangshuqi12 henry-tellic michaelrstuart hafizurcse arsenyspb manuelmena gregorydillon shotakikuchi radrangi khaot0m chaipat-ncm pongnice divyasaini d8aninja mfreyeso tiravata mobilefirsts a1309820 simonhowlett kirazgg abishekganesh72 zhukaisjtu micseb carlosandres12 stalmar peopzen mandelag sparkyminds-organisation jackyyang1 rakatashii yingweiy eampo ndjido afcarl salestube willcford ml-tutorials gamunster riccardo1980 aosterloh frenkzappa gachet fionaaawu tkazusa kamal-juan y1ngyang 2rd-kantena manikandanrd alixhami sadeeqakintola rs6 alexchicote kenhehuang andmarkan shubhamshubhankar

data-science-on-gcp's Issues

error with 04_streaming/simulate/df05.py

I'm running this on qwiklabs. This part here

$ python ./df05.py

Gives the following truncated stack trace:
File "apache_beam/runners/common.py", line 419, in
apache_beam.runners.common.SimpleInvoker.invoke_process
File "/home/google2864119_student/.local/lib/python2.7/site-packages/apache_beam/transforms/core.py", line 1161, in
wrapper = lambda x: [fn(x)]
File "./df05.py", line 98, in
| 'airports:tz' >> beam.Map(lambda fields: (fields[0], addtimezone(fields[21], fields[26])))
File "./df05.py", line 24, in addtimezone
import timezonefinder
File "/home/google2864119_student/.local/lib/python2.7/site-packages/timezonefinder/init.py", line 2, in
from .timezonefinder import TimezoneFinder
File "/home/google2864119_student/.local/lib/python2.7/site-packages/timezonefinder/timezonefinder.py", line 300
def closest_timezone_at(self, *, lat, lng, delta_degree=1, exact_computation=False,
return_distances=False,
^
SyntaxError: invalid syntax [while running 'airports:tz']
google2864119_student@cloudshell:~/data-science-on-gcp/04_streaming/simulate (qwiklabs-gcp-8a4cca54b8a375af)$

Chapter 9 ./create_small.sh bucket name

When i try to run ./create_small.sh valid-bucket name on cloud shell i get the following error.
i tried to change the director for writing csv files to be cloud storage but no luck. Still get same exception.

https://github.com/tmreic/data-science-on-gcp/blob/master/09_cloudml/create_small.sh

@CloudShell:~/data-science-on-gcp/09_cloudml $ ./create_small.sh bucket
CommandException: Destination URL must name a directory, bucket, or bucket
subdirectory for the multiple source form of the cp command.
head: cannot open 'full.csv' for reading: No such file or directory
rm: cannot remove 'full.csv': No such file or directory
CommandException: Destination URL must name a directory, bucket, or bucket
subdirectory for the multiple source form of the cp command.
head: cannot open 'full.csv' for reading: No such file or directory
rm: cannot remove 'full.csv': No such file or directory

Chapter 3 - Datastudio Pie chart flights

When setting up the "islate" definition, the current formula is

CASE WHEN (ARR_DELAY_copy < 15) THEN "ON TIME" ELSE "LATE" END

the brackets are not accepted, so the formula shall be:

CASE WHEN ARR_DELAY_copy < 15 THEN "ON TIME" ELSE "LATE" END

See: https://support.google.com/datastudio/table/6379764?hl=en

Ch 6 creating Chrome session from local Windows session for Google Cloud Datalab on Cloud Dataproc

I hit three issues for this chapter, not code issues in github but rather gcloud config issues

I couldn't create a Datalab session using the instructions in Readme.md.

However, I referred to the Google Cloud documentation at https://cloud.google.com/dataproc/docs/concepts/accessing/cluster-web-interfaces#configure_your_browser, and at Dataproc -> ch6cluster->Web Interfaces -> Create an SSH tunnel. I connected twice, but I'm not sure what sequence resulted in success (notes from first time did not directly work for the second connection), so not leaving instructions here.

To get bigquery working Python, I uploaded my json to the cluster and changed the query (see below):

a) Create service account json in https://console.cloud.google.com/apis/credentials/serviceaccountkey?project=neural-virtue-236312&folder&organizationId=790480867434
b) upload json to datalab home directory in master
c) client = bigquery.Client.from_service_account_json("Data Science on GCP-93d5e770def5.json")
d) sql = """
SELECT ARR_DELAY, DEP_DELAY
FROM flights.tzcorr
WHERE DEP_DELAY >= 10 AND RAND() < 0.01
"""
e) df = client.query(sql).result().to_dataframe()

I was getting qetting "INVALID_ARGUMENT: Insufficient 'DISKS_TOTAL_GB' quota" when running ./increase_cluster.sh. The fix for me was to go to Compute Engine->Disks and delete disk instances related to previous chapters. Had a look through various docs but not obvious to me how to increase regional quota (docs seemed more around increasing specific standard persistent disks).

Chapter 3, page 73 - Add zone to SQL create command

Current command

gcloud sql instances create flights --tier=db-n1-standard-1 --activation-policy=ALWAYS

Recommendation to add zone:

gcloud sql instances create flights --tier=db-n1-standard-1 --activation-policy=ALWAYS --gce-zone=us-central1-a

to avoid

WARNING: Starting with release 218.0.0, you will need to specify either a region or a zone to create an instance.

Python client library version is too old in cloud shell

Maybe I am missing something but when I tried to run some code in cloud shell, I saw errors because the default installed python client library version is too old.

e.g. simulate.py in chapter 04

$  python ./simulate.py --startTime '2015-05-01 00:00:00 UTC' --endTime '2015-05-04 00:00:00 UTC' --speedFactor=30 --project $DEVSHELL_PROJECT_ID
Traceback (most recent call last):
  File "./simulate.py", line 83, in <module>
    dataset =  bqclient.get_dataset( bqclient.dataset('flights') )  # throws exception on failure
AttributeError: 'Client' object has no attribute 'get_dataset'

pip freeze shows following:

google-cloud-bigquery==0.25.0
google-cloud-pubsub==0.26.0

After running pip install --upgrade the code run successfully.
(e.g.)

$ pip install --upgrade --user google-cloud-bigquery
$ pip install --upgrade --user google-cloud-pubsub

Am I missing some steps?

simulate.py not working in cloud shell

Error thrown:
Traceback (most recent call last):
File "simulate.py", line 22, in
from google.cloud import pubsub
File "/usr/local/lib/python2.7/dist-packages/google/cloud/pubsub/init.py", line 30, in
from google.cloud.pubsub.client import Client
File "/usr/local/lib/python2.7/dist-packages/google/cloud/pubsub/client.py", line 21, in
from google.cloud.pubsub._http import Connection
File "/usr/local/lib/python2.7/dist-packages/google/cloud/pubsub/_http.py", line 25, in
from google.cloud.iterator import HTTPIterator
ImportError: No module named iterator

When running install.sh:
google-cloud-pubsub 0.26.0 has requirement google-cloud-core<0.26dev,>=0.25.0, but you'll have google-cloud-core 0.28.1 which is incompatible.
gapic-google-cloud-pubsub-v1 0.15.4 has requirement oauth2client<4.0dev,>=2.0.0, but you'll have oauth2client 4.1.2 which is incompatible.
proto-google-cloud-pubsub-v1 0.15.4 has requirement oauth2client<4.0dev,>=2.0.0, but you'll have oauth2client 4.1.2 which is incompatible.
google-api-core 1.1.2 has requirement setuptools>=34.0.0, but you'll have setuptools 33.1.1 which is incompatible.
proto-google-cloud-datastore-v1 0.90.4 has requirement oauth2client<4.0dev,>=2.0.0, but you'll have oauth2client 4.1.2 which is incompatible.
googledatastore 7.0.1 has requirement oauth2client<4.0.0,>=2.0.1, but you'll have oauth2client 4.1.2 which is incompatible.

Uses version of BigQuery Python Client Library that is out of date

This repo uses an old version of the BigQuery Python Client Library and does not specify which version.

I work on the BigQuery Python Client Library and can help with updating the code, but the versions of the required libraries should be specified (requirements.txt or otherwise).

Chapter 2 - Error while trying to run the ingest app on App Engine: first service must be 'default' service

ERROR: (gcloud.app.deploy) INVALID_ARGUMENT: The first service (module) you upload to a new application must be the 'default' service (module). Please upload a version of the 'default' service (module) before uploading a version for the 'flights' service (module). See the documentation for more information. Python: (https://developers.google.com/appengine/docs/python/modules/#Python_Uploading%%20modules) Java: (https://developers.google.com/appengine/docs/java/modules/#Java_Uploading%%20modules)

Shall I change the service name to default?

Chapter 7: cloning GitHub in Jupyter

Hi, I'm on Chapter 7 and this sentence doesn't really make sense to me:

We can then start an SSH tunnel, a Chrome session via the network proxy, and browse to port 8080 on the master node of the cluster. In Jupyter, we then can clone the GitHub repository and start off a new notebook.

I previously went through Chapter 6 without problems, but it isn't clear to me what we're meant to do here. I know how to sign into the master node using SSH in Chrome, which I assume is the "Chrome session via the network proxy." But do you mean that we're supposed to browse to port 8080 within that SSH session somehow? That doesn't quite make sense to me because 8080 isn't an SSH port. What am I missing?

I tried just doing git clone within a Jupyter notebook that I accessed through Web Interfaces as explained in Chapter 6, but git clone isn't valid syntax within a Jupyter notebook. So I'm a bit lost. Thanks for any help you can provide!

SyntaxError while running df05.py @ Lab ( Processing Data with Google Cloud Dataflow )

For lab ( Processing Data with Google Cloud Dataflow), one of the step was to run script df05.py

However, it returned SyntaxError coming from one of the function under another file timezonefinder.py

Pls help to check and amend. Thanks.

Issue in 02_ingest/monthlyupdate/ingest_flights.py

In the download function,

response = urlopen(url, PARAMS)

TypeError: POST data should be bytes, an iterable of bytes, or a file object. It cannot be of type str.

I am getting type error because PARAMS need to be encoded in bytes. However when I try the following code, still there is error.

PARAMS = urllib.parse.urlencode(PARAMS).encode("utf-8")

Objects moved on BTS -- ingest.sh outdated?

201501
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  5405  100   191  100  5214     14    392  0:00:13  0:00:13 --:--:--     0
Received <head><title>Object moved</title></head>
<body><h1>Object Moved</h1>This object may be found <a HREF="https://transtats.bts.gov/ftproot/TranStatsData/474493635_T_ONTIME.zip">here</a>.</body>
https://transtats.bts.gov/ftproot/TranStatsData/474493635_T_ONTIME.zip```

PATH environment variable is clobbered in deploy_cf.sh

PATH environment variable is clobbered in setup_cron.sh

data-science-on-gcp/02_ingest/monthlyupdate/setup_cron.sh.
Choose a different variable name.

Chapter10 - Build Failure: AddRealtimePrediction.java

Running predict.sh gives:

Caused by: java.lang.NoSuchMethodError: com.google.api.client.googleapis.services.json.AbstractGoogleJsonClient$Builder.setBatchPath(Ljava/lang/String;)Lcom/google/api/client/googleapis/services
/AbstractGoogleClient$Builder;
...
at org.apache.beam.sdk.Pipeline.create (Pipeline.java:150)
at com.google.cloud.training.flights.AddRealtimePrediction.main (AddRealtimePrediction.java:101)
at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
...
at java.lang.Thread.run (Thread.java:748)

[ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.4.0:java (default-cli) on project chapter10: An exception occured while executing the Java class. null: InvocationTargetException: Failed to construct instance from factory method DataflowRunner#fromOptions(interface org.apache.beam.sdk.options.PipelineOptions): com.google.api.client.googleapis.services.json.AbstractGoogleJsonClient$Builder.setBatchPath(Ljava/lang/String;)Lcom/google/api/client/googleapis/services/AbstractGoogleClient$Builder; -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.4.0:java (default-cli) on project chapter10: An exception occured while executing the Java class. null

Many Chapter 9 issues

There are a lot of issues that make following along with Chapter 9 difficult to impossible.

When submitting the exact code that you have in this repository with the following command:

gcloud ai-platform jobs submit training $JOBNAME \      
--region=$REGION \
--module-name=trainer.task \
--package-path=$(pwd)/flights/trainer \
--job-dir=$OUTPUT_DIR \
--runtime-version=1.14 \
--staging-bucket=gs://$BUCKET \
--master-machine-type=n1-standard-4 \
--scale-tier=CUSTOM \
-- \
--bucket=$BUCKET --num_examples=100000 --func=linear

It doesn't work due to some error that I don't understand:

Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 105, in <module>
    model.train_and_evaluate(func_to_call)
  File "/root/.local/lib/python2.7/site-packages/trainer/model.py", line 220, in train_and_evaluate
    callbacks=[cp_callback])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/training.py", line 780, in fit
    steps_name='steps_per_epoch')
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/training_arrays.py", line 274, in model_iteration
    batch_outs = f(actual_inputs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/backend.py", line 3292, in __call__
    run_metadata=self.run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1458, in __call__
    run_metadata_ptr)
FailedPreconditionError: Table not initialized.
	 [[{{node features/carrier_indicator/hash_table_Lookup/LookupTableFindV2}}]]

There are more issues where that came from. I had to comment out import hypertune because that isn't available from pip. Also, the command in the book says to use --runtime-version 2.0 but that isn't even a publicly available version (I fell back to 1.14 - not sure if that's the reason for this error).

To address this I tried to fall back to the commands that you use in the GitHub instead of the ones you list in the book, but your README lists scripts (e.g. retrain_cloud.sh) that don't even exist in the repository, and more importantly I can't figure out how these scripts line up with what I'm reading in the book.

On the whole while I've been able to follow along with the book up to this point, I can't really do it with Chapter 9.

data-science-on-gcp / 05_bqdatalab / exploration.ipynb

I get an error when running the execute() part of this lab: data-science-on-gcp: RequestException: HTTP request failed: Not found: Job {Project}:job...

The Qwiklabs for this section ran fine which however did not have an execute() command.

This python file will work with following replacement:
ORIGINAL CODE: bq.Query(sql).execute().result().to_dataframe()
REPLACEMENT CODE: client.query(sql)..result().to_dataframe()

See below for fleshed out example including how to get service account client

a) Create service account json in https://console.cloud.google.com/apis/credentials/serviceaccountkey?project=neural-virtue-236312&folder&organizationId=790480867434
b) upload json to datalab home directory
c) client = bigquery.Client.from_service_account_json("Data Science on GCP-93d5e770def5.json")
d) sql = """
SELECT ARR_DELAY, DEP_DELAY
FROM flights.tzcorr
WHERE DEP_DELAY >= 10 AND RAND() < 0.01
"""
e) df = client.query(sql).result().to_dataframe()

[CRITICAL] WORKER TIMEOUT errors in stderr (via Stackdriver)

From AppEngine logs -

A  GET 200 86 B 2 ms Chrome 64 / GET 200 86 B 2 ms Chrome 64 
A  INFO: Rejected non-Cron request
 
A  GET 200 236 B 2 ms Chrome 64 /ingest GET 200 236 B 2 ms Chrome 64 
A  INFO: Received cron request true
 
A  INFO: scheduling ingest of year=2016 month=02
 
A  INFO: Requesting data for 2016-02-*
 
A  [2018-02-21 05:04:42 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:15)
 
A  [2018-02-21 05:04:42 +0000] [15] [INFO] Worker exiting (pid: 15)

Chapter 4 df06.py failed with "Unable to get the Filesystem for path gs://....."

This is FYI:
In Chapter 4, df06.py failed with following error:

$ ./df06.py -p $DEVSHELL_PROJECT_ID -b <BUCKETNAME> 
...  
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/filesystems.py", line 186, in match
    filesystem = FileSystems.get_filesystem(patterns[0])
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/filesystems.py", line 92, in get_filesystem
    raise ValueError('Unable to get the Filesystem for path %s' % path)
ValueError: Unable to get the Filesystem for path gs://<bucket>/flights/airports/airports.csv.gz

It seems like Beam could not recognize the GCS bucket and the issue was resolved after executing:
pip install google-cloud-dataflow

Chapter 07 ValueError: DEP_TIME when submitting experiment.py

In Chapter 07, I could not complete the last step(experiment.py).
Following is the output of the submission.
How can I avoid this error?

$ ./submit_spark.sh BUCKET experiment.py
 :
Job [f5947f18c07042fe9e4b295537b18eb1] submitted.
Waiting for job output...
19/02/05 11:22:30 INFO org.spark_project.jetty.util.log: Logging initialized @2422ms
19/02/05 11:22:30 INFO org.spark_project.jetty.server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
19/02/05 11:22:30 INFO org.spark_project.jetty.server.Server: Started @2509ms
19/02/05 11:22:30 INFO org.spark_project.jetty.server.AbstractConnector: Started ServerConnector@6726e80c{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
19/02/05 11:22:30 WARN org.apache.spark.scheduler.FairSchedulableBuilder: Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration.
19/02/05 11:22:40 WARN org.apache.spark.util.Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
19/02/05 11:24:05 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 4.0 (TID 55)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 253, in main
    process()
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 248, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 379, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1352, in takeUpToNumLeft
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/util.py", line 55, in wrapper
    return f(*args, **kwargs)
  File "/tmp/f5947f18c07042fe9e4b295537b18eb1/logistic.py", line 92, in to_example
    features.extend(get_local_hour(fields['DEP_TIME'],
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1548, in __getitem__
    raise ValueError(item)
ValueError: DEP_TIME

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:330)
	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:470)
	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:453)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:284)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
	at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
	at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
	at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
	at org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:152)
	at org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:152)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
19/02/05 11:24:05 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in stage 4.0 (TID 55, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 253, in main
    process()
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 248, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 379, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1352, in takeUpToNumLeft
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/util.py", line 55, in wrapper
    return f(*args, **kwargs)
  File "/tmp/f5947f18c07042fe9e4b295537b18eb1/logistic.py", line 92, in to_example
    features.extend(get_local_hour(fields['DEP_TIME'],
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1548, in __getitem__
    raise ValueError(item)
ValueError: DEP_TIME

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:330)
	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:470)
	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:453)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:284)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
	at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
	at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
	at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
	at org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:152)
	at org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:152)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

19/02/05 11:24:05 ERROR org.apache.spark.scheduler.TaskSetManager: Task 0 in stage 4.0 failed 1 times; aborting job
Traceback (most recent call last):
  File "/tmp/f5947f18c07042fe9e4b295537b18eb1/logistic.py", line 115, in <module>
    lrmodel = LogisticRegressionWithLBFGS.train(examples, intercept=True)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/mllib/classification.py", line 392, in train
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1376, in first
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1358, in take
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/context.py", line 1033, in runJob
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 (TID 55, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 253, in main
    process()
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 248, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 379, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1352, in takeUpToNumLeft
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/util.py", line 55, in wrapper
    return f(*args, **kwargs)
  File "/tmp/f5947f18c07042fe9e4b295537b18eb1/logistic.py", line 92, in to_example
    features.extend(get_local_hour(fields['DEP_TIME'],
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1548, in __getitem__
    raise ValueError(item)
ValueError: DEP_TIME

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:330)
	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:470)
	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:453)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:284)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
	at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
	at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
	at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
	at org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:152)
	at org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:152)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1651)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1639)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1638)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1638)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1872)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1821)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1810)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
	at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:152)
	at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 253, in main
    process()
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 248, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 379, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1352, in takeUpToNumLeft
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/util.py", line 55, in wrapper
    return f(*args, **kwargs)
  File "/tmp/f5947f18c07042fe9e4b295537b18eb1/logistic.py", line 92, in to_example
    features.extend(get_local_hour(fields['DEP_TIME'],
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1548, in __getitem__
    raise ValueError(item)
ValueError: DEP_TIME

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:330)
	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:470)
	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:453)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:284)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
	at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
	at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
	at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
	at org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:152)
	at org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:152)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more

19/02/05 11:24:06 INFO org.spark_project.jetty.server.AbstractConnector: Stopped Spark@6726e80c{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}

error with 04_streaming/simulate/df06.py (similar to #56)

Hi I am getting the same error as reported in #56 for df05.py, but am running on my own GCP platform account for df06.py. Error snippet is below.

File "/usr/local/lib/python2.7/dist-packages/apache_beam/transforms/core.py", line 1161, in wrapper = lambda x: [fn(x)] File "df06.py", line 134, in | 'airports:tz' >> beam.Map(lambda fields: (fields[0], addtimezone(fields[21], fields[26]))) File "df06.py", line 24, in addtimezone import timezonefinder File "/usr/local/lib/python2.7/dist-packages/timezonefinder/init.py", line 2, in from .timezonefinder import TimezoneFinder File "/usr/local/lib/python2.7/dist-packages/timezonefinder/timezonefinder.py", line 300 def closest_timezone_at(self, *, lat, lng, delta_degree=1, exact_computation=False, return_distances=False, ^ SyntaxError: invalid syntax [while running 'airports:tz']

08_dataflow [optional] Setup Dataflow Development environment

CreateTrainingDataset won't execute on my Windows machine. I believe this is related to credentials. I set GOOGLE_APPLICATION_CREDENTIALS to my json file but am still getting same error.

Error is at following line of code (240 in org.apache.beam.runners.dataflow.DataflowRunner:
try {
gcpTempLocation = dataflowOptions.getGcpTempLocation();
}

Called from line 96 in CreateTrainingDataset.java:
Pipeline p = Pipeline.create(options);

Pipeline options are below:
appName: CreateTrainingDataset
bucket: data_science_on_gcp
credentialFactoryClass: class org.apache.beam.sdk.extensions.gcp.auth.GcpCredentialFactory
gcpCredential: null
gcsEndpoint: null
optionsId: 0
pathValidator: org.apache.beam.sdk.extensions.gcp.storage.GcsPathValidator@5cbe877d
pathValidatorClass: class org.apache.beam.sdk.extensions.gcp.storage.GcsPathValidator
project: neural-virtue-236312
runner: class org.apache.beam.runners.dataflow.DataflowRunner
stableUniqueNames: WARNING
tempLocation: gs://data_science_on_gcp/flights/staging

Maven output below:

Current Settings:
appName: CreateTrainingDataset
bucket: data_science_on_gcp
optionsId: 0
project: neural-virtue-236312
runner: class org.apache.beam.runners.dataflow.DataflowRunner
stableUniqueNames: WARNING
tempLocation: gs://data_science_on_gcp/flights/staging
Exception in thread "main" java.lang.RuntimeException: Failed to construct instance from factory method DataflowRunner#fromOptions(interface org.apache.beam.sdk.options.PipelineOptions)
at org.apache.beam.sdk.util.InstanceBuilder.buildFromMethod(InstanceBuilder.java:233)
at org.apache.beam.sdk.util.InstanceBuilder.build(InstanceBuilder.java:162)
at org.apache.beam.sdk.PipelineRunner.fromOptions(PipelineRunner.java:55)
at org.apache.beam.sdk.Pipeline.create(Pipeline.java:150)
at com.google.cloud.training.flights.CreateTrainingDataset.main(CreateTrainingDataset.java:96)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.beam.sdk.util.InstanceBuilder.buildFromMethod(InstanceBuilder.java:222)
... 4 more
Caused by: java.lang.NoSuchMethodError: com.google.api.client.googleapis.services.json.AbstractGoogleJsonClient$Builder.setBatchPath(Ljava/lang/String;)Lcom/google/api/client/googleapis/services/AbstractGoogleClient$Builder;
at com.google.api.services.storage.Storage$Builder.setBatchPath(Storage.java:9307)
at com.google.api.services.storage.Storage$Builder.(Storage.java:9286)
at org.apache.beam.sdk.util.Transport.newStorageClient(Transport.java:95)
at org.apache.beam.sdk.util.GcsUtil$GcsUtilFactory.create(GcsUtil.java:96)
at org.apache.beam.sdk.util.GcsUtil$GcsUtilFactory.create(GcsUtil.java:84)
at org.apache.beam.sdk.options.ProxyInvocationHandler.returnDefaultHelper(ProxyInvocationHandler.java:592)
at org.apache.beam.sdk.options.ProxyInvocationHandler.getDefault(ProxyInvocationHandler.java:533)
at org.apache.beam.sdk.options.ProxyInvocationHandler.invoke(ProxyInvocationHandler.java:155)
at com.sun.proxy.$Proxy24.getGcsUtil(Unknown Source)
at org.apache.beam.sdk.extensions.gcp.storage.GcsPathValidator.verifyPathIsAccessible(GcsPathValidator.java:88)
at org.apache.beam.sdk.extensions.gcp.storage.GcsPathValidator.validateOutputFilePrefixSupported(GcsPathValidator.java:61)
at org.apache.beam.sdk.extensions.gcp.options.GcpOptions$GcpTempLocationFactory.create(GcpOptions.java:245)
at org.apache.beam.sdk.extensions.gcp.options.GcpOptions$GcpTempLocationFactory.create(GcpOptions.java:228)
at org.apache.beam.sdk.options.ProxyInvocationHandler.returnDefaultHelper(ProxyInvocationHandler.java:592)
at org.apache.beam.sdk.options.ProxyInvocationHandler.getDefault(ProxyInvocationHandler.java:533)
at org.apache.beam.sdk.options.ProxyInvocationHandler.invoke(ProxyInvocationHandler.java:155)
at com.sun.proxy.$Proxy15.getGcpTempLocation(Unknown Source)
at org.apache.beam.runners.dataflow.DataflowRunner.fromOptions(DataflowRunner.java:240)
... 9 more

Chapter 4 - df06.py faces default quota limit

In case a default project quota on GCP is still in place ("trial" account with free credit), df06.py fails due to '--max_num_workers=10'. The default quota is 8.
To make this run, change
'--max_num_workers=10', to '--max_num_workers=8', in the df06.py.
Alternatively, change from "free trial account" - "Upgrade" in IAM - Quotas.
Documentation: https://cloud.google.com/compute/quotas

08_dataflow/create_datasets.sh failed with java.lang.reflect.InvocationTargetException

Running create_datasets.sh in cloud shell failed with the following error.
Is this something related to my environment or is there any way to resolve the issue?

$ ./create_datasets.sh <BUCKET> 3
CommandException: 1 files/objects could not be removed.
[INFO] Scanning for projects...
[INFO]
[INFO] -------------< com.google.cloud.training.flights:chapter8 >-------------
[INFO] Building chapter8 [1.0.0,2.0.0]
[INFO] --------------------------------[ jar ]---------------------------------
[INFO]
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ chapter8 ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory /home/user/data-science-on-gcp/08_dataflow/chapter8/src/main/resources
[INFO]
[INFO] --- maven-compiler-plugin:3.5.1:compile (default-compile) @ chapter8 ---
[INFO] Nothing to compile - all classes are up to date
[INFO]
[INFO] --- exec-maven-plugin:1.4.0:java (default-cli) @ chapter8 ---
[WARNING]
java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke (Method.java:498)
    at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:293)
    at java.lang.Thread.run (Thread.java:748)
Caused by: java.lang.RuntimeException: Failed to construct instance from factory method DataflowRunner#fromOptions(interface org.apache.beam.sdk.options.PipelineOptions)
    at org.apache.beam.sdk.util.InstanceBuilder.buildFromMethod (InstanceBuilder.java:233)
    at org.apache.beam.sdk.util.InstanceBuilder.build (InstanceBuilder.java:162)
    at org.apache.beam.sdk.PipelineRunner.fromOptions (PipelineRunner.java:55)
    at org.apache.beam.sdk.Pipeline.create (Pipeline.java:150)
    at com.google.cloud.training.flights.CreateTrainingDataset.main (CreateTrainingDataset.java:95)
    at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke (Method.java:498)
    at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:293)
    at java.lang.Thread.run (Thread.java:748)
Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke (Method.java:498)
    at org.apache.beam.sdk.util.InstanceBuilder.buildFromMethod (InstanceBuilder.java:222)
    at org.apache.beam.sdk.util.InstanceBuilder.build (InstanceBuilder.java:162)
    at org.apache.beam.sdk.PipelineRunner.fromOptions (PipelineRunner.java:55)
    at org.apache.beam.sdk.Pipeline.create (Pipeline.java:150)
    at com.google.cloud.training.flights.CreateTrainingDataset.main (CreateTrainingDataset.java:95)
    at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke (Method.java:498)
    at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:293)
    at java.lang.Thread.run (Thread.java:748)
Caused by: java.lang.NoSuchMethodError: com.google.api.client.googleapis.services.json.AbstractGoogleJsonClient$Builder.setBatchPath(Ljava/lang/String;)Lcom/google/api/client/googleapis/services/AbstractGoogleClient$Builder;
    at com.google.api.services.storage.Storage$Builder.setBatchPath (Storage.java:9307)
    at com.google.api.services.storage.Storage$Builder.<init> (Storage.java:9286)
    at org.apache.beam.sdk.util.Transport.newStorageClient (Transport.java:95)
    at org.apache.beam.sdk.util.GcsUtil$GcsUtilFactory.create (GcsUtil.java:96)
    at org.apache.beam.sdk.util.GcsUtil$GcsUtilFactory.create (GcsUtil.java:84)
    at org.apache.beam.sdk.options.ProxyInvocationHandler.returnDefaultHelper (ProxyInvocationHandler.java:592)
    at org.apache.beam.sdk.options.ProxyInvocationHandler.getDefault (ProxyInvocationHandler.java:533)
    at org.apache.beam.sdk.options.ProxyInvocationHandler.invoke (ProxyInvocationHandler.java:155)
    at com.sun.proxy.$Proxy47.getGcsUtil (Unknown Source)
    at org.apache.beam.sdk.extensions.gcp.storage.GcsPathValidator.verifyPathIsAccessible (GcsPathValidator.java:88)
    at org.apache.beam.sdk.extensions.gcp.storage.GcsPathValidator.validateOutputFilePrefixSupported (GcsPathValidator.java:61)
    at org.apache.beam.sdk.extensions.gcp.options.GcpOptions$GcpTempLocationFactory.create (GcpOptions.java:245)
    at org.apache.beam.sdk.extensions.gcp.options.GcpOptions$GcpTempLocationFactory.create (GcpOptions.java:228)
    at org.apache.beam.sdk.options.ProxyInvocationHandler.returnDefaultHelper (ProxyInvocationHandler.java:592)
    at org.apache.beam.sdk.options.ProxyInvocationHandler.getDefault (ProxyInvocationHandler.java:533)
    at org.apache.beam.sdk.options.ProxyInvocationHandler.invoke (ProxyInvocationHandler.java:155)
    at com.sun.proxy.$Proxy38.getGcpTempLocation (Unknown Source)
    at org.apache.beam.runners.dataflow.DataflowRunner.fromOptions (DataflowRunner.java:240)
    at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke (Method.java:498)
    at org.apache.beam.sdk.util.InstanceBuilder.buildFromMethod (InstanceBuilder.java:222)
    at org.apache.beam.sdk.util.InstanceBuilder.build (InstanceBuilder.java:162)
    at org.apache.beam.sdk.PipelineRunner.fromOptions (PipelineRunner.java:55)
    at org.apache.beam.sdk.Pipeline.create (Pipeline.java:150)
    at com.google.cloud.training.flights.CreateTrainingDataset.main (CreateTrainingDataset.java:95)
    at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke (Method.java:498)
    at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:293)
    at java.lang.Thread.run (Thread.java:748)
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  11.681 s
[INFO] Finished at: 2019-02-05T16:13:20+09:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.4.0:java (default-cli) on project chapter8: An exception occured while executing the Java class. null: InvocationTargetException: Failed to construct instance from factory method DataflowRunner#fromOptions(interface org.apache.beam.sdk.options.PipelineOptions): com.google.api.client.googleapis.services.json.AbstractGoogleJsonClient$Builder.setBatchPath(Ljava/lang/String;)Lcom/google/api/client/googleapis/services/AbstractGoogleClient$Builder; -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

Need to specify project id in bq.Client()

This is the error seen:
agrawalankit@test-aagrawal:~/data-science-on-gcp/04_streaming/simulate$ python ./simulate.py --startTime '2015-05-01 00:00:00 UTC' --endTime '2015-05-04 00:00:00 UTC' --speedFactor=60 --project test-aagrawal
Traceback (most recent call last):
File "./simulate.py", line 83, in
dataset = bqclient.get_dataset( bqclient.dataset('flights') ) # throws exception on failure
File "/usr/local/lib/python2.7/dist-packages/google/cloud/bigquery/client.py", line 288, in get_dataset
path=dataset_ref.path)
File "/usr/local/lib/python2.7/dist-packages/google/cloud/bigquery/client.py", line 271, in _call_api
return call()
File "/usr/local/lib/python2.7/dist-packages/google/api_core/retry.py", line 260, in retry_wrapped_func
on_error=on_error,
File "/usr/local/lib/python2.7/dist-packages/google/api_core/retry.py", line 177, in retry_target
return target()
File "/usr/local/lib/python2.7/dist-packages/google/cloud/_http.py", line 293, in api_request
raise exceptions.from_http_response(response)
google.api_core.exceptions.NotFound: 404 GET https://www.googleapis.com/bigquery/v2/projects/no-project-id/datasets/flights: Not found: Project no-project-id

The fix should be update line 83 of simulate.py as:

bqclient = bq.Client(project=args.project)

Chapter 06 creating cluster failed

Running create_cluster.sh failed with following errors. How can I resolve the error?

$ ./create_cluster.sh <BUCKET> us-central1-c
Waiting for cluster creation operation...done.
ERROR: (gcloud.dataproc.clusters.create) Operation [projects/<myproject>/regions/global/operations/66fca575-94c6-356f-b11e-036fd3b335c5] failed: Multiple Errors:
 - Cannot start master: Insufficient number of DataNodes reporting
 - Worker ch6cluster-w-0 unable to register with master: ch6cluster-m. This could be because it is offline, or network is misconfigured.
 - Worker ch6cluster-w-1 unable to register with master: ch6cluster-m. This could be because it is offline, or network is misconfigured..

02_ingest/monthlyupdate/ingest_flights.py

Hi,
I'm currently reading your book and having some trouble with this executable. To be honest, I'm not very familiar with Python. This is the error I'm getting (also with ./ingest_flights.py --help):
Traceback (most recent call last): File "./ingest_flights.py", line 25, in <module> from google.cloud import storage ImportError: No module named google.cloud
I have the cloud SDK and python installed so I'm a bit confused on what to do. Unfortunately the answers on SO and the GCP docs didn't help me much, so I'm wondering if you could advise.

Tensorflow 1.7

The current version of Tensorflow is 1.7. Models created using this version can't be deployed to the GCP ML Engine.

I fixed this by removing 1.7 and inserting

REQUIRED_PACKAGES = [
'tensorflow==1.6',
]
into setup.py

and then deploying with --runtime-version=1.6

You might like to point this out for new users who will download 1.7 by default. I appreciate that versions and compatibility with cloudML will continue to evolve

Chapter4: df06.py / simevents table is empty

I can't have data in simevents table of Bigquery.
Something is wrong with flights:tzcorr because of following.

Output collections
flights:tzcorr/flights:tzcorr.out0

Anyone help me?

Chapter 4: df06.py didn't create simevents table

Dataflow doesn't show anything. I go to BigQuery and see the flights dataset but the table simevents is absent. I run the query and I get this popup message

Not found: Table data-science-gcp-book-237318:flights.simevents was not found in location US

Error compiling FlightsMLService

Caused by: com.google.gson.JsonSyntaxException: java.lang.IllegalStateException: Expected an int but was BEGIN_ARRAY at line 1 column 91 path $.predictions[0].classes
at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.read(ReflectiveTypeAdapterFactory.java:224)
at com.google.gson.internal.bind.TypeAdapterRuntimeTypeWrapper.read(TypeAdapterRuntimeTypeWrapper.java:41)
at com.google.gson.internal.bind.CollectionTypeAdapterFactory$Adapter.read(CollectionTypeAdapterFactory.java:82)
at com.google.gson.internal.bind.CollectionTypeAdapterFactory$Adapter.read(CollectionTypeAdapterFactory.java:61)
at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$1.read(ReflectiveTypeAdapterFactory.java:129)
at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.read(ReflectiveTypeAdapterFactory.java:220)
at com.google.gson.Gson.fromJson(Gson.java:887)
at com.google.gson.Gson.fromJson(Gson.java:852)
at com.google.gson.Gson.fromJson(Gson.java:801)
at com.google.gson.Gson.fromJson(Gson.java:773)
at com.google.cloud.training.flights.FlightsMLService.sendRequest(FlightsMLService.java:106)
at com.google.cloud.training.flights.FlightsMLService.main(FlightsMLService.java:177)

Charpter 7 experimentation.ipynb - Variable 'BUCKET' is not used properly.

In chapter 7, experimentation.ipynb, 'BUCKET' is defined at the top of the code but it's not used properly later.

BUCKET='cloud-training-demos-ml'
os.environ['BUCKET'] = BUCKET

from pyspark.mllib.classification import LogisticRegressionWithLBFGS
from pyspark.mllib.regression import LabeledPoint

traindays = spark.read \
    .option("header", "true") \
    .csv('gs://cloud-training-demos-ml/flights/trainday.csv')
traindays.createOrReplaceTempView('traindays')

Chapter 2 Why Pub sub is used with cloud functions?

The last line of chap says: "Note also that the program creates a Cloud Pub/Sub topic and subscription. These are needed for Cloud Functions." Can somebody put some light on this? I am assuming scheduler automatically triggers cloud functions based on time specified. In that case, how does Pub/Sub come into picture? How it works here? Is it neccessary? Any leads appreciated.

Struggling with Visualize Real Time Geospatial Data with Google Data Studio

Im trying to do the qwiklab Visualize Real Time Geospatial Data with Google Data Studio and starting to get troubles from running the python script to see an event which I get the following

Then I tried to create a bucket that says that -ml is an invalid bucket name in URL

And Finally I failed running the Maven to deploy Java Stream to Google Cloud Dataflow

And obviously I couldn't see any streaming Dataflow job streaming.
I'm a completely rookie in this field, I'm not certain if I have to strictly follow the commands as shown in the qwiklabs or I have to add more stuff like a truly project ID, a different bucket name as described. I'm very frustrated, I need HELP please !!!!

Description field in create_cbt

Flag --description has been removed. Use --display-name=DISPLAY_NAME instead.

The www.transtats.bts.gov is down no idea for how long and since when

Hi,

The https://www.transtats.bts.gov is down no idea since when. Also I have no information of when or if will be fixed.

I know that is not our responsibility but would be nice to have alternative way to download it in case problem is persisting.

Wide and Deep Model and runtime version.

Wide and Deep Model produces same probabilities regardless of request instance values. Linear and DNN models function correctly.

Model.py:
def get_model(output_dir, nbuckets, hidden_units, learning_rate):
#return linear_model(output_dir)
return dnn_model(output_dir)
#return wide_and_deep_model(output_dir, nbuckets, hidden_units, learning_rate)

Deploy_model.sh:
Need to add --runtime-version=1.6:
gcloud ml-engine versions create ${MODEL_VERSION} --model ${MODEL_NAME} --origin ${MODEL_LOCATION} --runtime-version=1.6

flight data monthly ingestion error

The original deployment failed because I didn't have a default instance so I deployed the first deafault instance by commenting out the app.yaml file then redeployed with the service name un commented.
The service gets deployed successfully but gives the following error on click the Ingest link:
Sorry, this capability is accessible only by the Cron service, but I got a KeyError for 'HTTP_X_APPENGINE_CRON' -- try invoking it from the GCP console / AppEngine / taskqueues

df02.py does not work

df02.py does not work because the latest timezonefinder packages have changed some method calls I guess. Changing the install_packages.sh script to install version 3.0.0 of timezonefinder seems to solve the problem.

simulate.py in chapter 10

Need to specify --project

Install Google Cloud Dataflow prior to df06.py

Had to run

sudo pip install google-cloud-dataflow

before

./df06.py --project=$DEVSHELL_PROJECT_ID --dataset=flights2016 --bucket=BUCKETNAME

CreateTrainingDataset9 average arrival delay time is computed without filtering by "isTrain" flag)

data-science-on-gcp/08_dataflow/chapter8/src/main/java/com/google/cloud/training/flights/CreateTrainingDataset9.java

Line 216 in 7061042

 private static PCollectionView<Map<String, Double>> computeAverageArrivalDelay(PCollection<Flight> hourlyFlights) { 

Hello. As we create training set it is reasonable to filter flights by training flag for average arrival delay too.
Otherwise during evaluation we will have not consistent data.
I mean the average arrival delay will already include the delay of the flight we make prediction for. So we shouldn't include anyhow data from evaluation set to training set.

Global Step Local Train

train_local runs 2000 steps not 200, if fix_global_step_increment_bug=True is defined then 1000 steps.

batch size =512
train.csv has 10003 rows

20 batches x 10 epochs?

WARNING:tensorflow:The default stddev value of initializer will change from "1/sqrt(vocab_size)" to "1/sqrt(dimension)" after 2017/02/25.
WARNING:tensorflow:The default stddev value of initializer will change from "1/sqrt(vocab_size)" to "1/sqrt(dimension)" after 2017/02/25.
WARNING:tensorflow:The default stddev value of initializer will change from "1/sqrt(vocab_size)" to "1/sqrt(dimension)" after 2017/02/25.
WARNING:tensorflow:The default stddev value of initializer will change from "1/sqrt(vocab_size)" to "1/sqrt(dimension)" after 2017/02/25.
WARNING:tensorflow:The default stddev value of initializer will change from "1/sqrt(vocab_size)" to "1/sqrt(dimension)" after 2017/02/25.
WARNING:tensorflow:The default stddev value of initializer will change from "1/sqrt(vocab_size)" to "1/sqrt(dimension)" after 2017/02/25.
WARNING:tensorflow:The default stddev value of initializer will change from "1/sqrt(vocab_size)" to "1/sqrt(dimension)" after 2017/02/25.
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_num_ps_replicas': 0, '_keep_checkpoint_max': 5, '_task_type': None, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f3c87b04890>, '_model_dir': './trained_model/', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_session_config': None, '_tf_random_seed': None, '_save_summary_steps': 100, '_environment': 'local', '_num_worker_replicas': 0, '_task_id': 0, '_log_step_count_steps': 100, '_tf_config': gpu_options {
per_process_gpu_memory_fraction: 1.0
}
, '_evaluation_master': '', '_master': ''}
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/monitors.py:267: init (from tensorflow.contrib.learn.python.learn.monitors) is deprecated and will be removed after 2016-12-05.
Instructions for updating:
Monitors are deprecated. Please use tf.train.SessionRunHook.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/layers/python/layers/feature_column.py:2341: calling sparse_feature_cross (from tensorflow.contrib.layers.python.ops.sparse_feature_cross_op) with hash_key=None is deprecated and will be removed after 2016-11-20.
Instructions for updating:
The default behavior of sparse_feature_cross is changing, the default
value for hash_key will change to SPARSE_FEATURE_CROSS_DEFAULT_HASH_KEY.
From that point on sparse_feature_cross will always use FingerprintCat64
to concatenate the feature fingerprints. And the underlying
_sparse_feature_cross_op.sparse_feature_cross operation will be marked
as deprecated.
WARNING:tensorflow:Casting <dtype: 'float32'> labels to bool.
WARNING:tensorflow:Casting <dtype: 'float32'> labels to bool.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1 into ./trained_model/model.ckpt.
INFO:tensorflow:loss = 9.615792, step = 1
WARNING:tensorflow:Casting <dtype: 'float32'> labels to bool.
WARNING:tensorflow:Casting <dtype: 'float32'> labels to bool.
INFO:tensorflow:Starting evaluation at 2018-04-01-14:20:46
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from ./trained_model/model.ckpt-1
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [10/100]
INFO:tensorflow:Evaluation [20/100]
INFO:tensorflow:Finished evaluation at 2018-04-01-14:20:48
INFO:tensorflow:Saving dict for global step 1: accuracy = 0.16655004, accuracy/baseline_label_mean = 0.83344996, accuracy/threshold_0.700000_mean = 0.16655004, auc = 0.5, auc_precision_recall = 0.916725, global_step = 1, labels/actual_label_mean = 0.83344996, labels/prediction_mean = 2.4281659e-22, loss = 108.89304, precision/positive_threshold_0.700000_mean = 0.0, recall/positive_threshold_0.700000_mean = 0.0, rmse = 0.9129348, training/hptuning/metric = 0.9129348
INFO:tensorflow:Validation (step 100): accuracy/baseline_label_mean = 0.83344996, loss = 108.89304, auc = 0.5, accuracy/threshold_0.700000_mean = 0.16655004, global_step = 1, rmse = 0.9129348, recall/positive_threshold_0.700000_mean = 0.0, labels/prediction_mean = 2.4281659e-22, precision/positive_threshold_0.700000_mean = 0.0, training/hptuning/metric = 0.9129348, accuracy = 0.16655004, auc_precision_recall = 0.916725, labels/actual_label_mean = 0.83344996
INFO:tensorflow:global_step/sec: 16.7229
INFO:tensorflow:loss = 0.07211667, step = 101 (5.980 sec)
INFO:tensorflow:global_step/sec: 130.852
INFO:tensorflow:loss = 0.3313904, step = 201 (0.764 sec)
INFO:tensorflow:global_step/sec: 135.722
INFO:tensorflow:loss = 0.08310337, step = 301 (0.737 sec)
INFO:tensorflow:global_step/sec: 139.879
INFO:tensorflow:loss = 0.13494043, step = 401 (0.715 sec)
INFO:tensorflow:global_step/sec: 144.19
INFO:tensorflow:loss = 0.06767261, step = 501 (0.694 sec)
INFO:tensorflow:global_step/sec: 147.938
INFO:tensorflow:loss = 0.09191691, step = 601 (0.676 sec)
INFO:tensorflow:global_step/sec: 127.188
INFO:tensorflow:loss = 0.15389094, step = 701 (0.786 sec)
INFO:tensorflow:global_step/sec: 122.919
INFO:tensorflow:loss = 0.07212783, step = 801 (0.814 sec)
INFO:tensorflow:global_step/sec: 123.735
INFO:tensorflow:loss = 0.17153545, step = 901 (0.811 sec)
INFO:tensorflow:global_step/sec: 125.078
INFO:tensorflow:loss = 0.20417586, step = 1001 (0.797 sec)
INFO:tensorflow:Saving checkpoints for 1010 into ./trained_model/model.ckpt.
INFO:tensorflow:Loss for final step: 0.030963236.
WARNING:tensorflow:Casting <dtype: 'float32'> labels to bool.
WARNING:tensorflow:Casting <dtype: 'float32'> labels to bool.
INFO:tensorflow:Starting evaluation at 2018-04-01-14:20:58
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from ./trained_model/model.ckpt-1010
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [10/100]
INFO:tensorflow:Evaluation [20/100]
INFO:tensorflow:Finished evaluation at 2018-04-01-14:21:00
INFO:tensorflow:Saving dict for global step 1010: accuracy = 0.8775367, accuracy/baseline_label_mean = 0.83344996, accuracy/threshold_0.700000_mean = 0.8353494, auc = 0.9727974, auc_precision_recall = 0.9941381, global_step = 1010, labels/actual_label_mean = 0.83344996, labels/prediction_mean = 0.7044494, loss = 0.34910947, precision/positive_threshold_0.700000_mean = 0.9930719, recall/positive_threshold_0.700000_mean = 0.8080844, rmse = 0.30576178, training/hptuning/metric = 0.30576178
INFO:tensorflow:Restoring parameters from ./trained_model/model.ckpt-1010
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: ./trained_model/export/Servo/temp-1522592460/saved_model.pb

Model predictions

I get the same results for different instances on pg328:

response={u'predictions': [{u'probabilities': [0.09236259013414383, 0.907637357711792], u'logits': [2.285122871398926], u'classes': 1, u'logistic': [0.907637357711792]}, {u'probabilities': [0.09236259013414383, 0.907637357711792], u'logits': [2.285122871398926], u'classes': 1, u'logistic': [0.907637357711792]}, {u'probabilities': [0.09236259013414383, 0.907637357711792], u'logits': [2.285122871398926], u'classes': 1, u'logistic': [0.907637357711792]}, {u'probabilities': [0.09236259013414383, 0.907637357711792], u'logits': [2.285122871398926], u'classes': 1, u'logistic': [0.907637357711792]}]}
probs=[0.907637357711792, 0.907637357711792, 0.907637357711792, 0.907637357711792]

The ontime probability=0.907637357712; the key reason is that this flight appears rather typical
The ontime probability=0.907637357712; the key reason is that this flight appears rather typical
The ontime probability=0.907637357712; the key reason is that this flight appears rather typical
The ontime probability=0.907637357712; the key reason is that this flight appears rather typical

Can you help with what i've done wrong?

Also, should factor be in range 1, len(probs) or range 0, len(probs) ?

for factor in xrange(0, len(probs)):
impact = abs(probs[factor] - probs[0])

Missing code for section "Publishing an Event Stream to Cloud Pub/Sub"

The code is documented in the book, but doesn't seem to be located in the git repo.

"Open in Google Cloud Shell" gives a 404

ValueError: DEP_TIME

Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 177, in main
process()
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 172, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 268, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1339, in takeUpToNumLeft
File "/tmp/7fa836049f5f469b85f5eb082a72c77c/logistic.py", line 92, in to_example
features.extend(get_local_hour(fields['DEP_TIME'],
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1491, in getitem
raise ValueError(item)
ValueError: DEP_TIME