sjvasquez / web-traffic-forecasting Goto Github PK

View Code? Open in Web Editor NEW

665.0 30.0 239.0 920 KB

Kaggle | Web Traffic Forecasting 📈

Python 100.00%

time-series forecasting convolutional-neural-networks tensorflow

web-traffic-forecasting's Introduction

Web Traffic Forecasting

My solution for the Web Traffic Forecasting competition hosted on Kaggle.

The Task

The training dataset consists of approximately 145k time series. Each of these time series represents a number of daily views of a different Wikipedia article, starting from July 1st, 2015 up until September 10th, 2017. The goal is to forecast the daily views between September 13th, 2017 and November 13th, 2017 for each article in the dataset. The name of the article as well as the type of traffic (all, mobile, desktop, spider) is given for each article.

The evaluation metric is symmetric mean absolute percentage error (SMAPE).

The Approach

A single neural network was used to model all 145k time series. The model architecture is similar to WaveNet, consisting of a stack of dilated causal convolutions, as demonstrated in the diagram below.

A few modifications were made to adapt the model to generate coherent predictions for the entire forecast horizon (64 days). WaveNet was trained using next step prediction, so errors can accumulate as the model generates long sequences in the absence of conditioning information. To remedy this, we trained the model to minimize the loss when unraveled for 64 steps. We adopt a sequence to sequence approach where the encoder and decoder do not share parameters. This allows the decoder to handle the accumulating noise when generating long sequences.

Below are some sample forecasts to demonstrate some of the patterns that the network can capture. The forecasted values are in yellow, and the ground truth values (not used in training or validation) are shown in grey. The y-axis is log transformed.

Requirements

12 GB GPU (recommended), Python 2.7

Python packages:

numpy==1.13.1
pandas==0.19.2
scikit-learn==0.18.1
tensorflow==1.3.0

web-traffic-forecasting's People

Stargazers

Watchers

Forkers

wuqixiaobai puremath86 serignecisse awesome-python rspadim githubbayes lucius-yu benwu232 aihill plantsgo lyang24 anyuray rvaughan yunxileo roxw jeffstahler prob1995 mohsinkhn awasthimaddy ashishlal sunnymarkliu lancifollia joconnor-ml kzhoulatte vuongnm xuelun ptiwaree vgoklani pablomarin zhilangtaosha fujiyuu75 alonegu zs167275 johnpateha ab-be rickoclausen harisyammnv gourmentic selvamshan kylinlin aidsj leoleon506 hengqujushi simonsleo kesjien sdmhans ericperbos gustavocarita shaqbari 5up3rc zhiquanchen rabitw markedmondson1234 satadru5 kagglesolutions beatrice111 busizshen wllidr blueroutecn jerusalemsbell labssec cloudandml theobserverofone justinjm feng-1985 rahasayantan zeyu-h tony32769 chenxingqiang david931229 waldstein1983 linsamtw phil-u-u stevenlol jdoe68877 manqiaoyue ledata cnzjhdx huasanyelao jkhlot valeman ringwraith kwin-wang ahmed16 chou852ishare gerenuk shellsec letsdodatascience wangguangya60 samithaj esigh roushan2016 snowmasaya haha00gou xxyy1 yxhappy songquanwang antbean aigaosheng hijuly

web-traffic-forecasting's Issues

shift should plus 1?

    if causal:
        shift = int((convolution_width / 2) + (int(dilation_rate[0] - 1) / 2))
        pad = tf.zeros([tf.shape(inputs)[0], shift, inputs.shape.as_list()[2]])
        inputs = tf.concat([pad, inputs], axis=1)

shift may should plus 1

Code not running -Tensorflow gather_nd bounds problem

I was trying to get this code running on my local system --> I am facing this error-

Traceback (most recent call last):
File "cnn.py", line 414, in
nn.fit()
File "/Users/srikanthjammy/Documents/midterm/tf_base_model.py", line 142, in fit
feed_dict=val_feed_dict
File "/Users/srikanthjammy/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/Users/srikanthjammy/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1124, in _run
feed_dict_tensor, options, run_metadata)
File "/Users/srikanthjammy/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run
options, run_metadata)
File "/Users/srikanthjammy/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: flat indices[15493, :] = [121, -1] does not index into param (shape: [128,486,32]).
[[Node: GatherNd_23 = GatherNd[Tindices=DT_INT32, Tparams=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](add_24, stack_23)]]

Caused by op u'GatherNd_23', defined at:
File "cnn.py", line 412, in
num_decode_steps=64,
File "cnn.py", line 121, in init
super(cnn, self).init(**kwargs)
File "/Users/srikanthjammy/Documents/midterm/tf_base_model.py", line 99, in init
self.graph = self.build_graph()
File "/Users/srikanthjammy/Documents/midterm/tf_base_model.py", line 344, in build_graph
self.loss = self.calculate_loss()
File "cnn.py", line 366, in calculate_loss
y_hat_decode = self.decode(y_hat_encode, conv_inputs, features=self.decode_features)
File "cnn.py", line 265, in decode
slices = tf.reshape(tf.gather_nd(conv_input, idx), (batch_size, dilation, shape(conv_input, 2)))
File "/Users/srikanthjammy/.local/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 1338, in gather_nd
name=name)
File "/Users/srikanthjammy/.local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)

On searching online, looks like it is a tensorflow bug -- tensorflow/tensorflow#12608
Did you face this issue?
I'm running the data on CPU btw.
versions of numpy, pandas, scikit and tensorflow are as you had mentioned.

how to understand seperate parameters handling the accumulating ?

WaveNet was trained using next step prediction, so errors can accumulate as the model generates long sequences in the absence of conditioning information. To remedy this, we trained the model to minimize the loss when unraveled for 64 steps. We adopt a sequence to sequence approach where the encoder and decoder do not share parameters. This allows the decoder to handle the accumulating noise when generating long sequences.

above said that using seperate parameters the accumulating noise will not be a big issue, basically the encoder part still accumulating the noise then transfer to the decoder part. I think I may miss something for better understanding the picture, can you please tell us more about it ?

Data folder is empty

The data folder does not contain train and test dataset or processed folder, and the train dataset from Kaggle is train_1 and train_2. How can we use these?

What is the hierarchy of the codes/files in this repo?

Hi,
Is there anybody that can help me to figure out how can I run the repo codes in order? I cannot figure out the hierarchy of the codes/files in the repo that I can run them step by step to produce the results.
Thanks

padding seems wrong

In the function temporal_convolution_layer.
shift = (kernel_size // 2) + (int(dilation_rate - 1) // 2)

In Keras and some other implementations. The equation is like this
shift = dilation_rate * (kernel_size - 1)

If it is wrong here, you may use some future information.

cnn.py line260 queue_begin_time = self.encode_len - dilation - 1

I think the code in this line should be self.encode_len - dilation . for example [0,1,2,3,4,5,6,7,8,9] dilation=4 idx=10-4=6 ,
slices = tf.reshape(tf.gather_nd(conv_input, idx), (batch_size, dilation, shape(conv_input, 2)))

should be [6,7,8,9] .(the last dilation of th seq).or you will loss the last day value

During training, train loss and validation loss became nan, does this matter?

When I ran cnn.py, during training, train loss and validation loss became nan after step 50, is this normal? I wonder why losses remains nan……

Decode Features

In the decode features, why are we passing the one hot encoded values of the categorical variables?

        self.decode_features = tf.concat([
            tf.one_hot(decode_idx, self.num_decode_steps),
            tf.tile(tf.reshape(self.log_x_encode_mean, (-1, 1, 1)), (1, self.num_decode_steps, 1)),
            tf.tile(tf.expand_dims(tf.one_hot(self.project, 9), 1), (1, self.num_decode_steps, 1)),
            tf.tile(tf.expand_dims(tf.one_hot(self.access, 3), 1), (1, self.num_decode_steps, 1)),
            tf.tile(tf.expand_dims(tf.one_hot(self.agent, 2), 1), (1, self.num_decode_steps, 1)),
        ], axis=2)

sequence smape loss function

zero_loss = 2.0*tf.ones_like(smape)
nonzero_loss = smape
smape = tf.where(tf.logical_or(tf.equal(y, 0.0), tf.equal(y_hat, 0.0)), zero_loss, nonzero_loss)

There is 'or' condition. What if y !=0.0 and y_hat=0.0. Sequence smape will still give value of zero loss.

It should be 'and' condition.

Two errors occurred while running cnn.py

For anaconda python 3.6 version:
1.
File "D:\Anaconda\lib\site-packages\tensorflow\python\framework\tensor_shape.py", line 32, in init
self._value = int(value)
TypeError: int() argument must be a string, a bytes-like object or a number, not 'Tensor'

File "D:\Anaconda\lib\site-packages\tensorflow\python\framework\tensor_util.py", line 302, in _AssertCompatible
(dtype.name, repr(mismatch), type(mismatch).name))
TypeError: Expected int32, got 1.0 of type 'float' instead.

line 342 in cnn.py

(next_finished, emit_output, state_queues) = loop_fn(time, initial_input, state_queues)
this code that call loop_fun with initial_input,so,I think the initial_input parameter is not update in all loop。can you explain this for me?

Always uses initial_input for loop_fn

Hi,

Thanks so much for sharing your perfect work. But I was confused in the decode part:

web-traffic-forecasting/cnn.py

Lines 342 to 349 in 6cb4a91

 def body(time, elements_finished, emit_ta, *state_queues): 

 (next_finished, emit_output, state_queues) = loop_fn(time, initial_input, state_queues) 

 emit = tf.where(elements_finished, tf.zeros_like(emit_output), emit_output) 

 emit_ta = emit_ta.write(time, emit) 

 elements_finished = tf.logical_or(elements_finished, next_finished) 

 return [time + 1, elements_finished, emit_ta] + list(state_queues)

In line 343, function loop_fn, always takes initial_input as the parameter current_input.

I wonder why we don't use previous prediction for loop_fn? Just likes:

def body(time, elements_finished, emit_ta, *state_queues):
    current_input = tf.cond(time == 0, initial_input, emit_ta.read(time - 1)
    (next_finished, emit_output, state_queues) = loop_fn(time, current_input, state_queues)
    ...

	def body(time, elements_finished, emit_ta, *state_queues):
	(next_finished, emit_output, state_queues) = loop_fn(time, initial_input, state_queues)

	emit = tf.where(elements_finished, tf.zeros_like(emit_output), emit_output)
	emit_ta = emit_ta.write(time, emit)

	elements_finished = tf.logical_or(elements_finished, next_finished)
	return [time + 1, elements_finished, emit_ta] + list(state_queues)