Udacity Self-Driving Car Engineer Nanodegree: MiniFlow.
- Partial derivatives
- Gradients
- Yes you should understand backprop
- Vector, Matrix, and Tensor Derivatives
def forward(self):
x_value = self.inbound_nodes[0].value
y_value = self.inbound_nodes[1].value
self.value = x_value + y_value
Linear algebra nicely reflects the idea of transforming values between layers in a graph.
def forward(self):
inputs = self.inbound_nodes[0].value
weights = self.inbound_nodes[1].value
bias = self.inbound_nodes[2].value
self.value = bias
for x, w in zip(inputs, weights):
self.value += x * w
def forward(self):
X = self.inbound_nodes[0].value
W = self.inbound_nodes[1].value
b = self.inbound_nodes[2].value
self.value = np.dot(X, W) + b
def _sigmoid(self, x):
return 1. / (1. + np.exp(-x)) # the `.` ensures that `1` is a float
def forward(self):
input_value = self.inbound_nodes[0].value
self.value = self._sigmoid(input_value)
def forward(self):
"""
Calculates the mean squared error.
"""
# NOTE: We reshape these to avoid possible matrix/vector broadcast
# errors.
#
# For example, if we subtract an array of shape (3,) from an array of shape
# (3,1) we get an array of shape(3,3) as the result when we want
# an array of shape (3,1) instead.
#
# Making both arrays (3,1) insures the result is (3,1) and does
# an elementwise subtraction as expected.
y = self.inbound_nodes[0].value.reshape(-1, 1)
a = self.inbound_nodes[1].value.reshape(-1, 1)
m = self.inbound_nodes[0].value.shape[0]
diff = y - a
self.value = np.mean(diff**2)
Empirically, Learning rate in the range 0.1 to 0.0001 work well. The range 0.001 to 0.0001 is popular, as 0.1 and 0.01 are sometimes too large.
def gradient_descent_update(x, gradx, learning_rate):
x = x - learning_rate * gradx
return x
A composition of functions MSE(Linear(Sigmoid(Linear(X, W1, b1)), W2, b2), y)
class Sigmoid(Node)
def backward(self):
# Initialize the gradients to 0.
self.gradients = {n: np.zeros_like(n.value) for n in self.inbound_nodes}
# Cycle through the outputs. The gradient will change depending
# on each output, so the gradients are summed over all outputs.
for n in self.outbound_nodes:
# Get the partial of the cost with respect to this node.
grad_cost = n.gradients[self]
sigmoid = self.value
self.gradients[self.inbound_nodes[0]] += sigmoid * (1 - sigmoid) * grad_cost
A naive implementation of SGD involves:
- Randomly sample a batch of data from the total dataset.
- Running the network forward and backward to calculate the gradient (with data from (1)).
- Apply the gradient descent update.
- Repeat steps 1-3 until convergence or the loop is stopped by another mechanism (i.e. the number of epochs).
epochs = 10
# Total number of examples
m = X_.shape[0]
batch_size = 11
steps_per_epoch = m
graph = topological_sort(feed_dict)
trainables = [W1, b1, W2, b2]
# Step 4
for i in range(epochs):
loss = 0
for j in range(steps_per_epoch):
# Step 1
# Randomly sample a batch of examples
X_batch, y_batch = resample(X_, y_, n_samples=batch_size)
# Reset value of X and y Inputs
X.value = X_batch
y.value = y_batch
# Step 2
forward_and_backward(graph)
# Step 3
sgd_update(trainables)
loss += graph[-1].value
First, the partial of the cost (C) with respect to the trainable t
is accessed.
Second, the value of the trainable is updated.
Create a neural network.
X, y = Input(), Input()
W1, b1 = Input(), Input()
W2, b2 = Input(), Input()
l1 = Linear(X, W1, b1)
s1 = Sigmoid(l1)
l2 = Linear(s1, W2, b2)
cost = MSE(y, l2)
Train
for i in range(epochs):
loss = 0
for j in range(steps_per_epoch):
# Step 1
# Randomly sample a batch of examples
X_batch, y_batch = resample(X_, y_, n_samples=batch_size)
# Reset value of X and y Inputs
X.value = X_batch
y.value = y_batch
# Step 2
forward_and_backward(graph)
# Step 3
sgd_update(trainables)
loss += graph[-1].value
print("Epoch: {}, Loss: {:.3f}".format(i+1, loss/steps_per_epoch))