What you need to remember:
• np.exp(x) works for any np.array x and applies the exponential function
to every coordinate
• the sigmoid function and its gradient
• image2vector is commonly used in deep learning
• np.reshape is widely used. In the future, you’ll see that keeping your
matrix/vector dimensions straight will go toward eliminating a lot of
bugs.
• numpy has efficient built-in functions
• broadcasting is extremely useful
What to remember:
• Vectorization is very important in deep learning. It provides computa-
tional efficiency and clarity.
• You have reviewed the L1 and L2 loss.
• You are familiar with many numpy functions such as np.sum, np.dot,
np.multiply, np.maximum, etc…
What you need to remember:
Common steps for pre-processing a new dataset are:
• Figure out the dimensions and shapes of the problem (m_train, m_test, num_px,
…)
• Reshape the datasets such that each example is now a vector of size (num_px *
num_px * 3, 1)
• Standardize the data
What to remember: You’ve implemented several functions that:
• Initialize (w,b)
• Optimize the loss iteratively to learn parameters (w,b):
– computing the cost and its gradient
– updating the parameters using gradient descent
• Use the learned (w,b) to predict the labels for a given set of examples
What to remember from this assignment:
- Preprocessing the dataset is important.
- You implemented each function separately: initialize(), propagate(), optimize().
Then you built a model(). - Tuning the learning rate (which is an example of a ”hyperparameter”) can make a
big difference to the algorithm
What you should remember:
• The weights W[l] should be initialized randomly to break symmetry.
• It is however okay to initialize the biases b[l] to zeros. Symmetry is still
broken so long as W[l] is initialized randomly.
• Initializing weights to very large random values does not work well.
• Hopefully intializing with small random values does better.
What you should remember from this assignment:
• Different initializations lead to different results
• Random initialization is used to break symmetry and make sure different
hidden units can learn different things
• Don’t intialize to values that are too large
• He initialization works well for networks with ReLU activations.
Gradient Descent => Stochastic Gradient Descent (SGD) =>mini-batch Gradient Descent
SGD is equivalent to mini-batch gradient descent where each mini-batch has just 1 example.
What you should remember:
• The difference between gradient descent, mini-batch gradient descent and stochastic
gradient descent is the number of examples you use to perform one update step.
• You have to tune a learning rate hyperparameter α.
• With a well-turned mini-batch size, usually it outperforms either gradient descent
or stochastic gradient descent (particularly when the training set is large).
What you should remember:
• Shuffling and Partitioning are the two steps required to build mini-batches
• Powers of two are often chosen to be the mini-batch size, e.g., 16, 32, 64, 128.