Skip to content

Week 2 — Basics of Neural Network programming

May 1, 2021

Logistic Regression as a Neural Network

  • Logistic Regression is an algorithm for Binary Classification.

Consider we have an image we want to classify as cat or not of 64x64 dimension

  • The images are stored as matrix pixel intensity value for different color channels. (one matrix per color channel and with each layer dimension being the dimension of image).

    To represent image we can use a single feature vector:

So nxn_x (input size) becomes 64×64×3=12228864\times64\times3 = 122288.

x=[255231...255134...25513493] x = \begin{bmatrix} 255 \\ 231 \\...\\255\\134\\...\\255\\134\\93 \end{bmatrix}

Notations:

Logistic Regression

In formal terms for cat classification problem:

Given xx, want y^=P(y=1x)\hat{y} = \text{P}(y=1|x).

where,

  • xRnxx \in \mathbb{R}^{n_x}

Parameters: wRnx,bRw\in \mathbb{R}^{n_x}, b\in \mathbb{R}

Output or Activation function

Output : y^=σ(wx+b)\hat{y} = \sigma(w^\intercal x + b)

where,

σ(Z)=11+ezz=w+b \sigma(Z) = \frac{1}{1+ e^{-z}} \\[10pt] z = w^\intercal + b
Exponential function y=exy = e^x (source: Wikipedia

Hence, this satisfy the two conditions:dx

If zz is large:

σ(z)11+0=1 \sigma(z) \approx \frac{1}{1+0} = 1

If z is large negative number then, the exponential function returns very big value.

σ(z)11+bigno0 \sigma(z) \approx \frac{1}{1+bigno} \approx 0

Warning

In a alternate notation you might see (like in Hands-on machine learning book) y^=σ(xθ)\hat{y} = \sigma(x^{\intercal}\theta) instead of σ(wx+b)\sigma(w^\intercal x + b). In this notation θ0\theta_0 represent bias vector and rest of the θ\theta vector contains the weights.

But in this course we will not be using this convention.

Logistic Regression Cost Function

  • Cost function (loss function for single training example) to optimize, want to be as small as possible.The Loss function is cost function averaged for all training instances.

J(w,b)=J(θ)=1mi=1m[y(i)log(y^(i))+(1y(i))log(1y^(i))] J(w, b)=J(\theta) = -\frac{1}{m}\sum^m_{i=1}\bigg[y^{(i)}\log\Big(\hat{y}^{(i)}\Big)+ \Big(1-y^{(i)}\Big)\log\Big(1-\hat{y}^{(i)}\Big)\bigg]

  • The cost function for single training example, we want:
c(θ)={log(y^)ify=1log(1y^)ify=0 c(\theta) = \begin{cases}-\log(\hat{y}) & \text{if}\enspace y=1\\-\log(1-\hat{y}) & \text{if}\enspace y=0\end{cases}
  • That means
  • If label is 1, we want (log(y^))(-\log(\hat{y})) to be as small as possible \rightarrow want (logy^)(\log\hat{y}) to be large as possible \rightarrow want y^\hat{y} to be large (i.e. 1\approx 1).
  • If label is 0, we want (log(1y^))(-\log(1- \hat{y})) to be as small as possible \rightarrow want (log1y^)(\log1-\hat{y}) to be large \rightarrow want y^\hat{y} to be small.

Gradient Descent

May 2, 2021

Gradient Descent is a generic optimization algorithm capable of finding optimal solutions to a wide range of problems, The general idea of Gradient Descent is to tweak parameters iteratively to minimize cost function.

We have a cost function for logistic regression.

J(w,b)=J(θ)=1mi=1m[y(i)log(y^(i))+(1y(i))log(1y^(i))] J(w, b)=J(\theta) = -\frac{1}{m}\sum^m_{i=1}\bigg[y^{(i)}\log\Big(\hat{y}^{(i)}\Big)+ \Big(1-y^{(i)}\Big)\log\Big(1-\hat{y}^{(i)}\Big)\bigg]

Where.

  • α\alpha is a hyperparmeter for learning rate.
  • Derivative tells the slope of the function OR how a small change in a value what change comes to the function.
  • On the left side the derivative (slope) will be negative making us increasing the weights.
  • On the right side the derivative (slope) will be positive which will result in decreasing the weights.

  • So we will update the weights and bias like this:

w=wαJ(w,b)w w = w - \alpha\frac{\partial J(w,b)}{\partial w}

b=bαJ(w,b)b b = b - \alpha \frac{\partial J(w, b)}{\partial b}

Tip

Remember dd is used for derivative whereas \partial is used to denote partial derivative when the function we want to derivate has multiple other variables (which are considered constant at the time we are finding the derivative).

Derivatives

Computation Graph

A computational graph is defined as a directed graph where the nodes correspond to mathematical operations. Computational graphs are a way of expressing and evaluating a mathematical expression.

  • The computation graph explains why the computations of neural network is organised with first a forward pass and then a backward pass in Backpropagation algorithm.
  • Reverse-mode autodiff performs a forward pass through a computation graph, computing every node's value for the current training batch, and then it performs a reverse pass, computing all the gradients at once.
  • Let's say we have a function

J(a,b,c)=3(a+bc) J(a, b, c) = 3(a + bc)

  • In this we have 3 steps (represented by nodes) to compute. These are:

u=bcu = bc

v=a+uv = a+u

J=3vJ = 3v

The blue arrows represent the forward pass and Red arrows the Backward pass

Computing with a computation Graph

  • Let's say we want to compute the derivative of JJ with respect to vv. dJdv=?\frac{dJ}{dv} = ?
  • i.e. for a little change in vv how the value of JJ changes?
  • We know from our calculus class, the derivative should be:

dJdv=3 \frac{dJ}{dv} = \bold{3}

  • Then we want to compute the derivative of JJ with respect to aa. We can use chain rule which says:

If aa affects vv affects JJ (avJa \rightarrow v \rightarrow J) then the amounts that JJ changes when you nudge aa is the product of how much vv changes when you nudge aa times how much JJ changes when you nudge vv.

dvda=1 [10pt]dJda=dJdvdvda=3×1=3 \frac{dv}{da} = \bold{1} \ [10pt] \frac{dJ}{da} = \frac{dJ}{dv}\frac{dv}{da} = 3 \times 1 = \bold{3}

  • This illustrates how computing dJdv\frac{dJ}{dv} let's us compute dJda\frac{dJ}{da}.
  • In the code we will be using dvar to represent the the derivative of the final output variable with respect to any variable var.
  • Let's do it for other variables.
  • What is the derivative of JJ with respect to uu.

dJdu=dJdvdvdu=3×1=3 \frac{dJ}{du} = \frac{dJ}{dv}\frac{dv}{du} = 3 \times 1 = \bold{3}

  • And then derivative of JJ with respect to bb

dudb=2 \frac{du}{db} = 2

dJdb=dJdududb=3×2=6 \frac{dJ}{db} = \frac{dJ}{du}\frac{du}{db} = 3 \times 2 = \bold{6}

  • The derivative of JJ with respect to cc

dJdc=dJdududc=3×3=9 \frac{dJ}{dc} = \frac{dJ}{du}\frac{du}{dc} = 3 \times 3 = \bold{9}

Gradient Descent for Logistic Regression

  • Logistic regression recap

z=w+bz = w^\intercal +b

y^=p^=σ(z)\hat{y} = \hat{p} = \sigma(z)

L(y^,y)=[ylog(a)+(1y)log(1y^)]L(\hat{y}, y) = - \Big[y\log (a) + (1-y)\log(1-\hat{y})\Big]

  • Let's say we have only two features, then we will have two weight and a bias.

The computation graph

The photo above uses aa for prediction y^\hat{y}

Hence, we use dz=(ay)dz = (a-y) for calculating Loss function for single instance with respect to w1w_1 by calulating:

dLdw1=dLdadadzdzdw1=(ay)(x1)=x1dLdz \frac{dL}{dw_1} = \frac{dL}{da}\frac{da}{dz}\frac{dz}{dw_1} = (a-y)(x_1) = \bold{x_1\frac{dL}{dz}}

and with respect to w2w_2

dLdw2=dLdadadzdzdw2=(ay)(x2)=x2dLdz \frac{dL}{dw_2} = \frac{dL}{da}\frac{da}{dz}\frac{dz}{dw_2} = (a-y)(x_2) = \bold{x_2\frac{dL}{dz}}

for bias it remains same:

dLdb=dLdz \frac{dL}{db} = \bold{\frac{dL}{dz}}

This makes the updates like this:

w1=w1αdw1w2=w2=αdw2b=bαdb w_1 = w_1 - \alpha dw_1\\ w_2 = w_2 = \alpha dw_2\\b = b -\alpha db

  • where dw1,dw2,dbdw_1, dw_2, db as said earlier represents the derivative of the final output variable (Loss function) with respect to respective variable.

Gradient Descent on mm Examples

This Loss function is for mm training examples computed over (i)th(i)^\text{th} instances.

J(w,b)=J(θ)=1mi=1m[y(i)log(y^(i))+(1y(i))log(1y^(i))] J(w, b)=J(\theta) = -\frac{1}{m}\sum^m_{i=1}\bigg[y^{(i)}\log\Big(\hat{y}^{(i)}\Big)+ \Big(1-y^{(i)}\Big)\log\Big(1-\hat{y}^{(i)}\Big)\bigg]

  • where y^=σ(z(i))=σ(wx(i)+b)\hat{y} = \sigma(z^{(i)}) = \sigma(w^\intercal x^{(i)} + b) is the prediction over one training example.

So for calculating Cost function derivative with respect to first weight becomes:

Jw1=1mi=1mw1L(y^(i),y(i)) \frac{\partial J}{\partial w_1} = \frac{1}{m}\sum^m_{i=1} \frac{\partial}{\partial w_1}L(\hat{y}^{(i)}, y^{(i)})

Which as we have seen will then be:

Jw1=1mi=1mdw1(i)Jw2=1mi=1mdw2(i) \frac{\partial J}{\partial w_1} = \frac{1}{m}\sum^m_{i=1} dw_1^{(i)} \qquad\qquad \frac{\partial J}{\partial w_2} = \frac{1}{m}\sum^m_{i=1} dw_2^{(i)}

Wrap up in algorithm (what we can do)

Assuming we have only two features. The single step of gradient descent will look like:

j=0,  dw1=0,  dw2=0,  db=0Fori=1tom:j=0,\; dw_1=0,\; dw_2=0,\; db=0\\\text{For} \enspace i=1\enspace\text{to}\enspace m:

z(i)=wx(i)+by^(i)=σ(z(i))J+=[y(i)logy^(i)+(1y(i))log(1y^(i))]dz(i)=y^(i)y(i)dw1+=x1(i)dz(i)here dw1 is used as accumulatordw2+=x2(i)dz(i)h ere dw2 is used as accumulatordb+=dz(i)\qquad z^{(i)} = w^\intercal x^{(i)} + b\\\qquad \hat{y}^{(i)} = \sigma(z^{(i)})\\\qquad J \enspace+= -\Big[y^{(i)}\log\hat{y}^{(i)}+ \big(1-y^{(i)}\big)\log\big(1-\hat{y}^{(i)}\big)]\\\qquad dz^{(i)} = \hat{y}^{(i)} - y^{(i)}\\\qquad dw_1\enspace += x_1^{(i)}dz^{(i)}\enspace\fcolorbox{red}{white}{here $dw_1$ is used as accumulator}\\\qquad dw_2 \enspace += x_2^{(i)}dz^{(i)}\enspace\fcolorbox{red}{white}{h ere $dw_2$ is used as accumulator}\\\qquad db\enspace+= dz^{(i)}

J/=maveraging cost function over m training examplesdw1/=m;  dw2/=m  db/=mw1=w1αdw1w2=w2=αdw2b=bαdbJ \enspace/=m \qquad\fcolorbox{red}{white}{averaging cost function over $m$ training examples}\\dw_1\enspace /=m; \; dw_2\enspace/=m\; db\enspace/=m\\w_1 = w_1 - \alpha dw_1\\ w_2 = w_2 = \alpha dw_2\\b = b -\alpha db

  • We have a problem here in which we require two for loops:
  • One for iterating over mm training example.
  • another one iterating over feature set calculating the derivative per feature.

Vectorization

Whenever possible, avoid explicit for-loops. Use vectorization.

z = np.dot(w, x) + b

Vectorization relies on parallelization instructions called SIMD (Single Instruction Multiple Data), which can be be found in both CPU and GPU but GPUs are better at that.

And that's how we eliminate one for loop which was iterating over the features for calculating derivative and updating derivatives..

IMPORTANT

Andrew refers to dz=a(1a)dz = a (1-a)

Note that Andrew is using "dzdz" as a shorthand to refer to dadz=a(1a)\frac{da}{dz} = a (1-a) .

Earlier in this week's videos, Andrew used the name "dzdz" to refer to a different derivative: dLdz=ay\frac{dL}{dz} = a -y.

Note: here aa is y^\hat{y}

Vectorizing Logistic Regression

We can make the prediction or the forward propagation step like this:

Step 1

Z=[z(i),z(2),...,z(m)]=wX+[b,b,b,....b] Z = [z^{(i)}, z^{(2)}, ..., z^{(m)}]= w^\intercal X + [b, b, b, ....b]

where,

  • XX represent the (nx,m)(n_x, m) dimensional feature matrix.
  • [b,b,b,....b][b, b, b, ....b] is bias matrix with (1,m)(1, m) dimension.

In Python

Z = np.dot(w.T, X) + b # here b is a real number which is broadcasted

Step 2

Y^=[y^(1),y^(2),...,y^(m)]=σ(Z) \hat{Y} = [\hat{y}^{(1)}, \hat{y}^{(2)}, ..., \hat{y}^{(m)}] = \sigma(Z)

Vectorizing Logistic Regression's Gradient Output

vectorizing calculation of "dz" (dLdz\frac{dL}{dz})

We want: dz(1)=[y^(1)y(1)]dz^{(1)} = [\hat{y}^{(1)} - y^{(1)}], dz(2)=[y^(2)y(2)]dz^{(2)}= [\hat{y}^{(2)} - y^{(2)}],...

dZ=[dz(1),dz(2),...,dz(m)]dZ = [dz^{(1)}, dz^{(2)}, ..., dz^{(m)}]

Then:

Y^=[y^(1),...,(m)^]\hat{Y} = [\hat{y}^{(1)}, ..., \hat{(m)}] Y=[y(1),...,y(m)]Y = [y^{(1)},..., y^{(m)}]

dZ=Y^YdZ = \hat{Y} - Y

vectorizing updates of weights and bias

dw=1mXdZdw = \frac{1}{m}X{dZ}^\intercal

Which creates as (n,1n, 1) dimensional vector with each element being from dz(i)dz_{(i)} to dz(n)dz_{(n)} where nn is the number of features.

dw = (1/m)*np.dot(X, dZ.T)

db=1mi=1mdz(i)db = \frac{1}{m} \sum^m_{i=1} dz^{(i)}

which in python is done using

db = (1/m)*np.sum(dZ)

Implementing Logistic Regression

Z=wX+b=np.dot(w.T, X)+bZ = w^\intercal X +b = \text{np.dot(w.T, X)+b}

Y^=σ(Z)\hat{Y} = \sigma(Z)

dZ=Y^YdZ = \hat{Y} - Y

dw=1mXdZdw= \frac{1}{m}XdZ^\intercal

db=1m * np.sum(dZ)db = \frac{1}{m}\text{ * np.sum($dZ$)}

w=wαdww = w -\alpha dw

b=bαdbb = b - \alpha db

General Principle of Broadcasting

If we have (m,n)(m, n) matrix and for any operation we want to do with (1,n)(1, n) or (m,1)(m, 1) matrix, the matrix with be converted to (m,n)(m, n) dimensional matrix by copying.

If we have (m,1)(m, 1) or (1,m)(1, m) matrix and for any operation we want to do with a real number RR get converted to (m,1)(m, 1) or (1,m)(1, m) dimensional matrix by copying RR.

A note on python/NumPy vectors

import numpy as np

a = np.random.randn(5)
print(a)
>>> [0.502, -0.296, 0.954, -0.821, -1.462]

print(a.shape)
>>>(5,)

Vectors like a are called rank 1 array in python. That is it is neither a row vector, nor a column vector. Which leads to some slightly non-intuitive effects such as-

print(a.T)
>>> [0.502, -0.296, 0.954, -0.821, -1.462]

print(np.dot(a, a.T)   # should probanly be matrix product
>>>  4.06570109321

Instead do this:

a = np.random.rand(5, 1)
print(a)
>>> [[-0.0967]
     [-2.3861]
     [-9.1231]
     [ 0.1231]
     [ 9.1111]]
assert(a.shape == (5, 1))

If you get a rank 1 array, just reshape it.

Some important Points

  • Softmax function

for a matrix xRm×nxij maps to the element in the ith row and jth column of x, thus we have:  \text{for a matrix } x \in \mathbb{R}^{m \times n} \text{, $x_{ij}$ maps to the element in the $i^{th}$ row and $j^{th}$ column of $x$, thus we have: }

softmax(x)=softmax[x_11x_12x_13x_1n x_21x_22x_23x_2n  x_m1x_m2x_m3x_mn]=[ex_11_jex_1jex_12_jex_1jex_13_jex_1jex_1n_jex_1j ex_21_jex_2jex_22_jex_2jex_23_jex_2jex_2n_jex_2j  ex_m1_jex_mjex_m2_jex_mjex_m3_jex_mjex_mn_jex_mj]=(softmax(first row of x) softmax(second row of x) ... softmax(last row of x)) softmax(x) = softmax\begin{bmatrix} x\_{11} & x\_{12} & x\_{13} & \dots & x\_{1n} \\\ x\_{21} & x\_{22} & x\_{23} & \dots & x\_{2n} \\\ \vdots & \vdots & \vdots & \ddots & \vdots \\\ x\_{m1} & x\_{m2} & x\_{m3} & \dots & x\_{mn}\end{bmatrix} \\[10pt]= \begin{bmatrix} \frac{e^{x\_{11}}}{\sum\_{j}e^{x\_{1j}}} & \frac{e^{x\_{12}}}{\sum\_{j}e^{x\_{1j}}} & \frac{e^{x\_{13}}}{\sum\_{j}e^{x\_{1j}}} & \dots & \frac{e^{x\_{1n}}}{\sum\_{j}e^{x\_{1j}}} \\\ \frac{e^{x\_{21}}}{\sum\_{j}e^{x\_{2j}}} & \frac{e^{x\_{22}}}{\sum\_{j}e^{x\_{2j}}} & \frac{e^{x\_{23}}}{\sum\_{j}e^{x\_{2j}}} & \dots & \frac{e^{x\_{2n}}}{\sum\_{j}e^{x\_{2j}}} \\\ \vdots & \vdots & \vdots & \ddots & \vdots \\\ \frac{e^{x\_{m1}}}{\sum\_{j}e^{x\_{mj}}} & \frac{e^{x\_{m2}}}{\sum\_{j}e^{x\_{mj}}} & \frac{e^{x\_{m3}}}{\sum\_{j}e^{x\_{mj}}} & \dots & \frac{e^{x\_{mn}}}{\sum\_{j}e^{x\_{mj}}}\end{bmatrix} \\[10pt]= \begin{pmatrix} softmax\text{(first row of x)} \\\ softmax\text{(second row of x)} \\\ ... \\\ softmax\text{(last row of x)} \\\\\end{pmatrix}
  • L1 loss is defined as:

L1(y^,y)=i=0my(i)y^(i)(1) \begin{align*} & L_1(\hat{y}, y) = \sum_{i=0}^m|y^{(i)} - \hat{y}^{(i)}| \end{align*}\tag{1}

  • L2 loss is defined as:

L2(y^,y)=i=0m(y(i)y^(i))2(2) \begin{align*} & L_2(\hat{y},y) = \sum_{i=0}^m(y^{(i)} - \hat{y}^{(i)})^2 \end{align*}\tag{2}

A trick when you want to flatten a matrix XX of shape (a,b,c,d)(a,b,c,d) to a matrix XflattenX_flatten of shape (bcd,a)(b∗c∗d, a) is to use:

X_flatten = X.reshape(X.shape[0],-1).T