Logistic Regression is an algorithm for Binary Classification.
Consider we have an image we want to classify as cat or not of 64x64 dimension
The images are stored as matrix pixel intensity value for different color channels. (one matrix per color channel and with each layer dimension being the dimension of image).
To represent image we can use a single feature vector:
So nx (input size) becomes 64×64×3=122288.
x=255231...255134...25513493
Notations:
Logistic Regression
In formal terms for cat classification problem:
Given x, want y^=P(y=1∣x).
where,
x∈Rnx
Parameters: w∈Rnx,b∈R
Output or Activation function
Output : y^=σ(w⊺x+b)
where,
σ(Z)=1+e−z1z=w⊺+b
Hence, this satisfy the two conditions:dx
If z is large:
σ(z)≈1+01=1
If z is large negative number then, the exponential function returns very big value.
σ(z)≈1+bigno1≈0
Warning
In a alternate notation you might see (like in Hands-on machine learning book) y^=σ(x⊺θ) instead of σ(w⊺x+b). In this notation θ0 represent bias vector and rest of the θ vector contains the weights.
But in this course we will not be using this convention.
Logistic Regression Cost Function
Cost function (loss function for single training example) to optimize, want to be as small as possible.The Loss function is cost function averaged for all training instances.
The cost function for single training example, we want:
c(θ)={−log(y^)−log(1−y^)ify=1ify=0
That means
If label is 1, we want (−log(y^)) to be as small as possible→ want (logy^) to be large as possible → want y^ to be large (i.e. ≈1).
If label is 0, we want (−log(1−y^)) to be as small as possible→ want (log1−y^) to be large→ want y^ to be small.
Gradient Descent
May 2, 2021
Gradient Descent is a generic optimization algorithm capable of finding optimal solutions to a wide range of problems, The general idea of Gradient Descent is to tweak parameters iteratively to minimize cost function.
Derivative tells the slope of the function OR how a small change in a value what change comes to the function.
On the left side the derivative (slope) will be negative making us increasing the weights.
On the right side the derivative (slope) will be positive which will result in decreasing the weights.
So we will update the weights and bias like this:
w=w−α∂w∂J(w,b)
b=b−α∂b∂J(w,b)
Tip
Remember d is used for derivative whereas ∂ is used to denote partial derivative when the function we want to derivate has multiple other variables (which are considered constant at the time we are finding the derivative).
Derivatives
Computation Graph
A computational graph is defined as a directed graph where the nodes correspond to mathematical operations. Computational graphs are a way of expressing and evaluating a mathematical expression.
The computation graph explains why the computations of neural network is organised with first a forward pass and then a backward pass in Backpropagation algorithm.
Reverse-mode autodiff performs a forward pass through a computation graph, computing every node's value for the current training batch, and then it performs a reverse pass, computing all the gradients at once.
Let's say we have a function
J(a,b,c)=3(a+bc)
In this we have 3 steps (represented by nodes) to compute. These are:
u=bc
v=a+u
J=3v
Computing with a computation Graph
Let's say we want to compute the derivative of J with respect to v.dvdJ=?
i.e. for a little change in v how the value of J changes?
We know from our calculus class, the derivative should be:
dvdJ=3
Then we want to compute the derivative of J with respect to a. We can use chain rule which says:
If a affects v affects J (a→v→J) then the amounts that J changes when you nudge a is the product of how much v changes when you nudge a times how much J changes when you nudge v.
dadv=1[10pt]dadJ=dvdJdadv=3×1=3
This illustrates how computing dvdJ let's us compute dadJ.
In the code we will be using dvar to represent the the derivative of the final output variable with respect to any variable var.
Let's do it for other variables.
What is the derivative of J with respect to u.
dudJ=dvdJdudv=3×1=3
And then derivative of J with respect to b
dbdu=2
dbdJ=dudJdbdu=3×2=6
The derivative of J with respect to c
dcdJ=dudJdcdu=3×3=9
Gradient Descent for Logistic Regression
Logistic regression recap
z=w⊺+b
y^=p^=σ(z)
L(y^,y)=−[ylog(a)+(1−y)log(1−y^)]
Let's say we have only two features, then we will have two weight and a bias.
The computation graph
Hence, we use dz=(a−y) for calculating Loss function for single instance with respect to w1 by calulating:
dw1dL=dadLdzdadw1dz=(a−y)(x1)=x1dzdL
and with respect to w2
dw2dL=dadLdzdadw2dz=(a−y)(x2)=x2dzdL
for bias it remains same:
dbdL=dzdL
This makes the updates like this:
w1=w1−αdw1w2=w2=αdw2b=b−αdb
where dw1,dw2,db as said earlier represents the derivative of the final output variable (Loss function) with respect to respective variable.
Gradient Descent on m Examples
This Loss function is for m training examples computed over (i)th instances.
where y^=σ(z(i))=σ(w⊺x(i)+b) is the prediction over one training example.
So for calculating Cost function derivative with respect to first weight becomes:
∂w1∂J=m1i=1∑m∂w1∂L(y^(i),y(i))
Which as we have seen will then be:
∂w1∂J=m1i=1∑mdw1(i)∂w2∂J=m1i=1∑mdw2(i)
Wrap up in algorithm (what we can do)
Assuming we have only two features. The single step of gradient descent will look like:
j=0,dw1=0,dw2=0,db=0Fori=1tom:
z(i)=w⊺x(i)+by^(i)=σ(z(i))J+=−[y(i)logy^(i)+(1−y(i))log(1−y^(i))]dz(i)=y^(i)−y(i)dw1+=x1(i)dz(i)here dw1 is used as accumulatordw2+=x2(i)dz(i)h ere dw2 is used as accumulatordb+=dz(i)
J/=maveraging cost function over m training examplesdw1/=m;dw2/=mdb/=mw1=w1−αdw1w2=w2=αdw2b=b−αdb
We have a problem here in which we require two for loops:
One for iterating over m training example.
another one iterating over feature set calculating the derivative per feature.
Vectorization
Whenever possible, avoid explicit for-loops. Use vectorization.
z=np.dot(w,x)+b
Vectorization relies on parallelization instructions called SIMD (Single Instruction Multiple Data), which can be be found in both CPU and GPU but GPUs are better at that.
IMPORTANT
Andrew refers to dz=a(1−a)
Note that Andrew is using "dz" as a shorthand to refer to dzda=a(1−a) .
Earlier in this week's videos, Andrew used the name "dz" to refer to a different derivative: dzdL=a−y.
Note: here a is y^
Vectorizing Logistic Regression
We can make the prediction or the forward propagation step like this:
Step 1
Z=[z(i),z(2),...,z(m)]=w⊺X+[b,b,b,....b]
where,
X represent the (nx,m) dimensional feature matrix.
[b,b,b,....b] is bias matrix with (1,m) dimension.
In Python
Z=np.dot(w.T,X)+b# here b is a real number which is broadcasted
Step 2
Y^=[y^(1),y^(2),...,y^(m)]=σ(Z)
Vectorizing Logistic Regression's Gradient Output
vectorizing calculation of "dz" (dzdL)
We want: dz(1)=[y^(1)−y(1)], dz(2)=[y^(2)−y(2)],...
dZ=[dz(1),dz(2),...,dz(m)]
Then:
Y^=[y^(1),...,(m)^]Y=[y(1),...,y(m)]
dZ=Y^−Y
vectorizing updates of weights and bias
dw=m1XdZ⊺
Which creates as (n,1) dimensional vector with each element being from dz(i) to dz(n) where n is the number of features.
dw=(1/m)*np.dot(X,dZ.T)
db=m1∑i=1mdz(i)
which in python is done using
db=(1/m)*np.sum(dZ)
Implementing Logistic Regression
Z=w⊺X+b=np.dot(w.T, X)+b
Y^=σ(Z)
dZ=Y^−Y
dw=m1XdZ⊺
db=m1 * np.sum(dZ)
w=w−αdw
b=b−αdb
General Principle of Broadcasting
If we have (m,n) matrix and for any operation we want to do with (1,n) or (m,1) matrix, the matrix with be converted to (m,n) dimensional matrix by copying.
If we have (m,1) or (1,m) matrix and for any operation we want to do with a real number R get converted to (m,1) or (1,m) dimensional matrix by copying R.
Vectors like a are called rank 1 array in python. That is it is neither a row vector, nor a column vector. Which leads to some slightly non-intuitive effects such as-
print(a.T)>>>[0.502,-0.296,0.954,-0.821,-1.462]print(np.dot(a,a.T)# should probanly be matrix product>>>4.06570109321
for a matrix x∈Rm×n, xij maps to the element in the ith row and jth column of x, thus we have:
softmax(x)=softmaxx_11x_21⋮x_m1x_12x_22⋮x_m2x_13x_23⋮x_m3……⋱…x_1nx_2n⋮x_mn=∑_jex_1jex_11∑_jex_2jex_21⋮∑_jex_mjex_m1∑_jex_1jex_12∑_jex_2jex_22⋮∑_jex_mjex_m2∑_jex_1jex_13∑_jex_2jex_23⋮∑_jex_mjex_m3……⋱…∑_jex_1jex_1n∑_jex_2jex_2n⋮∑_jex_mjex_mn=softmax(first row of x)softmax(second row of x)...softmax(last row of x)
L1 loss is defined as:
L1(y^,y)=i=0∑m∣y(i)−y^(i)∣(1)
L2 loss is defined as:
L2(y^,y)=i=0∑m(y(i)−y^(i))2(2)
A trick when you want to flatten a matrix X of shape (a,b,c,d) to a matrix Xflatten of shape (b∗c∗d,a) is to use: