### Learning Goals

Be able to implement and use Logistic Regression. Understand why it works and what it is doing.

### Concept Check

Q: are friday’s and next week’s lecture going to cover material on the final?

A1:  yes! material is cumulative. But as you imagine we would have different expectations for the material you have covered on psets

Q: In order to find each theta_i parameter, do we use the MAP or MLE of training data?

A1:  '+ that this optimization algorithm is doing Maximum Liklihood Estimation

A2:  We actually use an optimization algorithm called gradient descent, which Chris will talk about that next.

Q: how do you determine the theta 0?

A1:  Theta0 is determined the same way all of the other thetas are determined (and Chris will speak about that very, very soon). But to simplify the math, we invent a new input parameter called x0 for all samples in the training set, and we contrain this invented x0 to always be 1.

Q: Can we replace the sigmoid with other functions in the logistic family? It seems similar to changing the model you might use to predict a random variable in something like MLE

A1:  The drawback of the digmoid is that its slope is very close to 0 in most places, and that sometimes interferes with the gradient descent algorithm that Chris is about to talk about.

A2:  Absolutely, and many times we do. I’ve seen hyperbolic tangents used as well.

Q: wait what did chris just say about log likelihood and likelihood?

A1:  He just reminded us that whatever values of theta optimize the likelihood function also optimize the log-likelihood function.

Q: Do we do parameters one at a time (rotating through) or all at once at each step?

A1:  Python libraries that do this for us manage all of the updates in parallel, but any code we write ourselves will just update them one at a time.

Q: On the last slide it seemed like we used two negatives that canceled out. What am I missing? (with - negative LL)

A1:  When Chris subsituted dLoss with d NegativeLL, the “Negative” of “NegativeLL” was assuming to represent the negative sign that canceled the visible one in order to go to the next line.

Q: do we assume that thetas/parameters are independent?

A1:  Actually, you don’t have to assume independence at all… gradient descent works without assuming anything about independence.

Q: Why would we ever use Naive Bayes if we have logistic regression (naive bayes makes a mathematically wrong assumption, so logistic regression seems much better)? is it because naive bayes is faster?

A1:  It’s a combination of easier and faster, yes. It’s easier in the sense is that it’s really just counting, and it’s faster because it typically only requires O(m) passes over the data samples, where m is the number of input features. Logistic regression, because it involves gradient descent, will generally take longer (and it’s more complicated to implement if you’re implementing it yourself).

Q: what does it mean for theta_0 to be an "intercept"

A1:  y = Theta1*x1 +Theta0 is the new y = mx + b, where b is the y-intercept that’s been named Theta0 instead. The intercept is the value of y when x is (or, more generally, all x’s are) zero.

Q: is dataset liklihood given by z or sigma(z) afyer squashing?

A1:  σ(z)

Q: so will one of the numbers in summation (step 2) always be 0 because y is always 0 or 1?

A1:  Chris is saying all of this again right now :)

A2:  yes… that’s exactly right

Q: is the x_j at the end scalar multiplication or a dot product?

A1:  scalar multiplication, since x_j is a single number (i.e. a single 0 or 1 in the jth position of the vector x)