July 24, 2017

Lectures 6-7 (part) of Andrew Ng's mathier ML course

Ng lecture 6 and 7: multinomial models, transition to SVM.

Lecture 6

Generalizing Naive Bayes to multinomial models

What if it isn't just "spam/not-spam?" What if it's "spam/not-spam-and-from-a-friend/not-spam-andfrom-by-boss?" The model comes out the same way: the probability of x given y is just a multinomial rather than a bernoilli. You can also discretize continuous features into buckets and then use multinomial naive bayes.

Multinomial Event model for naive bayes

(warning, his notation changes here.)

another variation on naive bayes especially for other text documents or other sequences. Takes into account the number of times different words appears. If some word appears a lot of times ("buy"), it's more likely to be spam than if it appears once, but the version of naive bayes he did before just has binaries.

Represents each document as a feature vector $(x^{(i)}_1, x^{(i)}_2... x^{(i)}_n)$ where n is the number of words in that document, and maps to the dictionary. So if the dictionary has 1000 words, then the components of the feature vector for each document is an index into our dictionary for the word that appears in that picture.

So then we're calculating p(x|y) as the product of the probability of seeing the word we see in each position of the document. See page 13 of notes 2.

Skipping the gnarly sums, the maximum likelihood estimator for a given word p(xi|y=1) is another simple proportion, namely: the numerator is "count the total number of appearances of word k in documents with classlabel 1. The denominator is "sum the lengths of all the e-mails in classlabel 1". This is a translation of the maximum likelihood estimates on pg. 14 of the notes. In those estimates, remember that j...n indexes positions in a document and m indexes documents.

So essentially instead of fitting naive bayes over documents, we're fitting naive bayes over positions in documents.

And for laplace smoothing, we add 1 to the numerator and k to the denominator, where k is the number of unique words. (also in the notes as |V|).

(Interlude: Ng actually stopped in the middle to lay out maximum likelihood estimation in response to a student question (at minute 20 of the video). The short version id that the likelihood function is a function of the parameters, and is the product of the probability of each x and y, paramaterized by the aforementioned parameters; then maximize the log.)

Multinomial event model almost always does better than the first version of naive bayes for text classification.

Nonlinear classifiers

Let's think of taking a simple algorithm like logistic regression and go out to more complex classifiers.

How would we get a nonlinear decision boundary out of our logistic regression. Suppose we have a function with a bunch of parameters and weights on them, then applies a sigmoid function to generate a hypothesis.

To get a nonlinear decision boundary, one thing we could do would be to create intermediate results—create multiple hypotheses from the combinations of our features and weights, and then has a second (etc.) set of weights on those hypotheses, which get combined by another sigmoid function. Then you've turned your logistic regression into a neutral net.

As before, we do gradient descent to minimize the squared error, our labels - hypothesis. And backpropogation is the name of the algorithm that carries out that gradient descent.

Support vector machines

We're actually going to start off with a linear version before going to the nonlinear bit.

"Two intuitions about classification":

First intuition. Logistic regression is an algorithm that computes $\theta^Tx$ and predicts 1 if that product greater than zero, or - if it's less than zero. If the distance away from zero is large, that's a very confident prediction, you're on the tails of the sigmoid function.

"Wouldn't it be nice," he says, if for our training set, if y=1 we'd have high confidence in it, i.e., we'd have $\theta^Tx$ much higher than zero, and vice versa. So that's the first intuition, and it leads to the idea of "functional margins," discussed below.

Second intuition: assuming the training set is linearly separable (there's a straight line separating the training set... this is an assumption that will go away later). A better line is one that is further away from the training data (more centralized in the boundary). Formalizations here will use the term "geometric margins."

warning: another notation change here as we get into SVM

labels (y), instead of being 0 or 1, they'll be -1 or +1. And the hypothesis will output values that are either +1 or -1. Also, for our hypothesis, we're dropping our convention that $x_0 = 1$ (adding the intercept). Instead, we'll just separate out an intercept term b. Also our parameters become w rather than theta.

Definition: functional margin: the functional margin of a hyperplane with respect to a specific training example $(x^{(i)}, y^{(i)})$ is $\hat{\gamma}^{(i)}=y^{(i)}(w^Tx^{(i)}+b)$ What that means is that if $y^{(i)} = 1$ then you want the functional margin to be large which means you make the hypothesis (that term with w and b) to be big, and small if it's -1. Also, if $\hat{\gamma}^{(i)} > 0$ then we've classified correctly.

So if the functional margin is high, we've both classified correctly and confidently. Module one problem, namely that we could make the funcional margin large just by multiplying our parameters by constants. So that's bad. We need to add a normalization constraint later. (e.g. that the l2 norm ||w|| of w is 1.)

Definition: geometric margin: the geometric distance at a given training example is between a training example and the separating hyperplane.

Note that all points on the decision boundary satisfy $w^Tx+b=0$.

There's a bunch of linear algebra/geometry that I let slide here. But what it amounts to is that the geometric margin is just equal to the functional margin divided by the (l2?) norm of w. So it doesn't have the problem that the functional margin has of being subject to scaling by a constant. Intuitively, what we're doing is finding a vector orthogonal to the plane at the training example. (More intuitively still, a perpendicular line from the plane to the data.)

The geometric margin of a whole training set is just the minimum of the margins of all the examples.

Maximum margin classifier is a precursor to SVMs. See notes 3, pg. 5-6.

Maximum margin idea: maximize the minimum of the margins of all the examples.

Unfortunately, maximizing the geometric margin directly us pretty uggo: it's not convex.

Lecture 7

So there's a trick to solve that optimization problem. Since the geometric margin is invariant to scaling, you can find some constants to scale w such that the functional margin $min(y^{(i)})(w^Tx^{(i)}+b) = 1$. Plugged into the optimization problem we had before, we get a convex function that can be optimized.


Essentially, the idea of a kernel is: "sometimes you want to have a really ugly transformation on your feature vector. And sometimes, you can find some other function and other matrix, such that dot-producting the two matrices together with the function is equivalent, but it's much easier to calculate."

Tags: machinelearning math