CS 229: Machine learning- Notes on Lecture #3

by Amit

For a brief recap of lecture #2, please see my notes.

In the third lecture, Prof. Andrew Ng continues his lecture on regression. He talks about non-parametric learning algorithms, a probabilistic interpretation of linear regression (introduced in the 2nd lecture), a simple classification algorithm and very brief idea about perceptron.

Underfitting and Overfitting: To start with, the concepts of underfitting and overfitting in linear regression are explained very informally and clearly. The choice of the degree of the fitting polynomial and hence, the parameters \vec{\theta} can lead to either of the above to happen. This is a motivating factor to study non-parametric learning algorithms.

Non-parametric learning algorithms: Whereas the name may falsely imply that the parameters are done away with, the actual meaning is different, which is kind of weird. The learning algorithm that we studied in the 2nd lecture, had  a fixed number of parameters, independent of the size of the data set m. In non-parametric learning algorithms, the number of parameters varies linearly with m. We will see what this means shortly.

Locally Weighted Regression: Locally Weighted Regression (LWR) is a non-parametric learning algorithm. As is apparent, it has two important characteristics:  it works locally and there are weights involved. Let’s see how.

Demonstrative sketch to demonstrate LWR (drawn from the lecture)

Let’s say, we are interested in finding the value of Y_i  for a particular X_i. The LWR acts locally, so it considers a certain vicinity around X_i and their corresponding Y_i‘s. Then, its tries to fit \vec{\theta} so as to minimize the function:     \sum_{i=1}^m w^{(i)}(y^{(i)}-{\theta}^T X^{(i)})^2, where w^{(i)}=e^{\frac{-(x^{(i)}-x)^2}{2}}.

From the expression, if w^{(i)} it is clear that for points which are near, w^{(i)}=1 and w^{(i)}=0 for points far away from each other. Thus the weight is higher when we are considering points which are close to each other.

Probabilistic Interpretation of Linear regression: This is a very interesting part of the lecture. In the second lecture, we estimated \vec{\theta} by minimizing J(\vec{\theta})=\frac{1}{2}\sum_{i=1}^m(h_\theta(x^{(i)}-y^{(i)})^2. Why was the mean square error minimized and not the third or fourth power of the error? This part of the lecture gives a proof to that. The starting point of the proof is to consider y^{(i)} = \theta^{T}x^{(i)} + \epsilon^{i}, where \epsilon is random noise, and assumed to be a gaussian random variable with mean m and \sigma^{2} its variance. Using this definition, it defines a likelihood function L(\vec{\theta})=P(\vec{Y}|X;\theta). Finally, after defining a logarithmic function on L(\vec{\theta}), the function to be minimized turns out to be the same as J(\vec{\theta}) defined earlier.

Besides the above, classification and perceptrons were discussed. I didn’t take much notes in those sections. Here is the video lecture:

It is highly recommended to go through the section notes on probability theory now. See you next time!