### CS 229: Machine learning- Notes on Lecture #3

#### by Amit

For a brief recap of lecture #2, please see my notes.

In the third lecture, Prof. Andrew Ng continues his lecture on **regression**. He talks about **non-parametric learning algorithms**, a **probabilistic interpretation of linear regression** (introduced in the 2nd lecture), a simple **classification **algorithm and very brief idea about **perceptron.**

**Underfitting and Overfitting: **To start with, the concepts of underfitting and overfitting in linear regression are explained very informally and clearly. The choice of the degree of the fitting polynomial and hence, the parameters can lead to either of the above to happen. This is a motivating factor to study *non-parametric learning algorithms*.

**Non-parametric learning algorithms**: Whereas the name may falsely imply that the parameters are done away with, the actual meaning is different, which is kind of weird. The learning algorithm that we studied in the 2nd lecture, had a fixed number of parameters, independent of the size of the data set . In non-parametric learning algorithms, the number of parameters varies linearly with . We will see what this means shortly.

**Locally Weighted Regression:** Locally Weighted Regression (LWR) is a non-parametric learning algorithm. As is apparent, it has two important characteristics: it works *locally* and there are *weights *involved. Let’s see how.

Let’s say, we are interested in finding the value of for a particular . The LWR acts locally, so it considers a certain vicinity around and their corresponding ‘s. Then, its tries to fit so as to minimize the function: , where .

From the expression, if it is clear that for points which are near, and for points far away from each other. Thus the *weight *is higher when we are considering points which are close to each other.

**Probabilistic Interpretation of Linear regression: **This is a very interesting part of the lecture. In the second lecture, we estimated by minimizing . Why was the mean square error minimized and not the third or fourth power of the error? This part of the lecture gives a proof to that. The starting point of the proof is to consider , where is random noise, and assumed to be a *gaussian *random variable with mean and its *variance*. Using this definition, it defines a likelihood function . Finally, after defining a logarithmic function on , the function to be minimized turns out to be the same as defined earlier.

Besides the above, **classification **and **perceptrons **were discussed. I didn’t take much notes in those sections. Here is the video lecture:

It is highly recommended to go through the section notes on probability theory now. See you next time!