echo \$RANDOM

## Category: Course Notes

### CS 229: Machine learning- Notes on Lecture #3

For a brief recap of lecture #2, please see my notes.

In the third lecture, Prof. Andrew Ng continues his lecture on regression. He talks about non-parametric learning algorithms, a probabilistic interpretation of linear regression (introduced in the 2nd lecture), a simple classification algorithm and very brief idea about perceptron.

Underfitting and Overfitting: To start with, the concepts of underfitting and overfitting in linear regression are explained very informally and clearly. The choice of the degree of the fitting polynomial and hence, the parameters $\vec{\theta}$ can lead to either of the above to happen. This is a motivating factor to study non-parametric learning algorithms.

Non-parametric learning algorithms: Whereas the name may falsely imply that the parameters are done away with, the actual meaning is different, which is kind of weird. The learning algorithm that we studied in the 2nd lecture, had  a fixed number of parameters, independent of the size of the data set $m$. In non-parametric learning algorithms, the number of parameters varies linearly with $m$. We will see what this means shortly.

Locally Weighted Regression: Locally Weighted Regression (LWR) is a non-parametric learning algorithm. As is apparent, it has two important characteristics:  it works locally and there are weights involved. Let’s see how.

Let’s say, we are interested in finding the value of $Y_i$  for a particular $X_i$. The LWR acts locally, so it considers a certain vicinity around $X_i$ and their corresponding $Y_i$‘s. Then, its tries to fit $\vec{\theta}$ so as to minimize the function: $\sum_{i=1}^m w^{(i)}(y^{(i)}-{\theta}^T X^{(i)})^2$, where $w^{(i)}=e^{\frac{-(x^{(i)}-x)^2}{2}}$.

From the expression, if $w^{(i)}$ it is clear that for points which are near, $w^{(i)}=1$ and $w^{(i)}=0$ for points far away from each other. Thus the weight is higher when we are considering points which are close to each other.

Probabilistic Interpretation of Linear regression: This is a very interesting part of the lecture. In the second lecture, we estimated $\vec{\theta}$ by minimizing $J(\vec{\theta})=\frac{1}{2}\sum_{i=1}^m(h_\theta(x^{(i)}-y^{(i)})^2$. Why was the mean square error minimized and not the third or fourth power of the error? This part of the lecture gives a proof to that. The starting point of the proof is to consider $y^{(i)} = \theta^{T}x^{(i)} + \epsilon^{i}$, where $\epsilon$ is random noise, and assumed to be a gaussian random variable with mean $m$ and $\sigma^{2}$ its variance. Using this definition, it defines a likelihood function $L(\vec{\theta})=P(\vec{Y}|X;\theta)$. Finally, after defining a logarithmic function on $L(\vec{\theta})$, the function to be minimized turns out to be the same as $J(\vec{\theta})$ defined earlier.

Besides the above, classification and perceptrons were discussed. I didn’t take much notes in those sections. Here is the video lecture:

It is highly recommended to go through the section notes on probability theory now. See you next time!

### CS 229: Machine learning- Notes on Lecture #2

In the second lecture, Prof. Andrew Ng starts talking about supervised learning methods. He begins with Linear Regression, in which the relationship between the input is assumed to be linear, such as $h_\theta(x)=\theta_0 + \theta_1x$. The parameters $\theta_0$ and $\theta_1$ need to be found, for which couple of approaches are discussed:

1. The first one involves minimizing the function: $J(\vec{\theta})=\frac{1}{2}\sum_{i=1}^m(h_\theta(x^{(i)}-y^{(i)})^2$, where m is the number of training examples, x(i) is the ith sample input, and y(i) is the corresponding output. Couple of methods are discussed for the minimization task above. The first method that is discussed is the Gradient Descent method, which roughly is:
1. Start with some value of the $\vec{\theta}$ (say $\vec{\theta} = \vec{0}$)
2. Keep updating $\vec{\theta}$ to reduce $J(\vec{\theta})$ as follows: ${\theta}_i = {\theta}_i - \alpha \frac{\partial}{\partial {\theta}_i} J(\theta)$
3. Stop when a desired reduced value of $J(\vec{\theta})$ is reached.
2. A faster method in case of large data sets is the stochastic gradient descent method is then described

3. The second approach to estimate the value of $\vec{\theta}$ uses linear algebraic techniques to obtain a closed form formulae for the parameters, $\vec{\theta}$

This is the video lecture:

As I told in the notes of my first lecture, this is a good time to review the section notes on Linear Algebra. Lecture Notes 1 have some notes on the first two lectures.

Looks like we will do a lot of regression in the next lecture. See you then!

### CS 229: Machine learning- Notes on Lecture #1

With this blog post, I am making a bold attempt to follow an online course to a decent degree of successful completion. I shall post some notes, links to relevant lecture notes and other misc. stuff as I start on my quest to follow the CS 229: Machine Learning course from Stanford University.

This is the video lecture:

My summarized notes:

The first 30 minutes or so of this lecture is some logistic information for the course, introduction to the course and the field of machine learning in general. The actual stuff starts at around 31:36, so you may want to skip till that time
• Machine learning definition: Arthur Samuel(1959), Tom Mitchel(1998)

• In the rest of the lecture, the four main parts of the course is described in some detail along with illustrative examples.
• Supervised learning: providing the algorithm a dataset, supervising- learn the association between input & output, regression problems, classification problems, support vector machines- infinite number of features

• Learning theory

• Unsupervised learning: clustering, Cocktail party problem, indepedent component analysis

• Reinforcement learning: reward function (good dog, bad dog), feedback function

Miscellaneous

Before proceeding further, you may also want to take a look at the course materials. The Lecture notes and the video lectures are not in a 1-1 mapping, its a many-one mapping. So you have 20 video lectures and 12 lecture notes. Corresponding to the first video lecture, there is no corresponding lecture note.

The first of the Section Notes on Linear Algebra carries a nice review of the concepts that will be useful for a quick review. From what I understand and know, Linear Algebra seems to very important to the study of Machine Learning. So, its a good idea to just skim over the notes given on the website.

Hope you enjoy following the first video lecture! See you.