Thursday, October 5, 2017

1. Notes on Machine Learning: Linear Regression

Machine Learning is a hot topic these days since various kinds of applications rely on Machine Learning algorithms to get things done. While learning this topic, I will be writing my own notes about it as article serious in this blog for my own future reference. The contents might be highly abstract as this is not a tutorial aimed at somebody to learn Machine Learning by reading these notes.


According to the definition of Tom Mitchell, Machine Learning is defined as  "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E." Basically, we are enabling computers to do things without explicitly programming them to do it.

Categories of Machine Learning:

There are two broad categories of Machine Learning algorithms. The first is Supervised Learning and the second is Unsupervised Learning. There are various sub categorizations under these categories which can be illustrated as follows.

Machine Learning Models:

A model is a function (ie. hypothesis) \( h(x) \) which provides the output value \( y \) for a given input values \( x \), based on previously given leaning dataset \( X \) and output set \( Y \). The input values \( x \) are called features. A hypothesis function in Linear Regression with one feature would look like the following.
$$ h(x) = \theta_{0} x_{0} + \theta_{1} x_{1} $$
The first feature \( x_{0} \) is always set to 1 while the second feature \( x_{1} \) is actually the feature used in this model. The parameters \( \theta_{0} \) and \( \theta_{1} \) are the weight of the features to the final output and therefore, they are the values we are looking for in order to build the linear regression model for a specific dataset. The reason why we have an extra feature in the beginning which is always set to 1 is that it is easy to perform vecterized calculations (using matrices based tools) when we have it in that way.

Cost Function:

In order to measure the accuracy of a particular hypothesis function, we use another function called cost function. It is actually, a squared mean error function between the difference of predicted output value and the true output value of the hypothesis. By adjusting the values of the parameters \( \theta_{0} \) and \( \theta_{1} \) in the hypothesis, we can minimize the cost function and make the hypothesis more precise.
$$J(\theta_{0},\theta_{1}) = \frac{1}{2m} \sum_{i=0}^{m} (h_\theta(x_i) - y_i)^{2}$$

Gradient Descent:

In this algorithm, what we are doing is keep adjusting the parameters \( \theta_{0} \) and \( \theta_{1} \) until the Cost Function evetually becomes the minimum it can get. That means, we found the most accurate Model for the training dataset distribution. So, in order to adjust the parameters \( \theta_{0} \) and \( \theta_{1} \), we perform the following step over and over again for \( \theta_{0} \) and \( \theta_{1} \). In this equation, \( j \) is 0 and 1 in two steps.
$$\theta_{j} = \theta_{j} - \alpha \frac{\partial }{\partial \theta_{j}} J(\theta_{0},\theta_{1})$$


No comments:

Post a Comment