Beginning with Machine Learning: Linear Regression

Artificial Intelligence is the next big dream. With a world becoming more and more technologically centred by the second, computers that can make logical decisions by themselves becomes more of a necessity than a luxurious dream. It’s already woven its way into our daily lives. From recommendation systems which predict what you’re most likely to enjoy watching on Netflix or what you’re most likely to buy on Flipkart to telling Alexa to increase the volume for your favorite song, AI is already here.

If machines should be able to make logical conclusions and decisions without being explicitly programmed in order to do so, they need to be able to learn. Like us. They need to be able to take data that is given to them and learn how to come to certain conclusions. And just like us, they need to be able to learn from their mistakes. Thus,enter Machine Learning. Machine Learning is not analogous to Artificial Intelligence, it’s a subset of the latter. Using Machine Learning, computers can make decisions based on patterns and inferences, instead of the conventional method of explicit instructions hidden in lines and lines of code. If machines should be able to make logical conclusions and decisions without being explicitly programmed in order to do so, they need to be able to learn. Like us. They need to be able to take data that is given to them and learn how to come to certain conclusions. And just like us, they need to be able to learn from their mistakes. Machine Learning is not analogous to Artificial Intelligence, it’s a subset of the latter. Using Machine Learning, computers can make decisions based on patterns and inferences, instead of the conventional method of explicit instructions hidden in lines and lines of code.

The learning algorithms used in ML can be broadly classified under two categories- Supervised learning algorithms and unsupervised learning algorithms. Supervised learning involves using data that has conclusive results in order to form new conclusions whereas Unsupervised learning involves using raw data to form conclusive results. To understand this better, let’s look at a small example. If you don’t really care and just want to read up on linear regression, jump to the next paragraph. Consider a basket of vegetables which contains carrots, radishes, tomatoes and onions. Since you already know what each of them looks like, you should easily be able to classify them into their respective categories. Now, imagine asking a small child to make the same classification. It will be a lot harder, since the child probably doesn’t know what each vegetable looks like. This is the principle of unsupervised learning. And just like a child, computers too can learn to make the correct classification, given a large basket of vegetables.

Moving on to Linear Regression: Linear Regression is a supervised learning algorithm. Against a reasonably sized dataset and required output(s), the computer can learn how to predict what the output will be for a new dataset. Linear Regression can be performed with a single variable (Univariate Regression) or with many variables (MultiVariate Regression). Now, let’s get perspective of what all of this means by considering a practical example. Suppose a few brands in online fashion want to advertise on Instagram such that their return on investment is maximum. To be able to help with this, we need a computer that not only knows how much each company invested on each Instagram, but it also needs to know how much return on investment that company got back. This hypothetical data is showcased in the table below:

Social Media	Instagram
Company	Investment	Sales
Myntra	50000	52500
Jabong	48000	49500
Limeroad	51200	51000

Given this data, using linear regression, we can train a computer to predict how much a company stands to gain by investing in Instagram. To understand how it does this, consider the following graph:

Linear Regression Graph

This graph showcases the “best-fitting line.” While the line admittedly doesn’t pass through all points, it’s the line which can predict the sales growth most accurately. Using linear regression, the computer tries to learn the “weight”, or in this case, the slope of the line. It does so, by comparing the predicted output to the actual computed output. The difference between the two is measured by a mathematical expression termed as the cost function, which measures how wrong the model is. This cost function is measured by using a mathematical computation of error, called mean squared error. In simple terms, this means that the regression line is said to be “best fit” if the sum of the squares of the distance of each point in the dataset is minimum. Since our regression line is a straight line, it can be represented by a basic equation:

y = mx + c;

Since the cost function measures the MSE of the computed output, it is represented as follows:

Cost Function for Linear Regression

Now, to improve this cost function and thus improve the accuracy of our linear regression model, we use ‘Gradient Descent’. Elementary calculus tells us that the way to minimise a function is to calculate it’s derivative and equate it to zero. This is exactly what gradient descent does. The partial derivatives of the cost function look like this: (Of course, you can arrive at this answer by yourself too if you know how to calculate partial derivatives)

Computing Gradient Descent

Gradient descent introduces a new term, something called the learning rate, commonly denoted by alpha. To understand what learning rate actually is, it is better to first visualise gradient descent. To that end, imagine a hill. Suppose you are on top of that hill and need to get down. However, you have a mild fear of heights and are paranoid that you might fall down. So, you start descending the hill with very small steps. Soon enough, you realise that you are most likely not going to fall off the hill and start taking slightly bigger steps. You continue to take such steps downward until you reach the bottom of the hill, something which is called the local minimum in mathematics. That’s it! You just did what gradient descent does. The size of the steps you took to get off the hill represents the learning rate. And much like the example of the hill, the learning rate should be small, but not too small. Usually, it is convenient to set alpha to 0.01 initially. After implementing your cost function, you can later change it to a more suitable value if needed.

Pictorial Representation of Gradient Descent