Simple Logistic Regression Model for Predicting Rain

Introduction

Here, in this tutorial, we will go through to the basic steps of building a simple logistic regression model for predicting a simple classification problem.  We are going to build the model from scratch in Python. Since it is the simple logistic regression, it has only one independent variable and one dependent variable.

Classification

As opposed to regression problems, in classification problems, the dependent variable that we want to predict is a discrete value rather than a continuous value. As an example of classification problem is classifying whether it is going to rain or not. So, it is clear that the dependent variable has only two possible output. This is the example that we are going to use for learning purpose in this tutorial.

Data Set

Since it is a simple logistic regression tutorial, we are going to use a dummy data set that is randomly generated. The data set consist of two columns, which are relative humidity and rain prediction. Actually, the rain prediction depends on many factors, such as temperature, humidity, pressure, wind, etc. With only humidity, it is not a sufficient condition for predicting rain. So, the purpose of the model that we are going to develop is for learning purpose only.

The figure below presents the scatter plot of the data set:

Here, we have the relative humidity as the independent variable and the occurrence of rain as the dependent variable that we want to predict. The dependent variable is a set that consists of two elements either of 0 or 1. Where 0 represents not rain and 1 represents rain. Moreover, we already have the data set divided into two sets, which are train set and test set. The train set has 80 samples, and the test set has 20 samples. So, the ratio of train set to test set is 80:20.

Plot the Data Set

Let us plot our data set. First of all, we need to import the following libraries: numpy and matplotlib.

After that, let us define our data set, which are the train set (x_train , y_train) and test set (x_test, y_test).

Finally, let us plot the train set and test set.

Hypothesis Function

Here, we are going to define our hypothesis function for simple logistic regression. The hypothesis function \(h_{\theta}(x)\) is given by

\(h_{\theta}(x)=g(\theta_{0}+\theta_{1}x)\)

where \(\theta_{0}\) and \(\theta_{1}\) are the parameters that we will obtain from training, and \(x\) is our independent variable. Here, we introduce a new function \(g(z)\) given by

\(g(z)=\frac{1}{1+e^{-z}}\)

which is called sigmoid function or logistic function. As a result, we can take and put them together. So, we can rewrite the hypothesis function to become

\(h_{\theta}(x)=\frac{1}{1+e^{-(\theta_{0}+\theta_{1}x)}}\)

Let us define this hypothesis function in Python. In the code below, in line 2-3, we define the sigmoid function. Then, in line 6-7, we define the parameters \(\theta_{0}\) and \(\theta_{1}\). After that, in line 10-12, we define the hypothesis function.

Gradient Descent

Here, we are going to define the gradient descent for training our simple logistic regression model. The gradient descent is an iterative algorithm for finding the optimal parameters \(\theta_{0}\) and \(\theta_{1}\). It works by minimizing the cost function.

So, let us define the cost function. The cost function \(J(\theta_{0}, \theta_{1})\) is given by

\(J(\theta_{0}, \theta_{0})=-\frac{1}{m}[\sum_{i=1}^{m}y^{(i)}log(h_{\theta}(x^{(i)}))+(1-y^{(i)})log(1-h_{\theta}(x^{(i)}))]\)

where \(m\) is number of training samples. This function measures the error between the predicted outputs and the expected outputs. The goal of gradient descent is to minimize this error. Gradient descent is a first-order optimization algorithm. Hence, we should find the first partial derivative of the cost function. The first partial derivative of the cost function is given by

\(\frac{\partial}{\partial\theta_{j}}J(\theta_{0}, \theta_{1})=\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_{j}^{(i)}\)

for \(j=0\) and \(j=1\). Subsequently, the gradient descent algorithm is given by

\(
repeat \; until \; convergence \; \{ \\
\qquad temp0:=\theta_{0}-\alpha\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)}) \\
\qquad temp1:=\theta_{1}-\alpha\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x^{(i)} \\
\qquad \theta_{0}:=temp0 \\
\qquad \theta_{1}:=temp1 \\
\}
\)

where \(\alpha\) is learning rate. It defines how quickly the model trained.

Recommended reading: Derivative of cost function for Logistic Regression

Now, let us define this gradient descent in Python. In the code below, we calculate the hypothesis by using hypothesis function in line 8. Then, we calculate the error between the predicted outputs and the expected outputs in line 11. After that, we calculate the partial derivative in line 14-15 and line 18-20. Finally, we update the parameters in line 23 and line 26.

Now, we have defined the gradient_descent function. So, let us run this function. In the code below, we run the gradient descent for 30000 iterations and \(\alpha=0.01\). You should get \(\theta_{0}\approx -9.33\) and \(\theta_{1}\approx 0.18\).

Plot the Hypothesis Function

Here, we have obtained the trained parameters \(\theta_{0}\) and \(\theta_{1}\). We can visualize our logistic regression model by plotting it with data input from 0 to 100. It corresponds to relative humidity from 0 to 100. In the code below, we plot our hypothesis function.

As you can see in the following figure, the hypothesis function best fit to the training set. We use a threshold of 0.5 to predict the occurrence of rain. So, the decision of a prediction is given by

\(y=1\)     if \(h_{\theta}(x)\geq 0.5\)

\(y=0\)     if \(h_{\theta}(x)< 0.5\)

Make Predictions

Now, let us try to make predictions. We are going to use the test set as the data input. In the code below, we make predictions by calling the hypothesis function.

The following figure shows the prediction results. If we apply the decision rules to the output of the hypothesis function, then we should get the predicted output.

Source Code

You can get the source code from this repository.

Summary

In this tutorial, we have learned how to build a simple logistic regression model from scratch in Python. We have trained the model using the gradient descent algorithm. Then, we can make predictions by using test set as input. We set the decision boundary to be 0.5.