**Introduction**

Here, in this tutorial, we will go through to the basic steps of building a simple logistic regression model for predicting a simple classification problem. We are going to build the model from scratch in Python. Since it is the simple logistic regression, it has only one independent variable and one dependent variable.

**Classification**

As opposed to regression problems, in classification problems, the dependent variable that we want to predict is a discrete value rather than a continuous value. As an example of classification problem is classifying whether it is going to rain or not. So, it is clear that the dependent variable has only two possible output. This is the example that we are going to use for learning purpose in this tutorial.

**Data Set**

Since it is a simple logistic regression tutorial, we are going to use a dummy data set that is randomly generated. The data set consist of two columns, which are relative humidity and rain prediction. Actually, the rain prediction depends on many factors, such as temperature, humidity, pressure, wind, etc. With only humidity, it is not a sufficient condition for predicting rain. So, the purpose of the model that we are going to develop is for learning purpose only.

The figure below presents the scatter plot of the data set:

Here, we have the relative humidity as the independent variable and the occurrence of rain as the dependent variable that we want to predict. The dependent variable is a set that consists of two elements either of 0 or 1. Where 0 represents not rain and 1 represents rain. Moreover, we already have the data set divided into two sets, which are train set and test set. The train set has 80 samples, and the test set has 20 samples. So, the ratio of train set to test set is 80:20.

**Plot the Data Set**

Let us plot our data set. First of all, we need to import the following libraries: `numpy` and `matplotlib`.

1 2 |
import numpy as np import matplotlib.pyplot as plt |

After that, let us define our data set, which are the train set (`x_train` , `y_train`) and test set (`x_test`, `y_test`).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
# dataset x_train = np.array([ 36, 37, 27, 39, 46, 18, 0, 53, 10, 52, 24, 29, 25, 42, 25, 19, 54, 0, 37, 23, 48, 8, 40, 3, 15, 28, 0, 26, 28, 25, 39, 9, 25, 30, 27, 8, 9, 36, 47, 11, 74, 87, 57, 58, 92, 84, 58, 96, 91, 60, 75, 87, 75, 69, 91, 77, 62, 69, 91, 93, 61, 57, 66, 66, 67, 61, 79, 96, 64, 61, 88, 74, 65, 76, 80, 58, 81, 65, 85, 99]) y_train = np.array([ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]) x_test = np.array([ 27, 9, 40, 36, 29, 2, 25, 34, 48, 19, 97, 65, 66, 65, 60, 85, 65, 86, 56, 53]) y_test = np.array([ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]) |

Finally, let us plot the train set and test set.

1 2 3 4 5 6 7 8 9 10 |
# plot data set plt.figure(0) plt.scatter(x_train, y_train, label='$Train\;set$') plt.scatter(x_test, y_test, label='$Test\;set$') plt.yticks([0, 1.0]) plt.title('Rain data set') plt.xlabel('Relative humidity $[\%]$') plt.ylabel('Rain') plt.legend(fontsize=10); plt.show() |

**Hypothesis Function**

Here, we are going to define our hypothesis function for simple logistic regression. The hypothesis function \(h_{\theta}(x)\) is given by

\(h_{\theta}(x)=g(\theta_{0}+\theta_{1}x)\)

where \(\theta_{0}\) and \(\theta_{1}\) are the parameters that we will obtain from training, and \(x\) is our independent variable. Here, we introduce a new function \(g(z)\) given by

\(g(z)=\frac{1}{1+e^{-z}}\)

which is called sigmoid function or logistic function. As a result, we can take and put them together. So, we can rewrite the hypothesis function to become

\(h_{\theta}(x)=\frac{1}{1+e^{-(\theta_{0}+\theta_{1}x)}}\)

Let us define this hypothesis function in Python. In the code below, in **line 2-3**, we define the `sigmoid` function. Then, in **line 6-7**, we define the parameters \(\theta_{0}\) and \(\theta_{1}\). After that, in **line 10-12**, we define the `hypothesis` function.

1 2 3 4 5 6 7 8 9 10 11 12 |
# sigmoid function def sigmoid(x): return 1/(1 + np.exp(-x)) # parameters theta_0 = 0 theta_1 = 0 # hypothesis function def hypothesis(x): h = sigmoid(theta_0 + theta_1*x) return h |

**Gradient Descent**

Here, we are going to define the gradient descent for training our simple logistic regression model. The gradient descent is an iterative algorithm for finding the optimal parameters \(\theta_{0}\) and \(\theta_{1}\). It works by minimizing the cost function.

So, let us define the cost function. The cost function \(J(\theta_{0}, \theta_{1})\) is given by

\(J(\theta_{0}, \theta_{0})=-\frac{1}{m}[\sum_{i=1}^{m}y^{(i)}log(h_{\theta}(x^{(i)}))+(1-y^{(i)})log(1-h_{\theta}(x^{(i)}))]\)

where \(m\) is number of training samples. This function measures the error between the predicted outputs and the expected outputs. The goal of gradient descent is to minimize this error. Gradient descent is a first-order optimization algorithm. Hence, we should find the first partial derivative of the cost function. The first partial derivative of the cost function is given by

\(\frac{\partial}{\partial\theta_{j}}J(\theta_{0}, \theta_{1})=\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_{j}^{(i)}\)

for \(j=0\) and \(j=1\). Subsequently, the gradient descent algorithm is given by

\(

repeat \; until \; convergence \; \{ \\

\qquad temp0:=\theta_{0}-\alpha\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)}) \\

\qquad temp1:=\theta_{1}-\alpha\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x^{(i)} \\

\qquad \theta_{0}:=temp0 \\

\qquad \theta_{1}:=temp1 \\

\}

\)

where \(\alpha\) is learning rate. It defines how quickly the model trained.

Recommended reading: Derivative of cost function for Logistic Regression

Now, let us define this gradient descent in Python. In the code below, we calculate the hypothesis by using `hypothesis` function in **line 8**. Then, we calculate the error between the predicted outputs and the expected outputs in **line 11**. After that, we calculate the partial derivative in **line 14-15** and **line 18-20**. Finally, we update the parameters in **line 23** and **line 26**.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
# gradient descent def gradient_descent(x, y, m, iter, alpha): global theta_0 global theta_1 for i in range(iter): # calculate hypothesis h = hypothesis(x) # calculate error e = h - y # calculate d_J/d_theta_0 sum_e = np.sum(e) d_J_d_theta_0 = sum_e / m # calculate d_J/d_theta_1 e_mul_x = e * x sum_e_mul_x = np.sum(e_mul_x) d_J_d_theta_1 = sum_e_mul_x / m # update theta 0 theta_0 = theta_0 - alpha*d_J_d_theta_0; # update theta 1 theta_1 = theta_1 - alpha*d_J_d_theta_1; return |

Now, we have defined the `gradient_descent` function. So, let us run this function. In the code below, we run the gradient descent for 30000 iterations and \(\alpha=0.01\). You should get \(\theta_{0}\approx -9.33\) and \(\theta_{1}\approx 0.18\).

1 2 3 4 |
# run gradient descent gradient_descent(x_train, y_train, 80, 30000, 0.01) print(theta_0) print(theta_1) |

**Plot the Hypothesis Function**

Here, we have obtained the trained parameters \(\theta_{0}\) and \(\theta_{1}\). We can visualize our logistic regression model by plotting it with data input from 0 to 100. It corresponds to relative humidity from 0 to 100. In the code below, we plot our hypothesis function.

1 2 3 4 5 6 7 8 9 10 11 12 13 |
# plot logictic model x_model = np.arange(0, 100) y_model = hypothesis(x_model) plt.figure(1) plt.scatter(x_train, y_train, label='$Train\;set$') plt.plot(x_model, y_model, '--r', linewidth=2.5, label='$Logistic\;model$') plt.axhline(0.5, linestyle='--', label='$Decision\;threshold$') plt.yticks([0, 0.5, 1.0]) plt.title('Logistic regression model') plt.xlabel('Relative humidity $[\%]$') plt.ylabel('Rain') plt.legend(fontsize=10); plt.show() |

As you can see in the following figure, the hypothesis function best fit to the training set. We use a threshold of 0.5 to predict the occurrence of rain. So, the decision of a prediction is given by

\(y=1\) **if **\(h_{\theta}(x)\geq 0.5\)

\(y=0\) **if** \(h_{\theta}(x)< 0.5\)

**Make Predictions**

Now, let us try to make predictions. We are going to use the test set as the data input. In the code below, we make predictions by calling the `hypothesis` function.

1 2 3 4 5 6 7 8 9 10 11 12 |
# make predictions y_pred = hypothesis(x_test) plt.figure(2) plt.scatter(x_test, y_test, label='$Test\;set$') plt.scatter(x_test, y_pred, marker='x', label='$Prediction$') plt.axhline(0.5, linestyle='--', label='$Decision\;threshold$') plt.yticks([0, 0.5, 1.0]) plt.title('Rain prediction') plt.xlabel('Relative humidity $[\%]$') plt.ylabel('Rain') plt.legend(fontsize=10); plt.show() |

The following figure shows the prediction results. If we apply the decision rules to the output of the `hypothesis` function, then we should get the predicted output.

**Source Code**

You can get the source code from this repository.

**Summary**

In this tutorial, we have learned how to build a simple logistic regression model from scratch in Python. We have trained the model using the gradient descent algorithm. Then, we can make predictions by using test set as input. We set the decision boundary to be 0.5.