Home » Blog » Multiple Linear Regression | Machine Learning Beginners

Multiple Linear Regression | Machine Learning Beginners

Multiple linear regression is a regression model that estimates the relationship between a quantitative dependent variable and two or more independent variables using a straight line.

we are going to implement the multiple linear regression for boston dataset.

First let’s import the required libraries. Numpy and Pandas are used for deal with the dataset.

sklearn library is used for loading the inbuilt boston dataset. metrics is used for calculating the different types of errors.

matplotlib library is used for data visualization. It is used for plotting the data and visualize it.

sklearn.preprocessing is used for data preprocessing.

# import libraries

import numpy as np
import pandas as pd
from sklearn import datasets,metrics
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

Next we are reading the boston dataset. datasets.load_boston(return_x_y = True) will give the input feature in x and output feature in y.

# load dataset

x, y = datasets.load_boston(return_X_y=True)

We can partitioned the dataset into training set and test set. Extract the number of dataset row and number of attribute of the dataset using the below code. Add dummy column to training set and test set.

# take partition for training set(feature)

x_train_tmp=x[0:400,:]

# get number of rows(input) and number of attribute in training dataset

no_of_input = x_train_tmp.shape[0]
no_of_attributes = x_train_tmp.shape[1]

# create training dataset with adding one dummy attribute column as 1

x_train = np.zeros((no_of_input,no_of_attributes+1))
x_train[:,0] = np.ones(no_of_input)
x_train[:,1:] = x_train_tmp


# take partition for training set(output)

y_train = y[0:400]

# take partition for testing set(feature)

x_test_tmp = x[400:506,:]

# get number of rows(input) and number of attribute in testing dataset

no_of_input_test = x_test_tmp.shape[0]
no_of_attributes_test = x_test_tmp.shape[1]

# create testing dataset with adding ne dummy attribute column as 1

x_test = np.zeros((no_of_input_test,no_of_attributes_test+1))
x_test[:,0] = np.ones(no_of_input_test)
x_test[:,1:] = x_test_tmp

# take partition for testing set(output)

y_test = y[400:506]

We have to perform the data preprocessing, so that the data value is set such that mean of that is 0 and standard deviation is 1.

For that we can use StandardScaler() function. We can also use MinMaxScaler() function also.

# pre processing for transform data such that mean is 0 and sd is 1

scaler = StandardScaler()

# scaler = MinMaxScaler()

scaler.fit(x_train[:,1:])
x_train[:,1:] = scaler.transform(x_train[:,1:])
x_test[:,1:] = scaler.transform(x_test[:,1:])

We are randomly choosing value of theta using random.uniform() function of numpy array.

We are also defining the list to store the cost_history and theta_history.

# randomly choose value of theta

theta = np.random.uniform(0,1,size = x_train.shape[1])
no_of_iteration = 1000
alpha = 0.01
cost_history = []
theta_history = []

In this cost function finds the cost in each iteration.

def cost(error, m):
    J = np.dot(error.T,error)/(2*m)
    return J

The below code will execute to train the model. The process will continue for the number of iteration defined earlier.

for i in range(no_of_iteration):
new_theta = np.zeros(x_train.shape[1])
y_predicted = np.dot(x_train,theta)
error = y_predicted - y_train

for j in range(no_of_attributes):
new_theta[j] = np.sum(error*(x_train.T)[j])

theta = theta - (1/no_of_input)*(alpha)*new_theta
theta_history.append(theta[1])
cost_history.append(cost(error,no_of_input))

We are predict the value for test data set and then find error for both train(Seen data) and test(Unseen data) dataset.

# Perform Prediction On Test Data

train_prediction = np.dot(x_train,theta)
test_prediction = np.dot(x_test,theta)

# Find Error

print('-----------------------------------------------------------')
print('Error In Seen Data')
print('Mean Absolute Error',metrics.mean_absolute_error(y_true = y_train,y_pred=train_prediction))
print('Mean Squred Error',metrics.mean_squared_error(y_true = y_train,y_pred=train_prediction))

print('\n-----------------------------------------------------------')
print('\nError In Unseen Data')
print('Mean Absolute Error',metrics.mean_absolute_error(y_true = y_test,y_pred=test_prediction))
print('Mean Squred Error',metrics.mean_squared_error(y_true = y_test,y_pred=test_prediction))

noitr = np.arange(start=1, stop=no_of_iteration+1, step=1)
plt.plot(noitr,cost_history,color ='green')
plt.title('Cost Function')
plt.xlabel("Number Of Iteration")
plt.ylabel("Cost")

Leave a Reply

Your email address will not be published.