Multiple linear regression is a regression model that estimates the relationship between a quantitative dependent variable and two or more independent variables using a straight line.
we are going to implement the multiple linear regression for boston dataset.
First let’s import the required libraries. Numpy and Pandas are used for deal with the dataset.
sklearn library is used for loading the inbuilt boston dataset. metrics is used for calculating the different types of errors.
matplotlib library is used for data visualization. It is used for plotting the data and visualize it.
sklearn.preprocessing is used for data preprocessing.
# import libraries import numpy as np import pandas as pd from sklearn import datasets,metrics import matplotlib.pyplot as plt from sklearn.preprocessing import StandardScaler
Next we are reading the boston dataset. datasets.load_boston(return_x_y = True) will give the input feature in x and output feature in y.
# load dataset x, y = datasets.load_boston(return_X_y=True)
We can partitioned the dataset into training set and test set. Extract the number of dataset row and number of attribute of the dataset using the below code. Add dummy column to training set and test set.
# take partition for training set(feature) x_train_tmp=x[0:400,:] # get number of rows(input) and number of attribute in training dataset no_of_input = x_train_tmp.shape[0] no_of_attributes = x_train_tmp.shape[1] # create training dataset with adding one dummy attribute column as 1 x_train = np.zeros((no_of_input,no_of_attributes+1)) x_train[:,0] = np.ones(no_of_input) x_train[:,1:] = x_train_tmp # take partition for training set(output) y_train = y[0:400] # take partition for testing set(feature) x_test_tmp = x[400:506,:] # get number of rows(input) and number of attribute in testing dataset no_of_input_test = x_test_tmp.shape[0] no_of_attributes_test = x_test_tmp.shape[1] # create testing dataset with adding ne dummy attribute column as 1 x_test = np.zeros((no_of_input_test,no_of_attributes_test+1)) x_test[:,0] = np.ones(no_of_input_test) x_test[:,1:] = x_test_tmp # take partition for testing set(output) y_test = y[400:506]
We have to perform the data preprocessing, so that the data value is set such that mean of that is 0 and standard deviation is 1.
For that we can use StandardScaler() function. We can also use MinMaxScaler() function also.
# pre processing for transform data such that mean is 0 and sd is 1 scaler = StandardScaler() # scaler = MinMaxScaler() scaler.fit(x_train[:,1:]) x_train[:,1:] = scaler.transform(x_train[:,1:]) x_test[:,1:] = scaler.transform(x_test[:,1:])
We are randomly choosing value of theta using random.uniform() function of numpy array.
We are also defining the list to store the cost_history and theta_history.
# randomly choose value of theta theta = np.random.uniform(0,1,size = x_train.shape[1]) no_of_iteration = 1000 alpha = 0.01 cost_history = [] theta_history = []
In this cost function finds the cost in each iteration.
def cost(error, m): J = np.dot(error.T,error)/(2*m) return J
The below code will execute to train the model. The process will continue for the number of iteration defined earlier.
for i in range(no_of_iteration):
new_theta = np.zeros(x_train.shape[1])
y_predicted = np.dot(x_train,theta)
error = y_predicted - y_train
for j in range(no_of_attributes):
new_theta[j] = np.sum(error*(x_train.T)[j])
theta = theta - (1/no_of_input)*(alpha)*new_theta
theta_history.append(theta[1])
cost_history.append(cost(error,no_of_input))
We are predict the value for test data set and then find error for both train(Seen data) and test(Unseen data) dataset.
# Perform Prediction On Test Data train_prediction = np.dot(x_train,theta) test_prediction = np.dot(x_test,theta) # Find Error print('-----------------------------------------------------------') print('Error In Seen Data') print('Mean Absolute Error',metrics.mean_absolute_error(y_true = y_train,y_pred=train_prediction)) print('Mean Squred Error',metrics.mean_squared_error(y_true = y_train,y_pred=train_prediction)) print('\n-----------------------------------------------------------') print('\nError In Unseen Data') print('Mean Absolute Error',metrics.mean_absolute_error(y_true = y_test,y_pred=test_prediction)) print('Mean Squred Error',metrics.mean_squared_error(y_true = y_test,y_pred=test_prediction)) noitr = np.arange(start=1, stop=no_of_iteration+1, step=1) plt.plot(noitr,cost_history,color ='green') plt.title('Cost Function') plt.xlabel("Number Of Iteration") plt.ylabel("Cost")