I will use Python 3.7 for this whole tutorial series. The easiest way to start with python for machine learning is to install Anaconda. It will give you almost all the necessary bells and whistles required. Once you install the Anaconda, Jupyter notebook will automatically get installed. I am using the jupyter notebook for this tutorial.
We will use the housing price dataset for building linear regression prediction model. And then we will calculate the prediction accuracy of the built model.
Code explanation:
import numpy as np import pandas as pd rom sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split
I am importing all the necessary libraries above.
data=pd.read_csv('kc_house_data.csv')
Reading the whole dataset
data.head()
Getting the overview of the dataset
X=data[['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
'view', 'condition', 'grade', 'sqft_above',
'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long',
'sqft_living15', 'sqft_lot15']]
Choosing the necessary columns for input data (X).
y=data['price']
Choosing the output column (Y)
X_train, X_test, y_train, y_test=train_test_split(X,y,random_state=0)
Splitting the whole dataset into train and test data. The first part train, as the name suggested, will be used to train the regression model. And the second part will be used to check accouracy/prediction.
linreg=LinearRegression().fit(X_train,y_train)
This is the model, we are training with the train data.
acc=linreg.score(X_test,y_test)
We are calculating the accuracy of the model. In my case I am getting the accuracy value as 0.6817 or around 68%. Not great, but good enough for now.
Now, lets check what it predicts for some input data, say row number 4 of the input (X) of the test dataset.
linreg.predict(X_test.iloc[[5]])
and I get output as 371119.19927172 against actual value of 29700. As I said, not great but good for now and we will see different ways we will gradually increase the accuracy in the subsequent tutorials.