Regression with scikit-learn

Linear and Polynomial Regression

Regression with scikit-learn

Today we are going to look at how we can perform Regression with scikit-learn. In fact scikit-learn is an open source python library that can perform more than regression (e.g. predictive data analysis). If you have not already, we recommend installing the library into your virtual environment. Moreover, after going through this article you will quickly find it simple to building regression models and using them to make predictions.

Linear Regression with scikit-learn

Linear Regression or sometimes referred to as Simple Linear Regression involves identifying a simple linear relationship between two sets of datapoints. In other words if we were to plot the variables x and y onto a cartesian plane, we are attempting to plot a straight line that is closest to all data points e,g, finding the best fit line. The best fit line can be represented by the following equation:

y = mx + b

Specifically, imagine we have a plot like the following:

plot_data(x,y)
plot of 2-fold determination on linearly related data

Given these points, there seems to be a positive correlation and relationship between x and y. Hence we would like to find the best fit line. This can be achieved by first creating a LinearRegression object and applying the fit function.

# Create a Linear Regression model and fit our data to our model
lm = linear_model.LinearRegression()
lm.fit(x,y)

print("The slope is {} and the y intercept is {}".format(lm.coef_, lm.intercept_))
The slope is [[1.85454545]] and the y intercept is [-3.94545455]

As illusrtated above, in two three lines of code we were able to calculate the intercept and slope. We can then use these two values and plot our best fit line.

# Create a set of x,y values based on our intercept and coefficient
x2 = pd.DataFrame(range(0,10,1))
y2 = x2*lm.coef_ + lm.intercept_

# Plot original data and best fit line
fig, ax = plt.subplots()
ax.scatter(x,y)  #plotting original data 
ax.plot(x2,y2, color='r')  #plotting our best fit line
plt.show()
plot of data with linear regression

Now that we have our linear regression model, let’s try to make a prediction. For this purpose, we will see what is the value of y when x=20

# Making predictions with linear regression model
print(lm.predict([[20]]))
[[33.14545455]]

Multiple Linear Regression with scikit-learn

In essence simple linear regresion with scikit-learn can be easily achieved with only a few lines of code. In fact scikit-learn offers more than simple linear regression. Imagine instead of having x as the input, you have more than one variable, the same Linear Regression can be used to generate a multiple linear regression model. This would be analogous to the functional equation:

Formula for multiple linear regression

To demonstrate this imagine we have a dataset like the below

# Plot of our data, may not be necessary to include in blog
fig, ax = plt.subplots()
ax.scatter(x3, y3)
ax.scatter(x3, y4)
plt.show()

mlr_ds = pd.concat([y3,y4], axis=1)
plot of multiple linear regression data points
mlr_ds.head()
sample contents of our dataset
mlr = linear_model.LinearRegression()
mlr.fit(mlr_ds,x3)

print("The slope is {} and the y intercept is {}".format(mlr.coef_, mlr.intercept_))
The slope is [[ 0.00662177 -0.1424146 ]] and the y intercept is [0.49662243]

In order to perform multiple linear regression we need to pass our entire dataset into our linear regression model. Notice that for simplicity mls_ds contains only two columns. As a result after creating our linear regression model, we get two values in our coefficient.

Polynomial Regression with scikit-learn

Previously we talked about how we could easily perform linear regression and be able to find a straight line that best describes our datapoints. But what happens when our data points don’t exhibit a linear relationship? As an illustration imagine your data resembles the below plot.

plot_data(x5,y5)
plot of datapoints with parabolic relationship

As shown above, there seems to be a relationship betwen the data, but this time around it seems to fit a curve.

Pipeline

At this point we will use a pipeline and Polynomial Features to build a more robust model. In turn, this will allow us to fit a curve to our data.

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

deg = 2
pm = make_pipeline(PolynomialFeatures(degree=deg, include_bias=False), linear_model.LinearRegression())

Analogous to linear regression, in order to perform polynomial regression a few more adjustments to our code is needed. By comparison in the above we import two additional functions from the scikit-learn library; namely PolynomialFeatures and make_pipeline.

  • Firstly, PolynomialFeature in short tells our model that we are working with a nth degree polynomial
  • Secondly, Make_pipeline tells our program that we want to process our data as a sequence of operations that are to be defined.

Subsequently, we want to determine whether we will be working with a 2nd, 3rd, or 4th degree polynomial regression. Since our plot looks parabolic, we will settle for a 2 degree polynomial and define deg as 2.

Lastly, we create our model ‘pm’ by defining our pipeline. To illustrate, the following diagram will help describe what the pipeline does and what we want to achieve.

What is a pipeline?

As we have noted the pipeline allows us to define a sequential series of functions to be applied. Our data can pass through step 1, to step 2, all the way to step n based on how we define it. In our case, we first pass our data into our PolynomialFeatures function, afterwhich it is passed into a linear regression model. In other words we are building a 2-degree polynomial regression based upon our linear model.

Fitting our polynomial regression model

Basically once we have our pipeline, the subsequent steps are similar to our linear regression approach, we first fit our data points into our model, and generate a set of prediction points from this. Lastly we plot our values onto the same plot to visualize the results.

# Fit our data points into our pipeline
pm.fit(x5,y5)

# Predict the values from our polynomial regression model
x5_predict = pm.predict(x5)

# Plot our results against our original plot

fig, ax = plt.subplots()
ax.scatter(x5, y5)  # plot original data points
ax.plot(x5, x5_predict, color='r')  # plot our best fit curve
plt.show()
plot of data points with parabolic relationship together with polynomial regression curve

By and large with only a few more lines of code we have been able to achieve polynomial regression. The above functions from scikit-learn also allows us to manage multi-dimensional polynomial regression. Meanwhile, if you are performing polynomial regression on a two dimensional space (e.g. only [x,y]), Numpy also offers a feature that allows us to determine the polynomial extression in the form of:

formula for polynomial regression
# Using numpy to determine our polynomial regression
pm_numpy = np.polyfit(x5[0],y5[0],2)
pm_numpy_degree = np.poly1d(pm_numpy)
print(pm_numpy_degree)
        2
0.9927 x - 16.71 x + 99.77
pm_numpy_degree
poly1d([  0.99265208, -16.70625427,  99.76688312])

Summary

In summary performing Regression in python with the scikit-learn library is fairly simple to do and within ten minutes you should be able to get up and running. In contrast, to achieve the same without scikit-learn would involve a deeper understanding of the mathematics and calculations involved. Today we showed how we can create a simple linear model, extended that to multiple linear regression, and finally ending with polynomial regression. We hope our discussions here today will allow you to see how far the data science community has come to make analysis simple and efficient. Leave us a comment below or follow us on social media and let us know what you think.

FreedomvcAbout Alan Wong
Alan is a part time Digital enthusiast and full time innovator who believes in freedom for all via Digital Transformation. 
兼職人工智能愛好者,全職企業家利用數碼科技釋放潛能與自由。

LinkedIn

Leave a Reply