Linear Regression#

Linear regression or least squares regression line, chooses the line with the least distance of the data points from the line. So that positive and negative distances do not cancel each other out the distance is calculated as a square. This makes outlyers more important than they should be. The distance calculated is the vertical distance from the point to the line (both at the same x-value), and not the perpendicular distance to the point. The total effect of several points to the line:-

\[\hat y = mx + c\]

is:-

\[\begin{split}m &= \frac {n \sum(xy) - \sum x \sum y}{n \sum(x^2) - (\sum x)^2} \\ c &= \frac {\sum y - m \sum x}{n}\end{split}\]

where n is the number of points.

This method is applied to several points, so when starting to create data first use linear extrapolation (2 points), then quadratic interpolation (3 points) thereafter linear regression. It is assumed that the various modes of the linear regression will enable a forecast of the next point once the equation of the regression has been found.

After the line has been fitted one should check whether it is a good fit or not. The difference between the actual y-value and the estimated y_value at the same x-value is the residual. If the linear regression has a good fit then residuals will be scattered above and below the linear regression line. As they stand if the residuals are added together then all the values will cancel each other out, so use R-squared which evaluates the scatter of the points around the line, so a high value, almost 100%, is generally what we want. If the fit has residuals consistently on one side or the other and not scattered at random, then the data is biased and the fit needs correcting.

\[R^2 = \frac {Variance\, explained\, by\, the\, model}{Total\, Variance}\]

When in Python we are a bit spoilt for choice numpy, scipy, statsmodels, sklearn and others. Use the one that suits the application best. Here's one to add to the mix, use it to see how the linear regression works rather than use for major applications. From Regression and Inference

# https://web.stanford.edu/class/stats110/notes/Chapter7/Inference.html

import numpy as np
#%matplotlib inline
import matplotlib.pyplot as plt
from scipy.stats import t as t_dbn
from scipy.stats import norm as normal_dbn

def fit_least_squares(X, Y):
    '''
    linear regression gives line as y = a + bx
    a is beta_hat_0
    b is beta_hat_1
    sigma_hat is
    '''
    X = np.asarray(X)
    n = X.shape[0]
    Y = np.asarray(Y)
    X_bar = np.mean(X)
    Y_bar = np.mean(Y)

    beta_hat_1 = (np.sum((X - X_bar) * (Y - Y_bar)) /
                  np.sum((X - X_bar)**2))
    beta_hat_0 = Y_bar - beta_hat_1 * X_bar

    resid = Y - beta_hat_0 - beta_hat_1 * X
    sigma_hat = np.sqrt(np.sum(resid**2) / (n - 2))

    return np.array([beta_hat_0, beta_hat_1]), sigma_hat, resid

X = np.array([0, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8,
              2.0, 2.2, 2.4, 2.6, 2.8, 3.0, 3.2, 3.4, 3.6, 3.8])
Y = np.array([5.06, 5.01, 5.12, 5.13, 5.14, 5.16, 5.25, 5.17, 5.24, 5.46,
              5.40, 5.57, 5.47, 5.53, 5.61, 5.59, 5.61, 5.75, 5.68, 5.80])
f = plt.figure(figsize=(8,8))
ax = f.gca()
ax.scatter(X, Y)
ax.set_xlabel('Weight')
ax.set_ylabel('Length')

results = fit_least_squares(X, Y)
#print(results[0], results[1])

X_min, X_max = X.min(), X.max()
beta_hat_0, beta_hat_1 = results[0]
Y_max = beta_hat_0 + beta_hat_1 * X_max
Y_min = beta_hat_0 + beta_hat_1 * X_min
#print('X_min, X_max, Y_min, Y_max', X_min, X_max, Y_min, Y_max)
ax.plot([X_min, X_max], [Y_min, Y_max], 'k--')

#plt.show()

As an example see how using \(x = np.linspace(0, 1, 101)\) and \(y = 1 + x + x * np.random.random(len(x))\), which should give m=1.5 and c=1.0 based on this data using numpy polyfit and Polynomial. For a linear fit use polyfit(x, y, 1).

# https://pythonnumericalmethods.berkeley.edu/notebooks/chapter16.04-Least-Squares-Regression-in-Python.html
# https://stackoverflow.com/questions/18767523/fitting-data-with-numpy

import numpy as np
import numpy.polynomial.polynomial as poly
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme()

# generate x and y
x = np.linspace(0, 1, 101)
y = 1 + x + x * np.random.random(len(x))
x_new = np.linspace(x[0], x[-1], num=len(x)*10)
'''
# assemble matrix A
A = np.vstack([x, np.ones(len(x))]).T

# turn y into a column vector
y = y[:, np.newaxis]

# Direct least square regression
#alpha = np.dot((np.dot(np.linalg.inv(np.dot(A.T,A)),A.T)),y)
alpha = np.linalg.lstsq(A, y, rcond=None)[0]
print('m: ',alpha[0],' c: ',alpha[1])
'''
coefs = poly.polyfit(x, y, 1)
#print(coefs)

# ffit = poly.polyval(x_new, coefs)
# plt.plot(x_new, ffit, 'r')

ffit = poly.Polynomial(coefs)

yhat = coefs[1] * x + coefs[0]
R = yhat - y

plt.figure(2)
plt.plot(x, R, 'b.', label='residuals')
plt.hlines(0,x[0],x[-1], colors='green', linestyle='dashed', label='$y = 0$')
plt.xlabel('x')
plt.ylabel('Residual $\hat{y} - y$')
plt.legend()
plt.title('Residuals \nshould be random either side')

plt.figure(1)
plt.plot(x_new, ffit(x_new), 'r',label=f'linear fit \n{coefs[1]:.4} x + {coefs[0]:.4}')
#print(ffit)

# plot the results
plt.plot(x, y, 'b.', label='data points')

plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.title('Linear Regression using Numpy polyfit \n$ y = 1.5\,x + 1$')

plt.show()

The original script used numpy linalg.lstsq to create the linear regression function.

At present there are two polyfit modules, the older version is np.polyfit that uses \(ax^3 + bx^2 + cx +d\), and a newer version np.polynomial.polynomial.polyfit that uses the function in the reversed order \(d + cx + bx^2 + ax^3\). The latter ought to be used, but most websites still use the older version.

import numpy as np
p1d = np.poly1d([1, 2, 3])
print('old format p1d:\n',p1d)

p = np.polynomial.Polynomial(p1d.coef[::-1]) # coefficients flipped
print('new format p:',p)

old format p1d:
    2
1 x + 2 x + 3
new format p: 3.0 + 2.0·x + 1.0·x²

The polynomial package shows the coefficients also have domain and window attributes. The new polyfit can only be used for standard linear regression, whereas fit can be used with other fitting techniques.

Numpy advises using fit in the new version, the coefficients are given in the scaled domain defined by the linear mapping between the window and domain, to get coefficients in the unscaled domain use convert(). The domain is the interval [domain[0], domain[1]] mapped to the interval [window[0], window[1]] by shifting and scaling. Both domain and window have default values [-1, 1].

import numpy as np
rng = np.random.default_rng()
x = np.arange(10)
y = np.arange(10) + rng.standard_normal(10)
old = np.polyfit(x, y, deg=1)
print('old format polyfit:', old)
p_fitted = np.polynomial.Polynomial.fit(x, y, deg=1)
print('new format p_fitted:',p_fitted)
print('p_fitted.convert():',p_fitted.convert())
pf = np.polynomial.polynomial.polyfit(x, y, deg=1)
print('new format polyfit pf:',pf)

old format polyfit: [ 1.098403   -1.02020271]
new format p_fitted: 3.92261079 + 4.94281349·(-1.0 + 0.22222222x)
p_fitted.convert(): -1.02020271 + 1.098403·x
new format polyfit pf: [-1.02020271  1.098403  ]

Visualisation#

It is usual to plot the outcome to see how the regression worked out, in which case one could use Seaborn directly, but they advise using the statsmodel to show the information. One can alter Seaborn's confidence interval by changing the attribute ci for both lineplot and regplot.

After checking the outcome on a plot, a second plot should be used, this plots residuals against the x-values. The residuals should lie evenly spread close to the mean, in a random manner showing no trend.

Other Models#

The least squares method can be adapted to functions other than just the straight line.

Exponential Functions#

In numpy say we have an exponential function \(ŷ(x) = \alpha e^{\beta x}\) take log on both sides \(log(ŷ(x)) = log(\alpha) + \beta x\)

# https://pythonnumericalmethods.berkeley.edu/notebooks/chapter16.05-Least-Square-Regression-for-Nonlinear-Functions.html
# https://www.delftstack.com/howto/python/logarithmic-and-exponential-curve-fitting-python/

import numpy as np
import numpy.polynomial.polynomial as poly
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme()

# let's generate x and y, and add some noise into y
x = np.linspace(0.1, 10, 101)
y = 0.1*np.exp(0.3*x) + 0.1*np.random.random(len(x))

log_x = np.log(x)
log_y = np.log(y)

'''
A = np.vstack([x, np.ones(len(x))]).T
beta, log_alpha = np.linalg.lstsq(A, np.log(y), rcond = None)[0]
alpha = np.exp(log_alpha)
print(f'alpha={alpha}, beta={beta}')
'''

coefs = poly.polyfit(x, log_y, 1)
#print(coefs)

c = np.exp(coefs[0]) * np.exp(coefs[1]*x)

# Let's have a look of the data
#plt.figure(figsize = (10,8))
plt.plot(x, y, 'b.', label='data points')
#plt.plot(x, alpha*np.exp(beta*x), 'r')
plt.plot(x, c, 'r', label='polynomial fit')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Fitting Linear Regression to \n$0.1 e^{0.3x} + 0.1$ plus some noise')
plt.legend()
plt.show()

Arrhenuis Equation#

The solution for the Arrhenuis equation can be treated as for an exponential function, so by taking logs on both sides:

\[\begin{split}k &= Ae^{-E_a/RT} \\ ln\, k &= ln\, A - \frac {E_a}{R} \frac {1}{T}\end{split}\]

\[ln\, k = -\frac {E_a}{R} \left( \frac {1}{T} \right) +ln\, A \tag 1\]

where:

k rate constant
T absolute temperature, K
A pre-exponential factor
\(E_a\) activation energy
R universal gas constant

The last equation (1) treats the reciprocal temperature as the x variable in a straight line equation \(y = m\,x + c\).

Simple Power#

A similar process used for exponential functions can be used on simple power expressions. So a function \(ŷ(x) = b\, x^m\) can be turned into a linear form by taking logs on both sides \(log(ŷ(x)) = m\, log(x) + log\, b\).

# https://www.delftstack.com/howto/python/logarithmic-and-exponential-curve-fitting-python/

import numpy as np
import numpy.polynomial.polynomial as poly
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme()

x = np.linspace(0.1, 10, 101)
y = 2*x**0.3 + 0.1*np.random.random(len(x))

log_x = np.log(x)
log_y = np.log(y)

pcoeffs = poly.polyfit(log_x, y, 1)
pc = pcoeffs[1] * log_x + pcoeffs[0]

plt.plot(log_x, y, "o", label='log data points')
plt.plot(log_x, pc, 'r', label='log linear fit')

plt.plot(x, y, "co", label='data points')
plt.plot(x, pc, 'm', label='linear fit')

plt.xlabel('x')
plt.ylabel('y')
plt.title('Fitting Linear Regression to \n$2 x^{0.3x}$ plus some noise')
plt.legend()
plt.show()

Polynomial#

If the underlying function is a polynomial then the simplest numpy functions to use is y_est = poly.polyfit(x, y, j) where x and y are data numpy arrays, and j is the polynomial order. Assuming that a larger number of x-values will be required for plotting, x_new, plot use x_new, np.polyval(x_new, y_est) for the x and y values.

When deciding which polynomial fits best remember to set the attribute full to True. The term that follows immediately after the coefficient array is the residuals array, which is the sum of squared residuals of the least squares fit.

# https://stackoverflow.com/questions/18767523/fitting-data-with-numpy
import numpy as np
import numpy.polynomial.polynomial as poly
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme()

x = np.array([ 3.08,  3.1 ,  3.12,  3.14,  3.16,  3.18,  3.2 ,  3.22,  3.24,
    3.26,  3.28,  3.3 ,  3.32,  3.34,  3.36,  3.38,  3.4 ,  3.42,
    3.44,  3.46,  3.48,  3.5 ,  3.52,  3.54,  3.56,  3.58,  3.6 ,
    3.62,  3.64,  3.66,  3.68])

y = np.array([ 0.000857,  0.001182,  0.001619,  0.002113,  0.002702,  0.003351,
    0.004062,  0.004754,  0.00546 ,  0.006183,  0.006816,  0.007362,
    0.007844,  0.008207,  0.008474,  0.008541,  0.008539,  0.008445,
    0.008251,  0.007974,  0.007608,  0.007193,  0.006752,  0.006269,
    0.005799,  0.005302,  0.004822,  0.004339,  0.00391 ,  0.003481,
    0.003095])

x_new = np.linspace(x[0], x[-1], num=len(x)*10)

coefs = poly.polyfit(x, y, 4)

ffit = poly.polyval(x_new, coefs)
plt.plot(x_new, ffit, 'r', label='polynomial fit')

'''
# an alternative method to plot
ffit = poly.Polynomial(coefs)    # instead of np.poly1d
plt.plot(x_new, ffit(x_new))
'''

plt.scatter(x, y, color='k', label='data points')
plt.legend()
plt.xlabel('x')
plt.ylabel('y')
plt.title('Fitting 4th Degree Polynomial to Custom Data')
plt.show()

On Seaborn add the attribute order with the polynomial order for non-linear plots. If an outlyer ruins the regression line use robust=True then the outlyer is ignored.