TheBestStatistics.Info

Introduction

In the chapter on correlation, we looked at the linear relationship of two variables. The correlation analysis produced a measure of association (r) that ranged from -1 to +1. If two variables have a positive or negative association, then knowing one variable will help reduce the error in predicting the other variable. In the extreme case, if the correlation between X and Y was 1, then we could perfectly predict X from Y or Y from X with no error. This chapter extends our understanding of two variables by showing us exactly how we can predict one variable from the other variable.

In order to figure out how we can predict Y from X where X and Y are correlated, we need to remember that correlation is a linear association. This means that any prediction that we make will be described by a linear equation (i.e. a line). .

All linear equations take the form of ' Y=MX+B', where..

M=slope of the reqression equation line
B=the intercept of the reqression equation line

Therefore, our goal is to find the 'M' (the slope) and the 'B' (the intercept) that combine to make the linear equation that produces the best prediction.

Before we can find the slope and intercept of the line that will help us predict Y from X, we need to realize that there are an infinite possibilities of slopes and intercepts. Every line is defined by a unique combination of slope and intercept. In a regression analysis, our goal is to find the 'M' (the slope) and the 'B' (the intercept) that combine to make the linear equation that produces the best prediction of Y from X. A picture will help...

Example of 3 different lines. Each line has a a unique combination of slope and intercept

In the display above, we see that the pink, blue and black lines all are unique and therefore each line is defined by a unique combination of slope and intercept. We can also see that the black line is the steepest, which means that it has the highest slope value (e.g. the M value is the highest). The intercept value (B) is the point where the line crosses the Y axis. In other words, it is the Y value of the line when X=0. In the display above, the pink line has an intercept of 2 because the pink line has a Y value of 2 when the x value is 0.

The challenge of a regression equation is to find the line which best predicts actual data value. But, of course, the actual data values rearely make a perfect line. In fact, the only time the actual data make a prefect line is when the correlation is +1 or -1, which is almost never.

Sample correlated data from an experiment

As we can see above, the X and Y values are highly correlated, but the points do not make a perfect line. Therefore, there are no lines that will perfectly predict Y from X -- there will always be some error in prediction. Our task is to find the best prediction line, but how do we do that?

The key to the regression equation is to identify a way of measuring prediction error. In order to do that, let's consider how we've measured error in the other analyses. Remember that we generally defined error using squared deviations (e.g., as in the standard deviation), so maybe we can use that same concept with the regression equation. If we measured error in squared deviations, then it would make sense to compare the actual data values with the values predicted form the line, and this would produce an error for a particular XY pair. Then if we added up all of the errors, we could have an estimate of the total error between the predicted values and the actual values. In formula form, this would be ...

Total Error = Σ(Y_predicted-Y_actual)²Indeed, this is the actual formula we will use as our definition of error. Fortunately,each line will produce a unique amount of error (don't worry about why this is true). Because each line will produce a unique error value, there will be only one line that will produce the minimum error. Here's the data we saw before with two different prediction lines...

Sample data with two regression lines -- each with an error value

Notice that both the pink and the black line fit the data pretty well, but the black line fits the data better because it has less overall error as defined by the squared deviations. This difference may not be readily apparent here because the error is defined by the formula above which involves a lot of calculations. A trained eye could tell that the pink line has more total error because the pink line is pretty far from the actual data for the x values of 7 and 9. Since these deviations get squared, then these two deviations account for most of the total error.

Fortunately,each line will produce a unique amount of error (don't worry about why this is true). Because each line will produce a unique error value, there will be only one line that will produce the minimum error. Of course, the line with the minimum error will be the 'best prediction' line -- we will call it the 'Least Squares Regression Line' because it produces the smallest amount of squared error. So now we know that there is an answer, but we still have to figure out how to find the slope and intercept that create the best prediction line. Before we do that, we will review slopes and intercepts for those students who may have forgotten exactly how they work. Students who understand slopes and intercepts well can skip past this section and move on to the 'Calculating the regression equation equation' section.

Slope and Intercept

The slope indicates how 'steep' the line is that relates X and Y. In more precise mathematical terms, the slope indicates how much Y changes per 1 unit of X change. Consider the display below showing some lines:

Lines with different slopes

As we can see, we have three different lines with 3 different slopes...

The black line has a slope of 1 because Y increases by 1 unit for every one unit of X (this is a perfect diagonal or 45 degree line)
The pink line has a slope of 2 because Y increases 2 for every 1 unit of increase in X.
The blue line has a slope of 1/2 because Y increases 1/2 for every 1 unit of increase in X.

Negative slopes are also possible when Y decreases as X increases. Consider the display below:

Lines with negative slopes

As we can see, we have three different lines with 3 different slopes...

The black line has a slope of -1 because Y decreases by 1 unit for every one unit of X (this is a perfect diagonal or 45 degree line)
The pink line has a slope of -2 because Y decreases 2 for every 1 unit of increase in X.
The blue line has a slope of -1/2 because Y decreases 1/2 for every 1 unit of increase in X.

The intercept indicates where the regression line 'crosses' the Y axis. More precisely, the intercept (sometimes called the 'Y' intercept) is the Y value that is paired with the X value of 0. Mathematically, it is easy to see that if the line has the equation

Y =MX + B, then the 'B' value is the intercept because Y = M(0) + B is the
same as Y=B when X=0.

Consider the display below showing three lines with the same slope but different intercepts.

Lines with the same slope but different intercepts

As the display shows, lines with the same slopes can have different intercepts, depending upon where the line crosses the Y axis.

The pink line has an intercept of 5 because Y is 5 when X is 0
The black line has an intercept of 0 because Y is 0 when X is 0
The blue line has an intercept of -3 because Y is -3 when X is 0

As mentioned before, every line is defined by a unique combination of slope and intercept. Here is the picture we saw in the last section...

Three lines with different slopes and intercepts

Now that we've brushed up on our slopes and intercepts, we can move on to finding the best predicting line for a set of actual data.

Calculating the regression equation

In the previous sections, we learned that each regression line will produce a unique error value, but there will be only one regression line that will produce the minimum error. This section describes how we find that line and how we measure the prediction error in a standardized way. Before we do that, it must be noted that the calculation for determining the best predicting regression line is hard to describe in terms that are easy to visualize and/or understand. In other words, we will learn the formula here, but it is beyond the scope of this website to show why this formula minimizes the error. In essence, this is a computational formula (because we don't understand the concept behind it), but in this case a computational formula is necessary because the concept is too difficult to explain here.

So how do we calculate the regression line that best predicts actual data?

Calculate Mean of X	This is easy. X=(ΣX_i)/N
Calculate Mean of Y	This is easy. Y=(ΣY_i)/N
Calculate Sum of Deviation Cross Product	We go through each XY pair and multiply the X deviations from the mean by the Y deviation from the mean -- this product is called the 'Deviation Cross Product'. We then add up all of these deviation cross products, which looks like this as a formula.. Σ(X_i-X)(Y_i-Y)
Calculate Sum of Squares for X	Here we add up the squares the X deviations from the mean. Σ(X_i-X)²
Calculate Slope of Regression Line	The slope of the regression line is the sum of the deviation cross products divided by the sum of squares for X. As a formula... Slope (M) =(Sum of Deviation Cross Product)/(Sum of Squares for X) Slope (M) = (Σ(X_i-X)(Y_i-Y)) / Σ(X_i-X)²
Calculate Intercept of Regression Lin	We find the intercept by substituting the slope just calculated into the equation with the X and Y means. As a formula.... Intercept (B) = Y - M(X)
Calculate Equation of Regression Line	Y=MX+B

Steps for finding regression line with least error (Best Prediction Line)

Of course, we need to solve a problem with actual data.

Consider the regression equation that predicts Depression from Drinks per month.

Here are pairs of scores.

Drinks per month	21	0	12
Depression	9	3	6

What is the equation of the least squares regression line?

Step 1:Calculate X. X=(ΣX_i)/N

X=(ΣX_i)/N=(21+0+12)/3 = 33/3 = 11

Step 2:Calculate Y. Y=(ΣY_i)/N

Y=(ΣY_i)/N=(9+3+6)/3 = 18/3 = 6

Step 3:Calculate Sum of Deviation Cross Products. Σ(X_i-X)(Y_i-Y)

Σ(X_i-X)(Y_i-Y) = ((21 - 11)(9 - 6)+(0 - 11)(3 - 6)+(12 - 11)(6 - 6)) = 63

Step 4:Calculate Sum of Squares for X. Σ(X_i-X)²

Σ(X_i-X)² = ((21 - 11)²+(0 - 11)²+(12 - 11)²) = 222

Step 5:Calculate Slope of Regression Line. Slope (b) = Σ(X_i-X)(Y_i-Y) / Σ(X_i-X)²

Slope (b) = Σ(X_i-X)(Y_i-Y) / Σ(X_i-X)² = 63 / 222 = 0.2838

Step 6:Calculate Intercept of Regression Line. Intercept (a) = Y - b(X)

Intercept (a) = Y - b(X) = 6 - 0.2838(11) = 2.8782

Step 7:Calculate Equation of Regression Line. Y=bX+a

Y=0.2838X + 2.8782

If the problem has a large number of pairs, then the sums of deviation cross products and sum of squares for X are often given...

Consider the regression equation that predicts SAT Score from IQ.

Here are pairs of scores.

IQ	103	103	98	119	123	126	86	104	106	122
SAT_Score	565	534	512	701	708	754	554	495	555	666

X = 109, Y = 604.4, Σ(X_i-X)(Y_i-Y) = 9290, Σ(X_i-X)² = 1510
What is the equation of the least squares regression line?

Answer:Y=6.1523X + -66.2026

Step 1:Calculate Slope of Regression Line. Slope (M) = Σ(X_i-X)(Y_i-Y) / Σ(X_i-X)²

Slope (M) = Σ(X_i-X)(Y_i-Y) / Σ(X_i-X)² = 9290 / 1510 = 6.1523

Step 2:Calculate Intercept of Regression Line. Intercept (B) = Y - M(X)

Intercept (B) = Y - M(X) = 604.4 - 6.1523(109) = -66.2007

Step 3:Calculate Equation of Regression Line. Y=MX+B

Y=6.1523X + -66.2007

If the problem has a large number of pairs, then the sums of deviation cross products and sum of squares for X are often given...

Consider the regression equation that predicts SAT_Score from IQ.

Here are pairs of scores.

IQ	103	103	98	119	123	126	86	104	106	122
SAT_Score	565	534	512	701	708	754	554	495	555	666

X = 109, Y = 604.4, Σ(X_i-X)(Y_i-Y) = 9290, Σ(X_i-X)² = 1510
What is the equation of the least squares regression line?

Answer:Y=6.1523X + -66.2026

Step 1:Calculate Slope of Regression Line. Slope (M) = Σ(X_i-X)(Y_i-Y) / Σ(X_i-X)²

Slope (M) = Σ(X_i-X)(Y_i-Y) / Σ(X_i-X)² = 9290 / 1510 = 6.1523

Step 2:Calculate Intercept of Regression Line. Intercept (B) = Y - M(X)

Intercept (B) = Y - M(X) = 604.4 - 6.1523(109) = -66.2007

Step 3:Calculate Equation of Regression Line. Y=MX+B

Y=6.1523X + -66.2007

And here's what a plot of the regression line looks like with the actual data...

Plot of actual data with least squares regression line

So now that we've learned how to calculate the least squares regression line, we need to learn one more concept before we are finished with regression. Once we know what the best prediction line is, it would be nice to define how much error is present in the prediction line. Of course, it makes sense to use the formula we learned at the beginning of the chapter to estimate error...

Total Error = Σ(Y_predicted-Y_actual)²

This formula represents the error for any line predicting the actual X and Y values, but it has a special name when the error is associated with the least squares regression line taht we calculated...

SS_residual = Σ(Y_predicted-Y_actual)²

The SS_residual is helpful as a measure of error, but it is not standardized -- data pairs with large number of pairs will have large total errors and data pairs with a small number of pairs will have small total errors. So we need to standardize the error by taking into account the number of pairs. The easiest way to do this would be to divide the total error by some number. You might be inclined to suggest 'N-1' where N is the number of pairs, and this is a good idea, but not quite right. As it turns out, we divide the total error by N-2 instead of N-1. This is true because the error only starts happening with more than 2 pairs of values. In other words, if we only two X and two Y values, we would never have any error because we could always draw a perfect line between the first XY pair and the second XY pair.

So our new standardized estimate is...

SS_residual / (N-2)

One final thing, because we are dealing with standardized error, we use the non-squared units (just like standard deviation), so our final standardized error estimate is ...

√

SS_residual / (N-2)

We have a name for this -- it is called the 'Standard Error of the Estimate'.

Let's look at an example of calculating this for a set of points and a least squares regression line (Y=2.13X+4)

Actual X score	Actual Y score	Predicted Y	Squared Error (Y_{Predicted -}Y_Actual)²
0	5	4.00	1.00
1	7	6.13	0.75
2	8	8.27	.07
3	11	10.4	.36
4	8	12.53	20.55
5	15	14.67	0.11
6	17	16.8	0.04
7	22	18.93	9.4
8	18	21.07	9.4
9	25	23.2	3.24
			SS_Residual=Σ=44.93

Actual and Predicted Values to Assist in Calculation of Standard Error of the Estimate

Definitions

Least Squares Regression Line: A line that produces the smallest amount of error in matching predicted values to actual values. Also referred to here as the \'best prediction\' line.

Linear Equation (defines a line): Formula: Y=MX+B

where M is the slope and B is the intercept

Prediction Error for a Regression Line: Total Error = Σ(Y_predicted-Y_actual)²

Standard Error of the Estimate: Standardized prediction error for a least squares regression line

Formula: √

SS_residual / (N-2)

Easy Questions

1. The two components of the regression line equation are ...

2. What is the general form of the regression line equation?

3. The intercept in the regression equation is the Y value where the line crosses ....

4. The BEST regression line equation minimizes the error between the actual values and those predicted by the regression line. Error is defined as ...

5. The term for the standardized error in the regression line is ..

6. What is the denominator in the equation for the standard error of the estimate?

Medium Questions

7. If the sum of squared differences (actual - predicted) is 100 and the number of pairs is 8 then what is the standard error of the estimate?

8. How much error is there in the least squares regression line with 2 pairs of XY values?

9. If the sum of squared differences (actual - predicted) is 14 and the number of pairs is 16 then what is the standard error of the estimate?

10. Consider the regression equation that predicts wife_age from husband_age.

Here are pairs of scores.

husband_age	46	27	19	41	72	90	41	72	90	41
wife_age	43	29	18	65	69	80	39	66	89	40

X = 53.9, Y = 53.8, Σ(X_i-X)(Y_i-Y) = 4938.8, Σ(X_i-X)² = 5764.9
What is the equation of the least squares regression line?

11. Consider the regression equation that predicts Depression from Stress.

Here are pairs of scores.

Stress	5	7	9	3
Depression	4	8	6	2

What is the equation of the least squares regression line?