# the linear models,

The data consisted of the winning times for the men’s 400m event in the Summer Olympics, for 1948 through 2008. The data exhibit a moderately strong downward linear trend, looking overall at the 60 year period.

The regression line predicts a winning time of 43.1 seconds for the 2012 Summer Olympics, which would be nearly 0.4 second less than the existing Olympic record of 43.49 seconds, quite a feat!

Will the regression line’s prediction be accurate? In the last two decades, there appears to be more of a cyclical (up and down) trend. Could winning times continue to drop at the same average rate? Extensive searches for talented potential athletes and improved full-time training methods can lead to decreased winning times, but ultimately, there will be a physical limit for humans.

Note that there were some unusual data points of 46.7 seconds in 1956 and 43.80 in 1968, which are far above and far below the regression line.

If we restrict ourselves to looking just at the most recent winning times, beyond 1968, for Olympic winning times in 1972 and beyond (10 winning times), we have the following scatterplot and regression line.

Time (seconds)

Year 2008 2000 1992 1984 1976 44.20 44.00 43.80 43.60 43.40 1968 y = -0.025x + 93.834 R² = 0.5351 44.60 44.40 Summer Olympics: Men’s 400 Meter Dash Winning Times 44.80Using the most recent ten winning times, our regression line is *y *= 0.025*x *+ 93.834.

When *x *= 2012, the prediction is *y *= 0.025(2012) + 93.834 43.5 seconds. This line predicts a winning time of 43.5 seconds for 2012 and that would indicate an excellent time close to the existing record of 43.49 seconds, but not dramatically below it.

Note too that for *r*2 = 0.5351 and for the negatively sloping line, the correlation coefficient is 𝑟𝑟 = −√0.5351 = −0.73, not as strong as when we considered the time period going back to 1948. The most recent set of 10 winning times do not visually exhibit as strong a linear trend as the set of 16 winning times dating back to 1948.

**CONCLUSION:**

I have examined two linear models, using different subsets of the Olympic winning times for the men’s 400 meter dash and both have moderately strong negative correlation coefficients. One model uses data extending back to 1948 and predicts a winning time of 43.1 seconds for the 2012 Olympics, and the other model uses data from the most recent 10 Olympic games and predicts 43.5 seconds. My guess is that 43.5 will be closer to the actual winning time. We will see what happens later this summer!

#### UPDATE: When the race was run in August, 2012, the winning time was 43.94 seconds.

**Scatterplots, Linear Regression, and Correlation**

When we have a set of data, often we would like to develop a model that fits the data.

First we graph the data points (*x*, *y*) to get a scatterplot. Take the data, determine an appropriate scale on the horizontal axis and the vertical axis, and plot the points, carefully labeling the scale and axes.

Summer Olympics: Men’s 400 Meter Dash Winning Times Year (x) Time(y) (seconds) 1948 46.20 1952 45.90 1956 46.70 1960 44.90 1964 45.10 1968 43.80 1972 44.66 1976 44.26 1980 44.60 1984 44.27 1988 43.87 1992 43.50 1996 43.49 2000 43.84 2004 44.00 2008 43.75

Burger | Fat (x) (grams) | Calories (y) |

Wendy’s Single | 20 | 420 |

BK Whopper Jr. | 24 | 420 |

McDonald’s Big Mac | 28 | 530 |

Wendy’s Big Bacon Classic | 30 | 580 |

Hardee’s The Works | 30 | 530 |

McDonald’s Arch Deluxe | 34 | 610 |

BK King Double Cheeseburger | 39 | 640 |

Jack in the Box Jumbo Jack | 40 | 650 |

BK Big King | 43 | 660 |

BK King Whopper | 46 | 730 |

Data from 1997

If the scatterplot shows a relatively linear trend, we try to fit a linear model, to find a line of best fit.

We could pick two arbitrary data points and find the line through them, but that would not necessarily provide a good linear model representative of all the data points.

A mathematical procedure that finds a line of “best fit” is called linear regression. This procedure is also called the method of least squares, as it minimizes the sum of the squares of the deviations of the points from the line. In MATH 107, we use software to find the regression line. (We can use Microsoft Excel, or Open Office, or a hand-held calculator or an online calculator — more on this in the Technology Tips topic.)

Linear regression software also typically reports parameters denoted by *r *or *r*2.

The real number *r *is called the correlation coefficient and provides a measure of the strength of the linear relationship.

*r *is a real number between 1 and 1.

*r *= 1 indicates perfect positive correlation — the regression line has positive slope and all of the data points are on the line.

r = 1 indicates perfect negative correlation — the regression line has negative slope and all of the data points are on the line

The closer |*r*| is to 1, the stronger the linear correlation. If *r *= 0, there is no correlation at all. The following examples provide a sense of what an *r *value indicates.