Modeling Data with Straight Lines using Excel

Microsoft Excel, like most graphical calculators and spreadsheet programs, has the built in ability to do linear regression of measured data using certain types of functions — lines, polynomials, logarithms, and exponents for example. However, you can get it to do any type of function — sinusoidal, natural log, whatever — if you work through the spreadsheet and can use the iterative Solver tool.

This more general approach is quite useful in teaching pre-Calculus, because the primary purpose of all the functions they have to learn is to create mathematical models (functions) based on data that can be used for predictions.

The Data

I started this year’s pre-calculus class by having them collect some data. In a simplification of the snow-melt experiment I did with the middle school last year, I had them put a beaker of water (about 300ml) on a hot plate and measure the temperature every minute as warmed up.

To make the experiment a little more interesting, I had each student in each group of four take just three consecutive measurements and try to find the equation of the straight line that best fit their data, and could be used to try to predict the other measurements of their peers in their group.

Tvt-scatter — Figure 1. Scatter plot of measured temperatures during the warming of a beaker of water on a hot plate. Data given in Table 1.

It did not quite work out as I’d hoped. Since you only need two points to find the equation of a straight line, having three points produced a little confusion. I’d hoped to produce that confusion, but hadn’t realized that I’d need to do a review of how to find the equation of a straight line. A large fraction of the class was a little bit rusty after hot months of summer.

So, we pooled all the data and reviewed how to find the equation of a straight line.

Table 1: The Data

Time (minutes)	Measured Temperature (°C)
0	22
1	26
2	31
3	36
4	40
5	44
6	48
7	53
8	58
9	61
10	65
11	68
12	71

Finding the Equation for a Straight Line using Two Points

The general equation for a straight lines is:

(1) $y = mx + b$

and we need to determine the coefficients m and b. m is the slope, which can be calculated from two points using the equation:

(2) $m = \frac{y_2 - y_1}{x_2 - x_1}$

using the points at t=6 and t=11 — the points (x₁, y₁) = (6,48) and (x₂, y₂) = (11,68) respectively — for example, gives a slope of:

$m = \frac{68 - 48}{11 - 6}$

$m = \frac{20}{5}$

$m = 4$

so our general equation becomes:

$y = 4 x + b$

to find b we substitute either one of the points into the equation for x and y. If we use the first point, x = 6, and y = 48, we get:

$48 = 4 (6) + b$
$48 = 24 + b$

$24 + b = 48$
$b = 48 - 24$
$b = 24$

and the equation of our line becomes:
(3) $y = 4 x + 24$

Now, since we’re actually looking at a relationship between temperature and time, with temperature on the y-axis and time on the x-axis, we could relabel the terms in the equation with T = temperature and t = time to have:

(4) $T = 4 t + 24$

While this equation is more satisfying to me, because I think it better describes the relationship we have, the more vocal students preferred the equation in terms of x and y (Eqn 3). These are the terms they are more familiar with in the context of a math class, and I recall seeing some evidence that students seem to learn better with the more abstract representations sometimes (though I can’t quite remember the source; I should have blogged about it).

Plotting the Data and the Modeled Straight Line

The straight line equation we came up with (Eqn. 4) is our model of the data. It’s not quite perfect. All the data do not lie on the line, although, if we did everything right, only the points (6, 48) and (11, 68) are guaranteed to be on the line.

Tvt-mvm — Figure 2. The equation of our straight line model (red line) matches the data (blue diamonds) pretty well.

I showed the class how to plot the scatter graph using MS Excel, and how to draw the line to show the modeled data. The measured data are represented as points since the measurements were made at discrete points in time. The modeled equation, however, is a continuous function, hence the straight line. The Excel sheet below (Resource 1) illustrates:

Resource 1: Excel Spreadsheet of Measured versus Modeled Data

The Best Fit Curve

The Excel spreadsheet (Resource 1) was set up so that when I entered the slope (m) and intercept (b) values, the graph would quickly update. So I went through the class. Everyone called out their slope and intercept values, I plugged them in, and they could all see how the modeled line changed slightly based on the points used to calculate it. So I put the question to them, “How can we figure out which model equation is the best?”

That’s how I was able to introduce the topic of error. What if we compared the temperature predicted by the model for each data point, to the actual value. The smaller the difference in modeled versus measured temperatures, the better the fit of the model. Indeed, if we sum all the differences, or better yet take the average of the differences, we could get a single number, we’ll call the average error (ε), that could be used to compare the different models. I used this opportunity to introduce sigma notation, which the pre-calculus students had not seen much of before.

As a first pass (which, as we’ll see below, has a major problem), the error (ε) for each point (i) is:

$\epsilon_i = (T_{measured}-T_{modeled})$

The average error is the sum of all the errors divided by the number of points (n) (we have 12 points so n=12 in this example):

(5) $\bar{\epsilon} = \frac{\sum\limits_{i=1}^{n} \epsilon_i}{n}$

Now this works, but there is one problem. I was quite pleased and a little bit surprise that one of my students recognized what it was without any coaxing and also suggested a solution: by simply taking the difference to calculate the error, a point that is offset above the modeled line can be canceled out by a point offset by the same amount below the line. So what we really need is to use the absolute value of the error.

(6) $\epsilon_i = \left| T_{measured}-T_{modeled} \right|$

This works, and is what we went with, but I did also point that what’s usually done is to use the square of the error instead of the absolute value. Squaring makes any number positive, so it accomplishes the same goal as the absolute value, and is the approach we’ll use when I go into linear regression later on.

Setting up the Excel spreadsheet to calculate the average error is fairly straightforward as shown in Resource 2:

Resource 2. Calculating the average error using Excel.

So once again, we went through the class and everyone called out their slope and intercept values and cheered when I plugged the numbers in and they saw if they had the lowest value.

It is important to remember, though, that the competition gives a somewhat random result: students’ average error is a function of the points they happened to pick, not how well they did the math (assuming everyone did the math correctly).

The Data

Finding the Equation for a Straight Line using Two Points

Plotting the Data and the Modeled Straight Line

The Best Fit Curve

Leave a Reply