Regression with Gnumeric

A test regression (linear equation) using Gnumeric.

Finally, I’ve found a spreadsheet application (Gnumeric) with a reliable Solver for doing regressions. And it’s free. The only tricky part is that there’s no native port for Macs; you have to use a command line package manager to install it (I used Fink).

Gnumeric, however, seems to be an excellent tool for data analysis.

Warming of the West Antarctic Ice Sheet

… a breakup of the ice sheet, … could raise global sea levels by 10 feet, possibly more.

— Gillis (2012): Scientists Report Faster Warming in Antarctica in The New York Times.

In an excellent article, Justin Gillis highlights a new paper that shows the West Antarctic Ice sheet to be one of the fastest warming places on Earth.

The black star shows the Byrd Station. The colors show the number of melting days over Antarctica in January 2005. This number increases with warming temperatures (image from supplementary material in Bromwich et al., 2012).

Note to math students: The scientists use linear regression to get the rate of temperature increase.

The record reveals a linear increase in annual temperature between 1958 and 2010 by 2.4±1.2 °C, establishing central West Antarctica as one of the fastest-warming regions globally.

— Bromwich et al., (2012): Central West Antarctica among the most rapidly warming regions on Earth in Nature.

Analyzing the 20th Century Carbon Dioxide Rise: A pre-calculus assignment

Carbon dioxide concentration (ppm) measured at the Mona Loa observatory in Hawaii shows exponential growth and a periodic annual variation.

The carbon dioxide concentration record from Mona Loa in Hawaii is an excellent data set to work with in high-school mathematics classes for two key reasons.

The first has to do with the spark-the-imagination excitement that comes from being able to work with a live, real, scientific record (updated every month) that is so easy to grab (from Scrippts), and is extremely relavant given all the issues we’re dealing with regarding global climate change.

The second is that the data is very clearly the sum of two different types of functions. The exponential growth of CO2 concentration in the atmosphere over the last 60 years dominates, but does not swamp, the annual sinusoidal variability as local plants respond to the seasons.

Assignment

So here’s the assignment using the dataset (mona-loa-2012.xls or plain text mona-loa-2012.csv):

1. Identify the exponential growth function:

Add an exponential curve trendline in a spreadsheet program or manual regression. If using the regression (which I’ve found gives the best match) your equation should have the form:

 y = a b^{cx} + d

while the built-in exponential trendline will usually give something simpler like:

 y = a e^{bx}

2. Subtract the exponential function.

Put the exponential function (model) into your spreadsheet program and subtract it from data set. The result should be just the annual sinusoidal function.

Dataset with the exponential curve subtracted.

If you look carefully you might also see what looks like a longer wavelength periodicity overlain on top of the annual cycle. You can attempt to extract if you wish.

3. Decipher the annual sinusoidal function

Try to match the stripped dataset with a sinusoidal function of the form:

 y = a \sin (bx+c) + d

A good place to start at finding the best-fit coefficients is by recognizing that:

  • a = amplitude;
  • b = frequency (which is the inverse of the wavelength;
  • c = phase (to shift the curve left or right); and
  • d = vertical offset (this sets the baseline of the curve.

Wrap up

Now you have a model for carbon dioxide concentration, so you should be able to predict, for example, what the concentration will be for each month in the years 2020, 2050 and 2100 if the trends continue as they have for the last 60 years. This is the first step in predicting annual temperatures based on increasing CO2 concentrations.

Exponential Growth of Cells

Today I grew, and then killed off, a bunch of bacteria using the VAMP exponential growth model to talk about exponential and logarithmic functions in pre-Calculus. I also took the opportunity to use an exponential decay model to talk about the development of drug resistance in bacteria.

Two cells are reproducing (yellow) during a run of the exponential growth model.

Students had already worked on, and presented to each other, a few bacterial growth problems but the sound and the animation helped give a better conceptual understanding of what was going on.

After watching and listening to the simulation I asked, “What happens to the doubling time?” and one student answered, “It gets shorter,” which seems reasonable but is incorrect. I was able to explain that the doubling time stays the same even though the rate of reproduction (the number of new cells per second) increases rapidly.

Graph showing how the number of cells increases over time.
Switching from growth to decay (half-life of 50 sec).

Then I changed the model from growth to decay by changing the doubling time to a half-life. Essentially this changes the coefficient in the exponent of the growth equation from positive to negative. The growth rate’s doubling time was 100 seconds, but I used a half life of 50 seconds for decay to accelerate things a bit, but still show the persistence of the last of the bugs.

Exponential decrease in cell population/biomass.

The cells died really fast in the beginning, and while there was just one cell was left at the very end, it was pretty clear just how persistent that last cell was; cells were dieing so slowly at the end.

This is similar to what happens when someone takes antibiotics. The typical course lasts for 10 days, but you’ve killed enough of the bacteria to loose the symptoms of sickness after two or three. Those final few that remain are the most resistant to the antibiotic, and if you don’t kill them then, once you stop taking the antibiotic, they’ll start to grow and replicate and you’ll end up sick again with a new, antibiotic-resistant population of bacteria.

I thought that using the VAMP model for the demonstration worked very well. The sound of the cells popping up faster and faster with exponential growth seemed to help amplify the visual effect, and make the whole thing more real. And during the decay phase, having that last cell hang on, seemingly forever, really helped convey the idea that bacteria can be extremely persistent.

Inverse Relationships

Inverse relationships pop-up everywhere. They’re pretty common in physics (see Boyle’s Law for example: P ∝ 1/V), but there you sort-of expect them. You don’t quite expect to see them in the number of views of my blog posts, as are shown in the Popular Posts section of the column to the right.

Table 1: Views of the posts on the Montessori Muddle in the previous month as of October 16th, 2012.

Post Post Rank Views
Plate Tectonics and the Earthquake in Japan 1 3634
Global Atmospheric Circulation and Biomes 2 1247
Equations of a Parabola: Standard to Vertex Form and Back Again 3 744
Cells, cells, cells 4 721
Salt and Sugar Under the Microscope 5 686
Google Maps: Zooming in to the 5 themes of geography 6 500
Market vs. Socialist Economy: A simulation game 7 247
Human Evolution: A Family Tree 8 263
Osmosis under the microscope 9 219
Geography of data 10 171

You can plot these data to show the relationship.

Views of the top 10 blog posts on the Montessori Muddle in the last month (as of 10/16/2012).

And if you think about it, part of it sort of makes sense that this relationship should be inverse. After all, as you get to lower ranked (less visited) posts, the number of views should asymptotically approach zero.

Questions

So, given this data, can my pre-Calculus students find the equation for the best-fit inverse function? That way I could estimate how many hits my 20th or 100th ranked post gets per month.

Can my Calculus students use the function they come up with to estimate the total number of hits on all of my posts over the last month? Or even the top 20 most popular posts?

Least Squares Regression with Excel

Here I’ll go over how to do Least Squares Regression, as simply as possibly, using Excel and its Solver*. Least Squares Regression can be used to match pretty much any type of function to any type of data. Most spreadsheet programs, like Excel, will do some curve matching for you when you add trendlines to graphs, but for more sophisticated work — in pre-Calculus and beyond for example — you need a more general approach.

Figure 1. Selected annual average carbon dioxide concentrations since 1959 (see Table 1 for data). Data from NOAA.

We’ll start with a data set that we want to match. The atmospheric CO2 data from Mauna Loa is a convenient dataset. It’s also an important data set to model because matching a function to the data will allow us to predict the change in CO2 over the next 100 years, and those predictions are the ones the IPCC uses in their estimates of the impact of global climate change around the world. Billions, if not trillions of dollars depend on those predictions (and how governments decide to respond). For ease of demonstration, I’ve picked a few of the annual average CO2 concentration measurements at random to work with:

Table 1: CO2 Data

Year (x) CO2 concentration (ppm) (y)
1959 315.97
1964 319.62
1969 324.62
1981 340.10
1985 346.04
1996 362.59
2003 375.77

Now, looking at the data, we can see there is some sort of trend. Our first decision is about how to model it.

Straight Line Model

The simplest approach, and the one we’ll try first is to fit a straight line through the data. (The final spreadsheet is here).

The equation of a straight line is:

 y = mx + b

In this case, as we can see from the graph (Figure 1), the y axis is CO2 concentration, and the x axis is the year. m and b are our two unknown constants that set the slope of the line (m) and move it up and down (b). If you need to get a better feel for what this means, try changing the slope and intercept in the Straight Line Grapher (I used this in class to demonstrate).

To match our straight line model to the data we need to determine the values of m and b that give the best fit.

Figure 2. The formula ("=C$3*$A8+C$4") for our straight line model uses the year (column A) and the m and b coefficients (cells C3 and C4 respectively) . You'll note the $ signs in the C3 and C4 references in the formula (i.e. $C$3 and $C$4); these tell Excel to always refer to these specific cells when the formula is copied and pasted down the entire column.

So we create a spreadsheet with the data and in the adjacent column set up the straight line function by setting two cells to the values of the constants (m and b) and using those values to calculate the modeled CO2 concentration.

Figure 3. The initial values for m and b (1 and -1600 respectively) don't match the data very well.

You’ll notice that I have values of m = 1 and b = -1600 . These are just my initial estimates of these values. The initial values are not crucial, as you’ll see, but are just there for me to check that my formula is in right.

Once I have the correct formulas in, I can play around with these values until my line matches the data. However, this is where Solver comes in to save a lot of time.

Finding a match

First we need a quantitative way of telling if we have a good match or not. We can start by taking the difference between each real data point and the modeled value. We’ll call this the error.

Figure 4. Calculating the error -- the difference between the actual and modeled data.

Now we could get a single value for the total error by adding up all the individual error values, or taking the average. However, as one of my students pointed out, we could end up with a case where the modeled line crossed through the data and we’d end up with positive differences of the data points above the line canceling out the negative differences of the data points below the line. His solution was to take the absolute value of the differences (errors) instead, which should actually work just as well in the method we’re taking here.

Instead of using the absolute value, however, we’ll square the errors instead. This achieves the same effect we need because the squares of both negative and positive numbers are positive. This approach is where the “squares” in Least Squares Regression comes from. The “Least” part comes from the fact that we’re now going to try adjusting our coefficients until we get the average of the squares of the errors to be as small as possible.

Figure 5. Calculating the square of the error.

Now we take the calculate the average of the errors (the sum of the errors would work just as well) using the spreadsheet’s “AVERAGE” function.

Figure 6. Calculating the average error using the "AVERAGE" function.

Now we can adjust m and b and not just see how they compare to the data points on a graph, but know that we have the best fit if we minimize the average error.

Solver

Instead of changing m and b by hand, we can use Excel’s Solver to minimize the average error for us. The matching of a straight line can be done using algebra (e.g. here and here) but that approach won’t help us when we get to more complex functions.

We select the average error cell (E16) and tell Solver to minimize its value by changing the values of m and b.

Figure 7. Setting up Excel's Solver.

When we hit solve, Solver should converge on a solution. Because of the way the problem is set up — using the square of the error for example — makes this a non-linear problem for the solver to solve. As of this writing, Excel is the only spreadsheet program I know of that has a built-in, non-linear solver.

Figure 8. The Solver solution.

You’ll notice that Solver’s solution gives:

  • m = 1.360
  • b = -2351.9

So now we have the equation for the best fit line (our model) being:

 y = 1.360 x - 2351.9

Prediction

Using this model we can predict the atmospheric CO2 concentration for the year 2050 by setting x = 2050 in the modeled equation, which gives 436.5 ppm.

Figure 9. The straight line model matches the data very well and can be projected to predict atmospheric carbon dioxide concentrations in the future.

The final spreadsheet I used to do these calculations can be found here.

Parabolic Model

As good as the straight line model seems to be, it does not account for the slight upward curve the data seems to have (see Figure 1). So instead of a straight line, perhaps we could try a parabolic function. Just as before, we’ll need to figure out the coefficients for the parabolic function. In this case the general equation we’ll use is:

 y = ax^2 + bx + c

and we’ll need to find the coefficients a, b and c. To see how changing these coefficients change the curve, you can play around with the interactive parabola model.

We set up our spreadsheet in the same way as for the straight line, but with our new coefficients.

Figure 10. Setting the equation for a parabolic model.

Note that the only column that needs to change from the straight-line spreadsheet is the “Model” column (column C), to add the coefficients and to change the formula to that of a parabola.

Now we can use Solver to minimize the average of the squares of the errors, just as we did before, only having it change the three coefficients instead of two.

  • a = 0.00060091
  • b = -1.018
  • c = 0.9933

and a final equation:

 y = 0.00060091 x^2 + -1.018 x + 0.9933

Figure 11. Matching data using a parabolic function.

Prediction

The parabolic model predicts for the year 2050 (x = 2050) that the CO2 concentration will be 439.4 ppm.

The Excel spreadsheet for the parabolic model is here.

Which model is better?

The easiest way to compare the two models is to see which did better at minimizing the squared errors.

Table 2: Comparison of models.

Model Average Squared Error
Straight Line 4.94
Parabola 4.47

So the parabolic model does better, but not by much.

Conclusions

With this approach, you can use any type of function — exponential, sinusoidal etc. — or combination of functions to match any data, and be able to compare among the models. A better measure of how well the models match the data is the regression coefficient or coefficient of determination, but I’ll save those for another post.

Conclusions

* UPDATE (Dec 31, 2012): Gnumeric’s Solver works.
* You’ll need to use Microsoft’s Excel, or better yet Gnumeric for this because, as of this writing, none of the other common spreadsheet options — OpenOffice, Google Docs, and Mac’s Calc — have a non-linear solver built in — and even some of the newer versions of Excel seem buggy.

Using Real Data, and Least Squares Regression, in pre-Calculus

The equation of our straight line model (red line) matches the data (blue diamonds) pretty well.

One of the first things that my pre-Calculus students need to learn is how to do a least squares regression to match any type of function to real datasets. So I’m teaching them the most general method possible using MS Excel’s iterative Solver, which is pretty easy to work with once you get the hang of it.

Log, reciprocal and square root functions can all be matched using least squares regression.

I’m teaching the pre-Calculus using a graphical approach, and I want to emphasize that the main reason we study the different classes of functions — straight lines, polynomials, exponential curves etc.— is because of how useful they are at modeling real data in all sorts of scientific and non-scientific applications.

So I’m starting each topic with some real data: either data they collect (e.g. bring water to a boil) or data they can download (e.g. atmospheric CO2 from Mauna Loa). However, while it’s easy enough to pick two points, or draw a straight line by eye, and then determine its linear equation, it’s much trickier if not impossible when dealing with polynomials or transcendental functions like exponents or square-roots. They need a technique they can use to match any type of function, and least squares regression is the most commonly used method of doing this. While calculators and spreadsheet programs, like Excel, use least squares regression to draw trendlines on their graphs, they can’t do all the different types of functions we need to deal with.

The one issue that has come up is that not everyone has Excel and Solver. Neither OpenOffice nor Apple’s spreadsheet software (Numbers) has a good equivalent. However, if you have a good initial guess, based on a few datapoints, you can fit curves reasonably well by changing their coefficients in the spreadsheet by hand to minimize the error.

I’m working on a post on how to do the linear regression with Excel and Solver. It should be up shortly.

Notes

If Solver is not available in the Tools menu you may have to activate it because it’s an Add In. Wikihow explains activation.

Some versions of Excel for the Mac don’t have Solver built in, but you can download it from Frontline.

Modeling Data with Straight Lines using Excel

Microsoft Excel, like most graphical calculators and spreadsheet programs, has the built in ability to do linear regression of measured data using certain types of functions — lines, polynomials, logarithms, and exponents for example. However, you can get it to do any type of function — sinusoidal, natural log, whatever — if you work through the spreadsheet and can use the iterative Solver tool.

This more general approach is quite useful in teaching pre-Calculus, because the primary purpose of all the functions they have to learn is to create mathematical models (functions) based on data that can be used for predictions.

The Data

I started this year’s pre-calculus class by having them collect some data. In a simplification of the snow-melt experiment I did with the middle school last year, I had them put a beaker of water (about 300ml) on a hot plate and measure the temperature every minute as warmed up.

To make the experiment a little more interesting, I had each student in each group of four take just three consecutive measurements and try to find the equation of the straight line that best fit their data, and could be used to try to predict the other measurements of their peers in their group.

Figure 1. Scatter plot of measured temperatures during the warming of a beaker of water on a hot plate. Data given in Table 1.

It did not quite work out as I’d hoped. Since you only need two points to find the equation of a straight line, having three points produced a little confusion. I’d hoped to produce that confusion, but hadn’t realized that I’d need to do a review of how to find the equation of a straight line. A large fraction of the class was a little bit rusty after hot months of summer.

So, we pooled all the data and reviewed how to find the equation of a straight line.

Table 1: The Data

Time (minutes) Measured Temperature (°C)
0 22
1 26
2 31
3 36
4 40
5 44
6 48
7 53
8 58
9 61
10 65
11 68
12 71

Finding the Equation for a Straight Line using Two Points

The general equation for a straight lines is:

(1)  y = mx + b

and we need to determine the coefficients m and b. m is the slope, which can be calculated from two points using the equation:

(2)  m = \frac{y_2 - y_1}{x_2 - x_1}

using the points at t=6 and t=11 — the points (x1, y1) = (6,48) and (x2, y2) = (11,68) respectively — for example, gives a slope of:

 m = \frac{68 - 48}{11 - 6}

 m = \frac{20}{5}

 m = 4

so our general equation becomes:

 y = 4 x + b

to find b we substitute either one of the points into the equation for x and y. If we use the first point, x = 6, and y = 48, we get:

 48 = 4 (6) + b
 48 = 24 + b

 24 + b = 48
 b = 48 - 24
 b = 24

and the equation of our line becomes:
(3)  y = 4 x + 24

Now, since we’re actually looking at a relationship between temperature and time, with temperature on the y-axis and time on the x-axis, we could relabel the terms in the equation with T = temperature and t = time to have:

(4)  T = 4 t + 24

While this equation is more satisfying to me, because I think it better describes the relationship we have, the more vocal students preferred the equation in terms of x and y (Eqn 3). These are the terms they are more familiar with in the context of a math class, and I recall seeing some evidence that students seem to learn better with the more abstract representations sometimes (though I can’t quite remember the source; I should have blogged about it).

Plotting the Data and the Modeled Straight Line

The straight line equation we came up with (Eqn. 4) is our model of the data. It’s not quite perfect. All the data do not lie on the line, although, if we did everything right, only the points (6, 48) and (11, 68) are guaranteed to be on the line.

Figure 2. The equation of our straight line model (red line) matches the data (blue diamonds) pretty well.

I showed the class how to plot the scatter graph using MS Excel, and how to draw the line to show the modeled data. The measured data are represented as points since the measurements were made at discrete points in time. The modeled equation, however, is a continuous function, hence the straight line. The Excel sheet below (Resource 1) illustrates:

Resource 1: Excel Spreadsheet of Measured versus Modeled Data

The Best Fit Curve

The Excel spreadsheet (Resource 1) was set up so that when I entered the slope (m) and intercept (b) values, the graph would quickly update. So I went through the class. Everyone called out their slope and intercept values, I plugged them in, and they could all see how the modeled line changed slightly based on the points used to calculate it. So I put the question to them, “How can we figure out which model equation is the best?”

That’s how I was able to introduce the topic of error. What if we compared the temperature predicted by the model for each data point, to the actual value. The smaller the difference in modeled versus measured temperatures, the better the fit of the model. Indeed, if we sum all the differences, or better yet take the average of the differences, we could get a single number, we’ll call the average error (ε), that could be used to compare the different models. I used this opportunity to introduce sigma notation, which the pre-calculus students had not seen much of before.

As a first pass (which, as we’ll see below, has a major problem), the error (ε) for each point (i) is:

 \epsilon_i = (T_{measured}-T_{modeled})

The average error is the sum of all the errors divided by the number of points (n) (we have 12 points so n=12 in this example):

(5)  \bar{\epsilon} = \frac{\sum\limits_{i=1}^{n} \epsilon_i}{n}

Now this works, but there is one problem. I was quite pleased and a little bit surprise that one of my students recognized what it was without any coaxing and also suggested a solution: by simply taking the difference to calculate the error, a point that is offset above the modeled line can be canceled out by a point offset by the same amount below the line. So what we really need is to use the absolute value of the error.

(6)  \epsilon_i = \left| T_{measured}-T_{modeled} \right|

This works, and is what we went with, but I did also point that what’s usually done is to use the square of the error instead of the absolute value. Squaring makes any number positive, so it accomplishes the same goal as the absolute value, and is the approach we’ll use when I go into linear regression later on.

Setting up the Excel spreadsheet to calculate the average error is fairly straightforward as shown in Resource 2:

Resource 2. Calculating the average error using Excel.

So once again, we went through the class and everyone called out their slope and intercept values and cheered when I plugged the numbers in and they saw if they had the lowest value.

It is important to remember, though, that the competition gives a somewhat random result: students’ average error is a function of the points they happened to pick, not how well they did the math (assuming everyone did the math correctly).

Figure 2. Showing the spreadsheet used to calculate the average error (Resource 2).