“Junk” DNA: Not so much

It has always strained credibility that the 98% of our DNA not used to code proteins would be useless. But this non-coding DNA picked up the name “junk DNA” because no-one quite knew what it did. In fact, one study (Nóbrega, 2004) found that deleting large chunks of DNA had no discernible effect on mice; the mice born without these pieces of non-coding DNA were viable.

However, a slew of papers from the Encode project indicate that the part of our genome formerly known as junk DNA, regulates the 2% that does the protein coding:

The researchers … have identified more than 10,000 new “genes” that code for components that control how the more familiar protein-coding genes work. Up to 18% of our DNA sequence is involved in regulating the less than 2% of the DNA that codes for proteins. In total, Encode scientists say, about 80% of the DNA sequence can be assigned some sort of biochemical function.

— Jha (2012): Breakthrough study overturns theory of ‘junk DNA’ in genome in The Guardian.

This is more good news for useless bits of biology (see the appendix).

Sections of non-junk DNA transcribe messenger RNA which code proteins. Image from Talking Glossary of Genetics via Wikipedia.

The Appendix: A Useless bit of Biology? Perhaps Not

The appendix has long been supposed to be a vestigial, useless organ. But a 2007 study suggests that it might have had — and may still have in many developing countries — an important role in digestion. It may provide a refuge for helpful, commensal bacteria to repopulate our guts after we purge when we get sick (Bollinger et al., 2007):

The organs of the lower digestive system. The appendix is located in the lower left, near where the small and large intestines meet. Image from Wikipedia.

… the human appendix is well suited as a “safe house” for commensal bacteria, providing support for bacterial growth and potentially facilitating re-inoculation of the colon in the event that the contents of the intestinal tract are purged following exposure to a pathogen.

— Bollinger et al., 2007: Biofilms in the large bowel suggest an apparent function of the human vermiform appendix in the Journal of Theoretical Biology.

Why do they think that? What’s the evidence?

The shape of the appendix is perfectly suited as a sanctuary for bacteria: Its narrow opening prevents an influx of the intestinal contents, and it’s situated inaccessibly outside the main flow of the fecal stream.

–Glausiusz (2008): And Here’s Why You Have an Appendix in Discover Magazine.

And thinking about supposedly useless bits of biology, there’s a bunch of interesting papers coming out about so-called “junk” DNA.

Using Real Data, and Least Squares Regression, in pre-Calculus

The equation of our straight line model (red line) matches the data (blue diamonds) pretty well.

One of the first things that my pre-Calculus students need to learn is how to do a least squares regression to match any type of function to real datasets. So I’m teaching them the most general method possible using MS Excel’s iterative Solver, which is pretty easy to work with once you get the hang of it.

Log, reciprocal and square root functions can all be matched using least squares regression.

I’m teaching the pre-Calculus using a graphical approach, and I want to emphasize that the main reason we study the different classes of functions — straight lines, polynomials, exponential curves etc.— is because of how useful they are at modeling real data in all sorts of scientific and non-scientific applications.

So I’m starting each topic with some real data: either data they collect (e.g. bring water to a boil) or data they can download (e.g. atmospheric CO2 from Mauna Loa). However, while it’s easy enough to pick two points, or draw a straight line by eye, and then determine its linear equation, it’s much trickier if not impossible when dealing with polynomials or transcendental functions like exponents or square-roots. They need a technique they can use to match any type of function, and least squares regression is the most commonly used method of doing this. While calculators and spreadsheet programs, like Excel, use least squares regression to draw trendlines on their graphs, they can’t do all the different types of functions we need to deal with.

The one issue that has come up is that not everyone has Excel and Solver. Neither OpenOffice nor Apple’s spreadsheet software (Numbers) has a good equivalent. However, if you have a good initial guess, based on a few datapoints, you can fit curves reasonably well by changing their coefficients in the spreadsheet by hand to minimize the error.

I’m working on a post on how to do the linear regression with Excel and Solver. It should be up shortly.

Notes

If Solver is not available in the Tools menu you may have to activate it because it’s an Add In. Wikihow explains activation.

Some versions of Excel for the Mac don’t have Solver built in, but you can download it from Frontline.

Water Scarcity in Yemen

Groundwater tends to be a common property resource. In places like Yemen, where ownership rights are not clearly defined it tends to be overexploited. So much so, that they’re looking at running out within the next 10 years. Peter Salisbury has an article in Foreign Policy.

Most potable water in Yemen is produced from a series of deep underground aquifers using electric and diesel-powered pumps. Some of these pumps are run by the government, but many more are run by private companies, most of them unlicensed and unregulated. Because of this, it is nigh on impossible to control the volume of water produced. By some (conservative) estimates, about 250 million cubic meters of water are produced from the Sanaa basin every year, 80 percent of which is non-renewable. In recent years, the businessmen who produce the water have had to drill ever-deeper wells and use increasingly powerful pumps to get the region’s dwindling water reserves out of the ground.

–Salisbury (2012): Yemen’s water woes in Foreign Policy.

Ecological Footprints: If the World Lived Like …

What if the entire world population lived like the people in Bangladesh? The amount of land to produce the resources we’d need would take up most of Asia and some of Africa. On the other hand, if we lived like the people in the UAE we’d need 5.4 Earths to support us sustainably. That’s the result of Mathis Wackernagel’s work (Wackernagel, 2006) comparing resource availability to resource demand. Tim De Chant put this data into graphical form:

Ecological footprints needed to support the world population if everyone used resources at the rate of these different countries. Image by Tim De Chant, based on data from Wacknagel (2006).

I showed this image in Environmental Science class today when we talked about ecological footprints, as well as the one showing how much space the world population of seven billion would take up if everyone lived in one big city with the same density of a few different cities (Paris, New York, Houston etc.).

Wacknagel’s original article also includes this useful table of data for different countries that I think I’ll try to get a student to put into bar graph for a project or presentation.

Data from Wackernagel (2006).

Zoë Pollock at The Dish

A Darwinian Debt

Evidence is mounting that fish populations won’t necessarily recover even if overfishing stops. Fishing may be such a powerful evolutionary force that we are running up a Darwinian debt for future generations.

— Loder (2006), Point of No Return in Conservation in Practice.

Darwinian Debt. That’s the elegant phrase Natasha Loder (2006) uses to describe the observation that human pressure on the environment — fishing in this particular example — has forced evolutionary changes that are not soon reversed.

Fishermen prefer to catch larger fish, depleting the population of older fish, and allowing smaller fish to successfully reproduce. Over a period of years this artificial selection — as opposed to natural selection — gives rise to new generations of fish that are permanently smaller than they used to be. And the fisheries find it hard to recover even after decades (Swain, 2007):

Populations where large fish were selectively harvested (as in most fisheries) displayed substantial declines in fecundity, egg volume, larval size at hatch, larval viability, larval growth rates, food consumption rate and conversion efficiency, vertebral number, and willingness to forage. These genetically based changes in numerous traits generally reduce the capacity for population recovery.

— Walsh et al., 2005, Maladaptive changes in multiple traits caused by fishing: impediments to population recovery in Ecology Letters.

Modeling Data with Straight Lines using Excel

Microsoft Excel, like most graphical calculators and spreadsheet programs, has the built in ability to do linear regression of measured data using certain types of functions — lines, polynomials, logarithms, and exponents for example. However, you can get it to do any type of function — sinusoidal, natural log, whatever — if you work through the spreadsheet and can use the iterative Solver tool.

This more general approach is quite useful in teaching pre-Calculus, because the primary purpose of all the functions they have to learn is to create mathematical models (functions) based on data that can be used for predictions.

The Data

I started this year’s pre-calculus class by having them collect some data. In a simplification of the snow-melt experiment I did with the middle school last year, I had them put a beaker of water (about 300ml) on a hot plate and measure the temperature every minute as warmed up.

To make the experiment a little more interesting, I had each student in each group of four take just three consecutive measurements and try to find the equation of the straight line that best fit their data, and could be used to try to predict the other measurements of their peers in their group.

Figure 1. Scatter plot of measured temperatures during the warming of a beaker of water on a hot plate. Data given in Table 1.

It did not quite work out as I’d hoped. Since you only need two points to find the equation of a straight line, having three points produced a little confusion. I’d hoped to produce that confusion, but hadn’t realized that I’d need to do a review of how to find the equation of a straight line. A large fraction of the class was a little bit rusty after hot months of summer.

So, we pooled all the data and reviewed how to find the equation of a straight line.

Table 1: The Data

Time (minutes) Measured Temperature (°C)
0 22
1 26
2 31
3 36
4 40
5 44
6 48
7 53
8 58
9 61
10 65
11 68
12 71

Finding the Equation for a Straight Line using Two Points

The general equation for a straight lines is:

(1)  y = mx + b

and we need to determine the coefficients m and b. m is the slope, which can be calculated from two points using the equation:

(2)  m = \frac{y_2 - y_1}{x_2 - x_1}

using the points at t=6 and t=11 — the points (x1, y1) = (6,48) and (x2, y2) = (11,68) respectively — for example, gives a slope of:

 m = \frac{68 - 48}{11 - 6}

 m = \frac{20}{5}

 m = 4

so our general equation becomes:

 y = 4 x + b

to find b we substitute either one of the points into the equation for x and y. If we use the first point, x = 6, and y = 48, we get:

 48 = 4 (6) + b
 48 = 24 + b

 24 + b = 48
 b = 48 - 24
 b = 24

and the equation of our line becomes:
(3)  y = 4 x + 24

Now, since we’re actually looking at a relationship between temperature and time, with temperature on the y-axis and time on the x-axis, we could relabel the terms in the equation with T = temperature and t = time to have:

(4)  T = 4 t + 24

While this equation is more satisfying to me, because I think it better describes the relationship we have, the more vocal students preferred the equation in terms of x and y (Eqn 3). These are the terms they are more familiar with in the context of a math class, and I recall seeing some evidence that students seem to learn better with the more abstract representations sometimes (though I can’t quite remember the source; I should have blogged about it).

Plotting the Data and the Modeled Straight Line

The straight line equation we came up with (Eqn. 4) is our model of the data. It’s not quite perfect. All the data do not lie on the line, although, if we did everything right, only the points (6, 48) and (11, 68) are guaranteed to be on the line.

Figure 2. The equation of our straight line model (red line) matches the data (blue diamonds) pretty well.

I showed the class how to plot the scatter graph using MS Excel, and how to draw the line to show the modeled data. The measured data are represented as points since the measurements were made at discrete points in time. The modeled equation, however, is a continuous function, hence the straight line. The Excel sheet below (Resource 1) illustrates:

Resource 1: Excel Spreadsheet of Measured versus Modeled Data

The Best Fit Curve

The Excel spreadsheet (Resource 1) was set up so that when I entered the slope (m) and intercept (b) values, the graph would quickly update. So I went through the class. Everyone called out their slope and intercept values, I plugged them in, and they could all see how the modeled line changed slightly based on the points used to calculate it. So I put the question to them, “How can we figure out which model equation is the best?”

That’s how I was able to introduce the topic of error. What if we compared the temperature predicted by the model for each data point, to the actual value. The smaller the difference in modeled versus measured temperatures, the better the fit of the model. Indeed, if we sum all the differences, or better yet take the average of the differences, we could get a single number, we’ll call the average error (ε), that could be used to compare the different models. I used this opportunity to introduce sigma notation, which the pre-calculus students had not seen much of before.

As a first pass (which, as we’ll see below, has a major problem), the error (ε) for each point (i) is:

 \epsilon_i = (T_{measured}-T_{modeled})

The average error is the sum of all the errors divided by the number of points (n) (we have 12 points so n=12 in this example):

(5)  \bar{\epsilon} = \frac{\sum\limits_{i=1}^{n} \epsilon_i}{n}

Now this works, but there is one problem. I was quite pleased and a little bit surprise that one of my students recognized what it was without any coaxing and also suggested a solution: by simply taking the difference to calculate the error, a point that is offset above the modeled line can be canceled out by a point offset by the same amount below the line. So what we really need is to use the absolute value of the error.

(6)  \epsilon_i = \left| T_{measured}-T_{modeled} \right|

This works, and is what we went with, but I did also point that what’s usually done is to use the square of the error instead of the absolute value. Squaring makes any number positive, so it accomplishes the same goal as the absolute value, and is the approach we’ll use when I go into linear regression later on.

Setting up the Excel spreadsheet to calculate the average error is fairly straightforward as shown in Resource 2:

Resource 2. Calculating the average error using Excel.

So once again, we went through the class and everyone called out their slope and intercept values and cheered when I plugged the numbers in and they saw if they had the lowest value.

It is important to remember, though, that the competition gives a somewhat random result: students’ average error is a function of the points they happened to pick, not how well they did the math (assuming everyone did the math correctly).

Figure 2. Showing the spreadsheet used to calculate the average error (Resource 2).