Getting Data into R

December 14, 2017

One of my students is taking an advanced statistics course–mostly online–and it introduced her to the statistical package R. I’ve been meaning to learn how to use R for a while, so I had her show me how use it. This allowed me to give her a final exam that used some PEW survey data for analysis. (I used the data for the 2013 LGBT survey). These are my notes on getting the PEW data, which is in SPSS format, into R.

Instructions on Getting PEW data into R

Go to the link for the 2013 LGBT survey“>2013 LGBT survey and download the data (you will have to set up an account if you have not used their website before).

  • There should be two files.
    • The .sav file contains the data (in SPSS format)
    • The .docx file contains the metadata (what is metadata?).
  • Load the data into R.
    • To load this data type you will need to let R know that you are importing a foreign data type, so execute the command:
    • > library(foreign)
    • To get the file’s name and path execute the command:
    • > file.choose()
    • The file.choose() command will give you a long string for the file’s path and name: it should look something like “C:\\Users\…” Copy the name and put it in the following command to read the file (Note 1: I’m naming the data “dataset” but you can call it anything you like; Note 2: The string will look different based on which operating system you use. The one you see below is for Windows):
    • > dataset = read.spss(“C:\\Users\...”)
    • To see what’s in the dataset you can use the summary command:
    • > summary(dataset)
    • To draw a histogram of the data in column “Q39” (which is the age at which the survey respondents realized they were LGBT) use:
    • > hist(dataset$Q35)
    • If you would like to export the column of data labeled “Q39” as a comma delimited file (named “helloQ39Data.csv”) to get it into Excel, use:
    • > write.csv(dataset$Q39, ”helloQ39Data.csv”)

This should be enough to get started with R. One problem we encountered was that the R version on Windows was able to produce the histogram of the dataset, while the Mac version was not. I have not had time to look into why, but my guess is that the Windows version is able to screen out the non-numeric values in the dataset while the Mac version is not. But that’s just a guess.

Histogram showing the age at which LGBT respondents first felt that they might be something other than heterosexual.

Histogram showing the age at which LGBT respondents first felt that they might be something other than heterosexual.

Citing this post: Urbano, L., 2017. Getting Data into R, Retrieved February 26th, 2018, from Montessori Muddle: .
Attribution (Curator's Code ): Via: Montessori Muddle; Hat tip: Montessori Muddle.

Spurious Correlations

November 18, 2016

Tyler Vigen has a great website Spurious Correlations that shows graphs of exactly that.

A spurious correlation.

A spurious correlation.

Great for explaining what correlation means, and why correlation does not necessarily mean causation.

Citing this post: Urbano, L., 2016. Spurious Correlations, Retrieved February 26th, 2018, from Montessori Muddle: .
Attribution (Curator's Code ): Via: Montessori Muddle; Hat tip: Montessori Muddle.

How to be Lucky

May 28, 2013

The lucky try more things, and fail more often, but when they fail they shrug it off and try something else. Occasionally, things work out.

— McRaney, 2013: Survivorship Bias on You Are Not So Smart.

David McRaney synthesizes work on luck in an article on survivorship bias.

… the people who considered themselves lucky, and who then did actually demonstrate luck was on their side over the course of a decade, tended to place themselves into situations where anything could happen more often and thus exposed themselves to more random chance than did unlucky people.

Unlucky people are narrowly focused, [Wiseman] observed. They crave security and tend to be more anxious, and instead of wading into the sea of random chance open to what may come, they remain fixated on controlling the situation, on seeking a specific goal. As a result, they miss out on the thousands of opportunities that may float by. Lucky people tend to constantly change routines and seek out new experiences.

McRaney goes also points out how this survivorship bias negatively affects scientific publications (scientists tend to get successful studies published but not ones that show how things don’t work), and in war (deciding where to armor airplanes).

Citing this post: Urbano, L., 2013. How to be Lucky, Retrieved February 26th, 2018, from Montessori Muddle: .
Attribution (Curator's Code ): Via: Montessori Muddle; Hat tip: Montessori Muddle.

Curious Correlations

March 27, 2012

The Correlated website asks people different, apparently unrelated questions every day and mines the data for unexpected patterns.

In general, 72 percent of people are fans of the serial comma. But among those who prefer Tau as the circle constant over Pi, 90 percent are fans of the serial comma. March 23’s Correlation.

Two sets of data are said to be correlated when there is a relationship between them: the height of a fall is correlated to the number of bones broken; the temperature of the water is correlated to the amount of time the beaker sits on the hot plate (see here).

A positive correlation between the time (x-axis) and the temperature (y-axis).

In fact, if we can come up with a line that matches the trend, we can figure out how good the trend is.

The first thing to try is usually a straight line, using a linear regression, which is pretty easy to do with Excel. I put the data from the graph above into Excel (melting-snow-experiment.xls) and plotted a linear regression for only the highlighted data points that seem to follow a nice, linear trend.

Correlation between temperature (y) and time (x) for the highlighted (red) data points.

You’ll notice on the top right corner of the graph two things: the equation of the line and the R2, regression coefficient, that tells how good the correlation is.

The equation of the line is:

  • y = 4.4945 x – 23.65

which can be used to predict the temperature where the data-points are missing (y is the temperature and x is the time).

You’ll observe that the slope of the line is about 4.5 ºC/min. I had my students draw trendlines by hand, and they came up with slopes between 4.35 and 5, depending on the data points they used.

The regression coefficient tells how well your data line up. The better they line up the better the correlation. A perfect match, with all points on the line, will have a regression coefficient value of 1.0. Our regression coefficient is 0.9939, which is pretty good.

If we introduce a little random error to all the data points, we’d reduce the regression coefficient like this (where R2 is now 0.831):

Adding in some random error causes the data to scatter more, making for a worse correlation. The black dots are the original data, while the red dots include some random error.

The correlation trend lines don’t just have to go up. Some things are negatively correlated — when one goes up the other goes down — such as the relationship between the number of hours spent watching TV and students’ grades.

The negative correlation between grades and TV watching. Image: Lanthier (2002).

Correlation versus Causation

However, just because two things are correlated does not mean that one causes the other.

A jar of water on a hot-plate will see its temperature rise with time because heat is transferred (via conduction) from the hot-plate to the water.

On the other hand, while it might seem reasonable that more TV might take time away from studying, resulting in poorer grades, it might be that students who score poorly are demoralized and so spend more time watching TV; what causes what is unclear — these two things might not be related at all.

Which brings us back to the website. They’re collecting a lot of seemingly random data and just trying to see what things match up.

Curiously, many scientists do this all the time — typically using a technique called multiple regression. Understandably, others are more than a little skeptical. The key problem is that people too easily leap from seeing a correlation to assuming that one thing causes the other.

Citing this post: Urbano, L., 2012. Curious Correlations, Retrieved February 26th, 2018, from Montessori Muddle: .
Attribution (Curator's Code ): Via: Montessori Muddle; Hat tip: Montessori Muddle.

Sub-atomic Physics: The Significance of 0.8%

November 19, 2011

When it comes to particle physics … [m]easuring something once is meaningless because of the high degree of uncertainty involved in such exotic, small systems. Scientists rely on taking measurements over and over again — enough times to dismiss the chance of a fluke.

— Moskowitz (2011): Is the New Physics Here? Atom Smashers Get an Antimatter Surprise in LiveScience

New research, out of the Large Haldron Collider in Switzerland, shows a 0.8% difference in the way matter and antimatter particles behave. This small difference could go a long way in explaining why the universe is made up mostly of matter today, even though in the beginning there were about equal amounts of matter and antimatter. It would mean that the current, best theory describing particle physics, the Standard Model, needs some significant tweaking.

The Standard Model of elementary particles. The LHC experiment looked the charm quarks (c), and their corresponding antiquarks, which have an opposite charge. Image by MissMJ via Wikipedia.

0.8% is small, but significant. How confident are the physicists that their measurements are accurate? Well, the more measurements you take the more confident you can be in your average result, though you can never be 100% certain. The LHC scientists did enough measurements that they could calculate, statistically, that there is only a 0.05% chance that their measurement is wrong.

Citing this post: Urbano, L., 2011. Sub-atomic Physics: The Significance of 0.8%, Retrieved February 26th, 2018, from Montessori Muddle: .
Attribution (Curator's Code ): Via: Montessori Muddle; Hat tip: Montessori Muddle.

Figuring Out Experimental Error

November 2, 2011

Using stopwatches, we measured the time it took for the tennis ball to fall 5.3 meters. Some of the individual measurements were off by over 30%, but the average time measured was only off by 7%.

Using stopwatches, we measured the time it took for the tennis ball to fall 5.3 meters. Some of the individual measurements were off by over 30%, but the average time measured was only off by 7%.

I did a little exercise at the start of my high-school physics class today that introduced different types of experimental error. We’re starting the second quarter now and it’s time for their lab reports to including more discussion about potential sources of error, how they might fix some of them, and what they might mean.

One of the stairwells just outside the physics classroom wraps around nicely, so students could stand on the steps and, using stopwatches, time it as I dropped a tennis ball 5.3 meters, from the top banister to the floor below.

Students' measured falling times (in seconds).

Random and Reading Errors

They had a variety of stopwatches, including a number of phones, at least one wristwatch, and a few of the classroom stopwatches that I had on hand. Some devices could do readings to one hundredth of a second, while others could only do tenths of a second. So you can see that there is some error just due to how detailed the measuring device can be read. We’ll call this the reading error. If the best value your stopwatch gives you is to the tenth of a second, then you have a reading error of plus or minus 0.1 seconds (±0.1 s). And you can’t do much about this other than get a better measuring device.

Another source of error is just due to random differences that will happen with every experimental trial. Maybe you were just a fraction of a second slower stopping your watch this time compared to the last. Maybe a slight gust of air slowed the balls fall when it dropped this time. This type of error is usually just called random error, and can only be reduced by taking more and more measurements.

Our combination of reading and random errors, meant that we had quite a wide range of results – ranging from a minimum time of 0.7 seconds, to a maximum of 1.2 seconds.

So what was the right answer?

Well, you can calculate the falling time if you know the distance (d) the ball fell (d = 5.3 m), and its acceleration due to gravity (g = 9.8 m/s2) using the equation:

 t = \sqrt{\frac{2d}{g}}

which gives:

 t = 1.043 s

So while some individual measurements were off by over 30%, the average value was off by only 8%, which is a nice illustration of the phenomenon that the more measurements you take, the better your result. In fact, you can plot the improvement in the data by drawing a graph of how the average of the measurements improves with the number of measurements (n) you take.

The first measurement (1.2 s) is much higher than the calculated value, but when you incorporate the next four values in the average it undershoots the actual (calculated) value. However, as you add more and more data points into the average the measured value gets slowly closer to the calculated value.

More measurements reduce the random error, but you tend to get to a point of diminishing returns when you average just does not improve enough to make it worth the effort of taking more measurements. The graph shows the average slowly ramping up after you use five measurements. While there are statistical techniques that can help you determine how many samples are enough, you ultimately have to base you decision on how accurate you want to be and how much time and energy you want to spend on the project. Given the large range of values we have in this example, I would not want to use less than six measurements.

Systematic Error

But, as you can see from the graph, even with over a dozen measurements, the average measured value remains persistently lower than the calculated value. Why?

This is quite likely due to some systematic error in our experiment – an error you make every time you do the experiment. Systematic errors are the most interesting type of errors because they tell you that something in the way you’ve designed your experiment is faulty.

The most exciting type of systematic error would, in my opinion, be one caused by a fundamental error in your assumptions, because they challenge you to fundamentally reevaluate what you’re doing. The scientists who recently reported seeing particles moving faster than light made their discovery because there was a systematic error in their measurements – an error that may result in the rewriting of the laws of physics.

In our experiment, I calculated the time the tennis ball took to fall using the gravitational acceleration at the surface of the Earth (9.8 m/s2). One important force that I did not consider in the calculation was air resistance. Air resistance would slow down the ball every single time it was dropped. It would be a systematic error. In fact, we could use the error that shows up to actually calculate the force of the air resistance.

However, since air resistance would slow the ball down, it would take longer to hit the floor. Unfortunately, our measurements were shorter than the calculated falling time so air resistance is unlikely to explain our error. So we’re left with some error in how the experiment was done. And quite frankly, I’m not really sure what it is. I suspect it has to do with student’s reaction times – it probably took them longer to start their stopwatches when I dropped the ball than it did to stop them when the ball hit the floor – but I’m not sure. We’ll need further experiments to figure this one out.

In Conclusion

On reflection, I think I probably would have done better using a less dense ball, perhaps a styrofoam ball, that would be more affected by air resistance, so I can show how systematic errors can be useful.

Fortunately (sort of) in my demonstration I made an error in calculating the falling rate – I forgot to include the 2 under the square root sign – so I ended up with a much lower predicted falling time for the ball – which allowed me to go through a whole exercise showing the class how to use Excel’s Goal Seek function to figure out the deceleration due to air resistance.

My Excel Spreadsheet with all the data and calculations is included here.

There are quite a number of other things that I did not get into since I was trying to keep this exercise short (less than half an hour), but one key one would be using significant figures.

There are a number of good, but technical websites dealing with error analysis including this, this and this.

Citing this post: Urbano, L., 2011. Figuring Out Experimental Error, Retrieved February 26th, 2018, from Montessori Muddle: .
Attribution (Curator's Code ): Via: Montessori Muddle; Hat tip: Montessori Muddle.

Match Stick Rockets

July 24, 2011

A great, simple, and slightly dangerous way of making rockets. There are a number of variations. I like NASA’s because they have a very nice set of instructions.

How to make a match stick rocket. By Steve Cullivan via NASA.

With a stable launch platform that maintains consistent but changeable launch angles, these could be a great source of simple science experiments that look at the physics of ballistics and the math of parabolas (a nice video camera would be a great help here too) and statistics (matchsticks aren’t exactly precision instruments).

Citing this post: Urbano, L., 2011. Match Stick Rockets, Retrieved February 26th, 2018, from Montessori Muddle: .
Attribution (Curator's Code ): Via: Montessori Muddle; Hat tip: Montessori Muddle.

Ngram: The history of words

December 18, 2010

Graphs of the words Montessori and muddle created with Google Ngram.

If you take all the books ever written and draw a graph showing which words were used when, you’d end up with something like Google’s Ngram. Of course I thought I’d chart “Montessori” and “muddle”.

The “Montessori” graph is interesting. It seems to show the early interest in her work, around 1912, and then an interesting increase in interest in the 1960’s and 1970’s. Like with all statistics, one should really be cautious about how you interpret this type of data, however, I suspect this graph explains a lot about the sources of modern trends in Montessori education. I’d love hear someone with more experience thinks.

Alexis Madrigal has an interesting collection of graphs, while Discover has an article with much more detail about what can be done with Google’s database.

Citing this post: Urbano, L., 2010. Ngram: The history of words, Retrieved February 26th, 2018, from Montessori Muddle: .
Attribution (Curator's Code ): Via: Montessori Muddle; Hat tip: Montessori Muddle.

Creative Commons License
Montessori Muddle by Montessori Muddle is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.