Basic R (using Covid data)

Once you start R you’ll need to figure out which directory you’re working in:

> getwd()

On a Windows machine your default working directory might be something like:

[1] "C:/Users/username"

On OSX or Linux you’ll get something like:

 [1] "/Users/username" 

To get to the directory you want to work in use setwd(). I’ve put my files into the directory: “/Users/lurba/Documents/TFS/Stats/COVID”

To get there from the working directory above I could enter the full path above, or just the relative path like:

> setwd("TFS/Stats/COVID")

Now my data is in the file named “04-20-2020.csv” (from the John Hopkins Covid data repository on github) which I’ll read in with:

> mydata <- read.csv("04-20-2020.csv")

This creates a variable named “mydata” that has the information in it. I can see the column names by using:

> colnames(mydata)

which gives:

 [1] "Province_State"       "Country_Region"       "Last_Update"         
 [4] "Lat"                  "Long_"                "Confirmed"           
 [7] "Deaths"               "Recovered"            "Active"              
 [10] "FIPS"                 "Incident_Rate"        "People_Tested"       
 [13] "People_Hospitalized"  "Mortality_Rate"       "UID"                 
 [16] "ISO3"                 "Testing_Rate"         "Hospitalization_Rate"

Let’s take a look at the summary statistics for the number of confirmed cases, which is in the column labeled “Confirmed”:

> summary(mydata$Confirmed)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    317    1964    4499   15347   13302  253060 

This shows that the mean is 15, 347 and the maximum is 253,060 confirmed cases.

I’m curious about which state has that large number of cases, so I’m going to print out the columns with the state names (“Province_State”) and the number of confirmed cases (“Confirmed”). From our colnames command above we can see that “Province_State” is column 1, and “Confirmed” is column 6, so we’ll use the command:

> mydata[ c(1,6) ]

The “c(1,6)” says that we want the columns 1 and 6. This command outputs

             Province_State Confirmed
1                   Alabama      5079
2                    Alaska       321
3            American Samoa         0
4                   Arizona      5068
5                  Arkansas      1973
6                California     33686
7                  Colorado      9730
8               Connecticut     19815
9                  Delaware      2745
10         Diamond Princess        49
11     District of Columbia      2927
12                  Florida     27059
13                  Georgia     19407
14           Grand Princess       103
15                     Guam       136
16                   Hawaii       584
17                    Idaho      1672
18                 Illinois     31513
19                  Indiana     11688
20                     Iowa      3159
21                   Kansas      2048
22                 Kentucky      3050
23                Louisiana     24523
24                    Maine       875
25                 Maryland     13684
26            Massachusetts     38077
27                 Michigan     32000
28                Minnesota      2470
29              Mississippi      4512
30                 Missouri      5890
31                  Montana       433
32                 Nebraska      1648
33                   Nevada      3830
34            New Hampshire      1447
35               New Jersey     88722
36               New Mexico      1971
37                 New York    253060
38           North Carolina      6895
39             North Dakota       627
40 Northern Mariana Islands        14
41                     Ohio     12919
42                 Oklahoma      2680
43                   Oregon      1957
44             Pennsylvania     33914
45              Puerto Rico      1252
46             Rhode Island      5090
47           South Carolina      4446
48             South Dakota      1685
49                Tennessee      7238
50                    Texas     19751
51                     Utah      3213
52                  Vermont       816
53           Virgin Islands        53
54                 Virginia      8990
55               Washington     12114
56            West Virginia       902
57                Wisconsin      4499
58                  Wyoming       317
59                Recovered         0

Looking through, we can see that New York was the state with the largest number of cases.

Note that we could have searched for the row with the maximum number of Confirmed cases using the command:

> d2[which.max(d2$Confirmed),]

Merging Datasets

In class, we’ve been editing the original data file to add a column with the state populations (called “Population”). I have this in a separate file called “state_populations.txt” (which is also a comma separated variable file, .csv, even if not so labeled). So I’m going to import the population data:

> pop <- read.csv("state_population.txt")

Now I’ll merge the two datasets to add the population data to “mydata”.

> mydata <- merge(mydata, pop)

Graphing (Histograms and Boxplots)

With the datasets together we can try doing a histogram of the confirmed cases. Note that there is a column labeled “Confirmed” in the mydata dataset, which we’ll address as “mydata$Confirmed”:

> hist(mydata$Confirmed)
Histogram of confirmed Covid-19 cases as of 04-20-2020.

Note that on April 20th, most states had very few cases, but there were a couple with a lot of cases. It would be nice to see the data that’s clumped in the 0-50000 range broken into more bins, so we’ll add an optional argument to the hist command. The option is called breaks and we’ll request 20 breaks.

> hist(mydata$Confirmed, breaks=20)
A more discretized version of the confirmed cases histogram.

Calculations (cases per 1000 population)

Of course, simply looking at the number of cases in not very informative because you’d expect, with all things being even, that states with the highest populations would have the highest number of cases. So let’s calculate the number of cases per capita. We’ll multiply that number by 1000 to make it more human readable:

> mydata$ConfirmedPerCapita1000 <- mydata$Confirmed / mydata$Population * 1000

Now our histogram would look like:

> hist(mydata$ConfirmedPerCapita1000, breaks=20)
Confirmed cases per 1000 people.

The dataset still has a long tail, but we can see the beginnings of a normal distribution.

The next thing we can do is make a boxplot of our cases per 1000 people. I’m going to set the range option to zero so that the plot has the long tails:

> boxplot(mydata$ConfirmedPerCapita1000, range=0)
Boxplot of US states’ confirmed cases per 1000 people.

The boxplot shows, more or less, the same information in the histogram.

Finding Specific Data in the Dataset

We’d like to figure out how Missouri is doing compared to the rest of the states, so we’ll calculate the z-score, which tells how many standard deviations you are away from the mean. While there is a built in z-score function in R, we’ll first see how we can use the search and statistics methods to find the relevant information.

First, finding Missouri’s number of confirmed cases. To find all of the data in the row for Missouri we can use:

> mydata[mydata$Province_State == "Missouri",]

which gives something like this. It has all of the data but is not easy to read.

   Province_State Population Country_Region         Last_Update     Lat
26       Missouri    5988927             US 2020-04-20 23:36:47 38.4561
      Long_ Confirmed Deaths Recovered Active FIPS Incident_Rate People_Tested
26 -92.2884      5890    200        NA   5690   29      100.5213         56013
   People_Hospitalized Mortality_Rate      UID ISO3 Testing_Rate
26                 873       3.395586 84000029  USA      955.942
   Hospitalization_Rate ConfirmedPerCapita1000
26             14.82173              0.9834817

To extract just the “Confirmed” cases, we’ll add that to our command like so:

> mydata[mydata$Province_State == "Missouri",]$Confirmed
[1] 5890

Which just gives the number 5890. Or the “ConfirmedPerCapita1000”:

> mydata[mydata$Province_State == "Missouri",]$ConfirmedPerCapita1000
[1] 0.9834817

This method would also be useful later on if we want to automate things.

z-score

We have the mean from when we did the summary command, but there’s also a mean command.

> mean(mydata$ConfirmedPerCapita1000)
[1] 2.006805

Similarly you can get the standard deviation with the sd function.

> sd(mydata$ConfirmedPerCapita1000)
[1] 2.400277

We can now calculate the z-score for Missouri:

\text{z-score} = \frac{(X - \mu)}{ \sigma}

which gives a results of:

\text{z-score} =  -0.43

So it looks like Missouri was doing reasonable well back in April, at least compared to the rest of the country.

Math Flowcharts

Flowchart in progress. Showing topics being covered in basic statistical graphing.
Flowchart in progress. Showing topics being covered in basic statistical graphing. The topics they are working on at the moment are highlighted in yellow. The worksheets attached to each topic are linked on the left side in the highlighted area.

To help students track their progress in math we’ve started requiring them to map what they’re doing on flowcharts. Right now, they’re doing it on paper, but we’re working on getting it to be all electronic.

Tracking with the charts helps them see how what they’ve done fits into the bigger picture. It allows them to be able to go back up the chart to previous topics if they need to review, and look forward to what comes next (and to work ahead if they would like).

The image above shows a sample what a student’s flow chart would look like while they are working on a subject (statistics in this case). The topics they are working on at the moment are highlighted in yellow. The worksheets attached to each topic are linked on the left side in the highlighted area. Links to references (Khan Academy for example) are linked on the right–there are only a few on right now (see the Mean topic on the upper right of the flowchart).

The topics on the flowchart can be expanded (using the green button on the top right of each topic) to show more detail.

Expanded window for the topic "Mean".
Expanded window for the topic “Mean”. Detail on the topics is a little sparse at the moment as we’re focusing on setting up all of the flowcharts first.

At the moment, I’m uploading the flowcharts that we’re currently using up on the website myself. Students can use the website to get worksheets and links to references, but if they mark what they’re doing on the webpage it’s not saved. We’re currently working on making it possible to create and edit flowcharts on the website itself. After that, we’ll be setting it up so students can log in with their school accounts and track their progress electronically. One ultimate goal is to map the entire upper school curriculum with these flow charts so a student might be able to track their work all the way from 7th to 12th grade.

How to make a Boxplot in R

Guest post by Grace Appell.

What is a Boxplot?

A box plot is a graph that helps you to analyze a set of data. It used to show the spread of the data. In it you use five data points: the minimum, the 1st quartile, the median, the 3rd quartile, and the maximum.
The minimum is the lowest point in your data set, and the maximum is the largest. The median is the number in the middle of the data set when you have the number lined up numerically.
For example if your data set was this:

 5, 6 ,11, 18, 12, 9, 4 

First you would put them in order lowest to highest.

4, 5, 6, 9, 11, 12, 18

Your median would be 9, because it is the middle number. The minimum would be 4, and the maximum would be 18.
The first quartile would be 5, the median of the numbers below 9, and the third quartile would be 12, the median of the numbers above 9.
So the data you would use in your boxplot would be

(4,5,9,12,18)

The boxplot would look like would look like this.

Example boxplot #1.
Example boxplot #1.

What is R?

R is a software program that is free to download that you can use for calculating statistics and creating graphics.
Here is their website: https://www.r-project.org/

Boxplots in R

In R you can create a boxplot by using this formula.

> Name of data set <- c(minimum, quartile 1, median, quartile 3, maximum)
> boxplot(Name of data set)

First you have to name your data set. In our project where we analyzed the number of times people repeated their Shakespeare lines that they performed, I used the name Macbeth. So the formula looked like this:

> Macbeth <- c(1,9,18,48,93)
> boxplot(Macbeth)

Using this data set, your box plot should look like this

Example boxplot #2.
Example boxplot #2.

Getting Data into R

One of my students is taking an advanced statistics course–mostly online–and it introduced her to the statistical package R. I’ve been meaning to learn how to use R for a while, so I had her show me how use it. This allowed me to give her a final exam that used some PEW survey data for analysis. (I used the data for the 2013 LGBT survey). These are my notes on getting the PEW data, which is in SPSS format, into R.

Instructions on Getting PEW data into R

Go to the link for the 2013 LGBT survey“>2013 LGBT survey and download the data (you will have to set up an account if you have not used their website before).

  • There should be two files.
    • The .sav file contains the data (in SPSS format)
    • The .docx file contains the metadata (what is metadata?).
  • Load the data into R.
    • To load this data type you will need to let R know that you are importing a foreign data type, so execute the command:
    • > library(foreign)
      
    • To get the file’s name and path execute the command:
    • > file.choose()
      
    • The file.choose() command will give you a long string for the file’s path and name: it should look something like “C:\\Users\…” Copy the name and put it in the following command to read the file (Note 1: I’m naming the data “dataset” but you can call it anything you like; Note 2: The string will look different based on which operating system you use. The one you see below is for Windows):
    • > dataset = read.spss(“C:\\Users\...”)
      
    • To see what’s in the dataset you can use the summary command:
    • > summary(dataset)
      
    • To draw a histogram of the data in column “Q39” (which is the age at which the survey respondents realized they were LGBT) use:
    • > hist(dataset$Q35)
      
    • If you would like to export the column of data labeled “Q39” as a comma delimited file (named “helloQ39Data.csv”) to get it into Excel, use:
    • > write.csv(dataset$Q39, ”helloQ39Data.csv”)
      

This should be enough to get started with R. One problem we encountered was that the R version on Windows was able to produce the histogram of the dataset, while the Mac version was not. I have not had time to look into why, but my guess is that the Windows version is able to screen out the non-numeric values in the dataset while the Mac version is not. But that’s just a guess.

Histogram showing the age at which LGBT respondents first felt that they might be something other than heterosexual.
Histogram showing the age at which LGBT respondents first felt that they might be something other than heterosexual.

Liquid Chessboard

Chessboard under regular (day) light.
Chessboard under regular (day) light.

I used the computer controlled (CNC) Shopbot machine at the Techshop to drill out 64 square pockets in the shape of a chessboard. One of my students (Kathryn) designed and printed the pieces as part of an extra credit project for her Geometry class.

The pockets were then filled with a clear eqoxy to give a liquid effect. However, I mixed in two colors of pigmented powder to make the checkerboard. The powder was uv reactive so it fluoresces under black (ultra-violet) light.

Under a black (ultra violet) light bulb.
Under a black (ultra violet) light bulb.

The powder also glows in the dark.

Glowing in the dark.
Glowing in the dark.

Maximum Range of a Potato Gun

One of the middle schoolers built a potato gun for his math class. He was looking a the mathematical relationship between the amount of fuel (hair spray) and the hang-time of the potato. To augment this work, I had my Numerical Methods class do the math and create analytical and numerical models of the projectile motion.

One of the things my students had to figure out was what angle would give the maximum range of the projectile? You can figure this out analytically by finding the function for how the horizontal distance (x) changes as the angle (theta) changes (i.e. x(theta)) and then finding the maximum of the function.

Initial velocity vector (v) and its component vectors in the x and y directions.
Initial velocity vector (v) and its component vectors in the x and y directions for a given angle.

Distance as a function of the angle

In a nutshell, to find the distance traveled by the potato we break its initial velocity into its x and y components (vx and vy), use the y component to find the flight time of the projectile (tf), and then use the vx component to find the distance traveled over the flight time.

Starting with the diagram above we can separate the initial velocity of the potato into its two components using basic trigonometry:

 \cos{\theta} = \frac{v_x}{v}
 \sin{\theta} = \frac{v_y}{v} ,

so,

 v_x = v \cos{\theta} ,
 v_y = v \sin{\theta}

Now we know that the height of a projectile (y) is given by the function:

! y(t) = \frac{a t^2}{2} + v_0 t + y_0

(you can figure this out by assuming that the acceleration due to gravity (a) is constant and acceleration is the second differential of position with respect to time.)

To find the flight time we assume we’re starting with an initial height of zero (y0 = 0), and that the flight ends when the potato hits the ground which is also at zero ((yt = 0), so:

! 0 = \frac{a t^2}{2} + v_0 t + 0

! 0 = \frac{a t^2}{2} + v_0 t

Factoring out t gives:

! 0 = t ( \frac{a t}{2} + v_0)

Looking at the two factors, we can now see that there are two solutions to this problem, which should not be too much of a surprise since the height equation is parabolic (a second order polynomial). The solutions are when:

! t = 0

!  \frac{a t}{2} + v_0 = 0

The first solution is obviously the initial launch time, while the second is going to be the flight time (tf).

!  \frac{a t_f}{2} + v_0 = 0

!  t_f = - \frac{2 v_0}{a}

You might think it’s odd to have a negative in the equation, but remember, the acceleration is negative so it’ll cancel out.

Now since we’re working with the y component of the velocity vector, the initial velocity in this equation (v0) is really just vy:

!   v_0 = v_y

so we can substitute in the trig function for vy to get:

!  t_f = - \frac{2 v  \sin{\theta}}{a}

Our horizontal distance is simply given by the velocity in the x direction (vx) times the flight time:

!  x = v_x t_f

which becomes:

!  x = v_x  \left(- \frac{2 v  \sin{\theta}}{a}\right)

and substituting in the trig function for vx (just to make things look more complicated):

!  x = \left(  v \cos{\theta} \right)  \left(- \frac{2 v  \sin{\theta}}{a}\right)

and factoring out some of the constants gives:

!  x = -\frac{v^2}{a} 2 \sin{\theta}\cos{theta}

Now we have distance as a function of the launch angle.

We can simplify this a little by using the double-angle formula:

!  \sin{2\theta} = 2 \sin{\theta}\cos{theta}

to get:

!  x = -\frac{v^2}{a} \sin{2\theta}

Finding the maximum distance

How do we find the maxima for this function. Sketching the curve should be easy enough, but because we know a little calculus we know that the maximum will occur when the first differential is equal to zero. So we differentiate with respect to the angle to get:

!  \frac{dx}{d\theta} = -\frac{v^2}{a} 2 \cos{2\theta}

and set the differential equal to zero:

!  0 = -\frac{v^2}{a} 2 \cos{2\theta}

and solve to get:

!  \cos{2\theta}  = 0

!  2\theta  = \cos^{-1}{(0)}

Since we remember that the arccosine of 0 is 90 degrees:

!  2\theta  = 90^{\circ}

!  \theta  = 45^{\circ}

And thus we’ve found the angle that gives the maximum launch distance for a potato gun.