Chapter 1 Introduction to R and RStudio

R is the underlying statistical computing environment. You can think of this like the engine of a car. That makes RStudio like the dashboard¹.

R is the engine RStudio is the dashboard

RStudio is an integrated development environment (IDE) that allows us to interact with R. RStudio sits on top of R and makes writing and executing R code a lot easier. We’ll be benefiting from many of the added features that come with RStudio and we will point them out as we go.

1.0.1 Panes in RStudio

When you open RStudio, you will have access to R (there is no need to open R directly).

Now go to the top menu to open a new R Script File –> New File –> R Script

Great! Now you will see four panes

I have mine set up as follows:
- Editor / script / source in the top left
- Console bottom left
- Environment/history on the top right
- Plots/help on the bottom right

Four pane layout in RStudio

On the top left is the script or editor window. This is where we are going to write all of our code.
On the lower left we have the console window. This is where R is running, and this is what you would see if you opened R instead of RStudio. In this pane we can see the code we send and then the answer.
The top right has the environment and history tabs. The Environment is a list of all objects that are saved in memory. The History tab shows all commands that have been run.
On the bottom right hand side there’s a window with several tabs.
- Files shows the file structure of the working directory.
- Plots is where your visualizations will appear.
- Packages shows all of the installed packages where checked ones are loaded and unchecked packages are not loaded.
- Help shows information about functions.
- Viewer for viewing other kinds of output, like web content.

1.0.2 RStudio Global Options

There is one set-up preference that I think everyone will prefer. Go to Tools –> Global Options

In the Code menu check the box for “Soft-wrap R source files”

This wraps long lines of code on to the next line automatically so that you do not have to scroll left and right to see a long line of code.

1.1 Set up an R Project

I mentioned previously that RStudio has a lot of pretty handy features. One of those is the project structure

Before we can start writing code we need to set up a project so that the data and our codes will be in the same place. Go to File –> New Project–> New Directory. I will name mine IntroR and it will be a folder on my desktop

New R Project

Now we have opened up a new instance of RStudio running inside the IntroR folder. Notice on the top of the console pane and the Files pane (bottom right) that the path to the IntroR folder is specified.

When we start reading in data it will be important that the code and the data are in the same place. Creating a project creates an Rproj file that runs R in that folder. If you are familiar with working directories, this process is setting the working directory for this project as this folder.

Once you have a project set up, when you want to read in dataset whatever.txt, you will be able to easily specify the path pointing to whatever.txt rather than having to specify a full path. This is critical for reproducibility, and we’ll talk about that more later.

1.1.1 Download learning materials

To get things arranged for later, please download the dataset we are going to use. Right click –> Save link as to download the file to your computer.

gapminder data set

Move the dataset to your IntroR directory in a subfolder called data.

File structure

Now that we have a project directory and the dataset inside that project directory, make a new R script by going to File –> New File –> R Script. Now you too have the 4 pane layout.

1.2 R as a calculator

R can be used as a calculator. Make sure you’re typing into into the editor. Do not code in the console because this work will not be saved.

Use the Run button on the top of the script pane to run the code.

2 + 2

## [1] 4

Notice the output in the console that tells us the code we wrote and the answer. Let’s try some others.

5 * 4

## [1] 20

2 ^3

## [1] 8

Instead of using the Run button to run code, let’s try the keyboard shortcut to run code. To send code from the editor to the console, place your cursor on a line of code and use CMD+Enter (Mac) or Ctrl+Enter (Windows). This is way faster than using your mouse to hit the Run button each time.

Go back to your code above and run them using the keyboard. We can also run multiple lines of code at once. Now highlight 2 lines of code and run them together.

R also knows order of operations and scientific notation.

(5 + 3)^2

## [1] 64

5 + 3^2

## [1] 14

5e4

## [1] 50000

1.2.1 Comments

Anything after a # sign is a comment, meaning it will not be executed as code. Use them liberally to comment about what you are doing and why.

Today, you can take notes about what you are learning as comments.

Comments are a big part of making your work reproducible for others and for your future self when you open this script a few months from now and need to remember what you were doing.

Commenting is also helpful when you’re testing things out during your analysis so that you can ‘turn off’ parts of your script.

# Here is a comment
# Note that anything after the # is considered a comment

Let’s save our script before we get any further.

Go up to File –> Save As and let’s go with “intro.R”

1.2.2 Creating R objects

Let’s learn to create R objects next. We assign values to objects using the assignment operator “<-”. This arrow is typed with a less than sign followed by a dash. We first name the object on the left and then provide the assignment operator <-, and then the value.

Let’s create an object called thing1 that takes the value 55.

thing1 <- 55

Look in the Environment pane (top right) to see your new R object!

<- is the assignment operator in R. It assigns values on the right to object names on the left. Think of it like an arrow that points from the value to the object. The <- is mostly similar to = but not always. Learn to use <- as it is good R programming practice. Using = in place of <- can lead to issues down the line. The keyboard shortcut for inserting the <- operator is option + dash(Mac) and Alt + dash (Windows).

Objects can be given any name such as x, current_temperature, or subject_id, but they may not have a space in the name. You want your object names to be explicit and not too long. They also cannot start with a number (2x is not valid but x2 is). R is case sensitive (e.g., thing1 is different from Thing1).

Thing1 <- 60

Look in the Environment pane to see that there are now 2 different thing1 objects since we used different casing in the spelling of the object names.

There are some words that should not be used as object names because they represent the names of functions in R. It is best to not use function names as object names since it will be confusing to tell the difference between the object and the function (e.g., c, T, mean, data, df, weights).

If in doubt, start typing the name and if RStudio suggests something it already knows, then that name is already in use.

For example, it is perfectly reasonable to think that data is a great name for your dataset but as you start to type it, the autocomplete function in RStudio tells you that data already exists.

data()

Try to use nouns for object names, and verbs for function names to help yourself remember what each item is.

When assigning a value to an object, R does not print anything. You can ask to print the value by typing the object name:

thing1

## [1] 55

We can overwrite the value of thing1 by re-assigning it

thing1 <- 70

#then call its name to see the object
thing1

## [1] 70

1.2.3 EXERCISE 1

Try these on your own

A. You have a patient with a height (inches) of 73 and a weight (lbs) of 203. Create r objects labeled ‘height’ and ‘weight’.

SHOW ANSWER A

height <- 73
weight <- 203

height

## [1] 73

weight

## [1] 203

B. Convert ‘weight’ to ‘weight_kg’ by dividing by 2.2. Convert ‘height’ to ‘height_m’ by dividing by 39.37

SHOW ANSWER B

weight_kg <- weight / 2.2
height_m <- height / 39.37

weight_kg

## [1] 92.27273

height_m

## [1] 1.854204

C. Calculate a new object ‘bmi’ where BMI = weight_kg / (height_m*height_m)

SHOW ANSWER C

bmi <- weight_kg / (height_m * height_m)
bmi

## [1] 26.83851

You can remove objects from the environment using the rm() function. You can do this one at a time or remove several objects at once by separating their names with ,. The broom button in the Environment pane will remove all objects from your environment.

rm(weight, Thing1)
# Now ask R for weight (uncomment the following line and run it)
# weight
# oops! you should get an error because weight no longer exists!

1.3 Functions

A function is a verb; it tells R to do something. To call an R function, we call the name of the function followed directly by (). The items passed to the function inside the () are called arguments. Arguments change the way a function behaves

Some functions don’t need any arguments

Sys.Date() #get today's date

Some functions just take a single argument. Let’s get the square root of 961. Now let’s get the square root of object1

sqrt(961)

## [1] 31

To learn more about the function, type ? and then the function’s name

?sqrt

Sometimes functions have arguments that have a default value. In those cases, you can override the default value by specifying your own.

For example, let’s look at the help page for the rnorm() function

?rnorm

rnorm() generates random values from the normal distribution. We must supply the n argument since there is no default value, but there is a default value set for the mean and sd arguments.

First we’ll allow the default mean and sd.

rnorm(n = 10)

##  [1] -1.22988646 -0.31079349 -0.54320168  0.21735077  0.43507309
##  [6]  0.21053498  0.14672412 -0.25805385 -0.05882915 -0.98466509

The above code drew 10 random draws from a normal distribution with a mean = 0 and an sd = 1

Now let’s set the n = 10, mean = 50, and the sd = 5 to see 10 random draws from a normal distribution with a mean = 50 and an sd = 5

rnorm(n = 10, mean = 50, sd = 5)

##  [1] 47.69543 43.74427 51.18394 50.81982 56.01156 45.94905 45.69107
##  [8] 66.00850 51.67892 50.52275

What happens if we do not specify n? Uncomment the code below (remove the #) to see what happens

# rnorm(mean = 50, sd = 5)

In the above examples, we have labeled our arguments according to their names in the help menu. If you do not label the arguments, they will be called into the function in the order given in the help menu.

# must be in order given by help menu to work as intended
rnorm(10, 50, 5)

##  [1] 53.18046 51.30431 44.13164 48.51052 52.19761 49.69490 46.08112
##  [8] 41.02595 55.70006 52.94951

#out of order, but works bc the arguments are labeled
rnorm(n = 10, sd = 5, mean = 50)

##  [1] 46.99579 51.23749 58.31002 50.23257 50.42940 42.95445 57.57516
##  [8] 51.74522 49.21021 48.54995

To improve readability (and accuracy) of your code, we would recommend labeling your arguments.

1.3.1 EXERCISE 2

A. Use the arrow operator to create an object called object2 that stores 100 draws from a normal distribution with mean = 500 and sd = 100.

SHOW ANSWER A

object2 <- rnorm(n = 100, mean = 500, sd = 100)
object2

##   [1] 297.1305 509.4696 439.9178 338.5022 495.4454 456.2640 627.5269
##   [8] 500.9576 325.8321 589.9559 566.6740 586.8104 401.9329 582.6794
##  [15] 691.1701 612.2797 572.5787 624.1436 523.8449 397.6236 531.9029
##  [22] 846.0834 532.8453 642.3347 548.5095 527.9075 398.0182 515.0165
##  [29] 489.4775 265.6431 508.7015 633.9289 554.0167 571.0277 609.6461
##  [36] 569.8040 582.4079 457.2796 447.4124 643.4981 523.7328 305.2602
##  [43] 330.6657 453.0843 474.0488 407.0230 474.6653 533.7707 501.9376
##  [50] 604.5314 449.7320 549.5420 387.0427 477.0604 594.6091 363.8655
##  [57] 533.5354 545.5140 633.6954 627.0064 477.1409 318.2809 447.4973
##  [64] 522.9422 439.1719 556.3196 362.3018 251.7203 393.4954 448.9454
##  [71] 514.6430 642.6446 439.2294 543.1201 557.1770 610.3772 479.0298
##  [78] 427.4944 626.9036 463.4344 581.9577 517.3701 589.7731 670.2309
##  [85] 348.7351 571.9980 375.8793 569.6828 485.6591 502.0711 569.2511
##  [92] 564.2181 328.5037 436.5932 405.1674 403.6753 378.1837 474.9271
##  [99] 486.5112 441.5676

B. Call hist(object2) to create a histogram of your normal distribution

SHOW ANSWER B

hist(object2)

Look at the environment. What does it tell you about object2?

The environment pane provides details about objects. We can see that object2 is a numeric object with items 1 through 100. Then we can see the first few numeric items in the object.

Let’s create some more R objects that are collections of several values. To accomplish this, we will use the function c(), which stands for concatenate or combine. Usually functions are named with a full word describing what they do but because combining items together is so common, this function gets a very short name.

object3 <- c(55, 60, 35, 70)

Check out the environment now. It worked! We created object3

Let’s create another object containing a different type of data

object4 <- c("Jack", "Leila", "Rohit")

Check out the environment now. Notice that it specifies that object4 is a character (chr) vector

1.4 More Functions

Let’s sum() everything in object3

sum(object3)

## [1] 220

Try the mean() function on object3

mean(object3)

## [1] 55

What happens if we try to sum() object4? Uncomment the code below to try it

#sum(object4)

What if we take the square root of object3? The sum() and mean() functions both take a vector and return one number. What about sqrt() where we want multiple answers given multiple inputs?

sqrt(object3)

## [1] 7.416198 7.745967 5.916080 8.366600

It worked! Most functions in R are vectorized meaning that they will work on a vector as well as a single value. This means that in R, we usually do not need to write loops like we would in other languages.

1.4.1 EXERCISE 3

Try the following functions on object3 and on object4. What do each of the below functions do? Optionally, call up the help menu for these functions to learn more.

A. class() B. length() C. summary() D. str()

SHOW ANSWER

The class() function provides information about the type of object

class(object3)

## [1] "numeric"

class(object4)

## [1] "character"

length() tells us how many items are in each vector

length(object3)

## [1] 4

length(object4)

## [1] 3

summary() provides a summary of an object. In the case of object3, we have a 6 number numeric summary describing the minimum, 1st quartile, median, mean, 3rd quartile, and maximum. For object 4, summary() tells us that the object is a character

summary(object3)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    35.0    50.0    57.5    55.0    62.5    70.0

summary(object4)

##    Length     Class      Mode 
##         3 character character

Finally, str() provides the structure of an object. For object3 and object4 str() returns the same information that we see in the environment, but for more complex objects, str() can be very helpful

str(object3)

##  num [1:4] 55 60 35 70

str(object4)

##  chr [1:3] "Jack" "Leila" "Rohit"

1.5 DataFrames

Let’s move on to learning about dataframes. There are lots of different basic data structures in R. Besides the dataframe, there are also arrays, lists, matrices, etc. We are going to skip those in favor of spending time learning the data structure you’ll probably use most – the dataframe.

We use dataframes to store heterogeneous tabular data in R: tabular, meaning that individuals or observations are typically represented in rows, while variables or features are represented in columns; heterogeneous, meaning that columns/features/variables can be different classes (a variable like age, can be numeric, while another, like cause of death, can be a character string).

1.6 R Packages

We have the gapminder.csv file into our project directory, but we don’t know anything about it yet. Our goal will be to read it into R so we can start exploring it.

There are lots of ways to load data into R. There is a point-and-click RStudio menu and go to File > Import Data Set > Import From Text File but that is not the most reproducible way to read in data.

Instead, we would prefer that you read data into R for analysis as part of your script.

I gave you the gm dataset as a csv file. “csv” stands for comma separated values. You can save any Excel, SPSS, Qualtrics, etc. data file as a .csv and then import it into R. This is the workflow that we would recommend.

To read a csv file into R, we are going to use a function read_csv() that is accessed from a package.

You can think of an R package like an app on your phone. Fist, we will need to install it from the internet.

When we call the install.packages() function, R goes to the Comprehensive R Archive Network (CRAN) and downloads the specified package. There are over 10K packages listed on CRAN, over 1500 on Bioconductor (bioinformatics packages), and many more under development on people’s github pages etc. You can be sure a package is safe to download if it comes from the CRAN or from Bioconductor.

Uncomment the line below to run the install.packages() function

# install.packages("tidyverse")

Once we have installed it to this computer, you will not need to do that again (until you update R or your OS, etc). Therefore, comment out the install line.

The library() command loads the functions from that package into the R environment so that we can use them. This is like opening the app. We will need to do this every time we open the script.

library(tidyverse)

## ── Attaching packages ───────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.2.0     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0

## ── Conflicts ──────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

We can see based on the output from the library(tidyverse) line that the tidyverse is actually a megapackage, containing 8 packages. All of these packages share a similar syntax in an attempt to simplify coding and readability for R users. Aside from the core tidyverse packages, there are around 10 other packages

Ok! Now let’s write the line of code to read the csv file into R. We will use the read_csv() function that comes from the readr package (one of the tidyverse packages).

At the beginning of the session, we asked you to save the gm file into the data directory of your project file, so when we write the path to the gm file, we’ll specify that it is in the data folder.

gm <- read_csv("data/gapminder.csv")

## Parsed with column specification:
## cols(
##   country = col_character(),
##   continent = col_character(),
##   year = col_double(),
##   lifeExp = col_double(),
##   pop = col_double(),
##   gdpPercap = col_double()
## )

If the above line did not work for you, follow the steps to create an R project with the gm file in a subdirectory called data.

Assuming you were able to load the data, let’s move on!

Let’s look at this object by calling its name

gm

## # A tibble: 1,704 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # … with 1,694 more rows

Because we read in the data using read_csv(), the dataframe was read in as a modified dataframe called a tibble. Printing tibbles to the console looks great, but that used to not be the case. If you’d like to learn more about the difference between dataframes and tibbles please see the tibbles section of R for Data Science

Let’s also use the View() function to look at the data. Note that this is a read only viewer - not like Excel where you can go in and change cell values etc. This feature helps with reproducibility.

View(gm)

Let’s go back to the script. Your script is still there. It is in a tab next to the viewer tab.

The third way to look at a dataframe or tibble is to click on the blue arrow next to the gm name in the Environment.

Click the blue arrow next to the gm name in the environment {width = 400px}

This view enables you to see the variable names and classes while you type code, so this is often what my environment looks like.

1.7 Inspecting Dataframes

There are several functions that are useful for investigating dataframes. We already saw some of them in the section on Functions above.

Instead of printing the whole dataframe to the console, we can print an abbreviated version using head() and tail(). By default, these functions give us the first and last 6 rows respectively

head(gm)

## # A tibble: 6 x 6
##   country     continent  year lifeExp      pop gdpPercap
##   <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.

tail(gm)

## # A tibble: 6 x 6
##   country  continent  year lifeExp      pop gdpPercap
##   <chr>    <chr>     <dbl>   <dbl>    <dbl>     <dbl>
## 1 Zimbabwe Africa     1982    60.4  7636524      789.
## 2 Zimbabwe Africa     1987    62.4  9216418      706.
## 3 Zimbabwe Africa     1992    60.4 10704340      693.
## 4 Zimbabwe Africa     1997    46.8 11404948      792.
## 5 Zimbabwe Africa     2002    40.0 11926563      672.
## 6 Zimbabwe Africa     2007    43.5 12311143      470.

# see the first 12 rows using the n = argument
head(gm, n = 12)

## # A tibble: 12 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## 11 Afghanistan Asia       2002    42.1 25268405      727.
## 12 Afghanistan Asia       2007    43.8 31889923      975.

Remember, class() tells us the type of object

class(gm)

## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

We can see that gm is a dataframe and a tibble (tbl)

We can look at the number of rows and columns with dim(), just the number of rows with nrow() and just the number of columns with ncol()

dim(gm)

## [1] 1704    6

nrow(gm)

## [1] 1704

ncol(gm)

## [1] 6

names() will show us the column names

names(gm)

## [1] "country"   "continent" "year"      "lifeExp"   "pop"       "gdpPercap"

And probably the two you’ll use the most to inspect data frames, because they are the most descriptive, are summary() and str(). Let’s start with summary()

summary(gm)

##    country           continent              year         lifeExp     
##  Length:1704        Length:1704        Min.   :1952   Min.   :23.60  
##  Class :character   Class :character   1st Qu.:1966   1st Qu.:48.20  
##  Mode  :character   Mode  :character   Median :1980   Median :60.71  
##                                        Mean   :1980   Mean   :59.47  
##                                        3rd Qu.:1993   3rd Qu.:70.85  
##                                        Max.   :2007   Max.   :82.60  
##       pop              gdpPercap       
##  Min.   :6.001e+04   Min.   :   241.2  
##  1st Qu.:2.794e+06   1st Qu.:  1202.1  
##  Median :7.024e+06   Median :  3531.8  
##  Mean   :2.960e+07   Mean   :  7215.3  
##  3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
##  Max.   :1.319e+09   Max.   :113523.1

Notice that the output depends on the type of column. For country, a character vector, we get a frequency count of the number of occurences of the first few countries. Same for continent. The other columns are numeric, so their summary is a six number summary showing the minimum, 1st quartile, median, mean, 3rd quartile, and the maximum.

The read_csv() determined what type of column each one was while we were reading in the data. Of course, there are arguments to change the type of column within the read_csv() function.

Let’s look now at the structure of gm.

str(gm)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of  6 variables:
##  $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
##  $ year     : num  1952 1957 1962 1967 1972 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ gdpPercap: num  779 821 853 836 740 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   country = col_character(),
##   ..   continent = col_character(),
##   ..   year = col_double(),
##   ..   lifeExp = col_double(),
##   ..   pop = col_double(),
##   ..   gdpPercap = col_double()
##   .. )

The structure tells us that gm is a dataframe and tibble object and it specifies the dimensions. Below that, it also gives us each of the column names with the type of data it contains and the first 4 or 5 values for each column.

1.8 Accessing variables

Notice in the str() output that there is a $ in front of each of the variable names. That symbol is how we access invidual variables / columns / vectors from a dataframe object

To access a variable from a dataframe, the syntax we want is dataframe$columnname

Let’s use this to print out all of values in the pop variable. First we call the dataframe, then $ and the variable name

gm$pop

Whoa. That function calls the whole column, which is 1704 observations long. Usually printing out a long vector or column to the console is not useful. Maybe we meant to call head() on one column

head(gm$pop)

## [1]  8425333  9240934 10267083 11537966 13079460 14880372

What if we want to see the first 20 country values?

head(gm$country, n = 20)

##  [1] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan"
##  [6] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan"
## [11] "Afghanistan" "Afghanistan" "Albania"     "Albania"     "Albania"    
## [16] "Albania"     "Albania"     "Albania"     "Albania"     "Albania"

Nice! We can also use the $ to create a new variable and attach it onto our dataframe.

First let’s look at the first 20 values of the pop column

head(gm$pop, n = 20)

##  [1]  8425333  9240934 10267083 11537966 13079460 14880372 12881816
##  [8] 13867957 16317921 22227415 25268405 31889923  1282697  1476505
## [15]  1728137  1984060  2263554  2509048  2780097  3075321

Let’s say I would like a column where the population is in millions. We’ll take the original gm$pop and divide by 1e6 then save it as a new column on the dataframe.

# dataframe$newvar <- dataframe$oldvar / 1e6
gm$popmill <- gm$pop / 1e6

#head of the new column
head(gm$popmill)

## [1]  8.425333  9.240934 10.267083 11.537966 13.079460 14.880372

Great!

Using this $ syntax, let’s calculate some descriptive statistics for life expectancy in the gm dataset.

Notice that the lifeExp variable is mixed case, so be careful in spelling. However, RStudio’s autocomplete function can help. Once you type the gm$ RStudio autocompletes with the options for variable names so you can just select from the list.

mean(gm$lifeExp)

## [1] 59.47444

sd(gm$lifeExp)

## [1] 12.91711

range(gm$lifeExp)

## [1] 23.599 82.603

1.8.1 EXERCISE 4

A. What’s the standard deviation of the population variable (hint: get help on the sd function with ?sd) B. What’s the mean gdpPercap? C. What’s the range of years represented in the data? D. Run a summary on the lifeExp column

SHOW ANSWERS

sd(gm$pop)

## [1] 106157897

mean(gm$gdpPercap)

## [1] 7215.327

range(gm$year)

## [1] 1952 2007

summary(gm$lifeExp)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   23.60   48.20   60.71   59.47   70.85   82.60

While the gm dataset is fully complete (no missing values), in real life, dataframes often come with missing values. For basic statistical functions like mean, sd, etc., there is an argument na.rm that we can use to remove missing values prior to calculating the statistic.

In this case, the result will not change because we do not have any missings, but in case your dataset does, here is what the code would look like

# calculate the mean population
mean(gm$pop)

## [1] 29601212

#calculate the mean population after removing missings
mean(gm$pop, na.rm = TRUE)

## [1] 29601212

1.9 Subset a dataframe using filter()

Often we want to look at just a subset of the data that meet certain criteria. One really nice way to do this is the filter() function from the dplyr package. The dplyr package is one that we loaded when we loaded the tidyverse.

filter() subsets rows of a dataframe.

The first argument to filter() is the dataframe we are filtering from and the second argument is the logical condition(s) the row must meet to be returned

There are six basic logical operators in R -equal to == -not equal to != -greather than > -greater than or equal to >= -less than < -less than or equal to <=

You can chain multiple conditions together with the AND operator & or the OR | operator

Let’s see how it works by filtering for rows where the population is over 70Million. The first argument is the dataframe and the second is the logical criteria a row must meet to be returned. I’ll choose to use my new popmill variable

filter(gm, popmill > 70)

## # A tibble: 118 x 7
##    country    continent  year lifeExp       pop gdpPercap popmill
##    <chr>      <chr>     <dbl>   <dbl>     <dbl>     <dbl>   <dbl>
##  1 Bangladesh Asia       1972    45.3  70759295      630.    70.8
##  2 Bangladesh Asia       1977    46.9  80428306      660.    80.4
##  3 Bangladesh Asia       1982    50.0  93074406      677.    93.1
##  4 Bangladesh Asia       1987    52.8 103764241      752.   104. 
##  5 Bangladesh Asia       1992    56.0 113704579      838.   114. 
##  6 Bangladesh Asia       1997    59.4 123315288      973.   123. 
##  7 Bangladesh Asia       2002    62.0 135656790     1136.   136. 
##  8 Bangladesh Asia       2007    64.1 150448339     1391.   150. 
##  9 Brazil     Americas   1962    55.7  76039390     3337.    76.0
## 10 Brazil     Americas   1967    57.6  88049823     3430.    88.0
## # … with 108 more rows

We do not need to specify gm$popmill because the first argument told R we would be operating within the gm dataframe. Therefore, we need only specify the variable name in the second argument.

118 rows meet this criteria.

Now let’s see rows belonging to the United States. First let’s use View() to see how USA is spelled. Click on the country column header to sort by country to quickly scroll to the U section.

View(gm)

Ok, now that we know how it is spelled, we can write a line of code to filter for where country is ‘United States’. We need the quotes because country is a character (factor) variable.

filter(gm, country == 'United States')

## # A tibble: 12 x 7
##    country       continent  year lifeExp       pop gdpPercap popmill
##    <chr>         <chr>     <dbl>   <dbl>     <dbl>     <dbl>   <dbl>
##  1 United States Americas   1952    68.4 157553000    13990.    158.
##  2 United States Americas   1957    69.5 171984000    14847.    172.
##  3 United States Americas   1962    70.2 186538000    16173.    187.
##  4 United States Americas   1967    70.8 198712000    19530.    199.
##  5 United States Americas   1972    71.3 209896000    21806.    210.
##  6 United States Americas   1977    73.4 220239000    24073.    220.
##  7 United States Americas   1982    74.6 232187835    25010.    232.
##  8 United States Americas   1987    75.0 242803533    29884.    243.
##  9 United States Americas   1992    76.1 256894189    32004.    257.
## 10 United States Americas   1997    76.8 272911760    35767.    273.
## 11 United States Americas   2002    77.3 287675526    39097.    288.
## 12 United States Americas   2007    78.2 301139947    42952.    301.

Now let’s return the data that meets multiple criteria at once. We’ll use the & to combine the year == 1982 and country == “United States” criteria

filter(gm, year == 1982 & country == 'United States')

## # A tibble: 1 x 7
##   country       continent  year lifeExp       pop gdpPercap popmill
##   <chr>         <chr>     <dbl>   <dbl>     <dbl>     <dbl>   <dbl>
## 1 United States Americas   1982    74.6 232187835    25010.    232.

We’ll do one more together before you will practice on your own. Let’s filter the gm dataset for rows where the population is higher than 1 billion (1e9). This time, let’s start with the original pop variable

filter(gm, pop > 1e9)

## # A tibble: 8 x 7
##   country continent  year lifeExp        pop gdpPercap popmill
##   <chr>   <chr>     <dbl>   <dbl>      <dbl>     <dbl>   <dbl>
## 1 China   Asia       1982    65.5 1000281000      962.   1000.
## 2 China   Asia       1987    67.3 1084035000     1379.   1084.
## 3 China   Asia       1992    68.7 1164970000     1656.   1165.
## 4 China   Asia       1997    70.4 1230075000     2289.   1230.
## 5 China   Asia       2002    72.0 1280400000     3119.   1280.
## 6 China   Asia       2007    73.0 1318683096     4959.   1319.
## 7 India   Asia       2002    62.9 1034172547     1747.   1034.
## 8 India   Asia       2007    64.7 1110396331     2452.   1110.

1.9.1 EXERCISE 1.5

Use the filter() function to return rows matching the given criteria.

A. Which rows have life expectancies of more than 80 years (>80)?

SHOW ANSWER A

filter(gm, lifeExp > 80)

## # A tibble: 21 x 7
##    country          continent  year lifeExp      pop gdpPercap popmill
##    <chr>            <chr>     <dbl>   <dbl>    <dbl>     <dbl>   <dbl>
##  1 Australia        Oceania    2002    80.4 19546792    30688.  19.5  
##  2 Australia        Oceania    2007    81.2 20434176    34435.  20.4  
##  3 Canada           Americas   2007    80.7 33390141    36319.  33.4  
##  4 France           Europe     2007    80.7 61083916    30470.  61.1  
##  5 Hong Kong, China Asia       2002    81.5  6762476    30209.   6.76 
##  6 Hong Kong, China Asia       2007    82.2  6980412    39725.   6.98 
##  7 Iceland          Europe     2002    80.5   288030    31163.   0.288
##  8 Iceland          Europe     2007    81.8   301931    36181.   0.302
##  9 Israel           Asia       2007    80.7  6426679    25523.   6.43 
## 10 Italy            Europe     2002    80.2 57926999    27968.  57.9  
## # … with 11 more rows

B. Which countries had a low GDP per capita (< 500) in 2007?

SHOW ANSWER B

filter(gm, gdpPercap < 500 & year == 2007)

## # A tibble: 4 x 7
##   country          continent  year lifeExp      pop gdpPercap popmill
##   <chr>            <chr>     <dbl>   <dbl>    <dbl>     <dbl>   <dbl>
## 1 Burundi          Africa     2007    49.6  8390505      430.    8.39
## 2 Congo, Dem. Rep. Africa     2007    46.5 64606759      278.   64.6 
## 3 Liberia          Africa     2007    45.7  3193942      415.    3.19
## 4 Zimbabwe         Africa     2007    43.5 12311143      470.   12.3

filter(gm, year == 2007 & gdpPercap < 500)

## # A tibble: 4 x 7
##   country          continent  year lifeExp      pop gdpPercap popmill
##   <chr>            <chr>     <dbl>   <dbl>    <dbl>     <dbl>   <dbl>
## 1 Burundi          Africa     2007    49.6  8390505      430.    8.39
## 2 Congo, Dem. Rep. Africa     2007    46.5 64606759      278.   64.6 
## 3 Liberia          Africa     2007    45.7  3193942      415.    3.19
## 4 Zimbabwe         Africa     2007    43.5 12311143      470.   12.3

# order doesn't matter

C. Which rows have extremely low GDP per capita (< 300) OR extremely low life expectancy (< 30)?

SHOW ANSWER C

filter(gm, gdpPercap < 300 | lifeExp < 30)

## # A tibble: 6 x 7
##   country          continent  year lifeExp      pop gdpPercap popmill
##   <chr>            <chr>     <dbl>   <dbl>    <dbl>     <dbl>   <dbl>
## 1 Afghanistan      Asia       1952    28.8  8425333      779.   8.43 
## 2 Congo, Dem. Rep. Africa     2002    45.0 55379852      241.  55.4  
## 3 Congo, Dem. Rep. Africa     2007    46.5 64606759      278.  64.6  
## 4 Guinea-Bissau    Africa     1952    32.5   580653      300.   0.581
## 5 Lesotho          Africa     1952    42.1   748747      299.   0.749
## 6 Rwanda           Africa     1992    23.6  7290203      737.   7.29

1.10 Plots in base R

Plots are a great way to help us explore our dataset to see relationships, investigate interactions, diagnose problems, etc.

Here we will introduce plotting using base R (without loading any extra packages). Chapter 2 is all about plotting using the premier plotting package in R, ggplot2. Jump to /@ref(ggplot)

Let’s start out with a histogram of the life expectancy variable from gm.

hist(gm$lifeExp)

R decided how many breaks to insert in the above histogram, but we can set that manually using the breaks = argument.

hist(gm$lifeExp, breaks=100)

We can also change the color of the bars using col =.

hist(gm$lifeExp, breaks=100, col='blue')

If we wanted to look at more than one numeric variable we could try a scatterplot. The syntax for plot(dataframe$varX, dataframe$varY)

plot(gm$gdpPercap, gm$lifeExp)

The default plotting character in base R is an open circle, which I dislike. Let’s change that using the pch = argument, which stands for plotting character. pch ranges from 0 - 25 and you can easily search for what each looks like on the internet. I’ll change mine to pch = 16, a filled-in circle.

plot(gm$gdpPercap, gm$lifeExp, pch = 16)

Next, I would like to change the color of the points to red using col = "red"

plot(gm$gdpPercap, gm$lifeExp, pch = 16, col = "red")

You can see the names of all 657 base R colors

colors()

To add a title, the argument is main =

plot(gm$gdpPercap, gm$lifeExp, pch = 16, col = "red", main = "Life Exp vs GDP")

Finally, we’ll add an xlabel and a ylabel both in quotes.

plot(gm$gdpPercap, gm$lifeExp, pch = 16, col = "red", main = "Life Exp vs GDP", 
     ylab = "Life Expectancy (years)", 
     xlab = "Per-capita GDP ($)")

There are hundreds of plotting parameters you can use to customize your plot’s appearance. I know these parameters because I have learned them. The internet is your friend in this case, so if you forget how to modify a parameter, don’t be afraid to Google it.

1.10.1 EXERCISE 5

Create a histogram to show the distribution of the gdpPercap variable with 100 breaks. Optional: Add color, axis labels, and a title

SHOW ANSWER

hist(gm$gdpPercap, breaks = 100)

Visualizations are a large part of R’s appeal and in our opinion, learning to plot using ggplot2 will serve you well. Therefore, we only cover the very basics of plotting using base R here, and devote more time to a more comprehensive dive into ggplot2 in Chapter 2 /@ref(ggplot2)

1.11 Write csv file

We’ve already seen how to read in data using read_csv(). Now we’ll do the opposite. There are going to be some cases when you need to save the data you’re working on to open up outside of R.

Just like R has functions to read data of many different kinds of formats, it also has functions to write data into many different kinds of formats. We’ll stick to csv format here.

First, let’s create a dataframe that is a subset of gm where the year is 1997. We’ll name the resulting dataframe

gm97 <- filter(gm, year == 1997)

To save this as a csv file, we will call write_csv() where the first argument is the R object to be written and the second argurment is the name of the proposed file.

write_csv(gm97, "gm97.csv")

Where did it go? Let’s have a look at the Files pane (bottom right) and there it is. It went into our working directory (project directory) automatically. We don’t need to worry about our working directory here because we’re using an R project.

1.12 Saving your work and quitting R

We’ll close this chapter with how to save your work.

Our suggestion is to make sure your R script (top left) is saved and then throw out the rest. After all, the script created the objects in the environment, the output in the console, and all the plots. Remember that to save your script, go to File –> Save or CMD + S (mac) and CTRL + S (pc).

I prefer that RStudio never ask me to save my workspace (Environment, Plots, etc) so I have set that preference in Tools –> Global Options –> General. Save workspace to .RData on exit = “never”. While you are at it, uncheck the options for 1. Restore most recently opened project 2.Restore previosuly open source documents on startup 3. Restore .RData into workspace at startup

Once your script is saved, quit RStudio.

To re-open the project and prove to yourself that all of your hard work has been preserved, double click the Rproj file to launch RStudio in your project directory. Now open your script and start running your code.

To run all the code in an R file (there have to be no errors), highlight the entire code CMD + A (mac) or CTRL + A (pc) and then run.

Woohooo! Happy Running! See you in Chapter 2.

Credit to Modern Dive for the R and RStudio analogies↩