1.0 Overview of R 🌟

R is both a programming language and software environment designed for statistical computation and data visualization that has been around for more than 30 years. While other languages like Python are generally more accessible and “more popular” in certain areas, R has been widely adopted in data science and bioinformatics. That’s because the core functionality of R has been massively extended through thousands of open-source software “packages”. Many of these packages don’t just provide neat statistical models to analyze genomics data, but significantly improve data management, manipulation, modeling, and visualization.

In my personal opinion, the real power and utility of R for bioinformatics comes from the wealth of packages available (and always being developed).

“Base” R provides a command line interface called the “console” that allows you to run a number of functions and make simple plots out of the box. RStudio (aka Posit) is a more powerful “Integrated Development Environment” or IDE for R. note that “The R Project” and “RStudio/Posit” are not the same organization/company.

1.1 The RStudio IDE

RStudio allows users to work from a command line interface, edit and run R scripts, edit and run R markdown files (just like this one), manage projects, visualize plots, and navigate files. The RStudio IDE has a default layout with 4 “panels”.

  • “Source”
  • “Console, Terminal,…”
  • “Environment, History, Connections”
  • “Files, Plots, Packages,…”

These can be fully customized if you want.

The “Source” panel shows the file you’re editing. It can have multiple files open at once, organized as tabs.

The “Console, Terminal,” panel has the console where you have an R command line interface, and a terminal with access to the system shell.

The “Environment, History, Connections” panel has information about all of the objects, functions etc. that are currently saved in your environment. As well as a running record of commands that have been run. The environment will normally be cleared when you end your RStudio session, and it can be manually cleared by clicking the “broom” icon.

The “Files, Plots, Packages, Help,…” panel has a file navigator as well as a useful tab for managing packages and accessing package documentation.

1.2 R Notebooks

One of the most useful things about RStudio is the ability to run notebooks. If you’ve ever worked with Jupyter Notebooks in Python the idea is basically the same. An R “notebook” combines text (using markdown for formatting) and code “chunks” that can be interactively run in the notebook.

```{r eval=TRUE}
n = 10
rnorm(x)
```

Notebooks are a great way to document projects and make tutorials, but I also love using notebooks for data exploration and development.

I often start of projects with a notebook so I can write notes, organize code into discrete chunks, and quickly see visualizations (which I’ll demonstrate later). Then when it comes time to formalize a project I’ll typically transition my notebook code to discrete R scripts.

We can open a new notebook by clicking ‘File -> New File -> R Notebook’. Go ahead and try opening a new notebook.

The first thing you’ll see is some default information and code. There’s a YAML header with basic info about the notebook like name and how the notebook will be rendered. Then there’s information about how to make a new code chunk Ctrl+Alt+I and how to run chunks Ctrl+Shift+Enter.

Try making a new chunk. And let’s run a simple command.

print('Hello World!')
## [1] "Hello World!"

You can also run chunks by clicking the green “play” icon in the top-right of the chunk.

You should see the output below the chunk, and also in the RStudio Console panel. Now try running the same command directly in the RStudio console.

There’s a lot more that you can do with RStudio that we don’t have time to cover today, for instance you can run other languages like Bash and Python in R notebook code chunks. But for now we’ll move on to some basic R programming concepts.

2.0 Getting Started 👩‍🎓

2.1 Assigning Objects

Just like other languages, R has it’s own syntax for assigning variables (aka objects). An object is basically any named thing in R, it could be a single number, a list of names, the output of a function, or a large complex dataset.

In R there are three operators for assigning objects, ‘=’ and ‘<-’ and ‘<<-’.

We’re going to ignore the last one today because it’s not commonly used. Confusingly, ‘=’ and ‘<-’ are largely identical and can often be used interchangeably.

For example both of the follow assignments are correct:

apple <- 10
banana = 10

print(apple)
## [1] 10
print(banana)
## [1] 10

Generally it is considered good stylistic practice to use ‘<-’ outside of functions, and ‘=’ when specifying named arguments inside functions.

There are (apparently) some weird cases where ‘<-’ and ‘=’ may function differently, but for now we’ll just stick to the conventional style.

For example:

test1 <- 5.4321 # this is best
test2 = 1.2345 # this is fine, but not "good style"
round(test1, digits = 0) # this is best
round(test2, digits <- 0) # considered "bad" style

NOTE you’ll notice I added some text in that chunk with a “#” before it. This is the syntax in R to “comment”. Anything following a “#” on that line will not be evaluated.

For the most part you can name an object whatever you want as long as you follow these rules/guidelines:

  1. an object name cannot start with a number!
  2. give objects simple but descriptive names
  3. try to use a common style like camelCase or snake_case
  4. be careful not to use an object name that is also a function - this is mostly a style rule to avoid confusing code.

For example:

# this will produce an error
1a <- 'test' 
print <- 'test'
print(print) # the function still works even though there's now a character object of the same name, it's just messy code
## [1] "test"

2.2 Data Structures in R

There are 5 core data structures in R.
- Atomic Vectors - Lists - Matrices - Dataframes - Arrays

These all differ in their dimensionaly (1d, 2d, Nd) and the data that’s stored in them (all the same type or a mix of types).

Briefly, datatypes in R include decimal numbers (double), integers, letters/words (character), factors, and True/False (logical).

  • Atomic Vectors are 1d vectors of the same data type.
  • Lists are also 1d vectors but can have a mix of data types.
  • Matrices are 2d (columns and rows) structures where all values are the same data type.
  • Dataframes are 2d structures that can have a mix of data types.
  • Arrays are higher order n-dimensional structures of the same data type.

Let’s go a bit deeper on working with some of these structures.

3.0 Vectors 🏹

R is really all about vectors and running functions on vectors. A vector is either “Atomic” meaning it has all the same data type or a “list” with a mix of types.

3.1 Creating atomic vectors

To make atomic vectors we use the c() function, which stands for “combine”.

# this is a character atomic vector
c('this','is','a','character','vector')
## [1] "this"      "is"        "a"         "character" "vector"
# this is an integer atomic vector
c(1,2,3,4)
## [1] 1 2 3 4

What happens if we mix data types with c()?

First let’s try combining integers and characters…

# if we add an integer and a logical type with these words, everything becomes a character type.
c(1,TRUE,'this','is','a','character','vector')
## [1] "1"         "TRUE"      "this"      "is"        "a"         "character"
## [7] "vector"

What about combining logicals (TRUE, FALSE) with integers…

c(TRUE,FALSE,20,2,1)
## [1]  1  0 20  2  1

What happened? Our output is a vector of integers. That’s because in R logical values have a numeric equivalency. TRUE = 1, FALSE = 0.

This demonstrates how c() “coerces” everything into the same type, in this case converting TRUE and FALSE to 1 and 0.

3.2 Changing types

It’s often useful to check or change the type of an atomic vector. Let’s demonstrate this by first making a new vector and assigning it an object name.

my_numbers <- c(1.1,3.9292,0.0002,4.12) 
my_numbers
## [1] 1.1000 3.9292 0.0002 4.1200

Here we have an atomic vector of doubles. Notice how the values have been coerced to the same number of decimal places.

We can check the datatype with the typeof() function.

typeof(my_numbers)
## [1] "double"

And we can convert types in R using one of the functions as.character(),as.numeric(),as.integer(),as.double(),as.logical().

What happens if we try to convert my_numbers to character type versus to integer type.

as.character(my_numbers)
## [1] "1.1"    "3.9292" "2e-04"  "4.12"
as.integer(my_numbers)
## [1] 1 3 0 4

What happened when we converted to integer? All the double values were simplified to integers but not rounded.

Now let’s try something that shouldn’t work, making a new vector of characters and converting it to numeric type…

letters <- c('a','b','c')
as.numeric(letters)
## Warning: NAs introduced by coercion
## [1] NA NA NA

In this case we get a Warning that the coercion created NA values, meaning that the coercion didn’t really work. But notice that this is a Warning and not an explicit Error. The code did run and new values were created, they’re just all NA. NA values in R are essentially Null or missing values.

Also note that Warning messages are common in R and while many of them can be safely ignored, in some cases they could be pointing to a real problem.

3.2.1 Factors

Brief pit stop to explain the factor datatype. These are a special type in R that are used for categorical values and thus can be very useful.

Factors are most like character vectors, but also have levels which are the total number unique “categories/values” in the vector.

# here is an example with a integer vector
# it has 9 values, but only 5 *distinct* values
nums <- c(1,1,1,1,2,3,4,4,5)
# if we convert it to a factor, we now see that we have 5 levels and 9 values
factor(nums)
## [1] 1 1 1 1 2 3 4 4 5
## Levels: 1 2 3 4 5

The order of a factor’s levels can impact how the vector is used by functions, for instance when plotting. We can change the factor levels by explicitly indicating the order.

factor(nums, levels = c(5,4,3,2,1))
## [1] 1 1 1 1 2 3 4 4 5
## Levels: 5 4 3 2 1

3.3 Lists

What if we want to maintain a mix of types in our vector? Well then we need a list which we can make with list().

my_list <- list(1,2,3.5,'this','is','a','character','vector',TRUE,FALSE)
my_list
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3.5
## 
## [[4]]
## [1] "this"
## 
## [[5]]
## [1] "is"
## 
## [[6]]
## [1] "a"
## 
## [[7]]
## [1] "character"
## 
## [[8]]
## [1] "vector"
## 
## [[9]]
## [1] TRUE
## 
## [[10]]
## [1] FALSE

You should get an output where each input data type is retained in the list.

Lists are there own special “type”. ie. typeof() on a list should return list. But we can still do type coercion on lists. For example:

as.character(my_list)
##  [1] "1"         "2"         "3.5"       "this"      "is"        "a"        
##  [7] "character" "vector"    "TRUE"      "FALSE"

Doing this both standardizes the data type and converts the list into an atomic vector. NOTE that this is inherently different than making a list with all the same data type, that is still a list until it’s been explicitly changed.

For example, we can covert a list directly to a vector without with the unlist() function.

unlist(my_list)
##  [1] "1"         "2"         "3.5"       "this"      "is"        "a"        
##  [7] "character" "vector"    "TRUE"      "FALSE"

Because character is the most “accessible” type to standardize the mixed types in my_list, unlist() coerces everything to character. Compare that to unlist() with a list of integers, where the resulting vector contains integers.

unlist(list(1,2,3,4))
## [1] 1 2 3 4

3.4 Working with vectors

So what can we do with vectors?

In R vectors have an “index”, an integer value indicating position in the vector. Using square brackets [ ] we can use an index to pull out a single value…

my_numbers
## [1] 1.1000 3.9292 0.0002 4.1200
my_numbers[1]
## [1] 1.1

In R vectors are indexed starting at 1. Which is notably different from Python which is indexed starting at 0. In this manner we can select several values, a range of values, or take everything but a certain value.

We can access multiple values from the vector by indexing with another vector…

my_numbers[c(1,3)]
## [1] 1.1000 0.0002

Or take a range of values…

my_numbers[2:4]
## [1] 3.9292 0.0002 4.1200

Or grab everything but a certain index…

my_numbers[-2] 
## [1] 1.1000 0.0002 4.1200

We can also add new values to an existing vector with the append() function.

append(my_numbers, 10.1)
## [1]  1.1000  3.9292  0.0002  4.1200 10.1000

Importantly, append() does not alter the input object, it just creates an output with the new value added to the input vector. If we want to update the initial vector we have to explicitly reassign.

my_numbers <- append(my_numbers, 10.1)

Another important quality of vectors is that in R many functions are designed to take vectors as inputs.

You’ve already seen that in some of the examples above. But there are a number of base R functions for computing statistics on a vector of numbers.

my_numbers
mean(my_numbers)
median(my_numbers)
sd(my_numbers)
var(my_numbers)
quantile(my_numbers)

These functions only take atomic vectors as inputs and won’t work correctly if we try to pass in a list, even if they have a consistent datatype. But notice again how the code below does run it just produces a warning and outputs NA.

mean(list(1,2,3,4,5))
## Warning in mean.default(list(1, 2, 3, 4, 5)): argument is not numeric or
## logical: returning NA
## [1] NA

3.5 Vector Exercises

  1. Make a new vector of 6 numbers named ‘counts’ with the type character.
  2. Convert the ‘counts’ vector to double.
  3. Add another value to the ‘counts’ vector.
  4. Extract the first, second, and fourth values from the vector.
  5. Calculate the mean and standard deviation of the ‘counts’ vector.

Answers

counts <- c('1.2','3.06','2.222','0.5','0.1','0.01')
counts <- as.double(counts)
counts <- append(counts, 2.31)
counts[c(1,2,4)]
mean(counts)
sd(counts)

4.0 Dataframes 🏠

If vectors are the foundation of R, then dataframes are the house. They are essentially lists of equal length atomic vectors. Each vector is a column with each value as a row.

geeksforgeeks.org Dataframes can be any number of rows and columns, can be sorted, merged, and split into parts. You will likely spend most of your time in R making, manipulating, and plotting dataframes.

4.1 Making dataframes

There are a couple of common methods to create dataframes in R. The first is to use the data.frame() function and the other is to import existing data from a .csv or .tsv file (more on that later).

The data.frame() function takes “named” vectors as inputs…

df <- data.frame(a = c('Matt','Sally','Emily','Tracey'), 
                 b = c(36,35,2,5), 
                 c = c('tacos','fries','apples','candy'))
df
##        a  b      c
## 1   Matt 36  tacos
## 2  Sally 35  fries
## 3  Emily  2 apples
## 4 Tracey  5  candy

That gives us a simple dataframe with 3 columns (variables) and 4 rows. Note that in the notebook output the type of each variable is displayed indicating how these are essentially different atomic vectors.

Column names (and optionally row names) can be reassigned with functions colnames() and rownames(). Both functions take a dataframe as input and then new names can be specified with another character vector like so.

colnames(df) <- c('name','age','food')
df
##     name age   food
## 1   Matt  36  tacos
## 2  Sally  35  fries
## 3  Emily   2 apples
## 4 Tracey   5  candy

This is often useful particularly when importing data that may not have names or may have confusing names.

4.2 Subsetting and combining

One of the most common things you’ll need to do is extract parts of a dataframe or combine different dataframes. There are a number of ways to do these operations.

We can get specific columns, rows, values, and apply logical filtering. For example because dataframes are lists and columns are named we can directly access a column with the $ operator.

df$name
## [1] "Matt"   "Sally"  "Emily"  "Tracey"

Alternatively we can use square brackets like we did above with single vectors. Indexing dataframes follows the syntax [row, column], but you don’t always have to include a comma as you’ll see below.

#gives us the first element in the dataframe list, which is column 1
df[1] 
##     name
## 1   Matt
## 2  Sally
## 3  Emily
## 4 Tracey
#gives all three columns
df[1:3] 
##     name age   food
## 1   Matt  36  tacos
## 2  Sally  35  fries
## 3  Emily   2 apples
## 4 Tracey   5  candy
#we can also use names with brackets
df['name']
##     name
## 1   Matt
## 2  Sally
## 3  Emily
## 4 Tracey
# If we want a specific row, 
df[1,]
##   name age  food
## 1 Matt  36 tacos
# or one value in a column
df[2,'name']
## [1] "Sally"

4.3 Logical subsetting

We can also do more complex subsetting with logical comparisons. In the context of our example dataframe this would let us get the names of anyone younger than 30 (for example). We do this using the evaluation of a logical test as the indexing vector for the dataframe.

df[df$age < 30,]$name
## [1] "Emily"  "Tracey"

This might look confusing at first so let’s break it down. First let’s just run our logical test on its own.

# This gives a logical vector from our age vector, indicating if each age is less than 30 or not. 
age_check <- df$age < 30
age_check
## [1] FALSE FALSE  TRUE  TRUE
# Now we insert that logical vector as our indexes to the dataframe
df[age_check,] # we add the comma to indicate we need rows back, not columns
##     name age   food
## 3  Emily   2 apples
## 4 Tracey   5  candy
# And if we just want the names...
df[age_check,]$name
## [1] "Emily"  "Tracey"

Side note, I used the logical operator < in the example above. There are several logical operators in R: < - less than > - greater than <= - less than or equal to >= - greater than or equal to == - equal to != - not equal to

These can be used to compare numbers as well as check if two character values are equivalent such as

'apple' == 'banana'
## [1] FALSE

There is also the %in% operator, which allows for checking if a number or character is in a vector.

1 %in% c(2,3,4,5)
## [1] FALSE

4.4 Combining dataframes

There are two base R functions available out of the box to combine dataframes, cbind() for merging column-wise and rbind() for row-wise.

# here's a new dataframe... 
df2 <- data.frame(colors = c('red','blue','green','pink'),
                  animals = c('owl','tiger','bird','unicorn'))

# and we want to combine with our existing one, adding thes new vectors as columns.
cbind(df,df2)
##     name age   food colors animals
## 1   Matt  36  tacos    red     owl
## 2  Sally  35  fries   blue   tiger
## 3  Emily   2 apples  green    bird
## 4 Tracey   5  candy   pink unicorn
# we can also just bind in a single column if we want.
cbind(df, dessert = c('ice_cream','cake','popscicle','cupcake'))
##     name age   food   dessert
## 1   Matt  36  tacos ice_cream
## 2  Sally  35  fries      cake
## 3  Emily   2 apples popscicle
## 4 Tracey   5  candy   cupcake

What happens if we try to bind a new column (or dataframe) with a different number of rows than what’s in the original dataframe? Try for yourself.

#cbind(df, dessert = c('cake','popscicle','cupcake'))

You should get an error message because cbind() needs all vectors to have the same length.

If instead we want to add some new values to our first dataframe we can use rbind().

df3 <- data.frame(name = c('Liz','Jerry'), age = c(65,68), food = c('rice','chicken'))
rbind(df,df3)
##     name age    food
## 1   Matt  36   tacos
## 2  Sally  35   fries
## 3  Emily   2  apples
## 4 Tracey   5   candy
## 5    Liz  65    rice
## 6  Jerry  68 chicken

What happens if we don’t have the same vector names or a typo in our new dataframe? Try for yourself.

df3 <- data.frame(n = c('Liz','Jerry'), age = c(65,68), food = c('rice','chicken'))
#rbind(df,df3)

For rbind() to work it has to find matching vector names for all vectors of the two dataframes.

Note that we are not limited to binding a just two dataframes, as long as the vectors are the same length or have the same name we can do these operations on any number of dataframs/vectors.

There is also the base R merge() function to combine two dataframes by common columns or rows. This is very useful in cases where we have two dataframes with some shared information but not the same order of rows and we need to map in missing data.

ages <- data.frame(name = c('Matt','Sally','Emily','Tracey'), 
                   age = c(36,35,2,5))

foods <- data.frame(name = c('Emily','Tracey','Sally','Matt','Todd'),
                    food = c('apples','candy','fries','tacos','pineapple'))

# merge takes at most 2 dataframes as input, and a `by` argument to specify what variable to merge on
# in absence of `by` merge will look for shared variables automatically
merge(ages, foods, by = 'name')
##     name age   food
## 1  Emily   2 apples
## 2   Matt  36  tacos
## 3  Sally  35  fries
## 4 Tracey   5  candy

4.5 Importing Data

Most of the time the data you work with in R will already exist as some kind of table or spreadsheet, and you’ll need to import that data into R. For example a table of transcript counts from an RNA-seq experiment. Importing and exporting data is straightforward with the read.csv() and read.table() functions.

Let’s load an example dataset called “Iris” from ‘iris.csv’. Note this dataset is from a collection of example datasets that are available with R which can be directly accessed with the data() function, I’ve gone ahead and saved it as a csv just for this example.

iris <- read.csv('../example/iris.csv')
head(iris) # the head() function is really useful when working with larger dataframes, it returns just the first few rows (default n = 6). 
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

4.6 Dataframe Exercises

  1. Make a dataframe with three vectors of different types.
  2. Change the column names
  3. Add another column
  4. Get a single value using row/column indexing
  5. Using a logical test to subset the dataframe
  6. Save the subset to your home directory ~/R_workshop_test.csv

Answers

new <- data.frame(a = c(1,2,3,4),
                  b = c('elm','spruce','dogwood','willow'),
                  c = c(22.3,50.1,101.1,97.9))

colnames(new) <- c('rank','tree','age') 

new <- cbind(new, width = rnorm(4, mean = 100, sd = 50))

new[1,2]

new_subset <- new[new$width < 100,]

#write.csv(new_subset, file = '~/R_workshop_test.csv')

5.0 Functions 🤖

5.1 Writing functions

We’ve been using functions throughout this workshop, but what exactly are they? A function is just a discrete chunk of code that takes a number of inputs and arguments and performs some standard task with those inputs.

So instead of having to manually program the calculation of the mean we can just use the mean() function. When you first start working in R you’ll likely stick to these prebuilt functions either from base R or installed packages, like we have above. But we can also write our own custom functions to perform any number of repetitive tasks. Custom or “user-defined” functions offer a powerful method for automating and organizing your analysis.

Here’s an example. Let’s say we have a large RNA-seq dataset from cancer patients with many associatd clinical variables. We want to do differential gene expression analysis but we’re interested in comparing several of these variables, for instance ‘smoking’ vs ‘non-smoking’, ‘biomarker positive’ vs ‘negative’, ‘male’ vs ‘female’, etc, etc. We could just write our code then copy and paste it for every different contrast we want to explore. That works, but it’s messy. Not only that it can cause a lot of problems if for instance we need to update the code we’ll have to carefully make our changes in multiple repetive sections, increasing the chance we mess something up. This is where functions can make your life much easier.

Instead of copying and pasting a similar set of operations over and over again, we can write a single generalized function that we can use repeatedly. This makes our analysis cleaner and much easier to modify.

# This is a rough example of how you might draft a function to automate running DESeq with any number of input contrasts
runDESeq <- function(dds, contrast, formula) {
  formula <- as.formula(formula)
  design(dds) <- formula  
  
  dds <- DESeq(dds)
  res <- results(dds, contrast = contrast)
  
  return(res)
}

runDESeq(dds, c('smoking_non-smoking', 'smoking','non-smoking'), '~smoking_status')
runDESeq(dds, c('male_female', 'female','male'), '~gender')
runDESeq(dds, c('bioPlus_bioNeg', 'bioPlus','bioNeg'), '~bioMarker')

5.2 Installing packages

While there are a lot of useful functions availble in base R the real power of R comes from the huge number of packages freely available to download.

A package in R is basically a collection of functions. There are a two main public repositories for R packages that you’ll commonly use:

We can install new packages with the function install.packages(). Downloaded packages are stored in a system “library”, by default R makes a library in your $HOME directory.

Something like: /nas/longleaf/home/ID/R/x68_64-pc-linux-gnu-library/4.4.

The library path is specific to the underlying system architecture (x68_64…) and the R version you’re currently using (4.4). You can check your R library path with the .libPaths() function.

.libPaths()

For the sake of time we won’t install anything today, but an example would look like…

install.packages('PACKAGENAME')

And then to remove packages…

remove.packages('PACKAGE_NAME')

Once a package has been installed we have to explicitly load it into our environment with the library() function like so…

library(package_name) # once installed the package name is an object, so no quotes

Sometimes different packages will have the same or similar function names, at worst this can cause conflicts where a package will supercede the function of another package, at best it can get confusing which functions are coming from what packages.

To overcome this confusion I will often avoid loading too many packages at once and instead directly call functions from specific functions like so:

package_name::function_name()

It’s not always the best choice, and it takes a bit more typing, but generally I find this method helpful for keeping track of functions.

5.3 Loops and Control-Flow

Sometimes we need to run the same function on each value in a vector or dataframe. One way that we can achieve this is by using “loops”. If you’re familiar with loops in other programming languages the idea is the same in R.

A loop takes a vector and works with each item until the end of the vector. In R the syntax for writing loops is:

for (x in 1:10) {
  print(x/2)
}
## [1] 0.5
## [1] 1
## [1] 1.5
## [1] 2
## [1] 2.5
## [1] 3
## [1] 3.5
## [1] 4
## [1] 4.5
## [1] 5

Note that in R the for loop will assign the loop variable to the “global environment”, that means if we have a variable already called “x” it will be overwritten by the for loop and retain the value of the last element in the vector parsed by the loop.

x <- 100
for (x in 1:10) {}
x
## [1] 10

So be careful when assigning variables in for loops.

We can also introduce “flow-control” into our loops (and functions) with conditional If/Else statements. The general syntax of If/Else statements in R looks like:

# if the condition is True
condition <- TRUE
if (condition) {
  print('do a thing')
} else {
  print('do something else')
}
## [1] "do a thing"

We can integrate an If/Else control into a for loop to conditionally perform functions.

for (x in 1:10){
  if (x <= 5) {
    print(x/2)
  } else if (x == 6){
    print(x*2)
  } else {
    print(x-10)
  }
}
## [1] 0.5
## [1] 1
## [1] 1.5
## [1] 2
## [1] 2.5
## [1] 12
## [1] -3
## [1] -2
## [1] -1
## [1] 0

Hopefully you can appreciate that we can construct sophisticated data flows with loops and If/Else controls.

5.4 Applys

Now that I’ve demonstrated what you can do with ‘for loops’, I’m going to tell you why you should never* use them in R.

That’s because in R standard loops can be slower and offer less control. The alternative are the apply functions: apply(), lapply(), sapply(), vapply(), mapply(), and tapply(). These generally work the same as loops but are often faster and enforce a specific type of output (indicated by the fist letter in the function names, eg. list output with lapply()).

Here’s a nice visualization of apply functions from Hadley Wickham (CSO of RStudio):

Here’s a very simple example of a for loop and the equivalent apply function…

names <- c('bob','tom','sarah','mary')

# using a for loop at the `toupper()` function to make names all caps
names_2 = c()
for (name in names){
  names_2 <- append(names_2, toupper(name))
}
names_2
## [1] "BOB"   "TOM"   "SARAH" "MARY"
# now with an apply function
# vapply() returns an atomic vector, and we can specify the desired output data type with FUN.VALUE
names_2 <- vapply(names, toupper, FUN.VALUE = character(1), USE.NAMES = F) # a single line of code
names_2
## [1] "BOB"   "TOM"   "SARAH" "MARY"

We can also use custom functions with apply:

names_2 <- vapply(names, function(x) {
  if (x != 'bob') {
    new_name = toupper(x)
  } else {
    new_name = 'BARB'
  }
  return(new_name)
}, FUN.VALUE = character(1), USE.NAMES = F)

names_2
## [1] "BARB"  "TOM"   "SARAH" "MARY"

Apply functions can be very useful and are almost always the better choice than using traditional loops. Though admittedly they can be a bit finicky to use at times, it’s definitely worth it in the long run to adopt them over standard loops.

5.5 Exercises

  1. Write a simple function to convert Fahrenheit to Celsius. The conversion is °C = (°F - 32) × 5/9
  2. Make a vector for fahrenheit temperatures and use a loop to convert each temp to celsius with your new function.
  3. Now try to use lapply() instead of a loop to make a list of celsius values.

Answers

FtoC <- function(f) {
  c <- (f - 32) * 5/9
  return(c)
}

temps <- c(50.2, 88.1, 97.5, 30.0)

c_temps <- c() 
for (t in temps) {
   c_temps <- append(c_temps, FtoC(t))
}

c_temps <- lapply(temps, FtoC)

6.0 Data Visualization with ggplot 👩‍🎨

Data visualization is one of the top reasons R is such a popular language for data analysis. It is easy to make publication quality graphics with sophisticated labelling, grouping, and coloring. R comes with plotting functionality “out of the box” with the plot() function.

Let’s use a different example dataset called mtcars which is data from the 1974 Motor Trend road tests.

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
# assigning the rownames (which are the model names) to a separate vector so it can be used for plotting 
mtcars$model <- rownames(mtcars)

6.1 base R Plots

With the mtcars example dataset we can easily make a scatter plot of the car weight to mpg trend. We just have to tell plot() what vectors from the dataframe to plot.

# at the simplest, plot takes an x-axis and y-axis argument. 
plot(mtcars$wt, mtcars$mpg)

While the plot() function is capable of doing quite a bit there is an external package called ggplot2 that I would argue is so much better at making high-quality plots that it’s not even really worth learning to use plot().

6.2 ggplot2

ggplot2 is a package with a lot of powerful functions for making really informative and aesthetically pleasing plots in R. https://ggplot2.tidyverse.org/ It really deserves it’s own workshop to fully demonstrate but we’ll quickly cover some of the basics.

One of the main differences between ggplot and plot in base R is that ggplot takes dataframes as inputs, wherase plot takes specific vectors as arguments.

The other major difference is the ggplot syntax, which looks like ggplot_object + plot_layer + plot_layer + ...

Let’s demonstrate. First we make a ggplot object.

library(ggplot2)

# pass the dataframe as the first argument, and then an aesthetics function `aes()` to sepcify which vectors in the dataframe we want to plot
plot <- ggplot(mtcars, aes(hp, mpg))
plot

This makes a ggplot object, which we’ve stored as “plot”, but you’ll see that even though we have the right axes and values nothing is actually plotted. That’s were the ‘plot layers’ come in. We have to specify how the data in the ggplot object is going to be used using additional functions, primarily “geometric objectsion” or geoms.

6.3 Geoms

We specify plots in ggplot with geoms which are functions that generate plots from the ggplot object to layer onto our plot.

Some geoms I frequently use are: - geom_point() - for scatterplots - geom_boxplot() or geom_violin() - for distribution plots - geom_bar() or geom_col() - for bar graphs

But there are more than 40 different geom functions to work with for making everything from simple plots to making more complex plots that calculate models and/or additional statistics.

A nice list of all geoms (and other ggplot2 functions): ggplot2 Functions

So, to make a scatterplot from our ‘plot’ ggplot object we simply add a layer with geom_point() like so…

# we use + to add on new layers to the plot
plot +
  geom_point()

6.4 Layering geoms

We can add multiple layers in sequence to our ggplot object, overplotting additional information onto the original scatterplot. For instance, we can compute and plot a regression model with the geom_smooth() function and add that as a layer.

plot +
  geom_point() +
  geom_smooth(method = 'loess') # we can specify the type of model (lm, glm, loess) withe the `method` argument
## `geom_smooth()` using formula = 'y ~ x'

Or we could add another layer to label all of our points with the car model name.

plot +
  geom_point() +
  geom_text(aes(label = model), hjust = -0.1)

Note there’s a great package called ggrepel that actually makes plotting text labels much cleaner.

6.5 Mapping colors

Grouping and color plots is critical to making informative and visually exciting figures. There are a number of easy ways to add layers of information with coloring using ggplot.

For example, what if we want to also visualize how another variable maps to our scatter plot? For instance, how does the negative correlation correspond to number of cylinders? To do this we can specify our ‘hp’ vector to be used for mapping color.

ggplot(mtcars, aes(mpg, hp, color = cyl)) +
  geom_point()

That’s somewhat useful, we can see that engines with fewer cylinders have low horse power and higher miles per gallon.
But ‘hp’ only has a few discrete values (4,6,8) and here we’re using a continuous color scale which is hard to read.
That’s happening because ‘hp’ is a double type. Let’s try converting ‘hp’ to a categorical variable with factor.

ggplot(mtcars, aes(mpg, hp, color = factor(cyl))) +
  geom_point(size = 4) +
  scale_color_brewer(palette = 'Set1')

That’s a lot better. We’ve also adjusted the point size, and specified a color palette with the scale_color_brewer() function.
There are a number of palletes available with ggplot that you can see here: - scale-brewer - brewer scales

6.6 ggplot exercises

  1. Use geom_boxplot() to make a boxplot of ‘hp’ values for each cylinder category.
  2. Use geom_point() to add a layer of points on the boxplot
  3. Map the vector ‘mpg’ to color the points.
  4. Change the color scale with scale_color_brewer and the RdPu palette.

Answers

plot <- ggplot(mtcars, aes(factor(cyl), hp, fill = mpg)) +
  geom_boxplot(outliers = F) + 
  geom_point(position = position_dodge2(width = 0.4), shape = 21, size = 3, alpha = 0.7) +
  scale_fill_distiller(palette = 'RdPu') +
  ggtitle('Beautiful Plot') +
  xlab('Cylinders') +
  ylab('Horse Power') 

plot
## Warning: The following aesthetics were dropped during statistical transformation: fill.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

6.7 Saving

When we want to save a ggplot we can use the ggsave() function. We can save to a number of formats (pdf, png, tiff, jpeg) and specify size and resolution.

ggsave('demo_plot.png', plot, width = 5, height = 6)

7.0 Tidy data and the Tidyverse 🧹

Another highly useful and arguably essential set of packages for data analysis in R is the tidyverse, which is really a set of many packages. Together they improve on many of the basic R functions and methods, and are all based on a shared underlying data design philosophy such they all the packages work together pretty seamlessly.

7.1 The tidy data philosophy

It’s not uncommon that we have data that looks something like this, each row is a different country with values for population in millions for 1990, 2000, and 2010.

pops <- data.frame(country = c('France','England','Germany','Italy'),
           Pop_1990 = c(58,40,80,56),
           Pop_2000 = c(61,42,85,60),
           Pop_2010 = c(64,46,84,59))

pops
##   country Pop_1990 Pop_2000 Pop_2010
## 1  France       58       61       64
## 2 England       40       42       46
## 3 Germany       80       85       84
## 4   Italy       56       60       59

This looks like how we often structure datasets when using spreadsheet software like Excel, and we can make plots with data like this with that kind of software. But in R this is a very cumbersome organization.

For example, we can’t easily plot how population is changing by time because time isn’t a variable, its part of the column name. Or if we wanted to determine what years France had a population below 62 million, this dataframe organization makes that tricky.

This is where the concept of ‘tidy data’ comes in.

Tidy data follows these rules: 1. each variable has its own column 2. each observation is its own row 3. each value has its own cell

For our example above that would look like this.

data.frame(country = c('France','England','Germany','Italy',
                       'France','England','Germany','Italy',
                       'France','England','Germany','Italy'),
                   population = c(58,40,80,56,61,42,85,60,64,46,84,59),
                   year = c(1990,1990,1990,1990,2000,2000,2000,2000,2010,2010,2010,2010))
##    country population year
## 1   France         58 1990
## 2  England         40 1990
## 3  Germany         80 1990
## 4    Italy         56 1990
## 5   France         61 2000
## 6  England         42 2000
## 7  Germany         85 2000
## 8    Italy         60 2000
## 9   France         64 2010
## 10 England         46 2010
## 11 Germany         84 2010
## 12   Italy         59 2010

Now we have variables for country, year, and population and can easily subset and plot.

7.2 The Tidyverse

The tidyverse is a collecetion of many pacakges all designed to make, manipulate, and utilize tidy data. They are all very well made packages with very good documentation and instructions. I honestly do not think I would use R as much as I do if not for the tidyverse.

Several core Tidyverse packages are:

  • dplyr - the toolbox for tidy data
  • tidyr - functions to make data tidier
  • magrittr - pipes!
  • forcats - functions for factors
  • purr - apply functions for tidy data
  • ggplot2 - oh hey! we know this one already

There are many more, but these are the ones I use regularly.

We do not have time to go into much detail but I would like to just briefly demonstrate how some of these packages work.

7.3 Tidyr

The Tidyr package is here to help clean up your data. It has some super useful functions to easily manipulate dataframes into tidy formats.

Going back to our poorly organized population data, we can easily convert it into a tidy format with the pivot_longer() function.

# we specify our input dataframe, what columns we want to pivot, 
# and then optionally if we want to specify names for the new columns
tidyr::pivot_longer(pops, 
                    cols = c(Pop_1990, Pop_2000, Pop_2010), 
                    names_to = 'year', 
                    values_to = 'population')
## # A tibble: 12 × 3
##    country year     population
##    <chr>   <chr>         <dbl>
##  1 France  Pop_1990         58
##  2 France  Pop_2000         61
##  3 France  Pop_2010         64
##  4 England Pop_1990         40
##  5 England Pop_2000         42
##  6 England Pop_2010         46
##  7 Germany Pop_1990         80
##  8 Germany Pop_2000         85
##  9 Germany Pop_2010         84
## 10 Italy   Pop_1990         56
## 11 Italy   Pop_2000         60
## 12 Italy   Pop_2010         59

There is also the companion function pivot_wider which does the opposite of pivot_longer, and expands a single column into multiple columns. This can be useful for instance when making a matrix of data for plotting heatmaps, where each sample has its own column.

Tidyr has a number of other useful functions, but these are the two I use regularly to quickly reshape dataframes.

7.4 Dplyr

The Dplyr package is aptly named because it really does provide a data toolbox. It’s a set of easy to use functions to change, subset, group, and summarise the data in dataframes.

We can add or change columns in dataframes with the mutate() function.

# here we pass in our mtcars dataframe
# and round up our mpg column
dplyr::mutate(mtcars, 
              mpg = round(mpg, digits = 0)) 
##                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4            21   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag        21   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710           23   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive       21   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout    19   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant              18   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360           14   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D            24   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230             23   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280             19   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C            18   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE           16   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL           17   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC          15   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood   10   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental  10   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial    15   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128             32   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic          30   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla       34   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona        22   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger     16   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin          15   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28           13   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird     19   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9            27   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2        26   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa         30   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L       16   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino         20   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora        15   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E           21   4 121.0 109 4.11 2.780 18.60  1  1    4    2
##                                   model
## Mazda RX4                     Mazda RX4
## Mazda RX4 Wag             Mazda RX4 Wag
## Datsun 710                   Datsun 710
## Hornet 4 Drive           Hornet 4 Drive
## Hornet Sportabout     Hornet Sportabout
## Valiant                         Valiant
## Duster 360                   Duster 360
## Merc 240D                     Merc 240D
## Merc 230                       Merc 230
## Merc 280                       Merc 280
## Merc 280C                     Merc 280C
## Merc 450SE                   Merc 450SE
## Merc 450SL                   Merc 450SL
## Merc 450SLC                 Merc 450SLC
## Cadillac Fleetwood   Cadillac Fleetwood
## Lincoln Continental Lincoln Continental
## Chrysler Imperial     Chrysler Imperial
## Fiat 128                       Fiat 128
## Honda Civic                 Honda Civic
## Toyota Corolla           Toyota Corolla
## Toyota Corona             Toyota Corona
## Dodge Challenger       Dodge Challenger
## AMC Javelin                 AMC Javelin
## Camaro Z28                   Camaro Z28
## Pontiac Firebird       Pontiac Firebird
## Fiat X1-9                     Fiat X1-9
## Porsche 914-2             Porsche 914-2
## Lotus Europa               Lotus Europa
## Ford Pantera L           Ford Pantera L
## Ferrari Dino               Ferrari Dino
## Maserati Bora             Maserati Bora
## Volvo 142E                   Volvo 142E

Or quickly subset our dataframe with complex logic using filter().

dplyr::filter(mtcars, 
              mpg > 20 & hp > 95 | cyl == 4)
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
##                         model
## Mazda RX4           Mazda RX4
## Mazda RX4 Wag   Mazda RX4 Wag
## Datsun 710         Datsun 710
## Hornet 4 Drive Hornet 4 Drive
## Merc 240D           Merc 240D
## Merc 230             Merc 230
## Fiat 128             Fiat 128
## Honda Civic       Honda Civic
## Toyota Corolla Toyota Corolla
## Toyota Corona   Toyota Corona
## Fiat X1-9           Fiat X1-9
## Porsche 914-2   Porsche 914-2
## Lotus Europa     Lotus Europa
## Volvo 142E         Volvo 142E

Or summarise our data.

dplyr::summarise(mtcars, 
                 mean_mpg = mean(mpg),
                 sd_hp = sd(hp),
                 max_wt = max(wt))
##   mean_mpg    sd_hp max_wt
## 1 20.09062 68.56287  5.424

7.5 Magrittr

The real elegance of Dplyr and the other packages of the Tidyverse is evident when you start using pipes.

If you’re familiar with pipes in Bash, the idea is the same in R. We can use a pipe operator to pass the output of one function to the input of the next.

There is actually a pipe operator built in to base R, |>. But the Magrittr package provides another pipe operator %>% that improves on the base version.

We can use pipes to perform a number of operations in sequence.

library(magrittr)
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
##                               model
## Mazda RX4                 Mazda RX4
## Mazda RX4 Wag         Mazda RX4 Wag
## Datsun 710               Datsun 710
## Hornet 4 Drive       Hornet 4 Drive
## Hornet Sportabout Hornet Sportabout
## Valiant                     Valiant
mtcars %>%
  dplyr::mutate(model = rownames(.),
                mpg = round(mpg, digits = 0),
                cyl = as.factor(cyl)) %>%
  dplyr::filter(wt > 3) %>%
  dplyr::group_by(cyl, model) %>%
  dplyr::summarise(mean_mpg = mean(mpg)) %>%
  ggplot(aes(mean_mpg, model, fill = cyl)) +
  geom_bar(stat = 'identity', position = position_dodge(), alpha = 0.7) +
  scale_fill_brewer(palette = 'Set1')
## `summarise()` has grouped output by 'cyl'. You can override using the `.groups`
## argument.

In my experience pipes make my R code and environment cleaner, because it avoids uneccessarily storing intermediate steps as objects. It also lends well to functionalizing code.

8.0 Package management

When working on multiple projects or preparing projects for publication it’s really important to maintain a record of your development environment. By environment I mean, the R version, packages, and the package versions that were used in your analysis. Unfortunately, most people have to learn this the hard way.

The problem arises from using a common library of packages across multiple projects and over time. Installing packages with install.packages() will place everything in a common library (usually in your home directory).But this is not a “static environment”, it will change as you install new package updates and new R versions. There is nothing directly connection your R code to a particular R environment.

Here’s an example: you’ve been working on a project and have code for your analysis that uses the “common” library. You take a break from this project for a couple of months to help out someone with their scRNA-seq analysis, in the meantime you install some new tools, update some old packages, maybe switch over to the newest version of R. When you come back to your first project the output of your analysis now looks different, maybe the number of significant genes is slightly different or maybe your code doesn’t run at all because there’s some kind of new package conflict. This is a sadly common story.

This can be a headache to correct. At worse it can mean that a collaborator (or even reviewer) can’t replicate your analysis.

If only we could set up a package library for each project and track a project-specific environment.

This is exactly what renv does. It provides an easy to use set of functions to initialize project libraries, keep them in sync with your code, and record everything you’ve used. It makes it simple to report your methods and share the environment with collaborators or the public when it’s time to publish.

There is a great tutorial for renv here: Introduction to renv that walks through exactly what renv does and how to start using it.

But this is my personal plug to encourage everyone to start using this kind of package management / virtual environment in your own projects.