R is both a programming language and software environment designed for statistical computation and data visualization that has been around for more than 30 years. While other languages like Python are generally more accessible and “more popular” in certain areas, R has been widely adopted in data science and bioinformatics. That’s because the core functionality of R has been massively extended through thousands of open-source software “packages”. Many of these packages don’t just provide neat statistical models to analyze genomics data, but significantly improve data management, manipulation, modeling, and visualization.
In my personal opinion, the real power and utility of R for bioinformatics comes from the wealth of packages available (and always being developed).
“Base” R provides a command line interface called the “console” that allows you to run a number of functions and make simple plots out of the box. RStudio (aka Posit) is a more powerful “Integrated Development Environment” or IDE for R. note that “The R Project” and “RStudio/Posit” are not the same organization/company.
RStudio allows users to work from a command line interface, edit and run R scripts, edit and run R markdown files (just like this one), manage projects, visualize plots, and navigate files. The RStudio IDE has a default layout with 4 “panels”.
These can be fully customized if you want.
The “Source” panel shows the file you’re editing. It can have multiple files open at once, organized as tabs.
The “Console, Terminal,” panel has the console where you have an R command line interface, and a terminal with access to the system shell.
The “Environment, History, Connections” panel has information about all of the objects, functions etc. that are currently saved in your environment. As well as a running record of commands that have been run. The environment will normally be cleared when you end your RStudio session, and it can be manually cleared by clicking the “broom” icon.
The “Files, Plots, Packages, Help,…” panel has a file navigator as well as a useful tab for managing packages and accessing package documentation.
One of the most useful things about RStudio is the ability to run
notebooks. If you’ve ever worked with Jupyter Notebooks in Python the
idea is basically the same. An R “notebook” combines text (using
markdown for formatting) and code “chunks” that can be interactively run
in the notebook.
```{r eval=TRUE}
n = 10
rnorm(x)
```
Notebooks are a great way to document projects and make tutorials, but I also love using notebooks for data exploration and development.
I often start of projects with a notebook so I can write notes, organize code into discrete chunks, and quickly see visualizations (which I’ll demonstrate later). Then when it comes time to formalize a project I’ll typically transition my notebook code to discrete R scripts.
We can open a new notebook by clicking ‘File -> New File -> R Notebook’. Go ahead and try opening a new notebook.
The first thing you’ll see is some default information and code. There’s a YAML header with basic info about the notebook like name and how the notebook will be rendered. Then there’s information about how to make a new code chunk Ctrl+Alt+I and how to run chunks Ctrl+Shift+Enter.
Try making a new chunk. And let’s run a simple command.
print('Hello World!')
## [1] "Hello World!"
You can also run chunks by clicking the green “play” icon in the top-right of the chunk.
You should see the output below the chunk, and also in the RStudio Console panel. Now try running the same command directly in the RStudio console.
There’s a lot more that you can do with RStudio that we don’t have time to cover today, for instance you can run other languages like Bash and Python in R notebook code chunks. But for now we’ll move on to some basic R programming concepts.
Just like other languages, R has it’s own syntax for assigning variables (aka objects). An object is basically any named thing in R, it could be a single number, a list of names, the output of a function, or a large complex dataset.
In R there are three operators for assigning objects, ‘=’ and ‘<-’ and ‘<<-’.
We’re going to ignore the last one today because it’s not commonly used. Confusingly, ‘=’ and ‘<-’ are largely identical and can often be used interchangeably.
For example both of the follow assignments are correct:
apple <- 10
banana = 10
print(apple)
## [1] 10
print(banana)
## [1] 10
Generally it is considered good stylistic practice to use ‘<-’ outside of functions, and ‘=’ when specifying named arguments inside functions.
There are (apparently) some weird cases where ‘<-’ and ‘=’ may function differently, but for now we’ll just stick to the conventional style.
For example:
test1 <- 5.4321 # this is best
test2 = 1.2345 # this is fine, but not "good style"
round(test1, digits = 0) # this is best
round(test2, digits <- 0) # considered "bad" style
NOTE you’ll notice I added some text in that chunk with a “#” before it. This is the syntax in R to “comment”. Anything following a “#” on that line will not be evaluated.
For the most part you can name an object whatever you want as long as you follow these rules/guidelines:
camelCase or
snake_caseFor example:
# this will produce an error
1a <- 'test'
print <- 'test'
print(print) # the function still works even though there's now a character object of the same name, it's just messy code
## [1] "test"
There are 5 core data structures in R.
- Atomic Vectors - Lists - Matrices - Dataframes - Arrays
These all differ in their dimensionaly (1d, 2d, Nd) and the data
that’s stored in them (all the same type or a mix of types).
Briefly, datatypes in R include decimal numbers (double), integers,
letters/words (character), factors, and True/False (logical).
Let’s go a bit deeper on working with some of these structures.
R is really all about vectors and running functions on vectors. A vector is either “Atomic” meaning it has all the same data type or a “list” with a mix of types.
To make atomic vectors we use the c() function, which
stands for “combine”.
# this is a character atomic vector
c('this','is','a','character','vector')
## [1] "this" "is" "a" "character" "vector"
# this is an integer atomic vector
c(1,2,3,4)
## [1] 1 2 3 4
What happens if we mix data types with c()?
First let’s try combining integers and characters…
# if we add an integer and a logical type with these words, everything becomes a character type.
c(1,TRUE,'this','is','a','character','vector')
## [1] "1" "TRUE" "this" "is" "a" "character"
## [7] "vector"
What about combining logicals (TRUE, FALSE) with integers…
c(TRUE,FALSE,20,2,1)
## [1] 1 0 20 2 1
What happened? Our output is a vector of integers. That’s because in R logical values have a numeric equivalency. TRUE = 1, FALSE = 0.
This demonstrates how c() “coerces” everything into the
same type, in this case converting TRUE and FALSE to 1 and 0.
It’s often useful to check or change the type of an atomic vector. Let’s demonstrate this by first making a new vector and assigning it an object name.
my_numbers <- c(1.1,3.9292,0.0002,4.12)
my_numbers
## [1] 1.1000 3.9292 0.0002 4.1200
Here we have an atomic vector of doubles. Notice how the values have been coerced to the same number of decimal places.
We can check the datatype with the typeof()
function.
typeof(my_numbers)
## [1] "double"
And we can convert types in R using one of the functions
as.character(),as.numeric(),as.integer(),as.double(),as.logical().
What happens if we try to convert my_numbers to
character type versus to integer type.
as.character(my_numbers)
## [1] "1.1" "3.9292" "2e-04" "4.12"
as.integer(my_numbers)
## [1] 1 3 0 4
What happened when we converted to integer? All the
double values were simplified to integers but not rounded.
Now let’s try something that shouldn’t work, making a new vector of
characters and converting it to numeric type…
letters <- c('a','b','c')
as.numeric(letters)
## Warning: NAs introduced by coercion
## [1] NA NA NA
In this case we get a Warning that the coercion created NA values, meaning that the coercion didn’t really work. But notice that this is a Warning and not an explicit Error. The code did run and new values were created, they’re just all NA. NA values in R are essentially Null or missing values.
Also note that Warning messages are common in R and while many of them can be safely ignored, in some cases they could be pointing to a real problem.
Brief pit stop to explain the factor datatype. These are
a special type in R that are used for categorical values and thus can be
very useful.
Factors are most like character vectors, but also have
levels which are the total number unique
“categories/values” in the vector.
# here is an example with a integer vector
# it has 9 values, but only 5 *distinct* values
nums <- c(1,1,1,1,2,3,4,4,5)
# if we convert it to a factor, we now see that we have 5 levels and 9 values
factor(nums)
## [1] 1 1 1 1 2 3 4 4 5
## Levels: 1 2 3 4 5
The order of a factor’s levels can impact how the vector is used by functions, for instance when plotting. We can change the factor levels by explicitly indicating the order.
factor(nums, levels = c(5,4,3,2,1))
## [1] 1 1 1 1 2 3 4 4 5
## Levels: 5 4 3 2 1
What if we want to maintain a mix of types in our vector? Well then
we need a list which we can make with
list().
my_list <- list(1,2,3.5,'this','is','a','character','vector',TRUE,FALSE)
my_list
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3.5
##
## [[4]]
## [1] "this"
##
## [[5]]
## [1] "is"
##
## [[6]]
## [1] "a"
##
## [[7]]
## [1] "character"
##
## [[8]]
## [1] "vector"
##
## [[9]]
## [1] TRUE
##
## [[10]]
## [1] FALSE
You should get an output where each input data type is retained in the list.
Lists are there own special “type”. ie. typeof() on a
list should return list. But we can still do type coercion
on lists. For example:
as.character(my_list)
## [1] "1" "2" "3.5" "this" "is" "a"
## [7] "character" "vector" "TRUE" "FALSE"
Doing this both standardizes the data type and converts the list into an atomic vector. NOTE that this is inherently different than making a list with all the same data type, that is still a list until it’s been explicitly changed.
For example, we can covert a list directly to a vector without with
the unlist() function.
unlist(my_list)
## [1] "1" "2" "3.5" "this" "is" "a"
## [7] "character" "vector" "TRUE" "FALSE"
Because character is the most “accessible” type to standardize the
mixed types in my_list, unlist() coerces
everything to character. Compare that to unlist() with a
list of integers, where the resulting vector contains integers.
unlist(list(1,2,3,4))
## [1] 1 2 3 4
So what can we do with vectors?
In R vectors have an “index”, an integer value indicating position in the vector. Using square brackets [ ] we can use an index to pull out a single value…
my_numbers
## [1] 1.1000 3.9292 0.0002 4.1200
my_numbers[1]
## [1] 1.1
In R vectors are indexed starting at 1. Which is notably different from Python which is indexed starting at 0. In this manner we can select several values, a range of values, or take everything but a certain value.
We can access multiple values from the vector by indexing with another vector…
my_numbers[c(1,3)]
## [1] 1.1000 0.0002
Or take a range of values…
my_numbers[2:4]
## [1] 3.9292 0.0002 4.1200
Or grab everything but a certain index…
my_numbers[-2]
## [1] 1.1000 0.0002 4.1200
We can also add new values to an existing vector with the
append() function.
append(my_numbers, 10.1)
## [1] 1.1000 3.9292 0.0002 4.1200 10.1000
Importantly, append() does not alter the input object,
it just creates an output with the new value added to the input vector.
If we want to update the initial vector we have to explicitly
reassign.
my_numbers <- append(my_numbers, 10.1)
Another important quality of vectors is that in R many functions are designed to take vectors as inputs.
You’ve already seen that in some of the examples above. But there are a number of base R functions for computing statistics on a vector of numbers.
my_numbers
mean(my_numbers)
median(my_numbers)
sd(my_numbers)
var(my_numbers)
quantile(my_numbers)
These functions only take atomic vectors as inputs and won’t work correctly if we try to pass in a list, even if they have a consistent datatype. But notice again how the code below does run it just produces a warning and outputs NA.
mean(list(1,2,3,4,5))
## Warning in mean.default(list(1, 2, 3, 4, 5)): argument is not numeric or
## logical: returning NA
## [1] NA
character.double.counts <- c('1.2','3.06','2.222','0.5','0.1','0.01')
counts <- as.double(counts)
counts <- append(counts, 2.31)
counts[c(1,2,4)]
mean(counts)
sd(counts)
If vectors are the foundation of R, then dataframes are the house. They are essentially lists of equal length atomic vectors. Each vector is a column with each value as a row.
Dataframes can be any number of rows and
columns, can be sorted, merged, and split into parts. You will likely
spend most of your time in R making, manipulating, and plotting
dataframes.
There are a couple of common methods to create dataframes in R. The
first is to use the data.frame() function and the other is
to import existing data from a .csv or .tsv file (more on that
later).
The data.frame() function takes “named” vectors as
inputs…
df <- data.frame(a = c('Matt','Sally','Emily','Tracey'),
b = c(36,35,2,5),
c = c('tacos','fries','apples','candy'))
df
## a b c
## 1 Matt 36 tacos
## 2 Sally 35 fries
## 3 Emily 2 apples
## 4 Tracey 5 candy
That gives us a simple dataframe with 3 columns (variables) and 4 rows. Note that in the notebook output the type of each variable is displayed indicating how these are essentially different atomic vectors.
Column names (and optionally row names) can be reassigned with
functions colnames() and rownames(). Both
functions take a dataframe as input and then new names can be specified
with another character vector like so.
colnames(df) <- c('name','age','food')
df
## name age food
## 1 Matt 36 tacos
## 2 Sally 35 fries
## 3 Emily 2 apples
## 4 Tracey 5 candy
This is often useful particularly when importing data that may not have names or may have confusing names.
One of the most common things you’ll need to do is extract parts of a dataframe or combine different dataframes. There are a number of ways to do these operations.
We can get specific columns, rows, values, and apply logical
filtering. For example because dataframes are lists and columns are
named we can directly access a column with the $
operator.
df$name
## [1] "Matt" "Sally" "Emily" "Tracey"
Alternatively we can use square brackets like we did above with single vectors. Indexing dataframes follows the syntax [row, column], but you don’t always have to include a comma as you’ll see below.
#gives us the first element in the dataframe list, which is column 1
df[1]
## name
## 1 Matt
## 2 Sally
## 3 Emily
## 4 Tracey
#gives all three columns
df[1:3]
## name age food
## 1 Matt 36 tacos
## 2 Sally 35 fries
## 3 Emily 2 apples
## 4 Tracey 5 candy
#we can also use names with brackets
df['name']
## name
## 1 Matt
## 2 Sally
## 3 Emily
## 4 Tracey
# If we want a specific row,
df[1,]
## name age food
## 1 Matt 36 tacos
# or one value in a column
df[2,'name']
## [1] "Sally"
We can also do more complex subsetting with logical comparisons. In the context of our example dataframe this would let us get the names of anyone younger than 30 (for example). We do this using the evaluation of a logical test as the indexing vector for the dataframe.
df[df$age < 30,]$name
## [1] "Emily" "Tracey"
This might look confusing at first so let’s break it down. First let’s just run our logical test on its own.
# This gives a logical vector from our age vector, indicating if each age is less than 30 or not.
age_check <- df$age < 30
age_check
## [1] FALSE FALSE TRUE TRUE
# Now we insert that logical vector as our indexes to the dataframe
df[age_check,] # we add the comma to indicate we need rows back, not columns
## name age food
## 3 Emily 2 apples
## 4 Tracey 5 candy
# And if we just want the names...
df[age_check,]$name
## [1] "Emily" "Tracey"
Side note, I used the logical operator < in
the example above. There are several logical operators in R:
< - less than > - greater than
<= - less than or equal to >= - greater
than or equal to == - equal to != - not equal
to
These can be used to compare numbers as well as check if two character values are equivalent such as
'apple' == 'banana'
## [1] FALSE
There is also the %in% operator, which allows for
checking if a number or character is in a vector.
1 %in% c(2,3,4,5)
## [1] FALSE
There are two base R functions available out of the box to combine
dataframes, cbind() for merging column-wise and
rbind() for row-wise.
# here's a new dataframe...
df2 <- data.frame(colors = c('red','blue','green','pink'),
animals = c('owl','tiger','bird','unicorn'))
# and we want to combine with our existing one, adding thes new vectors as columns.
cbind(df,df2)
## name age food colors animals
## 1 Matt 36 tacos red owl
## 2 Sally 35 fries blue tiger
## 3 Emily 2 apples green bird
## 4 Tracey 5 candy pink unicorn
# we can also just bind in a single column if we want.
cbind(df, dessert = c('ice_cream','cake','popscicle','cupcake'))
## name age food dessert
## 1 Matt 36 tacos ice_cream
## 2 Sally 35 fries cake
## 3 Emily 2 apples popscicle
## 4 Tracey 5 candy cupcake
What happens if we try to bind a new column (or dataframe) with a different number of rows than what’s in the original dataframe? Try for yourself.
#cbind(df, dessert = c('cake','popscicle','cupcake'))
You should get an error message because cbind() needs
all vectors to have the same length.
If instead we want to add some new values to our first dataframe we
can use rbind().
df3 <- data.frame(name = c('Liz','Jerry'), age = c(65,68), food = c('rice','chicken'))
rbind(df,df3)
## name age food
## 1 Matt 36 tacos
## 2 Sally 35 fries
## 3 Emily 2 apples
## 4 Tracey 5 candy
## 5 Liz 65 rice
## 6 Jerry 68 chicken
What happens if we don’t have the same vector names or a typo in our new dataframe? Try for yourself.
df3 <- data.frame(n = c('Liz','Jerry'), age = c(65,68), food = c('rice','chicken'))
#rbind(df,df3)
For rbind() to work it has to find matching vector names
for all vectors of the two dataframes.
Note that we are not limited to binding a just two dataframes, as long as the vectors are the same length or have the same name we can do these operations on any number of dataframs/vectors.
There is also the base R merge() function to combine two
dataframes by common columns or rows. This is very useful in cases where
we have two dataframes with some shared information but not the same
order of rows and we need to map in missing data.
ages <- data.frame(name = c('Matt','Sally','Emily','Tracey'),
age = c(36,35,2,5))
foods <- data.frame(name = c('Emily','Tracey','Sally','Matt','Todd'),
food = c('apples','candy','fries','tacos','pineapple'))
# merge takes at most 2 dataframes as input, and a `by` argument to specify what variable to merge on
# in absence of `by` merge will look for shared variables automatically
merge(ages, foods, by = 'name')
## name age food
## 1 Emily 2 apples
## 2 Matt 36 tacos
## 3 Sally 35 fries
## 4 Tracey 5 candy
Most of the time the data you work with in R will already exist as
some kind of table or spreadsheet, and you’ll need to import that data
into R. For example a table of transcript counts from an RNA-seq
experiment. Importing and exporting data is straightforward with the
read.csv() and read.table() functions.
Let’s load an example dataset called “Iris” from ‘iris.csv’.
Note this dataset is from a collection of example datasets that
are available with R which can be directly accessed with the
data() function, I’ve gone ahead and saved it as a csv just
for this example.
iris <- read.csv('../example/iris.csv')
head(iris) # the head() function is really useful when working with larger dataframes, it returns just the first few rows (default n = 6).
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
~/R_workshop_test.csvnew <- data.frame(a = c(1,2,3,4),
b = c('elm','spruce','dogwood','willow'),
c = c(22.3,50.1,101.1,97.9))
colnames(new) <- c('rank','tree','age')
new <- cbind(new, width = rnorm(4, mean = 100, sd = 50))
new[1,2]
new_subset <- new[new$width < 100,]
#write.csv(new_subset, file = '~/R_workshop_test.csv')
We’ve been using functions throughout this workshop, but what exactly are they? A function is just a discrete chunk of code that takes a number of inputs and arguments and performs some standard task with those inputs.
So instead of having to manually program the calculation of the mean
we can just use the mean() function. When you first start
working in R you’ll likely stick to these prebuilt functions either from
base R or installed packages, like we have above. But we can also write
our own custom functions to perform any number of repetitive tasks.
Custom or “user-defined” functions offer a powerful method for
automating and organizing your analysis.
Here’s an example. Let’s say we have a large RNA-seq dataset from cancer patients with many associatd clinical variables. We want to do differential gene expression analysis but we’re interested in comparing several of these variables, for instance ‘smoking’ vs ‘non-smoking’, ‘biomarker positive’ vs ‘negative’, ‘male’ vs ‘female’, etc, etc. We could just write our code then copy and paste it for every different contrast we want to explore. That works, but it’s messy. Not only that it can cause a lot of problems if for instance we need to update the code we’ll have to carefully make our changes in multiple repetive sections, increasing the chance we mess something up. This is where functions can make your life much easier.
Instead of copying and pasting a similar set of operations over and over again, we can write a single generalized function that we can use repeatedly. This makes our analysis cleaner and much easier to modify.
# This is a rough example of how you might draft a function to automate running DESeq with any number of input contrasts
runDESeq <- function(dds, contrast, formula) {
formula <- as.formula(formula)
design(dds) <- formula
dds <- DESeq(dds)
res <- results(dds, contrast = contrast)
return(res)
}
runDESeq(dds, c('smoking_non-smoking', 'smoking','non-smoking'), '~smoking_status')
runDESeq(dds, c('male_female', 'female','male'), '~gender')
runDESeq(dds, c('bioPlus_bioNeg', 'bioPlus','bioNeg'), '~bioMarker')
While there are a lot of useful functions availble in base R the real power of R comes from the huge number of packages freely available to download.
A package in R is basically a collection of functions. There are a two main public repositories for R packages that you’ll commonly use:
We can install new packages with the function
install.packages(). Downloaded packages are stored in a
system “library”, by default R makes a library in your $HOME
directory.
Something like:
/nas/longleaf/home/ID/R/x68_64-pc-linux-gnu-library/4.4.
The library path is specific to the underlying system architecture
(x68_64…) and the R version you’re currently using (4.4). You can check
your R library path with the .libPaths() function.
.libPaths()
For the sake of time we won’t install anything today, but an example would look like…
install.packages('PACKAGENAME')
And then to remove packages…
remove.packages('PACKAGE_NAME')
Once a package has been installed we have to explicitly load it into
our environment with the library() function like so…
library(package_name) # once installed the package name is an object, so no quotes
Sometimes different packages will have the same or similar function names, at worst this can cause conflicts where a package will supercede the function of another package, at best it can get confusing which functions are coming from what packages.
To overcome this confusion I will often avoid loading too many packages at once and instead directly call functions from specific functions like so:
package_name::function_name()
It’s not always the best choice, and it takes a bit more typing, but generally I find this method helpful for keeping track of functions.
Sometimes we need to run the same function on each value in a vector or dataframe. One way that we can achieve this is by using “loops”. If you’re familiar with loops in other programming languages the idea is the same in R.
A loop takes a vector and works with each item until the end of the vector. In R the syntax for writing loops is:
for (x in 1:10) {
print(x/2)
}
## [1] 0.5
## [1] 1
## [1] 1.5
## [1] 2
## [1] 2.5
## [1] 3
## [1] 3.5
## [1] 4
## [1] 4.5
## [1] 5
Note that in R the for loop will assign the loop variable to the “global environment”, that means if we have a variable already called “x” it will be overwritten by the for loop and retain the value of the last element in the vector parsed by the loop.
x <- 100
for (x in 1:10) {}
x
## [1] 10
So be careful when assigning variables in for loops.
We can also introduce “flow-control” into our loops (and functions)
with conditional If/Else statements. The general syntax of
If/Else statements in R looks like:
# if the condition is True
condition <- TRUE
if (condition) {
print('do a thing')
} else {
print('do something else')
}
## [1] "do a thing"
We can integrate an If/Else control into a for loop to conditionally perform functions.
for (x in 1:10){
if (x <= 5) {
print(x/2)
} else if (x == 6){
print(x*2)
} else {
print(x-10)
}
}
## [1] 0.5
## [1] 1
## [1] 1.5
## [1] 2
## [1] 2.5
## [1] 12
## [1] -3
## [1] -2
## [1] -1
## [1] 0
Hopefully you can appreciate that we can construct sophisticated data flows with loops and If/Else controls.
Now that I’ve demonstrated what you can do with ‘for loops’, I’m going to tell you why you should never* use them in R.
That’s because in R standard loops can be slower and offer less
control. The alternative are the apply functions:
apply(), lapply(), sapply(),
vapply(), mapply(), and tapply().
These generally work the same as loops but are often faster and enforce
a specific type of output (indicated by the fist letter in the function
names, eg. list output with lapply()).
Here’s a nice visualization of apply functions from Hadley Wickham (CSO of RStudio):
Here’s a very simple example of a for loop and the equivalent apply function…
names <- c('bob','tom','sarah','mary')
# using a for loop at the `toupper()` function to make names all caps
names_2 = c()
for (name in names){
names_2 <- append(names_2, toupper(name))
}
names_2
## [1] "BOB" "TOM" "SARAH" "MARY"
# now with an apply function
# vapply() returns an atomic vector, and we can specify the desired output data type with FUN.VALUE
names_2 <- vapply(names, toupper, FUN.VALUE = character(1), USE.NAMES = F) # a single line of code
names_2
## [1] "BOB" "TOM" "SARAH" "MARY"
We can also use custom functions with apply:
names_2 <- vapply(names, function(x) {
if (x != 'bob') {
new_name = toupper(x)
} else {
new_name = 'BARB'
}
return(new_name)
}, FUN.VALUE = character(1), USE.NAMES = F)
names_2
## [1] "BARB" "TOM" "SARAH" "MARY"
Apply functions can be very useful and are almost always the better choice than using traditional loops. Though admittedly they can be a bit finicky to use at times, it’s definitely worth it in the long run to adopt them over standard loops.
°C = (°F - 32) × 5/9lapply() instead of a loop to make a
list of celsius values.FtoC <- function(f) {
c <- (f - 32) * 5/9
return(c)
}
temps <- c(50.2, 88.1, 97.5, 30.0)
c_temps <- c()
for (t in temps) {
c_temps <- append(c_temps, FtoC(t))
}
c_temps <- lapply(temps, FtoC)
Data visualization is one of the top reasons R is such a popular
language for data analysis. It is easy to make publication quality
graphics with sophisticated labelling, grouping, and coloring. R comes
with plotting functionality “out of the box” with the
plot() function.
Let’s use a different example dataset called mtcars
which is data from the 1974 Motor Trend road tests.
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# assigning the rownames (which are the model names) to a separate vector so it can be used for plotting
mtcars$model <- rownames(mtcars)
With the mtcars example dataset we can easily make a scatter plot of
the car weight to mpg trend. We just have to tell plot()
what vectors from the dataframe to plot.
# at the simplest, plot takes an x-axis and y-axis argument.
plot(mtcars$wt, mtcars$mpg)
While the plot() function is capable of doing quite a
bit there is an external package called ggplot2 that I
would argue is so much better at making high-quality plots that it’s not
even really worth learning to use plot().
ggplot2ggplot2 is a package with a lot of powerful
functions for making really informative and aesthetically pleasing plots
in R. https://ggplot2.tidyverse.org/ It really deserves it’s
own workshop to fully demonstrate but we’ll quickly cover some of the
basics.
One of the main differences between ggplot and
plot in base R is that ggplot takes dataframes
as inputs, wherase plot takes specific vectors as
arguments.
The other major difference is the ggplot syntax, which
looks like
ggplot_object + plot_layer + plot_layer + ...
Let’s demonstrate. First we make a ggplot object.
library(ggplot2)
# pass the dataframe as the first argument, and then an aesthetics function `aes()` to sepcify which vectors in the dataframe we want to plot
plot <- ggplot(mtcars, aes(hp, mpg))
plot
This makes a ggplot object, which we’ve stored as “plot”, but you’ll
see that even though we have the right axes and values nothing is
actually plotted. That’s were the ‘plot layers’ come in. We have to
specify how the data in the ggplot object is going to be used using
additional functions, primarily “geometric objectsion” or
geoms.
We specify plots in ggplot with geoms which
are functions that generate plots from the ggplot object to layer onto
our plot.
Some geoms I frequently use are: - geom_point() - for
scatterplots - geom_boxplot() or geom_violin()
- for distribution plots - geom_bar() or
geom_col() - for bar graphs
But there are more than 40 different
geom functions to work with for making everything from
simple plots to making more complex plots that calculate models and/or
additional statistics.
A nice list of all geoms (and other ggplot2 functions):
ggplot2
Functions
So, to make a scatterplot from our ‘plot’ ggplot object
we simply add a layer with geom_point() like so…
# we use + to add on new layers to the plot
plot +
geom_point()
We can add multiple layers in sequence to our ggplot
object, overplotting additional information onto the original
scatterplot. For instance, we can compute and plot a regression model
with the geom_smooth() function and add that as a
layer.
plot +
geom_point() +
geom_smooth(method = 'loess') # we can specify the type of model (lm, glm, loess) withe the `method` argument
## `geom_smooth()` using formula = 'y ~ x'
Or we could add another layer to label all of our points with the car model name.
plot +
geom_point() +
geom_text(aes(label = model), hjust = -0.1)
Note there’s a great package called ggrepel that actually makes plotting text labels much cleaner.
Grouping and color plots is critical to making informative and
visually exciting figures. There are a number of easy ways to add layers
of information with coloring using ggplot.
For example, what if we want to also visualize how another variable maps to our scatter plot? For instance, how does the negative correlation correspond to number of cylinders? To do this we can specify our ‘hp’ vector to be used for mapping color.
ggplot(mtcars, aes(mpg, hp, color = cyl)) +
geom_point()
That’s somewhat useful, we can see that engines with fewer cylinders
have low horse power and higher miles per gallon.
But ‘hp’ only has a few discrete values (4,6,8) and here we’re using a
continuous color scale which is hard to read.
That’s happening because ‘hp’ is a double type. Let’s try
converting ‘hp’ to a categorical variable with factor.
ggplot(mtcars, aes(mpg, hp, color = factor(cyl))) +
geom_point(size = 4) +
scale_color_brewer(palette = 'Set1')
That’s a lot better. We’ve also adjusted the point size, and
specified a color palette with the scale_color_brewer()
function.
There are a number of palletes available with ggplot that
you can see here: - scale-brewer
- brewer
scales
ggplot exercisesgeom_boxplot() to make a boxplot of ‘hp’ values for
each cylinder category.geom_point() to add a layer of points on the
boxplotscale_color_brewer and the
RdPu palette.plot <- ggplot(mtcars, aes(factor(cyl), hp, fill = mpg)) +
geom_boxplot(outliers = F) +
geom_point(position = position_dodge2(width = 0.4), shape = 21, size = 3, alpha = 0.7) +
scale_fill_distiller(palette = 'RdPu') +
ggtitle('Beautiful Plot') +
xlab('Cylinders') +
ylab('Horse Power')
plot
## Warning: The following aesthetics were dropped during statistical transformation: fill.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
When we want to save a ggplot we can use the
ggsave() function. We can save to a number of formats (pdf,
png, tiff, jpeg) and specify size and resolution.
ggsave('demo_plot.png', plot, width = 5, height = 6)
Another highly useful and arguably essential set of packages for data
analysis in R is the tidyverse, which is really a set of
many packages. Together they improve on many of the basic R functions
and methods, and are all based on a shared underlying data design
philosophy such they all the packages work together pretty
seamlessly.
It’s not uncommon that we have data that looks something like this, each row is a different country with values for population in millions for 1990, 2000, and 2010.
pops <- data.frame(country = c('France','England','Germany','Italy'),
Pop_1990 = c(58,40,80,56),
Pop_2000 = c(61,42,85,60),
Pop_2010 = c(64,46,84,59))
pops
## country Pop_1990 Pop_2000 Pop_2010
## 1 France 58 61 64
## 2 England 40 42 46
## 3 Germany 80 85 84
## 4 Italy 56 60 59
This looks like how we often structure datasets when using spreadsheet software like Excel, and we can make plots with data like this with that kind of software. But in R this is a very cumbersome organization.
For example, we can’t easily plot how population is changing by time because time isn’t a variable, its part of the column name. Or if we wanted to determine what years France had a population below 62 million, this dataframe organization makes that tricky.
This is where the concept of ‘tidy data’ comes in.
Tidy data follows these rules: 1. each variable has its own column 2. each observation is its own row 3. each value has its own cell
For our example above that would look like this.
data.frame(country = c('France','England','Germany','Italy',
'France','England','Germany','Italy',
'France','England','Germany','Italy'),
population = c(58,40,80,56,61,42,85,60,64,46,84,59),
year = c(1990,1990,1990,1990,2000,2000,2000,2000,2010,2010,2010,2010))
## country population year
## 1 France 58 1990
## 2 England 40 1990
## 3 Germany 80 1990
## 4 Italy 56 1990
## 5 France 61 2000
## 6 England 42 2000
## 7 Germany 85 2000
## 8 Italy 60 2000
## 9 France 64 2010
## 10 England 46 2010
## 11 Germany 84 2010
## 12 Italy 59 2010
Now we have variables for country, year, and population and can easily subset and plot.
The tidyverse is a collecetion of many pacakges all designed to make, manipulate, and utilize tidy data. They are all very well made packages with very good documentation and instructions. I honestly do not think I would use R as much as I do if not for the tidyverse.
Several core Tidyverse packages are:
dplyr - the
toolbox for tidy datatidyr -
functions to make data tidiermagrittr
- pipes!forcats -
functions for factorspurr - apply
functions for tidy dataggplot2 -
oh hey! we know this one alreadyThere are many more, but these are the ones I use regularly.
We do not have time to go into much detail but I would like to just briefly demonstrate how some of these packages work.
The Tidyr package is here to help clean up your data. It has some super useful functions to easily manipulate dataframes into tidy formats.
Going back to our poorly organized population data, we can easily
convert it into a tidy format with the pivot_longer()
function.
# we specify our input dataframe, what columns we want to pivot,
# and then optionally if we want to specify names for the new columns
tidyr::pivot_longer(pops,
cols = c(Pop_1990, Pop_2000, Pop_2010),
names_to = 'year',
values_to = 'population')
## # A tibble: 12 × 3
## country year population
## <chr> <chr> <dbl>
## 1 France Pop_1990 58
## 2 France Pop_2000 61
## 3 France Pop_2010 64
## 4 England Pop_1990 40
## 5 England Pop_2000 42
## 6 England Pop_2010 46
## 7 Germany Pop_1990 80
## 8 Germany Pop_2000 85
## 9 Germany Pop_2010 84
## 10 Italy Pop_1990 56
## 11 Italy Pop_2000 60
## 12 Italy Pop_2010 59
There is also the companion function pivot_wider which
does the opposite of pivot_longer, and expands a single
column into multiple columns. This can be useful for instance when
making a matrix of data for plotting heatmaps, where each sample has its
own column.
Tidyr has a number of other useful functions, but these are the two I use regularly to quickly reshape dataframes.
The Dplyr package is aptly named because it really does provide a data toolbox. It’s a set of easy to use functions to change, subset, group, and summarise the data in dataframes.
We can add or change columns in dataframes with the
mutate() function.
# here we pass in our mtcars dataframe
# and round up our mpg column
dplyr::mutate(mtcars,
mpg = round(mpg, digits = 0))
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 23 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 19 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 23 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 18 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 15 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 34 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 22 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 16 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 16 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 20 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21 4 121.0 109 4.11 2.780 18.60 1 1 4 2
## model
## Mazda RX4 Mazda RX4
## Mazda RX4 Wag Mazda RX4 Wag
## Datsun 710 Datsun 710
## Hornet 4 Drive Hornet 4 Drive
## Hornet Sportabout Hornet Sportabout
## Valiant Valiant
## Duster 360 Duster 360
## Merc 240D Merc 240D
## Merc 230 Merc 230
## Merc 280 Merc 280
## Merc 280C Merc 280C
## Merc 450SE Merc 450SE
## Merc 450SL Merc 450SL
## Merc 450SLC Merc 450SLC
## Cadillac Fleetwood Cadillac Fleetwood
## Lincoln Continental Lincoln Continental
## Chrysler Imperial Chrysler Imperial
## Fiat 128 Fiat 128
## Honda Civic Honda Civic
## Toyota Corolla Toyota Corolla
## Toyota Corona Toyota Corona
## Dodge Challenger Dodge Challenger
## AMC Javelin AMC Javelin
## Camaro Z28 Camaro Z28
## Pontiac Firebird Pontiac Firebird
## Fiat X1-9 Fiat X1-9
## Porsche 914-2 Porsche 914-2
## Lotus Europa Lotus Europa
## Ford Pantera L Ford Pantera L
## Ferrari Dino Ferrari Dino
## Maserati Bora Maserati Bora
## Volvo 142E Volvo 142E
Or quickly subset our dataframe with complex logic using
filter().
dplyr::filter(mtcars,
mpg > 20 & hp > 95 | cyl == 4)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
## model
## Mazda RX4 Mazda RX4
## Mazda RX4 Wag Mazda RX4 Wag
## Datsun 710 Datsun 710
## Hornet 4 Drive Hornet 4 Drive
## Merc 240D Merc 240D
## Merc 230 Merc 230
## Fiat 128 Fiat 128
## Honda Civic Honda Civic
## Toyota Corolla Toyota Corolla
## Toyota Corona Toyota Corona
## Fiat X1-9 Fiat X1-9
## Porsche 914-2 Porsche 914-2
## Lotus Europa Lotus Europa
## Volvo 142E Volvo 142E
Or summarise our data.
dplyr::summarise(mtcars,
mean_mpg = mean(mpg),
sd_hp = sd(hp),
max_wt = max(wt))
## mean_mpg sd_hp max_wt
## 1 20.09062 68.56287 5.424
The real elegance of Dplyr and the other packages of the Tidyverse is evident when you start using pipes.
If you’re familiar with pipes in Bash, the idea is the same in R. We can use a pipe operator to pass the output of one function to the input of the next.
There is actually a pipe operator built in to base R,
|>. But the Magrittr package provides another pipe
operator %>% that improves on the base version.
We can use pipes to perform a number of operations in sequence.
library(magrittr)
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
## model
## Mazda RX4 Mazda RX4
## Mazda RX4 Wag Mazda RX4 Wag
## Datsun 710 Datsun 710
## Hornet 4 Drive Hornet 4 Drive
## Hornet Sportabout Hornet Sportabout
## Valiant Valiant
mtcars %>%
dplyr::mutate(model = rownames(.),
mpg = round(mpg, digits = 0),
cyl = as.factor(cyl)) %>%
dplyr::filter(wt > 3) %>%
dplyr::group_by(cyl, model) %>%
dplyr::summarise(mean_mpg = mean(mpg)) %>%
ggplot(aes(mean_mpg, model, fill = cyl)) +
geom_bar(stat = 'identity', position = position_dodge(), alpha = 0.7) +
scale_fill_brewer(palette = 'Set1')
## `summarise()` has grouped output by 'cyl'. You can override using the `.groups`
## argument.
In my experience pipes make my R code and environment cleaner, because
it avoids uneccessarily storing intermediate steps as objects. It also
lends well to functionalizing code.
When working on multiple projects or preparing projects for publication it’s really important to maintain a record of your development environment. By environment I mean, the R version, packages, and the package versions that were used in your analysis. Unfortunately, most people have to learn this the hard way.
The problem arises from using a common library of packages across
multiple projects and over time. Installing packages with
install.packages() will place everything in a common
library (usually in your home directory).But this is not a “static
environment”, it will change as you install new package updates and new
R versions. There is nothing directly connection your R code to
a particular R environment.
Here’s an example: you’ve been working on a project and have code for your analysis that uses the “common” library. You take a break from this project for a couple of months to help out someone with their scRNA-seq analysis, in the meantime you install some new tools, update some old packages, maybe switch over to the newest version of R. When you come back to your first project the output of your analysis now looks different, maybe the number of significant genes is slightly different or maybe your code doesn’t run at all because there’s some kind of new package conflict. This is a sadly common story.
This can be a headache to correct. At worse it can mean that a collaborator (or even reviewer) can’t replicate your analysis.
If only we could set up a package library for each project and track a project-specific environment.
This is exactly what renv does. It provides an easy to
use set of functions to initialize project libraries, keep them in sync
with your code, and record everything you’ve used. It makes it simple to
report your methods and share the environment with collaborators or the
public when it’s time to publish.
There is a great tutorial for renv here: Introduction to
renv that walks through exactly what renv does and how
to start using it.
But this is my personal plug to encourage everyone to start using this kind of package management / virtual environment in your own projects.