On Github nlk124 / Rminicourse
Type directly into the console (best when you don't want to save the code) or type into the script - then run (CTRL + ENTER) (Windows) or (CMD + ENTER) (Mac)
A script is a plain text file with R commands in it. This will be where you save the code that you are writing - the file will end in the extension .R
R has many arithmetic operators
R obeys the standard order of operations
Examples
7 + 4
## [1] 11
3^2
## [1] 9
10 %% 7
## [1] 3
Examples
7 == 4
## [1] FALSE
3 > 2
## [1] TRUE
7!=4
## [1] TRUE
What is 17 multiplied by 365?
What is 13 cubed?
Is 9 to the fourth equal to the sum of 2000 and 187 multiplied by 3?
An object is the fundamental unit in R. All expressions can be saved as an object.
To create an object from an expression we use the assignment operator (<-). The assignment operator assigns values on the right to objects on the left.
a <- (12 + 180) * 3 a
## [1] 576
Do not use a = 12 + 180 for assignment in R. It's best practice to use <- intead of =.
This may seem arbitrary, but it is helpful when you are reading someone else's code.
Use # signs to comment on your script. Anything to the right of a # is ignored. Good scripts (and homework) have comments before every major block of code. It's surprisingly hard to remember what you did when reviewing older code without comments, and it's particularly important when other people are reading your code.
5 + 5 # This adds five and five
## [1] 10
# 10 + 10 this does not add ten and ten
a <- 8 * 10 b <- 2 * 10 d <- a * b d
## [1] 1600
# This is equivalent to: d <- 8 * 10 * 2 * 10 d
## [1] 1600
a <- c(3, 4, 5) a
## [1] 3 4 5
b <- c(3.24, 4.57, 5.03) b
## [1] 3.24 4.57 5.03
pets <- c("dog", "cat", "bird") pets
## [1] "dog" "cat" "bird"
# Make objects a <- sqrt(4 * 7) b <- 6 * 5 g <- 9 * 2 # Combine d <- c(a, b, g) d
## [1] 5.291503 30.000000 18.000000
x <- 1:10 x
## [1] 1 2 3 4 5 6 7 8 9 10
x <- seq(from = 1, to = 20, by = 2) x
## [1] 1 3 5 7 9 11 13 15 17 19
# Create a vector height <- c(76, 72, 74, 74, 78) height
## [1] 76 72 74 74 78
height[1] # extract the 1st element in the vector
## [1] 76
height[5] # extract the 5th element
## [1] 78
height <- c(76, 72, 74, 74, 78) height[-1]
## [1] 72 74 74 78
# Create a vector with named elements temp <- c(monday = 28.1, tuesday = 28.5, wednesday = 29.0, thursday = 30.1, friday = 30.2) temp
## monday tuesday wednesday thursday friday ## 28.1 28.5 29.0 30.1 30.2
temp["wednesday"]
## wednesday ## 29
temp[3]
## wednesday ## 29
y <- 5:50 y
## [1] 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 ## [24] 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
y[y <= 10] # extract all elements less than or equal to 10
## [1] 5 6 7 8 9 10
y[y < 10 & y != 5] # extract all elements less than 10 that are not equal to 5
## [1] 6 7 8 9
What are the 9th and 12th positions of the vector seq(1, 27, 0.5)?
Bonus! Can you find those positions simultaneously?
Create the vector c(3:33) and name it a. Extract all elements of a that are greater than or equal to 17.
A function is a stored object that performs a task given some inputs (called arguments). R has many functions already available, but you can also write your own functions.
Try using the tab key while entering arguments in any function to discover a useful feature of RStudio.
Functions are used in the format: name_of_function(inputs)
The output of a function can be saved to an object: output <- name_of_function(inputs)
seq(1, 10, 1)
## [1] 1 2 3 4 5 6 7 8 9 10
seq(from = 1, to = 10, by = 1)
## [1] 1 2 3 4 5 6 7 8 9 10
sum(c(3, 4, 5))
## [1] 12
mean(seq(5, 100, 5))
## [1] 52.5
x <- seq(5, 100, 5) # use the vector x as the input to the function mean(x)
## [1] 52.5
help(mean)
?mean # Same as help(mean)
??robust
What is the median of 34, 16, 105, and 27?Remember: functions are often named intuitively.
What does the function range() do, and what is the sample example in the help file?
Is mean(4, 5) different than mean(c(4, 5))?
We will be exploring functions in much greater detail throughout this course. (Including writing your own functions!)
Functions are kept inside packages, some of which come pre-installed with R. Others must be downloaded.
There are many, many R packages - currently 7742!
Check the List of R Packages and search with your favorite keyword, e.g. ecology, paleo, dispersal, population, time series, phylogeny, community, Bayes
# Install a new package install.packages("picante")
Installing a package downloads it to your computer. You need to download packages before using them.
Additionally, you have to load packages before you use them every time you restart R. This lets R know what packages to load in, and not waste memory by loading all potential functions.
# Two ways to load packages: library(ggplot2) require(ggplot2)
Good scripts (and homeworks) have a series of require() or library() statements at the top of the script.
Some say that you should preferentially use library(). The logic behind this is because require() "tries" to laod a package and library() actually loads a package. If you use require() at the top of the screen and you don't have the package installed, you'll get a failure message later on when you use functions in the package. However, if you use library() at the top of the screen and you don't have the package installed, you'll get a failure message right away.
Search and find an interesting package. What is it? What is one function included in the package?
Install the package to your computer.
Worksheets created by Mike McCann and hosted on his Rpubs site
What is the (sum of 1 through 10) multiplied by two?
What are the 12th and 45th positions of the vector seq(1, 43, 0.25)?
Find all of the elements of the vector seq(1, 10, 0.1) that are less than 2. What is the median value of these elements?
norm <- rnorm(n = 1000, mean = 0, sd = 1) head(norm)
## [1] -2.0902195660 -0.5281914723 -0.1530587059 -0.4630696300 0.3055771295 ## [6] 0.0002270506
hist(norm, col = "palevioletred", main = "", xlab = "", ylab = "Frequency")
Look up the rnorm() function in help screen. Locate the three arguments we used.
Draw 100 random numbers from a normal distribution with a mean of 3 and an sd of 2, assign it to an object a.
Find the mean of your sample. How close was it to the true mean?
What is the 13th element in your vector a?
BEE552 Students: Heather will pick up from here.
Draw 100 random variates from a Poisson distribution with a lambda = 3, assign it to an object a.
Draw 1000 random variates from a Poisson distribution with a lambda = 3, assign it to an object b.
Find the means of both draws. What is the difference in means?
For each distribution, there are four functions which will generate fundamental quantities of a distribution. Let's consider the normal distribution as an example.
pnorm(), and qnorm() will be covered during Biometry. They are less commonly used.
# Use the R dataset of the length of rivers in the United States (a univariate data set) head(rivers)
## [1] 735 320 325 392 524 450
barplot(height = rivers[1:10], col = "paleturquoise", main = "Lengths of rivers in the United States", ylab = "Length (in miles)")
Histograms are also a common univariate plot. Histograms place data into "bins", and count the number (frequency) of data falling into each bin.
Bins are usually plotted as bars, with the x range on the x axis, and frequency on the y axis.
Histograms are an effective way of visualizing distributions
# Generate and visualize 100 random points from a standard normal distribution sample <- rnorm(1000) hist(sample, col = "palevioletred", main = "", xlab = "", ylab = "Frequency")
Draw 10 random variates from a normal distribution and plot a histogram of the sample. Repeat for 100, then 1000 random variates. What do you notice about the histograms?
Explore at least one other distribution; look up ?distributions.
Take a random sample of the new distribution, plot the sample, and share with your neighbor.
Draw 1000 random variates from a normal distribution with a mean of 0 and a sd of 1. Look at ?hist. How do you specify the size of the bin range? Try making bins from -4 to 4, with intervals of 0.01, 0.1, and 1. Hint: Consider using seq() in the "breaks" argument within hist().
In R, it's very easy to take a random sample of elements in any vector with the sample() function.
Take a random sample of 20 elements from a vector of integers between 1 and 100.
x <- 1:100 sample(x, size = 20) # Sample without replacement
## [1] 96 56 10 30 4 99 93 60 49 36 85 65 28 88 2 32 48 64 94 76
sample(x, size = 20, replace = TRUE) # Sample with replacement
## [1] 24 59 59 97 7 78 81 10 68 31 42 50 46 32 87 42 84 57 46 5
Sample 14 elements from a vector of integers between 100 and 200 without replacement.
Sample 5 letters (using the pre-installed vector letters) from the alphabet with replacement.
Sample 0 or 1 twelve times, with and without replacement. What happens?
sample1 <- rnorm(n = 40, mean = 4, sd = 2) sample2 <- rnorm(n = 40, mean = 2, sd = 1) plot(formula = sample2 ~ sample1, pch = 20)
plot(x = sample1, y = sample2, pch = 20)
plot(sample1, sample2) abline(a = -0.1, b = 1, col="royalblue4")
Lines can also be model fits
lm() fits a linear relationship between x and y.
plot(sample1, sample2) abline(lm(sample2 ~ sample1), col="royalblue4")
R Base Graphics: An Idiot's Guide
Producing Simple Graphs with R
More on plotting later (i.e. making plots that are visually appealing!)
Download the file for this section
head(trees) # Pre-loaded dataset in R - measurements of black cherry trees
## Girth Height Volume ## 1 8.3 70 10.3 ## 2 8.6 65 10.3 ## 3 8.8 63 10.2 ## 4 10.5 72 16.4 ## 5 10.7 81 18.8 ## 6 10.8 83 19.7
Data frames are usually read in from a file, but R comes with many practice datasets. We will use the iris dataset, famously used by R.A. Fisher in 1936
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa
How many rows does the iris data frame have?
How many columns? What are the column names?
Using the str() function, how many species are in the data?
What classes are each of the columns?
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa
iris[1, 1] # first row, first column
## [1] 5.1
iris[3, 3] # third row, third column
## [1] 1.3
iris[2, ] # subset the entire second row
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 2 4.9 3 1.4 0.2 setosa
iris[, 2] # subset the entire second column
## [1] 3.5 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 3.7 3.4 3.0 3.0 4.0 4.4 3.9 ## [18] 3.5 3.8 3.8 3.4 3.7 3.6 3.3 3.4 3.0 3.4 3.5 3.4 3.2 3.1 3.4 4.1 4.2 ## [35] 3.1 3.2 3.5 3.6 3.0 3.4 3.5 2.3 3.2 3.5 3.8 3.0 3.8 3.2 3.7 3.3 3.2 ## [52] 3.2 3.1 2.3 2.8 2.8 3.3 2.4 2.9 2.7 2.0 3.0 2.2 2.9 2.9 3.1 3.0 2.7 ## [69] 2.2 2.5 3.2 2.8 2.5 2.8 2.9 3.0 2.8 3.0 2.9 2.6 2.4 2.4 2.7 2.7 3.0 ## [86] 3.4 3.1 2.3 3.0 2.5 2.6 3.0 2.6 2.3 2.7 3.0 2.9 2.9 2.5 2.8 3.3 2.7 ## [103] 3.0 2.9 3.0 3.0 2.5 2.9 2.5 3.6 3.2 2.7 3.0 2.5 2.8 3.2 3.0 3.8 2.6 ## [120] 2.2 3.2 2.8 2.8 2.7 3.3 3.2 2.8 3.0 2.8 3.0 2.8 3.8 2.8 2.8 2.6 3.0 ## [137] 3.4 3.1 3.0 3.1 3.1 3.1 2.7 3.2 3.3 3.0 2.5 3.0 3.4 3.0
Data frames can be indexed for both rows and columns
Get the 5th, 7th, and 9th rows for the first two columns.
iris[c(5, 7, 9), 1:2]
## Sepal.Length Sepal.Width ## 5 5.0 3.6 ## 7 4.6 3.4 ## 9 4.4 2.9
iris[, "Sepal.Length"]
iris$Sepal.Length
What is the 9th entry of the Sepal.Width column? Call it x.
Subset the 17th row of the data frame iris.
Return an object with the 1st, 4th and 7th rows of the iris dataframe.
Use the seq() function to subset all odd rows in the iris dataset.
What happens when you use negative numbers to index the iris dataframe? Hint: Use dim() on the original and final objects.
petal <- iris$Petal.Width # Subset of the column petal width hist(petal, col = "darkseagreen1", main = "Petal width of Iris", xlab = "", ylab = "Frequency") # Make a histogram of petal width frequency
logi <- petal > 1 # Which petal widths are greater than 1? head(logi)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
iris.subset <- iris[logi, ] # Subset of iris only including individuals # where petal width is greater than 1 head(iris.subset)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 51 7.0 3.2 4.7 1.4 versicolor ## 52 6.4 3.2 4.5 1.5 versicolor ## 53 6.9 3.1 4.9 1.5 versicolor ## 54 5.5 2.3 4.0 1.3 versicolor ## 55 6.5 2.8 4.6 1.5 versicolor ## 56 5.7 2.8 4.5 1.3 versicolor
This is the same as:
iris[iris$Petal.Width > 1, ]
Why is iris[iris > 3, ] a nonsensical command?
What about iris[iris$Sepal.Length > 3]?
Create a histogram of petal lengths for the entire data.
Subset the data for petal lengths greater than two.
Create a histogram of your new data.
iris.subset <- iris[iris$Petal.Length == 4,] head(iris.subset)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 54 5.5 2.3 4 1.3 versicolor ## 63 6.0 2.2 4 1.0 versicolor ## 72 6.1 2.8 4 1.3 versicolor ## 90 5.5 2.5 4 1.3 versicolor ## 93 5.8 2.6 4 1.2 versicolor
versicolor_only <- iris[iris$Species == "versicolor", ] head(versicolor_only)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 51 7.0 3.2 4.7 1.4 versicolor ## 52 6.4 3.2 4.5 1.5 versicolor ## 53 6.9 3.1 4.9 1.5 versicolor ## 54 5.5 2.3 4.0 1.3 versicolor ## 55 6.5 2.8 4.6 1.5 versicolor ## 56 5.7 2.8 4.5 1.3 versicolor
versicolor.4 <- iris[iris$Petal.Length > 4 & iris$Species == "versicolor", ] # subset of observations or only I. versicolor where petal length is greater than 4 head(versicolor.4)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 51 7.0 3.2 4.7 1.4 versicolor ## 52 6.4 3.2 4.5 1.5 versicolor ## 53 6.9 3.1 4.9 1.5 versicolor ## 55 6.5 2.8 4.6 1.5 versicolor ## 56 5.7 2.8 4.5 1.3 versicolor ## 57 6.3 3.3 4.7 1.6 versicolor
Explain in words each of the following logical statements
iris[1:4, ]
iris[c(1:15), c(1, 3)]
iris[iris$Species == "setosa", "Petal.Width"]
What happens when you add a ! before a logical statment? Hint: Compare iris[iris$Species == "setosa", ] and iris[!iris$Species == "setosa", ].
df <- data.frame(x = 1:5, y = 6:2) df
## x y ## 1 1 6 ## 2 2 5 ## 3 3 4 ## 4 4 3 ## 5 5 2
df <- data.frame(df, z = 41:45) df
## x y z ## 1 1 6 41 ## 2 2 5 42 ## 3 3 4 43 ## 4 4 3 44 ## 5 5 2 45
df$z <- 41:45 df
## x y z ## 1 1 6 41 ## 2 2 5 42 ## 3 3 4 43 ## 4 4 3 44 ## 5 5 2 45
df.subset <- df[, -3] df.subset
## x y ## 1 1 6 ## 2 2 5 ## 3 3 4 ## 4 4 3 ## 5 5 2
df$z <- NULL df
## x y ## 1 1 6 ## 2 2 5 ## 3 3 4 ## 4 4 3 ## 5 5 2
Similar to data frames. Matrices are two-dimensional and only consist of numbers.
Why do you need them? Some functions require a matrix as an input.
a <- matrix(1:9, ncol = 3, nrow = 3) a
## [,1] [,2] [,3] ## [1,] 1 4 7 ## [2,] 2 5 8 ## [3,] 3 6 9
a <- matrix(1:9, ncol = 3, nrow = 3) a
## [,1] [,2] [,3] ## [1,] 1 4 7 ## [2,] 2 5 8 ## [3,] 3 6 9
a[2, 2]
## [1] 5
a[2, ]
## [1] 2 5 8
R is not a spreadsheet program, so it's not great for direct data entry. It's best to start with spreadsheets for data entry and storage, and to import spreadhseets into R for data visualization and analysis.
.csv (comma separated values) files are often the preferred format to import into R.
Before we do that, we will need to look at concept of your working directory.
getwd()
## [1] "/Users/nicolekinlock/Documents/Biometry TA/R Mini Course/slides/Rminicourse"
setwd("/Users/nicolekinlock/Documents/Biometry TA/R Mini Course/slides/")
seedlings <- read.csv("/Users/nicolekinlock/Documents/Biometry TA/R Mini Course/data/seedlings.csv") head(seedlings)
Average Height # BAD - this won't work Average.Height # OK - this will work but isn't best practice (See style guide) average.height # BETTER - this will work, but will be slow to type repeatedly avg.height # GOOD!
# Write the file write.csv(iris, file = "iris.csv", row.names = FALSE) # Check in your working directory for the new file list.files()
Loops are an important programming tool. The first loop we will learn is a for loop.
For loops run for a certain number of steps (iterations) that you define, during which any statements in the loop are executed.
-The basic syntax is:
for (i in 1:number_of_iterations) { execute these statements }
We have a repeated process with indentical formatting, but different values.
To avoid laborious typing into R
for (i in 1:5) {
}
for (i in 1:5){ print(i) }
## [1] 1 ## [1] 2 ## [1] 3 ## [1] 4 ## [1] 5
Be sure you distinguish between:
Brackets [ ] are used to access elements of vectors, matrices, and dataframes.
x <- 1:10 x[6]
## [1] 6
x[6:9]
## [1] 6 7 8 9
Parentheses ( ) are used to specify arguments to functions.
x <- 1:10 sum(x)
## [1] 55
mean(x)
## [1] 5.5
Finally, use curly braces { } to enclose all of the statements to be executed in a loop.
for (i in 1:3) { print(i) }
## [1] 1 ## [1] 2 ## [1] 3
You can perform operations on i.
for (i in 1:4){ print(i^2) }
## [1] 1 ## [1] 4 ## [1] 9 ## [1] 16
Assignments can occur in a loop.
x <- 2 for (i in 1:4){ x <- x^2 print(x) }
## [1] 4 ## [1] 16 ## [1] 256 ## [1] 65536
Note: i is not directly called in the equation.
The operation x <- x^2 will be done four times. x changes with each iteration because of the re-assignment in the loop.
Create a for loop that prints numbers 1 to 100.
Create a for loop that prints numbers 100 to 1.
Create a for loop that adds 1 to numbers 1:5.
Create a for loop that divides all even numbers from 0 to 20 by 10 (consider using seq()).
Bonus! What would be the final value here (witout trying it)? Why?
dogs <- 10 for (i in 1:5){ dogs <- dogs + 1 }
In the above examples, we used i directly in mathematical operations. It is more common to loop over elements of a vector to accomplish some particular task.
nameVector <- c("Harry", "Hermione", "Ron") for (i in 1:length(nameVector)){ print(paste("Hi,", nameVector[i], sep=" ")) }
## [1] "Hi, Harry" ## [1] "Hi, Hermione" ## [1] "Hi, Ron"
length(nameVector) # The # of positions in nameVector
## [1] 3
nameVector[1] # The 1st position in nameVector
## [1] "Harry"
# Combine text and index of a vector paste("Hi,", nameVector[1], sep=" ")
## [1] "Hi, Harry"
Loops are their own little environment, so use print() to view them on your console so you can see the output of each iteration.
nameVector <- c("Harry", "Hermione", "Ron") for (i in 1:length(nameVector)){ print(paste("Hi,", nameVector[i], sep=" ")) }
## [1] "Hi, Harry" ## [1] "Hi, Hermione" ## [1] "Hi, Ron"
Without print() or an assignment <- results are not returned.
nameVector <- c("Harry", "Hermione", "Ron") for (i in 1:length(nameVector)){ paste("Hi,", nameVector[i], sep=" ") }
Create a vector of names of people in your row, write them a nice message using a loop.
Explain why the following code is wrong:
for (x in 1:10) { print(sum(i)) }
Lists are another of the 5 basic data structures in R. Lists, like vectors, are 1 dimensional.
The elements of a vector must all b the same type. In this example, 1 and 2 are converted to characters because "blue" is included.
x <- c(1, 2, "blue") x
## [1] "1" "2" "blue"
typeof(x)
## [1] "character"
a <- list(numeric = seq(0, 0.25, 0.01), integer = 10:20, logical = c(TRUE, FALSE, FALSE), character = c("open", "closed", "closed", "open")) a
## $numeric ## [1] 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 ## [15] 0.14 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 ## ## $integer ## [1] 10 11 12 13 14 15 16 17 18 19 20 ## ## $logical ## [1] TRUE FALSE FALSE ## ## $character ## [1] "open" "closed" "closed" "open"
x <- c(1, 2, 3, 4, 5) y <- c(3, 6, 8, 10) v <- c(x, y) # Create a vector str(v)
## num [1:9] 1 2 3 4 5 3 6 8 10
l <- list(x, y) # Create a list str(l)
## List of 2 ## $ : num [1:5] 1 2 3 4 5 ## $ : num [1:4] 3 6 8 10
v[2] # Index a vector
## [1] 2
l[[2]] # Index a list
## [1] 3 6 8 10
Using the single bracket indexing operator [] will return a list
Using a double bracket indexing operator [[]] will return contents within a list (integers, numbers, characters)
b <- list(1:4, c(3.25, 6.75, 8.175, 4.5), c("apples", "bananas", "oranges", "pears")) b[2] # returns a list
## [[1]] ## [1] 3.250 6.750 8.175 4.500
str(b[2])
## List of 1 ## $ : num [1:4] 3.25 6.75 8.18 4.5
b[[2]] # returns numbers from list
## [1] 3.250 6.750 8.175 4.500
str(b[[2]])
## num [1:4] 3.25 6.75 8.18 4.5
b[1] * b[2] # can't do arithmatic on lists
## Error in b[1] * b[2]: non-numeric argument to binary operator
b[[1]] * b[[2]] # can do arithmatic on numbers
## [1] 3.250 13.500 24.525 18.000
b[[2]] # 2nd element of list
## [1] 3.250 6.750 8.175 4.500
b[[2]][2] # 2nd element of 2nd element of list
## [1] 6.75
outputs <- c() # Create a blank vector x <- rnorm(n = 10, mean = 1, sd = 0.5) # Vector that we will use in the loop for (i in 1:length(x)) { outputs[i] <- x[i] * 10 } outputs
## [1] 6.029596 13.761718 10.922449 14.737241 6.798512 6.913397 1.505247 ## [8] 8.059382 7.972717 7.948766
First, create a vector, y, with 10 draws from a standard normal distribution (mean = 0, sd = 1).
Compute y * 2 for 10 iterations (the length of vector y) using a for loop. Place the output in a vector.
Compute y * 2 for 10 iterations using a for loop. Place the output in a list.
Find the the 47th element in the output vector from question 1 and the output list from question 2.
What do each of these loops do?
output <- list() for(x in 1:10) { output[1] <- sum(x + x^2) } output2 <- list() for(x in 1:11) { output2[x + 1] <- sum(x + x^2) }
if (3 > 2) { print("Yes") }
## [1] "Yes"
Often, we want for loops to be able to do different things under different conditions. We want the for loop to account for variables, options, and logical statements.
Let's use an if statement:
x <- 1:5 for (i in 1:length(x)) { if(x[i] > 3) { print(paste(x[i],"is greater than 3")) } if(x[i] <= 3) { print(paste(x[i],"is less than or equal to 3")) } }
## [1] "1 is less than or equal to 3" ## [1] "2 is less than or equal to 3" ## [1] "3 is less than or equal to 3" ## [1] "4 is greater than 3" ## [1] "5 is greater than 3"
for (i in 1:length(x)) { if(x[i] > 3){ break } if(x[i] <= 3) { print(paste(x[i],"is less than or equal to 3")) } }
## [1] "1 is less than or equal to 3" ## [1] "2 is less than or equal to 3" ## [1] "3 is less than or equal to 3"
Sometimes we don't want to break the statement, just skip a troublesome object or R that we know will cause an error.
We can continue within a loop based on an if and next statement. Here we want to skip 4.
for (i in 1:length(x)) { if(x[i] == 4) { next } if(x[i] > 3) { print(paste(x[i],"is greater than 3")) } if(x[i] <= 3) { print(paste(x[i],"is less than or equal to 3")) } }
## [1] "1 is less than or equal to 3" ## [1] "2 is less than or equal to 3" ## [1] "3 is less than or equal to 3" ## [1] "5 is greater than 3"
First, create a vector, x, with all integers between 1 and 100.
Create a for loop that computes x[i] * 2 for 100 iterations (the length of vector x). Place the output in a vector. However, calculate x[i] * 3 when x[i] = 32.
Create a for loop that computes x[i] * 2 for 100 iterations. Place the output in a list. However, break the loop after 51 iterations.
Create a for loop that computes x[i] * 2 for 100 iterations. Place the output in a vector. However, skip x = 71, 74.
The apply family of functions allows you to process whole rows, columns, or lists. This is called vectorization. This can replace for loops (and is often faster).
M <- matrix(rnorm(9), ncol = 3, nrow = 3) M
## [,1] [,2] [,3] ## [1,] 0.1494313 0.8779963 0.25422087 ## [2,] -0.8206744 1.4368966 -1.01193249 ## [3,] -1.8166418 -0.4964814 0.06765827
apply(M, 1, median) # median of rows
## [1] 0.2542209 -0.8206744 -0.4964814
apply(M, 2, median) # median of columns
## [1] -0.82067442 0.87799633 0.06765827
med.M <- c() for (i in 1:3) { med.M[i] <- median(M[, i]) } med.M
## [1] -0.82067442 0.87799633 0.06765827
mylist <- list(1:3, c(1.4, 4.6, 4.2, 2.1), rnorm(4)) # make a list mylist sum(mylist) # returns an error lapply(mylist, sum) # applies function to each element in the list
Functions contain sets of instructions that we want to carry out repeatedly
We have already seen many of the basic functions that come pre-installed with R.
sum(seq(1, 100, 1)) abs(-100 + 50) dim(iris) str(iris) colnames(iris)
# This will not run. tree is not loaded tree(formula = Species ~ . -Species, data = iris) install.packages("tree") # Install package library(tree) # Load package # Now it should work tree(formula = Species ~ . -Species, data = iris)
It is also possible to define your own functions. This is especially important if you are going to write the same lines of code over and over again.
function_name <- function(arguments) {body}
# define a function, f f <- function(x, y){ x + y } # call function f f(x = 1, y = 3)
## [1] 4
Variables defined inside functions exist in a different environment than the global environment, i.e. they don't exist outside of the function.
However, if a variable is not defined inside a function, the function will look one level above.
If you run this function, you'll see that y does not pop up in your environment panel.
x <- 2 # variable defined outside the function g <- function() { y <- 1 # variable defined inside the function c(x, y) } g()
## [1] 2 1
f1 <- function(a, b) { x <- a + b y <- (a + b)^2 z <- a/b } f1(1, 2) # doesn't return anything f2 <- function(a, b) { x <- a + b y <- (a + b)^2 z <- a/b c(x, y, z) # same result as writing return(c(x, y, z)) } f2(1, 2) # returns x, y, and z
## [1] 3.0 9.0 0.5
Create a function that takes in two arguments, x and y, and computes x * 2 * y.
Create a function that takes in three arguments, and makes a vector from the result.
Create a function that counts the number of matching elements in two separate vectors. Hint: use %in% to create a logical statement.
params <- c(5, 25) params
## [1] 5 25
f3 <- function(p){ alpha <- p[1] beta <- p[2] alpha * beta } f3(params)
## [1] 125
subtract <- function(a = 5, b = 2){ return(a - b) } subtract()
## [1] 3
subtract(5, 6)
## [1] -1
Write a function that takes a vector as an argument and multiplies the sum of the vector by 10. Return a logical statement based on whether the sum is under 1000.
Write a function that calculates the mean of every column in a dataframe. Code the function so that it does not evaluate the column mean if the column elements are not numbers, using class(x) != "numeric". Try your function on the iris dataset.
f4 <- function(x) { for (i in 1:ncol(x)) { if(class(x[, i]) != "numeric") { next } if(class(x[, i]) == "numeric") { print(mean(x[, i])) } } } f4(iris)
## [1] 5.843333 ## [1] 3.057333 ## [1] 3.758 ## [1] 1.199333
ggplot2 is not a base package, so we need to install it.
How would you install the package ggplot2?
What is the next step before you can use a function from ggplot2?
install.packages("ggplot2") library(ggplot2)
plot(x = iris$Species, y = iris$Sepal.Length)
library(ggplot2) ggplot(data=iris, aes(x = Species, y = Sepal.Length)) + geom_boxplot() + theme_bw()
gg stands for "grammar of graphics"
Uses a set of terms that defines the basic components of (every) plot
Produce figures using coherent, consistent syntax (very similar code for very different figures)
library(ggplot2) ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point()
Useful for exploring data in different ways
I'll be using variables from the state.x77 and state.region datasets (US state data from 1977)
state <- data.frame(state.x77, state.region, state.abb) stateplot <- ggplot(data = state, aes(x = Population, y = Area)) stateplot + geom_point()
Increase the size of points
ggplot(data = state, aes(x = Population, y = Area)) + geom_point(size = 3)
ggplot(data = state, aes(x = Illiteracy, y = Murder, color = state.region)) + geom_point(size = 2)
ggplot(data = state, aes(x = Life.Exp, y = HS.Grad, color = state.region, shape = state.region)) + geom_point(size = 2)
d2 <- diamonds[sample(1:nrow(diamonds), 1000), ]
Type geom_ and hit tab to see them all. Then, use ?geom_nameofgeom to see the help screen.
ggplot(iris, aes(x = Species,y = Sepal.Length)) + geom_boxplot()
ggplot(faithful, aes(x = waiting)) + geom_histogram(binwidth = 8, color = "black", fill = "paleturquoise")
ggplot(iris, aes(x = Species, y = Sepal.Length)) + geom_bar(stat = "identity")
ggplot(mtcars, aes(x = wt, y = mpg, color = as.factor(cyl))) + geom_line()
ggplot(faithful, aes(waiting)) + geom_density(fill = "thistle")
Look up geom_histogram. What does it do?
Make a histogram of Sepal.Length from the iris data set. What did it do with the different species?
Plots can also have facets, which divide a plot into subplots based on some discrete variable (here, species).
ggplot(iris, aes(Sepal.Length)) + geom_histogram() + facet_grid(Species ~ .)
Change to facet_grid(. ~ Species) and get one row, three columns.
ggplot(iris, aes(Sepal.Length)) + geom_histogram() + facet_grid(. ~ Species)
Type stat_ and hit tab to see them all. Then, use ?stat_nameofstat to see the help screen.
Use stat_smooth to add a linear fit
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + geom_point() + stat_smooth(method = "lm")
scales are used to modify axes and colors
For example:
ggplot(data = state, aes(x = Population, y = Area, color = state.region)) + scale_y_log10() + scale_x_log10() + geom_point()
ggplot(data = state, aes(x = Population, y = Area, color = state.region)) + scale_y_log10() + scale_x_log10() + geom_point() + scale_colour_manual(values = c("palegreen3","palevioletred3","peachpuff3", "paleturquoise3"))
ggplot(data = state, aes(x = Life.Exp, y = HS.Grad, color = state.region, shape = state.region)) + geom_point(size = 2) + labs(title = "US States, 1977", x = "Life expectancy (in years)", y = "High school graduation rate (percent)")
ggplot(data = state, aes(x = Life.Exp, y = HS.Grad, color = state.region, shape = state.region)) + geom_point(size = 2) + labs(title = "US States, 1977", x = "Life expectancy (in years)", y = "High school graduation rate (percent)") + geom_text(aes(label = state.abb, size = 2, hjust = 0, vjust = 0))
Control over the figure as a whole can be done by modifying themes. See ?theme for all of the options
I prefer + theme_bw() (white background with gridlines) or + theme_classic() (white background without gridlines)
You can also change the overall font family and font size
ggplot(iris, aes(Species, Sepal.Length)) + geom_bar(stat = "identity") + theme_bw