Susan Holmes
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
## [18] "r" "s" "t" "u" "v" "w" "x" "y" "z"
## [1] "a" "b" "c" "d" "e" "f"
## [1] 1 2 3 4 5 6
## [1] 2 3 7 8
We have already seen examples of vectors we created using the c() that combines values:
## [1] "numeric"
## [1] TRUE
How many elements do you think there are in c(fib,fib) ?
We can also combine elements in the middle:
## [1] 0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987
## [18] 5 0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610
## [35] 987
Let’s try some operations on fib:
## [1] 10 11 11 12 13 15 18 23 31 44 65 99 154 243 387 620 997
## [1] 0 10 10 20 30 50 80 130 210 340 550 890 1440 2330
## [15] 3770 6100 9870
Try these commands out:
## [1] 1 11 101 1002 10003 6 18 113 1021 10034 56
## [12] 99 244 1233 10377 611 997 105 1000 10001 2 12
## [23] 103 1005 10008 14 31 134 1055 10089 145 243 477
## [34] 1610 10987
## Warning in fib + c(1, 100): longer object length is not a multiple of
## shorter object length
## [1] 1 101 2 102 4 105 9 113 22 134 56 189 145 333 378 710 988
Question: Can you explain how R is trying to add vectors of different lengths?
Answer: In fact to do operation between unequal lengthed vectors, R tries to recycle values to make the operations work, this can cause confusion when it goes ahead and does things when you really made a typing error.
Question: Try out different operations on vectors: - “/”,“+”,“^2”,“log”,“exp”,“cos”,…
We saw that the number [1] appears as the first index of the vector on the left. We use the indexing to reach certain elements of the vector. R indices start at 1.
## [1] 0
## [1] 2
## [1] 1 2 3
We can access only certain elements given by indices in their own vector, for instance c(1,3,5)
## [1] 0 1 3
A negative index means take out that value from the vector:
## [1] 0 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987
## [1] 29
## [1] 9 13 10 23 14 30 22 25 20 7
## [1] 15 12 26 21 28 27 24 5 11 4
sample takes a random subset of the input, here the vector vec4
Question Why do the two calls to the same function with the same input and arguments give two different answers?
R was created so we can easily manipulate, summarize and visualize data. The first structure that allows us to group together several measurements on the same people/animals/samples are matrices.
(Here we have interjected a few comments in the code using the # character.) We create a matrix with the matrix function.
A <- matrix(
c(4,2,0,3,1,7,2,8,4,5), # the data elements
nrow=2, # number of rows
ncol=5) # number of columns
A## [,1] [,2] [,3] [,4] [,5]
## [1,] 4 0 1 2 4
## [2,] 2 3 7 8 5
You can check what type of object A is by asking R
## [1] "matrix"
## [1] "numeric"
## [1] TRUE
## [1] FALSE
## [1] TRUE
Matrices have to have all its entries of the same mode.
Q: A vector also has to have homogeneous entries, it’s not always obvious, try typing c(letters[3],4,6,letters[7]) what do you notice?
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,] "a" "f" "k" "p" "u" "z" "e" "j" "o" "t" "y" "d" "i"
## [2,] "b" "g" "l" "q" "v" "a" "f" "k" "p" "u" "z" "e" "j"
## [3,] "c" "h" "m" "r" "w" "b" "g" "l" "q" "v" "a" "f" "k"
## [4,] "d" "i" "n" "s" "x" "c" "h" "m" "r" "w" "b" "g" "l"
## [5,] "e" "j" "o" "t" "y" "d" "i" "n" "s" "x" "c" "h" "m"
## [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
## [1,] "n" "s" "x" "c" "h" "m" "r" "w" "b" "g" "l"
## [2,] "o" "t" "y" "d" "i" "n" "s" "x" "c" "h" "m"
## [3,] "p" "u" "z" "e" "j" "o" "t" "y" "d" "i" "n"
## [4,] "q" "v" "a" "f" "k" "p" "u" "z" "e" "j" "o"
## [5,] "r" "w" "b" "g" "l" "q" "v" "a" "f" "k" "p"
## [,25] [,26]
## [1,] "q" "v"
## [2,] "r" "w"
## [3,] "s" "x"
## [4,] "t" "y"
## [5,] "u" "z"
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m"
## [2,] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m"
## [3,] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m"
## [4,] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m"
## [5,] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m"
## [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
## [1,] "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x"
## [2,] "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x"
## [3,] "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x"
## [4,] "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x"
## [5,] "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x"
## [,25] [,26]
## [1,] "y" "z"
## [2,] "y" "z"
## [3,] "y" "z"
## [4,] "y" "z"
## [5,] "y" "z"
## [1] 5 26
## [1] 5
## [1] 26
You see that by default the function matrix takes a vector and fills in the data column by column., in order to change that you have put a special byrow=TRUE argument.
Now is a good time to revist the help to understand how to read the default arguments.
Q: Where do you see the default value of the argument byrow ?
Matrices are sometimes called two dimensional arrays.
The rows of a matrix are the first index, the columns are the second.
Now suppose we want to replace the forth column of A by two 1’s.
First take a look at the current values:
## [1] 2 8
Strangely the column has become a row ? This is because all vectors appear as row vectors to save space.
Now we replace the values by 1:
## [,1] [,2] [,3] [,4] [,5]
## [1,] 4 0 1 1 4
## [2,] 2 3 7 1 5
Transposition
## [,1] [,2]
## [1,] 4 2
## [2,] 0 3
## [3,] 1 7
## [4,] 1 1
## [5,] 4 5
The transpose of the matrix A has the number of rows equal to the number of columns of A, so what do you think dim(t(A)) will be?
In real situations matrices represent data where the rows are the observations and the columns are the variables.
observNames <- c("H1","H3","H4","H5","H7","H8","H9")
vecHapl <- c(14,12,4,12,3,10,8,10,1,4,15,13,0,1,1,15,13,4,13,3,9,8,10,1,4,13,12,0,1,1,
15,11,5,11,3,10,8,10,1,4,11,14,0,1,1,17,13,4,11,3,10,7,10,1,4,14,12,0,1,1,
13,12,5,12,3,11,8,11,1,4,14,14,0,1,1,16,11,5,12,3,10,8,10,1,4,11,15,0,1,1,
16,11,5,11,3,10,8,10,1,4,11,14,0,1,1)
matHap <- matrix(vecHapl,nrow=7,byrow=TRUE)
rownames(matHap) <- observNamesSuppose I also want to name the columns and type:
POSnames <- c(DYS19,DXYS156Y,DYS389m,DYS389n,DYS389p,DYS389q,DYS390m,
DYS390n,DYS390p,DYS390q,DYS392,DYS393,YAPbcbc,SRY1532bb,92R7bb)
Question: Why does this create an error?
POSnames <- c("DYS19","DXYS156Y","DYS389m","DYS389n","DYS389p",
"DYS389q","DYS390m","DYS390n","DYS390p","DYS390q",
"DYS392","DYS393","YAPbcbc","SRY1532bb","92R7bb")
colnames(matHap) <- POSnames
matHap## DYS19 DXYS156Y DYS389m DYS389n DYS389p DYS389q DYS390m DYS390n DYS390p
## H1 14 12 4 12 3 10 8 10 1
## H3 15 13 4 13 3 9 8 10 1
## H4 15 11 5 11 3 10 8 10 1
## H5 17 13 4 11 3 10 7 10 1
## H7 13 12 5 12 3 11 8 11 1
## H8 16 11 5 12 3 10 8 10 1
## H9 16 11 5 11 3 10 8 10 1
## DYS390q DYS392 DYS393 YAPbcbc SRY1532bb 92R7bb
## H1 4 15 13 0 1 1
## H3 4 13 12 0 1 1
## H4 4 11 14 0 1 1
## H5 4 14 12 0 1 1
## H7 4 14 14 0 1 1
## H8 4 11 15 0 1 1
## H9 4 11 14 0 1 1
Each row of matHap corresponds to a person, whose ID starts with ‘H’ and each columns represents a special position on the Y chromosome where repeats can occur, the numbers are the number of repeats, so they are integers.
By default plot makes a scatter plot of the first two columns of matHap.
We can save our data to a simple text file for later use in various ways:
As an R object:
As a plain text file:
You can’t look at an .RData file because they are not text files but compressed binary versions of the information, so humanly unreadable, although later we will be able to load the data just by typing
load("matHap.RData")
Question: Why do we need quotes within the brackets here?
When you run a pipeline or analysis, it is convenient to save the objects such as matrices or dataframes. By doing so, you won’t have to run your analysis every single time you want to view the object in question. In order to save an object the function saveRDS can be used. The object can then be opened with the readRDS function. This pair of functions are an alternative to the save and load. Their adventage is to allow the user to give a new name to the saved object when they load it.
mat <- matrix(sample(0:1, 12, replace=TRUE),3,4) # A 3 by 4 matrix containing 0s and 1s
saveRDS(mat, "Neo") # Save mat with Neo as a file name
the.matrix <- readRDS("Neo") # Load the file Neo but it will no longer be named mat but the.matrix in your environment
identical(mat, the.matrix ) # Checks if mat and my.matrix are identicalA completely heterogeneous set of objects of different types and different sizes can be combined into a list.
We cannot use the c() function but have an equivalent, simply the list() function, which creates this ordered collection of components.
A <- matrix( c(4,2,0,3,1,7,2,8,4,5), nrow=2,ncol=5)
Atypical <- list(name="Susan", byDate=1125, Amatrix=A, size=5.5, urban=FALSE)
Atypical## $name
## [1] "Susan"
##
## $byDate
## [1] 1125
##
## $Amatrix
## [,1] [,2] [,3] [,4] [,5]
## [1,] 4 0 1 2 4
## [2,] 2 3 7 8 5
##
## $size
## [1] 5.5
##
## $urban
## [1] FALSE
listWe access the elements of the list either by the number in the order of the elements:
## [1] "Susan"
## [1] "numeric"
## [1] TRUE
## [1] FALSE
We can also access the list elements by their label using the $ sign:
## [1] 1125
## [,1] [,2] [,3] [,4] [,5]
## [1,] 4 0 1 2 4
## [2,] 2 3 7 8 5
We can ask to see information about a list using either summary or str
## Length Class Mode
## name 1 -none- character
## byDate 1 -none- numeric
## Amatrix 10 -none- numeric
## size 1 -none- numeric
## urban 1 -none- logical
## List of 5
## $ name : chr "Susan"
## $ byDate : num 1125
## $ Amatrix: num [1:2, 1:5] 4 2 0 3 1 7 2 8 4 5
## $ size : num 5.5
## $ urban : logi FALSE
We see that summary describes Atypical$A with a length attribute equal to 10.
Lists are often used as the output of a complicated function that gives parameters and results of different dimensions as output. Most lists have names for the components, otherwise we can acess them in order with a double square bracket
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
## (Intercept) height
## -87.51667 3.45000
Some variables are encoded as numbers when in fact these numbers are themselves meaningless codes.
For instance we might have a class of 13 male and 11 female students, we could code this using the rep function that repeats a value a certain number of times
## studentg
## 1 2
## 13 11
A better solution is to encode the variable gender as a factor.
## [1] M M M M M M M M M M M M M F F F F F F F F F F F
## Levels: F M
## [1] "factor"
## gender
## F M
## 11 13
## [1] "F" "M"
## Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 2 2 ...
## [1] 24
Let’s start with an example using the Datasets already available in R.
In the UScereal data from the MASS package, the maker is represented by its first initial: G=General Mills, K=Kelloggs, N=Nabisco, P=Post, Q=Quaker Oats, R=Ralston Purina.
## mfr calories protein fat sodium
## 100% Bran N 212.1212 12.121212 3.030303 393.9394
## All-Bran K 212.1212 12.121212 3.030303 787.8788
## All-Bran with Extra Fiber K 100.0000 8.000000 0.000000 280.0000
## Apple Cinnamon Cheerios G 146.6667 2.666667 2.666667 240.0000
##
## G K N P Q R
## 22 21 3 9 5 5
## G K N P Q R
## 22 21 3 9 5 5
So it is rectangular like a matrix, but some variables are numeric and others are factors.
The first variable is an example of a factor variable. It is because of these different classes of variables that exist often together as information on the same observations that R needs a richer data structure than vectors, arrays and matrices.
A data.frame is a list that contains many variables, they can be considered to be in the columns of a table, we won’t call this a matrix because the columns can have different types. Let’s revisit the UScereal data.
We can access the 11th variable as we would in a matrix, this is a factor variable, so a good summary is to tabulate it with a table function.
##
## 100% enriched none
## 5 57 3
But we could also have used the variable name:
##
## 100% enriched none
## 5 57 3
## [1] "data.frame"
## 'data.frame': 65 obs. of 11 variables:
## $ mfr : Factor w/ 6 levels "G","K","N","P",..: 3 2 2 1 2 1 6 4 5 1 ...
## $ calories : num 212 212 100 147 110 ...
## $ protein : num 12.12 12.12 8 2.67 2 ...
## $ fat : num 3.03 3.03 0 2.67 0 ...
## $ sodium : num 394 788 280 240 125 ...
## $ fibre : num 30.3 27.3 28 2 1 ...
## $ carbo : num 15.2 21.2 16 14 11 ...
## $ sugars : num 18.2 15.2 0 13.3 14 ...
## $ shelf : int 3 3 3 1 2 3 1 3 2 1 ...
## $ potassium: num 848.5 969.7 660 93.3 30 ...
## $ vitamins : Factor w/ 3 levels "100%","enriched",..: 2 2 2 2 2 2 2 2 2 2 ...
## mfr calories protein fat sodium
## G:22 Min. : 50.0 Min. : 0.7519 Min. :0.000 Min. : 0.0
## K:21 1st Qu.:110.0 1st Qu.: 2.0000 1st Qu.:0.000 1st Qu.:180.0
## N: 3 Median :134.3 Median : 3.0000 Median :1.000 Median :232.0
## P: 9 Mean :149.4 Mean : 3.6837 Mean :1.423 Mean :237.8
## Q: 5 3rd Qu.:179.1 3rd Qu.: 4.4776 3rd Qu.:2.000 3rd Qu.:290.0
## R: 5 Max. :440.0 Max. :12.1212 Max. :9.091 Max. :787.9
## fibre carbo sugars shelf
## Min. : 0.000 Min. :10.53 Min. : 0.00 Min. :1.000
## 1st Qu.: 0.000 1st Qu.:15.00 1st Qu.: 4.00 1st Qu.:1.000
## Median : 2.000 Median :18.67 Median :12.00 Median :2.000
## Mean : 3.871 Mean :19.97 Mean :10.05 Mean :2.169
## 3rd Qu.: 4.478 3rd Qu.:22.39 3rd Qu.:14.00 3rd Qu.:3.000
## Max. :30.303 Max. :68.00 Max. :20.90 Max. :3.000
## potassium vitamins
## Min. : 15.00 100% : 5
## 1st Qu.: 45.00 enriched:57
## Median : 96.59 none : 3
## Mean :159.12
## 3rd Qu.:220.00
## Max. :969.70
Question What has the function summary shown about the data UScereal? The data.frame structure or class is a list, we can access the variables using their names as well as using the order they appear in the data.
Question Which of the variables in the UScereal data.frame are factors.
## [1] 65 11
Question We can use the function dim on a data.frame to find out how many variables were measured on how many observations, what other data type can we use dim on ?
If we only want to look at the top few observations we can use the function head
## mfr calories protein fat sodium
## 100% Bran N 212.1212 12.121212 3.030303 393.9394
## All-Bran K 212.1212 12.121212 3.030303 787.8788
## All-Bran with Extra Fiber K 100.0000 8.000000 0.000000 280.0000
## Apple Cinnamon Cheerios G 146.6667 2.666667 2.666667 240.0000
## Apple Jacks K 110.0000 2.000000 0.000000 125.0000
## Basic 4 G 173.3333 4.000000 2.666667 280.0000
## fibre carbo sugars shelf potassium
## 100% Bran 30.303030 15.15152 18.18182 3 848.48485
## All-Bran 27.272727 21.21212 15.15151 3 969.69697
## All-Bran with Extra Fiber 28.000000 16.00000 0.00000 3 660.00000
## Apple Cinnamon Cheerios 2.000000 14.00000 13.33333 1 93.33333
## Apple Jacks 1.000000 11.00000 14.00000 2 30.00000
## Basic 4 2.666667 24.00000 10.66667 3 133.33333
## vitamins
## 100% Bran enriched
## All-Bran enriched
## All-Bran with Extra Fiber enriched
## Apple Cinnamon Cheerios enriched
## Apple Jacks enriched
## Basic 4 enriched
Here is an example of data from the datasets package
## height weight
## 1 58 115
## 2 59 117
## 3 60 120
## 4 61 123
## 5 62 126
## 6 63 129
## 7 64 132
## 8 65 135
## 9 66 139
## 10 67 142
## 11 68 146
## 12 69 150
## 13 70 154
## 14 71 159
## 15 72 164
## [1] "data.frame"
## [1] 15 2
We can change the data class.
Sometimes we may need to go back to matrices, this can be quite easy as is the reverse:
## [1] "matrix"
## [1] "data.frame"
But sometimes it can give surprising results;
Question: Try typing
as.matrix(UScereal)
and explain what you see.
c() and with : and how some operators work.-.plot function understands what to do with a matrix.$label addresses.factorQuestion The dataset mtcars is very popular as an example in R tutorials available online. Look at the data frame and find out if it has any variables that are factors