Data types

Susan Holmes

R basic data types

letters
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
## [18] "r" "s" "t" "u" "v" "w" "x" "y" "z"
letters[1:6]
## [1] "a" "b" "c" "d" "e" "f"
1:6
## [1] 1 2 3 4 5 6
c(2,3,7,8)
## [1] 2 3 7 8

Basic type:Data structures: vectors

We have already seen examples of vectors we created using the c() that combines values:

fib <- c(0,1,1,2,3,5,8,13,21,34,55,89,144,233,377,610,987)
class(fib)
## [1] "numeric"
is.vector(fib)
## [1] TRUE

How many elements do you think there are in c(fib,fib) ?
We can also combine elements in the middle:

fib5fib <- c(fib,5,fib)
fib5fib
##  [1]   0   1   1   2   3   5   8  13  21  34  55  89 144 233 377 610 987
## [18]   5   0   1   1   2   3   5   8  13  21  34  55  89 144 233 377 610
## [35] 987

Let’s try some operations on fib:

fib+10
##  [1]  10  11  11  12  13  15  18  23  31  44  65  99 154 243 387 620 997
fib*10
##  [1]    0   10   10   20   30   50   80  130  210  340  550  890 1440 2330
## [15] 3770 6100 9870

Try these commands out:

fib5fib+c(1,10,100,1000,10000)
##  [1]     1    11   101  1002 10003     6    18   113  1021 10034    56
## [12]    99   244  1233 10377   611   997   105  1000 10001     2    12
## [23]   103  1005 10008    14    31   134  1055 10089   145   243   477
## [34]  1610 10987
fib+c(1,100)
## Warning in fib + c(1, 100): longer object length is not a multiple of
## shorter object length
##  [1]   1 101   2 102   4 105   9 113  22 134  56 189 145 333 378 710 988

Question: Can you explain how R is trying to add vectors of different lengths?

Answer: In fact to do operation between unequal lengthed vectors, R tries to recycle values to make the operations work, this can cause confusion when it goes ahead and does things when you really made a typing error.

Question: Try out different operations on vectors: - “/”,“+”,“^2”,“log”,“exp”,“cos”,…

Indexing vectors

We saw that the number [1] appears as the first index of the vector on the left. We use the indexing to reach certain elements of the vector. R indices start at 1.

fib[1]
## [1] 0
fib[4]
## [1] 2
fib[3:5]
## [1] 1 2 3

We can access only certain elements given by indices in their own vector, for instance c(1,3,5)

fib[c(1,3,5)]
## [1] 0 1 3

A negative index means take out that value from the vector:

fib[-2]
##  [1]   0   1   2   3   5   8  13  21  34  55  89 144 233 377 610 987

Taking a random subset of a vector

vec4 <- 4:32
length(vec4)
## [1] 29
sample(vec4,size=10)
##  [1]  9 13 10 23 14 30 22 25 20  7
sample(vec4,size=10)
##  [1] 15 12 26 21 28 27 24  5 11  4

sample takes a random subset of the input, here the vector vec4

Question Why do the two calls to the same function with the same input and arguments give two different answers?

Many variables measured on individuals: matrices

R was created so we can easily manipulate, summarize and visualize data. The first structure that allows us to group together several measurements on the same people/animals/samples are matrices.

(Here we have interjected a few comments in the code using the # character.) We create a matrix with the matrix function.

  A <- matrix( 
   c(4,2,0,3,1,7,2,8,4,5),    # the data elements 
   nrow=2,                 # number of rows 
   ncol=5)                 # number of columns 
A
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    4    0    1    2    4
## [2,]    2    3    7    8    5

You can check what type of object A is by asking R

class(A)
## [1] "matrix"
mode(A)
## [1] "numeric"
is.matrix(A)
## [1] TRUE
is.vector(A)
## [1] FALSE
is.numeric(A)
## [1] TRUE

Matrices have to have all its entries of the same mode.

Q: A vector also has to have homogeneous entries, it’s not always obvious, try typing c(letters[3],4,6,letters[7]) what do you notice?

matlet <- matrix(letters,ncol=26,nrow=5)
matlet
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,] "a"  "f"  "k"  "p"  "u"  "z"  "e"  "j"  "o"  "t"   "y"   "d"   "i"  
## [2,] "b"  "g"  "l"  "q"  "v"  "a"  "f"  "k"  "p"  "u"   "z"   "e"   "j"  
## [3,] "c"  "h"  "m"  "r"  "w"  "b"  "g"  "l"  "q"  "v"   "a"   "f"   "k"  
## [4,] "d"  "i"  "n"  "s"  "x"  "c"  "h"  "m"  "r"  "w"   "b"   "g"   "l"  
## [5,] "e"  "j"  "o"  "t"  "y"  "d"  "i"  "n"  "s"  "x"   "c"   "h"   "m"  
##      [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
## [1,] "n"   "s"   "x"   "c"   "h"   "m"   "r"   "w"   "b"   "g"   "l"  
## [2,] "o"   "t"   "y"   "d"   "i"   "n"   "s"   "x"   "c"   "h"   "m"  
## [3,] "p"   "u"   "z"   "e"   "j"   "o"   "t"   "y"   "d"   "i"   "n"  
## [4,] "q"   "v"   "a"   "f"   "k"   "p"   "u"   "z"   "e"   "j"   "o"  
## [5,] "r"   "w"   "b"   "g"   "l"   "q"   "v"   "a"   "f"   "k"   "p"  
##      [,25] [,26]
## [1,] "q"   "v"  
## [2,] "r"   "w"  
## [3,] "s"   "x"  
## [4,] "t"   "y"  
## [5,] "u"   "z"
matlet <- matrix(letters,ncol=26,nrow=5,byrow=TRUE)
matlet
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,] "a"  "b"  "c"  "d"  "e"  "f"  "g"  "h"  "i"  "j"   "k"   "l"   "m"  
## [2,] "a"  "b"  "c"  "d"  "e"  "f"  "g"  "h"  "i"  "j"   "k"   "l"   "m"  
## [3,] "a"  "b"  "c"  "d"  "e"  "f"  "g"  "h"  "i"  "j"   "k"   "l"   "m"  
## [4,] "a"  "b"  "c"  "d"  "e"  "f"  "g"  "h"  "i"  "j"   "k"   "l"   "m"  
## [5,] "a"  "b"  "c"  "d"  "e"  "f"  "g"  "h"  "i"  "j"   "k"   "l"   "m"  
##      [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
## [1,] "n"   "o"   "p"   "q"   "r"   "s"   "t"   "u"   "v"   "w"   "x"  
## [2,] "n"   "o"   "p"   "q"   "r"   "s"   "t"   "u"   "v"   "w"   "x"  
## [3,] "n"   "o"   "p"   "q"   "r"   "s"   "t"   "u"   "v"   "w"   "x"  
## [4,] "n"   "o"   "p"   "q"   "r"   "s"   "t"   "u"   "v"   "w"   "x"  
## [5,] "n"   "o"   "p"   "q"   "r"   "s"   "t"   "u"   "v"   "w"   "x"  
##      [,25] [,26]
## [1,] "y"   "z"  
## [2,] "y"   "z"  
## [3,] "y"   "z"  
## [4,] "y"   "z"  
## [5,] "y"   "z"
dim(matlet)
## [1]  5 26
nrow(matlet)
## [1] 5
ncol(matlet)
## [1] 26

You see that by default the function matrix takes a vector and fills in the data column by column., in order to change that you have put a special byrow=TRUE argument.

Now is a good time to revist the help to understand how to read the default arguments.

?matrix

Q: Where do you see the default value of the argument byrow ?

Accessing Matrix elements

Matrices are sometimes called two dimensional arrays.
The rows of a matrix are the first index, the columns are the second.
Now suppose we want to replace the forth column of A by two 1’s.
First take a look at the current values:

A[,4]
## [1] 2 8

Strangely the column has become a row ? This is because all vectors appear as row vectors to save space.

Now we replace the values by 1:

A[,4] <- 1
A
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    4    0    1    1    4
## [2,]    2    3    7    1    5

Transposition

t(A)
##      [,1] [,2]
## [1,]    4    2
## [2,]    0    3
## [3,]    1    7
## [4,]    1    1
## [5,]    4    5

The transpose of the matrix A has the number of rows equal to the number of columns of A, so what do you think dim(t(A)) will be?

Data Matrices

In real situations matrices represent data where the rows are the observations and the columns are the variables.

observNames <- c("H1","H3","H4","H5","H7","H8","H9")
vecHapl <- c(14,12,4,12,3,10,8,10,1,4,15,13,0,1,1,15,13,4,13,3,9,8,10,1,4,13,12,0,1,1,
15,11,5,11,3,10,8,10,1,4,11,14,0,1,1,17,13,4,11,3,10,7,10,1,4,14,12,0,1,1,
13,12,5,12,3,11,8,11,1,4,14,14,0,1,1,16,11,5,12,3,10,8,10,1,4,11,15,0,1,1,
16,11,5,11,3,10,8,10,1,4,11,14,0,1,1)
matHap <- matrix(vecHapl,nrow=7,byrow=TRUE)
rownames(matHap) <- observNames

Suppose I also want to name the columns and type:

POSnames <- c(DYS19,DXYS156Y,DYS389m,DYS389n,DYS389p,DYS389q,DYS390m,
DYS390n,DYS390p,DYS390q,DYS392,DYS393,YAPbcbc,SRY1532bb,92R7bb)

Question: Why does this create an error?

POSnames <- c("DYS19","DXYS156Y","DYS389m","DYS389n","DYS389p",
              "DYS389q","DYS390m","DYS390n","DYS390p","DYS390q",
              "DYS392","DYS393","YAPbcbc","SRY1532bb","92R7bb")
colnames(matHap) <- POSnames
matHap
##    DYS19 DXYS156Y DYS389m DYS389n DYS389p DYS389q DYS390m DYS390n DYS390p
## H1    14       12       4      12       3      10       8      10       1
## H3    15       13       4      13       3       9       8      10       1
## H4    15       11       5      11       3      10       8      10       1
## H5    17       13       4      11       3      10       7      10       1
## H7    13       12       5      12       3      11       8      11       1
## H8    16       11       5      12       3      10       8      10       1
## H9    16       11       5      11       3      10       8      10       1
##    DYS390q DYS392 DYS393 YAPbcbc SRY1532bb 92R7bb
## H1       4     15     13       0         1      1
## H3       4     13     12       0         1      1
## H4       4     11     14       0         1      1
## H5       4     14     12       0         1      1
## H7       4     14     14       0         1      1
## H8       4     11     15       0         1      1
## H9       4     11     14       0         1      1

Each row of matHap corresponds to a person, whose ID starts with ‘H’ and each columns represents a special position on the Y chromosome where repeats can occur, the numbers are the number of repeats, so they are integers.

plot(matHap)

By default plot makes a scatter plot of the first two columns of matHap.

Saving matrices

We can save our data to a simple text file for later use in various ways:

As an R object:

save(matHap,file="matHap.RData")

As a plain text file:

write.table(matHap, file="matHap.txt")
# Take a look at the file
file.show("matHap.txt")

You can’t look at an .RData file because they are not text files but compressed binary versions of the information, so humanly unreadable, although later we will be able to load the data just by typing

load("matHap.RData")

Question: Why do we need quotes within the brackets here?

Hadley Wickham’s extensive documentation

Saving and loading your objects

When you run a pipeline or analysis, it is convenient to save the objects such as matrices or dataframes. By doing so, you won’t have to run your analysis every single time you want to view the object in question. In order to save an object the function saveRDS can be used. The object can then be opened with the readRDS function. This pair of functions are an alternative to the save and load. Their adventage is to allow the user to give a new name to the saved object when they load it.

mat <- matrix(sample(0:1, 12, replace=TRUE),3,4) # A 3 by 4 matrix containing 0s and 1s
saveRDS(mat, "Neo") # Save mat with Neo as a file name 
the.matrix <- readRDS("Neo") # Load the file Neo but it will no longer be named mat but the.matrix in your environment
identical(mat, the.matrix ) # Checks if mat and my.matrix are identical

Heterogeneous Data: list and data.frame

Lists

A completely heterogeneous set of objects of different types and different sizes can be combined into a list.

We cannot use the c() function but have an equivalent, simply the list() function, which creates this ordered collection of components.

A <- matrix( c(4,2,0,3,1,7,2,8,4,5), nrow=2,ncol=5) 
Atypical <- list(name="Susan", byDate=1125, Amatrix=A, size=5.5, urban=FALSE)
Atypical
## $name
## [1] "Susan"
## 
## $byDate
## [1] 1125
## 
## $Amatrix
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    4    0    1    2    4
## [2,]    2    3    7    8    5
## 
## $size
## [1] 5.5
## 
## $urban
## [1] FALSE

Addressing elements of a list

We access the elements of the list either by the number in the order of the elements:

Atypical[[1]]
## [1] "Susan"
class(A[[3]])
## [1] "numeric"
is.logical(Atypical[[5]])
## [1] TRUE
Atypical[[5]]
## [1] FALSE

We can also access the list elements by their label using the $ sign:

Atypical$byDate
## [1] 1125
Atypical$A
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    4    0    1    2    4
## [2,]    2    3    7    8    5

We can ask to see information about a list using either summary or str

summary(Atypical)
##         Length Class  Mode     
## name     1     -none- character
## byDate   1     -none- numeric  
## Amatrix 10     -none- numeric  
## size     1     -none- numeric  
## urban    1     -none- logical
str(Atypical)
## List of 5
##  $ name   : chr "Susan"
##  $ byDate : num 1125
##  $ Amatrix: num [1:2, 1:5] 4 2 0 3 1 7 2 8 4 5
##  $ size   : num 5.5
##  $ urban  : logi FALSE

We see that summary describes Atypical$A with a length attribute equal to 10.

Lists as output

Lists are often used as the output of a complicated function that gives parameters and results of different dimensions as output. Most lists have names for the components, otherwise we can acess them in order with a double square bracket

 result <- lm(weight~height, data=women)
 names(result)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"
 result[[1]]
## (Intercept)      height 
##   -87.51667     3.45000

Factor variables

Some variables are encoded as numbers when in fact these numbers are themselves meaningless codes.

For instance we might have a class of 13 male and 11 female students, we could code this using the rep function that repeats a value a certain number of times

studentg=c(rep(1,13),rep(2,11)) 
table(studentg)
## studentg
##  1  2 
## 13 11

A better solution is to encode the variable gender as a factor.

gender=factor(c(rep("M",13),rep("F",11)))
gender
##  [1] M M M M M M M M M M M M M F F F F F F F F F F F
## Levels: F M
class(gender)
## [1] "factor"
table(gender)
## gender
##  F  M 
## 11 13
levels(gender)
## [1] "F" "M"
str(gender)
##  Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 2 2 ...
length(gender)
## [1] 24

R Data set example: UScereal

Let’s start with an example using the Datasets already available in R.

In the UScereal data from the MASS package, the maker is represented by its first initial: G=General Mills, K=Kelloggs, N=Nabisco, P=Post, Q=Quaker Oats, R=Ralston Purina.

library(MASS)
?UScereal
UScereal[1:4,1:5]
##                           mfr calories   protein      fat   sodium
## 100% Bran                   N 212.1212 12.121212 3.030303 393.9394
## All-Bran                    K 212.1212 12.121212 3.030303 787.8788
## All-Bran with Extra Fiber   K 100.0000  8.000000 0.000000 280.0000
## Apple Cinnamon Cheerios     G 146.6667  2.666667 2.666667 240.0000
table(UScereal[,1])
## 
##  G  K  N  P  Q  R 
## 22 21  3  9  5  5
summary(UScereal[,1])
##  G  K  N  P  Q  R 
## 22 21  3  9  5  5

So it is rectangular like a matrix, but some variables are numeric and others are factors.

The first variable is an example of a factor variable. It is because of these different classes of variables that exist often together as information on the same observations that R needs a richer data structure than vectors, arrays and matrices.

data.frame: A way of combining different type of variables

A data.frame is a list that contains many variables, they can be considered to be in the columns of a table, we won’t call this a matrix because the columns can have different types. Let’s revisit the UScereal data.

We can access the 11th variable as we would in a matrix, this is a factor variable, so a good summary is to tabulate it with a table function.

table(UScereal[,11])
## 
##     100% enriched     none 
##        5       57        3

But we could also have used the variable name:

table(UScereal$vitamins)
## 
##     100% enriched     none 
##        5       57        3
class(UScereal)
## [1] "data.frame"
str(UScereal)
## 'data.frame':    65 obs. of  11 variables:
##  $ mfr      : Factor w/ 6 levels "G","K","N","P",..: 3 2 2 1 2 1 6 4 5 1 ...
##  $ calories : num  212 212 100 147 110 ...
##  $ protein  : num  12.12 12.12 8 2.67 2 ...
##  $ fat      : num  3.03 3.03 0 2.67 0 ...
##  $ sodium   : num  394 788 280 240 125 ...
##  $ fibre    : num  30.3 27.3 28 2 1 ...
##  $ carbo    : num  15.2 21.2 16 14 11 ...
##  $ sugars   : num  18.2 15.2 0 13.3 14 ...
##  $ shelf    : int  3 3 3 1 2 3 1 3 2 1 ...
##  $ potassium: num  848.5 969.7 660 93.3 30 ...
##  $ vitamins : Factor w/ 3 levels "100%","enriched",..: 2 2 2 2 2 2 2 2 2 2 ...
summary(UScereal)
##  mfr       calories        protein             fat            sodium     
##  G:22   Min.   : 50.0   Min.   : 0.7519   Min.   :0.000   Min.   :  0.0  
##  K:21   1st Qu.:110.0   1st Qu.: 2.0000   1st Qu.:0.000   1st Qu.:180.0  
##  N: 3   Median :134.3   Median : 3.0000   Median :1.000   Median :232.0  
##  P: 9   Mean   :149.4   Mean   : 3.6837   Mean   :1.423   Mean   :237.8  
##  Q: 5   3rd Qu.:179.1   3rd Qu.: 4.4776   3rd Qu.:2.000   3rd Qu.:290.0  
##  R: 5   Max.   :440.0   Max.   :12.1212   Max.   :9.091   Max.   :787.9  
##      fibre            carbo           sugars          shelf      
##  Min.   : 0.000   Min.   :10.53   Min.   : 0.00   Min.   :1.000  
##  1st Qu.: 0.000   1st Qu.:15.00   1st Qu.: 4.00   1st Qu.:1.000  
##  Median : 2.000   Median :18.67   Median :12.00   Median :2.000  
##  Mean   : 3.871   Mean   :19.97   Mean   :10.05   Mean   :2.169  
##  3rd Qu.: 4.478   3rd Qu.:22.39   3rd Qu.:14.00   3rd Qu.:3.000  
##  Max.   :30.303   Max.   :68.00   Max.   :20.90   Max.   :3.000  
##    potassium          vitamins 
##  Min.   : 15.00   100%    : 5  
##  1st Qu.: 45.00   enriched:57  
##  Median : 96.59   none    : 3  
##  Mean   :159.12                
##  3rd Qu.:220.00                
##  Max.   :969.70

Question What has the function summary shown about the data UScereal? The data.frame structure or class is a list, we can access the variables using their names as well as using the order they appear in the data.

Question Which of the variables in the UScereal data.frame are factors.

dim(UScereal)
## [1] 65 11

Question We can use the function dim on a data.frame to find out how many variables were measured on how many observations, what other data type can we use dim on ?

If we only want to look at the top few observations we can use the function head

head(UScereal)
##                           mfr calories   protein      fat   sodium
## 100% Bran                   N 212.1212 12.121212 3.030303 393.9394
## All-Bran                    K 212.1212 12.121212 3.030303 787.8788
## All-Bran with Extra Fiber   K 100.0000  8.000000 0.000000 280.0000
## Apple Cinnamon Cheerios     G 146.6667  2.666667 2.666667 240.0000
## Apple Jacks                 K 110.0000  2.000000 0.000000 125.0000
## Basic 4                     G 173.3333  4.000000 2.666667 280.0000
##                               fibre    carbo   sugars shelf potassium
## 100% Bran                 30.303030 15.15152 18.18182     3 848.48485
## All-Bran                  27.272727 21.21212 15.15151     3 969.69697
## All-Bran with Extra Fiber 28.000000 16.00000  0.00000     3 660.00000
## Apple Cinnamon Cheerios    2.000000 14.00000 13.33333     1  93.33333
## Apple Jacks                1.000000 11.00000 14.00000     2  30.00000
## Basic 4                    2.666667 24.00000 10.66667     3 133.33333
##                           vitamins
## 100% Bran                 enriched
## All-Bran                  enriched
## All-Bran with Extra Fiber enriched
## Apple Cinnamon Cheerios   enriched
## Apple Jacks               enriched
## Basic 4                   enriched

Some functions know how to behave, whatever the data.

Here is an example of data from the datasets package

library(datasets)
women
##    height weight
## 1      58    115
## 2      59    117
## 3      60    120
## 4      61    123
## 5      62    126
## 6      63    129
## 7      64    132
## 8      65    135
## 9      66    139
## 10     67    142
## 11     68    146
## 12     69    150
## 13     70    154
## 14     71    159
## 15     72    164
class(women)
## [1] "data.frame"
dim(women)
## [1] 15  2
plot(women)

We can change the data class.

Sometimes we may need to go back to matrices, this can be quite easy as is the reverse:

matw=as.matrix(women)
class(matw)
## [1] "matrix"
women2=as.data.frame(matw)
class(women2)
## [1] "data.frame"

But sometimes it can give surprising results;
Question: Try typing

as.matrix(UScereal)

and explain what you see.

Summary of our session

Question The dataset mtcars is very popular as an example in R tutorials available online. Look at the data frame and find out if it has any variables that are factors