Chapter 5 Relationships among samples

5.1 Introduction

Principal component analysis and discriminant analysis have analysed the data through the variables (columns) of the data set, extracting information from the covariance or correlation matrix of the variables. However, for other research purposes we switch focus away from relationships among the variables to relationships among the samples, analysing the data set \(\mathbf X\) through the samples (rows). For this, instead of covariance matrices for variables, we start with similarity (or distance) matrices for samples.

5.2 Similarity and distance matrices

The similarity between two samples is often defined as a number in the range \([0,1]\) where 1 implies identical and 0 implies nothing in common. The distance between two samples is the complement of similarity though distances are not always restricted to lie in the range \([0,1]\). When they are restricted to this range there is often a simple relationship between similarity and distance such as \(d = 1-s\) or \(d=1-s^2\). Distances are also referred to as dissimilarities.

If \(s_{ij}\) denotes the similarity between the i\(th\) and j\(th\) sample then the collection of all pairwise similarities between samples can be written as a matrix, referred to as the similarity matrix, \(S\). The full \(n\) by \(n\) matrix, where \(n\) is the number of samples, will have 1s on all the diagonal positions because the similarity of a sample with itself must be 1. Also \(s_{ij} = s_{ji}\).

In the same way, we can define a distance matrix \(D\) with individual elements \(d_{ij}\) and 0s on the diagonal.

From this point onwards we will be exploring methods of analysing multivariate data for which distance or similarity matrices are sometimes the natural starting point. Note that distance or similarity matrices are symmetrical and typically only the lower triangle is shown and stored. There is a wide range of similarity and distance measures, some generic (e.g. simple matching) and some more specialised (e.g. Nei-Li for molecular markers or the Bradley-Curtis which is popular in ecology). It is then essential to know how these measures are calculated, actually the details of a measure under the same name often change from one software package to another. When the data set includes several types of variables, say quantitative, binary and categorical, a strategy consists of defining an appropriate measure for each type of variables separately and the compute an overall similarity/distance measure (e.g. using Gower’s formula).

Here we explore different methods of calculating such matrices for data of different types. We use the moths data set as reference to illustrate. It counts the number of moths in each of 12 species (variables) at each of 14 sites (samples).

5.3 Binary data

We begin by converting to the moths counts to presence/absence indicators. This involves a little programming.

moths <- read.table("moths.txt", h = T, stringsAsFactors = TRUE)
#copy in preparation for conversion to presence/absence (pa) format
mothspa <- moths
#convert the counts to 1=present, 0=absent
mothspa[, 5:16] <- ifelse(mothspa[, 5:16] > 0, 1, 0)
mothspa
   Site   E   N Type M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12
1     A 513 213    W  1  1  0  1  1  1  1  1  1   1   0   0
2     B 465 164    W  1  1  1  1  1  1  1  1  1   1   0   0
3     C 480 143    W  1  1  1  1  1  1  1  1  1   1   0   1
4     D 418 107    W  0  1  1  1  1  1  1  1  1   0   0   0
5     E 306 138    P  1  1  1  1  1  1  1  1  1   0   1   1
6     F 282  45    F  1  1  1  1  1  1  1  1  1   0   0   0
7     G 263 284    F  1  1  1  1  1  1  1  1  1   1   0   1
8     H 520 280    P  1  1  0  1  1  1  1  1  1   1   0   0
9     I 575 266    F  0  1  0  1  1  1  1  1  1   0   0   0
10    J 363 594    W  1  1  1  0  1  0  0  0  0   1   1   0
11    K 260 756    P  1  1  1  0  1  0  0  0  0   1   1   0
12    L 237 809    W  1  1  1  1  1  0  0  1  0   0   1   0
13    M 316 864    W  1  1  0  1  1  0  1  0  0   0   1   0
14    N 263 874    W  1  1  0  0  1  0  0  0  0   1   1   1

Considering sites K and L only. We create a data.frame from the K and L site data and manipulate it to compute a table showing the co-occurrence of moths across species in the two sites.

KL <- data.frame(t(mothspa[mothspa$Site %in% c("K", "L"), 5:16]))
names(KL) <- c("K", "L")
KL[, 1] <- factor(KL[, 1], levels = c(1, 0), labels = c("Present", "Absent"))
KL[, 2] <- factor(KL[, 2], levels = c(1, 0), labels = c("Present", "Absent"))
ftable(KL[, c("L", "K")])
        K Present Absent
L                       
Present         5      2
Absent          1      4

The simple matching coefficient of similarity is \(S_{KL} = (5+4)/(5+2+1+4) = 0.75\). The Jaccard coefficient of similarity (ignores “double zeros”) is \(S_{KL} = 5/(5+2+1) = 0.625\).

The package vegan has a wide range of options for distance coefficients, though its nomenclature is somewhat challenging. The function vegdist in vegan computes distances \(d\), hence similarities can be obtained as \(s=1-d\). The following code snippet computes the simple matching similarity between all sites using vegan.

library(vegan)
# simple matching similarity matrix
sm <- 1 - vegdist(mothspa[5:16], method = "gower")
options(digits = 2)
sm
      1    2    3    4    5    6    7    8    9   10   11   12   13
2  0.92                                                            
3  0.83 0.92                                                       
4  0.75 0.83 0.75                                                  
5  0.67 0.75 0.83 0.75                                             
6  0.83 0.92 0.83 0.92 0.83                                        
7  0.83 0.92 1.00 0.75 0.83 0.83                                   
8  1.00 0.92 0.83 0.75 0.67 0.83 0.83                              
9  0.83 0.75 0.67 0.92 0.67 0.83 0.67 0.83                         
10 0.42 0.50 0.42 0.33 0.42 0.42 0.42 0.42 0.25                    
11 0.42 0.50 0.42 0.33 0.42 0.42 0.42 0.42 0.25 1.00               
12 0.50 0.58 0.50 0.58 0.67 0.67 0.50 0.50 0.50 0.75 0.75          
13 0.58 0.50 0.42 0.50 0.58 0.58 0.42 0.58 0.58 0.67 0.67 0.75     
14 0.42 0.33 0.42 0.17 0.42 0.25 0.42 0.42 0.25 0.83 0.83 0.58 0.67

The following computes the matrix of Jaccard distances (the round function takes care of the number of decimal places).

# Jaccard distance matrix
jacc <- vegdist(mothspa[5:16], method = "jaccard")
round(jacc, 2)
      1    2    3    4    5    6    7    8    9   10   11   12   13
2  0.10                                                            
3  0.18 0.09                                                       
4  0.30 0.20 0.27                                                  
5  0.33 0.25 0.17 0.27                                             
6  0.20 0.10 0.18 0.11 0.18                                        
7  0.18 0.09 0.00 0.27 0.17 0.18                                   
8  0.00 0.10 0.18 0.30 0.33 0.20 0.18                              
9  0.22 0.30 0.36 0.12 0.36 0.22 0.36 0.22                         
10 0.64 0.55 0.58 0.73 0.58 0.64 0.58 0.64 0.82                    
11 0.64 0.55 0.58 0.73 0.58 0.64 0.58 0.64 0.82 0.00               
12 0.55 0.45 0.50 0.50 0.36 0.40 0.50 0.55 0.60 0.38 0.38          
13 0.50 0.55 0.58 0.60 0.45 0.50 0.58 0.50 0.56 0.50 0.50 0.38     
14 0.64 0.67 0.58 0.83 0.58 0.75 0.58 0.64 0.82 0.29 0.29 0.56 0.50

Note that the function dist in the R base distribution also includes several distance measures (check out ?dist). For example, the Jaccard distance matrix we just obtained can be equally computed using dist as

# Jaccard distance matrix using dist in R base
jacc <- dist(mothspa[5:16], method = "binary")
round(jacc, 2)
      1    2    3    4    5    6    7    8    9   10   11   12   13
2  0.10                                                            
3  0.18 0.09                                                       
4  0.30 0.20 0.27                                                  
5  0.33 0.25 0.17 0.27                                             
6  0.20 0.10 0.18 0.11 0.18                                        
7  0.18 0.09 0.00 0.27 0.17 0.18                                   
8  0.00 0.10 0.18 0.30 0.33 0.20 0.18                              
9  0.22 0.30 0.36 0.12 0.36 0.22 0.36 0.22                         
10 0.64 0.55 0.58 0.73 0.58 0.64 0.58 0.64 0.82                    
11 0.64 0.55 0.58 0.73 0.58 0.64 0.58 0.64 0.82 0.00               
12 0.55 0.45 0.50 0.50 0.36 0.40 0.50 0.55 0.60 0.38 0.38          
13 0.50 0.55 0.58 0.60 0.45 0.50 0.58 0.50 0.56 0.50 0.50 0.38     
14 0.64 0.67 0.58 0.83 0.58 0.75 0.58 0.64 0.82 0.29 0.29 0.56 0.50

There are many other similarity coefficients for binary data and it is simply a matter of selecting the most appropriate for your circumstances. A literature review in your area is useful to get insight into what other people is using and what might be a good choice for the characteristics of your data.

5.4 Numeric data

The raw counts for the moths data are strongly skewed. A common strategy to reduce this effect is to take logs (after adding 1 to the counts to avoid problems with zeros), we use log base 10 as an example here.

mothslog <- moths # Create a separate data set
mothslog[, 5:16] <- log10(mothslog[, 5:16] + 1) # Transform the counts
options(digits = 2)
mothslog
   Site   E   N Type   M1  M2   M3   M4   M5  M6  M7   M8   M9  M10  M11  M12
1     A 513 213    W 2.19 1.7 0.00 1.58 1.32 1.7 0.3 0.78 0.30 2.86 0.00 0.00
2     B 465 164    W 1.54 1.8 0.78 1.83 0.95 1.3 2.2 1.53 1.56 1.93 0.00 0.00
3     C 480 143    W 0.78 1.4 0.78 1.49 0.48 1.3 1.1 1.83 2.33 1.87 0.00 0.48
4     D 418 107    W 0.00 1.6 1.75 0.78 0.60 1.8 0.9 1.38 0.85 0.00 0.00 0.00
5     E 306 138    P 1.51 2.0 0.85 1.11 1.38 1.8 1.0 2.61 1.77 0.00 0.48 1.15
6     F 282  45    F 0.30 1.6 0.30 1.90 1.83 1.6 1.9 1.56 0.85 0.00 0.00 0.00
7     G 263 284    F 0.78 1.2 0.48 1.45 2.16 1.7 1.4 0.90 0.90 0.48 0.00 1.18
8     H 520 280    P 0.60 1.3 0.00 2.50 1.90 1.3 1.3 0.70 1.69 0.90 0.00 0.00
9     I 575 266    F 0.00 0.9 0.00 2.10 1.58 1.0 2.0 1.92 0.60 0.00 0.00 0.00
10    J 363 594    W 1.81 2.6 0.30 0.00 1.68 0.0 0.0 0.00 0.00 1.90 1.82 0.00
11    K 260 756    P 1.43 2.6 0.48 0.00 2.36 0.0 0.0 0.00 0.00 2.11 2.60 0.00
12    L 237 809    W 0.95 2.5 1.98 0.48 2.04 0.0 0.0 0.90 0.00 0.00 1.75 0.00
13    M 316 864    W 0.85 2.2 0.00 0.85 1.59 0.0 1.0 0.00 0.00 0.00 1.04 0.00
14    N 263 874    W 1.00 2.7 0.00 0.00 2.20 0.0 0.0 0.00 0.00 2.63 2.15 0.85

Focusing on sites K and L only again.

round(mothslog[mothspa$Site %in% c("K", "L"), 5:16], 2)
     M1   M2   M3   M4   M5 M6 M7  M8 M9  M10  M11 M12
11 1.43 2.57 0.48 0.00 2.36  0  0 0.0  0 2.11 2.60   0
12 0.95 2.49 1.98 0.48 2.04  0  0 0.9  0 0.00 1.75   0

With numeric data distances rather than similarities are commonly used. For example, the squared (and normalised) Euclidean distance between K and L is

\[ d^{2}_{K,L} = [(1.43-0.95)^{2}/(2.19)^{2} + (2.57-2.49)^{2}/(1.83)^{2} + \cdots ]/12\] where 2.19 and 1.83 are the ranges (\(\max-\min\)) across all 14 sites of the log counts for species 1 and 2.

This particular version of the Euclidean distance forces the distance to lie in the range \([0,1]\), but the exact formula used under the heading Euclidean will be software package specific. It is common for Euclidean coefficients not to divide each term by its range, for example choosing method = "euclidean" in the dist function computes it using the original definition. This stresses again the need to check the details of the function used.

Deriving normalised squared Euclidean distances, as defined above, within vegan is awkward. Before calling vegdist the variables have to be normalised; that is, divided by their range. The derived distances then have to be squared and divided by the number of variables.

stmothslog <- decostand(mothslog[5:16], method = "range")
round(stmothslog, 2)
     M1   M2   M3   M4   M5   M6   M7   M8   M9  M10  M11  M12
1  1.00 0.45 0.00 0.63 0.45 0.97 0.14 0.30 0.13 1.00 0.00 0.00
2  0.70 0.47 0.39 0.73 0.25 0.74 1.00 0.59 0.67 0.67 0.00 0.00
3  0.36 0.29 0.39 0.60 0.00 0.73 0.52 0.70 1.00 0.65 0.00 0.41
4  0.00 0.37 0.88 0.31 0.07 1.00 0.42 0.53 0.36 0.00 0.00 0.00
5  0.69 0.62 0.43 0.45 0.48 0.97 0.46 1.00 0.76 0.00 0.18 0.97
6  0.14 0.39 0.15 0.76 0.72 0.86 0.88 0.60 0.36 0.00 0.00 0.00
7  0.36 0.15 0.24 0.58 0.89 0.96 0.65 0.35 0.39 0.17 0.00 1.00
8  0.27 0.19 0.00 1.00 0.76 0.72 0.62 0.27 0.72 0.32 0.00 0.00
9  0.00 0.00 0.00 0.84 0.59 0.58 0.93 0.74 0.26 0.00 0.00 0.00
10 0.83 0.92 0.15 0.00 0.64 0.00 0.00 0.00 0.00 0.67 0.70 0.00
11 0.65 0.91 0.24 0.00 1.00 0.00 0.00 0.00 0.00 0.74 1.00 0.00
12 0.44 0.87 1.00 0.19 0.83 0.00 0.00 0.35 0.00 0.00 0.67 0.00
13 0.39 0.68 0.00 0.34 0.59 0.00 0.46 0.00 0.00 0.00 0.40 0.00
14 0.46 1.00 0.00 0.00 0.91 0.00 0.00 0.00 0.00 0.92 0.83 0.72
eucldist2 <- vegdist(stmothslog[, 1:12], method = "euclidean") ^ 2 / 12
round(eucldist2, 2)
      1    2    3    4    5    6    7    8    9   10   11   12   13
2  0.13                                                            
3  0.18 0.06                                                       
4  0.27 0.16 0.13                                                  
5  0.28 0.18 0.12 0.19                                             
6  0.21 0.10 0.15 0.12 0.17                                        
7  0.23 0.19 0.17 0.21 0.10 0.11                                   
8  0.16 0.09 0.12 0.19 0.22 0.05 0.12                              
9  0.27 0.14 0.18 0.16 0.24 0.03 0.15 0.07                         
10 0.20 0.32 0.37 0.37 0.40 0.36 0.38 0.33 0.42                    
11 0.27 0.39 0.44 0.44 0.46 0.40 0.41 0.37 0.46 0.02               
12 0.36 0.35 0.38 0.24 0.33 0.29 0.35 0.34 0.36 0.13 0.12          
13 0.24 0.23 0.29 0.24 0.32 0.16 0.24 0.18 0.18 0.09 0.13 0.13     
14 0.29 0.42 0.42 0.49 0.40 0.43 0.34 0.40 0.49 0.07 0.06 0.21 0.17

A drawback of the Euclidean distance is that it is sensible to even a few large data values (outliers). A popular robust alternative is the (normalised) Manhattan or city-block distance.

\[ d_{K,L} = [|1.43-0.95|/(2.19) + |2.57-2.49|/(1.83) + \cdots ]/12\]

Another known alternative is the ecological distance, which is the same as city-block but ignoring double zeros

\[ d_{K,L} = [|1.43-0.95|/(2.19) + |2.57-2.49|/(1.83) + \cdots ]/8\]

The final divisor is now 8 rather than 12 because there are 4 double zeros: species 6, 7, 9 and 12. Therefore only 8 variables contribute to the distance calculation. The normalised Manhattan and ecological distances can be obtained with vegan as follows.

# Normalised Manhattan distance matrix in vegan package
manhdist <- vegdist(mothslog[, 5:16], method = "gower")
round(manhdist, 2)
      1    2    3    4    5    6    7    8    9   10   11   12   13
2  0.27                                                            
3  0.36 0.19                                                       
4  0.37 0.31 0.29                                                  
5  0.41 0.32 0.30 0.34                                             
6  0.32 0.22 0.31 0.22 0.34                                        
7  0.36 0.35 0.30 0.33 0.26 0.22                                   
8  0.31 0.25 0.28 0.34 0.40 0.18 0.23                              
9  0.38 0.29 0.36 0.30 0.42 0.12 0.31 0.21                         
10 0.35 0.47 0.55 0.56 0.56 0.51 0.56 0.49 0.56                    
11 0.42 0.52 0.59 0.59 0.60 0.55 0.56 0.53 0.61 0.08               
12 0.48 0.53 0.57 0.38 0.50 0.44 0.49 0.49 0.50 0.23 0.23          
13 0.37 0.44 0.47 0.37 0.41 0.32 0.40 0.33 0.34 0.22 0.27 0.23     
14 0.45 0.61 0.59 0.66 0.58 0.60 0.50 0.55 0.63 0.16 0.14 0.30 0.30
# Ecological distance matrix using the vegan package
# altGower omits double zeros and decostand() standardises
ecoldist <- vegdist(decostand(mothslog[5:16], "range"), method = "altGower")
round(ecoldist, 2)
      1    2    3    4    5    6    7    8    9   10   11   12   13
2  0.32                                                            
3  0.39 0.21                                                       
4  0.44 0.38 0.32                                                  
5  0.41 0.32 0.30 0.37                                             
6  0.38 0.26 0.34 0.29 0.37                                        
7  0.39 0.38 0.33 0.36 0.26 0.24                                   
8  0.41 0.30 0.30 0.41 0.40 0.22 0.25                              
9  0.50 0.35 0.39 0.44 0.46 0.16 0.33 0.28                         
10 0.38 0.51 0.55 0.61 0.56 0.56 0.56 0.54 0.61                    
11 0.46 0.56 0.59 0.65 0.60 0.60 0.56 0.58 0.66 0.17               
12 0.52 0.58 0.57 0.46 0.54 0.53 0.49 0.53 0.60 0.34 0.35          
13 0.44 0.48 0.47 0.45 0.45 0.39 0.40 0.40 0.45 0.33 0.41 0.34     
14 0.49 0.61 0.59 0.66 0.58 0.60 0.50 0.60 0.69 0.28 0.24 0.40 0.45

There are, again, many ways of calculating distances for numeric data and there is a lack of consistency across software platforms with respect to the details of these calculations and to their naming conventions. Other specialised R packages that implement a range of distance measures include cluster, proxy and simba.

5.5 Exercises

  1. RAPD data. Derive a similarity matrix for these 25 nematodes species based on the RAPD marker data. These markers are binary variables. Does it make much difference whether you use simple matching or Jaccard? Does a visual inspection of the matrix in any way support the known species classifications of these samples?

  2. SOUPS data. How similar are these 8 soups with respect to (1) Odour (2) Flavour? Does either of these similarity patterns match the texture assessments?