For many objects, it may be useful to classify them into groups (clusters) based on their multidimensional characteristics. For example, if varieties and lines included in genetic resources can be grouped based on DNA polymorphism data, the variation of traits in genetic resources can be organized based on the group information. As I mentioned in the last lecture, it is difficult to understand the variation in many features of many samples in data just by looking at the data. In principal component analysis, we tried to summarize variation in data by representing a large number of features with low-dimensional variables. Cluster analysis tries to summarize the variation in data by grouping a large number of samples into a small number of groups. In this lecture, we will first outline hierarchical cluster analysis that classifies a large number of samples hierarchically into groups.
In this lecture, explanations will be given using rice data (Zhao et al. 2011) as before. In this lecture, three data of variety/line data (RiceDiversityLine.csv), phenotype data (RiceDiversityPheno.csv) and marker genotype data (RiceDiversityGeno.csv) are used. All of them are downloaded from the Rice Diversity web page http://www.ricediversity.org/data/index.cfm. As described earlier, marker genotype data is imputed for missing data using the software fastPHASE (Scheet and Stephens 2006).
First, let’s read three datasets and combine them as we did last time.
# this data set was analyzed in Zhao 2011 (Nature Communications 2:467)
line <- read.csv("RiceDiversityLine.csv", stringsAsFactors = T)
pheno <- read.csv("RiceDiversityPheno.csv", stringsAsFactors = T)
geno <- read.csv("RiceDiversityGeno.csv", stringsAsFactors = T)
line.pheno <- merge(line, pheno, by.x = "NSFTV.ID", by.y = "NSFTVID")
alldata <- merge(line.pheno, geno, by.x = "NSFTV.ID", by.y = "NSFTVID")
rownames(alldata) <- alldata$NSFTV.ID
First, let’s classify 374 varieties / lines into clusters based on variations in DNA markers (1,311 SNPs). First, prepare the data for that.
#### analysis of marker data
data.mk <- alldata[, 50:ncol(alldata)]
subpop <- alldata$Sub.population
dim(data.mk)
## [1] 374 1311
There are various methods for cluster analysis, but here we will perform cluster analysis with one method.
First, based on the DNA marker data, distances among varieties and lines are calculated.
# calculate Euclid distance
d <- dist(data.mk)
head(d)
## [1] 54.47141 53.08033 44.70547 52.82571 45.40700 44.36904
as.matrix(d)[1:6,1:6]
## 1 3 4 5 6 7
## 1 0.00000 54.47141 53.08033 44.70547 52.82571 45.40700
## 3 54.47141 0.00000 37.53194 46.79940 37.68502 49.82169
## 4 53.08033 37.53194 0.00000 44.38481 17.58133 46.49073
## 5 44.70547 46.79940 44.38481 0.00000 43.85254 42.87989
## 6 52.82571 37.68502 17.58133 43.85254 0.00000 46.69070
## 7 45.40700 49.82169 46.49073 42.87989 46.69070 0.00000
Note that the value returned by the function dist is not in the form of a matrix, but in the form of a distance matrix. Therefore, if you want to display the distances among the first six varieties in a 6x6 matrix, you need to convert the distance matrix-specific format to the matrix format with the function as.matrix as described above.
Let’s perform cluster analysis.
tre <- hclust(d)
tre
##
## Call:
## hclust(d = d)
##
## Cluster method : complete
## Distance : euclidean
## Number of objects: 374
After the “Call” the executed command was displayed as it is in regression analysis. Also, “Cluster method” shows the method of cluster analysis (definition of distance between clusters), and “Distance” shows calculation method of distance. Also, “Number of objects” is the number of classified objects (here, varieties and lines).
We will display the result of cluster analysis as a dendrogram.
# draw dendrogram
plot(tre)
Fig.1 Dendrogram of 374 varieties / lines obtained based on marker genotype data
Figure 1 shows the result obtained with the function hclust in the form of a dendrogram. Using the package ape, you can draw a dendrogram in various expression styles. To do so, you first need to convert the result obtained with the function hclust into a class called phylo, which is defined in the package ape.
# convert to a phylo object defined in the ape package
phy <- ape::as.phylo(tre)
Let’s plot the result converted to the phylo class.
# plot as a phylo object
plot(phy)
Fig.2 Dendrogram converted to the phylo class of the package ape
Figure 2 is very difficult to see due to the large number of varieties and lines. Let’s make it a little easier to see, by making it possible to confirm the relationship between the genetic background of each variety / line (the belonging subpopulation) and the position in the tree diagram with the color of the branches.
head(phy$edge)
## [,1] [,2]
## [1,] 375 376
## [2,] 376 380
## [3,] 380 236
## [4,] 380 392
## [5,] 392 209
## [6,] 392 334
head(subpop[phy$edge[,2]], 10)
## [1] <NA> <NA> ADMIX <NA> ADMIX ADMIX <NA> <NA> <NA> <NA>
## Levels: ADMIX AROMATIC AUS IND TEJ TRJ
col <- as.numeric(subpop[phy$edge[,2]])
edge.col <- ifelse(is.na(col), "gray", col)
plot(phy, edge.color = edge.col, show.tip.label = F)
Fig.3 Dendrogram colored for each population group of varieties and lines
As seen in Figure 3, it is possible to confirm the tendency that varieties and lines included in the same subpopulation are included in the same cluster, and it is understood that differences in genetic background of varieties and lines are well reflected in the results of cluster analysis.
The phylo class of package ape can draw dendrograms in various ways of expression. Draw different types of dendrograms.
# different types of dendrogram
#pdf("fig4.pdf", width = 10, height = 10)
op <- par(mfrow = c(2, 2), mar = rep(0, 4))
plot(phy, edge.color = edge.col, type = "phylogram", show.tip.label = F)
plot(phy, edge.color = edge.col, type = "cladogram", show.tip.label = F)
plot(phy, edge.color = edge.col, type = "fan", show.tip.label = F)
plot(phy, edge.color = edge.col, type = "unrooted", show.tip.label = F)
Fig.4 Various styles of dendrograms drawn using package ape
par(op)
#dev.off()
Figure 4 depicts the results of the same cluster analysis in a different style. Impressions and ease of understanding are different when the style is different. If you want to understand the genetic relationship of varieties and lines globally, it is likely that the fourth “unrooted” type tree chart is the most suitable.
The procedure of drawing a cluster analysis result using package ape is somewhat troublesome because it requires conversion to a phylo class on the way. So, let’s define a series of tasks as a self-made function to simplify the illustration of the results of cluster analysis.
# create an own function
myplot <- function(tre, subpop, type = "unrooted", ...) {
phy <- ape::as.phylo(tre)
col <- as.numeric(subpop[phy$edge[,2]])
edge.col <- ifelse(is.na(col), "gray", col)
plot(phy, edge.color = edge.col, type = type, show.tip.label = F, ...)
}
Let’s draw a dendrogram using the self-made function myplot.
# use the function
d <- dist(data.mk)
tre <- hclust(d)
myplot(tre, subpop)
Cluster analysis calculates distances between samples and clusters, and performs clustering based on the calculated distances. Therefore, different definitions of distance will give different results. Here, we will explain the definition of the distance between samples and between clusters.
First of all, about the distance between samples. There are various definitions to calculate the distance between samples. First, let’s draw a dendrogram based on different defined distances.
# try different methods for calculating distance
#pdf("fig5.pdf", width = 10, height = 10)
op <- par(mfrow = c(2, 2), mar = rep(0, 4))
d <- dist(data.mk, method = "euclidean") # default method
myplot(hclust(d), subpop)
d <- dist(data.mk, method = "manhattan")
myplot(hclust(d), subpop)
d <- dist(data.mk, method = "minkowski", p = 1.5)
myplot(hclust(d), subpop)
d <- as.dist(1 - cor(t(data.mk)))
myplot(hclust(d), subpop)
Fig.5 Dendrograms calculated based on different definitions of distances between samples
par(op)
#dev.off()
In this data, the topology of the dendrogram does not change significantly even if the definition of distance is different, but depending on the data, the definition of distance may have a large effect.
Here are the definitions of the distances between the samples used above. Note that each sample is described by \(q\) features, and let the data vector of the \(i\)-th sample be denoted by \(\mathbf{x}_i=(x_{i1},...,x_{iq} )^T\), and the data vector of the \(j\)-th sample be denoted by \(\mathbf{x}_j=(x_{j1},...,x_{jq} )^T\). At this time, the distances between samples \(i\) and \(j\), \(d(\mathbf{x}_i,\mathbf{x}_j)\), are defined as follows.
The Manhattan distance is the origin of its name when traveling around a city divided into squares, such as Manhattan in New York City. In such an urban area, for example, when moving from the point \((0,0)\) to the point \((2,3)\), it is not possible to move diagonally (Euclidean distance \(\sqrt{13}\)) because of the building, and move along the road (Manhattan distance 5) is necessary. Minkowski distance is a generalized form of Euclidean distance and Manhattan distance. It corresponds to the Manhattan distance when \(p=1\) and the Euclidean distance when \(p=2\).
For correlation-based distances, calculate the correlation coefficient \(r\) “between samples instead of between variables” and subtract one from it as the distance. When the correlation is \(r=1\), the distance is \(1-r=0\). When the correlation coefficient is \(r=0\), the distance is \(1-r=1\). When the correlation coefficient is \(r=−1\), the distance is \(1=r=2\). That is, the maximum value is 2 for distances based on the correlation coefficient. When performing cluster analysis based on the similarity of expression patterns between genes, “absolute value of correlation coefficient” \(|r|\) may be reduced instead of subtracting correlation coefficient \(r\) from 1. In this case, the distance is \(1-|r|=0\) when the correlation is \(r=-1\) or \(r=1\) (\(|r|=1\)), and the distance is \(1-|r|=1\) when the correlation is \(r=|r|=0\).
The function dist can also calculate the following distances: Although it was not suitable for this data, it was not used, but depending on the nature of the data to be analyzed, the distances described below may be appropriate.
関数distでは、次のような距離も計算できます。今回のデータには不向きであったので利用しませんでしたが、解析するデータの性質によっては、以下に紹介する距離が適切な場合もあります。
The Chebyshev distance is a distance based only on the difference of one of the \(q\) features that is the most different. This distance is the limit \(p\rightarrow\infty\) of the Minkowski distance. The Hamming distance is a commonly used distance in information science, and it counts the number of positions that do not match when comparing values at the same position for a sequence of the same length. For data that uses the Hamming distance, \(x_{ik}\) is not a continuous value but a discrete value \((0,1)\) in most cases.
So far we have described the definition of the distance between samples. In hierarchical cluster analysis, samples close to each other are grouped into one cluster, and samples and clusters or clusters are further grouped into higher level clusters. Therefore, you need to define not only the distance between samples but also the distance between samples and clusters or between clusters.
First, let’s draw a dendrogram based on various definitions of inter-cluster distance. In the hclust function, the calculation method (definition) of the distance between clusters can be specified by the option method.
#pdf("fig6.pdf", width = 10, height = 10)
d <- dist(data.mk)
op <- par(mfrow = c(2, 3), mar = rep(0, 4))
tre <- hclust(d, method = "complete") # default method
myplot(tre, subpop)
tre <- hclust(d, method = "single")
myplot(tre, subpop)
tre <- hclust(d, method = "average")
myplot(tre, subpop)
tre <- hclust(d, method = "median")
myplot(tre, subpop)
tre <- hclust(d, method = "centroid")
myplot(tre, subpop)
tre <- hclust(d, method = "ward.D2")
myplot(tre, subpop)
Fig.6 Dendrogram based on various inter-cluster distance definitions
par(op)
#dev.off()
As you can see in Figure 6, the difference in the definition of inter-cluster distance is different from the difference in the definition of inter-sample distance, and the topology of the dendrogram changes significantly. In some cases, the branch length becomes negative and it causes a strange dendrogram (lower left, lower center). Also, differences between clusters may be highly emphasized (lower right). It is difficult to choose which method to use from these definitions. But in many cases, it is chosen that has no major contradiction with known (a priori) information. For example, here, it is better to choose one that is less inconsistent with the subpopulation to which the variety/line belongs.
Indicates the definition of the distance between clusters that can be specified by the function hclust. Based on the distance between the samples, \(d(\mathbf{x}_i,\mathbf{x}_j)\), the distance between clusters A and B, \(d_{AB}\), is calculated as follows:
In the following three definitions, when clusters A and B merge to form a new cluster C, the distance \(d_{CO}\) between new clusters C and clusters O other than A and B is defined as follows. The distance between clusters A and B is denoted by \(d_{AB}\), the distance between clusters A and O by \(d_{AO}\), the distance between clusters B and O by \(d_{BO}\) and the number of samples contained in clusters A, B and O by \(n_A\), \(n_B\) and \(n_O\).
Let’s compare in more detail the two methods in Figure 6 where the correspondence with the divided groups seems clear.
# focus on two clustering methods
op <- par(mfrow = c(1, 2), mar = rep(0, 4))
d <- dist(data.mk)
tre <- hclust(d, method = "complete")
myplot(tre, subpop, type = "phylogram")
tre <- hclust(d, method = "ward.D2")
myplot(tre, subpop, type = "phylogram")
Fig.7 Dendrograms of the longest distance method (left) and Ward method (right)
par(op)
So far, varieties and lines have been classified into clusters based on DNA marker data. Cluster analysis of varieties and lines can be performed based on trait data as well as DNA marker data. Also, regarding the same data, it is possible to classify traits having similar variation patterns among varieties/lines into the same cluster, considering them as targets for classifying traits instead of vareties/lines. Here we will explain such an approach.
First, let’s prepare trait data. Let’s extract trait data from all data (alldata), excluding traits not suitable for such analysis.
# preparation of data
required.traits <- c("Flowering.time.at.Arkansas",
"Flowering.time.at.Faridpur", "Flowering.time.at.Aberdeen",
"Culm.habit", "Flag.leaf.length", "Flag.leaf.width",
"Panicle.number.per.plant", "Plant.height", "Panicle.length",
"Primary.panicle.branch.number", "Seed.number.per.panicle",
"Florets.per.panicle", "Panicle.fertility", "Seed.length",
"Seed.width","Brown.rice.seed.length", "Brown.rice.seed.width",
"Straighthead.suseptability","Blast.resistance",
"Amylose.content", "Alkali.spreading.value", "Protein.content")
data.tr <- alldata[, required.traits]
missing <- apply(is.na(data.tr), 1, sum) > 0
data.tr <- data.tr[!missing, ]
subpop.tr <- alldata$Sub.population[!missing]
The trait data varies in size (variance) depending on the trait. If this data is used as it is, the large variance trait has a large effect on the distance calculation, and the small variance trait has a small contribution to the distance calculation. Therefore, all traits are normalized to variance 1.
# scaling
data.tr <- scale(data.tr)
Now let’s perform cluster analysis with traits classified as variety/line and cluster analysis with traits classified as trait data.
# perform clusetering for both varieties and traits
d <- dist(data.tr)
tre.var <- hclust(d, method = "ward.D2")
d <- dist(t(data.tr))
tre.tra <- hclust(d, "ward.D2")
op <- par(mfrow = c(1, 2))
myplot(tre.var, subpop.tr, type = "phylogram")
plot(tre.tra, cex = 0.5)
Fig.8 Cluster analysis based on trait data. Dendrograms showing the relationship between varieties and lines (left) and the relationship between traits (right)
par(op)
From the dendrogram on the right side of Figure 8, it can be seen that the traits (Plant. Height, Panicle. Length, Flag. Leaf. Length) related to the size of the plant are closely related to each other. You can also see that the flowering timing (Flowering.time.at. *****) located in the three environments is located near the cluster. In addition, it is also clear that the flag leaf width (Flag.leaf.width) is different from other size-related traits and strongly associated with the traits that characterize the panicle. In this way, multidimensional data can be cluster analyzed from either side. With this in mind, you will be able to view the same data from a slightly different perspective.
Note that the above analysis can display results more visually using the heatmap function.
# perform clustring from both sides
#pdf("fig9.pdf")
heatmap(data.tr, margins = c(12,2))
Fig.9 Display of cluster analysis results and heatmap of trait data
#dev.off()
There is also a heatmap function, heatmap.2, which is included in the gplots package (I think there are many other functions). Let’s draw a heat map using this. Although the way of giving options is slightly different, you can draw a nice-looking figure.
#pdf("fig9-2.pdf", height = 12)
gplots::heatmap.2(data.tr, margins = c(12,2), col=redgreen(256), trace = "none", lhei = c(2,10), cexRow = 0.3)
Fig.9-2 Display heat map of trait data using heatmap.2 function
#dev.off()
You can reflect the results of another cluster analysis as follows: Reflect the result of the cluster analysis performed for the same data on the heat map display.
# perform clustering with appropriate methods
#pdf("fig10.pdf")
heatmap(data.tr, Rowv = as.dendrogram(tre.var),
Colv = as.dendrogram(tre.tra),
RowSideColors = as.character(as.numeric(subpop.tr)),
labRow = subpop.tr,
margins = c(12, 2))
Fig.10 The result of changing the cluster analysis method in Figure 9 to Ward’s method
#dev.off()
This can also be drawn using the heatmap.2 function as before.
# this part is again optional
#pdf("fig10-2.pdf", height = 12)
gplots::heatmap.2(data.tr, Rowv = as.dendrogram(tre.var),
Colv = as.dendrogram(tre.tra),
RowSideColors = as.character(as.numeric(subpop.tr)),
labRow = subpop.tr,
margins = c(12,2), col=redgreen(256), trace = "none", lhei = c(2,10), cexRow = 0.3)
Fig.10-2 The result of changing the cluster analysis method of Figure 9-2 to Ward’s method
#dev.off()
The heatmap function does not have to perform cluster analysis on the same data in both vertical and horizontal directions. For example, you can also apply the results of cluster analysis using DNA marker data for the row side.
# perform clustering based on marker genotypes for determining row order
data.mk2 <- data.mk[!missing, ]
d <- dist(data.mk2)
tre.mrk <- hclust(d, method = "ward.D2")
#pdf("fig11.pdf")
heatmap(data.tr, Rowv = as.dendrogram(tre.mrk),
Colv = as.dendrogram(tre.tra),
RowSideColors = as.character(as.numeric(subpop.tr)),
labRow = subpop.tr,
margins = c(12, 2))
Fig.11 Relationship between the results of cluster analysis based on genetic marker data and traits
#dev.off()
This can also be drawn using the heatmap.2 function as before.
# this part is optional
#pdf("fig11-2.pdf", height = 12)
gplots::heatmap.2(data.tr, Rowv = as.dendrogram(tre.mrk),
Colv = as.dendrogram(tre.tra),
RowSideColors = as.character(as.numeric(subpop.tr)),
labRow = subpop.tr,
margins = c(12,2), col=redgreen(256), trace = "none", lhei = c(2,10), cexRow = 0.3)
Fig.11-2 Relationship between the results of cluster analysis and traits based on genetic marker data expressed using heatmap.2 function
#dev.off()
Sometimes you want to delineate samples into places where there is similarity between samples and to categorize samples discretely into groups. This section describes how to classify samples into a fixed number of clusters based on the results of hierarchical cluster analysis.
Based on the results of hierarchical cluster analysis based on DNA marker data, let’s classify varieties/lines into five groups. Five is a number according to the number of division groups to which varieties/lines belong. From the result of hierarchical cluster analysis, we use the function cutree to find discrete groups.
# classify samples with the cutree function
d <- dist(data.mk)
tre <- hclust(d, method = "ward.D2")
cluster.id <- cutree(tre, k = 5)
head(cluster.id, 10)
## 1 3 4 5 6 7 8 9 10 11
## 1 2 3 4 3 5 5 1 1 2
Let’s Visualize the results of classification into five groups based on cluster analysis.
# visualize the result
op <- par(mfrow = c(1,2), mar = rep(0, 4))
myplot(tre, cluster.id, type = "phylogram")
myplot(tre, subpop, type = "phylogram", direction = "leftwards")
Fig.12 Relationship between classification by cluster analysis (left) and divided groups (right)
par(op)
Let’s check the relationship between classification and subpopulation based on cluster analysis by creating cross tabulation table.
# obtain the cross table of classification
table(cluster.id, subpop)
## subpop
## cluster.id ADMIX AROMATIC AUS IND TEJ TRJ
## 1 5 0 0 0 84 0
## 2 14 0 0 80 0 0
## 3 1 0 52 0 0 0
## 4 2 14 0 0 0 0
## 5 34 0 0 0 3 85
It can be seen that the two match very well, except for the three varieties / lines of Indica (IND). This is thought to be because the divided population structure itself was estimated based on DNA marker data. In addition, it is also understood that varieties that are estimated to be a mixture of multiple subpopulations (ADMIX) are classified into various groups.
Let’s confirm the result of classification by hierarchical cluster analysis on the principal component axis.
# visualize the result in the PCA space
pca <- prcomp(data.mk)
op <- par(mfrow = c(1,2))
plot(pca$x[,1:2], pch = cluster.id, col = as.numeric(subpop))
plot(pca$x[,3:4], pch = cluster.id, col = as.numeric(subpop))
Fig.13 Classification by cluster analysis and relationship between subgroups
par(op)
When classifying into a certain number of groups, it is not necessary to classify hierarchically. Here we introduce the k-means method, which is one of the non-hierarchical cluster analysis methods.
For the same data as before, let’s classify it into five groups using the function kmeans.
# kmeans clustering
kms <- kmeans(data.mk, centers = 5)
#kms
head(kms$cluster, 20)
## 1 3 4 5 6 7 8 9 10 11 12 13 15 16 17 18 19 20 22 24
## 1 4 5 2 5 2 2 1 1 3 2 5 1 2 3 5 5 2 2 2
The k-means method performs classification into the number of groups determined by the following algorithm.
In the k-means method, the results may vary depending on the first randomly chosen sample. In fact, let’s repeat the analysis with the same data and check the variation of the results.
# repeat kmeans clustering
for(i in 1:5) {
kms <- kmeans(data.mk, centers = 5)
print(table(kms$cluster, subpop))
}
## subpop
## ADMIX AROMATIC AUS IND TEJ TRJ
## 1 16 0 0 0 1 41
## 2 3 14 0 0 0 0
## 3 17 0 0 0 86 0
## 4 7 0 0 0 0 44
## 5 13 0 52 80 0 0
## subpop
## ADMIX AROMATIC AUS IND TEJ TRJ
## 1 4 0 0 0 84 0
## 2 7 0 0 79 0 0
## 3 5 0 52 1 0 0
## 4 33 0 0 0 3 2
## 5 7 14 0 0 0 83
## subpop
## ADMIX AROMATIC AUS IND TEJ TRJ
## 1 10 0 0 79 0 0
## 2 17 0 0 0 86 0
## 3 3 14 0 0 0 0
## 4 3 0 52 1 0 0
## 5 23 0 0 0 1 85
## subpop
## ADMIX AROMATIC AUS IND TEJ TRJ
## 1 26 14 0 0 1 85
## 2 1 0 34 0 0 0
## 3 5 0 18 1 0 0
## 4 17 0 0 0 86 0
## 5 7 0 0 79 0 0
## subpop
## ADMIX AROMATIC AUS IND TEJ TRJ
## 1 13 0 52 80 0 0
## 2 22 0 0 0 5 1
## 3 5 0 0 0 82 0
## 4 15 14 0 0 0 32
## 5 1 0 0 0 0 52
Each time we run it, we get different results. This is due to the strong dependence on the initial value of the algorithm described above. So, let’s repeat the analysis with different initial values. Here, we do the calculation based on 50 different initial values.
# start from multiple sets of initial points
for(i in 1:5) {
kms <- kmeans(data.mk, centers = 5, nstart = 50)
print(table(kms$cluster, subpop))
}
## subpop
## ADMIX AROMATIC AUS IND TEJ TRJ
## 1 12 0 0 80 0 0
## 2 1 0 52 0 0 0
## 3 23 0 0 0 1 85
## 4 17 0 0 0 86 0
## 5 3 14 0 0 0 0
## subpop
## ADMIX AROMATIC AUS IND TEJ TRJ
## 1 3 14 0 0 0 0
## 2 23 0 0 0 1 85
## 3 12 0 0 80 0 0
## 4 1 0 52 0 0 0
## 5 17 0 0 0 86 0
## subpop
## ADMIX AROMATIC AUS IND TEJ TRJ
## 1 3 14 0 0 0 0
## 2 23 0 0 0 1 85
## 3 17 0 0 0 86 0
## 4 12 0 0 80 0 0
## 5 1 0 52 0 0 0
## subpop
## ADMIX AROMATIC AUS IND TEJ TRJ
## 1 23 0 0 0 1 85
## 2 12 0 0 80 0 0
## 3 17 0 0 0 86 0
## 4 3 14 0 0 0 0
## 5 1 0 52 0 0 0
## subpop
## ADMIX AROMATIC AUS IND TEJ TRJ
## 1 1 0 52 0 0 0
## 2 3 14 0 0 0 0
## 3 12 0 0 80 0 0
## 4 23 0 0 0 1 85
## 5 17 0 0 0 86 0
Unlike before, you can see that the results are stable. Note that although the “number” of the group into which each sample is classified varies among different analyzes, this is not a particular problem because it is an arbitrary number assigned to five groups.
Let’s create a cross-tabulation table and check the relationship between groups classified by k-means, groups classified by hierarchical cluster analysis, and division groups to which varieties / lines belong.
# compare results
print("k-means vs. subpopulations")
## [1] "k-means vs. subpopulations"
table(kms$cluster, subpop)
## subpop
## ADMIX AROMATIC AUS IND TEJ TRJ
## 1 1 0 52 0 0 0
## 2 3 14 0 0 0 0
## 3 12 0 0 80 0 0
## 4 23 0 0 0 1 85
## 5 17 0 0 0 86 0
print("hierarchical clustering vs. subpopulations")
## [1] "hierarchical clustering vs. subpopulations"
table(cluster.id, subpop)
## subpop
## cluster.id ADMIX AROMATIC AUS IND TEJ TRJ
## 1 5 0 0 0 84 0
## 2 14 0 0 80 0 0
## 3 1 0 52 0 0 0
## 4 2 14 0 0 0 0
## 5 34 0 0 0 3 85
print("k-means vs. hierarchical clustering")
## [1] "k-means vs. hierarchical clustering"
table(kms$cluster, cluster.id)
## cluster.id
## 1 2 3 4 5
## 1 0 0 53 0 0
## 2 1 0 0 16 0
## 3 0 92 0 0 0
## 4 0 1 0 0 108
## 5 88 1 0 0 14
From cross-tabulation tables, it can be seen that varieties and lines belonging to the subpopulations other than ADMIX and IND are classified in the same way by k-means and hierarchical cluster analysis. Looking at the third cross tabulation table, the classification results of both methods are almost identical, while some differences can be seen. This is mainly due to the fact that the classification of varieties/lines in which the subpopulation is ADMIX (mixed) differs in both methods.
Let’s confirm the result of classification by k-means method and hierarchical cluster analysis by plotting on principal component axis.。
# match id between the results of kmeans and hclust
convert.table <- apply(table(kms$cluster, cluster.id), 1, which.max)
convert.table
## 1 2 3 4 5
## 3 4 2 5 1
cluster.id.kms <- convert.table[kms$cluster]
# draw graph based on the converted id
#pdf("fig14.pdf", width = 8, height = 8)
op <- par(mfrow = c(2,2))
plot(pca$x[,1:2], pch = cluster.id, col = as.numeric(subpop), main = "hclust")
plot(pca$x[,3:4], pch = cluster.id, col = as.numeric(subpop), main = "hclust")
plot(pca$x[,1:2], pch = cluster.id.kms, col = as.numeric(subpop), main = "kmeans")
plot(pca$x[,3:4], pch = cluster.id.kms, col = as.numeric(subpop), main = "kmeans")
Fig.14 Classification by hierarchical cluster analysis (top) and k-means (bottom) Relationship between the scores and the principal component scores
par(op)
#dev.off()
The number of groups classified up to this point has been set to 5 according to the number of divided groups. What should we do to check if the number of 5 groups is really appropriate?
One way to determine the appropriate number of groups is to classify them into various numbers of groups and determine the degree of decrease in the within groups sum of squares at that time. there is. As explained in the analysis of variance, the sum of squares is divided into the sum of squares between groups and the sum of squares within groups. Therefore, when the number of groups is 1, the sum of squares is the sum of squares within groups. Then, as the number of groups increases to 2, 3, 4 …, the sum of squares between groups increases and the sum of squares within groups decreases. When the number of groups finally matches the number of samples, the sum of squares within groups becomes 0. Therefore, the rule of minimizing the sum of squares within a group is meaningless because the number of groups is always the number of samples. Therefore, as with the rules for determining the number of components in principal component analysis, find the point where the reduction of the sum of squares within a group changes from a sudden change to a gradual change, and use it as the number of groups to adopt.
Now let’s change the number of groups from 1 to 10 and calculate the sum of squares within a group. Graphically illustrate how it decreases.
# visualize within-group and between-groups sum of squares
n <- nrow(data.mk)
wss <- rep(NA, 10)
wss[1] <- (n - 1) * sum(apply(data.mk, 2, var))
for(i in 2:10) {
print(i)
res <- kmeans(data.mk, centers = i, nstart = 50)
wss[i] <- sum(res$withinss)
}
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
plot(1:10, wss, type = "b", xlab = "Number of groups",
ylab = "Within groups sum of squares")
Fig.15 Change in the within group variation when the number of clusters is 1 to 10 by k-means method
In Fig. 15, we can find that the decrease in the within group sum of squares becomes linear after the number of clusters exceeds five. Also from this figure, the number 5 is considered to be a suitable number of groups.
So far, each sample has always been classified into one group. Then, among the samples classified into a certain group, there will be those that are clearly classified into that group and those that are “barely” classified into that group. In order to clarify the certainty of classification, it is useful to be able to detect a sample with an ambiguous classification like the latter by some standard. Here, we will introduce a method to evaluate classification ambiguity based on statistics called shadow value (Everitt and Hothorn 2011, An introduction to applied multivariate analysis with R. Springer).
In the kms function, the position of the center of gravity of each group determined by the k-means method is calculated. Here, we calculate the distance to the center of gravity of these groups from each sample, calculate the distance to the nearest group and the distance to the next closest group, and evaluate the ambiguity based on the difference.
First we calculate the distance between all samples and group centroids using the function rdist which is included in the package fields.
d2ctr <- fields::rdist(kms$centers, data.mk)
d2ctr[, 1:10]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 49.59083 32.90422 20.82710 39.54322 21.82860 44.50590 44.34419 45.67021
## [2,] 40.35414 41.98393 39.47666 13.04416 39.19190 38.61061 37.13867 35.43122
## [3,] 48.19311 20.62444 28.70590 39.76012 29.21810 43.21613 42.58438 44.71007
## [4,] 36.10209 43.10088 41.49647 34.88985 41.45680 24.50574 22.59442 34.04487
## [5,] 19.66441 48.58701 47.34412 38.28548 47.10191 37.74005 36.65759 26.00505
## [,9] [,10]
## [1,] 51.53614 32.28608
## [2,] 42.25515 42.21440
## [3,] 49.99222 22.91782
## [4,] 37.71899 43.98252
## [5,] 20.32749 49.01558
apply(d2ctr, 2, which.min)
## [1] 5 3 1 2 1 4 4 5 5 3 2 1 5 2 3 1 1 4 4 4 4 4 4 3 5 3 3 4 1 4 2 3 1 2 4 1 1
## [38] 5 5 2 4 5 5 3 1 4 5 3 5 5 5 4 3 5 4 4 3 3 4 3 4 3 3 1 5 3 1 5 4 1 4 1 4 5
## [75] 4 2 5 4 4 5 4 5 1 3 4 4 3 3 2 5 4 5 4 3 5 3 4 5 4 3 2 3 3 5 3 3 1 3 5 5 4
## [112] 3 3 4 4 3 3 5 5 3 3 4 3 4 4 5 1 1 5 5 3 5 5 3 2 3 3 3 4 4 3 4 3 5 4 3 3 4
## [149] 4 5 1 5 5 5 4 4 5 5 4 4 3 4 2 5 3 4 3 4 1 4 4 3 5 3 3 3 3 4 4 4 4 4 4 4 5
## [186] 5 2 3 5 5 3 1 3 5 5 3 3 4 5 4 4 3 4 5 4 5 5 5 3 5 4 3 3 3 5 5 4 3 2 1 5 5
## [223] 5 5 5 5 3 5 5 3 4 4 5 1 5 5 5 4 5 5 5 3 4 4 5 5 5 5 5 3 3 5 5 3 3 5 5 5 5
## [260] 3 5 5 4 4 4 5 1 3 1 3 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 3 5 5 4 1 3 5 3 2 1 4
## [297] 5 2 1 1 4 3 3 4 4 1 5 3 1 1 1 5 4 5 5 4 5 1 1 1 2 4 5 4 1 4 5 4 4 3 4 5 5
## [334] 4 4 4 4 4 4 4 3 4 3 3 4 4 3 4 4 3 4 4 3 4 4 4 4 4 3 3 4 3 1 4 5 2 5 3 3 3
## [371] 5 4 1 5
The k-means classifies each sample into the closest group to the center of gravity, as mentioned earlier. Therefore, note that the two results displayed in the upper box match.
From the distances calculated for each sample, you can use the min function to derive the distance to the nearest centroid. But how do you get the distance to the second closest center of gravity? Here, we realize this by creating a self-made function nth.min. The self-made function nth.min is a function that rearranges the array given as the argument x in order of size (ascending order) and returns the nth value.
# prepare own function find the second best
nth.min <- function(x, n) {
sort(x)[n]
}
nth.min(-10:10, 3)
## [1] -8
d.1st <- apply(d2ctr, 2, min)
d.2nd <- apply(d2ctr, 2, nth.min, n = 2)
When the command in the upper box is executed, d.1st is assigned the distance from each sample to the nearest centroid, and d.2nd is assigned the distance to the second closest centroid.
Next, let’s calculate the shadow value. The shadow value for the ith sample is defined as \[ s(\mathbf{x}_i)=\frac{2d(\mathbf{x}_i,c(\mathbf{x}_i))}{d(\mathbf{x}_i,c(\mathbf{x}_i))+d(\mathbf{x}_i,\tilde c(\mathbf{x}_i))} \] where \(d(\mathbf{x}_i,c(\mathbf{x}_i))\) is the distance from the observed value of the \(i\)th sample \(\mathbf{x}_i\) to the centroid of the nearest group (the centroid of the group into which the sample was classified), while \(d(\mathbf{x}_i,\tilde c(\mathbf{x}_i))\) represents the distance to the centroid of the second closest group. This value is from 0 to 1. If this value is close to 0, it indicates that the sample is located near the center of gravity of the classified group; conversely, if it is close to 1, the gravity center of the classified group and the second closest group are It means that the distance to the center of gravity is almost the same. Therefore, in order to detect a sample whose classification is ambiguous, it means that shadow value should find a sample close to 1.
Let’s calculate the shadow value using the distances d.1st and d.2nd calculated above and find out if the value is 0.9 or more.
# evaluate unclearness of classificatinos (shadow values)
shadow <- 2 * d.1st / (d.1st + d.2nd)
unclear <- shadow > 0.9
The result of detection is assigned to “unclear”. If this value is T (true), the classification is considered ambiguous, if it is F (false), the classification is considered relatively clear.
Now let’s draw a scatter plot on the principal component axis, representing the ● of the sample whose classification is determined to be ambiguous.
# visualize the result
cluster.id.kms[unclear] <- 20
op <- par(mfrow = c(1,2))
plot(pca$x[,1:2], pch = cluster.id.kms, col = as.numeric(subpop), main = "kmeans")
plot(pca$x[,3:4], pch = cluster.id.kms, col = as.numeric(subpop), main = "kmeans")
Fig.16 Scatter plot of the samples with the marker for unclear classification (●)
par(op)
It can be seen from Fig. 16 that most of the varieties and lines (black points) in which the subpopulations are ADMIX (mixed) are scattered with ●. In this way, by evaluating the ambiguity (in other words, the certainty) of the classification of each sample, it becomes possible to grasp the classification result in more detail. In this example, we were able to find varieties/lines that seemed to be mixed backgrounds resulting from multiple subpopulations.
Cluster analysis can also be used to select a small number of representative samples from a large number of samples. For example, classification can be performed by cluster analysis based on existing data collected for a large number of genetic resources, and representative varieties/lines can be selected based on the classification results. In this way, representative varieties/lines are often selected, and field trials and molecular biological experiments that require time and cost are often performed using these varieties/lines.
Here, we introduce the k-medoids method as a method of cluster analysis suitable for selection of such representative samples. The k-medoids method is similar to the k-means method, but instead of grouping based on the distance to the center of gravity of the group, grouping based on the distance to the representative samples of the groups (medoids). More specifically, this algorithm does not use the center of a cluster as the center of gravity, but as the coordinate point of the representative sample of the group.
Let’s select 48 samples from 229 varieties and lines contained in the phenotypic data (data.tr) that are representative of the k-medoids method. The function pam that executes the k-medoids method is included in the package cluster. Of the results obtained by the k-medoids method, the ID of the medoids (id.med) is the IDs of the samples selected as representatives.
# select 48 varieties/lines (by classifying all samples into 48 groups)
n.sel <- 48
kmed <- cluster::pam(data.tr, k = n.sel)
head(kmed)
## $medoids
## Flowering.time.at.Arkansas Flowering.time.at.Faridpur
## 1 -0.84468910 -0.9295405
## 61 0.40615449 0.3485777
## 16 0.49444934 -0.8133479
## 316 -0.10889875 0.5809628
## 214 1.23023969 0.9295405
## 154 -0.68281523 -0.9295405
## 10 0.38408078 -1.9752735
## 108 1.54662954 0.2323851
## 24 0.55331256 0.4647702
## 377 0.39143869 0.6971553
## 73 0.59745999 0.6971553
## 40 0.97271306 -3.0210065
## 123 0.64896531 0.9295405
## 314 -0.30756215 0.5809628
## 45 -0.25605682 -0.8133479
## 157 0.29578594 0.9295405
## 220 -1.92630092 -0.8133479
## 60 -0.20455150 -0.4647702
## 170 0.11919626 0.5809628
## 177 -0.76375217 -0.8133479
## 81 -0.12361456 0.5809628
## 84 0.39879659 4.4153172
## 342 0.49444934 0.8133479
## 188 1.48776631 0.0000000
## 107 -0.28548844 -3.0210065
## 112 1.09043952 0.6971553
## 250 -2.37513304 -2.2076586
## 236 -0.22662521 0.4647702
## 180 -0.46207812 -0.5809628
## 122 1.98319848 0.4647702
## 356 0.20749110 -0.4647702
## 271 -0.52829925 -1.0457330
## 163 1.59813486 2.7886214
## 247 -0.85940491 -0.8133479
## 162 2.32656731 0.6971553
## 168 1.45097679 0.6971553
## 169 0.09712255 0.5809628
## 348 -1.02863669 -1.7428884
## 217 -0.45472022 -0.2323851
## 343 -0.79318378 -0.2323851
## 218 0.10448045 0.6971553
## 268 -2.14115170 -0.8133479
## 291 -1.55104784 -1.2781181
## 237 -0.80054168 -0.6971553
## 239 -0.54301506 0.4647702
## 272 2.06904069 -1.1619256
## 277 -0.20455150 -0.5809628
## 369 -0.78582588 0.4647702
## Flowering.time.at.Aberdeen Culm.habit Flag.leaf.length Flag.leaf.width
## 1 -0.65883657 -0.12403316 -0.430259255 0.15313754
## 61 -0.16712476 2.05986166 0.390547174 -1.13475849
## 16 0.61141895 0.96791425 -0.147909619 -1.07036369
## 316 -0.86371649 1.78687481 1.048067066 0.99026996
## 214 1.14410674 0.14895370 1.323613096 0.86148035
## 154 -0.20810074 -0.67000686 -1.032378359 -1.00596889
## 10 -0.94566846 -0.67000686 -0.513360439 -0.94157409
## 108 0.89825084 -1.76195427 1.778482733 1.31224396
## 24 0.69337091 -0.67000686 0.887696360 1.11905956
## 377 0.48849099 -0.67000686 0.167486102 1.18345436
## 73 0.36556304 -0.39702001 0.791473937 0.76488815
## 40 1.47191462 -0.94299371 0.394920921 1.37663876
## 123 -0.37200468 0.42194055 -0.602779256 -0.16883647
## 314 -0.00322082 1.24090111 1.603532873 1.08686216
## 45 -0.12614877 -0.67000686 -0.877839314 -1.36014030
## 157 1.88167446 -0.30602439 0.596599231 -0.46934554
## 220 -1.56030823 -0.39702001 0.005657481 -0.84498189
## 60 -0.37200468 0.14895370 0.636934894 0.79708555
## 170 -0.57688460 -1.76195427 -0.376316382 0.08874274
## 177 -0.86371649 0.05795808 -1.145123824 -0.29762607
## 81 -0.20810074 1.14990549 0.629645316 0.26046221
## 84 0.16068312 -0.12403316 0.475106273 1.63421797
## 342 0.40653902 -0.94299371 0.944555065 0.63609855
## 188 2.08655438 -1.76195427 1.708502789 2.23523612
## 107 -0.94566846 -0.12403316 -0.383605958 0.08874274
## 112 3.31583391 -0.88839634 0.496100256 -0.74838968
## 250 -1.56030823 0.90725051 -0.385063875 -1.07036369
## 236 -0.20810074 -0.39702001 -0.627077848 -0.32982348
## 180 1.06215477 0.14895370 0.179149425 -0.94157409
## 122 2.12753037 -0.12403316 1.510226281 0.95807256
## 356 -0.69981255 1.78687481 0.112085312 0.15313754
## 271 0.11970713 0.42194055 -0.420053847 -0.55520528
## 163 2.08655438 2.05986166 -0.334522804 -1.07036369
## 247 -0.69981255 0.42194055 -1.561601687 -2.03628571
## 162 0.77532288 0.05795808 3.255837110 -2.27239998
## 168 0.65239493 0.05795808 0.722951909 -1.13475849
## 169 0.20165910 0.96791425 0.776269961 -1.10716072
## 348 -1.19152437 1.78687481 -0.902623879 -0.52300788
## 217 0.11970713 -0.67000686 0.096048241 -1.26354809
## 343 -0.69981255 0.14895370 -0.067238295 -0.16883647
## 218 1.79972249 0.69492740 -0.946361344 -1.45673250
## 268 -1.56030823 0.42194055 1.186569039 -0.52300788
## 291 -1.19152437 -1.48896742 -2.195794931 -1.07036369
## 237 -0.86371649 -0.57901124 -1.405604727 -1.10256109
## 239 -0.49493263 -0.67000686 -0.793280216 0.81855049
## 272 0.40653902 -0.67000686 0.877490952 -0.78058708
## 277 1.67679454 -0.67000686 -0.966286189 1.03319982
## 369 -0.74078854 1.78687481 -0.934698019 -1.23135069
## Panicle.number.per.plant Plant.height Panicle.length
## 1 -0.36635080 -0.300031055 -1.06663830
## 61 1.84250994 1.167426588 0.49972675
## 16 0.47971752 2.071970582 1.71585257
## 316 0.27647886 0.845207051 0.68894365
## 214 -0.80385475 0.270646917 0.65065472
## 154 -0.48677421 -0.187448329 -1.33041417
## 10 0.99065981 -1.600555688 -0.60215885
## 108 -1.74085297 -0.213329414 1.33219756
## 24 -0.75874020 0.720977831 -0.37242531
## 377 -0.65427377 0.107596068 -0.17485446
## 73 -0.56409951 -0.672718711 -0.25985587
## 40 -0.59377139 -0.121451555 0.93329660
## 123 1.31487928 0.687332419 -0.20701716
## 314 1.10382219 0.620041594 0.00433770
## 45 0.31048254 1.109194142 1.24030415
## 157 0.09612401 -0.891413899 -0.18732571
## 220 0.74076723 -1.029877716 -1.23928653
## 60 -0.56409951 0.538516170 0.76538226
## 170 -0.69549982 -1.227868033 -0.20701716
## 177 0.86108186 -1.274453990 -1.09838329
## 81 0.61981687 1.023786551 0.50362526
## 84 -2.11078659 0.709331342 1.82382734
## 342 -0.68512470 0.033834966 0.21262944
## 188 -0.80187497 0.127006880 -0.33107327
## 107 -0.49630064 -0.144744534 -0.31728926
## 112 0.49536037 1.587254793 0.90183006
## 250 0.22802399 -1.057052857 -1.34649552
## 236 0.03107200 -0.004986663 -0.39386710
## 180 -0.26910207 -0.214623469 0.34434334
## 122 0.30372141 1.994327324 1.41030697
## 356 1.62645029 1.594464524 1.25158197
## 271 -1.39346711 -0.144744534 -0.08985305
## 163 2.03700306 1.237305524 0.75786371
## 247 0.91378118 -0.936705802 -0.37931731
## 162 1.25264475 0.547574548 0.95083988
## 168 0.81804671 1.960681907 2.16383298
## 169 0.00812393 1.234532551 1.37617512
## 348 1.37549413 0.305586385 -0.13120509
## 217 -0.84582365 0.158064186 1.06361817
## 343 -0.58383871 0.057127945 -0.53323878
## 218 0.04624607 -0.168037512 0.15136716
## 268 0.38031422 -0.529078678 -0.54013079
## 291 0.37834782 -2.310991531 -2.16128381
## 237 -0.93628981 -1.041524205 -0.28053189
## 239 -1.09123051 -0.487668938 0.09278511
## 272 0.43535026 1.168720639 2.16383298
## 277 -0.34836453 -0.757349867 -1.79677326
## 369 1.97836248 0.099831740 -1.33479999
## Primary.panicle.branch.number Seed.number.per.panicle Florets.per.panicle
## 1 -0.431283709 -0.258760638 -0.475933049
## 61 -1.102193707 -0.876079700 -0.895386728
## 16 -0.672535592 -0.411988727 -0.555640631
## 316 0.009862591 1.779281732 1.346165250
## 214 1.694796377 1.218266993 1.169580226
## 154 0.313150672 -0.880995499 -0.881536337
## 10 0.212054643 -2.421533974 -1.070689887
## 108 1.526302998 0.783516123 0.721383139
## 24 -0.697809599 -0.031257998 0.011025957
## 377 0.818630806 -0.007760557 -0.108825920
## 73 1.020822864 -0.233942620 -0.161231929
## 40 -0.707000147 0.248248555 0.436585850
## 123 -0.065959429 0.225343099 0.041056341
## 314 -0.866302977 0.455621796 0.087742049
## 45 -1.263947350 -1.019223437 -1.256637457
## 157 0.529785018 0.272822174 0.319137897
## 220 -1.270687085 -2.844547836 -2.898639507
## 60 0.212054643 0.780393490 1.235134586
## 170 0.313150672 0.162738678 0.396877904
## 177 0.313150672 0.088725396 -0.125655596
## 81 -1.430755795 0.189970641 -0.152136445
## 84 -0.040685422 1.167279818 1.791276653
## 342 0.683836104 1.054928315 1.526813885
## 188 1.121918887 2.007577296 2.009790344
## 107 0.175292452 -1.105116691 -0.349351235
## 112 0.414246702 0.086543832 0.007252743
## 250 -1.809865896 -2.267730366 -2.386656155
## 236 0.212054643 0.111542573 0.418318840
## 180 -0.394521517 -0.148856628 -0.153953524
## 122 1.829591079 -0.566814261 -0.410958930
## 356 0.395865606 0.410082280 0.212137580
## 271 0.414246702 0.012775580 0.167801038
## 163 0.481644052 0.637547278 0.566063087
## 247 -0.707000147 -1.226357451 -1.471648074
## 162 0.212054643 -0.783130250 -0.094353020
## 168 0.009862591 -0.089133501 0.393803434
## 169 -0.770021047 -3.247080853 -1.182466197
## 348 -1.152741721 -0.779791898 -1.230789125
## 217 0.892155188 1.067628007 1.072818192
## 343 0.892155188 0.264804800 0.197920016
## 218 -0.091233436 -0.779791898 -0.464469237
## 268 -1.253837747 -1.593159482 -1.714015333
## 291 -1.068495031 -1.116745113 -1.320347713
## 237 -0.596713571 -1.047599408 -1.225744631
## 239 0.085684611 -0.423852868 -0.563045045
## 272 -1.754722609 -2.613366309 -1.339268251
## 277 0.481644052 0.178328133 0.424419155
## 369 -2.085582334 -2.386711605 -2.782174502
## Panicle.fertility Seed.length Seed.width Brown.rice.seed.length
## 1 0.567830451 -0.47807783 1.32968778 -0.5426835
## 61 -0.064962376 -0.63213686 -0.31430244 -0.7177220
## 16 0.337723969 1.56524223 -1.82328313 1.7235440
## 316 1.419224437 -1.60011631 -0.33193066 -1.4624181
## 214 0.165144107 0.95478759 -1.41825456 1.2542065
## 154 -0.122488997 0.06374449 0.82027822 0.1481969
## 10 -3.366990401 -0.69856911 0.18885942 -1.4930995
## 108 0.176649431 0.56812254 0.65413531 0.5171695
## 24 -0.180015617 1.25236750 0.27895693 1.2994996
## 377 0.234176051 0.40412017 -0.01216130 0.5191880
## 73 -0.283563534 1.49261108 -0.01455942 1.5177690
## 40 -0.571196638 -0.30799301 1.51030352 -0.4040984
## 123 0.510303831 -1.07255506 -0.15493314 -1.1148193
## 314 1.097075361 -0.12506247 -1.01631830 -0.2206748
## 45 0.579335775 1.00875573 -0.63009598 1.1237658
## 157 -0.180015617 -1.19315610 0.51616671 -1.2461234
## 220 -0.099478348 -0.53850824 0.86630897 -0.5623752
## 60 -1.238505437 -0.71948595 0.38003526 -0.7133576
## 170 -0.697755203 1.02927842 -1.56083707 1.2556419
## 177 0.590841099 -0.58972091 0.20118656 -0.5042084
## 81 0.993527444 1.06064920 -1.27609277 1.1031433
## 84 -1.641191782 -0.70062944 0.52682498 -1.0287359
## 342 -1.273021410 -0.42880416 -0.36508161 -0.4540708
## 188 0.073101514 -1.18203031 -0.06725478 -0.8587405
## 107 -2.066888775 1.54168280 -0.19751015 1.6766584
## 112 0.176649431 0.26308602 -2.30496716 0.5877053
## 250 0.130628134 -1.05940477 1.26228820 -1.1978586
## 236 -0.904851037 -1.27413608 0.04724457 -1.5709916
## 180 -0.053457052 -1.41945214 1.09883791 -1.3020137
## 122 -0.536680665 -0.92315417 0.08942190 -0.5160727
## 356 0.556325127 -0.60489570 -0.47240957 -0.6878481
## 271 -0.502164693 1.78580518 1.41193884 1.8496734
## 163 0.188154755 -0.80383412 -1.38411302 -0.6885433
## 247 0.590841099 -0.59548088 1.12248246 -0.7662560
## 162 -1.905814237 -0.44797012 -1.62665895 -0.5749460
## 168 -1.365064003 0.29466283 -0.16585086 0.2592456
## 169 -4.609565407 2.21318954 0.83130112 1.9993567
## 348 1.281160547 -0.51645374 0.03504365 -0.5850385
## 217 -0.007435755 0.92450609 -1.50355584 1.1270066
## 343 0.153638782 0.59238070 0.47476070 0.5686752
## 218 -0.973882982 1.03291535 -1.36982952 1.1819326
## 268 0.188154755 0.58643261 0.60943365 0.3197815
## 291 0.464282534 -1.59959675 0.15842022 -1.7986124
## 237 0.395250589 2.92968316 -0.85724350 2.0024798
## 239 0.326218644 0.52911061 1.00867730 0.4042002
## 272 -3.228926512 4.34281550 -0.54896749 3.4797497
## 277 -0.732271176 0.03777536 1.74077474 -0.1811793
## 369 0.970516796 0.58274193 -1.22539584 0.5756054
## Brown.rice.seed.width Straighthead.suseptability Blast.resistance
## 1 1.404481793 -1.49925662 0.99698930
## 61 -0.298354717 0.43103397 -0.41451332
## 16 -1.811381664 0.79560418 -0.06163767
## 316 -0.081168774 -0.29327769 -0.06163767
## 214 -1.317666011 1.15655282 -0.41451332
## 154 0.863149235 -0.77373777 0.99698930
## 10 0.860887147 -2.58813849 -1.12026464
## 108 0.657787831 1.27847862 -1.47314029
## 24 0.344103361 0.49260046 -1.47314029
## 377 -0.021274910 0.79318980 -0.06163767
## 73 -0.111613458 0.43103397 -1.47314029
## 40 1.419865381 -0.53230054 -0.41451332
## 123 -0.001610187 1.27847862 -0.76738898
## 314 -0.947033533 0.43103397 -0.41451332
## 45 -0.589829617 0.55295977 -0.06163767
## 157 0.810093041 -1.13830797 -1.12026464
## 220 1.039848983 -0.29086332 0.29123799
## 60 0.336700171 -1.01638217 0.99698930
## 170 -1.660104674 -0.53471492 -0.76738898
## 177 0.130901771 -1.50046380 0.99698930
## 81 -1.332873397 0.31031536 -1.47314029
## 84 0.564684771 -0.17014471 -0.41451332
## 342 -0.570267715 0.55416695 -0.06163767
## 188 0.045788456 1.03704140 -1.47314029
## 107 -0.286427355 0.07008532 -1.82601595
## 112 -2.463864896 1.03704140 -1.47314029
## 250 1.585834243 -0.65543353 1.34986495
## 236 -0.022534480 -0.05063329 0.99698930
## 180 1.325540141 -0.29327769 0.99698930
## 122 0.128151283 0.43344834 -0.76738898
## 356 -0.484128515 1.27606425 -0.41451332
## 271 1.346394329 -1.37974519 1.34986495
## 163 -1.342178797 0.49380765 -1.47314029
## 247 1.110462050 -0.77373777 1.34986495
## 162 -1.680000747 1.39919723 0.99698930
## 168 -0.051434497 1.39919723 1.34986495
## 169 0.242425174 -0.41278912 0.29123799
## 348 -0.055206197 -0.41158193 -0.76738898
## 217 -1.620055474 -0.89566356 1.34986495
## 343 0.250123976 -0.77615214 0.99698930
## 218 -1.566433759 -1.73948665 1.34986495
## 268 0.555942557 -0.29327769 0.99698930
## 291 0.308372688 -0.65301915 0.99698930
## 237 -1.257890394 -0.65422634 0.64411364
## 239 0.859653282 -1.49925662 -1.47314029
## 272 -0.631126702 0.43344834 0.29123799
## 277 1.710685957 -0.89687075 0.99698930
## 369 -1.172805119 0.73283050 0.29123799
## Amylose.content Alkali.spreading.value Protein.content
## 1 -0.68226975 0.24251496 -0.14673626
## 61 1.13994856 -0.06147408 -1.19681764
## 16 0.30412867 -0.08680650 1.25337224
## 316 0.75274665 -0.21346860 -0.43842554
## 214 0.69195409 0.21718254 0.02827730
## 154 -0.19561743 -0.74544942 1.72007508
## 10 -0.77891435 1.07848483 1.07835868
## 108 0.63957895 0.16651770 0.90334512
## 24 0.40202029 -0.28946586 -0.20507412
## 377 0.37707975 0.09052044 0.26162872
## 73 -0.94289841 -3.40535353 0.14495301
## 40 -3.56352580 1.07848483 0.14495301
## 123 1.03426302 0.39450948 -1.19681764
## 314 1.05982707 -0.03614166 -0.96346623
## 45 -0.01043391 -0.64411974 0.55331799
## 157 -0.06623837 1.07848483 -0.67177695
## 220 -0.08120270 1.07848483 2.47846719
## 60 -0.18439419 -0.55545627 0.37830443
## 170 0.50178246 -0.74544942 0.37830443
## 177 -1.09815328 0.73649716 -1.13847979
## 81 1.15397762 -0.85944531 -0.20507412
## 84 0.21028988 -0.51745764 0.84500726
## 342 -0.53823812 1.07848483 -0.55510124
## 188 0.74682328 0.62250127 -0.49676339
## 107 -0.07122648 -0.40346175 0.49498014
## 112 0.65173746 1.07848483 -1.08014193
## 250 -0.55320245 0.96448894 -0.96346623
## 236 -3.53671472 0.77449579 -0.26341197
## 180 -0.05937973 0.96448894 -0.67177695
## 122 0.53607570 0.62250127 1.72007508
## 356 0.79358679 -0.36546312 -0.61343910
## 271 -0.86121813 0.47050675 1.83675079
## 163 1.39465384 0.92649031 -1.37183121
## 247 -0.71032786 0.69849853 1.42838581
## 162 1.05172139 1.07848483 -1.13847979
## 168 1.62660088 0.57183643 -1.02180408
## 169 -3.60280715 0.81249442 1.31171010
## 348 0.88274923 -0.74544942 -0.78845266
## 217 -1.04390760 0.50850538 0.02827730
## 343 -0.91234624 -0.41612796 -0.03006055
## 218 -1.38902234 -2.37939052 0.49498014
## 268 -0.30940865 -0.27426641 0.08661516
## 291 -0.81850746 0.88849168 -1.72185833
## 237 -1.46914383 -3.48135079 1.19503439
## 239 -1.07009517 -2.37939052 0.37830443
## 272 1.17642410 1.07848483 1.83675079
## 277 -0.42101758 1.00248757 0.55331799
## 369 1.04610977 -0.74544942 0.49498014
##
## $id.med
## [1] 1 28 8 192 121 80 7 53 10 218 33 15 63 191 18 83 126 27 93
## [20] 98 36 38 202 106 52 54 148 136 101 62 211 161 86 145 85 91 92 207
## [39] 123 203 124 159 178 137 138 162 166 214
##
## $clustering
## 1 3 5 6 8 9 10 16 22 24 25 26 27 39 40 43 44 45 46 51
## 1 2 3 4 5 6 7 3 8 9 10 9 11 2 12 13 14 15 9 16
## 53 54 55 57 58 59 60 61 65 69 70 72 73 74 79 81 83 84 87 88
## 3 5 17 13 14 18 18 2 9 19 10 10 11 5 20 21 20 22 23 4
## 89 92 93 94 96 99 100 101 103 105 106 107 108 112 113 114 116 117 118 120
## 24 10 3 16 16 9 23 19 16 14 11 25 8 26 27 23 10 28 1 10
## 121 122 123 125 129 130 131 132 134 135 137 139 140 142 147 149 150 152 153 154
## 29 30 13 31 13 13 14 5 32 10 2 5 18 33 24 5 23 3 21 6
## 155 156 157 160 162 163 164 165 166 167 168 169 170 171 172 174 176 177 178 179
## 34 13 16 15 35 33 10 18 36 19 36 37 19 38 6 5 8 20 4 34
## 180 183 184 186 187 188 189 191 195 198 201 202 203 205 206 207 208 209 211 213
## 29 10 29 16 39 24 13 15 11 19 8 9 2 32 36 5 5 13 34 40
## 214 215 217 218 219 220 221 222 225 228 231 232 233 234 235 236 237 239 240 241
## 5 10 39 41 27 17 42 31 1 38 38 43 43 38 13 28 44 45 45 13
## 242 243 244 245 247 248 249 250 251 252 254 256 257 263 264 265 266 267 268 270
## 40 42 5 40 34 1 1 27 19 21 31 27 6 1 1 1 1 34 42 40
## 271 272 273 275 276 277 279 280 281 282 283 284 285 286 287 289 290 291 294 296
## 32 46 45 1 4 47 17 23 40 47 6 21 19 11 27 17 20 43 13 1
## 297 299 300 301 303 306 307 308 311 313 314 316 321 333 334 335 337 338 339 340
## 47 38 1 6 47 17 1 5 29 21 14 4 4 27 1 6 2 29 2 36
## 341 342 343 344 346 347 348 349 350 355 356 359 365 369 371 372 376 377 381 384
## 14 23 40 2 21 10 38 13 45 1 31 14 20 48 14 14 32 10 18 10
## 385 386 387 391 392 394 395 396 397
## 31 23 20 19 11 11 10 19 19
##
## $objective
## build swap
## 2.350559 2.334805
##
## $isolation
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## no no no no no no no no no no no no no no no no no no no no no no no no no no
## 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
## no no no no no no L* no no no no no L* no no no L* no no no no no
## Levels: no L L*
##
## $clusinfo
## size max_diss av_diss diameter separation
## [1,] 15 3.396047 2.491193 5.013251 2.102297
## [2,] 8 3.940002 2.588179 5.590730 1.908541
## [3,] 5 2.935819 2.176676 3.671067 2.647235
## [4,] 6 3.287674 2.314052 3.984330 2.356047
## [5,] 12 4.270387 3.210883 6.538077 3.023159
## [6,] 7 4.211719 3.210739 5.811008 3.064565
## [7,] 1 0.000000 0.000000 0.000000 4.764459
## [8,] 4 4.231524 2.445166 5.136474 2.603085
## [9,] 6 2.583538 1.783762 3.684491 2.066555
## [10,] 14 4.132061 2.621656 5.608358 2.066555
## [11,] 7 3.542087 2.580328 4.814989 2.959904
## [12,] 1 0.000000 0.000000 0.000000 4.291645
## [13,] 12 4.001610 2.466821 5.330147 1.908541
## [14,] 9 3.857231 2.479860 4.587549 2.118303
## [15,] 3 3.148820 1.647018 3.364221 2.647235
## [16,] 6 4.434976 2.582121 5.332131 2.360260
## [17,] 5 3.681844 2.446892 4.489978 2.841495
## [18,] 5 4.001869 2.757616 5.355336 3.192992
## [19,] 10 3.792886 2.891107 5.595767 2.735103
## [20,] 6 3.213592 2.228942 4.182100 2.721912
## [21,] 6 3.475275 2.292260 5.173628 2.582119
## [22,] 1 0.000000 0.000000 0.000000 4.101246
## [23,] 7 4.195454 2.463641 4.961384 2.914611
## [24,] 3 4.203317 2.404226 4.770831 3.265332
## [25,] 1 0.000000 0.000000 0.000000 4.071087
## [26,] 1 0.000000 0.000000 0.000000 4.556660
## [27,] 6 3.382978 2.332328 4.390924 2.841495
## [28,] 2 4.170205 2.085103 4.170205 3.687605
## [29,] 5 3.221027 2.128065 4.903242 2.360260
## [30,] 1 0.000000 0.000000 0.000000 4.729190
## [31,] 5 3.013936 2.261990 4.262801 2.497997
## [32,] 4 3.666009 2.484052 5.561418 2.759170
## [33,] 2 3.884203 1.942101 3.884203 4.551940
## [34,] 5 4.277568 2.516751 5.655798 2.773746
## [35,] 1 0.000000 0.000000 0.000000 4.435230
## [36,] 4 4.059718 2.101660 4.740166 3.340238
## [37,] 1 0.000000 0.000000 0.000000 6.868458
## [38,] 6 4.339646 2.728750 5.913697 2.854855
## [39,] 2 2.942470 1.471235 2.942470 2.953791
## [40,] 6 3.803951 2.792516 5.780604 2.527868
## [41,] 1 0.000000 0.000000 0.000000 4.425165
## [42,] 3 3.422055 2.219950 4.138119 3.109955
## [43,] 3 2.431220 1.589834 2.456639 2.747514
## [44,] 1 0.000000 0.000000 0.000000 4.585757
## [45,] 4 3.834916 2.177878 4.956017 3.026757
## [46,] 1 0.000000 0.000000 0.000000 6.318230
## [47,] 4 2.911932 1.391317 3.396124 2.102297
## [48,] 1 0.000000 0.000000 0.000000 4.191402
kmed$id.med
## [1] 1 28 8 192 121 80 7 53 10 218 33 15 63 191 18 83 126 27 93
## [20] 98 36 38 202 106 52 54 148 136 101 62 211 161 86 145 85 91 92 207
## [39] 123 203 124 159 178 137 138 162 166 214
Note that the phenotypic data (data.tr) used here has already been normalized to have variance 1. If you perform the same analysis on your own data, carefully consider whether it is necessary to scale the data, and if necessary, use the scale function to scale.
Now, let’s visualize the variation in the samples selected as representatives with scatter plots on the principal component axis.
# look at the distribution of 48 varieties/lines selected as medoids
pca.tr <- prcomp(data.tr, scale = T)
kmed$id.med
## [1] 1 28 8 192 121 80 7 53 10 218 33 15 63 191 18 83 126 27 93
## [20] 98 36 38 202 106 52 54 148 136 101 62 211 161 86 145 85 91 92 207
## [39] 123 203 124 159 178 137 138 162 166 214
pch <- rep(1, nrow(data.tr))
pch[kmed$id.med] <- 19
op <- par(mfrow = c(1,2))
plot(pca.tr$x[,1:2], col = as.numeric(subpop.tr), pch = pch)
plot(pca.tr$x[,3:4], col = as.numeric(subpop.tr), pch = pch)
Fig.17 Distribution of representative 48 varieties / lines selected by the k-medoids method
par(op)
Finally, let’s compare the distribution of principal component scores of 48 varieties / lines selected using the k-medoids method with the distribution of principal component scores of all varieties / lines by drawing a histogram.
# compare histogram between three datasets (all, 48 selected k-medoids, 48 selected randomly)
op <- par(mfcol = c(2,4))
for(i in 1:4) {
res <- hist(pca.tr$x[,i], main = paste("PC", i, "all"))
hist(pca.tr$x[kmed$id.med, i], breaks = res$breaks, main = paste("PC", i, "k-medoids"))
}
Fig.18 Distribution of principal component scores of all varieties / lines (top) and varieties / lines selected as representatives (bottom)
par(op)
Answer the questions in homework4lec4e.Rmd
Submission method: - Create a report as a html file and submit it to ITC-LMS. - When you cannot submit your report to ITC-LMS with some issues, send the report to report@iu.a.u-tokyo.ac.jp - Make sure to write your affiliation, student number, and name at the beginning of the report. - The deadline for submission is May 27th.