CCA with ggplot2

CCA 简介

典型关联分析(CCA)原理总结
文章是专业的描述,我按照我的理解简单写一下:
CCA 可以将高维的数据(比如丰度数据和表型数据)降维,然后进行相关性分析,降维的标准是降维后两组数据的相关系数最大,大概知道这么一个结论就可以了。

折腾了好久时间,本来是用 vegan 包的结果来画,结果发现如果这样的话, 个性化调整图片很难(比如将中间文件转用 ggplot2 画),查着查着就发现只用 ggplot2 也可以画,用的是 ggvegan 这个包。最后花了一段时间用来调整图片的细节问题,比如给个体代表的点按照组别上色,是否画物种以及如何加物种标签(结果发现物种多的时候,会显得很乱… 于是就暂时放弃了),接下来是代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
library("ggvegan")
library("ggrepel")

taxdata <- read.table('All.taxonomy_percent.xls', header=T, check.names=F, sep='\t', row.names = 1, comment.char = "") #实测将低丰度的物种合并为 others 对结果的影响比较大,而且合并以后的效果更好;
env <- read.table('impute_phenotype.xls', header=T, check.names=F, sep='\t', row.names = 1)
taxdata <- t(taxdata)
vare.cca <- cca(taxdata, env)
#autoplot(vare.cca, geom = c("point","text"), layers = c("species", "sites", "biplot", "centroids")) #自动输出


ford = fortify(vare.cca) #提取 CCA 的元数据
ford <- fortify(ford, axes = 1:2) # fortify the ordination
take <- c('CCA1', 'CCA2') # which columns contain the scores we want
arrows <- subset(ford, Score == 'biplot') # take only biplot arrow scores
## multiplier for arrows to scale them to the plot range
mul <- ggvegan:::arrowMul(arrows[, take],
subset(ford, select = take, Score == 'sites'))
arrows[, take] <- arrows[, take] * mul # scale biplot arrows

## 加入组别信息以及颜色
group = read.table('Mapping.txt', comment.char = "", row.names = 1, header = T) #ID\tGroup
sample = subset(ford, Score == 'sites')
rownames(sample) = sample$Label
merge_dt = merge(sample, group, by = "row.names")

group_list = "A:B:C" #设定组别以及下边的颜色
color_list = "24af2d:ea5e74:aa75b3"
legend_list = c(unlist(strsplit(group_list, ":")))
color_var = unlist(strsplit(color_list, ":"))
color_var = c(paste("#",color_var,sep=""))
merge_dt$Group <-factor(merge_dt$Description, legend_list)

pdf("CCA.pdf", width=5, height=4)

ggplot() +
geom_point(data = merge_dt, size = 2,
mapping = aes(x = CCA1, y = CCA2, color = Group,)) + scale_color_manual(values = color_var) + #个体
# geom_point(data = subset(ford, Score == 'species'), size = 1, shape = 1,
# mapping = aes(x = CCA1, y = CCA2)) + #这是物种
# geom_label_repel(data = subset(ford, Score == 'species'), mapping = aes(x = CCA1, y = CCA2, label = Label),
# # box.padding = 0,
# point.padding = 0,
# segment.color = 'grey50') + #物种标签,注释掉是因为会显得很乱
geom_segment(data = arrows,
mapping = aes(x = 0, y = 0, xend = CCA1, yend = CCA2),
arrow = arrow(length = unit(0.01, "npc"))) + #临床表型的箭头
geom_text(data = arrows,
mapping = aes(label = Label, x = CCA1, y = CCA2, hjust = 0.5*(1 - sign(CCA1)), vjust = 0.5*(1-sign(CCA2)))) + #临床表型的标签
coord_fixed() +
scale_x_continuous(expand = c(.1, .1)) +
scale_y_continuous(expand = c(.1, .1)) +
theme_bw() +
theme(
axis.text = element_text(colour = 'black', size = 12),
axis.title = element_text(size = 12),
panel.background = element_rect(colour = "black", size = 1),
panel.grid =element_blank(),
legend.key = element_blank(),
legend.text = element_text(size = 10),
legend.title = element_blank(),
#legend.position=c(0.1,0.1),
legend.position='none',
plot.margin = unit(c(0.2, 0.1, 0.1, 0.1 ), 'in'))
dev.off()

物种可以取丰度前 10 的进行分析(得看项目的具体情况)
https://mp.weixin.qq.com/s/urMdWn5Jf2Ia8CdK9jEsgA

上图的CCA分析结果图,图中箭头代表不同的环境因子,红色的代表不同的微生物,绿色的代表不同的样本(当然这个图可以只展示样本和和环境因子2种)。
环境因子的箭头的长度代表相应的环境因子与研究对象(样品,微生物)相关程度的大小,越长代表其对所研究对象(样品,微生物)的分布影响越大。箭头连线之间的夹角的代表其相关性,为锐角是说明2个环境因子之间是正相关,钝角是负相关。

表型因子也可以进行提前筛选
https://mp.weixin.qq.com/s/Mn12azkNPNEHSK9tjiFYng

参考链接

https://stackoverflow.com/questions/28682405/r-visualize-cca-plot-in-ggplot
https://rdrr.io/github/gavinsimpson/ggvegan/man/autoplot.cca.html
https://github.com/gavinsimpson/ggvegan/issues/9
r - Label points in geom_point - Stack Overflow

(✪ω✪)