离散型表型变量的差异检验

对离散型变量进行差异检验,理论频数小于 5 用 fisher , 否则用卡方检验; 输出两种文件,一种是表型每种分类的的个数以及百分比,另外一种是差异检验的结果,暂且不用多重校正。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
library(dplyr)
library(data.table)
dt = read.table('../Samle_Information_categorical.xls', sep = '\t', header = T)
group = 'NS:FS:CS' #the colname of group is 'Group'
group_list = unlist(strsplit(group, ":"))
indice_index = 3 #表型开始的列的位置,从 1 开始

fileConn<-file("diff_results.xls",'w')
head = paste('Characteristic', 'P value', 'Method', sep ='\t')
writeLines(head, fileConn)

# 计算连续变量的 sd 和 mean ... 但是我目前不需要这个...
#temp = dt[c('Group', 'Age')]
#ag = aggregate(. ~ Group, temp, function(x) c(mean = round(mean(x),2), sd = round(sd(x),2)))
#ag_flat <- do.call("data.frame", ag) # flatten

for (index in indice_index:ncol(dt)){
indice = colnames(dt)[index]
sub_dt = dt[c('Group',indice)] # 提取

freq = sub_dt %>%
group_by(Group, eval(parse(text = indice))) %>% # 将变量转化为内置变量(大概这么叫吧)
summarise(n = n()) %>% # 天呐这个功能太好用了...
mutate(freq = round(n / sum(n)*100,2)) %>% #另一个常见的操作是添加新的列。这就是mutate()函数的工作了。
mutate(item = paste(n,'(', freq, ')', sep = ''))
colnames(freq)[2] = indice

#freq
need = freq[c('Group',indice,'item')]
#need
dt_cast = dcast(need, eval(parse(text = indice))~Group, value.var='item', fill=0) #太牛批了
colnames(dt_cast)[1] = indice
dt_cast = dt_cast[c(indice,group_list)]
#dt_cast
write.table(dt_cast, file = paste(indice, "information.xls", sep = "_"),sep="\t",quote = F, row.names = F)

#计算 P 值
ctable = table(sub_dt)

if (any(ctable < 5)){
p = fisher.test(ctable, simulate.p.value = TRUE, B = 1e6)$p.value
line = paste(indice, p, 'fisher', sep ='\t')
writeLines(line, fileConn)
} else {
p = chisq.test(ctable)$p.value
line = paste(indice, p,'chisq', sep ='\t')
print(indice)
writeLines(line, fileConn)
}
}
close(fileConn)

随后用 python3 对其结果进行整理即可

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import os
group = ['NS','FS','CS'] # 组名

with open('diff_results.xls', 'r') as IN, open('Final_result.xls','w') as out:
head = IN.readline().strip('\n').split('\t')
print('\t'.join([head[0], '\t'.join(group), head[1]]), file = out)
for line in IN:
line = line.strip('\n').split('\t')
outline = '\t'.join([line[0] + ',n(%)', '\t'*len(group)])
print(outline + '%.2e' % float(line[1]), file = out) #格式化输出,科学计数法,保留两位小数点...
with open(line[0] + '_information.xls', 'r') as indice:
indice.readline()
for line in indice:
line = line.strip('\n')
print(line, file = out)
# os.remove(line[0] + '_information.xls')

参考链接:
Chi-squared Test of Independence | R Tutorial
Chi-Square Test of Independence in R - Easy Guides - Wiki - STHDA
Fisher精确检验的通俗理解 - z54572的专栏 - CSDN博客
列联表数据的分析 - 战立侃

(✪ω✪)