倾向性评分 | Propensity score

数据分析的时候,难免会碰到混杂因素显著差异造成干扰的情况,在样品数目足够的情况下,就得考虑如何挑选样品使其配对,消除混杂因素的差异性,这时候就该倾向性评分发挥作用了。

简单来说就是,不同组别的混杂因素用一个综合指标替代,选择混杂因素相近(没有差异)的样品,组成新的样品数目,用于分析。

经过查找发现,R 的 MatchIt 可以用来做这个分析,不过只能做两组的,对于三组的数据,我的做法是挑其中一组出来,分别和其他两组做分析,组成没有混杂因素差异的新样品组合。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
library("MatchIt")
library("dplyr")

# 获取 FS 匹配的 CS ,之所以选 FS 为「对照」组是因为它的样品数目最少,可以可其他两组最大化配对;
dt = read.table('Samle_Information.xls', sep = '\t', row.names = 1, header = T)
dt = dt[c('Group','Age', 'BMI', 'Ethnicity','Gender','Married','Breakfast','Diet','Vegetable','Fruit','Meat','Dairy.products','Coarse.grain','Eggs','Drinking','Antibiotic','Mouthwash','Oral.ulcer','Bleeding.gums','Dental.caries','Pulpitis','Periodontitis','Constipation','Insomnia','Cold','Sore.throat')] #对所有这些因素进行综合评分挑选样品
dt = dt[dt['Group'] != 'NS',]
temp = rownames(dt) # 这里保留行名是因为下一行的操作会把行名取消掉...
dt = dt %>% mutate(Group = ifelse(Group == 'FS', 1, 0))
rownames(dt)= temp
m = matchit(Group ~ Age + BMI + Ethnicity + Gender + Married + Breakfast + Diet + Vegetable + Fruit + Meat + Dairy.products + Coarse.grain + Eggs + Drinking + Antibiotic + Mouthwash + Oral.ulcer + Bleeding.gums + Dental.caries + Pulpitis + Periodontitis + Constipation + Insomnia + Cold + Sore.throat ,data = dt,method ="nearest", ratio =1)
matched <- match.data(m)
cs_id = rownames(matched[matched['Group'] == 0,])

# 获取 FS 匹配的 NS
dt = read.table('Samle_Information.xls', sep = '\t', row.names = 1, header = T)
dt = dt[c('Group','Age', 'BMI', 'Ethnicity','Gender','Married','Breakfast','Diet','Vegetable','Fruit','Meat','Dairy.products','Coarse.grain','Eggs','Drinking','Antibiotic','Mouthwash','Oral.ulcer','Bleeding.gums','Dental.caries','Pulpitis','Periodontitis','Constipation','Insomnia','Cold','Sore.throat')]
dt = dt[dt['Group'] != 'CS',]
temp = rownames(dt)
dt = dt %>% mutate(Group = ifelse(Group == 'FS', 1, 0))
rownames(dt)= temp
m = matchit(Group ~ Age + BMI + Ethnicity + Gender + Married + Breakfast + Diet + Vegetable + Fruit + Meat + Dairy.products + Coarse.grain + Eggs + Drinking + Antibiotic + Mouthwash + Oral.ulcer + Bleeding.gums + Dental.caries + Pulpitis + Periodontitis + Constipation + Insomnia + Cold + Sore.throat, data = dt,method ="nearest", ratio =1)
matched <- match.data(m)
ns_id = rownames(matched[matched['Group'] == 0,])
fs_id = rownames(dt[dt['Group'] == '1',])

dt = read.table('Samle_Information.xls', sep = '\t', row.names = 1, header = T)
write.table(dt[c(fs_id,cs_id,ns_id),], file = 'matched_Samle_Information.xls', sep = "\t",quote = F) #输出配对后的表型信息表

遇到的坑:本来我只打算对那些差异表型纳入,然后评分,挑选样品,结果发现每次重新组合样品后,会有新的表型出现差异,然后我一个一个的加,这个过程重复了大概 3-4 次,实在忍无可忍了,就直接纳入了全部的。

未解决问题:查的时候查到了也有其他包可以同时处理多组的,叫 twang ,但是我没看太懂怎么用,而且发现遇到问题的话,Google 也不好找解决办法,就退而求其次,用了上边的办法。

参考链接:
倾向性评分介绍:背景、原理和应用
用R实现倾向性评分匹配 - 知乎
手把手教你做倾向评分匹配 - 云+社区 - 腾讯云
https://sejdemyr.github.io/r-tutorials/statistics/tutorial8.html
https://datawookie.netlify.com/blog/2013/05/package-matchit-balancing-experimental-data/
https://www.researchgate.net/post/R_Package_for_doing_propensity_score_matching_on_more_than_two_groups

(✪ω✪)