how to deal with the missing values of phenotype

The idea comes from the the document of MaAslin2
biobakery / biobakery / wiki / maaslin2 — Bitbucket

MaAsLin2 will generally accommodate missing values (which typically occur only in metadata, not in microbial community features), but you may get better results if you impute them, even using something very simple like a median value.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# 处理表型信息中的 Na 值用于 CCA 分析
# 如果缺失值的数目大于 20% 的样品,则剔除该指标,如果小于,则用该组的均值(也可以选择用中位数)替换
dt = pd.read_table('phenotype.xls', index_col=0) # 行名样品 ID, 列名表型信息
group = pd.read_table('Mapping.txt', index_col = 0) #行名样品ID, 列名组名,组别信息的列名默认是 "Description"

dt_merge = pd.concat([dt, group], axis =1)
dt_mean = dt_merge.groupby('Description').mean() #按照组别计算均值,用于 impute, 这里计算的均值会把 NaN 值剔除,不计数

dt_left = dt.loc[:,dt.isnull().sum() < dt.shape[0]*0.2] #剔除缺失值太多的指标20%
dt_left.is_copy = False # 目的在于取消 A value is trying to be set on a copy of a slice from a DataFrame 这个警告

for var in dt_left.columns:
temp = dt.loc[dt[var].isnull()]
for ID in temp.index:
dt_left[var][ID] = dt_mean[var][group['Description'][ID]]

dt_left.to_csv('impute_phenotype.xls', sep = '\t')
(✪ω✪)