如何在corrgram图上添加p-value

这篇文章可以说是怎么分析和展示RNAseq基因表达数据中基因的相关性的延续。上次绘制了下图：

可以发现只有两个基因的表达表现出了较强的相关(ETV3-ELK4)。一般教材描述相关性大小为：

相关系数r 是否是：·|r|>0.95 存在显著性相关； ·|r|≥0.8 高度相关； ·0.5≤|r|<0.8 中度相关； ·0.3≤|r|<0.5 低度相关； ·|r|<0.3 关系极弱，认为不相关计算公式为：

可见这两个基因属于中度相关。

但是我们知道皮尔逊相关系数表示的是两组数据线性相关的程度，但是如果两者在统计学上不存在相关性呢？那这个指标还有什么意义？因此，我们在评判相关的时候需要同时考量p值和r相关系数大小。

一个博主是这样认为的：

看两者是否算相关要看两方面：显著水平以及相关系数（1）显著水平,就是P值,这是首要的,因为如果不显著,相关系数再高也没用,可能只是因为偶然因素引起的,那么多少才算显著,一般p值小于0.05就是显著了；如果小于0.01就更显著；例如p值=0.001,就是很高的显著水平了,只要显著,就可以下结论说：拒绝原假设无关,两组数据显著相关也说两者间确实有明显关系.通常需要p值小于0.1,最好小于0.05设甚至0.01,才可得出结论：两组数据有明显关系,如果p=0.5,远大于0.1,只能说明相关程度不明显甚至不相关.起码不是线性相关. （2）相关系数,也就是Pearson Correlation(皮尔逊相关系数),通常也称为R值,在确认上面指标显著情况下,再来看这个指标,一般相关系数越高表明两者间关系越密切.

在搜索相关概念时，发现百度文库一篇文档相关系数与P值的一些基本概念提供了详细的描述和实例。有兴趣可以看下。

从上面看来，在进行相关分析考量相关系数r（或者R2）前，先考量显著性是有必要的。不过，如果你已经能看到两变量有很明显的线性关系了，你不看p值也无伤大雅，因为那个时候p值肯定少于0.05。

参考R包corrplot文档对上次的函数进行优化，代码如下：

gene_exp.corr <- function(gene.list, project_code, project.clinical, project.exp, outdir, ID_transform=TRUE, conf.level=0.95){
    # Arguments:
    # gene.list: a list of gene you want to analyze their expression correlation
    # project_code: data project name or name you want to specify this analysis
    # project.clinical: clinical information about samples, data.frame format
    # project.exp: normalized gene expression (RNA seq) about samples, data.frame format
    # ID_transform: sometimes clinical information use "-" as separate symbol for sample ID,
    #               we need it to be the same as it in project.exp data
    # one sample ID example: in clinical information, one sample may be marked by "TCGA-3N-A9WB-06",
    #                        in RNA seq data.set, this sample is "TCGA.3N.A9WB.06". If it is not, set ID_transform=FALSE. 
    
    # note: you need to install "corrgram" package before use this function
    
    gene_exp.list <- subset(project.exp, sample%in%gene.list)
    rownames(gene_exp.list) <- gene_exp.list[,1]
    gene_exp.list <- gene_exp.list[,-1]
    gene_exp.list <- t(gene_exp.list)
    # gene_exp.list <- gene_exp.list[,c(5,1,2,3,4,6,7,8,9,10)]
    library(corrplot)
    # combine with significance test
    cor.mtest <- function(mat, conf.level = 0.95){
        mat <- as.matrix(mat)
        n <- ncol(mat)
        p.mat <- lowCI.mat <- uppCI.mat <- matrix(NA, n, n)
        diag(p.mat) <- 0
        diag(lowCI.mat) <- diag(uppCI.mat) <- 1
        for(i in 1:(n-1)){
            for(j in (i+1):n){
                tmp <- cor.test(mat[,i], mat[,j], conf.level = conf.level)
                p.mat[i,j] <- p.mat[j,i] <- tmp$p.value
                lowCI.mat[i,j] <- lowCI.mat[j,i] <- tmp$conf.int[1]
                uppCI.mat[i,j] <- uppCI.mat[j,i] <- tmp$conf.int[2]
            }
        }
        return(list(p.mat, lowCI.mat, uppCI.mat))
    }
    
    if(ID_transform){
        project.clinical$sampleID = gsub("-",".",project.clinical$sampleID, fixed = TRUE)
    }
    n.gene <- ncol(gene_exp.list)
    # all samples
    M1 <- cor(gene_exp.list)
    res1 <- cor.mtest(gene_exp.list, conf.level)
    
    pdf(paste(outdir,project_code,"_all_sample_genelist_expression_corrgram.pdf", sep=""))
    corrplot(M1, order = "AOE", tl.pos = "d", p.mat = res1[[1]], insig = "p-value")
    title(paste("Corrgram of ", n.gene," Genes Expression in ", project_code, sep = ""))
    dev.off()
    
    # choose tumor sample
    # table(project.clinical$sample_type)
    primary_tumor <- "Primary Tumor"
    Metast_tumor  <- "Metastatic"
    primary_tumor.id <- project.clinical[project.clinical$sample_type==primary_tumor,]$sampleID
    Metast_tumor.id  <- project.clinical[project.clinical$sample_type==Metast_tumor,]$sampleID
    
    if(length(primary_tumor.id)<2 & length(Metast_tumor.id<2)){
        stop("Maybe your data have something wrong. Please check it!")
    }else{
        if(length(primary_tumor.id)<2){
            stop("I don't think it's reasonable that there are less than 2 primary tumor samples.")}
        
        gene_exp.list.primary <- subset(gene_exp.list, rownames(gene_exp.list)%in%primary_tumor.id)
        M2 <- cor(gene_exp.list.primary)
        res2 <- cor.mtest(gene_exp.list.primary, conf.level)
        
        pdf(paste(outdir,project_code,"_primary_tumor_sample_genelist_expression_corrgram.pdf", sep=""))
        corrplot(M2, order = "AOE", tl.pos = "d", p.mat = res2[[1]], insig = "p-value")
        title(paste("Corrgram of ", n.gene," Genes Expression in ", project_code, sep = ""))
        dev.off()
        
        if(length(Metast_tumor.id)<2){
            cat("It seems has no Metastatic sample in this analysis. \n")
            return(0)
        }
        
        gene_exp.list.Metast  <- subset(gene_exp.list, rownames(gene_exp.list)%in%Metast_tumor.id)
        M3 <- cor(gene_exp.list.Metast)
        res3 <- cor.mtest(gene_exp.list.Metast, conf.level)
        pdf(paste(outdir,project_code,"_Metastatic_sample_genelist_expression_corrgram.pdf", sep=""))
        corrplot(M3, order = "AOE", tl.pos = "d", p.mat = res3[[1]], insig = "p-value")
        title(paste("Corrgram of ", n.gene," Genes Expression in ", project_code, sep = ""))
        dev.off()}
}

一方面增加了检验部分，另一方面修改了画图函数。如果你想用这个函数绘制更多自定义的图，可以参考R包文档进行修改。

这跟上次的图类似。偏蓝色代表正相关，偏红色代表负相关。用圆圈大小和颜色鲜艳程度辅助color legend可以很好的区分和找出相关性明显的。corrgram的左下角和右上角是对称的，标有数字的显示的是p值，我这里默认设定0.05为阈值，大于0.05都会显示出来，这些值说明对应的两个基因在统计学上是没有相关性的。