Shixiang Wang

>上士闻道
勤而行之

使用有限混合模型

王诗翔 · 2019-09-06

分类: r  
标签: r   flexmix  

模拟数据

library(ggplot2)
library(flexmix)
#> Loading required package: lattice
m1 <- 0
m2 <- 50
sd1 <- sd2 <- 5
N1 <- 100
N2 <- 10

a <- rnorm(n=N1, mean=m1, sd=sd1)
b <- rnorm(n=N2, mean=m2, sd=sd2)

绘制数据图形

x <- c(a,b)
class <- c(rep('a', N1), rep('b', N2))
data <- data.frame(cbind(x=as.numeric(x), class=as.factor(class)))

library("ggplot2")
p <- ggplot(data, aes(x = x)) + 
  geom_histogram(aes(x, ..density..), binwidth = 1, colour = "black", fill = "white") +
  geom_vline(xintercept = m1, col = "red", size = 2) + 
  geom_vline(xintercept = m2, col = "blue", size = 2)
p

拟合模型

这里我们可以看到应该是由2个分布混合而成,试着去恢复相应分布的参数:

set.seed(0)

mo1 <- FLXMRglm(family = "gaussian")
mo2 <- FLXMRglm(family = "gaussian")
flexfit <- flexmix(x ~ 1, data = data, k = 2, model = list(mo1, mo2))
print(table(clusters(flexfit), data$class))
#>    
#>       1   2
#>   1   0  10
#>   2 100   0

区分效果很好(其实可以用这种方法去分类)。

查看结果

parameters(flexfit)
#> [[1]]
#>                  Comp.1 Comp.2
#> coef.(Intercept)  48.76   0.37
#> sigma              5.84   4.78
#> 
#> [[2]]
#>                  Comp.1 Comp.2
#> coef.(Intercept)  48.76   0.37
#> sigma              5.84   4.78

输出参数:

c1 <- parameters(flexfit, component=2)[[1]]
c2 <- parameters(flexfit, component=1)[[1]]

cat('pred:', c1[1], '\n')
#> pred: 0.37
cat('true:', m1, '\n\n')
#> true: 0
cat('pred:', c1[2], '\n')
#> pred: 4.78
cat('true:', sd1, '\n\n')
#> true: 5

cat('pred:', c2[1], '\n')
#> pred: 48.8
cat('true:', m2, '\n\n')
#> true: 50
cat('pred:', c2[2], '\n')
#> pred: 5.84
cat('true:', sd2, '\n\n')
#> true: 5

基本能拟合出原始分布。

可视化拟合

plot_mix_comps <- function(x, mu, sigma, lam) {
  lam * dnorm(x, mu, sigma)
}

lam <- table(clusters(flexfit))
  
ggplot(data) +
geom_histogram(aes(x, ..density..), binwidth = 1, colour = "black", fill = "white") +
stat_function(geom = "line", fun = plot_mix_comps,
                args = list(c1[1], c1[2], lam[2]/sum(lam)),
                colour = "red", lwd = 1.5) +
stat_function(geom = "line", fun = plot_mix_comps,
                args = list(c2[1], c2[2], lam[1]/sum(lam)),
                colour = "blue", lwd = 1.5) +
ylab("Density")

新的问题

能否自动推断出有2个分布以及它们的参数??

flexfit = stepFlexmix(x ~ 1, data = data, k = 1:5, model = FLXMRglm(family = "gaussian"))
#> 1 : * * *
#> 2 : * * *
#> 3 : * * *
#> 4 : * * *
#> 5 : * * *
flexfit
#> 
#> Call:
#> stepFlexmix(x ~ 1, data = data, model = FLXMRglm(family = "gaussian"), 
#>     k = 1:5)
#> 
#>   iter converged k k0 logLik AIC BIC ICL
#> 1    2      TRUE 1  1   -452 908 913 913
#> 2   13      TRUE 2  2   -363 736 750 750
#> 3   36      TRUE 3  3   -360 735 757 795
#> 4   69      TRUE 4  4   -360 741 771 860
#> 5   66      TRUE 4  5   -360 741 771 874

根据 BIC 选择一个最佳的模型:

fitBest = getModel(flexfit, which = "BIC")
str(fitBest, max.level = 2)
#> Formal class 'flexmix' [package "flexmix"] with 18 slots
#>   ..@ posterior  :List of 2
#>   ..@ weights    : NULL
#>   ..@ iter       : int 13
#>   ..@ cluster    : int [1:110] 1 1 1 1 1 1 1 1 1 1 ...
#>   ..@ logLik     : num -363
#>   ..@ df         : num 5
#>   ..@ control    :Formal class 'FLXcontrol' [package "flexmix"] with 6 slots
#>   ..@ group      : Factor w/ 0 levels: 
#>   ..@ size       : Named int [1:2] 100 10
#>   .. ..- attr(*, "names")= chr [1:2] "1" "2"
#>   ..@ converged  : logi TRUE
#>   ..@ k0         : int 2
#>   ..@ model      :List of 1
#>   ..@ prior      : num [1:2] 0.9091 0.0909
#>   ..@ components :List of 2
#>   ..@ concomitant:Formal class 'FLXP' [package "flexmix"] with 7 slots
#>   ..@ formula    :Class 'formula'  language x ~ 1
#>   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
#>   ..@ call       : language stepFlexmix(x ~ 1, data = data, model = FLXMRglm(family = "gaussian"),      k = 2)
#>   ..@ k          : int 2

查看参数:

parameters(fitBest)
#>                  Comp.1 Comp.2
#> coef.(Intercept)   0.37  48.76
#> sigma              4.78   5.84

这个我们的输入非常接近,但也存在一个不小的误差:

print(paste(m1, sd1))  
#> [1] "0 5"
print(paste(m2, sd2))
#> [1] "50 5"

使用不同的接口

Flexmix 这个包的文档看起来让人很蒙蔽,完全搞不懂核心的一些建模函数。我下面测试下不同的接口使用的效果。

set.seed(0)


fit1 <- flexmix(x ~ 1, data = data, k = 2, model = FLXMRglm(family = "gaussian"))
parameters(fit1)
#>                  Comp.1 Comp.2
#> coef.(Intercept)  48.76   0.37
#> sigma              5.84   4.78
fit2 <- flexmix(x ~ 1, data = data, k = 2, model = FLXMCnorm1())
parameters(fit2)
#>      Comp.1 Comp.2
#> mean  48.76   0.37
#> sd     6.12   4.78

使用泊松分布来拟合试试,先生成泊松分布数据:

N1 <- 100
N2 <- 10

a <- rpois(N1, 0)
b <- rpois(N2, 50)

x <- c(a,b)
class <- c(rep('a', N1), rep('b', N2))
data <- data.frame(cbind(x=as.numeric(x), class=as.factor(class)))
fit3 <- flexmix(x ~ 1, data = data, k = 2, model = FLXMCmvpois())
parameters(fit3)
#> Comp.1.lambda Comp.2.lambda 
#>          48.3           0.0
fit4 <- flexmix(x ~ 1, data = data, k = 2, model = FLXMRglm(family = "poisson"))
parameters(fit4)
#> Comp.1.coef.(Intercept) Comp.2.coef.(Intercept) 
#>                    3.88                  -28.67

FLXMCmvpois() 显示的是 demo driver,但却比 FLXMRglm(family = "poisson") 结果准确的多!!

不能理解这个包的哲学,尽管它看起来是那么的优秀~

更新:2019-09-17

发现 flexmix 提供的功能大体分为两类,以 FLXMC 开头的是做聚类的,而以 FLXMR 开头的是做回归的。

能否重复分析?

set.seed(1234)

fit <- flexmix(x ~ 1, data = data, k = 2, model = FLXMCmvpois())
parameters(fit)
#> Comp.1.lambda Comp.2.lambda 
#>          48.3           0.0
set.seed(1234)

fit <- flexmix(x ~ 1, data = data, k = 2, model = FLXMCmvpois())
parameters(fit)
#> Comp.1.lambda Comp.2.lambda 
#>          48.3           0.0

对于 step 方法?

set.seed(1234)
stepfit = stepFlexmix(x ~ 1, data = data, k = 1:5, model = FLXMCmvpois())
#> 1 : * * *
#> 2 : * * *
#> 3 : * * *
#> 4 : * * *
#> 5 : * * *
fit = getModel(flexfit, which = "BIC")
parameters(fit)
#>                  Comp.1 Comp.2
#> coef.(Intercept)   0.37  48.76
#> sigma              4.78   5.84
set.seed(1234)
stepfit = stepFlexmix(x ~ 1, data = data, k = 1:5, model = FLXMCmvpois())
#> 1 : * * *
#> 2 : * * *
#> 3 : * * *
#> 4 : * * *
#> 5 : * * *
fit = getModel(flexfit, which = "BIC")
parameters(fit)
#>                  Comp.1 Comp.2
#> coef.(Intercept)   0.37  48.76
#> sigma              4.78   5.84

结果显示一致

本文前半部分示例来自《A Practical Introduction To Finite Mixture Models