Shixiang Wang

>上士闻道
勤而行之

模型结果为什么那么占存储空间?可以缩小吗?

王诗翔 · 2020-04-24

分类: r  
标签: r   model  

在使用 R 处理逻辑回归建模问题时发现保存的模型对象非常之大,不可思议。正常情况下,我们建模之后所需要的就是模型的系数,以此对新的数据进行预测。当然,为了方便获取和处理一些模型信息,可能有一些汇总或关键的参数信息。

但是,模型结果大小远超乎我的想象,有必要彻查一番

为了了解逻辑回归模型结果中到底存储了什么信息,我先造一个简单的模型。

简单起见,我使用 Cookbook for R 中的【逻辑回归】一节的第一个例子。

data(mtcars)
dat <- subset(mtcars, select = c(mpg, am, vs))
dat
#>                      mpg am vs
#> Mazda RX4           21.0  1  0
#> Mazda RX4 Wag       21.0  1  0
#> Datsun 710          22.8  1  1
#> Hornet 4 Drive      21.4  0  1
#> Hornet Sportabout   18.7  0  0
#> Valiant             18.1  0  1
#> Duster 360          14.3  0  0
#> Merc 240D           24.4  0  1
#> Merc 230            22.8  0  1
#> Merc 280            19.2  0  1
#> Merc 280C           17.8  0  1
#> Merc 450SE          16.4  0  0
#> Merc 450SL          17.3  0  0
#> Merc 450SLC         15.2  0  0
#> Cadillac Fleetwood  10.4  0  0
#> Lincoln Continental 10.4  0  0
#> Chrysler Imperial   14.7  0  0
#> Fiat 128            32.4  1  1
#> Honda Civic         30.4  1  1
#> Toyota Corolla      33.9  1  1
#> Toyota Corona       21.5  0  1
#> Dodge Challenger    15.5  0  0
#> AMC Javelin         15.2  0  0
#> Camaro Z28          13.3  0  0
#> Pontiac Firebird    19.2  0  0
#> Fiat X1-9           27.3  1  1
#> Porsche 914-2       26.0  1  0
#> Lotus Europa        30.4  1  1
#> Ford Pantera L      15.8  1  0
#> Ferrari Dino        19.7  1  0
#> Maserati Bora       15.0  1  0
#> Volvo 142E          21.4  1  1

建模:

# 执行逻辑回归 —— 下面两种方式等效
# logit是二项分布家族的默认模型
logr_vm <- glm(vs ~ mpg, data = dat, family = binomial)
# logr_vm <- glm(vs ~ mpg, data = dat, family = binomial(link = "logit"))
logr_vm
#> 
#> Call:  glm(formula = vs ~ mpg, family = binomial, data = dat)
#> 
#> Coefficients:
#> (Intercept)          mpg  
#>       -8.83         0.43  
#> 
#> Degrees of Freedom: 31 Total (i.e. Null);  30 Residual
#> Null Deviance:       43.9 
#> Residual Deviance: 25.5  AIC: 29.5

显示的结果还是比较简洁的,包括调用、系数及其其他模型参数。

用训练数据预测如下:

pred <- predict(logr_vm, type = "response", newdata = dat)
pred
#>           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
#>              0.5512              0.5512              0.7272              0.5933 
#>   Hornet Sportabout             Valiant          Duster 360           Merc 240D 
#>              0.3134              0.2607              0.0643              0.8414 
#>            Merc 230            Merc 280           Merc 280C          Merc 450SE 
#>              0.7272              0.3614              0.2365              0.1450 
#>          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
#>              0.1999              0.0919              0.0127              0.0127 
#>   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
#>              0.0754              0.9940              0.9860              0.9969 
#>       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
#>              0.6037              0.1032              0.0919              0.0428 
#>    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
#>              0.3614              0.9487              0.9135              0.9860 
#>      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
#>              0.1158              0.4124              0.0850              0.5933

大部分情况下,我们得到一个模型后想要做的事情可能仅仅只是用得到的模型系数来预测新的数据,也就是得到类似上面的结果。

到底是什么占据了大量的存储空间呢?我们实际看看这个对象到底存储了些什么:

str(logr_vm)
#> List of 30
#>  $ coefficients     : Named num [1:2] -8.83 0.43
#>   ..- attr(*, "names")= chr [1:2] "(Intercept)" "mpg"
#>  $ residuals        : Named num [1:32] -2.23 -2.23 1.38 1.69 -1.46 ...
#>   ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
#>  $ fitted.values    : Named num [1:32] 0.551 0.551 0.727 0.593 0.313 ...
#>   ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
#>  $ effects          : Named num [1:32] 0.703 -2.717 0.972 1.135 -0.549 ...
#>   ..- attr(*, "names")= chr [1:32] "(Intercept)" "mpg" "" "" ...
#>  $ R                : num [1:2, 1:2] -2.03 0 -40.04 -6.31
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:2] "(Intercept)" "mpg"
#>   .. ..$ : chr [1:2] "(Intercept)" "mpg"
#>  $ rank             : int 2
#>  $ qr               :List of 5
#>   ..$ qr   : num [1:32, 1:2] -2.031 0.245 0.219 0.242 0.228 ...
#>   .. ..- attr(*, "dimnames")=List of 2
#>   .. .. ..$ : chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
#>   .. .. ..$ : chr [1:2] "(Intercept)" "mpg"
#>   ..$ rank : int 2
#>   ..$ qraux: num [1:2] 1.24 1.08
#>   ..$ pivot: int [1:2] 1 2
#>   ..$ tol  : num 1e-11
#>   ..- attr(*, "class")= chr "qr"
#>  $ family           :List of 12
#>   ..$ family    : chr "binomial"
#>   ..$ link      : chr "logit"
#>   ..$ linkfun   :function (mu)  
#>   ..$ linkinv   :function (eta)  
#>   ..$ variance  :function (mu)  
#>   ..$ dev.resids:function (y, mu, wt)  
#>   ..$ aic       :function (y, n, mu, wt, dev)  
#>   ..$ mu.eta    :function (eta)  
#>   ..$ initialize: language {     if (NCOL(y) == 1) { ...
#>   ..$ validmu   :function (mu)  
#>   ..$ valideta  :function (eta)  
#>   ..$ simulate  :function (object, nsim)  
#>   ..- attr(*, "class")= chr "family"
#>  $ linear.predictors: Named num [1:32] 0.206 0.206 0.98 0.378 -0.784 ...
#>   ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
#>  $ deviance         : num 25.5
#>  $ aic              : num 29.5
#>  $ null.deviance    : num 43.9
#>  $ iter             : int 6
#>  $ weights          : Named num [1:32] 0.247 0.247 0.198 0.241 0.215 ...
#>   ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
#>  $ prior.weights    : Named num [1:32] 1 1 1 1 1 1 1 1 1 1 ...
#>   ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
#>  $ df.residual      : int 30
#>  $ df.null          : int 31
#>  $ y                : Named num [1:32] 0 0 1 1 0 1 0 1 1 1 ...
#>   ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
#>  $ converged        : logi TRUE
#>  $ boundary         : logi FALSE
#>  $ model            :'data.frame':   32 obs. of  2 variables:
#>   ..$ vs : num [1:32] 0 0 1 1 0 1 0 1 1 1 ...
#>   ..$ mpg: num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#>   ..- attr(*, "terms")=Classes 'terms', 'formula'  language vs ~ mpg
#>   .. .. ..- attr(*, "variables")= language list(vs, mpg)
#>   .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
#>   .. .. .. ..- attr(*, "dimnames")=List of 2
#>   .. .. .. .. ..$ : chr [1:2] "vs" "mpg"
#>   .. .. .. .. ..$ : chr "mpg"
#>   .. .. ..- attr(*, "term.labels")= chr "mpg"
#>   .. .. ..- attr(*, "order")= int 1
#>   .. .. ..- attr(*, "intercept")= int 1
#>   .. .. ..- attr(*, "response")= int 1
#>   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
#>   .. .. ..- attr(*, "predvars")= language list(vs, mpg)
#>   .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
#>   .. .. .. ..- attr(*, "names")= chr [1:2] "vs" "mpg"
#>  $ call             : language glm(formula = vs ~ mpg, family = binomial, data = dat)
#>  $ formula          :Class 'formula'  language vs ~ mpg
#>   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
#>  $ terms            :Classes 'terms', 'formula'  language vs ~ mpg
#>   .. ..- attr(*, "variables")= language list(vs, mpg)
#>   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
#>   .. .. ..- attr(*, "dimnames")=List of 2
#>   .. .. .. ..$ : chr [1:2] "vs" "mpg"
#>   .. .. .. ..$ : chr "mpg"
#>   .. ..- attr(*, "term.labels")= chr "mpg"
#>   .. ..- attr(*, "order")= int 1
#>   .. ..- attr(*, "intercept")= int 1
#>   .. ..- attr(*, "response")= int 1
#>   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
#>   .. ..- attr(*, "predvars")= language list(vs, mpg)
#>   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
#>   .. .. ..- attr(*, "names")= chr [1:2] "vs" "mpg"
#>  $ data             :'data.frame':   32 obs. of  3 variables:
#>   ..$ mpg: num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#>   ..$ am : num [1:32] 1 1 1 0 0 0 0 0 0 0 ...
#>   ..$ vs : num [1:32] 0 0 1 1 0 1 0 1 1 1 ...
#>  $ offset           : NULL
#>  $ control          :List of 3
#>   ..$ epsilon: num 1e-08
#>   ..$ maxit  : num 25
#>   ..$ trace  : logi FALSE
#>  $ method           : chr "glm.fit"
#>  $ contrasts        : NULL
#>  $ xlevels          : Named list()
#>  - attr(*, "class")= chr [1:2] "glm" "lm"

我勒个去,看得眼睛都花了。仔细扫读一下,发现有很多的信息会随着拟合(训练)数据的增大而增大,包括残差、拟合值、效应值、模型。奇葩的是,拟合数据本身也被存储了,这是一个非常明显的负担。而且 model 项把数据又存储了一遍。

大部分的信息在预测时根本用不到,我们可以试着删除一些信息后看是否还能够进行模型预测。

存储的数据对模型预测应该没有影响:

logr_vm$data <- NULL
predict(logr_vm, type = "response", newdata = dat)
#>           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
#>              0.5512              0.5512              0.7272              0.5933 
#>   Hornet Sportabout             Valiant          Duster 360           Merc 240D 
#>              0.3134              0.2607              0.0643              0.8414 
#>            Merc 230            Merc 280           Merc 280C          Merc 450SE 
#>              0.7272              0.3614              0.2365              0.1450 
#>          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
#>              0.1999              0.0919              0.0127              0.0127 
#>   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
#>              0.0754              0.9940              0.9860              0.9969 
#>       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
#>              0.6037              0.1032              0.0919              0.0428 
#>    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
#>              0.3614              0.9487              0.9135              0.9860 
#>      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
#>              0.1158              0.4124              0.0850              0.5933

删除残差和拟合值:

logr_vm$residuals <- NULL
logr_vm$fitted.values <- NULL
predict(logr_vm, type = "response", newdata = dat)
#>           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
#>              0.5512              0.5512              0.7272              0.5933 
#>   Hornet Sportabout             Valiant          Duster 360           Merc 240D 
#>              0.3134              0.2607              0.0643              0.8414 
#>            Merc 230            Merc 280           Merc 280C          Merc 450SE 
#>              0.7272              0.3614              0.2365              0.1450 
#>          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
#>              0.1999              0.0919              0.0127              0.0127 
#>   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
#>              0.0754              0.9940              0.9860              0.9969 
#>       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
#>              0.6037              0.1032              0.0919              0.0428 
#>    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
#>              0.3614              0.9487              0.9135              0.9860 
#>      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
#>              0.1158              0.4124              0.0850              0.5933

删除权重:

logr_vm$weights <- NULL
logr_vm$prior.weights <- NULL
predict(logr_vm, type = "response", newdata = dat)
#>           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
#>              0.5512              0.5512              0.7272              0.5933 
#>   Hornet Sportabout             Valiant          Duster 360           Merc 240D 
#>              0.3134              0.2607              0.0643              0.8414 
#>            Merc 230            Merc 280           Merc 280C          Merc 450SE 
#>              0.7272              0.3614              0.2365              0.1450 
#>          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
#>              0.1999              0.0919              0.0127              0.0127 
#>   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
#>              0.0754              0.9940              0.9860              0.9969 
#>       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
#>              0.6037              0.1032              0.0919              0.0428 
#>    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
#>              0.3614              0.9487              0.9135              0.9860 
#>      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
#>              0.1158              0.4124              0.0850              0.5933

从以上操作来看基本上与拟合模型数据等量级的信息在预测时都不会用到,如果模型结果很大时可以考虑删除。

工具“屠夫”

搜索了一下,tidymodels 组织有开发一个专门的工具包 butcher (屠夫!)来处理这个问题。

install.packages("butcher")

它有一些斧头函数可以砍掉各种不同的数据,适用于常用的模型(见列表)。

我们来试一试把这几个都砍掉是否还可以预测:

library(butcher)

先重新弄一个完整模型:

logr_vm <- glm(vs ~ mpg, data = dat, family = binomial)

查看占用:

butcher::weigh(logr_vm)
#> # A tibble: 47 x 2
#>    object               size
#>    <chr>               <dbl>
#>  1 family.variance   0.0437 
#>  2 family.dev.resids 0.0437 
#>  3 family.aic        0.0437 
#>  4 family.validmu    0.0437 
#>  5 family.simulate   0.0437 
#>  6 family.initialize 0.0174 
#>  7 qr.qr             0.00418
#>  8 terms             0.00318
#>  9 residuals         0.00286
#> 10 fitted.values     0.00286
#> # … with 37 more rows

搞事:

cleaned_lm <- butcher::axe_env(logr_vm, verbose = TRUE)
#> x No memory released. Do not butcher.
cleaned_lm <- butcher::axe_call(cleaned_lm, verbose = TRUE)
#> ✓ Memory released: '264 B'
#> x Disabled: `print()`, `summary()`
cleaned_lm <- butcher::axe_data(cleaned_lm, verbose = TRUE)
#> x No memory released. Do not butcher.
cleaned_lm <- butcher::axe_fitted(cleaned_lm, verbose = TRUE)
#> ✓ Memory released: '232 B'
#> x Disabled: `print()`, `summary()`, `fitted()`
cleaned_lm <- butcher::axe_ctrl(logr_vm, verbose = TRUE)
#> x No memory released. Do not butcher.
predict(cleaned_lm, type = "response", newdata = dat)
#>           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
#>              0.5512              0.5512              0.7272              0.5933 
#>   Hornet Sportabout             Valiant          Duster 360           Merc 240D 
#>              0.3134              0.2607              0.0643              0.8414 
#>            Merc 230            Merc 280           Merc 280C          Merc 450SE 
#>              0.7272              0.3614              0.2365              0.1450 
#>          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
#>              0.1999              0.0919              0.0127              0.0127 
#>   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
#>              0.0754              0.9940              0.9860              0.9969 
#>       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
#>              0.6037              0.1032              0.0919              0.0428 
#>    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
#>              0.3614              0.9487              0.9135              0.9860 
#>      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
#>              0.1158              0.4124              0.0850              0.5933

奇怪的是 data 没有被砍掉:

cleaned_lm$data
#>                      mpg am vs
#> Mazda RX4           21.0  1  0
#> Mazda RX4 Wag       21.0  1  0
#> Datsun 710          22.8  1  1
#> Hornet 4 Drive      21.4  0  1
#> Hornet Sportabout   18.7  0  0
#> Valiant             18.1  0  1
#> Duster 360          14.3  0  0
#> Merc 240D           24.4  0  1
#> Merc 230            22.8  0  1
#> Merc 280            19.2  0  1
#> Merc 280C           17.8  0  1
#> Merc 450SE          16.4  0  0
#> Merc 450SL          17.3  0  0
#> Merc 450SLC         15.2  0  0
#> Cadillac Fleetwood  10.4  0  0
#> Lincoln Continental 10.4  0  0
#> Chrysler Imperial   14.7  0  0
#> Fiat 128            32.4  1  1
#> Honda Civic         30.4  1  1
#> Toyota Corolla      33.9  1  1
#> Toyota Corona       21.5  0  1
#> Dodge Challenger    15.5  0  0
#> AMC Javelin         15.2  0  0
#> Camaro Z28          13.3  0  0
#> Pontiac Firebird    19.2  0  0
#> Fiat X1-9           27.3  1  1
#> Porsche 914-2       26.0  1  0
#> Lotus Europa        30.4  1  1
#> Ford Pantera L      15.8  1  0
#> Ferrari Dino        19.7  1  0
#> Maserati Bora       15.0  1  0
#> Volvo 142E          21.4  1  1

是因为模型太小,没必要吗?试一下官方示例模型。

our_model <- function() {
  some_junk_in_the_environment <- runif(1e6) # we didn't know about
  lm(mpg ~ ., data = mtcars)
}
big_lm <- our_model()
butcher::weigh(big_lm)
#> # A tibble: 25 x 2
#>    object            size
#>    <chr>            <dbl>
#>  1 terms         8.01    
#>  2 qr.qr         0.00666 
#>  3 residuals     0.00286 
#>  4 fitted.values 0.00286 
#>  5 effects       0.0014  
#>  6 coefficients  0.00109 
#>  7 call          0.000728
#>  8 model.mpg     0.000304
#>  9 model.cyl     0.000304
#> 10 model.disp    0.000304
#> # … with 15 more rows
clean_lm <- butcher::axe_data(big_lm, verbose = TRUE)
#> x No memory released. Do not butcher.

奇怪了。还是不是砍掉数据。我试着查看了下底层的程序,才发现它有内部的评估机制看砍掉后是否能够释放空间,如果不会,就不砍了。

butcher:::assess_object(big_lm, clean_lm)
#> x No memory released. Do not butcher.

最后看一下与包同名的函数,应该是可以自动删除不必要的东西:

butcher::butcher(big_lm, verbose = TRUE)
#> ✓ Memory released: '7,999,696 B'
#> x Disabled: `print()`, `summary()`, `fitted()`
#> 
#> Call:
#> dummy_call()
#> 
#> Coefficients:
#> (Intercept)          cyl         disp           hp         drat           wt  
#>     12.3034      -0.1114       0.0133      -0.0215       0.7871      -3.7153  
#>        qsec           vs           am         gear         carb  
#>      0.8210       0.3178       2.5202       0.6554      -0.1994