模型结果为什么那么占存储空间?可以缩小吗?
王诗翔 · 2020-04-24
在使用 R 处理逻辑回归建模问题时发现保存的模型对象非常之大,不可思议。正常情况下,我们建模之后所需要的就是模型的系数,以此对新的数据进行预测。当然,为了方便获取和处理一些模型信息,可能有一些汇总或关键的参数信息。
但是,模型结果大小远超乎我的想象,有必要彻查一番。
为了了解逻辑回归模型结果中到底存储了什么信息,我先造一个简单的模型。
简单起见,我使用 Cookbook for R 中的【逻辑回归】一节的第一个例子。
data(mtcars)
dat <- subset(mtcars, select = c(mpg, am, vs))
dat
#> mpg am vs
#> Mazda RX4 21.0 1 0
#> Mazda RX4 Wag 21.0 1 0
#> Datsun 710 22.8 1 1
#> Hornet 4 Drive 21.4 0 1
#> Hornet Sportabout 18.7 0 0
#> Valiant 18.1 0 1
#> Duster 360 14.3 0 0
#> Merc 240D 24.4 0 1
#> Merc 230 22.8 0 1
#> Merc 280 19.2 0 1
#> Merc 280C 17.8 0 1
#> Merc 450SE 16.4 0 0
#> Merc 450SL 17.3 0 0
#> Merc 450SLC 15.2 0 0
#> Cadillac Fleetwood 10.4 0 0
#> Lincoln Continental 10.4 0 0
#> Chrysler Imperial 14.7 0 0
#> Fiat 128 32.4 1 1
#> Honda Civic 30.4 1 1
#> Toyota Corolla 33.9 1 1
#> Toyota Corona 21.5 0 1
#> Dodge Challenger 15.5 0 0
#> AMC Javelin 15.2 0 0
#> Camaro Z28 13.3 0 0
#> Pontiac Firebird 19.2 0 0
#> Fiat X1-9 27.3 1 1
#> Porsche 914-2 26.0 1 0
#> Lotus Europa 30.4 1 1
#> Ford Pantera L 15.8 1 0
#> Ferrari Dino 19.7 1 0
#> Maserati Bora 15.0 1 0
#> Volvo 142E 21.4 1 1
建模:
# 执行逻辑回归 —— 下面两种方式等效
# logit是二项分布家族的默认模型
logr_vm <- glm(vs ~ mpg, data = dat, family = binomial)
# logr_vm <- glm(vs ~ mpg, data = dat, family = binomial(link = "logit"))
logr_vm
#>
#> Call: glm(formula = vs ~ mpg, family = binomial, data = dat)
#>
#> Coefficients:
#> (Intercept) mpg
#> -8.83 0.43
#>
#> Degrees of Freedom: 31 Total (i.e. Null); 30 Residual
#> Null Deviance: 43.9
#> Residual Deviance: 25.5 AIC: 29.5
显示的结果还是比较简洁的,包括调用、系数及其其他模型参数。
用训练数据预测如下:
pred <- predict(logr_vm, type = "response", newdata = dat)
pred
#> Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
#> 0.5512 0.5512 0.7272 0.5933
#> Hornet Sportabout Valiant Duster 360 Merc 240D
#> 0.3134 0.2607 0.0643 0.8414
#> Merc 230 Merc 280 Merc 280C Merc 450SE
#> 0.7272 0.3614 0.2365 0.1450
#> Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
#> 0.1999 0.0919 0.0127 0.0127
#> Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla
#> 0.0754 0.9940 0.9860 0.9969
#> Toyota Corona Dodge Challenger AMC Javelin Camaro Z28
#> 0.6037 0.1032 0.0919 0.0428
#> Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
#> 0.3614 0.9487 0.9135 0.9860
#> Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
#> 0.1158 0.4124 0.0850 0.5933
大部分情况下,我们得到一个模型后想要做的事情可能仅仅只是用得到的模型系数来预测新的数据,也就是得到类似上面的结果。
到底是什么占据了大量的存储空间呢?我们实际看看这个对象到底存储了些什么:
str(logr_vm)
#> List of 30
#> $ coefficients : Named num [1:2] -8.83 0.43
#> ..- attr(*, "names")= chr [1:2] "(Intercept)" "mpg"
#> $ residuals : Named num [1:32] -2.23 -2.23 1.38 1.69 -1.46 ...
#> ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
#> $ fitted.values : Named num [1:32] 0.551 0.551 0.727 0.593 0.313 ...
#> ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
#> $ effects : Named num [1:32] 0.703 -2.717 0.972 1.135 -0.549 ...
#> ..- attr(*, "names")= chr [1:32] "(Intercept)" "mpg" "" "" ...
#> $ R : num [1:2, 1:2] -2.03 0 -40.04 -6.31
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:2] "(Intercept)" "mpg"
#> .. ..$ : chr [1:2] "(Intercept)" "mpg"
#> $ rank : int 2
#> $ qr :List of 5
#> ..$ qr : num [1:32, 1:2] -2.031 0.245 0.219 0.242 0.228 ...
#> .. ..- attr(*, "dimnames")=List of 2
#> .. .. ..$ : chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
#> .. .. ..$ : chr [1:2] "(Intercept)" "mpg"
#> ..$ rank : int 2
#> ..$ qraux: num [1:2] 1.24 1.08
#> ..$ pivot: int [1:2] 1 2
#> ..$ tol : num 1e-11
#> ..- attr(*, "class")= chr "qr"
#> $ family :List of 12
#> ..$ family : chr "binomial"
#> ..$ link : chr "logit"
#> ..$ linkfun :function (mu)
#> ..$ linkinv :function (eta)
#> ..$ variance :function (mu)
#> ..$ dev.resids:function (y, mu, wt)
#> ..$ aic :function (y, n, mu, wt, dev)
#> ..$ mu.eta :function (eta)
#> ..$ initialize: language { if (NCOL(y) == 1) { ...
#> ..$ validmu :function (mu)
#> ..$ valideta :function (eta)
#> ..$ simulate :function (object, nsim)
#> ..- attr(*, "class")= chr "family"
#> $ linear.predictors: Named num [1:32] 0.206 0.206 0.98 0.378 -0.784 ...
#> ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
#> $ deviance : num 25.5
#> $ aic : num 29.5
#> $ null.deviance : num 43.9
#> $ iter : int 6
#> $ weights : Named num [1:32] 0.247 0.247 0.198 0.241 0.215 ...
#> ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
#> $ prior.weights : Named num [1:32] 1 1 1 1 1 1 1 1 1 1 ...
#> ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
#> $ df.residual : int 30
#> $ df.null : int 31
#> $ y : Named num [1:32] 0 0 1 1 0 1 0 1 1 1 ...
#> ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
#> $ converged : logi TRUE
#> $ boundary : logi FALSE
#> $ model :'data.frame': 32 obs. of 2 variables:
#> ..$ vs : num [1:32] 0 0 1 1 0 1 0 1 1 1 ...
#> ..$ mpg: num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#> ..- attr(*, "terms")=Classes 'terms', 'formula' language vs ~ mpg
#> .. .. ..- attr(*, "variables")= language list(vs, mpg)
#> .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
#> .. .. .. ..- attr(*, "dimnames")=List of 2
#> .. .. .. .. ..$ : chr [1:2] "vs" "mpg"
#> .. .. .. .. ..$ : chr "mpg"
#> .. .. ..- attr(*, "term.labels")= chr "mpg"
#> .. .. ..- attr(*, "order")= int 1
#> .. .. ..- attr(*, "intercept")= int 1
#> .. .. ..- attr(*, "response")= int 1
#> .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
#> .. .. ..- attr(*, "predvars")= language list(vs, mpg)
#> .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
#> .. .. .. ..- attr(*, "names")= chr [1:2] "vs" "mpg"
#> $ call : language glm(formula = vs ~ mpg, family = binomial, data = dat)
#> $ formula :Class 'formula' language vs ~ mpg
#> .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
#> $ terms :Classes 'terms', 'formula' language vs ~ mpg
#> .. ..- attr(*, "variables")= language list(vs, mpg)
#> .. ..- attr(*, "factors")= int [1:2, 1] 0 1
#> .. .. ..- attr(*, "dimnames")=List of 2
#> .. .. .. ..$ : chr [1:2] "vs" "mpg"
#> .. .. .. ..$ : chr "mpg"
#> .. ..- attr(*, "term.labels")= chr "mpg"
#> .. ..- attr(*, "order")= int 1
#> .. ..- attr(*, "intercept")= int 1
#> .. ..- attr(*, "response")= int 1
#> .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
#> .. ..- attr(*, "predvars")= language list(vs, mpg)
#> .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
#> .. .. ..- attr(*, "names")= chr [1:2] "vs" "mpg"
#> $ data :'data.frame': 32 obs. of 3 variables:
#> ..$ mpg: num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#> ..$ am : num [1:32] 1 1 1 0 0 0 0 0 0 0 ...
#> ..$ vs : num [1:32] 0 0 1 1 0 1 0 1 1 1 ...
#> $ offset : NULL
#> $ control :List of 3
#> ..$ epsilon: num 1e-08
#> ..$ maxit : num 25
#> ..$ trace : logi FALSE
#> $ method : chr "glm.fit"
#> $ contrasts : NULL
#> $ xlevels : Named list()
#> - attr(*, "class")= chr [1:2] "glm" "lm"
我勒个去,看得眼睛都花了。仔细扫读一下,发现有很多的信息会随着拟合(训练)数据的增大而增大,包括残差、拟合值、效应值、模型。奇葩的是,拟合数据本身也被存储了,这是一个非常明显的负担。而且 model
项把数据又存储了一遍。
大部分的信息在预测时根本用不到,我们可以试着删除一些信息后看是否还能够进行模型预测。
存储的数据对模型预测应该没有影响:
logr_vm$data <- NULL
predict(logr_vm, type = "response", newdata = dat)
#> Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
#> 0.5512 0.5512 0.7272 0.5933
#> Hornet Sportabout Valiant Duster 360 Merc 240D
#> 0.3134 0.2607 0.0643 0.8414
#> Merc 230 Merc 280 Merc 280C Merc 450SE
#> 0.7272 0.3614 0.2365 0.1450
#> Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
#> 0.1999 0.0919 0.0127 0.0127
#> Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla
#> 0.0754 0.9940 0.9860 0.9969
#> Toyota Corona Dodge Challenger AMC Javelin Camaro Z28
#> 0.6037 0.1032 0.0919 0.0428
#> Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
#> 0.3614 0.9487 0.9135 0.9860
#> Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
#> 0.1158 0.4124 0.0850 0.5933
删除残差和拟合值:
logr_vm$residuals <- NULL
logr_vm$fitted.values <- NULL
predict(logr_vm, type = "response", newdata = dat)
#> Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
#> 0.5512 0.5512 0.7272 0.5933
#> Hornet Sportabout Valiant Duster 360 Merc 240D
#> 0.3134 0.2607 0.0643 0.8414
#> Merc 230 Merc 280 Merc 280C Merc 450SE
#> 0.7272 0.3614 0.2365 0.1450
#> Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
#> 0.1999 0.0919 0.0127 0.0127
#> Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla
#> 0.0754 0.9940 0.9860 0.9969
#> Toyota Corona Dodge Challenger AMC Javelin Camaro Z28
#> 0.6037 0.1032 0.0919 0.0428
#> Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
#> 0.3614 0.9487 0.9135 0.9860
#> Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
#> 0.1158 0.4124 0.0850 0.5933
删除权重:
logr_vm$weights <- NULL
logr_vm$prior.weights <- NULL
predict(logr_vm, type = "response", newdata = dat)
#> Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
#> 0.5512 0.5512 0.7272 0.5933
#> Hornet Sportabout Valiant Duster 360 Merc 240D
#> 0.3134 0.2607 0.0643 0.8414
#> Merc 230 Merc 280 Merc 280C Merc 450SE
#> 0.7272 0.3614 0.2365 0.1450
#> Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
#> 0.1999 0.0919 0.0127 0.0127
#> Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla
#> 0.0754 0.9940 0.9860 0.9969
#> Toyota Corona Dodge Challenger AMC Javelin Camaro Z28
#> 0.6037 0.1032 0.0919 0.0428
#> Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
#> 0.3614 0.9487 0.9135 0.9860
#> Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
#> 0.1158 0.4124 0.0850 0.5933
从以上操作来看基本上与拟合模型数据等量级的信息在预测时都不会用到,如果模型结果很大时可以考虑删除。
工具“屠夫”
搜索了一下,tidymodels 组织有开发一个专门的工具包 butcher (屠夫!)来处理这个问题。
install.packages("butcher")
它有一些斧头函数可以砍掉各种不同的数据,适用于常用的模型(见列表)。
axe_call()
: To remove the call object.axe_ctrl()
: To remove controls associated with training.axe_data()
: To remove the original training data.axe_env()
: To remove environments.axe_fitted()
: To remove fitted values.
我们来试一试把这几个都砍掉是否还可以预测:
library(butcher)
先重新弄一个完整模型:
logr_vm <- glm(vs ~ mpg, data = dat, family = binomial)
查看占用:
butcher::weigh(logr_vm)
#> # A tibble: 47 x 2
#> object size
#> <chr> <dbl>
#> 1 family.variance 0.0437
#> 2 family.dev.resids 0.0437
#> 3 family.aic 0.0437
#> 4 family.validmu 0.0437
#> 5 family.simulate 0.0437
#> 6 family.initialize 0.0174
#> 7 qr.qr 0.00418
#> 8 terms 0.00318
#> 9 residuals 0.00286
#> 10 fitted.values 0.00286
#> # … with 37 more rows
搞事:
cleaned_lm <- butcher::axe_env(logr_vm, verbose = TRUE)
#> x No memory released. Do not butcher.
cleaned_lm <- butcher::axe_call(cleaned_lm, verbose = TRUE)
#> ✓ Memory released: '264 B'
#> x Disabled: `print()`, `summary()`
cleaned_lm <- butcher::axe_data(cleaned_lm, verbose = TRUE)
#> x No memory released. Do not butcher.
cleaned_lm <- butcher::axe_fitted(cleaned_lm, verbose = TRUE)
#> ✓ Memory released: '232 B'
#> x Disabled: `print()`, `summary()`, `fitted()`
cleaned_lm <- butcher::axe_ctrl(logr_vm, verbose = TRUE)
#> x No memory released. Do not butcher.
predict(cleaned_lm, type = "response", newdata = dat)
#> Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
#> 0.5512 0.5512 0.7272 0.5933
#> Hornet Sportabout Valiant Duster 360 Merc 240D
#> 0.3134 0.2607 0.0643 0.8414
#> Merc 230 Merc 280 Merc 280C Merc 450SE
#> 0.7272 0.3614 0.2365 0.1450
#> Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
#> 0.1999 0.0919 0.0127 0.0127
#> Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla
#> 0.0754 0.9940 0.9860 0.9969
#> Toyota Corona Dodge Challenger AMC Javelin Camaro Z28
#> 0.6037 0.1032 0.0919 0.0428
#> Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
#> 0.3614 0.9487 0.9135 0.9860
#> Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
#> 0.1158 0.4124 0.0850 0.5933
奇怪的是 data
没有被砍掉:
cleaned_lm$data
#> mpg am vs
#> Mazda RX4 21.0 1 0
#> Mazda RX4 Wag 21.0 1 0
#> Datsun 710 22.8 1 1
#> Hornet 4 Drive 21.4 0 1
#> Hornet Sportabout 18.7 0 0
#> Valiant 18.1 0 1
#> Duster 360 14.3 0 0
#> Merc 240D 24.4 0 1
#> Merc 230 22.8 0 1
#> Merc 280 19.2 0 1
#> Merc 280C 17.8 0 1
#> Merc 450SE 16.4 0 0
#> Merc 450SL 17.3 0 0
#> Merc 450SLC 15.2 0 0
#> Cadillac Fleetwood 10.4 0 0
#> Lincoln Continental 10.4 0 0
#> Chrysler Imperial 14.7 0 0
#> Fiat 128 32.4 1 1
#> Honda Civic 30.4 1 1
#> Toyota Corolla 33.9 1 1
#> Toyota Corona 21.5 0 1
#> Dodge Challenger 15.5 0 0
#> AMC Javelin 15.2 0 0
#> Camaro Z28 13.3 0 0
#> Pontiac Firebird 19.2 0 0
#> Fiat X1-9 27.3 1 1
#> Porsche 914-2 26.0 1 0
#> Lotus Europa 30.4 1 1
#> Ford Pantera L 15.8 1 0
#> Ferrari Dino 19.7 1 0
#> Maserati Bora 15.0 1 0
#> Volvo 142E 21.4 1 1
是因为模型太小,没必要吗?试一下官方示例模型。
our_model <- function() {
some_junk_in_the_environment <- runif(1e6) # we didn't know about
lm(mpg ~ ., data = mtcars)
}
big_lm <- our_model()
butcher::weigh(big_lm)
#> # A tibble: 25 x 2
#> object size
#> <chr> <dbl>
#> 1 terms 8.01
#> 2 qr.qr 0.00666
#> 3 residuals 0.00286
#> 4 fitted.values 0.00286
#> 5 effects 0.0014
#> 6 coefficients 0.00109
#> 7 call 0.000728
#> 8 model.mpg 0.000304
#> 9 model.cyl 0.000304
#> 10 model.disp 0.000304
#> # … with 15 more rows
clean_lm <- butcher::axe_data(big_lm, verbose = TRUE)
#> x No memory released. Do not butcher.
奇怪了。还是不是砍掉数据。我试着查看了下底层的程序,才发现它有内部的评估机制看砍掉后是否能够释放空间,如果不会,就不砍了。
butcher:::assess_object(big_lm, clean_lm)
#> x No memory released. Do not butcher.
最后看一下与包同名的函数,应该是可以自动删除不必要的东西:
butcher::butcher(big_lm, verbose = TRUE)
#> ✓ Memory released: '7,999,696 B'
#> x Disabled: `print()`, `summary()`, `fitted()`
#>
#> Call:
#> dummy_call()
#>
#> Coefficients:
#> (Intercept) cyl disp hp drat wt
#> 12.3034 -0.1114 0.0133 -0.0215 0.7871 -3.7153
#> qsec vs am gear carb
#> 0.8210 0.3178 2.5202 0.6554 -0.1994