We used multilevel multiple imputation to address missing data and a mixed-effects logistic regression model to address the data’s hierarchical structure. We used parallel processing to decrease computational time on large processes amenable to parallelization.
This document will lay out not only what we did to obtain our final analysis, but also how we set up for large tasks and alternate methods we tried that were unsuccessful. We will have Rmarkdown run code whenever possible, but in some cases the run times are prohibitively long. In these cases we will screen shot console outputs for discussion. We hope this technical appendix will provide transparency to our analyses, as well as guidance for clinicians and researchers looking to perform similar analyses in the future. And as always, despite our best efforts, we may have made mistakes or overlooked more efficient ways of doing things. We would love to hear from you and share in any discussions, so please contact us at our Github page!
The GWTG-HF dataset is available upon request to the AHA. An application for use is required. Proposals can be submitted at www.heart.org/qualityresearch. The R code used to generate our clean data from the raw GWTG-HF can be found in the same GitHub repository as this Rmarkdown file.
This analysis was performed on the AHA’s Precision Medicine Platform (PMP) using an r5.12xlarge AWS EC2 instance. If you are also using the PMP for these kinds of analysis, I recommend discussing your task with IT support, as they may have to remove some of the automatic disconnection timers inherent to the platform due to the long run times required for some of the tasks. Note that the r5.12xlarge instance is a fairly large machine - it has 48 vCPU and 384GiB RAM. While having such a large machine is not necessary, it will greatly decrease the time it takes to complete the task.
We assume proficiency with the statistical software R, as well as familiarity with generalized linear models and decision tree analysis.
Before starting, I highly recommend Stef van Buuren’s free online book Flexible Imputation of Missing Data. He provides both a theoretical understanding of multiple imputation by chained equations and a comprehensive guide of how to properly perform the procedure. He is also the primary contributor to the R library MICE, which is the core library we will be using to perform multiple imputation.
Start by loading all the packages we’ll need into our Rmarkdown environment, followed by our dataset.
library(dplyr)
library(mice)
library(miceadds)
library(broom.mixed)
library(lme4)
library(multilevel)
library(parallel)
load('/mnt/workspace/rstudio_projects/data/gdmt_11.11_clean.RData')
dplyr is used for data manipulation. MICE is the core library used to implement multiple imputation. It has been documented extensively by Gerko Vink and Stef van Buuren in both case-based and book formats. The source code is publicly available on GitHub. Miceadds adds quite a few univariate imputation methods, but we use it because it provides a clean way to save imputation objects (mids), which can take quite some time to produce. Boom.mixed is necessary for MICE. lme4 is used to determine whether or not a hierarchical data structure is present and is later used as the core library to implement our mixed model. Multilevel is used to determine hierarchical data structure. Parallel will be used later on in the analysis in the model fitting step.
Now lets explore our data a little bit:
str(df)
## 'data.frame': 136144 obs. of 40 variables:
## $ SITE_ID : Factor w/ 560 levels "10001","10140",..: 89 396 261 422 102 23 309 66 123 65 ...
## $ GDMT_HFBB : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 1 2 ...
## $ GDMT_ACEIARB : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 2 2 2 ...
## $ GDMT_ARNI : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ GDMT_MRA : Factor w/ 2 levels "0","1": 2 1 2 NA 1 1 1 1 2 2 ...
## $ GDMT_BBcontra : Factor w/ 2 levels "0","1": 1 1 NA NA 1 NA 1 1 2 1 ...
## $ GDMT_RAASIcontra : Factor w/ 2 levels "0","1": 1 1 2 NA 2 1 2 1 1 2 ...
## $ GDMT_MRAcontra : Factor w/ 2 levels "0","1": 1 1 NA NA 1 1 1 1 1 1 ...
## $ GDMT_ARNIcontra : Factor w/ 2 levels "0","1": 1 1 NA NA 2 1 1 1 1 2 ...
## $ GENDERi : Factor w/ 2 levels "male","female": 2 1 1 1 1 2 1 1 1 1 ...
## $ race2i : Factor w/ 5 levels "Asian","Black",..: 5 5 5 2 5 5 5 5 3 3 ...
## $ insurance : Factor w/ 4 levels "Medicaid","Medicare",..: 2 3 1 NA 2 2 2 3 3 3 ...
## $ AGEi : num 63 47 57 54 94 78 85 73 38 75 ...
## $ PMHx_none : Factor w/ 2 levels "0","1": 1 1 1 NA 1 1 1 1 1 1 ...
## $ PMHx_CHF : Factor w/ 2 levels "0","1": 2 1 2 NA 2 1 2 2 2 2 ...
## $ PMHx_ESRD : Factor w/ 2 levels "0","1": 1 1 1 NA 1 1 1 1 2 1 ...
## $ PMHx_Cr2 : Factor w/ 2 levels "0","1": 1 1 1 NA 1 2 1 1 2 1 ...
## $ PMHx_PPMICDCRT : Factor w/ 2 levels "0","1": 2 1 1 NA 2 1 2 2 1 1 ...
## $ PMHx_fibfl : Factor w/ 2 levels "0","1": 1 1 1 NA 2 1 1 1 1 1 ...
## $ GH_HEART_TRANSPLANTS : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 2 2 1 ...
## $ RESIDENTS : Factor w/ 2 levels "0","1": 2 1 1 1 2 2 1 2 2 2 ...
## $ OH_EF : num 30 27 33 14 20 35 37 21 22 20 ...
## $ OH_TRANSPLANT : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ covar_ino : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ inhospRx_BB : Factor w/ 2 levels "0","1": NA NA 2 NA 2 2 NA NA 1 NA ...
## $ inhospRx_MRA : Factor w/ 2 levels "0","1": NA NA 2 NA 1 1 NA NA 2 NA ...
## $ inhospRx_none : Factor w/ 2 levels "0","1": NA NA 1 NA 1 1 NA NA 1 NA ...
## $ inhospRx_ARNI : Factor w/ 2 levels "0","1": NA NA 1 NA 1 1 NA NA 1 NA ...
## $ inhospRx_ACEIARB : Factor w/ 2 levels "0","1": NA NA 2 NA 2 1 NA NA 2 NA ...
## $ covar_Kdisc : num NA NA NA NA 4.5 4.1 NA NA 5.1 NA ...
## $ covar_Crdisc : num NA NA NA NA 1.2 1.6 NA NA 6.7 NA ...
## $ OH_DISCBPSYST : num NA NA 111 NA 142 139 NA NA 119 NA ...
## $ OH_DISCHR : num NA NA NA NA 59 73 NA NA 79 NA ...
## $ zip_designation : Factor w/ 4 levels "Rural","SmallTown",..: NA NA NA 3 3 2 NA NA 4 NA ...
## $ population_total : num NA NA NA 35641 7223 ...
## $ distress_score : num NA NA NA 49.2 47.9 ...
## $ covar_OutsideZIPregion: Factor w/ 2 levels "0","1": NA NA NA 1 2 2 NA NA 2 NA ...
## $ followup : Factor w/ 2 levels "0","1": 2 2 2 2 2 NA 2 2 2 2 ...
## $ OH_ARNI : Factor w/ 2 levels "0","1": NA NA NA NA 1 1 NA NA 1 NA ...
## $ DCtoContCare : Factor w/ 2 levels "0","1": NA NA NA NA 1 2 NA NA 1 2 ...
#how many missing in each variable?
lapply(df, function(x)
sum(is.na(x))
)
## $SITE_ID
## [1] 0
##
## $GDMT_HFBB
## [1] 3024
##
## $GDMT_ACEIARB
## [1] 513
##
## $GDMT_ARNI
## [1] 0
##
## $GDMT_MRA
## [1] 7266
##
## $GDMT_BBcontra
## [1] 25857
##
## $GDMT_RAASIcontra
## [1] 2128
##
## $GDMT_MRAcontra
## [1] 14237
##
## $GDMT_ARNIcontra
## [1] 15122
##
## $GENDERi
## [1] 322
##
## $race2i
## [1] 0
##
## $insurance
## [1] 9060
##
## $AGEi
## [1] 0
##
## $PMHx_none
## [1] 8504
##
## $PMHx_CHF
## [1] 8504
##
## $PMHx_ESRD
## [1] 8504
##
## $PMHx_Cr2
## [1] 8504
##
## $PMHx_PPMICDCRT
## [1] 8504
##
## $PMHx_fibfl
## [1] 8504
##
## $GH_HEART_TRANSPLANTS
## [1] 0
##
## $RESIDENTS
## [1] 0
##
## $OH_EF
## [1] 0
##
## $OH_TRANSPLANT
## [1] 0
##
## $covar_ino
## [1] 0
##
## $inhospRx_BB
## [1] 55647
##
## $inhospRx_MRA
## [1] 55647
##
## $inhospRx_none
## [1] 55647
##
## $inhospRx_ARNI
## [1] 55647
##
## $inhospRx_ACEIARB
## [1] 55647
##
## $covar_Kdisc
## [1] 96076
##
## $covar_Crdisc
## [1] 94145
##
## $OH_DISCBPSYST
## [1] 57867
##
## $OH_DISCHR
## [1] 59064
##
## $zip_designation
## [1] 59969
##
## $population_total
## [1] 59969
##
## $distress_score
## [1] 59969
##
## $covar_OutsideZIPregion
## [1] 58122
##
## $followup
## [1] 7650
##
## $OH_ARNI
## [1] 91588
##
## $DCtoContCare
## [1] 73568
#how many complete cases?
sum(complete.cases(df))
## [1] 23687
Note that missingness ranges from 0 to 70%. To understand how this missingness will impact our statistical analysis, we need to better understand the mechanism of missingness. Data that are “missing completely at random” (MCAR) are frequently amenable to “simple” fixes such as complete case analysis (also known as listwise deletion). Data that are “missing at random” (MAR) are amenable to multiple imputation in most cases, or univariate imputation if the data is monotone. Data that are “missing not at random” (MNAR) present the most complex scenario and frequently require additional research into the cause of data missingness. Blindly imputing MNAR data can result in inaccurate estimates in the final model.
MICE has several built-in methods for diagnosing the mechanism of missingness, but due to the size of the GWTG-HF dataset, the visual assessments are not practical.
#what is the pattern of data missingness?
md.pattern(df, plot=FALSE)
We have intentionally hidden the results of the above code, as the output makes the Rmarkdown somewhat unreadable. The first column indicates the number of observations with that pattern of missingness, the subsequent columns indicate if the associated variable is missing or not, and the final column shows the number of variables that are missing for that particular pattern. Because there are 40 variables and over 1500 patterns of missingness, this matrix is rather large, but is nevertheless worth reviewing in detail. After review, the data appear to be MAR, and the missingess pattern is not monotone. In addition, we can surmise that missingness happens at a systemic level, with some sites simply not reporting entire groups of variables, such as patients’ medical comorbidities, their discharge labs/vitals, etc. Such a pattern is amenable to multiple imputation by chained equations.
When interpreting data in the context of missingness, we also need to consider if a data has a multilevel structure (also referred to as hierarchical or clustered data). We know that patients are nested within hospitals, and it’s reasonable to suspect that culture, resources, and other unobserved variables at each site are associated with the characteristics of the patients they treat. We can assess this quantitatively by looking at the intraclass correlation coefficient (ICC) for incomplete variables.
#Find the variables with missing data and split the variable names
#into lists based on class (factor vs numeric)
varlist <- colnames(df)
hasmiss <- list()
miss_f <- list()
miss_n <- list()
for(i in 1:ncol(df)) {
if(with(df,sum(is.na(eval(as.name(varlist[i]))))) > 0) {
hasmiss <- append(hasmiss, varlist[i])
}
}
for(i in 1:length(hasmiss)) {
if (with(df,class(eval(as.name(hasmiss[[i]])))) == "factor") {
miss_f <- append(miss_f, hasmiss[i])
} else {
miss_n <- append(miss_n, hasmiss[i])
}
}
rm(varlist, hasmiss)
#calculate the ICC for each variable. Note the models used to calculate ICC are dependent
#on the dependent variable's class (ANOVA for numeric variables, unconditional mean model
#for factor variables)
fun_f <- function(i) {
m0 <- glmer(eval(as.name(i)) ~ (1|SITE_ID),
data=df,
family='binomial',
control=glmerControl(optimizer='bobyqa', optCtrl=list(maxfun=2e5)))
ICC <- m0@theta[1]^2/(m0@theta[1]^2 + (3.14159^2/3))
print(i)
print(ICC)
}
fun_n <- function(i) {
ICC <- multilevel::ICC1(aov(eval(as.name(i)) ~ SITE_ID, data=df))
print(i)
print(ICC)
}
suppress <- lapply(miss_f, fun_f)
suppress <- lapply(miss_n, fun_n)
rm(miss_n, miss_f, fun_n, fun_f, suppress)
There is no cutoff for a “significant” ICC, but we see that SITE_ID alone explains more than 30% of the variance seen in many of the variables to be imputed; hence, clustering cannot be ignored. Any imputation technique we apply our data to needs to account for these three factors. Because the data is MAR and not monotone, multiple imputation arises as an attractive option for imputation. Because the data is hierarchical we will need to apply a multilevel multiple imputation model.
In this section we describe how we implemented FCS using generalized linear models. The imputation model ultimately did not converge, so this approach was ultimately abandoned. Nevertheless, we learned much about multiple imputation through this failure, and so present it anyways.
The first task of the imputation step is to specify the kind of statistical model we are going to use to predict each of our independent variables with missingness. In MICE, this is referred to as choosing the imputation method. This is largely going to depend on the class of that variable (numeric vs factor/binary vs factor/nominal vs factor/ordinal), but it is also determined by the presence of a hierarchical data structure. The presence of clustering in our data greatly complicates multiple imputation, but luckily the MICE library comes with several imputation methods specifically for hierarchical data with 2 levels. These methods typically start with the prefix “2l.”
Note that there are no 2l methods for ordinal or nominal outcomes. In our case this caused a problem for the variables insurance and zip_designation. Our work around was to create binary variables based on nominal variables and impute the binary variable instead. We retained as much information as possible by continuing to use nominal variables as dependent variables in other imputation models. For some nominal variables, like zip_designation, creating a binary derivative was not possible, so it had to be dropped entirely from analysis.
#convert nominal variables to binary
df <- df %>%
mutate(insured = ifelse(insurance %in% 'Medicaid' |
insurance %in% 'Medicare' |
insurance %in% 'Other', 1,
ifelse(insurance %in% 'none', 0,
ifelse(is.na(insurance), NA, 999))))
df$insured <- as.factor(df$insured)
We have now done sufficient set up to continue setting up imputation methods. First we set the cluster variable, which in our case is SITE_ID. Interestingly, MICE requires that cluster variables be changed to numeric class.
#set the cluster variable (note the cluster var must be numeric)
df$SITE_ID <- as.numeric(df$SITE_ID)
cluster_var=df$SITE_ID
Next we create the imputation vector.
impmethod <- character(ncol(df))
names(impmethod) <- colnames(df)
We then assign imputation methods to each variable based on its class. We arbitrarily chose an ICC < 0.05 as representing a negligible clustering effect, and this allows us to use a non-2l imputation method for variables that meet this criteria. This is desirable because all 2l methods call glmer() and assign the cluster variable a random intercept, which becomes computationally expensive for the complex models we use for imputation.
impmethod['SITE_ID'] <- ""
impmethod['GDMT_HFBB'] <- "2l.binary"
impmethod['GDMT_ACEIARB'] <- "2l.binary"
impmethod['GDMT_ARNI'] <- ""
impmethod['GDMT_MRA'] <- "2l.binary"
impmethod['GDMT_BBcontra'] <- "2l.binary"
impmethod['GDMT_RAASIcontra'] <- "2l.binary"
impmethod['GDMT_MRAcontra'] <- "2l.binary"
impmethod['GDMT_ARNIcontra'] <- "2l.binary"
impmethod['GENDERi'] <- "logreg"
impmethod['race2i'] <- ""
impmethod['insurance'] <- ""
impmethod['AGEi'] <- ""
impmethod['PMHx_none'] <- "2l.binary"
impmethod['PMHx_CHF'] <- "2l.binary"
impmethod['PMHx_ESRD'] <- "2l.binary"
impmethod['PMHx_Cr2'] <- "2l.glm.bin"
impmethod['PMHx_PPMICDCRT'] <- "2l.binary"
impmethod['PMHx_fibfl'] <- "logreg"
impmethod['GH_HEART_TRANSPLANTS'] <- ""
impmethod['RESIDENTS'] <- ""
impmethod['OH_EF'] <- ""
impmethod['OH_TRANSPLANT'] <- ""
impmethod['covar_ino'] <- ""
impmethod['inhospRx_BB'] <- "2l.binary"
impmethod['inhospRx_MRA'] <- "2l.binary"
impmethod['inhospRx_none'] <- "2l.binary"
impmethod['inhospRx_ARNI'] <- "2l.binary"
impmethod['inhospRx_ACEIARB'] <- "2l.binary"
impmethod['covar_Kdisc'] <- "pmm"
impmethod['covar_Crdisc'] <- "pmm"
impmethod['OH_DISCBPSYST'] <- "pmm"
impmethod['OH_DISCHR'] <- "pmm"
impmethod['zip_designation'] <- ""
impmethod['population_total'] <- "2l.pmm"
impmethod['distress_score'] <- "2l.pmm"
impmethod['covar_OutsideZIPregion'] <- "2l.binary"
impmethod['followup'] <- "2l.binary"
impmethod['DCtoContCare'] <- "2l.binary"
impmethod['insured'] <- "2l.binary"
print(impmethod)
## SITE_ID GDMT_HFBB GDMT_ACEIARB
## "" "2l.binary" "2l.binary"
## GDMT_ARNI GDMT_MRA GDMT_BBcontra
## "" "2l.binary" "2l.binary"
## GDMT_RAASIcontra GDMT_MRAcontra GDMT_ARNIcontra
## "2l.binary" "2l.binary" "2l.binary"
## GENDERi race2i insurance
## "logreg" "" ""
## AGEi PMHx_none PMHx_CHF
## "" "2l.binary" "2l.binary"
## PMHx_ESRD PMHx_Cr2 PMHx_PPMICDCRT
## "2l.binary" "2l.glm.bin" "2l.binary"
## PMHx_fibfl GH_HEART_TRANSPLANTS RESIDENTS
## "logreg" "" ""
## OH_EF OH_TRANSPLANT covar_ino
## "" "" ""
## inhospRx_BB inhospRx_MRA inhospRx_none
## "2l.binary" "2l.binary" "2l.binary"
## inhospRx_ARNI inhospRx_ACEIARB covar_Kdisc
## "2l.binary" "2l.binary" "pmm"
## covar_Crdisc OH_DISCBPSYST OH_DISCHR
## "pmm" "pmm" "pmm"
## zip_designation population_total distress_score
## "" "2l.pmm" "2l.pmm"
## covar_OutsideZIPregion followup OH_ARNI
## "2l.binary" "2l.binary" ""
## DCtoContCare insured
## "2l.binary" "2l.binary"
The default imputation method for each variable is "“, which indicates to MICE that an imputation model should not be fit to that variable. We explicitly assigned”" where appropriate to make sure that we considered every variable in the dataset, but an explicit assignment was not necessary. Notice that variables might be assigned "" because they have no missing values or because they are nominal. There is value in keeping nominal variables in the dataset because they can still be used to impute other variables, even though they themselves cannot be imputed.
The next step required in FCS is to determine which variables belong in each imputation model. This is done through the prediction matrix, which is created via the following:
pm <- make.predictorMatrix(df)
print(pm)
## SITE_ID GDMT_HFBB GDMT_ACEIARB GDMT_ARNI GDMT_MRA
## SITE_ID 0 1 1 1 1
## GDMT_HFBB 1 0 1 1 1
## GDMT_ACEIARB 1 1 0 1 1
## GDMT_ARNI 1 1 1 0 1
## GDMT_MRA 1 1 1 1 0
## GDMT_BBcontra 1 1 1 1 1
## GDMT_RAASIcontra 1 1 1 1 1
## GDMT_MRAcontra 1 1 1 1 1
## GDMT_ARNIcontra 1 1 1 1 1
## GENDERi 1 1 1 1 1
## race2i 1 1 1 1 1
## insurance 1 1 1 1 1
## AGEi 1 1 1 1 1
## PMHx_none 1 1 1 1 1
## PMHx_CHF 1 1 1 1 1
## PMHx_ESRD 1 1 1 1 1
## PMHx_Cr2 1 1 1 1 1
## PMHx_PPMICDCRT 1 1 1 1 1
## PMHx_fibfl 1 1 1 1 1
## GH_HEART_TRANSPLANTS 1 1 1 1 1
## RESIDENTS 1 1 1 1 1
## OH_EF 1 1 1 1 1
## OH_TRANSPLANT 1 1 1 1 1
## covar_ino 1 1 1 1 1
## inhospRx_BB 1 1 1 1 1
## inhospRx_MRA 1 1 1 1 1
## inhospRx_none 1 1 1 1 1
## inhospRx_ARNI 1 1 1 1 1
## inhospRx_ACEIARB 1 1 1 1 1
## covar_Kdisc 1 1 1 1 1
## covar_Crdisc 1 1 1 1 1
## OH_DISCBPSYST 1 1 1 1 1
## OH_DISCHR 1 1 1 1 1
## zip_designation 1 1 1 1 1
## population_total 1 1 1 1 1
## distress_score 1 1 1 1 1
## covar_OutsideZIPregion 1 1 1 1 1
## followup 1 1 1 1 1
## OH_ARNI 1 1 1 1 1
## DCtoContCare 1 1 1 1 1
## insured 1 1 1 1 1
## GDMT_BBcontra GDMT_RAASIcontra GDMT_MRAcontra
## SITE_ID 1 1 1
## GDMT_HFBB 1 1 1
## GDMT_ACEIARB 1 1 1
## GDMT_ARNI 1 1 1
## GDMT_MRA 1 1 1
## GDMT_BBcontra 0 1 1
## GDMT_RAASIcontra 1 0 1
## GDMT_MRAcontra 1 1 0
## GDMT_ARNIcontra 1 1 1
## GENDERi 1 1 1
## race2i 1 1 1
## insurance 1 1 1
## AGEi 1 1 1
## PMHx_none 1 1 1
## PMHx_CHF 1 1 1
## PMHx_ESRD 1 1 1
## PMHx_Cr2 1 1 1
## PMHx_PPMICDCRT 1 1 1
## PMHx_fibfl 1 1 1
## GH_HEART_TRANSPLANTS 1 1 1
## RESIDENTS 1 1 1
## OH_EF 1 1 1
## OH_TRANSPLANT 1 1 1
## covar_ino 1 1 1
## inhospRx_BB 1 1 1
## inhospRx_MRA 1 1 1
## inhospRx_none 1 1 1
## inhospRx_ARNI 1 1 1
## inhospRx_ACEIARB 1 1 1
## covar_Kdisc 1 1 1
## covar_Crdisc 1 1 1
## OH_DISCBPSYST 1 1 1
## OH_DISCHR 1 1 1
## zip_designation 1 1 1
## population_total 1 1 1
## distress_score 1 1 1
## covar_OutsideZIPregion 1 1 1
## followup 1 1 1
## OH_ARNI 1 1 1
## DCtoContCare 1 1 1
## insured 1 1 1
## GDMT_ARNIcontra GENDERi race2i insurance AGEi PMHx_none
## SITE_ID 1 1 1 1 1 1
## GDMT_HFBB 1 1 1 1 1 1
## GDMT_ACEIARB 1 1 1 1 1 1
## GDMT_ARNI 1 1 1 1 1 1
## GDMT_MRA 1 1 1 1 1 1
## GDMT_BBcontra 1 1 1 1 1 1
## GDMT_RAASIcontra 1 1 1 1 1 1
## GDMT_MRAcontra 1 1 1 1 1 1
## GDMT_ARNIcontra 0 1 1 1 1 1
## GENDERi 1 0 1 1 1 1
## race2i 1 1 0 1 1 1
## insurance 1 1 1 0 1 1
## AGEi 1 1 1 1 0 1
## PMHx_none 1 1 1 1 1 0
## PMHx_CHF 1 1 1 1 1 1
## PMHx_ESRD 1 1 1 1 1 1
## PMHx_Cr2 1 1 1 1 1 1
## PMHx_PPMICDCRT 1 1 1 1 1 1
## PMHx_fibfl 1 1 1 1 1 1
## GH_HEART_TRANSPLANTS 1 1 1 1 1 1
## RESIDENTS 1 1 1 1 1 1
## OH_EF 1 1 1 1 1 1
## OH_TRANSPLANT 1 1 1 1 1 1
## covar_ino 1 1 1 1 1 1
## inhospRx_BB 1 1 1 1 1 1
## inhospRx_MRA 1 1 1 1 1 1
## inhospRx_none 1 1 1 1 1 1
## inhospRx_ARNI 1 1 1 1 1 1
## inhospRx_ACEIARB 1 1 1 1 1 1
## covar_Kdisc 1 1 1 1 1 1
## covar_Crdisc 1 1 1 1 1 1
## OH_DISCBPSYST 1 1 1 1 1 1
## OH_DISCHR 1 1 1 1 1 1
## zip_designation 1 1 1 1 1 1
## population_total 1 1 1 1 1 1
## distress_score 1 1 1 1 1 1
## covar_OutsideZIPregion 1 1 1 1 1 1
## followup 1 1 1 1 1 1
## OH_ARNI 1 1 1 1 1 1
## DCtoContCare 1 1 1 1 1 1
## insured 1 1 1 1 1 1
## PMHx_CHF PMHx_ESRD PMHx_Cr2 PMHx_PPMICDCRT PMHx_fibfl
## SITE_ID 1 1 1 1 1
## GDMT_HFBB 1 1 1 1 1
## GDMT_ACEIARB 1 1 1 1 1
## GDMT_ARNI 1 1 1 1 1
## GDMT_MRA 1 1 1 1 1
## GDMT_BBcontra 1 1 1 1 1
## GDMT_RAASIcontra 1 1 1 1 1
## GDMT_MRAcontra 1 1 1 1 1
## GDMT_ARNIcontra 1 1 1 1 1
## GENDERi 1 1 1 1 1
## race2i 1 1 1 1 1
## insurance 1 1 1 1 1
## AGEi 1 1 1 1 1
## PMHx_none 1 1 1 1 1
## PMHx_CHF 0 1 1 1 1
## PMHx_ESRD 1 0 1 1 1
## PMHx_Cr2 1 1 0 1 1
## PMHx_PPMICDCRT 1 1 1 0 1
## PMHx_fibfl 1 1 1 1 0
## GH_HEART_TRANSPLANTS 1 1 1 1 1
## RESIDENTS 1 1 1 1 1
## OH_EF 1 1 1 1 1
## OH_TRANSPLANT 1 1 1 1 1
## covar_ino 1 1 1 1 1
## inhospRx_BB 1 1 1 1 1
## inhospRx_MRA 1 1 1 1 1
## inhospRx_none 1 1 1 1 1
## inhospRx_ARNI 1 1 1 1 1
## inhospRx_ACEIARB 1 1 1 1 1
## covar_Kdisc 1 1 1 1 1
## covar_Crdisc 1 1 1 1 1
## OH_DISCBPSYST 1 1 1 1 1
## OH_DISCHR 1 1 1 1 1
## zip_designation 1 1 1 1 1
## population_total 1 1 1 1 1
## distress_score 1 1 1 1 1
## covar_OutsideZIPregion 1 1 1 1 1
## followup 1 1 1 1 1
## OH_ARNI 1 1 1 1 1
## DCtoContCare 1 1 1 1 1
## insured 1 1 1 1 1
## GH_HEART_TRANSPLANTS RESIDENTS OH_EF OH_TRANSPLANT
## SITE_ID 1 1 1 1
## GDMT_HFBB 1 1 1 1
## GDMT_ACEIARB 1 1 1 1
## GDMT_ARNI 1 1 1 1
## GDMT_MRA 1 1 1 1
## GDMT_BBcontra 1 1 1 1
## GDMT_RAASIcontra 1 1 1 1
## GDMT_MRAcontra 1 1 1 1
## GDMT_ARNIcontra 1 1 1 1
## GENDERi 1 1 1 1
## race2i 1 1 1 1
## insurance 1 1 1 1
## AGEi 1 1 1 1
## PMHx_none 1 1 1 1
## PMHx_CHF 1 1 1 1
## PMHx_ESRD 1 1 1 1
## PMHx_Cr2 1 1 1 1
## PMHx_PPMICDCRT 1 1 1 1
## PMHx_fibfl 1 1 1 1
## GH_HEART_TRANSPLANTS 0 1 1 1
## RESIDENTS 1 0 1 1
## OH_EF 1 1 0 1
## OH_TRANSPLANT 1 1 1 0
## covar_ino 1 1 1 1
## inhospRx_BB 1 1 1 1
## inhospRx_MRA 1 1 1 1
## inhospRx_none 1 1 1 1
## inhospRx_ARNI 1 1 1 1
## inhospRx_ACEIARB 1 1 1 1
## covar_Kdisc 1 1 1 1
## covar_Crdisc 1 1 1 1
## OH_DISCBPSYST 1 1 1 1
## OH_DISCHR 1 1 1 1
## zip_designation 1 1 1 1
## population_total 1 1 1 1
## distress_score 1 1 1 1
## covar_OutsideZIPregion 1 1 1 1
## followup 1 1 1 1
## OH_ARNI 1 1 1 1
## DCtoContCare 1 1 1 1
## insured 1 1 1 1
## covar_ino inhospRx_BB inhospRx_MRA inhospRx_none
## SITE_ID 1 1 1 1
## GDMT_HFBB 1 1 1 1
## GDMT_ACEIARB 1 1 1 1
## GDMT_ARNI 1 1 1 1
## GDMT_MRA 1 1 1 1
## GDMT_BBcontra 1 1 1 1
## GDMT_RAASIcontra 1 1 1 1
## GDMT_MRAcontra 1 1 1 1
## GDMT_ARNIcontra 1 1 1 1
## GENDERi 1 1 1 1
## race2i 1 1 1 1
## insurance 1 1 1 1
## AGEi 1 1 1 1
## PMHx_none 1 1 1 1
## PMHx_CHF 1 1 1 1
## PMHx_ESRD 1 1 1 1
## PMHx_Cr2 1 1 1 1
## PMHx_PPMICDCRT 1 1 1 1
## PMHx_fibfl 1 1 1 1
## GH_HEART_TRANSPLANTS 1 1 1 1
## RESIDENTS 1 1 1 1
## OH_EF 1 1 1 1
## OH_TRANSPLANT 1 1 1 1
## covar_ino 0 1 1 1
## inhospRx_BB 1 0 1 1
## inhospRx_MRA 1 1 0 1
## inhospRx_none 1 1 1 0
## inhospRx_ARNI 1 1 1 1
## inhospRx_ACEIARB 1 1 1 1
## covar_Kdisc 1 1 1 1
## covar_Crdisc 1 1 1 1
## OH_DISCBPSYST 1 1 1 1
## OH_DISCHR 1 1 1 1
## zip_designation 1 1 1 1
## population_total 1 1 1 1
## distress_score 1 1 1 1
## covar_OutsideZIPregion 1 1 1 1
## followup 1 1 1 1
## OH_ARNI 1 1 1 1
## DCtoContCare 1 1 1 1
## insured 1 1 1 1
## inhospRx_ARNI inhospRx_ACEIARB covar_Kdisc covar_Crdisc
## SITE_ID 1 1 1 1
## GDMT_HFBB 1 1 1 1
## GDMT_ACEIARB 1 1 1 1
## GDMT_ARNI 1 1 1 1
## GDMT_MRA 1 1 1 1
## GDMT_BBcontra 1 1 1 1
## GDMT_RAASIcontra 1 1 1 1
## GDMT_MRAcontra 1 1 1 1
## GDMT_ARNIcontra 1 1 1 1
## GENDERi 1 1 1 1
## race2i 1 1 1 1
## insurance 1 1 1 1
## AGEi 1 1 1 1
## PMHx_none 1 1 1 1
## PMHx_CHF 1 1 1 1
## PMHx_ESRD 1 1 1 1
## PMHx_Cr2 1 1 1 1
## PMHx_PPMICDCRT 1 1 1 1
## PMHx_fibfl 1 1 1 1
## GH_HEART_TRANSPLANTS 1 1 1 1
## RESIDENTS 1 1 1 1
## OH_EF 1 1 1 1
## OH_TRANSPLANT 1 1 1 1
## covar_ino 1 1 1 1
## inhospRx_BB 1 1 1 1
## inhospRx_MRA 1 1 1 1
## inhospRx_none 1 1 1 1
## inhospRx_ARNI 0 1 1 1
## inhospRx_ACEIARB 1 0 1 1
## covar_Kdisc 1 1 0 1
## covar_Crdisc 1 1 1 0
## OH_DISCBPSYST 1 1 1 1
## OH_DISCHR 1 1 1 1
## zip_designation 1 1 1 1
## population_total 1 1 1 1
## distress_score 1 1 1 1
## covar_OutsideZIPregion 1 1 1 1
## followup 1 1 1 1
## OH_ARNI 1 1 1 1
## DCtoContCare 1 1 1 1
## insured 1 1 1 1
## OH_DISCBPSYST OH_DISCHR zip_designation population_total
## SITE_ID 1 1 1 1
## GDMT_HFBB 1 1 1 1
## GDMT_ACEIARB 1 1 1 1
## GDMT_ARNI 1 1 1 1
## GDMT_MRA 1 1 1 1
## GDMT_BBcontra 1 1 1 1
## GDMT_RAASIcontra 1 1 1 1
## GDMT_MRAcontra 1 1 1 1
## GDMT_ARNIcontra 1 1 1 1
## GENDERi 1 1 1 1
## race2i 1 1 1 1
## insurance 1 1 1 1
## AGEi 1 1 1 1
## PMHx_none 1 1 1 1
## PMHx_CHF 1 1 1 1
## PMHx_ESRD 1 1 1 1
## PMHx_Cr2 1 1 1 1
## PMHx_PPMICDCRT 1 1 1 1
## PMHx_fibfl 1 1 1 1
## GH_HEART_TRANSPLANTS 1 1 1 1
## RESIDENTS 1 1 1 1
## OH_EF 1 1 1 1
## OH_TRANSPLANT 1 1 1 1
## covar_ino 1 1 1 1
## inhospRx_BB 1 1 1 1
## inhospRx_MRA 1 1 1 1
## inhospRx_none 1 1 1 1
## inhospRx_ARNI 1 1 1 1
## inhospRx_ACEIARB 1 1 1 1
## covar_Kdisc 1 1 1 1
## covar_Crdisc 1 1 1 1
## OH_DISCBPSYST 0 1 1 1
## OH_DISCHR 1 0 1 1
## zip_designation 1 1 0 1
## population_total 1 1 1 0
## distress_score 1 1 1 1
## covar_OutsideZIPregion 1 1 1 1
## followup 1 1 1 1
## OH_ARNI 1 1 1 1
## DCtoContCare 1 1 1 1
## insured 1 1 1 1
## distress_score covar_OutsideZIPregion followup OH_ARNI
## SITE_ID 1 1 1 1
## GDMT_HFBB 1 1 1 1
## GDMT_ACEIARB 1 1 1 1
## GDMT_ARNI 1 1 1 1
## GDMT_MRA 1 1 1 1
## GDMT_BBcontra 1 1 1 1
## GDMT_RAASIcontra 1 1 1 1
## GDMT_MRAcontra 1 1 1 1
## GDMT_ARNIcontra 1 1 1 1
## GENDERi 1 1 1 1
## race2i 1 1 1 1
## insurance 1 1 1 1
## AGEi 1 1 1 1
## PMHx_none 1 1 1 1
## PMHx_CHF 1 1 1 1
## PMHx_ESRD 1 1 1 1
## PMHx_Cr2 1 1 1 1
## PMHx_PPMICDCRT 1 1 1 1
## PMHx_fibfl 1 1 1 1
## GH_HEART_TRANSPLANTS 1 1 1 1
## RESIDENTS 1 1 1 1
## OH_EF 1 1 1 1
## OH_TRANSPLANT 1 1 1 1
## covar_ino 1 1 1 1
## inhospRx_BB 1 1 1 1
## inhospRx_MRA 1 1 1 1
## inhospRx_none 1 1 1 1
## inhospRx_ARNI 1 1 1 1
## inhospRx_ACEIARB 1 1 1 1
## covar_Kdisc 1 1 1 1
## covar_Crdisc 1 1 1 1
## OH_DISCBPSYST 1 1 1 1
## OH_DISCHR 1 1 1 1
## zip_designation 1 1 1 1
## population_total 1 1 1 1
## distress_score 0 1 1 1
## covar_OutsideZIPregion 1 0 1 1
## followup 1 1 0 1
## OH_ARNI 1 1 1 0
## DCtoContCare 1 1 1 1
## insured 1 1 1 1
## DCtoContCare insured
## SITE_ID 1 1
## GDMT_HFBB 1 1
## GDMT_ACEIARB 1 1
## GDMT_ARNI 1 1
## GDMT_MRA 1 1
## GDMT_BBcontra 1 1
## GDMT_RAASIcontra 1 1
## GDMT_MRAcontra 1 1
## GDMT_ARNIcontra 1 1
## GENDERi 1 1
## race2i 1 1
## insurance 1 1
## AGEi 1 1
## PMHx_none 1 1
## PMHx_CHF 1 1
## PMHx_ESRD 1 1
## PMHx_Cr2 1 1
## PMHx_PPMICDCRT 1 1
## PMHx_fibfl 1 1
## GH_HEART_TRANSPLANTS 1 1
## RESIDENTS 1 1
## OH_EF 1 1
## OH_TRANSPLANT 1 1
## covar_ino 1 1
## inhospRx_BB 1 1
## inhospRx_MRA 1 1
## inhospRx_none 1 1
## inhospRx_ARNI 1 1
## inhospRx_ACEIARB 1 1
## covar_Kdisc 1 1
## covar_Crdisc 1 1
## OH_DISCBPSYST 1 1
## OH_DISCHR 1 1
## zip_designation 1 1
## population_total 1 1
## distress_score 1 1
## covar_OutsideZIPregion 1 1
## followup 1 1
## OH_ARNI 1 1
## DCtoContCare 0 1
## insured 1 0
Each row is an imputation model with the first column of each row indicating the variable to be imputed. The columns are potential indpendent variables to be included in that imputation model. 0’s indicate the variable in that column will be excluded from the imputation model, while 1 indicates the variable be assigned a fixed effect. Note that 0’s are automatically assigned to the matrix diagonal as variables should not be used to impute themselves. Similarly, note that every other variable has been assigned a fixed effect (“1”) by default.
As a general rule, every variable that is going to be used in the final model needs to be included in the imputation model. The only time to definitely exclude a variable is if it is collinear with another variable. This will definitely occur for “insured”, the binary variable we generated from nominal variable “insurance” a few steps ago. Hence, we need to remove “insurance” from the imputation model for “insured”. Similarly “insurance” provides more information than “insured” so we use “insurance” as an independent variable in all other imputation models instead of “insured”. We can arrange this via the following:
pm['insured','insurance'] <- 0
pm[,'insured'] <- 0
Outside of variables that induce collinearity, we usually want to include as much data as possible when creating imputation models, even if that variable is not going to be used in your final model. You may need to exclude variables from imputation models when the imputation models have become exceedingly complex (oversaturated), but the GTWG-HF dataset is sufficiently large that this is not necessarily. If your dataset is not large, remember that you should never exclude any variable from your imputation models that you plan to include in your final model.
The final set up required before initiating imputation is to set the cluster variable within the predictor matrix. This is done by assigning the cluster variable a value of -2 in all imputation models.
pm[,'SITE_ID'] <- -2
pm['SITE_ID','SITE_ID'] <- 0
We have now done all the set up required for multiple imputation, but before implementing the full imputation process it is advisable to carry out some a test case first, just to make sure the models have been specified correctly. As a test run, I set the number of imputations (m) to 1 and the iteration number (maxit) to 2. Verbose will allow us to see where we are in the computation process. control1 and control2 are options for glmer() (for hierchical models) and lmer() (for non-hierarchical models), respectively. Setting these parameters can help reduce run time. Most importantly, make sure to specify nAGQ=0; this may reduce the accuracy of estimates, but will greatly reduce run time. It is inadvisable to set nAGQ to 0 for the actual model, but doing so for test run shouldn’t cause a problem. Finally, we have the computer keep track of how long it takes the process to run using Sys.time().
start_time <- Sys.time()
test <- mice(df, m=1,maxit=2,
predictorMatrix=pm, method=impmethod, cluster_var=cluster_var,
nAGQ=0, seed=88, verbose=2,
control1=glmerControl(optimizer='bobyqa', optCtrl=list(maxfun=2e5), calc.derivs=TRUE),
control2=lmerControl(optimizer='nloptwrap', optCtrl=list(maxfun=2e5)))
end_time <- Sys.time()
end_time - start_time
This will produce a mids object, which is a series of datasets containing imputed values. We can check to see if our model has been specified properly by looking at the trace plots.
plot(test)
We have provided some screen shots of a few of the plots. At minimum make sure the plots are not stationary for at least one of each kind of imputation method you used.
TO DO : insert screen shots of test plots
We see that there are no stationary plots, so at least our model is not exhibiting flat line behavior. We won’t know if the imputations are reasonable until we create the full model. However, note the above task took 2 hours for a single imputation with only 2 iterations. Our full model will require at least 45 imputations with >10 iterations- this would take weeks. To shorten our computational time, we can use parallel processing. R is single-threaded, meaning it will only use a single core for processing tasks, no matter how complex they maybe. Luckily, the imputations performed by MICE are independent of each other, so each imputation can be assigned to a different core in our machine and the datasets can be combined into a single object at that the end. This has been written into MICE using the wrapper function parlMICE(). Hence, we can specify our full imputation task as follows:
start_time <- Sys.time()
print(start_time)
impobj <- parlmice(df, n.core=46, n.imp.core = 1, cl.type="FORK",
maxit=11, predictorMatrix=pm, method=impmethod,
cluster_var=cluster_var,
nAGQ=0, cluster.seed=88,
control1=glmerControl(optimizer='bobyqa', optCtrl=list(maxfun=2e5), calc.derivs=TRUE),
control2=lmerControl(optimizer='nloptwrap', optCtrl=list(maxfun=2e5)))
end_time <- Sys.time()
end_time - start_time
Our machine had 48 cores, and in general you should use no more than n-1 of the cores on your machine or you’ll risk being unable to do anything else while the task is running. Additionally cluster type (cl.type) is going to depend on the operating system of your machine. If you are using Windows, you will specify PSOCK, whereas on Linux machines, FORK is preferable. The number of imputations (equivalent to “m” in the non-parallel version of mice()) R will perform is equal to n.core * n.imp.core. We specified that each core only run 1 imputation as we wanted to reduce computation time as much as possible, but if your machine has fewer cores than you the number of imputations you want to run, you can increase the n.imp.core parameter at the cost of geometrically increasing your run time.
We have made instructed Rmarkdown not to run the code above, as the run time ended up being 9.3 days even with our large machine.