1 Introduction

2 Getting Started

2.1 Data Dictionary

3 Understanding Data from the Missingness Perspective

3.1 Exploring missingness patterns
3.2 Exploring multilevel data structures

4 Multilevel Multiple Imputation via Fully Conditional Specification

4.1 Mulitiple Imputation using GLM-based Imputation Methods (unsuccessful)

4.1.1 Selecting imputation methods
4.1.2 Specifying the predictor matrix
4.1.3 Do a test run of your imputation model
4.1.4 Multiple Imputation in Parallel

4.2 Mulitiple Imputation using Random Forest

5 Model Buidling after Multiple Imputation: the Fitting Step

5.1 Data Preparation
5.2 Run the Fit Step in Parallel

6 Pool the Results and Display the Output

7 Sensitivity Analysis

8 Final Thoughts

1 Introduction

This document is the technical appendix to our paper “Clinical and Socioeconomic Determinants of Angiotensin Receptor Blocker/Neprilysin Inhibitor Prescription at Hospital Discharge in Patients with Heart Failure with Reduced Ejection Fraction”, in which we investigated the impact of clinical and socioeconomic factors on ARNI prescription at discharge. We performed our analyses using the Get with the Guidelines-Heart Failure (GWTG-HF) data set, which provides an incredible wealth of clinical data, but also presents three major challenges:

A high degree of missing data
A hierarchical data structure
A large volume of data

We used multilevel multiple imputation to address missing data and a mixed-effects logistic regression model to address the data’s hierarchical structure. We used parallel processing to decrease computational time on large processes amenable to parallelization.
This document presents the code we used to obtain our final analysis in an extensively annotated format. We discuss not only how we obtained the final model, but also how we set up for large tasks and alternate methods we tried that were unsuccessful. As many of the task run times are prohibitively long, we have provided screenshots of the output rather than allowing the code to run in Rmarkdown. We hope this technical appendix will provide transparency to our analyses, as well as guidance for clinicians and researchers looking to perform similar analyses in the future. And as always, despite our best efforts, we may have made mistakes or overlooked more efficient ways of doing things. We would love to hear from you and share in any discussions, so please contact us at jeffreytran@email.arizona.edu!

1.1 About the Data

The GWTG-HF dataset is available upon request to the AHA. An application is required. Proposals can be submitted at www.heart.org/qualityresearch. The R code used to generate our clean data from the raw GWTG-HF can be found in the same GitHub repository as this Rmarkdown file.

1.2 IT Requirements

This analysis was performed on the AHA’s Precision Medicine Platform (PMP) using an r5.12xlarge AWS EC2 instance. If you are also using the PMP for these kinds of analysis, I recommend discussing your task with IT support, as they may have to remove some of the hidden disconnection timers inherent to the platform as some of these tasks require long run times. Note that the r5.12xlarge instance is a fairly large machine - it has 48 vCPU and 384GiB RAM. While having such a large machine is not necessary, it will greatly decrease the time it takes to complete the task. ## Prerequisites

We assume proficiency with the statistical software R, as well as familiarity with generalized linear models and decision tree analysis.

1.3 Helpful Resources

There are a variety of very helpful free online resources available for each of the three topics described above. Here are a few of them that we found the most helpful:
Multiple Imputation:

Flexible Impuation of Missing Data by Stef van Buuren (https://stefvanbuuren.name/fimd/)
Applied Missing Data Analysis with SPSS and (R)Studio by Heymans & Eekhout (https://bookdown.org/mwheymans/bookmi/)
Multiple Imputation in STATA by UCLA IDRE (https://stats.idre.ucla.edu/stata/seminars/mi_in_stata_pt1_new/)
miceVignettes by Gerko Vink (https://www.gerkovink.com/miceVignettes/)
ASRM Unit 9: Sampling, sample selection, missing data by Mikko Ronkko (https://www.youtube.com/watch?v=jZ897LOEfKM&list=PL6tc6IBlZmOV0LRP_Hihmc0D-JwF6viA4)

Mixed-effects Regression Models:

Mixed Effects Logistic Regression | R Data Analysis Examples by UCLA IDRE (https://stats.idre.ucla.edu/r/dae/mixed-effects-logistic-regression/)
Keep Calm and Learn Multilevel Logistic Modeling: A simplified 3-Step Procedure Using Stata, R, Mplus, and SPSS by Sommet & Morselli (https://www.rips-irsp.com/articles/10.5334/irsp.90/)
ASRM Unit 7: Multilevel models by Mikko Ronkko (https://www.youtube.com/playlist?list=PL6tc6IBlZmOVEofGo-Yz4MpdOp5K8yk_T)
Random Effects Estimator - an introduction by Ben Lambert (https://www.youtube.com/watch?v=bQampZBzU9Q)

Parallel Processing:

Wrapper function parlMICE by Schouten & Vink (https://www.gerkovink.com/parlMICE/Vignette_parlMICE.html)
Parallel Processing in R by Josh Errickson (https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/parallel.html)
How to go parallel in R - basics + tips by Max Gordon (https://gforge.se/2015/02/how-to-go-parallel-in-r-basics-tips/)

2 Getting Started

Start by loading all the packages we’ll need into the Rstudio environment, followed by our dataset.

library(dplyr)
library(mice)
library(miceadds)
library(broom.mixed)
library(lme4)
library(multilevel)
library(parallel)

load('/mnt/workspace/rstudio_projects/data/gdmt_11.11_clean.RData')

dplyr: used for creating new variables and other data manipulation tasks.
MICE: the core library used to implement multiple imputation. It has been documented extensively by Gerko Vink and Stef van Buuren in both case-based and book formats. The source code is publicly available on GitHub.
Miceadds: adds useful imputation methods for multilevel multiple imputation models. It also provides an intuitive way to save imputation objects (mids), which can take significant time to produce.
Broom.mixed: necessary for MICE.
lme4: initially used to determine whether or not a hierarchical data structure is present and is later used as the core library used to build our mixed model.
Multilevel: used to determine hierarchical data structure.
Parallel: used to split the analysis task into parallel for the model fitting step.

Let’s explore our data a little bit:

str(df)

## 'data.frame':    136144 obs. of  40 variables:
##  $ SITE_ID               : Factor w/ 560 levels "10001","10140",..: 89 396 261 422 102 23 309 66 123 65 ...
##  $ GDMT_HFBB             : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 1 2 ...
##  $ GDMT_ACEIARB          : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 2 2 2 ...
##  $ GDMT_ARNI             : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ GDMT_MRA              : Factor w/ 2 levels "0","1": 2 1 2 NA 1 1 1 1 2 2 ...
##  $ GDMT_BBcontra         : Factor w/ 2 levels "0","1": 1 1 NA NA 1 NA 1 1 2 1 ...
##  $ GDMT_RAASIcontra      : Factor w/ 2 levels "0","1": 1 1 2 NA 2 1 2 1 1 2 ...
##  $ GDMT_MRAcontra        : Factor w/ 2 levels "0","1": 1 1 NA NA 1 1 1 1 1 1 ...
##  $ GDMT_ARNIcontra       : Factor w/ 2 levels "0","1": 1 1 NA NA 2 1 1 1 1 2 ...
##  $ GENDERi               : Factor w/ 2 levels "male","female": 2 1 1 1 1 2 1 1 1 1 ...
##  $ race2i                : Factor w/ 5 levels "Asian","Black",..: 5 5 5 2 5 5 5 5 3 3 ...
##  $ insurance             : Factor w/ 4 levels "Medicaid","Medicare",..: 2 3 1 NA 2 2 2 3 3 3 ...
##  $ AGEi                  : num  63 47 57 54 94 78 85 73 38 75 ...
##  $ PMHx_none             : Factor w/ 2 levels "0","1": 1 1 1 NA 1 1 1 1 1 1 ...
##  $ PMHx_CHF              : Factor w/ 2 levels "0","1": 2 1 2 NA 2 1 2 2 2 2 ...
##  $ PMHx_ESRD             : Factor w/ 2 levels "0","1": 1 1 1 NA 1 1 1 1 2 1 ...
##  $ PMHx_Cr2              : Factor w/ 2 levels "0","1": 1 1 1 NA 1 2 1 1 2 1 ...
##  $ PMHx_PPMICDCRT        : Factor w/ 2 levels "0","1": 2 1 1 NA 2 1 2 2 1 1 ...
##  $ PMHx_fibfl            : Factor w/ 2 levels "0","1": 1 1 1 NA 2 1 1 1 1 1 ...
##  $ GH_HEART_TRANSPLANTS  : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 2 2 1 ...
##  $ RESIDENTS             : Factor w/ 2 levels "0","1": 2 1 1 1 2 2 1 2 2 2 ...
##  $ OH_EF                 : num  30 27 33 14 20 35 37 21 22 20 ...
##  $ OH_TRANSPLANT         : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ covar_ino             : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ inhospRx_BB           : Factor w/ 2 levels "0","1": NA NA 2 NA 2 2 NA NA 1 NA ...
##  $ inhospRx_MRA          : Factor w/ 2 levels "0","1": NA NA 2 NA 1 1 NA NA 2 NA ...
##  $ inhospRx_none         : Factor w/ 2 levels "0","1": NA NA 1 NA 1 1 NA NA 1 NA ...
##  $ inhospRx_ARNI         : Factor w/ 2 levels "0","1": NA NA 1 NA 1 1 NA NA 1 NA ...
##  $ inhospRx_ACEIARB      : Factor w/ 2 levels "0","1": NA NA 2 NA 2 1 NA NA 2 NA ...
##  $ covar_Kdisc           : num  NA NA NA NA 4.5 4.1 NA NA 5.1 NA ...
##  $ covar_Crdisc          : num  NA NA NA NA 1.2 1.6 NA NA 6.7 NA ...
##  $ OH_DISCBPSYST         : num  NA NA 111 NA 142 139 NA NA 119 NA ...
##  $ OH_DISCHR             : num  NA NA NA NA 59 73 NA NA 79 NA ...
##  $ zip_designation       : Factor w/ 4 levels "Rural","SmallTown",..: NA NA NA 3 3 2 NA NA 4 NA ...
##  $ population_total      : num  NA NA NA 35641 7223 ...
##  $ distress_score        : num  NA NA NA 49.2 47.9 ...
##  $ covar_OutsideZIPregion: Factor w/ 2 levels "0","1": NA NA NA 1 2 2 NA NA 2 NA ...
##  $ followup              : Factor w/ 2 levels "0","1": 2 2 2 2 2 NA 2 2 2 2 ...
##  $ OH_ARNI               : Factor w/ 2 levels "0","1": NA NA NA NA 1 1 NA NA 1 NA ...
##  $ DCtoContCare          : Factor w/ 2 levels "0","1": NA NA NA NA 1 2 NA NA 1 2 ...

2.1 Data Dictionary

SITE_ID: identifier for the hospital submitting the record of hospitalization.
GDMT_HFBB: Was the patient prescribed a guideline-directed medical therapy (GDMT)-recommended beta blocker at discharge from the hospital? ‘0’ denotes ‘no’, ‘1’ denotes ‘yes’
GDMT_ACEIARB: Was the patient prescribed an angiotensin converting enzyme inhibitor (ACEI) or angiotensin receptor blocker (ARB) at discharge from the hospital? ‘0’ denotes ‘no’, ‘1’ denotes ‘yes’
GDMT_ARNI: Was the patient prescribed an angiotensin receptor blocker - neprilysin inhibitor (ARNI) at discharge from the hospital? ‘0’ denotes ‘no’, ‘1’ denotes ‘yes’
GDMT_MRA: Was the patient prescribed an mineralocorticoid receptor antagonist (MRA) at discharge from the hospital? ‘0’ denotes ‘no’, ‘1’ denotes ‘yes’
GDMT_BBcontra: Does the patient have a contraindication to beta blockers? ‘0’ denotes ‘no’, ‘1’ denotes ‘yes’
GDMT_RAASIcontra: Does the patient have a contraindication to an ACEI or ARB? ‘0’ denotes ‘no’, ‘1’ denotes ‘yes’
GDMT_MRAcontra: Does the patient have a contraindication to an MRA? ‘0’ denotes ‘no’, ‘1’ denotes ‘yes’
GDMT_ARNIcontra: Does the patient have a contraindication to an ARNI? ‘0’ denotes ‘no’, ‘1’ denotes ‘yes’
GENDERi: Biological sex of the patient. ‘0’ denotes ‘male’, ‘1’ denotes ‘female’
race2i: Site-reported race of the patient. ‘1’ denotes ‘Asian’, ‘2’ denotes ‘Black’, ‘3’ denotes Hispanic’, ‘4’ denotes ‘Other or unable to determine’, ‘5’ denotes ‘White’
insurance: Patient’s insurance type: ‘1’ denotes ‘Medicaid’, ‘2’ denotes ‘Medicare’, ‘3’ denotes ‘No Insurance’, ‘4’ denotes ‘Other’, ‘NA’ denotes missing
AGEi: Age of patient at time of discharge.
PMHx_none: Did the patient have no prior medical history at time of admission to the hospital? ‘0’ denotes ‘no’, ‘1’ denotes ‘yes’
PMHx_CHF: Did the patient have a prior medical history of congestive heart failure? ‘0’ denotes ‘no’, ‘1’ denotes ‘yes’
PMHx_ESRD: Did the patient have a prior medical history of end stage renal disease (ESRD)? ‘0’ denotes ‘no’, ‘1’ denotes ‘yes’
PMHx_Cr2: Did the patient have a prior medical history of have chronic kidney disease (CKD)? the GWTG-HF dataset defines CKD based on a serum creatinine level of >2. ‘0’ denotes ‘no’, ‘1’ denotes ‘yes’
PMHx_PPMICDCRTD: Did the patient have a permanent pacemaker (PPM), implantable cardioverter defibrillator (ICD), or cardiac resynchronization therapy defibrillator (CRTD) prior to admission to the hospital? ‘0’ denotes ‘no’, ‘1’ denotes ‘yes’
PMHx_fibfl: Did the patient have a prior medical history of atrial fibrillation or flutter? ‘0’ denotes ‘no’, ‘1’ denotes ‘yes’
GH_HEART_TRANSPLANTS: Site-level variable indicating if the site performs heart transplants. ‘0’ denotes ‘no’, ‘1’ denotes ‘yes’
RESIDENTS: Site-level variable indicating if postgraduate resident physicians work at the site. ‘0’ denotes ‘no’, ‘1’ denotes ‘yes’
OH_EF: the patient’s ejection fraction.
OH_TRANSPLANT: Is the patient listed for heart transplant? ‘0’ denotes ‘no’, ‘1’ denotes ‘yes’
covar_ino: Did the patient receive an inotrope (dopamine, dobutamine, milrinone) during hospitalization? ‘0’ denotes ‘no’, ‘1’ denotes ‘yes’
inhospRx_BB: Did the patient receive a beta blocker while hospitalized? ‘0’ denotes ‘no’, ‘1’ denotes ‘yes’
inhospRx_MRA: Did the patient receive an MRA while hospitalized? ‘0’ denotes ‘no’, ‘1’ denotes ‘yes’
inhospRx_none: Did the patient receive no GDMT components while hospitalized? ‘0’ denotes ‘no’, ‘1’ denotes ‘yes’
inhospRx_ARNI: Did the patient receive an ARNI while hospitalized? ‘0’ denotes ‘no’, ‘1’ denotes ‘yes’
inhospRx_ACEIARB: Did the patient receive an ACEi or ARB while hospitalized? ‘0’ denotes ‘no’, ‘1’ denotes ‘yes’
covar_Kdisc: the patient’s serum potassium at discharge.
covar_Crdisc: the patient’s serum creatinine at discharge.
OH_DISCBPSYST: the patient’s systolic blood pressure at discharge.
OH_DISCHR: the patient’s heart rate at discharge.
zip_designation: merged from the distressed community index (DCI) dataset; denotes the community type. ‘1’ denotes ‘rural’, ‘2’ denotes ‘small town’, ‘3’ denotes ‘suburban’, ‘4’ denotes ‘urban’, ‘NA’ denotes missing data
population_total: merged from the DCI dataset; denotes the population count of the associated zip code.
covar_OutsideZIPregion: Does the patient’s zip code differ from the site’s zip code? ‘0’ denotes ‘no’, ‘1’ denotes ‘yes’
followup: Was the patient given follow up on discharge? ‘0’ denotes ‘no’, ‘1’ denotes ‘yes’
OH_ARNI: Had the patient been prescribed an ARNI prior to admission to the hospital? ‘0’ denotes ‘no’, ‘1’ denotes ‘yes’
DCtoContCare: Was the patient discharged to a skilled nursing facility, long term acute care facility, inpatient rehabilitation, intermediate care facility, home health care, or other acute care facility? ‘0’ denotes ‘no’, ‘1’ denotes ‘yes’

3 Understanding Data from the Missingness Perspective

Before starting, I highly recommend reading Stef van Buuren’s free online book Flexible Imputation of Missing Data. He provides both a theoretical understanding of multiple imputation and a comprehensive guide of how the procedure is properly performed. He is also the primary contributor to the R library MICE, which is the core library we will be using to perform multiple imputation.

3.1 Exploring missingness patterns

#how many missing in each variable? 
lapply(df, function(x) 
       sum(is.na(x))
       )

## $SITE_ID
## [1] 0
## 
## $GDMT_HFBB
## [1] 3024
## 
## $GDMT_ACEIARB
## [1] 513
## 
## $GDMT_ARNI
## [1] 0
## 
## $GDMT_MRA
## [1] 7266
## 
## $GDMT_BBcontra
## [1] 25857
## 
## $GDMT_RAASIcontra
## [1] 2128
## 
## $GDMT_MRAcontra
## [1] 14237
## 
## $GDMT_ARNIcontra
## [1] 15122
## 
## $GENDERi
## [1] 322
## 
## $race2i
## [1] 0
## 
## $insurance
## [1] 9060
## 
## $AGEi
## [1] 0
## 
## $PMHx_none
## [1] 8504
## 
## $PMHx_CHF
## [1] 8504
## 
## $PMHx_ESRD
## [1] 8504
## 
## $PMHx_Cr2
## [1] 8504
## 
## $PMHx_PPMICDCRT
## [1] 8504
## 
## $PMHx_fibfl
## [1] 8504
## 
## $GH_HEART_TRANSPLANTS
## [1] 0
## 
## $RESIDENTS
## [1] 0
## 
## $OH_EF
## [1] 0
## 
## $OH_TRANSPLANT
## [1] 0
## 
## $covar_ino
## [1] 0
## 
## $inhospRx_BB
## [1] 55647
## 
## $inhospRx_MRA
## [1] 55647
## 
## $inhospRx_none
## [1] 55647
## 
## $inhospRx_ARNI
## [1] 55647
## 
## $inhospRx_ACEIARB
## [1] 55647
## 
## $covar_Kdisc
## [1] 96076
## 
## $covar_Crdisc
## [1] 94145
## 
## $OH_DISCBPSYST
## [1] 57867
## 
## $OH_DISCHR
## [1] 59064
## 
## $zip_designation
## [1] 59969
## 
## $population_total
## [1] 59969
## 
## $distress_score
## [1] 59969
## 
## $covar_OutsideZIPregion
## [1] 58122
## 
## $followup
## [1] 7650
## 
## $OH_ARNI
## [1] 91588
## 
## $DCtoContCare
## [1] 73568

#how many complete cases? 
sum(complete.cases(df))

## [1] 23687

Note that missingness ranges from 0 to 70%. To understand how this missingness will impact our statistical analysis, we need to understand the mechanism of missingness. Data that are “missing completely at random” (MCAR) are frequently amenable to “simple” fixes such as complete case analysis (also known as listwise deletion). Data that are “missing at random” (MAR) are amenable to multiple imputation in most cases, or univariate imputation if the data is monotone. Data that are “missing not at random” (MNAR) present the most complex scenario and frequently require additional research into the cause of data missingness. Blindly imputing MNAR data can result in inaccurate estimates in the final model.
MICE has several built-in methods for diagnosing the mechanism of missingness, but due to the size of the GWTG-HF dataset, the visual assessments are not practical.

#what is the pattern of data missingness? 
md.pattern(df, plot=FALSE)

The output from the above code is a large matrix that is cumbersome to read. The first column indicates the number of observations with that pattern of missingness, the subsequent columns indicate if the associated variable is missing or not, and the final column shows the number of variables that are missing for that particular pattern. Because there are 40 variables and over 1500 patterns of missingness, this matrix is rather large, but is nevertheless worth reviewing in detail. After review, the data appear to be MAR, and the missingess pattern is not monotone. In addition, we can surmise that missingness happens at a systemic level, with some sites simply not reporting entire groups of variables, such as patients’ medical comorbidities, their discharge labs/vitals, etc. Such a pattern is amenable to multiple imputation.

3.2 Exploring multilevel data structures

When interpreting data in the context of missingness, we also need to consider if the data has a multilevel structure (also referred to as hierarchical or clustered data). We know that patients are nested within hospitals, and it’s reasonable to suspect that culture, resources, and other unobserved variables at each site are associated with the characteristics of the patients they treat. We can assess this quantitatively by looking at the intraclass correlation coefficient (ICC) for incomplete variables.

#Find the variables with missing data and split the variable names 
#into lists based on class (factor vs numeric)

varlist <- colnames(df)
hasmiss <- list() 
miss_f <- list() 
miss_n <- list() 

for(i in 1:ncol(df)) { 
        if(with(df,sum(is.na(eval(as.name(varlist[i]))))) > 0) {
                hasmiss <- append(hasmiss, varlist[i])
        }
}

for(i in 1:length(hasmiss)) { 
        if (with(df,class(eval(as.name(hasmiss[[i]])))) == "factor") {
                miss_f <- append(miss_f, hasmiss[i])
        } else {
                miss_n <- append(miss_n, hasmiss[i])
        }
}

rm(varlist, hasmiss)

#calculate the ICC for each variable. Note the models used to calculate ICC are dependent 
#on the dependent variable's class (ANOVA for numeric variables, unconditional mean model 
#for factor variables)

fun_f <- function(i) {
        
        m0 <- glmer(eval(as.name(i)) ~ (1|SITE_ID),
                    data=df,
                    family='binomial',
                    control=glmerControl(optimizer='bobyqa', optCtrl=list(maxfun=2e5)))
        
        ICC <- m0@theta[1]^2/(m0@theta[1]^2 + (3.14159^2/3))
        
        print(i)
        print(ICC)
}

fun_n <- function(i) {
        
        ICC <- multilevel::ICC1(aov(eval(as.name(i)) ~ SITE_ID, data=df))
        
        print(i)
        print(ICC)
}

suppress <- lapply(miss_f, fun_f)

## [1] "GDMT_HFBB"
## [1] 0.06958008
## [1] "GDMT_ACEIARB"
## [1] 0.09156555
## [1] "GDMT_MRA"
## [1] 0.08377357
## [1] "GDMT_BBcontra"
## [1] 0.3597455
## [1] "GDMT_RAASIcontra"
## [1] 0.3633391
## [1] "GDMT_MRAcontra"
## [1] 0.3646874
## [1] "GDMT_ARNIcontra"
## [1] 0.3928531
## [1] "GENDERi"
## [1] 0.01079836
## [1] "insurance"
## [1] 0.2117845
## [1] "PMHx_none"
## [1] 0.1166666
## [1] "PMHx_CHF"
## [1] 0.1012625
## [1] "PMHx_ESRD"
## [1] 0.117713
## [1] "PMHx_Cr2"
## [1] 0.1588017
## [1] "PMHx_PPMICDCRT"
## [1] 0.07147059
## [1] "PMHx_fibfl"
## [1] 0.03680871
## [1] "inhospRx_BB"
## [1] 0.09117881
## [1] "inhospRx_MRA"
## [1] 0.111789
## [1] "inhospRx_none"
## [1] 0.1152063
## [1] "inhospRx_ARNI"
## [1] 0.3411458
## [1] "inhospRx_ACEIARB"
## [1] 0.06562613
## [1] "zip_designation"
## [1] 0.4777781
## [1] "covar_OutsideZIPregion"
## [1] 0.2940743
## [1] "followup"
## [1] 0.596069
## [1] "OH_ARNI"
## [1] 0.2088988
## [1] "DCtoContCare"
## [1] 0.7089976

suppress <- lapply(miss_n, fun_n)

## [1] "covar_Kdisc"
## [1] 0.04157473
## [1] "covar_Crdisc"
## [1] 0.01061931
## [1] "OH_DISCBPSYST"
## [1] 0.02968353
## [1] "OH_DISCHR"
## [1] 0.02274791
## [1] "population_total"
## [1] 0.4089865
## [1] "distress_score"
## [1] 0.2501159

rm(miss_n, miss_f, fun_n, fun_f, suppress)

There is no cutoff for a “significant” ICC, but we see that SITE_ID alone explains more than 30% of the variance seen in many of the variables to be imputed; hence, clustering cannot be ignored.
To summarize what we have learned about our data up to this point:

The data seems to missing at random (MAR)
Missingness is not monotone
The data has a hierarchical structure with patients nested within sites

Any imputation technique we apply our data to needs to account for these three factors. Because the data is MAR and not monotone, multiple imputation arises as an attractive option for imputation. Because the data is hierarchical we will need to apply a multilevel multiple imputation model.

4 Multilevel Multiple Imputation via Fully Conditional Specification

We will implement our multilevel multiple imputation model via fully conditional specification (FCS). Stef van Buuren’s book comprehensively explains the theory and basic steps for this procedure, while the vignette on multilevel imputation by Gerko Vink provides a nice walk through via examples. We adapt many techniques and terminology from these two sources.
Very broadly speaking, multiple imputation can be broken down into 3 steps, and these steps provide a nice outline for how we will approach the procedure. The steps are as follows:

The imputation step:
We use either fully conditional specification (FCS) or joint modeling to estimate a distribution of missing values (we are NOT imputing point values). Random draws from that distribution are imputed into the missing value slots and this process is repeated multiple times to create multiple datasets with imputed values. The variance between the imputed values of different datasets reflects our uncertainty regarding the true value of the missing data point. Each dataset with imputed values is referred to as an imputation.
In FCS the distribution of missing values is created via a series of models, with each independent variable with missingness requiring its own model. The independent variable with missingess becomes the dependent variable for that imputation model. For example, we saw previously that the variable GDMT_BBcontra has 25,857 missing values. We will create a model with GDMT_BBcontra as the dependent variable, and use this model to estimate the distribution of missing values for GDMT_BBcontra. Note that the ICC for this model was 0.360 so the imputation model will need to take this clustering into account.
The fitting step:
We create a model to test our hypothesis, which in our case is testing if a patient is prescribed an ARNI at hospital discharge (GDMT_ARNI). Keep in mind that we may have to take a hierarchical data structure into account. This model is applied to every one of the imputed datasets generated in the imputation step.
The pooling step:
We pool the results of each of the previously fitted models using Rubin’s rules to generate a final model.

4.1 Mulitiple Imputation using GLM-based Imputation Methods (unsuccessful)

In this section we describe how we implemented FCS using generalized linear models (GLMs). The imputation model did not converge, so this approach was ultimately abandoned. Nevertheless, we present this approach because multiple imputation is commonly applied via GLMs.

4.1.1 Selecting imputation methods

The first task of the imputation step is to specify the kind of statistical model we are going to use to predict each of our independent variables with missingness. In MICE, this is referred to as choosing the imputation method. This is largely going to depend on the class of that variable (numeric vs factor/binary vs factor/nominal vs factor/ordinal), but it is also determined by the presence of a hierarchical data structure. The presence of clustering in our data greatly complicates multiple imputation, but luckily the MICE library comes with several imputation methods specifically for hierarchical data with 2 levels. These methods typically start with the prefix “2l.”
Note that there are no 2l methods for ordinal or nominal outcomes. In our case this caused a problem for the variables ‘insurance’ and ‘zip_designation.’ Our work around was to create binary variables based on these nominal variables and impute the binary version instead. We retained as much information as possible by continuing to use nominal form of these variables as dependent variables in other imputation models. For the nominal variable ‘zip_designation’ creating a binary derivative was not possible, so it had to be dropped entirely from analysis.
Here we convert the nominal variable ‘insurance’ to a binary variable.

#convert nominal variables to binary 
df <- df %>% 
  mutate(insured = ifelse(insurance %in% 'Medicaid' | 
                          insurance %in% 'Medicare' | 
                          insurance %in% 'Other', 1,  
                   ifelse(insurance %in% 'none', 0, 
                   ifelse(is.na(insurance), NA, 999))))
df$insured <- as.factor(df$insured)

Next we set the cluster variable, which in our case is SITE_ID. Interestingly, MICE requires that cluster variables be changed to numeric class.

#set the cluster variable (note the cluster var must be numeric)
df$SITE_ID <- as.numeric(df$SITE_ID)
cluster_var=df$SITE_ID

Next we create a vector containing all of the dataset’s variable names. We will assign an imputation method to each of these variable names in the next step.

impmethod <- character(ncol(df))
names(impmethod) <- colnames(df)

We are finally ready to assign our imputation methods and we do so primarily based on that variable’s class. We arbitrarily chose an ICC < 0.05 as representing a negligible clustering effect, and this allows us to use a non-2l imputation method for variables that meet this criteria. This is desirable because all 2l methods call glmer() and assign the cluster variable a random intercept, which becomes computationally expensive for the complex models we use for imputation.

impmethod['SITE_ID'] <- "" 

impmethod['GDMT_HFBB'] <- "2l.binary" 
impmethod['GDMT_ACEIARB'] <- "2l.binary" 
impmethod['GDMT_ARNI'] <- "" 
impmethod['GDMT_MRA'] <- "2l.binary"

impmethod['GDMT_BBcontra'] <- "2l.binary"
impmethod['GDMT_RAASIcontra'] <- "2l.binary"
impmethod['GDMT_MRAcontra'] <- "2l.binary" 
impmethod['GDMT_ARNIcontra'] <- "2l.binary" 

impmethod['GENDERi'] <- "logreg" 
impmethod['race2i'] <- ""
impmethod['insurance'] <- ""
impmethod['AGEi'] <- ""

impmethod['PMHx_none'] <- "2l.binary" 
impmethod['PMHx_CHF'] <- "2l.binary" 
impmethod['PMHx_ESRD'] <- "2l.binary"
impmethod['PMHx_Cr2'] <- "2l.glm.bin"
impmethod['PMHx_PPMICDCRT'] <- "2l.binary"
impmethod['PMHx_fibfl'] <- "logreg"

impmethod['GH_HEART_TRANSPLANTS'] <- ""
impmethod['RESIDENTS'] <- ""

impmethod['OH_EF'] <- ""
impmethod['OH_TRANSPLANT'] <- ""
impmethod['covar_ino'] <- ""

impmethod['inhospRx_BB'] <- "2l.binary"
impmethod['inhospRx_MRA'] <- "2l.binary"
impmethod['inhospRx_none'] <- "2l.binary"
impmethod['inhospRx_ARNI'] <- "2l.binary"
impmethod['inhospRx_ACEIARB'] <- "2l.binary"

impmethod['covar_Kdisc'] <- "pmm"
impmethod['covar_Crdisc'] <- "pmm"
impmethod['OH_DISCBPSYST'] <- "pmm"
impmethod['OH_DISCHR'] <- "pmm"

impmethod['zip_designation'] <- ""
impmethod['population_total'] <- "2l.pmm"
impmethod['distress_score'] <- "2l.pmm"

impmethod['covar_OutsideZIPregion'] <- "2l.binary"
impmethod['followup'] <- "2l.binary"
impmethod['DCtoContCare'] <- "2l.binary"
impmethod['insured'] <- "2l.binary"

print(impmethod)

##                SITE_ID              GDMT_HFBB           GDMT_ACEIARB 
##                     ""            "2l.binary"            "2l.binary" 
##              GDMT_ARNI               GDMT_MRA          GDMT_BBcontra 
##                     ""            "2l.binary"            "2l.binary" 
##       GDMT_RAASIcontra         GDMT_MRAcontra        GDMT_ARNIcontra 
##            "2l.binary"            "2l.binary"            "2l.binary" 
##                GENDERi                 race2i              insurance 
##               "logreg"                     ""                     "" 
##                   AGEi              PMHx_none               PMHx_CHF 
##                     ""            "2l.binary"            "2l.binary" 
##              PMHx_ESRD               PMHx_Cr2         PMHx_PPMICDCRT 
##            "2l.binary"           "2l.glm.bin"            "2l.binary" 
##             PMHx_fibfl   GH_HEART_TRANSPLANTS              RESIDENTS 
##               "logreg"                     ""                     "" 
##                  OH_EF          OH_TRANSPLANT              covar_ino 
##                     ""                     ""                     "" 
##            inhospRx_BB           inhospRx_MRA          inhospRx_none 
##            "2l.binary"            "2l.binary"            "2l.binary" 
##          inhospRx_ARNI       inhospRx_ACEIARB            covar_Kdisc 
##            "2l.binary"            "2l.binary"                  "pmm" 
##           covar_Crdisc          OH_DISCBPSYST              OH_DISCHR 
##                  "pmm"                  "pmm"                  "pmm" 
##        zip_designation       population_total         distress_score 
##                     ""               "2l.pmm"               "2l.pmm" 
## covar_OutsideZIPregion               followup                OH_ARNI 
##            "2l.binary"            "2l.binary"                     "" 
##           DCtoContCare                insured 
##            "2l.binary"            "2l.binary"

The default imputation method for each variable is "“, which indicates to MICE that an imputation model should not be fit to that variable. We explicitly assigned”" where appropriate to make sure that we considered every variable in the dataset, but an explicit assignment was not necessary. Notice that variables might be assigned "" because they have no missing values or because they are nominal. There is value in keeping nominal variables in the dataset because they can still be used to impute other variables, even though they themselves cannot be imputed.

4.1.2 Specifying the predictor matrix

The next step is to determine which variables belong in each imputation model. This is done through the prediction matrix, which is created via the following:

pm <- make.predictorMatrix(df)
print(pm)

##                        SITE_ID GDMT_HFBB GDMT_ACEIARB GDMT_ARNI GDMT_MRA
## SITE_ID                      0         1            1         1        1
## GDMT_HFBB                    1         0            1         1        1
## GDMT_ACEIARB                 1         1            0         1        1
## GDMT_ARNI                    1         1            1         0        1
## GDMT_MRA                     1         1            1         1        0
## GDMT_BBcontra                1         1            1         1        1
## GDMT_RAASIcontra             1         1            1         1        1
## GDMT_MRAcontra               1         1            1         1        1
## GDMT_ARNIcontra              1         1            1         1        1
## GENDERi                      1         1            1         1        1
## race2i                       1         1            1         1        1
## insurance                    1         1            1         1        1
## AGEi                         1         1            1         1        1
## PMHx_none                    1         1            1         1        1
## PMHx_CHF                     1         1            1         1        1
## PMHx_ESRD                    1         1            1         1        1
## PMHx_Cr2                     1         1            1         1        1
## PMHx_PPMICDCRT               1         1            1         1        1
## PMHx_fibfl                   1         1            1         1        1
## GH_HEART_TRANSPLANTS         1         1            1         1        1
## RESIDENTS                    1         1            1         1        1
## OH_EF                        1         1            1         1        1
## OH_TRANSPLANT                1         1            1         1        1
## covar_ino                    1         1            1         1        1
## inhospRx_BB                  1         1            1         1        1
## inhospRx_MRA                 1         1            1         1        1
## inhospRx_none                1         1            1         1        1
## inhospRx_ARNI                1         1            1         1        1
## inhospRx_ACEIARB             1         1            1         1        1
## covar_Kdisc                  1         1            1         1        1
## covar_Crdisc                 1         1            1         1        1
## OH_DISCBPSYST                1         1            1         1        1
## OH_DISCHR                    1         1            1         1        1
## zip_designation              1         1            1         1        1
## population_total             1         1            1         1        1
## distress_score               1         1            1         1        1
## covar_OutsideZIPregion       1         1            1         1        1
## followup                     1         1            1         1        1
## OH_ARNI                      1         1            1         1        1
## DCtoContCare                 1         1            1         1        1
## insured                      1         1            1         1        1
##                        GDMT_BBcontra GDMT_RAASIcontra GDMT_MRAcontra
## SITE_ID                            1                1              1
## GDMT_HFBB                          1                1              1
## GDMT_ACEIARB                       1                1              1
## GDMT_ARNI                          1                1              1
## GDMT_MRA                           1                1              1
## GDMT_BBcontra                      0                1              1
## GDMT_RAASIcontra                   1                0              1
## GDMT_MRAcontra                     1                1              0
## GDMT_ARNIcontra                    1                1              1
## GENDERi                            1                1              1
## race2i                             1                1              1
## insurance                          1                1              1
## AGEi                               1                1              1
## PMHx_none                          1                1              1
## PMHx_CHF                           1                1              1
## PMHx_ESRD                          1                1              1
## PMHx_Cr2                           1                1              1
## PMHx_PPMICDCRT                     1                1              1
## PMHx_fibfl                         1                1              1
## GH_HEART_TRANSPLANTS               1                1              1
## RESIDENTS                          1                1              1
## OH_EF                              1                1              1
## OH_TRANSPLANT                      1                1              1
## covar_ino                          1                1              1
## inhospRx_BB                        1                1              1
## inhospRx_MRA                       1                1              1
## inhospRx_none                      1                1              1
## inhospRx_ARNI                      1                1              1
## inhospRx_ACEIARB                   1                1              1
## covar_Kdisc                        1                1              1
## covar_Crdisc                       1                1              1
## OH_DISCBPSYST                      1                1              1
## OH_DISCHR                          1                1              1
## zip_designation                    1                1              1
## population_total                   1                1              1
## distress_score                     1                1              1
## covar_OutsideZIPregion             1                1              1
## followup                           1                1              1
## OH_ARNI                            1                1              1
## DCtoContCare                       1                1              1
## insured                            1                1              1
##                        GDMT_ARNIcontra GENDERi race2i insurance AGEi PMHx_none
## SITE_ID                              1       1      1         1    1         1
## GDMT_HFBB                            1       1      1         1    1         1
## GDMT_ACEIARB                         1       1      1         1    1         1
## GDMT_ARNI                            1       1      1         1    1         1
## GDMT_MRA                             1       1      1         1    1         1
## GDMT_BBcontra                        1       1      1         1    1         1
## GDMT_RAASIcontra                     1       1      1         1    1         1
## GDMT_MRAcontra                       1       1      1         1    1         1
## GDMT_ARNIcontra                      0       1      1         1    1         1
## GENDERi                              1       0      1         1    1         1
## race2i                               1       1      0         1    1         1
## insurance                            1       1      1         0    1         1
## AGEi                                 1       1      1         1    0         1
## PMHx_none                            1       1      1         1    1         0
## PMHx_CHF                             1       1      1         1    1         1
## PMHx_ESRD                            1       1      1         1    1         1
## PMHx_Cr2                             1       1      1         1    1         1
## PMHx_PPMICDCRT                       1       1      1         1    1         1
## PMHx_fibfl                           1       1      1         1    1         1
## GH_HEART_TRANSPLANTS                 1       1      1         1    1         1
## RESIDENTS                            1       1      1         1    1         1
## OH_EF                                1       1      1         1    1         1
## OH_TRANSPLANT                        1       1      1         1    1         1
## covar_ino                            1       1      1         1    1         1
## inhospRx_BB                          1       1      1         1    1         1
## inhospRx_MRA                         1       1      1         1    1         1
## inhospRx_none                        1       1      1         1    1         1
## inhospRx_ARNI                        1       1      1         1    1         1
## inhospRx_ACEIARB                     1       1      1         1    1         1
## covar_Kdisc                          1       1      1         1    1         1
## covar_Crdisc                         1       1      1         1    1         1
## OH_DISCBPSYST                        1       1      1         1    1         1
## OH_DISCHR                            1       1      1         1    1         1
## zip_designation                      1       1      1         1    1         1
## population_total                     1       1      1         1    1         1
## distress_score                       1       1      1         1    1         1
## covar_OutsideZIPregion               1       1      1         1    1         1
## followup                             1       1      1         1    1         1
## OH_ARNI                              1       1      1         1    1         1
## DCtoContCare                         1       1      1         1    1         1
## insured                              1       1      1         1    1         1
##                        PMHx_CHF PMHx_ESRD PMHx_Cr2 PMHx_PPMICDCRT PMHx_fibfl
## SITE_ID                       1         1        1              1          1
## GDMT_HFBB                     1         1        1              1          1
## GDMT_ACEIARB                  1         1        1              1          1
## GDMT_ARNI                     1         1        1              1          1
## GDMT_MRA                      1         1        1              1          1
## GDMT_BBcontra                 1         1        1              1          1
## GDMT_RAASIcontra              1         1        1              1          1
## GDMT_MRAcontra                1         1        1              1          1
## GDMT_ARNIcontra               1         1        1              1          1
## GENDERi                       1         1        1              1          1
## race2i                        1         1        1              1          1
## insurance                     1         1        1              1          1
## AGEi                          1         1        1              1          1
## PMHx_none                     1         1        1              1          1
## PMHx_CHF                      0         1        1              1          1
## PMHx_ESRD                     1         0        1              1          1
## PMHx_Cr2                      1         1        0              1          1
## PMHx_PPMICDCRT                1         1        1              0          1
## PMHx_fibfl                    1         1        1              1          0
## GH_HEART_TRANSPLANTS          1         1        1              1          1
## RESIDENTS                     1         1        1              1          1
## OH_EF                         1         1        1              1          1
## OH_TRANSPLANT                 1         1        1              1          1
## covar_ino                     1         1        1              1          1
## inhospRx_BB                   1         1        1              1          1
## inhospRx_MRA                  1         1        1              1          1
## inhospRx_none                 1         1        1              1          1
## inhospRx_ARNI                 1         1        1              1          1
## inhospRx_ACEIARB              1         1        1              1          1
## covar_Kdisc                   1         1        1              1          1
## covar_Crdisc                  1         1        1              1          1
## OH_DISCBPSYST                 1         1        1              1          1
## OH_DISCHR                     1         1        1              1          1
## zip_designation               1         1        1              1          1
## population_total              1         1        1              1          1
## distress_score                1         1        1              1          1
## covar_OutsideZIPregion        1         1        1              1          1
## followup                      1         1        1              1          1
## OH_ARNI                       1         1        1              1          1
## DCtoContCare                  1         1        1              1          1
## insured                       1         1        1              1          1
##                        GH_HEART_TRANSPLANTS RESIDENTS OH_EF OH_TRANSPLANT
## SITE_ID                                   1         1     1             1
## GDMT_HFBB                                 1         1     1             1
## GDMT_ACEIARB                              1         1     1             1
## GDMT_ARNI                                 1         1     1             1
## GDMT_MRA                                  1         1     1             1
## GDMT_BBcontra                             1         1     1             1
## GDMT_RAASIcontra                          1         1     1             1
## GDMT_MRAcontra                            1         1     1             1
## GDMT_ARNIcontra                           1         1     1             1
## GENDERi                                   1         1     1             1
## race2i                                    1         1     1             1
## insurance                                 1         1     1             1
## AGEi                                      1         1     1             1
## PMHx_none                                 1         1     1             1
## PMHx_CHF                                  1         1     1             1
## PMHx_ESRD                                 1         1     1             1
## PMHx_Cr2                                  1         1     1             1
## PMHx_PPMICDCRT                            1         1     1             1
## PMHx_fibfl                                1         1     1             1
## GH_HEART_TRANSPLANTS                      0         1     1             1
## RESIDENTS                                 1         0     1             1
## OH_EF                                     1         1     0             1
## OH_TRANSPLANT                             1         1     1             0
## covar_ino                                 1         1     1             1
## inhospRx_BB                               1         1     1             1
## inhospRx_MRA                              1         1     1             1
## inhospRx_none                             1         1     1             1
## inhospRx_ARNI                             1         1     1             1
## inhospRx_ACEIARB                          1         1     1             1
## covar_Kdisc                               1         1     1             1
## covar_Crdisc                              1         1     1             1
## OH_DISCBPSYST                             1         1     1             1
## OH_DISCHR                                 1         1     1             1
## zip_designation                           1         1     1             1
## population_total                          1         1     1             1
## distress_score                            1         1     1             1
## covar_OutsideZIPregion                    1         1     1             1
## followup                                  1         1     1             1
## OH_ARNI                                   1         1     1             1
## DCtoContCare                              1         1     1             1
## insured                                   1         1     1             1
##                        covar_ino inhospRx_BB inhospRx_MRA inhospRx_none
## SITE_ID                        1           1            1             1
## GDMT_HFBB                      1           1            1             1
## GDMT_ACEIARB                   1           1            1             1
## GDMT_ARNI                      1           1            1             1
## GDMT_MRA                       1           1            1             1
## GDMT_BBcontra                  1           1            1             1
## GDMT_RAASIcontra               1           1            1             1
## GDMT_MRAcontra                 1           1            1             1
## GDMT_ARNIcontra                1           1            1             1
## GENDERi                        1           1            1             1
## race2i                         1           1            1             1
## insurance                      1           1            1             1
## AGEi                           1           1            1             1
## PMHx_none                      1           1            1             1
## PMHx_CHF                       1           1            1             1
## PMHx_ESRD                      1           1            1             1
## PMHx_Cr2                       1           1            1             1
## PMHx_PPMICDCRT                 1           1            1             1
## PMHx_fibfl                     1           1            1             1
## GH_HEART_TRANSPLANTS           1           1            1             1
## RESIDENTS                      1           1            1             1
## OH_EF                          1           1            1             1
## OH_TRANSPLANT                  1           1            1             1
## covar_ino                      0           1            1             1
## inhospRx_BB                    1           0            1             1
## inhospRx_MRA                   1           1            0             1
## inhospRx_none                  1           1            1             0
## inhospRx_ARNI                  1           1            1             1
## inhospRx_ACEIARB               1           1            1             1
## covar_Kdisc                    1           1            1             1
## covar_Crdisc                   1           1            1             1
## OH_DISCBPSYST                  1           1            1             1
## OH_DISCHR                      1           1            1             1
## zip_designation                1           1            1             1
## population_total               1           1            1             1
## distress_score                 1           1            1             1
## covar_OutsideZIPregion         1           1            1             1
## followup                       1           1            1             1
## OH_ARNI                        1           1            1             1
## DCtoContCare                   1           1            1             1
## insured                        1           1            1             1
##                        inhospRx_ARNI inhospRx_ACEIARB covar_Kdisc covar_Crdisc
## SITE_ID                            1                1           1            1
## GDMT_HFBB                          1                1           1            1
## GDMT_ACEIARB                       1                1           1            1
## GDMT_ARNI                          1                1           1            1
## GDMT_MRA                           1                1           1            1
## GDMT_BBcontra                      1                1           1            1
## GDMT_RAASIcontra                   1                1           1            1
## GDMT_MRAcontra                     1                1           1            1
## GDMT_ARNIcontra                    1                1           1            1
## GENDERi                            1                1           1            1
## race2i                             1                1           1            1
## insurance                          1                1           1            1
## AGEi                               1                1           1            1
## PMHx_none                          1                1           1            1
## PMHx_CHF                           1                1           1            1
## PMHx_ESRD                          1                1           1            1
## PMHx_Cr2                           1                1           1            1
## PMHx_PPMICDCRT                     1                1           1            1
## PMHx_fibfl                         1                1           1            1
## GH_HEART_TRANSPLANTS               1                1           1            1
## RESIDENTS                          1                1           1            1
## OH_EF                              1                1           1            1
## OH_TRANSPLANT                      1                1           1            1
## covar_ino                          1                1           1            1
## inhospRx_BB                        1                1           1            1
## inhospRx_MRA                       1                1           1            1
## inhospRx_none                      1                1           1            1
## inhospRx_ARNI                      0                1           1            1
## inhospRx_ACEIARB                   1                0           1            1
## covar_Kdisc                        1                1           0            1
## covar_Crdisc                       1                1           1            0
## OH_DISCBPSYST                      1                1           1            1
## OH_DISCHR                          1                1           1            1
## zip_designation                    1                1           1            1
## population_total                   1                1           1            1
## distress_score                     1                1           1            1
## covar_OutsideZIPregion             1                1           1            1
## followup                           1                1           1            1
## OH_ARNI                            1                1           1            1
## DCtoContCare                       1                1           1            1
## insured                            1                1           1            1
##                        OH_DISCBPSYST OH_DISCHR zip_designation population_total
## SITE_ID                            1         1               1                1
## GDMT_HFBB                          1         1               1                1
## GDMT_ACEIARB                       1         1               1                1
## GDMT_ARNI                          1         1               1                1
## GDMT_MRA                           1         1               1                1
## GDMT_BBcontra                      1         1               1                1
## GDMT_RAASIcontra                   1         1               1                1
## GDMT_MRAcontra                     1         1               1                1
## GDMT_ARNIcontra                    1         1               1                1
## GENDERi                            1         1               1                1
## race2i                             1         1               1                1
## insurance                          1         1               1                1
## AGEi                               1         1               1                1
## PMHx_none                          1         1               1                1
## PMHx_CHF                           1         1               1                1
## PMHx_ESRD                          1         1               1                1
## PMHx_Cr2                           1         1               1                1
## PMHx_PPMICDCRT                     1         1               1                1
## PMHx_fibfl                         1         1               1                1
## GH_HEART_TRANSPLANTS               1         1               1                1
## RESIDENTS                          1         1               1                1
## OH_EF                              1         1               1                1
## OH_TRANSPLANT                      1         1               1                1
## covar_ino                          1         1               1                1
## inhospRx_BB                        1         1               1                1
## inhospRx_MRA                       1         1               1                1
## inhospRx_none                      1         1               1                1
## inhospRx_ARNI                      1         1               1                1
## inhospRx_ACEIARB                   1         1               1                1
## covar_Kdisc                        1         1               1                1
## covar_Crdisc                       1         1               1                1
## OH_DISCBPSYST                      0         1               1                1
## OH_DISCHR                          1         0               1                1
## zip_designation                    1         1               0                1
## population_total                   1         1               1                0
## distress_score                     1         1               1                1
## covar_OutsideZIPregion             1         1               1                1
## followup                           1         1               1                1
## OH_ARNI                            1         1               1                1
## DCtoContCare                       1         1               1                1
## insured                            1         1               1                1
##                        distress_score covar_OutsideZIPregion followup OH_ARNI
## SITE_ID                             1                      1        1       1
## GDMT_HFBB                           1                      1        1       1
## GDMT_ACEIARB                        1                      1        1       1
## GDMT_ARNI                           1                      1        1       1
## GDMT_MRA                            1                      1        1       1
## GDMT_BBcontra                       1                      1        1       1
## GDMT_RAASIcontra                    1                      1        1       1
## GDMT_MRAcontra                      1                      1        1       1
## GDMT_ARNIcontra                     1                      1        1       1
## GENDERi                             1                      1        1       1
## race2i                              1                      1        1       1
## insurance                           1                      1        1       1
## AGEi                                1                      1        1       1
## PMHx_none                           1                      1        1       1
## PMHx_CHF                            1                      1        1       1
## PMHx_ESRD                           1                      1        1       1
## PMHx_Cr2                            1                      1        1       1
## PMHx_PPMICDCRT                      1                      1        1       1
## PMHx_fibfl                          1                      1        1       1
## GH_HEART_TRANSPLANTS                1                      1        1       1
## RESIDENTS                           1                      1        1       1
## OH_EF                               1                      1        1       1
## OH_TRANSPLANT                       1                      1        1       1
## covar_ino                           1                      1        1       1
## inhospRx_BB                         1                      1        1       1
## inhospRx_MRA                        1                      1        1       1
## inhospRx_none                       1                      1        1       1
## inhospRx_ARNI                       1                      1        1       1
## inhospRx_ACEIARB                    1                      1        1       1
## covar_Kdisc                         1                      1        1       1
## covar_Crdisc                        1                      1        1       1
## OH_DISCBPSYST                       1                      1        1       1
## OH_DISCHR                           1                      1        1       1
## zip_designation                     1                      1        1       1
## population_total                    1                      1        1       1
## distress_score                      0                      1        1       1
## covar_OutsideZIPregion              1                      0        1       1
## followup                            1                      1        0       1
## OH_ARNI                             1                      1        1       0
## DCtoContCare                        1                      1        1       1
## insured                             1                      1        1       1
##                        DCtoContCare insured
## SITE_ID                           1       1
## GDMT_HFBB                         1       1
## GDMT_ACEIARB                      1       1
## GDMT_ARNI                         1       1
## GDMT_MRA                          1       1
## GDMT_BBcontra                     1       1
## GDMT_RAASIcontra                  1       1
## GDMT_MRAcontra                    1       1
## GDMT_ARNIcontra                   1       1
## GENDERi                           1       1
## race2i                            1       1
## insurance                         1       1
## AGEi                              1       1
## PMHx_none                         1       1
## PMHx_CHF                          1       1
## PMHx_ESRD                         1       1
## PMHx_Cr2                          1       1
## PMHx_PPMICDCRT                    1       1
## PMHx_fibfl                        1       1
## GH_HEART_TRANSPLANTS              1       1
## RESIDENTS                         1       1
## OH_EF                             1       1
## OH_TRANSPLANT                     1       1
## covar_ino                         1       1
## inhospRx_BB                       1       1
## inhospRx_MRA                      1       1
## inhospRx_none                     1       1
## inhospRx_ARNI                     1       1
## inhospRx_ACEIARB                  1       1
## covar_Kdisc                       1       1
## covar_Crdisc                      1       1
## OH_DISCBPSYST                     1       1
## OH_DISCHR                         1       1
## zip_designation                   1       1
## population_total                  1       1
## distress_score                    1       1
## covar_OutsideZIPregion            1       1
## followup                          1       1
## OH_ARNI                           1       1
## DCtoContCare                      0       1
## insured                           1       0

Each row is an imputation model with the first column of each row indicating the variable to be imputed. The columns are potential independent variables to be included in that imputation model. 0’s indicate the variable in that column will be excluded from the imputation model, while 1 indicates the variable be assigned a fixed effect. Note that 0’s are automatically assigned to the matrix diagonal as variables should not be used to impute themselves. Similarly, note that every other variable has been assigned a fixed effect (“1”) by default.
As a general rule, every variable that is going to be used in the final model needs to be included in the imputation model. The only time to definitely exclude a variable is if it is collinear with another variable. This will definitely occur for “insured”, the binary variable we generated from nominal variable “insurance” a few steps ago. Hence, we need to remove “insurance” from the imputation model for “insured”. Similarly “insurance” provides more information than “insured” so we use “insurance” as an independent variable in all other imputation models instead of “insured”. We can arrange this via the following:

pm['insured','insurance'] <- 0
pm[,'insured'] <- 0

Outside of variables that induce collinearity, we usually want to include as much data as possible when creating imputation models, even if that variable is not going to be used in your final model. You may need to exclude variables from imputation models when the imputation models have become exceedingly complex, but the GTWG-HF dataset is sufficiently large that this is not necessarily. If your dataset is not large, remember to include all variables in the imputation models that you plan to include in your final model.
The final set up required before initiating imputation is to set the cluster variable within the predictor matrix. This is done by assigning the cluster variable a value of -2 in all imputation models. The second line here is redundant because we are not imputing the cluster variable (SITE_ID had no missing), but if your data is missing at the cluster level, make sure to set the diagonal back to 0.

pm[,'SITE_ID'] <- -2
pm['SITE_ID','SITE_ID'] <- 0

4.1.3 Do a test run of your imputation model

We’ve now done all the set up required for multiple imputation, but these tasks can be quite computationally expensive, so it’s helpful to perform a test case first. MICE captures many errors as warnings so it can run to completion, so we don’t want the task to run for a week only for us to discover the imputation model has failed at a critical step and makes the model unusable.
As a test run, I set the number of imputations (m) to 1 and the number of iterations (maxit) to 2. Verbose will allow us to see where we are in the computation process. Control1 and control2 are options for glmer() (for hierchical models) and lmer() (for non-hierarchical models), respectively. Setting glmer() and lmer() parameters can help reduce run time. Most importantly, make sure to specify nAGQ=0; this will reduce the accuracy of estimates, but will greatly reduce run time. It is inadvisable to set nAGQ to 0 for the actual model, but isn’t a problem for our test run. Finally, we have the computer keep track of how long it takes the process to run using Sys.time().

start_time <- Sys.time() 
test <- mice(df, m=1,maxit=2, 
             predictorMatrix=pm, method=impmethod, cluster_var=cluster_var, 
             nAGQ=0, seed=88, verbose=2,
             control1=glmerControl(optimizer='bobyqa', optCtrl=list(maxfun=2e5), calc.derivs=TRUE),
             control2=lmerControl(optimizer='nloptwrap', optCtrl=list(maxfun=2e5)))
end_time <- Sys.time() 
end_time - start_time

This will produce a mids object, which will contain the set of datasets containing your imputed values. Since this is a test run and we set m=1, the mids object ‘test’ will only contain 1 dataset. We can check to see if our model produces estimates by looking at the trace plots.

plot(test)

We have provided some screen shots of a few of the plots. At minimum make sure the plots are not stationary for at least each kind of imputation method you’ve used. Plot() gives you the traces for all of the imputation models we’ve run, but we only show a few plots below:

There are no traces with a slope of 0 (referred to as flat line behavior; see below under the next troubleshooting section for examples of flatline behavior). Flat line behavior indicates the model has failed to produce anything and imputes stationary values.

We won’t know if the imputations are reasonable until we create the full model. However, the above task took >2 hours for a single imputation with only 2 iterations with nAGQ set to 0. Our full model will require at least 45 imputations with >10 iterations- this could take weeks. To shorten our computational time, we can use parallel processing.

4.1.3.1 Troubleshooting: Flatline trace plots

When we were first setting up the task, our trace plots revealed flat line behavior. Flatline behavior looks like this:

Unfortunately the warning gave limited information: “lmer does not run. Simplify imputation model”. I found three at least 3 sources of this flat line behavior, though there are likely more:

Factor variables have levels with special characters:
Plot appearance: only the trace plots of factor variables show flat line behavior
Problem: MICE internally creates formulas for the GLMs underlying its imputation methods. It explicitly declares dummy variables in this formula, and dummy variables names are based on level names. If the level names have special characters (spaces, periods, etc), these special characters will be incorporated into the formula and cause the model building step to fail.
Fix: Go through all categorical variables carefully and remove special characters from level names (in R, it is good data management practice to not use special characters in the first place, but this may be new to users who routinely use other statistical programming software that leverage value labels
There is only one observation in at least one of your clusters:
Plot appearance: all of your trace plots show flat line behavior
Problem: This is only an issue with older versions of MICE. 2lmer imputation methods fail if at least one cluster has n<2 observations.
Fix: use MICE version > 3.13.6. The authors of MICE are aware of this issue and implemented a fix. By the time you’re reading this, this fix has likely been incorporated into the latest stable version of MICE, but if it hasn’t, install newer versions using devtools.
The model is too complex to run:
Plot appearance: Depends. If you have specified a different model for each imputation method, only the more complex models may have flat line behavior. If you have used the same complex imputation model for each method, all the trace plots will be flat.
Problem: Your imputation model is too complex and blows up. Unfortunately (or fortunately) mice automates a lot of complex modeling, so many of the diagnostics used to validate models cannot be performed, and warnings are captured so mice can run to completion.
Fix: Run the GLM outside of MICE. Go to the source code for MICE (https://github.com/amices/mice/tree/master/R) and find the imputation methods you are using for your imputation model (eg mice.impute.2l.bin.R, mice.impute.2l.lmer.R, etc). In the source code files, mice stores this model in the local variable “fit”, which you can find in page using your browser’s Find tool. This will give you the parameters mice uses to generate your models. Set these models up outside of MICE using the same parameters specified in the source code, and you can then run diagnostics on each model.

4.1.4 Multiple Imputation in Parallel

R is single-threaded, meaning it will only use a single core for processing tasks, no matter how time consuming this might be. Luckily, the imputations performed by MICE are independent of each other, so each imputation can be assigned to a different core in our machine and the datasets can be combined into a single mids object at that the end. This has been written into MICE using the wrapper function parlMICE().
The number of imputations performed (equivalent to “m” in the non-parallel version of mice()) is equal to n.core * n.imp.core. We aimed to produce imputations at least 1-1.5x the degree of missingness in the dataset, which was about 40 imputations in our case. In general, you should use no more than n-1 of the cores on your machine or you’ll risk being unable to do anything else while the task is running. Our machine had 48 cores, so we set n.core=46.
We specified that each core only run 1 imputation as we wanted to reduce run time as much as possible, but if your machine has fewer cores than the number of imputations you want to run, you can increase the n.imp.core parameter at the cost of increasing your run time.
Maxit indicates the number of iterations each imputation model is allowed to run, and frequently 20-30 iterations are necessary for the model to converge (which we will discuss in the next section). Based on our test cases we estimated the task would already take a week with just 10 iterations so our maxit parameter is set much lower.
Cluster type (cl.type) is going to depend on the operating system of your machine. If you are using Windows, you will specify PSOCK, whereas on Linux machines, FORK is preferable.

start_time <- Sys.time() 
print(start_time)
impobj <- parlmice(df, n.core=46, n.imp.core = 1, cl.type="FORK",
                   maxit=11, predictorMatrix=pm, method=impmethod,
                   cluster_var=cluster_var, 
                   nAGQ=0, cluster.seed=88, 
                   control1=glmerControl(optimizer='bobyqa', optCtrl=list(maxfun=2e5), calc.derivs=TRUE),
                   control2=lmerControl(optimizer='nloptwrap', optCtrl=list(maxfun=2e5)))
end_time <- Sys.time() 
end_time - start_time

The run time for the task above was 9.3 days, even with our large machine. We used plot() to generate trace plots to assess whether the algorithm had converged after 10 iterations. Here are some example screen shots of the output.

Trace plots for GLM-based imputation methods

The mixing is quite poor and the traces show no sign of converging for any imputation models using the 2l.binary method. The trace plot for PMHx_Cr2 exhibits flat line behavior. The imputation models for 2l.pmm (population_total, distress_score), pmm (OH_DISCBPSYST, OH_DISCHR, etc), and logreg (GENDERi, etc) methods seem to converge without issue, which suggest the problem is isolated to the 2l.binary method.
Based on this pattern, we were concerned that the underlying GLMs were not estimated appropriately. The next step in trouble shooting would have been to take the GLMs out of MICE and run diagonstics on them to assess the problem, but at this point we decided to move away from imputation models based on GLMs. We did so primarily because the run time for the imputation model was prohibitively long for our machine and the model did a poor job of handling categorical variables with more than 2 levels.

4.2 Mulitiple Imputation using Random Forest

Random forest is a type of decision tree analysis. It has been found to be a useful in multiple imputation and can be applied to hierarchical data structures without extra effort. In our case, we simply include the cluster variable in the imputation model without further manipulation. One thing to note is that for random forest, we can keep the cluster variable as a categorical variable instead of converting it into a numeric variable, as we had to do for the GLM based methods.
We used the following code to set up our imputation methods and predictor matrix:

impmethod <- character(ncol(df))
names(impmethod) <- colnames(df)
print("imputation vector created")

impmethod['SITE_ID'] <- "" 
impmethod['GDMT_HFBB'] <- "rf" 
impmethod['GDMT_ACEIARB'] <- "rf" 
impmethod['GDMT_ARNI'] <- "" 
impmethod['GDMT_MRA'] <- "rf"

impmethod['GDMT_BBcontra'] <- "rf"
impmethod['GDMT_RAASIcontra'] <- "rf"
impmethod['GDMT_MRAcontra'] <- "rf" 
impmethod['GDMT_ARNIcontra'] <- "rf" 
impmethod['OH_ARNI'] <- "rf"

impmethod['GENDERi'] <- "rf" 
impmethod['race2i'] <- ""
impmethod['insurance'] <- "rf"
impmethod['AGEi'] <- ""

impmethod['PMHx_none'] <- "rf" 
impmethod['PMHx_CHF'] <- "rf" 
impmethod['PMHx_ESRD'] <- "rf"
impmethod['PMHx_Cr2'] <- "rf"
impmethod['PMHx_PPMICDCRT'] <- "rf"
impmethod['PMHx_fibfl'] <- "rf"

impmethod['GH_HEART_TRANSPLANTS'] <- ""
impmethod['RESIDENTS'] <- ""
impmethod['OH_EF'] <- ""
impmethod['OH_TRANSPLANT'] <- ""
impmethod['covar_ino'] <- ""

impmethod['inhospRx_BB'] <- "rf"
impmethod['inhospRx_MRA'] <- "rf"
impmethod['inhospRx_none'] <- "rf"
impmethod['inhospRx_ARNI'] <- "rf"
impmethod['inhospRx_ACEIARB'] <- "rf"

impmethod['covar_Kdisc'] <- "pmm"
impmethod['covar_Crdisc'] <- "pmm"
impmethod['OH_DISCBPSYST'] <- "pmm"
impmethod['OH_DISCHR'] <- "pmm"

impmethod['zip_designation'] <- "rf"
impmethod['population_total'] <- "pmm"
impmethod['distress_score'] <- "pmm"
impmethod['covar_OutsideZIPregion'] <- "rf"
impmethod['followup'] <- "rf"
impmethod['DCtoContCare'] <- "rf"

We used the imputation method pmm instead of 2l.pmm for continuous variables when simultaneously using the rf method for categorical methods. As mentioned previously, 2l.pmm requires that we change the class of the cluster variable (SITE_ID in our case) to numeric. But for random forest to handle cluster variables, the cluster variable’s class must be factor.
We handled this issue by assigning the cluster variable a fixed effect rather than a random effect for the imputation models using the pmm method. This was possible because we had a proportionately large sample (136,000 observations to ~550 sites); however in many cases assigning the cluster variable a fixed effect isn’t practical because this will consume all the degrees of freedom. If assigning the cluster variable a fixed effect is not possible, 2l.pmm must be used as an imputation method, but this will cause a problem with the cluster variable class. A potential work around would be to duplicate the cluster variable, with one version being kept as a factor variable and the duplicated version converted to numeric class. For the imputation models using the random forest method, use the factor version of the cluster variable and set the numeric cluster variable to 0 in the predictor matrix. For imputation models assigning the cluster variable a random effect (eg 2l.pmm), use the numeric version of the cluster variable and set the factor version to 0 in the predictor matrix.
Again, we opted to assign the cluster variable a fixed effect, so our predictor matrix was as follows:

pm <- mice::make.predictorMatrix(df)

#set the SITE_ID as the cluster var 
pm[,'SITE_ID'] <- 1
pm['SITE_ID','SITE_ID'] <- 0

pm['SITE_ID',] <- 0
pm['GDMT_ARNI',] <- 0
pm['race2i',] <- 0
pm['AGEi',] <- 0
pm['GH_HEART_TRANSPLANTS',] <- 0
pm['RESIDENTS',] <- 0
pm['OH_EF',] <- 0
pm['OH_TRANSPLANT',] <- 0
pm['covar_ino',] <- 0

This code is only necessary if we do not choose to re-create the predictor matrix. We share this code here to demonstrate our thought process. We assigned SITE_ID (our cluster variable) a value of 1, thereby having both the pmm and rf imputation models assign SITE_ID a fixed effect. All the rows we manually set to 0 had no imputation method assigned, so no imputation model would be generated anyways.
Now we set up a test run of the task to make sure there is no flat line behavior.

start_time <- Sys.time()
print(start_time)
test_rf <- mice(df, m=2, maxit=2,
                predictorMatrix=pm, method=impmethod,
                cluster.seed=88,
                verbose=2,
                ntree= 15,
                control=lmerControl(optimizer='bobyqa', optCtrl=list(maxfun=2e5), calc.derivs=TRUE))
end_time <- Sys.time()
end_time - start_time

Notice that there is no glmercontrol() argument as there are no mixed models are used in this imputation model. We then use plot() to make sure there is no flat line behavior

plot(test_rf)

We display a sample of the test trace plots below, making sure to include plots using the rf and pmm methods:

It seems there’s no flat line behavior, so we set up and run the task in parallel.

start_time <- Sys.time() 
print(start_time)
imp_rf2 <- parlmice(df, n.core=30, n.imp.core = 1, cl.type="FORK",
                maxit=15, cluster.seed=88,
                predictorMatrix=pm, method=impmethod,
                ntree = 15, 
                verbose=2,
                control=lmerControl(optimizer='bobyqa', optCtrl=list(maxfun=2e5), calc.derivs=TRUE))
end_time <- Sys.time() 
end_time - start_time 

write.mice.imputation(mi.res=imp_rf2, name="imp_rf2", mids2spss = FALSE)

The last line saves the mids output object of parlice(). This is very helpful for mice() objects that take a long time to generate. We then check convergence as follows:

plot(imp_rf2)

Trace plot examples for rf-based imputation methods
The trace plots have good mixing and appear to converge. The next step is to assess whether the imputations make sense. This is best done via density plots. We don’t expect the density plots to be identical to the distribution of the known data, but in general they should be similar.

densityplot(imp_rf2, ~ GDMT_HFBB)

densityplot(imp_rf2, ~ GDMT_ACEIARB)

densityplot(imp_rf2, ~ GDMT_MRA)

densityplot(imp_rf2, ~ GDMT_BBcontra)

densityplot(imp_rf2, ~ GDMT_RAASIcontra)

densityplot(imp_rf2, ~ GDMT_MRAcontra)

densityplot(imp_rf2, ~ GDMT_ARNIcontra)

densityplot(imp_rf2, ~ GENDERi)

densityplot(imp_rf2, ~ insurance)

densityplot(imp_rf2, ~ PMHx_none)

densityplot(imp_rf2, ~ PMHx_CHF)

densityplot(imp_rf2, ~ PMHx_ESRD)

densityplot(imp_rf2, ~ PMHx_Cr2)

densityplot(imp_rf2, ~ PMHx_PPMICDCRT)

densityplot(imp_rf2, ~ PMHx_fibfl)

densityplot(imp_rf2, ~ inhospRx_BB)

densityplot(imp_rf2, ~ inhospRx_ACEIARB)

densityplot(imp_rf2, ~ inhospRx_MRA)

densityplot(imp_rf2, ~ OH_ARNI)

densityplot(imp_rf2, ~ covar_Kdisc)

densityplot(imp_rf2, ~ covar_Crdisc)

densityplot(imp_rf2, ~ OH_DISCBPSYST)

densityplot(imp_rf2, ~ OH_DISCHR)

densityplot(imp_rf2, ~ population_total)

densityplot(imp_rf2, ~ distress_score)

densityplot(imp_rf2, ~ followup)

densityplot(imp_rf2, ~ DCtoContCare)

densityplot(imp_rf2, ~ zip_designation)

The density plots seem reasonable, and this marks the completion of the imputation step of multiple imputation.

5 Model Buidling after Multiple Imputation: the Fitting Step

5.1 Data Preparation

Data transformation is often necessary prior to fitting models. Some transformations should be performed prior to multiple imputation, but linear transformations generally can be performed after the fact. The timing of and various approaches to data transformation with respect to multiple imputation are discussed at length in van Buuren’s book in Chapter 6.4. For example, if we wanted to apply a logarithmic transformation to a variable (such as serum creatinine), this should be performed prior to multiple imputation. However, tasks such as grand mean centering (to ease interpretation of the intercept) can be performed afterwards. In the following code, we grand-mean center some continuous variables, convert distress scores to quintiles, and convert some continuous variables to more clinically-iterpretable binary variables. Note that the conversion of continuous variables to categorical/factor variables follows a method called “impute then transform”, which may introduce some bias into our final estimates but is better than performing such a conversion prior to imputation.

long <- mice::complete(imp_rf2, "long", include=TRUE)

When preparing to transform data, imputed datasets need to be converted to a single dataset. During multiple imputation we have actually created “m” datasets. Prior to transformation, these datasets are appended together to create a long data set of N*m observations (so in our case 30x136,144 observations). After we have converted to a long format, we can perform our transformations

#grand mean center ejection fraction, age, discharge serum creatinine, and total population within a zip code
gm_EF = mean(long$OH_EF, na.rm=TRUE)
long$EF_GMrescale = long$OH_EF - gm_EF

gm_age = mean(long$AGEi, na.rm=TRUE)
long$age_GMrescale = long$AGEi - gm_age

gm_Cr = mean(long$covar_Crdisc, na.rm=TRUE)
long$Crdisc_GMrescale = long$covar_Crdisc - gm_Cr

gm_poptotal = mean(long$population_total, na.rm=TRUE)
long$poptotal_GMrescale = long$population_total - gm_poptotal

#clean up memory
rm(gm_EF, gm_age, gm_Cr, gm_poptotal)

#covert distress scores to quintiles 
long <- long %>% 
        mutate(DCI_quint=ifelse(distress_score < 20, 1, 
                         ifelse(distress_score >= 20 & distress_score < 40, 2, 
                         ifelse(distress_score >= 40 & distress_score < 60, 3, 
                         ifelse(distress_score >= 60 & distress_score < 80, 4, 
                         ifelse(distress_score >= 80, 5, 999))))))
long$DCI_quint <- factor(long$DCI_quint, 
                         levels=c(1,2,3,4,5), 
                         labels=c('prosperous', 'comfortable', 'midtier', 
                                  'atrisk', 'distressed'))

#convert blood pressure, heart rate, and potassium to more meaningful clinical variables (systolic BP < 90, HR < 60, K >= 5)
long <- long %>% 
        mutate(sBP90=ifelse(OH_DISCBPSYST < 90, 1, 0))
long$sBP90 <- as.factor(long$sBP90)

long <- long %>% 
        mutate(HR60=ifelse(OH_DISCHR < 60, 1, 0))
long$HR60 <- as.factor(long$HR60)

long <- long %>% 
        mutate(K5=ifelse(covar_Kdisc >= 5, 1, 0))
long$K5 <- as.factor(long$K5)


#convert the long data sets back into individual impuated datasets 
imprf2_GMC <- as.mids(long)

The last line of code as.mids() breaks up the long dataset back into individual imputed datasets and stores these datasets in a mids object.

5.2 Run the Fit Step in Parallel

All necessary data manipulation has been completed, so the next step is the model fitting phase. In this phase, we fit the model of interest (going forward referred to as the hypothesis model) to each of the imputed datasets. In our case, this translates to 30 models. Our hypothesis model is a mixed effects logistic regression model, with the cluster variable assigned a random intercept. The first step in building mixed effects models is to ensure that the random effect component is even necessary. This can be done by calculating the intraclass correlation coefficient (ICC) for the unconditional mean (aka empty) model.
When constructing this model, the first issue that comes to mind is how to apply an empty model to a MIDS object, which actually contains 30 datasets. Our particular case is straight forward- we can use any of the data sets or even the original one because neither the outcome (GDMT_ARNI) nor the cluster variable (SITE_ID) have any missing data. Hence all 31 datasets (original + imputed datsets) are identical for the purpose of building an empty model. If there is missingness in the cluster variable, you would have to fit an unconditional mean model to each dataset, pool each of these models (this pooling step is straight forward in mice() and will be demonstrated in a later section), and use the pooled model to calculate an ICC.

m0 <- glmer(GDMT_ARNI ~ (1|SITE_ID), 
            data=mice::complete(imprf2_GMC,2), 
            family='binomial', 
            control=glmerControl(optimizer='bobyqa', 
                                 optCtrl=list(maxfun=2e5), 
                                 calc.derivs=TRUE))
ICC <- m0@theta[1]^2/(m0@theta[1]^2 + (3.14159^2/3))
print(ICC)
rm(m0, ICC)

The code above sets up an unconditional mean model and subsequently calculates the intraclass correlation coefficient. In the glmer() function, we use the argument data=mice::complete(mids-name, int) to demonstrate how a specific dataset can be extracted from the mids object imprf2_GMC. Our code uses imputation dataset #2, but in our case, int could have taken on any value between [1,30] because both GDMT_ARNI and SITE_ID have no missing values. We will apply this technique in the next block of code. In any case, the ICC in our case was 0.254, which indicates a random effects model is necessary.
Now we are ready to fit our model, but significant time would be required to run 30 of these complex explanatory models on each of the imputed datasets if we run in each model in serial (in our case, estimated run time was 120+ hours). Luckily, each model is estimated independently of the others, so the task can be paralleled to save time.
Putting a task into parallel can be daunting at first, and the resources we found helpful are in the “Helpful Resources” subsection earlier in this document. Being familiar with the R function ‘lapply’ is a must, as the function we use, ‘parlapply’, is essentially ‘lapply’ split over multiple cores.
Just as in ‘lapply’, we start by defining the function we need to iterate over, which in the case of model fitting, is your hypothesis model.

mids = imprf2_GMC

f <- function(i) {
  
  data.i <- complete(mids,i)
  
  fits <- glmer(GDMT_ARNI ~ GDMT_HFBB + GDMT_ACEIARB + GDMT_MRA +
          GDMT_RAASIcontra + inhospRx_ARNI + OH_ARNI +
          relevel(race2i, ref="White") + relevel(insurance, ref="Other") +
          age_GMrescale + GENDERi +
          PMHx_none + PMHx_CHF + PMHx_Cr2 + PMHx_ESRD +
          GH_HEART_TRANSPLANTS + RESIDENTS +
          EF_GMrescale + OH_TRANSPLANT +
          covar_ino +
          K5 + Crdisc_GMrescale +
          sBP90 + HR60 +
          relevel(zip_designation, ref="Urban") +
          relevel(DCI_quint, ref="prosperous") +
          followup + DCtoContCare +
          (1 | SITE_ID),
        family='binomial', data = data.i,
        control=glmerControl(optimizer='bobyqa', optCtrl=list(maxfun=2e5), calc.derivs=TRUE))
  
  return(fits)
}

Next we have to prepare the group of cores over which we are going to parallel our task. This group of cores is referred to as a cluster, and each core in the cluster is a node. Each node essentially starts up its own fresh version of R that doesn’t have any libraries or data loaded into it; hence, we need to load the libraries and data that our function depends on into each node. This is done through clusterEvalQ(), which is used to export libraries and clusterExport(), which is used to export objects. The next block of code sets up the cluster ‘cl’ with 30 nodes and then exports the essential libraries and objects to each node.

#initiate a cluster using 30 cores. Cluster type is "FORK"
n.core = 30
iter = imp_rf2$m

cl <- makeCluster(n.core, type="FORK")

parallel::clusterEvalQ(cl, library(lme4))
parallel::clusterExport(cl, varlist="mids", envir=environment())

We now run the fitting step, with each node running 1 model. parLapply returns the object ‘fit’, which is a list of length 30 with each element of the list containing a model. The stopCluster() function is very important- it tells the machine to terminate all those R instances we set up previously; forgetting to do this will cause cores to run unnecessarily in the background.

fit <- parLapply(cl, 1:iter, f)

parallel::stopCluster(cl)

rm(n.core, iter, cl)

We now have all our models in the list ‘fit’, but to use the rest of mice()‘s built-in commands we have to convert this into an object that mice will recognize, which is a ’mira’ object. We create a ‘mira’ object using the code below and call it ‘object’.

#assemble a mira object 
object <- list(call=match.call(), call1=imprf2_GMC$call, nmis=imprf2_GMC$nmis, analyses=fit)
oldClass(object) <- c("mira", "matrix")
object

6 Pool the Results and Display the Output

We now have a mira object containing 30 models fit to our 30 imputed datasets. The next step is to pool the results into a single model. This can be done easily using the mice::pool() function, which will create a ‘mipo’ object. The output of a mipo object can be displayed in multiple ways.

pool <- pool(object)
pool$pooled

The pool$pooled output is useful because it displays relative increase in variance (riv), lambda, and fraction of missing information (FMI), which are assessments of how missingness impacts the standard errors of the estimates. Chapter 10 of Applied Missing Data Analysis with SPSS and (R)Studio by Heymans & Eekhout does an excellent job of explaining how these values are interpreted.

result <- summary(pool)
print(result)

Using mice::summary() on a mipo object will print a standard regression table, though one should consider including at least one measure of missing data information (riv, lambda, fmi) when presenting results derived from imputed data. Also for explanatory modeling, consider using e-values rather than or in addition to p-values. E-values better help address robustness against missing variable bias. See VanderWeele, TJ & Ding P. “Sensitivity Analysis in Observational Research: Introducing the E-Value”. Ann Intern Med. 2017. PMID 28693043. for further information.

Finally, clinically-oriented audiences may prefer to see effect estimates and confidence intervals given as odds ratios. This can be done via the following:

table <- cbind(Var=result$term, 
               Est=exp(result$estimate), 
               SE=exp(result$std.error), 
               LL=exp(result$estimate - 1.96*result$std.error),
               UL=exp(result$estimate + 1.96*result$std.error),
               p=result$p.value)

table2 <- format(round(table,3), nsmall=3)
result$term

7 Sensitivity Analysis

Reviewers may request to see a complete case (aka listwise deletion) analysis to compare with the model fitted to data that has undergone multiple imputation. Below is the code we used to generate this model. Complete case analysis is the default method used by glmer to handle missing data, so no data preparation is required outside of basic data transformations.

#data transformation for compelte case analysis 
df <- df %>% 
        mutate(DCI_quint=ifelse(distress_score < 20, 1, 
                         ifelse(distress_score >= 20 & distress_score < 40, 2, 
                         ifelse(distress_score >= 40 & distress_score < 60, 3, 
                         ifelse(distress_score >= 60 & distress_score < 80, 4, 
                         ifelse(distress_score >= 80, 5, 999))))))
df$DCI_quint <- factor(df$DCI_quint, 
                         levels=c(1,2,3,4,5), 
                         labels=c('prosperous', 'comfortable', 'midtier', 
                                  'atrisk', 'distressed'))

df <- df %>% 
        mutate(sBP90=ifelse(OH_DISCBPSYST < 90, 1, 0))
df$sBP90 <- as.factor(df$sBP90)

df <- df %>% 
        mutate(HR60=ifelse(OH_DISCHR < 60, 1, 0))
df$HR60 <- as.factor(df$HR60)

df <- df %>% 
        mutate(K5=ifelse(covar_Kdisc >= 5, 1, 0))
df$K5 <- as.factor(df$K5)

gm_age = mean(df$AGEi, na.rm=TRUE)
df$age_GMrescale = df$AGEi - gm_age

gm_EF = mean(df$OH_EF, na.rm=TRUE)
df$EF_GMrescale = df$OH_EF - gm_EF

gm_Cr = mean(df$covar_Crdisc, na.rm=TRUE)
df$Crdisc_GMrescale = df$covar_Crdisc - gm_Cr

rm(gm_age, gm_EF, gm_Cr)

#sensitivity analysis: run the model using complete case analysis 
cc_analysis <-  glmer(GDMT_ARNI ~ GDMT_HFBB + GDMT_ACEIARB + GDMT_MRA +
                      GDMT_RAASIcontra + inhospRx_ARNI + OH_ARNI +
                      relevel(race2i, ref="White") + relevel(insurance, ref="Other") +
                      age_GMrescale + GENDERi +
                      PMHx_none + PMHx_CHF + PMHx_Cr2 + PMHx_ESRD +
                      GH_HEART_TRANSPLANTS + RESIDENTS +
                      EF_GMrescale + OH_TRANSPLANT +
                      covar_ino +
                      K5 + Crdisc_GMrescale +
                      sBP90 + HR60 +
                      relevel(zip_designation, ref="Urban") +
                      relevel(DCI_quint, ref="prosperous") +
                      followup + DCtoContCare +
                      (1 | SITE_ID),
                family='binomial', data = df, verbose=2,
                control=glmerControl(optimizer='bobyqa', 
                                     optCtrl=list(maxfun=2e5), 
                                     calc.derivs=TRUE))

#create results tables for complete case analysis 
summary(cc_analysis)
se <- sqrt(diag(vcov(cc_analysis)))
cc_analysis_table <- cbind(Est = fixef(cc_analysis), LL = fixef(cc_analysis) - 1.96*se, UL = fixef(cc_analysis) + 1.96* se)
exp(cc_analysis_table)
cc_analysis_table_rounded <- format(round(exp(cc_analysis_table),3), nsmall=3)
cc_analysis_table_rounded

8 Final Thoughts

Multiple imputation provides a powerful and unbiased method of handling missing data so that data can be used to the fullest extent possible. However, there is a trade off- the uncertainty surrounding the true value of missing data manifests as increased standard errors. As a result, for effect estimates with p-values > 0.05, it may be that the result is truly due to random chance (“insignificant”) or that confidence intervals have been widened by uncertainty about the true value of missing data. That being said, we believe that multiple imputation is an excellent solution to handling missing data and will hopefully become increasingly wide spread as large retrospective datasets mature and the medical community gains a better understanding of how these data can be applied to clinical practice. And again, if you believe that we have made an egregious error or if there is simply a better way of doing things, please reach out to us! We are always looking for ways to improve and get closer to the truth!

Thanks all for making it to the end! Jeff

Jeffrey S Tran

21 December 2021