How to handle missing data

Missing data is a recurring challenge in clinical research, as any study can face the absence of experimental observations, potentially undermining the reliability of analyses. Understanding the frequency and nature of missing data is essential for implementing the most appropriate data management strategies.

Ennio Russo

Medical Writing & Scientific Communication Executive, Ph.D.

From the same author

tecniche di randomizzazione per studi clinici

Randomization techniques for clinical trials

22 October 2024

importanza del protocollo clinico nella pianificazione di una ricerca clinica

The logical structure of a clinical protocol

10 April 2024

Dimensione campionaria e potenza di uno studio

Sample size and power of a study

12 July 2023

In any clinical study, researchers often encounter datasets with missing observations, commonly referred to as “missing data.” Most standard statistical methods require that information is available for all observations for each study variable. Therefore, managing missing data is crucial, as neglecting them can lead to distorted and unreliable results.

Understanding Missing Data

The first step in managing missing data is to understand how frequently they occur. Intuitively, handling a dataset where missing data represent a small percentage is quite different from dealing with a dataset with a significant amount of missing data.

Next, it’s essential to comprehend the reasons behind the missing data. This aspect is key in interpreting results, as it allows researchers to distinguish whether the missing data arise from causal dynamics or are associated with specific experimental factors. Based on this criterion, missing data can be classified into three main categories:

Missing Completely At Random (MCAR): In this case, missing data are randomly distributed across the sample and are not related to any study variables.
Missing At Random (MAR): Here, the probability of a missing data point is related to certain variables, but not the value of the missing data itself.
Missing Not At Random (MNAR): This category includes all missing data that depend on both the value of the data itself and certain study variables.

Managing Missing Data

Ideally, the best way to manage missing data is to prevent them from occurring in the first place. This requires careful study design and accurate data collection. For example, reducing the number of follow-up visits and collecting only essential information at each visit, along with designing easy-to-complete forms, can help minimize missing data. Prior to starting clinical research, it’s advisable to develop a detailed protocol documentation, including methods for participant screening, training for researchers and participants, communication among involved parties, and monitoring of collected data. Additionally, it’s possible to establish a priori an acceptable level of missing data.

There are various techniques to handle missing data, fundamentally falling into two approaches: either deleting observations or imputing missing values. Here are some techniques available to researchers:

Listwise Deletion: This method removes cases with missing data and analyzes only the remaining complete data. If the assumption of MCAR is met, this method can produce unbiased estimates.
Pairwise Deletion: This method uses available data for each specific analysis, preserving more information than listwise deletion. However, it can produce estimates from different data sets and may lead to analytical issues.
Mean Substitution: Missing values are replaced with the mean of the variable. However, this can introduce bias into the estimates and increase standard error.
Regression Imputation: This method estimates missing values using other variables through regression analysis. It allows for more data retention compared to deletion methods.
Last Observation Carried Forward (LOCF): Each missing value is replaced with the last known observation for that subject. While simple, this method can produce biased estimates of treatment effects.
Maximum Likelihood: This method estimates missing data using observed data from other variables. It can be time-consuming and may yield biased estimates if assumptions are not met.

Multiple Imputation: This technique replaces missing data with several plausible values, generating multiple complete datasets. The results of analyses on these datasets are then combined to obtain a final estimate. It is a robust method that produces valid estimates even with a small sample or a high number of missing values.

The choice of method should be evaluated by the researcher in relation to the experimental needs and characteristics of the missing data.

Conclusion

Missing data present a significant challenge in clinical research, as they can compromise the reliability and validity of analyses. Understanding the nature and frequency of missing data is essential to adopt the best management strategies. Preventing missing data through careful study design and attentive data collection is a crucial first step. If missing data are present, researchers have several techniques at their disposal to manage them, adapting their approach based on the nature of the missing data.