How to compare baselines: p-values or standardized mean difference (SMD)?

A new clinical treatment (i.e. intervention) is often studied by comparing it to a placebo or an existing  intervention. Two or more patient groups with different interventions are created and the outcomes are studied. When you compare these outcomes, it is important to make sure you avoid confounding. This will increase the likelihood of capturing the true effect the intervention has on the outcome (i.e. the treatment effect).

As said before, a few methods to minimize confounding are restriction, randomization, or matching. These methods all aim to reduce the baseline differences between intervention groups. When patient characteristics are balanced between groups, they are less likely to confound the treatment effect.

Can I use p-values to assess balance in baseline between groups?

P-values can tell you whether differences between groups are likely caused by chance (i.e. random variation) or not. When a p-value is less than 0.05 (the current cut-off), the risk of mistakenly concluding that a difference is caused by chance, is <5%. P-values are an indispensable part of hypothesis testing, however, they are overused in clinical researchFor more details check out this Wikipedia page on p-values.

One example of inappropriate use of p-values is to assess balance in baseline characteristics between intervention groups after an attempt to avoid confounding. Here are two examples why: (1) After randomization, the chance that observed differences are caused by random variation is 100% (p=1.0). (2) After matching, the p-value is uninformative because there is no immediate relationship between group differences and the p-value. Therefore, the p-value is inadequate to assess and optimize balance at baseline.


The current best practice for baseline assessment is to use the standardized mean difference (SMD). This ratio is calculated by dividing the difference in means between groups by the standard deviation of the variable among all study participants. An SMD of 0 indicates perfect balance, whereas an SMD of 1 indicates infinite imbalance. A typical rule of thumb for adequate balance is an SMD <0.1 (or 10%). For more details check out this Cochrane page on the SMD.

Using SMDs instead of p-values for baseline comparison will improve the quality of your study. However, keep in mind that some (late adapting) journals still require the use of p-values for baseline tables. 


What is bias?

Bias is a systematic error in the design or execution of a study that may lead to invalid conclusions. Bias can have a big impact on the measurement of the association between exposure and outcome.

The two main types of bias are:

  • Selection bias (e.g. loss to follow-up is different between the intervention and control group)
  • Information bias (e.g. collected information is more accurate in the intervention group compared to the control group)

In contrast to the terminology that is often used in published literature, selection bias is not the right term to indicate confounding by indication. This is the case when the intervention group has a different level of disease severity compared to the control group.

To avoid selection bias you need to make sure you select the intervention and control groups carefully and you tried to minimize loss to follow-up. To avoid information bias you need to use high-quality and validated measures for outcomes (minimizing misclassification), mask the study hypothesis when interviewing patients, use blinding when possible, and standardize your follow-up procedures.

By thinking ahead about bias in your study protocol you can avoid getting into trouble later on.

What is confounding?

When you perform a scientific study, the issue of confounding is very important. In general, the aim of your study is to assess the true effect that an exposure (e.g. a risk factor or an intervention) has on the outcome of interest (e.g. a disease or a clinical outcome). Confounding prevents you from measuring this true effect. If your study has confounding, the conclusion of your research may be invalid. A factor that causes confounding in your study is called a confounder. Below you see a schematic overview of how a confounder may effect your study.


A confounder can be almost anything you can measure. However, there are three requirements for a given factor to be a confounder. These are:

  1. A confounder has to be associated with the level of exposure (e.g. different proportion in intervention compared to control group).
  2. A confounder has to be associated with the outcome of interest (e.g. different proportion in present compared to absent outcome).
  3. A confounder cannot be on the causal pathway between the exposure and the outcome of interest (e.g. renal failure is on the causal pathway between high blood pressure and heart failure).


If you want to make sure your study results are valid, you need to limit the effects of confounding as much as possible. This can be done by study design: restriction, randomization, or matching. Alternatively, you can adjust for known confounding in the statistical analysis: stratification or coefficient adjustment. It is important to think about confounding both in the study protocol and when you perform analysis of your results.

The Importance of Study Protocols…

I can’t stress enough the importance of having a solid and to-the-point research protocol before you start a new research project. Whether your study is going to be a multi-center randomized controlled trial or even a small prospective or retrospective cohort study, a research protocol makes your life easier and your study more valid. Here is why:

Internal validity
All research is incremental and all the improvements we aim to achieve by doing clinical research are aimed at getting a somewhat better understanding of the underlying truth. Although it sounds appealing, it is quite unlikely that we actually experience “break-trough science” anytime in our career. This means that the research we do should be as accurate and valid as possible, so that we are able to detect small and large effects alike.

The way modern-day statistics are set up makes the tests we do reliable, but it also limits us to using a strict framework. This framework includes defining the exposure(s) of interest, primary endpoint and secondary outcomes, setting a null hypothesis, and state which statistical parameters and test you are going to use for the analysis, before you actually start the study.

If you record all these parameters in your study protocol, it will be more likely that you adhere to them throughout the rest of the study. This will eventually improve your internal validity, which reflects the accuracy and validity of your conclusions.

External validity
Clinical research is typically performed in a sub set of the general population. We control the environment to improve feasibility and internal validity of our research. Although the internal validity is most important, eventually you would like to apply the results of your study to a larger population (generalize). The degree to which this is possible depends heavily on the external validity of your study.

If you clearly define the inclusion and exclusion criteria in your study protocol, this will enable you to think about whether your study is going to be generalizable before you start. It will also impact to what extent readers are going to be able to apply your conclusions to their individual setting.

Bottom line
Most of the problems that occur by not thinking about these issues ahead are very difficult (if not impossible) to fix after your study ends. You save a lot of trouble if you create a clear study protocol and discuss it with your clinical and scientific co-workers.

A good study protocol will result in higher internal and external validity and thereby improve the odds of publication in a high-impact journal!