TransWikia.com

How does propensity score matching that uses only a small proportion of eligible patients affect generalizability?

Cross Validated Asked by Diana Petitti on December 1, 2021

I am reviewing a paper that seeks to assess the effect of treatment on mortality using observational data about 2,985 hospitalized patients. A propensity-matched analysis ends up with 380 patients (190 treated/190 not treated). But these 380 patients are a highly selected group compared with all 2,985 patients. For example, only 6.3% of the 380 patients in the propensity-matched analysis were admitted to the ICU compared with 24.2% of all patients; only 5.3% of the 380 the patients in the propensity-matched analysis were mechanically ventilated compared with 17.6% of all patients.

The literature on propensity-matched analyses identifies inefficiency/loss of power as a problem with propensity-matching. But isn’t generalizability (the ability to draw conclusions about a causal effect of treatment on mortality in all of the hospitalized patients) also a concern?

One Answer

Generalizability is absolutely one of the problems with using propensity score matching for exactly the reason you mentioned. This is why it is so important to be clear about the causal estimand and to ensure that the statistical method you're using doesn't affect it. If one seeks to generalize to the population from which the sample was drawn, one is estimating the average treatment effect in the population (ATE) and must use methods appropriate for estimating the ATE. Propensity score matching (or specifically, propensity score subset selection) is not one such method. As soon as you perform matching, your estimand no longer corresponds to the ATE and the estimated effect cannot be said to generalize to the population from which the sample was drawn.

There has been a fair bit written about propensity score-related methods that forgo generalizing to a clear target population and focus instead on simply removing confounding in a way that doesn't decrease the variance too much. Important papers in this domain include Crump et al. (2009) and Mao, Li, and Greene (2018), who describe specific statistical methods for estimating treatment effects when generalizing to a specific population is not necessarily desirable. Desai and Franklin (2019) do a nice job of describing which methods should be used for estimating treatment effects for different target populations.

One reason, I believe, that this is not so frequently discussed in the applied literature is that the target population is already often ambiguous or arbitrary. The ATE properly estimated in a certain hospital only generalizes to that hospital, but that's not an interesting or clinically meaningful population. Given this, it makes sense to forgo generalizing to the specific population from which the sample was drawn and instead focus on removing confounding. This is exactly the implicit perspective taken when using caliper matching or forms of propensity score weighting that change the estimand (e.g., overlap weights).

Mao et al. (2018) provide a very nice description of this type of reasoning, with five reasons why retaining the original target population may not be a good choice, in which case it doesn't matter that the treatment effect doesn't generalize to a specific population. They couch their reasoning in terms of "treatment effect discovery", i.e., "Is there any evidence of treatment efficacy in the data?"

Although the paper you are reviewing may not be specific about their goal of treatment effect discovery rather than generalization of a treatment effect to a specific population, I interpret their choice of using caliper matching to imply that that is their goal. I think it would be wise to point the authors to Mao et al. (2018) and make them be explicit about their goal of treatment effect discovery rather than leaving the reader (such as yourself) wondering why they have completely forgone the desire to generalize to a specific population by discarding units from the sample. If the authors do not include lack of generalizability as a limitation, encourage them to do so and to write about the implications of such a failure, otherwise readers may believe the estimated treatment effect applies to all individuals. The authors should be clear that their goal is treatment effect discovery and that future research should identify treatment effects for specific populations of interest, which may not be possible to do while eliminating confounding in their sample.


Crump, R. K., Hotz, V. J., Imbens, G. W., & Mitnik, O. A. (2009). Dealing with limited overlap in estimation of average treatment effects. Biometrika, 96(1), 187–199. https://doi.org/10.1093/biomet/asn055

Desai, R. J., & Franklin, J. M. (2019). Alternative approaches for confounding adjustment in observational studies using weighting based on the propensity score: A primer for practitioners. BMJ, 367, l5657. https://doi.org/10.1136/bmj.l5657

Mao, H., Li, L., & Greene, T. (2018). Propensity score weighting analysis and treatment effect discovery. Statistical Methods in Medical Research, 096228021878117. https://doi.org/10.1177/0962280218781171

Answered by Noah on December 1, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP