This appendix describes how to calculate statistical power to determine whether to conduct an impact study, and if so, which outcomes you can examine. To do so, first consider effect size: What effect size would convince stakeholders to adopt the intervention? The answer to this question will determine the effect size you should aim to detect with confidence. Discussions about the effect sizes that stakeholders consider meaningful should occur when you describe the logic model/theory of change. Stakeholders focused on return on investment may need to see a reduction in costs large enough to more than offset any extra payments made to the practice to adopt an intervention. Others may need to see improvements in patient experience ratings. Next, assess the feasibility of generating that effect in the time frame you have to work with. Finally, calculate statistical power to figure out how many practices and patients you need to be confident that the evaluation will detect an effect of this size. If you already know the maximum number of practices you will be transforming, you can skip to the final step and see how likely it is that the evaluation will find an effect of a given size to be statistically significant. If the number of practices is too small, the evaluation will be unable to reliably determine whether the intervention generated an effect.
Calculating the minimum effect that the evaluation is likely to detect using different design approaches is particularly important with evaluations of primary care interventions such as the PCMH, because testing them often requires large samples. In practice-level interventions, patient outcomes are clustered within practices and this clustering reduces the effective sample size. Clustering arises because patients within a practice often receive care similar to that received by other patients served by the same practice, but different from the care received by patients in other practices—given differences in ways that clinicians practice medicine and other factors. This means that patients from a given practice cannot necessarily be considered statistically independent from one another, which lowers the effective sample size of patients. As a result, the number of practices (not the number of patients) in the intervention largely determines the size of the minimum effect the evaluation is likely to detect with high confidence. The amount of clustering in data varies for different samples of patients and practices and for different outcomes. When calculating the minimum detectable effect (and when analyzing outcomes), you must account for the clustered nature of the data.7
One strategy—measuring effects among members of a high-risk subgroup—might help improve power to detect effects, depending on the extent of clustering in the data for a given set of intervention practices and outcomes. Although models like the PCMH target all patients in a practice, studying sicker patients can increase the power to detect effects on continuous claims-based outcomes, such as cost and service use, for several reasons. First, among healthy patients, we expect relatively few hospitalizations and limited service use regardless of the intervention’s effectiveness, leaving little opportunity to reduce health care use and cost. Among sicker patients, we expect more opportunities for reductions in cost and service use. As a result, evaluations can use smaller samples of those patients. Additionally, because service use and cost vary more widely among all patients than among sicker patients, it is often harder to distinguish an effect of the intervention from regular variation in these outcomes among all patients. Therefore, we suggest calculating minimum detectable effects for different outcomes and samples, to help you decide which outcome measures to track for which patients.