After the compendium was produced, the second phase of the measure scan included development and application of consensus evaluation criteria for the measures identified. This task entailed three distinct steps: construction of a list of important measurement constructs within each DRA domain, based on stakeholder input; a crosswalk of entries within the compendium with these constructs to identify psychometrically tested measures in each; and assessment of the resulting measures against a set of threshold evaluation criteria. The result was a focused set of candidate measures for AHRQ's consideration that could also inform an analysis of the gaps in measure development and testing for specific topics and populations.
Background
Much has been written about the desirable attributes of a sound and meaningful performance measure. Several organizations, including the Institute of Medicine (IOM), the Foundation for Accountability (FAACT), and the National Committee for Quality Assurance (NCQA), have developed frameworks specifying criteria for evaluating health care measures. In general, the criteria they articulate fall into three broad categories: importance; scientific soundness; and feasibility.v Each of these concepts, as they relate to the HCBS Measure Scan project, is discussed in greater detail below.
The importance of a measure involves several dimensions and can be assessed from the vantage point of key program stakeholders, including those providing and those receiving services. Briefly, this concept can include:
- Relevance: Does the measure address the aspects of the program that are relevant to stakeholders? Measures of service volume, for example, may or may not be relevant to the assessment of service quality.
- Meaningfulness: There are plenty of metrics that can be calculated whose value has no practical meaning to program participants and other stakeholders.
- Strategic importance: How does the measure compare to overall quality improvement program goals, including externally imposed expectations or requirements?
- Susceptibility to influence: To what extent can the value of the measure be affected by variables under control of the delivery system? Measures that cannot be expected to improve as a result of programmatic action are often considered less important.
The scientific soundness of a performance or quality measure is generally evaluated in terms of its psychometric properties. The field of psychometrics evolved initially from the field of intelligence tests and statistical theory and has expanded to include other types of reported data on experience, education, and personality traits. The goal of psychometric testing is to determine whether results are accurate and meaningful.
Two of the key concepts of psychometric testing are:
- Reliability: Has the measure been shown to be consistent, reproducible, and stable over time?
- Validity: How well does it measure what it is intended to measure? This concept includes face validity (makes sense logically), content validity (measures meaningful aspects of the program), and construct validity (correlates with other measures assessing the same construct).
Another important element of scientific soundness is the degree to which scientific evidence is available to support the measure. That is, there is a demonstrable link between program interventions and measured outcomes.
Measures that are both important and scientifically sound are of little value if collecting and analyzing the data to support them is expensive or burdensome. Feasibility refers to the ease of calculating the metrics, including collection of the underlying data needed to support them. Feasibility may be determined by:
- Availability of data across the system: Are the same data available from disparate entities (in this case, State Medicaid programs) across the system?
- Alignment with current system expectations and initiatives: Do the measures align with data already being collected for other purposes?
- Prevalence of current use: How widely is the measure already being used?
- Cost and cost-effectiveness: What are the costs associated with data collection and analysis? Is the resulting information "worth" this investment?
- Ability to support subgroup analyses: Will the measure allow comparison between different programs or categories of participants?
In addition to evaluating the characteristics of individual measures, research by the IOM1 suggests that it is important to consider the overall characteristics of the measure set. Specifically, the IOM recommends that the following three criteria also be assessed:
- Balance.
- Comprehensiveness.
- Robustness.
Criteria Development and Application
Our development and application of evaluation criteria for the analysis of HCBS measure scan findings was guided by the three concepts described above: importance, scientific soundness, and feasibility. Stakeholders, including TEP members and others, provided valuable narrative and quantitative feedback on the importance of selected measure constructs. After compiling and analyzing this feedback, we developed threshold evaluation criteria to rate candidate measures on aspects of scientific soundness and feasibility. Each process is discussed in greater detail below.
The importance of selected measures and measure constructs within the three DRA domains was assessed primarily via stakeholder input. The DRA (Public Law 109-171, Section 6086(b)) specifically instructed AHRQ to "consult with consumers, health and social service providers and other professionals knowledgeable about long-term care services and supports" when undertaking measure development. To this end, the Measure Scan project received regular input throughout from three distinct groups of stakeholders.
As noted earlier, a TEP was convened to advise this process and met three times during the scan, twice via conference call and once in person at AHRQ's Rockville, Maryland, headquarters. This group also provided written feedback upon request. Two members of the TEP, Dr. Charlie Lakin and Dr. Judith Kasper, served in the capacity of "substantive reviewers" and participated in additional meetings with the Measure Scan team.
In addition, AHRQ assembled a group of Federal Partners, comprising representatives with interest and knowledge in HCBS from other agencies within the Department of Health and Human Services, including the Centers for Medicare & Medicaid Services (CMS), the Office of the Assistant Secretary for Planning and Evaluation (ASPE), the Administration on Aging (AoA), and the Substance Abuse and Mental Health Services Administration (SAMHSA), among others. This group also met regularly with the project team and provided additional feedback on products and priorities.
Finally, the project team compiled a mailing list of "interested parties," members of the general public who expressed an interest in the measure scan. Each TEP meeting was open to observation to the public, and interested parties were invited to provide feedback during the public comment period at the conclusion of each meeting.
The formal process of assessing measure importance began during the TEP meeting in Rockville on June 22, 2007. The purpose of this meeting was to review a draft of the compendium and develop evaluation criteria. The discussion at this meeting produced a list of approximately 50 measure constructs that participants believed were relevant, meaningful dimensions of HCBS quality that reasonably fell under the purview of State Medicaid programs. These constructs were compiled into tables after the meeting. TEP members and Federal Partners were asked to indicate which they believed were most important for each of the various disability populations where they felt they had expertise. Through iterative rounds of feedback, 21 constructs ultimately emerged as representing consensus for cross-disability quality measurement. Table 1 lists these constructs, by DRA domain.
Starting with an assessment of measure importance was helpful in making sure the resulting measure set for gap analysis was not limited to measures that were already being collected or only those with strong psychometric properties. It also included key constructs that did not fit those categories. Many of these constructs align with the Federal requirement that States' Medicaid programs ensure the health and welfare of their HCBS recipients. This finding suggests that the TEP prioritized this dimension of quality over other operational ones, such as program oversight or the financial integrity of payments.vi
The second evaluation phase entailed cross-walking this list with the entries in the compendium to identify those measures that aligned with the constructs in Table 1. Because of AHRQ's interest in documenting the available measurement science, we limited our inquiry to those instruments and measure sets that had some degree of scientific soundness, generally represented by stakeholder endorsement (content validity) or psychometric testing. In addition, proprietary measures AHRQ had not been given formal permission to use in future development work were excluded.vii
The resulting measures were organized into matrices, by DRA domain and measure construct, with selected information about their developer, scientific testing, prevalence of use, and applicable populations. These matrices are included in Appendix V (Tables A.V.1a, A.V.2a, and A.V.3a). Many of the items in State-developed tools represented in the compendium were not included due to a lack of psychometric testing. However, several State tools and measure sets aligned closely with Table 1, in essence providing construct validity for the results of the importance exercise.
The third component of measure evaluation was development of numeric scoring criteria in four areas: psychometric testing, target populations, feasibility of data collection, and prevalence of current use (a component of feasibility). These numeric criteria are shown in Figure 1. Every measure in the matrices drawn from the compendium (i.e., those shown in Tables A.V.1a, A.V.2a, and A.V.3a) was assessed using these criteria.
The highlighted scores in Figure 1 were those AHRQ considered to be below the threshold for inclusion in the final set of candidate measures. Therefore, any measure that did not meet the minimum threshold in one or more categories was dropped from the original matrices to provide a more robust set of potential measures for AHRQ's consideration in the next phase of the DRA project. These evaluated sets of candidate measures can be found in Appendix V as well (Tables A.V.1b, A.V.2b, and A.V.3b). While the measure scan makes no recommendations regarding the relative merit of any of the measures or instruments included in these tables, they were used to inform the gap analysis for each DRA domain, which is described in the next three sections.
A few important contextual observations should be noted, prior to the gap analysis. First, our review of the many State tools and measure sets we collected showed an increasing emphasis on aligning State data collection with CMS requirements, specifically the six assurances of program management articulated in the Code of Federal Regulations.viii It is important to recognize this as a driving force in HCBS quality measurement, particularly when considering feasibility.
Second, many measures and constructs have been developed specific to the mental health population that align with the constructs in Table 1. Individuals with severe and persistent mental illness are generally not currently served through Medicaid HCBS programs. However, States will be able to use the new Medicaid 1915i State plan option to target this population and are encouraged to do so. Thus, they were judged to be important to the scan, despite lack of current use with Medicaid HCBS programs.
Finally, our initial threshold evaluation criteria dropped measures that were not used by at least one State program providing or financing HCBS. Several participants at the final TEP meeting suggested that this exclusion criterion was too strict and argued for a more liberal interpretation of prevalence of current use. The scoring was consequently changed so that measures in current clinical use, although not necessarily by State Medicaid HCBS programs, were included in the final matrices.
There is one important caveat to the categorization of the measure constructs discussed below. In essence, all quality measures can be considered measures of program performance. For the sake of organizational convenience, and to comply with the language of the DRA, we have divided the constructs in Table 1 into two categories: functioning (broadly defined) and client experience, based on client report. The remaining or residual constructs are discussed in the section on program performance. However, there is overlap between measures of client function and experience and assessment of program performance. In fact, this final domain in some ways subsumes the previous two, making the division among the three domains somewhat artificial.