To further assess the potential and current limitations of APCDs to support health care measures, the team acquired APCD data from three States, along with accompanying documentation and online documentation. These data were used to conduct preliminary analyses.i The focus of the assessment was on tables and fields particularly relevant to the calculation of health care measures, such as those identified in the measure inventory.
Specifically, each APCD was reviewed to:
- Determine whether the fields received were correct (e.g., fields provided matched the documentation, values fell within expected ranges).
- Document any anomalies.
- Note any differences over quarters, across States, or both.
- Provide a preliminary assessment of whether the data could be used to construct reliable and valid health care measures.
- Provide insight into future analyses that will further validate these data for use in measure construction.
Methods
The team analyzed data for each State separately and did not link the data across States. The APCDs evaluated were limited or restricted datasets to provide a fuller range of variables that could be used for measure construction and to provide information on potentially useful variables for validating APCD data elements in future projects. When feasible, the datasets include the majority of non-patient-identifying fields (except for unique, encrypted patient identifier) from medical claims, enrollment records, and provider records.
This initial evaluation focused only on medical and pharmacy claims. While some States include dental claims, these were not evaluated. Also, due to limitations in data availability and time available to obtain data, APCDs varied with regard to the inclusion of Medicare and Medicaid records, which limited the ability to compare analyses across States.
The analyses were conducted and are reported in accordance with each State's Data Use Agreement. Detailed data analyses were available to AHRQ and researchers to construct recommendations for future work.
The analyses were intended as a high-level overview of the existing, unmanipulated APCDs. The overall data assessment sought to assess:
- Accessibility of data.
- Basic usability of data and documentation received.
- Availability of data elements needed to construct health care measures in the utilization, quality, and cost domains, such as unique identifiers, provider or facility identifiers, service dates, sociodemographic fields, payment fields, revenue codes, clinical diagnosis and procedure codes, payer, and claim status.
- Consistency of the distribution of key data elements with the documented data standards and face validity of the values.
For all key fields, the percentage missing/populated, distribution of values for continuous fields, and frequencies for categorical variables (up to 50 most frequent values) were calculated. The team also assessed the number of unique patients, unique claims, and number of transactions by month. These numbers were compared with State populations obtained from the U.S. Census Web site for qualitative assessment only.
Results of Data Evaluation
Availability of APCD data. This report provides a brief summary of the availability of datasets. Because the project required a large number of potential variables to conduct the global assessment, the experience of obtaining data for this project may not reflect the experience of obtaining data for other measurement efforts. Although up to 14 States had APCDs in 2014, only 7 had made those data available for extramural analyses. Four State APCDs were pursued with the intention of obtaining three for analysis.
The applications for data varied somewhat but were all reasonable in length and detail and were consistent with other data request applications. One application for a Limited Data Set version was denied due to the lack of consistency of this project's goals with the data release guidelines for that APCD. In some cases, the process for obtaining data required iterative submissions and communication with the APCD organizations.
Overall, the application process required more time than anticipated, but all remaining applications were approved, except for Medicaid data for one State. The data were provided at a cost consistent with other public datasets, and educational or government rates were frequently available. Some States provide data without charge to government or educational institutions.
Data usability assessment. The first step was to confirm that the expected data files were received with documentation and to assess the ability to convert the flat data files into SAS files. The analysts completing these tasks were experienced in taking in raw data of this type, such that the usability assessment here may have limited application to users with less data experience.
All data dictionaries, control totals, and user guides were received as they were available with the data. Few issues were experienced in loading the data. Upon data intake, the team learned some files were missing from one State dataset as specified in the documentation and some documentation did not match the data layout provided. One file could not be imported into SAS for unclear reasons. The APCD team worked with the project team to provide a file that could be read. When available, the record numbers reported in the control tables were observed in all resulting SAS files.
In general, States provided reasonable assistance with all issues identified. However, because APCDs are often run with limited staff, adequate time allocations were required to be made to overcome the limited staffing of these organizations.
Examination of the reference tables (e.g., lookup tables providing labels for market, diagnosis codes, etc.) for all three States did not show any problems with the values in these tables with regard to duplicates or values/contents in the files. One can use the lookup tables with confidence.
Evaluation of data elements and completeness. The team checked the databases for unique members, data elements used to construct health care measures, and other criteria such as data elements used to identify unique encounters.
Unique members in database. The team assessed the total number of unique members by month to observe any large fluctuations that may suggest concerns about the unique patient identifier or completeness of the claims provided. In some cases, modest variability over time was found, which appeared from Web site documentation to be due to incomplete claim submission.ii
It was observed in one case that the total number of unique individuals could exceed the population within a State, suggesting incomplete matching of claims to individuals. This situation can occur due to carve-outs, multiple coverage for one individual, duplicate claims, or failures in the patient identifier encryption methods. Despite these two issues, which were addressed in subsequent data releases or clarifications, these analyses showed reasonable numbers of unique patients relative to State populations and consistency over time.
Data elements used to construct health care measures. APCDs are typically organized into eligibility, claim, and provider files. However, data elements across these files are important to construct health care measures of utilization, quality, and cost. This section discusses several groups of measures that serve unique functions in health care measures. For any given measure, a subset of these elements would be used.
This section aims to describe the completeness of the data in qualitative terms, consistent with the project-specific Data Use Agreements and to provide context for the findings of the limited analytic review. In many cases, further analyses would better evaluate the validity and reliability of these data elements. This section also notes similarities and differences between the definition and coding of the data elements, as such consistency across States provides additional measurement opportunities.
Data elements used to identify unique encounters and link encounters. Most measures derived using APCD data will at minimum require the distinction between the type of health care encounters represented by the claim, such as isolating outpatient visits or prescriptions. Here, the team made the assessment on the following:
- Unique patient identifiers.
- Service dates.
- Variables that provide clues to movement through the health care system (e.g., discharge status or admission source).
- Variables used to identify the type of encounter (e.g., hospital, pharmacy, outpatient).
- Claim status, used to determine whether a claim is the final paid claim.
The analyses show the percentage missing on all claims for variables used to isolate or link encounters. Some variables are not expected to be 100 percent populated, because they are only used for specific claim types, such as inpatient claims. Other variables, while being fully populated, have important caveats.
Identifying unique encounters requires that data elements facilitate removal of any duplicate claims. Unlike administrative data, such as hospital discharge data, a single encounter often results in multiple claims from providers, facilities, and pharmacy and may also have multiple claims representing different claim processing statuses. Variables such as service dates and bill/service types are essential to build a dataset with only unique encounters, but additional variables such as clinical variables or provider variables are also important. These variables are described later, because they also serve other functions in health care measurement.
Some measures count encounters of a certain type without concern for multiple unique encounters for a single patient, such as the total number of prescriptions of a class of medication per population. In this case, whether 50 or 500 people obtained the observed prescriptions is not relevant to the measure. These measures would require information about (1) the service or bill type, to isolate the encounter type of interest; (2) the service provider and date, to ensure unique claims; and (3) the status of the claim, to ascertain whether the claim is the final adjudicated claim. In this case, encrypted member or patient identifiers will also assist in identifying unique encounters, even if the measure does not require linking across encounters.
Measures designed to track care across the health care system, to track encounters for the same individual over time, or to construct "episodes of care" all must have robust patient identifying information. For this purpose, a unique encrypted patient identifier is necessary. Additional information about the patient, such as age, gender, date of birth, or clinical information about the encounter (e.g., diagnosis or procedure codes), can be used to ensure proper linkages across claims and to clean data.
Although the analyses did not specifically examine the quality of the patient identifying variable, some modest fluctuation over time was observed (i.e., more unique individuals were identified within a database than anticipated). This suggests that single individuals may have more than one unique identifier, complicating linkage of claims.
Because some variables are unique to inpatient encounters, the analyses were inconclusive but suggest further investigation to understand these variables. The completeness of inpatient-specific variables varied even within APCDs by 5 to 15 percent on the low end; in some cases, one or two variables appeared to be populated at a much higher rate. This variation suggests that the variables may contain either invalid or placeholder entries or may be used for claims of different service or facility types. Overall, the data completeness was robust.
Although users should always analyze for and potentially remove duplicate claims for the same encounter, many APCDs intake data from payers and prepare the data to assess quality and remove certain duplicate claims. For instance, in Maine, the State data steward applies an algorithm to remove duplicate pharmacy claims so that the same prescription will not be double counted.6
Some data cleaning undertaken by the APCD or the user will remove real multiple encounters that occur on the same day, such as multiple emergency department visits or the same drug administered twice on the same day. As of yet, however, few variables are available to assist with such targeted data cleaning. Further, the issue of duplicate claims likely far outweighs the claims lost.
Table 3. Select variables useful for identifying and linking encounters
Variable Type | Range of Nonmissing Across APCDs and Quarters | Notes on Expectations and Validity* |
---|---|---|
Unique patient identifier | 100% consistent across quarters and years | Unique patient identifier is typically created and fully populated. Usefulness of the identifier may be affected by methods used to create identifier. Observed higher numbers than anticipated given known covered lives, suggesting multiple identifiers per patient. |
Service date | 100% consistent across quarters and years | Typically includes day, month, and year. |
Admission date | Highly variable, 10%-100% | Required only for inpatient claims, so lower rates would be expected. Unclear whether quarters with 100% of claims with nonmissing admission dates represent valid dates or whether dates match the service date. In general, inpatient claim variables are not reported at the same rate in data, suggesting use of fields for non-inpatient claims or missing values. |
Admission type | Highly variable, 16%-75% | Required only for inpatient claims, so lower rates would be expected. Unclear whether quarters with 75% of claims with nonmissing admission type represent valid data. In general, inpatient claim variables are not reported at the same rate in data, suggesting use of fields for non-inpatient claims or missing values. |
Discharge date | Highly variable, 21%-93% | Required only for inpatient claims, so lower rates would be expected. Unclear whether quarters with 93% of claims with non-missing dates represent valid dates or whether dates match the service date. In general, inpatient admission date and discharge date are reported at different rates, suggesting use of fields for non-inpatient claims or missing values. |
Discharge status | 14%-37% | Required only for inpatient claims, so lower rates would be expected. In general, APCDs were less variable in reporting but reported frequently on more claims than anticipated (typically, 30% of claims are not inpatient). |
Admission source | 13%-41% | Required only for inpatient claims, so lower rates would be expected. In general, APCDs were less variable in reporting but reported frequently on more claims than anticipated (typically, 30% of claims are not inpatient). |
Bill type, service type, billing provider type | 95%-100% | APCDs have different ways of populating this key variable. National billing standards break out facility type from provider services. While all APCDs evaluated had the standard fields populated at similar rates, each had unique fields that were fully populated with the type of service or facility. |
Claim status | 100% consistent across quarters and years | Variable is reported consistently, but the available categories vary by APCD. |
*Additional information on validity can be found in Discussion section of this report.
Data elements that describe the clinical aspects of an encounter. Because claims data focus on payment, the clinical information provided is limited. However, to provide context for and to justify a claim, certain useful clinical elements are included:
- International Classification of Diseases, Ninth Revision (ICD-9, and now, the Tenth Revision, ICD-10) codes for diagnoses and associated present on admission (POA) flags.
- Current Procedure Terminology (CPT) or ICD-9/10 codes for procedures.
- Drug codes for pharmacy claims and details about the prescription, such as whether it is a new prescription or refill, the quantity dispensed, and the anticipated day supply.
These fields were found to be nearly always populated as appropriate.
First listed or principal diagnosis codes were usually listed, while secondary diagnosis codes were listed in just over half of claims and further diagnoses dropped further. Although secondary diagnoses can be used across different types of claims, it is known that outpatient claims contain fewer diagnosis codes. Secondary diagnosis codes often do not affect reimbursement. But for inpatient claims, where these codes do affect reimbursement, the number of available diagnosis code fields can affect measures.
Even in the inpatient setting, the APCDs evaluated did not adequately support POA coding, which should be available for all secondary diagnoses (and can be assigned to principal diagnoses). The lack of secondary diagnosis codes and POA coding affects the usefulness of these data for adequate risk adjustment without linking across multiple encounter and pharmacy records to establish comorbidities for a patient.
Procedure or service codes could be assigned to office visits as well as inpatient visits, but high variability in the inclusion of these codes was observed. This finding is consistent with other similar data. DRG data are of limited usefulness in all-payer data unless derived after data are received, as providers are often not required to report DRGs to payers that do not use DRG-based reimbursement.
Table 4. Variables that describe the clinical aspects of the encounter
Variable Type | Claims Populated Range in Quarter and State* | Notes on Expectations |
---|---|---|
Principal or first listed diagnosis | 99%-100% | ICD-9 or ICD-10-CM codes |
Secondary diagnoses | The secondary diagnosis codes range from 53%-66% for the second position, then appropriately drop in frequency after about 5 diagnosis codes. | Secondary diagnosis codes can be included in records of different types but typically are more frequent in inpatient records where secondary codes can affect reimbursement. Outpatient records often include a limited number of diagnosis codes. |
POA | 0%-33% | Typically accompany inpatient diagnoses only but variable not populated or reported for majority of States/quarters. |
Admitting diagnosis | 4%-31% | Most State-quarter combinations have high missing rates (more than 90% missing). Admitting diagnosis differs from principal diagnosis, which is assigned at discharge. Low rates of admitting diagnosis are consistent with other data. |
ICD-9 Principal Procedure | 3%-22% | ICD-9/10 procedure codes are used most frequently in inpatient or high-intensity outpatient settings (some emergency departments, ambulatory surgery). Expected percentage is low, because not all receive procedures. Some APCDs combine procedure classifications into one variable (CPT, ICD-9). |
CPT #1 | 20%-94% | Some APCDs combine procedure classifications into one variable (CPT, ICD-9). Outpatient encounters can almost always receive a CPT service code for an office visit. |
Secondary procedure codes | <10% | APCDs vary on number of procedure fields available. |
DRG | 3%-17% | Typically only for inpatient claims. Not all payers use a DRG system. This will be populated at a low rate and for younger patients, lower than the inpatient rate. The DRG version is reported for a quarter of cases but may be auto-populated. |
National Drug Code | 100% | Included in the pharmacy claims, but reliability was not assessed.7 |
Prescription details such as refills, day supply, quantity | 100% | Included in the pharmacy claims, but reliability was not assessed. |
* Unless specific to a pharmacy claim as noted, all missing values are based on the medical claim records only.
Data elements used in attribution to a provider or payer. APCD data can be used to calculate measures at different levels, including geographic (e.g., State), payer, facility, or provider. Because of the nested nature of the data, it is possible to build measures, but attribution remains a complex problem in most data sources.
Providers may use multiple Medicaid numbers or National Provider Identification numbers or provide care across different locations based on administrative and logistic features of their practice. Therefore, it can be difficult to attribute all encounters to the provider. The team observed a large number of providers in some quarter-State combinations, suggesting single providers using multiple provider IDs. The team did not evaluate whether the location of service was populated for most provider claims or if the population was consistent with the data submission guidelines. In addition, prescribers were not consistently identified in pharmacy records.
It is important to evaluate the provider data across the data tables. For instance, the master provider table may have limited information on specialty, while the claims table has this field populated as expected. Provider address and name variables can be used to further validate providers but can also vary by table.
The data also do not provide definitive markers for either assigned or utilized primary care providers, medical homes, or other constructs. This information may be estimated based on plurality of visits, payment arrangements suggesting medical homes, or other methods. A current NQF committee has developed a report on the attribution of care to providers and facilities.
Table 5. Variables used for measure attribution
Variable Type | Missing Range in Quarter and State | Notes on Expectations |
---|---|---|
Submitter | 100% | Some records have submitters that differ from the payer. |
Payer | 100% | In some record types, payer may be reported less frequently. |
Pharmacy location | 74%-99% | ZIP Code |
Billing provider (any unique identifier) | 100% | The best variable for identifying billing providers is not clear. APCDs may populate Medicaid or Medicare numbers, National Provider Identifier, or unique provider codes. Some have multiple fields populated. Observed high numbers of providers in some cases, suggesting multiple identifiers for one provider. |
Service provider city/ZIP Code | 0%-100% | State is always provided, but more granular data varies depending on data request. Must justify more granular data when available. |
Location of service | 58%-100% | Location of service varies for State-quarter combinations. Applies to provider records. |
Primary insurance indicator | 96%-100% | Flag or identification of primary insurance. |
Variables used to calculate cost. The variables associated with payment for services, which allow tracking and measuring of expenditure trends, were nearly always populated. These included payer/submitter, diagnosis, procedure, and all payment fields (copay, deductible, coinsurance, prepaid, and insurance payment). This finding is not surprising given that these data files are built from claims submitted to payers for reimbursement.
In relation to out-of-pocket health care costs, although all records have information on co-insurance and deductible, few records contain any information on the premium patients paid. Another source of missed costs is the exclusion of denied claims, which may, if not paid by a secondary insurer, result in out-of-pocket payment.
Other cost information related to secondary or other payers varied, such as co-insurance days or Medicare payments. However, because the total number of records where these fields would be relevant is not known, an assessment cannot be made regarding the observed variation in the completeness of these variables. Not all APCDs contain information on secondary payers.
Table 6. Variables used to calculate cost
Variable Type | Nonmissing Range in Quarter and State | Notes on Expectations |
---|---|---|
Copay amount | 100% | In claims file |
Deductible amount | 100% | In claims file |
Amount prepaid | 100% | In claims file |
Amount paid | 100% | In claims file |
Coinsurance | 100% | In claims file |
Allowed amount | 100% | Key variable for cost measures |
Insurance type | 100% | In eligibility file |
Year and month of enrollment | 100% | In eligibility file |
Coverage flags (medical, pharmacy) | 100% | In eligibility file |
Demographic variables. While demographic variables beyond age and gender are rarely used to construct measures, they do have roles in assessing disparities and in some cases, where the application indicates, risk adjustment. The demographic variables in APCDs are limited. Age and gender are well populated while race and ethnicity are more variable. The validity of the race variable was not assessed, but this should be assessed further since it is unlikely that race is accurately reported on all records and not all individuals report a second race.
Other variables, such as insurance type or the member's residence, can be used as a proxy for other demographic variables, although imperfect. The member address and ZIP Code on file may not always match the actual residence of the member.
Table 7. Demographic variables
Variable Type | Missing Range in Quarter and State | Notes on Expectations |
---|---|---|
Race, primary | 19%-100% | Potentially over- and underreported. Most datasets do not contain 100% quality race data. |
Race, secondary | 17%-100% | Potentially over- and underreported. Most datasets do not contain 100% quality race data. |
Ethnicity or Hispanic flag | 74%-100% | Potentially over- and underreported. Most datasets do not contain 100% quality race data. |
Insurance type | 100% | Used as a proxy, but primary insurance type may not capture dual-covered or dual-eligible individuals. |
Member city/county | 98%-100% | Counties and cities can be heterogeneous, but data are available for socioeconomic status of county communities. |
Member ZIP Code | 99%-100% | |
Sex | 100%(nonmissing) | |
Age | 100%(nonmissing) |
i In agreement with the States providing the data, the States are not named in this report. Resource limitations allowed limited analyses, focused primarily on descriptive statistics and missingness of data across the full, unmanipulated datasets. The results are presented in qualitative form in accordance with the project-specific Data Use Agreement for each APCD.
ii Missing claims are frequently added in subsequent releases of the data.