Subtask 2.3. Demonstrate Performance of Electronic Detection Schemes Compared to an Accepted SSI Surveillance Reference Standard.
2.3.1. Acquire all surveillance data pertaining to cases in the test set
Each of the facilities had different preexisting strategies for SSI surveillance. Denver Health and Vail hospitals followed National Healthcare Safety Network (NHSN) guidelines and performed traditional, manual surveillance. Intermountain Healthcare previously pioneered electronically supported, human-adjudicated surveillance systems and used that modality routinely across all hospitals in their system. The VA used NSQIP for surveillance with rules similar to, but not entirely the same as NHSN. None of the participating hospitals followed NHSN guidelines pertaining to not counting procedures where a drain was placed through the incision intraoperatively. Each of the facilities pulled the results of routine surveillance based on their own methodologies into databases residing on their own systems. Each of these datasets served as slightly different reference standards representing the status quo. As such, accuracy and reliability statistics between the centers are not directly comparable, but represent the performance of the human-adjudicated surveillance systems compared to the various systems already in place.
With regard to amending VASQIP data to be more in line with NHSN guidelines, it became clear that Denver Health, because of its integrated system (where outpatient and inpatient records are part of the same EHR) also principally focused on postdischarge surveillance of up to 30 days. Because of variances from and differences in interpretation of the NHSN guidelines at each center, it was deemed difficult to attempt to build or amend data from the routine surveillance data to achieve harmonization to a common standard at each facility. In the end, no consistent reference standard could be applied across all hospitals.
2.3.2. Analysis and reporting of electronic surveillance & manual surveillance performance to the reference standard
At each hospital, we applied our electronic algorithm to all surgical procedures that met our prespecified criteria. Results were pooled to report overall accuracy. The sensitivity was 37.8 percent, the specificity was 94.3 percent, the PPV was 2.0 percent, and the NPV was 99.8 percent. Results from each system calculated individually are outlined in Exhibit 13. The sensitivity ranged from 0 percent at VVMC to 50 percent at VA SLC HCS. As the numbers of positives are quite small, the confidence intervals for sensitivity and PPV are quite large.
We compared the human-adjudicated system to routine prospective surveillance as a test of feasibility and diagnostic performance. Of all of the charts that the algorithm deemed positive and queued for review, we randomly selected up to 50 that had also been identified by routine surveillance as positive (i.e., true positives) and up to 50 identified as negative (i.e., false positives) for manual review at each center. The reviewer was blinded as to the result of routine surveillance as well as to the ratio of positives and negatives. The reviewer classified each chart as to whether an SSI was present and the depth of the SSI. The procedure type was already specified by ICD-9 or CPT code. Charts not queued for review by the algorithm were considered negative by the human-adjudicated system. Exhibit 14 illustrates the sampling and reviewing processes. Sensitivity, specificity, and their confidence intervals were calculated using methods included in the Appendix H. We found a sensitivity of 41.5 percent (excluding records with corrupted identifiers), which is lower than hoped for, but had been limited only by the insensitivity of the electronic algorithm. Our specificity of 99.8 percent was comparable to that frequently reported by manual surveillance systems. The overall measured interrater reliability between the two historical surveillance assessments and the assessments of our reviewers on sampled charts was 0.85.
During implementation, poor algorithm performance was noted at VVMC (Exhibit 15). At most other facilities, the absence of an antibiotic prescription after surgery meant either no antibiotic was given, or that the prescription data were missing from the record. In the case of VVMC, no electronic antibiotic prescription data were available, so all were missing. In their case, we coded ‘-1's in the postoperative antibiotic field and altered the algorithm to allow ‘-1's to cause the algorithm to err on the side of calling cases positive for flagging. No cases were picked up with the changes.
All reports of sensitivity, especially when grouped by procedure as in Exhibit 16, must be interpreted with extreme caution. The confidence intervals are wide. For example, at VVMC the algorithm found 0 of 3 SSI. These data would be observed with a sensitivity of up to 60 percent over 5 percent of the time.
All positives identified by the algorithm, as well as positives identified by routine surveillance were reviewed (Exhibit 17). At DH, the study reviewer agreed with all of the cases identified as positive by routine surveillance. Four surgeries were noted to have incorrect ICD-9 codes, indicating that they should not have been included. The study reviewer also identified one superficial SSI and one deep SSI caught by the algorithm, but not found in routine surveillance records. At VA SLC HCS, four additional deep and organ-space SSI were identified in addition to those identified by routine surveillance. At VVMC, all algorithm-identified cases were false positives. At IH, the study reviewer agreed with all positive cases identified by routine surveillance and none identified by algorithm, except for two cases where there appeared to have been errors with identifiers.
False negatives were reviewed at each center to determine the reasons for failure and to identify areas for future algorithm improvement. At Denver Health, two of the false negatives represented problems with the data pull. One SSI was assigned to the wrong hip replacement in the historical dataset. The hip replacement with infection was not in the dataset. Another procedure identified as having an SSI was actually a hysterectomy. Three surgeries were missed because the SSI occurred greater than 30 days postoperatively. One SSI was missed because laboratories were only available from the outpatient setting. One SSI could only have been picked up from emergency department notes. Only two SSI could have been picked up by electronic data, but were missed due to the algorithm's threshold criteria.
At VA SLC HCS, only two SSI were missed. Both occurred in total hip arthroplasties with onset of infection greater than 30 days postoperatively. This is interesting because although VASQIP protocol is to extend surveillance through 30 days, it appears that SSIs occurring outside of this time period are being recorded. Fortunately, algorithm sensitivity can be easily increased in this setting by increasing the observation period.
At VVMC, the algorithm was unable to detect three SSIs, due to one surgery being treated solely in the outpatient setting and another being treated at an outside facility. The last infection developed 11 months after surgery and was thus not picked up because of the time period of surveillance.
At IH, 11/16 false negatives were attributed to the algorithm missing important information. Most of this information was in the clinical notes (10/12), with the remainder in microbiology. All of this information occurred after discharge from the initial surgery. In 2 cases, the reviewer felt that the cases were ambiguous, in another 2 the reviewer actually disagreed that the cases were SSI. In 1 case, the reviewer felt that the case actually was sSSI rather than dSSI or oSSI.
The single most remediable element for the algorithm is the duration of surveillance after surgery. Otherwise, it appears that further improvements may be difficult without information extraction from free-text notes. Difficulties that arise from fragmented care are not easily addressed, but do contribute to missed SSIs.
Subtask 2.4. Assessability of Electronic Detection Methods to Determine Procedure-specific, Organism-specific SSI Rates Versus Estimates of Annual National Burden and Identifying SSI in Health Care Facilities
2.4.1. Identifying SSI in health care facilities
We tabulated identified organisms by procedure at each hospital. Organisms associated with deep- or organ-space SSI were counted by positive microbiology cultures from postoperative day 4 to 30. If more than one organism was present, all were counted. No attempt at establishing the organism as the etiology of the SSI was made. Organisms were grouped in Exhibit 18. Trend analysis was not feasible because of the short-time period represented in the data and the rarity of events, especially when divided by procedure.
2.4.2. Estimate the national procedure-specific, organism-specific rates of SSI
We used methods similar to those outlined by the National Nosocomial Infection Surveillance System (NNIS) as a model for estimating the annual national burden of SSI for the specified four procedures.53 The SSI incidence proportions are listed in Exhibit 20. We did differentiate between ICU and non-ICU infections as in the NNIS study because of the difficulty with assigning hospitalization days that were at risk for SSI versus days that were a result of SSI. Inpatient sample data from AHRQ's Healthcare Cost and Utilization Project (HCUP) were used to estimate the number of procedures performed nationwide in 2008 (Exhibit 19). Multiplying the number of procedures by the pooled incidence proportion gave us 16,530 as an estimate of nationwide yearly instances of SSI for these four procedures. Individually, they were 4,478 for CABG, 1,233 for herniorrhaphies, 4,929 for total knee arthroplasties, and 5,890 for total hip arthroplasties. These numbers are not directly comparable to NNIS results because we are only examining a fraction of the surgeries. Additionally, the four hospitals were selected for convenience and do not represent the range of variability in hospitals across the country (by size, case mix, etc.). Exhibit 21 contrasts the number of total knee arthroplasties performed at each of the principal hospitals with the average number of procedures performed at various hospital types using HCUP data. Deficiencies in representativeness are evident. Combining these into a single estimate of SSI rate is misleading, at best and could be quite biased. Accurate estimates could be made by defining segments in the population thought to have different SSI rates (this is a challenge in and of itself) and sampling hospitals from each. At the very least, to apply this method to determine national SSI burden, there would need to be a larger number of hospitals chosen to represent the variability in size and case mix. Reporting confidence intervals would be inappropriate given these limitations.
There are severe limitations to this approach and there should be caution with interpretation. Unfortunately, the nationwide SSI rates are unknown–the best estimate being one from 2002 by NNIS.53 Our smaller group of hospitals is only informative in a very limited way. Vail was noted to have the lowest risk patients and a very low SSI rate. If all of the hospitals had the same SSI rate, then a low estimate of infections associated with 254 herniorraphies, with 2,620 THAs, and with 1,337 TKAs could be made, but not without a large amount of uncertainty secondary to very small numbers. Estimates using each of the hospital's rates and pooled rates are illustrated in Exhibit 19 and Exhibit 20 below. Again, more accurate estimates would require sophisticated patient case mix adjustment and a much larger sampling of hospitals.
The results above are methodologically problematic, but are best estimates considering the data and resource limitations of the task. Sound estimates would necessitate a large enough dataset to estimate the rate at each hospital with confidence. Ideally (for statistics, but not for patients), there would be enough events, so that the rate could be estimated with confidence with at least one year granularity. Facilities that are too small to accumulate enough events to estimate their rates within an aggregated 2- to 3-year time period are difficult to use when secular trends are important. The sample of hospitals needs to be large enough to provide power in the estimation of rates and be representative of the target population. Some adjustments can be made if patient-case mix is considered. However, the number of SSIs at each hospital is small and we had only four hospitals from the Intermountain west to analyze. Acquiring the data for and developing valid models that adjust for patient-case mix is a large endeavor outside of the scope of this task. It is perhaps more appropriate to compare the individual SSI rates that we found with that reported in the literature.
Total hip arthroplasty. In much of the developed world, the incidence of SSI after total hip arthroplasty appears low. Reports from the United States range from 0.9 to 2.52 percent.54, 55 In Europe, rates have been reported as low as 1.3 percent in Germany24 and as high as 3.4 percent in the Netherlands.56 In Brazil, one small study reported a rate of 17 percent for this procedure.57 Unfortunately, with this and other studies, there may be some important differences in reporting systems that complicate the direct comparisons of rates from one institution to the next. Also, the point estimates from small single-hospital studies may be spurious and not reflect the true underlying rate because of chance. However, there does appear to be real variability between institutions and countries. The rates we reported from each of our hospitals range from below the low end reported in the United States to that reported in the Netherlands.
Total knee arthroplasty. Similar trends were seen in TKA as in THA. The rates for SSI after TKA range from 0.5 percent in France58 to 0.99–2.3 in the Netherlands.56 Using Medicare data, the rate of prosthetic joint infections was estimated at 1.55 percent within 2 years of surgery.59 As with THA, TKA had a higher rate in Brazil from the same study above.57 As with THA, we observed rates that ranged from below the low to above the high range reported in these first world countries. This is not surprising, given the small numbers in this study compared to the studies cited.
Coronary artery bypass grafting. Reports of anywhere between 0.560 and 2.7 percent can be seen for deep SSI after cardiac surgery. A study in New York found a rate of 1.8 percent after isolated CABG and 2.4 percent after CABG combined with other cardiac procedures, such as valve replacement.61 Only two centers performed CABG in our study. Our values of 1.9 and 4.2 percent are comparable to, but perhaps a bit higher than, that reported in the literature. One aspect to point out is that Intermountain Health rates are calculated from data already derived from an algorithm-driven, human-adjudicated process, which may decrease review fatigue and improve sensitivity. As with the other surgeries, direct comparisons can be difficult because followup time can be variable.
Herniorrhaphy. A Cochrane review in 2007 identified four studies that used CDC criteria when considering antimicrobial prophylaxis for herniorrhaphy or hernioplasty. The other studies used various criteria that were not completely in line with CDC guidelines. Among the four studies, the incidence proportion of SSI among those receiving prophylaxis was 0–1.72 percent and 1.9–8.16 percent, respectively, for herniorrhaphy and hernioplasty. The reported proportions from our four hospitals are closest to those reported in the literature for patients receiving preoperative antibiotics.
Subtask 2.5. Estimate Burdens of Electronic SSI Detection Relative to an Accepted Reference Standard
Our ability to conduct a detailed cost estimate was limited, in that costs associated with information technology implementation are largely system-specific and are dependent on available resources and the expertise of inhouse personnel. As a result, we present a general categorization of the expected costs and resources required for adaptation and implementation of an electronic SSI detection tool, and provide an estimate of cost savings due to successful implementation for an example case.
As previously mentioned, we observed variations in SSI surveillance practices across the four partner health delivery systems. The practice of conducting manual review for surveillance with no automated support was used as the standard base case. Cost and resource use for implementation of the electronic surveillance algorithm were categorized in two primary cost buckets: (1) cost to set up; and (2) costs to run. In addition, we also note the expectation that there will be future costs and a resource burden associated with regular maintenance and required updating of the algorithm, to maintain the value and accuracy of the tool.
The SSI-identifying algorithm is a program written in structured query language (SQL) that uses Boolean logic (in “case” statements) to identify potential surgical site infections. Setting up the query is a one time, up-front investment that results in a generalized program that can be reused repeatedly. The time required to set up the algorithm varies, based on the electronic availability of required data fields within a given delivery system, as well as the expertise of personnel available to perform the algorithm implementation. If all the variables are easily accessible and experienced personnel available, the algorithm can be programmed in 1–2 hours. The most difficult aspect of developing the query is the complexity inherent in joining multiple data sources such as laboratory, surgery, and patient demographic data.
DH serves as a case example for adapting and implementing an electronic surveillance algorithm in a health care delivery system in which manual review is used as standard surveillance practice.
DH identified 2,179 procedures for inclusion in its system-specific cohort. Sixty procedures were associated with SSI after manual chart review using NHSN methodology. The modified algorithm flagged 804 procedures (or 37 percent of total charts) for review. The percent of total procedures flagged for review varied by procedure type, and ranged from 15 percent (herniorrhaphy) to 62 percent (craniotomy). The modified algorithm achieved 100 percent sensitivity and 72 percent specificity in detecting SSIs, validated on 4 years of our manual SSI surveillance data using NHSN methodology.
Based on these results, 1,375 unnecessary chart reviews would have been avoided over the 4-year period without sacrificing detection of a single SSI. Assuming 20 minutes per chart for manual review, 57 work days (8 hours/day) of chart review would have been eliminated using the algorithm for surveillance of SSI in hip and knee arthroplasty, abdominal and vaginal hysterectomy, spinal fusion, craniotomy, and herniorrhaphy.
Measurements of algorithm diagnostic accuracy must be carefully considered. Routine, manual, prospective surveillance is estimated to have a sensitivity between 30 percent and in excess of 90 percent, with most estimates in the 70–80 percent range.21, 35, 50, 62, 63 Additionally, the reliability of manual health care-associated and surgical site surveillance has been reported to be less than ideal.21, 64-67 For many of these studies, either routine surveillance or some augmentation of routine surveillance is used as the reference standard. Any comparisons to such standards must take this into account. Other electronic algorithms are frequently reported to have sensitivities in excess of 80 percent.38, 68 Only some of them have been applied to multiple hospitals,16, 21, 35 and none of them report individual hospital validation results among hospitals as heterogeneous as the principal hospitals of our study. Although our recursive partitioning algorithm had high sensitivity on the VASQIP training set, its sensitivity was 73.1 percent on the VASQIP test set. The pooled sensitivity at the four principal hospitals was 41.5 percent. These results contrast with the high performance seen in other published literature. Specificities and predictive values were relatively stable between training and testing sets.
The differences in sensitivities that we see in the recursive partitioning algorithm suggest two levels of overfitting. The first is overfitting to the training dataset and the other is overfitting to the VA system. The two “common sense” algorithms demonstrated high sensitivity in both the VASQIP training and testing sets. Since they were not derived from the training set, they are not overfit to it. We expected the sensitivity of these algorithms to be high because of success for previously devised algorithms—and because we surmised that it was unlikely that patients with deep or organ-space SSI would neither be treated by antimicrobials nor tested for etiologic microorganisms. However, when these algorithms were tested against other hospitals, the sensitivity and positive predictive values varied. At VA SLC HCS, no improvement in sensitivity was observed over the recursive partitioning algorithm, perhaps due to small numbers. At IH, a much larger number of false positives was found, resulting in a much lower positive predictive value. This appears largely due to a different pattern of antimicrobial use during the postoperative period in this system. At DH, the simple algorithm fared poorly, while the inclusive algorithm fared better. This illustrates that the “common-sense” algorithms that include elements successful at other institutions16, 21, 35 still did not generalize well because of institutional differences.
Other factors may also have contributed to the less than desirable performance of the algorithm derived from recursive partitioning. Poor reliability in the reference standard, too little information, and limitations in the recursive partitioning method may also have been factors. Having relatively few SSIs to work with was also a limiting factor for algorithm training.
The amount and quality of electronic information, as well as algorithm development issues, are also important. The available data elements may not have been informative enough. In 1992, Harvard Pilgrim Health used the same recursive partitioning method on a set of 4,086 procedures containing 96 postdischarge SSIs to develop algorithms to predict SSI.16 The approach allows for some interactions of variables in rule generation. The group allowed variables such as diagnosis codes, age, sex, length of surgery, microbiology, antibiotics, readmissions, and emergency department visits. The principal difference between our variables and the Harvard study was that we omitted diagnosis codes, age, and sex, and included laboratory values. We made these omissions for reasons stated previously. They developed multiple models: one with a sensitivity of 74 percent and a specificity of 98 percent, one with a sensitivity of 92 percent and a specificity of 92 percent, and another with a sensitivity of 77 percent and a specificity of 94 percent. Each of these models was created by adding different sets of information. The authors noted that no model surpassed a sensitivity of 80 percent without both outpatient diagnosis codes and microbiology results. It is still unclear how well those algorithms would perform on an external set. The discriminability indices of the models above [discussed earlier] were as follows: 2.7, 2.8, and 2.3 respectively. In comparison, our algorithm had a sensitivity of around 73 percent and a specificity of around 92 percent. The “simple” algorithm, using only data on postoperative cultures or postoperative antibiotics, had a sensitivity of 95 percent and a specificity of 81 percent on the same set. The discriminability indices were 2.0 and 2.2, respectively. Similar discriminability indices would suggest that the criterion is drastically different to produce very different sensitivities. The effect of criterion is illustrated by examining the inverse of the positive predictive value, or the number of charts needed to review to find an SSI. IP would likely choose the more sensitive algorithm, although they would need to review an average of 53 charts for every SSI, instead of 23. However, any option would like be better than the approximately 250 charts needed to be reviewed to find a deep or organ-space SSI. Problems with identifying SSI may have been in part because: (1) the available information was insufficient; (2) recursive partitioning may not be robust to missing data and/or not fully consider interactions between variables; and (3) the penalty for false negatives was not high enough.
The algorithm may have problems with generalizability. However the measures of sensitivity at the principal hospitals are difficult to interpret because the confidence intervals are very wide. Improving the algorithm's sensitivity, while keeping the number of charts needed to review low, can only be accomplished by improving the algorithm's discriminability. This could be accomplished by using procedures more robust to sparse data for algorithm development, incorporating dynamic thresholds for laboratory values and vitals, and enriching the input data by using natural language processing to extract information from text notes. Recently, natural language processing was used to find postoperative complications (but not SSI) using VA data.69
Another potential approach to improve acceptability is to allow IPs to change criterion by allowing them to choose from a menu of algorithms. Instead of using algorithms to rule out cases by forcing a dichotomy, algorithms would predict the likelihood of SSI to facilitate subsequent triage and chart review. Algorithms could also facilitate review by annotating important aspects of the case.
In this system, IP would need to accept that greater sensitivity comes at the price of reviewing more charts. When discriminability is held constant and criterion is lowered, then sensitivity improves and specificity worsens. If criterion is raised then the opposite happens. Just as we demonstrated with the “inclusive” algorithm, we could investigate other algorithms with known characteristics. If IP are unsatisfied with the sensitivity of an algorithm, then the next most sensitive algorithm can be selected. If IP feel that they are reviewing too many cases, then a more specific algorithm can be selected. The drawback to this approach is that it compromises the reliability that automated systems could contribute.
A modification of this last approach could use strategies similar to IBM's DeepQA technology to incorporate information from multiple, simultaneously applied algorithms for prediction. This technology was used in the recently televised Watson experiment on Jeopardy.70 Model-combining71 or multiple-model-estimation72 approaches could potentially provide better performance, but they would increase the complexity of the rules and make the automated system a black box. This may be less acceptable to IP, but it could still be used in a decision-support style system. For example, instead of a two-tiered system, a multitiered system could be employed. As before, IP would be the final tier, but the automated system could have multiple tiers. The initial tier would use a high-sensitivity algorithm to completely eliminate a modest number of charts, perhaps up to half. The next tier would flag possible SSI, but not remove charts. One or multiple algorithms could be used in this tier. Finally, IP would still need to review all charts that from passed the first tier, but could choose to spend more or less time on charts, depending on the flags.
The most appropriate use of automated systems, whether alone or in combination with manual surveillance, will take careful consideration of the purpose and requirements of the events being surveilled. The performance of automated systems may vary, particularly when attempting to detect events that occur in the outpatient setting, where differences in data availability may be pronounced. More work is necessary to improve the discriminability index of electronic algorithms, but allowing IP to select rules that suit their own needs may be a reasonable measure in the interim.