Bad Data is the Biggest Challenge in Healthcare
Poor data quality is one of the biggest obstacles to conducting insightful analytics to help forecast infusion volumes and mix or to optimize the scheduling of infusion treatments in order to level the utilization of chairs. Usually, the check-in data is pretty accurate since it accompanies patient registration — it is the lack of (or the errors in) check-out data that is the problem. There are several underlying reasons for the poor quality of the data including:
- Nurses are swamped with the task of caring for patients that they simply do not have time to accurately enter the check-out data. As a result, the check out time is either missing or is grossly incorrect (i.e., it was entered in a batch at the end of the day based on memory)
- Lack of automation to accurately and consistently capture timestamp data into the EHR
We have found a few successful methods for overcoming this issue. They are as follows:
1. Create “synthetic” check-out data: Here, we eliminate all of the records that have missing check-out data or incorrect check-out data (i.e., check-out on a different day from check-in). We then look at the data that seems to have reasonable check-in and check-out data and cluster the appointments by treatment type and estimate the duration for each type of infusion treatment. For the records with a missing check-out time, we insert a “synthetic” time based on generating a random number that is within the range of +/- 1 standard deviation from the mean treatment time for appointments of the same type. Finally, we do a spot validation by sampling the number of open chairs at multiple points in time throughout the day for a few days. The combination of actual and synthetic data ought to be able to closely match the observed chair availabilities. Now, the data set is more robust and can be used to build models estimate utilization profiles, etc, etc. In the meantime, it serves as a catalyst for communicating the importance of accurate check-out data to the entire nursing staff to minimize the need for synthetic data. In one case, we were able to lift the accuracy of check-out data from ~30% to ~95% within about six weeks — without needing to wait for the 6 weeks to build sophisticated analytic models using the synthetic data.
2. Use “fuzzy logic” based on treatment descriptions: Here, we use the text strings that describe the treatment. Since the treatment description is not always consistent, simple string lookup methods are inadequate. Therefore, it requires the use of fuzzy logic to create “equivalent sets” where the description of the treatment types seems almost the same (or has sufficient overlap in the text characters contained in the string). Having created equivalent sets, the estimated treatment times can be computed and the time for the 80th percentile (for example) can be used as a proxy for the expected duration of the treatment. This provides sufficient accuracy to proceed with sophisticated analytics even while the underlying data quality issue is being discussed and resolved.”