Enhancing Healthcare Data Integrity: A Comprehensive Review of Quality Dimensions, Assessment Techniques, and Evaluation Tools
Achieving reliable and high-quality data in healthcare is fundamental for informed clinical decisions, impactful research, and effective health system management. As digital health records and large-scale data repositories become more prevalent, understanding how to systematically evaluate and improve data quality has gained paramount importance. This review explores the multifaceted nature of healthcare data quality, examining the key dimensions used for assessment, the methodologies employed, and the tools developed to support robust data evaluation. It aims to provide a structured understanding that guides researchers and practitioners in selecting appropriate frameworks, optimizing assessment processes, and implementing effective tools to ensure data integrity across healthcare settings.
Background
Data quality in healthcare is a complex, nuanced concept that varies depending on context and purpose. Definitions of quality differ among experts, but common threads emphasize accuracy, completeness, consistency, and relevance. For example, Juran describes quality as “fitness for use,” highlighting the importance of data being suitable for its intended purpose. Crosby focuses on “conformance to requirements,” underscoring adherence to standards, while ISO 9001 characterizes quality as the degree to which a set of characteristics fulfills specified requirements [6].
The concept of data quality originated in manufacturing during the 1950s and later expanded into healthcare and other service sectors. Despite its widespread recognition, a universally accepted definition remains elusive, partly because data quality encompasses multiple dimensions and subjective interpretations [8]. Wand (1996) emphasizes the importance of an information system’s ability to accurately represent real-world states, whereas Wang (1996) proposes a framework that categorizes data quality into four dimensions: intrinsic, contextual, representational, and accessibility [9,10]. These dimensions collectively ensure that healthcare data is accurate, relevant, clearly represented, and readily accessible—factors crucial for clinical decision-making and research.
The World Health Organization (WHO) broadens the perspective by linking data quality to a system’s capacity to meet objectives through lawful means, reflecting the alignment with standards and standards compliance [7]. Similarly, the National Academy of Medicine (NAM) defines high-quality data as “robust enough to support conclusions and interpretations comparable to those derived from error-free data” [12].
The importance of data quality in healthcare continues to grow, driven by the proliferation of electronic health records (EHRs), registries, and big data initiatives. High-quality data directly influences clinical outcomes, patient safety, and the validity of research findings [2,13]. Inaccurate or incomplete data can lead to misdiagnosis, inappropriate treatments, and flawed research conclusions, emphasizing the need for rigorous assessment and continuous improvement of data quality. The secondary use of healthcare data, such as in observational studies or quality improvement projects, relies heavily on robust data assessment frameworks to ensure reliability [14].
Assessment approaches generally fall into two categories: global measures that evaluate the overall dataset’s fitness for use, and fitness-for-use measures that target specific data applications or research objectives [12]. Systematic evaluation of data quality involves identifying relevant dimensions, selecting appropriate assessment methods, and utilizing tools designed for comprehensive data evaluation.
For example, the progression of electronic health records has introduced new challenges and opportunities in data quality assessment. Empirical studies highlight that improvements in data completeness are often accompanied by ongoing challenges related to consistency, plausibility, and timeliness [71,72,73]. Ongoing monitoring and targeted interventions are essential to maintain high standards in data quality, especially given the heterogeneity of data sources and documentation practices.
The lack of standardized definitions for data quality dimensions leads to variability across studies, with some overlapping concepts such as accuracy and correctness or currency and timeliness. This inconsistency hampers the comparability of assessments and the development of unified frameworks. While advanced methodologies, including machine learning and natural language processing, are emerging to automate and enhance data quality evaluation, their adoption remains limited but promising [30,32,34].
Tools for data quality assessment vary widely, from web-based applications to programming libraries embedded within statistical software environments like R and Python. Frameworks such as those proposed by Kahn et al. have heavily influenced tool development, providing structured approaches to evaluate multiple dimensions systematically [39]. However, the diversity of tools and frameworks underscores the need for standardization and harmonization to enable broader applicability and comparability.
This review aims to answer critical questions:
– What are the key data quality dimensions utilized in healthcare?
– Which assessment methodologies are most effective?
– What tools and software support systematic data evaluation?
Addressing these questions can guide the development of comprehensive, standardized frameworks to ensure healthcare data integrity, facilitating better clinical outcomes and research advancements.
Materials and Methods
This systematic review adhered to the PRISMA guidelines to ensure transparency and reproducibility [25]. The primary goal was to identify studies that introduced specific methods or tools for assessing healthcare data quality, focusing on well-defined data quality dimensions.
Search Strategy
The literature search targeted three major databases: PubMed, Web of Science, and Scopus. Search terms encompassed two main categories: healthcare data concepts and data quality assessment techniques. The keywords within each category were combined with OR operators, and the two categories were linked with AND operators. The search strategies were tailored for each database’s syntax, with detailed strategies provided in Supplementary File 1.
Inclusion and Exclusion Criteria
Studies were included if they:
– Evaluated one or more data quality dimensions.
– Introduced a method or tool for data quality assessment.
– Used structured data from medical or healthcare sources.
– Were published in English.
Studies were excluded if they:
– Focused on non-tabular or unstructured datasets.
– Discussed data collection or management frameworks without empirical assessment.
– Evaluated data quality only across multiple databases without specific methodologies.
– Were literature reviews or lacked original data assessment.
– Were not published in English or lacked sufficient methodological detail.
Data Extraction and Screening
Using EndNote 21, retrieved articles were deduplicated and screened independently by two reviewers (E.H. and M.A.) through title and abstract review, followed by full-text evaluation. Disagreements were resolved with a third reviewer (H.T.). Forty-four studies were selected for detailed review. Relevant data, including identified dimensions, assessment methods, and tools, were systematically extracted into Excel.
Evidence Mapping and Quality Assessment
To visualize relationships among dimensions, methods, and tools, two heatmaps were generated using R. The studies’ quality was assessed via a scoring system based on predefined criteria (QAC), with scores ranging from 0 to 1. The overall quality scores informed the reliability of the findings [26].
Results
Search and Selection Process
Initial retrieval totaled 614 studies, which after screening and eligibility assessment, resulted in 44 included articles (Fig. 1). These studies spanned from recent years, highlighting increasing interest in healthcare data quality assessment (Fig. 2).
Study Characteristics
The studies addressed three main objectives:
1. Developing tools, guidelines, or frameworks for data quality evaluation.
2. Assessing data quality for secondary use in clinical research.
3. Validating data during transfer processes like ETL.
The United States led contributions, with Germany being notably active among European countries. Data sources varied, including EHRs, registries, and administrative databases (Fig. 3).
Data Quality Dimensions
Across the studies, from 1 to 6 dimensions were considered, with “completeness” being most prevalent (93%), followed by “plausibility” (49%), and “conformance” (26%) (Fig. 4). Definitions often overlapped or varied, reflecting the lack of standardized terminology. The most examined dimensions included accuracy, correctness, and consistency, with others like currency, timeliness, and validity also addressed.
Assessment Methods
Methods ranged from simple ratio calculations to advanced AI-based techniques. The most common approaches included rule-based systems, statistical analyses, enhanced definitions, and external standard comparisons (Table 1). For example:
– Completeness often measured via field completion ratios.
– Plausibility assessed through logical coherence checks.
– Accuracy evaluated by comparison with gold standards or repeated measurements.
While some studies explored machine learning and natural language processing, these were relatively rare (~5%), indicating room for broader adoption [30,32,34].
Tools for Data Quality
Over half the studies (55%) introduced specific tools, many based on frameworks like Kahn’s [39]. Implementations utilized R, Python, web-based platforms, and SQL environments (Fig. 6). The tools varied from simple packages to comprehensive dashboards, supporting diverse assessment dimensions. The relationship between frameworks and tools was visualized in a heatmap, illustrating widespread adoption of certain models, particularly in R (Fig. 6).
Table 2 summarizes the relationships among dimensions, methods, and tools, revealing common combinations such as completeness assessed via R-based tools or Python scripts, and plausibility evaluated through rule-based systems.
Study Quality
The overall methodological quality was high, with an average score of 67%. Most studies (97%) comprehensively addressed data quality dimensions and assessment methods, while fewer provided detailed tools or frameworks (Table 3). The quality assessment underscores the robustness of current research, yet highlights variability in methodological transparency.
Discussion
This review underscores the multifaceted nature of data quality in healthcare, emphasizing that no single dimension or assessment method suffices. The predominance of the completeness dimension aligns with its straightforward quantification and critical role in data usability. However, comprehensive evaluation requires addressing multiple dimensions like plausibility, accuracy, and conformance, which often overlap and lack standardized definitions.
The variability in definitions and terminology reflects the need for harmonized frameworks. For example, the overlap between accuracy and correctness suggests a need for clearer conceptual boundaries. Similarly, the underutilization of AI-based assessment methods points to potential growth areas, especially with the advent of machine learning, NLP, and automated validation techniques [30,32,34].
Tools developed for data quality assessment are diverse, with many based on established frameworks like that of Kahn et al. [39]. The widespread use of R and Python facilitates integration into existing workflows, yet standardization across tools remains a challenge. Developing unified standards and frameworks could improve comparability, reproducibility, and scalability of data quality assessment efforts.
Limitations include the heterogeneity of included studies and the lack of a universally accepted set of data quality dimensions. Future research should focus on establishing consensus definitions, exploring automated and AI-driven approaches, and developing practical, standardized frameworks that can be adopted across diverse healthcare environments. Additionally, understanding the optimal sequencing of assessing different dimensions may streamline processes and reduce resource expenditure.
Conclusions
Ensuring high-quality healthcare data is essential for effective clinical care, research, and health system management. This review highlights the complexity and variability inherent in data quality assessment, emphasizing the importance of standardized definitions, methodologies, and tools. While significant progress has been made, ongoing efforts should aim to develop comprehensive frameworks that harmonize terminology, assessment approaches, and evaluation tools. Implementing such standardized models will enhance the reliability, reproducibility, and utility of healthcare data, ultimately supporting better health outcomes and system efficiencies. Future research should prioritize the integration of AI methods, the establishment of best practices for dimension sequencing, and the development of adaptable, user-friendly tools to facilitate widespread adoption.
Electronic supplementary material
The online supplementary documents provide detailed search strategies, classification schemas, and additional data supporting this review.
(85.6KB, pdf)
(10.8KB, xlsx)
(14.7KB, xlsx)
(13.9KB, xlsx)
(12.6KB, xlsx)
For further reading on the influence of emerging technologies in healthcare, see:
Note: These insights aim to assist healthcare professionals, researchers, and data analysts in implementing effective data quality assessment strategies to foster trustworthy and actionable health data.