S. Bonner
Data Quality Assessment and Anomaly Detection Via Map / Reduce and Linked Data: A Case Study in the Medical Domain
Bonner, S.; McGough, S.; Kureshi, I.; Brennan, J.; Theodoropoulos, G.; Moss, L.; Corsar, D.; Antoniou, G.
Authors
S. McGough
I. Kureshi
J. Brennan
G. Theodoropoulos
L. Moss
D. Corsar
G. Antoniou
Abstract
Recent technological advances in modern healthcare have lead to the ability to collect a vast wealth of patient monitoring data. This data can be utilised for patient diagnosis but it also holds the potential for use within medical research. However, these datasets often contain errors which limit their value to medical research, with one study finding error rates ranging from 2.3%???26.9% in a selection of medical databases. Previous methods for automatically assessing data quality normally rely on threshold rules, which are often unable to correctly identify errors, as further complex domain knowledge is required. To combat this, a semantic web based framework has previously been developed to assess the quality of medical data. However, early work, based solely on traditional semantic web technologies, revealed they are either unable or inefficient at scaling to the vast volumes of medical data. In this paper we present a new method for storing and querying medical RDF datasets using Hadoop Map / Reduce. This approach exploits the inherent parallelism found within RDF datasets and queries, allowing us to scale with both dataset and system size. Unlike previous solutions, this framework uses highly optimised (SPARQL) joining strategies, intelligent data caching and the use of a super-query to enable the completion of eight distinct SPARQL lookups, comprising over eighty distinct joins, in only two Map / Reduce iterations. Results are presented comparing both the Jena and a previous Hadoop implementation demonstrating the superior performance of the new methodology. The new method is shown to be five times faster than Jena and twice as fast as the previous approach.
Citation
Bonner, S., McGough, S., Kureshi, I., Brennan, J., Theodoropoulos, G., Moss, L., Corsar, D., & Antoniou, G. (2023, October). Data Quality Assessment and Anomaly Detection Via Map / Reduce and Linked Data: A Case Study in the Medical Domain. Presented at IEEE International Conference on Big Data, Santa Clara
Presentation Conference Type | Conference Paper (published) |
---|---|
Conference Name | IEEE International Conference on Big Data |
Start Date | Oct 29, 2023 |
End Date | Nov 1, 2015 |
Acceptance Date | Sep 5, 2015 |
Publication Date | Nov 1, 2015 |
Deposit Date | Sep 24, 2015 |
Publicly Available Date | Nov 26, 2015 |
Publisher | Institute of Electrical and Electronics Engineers |
Pages | 737-746 |
Book Title | Proceedings, 2015 IEEE International Conference on Big Data : Oct 29-Nov 01, 2015, Santa Clara, CA, USA. |
DOI | https://doi.org/10.1109/bigdata.2015.7363818 |
Keywords | RDF, Medical Data, Map / Reduce, Joins. |
Public URL | https://durham-repository.worktribe.com/output/1152954 |
Files
Accepted Conference Proceeding
(379 Kb)
PDF
Copyright Statement
© 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
You might also like
Advancing Research Infrastructure Using OpenStack
(2013)
Journal Article
On the Classification of SSVEP-Based Dry-EEG Signals via Convolutional Neural Networks
(2018)
Presentation / Conference Contribution
Deep Topology Classification: A New Approach for Massive Graph Classification
(2017)
Presentation / Conference Contribution
Efficient Comparison of Massive Graphs Through The Use Of 'Graph Fingerprints'
(2016)
Presentation / Conference Contribution
Using Hadoop To Implement a Semantic Method Of Assessing The Quality Of Research Medical Datasets
(2014)
Presentation / Conference Contribution
Downloadable Citations
About Durham Research Online (DRO)
Administrator e-mail: dro.admin@durham.ac.uk
This application uses the following open-source libraries:
SheetJS Community Edition
Apache License Version 2.0 (http://www.apache.org/licenses/)
PDF.js
Apache License Version 2.0 (http://www.apache.org/licenses/)
Font Awesome
SIL OFL 1.1 (http://scripts.sil.org/OFL)
MIT License (http://opensource.org/licenses/mit-license.html)
CC BY 3.0 ( http://creativecommons.org/licenses/by/3.0/)
Powered by Worktribe © 2024
Advanced Search