George Gkotsis
Entropy-based automated wrapper generation for weblog data extraction
Gkotsis, George; Stepanyan, Karen; Cristea, A.I.; Joy, Mike
Abstract
This paper proposes a fully automated information extraction methodology for weblogs. The methodology integrates a set of relevant approaches based on the use of web feeds and processing of HTML for the extraction of weblog properties. The approach includes a model for generating a wrapper that exploits web feeds for deriving a set of extraction rules automatically. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. An evaluation of the model is conducted on a collection of weblogs reporting a prediction accuracy of 89 %. The results of this evaluation show that the proposed technique enables robust extraction of weblog properties and can be applied across the blogosphere.
Citation
Gkotsis, G., Stepanyan, K., Cristea, A., & Joy, M. (2013). Entropy-based automated wrapper generation for weblog data extraction. World Wide Web, 17(4), 827-846. https://doi.org/10.1007/s11280-013-0269-6
Journal Article Type | Article |
---|---|
Acceptance Date | Nov 4, 2013 |
Online Publication Date | Nov 21, 2013 |
Publication Date | Nov 21, 2013 |
Deposit Date | Jul 11, 2018 |
Publicly Available Date | Jul 31, 2018 |
Journal | World Wide Web |
Print ISSN | 1386-145X |
Electronic ISSN | 1573-1413 |
Publisher | Springer |
Peer Reviewed | Peer Reviewed |
Volume | 17 |
Issue | 4 |
Pages | 827-846 |
DOI | https://doi.org/10.1007/s11280-013-0269-6 |
Public URL | https://durham-repository.worktribe.com/output/1354582 |
Related Public URLs | http://wrap.warwick.ac.uk/61827/ |
Files
Accepted Journal Article
(1.7 Mb)
PDF
Copyright Statement
The final publication is available at Springer via https://doi.org/10.1007/s11280-013-0269-6
You might also like
Editorial: New challenges and future perspectives in cognitive neuroscience
(2024)
Journal Article
Downloadable Citations
About Durham Research Online (DRO)
Administrator e-mail: dro.admin@durham.ac.uk
This application uses the following open-source libraries:
SheetJS Community Edition
Apache License Version 2.0 (http://www.apache.org/licenses/)
PDF.js
Apache License Version 2.0 (http://www.apache.org/licenses/)
Font Awesome
SIL OFL 1.1 (http://scripts.sil.org/OFL)
MIT License (http://opensource.org/licenses/mit-license.html)
CC BY 3.0 ( http://creativecommons.org/licenses/by/3.0/)
Powered by Worktribe © 2024
Advanced Search