George Gkotsis
Entropy-based automated wrapper generation for weblog data extraction
Gkotsis, George; Stepanyan, Karen; Cristea, A.I.; Joy, Mike
Authors
Abstract
This paper proposes a fully automated information extraction methodology for weblogs. The methodology integrates a set of relevant approaches based on the use of web feeds and processing of HTML for the extraction of weblog properties. The approach includes a model for generating a wrapper that exploits web feeds for deriving a set of extraction rules automatically. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. An evaluation of the model is conducted on a collection of weblogs reporting a prediction accuracy of 89 %. The results of this evaluation show that the proposed technique enables robust extraction of weblog properties and can be applied across the blogosphere.
Citation
Gkotsis, G., Stepanyan, K., Cristea, A., & Joy, M. (2013). Entropy-based automated wrapper generation for weblog data extraction. World Wide Web, 17(4), 827-846. https://doi.org/10.1007/s11280-013-0269-6
Journal Article Type | Article |
---|---|
Acceptance Date | Nov 4, 2013 |
Online Publication Date | Nov 21, 2013 |
Publication Date | Nov 21, 2013 |
Deposit Date | Jul 11, 2018 |
Publicly Available Date | Jul 31, 2018 |
Journal | World Wide Web |
Print ISSN | 1386-145X |
Electronic ISSN | 1573-1413 |
Publisher | Springer |
Peer Reviewed | Peer Reviewed |
Volume | 17 |
Issue | 4 |
Pages | 827-846 |
DOI | https://doi.org/10.1007/s11280-013-0269-6 |
Public URL | https://durham-repository.worktribe.com/output/1354582 |
Related Public URLs | http://wrap.warwick.ac.uk/61827/ |
Files
Accepted Journal Article
(1.7 Mb)
PDF
Copyright Statement
The final publication is available at Springer via https://doi.org/10.1007/s11280-013-0269-6
You might also like
Editorial: New challenges and future perspectives in cognitive neuroscience
(2024)
Journal Article
Using deep learning to analyze the psychological effects of COVID-19
(2023)
Journal Article