You are here
Main Content Detection in HTML Journal Articles.
Web content extraction algorithms have been shown to improve the performance of web content analysis tasks. This is because noisy web page content, such as advertisements and navigation links, can significantly degrade performance. This paper presents a novel and effective layout analysis algorithm for main content detection in HTML journal articles. The algorithm first segments a web page based on rendered line breaks, then based on its column structure, and finally identifies the column that contains the most paragraph text. On a test set of 359 manually labeled HTML journal articles, the proposed layout analysis algorithm was found to significantly outperform an alternative semantic markup algorithm based on HTML 5 semantic tags. The precision, recall, and F-score of the layout analysis algorithm were measured to be 0.96, 0.99, and 0.98 respectively.