Intelligent Web Crawler by Supervised Learning | Original Article
In this paper we present Intelligent Web Crawler (IWC) a supervise and intelligent web scale forum crawler. The goal and objective of this IWC is to crawl relevant forum content from the web with minimum overhead. URL and forum threads have information content that is collected by forum crawlers. Web forum crawling problem to a URL type have been reduced to recognition problem which shows how to learn accurate and effective regular expression patterns of constant navigation paths by automatically created training sets using aggregated results from weak page type classifiers. Every forum have different layouts or styles and have different forum software packages, they always have homogeneous constant navigation paths connected by specific URL types to direct users from entry pages to thread page. Robust page type classifiers can be get from as few as five annotated forums and applied to a large set of unseen forums. To have accurate specification we have used the supervise machine learning process applied to immense set of Forum. Among the other forum crawlers, IWC gives best performance. The results show that IWC gives better performance in terms of precision and crawling time. In future, we would like to extend this crawler to other sites like Question Answer (Q A) sites, blog sites and other social media sites to develop as IWC as better forum crawler.