Interactive generation of uniformly random samples of World Wide Web pages

Andrew Mitchell Walker

Research output: ThesisMaster's thesis

Abstract

The size and complexity of the World Wide Web means that for all practical purposes it is impossible to have information about the content of every web page in existence. Hence to learn about the structure of the web and the characteristics of the documents accessible through it, it is necessary to devise a means of collecting a random sample of documents to study. The results I obtained from the study of a small sample of pages and details of the design of the random web crawler, named Alienbot, that I used to collect the sample are presented here. Alienbot employs a unique crawling strategy, which utilises the blind interaction of a user to randomly select the pages to be crawled. The user interacts with the crawler through an interface based on the classic game Space Invaders, in which the action of shooting aliens determines the pages to be crawled. The sample of pages collected on the random web crawl performed by Alienbot is then used to estimate average properties of pages on the World Wide Web, examining elements such as links and images. The link usage is also expanded to look at a previously unstudied area of the web, a type of page known as a weblog [30]. The link usage on weblogs is compared and contrasted with that found on a typical page. I also use the data set to examine the extent of the use of valid HTML/XHTML mark-up to see how quickly web pages authors are adopting new web standards. The data set collected by Alienbot shows some interesting results, it is found that on most pages the majority of links are to different pages within the same website and that when people do link to a different website they are more likely to link to its homepage rather than a page deeper within the site. In general weblogs appear to exhibit significantly different link characteristics to other pages in the sample, in particular it appears weblog homepages are much more richly linked than homepages of other sites. It is also discovered that within the sample collected by Alienbot the use of valid mark-up is not common, with most pages that could be validated exhibiting many errors.
Original languageEnglish
QualificationMaster of Science by Research (MSc(R))
Awarding Institution
  • Kingston University
Supervisors/Advisors
  • Evans, Michael, Supervisor, External person
Publication statusAccepted/In press - Sept 2003
Externally publishedYes

Bibliographical note

Department: School of Mathematics

Physical Location: This item is held in stock at Kingston University Library.

Keywords

  • Computer science and informatics

Fingerprint

Dive into the research topics of 'Interactive generation of uniformly random samples of World Wide Web pages'. Together they form a unique fingerprint.

Cite this