Content Frame
[Skip Breadcrumb Navigation]
Home  arrow Chapter 5  arrow Lesson Links  arrow The Invisible Web

The Invisible Web

Much of the vast expanses of the Web are completely invisible to general purpose search engines like AltaVista and Google. Furthermore, this “invisible Web” (also called the “deep Web”) is possibly growing faster than the visible Web. The invisible Web is comprised of Web content that is not or cannot be indexed by search engines. The invisible Web is a significant part of the Web that is rich in content. Therefore, it is important to include guides to the invisible Web as a part of any search task and research project.

To understand what the “invisible Web” is and why search engines cannot index it, it is important to first define the “visible” Web and describe how search engines compile their indexes.

Web pages are written in Hypertext Markup Language (HTML). HTML documents consist of a “head” portion, with a title and perhaps some additional information describing the document, and a “body” portion—the actual document itself. The simplicity of this format makes it easy for search engines to retrieve HTML documents, index every word on every page, and store them in huge databases that can be searched on demand. Search engines use automated programs called spiders or robots to “crawl” the Web and retrieve pages. Spiders rely on links to take them from page to page. Because crawling is a resource-intensive operation, search engines often limit the number of pages they retrieve and index from any given website.

Some Web pages cannot be indexed because they are not accessible by the search engines. The main reason these pages cannot be indexed is because there are no links pointing to a page that a search engine spider can follow or because a page may be made up of data types that search engines do not index, such as graphics or PDF files. The biggest part of the invisible Web, however, is made up of information stored in databases. When a spider comes across a database, it cannot index the information because it is not stored as a static Web page in HTML format. The spiders can record the link of the page containing the portal to the database, but it cannot tell you anything about the information contained in the database.

There are probably thousands of databases containing high-quality information that are accessible via the Web. To search these databases, you must browse the website that provides a portal to the database. The advantage to this approach is that you can use search tools that were specifically designed for that database to query information in it. The disadvantage is that you need to locate the database in the first place and the search engines may or may not be able to help you find it.

Another type of Web document that search engines cannot index is one that contains dynamically generated information from a database. Web page authors and developers often use database technology to provide customized content on a Web page that is assembled on the fly. Search engine results pages are an example of this type of dynamic Web pages. The trend toward customization of Web page content that is dynamically generated will continue and probably become more widely used, thus making it more difficult for search engines to create comprehensive Web indexes.

For more information about the invisible Web, see the book: The Invisible Web: Uncovering Information Sources Search Engines Can't See, by Chris Sherman and Gary Price.




Copyright © 1995-2008, Pearson Education, Inc., publishing as Pearson Prentice Hall
Legal and Privacy Terms
Pearson Education

[Return to the Top of this Page]