Deep Web - Crawling The Deep Web

Crawling The Deep Web

Researchers have been exploring how the deep Web can be crawled in an automatic fashion. In 2001, Sriram Raghavan and Hector Garcia-Molina presented an architectural model for a hidden-Web crawler that used key terms provided by users or collected from the query interfaces to query a Web form and crawl the deep Web resources. Alexandros Ntoulas, Petros Zerfos, and Junghoo Cho of UCLA created a hidden-Web crawler that automatically generated meaningful queries to issue against search forms. Several form query languages (e.g., DEQUEL) have been proposed that, besides issuing a query, also allow to extract structured data from result pages. Another effort is DeepPeep, a project of the University of Utah sponsored by the National Science Foundation, which gathered hidden-Web sources (Web forms) in different domains based on novel focused crawler techniques.

Commercial search engines have begun exploring alternative methods to crawl the deep Web. The Sitemap Protocol (first developed by Google) and mod oai are mechanisms that allow search engines and other interested parties to discover deep Web resources on particular Web servers. Both mechanisms allow Web servers to advertise the URLs that are accessible on them, thereby allowing automatic discovery of resources that are not directly linked to the surface Web. Google's deep Web surfacing system pre-computes submissions for each HTML form and adds the resulting HTML pages into the Google search engine index. The surfaced results account for a thousand queries per second to deep Web content. In this system, the pre-computation of submissions is done using three algorithms: (1) selecting input values for text search inputs that accept keywords, (2) identifying inputs which accept only values of a specific type (e.g., date), and (3) selecting a small number of input combinations that generate URLs suitable for inclusion into the Web search index.

Read more about this topic:  Deep Web

Famous quotes containing the words crawling, deep and/or web:

    As I stand over the insect crawling amid the pine needles on the forest floor, and endeavoring to conceal itself from my sight, and ask myself why it will cherish those humble thoughts, and hide its head from me who might, perhaps, be its benefactor, and impart to its race some cheering information, I am reminded of the greater Benefactor and Intelligence that stands over me the human insect.
    Henry David Thoreau (1817–1862)

    Personal size and mental sorrow have certainly no necessary proportions. A large bulky figure has a good a right to be in deep affliction, as the most graceful set of limbs in the world. But, fair or not fair, there are unbecoming conjunctions, which reason will pa tronize in vain,—which taste cannot tolerate,—which ridicule will seize.
    Jane Austen (1775–1817)

    Being so wrong about her makes me wonder now how often I am utterly wrong about myself. And how wrong she might have been about her mother, how wrong he might have been about his father, how much of family life is a vast web of misunderstandings, a tinted and touched-up family portrait, an accurate representation of fact that leaves out only the essential truth.
    Anna Quindlen (b. 1952)