URL Normalization - Normalization Based On URL Lists

Normalization Based On URL Lists

Some normalization rules may be developed for specific websites by examining URL lists obtained from previous crawls or web server logs. For example, if the URL

http://foo.org/story?id=xyz

appears in a crawl log several times along with

http://foo.org/story_xyz

we may assume that the two URLs are equivalent and can be normalized to one of the URL forms.

Schonfeld et al. (2006) present a heuristic called DustBuster for detecting DUST (different URLs with similar text) rules that can be applied to URL lists. They showed that once the correct DUST rules were found and applied with a canonicalization algorithm, they were able to find up to 68% of the redundant URLs in a URL list.

Read more about this topic:  URL Normalization

Famous quotes containing the words based and/or lists:

    Justice in the hands of the powerful is merely a governing system like any other. Why call it justice? Let us rather call it injustice, but of a sly effective order, based entirely on cruel knowledge of the resistance of the weak, their capacity for pain, humiliation and misery. Injustice sustained at the exact degree of necessary tension to turn the cogs of the huge machine-for- the-making-of-rich-men, without bursting the boiler.
    Georges Bernanos (1888–1948)

    Most of our platitudes notwithstanding, self-deception remains the most difficult deception. The tricks that work on others count for nothing in that very well-lit back alley where one keeps assignations with oneself: no winning smiles will do here, no prettily drawn lists of good intentions.
    Joan Didion (b. 1934)