Normalization Based On URL Lists
Some normalization rules may be developed for specific websites by examining URL lists obtained from previous crawls or web server logs. For example, if the URL
http://foo.org/story?id=xyz
appears in a crawl log several times along with
http://foo.org/story_xyz
we may assume that the two URLs are equivalent and can be normalized to one of the URL forms.
Schonfeld et al. (2006) present a heuristic called DustBuster for detecting DUST (different URLs with similar text) rules that can be applied to URL lists. They showed that once the correct DUST rules were found and applied with a canonicalization algorithm, they were able to find up to 68% of the redundant URLs in a URL list.
Read more about this topic: URL Normalization
Famous quotes containing the words based and/or lists:
“What strikes many twin researchers now is not how much identical twins are alike, but rather how different they are, given the same genetic makeup....Multiples dont walk around in lockstep, talking in unison, thinking identical thoughts. The bond for normal twins, whether they are identical or fraternal, is based on how they, as individuals who are keenly aware of the differences between them, learn to relate to one another.”
—Pamela Patrick Novotny (20th century)
“Behold then Septimus Dodge returning to Dodge-town victorious. Not crowned with laurel, it is true, but wreathed in lists of things he has seen and sucked dry. Seen and sucked dry, you know: Venus de Milo, the Rhine or the Coloseum: swallowed like so many clams, and left the shells.”
—D.H. (David Herbert)