Full Text Search - False-positive Problem

False-positive Problem

Free text searching is likely to retrieve many documents that are not relevant to the intended search question. Such documents are called false positives (see Type I error). The retrieval of irrelevant documents is often caused by the inherent ambiguity of natural language. In the sample diagram at right, false positives are represented by the irrelevant results (red dots) that were returned by the search (on a light-blue background).

Clustering techniques based on Bayesian algorithms can help reduce false positives. For a search term of "football", clustering can be used to categorize the document/data universe into "American football", "corporate football", etc. Depending on the occurrences of words relevant to the categories, search terms a search result can be placed in one or more of the categories. This technique is being extensively deployed in the e-discovery domain.

Read more about this topic:  Full Text Search

Famous quotes containing the word problem:

    The family environment in which your children are growing up is different from that in which you grew up. The decisions our parents made and the strategies they used were developed in a different context from what we face today, even if the “content” of the problem is the same. It is a mistake to think that our own experience as children and adolescents will give us all we need to help our children. The rules of the game have changed.
    Lawrence Kutner (20th century)