Natural Language Generation - Evaluation

As in other scientific fields, NLG researchers need to test how well their systems, modules, and algorithms work. This is called evaluation. There are three basic techniques for evaluating NLG systems:

  • Task-based (extrinsic) evaluation: give the generated text to a person, and assess how well it helps him perform a task (or otherwise achieves its communicative goal). For example, a system which generates summaries of medical data can be evaluated by giving these summaries to doctors, and assessing whether the summaries helps doctors make better decisions.
  • Human ratings: give the generated text to a person, and ask him or her to rate the quality and usefulness of the text.
  • Metrics: compare generated texts to texts written by people from the same input data, using an automatic metric such as BLEU.

An ultimate goal is how useful NLG systems are at helping people, which is the first of the above techniques. However, task-based evaluations are time-consuming and expensive, and can be difficult to carry out (especially if they require subjects with specialised expertise, such as doctors). Hence (as in other areas of NLP) task-based evaluations are the exception, not the norm.

Recently researchers are assessing how well human-ratings and metrics correlate with (predict) task-based evaluations. Work is being conducted in the context of Generation Challenges shared-task events. Initial results suggest that human ratings are much better than metrics in this regard. In other words, human ratings usually do predict task-effectiveness at least to some degree (although there are exceptions ), while ratings produced by metrics often do not predict task-effectiveness well. These results are preliminary. In any case, human ratings are the most popular evaluation technique in NLG; this is contrast to machine translation, where metrics are widely used.

Read more about this topic:  Natural Language Generation

Famous quotes containing the word evaluation:

    Good critical writing is measured by the perception and evaluation of the subject; bad critical writing by the necessity of maintaining the professional standing of the critic.
    Raymond Chandler (1888–1959)

    Evaluation is creation: hear it, you creators! Evaluating is itself the most valuable treasure of all that we value. It is only through evaluation that value exists: and without evaluation the nut of existence would be hollow. Hear it, you creators!
    Friedrich Nietzsche (1844–1900)