Genome annotation is the process of attaching biological information to sequences. It consists of three main steps:
- identifying portions of the genome that do not code for proteins
- identifying elements on the genome, a process called gene prediction, and
- attaching biological information to these elements.
Automatic annotation tools try to perform all this by computer analysis, as opposed to manual annotation (a.k.a. curation) which involves human expertise. Ideally, these approaches co-exist and complement each other in the same annotation pipeline.
The basic level of annotation is using BLAST for finding similarities, and then annotating genomes based on that. However, nowadays more and more additional information is added to the annotation platform. The additional information allows manual annotators to deconvolute discrepancies between genes that are given the same annotation. Some databases use genome context information, similarity scores, experimental data, and integrations of other resources to provide genome annotations through their Subsystems approach. Other databases (e.g. Ensembl) rely on both curated data sources as well as a range of different software tools in their automated genome annotation pipeline.
Structural annotation consists of the identification of genomic elements.
- ORFs and their localisation
- gene structure
- coding regions
- location of regulatory motifs
Functional annotation consists of attaching biological information to genomic elements.
- biochemical function
- biological function
- involved regulation and interactions
- expression
These steps may involve both biological experiments and in silico analysis. Proteogenomics based approaches utilize information from expressed proteins, often derived from mass spectrometry, to improve genomics annotations.
A variety of software tools have been developed to permit scientists to view and share genome annotations.
Genome annotation is the next major challenge for the Human Genome Project, now that the genome sequences of human and several model organisms are largely complete. Identifying the locations of genes and other genetic control elements is often described as defining the biological "parts list" for the assembly and normal operation of an organism. Scientists are still at an early stage in the process of delineating this parts list and in understanding how all the parts "fit together".
Genome annotation is an active area of investigation and involves a number of different organizations in the life science community which publish the results of their efforts in publicly available biological databases accessible via the web and other electronic means. Here is an alphabetical listing of on-going projects relevant to genome annotation:
- ENCyclopedia Of DNA Elements (ENCODE)
- Entrez Gene
- Ensembl
- GENCODE
- Gene Ontology Consortium
- GeneRIF
- RefSeq
- Uniprot
- Vertebrate and Genome Annotation Project (Vega)
At Wikipedia, genome annotation has started to become automated under the auspices of the Gene Wiki portal which operates a bot that harvests gene data from research databases and creates gene stubs on that basis.
Read more about this topic: Genome Project