Statistical Analysis
Data analysis of microarrays has become an area of intense research. Simply stating that a group of genes were regulated by at least twofold, once a common practice, lacks a solid statistical footing. With five or fewer replicates in each group, typical for microarrays, a single outlier observation can create an apparent difference greater than two-fold. In addition, arbitrarily setting the bar at two-fold is not biologically sound, as it eliminates from consideration many genes with obvious biological significance.
Rather than identify differentially expressed genes using a fold change cutoff, one can use a variety of statistical tests or omnibus tests such as ANOVA, all of which consider both fold change and variability to create a p-value, an estimate of how often we would observe the data by chance alone. Applying p-values to microarrays is complicated by the large number of multiple comparisons (genes) involved. For example, a p-value of 0.05 is typically thought to indicate significance, since it estimates a 5% probability of observing the data by chance. But with 10,000 genes on a microarray, 500 genes would be identified as significant at p < 0.05 even if there were no difference between the experimental groups. One obvious solution is to consider significant only those genes meeting a much more stringent p value criterion, e.g., one could perform a Bonferroni correction on the p-values, or use a false discovery rate calculation to adjust p-values in proportion to the number of parallel tests involved. Unfortunately, these approaches may reduce the number of significant genes to zero, even when genes are in fact differentially expressed. Current statistics such as Rank products aim to strike a balance between false discovery of genes due to chance variation and non-discovery of differentially expressed genes. Commonly cited methods include the Significance Analysis of Microarrays (SAM) and a wide variety of methods are available from Bioconductor and a variety of analysis packages from bioinformatics companies.
Selecting a different test usually identifies a different list of significant genes since each test operates under a specific set of assumptions, and places a different emphasis on certain features in the data. Many tests begin with the assumption of a normal distribution in the data, because that seems like a sensible starting point and often produces results that appear more significant. Some tests consider the joint distribution of all gene observations to estimate general variability in measurements, while others look at each gene in isolation. Many modern microarray analysis techniques involve bootstrapping (statistics), machine learning or Monte Carlo methods.
As the number of replicate measurements in a microarray experiment increases, various statistical approaches yield increasingly similar results, but lack of concordance between different statistical methods makes array results appear less trustworthy. The MAQC Project makes recommendations to guide researchers in selecting more standard methods (e.g. using p-value and fold-change together for selecting the differentially expressed genes) so that experiments performed in different laboratories will agree better.
Different from the analysis on differentially expressed individual genes, another type of analysis focuses on differential expression or perturbation of pre-defined gene sets and is called gene set analysis. Gene set analysis demonstrated several major advantages over individual gene differential expression analysis. Gene sets are groups of genes that are functionally related according to current knowledge. Therefore, gene set analysis is considered a knowledge based analysis approach. Commonly used gene sets include those derived from KEGG pathways, Gene Ontology terms, gene groups that share some other functional annotations, such as common transcriptional regulators etc. Representative gene set analysis methods include GSEA, which estimates significance of gene sets based on permutation of sample labels, and GAGE, which tests the significance of gene sets based on permutation of gene labels or a parametric distribution.
Read more about this topic: Gene Expression Profiling
Famous quotes containing the word analysis:
“Cubism had been an analysis of the object and an attempt to put it before us in its totality; both as analysis and as synthesis, it was a criticism of appearance. Surrealism transmuted the object, and suddenly a canvas became an apparition: a new figuration, a real transfiguration.”
—Octavio Paz (b. 1914)