This section is provided for the consideration of statistical methods and interpretation of GEAR results (including FAQ)

 

  Although the functional enrichment analysis is widely used for the analysis of high-throughput expression microarray, users must be well aware of the statistical concerns or other related issues to use GEAR algorithm more efficiently. Brief descriptions on these issues will be addressed to provide the users practical information. In addition, the proposed GEAR algorithm is still in development and it has much to be improved. It will be very kind of you to provide me valuable comments/suggestions on program bugs as well as ideas to improve the algorithm. Click to e-mail developer.

 

 

(1) Functional enrichment analysis and ORA (overrepresentation analysis) based on hypergeometric distribution

 

  GEAR is specifically designed for the application of `functional enrichment analysis¨ to genome-wide profile of chromosomal alterations commonly observed in cancer or those of recently emerging copy number variation (CNV). The functional enrichment analysis has extended versatility for the analysis of microarray datasets and has been widely used recently. Thus, the basic algorithms such as hypergeometric distribution or Fisher's exact tests as well as their statistical considerations are already largely available in the web (reviewed in Curtis RK et al, Trends Biotechnol, 23:429, 2005; it will be also helpful to look up the related applications for expression array datasets such as GOMINER, GOSTAT, ArrayXPath, ONTO-TOOLS, GeneTrail, Babelomics, ErmineJ, etc).

 

  The basic statistics used in GEAR algorithm is hypergeometric distribution, one of the simple methods to measure the significance of enrichment between gene sets. It basically measures the extent of overlap (or enrichment) between two gene subsets selected from total gene list. One good example is described by Rhodes DR and Chinnaiyan AM (Nat Genet 37:S31, 2005 - Page S33) - if 1,000 genes are called as 'over-expressed' among total 10,000 genes in a microarray, a randomly selected gene set is expected to have 10% of 'over-expressed' genes. Consider a functional gene set annotated as 'protein biosynthesis' with 100 genes whose known molecular functions are involved in protein biosynthesis. If 50 genes (observed) among the gene sets are 'over-expressed', it is more than the 10 genes (10% of 100 genes as expected) and the 'protein biosynthesis' gene set can be called 'enriched' (or overrepresented; this kind of study is also called as 'overrepsentation analysis' or 'ORA') in over-expressed genes. The significance level of enrichment can be calculated by the following equation of hypergeometric distribution:

 

, in this case k = 50, n = 100, N = 1000, M = 10000. The calculated P-value represents the probability of a randomly selected gene set with 100 genes having more than 50 over-expressed genes among 10,000 genes in which 1,000 genes are over-expressed (P = 2.3929E-024).

 

  This basic scheme is used for GEAR algorithm that deals with chromosomal alterations (genomic dosage change) rather than the transcriptional dosage change of genes (expression microarray). Genes under the chromosomal alterations (gain or loss) can be selected using alteration profiles of array-based comparative genomic hybridization (array-CGH) and applied for functional enrichment analysis using above strategy. With a large set of functional gene sets a priori prepared from public gene databases (GO, GenMAPP, KEGG, Biocarta, etc.), the molecular categories significantly enriched in specific chromosomal alterations can be identified and used for functional interpretation and rich description of corresponding genomic profile. For example, if  'recurrent chromosomal gains' were identified as specific alterations, the concurrent genomic signatures or the functional annotation of gene sets (significantly enriched in recurrent chromosomal gain) would represent the primary (initial) molecular functions up-regulated by genomic dosage changes. Of course, the interpretation of functional enrichment analysis must be taken with care, just like those from the other in silico studies.

 

 

 

(2) ORA results of expression and genome (array-CGH) microarray data

 

  Expression microarray data can be regarded as a 'snapshot' of the transcriptional profile of a cell reflecting the dynamic changes in the cellular mRNA states. On the other hand, the genomic alterations identified by array-CGH are more likely to represent the stable and 'fossil-like' signatures of the cancer cells (Albertson DG et al, Nature Genet 34:369, 2003). This raises a question whether the ORA results of expression array and array-CGH data can be interpreted in the same manner. According to the well-known clonal evolution theory of tumorigenesis, the majority of genomic alterations are accumulated during the evolution of cancer. Some of the acquired alterations might provide the host benefits for survival or migration while the others merely represent the randomly acquired changes (this is the critical difference between expression microarray). Thus, the ORA results of individual cancer genome might be hard to interpret (although it might be worthy of further investigation), because the majority of the chromosomal alterations in individual genome might represent such random changes. This is the reason why we adopted additional algorithm to detect 'recurrent' and 'class-specific' alterations as biologically relevant changes before the application of ORA in GEAR. On the other hand, there have been extensive effort to identify more biological alterations among the extensive and complex genome-wide profile of array-CGH datasets. Thus, the basic algorithms incorporated in GEAR might not fully satisfy the all potential needs. For this respective, user-defined custom alterations determined by various methods and assumptions can be used as query in GEAR algorithm.

 

 

 

 

(3) Beyond simple ORA - advanced GSEA (gene set enrichment analysis)?

 

  Recently, more advanced form of enrichment analysis has been proposed considering the correlation or expression values of individual genes (FCS; functional class scoring). One of the well-known method is GSEA (gene set enrichment analysis). GSEA is distinguished from the conventional functional enrichment analysis (ORA) in that it can use the entire expression profiles as order gene list. This method was first introduced for the comparison of gene expression profiles between DM (diabetes mellitus) versus normal muscle cells (Mootha VK et al, Nature Genet, 34:267, 2003). In the study, GSEA successfully identified the DM-specific repressed gene set (oxidative phosphorylation) whereby single gene approach did not find any differentially expressed gene (DEG) with acceptable significance as claimed by authors. They also claimed this method is sensitive in detecting the subtle but coordinated expression changes of functionally-related genes. The method has been recently improved by original authors (Subramanian A et al, PNAS, 102:15545, 2005; GSEA-P in Bioinformatics) to deal with several statistical concerns (in Comment; Nature Genet. 36:663, 2004). Several modifications to this GSEA algorithm have been also published (Jiang Z and Gentleman R, Bioinformatics, 23:306, 2006; BMC Bioinformatics, Kim SY and Volsky DJ, 6:144, 2005; Klevanoc L et al; J Bioinform Comput Biol 5:1139, 2007).

 

  In spite of promising utility, the implementation of GSEA algorithm (or other kinds of FCS) into GEAR does not seem so relevant due to the intrinsic nature of genomic alterations. Unlike the expression values (the measured amount of mRNA transcripts), the chromosomal alterations are basically n-based (1n for unilateral loss like LOH, 2n as diploid, 3n as single copy number gain, >4n as high copy number gain; the minor changes of log2 values might present the intrinsic array noise, the cellular homogeneity of sample, etc.). The original version of GSEA uses non-parametric Kolmogorov-Smirnov statistics making ordered gene list, however, one can ask whether it is reasonable to use log2 values of probes as ordered list in non-parametric or parametric manner. One possible solution is to assign the probes or genes different weights according to the copy number changes (i.e. single copy gains and high-level amplification are discriminated); however, this strategy is not feasible in current GEAR algorithm (it is possible to use very stringent cutoff for identification of alterations, or use high-level amplification as custom alterations). 

 

 

 

(4) Jackpot effects

 

  In GEAR results, gene sets including a localized gene cluster (number of genes with similar functional annotation and similar genomic position) are frequently observed to be enriched and identified as candidate genomic signatures. This phenomenon was previously reported in the search of CNV-related functional categories and termed 'Jackpot effect' (Cooper GM et al, Nature Genet. 39:S22, 2007) - "functional categories such as `cell adhesion¨ and `structural proteins¨ are also enriched, although these observations are driven largely by a few gene clusters, namely the LCE and keratin (structural proteins) and protocadherin (cell adhesion) loci (that is, these enrichment values result from a `jackpot¨ effect in which one or a few CNVs overlap with dozens of distinct but functionally related genes because these genes reside in a genomic cluster)".  For example, a gene set with functional annotation of 'ethanol oxidation' have seven genes with similar annotation of 'alcohol dehydrogenase' (ADH1A, ADH1B, ADH1C, ADH4, ADH5, ADH6, ADH). Since all seven genes are located in small, localized area in 4q23 (occupying less than 0.5 Mb), this gene set can be identified to be 'significantly enriched' with the alterations involving this locus. It is not easy to distinguish whether they indeed represent the biologically relevant functions associated chromosomal alterations. It is reasonable to assume; however, their potential significance might be not so strong as the gene sets including genes with multiple 'hits' in with different genomic loci.  One possible solution (in case of large insert clone-based array-CGH) is to use clone-based annotation of functional gene sets (only in assumption; not currently feasible in GEAR algorithm). If multiple genes are matched to a single clone, it can be counted as 'a single hit' rather than counting all the genes located in the clone to ameliorate the overestimating of enrichment and possible 'Jackpot' effect.

 

 

 

(5) Gene sets enriched both in chromosomal gains and losses

 

  Another intriguing situation is that a single functional gene set can be identified in the opposite situations. For example, how to interpret a functional gene set both enriched in recurrent chromosomal gains and losses? Similar phenomenon was previously observed in case of expression microarray and described in in Curtis RK et al, Trends Biotechnol, 23:429, 2005 "...It is possible for a pathway to be both upregulated and downregulated, perhaps because of a block in the pathway where genes above and below the block respond differently.". As mentioned, this phenomenon might represent the quite interesting behavior of a functional gene sets; however, the interpretation would be not so easy depending on the researcher's basic assumption or choice. A recent paper has proposed an interesting but simple method to identify the such gene sets with increased statistical power (Saxena V et al, Nucleic Acids Res, 22:e151, 2006). The authors proposed that a gene set both enriched in top or bottom of ordered gene list can be efficiently identified by simply taking the absolute value of gene correlation.To apply this absolute enrichment, save the recurrent alterations of gains and losses detected by SW-array and collect them as a custom signature and analyze using GEAR

 

 

 

(6) The number of gene sets to be analyzed

 

  In case of GEAR analysis, the minimum and maximum number of genes (gene size) should be determined to obtain relevant biological information (this issue is addressed in GEAR manual especially on why too small- or large-sized gene sets are inappropriate for functional enrichment analysis like GEAR). Actually, among ~5,000 total functional gene sets provided in Pathway.txt of test datasets, only ~2,000 gene sets are selected for subsequent analysis with the size limitation of min. 5 - max. 200 genes (the actual number can be much smaller than 2,000 if you use moderate resolution large insert clone-based array-CGH). This brings in another consideration - the number of gene sets (not the number of genes or gene size of a gene set).

 

  GEAR algorithm exploits the multiple testing adjustment based on Bonferroni correction or FDR (false discovery rate) and both methods are largely dependent upon the 'number of gene sets (as the number of assumptions)'. Of course, if you use smaller number of gene sets, the corrected P values will become more significant (the uncorrected P-values are not affected). Thus, it would be often helpful to reduce the number of functional gene sets under biological relevance (i.e. selecting only signal transduction-related or metabolism-related pathways) to obtain reasonable significance level. The related issues are described in Lewin A and Grieve IC, BMC Bioinformatics, 7:426, 2006 in which authors recommend to reduce (more likely, grouping of related functions or categories considering the hierarchy of GO annotations; they used custom-made tool of POSOC) the number of gene sets to gain increased statistical power.

 

 

 

(7) GEAR application to other kinds of biological information - CNV (copy number variation)

 

  One of the recent amazing discovery in human genomics is that the large scale structural variation such as duplication, deletion and inversion are prevalent among the phenotypically normal population. Such variants are collectively termed copy number variation (CNV) and presents the genomic alterations in normal population. Basically similar with chromosomal alterations in cancer genome (in spite of the actual problems in determining CNV - the lack of universal reference), the preliminary studies on CNVs uses functional enrichment analysis observing that the genes with functional annotation of sensory perception, immune and cell adhesion are significantly enriched in CNV-affected genomic regions (Redon R et al, Nature 444:444, 2006). You can download already published CNVs or collect your own CNVs and make 'custom alterations' to use them for GEAR analysis. This is an example of extended use of GEAR algorithm.

 

 

 

 

 


 

 

FAQ list will be updated with your valuable questions/comments

 

 

 

Q: Can Gear be used to analyze copy number variations using SNPs data?

 

A: The basic steps of 'functional enrichment analysis' of genome-wide alteration profile are:

(1) the basic preprocessing of array data - NOT PROVIDED in GEAR

- unfortunately, this procedure is quite dependent on the used 'platform' and it is generally recommeded to use company-recommended or lab's own method (such as normalization)

 

(2) the determination of alterations of 'interests' - PROVIDED in GEAR but I doubt in your case

- in your case, I suppose you're questioning how to interpret 'CNVR' or 'individual CNVs' in functional terms as recent CNV reports uses functional enrichment analysis as general analytic method.

- unfortunately again, GEAR was first designed to deal with moderate resolution (~3,000 large insert clones or BAC) array-CGH.

- I have recently tried ~30,000 oligonucleotide-based array CGH. GEAR works well with SW-ARRAY option (it's own smoothing-like method), however, I think the intrinsic problem of oligo-based array CGH (low signal-to-noise ratio) makes it difficult to define the biologicall relevant alterations with acceptable 'robustness'. - I think this is also the problem with oligo-based CNV identification. In my opinion, SNP-oligonucleotide based array (such as 500K Affymetrix) is not originally designed to deal with CNV (although Redon's Nature study used this platform, however, low signal-to-noise ratio is the major disadvantage compared with tiling-path array-CGH) and have several pitfalls. Moreover, I doubt whether GEAR can accept this large amount of data as array form. Thus, it seems not feasible to use GEAR's basic methods to define the alterations for SNP-oligo based array.

 

(3) USE 'CUSTOM ALTERATIONS' INSTEAD

- Alternatively, you can use 'custom alterations' as possible solution (I have mentioned this in web-available note in GEAR homepage). After basic processing of array, you can define the alterations of your own interests (such as CNVR). It will be like this (this is example 'custom alteration' file of GEAR package) and GEAR can accept these 'alterations' without loading of full array data.

 

ID Chromosome Start position End position

RAR-G1 chr2 60611240 61194391

RAR-G2 chr3 173181079 178318222

RAR-G3 chr7 15830008 19957311

RAR-G4 chr8 126599181 130578066

RAR-G5 chr11 21566334 22575764

RAR-G6 chr11 34759834 35329998

RAR-G7 chrX 118336727 123861355

 

- This can be made from GeneChip + CNAT (in case of Affymetrix 500K, HMM-based alterations) or whatever methods of your preference. I think that you already have your own list like this if you're planning your CNV report for publication. Please make sure that the genomic coordinates are those of UCSC 2004 or 2006 and use the same version of gene mapping list.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

..