Introduction Genome wide association studies (GWASs) have been used to analyse thegenetic architecture of common diseases and quantitative traits (Visscher etal., 2012). These studies assess common variants that have a minor allelefrequency (MAF) >5% in the human genome. They have been completed for mostcommon diseases and numerous associated traits. They have uncovered more thantwo thousand disease-related genetic common variants (NHGRI, 2015). But these relatedcommon variants have very small effect sizes and a modest effect in predicting diseaserisk or quantitative traits.
For example, substantial meta-analysis of GWAS oftype 2 diabetes (T2D) in more than 10,128 people have identified more than 18SNPs associated with the disease, but these sites explain only 6% of the heritabilityof the T2D, and does not explain the causal biology (Zeggini et al., 2008). Aswell, in Crohn disease, GWAS meta-analysis in more than 210,000 people haveidentified 70 loci associated with the disease, but these explain only 23% ofthe increased disease risk between relatives (Frankeet al., 2010). Generally, the majority of identified common variants through GWASshave shed no light on the casual biology of the disease or trait.
This problemreferred to as missing heritability. Low-frequency and rare variants mightsolve a portion of missing heritability. Thus, it is reasonable that analyses oflow-frequency with a MAF of (0.5% ?MAF <5%) and rare with a MAF of < 0.5variants could give an explanation to disease risk or quantitative trait (Lee et al., 2014). The advancement in sequencing technologies allows in depth examinationson the genetic contribution of rare variants to complex traits.
This essay will look into challenges of studying rare variants and what sequencingapproaches and statistical methods that can be used for rare variant associationdetection analysis and testing. And mention some current studies thatdiscovered rare variants. Rare variants The theory states that purifying selection keep strong effect rarevariants at low frequency in population. Highly penetrant rare variant playessential role in many Mendelian disorders and rare forms of complex diseases. Thegenotyping arrays have ignored this fraction of allele frequency spectrumbecause there are no systematic catalogs of the rare variants to support arraydesign. Thus, to look for rare variants multiple assays will be needed as thecurrent arrays are not supportive, it is reasonable to focus first on thecommon variants.
However, accelerate advances in sequencing technologies helpto locate and identify low-frequency and rare variants and then investigatetheir effects in complex traits. Next generation sequencing (NGS) technologiesare capable of generating a substantial amount of sequence data in a relativelyshort time for a reasonable cost. NGS have revolutionized genome research inrecent years. It produces billions of short reads; these reads are aligned to areference genome to enable researchers to identify and genotype sites wheresequenced people differ. In these days, the cost of sequencing has gone down,allowing exome and whole genome sequencing studies of common diseases. Someexamples of exome sequencing studies including, the NHLBI exome sequencingproject, UK10K project, and T2D-GENES project. These exome sequencing projects andothers have provided dbSNP over 60 million genetic variants, most of them arerare variants (Lee et al.
, 2014). However, the detection of low-frequency and rare variants in commondiseases present substantial challenges despite the unique chance thatsequencing provide to investigate the functions of low-frequency and rarevariants in common diseases. For deep whole genome sequencing WGS large size ofindividuals are required and currently this expensive. Thus, because of thislimitation other alternative methods have been proposed for high efficiencyincluding, low-depth WGS, exome sequencing, target sequencing, and custom arrayetc., (Lee et al., 2014).
For example, researches have usedgenotyping arrays, such as Affymetrix exome chip and Illumina to enables themto examine protein coding variants that have been identified previously throughdifferent allele frequency spectrum. Moreover, the statistical significance of classical single-variant testsfor low-frequency and rare variants are underpowered unless sample sizes arevery large. To solve this problem researchers have developed statistical approachesexclusively designed for rare variant related analysis. These approaches assessrelations for several variants in a target region of a gene for instance,instead of examining the effects of single variants. Arrays and sequencing platforms for rare variant analysis Sequencing studies require multiple data processing and analysis steps.These including rigours planning regarding to platform and sample selection,quality control QC, choice of statistical tests, which variant to associate,and prioritization for replication. Deep WGS of large size ofindividuals give much information for association studies of complex traits anddiseases.
For instance, to sequence one individual at 30x read depth whichmeans generating redundant sequencing of each base at an average of 30 reads todifferentiate sequencing errors from true polymorphisms, results in more than99% genotyping accuracy (Bentley et al., 2008). But WGShave not been used in practical because of its high cost. Therefore, severalsuitable sequencing strategies have been suggested and used in consideration ofthe cost. Low-depth WGS have been used to sequence a large number of individuals at low cost.Thus, it is possible to sequence 7 or 8 individuals at 4x read depth (coveringat each position by an average of 4 reads) which is cost the same whensequencing one individual at 30x read depth using deep WGS(Lee et al.
, 2014). The low depth WGS is useful for discovering and genotyping sharedvariants as what 1000 Genome Project indicates (McVean etal., 2012).
Low-depth WGS based on linkage disequilibrium strategy’s that benefitsfrom the information of each individual to enhance the standard of variantdetection and approximate genotypes. However, low depth sequencing has highgenotyping error rates compared to deep sequencing. Primary studies indicatedthat low depth WGS for larger effect sizes can be more beneficial than deep WGSof smaller effect sizes, regarding both variant detection and follow-up diseaserelated studies. Another strategy widely used is exomesequencing. It used to capture and sequence 1%-2% ofcoding regions of the genome (Bamshad et al., 2011). Exomesequencing have been used to identify many rare variants that associated withMendelian disorders.
And it is effective at detecting unidentified variantsthat might present in complex diseases or a familial condition which have manyaffected individuals. The first successful application was reported by Ng etal. (2010). They studied four patients of European ancestry in three differentfamilies that they suffer from Miller syndrome of unknown cause. Millersyndrome is extremely rare mendelian disorder, characterized by many featuresincluding, cleft lip, hypoplasia, and micrognathia. However, they captured andsequenced protein coding regions at 40x read depth.
Then they used HapMap anddbSNP databases as filters to eliminate common variants. They have detectedDHODH variants in each of the four patients, missense mutations predicted to bedeleterious. And they used Sangersequencing to validate their findings. The candidate gene DHODH encodes for anenzyme in the pyrimidine de novo biosynthesis pathway. Many other casual variantson other mendelian disorders have been identified such as Kabuki syndrome (MIM147920) etc. Currently several empirical studies use exome sequencing in attempt todetect genes and variants that are related to complex diseases. The NHLBI ESP usesapproximate 6500 people to sequence their exome for studying the phenotypesrelated to heart attack, blood pressure, stroke, blood lipid levels, chronicobstructive pulmonary disease and obesity (Fu et al.,2012, Tennessen et al.
, 2012). And the T2D-GENES Consortium has aimed toidentify the genetic variants related to T2D and metabolic phenotypes, so theyhave sequenced the exomes of roughly 10,000 peoples throughout five ancestrygroups. Exomesequencing performed at high coverage, an average depth of 60x-80 in aparticular region, gives high p-value of more than 20x coverage in massiveportion 90% of the coding regions (Do, Kathiresan and Abecasis, 2012). Exome sequencingalso have some error reads, it reads the off target regions, however, thesereads useful for testing sequence quality and deducing population structure.The main limitation of exome sequencing that it covers only the geneticvariation in the exome. Non-coding region can have a significant role in commondiseases and traits.
Some finding from ENCODE Project propose that non-codingregions may play essential biological role. Overall, the low cost and thefocusing on coding protein regions propose that exome sequencing is a crucialsequencing approach for studying rare variant (Lee et al., 2014). A recent published study has been conducted by (Sims et al.
, 2017) toreveal the genes and variants that are associated with Alzheimer’s disease,carried out in a three-stage case-control study of over 85,000 individuals. Inthe first stage, they genotyped over 16,000 late on-set Alzheimer’s patientsand over 18,000 controls by using Illumina HumanExome microarray. They checkedthe quality control of the variants and then, analysed common variants using theclassic regression model in each sample group and combined the data usingMETAL. And they analysed low-frequency and rare variants using score test andthey combined the data using SeqMeta. However, in this stage they detected 43candidate variants after they removed known risk loci.
In the second stage,they tested these candidate variants for association in separate group of over14,000 patients and over 21,000, using de novo genotyping and imputation. And thevariants from stage two were then carried forward to stage three for testing ina group of 6652 cases and 8345 controls were imputed using the HaplotypeReference Consortium resource. Fromthese analysis, they uncovered four rare coding variants associated withlate-onset Alzheimer disease; missense variant in PLCG2 (associated withreduced risk of the disease), missense change in ABI3 (showed evidence ofrising the disease risk), and two independent variants in TREM2, one of themwas previously recognised. These genes are highly expressed in microglial andthe analysis of protein-protein interaction indicated that these genes interactwith other variants associated with Alzheimer’s disease. Genotype imputation strategy basedon the availability of known haplotypes in a population (reference panel) inorder to impute genotypes of the missing SNPs which is generated by genotypingarrays used for GWAS. Then, these predicted SNPs can be used to check for associationwith traits.
The missing SNPs can be predicted from reference panels of either HapMapor 1000 genome project or UK10K. Zheng et al., (2015) have showed that combining1000 Genome project/ UK10K reference panels is possible to identify rarevariants related to bone mineral density. So, genotype imputation can booststhe coverage of the variation which enabling to examine more SNPs than that weobtain from the original microarray. Imputation is beneficial for meta-analysisstudies as it increases the overlap of variants among arrays. Custom genotyping arrays is not ideal to capture sufficient low-frequency and rare variants butit is cost effective to use it as an alternative method to sequence regions ofinterest. Metabochip41 is an example of custom genotype array that is used forcardiovascular and metabolic disease. Immunochip42 is used for inflammatory andautoimmune disease.
These chips were designed for the high priority variantsfrom sequencing and GWAS studies. These chips contains a common variants pickto replicate the novel GWAS signals and a pick of low-frequency and commonvariants to allow a comprehensive testing of many regions linked to aparticular phenotype. Anotherinexpensive arrays Illumina and Affymetrix exome chips (Do, Kathiresan andAbecasis, 2012). Association tests for low-frequency and rare variants To analyse low-frequency and rare variants that have been identified insequencing studies, we have to have new statistical methods for examiningsingle and multiple variants. The regression model that is typically used inGWAS studies for testing the associations of genetic variants with phenotypecannot be used to test rare variants. For example, Wald test is widely used fortesting common variants as it is characterized by computation speed and broadapplication. However, Wald test has reduced power for detecting rare variants. So, to increase power many alternative tests ofmultivariate have been designed.
These tests collapse rare variants togetheracross a gene for example and, therefore, with the presence of several causalvariants there will be more power to detect association. The proposed rare variant tests fall into fourcategories (Table 1). However, not all variants are affecting phenotypes.
Burden tests such as ARIELtest, CAST, CMC method etc., collapse rare variants into a single predictor andthen compare the distribution among cases and controls. This test is powerfulwhen the fraction of causal variants rises. Each type of burden test havedifferent conclusion. For instance, the easy way to do burden test is to countthe number of the minor alleles through all variants in the set creating ascore for each individual. The the CAST test sets the score to 0 on thepresence or 1 on the absence, at least for one rare variant in the regionassessed. Madson and Browing have proposed weighted sum statistic (WSS), takesall the variants frequencies into account, not require to set a fixed thresholdto determine rare and common variant as in CAST.
The limitation of this test is giving astrong statement about the same path and scale of effect, low power. Thevariants tested in the functional region are all casual and related to thetrait. Adaptive Burden tests have been developed toaddress the limitation of the basic burden tests.
The adaptive burden tests arerobust to the existence of null variants and it permit for multiple effectdirections. For example, the data adaptive sum test (aSum) developed by Han etal. (2010) estimate the effect direction for each variants in a marginal modeland performs the burden test with the estimated direction. This approach needspermutation to estimate P-values.
The limitation of this tests that the marginalmodels are unstable even though more robust. And permutation requires intensivecalculation. Variant-component tests such as C-Alpha, SKAT, SSU etc., have beendesigned to take into account the specific scenario where protective and riskvariants might be detected within a gene or functional unit. It test thedistributions of genetic effects within a collection of variants. This mothedis adaptable and permits for a mixture of effects in the rare variantcollection. SKAT is most popular test, can consider weightings of rarevariants, covariates, and family structure, it has been basically developed forquantitative traits.
Overall these tests feature is to examine the combined effect not theindividual effects of several rare variants as a whole group, therefore, if theassociation of rare variants is identified, more analyses will be needed toestablish which one in the group cause the association. Also, these tests cannotestimate the heritability of rare variants; additional analyses of heritabilityusing the right method may require. Table 1. Summary ofstatistical tests that have been proposed to test rare variant association.
Tableadapted from (Lee et al., 2014). Discussion Several GWASstudies have been performed to look at genetic variation associated withdiseases or traits particularly look at common variants.
Through GWAS studiesonly small portion of heritability have been explained. Heritability refers tothe proportion of phenotype variation that can be explained by geneticcomponent. This problem co-called missing heritability problem. Common variantscannot explain missing heritability so, rare variants can provide explanationof trait variability and disease risk. The advancement in sequencingtechnologies and their cost-effective enables a large collection of variantswhere we are probably going to have some polygenic variants. However, itis challenge to identify the rare variant associated with diseases and traitsbecause of it is rarity.
The case-control analysis Genome-wide comparison areunderpowered. The same pattern of genetic risk that we see for common variantswith respect to complex traits such as autism, diabetes, schizophrenia, takingthe pick is exactly the same for rare variants which is there a lot ofpolygenicity. A lot of genes that potentially involved in risk to disease andit’s going to require extremely large sample sizes to unequivocally identifythose genes against this kind of background of polygenic inheritance. We have toconsider many things in analysis, for example what variant we should select fortesting associations as not all variants affecting phenotype. So separate rarevariants into classes, non-functional (synonymous) and functional. For example,if a study includes neutral variation, it will dilute out the signal that is inthe data to detect the real associations.
The bioinformatics tools have beenestablished to predict functional roles of the variants. We have to consider which statistical test touse. If we have prior information, we can choose the association test by takeinto account this information. In addition,population stratification with a study sample causes two factors, differentfrequencies of alleles and different frequencies of disease, and these leads toconfounding.
To control this issue, we have to make well-matched designs,adjustment for population stratification in statistical test (calculateprinciple components and add them to the regression model as covariates), usefamily-based designs that test association within families. Regarding to missing heritability problem, manymechanisms for this problem have been suggested including epistasis, epigenetic,small effect sizes, gene interaction, GWAS studies limitation, and othercauses. Still no definite explanation for missing heritability problem.
Sandoval-Mottaet al., (2017) have suggested that to understand missing heritability we haveto take into account the compositional and functional human microbiome. Andthey states reasons for their hypothesis includes, many human traits suchobesity, cancer, etc are associated with the composition of human microbiome. Andas our microbiome have a larger genome than ours, it could be a good source ofthe variation and phenotypic plasticity.
Moreover, our genotype interact withthe composition and the structure of our microbiome. In addition, the geneticstructure of the microbiome can be influenced by the host environment or by thetransmission from other hosts. Thus, familial studies might overestimates thegenetic similarity. Moreenhancement might be needed to test and translate the findings of the GWASstudies into clinical practice. The declining in sequencing cost and theadvances of sequencing approaches promise to generate a great signal todiscover an informative low-frequency and rare variants through whole exomesequencing and whole genome sequencing.
Thus, it might be possible fordeveloping therapeutic targets.