GWAS of 126,559 Individuals Identifies Genetic Variants Associated with Educational Attainment

See allHide authors and affiliations

Science  21 Jun 2013:
Vol. 340, Issue 6139, pp. 1467-1471
DOI: 10.1126/science.1235488

Genetic College

Many genomic elements in humans are associated with behavior, including educational attainment. In a genome-wide association study including more than 100,000 samples, Rietveld et al. (p. 1467, published online 30 May; see the Perspective by Flint and Munafò) looked for genes related to educational attainment in Caucasians. Small genetic effects at three loci appeared to impact educational attainment.


A genome-wide association study (GWAS) of educational attainment was conducted in a discovery sample of 101,069 individuals and a replication sample of 25,490. Three independent single-nucleotide polymorphisms (SNPs) are genome-wide significant (rs9320913, rs11584700, rs4851266), and all three replicate. Estimated effects sizes are small (coefficient of determination R2 ≈ 0.02%), approximately 1 month of schooling per allele. A linear polygenic score from all measured SNPs accounts for ≈2% of the variance in both educational attainment and cognitive function. Genes in the region of the loci have previously been associated with health, cognitive, and central nervous system phenotypes, and bioinformatics analyses suggest the involvement of the anterior caudate nucleus. These findings provide promising candidate SNPs for follow-up work, and our effect size estimates can anchor power analyses in social-science genetics.

Twin and family studies suggest that a broad range of psychological traits (1), economic preferences (24), and social and economic outcomes (5) are moderately heritable. Discovery of genetic variants associated with such traits may lead to insights regarding the biological pathways underlying human behavior. If the predictive power of a set of genetic variants considered jointly is sufficiently large, then a “risk score” that aggregates their effects could be useful to control for genetic factors that are otherwise unobserved, or to identify populations with certain genetic propensities, for example in the context of medical intervention (6).

To date, however, few if any robust associations between specific genetic variants and social-scientific outcomes have been identified probably because existing work [for a review, see (7)] has relied on samples that are too small [for discussion, see (4, 6, 8, 9)]. In this paper, we apply to a complex behavioral trait—educational attainment—an approach to gene discovery that has been successfully applied to medical and physical phenotypes (10), namely meta-analyzing data from multiple samples.

The phenotype of educational attainment is available in many samples with genotyped participants (5). Educational attainment is influenced by many known environmental factors, including public policies. Educational attainment is strongly associated with social outcomes, and there is a well-documented health-education gradient (5, 11). Estimates suggest that around 40% of the variance in educational attainment is explained by genetic factors (5). Furthermore, educational attainment is moderately correlated with other heritable characteristics (1), including cognitive function (12) and personality traits related to persistence and self-discipline (13).

To create a harmonized measure of educational attainment, we coded study-specific measures using the International Standard Classification of Education (1997) scale (14). We analyzed a quantitative variable defined as an individual’s years of schooling (“EduYears”) and a binary variable for College completion (“College”). College may be more comparable across countries, whereas EduYears contains more information about individual differences within countries.

A genome-wide association study (GWAS) meta-analysis was performed across 42 cohorts in the discovery phase. The overall discovery sample comprises 101,069 individuals for EduYears and 95,427 for College. Analyses were performed at the cohort level according to a prespecified analysis plan, which restricted the sample to Caucasians (to help reduce stratification concerns). Educational attainment was measured at an age at which participants were very likely to have completed their education [more than 95% of the sample was at least 30 (5)]. On average, participants have 13.3 years of schooling, and 23.1% have a College degree. To enable pooling of GWAS results, all studies conducted analyses with data imputed to the HapMap 2 CEU (r22.b36) reference set. To guard against population stratification, the first four principal components of the genotypic data were included as controls in all the cohort-level analyses. All study-specific GWAS results were quality controlled, cross-checked, and meta-analyzed using single genomic control and a sample-size weighting scheme at three independent analysis centers.

At the cohort level, there is little evidence of general inflation of P values. As in previous GWA studies of complex traits (15), the Q-Q plot of the meta-analysis exhibits strong inflation. This inflation is not driven by specific cohorts and is expected for a highly polygenic phenotype even in the absence of population stratification (16).

From the discovery stage, we identified one genome-wide–significant locus (rs9320913, P = 4.2 × 10–9) and three suggestive loci (defined as P < 10–6) for EduYears. For College, we identified two genome-wide–significant loci (rs11584700, P = 2.1 × 10–9, and rs4851266, P = 2.2 × 10–9) and an additional four suggestive loci (Table 1). We conducted replication analyses in 12 additional, independent cohorts that became available after the completion of the discovery meta-analysis, using the same pre-specified analysis plan. For both EduYears and College, the replication sample comprises 25,490 individuals.

Table 1 The results of the GWAS meta-analysis for the independent signals reaching P < 10−6 in the discovery stage.

The rows in bold are the independent signals reaching P < 5 × 10−8 in the discovery stage. “Frequency” refers to allele-frequency in the combined-stage meta-analysis. “Beta/OR” refers to the effect size in the EduYears analysis and to the odds ratio in the College analysis. All P values are from the sample-size–weighted meta-analysis (fixed effects). The P value in the replication-stage meta-analysis was calculated from a one-sided test. I2 represents the percent heterogeneity of effect size between the discovery-stage studies. Phet is the heterogeneity P value. bp, base pair.

View this table:

For each of the 10 loci that reached at least suggestive significance, we brought forward for replication the single-nucleotide polymorphism (SNP) with the lowest P value. The three genome-wide–significant SNPs replicate at the Bonferroni-adjusted 5% level, with point estimates of the same sign and similar magnitude (Fig. 1 and Table 1). The seven loci that did not reach genome-wide significance did not replicate (the effect went in the anticipated direction in five out of seven cases). The meta-analytic findings are not driven by extreme results in a small number of cohorts (see Phet in Table 1), by cohorts from a specific geographic region (figs. S7 to S15), or by a single sex (figs. S3 to S6). Given the high correlation between EduYears and College (5), it is unsurprising that the set of SNPs with low P values exhibit considerable overlap in the two analyses (tables S8 and S9).

Fig. 1 Regional association plots of replicated loci associated with educational attainment.

(A) rs9320913, (B) rs11584700, (C) rs4851266. The plots are centered on the SNPs with the lowest P values in the discovery stage (purple diamonds). The R2 values are from the CEU HapMap 2 samples. The CEU HapMap 2 recombination rates are indicated in blue on the right y axes. The figures were created with LocusZoom ( Mb, megabases.

The observed effect sizes of the three replicated individual SNPs are small [see (5) for discussion]. For EduYears, the strongest effect identified (rs9320913) explains 0.022% of phenotypic variance in the replication sample. This coefficient of determination R2 corresponds to a difference of ≈1 month of schooling per allele. For College completion, the SNP with the strongest estimated effect (rs11584700) has an odds ratio of 0.912 in the replication sample, equivalent to a 1.8 percentage-point difference per allele in the frequency of completing College.

We subsequently conducted a “combined stage” meta-analysis, including both the discovery and replication samples. This analysis revealed additional genome-wide–significant SNPs: four for EduYears and three for College. Three of these SNPs (rs1487441, rs11584700, rs4851264) are in linkage disequilibrium with the replicated SNPs. The remaining four are located in different loci and warrant replication attempts in future research: rs7309, a 3′ untranslated region (3′UTR) variant in TANK; rs11687170, close to GBX2; rs1056667, a 3′UTR variant in BTN1A1; and rs13401104 in ASB18.

Using the results of the combined meta-analyses of discovery and replication cohorts, we conducted a series of complementary and exploratory supplemental analyses to aid in interpreting and contextualizing the results: gene-based association tests, expression quantitative trait locus (eQTL) analyses of brain and blood tissue data, pathway analysis, functional annotation searches, enrichment analysis for cell-type–specific overlap with H3K4me3 chromatin marks, and predictions of likely gene function with the use of gene-expression data. Table S20 summarizes promising candidate loci identified through follow-up analyses (5). Two regions, in particular, showed convergent evidence from functional annotation, blood cis-eQTL analyses, and gene-based tests: chromosome 1q32 (including LRRN2, MDM4, and PIK3C2B) and chromosome 6 near the major histocompatibility complex. We also find evidence that in anterior caudate cells, there is enrichment of H3K4me3 chromatin marks (believed to be more common in active regulatory regions) in the genomic regions implicated by our analyses (fig. S20). Many of the implicated genes have previously been associated with health, central nervous system, or cognitive-process phenotypes in either human GWAS or model-animal studies (table S22). Gene coexpression analysis revealed that several implicated genes (including BSN, GBX2, LRRN2, and PIK3C2B) are probably involved in pathways related to cognitive processes (such as learning and long-term memory) and neuronal development or function (table S21).

Although the effects of individual SNPs on educational attainment are small, many of their potential uses in social science depend on their combined explanatory power. To evaluate the combined explanatory power, we constructed a linear polygenic score (5) for each of our two education measures using the meta-analysis results (combining discovery and replication), excluding one cohort. We tested these scores for association with educational attainment in the excluded cohort. We constructed the scores using SNPs whose nominal P values fall below a certain threshold, ranging from 5 × 10−8 (only the genome-wide–significant SNPs were included) to 1 (all SNPs were included).

We replicated this procedure with two of the largest cohorts in the study, both of which are family-based samples [Queensland Institute of Medical Research (QIMR) and Swedish Twin Registry (STR)]. The results suggest that educational attainment is a highly polygenic trait (Fig. 2 and table S23): the amount of variance accounted for increases as the P value threshold becomes less conservative (i.e., includes more SNPs). The linear polygenic score from all measured SNPs accounts for ≈2% (P = 1.0 × 10−29) of the variance in EduYears in the STR sample and ≈3% (P = 7.1 × 10−24) in the QIMR sample.

Fig. 2 Explanatory power of the linear polygenic scores estimated for EduYears or College.

Solid lines show results from regressions of EduYears on linear polygenic scores in a set of unrelated individuals from the QIMR (n = 3526) and STR (n = 6770) cohorts. Dashed lines show results from regressions of cognitive function on linear polygenic scores in a sample from STR (n = 1419). The scores are constructed from the meta-analysis for either EduYears or College, excluding the cohort (either QIMR or STR) subsequently used as the prediction sample.

To explore one of the many potential mediating endophenotypes, we examined how much the same polygenic scores (constructed to explain EduYears or College) could explain individual differences in cognitive function. Though it would have been preferable to explore a richer set of mediators, this variable was available in STR, a data set where we had access to the individual-level genotypic data. The Swedish Enlistment Battery (used for conscription) had previously been administered to measure cognitive function in a subset of males (5, 17). The estimated R2 ≈ 2.5% (P < 1.0 × 10−8) for cognitive function is actually slightly larger than the fraction of variance in educational attainment captured by the score in the STR sample. One possible interpretation is that some of the SNPs used to construct the score matter for education through their stronger, more direct effects on cognitive function (5). A mediation analysis (table S24) provides tentative evidence consistent with this interpretation.

The polygenic score remains associated with educational attainment and cognitive function in within-family analyses (table S25). Thus, these results appear robust to possible population stratification.

If the size of the training sample used to estimate the linear polygenic score increased, the explanatory power of the score in the prediction sample would be larger, because the coefficients used for constructing the score would be estimated with less error. In (5), we report projections of this increase. We also assess, at various levels of explanatory power, the benefits from using the score as a control variable in a randomized educational intervention (5). An asymptotic upper bound for the explanatory power of a linear polygenic score is the additive genetic variance across individuals captured by current SNP microarrays. Using combined data from STR and QIMR, we estimate that this upper bound is 22.4% (SE = 4.2%) in these samples (5) (table S12).

Placed in the context of the GWAS literature (10), our largest estimated SNP effect size of 0.02% is more than an order of magnitude smaller than those observed for height and body mass index (BMI): 0.4% (15) and 0.3% (18), respectively. For comparison with the R2 value of 2% from our linear polygenic score for education, estimated from a sample of 120,000, a score for height reached 10%, estimated from a sample of 180,000 (15), and a score for BMI, using only the top 32 SNPs, reached 1.4% (18). Taken together, our findings suggest that the genetic architecture of complex behavioral traits is far more diffuse than that of complex physical traits.

Existing claims of “candidate gene” associations with complex social-science traits have reported widely varying effect sizes, many with R2 values more than 100 times larger than those we have found (4, 6). For complex social-science phenotypes that are likely to have a genetic architecture similar to educational attainment, our estimate of 0.02% can serve as a benchmark for conducting power analyses and evaluating the plausibility of existing findings in the literature.

The few GWAS studies conducted to date in social-science genetics have not found genome-wide–significant SNPs that replicate consistently (19, 20). One commonly proposed solution is to gather better measures of the phenotypes in more environmentally homogenous samples. Our findings demonstrate the feasibility of a complementary approach: identify a phenotype that, although more distal from genetic influences, is available in a much larger sample [see (5) for a simple theoretical framework and power analysis]. The genetic variants uncovered by this “proxy-phenotype” methodology can then serve as a set of empirically-based candidate genes in follow-up work, such as tests for associations with well-measured endophenotypes (e.g., personality, cognitive function), research on gene-environment interactions, or explorations of biological pathways.

In social-science genetics, researchers must be especially vigilant to avoid misinterpretations. One of the many concerns is that a genetic association will be mischaracterized as “the gene for X,” encouraging misperceptions that genetically influenced phenotypes are immune to environmental intervention [for rebuttals, see (21, 22)] and misperceptions that individual SNPs have large effects (which our evidence contradicts). If properly interpreted, identifying SNPs and constructing polygenic scores are steps toward usefully incorporating genetic data into social-science research.

Supplementary Materials

Materials and Methods

Supplementary Text

Figs. S1 to S22

Tables S1 to S27

References (23175)

References and Notes

  1. See the supplementary materials on Science Online.
  2. Acknowledgments: This research was carried out under the auspices of the Social Science Genetic Association Consortium (SSGAC), a cooperative enterprise among medical researchers and social scientists that coordinates genetic-association studies for social-science variables. Data for our analyses come from many studies and organizations, some of which are subject to a materials transfer agreement (5). Results from the meta-analysis are available at the Web site of the consortium, The formation of the SSGAC was made possible by an EAGER grant from the NSF and a supplemental grant from the NIH Office of Behavioral and Social Sciences Research (SES-1064089). This research was also funded in part by the Ragnar Söderberg Foundation (E9/11); the National Institute on Aging (NIA)/NIH through grants P01-AG005842, P01-AG005842-20S2, P30-AG012810, and T32-AG000186-23; and the Intramural Research Program of the NIA/NIH. For a full list of acknowledgments, see (5).
View Abstract

Stay Connected to Science

Navigate This Article