Teaching accreditation exams reveal grading biases favor women in male-dominated disciplines in France

See allHide authors and affiliations

Science  29 Jul 2016:
Vol. 353, Issue 6298, pp. 474-478
DOI: 10.1126/science.aaf4372

Signaling ability and grit to academia

In many professions, getting ahead requires evidence of both effort and ability. This is especially true if one is not a member of the dominant group and thus surmounting social norms. Breda and Hillion show that oral examiners of candidates for teaching positions in the French education system reward such applicants. Specifically, women applying for high-level teaching positions in male-dominated fields, such as physics and philosophy, are favored, as are men who apply in female-dominated fields, such as literature and foreign languages.

Science, this issue p. 474


Discrimination against women is seen as one of the possible causes behind their underrepresentation in certain STEM (science, technology, engineering, and mathematics) subjects. We show that this is not the case for the competitive exams used to recruit almost all French secondary and postsecondary teachers and professors. Comparisons of oral non–gender-blind tests with written gender-blind tests for about 100,000 individuals observed in 11 different fields over the period 2006–2013 reveal a bias in favor of women that is strongly increasing with the extent of a field’s male-domination. This bias turns from 3 to 5 percentile ranks for men in literature and foreign languages to about 10 percentile ranks for women in math, physics, or philosophy. These findings have implications for the debate over what interventions are appropriate to increase the representation of women in fields in which they are currently underrepresented.

Why are women underrepresented in most areas of science, technology, engineering, and mathematics (STEM)? One of the most common explanations is that a hiring bias against women exists in those fields (14). This explanation is supported by a few older experiments (57), a recent one with fictitious resumes (8), and a recent lab experiment (9), which suggest that the phenomenon still prevails.

However, some scholars have challenged this view (10, 11), and another recent experiment with fictitious resumes finds a bias in favor of women in academic recruitment (12). Studies based on actual hiring also find that when women apply to tenure-track STEM positions, they are more likely to be hired (1318). However, those studies do not control for applicants’ quality and a frequent claim is that their results simply reflect that only the best female Ph.D.’s apply to these positions, whereas a larger fraction of males do so (11, 13). A study by one of us did partly control for applicants’ quality and reported a bias in favor of women in male-dominated fields (19). However, it has limited external validity because it relies on only 3000 candidates who took the French Ecole Normale Supérieure entrance exam.

The present analysis is based on a natural experiment involving >100,000 individuals who participated in competitive exams used to hire French primary, secondary, and college or university teachers over the period 2006–2013. It has two distinct advantages over all previous studies. First, it provides large-scale real-world evidence of gender biases in evaluation-based hiring in several fields. Second, it shows that those biases against or in favor of women are strongly shaped by the actual degree of female underrepresentation in the field in which the evaluation takes place, which partly reconciles existing studies.

Carefully taking into account the extent of underrepresentation of women in 11 academic fields allowed us to extend the analysis beyond the STEM distinction. As pointed out recently (11, 12, 19, 20), the focus on STEM versus non-STEM fields can be misleading for understanding female underrepresentation in academia, as some STEM fields are not dominated by men [e.g., 54% of U.S. Ph.D.’s in molecular biology are women (21)], whereas some non-STEM fields, including humanities, are male-dominated [e.g., only 31% of U.S. Ph.D.’s in philosophy are women (21)]. A better predictor of this underrepresentation, some have argued, is the belief that innate raw talent is the main requirement to succeed in the field (20).

To study how female underrepresentation can shape skills assessment, we exploit the two-stage design of the three national exams used in France to recruit virtually all primary-school teachers, CRPE; middle- and high-school teachers, CAPES and Agrégation; as well as a large share of graduate school and university teachers, who also take the Agrégation. (22)A college degree is necessary to take part in those competitive exams [table S1 in (22)]. Except for the lower level (CRPE), each exam is subject-specific and typically includes two or three written tests. The best candidates after those written tests (tables S2 and S3) are eligible for typically two or three oral tests taken no later than 3 months after the written tests (22). Note that oral tests are not general recruiting interviews: Depending on the subject, they include exercises, questions, or text discussions designed to assess candidates’ fundamental skills, exactly as written tests. Teachers or professors who have specialized in the subject grade all the tests. At the highest-level exam (Agrégation), 80% of evaluators are either full-time researchers or university professors in French academia. The corresponding statistic is 30% for the medium-level exam (CAPES).

Our strategy exploits the ”blinding” of the written tests (candidates’ name and gender are not known by the professors who grade these tests), whereas the oral tests are not blinded. If one assumes that female handwriting cannot be easily detected—which we discuss later—written tests provide a counterfactual measure of students’ cognitive ability in each subject.

The French evaluation data offer unique advantages over previously published experiments; they provide real-world test scores for a large group of individuals. Thus, they avoid the usual problem of experiments’ limited external validity. At the same time, these data present a compelling “experiment of nature” in which naturally occurring variations can be leveraged to provide controls. A final advantage is being able to draw on very rich administrative data that allow numerous statistical controls to be applied and comparisons to be made across levels of evaluation, from lower-level (primary and secondary teaching) to college or university hiring.

To assess gender bias in evaluation, we focused on candidates who took all oral and written tests, and we ranked them according to their total score on either written or oral tests. We then compared the variation of women’s mean percentile rank between written and oral tests to the same variation for men. This standardized measure is bounded between –1 and 1, and it is independent of the share of females among the total pool of applicants. It is equal to 1 if all women are below the men on written tests and above them on oral tests [see (22) for additional explanations]. For each subject-specific exam, we computed this measure and its statistical significance using a linear regression model—named DD1 in (22)—of the type ΔRanki = a + bFi + εi. ΔRanki is the variation in rank between oral and written tests of candidate i, Fi is an indicator variable equal to 1 for female candidates and 0 for males, εi is an error term, and b is the measure of interest.

In fields in which women are underrepresented (mathematics, physics, chemistry, and philosophy), oral tests favor women over men both on the higher-level exams (professorial and high-school teaching) and medium-level exams (secondary school teaching only) (Fig. 1) (Ps < 0.01 in all cases, see sample sizes and detailed results in table S4). In contrast, oral tests in fields in which women are well-represented (literature and foreign languages) favor men over women, but the differences are smaller and not always significantly different from 0 at the 5% statistical level (Fig. 1 and table S4). In history, geography, and social sciences, there are only small gender differences between oral and written tests. Those differences are not significantly different from 0 at the 5% statistical level. In biology, a bias against women is found on the high-level exam only. With the exception of social sciences at the medium-level exam (22), all results are robust to the inclusion of control variables and to the use of a more general econometric model that allows for different returns to candidates’ fundamental skills between oral and written tests [see models DD2 and DD3+IV in (22)].

Fig. 1 Female evaluation advantage or disadvantage and fields’ extent of male-domination.

(A and B) The gap between females’ average percentile rank on nonblind oral tests and blind written tests, minus the same gap for men is shown on the y axis. It is computed for each field-specific exam at the high and medium level. The size of each point indicates the extent to which it is different from 0 (P value from Student’s t test). Fields’ extent of (non–)male-domination (x axis) measured by the share of women academics in each fields [see (22) for alternative measures].

A simple explanation for these results would be that examiners on oral tests try to lower the gender difference in ability observed on written tests. This is not always the case (Fig. 2): The oral tests sometimes fully invert a significant ranking gap between women and men on written tests (physics at the highest level, math at the medium level).

Fig. 2 Average rank difference between women and men on oral and written tests in each subject-specific exam at the high and medium level.

(A and B) Error bars indicate 95% confidence intervals from Student’s t test.

A clear pattern emerges from Fig. 1: The more male-dominated a field is, the higher the bonus for women on the nonblinded oral tests. To formally capture this pattern, we study how the bonus b on oral tests varies with the share of women s among assistant professors and senior professors in the French academy [see (22) for statistical details and other measures of fields’ feminization, e.g., table S5]. We find a significant negative relation at both the higher- and medium-level exams (see table S6) (b = 0.25 – 0.53 s at the high-level exam; b = 0.13 – 0.28 s at the medium-level exam, with P < 0.02 for both slopes and intercepts of the fitted lines).

The relation between the extent of a field’s male-dominance and female bonuses on oral tests at the highest-level exams (for high-school teachers and professorial) is about 150% of that at the medium-level exams. At the highest level, switching from a subject as feminine as foreign languages (s = 0.62) to a subject as masculine as math (s = 0.21) leads female candidates to gain, on average, 17 percentile ranks on oral tests with respects to written tests. To avoid sample-selection bias, this comparison between the medium- and the high-level exam is made on a subsample of about 3500 individuals who have taken both exams in the same subject the same year [(22), fig. S2, and table S6].

Finally, the statistical analysis suggests an absence of large significant gender biases on oral tests for the lower-level teaching exam (22). Note that this exam is not subject-specific. However, since 2011, all applicants have been required to take both an oral and a written test in math and literature, which makes it possible to study the bonus on oral tests for women in those two subjects. We find a small premium of around 3 percentile ranks for women on oral tests, both in math and literature, with no clear difference between those two subjects (see table S7). This finding should, however, be considered with prudence because it can only be established with the more general econometric specification [see model DD3+IV in (22)].

The differences between written and oral tests on the specialized medium- and high-level exams have implications for the gender composition of newly recruited secondary and postsecondary teachers. Oral tests give the gender in the minority better chances of being hired (fig. S1) and, therefore, induce a rebalancing of gender ratios between teachers hired in male- and female-dominated fields (table S8). We also find that the gender gaps between oral and written tests are very stable across the written test score distribution in all fields for the medium- and high-level exams (table S9).

Should the differences between written and oral test scores be interpreted as evaluation biases ? In natural experiments, the researcher does not have full control on the research design; thus, the results usually need to be interpreted with caution. The setting we exploit has three potential issues: (i) gender may be inferred on written tests from handwriting; (ii) there might be gender differences in the types of abilities that are required on oral and written tests; and (iii) the way candidates self-select in a given field may depend on their gender.

Tests that we previously conducted have shown that the rate of success in guessing gender from handwritten anonymous exam sheets is, on average, 68.6% (19). This suggests that examiners are rarely certain about the candidates’ gender from written tests [see additional details in (22)]. Their limited ability to detect the gender of candidates from the written tests would be truly problematic for the interpretation of our results if and only if those examiners were biased in opposite directions on the written and oral tests. This assumption cannot be tested empirically but seems unlikely, given that the same examiners usually evaluate both the written and oral tests (22). Moreover, examiners’ bias is likely to be smaller when they face presumably female or male handwriting than when they are exposed to an actual female or male candidate during an oral test. Therefore, partial gender detection on written tests should, if anything, only attenuate the magnitude of the estimated biases, which would keep their direction identified.

A more fundamental issue is that the gap between a candidate’s oral and written test score in a given subject can capture the effect of gender-related attributes visible only from oral or written tests, such as the quality of handwriting, elocution, or emotional intelligence [see (2326) for surveys on possible sex differences in cognitive abilities, including verbal fluency].

The first defense against those interpretations is that our key result is not the absolute gender gap in the oral versus written test scores in a given subject, but the variation—and even reversal—of this gap across subjects according to a regular pattern. If there are gender-specific differences in abilities required specifically for oral or written tests, these differences need to vary between male-dominated and other subjects to explain our results. For example, handwriting quality or elocution would both need to differ across gender and to be more rewarded for some subjects than for others. This could be true if the oral tests in the most male-dominated subjects are framed in a way that makes more visible the qualities that are more prevalent among women.

To overcome these issues and a possible handwriting detection problem, we exploit a remarkable feature of the teaching exams: since 2011, all of them have included an oral test entitled “Behave as an ethical and responsible civil servant” (BERCS). At the medium- and high-level exams, BERCS is the only test that is not subject-specific (27). This oral interview is a subpart of an oral test that otherwise attempts to evaluate the competence in the exam core subject. It is consequently graded by teachers or professors specialized in the exam core subject.

We have data on detailed scores for the BERCS test for the lower- and medium-level exams (22). Comparisons of gender differences in performance on this oral test across subjects for the medium-level exam reveals that women systematically get better grades, and that this bonus bʹ decreases with the share of women s in the exam’s overall subject area (Fig. 3, bʹ = 0.12 - 0.25 s, with P < 0.0001 for both the slope and the intercept, clustering by subjects). This pattern is similar to what is observed in Fig. 1 when comparing blind and nonblind subject-specific tests. However, the comparison across fields now relies on a single oral test that is identical in all exams. Consequently, the pattern in Fig. 3 cannot be influenced by (i) handwriting detection or by (ii) the fact that the oral and written tests evaluate different skills. Figure 3 also suggests that examiners favor women who chose to specialize in male-dominated subjects no matter what they are tested on.

Fig. 3 Female advantage or disadvantage on an oral test which is identical in all fields.

The difference between women’s and men’s average rank on the oral test “Behave as an ethical and responsible civil servant” in the different subject-specific medium-level exams (y axis). The size of each point indicates the extent to which it is different from 0 (P value from Student’s t test). Any fields’ extent of (non–)male-domination measured by the share of women academics in each field (x axis).

A last reason why our results could reflect skill differences is that (iii) the populations tested in the different subjects are not the same and are self-selecting. The women who decided to study math and take the math exams might be especially confident in math and perform better on oral tests for this reason, whereas the same happens for men in literature. Selection may also explain the results of the BERCS test: Women enrolled in the more male-dominated exams may have better aptitude for that particular oral test.

We can first reject that sample selection drives our results in a specific case: at the medium-level exam in physics-chemistry, the same candidates have to take the oral and written tests in both physics-chemistry. Among those candidates, the bonus for women on oral tests is 9 percentile points greater in physics than in chemistry, a subject that is less male-dominated according to all indicators. The idea that sample selection does not drive the general pattern in Fig. 1 is also confirmed by a previous analysis that is entirely based on identical samples of candidates being tested in different subjects (19).

To control for sample selection in the BERCS test, we exploited a pattern(?) of test-taking over the period 2011–2013: A few candidates took both the lower-level exam and the medium-level exam in a specific subject. We used the grade obtained from the BERCS test for the lower-level exam (where this test is also mandatory and graded as a subpart of the literature test) as a counterfactual measure of ability. As the lower-level exam is not subject-specific, it offers a counterfactual measure in a gender-neutral context. Among the small group of candidates who took both exams and took the medium-level exam in a less male-dominated subject (social sciences, history, geography, biology, literature, or foreign languages), men get an advantage over women on the oral test BERCS that is significantly higher at the 5% level for the medium-level exam than for the lower-level exam (Fig. 4, P = 0.04, N = 120 candidates). The reverse is true (however, not statistically significant) among the group that took the medium-level exam in a male-dominated subject (math, physics-chemistry, or philosophy, N = 60). As both the test subject and the sample of candidates are held constant in this last experiment, observed differences almost surely reflect examiners’ bias according to the extent of male-domination in the candidates’ field of specialization.

Fig. 4 Female advantage or disadvantage on the BERCS test for candidates taking both the lower-level exam and a medium level exam in either a male-dominated or a gender-neutral field.

Rank difference between women and men on the oral test “Behave as an ethical and responsible civil servant” at the lower-level exam and at the medium-level exam among two samples of candidates: those who took both the lower- and medium-level exams in a strongly male-dominated subject (math, physics, or philosophy, left side, N = 60), and those who took both the lower- and medium-level exams in a more gender neutral subject (social sciences, history, geography, biology, literature, or foreign languages, right side, N = 120). To control for selection, ranks at the tests have been computed within each sample, ignoring other candidates that are not in the sample. Confidence intervals at the 90% level are given in square brackets.

In total, the various empirical checks provided here imply with high confidence that our results for the medium- and higher-level exams reflect evaluation biases rather than differences in candidates’ abilities. These biases rebalance gender asymmetries in academic fields by favoring the minority gender. For women, this runs counter to the claim of negative discrimination in recruitment of professors into math-based fields. If anything, women appear to be advantaged in those fields. In contrast, men appear to be advantaged in recruitment into the most feminized fields. Those behaviors are stronger on the highest-level exams, where candidates are more skilled, and where initial gender imbalances between the different fields are largest (see table S2).

Our results are compatible with two main mechanisms. First, evaluators may have different beliefs about female and male applicants in the different fields and may statistically discriminate accordingly. For example, females who have mastered the curriculum, and who apply for highly skilled jobs in male-dominated fields may signal that they do not elicit the general stereotypes associating quantitative ability with men. This may induce a rational belief reversal regarding the motivation or ability of those female applicants (28), or a so-called “boomerang effect” (29) that modifies the attitudes toward them. Experimental evidence provides support for this theory by showing that gender biases are lower or even inverted when information clearly indicates high competence of those being evaluated (29, 30). Second, evaluators may simply have a preference for gender diversity, either conscious (e.g., political reasons) or unconscious. Evidence shows that evaluation biases in favor of the minority gender in a given field are larger in years where this gender performs more poorly at written tests (table S10). This result, which should not be overinterpreted (22), tends to reject the first explanation and is consistent with the second one.

Finally, for the math medium-level exam [the only one for which we have data on jury composition (see table S11)], we find no evidence that male (with respect to female) examiners systematically favor female (with respect to male) candidates (table S12). This result is in line with previous research (12, 19, 31) and suggests that context effects (surrounding gender stereotypes) are more important than examiners’ gender in explaining gender biases in evaluation. It excludes that between-fields variation in panel composition drives our results. We also checked (on the subsample for which we have detailed information) that examiners’ teaching levels do not affect their preferences and conclude that the higher proportion of assistant professors and professors who judge the higher-level exam cannot explain the stronger bonus obtained by the minority gender at that level.

Even without being fully conclusive on the underlying mechanisms, the presented analyses shed light on the possible causes of the underrepresentation of women in many academic fields. They confirm evidence from a recent experiment with fictitious resumes (12) that women can be favored in male-dominated fields at high recruiting levels (from secondary school teaching to professorial hiring), once they have already specialized and heavily invested in those fields (candidates on teaching exams hold at least a college or a master’s degree) (32). In contrast, the study of the recruiting process for primary schoolteachers suggests that prowomen biases in male-dominated fields may disappear in less prestigious and less selective hiring exams, where candidates are not necessarily specialized. Perhaps the bias in favor of women in male-dominated fields would even reverse at lower recruiting levels, as in experiments done with medium-skilled applicants (8, 9). Discrimination may then still impair women’s chances to pursue a career in quantitative science (or philosophy), but only at the early stages of the curriculum, before or just as they enter the pipeline that leads to a Ph.D. or a professorial position.

Nevertheless, there is no compelling evidence of hiring discrimination against individuals who have already decided against social norms to pursue an academic or a teaching career in a field where their own gender is in the minority. This result has three consequences for policy. First, active policies aimed at counteracting stereotypes and discrimination should probably focus on students at early ages, before educational choices are made. Second, nonblind evaluation and hiring should be favored over blind-evaluation in order to reduce gender imbalances across academic fields. In particular, policies that impose anonymous curricula vitae in the first stage of academic hiring are likely to have effects opposite to those expected. Third, many women may shy away from male-dominated fields at early ages because they believe that they would suffer from discrimination. Advertising that they have at least as good—or even better—opportunities as their male counterparts at the levels of secondary school teaching and professorial recruiting could encourage talented young women to study in those fields.

Supplementary Materials

Materials and Methods

Supplementary Text

Figs. S1 to S3

Tables S1 to S14

References (3338)

References and Notes

  1. Materials and methods are available as supplementary materials on Science Online.
  2. We checked that candidates’ score on the test, “Behave as an ethical and responsible civil servant,” for the computation of candidates’ rank on oral tests do not affect the main results. To do this, we restricted the analysis to the period before it was implemented in 2011. We also replicated the analysis by keeping only one oral and one written test in each of the miedium- and high- level exams. We kept the pairs of tests that match the most closely in terms of the subtopic or test program on which they were based. Results are virtually unchanged (fig. S3 and table S12).
  3. The higher-level teaching exam has been passed by a substantial fraction of researchers and may in some cases accelerate a career in French academia. In that sense, results obtained on this exam can be seen as more closely related to the specific debate on the underrepresentation of women scientists in academia.
  4. Acknowledgments: We thank S. Ceci for his amazing feedback and advice, as well as P. Askenazy, X. D’Hautefeuille, S. Georges-Kot, A. Marguerie, F. Kramarz, H. Omer, G. Piaton, T. Piketty, J. Rothstein, and D. Skandalis. We also thank colleagues and seminar participants at Paris School of Economics, CREST, and IZA, as well as three anonymous reviewers for their comments. The data necessary to reproduce most of this study are available at The initial data are property of the French Ministry of Education. Preliminary agreement is necessary to access the data for research purposes (we thank people in Office A2 at DEPP and Xavier Sorbe for giving us access). Summary statistics on sample sizes and female average rank at each test are given in the supplementary materials.
View Abstract

Navigate This Article