Genetics of Mouse Behavior: Interactions with Laboratory Environment

See allHide authors and affiliations

Science  04 Jun 1999:
Vol. 284, Issue 5420, pp. 1670-1672
DOI: 10.1126/science.284.5420.1670


Strains of mice that show characteristic patterns of behavior are critical for research in neurobehavioral genetics. Possible confounding influences of the laboratory environment were studied in several inbred strains and one null mutant by simultaneous testing in three laboratories on a battery of six behaviors. Apparatus, test protocols, and many environmental variables were rigorously equated. Strains differed markedly in all behaviors, and despite standardization, there were systematic differences in behavior across labs. For some tests, the magnitude of genetic differences depended upon the specific testing lab. Thus, experiments characterizing mutants may yield results that are idiosyncratic to a particular laboratory.

Targeted and chemically induced mutations in mice are valuable tools in biomedical research, especially in the neurosciences and psychopharmacology. Phenotypic effects of a knockout often depend on the genetic background of the mouse strain carrying the mutation (1), but the effects of environmental background are not generally known.

Different laboratories commonly employ their own idiosyncratic versions of behavioral test apparatus and protocols, and any laboratory environment also has many unique features. These variations have sometimes led to discrepancies in the outcomes reported by different labs testing the same genotypes for ostensibly the same behaviors (2). Previous studies could not distinguish between interactions arising from variations in the test situation itself and those arising from subtle environmental differences among labs. Usually, such differences are eventually resolved by repetition of tests in multiple labs. However, null mutants and transgenic mice are often scarce and tend to be behaviorally characterized in a single laboratory with a limited array of available tests.

We addressed this problem by testing six mouse behaviors simultaneously in three laboratories (Albany, New York; Edmonton, Alberta, Canada; and Portland, Oregon) using exactly the same inbred strains and one null mutant strain (3). We went to extraordinary lengths to equate test apparatus, testing protocols, and all possible features of animal husbandry (4). One potentially important feature was varied systematically. Because many believe that mice tested after shipping from a supplier behave differently from those reared in-house, we compared mice either shipped or bred locally at the same age (77 days) starting at the same time (0830 to 0900 hours local time on 20 April 1998) in all three labs (5). Each mouse was given the same order of tests [Day 1: locomotor activity in an open field; Day 2: an anxiety test, exploration of two enclosed and two open arms of an elevated plus maze; Day 3: walking and balancing on a rotating rod; Day 4: learning to swim to a visible platform; Day 5: locomotor activation after cocaine injection; Days 6 to 11: preference for drinking ethanol versus tap water (6)].

Despite our efforts to equate laboratory environments, significant and, in some cases, large effects of site were found for nearly all variables (Table 1). Furthermore, the pattern of strain differences varied substantially among the sites for several tests. Sex differences were only occasionally detected, and, much to our surprise, there were almost no effects of shipping animals before testing. Large genetic effects on all behaviors were confirmed, which is not surprising because we chose strains known to differ markedly on these tasks.

Table 1

Statistical significance and effect sizes for selected variables in the multisite trial. Color of cell depicts Type I error probability or significance of main effects and two-way interactions from 8 × 2 × 3 × 2 analyses of variance: blue, P < 0.00001; purple, P < 0.001; gold, P < 0.01; dashes with no shading,P > 0.01. Cell entries are effect sizes, expressed as partial omega squared, the proportion of variance accounted for by the factor or interaction if only that factor were in the experimental design (range = 0 to 1.0). Multiple R2 (unbiased estimate) gives the proportion of the variance accounted for by all factors. For the water escape task, results are based on only seven strains because most A/J mice never escaped because of wall-hugging. We recognize that the issue of appropriate alpha level correction for multiple comparisons is contentious. Details of the statistical analyses are available on the Web site (4), including a discussion of our rationale for presenting uncorrected values in this table.

View this table:

Results for locomotor activity and the effect of a subsequent cocaine injection on locomotion are shown in Fig. 1. Expected strain differences in undrugged activity were found: A/J mice were relatively inactive at all three sites, whereas C57BL/6J mice were very active. An effect of laboratory was also found: mice tested in Edmonton were, on average, more active than those tested in Albany or Portland. In addition, the pattern of genetic differences depended on site. For example, 129/SvEvTac mice tested in Albany were very inactive compared to their counterparts in other labs. Similar results were seen for sensitivity to cocaine stimulation. For example, B6D2F2 mice were very responsive (and A/J mice quite insensitive) to cocaine in Portland, but not at other sites.

Figure 1

Group means (±SEM for n = 16 mice) for activity in a 40 cm by 40 cm open field for eight strains tested at the same time of day in identical apparatus in three laboratories. (A) Horizontal distance (centimeters) traveled in 15 min on the first test on Day 1. (B) Cocaine-induced activation, expressed as the difference between horizontal activity (centimeters in 15 min) after cocaine (20 mg/kg) on Day 5 minus the score on Day 1.

In the elevated plus maze, a very similar pattern was seen: strong effects of genotype, site, and their interaction. This was true both for activity measures and for time spent in open arms, the putative index of anxiety (Fig. 2). For total arm entries, the testing laboratory was particularly important for the 5-HT1B knockout mice versus their wild-type 129/Sv-ter background controls. Knockout mice had greater activity than wild types in Portland and tended to have less activity in Albany, while not differing in Edmonton. Edmonton mice of all strains spent more time in open arms (lower anxiety). Portland mice also spent less time in open arms, but this was especially true for strains A/J, BALB/cByJ, and the B6D2F2 mice.

Figure 2

Group means (±SEM for n = 16 mice) for behavior videotaped for 5 min on elevated plus mazes having two open and two enclosed arms. (A) Total number of entries into any arm (defined as all four limbs in the arm). (B) Time (seconds) spent in the two open arms during the 300-s test. Smaller amounts of time indicate higher levels of anxiety.

Although the testing laboratory was an important variable, there was a good deal of consistency to the genetic results as well. For example, comparison of the genotype means (averaged over sites) for the initial 5 min of the activity test on Day 1 with the total arm entry scores from the plus maze yielded a high correlation between strains (r = 0.91, P < 0.002). This indicates that a strain's characteristic activity in novel apparatus is robust and occurs in different apparatus as well as different labs (7).

For some behaviors, laboratory environment was not critical. For example, ethanol drinking scores were closely comparable across all three labs, and genotypes alone accounted for 48% of the variance (Table 1 and Fig. 3). The genetic differences showed the well-known pattern of C57BL/6J mice strongly preferring and DBA/2J mice avoiding ethanol (8). Females drank more, as is also well known (8), but there were no significant effects of site, shipping, or any other interactions. Unlike the other five tests, ethanol preference testing extended over 6 days in the home cage and involved a bare minimum of handling mice by the experimenter.

Figure 3

Mean (±SEM) ethanol consumed per day, expressed as grams per kilogram body weight, over 4 days of an ethanol preference test where each mouse had free access to two drinking bottles, one with local tap water and the other with 6% ethanol in tap water.

For some measures, the difference between 5-HT1B null mutant and wild-type mice depended on the specific laboratory environment. In Edmonton, for example, no difference was observed between +/+ and –/– mice in distance traveled in the activity monitor, whereas there was greater activity in the knockouts at the other two sites, especially Portland (P = 0.002). In the elevated plus maze, knockouts were considerably more active than wild types only in Portland (Fig. 2A; P = 0.02).

The numbers of mice we tested made formal statistical assessment of reliability infeasible, but it would be important to know whether each laboratory would obtain essentially the same strain-specific results if this experiment were repeated. Because our experiment included an internal replication, we estimated the lower bounds of reliability for each site separately by correlating the mean scores for each strain (collapsed over sex and shipping group) obtained during the two replicates of the experiment. These correlations differed depending upon the behavior, and were consonant with the relative importance of genotype in the overall analysis. For example, for locomotor activity, the correlations were 0.97, 0.74, and 0.87 for the three sites. For open-arm time on the plus maze, possibly the most intrinsically unstable task we employed, the correlations were lower (0.32, 0.52, and 0.26). These can be compared to correlations for body weight, which can serve as a type of control variable not influenced by idiosyncratic dynamics of the test situation (0.83, 0.74, and 0.90). No site had generally higher or lower reliability than the others, and formal analyses of replication in analyses of variance indicated no strong interactions of strain by replication. We conclude that reasonable estimates of strain-specific scores are highly dependent on behavioral endpoint, and that some behaviors are highly stable.

Several sources of these laboratory-specific behavioral differences could be ruled out by the rigor of the experimental design. For example, Edmonton mice might have been more sensitive to cocaine-induced locomotion because the source of cocaine differed from the other two sites (4), but this could not explain the relatively marked response of the three 129-derived strains in Edmonton only. However, specific experimenters performing the testing were unique to each laboratory and could have influenced behavior of the mice. The experimenter in Edmonton, for example, was highly allergic to mice and performed all tests while wearing a respirator—a laboratory-specific (and uncontrolled) variable.

Whether animals were bred in each laboratory or shipped as adults 5 weeks before testing had no consistent influence on results in this experiment. Shipped animals took routes of varying duration and difficulty. For example, some Taconic mice were trucked to Albany from nearby Germantown, New York, whereas others spent 2 days in transit during a flight in midwinter to Edmonton. At least in this experiment, allowing animals a lengthy period of acclimation to new quarters was sufficient to overcome any strong effects of putative shipping stress on subsequent behavior.

These results support both optimistic and pessimistic interpretations. Seen optimistically, genotype was highly significant for all behaviors studied, accounting for 30 to 80% of the total variability, and several historically documented strain differences were also seen here. In general, we conclude that very large strain differences are robust and are unlikely to be influenced in a major way by site-specific interactions. However, a more cautious reading suggests that for behaviors with smaller genetic effects (such as those likely to characterize most effects of a gene knockout), there can be important influences of environmental conditions specific to individual laboratories, and specific behavioral effects should not be uncritically attributed to genetic manipulations such as targeted gene deletions.

When studying mutant mice, relatively small genetic effects should first be replicated locally before drawing conclusions (9). We further recommend that, if possible, genotypes should be tested in multiple labs and evaluated with multiple tests of a single behavioral domain (such as several tests of anxiety-related behavior) before concluding that a specific gene influences a specific behavioral domain. We also suggest the possibility that laboratory-specific effects on genetic differences will affect phenotypes other than behaviors to an extent similar to that we report.

It is not clear whether standardization of behavioral assays would markedly improve future replicability of results across laboratories. Standardization will be difficult to achieve because most behaviorists seem to have differing opinions about the “best” way to assay a behavioral domain. For example, two of us typically test behavior during the light phase of the animals' cycle, whereas the third typically tests during the dark phase (but switched to the light phase for this study). Which apparatus specifications or test protocol to employ is also a subject of differing opinion. There is a risk of prematurely limiting the “recommended” tests in a domain to those deemed “industry standard,” because this may constrain the intrinsic richness of a domain and obscure interesting interactions. On the other hand, increased communication and collaboration between the molecular biologists creating mutations and behavioral scientists interested in the psychological aspects of behavioral testing will benefit both groups.

  • * To whom correspondence should be addressed. E-mail: crabbe{at}


View Abstract

Navigate This Article