Weighing the results of differing ‘low dose’ studies of the mouse prostate by Nagel, Cagen, and Ashby: Quantification of experimental power and statistical results

https://doi.org/10.1016/j.yrtph.2005.07.001Get rights and content

Abstract

Differing experimental findings with respect to “low dose” responses in the mouse prostate after in utero exposure have generated considerable controversy. An analysis of such controversies requires a broad strength and weight of the evidence approach. For example, a National Toxicology Program review panel acquired the raw data from nearly 50 studies and then statistically reanalyzed these data in a common and comparable approach. However, the statistical power of the various studies was not calculated and the quantitative p values were not reported in this reanalysis. Such calculations and values address vital strength- and weight-of-the-evidence questions: (1) how sensitive were the various studies to detect changes in prostate weight, particularly the negative replicate studies and (2) what were the p values; were negative studies robust or only marginal in their inability to find an effect? We first examined the statistical power of the studies to detect a positive effect on prostate weight. Preliminary calculations indicated that the two subsequent replicating studies were indeed more sensitive to changes in prostate weight in comparison to the original study, having reasonable power to detect an effect at only 50% of the response reported in the original study. Additional calculations were performed using the raw data available from one negative replicating study and the methods recommended by the statistics subpanel of the original review. This analysis used Dunnett’s multiple comparison procedure for groups with p < 0.05 to infer statistical significance, employed an analysis-of-covariance model with body weight as a covariate, and addressed litter as a nested random effect. The quantitated p values for this replicated study, comparing the two Bisphenol A treatment groups (2 and 20 μg/kg/day) to the control, were 0.821 and 0.972, respectively. This indicates this study was indeed robust in finding no treatment-related effect. Thus, the weight and strength of the evidence, based on sensitivity and quantitative p value, was that it is highly unlikely for this negative replicating study to have missed a true effect. In the future, we recommend a similar use of statistical power analysis for those designing experimental studies and for those conducting weight-of-the-evidence reviews, and we also recommend the clear quantitation and reporting of p values to support the review’s interpretation and conclusions.

Introduction

Apparent effects at doses well below those traditionally accepted by toxicologists as no-adverse-effect-levels (NOAELs) have been the recent subject of controversy (see for example, Anon., 1997a, Anon., 1997b). This “low dose” controversy potentially challenges the toxicological tenet that the dose makes the poison. In turn, the “low dose” conclusion challenges the regulatory approaches to chemical safety, which use uncertainty factors to extrapolate statistical NOAELs from toxicological studies in the hazard characterization step of risk assessment to protect large populations.

Central to this “low dose” controversy are conflicting studies administering Bisphenol A (BPA)1 to mice in utero and measuring changes in the weight of the adult prostate. The original positive study reported that increases in prostate weights were found at doses well below those of traditional toxicological thresholds for BPA, i.e., 2 and 20 μg BPA/kg body weight/day (Nagel et al., 1997). In contrast, the later replicating studies were negative, failing to observe any changes in prostate weights (Ashby et al., 1999, Cagen et al., 1999a). Further, multiple generation reproductive and developmental studies with BPA were performed and did not observe any adverse effects at the very low BPA doses in rats (Ema et al., 2001, Tyl et al., 2002).

The “low dose” controversy resulted in a major public scientific review of these and other studies under the auspices of the NTP (Melnick et al., 2002, NTP, 2001). The review panel and its subpanels considered a wide range of factors that might account for the different outcomes, including experimental designs, animal strain, laboratory diet, and statistical analyses. A novel aspect of this review was the request to 15 principal investigators to provide their raw experimental data in order to conduct an independent statistical reanalysis prior to the review meeting. Data from most of the 58 requested studies were received, the data were carefully audited, and the data were reanalyzed using a single consistent approach by the statistics subpanel (Haseman et al., 2001).

The comment in the panel report that caught our attention was made by the BPA subpanel: “collectively, these studies [referring to Ashby et al., 1999, Cagen et al., 1999a; and others] found no evidence for a low dose effect of BPA, despite the considerable strength and statistical power they represent, which the subpanel considered especially noteworthy (NTP, 2001).” However, we could find no actual quantification or comparison of the actual power of these studies or a description any power calculations made to support this statement. Further, the panel did not report quantified p values for all of the studies. Such quantified p values can be of high value. For example, if p < 0.05 was considered significant, values of 0.002 and 0.812 would be strong evidence for an individual study detecting an effect and for the lack of effect, respectively. In contrast, p values of 0.046 or 0.062 would be marginal evidence for any effect or lack of effect, respectively. Therefore, we have used the available means and standard deviations to quantitate the experimental power of the studies of Nagel et al., 1997, Cagen et al., 1999a, and Ashby et al. (1999), and we have calculated p values when the raw data were available.

Such examination and calculations are not trivial tasks. For example, actual group means and standard deviations were published by both Cagen et al. (1999a) and Ashby et al. (1999). Although these data were present in the original publication only in graphic form (see Fig. 2, Nagel et al. (1997)), the numerical values were available from the NTP review (2001) (see p. A-9). These three data sets provide an ability to calculate coefficients of variation and to perform initial power simulations for the three studies. However, detailed simulations and calculations require the data from individual animals. These individual data are available only from the original publication of Ashby et al. (1999) and not for the other two studies. In addition, cofactors must be statistically tested and taken into account. For example, prostate weight is potentially correlated with individual animal body weights and, when present may be subject to a possible litter effect.

A statistical reanalysis of the raw data from these studies was performed during the NTP review. The findings were that:

  • (1)

    in the Nagel study, prostate weights were indeed significantly different, but at p < 0.05, not p < 0.01 as reported by the authors (the quantitative p value was not reported, only the < indication was used), prostate weight was correlated to body weight at p < 0.05 (the quantitative p value was again not reported) although the original study did not find a correlations with body weight (Nagel et al., 1997), and although the original authors reported a significant litter effect (Nagel et al., 1997), no reference to a litter effect being significant or p value was discovered in the subpanel report;

  • (2)

    in the Cagen study, there was no effect of treatment on the prostate weight (the quantitative p value was again not reported), a significant body weight correlation at p < 0.05 (the quantitative p value was again not reported) was found that was not originally reported, and a significant litter effect on prostate weight at p < 0.01 was found consistent with the original report;

  • (3)

    in the Ashby study, there was no effect of treatment on the prostate weight, there was a significant correlation of the prostate weight with body weight (quantitative p values were not reported for the body weight correlation or the treatment effect on prostate weight), and there was a marginal litter effect at p = 0.046 (NTP, 2001). All findings were consistent with the original report.

This suggests an evaluation is needed that considers the statistical power of the conflicting experiments and the calculation and transparent reporting the quantitative p values of these studies to judge the strength and the weight of the evidence on the “low dose” controversy.

Section snippets

Materials and methods

The experimental results being compared in the initial power simulations are from the studies of Nagel et al., 1997, Cagen et al., 1999a, and Ashby et al. (1999). These are summarized in Table 1. The actual data for individual litter and animals from the study of Ashby et al. (1999) were published and were taken from Table 6 of that paper.

We have followed the general approach of the statistics subpanel (Haseman et al., 2001), using Dunnett’s multiple comparison test, p < 0.05 for significance, a

Preliminary power calculations

The results of the first set of power calculations are presented in Fig. 1. This initial analysis showed that indeed the Cagen et al., 1999a and Ashby et al., 1999 studies should have had substantial power to detect changes in the mouse prostate even at magnitudes well below those observed in the Nagel study. In fact, based on the preliminary calculations, these studies were almost certain to detect a 25% change in prostate weight, were highly likely to detect a 20% change, and still had good

Discussion

The NTP panel attempted to review nearly 60 experimental studies, requesting their raw data to undertake a consistent statistical reanalysis. An effort on this scale is literally unprecendented. However, this reflects the importance of the “low dose” issue to toxicological and regulatory tenets. Nearly, 50 of the requested data sets were submitted. The enormous statistical effort was intriguing in itself as:

  • audits discovered some data errors;

  • indicated inappropriate statistical approaches and

Acknowledgments

We acknowledge and thank Drs. Joe Haseman and Greg Carr for thoughtful discussions and comments that improved the design and performance of these statistical analyses, and thank Drs. George Daston and Scott Belanger for suggesting improvements to the manuscript.

References (22)

  • J.Y. Domoradzki et al.

    Age and dose dependency of the pharmacokinetics and metabolism of Bisphenol A in neonatal Sprague–Dawley rats following oral administration

    Toxicol. Sci.

    (2004)
  • Cited by (10)

    • Male rat exposure to low dose of di(2-ethylhexyl) phthalate during pre-pubertal, pubertal and post-pubertal periods: Impact on sperm count, gonad histology and testosterone secretion

      2018, Reproductive Toxicology
      Citation Excerpt :

      In the recent years, it was demonstrated that certain EDCs had produced toxic effects at even very low dose exposure during “critical windows” [21–25]. By definition, “low doses” are doses below the NOAEL set in traditional toxicology studies, or doses in the range of typical human exposure in the environment [26]. Thus, Low-dose effects are effects observed at doses below the NOAEL or effects that occur in the typical human exposure range according to the National Toxicology Program [27].

    • Polycystic ovary syndrome and environmental toxins

      2016, Fertility and Sterility
      Citation Excerpt :

      For example, the affinity of the classical estrogen receptors (ER), ER-α and ER-β, to one of the most commonly used EDCs, BPA, is 1,000–10,000 fold lower than for 17β-estradiol (E2) (32). However, EDCs can act in several tissues, even in low doses, also below the level of exposure at which no adverse effects in organism have been observed by toxicologists (33, 34). Thus, long-term exposure of humans even to EDCs levels below the estimated tolerable daily intake dose may be not as safe as previously considered.

    • Endocrine disruptors

      2011, Reproductive and Developmental Toxicology
    • Bisphenol a and aging

      2013, Issues in Toxicology
    View all citing articles on Scopus
    View full text