José Silva's Scrapbook
MY TAKE ON WHY GELMAN AND STERN DON’T LIKE STATISTICAL SIGNIFICANCE (AND NEITHER DO I NOR SHOULD YOU)
Paper at www.stat.columbia.edu/~gelman/research/published/signif4.pdf (PDF)
Let’s say that a country, Lagutrop, wants to increase its PISA results and, before deciding on policy, the decision-makers want to know  what works by running experiments or studies of intervention effectiveness. These studies compare two variables with similar intervention scales, say expenditure in computer-assisted learning (U) and expenditure in teacher selection and incentive systems (V).
In Study 1, data is collected measuring the effect of U and V (possibly with a lot of covariates). The parameter estimates for the effect of U and V are given by the two bell-curve-like curves on the left above.
Conclusion (as is traditionally presented): Lagutrop should invest in teacher selection and incentive systems, since computer-aided education has no significant effect.
Gelman and Stern problem 1: but the difference between the effects is itself non-significant! If significance is the criterion for disposing of U, then it should also be explained to the decision-makers that significance cannot be used to separate U from V. Specifically, policy-makers in Lagutrop should be told that the rejection of computer-aided education is based on a criterion that also suggests that computer-based education is as effective as teacher recruitment and incentives.
Meanwhile, another group of researchers run a single-variable study (Study 2) considering only the effects of spending money on teachers (in Lagutrop this study would probably have been done by teachers :-).
The results of Study 2 are then presented as supporting the conclusions of Study one, phrased as “Expenditure on teachers shows a significant effect on PISA scores in both studies.”
Gelman and Stern problem 2: Studies 1 and 2 predict very different effect sizes for variable V; why the discrepancy? How can two parameter estimates that are significantly different from each other be considered corroboration?
My own take on this problem 2 is the following: suppose the policy-makers in Lagutrop have to decide how much to allocate to this PISA-improvement project, out of a budget that includes other considerations (national defense, jobs for the families and friends of the politicians, police, fire-fighters, etc.). Budgeting will require forecasting. Which of the parameter estimates for effect size will they use to build a forecasting model? Since the two estimates are significantly different, any attempt at aggregation would violate the basic meaning of that significance.
That’s what we engineers call a serious execution problem.
(Reblogged from my Flickr post.) 

MY TAKE ON WHY GELMAN AND STERN DON’T LIKE STATISTICAL SIGNIFICANCE (AND NEITHER DO I NOR SHOULD YOU)

Paper at www.stat.columbia.edu/~gelman/research/published/signif4.pdf (PDF)

Let’s say that a country, Lagutrop, wants to increase its PISA results and, before deciding on policy, the decision-makers want to know what works by running experiments or studies of intervention effectiveness. These studies compare two variables with similar intervention scales, say expenditure in computer-assisted learning (U) and expenditure in teacher selection and incentive systems (V).

In Study 1, data is collected measuring the effect of U and V (possibly with a lot of covariates). The parameter estimates for the effect of U and V are given by the two bell-curve-like curves on the left above.

Conclusion (as is traditionally presented): Lagutrop should invest in teacher selection and incentive systems, since computer-aided education has no significant effect.

Gelman and Stern problem 1: but the difference between the effects is itself non-significant! If significance is the criterion for disposing of U, then it should also be explained to the decision-makers that significance cannot be used to separate U from V. Specifically, policy-makers in Lagutrop should be told that the rejection of computer-aided education is based on a criterion that also suggests that computer-based education is as effective as teacher recruitment and incentives.

Meanwhile, another group of researchers run a single-variable study (Study 2) considering only the effects of spending money on teachers (in Lagutrop this study would probably have been done by teachers :-).

The results of Study 2 are then presented as supporting the conclusions of Study one, phrased as “Expenditure on teachers shows a significant effect on PISA scores in both studies.”

Gelman and Stern problem 2: Studies 1 and 2 predict very different effect sizes for variable V; why the discrepancy? How can two parameter estimates that are significantly different from each other be considered corroboration?

My own take on this problem 2 is the following: suppose the policy-makers in Lagutrop have to decide how much to allocate to this PISA-improvement project, out of a budget that includes other considerations (national defense, jobs for the families and friends of the politicians, police, fire-fighters, etc.). Budgeting will require forecasting. Which of the parameter estimates for effect size will they use to build a forecasting model? Since the two estimates are significantly different, any attempt at aggregation would violate the basic meaning of that significance.

That’s what we engineers call a serious execution problem.

(Reblogged from my Flickr post.)