9/25/2020

Statistical Modeling: The Two Cultures

Leo Breiman, Statistical Modeling: The Two CulturesStatistical Science, 2001, Vol. 16, No. 3, 199–231.

There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical community has been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. Algorithmic modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools.

page 1:

There are two goals in analyzing the data:

Prediction. To be able to predict what the responses are going to be to future input variables;

Information. To extract some information about how nature is associating the response variables to the input variables.

Objective: 

In this paper I will argue that the focus in the statistical community on data models has:

  • Led to irrelevant theory and questionable scientific conclusions;
  • Kept statisticians from using more suitable algorithmic models;
  • Prevented statisticians from working on exciting new problems; 

I will also review some of the interesting new developments in algorithmic modeling in machine learning and look at applications to three data sets.

Interpretability: 

While trees rate an A+ on interpretability, they are good, but not great, predictors....

So forests are A+ predictors. But their mechanism for producing a prediction is difficult to understand. Trying to delve into the tangled web that generated a plurality vote from 100 trees is a Herculean task. So on interpretability, they rate an F.... 

My biostatistician friends tell me, “Doctors can interpret logistic regression.” There is no way they can interpret a black box containing fifty trees hooked together. In a choice between accuracy and interpretability, they’ll go for interpretability. 

The current state-of-the-art approach for tree. 

Still highly relevant today, in words, robustness and interpretability with good performance:

There have been particularly exciting developments in the last five years. What has been learned? The three lessons that seem most important to one:

  • Rashomon: the multiplicity of good models;
  • Occam: the conflict between simplicity and accuracy;
  • Bellman: dimensionality—curse or blessing.

Comment by Brad Efron:

The whole point of science is to open up black boxes, understand their insides, and build better boxes for the purposes of mankind. 

沒有留言:

張貼留言