10/22/2020

50 years of Data Science

David Donoho, 50 Years of Data Science, Journal of Computational and Graphical Statistics, Volume 26, Issue 4, 2017, Pages 745-766.

More than 50 years ago, John Tukey called for a reformation of academic statistics. In “The Future of Data Analysis,” he pointed to the existence of an as-yet unrecognized science, whose subject of interest was learning from data, or “data analysis.” Ten to 20 years ago, John Chambers, Jeff Wu, Bill Cleveland, and Leo Breiman independently once again urged academic statistics to expand its boundaries beyond the classical domain of theoretical statistics; Chambers called for more emphasis on data preparation and presentation rather than statistical modeling; and Breiman called for emphasis on prediction rather than inference. Cleveland and Wu even suggested the catchy name “data science” for this envisioned field. A recent and growing phenomenon has been the emergence of “data science” programs at major universities, including UC Berkeley, NYU, MIT, and most prominently, the University of Michigan, which in September 2015 announced a $100M “Data Science Initiative” that aims to hire 35 new faculty. Teaching in these new programs has significant overlap in curricular subject matter with traditional statistics courses; yet many academic statisticians perceive the new programs as “cultural appropriation.” This article reviews some ingredients of the current “data science moment,” including recent commentary about data science in the popular media, and about how/whether data science is really different from statistics. The now-contemplated field of data science amounts to a superset of the fields of statistics and machine learning, which adds some technology for “scaling up” to “big data.” This chosen superset is motivated by commercial rather than intellectual developments. Choosing in this way is likely to miss out on the really important intellectual event of the next 50 years. Because all of science itself will soon become data that can be mined, the imminent revolution in data science is not about mere “scaling up,” but instead the emergence of scientific studies of data analysis science-wide. In the future, we will be able to predict how a proposal to change data analysis workflows would impact the validity of data analysis across all of science, even predicting the impacts field-by-field. Drawing on work by Tukey, Cleveland, Chambers, and Breiman, I present a vision of data science based on the activities of people who are “learning from data,” and I describe an academic field dedicated to improving that activity in an evidence-based manner. This new field is a better academic enlargement of statistics and machine learning than today’s data science initiatives, while being able to accommodate the same short-term goals.

page 747: 

In conversations I have witnessed, computer scientists seem to have settled on the following talking points:

(a) data science is concerned with really big data, which traditional computing resources could not accommodate

(b) data science trainees have the skills needed to cope with such big datasets. (1)

page 749 what makes a science (2): 

There are diverse views as to what makes a science, but three constituents will be judged essential by most, viz:

(a1) intellectual content,

(a2) organization in an understandable form,

(a3) reliance upon the test of experience as the ultimate standard of validity.

By these tests, mathematics is not a science, since its ultimate standard of validity is an agreed-upon sort of logical consistency and provability. As I see it, data analysis passes all three tests, and I would regard it as a science, one defined by a ubiquitous problem rather than by a concrete subject. Data analysis and the parts of statistics which adhere to it, must then take on the characteristics of a science rather than those of mathematics, …These points are meant to be taken seriously.

Tukey identified four driving forces in the new science:

Four major influences act on data analysis today:

1. The formal theories of statistics

2. Accelerating developments in computers and display devices

3. The challenge, in many fields, of more and ever larger bodies of data

4. The emphasis on quantification in an ever wider variety of disciplines 

page 752 on CTF

To my mind, the crucial but unappreciated methodology driving predictive modeling’s success is what computational linguist Mark Liberman (Liberman 2010) has called the Common Task Framework (CTF). An instance of the CTF has these ingredients:

(a) A publicly available training dataset involving, for each observation, a list of (possibly many) feature measurements, and a class label for that observation.

(b) A set of enrolled competitors whose common task is to infer a class prediction rule from the training data.

(c) A scoring referee, to which competitors can submit their prediction rule. The referee runs the prediction rule against a testing dataset, which is sequestered behind a Chinese wall. The referee objectively and automatically reports the score (prediction accuracy) achieved by the submitted rule....

It is no exaggeration to say that the combination of a predictive modeling culture together with CTF is the “secret sauce” of machine learning. 

page 753 on teaching 

Let us consider the attractive and informative web site for the UC Berkeley Data Science Masters’ degree at datascience.berkeley.edu.
 page 755 on greater data science (GDS)

The activities of GDS are classified into six divisions:

1. Data Gathering, Preparation, and Exploration

2. Data Representation and Transformation

3. Computing with Data

4. Data Modeling

5. Data Visualization and Presentation

6. Science about Data Science

page 757  on Teaching of GDS

I suggest the reader study carefully two books (together).

  • The Book (Tango, Lichtman, and Dolphin 2007) analyzes a set of databases covering all aspects of the American game of major league baseball, including every game played in recent decades and every player who ever appeared in such games. This amazingly comprehensive work considers a near-exhaustive list of questions one might have about the quantitative performance of different baseball strategies, carefully describes how such questions can be answered using such a database, typically by a statistical two-sample test (or A/B test in internet marketing terminology).
  • Analyzing Baseball Data with R (Marchi and Albert 2013) showed how to access the impressive wealth of available Baseball data using the internet and how to use R to insightfully analyze that data. 

(1) A wonderful and standard reference is Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, and Jeff Ullman.

(2)  J. W. Tukey, The Future of Data Analysis, The Annals of Mathematical Statistics, 1962, 33(1), pp. 1–67.

沒有留言:

張貼留言