Machine learning:Trends, perspectives, and prospectstom/pubs/Science-ML-2015.pdf · Despite...

Despite practical challenges, we are hopeful thatinformed discussions among policy-makers and thepublic about data and the capabilities of machinelearning, will lead to insightful designs of programsand policies that can balance the goals of protectingprivacy and ensuring fairness with those of reapingthe benefits to scientific research and to individualand public health. Our commitments to privacy andfairness are evergreen, but our policy choices mustadapt to advance them, and support new tech-niques for deepening our knowledge.

REFERENCES AND NOTES

1. M. De Choudhury, S. Counts, E. Horvitz, A. Hoff, in Proceedingsof International Conference on Weblogs and Social Media[Association for the Advancement of Artificial Intelligence(AAAI), Palo Alto, CA, 2014].

2. J. S. Brownstein, C. C. Freifeld, L. C. Madoff, N. Engl. J. Med.360, 2153–2155 (2009).

3. G. Eysenbach, J. Med. Internet Res. 11, e11 (2009).4. D. A. Broniatowski, M. J. Paul, M. Dredze, PLOS ONE 8, e83672

(2013).5. A. Sadilek, H. Kautz, V. Silenzio, in Proceedings of the

Twenty-Sixth AAAI Conference on Artificial Intelligence(AAAI, Palo Alto, CA, 2012).

6. M. De Choudhury, S. Counts, E. Horvitz, in Proceedings of theSIGCHI Conference on Human Factors in Computing Systems(Association for Computing Machinery, New York, 2013),pp. 3267–3276.

7. R. W. White, R. Harpaz, N. H. Shah, W. DuMouchel, E. Horvitz,Clin. Pharmacol. Ther. 96, 239–246 (2014).

8. Samaritans Radar; www.samaritans.org/how-we-can-help-you/supporting-someone-online/samaritans-radar.

9. Shut down Samaritans Radar; http://bit.ly/Samaritans-after.10. U.S. Equal Employment Opportunity Commission (EEOC), 29

Code of Federal Regulations (C.F.R.), 1630.2 (g) (2013).11. EEOC, 29 CFR 1635.3 (c) (2013).12. M. A. Rothstein, J. Law Med. Ethics 36, 837–840 (2008).13. Executive Office of the President, Big Data: Seizing

Opportunities, Preserving Values (White House, Washington,DC, 2014); http://1.usa.gov/1TSOhiG.

14. Letter from Maneesha Mithal, FTC, to Reed Freeman, Morrison,& Foerster LLP, Counsel for Netflix, 2 [closing letter] (2010);http://1.usa.gov/1GCFyXR.

15. In re Facebook, Complaint, FTC File No. 092 3184 (2012).16. FTC Staff Report, Mobile Privacy Disclosures: Building Trust Through

Transparency (FTC, Washington, DC, 2013); http://1.usa.gov/1eNz8zr.17. FTC, Protecting Consumer Privacy in an Era of Rapid Change:

Recommendations for Businesses and Policymakers (FTC,Washington, DC, 2012).

18. Directive 95/46/ec of the European Parliament and of TheCouncil of Europe, 24 October 1995.

19. L. Sweeney, Online ads roll the dice [blog]; http://1.usa.gov/1KgEcYg.

20. FTC, “Big data: A tool for inclusion or exclusion?” (workshop,FTC, Washington, DC, 2014); http://1.usa.gov/1SR65cv

21. FTC, Data Brokers: A Call for Transparency and Accountability(FTC, Washington, DC, 2014); http://1.usa.gov/1GCFoj5.

22. J. Podesta, “Big data and privacy: 1 year out” [blog]; http://bit.ly/WHsePrivacy.

23. White House Council of Economic Advisers, Big Data andDifferential Pricing (White House, Washington, DC, 2015).

24. Executive Office of the President,Big Data and Differential Processing(White House, Washington, DC, 2015); http://1.usa.gov/1eNy7qR.

25. Executive Office of the President, Big Data: SeizingOpportunities, Preserving Values (White House, Washington,DC, 2014); http://1.usa.gov/1TSOhiG.

26. President’s Council of Advisors on Science and Technology(PCAST), Big Data and Privacy: A Technological Perspective(White House, Washington, DC, 2014); http://1.usa.gov/1C5ewNv.

27. European Commission, Proposal for a Regulation of the EuropeanParliament and of the Council on the Protection of Individualswith regard to the processing of personal data and on the freemovement of such data (General Data Protection Regulation),COM(2012) 11 final (2012); http://bit.ly/1Lu5POv.

28. M. Schrems v. Facebook Ireland Limited, §J. Unlawful datatransmission to the U.S.A. (“PRISM”), ¶166 and 167 (2013);www.europe-v-facebook.org/sk/sk_en.pdf.

10.1126/science.aac4520

REVIEW

Machine learning: Trends,perspectives, and prospectsM. I. Jordan1* and T. M. Mitchell2*

Machine learning addresses the question of how to build computers that improveautomatically through experience. It is one of today’s most rapidly growing technical fields,lying at the intersection of computer science and statistics, and at the core of artificialintelligence and data science. Recent progress in machine learning has been driven both bythe development of new learning algorithms and theory and by the ongoing explosion in theavailability of online data and low-cost computation. The adoption of data-intensivemachine-learning methods can be found throughout science, technology and commerce,leading to more evidence-based decision-making across many walks of life, includinghealth care, manufacturing, education, financial modeling, policing, and marketing.

Machine learning is a discipline focusedon two interrelated questions: How canone construct computer systems that auto-matically improve through experience?and What are the fundamental statistical-

computational-information-theoretic laws thatgovern all learning systems, including computers,humans, and organizations? The study of machinelearning is important both for addressing thesefundamental scientific and engineering ques-tions and for the highly practical computer soft-ware it has produced and fielded across manyapplications.Machine learning has progressed dramati-

cally over the past two decades, from laboratorycuriosity to a practical technology in widespreadcommercial use. Within artificial intelligence (AI),machine learning has emerged as the methodof choice for developing practical software forcomputer vision, speech recognition, natural lan-guage processing, robot control, and other ap-plications. Many developers of AI systems nowrecognize that, for many applications, it can befar easier to train a system by showing it exam-ples of desired input-output behavior than toprogram it manually by anticipating the desiredresponse for all possible inputs. The effect of ma-chine learning has also been felt broadly acrosscomputer science and across a range of indus-tries concerned with data-intensive issues, suchas consumer services, the diagnosis of faults incomplex systems, and the control of logisticschains. There has been a similarly broad range ofeffects across empirical sciences, from biology tocosmology to social science, as machine-learningmethods have been developed to analyze high-throughput experimental data in novel ways. SeeFig. 1 for a depiction of some recent areas of ap-plication of machine learning.A learning problem can be defined as the

problem of improving some measure of perform-

ance when executing some task, through sometype of training experience. For example, in learn-ing to detect credit-card fraud, the task is to as-sign a label of “fraud” or “not fraud” to any givencredit-card transaction. The performance metricto be improved might be the accuracy of thisfraud classifier, and the training experience mightconsist of a collection of historical credit-cardtransactions, each labeled in retrospect as fraud-ulent or not. Alternatively, one might define adifferent performance metric that assigns a higherpenalty when “fraud” is labeled “not fraud” thanwhen “not fraud” is incorrectly labeled “fraud.”One might also define a different type of trainingexperience—for example, by including unlab-eled credit-card transactions along with labeledexamples.A diverse array of machine-learning algorithms

has been developed to cover the wide variety ofdata and problem types exhibited across differ-ent machine-learning problems (1, 2). Conceptual-ly, machine-learning algorithms can be viewed assearching through a large space of candidateprograms, guided by training experience, to finda program that optimizes the performance metric.Machine-learning algorithms vary greatly, in partby the way in which they represent candidateprograms (e.g., decision trees, mathematical func-tions, and general programming languages) and inpart by the way in which they search through thisspace of programs (e.g., optimization algorithmswith well-understood convergence guaranteesand evolutionary search methods that evaluatesuccessive generations of randomly mutated pro-grams). Here, we focus on approaches that havebeen particularly successful to date.Many algorithms focus on function approxi-

mation problems, where the task is embodiedin a function (e.g., given an input transaction, out-put a “fraud” or “not fraud” label), and the learn-ing problem is to improve the accuracy of thatfunction, with experience consisting of a sampleof known input-output pairs of the function. Insome cases, the function is represented explicit-ly as a parameterized functional form; in othercases, the function is implicit and obtained via asearch process, a factorization, an optimization

SCIENCE sciencemag.org 17 JULY 2015 • VOL 349 ISSUE 6245 255

1Department of Electrical Engineering and ComputerSciences, Department of Statistics, University of California,Berkeley, CA, USA. 2Machine Learning Department, CarnegieMellon University, Pittsburgh, PA, USA.*Corresponding author. E-mail: [email protected] (M.I.J.);[email protected] (T.M.M.)

on

July

20,

201

5w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

o

n Ju

ly 2

0, 2

015

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

on

July

20,

201

5w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

o

n Ju

ly 2

0, 2

015

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

on

July

20,

201

5w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

o

n Ju

ly 2

0, 2

015

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

http://www.sciencemag.org/






procedure, or a simulation-based procedure. Evenwhen implicit, the function generally dependson parameters or other tunable degrees of free-dom, and training corresponds to finding valuesfor these parameters that optimize the perform-ance metric.Whatever the learning algorithm, a key scien-

tific and practical goal is to theoretically character-ize the capabilities of specific learning algorithmsand the inherent difficulty of any given learningproblem: How accurately can the algorithm learnfrom a particular type and volume of trainingdata? How robust is the algorithm to errors in itsmodeling assumptions or to errors in the train-ing data? Given a learning problem with a givenvolume of training data, is it possible to design asuccessful algorithm or is this learning problemfundamentally intractable? Such theoretical char-acterizations of machine-learning algorithms andproblems typically make use of the familiar frame-works of statistical decision theory and compu-tational complexity theory. In fact, attempts tocharacterize machine-learning algorithms the-oretically have led to blends of statistical andcomputational theory in which the goal is to simul-taneously characterize the sample complexity(how much data are required to learn accurately)and the computational complexity (how muchcomputation is required) and to specify how thesedepend on features of the learning algorithm suchas the representation it uses for what it learns(3–6). A specific form of computational analysisthat has proved particularly useful in recentyears has been that of optimization theory,with upper and lower bounds on rates of con-vergence of optimization procedures mergingwell with the formulation of machine-learningproblems as the optimization of a performancemetric (7, 8).As a field of study, machine learning sits at the

crossroads of computer science, statistics and avariety of other disciplines concerned with auto-matic improvement over time, and inference anddecision-making under uncertainty. Related dis-ciplines include the psychological study of humanlearning, the study of evolution, adaptive controltheory, the study of educational practices, neuro-science, organizational behavior, and economics.Although the past decade has seen increased cross-talk with these other fields, we are just beginningto tap the potential synergies and the diversityof formalisms and experimental methods usedacross these multiple fields for studying systemsthat improve with experience.

Drivers of machine-learning progress

The past decade has seen rapid growth in theability of networked and mobile computing sys-tems to gather and transport vast amounts ofdata, a phenomenon often referred to as “BigData.” The scientists and engineers who collectsuch data have often turned to machine learn-ing for solutions to the problem of obtaininguseful insights, predictions, and decisions fromsuch data sets. Indeed, the sheer size of the datamakes it essential to develop scalable proce-dures that blend computational and statistical

256 17 JULY 2015 • VOL 349 ISSUE 6245 sciencemag.org SCIENCE

sunglasses 0.52

antelope 0.68

Dialog

Visual

Characters

Semantics

Motion

Syntax

starfish 0.67

milk can 1.00milk can 1.00person 0.92

orange 0.73

bird 0.78

lemon 0.86

bird 0.69

bird 0.95isopod 0.55

Fig. 1. Applications of machine learning. Machine learning is having a substantial effect on manyareas of technology and science; examples of recent applied success stories include robotics andautonomous vehicle control (top left), speech processing and natural language processing (topright), neuroscience research (middle), and applications in computer vision (bottom). [The middlepanel is adapted from (29). The images in the bottom panel are from the ImageNet database; objectrecognition annotation is by R. Girshick.]C

REDIT:ISTOCK/A

KIN

BOSTA

NCI

ARTIFICIAL INTELLIGENCE

considerations, but the issue is more than themere size of modern data sets; it is the granular,personalized nature of much of these data. Mo-bile devices and embedded computing permitlarge amounts of data to be gathered about in-dividual humans, and machine-learning algo-rithms can learn from these data to customizetheir services to the needs and circumstancesof each individual. Moreover, these personalizedservices can be connected, so that an overall ser-vice emerges that takes advantage of the wealthand diversity of data from many individualswhile still customizing to the needs and circum-stances of each. Instances of this trend towardcapturing and mining large quantities of data toimprove services and productivity can be foundacross many fields of commerce, science, andgovernment. Historical medical records are usedto discover which patients will respond bestto which treatments; historical traffic data areused to improve traffic control and reduce con-gestion; historical crime data are used to helpallocate local police to specific locations at spe-cific times; and large experimental data sets arecaptured and curated to accelerate progress inbiology, astronomy, neuroscience, and other data-intensive empirical sciences. We appear to be atthe beginning of a decades-long trend toward in-creasingly data-intensive, evidence-based decision-making across many aspects of science, commerce,and government.With the increasing prominence of large-scale

data in all areas of human endeavor has come awave of new demands on the underlying machine-learning algorithms. For example, huge data setsrequire computationally tractable algorithms, high-ly personal data raise the need for algorithmsthat minimize privacy effects, and the availabil-ity of huge quantities of unlabeled data raisesthe challenge of designing learning algorithmsto take advantage of it. The next sections surveysome of the effects of these demands on recent

work in machine-learning algorithms, theory, andpractice.

Core methods and recent progress

The most widely used machine-learning methodsare supervised learning methods (1). Supervisedlearning systems, including spam classifiers ofe-mail, face recognizers over images, and med-ical diagnosis systems for patients, all exemplifythe function approximation problem discussedearlier, where the training data take the form ofa collection of (x, y) pairs and the goal is toproduce a prediction y* in response to a queryx*. The inputs x may be classical vectors or theymay be more complex objects such as documents,images, DNA sequences, or graphs. Similarly,many different kinds of output y have been studied.Much progress has been made by focusing onthe simple binary classification problem in whichy takes on one of two values (for example, “spam”or “not spam”), but there has also been abun-dant research on problems such as multiclassclassification (where y takes on one of K labels),multilabel classification (where y is labeled simul-taneously by several of the K labels), rankingproblems (where y provides a partial order onsome set), and general structured predictionproblems (where y is a combinatorial object suchas a graph, whose components may be requiredto satisfy some set of constraints). An exampleof the latter problem is part-of-speech tagging,where the goal is to simultaneously label everyword in an input sentence x as being a noun,verb, or some other part of speech. Supervisedlearning also includes cases in which y has real-valued components or a mixture of discrete andreal-valued components.Supervised learning systems generally form

their predictions via a learned mapping f(x),which produces an output y for each input x (ora probability distribution over y given x). Manydifferent forms of mapping f exist, including

decision trees, decision forests, logistic regres-sion, support vector machines, neural networks,kernel machines, and Bayesian classifiers (1). Avariety of learning algorithms has been proposedto estimate these different types of mappings, andthere are also generic procedures such as boost-ing and multiple kernel learning that combinethe outputs of multiple learning algorithms.Procedures for learning f from data often makeuse of ideas from optimization theory or numer-ical analysis, with the specific form of machine-learning problems (e.g., that the objective functionor function to be integrated is often the sum overa large number of terms) driving innovations. Thisdiversity of learning architectures and algorithmsreflects the diverse needs of applications, withdifferent architectures capturing different kindsof mathematical structures, offering different lev-els of amenability to post-hoc visualization andexplanation, and providing varying trade-offsbetween computational complexity, the amountof data, and performance.One high-impact area of progress in supervised

learning in recent years involves deep networks,which are multilayer networks of threshold units,each of which computes some simple param-eterized function of its inputs (9, 10). Deep learningsystems make use of gradient-based optimiza-tion algorithms to adjust parameters throughoutsuch a multilayered network based on errors atits output. Exploiting modern parallel comput-ing architectures, such as graphics processingunits originally developed for video gaming, ithas been possible to build deep learning sys-tems that contain billions of parameters andthat can be trained on the very large collectionsof images, videos, and speech samples availableon the Internet. Such large-scale deep learningsystems have had a major effect in recent yearsin computer vision (11) and speech recognition(12), where they have yielded major improve-ments in performance over previous approaches


Input image Convolutional feature extraction

14 x 14 feature map

RNN with attention over image Word by wordgeneration

A

Abird

flyingover

a bodyof

water

bird flying over a body of water

LSTM

.

Fig. 2. Automatic generation of text captions for images with deep networks. A convolutional neural network is trained to interpret images, and itsoutput is then used by a recurrent neural network trained to generate a text caption (top). The sequence at the bottom shows the word-by-word focus ofthe network on different parts of input image while it generates the caption word-by-word. [Adapted with permission from (30)] C

REDIT:ISTOCK/C

ORR

(see Fig. 2). Deep network methods are beingactively pursued in a variety of additional appli-cations from natural language translation tocollaborative filtering.The internal layers of deep networks can be

viewed as providing learned representations ofthe input data. While much of the practical suc-cess in deep learning has come from supervisedlearning methods for discovering such repre-sentations, efforts have also been made to devel-op deep learning algorithms that discover usefulrepresentations of the input without the need forlabeled training data (13). The general problem isreferred to as unsupervised learning, a secondparadigm in machine-learning research (2).Broadly, unsupervised learning generally in-

volves the analysis of unlabeled data under as-sumptions about structural properties of thedata (e.g., algebraic, combinatorial, or probabi-listic). For example, one can assume that datalie on a low-dimensional manifold and aim toidentify that manifold explicitly from data. Di-mension reduction methods—including prin-cipal components analysis, manifold learning,factor analysis, random projections, and autoen-coders (1, 2)—make different specific assump-tions regarding the underlying manifold (e.g.,that it is a linear subspace, a smooth nonlinearmanifold, or a collection of submanifolds). An-other example of dimension reduction is thetopic modeling framework depicted in Fig. 3.A criterion function is defined that embodiesthese assumptions—often making use of generalstatistical principles such as maximum like-lihood, the method of moments, or Bayesianintegration—and optimization or sampling algo-

rithms are developed to optimize the criterion.As another example, clustering is the problemof finding a partition of the observed data (anda rule for predicting future data) in the absenceof explicit labels indicating a desired partition.A wide range of clustering procedures has beendeveloped, all based on specific assumptionsregarding the nature of a “cluster.” In both clus-tering and dimension reduction, the concernwith computational complexity is paramount,given that the goal is to exploit the particularlylarge data sets that are available if one dis-penses with supervised labels.A third major machine-learning paradigm is

reinforcement learning (14, 15). Here, the infor-mation available in the training data is inter-mediate between supervised and unsupervisedlearning. Instead of training examples that in-dicate the correct output for a given input, thetraining data in reinforcement learning are as-sumed to provide only an indication as to whetheran action is correct or not; if an action is incor-rect, there remains the problem of finding thecorrect action. More generally, in the setting ofsequences of inputs, it is assumed that rewardsignals refer to the entire sequence; the assign-ment of credit or blame to individual actions in thesequence is not directly provided. Indeed, althoughsimplified versions of reinforcement learningknown as bandit problems are studied, where itis assumed that rewards are provided after eachaction, reinforcement learning problems typicallyinvolve a general control-theoretic setting inwhich the learning task is to learn a control strat-egy (a “policy”) for an agent acting in an unknowndynamical environment, where that learned strat-

egy is trained to chose actions for any given state,with the objective of maximizing its expected re-ward over time. The ties to research in controltheory and operations research have increasedover the years, with formulations such as Markovdecision processes and partially observed Mar-kov decision processes providing points of con-tact (15, 16). Reinforcement-learning algorithmsgenerally make use of ideas that are familiarfrom the control-theory literature, such as policyiteration, value iteration, rollouts, and variancereduction, with innovations arising to addressthe specific needs of machine learning (e.g., large-scale problems, few assumptions about the un-known dynamical environment, and the use ofsupervised learning architectures to representpolicies). It is also worth noting the strong tiesbetween reinforcement learning and many dec-ades of work on learning in psychology andneuroscience, one notable example being theuse of reinforcement learning algorithms to pre-dict the response of dopaminergic neurons inmonkeys learning to associate a stimulus lightwith subsequent sugar reward (17).Although these three learning paradigms help

to organize ideas, much current research involvesblends across these categories. For example, semi-supervised learning makes use of unlabeled datato augment labeled data in a supervised learningcontext, and discriminative training blends ar-chitectures developed for unsupervised learningwith optimization formulations that make useof labels. Model selection is the broad activity ofusing training data not only to fit a model butalso to select from a family of models, and thefact that training data do not directly indicate


Topics Documents

Topic proportionsand assignments

genednagenetic.,,

0.040.020.01

lifeevolveorganism.,,

0.020.010.01

datanumbercomputer.,,

0.020.020.01

brainneuronnerve.,,

0.040.020.01

genes organism

organisms

survive?

genes

genes

genetic

genomes

life.computer

Computer analysis

computational

numbers

predictions

genomesequenced

Fig. 3.Topic models.Topic modeling is a methodology for analyzing documents, where a document is viewed as a collection of words, and the words inthe document are viewed as being generated by an underlying set of topics (denoted by the colors in the figure). Topics are probability distributionsacross words (leftmost column), and each document is characterized by a probability distribution across topics (histogram). These distributions areinferred based on the analysis of a collection of documents and can be viewed to classify, index, and summarize the content of documents. [From (31).Copyright 2012, Association for Computing Machinery, Inc. Reprinted with permission]

CREDIT:ISTOCK/A

KIN

BOSTA

NCI


which model to use leads to the use of algo-rithms developed for bandit problems and toBayesian optimization procedures. Active learn-ing arises when the learner is allowed to choosedata points and query the trainer to request tar-geted information, such as the label of an other-wise unlabeled example. Causal modeling is theeffort to go beyond simply discovering predictiverelations among variables, to distinguish whichvariables causally influence others (e.g., a highwhite-blood-cell count can predict the existenceof an infection, but it is the infection that causesthe high white-cell count). Many issues influencethe design of learning algorithms across all ofthese paradigms, including whether data areavailable in batches or arrive sequentially overtime, how data have been sampled, require-ments that learned models be interpretable byusers, and robustness issues that arise whendata do not fit prior modeling assumptions.

Emerging trends

The field ofmachine learning is sufficiently youngthat it is still rapidly expanding, often by invent-ing new formalizations of machine-learningproblems driven by practical applications. (Anexample is the development of recommendationsystems, as described in Fig. 4.) One major trenddriving this expansion is a growing concern withthe environment in which a machine-learningalgorithm operates. The word “environment”here refers in part to the computing architecture;whereas a classical machine-learning system in-volved a single program running on a single ma-chine, it is now common for machine-learningsystems to be deployed in architectures that in-clude many thousands or ten of thousands ofprocessors, such that communication constraintsand issues of parallelism and distributed pro-cessing take center stage. Indeed, as depictedin Fig. 5, machine-learning systems are increas-ingly taking the form of complex collections ofsoftware that run on large-scale parallel and dis-tributed computing platforms and provide a rangeof algorithms and services to data analysts.The word “environment” also refers to the

source of the data, which ranges from a set ofpeople who may have privacy or ownership con-cerns, to the analyst or decision-maker who mayhave certain requirements on a machine-learningsystem (for example, that its output be visual-izable), and to the social, legal, or political frame-work surrounding the deployment of a system.The environment also may include other machine-learning systems or other agents, and the overallcollection of systems may be cooperative or ad-versarial. Broadly speaking, environments pro-vide various resources to a learning algorithmand place constraints on those resources. Increas-ingly, machine-learning researchers are formalizingthese relationships, aiming to design algorithmsthat are provably effective in various environ-ments and explicitly allow users to express andcontrol trade-offs among resources.As an example of resource constraints, let us

suppose that the data are provided by a set ofindividuals who wish to retain a degree of pri-

vacy. Privacy can be formalized via the notion of“differential privacy,” which defines a probabi-listic channel between the data and the outsideworld such that an observer of the output of thechannel cannot infer reliably whether particularindividuals have supplied data or not (18). Clas-sical applications of differential privacy haveinvolved insuring that queries (e.g., “what is themaximum balance across a set of accounts?”) toa privatized database return an answer that isclose to that returned on the nonprivate data.Recent research has brought differential privacyinto contact with machine learning, where que-ries involve predictions or other inferential asser-tions (e.g., “given the data I've seen so far, what isthe probability that a new transaction is fraud-ulent?”) (19, 20). Placing the overall design of aprivacy-enhancing machine-learning systemwithin a decision-theoretic framework providesusers with a tuning knob whereby they can choosea desired level of privacy that takes into accountthe kinds of questions that will be asked of thedata and their own personal utility for the an-swers. For example, a person may be willing to

reveal most of their genome in the context ofresearch on a disease that runs in their familybut may ask for more stringent protection if in-formation about their genome is being used toset insurance rates.Communication is another resource that needs

to be managed within the overall context of adistributed learning system. For example, datamay be distributed across distinct physical loca-tions because their size does not allow them tobe aggregated at a single site or because of ad-ministrative boundaries. In such a setting, we maywish to impose a bit-rate communication con-straint on the machine-learning algorithm. Solvingthe design problem under such a constraint willgenerally show how the performance of the learn-ing system degrades under decrease in commu-nication bandwidth, but it can also reveal howthe performance improves as the number of dis-tributed sites (e.g., machines or processors) in-creases, trading off these quantities against theamount of data (21, 22). Much as in classical in-formation theory, this line of research aims atfundamental lower bounds on achievable per-formance and specific algorithms that achievethose lower bounds.A major goal of this general line of research is

to bring the kinds of statistical resources studiedin machine learning (e.g., number of data points,dimension of a parameter, and complexity of ahypothesis class) into contact with the classicalcomputational resources of time and space. Sucha bridge is present in the “probably approximatelycorrect” (PAC) learning framework, which studiesthe effect of adding a polynomial-time compu-tation constraint on this relationship among errorrates, training data size, and other parameters ofthe learning algorithm (3). Recent advances inthis line of research include various lower boundsthat establish fundamental gaps in performanceachievable in certain machine-learning prob-lems (e.g., sparse regression and sparse princi-pal components analysis) via polynomial-timeand exponential-time algorithms (23). The coreof the problem, however, involves time-data trade-offs that are far from the polynomial/exponentialboundary. The large data sets that are increas-ingly the norm require algorithms whose timeand space requirements are linear or sublinear inthe problem size (number of data points or num-ber of dimensions). Recent research focuses onmethods such as subsampling, random projec-tions, and algorithm weakening to achieve scal-ability while retaining statistical control (24, 25).The ultimate goal is to be able to supply timeand space budgets to machine-learning systemsin addition to accuracy requirements, with thesystem finding an operating point that allowssuch requirements to be realized.

Opportunities and challenges

Despite its practical and commercial successes,machine learning remains a young field withmany underexplored research opportunities.Some of these opportunities can be seen by con-trasting current machine-learning approachesto the types of learning we observe in naturally


Fig. 4. Recommendation systems. A recommen-dation system is a machine-learning system that isbased on data that indicate links between a setof a users (e.g., people) and a set of items (e.g.,products). A link between a user and a productmeans that the user has indicated an interest inthe product in some fashion (perhaps by purchas-ing that item in the past).The machine-learning prob-lem is to suggest other items to a given user that heor she may also be interested in, based on the dataacross all users. C

REDIT:ISTOCK/C

ORR

occurring systems such as humans and otheranimals, organizations, economies, and biologicalevolution. For example, whereas most machine-learning algorithms are targeted to learn onespecific function or data model from one singledata source, humans clearly learn many differ-ent skills and types of knowledge, from yearsof diverse training experience, supervised andunsupervised, in a simple-to-more-difficult se-quence (e.g., learning to crawl, then walk, thenrun). This has led some researchers to beginexploring the question of how to construct com-puter lifelong or never-ending learners that op-erate nonstop for years, learning thousands ofinterrelated skills or functions within an over-all architecture that allows the system to im-prove its ability to learn one skill based onhaving learned another (26–28). Another aspectof the analogy to natural learning systems sug-gests the idea of team-based, mixed-initiativelearning. For example, whereas current machine-learning systems typically operate in isolationto analyze the given data, people often workin teams to collect and analyze data (e.g., biol-ogists have worked as teams to collect and an-alyze genomic data, bringing together diverseexperiments and perspectives to make progresson this difficult problem). Newmachine-learningmethods capable of working collaboratively withhumans to jointly analyze complex data setsmight bring together the abilities of machinesto tease out subtle statistical regularities frommassive data sets with the abilities of humans todraw on diverse background knowledge to gen-erate plausible explanations and suggest newhypotheses. Many theoretical results in machinelearning apply to all learning systems, whetherthey are computer algorithms, animals, organ-izations, or natural evolution. As the field pro-gresses, we may see machine-learning theoryand algorithms increasingly providing modelsfor understanding learning in neural systems,

organizations, and biological evolution and seemachine learning benefit from ongoing studiesof these other types of learning systems.As with any powerful technology, machine

learning raises questions about which of its po-tential uses society should encourage and dis-courage. The push in recent years to collect newkinds of personal data, motivated by its eco-nomic value, leads to obvious privacy issues, asmentioned above. The increasing value of dataalso raises a second ethical issue: Who will haveaccess to, and ownership of, online data, and whowill reap its benefits? Currently, much data arecollected by corporations for specific uses leadingto improved profits, with little or no motive fordata sharing. However, the potential benefits thatsociety could realize, even from existing onlinedata, would be considerable if those data were tobe made available for public good.To illustrate, consider one simple example

of how society could benefit from data that isalready online today by using this data to de-crease the risk of global pandemic spread frominfectious diseases. By combining location datafrom online sources (e.g., location data from cellphones, from credit-card transactions at retailoutlets, and from security cameras in public placesand private buildings) with online medical data(e.g., emergency room admissions), it would befeasible today to implement a simple system totelephone individuals immediately if a personthey were in close contact with yesterday was justadmitted to the emergency room with an infec-tious disease, alerting them to the symptoms theyshould watch for and precautions they shouldtake. Here, there is clearly a tension and trade-offbetween personal privacy and public health, andsociety at large needs to make the decision onhow to make this trade-off. The larger point ofthis example, however, is that, although the dataare already online, we do not currently have thelaws, customs, culture, or mechanisms to enable

society to benefit from them, if it wishes to do so.In fact, much of these data are privately held andowned, even though they are data about each ofus. Considerations such as these suggest thatmachine learning is likely to be one of the mosttransformative technologies of the 21st century.Although it is impossible to predict the future, itappears essential that society begin now to con-sider how to maximize its benefits.

REFERENCES

1. T. Hastie, R. Tibshirani, J. Friedman, The Elements of StatisticalLearning: Data Mining, Inference, and Prediction (Springer,New York, 2011).

2. K. Murphy, Machine Learning: A Probabilistic Perspective(MIT Press, Cambridge, MA, 2012).

3. L. Valiant, Commun. ACM 27, 1134–1142 (1984).4. V. Chandrasekaran, M. I. Jordan, Proc. Natl. Acad. Sci. U.S.A.

110, E1181–E1190 (2013).5. S. Decatur, O. Goldreich, D. Ron, SIAM J. Comput. 29, 854–879

(2000).6. S. Shalev-Shwartz, O. Shamir, E. Tromer, Using more data to

speed up training time, Proceedings of the Fifteenth Conferenceon Artificial Intelligence and Statistics, Canary Islands, Spain, 21to 23 April, 2012.

7. S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, inFoundations and Trends in Machine Learning 3 (NowPublishers, Boston, 2011), pp. 1–122.

8. S. Sra, S. Nowozin, S. Wright, Optimization for MachineLearning (MIT Press, Cambridge, MA, 2011).

9. J. Schmidhuber, Neural Netw. 61, 85–117 (2015).10. Y. Bengio, in Foundations and Trends in Machine Learning 2

(Now Publishers, Boston, 2009), pp. 1–127.11. A. Krizhevsky, I. Sutskever, G. Hinton, Adv. Neural Inf. Process.

Syst. 25, 1097–1105 (2015).12. G. Hinton et al., IEEE Signal Process. Mag. 29, 82–97

(2012).13. G. E. Hinton, R. R. Salakhutdinov, Science 313, 504–507

(2006).14. V. Mnih et al., Nature 518, 529–533 (2015).15. R. S. Sutton, A. G. Barto, Reinforcement Learning:

An Introduction (MIT Press, Cambridge, MA, 1998).16. E. Yaylali, J. S. Ivy, Partially observable MDPs (POMDPs):

Introduction and examples. Encyclopedia of Operations Researchand Management Science (John Wiley, New York, 2011).

17. W. Schultz, P. Dayan, P. R. Montague, Science 275, 1593–1599(1997).

18. C. Dwork, F. McSherry, K. Nissim, A. Smith, in Proceedings ofthe Third Theory of Cryptography Conference, New York, 4 to 7March 2006, pp. 265–284.

19. A. Blum, K. Ligett, A. Roth, J. ACM 20, (2013).20. J. Duchi, M. I. Jordan, J. Wainwright, J. ACM 61, 1–57

(2014).21. M.-F. Balcan, A. Blum, S. Fine, Y. Mansour, Distributed learning,

communication complexity and privacy. Proceedings of the29th Conference on Computational Learning Theory, Edinburgh,UK, 26 June to 1 July 2012.

22. Y. Zhang, J. Duchi, M. Jordan, M. Wainwright, in Advances inNeural Information Processing Systems 26, L. Bottou,C. Burges, Z. Ghahramani, M. Welling, Eds. (Curran Associates,Red Hook, NY, 2014), pp. 1–23.

23. Q. Berthet, P. Rigollet, Ann. Stat. 41, 1780–1815 (2013).24. A. Kleiner, A. Talwalkar, P. Sarkar, M. I. Jordan, J. R. Stat. Soc.,

B 76, 795–816 (2014).25. M. Mahoney, Found. Trends Machine Learn. 3, 123–224

(2011).26. T. Mitchell et al., Proceedings of the Twenty-Ninth Conference

on Artificial Intelligence (AAAI-15), 25 to 30 January 2015,Austin, TX.

27. M. Taylor, P. Stone, J. Mach. Learn. Res. 10, 1633–1685(2009).

28. S. Thrun, L. Pratt, Learning To Learn (Kluwer Academic Press,Boston, 1998).

29. L. Wehbe et al., PLOS ONE 9, e112575 (2014).30. K. Xu et al., Proceedings of the 32nd International Conference

on Machine Learning, vol. 37, Lille, France,6 to 11 July 2015, pp. 2048–2057.

31. D. Blei, Commun. ACM 55, 77–84 (2012).

10.1126/science.aaa8415


Cancer genomics, energy debugging, smart buildings

Sample clean

BlinkDB

Spark Core

Succinct

Tachyon

Mesos Hadoop Yarn

HDFS, S3, Ceph, …

In-house apps

Access andinterfaces

Processingengine

Storage

Resourcevirtualization

AMPLab developed Spark Community 3rd party

SparkSQL

Sp

ark

stre

amin

g

Velo

x

Sp

arkR

Gra

ph

X

Sp

lash

G-OLA MLBase

MLPiplines

MLIib

Fig. 5. Data analytics stack. Scalable machine-learning systems are layered architectures that arebuilt on parallel and distributed computing platforms. The architecture depicted here—an open-source data analysis stack developed in the Algorithms, Machines and People (AMP) Laboratory atthe University of California, Berkeley—includes layers that interface to underlying operating systems;layers that provide distributed storage, data management, and processing; and layers that providecore machine-learning competencies such as streaming, subsampling, pipelines, graph processing,and model serving.

CREDIT:ISTOCK/A

KIN

BOSTA

NCI


DOI: 10.1126/science.aaa8415, 255 (2015);349 Science

M. I. Jordan and T. M. MitchellMachine learning: Trends, perspectives, and prospects

This copy is for your personal, non-commercial use only.

clicking here.colleagues, clients, or customers by , you can order high-quality copies for yourIf you wish to distribute this article to others

here.following the guidelines

can be obtained byPermission to republish or repurpose articles or portions of articles

): July 18, 2015 www.sciencemag.org (this information is current as of

The following resources related to this article are available online at

http://www.sciencemag.org/content/349/6245/255.full.htmlversion of this article at:

including high-resolution figures, can be found in the onlineUpdated information and services,

http://www.sciencemag.org/content/349/6245/255.full.html#relatedfound at:

can berelated to this article A list of selected additional articles on the Science Web sites

http://www.sciencemag.org/content/349/6245/255.full.html#ref-list-1, 3 of which can be accessed free:cites 17 articlesThis article

http://www.sciencemag.org/content/349/6245/255.full.html#related-urls1 articles hosted by HighWire Press; see:cited by This article has been

http://www.sciencemag.org/cgi/collection/comp_mathComputers, Mathematics

subject collections:This article appears in the following

registered trademark of AAAS. is aScience2015 by the American Association for the Advancement of Science; all rights reserved. The title

CopyrightAmerican Association for the Advancement of Science, 1200 New York Avenue NW, Washington, DC 20005. (print ISSN 0036-8075; online ISSN 1095-9203) is published weekly, except the last week in December, by theScience

on

July

20,

201

5w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

http://www.sciencemag.org/about/permissions.dtl

http://www.sciencemag.org/about/permissions.dtl

http://www.sciencemag.org/content/349/6245/255.full.html

http://www.sciencemag.org/content/349/6245/255.full.html#related

http://www.sciencemag.org/content/349/6245/255.full.html#ref-list-1

http://www.sciencemag.org/content/349/6245/255.full.html#related-urls

http://www.sciencemag.org/cgi/collection/comp_math


Machine learning:Trends, perspectives, and prospectstom/pubs/Science-ML-2015.pdf · Despite...

Documents

Transcript of Machine learning:Trends, perspectives, and prospectstom/pubs/Science-ML-2015.pdf · Despite...