Likely to stop? Predicting Stopout in Massive Open …to capture MOOC data across platforms thereby...

Likely to stop?Predicting Stopout in Massive Open Online Courses

Colin Taylor [email protected] Institute of Technology, Cambridge, MA 02139 USA

Kalyan Veeramachaneni [email protected] Institute of Technology, Cambridge, MA 02139 USA

Una-May O’Reilly [email protected] Institute of Technology, Cambridge, MA 02139 USA

Abstract

Understanding why students stopout willhelp in understanding how students learn inMOOCs. In this report, part of a 3 unitcompendium, we describe how we build ac-curate predictive models of MOOC studentstopout. We document a scalable, stopoutprediction methodology, end to end, fromraw source data to model analysis. We at-tempted to predict stopout for the Fall 2012offering of 6.002x. This involved the meticu-lous and crowd-sourced engineering of over 25predictive features extracted for thousands ofstudents, the creation of temporal and non-temporal data representations for use in pre-dictive modeling, the derivation of over 10thousand models with a variety of state-of-the-art machine learning techniques and theanalysis of feature importance by examin-ing over 70000 models. We found that stopout prediction is a tractable problem. Ourmodels achieved an AUC (receiver operatingcharacteristic area-under-the-curve) as highas 0.95 (and generally 0.88) when predictingone week in advance. Even with more diffi-cult prediction problems, such as predictingstop out at the end of the course with onlyone weeks’ data, the models attained AUCsof 0.7.

1. Introduction

Massive Open Online Courses (MOOCs) leverage dig-ital technologies to teach advanced topics at scale.MOOC providers such as edX and Coursera boasthundreds of classes developed by top-tier universities.Renowned professors record their lectures, and whenneeded, use interactive whiteboards to explain con-cepts. Recordings are delivered all over the world viaweb servers at no cost to the learner. Far from com-promising the quality of course content, the internetprovides a flexible medium for educators to employnew instructional tools. For example, videos enablestudents to pause, rewind, review difficult conceptsand even adjust the speed. In addition, MOOCs al-low the learner the flexibility to learn in his or herown time frame. Only in the online medium are shortlectures logistically feasible through videos. MOOCsare changing the face of education by providing an al-ternative to the “one size fits all", learning conceptemployed by hundreds of universities.

The specific layout of each MOOC varies, but mostfollow a similar format. Content is sectioned into mod-ules, usually using weeks as intervals. Most MOOCsinclude online lectures (video segments), lecture ques-tions, homework questions, labs, a forum, a Wiki, andexams. Students advance through the material se-quentially, access online resources, submit assignmentsand participate in peer-to-peer interactions (like theforum).

Not surprisingly, MOOCs have attracted the attentionof online learners all over the world. The platformsboast impressive numbers of registrants and individ-uals who complete online course work. For example,MITx offered its first course 6.002x: Circuits and Elec-tronics in the Fall of 2012. 6.002x had 154,763 regis-

arX

iv:1

408.

3382

v1 [

cs.C

Y]

14

Aug

201

4

Title Suppressed Due to Excessive Size

trants. Of those, 69,221 students looked at the firstproblem set, and 26,349 earned at least one point.9,318 students passed the midterm and 5,800 studentsgot a passing score on the final exam. Finally, aftercompleting 15 weeks of study, 7,157 registrants earnedthe first certificate awarded by MITx, showing theyhad successfully completed 6.002x. For perspective,approximately 100 students take the same course eachyear at MIT. It would have taken over 70 years of on-campus education to grant the same number of 6.002xcertificates that were earned in a single year online.

While the completion rates are impressive when com-pared to in-class capacity, they are still low relative tothe number of people who registered, completed cer-tain parts of the course or spent a considerable amountof time on the course. To illustrate, in the above sce-nario approximately 17% attempted and got at leastone point on the first problem set. The percentageof students who passed the midterm drops to just 6%,and certificate earners dwindles at just under 5%. 94%of registrants did not make it past the midterm.

How do we explain the 96% stopout1 rate from coursestart to course finish? Analyzing completion ratesgoes hand in hand with understanding student behav-ior. One MOOC research camp advocates analyzingstudent usage patterns– resources used, homework re-sponses, forum and Wiki participation – to improvethe online learning experience thereby increasing com-pletion rates. Other researchers question the feasibil-ity of analyzing completion rates altogether becausethe online student body is unpredictable. For exam-ple, some students register online because it is freeand available with little or no intention of finishing.Some students who leave may lack motivation, or couldleave due to personal reasons completely unrelated toMOOCs. As a result, interpreting completion rates isnot a straightforward exercise. However, we believethat if we are to fully understand how students learnin MOOCs, we need to better understand why stu-dents stopout. Building accurate predictive models isthe first step in this undertaking.

Why predict stopout? There are a number of rea-sons to predict stopout.

Interventions: Stopout prediction in advance al-lows us to design interventions that would increaseengagement, provide motivation and eventuallyprevent stopout.

Identifying intentions: Certain special cases ofstopout prediction allow us to delineate the stu-dent intentions in taking the MOOC. For exam-

1We use stopout as synonymous with dropout and werefer to its opposite as persistence

ple, the cohort for whom we are able to predictstopout accurately based on just their first weekbehavior could imply that the method and man-ner in which the course was designed or handledhad no effect on the learner’s decision to stopout.

Model analysis: Analysis of accurate statisticalmodels that have maximum stopout prediction ac-curacy can yield insights as to what caused thestudents to stopout. From this perspective onecan even examine the students for whom the pre-dictions were wrong, that is, a very accurate andtrustworthy model predicted the student wouldnot stopout but the student did stopout. Herethe model error could be due to reasons that areunrelated to the course itself.

In a compendium of papers, of which this is first ofthree, we tackle the challenge of predicting studentpersistence in MOOCs. Throughout the compendiumwe focus on the aforementioned course, the Fall 2012offering of 6.002x: Circuits and Electronics. We be-lieve a three pronged approach which comprehensivelyanalyzes student interaction data, extracts from thedata sophisticated predictive indicators and leveragesstate-of-the-art models will lead to successful predic-tions. The compendium presents a comprehensivetreatment of predicting stopout which produces andconsiders complex, multi-layered yet interpretive fea-tures and fine tuned modeling.

We ask whether it is possible for machine learning al-gorithms, with only a few weeks of data, to accuratelypredict persistence. Is it possible to predict, given onlythe first week of course data, who will complete the lastweek of the course? How much history (or how manyweeks’ data) is necessary for accurate prediction oneor more week ahead?

1.1. Outline of the compendium

The compendium is organized into the following pa-pers:

• In this paper, we describe the stopout predic-tion problem, and present a number of discrim-inatory models we built starting with Logistic re-gression and moving to Support Vector Machines,Deep Belief networks and decision trees. We alsopresent a summary of which features/variablesplayed a role in gaining accurate predictions.

• In “Towards Feature Engineering at Scale for Datafrom Massive Open Online Courses" we presenthow we approached the problem of constructinginterpretive features from a time series of click


stream events. We present the list of featureswe have extracted to create the predictive models(Veeramachaneni et al., 2014b).

• In “Exploring Hidden Markov Models for model-ing online learning behavior" we outline a tem-poral modeling technique called Hidden Markovmodels and present the results when these mod-els are used to make predictions. We also presenta stacked model using both techniques (HMMsand Logisitic regression) and present overview ofour findings (Taylor et al., 2014).

Compendium contributions : The most funda-mental contribution of this compendium is the design,development and demonstration of a stopout predic-tion methodology, end to end, from raw source datato model analysis. The methodology is painstakinglymeticulous about every detail of data preparation, fea-ture engineering, model evaluation and outcome analy-sis. Our goal with such thoroughness is to advance thestate of research into stopout from its current positionand document a methodology that is reproducible andscalable. We will next generalize this methodology toa number of additional edX and Coursera courses andreport the successes and limitations. In addition, themethodology and software will shortly be released tointerested educational researchers.

1.2. Our contributions through this paper

This paper makes the following contributions:

• We successfully predict stopout for the Fall 2012offering of 6.002x. The major findings of the pre-dictive models are presented in 8.

• We extract 27 sophisticated, interpretive featureswhich combine student usage patterns from dif-ferent data sources. This included leveraging thecollective brain-power of the crowd. These arepresented in 3.5.

• We utilize these features to create a series of tem-poral and non-temporal feature-sets for use in pre-dictive modeling.

• We create over 10,000 comprehensive, predictivemodels using a variety of state-of-the-art tech-niques.

• We demonstrate that with only a few weeks ofdata, machine learning techniques can predictpersistence remarkably well. For example we wereable to achieve an area under the curve of the re-ceiver operating characteristic of 0.71, given only

one week of data, while predicting student persis-tence in the last week of the course. Given moredata, some of the models reached an AUC of 0.95.We present these and other results in 5.2.

• We build and demonstrate a scalable, distributed,modular and reusable framework to accomplishthese steps iteratively.

The rest of the paper is organized as follows. Section 2presents the details of the data we use, its organiza-tion, features/variables we extracted/operationalizedfor modeling. Section 3 presents the definition of theprediction problem and the different assumptions wemake in defining the problem. Section 4 presents thepredictive modeling technique - logistic regression andthe results we achieved for all 91 prediction problems.Section 6 presents the details of how we employedmultiple predictive modeling techniques. Section 7presents the related work both prior to MOOCs andMOOCs. Section 8 presents the research findings rel-evant to this paper. Section 9 presents our reflectionfor the entire compendium.

2. Data organization into MOOCdb

As previously mentioned, we focused on the Fall 2012offering of 6.002x: Circuits and Electronics. edX pro-vided the following raw data from the 6.002x course:

• A dump of click-stream data from learner-browserand edX-server tracking logs in JSON format. Forinstance, every page visited by every learner wasstored as a server-side JSON (JavaScript ObjectNotation) event.

• Forum posts, edits, comments and replies storedin a MongoDB collection. Note that passive forumdata, such as how many views a thread receivedwas not stored here and had to be inferred fromthe click-stream data.

• Wiki revisions stored in a MongoDB collection.Again, passive views of the Wiki must be inferredfrom the click-stream data.

• A dump of the MySQL production database con-taining learner state information. For example,the database contained his/her final answer to aproblem, along with its correctness. Note that thehistory of his submissions must be inferred fromthe click-stream data.

• An XML file of the course calendar which includedinformation like the release of content and the as-signment deadlines.


Figure 1. Multiple data sources received from edX with their corresponding formats

Figure 1 summarizes the raw data received. This dataincluded:

• 154,763 registered learners

• 17.8 million submission events

• 132.3 million navigational events 2

• ∼90,000 forum posts

To analyze this data at scale, as well as write reusableanalysis scripts, we first organized the data into aschema designed to capture pertinent information.The resulting database schema, MOOCdb, is designedto capture MOOC data across platforms therebypromoting collaboration among MOOC researchers.MOOCdb utilizes a large series of scripts to pipe the6.002x raw data into a standardized schema. Moreabout MOOCdb can be found in the MOOCdb Techreport, but the details are outside the scope of thiscompendium (Veeramachaneni et al., 2014a).

Through the labor intensive process of piping the rawdata into a schematized database, we were able to sig-nificantly reduce the data size in terms of disk space.The original ∼70GB of raw data was reduced to a∼7GB MOOCdb through schema normalization. Thetransformation was crucial in order to load the entiredatabase into RAM enabling prompt queries and fea-ture extractions. Figure 2 shows a snapshot of theoriginal JSON transactional data transformed into anormalized schema.

2We received more navigational events, but only 132.3million were well formed enough to be reliably consideredfor this compendium.

3. Prediction problem definition andassumptions

We made several assumptions to more precisely definethe stopout prediction problem and interpret the data.These assumptions include time-slice delineation anddefining persistence (stopout) as the event we attemptto predict.

3.1. Time-slice delineation

Temporal prediction of a future event requires us to as-semble explanatory variables along a time axis. Thisaxis is subdivided to express the time-varying behav-ior of variables so they can be used for explanatorypurposes. In 6.002x, course content was assigned anddue on a weekly basis, where each week correspondedto a module. Owing to the regular modular structure,we decided to define time slices as weekly units. Timeslices started the first week in which course contentwas offered, and ended in the fifteenth week, after thefinal exam had closed.

3.2. Stopout definition

The next question we had to address was our defi-nition of stopout. We considered defining it by thelearner’s last interaction in the course, regardless of thenature of the interaction. This is the approach takenby Balakrishnan in his stopout analysis (Balakrishnan& Coetzee, 2013). However, Balakrishnan’s definitionyields noisy results because it gives equal weight toa passive interaction (viewing a lecture, accessing anassignment, viewing a Wiki etc) as it does to a pro-active interaction (submitting a problem, midterm, as-signment etc). A learner could stop submitting assign-ments in the course after week 2, but continue to ac-


Figure 2. Piping data into MOOCdb

Figure 3. Stopout week distribution

cess the course pages and not be considered stoppedout. Instead, we define the stopout point as the timeslice (week) a learner fails to submit any further as-signments or exercise problems. To illustrate, if alearner submits his/her last assignment in the thirdmodule, he/she is considered to have stopped out atweek four. A submission (or attempt) is a submissionof any problem type (Homework, lab, exam etc.), asdefined in MOOCdb. This definition narrows the re-search to learners who consistently participate in thecourse by submitting assignments. Using this defini-tion for stopout we extracted the week number when

each learner in the cohort stopped out.

Figure 3 shows the distribution of stopout week forall 105,622 learners who ever accessed the course. Ofthese, 52,683 learners stopped out on week one. Theselearners never submitted an assignment, and are neverconsidered in the rest of our analysis. Another largelearner drop off point is in week 15, the last week of thecourse. Many of these learners actually finished thecourse, but did so by submitting their final exam inweek 14. This nuance presented itself because learnershad a range of time to start the final exam, and this


range actually overlapped between weeks 14 and 15.Due to the nature of the final exam time range, wenever attempt to predict week 15, and consider week14 as the final week.

3.3. Lead and Lag

Lead represents how many weeks in advance to predictstopout. We assign the stopout label (x1, 0 for stopoutor 1 for persisted) of the lead week as the predictiveproblem label. Lag represents how many weeks of his-torical variables will be used to classify. For example,if we use a lead of 5 and a lag of 3, we would take thefirst 3 weeks of data to predict 5 weeks ahead. Thus,each training data point is a learner’s feature valuesfor weeks 1, 2 and 3 as features. The binary stopoutvalue for week 8 becomes the label. Figure 4 shows adiagram of this scenario.

We are careful not to use learners’ stopped out week’sfeatures as input to our models. In other words, if alearner has stopped out in week 1, 2 or 3, we do notuse this learner as a data point. Including stoppedout learner data makes the classification problem tooeasy as the model will learn that a stopped out learnernever returns (by our stopout definition).

To illustrate the predictive model’s potential applica-tion, we will use a realistic scenario. The model user,likely an instructor or platform provider, could use thedata from week 1 to week i (current week) to makepredictions. The model will predict existing learnerstopout during weeks i + 1 to 14. For example, Fig-ure 4 shows one such prediction problem. In this casethe user, currently at the end of week 3, is attemptingto predict stopout for the 8th week.

Multiple prediction problems Under this defini-tion 91 individual prediction problems exist. For anygiven week i there are 14−i number of prediction prob-lems. Each prediction problem becomes an indepen-dent modeling problem which requires a discrimina-tive model. To build discriminative models we utilizea common approach of flattening out the data, that isforming the covariates for the discriminative model byassembling the features from different learner-weeks asseparate variables. This process is shown in Figure 7.The example uses data from weeks 1 and 2 (lag of 2)and attempts to predict the stopout for week 13 (leadof 11).

For all of the ensuing modeling and analysis, wetreated and reported on each of the cohort datasetsindependently.

3.4. Partitioning learners into cohorts

Rather than treat all learners uniformly, we decidedto build predictive models for different types of learn-ers. With this in mind we divided the learners intocohorts as a rough surrogate variable for their com-mitment to the course. We chose four cohorts basedon the learner’s collaborative activity throughout thecourse. More specifically, we divided learners based onwhether or not they participated in the class forum orhelped edit the class Wiki pages. The four types oflearners are:• passive collaborator- these learners never actively

participated in either the forum or the Wiki. Theyare named passive because they passively viewed,but did not contribute to, resources.

• wiki contributor- these learners actively partic-ipated in the Wiki by generating Wiki contentthrough their edits, but never actively posted inthe forum.

• forum contributor- these learners actively postedin the forum, but never actively participated inthe class Wiki.

• fully collaborative- these learners actively partici-pated by generating Wiki content and by postingin the forum

From the combined dataset of 52,939 participatinglearners, we assigned each learner into one of the fourtypes. The chart 5 summarizes the sizes of the cohortdatasets.

3.5. Features per learner

We extracted 27 interpretive features on a per-learnerbasis. These are the features we use to build amodel. We describe the process of feature engineeringat length in (Veeramachaneni et al., 2014b). In thispaper for the sake of brevity, we only list the featuresand their brief descriptions in the two tables below.For more details about how we came up with thesefeatures and how specifically these features were op-erationalized we refer the readers to (Veeramachaneniet al., 2014b)

4. Logistic Regression

Logistic regression is a commonly used binary predic-tive model. It calculates a weighted average of a set ofvariables, submitted as covariates, as an input to thelogit function. Thus, the input to the logit function, z,takes the following form:

z = β0 + β1 ∗ x1 + β2 ∗ x2 + ...βm ∗ xm (1)


Table 1. List of self-proposed, self-extracted covariates

Name Definition

x1 stopout Whether the student has stopped out or not

*x2 total duration Total time spent on all resources

x3 number forum posts Number of forum posts

x4 number wiki edits Number of wiki edits

*x5 average length forumpost

Average length of forum posts

*x6 number distinctproblems submitted

Number of distinct problems attempted

*x7 number submissions Number of submissions 1

x8 number distinctproblems correct

Number of distinct correct problems

x9 average numbersubmissions

Average number of submissions per problem (x7 / x6)

x10 observed event durationper correct problem

Ratio of total time spent to number of distinct correct problems (x2/ x8). This is the inverse of the percent of problems correct

x11 submissions per correctproblem

Ratio of number of problems attempted to number of distinctcorrect problems (x6 / x8)

x12 average time to solveproblem

Average time between first and last problem submissions for eachproblem (average(max(submission.timestamp) -min(submission.timestamp) for each problem in a week) )

*x13 observed event variance Variance of a student’s observed event timestamps

x14 number collaborations Total number of collaborations (x3 + x4)

x15 max observed eventduration

Duration of longest observed event

*x16 total lecture duration Total time spent on lecture resources

*x17 total book duration Total time spent on book resources

*x18 total wiki duration Total time spent on wiki resources

1 In our terminology, a submission corresponds to a problem attempt. In 6.002x, students could submit multiple times toa single problem. We therefore differentiate between problems and submissions.


Table 2. List of crowd-proposed, self-extracted covariates

Name Definition

x201 number forum responses Number of forum responses

*x202 average number of submissionspercentile

A student’s average number of submissions (feature 9) ascompared with other students that week as a percentile

*x203 average number of submissionspercent

A student’s average number of submissions (feature 9) asa percent of the maximum average number of submissionsthat week

*x204 pset grade Number of the week’s homework problems answeredcorrectly / number of that week’s homework problems

x205 pset grade over time Difference in grade between current pset grade andaverage of student’s past pset grade

*x206 lab grade Number of the week’s lab problems answered correctly /number of that week’s lab problems

x207 lab grade over time Difference in grade between current lab grade and averageof student’s past lab grade

x208 number submissions correct Number of correct submisions

x209 correct submissions percent Percentage of the total submissions that were correct (x208/ x7)

*x210 average predeadline submissiontime

Average time between a problem submission and problemdue date over each submission that week


Figure 4. Diagram of the learners’ weeks data used in a lead 5, lag 3 prediction problem

Figure 5. Chart of the relative sizes of our cohorts

Here, β1 to βm are the coefficients for the feature val-ues, x1 to xm. β0 is a constant. The logit function,given by,

y =1

1 + e−z(2)

takes the shape as shown in figure 8. Note that thefunction’s range is between 0 and 1, which is optimalfor probability. Also note that it tends to ‘smooth out’at extreme input value, as the range is capped.

For a binary classification problem, such as ours, theoutput of the logit function becomes the estimatedprobability of a positive training example. These fea-

ture weights, or coefficients, are similar to the coeffi-cients in linear regression. The difference is that theoutput ranges between 0 and 1 due to the logit func-tion, rather than an arbitrary range for linear regres-sion.

4.1. Learning

The objective of training a logistic regression model isto find a set of coefficients well suited to fit the data.For the binary classification problem, as noted before,training involves passing a set of covariates and a corre-


x1 x2 xm w1 w2

w13

w14

Weeks

S

Figure 6. The feature matrix, which captures each feature value for each week. Each student has such a matrix.

x1 x2 xm w1 w2

w13

w14

Weeks

S x1 x2 xm x1 x2 xm

Week 1 Week 2

L

Figure 7. Diagram of the flattening process. In this case two weeks of data is used to predict week 13. This predictionproblem corresponds to a lead of 11, and a lag of 2.

Figure 8. The logit (aka logistic or sigmoid) function. Thelogit equation is y = 1

1+e−x . The range of the function isbetween 0 and 1.

sponding binary label associated with the covariates.After training a model, the predicted probability, orthe output of the logit function, should predict higher

probabilities for the positive ‘+1’ class examples in thetraining data and a lower probability for the negative‘0’ class examples.

There is no closed form solution to find the optimalcoefficients to best fit the training data. As a result,training is usually done iteratively through a techniquecalled maximum likelihood estimation (Menard, 2002).First, a random set of coefficients are chosen. At eachiteration, an algorithm such as Newton’s method isused to find the gradient between what the coefficientspredict and what they should predict, and updatesthe weights accordingly. The process repeats until thechange in the coefficients is sufficiently small. This iscalled convergence. After running this iterative pro-cess over all of the training examples, the coefficientsrepresent the final trained model.


4.2. Inference and evaluation

With training in place, the next step is evaluating theclassifier’s performance. A testing set comprised ofuntrained covariates and labels evaluates the perfor-mance of the model on the test data following the stepsbelow:

Step 1: The logistic function learned and presented in2 is applied to each data point and the estimatedprobability of a positive label yi is produced foreach data point in test set.

Step 2: A decision rule is applied to determine theclass label for each probability estimate yi. Thedecision rule is given by:

L̂i =

{1, if yi ≥ λ

0, if yi < λ

}(3)

Given the estimated labels for each data pointL̂i and the true labels Li we can calculate theconfusion matrix, true positives and false positivesand thus obtain an operating point on the ROCcurve.

Step 3: By varying the threshold λ in 3 the decisionrule above we can evaluate multiple points on theROC curve. We then evaluate the area under thecurve and report that as the performance of theclassifier on the test data.

Predictive accuracy heat map To present theresults for multiple prediction problems for differentweeks simultaneously, as discussed in Section 3.3, weassemble a heat map of a lower right triangular matrixas shown in Figure 9. The number on the x-axis is theweek for which predictions are made of that experi-ment. The y-axis represents the lag , or the numberof weeks of data used to predict. The color representsthe area under the curve for the ROC that the modelachieved. Note that as the predicted week increasesfor a given lag , it is harder to predict. Likewise, aswe increase the lag for a given prediction week, thestopout value becomes easier to predict. This impliesthat using more historical information enables a betterprediction.

4.3. Attractive properties of logistic regression

• It is relatively simple to understand.

• After a model is trained, it provides featureweights, which are useful in assessing the predic-tive power of features (this will be discussed fur-ther in our treatment of the randomized logisticregression model).

Figure 9. Example heatmap for a logistic regression prob-lem. The heatmap shows how the ROC AUC varied as lagchanged as the target prediction week changed.

• It is fast to run. On a single i-7 core machine,for example, running each of the 91 predictionproblems on all 4 cohorts took 25 hours.

5. Predicting stopout with logisticregression

We applied logistic regression to student persistenceprediction. We used the 27 interpretive features wedescribed earlier in this paper to form the feature vec-tors, and maintained the stopout value as the label.

5.1. Experimental setup

To perform logistic regression analysis, we executedthe ensuing steps for every lead, lag and cohort com-bination 3:1. Performed 10 fold cross validation on the training

set. As outlined in the evaluation chapter, thisinvolved training the model on 9 folds of the traindataset and testing on the last fold.

2. Trained a logistic regression model on the entiretrain dataset.

3. Applied the model to the test dataset by puttingeach data point through the model then applyingthe decision rule in 3 and following the steps in4.2 to determine the AUC under the ROC.

4. Evaluating the model using mean cross validationROC AUC and test set ROC AUC.

3 We used the logistic regression implementation of anopen source machine learning library, called scikit-learn.We chose this library because it is well known and tested,fast (the core maximum likelihood estimation algorithm iswritten in C), with an easy to use python interface. Inaddition, the scikit-learn library includes an easy interfacefor cross validation and feature normalization.


Figure 10. Logistic regression results for the passive collaborator cohort.

5.2. Experimental Results

Figures 10 through 13 summarize the AUC of the re-ceiver operating characteristic for all four cohorts overeach lead and lag combination. Overall, logistic re-gression predicted dropout with very high accuracy.Some experiments, such as a lag of 7, predicting week8 in the fully collaborative cohort achieved accuraciesas high as 0.95, a noteworthy result(Figure 12). More-over, the entire diagonal of the passive collaborator co-hort’s heatmap (Figure 10) resulted in an AUC greaterthan 0.88. This diagonal represents experiments witha lead of one. Thus, we can surmise that the extractedfeatures are highly capable of predicting stopout, es-pecially when the prediction week is fairly near the lagweek.

Across all experiments, the predictive models of thepassive collaborator cohort achieved the highest pre-dictive accuracies. This is because passive collaboratoris by far the largest cohort, which resulted in high per-forming, stable accuracy for all 91 experiments. Con-versely, the wiki contributor cohort performed terriblyfor many experiments. In fact, for some lag and pre-dicted week combinations, the model could not evencompute an AUC because there were not enough ex-amples to test on.

What follows is a deeper explanation of two interestingprediction problems and their results.

Are there early signs of stopout? One interest-ing prediction problem is trying to predict student per-sistence into the last week of the course using a singleweek of data. Practically speaking, this would enableplatform providers and instructors to predict whichstudents would finish the course by the end of the firstweek. Potentially, this would allow instructors to in-terpret the reason for student stopout as motivational(such as just browsing) rather than course-specific rea-sons (such as the content becoming too difficult), be-cause the students have not been exposed to muchcontent yet. Furthermore, early-sign stopout predic-tion could allow courses to target certain types of stu-dents for some type of intervention or special content.If our models are successful, the results would implythat our extracted features are capturing a student’spersistence far in advance. Remarkably across cohorts,the generated models achieved an AUC of at least 0.64,and reached as high as 0.78 in the case of the wiki con-tributor cohort.

The wiki contributor AUC of 0.78, or even the passivecollaborator of 0.7 suggests it is possible to roughlyestimate which students will finish the course. Im-


Figure 11. Logistic regression results for the forum contributor cohort.

plications include the ability to reach out to studentslikely to stop the course before they become disen-gaged, or giving a professor a rough indication of howmany students to expect each week. If these predic-tions hold true for other courses, a prediction modelcould be used to measure the success of course exper-iments, such as changing course content.

In the case of the wiki contributor cohort, the modelperformed well for most later predictive weeks given alag of one. This indicates two things. Firstly, wiki con-tributor students show remarkably strong early signsof persistence. Secondly, given more students, pre-dictive models of the wiki contributor cohort wouldlikely perform well. Owing largely to the small poolsize of the wiki contributor cohort, model performancesuffered, especially as lag increased, because therewere not enough students to appropriately train on.However, with a lead of one, the models used morestudent’s data because we included all students whostarted in the course.The prediction spike after the midterm Lead-ing up to the midterm (in week 8), making predictionsusing a lag of i, where i is the current week, yields afairly consistent AUC. In other words, students whowill stopout after the midterm resemble their persis-tent counterparts up until week 8. However, using lag

8 instead of 7, thereby including midterm data, pro-duces an upward prediction spike in all four cohorts.

Perhaps the most striking spike example is in the mostconsistent cohort, the passive collaborator students.If the model attempts to predict using a only lag 7,it realizes an AUC of 0.75. If the model expands toinclude midterm week data from week 8 and attemptsto predict who will be in the course the next week, itachieves an AUC of 0.91. This is a significant spike.Similarly, the fully collaborative cohort increases AUCsignificantly from 0.68 in week 7 to 0.81 in week 8.

With the addition of the midterm week data, themodel is equipped to make reasonably consistent pre-dictions through the end of the course. In fact, forthe two cohorts of significant size, the region includingand beyond week 8 achieves the highest AUCs of theentire course. This suggests that the midterm exam isa significant milestone for stopout prediction. It fol-lows that most students who complete the midtermfinish the course. For the two smaller cohorts, wikicontributor and fully collaborative, the region beyondweek 8 realizes terrible predictive power because toofew students remain in the course to accurately trainon.


Figure 12. Logistic regression results for the fully collaborative cohort.

Consider what a wiki is (students summarize and re-frame their knowledge) and the high level of engage-ment it reflects (more than forum). Therefore, othertechnologies that have similar kind of high engagementmay have same influence on persistence.

Feature importance We utilized randomized lo-gistic regression methodology to identify the relativeweighting of each of the feature. More details aboutthis approach are presented in (Veeramachaneni et al.,2014b). Here we briefly present the results of thatexperiment through Figure 14. In these four figureshigher bar represents higher importance of that fea-ture in predicting stopout across all 91 experimentsfor that cohort. We summarize the features we foundimportant in the findings section.

6. Multiple classifiers

After successfully modeling the data using logistic re-gression and Randomized logistic regression, we pro-ceeded to model the data using a number of classi-fiers via a cloud based machine learning as serviceframework called Delphi. Delphi is a first-ever sharedmachine learning service developed by members of

ALFA group at CSAIL, MIT (Drevo, 2014) 4. Itis a multi-algorithm, multi-parameter self-optimizingmachine learning system that attempts to automat-ically find and generate the optimal discriminativemodel/classifier with optimal parameters. A hybridBayesian and Multi-armed Bandit optimization sys-tem searches through the high dimensional searchspace of models and parameters. It works in a loadbalanced fashion to quickly deliver results in the formof ready-to-predict models, confusion matrices, crossvalidation accuracy, training timings, and average pre-diction times.

6.1. Experimental setup

In order to run our datasets through Delphi, we per-formed the following:

1. Chose a few lead, lag combinations to run on Del-phi. Since Delphi creates many models, we onlychose 3 datasets per cohort. We chose lead and lagcombinations which were difficult for logistic re-gression to predict so we could see if Delphi wouldperform better. We chose the following combina-tions: lead of 13, lag of 1; lead of 3, lag of 6; lead

4http://delphi.csail.mit.edu/


Figure 13. Logistic regression results for the wiki contributor cohort.

of 6, lag of 4.

2. Flattened each cohort’s train and test dataset togenerate files which could be passed to Delphi.We flattened in the same manner as described inlogistic regression section.

3. Ran the 12 datasets through Delphi. This gave us12 models which performed best on a mean crossvalidation prediction accuracy metric.

4. Evaluated these models on the basis of testdataset ROC AUC and cross validation ROCAUC performance.

6.2. Delphi results

The models created by Delphi attained AUCs verysimilar to those of our logistic regression and HMMmodels (described in (Taylor et al., 2014). The bestalgorithm chosen by Delphi varied depending on whichlead, lag and cohort combination was chosen. The al-gorithms included stochastic gradient descent, k near-est neighbors, logistic regression, support vector ma-chines and random forests.

For the two larger cohorts, passive collaborator andforum contributor, Delphi’s models used logistic re-gression, stochastic gradient descent, support vector

machines and random forests. For each of the leadand lag combinations, the models’ results were within0.02 of our logisitic regression results. This indicatedthat the predictive power of these cohorts was not dueto the type of model used. Rather, the strong predic-tive accuracies achieved were due to the interpretivefeatures in the models. As noted in (Veeramacha-neni et al., 2014b), varying the features used, suchas when using only the self-proposed, self-extractedfeatures rather than the self-proposed, self-extractedfeatures and crowd-proposed, self-extracted features,significantly changed the results. These findings leadus to conclude that focusing on better features pro-vides more leverage in MOOC data science than doesfine-tuning models.

For the smaller wiki contributor and fully collaborativecohorts, Delphi’s models provided significantly betteraccuracy. For example, for the wiki contributor cohort,all three lead and lag combinations’ models produceAUCs greater than 0.85. The best classifiers used tomodel these cohorts included k nearest neighbors andstochastic gradient descent. This indicates that forthese cohorts, the type of model matters a great deal.We conclude that this is due to the small size of thecohorts. Some classifiers are able to more gracefully


(a) passive collaborator cohort (b) forum contributor only

(c) fully collaborative (d) wiki contributor only

Figure 14. Relative importance of different features across all variants (different lead and lag) of stopout predictionproblem. The four plots give the feature importances as found for the four cohorts we described in this paper. Summary:For the passive collaborator cohort, top 5 features that had the most predictive power across multiple stopout predictiveproblems include average pre deadline submission time, submissions per correct problem, average number of submissionsin percent , correct submissions percent , pset grade over time. For the forum contributor cohort, top 5 features thathad the most predictive power across multiple stopout predictive problems include lab grade over time, average predeadline submission time, average length of forum post , lab grade, average number of submissions in percent . For thefully collaborative cohort, top 5 features that had the most predictive power across multiple stopout predictive problemsinclude lab grade over time, lab grade, pset grade, pset grade over time, average pre deadline submission time. For thewiki contributor cohort, top 5 features that had the most predictive power across multiple stopout predictive problemsinclude lab grade over time, lab grade, average pre deadline submission time, pset grade, average number of submissionsin percent . For more details about how these relative importances were calculated we refer the user to (Veeramachaneniet al., 2014b).


handle less data. This provides early suggestive evi-dence that, when a student cohort is relatively small(in relation to number of features), it is important toinvestigate multiple models to identify the most accu-rate one.

7. Related work and literature

Dropout prediction and analysis of reasons for dropouthave been of great interest to the educational researchcommunity in a variety of contexts: e-learning, dis-tance education and online courses offered by commu-nity colleges. To understand and gain perspective onwhat has been done so far, we surveyed a large cohortof relevant literature. Tables 3 present a list of 25 re-search studies we surveyed that were prior to MOOCs.Table 4 presents the list of 8 concurrent studies thatcorrespond to MOOCs.

We use a set of five axes along to compare researchmodels. For any model, the axes are 1) intended pur-pose: predictive vs. correlative, 2) whether behav-ioral and/or non-behavioral attributes were employed,3) use of longitudinal and/or time-invariant variablesand 4) use of trace data and/or survey data. A subsetof these axes are similar to those identified by (Lyk-ourentzou et al., 2009).

Axis 1: Intended Purpose – Predictive vs.correlative By categorizing a model as predictive,we identify it being used prospectively to predict thewhether or not a student will dropout. Predictivemodeling is often the basis of interventions. It can beused while a course is running. In correlative model-ing, analysis is performed to correlate one or more vari-ables with completion (or progress to some timepoint).Retrospectively, the reasons for dropout are identified.(Diaz, 2002; Tyler-Smith, 2006; Street, 2010) providean excellent summary of a number of correlative stud-ies performed in this domain. Many studies build pre-dictive models to not operationalize the model for ac-tual prediction during a course but to gain insights intowhich variables and what values of the variables arepredictive of dropout. We categorize such approachesas correlative as well.

Within the predictive models category, in the litera-ture, there is an abundance of modeling problems setup to use a set of variables recorded over a single his-torical interval, e.g first 3 weeks of the course, and topredict an event at a single timepoint. For example,using data or surveys collected from the first 4 mod-ules of a course to forecast stopout after the midterm.In some cases, however, when predictive models arebuilt for a number of time points in the course as in(Lykourentzou et al., 2009) the model is not built to

predict ahead.

In contrast, we identify 91 different predictive model-ing problems within a single MOOC. We take painsto not include any variable that would arise at or af-ter the time point of our predictions, i.e. beyond thelag interval. We do this so we can understand theimpact of different timespans of historical informationon predicting at different time intervals forward. Inother words, our study is the first to the best of ourknowledge to systematically define multiple predictionproblems so predictions could be made during everyweek of the course, each week to the end of the course.(Lykourentzou et al., 2009) provide an excellent sum-mary of studies that fall in predictive category.

Finally, we are concerned with the accuracy of predic-tive model so that it could be used during the coursefor intervention. As predictions can make two types oferrors: mispredict stopout or mispredict persistence,our use of area under the receiver operating charac-teristic curve (AUC) as a metric for measuring theefficacy of our models rather than R2 metric is a tes-tament to the effect. The metric emphasizes the im-portance of both the errors and we aim at optimizingthis metric. To the best of our knowledge this metrichas not been used to evaluate the models. Addition-ally, we provide a probability of stopout allowing theuser to choose a threshold to make prediction. Thisallows the intervention designer to choose a trade-offpoint on the receiver operating characteristic curve.

Axis 2: Behavioral and/or non-behavioral at-tributes. This categorization identifies whether ornot variables that capture learning behavior of studentwere used in modeling. Examples of a non-behavioralattribute are a student’s age, sex or location, occupa-tion, financial status (Parker, 1999; Willging & John-son, 2009). Second kind of non-behavioral variablesare perceptual variables, such as those derived fromquestionnaires, that need to be self reported. Ourmodels do not depend on perceptual variables, neitherdo they depend on non-behavioral variables such asage, gender and others. While such variables can playa significant role in increasing accuracy of models (es-pecially when predicting far ahead), in a MOOC theymay not be available. This is a powerful and signifi-cant difference as this allows us to be able to transferthe model without needing personally identifiable in-formation.

Within the use of behavioral data, the most commonbehavioral variables used are performance related ei-ther prior to or during the course. For example, (Lyk-ourentzou et al., 2009) use prior academic performance(education level), other even use high school GPA, col-


lege GPA, freshmen year GPA (Morris et al., 2005;Mendez et al., 2008). Some studies compose variablesbased on project grade and test grades during thecourse (Lykourentzou et al., 2009). In almost all cases,prior academic performance has been found to be thehighest predictor of the student persistence (Mendezet al., 2008; Xenos et al., 2002; Allen & Robbins, 2008).

The second type of behavioral variables are based onstudents’ interaction with the educational resources(online or otherwise) rather than performance on atest or a midterm, or prior academic performance, forexample how much time a student spent on lecturevideos or whether or not student attended the orien-tation session. We tackle the challenge of identifyingvariables that capture students’ interactions with theonline platform using actual trace analysis (log analy-sis). We argue that such analysis can enable identifi-cation of attributes of the course that could be asso-ciated with stopout. For example, a difficult concept,or a rather hard/confusing video.To the best of ourknowledge, the exploitation of very detailed trace ori-ented variables like we derive appears to be not fullyexploited.

Axis 3: Time varying vs. time-invariant vari-ables. A time varying variable captures a quantity atdifferent points in time. Many time varying variablesare summaries, such as average downloads per day, to-tal minutes watching videos per module. Sometimesthey are first processed with scoring, e.g. see the en-gagement scoring of (Poellhuber et al., 2008), or rank-ing such as the decile of a participation level each week.In contrast, a time invariant variable is constant overtime, e.g. ethnicity. The important choice betweenthese two types is whether dynamics are factored intomodeling. For example, we choose time varying vari-ables as a means of capturing behavioral trends.

Most studies that we surveyed capture variables thatare summaries over time. Some of these variables bydefinition are not time dependent such as attendanceat class orientation (Wojciechowski & Palmer, 2005)and some are usually aggregated for a period in thecourse (or entire course): number of emails to the in-structor (Nistor & Neubauer, 2010). In our work weoperationalize variables at multiple time points in thecourse. In this aspect, perhaps the closest approachto ours is (Lykourentzou et al., 2009) where authorsform the time varying variables at different points ofthe course - different sections of the course.

Axis 4: Trace or survey data use Surveys playan important role in analyzing the factors related topersistence.

Surveys allow perceptual data to be self reported and

collected via questionnaires. They permit very specifictheory, such as that underlying motivation or engage-ment to be articulated and used as a reference point fordescribing a student. They also permit the theory tobe tested. Very common among these are studies thatfocus on collecting information about students “locusof control" (Levy, 2007; Parker, 1999; Morris et al.,2005), satisfaction (Park & Choi, 2009; Levy, 2007),or perception of family support (Park & Choi, 2009),among others.

Many studies struggle to collect this type of the data.To overcome mistakes in manual data entry most sur-veys are now provided electronically. However, inmany cases, not all students submit responses to sur-veys and questionnaires. Survey data may ask a ques-tion which a respondent fails unintentionally to answeraccurately (or worse, intentionally).

Trace data is typically logs and counters. It may in-clude participation records. In MOOCs trace data isavailable at a very fine grained level. Largely, it canbe considered as a set of silent, passive observations.However, one needs to build interpretations for thetrace data that does not directly capture student statessuch as attention, motivation, satisfaction.

Student persistence studies in MOOCs In thecontext of MOOCs, study of factors relating to persis-tence has been of great interest due to non-completionrates. There have been at least 5 correlative studieswhich we present in Table 4. These include (Poellhu-ber et al., 2014; DeBoer et al., 2014; Breslow et al.,2013), for more see Table 4. We categorize these stud-ies as correlative as their goal primarily is to identifyvariable influences on achievement or persistence.

Research studies performed on the same data as oursin this paper show a steady progression in how vari-ables are assembled and progress is made on this data.(Breslow et al., 2013) identify the sources of data inMOOCs and discuss the influences of different factorson persistence and achievement. (DeBoer et al., 2013)identifies the demographic and background informa-tion about students that is related to performance.(DeBoer et al., 2014) assembles 20 different variablesthat capture aggregate student behavior for the en-tire course. (DeBoer & Breslow, 2014) posits variableson a per week basis and correlates with achievement,thus forming a basis for longitudinal study. Our work,takes a leap forward and forms complex longitudinalvariables on a per student - per week basis. Later, weattribute the success of our predictive models to theformation of the variables.

In (Poellhuber et al., 2014) a logistic regression modelwith 90% accuracy was (retrospectively) developed for


Table 3. Related Literature

Paper Sample size Category

(Parker, 1999) 100 correlative, behavioral , time-invariant

(Xenos et al., 2002) 1230 correlative, behavioral , time-invariant

(Kotsiantis et al., 2003) 354 predictive, behavioral , time-invariant

(Xenos, 2004) 800 correlative, behavioral , time-invariant

(Zhang et al., 2004) 57549 correlative non-behavioral , time-invariant

(Dupin-Bryant, 2004) 464 correlative, behavioral , time-invariant

(Wojciechowski & Palmer, 2005) 179 correlative, behavioral , time-invariant

(Morris et al., 2005) 211 predictive, behavioral , time-invariant

(Herzog, 2006) 23,475 predictive, behavioral , time-invariant

(Levy, 2007) 133 correlative, non-behavioral , time-invariant

(Holder, 2007) 259 correlative, non-behavioral , time-invariant

(Cocea & Weibelzahl, 2007) 11 predictive, behavioral , time-invariant

(Mendez et al., 2008) 2232 predictive, behavioral , time-invariant

(Hung & Zhang, 2008) 98 predictive, behavioral , time varying

(Moseley & Mead, 2008) 528 predictive, behavioral , time varying

(Juan et al., 2008) 50 correlative, behavioral , time varying

(Boon, 2008) 1050 correlative, behavioral , time-invariant

(Aragon & Johnson, 2008) 305 correlative, non-behavioral , time-invariant

(Allen & Robbins, 2008) 50,000 correlative, behavioral , time-invariant

(Lykourentzou et al., 2009)1 193 predictive, behavioral , time varying

(Willging & Johnson, 2009) 83 predictive, non-behavioral , time-invariant

(Park & Choi, 2009) 147 correlative, non-behavioral , time-invariant

(Nistor & Neubauer, 2010) 209 predictive, behavioral , time varying

1 This article contains a comprehensive overview similar to ours about a variety of studies conducted in dropout predictionover a number of years in the e-learning/online learning context. We follow some of their findings about related work andsummarize in this table in addition to a few more studies we found.


Table 4. Studies about student persistence in MOOCs

Paper Category

(DeBoer et al., 2013) correlative, non-behavioral , time-invariant

(Yang et al., 2013) correlative, behavioral , time varying

(Breslow et al., 2013) correlative, behavioral , time-invariant

(DeBoer et al., 2014) correlative, behavioral , time-invariant

(DeBoer & Breslow, 2014) correlative, behavioral , time varying

(Halawa et al., 2014) predictive, behavioral , time varying

(Ramesh et al., 2013) predictive, behavioral , time varying

(Balakrishnan & Coetzee, 2013) predictive, behavioral , time varying

a French language economics course delivered throughthe Edulib initiative5 of HEC Montreal during thespring 2012 semester. We designate this as a correla-tive model because completion of the final exam wasused as an explanatory variable. Univariate modelswere first constructed to provide information on vari-able significance. The final logistical regression modelintegrated significant variables and identified behav-ioral engagement measures as strongly related to per-sistence.

Three predictive studies closer to our study here are(Halawa et al., 2014; Balakrishnan & Coetzee, 2013;Ramesh et al., 2013; 2014). All three attempt to pre-dict one week ahead (lead =1) 6. Among the paperswe surveyed (Balakrishnan & Coetzee, 2013; Rameshet al., 2013) use area under the curve (AUC) as a met-ric for evaluating the predictive model.

There are three noteworthy accomplishments ofour study when compared to these studies above.First, throughout our study we emphasize on vari-able/feature engineering from the click stream dataand thus generate complex features that explain stu-dent behavior longitudinally (Veeramachaneni et al.,2014b). We attribute success of our models to thesevariables (more then the models themselves) as weachieve AUC in the range of 0.88-0.90 for one weekahead for the passive collaborator cohort.

Second we focus on forming features/variables fromhighly granular, frequently collected click stream datawhich allows us to make predictions for a significantly

5http://edulib.hec.ca6Except for (Ramesh et al., 2014) which attempts to

predict at three different time points in the course

large portion of students who do not participate inforums, and in addition captures learners interactionwith resources and assignments. In the course data weworked with, only 8301 out of 52,939 students partici-pated on forums (approximately 15.6%, See Figure 5).We argue that variables derived from learner interac-tions on forums, as presented in (Ramesh et al., 2013;2014; Yang et al., 2013) will only be available for asubset of learners.

Third, we split the learner population into four differ-ent cohorts and our methodology generates 91 differentprediction problems based on different leads and lagsand builds models for each of them. These result in364 different prediction problems requiring modeling.

8. Summary of Research Findings

Our modeling and feature engineering efforts revealthe following: 7:

• Stopout prediction is a tractable problem. Ourmodels achieved an AUC (receiver operating char-acteristic area-under-the-curve) as high as 0.95(and generally ∼0.88) when predicting one weekin advance. Even with more difficult predictionproblems, such as predicting student stopout atthe end of the course with only one week’s data,our models attained AUCs of ∼0.7. This suggeststhat early predictors of stopout exist.

• For almost every prediction week, our models find7While we refer the reader to (Veeramachaneni et al.,

2014b) in the compendium for detailed descriptions of thefeatures we employed for prediction, numbered x1 . . . x18

and x201 . . . x208, we present a summary of our findings.


only the most recent four weeks of data predictive.

• Taking the extra effort to extract complexpredictive features that require relative com-parison or temporal trends, rather than em-ploying more direct covariates of behavior, oreven trying multiple modeling techniques, is themost important contributor to successful stopoutprediction. While we constructed many modelswith a variety of techniques, we found consis-tent accuracy arising across techniques whichwas dependent on the features we used. Usingmore informative features yielded superior accu-racy that was consistent across modeling tech-niques. Very seldom did the modeling techniqueitself make a difference. A significant exception tothis is when the model only has a small numberof students (for example, approximately less than400) to learn from. Some models perform notablybetter than others on less data.

• A crowd familiar with MOOCs is capable ofproposing sophisticated features which are highlypredictive. The features brainstormed by ourcrowd-sourcing efforts were actually more usefulthan those we thought of independently. Addi-tionally, the crowd is very willing to participatein MOOC research. These observations suggestthe education-informed crowd is a realistic sourceof modeling assistance and more efforts should bemade to engage it. See (Veeramachaneni et al.,2014b) for more details.

• Overall, features which incorporate student prob-lem submission engagement are the most predic-tive of stopout. As our prediction problem definedstopout using problem submissions, this result isnot particularly surprising. however submissionengagement is an arguably good definition.

• In general, complex, sophisticated features, suchthe percentile of a student when compared toother students (x202, Table 2), which relates stu-dents to peers, and lab grade over time(x207, Ta-ble 2), which has a temporal trend, are more pre-dictive than simple features, such a count of sub-missions (x7, Table 1).

• Features involving inter-student collaboration,such as the class forum and Wiki, can be usefulin stopout prediction. It is likely that the qualityand content of a student’s questions or knowledgeare more important than strict collaboration fre-quency. We found that, in particular, the lengthof forum posts (x5, Table 1) is predictive, but the

number of posts (x3, Table 1) and number of fo-rum responses (x201, Table 2) is not. The role ofthe collaborative mechanism (i.e. Wiki or forum)also appears to be distinctive since, in contrastto forum post length, Wiki edits have almost nopredictive power.

9. General reflections for the entirecompendium

This extensive project has revealed a combinatorialexplosion of MOOC modeling choices. There are avariety of algorithms which one could use, a varietyof ways to define a modeling problem and a numberof ways to organize data which is fed into modeling.There are also numerous challenges in assembling fea-tures while features themselves turn out to be of veryhigh importance. One has work systematically and bethorough in feature definition and model exploration,otherwise one will never know if one has derived thebest prediction capability from the data.

To successfully apply the power of data science andmachine learning to MOOC analytics, multiple aspectsof the process are critical:

Feature engineering One has to be meticulousfrom the data up – any vague assumptions, quickand dirty data conditioning or preparation will cre-ate weak foundations for one’s modeling and analy-ses. Many times painstaking manual labor is required- such as manually matching up pset deadlines, etc.We need to be ready to think creatively as you brain-storm and extract features, and be flexible in the wayswe assemble them. For example, utilizing the crowd ismuch richer than just our own expertise.

Machine learning/modeling at scale There aremany ways to represent the extracted features data-with or without PCA, temporal and non-temporal, dis-cretized and non discretized. Additionally there are anumber of modeling choices - discriminative, gener-ative or mixed models which include many types ofclassifiers. One has to consider a number of them toenable insights at scale. The alternative results in amuch smaller scope with more limited results.

Our ability to build 10,000 models relied on us firstbuilding the cloud scale platforms. This is especiallytrue as the machine learning process includes iterationsover data definitions, features and cohort definitions.Only through a large scale computational frameworkare these multiple iterations possible. Throughout ouranalysis we ran on hundreds of nodes simultaneously,using the DCAP and Delphi frameworks.


Transfer learning prospects In order to have alasting impact on MOOC data science, we have tothink big! Investing resources only in investigatingstopout for one course limits the impacts of the results.With this in mind, we set out to create a reusable, scal-able methodology.

From the beginning of the research, we have envisionedcreating open source software. This would allow otherresearchers to apply our methodology to their ownMOOC courses. That our software can be used byany other MOOC research, is due to standardizationvia the shared MOOCdb data schema. Our attentionto the scalability of our methods for large data setsalso supports wide applicability. The prospect of mul-tiple studies and multi-course studies would be veryexciting and most welcome.

Bibliography

Allen, Jeff and Robbins, Steven B. Prediction of col-lege major persistence based on vocational interests,academic preparation, and first-year academic per-formance. Research in Higher Education, 49(1):62–79, 2008.

Aragon, Steven R and Johnson, Elaine S. Factors influ-encing completion and noncompletion of communitycollege online courses. The Amer. Jrnl. of DistanceEducation, 22(3):146–158, 2008.

Balakrishnan, Girish and Coetzee, Derrick. Predictingstudent retention in massive open online courses us-ing hidden markov models. In Technical Report No.UCB/EECS-2013-109. EECS, University of Califor-nia, Berkeley, 2013.

Boon, Helen J. Risk or resilience? what makes a dif-ference? The Australian Educational Researcher, 35(1):81–102, 2008.

Breslow, Lori, Pritchard, David E, DeBoer, Jennifer,Stump, Glenda S, Ho, Andrew D, and Seaton, DT.Studying learning in the worldwide classroom: Re-search into edxs’ first mooc. Research & Practice inAssessment, 8:13–25, 2013.

Cocea, Mihaela and Weibelzahl, Stephan. Cross-system validation of engagement prediction from logfiles. In Creating new learning experiences on aglobal scale, pp. 14–25. Springer, 2007.

DeBoer, Jennifer and Breslow, Lori. Trackingprogress: predictors of students’ weekly achievementduring a circuits and electronics mooc. In Proceed-ings of the first ACM conference on Learning@ scaleconference, pp. 169–170. ACM, 2014.

DeBoer, Jennifer, Ho, Andrew, Stump, Glenda S,Pritchard, David E, Seaton, Daniel, and Breslow,Lori. Bringing student backgrounds online: Moocuser demographics, site usage, and online learning.engineer, 2:0–81, 2013.

DeBoer, Jennifer, Ho, Andrew D, Stump, Glenda S,and Breslow, Lori. Changing course reconcep-tualizing educational variables for massive open

online courses. Educational Researcher, pp.0013189X14523038, 2014.

Diaz, David P. Online drop rates revisited. The Tech-nology Source, 3, 2002.

Drevo, Will. Delphi: A Distributed Multi-algorithm,Multi-user, Self Optimizing Machine Learning Sys-tem. Master’s thesis, Massachusetts Institute ofTechnology, 2014.

Dupin-Bryant, Pamela A. Pre-entry variables relatedto retention in online distance education. The Amer-ican Journal of Distance Education, 18(4):199–206,2004.

Halawa, Sherif, Greene, Daniel, and Mitchell, John.Dropout prediction in moocs using learner activ-ity features. In Proceedings of the European MOOCSummit. EMOOCs, 2014.

Herzog, Serge. Estimating student retention anddegree-completion time: Decision trees and neuralnetworks vis-à-vis regression. New Directions forInstitutional Research, 2006(131):17–33, 2006.

Holder, Bruce. An investigation of hope, academics,environment, and motivation as predictors of per-sistence in higher education online programs. TheInternet and higher education, 10(4):245–260, 2007.

Hung, Jui-Long and Zhang, Ke. Revealing onlinelearning behaviors and activity patterns and mak-ing predictions with data mining techniques in on-line teaching. MERLOT Journal of Online Learningand Teaching, 2008.

Juan, Angel A, Daradoumis, Thanasis, Faulin, Javier,and Xhafa, Fatos. Developing an information systemfor monitoring student’s activity in online collabora-tive learning. In Complex, Intelligent and SoftwareIntensive Systems, 2008. CISIS 2008. InternationalConference on, pp. 270–275. IEEE, 2008.

Kotsiantis, Sotiris B, Pierrakeas, CJ, and Pintelas,Panayiotis E. Preventing student dropout in dis-tance learning using machine learning techniques. InKnowledge-Based Intelligent Information and Engi-neering Systems, pp. 267–274. Springer, 2003.

23


Levy, Yair. Comparing dropouts and persistence ine-learning courses. Computers & education, 48(2):185–204, 2007.

Lykourentzou, Ioanna, Giannoukos, Ioannis,Nikolopoulos, Vassilis, Mpardis, George, andLoumos, Vassili. Dropout prediction in e-learningcourses through the combination of machine learn-ing techniques. Computers & Education, 53(3):950–965, 2009.

Menard, Scott. Applied logistic regression analysis,volume 106. Sage, 2002.

Mendez, Guillermo, Buskirk, Trent D, Lohr, Sharon,and Haag, Susans. Factors associated with persis-tence in science and engineering majors: An ex-ploratory study using classification trees and ran-dom forests. Journal of Engineering Education, 97(1):57–70, 2008.

Morris, Libby V, Wu, Sz-Shyan, and Finnegan,Catherine L. Predicting retention in online generaleducation courses. The American Journal of Dis-tance Education, 19(1):23–36, 2005.

Moseley, Laurence G and Mead, Donna M. Predictingwho will drop out of nursing courses: a machinelearning exercise. Nurse education today, 28(4):469–475, 2008.

Nistor, Nicolae and Neubauer, Katrin. From participa-tion to dropout: Quantitative participation patternsin online university courses. Computers & Educa-tion, 55(2):663–672, 2010.

Park, Ji-Hye and Choi, Hee Jun. Factors influencingadult learners’ decision to drop out or persist in on-line learning. Educational Technology & Society, 12(4):207–217, 2009.

Parker, Angie. A study of variables that predictdropout from distance education. Internationaljournal of educational technology, 1(2):1–10, 1999.

Poellhuber, Bruno, Chomienne, Martine, andKarsenti, Thierry. The effect of peer collaborationand collaborative learning on self-efficacy andpersistence in a learner-paced continuous intakemodel. International Journal of E-Learning &Distance Education, 22(3):41–62, 2008.

Poellhuber, Bruno, Roy, Normand, Bouchoucha, Ibti-hel, and Anderson, Terry. The relationship betweenthe motivational profiles, engagement profiles andpersistence of mooc participants. MOOC ResearchInitiative Final Report, 2014.

Ramesh, Arti, Goldwasser, Dan, Huang, Bert,Daumé III, Hal, and Getoor, Lise. Modeling learnerengagement in moocs using probabilistic soft logic.In NIPS Workshop on Data Driven Education, 2013.

Ramesh, Arti, Goldwasser, Dan, Huang, Bert,Daume III, Hal, and Getoor, Lise. Learning latentengagement patterns of students in online courses.In Proceedings of the Twenty-Eighth AAAI Confer-ence on Artificial Intelligence. AAAI Press, 2014.

Street, Hannah D. Factors influencing a learners’ de-cision to drop-out or persist in higher education dis-tance learning. Online Journal of Distance LearningAdministration, 13(4), 2010.

Taylor, Colin, Veeramachaneni, Kalyan, and Una-May, O’Reilly. Hidden markov models for stopoutprediction in moocs. x-arxiv, 8, 2014.

Tyler-Smith, Keith. Early attrition among first timeelearners: A review of factors that contribute todrop-out, withdrawal and non-completion rates ofadult learners undertaking elearning programmes.Journal of Online learning and Teaching, 2(2):73–85, 2006.

Veeramachaneni, Kalyan, Halawa, Sherif, Dernon-court, Franck, O’Reilly, Una-May, Taylor, Colin,and Do, Chuong. Moocdb: Developing standardsand systems to support mooc data science. arXivpreprint arXiv:1406.2015, 2014a.

Veeramachaneni, Kalyan, O’Reilly, Una-May, andTaylor, Colin. Towards feature engineering at scalefor data from massive open online courses. arXivpreprint arXiv:1407.5238, 2014b.

Willging, Pedro A and Johnson, Scott D. Factorsthat influence students’ decision to dropout of on-line courses. Journal of Asynchronous Learning Net-works, 13(3):115–127, 2009.

Wojciechowski, Amy and Palmer, Louann Bierlein. In-dividual student characteristics: Can any be predic-tors of success in online classes? Online Journal ofDistance Learning Administration, 8(2), 2005.

Xenos, Michalis. Prediction and assessment of stu-dent behaviour in open and distance education incomputers using bayesian networks. Computers &Education, 43(4):345–359, 2004.

Xenos, Michalis, Pierrakeas, Christos, and Pintelas,Panagiotis. A survey on student dropout rates anddropout causes concerning the students in the courseof informatics of the hellenic open university. Com-puters & Education, 39(4):361–377, 2002.


Yang, Diyi, Sinha, Tanmay, Adamson, David, andRosé, Carolyn Penstein. Turn on, tune in, drop out:Anticipating student dropouts in massive open on-line courses. In Proceedings of the 2013 NIPS Data-Driven Education Workshop, 2013.

Zhang, Guili, Anderson, Timothy J, Ohland,Matthew W, and Thorndyke, Brian R. Identifyingfactors influencing engineering student graduation:A longitudinal and cross-institutional study. Jour-nal of Engineering Education, 93(4):313–320, 2004.

Likely to stop? Predicting Stopout in Massive Open …to capture MOOC data across platforms thereby...

Documents

Transcript of Likely to stop? Predicting Stopout in Massive Open …to capture MOOC data across platforms thereby...