2015 EDM Leopard for Adaptive Tutoring Evaluation
-
Upload
yun-huang -
Category
Data & Analytics
-
view
377 -
download
0
Transcript of 2015 EDM Leopard for Adaptive Tutoring Evaluation
![Page 1: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/1.jpg)
Your model is predictive but is it useful?− Theoretical and Empirical Considerations of a New Paradigm for Adaptive Tutoring Evaluation
José P. González-Brenes, Pearson Yun Huang, University of Pittsburgh
1
![Page 2: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/2.jpg)
Main point of the paper: We are not evaluating student models correctly New paradigm, Leopard, may help
2
![Page 3: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/3.jpg)
3
![Page 4: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/4.jpg)
Why are we here?
4
![Page 5: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/5.jpg)
5
![Page 6: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/6.jpg)
Educational Data Mining
Data Mining
= ?
6
![Page 7: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/7.jpg)
Should we just publish at KDD*?
*or other data mining venue
7
![Page 8: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/8.jpg)
Claim: Educational Data Mining helps learners
8
![Page 9: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/9.jpg)
9
Is our research helping learners?
![Page 10: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/10.jpg)
Adaptive Intelligent Tutoring Systems: Systems that teach and adapt content to humans
teach
collect evidence
10
![Page 11: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/11.jpg)
Paper writing: Researchers quantify the improvements of the systems compared Not a purely academic pursuit: Superintendents to choose between alternative technology Teachers choose between systems
11
![Page 12: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/12.jpg)
Randomized Controlled Trials may measure the time students spent on tutoring, and their performance on post-tests
12
![Page 13: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/13.jpg)
Difficulties of Randomized Controlled Trials: • IRB approval • experimental design by an expert • recruiting (and often payment!) of enough
participants to achieve statistical power • data analysis
13
![Page 14: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/14.jpg)
How do other A.I. disciplines do it?
14
![Page 15: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/15.jpg)
Bleu [Papineni et al ‘01]: machine translation systems Rouge [Lin et al ’02]: automatic summarization systems Paradise [Walker et al ’99]: spoken dialogue systems
15
![Page 16: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/16.jpg)
Using automatic metrics can be very positive: • Cheaper experimentation • Faster comparisons • Competitions that accelerate progress
16
![Page 17: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/17.jpg)
Automatic metrics do not replace RCTs
17
![Page 18: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/18.jpg)
What does the Educational Data Mining community do? Evaluate the student model using classification accuracy metrics like RMSE, AUC of ROC, accuracy… (Literature reviews by Pardos, Pelánek, … )
18
![Page 19: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/19.jpg)
Other fields verify that automatic metrics correlated with the target behavior [Eg.: Callison-Burch et al ’06]
19
![Page 20: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/20.jpg)
Ironically, we have a growing body of evidence that classification evaluation metrics are a BAD way to evaluate adaptive tutors
Read Baker and Beck papers on limitations / problems of these evaluation metrics
20
![Page 21: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/21.jpg)
Surprisingly, in spite of all of the evidence against using classification evaluation metrics, their use is still very widespread in the adaptive literature* Can we do better? * ExpOppNeed [Lee & Brunskill] is an exception
21
![Page 22: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/22.jpg)
The rest of this talk: • Leopard Paradigm
– Teal – White
• Meta-evaluation • Discussion
22
![Page 23: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/23.jpg)
Leopard
23
![Page 24: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/24.jpg)
• Leopard: Learner Effort-Outcomes Paradigm
• Leopard quantifies the effort and outcomes of students in adaptive tutoring
24
![Page 25: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/25.jpg)
• Effort: Quantifies how much practice the adaptive tutor gives to students. Eg., number of items assigned to students, amount of time…
• Outcome: Quantifies the performance of students after adaptive tutoring
25
![Page 26: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/26.jpg)
• Measuring effort and outcomes is not novel by itself (e.g, RCT)
• Leopard’s contribution is measuring both without a randomized control trial
• White and Teal are metrics that operationalize Leopard
26
![Page 27: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/27.jpg)
White: Whole Intelligent Tutoring system Evaluation White performs a counterfactual simulation (“What Would the Tutor Do?”) to estimate how much practice students receive
27
![Page 28: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/28.jpg)
28
Design desiderata: Evaluation metric should be easy to use Same, or similar input than conventional metrics
![Page 29: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/29.jpg)
29
Bob
Alice
skill q
1
student u
.5
Bob
s1
s1
Bob
Alices1
.7
Bobs13
.7
s1
.8
6
1
yt0
41
.6
11
Bob
2
3
4
Bob
.72
.91Bob
.5
t
1
Alice s11
0
s1
s1s1
0
s1
s11
ct+1
.6
Alice 0
.4
.9
ˆactual performancepredicted performance
cu,q,t+1 yu,q,tˆ
![Page 30: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/30.jpg)
30
Bob
Alice
skill q
1
student u
.5
Bob
s1
s1
Bob
Alices1
.7
Bobs13
.7
s1
.8
6
1
yt0
41
.6
11
Bob
2
3
4
Bob
.72
.91Bob
.5
t
1
Alice s11
0
s1
s1s1
0
s1
s11
ct+1
.6
Alice 0
.4
.9
2/3
4/5
score
score
effort=
effort=
ˆactual performancepredicted performance
cu,q,t+1 yu,q,tˆ
Alternatively, we can
model error
![Page 31: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/31.jpg)
Future direction: Present aggregate results
31
![Page 32: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/32.jpg)
Q/ What if student does not achieve target performance? A: “Visible” imputation
32
![Page 33: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/33.jpg)
Meta-Evaluation
33
![Page 34: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/34.jpg)
Compare: • Conventional classification metrics • Leopard metrics (White)
34
![Page 35: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/35.jpg)
Datasets: • Data from a middle-school Math commercial
tutor – 1.2 million observations – 25,000 students – Item to skill mapping:
• Coarse: 27 skills • Fine: 90 skills • (Other item-to-skill model not reported)
• Synthetic data
35
![Page 36: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/36.jpg)
Assessing an evaluation metric with real student data is difficult because we often do not know the ground truth Insight: Use data that we know a priori its behavior in an adaptive tutor
36
![Page 37: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/37.jpg)
For adaptive tutoring to be able to optimize when to stop instruction, the student performance should increase with repeated practice (the learning curve should be increasing) Decreasing /flat learning curve = bad data
37
![Page 38: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/38.jpg)
38
![Page 39: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/39.jpg)
39
![Page 40: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/40.jpg)
Procedure: 1. Select skills with decreasing/flat learning
curve (aka bad data) 2. Train a student model on those skills 3. Compare classification metrics with
Leopard
40
![Page 41: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/41.jpg)
F1 AUC Score Effort Bad student model .79 .85 Majority class 0 .50
41
![Page 42: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/42.jpg)
F1 AUC Score Effort Bad student model .79 .85 .18 10.1 Majority class 0 .50 .18 11.2
42
![Page 43: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/43.jpg)
What does this mean? • High accuracy models may not be useful
for adaptive tutoring • We need to change how we report results
in adaptive tutoring
43
![Page 44: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/44.jpg)
Solutions • Report classification accuracy averaged
over skills (for models with 1 skill per item) ✖ Not useful for comparing or discovering different skill models
• Report as “difficulty” baseline ✖ Experiments suggest that models with baseline performance can be useful
• Use Leopard
44
![Page 45: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/45.jpg)
Let’s use all data, and pick an item-to-skill mapping:
AUC Score Effort Coarse (27 skills) .69 Fine (90 skills) .74
45
![Page 46: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/46.jpg)
Let’s use all data, and pick an item-to-skill mapping:
AUC Score Effort Coarse (27 skills) .69 .41 55.7 Fine (90 skills) .74 .36 88.1
46
![Page 47: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/47.jpg)
With synthetic data we can use Teal as the ground truth We generate 500 synthetic datasets with known Knowledge Tracing Parameters Which metrics correlate best to the truth?
47
![Page 48: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/48.jpg)
48
![Page 49: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/49.jpg)
49
![Page 50: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/50.jpg)
Discussion
50
![Page 51: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/51.jpg)
51
In EDM 2014 we proposed “FAST” toolkit for Knowledge Tracing with Features
![Page 52: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/52.jpg)
“FAST model improves 25% AUC of ROC”
![Page 53: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/53.jpg)
![Page 54: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/54.jpg)
54
![Page 55: 2015 EDM Leopard for Adaptive Tutoring Evaluation](https://reader036.fdocuments.net/reader036/viewer/2022062503/58ed2b6f1a28abab6c8b4659/html5/thumbnails/55.jpg)
Input
Teal White Knowledge Tracing Family parameters
Student’s correct or incorrect response
Sequence length Student models’ prediction of correct/incorrect
Target Probability of correct
Target Probability of correct
55