Uncertainty Estimation of Deep Neural Networks
Transcript of Uncertainty Estimation of Deep Neural Networks
University of South Carolina University of South Carolina
Scholar Commons Scholar Commons
Theses and Dissertations
2018
Uncertainty Estimation of Deep Neural Networks Uncertainty Estimation of Deep Neural Networks
Chao Chen University of South Carolina - Columbia
Follow this and additional works at: https://scholarcommons.sc.edu/etd
Part of the Computer Sciences Commons
Recommended Citation Recommended Citation Chen, C.(2018). Uncertainty Estimation of Deep Neural Networks. (Doctoral dissertation). Retrieved from https://scholarcommons.sc.edu/etd/5035
This Open Access Dissertation is brought to you by Scholar Commons. It has been accepted for inclusion in Theses and Dissertations by an authorized administrator of Scholar Commons. For more information, please contact [email protected].
Uncertainty Estimation of Deep Neural Networks
by
Chao Chen
Bachelor of ScienceSun Yat-sen University 2010
Master of ScienceCentral Michigan University 2012
Submitted in Partial Fulfillment of the Requirements
for the Degree of Doctor of Philosophy in
Computer Science
College of Engineering and Computing
University of South Carolina
2018
Accepted by:
Gabriel Terejanu, Major Professor
Jianjun Hu, Committee Member
Jijun Tang, Committee Member
John Rose, Committee Member
Juan Caicedo, Committee Member
Cheryl L. Addy, Vice Provost and Dean of the Graduate School
c© Copyright by Chao Chen, 2018All Rights Reserved.
ii
Dedication
To my family.
iii
Abstract
Normal neural networks trained with gradient descent and back-propagation have
received great success in various applications. On one hand, point estimation of the
network weights is prone to over-fitting problems and lacks important uncertainty
information associated with the estimation. On the other hand, exact Bayesian neural
network methods are intractable and non-applicable for real-world applications.
To date, approximate methods have been actively under development for Bayesian
neural networks, including but not limited to: stochastic variational methods, Monte
Carlo dropouts, and expectation propagation. Though these methods are applicable
for current large networks, there are limits to these approaches with either under-
estimation or over-estimation of uncertainty. Extended Kalman filters (EKFs) and
unscented Kalman filters (UKFs), which are widely used in data assimilation com-
munity, adopt a different perspective of inferring the parameters. Nevertheless, EKFs
are incapable of dealing with highly non-linearity, while UKFs are inapplicable for
large network architectures.
Ensemble Kalman filters (EnKFs) serve as great methodology in atmosphere and
oceanology disciplines targeting extremely high-dimensional, non-Gaussian, and non-
linear state-space models. So far, there is little work that applies EnKFs to estimate
the parameters of deep neural networks. By considering neural network as a non-
linear function, we augment the network prediction with parameters as new states
and adapt the state-space model to update the parameters. In the first work, we
describe the ensemble Kalman filter, two proposed training schemes for training both
fully-connected and Long Short-term Memory (LSTM) networks, and experiment
iv
with 10 UCI datasets and a natural language dataset for different regression tasks.
To further evaluate the effectiveness of the proposed training scheme, we trained
a deep LSTM network with the proposed algorithm, and applied it on five real-
world sub-event detection tasks. With a formalization of the sub-event detection
task, we develop an outlier detection framework and take advantage of the Bayesian
Long Short-term Memory (LSTM) network to capture the important and interesting
moments within an event.
In the last work, we propose a framework for student knowledge estimation using
Bayesian network. By constructing student models with Bayesian network, we can
infer the new state of knowledge on each concept given a student. With a novel
parameter estimate algorithm, the model can also indicate misconception on each
question. Furthermore, we develop a predictive validation metric with expected data
likelihood of the student model to evaluate the design of questions.
v
Table of Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation of Uncertainty Estimation . . . . . . . . . . . . . . . . . . 1
1.2 Bayesian Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Bayesian Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Variational Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Expectation Propagation . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 (Non)linear Kalman Filters . . . . . . . . . . . . . . . . . . . . . . . . 10
1.7 Structure of the dissertation . . . . . . . . . . . . . . . . . . . . . . . 11
Chapter 2 Uncertainty Estimation of Bayesian Neural Networks 13
2.1 Approximation in Bayesian Neural Networks . . . . . . . . . . . . . . 13
2.2 Ensemble Kalman Filter for Bayesian Neural Networks . . . . . . . . 14
2.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
vi
Chapter 3 Approximate Long Short Term Memory Networkfor Outlier Detection . . . . . . . . . . . . . . . . . . . 28
3.1 Sub-event Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 LSTM Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Window Embeddings for Outlier Detection . . . . . . . . . . . . . . . 33
3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Chapter 4 Student Knowledge Estimation with Bayesian Network 45
4.1 Student Knowledge Estimation . . . . . . . . . . . . . . . . . . . . . 45
4.2 Bayesian Network Background . . . . . . . . . . . . . . . . . . . . . . 47
4.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.5 Educational Component . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
vii
List of Tables
Table 2.1 Average test RMSE of the proposed algorithm on different layersand ensemble size. . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Table 2.2 Average test LL of the proposed algorithm on different layersand ensemble size. . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Table 2.3 Significance tests of the average test RMSE among VI, PBP andthe proposed algorithm with different ensembles and layers. . . . . 23
Table 2.4 Significance tests of the average test RMSE among MC-dropout,Deep Ensembles, and the proposed algorithm with different en-sembles and layers. . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Table 2.5 Significance tests of the average test LL among VI, PBP and theproposed algorithm with different ensembles and layers. . . . . . . 23
Table 2.6 Significance tests of the average test LL among MC-dropout,Deep Ensembles, and the proposed algorithm with different en-sembles and layers. . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Table 2.7 Number of parameters of selected networks. . . . . . . . . . . . . . 27
Table 3.1 Basic information of the five events. . . . . . . . . . . . . . . . . . 35
Table 3.2 Evaluation metrics on different algorithms. . . . . . . . . . . . . . 40
Table 4.1 Concepts are used for the concept inventory . . . . . . . . . . . . . 52
Table 4.2 Summary of students answers for each question in each class(Numbers of students who selected the correct choice are indi-cated by circle) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Table 4.3 Conditional probabilities for a question related to two concepts . . 61
Table 4.4 Conditional probabilities for question 19 . . . . . . . . . . . . . . . 61
viii
Table 4.5 Conditional probabilities for question 22 . . . . . . . . . . . . . . . 62
Table 4.6 P-value for each question . . . . . . . . . . . . . . . . . . . . . . . 62
Table 4.7 Conditional probabilities for question 11 . . . . . . . . . . . . . . . 63
Table 4.8 Conditional probabilities for question 20 . . . . . . . . . . . . . . . 63
ix
List of Figures
Figure 1.1 Comparasion between a common MLP network and a BayesianMLP network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Figure 1.2 Uncertainty information associated with a non-linear function:y = x+ 0.3sin(2π(x+ ε)) + 0.3sin(4π(x+ ε)) + ε . . . . . . . . . 2
Figure 2.1 Training RMSE for different models. . . . . . . . . . . . . . . . . 25
Figure 2.2 Testing RMSE for different models. . . . . . . . . . . . . . . . . . 26
Figure 2.3 Testing LL for different models. . . . . . . . . . . . . . . . . . . . 26
Figure 3.1 A RNN with an input layer (blue), a hidden layer (red), and anoutput layer (green). Units within the dotted regions are optional. 32
Figure 3.2 Gated mechanism of a LSTM cell as described by [39]. . . . . . . 34
Figure 3.3 Architecture of the network used in this study. . . . . . . . . . . . 38
Figure 3.4 Predicted sub-events with the proposed algorithm for the 2013Boston marathon event. The distance lines above the bluethreshold line indicate identified outliers, and the red color in-dicates the Boston bombing moment (identified=40, true=16,σ2ε=2.13). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Figure 3.5 Predicted sub-events with the proposed algorithm for the 2013Superbowl event. The distance lines above the blue thresholdline indicate identified outliers, and the red color indicates thepower outbreak (identified=33, true=18, σ2
ε=2.19). . . . . . . . . 40
Figure 3.6 Predicted sub-events with the proposed algorithm for the 2013OSCAR event. The distance lines above the blue thresholdline indicate identified outliers, and the red color indicates theOSCAR starting moment (identified=39, true=25, σ2
ε=1.98). . . . 41
x
Figure 3.7 Predicted sub-events with the proposed algorithm for the 2013AllStar event. The distance lines above the blue threshold lineindicate identified outliers, and the red color indicates the All-Star starting moment (identified=33, true=22, σ2
ε=1.56). . . . . . 42
Figure 3.8 Predicted sub-events with the proposed algorithm for the 2013Zimmerman trial news event. The distance lines above the bluethreshold line indicate identified outliers, and the red color in-dicates the verdict moment (identified=23, true=17, σ2
ε=1.21). . . 43
Figure 3.9 Performance of the algorithm on the marathon event for differ-ent ensemble size. . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Figure 3.10 Performance of the algorithm on the Boston marathon eventfor different sigma value. . . . . . . . . . . . . . . . . . . . . . . . 44
Figure 4.1 Relationships of concepts and questions . . . . . . . . . . . . . . . 53
Figure 4.2 Bayesian Network Model of Student Knowledge for Statics Con-cept Inventory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Figure 4.3 Question 19 related to C3 . . . . . . . . . . . . . . . . . . . . . . 61
Figure 4.4 Question 22 related to C3 . . . . . . . . . . . . . . . . . . . . . . 62
Figure 4.5 Question 11 related to concept C3 . . . . . . . . . . . . . . . . . 63
Figure 4.6 Question 20 related to concept C3 . . . . . . . . . . . . . . . . . 64
Figure 4.7 Question 9 related to concept C2 . . . . . . . . . . . . . . . . . . 65
Figure 4.8 Histogram of blind guessing in addition to the actual perfor-mance of students score. . . . . . . . . . . . . . . . . . . . . . . . 66
Figure 4.9 Question 7 related to concept C2 . . . . . . . . . . . . . . . . . . 67
Figure 4.10 Histogram of blind guessing in addition to the actual perfor-mance of students score. . . . . . . . . . . . . . . . . . . . . . . . 67
xi
Chapter 1
Introduction
1.1 Motivation of Uncertainty Estimation
The recent resurgence of neural network trained with backpropagation has established
state-of-the-art results in a wide range of domains. However, backpropagation-based
neural networks (NN) offers no estimation of uncertainty and thus lacks important
confidence bounds for practical applications. In contrast, probabilistic view of neural
network can provide uncertainty estimates that can be used for decision making. The
difference between a conventional multiple perception network (MLP) and a Bayesian
MLP network is depicted in Figure 1.1a and 1.1b.
(a) A common MLP network. (b) A Bayesian MLP network.
Figure 1.1: Comparasion between a common MLP network and a Bayesian MLPnetwork.
The importance of estimating uncertainty information can be illustrated in Fig-
ure 1.2. While the value predicted at x = 1.2 is high and might trigger a certain
decision, one also needs to consider the uncertainty associated with that value, which
in this case is also high. Uncertainty information is important in safety-critical ap-
1
plications where predictions from neural networks act on acceleration, braking and
steering systems. In such situations instead of driving the actuators, the automated
system may decide to pass the control to the human driver because of inherent large
uncertainties in the model. A good example to illustrate the importance of un-
certainty quantification is related to autonomous driving. With traditional neural
networks, we can make an inference whether to brake or not for a self-driving car.
Nonetheless, a probabilistic perspective can not only provide the inference but also
the confidence level to lead to the decision. Similar to this example, the uncertainty
estimate is also valuable for other real-world applications such as image classification,
speech recognition, and medical diagnostics.
Figure 1.2: Uncertainty information associated with a non-linear function: y = x +0.3sin(2π(x+ ε)) + 0.3sin(4π(x+ ε)) + ε
Uncertainty can be generally categorized as data uncertainty and model uncer-
tainty, while the latter can be further introduced as model parameters uncertainty
and model structure uncertainty. Based upon these two types of uncertainty, we can
deduce the predictive uncertainty. Bayesian framework, described in the following
section, serves as a guidance to infer the predictive uncertainty.
2
1.2 Bayesian Framework
In an attempt to model a system with experimental data, we might be uncertain what
the proper parameters or model structure are. A mature framework of dealing with
such uncertainty is developed and termed as Bayesian modeling. Bayesian modeling
offers us the theory and guidance to develop tools to model the real-world systems.
In the followings, a formal introduction of Bayesian modeling is provided.
Assuming there are training data (x1, y1), (x2, y2), . . . , (xn, yn), our goal is to make
an inference of y∗ with a new observation x∗. A probabilistic way of modeling it is
to define a predictive distribution p(y|x) and use this function to make an inference
given a new data point. A frequentist way is to optimize a suitably chosen function,
such as squared loss function, to get the optimized parameter vectors. In contrast,
the Bayesian perspective is to define a prior on the parameters before observing the
data, and make an inference on the new data point by integrating over the parameters
space.
Let us assume the parameters w follow a distribution p(w). We further define a
likelihood function p(y|x,w), which represents the likelihood of generating output y
with specific parameters setting w. With the prior and likelihood functions, we can
obtain the posterior distribution:
p(w|D) = p(D|w)p(w)p(D) . (1.1)
where D is the training data, and p(D) is the model evidence
p(D) =∫p(D|w)p(w)dw. (1.2)
To make a prediction to a new data point x∗, we have
p(y∗|x∗, D) =∫p(y∗|x∗, w)p(w|D)dw. (1.3)
The model evidence is computed by marginalizing over the parameters, and it is thus
also termed as marginal likelihood. Analytical solutions of the model evidence exist
3
if the prior is conjugate for the likelihood function. However, for complex models
such as Bayesian Neural Networks (BNNs), we need to use approximate approaches
to get the model evidence. Approximate methods include variational inference (VI),
expectation propagation (EP), Kalman filter (KF), and Markov chain Monte Carlo
(MCMC).
1.3 Bayesian Neural Networks
Point estimation of neural networks are associated with many disadvantages, includ-
ing but not limited to, the lack of uncertainty estimation, tendency of overfitting
small data, and tuning of many hyper-parameters.
In backpropagation NNs, the lack of uncertainty information is due to the weights
that are treated as point estimates tuned with gradient-descent methods. By contrast,
Bayesian neural networks (BNNs) [96, 83] can cope with some of these problems by
assigning a prior distribution on the parameters [33, 45, 40, 11]. Nonetheless, the
Bayesian inference in BNNs is intractable, and researchers have developed various
approximate methods to estimate the uncertainty of the weights.
Adding prior distributions on the weights acts as regularization, BNNs can avoid
over-fitting and offer uncertainty estimates associated with predictions. They were
firstly introduced in the 1990s [83, 96] and recently re-draw many research interests
from the community [72]. Given a training example (xn, yn) , xn is a D dimensional
vector xn ∈ <D, and yn is a scalar target variable yn ∈ <. For a M -layer feedforward
neural network, Si denotes the number of hidden neurons in layer i, andW = {W i}Mi=1
represents Si × (Si−1 + 1) weights between two consecutive layers.
yn = f(xn,W ) + εn, where εn ∼ N(0, γ−1) (1.4)
where f represents the network output and W is a set of weights between two consec-
utive layers.
4
y denotes target vectors y1, y2, ..., yn ∈ <N , and X denotes feature vectors
x1, x2, ..., xn ∈ <N ×D.
We first define the following prior distribution on the network weights with pre-
cision λ
p(w|λ) =M∏m=1
Sm∏i=1
Sm−1+1∏j=1
N(wi,j,m|µi,j,m, λ−1i,j,m). (1.5)
Also we have the likelihood function with precision γ as follows
p(y|w,X, γ) =N∏n=1
N(yn|f(xn, w), γ−1). (1.6)
Based upon (1.5) and (1.6), we get the following posterior distribution
p(w|D,λ, γ) ∝ p(w|λ)p(y|w,X, γ). (1.7)
However, the posterior distribution is non-linear and intractable in most cases,
we need to solve it with approximate techniques. In the past, approximate methods
including Laplace approximation [28] and Hybrid Monte Carlo [96] have been used
to tackle these problems, but at a cost on the accuracy or computation efficiency.
Hinton and Van Camp [50], proposed a pioneering work that constrained the amount
of information in the weights with respect to the expected squared of error in the
network. For a linear output network with one hidden layer, they derived the analyt-
ical objective with the minimum description length principle which is actually a VI
interpretation. Nonetheless, the derivation of (1.8) is challenging for most NN archi-
tecture since the expected log likelihood is intractable. Thus the analytic solution is
motived for small dataset and impractical for modern big data scenarios.
q(W |β) =M∏m=1
Sm∏i=1
Sm−1+1∏j=1
q(wi,j,m|µi,j,m, β−1i,j,m) =
∏i,j,m
N(wi,j,m;µi,j,m, β−1i,j,m). (1.8)
1.4 Variational Inference
Built upon Hinton and Van Campś [50] work, Graves [40] adopted a stochastic varia-
tional method instead of computing the analytical solution to estimate the expected
5
log likelihood. As explained in the paper, this approach searched the variational
posterior distribution of the weights q(w|β) instead of the intractable distribution
p(w|D,α), and then drew samples from it to calculate the estimate. By drawing
samples from the variational distribution, the method is amenable to solving big
data and complex network architectures.
The variational posterior can be obtained by minimizing the Kullback-Leibler
(KL) divergence between a simpler form that approximates the distribution (e.g.
Gaussian distribution) and the true posterior on the weights.
(α, β)∗ = arg min(α,β)
KL[q(w|β)||p(w|D,α)] (1.9)
= arg min(α,β)
∫q(w|β) log q(w|β)
p(w|D,α)dw (1.10)
∝ arg min(α,β)
∫q(w|β) log q(w|β)
p(w|α)p(D|w)dw (1.11)
= arg min(α,β)
∫q(w|β) log q(w|β)
p(w|α)dw −∫q(w|β) log p(D|w)dw (1.12)
= arg min(α,β)
KL[q(w|β)||p(w|α)]− Eq(w|β)[log p(D|w)] (1.13)
The resulting cost function (1.13) is termed as variational free energy [97, 154]
or the expected lower bound [56]. It is composed of a model complexity cost and
likelihood cost [11] that aims at finding a simpler form of variational posterior that
can maximize the likelihood.
In terms of the approximate posterior distribution, we can make an inference for
a new observation x∗ as follows
p(y∗|x∗, D) =∫p(y∗|x∗, w)q(w|β)dw.
However, Graves [40] simply used a diagonal covariance matrices in his work and
ignored the correlations over the weights, thus it yielded poor results in practice [45].
To improve the prediction capability of VI on BNNs, Blundell et al. [11] applied
re-parameterization tricks to approximate the cost function with unbiased Monte
Carlo gradients. Furthermore, they assigned a mixture of Gaussian prior on weights
6
that resembled a spike-and-slab prior and optimized the prior variance with cross
validation. As shown in the paper, the authors claimed that their VI approach was
matchable to the standard dropout method [49] in spite of doubling the size of the
parameters.
In parallel, Gal and Ghahramani [34] developed theoretical proof and demon-
strated that dropout could be considered as a BNN. In their work, they pointed out
that Monte Carlo (MC) dropout, regularized the deep models in the same way as the
VI techniques did.
Dropout was originally introduced in [49], and further developed by [131]. In
their work, the authors [131] randomly dropped some network units with dropout
rate p during the training phase, and used a single thinned network with expected
weights (1− p)W for predictions in the testing phase.
The approximate distribution q(wm|β) in MC dropout for layer m is defined as
Wm = Qm · Vm (1.14)
Vm = diag([vm,i]Smi=1) (1.15)
Where
vm,i ∼ Bernoulli(pm) for m = 1, ...,M, i = 1, ..., Sm−1 (1.16)
Here pm is the dropout rate for each layer, and Qm is the variational parameters
to be optimized. vm,i is a random binary variable that represents whether a unit i in
layer m− 1 dropped out as an input to the mth layer.
The second term in (1.13) is intractable, and an unbiased estimator is obtained
7
with Monte Carolo integration over w
Eq(w|β)[log p(D|w)] =∫
log p(D|w)q(w|β)dw (1.17)
=∫ N∑
i=1log p(yi|xi, w)q(w|β)dw (1.18)
=N∑i=1
∫log p(yi|xi, w)q(w|β)dw (1.19)
= 1J
J∑j=1
N∑i=1
logP (yi|xi, wj) (1.20)
Furthermore, the approximate predictive distribution is computed as
q(y∗|x∗, D) =∫p(y∗|x∗, w)q(w|α)dw (1.21)
And the first two moments are estimated as
Eq(y∗|x∗)(y∗) ≈1T
T∑t=1
f(x∗, wt), where wt ∼ q(w|β) (1.22)
Eq(y∗|x∗)((y∗)T (y∗)) ≈ τ−1ID + 1T
T∑t=1
f(x∗, wt)T f(x∗, wt) (1.23)
V arq(y∗|x∗)(y∗) ≈ τ−1ID + 1T
T∑t=1
f(x∗, wt)T f(x∗, wt) (1.24)
− Eq(y∗|x∗)(y∗)T Eq(y∗|x∗)(y∗) (1.25)
Variational inference was also proved to be theoretically equivalent to the dropout
method that was widely used as a regularization technique for deep NNs [34]. Fur-
thermore, the authors developed tools to extract model uncertainty from dropout
NNs. However, variational estimation typically underestimate the uncertainty of the
posterior because they ignore the posterior correlation between the weights.
1.5 Expectation Propagation
In addition to variational inference, expectation propagation were applied to estimate
the uncertainty of NNs. Hernández-Lobato and Adams [45] developed a scalable al-
gorithm which propagated probabilities forward through the network to obtain the
8
marginal likelihood and then obtained the gradients of the marginal probability in
the backward step. Similarly, Soudry et al. [130], described an expectation propaga-
tion algorithm aiming at approximating the posterior of the weights with factorized
distribution in an online setting.
In addition to VI, several studies explore expectation propagation (EP) to ap-
proximate p(w|D). In a recent study, Hernandez-Lobato [45] proposed a two-step
backpropagation alike algorithm termed as probabilistic backpropagation (PBP) for
NN update. In the first step, the algorithm computes the logarithm of the marginal
probability log(p(y|x)) as the objective function. In the second step, PBP propagates
back the gradients of the objective function with regard to the means and variances
of the approximate posterior through each layer using chain rule. The main idea of
PBP is to approximate p(w|D) with (1.27), while (1.27) has the same form as the
prior (1.26). The parameters in the approximate posterior can be updated with EP
which minimizes the Kullback-Leibler (KL) divergence between p(w|D) and q(w).
p(w) = N(w|µ, σ2) (1.26)
q(w) = N(w|µ′ , σ2′) (1.27)
Specific, the approximate posterior can be factorized into
q(W, γ, λ) = [M∏m=1
Sm∏i=1
Sm−1+1∏j=1
q(wi,j,m)]× q(γ)× q(λ) (1.28)
where wi,j,m is the weight from the jth node on the m− 1th layer to the ith node on
the mth layer, q(w) is a Gaussian distribution with µ and σ2, γ is the noise precision,
λ is the prior precision, and both q(γ) and q(λ) are gamma distributions.
Nevertheless, this approach is still in active development [13]. Currently, the PBP
algorithm simply factorized the posterior with simple Gaussian distributions, aiming
at making interpolations between dispersed modes instead of seeking modes.
9
1.6 (Non)linear Kalman Filters
Kalman filter is a common approach for parameter estimation in linear dynamic
systems. There are a number of work conducted to estimate the parameters of NNs
with EKFs [118, 128], which perform parameter estimation for non-linear dynamical
systems. However, EKFs were criticized for larger errors of the posterior mean and
covariance introduced by the first-order linearization of the nonlinear system [148].
Instead, UKFs were explored to estimate the parameters of non-linear models and
were claimed to have better performance to capture the higher order accuracy for the
posterior mean and covariance [62].
UKF can approximate the state distribution with carefully selected determinis-
tic samples from a Gaussian random variable, and yield superior estimates of the
posterior mean and covariance to the 3rd order of Taylor expansion [148]. More im-
pressively, UKF consumes the same time complexity as EKF. The key component
of UKF is termed as unscented transformation (UT), which is used to compute the
statistics of a random variable that propagates through a non-linear models. For a
m-dimension random variable x with mean x and covariance Px, the mean and co-
variance of the target y = u(x) can be estimated with the following procedure [148,
141].
The main step in UT is to sample 2m+ 1 sigma vectors Xk = (Xjk, w
jk)|j=0,1,...,2m
10
with a specific scheme
Xj =
x+ (
√α2(m+ k)Px)j for j = 1, ...,m
x− (√α2(m+ k)Px)j for j = m+ 1, ..., 2m
(1.29)
wµj =
λ
α2(m+k) for j = 0
12α2(m+k) for j = 1, ..., 2m
(1.30)
wcj =
λ
α2(m+k) + (1− α2 + β) for j = 0
12α2(m+k) for j = 1, ..., 2m
(1.31)
where λ and κ are scaling parameters, α determines the divergence of the selected
points around x, (√α2(m+ k)Px)j is the jth row of the matrix square root.
By propagating xj through the non-linear function, the next step is to estimate
the target yj. After obtaining the samples, it is straightforward to get the weighted
sample mean and covariance of the posterior sigma points.
y ≈2m∑j=0
wµj yj (1.32)
Py ≈2m∑j=0
wcj(yj − y)(yj − y)T (1.33)
In terms of the UT, UKF defines the state random variable as an augmented state
xak = [xTk vTk nTk ]T , and chooses the sigma points of the variable to compute the
sigma matrix Xak .
1.7 Structure of the dissertation
This dissertation is organized as follows:
(i) Chapter 2 describes the proposed algorithms for training neural networks and
introduces two training schemes for updating the noise covariance. In this
chapter, we explored training a fully-connected network with scheme 1 on a
11
synthetic dataset. Then we applied both scheme 1 and scheme 2 on training
fully-connected networks, and compared them with state-of-the-art models of
approximating Bayesian neural networks on ten UCI datasets. Furthermore, we
trained both fully-connected and LSTM networks and applied them on modeling
a natural language dataset.
(ii) In Chapter 3, the work is further extended for training deep LSTM networks
and demonstrates its effectiveness in an outlier detection task. Specifically, we
formalized sub-event detection as an outlier detection problem, then we applied
the proposed algorithm to model the predictive distributions of five real datasets
on Twitter network.
(iii) Chapter 4 introduces a framework to estimate student knowledge with Bayesian
network models. Additionally, the proposed framework is also capable of iden-
tifying misconceptions with a modified EM algorithm for parameter learning,
and evaluating question design with a proposed index.
12
Chapter 2
Uncertainty Estimation of Bayesian Neural
Networks
2.1 Approximation in Bayesian Neural Networks
EnKFs are developed by geoscientists for data assimilation tasks dealing with mil-
lions of features [31]. By adopting Monte Carlo samplings, the filters can generate
samples representing the means and covariances of the dynamic systems. As scalable
algorithms, EnKFs can save computation and storage of inverting large matrices, and
demonstrate effectiveness of handing non-Gaussian and nonlinear systems [65].
As introduced in Chapter 1, both EKFs and UKFs are applied to estimate the
state or parameters of nonlinear systems. Compared to these Kalman filter variants,
researchers find much better performance of EnKFs in parameter estimation [52].
In contemporary deep learning research, the dimensionality of the parameter space
rapidly grows, which can be elegantly solved by using EnKFs. To date, there has
been very little attention paid to applying EnKFs for parameter estimation in neural
networks. In an attempt to introduce the EnKFs to the deep learning community,
we evaluate the performance of using an EnKF model for parameter estimation of
neural networks and apply it for general regression tasks.
13
2.2 Ensemble Kalman Filter for Bayesian Neural Networks
2.2.1 Ensemble Kalman Filter with Stochastic Update
A linear Kalman Filter model can be described as
xk = Fxk−1 + εxk−1 (2.1)
dk = Dxk + εdk−1. (2.2)
The above functions can also be interpreted as state space models, where xk is
the hidden state at time k, and dk is the observation at time k. For linear Gaussian
models, exact solutions can be derived to update the posterior mean and posterior
covariance. However, when the model is non-Gaussian or nonlinear, or even the
state dimensions or measurement dimensions are high, we need to seek approximate
solutions.
By sampling the filtering distribution, we can obtain a set of samples and propa-
gate them through the forecast step as well as the update step to yield approximate
posterior estimates. In this way, we can reduce the space and time required for
computing the error covariance and innovation covariance.
Suppose we sample from the filtering distribution p(xk−1|d1:k−1) and measurement
forecast distribution p(dk|xk−1) and get an ensemble of data pairs
(x(1)k−1|k−1, d
(1)k ), ..., (x(n)
k−1|k−1, d(n)k ).
In the EnKF, we apply the forecast step (2.1) on each sample x(i)k−1|k−1 and get
x(i)k|k−1 = Fx
(i)k−1|k−1 + ε
x(i)k−1, i = 1, ..., N (2.3)
εx(i)k−1 ∼ N(0,Σx), (2.4)
Then we update each predicted state x(i)k|k−1 to get a posterior state estimate x(i)
k|k with
14
a perturbed measurement d(i)k|k−1
x(i)k|k = x
(i)k|k−1 + Lk(dk − d(i)
k|k−1), i = 1, ..., N (2.5)
d(i)k|k−1 = Dx
(i)k|k−1 + ε
d(i)k−1 (2.6)
εd(i)k−1 ∼ N(0,Σd). (2.7)
And we can estimate the Kalman gain L with sample covariance to get an approximate
L
L = Pk|k−1DT (DPk|k−1D
T + Σd)−1 (2.8)
Pk|k−1 ≈1
N − 1
M∑i=1
(d(i)k − dk)(d
(i)k − dk)T . (2.9)
By using the Monte Carlo schema and adding the perturbed noise, the EnKF is
claimed to capture the non-Gaussianity and nonlinearity [65].
2.2.2 Bayesian Neural Networks Update with EnKF
A neural network can be represented as
yk = f(xk, w) + ε, k = 1, 2, ...,M (2.10)
where (xk, yk) is the training data, w is the parameter vector (weights), f is a nonlinear
neural network mapping, and ε is the noise which compensates for the difference
between outputs of neural network and real target values.
Let D indicate the training data, D = {(xk, yk)}k=1,2,...,M . Given a new input x∗,
we are interested in the predictive distribution
p(y∗|D, x∗) =∫p(y∗|x∗, w)p(w|D)dw . (2.11)
Here p(w|D) is the conditional distribution of w given the training data D, which
can be obtained via Bayes’ rule:
p(w|D) = p(D|w)p(w)p(D) . (2.12)
15
p(w) is the prior distribution of the weights w, p(D) is the evidence, and p(D|w) is
the likelihood which can be obtained though Eq. (2.10).
To evaluate the integral in Eq. (2.11), we need to find a solution for Eq. (2.12).
Since the neural network is a nonlinear mapping function, a common way to solve
Eq. (2.12) is using Monte Carlo methods. Suppose N samples {wj}j=1,...,N of p(w|D)
are available, and δ(·) represents the dirac function. Then p(y∗|D, x∗) can be obtained
as follows:
p(y∗|D, x∗) = 1N
N∑j=1
p(y∗|x∗, wj) (2.13)
= 1N
N∑j=1
δ(y∗ − f(x∗, wj)) (2.14)
Let {y∗j}j=1,...,N denotes samples of p(y∗|D, x∗), then we have y∗j = f(x∗, wj)
In this paper, we use EnKF to estimate the uncertainty of the weights w. The
corresponding dynamic system is shown in Eq. (2.15) and Eq. (2.16).
wk = wk−1 (2.15)
yk = f(xk, wk) + ε (2.16)
w has a prior distribution p(w) = N (w; 0, σ2wIp) and ε is a white noise with distribu-
tion p(ε) = N (ε; 0, σ2ε Iq). Here p and q represent the dimensionality of features and
targets, respectively.
Suppose the batch size is s and the number of weights is l. Since weights w is the
quantity that needs to be estimated, we augment the output of neural network with
w to form an augmented state variable Uk = [Fk, w]. Here Fk includes all the outputs
of the kth batch {f(xk,i, w)}i=1,...,s. The matching measurement model is given by
16
Eq. (2.17).
Yk = HUk + ε (2.17)
ε ∼ N (0, σ2ε Isq)
H = [Isq, 0sq×l]
Yk = [yTk,1, yTk,2, ..., yTk,s]T
Uk = [f(xk,1, w)T , f(xk,2, w)T , ..., f(xk,s, w)T , w1, w2, ...wl]T
Maximizing Hyper-parameters with Maximum Likelihood Estimation
Before inference, two hyperparameters σ2w and σ2
ε need to be determined. A common
way is to maximize evidence p(D|σ2w, σ
2ε ) with respect to σ2
w and σ2ε .
p(D|σ2w, σ
2ε ) =
∫p(D|w, σ2
ε )p(w|σ2w)dw (2.18)
This has been successfully applied to Bayesian linear regression. However, for nonlin-
ear models, it is difficult to evaluate the integral above. Here, we fix σ2w and estimate
σ2ε by maximizing p(D|σ2
ε ). Assume N samples are drawn from p(w), M data points
are used for training network and the dimensionality of D is d. Under the assumption
that the data points are generated independently, we have
p(D|σ2ε ) =
M∏j=1
p(yj|xj, σ2ε ). (2.19)
For simplicity, we omit xj in p(yj|xj, σ2ε ). Each p(yj|σ2
ε ) is calculated as follows
p(yj|σ2ε ) =
∫p(yj|w, σ2
ε )p(w)dw
= 1N
N∑i=1
p(yj|wi, σ2ε )
= 1N
N∑i=1N (yj; fi, σ2
ε Iq)
= E[N (yj; fi, σ2
ε Iq)]
(2.20)
17
Here fi = f(x,wi). Thus the log-evidence is given by
lnp(D|σ2ε ) = ln
M∏j=1
p(yj|σ2ε )
=M∑j=1
lnp(yj|σ2ε )
=M∑j=1
lnE[N (yj; fi, σ2
ε Iq)]
Since log is a concave function, according to Jensen’s inequality, we have
lnE[N (yj; fi, σ2
ε Iq)]≥ E
[lnN (yj; fi, σ2
ε Iq)]. (2.21)
Thus we obtain the lower bound of lnp(D|σ2ε ):
lnp(D|σ2ε ) ≥
M∑j=1
E[lnN (yj; fi, σ2
ε Iq)]
(2.22)
= 1N
M∑j=1
N∑i=1
lnN (yj; fi, σ2ε Iq)
= 1N
M∑j=1
N∑i=1
(−q2 ln2π − q
2 lnσ2ε
− 12σ2
ε
(yj − fi)T (yj − fi))
=− qM
2 ln2π − qM
2 lnσ2ε
− 12Nσ2
ε
M∑j=1
N∑i=1
(yj − fi)T (yj − fi)
Maximizing the lower bound of log-evidence with respect to σ2ε we obtain
σ2ε = 1
qMN
M∑j=1
N∑i=1
(yj − fi)T (yj − fi). (2.23)
Once the variance of the noise is determined, the EnKF algorithm presented in
the previous step is applied to obtain samples from the posterior distribution of the
weights.
18
Maximizing Hyper-parameters with Full Bayes
Another way to estimate the model error is to use full Bayes approach. Specifically,
we assume an inverse gamma distribution for σ2 and the following likelihood function
p(σ2ε ) = IG(α, β) (2.24)
p(y|w,X, σ2ε ) =
N∏n=1
N(yn|f(xn, w), σ2ε ). (2.25)
Furthermore, we can get the predictive distribution
p(y∗|x∗, D) =∫p(y∗|x∗, w, σ2
ε )p(w, σ2ε |D)dwdσ2
ε . (2.26)
And the posterior distribution for w and σ2ε can be estimated with Bayes’ rule:
p(w, σ2ε |D) = p(y|w,X, σ2
ε )p(w)p(σ2ε )
p(y|X) . (2.27)
However, we need to approximate p(w, σ2ε |D) and p(y|X) since they are intractable
to compute. In this paper, we propose a novel method to estimate these quantities
with EnKF.
In addition to the weight samples, we also sample N points σ2ε
(1), ..., σ2
ε(N) from
(2.24). For each w(j) and σ2ε
(j), we can forward propagate a data point x to get an
estimate y(j)
y(j) = f(x,w(j)) + ε(j), ε(j) ∼ N(0, σ2ε
(j)). (2.28)
Then the algorithm 1 works in two phases, including a training phase and a testing
phase. In the training phase, we augment y, w, and σ2ε into a new state S, and update
S with the mini-batch data and the EnKF algorithm. In this way, we can train the
neural network and estimate the optimal w and σ2ε . Then in the testing phase, we
use the optimized parameters to make an inference on each mini-batch testing data.
19
Similar to (2.17), we can get the following equations for each mini-batch data:
U = HS + εp, εp ∼ N(0, rζ) (2.29)
H = [Inq, 0nq×(l+1)]
S = [y1, ..., yn, w1, ..., wl, σ2ε ]
U = [U1, ..., Un]T .
where n is the batch size and l is the number of weights, rζ is a small perturbed value,
and it can also be described as
U1
U2
...
Un
=
1 0 · · · 0 0 · · · 0
0 1 · · · 0 0 · · · 0... . . . ...
0 0 · · · 0 0 · · · 1
×
y1
...
yn
w1
...
wl
σ2ε
+ εp. (2.30)
Algorithm 1 enkf_bnnInput: features x, targets yOutput: updated weights wDraw N samples for w from prior p(0, σ2
wIl)Draw N samples for σ2
ε from an inverse Gamma distribution IG(α, β).for Each mini-batch input xk do
for Each sample w(j) and σ2(j)ε do
y(j)k = f(w(j), xk) + ε(j), ε(j) ∼ N(0, σ2(j)
ε )end forCreate the new state S = [y, w, logσ2
ε ]Update S ′ = enkf(S,H, rζ , yobs)Get updated w′ and logσ2′
ε from S′
Update σ2′ε = exp(logσ2′
ε ) return w′ and σ2′ε .
end for
20
2.3 Experiments
2.3.1 Regression Analysis on UCI datasets
To evaluate our algorithm, we run tests on 10 UCI datasets recommended by
Hernández-Lobato and Adams [45] to evaluate PBP. The datasets were later used
by Gal and Gharamani [34] and Lakshminarayanan et al. [72] for their approximate
algorithms. Initially, we applied the same protocol to get the evaluation metrics for
the comparison. That is, we built a feed-forward network with 1 hidden layer, 100
hidden nodes for the larger protein and year prediction MSD datasets, and 50 hidden
nodes for the smaller datasets. We randomly permuted the data, and use 90% for
training and 10% for testing. We ran 1 epoch for the Year Prediction MSD dataset,
5 epoches for the protein dataset, and 20 epoches for the rest datasets. Additionally,
we set σ2w, α, β, rζ , N , n to 0.02, 50, 1, 0.01, 200, and 32, respectively. Then we get
samples of w and σ2ε from a normal distribution parameterized by N(0, 0.02Il) and
an inverse Gamma distribution IG(50, 1).
Table 2.1: Average test RMSE of the proposed algorithm on different layers andensemble size.
Dataset N Q Avg. Test RMSE and Std. ErrorsVI PBP MC Dropout Deep Ensembles EnKF-200-1 EnKF-1000-1 EnKF-1000-5
Boston Housing 506 13 4.32±0.29 3.01±0.18 2.97±0.85 3.28±1.00 3.92±0.92 3.34±0.17 2.81±0.51Concrete Strength 1,030 8 7.19±0.12 5.67±0.09 5.23±0.53 6.03±0.58 5.13±0.69 5.21±0.43 5.22±0.49Energy Efficiency 768 8 2.65±0.08 1.80±0.05 1.66±0.19 2.09±0.29 2.1±0.31 1.88±0.19 1.75±0.28
Kin8nm 8,192 8 0.10±0.00 0.10±0.00 0.10±0.00 0.09±0.00 0.11±0.01 0.10±0.00 0.09±0.00Naval Propulsion 11,934 16 0.01±0.00 0.01±0.00 0.01±0.00 0.00±0.00 0.00±0.00 0.00±0.00 0.00±0.00
Power Plant 9,568 4 4.33±0.04 4.12±0.03 4.02±0.18 4.11±0.17 4.16±0.20 4.11±0.14 4.01±0.23Protein Structure 45,730 9 4.84±0.03 4.73±0.01 4.36±0.04 4.71±0.06 4.75±0.03 4.51±0.11 4.53±0.10Wine Quality Red 1,599 11 0.65±0.01 0.64±0.01 0.62±0.04 0.64±0.04 0.69±0.06 0.63±0.05 0.61±0.08
Yacht Hydrodynamics 308 6 6.89±0.67 1.02±0.05 1.11±0.38 1.58±0.48 4.61±1.28 2.95±0.17 3.01±0.49Year Prediction MSD 515,345 90 9.03±NA 8.88±NA 8.85±NA 8.89±NA 10.01±NA 8.91±NA 8.88±NA
Furthermore, we also explored a deep architecture with 5 hidden layers and in-
creased the ensemble size from 200 to 1000. As shown in Table 2.1 we can observe our
algorithm out-performs the state-of-the-art models in 5 of 10 datasets in terms of av-
erage test RMSE. With respect to RMSE, we find the following statistical differences
as shown in Table 2.3 and 2.4: (1) EnKF-1000 is statistically better than VI for 5
datasets (Boston Housing, Concrete Strength, Energy, Efficiency, Protein Structure,
21
Table 2.2: Average test LL of the proposed algorithm on different layers and ensemblesize.
Dataset N Q Avg. Test LL and Std. ErrorsVI PBP MC Dropout Deep Ensembles EnKF-200-1 EnKF-1000-1 EnKF-1000-5
Boston Housing 506 13 -2.90±0.07 -2.57±0.09 -2.46±0.25 -2.41±0.25 -2.88±0.50 -2.53±0.05 -2.41±0.31Concrete Strength 1,030 8 -3.39±0.02 -3.16±0.02 -3.04±0.09 -3.06±0.18 -3.20±0.42 -3.13±0.19 -3.01±0.15Energy Efficiency 768 8 -2.39±0.03 -2.04±0.02 -1.99±0.09 -1.38±0.22 -2.42±0.14 -2.26±0.29 -2.29±0.12
Kin8nm 8,192 8 0.90±0.01 0.90±0.01 0.95±0.03 1.20±0.02 1.09±0.52 1.24±0.48 1.12±0.37Naval Propulsion 11,934 16 3.73±0.12 3.73±0.01 3.80±0.05 5.63±0.05 5.20±0.26 5.48±0.13 5.81±0.13
Power Plant 9,568 4 -2.89±0.01 -2.84±0.01 -2.80±0.05 -2.79±0.04 -2.97±0.15 -2.81±0.05 -2.77±0.06Protein Structure 45,730 9 -2.99±0.01 -2.97±0.00 -2.89±0.01 -2.83±0.02 -3.13±0.38 -3.01±0.31 -3.02±0.25Wine Quality Red 1,599 11 -0.98±0.01 -0.97±0.01 -0.93±0.06 -0.94±0.12 -0.97±0.09 -0.91±0.11 -0.92±0.12
Yacht Hydrodynamics 308 6 -3.43±0.16 -1.63±0.02 -1.55±0.12 -0.18±0.21 -2.94±0.26 -1.52±0.34 -1.48±0.28Year Prediction MSD 515,345 90 -3.62±NA -3.60±NA -3.59±NA -3.35±NA -4.82±NA -4.21±NA -4.23±NA
and Yacht Hydrodynamics), and (2) MC-dropout, Deep Ensembles, and PBP are sta-
tistically better than EnKF only for the Yacht Hydrodynamics dataset. According to
Table 2.2, our algorithm yields the best average test LL in 6 of 10 datasets. With re-
spect to LL, we find the following statistical differences as shown in Table 2.5 and 2.6:
(1) EnKF-1000 is statistically better than VI for 3 datasets (Boston Housing, Naval
Propulsion, Yacht Hydrodynamics) and it is statistically better than MC-dropout
and PBP for the Naval Propulsion dataset, and (2) Deep Ensembles is statistically
better than EnKF-1000 for Energy Efficiency and Yacht Thermodynamics.
Besides these, we have not found any other statistical differences in performance
which suggests that overall EnKF can performs better than VI however has similar
performance with the state-of-the-art in capturing prediction uncertainty. This sug-
gests that EnKF can be used as an alternative to training deep neural networks and
capturing prediction uncertainty using the Bayesian formalism without the use of
gradient information in obtaining the posterior distribution over the weights as com-
pared with MC-dropout, Deep Ensembles, and PBP. All that is required is simulating
the neural network N times, where N is the number of ensembles, and readjusting the
weights based on the Kalman update step. This has the potential to bring us closer
to real-time training of deep neural networks.
22
Table 2.3: Significance tests of the average test RMSE among VI, PBP and theproposed algorithm with different ensembles and layers.
RMSE VI PBPDataset EnKF-200-1 EnKF-1000-1 EnKF-1000-5 EnKF-200-1 EnKF-1000-1 EnKF-1000-5
Boston Housing -0.41 -2.92 -2.57 0.97 1.33 -0.37Concrete Strength -2.94 -4.44 -3.91 -0.78 -1.05 -0.90Energy Efficiency -1.72 -3.74 -3.09 0.96 0.41 -0.18
Kin8nm 1.00 - - 1.00 - -Naval Propulsion - - - - - -
Power Plant -0.83 -1.51 -1.37 0.20 -0.07 -0.47Protein Structure -2.12 -2.89 -2.97 0.63 -1.99 -1.99Wine Quality Red 0.66 -0.39 -0.50 0.82 -0.20 -0.37
Yacht Hydrodynamics -1.58 -5.70 -4.67 2.80 10.89 4.04Year Prediction MSD - - - - - -
Table 2.4: Significance tests of the average test RMSE among MC-dropout, DeepEnsembles, and the proposed algorithm with different ensembles and layers.
RMSE MC Dropout Deep EnsemblesDataset EnKF-200-1 EnKF-1000-1 EnKF-1000-5 EnKF-200-1 EnKF-1000-1 EnKF-1000-5
Boston Housing -0.76 -0.43 0.16 -0.47 -0.06 0.42Concrete Strength 0.11 0.03 0.01 1.00 1.14 1.07Energy Efficiency -1.21 -0.82 -0.27 -0.02 0.61 0.84
Kin8nm -1.00 - - -2.00 - -Naval Propulsion - - - - - -
Power Plant -0.52 -0.39 0.03 -0.19 0.00 0.35Protein Structure -7.80 -1.28 -1.58 -0.60 1.60 1.54Wine Quality Red -0.97 -0.16 0.11 -0.69 0.16 0.34
Yacht Hydrodynamics -2.62 -4.42 -3.06 -2.22 -2.69 -2.08Year Prediction MSD - - - - - -
Table 2.5: Significance tests of the average test LL among VI, PBP and the proposedalgorithm with different ensembles and layers.
LL VI PBPDataset EnKF-200-1 EnKF-1000-1 EnKF-1000-5 EnKF-200-1 EnKF-1000-1 EnKF-1000-5
Boston Housing 0.04 4.30 1.54 -0.61 0.39 0.50Concrete Strength 0.45 1.36 2.51 -0.10 0.16 0.99Energy Efficiency -0.21 0.45 0.81 -2.69 -0.76 -2.05
Kin8nm 0.37 0.71 0.59 0.37 0.71 0.59Naval Propulsion 5.13 9.89 11.76 5.65 13.42 15.95
Power Plant -0.53 1.57 1.97 -0.86 0.59 1.15Protein Structure -0.37 -0.06 -0.12 -0.42 -0.13 -0.20Wine Quality Red 0.11 0.63 0.50 0.00 0.54 0.42
Yacht Hydrodynamics 1.61 5.08 6.05 -5.02 0.32 0.53Year Prediction MSD - - - - - -
23
Table 2.6: Significance tests of the average test LL among MC-dropout, Deep En-sembles, and the proposed algorithm with different ensembles and layers.
LL MC Dropout Deep EnsemblesDataset EnKF-200-1 EnKF-1000-1 EnKF-1000-5 EnKF-200-1 EnKF-1000-1 EnKF-1000-5
Boston Housing 0.75 0.27 -0.13 0.84 0.47 0.00Concrete Strength 0.37 0.43 -0.17 0.31 0.27 -0.21Energy Efficiency 2.58 0.89 2.00 3.99 2.42 3.63
Kin8nm -0.27 -0.60 -0.46 0.21 -0.08 0.22Naval Propulsion -5.29 -12.06 -14.43 1.62 1.08 -1.29
Power Plant 1.08 0.14 -0.38 1.16 0.31 -0.28Protein Structure 0.63 0.39 0.52 0.79 0.58 0.76Wine Quality Red 0.37 -0.16 -0.07 0.20 -0.18 -0.12
Yacht Hydrodynamics 4.85 -0.08 -0.23 8.26 3.35 3.71Year Prediction MSD - - - - - -
2.3.2 Regression Analysis on Natural Language Data
To further evaluate the proposed algorithm, we ran it on a regression analysis on the
raw Cornell film review dataset [107] which contains 5000 movie reviews. Similar to
Gal and Gharamani [33], we experimented three different sets of network topologies.
The first set of networks include an embedding layer, a dense layer with 128 hidden
nodes, and a second dense layer with 1 output unit. The second set of networks
include an embedding layer, a LSTM layer with 128 hidden nodes, and a dense layer
with 1 output unit. The third set of networks replaces the LSTM layer with a GRU
layer and keeps the rest of the layers in the second set of networks. The batch size
is defined as 128, the weight decay is set to 1e-4, and the optimizer is set to Adam.
The padding sequence of the LSTM or GRU layer is set to 100, indicating we take
every 100 words as an input. To get a good estimate of the noise precision τ and the
dropout rate, we adopt a grid search approach and find the best precision value 1.0
and best dropout 0.1 from [0.01, 0.05, 0.1, 0.5, 1.0] and [0.05, 0.1, 0.25, 0.5, 0.75].
For the proposed algorithm, we also experimented with both a dense layer and a
LSTM layer. We kept the same setting for σ2w, α, β, rζ , N , n, and define the loop
size the same as the padding sequence for the LSTM layer.
The referenced models used for this task include a naive LSTM network, a vari-
ational network with embedding dropout, a variational LSTM (VLSTM) network
24
without embedding dropout, and a variational GRU (VGRU) network with embed-
ding dropout. Results are provided in Figure 2.1-2.3. According to Figure 2.1, our
proposed algorithms tend to avoid quickly overfitting to the training data which is
commonly shown for the dropout-based models. This is further illustrated by the
RMSE of the testing data, which indicates that EnKF LSTM performs best on the
new data. Furthermore, we observe that the proposed training scheme consistently
performs better on LSTM networks than the fully connected network, indicating other
advanced network architectures may be studied with this proposed scheme for bet-
ter inference. Furthermore, Figure 2.3 shows the proposed EnKF LSTM algorithm
achieves better log likelihood which provides some evidence that the model can be
used for better uncertainty estimate on approximate Bayesian neural networks.
Figure 2.1: Training RMSE for different models.
The parameters of several selected networks of the UCI datasets and the Cornel
film review dataset are described in Table 2.7. According to the table and above
results, we observe good regularization of the proposed algorithm on training large
25
Figure 2.2: Testing RMSE for different models.
Figure 2.3: Testing LL for different models.
collection of parameters.
26
Table 2.7: Number of parameters of selected networks.
Network # ParametersNaval Propulsion 1 Layer 850Naval Propulsion 5 Layers 10850
Year Prediction MSD 1 Layer 19100Year Prediction MSD 5 Layers 45500
Cornell Film Dense 25729Cornell Film LSTM 205441
2.4 Discussion
In this chapter, we proposed two schemes to optimize covariance of the measure-
ment noise. That is, maximizing model evidence (MME) of an approximate evidence
function in each update step during the training phase, and a full Bayes approach
that draws covariance samples from an inverse Gamma distribution. To validate the
proposed algorithms, we experimented with 10 UCI datasets and a natural language
dataset for regression analysis. According to average testing model errors and data
likelihood on UCI datasets, our proposed algorithms yielded better performance on
at least half of the datasets. With deeper layers and larger ensemble size, we also
observed slightly increasing performance. More importantly, the natural language
dataset results revealed that our proposed algorithm could compete with dropout-
based algorithms. In the next chapter, we applied the proposed MME scheme on
deep LSTM architecture, and evaluated it on five real-world datasets for sub-event
detection.
27
Chapter 3
Approximate Long Short Term Memory
Network for Outlier Detection
3.1 Sub-event Detection
Launched in 2006, Twitter serves as a microblogging platform in which people can
publish at most 140 character-long tweets or 10,000 character-long direct messages [91].
Due to its popularity, portability, and ease of use, Twitter quickly has grown into a
platform for people sharing daily life updates, chatting, and recording or spreading
news. As of September 2015, Twitter announced that there were more than 320 mil-
lion monthly active users worldwide1. In comparison to conventional news sources,
Twitter favors real-time content and breaking news, and it thus plays an important
role as a dynamic information source for individuals, companies, and organizations [6].
Since its establishment, Twitter has generously opened a portion of its data to
the public and has attracted extensive research in many areas [120, 71, 53]. In many
studies, the primary task is to identify the event-related tweets and then exploit
these tweets to build domain knowledge-related models for analysis. As defined by
Atefeh [6], events are generally considered as ”real-world occurrences that unfold
over space and time”. Compared to many data sources, tweets serve as a massive
and timely collection of facts and controversial opinions related to specific events [6].
Furthermore, events discussed on Twitter vary in both scale and category, while some
may reach to global audiences such as presidential elections [149], and others, such
1https://about.twitter.com/company
28
as wildfire [111, 106], appeal to local users. In general, studies of Twitter events can
be categorized into natural events [120], political events [143], social events [76], and
others [90].
Originated from the Topic Detection and Tracking (TDT) program, detection of
retrospective or new events has been addressed over two decades from a collection of
news stories [2]. Historically, there exist a number of systems developed to automat-
ically detect events from online news [8, 90, 78, 157].
An event usually consists of many sub-events, which can describe various facets
of it [111]. Furthermore, users tend to post new statuses of an event to keep track
of the dynamics of it. Within an event, some unexpected situations or results may
occur and surprise users, such as the bombing during the Boston Marathon and the
verdict moment of the Zimmerman trial. By building an intelligent system, we can
identify these sub-events to quickly respond to them, thus avoiding crisis situations
or maximizing marketing impact.
3.2 Literature Review
Traditionally, unsupervised models and supervised models have been widely applied
to detect events from news sources. Clustering methods have been a classic approach
for both Retrospective Event Detection (RED) and New Event Detection (NED) since
1990s. According to Allan et al. [2], they designed a single pass clustering method
with a threshold model to detect and track events from a collection of digitalized news
sources. Chen and Roy [22] also applied clustering approaches such as DBSCAN to
identify events for other user-generated content such as photos.
Additionally, supervised algorithms such as Naïve Bayes, SVM, and gradient
boosted decision trees, have been proposed for event detection. Becker et al. [9]
employed the NaÃŕve Bayes classifier to label the clustered tweets into event-tweets
or non-events tweets with derived temporal features, social features, topical features,
29
and Twitter-centric features, while the tweets are grouped using an incremental clus-
tering algorithm. Sakaki et al. [120] applied the support vector machines to classify
tweets into tweets related to target events or unrelated tweets with three key features.
Subsequently, they designed a spatial-temporal model to estimate the center of an
earthquake and forecast the trajectory of a hurricane using Kalman filtering and par-
ticle filtering. Popescu and Pennacchiotti [114] proposed a gradient boosted decision
tree based model integrated with a number of custom features to detect controversial
events from Twitter streams.
Furthermore, ensemble approaches are also employed to address the event detec-
tion problem. Sankaranarayanan et al. [122] first employed a classification scheme
to classify tweets into different groups, and then applied a clustering algorithm to
identify events.
As argued by Meladianos et al. [91], sub-event detection has been receiving more
and more attention from the event research community. At the time being, there are
a number of studies dealing with sub-event detection in an offline mode [161]. Zhao et
al. [159] adopted a simple statistical approach to detect sub-events during NFL games
when tweeting rate suddenly rose higher than a prior threshold. Chakrabarti and
Punera [20] developed a two-phase model with a modified Hidden Markov Model to
identify sub-events and then derived a summary of the tweets stream. However, their
approach has a severe deficiency because it fails to work properly under situations
when unseen event types are involved. Zubiaga et al. [161] compared two different
approaches for sub-event detection. The first approach measured recent tweeting
activities and identified sub-events if there was a sudden increase of the tweeting
rate by at least 1.7 compared to the previous period. The second approach relied on
all previous tweeting activities and detected sub-events if the tweeting rate within 60
seconds exceeded 90% of all previously tweeting rates. As claimed by the authors, the
latter outlier-based approach outperformed the first increase-based approach since it
30
neglected situations when there existed low tweeting rates preceded by even lower
rates [161].
Nichols et al. [101] provided both an online approach and an offline approach to
detect sub-events as well as summarizing important events moments by comparing
slopes of statuses updates with a specific slope threshold, which was defined as the
sum of the median and three times the standard deviation (median + 3*standard
deviation) in their experiment. Shen et al. [126] incorporated ”burstiness” and ”co-
hesiveness” properties of tweets into a participant-based sub-event detection frame-
work, and developed a mixture model tuned by EM which yielded the identification
of important moments of an event. Chierichetti et al. [23] proposed a simple lin-
ear classifier, particularly to say, a logistic regression classifier, to capture the new
sub-events with the exploration of the tweet and retweet rates as the features.
In this work, we propose a novel algorithm termed EnKF_LSTM, which trains
a LSTM network with ensemble Kalman filter, and apply it to an outlier detection
task. The goal is to model the evolution of the probability distribution of the ob-
served features at time t using Recurrent Neural Networks (RNNs).The probability
distribution is then used to determine whether an observation is an outlier.
RNNs are sequence-based networks designed for sequence data. These models
have been successfully applied in areas including speech recognition, image caption
generation, and machine translation [119, 64, 139]. Compared to feed-forward net-
works, RNNs can capture the information from all previous time steps and share the
same parameters across all steps. The term “recurrent” means that we can unfold
the network, and at each step the hidden layer performs the same task for different
inputs. A typical RNN architecture is illustrated in Figure 3.1.
Standard RNN is limited by the gradient-vanishing problem. To cope with this is-
sue, LSTM [51] networks have been developed to maintain the long term dependency
by specifying gating mechanism for the passing of information through the hidden
31
Figure 3.1: A RNN with an input layer (blue), a hidden layer (red), and an outputlayer (green). Units within the dotted regions are optional.
layers. Namely, memory blocks replace the traditional hidden units, and store infor-
mation in the cell variable. There are four components for each memory block, which
include a memory cell, an input gate, an output gate, and a forget gate.
We propose a Bayesian LSTM where the uncertainty in the weights is estimated
using EnKF. To mitigate the underestimation of error covariance due to various
sources such as model errors, nonlinearity, and limited ensemble size, in this study
we optimize the covariance inflation using maximum likelihood estimation. To assess
the proposed algorithm, we apply it to outlier detection in five real-world events
retrieved from the Twitter platform.
In the following sections we introduce the LSTM networks, window embedding,
and a framework for outlier detection. This will be followed by the subevent detection
application in Twitter streams, where the problem specifics and numerical results are
presented in the experiment section.
3.3 LSTM Networks
Given an observed sequence of features, y∗1 . . . y∗t , the goal is to contruct the predictive
probability density function (pdf) p(yt+1|y1:t = {y∗1 . . . y∗t }) using a Bayesian LSTM.
This pdf is then used to determine whether the next observation y∗t+1 is an outlier (∗
denotes the actual observation).
32
In LSTM each hidden unit in Figure 3.1 is replaced by a memory cell. Each
memory cell is composed of an input gate, a forget gate, an output gate, and an
internal state. The input data can be propagated through the network as described
by the following formula:
it = σ(Wixxt +Wimmt−1 + bi) (3.1)
ft = σ(Wfxxt +Wmfmt−1 + bf )
ct = ft � ct−1 + it � g(Wcxxt +Wcmmt−1 + bc)
ot = σ(Woxxt +Wommt−1 + bo)
mt = ot � h(ct)
yt = Wymmt + by
Here, σ is the logistic sigmoid function, i, f , o, and c represent the three gates
and the internal state, W is the weight matrix, b represents the bias term, m is the
cell output activation vector, � is element-wise product, g and h are tanh activation
functions, and x and y represent the input and the output vector, respectively.
To train a LSTM network with EnKF, the function f in (2.16) is the above gated
mechanism. Given an input xk, we can estimate yk by propagating the data through
the LSTM network and update the prediction with a model noise as the output. Then
we can use (2.17) to augment the state, and infer the posterior of weights using the
EnKF as shown in Algorithm 2.
3.4 Window Embeddings for Outlier Detection
3.4.1 Outlier Detection
The inferred distribution of the weights induces a predictive distribution for the next
observable p(yt+1|y1:t = {y∗1 . . . y∗t }). We can use this probability distribution to label
33
Figure 3.2: Gated mechanism of a LSTM cell as described by [39].
Algorithm 2 enkf_lstmInput: features x, targets yOutput: updated weights wDraw N samples for w from prior p(0, σwIl).Compute σ2
ε according to Eq. (2.23).for Each batch input xk do
for Each sample wi dofk,i = lstm(wi, xk)
end forAugment Fk with w to form Uk.Update Uk using EnKF, U ′k = enkf(Uk, H, σ2
ε , yk)Get posterior samples w from U
′k
end forreturn w
the actual observation y∗t+1 as outlier. Since each observation is a multi-dimensional
feature with dimension q, we can use the Chi-squared test of the squared Maha-
lanobis distance [151]. The main idea is to identify when a data point falls outside of
the multidimensional uncertainty even when the marginal uncertainties capture the
observational data.
The Mahalanobis distance between the actual observation y∗t+1 and the predicted
uncertainty approximated using a Gaussian distribution
p(yt+1|y1:t) ≈ N (yt+1;µt+1,Σt+1) is given by
md =√
(y∗t+1 − µt+1)TΣ−1t+1(yt+1 − µt+1) , (3.2)
34
Table 3.1: Basic information of the five events.Event Collection Starting Time Event Time Collection Ending Time Key Words/Hashtags
2013 Boston Marathon 04/12/2013 00:00:00 04/15/2013 14:49:00 04/18/2013 23:59:59 Marathon, #marathon2013 Superbowl 01/31/2013 00:00:00 02/03/2013 20:38:00 02/06/2014 23:59:59 Superbowl, giants, ravens, harbaugh
2013 OSCAR 02/21/2013 00:00:00 02/24/2013 20:30:00 02/27/2013 23:59:59 Oscar, #sethmacfarlane, #academyawards2013 NBA AllStar 02/14/2013 00:00:0 02/17/2013 20:30:00 02/20/2013 23:59:59 allstar, all-starZimmerman Trial 07/12/2013 11:30:00 07/13/2013 22:00:00 07/15/2013 11:30:00 trayvon, zimmerman
where the sample mean and covariance matrix are obtained using the propagated
ensemble memebers to the observable.
When the square of the Mahalanobis distance passes the following test, the ob-
servations is considered to not be an outlier and a plausible outcome of the model.
Here the degree of freedom used to obtain χ20.05 is q.
m2d ≤ χ2
0.05 (3.3)
3.4.2 Subevent Detection
An event is confined by space and time. Specifically, it consists of a set of subevents,
depicting different facets of an event [111]. As an event evolves, users usually post
new statuses to capture new states as subevents of the main issue [91]. Within an
event, some unexpected situations or results may occur and surprise users, such as
the bombing during the Boston Marathon and the power outage during the 2013
Superbowl. Subevent detection provides a deeper understanding of the threats to
better manage the situation within a crisis [112].
By formalizing it as an outlier detection task, we built dynamic models to detect
subevents based upon the retrieved Twitter data and the proposed window embedding
representation described in the following sections.
3.4.3 Data
We collected the data from Jan. 2, 2013 to Oct. 7, 2014 with the Twitter streaming
API and selected five national events for the outlier detection task. The five events
include the 2013 Boston Marathon event, the 2013 Superbowl event, the 2013 OSCAR
35
event, the 2013 NBA AllStar event, and the Zimmerman trial event. Each of these
events consists of a variety of subevents, such as the bombing for the marathon
event, the power outrage for the Superbowl event, the nomination moment of the
best picture award, the ceremony for the NBA AllStar MVP, and the verdict of the
jury for the Zimmerman trial event.
For these case studies, we filtered out relevant tweets with event-related keywords
and hashtags, preprocessed the data to remove urls and mentioned users. The basic
information of each event is provided in Table 3.1.
3.4.4 Window Embedding
In computational linguistics, distributed representations of words have shown some
advantages over raw co-occurrence count since they can capture the contextual infor-
mation of words. As categorized by Boroni et al. [84], distributed semantic models
can be termed as count models or prediction models. On one hand, count mod-
els, including LSA, HAL, and Hellinger PCA, can efficiently use the statistics of the
co-occurrence information but limited to capture complex patterns beyond word sim-
ilarities. On the other hand, prediction models, such as NNLM and word2vec, can
capture complex patterns of the words but limited to use the statistics information
of the words. To cope with limits of each approach, Pennington et al. [108] proposed
a weighted least squares objective J as shown follows:
J =V∑
i,j=1f(Xij)(wTi wj + bi + bj − logXij)2 (3.4)
where Xij is the number of times word j in the context of word i, wi and bi are
the word vector and bias of word i, wj and bj are the context word vector and bias
of word j, and f is a pre-defined weighting scheme as follows.
36
f(x) =
(x/xmax)α if x < xmax
1 otherwise
Vector representations can be used as features and they have been successfully
applied in many natural language processing applications [108]. Through some ex-
periments, we decided to use the 100 dimension GloVe vector representation that
were trained with 27 billion tweets. We further used the Probabilistic PCA to reduce
the vector dimensionality into d latent components that could capture at least 99%
variability of the original information.
Here, we define sentence embedding as the average of its word vectors. Given a
sentence, it consists of n words represented by vectors ed1, ed2, ..., edn, and the sentence
embedding sdi is defined as ∑ni=1 e
di /n. Furthermore, we define a window embedding
wdt as the average of its sentence vectors. For a given time window, it is composed of
m sentence vectors sd1, sd2, ..., sdm, and a window embedding wdt is defined as∑mi=1 s
di /m.
As we use a moving window approach, we grouped every l-size window wd1, wd2, ..., w
dl
into a training input X, and label wdl+1 as the training input Y . Based upon the
grouped data, we can train our proposed multivariate EnKF-LSTM model. With
some experiments, we chose 5 as the number of latent components d, 5 minutes as
the time window t, and 32 as the grouping size l.
3.4.5 Implementation
The implemented 2 layer network is shown in Figure 3.3. The input layer consists of
5 nodes, each hidden layer consists of 32 LSTM cells, and the output layer consists of
5 output nodes. In this implementation, we include the forget gate proposed by [37].
The implementation is based upon Tensorflow,and it could be easily extended for
deeper architectures or variants of LSTMs.
37
Figure 3.3: Architecture of the network used in this study.
3.5 Results
The outlier detection results are provided in Figure 3.4-3.8. In terms of the results,
we observe 40, 33, 39, 33, and 23 identified sub-events, respectively. Of those sub-
events, 16, 18, 25, 22, and 17 are verified as true sub-events. We set the initial
sigma value of the noise covariance matrix in the EnKF update step to 1.0, and then
further optimized them to 2.13, 2.19, 1.98, 1.56, and 1.21 with Maximum Likelihood
Estimation.
To further evaluate our model, we compared it with Gaussian Process (GP) and
MC dropout [34]. The comparison result is provided in Table 3.2. The GP model
yielded the best recall value in three of the five events, indicating that it captured most
true sub-events. On the other hand, it also misidentified many normal time windows
as sub-events, thus yielding many false positives and low precision. Compared to
the GP model, our proposed enkf_lstm algorithm reliably captured many true sub-
events and yield the best precision in these five events. Though, on the other hand, it
missed capturing some true sub-events and had worse recall performance in four of the
five events. In terms of the F1 score, our proposed algorithm has the best performance
38
Figure 3.4: Predicted sub-events with the proposed algorithm for the 2013 Bostonmarathon event. The distance lines above the blue threshold line indicate identifiedoutliers, and the red color indicates the Boston bombing moment (identified=40,true=16, σ2
ε=2.13).
in two of the five cases. The MC dropout model, however, has the worst performance
for this specific outlier detection task. Since MC dropout is mathematically equivalent
to variational inference, which under-estimates the uncertainty, the model mislabels
many normal time windows as outliers.
For the proposed algorithm, ensemble size N and the initial sigma value of the
noise covariance matrix σ2ε are two important hyper-parameters. To further evaluate
their effects on the performance, we provided a sensitivity analysis of the hyper-
parameters for the 2013 Boston marathon event. Based upon Figure 3.9, the algo-
rithm yielded the best result with an ensemble size at 200. In general, the evaluation
metrics increase until 200 and then slightly decrease, implying the proposal algo-
rithm can capture the dynamics of the posterior weights with a medium sample size.
39
Figure 3.5: Predicted sub-events with the proposed algorithm for the 2013 Superbowlevent. The distance lines above the blue threshold line indicate identified outliers,and the red color indicates the power outbreak (identified=33, true=18, σ2
ε=2.19).
Table 3.2: Evaluation metrics on different algorithms.
Event Model Precision Recall F1 ScoreGP 37.5 64.3 47.4
ENKF LSTM 40.0 57.1 47.12013 Boston MarathonMC Dropout 19.0 28.6 22.9
GP 43.8 55.3 48.8ENKF LSTM 54.5 47.4 50.72013 SuperbowlMC Dropout 18.2 21.1 19.5
GP 56.3 61.4 58.7ENKF LSTM 64.1 56.8 60.22013 OSCARMC Dropout 25.5 27.3 26.4
GP 55.2 68.3 61.0ENKF LSTM 66.7 53.7 59.52013 NBA AllStarMC Dropout 40.5 41.5 41.0
GP 57.7 62.5 60.0ENKF LSTM 60.9 70.8 65.5Zimmerman TrialMC Dropout 26.3 41.7 32.3
40
Figure 3.6: Predicted sub-events with the proposed algorithm for the 2013 OSCARevent. The distance lines above the blue threshold line indicate identified outliers,and the red color indicates the OSCAR starting moment (identified=39, true=25,σ2ε=1.98).
According to Figure 3.10, the evaluation metrics peaked at 0.05 and then slightly
decreased with larger value.
3.5.1 Discussion
In this work, we proposed a novel algorithm to estimate the posterior weights for
LSTMs, and we further developed a framework for outlier detection. Based upon
the proposed algorithm and framework, we applied them for five real-world outlier
detection tasks using Twitter streams. As shown in the above section, the proposed
algorithm can capture the uncertainty of the non-linear multivariate distribution and
outperform Gaussian process and MC dropout in terms of precision. We also eval-
uated the sensitivity of the algorithm on different ensemble size and variance value
41
Figure 3.7: Predicted sub-events with the proposed algorithm for the 2013 AllStarevent. The distance lines above the blue threshold line indicate identified outliers,and the red color indicates the AllStar starting moment (identified=33, true=22,σ2ε=1.56).
of the prior distribution of the weights with the Boston marathon data. However,
the performance of the model is further affected by several other hyper-parameters,
including the batch size, the number of layers, the number of nodes in each layer,
and the choice of window size, and the sensitivity of these hyper-parameters will be
evaluated in future research.
42
Figure 3.8: Predicted sub-events with the proposed algorithm for the 2013 Zim-merman trial news event. The distance lines above the blue threshold line indicateidentified outliers, and the red color indicates the verdict moment (identified=23,true=17, σ2
ε=1.21).
Figure 3.9: Performance of the algorithm on the marathon event for different ensemblesize.
43
Figure 3.10: Performance of the algorithm on the Boston marathon event for differentsigma value.
44
Chapter 4
Student Knowledge Estimation with Bayesian
Network
4.1 Student Knowledge Estimation
Intelligent Tutoring Systems (ITS) have been studied since 1980s [113] and research
from this area is becoming more important because of the advancement of computa-
tion and the increase of class size. As Butz et al. [15] explained, ITSs are computer-
based systems that can provide functionalities such as estimation of students’ un-
derstanding and individualized instruction in a similar way to traditional one-to-one
tutoring. It is of particular importance when the enrollment of post-secondary stu-
dents keeps growing 1 while the instructors have limited time for providing feedback.
Knowledge of the students are hidden from direct measurements; however, ITS
can help us to estimate the latent knowledge of students from quizzes. So far, there
are a number of ITSs that are used for different domains, such as BITS [15], Andes
[145], ViSMod [156], KERMIT[48].
As Butz et al. [15] claimed, there exists four common components of traditional
ITSs, including knowledge domain, student model, teaching strategies, and user in-
terface. Specifically, student models are pre-defined user models that are used to
track the states and needs of each student. The teaching strategies are the instruc-
tion styles of the system, such as the way of providing recommendations, while the
user interface of the system provides the capacity to interact with users.
1http://nces.ed.gov/fastfacts/display.asp?id=98
45
Of all components, student models are considered the key component of any adap-
tive tutoring system [93] due to their capability of storing information (e.g. example
and learning styles) about the students. Based upon student models, we can further
estimate the knowledge of each student and provide an individualized and optimum
learning path for the students. As explained by Chrysafiadi and Virvou [24], student
models can be used to estimate the knowledge level and cognitive states of students,
identifying learning styles and preferences, selecting proper learning methods (e.g.
providing tutorials), recognizing weaknesses and strengths to further recommend in-
dividualized feedback.
Due to the difficulty of model construction and initialization, many researchers
realize the complexity of the issue and put forward many possible approaches [93].
The typical approaches of constructing student models and initializing parameters
are using expert knowledge, data-driven estimations, or a synthesis of those two.
In this study, we address three issues involved in student knowledge modeling.
That is, estimation of knowledge level of students, distractors or misconceptions
identification, and evaluation of questions design. We address the first issue by con-
structing Bayesian Student models to estimate the posterior of the knowledge of a
student for a specific concept given assessment answers. We address the second issue
by proposing a novel optimization procedure. And we address the third issue by
designing a novel index to evaluate whether a question is proper or not.
Concept inventories (CIs) are commonly used to make inferences about students’
knowledge. Concept inventories are used for different branches of science including,
but not limited to, Electromagnetics [102], Discrete Mathematics [3], Statistics [138],
Electric Circuits [104], Signals and Systems [147], Thermodynamics [92], Strength of
Materials [116], Fluid Mechanics [88, 57], Dynamics [41], Chemistry [69, 29], Biology
[63, 35, 4, 12, 27],Biomechanics [67], Geoscience [79], Calculus [30], Astronomy [7].
The Statics concept inventory used in this work is designed by Paul S. Steif [134,
46
135].
Misconceptions of (scientific) knowledge have been studied with both qualitative
approaches and quantitative approaches. As mentioned in the literature [140, 21], the
development of two-tier multiple-choice diagnostic instruments serves as an effective
way of identifying misconceptions in the process of learning. Two-tier multiple-choice
diagnostic instruments, as discussed by Chandrasegaran [21], contain one tier of sev-
eral content questions and a second tier of possible conceptions and/or misconceptions
for the answers to the first-tier questions. To gain points for the question, a student
needs to answer correctly for both tiers of questions. Other qualitative methods, in-
cluding designs of Force Concept Inventory [123], or questionaire [73], or protocolss
[146], have been explored to study misconceptions. Corter et al.[26] proposed a work
to perceive misconceptions, errors, or biases by directly building a Bayesian Causal
Network and by modeling the misconceptions as latent variables. In their work, they
also specify the initial structure and parameters of the network and fit the data to
the network to get optimized parameters using the Expectation Maximization (EM)
algorithm. Similarly, Goguadazi et al. [38] also explicitly defined the misconception
as nodes in the Bayesian Network.
According to Liu et al [82], they incorporated the misconception as a model com-
ponent in the knowledge component model, and claimed a dramatic increase in the
overall fit of the improved model to the data. Compared to previous work, we de-
sign a data-driven approach to identify the misconceptions. In other words, we build
a Bayesian Network without incorporating the misconception nodes, and propose a
novel optimization procedure to identify the misconceptions.
4.2 Bayesian Network Background
In the Bayesian framework, Bayes’ theorem is widely used to update the posterior
based upon the likelihood function and a pre-defined prior probability. Specifically,
47
the posterior can be computed as:
p(θ|D) = p(D|θ)p(θ)p(D) (4.1)
While p(D|θ) is the likelihood function, and p(D) is the prior probability.
Compared to the freqentist framework, the Bayesian framework introduces the
concept of prior beliefs. The Bayesian framework, furthermore, can iteratively up-
date the prior probability and posterior probability in a procedure termed Bayesian
updating. During the iterative procedure, a posterior probability is computed after
some evidence is observed and this posterior probability is considered as an updated
prior to repeat the new updating cycle.
To better understand the Bayes’ rule and the Bayesian updating procedure, we
provide a simple example here. Assume a two-choice question related to a particular
concept of interest. Defining "C" as the concept being known, and assuming that
the belief prior to any testing is a 25 % probability of the student knowing the
concept, we can say that P (C) = 0.25. The conditional probability of the student
selecting choice "A" or "B" depending on knowing or not knowing the concept should
also be defined. This will depend on the design of the question. For example, a
distractor representing a common misconception might have a higher probability of
being selected by students that do not know the concept [98]. For this example let’s
assume that "A" is the right choice, and there is a 90% probability of being selected
by students that know the concept i.e. P (A|C) = 0.9, accordingly there is a chance
of 10% for selecting the wrong choice i.e. P (B|C) = 0.1 . If the students do not know
the concept, we can assume that each choice has the same chance of being selected
(P (A|C) = 0.5, and P (B|C) = 0.5). Based on the answer of the student, and the
prior information highlighted in the previous paraghraph, it is possible to infer if the
48
student knows concept "C". If the student selects "A" (right answer):
P (C|A) = P (A|C)P (C)P (A|C)P (C) + P (A|C)P (C)
= 0.9× 0.250.9× 0.25 + 0.5× 0.75 = 0.375
(4.2)
Here our belief about the probability of the student knowing the concept changed
from 25% (P (C) = 0.25) to 37.5 % (P (C|A) = 0.375). However, if the student
answers "B":P (C|B) = P (B|C)P (C)
P (B|C)P (C) + P (B|C)P (C)= 0.1× 0.25
0.1× 0.25 + 0.5× 0.75 = 0.0625(4.3)
In this case the probability of the student knowing the concept dropped to 6.25%.
Obtained posterior can be used as the prior for the next question. For example
consider another question with two choices and A as the correct choice. If the student
who answered the first question correctly selects the correct choice again, the belief
about whether he or she knows the concept, P (C|AA), is increased from 0.375 to
0.519.
P (C|AA) = 0.9× 0.3750.9× 0.375 + 0.5× 0.625 = 0.519 (4.4)
According to Jensen [61], a Bayesian Network (BN) provides a graphic and math-
ematical depiction of the joint probability over a group of random variables. Before
we introduce the power of a BN, we need to illustrate the concept of the Joint Proba-
bility Distribution (JPD). As discussed in [15], a JPD is defined as a function p over
a set of random and discrete variables V = v1, v2, ..., vn if they meet the following
properties:
• 0 <= p(v) <= 1.0, for each v ∈ dom(V )
• ∑v∈dom(V ) p(v) = 1
Where dom(V ) is the Cartesian product of each variable domain in U .
49
One of the main advantages of using BNs is to get a compact form of the joint
distribution on V . Obviously, the operation of obtaining the joint distribution directly
on V takes O(2n) for n variables. However, the utilization of BNs can speed up the
acquisition of JPD based upon the notion of conditional independence [15]. Specific,
random variables v1 and v3 are conditionally independent given v2 if
p(v1|v2v3) = p(v1|v2) (4.5)
The compact form of the joint distribution is then acquired with the chain rule.
According to Jensen [61], the JPD, p(U), is specified as follows based upon the chain
rule:
p(U) =n∏i=1
p(vi|pa(vi)) (4.6)
Where pa(vi) are the parents of vi in BN.
As Jensen [61] claimed, a BN consists of four basic elements. That is, it includes
a set of variables as well as edges between variables, a number of mutually exclusive
states for each variable, a Directed Acyclic Graph (DAG) constructed by the variables
and edges, and a conditional probability table attached to each variable. If no parents
exist for node A, then the conditional probability attached to the node becomes
the prior. In this study, all the basic elements of the network are pre-specified.
Particularly, we use domain expertise to construct the network structure, and estimate
the conditional tables as well as priors with an integration of expert knowledge and
a stepwise procedure introduced later in this section.
The Expectation Maximization (EM) algorithm is used to estimate the parameters
of BN with missing observed data [61]. In general, the EM algorithm is an iterative
approach to estimate the maximum likelihood of the parameters with an expectation
step and a maximization step. In the expectation step, we compute the expectation
of the data by using current parameters θ0ld, and later we obtain an updated set of
50
parameters θnew by maximizing the expectation with regard to each old parameter.
Then the algorithm runs in a repetitive way with these two steps until it converges
or reaches the pre-specified iteration steps.
A formal description of the algorithm for EM can be found in Jensen [61] and it is
also provided in Algorithm 3. In the algorithm, p(vi|pa(vi)) represents the conditional
probability for variable vi in its kth state, given the jth configuration of the parents
of vi, and sp(vi) represents the state space of variable vi.
Algorithm 3 EM algorithm for Bayesian Network [61]Choose initial parameters θold.Define stopping criteria ε > 0.Set t := 0.while | log2 p(D|θt)− p(D|θt−1)| > ε doE step: Compute the expectation of the likelihood function:
Eθt [N(vi, pa(vi))|D] =∑d∈D
p(vi, pa(vi)|d, θt)
M step: Estimate θijk with the expected counts using maximum likelihood:
θijk = Eθt [N(vi = k, pa(vi) = j)|D]∑|sp(vi)|h=1 Eθt [N(vi = h, pa(vi) = j)|D]
Set θt+1 := θ, t := t+ 1.end while
4.3 Methodology
4.3.1 Student Model Knowledge
Inspired from the force concept inventory [47, 46, 54], Statics concept inventory is
developed to assess conceptual understanding, and identify the students’ misconcep-
tions regarding basic concepts in statics. Four clusters of concepts are introduced by
Steif [133, 135] and used in this paper. These clusters of concepts are summarized
in Table 4.1. The CI contains 27 questions, and to answer them correctly, a student
51
Table 4.1: Concepts are used for the concept inventory
Concept Definition
C1Forces are always in equal and opposite
pairs acting between bodies, which are usually in contact
C2
Distinctions must be drawn between a force, a momentdue to a force about a point, and a couple. Twocombinations of forces and couples are statically
equivalent to one another if they have the same netforce and moment.
C3
The possibilities of forces between bodies that areconnected to, or contact, one another can be reducedby virtue of the bodies themselves, the geometry of the
connection and/or assumptions on friction.
C4
Equilibrium conditions always pertain to the externalforces acting directly on a chosen body, and a body isin equilibrium if the summation of forces on it is zero
and the summation of moments on it is zero.
should know one or some of the mentioned concepts. The model used in this paper
is illustrated in Figure 4.1. The arrows show the logical connection between concepts
and questions. For example we believe that to learn concept C2, students first need
to learn concept C3. Also, to answer some of questions, students need to learn more
than one concept.
Construction of the student model is based upon the integration of BN and under-
standing of the student curricular structure. In the knowledge tracking domain, the
knowledge of a student, which is also the understanding of particular concepts (e.g.
1st cluster of concepts, and 2nd cluster of concepts) is treated as hidden variables
with a known state and an unknown state. The hidden variables are investigated by
using observed variables which are the answers to the concept inventory questions
(e.g. q_q1) in Figure 4.2.
An instance of our student model is provided in Figure 4.2. According to the
figure, each node represents either a concept or a question, and each edge of the graph
represents a connection between concepts and questions or a connection between
concepts. With the post-test data, we insert the answers to questions as evidence
52
C1C2
C3 C4 Q 1
Q 4
Q 7Q10
Q13
Q16
Q19
Q22 Q25
Q 2
Q 5
Q 8
Q11
Q14
Q17
Q20
Q23
Q26
Q 3
Q 6
Q 9
Q12
Q15
Q18
Q21
Q24
Q27
Figure 4.1: Relationships of concepts and questions
into the model and estimate the posterior of each concept for each student. The blue
color of a concept node indicates the probability of knowing the concept given the
answers to the 27 questions.
Knowledge tracking based upon BN are limited by three factors, including the
selection of nodes, structure building of the network, and initialization of the priors
and conditionals [87]. In this work, we integrate concept inventory to select the
concepts for the model. The structure of the network is specified in terms of the
expert knowledge and concept inventory. Due to its factorization of complex joint
distribution into marginal distributions, we initialize the priors and conditionals over
all nodes and edges using our expertise. Somehow the way of initializing the model
neglects the data characteristics, which fails to differentiate the students’ performance
53
known 16%unknown84%
c_c1known 40%unknown60%
c_c3
known 15%unknown85%
c_c2known 16%unknown84%
c_c4
a 0%b 0%c 0%d 100%e 0%
q_q1
a 0%b 0%c 0%d 100%e 0%
q_q2
a 0%b 100%c 0%d 0%e 0%
q_q3
a 100%b 0%c 0%d 0%e 0%
q_q4
a 100%b 0%c 0%d 0%e 0%
q_q5
a 0%b 0%c 0%d 0%e 100%
q_q6
a 0%b 0%c 100%d 0%e 0%
q_q7a100%b 0%c 0%d 0%e 0%
q_q8a 0%b 100%c 0%d 0%e 0%
q_q9a 0%b 100%c 0%d 0%e 0%
q_q10a 0%b 0%c 0%d 100%e 0%
q_q11a 0%b 100%c 0%d 0%e 0%
q_q12
a 0%b 0%c 0%d 100%e 0%
q_q13
a 0%b 0%c100%d 0%e 0%
q_q14
a 0%b 0%c 0%d 0%e 100%
q_q15
a100%b 0%c 0%d 0%e 0%
q_q16
a 0%b 0%c100%d 0%e 0%
q_q17
a 0%b 100%c 0%d 0%e 0%
q_q18
a100%b 0%c 0%d 0%e 0%
q_q19
a 0%b 0%c 100%d 0%e 0%
q_q20a 0%b 0%c 0%d 0%e 100%
q_q21a 0%b 0%c 0%d 100%e 0%
q_q22a 0%b 0%c 0%d 0%e 100%
q_q23a 0%b 0%c 0%d 0%e 100%
q_q24
a 0%b 100%c 0%d 0%e 0%
q_q25
a100%b 0%c 0%d 0%e 0%
q_q26a 0%b 0%c 100%d 0%e 0%
q_q27
Figure 4.2: Bayesian Network Model of Student Knowledge for Statics Concept In-ventory
54
in each semester. Thus we adopt a novel optimization procedure which is a data-
driven approach to obtain the optimized parameters for the prior of the first cluster
of concepts and all the conditionals between the concepts and questions.
Specifically, the initial prior probability for knowing the first cluster of concepts
(p(c_c1) is defined as 0.7, since it is a pre-requisite concept that is learned in the
Linear Algebra class before the Statics class. In this work, we also explore the rest of
the concepts, such as the second cluster of concepts and the third cluster of concepts,
which are developed based upon the first cluster of concepts. The transition prob-
ability between pairs of concepts is determined by the capability of the instructors
of communicating the knowledge to the students. In this study, we assume the in-
structor is capable of communicating those concepts effectively to the students, thus
we assign a high probability (e.g. 0.95) to p(c_c2 = known|c_c1 = known). On the
other hand, if a student has no knowledge of c_c1 (p(c_c1 = unknown)), then the
probability of not knowing c_c2 (p(c_c2 = unknown|c_c1 = known)) is set as a high
value (e.g. 0.95) as well. In this case, we assume that the lack of understanding of a
basic concept will impede the understanding of a more advanced concept. Similarly,
we define all other conditional probabilities between the concepts and questions.
According to our model, it should be kept in mind that the probability of knowing
a more advanced concept (e.g. 3rd cluster of concepts) is lower in the absence of any
testing. Specifically, when the prior of knowing the 1st cluster of concepts is set as 0.7,
the prior of knowing the rest of the clusters of concepts are 0.6485, 0.68, and 0.62015,
respectively. The objectives of this research can be discussed in aspects. First, we
can estimate the level of understanding of each individual student for the instructed
concepts with a Bayesian approach by analyzing the evidences from the concept in-
ventory tests. Secondly, we can reveal the misconceptions of the questions in the tests
when the concept is unknown to the students using our proposed parameter learning
algorithm. The initial conditional probability between the concepts and questions is
55
equally distributed over the answers (e.g. p(q_q1 = X|unknown) = 0.2 where X is
either A, B, C, D, and E) when the concept is unknown to the students. However,
this setting ignores the misconceptions of the students at the moment of assessment
because some answers are more distracting than others. Based upon the proposed
parameter learning procedure, we can learn the optimized conditional probabilities
using the data and recognize the misconceptions of the students to develop remedial
interventions. Thirdly, we can investigate the design of the questions from the per-
spective of instructors to explore whether a question is a well designed question or a
badly designed (or too difficult) question. To address this goal, we use the likelihood
plots and identify the badly-designed questions if we have a p-value larger than a
pre-specified threshold such as 0.05. More detailed discussion about the second goal
and the third goal are provided in the following sections.
4.3.2 Misconception Identification
Conditional probabilities can be used to identify the distractors. When a student
does not know the concept, there is a higher chance to select a choice other than the
correct one.
To get more informative priors and conditionals, we learn the parameters from
experimental data with the EM algorithm provided in the JSMILE APIs. However,
current JSMILE APIs only infers entire conditionals which include the probabilities
of answering the quizzes when the concept is known. This may result in logical
inconsistencies for extreme datasets. A typical case is that all students answer the
questions incorrectly. To cope with the issue and obtain proper parameters, we
establish a novel optimization procedure shown in Algorithm 4 to estimate the prior
of the first cluster of concepts and conditionals between concepts and questions.
56
Algorithm 4 Optimization procedure for parameter learningSet an initial model evidence log2p(D|θ0).Define stopping criteria ε > 0.Set t := 0.while | log2 p(D|θtC)− log2 p(D|θt−1
C )| > ε doEstimate θQ with EM.Update p(Q = X|C = known) for question Q and concept C.Estimate θC with EM.Set θt+1
C = θC , t = t+ 1.end whileUpdate p(Q = X|C = known) for question Q and concept C.
4.3.3 Ill-designed Question Identification
To assess the validity of the optimized Bayesian network model, M , we propose
a predictive validation metric evaluated on a hold-out dataset. Using the training
dataset D6=i,j consisting of student’s answers to 26 out of the 27 questions in the
concept inventory, we can infer the posterior probabilities associated with knowing the
four concepts for each student, Sj. Student’s answer, xij, to the validation question,
Qi, is used to evaluate the following likelihood function obtained from the posterior
predictive distribution corresponding to the ith question.
p(Qi = xij|Sj, D6=i,j,M) (4.7)
This likelihood function can be used to define a measure of how well the model
fits all N students’ answers for the ith question. This goodness-of-fit measure can be
expressed using the following expected log-likelihood under the assumption that all
students are equally likely.
E[log p(Qi = xi|S,D 6=i,M)] =N∑j=1
p(Sj) log p(Qi = xij|Sj, D6=i,j,M)
= 1N
N∑j=1
log p(Qi = xij|Sj, D6=i,j,M) (4.8)
By itself, this goodness-of-fit measure is non-informative. A reference measure is
required to assess whether the model can explain the validation data. Note that by
57
generating a random answer rij from a discrete uniform distribution corresponding
to Qi, and for each student Sj, we can generate a similar measure as in Eq. 4.8 for
how well the model fits random answers.
E[log p(Qi = ri|S,D 6=i,M)] = 1N
N∑j=1
log p(Qi = rij|Sj, D6=i,j,M) (4.9)
It is expected that a model that has predictive capability will better fit the ac-
tual answers than the synthetically generated random answers. Thus, the measure in
Eq. 4.8 is expected to be larger than the one in Eq. 4.9. However, the expected log-
likelihood in Eq. 4.9 is a random variable with a distribution induced by the discrete
uniform distribution over the answers of Qi. As a result, the proposed predictive vali-
dation metric takes the form of a standard hypothesis test - where the null hypothesis
is that the model has no predictive capability and the alternative is that the model
has predictive capability. The null hypothesis is rejected at the significance level α.
P
(E[log p(Qi = ri|S,D6=i,M)] ≥ E[log p(Qi = xi|S,D 6=i,M)]
)< α (4.10)
With the proposed index, we can evaluate whether a question is designed properly
or not. Overall, there are two types of problems of designing the questions. Under
the first type of design problem, the conditional probability of the choice when the
student does not know the concept is the same as the correct choice. It means even if
a student does not know the concept, he or she is still very likely to select the correct
choice. In these cases, in the case of correct answer, it is not very obvious whether the
student knows the concept or not. Under the second type of design problem, very few
students can answer correctly either because the problem is too hard or it is designed
improperly. Both types of design problems can be detected by a high p-value larger
than 0.05.
58
4.4 Numerical Results
4.4.1 Data
Our datasets include three years of post-test data of student performance at the
University of South Carolina for the Statics class from 2006 to 2008. The tests are
either paper and pencil or web-based. Each test contains the same 27 questions
related to the CI.
There are 27, 26, and 34 students for 2006, 2007, and 2008, respectively. How-
ever, there are five students enrolled in 2006 with missing information and thus we
removed them for our analysis. Finally, we have 82 students in total, and we used 48
students from 2006 and 2007 for training the model, and the remaining 34 students
for estimation of student knowledge, misconception identification, and ill-designed
questions detection.
The results are summarized in Table 4.2. Correct answers of each question are
indicated by circle.
Table 4.3 illustrates the conditional probability obtained for a question that is
related to two concepts, C1 and C4. This table considered the data from classes of
2006 and 2007.
4.4.2 Misconceptions
Conditional probabilities can be used to identify the distractors. A student who does
not know the concept will select an incorrect answer, and this can show the miscon-
ceptions of the students. Consider the following examples. Question 19 shown in
figure 4.3 has conditional as shown in Table 4.4. The correct answer is choice a and
the distractor is choice d. In another example, question 22 shown in figure 4.4, the
correct answer is choice b, but the distractor is choice e. The reason is that some
students who do not know the concept C3 simply multiply the weight of 20 N by
59
Table 4.2: Summary of students answers for each question in each class (Numbers ofstudents who selected the correct choice are indicated by circle)
Class of 2006 & 2007 Class of 2008Question a b c d e a b c d e
Q1 11 6 22 8 1 9 3 17 5 0Q2 12 12 5 10 9 12 8 3 4 7Q3 10 13 14 5 6 10 16 5 1 2Q4 7 15 9 7 10 6 9 9 1 9Q5 18 6 8 12 4 11 3 3 11 6Q6 12 12 4 8 12 9 7 4 5 9Q7 8 13 3 17 7 4 15 3 7 5Q8 18 10 12 4 4 14 5 5 5 5
Q9 9 11 4 13 11 6 12 4 4 8
Q10 7 2 27 10 1 6 4 11 8 5
Q11 3 3 38 3 1 5 1 22 5 1
Q12 2 7 25 12 2 7 5 12 7 3
Q13 2 18 4 19 5 2 10 2 15 5
Q14 4 22 5 9 8 3 10 9 4 8
Q15 1 12 1 8 26 0 10 0 10 14Q16 11 12 20 4 1 3 11 14 6 0Q17 10 8 22 4 3 15 2 13 4 0Q18 16 10 11 6 5 12 8 11 2 1
Q19 21 3 3 15 6 14 1 6 10 3
Q20 31 3 3 1 9 19 1 2 7 5
Q21 23 7 4 11 2 17 3 4 6 4
Q22 4 10 6 8 19 3 7 9 6 9
Q23 3 6 10 10 17 3 4 4 4 19Q24 5 7 9 23 2 5 4 5 19 1Q25 5 8 4 19 10 4 3 2 15 10Q26 17 6 11 10 2 13 1 11 8 1Q27 13 2 19 7 4 7 1 11 10 5
friction coefficient which is 0.5. Both questions have good distractors, because the
instructor can understand what the misconception is. In the first one, the miscon-
ception is behavior of connection. A pinned connection cannot resist in front of the
60
Table 4.3: Conditional probabilities for a question related to two concepts
C1 known unknownC4 known unknown known unknowna 0.05 0.6638 0.9982 0.0478b 0.05 1.9e-6 0.0004 0.3965c 0.05 1.9e-6 0.0004 0.1585d 0.05 2.1e-6 0.0004 0.3176e 0.8 0.3361 0.0004 0.0794
Question 19: The force F is known and the other loads on the plate areunknowns to be determined. Consider drawing a free body diagram ofthe plate, including the unknown reaction of the pin.
Figure 4.3: Question 19 related to C3
Table 4.4: Conditional probabilities for question 19
C3 known unknowna 0.8 0.2578b 0.05 0.0967c 0.05 0.0968d 0.05 0.4515e 0.05 0.0972
rotation. In the second question, the distractor represents one of the most common
misconceptions in friction questions [137]. The tangential force is obtained equal to
the normal force times the friction coefficient, while its magnitude is greater than the
force needed to maintain the equilibrium.
4.4.3 Ill-designed questions
The p-values for each question are calculated using 50, 000 sets of random samples
corresponding to students’ answers to the ith question. Table 4.6 summarizes the
61
Question 22: Three blocks are stacked on top of one another on atable. Then, the horizontal forces shown are applied. The frictioncoefficient is 0.5 between all contacting surfaces. (This is both thestatic and kinetic coefficient of friction.)Which of the following represents the horizontal component of theforce acting on the lower face of the top (20 N) block?
Figure 4.4: Question 22 related to C3
Table 4.5: Conditional probabilities for question 22
C3 known unknowna 0.05 0.0967b 0.8 0.2257c 0.05 0.129d 0.05 0.0648e 0.05 0.4837
Table 4.6: P-value for each question
Question p-value Question p-value Question p-valueQ1 0.0247 Q10 0.0270 Q19 0.0001Q2 0.5362 Q11 0.0000 Q20 0.0001Q3 0.0324 Q12 0.0097 Q21 0.0002Q4 0.2665 Q13 0.0001 Q22 0.3012Q5 0.0336 Q14 0.0716 Q23 0.0466Q6 0.6519 Q15 0.0000 Q24 0.0004Q7 0.0111 Q16 0.0005 Q25 0.0067Q8 0.2156 Q17 0.0106 Q26 0.0158Q9 0.8412 Q18 0.0188 Q27 0.0053
p-values for all the questions in the concept inventory. Note that for 7 questions the
model does not exhibit predictive capability at 0.05 significance level.
Here are some examples of the first type of design problems. For question 11
which is shown in figure 4.5, there is 68% chance of selecting the correct choice when
62
Question 11: The platform is kept in the position shown by a roller,link and hydraulic cylinder. The coefficient of friction between theroller and the dump is 0.6. What is the direction of the roller on theplatform at the point of interest?
Figure 4.5: Question 11 related to concept C3
Table 4.7: Conditional probabilities for question 11
C3 known unknowna 0.05 0.0968b 0.05 0.0968c 0.8 0.6775d 0.05 0.0968e 0.05 0.0323
Table 4.8: Conditional probabilities for question 20
C3 known unknowna 0.8 0.6125b 0.05 0.0968c 0.05 0.0967d 0.05 0.0323e 0.05 0.1618
a student does not know the related concept. Another example could be question
20, shown in figure 4.6. As it shown in Table 4.8, the chance of selecting the correct
answer, when a student does not know the concept is 61%. Another good example
of these kinds of questions is question 10.
Here is an example of the second type of design problems. Question 9 is a good
example of those hard questions, only 4 students among 34 could answers this question
correctly and its p-value is around 0.84. It shows that the model is not able to make
63
Question 20: Part 1 and part 2 are welded to each other. Forces Fand G are known, and the other loads are unknowns to be determined.Consider drawing a free body diagram of part 1, including the unknownreaction of part 2 on part 1. Unknowns A, B, and C could be positive,negative or zero.Which of the following is the correct free body diagram for the forcesand/or moments exerted by part 2 on part 1 at welded section?
Figure 4.6: Question 20 related to concept C3
prediction for students in this question. They are several possible reasons for that.
First, looking at the optimized conditional probabilities, it is one of those questions
where the chance of selecting the correct answer when a student does not know
the concept is more than other choices, p=0.31. Furthermore there are only three
questions related to concept C2, which may not be enough to predict the performance
of students in this question.
A good design problem can be illustrated by Question 7. Question 7, related
to concept C2, is one of the questions with a low p-value. Figure 4.10 shows the
related histogram and p-value for the class of 2008, considering classes of 2007 and
2006 for optimization. The value of p-value shows indicates strong evidence against
the null hypothesis that the model cannot predict the performance of students. In
other words, assuming the model inference about student knowledge is acceptable,
the question and its related concept is designed in a way that can be used to infer
about student knowledge in concept C2.
64
Question 9: The two forces with magnitudes 7 N and 10 N act in thedirections shown through points A and B, which are denote with dots.These forces keep the member in equilibrium while it is subjected toother forces acting in the plane (shown at the right).Assuming the other forces stay the same, what load(s) could replacethe 7 N and 10 N forces and maintain equilibrium ?
Figure 4.7: Question 9 related to concept C2
4.5 Educational Component
Student knowledge estimation plays a core role in the process of student learning.
Proper and prompt estimation can help provide important information to both in-
structors and students for remedial intervention. In this study, we explore using
concept inventory for student knowledge estimation. Meanwhile, our system can help
each student identify the individualized misconception of each question. By proposing
a question design metric, we can further measure the effectiveness of each question.
We implemented the backend system with Java and JSMILE APIs. With the sys-
tem, we can construct student models with prior knowledge of the structure of a given
concept inventory. By feeding student exam data into the system, we can obtain per-
sonalized student models and provide individualized suggestion for intervention. The
implemented system is publically available at https://bitbucket.org/uqlab/scilaf.
65
Figure 4.8: Histogram of blind guessing in addition to the actual performance ofstudents score.
4.6 Conclusions
In this work, we firstly developed a data-driven approach to assess the latent student
knowledge by constructing Bayesian student models. Then we put forward a novel
algorithm to identify the misconceptions for each quiz question, and thus we could
provide individualized and remedial interventions for each student. In the end, we
proposed a novel index to evaluate the student models as well as measuring the design
of each question. According to the results, we identified common distractors found
for the concept inventory data. As the model is capable of discovering individualized
misconception, it can provide timely intervention after each test. Furthermore, the
measurement index showed that 20 of the 27 questions exhibit predictive capability
with the student model, while several inproper-designed questions were discussed.
66
Question 7: A 200 N-mm couple acting counter-clockwise keeps themember in equilibrium while it is subjected to other forces acting inthe plane (shown schematically at the left). The four dots denoteequally spaced points along the member. Assuming the other forcesstay the same, what load(s) could replace the 200 N-mm couple andmaintain equilibrium?
Figure 4.9: Question 7 related to concept C2
Figure 4.10: Histogram of blind guessing in addition to the actual performance ofstudents score.
67
Bibliography
[1] Hamed Abdelhaq, Christian Sengstock, and Michael Gertz. “EvenTweet: On-line Localized Event Detection from Twitter”. In: Proc. VLDB Endow. 6.12(Aug. 2013), pp. 1326–1329. issn: 2150-8097. doi: 10.14778/2536274.2536307.url: http://dx.doi.org/10.14778/2536274.2536307.
[2] James Allan. “Introduction to topic detection and tracking”. In: Topic detec-tion and tracking. Kluwer Academic Publishers, 2002, pp. 1–16. isbn: 0-7923-7664-1.
[3] Vicki L Almstrum et al. “Concept inventories in computer science for thetopic discrete mathematics”. In: ACM SIGCSE Bulletin. Vol. 38(4). ACM.2006, pp. 132–145.
[4] Dianne L Anderson, Kathleen M Fisher, and Gregory J Norman. “Develop-ment and evaluation of the conceptual inventory of natural selection”. In:Journal of research in science teaching 39.10 (2002), pp. 952–978.
[5] Rebecca A Atadero et al. “Project-Based Learning in Statics: Curriculum,Student Outcomes, and On-going Questions”. In: age 24 (2014), p. 1.
[6] Farzindar Atefeh and Wael Khreich. “A Survey of Techniques for Event Detec-tion in Twitter”. In: Comput. Intell. 31.1 (2015), pp. 132–164. issn: 0824-7935.doi: 10.1111/coin.12017.
[7] Janelle Margaret Bailey. Development of a Concept Inventory to Assess Stu-dents’ Understanding and Reasoning Difficulties About the Properties and For-mation of Stars. 2006. url: http://hdl.handle.net/10150/193643.
[8] Nilesh Bansal and Nick Koudas. “BlogScope: Spatio-temporal Analysis of theBlogosphere”. In: Proceedings of the 16th International Conference on WorldWide Web. WWW ’07. New York, NY, USA: ACM, 2007, pp. 1269–1270. isbn:978-1-59593-654-7. doi: 10.1145/1242572.1242802. url: http://doi.acm.org/10.1145/1242572.1242802.
68
[9] H. Becker, M. Naaman, and L. Gravano. “Beyond trending topics: Real-worldevent identification on Twitter”. In: Fifth International AAAI Conference onWeblogs and Social Media. 2011.
[10] David M. Blei. “Probabilistic Topic Models”. In: Commun. ACM 55.4 (Apr.2012), pp. 77–84. issn: 0001-0782. doi: 10.1145/2133806.2133826. url:http://doi.acm.org/10.1145/2133806.2133826.
[11] Charles Blundell et al. “Weight Uncertainty in Neural Networks”. In: Pro-ceedings of the 32Nd International Conference on International Conferenceon Machine Learning - Volume 37. ICML’15. Lille, France: JMLR.org, 2015,pp. 1613–1622. url: http://dl.acm.org/citation.cfm?id=3045118.3045290.
[12] Stacey Lowery Bretz and Kimberly J Linenberger. “Development of the enzyme–substrate interactions concept inventory”. In: Biochemistry and Molecular Bi-ology Education 40.4 (2012), pp. 229–233.
[13] Thang D. Bui et al. “Deep Gaussian Processes for Regression using Approx-imate Expectation Propagation”. In: ICML. Vol. 48. JMLR Workshop andConference Proceedings. JMLR.org, 2016, pp. 1472–1481.
[14] Gregoire Burel et al. “On Semantics and Deep Learning for Event Detectionin Crisis Situations”. In: ESWC 2017. Portoroz, Slovenia, 2017.
[15] C. J. Butz, S. Hua, and R. B. Maguire. “A Web-based Bayesian IntelligentTutoring System for Computer Programming”. In: Web Intelli. and AgentSys. 4.1 (Jan. 2006), pp. 77–97. issn: 1570-1263. url: http://dl.acm.org/citation.cfm?id=1239784.1239789.
[16] SM Case and DB Swanson. Item writing manual: Constructing written testquestions for the basic and clinical sciences. 2002.
[17] Carlos Castillo, Marcelo Mendoza, and Barbara Poblete. “Information Cred-ibility on Twitter”. In: Proceedings of the 20th International Conference onWorld Wide Web. WWW ’11. Hyderabad, India: ACM, 2011, pp. 675–684.isbn: 978-1-4503-0632-4. doi: 10.1145/1963405.1963500. url: http://doi.acm.org/10.1145/1963405.1963500.
[18] Mario Cataldi, Luigi Di Caro, and Claudio Schifanella. “Emerging Topic De-tection on Twitter Based on Temporal and Social Terms Evaluation”. In: Pro-ceedings of the Tenth International Workshop on Multimedia Data Mining.MDMKDD ’10. Washington, D.C.: ACM, 2010, 4:1–4:10. isbn: 978-1-4503-0220-3. doi: 10.1145/1814245.1814249. url: http://doi.acm.org/10.1145/1814245.1814249.
69
[19] Junghoon Chae et al. “Spatiotemporal social media analytics for abnormalevent detection and examination using seasonal-trend decomposition”. In:2012 IEEE Conference on Visual Analytics Science and Technology, VAST2012, Seattle, WA, USA, October 14-19, 2012. 2012, pp. 143–152. doi: 10.1109/VAST.2012.6400557. url: https://doi.org/10.1109/VAST.2012.6400557.
[20] Deepayan Chakrab. and Kunal Punera. “Event Summarization Using Tweets”.In: (2011).
[21] A. L. Chandrasegaran, David F. Treagust, and Mauro Mocerino. “The de-velopment of a two-tier multiple-choice diagnostic instrument for evaluatingsecondary school students’ ability to describe and explain chemical reactionsusing multiple levels of representation”. In: Chem. Educ. Res. Pract. 8 (32007), pp. 293–307.
[22] Ling Chen and Abhishek Roy. “Event Detection from Flickr Data ThroughWavelet-based Spatial Analysis”. In: Proceedings of the 18th ACM Conferenceon Information and Knowledge Management. CIKM ’09. Hong Kong, China:ACM, 2009, pp. 523–532. isbn: 978-1-60558-512-3. doi: 10.1145/1645953.1646021. url: http://doi.acm.org/10.1145/1645953.1646021.
[23] Flavio Chierichetti et al. “Event Detection via Communication Pattern Anal-ysis”. In: Proceedings of the Eighth International Conference on Weblogs andSocial Media, ICWSM 2014, Ann Arbor, Michigan, USA, June 1-4, 2014.2014. url: http://www.aaai.org/ocs/index.php/ICWSM/ICWSM14/paper/view/8088.
[24] Konstantina Chrysafiadi and Maria Virvou. “Review: Student Modeling Ap-proaches: A Literature Review for the Last Decade”. In: Expert Syst. Appl.40.11 (Sept. 2013), pp. 4715–4729. issn: 0957-4174. doi: 10.1016/j.eswa.2013.02.007. url: http://dx.doi.org/10.1016/j.eswa.2013.02.007.
[25] Clyde H Coombs, John Edgar Milholland, and Frank Burton Womer. “Theassessment of partial knowledge”. In: Educational and Psychological Measure-ment 16.1 (1956), pp. 13–37.
[26] J. E. Corter et al. “Bugs and biases: Diagnosing misconceptions in the under-standing of diagrams”. In: Proceedings of the 31st Annual Conference of theCognitive Science Society. Ed. by N. A. Taatgen and H. van Rijn. Austin, TX:Cognitive Science Society, 2009, pp. 756–761.
[27] Thomas Deane et al. “Development of the biological experimental designconcept inventory (BEDCI)”. In: CBE-Life Sciences Education 13.3 (2014),pp. 540–551.
70
[28] John S. Denker and Yann LeCun. “Transforming Neural-Net Output Levels toProbability Distributions”. In: NIPS. Morgan Kaufmann, 1990, pp. 853–859.
[29] Marilu Dick-Perez et al. “A quantum chemistry concept inventory for physicalchemistry classes”. In: Journal of Chemical Education 93.4 (2016), pp. 605–612.
[30] Jerome Epstein. “Development and validation of the Calculus Concept Inven-tory”. In: Proceedings of the ninth international conference on mathematicseducation in a global community. Vol. 9. Charlotte, NC. 2007, pp. 165–170.
[31] Geir Evensen. “The Ensemble Kalman Filter: theoretical formulation and prac-tical implementation”. In: 53 (2003), pp. 343–367. doi: 10.1007/s10236-003-0036-9.
[32] Tristan Fletcher. “The Kalman Filter Explained”. 2010.
[33] Yarin Gal and Zoubin Ghahramani. “A Theoretically Grounded Application ofDropout in Recurrent Neural Networks”. In: Advances in Neural InformationProcessing Systems 29: Annual Conference on Neural Information ProcessingSystems 2016, December 5-10, 2016, Barcelona, Spain. 2016, pp. 1019–1027.
[34] Yarin Gal and Zoubin Ghahramani. “Dropout As a Bayesian Approximation:Representing Model Uncertainty in Deep Learning”. In: Proceedings of the 33rdInternational Conference on International Conference on Machine Learning -Volume 48. ICML’16. New York, NY, USA: JMLR.org, 2016, pp. 1050–1059.url: http://dl.acm.org/citation.cfm?id=3045390.3045502.
[35] Kathy Garvin-Doxas and Michael W Klymkowsky. “Understanding random-ness and its impact on student learning: lessons learned from building the Bi-ology Concept Inventory (BCI)”. In: CBE-Life Sciences Education 7.2 (2008),pp. 227–233.
[36] Arthur Gelb. “Applied Optimal Estimation”. In: The MIT Press, 1974. isbn:0262570483, 9780262570480.
[37] Felix A. Gers, JÃijrgen Schmidhuber, and Fred Cummins. “Learning to For-get: Continual Prediction with LSTM”. In: Neural Computation 12 (1999),pp. 2451–2471.
[38] George Goguadze et al. “Evaluating a Bayesian Student Model of DecimalMisconceptions”. In: EDM. 2011.
[39] Alex Graves. “Generating Sequences With Recurrent Neural Networks.” In:CoRR (2014). url: https://arxiv.org/pdf/1308.0850.pdf.
71
[40] Alex Graves. “Practical Variational Inference for Neural Networks”. In: Pro-ceedings of the 24th International Conference on Neural Information Process-ing Systems. NIPS’11. Granada, Spain: Curran Associates Inc., 2011, pp. 2348–2356. isbn: 978-1-61839-599-3.
[41] Gary L Gray et al. “The dynamics concept inventory assessment test: Aprogress report and some results”. In: American Society for Engineering Edu-cation Annual Conference & Exposition. 2005.
[42] James H Hanson and Julia MWilliams. “Using writing assignments to improveself-assessment and communication skills in an engineering statics course”. In:Journal of engineering education 97.4 (2008), p. 515.
[43] Habibah Norehan Haron et al. “Self-regulated learning strategies between theperforming and non-performing students in statics”. In: Interactive Collabora-tive Learning (ICL), 2014 International Conference on. IEEE. 2014, pp. 802–805.
[44] Eric L. Haseltine and James B. Rawlings. “Critical Evaluation of ExtendedKalman Filtering and Moving-Horizon Estimation”. In: Industrial & Engi-neering Chemistry Research 44.8 (June 2004), pp. 2451–2460. doi: 10.1021/ie034308l. url: http://dx.doi.org/10.1021/ie034308l.
[45] José Miguel Hernández-Lobato and Ryan P. Adams. “Probabilistic Backprop-agation for Scalable Learning of Bayesian Neural Networks”. In: Proceedings ofthe 32Nd International Conference on International Conference on MachineLearning - Volume 37. ICML’15. Lille, France: JMLR.org, 2015, pp. 1861–1869. url: http://dl.acm.org/citation.cfm?id=3045118.3045316.
[46] David Hestenes and Ibrahim Halloun. “Interpreting the force concept inven-tory”. In: The Physics Teacher 33.8 (1995), pp. 502–506.
[47] David Hestenes, Malcolm Wells, Gregg Swackhamer, et al. “Force conceptinventory”. In: The physics teacher 30.3 (1992), pp. 141–158.
[48] Randall W. Hill, Jr., andW. Lewis Johnson. “Designing an Intelligent TutoringSystem for Database Modelling”. In: Proceedings of the world conference ofartificial intelligence in education. 1993, pp. 273–281.
[49] Geoffrey Hinton et al. “Improving neural networks by preventing co-adaptationof feature detectors”. In: CoRR abs/1207.0580 (2012). url: http://arxiv.org/abs/1207.0580.
[50] Geoffrey E. Hinton and Drew van Camp. “Keeping the Neural Networks Simpleby Minimizing the Description Length of the Weights”. In: Proceedings of the
72
Sixth Annual Conference on Computational Learning Theory. COLT ’93. SantaCruz, California, USA: ACM, 1993, pp. 5–13. isbn: 0-89791-611-5. doi: 10.1145/168304.168306. url: http://doi.acm.org/10.1145/168304.168306.
[51] Sepp Hochreiter and Jürgen Schmidhuber. “Long Short-term Memory”. In:Neural Comput. 9.9 (Nov. 1997), pp. 1735–1780. issn: 0899-7667. doi: 10.1162/neco.1997.9.8.1735. url: http://dx.doi.org/10.1162/neco.1997.9.8.1735.
[52] Anneke Hommels, Akira Murakami, and Nishimura Shin-Ichi. “Comparison ofthe Ensemble Kalman filter with the Unscented Kalman filter: application tothe construction of a road embankment”. In: Proceedings of the 19th EuropeanYoung Geotechnical Engineer Conference. Gyor, Hungary, 2009.
[53] Yuan Huang et al. “Understanding US regional linguistic variation with Twit-ter data analysis”. In: Computers, Environment and Urban Systems (2015).issn: 0198-9715. url: http://www.sciencedirect.com/science/article/pii/S0198971515300399.
[54] Douglas Huffman and Patricia Heller. “What Does the Force Concept Inven-tory Actually Measure?.” In: Physics Teacher 33.3 (1995), pp. 138–43.
[55] Jonathan Hurlock and Max L. Wilson. “Searching Twitter: Separating theTweet from the Chaff.” In: ICWSM. Ed. by Lada A. Adamic, Ricardo A.Baeza-Yates, and Scott Counts. The AAAI Press, 2011.
[56] Tommi S. Jaakkola and Michael I. Jordan. “Bayesian parameter estimation viavariational methods”. In: statistics and computing 10 (Jan. 2000), pp. 25–37.
[57] Anthony Jacobi et al. “A concept inventory for heat transfer”. In: Frontiersin Education, 2003. FIE 2003 33rd Annual. Vol. 1. IEEE. 2003, T3D–12.
[58] Bernard J. Jansen et al. “Twitter Power: Tweets As Electronic Word of Mouth”.In: J. Am. Soc. Inf. Sci. Technol. 60.11 (Nov. 2009), pp. 2169–2188. issn: 1532-2882. doi: 10.1002/asi.v60:11. url: http://dx.doi.org/10.1002/asi.v60:11.
[59] Akshay Java et al. “Why We Twitter: Understanding Microblogging Usage andCommunities”. In: Proceedings of the 9th WebKDD and 1st SNA-KDD 2007Workshop on Web Mining and Social Network Analysis. WebKDD/SNA-KDD’07. San Jose, California: ACM, 2007, pp. 56–65. isbn: 978-1-59593-848-0. doi:10.1145/1348549.1348556. url: http://doi.acm.org/10.1145/1348549.1348556.
73
[60] Andrew H. Jazwinski. “Stochastic processes and filtering theory”. In: Mathe-matics in science and engineering 64. New York, NY [u.a.]: Acad. Press, 1970.isbn: 0123815509.
[61] Finn V. Jensen and Thomas D. Nielsen. Bayesian Networks and DecisionGraphs. 2nd. Springer Publishing Company, Incorporated, 2007.
[62] Simon J. Julier and Jeffrey K. Uhlmann. “Unscented Filtering and NonlinearEstimation”. In: PROCEEDINGS OF THE IEEE. 2004, pp. 401–422.
[63] Pamela Kalas et al. “Development of a meiosis concept inventory”. In: CBE-Life Sciences Education 12.4 (2013), pp. 655–664.
[64] Andrej Karpathy and Li Fei-Fei. “Deep Visual-Semantic Alignments for Gen-erating Image Descriptions”. In: IEEE Trans. Pattern Anal. Mach. Intell.39.4 (Apr. 2017), pp. 664–676. issn: 0162-8828. doi: 10.1109/TPAMI.2016.2598339. url: https://doi.org/10.1109/TPAMI.2016.2598339.
[65] Matthias Katzfuss, Jonathan R. Stroud, and Christopher K. Wikle. “Under-standing the Ensemble Kalman Filter”. In: The American Statistician 70.4(2016), pp. 350–357. doi: 10.1080/00031305.2016.1141709.
[66] Yoon Kim et al. “Character-aware Neural Language Models”. In: Proceed-ings of the Thirtieth AAAI Conference on Artificial Intelligence. AAAI’16.Phoenix, Arizona: AAAI Press, 2016, pp. 2741–2749. url: http://dl.acm.org/citation.cfm?id=3016100.3016285.
[67] Duane Knudson et al. “Development and evaluation of a biomechanics conceptinventory”. In: Sports Biomechanics 2.2 (2003), pp. 267–277.
[68] Fantian Kong et al. “Mobile Robot Localization Based on Extended KalmanFilter”. In: 2006 6th World Congress on Intelligent Control and Automation.Vol. 2. 2006, pp. 9242–9246. doi: 10.1109/WCICA.2006.1713789.
[69] Stephen Krause et al. “Development, testing, and application of a chemistryconcept inventory”. In: Frontiers in Education, 2004. FIE 2004. 34th Annual.IEEE. 2004, T1G–1.
[70] John Krumm and Eric Horvitz. “Eyewitness: Identifying Local Events viaSpace-time Signals in Twitter Feeds”. In: Proceedings of the 23rd SIGSPA-TIAL International Conference on Advances in Geographic Information Sys-tems. GIS ’15. Bellevue, Washington: ACM, 2015, 20:1–20:10. isbn: 978-1-4503-3967-4. doi: 10.1145/2820783.2820801. url: http://doi.acm.org/10.1145/2820783.2820801.
74
[71] Haewoon Kwak et al. “What is Twitter, a Social Network or a News Media?”In: Proceedings of the 19th International Conference on World Wide Web.WWW ’10. Raleigh, North Carolina, USA: ACM, 2010, pp. 591–600. isbn:978-1-60558-799-8. doi: 10.1145/1772690.1772751. url: http://doi.acm.org/10.1145/1772690.1772751.
[72] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. “Simpleand Scalable Predictive Uncertainty Estimation using Deep Ensembles”. In:Advances in Neural Information Processing Systems 30: Annual Conference onNeural Information Processing Systems 2017, 4-9 December 2017, Long Beach,CA, USA. 2017, pp. 6405–6416. url: http://papers.nips.cc/paper/7219-simple - and - scalable - predictive - uncertainty - estimation - using -deep-ensembles.
[73] Norm G. Lederman et al. “Views of nature of science questionnaire: Towardvalid and meaningful assessment of learners’ conceptions of nature of science”.In: Journal of Research in Science Teaching 39.6 (2002), pp. 497–521. issn:1098-2736. doi: 10.1002/tea.10034. url: http://dx.doi.org/10.1002/tea.10034.
[74] Kathy Lee et al. “Adverse Drug Event Detection in Tweets with Semi-SupervisedConvolutional Neural Networks”. In: Proceedings of the 26th InternationalConference on World Wide Web. WWW ’17. Perth, Australia: InternationalWorld Wide Web Conferences Steering Committee, 2017, pp. 705–714. isbn:978-1-4503-4913-0.
[75] Kyumin Lee, Brian David Eoff, and James Caverlee. “Seven Months with theDevils: A Long-Term Study of Content Polluters on Twitter.” In: ICWSM. Ed.by Lada A. Adamic, Ricardo A. Baeza-Yates, and Scott Counts. The AAAIPress, 2011. url: http://dblp.uni-trier.de/db/conf/icwsm/icwsm2011.html#LeeEC11.
[76] Ryong Lee, Shoko Wakamiya, and Kazutoshi Sumiya. “Discovery of UnusualRegional Social Activities Using Geo-tagged Microblogs”. In: World Wide Web14.4 (July 2011), pp. 321–349. issn: 1386-145X.
[77] Richard B Lewis. “Creative Teaching and Learning in a Statics Class.” In:Engineering Education 81.1 (1991), pp. 15–18.
[78] Rui Li et al. “TEDAS: A Twitter-based Event Detection and Analysis System”.In: Proceedings of the 2012 IEEE 28th International Conference on Data En-gineering. ICDE ’12. Washington, DC, USA: IEEE Computer Society, 2012,pp. 1273–1276. isbn: 978-0-7695-4747-3. doi: 10.1109/ICDE.2012.125. url:http://dx.doi.org/10.1109/ICDE.2012.125.
75
[79] Julie C Libarkin and Steven W Anderson. “Development of the geoscienceconcept inventory”. In: Proceedings of the National STEM Assessment Con-ference, Washington DC. 2006, pp. 148–158.
[80] Xiao Lin and Gabriel Terejanu. “Fast Approximate Data Assimilation forHigh-Dimensional Problems”. In: 2017. url: https://arxiv.org/abs/1708.02340.
[81] Thomas A Litzinger et al. “A cognitive study of problem solving in statics”.In: Journal of Engineering Education 99.4 (2010), pp. 337–353.
[82] Ran Liu, Rony Patel, and Kenneth R. Koedinger. “Modeling Common Miscon-ceptions in Learning Process Data”. In: Proceedings of the Sixth InternationalConference on Learning Analytics & Knowledge. LAK ’16. Edinburgh, UnitedKingdom: ACM, 2016, pp. 369–377. isbn: 978-1-4503-4190-5. doi: 10.1145/2883851.2883967. url: http://doi.acm.org/10.1145/2883851.2883967.
[83] David J. C. MacKay. “A Practical Bayesian Framework for BackpropagationNetworks”. In: Neural Comput. 4.3 (May 1992), pp. 448–472. issn: 0899-7667.doi: 10.1162/neco.1992.4.3.448. url: http://dx.doi.org/10.1162/neco.1992.4.3.448.
[84] GermÃąn Kruszewski Marco Baroni, Georgiana Dinu. “Don’t count, predict!A systematic comparison of context-counting vs. context-predicting seman-tic vectors”. In: 52nd Annual Meeting of the Association for ComputationalLinguistics, ACL 2014 - Proceedings of the Conference 1 (2014), pp. 238–247.
[85] A. Marcus et al. “TwitInfo: Aggregating and visualizing microblogs for eventexploration”. In: Proceedings of the 2011 annual conference on Human factorsin computing systems. ACM. 2011, pp. 227–236.
[86] Adam Marcus et al. “Processing and Visualizing the Data in Tweets”. In:SIGMOD Record 40.4 (Dec. 2011), pp. 21–27.
[87] Dimitris Margaritis. “Learning Bayesian Network Model Structure From Data”.PhD thesis. School of Computer Science, Carnegie-Mellon University, 2003.
[88] Jay Martin, John Mitchell, and Ty Newell. “Development of a concept inven-tory for fluid mechanics”. In: Frontiers in Education, 2003. FIE 2003 33rdAnnual. Vol. 1. IEEE. 2003, T3D–23.
[89] Jay Mathews. “Just whose idea was all this testing”. In: The Washington Post14 (2006).
76
[90] Michael Mathioudakis and Nick Koudas. “TwitterMonitor: Trend Detectionover the Twitter Stream”. In: Proceedings of the 2010 ACM SIGMOD In-ternational Conference on Management of Data. SIGMOD ’10. Indianapo-lis, Indiana, USA: ACM, 2010, pp. 1155–1158. isbn: 978-1-4503-0032-2. doi:10.1145/1807167.1807306. url: http://doi.acm.org/10.1145/1807167.1807306.
[91] Polykarpos Meladianos et al. “Degeneracy-Based Real-Time Sub-Event Detec-tion in Twitter Stream.” In: ICWSM. Ed. by Meeyoung Cha, Cecilia Mascolo,and Christian Sandvig. AAAI Press, 2015, pp. 248–257. isbn: 978-1-57735-733-9.
[92] K Clark Midkiff, Thomas A Litzinger, and DL Evans. “Development of en-gineering thermodynamics concept inventory instruments”. In: Frontiers inEducation Conference, 2001. 31st Annual. Vol. 2. IEEE. 2001, F2A–F23.
[93] Eva MillÃąn and JosÃľ-Luis PÃľrez de-la Cruz. “A Bayesian Diagnostic Algo-rithm for Student Modeling and its Evaluation.” In: User Model. User-Adapt.Interact. 12.2-3 (2002), pp. 281–330.
[94] Multiple-Choice Test Preparation Manual.
[95] Mor Naaman, Hila Becker, and Luis Gravano. “Hip and trendy: Characterizingemerging trends on Twitter”. In: JASIST 62.5 (2011), pp. 902–918.
[96] Radford M. Neal. Bayesian Learning for Neural Networks. Secaucus, NJ, USA:Springer-Verlag New York, Inc., 1996. isbn: 0387947248.
[97] Radford M. Neal and Geoffrey E. Hinton. “Learning in Graphical Models”. In:ed. by Michael I. Jordan. Cambridge, MA, USA: MIT Press, 1999. Chap. AView of the EM Algorithm That Justifies Incremental, Sparse, and Other Vari-ants, pp. 355–368. isbn: 0-262-60032-3. url: http://dl.acm.org/citation.cfm?id=308574.308679.
[98] Jeffrey L Newcomer. “Inconsistencies in students’ approaches to solving prob-lems in Engineering Statics”. In: 2010 IEEE Frontiers in Education Conference(FIE). IEEE. 2010, F3G–1.
[99] Jeffrey L Newcomer and Paul S Steif. “Student explanations of answers toconcept questions as a window into prior misconceptions”. In: Proceedings.Frontiers in Education. 36th Annual Conference. IEEE. 2006, pp. 6–11.
[100] Jeffrey L Newcomer and Paul S Steif. “Student thinking about static equilib-rium: Insights from written explanations to a concept question”. In: Journalof Engineering Education 97.4 (2008), pp. 481–490.
77
[101] Jeffrey Nichols, Jalal Mahmud, and Clemens Drews. “Summarizing SportingEvents Using Twitter”. In: Proceedings of the 2012 ACM International Con-ference on Intelligent User Interfaces. IUI ’12. Lisbon, Portugal: ACM, 2012,pp. 189–198. isbn: 978-1-4503-1048-2. doi: 10.1145/2166966.2166999. url:http://doi.acm.org/10.1145/2166966.2166999.
[102] Branislav M Notaros. “Concept inventory assessment instruments for elec-tromagnetics education”. In: Antennas and Propagation Society InternationalSymposium, 2002. IEEE. Vol. 1. IEEE. 2002, pp. 684–687.
[103] Brendan O’Connor, Michel Krieger, and David Ahn. “TweetMotif: ExploratorySearch and Topic Summarization for Twitter.” In: ICWSM. Ed. by William W.Cohen and Samuel Gosling. The AAAI Press, 2010. url: http://dblp.uni-trier.de/db/conf/icwsm/icwsm2010.html#OConnorKA10.
[104] Tokunbo Ogunfunmi and Mahmudur Rahman. “A concept inventory for anelectric circuits course: Rationale and fundamental topics”. In: Proceedings of2010 IEEE International Symposium on Circuits and Systems. IEEE. 2010,pp. 2804–2807.
[105] Levent Ozbek and Umit Ozlale. “Employing the extended Kalman filter inmeasuring the output gap”. In: Journal of Economic Dynamics and Control29.9 (Sept. 2005), pp. 1611–1622.
[106] Leysia Palen et al. “Twitter-based Information Distribution during the 2009Red River Valley Flood Threat”. In: Bulletin of the American Society forInformation Science and Technology (2010).
[107] Bo Pang and Lillian Lee. “Seeing stars: Exploiting class relationships for sen-timent categorization with respect to rating scales”. In: Proceedings of ACL.2005, pp. 115–124.
[108] Jeffrey Pennington, Richard Socher, and Christopher D Manning. “Glove:Global Vectors for Word Representation.” In: EMNLP. Vol. 14. 2014, pp. 1532–1543.
[109] Sasa Petrovic, Miles Osborne, and Victor Lavrenko. “Streaming First StoryDetection with application to Twitter.” In: HLT-NAACL. The Associationfor Computational Linguistics, 2010, pp. 181–189. url: http://dblp.uni-trier.de/db/conf/naacl/naacl2010.html#PetrovicOL10.
[110] Timothy A Philpot et al. “Using games to teach statics calculation procedures:Application and assessment”. In: Computer Applications in Engineering Edu-cation 13.3 (2005), pp. 222–232.
78
[111] Daniela Pohl, Abdelhamid Bouchachia, and Hermann Hellwagner. “AutomaticSub-event Detection in Emergency Management Using Social Media”. In: Pro-ceedings of the 21st International Conference on World Wide Web. WWW ’12Companion. Lyon, France: ACM, 2012, pp. 683–686. isbn: 978-1-4503-1230-1.
[112] Daniela Pohl, Abdelhamid Bouchachia, and Hermann Hellwagner. “Social Me-dia for Crisis Management: Clustering Approaches for Sub-Event Detection”.In: Multimedia Tools and Applications (2013).
[113] M. C. Polson and J. J. Richardson, eds. Foundations of Intelligent TutoringSystems. Hillsdale, NJ, USA: L. Erlbaum Associates Inc., 1988. isbn: 0-805-80053-0.
[114] Ana-Maria Popescu and Marco Pennacchiotti. “Detecting Controversial Eventsfrom Twitter”. In: Proceedings of the 19th ACM International Conference onInformation and Knowledge Management. CIKM ’10. Toronto, ON, Canada:ACM, 2010, pp. 1873–1876. isbn: 978-1-4503-0099-5. doi: 10.1145/1871437.1871751. url: http://doi.acm.org/10.1145/1871437.1871751.
[115] Kevin Rawson and Tom Stahovich. “Predicting course performance from home-work habits”. In: Proceedings of the 2013 American society for engineeringeducation annual conference and exposition. 2013.
[116] Jim Richardson et al. “Development of a concept inventory for strength ofmaterials”. In: Frontiers in Education, 2003. FIE 2003 33rd Annual. Vol. 1.IEEE. 2003, T3D–29.
[117] Robert Rippey. “Probabilistic testing”. In: Journal of Educational Measure-ment 5.3 (1968), pp. 211–215.
[118] Isabelle Rivals and Léon Personnaz. “A recursive algorithm based on the ex-tended Kalman filter for the training of feedforward neural models”. In: Neu-rocomputing 20.1-3 (1998), pp. 279–294.
[119] Hasim Sak, Andrew W. Senior, and Françoise Beaufays. “Long short-termmemory recurrent neural network architectures for large scale acoustic model-ing”. In: INTERSPEECH 2014, 15th Annual Conference of the InternationalSpeech Communication Association, Singapore, September 14-18, 2014. 2014,pp. 338–342.
[120] Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. “Earthquake ShakesTwitter Users: Real-time Event Detection by Social Sensors”. In: Proceedingsof the 19th International Conference on World Wide Web. WWW ’10. Raleigh,North Carolina, USA: ACM, 2010, pp. 851–860. isbn: 978-1-60558-799-8. doi:
79
10.1145/1772690.1772777. url: http://doi.acm.org/10.1145/1772690.1772777.
[121] Hanan Samet et al. “Reading News with Maps by Exploiting Spatial Syn-onyms”. In: Commun. ACM 57.10 (Sept. 2014), pp. 64–77. issn: 0001-0782.doi: 10.1145/2629572. url: http://doi.acm.org/10.1145/2629572.
[122] Jagan Sankaranarayanan et al. “TwitterStand: News in Tweets”. In: Proceed-ings of the 17th ACM SIGSPATIAL International Conference on Advances inGeographic Information Systems. GIS ’09. Seattle, Washington: ACM, 2009,pp. 42–51. isbn: 978-1-60558-649-6. doi: 10.1145/1653771.1653781. url:http://doi.acm.org/10.1145/1653771.1653781.
[123] Antti Savinainen and Philip Scott. “Using the Force Concept Inventory tomonitor student learning and to plan teaching”. In: Physics Education 37.1(2002), p. 53. url: http://stacks.iop.org/0031-9120/37/i=1/a=307.
[124] Erich Schubert, Michael Weiler, and Hans-Peter Kriegel. “SigniTrend: Scal-able Detection of Emerging Topics in Textual Streams by Hashed SignificanceThresholds”. In: Proceedings of the 20th ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining. KDD ’14. New York, NewYork, USA: ACM, 2014, pp. 871–880. isbn: 978-1-4503-2956-9. doi: 10.1145/2623330.2623740. url: http://doi.acm.org/10.1145/2623330.2623740.
[125] David A. Shamma, Lyndon Kennedy, and Elizabeth F. Churchill. “Tweet theDebates: Understanding Community Annotation of Uncollected Sources”. In:Proceedings of the First SIGMM Workshop on Social Media. WSM ’09. Bei-jing, China: ACM, 2009, pp. 3–10. isbn: 978-1-60558-759-2. doi: 10.1145/1631144.1631148. url: http://doi.acm.org/10.1145/1631144.1631148.
[126] Chao Shen et al. “A Participant-based Approach for Event SummarizationUsing Twitter Streams”. In: Human Language Technologies: Conference ofthe North American Chapter of the Association of Computational Linguistics,Proceedings, June 9-14, 2013, Westin Peachtree Plaza Hotel, Atlanta, Georgia,USA. 2013, pp. 1152–1162. url: http://aclweb.org/anthology/N/N13/N13-1135.pdf.
[127] Emir H Shuford Jr, Arthur Albert, and H Edward Massengill. “Admissibleprobability measurement procedures”. In: Psychometrika 31.2 (1966), pp. 125–145.
[128] Sharad Singhal and Lance Wu. “Advances in Neural Information ProcessingSystems 1”. In: ed. by David S. Touretzky. San Francisco, CA, USA: MorganKaufmann Publishers Inc., 1989. Chap. Training Multilayer Perceptrons withthe Extended Kalman Algorithm, pp. 133–140. isbn: 1-558-60015-9.
80
[129] Edward Snelson and Zoubin Ghahramani. “Variable Noise and DimensionalityReduction for Sparse Gaussian processes”. In: UAI ’06, Proceedings of the 22ndConference in Uncertainty in Artificial Intelligence, Cambridge, MA, USA,July 13-16, 2006. 2006.
[130] Daniel Soudry, Itay Hubara, and Ron Meir. “Expectation Backpropagation:Parameter-Free Training of Multilayer Neural Networks with Continuous orDiscrete Weights.” In: NIPS. Ed. by Zoubin Ghahramani et al. 2014, pp. 963–971. url: http://dblp.uni-trier.de/db/conf/nips/nips2014.html#SoudryHM14.
[131] Nitish Srivastava et al. “Dropout: A Simple Way to Prevent Neural Networksfrom Overfitting”. In: J. Mach. Learn. Res. 15.1 (Jan. 2014), pp. 1929–1958.
[132] Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. “TrainingVery Deep Networks”. In: Proceedings of the 28th International Conferenceon Neural Information Processing Systems - Volume 2. NIPS’15. Montreal,Canada: MIT Press, 2015, pp. 2377–2385. url: http : / / dl . acm . org /citation.cfm?id=2969442.2969505.
[133] Paul S Steif. “An articulation of the concepts and skills which underlie en-gineering statics”. In: Frontiers in Education, 2004. FIE 2004. 34th Annual.IEEE. 2004, F1F–5.
[134] Paul S Steif. “Comparison between performance on a concept inventory andsolving of multifaceted problems”. In: Frontiers in Education, 2003. FIE 200333rd Annual. Vol. 1. IEEE. 2003, T3D–17.
[135] Paul S Steif. “Initial data from a statics concept inventory”. In: Proceedingsof the 2004, American Society of Engineering Education Conference and Ex-position, St. Lake City, UT. 2004.
[136] Paul S Steif and John A Dantzler. “A statics concept inventory: Developmentand psychometric analysis”. In: Journal of Engineering Education 94.4 (2005),p. 363.
[137] Paul S Steif and Mary A Hansen. “New practices for administering and analyz-ing the results of concept inventories”. In: Journal of Engineering Education96.3 (2007), p. 205.
[138] Andrea Stone et al. “The statistics concept inventory: A pilot study”. In:Frontiers in Education, 2003. FIE 2003 33rd Annual. Vol. 1. IEEE. 2003,T3D–1.
81
[139] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. “Sequence to Sequence Learningwith Neural Networks”. In: Proceedings of the 27th International Conferenceon Neural Information Processing Systems. NIPS’14. Montreal, Canada: MITPress, 2014, pp. 3104–3112. url: http://dl.acm.org/citation.cfm?id=2969033.2969173.
[140] Pinchas Tamir. “Some issues related to the use of justifications to multiple-choice answers”. In: Journal of Biological Education 23.4 (1989), pp. 285–292.doi: 10.1080/00219266.1989.9655083.
[141] Gabriel A. Terejanu. “Unscented kalman filter tutorial”. In: Workshop onLarge-Scale Quantification of Uncertainty. Sandia National Laboratories. 2009,pp. 1–6.
[142] Michael E. Tipping and Chris M. Bishop. “Probabilistic Principal ComponentAnalysis”. In: Journal of the Royal Statistical Society, Series B 61 (1999),pp. 611–622.
[143] Andranik Tumasjan et al. “Election Forecasts With Twitter”. In: Social Sci-ence Computer Review 29.4 (Nov. 2011), pp. 402–418. issn: 1552-8286. doi:10.1177/0894439310386557.
[144] George Valkanas and Dimitrios Gunopulos. “How the Live Web Feels AboutEvents”. In: Proceedings of the 22Nd ACM International Conference on Infor-mation & Knowledge Management. CIKM ’13. San Francisco, California, USA:ACM, 2013, pp. 639–648. isbn: 978-1-4503-2263-8. doi: 10.1145/2505515.2505572.
[145] Kurt Vanlehn et al. “The Andes Physics Tutoring System: Lessons Learned”.In: Int. J. Artif. Intell. Ed. 15.3 (Aug. 2005), pp. 147–204. issn: 1560-4292.url: http://dl.acm.org/citation.cfm?id=1434930.1434932.
[146] Stella Vosniadou and William F. Brewer. “Mental models of the earth: Astudy of the conceptual change in childhood”. In: Cognitive Psychology (1992),pp. 535–585. doi: doi:10.1016/0010-0285(92)90018-W.
[147] Kathleen E Wage et al. “The signals and systems concept inventory”. In: IEEETransactions on Education 48.3 (2005), pp. 448–461.
[148] Eric A. Wan and Rudolph Van Der Merwe. “The Unscented Kalman Filter forNonlinear Estimation”. In: 2000, pp. 153–158.
[149] Hao Wang et al. “A System for Real-time Twitter Sentiment Analysis of 2012U.S. Presidential Election Cycle”. In: Proceedings of the ACL 2012 SystemDemonstrations. ACL ’12. Jeju Island, Korea: Association for Computational
82
Linguistics, 2012, pp. 115–120. url: http://dl.acm.org/citation.cfm?id=2390470.2390490.
[150] Xiaofeng Wang, Donald E. Brown, and Matthew S. Gerber. “Spatio-temporalmodeling of criminal incidents using geographic, demographic, and twitter-derived information.” In: ISI. Ed. by Daniel Zeng et al. IEEE, 2012, pp. 36–41. isbn: 978-1-4673-2105-1.
[151] Rik Warren, Robert E. Smith, and Anne K. Cybenko. Use of MahalanobisDistance for Detecting Outliers and Outlier Clusters in Markedly Non-normalData: A Vehicular Traffic Example. Tech. rep. Air Force Materiel Command,2011.
[152] Jianshu Weng and Bu-Sung Lee. “Event Detection in Twitter”. In: Proceedingsof the Fifth International Conference on Weblogs and Social Media, Barcelona,Catalonia, Spain. 2011. url: http://www.aaai.org/ocs/index.php/ICWSM/ICWSM11/paper/view/2767.
[153] Yiming Yang, Tom Pierce, and Jaime Carbonell. “A Study of Retrospectiveand On-line Event Detection”. In: Proceedings of the 21st Annual Interna-tional ACM SIGIR Conference on Research and Development in InformationRetrieval. SIGIR ’98. Melbourne, Australia: ACM, 1998, pp. 28–36. isbn: 1-58113-015-5. doi: 10.1145/290941.290953. url: http://doi.acm.org/10.1145/290941.290953.
[154] J. S. Yedidia, W. T. Freeman, and Y. Weiss. “Constructing Free-energy Ap-proximations and Generalized Belief Propagation Algorithms”. In: IEEE Trans.Inf. Theor. 51.7 (July 2005), pp. 2282–2312. issn: 0018-9448. url: http ://dx.doi.org/10.1109/TIT.2005.850085.
[155] Zhijun Yin et al. “Geographical topic discovery and comparison”. In: Pro-ceedings of the 20th international conference on World wide web. ACM. 2011,pp. 247–256.
[156] Juan Diego Zapata Rivera. “Learning Environments Based on Inspectable Stu-dent Models”. AAINQ83573. PhD thesis. Saskatoon, Canada: University ofSaskatchewan, 2003. isbn: 0-612-83573-1.
[157] Chao Zhang et al. “GeoBurst: Real-Time Local Event Detection in Geo-TaggedTweet Streams.” In: SIGIR. Ed. by Raffaele Perego et al. ACM, 2016, pp. 513–522. isbn: 978-1-4503-4069-4. url: http://dblp.uni-trier.de/db/conf/sigir/sigir2016.html#ZhangZYZZKWH16.
[158] Chao Zhang et al. “TrioVecEvent: Embedding-Based Online Local Event De-tection in Geo-Tagged Tweet Streams”. In: Proceedings of the 2017 ACM
83
SIGKDD International Conference on Knowledge Discovery and Data Min-ing. KDD 2017. Halifax, Nova Scotia, Canada: ACM, 2017.
[159] Siqi Zhao et al. “Human as Real-Time Sensors of Social and Physical Events:A Case Study of Twitter and Sports Games”. In: CoRR abs/1106.4300 (2011).
[160] Xiangmin Zhou and Lei Chen. “Event Detection over Twitter Social MediaStreams”. In: The VLDB Journal 23.3 (June 2014), pp. 381–400. issn: 1066-8888. doi: 10.1007/s00778-013-0320-3. url: http://dx.doi.org/10.1007/s00778-013-0320-3.
[161] Arkaitz Zubiaga et al. “Towards Real-time Summarization of Scheduled Eventsfrom Twitter Streams”. In: Proceedings of the 23rd ACM Conference on Hy-pertext and Social Media. HT ’12. Milwaukee, Wisconsin, USA: ACM, 2012,pp. 319–320. isbn: 978-1-4503-1335-3. doi: 10.1145/2309996.2310053. url:http://doi.acm.org/10.1145/2309996.2310053.
84