Post on 06-Jan-2017
Featurizing log data before XGBoost
Xavier ConortThursday, August 20, 2015 @
● XuetangX, a Chinese MOOC learning platform initiated by Tsinghua University,
● launched online on Oct 10th, 2013. ● more than 100 Chinese courses and over 260
international courses● high dropout rate
The competition host
● challenge: predict whether a user will drop a course within next 10 days based on his or her prior activities.
● data: ○ enrollment_train (120K rows) / enrollment_test (80K rows):
■ Columns: enrollment_id, username, course_id○ log_train / log_test
■ Columns: enrollment_id, time, source, event, object○ object
■ Columns: course_id, module_id, category, children, start○ truth_train
■ Columns: enrollment_id, dropped_out
Problem to solve
Log data5890 objects
Team
Chief Product Officer Chief Data Scientist
Data Scientist Data Scientist
(O. Zhang)
How we worked as a Team● worked separately on feature engineering. 90% of
our time was spent here. ● delegated Modeling part to DataRobot to:
○ find best algorithm (with XGboost as a winner!) ○ model text features○ tune hyperparameters○ experiment different feature sets and blend 8 XGBoost
using different sets○ communicate results
Feature engineering techniques used● counts● time statistics (min, mean, max, diff)● entropy● sequences treated as text on which we ran
○ SVD on 3grams○ DataRobot text mining solution
● 20 first components of SVD on user x objectNB: removed duplicated log info and used training + test sets to build most features
How to build efficient features in R
Key course features● course_id● first log time● enrollment counts● unique log counts● mean time interval
Key enrollment count features● log counts● unique log counts● ratio between unique log counts over log counts● unique log counts by event (nagivate, access,
problem, video, page_close, discussion, wiki)● unique log counts before end of course (5 days, 10
days and 30 days before)● sequence number of enrollment in that course
Key enrollment time stats● log time stats (min, mean, max)● gap between first and last log of enrollment● gap between enrollment first log and course first log● gap between enrollment last log and course last logs● difference between mean log time and mid point
between first and last log ● log interval stats (mean, 90, 99 and 100 quantiles)
Enrollment entropy features enrollment entropy over● days● weekdays ● fraction (4) of weekdays● hours of the day● hours of the day for the last 1/3/7 days before last
logs● object (when event == problem)● chapter ids
Example of entropy feature- log(weekday_log_count / enrollment_log_count) *
weekday_log_count / enrollment_log_count
Sum => weekday_entropy[enrollment_id==1] 1.589988
Enrollment sequence features ● for each enrollment_id, built sequences of
○ weekdays○ objects
■ all objects / 'problem' and 'video' objects only○ events
● treated sequences as 4 text variables. Ran for each○ svd on 3 grams => first 10 components○ DataRobot stacked predictions from logistic regr.
& Nystroem SVM on (tuned) n-grams
Extract of enrollment object sequences
1/2-grams from Object sequences
DataRobot on Object 1-2 grams
Key user count features and time stats● enrollment count● binary indicator whether user signed up for each of
the 38 courses● unique log count● mean log time interval● sequence number of enrollment for that user
User entropy features user entropy over● days● weekdays ● fraction (4) of weekdays● hours of the day
User sequence features ● for each user, built sequences of
○ weekdays○ chapter_ids○ events
● treated them as 3 text variables. Ran○ SVD on 3 grams => first 10 components○ DataRobot stacked predictions from logistic regr.
+ Nystroem SVM on (tuned) n-grams
How we got to the TOP3● entropy features mentioned before● exploited info in
○ log count in the 5 / 10 / 20 days after end of course○ log counts by event, sign_up counts and day entropy in the next
10 days after end of course○ time to sign up for new course ○ time until the next log for same user
added ~0.001 to AUC (vs less powerful features)
added ~0.002 to AUC
XGBoost
Thank you!