Featurizing log data before XGBoost

Xavier ConortThursday, August 20, 2015 @

● XuetangX, a Chinese MOOC learning platform initiated by Tsinghua University,

● launched online on Oct 10th, 2013. ● more than 100 Chinese courses and over 260

international courses● high dropout rate

The competition host

● challenge: predict whether a user will drop a course within next 10 days based on his or her prior activities.

● data: ○ enrollment_train (120K rows) / enrollment_test (80K rows):

■ Columns: enrollment_id, username, course_id○ log_train / log_test

■ Columns: enrollment_id, time, source, event, object○ object

■ Columns: course_id, module_id, category, children, start○ truth_train

■ Columns: enrollment_id, dropped_out

Problem to solve

Log data5890 objects

Chief Product Officer Chief Data Scientist

Data Scientist Data Scientist

(O. Zhang)

How we worked as a Team● worked separately on feature engineering. 90% of

our time was spent here. ● delegated Modeling part to DataRobot to:

○ find best algorithm (with XGboost as a winner!) ○ model text features○ tune hyperparameters○ experiment different feature sets and blend 8 XGBoost

using different sets○ communicate results

Feature engineering techniques used● counts● time statistics (min, mean, max, diff)● entropy● sequences treated as text on which we ran

○ SVD on 3grams○ DataRobot text mining solution

● 20 first components of SVD on user x objectNB: removed duplicated log info and used training + test sets to build most features

How to build efficient features in R

Key course features● course_id● first log time● enrollment counts● unique log counts● mean time interval

Key enrollment count features● log counts● unique log counts● ratio between unique log counts over log counts● unique log counts by event (nagivate, access,

problem, video, page_close, discussion, wiki)● unique log counts before end of course (5 days, 10

days and 30 days before)● sequence number of enrollment in that course

Key enrollment time stats● log time stats (min, mean, max)● gap between first and last log of enrollment● gap between enrollment first log and course first log● gap between enrollment last log and course last logs● difference between mean log time and mid point

between first and last log ● log interval stats (mean, 90, 99 and 100 quantiles)

Enrollment entropy features enrollment entropy over● days● weekdays ● fraction (4) of weekdays● hours of the day● hours of the day for the last 1/3/7 days before last

logs● object (when event == problem)● chapter ids

Example of entropy feature- log(weekday_log_count / enrollment_log_count) *

weekday_log_count / enrollment_log_count

Sum => weekday_entropy[enrollment_id==1] 1.589988

Enrollment sequence features ● for each enrollment_id, built sequences of

○ weekdays○ objects

■ all objects / 'problem' and 'video' objects only○ events

● treated sequences as 4 text variables. Ran for each○ svd on 3 grams => first 10 components○ DataRobot stacked predictions from logistic regr.

& Nystroem SVM on (tuned) n-grams

Extract of enrollment object sequences

1/2-grams from Object sequences

DataRobot on Object 1-2 grams

Key user count features and time stats● enrollment count● binary indicator whether user signed up for each of

the 38 courses● unique log count● mean log time interval● sequence number of enrollment for that user

User entropy features user entropy over● days● weekdays ● fraction (4) of weekdays● hours of the day

User sequence features ● for each user, built sequences of

○ weekdays○ chapter_ids○ events

● treated them as 3 text variables. Ran○ SVD on 3 grams => first 10 components○ DataRobot stacked predictions from logistic regr.

+ Nystroem SVM on (tuned) n-grams

How we got to the TOP3● entropy features mentioned before● exploited info in

○ log count in the 5 / 10 / 20 days after end of course○ log counts by event, sign_up counts and day entropy in the next

10 days after end of course○ time to sign up for new course ○ time until the next log for same user

added ~0.001 to AUC (vs less powerful features)

added ~0.002 to AUC

XGBoost

Thank you!

Featurizing log data before XGBoost

Technology

Transcript of Featurizing log data before XGBoost

xgboost - Read the Docs · xgboost, Release 0.81 XGBoost is an optimized distributed gradient boosting library designed to be highly efﬁcient, ﬂexible and portable. It implements

SS-XGBoost: a machine learning framework for predicting ...gwang.people.ust.hk/Publications/Wang.et.a.(2020... · 1 1 SS-XGBoost: a machine learning framework for predicting 2 Newmark

Application of the XGBoost Machine Learning Method in PM2.5 … · ISSN: 1680-8584 print / 2071-1409 online doi: 10.4209/aaqr.2019.08.0408 Application of the XGBoost Machine Learning

XGBoost: The Art and Science of Communicating Machine ...€¦ · XGBoost Enter Extreme Gradient Boosting A.K.A XGBoost • Focused on computational speed and model performance •

XGBoost - Saed Sayad · the degree of overﬁtting. XGBoost provides a convenient function to do cross validation in a line of code. Notice the diﬀerence of the arguments between

Multi GPU training Part 3: Engineering Challenges of Multi ...on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8501.pdf · XGBOOST XGBoost is an implementation of gradient boosted decision

XGBoost - Saed Sayad - Data Mining Mapsaedsayad.com/docs/xgboost.pdf · We introduce the R package for XGBoost. To install, please run ... choosing the correct number of iterations.

Zillow Home value Prediction Using XGBOOST

A Host-Based Anomaly Detection Framework Using XGBoost …

XGBOOST : A SCALABLE TREE BOOSTING SYSTEM · XGBoost Scalability Weighted quantiles Sparsity-awareness Cache-awarereness Data compression + Nicely structured paper, easily comprehensible

XGBoost in handling missing values for life …...Vol.:(0123456789) SN Applied Sciences (2020) 2:1336 | ResearchArticle XGBoost in handling missing values for life insurance ...

Forecasting Canadian GDP growth using XGBoost · as Gradient Boosting Machines (GBMs), XGBoost, Generalized Linear Models (GLMs), Distributed Random Forest (DRFs, which consist of

Dive into XGBoost - GitHub Pages · 1.XGBoost 的Github地址 ... Dive into XGBoost Created Date: 10/17/2017 12:15:40 PM ...

Feature Importance Analysis with XGBoost in Tax audit

XGBoost: the algorithm that wins every competition

Tree Boosting With XGBoost - NTNU...hvorfor “tree boosting”, og særlig XGBoost, ser ut til˚avære en s˚apass eﬀektiv og allsidig tilnærming til prediktiv modellering. Hovedargumentet

Package ‘xgboost’ (PDF) - R - The Comprehensive R ... · PDF filePackage ‘xgboost ’ January 23, 2018 ... Date 2018-01-22 Author Tianqi Chen , ... their own objectives easily.

XGBoost: A Scalable Tree Boosting System - GitHub Pages · 2018-04-17 · Properties of XGBoost Single most important factor in its success: scalability Due to several important systems

Efficiency of Flash Flood Prediction by XGBoost and Random ...

A CEEMDAN and XGBOOST-Based Approach to Forecast Crude …downloads.hindawi.com/journals/complexity/2019/4392785.pdf · XGBo XGBo XGBo XGBoost Decompositio Individual orecasting Esemb