Feature Engineering Studio February 23, 2015. Let’s start by discussing the HW.

28
Feature Engineering Studio February 23, 2015

Transcript of Feature Engineering Studio February 23, 2015. Let’s start by discussing the HW.

Feature Engineering Studio

February 23, 2015

Let’s start by discussing the HW

Assignment 3

• Data Cleaning

• Look for outliers in your data set• Find 3 variables that have one or more outliers (if you can)• Identify those variables• Given the mean, median, SD, and some outlier values in

them• For each variable, write a 1 sentence “just so story” (or

multiple just so stories) about what might have caused the outlier(s)

• Argue (briefly) for a reasonable approach to dealing with that variable’s outliers (and explain why your chosen approach is reasonable)

Everyone will present an outlier

• Alphabetical Order Based on First Name– Tie-Breaker: Last Name

• I’ll call out letters– Using the class roster failed last time

Tell us about your best outlier

• Mean, Median, SD, and some outlier values • Give your “just so story” (or multiple just so

stories) about what might have caused the outlier(s)

• What do you plan to do about it (if anything)?

Questions? Comments?

Things you can do in Excel part 2 of 3

Identifying specific cases of interest

Did event of interest ever occur for student?

Ratios between events of interest

How many students had 3 (or 4, 5, 2,…) of an event

Unitized actions (such as unitized time)

Last 3 or 5 unitized

Comparing earlier behaviors to later behaviors through caching

Counts-if

Percentages of action type

Percentages of time spent per action/location/KC/etc.

List merging

Pearson Correlation

T-tests

More complex stats in Excel

• I have worksheets that can do Chi-squared, Cohen’s Kappa, Extra-Sum-of-Squares F-test, and some various meta-analytic methods in Excel

• But if you don’t really know what you’re doing, it’s better to use a stats package for these

What else might you want to do in Excel?

Questions? Comments?

HW4• Feature Engineering 1

“Bring Me a Rock”

• Get your data set• Open it in Excel• Create as many features as you feel inspired to create

– Features should be created with the goal of predicting your ground truth variable– At least 12 separate features that are not just variations on a theme (e.g. “time for

last 3 actions” and “time for last 4 actions” are variations on a theme; but “time for last 3 actions” and “total time between help requests and next action” are two separate features)

• For each feature, write a 1-3 sentence “just so story” for why it might work• Test how good each feature is

Testing Feature Goodness

• For this assignment, there are a bunch of ways to test feature goodness

• Single-feature prediction models in data mining or stats package, giving Pearson correlation, Spearman’s rho, or Cohen’s kappa (special session this Wednesday)

• Compute Pearson correlation in Excel • Compute t-test in Excel • Compute other metrics in Excel (but see earlier

disclaimer)

Were you right?

• Which of your “just so stories” seem to be correct?

• Did any of your feature correlate in the opposite direction from what you expected?

Assignment 4

• Write a brief report for me• Email me an excel sheet with your features• You don’t need to prepare a presentation• But be ready to discuss your features in class

Next Classes

• 2/25 Special Session– Using RapidMiner to Produce Prediction Models– Come to this if you’ve never built a classifier or

regressor in RapidMiner (or a similar tool)– Statistical significance tests using linear regression

don’t count…

• 3/2 Advanced Feature Distillation in Excel– HW4 due