Data Mining At Tech Journal
-
Upload
virginia-hutchinson -
Category
Documents
-
view
25 -
download
0
description
Transcript of Data Mining At Tech Journal
Data Mining At Tech Journal
Background
Questions of Interest
Data Overview
Selected Approach
Potential Issues
Current Status
First Results
Agenda
Background
Questions of Interest
Data Overview
Selected Approach
Potential Issues
Current Status
First Results
Agenda
The Company
• A US company (“TechJournal”) publishes an on-line journal (“TechPub”) with content specifically aimed at IT professionals
• TechJournal is 15 years old; TechPub is 5 years old
• Content for TechPub comes from three sources:
– Aggregated content from public sources
– TechJournal created content
– Peer contributed content
• TechJournal core business is to produce a high-end list product for the marketing departments of IT manufacturers
The Journal
• The content on the publication website is available to both anonymous and registered users
• Registered users get access to some premium services as well
• Most content is free. Some whitepapers for sale.
• Three very unique features of the site
– Peer contributed content
– Auction system -> readers to get paid to contribute content
– New: personalized content for each reader
• Target: IT Professional involved in their organization’s technology purchasing decision
• Different levels of “readership”:
• The company continuously tries to stimulate new readership through e-mail campaigns
The Readers
E Mail RecipientsAnonymous Visits
E Mail Recipients Visited Site
E Mail Recipients Repeat Visitor
RegisteredLight Reader
RegisteredHeavy Reader
Number ofIndividuals
The Business Model
TechPub ReaderActivity
Knowledge ofReaders'Interests
Quality Of ListProduct
List Value ToTechnology
ManufacturesGathering New
Content
New Readers:Reader Word Of
Mouth
New Readers:Company
Prospecting
CompanyResources ForReinvestment
Total Readers
Tuning ofContent
“Active Readers Produce Better Lists” Loop
“Known Readers Make For Better Journal” Loop
“Success Breeds Success” Loop
“Buzz Marketing” Loop
Background
Questions of Interest
Data Overview
Selected Approach
Potential Issues
Current Status
First Results
Agenda
Focal Areas For Data MiningTechPub Reader
Activity
Knowledge ofReaders'Interests
Quality Of ListProduct
List Value ToTechnology
ManufacturesGathering New
Content
New Readers:Reader Word Of
Mouth
New Readers:Company
Prospecting
CompanyResources ForReinvestment
Total Readers
Tuning ofContent
• Is TechJournal’s current content taxonomy effective or
would some content taxonomy be more useful?
• Given email recipient attributes, what is the likelihood of a visit to website? • Which content headlines would maximize that visit likelihood?
“Known Readers Make For Better Journal” Loop
“Active Readers Produce Better Lists” Loop
“Success Breeds Success” Loop
• Given registered readers’ attributes, which stories will they be interested in?• Given past stories read, what is a registered reader most likely to also read?• Given registered readers’ attributes, which will be most active?
Background
Questions of Interest
Data Overview
Selected Approach
Potential Issues
Current Status
First Results
Agenda
The DataMy “Chunk of Data” to Mine:
An Issues Table713,110 records
Issues - Content Linker Table 2,185,664 records
Content Items Table 590 records
Page Visit Table 43,580 records
Recipients Table 195,455 records
Taxonomy Click Table 9,385 records
Attributes to Work With
Reader Attributes Content Attributes Format Attributes
Primary Key Recipient IDIP Address
Content IDIssue ID
Data Mining Attributes TitleCityStateCountryZipPhone IT BudgetEmployeesSalesSIC CodeIndustry Time SentTime OpenedTime of VisitTime Content Click
AbstractHeadline MainContent TypeMedia TypeAuthorContent TaxonomyClick Rate
Template TypeMedia Type (HTML, Or Video)
= Features that can be utilized directly or derived from for Classification
Creating Content Classes
1 1
Classes
5
46
798
1909
5000 +
Level
2
3
4
5
.
.
...
.
21
TechJournal’s current taxonomy for classifying content:
• Manually derived• Aggregation of other credible taxonomy fragments• From a content provider point of view• Goes out to 21 levels in some cases, others as shallow as three 31 Classes
#Visits ContentClass2925 |Software|Business2736 |Hardware|Storage1187 |Software|Operating Systems670 |Hardware|Networking314 |Software|Software Development282 |Hardware|Computers278 |Industries|News131 |Hardware|Telecom118 |Industries|IT Management97 |Hardware|Mobile Devices75 |Online|Search53 |Online|Portal42 |Hardware|Printers40 |Software|Consumer38 |Industries|PCs36 |Industries|Legal32 |Hardware|Power28 |Software|Networking21 |Hardware|News13 |Industries|Standards8 |Hardware8 |Industries|Hacking7 |Online|News4 |Online|Software as a Service4 |Hardware|Chips4 |Services|Disaster Recovery3 |Online|Email3 |Online|IM2 |Services|Security1 |Hardware|Software1 |Services|Software Development
9,750 Visits
spreadover
Background
Questions of Interest
Data Overview
Selected Approach
Potential Issues
Current Status
Preliminary Results
Agenda
A Variety of Approaches
• Given past stories read, what is a registered reader most likely to also read?
• Given email recipient attributes, what is the likelihood of a visit to website? • Which content headline would maximize that visit likelihood?
• Given registered readers attributes, which readers will be most active?
• Given registered reader attributes, which types of content will they read?
PREDICTIVE MODELING
• Is TechJournal’s current content taxonomy effective or would some other taxonomy be more useful?
CLUSTER ANALYSIS
ASSOCIATION ANALYSIS
Background
Questions of Interest
Data Overview
Selected Approach
Potential Issues
Current Status
First Results
Agenda
Potential Issues• Database evolution produces noisy, dirty, unevenly populated data
• Data comes from multiple sources, producing consistent data has been a challenge
• Still not clear if we will end up with enough data to see anything meaningful
• Content taxonomy is relatively new; most likely has real problems with how its structured
• Taxomony measures article subject matter, but behavior stimulating content may be in headlines
• Features are somewhat related:
• Features have high number of discrete values – need to be put into meaningful groupings
• Under-representation of several feature and class values
Industry
Location
Size
TitleSales Employees
Feature Grouping - Location
1
2
3
4
5
6
7
10
9
8
Other11
Feature Grouping - Title• Start with ~ 1000 distinct self-reported Titles in the Database
• Most interested in Title as it correlates with impact, influence on IT buying decisions
• Reclassify them based on three concepts: Senority, Function, Employees in Company
Functional Area 1
Functional Area N
OwnerChairman/CEO
Assistant
Functional Area 1
Functional Area 10
Manager of Managers
Assistant
Manager ofDoer
Doer
1
2,20 - 29
3,30 - 39
4
Result: 24 Categories
Background
Questions of Interest
Data Overview
Selected Approach
Potential Issues
Current Status
First Results
Agenda
Where I Am In The Process
ProblemDefinition
Data Gathering
Data Prep
Data Mining
Results Analysis Visualiz.
Sum Up Insights
Background
Questions of Interest
Data Overview
Selected Approach
Potential Issues
Current Status
First Results
Agenda
0.7037n = 27
0.1429n = 7
First ResultsQ: Given registered readers attributes, which readers will be most active?
Method: Decision Tree Induction – Training Set 599 Records, Test Set 187 Records
MSE on Test Set = .1451MSE on Training Set = .1313
n= 786
node), split, n, deviance, yval * denotes terminal node
1) root 786 223508.000 29.44402 2) LocGrpID< 1.5 96 23784.990 24.01042 4) RIC>=70.5 53 10433.890 19.66038 * 5) RIC< 70.5 43 11112.050 29.37209 10) RIC< 66 33 8432.545 25.27273 * 11) RIC>=66 10 294.900 42.90000 * 3) LocGrpID>=1.5 690 196494.400 30.20000 6) RIC< 71.5 438 127844.900 28.34475 12) RIC>=14.5 411 120569.000 27.69586 * 13) RIC< 14.5 27 4468.667 38.22222 * 7) RIC>=71.5 252 64521.570 33.42460 14) Title_Code>=38 20 4712.950 20.45000 * 15) Title_Code< 38 232 56151.570 34.54310 *
First ResultsQ: Given the attributes of a registered reader, which content types they will read?
Method: Decision Tree Induction
20.45n = 20
35.54n = 232
First ResultsQ: Given registered reader attributes, which types of content will they read?
Method: Kernel SVM with Gaussian Kernel Overall Training Error = .569975
15 |Industries|Hacking 24 |Online|Email 37 |Services|Security16 |Industries|IT Management 25 |Online|IM 42 |Software|Business17 |Industries|Legal 26 |Online|News 43 |Software|Consumer18 |Industries|News 27 |Online|Portal 44 |Software|Networking20 |Industries|PCs 30 |Online|Search 45 |Software|Operating Systems21 |Industries|Standards 33 |Online|Software as a Service 46 |Software|Software Development
% PredictionsWere Accurate True
Pred 0 1 2 5 6 7 9 10 12 13 16 17 18 20 24 25 27 30 33 42 43 44 45 4667% 2 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 160% 6 0 0 0 0 15 0 0 1 2 1 0 3 0 0 0 0 0 0 0 1 0 0 1 140% 12 0 0 3 0 9 0 0 1 33 1 0 1 5 0 0 0 1 2 0 12 1 3 7 483% 16 0 0 0 0 1 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 045% 42 3 0 21 5 29 0 2 1 34 3 5 1 17 1 0 0 5 4 1 151 0 1 44 939% 45 0 2 19 6 20 3 3 4 18 10 10 0 16 2 2 1 2 5 0 42 1 3 126 2867% 46 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 1 0 0 0 6
% In Class Pred ------------> 0% 0% 4% 0% 20% 0% 0% 0% 37% 0% 25% 0% 0% 0% 0% 0% 0% 0% 0% 73% 0% 0% 71% 12%
Defining Project Success
Success for this project could come in different forms:
• Insights gained on any of the six questions within the project’s scope;
- and/or –
• Insight into how TechJournal should modify its data capture policies to facilitate data mining for the answers to these questions in the future
Questions/Comments