Improving the productivity of the copper mining process in ...
Improving the Web Design Mining Web Data at Cityjob
description
Transcript of Improving the Web Design Mining Web Data at Cityjob
Improving the Web Design Mining Web Data at Cityjob.com
Hing-Po Lo, Linda Lu, Miriam Chan
Department of Management Sciences
City University of Hong Kong, Hong Kong
I. Introduction
Data Mining Customer Relationship Management
The Web
Worldwide Internet Commerce Revenues:Business and Consumer Segments,1996-2002
0100
200300400
500600
1996 1997 1998 1999 2000 2001 2002
Consumer Business-Business
A. The Web
US$B
• More than 200 millions surfers per day
• Huge volume of data captured from the Web
• Only 2% of web data analyzed
B. Customer Relationship Management
• DOT COM companies
• require the use of CRM to establish a personalized relationship with their customers
• work in an “information-intensive” and “ultra-competitive” mode
C. Data Mining Tools•There are many software and web vendors that may help to explore and mine the web log files.
•Most study the “clickstream” at the “session level”. In order to conduct CRM, one has to analyze the web log file at the “customer level”.
•A tailor-made software using SAS macro and Enterprise Miner has been developed.
Cityjob.COM
• It offers information on almost all posts available from major companies in HK.
• It receives on average over several thousand visitors per day.
Study Period:
11 December 2000 to 4 February 2001
Three types of data files:
• Web log files;
• Subscribers’ profiles;
• Jobs’ profiles.
II. The Data
#Software: Microsoft Internet Information Server 4.0
#Version: 1.0
#Date: 2000-12-11 00:00:00
#Fields: date time c-ip cs-username s-sitename s-computername s-ip cs-method cs-uri-stem cs-uri-query sc-status sc-win32-status sc-bytes cs-bytes time-taken cs(Cookie)
2000-12-11 00:00:00 208.223.166.3 - W3SVC4 PROD5_WEB 202.130.170.225 GET /default.asp - 200 0 15838 645 1297
RMID=d0dfa603398e0850;+CityjobID=LASTUPD=20001130&LOGIN=sloo;+IND=000;+OPN=000;+CTY=091;+RDB=c80200000000000000020028311b1b0000000000000000;+ASPSESSIO
1. Web log files
User
ID
Age Sex Ed.
level
P.
income
H.
income
Country Marital
Status
Em.
Status
Occ.
cityjob94290 27 F SEC HK S FT CUS
cityjob94293 26 M DIP 2 HK S FT FIN
cityjob94338 28 F SEC HK S FT ACC
cityjob94345 34 M UC 8 9 HK M FT MGT
2. Subscribers’ profiles
Cont’d
Ind Reg. Date Interest
HOT 20001030 MKT
BNK 20001030 BANK, FIN, INVEST, MKT
OMF 20001030 ENTER, GAME, HKNEWS, PROPOMF
DPT 20001030 CNEWS, COMPU, ECON, ENTER, HKNEWS,
3. Jobs’ profiles
Job ID Title Type Work
Exp.
Quali. Industry Level
cityjobB7200 ORG. MANAGER
IT 4 UC BANK MID
cityjobAVU10 EXECUTIVE OFFICER II
LEG 3 DIP GOV JUN
cityjobB7040 ASST. ACCOUNTANT
ACC 5 SEC RET PRO
cityjobB7530 SALES EXECUTIVE
SAL 4 UC TDG JUN
Web log files
Subscribers’ files Jobs’ files
A: Reading the web log files
B: Cleaning the data files
C: Creating new variables
D: Merging the data files
E: Prepare different SAS data files
SAS macros were written to perform the following tasks:
Useful Summary Information
A. Subscribers’ profiles
B. Jobs’ profiles
C. Web log files
D. Web log files + User ID
E. Web log files + Job ID
Relative Percentage of Count in Each Hour
0%
1%
2%
3%
4%
5%
6%
7%
8%
Time
Rela
tive
Perc
enta
ge
Job ID Title Industry Visit
No.Popularity
Index
cityjobCM070 OFFICER - CORPORATE BANKING
BNK 7748 100.0
cityjobC8570 ADMINISTRATIVE ASSISTANT
GOV 6552 84.6
cityjobCDU20 EXECUTIVE TRAINEE - INVESTMENT PRODUCTS
BNK 5148 64.9
cityjobCL580 CONTRACT HOUSING OFFICER
GOV 4944 63.8
cityjobCK570 EXECUTIVES FOR CORPORATE FINANCE
BNK 4664 60.2
The most popular jobs
Ⅲ. Collaborative Filtering1. By Association Rules
• Whenever a visitor enquires about a particular job, we can “cross sell” similar jobs by recommending other jobs that have the highest association with the original one.
• The association is based on the click history of all the visitors to the Web.
• Job A: cityjobCF520:
Title: Assistant Accountant; Qualification: Diploma; Working experience: one year
then• Job B: cityjobCF180:
Title: Assistant Accountant; Qualification: Diploma; Working experience: three year
• Job C: cityjobCF100:
Title: Assistant Accountant; Qualification: University/College; Working experience: not specified
• Job D: cityjobCEUJ0:
Title: Assistant Accountant; Qualification: Not specified; Working experience: two years
For example,if
This group of 4 jobs has a
• Confidence Value of 50.3% :
given a visitor enquires about job A, the probability that he would also enquire about jobs B, C, and D is 0.503;
• Lift Value of 298.46 :
if a visitor has enquired about job A, he is almost 300 times more likely to enquire about jobs B, C, and D than a visitor chosen at random.
2. By Popularity Index
• Job A: cityjobCDU20
Title: EXECUTIVE TRAINEE - INVESTMENT PRODUCTS, Type: FIN, Working Experience: 0, Qualification: UC, Industry: BNK, Level: JUN, Index of popularity: 64.9.
then (with same type, industry and qualification)
• Job B: cityjobCM470
Title: ASSOCIATE (TREASURY), Type: FIN, Working Experience: 3, Qualification: UC, Industry: BNK, Level: JUN, Index of popularity: 59.2.
• Job C: cityjobCM470
Title: ASSOCIATES (CRM), Type: FIN, Working Experience: 2, Qualification: UC, Industry: BNK, Level: JUN, Index of popularity: 44.6.
• Job D: cityjobCFLC0
Title: DEALER & INVESTOR ADVISOR, Type: FIN, Working Experience: 3, Qualification: UC, Industry: BNK, Level: PRO, Index of popularity: 36.6.
For example,if
Ⅳ. Predictive Models
1. Churn (Attrition) model
To identify subscribers with high likelihood of ceasing their current activity of visiting the Web site,thus the Cityjob.com can take action to retain them. It is often less expensive to retain them than it is to win them back.
2. Popular job model
What are the characteristics of jobs that would attract more visitors? Are they related to their job type and job industry?
1. The Churn (Attrition) Model
• Sample: All subscribers of Cityjob.com.
• Dependent Variable: Visit = 1 if the subscriber has
visited the Cityjob.com during the study period;
Visit = 0 otherwise.
• Factors used: Gender; Age; Educational Level
dummy variables for interest and country;
no. of days since registration.
• Sampling procedure: Stratified sampling based on
the variable “Visit” is used to obtain equal number
of observations from the two groups of
subscribers (Y=1 and Y=0).
• Data partition: Training data 70%, Validation data 30%
• Lift Chart
Churn model
(logistic regression )
important factors:
1. No. of days since registration;
2. Educational level,
3. Gender
4. Whether has interest in computer games or not.
2. The Popular Job Model
• Sample : All jobs advertised on the Cityjob.com.
• Dependent Variable: Popular = 1 if the job has been
visited for at least 20 times, Popular = 0 otherwise.
• Factors used: Dummy variables for different job types,
job industries, job level, qualification required,
working experience.
• Data partition: Training data 70%, Validation data 30%
• Missing values: missing values for working experience
and qualification required were replaced by 0 and
3 (Secondary school completed) respectively.
• Lift Chart
popular job model
(logistic regression )
Important factors:
1. higher qualification(more likely)
2. higher level (more likely)
3. jobs industries:
accounting, banking, building ,
construction ( more likely )
4. jobs types:
art/design/creative, engineering,
sales (less likely)
1. Web Design
a. To develop a collaborative filtering system
b. To include a popularity index
Ⅴ. Recommendation
2. Marketing Strategies
a. To develop appropriate marketing strategiesfor customer retention
b. To develop Cityjob.com’s own web monitorsystem
Ⅵ.Unexpected Discovery
There was a user who came everyday during the study period at exactly the same time (4:00 a.m. HK time) and stayed for one to three hours browsing more than 500 pages each time (average 5 sec. per page).
The End