Improving the Web Design Mining Web Data at Cityjob.com Hing-Po Lo, Linda Lu, Miriam Chan Department...

31
Improving the Web Design Mining Web Data at Cityjob.com Hing-Po Lo, Linda Lu, Miriam Chan Department of Management Sciences [email protected] City University of Hong Kong, Hong Kong
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of Improving the Web Design Mining Web Data at Cityjob.com Hing-Po Lo, Linda Lu, Miriam Chan Department...

Improving the Web Design Mining Web Data at Cityjob.com

Hing-Po Lo, Linda Lu, Miriam Chan

Department of Management Sciences

[email protected]

City University of Hong Kong, Hong Kong

I. Introduction

Data Mining Customer Relationship Management

The Web

Worldwide Internet Commerce Revenues:Business and Consumer Segments,1996-2002

0100

200300400

500600

1996 1997 1998 1999 2000 2001 2002

Consumer Business-Business

A. The Web

US$B

• More than 200 millions surfers per day

• Huge volume of data captured from the Web

• Only 2% of web data analyzed

B. Customer Relationship Management

• DOT COM companies

• require the use of CRM to establish a personalized relationship with their customers

• work in an “information-intensive” and “ultra-competitive” mode

C. Data Mining Tools•There are many software and web vendors that may help to explore and mine the web log files.

•Most study the “clickstream” at the “session level”. In order to conduct CRM, one has to analyze the web log file at the “customer level”.

•A tailor-made software using SAS macro and Enterprise Miner has been developed.

Cityjob.COM

• It offers information on almost all posts available from major companies in HK.

• It receives on average over several thousand visitors per day.

Study Period:

11 December 2000 to 4 February 2001

Three types of data files:

• Web log files;

• Subscribers’ profiles;

• Jobs’ profiles.

II. The Data

#Software: Microsoft Internet Information Server 4.0

#Version: 1.0

#Date: 2000-12-11 00:00:00

#Fields: date time c-ip cs-username s-sitename s-computername s-ip cs-method cs-uri-stem cs-uri-query sc-status sc-win32-status sc-bytes cs-bytes time-taken cs(Cookie)

2000-12-11 00:00:00 208.223.166.3 - W3SVC4 PROD5_WEB 202.130.170.225 GET /default.asp - 200 0 15838 645 1297

RMID=d0dfa603398e0850;+CityjobID=LASTUPD=20001130&LOGIN=sloo;+IND=000;+OPN=000;+CTY=091;+RDB=c80200000000000000020028311b1b0000000000000000;+ASPSESSIO 

1. Web log files

User

ID

Age Sex Ed.

level

P.

income

H.

income

Country Marital

Status

Em.

Status

Occ.

cityjob94290 27 F SEC HK S FT CUS

cityjob94293 26 M DIP 2 HK S FT FIN

cityjob94338 28 F SEC HK S FT ACC

cityjob94345 34 M UC 8 9 HK M FT MGT

2. Subscribers’ profiles

Cont’d

Ind Reg. Date Interest

HOT 20001030 MKT

BNK 20001030 BANK, FIN, INVEST, MKT

OMF 20001030 ENTER, GAME, HKNEWS, PROPOMF

DPT 20001030 CNEWS, COMPU, ECON, ENTER, HKNEWS,

3. Jobs’ profiles

Job ID Title Type Work

Exp.

Quali. Industry Level

cityjobB7200 ORG. MANAGER

IT 4 UC BANK MID

cityjobAVU10 EXECUTIVE OFFICER II

LEG 3 DIP GOV JUN

cityjobB7040 ASST. ACCOUNTANT

ACC 5 SEC RET PRO

cityjobB7530 SALES EXECUTIVE

SAL 4 UC TDG JUN

Web log files

Subscribers’ files Jobs’ files

A: Reading the web log files

B: Cleaning the data files

C: Creating new variables

D: Merging the data files

E:   Prepare different SAS data files

SAS macros were written to perform the following tasks:

Useful Summary Information

A. Subscribers’ profiles

B. Jobs’ profiles

C. Web log files

D. Web log files + User ID

E. Web log files + Job ID

Relative Percentage of Count in Each Hour

0%

1%

2%

3%

4%

5%

6%

7%

8%

Time

Rela

tive

Perc

enta

ge

Job ID Title Industry Visit

No.Popularity

Index

cityjobCM070 OFFICER - CORPORATE BANKING

BNK 7748 100.0

cityjobC8570 ADMINISTRATIVE ASSISTANT

GOV 6552 84.6

cityjobCDU20 EXECUTIVE TRAINEE - INVESTMENT PRODUCTS

BNK 5148 64.9

cityjobCL580 CONTRACT HOUSING OFFICER

GOV 4944 63.8

cityjobCK570 EXECUTIVES FOR CORPORATE FINANCE

BNK 4664 60.2

The most popular jobs

Ⅲ. Collaborative Filtering1. By Association Rules

• Whenever a visitor enquires about a particular job, we can “cross sell” similar jobs by recommending other jobs that have the highest association with the original one.

• The association is based on the click history of all the visitors to the Web.

• Job A: cityjobCF520:

Title: Assistant Accountant; Qualification: Diploma; Working experience: one year

then• Job B: cityjobCF180:

Title: Assistant Accountant; Qualification: Diploma; Working experience: three year 

• Job C: cityjobCF100:

Title: Assistant Accountant; Qualification: University/College; Working experience: not specified 

• Job D: cityjobCEUJ0:

Title: Assistant Accountant; Qualification: Not specified; Working experience: two years

For example,if

This group of 4 jobs has a

• Confidence Value of 50.3% :

given a visitor enquires about job A, the probability that he would also enquire about jobs B, C, and D is 0.503;

• Lift Value of 298.46 :

if a visitor has enquired about job A, he is almost 300 times more likely to enquire about jobs B, C, and D than a visitor chosen at random.

2. By Popularity Index

• Job A: cityjobCDU20

Title: EXECUTIVE TRAINEE - INVESTMENT PRODUCTS, Type: FIN, Working Experience: 0, Qualification: UC, Industry: BNK, Level: JUN, Index of popularity: 64.9.

then (with same type, industry and qualification)

• Job B: cityjobCM470

Title: ASSOCIATE (TREASURY), Type: FIN, Working Experience: 3, Qualification: UC, Industry: BNK, Level: JUN, Index of popularity: 59.2.

•  Job C: cityjobCM470

Title: ASSOCIATES (CRM), Type: FIN, Working Experience: 2, Qualification: UC, Industry: BNK, Level: JUN, Index of popularity: 44.6.

•  Job D: cityjobCFLC0

Title: DEALER & INVESTOR ADVISOR, Type: FIN, Working Experience: 3, Qualification: UC, Industry: BNK, Level: PRO, Index of popularity: 36.6.

For example,if

Ⅳ. Predictive Models

1. Churn (Attrition) model

To identify subscribers with high likelihood of ceasing their current activity of visiting the Web site,thus the Cityjob.com can take action to retain them. It is often less expensive to retain them than it is to win them back.

2. Popular job model

What are the characteristics of jobs that would attract more visitors? Are they related to their job type and job industry?

1. The Churn (Attrition) Model

• Sample: All subscribers of Cityjob.com.

•  Dependent Variable: Visit = 1 if the subscriber has

visited the Cityjob.com during the study period;

Visit = 0 otherwise.

• Factors used: Gender; Age; Educational Level

dummy variables for interest and country;

no. of days since registration.

• Sampling procedure: Stratified sampling based on

the variable “Visit” is used to obtain equal number

of observations from the two groups of

subscribers (Y=1 and Y=0).

• Data partition: Training data 70%, Validation data 30%

• Lift Chart

 Churn model

(logistic regression )

important factors:

1. No. of days since registration;

2. Educational level,

3. Gender

4. Whether has interest in computer games or not.

2. The Popular Job Model

• Sample : All jobs advertised on the Cityjob.com.

•  Dependent Variable: Popular = 1 if the job has been

visited for at least 20 times, Popular = 0 otherwise.  

• Factors used: Dummy variables for different job types,

job industries, job level, qualification required,

working experience.

•  Data partition: Training data 70%, Validation data 30%

•  Missing values: missing values for working experience

and qualification required were replaced by 0 and

3 (Secondary school completed) respectively.

• Lift Chart

popular job model

(logistic regression )

Important factors:

1. higher qualification(more likely)

2. higher level (more likely)

3. jobs industries:

accounting, banking, building ,

construction ( more likely )

4. jobs types:

  art/design/creative, engineering,

sales (less likely)

1. Web Design 

a. To develop a collaborative filtering system

b. To include a popularity index

Ⅴ. Recommendation

2. Marketing Strategies

a. To develop appropriate marketing strategiesfor customer retention

b. To develop Cityjob.com’s own web monitorsystem

Ⅵ.Unexpected Discovery

There was a user who came everyday during the study period at exactly the same time (4:00 a.m. HK time) and stayed for one to three hours browsing more than 500 pages each time (average 5 sec. per page).

The End