Improving Gender Classification of Blog Authors

11
Improving Gender Classification of Blog Authors Arjun Mukherjee Bing LIu UIC

description

Improving Gender Classification of Blog Authors. Arjun Mukherjee Bing LIu UIC. Introduction. Dataset 3100 blogs ( 1588 – men,1512 – women ) Related work - PowerPoint PPT Presentation

Transcript of Improving Gender Classification of Blog Authors

Page 1: Improving  Gender Classification of Blog Authors

Improving Gender Classification of Blog Authors

Arjun MukherjeeBing LIu

UIC

Page 2: Improving  Gender Classification of Blog Authors

Introduction

• Dataset– 3100 blogs ( 1588 – men,1512 – women )

• Related work– Current systems use POS n-grams, word classes,

personality types to capture stylistic behavior of authors’ writings for classifying gender.

– However, these works use only one or a subset of the classes of features. None of them uses all features for classification learning.

Page 3: Improving  Gender Classification of Blog Authors

Feature Engineering and mining

• F-Measure ( Not the accuracy!!! )– F-Measure is used to measure contextuality and

formality– F = 0.5 * [(freq.noun + freq.adj + freq.prep +

freq.art) – (freq.pron + freq.verb + freq.adv + freq.int) + 100]

• Stylistic features– Words such as lol, hmm and smileys determine

style of person

Page 4: Improving  Gender Classification of Blog Authors

Feature….(contd)

• Gender Preferential features– Females tend to post emotionally. Frequent use of

intensive adverbs ( terribly, awfully….) whereas men tend to more provocative

Page 5: Improving  Gender Classification of Blog Authors

Features…( cont. )

• Factor analysis and word classes– Factor or word factor analysis refers to the process of

finding groups of similar words that tend to occur in similar documents.

• POS sequence pattern– A POS sequence pattern is a sequence of consecutive POS

tags that satisfy some constraints– POS ngrams are good at capturing the heavy stylistic and

syntactic information. Instead of using all such n-grams, we want to discover all those patterns that represent true regularities, and we also want to have flexible lengths

Page 6: Improving  Gender Classification of Blog Authors

Features…( contd.. )

• POS Sequence patterns– mining algorithm mines all such patterns that

satisfy the user-specified minimum support (minsup) and minimum adherence (minadherence) thresholds or constraints.

Page 7: Improving  Gender Classification of Blog Authors

POS sequence algo.

Page 8: Improving  Gender Classification of Blog Authors

Feature Selection

• System uses EFS( Ensemble feature selection ) algorithm for selecting the fearutes.– EFS is a hybrid of filter and wrapper techniques– Some of the criteria used for feature selection are

information gain, Mutual information, Chi square test

• Feature value assignment– The values are either boolean or term frequency

Page 9: Improving  Gender Classification of Blog Authors

Experiments and results

• Classifiers– Naïve Bayes– SVM– SVM Regression

Page 10: Improving  Gender Classification of Blog Authors

Results

Page 11: Improving  Gender Classification of Blog Authors

Comparison results