Improving Gender Classification of Blog Authors

Post on 01-Jan-2016

38 views 2 download

description

Improving Gender Classification of Blog Authors. Arjun Mukherjee Bing LIu UIC. Introduction. Dataset 3100 blogs ( 1588 – men,1512 – women ) Related work - PowerPoint PPT Presentation

Transcript of Improving Gender Classification of Blog Authors

Improving Gender Classification of Blog Authors

Arjun MukherjeeBing LIu

UIC

Introduction

• Dataset– 3100 blogs ( 1588 – men,1512 – women )

• Related work– Current systems use POS n-grams, word classes,

personality types to capture stylistic behavior of authors’ writings for classifying gender.

– However, these works use only one or a subset of the classes of features. None of them uses all features for classification learning.

Feature Engineering and mining

• F-Measure ( Not the accuracy!!! )– F-Measure is used to measure contextuality and

formality– F = 0.5 * [(freq.noun + freq.adj + freq.prep +

freq.art) – (freq.pron + freq.verb + freq.adv + freq.int) + 100]

• Stylistic features– Words such as lol, hmm and smileys determine

style of person

Feature….(contd)

• Gender Preferential features– Females tend to post emotionally. Frequent use of

intensive adverbs ( terribly, awfully….) whereas men tend to more provocative

Features…( cont. )

• Factor analysis and word classes– Factor or word factor analysis refers to the process of

finding groups of similar words that tend to occur in similar documents.

• POS sequence pattern– A POS sequence pattern is a sequence of consecutive POS

tags that satisfy some constraints– POS ngrams are good at capturing the heavy stylistic and

syntactic information. Instead of using all such n-grams, we want to discover all those patterns that represent true regularities, and we also want to have flexible lengths

Features…( contd.. )

• POS Sequence patterns– mining algorithm mines all such patterns that

satisfy the user-specified minimum support (minsup) and minimum adherence (minadherence) thresholds or constraints.

POS sequence algo.

Feature Selection

• System uses EFS( Ensemble feature selection ) algorithm for selecting the fearutes.– EFS is a hybrid of filter and wrapper techniques– Some of the criteria used for feature selection are

information gain, Mutual information, Chi square test

• Feature value assignment– The values are either boolean or term frequency

Experiments and results

• Classifiers– Naïve Bayes– SVM– SVM Regression

Results

Comparison results