Gender Detection on Blogs
-
Upload
nitish-jain -
Category
Technology
-
view
146 -
download
0
Transcript of Gender Detection on Blogs
![Page 1: Gender Detection on Blogs](https://reader031.fdocuments.net/reader031/viewer/2022021919/587f4fdf1a28ab0d378b50bf/html5/thumbnails/1.jpg)
GENDER DETECTION IN
BLOGS
![Page 2: Gender Detection on Blogs](https://reader031.fdocuments.net/reader031/viewer/2022021919/587f4fdf1a28ab0d378b50bf/html5/thumbnails/2.jpg)
Presented By (Team No. 32)
Nitish Jain (201301227)Ganesh Borle (201505587)Vamshikrishna Reddy (201202177)
Mentored By
Lokesh Walase
IRE [CSE474]
![Page 3: Gender Detection on Blogs](https://reader031.fdocuments.net/reader031/viewer/2022021919/587f4fdf1a28ab0d378b50bf/html5/thumbnails/3.jpg)
The Big Picture
![Page 4: Gender Detection on Blogs](https://reader031.fdocuments.net/reader031/viewer/2022021919/587f4fdf1a28ab0d378b50bf/html5/thumbnails/4.jpg)
ABSTRACT
● Through the sands of time, textual content has remained a prominent feature of internet media especially BLOGS.
● Thus, author profiling and attribution becomes an important and task and we try to capture one aspect of it, i.e gender.
● internet can’t take responsibility of the all the content, it should be the author itself.
● But . . .
● lot of content brings a lot of responsibility
![Page 5: Gender Detection on Blogs](https://reader031.fdocuments.net/reader031/viewer/2022021919/587f4fdf1a28ab0d378b50bf/html5/thumbnails/5.jpg)
Given a text blog , can we identify whether the writer is a male or a female ?
The Question
![Page 6: Gender Detection on Blogs](https://reader031.fdocuments.net/reader031/viewer/2022021919/587f4fdf1a28ab0d378b50bf/html5/thumbnails/6.jpg)
WHO IS THE AUTHOR?
![Page 7: Gender Detection on Blogs](https://reader031.fdocuments.net/reader031/viewer/2022021919/587f4fdf1a28ab0d378b50bf/html5/thumbnails/7.jpg)
OUR APPROACH
![Page 8: Gender Detection on Blogs](https://reader031.fdocuments.net/reader031/viewer/2022021919/587f4fdf1a28ab0d378b50bf/html5/thumbnails/8.jpg)
THE APPROACH
● An ensemble is applied on these models and the input document is classified as written by male or female.
● We take advantage of the linguistic features of the blog and create a feature file.
● This feature file is then trained on various classifier and a model for each of the classifier is prepared.
![Page 9: Gender Detection on Blogs](https://reader031.fdocuments.net/reader031/viewer/2022021919/587f4fdf1a28ab0d378b50bf/html5/thumbnails/9.jpg)
WORKFLOW
![Page 10: Gender Detection on Blogs](https://reader031.fdocuments.net/reader031/viewer/2022021919/587f4fdf1a28ab0d378b50bf/html5/thumbnails/10.jpg)
● each document contains text of about ~35 blogs in XML format.
[Dataset Link : http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm ]
The Dataset● Koppels blog dataset
● contains about 19 thousand document
![Page 11: Gender Detection on Blogs](https://reader031.fdocuments.net/reader031/viewer/2022021919/587f4fdf1a28ab0d378b50bf/html5/thumbnails/11.jpg)
PARSING
● Language used : Python● Each blog is entry stored in XML format
<Blog><date>....... </date><post>
…. </post>...
<Blog>
● Each of the blog filename contains the name and Gender of the author
![Page 12: Gender Detection on Blogs](https://reader031.fdocuments.net/reader031/viewer/2022021919/587f4fdf1a28ab0d378b50bf/html5/thumbnails/12.jpg)
The Feature Extraction
![Page 13: Gender Detection on Blogs](https://reader031.fdocuments.net/reader031/viewer/2022021919/587f4fdf1a28ab0d378b50bf/html5/thumbnails/13.jpg)
FEATURES
For our task of Gender Identification, we take the help of the following linguistic features:● Character Based Features● Word Based Features● Syntactic Features● Structural Features● Function Words● POS Start Probability
![Page 14: Gender Detection on Blogs](https://reader031.fdocuments.net/reader031/viewer/2022021919/587f4fdf1a28ab0d378b50bf/html5/thumbnails/14.jpg)
The
Classification
![Page 15: Gender Detection on Blogs](https://reader031.fdocuments.net/reader031/viewer/2022021919/587f4fdf1a28ab0d378b50bf/html5/thumbnails/15.jpg)
THE CLASSIFICATION TASK
For the task of classification, we used several classifying algorithms and arrived at a model that uses ensemble of the following classification algorithms:
● Random Forest Classifier● Neural Networks Classifier● Adaboost Tree Classifier● Gradient Boosting Classifier● Bagging Classifier
![Page 16: Gender Detection on Blogs](https://reader031.fdocuments.net/reader031/viewer/2022021919/587f4fdf1a28ab0d378b50bf/html5/thumbnails/16.jpg)
THE CLASSIFICATION TASK
For each of the classifier
● We fed it with partial features to actually see the variation of accuracies with the features.
● We applied a 10 fold validation to measure the accuracies.
For measuring the accuracy of the ensemble we took the majority class from the classified results of the classifiers.
![Page 17: Gender Detection on Blogs](https://reader031.fdocuments.net/reader031/viewer/2022021919/587f4fdf1a28ab0d378b50bf/html5/thumbnails/17.jpg)
RANDOM FOREST CLASSIFIER
● An meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset
● By using Random Forest Classifier we were able to achieve an accuracy of 69.79%
![Page 18: Gender Detection on Blogs](https://reader031.fdocuments.net/reader031/viewer/2022021919/587f4fdf1a28ab0d378b50bf/html5/thumbnails/18.jpg)
NEURAL NETWORKS CLASSIFIER
● Consists of multiple layers of nodes with each layer fully connected to the next layer nodes and each node is a neuron with non-linear perceptron.
● Uses a supervised learning called backpropagation for training the network.
● By using Neural Networks Classifier we were able to achieve an accuracy of 69.51%
![Page 19: Gender Detection on Blogs](https://reader031.fdocuments.net/reader031/viewer/2022021919/587f4fdf1a28ab0d378b50bf/html5/thumbnails/19.jpg)
ADABOOST TREE CLASSIFIER
● An meta estimator that begins by fitting a classifier on the original dataset and then fits the next round classifiers on the same dataset
● By using Adaboost tree Classifier we were able to achieve an accuracy of 69.57%
![Page 20: Gender Detection on Blogs](https://reader031.fdocuments.net/reader031/viewer/2022021919/587f4fdf1a28ab0d378b50bf/html5/thumbnails/20.jpg)
GRADIENT BOOSTING CLASSIFIER
● Builds model in a forward stage-wise fashion.
● In each of the next stages weak classifiers are introduced to compensate the shortcomings of the existing weak learners and these shortcomings are identified by the gradients.
● By using Gradient Boosting Classifier we were able to achieve an accuracy of 70.81%
![Page 21: Gender Detection on Blogs](https://reader031.fdocuments.net/reader031/viewer/2022021919/587f4fdf1a28ab0d378b50bf/html5/thumbnails/21.jpg)
BAGGING CLASSIFIER
● A meta estimator that fits the base classifiers each on random subsets of the datasets and then aggregate their individual predictions.
● By using Gradient Boosting Classifier we were able to achieve an accuracy of 70.03%
![Page 22: Gender Detection on Blogs](https://reader031.fdocuments.net/reader031/viewer/2022021919/587f4fdf1a28ab0d378b50bf/html5/thumbnails/22.jpg)
THE ENSEMBLE
● An Ensemble takes the output of other classifier and then applies a majority voting to the outputs of the classifier to determine the output.
● By using the Ensemble model on the above discussed classifiers we were able to achieve an accuracy of 71.10%
![Page 23: Gender Detection on Blogs](https://reader031.fdocuments.net/reader031/viewer/2022021919/587f4fdf1a28ab0d378b50bf/html5/thumbnails/23.jpg)
FINAL RESULTS
![Page 24: Gender Detection on Blogs](https://reader031.fdocuments.net/reader031/viewer/2022021919/587f4fdf1a28ab0d378b50bf/html5/thumbnails/24.jpg)
THE FINAL RESULTS
● By using the ensemble, we were actually able to increase our efficiency by nearly 1% in each case irrespective of the performance of the individual classifiers.
● The maximum obtainable accuracy that was shown during the experiments was 73.19% by the Ensemble model.
![Page 25: Gender Detection on Blogs](https://reader031.fdocuments.net/reader031/viewer/2022021919/587f4fdf1a28ab0d378b50bf/html5/thumbnails/25.jpg)
73.188406 %The maximum Accuracy Achieved
![Page 26: Gender Detection on Blogs](https://reader031.fdocuments.net/reader031/viewer/2022021919/587f4fdf1a28ab0d378b50bf/html5/thumbnails/26.jpg)
USEFUL LINKS
● Github - https://github.com/nitishjain2007/Gender_Identification
● Youtube - https://www.youtube.com/watch?v=T04BJ6cIeTs
● Slideshare - http://bit.ly/1Q8UiCe
● Website - http://nitishjain2007.github.io/Gender_Identification/
● Dropbox - http://bit.ly/1Xx0ppL
![Page 27: Gender Detection on Blogs](https://reader031.fdocuments.net/reader031/viewer/2022021919/587f4fdf1a28ab0d378b50bf/html5/thumbnails/27.jpg)
REFERENCES
● http://u.cs.biu.ac.il/~koppel/papers/male-female-llc-final.pdf
● http://www.aaai.org/ocs/index.php/ICWSM/09/paper/viewFile/208/537
● http://www.cs.columbia.edu/nlp/papers/2011/acl2011age.pdf
● http://www.ccse.kfupm.edu.sa/~ahmadsm/coe589-121/cheng2011-gender-identification.pdf
![Page 28: Gender Detection on Blogs](https://reader031.fdocuments.net/reader031/viewer/2022021919/587f4fdf1a28ab0d378b50bf/html5/thumbnails/28.jpg)
Thanks!Any questions?