Author paper midterm
-
Upload
pooja-mishra -
Category
Education
-
view
131 -
download
2
Transcript of Author paper midterm
Author- Paper Identification Problem
Team :
Karthik Reddy Vakati
Nachammai C
Pooja Mishra
Guided ByProf Duc Tran
Problem Statement
• To determine the correct author from the author’s dataset for a particular paper.
• Ambiguity in author names might cause a paper to be assigned to the wrong author, which leads to noisy author profiles
• This KDD Cup task challenges participants to determine which papers in an author profile were truly written by a given author
Type of data Data provided by KDD challenge is in csv format.
Paper -( Id, Title, Year, ConferenceId , JournalId, Keywords)
Author -( Id, Name, Affiliation) Paper-Author -( PaperId , AuthorId, Name, Affiliation) Conference-(Id, ShortName,FullName,HomePage) Journal -(Id, ShortName, FullName, HomePage) Train-(AuthorId , ConfirmedPaperIds, DeletedPaperIds) Test - (AuthorId , PaperIds) Validation -(AuthorId,PaperIds,Usage)
Data Points The data points include all papers written by an
author, his affliation (University, Technical Society, Groups). Paper-Author -( PaperId , AuthorId, Name, Affiliation)
The meta data includes journals written by him and conferences attended by an author. Paper -( Id, Title, Year, ConferenceId , JournalId,
Keywords) Author -( Id, Name, Affiliation) Conference-(Id, ShortName,FullName,HomePage) Journal -(Id, ShortName, FullName, HomePage)
Issues with data
Issues with data The csv files needed cleaning Few had attributes spilled over 3 rows Some rows had more attributes than the
required number of attributes Special characters caused issue
Wrote a Perl script to Clean data and format it
Issues with data-I
Issues with data-II
Predictions & IntuitionsPrediction: Given a paper and an author, one should be able to identify
whether the given paper was written by the author.
Intuition: We initially identified this problem as a Clustering problem. We
chose clustering because a set of papers written by one author can be grouped together and then for a given paper and author we can identify if the paper is from author’s cluster.
The features PaperId, AuthorId, PaperTitle, AuthorName play a significant role in the prediction.
Feature selection
We used following features from Train dataset while building the model :
ConfirmedPaperIds DeletedPaperIds
Tools Used & Model Trained
Tools Used: Weka R Apache Mahout
Model Trained: Simple K-Means J-48 ZeroR
K-means clustering using Weka Training the data
Visualization of k-means clustering result
Simple K-means clustering using R
Error in R for Clustering
> y=read.table("Paper_fixed.csv",header=TRUE,sep=',')
> y[1:10,]
> km3 <- kmeans(x,3)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In kmeans(x, 3) : NAs introduced by coercion
Conclusion
Why clustering does not work for this problem? Handling of mixed set of attributes is an issue in R Simple Kmeans clustering works on calculating the distance
from centroids and thus needs numeric attributes and distances. Hence clustering is not a best approach for our problem
To overcome the problem we are trying to convert the data into numeric integer values and then numeric distance measures are applied for computing
However, this problem looks more like a classification problem - to classify whether a paper is written by an author
Moving on to Classification algorithms..
ZeroR
Tree J-48
Naïve Bayes
Results using Tree-J48 algorithm
Results using ZeroR algorithm
Visualization of ZeroR results for Precision
Next Steps
We are working on the feature engineering - feature transformation – work on the Author name attribute and transform it into a common format for all Author names.
Once we have the feature engineering done - We will working principally on Naïve Bayes and other classification algorithms that we think will suit our problem
And fine tune the model…
Thank you!!