Data Mining for Web Personalization
description
Transcript of Data Mining for Web Personalization
Data Mining for Web Personalization
Patrick DudasData Mining for Web PersonalizationOutlinePersonalizationData miningExamplesWeb miningMapReduceData PreprocessingKnowledge DiscoveryEvaluationInformation High
PersonalizationGoal of data mining approach is for automatic personalizationAutomatic Personalization:Content-basedCollaborativeRule-based Rule-based (Brief overview)Create decision rulesImplicitly/ExplicitlyHighly domain dependent Rules nontransferable Profiles are based on user inputBiased StaticDegrade over time
Content-based (Brief overview)Profile based on users past experiences and their interest (ratings)Think Amazon, Pandora, eBay..Vector similarities based on cosine similarityBayesian classificationRemember: Ratings = Profile = RecommendationCollaborative (Brief overview)Creating groups of users based on ratingsNearest neighbor approachOnce grouped, recommendation based on the other neighbors are presentedMore users or items = more dimensions of dataDynamic or real-time not applicableData Mining Data rich descriptionsLarge volumes of datareliable modelsAutomated data collectionEvaluate results/make decisionsIntegration with existing data sources
Examples of Large Datasetshttp://aws.amazon.com/datasetsFeatured data sets:Illumina - Jay Flatley (CEO of Illumina) Human Genome Data Setcience 315(5814): 972.350 GBYRI Trio Dataset700GBSloan Digital Sky Survey DR6 Subset160 GBGenome, survey data, Google Books n-gram corpuses, traffic statistics, OpenStreetMap dataset, Wikipedia trafficData Mining Web Personalization Recommendations based on Web objects:ItemsPagesDocumentsNavigation by links
Web miningPros: Personalization (duh.), real-time, more enriched datasets Cons: Privacy issues, building complex systems that misrepresent the individualExtend the Data Mining ParadigmData Preparation and TransformationWeb logsDate/time usageSite informationResource requested (image, video, etc.)Site files/meta-data The power of the cookie
Server-side cookies!Data Preparation and Transformation (cont.)Pageview:User actions (where they clicked and the path)User events (what they are trying to accomplish)Session:Sequence of page views
MapReduceGoogle designHoodop implementedC++, C#, Erlang, Java, Ocaml, Perl, Python, Ruby, F#, R..
Example
Usage Data Pre-Processing
Pattern DiscoveryWe have data! Now what?ClusterClassificationAssociation Rule DiscoverySequential pattern DiscoveryMarkov ModelsLatent Variable Model
ClusteringPartitioningSplit your data into groupsK-meansHierarchical Divisive (top-down)Start with everything, find groupsAgglomerative (bottom-up)Start with a cluster and add additional informationModel-basedBuilding a model for the data (best fit)K-means
User-Based ClusteringStart with the user profilePartition into k-groups of profilesBased on similarity
Association DiscoverySupportmin(support)Confidencemin(confidence)
Evaluation (Personalization Model)Challenges:Recommendation algorithms may require unique set of evaluation metricsPersonalization actions may be differentDomainIntended applicationData gatheredCheck for overfitting dataTraining setROC CurveROC CurveTPR=TP/ (TP+FN)FPR=FP/ (FP+TN)"No""Yes"SNN.95.05.20.80
Information HighInformation is addictiveInformation can be misleadingInformation ethicsInformation is power, sometimes too powerful
Personal SuggestionsDevelop a hypothesisFigure out what data is neededMake informed decisionsDont trust just your judgmentExperts are experts for a reason!Develop a way to validate based on experienceThen get more data if needed
SourcesKohavi, R. and F. Provost (2001). "Applications of data mining to electronic commerce." Data Mining and Knowledge Discovery 5(1): 5-10.Dean, J. and S. Ghemawat (2008). "MapReduce: Simplified data processing on large clusters." Communications of the ACM 51(1): 107-113.Mobasher, B. (2007). "Data mining for web personalization." The adaptive web: 90-135.http://en.wikipedia.org/wiki/File:FrequentItems.pngZaharia, M., A. Konwinski, et al. (2008). Improving mapreduce performance in heterogeneous environments, USENIX Association.Witten, I. H. and E. Frank (2002). "Data mining: practical machine learning tools and techniques with Java implementations." ACM SIGMOD Record 31(1): 76-77.Senthil kumar, A. (2011). Knowledge discovery practices and emerging applications of data mining : trends and new domains. Hershey, PA, Information Science Reference.Thank you!Questions?