Web-based User Proﬁles -...

Web-based User Profiles Challenges and Opportunities

Christin Seifert MDPS Workshop, Lyon

2013-12-13 (Friday..)

in a Nutshell

Client Component

Privacy Proxy

Federated Recommender

Source 1: Search System

Source 2 CBR System

...

Visits

Web sitesmajor web hubs long-tail content sites

cultural, educational, scientific content

users' regularwhereabouts

Web-based User Profiles

Ch1: Visualization -> transparency

-> trust -> adaptability

Ch2: Learning -> avoid manual entry -> fine grained data

-> large-scale usage in applications

O3: Applications ? privacy ? trust

? accuracy

Ch 3: Definition - rich user models

- user context - time framing

Goal: “Saying the ‘right’ thing at the ‘right’ time in the ’right’ way.” [Fischer, 2001]

User model: “.. is the knowledge about the user, either explicitly or implicitly encoded, that is used by the system to improve the interaction. [Kass & Finin, 1988]

User profile: “.. is a machine-processable representation of a user model for the purpose of user identification and personalization.” [Carberry et al., 2013]

User models and profiles

“In fuzzy terms, context is [...] the “everything else” of the environment. More precisely, context is the set of features in the environment that are not explicitly intended as input into the system being discussed. “ [Rhodes, 2000]

“Context is any information that can be used to characterize the situation of an entity. An entity is a person, place, or object that is considered relevant to the interaction between a user and an application, including the user and applications themselves.” [Dey & Abowd, 1999]

(after reviewing 150 notions of contexts) "A definition of context depends on the field of knowledge that it belongs to" [Bazire and Brézillon, 2005]

Context

Our Viewuser profile is aggregated information

long-term profile information (long-term interests, knowledge, regular tasks, birthday, name, ..) does not change over a “longer” period of time

short-term profile information (interests, task) is only valid for a “short” period of time [Li et al., 2007, Bennett et al., 2012]

context (device information, physical surroundings, social setting) is not part of user, can not be captured, or is not aggregated information

D5.1Usage Pattern and Context Detection Speci�cation and Analysis

Table 2: Dimensions and attributes of the user pro�le

attribute im-/explicit expectedimpact

learningcomplexity

vocabulary

Interesttopic I H M skosweight I H M wi,woKnowledgetopic I M H skosweight I M H wi,woDemographicsprofession E M H -education level E M H -institution E M H -�rst name E L H foaflast name E L H foafbirthday E L H gumobirthplace E L H gumoaddress E L L gumo- city- country- house-nr- postal code- state- streetSocial Connectionsconnections in socialnetworks

I M M sioc/foaf

- strong/weak ties- type of connection (groups)Resource Relationsresource I M M oa/foaf- timestamp- annotation- been recommendedBehavioural Patterns / Tasks / Goalstasks I H H tmo

expected impact - Low ( L ), Medium ( M ), High ( H )the expected in�uence of a particular attribute on recommendation quality

learning complexity - Low ( L ), Medium ( M ), High ( H )the expected complexity to learn a particular feature of the user pro�le automatically

im-/explicit - Implicit ( I ), Explicit ( E )whether the feature has to be given by the user explicitly or can be mined implicitly (the user may also changeimplicitly mined features manually)

c� EEXCESS consortium: all rights reserved 15




learningcomplexity

vocabulary


I M M sioc/foaf









learningcomplexity

vocabulary


I M M sioc/foaf






not considered: Individual traits (introvert/extrovert) usually captured by extensive psychological tests [Brusilovsky & Millan, 2004]

User Model

Long-Term Profile Short-Term Profile- demographics- interest- knowledge- behavioral patterns, tasks, goals- social connections- resource relations

- task- interest- session

Context- physical skills- physical activities- social- location- focus

device -physical surroundings -




learningcomplexity

vocabulary

Interesttopic I H M skosweight I H M wi,woKnowledgetopic I M H skosweight I M H wi,woDemographicsprofession E M H -education level E M H -institution E M H -�rst name E L H foaflast name E L H foafbirthday E L H gumobirthplace E L H gumoaddress E L L gumoSocial Connectionsconnections in socialnetworks

I M M sioc/foaf









learningcomplexity

vocabulary


I M M sioc/foaf






Semantic Description

ExampleGoal: Finding literature about privacy in recommender systems

Task:

Literature search

Topics:

http://dbpedia.org/page/Recommender_system

http://dbpedia.org/page/Privacy

http://dbpedia.org/page/Recommender_system

http://dbpedia.org/page/Privacy

User Model

Long-Term Profile Short-Term Profile- demographics- interest- knowledge- behavioral patterns, tasks, goals- social connections- resource relations

- task- interest- session

Context- physical skills- physical activities- social- location- focus

device -physical surroundings -

extensive, semantically described user model

open: which parts can be inferred automatically, what can be exploited (for specific applications), how to convey the content to the users

User Model







? accuracy



Information Visualization

”Information visualization (InfoVis) is the communication of abstract data through the

use of interactive visual interfaces.“

[Keim et al., 2006]




learningcomplexity

vocabulary


I M M sioc/foaf









learningcomplexity

vocabulary


I M M sioc/foaf









learningcomplexity

vocabulary


I M M sioc/foaf






Hierarchical/network data

Nominal DataGeo-spatial data

Time

Complex multivariate data with different types of variables Allow interactive adaptations


TreeMap [Johnson and Shneiderman, 1991]

Marks and ChannelsThe 8 Visual Channels - Position

Projection examples: Left are geographic projections and right are projections ofmultidimensional data (i.e. text documents) on a 2D surface while retaining thetopical similarity of documents.

Source Wikipedia

Source Granitzer

VA:II-18 Foundations © Granitzer/Seifert 2013

Information Landscapes [Sabol et al., 2009]

DOITrees [Heer & Card, 2004]


FDP [Fruchtermann & Reingold, 1991]

Danny Holten & Jarke J. van Wijk / Force-Directed Edge Bundling for Graph Visualization

Figure 7: US airlines graph (235 nodes, 2101 edges) (a) not bundled and bundled using (b) FDEB with inverse-linear model,(c) GBEB, and (d) FDEB with inverse-quadratic model.

Figure 8: US migration graph (1715 nodes, 9780 edges) (a) not bundled and bundled using (b) FDEB with inverse-linearmodel, (c) GBEB, and (d) FDEB with inverse-quadratic model. The same migration flow is highlighted in each graph.

Figure 9: A low amount of straightening provides an indication of the number of edges comprising a bundle by widening thebundle. (a) s = 0, (b) s = 10, and (c) s = 40. If s is 0, color more clearly indicates the number of edges comprising a bundle.

we generated use the rendering technique described in Sec-tion 4.1. To facilitate the comparison of migration flow inFigure 8, we use a similar rendering technique as the onethat Cui et al. [CZQ⇤08] used to generate Figure 8c.

The airlines graph is comprised of 235 nodes and 2101edges. It took 19 seconds to calculate the bundled airlinesgraphs (Figures 7b and 7d) using the calculation scheme pre-

sented in Section 3.3. The migration graph is comprised of1715 nodes and 9780 edges. It took 80 seconds to calculatethe bundled migration graphs (Figures 8b and 8d) using thesame calculation scheme. All measurements were performedon an Intel Core 2 Duo 2.66GHz PC running Windows XPwith 2GB of RAM and a GeForce 8800GT graphics card.Our prototype was implemented in Borland Delphi 7.

c� 2009 The Author(s)Journal compilation c� 2009 The Eurographics Association and Blackwell Publishing Ltd.

Hierachical Edge-Bundling [Holten & Wijk, 2009]

Dot (GraphVis) [Emden et al., 1993]

dictionaries are used to take into account the vocabulary and semantics of terms (concepts):

Figure 3. Long-term and short-term interests visualization for a user (“Up”). Time has been divided into 4 semesters based on homogeneity frequency of activities (semester 2 of 2008, semester 1 of 2009, semester 2 of 2009 and semester 1 of 2010). The graph is undirected, node-weighted, and edges weighted. Each node represents a user (histogram bars with different colors) or an interest (histogram bars with blue color). For one node, each bar represents the frequency of the node for a time period. The succession of the bars for each node is made in a clock-wise representation of time-periods, beginning here from semester 1 of 2009. For instance, for the node “up” the red bar represents his activity frequency in semester 1 of 2009, the orange bar represents his activity frequency in semester 1 of 2010, the yellow bar represents his activity frequency in semester 2 of 2008, and the green bar represents his activity frequency in semester 2 of 2009.

• Positive filters (domains concepts): these filters contain exclusive terms to retain in the document. This may be the projection of text documents on specific domain ontology to select only specific concepts in this area. For our experimentation, we are not interested in a specific domain, thus we used as positives filters all the concepts which appear more than a predefine threshold in the document (e.g., more than 2 or 3 times).

• Negative filters (empty concepts): these filters contain concepts having no meaning in the context of the study. Typical examples of empty concepts are articles of any languages. Depending on the languages

studied, some negative filters already exist and can be reused or enriched.

• Synonyms dictionaries: several terms may refer to a single concept of the studied area. Synonyms dictionaries link several terms referring to the same concept. Depending on the studied languages, a synonyms dictionary can be automatically built and enriched from the document.

Once interests (domain concepts) are discovered from text, they are projected toward users and their dates of use (time) in order to construct a 3-D co-occurrence matrice [17]. Depending on the desired temporal level of granularity in the analysis, it is possible to split the time in different periods (e.g.,

372372372

Facebook Profile [Tchunte et al. 2010]

Infovis and User Profiles

Open Learner Models [Bull & Kay, 2012]

Fig. 1. Skill meters in SQL Tutor [4]

Fig. 2. Skill meters and structured view in OLMlets [20]

Skill meters are the most commonly used simple overviews of the learner model contents, with a meter assigned to each topic or concept, which may include separate skill meters for sub-topics. (The latter allows simple structuring in the model presentation - e.g. [9].) Most skill meters show the level of the user's knowledge, understanding or skill as a subset of expert knowledge[9], [21]. The two examples in Figures 1 and 2 additionally show: (i) level of understanding as a proportion of areas covered [4]; and (ii) the proportion of areas of difficulty that can be attributed to

Fig. 5. The Flexi-OLM concept map [29]

Fig. 6. Tree structures in UM [30] and Flexi-OLM [27]

User ProfileTreeMaps, Landscapes and graph layout are hard to understand, and not easily interactively adaptable

DOITree approach seems most promising, if

we can filter (focus),

create a tree from the graph

Because we automatically infer the user profile (and make mistakes), we should provide explanations.

Danny Holten & Jarke J. van Wijk / Force-Directed Edge Bundling for Graph Visualization

Figure 7: US airlines graph (235 nodes, 2101 edges) (a) not bundled and bundled using (b) FDEB with inverse-linear model,(c) GBEB, and (d) FDEB with inverse-quadratic model.

Figure 8: US migration graph (1715 nodes, 9780 edges) (a) not bundled and bundled using (b) FDEB with inverse-linearmodel, (c) GBEB, and (d) FDEB with inverse-quadratic model. The same migration flow is highlighted in each graph.

Figure 9: A low amount of straightening provides an indication of the number of edges comprising a bundle by widening thebundle. (a) s = 0, (b) s = 10, and (c) s = 40. If s is 0, color more clearly indicates the number of edges comprising a bundle.

we generated use the rendering technique described in Sec-tion 4.1. To facilitate the comparison of migration flow inFigure 8, we use a similar rendering technique as the onethat Cui et al. [CZQ⇤08] used to generate Figure 8c.

The airlines graph is comprised of 235 nodes and 2101edges. It took 19 seconds to calculate the bundled airlinesgraphs (Figures 7b and 7d) using the calculation scheme pre-

sented in Section 3.3. The migration graph is comprised of1715 nodes and 9780 edges. It took 80 seconds to calculatethe bundled migration graphs (Figures 8b and 8d) using thesame calculation scheme. All measurements were performedon an Intel Core 2 Duo 2.66GHz PC running Windows XPwith 2GB of RAM and a GeForce 8800GT graphics card.Our prototype was implemented in Borland Delphi 7.

c� 2009 The Author(s)Journal compilation c� 2009 The Eurographics Association and Blackwell Publishing Ltd.

Marks and ChannelsThe 8 Visual Channels - Position

Projection examples: Left are geographic projections and right are projections ofmultidimensional data (i.e. text documents) on a 2D surface while retaining thetopical similarity of documents.

Source Wikipedia

Source Granitzer

VA:II-18 Foundations © Granitzer/Seifert 2013







? accuracy






learningcomplexity

vocabulary


I M M sioc/foaf






Learning Tasks

Task detection

Topic detection

Mining Tasks

Resource Relations

Social Connections

Task DetectionOn the dektop [Rath et al., 2008]

bag-of-words, VSM, TF-IDF; SVM, NB, KNN

<10, predefined or user-defined tasks

80% accuracy, most informative feature is window title

In Web-setting unsupervised, defined by query similarity or time frame

<1% multitasking sessions with mostly 2 tasks [Buzikashvili, 2006]

Topic DetectionInterests are modelled using dbpedia categories

4 million “things”, 3.2 million classified in ontology

!

!

!

!

Hierarchical, multi-class, multi-label

persons 832kplaces 640k

creative works 370korganizations 210k

species 226k

Topic Detection!

!

Hierarchical training [Dumais & Chen, 2009]

17k classes, 50k training samples, LibSVM

F1 on top-level 0.57, 2nd level 0.47

Cluster Analysis Basics

Let x1

, . . .xn denote the p-dimensional feature vectors of n objects:

Feature 1 Feature 2 . . . Feature p

x

1

x1

1

x1

2

. . . x1

p

x

2

x2

1

x2

2

. . . x2

p

...xn xn

1

xn2

. . . xnp

no Target concept

c1

c2

...cn

ML:XI-16 Cluster Analysis © STEIN 2002-2013

c1c2

c1

��

��

��

� ��

��

��

MADANI, CONNOR AND GREINER

C1

C2

C3

f1

f2

f3

f4

w12

w13

Features Classes

Instances

x1x2x3x4x5

C2

C3

C1f1

f2

f3

f4

Classes Features

compute

Figure 2: A depiction of the problem: the input can be viewed as a tripartite graph, possiblyweighted, and perhaps only seen one instance at a time in an online manner. Our goal isto learn an accurate efficient index, that is, a sparse weighted bipartite graph that connectseach feature to zero or more classes, such that an adequate level of accuracy is achievedwhen the index is used for classification. The instances are ephemeral: they serve onlyas intermediaries in affecting the connections from features to classes. The index to learnis also equivalent to a sparse weight matrix (in which the entries are nonnegative in ourcurrent work) (see Sections 3.2 and 3.2.1).

3.1 The Level of Human Involvement in Teaching and Many-Class Learning

Learning under myriad-classes is not confined to a few text-classification problems. There are anumber of tasks that could be viewed as problems with many classes and, if effective many-classmethods are developed, such an interpretation can be quite useful. In terms of the sources of theclasses, we may roughly distinguish supervised learning problems along the following dimensions(the roles of the teacher):

1. The source that defines the classes of interest, that is, the space of the target classes to predict.

2. The source of supervisory feedback, that is, the source or the process that assigns to eachinstance one or more class labels, using the defined set of classes. This is necessary for theprocurement of training data, for supervised learning.

In many familiar cases, the classes are both human-defined and human-assigned. These includetypical text classification problems (e.g., see Lewis et al. 2004 and Open Directory Project or Yahoo!directories/topics). In many others, class assignment is achieved by some “natural” or indirectactivity, that is, the “labeling” process is not as explicit or controlled. The labeling is a by-productof an activity carried out for other purposes. One example of this case is data sets obtained fromnews groups postings (e.g., Lang, 1995). In this case, users post or reply to messages, withoutnecessarily verifying whether their message is topically relevant to the group. Another exampleproblem is predicting words using the context that the word appears (the words are the classes). Inthese problems, the set of the classes of interest may be viewed as human-defined, but the labelingis implicit (collections of written or spoken texts in the word prediction task). The extreme case

2578

Topic Detection

IR based approaches [Madani et al., 2009]

instead of documents, classes are indexed

weights of features are learned

70k instances, 17k classes, 0-1 loss 0.35

Training Data

We need training data for supervised machine learning

tasks

topics

Web-based training data collection

Training Data

Training Data


Figure 6: screenshot of overlay with additional information for a resource

4.4.4 Adaptions for Test Data Acquisition

This section describes adaptions to the prototype that were performed to meet the test setting re-quirements as presented in section 4.2.2. Basically, they comprise an extension of the injected wid-get’s user interface to record the execution of tasks, functionality to prevent user interaction beforehaving started a task, changes to the query process and logging of additional information. An updateof the object stores’ structure was necessary alongside with these modi�cations.

UI controls for task detection For recording tasks, an additional tab was added to the injectedwidget’s menu. Figure 7 shows the contents of this additional tab. The task to perform needs to be

Figure 7: screenshot of task de�nition user interface

selected at the input �eld shown at 02 . The user can select one of the prede�ned tasks ("annotatea webpage" or "write a blog entry") or choose "other". Choosing the latter will prompt to specify acustom label for the task after its execution. When choosing the task "other", an additional checkbox04 is shown to indicate, whether recommendations are desirable for this task or not. The user canadjust his level of expertise on the task at hand with the slider at 05 . Possible values range from0 (lowest) to 10 (highest). The topics related to the speci�ed task are de�ned at 06 . This input�eld features auto-completion for dbpedia-categories. This means, the user gets suggested dbpedia-



Table 4: features collected with user tests along with respective methods

feature method im-/explicit

prede�ned task name UI-control [select �eld] Ecustom task name UI-control [input �eld] Etask start-time UI-control [button] Etask end-time UI-control [button] Elevel of expertise UI-control [slider (range 0-10)] Etopics relevant to task UI-control [input �eld] Einput language of topics UI-control [select �eld] Eindicator, if recommendations are desirable UI-control [checkbox] Esearch queries UI-control [input �eld] Erating of recommendations UI-control [button (good/bad)] Eassessment if recommendations & interfacewere helpful

question E

assessment of sensitivity level of particularpersonal information

question E

disclosure level of personal information(subject to the recommender’s quality)

question E

clicked recommendations implicit Iignored recommendations implicit Idwell time at recommendation preview implicit Imouse clicks (+target) implicit Itextual input implicit Ibrowsing history implicit Ibrowser pro�le (plugins, ...) implicit I

The content creation task is to write a blog entry about a given topic. The topics to write about aresemi-de�ned: They comprise an important historical event, a cultural sight of the user’s hometown (oranother town of her choice) and a person, who played a signi�cant role in history. The semi-de�nedtopics provide the ability for the user to choose a topic, she already has some knowledge about. Theusers are instructed to query for additional resources, with which they can enrich their blog post whilewriting it.The predetermined tasks alternate with tasks of the user’s own choice, i.e. a possible sequence is

(all tasks executed within the browser):

1. annotate a web page

2. read newspaper article


4. write a blog entry

5. watch funny video clips


7. engage in a forum discussion





learningcomplexity

vocabulary


I M M sioc/foaf






User Profile







? accuracy



A Note on Privacy

explicit user pro�le

anonymizeduser pro�le

privacy-preservingproxy

user pro�levisualization

federatedrecommender

life is easy

.. and hard

embedded learning

exchange of ML models







? accuracy



Recommender

Adaptive VisualizationPersonalized

Search

Web-based User Proﬁles -...

Documents

Transcript of Web-based User Proﬁles -...