1 Truth Validation and Veracity Analysis with Information Networks Jiawei Han Data Mining Group,...

1

Truth Validation and Veracity Analysiswith Information

Networks

Jiawei HanData Mining Group, Computer Science

University of Illinois at Urbana-Champaignwww.cs.uiuc.edu/~hanj

April 18, 2023

http://www.cs.uiuc.edu/~hanj

2

Outline

TruthFinder: Tuth Validation by Information

Network Analysis

Beyond TruthFinder: Multiple Versions of

Truth and Evolution of Truth

Enhancing Truth Validation by InfoNet

Analysis: The RankClus & NetClus

Methodology

Summary

3

Motivation

Why truth validation and veracity analysis? Information sharing

Sharing trustable, quality information Identifying false information among many

conflicting ones Information security

Protecting trustable information and its sources Identifying which information providers are

suspicious ones: frequently providing false information

Tracing back suspicious information providers via information networks

4

Truth Validation and Veracity Analysis by Information Network

Analysis The trustworthiness problem of the web (according to a survey):

54% of Internet users trust news web sites most of time 26% for web sites that sell products 12% for blogs

TruthFinder: Truth discovery on the Web by link analysis Among multiple conflict results, can we automatically identify

which one is likely the true fact? Veracity (conformity to truth):

Given a large amount of conflicting information about many objects, provided by multiple web sites (or other information providers), how to discover the true fact about each object?

Xiaoxin Yin, Jiawei Han, Philip S. Yu, “Truth Discovery with Multiple Conflicting Information Providers on the Web”, TKDE’08

5

Conflicting Information on the Web

Different websites often provide conflicting info. on a subject, e.g., Authors of “Rapid Contextual Design”

Online Store Authors

Powell’s books Holtzblatt, Karen

Barnes & Noble Karen Holtzblatt, Jessamyn Wendell, Shelley Wood

A1 Books Karen Holtzblatt, Jessamyn Burns Wendell, Shelley Wood

Cornwall books Holtzblatt-Karen, Wendell-Jessamyn Burns, Wood

Mellon’s books Wendell, Jessamyn

Lakeside books WENDELL, JESSAMYNHOLTZBLATT, KARENWOOD, SHELLEY

Blackwell online Wendell, Jessamyn, Holtzblatt, Karen, Wood, Shelley

6

Mapping It to Information Networks

Each object may have a set of conflicting facts E.g., different author names for a book

And each web site provides some facts How to find the true fact for each object?

w1f1

f2

f3

w2

w3

w4

f4

f5

Web sites Facts

o1

o2

Objects

7

Basic Heuristics for Problem Solving

1. There is usually only one true fact for a property of an object

2. This true fact appears to be the same or similar on different web sites

E.g., “Jennifer Widom” vs. “J. Widom”

3. The false facts on different web sites are less likely to be the same or similar

False facts are often introduced by random factors

4. A web site that provides mostly true facts for many objects will likely provide true facts for other objects

8

Mutual Consolidation between Confidence of Facts and

Trustworthiness of Providers

Confidence of facts ↔ Trustworthiness of web sites A fact has high confidence if it is provided by

(many) trustworthy web sites A web site is trustworthy if it provides many facts

with high confidence The TruthFinder mechanism, an overview:

Initially, each web site is equally trustworthy Based on the above four heuristics, infer fact

confidence from web site trustworthiness, and then backwards

Repeat until achieving stable state

9

Analogy to Authority-Hub Analysis

Facts ↔ Authorities, Web sites ↔ Hubs

Difference from authority-hub analysis Linear summation cannot be used

A web site is trustable if it provides accurate facts, instead of many facts

Confidence is the probability of being true Different facts of the same object influence each

other

w1 f1

Web sites Facts

Hubs Authorities

High trustworthiness

High confidence

10

Inference on Trustworthness

Inference of web site trustworthiness & fact confidence

w1 f1

f2w2

w3

w4 f4

Web sites Facts

o1

o2

Objects

f3

True facts and trustable web sites will become apparent after some iterations

11

Computation Model: t(w) and s(f)

The trustworthiness of a web site w: t(w) Average confidence of facts it

provides

The confidence of a fact f: s(f) One minus the probability that all

web sites providing f are wrong

w1

f1

w2

t(w1)

t(w2)

s(f1)

wF

fswt wFf

Sum of fact confidence

Set of facts provided by w

fWw

wtfs 11

Probability that w is wrong

Set of websites providing f

12

Experiments: Finding Truth of Facts

Determining authors of books Dataset contains 1265 books listed on abebooks.com We analyze 100 random books (using book images)

Case Voting TruthFinder Barnes & Noble

Correct 71 85 64

Miss author(s) 12 2 4

Incomplete names 18 5 6

Wrong first/middle names 1 1 3

Has redundant names 0 2 23

Add incorrect names 1 5 5

No information 0 0 2

13

Experiments: Trustable Info Providers

Finding trustworthy information sources Most trustworthy bookstores found by TruthFinder vs.

Top ranked bookstores by Google (query “bookstore”)

Bookstore trustworthiness #book Accuracy

TheSaintBookstore 0.971 28 0.959

MildredsBooks 0.969 10 1.0

Alphacraze.com 0.968 13 0.947

Bookstore Google rank #book Accuracy

Barnes & Noble 1 97 0.865

Powell’s books 3 42 0.654

TruthFinder

Google

14

Outline


Network Analysis





Methodology

Summary

15

Beyond TruthFinder: Extensions

Limitations of TruthFinder: Only one version of truth

But people may have different, contrasting opinions

Not consider the time factor But truth may change with time, e.g., Obama’s

status in 2008 and 2009 Needed Extensions

Multiple versions of truth or opinions Evolution of truth

Philosophy Truth is a relative, evolving, and dynamically

changing judgment

16

Multiple Versions of Truth

Watch out of copy-cats! Copy-cat: Some information providers or even new

agencies simply copy each other Falsity could be amplified by copy-cats How to judge copy-cats: Always copying in certain

dimensional space Treat copy-cats as one instead of multiples

w1f1f2f3

w2

w3

w4

f4

f5

Web sites Facts

o1

o2

Objects

Statements can be clustered into multiple centers False statements: still diverse, spread, and lack

of converge Statements could be clustered based on

different dimensional space (context), e.g., Java

17

Transition/Evolution of Truth

Truth is not static: It changes dynamically with time Associating different versions of truth with

different time periods Clustering statements based on time durations Statements

Identifying clusters (density-based clustering) Distinguishing time-based clusters from outliers

Information providers Leaders, followers, and old-timers

Information-network based ranking and clustering Powerful analysis by information network analysis

18

Outline


Network Analysis





Methodology

Summary

19

Why RankClus?

More meaningful cluster Within each cluster, ranking score for every

object is available as well More meaningful ranking

Ranking within a cluster is more meaningful than in the whole network

Address the problem of clustering in heterogeneous networks No need to compute pair-wise similarity of objects Mapping each object into a low measure space

What type of objects to be clustered: Target objects (specified by user) Clustering of target objects can induce a sub-

network of the original network

20

Algorithm Framework - Illustration

SIGMOD

SDM

ICDM

KDD

EDBT

VLDB

ICML

AAAI

Tom

Jim

Lucy

Mike

Jack

Tracy

Cindy

Bob

Mary

Alice

SIGMOD

VLDB

EDBT

KDDICDM

SDM

AAAI

ICML

Objects

Ra

nkin

g

Sub-NetworkRanking

Clustering

21

Algorithm Framework - Summary

Step 0. Initialization Randomly partition target objects into K clusters

Step 1. Ranking Ranking for each sub-network induced from each

cluster, which serves as feature for each cluster Step 2. Generating new measure space

Estimate mixture model coefficients for each target object

Step 3. Adjusting cluster Step 4. Repeat Step 1-3 until stable

22

Focus on A Bi-type Network Case

Conference-author network, links can exist between Conference (X) and author (Y) Author (Y) and author (Y)

Use W to denote the links and there weights

W =

23

Step 1: Feature Extraction — Ranking

Simple Ranking Proportional to degree counting for objects E.g., number of publications of authors Considers only immediate neighborhood in the network

Authority Ranking Extension to HITS in weighted bi-type network Rules:

Rule 1: Highly ranked authors publish many papers in highly ranked conferences

Rule 2: Highly ranked conferences attract many papers from many highly ranked authors

Rule 3: The rank of an author is enhanced if he or she co-authors with many authors or many highly ranked authors

24

Rules in Authority Ranking Rule 1: Highly ranked authors publish many papers

in highly ranked conferences

Rule 2: Highly ranked conferences attract many papers from many highly ranked authors

Rule 3: The rank of an author is enhanced if he or she co-authors with many authors or many highly ranked authors

25

Example: Authority Ranking in the 2-Area Conference-Author

Network Given the correct cluster, the ranking of authors are quite

distinct from each other

26

Example: 2-D Coefficients in the 2-Area Conference-Author Network The conferences are well separated in the new

measure space

Scatter plots of two conferences and component coefficients

27

A Running Case Illustration for 2-Area Conf-Author Network

Initially, ranking distributions are mixed together

Two clusters of objects mixed together, but preserve similarity somehowImproved a little

Two clusters are almost well separated

Improved significantly

Stable

Well separated

28

Time Complexity Analysis

At each iteration, |E|: edges in network, m: number of target objects, K: number of clusters Ranking for sparse network

~O(|E|) Mixture model estimation

~O(K|E|+mK) Cluster adjustment

~O(mK^2) In all, linear to |E|

~O(K|E|)

29

Case Study: Dataset: DBLP

All the 2676 conferences and 20,000 authors with most publications, from the time period of year 1998 to year 2007.

Both conference-author relationships and co-author relationships are used.

K=15

30

Beyond RankClus: A NetClus Model

RankClus combines ranking and clustering successfully to analyze information networks

A study on how ranking and clustering can mutually reinforce each other in information network analysis

RankClus works well on bi-typed information networks Extension of bi-type network model to star-network

model DBLP: Author - paper - conference - title (subject)

Author Conference

Subject

Paper

31

NetClus: Database System Cluster

database 0.0995511databases 0.0708818

system 0.0678563data 0.0214893query 0.0133316

systems 0.0110413queries 0.0090603

management 0.00850744object 0.00837766

relational 0.0081175processing 0.00745875

based 0.00736599distributed 0.0068367

xml 0.00664958oriented 0.00589557design 0.00527672

web 0.00509167information 0.0050518

model 0.00499396efficient 0.00465707

Surajit Chaudhuri 0.00678065Michael Stonebraker 0.00616469

Michael J. Carey 0.00545769C. Mohan 0.00528346

David J. DeWitt 0.00491615Hector Garcia-Molina 0.00453497

H. V. Jagadish 0.00434289David B. Lomet 0.00397865

Raghu Ramakrishnan 0.0039278Philip A. Bernstein 0.00376314

Joseph M. Hellerstein 0.00372064Jeffrey F. Naughton 0.00363698Yannis E. Ioannidis 0.00359853

Jennifer Widom 0.00351929Per-?ke Larson 0.00334911Rakesh Agrawal 0.00328274

Dan Suciu 0.00309047Michael J. Franklin 0.00304099Umeshwar Dayal 0.00290143

Abraham Silberschatz 0.00278185

VLDB 0.318495SIGMOD Conf. 0.313903

ICDE 0.188746PODS 0.107943EDBT 0.0436849

Ranking authors in XML

32

Outline


Network Analysis





Methodology

Summary

33

Summary Progress Highlights

3 PhD graduated in 2009 Currently over 20 Ph.D.s working on closely related projects Attract more funded projects: 3 NSFs, NASA, DHS, … Industry collaborations: Microsoft Research, IBM Research,

Boeing, HP Labs, Yahoo!, Google, … Research papers published in 2008 & 2009: 8 journal papers

and 53 conference papers, including KDD, NIPS, SIGMOD, VLDB, ICDM, SDM, ICDE, ECML/PKDD, SenSys, ICDCS, IJCAI, AAAI, Discovery Science, PAKDD, SSDBM, ACM Multimedia, EDBT, CIKM, …

Truth validation by information network analysis: A promising direction: TruthFinder, iNextCube, and beyond

Knowledge is power, but knowledge is hidden in massive links Integration of data mining with the project: Much more to be

explored!

34

Recent Publications Related to the Talk

X. Yin, J. Han, and P. S. Yu, “Truth Discovery with Multiple Conflicting Information Providers on the Web”, TKDE’08

Y. Sun, J. Han, P. Zhao, Z. Yin, H. Cheng, T. Wu, “RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis”, EDBT'09

Y. Sun, Y. Yu, and J. Han, “Ranking-Based Clustering of Heterogeneous Information Networks with Star Network Schema", KDD'09

Y. Sun, J. Han, J. Gao, and Y. Yu, “iTopicModel: Information Network-Integrated Topic Modeling", ICDM'09

J. Han, “Mining Heterogeneous Information Networks by Exploring the Power of Links", Discovery Science'09 (Invited Keynote Speech)

M.-S. Kim and J. Han, “A Particle-and-Density Based Evolutionary Clustering Method for Dynamic Networks", VLDB'09

Y. Yu, C. Lin, Y. Sun, C. Chen, J. Han, B. Liao, T.Wu, C. Zhai, D. Zhang, and B. Zhao, “iNextCube: Information Network-Enhanced Text Cube", VLDB'09 (system demo).

1 Truth Validation and Veracity Analysis with Information Networks Jiawei Han Data Mining Group,...

Documents

Transcript of 1 Truth Validation and Veracity Analysis with Information Networks Jiawei Han Data Mining Group,...