Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's...

53
Domain-Specific Document Retrieval Framework for Near Real-time Social Health Data 1 Master’s Thesis Swapnil Soni Committees Prof. Amit P. Sheth (Advisor) Prof. Krishnaprasad Thirunarayan Dr. Tanvi Banerjee Collaborator Ashutosh Jadhav Contact: LinkedIn: https ://www.linkedin.com/in/swapnilsoniknoesis Home page: http://knoesis.org/researchers/swapnil / http://knoesis.org /

Transcript of Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's...

Page 1: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

Domain-Specific Document Retrieval Framework for

Near Real-time Social Health Data

1

Master’s Thesis

Swapnil SoniCommittees

Prof. Amit P. Sheth (Advisor)

Prof. Krishnaprasad Thirunarayan

Dr. Tanvi Banerjee

Collaborator

Ashutosh Jadhav

Contact:LinkedIn: https://www.linkedin.com/in/swapnilsoniknoesisHome page: http://knoesis.org/researchers/swapnil/http://knoesis.org/

Page 2: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

Outline

2

Background, Objective

Problem Statement

Data Collection

Pattern Extraction Analysis

Results and Evaluation

Demonstration

Conclusion

Page 3: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

Background, Objective

Problem Statement

Data Collection

Pattern Extraction Analysis

Results and Evaluation

Demonstration

Conclusion

3

Outline

Page 4: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

4

Background

Sources: Pew research

http://www.pewinternet.org/2010/03/24/health-information/

http://www.pewinternet.org/2011/02/01/health-topics-3/

Online health resources are easily accessible

and provide information about most of health

topics.

These resources can help non-experts to make

more informed decisions and play a vital role

in improving health literacy.

Page 5: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

5

According to the pew research*,

45% of U.S. adults are dealing with at least one chronic condition

Of those who are living with two or more conditions, 45% have diabetes

*http://www.pewinternet.org/files/old-media/Files/Reports/2013/PIP_TrackingforHealth%20with%20appendix.pdf

Background

Page 6: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

6

Health Information Seeking

Web search engine Social media

Choudhury et. al., Seeking and Sharing Health Information Online: Comparing Search Engines and Social Media,ACM,2014

Teevan et. Al., #TwitterSearch: A Comparison of Microblog Search and Web Search, 2011

Real-time content

Popular trends

Online health

information-seeker

Learn about basic facts

Get deeper understanding

about a topic of interest

Page 7: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

7

Online health

information-seeker

Real-time content

Relevant

Reliable information

Health Information Seeking

Social media search-engine

Page 8: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

8

Health Information Seeking Challenges

Keyword-based techniques are based on the

interpretation of keywords

Search results may not be real-time

Page 9: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

9

Example: How to control diabetes

Keyword-basedNot real-time

Twitter

Page 10: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

10

To provide a platform to ask health-related questions

in near real-time, reliable, and relevant health

information shared on social media.

Objective

Page 11: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

11

In the US 18% internet users use Twitter.

As we know, there are 500 million tweets per day and around 75K

verified healthcare professionals accounts from all over the world.

152K: number of health tweets every day by professionals in health-

care.

Twitter as a Data Source

Twitter has become a new source of information overload in health-care

Page 12: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

12

Problem

How to extract near real-time, reliable and relevant

documents from the health information shared on

Twitter for a given user query?

Page 13: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

Outline

13

Background, Objective

Problem Statement

Data Collection

Pattern Extraction Analysis

Results and Evaluation

Demonstration

Conclusion

Page 14: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

14

Data Collection

Page 15: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

Predefined questions:

Selected most frequently asked questions from Mayo clinic, WebMD, etc.

Dynamic questions:

User can ask any question

15

Categories of Questions

Page 16: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

16

System Architecture

User interface

Database

patternsURL social media

rank

Similarity score

Calc

Patterns Rank

Calc

Twitter

URL

content

extractor

Hadoop-based Pattern

extractor

Pattern

extractorURL

share &

like

counts

extractor

23

4

5

1

Language Identifier

URL extractor

URL resolver

DBHandler

Apache Storm

Processing pipeline

Page 17: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

17

Apache Storm

It is a distributed, real-time computation

system.

Spouts and Bolts are basic components in

storm for real-time processing of data.

Networks of spouts and bolts are packaged

into a “topology”, which is submitted to storm

cluster.

Page 18: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

18

Crawler

Spout

Topology Architecture

Language

identifierBolt

Object

Modeling

Hashtag

extractorURL

Extractor

URL

resolver

A spout which crawls in real-time based on keywords

It allows only English tweets

It is used for retrieving a

hashtag from the tweets

It converts tweet object to Java

object

Extract URLs from tweets

It expands the URL(s) from short to

its original form

Page 19: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

Outline

19

Background, Objective

Problem Statement

Data Collection

Pattern Extraction Analysis

Results and Evaluation

Demonstration

Conclusion

Page 20: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

20

Component: Pattern Extractor

User interface

DatabaseLanguage Identifier

URL extractor

URL resolver

DBHandler

URL

content

extractor

Apache Storm

Processing pipeline

patternsURL social media

rank

Similarity score

Calc

Patterns Rank

Calc

Hadoop-based Pattern

extractor

Pattern

extractor

Twitter

URL

share &

like

counts

extractor

23

4

5

1

Page 21: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

21

Content Extractor:

To extract content from the URLs (present in the tweets).

URL(s) Share & Like counts Extractor:

Popularity of a source: To measure the content popularity, we have used social

media shares and likes counts of the URLs.

Facebook shares, Facebook like count, Twitter share count.

Reliability of a source: Google domain page rank of the URLs.

Extractors

Page 22: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

22

Pattern Extractor

Pattern-based Mining

Triple Subject, predicate, and objectQuestion

Construct an AQL query

A noun or noun phrase, or a

verb or verb

Page 23: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

23

AQL is a language used for building queries that pulls structured

information from unstructured or semi-structured text.

Syntax of AQL is similar to that of Structured Query Language (SQL).

AQL file

AOG

SystemTData

folder

Contains all the patterns.

Result contains pattern.

Annotation Query Language (AQL)

Page 24: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

5 easy natural remedies to control diabetes : If you are a diabetic or know someone who is a diabeti... http://bit.ly/13oypg4

24

Pattern Extractor: Example

How to control diabetes?

X control diabetes

X control blood sugar

X handle blood sugar

X handle diabetes

UMLSWordNet

Synsets

Page 25: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

This module extracts triples (patterns) from unstructured (tweets and

URLs’ content) based on predefined questions (AQL queries).

The text analytic engine executes AQL queries--an interval of six hours.

25

Predefined Questions: Pattern Mining

Page 26: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

26

How to control diabetes by exercise?

Part-of-speech

taggerControl (verb), diabetes(noun), exercise(noun)

Query builderWordNetSynsets

Query executer

diabetes control exercise

exercise control diabetes

Dynamic Query Processing Architecture

Page 27: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

Question Extracted Pattern Paragraph

How to control

diabetes?

control blood sugar Exercise is a healthy way to lower and control blood sugar levels

within your body. Doing exercise and lifting weights will improve

your condition significantly. http://t.co/88ulxDPFTo

How to control

diabetes?

insulin into the blood

stream to handle

When a meal is eaten, the pancreas will send larger amounts of

insulin into the blood stream to handle the food

http://t.co/WsCWiNqhb9

How to control

diabetes?

remove sugar Since people with Type 2 diabetes tend to accumulate sugar in

their blood due to their inability to efficiently remove sugar from

the blood http://t.co/aHqKJjrTPY

27

Results

Question Extracted Pattern Paragraph

What are the

Symptoms of

diabetes?

Diabetics tend to get Diabetics should be very cautious when having a pedicure.

Diabetics tend to get bad infections in the feet, so you must be

very aware of any puncture or cut you notice on your feet.

http://t.co/HqJBjBtrXC

What are the

causes of

diabetes?

can cause diabetes Smoking isn’t healthy for anyone but can be very dangerous if

you’re a diabetic. This habit produces many poor health issues.

Smoking makes a person’s insulin resistant, in can cause

diabetes to develop http://t.co/Ca5SaXRL6w

Page 28: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

PatternsURL share and

like counts

Pattern Rank

Calculator

Pattern Rank Calculator Architecture

24

Similarity

Score

Calculator

Database

1a 1b 2

3

Page 29: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

29

Features Set

Popularity Relevancy

Facebook share counts

Facebook like counts

Twitter share count

Vector based

similarity score

Reliability

Google domain rank

Page 30: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

30

Query Expansion

How to control diabetes?

How to control diabetes?

How to control blood sugar?

How to handle blood sugar?

How to handle diabetes?

0.81 (TF-IDF score)

0.0 (TF-

IDF score)

0.81(TF-IDF score)

0.77(TF-IDF score)

Exercise controls diabetes

Natural way to handle blood sugar

Page 31: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

31

0.638 0.630.678

0.376

0.639 0.667 0.694

0.556

NAIVEBAYES SUPPORTVECTOR RANDOMFOREST ADABOOSTM1

Social Media share and like count + Jaccard Index on query expansion

Precision Recall

0.7530.687

0.793

0.501

0.722 0.750.806

0.583

NAIVEBAYES SUPPORTVECTOR RANDOMFOREST ADABOOSTM1

Social Media share and like count + TF-IDF on query expansion

Precision Recall

Experiments: ML Classifiers

Page 32: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

Outline

32

Background, Objective

Problem Statement

Data Collection

Pattern Extraction Analysis

Results and Evaluation

Demonstration

Conclusion

Page 33: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

Evaluation

33

Reliability, Relevancy, and Real-time

Pattern Generator

Query Expansion based on Relevance Feedback

Page 34: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

34

Evaluation: Reliability, Relevancy & Real-time

Reliability:

• Based on URL’s (extracted news article) Google domain pagerank

• Filtration criteria is URL’s Google domain pagerank should be greater than 4

Relevancy:

• Based on qualitative approach

• For a given question, user survey participants judge the relevancy of the result set from 1)

Twitter search 2) Social Health Signals 3) Google time bound search and assign relevancy score

from 1 (low) to 3(high)

Real-time:

• Timeliness (trends) of a retrieved document. We have considered only 6 hours

data to find out information of a user’s given query

• Example: breaking news on diabetes

Page 35: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

Collected the top 10 results from these sources: Twitter search, Social

Health Signal, and Google time-bound search

35

Evaluation: Relevancy

Queries (Frequently Asked Query)

1) How to control diabetes?

2) What are the causes of diabetes?

3) What are the symptoms of diabetes?

Page 36: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

Presented the top 10 results from all the three sources for each of the query

to participants

Participants judge each document of a query on a scale of 1 to 3 (i.e. 1-Not

good, 2-good, and 3-very good)

To calculate average rank, we have used the following formula*:

𝐑𝐚𝐧𝐤 =𝑻𝒐𝒕𝒂𝒍𝑪𝒐𝒖𝒏𝒕𝒐𝒇𝟑′𝒔 ∗ 𝟑 + 𝑻𝒐𝒕𝒂𝒍𝑪𝒐𝒖𝒏𝒕𝒐𝒇𝟐′𝒔 ∗ 𝟐 + 𝑻𝒐𝒕𝒂𝒍𝑪𝒐𝒖𝒏𝒕𝒐𝒇𝟏′𝒔 ∗ 𝟏

𝑵𝒐. 𝒐𝒇 𝑷𝒂𝒓𝒕𝒊𝒄𝒊𝒑𝒂𝒏𝒕𝒔

36

Evaluation: Relevancy

*http://help.surveymonkey.com/articles/en_US/kb/What-is-the-Rating-Average-and-how-is-it-calculated

Page 37: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

37

Evaluation: Relevancy

How to control diabetes?

Result 1

Result 2

Result 3

Score 1 Score 2 Score 3

Page 38: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

38

Evaluation: Relevancy

Twitter

search

Social

Health

Signal

Query 1 40% 50%

Query 2 10% 60%

Query 3 40% 50%

Twitter

search

Social

Health

Signal

Query 1 10% 10%

Query 2 30% 10%

Query 3 30% 20%

Twitter

search

Social

Health

Signal

Query 1 50% 40%

Query 2 60% 30%

Query 3 30% 30%

Bad

GoodVery Good

Google-

time bound

10%

50%

10%

Google-time

bound

40%

10%

70%

Google-time

bound

50%

40%

20%

Page 39: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

nDCG@K (Normal Discounted cumulative gain)

nDCG@K can handle multiple levels of relevance

It gives more weightage to a higher position document than a lower

ranking position document

39

Evaluation Matrices: Relevancy

Page 40: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

40

Twitter-

Search

Social Health

Signal

DCG 9.68 12.72

IDCG 13.33 13.76

nDCG 0.726 0.924

Twitter-

Search

Social Health

Signal

DCG 9.67 13.15

IDCG 10.55 14.15

nDCG 0.91 0.92

Twitter-

Search

Social Health

Signal

DCG 10.75 11.47

IDCG 12.69 13.45

nDCG 0.84 0.85

Google

Time-Bound

9.12

9.81

0.929

Google

Time-Bound

10.03

12.76

0.78

Google

Time-Bound

10.76

10.89

0.98

Query 2Query 1

Query 3

Evaluation Matrices: Relevancy

Page 41: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

41

Evaluation: Popularity

Google time-bound Social Health Signal

Query: How to control diabetes?

Facebook (Share + Like )

CountsTwitter Share Counts

4 0

0 0

0 0

0 4

0 0

0 0

1 2

52 1211

229 0

Facebook (Share + Like )

CountsTwitter Share Counts

3910 1843

213 8

81 90

0 128

149 826

0 24

0 20

0 24

0 2

Page 42: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

42

Google Time-Bound Search

Obesity cause diabetes Overweight cause diabetes

Page 43: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

43

URL Title: Replacing sugary drinks with water may reduce diabetes risk

Extracted Pattern: 'obese is a major risk'

URL Title: More Evidence Links Diabetes to Alzheimer's Disease

Extracted Pattern: Overweight May Decrease Mortality Risk

URL Title :The facts about sugar

Extracted Pattern: ‘overweight can increase your risk ‘

URL Title : Having Diabetes Can Increase Your Alzheimer's Risk Via Blood Glucose And

Brain Plaque Link

Extracted Pattern : obesity can also increase our risk

URL Title : Diabetes Study Suggests a Little Extra Weight Tied to Longer Survival

Extracted Pattern: risk for dying than overweight

Social Health SignalObesity less Mortality Risk

Page 44: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

44

Demo

http://knoesis-twit.cs.wright.edu/SocialHealthSignal/

Page 45: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

45

Future Work

Evaluation

Relevancy on more Queries

Pattern Generator

Query Expansion based on Relevance Feedback

Semantic Categorization

Performance improvement for dynamic queries

Page 46: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

46

Conclusion

◦ Twitter has become a popular tool for seeking health information.

◦ It is very difficult task to extract relevant, and reliable health document

from Twitter in near real-time

◦ We address this problem, by using state-of-the-art approaches such as

◦ Semantics-based pattern mining

◦ TF-IDF relevancy score on query expansion

◦ Content popularity: Social media share and like counts

◦ Reliability : Google domain page rank

Page 47: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

47

Acknowledgements

Dr. Amit Sheth Dr. T.K Prasad Dr. Tanvi Banerjee Ashutosh Jadhav

Page 48: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

48

Thanks!

Questions?

Page 49: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

49

Social Health

Signal

Screenshots

Page 50: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

50

Home Screen

Page 51: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

51

Search & Explore Screen

Page 52: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

52

Top 10 URLs Screen

Page 53: Domain Specific Document Retrieval Framework for Near Real-time Social Health Data - Swapnil Soni's MS thesis presentation

53

Tweet Locations Screen