Post on 28-Jul-2018
12/14/2017
1
DAMI, December 11-13, 2017
Outline
Introduction to DSA@SAU research group
Data Deluge and New Dimensions of Research
Text Mining and Exploratory Data Analysis
Blog Data Analytics for Open Source Intelligence
Group Members:
Muhammad Abulaish (faculty member)
Mohd. Fazil (RS): Social bot characterization and detection
Vineet Sejwal (RS): Context- and trust-aware recommender system
Jayati Gulati (RS): Big graph mining for community detection
Ashraf Kamal (RS): Figurative language detection in Twitter
Harshita Dalal (RS): Identity deception detection using deep learning
Mohd. Tufail (RS): Cyberbullying characterization and detection
Seven M.Sc. (CS) students
SMAC
Mobile Computing
Analytics
Cloud Computing
Social Computing
Trending 21st
Century Technologies
SMAC
Mobile Computing
Analytics
Cloud Computing
Social Computing
DSA group’s focus
OCMiner: A Density-Based Overlapping Community
Detection Method for Social Networks, Intelligent Data
Analysis, Vol. 19, Issue 4, IOS Press, Netherland, 2015.
12/14/2017
2
HOCTracker: Tracking the Evolution of Hierarchical
and Overlapping Communities in Dynamic Social
Networks, IEEE Transactions on Knowledge and Data
Engineering, Vol. 27, Issue 4, pp. 1019-1032, 2014.
Namesake Alias Mining on the Web and its Role
Towards Suspect Tracking, Information Sciences,
Elsevier, Vol. 276, pp. 123-145, 2014.
Ranking Radically Influential Web Forum Users, IEEE
Transactions on Information Forensics and Security,
IEEE, 2015.
• Various modalities: Tables,
Images, Video, Audio, Text, hyper-
text, “semantic” text, Networks,
Spreadsheets, Multi-lingual
• Enriched Data: Weighted, Multi-
labeled, Temporal/spatial
attributes
• Dynamic, Distributed, Uncertain
• Massive: Tera/Peta-scale
Water, water, every where, nor any drop to drink:
l l C l id i f h i
Bioinformatics
• Datasets:– Genomes
– Protein structure
– DNA/Protein arrays
– Interaction Networks
– Pathways
– Metagenomics
• Integrative Science
– Systems Biology
– Network Biology
PubMed Dataset containing more than 26 millions citations for
biomedical lietratures
12/14/2017
3
Geo-Informatics
Cheminformatics
N
N
Cl
O
AAACCTCATAGGAAGCATACCAGGAATTACATCA…
Structural Descriptors Physiochemical Descriptors
Topological Descriptors Geometrical Descriptors
Business Intelligence
Web & Social Networks
WWW map
Data are not Independent, Rather Linked! Traditional notion of independent
data items is increasingly false,
Everything is Linked!
Many datasets can be modeled as
complex attributed graphs with
various attributes on nodes and
edges: labels, text, time series,
confidences, etc. Social (human) networks, Biological
networks, etc.
Even non-graph data, e.g., relational
tables, Tweets, Blogs data can be
modeled as graphs via some similarity
measure
Graphs, Graphs Everywhere
Data Science & Big Data
Hypothesis
Experiment
Data
Result
Design
Data analysis
Process/Experiment
DataNo Prior HypothesisNew Science of Data
Hypothesis DrivenTraditional Approach
12/14/2017
4
New Science of Data• New data models: dynamic, streaming, etc.
• New mining, learning, and statistical algorithms that offer timely and reliable inference and information extraction: online, approximate
• Data languages
• Data and model compression
• Data provenance, security and privacy
• Data sensation: visual, aural, tactile
• Knowledge validation: domain experts
• Leverage synergy across subdisciplies: Data Mining and Machine Learning, Databases, Statististics, Optimization, Algebra and Geometry, HPC, Information Theory, Visualization, Sonification, Social/ethical/legal Dimensions
Future So Bright I GottaWear Shades! (Timbuk3)
Why Text Mining?
Growth of E-contents: exponential
E-contents are not well-organized
unstructured or semi-structured
Inherent ambiguities in natural languages texts
Lexical, Syntactic, Semantic, Pragmatic
Problem: Extraction of core information becomes very
expensive.
Text mining: A solution to transform unstructured texts into
structured form and infer knowledge from them
Many Definitions in the literature
Definition-1: (Hearst, 1999)
A process of exploratory data analysis that leads to unknown
information, or to answers for questions for which the answer is
not currently known.
Definition-2:
The extraction of implicit, previously unknown, and
potentially useful information from (large amount of) textual
data
What is Text Mining?
Strict definition
Information that not even the writer knows
For example, discovering a new method for a hair growth that
is described as a side effect for a different procedure
Lenient definition
Rediscover the information that the author encoded in the
text
For example, automatically extracting a product’s name from
a text document.
What is “previously unknown” information?
Exploiting information contained in textual
documents (Grobelnik et al., 2001)
To discover patterns and trends in data
To discover associations among entities
To discover predictive rules
Objective of Text Mining
Large textual databases
High dimensionality
Dependency
Noisy data
Ambiguity
Not well-structured data
Text Mining Challenges
12/14/2017
5
Text Mining Challenges (cont…)
Large textual database
Efficiency consideration
• Google: Indexing capacity is increasing day-by –day
• Almost all research publications are in textual format
– PubMed/Medline : Biomedical research papers
High dimensionality
Consider each word/phrase as a dimension
The Size of the WWW
Large textual database
Efficiency consideration
• Google: Indexing capacity is increasing day-by –day
• Almost all research publications are in textual format
– PubMed/Medline : Biomedical research papers
High dimensionality
Consider each word/phrase as a dimension
Dependency
Relevant information is a complex conjunction of
words/phrases
• Document categorization
• Pronoun disambiguation
Noisy Data
Spelling mistakes, Grammar mistakes, etc.
Text Mining Challenges (cont…)
Text Mining Challenges (cont…)
Ambiguity
Word ambiguity
• Buy vs. Purchase• Cricket (game) vs Cricket (insect)
Semantic ambiguity
• The king saw the rabbit with his glasses (many meanings)
Not well-structured text
R u available?
Hey whazzzz up
Text Mining
Data Mining
Database
Computational Linguistics
Information Retrieval &
IE
Exploratory Data
Analysis
Text Mining: A Confluence of Multiple Disciplines
How does it relate to Data Mining?
How does it relate to Computational Linguistics?
How does it relate to Exploratory Data Analysis?
How does it relate to Database?
How does it relate to Information Retrieval?
Text Mining (Hearst-1999)
Finding PatternsFinding “Nuggets”
Novel Non-Novel
Non-Textual data Data Mining Exploratory
Data
Analysis
Database Queries
Textual dataComputational
LinguisticsInformation
Retrieval
12/14/2017
6
An approach for data analysis that employs a variety of
techniques (mostly graphical – scatter plots, box plot,
histograms, probability plot, etc.) to
Uncover underlying structure of data
Extract important variables
Detect outliers and anomalies in data
Develop models
Steps:
Problem Data Analysis Model Conclusions
Exploratory Data Analysis (EDA)
Example: Biomedical Data Exploration (Swanson, and Smalheiser, 1997)
Extract pieces of evidence from article titles in the
biomedical literature
“stress is associated with migraines”
“stress can lead to loss of magnesium”
“magnesium is a natural calcium channel blocker”
“calcium channel blockers prevent some migraines”
Induce a new hypothesis not in the literature by
combining culled text fragments with human
medical expertise
Magnesium deficiency may play a role in some kinds of
migraine headache
Semantic Net Representation
Stress Migraine
Magnesium
Calcium Channel Broker
associated with
lead to loss
is a
prevents
Magnesium deficiency may play a role in some kinds of migraine headache
Refers to the use of papers and other academic
publications (the "literature") to find new relationships
between existing knowledge (the "discovery").
Pioneered by Don R. Swanson in the 1980s.
LBD does not generate new knowledge through laboratory
experiments, as is customary for empirical sciences.
Instead it seeks to connect existing knowledge from
empirical results by bringing to light relationships that are
implicated and "neglected”
Literature Based Discovery (LBD)
Swanson linking is a term proposed in 2003 that refers to
connecting two pieces of knowledge previously thought to
be unrelated.
For example, it may be known that
Illness A is caused by chemical B, and
Drug C is known to reduce the amount of chemical B in the body.
If the respective articles not analyzed together then
relationship between illness A and drug C may be
unknown.
Swanson linking aims to find these relationships and
report them.
LBD (cont…): Swanson Linking
LBD (cont…): Swanson Linking
(i) Illness A is caused by chemical B, and
(ii) Drug C is known to reduce the amount of chemical B in the body
12/14/2017
7
Forum Crawling and Parsing
User List
Body Text
Quotations Timestamp
User Id
Message ID
Thread ID
Ranked Users
DataPre-processing
Metadata Extraction
Cleaning
Chunking
User Radicalness Identification
Threat list
Radicalness
User Posts
As we know that the natural language, be it the spoken or wr itten, is highly ambiguous in
nature, identifying the radicalness in natural language text is a major challenge. Many pr ior works have attempted to identify the radical elements based on their textual
contents exchanged in discussions. However, the foundation of their automatic radical identification
process is laid on a set of manually
User Collocations
User Collocation Identification
7.09.03.05.04.08.03.04.01.00.0
0.01.03.08.07.05.07.05.01.03.0
0.05.03.05.09.08.01.00.00.03.0
5.08.01.03.08.07.06.01.08.05.0
6.08.01.06.05.08.03.09.06.00.0
4.07.03.08.05.06.02.00.04.08.0
0.03.04.07.02.09.03.05.08.09.0
0.01.00.02.06.03.01.00.05.04.0
8.05.06.09.04.08.05.06.04.01.0
3.04.07.00.05.08.03.07.06.01.0
User Collocations
As we know that the natural language, be it the spoken or wri tten, is highly ambiguous in
nature, identifying the radicalness in natural language text is a major challenge. Many pr ior works have attempted to identify the radical elem ents based on their textual
contents exchanged in discussions. However, the foundation of their automatic radical identification
process is la id on a set of manually
Radically Influential User Ranking
Ranking
Ass
oci
ativ
ity
Com
put
atio
n
Who is Radical?
A person galvanizing people by fanatic thoughts beyond the
norm to an extreme antagonistic political, religious, racial,
nationalist or any other ideology.
Such thoughts arouse in minds when they feel of some unjust or
discrimination happened with them either directly or indirectly,
though it actually may be false.
Such thoughts are sometimes triggered by their personal
involvement (e.g., death of a close relative or friend), political
involvement (e.g., being a follower of a political or religious
belief), and social involvement (e.g., racism, nationalism).
Hostility may be against a race, or a political party, or a religion,
or a nation, or any organization with a mass of followers.
Who is Influential? Very hard to define concretely or measure tangibly, despite the
large number of existing theories of sociology.
Can be approximated as something attained by the activeness of
a person. However, Just being active in communication does
not make someone influential in a social network.
Influential users generally get a very good response from others
in their comments, and it differentiates them from the
spammers, who in spite of being active do not receive much
attention.
Major factors that make a user influential - recognition, activity
generation, novelty, and eloquence.
Radically Influential User
Characterized by two key properties – radicalness and
influence.
Two different approaches to tackle the problem of radically
influential user identification – two-stage sequential
ranking and one-stage parallel ranking
Two-Stage Sequential Process
One-Stage Parallel Process
Measuring Radicalness
Ωj = A threat list containing 25 radically jihadi ideology
terms (Compiled by Kramer [ACM ISI-SIGKDD, 2010] from
Ansar Web Forum )
12/14/2017
8
Identifying Collocation
Collocation can be defined as the association of users co-
interacting in same threads.
Collocation theory is applied to study the structural
association of different users, and estimate their influence
while propagating an ideology through their interactions.
Contingency Table
Association Metrics
1. Co-Occurrence Frequency (μ1)
2. CF-ITF (μ2)
Association Metrics (cont...)
3. PMI (μ3): Used in the fields of information theory andstatistics to determine the association or dependence of twoprobabilistic events.
4. Cosine (μ4): Used to measure the strength of associationbetween a pair of objects having feature vectors
Association Metrics (cont...)
5. Overlap (μ5)
6. Dice (μ6)
Association Metrics (cont...)
7. Jaccard (μ7)
8. Chi-Square (μ8): Used to determine the difference between the
distribution of an actually observed sample and another hypothetical orpreviously established distribution
12/14/2017
9
Association Metrics (cont...)
9. LLR (μ9)
Association Metrics (cont...)
10. Phi Coefficient (μ10)
11. Contingency Coefficient (μ11)
Ranking Radically Influential Users
Page Rank:
d = Damping Factor (probability of a surfer to click on a link)
An ideal value for d is 0.85 (empirical determination)
Page Rank ExampleAA
CCBB
PR(A) = 0.15 + 0.85 x PR(C)
PR(B) = 0.15 + 0.85 x [PR(A)/2]
PR(C) = 0.15 + 0.85 x [PR(A)/2 + PR(B)]
Ranking Radically Influential Users (cont…)
Modified Page Rank
Experimental Data Set A set of threads provided for a challenge at the ACM ISI-
KDD’12 workshop to find radical and infectious threads,
members, postings, ideas and ideologies.
Consists of a total of 1,29,425 message posts commented as
response to a total of 27,968 threads by 2803 users.
Includes all discussions carried on in the forum from April 28,
2004 to May 20, 2010.
12/14/2017
10
Lifespan of Threads in Experimental Data Set
10 Top-Ranked Members Using Different Association Measures
Evaluation Metric: Mean Reciprocal Rank (MRR)
A statistical measure for evaluating any process that
produces a list of possible responses to a sample of
queries, ordered by probability of correctness.
Reciprocal Rank of a query response is the multiplicative
inverse of the rank of the first correct answer
MRR is the average of the reciprocal ranks of results for
a sample of queries Q:
Mean Reciprocal Rank (cont…)
MRR = (1/3 + 1/2 + 1)/3 = 11/18 ≈ 0.61.
Evaluation Results using MRR
For Further Details:
Tarique Anwar and Muhammad Abulaish,
Ranking Radically Influential Web Forum
Users, IEEE Transactions on Information
Forensics and Security, 2015.
Available at www.abulaish.com
12/14/2017
11
Development of a real-time online social media
monitoring system (incremental learning, stream data
mining).
Exploiting content and structural data together for
message authentication/ rumor detection
Exploring the application of community (cliques)
structures and their evolutionary behaviors for groups
monitoring (organized crimes, social botnet, etc.)
Development of Context- and Trust-aware recommender
systems (target marketing, sales and promotions, etc.)
Ranked Retrieval
Early IR focused on set-based retrieval
Boolean queries, set of conditions to be satisfied
document either matches the query or not
• like classifying the collection into relevant / non-relevant sets
still used by professional searchers
“advanced search” in many systems
Modern IR: ranked retrieval
free-form query expresses user’s information need
rank documents by decreasing likelihood of relevance
many studies prove it is superior