Post on 13-Jul-2020
1
2
Dr. Anthony ScriffignanoSenior Vice President & Chief Data ScientistDun & Bradstreet
Dr. Anthony Scriffignano, Senior Vice President and Chief Data Scientist for Dun & Bradstreet, is an
internationally recognized thought leader in the data science space. He leads a team of data
scientists focused on advancing Dun & Bradstreet's core capabilities and IP globally. With extensive
background in advanced algorithms and linguistics, he holds multiple patents and presents globally
on data and technology trends, multilingual challenges in business identity, and artificial intelligence.
Speaker Overview
Warwick MatthewsSenior Director of Identity Data EngineeringDun & Bradstreet
Warwick is Senior Director of Identity Data Engineering at Dun & Bradstreet. Based in Melbourne
Australia, his work focuses largely on creating complex cross-border multilingual data flows.
AI, MT and Language Processing Symposium
3
Dr. Anthony ScriffignanoSenior Vice President & Chief Data ScientistDun & Bradstreet
Keynote: Confounding Characteristics and Resolution of Complex Business Identity in a Multilingual Context
Warwick MatthewsSenior Director of Identity Data EngineeringDun & Bradstreet
The presentation has three major themes: • More Better Faster – Technology and Decision making• Risk and Response – Disruptive Evolution, Malfeasance and how we will Respond• The Future is Here – Quantum Computing, Machine Intelligence, New Mindsets, Recommendations for the future
AI, MT and Language Processing Symposium
Confounding Characteristics and
Resolution of Complex Business
Identity in a Multilingual Context
Anthony J. Scriffignano, Ph.D.
SVP / Chief Data Scientist
AI, MT AND LANGUAGE PROCESSING SYMPOSIUM
28 MARCH 2018
Warwick Matthews
Senior Director, Identity Data Engineering
T O D AY
OUR CURIOUS WORLDM O R E T E C H N O L O G Y , F A S T E R D E C I S I O N S
COMPUTATIONAL LINGUISTIC
CHALLENGESR O M A N I Z A T I O N O F B U S I N E S S I D E N T I T Y D A T A
THE RISKS AND OUR RESPONSED I S R U P T I V E E V O L U T I O N A N D H O W W E W I L L
R E S P O N D
THE FUTURE IS HERET R E N D S , D E V E L O P M E N T S , C A S E S T I U D I E S
OUR CURIOUS WORLDM O R E T E C H N O L O G Y , F A S T E R D E C I S I O N S
We live in an age of promise.
Advanced linguistic methods are
making things possible that were
science fiction only a few short
years ago.
8
3 Globalization
Challenges2 Unstructured Data
Changes in our environment have continuously influenced business decision-
making. What is changing is the speed and degree of globalization.
A business’
geographic location,
structure,
and physical customer
interaction are
becoming irrelevant
The globalization of
business
relationships can
overwhelm many
businesses with multi-
lingual data
It is estimated 80-90
percent of all business
information exists as
unstructured data
Hypergeometric digital
data growth can make
it more difficult to
determine what is
valuable vs. noise
1 1 HypergeometricData Growth
4 Virtual Businesses
Note: Unstructured data includes data which lacks (an exposed ontology) and which appears to belie attempts to understand any implied categorization.
The term is often used where the content is not actually unstructured, but rather only poorly understood at the time of ingestion or inspection
“The importance of language should not be underestimated. Language contains nuance, changes
constantly, and informs our thinking on levels that we often rely upon in subtle and powerful ways.”A. Scriffignano
9
In business today, we will hone our skills for using dynamic, unstructured
data, or we will begin to drown in it. There is no guarantee that things
get better.
Situational awareness…
• More than 85% of data
creation is
unstructured
• Language and use of
language are constantly
evolving
• Commonly available
tools and solutions only
address a small part of
this space
Unstructured
Data
Myths / Inconvenient Truths:
More Data is better
Lots of data in 1 place
is sufficient to learn
AI can find answers
Machine learning will
find hidden truth
Natural language
processing removes
all language barriers
Machine Translation is
good enough
Data vs. noise
Data at rest vs.
data in motion
AI methods have
preconditions
Regression vs.
unprecedented change
Language is constantly
changing
Many unmet challenges
remain in linguistics
COMPUTATIONAL LINGUISTIC
CHALLENGESR O M A N I Z A T I O N O F B U S I N E S S I D E N T I T Y D A T A
12
Today’s problem is not a lack of data..
TITLE OF PRESENTATION (EDIT USING: INSERT MENU > HEADER AND FOOTER)
•Dun & Bradstreet is continually acquiring large amounts of non-Latin data, particularly in
Asia, which needs to be Romanized in order to enter our Global Data Supply Chain.
• Translation is traditionally largely manual, time consuming and very expensive when you
have millions of records to process.
•When it comes to Romanizing pure Identity Data we have a special problem.
•Name and Address data has no context, and this is particularly challenging when we are talking
about new business entities who have not existing “footprint” to work from.
•And we need to solve this problem millions of times per day, automatically.
13
Business Names are a unique mix of
Translated and Transliterated (phonetic) data.
And this data has no context
(there is nothing around a name to guide the system)
A D&B Challenge: Romanization of Business Identity Data
Addresses are “easy” to translate for big geos like Cities.
But address detail is often vague, idiosyncratic
and defies systematic classification (especially in China!).
Why don’t we just “A.I.” our
way out of the problem?
15
AI has limitations..
•AI is a “grey box” at best – difficult to get actionable qualitative feedback to aid in
automated decision-making
• It does not actually understand what it is doing, so it cannot tell when its output
is nonsense.
inspirobot.com Microsoft TayTodai Robot
“..none of the modern AIs,
including Watson, Siri and Todai Robot,
is able to read..
...it doesn't understand any meaning.”
Dr Noriko Arai, Tokyo University
https://www.ted.com/talks/noriko_arai_can_a_robot_pass_a_university_entrance_exam
So we must leverage multiple simultaneous approaches..
Lexicon/Stats-based
routines
UI – Human
AdjudicationDecisioning system
Machine Learning
SystemAI – SHEN/X
THE RISKS AND OUR RESPONSED I S R U P T I V E E V O L U T I O N A N D H O W W E W I L L
R E S P O N D
19
There are three use cases which represent “True North” for our innovation
Discovering new businesses or changes in business statusOrganic growth/decay
•New business that would otherwise go undetected
•Additional input for Intelligence Engine to transform a Single Source Record (SSR)
•Full-file maintenance (e.g. Out of Business)
Ingesting information that can be used to detect bad behaviorFraud / Malfeasance
•Common social “footprint” shared by more than one persona (e.g. identity theft)
•Patterns of observation that suggest clusters of bad behavior (e.g. fraud rings)
•Discovery of new types of malfeasance (e.g. Phishing)
Discovering data elements that can help resolve people in business contextPersonal Identity
•New Social Media handles or sources of social data (e.g. opinion blogs)
•Data which can be aggregated to pre-existing clusters of identity (e.g. photos) for additional resolution (e.g. hyperclusters)
•Data about groups of individuals associated with a business context (e.g. user groups, discussion boards)
20
Entity ExtractionUnderstanding Person
and Perspective
Sentiment Attribution/
Clustering
• Multiple patents including
identity resolution,
people in the context of
business, geospatial
inference, flexible
alternative indicia
• Existing capabilities
include extraction of
entities, tokenization,
part of speech tagging,
usage vectors, language
detection
• The current state of art
is highly dependent on
training and has
challenges with precision
and recall, reproducible
results
Areas of focus
Apple destroys competition…
Transitive verb,
requires actor.
Multiple
interpretations.
Becomes Proper Noun
due to inference about
verb.
Яблоко пережило несколько стадий развития…
The
political
party?
The
fruit?
The
company? Changing regulatory
environment globally
FLOCCINAUCINIHILIPILIFICATION
22
Confounding characteristics
Sarcasm
ABC corporation is a wonderful
company, if you don’t do
business with them.
Neologism
Be sure to like us on FaceBook
and use #shallow when you
Tweet.
Grammar variations
FBI is Hunting Terrorists With
Explosives.
Punctuation
“Hi mom!” vs. “Hi, mom?”
Intentional mis-spelling
RU There?
Context / Behavior
Sentiment Attribution
Entity Extraction
USE CASESCONFOUNDING
CHARACTERISTICS
DERIVING EMPIRICAL MEASURES
THAT INFORM USE CASES
Passive
metric
23
Sarcasm
Words or predicates juxtaposed in such a way as to convey hidden meaning that is
opposed to that which comes from cursory interpretation
• Example: BP is an excellent company to do business with, if you like destroying nature.
Neologism
Words or phrases which are newly constructed and taken collectively to have
some shared meaning.
• Example: - Hashtags in Twitter
Grammar variations
Word usage which is intentionally or unintentionally incorrect, leading to
ambiguous or non-dispositive interpretation
• Example: FBI is Hunting Terrorists With Explosives
Punctuation
Usage of punctuation in a non-standard or inconsistent way or lack of punctuation,
leading to ambiguous or contradictory interpretation
• Example: “Eats shoots and leaves” vs. “Eats, shoots, and leaves”
Intentional Mis-Spelling
Invented, incorrect, or adopted spelling that results in inconsistent, incorrect, or
non-dispositive interpretation
• Example: RU There?
Mixing of Languages or Scripts
Including foreign words/phrases or characters, especially in non-standard ways.
• Examples: He has a certain je ne sais quoi or “Please have some ∏”
Recursive
Discovery
Entity
Extraction
Vetting/
AdjudicationSynthesis
Positioning Confounding Characteristics in the Curation Process
Who is
speaking?
About
whom?
How do
they feel?
In what
context?
THE FUTURE IS HERET R E N D S , D E V E L O P M E N T S , C A S E S T I U D I E S
25
Semantic
vector space
models
Using source
metadata for
insights
Breadth
Dep
th
Entity
Extraction
Sentiment
Analysis
Detecting &
Measuring scores of
Confounding
Language
Characteristics
Translating scores
into degree of
text
‘confoundedness’
Analyze relative
usefulness of new
sources
Understand
dependencies
across sources
Create dedicated & scalable infrastructure for unstructured data
Detecting additional
confounding factors
(e.g. foreign
language)
Analyzing impact
on specific use
cases
Improving
robustness of
existing detection
algorithms
Semantic
disambiuation
Understanding
Person &
Perspective
Assessing capabilities in language synthesis for identified use cases
26
Reality check…
27
Watch this space… • Inter- and intra-language correlation :
deciding when things mean the same thing
• Inter- and intra-language transformation :
transforming inference among languages
• Changing behavior to attract/obviate
grapheme analysis : reacting to changing
language
• Emerging “metalanguage” (e.g. “textspeak”)
: reacting to language about language
• A language of “things” : reacting to new
languages used by automation
• Using language to hide language : reacting
to attempts to obscure via language
• Unicode is not universal… : understanding
the limitations of automation
28
What is the community at
large saying about this
business?
How are opinions changing
over time? Are they
authentic?
How can I detect
inconsistent behavior? What
does it mean?
How do I understand and
measure customer sentiment?
The journey of discovery involves asking new questions?
How can I see “birth” and
“death” of a business more
quickly?
Can I trust the social data on my
partners?
What about modes: Is there a
measurable difference
between leaders and their
organizations'’ opinions?
29
Thank You!
謝謝Dankjewel
merci
ありがとうधन्यवाद
Warwick Matthews
MatthewsWa@dnb.com
Anthony Scriffignano
ScriffignanoA@dnb.com
30