Speaker Overview - Omniscien Technologies · Dun & Bradstreet Keynote: Confounding Characteristics...

Dr. Anthony ScriffignanoSenior Vice President & Chief Data ScientistDun & Bradstreet

Dr. Anthony Scriffignano, Senior Vice President and Chief Data Scientist for Dun & Bradstreet, is an

internationally recognized thought leader in the data science space. He leads a team of data

scientists focused on advancing Dun & Bradstreet's core capabilities and IP globally. With extensive

background in advanced algorithms and linguistics, he holds multiple patents and presents globally

on data and technology trends, multilingual challenges in business identity, and artificial intelligence.

Speaker Overview

Warwick MatthewsSenior Director of Identity Data EngineeringDun & Bradstreet

Warwick is Senior Director of Identity Data Engineering at Dun & Bradstreet. Based in Melbourne

Australia, his work focuses largely on creating complex cross-border multilingual data flows.

AI, MT and Language Processing Symposium

Dr. Anthony ScriffignanoSenior Vice President & Chief Data ScientistDun & Bradstreet

Keynote: Confounding Characteristics and Resolution of Complex Business Identity in a Multilingual Context

Warwick MatthewsSenior Director of Identity Data EngineeringDun & Bradstreet

The presentation has three major themes: • More Better Faster – Technology and Decision making• Risk and Response – Disruptive Evolution, Malfeasance and how we will Respond• The Future is Here – Quantum Computing, Machine Intelligence, New Mindsets, Recommendations for the future

AI, MT and Language Processing Symposium

Confounding Characteristics and

Resolution of Complex Business

Identity in a Multilingual Context

Anthony J. Scriffignano, Ph.D.

SVP / Chief Data Scientist

AI, MT AND LANGUAGE PROCESSING SYMPOSIUM

28 MARCH 2018

Warwick Matthews

Senior Director, Identity Data Engineering

T O D AY

OUR CURIOUS WORLDM O R E T E C H N O L O G Y , F A S T E R D E C I S I O N S

COMPUTATIONAL LINGUISTIC

CHALLENGESR O M A N I Z A T I O N O F B U S I N E S S I D E N T I T Y D A T A

THE RISKS AND OUR RESPONSED I S R U P T I V E E V O L U T I O N A N D H O W W E W I L L

R E S P O N D

THE FUTURE IS HERET R E N D S , D E V E L O P M E N T S , C A S E S T I U D I E S

OUR CURIOUS WORLDM O R E T E C H N O L O G Y , F A S T E R D E C I S I O N S

We live in an age of promise.

Advanced linguistic methods are

making things possible that were

science fiction only a few short

years ago.

3 Globalization

Challenges2 Unstructured Data

Changes in our environment have continuously influenced business decision-

making. What is changing is the speed and degree of globalization.

A business’

geographic location,

structure,

and physical customer

interaction are

becoming irrelevant

The globalization of

business

relationships can

overwhelm many

businesses with multi-

lingual data

It is estimated 80-90

percent of all business

information exists as

unstructured data

Hypergeometric digital

data growth can make

it more difficult to

determine what is

valuable vs. noise

1 1 HypergeometricData Growth

4 Virtual Businesses

Note: Unstructured data includes data which lacks (an exposed ontology) and which appears to belie attempts to understand any implied categorization.

The term is often used where the content is not actually unstructured, but rather only poorly understood at the time of ingestion or inspection

“The importance of language should not be underestimated. Language contains nuance, changes

constantly, and informs our thinking on levels that we often rely upon in subtle and powerful ways.”A. Scriffignano

In business today, we will hone our skills for using dynamic, unstructured

data, or we will begin to drown in it. There is no guarantee that things

get better.

Situational awareness…

• More than 85% of data

creation is

unstructured

• Language and use of

language are constantly

evolving

• Commonly available

tools and solutions only

address a small part of

this space

Unstructured

Myths / Inconvenient Truths:

More Data is better

Lots of data in 1 place

is sufficient to learn

AI can find answers

Machine learning will

find hidden truth

Natural language

processing removes

all language barriers

Machine Translation is

good enough

Data vs. noise

Data at rest vs.

data in motion

AI methods have

preconditions

Regression vs.

unprecedented change

Language is constantly

changing

Many unmet challenges

remain in linguistics

COMPUTATIONAL LINGUISTIC

CHALLENGESR O M A N I Z A T I O N O F B U S I N E S S I D E N T I T Y D A T A

Today’s problem is not a lack of data..

TITLE OF PRESENTATION (EDIT USING: INSERT MENU > HEADER AND FOOTER)

•Dun & Bradstreet is continually acquiring large amounts of non-Latin data, particularly in

Asia, which needs to be Romanized in order to enter our Global Data Supply Chain.

• Translation is traditionally largely manual, time consuming and very expensive when you

have millions of records to process.

•When it comes to Romanizing pure Identity Data we have a special problem.

•Name and Address data has no context, and this is particularly challenging when we are talking

about new business entities who have not existing “footprint” to work from.

•And we need to solve this problem millions of times per day, automatically.

Business Names are a unique mix of

Translated and Transliterated (phonetic) data.

And this data has no context

(there is nothing around a name to guide the system)

A D&B Challenge: Romanization of Business Identity Data

Addresses are “easy” to translate for big geos like Cities.

But address detail is often vague, idiosyncratic

and defies systematic classification (especially in China!).

Why don’t we just “A.I.” our

way out of the problem?

AI has limitations..

•AI is a “grey box” at best – difficult to get actionable qualitative feedback to aid in

automated decision-making

• It does not actually understand what it is doing, so it cannot tell when its output

is nonsense.

inspirobot.com Microsoft TayTodai Robot

“..none of the modern AIs,

including Watson, Siri and Todai Robot,

is able to read..

...it doesn't understand any meaning.”

Dr Noriko Arai, Tokyo University

https://www.ted.com/talks/noriko_arai_can_a_robot_pass_a_university_entrance_exam

So we must leverage multiple simultaneous approaches..

Lexicon/Stats-based

routines

UI – Human

AdjudicationDecisioning system

Machine Learning

SystemAI – SHEN/X

THE RISKS AND OUR RESPONSED I S R U P T I V E E V O L U T I O N A N D H O W W E W I L L

R E S P O N D

There are three use cases which represent “True North” for our innovation

Discovering new businesses or changes in business statusOrganic growth/decay

•New business that would otherwise go undetected

•Additional input for Intelligence Engine to transform a Single Source Record (SSR)

•Full-file maintenance (e.g. Out of Business)

Ingesting information that can be used to detect bad behaviorFraud / Malfeasance

•Common social “footprint” shared by more than one persona (e.g. identity theft)

•Patterns of observation that suggest clusters of bad behavior (e.g. fraud rings)

•Discovery of new types of malfeasance (e.g. Phishing)

Discovering data elements that can help resolve people in business contextPersonal Identity

•New Social Media handles or sources of social data (e.g. opinion blogs)

•Data which can be aggregated to pre-existing clusters of identity (e.g. photos) for additional resolution (e.g. hyperclusters)

•Data about groups of individuals associated with a business context (e.g. user groups, discussion boards)

Entity ExtractionUnderstanding Person

and Perspective

Sentiment Attribution/

Clustering

• Multiple patents including

identity resolution,

people in the context of

business, geospatial

inference, flexible

alternative indicia

• Existing capabilities

include extraction of

entities, tokenization,

part of speech tagging,

usage vectors, language

detection

• The current state of art

is highly dependent on

training and has

challenges with precision

and recall, reproducible

results

Areas of focus

Apple destroys competition…

Transitive verb,

requires actor.

Multiple

interpretations.

Becomes Proper Noun

due to inference about

Яблоко пережило несколько стадий развития…

political

party?

fruit?

company? Changing regulatory

environment globally

FLOCCINAUCINIHILIPILIFICATION

Confounding characteristics

Sarcasm

ABC corporation is a wonderful

company, if you don’t do

business with them.

Neologism

Be sure to like us on FaceBook

and use #shallow when you

Tweet.

Grammar variations

FBI is Hunting Terrorists With

Explosives.

Punctuation

“Hi mom!” vs. “Hi, mom?”

Intentional mis-spelling

RU There?

Context / Behavior

Sentiment Attribution

Entity Extraction

USE CASESCONFOUNDING

CHARACTERISTICS

DERIVING EMPIRICAL MEASURES

THAT INFORM USE CASES

Passive

metric

Sarcasm

Words or predicates juxtaposed in such a way as to convey hidden meaning that is

opposed to that which comes from cursory interpretation

• Example: BP is an excellent company to do business with, if you like destroying nature.

Neologism

Words or phrases which are newly constructed and taken collectively to have

some shared meaning.

• Example: - Hashtags in Twitter

Grammar variations

Word usage which is intentionally or unintentionally incorrect, leading to

ambiguous or non-dispositive interpretation

• Example: FBI is Hunting Terrorists With Explosives

Punctuation

Usage of punctuation in a non-standard or inconsistent way or lack of punctuation,

leading to ambiguous or contradictory interpretation

• Example: “Eats shoots and leaves” vs. “Eats, shoots, and leaves”

Intentional Mis-Spelling

Invented, incorrect, or adopted spelling that results in inconsistent, incorrect, or

non-dispositive interpretation

• Example: RU There?

Mixing of Languages or Scripts

Including foreign words/phrases or characters, especially in non-standard ways.

• Examples: He has a certain je ne sais quoi or “Please have some ∏”

Recursive

Discovery

Entity

Extraction

Vetting/

AdjudicationSynthesis

Positioning Confounding Characteristics in the Curation Process

Who is

speaking?

How do

they feel?

In what

context?

THE FUTURE IS HERET R E N D S , D E V E L O P M E N T S , C A S E S T I U D I E S

Semantic

vector space

models

Using source

metadata for

insights

Breadth

Entity

Extraction

Sentiment

Analysis

Detecting &

Measuring scores of

Confounding

Language

Characteristics

Translating scores

into degree of

‘confoundedness’

Analyze relative

usefulness of new

sources

Understand

dependencies

across sources

Create dedicated & scalable infrastructure for unstructured data

Detecting additional

confounding factors

(e.g. foreign

language)

Analyzing impact

on specific use

Improving

robustness of

existing detection

algorithms

Semantic

disambiuation

Understanding

Person &

Perspective

Assessing capabilities in language synthesis for identified use cases

Reality check…

Watch this space… • Inter- and intra-language correlation :

deciding when things mean the same thing

• Inter- and intra-language transformation :

transforming inference among languages

• Changing behavior to attract/obviate

grapheme analysis : reacting to changing

language

• Emerging “metalanguage” (e.g. “textspeak”)

: reacting to language about language

• A language of “things” : reacting to new

languages used by automation

• Using language to hide language : reacting

to attempts to obscure via language

• Unicode is not universal… : understanding

the limitations of automation

What is the community at

large saying about this

business?

How are opinions changing

over time? Are they

authentic?

How can I detect

inconsistent behavior? What

does it mean?

How do I understand and

measure customer sentiment?

The journey of discovery involves asking new questions?

How can I see “birth” and

“death” of a business more

quickly?

Can I trust the social data on my

partners?

What about modes: Is there a

measurable difference

between leaders and their

organizations'’ opinions?

Thank You!

謝謝Dankjewel

ありがとうधन्यवाद

Warwick Matthews

MatthewsWa@dnb.com

Anthony Scriffignano

ScriffignanoA@dnb.com

Speaker Overview - Omniscien Technologies · Dun & Bradstreet Keynote: Confounding Characteristics...

Documents

Transcript of Speaker Overview - Omniscien Technologies · Dun & Bradstreet Keynote: Confounding Characteristics...

Beware of Confounding Variables

FIRSTRAIN SOLUTIONS Dun & Bradstreet

Anne Bradstreet

CHAPTER 8 EXPERIMENTAL DESIGN. CONFOUNDING AND INTERNAL VALIDITY Confounding Variable: confounding occurs when the effects of the independent variable.

Dun & Bradstreet (UK) Pension Plan€¦ · DUN BRADSTREET UK PENSION PLAN DEFINED CONTRIBUTION DC SECTION 4 DUN BRADSTREET UK PENSION PLAN DEFINED CONTRIBUTION DC SECTION The Plan

Confounding and Effect Modification

Confounding, Effect Modification and Bias - IEH Consulting web... · • Confounding bias –Stratified analysis –Adjustment in the analyses. Title: Confounding, Effect Modification

Logistic Regression and Confounding

Assessing Confounding

Stratification: Confounding , Effect modification

4.2.2. confounding classical approach

3 Confounding Interaksi

CHAPTER 11. Confounding - medicine.mcgill.ca 2010/Clas… · CHAPTER 11. Confounding ... LECTURE NOTES ON CONFOUNDING ... Stratification refers to a group of methods which yields

Dun &Bradstreet

Definition of Confounding

Chance, Bias, Confounding,

Anne Bradstreet Poetry

5.2.2 dags for confounding

Confounding. Objectives To define and discuss confounding To discuss methods of diagnosing confounding To define positive, negative and qualitative confounding.

Lendingkart Compiled PDF - Dun & Bradstreet