Mass declassification sept 23 2010v2.1

39
© 2010 IBM Corporation 1 Mass Declassification What If? Jeff Jonas, IBM Distinguished Engineer Chief Scientist, IBM Entity Analytics [email protected] September 23, 2010

description

My public presentation as delivered to the Public Interest Declassification Board (PIDB) trying to determine the best way to declassify and release over 400M classified documents.

Transcript of Mass declassification sept 23 2010v2.1

Page 1: Mass declassification sept 23 2010v2.1

© 2010 IBM Corporation1

Mass Declassification

What If?

Jeff Jonas, IBM Distinguished EngineerChief Scientist, IBM Entity Analytics

[email protected]

September 23, 2010

Page 2: Mass declassification sept 23 2010v2.1

© 2010 IBM Corporation2

The Ask

What emerging technology or innovative approaches come to mind … which may have applicability to this task?

Use your imagination. What if?

Not talking about any specific products Not focusing on the widely available COTS/GOTS technologies

(OCR, document management, case management, workflow, etc.)

Page 3: Mass declassification sept 23 2010v2.1

© 2010 IBM Corporation3

The Problem at Hand

Volumes may be beyond human, brute force review (@5min/ea = 18,382 FTEs)

Necessitates some form of machine triage– Red: A disclosure risk

– Yellow: A possible disclosure risk

– Green: No disclosure risk

Reliable machine triage requires substantially better prediction systems

Even then, advanced means for humans to deal with the remaining large volumes of “possibles” is still required

Page 4: Mass declassification sept 23 2010v2.1

© 2010 IBM Corporation4

Background

Early 80’s: Founded Systems Research & Development (SRD), a custom software consultancy

1989 – 2003: Built numerous systems for Las Vegas casinos including a technology known as Non-Obvious Relationship Awareness (NORA)

2001/2003: Funded by In-Q-Tel

2005: IBM acquires SRD

Cumulatively: I have had a hand in a number of systems with multi-billions of rows describing 100’s of millions of entities

Affiliations:– Member, Markle Foundation Task Force on National Security in the Information Age

– Senior Associate, Center for Strategic and International Studies (CSIS)

– Distinguished Research Faculty (adjunct), Singapore Management University, School of Information Systems

– Member, EPIC advisory board

– Board Member, US Geospatial Intelligence Foundation (USGIF), the GEOINT organizing body

Page 5: Mass declassification sept 23 2010v2.1

© 2010 IBM Corporation5

In Today’s Session

Intro to context accumulating systems

Predictions and data points needed for mass declassification

Strawman architecture

Challenges

Q&A

Page 6: Mass declassification sept 23 2010v2.1

© 2010 IBM Corporation6

Context Accumulating Systems

Page 7: Mass declassification sept 23 2010v2.1

© 2010 IBM Corporation7

From Pixels to Pictures to Insight

Observations

Contextualization

Context

Relevance

Consumer(An analyst, a system, the sensor itself, etc.)

Page 8: Mass declassification sept 23 2010v2.1

© 2010 IBM Corporation8

Context, definition of:

Better understanding something by taking into account the things around it.

Page 9: Mass declassification sept 23 2010v2.1

© 2010 IBM Corporation9

Without Context

[email protected]

Page 10: Mass declassification sept 23 2010v2.1

© 2010 IBM Corporation10

Consequences

Algorithms flat-lining (e.g., alert queues)

Enterprise amnesia on the rise

Overwhelmed by false positives and false negatives? You have seen nothing yet

Not enough humans to fix this with brute force

Risk assessment becomes the risk

Page 11: Mass declassification sept 23 2010v2.1

© 2010 IBM Corporation11

Context Accumulation

TrustedSupplier

Job Applicant

Stolen Identity

KnownTerrorist

[email protected]

Page 12: Mass declassification sept 23 2010v2.1

© 2010 IBM Corporation12

Puzzle Metaphor Primer

Imagine an ever-growing pile of puzzle pieces of varying sizes, shapes and colors

What it represents is unknown – there is no picture on hand

Is it one puzzle, 15 puzzles, or 1,500 puzzles?

Some pieces are duplicates and some are missing

Some are pieces are incomplete, low quality, or have been misinterpreted

Some pieces may even be professionally fabricated lies

Until you take the pieces to the table, you don’t know what you are dealing with

Page 13: Mass declassification sept 23 2010v2.1

© 2010 IBM Corporation13

How Context Accumulates

With each new observation … one of three assertions are made: 1) Un-associated; 2) near like neighbors; or 3) connections

Asserted connections must favor the false negative

New observations sometimes reverse earlier assertions

Some observations produce novel discovery

As the working space expands, computational effort increases

The emerging picture helps focus collection interests

Given sufficient observations, there can come a tipping point

Thereafter, confidence improves while computational effort decreases!!!!

Page 14: Mass declassification sept 23 2010v2.1

© 2010 IBM Corporation14

Observations

Un

iqu

e Id

enti

ties

True Population

False Negatives Overstate The Universe

Page 15: Mass declassification sept 23 2010v2.1

© 2010 IBM Corporation15

Counting Is Difficult

Mark Smith6/12/1978

443-43-0000

Mark R Smith(707) 433-0000DL: 00001234

File 1

File 2

Page 16: Mass declassification sept 23 2010v2.1

© 2010 IBM Corporation16

Observations

Un

iqu

e Id

enti

ties

True Population

The Rise and Fall of a Population

Page 17: Mass declassification sept 23 2010v2.1

© 2010 IBM Corporation17

Data Triangulation

Mark Randy Smith443-43-0000

DL: 00001234

New Record

Mark Smith6/12/1978

443-43-0000

Mark R Smith(707) 433-0000DL: 00001234

File 1

File 2

Page 18: Mass declassification sept 23 2010v2.1

© 2010 IBM Corporation18

Observations

Un

iqu

e Id

enti

ties

True Population

Increasing Accuracy and Performance

Page 19: Mass declassification sept 23 2010v2.1

© 2010 IBM Corporation19

“Expert Counting” is Fundamental to Prediction

Is it 5 people each with 1 account … or is it 1 person with 5 accounts?

If one cannot count … one cannot estimate vector or velocity (direction and speed).

Without vector and velocity … prediction is nearly impossible.

Therefore, if you can’t count, you can’t predict.

Page 20: Mass declassification sept 23 2010v2.1

© 2010 IBM Corporation20

Mass DeclassificationPredictions

Page 21: Mass declassification sept 23 2010v2.1

© 2010 IBM Corporation21

Mass Declassification Predictions

Whose equity is it?

Machine triage – disposition

Queue prioritization

Page 22: Mass declassification sept 23 2010v2.1

© 2010 IBM Corporation22

Using What Data Points?

FOR EXAMPLE: 450M target documents Dirty words Previous declassifications Previous declassification denials FOIA’s Intellipedia Wikipedia WikiLeaks Deceased persons Publically available accounts/facts

Page 23: Mass declassification sept 23 2010v2.1

© 2010 IBM Corporation23

Page 24: Mass declassification sept 23 2010v2.1

© 2010 IBM Corporation24

Open Source Discovery/Scoring

“Height of Pakistan’s Mufasa missile.”

– What is 15.5 meters?

– New York Times, Sept 21, 2010, C3“Pakistan unveils Mufasa 7 Warhead”

– Wikipedia: Mufasa_7_Warhead

Page 25: Mass declassification sept 23 2010v2.1

© 2010 IBM Corporation25

Context Accumulation

FOIAMarch 2010

Open SourceReference

Dirty Word

Classified – Asserted

Mufasa 7Warhead

Page 26: Mass declassification sept 23 2010v2.1

© 2010 IBM Corporation26

Context Accumulation + Statistics

Document Element Total | Declass | Class-Default | Class-Asserted

Author: “Billy K” 4503 1600 403 0Codeword: “Tomatoe” 4818 4600 218 0Classification: “SI/TK/001” 23 22 1 0Actors: “Salam Ahmed” 782 700 82 0

Declassification dispositions … becoming a force multiplier.

The more human dispositions, the more automated dispositions.

Human Triage Auto Triage5,000 2010,000 4,000100,000 65,0001,000,000 17,000,000

Page 27: Mass declassification sept 23 2010v2.1

© 2010 IBM Corporation27

Policy Questions

What related information is already available in the public domain?

– Evidence: Exists in open source

What damage might conceivably result from disclosure and what benefits might ensue?

– Evidence: Same text already released (by same equity holder)

Page 28: Mass declassification sept 23 2010v2.1

© 2010 IBM Corporation28

Strawman Architecture

Page 29: Mass declassification sept 23 2010v2.1

© 2010 IBM Corporation29

Strawman Architecture

450M Docs

Historical Dispositions

DirtyWords

Etc.

Feature Extraction

& Classification

Context Accumulation

Predictions(*)

WorkflowSystem

(*) Recommendations: Equity of, Disposition, Priority

Dispositions

Page 30: Mass declassification sept 23 2010v2.1

© 2010 IBM Corporation30

Another Idea: Crowd Sourcing

Can you predict specific people with privileges and knowledge … to whom can be routed selected documents for evaluation?

Can you publish machine-triage recommendations to a wiki or other form of internal broadcast for community crowd sourcing?

Page 31: Mass declassification sept 23 2010v2.1

© 2010 IBM Corporation31

Another Idea: Better Classification

Using the overall declassification platform to assist in proper classification (real-time)

And, better pre-tagging to assist in future auto-declassification

Page 32: Mass declassification sept 23 2010v2.1

© 2010 IBM Corporation32

Challenges

Page 33: Mass declassification sept 23 2010v2.1

© 2010 IBM Corporation33

Challenges

Entity extraction is imperfect

Predictions may still not good enough, often enough

Not in English

The user work surface and its distribution

Consequences of an inappropriate release

With super access and super tools, this may call for stronger audit and insider-threat protections

Your contracting cycle and the creation of the system might take until mid-2011 or 2012 or 2013

Page 34: Mass declassification sept 23 2010v2.1

© 2010 IBM Corporation34

Closing Thoughts

Page 35: Mass declassification sept 23 2010v2.1

© 2010 IBM Corporation35

Closing Thoughts

Contextualization is essential to better prediction

There are not enough humans to ask every question every day

“Human attention directing” systems are critical to the mission

The data must find the data, the relevance must find the user

Page 36: Mass declassification sept 23 2010v2.1

© 2010 IBM Corporation36

Worst Case Scenario

Rich context enables better hints for users, results in faster dispositions

Rich context enables improved sequencing of the work

Page 37: Mass declassification sept 23 2010v2.1

© 2010 IBM Corporation37

Related Blog Posts

Smart Sensemaking Systems, First and Foremost, Must be Expert Counting Systems

Data Finds Data

Puzzling: How Observations Are Accumulated Into Context

The Fast Last Puzzle Piece

Algorithms At Dead-End: Cannot Squeeze Knowledge Out Of A Pixel

How to Use a Glue Gun to Catch a Liar

It Turns Out Both Bad Data and a Teaspoon of Dirt May Be Good For You

Smart Systems Flip-Flop

Page 38: Mass declassification sept 23 2010v2.1

© 2010 IBM Corporation38

Blogging At:

www.JeffJonas.TypePad.com

Information ManagementPrivacy

National Security

and Triathlons

Questions?

Page 39: Mass declassification sept 23 2010v2.1

© 2010 IBM Corporation39

Mass Declassification

What If?

Jeff Jonas, IBM Distinguished EngineerChief Scientist, IBM Entity Analytics

[email protected]

September 23, 2010