1 Open Mind Initiative David G. Stork Ricoh Silicon Valley [email protected].

34
1 Open Mind Initiative David G. Stork Ricoh Silicon Valley [email protected]

Transcript of 1 Open Mind Initiative David G. Stork Ricoh Silicon Valley [email protected].

1

Open Mind Initiative

David G. StorkRicoh Silicon Valley

[email protected]

2

Outline

One-sentence description Background Open Mind Initiative Sample projects Relation to Open Source and to Data mining Related efforts elsewhere What do we do next Monday?

3

Open Mind Initiative

A collaborative framework (based on Open Source methodology) for developing “intelligent” software, where... » domain experts provide algorithms, » tool developers provide software

infrastructure and tools, and » non-expert ‘e-citizens’ provide raw data.

4

Background: Market need

Speech recognition OCR Web searching ...... Some software (e.g., common sense)

too costly for a single company to build

5Background:

E-community & Open Source Waves

GNU SendMail Linux

» 10M lines; 10M seats; dbl. time ª 6 mo., 105 contributors Apache

» Half of all web servers Beowulf

» Supercomputer power from networked PCs Newhoo! dmoz.org

» Open web directory (527,991 sites, 10,943 editors, 82,003 categories) Infomedia

» Open source encyclopedia

6Growth of new software methods

1990 105 programmers 1995 Linux

1995 106 web authors 1999 Newhoo!

1999 109 e-citizens 2003 Open Mind

New communication allows communities and collaboration, and thus new software methods

Opportunities expand to less-skilled users

7

Background: Pattern recognition/intelligent systems

Recognizer = Theory + Model + Data Theory excellent Models depend on problem Never enough data

» “the group with the most data wins”» e.g., OCR» ...

8

Background: Tools

Tools for customization/experimentation» CSLU (Open Source)» Nuance» HTK» S+» ...

Non-experts can use these!

9Background: Infrastructure

Collaborative software Animals (Shapiro 75, Lo & Stork 99) Answer Garden (Ackerman 90) BBN UNIPEN data collection software

(Schwartz 97)

10Infrastructure: Relevance rating

DirectHit, Inc. » improved web indexing by monitoring

users’ selections FireFly

» target advertisements based on user profile

Amazon.com» book recommendations

11

Open Mind Initiative

Three main functions provided by» Domain Experts

– fundamental algorithms, process control, education/proselytizing, ...

» Tool developers– software infrastructure, tools, ...

» e-citizens– raw data, low-level bug reports, ...

12

Domain Experts

Provide algorithms (e.g., OCR, ...) Provide general algorithms (e.g., Bayes nets, ...) Process control, algorithm development and truthing

» detect outliers for review/rejection» data “voting”» catch trials» signal dection theory (d’)» method of limits» two-alternative forced-choice hidden staircase» bias avoidance

Trend to publish data and algorithms on the web More university work will be done with Linux

13Tool/infrastructure developers

Get maximum information for minimum e-citizen effort (e.g., informative patterns)

Make it easy (fast) for contributors Web infrastructure Collaborative software (version control) Reward contributors

14

e-citizens

Incentives » benefits in used system» fun (games: Marathon, MUDD, ...)» recognition (post names by amount of info. accepted)» general interest (note progress: data and performance)» altruism/philanthropy (cf. OED, SETI, ...)» education (linguistics in schools, ...)» lottery» money» frequent flyer miles

1.5M inmates, 1M in nursing homes, ...

15Sample Projects (1) Handwritten isolated character OCR

Recognizer: simple neural net, decision tree, nearest-neighbor, ...

Patterns presented on contributors’ browsers, cached, ...

Synthetic data (rotate, skew, line thicken/thin) Learning with queries (ask informative patterns);

each pattern more valuable than a sampled one Cooperative improvement (submit characters over

internet, download improved OCR the next day)

Improved OCR

16

OCR example

4 9

44 9

94 9

94 9

44 9

44 9

9

e-citizens

...

Open Mind host

17Sample Projects (2) Handwritten word recognition

Recognizer: “off the shelf” Words scanned from handwritten docs Three alternatives shown, best selected

by naive contributor (as in commercial speech recognizers)

Improved handwritten OCR

18Sample Projects (3) Open Mind chatbot game

MUDD-like game Goal: find the route through the castle

to the “human”» choose the “most natural” paragraph

Linguistic information learned in background

More natural interfaces

19Sample Projects (4) Common sense about computers

Facts» programs compiled, interpreted, run, ...» a mouse is a peripheral» early versions of code are generally buggy» COBOL is a programming language

More natural text interfaces

20Sample Projects (5) Open Mind chess/go

Chess/go = fast search + board scoring Allow contributors to score positions

» weighted by FIDE chess rating/go dan» weighted by score on on-line test» weighted by “confidence”

port to multiple PCs (Beowulf) for speed Improved beam search via improved

scoring (more humanlike style?)

21

NY

Sample Projects (6) Open Mind Animals (Lam & Stork 99)

2 legs?

can fly? can swim?

feathers?

Y N

N NY Y

hum

an

dog

elep

hant

parr

ot

bat

mane?

NY

hors

e

dog

challenges:truthing insure valid animals name/synonym check insure data quality (“voting,” “accept if used”)

bug reporting forwarding errors to domain experts

crediting contributors ordered by amount contributed avoid ID clashes; allow anonimity

query simplification reduce average number of queries/new animal

tree simplification better taxonomy

arbitrary branching factor tree reflects the structure of domain

generalizable to other domains other forms of queries

human-machine interface natural, show current query set (selectable)

display progress number of animals, contributors, show tree

22Sample Projects (7) Open Mind Investment Assistant (Lo & Stork

99)

IBM

MSFTBRDCY

AAPL

DELL

DOL

MAT

BTFD

GM

TOY

F

KXLNX

ALTR

ATT

AMD

23Problems in Machine learning

Relative value of learning with queries vs. iid samples Data truthing/outlier detection Optimal learning strategies given...

» Bayes error» probability of hostile data» probability of data error

Learn reliability of e-citizens, individually and as a group

24

Relation to Open Source

Open Source

• no e-citizens

• expert knowledge (C++filt,gdbm)

• machine learning irrelevant

• web infrastructure useful

• most work is directly on the final software

• hacker culture (ª105)

Open Mind

• e-citizens crucial

• informal knowledge (read, hear)

• machine learning essential

• web infrastructure essential

• most work is on the infrastructure

• e-citizen and business culture (ª109)

25

Relation to Data Mining

Data Mining

• type of data may not be available for the project desired (e.g., OCR)

• no interactive queries slower learningambiguities not resolved

• relatively fixed amount of data

• little or no e-citizen support

Open Mind

• data tailored to the project desired (e.g., OCR)

• interactive queries faster learningambiguities resolved

• new data encouraged

• e-citizen support

26Open Mind project Taxonomy

BenefitWorld OpenMind Use of e-citizens

ease/simplicity

OCR

speech

chess/go

commonsense

comp c-s

dialog

H H M

ML

H H

H H

M M

LL

H

HH

H

M

M

L

MH H

M

grammar M M M HAnimals M H M H

27

Related efforts elsewhere

Speech» Macrophone» Human phoneme project» Linguistic Data Consortium» VoiceControl (Open Source speech for Linux)» CSLU (Center for Spoken Language Understanding) Open

Source speech tools

OCR» NIST, CEDAR, ARPA, UNIPEN

GNU dictionary Newhoo!

28

It is inevitable

Need is here

Web is here

Theory/Machine learning is here

This collaboration is going to happen!» Less radical than Richard Stallman or Linus Torvald...

Intelligentsystems

e-citizens’knowledge

Open Mind

29Possible value to corporations

Most companies could never develop most of this software, nor preserve a competitive advantage through proprietary software

Expand functionality/niches for all Low-cost, possibly high-payoff research Leverage university work

30

Technical Specifications

Language: Java» Portable

Operating System: Linux» Open Source, portable, multiprocessor version (Beowulf)

Data representation: Resource Description Framework (RDF)» Source: www.w3.org/RDF/» Code: lxr.mozilla.org/mozilla/source/rdf/base/» Docs: www.mozilla.org/rdf/doc/

31

Licenses

No license choice will satisfy everyone» GNU: any linked code must include source and follow FSF

copyright -- “copyleft”

» FreeBSD: do whatever you like (can charge)

But... you cannot link GNU & FreeBSD! Practical (not moral) decision Open Source will benefit from competitive

commercialization» BSD license best for Open Mind

32What do we do next Monday?

Put up OpenMind.org Demonstration projects: Open Mind Animals Limited seeding (proselytizing) Solicit projects; introduce domain experts with tool

developers Get corporate donations (e.g., books, CDs, ...)

33

Summary

Open Mind» Collaborative framework for developing

“intelligent systems”» Experts, tool developers, e-citizens

Projects Vision of the future

34

Questions/Comments...

Contact: [email protected]