1 Open Mind Initiative David G. Stork Ricoh Silicon Valley [email protected].
-
Upload
dayna-garrett -
Category
Documents
-
view
225 -
download
2
Transcript of 1 Open Mind Initiative David G. Stork Ricoh Silicon Valley [email protected].
2
Outline
One-sentence description Background Open Mind Initiative Sample projects Relation to Open Source and to Data mining Related efforts elsewhere What do we do next Monday?
3
Open Mind Initiative
A collaborative framework (based on Open Source methodology) for developing “intelligent” software, where... » domain experts provide algorithms, » tool developers provide software
infrastructure and tools, and » non-expert ‘e-citizens’ provide raw data.
4
Background: Market need
Speech recognition OCR Web searching ...... Some software (e.g., common sense)
too costly for a single company to build
5Background:
E-community & Open Source Waves
GNU SendMail Linux
» 10M lines; 10M seats; dbl. time ª 6 mo., 105 contributors Apache
» Half of all web servers Beowulf
» Supercomputer power from networked PCs Newhoo! dmoz.org
» Open web directory (527,991 sites, 10,943 editors, 82,003 categories) Infomedia
» Open source encyclopedia
6Growth of new software methods
1990 105 programmers 1995 Linux
1995 106 web authors 1999 Newhoo!
1999 109 e-citizens 2003 Open Mind
New communication allows communities and collaboration, and thus new software methods
Opportunities expand to less-skilled users
7
Background: Pattern recognition/intelligent systems
Recognizer = Theory + Model + Data Theory excellent Models depend on problem Never enough data
» “the group with the most data wins”» e.g., OCR» ...
8
Background: Tools
Tools for customization/experimentation» CSLU (Open Source)» Nuance» HTK» S+» ...
Non-experts can use these!
9Background: Infrastructure
Collaborative software Animals (Shapiro 75, Lo & Stork 99) Answer Garden (Ackerman 90) BBN UNIPEN data collection software
(Schwartz 97)
10Infrastructure: Relevance rating
DirectHit, Inc. » improved web indexing by monitoring
users’ selections FireFly
» target advertisements based on user profile
Amazon.com» book recommendations
11
Open Mind Initiative
Three main functions provided by» Domain Experts
– fundamental algorithms, process control, education/proselytizing, ...
» Tool developers– software infrastructure, tools, ...
» e-citizens– raw data, low-level bug reports, ...
12
Domain Experts
Provide algorithms (e.g., OCR, ...) Provide general algorithms (e.g., Bayes nets, ...) Process control, algorithm development and truthing
» detect outliers for review/rejection» data “voting”» catch trials» signal dection theory (d’)» method of limits» two-alternative forced-choice hidden staircase» bias avoidance
Trend to publish data and algorithms on the web More university work will be done with Linux
13Tool/infrastructure developers
Get maximum information for minimum e-citizen effort (e.g., informative patterns)
Make it easy (fast) for contributors Web infrastructure Collaborative software (version control) Reward contributors
14
e-citizens
Incentives » benefits in used system» fun (games: Marathon, MUDD, ...)» recognition (post names by amount of info. accepted)» general interest (note progress: data and performance)» altruism/philanthropy (cf. OED, SETI, ...)» education (linguistics in schools, ...)» lottery» money» frequent flyer miles
1.5M inmates, 1M in nursing homes, ...
15Sample Projects (1) Handwritten isolated character OCR
Recognizer: simple neural net, decision tree, nearest-neighbor, ...
Patterns presented on contributors’ browsers, cached, ...
Synthetic data (rotate, skew, line thicken/thin) Learning with queries (ask informative patterns);
each pattern more valuable than a sampled one Cooperative improvement (submit characters over
internet, download improved OCR the next day)
Improved OCR
17Sample Projects (2) Handwritten word recognition
Recognizer: “off the shelf” Words scanned from handwritten docs Three alternatives shown, best selected
by naive contributor (as in commercial speech recognizers)
Improved handwritten OCR
18Sample Projects (3) Open Mind chatbot game
MUDD-like game Goal: find the route through the castle
to the “human”» choose the “most natural” paragraph
Linguistic information learned in background
More natural interfaces
19Sample Projects (4) Common sense about computers
Facts» programs compiled, interpreted, run, ...» a mouse is a peripheral» early versions of code are generally buggy» COBOL is a programming language
More natural text interfaces
20Sample Projects (5) Open Mind chess/go
Chess/go = fast search + board scoring Allow contributors to score positions
» weighted by FIDE chess rating/go dan» weighted by score on on-line test» weighted by “confidence”
port to multiple PCs (Beowulf) for speed Improved beam search via improved
scoring (more humanlike style?)
21
NY
Sample Projects (6) Open Mind Animals (Lam & Stork 99)
2 legs?
can fly? can swim?
feathers?
Y N
N NY Y
hum
an
dog
elep
hant
parr
ot
bat
mane?
NY
hors
e
dog
challenges:truthing insure valid animals name/synonym check insure data quality (“voting,” “accept if used”)
bug reporting forwarding errors to domain experts
crediting contributors ordered by amount contributed avoid ID clashes; allow anonimity
query simplification reduce average number of queries/new animal
tree simplification better taxonomy
arbitrary branching factor tree reflects the structure of domain
generalizable to other domains other forms of queries
human-machine interface natural, show current query set (selectable)
display progress number of animals, contributors, show tree
22Sample Projects (7) Open Mind Investment Assistant (Lo & Stork
99)
IBM
MSFTBRDCY
AAPL
DELL
DOL
MAT
BTFD
GM
TOY
F
KXLNX
ALTR
ATT
AMD
23Problems in Machine learning
Relative value of learning with queries vs. iid samples Data truthing/outlier detection Optimal learning strategies given...
» Bayes error» probability of hostile data» probability of data error
Learn reliability of e-citizens, individually and as a group
24
Relation to Open Source
Open Source
• no e-citizens
• expert knowledge (C++filt,gdbm)
• machine learning irrelevant
• web infrastructure useful
• most work is directly on the final software
• hacker culture (ª105)
Open Mind
• e-citizens crucial
• informal knowledge (read, hear)
• machine learning essential
• web infrastructure essential
• most work is on the infrastructure
• e-citizen and business culture (ª109)
25
Relation to Data Mining
Data Mining
• type of data may not be available for the project desired (e.g., OCR)
• no interactive queries slower learningambiguities not resolved
• relatively fixed amount of data
• little or no e-citizen support
Open Mind
• data tailored to the project desired (e.g., OCR)
• interactive queries faster learningambiguities resolved
• new data encouraged
• e-citizen support
26Open Mind project Taxonomy
BenefitWorld OpenMind Use of e-citizens
ease/simplicity
OCR
speech
chess/go
commonsense
comp c-s
dialog
H H M
ML
H H
H H
M M
LL
H
HH
H
M
M
L
MH H
M
grammar M M M HAnimals M H M H
27
Related efforts elsewhere
Speech» Macrophone» Human phoneme project» Linguistic Data Consortium» VoiceControl (Open Source speech for Linux)» CSLU (Center for Spoken Language Understanding) Open
Source speech tools
OCR» NIST, CEDAR, ARPA, UNIPEN
GNU dictionary Newhoo!
28
It is inevitable
Need is here
Web is here
Theory/Machine learning is here
This collaboration is going to happen!» Less radical than Richard Stallman or Linus Torvald...
Intelligentsystems
e-citizens’knowledge
Open Mind
29Possible value to corporations
Most companies could never develop most of this software, nor preserve a competitive advantage through proprietary software
Expand functionality/niches for all Low-cost, possibly high-payoff research Leverage university work
30
Technical Specifications
Language: Java» Portable
Operating System: Linux» Open Source, portable, multiprocessor version (Beowulf)
Data representation: Resource Description Framework (RDF)» Source: www.w3.org/RDF/» Code: lxr.mozilla.org/mozilla/source/rdf/base/» Docs: www.mozilla.org/rdf/doc/
31
Licenses
No license choice will satisfy everyone» GNU: any linked code must include source and follow FSF
copyright -- “copyleft”
» FreeBSD: do whatever you like (can charge)
But... you cannot link GNU & FreeBSD! Practical (not moral) decision Open Source will benefit from competitive
commercialization» BSD license best for Open Mind
32What do we do next Monday?
Put up OpenMind.org Demonstration projects: Open Mind Animals Limited seeding (proselytizing) Solicit projects; introduce domain experts with tool
developers Get corporate donations (e.g., books, CDs, ...)
33
Summary
Open Mind» Collaborative framework for developing
“intelligent systems”» Experts, tool developers, e-citizens
Projects Vision of the future