©2007 H5 Simultaneous Achievement of high Precision and high Recall through Socio-Technical...

19
©2007 H5 Simultaneous Achievement of high Precision and high Recall through S ocio-T echnical I nformation R etrieval Robert S. Bauer, Teresa Jade www.H5technologies.com & Mitchell P. Marcus www.cis.upenn.edu/~mitch/ June 7, 2007 STIR:
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    224
  • download

    1

Transcript of ©2007 H5 Simultaneous Achievement of high Precision and high Recall through Socio-Technical...

Page 1: ©2007 H5 Simultaneous Achievement of high Precision and high Recall through Socio-Technical Information Retrieval Robert S. Bauer, Teresa Jade .

©2007 H5

Simultaneous Achievement of

high Precision and high Recall through

Socio-Technical Information Retrieval

Robert S. Bauer, Teresa Jadewww.H5technologies.com

&Mitchell P. Marcus

www.cis.upenn.edu/~mitch/

June 7, 2007

STIR:

Page 2: ©2007 H5 Simultaneous Achievement of high Precision and high Recall through Socio-Technical Information Retrieval Robert S. Bauer, Teresa Jade .

©2007 H5 Slide of 9

The e-Discovery IDEAL: High P with High R• Find every relevant

document& only those docs that are relevant

• Desired

P=0.8 (or better)@

R=0.8 (or better)

• Acceptable

P=2/3 (or better)@

R=2/3 (or better)

1

Page 3: ©2007 H5 Simultaneous Achievement of high Precision and high Recall through Socio-Technical Information Retrieval Robert S. Bauer, Teresa Jade .

©2007 H5 Slide of 9

The e-Discovery REALITY

High P & Low R= RISK (important docs not retrieved)

Low P & High R= COST (many more documents must be reviewed)

TextREtrivalConference

1

Page 4: ©2007 H5 Simultaneous Achievement of high Precision and high Recall through Socio-Technical Information Retrieval Robert S. Bauer, Teresa Jade .

©2007 H5 Slide of 9

Agenda

• Results– TREC ad hoc (= typical)– Queries typifying Communities of Practice (CoPs)

• e-Discovery Approaches– 5 Dimensions– Linguistics of CoPs

• Research Issues– TREC– AI– Linguists– Lawyers

2

Page 5: ©2007 H5 Simultaneous Achievement of high Precision and high Recall through Socio-Technical Information Retrieval Robert S. Bauer, Teresa Jade .

©2007 H5 Slide of 9

Typical Results – ad hoc queries

(from Chapter 3, “Retrieval System Evaluation” by Chris Buckley and Ellen M. Voorhees, inTREC: Experiment and Evaluation in Information Retrieval, Voorhees & Harman, ed., MIT Press, 2005, p62, Fig. 3.1)

• 22 Topics

• Average

• Desiredis Rare

• Acceptable< 10%

3

Page 6: ©2007 H5 Simultaneous Achievement of high Precision and high Recall through Socio-Technical Information Retrieval Robert S. Bauer, Teresa Jade .

©2007 H5 Slide of 9

compared with STIR topical avg in 4 cases (I-IV) encompassing 42 topics

Accuracy Metrics

Most accurate TREC results for 20 of 22 topics in one test case

0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

70.0%

80.0%

90.0%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 LR KR MV Z

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

LR

KR

MV

Z

Ideal

TREC avg

Acceptable

F1 =

2. (

P. R

)/(P

+R

)

I II III IV

4

Page 7: ©2007 H5 Simultaneous Achievement of high Precision and high Recall through Socio-Technical Information Retrieval Robert S. Bauer, Teresa Jade .

©2007 H5 Slide of 9

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Recall

Pre

cisi

o

TREC Lone Ranger Knight Rider Miami Vice

Recall

Pre

cisi

on

• Average P & R for each case

STIR compared with TREC IR

Topical P & R results for one TREC and 4 STIR cases

STIR

TREC

5

Page 8: ©2007 H5 Simultaneous Achievement of high Precision and high Recall through Socio-Technical Information Retrieval Robert S. Bauer, Teresa Jade .

©2007 H5 Slide of 9

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Recall

Pre

cisi

on

Category 1

Category 2

Category 3

Category 4

Category 5

Category 6

Category 7

Category 8

Category 9

Category 10

Category 11

Category 12

Recall Improvement

Sampled Corpus Tests for 12 Topics in case I during STIR Training

Recall

Pre

cisi

on ● STIR training provides substantial Recall improvement with acceptable Precision reduction

5

Retrieval Acceptableto lowest limitof statistical uncertainty

Page 9: ©2007 H5 Simultaneous Achievement of high Precision and high Recall through Socio-Technical Information Retrieval Robert S. Bauer, Teresa Jade .

©2007 H5 Slide of 9

Agenda

• Results– TREC ad hoc (= typical)– Queries typifying Communities of Practice (CoPs)

• e-Discovery Approaches– 5 Dimensions– Linguistics of CoPs

• Research Issues– TREC– AI– Linguists– Lawyers

6

Page 10: ©2007 H5 Simultaneous Achievement of high Precision and high Recall through Socio-Technical Information Retrieval Robert S. Bauer, Teresa Jade .

©2007 H5 Slide of 9

Dimensions of e-Discovery

SubjectSubjectMatterMatter

LegalLegalCaseCase

LinguisticsLinguisticsDocumentsDocuments CommunityCommunity

7

Page 11: ©2007 H5 Simultaneous Achievement of high Precision and high Recall through Socio-Technical Information Retrieval Robert S. Bauer, Teresa Jade .

©2007 H5 Slide of 9

Dimensions of e-Discovery: Document Review

LegalLegalCaseCase

DocumentsDocuments

Example Systems:• Manual (human)

review conducted by attorneys

• Basic keyword searches targeted to legal issues

• Supervised learning with relevance feedback

7

Page 12: ©2007 H5 Simultaneous Achievement of high Precision and high Recall through Socio-Technical Information Retrieval Robert S. Bauer, Teresa Jade .

©2007 H5 Slide of 9

Dimensions of e-Discovery: Expert Search

SubjectSubjectMatterMatter

LegalLegalCaseCase

DocumentsDocuments

Example Systems:• Subject matter

experts reviewresults under legal team direction

● Domain-specificlexicons used

7

Page 13: ©2007 H5 Simultaneous Achievement of high Precision and high Recall through Socio-Technical Information Retrieval Robert S. Bauer, Teresa Jade .

©2007 H5 Slide of 9

Dimensions of e-Discovery: Model Meaning

SubjectSubjectMatterMatter

LegalLegalCaseCase

LinguisticsLinguisticsDocumentsDocuments

Example Systems:• Supervised

learning with– relevance

feedback– semantic analysis

● Semantic search

7

Page 14: ©2007 H5 Simultaneous Achievement of high Precision and high Recall through Socio-Technical Information Retrieval Robert S. Bauer, Teresa Jade .

©2007 H5 Slide of 9

Dimensions of e-Discovery: Model Communities

SubjectSubjectMatterMatter

LegalLegalCaseCase

LinguisticsLinguisticsDocumentsDocuments CommunityCommunity

Example System:● Socio-

Technical-IR

7

Page 15: ©2007 H5 Simultaneous Achievement of high Precision and high Recall through Socio-Technical Information Retrieval Robert S. Bauer, Teresa Jade .

©2007 H5 Slide of 9

Dimensions of e-Discovery: Socio-Technical-IR

LinguisticsLinguistics CommunityCommunity

• Non-computational Linguistic Disciplines– Pragmatics– Socio-

Linguistics– Ethno-

Methodology– Discourse

Analysis

• A community of practice is– a diverse group of people– engaged in real work– over a significant period of time– developing their own tools, language, and processes– during which they build things, solve problems, learn and invent– evolving a practice that is highly skilled and highly creative

7

Page 16: ©2007 H5 Simultaneous Achievement of high Precision and high Recall through Socio-Technical Information Retrieval Robert S. Bauer, Teresa Jade .

©2007 H5 Slide of 9

Agenda

• Results– TREC ad hoc (= typical)– Queries typifying Communities of Practice (CoPs)

• e-Discovery Approaches– 5 Dimensions– Linguistics of CoPs

• Research Issues– TREC– AI– Linguists– Lawyers

8

Page 17: ©2007 H5 Simultaneous Achievement of high Precision and high Recall through Socio-Technical Information Retrieval Robert S. Bauer, Teresa Jade .

©2007 H5 Slide of 9

Research Issues• TREC

– Nature of the relatively rare high P with high R queries– Measuring both recall and precision effectively

• AI– Knowledge-Based (Expert) Systems that codify linguistic expertise– Characterize practice communities of subject matter experts– Investigate combination systems applied to different types of topics

• Linguists– Identify and characterize different types of topics and map to system

types– Language patterns in communities as well as subject matter fields– Defining categories in concrete terms

• Lawyers– Defining categories in concrete terms– Integration of technology and processes

9

Page 18: ©2007 H5 Simultaneous Achievement of high Precision and high Recall through Socio-Technical Information Retrieval Robert S. Bauer, Teresa Jade .

©2007 H5 Slide of 9

Back-Up

Page 19: ©2007 H5 Simultaneous Achievement of high Precision and high Recall through Socio-Technical Information Retrieval Robert S. Bauer, Teresa Jade .

©2007 H5 Slide of 9

STIR Analysis: CoPs’ Enunciatory language

RelevantDocument

Text

State of Affairs

Object

Process

Action

Fact

Event