An Exploration of Total Recall with Multiple Manual Seedings · An Exploration of Total Recall with...

22
An Exploration of Total Recall with Multiple Manual Seedings Final Version Jeremy Pickens, Tom Gricks, Bayu Hardi, Mark Noel, John Tredennick Catalyst Repository Systems 1860 Blake Street, 7th Floor Denver, CO 80202 jpickens,tgricks,bhardi,mnoel,[email protected] ABSTRACT The Catalyst participation in the manual at home Total Recall Track had one fundamental question at the core of its run: What effect various kinds of limited human effort have on a total recall process. Our two primary modes were one-shot (single query) and interactive (multiple queries). 1. OFFICIAL RUN Our official run approach consisted of two main parts: 1. Manual selection of a minimal number of initial docu- ments (seeding) 2. Continuous learning with active learning as relevance feedback, only 1.1 Manual seeding Four different reviewers participated in every topic. For each topic, a reviewer was assigned to manually seed that topic either by doing a single query (one-shot) and flagging (reviewing) the first (i.e. not necessarily the best) 25 docu- ments returned by the query, or run as many queries (inter- active) as desired within a short time period, but stop after 25 documents had been reviewed. In the one-shot approach, the reviewer is not allowed to examine any documents be- fore issuing the query, i.e. the single query is issued ”blind” after only reading the topic title and description. The first 25 documents returned by that query are flagged as having been reviewed, but because there is no further interaction with the system, it does not matter whether or not the re- viewer spends any time looking at those documents. In the interactive case, reviewers were free to read documents in as much or little depth as they wished, issue as many or as few queries as they wished, and use whatever affordances were available to them from the system to find documents (e.g. synonym expansion, timeline views, communication tracking views, etc.) Every document that the reviewer laid eyeballs on dur- ing this interactive period had to be flagged as having been seen and submitted to the Total Recall server, whether or not the reviewer believed the document to be relevant. This was of course done in order to correctly assess total effort, and therefore correctly measure gain (recall as a function of effort). We also note that the software did not strictly enforce the 25 document guideline. As a result, sometimes the interactive reviewers went a few documents over their 25 document limit and sometimes they went a few documents under, as per natural human variance and mistake, but we do not consider this to be significant. Regardless, all docu- ments reviewed, even with duplication, were noted and sent to the Total Recall server. The reviewers working on each topic were randomized, assigned to run each topic either in one-shot or in interactive mode. Each topic had two one-shot and two interactive 25-document starting points. For our one allowed official manual run, these starting points were combined (unioned) into a separate starting point. Because we did not control for overlap or duplication of effort, the union of these reviewed documents is often smaller than the sum. Reviewers working asynchronously and without knowledge of each other often found (flagged as seen) the same exact documents. In this paper, we augment the official run with a number of unofficial runs, four for each topic, two one-shot start- ing points and two interactive starting points. This will be discussed further in Section 3 1.2 Continuous Learning, Active Learning In more consumer facing variations of our technology, mul- tiple forms of active learning such as explicit diversification are used together in a continuous setting. As the purpose of those additional types of active learning are to balance out and augment human effort, we turned them off for this experiment in order to be able to more clearly assess the effects of just the manual seeding. Therefore, in this experi- ment, we do continuous learning with relevance feedback as the only active document selection mechanism [1]. 2. OFFICIAL RESULTS 2.1 Gain Curves Figure 1 is a plot of the official results in the form of gain curves over the relevant documents produced by averaging performance across all 34 athome4 topics. The plots on the left have the entire collection along the x-axis, and the plots on the right are narrowed down to the more interesting top few thousand. There are two BMI baselines: Topic title only (red) and title+description (blue), with our method given in black. Figure 2 is a plot of the official results in the form of gain curves over the highly relevant documents, similarly characterized. There is not much to say about these results other than, on average, we come close but do not beat the baseline. The difference at 25% and 50% recall is small – only an extra 38

Transcript of An Exploration of Total Recall with Multiple Manual Seedings · An Exploration of Total Recall with...

Page 1: An Exploration of Total Recall with Multiple Manual Seedings · An Exploration of Total Recall with Multiple Manual Seedings Final Version Jeremy Pickens, Tom Gricks, Bayu Hardi,

An Exploration of Total Recallwith Multiple Manual Seedings

Final Version

Jeremy Pickens, Tom Gricks, Bayu Hardi, Mark Noel, John TredennickCatalyst Repository Systems1860 Blake Street, 7th Floor

Denver, CO 80202jpickens,tgricks,bhardi,mnoel,[email protected]

ABSTRACTThe Catalyst participation in the manual at home TotalRecall Track had one fundamental question at the core ofits run: What effect various kinds of limited human efforthave on a total recall process. Our two primary modes wereone-shot (single query) and interactive (multiple queries).

1. OFFICIAL RUNOur official run approach consisted of two main parts:

1. Manual selection of a minimal number of initial docu-ments (seeding)

2. Continuous learning with active learning as relevancefeedback, only

1.1 Manual seedingFour different reviewers participated in every topic. For

each topic, a reviewer was assigned to manually seed thattopic either by doing a single query (one-shot) and flagging(reviewing) the first (i.e. not necessarily the best) 25 docu-ments returned by the query, or run as many queries (inter-active) as desired within a short time period, but stop after25 documents had been reviewed. In the one-shot approach,the reviewer is not allowed to examine any documents be-fore issuing the query, i.e. the single query is issued ”blind”after only reading the topic title and description. The first25 documents returned by that query are flagged as havingbeen reviewed, but because there is no further interactionwith the system, it does not matter whether or not the re-viewer spends any time looking at those documents. In theinteractive case, reviewers were free to read documents inas much or little depth as they wished, issue as many or asfew queries as they wished, and use whatever affordanceswere available to them from the system to find documents(e.g. synonym expansion, timeline views, communicationtracking views, etc.)

Every document that the reviewer laid eyeballs on dur-ing this interactive period had to be flagged as having beenseen and submitted to the Total Recall server, whether ornot the reviewer believed the document to be relevant. Thiswas of course done in order to correctly assess total effort,and therefore correctly measure gain (recall as a functionof effort). We also note that the software did not strictlyenforce the 25 document guideline. As a result, sometimesthe interactive reviewers went a few documents over their 25document limit and sometimes they went a few documents

under, as per natural human variance and mistake, but wedo not consider this to be significant. Regardless, all docu-ments reviewed, even with duplication, were noted and sentto the Total Recall server.

The reviewers working on each topic were randomized,assigned to run each topic either in one-shot or in interactivemode. Each topic had two one-shot and two interactive25-document starting points. For our one allowed officialmanual run, these starting points were combined (unioned)into a separate starting point. Because we did not control foroverlap or duplication of effort, the union of these revieweddocuments is often smaller than the sum. Reviewers workingasynchronously and without knowledge of each other oftenfound (flagged as seen) the same exact documents.

In this paper, we augment the official run with a numberof unofficial runs, four for each topic, two one-shot start-ing points and two interactive starting points. This will bediscussed further in Section 3

1.2 Continuous Learning, Active LearningIn more consumer facing variations of our technology, mul-

tiple forms of active learning such as explicit diversificationare used together in a continuous setting. As the purposeof those additional types of active learning are to balanceout and augment human effort, we turned them off for thisexperiment in order to be able to more clearly assess theeffects of just the manual seeding. Therefore, in this experi-ment, we do continuous learning with relevance feedback asthe only active document selection mechanism [1].

2. OFFICIAL RESULTS

2.1 Gain CurvesFigure 1 is a plot of the official results in the form of gain

curves over the relevant documents produced by averagingperformance across all 34 athome4 topics. The plots on theleft have the entire collection along the x-axis, and the plotson the right are narrowed down to the more interesting topfew thousand. There are two BMI baselines: Topic title only(red) and title+description (blue), with our method given inblack. Figure 2 is a plot of the official results in the formof gain curves over the highly relevant documents, similarlycharacterized.

There is not much to say about these results other than,on average, we come close but do not beat the baseline. Thedifference at 25% and 50% recall is small – only an extra 38

Page 2: An Exploration of Total Recall with Multiple Manual Seedings · An Exploration of Total Recall with Multiple Manual Seedings Final Version Jeremy Pickens, Tom Gricks, Bayu Hardi,

0 0.58 1.16 1.74 2.32 2.9·105

00.10.20.30.40.50.60.70.80.91

Effort (# docs reviewed)

Recall

0 0.4 0.8 1.2 1.6 2·104

00.10.20.30.40.50.60.70.80.91

Effort (# docs reviewed)

Recall

Figure 1: Relevant documents. Full collection gain curve (left), top 20k documents (right). BMI title-only = red, BMIdescription = blue, this paper = black.

0 0.58 1.16 1.74 2.32 2.9·105

00.10.20.30.40.50.60.70.80.91

Effort (# docs reviewed)

Recall

0 0.4 0.8 1.2 1.6 2·104

00.10.20.30.40.50.60.70.80.91

Effort (# docs reviewed)

Recall

Figure 2: Highly relevant documents. Full collection gain curve (left), top 20k documents (right). BMI title-only = red, BMIdescription = blue, this paper = black.

documents are needed on average with our technique to getto 25% recall, and an extra 165 documents on average to getto 50% recall. The difference grows at 85% recall, with anextra 2692 documents needed. This is not a huge differencerelative to all 290,000 documents in the collection, but is anon-zero difference nonetheless. We have some hypothesesrelated to this gap, but the purpose of this work is not tofocus on comparisons to other methods, but on the effects ofa variety of manual seeding approaches on our own methods.

However, we do note one interesting aspect of the resultsin Figures 1 and 2. And that’s the fact that our techniqueseems to do better relative to the baseline on the highly rel-evant documents than it does relative to the baseline on thestandard relevant documents. We have attempted to quan-tify that narrowing gap in the following manner: First, atall levels of recall in 5% increments, we calculate the dif-ference in precision (drel) between our technique and eachbaseline for the relevant documents. Second, we calculatethis difference in precision (dhigh) between the techniquesfor the highly relevant documents. Then we calculate howmuch smaller dhigh is than drel at each of these recall points,expressed as a percentage difference, using the following for-mula:

|drel − dhigh|0.5 ∗ (drel + dhigh)

The result is shown in Figure 3. A negative value indicatesthat our highly relevant result is closer to the highly relevantbaseline than our regular relevant result is to the relevantbaseline. Across most recall points, the highly relevant resultis between 100% and 300% closer to the baseline than is theregular relevant result.

However, while this is an interesting pattern in the data,it is difficult to know how to interpret this. Does the gapnarrow on the highly relevant results because the highly rel-evant documents are also the more findable ones, and thereis not a big difference between any reasonable technique?I.e., is the overall dynamic range on the highly relevant doc-uments smaller? Or does the gap narrow because there issomething specific about our technique that is doing a betterjob on highly relevant documents than on regular relevantdocuments?

Stepping back for a moment, the larger question that weare trying to answer is how one would compare two methodsagainst each other in terms of their ability to find highly rel-evant documents, when what they are retrieving is relevantdocuments. The confounding factor is that one method mayretrieve more highly relevant documents simply because itretrieves more total relevant documents for the same levelof effort. So is that technique doing better than the otherbecause it is better at highly relevant documents, or becauseit is better at all relevant documents?

Page 3: An Exploration of Total Recall with Multiple Manual Seedings · An Exploration of Total Recall with Multiple Manual Seedings Final Version Jeremy Pickens, Tom Gricks, Bayu Hardi,

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−1,000−862−725−587−450−312−175−37100

RecallGapNarrow

ing%

Differen

ce

Figure 3: % Difference in the size of the gap between highly relevant and regular relevant runs, for our approach versus theBMI baseline title-only (red), and our approach versus BMI baseline title+description (blue)

At one level, this question doesn’t matter, as a techniquethat retrieves fewer relevant documents requires the reviewerto expend more effort. And avoiding that additional effortis of primary concern. However, it would also be interestingto see from a purely exploratory standpoint how two highlyrelevant gain curves would appear if all non-relevant docu-ments were removed and only relevant and highly relevantdocuments remained. In that manner, the highly relevantdocument ordering could be compared more directly. Again,this would be more of an exploratory comparison, but couldyield interesting insights.

Another approach to the evaluation of highly relevant doc-ument finding would be to run systems in which trainingis done either purely on the highly relevant documents, orin which the highly relevant documents are given a largerweight than the regular relevant documents when they areencountered during review. This becomes especially perti-nent to the main focus of this paper, which is the effect ofmanual seeding on the process. At some point we wouldlike to be able to see not only whether reviewers are able tomore quickly or easily find highly relevant documents, butwhether that makes a difference to the underlying machinelearning algorithms supporting the review.

2.2 Call Your ShotOur main focus in this paper was not on calling our shot,

but on exploring the effects of manual seeding. Neverthe-less, ”call your shot” was part of the track, so we borrowedwith attribution the basic intuition of [2] and implementeda quick, naive version thereof. [2] had noted that a reason-able stopping point seems to be when batch richness (totalrelevant / total reviewed) drops to about 1/10th of the highwater mark. We implemented this by calling the followingfunction periodically throughout the continuous learning re-view.

Not allowing the possibility of a positive value for thestopping condition until at least 500 documents have beenreviewed is a hack, and is based on the fact that we havenoticed through prior experience on TREC 2015 data thatwhen only manual seeds are used in the initial iteration,early richness can fluctuate unpredictably on some topics.Whether this is a function of the reviewers choosing seedsin a biased manner, or whether it is a broader reflection ofthe nature of certain topics, we are not sure. But because

Data: R is a list of documents, in reviewed orderchunksize ← 250;stop ← False;if len(R) > 500 then

for i ← 1 to len(R)-chunksize dowindow ← R[i..(i+chunksize)];rich ← numrel(window) / chunksize;if rich > maxrich then

maxrich ← richend

endfor i ← 1 to len(R)-chunksize do

window ← R[i..(i+chunksize)];rich ← numrel(window) / chunksize;if (rich / maxrich) < 0.1 then

stop ← Trueend

endResult: stop

of our our prior experience that the fluctuation could exist,we made a blanket decision to never stop before the 500threviewed document. This of course affected our shot-callingperformance on some of the topics for which there were onlya few hundred, sometimes a few dozen, relevant documents.We also note that the size of the window (chunk size) overwhich we calculated richness was arbitrary and unoptimized;we did not investigate other settings, either prior to the runnor since.

Finally, we note that the code to calculate the stoppingcondition was a rushed last minute endeavor, put togetherin about half an hour. As such, it contains (at least) oneglaring logical hole: The lowest to highest richness ratio isunordered. What should have been done is that only themore recent windows should have been emphasized, i.e. theratio to be tested should have been the more recent richnesswindow to the highest richness window.

Normally, that shouldn’t be a problem, because overallrichness is generally monotonically decreasing. However, inat least one case the opposite was true: Topic 403. Theinitial review started off moderately rich, then flattened fora longer period of time before rising sharply again. A a re-sult, about halfway up that sharp ascent, the lowest richness

Page 4: An Exploration of Total Recall with Multiple Manual Seedings · An Exploration of Total Recall with Multiple Manual Seedings Final Version Jeremy Pickens, Tom Gricks, Bayu Hardi,

0 10 20 30

−50

0

50

100

Topics (Sorted by % Difference)

%Differen

ceOver

Baseline

0 10 20 30

−100

0

100

200

300

Topics (Sorted by % Difference)

%Differen

ceOver

Baseline

Figure 4: ”Call Your Shot”: Per topic histogram of F1 score % change of our approach relative to the BMI baselines. Relevantdocuments (left), Highly relevant documents (right)

0 10 20 30

−80

−60

−40

−20

0

Topics (Sorted by % Difference)

%Differen

ceOver

Baseline

0 10 20 30

−100

0

100

Topics (Sorted by % Difference)

%Differen

ceOver

Baseline

Figure 5: ”Call Your Shot”: Per topic histogram of recall score % change of our approach relative to the BMI baselines.Relevant documents (left), Highly relevant documents (right)

window (centered around document 400) became 1/10th asrich as the highest richness window (centered around doc-ument 900). Since the algorithm ignored window order, itthen called the stopping point in the middle of that sharprise, while documents were still being found at a very highrate (see Figure 8, Topic 403)

That boundary condition aside, Figure 4 shows that formost of the topics, and evaluating using F1, our naive ap-proach beat the baselines for both relevant and highly rel-evant documents. However, Figure 5 tells a different story:When it comes to pure recall (with no consideration of pre-cision) the point at which our algorithm calls the shot isalmost invariably at a lower recall point than the baseline.This of course raises the issue of what ”reasonable” means.The task is a total recall task, so one would imagine that thehigher recall point is the better evaluation metric. However,anyone can get higher recall, simply by continuing to review,i.e. calling a later stopping point. In the extreme, everyone –even those randomly reviewing documents – can always get100% recall by reviewing the entire collection. This is notthe intent of the task, because there is also a requirement toavoid wasted effort, i.e. to keep precision high.

F1 is a traditionally common way to balance both preci-sion and recall, and it was the one chosen by the Total RecallTrack as an official metric, which is why we show it above.But it places equal weight on both recall and precision. So

might it not be more reasonable, in a total recall task, touse F2 or even F3 instead? Or why not F2.647? Is that not amore reasonable metric than F2.883? Why or why not? Thequestion still remains: What is reasonable, and how do wemeasure it?

Part of the difficulty is that the measurement of the qual-ity of a stopping point algorithm is inextricably linked withthe quality of the review ordering itself. Different revieworderings are going to limit (or expand) the highest achiev-able stopping metric score of even the best stopping pointalgorithm. Perhaps future work from the community couldseparate out the review order quality issue from the stop-ping point issue, by creating stopping point test collectionsthat standardize on fixed orderings.

3. MANUAL SEED EXPERIMENTSThe main subject of investigation for our TREC 2016 To-

tal Recall run was what effect various manual initial selec-tions (seedings) of documents has on overall task outcome.Section 1 explored the union of all four sets of seeds (twoone-shot, two iterative), and in this section we explore eachof the starting points, individually.

3.1 Manual Effort StatisticsFigure 6 contains statistics of the 34 athome4 topics. This

data is presented in four parts: Query Count, Time Spent,

Page 5: An Exploration of Total Recall with Multiple Manual Seedings · An Exploration of Total Recall with Multiple Manual Seedings Final Version Jeremy Pickens, Tom Gricks, Bayu Hardi,

Query Count Time Spent (minutes) Docs Reviewed Review OverlapTopic R1 R2 R3 R4 R1 R2 R3 R4 R1 R2 R3 R4 Unique Sum Ratio TopicSize

athome401d - - 6 7 - - 11 31 25 25 31 25 67 106 0.63 229athome402d 3 - 5 - 16 - 7 - 23 24 33 25 85 105 0.81 638athome403d 3 4 - - 16 11 - - 25 13 25 25 56 88 0.64 1090athome404d - 2 3 - - 5 4 - 25 20 25 25 54 95 0.57 545athome405d - - 5 5 - - 7 26 25 25 29 25 62 104 0.60 122athome406d 3 5 - - 16 6 - - 25 25 25 25 69 100 0.69 127athome407d 4 3 - - 16 7 - - 25 16 25 25 42 91 0.46 1586athome408d - 5 - 7 - 11 - 32 19 25 25 25 60 94 0.64 116athome409d - - 8 3 - - 12 22 24 25 22 25 67 96 0.70 202athome410d 4 - 3 - 15 - 6 - 25 25 35 25 91 110 0.83 1346athome411d 4 2 - - 16 4 - - 24 22 22 25 41 93 0.44 89athome412d - 6 - 5 - 12 - 34 22 25 25 25 81 97 0.84 1410athome413d - - 3 6 - - 5 36 25 25 25 25 77 100 0.77 546athome414d 1 - 3 - 16 - 5 - 25 25 27 25 59 102 0.58 839athome415d 3 3 - - 15 12 - - 22 16 25 25 64 88 0.73 12106athome416d - 4 - 6 - 9 - 36 25 35 25 25 76 110 0.69 1446athome417d - - 3 7 - - 7 30 25 25 34 25 108 109 0.99 5931athome418d 5 - 4 - 16 - 8 - 25 24 25 25 63 99 0.64 187athome419d 2 3 - - 16 9 - - 26 15 25 25 50 91 0.55 1989athome420d - 3 - 4 - 9 - 25 25 33 25 25 66 108 0.61 737athome421d - - 3 9 - - 5 36 25 25 26 25 50 101 0.50 21athome422d - - 3 - - - 5 - 25 25 32 25 75 107 0.70 31athome423d 2 3 - - 16 8 - - 26 25 25 25 32 101 0.32 286athome424d 2 2 - 4 16 7 - 28 25 20 25 25 60 95 0.63 497athome425d - - 3 7 - - 5 36 25 25 29 25 65 104 0.63 714athome426d 3 - 5 - 15 - 6 - 25 25 32 25 54 107 0.50 120athome427d 2 4 - - 16 9 - - 25 19 25 25 53 94 0.56 241athome428d - 1 - 6 - 7 - 32 25 24 25 25 72 99 0.73 464athome429d - - 6 1 - - 5 18 25 25 30 25 82 105 0.78 827athome430d 3 - 4 - 16 - 5 - 25 24 28 25 88 102 0.86 991athome431d 3 5 - - 16 10 - - 25 26 25 25 57 101 0.56 144athome432d - 2 - 4 - 5 - 27 25 23 25 25 51 98 0.52 140athome433d - - 3 4 - - 4 23 25 25 27 25 64 102 0.63 112athome434d 4 - 7 - 16 - 11 - 25 25 29 25 45 104 0.43 38

average 3.0 3.4 4.3 5.3 15.8 8.3 6.6 29.5 23.6 24.6 26.9 25 64.3 100.2 0.64 1056

Figure 6: Manual Effort Statistics

Docs Reviewed, and Review Overlap. In the Query Countsection, the number of queries that each reviewer (R1 throughR4) issued for each topic is shown. For ease of reading, ifthe reviewer did a one-shot (single) query, that is indicatedwith a dash “-”. If the reviewer did an interactive run, theactual number of queries is shown. The averages shown areaverages of just the interactive runs; one-shot averages are1.0, of course.

In the Time Spent section, the pattern is similar: dash in-dicates one-shot query, which likely look a minute or two butwe did not record the exact amount of time each reviewerspent pondering his or her one query before issuing it. Ac-tual numbers indicate time spent in an interactive run. Theaverages shown are averages of just the interactive runs.

In the third section (Docs Reviewed) the exact number ofdocuments that the reviewer laid eyeballs on, both relevantand non-relevant, is indicated. The review software did nothave explicit controls to stop a reviewer from going beyond25 documents; this was left up to the individual reviewer.So as per normal human variance the count is sometimesa few docs over, sometimes a few docs under, but gener-ally around the 25 document mark. The averages shown for

Docs Reviewed are averages of all runs, both one-shot anditerative.

The final section is Review Overlap. Since the reviewersworked with no knowledge of each other, it was often thecase that (even when issuing different queries) they reviewedsome of the same documents. Therefore, we show statisticson not only the total number of documents reviewed, butthe total number of unique documents reviewed. The ratioof unique to total is also shown, as is the log of the size ofthe topic (total number of available relevant documents forthat topic). The averages shown are averages of all runs,both one-shot and iterative.

When examining these values, we noticed an interesting,though possibly spurious, relationship between uniquenessratio and topic size (number of relevant documents availablefor that topic). Figure 7 is a plot of the relationship betweenthe ratio of unique to total documents reviewed (x-axis) andthe (log of the) size of the topic (y-axis). A line is fitted tothis data, and seems to suggest that the more total availablerelevant documents there are to be found in the collection,the less overlap there was between what the four reviewersfound.

Page 6: An Exploration of Total Recall with Multiple Manual Seedings · An Exploration of Total Recall with Multiple Manual Seedings Final Version Jeremy Pickens, Tom Gricks, Bayu Hardi,

Figure 7: Log Topic Size (y-axis) against Ratio of Multi-Reviewer Unique to Total Seed Count (x-axis)

3.2 Individual Topic Runs

3.2.1 Individual Gain CurvesGain curves for the 34 topics are shown in Figures 8

through 14 (left). The reviewer ID is indicated with a color:Red, blue, green, and brown for reviewers 1 through 4, re-spectively. If the reviewer ran that topic as a one-shot seed-ing process, the gain curve is a solid line. If the reviewer ranit iteratively, the line is dotted.

We acknowledge that there is an overabundance of graphsin this paper. The reason we chose to show them all intheir entirety, rather than just show some sort of summarystatistic such as precision@90% recall or average area underthe gain curve is that this gives us a chance to observe thefine differences between topics and reviewers. These differ-ences can be nuanced, which is both the strength and theweakness of presenting research results in this manner. It’sa weakness, because it makes the results more difficult tosummarize. It’s a strength, because it lets one see where andhow distinctions can arise. Furthermore, this is a TREC pa-per, rather than peer-reviewed scientific conference or jour-nal paper, and we the authors feel it is more valuable and inkeeping with the spirit of openness around TREC to showas much detail about our runs as possible. Displaying everygraph allows us to do that.

That said, one general observation is that, no matter whothe reviewer doing the initial seeding was, or what methodthey employed, high recall is achieved by almost every re-viewer on almost every topic without having to review thevast majority of the collection. There remains a belief amongmany industry practitioners engaged in high recall tasks thatonly experts may select seed documents, that only expertshave the capability to initiate a recall-oriented search by se-lecting the initial training documents. In our experiments,only one of our reviewers qualified as a practicing searchexpert: Reviewer 1 (red). Nevertheless, for 132 of the 136starting points in Figures 8 through 14, all starting points hithigh recall within a relatively similar amount of effort. Forexample, on Topic 405, Reviewer 2 (blue) gets to 90% recalla little faster, and Reviewer 1 (red) gets there a little later,with the other two reviewers somewhere in the middle. Butthe difference between the “best” and “worst” is literally 68

documents. Out of a collection of 290,000 documents, that isan insignificant difference. Other topics, such as Topic 411,show a bit of a back and forth as the review progresses be-tween the various starting points. But all achieve high recallat about the same point.

The four starting points where there is a significant dif-ference between best and worst were Reviewer 4 (brown) onTopic 415, Reviewers 2 and 3 (blue and green) on topic 418and Reviewer 3 on Topic 419. Of those, 3 of the 4 were allone-shot queries. That is, the reviewer did not do any of thereview of the collection before issuing his single query, didnot receive any feedback on his or her single query, and didnot issue more than one query. The final underperformingstarting point was done by an iterative reviewer (4 queries in8 minutes, as per Figure 6), but is still the exception ratherthan the rule.

There is one more data point worth noting in the gaincurves on the left of each figure. The black curve is basedon pooling all the unique seed documents from each of thefour reviewers before running a continuous learning review.In the majority of the cases, the pooled seed starting pointis at least as good as, if not better, than the best individ-ual starting point. However, in a number of instances, thepooled seeds yield a result that is equal to the worst individ-ual case, and in rare instances, even worse than the worstindividual case. Given the casualness of this TREC paper,we don’t have a formal metric, a concrete quantification of”better”, ”equal”, ”worse”, etc. Rather, we did a casual eye-balling of the curves and looked at where the combined seed-ing came out generally, often, as per the individual seedingsthemselves, by the time the process hit high (85%-95%) re-call, all the methods converged, anyway. So in this particularexploratory analysis, we are often looking at differences atlower levels of recall. Nevertheless, given that we’re display-ing all curves, the reader can see and judge for themselveswhether or not significant differences exist.

The following table is a rough count of the number oftimes that the combined seeding approach was better thanthe best individual, approximately equal to the best indi-vidual, somewhere in the middle of all individuals, equalto the worst individual, or worse than the worse individualseeding run. For the most part, the pooled seeds tended to

Page 7: An Exploration of Total Recall with Multiple Manual Seedings · An Exploration of Total Recall with Multiple Manual Seedings Final Version Jeremy Pickens, Tom Gricks, Bayu Hardi,

be equal to or better than the best individual. However, itwould be worth exploring those cases in which the pooledseeds do worse, and try to determine why. That analysisis beyond the scope of this paper. One more thing to keepin mind: Even when the pooled approach is better than thebest, or worse than the worst, the absolute magnitude of thedifferences, especially relative to the size of the collection,are small.

Combined Seeding CountBetter than the Best Individual 5

Equal to the Best Individual 16In the Middle of the Individual 8Equal to the Worst Individual 3

Lower than the Worst Individual 2

3.2.2 Seeding Method-Averaged Gain CurvesLastly, for each topic in Figures 8 through 14, we show the

average of the two one-shot seeding approaches, as well asthe average of the two iterative seedings, in the charts on theright side. Perhaps the better approach would have been topool the iterative and the one shot seeds, respectively, beforerunning CAL on the combined seed pools. For now, however,we show the average of the gain curves of the individuallyseeded runs. There is variability among individual reviewers,so by averaging multiple runs and randomizing reviewersacross topics, we hope to get a better sense of the the generalapproach, separate from the individual reviewer vagaries.Ideally we would have more than two reviewers doing eachmethod, but resources are always limited.

Basically, we can see that for the most part, there is notmuch difference between the two approaches for most topics.The iterative approach may have a slight edge on Topics 415,418, 419, 421, and 422. However, the one shot approach hasa slight edge on topics 402, 411, 416, 428, and 433. On theremainder of the topics, where there are differences, thosedifferences are slight.

4. GROUND TRUTH ANALYSISThere is one final analysis of the data that we would like

to present. It is perhaps a bit non-standard, and we donot offer any hard conclusions. But it was analysis thatwe found interesting so we would like to show it to provokefuture thought.

4.1 Explanation of the AnalysisOur understanding of how the ground truth was created

for this track was that the NIST assessors used a combina-tion of methods, their own searches plus an algorithm thatused the same underlying feature extraction mechanism asthe baseline model implementation (BMI). The collectionwas not fully judged for each topic, but was judged to adepth proportionally deeper than the number of relevantdocuments that were found. So for example, for Topic 422,NIST assessors judged 31 relevant documents, and 317 non-relevant docs. No other documents were assessed. For Topic423, 286 relevant documents and 1113 non-relevant docu-ments were judged. No other documents were assessed.

As per common TREC practice, documents that are notassessed (non-judged) are presumed to be non-relevant, andtreated as such for both training and evaluation purposes.In this last section, we wish to separate out, for the pur-pose of deeper analysis, judged non-relevant documents from

non-judged documents. To this end we present Figures 15through 21. For each topic, the figure on the left side showsrecall on the x-axis, and the raw number of judged non-relevant documents on the y-axis. The blue line representsthe number of judged non-relevant documents to that depthin the review unique to our pooled seeds method (see previ-ous section), the red line shows the same information uniqueto the baseline (BMI) title+description method, and thedotted grey line is the number of judged non-relevant doc-uments that both methods have in common. We only plotto 90% recall for all topics, as high non-judged counts canskew the visualization after that point.

So for example, see Topic 421 in Figure 19. Let’s startwith the figure on the left, which shows judged non-relevantdocuments on the y-axis. By the time that the baselinemethod (red) has hit 10% recall, it has seen 3 judged non-relevant documents, while our method (blue) has hit 0 judgednon-relevant documents, and none of those documents arethe same documents. At 75% recall, the baseline method hasseen 28 judged non-relevant documents, while our methodhas seen 16 judged non-relevant documents. However, 13 ofthose documents are in common between the two methods.So at 75% recall, the baseline method has seen 15 judgednon-relevant documents that our method has not (unique toBMI), and our method has seen about 3 judged non-relevantdocuments that the baseline method has not (unique to ourmethod). In comparison, see the figure on the right, whichshows non-judged documents on the y-axis. At 75% recall,there are only 5 documents that the baseline method hasseen that are have not been judged, but 27 documents thatour method has seen that have not been judged. None ofthese documents are in common, as the dotted grey lineonly starts to rise after about 82% recall. What this meansis that, at 75% recall, the baseline method has only hit 15judged + 5 non-judged = 20 non-relevant documents, butour method has hit 3 judged + 27 non-judged = 30 non-relevant documents.

From an overall evaluation standpoint, this means that (at75% recall) the baseline method is better than our method,because it has hit fewer non-relevant documents. But whenthe majority of the documents in that comparison were judgedfor one method, and not judged for the other method, itraises questions about how things might be different if someof the non-judged documents had been judged. Would therebe more relevant documents in those non-judged documents?Would there be more relevant documents in the non-judgeddocuments unique to the baseline method, or unique to ourmethod? And how would that affect overall recall and stop-ping points, not to mention training, especially when a halfdozen newly found relevant documents could have significanteffect in such low prevalence topics.

We did some casual, non-comprehensive spot checks onsome of the topics by looking at the top 20 highest rankeddocuments that were unique to each method (i.e. 40 docs pertopic in total). And we did find a fairly significant number of(what we thought were) relevant documents for some topics,almost none for other topics, at least within those first 20documents.

However, we are not going to go into detail about howmany additional relevant documents we believe that therewere, for four reasons: (1) We did not do a full assessmentof every topic, so any information we do present would bemisleading, (2) We are not the same assessors as the NIST

Page 8: An Exploration of Total Recall with Multiple Manual Seedings · An Exploration of Total Recall with Multiple Manual Seedings Final Version Jeremy Pickens, Tom Gricks, Bayu Hardi,

assessors, and unless we were to go back and also reviewall the same documents that the NIST assessors reviewed,any assessment would be a skewed or biased by the dis-jointedness of the assessment, (3) Even if we did do a fullreassessment of all judged and top-ranked non-judged doc-uments, it would be unfair to the baseline method, becausethat method would not have had a chance to train on anynewly-judged relevant documents, and finally (4) It is notwithin the spirit of TREC to publish research whose onlygoal it is to “beat” other systems. Rather, the purpose ofTREC is to dive deep in to interesting questions, to challengeassumptions, to learn by trying crazy, unproven methods,to basically poke and prod a problem, and see what hap-pens. This research hopefully accomplishes that by compar-ing multiple reviewers doing multiple approaches to seeding(one-shot query, iterative querying). To understand whatground truth data is being used and how that might affectthings is a side goal, but not the primary one.

4.2 DiscussionNevertheless, to understand the full context of this work,

we felt it necessary to break out the analysis of the resultsinto these two components: judged non-relevant and non-judged. Without even knowing whether the non-judged doc-uments were truly relevant or non-relevant, some interestingpatterns emerge. The first patterns is one exemplified by thetopic that we already discussed above, Topic 421. In thispattern, our method has higher precision (lower number ofnon-relevant documents) on the judged set (left graph), butlower precision (higher number of non-relevant documents)on the non-judged set. This is a pattern seen in 10 other (11total) instances: Topics 405, 410, 415, 422, 423, 425, 431,432, 433, and 434. The next pattern is topics for which bothmethods find judged non-relevant documents at about thesame rate, but our method hits a lot more non-judged doc-uments. This pattern is found in 7 instances: Topics 401,402, 406, 412, 418, 426, and 429. The remaining 16 topicsare ones for which our method hits more non-relevant docu-ments, both judged and non-judged, than does the baselinemethod.

How does one interpret this? Why is it that the twomethods are finding, at times, vastly different non-relevant(judged or non-judged documents, while finding the samenumber of relevant documents. Or more specifically: Fora large number of topics, why does our method find morenon-judged documents, even as it is finding fewer judgednon-relevant documents, at the same level of recall. Byitself, finding more non-judged documents is not difficult:One can simply select documents at random. But this isn’ta random selection of documents, because our method isfinding relevant documents at a reasonably fast clip, whilesometimes simultaneously finding fewer judged non-relevantdocuments at the same level of effort.

It’s also interesting to note that for a fairly large numberof topics, the baseline method finds almost no non-judgeddocuments at all. Almost all non-relevant documents thatit finds through the course of the review are ones that havealready been judged. For example, see Topics 401, 402, 403,404, 406, 407, 408, 409, 416, 417, 418, 419, 427, 428, 433, and434, which are almost half the topics in the track. We arenot certain why the unique documents found by one methodare almost thoroughly judged while the unique documentsfound by the other are not. There may be something inter-

esting in the way in which our method is working that ismore naturally diverse (even without explicit diversificationactivated as explained in Section 1.2) relative to the baseline.It may mean that there was an aspect or facet of relevancethat was found by our method that was not found duringthe ground truth assessment. Different doesn’t necessarilymean better, however, as the non-judged documents are notnecessarily going to be relevant, were they to be judged. Wecannot answer this question now; we simply wish to showthat there does seem to be consistent patterns of judged ver-sus non-judged documents in the non-relevant set. This is agood opening into future work on Total Recall.

5. CONCLUSIONIn conclusion, this paper examined the effect of multiple

manual approaches to seeding using four different review-ers applying one of two different manual seeding strategies(one-shot vs iterative). For over 97% (132 of the 136) seed-ings, by the time high recall was hit, there was relativelylittle difference between the starting points no matter themethod. When averaged across strategy (one-shot vs itera-tive), five topics slightly favored the iterative approach, fiveslightly favored the one-shot approach, and the remainderof the topics came out about the same.

Perhaps one of the challenges is that the iteration wasbrief; reviewers were only allowed to work until they hadmarked up to 25 documents. With more time or morequeries, perhaps a larger difference could have been observ-able. On the other hand, many of these topics were relativelystraightforward, and perhaps no differences and improve-ments via manual efforts may be possible. Nevertheless, wenote that all topics achieved high recall without having toreview the vast majority of the collection, no matter if anexpert or a non-expert was used to manually seed each topic.

We also noted a possible relationship between revieweroverlap and topic size. Where different reviewers manuallyfind many of the same documents, the topic may have asmaller number of documents, and vice versa when differentreviewers manually find many different documents. Whethersuch an approach could be formalized enough to be broadlypredictive remains an open question.

Additionally, we did some analysis of the ground truth it-self, and examined the relative difference between our methodand the baseline method in terms of how many judged non-relevant versus non-judged documents each found over thecourse of each topic’s review. The visualization of thesedifferences are interesting, but anything conclusive at thispoint would be pure speculation.

6. REFERENCES[1] G. V. Cormack and M. R. Grossman. Evaluation of

machine learning protocols for technology-assistedreview in electronic discovery. In Proceedings of theACM SIGIR Conference, Gold Coast, Australia, 6-11July 2014, Gold Coast, Australia, 2014.

[2] G. V. Cormack and M. R. Grossman. Waterloo(cormack) participation in the trec 2015 total recalltrack. In Proceedings of Text REtrieval Conference(TREC), Gaithersburg, MD, 2015.

Page 9: An Exploration of Total Recall with Multiple Manual Seedings · An Exploration of Total Recall with Multiple Manual Seedings Final Version Jeremy Pickens, Tom Gricks, Bayu Hardi,

0 5.52 · 10−2 0.11 0.17 0.22 0.280

102030405060708090

100Topic

401

0 5.52 · 10−2 0.11 0.17 0.22 0.280

102030405060708090

100

Topic

401

0 0.83 1.66 2.48 3.31 4.140

102030405060708090

100

Topic

402

0 0.83 1.66 2.48 3.31 4.140

102030405060708090

100

Topic

402

0 0.13 0.26 0.39 0.52 0.660

102030405060708090

100

Topic

403

0 0.13 0.26 0.39 0.52 0.660

102030405060708090

100

Topic

403

0 0.69 1.38 2.07 2.76 3.450

102030405060708090

100

Topic

404

0 0.69 1.38 2.07 2.76 3.450

102030405060708090

100

Topic

404

0 4.14 · 10−28.28 · 10−2 0.12 0.17 0.210

102030405060708090

100

Topic

405

0 4.14 · 10−28.28 · 10−2 0.12 0.17 0.210

102030405060708090

100

Topic

405

Figure 8: Gain Curves. x-axis = reviewed documents (in order), as a percentage of the entire collection. y-axis = recall. [Left]Red = Reviewer 1, Blue = Reviewer 2, Green = Reviewer 3, Brown = Reviewer 4, Black = Pooled seeds from all reviewers.For individual reviewers, solid line indicates one-shot query; dashed line indicates iterative searching. [Right] Solid line is theaverage of the one-shot reviewers; dashed is the average of iterative reviewers

Page 10: An Exploration of Total Recall with Multiple Manual Seedings · An Exploration of Total Recall with Multiple Manual Seedings Final Version Jeremy Pickens, Tom Gricks, Bayu Hardi,

0 0.1 0.21 0.31 0.41 0.520

102030405060708090

100Topic

406

0 0.1 0.21 0.31 0.41 0.520

102030405060708090

100

Topic

406

0 0.21 0.41 0.62 0.83 1.030

102030405060708090

100

Topic

407

0 0.21 0.41 0.62 0.83 1.030

102030405060708090

100

Topic

407

0 8.28 · 10−2 0.17 0.25 0.33 0.410

102030405060708090

100

Topic

408

0 8.28 · 10−2 0.17 0.25 0.33 0.410

102030405060708090

100

Topic

408

0 4.14 · 10−28.28 · 10−2 0.12 0.17 0.210

102030405060708090

100

Topic

409

0 4.14 · 10−28.28 · 10−2 0.12 0.17 0.210

102030405060708090

100

Topic

409

0 0.24 0.48 0.72 0.97 1.210

102030405060708090

100

Topic

410

0 0.24 0.48 0.72 0.97 1.210

102030405060708090

100

Topic

410

Figure 9: Gain Curves. x-axis = reviewed documents (in order), as a percentage of the entire collection. y-axis = recall. [Left]Red = Reviewer 1, Blue = Reviewer 2, Green = Reviewer 3, Brown = Reviewer 4, Black = Pooled seeds from all reviewers.For individual reviewers, solid line indicates one-shot query; dashed line indicates iterative searching. [Right] Solid line is theaverage of the one-shot reviewers; dashed is the average of iterative reviewers

Page 11: An Exploration of Total Recall with Multiple Manual Seedings · An Exploration of Total Recall with Multiple Manual Seedings Final Version Jeremy Pickens, Tom Gricks, Bayu Hardi,

0 9.66 · 10−2 0.19 0.29 0.39 0.480

102030405060708090

100Topic

411

0 9.66 · 10−2 0.19 0.29 0.39 0.480

102030405060708090

100

Topic

411

0 0.24 0.48 0.72 0.97 1.210

102030405060708090

100

Topic

412

0 0.24 0.48 0.72 0.97 1.210

102030405060708090

100

Topic

412

0 0.1 0.21 0.31 0.41 0.520

102030405060708090

100

Topic

413

0 0.1 0.21 0.31 0.41 0.520

102030405060708090

100

Topic

413

0 0.41 0.83 1.24 1.66 2.070

102030405060708090

100

Topic

414

0 0.41 0.83 1.24 1.66 2.070

102030405060708090

100

Topic

414

0 1.38 2.76 4.14 5.52 6.90

102030405060708090

100

Topic

415

0 1.38 2.76 4.14 5.52 6.90

102030405060708090

100

Topic

415

Figure 10: Gain Curves. x-axis = reviewed documents (in order), as a percentage of the entire collection. y-axis = recall.[Left] Red = Reviewer 1, Blue = Reviewer 2, Green = Reviewer 3, Brown = Reviewer 4, Black = Pooled seeds from allreviewers. For individual reviewers, solid line indicates one-shot query; dashed line indicates iterative searching. [Right] Solidline is the average of the one-shot reviewers; dashed is the average of iterative reviewers

Page 12: An Exploration of Total Recall with Multiple Manual Seedings · An Exploration of Total Recall with Multiple Manual Seedings Final Version Jeremy Pickens, Tom Gricks, Bayu Hardi,

0 1.38 2.76 4.14 5.52 6.90

102030405060708090

100

Topic

416

0 1.38 2.76 4.14 5.52 6.90

102030405060708090

100

Topic

416

0 0.62 1.24 1.86 2.48 3.10

102030405060708090

100

Topic

417

0 0.62 1.24 1.86 2.48 3.10

102030405060708090

100

Topic

417

0 0.83 1.66 2.48 3.31 4.140

102030405060708090

100

Topic

418

0 0.83 1.66 2.48 3.31 4.140

102030405060708090

100

Topic

418

0 1.38 2.76 4.14 5.52 6.90

102030405060708090

100

Topic

419

0 1.38 2.76 4.14 5.52 6.90

102030405060708090

100

Topic

419

0 0.14 0.28 0.41 0.55 0.690

102030405060708090

100

Topic

420

0 0.14 0.28 0.41 0.55 0.690

102030405060708090

100

Topic

420

Figure 11: Gain Curves. x-axis = reviewed documents (in order), as a percentage of the entire collection. y-axis = recall.[Left] Red = Reviewer 1, Blue = Reviewer 2, Green = Reviewer 3, Brown = Reviewer 4, Black = Pooled seeds from allreviewers. For individual reviewers, solid line indicates one-shot query; dashed line indicates iterative searching. [Right] Solidline is the average of the one-shot reviewers; dashed is the average of iterative reviewers

Page 13: An Exploration of Total Recall with Multiple Manual Seedings · An Exploration of Total Recall with Multiple Manual Seedings Final Version Jeremy Pickens, Tom Gricks, Bayu Hardi,

0 2.41 · 10−24.83 · 10−27.24 · 10−29.66 · 10−2 0.120

102030405060708090

100Topic

421

0 2.41 · 10−24.83 · 10−27.24 · 10−29.66 · 10−2 0.120

102030405060708090

100

Topic

421

0 0.69 1.38 2.07 2.76 3.450

102030405060708090

100

Topic

422

0 0.69 1.38 2.07 2.76 3.450

102030405060708090

100

Topic

422

0 0.55 1.1 1.66 2.21 2.760

102030405060708090

100

Topic

423

0 0.55 1.1 1.66 2.21 2.760

102030405060708090

100

Topic

423

0 8.28 · 10−2 0.17 0.25 0.33 0.410

102030405060708090

100

Topic

424

0 8.28 · 10−2 0.17 0.25 0.33 0.410

102030405060708090

100

Topic

424

0 0.21 0.41 0.62 0.83 1.030

102030405060708090

100

Topic

425

0 0.21 0.41 0.62 0.83 1.030

102030405060708090

100

Topic

425

Figure 12: Gain Curves. x-axis = reviewed documents (in order), as a percentage of the entire collection. y-axis = recall.[Left] Red = Reviewer 1, Blue = Reviewer 2, Green = Reviewer 3, Brown = Reviewer 4, Black = Pooled seeds from allreviewers. For individual reviewers, solid line indicates one-shot query; dashed line indicates iterative searching. [Right] Solidline is the average of the one-shot reviewers; dashed is the average of iterative reviewers

Page 14: An Exploration of Total Recall with Multiple Manual Seedings · An Exploration of Total Recall with Multiple Manual Seedings Final Version Jeremy Pickens, Tom Gricks, Bayu Hardi,

0 0.17 0.34 0.52 0.69 0.860

102030405060708090

100Topic

426

0 0.17 0.34 0.52 0.69 0.860

102030405060708090

100

Topic

426

0 0.24 0.48 0.72 0.97 1.210

102030405060708090

100

Topic

427

0 0.24 0.48 0.72 0.97 1.210

102030405060708090

100

Topic

427

0 0.28 0.55 0.83 1.1 1.380

102030405060708090

100

Topic

428

0 0.28 0.55 0.83 1.1 1.380

102030405060708090

100

Topic

428

0 0.14 0.28 0.41 0.55 0.690

102030405060708090

100

Topic

429

0 0.14 0.28 0.41 0.55 0.690

102030405060708090

100

Topic

429

0 0.25 0.5 0.74 0.99 1.240

102030405060708090

100

Topic

430

0 0.25 0.5 0.74 0.99 1.240

102030405060708090

100

Topic

430

Figure 13: Gain Curves. x-axis = reviewed documents (in order), as a percentage of the entire collection. y-axis = recall.[Left] Red = Reviewer 1, Blue = Reviewer 2, Green = Reviewer 3, Brown = Reviewer 4, Black = Pooled seeds from allreviewers. For individual reviewers, solid line indicates one-shot query; dashed line indicates iterative searching. [Right] Solidline is the average of the one-shot reviewers; dashed is the average of iterative reviewers

Page 15: An Exploration of Total Recall with Multiple Manual Seedings · An Exploration of Total Recall with Multiple Manual Seedings Final Version Jeremy Pickens, Tom Gricks, Bayu Hardi,

0 0.14 0.28 0.41 0.55 0.690

102030405060708090

100

Topic

431

0 0.14 0.28 0.41 0.55 0.690

102030405060708090

100

Topic

431

0 0.1 0.21 0.31 0.41 0.520

102030405060708090

100

Topic

432

0 0.1 0.21 0.31 0.41 0.520

102030405060708090

100

Topic

432

0 3.45 · 10−26.9 · 10−2 0.1 0.14 0.170

102030405060708090

100

Topic

433

0 3.45 · 10−26.9 · 10−2 0.1 0.14 0.170

102030405060708090

100

Topic

433

0 2.21 · 10−24.41 · 10−26.62 · 10−28.83 · 10−2 0.110

102030405060708090

100

Topic

434

0 2.21 · 10−24.41 · 10−26.62 · 10−28.83 · 10−2 0.110

102030405060708090

100

Topic

434

Figure 14: Gain Curves. x-axis = reviewed documents (in order), as a percentage of the entire collection. y-axis = recall.[Left] Red = Reviewer 1, Blue = Reviewer 2, Green = Reviewer 3, Brown = Reviewer 4, Black = Pooled seeds from allreviewers. For individual reviewers, solid line indicates one-shot query; dashed line indicates iterative searching. [Right] Solidline is the average of the one-shot reviewers; dashed is the average of iterative reviewers

Page 16: An Exploration of Total Recall with Multiple Manual Seedings · An Exploration of Total Recall with Multiple Manual Seedings Final Version Jeremy Pickens, Tom Gricks, Bayu Hardi,

0 10 20 30 40 50 60 70 80 90 1000

25.8

51.5

77.3

103

128.8

154.5

180.3

206Topic

401

0 10 20 30 40 50 60 70 80 90 1000

0.14

0.28

0.41

0.55

0.69

0.83

0.97

1.1·105

Topic

401

0 10 20 30 40 50 60 70 80 90 1000

64

127

191

255

318

382

445

509

Topic

402

0 10 20 30 40 50 60 70 80 90 1000

0.66

1.32

1.97

2.63

3.29

3.95

4.61

5.27·104

Topic

402

0 10 20 30 40 50 60 70 80 90 1000

6.6

13.3

19.9

26.5

33.1

39.8

46.4

53

Topic

403

0 10 20 30 40 50 60 70 80 90 1000

48

96

144

192

240

288

336

384

Topic

403

0 10 20 30 40 50 60 70 80 90 1000

111

223

334

446

557

668

780

891

Topic

404

0 10 20 30 40 50 60 70 80 90 1000

0.23

0.45

0.68

0.9

1.13

1.35

1.58

1.8·105

Topic

404

0 10 20 30 40 50 60 70 80 90 1000

3.4

6.8

10.1

13.5

16.9

20.3

23.6

27

Topic

405

0 10 20 30 40 50 60 70 80 90 1000

9.5

19

28.5

38

47.5

57

66.5

76

Topic

405

Figure 15: Analysis of Judged and Non-Judged Non-Relevant Documents. On both graphs, x-axis is recall level. y-axis isJudged Non-Relevant (left) and Non-Judged Non-Relevant (right). Red = documents unique to baseline method, blue =documents unique to our method, grey = documents common to both methods.

Page 17: An Exploration of Total Recall with Multiple Manual Seedings · An Exploration of Total Recall with Multiple Manual Seedings Final Version Jeremy Pickens, Tom Gricks, Bayu Hardi,

0 10 20 30 40 50 60 70 80 90 1000

17.5

35

52.5

70

87.5

105

122.5

140Topic

406

0 10 20 30 40 50 60 70 80 90 1000

0.13

0.27

0.4

0.53

0.67

0.8

0.94

1.07·105

Topic

406

0 10 20 30 40 50 60 70 80 90 1000

26

52

78

104

130

156

182

208

Topic

407

0 10 20 30 40 50 60 70 80 90 1000

371

742

1,112

1,483

1,854

2,225

2,595

2,966

Topic

407

0 10 20 30 40 50 60 70 80 90 1000

84

167

251

335

418

502

585

669

Topic

408

0 10 20 30 40 50 60 70 80 90 1000

0.31

0.61

0.92

1.23

1.53

1.84

2.15

2.45·105

Topic

408

0 10 20 30 40 50 60 70 80 90 1000

30.1

60.3

90.4

120.5

150.6

180.8

210.9

241

Topic

409

0 10 20 30 40 50 60 70 80 90 1000

0.14

0.28

0.42

0.56

0.7

0.84

0.98

1.12·105

Topic

409

0 10 20 30 40 50 60 70 80 90 1000

30

60

90

120

150

180

210

240

Topic

410

0 10 20 30 40 50 60 70 80 90 1000

111

223

334

445

556

668

779

890

Topic

410

Figure 16: Analysis of Judged and Non-Judged Non-Relevant Documents. On both graphs, x-axis is recall level. y-axis isJudged Non-Relevant (left) and Non-Judged Non-Relevant (right). Red = documents unique to baseline method, blue =documents unique to our method, grey = documents common to both methods.

Page 18: An Exploration of Total Recall with Multiple Manual Seedings · An Exploration of Total Recall with Multiple Manual Seedings Final Version Jeremy Pickens, Tom Gricks, Bayu Hardi,

0 10 20 30 40 50 60 70 80 90 1000

27.9

55.8

83.6

111.5

139.4

167.3

195.1

223Topic

411

0 10 20 30 40 50 60 70 80 90 1000

0.83

1.65

2.48

3.31

4.14

4.96

5.79

6.62·104

Topic

411

0 10 20 30 40 50 60 70 80 90 1000

9.4

18.8

28.1

37.5

46.9

56.3

65.6

75

Topic

412

0 10 20 30 40 50 60 70 80 90 1000

0.43

0.87

1.3

1.74

2.17

2.6

3.04

3.47·104

Topic

412

0 10 20 30 40 50 60 70 80 90 1000

0.4

0.81.1

1.5

1.9

2.32.6

3

Topic

413

0 10 20 30 40 50 60 70 80 90 1000

14.6

29.3

43.9

58.5

73.1

87.8

102.4

117

Topic

413

0 10 20 30 40 50 60 70 80 90 1000

10.8

21.5

32.3

43

53.8

64.5

75.3

86

Topic

414

0 10 20 30 40 50 60 70 80 90 1000

858

1,716

2,574

3,432

4,289

5,147

6,005

6,863

Topic

414

0 10 20 30 40 50 60 70 80 90 1000

135

269

404

539

673

808

942

1,077

Topic

415

0 10 20 30 40 50 60 70 80 90 1000

0.28

0.57

0.85

1.13

1.42

1.7

1.98

2.27·104

Topic

415

Figure 17: Analysis of Judged and Non-Judged Non-Relevant Documents. On both graphs, x-axis is recall level. y-axis isJudged Non-Relevant (left) and Non-Judged Non-Relevant (right). Red = documents unique to baseline method, blue =documents unique to our method, grey = documents common to both methods.

Page 19: An Exploration of Total Recall with Multiple Manual Seedings · An Exploration of Total Recall with Multiple Manual Seedings Final Version Jeremy Pickens, Tom Gricks, Bayu Hardi,

0 10 20 30 40 50 60 70 80 90 1000

79

158

237

316

394

473

552

631Topic

416

0 10 20 30 40 50 60 70 80 90 1000

0.14

0.29

0.43

0.57

0.72

0.86

1

1.15·105

Topic

416

0 10 20 30 40 50 60 70 80 90 1000

0.5

1

1.5

2

2.5

3

3.5

4

Topic

417

0 10 20 30 40 50 60 70 80 90 1000

4.4

8.8

13.1

17.5

21.9

26.3

30.6

35

Topic

417

0 10 20 30 40 50 60 70 80 90 1000

22.3

44.5

66.8

89

111.3

133.5

155.8

178

Topic

418

0 10 20 30 40 50 60 70 80 90 1000

0.2

0.4

0.59

0.79

0.99

1.19

1.38

1.58·105

Topic

418

0 10 20 30 40 50 60 70 80 90 1000

51

102

152

203

254

305

355

406

Topic

419

0 10 20 30 40 50 60 70 80 90 1000

0.29

0.58

0.87

1.16

1.45

1.73

2.02

2.31·105

Topic

419

0 10 20 30 40 50 60 70 80 90 1000

6.9

13.8

20.6

27.5

34.4

41.3

48.1

55

Topic

420

0 10 20 30 40 50 60 70 80 90 1000

3.9

7.8

11.6

15.5

19.4

23.3

27.1

31

Topic

420

Figure 18: Analysis of Judged and Non-Judged Non-Relevant Documents. On both graphs, x-axis is recall level. y-axis isJudged Non-Relevant (left) and Non-Judged Non-Relevant (right). Red = documents unique to baseline method, blue =documents unique to our method, grey = documents common to both methods.

Page 20: An Exploration of Total Recall with Multiple Manual Seedings · An Exploration of Total Recall with Multiple Manual Seedings Final Version Jeremy Pickens, Tom Gricks, Bayu Hardi,

0 10 20 30 40 50 60 70 80 90 1000

3.6

7.3

10.9

14.5

18.1

21.8

25.4

29

Topic

421

0 10 20 30 40 50 60 70 80 90 1000

9.3

18.5

27.8

37

46.3

55.5

64.8

74

Topic

421

0 10 20 30 40 50 60 70 80 90 1000

1.5

3

4.5

6

7.5

9

10.5

12

Topic

422

0 10 20 30 40 50 60 70 80 90 1000

20.4

40.8

61.1

81.5

101.9

122.3

142.6

163

Topic

422

0 10 20 30 40 50 60 70 80 90 1000

31

63

94

125

156

188

219

250

Topic

423

0 10 20 30 40 50 60 70 80 90 1000

520

1,040

1,560

2,080

2,599

3,119

3,639

4,159

Topic

423

0 10 20 30 40 50 60 70 80 90 1000

1.6

3.3

4.9

6.5

8.1

9.8

11.4

13

Topic

424

0 10 20 30 40 50 60 70 80 90 1000

14

28

42

56

70

84

98

112

Topic

424

0 10 20 30 40 50 60 70 80 90 1000

3.5

7

10.5

14

17.5

21

24.5

28

Topic

425

0 10 20 30 40 50 60 70 80 90 1000

33

66

99

132

165

198

231

264

Topic

425

Figure 19: Analysis of Judged and Non-Judged Non-Relevant Documents. On both graphs, x-axis is recall level. y-axis isJudged Non-Relevant (left) and Non-Judged Non-Relevant (right). Red = documents unique to baseline method, blue =documents unique to our method, grey = documents common to both methods.

Page 21: An Exploration of Total Recall with Multiple Manual Seedings · An Exploration of Total Recall with Multiple Manual Seedings Final Version Jeremy Pickens, Tom Gricks, Bayu Hardi,

0 10 20 30 40 50 60 70 80 90 100

000000000

Topic

426

0 10 20 30 40 50 60 70 80 90 1000

124

248

371

495

619

743

866

990

Topic

426

0 10 20 30 40 50 60 70 80 90 1000

9.4

18.8

28.1

37.5

46.9

56.3

65.6

75

Topic

427

0 10 20 30 40 50 60 70 80 90 1000

92

183

275

366

458

549

641

732

Topic

427

0 10 20 30 40 50 60 70 80 90 1000

9.6

19.3

28.9

38.5

48.1

57.8

67.4

77

Topic

428

0 10 20 30 40 50 60 70 80 90 1000

108

216

323

431

539

647

754

862

Topic

428

0 10 20 30 40 50 60 70 80 90 1000

0.4

0.81.1

1.5

1.9

2.32.6

3

Topic

429

0 10 20 30 40 50 60 70 80 90 1000

29.1

58.3

87.4

116.5

145.6

174.8

203.9

233

Topic

429

0 10 20 30 40 50 60 70 80 90 1000

56

113

169

225

281

338

394

450

Topic

430

0 10 20 30 40 50 60 70 80 90 1000

166

332

497

663

829

995

1,160

1,326

Topic

430

Figure 20: Analysis of Judged and Non-Judged Non-Relevant Documents. On both graphs, x-axis is recall level. y-axis isJudged Non-Relevant (left) and Non-Judged Non-Relevant (right). Red = documents unique to baseline method, blue =documents unique to our method, grey = documents common to both methods.

Page 22: An Exploration of Total Recall with Multiple Manual Seedings · An Exploration of Total Recall with Multiple Manual Seedings Final Version Jeremy Pickens, Tom Gricks, Bayu Hardi,

0 10 20 30 40 50 60 70 80 90 1000

8.8

17.5

26.3

35

43.8

52.5

61.3

70

Topic

431

0 10 20 30 40 50 60 70 80 90 1000

52

104

156

208

259

311

363

415

Topic

431

0 10 20 30 40 50 60 70 80 90 1000

0.5

1

1.5

2

2.5

3

3.5

4

Topic

432

0 10 20 30 40 50 60 70 80 90 1000

19.4

38.8

58.1

77.5

96.9

116.3

135.6

155

Topic

432

0 10 20 30 40 50 60 70 80 90 1000

4.6

9.3

13.9

18.5

23.1

27.8

32.4

37

Topic

433

0 10 20 30 40 50 60 70 80 90 1000

7.1

14.3

21.4

28.5

35.6

42.8

49.9

57

Topic

433

0 10 20 30 40 50 60 70 80 90 1000

1.8

3.5

5.3

7

8.8

10.5

12.3

14

Topic

434

0 10 20 30 40 50 60 70 80 90 1000

4.9

9.8

14.6

19.5

24.4

29.3

34.1

39

Topic

434

Figure 21: Analysis of Judged and Non-Judged Non-Relevant Documents. On both graphs, x-axis is recall level. y-axis isJudged Non-Relevant (left) and Non-Judged Non-Relevant (right). Red = documents unique to baseline method, blue =documents unique to our method, grey = documents common to both methods.