SemanticLabelSharingforLearningwith75,000Categories › ~fergus › drafts › ssl_share.pdf ·...

8
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096 097 098 099 100 101 102 103 104 105 106 107 CVPR #1092 CVPR #1092 CVPR 2009 Submission #1092. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE. Semantic Label Sharing for Learning with 75,000 Categories Anonymous CVPR submission Paper ID 1092 Abstract In an object recognition scenario with tens of thousands of categories, even a small number of labels per category leads to a very large number of total labels required. We propose a simple method of label sharing between semanti- cally similar categories. We leverage the WordNet hierarchy to define semantic distance between any two categories and use this semantic distance to share labels. Our approach can be used with any classifier. Experimental results on a dataset of 80 million images and 75,000 categories shows that despite the simplicity of the approach, it leads to sig- nificant improvements in performance. 1. Introduction Large image collections on the Internet and elsewhere contain a multitude of scenes and objects. Recent work in computer vision has explored the problems of visual search and recognition in this challenging environment. However, all approaches require some amount of hand-labeled train- ing data in order to build effective models. Working with large numbers of images creates two challenges: first, label- ing a representative set of images and, second, developing efficient algorithms that scale to very large databases. Labeling Internet imagery is challenging in two respects: first, the sheer number of images means that the labels will only ever cover a small fraction of images. Recent collab- orative labeling efforts such as Peekaboom, LabelMe, Ima- geNet [19, 25, 7] have gathered millions of labels at the im- age and object level. However this is but a tiny fraction of the estimated 10 billion images on Facebook, let alone the hundreds of petabytes of video on YouTube. Second, the diversity of the data means that many thousands of classes will be needed to give an accurate description of the visual content. Current recognition datasets use 10’s to 100’s of classes which give a hopelessly coarse quantization of im- ages into discrete categories. The richness of our visual world is reflected by the enormous number of nouns present in our language: English has around 70,000 that correspond to actual objects [8, 11]. This figure loosely agrees with the 30,000 visual concepts estimated by psychologists [4]. “Turboprop” search engine result Re-ranked images “Pony” search engine result Re-ranked images Figure 1. Two examples of images from the Tiny Images database [23] being re-ranked by our approach, according to the probabil- ity of belonging to the categories “turboprop” and “pony” respec- tively. No training labels were available for either class. However 64,185 images from the total of 80 million were labeled, spread over 386 classes, some of which are semantically close to the two categories. Using these labels in our semantic label sharing scheme, we can dramatically improve search quality. Furthermore, having a huge number of classes dilutes the available labels, meaning that, on average, there will be rel- atively few annotated examples per class (and many classes might not have any annotated data). To illustrate the challenge of obtaining high quality labels in the scenario of many categories, consider the CIFAR-10 dataset constructed by Alex Krizhevsky and Ge- off Hinton [14]. This dataset provides human labels for a subset of the Tiny Images [23] dataset which was obtained by querying Internet search engines with over 70,000 search terms. To construct the labels, Krizhevsky and Hinton chose 1

Transcript of SemanticLabelSharingforLearningwith75,000Categories › ~fergus › drafts › ssl_share.pdf ·...

Page 1: SemanticLabelSharingforLearningwith75,000Categories › ~fergus › drafts › ssl_share.pdf · geNet [19, 25, 7] have gathered millions of labels at the im-age and object level.

000

001

002

003

004

005

006

007

008

009

010

011

012

013

014

015

016

017

018

019

020

021

022

023

024

025

026

027

028

029

030

031

032

033

034

035

036

037

038

039

040

041

042

043

044

045

046

047

048

049

050

051

052

053

054

055

056

057

058

059

060

061

062

063

064

065

066

067

068

069

070

071

072

073

074

075

076

077

078

079

080

081

082

083

084

085

086

087

088

089

090

091

092

093

094

095

096

097

098

099

100

101

102

103

104

105

106

107

CVPR

#1092

CVPR

#1092CVPR 2009 Submission #1092. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Semantic Label Sharing for Learning with 75,000 Categories

Anonymous CVPR submission

Paper ID 1092

Abstract

In an object recognition scenario with tens of thousands

of categories, even a small number of labels per category

leads to a very large number of total labels required. We

propose a simple method of label sharing between semanti-

cally similar categories. We leverage the WordNet hierarchy

to define semantic distance between any two categories and

use this semantic distance to share labels. Our approach

can be used with any classifier. Experimental results on a

dataset of 80 million images and 75,000 categories shows

that despite the simplicity of the approach, it leads to sig-

nificant improvements in performance.

1. Introduction

Large image collections on the Internet and elsewhere

contain a multitude of scenes and objects. Recent work in

computer vision has explored the problems of visual search

and recognition in this challenging environment. However,

all approaches require some amount of hand-labeled train-

ing data in order to build effective models. Working with

large numbers of images creates two challenges: first, label-

ing a representative set of images and, second, developing

efficient algorithms that scale to very large databases.

Labeling Internet imagery is challenging in two respects:

first, the sheer number of images means that the labels will

only ever cover a small fraction of images. Recent collab-

orative labeling efforts such as Peekaboom, LabelMe, Ima-

geNet [19, 25, 7] have gathered millions of labels at the im-

age and object level. However this is but a tiny fraction of

the estimated 10 billion images on Facebook, let alone the

hundreds of petabytes of video on YouTube. Second, the

diversity of the data means that many thousands of classes

will be needed to give an accurate description of the visual

content. Current recognition datasets use 10’s to 100’s of

classes which give a hopelessly coarse quantization of im-

ages into discrete categories. The richness of our visual

world is reflected by the enormous number of nouns present

in our language: English has around 70,000 that correspond

to actual objects [8, 11]. This figure loosely agrees with

the 30,000 visual concepts estimated by psychologists [4].

“Turb

op

rop

” se

arch

en

gin

e re

sult

Re

-ran

ke

d im

ag

es

“Po

ny

” se

arch

en

gin

e re

sult

Re

-ran

ke

d im

ag

es

Figure 1. Two examples of images from the Tiny Images database

[23] being re-ranked by our approach, according to the probabil-

ity of belonging to the categories “turboprop” and “pony” respec-

tively. No training labels were available for either class. However

64,185 images from the total of 80 million were labeled, spread

over 386 classes, some of which are semantically close to the

two categories. Using these labels in our semantic label sharing

scheme, we can dramatically improve search quality.

Furthermore, having a huge number of classes dilutes the

available labels, meaning that, on average, there will be rel-

atively few annotated examples per class (and many classes

might not have any annotated data).

To illustrate the challenge of obtaining high quality

labels in the scenario of many categories, consider the

CIFAR-10 dataset constructed by Alex Krizhevsky and Ge-

off Hinton [14]. This dataset provides human labels for a

subset of the Tiny Images [23] dataset which was obtained

by querying Internet search engines with over 70,000 search

terms. To construct the labels, Krizhevsky and Hinton chose

1

Page 2: SemanticLabelSharingforLearningwith75,000Categories › ~fergus › drafts › ssl_share.pdf · geNet [19, 25, 7] have gathered millions of labels at the im-age and object level.

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

CVPR

#1092

CVPR

#1092CVPR 2009 Submission #1092. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

10 classes “airplane”, “automobile”, “bird”, “cat”, “deer”,

“dog”, “frog”, “horse”, “ship”, “truck”, and for each class

they used the WordNet hierarchy to construct a set of hy-

ponyms. The labelers were asked to examine all the images

which were found with a search term that is a hyponym of

the class. As an example, some of the hyponyms of ship

are “cargo ship”, “ocean liner”, and “frigate”. The label-

ers were instructed to reject images which did not belong to

their assigned class. Using this procedure, labels on a total

of 386 categories (hyponyms of the 10 classes listed above)

were collected at a cost of thousands of dollars.

Despite the high cost of obtaining these labels, the 386

categories are of course a tiny subset of the possible labels

in the English language. Consider for example the words

“pony” and “turboprop” (Fig. 1). Neither of these is con-

sidered a hyponym of the 10 classes mentioned above. Yet

there is obvious information in the labeled data for “horse”

and “airplane” that we would like to use to improve the

search engine results of “pony” and “turboprop”.

In this paper, we provide a very simple method for shar-

ing labels between categories. Our approach is based on a

basic assumption – we expect the classifier output for a sin-

gle category to degrade gracefully with semantic distance.

In other words, although horses are not exactly ponies,

we expect a classifier for “pony” to give higher values for

“horses” than to “airplanes”. Our scheme, which we call

“Semantic Label Sharing” gives the performance shown in

Fig. 1. Even though we have no labels for “pony” and “tur-

boprop” specifically, we can significantly improve the per-

formance of search engines by using label sharing.

1.1. Related Work

Various recognition approaches have been applied to In-

ternet data, with the aim of re-ranking, or refining the output

of image search engines. These include: Li et al. [16], Fer-

gus et al. [12], Berg et al. [3], Vijayanarasimhan et al. [26],

amongst others. Our approach differs in two respects: (i)

these approaches treat each class independently; (ii) they

are not designed to scale to the billions of images on the

web.

Sharing information across classes is a widely explored

concept in vision and learning, and takes many different

forms. Some of the first approaches applied to object recog-

nition are based on neural networks in which sharing is

achieved via the hidden layers which are common across

all tasks [5, 15]. Error correcting output codes[9] also look

at a way of combining multi-class classifiers to obtain bet-

ter performance. Another set of approaches tries to transfer

information from one class to another by regularizing the

parameters of the classifiers across classes.

Torralba et al. , Opelt et al. [24, 18] demonstrated its

power in sharing useful features between classes within a

boosting framework. Other approaches transfer informa-

tion across object categories by sharing a common set of

parts [10, 21], by sharing transformations across different

instances [22, 2, 17], or by sharing a set of prototypes [1].

Common to all those approaches is that the experiments

are always performed with relatively few classes. Further-

more, it is not clear how these techniques would scale to

very large databases with thousands of classes.

Our sharing takes a different form to these approaches,

in that we impose sharing on the class labels themselves,

rather than in the features or parameters of the model. As

such, our approach has the advantage that it it is indepen-

dent of the choice of the classifier.

2. Semantic Label Sharing

In order to compute the semantic distance between two

classes we use a tree defined by Wordnet (see Fig. 3(c))1.

The semantic distance Sij between classes i and j (whichare nodes in the tree) is defined as the number of nodes

shared by their two parent branches, divided by the length

of the longest of the two branches. For instance, the seman-

tic similarity between a “felis domesticus” and “tabby cat”

is 0.93, while the distance between “felis domesticus” anda “tractor trailer” is 0.21. We construct a sparse semanticaffinity matrixA = exp(−κ(1−S)), with κ = 10 for all theexperiments in this paper. For the class “airbus”, the nearest

semantic classes are: “airliner” (0.49), “monoplane” (0.24),

“dive bomber” (0.24), “twinjet” (0.24), “jumbo jet” (0.24),

and “boat” (0.03). A visualization of A and a closeup areshown in Fig. 3(a) and (b).

Let us assume we have a total of C classes, hence A willbe a C × C symmetric matrix. We are given L labeled ex-amples in total, distributed over these C classes. The labelsfor class c are represented by a binary vector yc of length Lwhich has values 1 for positive hand-labeled examples and 0otherwise. Hence positive examples for class c are regardedas negative labels for all other classes. Y = {y1, . . . , yC} isan N ×C matrix holding the label vectors from all classes.

We share labels between classes by replacing Y withY A. This simple operation has a number of effects:

• Positive examples are copied between classes,weighted according to their semantic affinity. For

example, the label vector for “felis domesticus”

previously had zero values for the images of “tabby

cat”, but now these elements are replaced by the value

0.93.

• However, labels from unrelated classes will only devi-ate slightly from their original state of 0 (dependent onthe value of κ).

1Wordnet is graph-structured and we convert it into a tree by taking the

most common sense of a word.

2

Page 3: SemanticLabelSharingforLearningwith75,000Categories › ~fergus › drafts › ssl_share.pdf · geNet [19, 25, 7] have gathered millions of labels at the im-age and object level.

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

CVPR

#1092

CVPR

#1092CVPR 2009 Submission #1092. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

• Negative labeled examples from classes outside the setof C are unaffected by A (since they are 0 across allrows of Y ).

• Even if each class has only a few labeled examples,the multiplication by A will effectively pool examplesacross semantically similar classes, dramatically in-

creasing the number that can be used for training, pro-

vided semantically similar classes are present amongst

the set of C.

The effect of this operation is illustrated in two exam-

ples on toy data, shown in Fig. 2. These examples show

good classifiers can be trained by sharing labels between

classes, given knowledge of the inter-class affinities, even

when no labels are given for the target class. In Fig. 2,

there are 9 classes but label data is only given for 7 classes.

In addition to the labels, the system also has access to the

affinities among the 9 classes. This information is enough to

build classification functions for the classes with no labels

(Fig. 2(d) and (f)).

From another perspective, our sharing mechanism turns

the original classification problem into a regression prob-

lem: the formerly binary labels in Y become real-valuesin Y A. As such we can adapt many types of classifiers tominimize regression error rather than classification error.

3. Sharing in Semi-Supervised Learning

Semi-supervised learning is an attractive option in set-

tings where very few training examples exist since the den-

sity of the data can be used to regularize the solution. This

can help prevent over-fitting the few training examples and

yield superior solutions. A popular class of semi-supervised

algorithms are based on the graph Laplacian and we use an

approach of this type.

We briefly describe semi-supervised learning in a graph

setting. In addition to the L labeled examples (Xl, Yl) ={(x1, y1), ..., (xL, yL)} introduced above, we have an ad-ditional U unlabeled images Xu = {xL+1, ..., xN}, fora total of N images. We form a graph where the ver-

tices are the images X and the edges are represented by anN × N matrixW . The edge weighting is given byWij =exp(−‖xi − xj‖

2/2ǫ2), the visual affinity between imagesi and j. Defining D = diag(

∑j Wij), we define the nor-

malized graph Laplacian to be: L = I = D−1/2WD−1/2.

We use L to measure the smoothness of solutions over thedata points, desiring solutions that agree with the labels but

are also smooth with respect to the graph. In the single class

case we want to minimize:

J(f) = fT Lf+

l∑

i=1

λ(fi−yi)2 = fT Lf+(f−y)T Λ(f−y)

(1)

Weighted training points f function

0

0.25

0.5

0.75

1

(c)

(e)

(d)

(f )

0

0.25

0.5

0.75

1

Data Labeled examples

(a) (b)

1 2

4 6

7 8 9

3

f function5Weighted training points

Figure 2. Toy data illustrating our sharing mechanism between 9

different classes (a) in discrete clusters. For 7 of the 9 classes, a

few examples are labeled (b). No labels exist for the classes 3 and

5. (c): Labels re-weighted by affinity to class 3. (Red=high affin-

ity, Blue=low affinity). (d): This plot shows the semi-supervised

learning solution fclass=3 using weighted labels from (c). The value

of the function fclass=3 on each sample from (a) is color coded.

Dark red corresponds to the samples more likely to belong to class

3. (e): Labels re-weighted by affinity to class 5. (d): Solution

of semi-supervised learning solution fclass=5 using weighted labels

from (e).

where Λ is a diagonal matrix whose diagonal elements areΛii = λ if i is a labeled point and Λii = 0 for unlabeledpoints. The solution is given by solving the N × N linearsystem (L + Λ)f = Λy.This system is impractical to solve for large N , thus it is

common [6, 20, 27] to reduce the dimension of the problem

by using the smallest k eigenvectors of L (which will bethe smoothest) U as a basis with coefficients α: f = Uα.Substituting into Eqn. 1, we find the optimal coefficients αto be the solution of the following k × k system:

(Σ + UT ΛU)α = UT Λy (2)

where Σ is a diagonal matrix of the smallest k eigenvectorsof L. While this system is easy to solve, the difficulty iscomputing the eigenvectors an O(N2) operation.Fergus et al. [13] introduced an efficient scheme for

computing approximate eigenvectors in O(N) time. This

3

Page 4: SemanticLabelSharingforLearningwith75,000Categories › ~fergus › drafts › ssl_share.pdf · geNet [19, 25, 7] have gathered millions of labels at the im-age and object level.

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

CVPR

#1092

CVPR

#1092CVPR 2009 Submission #1092. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

entity

physical entity

physical object animate thing

being

fauna

chordate

craniate

mammalian

eutherian mammal

carnivore

canid

Canis familiaris

felid

true cat

Felis catus toy

Japanese spaniel

hoofed mammal

artiodactyl mammal

ruminant

cervid

unit

artefact

instrumentation

transport

vehicle

craft aircraft heavier

air craft

plane

airliner

airbus

watercraft

boat

attack aircraft

tabby cat

wheeled vehicle

self propelled vehicle

automobile

Alces

automotive vehicle

Peke

perissodactyl mammal

equid

Equus caballus

mount

quarter horse

bomber

bird

Maltese

flightless bird

Emu

stud mare

compact car

amphibian

salientian

ranid

mutt

true toad

fighter aircraft

fire truck

motortruck

aerial ladder truck

mouser

ship

cargo vessel

dump truck powerboat

speedboat

Appaloosa

toy spaniel

King

Charles spaniel

passeriform bird

bird

Prunella

modularis

jet propelled

plane

twinjet

English

toy spaniel

Blenheim spaniel

coupe

ostrich

jumbo

jet

merchant

ship

moving van

car

jetliner

container vessel

wagtail

offspring

young mammal

puppy

wagon

station wagon

motorcar

shooting brake

passenger ship

patrol car

tomcat

horse

stealth fighter

estate car

true sparrow

camion

Capreolus

truck

delivery truck

tipper truck

garbage truck

stallion

motorcar

lorry

police cruiser

tractor trailer

20 40 60 80 100 120

20

40

60

80

100

120 0

0. 1

0. 2

0. 3

0. 4

0. 5

0. 6

0. 7

0. 8

0. 9

1

true cat

gelding

motorcar

camion

cargo ship

lippizaner

patrol car

stealthbomber

jetliner

0

0. 1

0. 2

0. 3

0. 4

0. 5

0. 6

0. 7

0. 8

0. 9

true

ca

t

ge

ldin

g

mo

torc

ar

ca

mio

n

ca

rgo

sh

ip

lipp

iza

ne

r

pa

trol c

ar

ste

alth

bo

mb

er

jetlin

er

1

a) b)

c)

van

va

n

Figure 3. Wordnet sub-tree for a subset of 386 classes used in our experiments. The associated semantic affinity matrix A is shown in (a),

along with a closeup of 10 randomly chosen rows and columns in (b).

approach proceeds by first computing numerical approxi-

mations to the eigenfunctions (the limit of the eigenvectors

as N → ∞). Then approximations to the eigenvectors arecomputed via a series of 1D interpolations into the numeri-

cal eigenfunctions. The resulting approximate eigenvectors

(and associated eigenvalues) can be used in place of U andΣ in Eqn. 2.

Extending the above formulations to the multi-class sce-

nario is straightforward. In a multi-class problem, the labels

will be held in an N × C binary matrix Y , replacing y inEqn. 2. We then solve for theN ×C matrix F using the ap-proach of Fergus et al. Utilizing the semantic sharing from

Section 2 is simple, with Y being replaced with Y A.

4. Experiments

Our experiments use the Tiny Images database [23]. This

data is diverse and highly variable, having been collected

directly from Internet search engines.

The first set of experiments are performed on 63,000 im-

ages from this dataset, as covered by the CIFAR-10 label

set [14]. The set of labels allows us to accurately measure

the performance of our algorithm, while using data typi-

cal of the large-scale Internet settings for which our algo-

rithm is designed. The second group of experiments use the

full Tiny Images database, consisting of 79,302,017 images

spread over 74,569 categories.

4.1. Experiments with Hand­Labeled Data

The CIFAR-10 labels comprise human image-level an-

notations (+ve/-ve) for 10 distinct classes in the Tiny Images

dataset. Each class is made up of a number of keywords (the

query term used to gather the images from the search en-

gine), subordinate to the class as given by the Wordnet tree

structure. We select the sub-set of 126 keywords from the

CIFAR set which had at least 200 positive labels and 300

negative labels, giving a total of 63,000 images. The gist

descriptor for each image is mapped down to a 64D space

using PCA. These keywords and their semantic relationship

to one another are shown in Fig. 3. For each keyword, we

randomly choose a fixed test-set of 100 positive and 200

negative examples, reflecting the typical signal-to-noise ra-

tio found in images from Internet search engines. Note that

we do not make use of any rank information. The training

examples consist of +ve/-ve pairs drawn from the remaining

pool of 100 positive/negative images for each keyword.

In Fig. 4(top) we explore the effects of semantic sharing

with the semi-supervised learning approach. The applica-

tion of the semantic affinity matrix can be seen to help per-

formance, particularly when only a few training examples

are present. If the semantic matrix is randomly permuted

(but with the diagonal fixed to be 1), then this provides no

benefit over the no sharing case. But if the sharing is in-

verted (by replacing A with 1 − A and setting the diagonalto 1), it clearly hinders performance. The same pattern of

results can be see in Fig. 4(bottom) for a nearest neighbor

classifier. Hence the semantic matrix must reflect the rela-

tionship between classes if it is to be effective.

In Fig. 7 we show examples of the semi-supervised

learning scheme, in conjunction with the Wordnet affinity

matrix, re-ranking images in the CIFAR set. Fig. 7 shows

examples of the re-ranking operation.

In Fig. 5, we perform a more systematic exploration of

the effects of sharing using the semi-supervised learning ap-

proach. Both the number of classes and number of images

are varied, and the performance recorded with and without

the semantic affinity matrix. In both cases, the performance

improves as more training data is used. However, the affin-

ity matrix gives a dramatic assistance to performance for

small numbers of training examples.

4

Page 5: SemanticLabelSharingforLearningwith75,000Categories › ~fergus › drafts › ssl_share.pdf · geNet [19, 25, 7] have gathered millions of labels at the im-age and object level.

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

CVPR

#1092

CVPR

#1092CVPR 2009 Submission #1092. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

−Inf 0 1 2 3 4 5 6 7 0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

Log2 number of training pairs

Me

an

pre

cis

ion

at

15

% r

eca

ll a

ve

rag

ed

ove

r 1

26

cla

sse

s

Semi−supervised Learning

Semantic sharing

No sharing

Random sharing

Inverse sharing

−Inf 0 1 2 3 4 5 6 7 0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

Log2 number of training pairs

Me

an

pre

cis

ion

at

15

% r

eca

ll a

ve

rag

ed

ove

r 1

26

cla

sse

s

Nearest Neighbors

Wordnet

None

Random

Inverse

Figure 4. Top: Performance for different sharing strategies with

the semi-supervised learning approach as the number of training

examples is increased, using 126 classes in the CIFAR label set.

Bottom: As for (top) but with a nearest neighbor classifier.

The sharing behavior can be used to effectively learn

classes for which we have zero training examples. In Fig. 6,

we explore what happens when we allocate 0 training im-

ages to one particular class (the left-out class) from the set

of 126, while using 100 training pairs for the remaining 125

classes. When the sharing matrix is not used, the perfor-

mance of the left-out class drops significantly, relative to its

performance when training data is available (i.e. the point

for each left-out class falls below the diagonal). But when

sharing is used, the drop in performance is relatively small,

with points being spread around the diagonal.

Motivated by Fig. 6, we show in Fig. 1 the approach op-

erating on two classes for which no CIFAR labels exist, ob-

taining qualitatively good results. However, this ability is

limited to keywords relatively close to the 386 keywords

for which we have labels – an issue which will be alleviated

Log2 # classes

# p

osi

tiv

e e

xam

ple

s/cl

ass

0 1 2 3 4 5 6 7

0

1

2

3

5

8

10

15

20

40

60

100

Log2 # classes

0 1 2 3 4 5 6 70.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75No Sharing Sharing

Figure 5. The variation in precision as the number of training ex-

amples is increased, using 126 classes. Note the improvement

in performance for small numbers of training examples when the

Wordnet sharing matrix is used.

0.2 0.4 0.6 0.8 10.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Using semantic sharing

Performance using 100 training pairs

Pe

rfo

rma

nce

usi

ng

0 t

rain

ing

pa

irs

(an

d a

ll o

the

r cl

ass

es

ha

ve

10

0 t

rain

ing

pa

irs)

0.2 0.4 0.6 0.8 10.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1No semantic sharing

Performance using 100 training pairs

Pe

rfo

rma

nce

usi

ng

0 t

rain

ing

pa

irs

(an

d a

ll o

the

r cl

ass

es

ha

ve

10

0 t

rain

ing

pa

irs)

(a) (b)

Figure 6. An exploration of the performance with 0 training ex-

amples for a single class, if all the other classes have 100 train-

ing pairs. (a): By using the sharing matrix A, we can obtain a

good performance by transferring labels from semantically simi-

lar classes. (b): Without it, the performance drops significantly.

with a more diverse set of labels.

4.2. Experiments with Noisy Labels

Our second set of experiments applies the semantic shar-

ing scheme to the whole of the Tiny Images dataset. This

dataset gives a dense coverage of the visual world, with

79,302,017 images distributed over 74,569 classes. With

this number of classes, many will be very similar visually

and our sharing scheme can be expected to greatly assist

performance.

As before, each image is represented by a single gist

descriptor that is mapped down to a 32D space using

PCA. k=48 eigenfunctions are precomputed over the entiredataset and used in the semi-supervised learning scheme.

Instead of using the CIFAR labels (which cover a tiny frac-

tion of the 74,569 classes), or any other source of hand-

labeled training images, we assume that the first 5 images in

each class are correct, taking them to be positive examples.

Since search engines often refine their results using user

5

Page 6: SemanticLabelSharingforLearningwith75,000Categories › ~fergus › drafts › ssl_share.pdf · geNet [19, 25, 7] have gathered millions of labels at the im-age and object level.

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616

617

618

619

620

621

622

623

624

625

626

627

628

629

630

631

632

633

634

635

636

637

638

639

640

641

642

643

644

645

646

647

CVPR#1092

CVPR#1092

CVPR 2009 Submission #1092. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Airbus

Japanese

Spaniel Ostrich Deer Fire truck Appaloosa

Honey

Eater Auto Boat

Leopard

Frog

Init

ial o

rde

rO

utp

ut

ord

er

Figure 7. Test images from 10 keywords drawn from the 126 keyword sub-set with manual annotations. The border of each image indicates

its label (used for evaluation purposes only) with respect to the keyword, green = +ve, red = -ve. The top row shows the initial ranking of

the data, while the bottom row shows the re-ranking of our approach trained on 126 classes with 100 training pairs/classes.

click-through, these few images are typically high quality

examples of the class. We also draw another 5 images for

each class randomly from the dataset to act as negative ex-

amples. Thus over the dataset, we have a total of 372,845

(noisy) positive examples, and the same number of negative

examples. With just 5 noisy positive examples per class,

it is challenging to train a standard classifier. However,

by using a 74,569 by 74,569 semantic affinity matrix con-

structed from Wordnet, we can apply the semi-supervised

framework to a range of tasks.

In the first application, we demonstrate how our ap-

proach can improve the search quality by re-ranking im-

ages for a given keyword. In Fig. 8 we show qualita-

tive results for 4 classes. Our algorithm takes around 0.1

seconds to perform each re-ranking (since the eigenfunc-

tions are precomputed), compared to over 1 minute for

the nearest-neighbor classifier. These figures show that

the semi-supervised learning scheme with semantic shar-

ing clearly improves search performance over the original

ranking and that without the sharing matrix the performance

drops significantly. Fig. 9 gives a visualization of our ap-

proach for the keyword “rabbiteye blueberry”.

The second application evaluates classification perfor-

mance using a test set of 3,538 hand-labeled examples

(1560 positive, 1978 negative) spread over 3,388 classes

chosen randomly from the dataset. For each class we trained

the semi-supervised learning classifier using the noisy 5

images per class and the Wordnet semantic sharing ma-

trix and scored the labeled images belonging to that class.

We then sorted all labeled images from all classes accord-

ing to their score and plotted the recall-precision curves

shown in Fig. 10. Chance level performance corresponds

to 1560/1978 = 44%. The use of the Wordnet sharing ma-

trix clearly improves the performance.

5. Summary and Future Work

We have introduced a very simple mechanism for sharing

training labels between classes. Our experiments demon-

strate that it gives significant benefits in situations where

there are many classes but only a few examples per class,

a common occurrence in large image collections. We have

shown how the semantic sharing can be combined with sim-

ple classifiers to operate on a dataset with 75,000 classes

and 80 million images. While the semantic sharing matrix

from Wordnet has proven effective, a goal of future work

would be to learn it directly from the data.

References

[1] M. C. A. Quattoni and T. Darrell. Transfer learning for image classi-

fication with sparse prototype representations. In CVPR, 2008. 2

[2] E. Bart and S. Ullman. Cross-generalization: learning novel classes

from a single example by feature replacement. In CVPR, 2005. 2

[3] T. Berg and D. Forsyth. Animals on the web. In CVPR, pages 1463–

1470, 2006. 2

[4] I. Biederman. Recognition-by-components: A theory of human im-

age understanding. Psychological Review, 94:115–147, 1987. 1

6

Page 7: SemanticLabelSharingforLearningwith75,000Categories › ~fergus › drafts › ssl_share.pdf · geNet [19, 25, 7] have gathered millions of labels at the im-age and object level.

648

649

650

651

652

653

654

655

656

657

658

659

660

661

662

663

664

665

666

667

668

669

670

671

672

673

674

675

676

677

678

679

680

681

682

683

684

685

686

687

688

689

690

691

692

693

694

695

696

697

698

699

700

701

702

703

704

705

706

707

708

709

710

711

712

713

714

715

716

717

718

719

720

721

722

723

724

725

726

727

728

729

730

731

732

733

734

735

736

737

738

739

740

741

742

743

744

745

746

747

748

749

750

751

752

753

754

755

CVPR#1092

CVPR#1092

CVPR 2009 Submission #1092. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Raw Images SSL, no sharing SSL, Wordnet sharing NN, Wordnet sharingP

on

yR

ab

bit

eye

Blu

eb

err

yN

ap

ole

on

Po

nd

Lily

Figure 8. Sample results of our semantic label sharing scheme on the 80 million image dataset. 0 hand-labeled training examples were

used. Instead, the first 5 images of each of the 74,569 classes were taken as positive examples. Using these labels, classifiers were trained

for 4 different query classes: “pony”, “rabbiteye blueberry”, “Napoleon” and “pond lily”. Column 1: the raw image ranking from the

Internet search engine. Column 2: re-ranking using the semi-supervised scheme without semantic sharing. Column 3: re-ranking with

semi-supervised scheme and semantic sharing. Column 4: re-ranking with a nearest-neighbor classifier and semantic sharing. Without

semantic sharing, the classifier only has 5 positive training examples, thus performs poorly. But with semantic sharing it can leverage the

semantically close examples from the pool of 5*74,569=372,845 positive examples.

7

Page 8: SemanticLabelSharingforLearningwith75,000Categories › ~fergus › drafts › ssl_share.pdf · geNet [19, 25, 7] have gathered millions of labels at the im-age and object level.

756

757

758

759

760

761

762

763

764

765

766

767

768

769

770

771

772

773

774

775

776

777

778

779

780

781

782

783

784

785

786

787

788

789

790

791

792

793

794

795

796

797

798

799

800

801

802

803

804

805

806

807

808

809

810

811

812

813

814

815

816

817

818

819

820

821

822

823

824

825

826

827

828

829

830

831

832

833

834

835

836

837

838

839

840

841

842

843

844

845

846

847

848

849

850

851

852

853

854

855

856

857

858

859

860

861

862

863

CVPR

#1092

CVPR

#1092CVPR 2009 Submission #1092. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

(b)

(a)

(c)

Figure 9. A visualization of the semi-supervised semantic label

sharing scheme operating on the Tiny Images dataset for the key-

word “rabbiteye blueberry”. (a): The wordnet tree mapped down

to a 2D plane. Each pixel corresponds to one category, the color

being the average color of the images of that class. (b): A map of

affinity to the keyword (red is high, blue is low), revealing seman-

tic clusters within the map. (c): The label function f produced

by the semi-supervised learning. Note that it broadly follows the

semantic contours of the Wordnet map.

[5] R. Caruana. Multitask learning. Machine Learning, 28(1):41–75,

1997. 2

[6] O. Chapelle, B. Scholkopf, and A. Zien. Semi-Supervised Learning.

MIT Press, 2006. 3

[7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Im-

ageNet: A Large-Scale Hierarchical Image Database. In CVPR09,

2009. 1

[8] O. E. Dictionary, 2009. 1

[9] T. G. Dietterich and G. Bakiri. Solving multiclass learning problems

via ECOCs. JAIR, 2:263–286, 1995. 2

[10] L. Fei-Fei, R. Fergus, and P. Perona. One-shot learning of object

categories. IEEE Transactions on Pattern Analysis and Machine In-

telligence, To appear - 2004. 2

[11] C. Fellbaum. Wordnet: An Electronic Lexical Database. Bradford

Books, 1998. 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

No Sharing

Semantic Sharing

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

Pre

cisi

on

Figure 10. Evaluation of our sharing scheme on the 80 million im-

age dataset, using 3,538 test examples. Our classifier was trained

using 0 hand-labeled examples and 5 noisy labels per class. Using

a Wordnet semantic affinity matrix over the 74,569 classes gives a

clear boost to performance. See text for details.

[12] R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman. Learning object

categories from google’s image search. In ICCV, volume 2, pages

1816–1823, Oct. 2005. 2

[13] R. Fergus, Y. Weiss, and A. Torralba. Semi-supervised learning in

gigantic image collections. In NIPS, 2009. 3

[14] A. Krizhevsky. Learning multiple layers of features from tiny im-

ages. Technical report, University of Toronto, 2009. 1, 4

[15] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based

learning applied to document recognition. Proc. IEEE, 86(11):2278–

2324, 1998. 2

[16] L. J. Li, G. Wang, and L. Fei-Fei. Imagenet. In CVPR, 2007. 2

[17] E. Miller, N. Matsakis, and P. Viola. Learning from one example

through shared densities on transforms. In CVPR, volume 1, pages

464–471, 2000. 2

[18] A. Opelt, A. Pinz, and A. Zisserman. Incremental learning of object

detectors using a visual shape alphabet. In CVPR (1), pages 3–10,

2006. 2

[19] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. La-

belme: a database and web-based tool for image annotation. IJCV,

77(1):157–173, 2008. 1

[20] B. Schoelkopf and A. Smola. Learning with Kernels Support Vector

Machines, Regularization, Optimization, and Beyond. MIT Press,,

2002. 3

[21] E. Sudderth, A. Torralba, W. Freeman, and A. Willsky. Learning

hierarchical models of scenes, objects, and parts. In Proceedings

of the IEEE International Conference on Computer Vision, Beijing,

page To appear, 2005. 2

[22] J. B. Tenenbaum and W. T. Freeman. Separating style and content

with bilinear models. Neural Computation, 12:1247–1283, 2000. 2

[23] A. Torralba, R. Fergus, and W. T. Freeman. 80 million tiny images: a

large database for non-parametric object and scene recognition. IEEE

PAMI, 30(11):1958–1970, November 2008. 1, 4

[24] A. Torralba, K. Murphy, and W. Freeman. Sharing features: efficient

boosting procedures for multiclass object detection. In Proc. of the

2004 IEEE CVPR., 2004. 2

[25] L. van Ahn. The ESP game, 2006. 1

[26] S. Vijayanarasimhan and K. Grauman. Keywords to visual cate-

gories: Multiple-instance learning for weakly supervised object cat-

egorization. In CVPR, 2008. 2

[27] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning

using gaussian fields and harmonic functions. In In ICML, pages

912–919, 2003. 3

8