Download - Visual Recognition with Humans in the Loopkovashka/cs3710_sp15/active...Department of Electrical & Computer Engineering Visual Recognition with Humans in the Loop Authors: Steve Branson,

Department of Electrical & Computer Engineering

Visual Recognition with Humans

in the Loop

Authors: Steve Branson, Catherine Wah, Florian Schroff, Boris

Babenko, Peter Welinder, Pietro Perona, and Serge Belongie

Presented by: Yan Fang


Overview

• Problem Introduction- Challenge

- Goal

- Related Work

• Approach- Method Overview

- Incorporating Computer Vision

- User Response

• Experiments & Results- Datasets & Configuration

- Performance Evaluation

- Results

• Conclusion & Discussion


Problem Introduction

• Multi-class Object Recognition

• Challenge: Computer vision performs bad on fine-grain category

Inter-category:

easy for computer and human

Fine-grain category:

hard for computer and human


Why do we care?

• Low performance on basic-level category of CV

algorithms, not acceptable

• Low object category number in most datasets

• Important problem to study - help people recognize

types of objects they don't yet know how to identify


Why is it hard?

Difficulties for Human in Fine-grain category classification

Easy

Recognize sub-class Recognize visual attributes

Hard


Why is it hard?

Compare Human with Computer:

Human Computer

Memory, Expertise, Knowledge

Limited Good

Basic Visual Capabilities

Good Limited


Combine them together

Blue Belly

Finch?

Bunting?

Hard for computer and human Easy for human Easy for computer


Goal

• Build a human-computer framework for multi-class object

recognition

• Easy to plug in any object recognition algorithm

• Use assistance of human to improve performance

• Minimize the human effort in recognition task

• Good enough for real-life application


Related Work

• Recognition of tightly-related categories- Dataset: Oxford Flowers 102, UIUC Birds, and STONEFLY

shortcoming: scaling, object domain, performance

- Similar work: Botanist's Field Guide

difference: intention (for expert/layperson), processing of image

• Areas combine vision, learning with human input- Relevance feedback, active learning, expert system

- Similar but different from this work

• Scaling to large number of category- Class taxonomies feature sharing, error correcting output codes

(ECOC), attribute based classification methods

- Can be plug into this work


Approach


Method Overview

Goal: Given image, classify bird category

• Pose question about visual property for human, easy to answer

• Intelligently select question, exploit visual content by step

• Make decision based on refined probability distribution


Method Overview

Example of Visual 20-question game for human

A database of C classes needs O(log C) questions,

can be faster with computer vision

http://20q.net

http://20q.net/


Algorithm Details

Some terms:

A set of possible questions (e.g. IsRed?, HasStripes?, BellyColor?)

Answer with confidence value

// Initialize question set

// Ask question iteratively

// Pick question by information gain

// Pose the question

// Make the decision


More notations

For time step t, select question

is the history response set

is the index of question in question set

is the current probability distribution for classification

is the information gain obtained if ask another

question


Select Next Question

Maximizing Information Gain like decision tree algorithm

Kullback–Leibler divergence, measure of

difference between two distributions

Entropy of

Depends on CV algorithm

Depends on user response

Cross-Entropy?

http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence


Incorporate Computer Vision

• Any recognition algorithm can be plugged in, e.g. classifier

like SVM that uses attributes or features

• The purpose of computer vision is to evaluate

• This conditional prior helps update the current class

distribution and determine which question to ask

• It’s OK not to use any CV algorithm, can be obtained

by any probability distribution, or simply replaced with prior


Incorporate Computer Vision

• Simple framework using Bayesian rule:

• Assume user response is class-dependent not image-

dependent


Modeling User Response

• Assume questions are answered independently given

the category (experimentally work)



• Dependencies of terms:



• Still need

• Assume

Weighted Dirichlet Prior

Global Attribute Prior Pooling together certainty labels



Example of user response


Experiments & Results


Dataset & Configuration

Bird200 Dataset

• 6033 images, 200 species

• Difficult for layperson

Questions:

• 25 question, 288 binary attributes

• Deterministic attributes from whatbird.com


Answer Collection

Mechanical Turk Interface:

• Non-expert answer

• Prototypical image with

supplementary material

• Use randomly selected

answer


Evaluation

Method Configuration:

• No Computer Vision

• Classifier based on SIFT, VL Features from Andrea

Vedaldi

• Classifier based on attributes

Evaluation:

• Ask T question, measure classification accuracy

• Provide images of the class with highest probability

after each question, user stop process by verify these

images


Results & Performance

• No Computer Vision

• Contribution of modeling user response

• Non-expert user is not ideal



• Question number vs Accuracy

• CV algorithms do improve performance when

fewer questions are asked



• User stop tests

• CV algorithms reduce the labor of easy tasks



• Similar Performance on Animal with Attributes

• Attribute works better than 1-vs-all


Computer Vision Help Case

• Computers help select the proper question which

helps the correct recognition


Human Response Help Case

• User response help correct the wrong prediction

of computer vision


Failed Case

• Cropped image lead to the failing response of

certain question (attributes of belly)

• Two species are naturally similar, questions fail to

capture the distinguish attributes


Conclusion

• Pros- A framework combine computer vision and human recognition

- Compatible with any CV algorithms

- Human inputs improve the accuracy on hard recognition task

- Computer reduce human labor on easy task

- Practical for real application that help non-expert human

• Cons- Cropped image can lead to failure answers to questions

- Might not work on very similar species

- The attribute selection is very complicate and depends on expert

knowledge


Future Work

• Trend: reduce/exclude human efforts in the

framework

• Improve CV performance on hard problem

• Develop better question design and selection

mechanism


Discussion & Questions


What's It Going to Cost You? : Predicting Effort vs.

Informativeness for Multi-Label Image Annotations

Sudheendra Vijayanarasimhan and Kristen Grauman


Overview

• Problem

• Method Overview

• Experiments & Results

• Conclusion & Discussion


Problem Introduction

• Annotation of train data is very important to visual recognition

• Manual effort required, and images are not equally informative

• Active learning does not fit visual category learning:- Images contain multiple objects need multiple labels

- More annotation type, regions, segments

- Each annotation cost different efforts due to different types and

image


Proposed Method

• A new active learning framework weight informativeness against

effort for annotation

• A multiple-instance, multi-label learning (MIML) formulation help

select most promising annotation

• Capable of choose both image and the types of annotation

• Learn from human to predict the effort cost of different image


Active Learning


Method Overview


Method Overview

Step 1.

Learn object categories from

multi-label images, with a

mixture of weak and strong

labels.

MIML- multiple-instance multi-

label learning


MIML Scenario

Unlabeled images are

oversegmented into regions

Multiple bag of regions

Different level of annotation

provide different informativeness


Method Overview

Step 2. Active multi-level selection of multi-label annotations

• surveys unlabeled and partially labeled images,

• predicts the tradeoff between its informativeness versus the manual

effort

• Select the promising annotation and update classifier


Experiments

• The MSRCv2 dataset, 591 images and 21 classes

• Evaluate three aspects:- Accuracy of learning from multi-label examples

- Accuracy of annotation cost prediction

- Effectiveness to reduce manual effort

• RBF kernel for SVM, set parameter based on cross-validation, ignore

void region


Results

• Segment image and obtain texton and color histogram of each bulb

• Each image is a bag, segment is instance

• Image-level label

• Accuracy on new image and new region


Results

• Gather data with Amazon’s Mechanical Turk

• Classifiers on “Easy” and “Hard”

• Regressors predict actual time cost


Results

• Comparison of different selection strategy

• Accuracy: average value of the diagonal of the confusion matrix

• Region-level accuracy

• 80 random images added into unlabeled pool


Results

• Comparison of with or without cost prediction function

• Work on Tree and Airplane, not Sky


Results

• Numbers for evaluation of active selection

• Active selection takes less effort to achieve the same level of

accuracy


Contribution

• An active learning framework choose annotation example

based on the balance of manual efforts and informativeness

• Handle annotation types on different level

• Active learning reduce much manual efforts

• Effectively predict the cost of annotation

• Multi-level and multi-label strategy outperform traditional

active method


Discussion & Question