Kobe University and Kindai University at TRECVID …...M_D_kobe_kindai.18_1 M_D_kobe_kindai.18_2...

1
Refinement by Object Detection Topics with requirement on the number of objects or their spatial relationship Topic 561: Find shots of exactly two men at a conference or meeting table talking in a room. Topic 584: Find shots of a person lying on a bed. Object detection by Mask R-CNN Examination of the top 10000 shots retrieved by the cascade-based approach Number of objects: Use the number of instances with the same label in a keyframe Spatial relationship between objects: Get center of gravity of each object instance by calculating average of all pixel coordinate in the instance mask Determine the spatial relationship by comparing the center of gravity of an object to the one of another object Shot filtering Kobe University and Kindai University at TRECVID 2018 AVS Task Kimiaki Shirahama 1 , He Zhenying 2 and Kuniaki Uehara 2 Department of Informatics, Kindai University Graduate School of System Informatics, Kobe University Abstract This year we addressed the following two points: How to fuse concept detection scores for accurate retrieval Cascade-based approach that uses a sequence of stages to gradually filter out irrelevant shots How to deal with a topic requiring the number of objects or their relation Object detection to analyze detected regions Concept detection Concept Selection Results Future work 385 Places concepts 345 SIN concepts 1000 ImageNet 9418 ImageNet 487 Sports1M concepts 345 SIN concepts: Detection scores that are provided by ITI-CERTH team and obtained by SVM-based fine-tuning of pre-trained network 1000 ImageNet concepts: ResNet152 implementation in YOLO to detect 1000 concepts in ImageNet 9418 ImageNet concepts: darknet9000 to detect 9418 concepts that are organized into a hierarchical tree 385 Places concepts: ResNet152 fine- tuned for 365 scene concepts defined in Places365, as well as max-pooling to obtain detection scores for their 20 super-concepts 487 Sports1M concepts: C3D to detect 487 concepts defined in Sports1M dataset. 1. Generality: Use the most general concept e.g. Topic 566: Find shots of a dog playing outdoors 2. Specificity: Use a specific concept deduced from a phase in a topic e.g. Topic 563: Find shots of one or more people on a moving boat in the water boatman Negative concept: indoor Number of object Spatial relationship between objects Adopt an “embedding-based” approach to avoid cumbersome issues in the concept-based approach, like concept selection and score fusion/pooling Use Deep relational network to specifically predict the complex relationships between objects. M_D_kobe_kindai.18_1:Baseline that uses the cascade-based approach without object detection. M_D_kobe_kindai.18_2: Refinement of shots retrieved by M_D_kobe_kindai.18_1 with object detection M_D_kobe_kindai.18_3: Slightly different sets of concepts from M_D_kobe_kindai.18_1 for some topics M_D_kobe_kindai.18_4: Simple summation of detection scores for the selected concepts. 1. Object detection effectively refines retrieval results, especially when the number of objects is required. 2. For topics requiring spatial relationship, object detection didn’t work as good as we expected. 3. For complex shot, it’s difficult to define which shot is correct Before After 1 2 3 4 5 Topic Concepts Cascade Refinement Result Manually Select Organized into a cascade with stages Object Detection by Mask R-CNN Get Retrieval Result 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 561 563 572 584 586 M_D_kobe_kindai.18_1 M_D_kobe_kindai.18_2 Cascade Construction Selected concepts are organized into a cascade where each concept is associated with one stage a) Order of stage: As a concept is more general, the corresponding stage is placed earlier b) Parallel: Multiple concepts representing the same (or very similar) meaning are placed in parallel c) Separate cascades: Multiple cascades are used for a topic including “or” Two_People (1) Talking (1) Meeting (1) Conference_Room (1) /c/conference_room (4) conference room (3) table (3) Table (1) table (3) Dogs (1) dog (3) Outdoor (1) outdoor (4) Dancing (1) Nighttime (1) dancer (3) outdoor (4) Outdoor (1) Nighttime (1) performer (3) outdoor (4) Outdoor (1) a) Cascade for Topic 561: Find shots of exactly two men at a conference or meeting table talking in a room b) Cascade for Topic 566: Find shots of a dog playing outdoors c) Two separate cascades for Topic 567: Find shots of people performing or dancing outdoors at night time Cascade-based Retrieval 1. Normalize detection scores 2. Filter out 3. Rank power normalization ( = 0.15) min-maxnormalizatio (0~1) Dogs (1) dog (3) Outdoor (1) outdoor (4) Indoor (1) Dogs (1) dog (3) Outdoor (1) outdoor (4) Indoor (1) low score score = Outdoor (1) outdoor (4) Indoor (1) + filter out half with low score high sore score = Dogs (1) dog (3) + high sore low score filter out half with low score × 0.5 Final score = Dogs (1) dog (3) Outdoor (1) outdoor (4) Indoor (1) + + + × 0.5 rank Top 1000 Topic 566: Find shots of a dog playing outdoors By mask R-CNN we obtain label, probability and mask of object instances Filter out shots with not exactly two people(works well) Filter out shots in which person’s position is lower than bed’s(doesn’t work well) 0 0.1 0.2 0.3 0.4 0.5 0.6 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 M_D_kobe_kindai.18_1 M_D_kobe_kindai.18_2 M_D_kobe_kindai.18_3 M_D_kobe_kindai.18_4 MAX in manually-assisted runs MAX in all runs For topics requiring spatial relationship The person’s center of gravity is higher than bed’s, but the person is sitting on the bed. For complex topic Although we obtained masks of object instances, it’s difficult to define which situation is correct.) 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 1. M_D_kobe_kindai.18 4 is ranked at the seventh place among 16 runs in the manually-assisted category. (Our team is ranked at the third place among six teams) 2. Adoption of the large concept vocabulary leads to good performances. 1. Our runs achieved the best average precisions for the six topics in the manually-assisted category. 2. No significant difference is observed between using the cascade-based approach and not-using it. 3. Cascade-based approach reduces search time (from 5.9s to 4.0s). For complex relations between objects like waving flags and pouring liquid, our current method only considers their co-occurrence 0 0.02 0.04 0.06 0.08 0.1 0.12 Topic 584: a person lying on a bed Topic 569: people standing in line outdoors. center of gravity of bed center of gravity of person

Transcript of Kobe University and Kindai University at TRECVID …...M_D_kobe_kindai.18_1 M_D_kobe_kindai.18_2...

Page 1: Kobe University and Kindai University at TRECVID …...M_D_kobe_kindai.18_1 M_D_kobe_kindai.18_2 M_D_kobe_kindai.18_3 M_D_kobe_kindai.18_4 MAX in manually-assisted runs MAX in all

Refinement by Object Detection

Topics with requirement on the number of objects or their spatial relationship

Topic 561: Find shots of exactly two men at a conference or meeting table talking in a room.

Topic 584: Find shots of a person lying on a bed.

Object detection by Mask R-CNN

Examination of the top 10000 shots retrieved by the cascade-based approach Number of objects:

• Use the number of instances with the same label in a keyframe Spatial relationship between objects:

• Get center of gravity of each object instance by calculating average of all pixel coordinate in the instance mask

• Determine the spatial relationship by comparing the center of gravity of an object to the one of another object

Shot filtering

Kobe University and Kindai Universityat TRECVID 2018 AVS Task

Kimiaki Shirahama1, He Zhenying2 and Kuniaki Uehara2

Department of Informatics, Kindai UniversityGraduate School of System Informatics, Kobe University

AbstractThis year we addressed the following two points:• How to fuse concept detection scores for accurate retrieval

Cascade-based approach that uses a sequence of stages to gradually filter out irrelevant shots

• How to deal with a topic requiring the number of objects or their relationObject detection to analyze detected regions

Concept detection

Concept Selection

Results

Future work

385 Placesconcepts

345 SIN concepts

1000 ImageNet 9418

ImageNet

487Sports1Mconcepts

345 SIN concepts: Detection scores that are provided by ITI-CERTH team and obtained by SVM-based fine-tuning of pre-trained network

1000 ImageNet concepts: ResNet152 implementation in YOLO to detect 1000 concepts in ImageNet

9418 ImageNet concepts: darknet9000 to detect 9418 concepts that are organized into a hierarchical tree

385 Places concepts: ResNet152 fine-tuned for 365 scene concepts defined in Places365,as well as max-pooling to obtain detection scoresfor their 20 super-concepts

487 Sports1M concepts: C3D to detect 487 conceptsdefined in Sports1M dataset.

1. Generality: Use the most general concepte.g. Topic 566: Find shots of a dog playing outdoors2. Specificity: Use a specific concept deduced from a phase in a topice.g. Topic 563: Find shots of one or more people on a moving boat in the water

boatman

Negative concept:indoor

Number of object

Spatial relationship between objects

• Adopt an “embedding-based” approach to avoid cumbersome issues in the concept-based approach, like concept selection and score fusion/pooling

• Use Deep relational network to specifically predict the complex relationships between objects.

• M_D_kobe_kindai.18_1:Baseline that uses the cascade-based approach without object detection.• M_D_kobe_kindai.18_2: Refinement of shots retrieved by M_D_kobe_kindai.18_1 with object detection• M_D_kobe_kindai.18_3: Slightly different sets of concepts from M_D_kobe_kindai.18_1 for some topics • M_D_kobe_kindai.18_4: Simple summation of detection scores for the selected concepts.

1. Object detection effectively refines retrieval results, especially when the number of objects is required.2. For topics requiring spatial relationship, object detection didn’t work as good as we expected.3. For complex shot, it’s difficult to define which shot is correct

Before

After

1

2

3

4

5

TopicConcepts

CascadeRefinement

Result

Manually Select

Organized into a cascadewith stages

Object Detection by Mask R-CNN

Get Retrieval Result

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

561 563 572 584 586

M_D_kobe_kindai.18_1 M_D_kobe_kindai.18_2

Cascade ConstructionSelected concepts are organized into a cascade where each concept is associated with one stagea) Order of stage: As a concept is more general,

the corresponding stage is placed earlierb) Parallel: Multiple concepts representing the

same (or very similar) meaning are placed in parallel

c) Separate cascades: Multiple cascades are used for a topic including “or”

Two_People (1) Talking (1)

Meeting (1)

Conference_Room (1)

/c/conference_room (4)

conference room (3)

table (3)

Table (1)

table (3)

Dogs (1)

dog (3)

Outdoor (1)

outdoor (4)

Dancing (1) Nighttime (1) dancer (3)outdoor (4)

Outdoor (1)

Nighttime (1) performer (3)outdoor (4)

Outdoor (1)

a) Cascade for Topic 561: Find shots of exactly two men at a conference or meeting table talking in a room

b) Cascade for Topic 566: Find shots of a dog playing outdoors

c) Two separate cascades for Topic 567: Find shots of people performing or dancing outdoors at night time

Cascade-based Retrieval1. Normalize detection scores2. Filter out3. Rank

power normalization (𝛼 = 0.15)

min-maxnormalizatio(0~1)

Dogs (1)

dog (3)

Outdoor (1)

outdoor (4)

Indoor (1)

Dogs (1)

dog (3)

Outdoor (1)

outdoor (4)

Indoor (1)

lowscore

score = Outdoor (1)

outdoor (4)

Indoor (1)

+

filter out half with low score

highsore

score = Dogs (1)

dog (3)+

highsore

lowscore

filter out half with low score

× 0.5

Final score =

Dogs (1)

dog (3)

Outdoor (1)

outdoor (4)

Indoor (1)

+

+

+

× 0.5

rank

Top 1000

Topic 566: Find shots of a dog playing outdoors

By mask R-CNNwe obtain label, probability and mask of object instances

Filter out shots with not exactly two people(works well) Filter out shots in which person’s position is lower than bed’s(doesn’t work well)

0

0.1

0.2

0.3

0.4

0.5

0.6

561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590

M_D_kobe_kindai.18_1 M_D_kobe_kindai.18_2 M_D_kobe_kindai.18_3 M_D_kobe_kindai.18_4 MAX in manually-assisted runs MAX in all runs

For topics requiring spatial relationship • The person’s center of gravity is higher than bed’s, but the person is sitting on the bed. For complex topic• Although we obtained masks of object instances, it’s difficult to define which situation is correct.)

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

1. M_D_kobe_kindai.18 4 is ranked at the seventh place among 16 runs in the manually-assisted category.(Our team is ranked at the third place among six teams)

2. Adoption of the large concept vocabulary leads to good performances.

1. Our runs achieved the best average precisions for the six topics in the manually-assisted category.2. No significant difference is observed between using the cascade-based approach and not-using it. 3. Cascade-based approach reduces search time (from 5.9s to 4.0s).

For complex relations between objects like waving flags and pouring liquid, our current method only considers their co-occurrence

0

0.02

0.04

0.06

0.08

0.1

0.12

Topic 584: a person lying on a bed

Topic 569: people standing in line outdoors.

center of gravity of bed

center of gravity of person