Vdfp audio and video fingerprinting

26
John Schavemaker, Werner Bailer, Peter-Jan Doets, Jaap Blom audio and video fingerprinting

description

Presentation about audio and video fingerprinting, see for more information

Transcript of Vdfp audio and video fingerprinting

Page 1: Vdfp   audio and video fingerprinting

John Schavemaker, Werner Bailer, Peter-Jan Doets, Jaap Blom

audio and video fingerprinting

Page 2: Vdfp   audio and video fingerprinting

audio and video fingerprinting2

techniek even in kort:

duplicaatherkenning (video fingerprinting)• bestaat een video in onze databases?

categorisatie• wat voor categorie video is het? Nieuws, sport, film?

object- en logoherkenning• bestaat een object of logo (plaatje) in onze databases?

Zie ook ons online rapport over stand van de techniek:

http://research.imagesforthefuture.org/index.php/video-fingerprinting-state-of-the-art-report/

Page 3: Vdfp   audio and video fingerprinting

audio and video fingerprinting3

duplicaatherkenning

VRAAG: bestaat een video in onze databases?

video fingerprints houden rekeningmet veranderingen in:

• resolutie• codec• ruis• kleur

Page 4: Vdfp   audio and video fingerprinting

audio and video fingerprinting4

SWOT video fingerprinting

THREATS• video fingerprints gesloten standaarden• versleuteling video• slimme “gebruikers”

OPPORTUNITIES• grotere video databases• niet geproduceerd materiaal• open standaard video fingerprints• combinatie met audio

WEAKNESSES• veel concurrerende partijen, welk softwarepakket te kiezen?• geschiktheid voor video materiaal dat niet geproduceerd is?

STRENGTHS• uitontwikkelde technologie• zeer goede performance op geproduceerd materiaal• veel commerciële pakketten verkrijgbaar op de markt

Page 5: Vdfp   audio and video fingerprinting

audio and video fingerprinting5

video categorisatie

VRAAG: Wat voor categorie video is het? Close-up gezicht, binnensport, buitensport?

images UvAhttp://www.science.uva.nl/research/mediamill/

Page 6: Vdfp   audio and video fingerprinting

audio and video fingerprinting6

SWOT video categorisatie

THREATS• variëteit te groot voor categorie• keuze van categorieën• afhankelijk van annotatie leervoorbeelden

OPPORTUNITIES• combinatie van categorieën• sneller en beter leren• automatische annotatie

WEAKNESSES• onvolwassen techniek• performance (sterk) afhankelijk van gebruikte leervoorbeelden• leren systeem voor nieuwe categorieën duurt relatief lang

STRENGTHS• veel belovende techniek• generieke herkenning mogelijk• aanvulling op duplicaat- en objectherkenning• brug van de ‘semantic gap’

Page 7: Vdfp   audio and video fingerprinting

audio and video fingerprinting7

object- en logoherkenning

VRAAG: bestaateen object of logo in onze databases?

picture from http://www.omniperception.com/

Page 8: Vdfp   audio and video fingerprinting

audio and video fingerprinting8

SWOT object- en logoherkenning

THREATS• pre-processing al het materiaal noodzakelijk• patenten

OPPORTUNITIES• grotere video databases • open standaard• 3D object herkenning

WEAKNESSES• alleen 2D objecten (logo’s)• echte duplicaatherkenning• rekenintensief

STRENGTHS• goede, robuuste performance• commerciële pakketten• snel leren en herkennen• revolutie in computer vision

Page 9: Vdfp   audio and video fingerprinting

audio and video fingerprinting9

video fingerprinting

Page 10: Vdfp   audio and video fingerprinting

audio and video fingerprinting10

FingerprintextractionLabeled

Multimedia items

Which item?Metadata

Fingerprintsand

Metadata

Audio/visualsignal

Metadata

Fingerprintextraction MatchAudio/visual

signal

Identification phase

Training phase

UnlabeledMultimedia items

Use of FP: identification

Page 11: Vdfp   audio and video fingerprinting

audio and video fingerprinting11

Sound & Vision Pilot• Observations

• Problem harder than expected• Transformations

• Crop & scale• Brightness/contrast• Logos, captions

• very difficult PIP• many matching sequences of black frames

Page 12: Vdfp   audio and video fingerprinting

audio and video fingerprinting12

Sound & Vision Pilot – results ZiuZ

• TNO has used the ZiuZ video fingerprinting tool on the dataset• ZiuZ video fingerprinting is optimized for child-abuse material:

• short clips• low resolution• low image quality

• Preliminary results on the Sound & Vision dataset show• material is very challenging• some but limited recall performance• application domain differs• queries containing multiple clips of reference material were

not enabled by this version of the tool

Page 13: Vdfp   audio and video fingerprinting

audio and video fingerprinting13

Sound & Vision Pilot – Results JRS• Recall: 36% (min: 16%, max. 55%)• Precision: difficult to determine, many black

sequences matching, needs manual checking

Page 14: Vdfp   audio and video fingerprinting

audio and video fingerprinting14

Sound & Vision Pilot - Results• Transformations our system handles

Page 15: Vdfp   audio and video fingerprinting

audio and video fingerprinting15

Sound & Vision Pilot - Results• False positives

Page 16: Vdfp   audio and video fingerprinting

audio and video fingerprinting16

Experiments with SIFT (1)• we do not have a SIFT based fingerprinting

solution in the consortium• JRS has SIFT-based interactive tool to locate

recurring objects in video• created video from episode + source clips and

performed analysis and search

Page 17: Vdfp   audio and video fingerprinting

audio and video fingerprinting17

Experiments with SIFT (2)

Page 18: Vdfp   audio and video fingerprinting

audio and video fingerprinting18

Experiments with SIFT (3)

Page 19: Vdfp   audio and video fingerprinting

audio and video fingerprinting19

Experiments with SIFT (4)• Conclusion

• SIFT can handle cases of scaling and cropping reliably

• even PIP with distortions• Scalability issues

• time for extraction and esp. matching• not sure if ranking of matches is still reliable on

huge datasets

Page 20: Vdfp   audio and video fingerprinting

audio and video fingerprinting20

Characteristics of the data set - audio

• Not all archive fragments contain audio• Often the original audio is used – just cut-and-paste, no serious

distortions• Sometimes the audio is replaced or combined with a voice over• Time segmentation of the audio in the episode is different from

the video used. The audio is not always used with the corresponding video fragments. Example on next slide illustratesthis. The other ways around, and other variations also occur.

Page 21: Vdfp   audio and video fingerprinting

audio and video fingerprinting21

Characteristics of the data set – audio example

Time line of onearchive video

Time line of oneAndere Tijden episode

video

audio

video

audio

Continuous audio fragment, with several shorter video fragments

Page 22: Vdfp   audio and video fingerprinting

audio and video fingerprinting22

Characteristics of the data set - audio

• Limitations of the use of audio• the reference material must contain audio• the audio track might not originate from the same material as

the video track; this is dependent on the video material used.• the playout speed must not be changed too much (less than

+/- 2%)

• Advantages of the use of audio• Highly robust algorithms• Usually audio is undistorted; video is cropped, scaled, etc.• Audio usually is used continuously, while video fragments are

cut-and-paste from different sections of the reference video, and ‘glued together’.

Page 23: Vdfp   audio and video fingerprinting

audio and video fingerprinting23

Identification results - audio

• Only checked if the correct archive file name is returned

037Pim en zijn volk018De wording van Paars0106Burgemeesters in oorlogstijd213Modderen in de polder: Lelystad162Op zoek naar Nederland191Kronkels van de Maas619Strijd tegen de file25075 jaar afsluitdijk1410Veertig jaar STER-reclame038Liggadjati

False PositiveMissedCorrectEpisode

silent parts in the video

Page 24: Vdfp   audio and video fingerprinting

audio and video fingerprinting24

Fingerprinting – audio algorithm

• Algorithm well-known from literature: • Haitsma, Kalker, “A Highly Robust Audio Fingerprinting

System”, In Proceedings of 3rd International Conference onMusic Information Retrieval (ISMIR), October 2002.

• Features: energy in 33 audio frequency bands• Every 11.6 ms a 32-bit sub-fingerprint is computed, consisting of

coarsely quantized differences between these energy samples• Fingerprint consists of a time series of sub-fingerprints• The implementation returns the best matching fragments only

(settings to return no false positives)• Algorithm is highly robust, and highly discriminative

Page 25: Vdfp   audio and video fingerprinting

audio and video fingerprinting25

Future improvements on current results

• Trailing parts contain silence and black frames (no content). The silences give rise to false positives and irrelevant detections. A silence/activity detector is needed to exclude these parts.

• Our current implementation from literature allows for only one fragment per reference file to be returned.

• Our current implementation has only coarse time localization.• Combination of audio and video fingerprinting

Page 26: Vdfp   audio and video fingerprinting

audio and video fingerprinting26

http://instituut.beeldengeluid.nl/

http://www.joanneum.at/en/digital.html

http://www.ziuz.com

http://hs-art.com/

http://www.tno.nl

Consortium