VISUAL ALGORITHMS FOR POST PRODUCTIONwebstaff.itn.liu.se/~jonun/web/teaching/2009-TNCG13/Sig...1.2....

VISUAL ALGORITHMSFOR POST PRODUCTION

Notes for the Course at SIGGRAPH 2009

Simon Robinson Anil Kokaram Mike Seymour

www.thefoundry.co.uk 2 www.sigmedia.tv

Contents

1 Course Overview 5

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 A Short History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Ingest 9

2.1 Defects in Pictures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.1 Modern Footage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Dust Busting Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.1 SDIx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.2 ROD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.3 Morphological/Median Filter Approaches . . . . . . . . . . . . . . . . . . 16

2.2.4 MRF based approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2.5 Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.6 Still work to do . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3 Noise and Grain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.4 Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4.1 Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3

CONTENTS CONTENTS

3 Basics of Compositing 23

3.1 Node-based Compositing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2 Compositing in 3D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3 Compositing in Stereo-3D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4 Challenges in Visual Compositing 29

4.1 Matting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.1.1 Making a Practical Matte Puller . . . . . . . . . . . . . . . . . . . . . . . 31

4.2 Inbetweening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2.1 Frame Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2.2 Outstanding issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5 Stereo-3D 37

5.1 Convergence and Keystoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.2 Disparities should be horizontal . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.3 Dense Disparity Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.3.1 Moving the cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.3.2 Moving compositing components . . . . . . . . . . . . . . . . . . . . . . 45

5.4 Optical differences between the eyes . . . . . . . . . . . . . . . . . . . . . . . . . 45


Chapter 1

Course Overview

PRESENTERS

Simon Robinson The Foundry, Wardour Street, Soho, London, UKAnil Kokaram Sigmedia Research Group, EEE Dept., Trinity College, Dublin 2, IrelandMike Seymour fxphd and fxguide

Additional material thanks to Lucy Wilkes, The Foundry.

1.1 Introduction

The modern post production house has evolved to contain a sophisticated workflow in which a largevariety of creative and algorithmic tasks are coordinated to achieve a finished product. While thework of SIGGRAPH authors undoubtedly has influence on the evolution of the technology usedin post production, not many are aware of the breadth of algorithms and people that are involved.The terminology used and the focus of those in this industry tends to be hard to access by those inacademia. What is also surprising is the amount of algorithm work that is used for manipulating 2Dpictures regardless of the requirement for 3D special effects. Tasks such as noise reduction, ”dustbusting”, matting, rotoscoping, colour correction, retiming, brightness balancing, are all 2D basedand have evolved to include a range of statistical techniques.

This course attempts to demystify part of the world of post production to the Siggraph audience.We expose the lesser known bread-and-butter parts of the workflow and educate the audience aboutthe tools that are commonly used. The course opens with an introduction to the world of post pro-duction given by Mike Seymour who is well known for his educational outreach. Simon Robinsonand Anil Kokaram then expose the links between the practice of post production and the algorithms

5

1.2. SCHEDULE CHAPTER 1. COURSE OVERVIEW

developed in academia.

Prerequisites: We assume some familiarity with basic image and video processing ideas (linearfiltering, colour spaces, motion estimation) undergraduate mathematics and signal processing.

1.2 Schedule

8:30 am Introduction [Mike Seymour]9:15 am Ingest [Anil Kokaram]9:30 am break9:40 am Basic Compositing [Simon Robinson]

10:15 am Challenges in 2D Compositing [Anil Kokaram]10:50 am break11:00 am 3D Stereoscopic post production [Simon Robinson]11:25 am The Future Of Compositing [Simon Robinson]

12:00 Summary and wrap up12:15 Close

1.3 A Short History

In film production of the early 1980s, film effects were predominantly optical post-processes orclever live action. Computer graphics was a niche topic, seeing its first mainstream uses in filmslike Star Trek: The Wrath Of Khan. The early days of such effect work was dominated by 3Deffects, possibly because the techniques enabled the creation of unique effects beyond what waspossible by other means. The Industrial Light and Magic team behind the film effects later formedthe core of Pixar, whose subsequent animation features are well known to children worldwide. 2Dimage processing lacked the ”wow” factor of 3D, and without the compute power to apply suchtechniques to movies wholesale this remained an underdeveloped area.

The early 1980s was also dominated by custom software. There was very little commercialexploitation in computer-generated graphics until the emergence of companies such as Softimage,Wavefront and Alias Research in the late 1980s, with their 3D modelling and animation packages.

A company called Avid showed the first non-linear editing system in a private suite at the NABconference in April 1988. The non-linear editing system allowed the image clips comprising a filmto be digitally edited together using a computer interface. The resulting edit decisions (the EditDecision List, EDL) was then used to produces the cuts and dissolves on the film using automatedequipment. This was the start of a big industry change - in 1994 three films used the new digital


CHAPTER 1. COURSE OVERVIEW 1.3. A SHORT HISTORY

editing system; today almost all do, and today the EDL output is almost entirely used to form adigital master. This latter development, where output to film is a final process only (and optionalin the case of digital projection) has led today to the rise in Digital Intermediate systems, whichcombine the non-linear aspects of these early systems with real-time playback, as well as effectsand grading capabilities.

Initial systems from Avid involved little 2D effects work beyond the wipes, cuts and dissolvesnecessary to edit a sequence of clips. But also in the late 1980s, a UK company called Quantel beganproducing the first commercial compositing system/non-linear editor, following on from successthey had enjoyed earlier in the decade with the Paintbox system - a tv graphics system. The newsystem was called ”Harry”, and relied upon custom hardware to render special effects to a hardwaredisk array. Unlike Avid’s long-form systems, the ”Harry” system could only hold 80 seconds ofuncompressed video; but the combination of effects, editing and playback made it a unique andsuccessful tool for the commercials market.

Quantel’s market dominance was challenged by a newcomer - Discreet Logic from Montreal.Discreet Logic started business marketing an early node-based compositing system called ”Eddie”,primarily developed by Bruno Nicoletti at a pioneering Australian effects house, Animal Logic.While ”Eddie” was an important forerunner of most film-based compositing systems today, DiscreetLogic sold their rights to the product and instead produced their first version of ”Flame”. ”Flame”was a clip-based compositing system which shared many of the same selling points of Quantel’s”Harry” system, but for one important difference: it was software based and depended primarilyon non-custom unix-based SGI hardware systems. As well as quickly becoming significant in whathad been thought of as Quantel’s untouchable broadcast and commercials market, Discreet Logicwere also more successful at capturing the growing digital film effects market. The first year of”Flame”, 1993, was the beginning of the end for Quantel’s market leadership, and paved the wayfor the dominance of flexible software systems in the 2D film-effects market.

Discreet Logic’s systems, however, left a gap in the market for cheaper effects systems - lowerprocessing power, but arguably more intricate in functionality. Early significant systems built uponthe node-based architecture pioneered by ”Eddie”. Kodak’s Cineon system and Avid’s Media-Illusion (a follow-on from Parallax’s Advance system) targeted some of the same market at DiscreetLogic, but also showed appeal in the broadening base of film digital effects work. A company calledNothing Real launched the node-based Shake system in 1997 which eventually became (and stillremains) the major single-shot film compositing tool worldwide.

Over the growth of software-based systems from the early nineties to today, more and morealgorithmic developments have been incorporated into 2D effects systems. It’s significant to notethat what seems today like ”obvious” developments have their history. Point tracking of a singlelocation, for example, in order to allow the automated attachment of objects to scene elements, firstappeared in Discreet Logic and Quantel systems around 1994 and it was seen as ground-breaking


1.3. A SHORT HISTORY CHAPTER 1. COURSE OVERVIEW

at the time.

Today, all compositing systems rely upon multi-point trackers and frequently on 3d-trackingsystems. Digital Domain’s TRACK technology for camera position calculation was first used inproduction in 1993. In 1998, Dr. Douglas R. Roble won the technical Achievement Award for hiscontribution to tracking technology. Today numerous commercial products exist, such as Boujou,PFTrack and 3D Equalizer, and their role extends both into 3D animation and into the 2D and 3Denvironments typical of modern compositing.

Motion-estimation has also become ubiquitous, particularly for retiming image sequences. Anearly commercial pioneer was found in the Cinespeed component of the Cineon compositing system,released in 1993. The film ”What Dreams May Come”, released in 1997, made the first importantuse of optical flow for visual effects, outside conventional retiming. Here it was used to transfermotion captured from live action into synthetic imagery, using optical flow techniques from PierreJasmin and Peter Litwinowic. Pierre and Peter later formed the RevisionFX company, now wellknown for its Twixtor retimer.

The 1999 film, ”The Matrix”, made famous the ”bullet-time” technique - based upon inter-polating pixel motion between multiple cameras rather than multiple sequential frames of the samecamera. This work was done in conjunction with Dr Bill Collis, then of Snell and Wilcox, now CEOof The Foundry, by some of the same team who had worked on the ”What Dreams May Come” se-quences. Later at The Foundry, Bill Collis was one of the team awarded an Academy Award for theFurnace plug-in toolset. Furnace is a complex compositor’s toolkit built around motion-estimationtechnology, and uses motion analysis for numerous tasks in image cleanup, object removal andretiming.

Subsequently, in sequels to ”The Matrix”, optical flow photogrammetry was also used to allowmarkerless facial capture. Today there is significant research effort into extending these techniquesinto more general multi-camera frameworks. The thrust of the research and commercial effort isto allow a new, richer extraction of scene information, with a view to improving the functionalityand creativity of future compositing systems, as well as helping to bridge the gaps between 2Dcompositing and 3D scene information.

In what follows, we consider the principal aspects of the workflow and show how technologydeveloped by the video and image processing research community is adapted for use in cinemaproduction.


Chapter 2

Ingest

When pictures first enter the post production there are a number of processes applied to them asa matter of course. These actions are generally known collectively as Ingest. The data has to beconverted into the right file formats and also checked for quality. Typically every post house hassome proprietary data network and database tools for handling data while it is being manipulated.Despite the prevalence of digital cameras, analog film is still used today. Some houses therefore havetheir own scanning facility although separate film scanning services do exist. We do not considerthe data handling and management issues here, but needless to say, the association of descriptivemetadata to every picture record has become a key enabler. The ingestion stage therefore involves anumber of well established pre-processing tasks in the pipeline.

1. Film Scanning / Data Collection

2. Metadata input

3. Image conditioning : Dust Busting, Noise reduction

4. A possible analysis step for the collection of automated metadata

After scanning, the images are viewed and decisions are made about how they should be treated.Shots are catalogued and marked up, and then treatments are scheduled. Shots and clips are re-viewed for quality and some marked for early stage treatment in Ingest. As picture processingtechnologies mature and become more automated, some tools that would previously be consideredto be part of the creative/interactive element in the post production process, have moved into theIngest stage. Increasingly Noise/Grain reduction and Dust-busting have therefore become standardtools. Line scratches often still occur in modern film and their automated removal remains difficult.

One well used process in post production is dust busting. This is a classic bread and butter taskrequired for removing the small corrupted pixels caused by dust stuck to the film, or in the case of

9

2.1. DEFECTS IN PICTURES CHAPTER 2. INGEST

old film, data loss caused by tears or holes. In modern digital footage, data loss caused by inactivepixels, or pixels locked in a certain state, does occur. Noise in modern footage occurs quite oftenbecause of the low exposure that Directors of Photography sometimes use. Many Dust bustingalgorithms are actually derived from a well founded set of algorithms derived from the early workof [55, 28]. It is a sufficiently interesting problem for it to be worth taking a look at some of theideas behind a successful algorithm. This chapter first reviews some of the defects that can occuron film and video and then highlights some aspects of their treatment algorithms.

2.1 Defects in Pictures

There are a huge range of defects that can be observed in video and film. As far as archived materialis concerned, the BRAVA consortium (brava.ina.fr) during 2000-2002 were the first to makesome attempt to catalogue these and educate the community about their nature. Figures 2.1-2.10present part of this taxonomy of defects in an attempt to educate the SIGGRAPH readership asto their names as they are used in the archive industry. Missing data problems manifest as Fig-ures 2.1, 2.2, 2.4, 2.6. Massive loss of data can also occur as in Figures 2.3 and 2.7. These are bettertreated through temporal frame interpolation and the reader may see [28, 27] for some treatmentof this issue. A possible solution for Kinescope Moire, as illustrated in Figure 2.5, can be foundin [50, 51]. Two inch scratches (caused by scratching of old Two Inch video tape) are an exampleof a specialised missing data problem and a treatment can be found in [46, 32].

Two major defects are missing from the visual description: Shake and Flicker. Those are bestviewed as video clips. Shake simply refers to unwanted global motion of the picture caused eitherby camera movement or problems during scanning. Algorithms for removing shake abound [23, 60,59, 24, 42, 61]. This is principally because it is related to the global motion estimation problem thatis also important for video compression issues [35, 19, 20, 12, 52, 41].

Flicker manifests itself as a fluctuation in picture brightness from frame to frame. In stillsthe effect is very difficult to observe indeed, but as a sequence the effect is often very disturbing.Two different types of degradations result in a perceptible flicker artefact. The first realistic de-flicker algorithm was developed by P. V. M. Roosmalen [45, 44] and a real time hardware versionwas developed by Snell and Wilcox in the late 1990’s. Both changing film exposure (in old silentmovies for instance) and varying lighting conditions result in luminance fluctuations. However, amisalignment in the two optical paths in a telecine machine also yields the same visible artefact,called Twin Lens Flicker. In that case, the two fields of each interlaced TV frame are incorrectlyaligned with respect to the original film frame, and the observed fluctuations are due more to theshake between the fields than any real luminance changes. Vlachos et al considered this problem in[62] and a real time implementation was also developed by Snell & Wilcox in the late 1990’s.


CHAPTER 2. INGEST 2.1. DEFECTS IN PICTURES

DIRT

Figure 2.1: Dirt and Sparkle oc-curs when material adheres tothe film due to electrostatic ef-fects (for instance) and whenthe film is abraded as it passesthrough the transport mecha-nism. It is also referred to asa Blotch in the literature. Thevisual effect is that of brightand dark flashes at localised in-stances in the frame. The imageindicates where a piece of Dirt isvisible.

DIRT

Figure 2.2: Film Grain Noiseis a common effect and is dueto the mechanism for the cre-ation of images on film. Itmanifests itselg slightly differ-ently depending on the differentfilm stocks. The image showsclearly the textured visible effectof noise in the blue sky at thetop left. Blotches and noise typ-ically occur together and are themain form of degradation foundon archived film and video. Apiece of Dirt is indicated on theimage.

Figure 2.3: Betacam Dropoutmanifests itself due to errors onBetacam tape. It is a missingdata effect and several field linesare repeated for a portion of theframe. The repeating field linesare the machine’s mechanism forinterpolating the missing data.

Video clips showing serious degradation by shake, flicker, lines, grain and blotches can be seenat www.sigmedia.tv/Research/DigitalFilmRestoration. The book by Read andMeyer [43] gives an excellent coverage of the physical nature of archive material, and the practicesin the film preservation industry.

2.1.1 Modern Footage

As a matter of course, film scanners result in images which are degraded by very slight amountsof dust or holes giving dark or white pixels. Figure 2.11 shows a typical example of a scanned 2kfilm frame at 2048× 1556 showing mild degradation which must be dealt with. Increasingly we areseeing noise in digital footage because of low exposure in cameras such as the RED.

Finally, as digital film material is compressed, and as producers mix high and medium qualitymedia, the post production house has to deal with blocking artefacts and mosquito noise (decom-


2.1. DEFECTS IN PICTURES CHAPTER 2. INGEST

Figure 2.4: Digital Drop Outoccurs because of errors on dig-ital video tape. This example isdrop out from D1 tape.

Figure 2.5: Kinescope Moireis caused by aliasing duringTelecine conversion and mani-fests itself as rings of degra-dation that move slightly fromframe to frame.

Figure 2.6: Film Tear is sim-ply the physical tearing of a filmframe, sometimes due to a dirtysplice nearby.

Figure 2.7: Vinegar Syndromeoften results in a catastrophicbreakdown of the film emulsion.This example shows long strandsof missing data over the frame.

Figure 2.8: Echoes and Over-shoots manifest as image shad-ows slightly displaced from eachobject in the frame. When theeffect is severe it is called Echoand when it is just limited toedges it is called Overshoot, asin this case.

Figure 2.9: Colour Fading im-plies that the picture colour isnot saturated enough, giving theimage a washed out look.


CHAPTER 2. INGEST 2.2. DUST BUSTING ALGORITHMS

Figure 2.10: Line Scratches can appear in much archived footage. They also occur due to accidents in filmdeveloping. The colour of the scratch depends on the side of the film layer on which it occurs. It is often thecase that not all the image information inside the scratch is lost. They are a challenge to remove because theypersist in the same location from frame to frame.

Figure 2.11: Showing three frames from a scanned film sequence at 2K resolution. Dust manifests itself inthe central frame as bright spots. In modern film, the blotches are quite small.

pression artefacts near sharp edges) in certain shots.

2.2 Dust Busting Algorithms

The act of repairing the missing data in the film frames has come to be known as Dust Busting. Inearly academic literature all missing data defects were referrred to as blotches. Almost all automateddust busting tools rely on the idea that dust does not exist in the same location in consecutive framesand so inter-frame differencing along motion trajectories leads to a reasonable detector. The mostsuccessful approaches have been Detect-then-interpolate approaches. A low cost detector is used todetect blotches and then a hole-filling algorithm is used for the reconstruction or inpainting step.

The earliest work on designing an automatic system to ‘electronically’ detect Dirt and Sparkle


2.2. DUST BUSTING ALGORITHMS CHAPTER 2. INGEST

was undertaken by Richard Storey at the BBC [55, 54] as early as 1983. The design was incor-porated directly into hardware which was subsequently used in-house for video restoration beforebroadcast. The idea was to flag a pixel as missing if the forward and backward pixel difference washigh. This idea was of course beset with problems in parts of the image where motion occurred.The natural extension of this idea was presented by Kokaram around 1993 [28, 26] which allowedfor motion compensated differences. That type of detector was called a “Spike Detection Index”(SDI) and the most useful are defined as follows.

2.2.1 SDIx

The forward and backward motion compensated pixel differences Ef , Eb of the observed, corruptedimage sequence Gn(x), are defined as follows.

Eb = Gn(x)− In−1(x + dn,n−1(x))

Ef = Gn(x)− In+1(x + dn,n+1(x)) (2.1)

Note that the previous and next frames are assumed to be uncorrupted at the required motion com-pensated sites, hence Gn−1 = In−1 etc. Two detectors can then be proposed [28] as follows.

bSDIa(x) =

{1 for (|Eb| > Et) AND (|Ef | > Et)

0 otherwise(2.2)

bSDIp(x) =

1 for (|Eb| > Et) AND (|Ef | > Et)

AND sign(Ef ) = sign(Eb)

0 otherwise

(2.3)

Here, b(·) is a detection field variable set to 1 at sites that are corrupted by missing data. Et is a userdefined threshold for detection of discontinuity. The SDIa is based on thresholding Ef , Eb only.SDIp additionally applies the constraint that if corruption does not occur in identical locations inconsecutive frames and the brightness constancy assumption holds, In−1 ≈ In+1, one should expectthat the sign of the difference signals should be the same. It is now accepted that SDIp is the betterdetector in almost all situations because of this additional constraint.

2.2.2 ROD

In 1996, Nadenau and Mitra [34] presented another scheme which used a spatio-temporal windowfor inference: the Rank Order Detector (ROD). It is generally more robust to motion estimationerrors than any of the SDI detectors although it requires the setting of three thresholds. It uses some



spatial information in making its decision. The essence of the detector is the premise that blotchedpixels are outliers in the local distribution of intensity.

Defining a list of pixels as

p1 = In−1(x + dn,n−1(x) + [0 0])

p2 = In−1(x + dn,n−1(x) + [0 1])

p3 = In−1(x + dn,n−1(x) + [0 − 1])

p4 = In+1(x + dn,n+1(x) + [0 0])

p5 = In+1(x + dn,n+1(x) + [0 1])

p6 = In+1(x + dn,n+1(x) + [0 − 1])

Ic = In(x) (2.4)

where Ic is the pixel to be tested, the algorithm may be enumerated as follows.

1. Sort p1 to p6 into the list [r1, r2, r3, . . . , r6] where r1 is minimum. The median of these pixelsis then calculated as M = (r3 + r4)/2.

2. Three motion compensated difference values are calculated as follows:

If Ic > M

e1 = Ic − r6

e2 = Ic − r5

e3 = Ic − r4

If Ic ≤ M

e1 = r1 − Ic

e2 = r2 − Ic

e3 = r3 − Ic

3. Three thresholds are selected: t1, t2, t3. If any of the differences exceeds these thresholds,then a blotch is flagged as follows

bROD(x) =

{1 if (e1 > t1) OR (e2 > t2) OR (e3 > t3)

0 otherwise

where t3 ≥ t2 ≥ t1. The choice of t1 is the most important. The detector works by measuringthe ‘outlierness’ of the current pixel when compared to a set of others chosen from other frames.The choice of the shape of the region from which the other pixels were chosen is arbitrary.



2.2.3 Morphological/Median Filter Approaches

In the 1D case, Paisan and Crise [37] were the first to spot that one could use a median filtered signalas a rough estimate of the original signal before corruption by impulsive noise. The differencebetween the observed, degraded signal and this rough estimate would be high at sites corruptedby impulsive defects. This is because the rank order filter removes all outliers, but leaves lowerscale trends untouched. This idea can be extended to treat small missing data artefacts in archivedvideo and film, known as Dust. These are generally just a few pixels in area (3 × 3 pixels), andhence only a small median or morphological window need be used. Using a larger window todetect larger artefacts causes problems since more true image detail would then be removed causingan increased number of false alarms. Joyeux, Buisson, Decenciere, Harvey, Tenze, Saito, Boukiret al have been implementing these types of techniques for film restoration since the mid-1990’s[21, 14, 17, 16, 7, 15, 8, 57, 58, 48]. Joyeux [21] points out that these techniques are particularlyattractive because of their low computational cost. They perform well when the artefact is small,and surrounded by a relatively low-activity homogenous region. The high resolution of film scansis therefore suitable for these tools.

2.2.4 MRF based approaches

Generally speaking a good detector ought to incorporate both spatial and temporal information. Themost natural way of doing this is to incorporate impose spatial smoothness on the blotch detectorby modelling the blotch as a Markov Random Field. A set of MRF based detectors were developedby Kokaram et al between 1990 and 2004[28, 25]. The idea is ultimately to pose the problem as theoptimal estimation of a detection field that is 1 at the sites of degradation and 0 otherwise.

The model for degradation is a Replacement process with additive noise. A binary field b(x) isintroduced that is 1 at a site of missing data and zero, otherwise. The degradation model can thenbe written as follows.

Gn(x) = (1− b(x))In(x) + b(x)c(x) + µ(x) (2.5)

Where µ(·) ∼ N (0, σ2µ) is the additive noise, and c(x) is a field of random variables that cause the

corruption at sites where b(x) = 1.

Degradation information can be included in the pixel site information by defining a pixel asoccupying one of six states. Each of these states s(x) ∈ [S1 . . . S6] is defined as a combination of3 binary variables [b(x), Ob(x), Of (x)] as follows.

001 The pixel is not ‘missing’ and there is occlusion in the forward direction only.

010 The pixel is not ‘missing’ and there is occlusion in the backward direction only.



000 The pixel is not ‘missing’ and there is no occlusion backward or forward.

100 The pixel is corrupted by a Blotch and there is no occlusion backward or forward.

101 The pixel is corrupted by a Blotch and there is occlusion in the forward direction only.

110 The pixel is corrupted by a Blotch and there is occlusion in the backward direction only.

Note that in this framework the [1 1 1] state is not allowed since it would imply that the data ismissing and yet there is no temporal information for reconstruction. This is an interesting practicalomission.

A Bayesian Framework

From the degradation model of (2.5) the principal unknown quantities in frame n are In(x), s(x),c(x), the motion dn,n−1 and the model error σ2

e(x). These variables are lumped together into asingle vector θ(x) at each pixel site x. The Bayesian approach presented here infers these unknownsconditional upon the corrupted data intensities from the current and surrounding frames Gn−1(x),Gn(x) and Gn+1(x). For the purposes of missing data treatment, it is assumed that corruption doesnot occur at the same location in consecutive frames, thus in effect Gn−1 = In−1, Gn+1 = In+1.

Proceeding in a Bayesian fashion, the conditional may be written in terms of a product of alikelihood and a prior as follows:

p(θ|In−1, Gn, In+1) ∝ p(Gn|θ, In−1, In+1)p(θ|In−1, In+1)

This posterior may be expanded at the single pixel scale, exploiting conditional independence in themodel, to yield

p(θ(x)|Gn(x), In−1, In+1, θ(−x))

∝ p(Gn(x)|θ(x), In−1, In+1)p(θ(x)|In−1, In+1, θ(−x))

= p(Gn(x)|In(x), c(x), b(x))

× p(In(x)|σe(x)2,d(x), Ob(x), Of (x), In−1, In+1)

× p(b(x)|B)p(c(x)|C)p(d(x)|D)p(σe(x)2)

p(Ob(x)|Ob)p(Of (x)|Of ) (2.6)

where θ(−x) denotes the collection of θ values in frame n with θ(x) omitted and B, C, D, Ob, Of

and I denote local dependence neighbourhoods around x (in frame n) for variables b c, d, Ob, Of

and In, respectively.



Figure 2.12: Two consecutive images from a sequence showing severe motion blur due to fast motion. Motionvectors are superimposed on the second frame (right). This pathological motion completely confuses motionestimators leading to erroneous detection of large blotches. Observe in particular how the motion blurredregions are behind sharp stationary objects.

Needless to say, the final algorithm amounts to some energy minimisation, trading off interframeimage differences with spatial smoothness of indicators and image material. For algorithmic detailsthe reader is invited to consult [28]. The optimisation strategy used then has since been supercededby much more complete strategies like Belief Propagation and Graph Cuts. Practical constraintsimposed by the massive sizes of images used in post production, however, imply that it is importantto derive memory efficient techniques for implementing these methods.

2.2.5 Reconstruction

Once the missing data is detected, the sites have to be filled in. Early work in this area exploitedmotion interpolation, colour multistage median filters and even Autoregressive texture interpolation[28]. The notion of image filling though, has since matured into the area known as Inpainting[4], and there are a vast array of options that can be used here. However, in post production, thefidelity of the interpolated data is extremely important. In addition, the large picture sizes imply thatmemory efficiency is paramount. Thus the techniques pioneered Efros [13] or Sapiro [4] are to beused with some care. Because of the strong temporal correlation in a sequence of pictures, and thesuccess of motion interpolation ideas [25] for preserving image integrity, it is generally a variant ofthose techniques that are used for hole filling in this problem.


CHAPTER 2. INGEST 2.3. NOISE AND GRAIN

2.2.6 Still work to do

Automated techniques for dust busting are commercial products today. However they all sufferwhen the underlying sequence model breaks down. In that situation, the motion information be-comes error prone. Large defects remain difficult to remove automatically. Good reconstruction hasbeen obtained by motion interpolation [29], but the detection of large defects is often confused withpathological motion in sequences. Thus fast moving hands, clothing or vehicles are always difficultfor a Dirt Detector to ignore. In addition, the dirt does not completely obscure the underlying image,hence a binary detection field is ultimately not good enough. What is missing in the literature isimproved detection in the presence of pathological motion combined with a treatment of non-binaryocclusion. Figure 2.12 shows the severity of this problem.

2.3 Noise and Grain

It is generally expected by researchers working on noise reduction algorithms that the requirementfor noise reduction is to completely remove any noise in the image sequence. This is not so. Flatpictures, even with all the detail preserved, look bad. Film grain is essential for the film look. Whilemost workers in this area concentrate on preserving detail but removing noise, what is required bythose in the post production industry is in fact noise reduction that reduces the level of grain butleaves some behind. The practice today is to remove the grain as much as possible, then add grainback into the picture of the type that is required. Compounding the problem is that film grain is amultiplicative process, and the amount of degradation/grain varies with brightness. There is moregrain at high brightness levels than at low brightness levels. There is very little published worktargeted at film grain noise reduction in particular. One might expect that most of the existing noisereduction techniques could be adapted to meet these new requirements, but reducing grain levelswhile leaving the correlation structure of the grain untouched is a difficult issue [47].

After ten years of research and industrial development in film restoration, sharpening and en-hancement are about to become much more important for broadcast applications. This is because ofthe advent of consumer high definition displays and DVD players. However, it is well known thathigh levels of noise mitigate against good sharpening results e.g. superresolution. It remains to beseen how these problems can be resolved in degraded film material since the noise levels especiallyof film grain can be very high, and possess a non-white correlation structure.


2.4. MOTION ESTIMATION CHAPTER 2. INGEST

2.4 Motion Estimation

It turns out that estimating the optic flow of every image in a film sequence is one of the mostcommon tasks in post production. It has become so important that some argue that it should bedone at ingest and the motion information stored as metadata with each clip. Motion Estimation hasbeen studied by the academic community for some time and the principal issues are well understood[30, 5]. Those working in Computer Vision tend to refer to the process as Optic Flow estimation,while those in the video processing and coding community refer to it as Motion Estimation orLocal Motion Estimation. This is quite different from the process of object tracking which typicallyrequires establishing a bounding box around an object moving through several frames. The principalrequirement for motion estimation information in post production is that it should be usable formanipulating the smallest visible image detail. In addition, processes that exploit motion shouldnot damage the picture in any way. These requirements are much more demanding than the use ofmotion estimation for video coding or some of the activities in the academic community. In videocoding, erroneous motion vectors cause an increase in bit rate but the impact on picture quality isnot necessarily visible.

In practice, a usable motion estimator combines a motion detection process with the actualmotion estimation process. In locations where motion is detected, some kind of matching processis used to generate motion vectors that explain the evolution of the image sequence and at the sametime result in a piecewise smooth vector field.

All the different processes for estimating motion rely on an assumption about the evolution ofpixel data through image sequences. A typical model showing translation motion is as follows.

In(x) = In−1(x + dn,n−1) + e(x) (2.7)

Here In(x) is the pixel intensity at site x in frame n, e(·) ∼ N (0, σ2e) and dn,n−1 is the motion

that maps site x in frame n into the corresponding site in the previous frame n − 1 along thedirection of motion. The motion estimation problem is to estimate d at all pixel sites that are atmoving object locations. Solution of this expression for motion is a massive optimisation problem,requiring the choice of values for d at all sites to minimise some function of the error e(x). Manymotion estimators can be derived from the idea of choosing a value of d such that it results in theminimum DISPLACED FRAME DIFFERENCE (DFD) defined as

DFD(x) = In(x)− In−1(x + dn,n−1) (2.8)

The problem is complicated by the fact that the motion variable d is an argument to the imagefunction In−1(·). The image function generally cannot be modelled explicitly as a function of posi-tion and this makes access to the motion information difficult. Note that the motion equation abovecannot be solved at a single pixel site. There are 2 variables: x and y components of motion, so atleast 2 equations are required i.e. at least 2 pixel sites.


CHAPTER 2. INGEST 2.4. MOTION ESTIMATION

Figure 2.13: Left to right: Previous frame, Current frame, Motion estimation vectors superimposed on currentframe, Next frame. Blue vectors point into the past while green vectors point into the future. Pathologicalmotion of the left hand, due to fast motion leading to blurring and rapid change of object geometry, causespoor motion field estimation. Note how the left hand in the current frame has an appearance that is verydivergent compared to the same hand in the other frames. The vectors here point in the same directionwhether mapping to the future or past, and this should not be the case. Also note the divergence of the bluevector field in this region.

A Bayesian framework allows the derivation of all the known motion estimation processes. Theidea is to choose the motion vector d such that it maximises the probability p(d|In−1, In, In+1)

which is the probability of motion d given the image data in the current and surrounding frames.Useful solutions only result when some prior constraints are placed on the motion field e.g. spatialand temporal smoothness as well as incorporating the notion that some image areas can be occludedand uncovered as the object moves. There is a large amount of literature on this [53, 30, 5, 28, 49]and the more modern optimisation strategies recently explored by the academic community, e.g.Graph Cuts and Belief Propagation, have yielded much better performance. The expression to bemanipulated is typically as follows.

p(d|In−1, In, In+1) ∝ p(In−1, In, In+1|d)p(d|D) (2.9)

where important motion field constraints are imposed by designing p(d|D) appropriately.

In hardware, block matching and phase correlation motion estimation are popular. They areboth a form of direct solution of 2.9 although most ignore the constraints on the motion field. Insoftware, gradient-based motion estimation is popular. These latter processes linearise the imagefunction about some initial estimate of motion and result in iterative update schemes.


2.5. SUMMARY CHAPTER 2. INGEST

2.4.1 Practice

Useful motion estimation strategies in post production involve multi-resolution estimation with apyramid of downsampled images. This handles fast motion bettter and is more memory and speedefficient than a single full resolution strategy. Figure 2.13 shows a typical output from a motionestimator described in [28]. There is no agreement on the required resolution of motion for veryhigh definition scanned images e.g. 2K or 4K plates. While at standard definition 720 × 576 itis widely agreed that 1/8 pel accuracy is desirable, at 2k or 4k which are ×10 larger at least, thataccuracy may be irrelevant. Memory and speed savings can be had by using integer or 0.5 pel (pixel)accuracy without loss of picture detail. Perhaps the most important issue though is how to deal withsituations in which motion estimation simply does not apply i.e. pathological motion (PM). Self-occluding clothing, fast motion leading to blurry images, translucent objects, reflections in polishedsurfaces, all cause most motion estimators to fail. Detection of failure is difficult since the DFD isnot necessarily a good indicator. Figure 2.13 shows both good motion vector estimation and failuredue to PM in the same scene. The motion of the arm in this case is too fast at 25fps for the motionestimator to perform well. It is reasonable to suggest that the proper action when detecting failuredepends on the application of the motion information.

Kent et al[22] propose a very simple scheme for dealing with PM when motion is used fordeblotching. In that scheme, it is assumed that any area undergoing camera or global motion willhave a low DFD. In other areas obeying other local motion, a MRF dirt detector is biased to be moreconservative. The local/global mask is generated using motion information thus being robust to dirtdegradation. This deals well with the mild levels of degradation in most post production cases. Itcauses dirt to be left in small parts of foreground objects that are moving rapidly, but removes dirtin the rest of the image that is well matched with previous frames. Bornard et al [6] and Corrigan etal[11] propose a discontinuity detection approach that relies on the observation that a failing motionestimator will cause fake motion mismatches in the same location in many consecutive frames.Corrigan also exploited the idea that local divergence of vector fields also generally indicates failure.Rares et al [2, 3, 1] has proposed schemes based on image classification. In the case of dust busting,it is best to leave the original picture alone in areas of PM. In frame interpolation or frame rateconversion though, it is much more difficult to propose an effective course of action.

2.5 Summary

The requirements of image integrity and automation at ingest are demanding. Motion estimationhas become an important part of the post production chain but the principal outstanding issue isdetection of failure and proposing what action to take in that instance. This issue has not been welladdressed in the academic community.


Chapter 3

Basics of Compositing

As discussed in the short history given in the introduction, there are a large range of image manip-ulation systems historically, and a large number in current use. It is not the aim of the course todescribe the benefits and compromises made between different systems. Instead, we will concen-trate on describing the attributes of node-based compositing, which is the most common style ofrepresenting a compositing problem in the feature film domain. There are a number of such systemsout there - the example in this case is Nuke, but the basic concepts translate easily between variouscommercial and bespoke alternatives.

Figure 3.1: A simple compositing graph in NUKE.

23

3.1. NODE-BASED COMPOSITING CHAPTER 3. BASICS OF COMPOSITING

Figure 3.2: The input and output from the simple compositing graph in Figure 3.1.

3.1 Node-based Compositing

Node-based compositing can be thought of as a visual programming tool for describing an acyclicgraph of image processing operations. For example, Figure 3.1 shows a simple graph or tree, withno branches. At the top of the tree, the leaf node is a Read node, here loading an image sequencewhich is a set of dpx files on disk. Underneath that, we have connected the Read node to a Blurnode. This is performing a Gaussian filtering operation on all the channels in the image. Underneaththat we have connected a viewer. The artist can step through the sequence and view the result ateach frame. In this case the tree converts the image in the way shown in Figure 3.2.

Obviously this isn’t a terribly exciting graph, so let’s jazz it up a little and create the tree shownin Figure 3.3. The graph now contains two new nodes. The first is a Merge, which in this case ismultiplying each pixel in the Gaussian filtered output by the corresponding pixel from the original.The ColorCorrect node underneath increases the gain of image by a factor of 4. In this case theoutput of the Viewer is different as shown.

Each node comes with a set of controls which can be used to control the parameters of thealgorithm inside the node. For example, the controls for the Blur node are shown in Figure 3.4. Thetop two controls show that we are blurring all the channels in the image, by a factor of 65. Manyof the typical operations that can be inserted as nodes will be familiar to anyone with an imageprocessing background.

There. Now you can composite film clips like a professional! Bear in mind that a typical filmshot will have trees an order of magnitude more complex than our simple example. A more typicaltree for a film effects shot is shown in Figure 3.5, zoomed out a long way so you can admire its


CHAPTER 3. BASICS OF COMPOSITING 3.1. NODE-BASED COMPOSITING

Figure 3.3: The output (right) from a more complex compositing graph (left).

Figure 3.4: Controls associated with a node.


3.1. NODE-BASED COMPOSITING CHAPTER 3. BASICS OF COMPOSITING

Figure 3.5: A typical, complex tree.

complexity.

So - why work this way? The tree representation gives us two important attributes. Firstly, theoperations applied to the footage are non-destructive - there is a visible audit-trail of the operationsused to construct the effects for a shot. Secondly, the graph can be applied not just to this footage,but to any footage by swapping the inputs for new ones. This means that the tree is reusable indifferent shots.

Note that systems such as this are also carefully designed to be efficient. Nuke, for example, willtypically be used with images which are either ’2K’ (say 2048x1556 pixels) or ’4K’ (4096x3112pixels) in size, containing at least three floating-point image channels, frequently more. A typicaltree such as the one on the right will have dozens of such inputs, and will need to handle the memorymanagement and caching through the tree to allow the artist some measure of interactive feedbackon the project they are working on. It is these data-sizes which make image processing in thisenvironment quite demanding.

For complex trees, these systems clearly do not perform in real-time . To cope with the computeburden, compositing pipelines in major facilities share warehouses of render computers with aneven greater resource drain - CG rendering. Weta Digital in New Zealand, for example, runs fourcomputing clusters, each of which features prominently in the list of the top 500 supercomputingfacilities worldwide.


CHAPTER 3. BASICS OF COMPOSITING 3.2. COMPOSITING IN 3D

Figure 3.6: A 3D environment for compositing. In this case, the camera can be used to introduce naturalparallax into a shot without requiring the complexity of a full 3D environment.

3.2 Compositing in 3D

Most of us think of image processing as a 2D problem. But, most compositing systems in film workhave 3D capabilities. This is actually a fairly natural move to make. The real-world shots thesesystems are to operate on are clearly 3D in the first case, even though they are now represented bya 2D projection of that environment. If an artist wants to simplify a particular image processingproblem, it makes sense to allow the scene to be represented by at least a minimal 3D representation- perhaps layers in space, with a viewing camera. Figure 3.6 shows the 3D visualisation environmentfor a 2D compositing exercise. Compositing systems with this capability are used to load 3D assetsfrom CG systems, to allow them to be composited in the context of live action. It also gives scopefor compositing to exploit some of the 3D algorithms we touch on elsewhere - not least camera-tracking.

3.3 Compositing in Stereo-3D

In the case where we are doing compositing work in multiple views (as opposed to true 3D asabove), the node-compositing tree needs to be extended to be meaningful. The design shown duringthe course allows multiple views - two in the case of stereo - to be processed at the same timeby the same node graph. This extension allows the use of more than one image stream withoutadding complexity. Nodes within the tree, or individual parameters within nodes, can be separatedwhere different vies do need separate treatment. More on this topic is presented when we discussstereo-3D.


3.4. SUMMARY CHAPTER 3. BASICS OF COMPOSITING

3.4 Summary

Node-based environments are natural tinkering-ground for algorithm developers. They come with arich set of common operations, allow easy configuration of components and examination of results.Invariably, they are also programmable by third parties, who can use built-in SDKs to develop theirown nodes. In addition, the most common compositing systems can be batch-run on command linesand scripted using common languages like Python.


Chapter 4

Challenges in Visual Compositing

Motion estimation, optic flow, texture synthesis and deblurring are mature areas of study in the aca-demic community. Yet it is only recently that their use has been integrated into the post productionworkflow with any degree of confidence. This is principally because footage presented for postproduction is never well behaved. In this section of the course a sample of technologies that havemade it into the workflow will be presented. The focus is on highlighting the difference betweenthe initial algorithmic proposal from the academic community and what has to be done to get thesealgorithms to a useful level of performance.

4.1 Matting

Pulling a matte from a film or video sequence is one of the oldest exercises in film and televisionpost production. It is used for direct manipulation of the position and nature of objects/actors inscenes in order to create new sequences not originally recorded. In the simplest case, the object isfilmed against a green or blue screen. Then, in post production a combination of detailed manualcontour delineation and colour based segmentation (i.e. all that is not blue or green is probably theobject of interest) is used for creating a mask or matte. The mask is non-zero in the region of theobject and zero otherwise. It describes the opacity of an object pixel at each location in the image.Thus a mask pixel setting of 1 indicates that that pixel is completely visible as the object of interest,while a mask pixel of 0 indicates that the corresponding object pixel is obscured or not availablein some way. This mask or matte can then be applied to mix the captured footage containing theobject of interest with footage recorded elsewhere. The act of creating new scenes in this way bycombining different object elements onto a background image is called compositing.

An excellent introduction and background to the Matting problem can be found in the work

29

4.1. MATTING CHAPTER 4. CHALLENGES IN VISUAL COMPOSITING

of Chuang et al [9]. As summarised there, traditional methods for pulling video mattes are bluescreen matting, rotoscoping and difference matting. Blue screen matting or chroma keying relieson capturing the foreground objects against a solid colour background and subsequently pullingthe foreground matte by segmentation on the basis of colour. Rotoscoping relies on user drawneditable curves (e.g. Splines) around the foreground object of interest. Snap-to-edge operations asfound in commercial packages like Adobe Photoshop, and introduced in previous articles [40] areuseful user complements here. Difference matting relies on the generation of a scene containingonly background elements (e.g. recording without actors). The image difference between this sceneand a scene subsequently recorded with actors is then exploited to generate the matte. The mattehere is therefore 1 when the difference is large and 0 otherwise (for instance).

While chroma keying demands a controlled environment for recording, rotoscoping can beachieved regardless of the complexity of the background environment. Employing tracking to-gether with user assisted rotoscoping can greatly improve the utility of contour based approaches.The main limitation of rotoscoping is the inability to correctly express image formation at the bound-ary between foreground and background. Useable mattes should express the notion that around theboundaries of objects the recorded light is a mixture of the background and foreground elements[39]. Difference matting and to some extent chroma keying suffer from the problem that in re-gions where foreground and background colour are similar, user interaction is required to resolvethe matte.

In [10, 9], the authors proposed a combination of limited user interaction followed by directestimation of a non-binary alpha matte. They articulate this information to resolve many of theproblems with previous methods. The underlying idea begins with the user specifying what theyterm a trimap. This map divides the scene into regions known to be background (matte pixels setto 0), known to be foreground (matte pixels set to 1), and unknown matte regions. They correctlyexploit the knowledge that the difficulty in pulling a very good and useable matte is the properdelineation of the mixing of light effect at the object edge. Hence the restriction of interest tothe unknown matte region, by exploiting image information from the surrounding known matteregions. Their method is also able to exploit motion information to propagate mattes betweenuser defined and delineated keyframes. They give convincing demonstrations of the matting oftranslucent material (smoke), traditionally a very difficult task.

Since 2001, the notion of Matting as an inference problem has been explored by several authors[56, 63, 31] with increasing degrees of success. What is interesting, however, is that in practice,none of these systems operate with any degree of reliability for a wide variety of material. Artiststypically really want the matte pulled for that 2 pixel wide strand of hair, and they really wanttemporally consistent mattes. Much of this can be achieved by allowing the artist to specify carefulboundaries between the known and unknown layers, but this in itself is a painful task. In practicethen, a workable matting solution requires both a method of generating garbage mattes as well as a


CHAPTER 4. CHALLENGES IN VISUAL COMPOSITING 4.1. MATTING

tool for editing trimaps and rotoscoping. This implies further that tools for matte-propagation androto-propagation are important in a real system.

4.1.1 Making a Practical Matte Puller

We experimented with Chuang’s techniques like many others did at the turn of this century. Howeverwe found that the mattes, while they look good when pulled from blue-screen backgrounds, weretoo active for practical purposes. Hence smoothness of the α matte was important. Furthermore,we found at the time that for real scenes, gaussian mixture models for layer colours were not richenough. Two modifications were enough therefore to allow us to build a new matte-puller [63].

We used an MRF prior on α to impose smoothness on the opacity, and in addition used a sam-pling scheme to improve the textural modelling capability of the foreground and background priors.The prior for alpha is therefore configured as follows.

p(α|αN ) ∝ exp−{∑

k∈Nλk|α− αk|

}(4.1)

where λk are the usual MRF hyperparameters and an eight connected neighbourhood system wasused where αk are the current values of α at the eight sites surrounding x. The conditional distribu-tion for α is now changed from that used by Chuang et al to yield the following.

p(α|f ,b, ·) =∝ exp−{||(c− (αf + (1− αb)))||2

2σ2e

+∑

k∈Nλk|α− αk|

}(4.2)

The specification for the priors on the foreground and background images is also different. Inorder to model texture, the work of Efros [13] is used to allow more believable samples for theforeground and background data at the current site to be drawn from the surrounding neighbourhood.Those samples can be used to create the colour models for the clusters as in Chuang, or as part of aGibbs sampling scheme for sample generation. This yields a more powerful textural model for theforeground and background estimate.

The solution is decomposed into solving for α and f , b in two separate steps. A direct linesearch for α yields the required estimate. The new technique models the known foreground andbackground at a pixel site by drawing samples from suitably sized nearby regions in a processinspired by [13] using weights similar to Chuang. The size of this sample region scales adaptivelyso for a pixel far from the known region, the region is larger than if it is close by. Given a set ofpairs, values for alpha are generated and the triplet maximizing the joint distribution is selected asthe MAP estimate.


4.2. INBETWEENING CHAPTER 4. CHALLENGES IN VISUAL COMPOSITING

Issues

In principle this process is computationally cheaper than Chuang et al since it eliminates the need forclustering, Gaussian modelling, and the associated matrix manipulations. In addition, the solutionyields much more believable and useful mattes for post production. Although other techniquesattempt to enforce smoothness on α , it seems that the sampling scheme as is important. The pointhere is that further constraints on the triplet are necessary to create believable mattes. However, thenotion of temporal smoothness remains difficult to impose, aside from using a 3D MRF on the α

channel. We get workable results from 3D median filtering and the user can adjust some smoothingparameters to help this. Here are some other open questions.

• How do we make sure the composited picture looks good? Good looking layers and mattesare irrelevant if this constraint is not met. What constraints does this idea imply?

• How can rotoscope parameters for trimaps be propagated between frames?

• How can Mattes themselves be propagated through a sequence?

• Is it possible to pull a garbage mattte automatically?

4.2 Retiming/Inbetweening/Frame Rate Conversion

The process of retiming changes the frame rate of a sequence of images. This implies creating im-ages where none existed before, and that requires interpolating frames at sites inbetween existingframes. In fact this task has been around for a very long time under the guise of standards con-version. Many television broadcasters and makers of television equipment have been convertingbetween 25fps and 30fps as a mater of course for PAL to NTSC conversion since the start of tele-vision itself. Notable early work was performed by the BBC R&D Department, Snell and Wilcoxand Philips. See [18] for an overview of now established ideas.

The key idea is shown in Figure 4.1. The problem is ultimately to reconstruct a frame at an arbi-trary point in time between given frames. If the motion between that missing frame and the framesaround was known, then building the frame is a matter of some variant on a motion compensatedblending operation. So the problem of image reconstruction here is in fact the problem of interpo-lating the motion field at this unknown location. It turns out that this problem is intimately related tothe problem of new-viewpoint interpolation in multi-view scene capture and synthesis. Viewpointsynthesis is a well explored area in the academic community [33, 36] and most solutions there arebased on exploiting the camera geometry between different views taken at the same point in time.


CHAPTER 4. CHALLENGES IN VISUAL COMPOSITING 4.2. INBETWEENING

Inbetween Framen n+1 n n+1Inbetween Frame

Given motion

Ad−hoc interpolation

Figure 4.1: Left: The essence of motion-based inbetweening is to reconstruct the missing motion field (greenand blue vectors) at the new temporal location such that some version of cut and paste from frames n andn + 1 will build the new inbetween picture showing the right movement. Right: Initial guesses for the newmotion field can be had by using a proportion of the motion estimated between n and n+1 (black vectors) tobe assigned to the appropriate sites in the inbetween frame. The initial guesses are shown as dashed colouredvectors.



However, in practice new-view synthesis based on object modelling does not necessarily result inpictures of good enough quality for post production. In fact, if the available views are close enoughtogether, an approach based on frame rate conversion tends to yield more reliable inbetweens.

This equivalence between view synthesis and frame rate conversion was exploited in post pro-duction for the movie The Matrix. The effects for that movie popularised the notion of new-viewsynthesis and drew heavily from the artistic work of Tim MacMillan (www.timeslicefilms.com) inthe early 1990’s. Dr. Bill Collis1 worked on the early algorithms using ideas from standards conver-sion and not considering the multi-view geometry as such. It turns out the early work of Kokaram[28, 27] for missing frame interpolation can also be adapted for this problem and together withCollis and Simon Robinson2 an automated inbetweening process was developed and deployed asplug-ins for various platforms. Simultaneous to this effort the team of Litwinowicz and Pierre Jas-min 3 had also developed similar tools, but emphasising instead the use of very good user interfacesto allow interaction with the interpolated frames.

4.2.1 Frame Interpolation

Consider reconstructing a frame Im which is, say, halfway inbetween frames In, In+1 at time n andn + 1. Assume also that the motion D = dn, n + 1, dn+1,n, is known since the frames n, n + 1

exist. Other motion fields D′ = dn, n − 1, dn+1,n+2 may also be known. A Bayesian approach tothe problem proceeds by stating the following.

p(dm,n,dm/n+1, s|In, In+1,D) ∝ p(In, In+1|dm,n,dm/n+1)p(dm,n,dm/n+1, s|Dnay) (4.3)

where the prior p(dm,n,dm/n+1|Dnay) imposes some smoothness constraints on the interpolatedmotion vector at a site given the local neighbourhood of vectors Dnay and p(In, In+1|·) (Likelihood)ensures that the interpolated vectors match similar data in the previous and next frames. An occlu-sion state s allows some incorporation of occlusion and uncovering into the problem. The simplestchoice for the Likelihood function is a Gaussian distribution of frame differences given an interpo-lated vector. The choice of prior should encourage the interpolated field to agree with the existingtemporal motion information and also result in smooth spatial behaviour. Smoothness is importanthere as it avoids picture break up as the objects move through the new frames.

The resulting optimisation problem of choosing the best vector field can be approached usingany number of schemes including ICM, BeliefPropagation, Graph Cuts and even one of the manyMarkov Chain techniques. Because of the large image sizes however, memory and speed is impor-tant in practice. Hence approaches based on selecting between possible candidates at each pixel site

1www.thefoundry.co.uk2www.thefoundry.co.uk3www.revisionfx.com


CHAPTER 4. CHALLENGES IN VISUAL COMPOSITING 4.2. INBETWEENING

tend to be more efficient. In a way ICM, BeliefPropagation and Graph Cuts all require generation ofsolution candidates as a pre-requisite. In the frame interpolation problem candidates can come fromthe existing motion fields by simple linear interpolation 4.1. The reason that this technique tendsto be more useful than a camera geometry based approach is that there are fewer assumptions hereabout the scene content between frames and issues such as lens distortions and so on tend to be im-plicitly compensated in the estimation of the optic flow field between frames. This is not to say thata motion interpolation technique is better, just that it can be applied to more situations without toomuch effort, hence is useful as an automated tool. Some interesting results are shown in Figure 4.2.

4.2.2 Outstanding issues

The creation of a reliable automated retimer is difficult indeed. Aside from the problems withmotion estimation itself, the main unanswered questions worth exploring i) how to detect whenthe motion estimator has failed? ii) what to do when it has failed? iii) how to detect occludedand uncovered regions and what to do in those regions?. When the motion between frames ispathological it is difficult to find a reliable fallback technique. Typically errors tend to be visibleas strange warping of object boundaries, or hard edges where they should be soft. Generatingconsistently smooth interpolated frames also requires multiple passes over the sequence, and thatadmits a longer processing time. Litwinowicz et al have developed very good interactive tools thatallow the user to create just the desired effect. In general to create cinema quality pictures a goodidea is to allow the user to retime the foreground and background data as two separate layers andthis has been used to good effect in tools from The Foundry and Re:VisionEffects. Figure 4.2 alsoshows problems when the frame rate of the sequence is just not good enough to capture the originalmotion. Motion interpolation in that case is bound to fail. Even more paradoxical is that when usedfor slow-motion effects, any small error in the frames is more visible after new frame generationbecause of the lower effective object speeds. These problems mean that while these tools are popularfor many applications, for some time to come they will remain heavily user interactive at the cinemalevel and very conservative at the broadcast level.



Figure 4.2: Inbetweening with frame blending (top row) and motion interpolation (bottom row). The 2ndand 4th frames are interpolated halfway between frames 1,3 and 3,5 reading from left to right. The motioninterpolated images look very good in areas where the motion field can be well estimated e.g. head and bodyof the moving foreground. Inbetweening with motion is much better than frame blending without motion ascan be seen with the very blurry images created in the top row. The hand shows very poor reproduction in themotion interpolated frames because i) the frame rate is just not good enough to capture the fast motion thereand ii) the camera exposure is too long and so there is substantial motion blurring.


Chapter 5

Stereo-3D

Stereo-3D1 presents interesting algorithmic challenges. Many of the issues are common to otherresearch efforts in multiple-camera work. Some are peculiar to the end-goal of the medium: toprovide comfortable stereo viewing.

Stereo-3D presents to the audience an approximation of normal human stereo vision. Why isit an approximation? The principal reason is that the viewer’s focus is always on the screen-plane,regardless of object depth. This unnatural relationship with real-world views is uncomfortable forsome. Also, some 5% of the adult population can’t resolve depth from stereo at all, with maybe upto a quarter of the population having some milder depth-resolution deficit. Any defects in our stereoapproximation to the real world can tip the balance, and make it impossible to resolve. The fullscience of human vision perception is beyond the scope of this course. We instead will concentrateon some of the common post production issues which can be need attention.

As a side-note, the current stereo-3D push is a revival of a old medium. Key factors encouragingthe current growth are the reliability of digital capture and the reliability of digital projection. Inparticular the investment in digital projection in modern cinemas has provided an wider outlet forstereo-3D material outside specialised venues.

5.1 Convergence and Keystoning

The point where cameras converge on an object in shot (ideally the point where their optical axesintersect) is known as the convergence point.

1Note our nomenclature here. Variously referred to as stereo, stereogrammetry, stereographic, 3D, the concatenation used here isat least familiar to both the image processing community and the marketing side of the business (who favour 3D, which this isn’t).

37

5.1. CONVERGENCE AND KEYSTONING CHAPTER 5. STEREO-3D

Figure 5.1: Parallax resulting from converging cameras.

At the convergence point, objects in the two views will have zero horizontal disparity, alsoknown as zero parallax, and will appear to be the same distance away as the screen on whichthe scene is being viewed. For a scene shot with cameras that are converging - i.e. pointing orToeing-in towards each other - this will occur where a ray emerging from the front of one camera,perpendicular to its image plane, would meet a similar ray from the other camera (see Figure 5.1).Anything in front of this point will have negative parallax - the object in the left image will be to theright of the same object in the right image - and appear to be in front to the screen. Similarly, objectsbehind the point of convergence are said to have positive parallax and appear behind the screen.

Convergence can be changed in post production with a simple horizontal shift of one imagerelative to the other. The eyes can take a while to adjust to sudden changes in convergence so itis desirable to try to minimise these when cutting between scenes, for example. Moving the pointof convergence nearer to the cameras will have the effect of shifting everything else further away,while converging on a more distant object will bring everything closer. In this second instance,care must be taken to ensure that the scene stays inside the area which can comfortably be seen bythe audience. During normal vision, our eyes converge on an object as we look at it, and focus onthe object at the same time. However, when viewing a stereo presentation such as a 3D film, oureyes will always be focused on the screen, yet will be required to change convergence as the scenechanges or as we look at different parts of it. To some extent, they are able to do this, but whenthe distance between the focal point and the convergence point becomes too great, viewers willexperience discomfort. If an object appears too far in front of the screen, they will feel as if theireyes are crossing and may even be unable to fuse the left and right images into one. Similarly, if


CHAPTER 5. STEREO-3D 5.2. DISPARITIES SHOULD BE HORIZONTAL

objects are pushed too far away, the positive parallax between the views could increase to the extentthat the eyes would be required to diverge in order to fuse the two images. Divergence never occursin normal vision; it is generally accepted that a small amount is acceptable in 3D cinema2, thoughthis can make for an uncomfortable viewing experience, causing eye strain, headaches or nausea forthe audience. Beyond this, the 3D effect will be lost and the audience will see two separate imagesof the object.

Stereoscopic footage is generally shot with one of two main physical camera configurations:parallel or converging. Each have their advantages and disadvantages3 and some stereographershave strong views on the subject of which is the right configuration to use4. The converging methodis more akin to the operation of the human visual system, where our eyes converge to focus onan object of interest, and might therefore seem the more natural choice. The views from parallelcameras do not converge, so the desired convergence distance must be set in post by applying ahorizontal shift to one or both images. When converging cameras are used, the convergence canstill be adjusted in a similar manner, but is likely to have already been set to the desired value whenthe footage was acquired. However, although footage shot with converging cameras is less likelyto need the convergence adjusted, this method of image acquisition does introduce an additionalproblem: keystoning.

The term keystoning refers to the perspective distortions introduced by the fact that two converg-ing cameras view the scene from different directions, so that their image planes are not coplanar butare angled slightly with respect to each other (see Figure 5.2 and Figure 5.3).

Avoiding keystoning differences between the eyes can be done by using perspective correction(or shift) lenses, which essentially allow a correction to be applied during a shoot5. Such a lensallows the centre of projection to be shifted away from the centre of the image sensor (film or CCD)while keeping the image plane parallel to the sensor in order to avoid perspective distortion.

5.2 Disparities should be horizontal

When deriving depth cues from stereo-3D material, the human vision system expects objects to havea horizontal separation right-to-left. Filmed stereo-3D material can exhibit a vertical separation,which can cause eye-strain. Vertical separation comes from multiple sources, for example:

• Paired cameras on a rig need to be carefully aligned to avoid vertical disparities through mis-2See How a 3D Stereoscopic Movie is Made, 3-D Revolution Productions: http://www.the3drevolution.com/3dscreen.html3See Digital Praxis - Stereoscopic 3D: http://www.digitalpraxis.net/stereoscopic3d.htm4See Stereo VFX - Convergence: http://www.stereovfx.com/convergence2.html5See Perspective Correction Lens, Wikipedia article: http://en.wikipedia.org/wiki/Perspective correction lens for examples.


5.2. DISPARITIES SHOULD BE HORIZONTAL CHAPTER 5. STEREO-3D

Figure 5.2: Image Planes of Converging Cameras.

Figure 5.3: Keystone Distortion.


CHAPTER 5. STEREO-3D 5.3. DENSE DISPARITY ESTIMATION

aligned camera geometry.

• Mismatched zooming.

• Keystoning effects from converging camera rigs.

For a nodal camera move, the keystoning disparity and other disparities introduced by nodalcamera misalignments can be compensated for by calculating a suitable affine correction. There areno hard and fast rules here, but a reasonable approach is to perform feature-matching between leftand right views, and then to find the parameters of a warp which minimises the vertical separationof matched features. This works well in practice, but fails to compensate for some rarer occurrencessuch as non-nodal camera displacement. In these cases, new view synthesis would be required toperform the correction. In practice, however, most defects can be fixed by assuming the cameramisalignment is nodal.

5.3 Dense Disparity Estimation

An image - such as a single frame of a motion picture - is a two-dimensional (2D) representationof a three-dimensional (3D) scene, which by definition will contain structures at different depths.In an image taken from a slightly different viewpoint - for example, that of the second camera usedto produce a stereo pair - the same structures will appear slightly shifted. Stereo disparity is theterm used to describe the shift that occurs for a particular point in 3D space between the left andright images. In a stereo pair the cameras are offset horizontally, so this usually corresponds to apurely horizontal shift. However, the amount of the shift is not constant but can vary from pixel topixel; it will depend on the depth of the corresponding object within the scene. If parallel camerasare used, objects in the foreground - close to the viewer - will shift significantly while those in thedistant background might not move at all. In addition, as the viewpoint shifts from one eye to theother, areas that were visible before might become hidden, such as background areas that becomeoccluded when a foreground object shifts across in front of them, or a surface of an object thatbecomes obscured when the object is viewed from a different angle.

Similarly, previously hidden areas can be revealed. Figure 5.4 shows an example of a stereo pairof views of a scene in which different areas are occluded and revealed in each camera’s view.

We can build up a complete picture of the stereo disparity by estimating the change in positionof every point in the scene between one view and the other.

Disparity estimation is a well-explored topic and Stereo-3D provides a practical application forthese techniques. Disparity estimation can be considered to be similar to local motion estimation,in that the goal is to estimate how each pixel moves from one image to another. However, local


5.3. DENSE DISPARITY ESTIMATION CHAPTER 5. STEREO-3D

Figure 5.4: Left and right view of the same scene.

motion estimation is typically unconstrained, making no assumptions about the nature of the scene(other than local smoothness) and pixels are allowed to move in any direction. This is because localmotion estimation is usually performed on two frames of a sequence, separated in time. The timeseparation between the frames means objects within them might have have moved in the interim,and we have no prior knowledge of the motion of these objects. With a stereo pair of imagesthere is no time separation, so the images should be unaffected by motion within the scene. Theonly motion between one view and another will be that resulting from the physical separation ofthe cameras. This has the advantage of being rigid body motion - scaling, rotation and translationonly - and applies to the whole of the image, unlike the motion in a typical image sequence whichcould include local deformations (imagine a sequence showing a man walking down a street, forexample). We therefore know that objects within the scene will be transformed only in this rigidway between one view and the other, and can use this additional knowledge to constrain the motion(i.e. disparity) estimation. In practice, we do this by detecting features in both views of the scenefirst of all, then calculating the stereo disparity of those features6. Given the stereo disparity of thesepairs of features, we can then find a transform which maps points in one view to lines in the otherview (the epipolar lines)7. The disparity estimation for the rest of the points in the image is thencarried out under the condition that the corresponding pixels in the other image must lie along theselines. Because of this additional constraint, motion estimation done in this way tends to give moreaccurate results for the disparity than would local motion estimation performed on the same pair of

6The features in question are points of interest corresponding to recognisable structures within the scene, for example in a scenecontaining a wooden table they would include the corners of the table top and perhaps some conspicuous marks on the surface of it,such as a dark knot in the grain.

7In order to be able to map points to points, we would need to know how far away the corresponding 3D points are from thecamera. Without this, there is an additional, unsolved degree of freedom which means the points in the first view can lie anywherealong a line in the second.


CHAPTER 5. STEREO-3D 5.3. DENSE DISPARITY ESTIMATION

Figure 5.5: Disparity estimate (on the left) and one source image (on the right) inside a compositing system.The disparity estimate is represented by using red and green as the magnitude of the x and y components ofthe field in one direction, and blue and alpha (not shown) in the other direction. The two directions are almostmirror fields of each other, with the exception of the occluded areas.

images. See Figure 5.5 for an example image.

The real problem with this and other techniques is occlusions. In the extreme example (a denseforest, for example) where few pixels in one eye are visible in the other, the disparity calculationis extremely unreliable. Possible solutions for static scenes would include using multiple framesto build up a dense 3D estimate of the scene data. Solutions for dynamic scenes could involveminimising optical flow over multiple frames. At the higher level of compute cost, however, doingthis may be more onerous than simply living with unreliable disparity maps.

5.3.1 Moving the cameras

Disparity maps can be used to compute novel views somewhere in space between the original views,in a similar manner to how retiming can produce new samples in time. This provides a mechanismto modify the apparent camera geometry in post, in order to correct shots with excessive disparity.The occlusion flaws in the disparity maps for moving footage mean this is a good starting point forshots which in any case will involve some manual corrective work. It hasn’t been shown to be areliable automated process for the unassisted re-grading of depth in arbitrary clips.

It is an interesting area for further development, however. Historically, one of the main prob-lems with shooting in stereo has been the need to make certain decisions about the desired 3D effectbefore the shoot takes place. To some extent, this problem has been reduced by the development


5.3. DENSE DISPARITY ESTIMATION CHAPTER 5. STEREO-3D

of more advanced camera set-ups which allow for more flexibility during a shoot. However, it isstill the case that some 3D properties are fixed after filming and have previously been difficult orimpossible to alter as the finished film takes shape. A fundamental property of a 3D shot is the inter-axial distance; this is the distance between the left and right cameras, and determines the perceiveddepth of the scene. Usually, this distance should be about the same as the distance between theviewer’s left and right eyes8 but it can be adjusted in order to achieve specific effects. For instance,it might be necessary to increase it significantly in order to get a better sense of depth over a distantlandscape, where filming with the standard interaxial separation would result in an essentially flat-looking scene (this is sometimes known as infinity flatness9). Increasingly the interaxial distancesubstantially can also result in hyperstereo or miniaturisation, where the exaggerated stereo effectmeans that the viewer feels herself to be massive in comparison to the scene being viewed. This isbecause the human brain is used to experiencing the amount of stereo separation that results fromthe separation of one’s eyes. Anything greater fools the brain into thinking that the eyes must befurther apart in relation to the scene in front of them than normal. Since the distance between theeyes is fixed, the brain can only make sense of this by reasoning that the scene must be smaller thanit really is. This can have the effect of making the viewer feel like a giant, or alternatively like anormal human looking down at a scale model10. Similarly, a reduced interaxial separation can havethe opposite effect, making the viewer feel she has shrunk in relation to the scene. Although it isusually desirable to avoid such effects, sometimes they can be invaluable tools in helping to tell astory or fulfil a director’s artistic vision (it is easy to imagine how they could be used to good effectin a 3D version of the children’s classic Alice in Wonderland, for example, as the heroine shrinksand grows in response to eating certain things). In addition, maintaining the standard interaxialseparation throughout - while being both realistic and comfortable for the viewer - can make for anunexciting stereo experience. After a while, the viewer’s brain will adjust to the stereo effect andthey might cease to appreciate the extra dimension once the initial novelty wears off11. Varying thisseparation - and thus the perceived depth - between scenes from time to time can help to providethe necessary stimulation to keep the viewer’s brain alive to the differences between the 3D scenein front of them and their customary 2D viewing experience. However, any variation in depth mustbe used with caution, as people’s eyes take a while to adjust to significant changes in the stereoseparation, which means that fast cutting between scenes with significantly different interocularseparations will be uncomfortable for the viewer and should be avoided12.

8The ’interocular’ separation, about 65mm for an average adult.9See Journey to the Center of the Earth Review, 3-D Revolution Productions: http://www.the3drevolution.com/3dreview.html

10See Digital Praxis - Stereoscopic 3D: http://www.digitalpraxis.net/stereoscopic3d.htm11See Beowulf Review, 3-D Revolution Productions: http://www.the3drevolution.com/3dreview2.html12See James Cameron Supercharges 3D, Variety, 10 Apr 2008 (http://www.variety.com/article/VR1117983864.html?categoryid=2868&cs=1)


CHAPTER 5. STEREO-3D 5.4. OPTICAL DIFFERENCES BETWEEN THE EYES

5.3.2 Moving compositing components

Disparity information can also be used in compositing applications to push data from one view intoanother. Imagine a rotoscoping task, using spline-based object outlines, performed in one eye only.The point data representing the shapes can be warped from one eye to another by using the disparityestimate.

5.4 Optical differences between the eyes

Colour discrepancies between the two views of a scene can also make it more difficult for the viewerto fuse objects in the scene or to view it comfortably. In order to reduce interaxial separation of thecamera optical axes, flexible stereo rigs can be built with mirrors or prisms. This approach allows theoptical axes to be more closely aligned than would otherwise be possible, given the large physicalsize of the cameras themselves.

While this improves that apparent camera geometry, it introduces a new issue. The use of amirror or a prism in one view introduces a polarising effect in that view. This means that theappearance of highlights can be significantly different between the left and right views. Even withrelatively diffuse lighting conditions, a colour density difference is visible between the eyes. Itshould be also be noted that even without the presence of a mirror or prism rig, it is hard to guaranteethat matched cameras will produce the same colour profile.

Correction of these differences in post usually requires skill and can be a painstaking process.However, some automated grading can be applied along the lines of the work of Pitie et al[38]. Thiswork uses histogram matching techniques in order to transfer the colour distribution of one imageto another. The transfer of missing highlights is a local phenomenon rather than a global histogramproblem and there are no good techniques currently available to the stereo-3d post production com-munity.

The use of mirrors and prisms can also introduce a filtering effect in the reflected view whichagain is hard to compensate for, although there is some scope to experiment with blind (or evennon-blind) deconvolution techniques.


5.4. OPTICAL DIFFERENCES BETWEEN THE EYES CHAPTER 5. STEREO-3D


Bibliography

[1] A. RARES, M. J.T. REINDERS, J. B. Complex event classification in degraded image se-quences. In Proceedings of ICIP 2001 (IEEE), ISBN 0-7803-6727-8 (Thessaloniki, Greece,October 2001).

[2] A. RARES, M. J.T. REINDERS, J. B. Statistical analysis of pathological motion areas. In The2001 IEE Seminar on Digital Restoration of Film and Video Archives (London, UK, January2001).

[3] A. RARES, M. J.T. REINDERS, J. B. Image sequence restoration in the presence of patho-logical motion and severe artifacts. In Proceedings of ICASSP 2002 (IEEE) (Orlando, Florida,USA, May 2002).

[4] BERTALMIO, M., SAPIRO, G., CASELLES, V., AND BALLESTER, C. Image inpainting.In SIGGRAPH ’00: Proceedings of the 27th annual conference on Computer graphics andinteractive techniques (New York, NY, USA, 2000), pp. 417–424.

[5] BLACK, M., AND ANANDAN, P. The robust estimation of multiple motions: Parametric andpiecewise-smooth flow fields. Computer Vision and Image Understanding 63 (January 1996),75–104.

[6] BORNARD, R. Probabilistic approaches for the digital restoration of television archives. PhDThesis, Ecole Centrale, Paris, 2002.

[7] BUISSON, O. Analyse de sequences d’images haute resolution, application a la restaura-tion numerique de films cinematographiques. PhD thesis, Universite de La Rochelle, France,December 1997.

[8] BUISSON, O., BESSERER, B., BOUKIR, S., AND HELT, F. Deterioration detection for digitalfilm restoration. In IEEE International Conference on Computer Vision and Pettern Recogni-tion (June 1997), vol. 1, IEEE, pp. 78–84.

[9] CHUANG, Y.-Y., AGARWALA, A., CURLESS, B., SALESIN, D. H., AND SZELISKI, R. Videomatting of complex scenes. In Proceedings of ACM SIGGRAPH (2002).

47

BIBLIOGRAPHY BIBLIOGRAPHY

[10] CHUANG, Y.-Y., CURLESS, B., SALESIN, D. H., AND SZELISKI, R. A bayesian approachto digital matting. In Proceedings of CVPR (2001).

[11] CORRIGAN, D., HARTE, N., AND KOKARAM, A. Pathological Motion Detection for RobustMissing Data Treatment. EURASIP Journal on Advances in Signal Processing (2008).

[12] DUFAUX, F., AND KONRAD, J. Efficient, robust and fast global motion estimation for videocoding. IEEE Transactions on Image Processing 9 (2000), 497–501.

[13] EFROS, A. A., AND LEUNG, T. K. Texture synthesis by non-parametric sampling. In Pro-ceedings of the IEEE International Conference on Computer Vision (ICCV) (September 1999),vol. 2, pp. 1033–1038.

[14] FERRANDIERE, E. D. Motion picture restoration using morphological tools. Kluwer Aca-demic Publishers, May 199, pp. 361–368.

[15] FERRANDIERE, E. D. Restauration automatique de films anciens. PhD thesis, Ecole desMines de Paris, France, December 1997.

[16] FERRANDIERE, E. D. Mathematical morphology and motion picture restoration. John Wileyand Sons, New York, 2001.

[17] FERRANDIERE, E. D., AND SERRA, J. Detection of local defects in old motion pictures. InVII National Symposium on Pattern Recognition and Image Analysis (April 1997), pp. 145–150.

[18] HAAN, G. D., AND BELLERS, E. Deinterlacing-an overview. In Proceedings of the IEEE(Sept 1998), vol. 86, no. 9, pp. 1839-1857.

[19] HILL, L., AND VLACHOS, T. On the estimation of of global motion using phase correlationfor broadcasting applications. In Seventh International Conference on Image Processing andIts Applications (July 1999), vol. 2, pp. 721–725.

[20] HILL, L., AND VLACHOS, T. Global and local motion estimation using higher-order search.In 5th Meeting on Image Recognition and Understanding (MIRU 2000) (July 2000), vol. 1,pp. 18–21.

[21] JOYEUX, L., BOUKIR, S., BESSERER, B., AND BUISSON, O. Reconstruction of degradedimage sequences. application to film restoration. Image and Vision Computing, 19 (2001),503–516.

[22] KENT, B., KOKARAM, A., COLLIS, B., AND ROBINSON, S. Two layer segmentation forhandling pathological motion in degraded post production media. In IEEE International Con-ference on Image Processing (October 2004), pp. 299–302.



[23] KO, S.-J., LEE, S.-H., JEON, S.-W., AND KANG, E.-S. Fast digital image stabilizer basedon gray-coded bit-plane matching. IEEE Transactions on Consumer Electronics 45, 3 (Aug.1999), 598–603.

[24] KO, S.-J., LEE, S.-H., AND LEE, K.-H. Digital image stabilizing algorithms based on bit-plane matching. IEEE Transactions on Consumer Electronics 44, 3 (Aug. 1998), 617–622.

[25] KOKARAM, A. On missing data treatment for degraded video and film archives: a survey anda new bayesian approach. IEEE Transactions on Image Processing (March 2004), 397–415.

[26] KOKARAM, A., MORRIS, R., FITZGERALD, W., AND RAYNER, P. Detection of missingdata in image sequences. IEEE Image Processing (November 1995), 1496–1508.

[27] KOKARAM, A. C. Reconstruction of severely degraded image sequence. In Image Analysisand Processing (September 1997), vol. 2, Springer–Verlag, pp. 773–780.

[28] KOKARAM, A. C. Motion Picture Restoration: Digital Algorithms for Artefact Suppressionin Degraded Motion Picture Film and Video. Springer Verlag, ISBN 3-540-76040-7, 1998.

[29] KOKARAM, A. C. On missing data treatment for degraded video and film archives: a surveyand a new bayesian approach. IEEE Transactions on Image Processing 13 (March 2004),397–415.

[30] KONRAD, J., AND DUBOIS, E. Bayesian estimation of motion vector fields. IEEE Transac-tions on Pattern Analysis and Machine Intelligence 14, 9 (September 1992).

[31] LEVIN, A., LISCHINSKI, D., AND WEISS, Y. A closed-form solution to natural image mat-ting. IEEE Transactions on Pattern Analysis and Machine Intelligence 30, 2 (2008), 228–242.

[32] MANHALL, S., AND HARVEY, N. Film and video archive restoration using mathematicalmorphology. In IEE Seminar on Digital Restoration of Film and Video Archives (Ref. No.2001/049) (January 2001), pp. 9/1–9/5.

[33] MANSOURI, A., AND KONRAD, J. Bayesian winner-take-all reconstruction of intermediateviews from stereoscopic images. IEEE Image Processing 9, 10 (October 2000), 1710–1722.

[34] NADENAU, M. J., AND MITRA, S. K. Blotch and scratch detection in image sequences basedon rank ordered differences. In 5th International Workshop on Time-Varying Image Processingand Moving Object Recognition (September 1996).

[35] ODOBEZ, J.-M., AND BOUTHEMY, P. Robust multiresolution estimation of parametric mo-tion models. Journal of visual communication and image representation 6 (1995), 348–365.



[36] O.J.WOODFORD, AND ANDA.W. FITZGIBBON, I. R. Efficient new-view synthesis usingpairwise dictionary priors. In IEEE International Conference on Computer Vision and PatternRecognition (June 2007), pp. 1–8.

[37] PAISAN, F., AND CRISE, A. Restoration of signals degraded by impulsive noise by means ofa low distortion, non–linear filter. Signal Processing 6 (1984), 67–76.

[38] PITI, F., KOKARAM, A., AND DAHYOT, R. Automated colour grading using colour distribu-tion transfer. Journal of Computer Vision and Image Understanding (February 2007).

[39] PORTER, T., AND DUFF, T. Compositing digital images. In Proceedings of ACM SIGGRAPH(1984), vol. 18, pp. 253–259.

[40] PREZ, P., BLAKE, A., AND GANGNET, M. Jetstream: Probabilistic contour extraction withparticles. In ICCV 2001, International Conference on Computer Vision (July 2001), vol. II,pp. 524–531.

[41] QI, W., AND ZHONG, Y. New robust global motion estimation approach used in mpeg-4.Journal of Tsinghua University Science and Technology (2001).

[42] RATAKONDA, K. Real-time digital video stabilization for multimedia applications. In Pro-ceedings International Symposium on Circuits and Systems (Monterey, CA, USA, May 1998),vol. 4, IEEE, pp. 69–72.

[43] READ, P., AND MEYER, M.-P. Restoration of Motion Picture Film. Butterworth Heinemann,ISBN 0-7506-2793-X, 2000.

[44] ROOSMALEN, P. M. B. V., LAGENDIJK, R. L., AND BIEMOND, J. Correction of intensityflicker in old film sequences. Submitted to: IEEE Transactions on Circuits and Systems forVideo Technology (December 1996).

[45] ROOSMALEN, P. M. B. V., LAGENDIJK, R. L., AND BIEMOND, J. Flicker reduction in oldfilm sequences. In Time-varying Image Processing and Moving Object Recognition 4 (1997),Elsevier Science, pp. 9–17.

[46] S. ARMSTRONG, P. J. W. R., AND KOKARAM, A. C. Restoring video images taken fromscratched 2-inch tape. In Workshop on Non-Linear Model Based Image Analysis, NMBIA’98;Eds: Stephen Marshall, Neal Harvey and Druti Shah (September 1998), Springer Verlag,pp. 83–88.

[47] SADHAR, S., AND RAJAGOPALAN, A. N. Image estimation in film-grain noise. IEEE SignalProcessing Letters 12 (March 2005), 238–241.



[48] SAITO, T., KOMATSU, T., OHUCHI, T., AND SETO, T. Image processing for restorationof heavily-corrupted old film sequences. In International Conference on Pattern Recognition2000 (2000), pp. Vol III: 17–20.

[49] SCHARSTEIN, D., AND SZELISKI, R. A taxonomy and evaluation of dense two-frame stereocorrespondence algorithms. International Journal of Computer Vision 47 (April 2002), 7–42.

[50] SIDOROV, D., AND KOKARAM, A. Suppression of moire patterns via spectral analysis. InSPIE Conference on Visual Communications and Image Processing (January 2002), vol. 4671,pp. 895–906.

[51] SIDOROV, D. N., AND KOKARAM, A. C. Removing moir from degraded video archives. InXIth European Conference in Signal Processing (EUSIPCO 2002) (September 2002).

[52] SMOLIC, A., AND OHM, J.-R. Robust global motion estimation using a simplified m-estimator approach. In IEEE International Conference on Image Processing (Vancouver,Canada, September 2000).

[53] STILLER, C. Motion–estimation for coding of moving video at 8kbit/sec with gibbs modelledvectorfield smoothing. In SPIE VCIP. (1990), vol. 1360, pp. 468–476.

[54] STOREY, R. Electronic detection and concealment of film dirt. UK Patent Specification No.2139039 (1984).

[55] STOREY, R. Electronic detection and concealment of film dirt. SMPTE Journal (June 1985),642–647.

[56] SUN, J., JIA, J., TANG, C.-K., AND SHUM, H.-Y. Poisson matting. ACM Transactions onGraphics 23, 3 (2004).

[57] TENZE, L., RAMPONI, G., AND CARRATO, S. Blotches correction and contrast enhance-ment for old film pictures. In IEEE International Conference on Image Processing (2000),p. TP06.05.

[58] TENZE, L., RAMPONI, G., AND CARRATO, S. Robust detection and correction of blotches inold films using spatio-temporal information. In Proceedings of SPIE International Symposiumof Electronic Imaging 2002 (January 2002).

[59] TUCKER, J., AND DE SAM LAZARO, A. Image stabilization for a camera on a movingplatform. In Proc. of the IEEE Pacific Rim Conf. on Communications, Computers and SignalProcessing (May 1993), vol. 2, pp. 734–737.

[60] UOMORI, K., MORIMURA, A., ISHII, H., SAKAGUCHI, T., AND KITAMURA, Y. Automaticimage stabilizing system by full-digital signal processing. IEEE Transactions on ConsumerElectronics 36, 3 (Aug. 1990), 510–519.



[61] VLACHOS, T. Simple method for estimation of global motion parameters using sparse trans-lational motion vector fields. Electronics Letters 34, 1 (January 1998), 60–62.

[62] VLACHOS, T., AND THOMAS, G. A. Motion estimation for the correction of twin-lenstelecine flicker. In IEEE International Conference on Image Processing (September 1996),vol. 1, pp. 109–112.

[63] WHITE, P., COLLIS, B., ROBINSON, S., AND KOKARAM, A. Inference matting. In IEEEuropean Conference on Visual Media Production (November 2005), pp. 161–171.


VISUAL ALGORITHMS FOR POST PRODUCTIONwebstaff.itn.liu.se/~jonun/web/teaching/2009-TNCG13/Sig...1.2....

Documents

Transcript of VISUAL ALGORITHMS FOR POST PRODUCTIONwebstaff.itn.liu.se/~jonun/web/teaching/2009-TNCG13/Sig...1.2....