[IEEE 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops...

8
Aligning Endoluminal Scene Sequences in Wireless Capsule Endoscopy Michal Drozdzal 1,2 [email protected] Laura Igual 1,2 [email protected] Jordi Vitri` a 1,2 [email protected] Carolina Malagelada 3 [email protected] Fernando Azpiroz 3 [email protected] Petia Radeva 1,2 [email protected] 1 Computer Vision Center, Campus UAB, Edifici O, 08193, Bellaterra, Spain 2 Dept. Matem` atica Aplicada i An` alisi, Universitat de Barcelona, Gran Via 585, 08007, Barcelona, Spain 3 Digestive System Research Unit, University Hospital Vall dHebron, Barcelona, Spain Abstract Intestinal motility analysis is an important examination in detection of various intestinal malfunctions. One of the big challenges of automatic motility analysis is how to com- pare sequence of images and extract dynamic paterns tak- ing into account the high deformability of the intestine wall as well as the capsule motion. From clinical point of view the ability to align endoluminal scene sequences will help to find regions of similar intestinal activity and in this way will provide a valuable information on intestinal motility prob- lems. This work, for first time, addresses the problem of aligning endoluminal sequences taking into account motion and structure of the intestine. To describe motility in the sequence, we propose different descriptors based on the Sift Flow algorithm, namely: (1) Histograms of Sift Flow Direc- tions to describe the flow course, (2) Sift Descriptors to rep- resent image intestine structure and (3) Sift Flow Magnitude to quantify intestine deformation. We show that the merge of all three descriptors provides robust information on se- quence description in terms of motility. Moreover, we de- velop a novel methodology to rank the intestinal sequences based on the expert feedback about relevance of the results. The experimental results show that the selected descriptors are useful in the alignment and similarity description and the proposed method allows the analysis of the WCE. 1. Introduction Wireless Capsule Endoscopy (WCE) [7] is a recent imaging technique for examination of Gastrointestinal (GI) tract. This technique consists of a WCE capsule, of dimen- sion 11mm × 26mm and 3,7g, which contains a camera, a battery, led lamps for illumination and a radio transmitter. The capsule is swallowed by the patient, captures frames at rate 2 fps and emits them by a radio frequency signal to an external device. This technique is able to acquire internal view of the GI tract in a non-invasive way compared with the manometry and does not need patient hospitalization. The whole trip of the capsule takes approximately eight hours and provides around 60, 000 images for each study. Specialists need between four and eight hours for screening the video. This makes examination and interpretation of the capsule recordings time consuming and tedious and thus, motivates a need for automatization of video analysis. Endoluminal scene in WCE is composed by different el- ements which can be visualized in the inner gut (Fig. 1), namely: intestinal walls, intestinal contents and intestinal lumen. Moreoverthe intestine produces peristalsis move- ments and, as a result lumen size changes: it may be opened, partly contracted or contracted (Fig. 1). Since WCE cap- sule can move freely in the GI tract, a variety of orientations of the scene can be produced. In addition, intestinal con- tents, which can be seen as a turbid liquid or as a mass of bubbles, may appear and hinder the proper visualization of the scene. An endoluminal scene sequence is defined as a consecu- tive series of frames which constitutes an intestinal action unit. Various types of intestinal actions which are diag- nostically and clinically interesting can be distinguished. (1) Contraction sequence, defined in WCE video as open- closed-open movement of the lumen. Generally, when in- testinal wall is closed during the contraction a star-wise pat- tern is present [14]. An example of phasic occlusive con- traction is shown on Fig. 1. (2) Tunnel sequence, defined as the absence of contractile activity in the relaxed intestine, visualized as a sequence of open lumens. (3) Sequences of WCE video with no intestine activity are defined as static sequences. (4)Intestinal content sequence the consecutive frames with intestinal content. From intestinal motility point of view, similar sequences 1 117 978-1-4244-7030-3/10/$26.00 ©2010 IEEE

Transcript of [IEEE 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops...

Page 1: [IEEE 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops) - San Francisco, CA, USA (2010.06.13-2010.06.18)] 2010 IEEE Computer

Aligning Endoluminal Scene Sequences in Wireless Capsule Endoscopy

Michal Drozdzal1,2

[email protected]

Laura Igual1,2

[email protected]

Jordi Vitria1,2

[email protected]

Carolina Malagelada3

[email protected]

Fernando Azpiroz3

[email protected]

Petia Radeva1,2

[email protected]

1 Computer Vision Center, Campus UAB, Edifici O, 08193, Bellaterra, Spain2 Dept. Matematica Aplicada i Analisi, Universitat de Barcelona, Gran Via 585, 08007, Barcelona, Spain

3 Digestive System Research Unit, University Hospital Vall dHebron, Barcelona, Spain

Abstract

Intestinal motility analysis is an important examinationin detection of various intestinal malfunctions. One of thebig challenges of automatic motility analysis is how to com-pare sequence of images and extract dynamic paterns tak-ing into account the high deformability of the intestine wallas well as the capsule motion. From clinical point of viewthe ability to align endoluminal scene sequences will help tofind regions of similar intestinal activity and in this way willprovide a valuable information on intestinal motility prob-lems. This work, for first time, addresses the problem ofaligning endoluminal sequences taking into account motionand structure of the intestine. To describe motility in thesequence, we propose different descriptors based on the SiftFlow algorithm, namely: (1) Histograms of Sift Flow Direc-tions to describe the flow course, (2) Sift Descriptors to rep-resent image intestine structure and (3) Sift Flow Magnitudeto quantify intestine deformation. We show that the mergeof all three descriptors provides robust information on se-quence description in terms of motility. Moreover, we de-velop a novel methodology to rank the intestinal sequencesbased on the expert feedback about relevance of the results.The experimental results show that the selected descriptorsare useful in the alignment and similarity description andthe proposed method allows the analysis of the WCE.

1. Introduction

Wireless Capsule Endoscopy (WCE) [7] is a recentimaging technique for examination of Gastrointestinal (GI)tract. This technique consists of a WCE capsule, of dimen-sion 11mm × 26mm and 3,7g, which contains a camera, abattery, led lamps for illumination and a radio transmitter.The capsule is swallowed by the patient, captures frames at

rate 2 fps and emits them by a radio frequency signal to anexternal device. This technique is able to acquire internalview of the GI tract in a non-invasive way compared withthe manometry and does not need patient hospitalization.The whole trip of the capsule takes approximately eighthours and provides around 60, 000 images for each study.Specialists need between four and eight hours for screeningthe video. This makes examination and interpretation of thecapsule recordings time consuming and tedious and thus,motivates a need for automatization of video analysis.

Endoluminal scene in WCE is composed by different el-ements which can be visualized in the inner gut (Fig. 1),namely: intestinal walls, intestinal contents and intestinallumen. Moreoverthe intestine produces peristalsis move-ments and, as a result lumen size changes: it may be opened,partly contracted or contracted (Fig. 1). Since WCE cap-sule can move freely in the GI tract, a variety of orientationsof the scene can be produced. In addition, intestinal con-tents, which can be seen as a turbid liquid or as a mass ofbubbles, may appear and hinder the proper visualization ofthe scene.

An endoluminal scene sequence is defined as a consecu-tive series of frames which constitutes an intestinal actionunit. Various types of intestinal actions which are diag-nostically and clinically interesting can be distinguished.(1) Contraction sequence, defined in WCE video as open-closed-open movement of the lumen. Generally, when in-testinal wall is closed during the contraction a star-wise pat-tern is present [14]. An example of phasic occlusive con-traction is shown on Fig. 1. (2) Tunnel sequence, defined asthe absence of contractile activity in the relaxed intestine,visualized as a sequence of open lumens. (3) Sequences ofWCE video with no intestine activity are defined as staticsequences. (4)Intestinal content sequence the consecutiveframes with intestinal content.

From intestinal motility point of view, similar sequences

1117978-1-4244-7030-3/10/$26.00 ©2010 IEEE

Page 2: [IEEE 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops) - San Francisco, CA, USA (2010.06.13-2010.06.18)] 2010 IEEE Computer

Figure 1. Illustration of endoluminal scene and sequences in WCE.

are those sequences with resembling movements. In thiswork, we focus on aligning endoluminal scene sequencesbased on dynamic patterns. Alignment of endoluminalscene sequences is an important challenge in small intestineexamination which allows to compare predefined patternsof intestinal performance in WCE and identify similarityregions. Intestinal dynamics has a high variability in theway and duration of deformation, as it can be seen in Fig.1, where several sequence examples are shown. For thisreason, alignment of endoluminal scene sequences shouldbe flexible in the length of sequences and has to be robustunder variations due to capsule motion.

Several works have been developed on endoluminalscene analysis, most of them focused on characterizationof different lesions in the gut, such as polyps [13] or bleed-ing [6]. Other are focused on reduction of the final timeneeded for visualization of WCE video [5], or [17]. Re-cently, works to assist in the analysis of the intestinal motil-ity disorders using WCE have been developed [15], [8]or[16]. The approach in [15] copes with the problem ofintestinal contractions detection. It is based on static infor-mation from lumen features of fixed size sequences (nineframes) and it does not exploit the motion structure of thecapsule recordings. Histograms of gradients are used in[16]. In [8], the authors propose eigenmotion-based con-traction detection where optical flow between consecutiveframes is used in fixed size sequences. The problem ofdetection of measurable contraction sequence in the WCEvideo is presented in [2]. All of them (except [2]) assumefixed sequence length and are based on optical flow or staticfeatures (f.e. lumen size). On the other hand, intestines arehighly deformable. Moreover due to the capsule motion in-

Figure 2. Scene sequence examples of intestinal contractions.

testinal events can be captured from different view angles.The standard way to extract motion is to use a scheme for

optical flow estimation. Classical optical flow techniquesare robust and precise when the deformation is smooth, buthave difficulties in correctly estimating the visual deforma-tions in the case of sudden (fast) motions. Fast motionsare frequent in WCE, due to the low frame rate of WCE (2fps). Therefore, the application of optical flow techniquesto capture the contraction movements is limited. In order toovercome this problem, we use the Sift Flow (SF) technique[11] to extract the motion information. SF estimates themotion between points using their Sift descriptors and hasbeen shown to be more robust for abrupt variations betweenconsecutive frames. However, since we are interested in de-scribing the structure and deformation of the intestinal wall,we consider three different descriptors to estimate similar-ity between the sequences: the Histogram of Sift Flow Di-rections (HSFD) to describe transformation direction; SiftDescriptor (SD) to measure similarity between structures ofthe images; and the Sift Flow Magnitude (SFM) to quantifythe intestine deformation. The union of these three descrip-tors provides a complete motion characterization and allowsto define tools to measure the similarity between intestinalsequences.

In this work, we propose a novel technique for aligningintestinal sequences based on three steps: (1) similarity fea-tures extraction, (2) computation of similarity between se-quences and (3) a scheme for optimal weighting of differentfeatures. In the first step, SF technique is used to estimatethe motion in sequences. In the second step we apply, Dy-namic Time Warping (DTW) in order to compare sequenceswith different length. Finally, in the last step, we define ascheme based on the relevance feedback algorithm [1] tolearn weights which are assigned to every descriptor. Ourmethodology allows: (1) comparing dynamic events inde-pendently of the sequence length and (2) use motion andstructure features that are not restricted to highly smoothdeformation. To the author’s knowledge the work presentedhere is the first one dealing with the problem of aligningendoluminal scene sequences.

The paper is organized as follows: in Section 2 a shortdescription of Sift Flow, DTW and relevance feedback algo-

118

Page 3: [IEEE 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops) - San Francisco, CA, USA (2010.06.13-2010.06.18)] 2010 IEEE Computer

rithm is presented including the mathematical background,Section 3 introduces the proposed methodology. The exper-iments and results are presented in Section 4. Finally, inSection 5 we expose conclusions and future work.

2. BackgroundSIFT Flow Because of the variety of endoluminal scenes,free movement of the capsule and the diversity of intestinalsequences, we need robust descriptors to describe intestinaldeformation. The descriptors should be flexible and able tomanage: scale changes, affine distortion, view point changeand illumination variations. In SF, a SIFT [12] descriptoris extracted at each pixel to characterize local image struc-tures and encode contextual information. Then, a discrete,discontinuity preserving, optical flow algorithm is used tomatch the SIFT descriptors between two images. The useof SIFT features allows robust matching across differentscene/object appearances and the discontinuity-preservingspatial model allows matching of objects located at differ-ent parts of the scene. SIFT features are invariant to imagescale and rotation, and are shown to provide robust match-ing across a substantial range of affine distortion, change in3D view point, addition of noise, and change in illumina-tion. The SF is calculated by optimization of the followingcost function:∑

p

||s1(p)− s2(p+w)||1 +1

σ2

∑p

(u2(p) + v2(p)

)+

∑(p,q)∈ϵ

min(α|u(p)− u(q)|, γ

)+min

(α|v(p)− v(q)|, γ

),

where, w(p) = (u(p), v(p)) is the displacement vector atpixel p, si(p) is the SIFT descriptor extracted at location pin i-th image and ϵ is the spatial neighborhood of a pixel.The σ, α and γ are parameters. Fig. 3 presents some ex-amples of SF performance in WCE images.

Dynamic Time Warping Besides describing sequenceframes we need a robust technique to compare sequences ofdifferent length. This technique should be flexible enoughto allow matching of sequences even with different and notnecessarily uniform capsule motion. DTW is an algorithmfor measuring similarity between two sequences which mayvary in time or speed in different ways. The cleverness ofDTW lies in the computation of the distance between in-put stream and template. A deep description of DTW ap-proach can be found in [4], [10] and [9]. Rather than com-paring the value of the input stream at time t to the templatestream at the same time t, the algorithm is used to searchoptimal mappings results from the input stream to the tem-plate stream, so that the total distance between correspond-ing frames is minimized. In this way, DTW aligns the time

Figure 3. Sift flow results of six pairs of images (six rows). Thetwo first columns are the Sift descriptors of the two images fromcolumns third and fourth. The fifth column displays the result ofapplying the inverse of the SF over the second image of the pair.The sixth column shows the SF filed of each pair, where the colorinforms about deformation direction and the color intensity aboutdeformation magnitude.

axes allowing many-to-one and one-to-many frame match-ing before calculating the Euclidean distance.

Relevance feedback algorithm When looking for thebest mapping between two sequences we need to estimatethe parameters that assign the right importance to differentterms of the sequences similarity. We base our approachon the relevance feedback algorithm. Relevance feedbackis widely used in content-based image retrieval, where theproblem of image relevance ranking is based on contentwithout any additional textual information. As it is pre-sented in [18], the relevance feedback algorithm follows thenext steps: (1) Machine provides initial retrieval results; (2)User provides judgment on the currently displayed imagesas to whether, and to what degree, they are relevant or irrel-evant to her/his request; (3) Machine learns and tries again.

In [1], the authors present an algorithm for weight learn-ing using relevance feedback. The basic idea is to minimizethe distance to images marked as relevant and to maximizethe distance to not relevant results. Given the distance:

D(Ym,Yq) =∑i

(wi ∗Di(Ym,Yq)),

where wi are weights, Ym and Yq are the model sequenceand the query sequence respectively. The following term isminimized with respect to wi:∑

m∈Q+

∑q+∈Q+\{m}

∑q−∈Q−\{m}

D(Ym,Yq+)

D(Ym,Yq−),

where q+ are relevant results and q− are not relevant re-sults. As a result, after a gradient descent is applied to min-imize the distance function, the number of relevant resultsis increasing in the top results of image retrieval.

119

Page 4: [IEEE 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops) - San Francisco, CA, USA (2010.06.13-2010.06.18)] 2010 IEEE Computer

Figure 4. Algorithm scheme for aligning scene sequences in WCE.

3. MethodologyGiven a model and a set of query sequences, the objec-

tive is to measure the similarity of each query sequence withrespect to the model. Based on this measure, a ranking ofsequence similarities with respect to the model can be cre-ated. For that, we define a pairwise aligning method and asimilarity measure between model and query sequences. Amodel is a sequence of frames with a predefined intestinalpattern which is used to explore the data base of endolu-minal scene sequences. A query sequence is an intestinalsequence used to compare with the model. The proposedmethodology to align sequences is based on three steps pro-cedure: (1) similarity features extraction, (2) sequence sim-ilarity estimation and (3) relevance feedback for optimalweights assignment. A scheme of the algorithm is presentedin Fig. 4. In the first step, the descriptors which character-ize intestine event in sequences are extracted. The secondstep deals with similarity calculation between sequences ofdifferent length, and finally, in the third step, the relevancefeedback is used to calculate weights for descriptors. In thefollowing sections, we describe in details the three parts ofthe algorithm.

3.1. Similarity characteristic extraction

Given two sequences of deforming intestines, we areinterested in obtaining the optimal match between theirframes taking into account the object structure and motion.Note that calculating the flow between frames of both se-

Figure 5. Vertical and horizontal flow definition.

Figure 6. Computation of S1, S2 and S3 descriptors.

quences allows us to obtain the cost of image transforma-tion that is the lower the closer the structures of images are.On the other hand, the flow of each frame to the next onein the same sequence gives us information about intestinedeformation. Since we expect that similar sequences con-tain intestine appearance with similar deformation we lookfor similar horizontal flows. Thus, we define two ways ofcomputing the flow between sequences: vertical and hori-zontal (see Fig. 5). Horizontal flow is calculated betweentwo consecutive frames in the same sequence, vertical flowis calculated between frames in different video sequences.In order to quantify the similarity level between sequencesin a robust way we introduce three types of descriptors:

- S1 The Sift Descriptors Difference (SDD) is used tomeasure similarity between structures of the images.

- S2 The Histogram of Sift Flow Directions (HSFD) isused to describe transformation directions.

- S3 The Sift Flow Magnitude (SFM) is used to quantifyintestine deformation amount. Descriptor S1 is calculatedusing vertical flow between sequences, descriptors S2 andS3 are calculated using horizontal flow (see Fig. 6).

Sift Descriptors Difference S1. The matrix S1(j, k), j =1, . . . , x, k = 1, . . . , y, where x and y are the length of the

120

Page 5: [IEEE 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops) - San Francisco, CA, USA (2010.06.13-2010.06.18)] 2010 IEEE Computer

Figure 7. Illustration of SDD computation process.

model and the query sequence respectively, is computed asfollows:

S1(j, k) = ||SD(Imj)− SDAW(Iqk)||,

where SD is Sift Descriptor, SDAW is Sift Descriptor Af-ter Warping, Im and In are intestinal images of differentsequences and ||.|| denotes Euclidean norm. Each elementin S1(j, k) describes structure similarity between frame j inthe model and frame k in sequence. In order to obtain Warp-ing the vertical Sift flow is calculated between frames in themodel and the query sequence. The process is illustrated inFig. 7.

Histogram of Sift Flow Directions S2. The Histogramsof Sift Flow Directions (HSFD) give information about an-gles of transformation which correspond to. directions ofmovement between frames and it is calculated using hori-zontal flow. As shown in Fig. 8, for every pair of frames wereceive 4 histograms, describing movement in every quarterof flow filed. A matrix S2 is defined as:

S2(j, k) = EMD(HSFD(Fmj)− HSFD(Fqk)),

where Fm is horizontal sift flow filed between frames inthe model and Fq is horizontal sift flow filed in the querysequence. In order to obtain similarity measures, the EarthMover Distance (EMD) between histograms is calculated.Each element of S2(j, k) quantifies flow directions similar-ity between frames j and j + 1 in the model and frames kand k + 1 in the query sequence.

Sift Flow Magnitude S3. To quantify intestine deforma-tion magnitude the Sift Flow Magnitude (SFM) is used. Itgives us the answer to the question: how much the framehas to be deformed to resemble the next frame? Fig. 9 dis-plays the SFM of a sequence example. The similarity of

SFM is defined as:

S3(j, k) = ||SFM(Fmj)− SFM(Fqk)||.

Thus, in terms of SFM, the flow between j and j+1 framesin the model and k and k + 1 frames in the query sequenceis similar when the magnitude of flow is alike.

3.2. Similarity calculation and Relevance feedback

For every two sequences (the model sequence and aquery sequence) three matrixes S1, S2 and S3 are ob-tained, the overall amount of similarity between frames insequences is defined as follows:

S(j, k) = w1 ∗ S1(j, k) + w2 ∗ S2(j, k) + w3 ∗ S3(j, k),

where w1, w2 and w3 are weights assigned to each descrip-tor, j = 1, . . . , x, k = 1, . . . , y, and x, y are the length ofthe model and the query sequence respectively. At the be-ginning all weights are equal w1 = w2 = w3 = 1. Thesimilarity D between two sequences Ym and Yq is com-puted using the frame similarity matrix S(j, k):

D(Ym,Yq) = DTW(S(j, k)).

Each result d is normalized by the length of the path usedfor DTW calculation, in that way the results obtained bycomparing the model with sequences of different size canbe confronted. The smaller d value is the more similar thequery sequence is to the model.

For the initial results of sequence ranking, the relevancefeedback is applied. Expert intervention is necessary to flagtop sequences as {relevant, not relevant}. Based on this in-formation, the algorithm learns weights w1, w2 and w3 min-imizing the distances to relevant sequences and maximizingthe distance to no relevant sequences. The optimization isdone using gradient descent minimizing the energy functionwith respect to wi:∑i

( ∑q+j ∈Q+

wi ∗Di(Ym,Yq+j)−

∑q−k ∈Q−

wi ∗Di(Ym,Yq−k)),

where Yq+jare the sequences marked as relevant, Yq−k

are the sequences marked as non relevant. With the newweights values the algorithm recalculates the S(j, k) ma-trix and similarity results between model and the query se-quence (see Fig. 4).

4. ResultsThe evaluation of proposed approach is twofold: first, we

validate the descriptors for sequence alignment and second,we evaluate the functionality of relevance feedback proce-dure. Two databases (DB1 and DB2) are used, containing

121

Page 6: [IEEE 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops) - San Francisco, CA, USA (2010.06.13-2010.06.18)] 2010 IEEE Computer

Figure 8. Computation of HSFD.

Index of the ranking of sequences 1 2 3 4Mean experts note 6.5 6.9 3.3 8.1

Table 1. Mean experts notes for the ranking of sequences.

222 and 247 contraction sequences respectively. The se-quences have variable length from 5 up to 25 frames andeach frame of the sequences have resolution 256× 256 pix-els. In order to speed up descriptors calculation the imagesare resized to resolution 80× 80 pixels. To represent HSFDa 17 bin histogram is used.

Descriptors validation for sequence alignment In thisexperiment, we performed a validation of descriptors for se-quence alignment with the help of 2 experts. We computed4 different rankings of sequences from DB1: (1) using onlySift Descriptors Difference, (2) using only Histograms ofSift Flow Directions Difference, (3) using only Sift FlowMagnitude Difference, (4) combination of all three. Usu-ally, ranking is evaluated by displaying the obtained orderof retrieval of data to an expert who evaluates them in pre-defined scale (f.e. from 1 to 10) [3]. We presented the 15most relevant sequences to the group of experts. They as-signed values from 1 to 10 for every of the four sets of se-quences, where 1 means that the sequences are not alignedand 10 means perfect alignment. The criteria for sequencealignment evaluation was based on the following:(1) are thequery sequence symmetrical/asymmetrical in the same wayas model (2) are the lumen size changes similar (3) do theimages in the model and in the query sequences preservesimilar intestinal structure. The sequences are presented onFig. 10 and Table 1 presents the mean results of experts an-swers. In all the mosaics presented below, the model se-quence is displayed in the first row. As it can be seen the

Figure 9. Sift Flow Magnitude (SFM) of a sequence example.

union of three descriptors provides the best sequence align-ment result, whereas applying only one descriptor givespoor results. As we expected, on the first ranking of se-quences, where only SDD is used, similarity order is basedon the structure of the images. On the second ranking, thedirections of transformation are well preserved in all se-quences. The third ranking, applying only SFM difference,gives the worst result, the sequences seems to be unorga-nized.

Relevance feedback validation In this experiment, wevalidate the performance of the relevance feedback algo-rithm. Given a ranking of a set of sequences with respectto a model sequence, an expert was asked to assign a flag{relevant or not relevant} to each one of the top 15 se-quences of the ranking. Based on expert labeling, the sys-tem learned the weights for descriptors using the relevancefeedback algorithm. Therefore, the set of reordered se-quences was presented to an expert together with the initialset of sequences (Fig. 11). As it can be seen, in the imageon the right, the majority of the sequences marked as rele-

122

Page 7: [IEEE 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops) - San Francisco, CA, USA (2010.06.13-2010.06.18)] 2010 IEEE Computer

Figure 10. Rankings of a set of sequences using as descriptors: (a) only Sift descriptors difference, (b) only Histograms of Sift FlowDirections, second row (c) only Sift Flow Magnitude and (d) combination of all three.

Figure 11. (a) Initial ranking of the set of sequences and (b) Reorganized set of sequences.

vant by the expert went up in the similarity ranking. Fig. 12shows the difference between the mean values of distancesfrom the model sequence to no relevant sequences and torelevant sequences. As it can be seen, on each iteration ofthe algorithm is incrementing more the distance to not rele-vant sequences than to relevant sequences. Moreover, it canbe observed that after 1000 iterations of gradient descent,

Descriptor SDD HSFD SFMWeight 0.33 0.54 0.11

Table 2. Descriptors weights calculated by relevance feedback.

the algorithm stabilizes. Table 2 presents weights assignedto the descriptors. The algorithm with weights from Ta-

123

Page 8: [IEEE 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops) - San Francisco, CA, USA (2010.06.13-2010.06.18)] 2010 IEEE Computer

Figure 12. Mean difference between not relevant and relevant se-quences.

Figure 13. Alignment result for DB2.

ble 2 was applied over new data base (DB2) of contractionsacquired during a different clinical study. The sequencesalignment results are presented in Fig. 13. The mean noteof the experts evaluation for this sequence ranking was 8.9.

5. ConclusionsIn this paper, for first time, an algorithm for aligning

and measuring similarity of endoluminal scene sequencesis presented. We introduce three descriptors to characterizeintestinal structure and deformation in consecutive frames:(1) Histograms of Sift Flow Directions to describe flowcourse, (2) Sift Descriptors to represent image structure and(3) Sift Flow Magnitude to quantify the deformation. Tohandle the different length of sequences the Dynamic TimeWarping algorithm is used. Moreover, the technique forweights learning based on the expert relevance feedbackprocedure is introduced to improve the performance of themethod. The experimental results show that the proposed

algorithm is able to create an endoluminal sequence rank-ing that fulfills the experts expectations. Extracted data arefeasible for further clinical, diagnostic and therapeutic ap-plications. It is worth remaining that one of the main lim-itations of our algorithm is the time needed to calculatethe Sift Flow between frames, reducing it will meaning-fully speed-up the alignment and ranking of endoluminalsequences. Another important steep is to elaborate confi-dent validation techniques of intestinal sequences relevanceranking.

References[1] T. Deselars and et al. Learning weighted distances for rele-

vance feedback in image retrieval. In ICPR, 2008. 2, 3[2] M. Drozdzal, P. Radeva, and et al. Towards detection of

measurable contractions using wce. In Proc. of the 4th CVCworkshop, pages 131 – 136, 2009. 2

[3] M. S. E. Horster and et al. Unsupervised image ranking. InLS-MMRM, 2009. 6

[4] A. W. Fu, E. Keogh, and et al. Scaling and time warping intime series quering. In Proc. of the 31VLDB Conf., 2005. 3

[5] V. Hai, T. Echigo, et al. Adaptive control of video displayfor diagnostic assistance by analysis of capsule endoscopicimages. In Proc. ICPR, pages 980 – 983, 2006. 2

[6] S. Hwang, J. Oh, and et al. Blood detection in wireless cap-sule endoscopy using expectation maximization clustering.In Proc. of the SPIE, pages 577 – 587, 2006. 2

[7] G. Iddan, G. Meron, et al. Wireless capsule endoscopy. Na-ture, 405:417, 2000. 1

[8] L. Igual, S. Segui, et al. Eigenmotion-based detection ofintestinal contractions. In Proc. CAIP, volume 4673, pages293–300, 2007. 2

[9] M. W. Kadus. A general architecture for supervised clasifi-cation of multivariate time series, 2009. 3

[10] E. Keogh, T. Palpanas, and et al. Indexing large human mo-tion databases. In Proc. of the 30VLDB Conf., 2004. 3

[11] C. Liu, J. Yuen, and et al. Sift flow: Dence correspondenceacross different scenes. In ECCV, pages 28 – 42, 2008. 2

[12] D. Lowe. Distinctive image features from scale-invariantkeypoints. In Inter. Jurnal on Computer Vision, 2004. 3

[13] M. Mackiewicz, J. Berens, and et al. Colour and texturebased gastrointestinal tissue discrimination. In Proc. of IEEEICASSP, pages 597 – 600, 2006. 2

[14] F. Vilarino. A machine learning approach for intestinalmotility assessment. PhD thesis, UAB, Barcelona, 2006. 1

[15] F. Vilarino, P. Spyridonos, et al. Cascade analysis for intesti-nal contraction detection. CARS, pages 9–10, 2006. 2

[16] H. Vu, T. Echigo, and et al. Contraction detection insmall bowel from an image sequence of wireless capsule en-doscopy. In Proc. of MICCAI, pages 775 – 783, 2007. 2

[17] Y. Yagi and et al. A diagnosis support system for capsuleendoscopy. Inflammopharmacology, pages 78 – 83, 2007. 2

[18] X. S. Zhou and et al. Relevance feedback in image retrieval:A comprehensive review. Multimedia Syst., 2003. 3

124