Research Article Performances Evaluation of a Novel Hadoop...

Research ArticlePerformances Evaluation of a Novel Hadoop and Spark BasedSystem of Image Retrieval for Huge Collections

Luca Costantini and Raffaele Nicolussi

Fondazione Ugo Bordoni Viale del Policlinico 147 Roma Italy

Correspondence should be addressed to Luca Costantini lcostantinifubit

Received 2 October 2015 Revised 24 November 2015 Accepted 25 November 2015

Academic Editor Athanasios V Vasilakos

Copyright copy 2015 L Costantini and R Nicolussi This is an open access article distributed under the Creative CommonsAttribution License which permits unrestricted use distribution and reproduction in any medium provided the original work isproperly cited

A novel system of image retrieval based onHadoop and Spark is presentedManaging and extracting information fromBig Data isa challenging and fundamental task For these reasons the system is scalable and it is designed to be able tomanage small collectionsof images as well as huge collections of images Hadoop and Spark are based on the MapReduce framework but they have differentcharacteristics The proposed system is designed to take advantage of these two technologies The performances of the proposedsystem are evaluated and analysed in terms of computational cost in order to understand in which context it could be successfullyused The experimental results show that the proposed system is efficient for both small and huge collections

1 Introduction

Due to the dramatic growth of available images and videosand the spread of social networks surveillance cameras andsatellite images an important challenge is how to effectivelymanage the computation and storage requirements imposedby these data [1] The trend towards many-core processorsand multiprocessor systems is thwarted by the complexityin developing applications that effectively utilize them [2]There are many methods available for retrieving images froma collection In themedical industry for example recent stud-ies have identified in CBIR (Content-Based Image Retrieval)systems a very effective technique In these systems input isrepresented by images and output consists in all the similarimages contained in the database [3 4] A CBIR systemtypically operates through three steps

(1) extraction of the features which represent images inthe collection (eg wavelet transform andGabor filterbank)

(2) extraction of the features which represent the queryimage

(3) comparison of the query image with the images in thecollection by using the features extracted

Often these techniques have proved to be inadequate asthe number of images grows Many studies have shownthat the use of MapReduce [5] techniques can howevergreatly speed up the image processing [6 7] while Hadoopis helpful for achieving scalability [8] These performanceimprovements have to be paid in terms of constraint for dataaccess pattern RAM sharing rigid frameworks and algo-rithms major redesign Furthermore a set of hybridizationsof the original MapReduce framework [9] can efficientlyscale the indexing pipeline across a distributed cluster ofmachines [10ndash12] With the objective to fill the gap betweencomplicatedmodern architectures and new image processingalgorithms this work aims to produce a high-performanceimage retrieval system able to hide the software complexityfrom researchers allowing them to focus on designing inno-vative image processing algorithms The proposed systemembeds just one feature to demonstrate its effectiveness andmany other features could be embedded because the systemconsiders the features as an abstract object The technologiesused in the system are chosen to satisfy the requirementsof each specific task To evaluate the effectiveness of theproposed system its performances in terms of computationalcost are evaluatedwith collections of different sizeThe rest ofthe paper is organized as follows In Section 2 theMapReduce

Hindawi Publishing CorporationAdvances in MultimediaVolume 2015 Article ID 629783 7 pageshttpdxdoiorg1011552015629783

2 Advances in Multimedia

framework is explained while in Section 3 the technologiesused in the proposed system are described The architectureof the system is described in Section 4 In Section 5 theexperimental results are reported while the conclusions aredrawn in Section 6

2 MapReduce

The MapReduce framework has been introduced by Google[13] in order to allow a distributed processing over a servercluster Despite the traditional distributed processing frame-work where the data are pushed to the nodes that belongto the cluster which are responsible for the processing inthe MapReduce system the approach is different [10] In thiscase the data are distributed among the nodes and the tasksare pushed to the particular node that stores the data TheMapReduce framework is made up of two steps Map andReduce and the whole framework is based on the conceptof key value pair ⟨119896 V⟩ [14] The Map function known alsoas mapper receives ⟨119896119898in V119898in⟩ as input and produces a listof pairs as output

⟨119896119898out1 V119898out1⟩ ⟨119896119898out1 V119898out(119872minus1)⟩ ⟨119896

119898out119873 V119898out2⟩

⟨119896119898out119873 V119898out119872⟩

(1)

Since the key emitted from the mapper could not be uniquethe Reduce function called reducer groups the values alto-gether for each unique key

⟨119896119898out1 [V119898out1 V119898out

(119872minus1)]⟩ 997888rarr ⟨119896

119903out1 V119903out1⟩ (2)

Depending on the implementation of the MapReduce frame-work the reducer could also producemultiple key value pairsas output There are many advantages in using the MapRe-duce framework related to both data storing and processingaspects Indeed the file system blocks are duplicated acrossthe nodes ensuring in this way a great tolerance in case ofnode failure Furthermore the frameworkmanages the blockin order to minimize the traffic of network data Regardingperformances the framework assigns the task to the nodesthat are not busy balancing in this way the load among thenodes Finally the framework is scalable and the numberof nodes in the cluster depends only on the specific caseof use In the context of image retrieval the MapReduceframework could be employed principally in two ways singleimage processing [15 16] or image collections processing [5]Although it is even possible to combine these two approachesin the proposed architecture the second approach has beenadopted

3 Technologies

The proposed system is based mainly on two technologiesHadoop and Spark which are described in this sectionAlthough these technologies are based on the MapReduceframework for many aspects they are very different and bothof them are used in the proposed system In order to takeadvantage of their potentialities these two technologies are

used in different parts of the system Hadoop and Spark aredescribed in Sections 31 and 32 respectively

31 Apache Hadoop Hadoop developed by the ApacheSoftware Foundation is an open source framework createdby Doug Cutting and Mike Cafarella in 2005 [18] Its aim isto make available a framework for distributed storage and fordistributed processing The main modules that made up theHadoop framework are as follows

(1) Hadoop Common this module contains the librariesand the utilities

(2) Hadoop Distributed File System (HDFS) originallyit was the Google File System This module is adistributed file system used as distributed storage forthe data furthermore it provides an access to the datawith high throughput

(3) Hadoop YARN (MRv2) this module is responsiblefor the job scheduling and for managing the clusterresources

(4) Hadoop MapReduce originally Googlersquos MapReducethis module is a system based on YARN for theparallel processing of the data

There are many projects related to Hadoop such as MahoutHive Hbase and Spark One of the main aspects that char-acterize Hadoop is that the HDFS has a high fault-toleranceto the hardware failure Indeed it is able to automaticallyhandle and resolve these events Furthermore HDFS is ableby the interaction among the nodes belonging to the clusterto manage the data for instance to rebalance them [19 20]The processing of the data stored on the HDFS is performedby the MapReduce framework Although Hadoop is writtenprincipally in Java and C languages it is accessible by manyother programming languages The MapReduce frameworkallows splitting on the nodes belonging to the cluster the tasksthat have to be completed The main drawback of Hadoop isthe lack of performing efficiently real-time tasks Howeverthis is not an important limitation because for these specificaspects other technologies could be employed

32 Apache Spark As well as Hadoop Spark is a project ofthe Apache Software Foundation originally created by theAMPLab at the University of California Berkeley Concern-ing the performances the main advantage of Spark is that itoutperforms Hadoop indeed it is 100x faster for in-memoryoperation and 10x faster for on-disk operations Spark adoptsthe MapReduce paradigm and it is accessible by using theAPI by different programming languages (such as Scala Javaand Python) The core of the system is made up of a setof powerful libraries that currently include parkSQL SparkStreaming MLlib and GraphX Spark is divided into variousindependent layers each one with specific responsibilities

(1) Scala interpreter it takes care of creating an oper-ator graph through RDD (ie responsibility-drivendesign) and applying operators

(2) DAG scheduler it divides operator graph into stagesEvery stage is comprised of tasks based on partitions

Advances in Multimedia 3

Collection of images

Mapper 1

Mapper 2

Reducer

Mapper N

Reducer

Mapper N

Hadoop file system

Query image Ranked collection of images

Hadoop Spark

Index files

Mapper Nminus 1

Mapper 1

Mapper 2

Mapper Nminus 1

Figure 1 Architecture of the system The gray objects represent the server side while the green objects represent the client side

of the input data The DAG scheduler pipelinesoperators work to optimize the graph

(3) Task scheduler it activates tasks via cluster manager

(4) Worker it executes the task on a slave (ie themachine on which the Executor program runs)

Spark can run on the Hadoop YARN cluster and it is able toaccess the HDFS this allows an easy and efficient integrationof Hadoop and Spark within the same system

4 Proposed Architecture System

Theaimof this paper is to propose the architecture of a systemprototype able to manage huge collection of images anddeeply analyse its performances in terms of computationalcost and not in terms of recall precision retrieval rate andso on because these aspects are not important in this contextThe proposed system is able to manage huge collectionof images thanks to its scalability It is based on the useof two important technologies in the context of Big DataHadoop and Spark both of them employ the MapReduceframework In the previous sections the MapReduce frame-work (Section 2) as well as Hadoop (Section 31) and Spark(Section 32) is described

Roughly speaking the system is made up of a client sideand a server side (see Figure 1) The client side allows theinteractions between the users and the system by a web pageIndeed it is used by the users to provide the query image andby the system to show the results The processing and all theoperations with a high computational cost are performed onthe server side For this reason it needs to be optimized and

efficiently designed in order to manage huge collections ofimages The architecture of the system taking into accountonly the server side could be considered to be made up oftwo main parts one used during the indexing phase andthe other one dedicated to the retrieval phase as shown inFigure 1 However both of them are based on a layer that isthe Hadoop file system The indexing phase and the retrievalphase have different requirements and then are separatelydesigned The indexing phase needs to write the index filesto the file system and also has to process all the imagesbelonging to the collection Hadoop succeeds in an optimalway to meet these needs On the other hand the retrievalphase should be as fast as possible then the operations shouldbe done in memory for these reasons Spark is better thanHadoop for this task Combining these two technologies intothe system it is possible to take advantage of their positivesides minimizing the impact on the performances of theirnegative sides Concerning the first part (ie the indexingphase) it makes use of the Hadoop technology and it storesthe result (ie the index files) in the Hadoop file systemBased on the HIPI project [5] a list of images is assigned toeach mapper and the result of the processing activity is indexfiles with the features extracted from the images [21] Once allthe images have been processed the reducermerges the indexfiles created by the mappers and generates the final index fileObviously the computational cost expressed in time of thisoperation is

119905I= 119905

Isys + 119905ip = 119905

Isys +119873

sum119899=1

120591ip (119899) (3)

where


(i) 119905I is the total time spent for the indexing phase(ii) 119905ip is the time related only to the processing of the

images belonging to the collection this is made up ofthe time 120591ip spent from each of the119873mappers

(iii) 119905Isys is the time spent from the system strictly related tothe proposed architecture (eg managing of themap-pers and reducer and the readingwriting operationson the Hadoop file system)

In this paper we focus our attention on 119905Isys The secondpart of the system (see Figure 1) responsible for the retrievalphase is based on the Spark technology but it makes use ofthe Hadoop file system to read the index file The systemperforms the retrieval operation by using an image as queryFrom the index file the information needed to perform thecomparison with the query image is loaded into memory andto eachmapper is assigned a list of images with their featuresto compute the similarity with the query image At the endof these tasks the reducer computes the final ranking basedon the similarity computed between the query image and theimages within the index file As well as for the previous partof the system the computational cost expressed in time is

119905R= 119905

Rsys + 119905ic = 119905

Rsys +119873

sum119899=1

120591ic (119899) (4)

where(i) 119905R is the total time spent for the retrieval phase(ii) 119905ic is the time related only to compare the query image

with all the images of the collection this is made upof the time 120591ic spent from each of the119873mappers

(iii) 119905Rsys is the time spent from the system strictly relatedto the proposed architecture (eg managing of themappers and reducer reading the index file on theHadoop file system performing the ranking andprocessing the query image)

It is important to analyse this last component of the com-putational cost Finally in this context the color feature andits comparison metric described and used in [22 23] havebeen consideredThis is not a limitation because any featurescan be deployed in the system adding them to the sameindex file or creating a new index file for each of themdepending on the specific case of use Obviously when theranking of the images belonging to the collection is basedon more features storing them into the same index file ismore efficient Otherwise if the ranking is based on onefeature the system is more efficient when the index file storesonly that feature Furthermore the computational cost ofthe processing of each image depends on which features areemployed but it does not affect the performances evaluationobject of this paper Indeed in this paper the computationalcost introduced by the system is analysed

5 Experimental Results

In Section 4 the architecture of the system is describedfurthermore the aspects that are analysed in this paper are

highlighted As said in the previous section the aim of thispaper is to investigate the computational cost introduced bythe system in both parts the first one based on Hadooprelated to the indexing operations and the second one basedon Spark related to the retrieval operations In Section 51the experiments performed are deeply described in order toexplain the performances that are reported in Section 52

51 Context of the Experiments Although the technologiesused by the system described in Section 3 are scalableand are designed to be used on a cluster of machinesthe experiments are not conducted on a cluster but on asingle machine Our aim is to analyse the computation costintroduced by the proposed system in order to understandits limitations in an image retrieval scenario Our analysis isnot focused on the technologies employed or on the overallperformances of the system in terms of efficiency or retrievalrate indeed just the color feature is consideredThe analysis iscentred on the complexity introduced by the proposed systemin a context heavy from the computational cost such as thoseof image retrieval The operations related to the indexingphase and related to the retrieval phase are performed byusing different collections with a growing size In particularcollections made up of 100 1 K 10 K 100K and 500K imagesare used In the next section the results and the performancesare extensively described

52 Performances The results of the performances analysisof the proposed system are presented in Tables 1 and 2 Inparticular the results regarding the indexing phase that isbased on Hadoop are presented in Table 1 and those regard-ing the retrieval phase that is based on Spark are presentedin Table 2 Our aim is to investigate and to understandthe behaviour of the proposed system for collections withdifferent size In our analysis the computational costs strictlyrelated to the feature extraction and feature comparisonare isolated These aspects affect the performances in termsof retrieval rate precision and recall but not in terms ofcomputational cost that is analysed in this work The featureextraction algorithm and the feature comparison algorithmare relatively simple and then their computational cost is lowThis feature has been chosen for this reason because this isthe worst case of use for the system Roughly speaking if theproposed system is efficient in this case employing it whenthe features are more complex in other words with a highercomputational cost is even more convenient

Table 1 shows that the computational cost due to thearchitecture of the system becomes relevant when the collec-tion ismade up of 500K imagesThis is also highlighted in thegraph shown in Figure 2 Since adding nodes to the clusterimpacts only on the image processing time this aspect hasto be faced when the collection size is 500K or higher It isimportant to notice that the percentage of architecture timeis related to the time to process an image that as said beforeis very small for the chosen feature In any case a possiblesolution is to split the collection in order to minimize theimpact of the architecture time

Concerning the retrieval phase the percentage of archi-tecture time is comparable for any collection size as shown


Table 1 Indexing phase performances (ie Hadoop based)

Number of images Indexingtime [119904] 119905I

Time to processall images [119904] 119905ip

Architecturetime [119904] 119905Isys

Percentage ofarchitecture time

100 29 28 1 3451000 285 280 5 17510000 2820 2800 20 071100000 28297 28000 297 105500000 165661 140000 25661 1549

Table 2 Retrieval phase performances (ie Spark based)

Number of images Retrievaltime [119904] 119905R

Time to processall images [119904] 119905ic

Architecturetime [119904] 119905Rsys


100 0559 0245 0314 56171000 0625 0262 0363 580810000 1140 0367 0773 6781100000 4603 1472 3131 6802500000 22635 9230 13405 5922

18001600140012001000800600400200000

Arc

hite

ctur

e tim

e (

)

100 1000 10000 100000 500000

Number of images

Figure 2 Percentage of architecture timewith respect to the numberof images indexing phase

in the graph in Figure 3 This is essentially related to the timespent to process the query image and to read the index filefrom the Hadoop file system and write it into the memoryAlthough a more efficient implementation of this operationcould be developed this is a necessary operation in orderto take advantage of the Spark technology For instance theindex file could be loaded into memory only once whenthe application is started or when the comparison methodis set The use of this technology allows adding nodes tothe cluster in order to reduce the time needed to makethe comparison among the query image and the images inthe index file This aspect becomes fundamental when thecomparison algorithms are complex in order to maximisethe retrieval rate performances Summarizing the resultsshow that for both phases the proposed system could beefficiently employedwith small collections (egsim100 images)as well as huge collections (eg sim500K images) even if thechosen feature has a very low computational cost Obviouslythe impact of the architecture time decreases when more

000

1000

2000

3000

4000

5000

6000

7000

8000A

rchi

tect

ure t

ime (

)

100 1000 10000 100000 500000

Number of images

Figure 3 Percentage of architecture timewith respect to the numberof images retrieval phase

complex and efficient features are employed to represent theimages in the collection

In Figure 4 the ratio between the indexing time and theretrieval time with respect to the collection size is shownThis is an important aspect of the image retrieval systembecause it should be as high as possible Indeed since theindexing time grows as well as the collection size grows theretrieval time should be bounded in order to not annoy theusers Furthermore while the indexing phase is performedjust once the retrieval phase is performed much more timesThis ratio for the proposed system has a value higher thanthe value obtained by the system which is based only onHadoop proposed in [17] Figure 5 refers to a collectionmadeup of 160K images and the maximum value of the ratio iswhen the cluster is made up of 30 nodes Also in this casethe ratio is lower than the ratio of the proposed system forthe same collection size as shown in Figure 4 although theproposed system has been tested on a single machine thismeans that the ratio could increase if the nodes are added to


000

100000

200000

300000

400000

500000

600000

700000

800000

100 1000 10000 100000 500000

Number of imagesRatio

bet

wee

n in

dexi

ng ti

me a

nd re

trie

val t

ime

731880

614751

247368

456005188

Figure 4 Ratio between indexing time and retrieval time withrespect to the collection size for the proposed system

000

1000

2000

3000

4000

5000

6000

1 5 10 20 30

Number of nodesRatio

bet

wee

n in

dexi

ng ti

me a

nd re

trie

val t

ime

Figure 5 Ratio between indexing time and retrieval time withrespect to the number of nodes for the system proposed in [17] witha collection made up of 160K images

the cluster This shows that combining Hadoop and Spark ina system of image retrieval as done in the proposed systemimproves the efficiency of the whole system

6 Conclusions

In this work a system to manage huge collection of imagesbased on theMapReduce framework has been presentedThetechnologies used Hadoop and Spark allow the system tobe completely scalable and to reduce the computational costproportionally to the nodes in the cluster Hadoop has beenused in the indexing phase while Spark has been used in theretrieval phaseThe performances in terms of computationalcost and not in terms of retrieval rate are evaluated A rela-tively cheap feature in terms of computational cost regardingthe algorithms of extraction and of comparison has beenemployed in order to analyse theworst case Furthermore theperformances have been evaluated by using collections madeup of 100 1 K 10 K 100K and 500K images Concerningthe indexing phase the results show that the percentage ofarchitecture time is very low with exception for the collection

made up of 500K imagesThis is not a limitation because thispercentage decreases when a more complex feature is usedfurthermore it could be decreased by splitting the collectioninto smaller subcollections On the other hand regardingthe retrieval phase the percentage of architecture time isquite constant for all the collections but could be decreasedwith more efficient implementations Future work shouldbe focused on improving these two critical aspects and ontesting the system behaviour on a cluster with a variablenumber of nodes Finally the performances show that thesystem is efficient for small collections (eg sim100 images)as well as huge collections (eg sim500K images) even withone simple feature Furthermore the results show that theefficiency of the proposed system based on the combinationof two technologies (ie Hadoop and Spark) is higher thanthe efficiency of a system based only on Hadoop

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] C-W Tsai C-F Lai H-C Chao and A V Vasilakos ldquoBig dataanalytics a surveyrdquo Journal of Big Data vol 2 no 1 2015

[2] B White T Yeh J Lin and L Davis ldquoWeb-scale computervision using mapreduce for multimedia data miningrdquo in Pro-ceedings of the 10th International Workshop onMultimedia DataMining (MDMKDD rsquo10) pp 91ndash910 ACM Washington DCUSA July 2010

[3] S Jai-Andaloussi A Elabdouli A Chaffai N Madrane and ASekkaki ldquoMedical content based image retrieval by using theHadoop frameworkrdquo in Proceedings of the 20th InternationalConference on Telecommunications (ICT rsquo13) pp 1ndash5 IEEECasablanca Morocco May 2013

[4] S Fong R Wong and A Vasilakos ldquoAccelerated pso swarmsearch feature selection for data stream mining big datardquo IEEETransactions on Services Computing 2015

[5] S Arietta J Lawrence L Liu and C Sweeney Hipi-hadoopimage processing interface httphipicsvirginiaeduabouthtml

[6] D Moise D Shestakov G Gudmundsson and L AmsalegldquoIndexing and searching 100M images with map-reducerdquo inProceedings of the 3rd ACM International Conference on Mul-timedia Retrieval (ICMR rsquo13) pp 17ndash24 Dallas Tex USA April2013

[7] L Zhou N Xiong L Shu A Vasilakos and S-S Yeo ldquoContext-aware middleware for multimedia services in heterogeneousnetworksrdquo IEEE Intelligent Systems vol 25 no 2 pp 40ndash472010

[8] Y Yan and L Huang ldquoLarge-scale image processing researchcloudrdquo in Proceedings of the 5th International Conference onCloud Computing GRIDs and Virtualization (CLOUD COM-PUTING rsquo14) Venice Italy May 2014

[9] M H Almeer ldquoCloud hadoop map reduce for remote sensingimage analysisrdquo Journal of Emerging Trends in Computing andInformation Sciences vol 3 no 4 pp 637ndash644 2012


[10] J S Hare S Samangooei and P H Lewis ldquoPractical scalableimage analysis and indexing using Hadooprdquo Multimedia Toolsand Applications vol 71 no 3 pp 1215ndash1248 2014

[11] D Dahiphale R Karve A V Vasilakos et al ldquoAn advancedmapreduce cloudmapreduce enhancements and applicationsrdquoIEEE Transactions on Network and Service Management vol 11no 1 pp 101ndash115 2014

[12] A V Vasilakos Z Li G Simon and W You ldquoInformationcentric network research challenges and opportunitiesrdquo Journalof Network and Computer Applications vol 52 pp 1ndash10 2015

[13] J Dean and S Ghemawat ldquoMapreduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[14] N K Alham M Li Y Liu and S Hammoud ldquoA MapReduce-based distributed SVM algorithm for automatic image annota-tionrdquo Computers amp Mathematics with Applications vol 62 no7 pp 2801ndash2811 2011

[15] M Yamamoto andK Kaneko ldquoParallel image database process-ing withmapreduce and performance evaluation in pseudo dis-tribuited moderdquo International Journal of Electronic CommerceStudies vol 3 no 2 2012

[16] S M Banaei and H K Moghaddam ldquoHadoop and its role inmodern image processingrdquoOpen Journal of Marine Science vol4 no 4 pp 239ndash245 2014

[17] D Yin and D Liu ldquoContent-based image retrial based onHadooprdquo Mathematical Problems in Engineering vol 2013Article ID 684615 7 pages 2013

[18] H Karau Learning Spark Lightning-Fast Big Data AnalysisOrsquoReilly Media Sebastopol Calif USA 2015

[19] A Shenoy Hadoop Explained Packt Publishing 2014[20] T White Hadoop The Definitive Guide OrsquoReilly 2012[21] H Kocakulak and T T Temizel ldquoA hadoop solution for

ballistic image analysis and recognitionrdquo in Proceedings of theInternational Conference on High Performance Computing andSimulation (HPCS rsquo11) pp 836ndash842 IEEE Istanbul Turkey July2011

[22] L Costantini P Sita L Capodiferro and A Neri ldquoLaguerregauss analysis for image retrieval based on color texturerdquo inWavelet Applications in Industrial Processing VII vol 7535 ofProceedings of SPIE San Jose Calif USA 2010

[23] L Capodiferro L Costantini F Mangiatordi and E PallottildquoSVM for historical sport video classificationrdquo in Proceedingsof the 5th International Symposium on Communications Controland Signal Processing (ISCCSP rsquo12) pp 1ndash4 Rome Italy May2012

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014


Active and Passive Electronic Components

Control Scienceand Engineering

Journal of



RotatingMachinery


Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design



Shock and Vibration


Civil EngineeringAdvances in

Acoustics and VibrationAdvances in



Electrical and Computer Engineering

Journal of

Advances inOptoElectronics


Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of


Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014


Chemical EngineeringInternational Journal of Antennas and

Propagation




Navigation and Observation



DistributedSensor Networks



framework is explained while in Section 3 the technologiesused in the proposed system are described The architectureof the system is described in Section 4 In Section 5 theexperimental results are reported while the conclusions aredrawn in Section 6

2 MapReduce

The MapReduce framework has been introduced by Google[13] in order to allow a distributed processing over a servercluster Despite the traditional distributed processing frame-work where the data are pushed to the nodes that belongto the cluster which are responsible for the processing inthe MapReduce system the approach is different [10] In thiscase the data are distributed among the nodes and the tasksare pushed to the particular node that stores the data TheMapReduce framework is made up of two steps Map andReduce and the whole framework is based on the conceptof key value pair ⟨119896 V⟩ [14] The Map function known alsoas mapper receives ⟨119896119898in V119898in⟩ as input and produces a listof pairs as output

⟨119896119898out1 V119898out1⟩ ⟨119896119898out1 V119898out(119872minus1)⟩ ⟨119896

119898out119873 V119898out2⟩

⟨119896119898out119873 V119898out119872⟩

(1)

Since the key emitted from the mapper could not be uniquethe Reduce function called reducer groups the values alto-gether for each unique key

⟨119896119898out1 [V119898out1 V119898out

(119872minus1)]⟩ 997888rarr ⟨119896

119903out1 V119903out1⟩ (2)

Depending on the implementation of the MapReduce frame-work the reducer could also producemultiple key value pairsas output There are many advantages in using the MapRe-duce framework related to both data storing and processingaspects Indeed the file system blocks are duplicated acrossthe nodes ensuring in this way a great tolerance in case ofnode failure Furthermore the frameworkmanages the blockin order to minimize the traffic of network data Regardingperformances the framework assigns the task to the nodesthat are not busy balancing in this way the load among thenodes Finally the framework is scalable and the numberof nodes in the cluster depends only on the specific caseof use In the context of image retrieval the MapReduceframework could be employed principally in two ways singleimage processing [15 16] or image collections processing [5]Although it is even possible to combine these two approachesin the proposed architecture the second approach has beenadopted

3 Technologies

The proposed system is based mainly on two technologiesHadoop and Spark which are described in this sectionAlthough these technologies are based on the MapReduceframework for many aspects they are very different and bothof them are used in the proposed system In order to takeadvantage of their potentialities these two technologies are

used in different parts of the system Hadoop and Spark aredescribed in Sections 31 and 32 respectively

31 Apache Hadoop Hadoop developed by the ApacheSoftware Foundation is an open source framework createdby Doug Cutting and Mike Cafarella in 2005 [18] Its aim isto make available a framework for distributed storage and fordistributed processing The main modules that made up theHadoop framework are as follows

(1) Hadoop Common this module contains the librariesand the utilities

(2) Hadoop Distributed File System (HDFS) originallyit was the Google File System This module is adistributed file system used as distributed storage forthe data furthermore it provides an access to the datawith high throughput

(3) Hadoop YARN (MRv2) this module is responsiblefor the job scheduling and for managing the clusterresources

(4) Hadoop MapReduce originally Googlersquos MapReducethis module is a system based on YARN for theparallel processing of the data

There are many projects related to Hadoop such as MahoutHive Hbase and Spark One of the main aspects that char-acterize Hadoop is that the HDFS has a high fault-toleranceto the hardware failure Indeed it is able to automaticallyhandle and resolve these events Furthermore HDFS is ableby the interaction among the nodes belonging to the clusterto manage the data for instance to rebalance them [19 20]The processing of the data stored on the HDFS is performedby the MapReduce framework Although Hadoop is writtenprincipally in Java and C languages it is accessible by manyother programming languages The MapReduce frameworkallows splitting on the nodes belonging to the cluster the tasksthat have to be completed The main drawback of Hadoop isthe lack of performing efficiently real-time tasks Howeverthis is not an important limitation because for these specificaspects other technologies could be employed

32 Apache Spark As well as Hadoop Spark is a project ofthe Apache Software Foundation originally created by theAMPLab at the University of California Berkeley Concern-ing the performances the main advantage of Spark is that itoutperforms Hadoop indeed it is 100x faster for in-memoryoperation and 10x faster for on-disk operations Spark adoptsthe MapReduce paradigm and it is accessible by using theAPI by different programming languages (such as Scala Javaand Python) The core of the system is made up of a setof powerful libraries that currently include parkSQL SparkStreaming MLlib and GraphX Spark is divided into variousindependent layers each one with specific responsibilities

(1) Scala interpreter it takes care of creating an oper-ator graph through RDD (ie responsibility-drivendesign) and applying operators

(2) DAG scheduler it divides operator graph into stagesEvery stage is comprised of tasks based on partitions



Mapper 1

Mapper 2

Reducer

Mapper N

Reducer

Mapper N

Hadoop file system


Hadoop Spark

Index files

Mapper Nminus 1

Mapper 1

Mapper 2

Mapper Nminus 1










119905I= 119905

Isys + 119905ip = 119905

Isys +119873

sum119899=1

120591ip (119899) (3)

where






119905R= 119905

Rsys + 119905ic = 119905

Rsys +119873

sum119899=1

120591ic (119899) (4)


















100 29 28 1 3451000 285 280 5 17510000 2820 2800 20 071100000 28297 28000 297 105500000 165661 140000 25661 1549






100 0559 0245 0314 56171000 0625 0262 0363 580810000 1140 0367 0773 6781100000 4603 1472 3131 6802500000 22635 9230 13405 5922

18001600140012001000800600400200000

Arc

hite

ctur

e tim

e (

)

100 1000 10000 100000 500000

Number of images



000

1000

2000

3000

4000

5000

6000

7000

8000A

rchi

tect

ure t

ime (

)

100 1000 10000 100000 500000

Number of images





000

100000

200000

300000

400000

500000

600000

700000

800000

100 1000 10000 100000 500000


bet

wee

n in

dexi

ng ti

me a

nd re

trie

val t

ime

731880

614751

247368

456005188


000

1000

2000

3000

4000

5000

6000

1 5 10 20 30


bet

wee

n in

dexi

ng ti

me a

nd re

trie

val t

ime



6 Conclusions





References


























RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation











Mapper 1

Mapper 2

Reducer

Mapper N

Reducer

Mapper N

Hadoop file system


Hadoop Spark

Index files

Mapper Nminus 1

Mapper 1

Mapper 2

Mapper Nminus 1










119905I= 119905

Isys + 119905ip = 119905

Isys +119873

sum119899=1

120591ip (119899) (3)

where






119905R= 119905

Rsys + 119905ic = 119905

Rsys +119873

sum119899=1

120591ic (119899) (4)


















100 29 28 1 3451000 285 280 5 17510000 2820 2800 20 071100000 28297 28000 297 105500000 165661 140000 25661 1549






100 0559 0245 0314 56171000 0625 0262 0363 580810000 1140 0367 0773 6781100000 4603 1472 3131 6802500000 22635 9230 13405 5922

18001600140012001000800600400200000

Arc

hite

ctur

e tim

e (

)

100 1000 10000 100000 500000

Number of images



000

1000

2000

3000

4000

5000

6000

7000

8000A

rchi

tect

ure t

ime (

)

100 1000 10000 100000 500000

Number of images





000

100000

200000

300000

400000

500000

600000

700000

800000

100 1000 10000 100000 500000


bet

wee

n in

dexi

ng ti

me a

nd re

trie

val t

ime

731880

614751

247368

456005188


000

1000

2000

3000

4000

5000

6000

1 5 10 20 30


bet

wee

n in

dexi

ng ti

me a

nd re

trie

val t

ime



6 Conclusions





References


























RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation














119905R= 119905

Rsys + 119905ic = 119905

Rsys +119873

sum119899=1

120591ic (119899) (4)


















100 29 28 1 3451000 285 280 5 17510000 2820 2800 20 071100000 28297 28000 297 105500000 165661 140000 25661 1549






100 0559 0245 0314 56171000 0625 0262 0363 580810000 1140 0367 0773 6781100000 4603 1472 3131 6802500000 22635 9230 13405 5922

18001600140012001000800600400200000

Arc

hite

ctur

e tim

e (

)

100 1000 10000 100000 500000

Number of images



000

1000

2000

3000

4000

5000

6000

7000

8000A

rchi

tect

ure t

ime (

)

100 1000 10000 100000 500000

Number of images





000

100000

200000

300000

400000

500000

600000

700000

800000

100 1000 10000 100000 500000


bet

wee

n in

dexi

ng ti

me a

nd re

trie

val t

ime

731880

614751

247368

456005188


000

1000

2000

3000

4000

5000

6000

1 5 10 20 30


bet

wee

n in

dexi

ng ti

me a

nd re

trie

val t

ime



6 Conclusions





References


























RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation















100 29 28 1 3451000 285 280 5 17510000 2820 2800 20 071100000 28297 28000 297 105500000 165661 140000 25661 1549






100 0559 0245 0314 56171000 0625 0262 0363 580810000 1140 0367 0773 6781100000 4603 1472 3131 6802500000 22635 9230 13405 5922

18001600140012001000800600400200000

Arc

hite

ctur

e tim

e (

)

100 1000 10000 100000 500000

Number of images



000

1000

2000

3000

4000

5000

6000

7000

8000A

rchi

tect

ure t

ime (

)

100 1000 10000 100000 500000

Number of images





000

100000

200000

300000

400000

500000

600000

700000

800000

100 1000 10000 100000 500000


bet

wee

n in

dexi

ng ti

me a

nd re

trie

val t

ime

731880

614751

247368

456005188


000

1000

2000

3000

4000

5000

6000

1 5 10 20 30


bet

wee

n in

dexi

ng ti

me a

nd re

trie

val t

ime



6 Conclusions





References


























RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation










000

100000

200000

300000

400000

500000

600000

700000

800000

100 1000 10000 100000 500000


bet

wee

n in

dexi

ng ti

me a

nd re

trie

val t

ime

731880

614751

247368

456005188


000

1000

2000

3000

4000

5000

6000

1 5 10 20 30


bet

wee

n in

dexi

ng ti

me a

nd re

trie

val t

ime



6 Conclusions





References


























RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation

























RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation









Research Article Performances Evaluation of a Novel Hadoop...

Documents

Transcript of Research Article Performances Evaluation of a Novel Hadoop...