Research on a new peer to cloud and peer model and a ...
Transcript of Research on a new peer to cloud and peer model and a ...
University of WollongongResearch Online
University of Wollongong Thesis Collection University of Wollongong Thesis Collections
2011
Research on a new peer to cloud and peer modeland a deduplication storage systemZhe SunUniversity of Wollongong
Research Online is the open access institutional repository for theUniversity of Wollongong. For further information contact ManagerRepository Services: [email protected].
Recommended CitationSun, Zhe, Research on a new peer to cloud and peer model and a deduplication storage system, Master of Information Systems andTechnology by Research thesis, University of Wollongong. School of Information Systems and Technology, University ofWollongong, 2011. http://ro.uow.edu.au/theses/3333
RESEARCH ON A NEW PEER TO CLOUD
AND PEER MODEL AND A DEDUPLICATION
STORAGE SYSTEM
A Thesis Submitted in Fulfilment ofthe Requirements for the Award of the Degree of
Master of Information Systems and Technology by Research
from
UNIVERSITY OF WOLLONGONG
by
Zhe SUN
School of Information Systems and TechnologyFaculty of Informatics
2011
CERTIFICATION
I, Zhe SUN, declare that this thesis, submitted in partial fulfilment of the requirementsfor the award of Master of Information Systems and Technology by Research, in theSchool of Information Systems and Technology, Faculty of Informatics, University ofWollongong, is wholly my own work unless otherwise referenced or acknowledged. Thedocument has not been submitted for qualifications at any other academic institution.
(Signature Required)
Zhe SUN30 March 2011
Table of Contents
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Figures/Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . viAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiPublications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
1 Introduction 11.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Major issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Motivations of this work . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.1 Method of improving data transmission efficiency . . . . . . . . 51.4.2 Method of developing deduplicated storage system . . . . . . . . 5
1.5 Summary of outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.6 Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Literature Review 92.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.2 Cloud computing . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.3 Cloud storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Distribution file system . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3.1 Distributed file system . . . . . . . . . . . . . . . . . . . . . . . 112.3.2 Distributed database . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Distributed storage model . . . . . . . . . . . . . . . . . . . . . . . . . 142.4.1 Distributed coordinator storage system . . . . . . . . . . . . . . 152.4.2 Files storage granularity . . . . . . . . . . . . . . . . . . . . . . 162.4.3 System scalability . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.5.1 Existing deduplicated storage systems . . . . . . . . . . . . . . 182.5.2 Method of identifying duplications . . . . . . . . . . . . . . . . 202.5.3 Hash algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.5.4 Study of cloud storage model . . . . . . . . . . . . . . . . . . . 22
i
TABLE OF CONTENTS ii
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3 P2CP: Peer To Cloud And Peer 243.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2 Existing distributed storage models . . . . . . . . . . . . . . . . . . . . 25
3.2.1 Peer to peer storage model . . . . . . . . . . . . . . . . . . . . . 253.2.2 Peer to server and peer . . . . . . . . . . . . . . . . . . . . . . . 263.2.3 Cloud storage model . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 A new cloud model: P2CP . . . . . . . . . . . . . . . . . . . . . . . . . 293.4 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4.1 Comparison based on Poisson process . . . . . . . . . . . . . . . 333.4.2 Comparison based on Little’s law . . . . . . . . . . . . . . . . . 37
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4 DeDu: A Deduplication Storage System Over Cloud Computing 404.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.1.1 Identifying the duplication . . . . . . . . . . . . . . . . . . . . . 414.1.2 Storage mechanism . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Design of DeDu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.2.1 Data organization . . . . . . . . . . . . . . . . . . . . . . . . . . 454.2.2 Storage of the files . . . . . . . . . . . . . . . . . . . . . . . . . 474.2.3 Access to the files . . . . . . . . . . . . . . . . . . . . . . . . . 494.2.4 Deletion of files . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Environment of simulation . . . . . . . . . . . . . . . . . . . . . . . . 514.4 System implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4.1 System architecture . . . . . . . . . . . . . . . . . . . . . . . . . 514.4.2 Class diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.4.3 Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5 Experiment Results and Evaluation 645.1 Evaluations and summary of P2CP . . . . . . . . . . . . . . . . . . . . 64
5.1.1 Evaluation from network model availability . . . . . . . . . . . . 655.1.2 Evaluation from particular resource availability . . . . . . . . . 655.1.3 Summary of P2CP . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.2 Performance of DeDu . . . . . . . . . . . . . . . . . . . . . . . . . . . 675.2.1 Deduplication efficiency . . . . . . . . . . . . . . . . . . . . . . 675.2.2 Balance of load . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.2.3 Reading efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . 705.2.4 Writing efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . 715.2.5 Evaluations and summary of DeDu . . . . . . . . . . . . . . . . 73
5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
TABLE OF CONTENTS iii
6 Conclusion and Future Work 746.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 746.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
References 86
A Appendix of DeDu’s Code 87
List of Tables
4.1 Configuration of virtual machine . . . . . . . . . . . . . . . . . . . . . . 51
5.1 Read efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 715.2 Write efficiency without deduplication . . . . . . . . . . . . . . . . . . . 725.3 Write efficiency with deduplication . . . . . . . . . . . . . . . . . . . . 72
iv
List of Figures
1.1 Digital data worldwide . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1 GFS Architecture [27]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2 Table location hierarchy for BigTable [15]. . . . . . . . . . . . . . . . . 14
3.1 P2P Storage Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.2 P2SP Storage Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.3 Traditional Cloud Storage Model . . . . . . . . . . . . . . . . . . . . . 293.4 P2CP storage model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.5 Time for download . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.6 Comparing download time . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1 The architecture of source data and link files . . . . . . . . . . . . . . . 414.2 Architecture of deduplication cloud storage system . . . . . . . . . . . 444.3 Collaborative relationship of HDFS and HBase[12] . . . . . . . . . . . . 454.4 Data organization in DeDu . . . . . . . . . . . . . . . . . . . . . . . . . 464.5 Procedures for storing a file. . . . . . . . . . . . . . . . . . . . . . . . . 484.6 Procedures to access a file. . . . . . . . . . . . . . . . . . . . . . . . . . 494.7 Procedures to delete a file. . . . . . . . . . . . . . . . . . . . . . . . . . 504.8 Over View of Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.9 System Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.10 Commands’ Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.11 Task Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.12 Utility Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.13 DeDu’s Main Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 594.14 Configure HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.15 Configure HBase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.16 After connecting to HDFS and HBase . . . . . . . . . . . . . . . . . . . 614.17 Uploading process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.18 Downloading process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.19 Deletion process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.1 Deduplication efficiency. . . . . . . . . . . . . . . . . . . . . . . . . . . 675.2 Static load balance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695.3 Dynamic load balance . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
v
Acknowledgements
I would like to thank my supervisor Dr.Jun SHEN. With his patience and encour-agement I can carry on my experiment and finish this thesis. I also want to thankDr.Ghassan BEYDOUN, Professor Xiaolin WANG and Dr.Tania SILVER at Univer-sity of Wollongong and Dr.Jianming YONG at University of Southern Queensland.Without their helps and advices, this thesis would not be possible.
Secondly, I am particularly indebted to my labmates Xiaojun Zhang, Zhou Sun,Hongda Tian, Juncheng Cui, Xing Su Hongxiang Hu, and my friends, Di Song andJinlin Li. Without thier great help and assistance, this thesis would not be possible.
Last but not the least, I am also very grateful to my parents who have been en-couaging me and always supporting me through the most dificult times of this work.Without their love and understanding, it is impossible to accomplish this work.
vi
Research on a new peer to cloud and peer model and adeduplication storage system
Zhe SUN
A Thesis for Master of Information Systems and Technology by Research
School of Information Systems and TechnologyUniversity of Wollongong
ABSTRACT
Since the concept of cloud computing had been proposed, scientists started to con-duct research on it. After Amazon supplied cloud services, lots of cloud techniquesand cloud online applications have been developed. Recently, some traditional Webservices change their service platforms to cloud platforms. Most of these Web servicesand applications concentrate on the functions of computing and storage.
This thesis focuses on improving the performance of cloud storage service. In par-ticular, this manuscript presents a new storage model named Peer to Cloud and Peer(P2CP). We assume that the P2CP model follows the Poisson process or Littles lawand have proved that the speed and availability of P2CP is better than the pure Peerto Peer (P2P) model, the Peer to Server and Peer (P2SP) model, and the cloud modelby mathematical modeling. The main features of P2CP is that there are three datatransmissions in the storage model. These are the cloud-user data transmission, theclients data transmission, and the common data transmission. P2CP uses the cloudstorage system as a common storage system. When data transmission occurs, the datanodes, cloud user, and the non-cloud user are all involved together to complete thetransaction.
This thesis also presents a deduplication storage system over cloud computing. Ourdeduplication storage system consists of two major components, a front-end dedupli-cation application and Hadoop Distributed File System (HDFS). HDFS is commonback-end distribution file system, which is used with HBase, a Hadoop database. Weuse HDFS to build up a mass storage system and use HBase to build up a fast index-ing system. With the deduplication application, a scalable and parallel deduplicatedcloud storage system can be effectively built up. We further use VMware to generatea simulated cloud environment. The simulation results demonstrate that the storageefficiency of our deduplication cloud storage system is better than traditional cloudstorage systems.
KEYWORDS: Cloud, Storage, Delete-duplicates, P2CP, P2P
Publications
Zhe SUN., Jun SHEN, and Jianming YONG. (2010) DeDu: Building a DeduplicationStorage System over Cloud Computing. This paper has been accepted by the 201115th International Conference on Computer Supported Cooperative Work in Design(CSCWD’11). Accepted Date: 01.03.2011
Zhe SUN., Jun SHEN, and Ghassan BEYDOUN. (2010) P2CP: A New Cloud StorageModel to Enhance Performance of Cloud Services. This paper has been accepted bythe 2011 International Conference on Information Resources Management in associa-tion with the Korea Society of MIS Conference (Conf-IRM-KMIS’11). Accepted Date:07.03.2011
viii
Chapter 1
Introduction
The research reported in this thesis investigates issues in cloud computing. This
chapter presents an overview of the research, including the major issues, motivations,
methodologies and structure of the thesis. To achieve this, Section 1.1 introduces the
history and development status of cloud computing. Then, Section 1.2 describe the
major issues existing in the field. Section 1.3 gives the motivations for this work.
Section 1.4 presents the methodologies of this work. Section 1.5 gives a summary of
the main results of this thesis. The structure of this thesis is given in the final Section
1.6.
1.1 Overview
In 1966, Douglas Parkhill pointed out the possibility of a computer utility that should
include the features of elastic provision, provided as a utility, online and with the
illusion of infinite supply just like the electricity industry [20]. In the process of
development of cloud computing, Amazon is an indispensable role, as the company
launched Amazon Web Service (AWS) on a utility computing basis in 2006, which
includes Elastic Compute Cloud (EC2), Simple Storage Service (S3), and SimpleDB.
1
1.1. Overview 2
After that Google, Microsoft, IBM, Apache, and so on joined the queue of large scale
cloud computing research projects. Google offers Google Apps; Microsoft provides
Azure; IBM supplies Blue Cloud to users, and Apache focuses on the research into
cloud architecture and cloud computing algorithms.
Modern society is a digital universe. Almost all current information could not survive
without the digital universe. The size of digital universe in 2007 was 281 Exabyte, and
in 2011, it will become 10 times larger than it was in 2007. See Figure 1.1 [25].
Figure 1.1: Digital data worldwide
This has led to researches with a focus on the development of storage system,
which could store and manage such masses of data. However, with the development of
storage systems, a great volume of backup data was created periodically. As time goes
1.2. Major issues 3
by, these data occupied huge storage space. Furthermore, once we need these backup
data, the data transmission bottleneck, which is caused by network bandwidth, will
constrain the delivery of data. Thus, the importance of new storage systems, that
can delete duplicated data and efficiently deliver data to and from storage system,
becomes significant and obvious.
1.2 Major issues
From the above brief description, we are well aware of two main issues, with several
other sub-issues, that need to be solved in this thesis.
The two main issues are that with the significant increased data, new technology
is needed to create a “huge container” to store it; the other is the bottleneck of data
transmission on Internet.
The sub-issues on new technology to create a “huge container” include building the
mass data storage platform and developing a new technique to manage the vast amount
of data. Furthermore, it is important to introduce the deduplication technique into
storage systems, which will keep them running efficienctly.
On the other hand, the sub-issues on the bottleneck of data transmission on the In-
ternet include introducing the Peer to Peer (P2P) data transmission model into cloud
storage systems; solving the problem of data persistent availability; proving that the
efficiency of new storage model is better than others, and finding a reliable mathemat-
ical model to evaluate the new storage model.
1.3. Motivations of this work 4
1.3 Motivations of this work
With the development of information technology, there are many information systems
that run over the network. Thus, the transmission of data which supports basic com-
puting has become extremely important. In [39], 10 obstacles were defined clearly, and
two of them are related to our work, data transfer bottlenecks and scalable storage.
Based on the current network bandwidth and storage capacity all over the world, it
should be possible to satisfy demands from the users. The only problem is that the
resources are not distributed in a balanced way. So, the problem shifts to how we can
exploit current resources efficiently. In the research of [10], we find that P2P network
could efficiently exploit network bandwidth; but, in the work of [34], we find that P2P
network can not offer steady and persistent availability. On the other hand, the cloud
storage model may solve the problem of data capacity, but with the sacrifice of data
space efficiency. The motivation of this work is to improve data transmission efficiency
while enhancing the usage efficiency of data space.
For the deduplication system, the works of HYDRAstor [14], MAD2 [55], DeDe [8]
and Extreme Binning [11] provide limited solutions for deleting duplications. The re-
sults of deleting duplication in Extreme Binning and DeDe are not accurate, and both
MAD2 and HYDRAstor can not efficiently employ network bandwidth.
So, with the aim of improving the accuracy of deduplication system and enhancing
network bandwidth usage, we will design and build a new deduplication system over
the cloud environment.
1.4. Methodologies 5
1.4 Methodologies
1.4.1 Method of improving data transmission efficiency
The method of improving data transmission efficiency is based on research into stor-
age models and mathematic calculation. Firstly, we study the existing storage models
and compare their advantages and disadvantage. We find out the bottlenecks in data
transmission. Secondly, based on the advantages of different storage models, we pro-
pose a new storage model to eliminate the bottleneck in data transmission and offer
persistent data availability. Thirdly, based on our mathematical model, we prove that
the new storage model is better than others. Finally, we use MATLAB to simulate
the process of data transmission and then obtain the results that we need.
1.4.2 Method of developing deduplicated storage system
To build a deduplicated storage system, we divide the whole development process into
three main stages. First stage: the scalable parallel deduplicated algorithm should be
moved into the cloud. In this stage, the main task is to study a good deduplication
algorithm that is scalable and parallel, and transfers it into the cloud computing en-
vironment. Second stage:the scalable, parallel deduplication system for a chunk-based
file backup system should be optimized to adapt it for the cloud storage system. In this
stage, the main task is to optimize the chunk-based file backup system and transfer
it into the cloud computing environment. Final stage: the deduplication function in
the cloud should be provided to the user via the Internet. In this stage, the main task
is to make delelting duplication function as a service, and provide the deduplication
function to users via the Internet.
1.5. Summary of outcomes 6
1.5 Summary of outcomes
The main outcomes of this thesis are the design and construction of a deduplication
storage system that runs over cloud computing, and the proposal of a new distributed
storage system model Peer to Cloud and Peer (P2CP). The specific outcomes are the
following:
1. In this research, we analyze the advantages and disadvantages of existing dis-
tributed storage systems, especially in the P2P storage model, Peer to Server and
Peer (P2SP) storage model, and the cloud storage model.
2. In this research, we propose a new distributed storage system model. This storage
model should absorb the advantages of the P2P storage model on high efficient em-
ployment of bandwidth, the advantages of the P2SP storage model on persistent data
availability, and the advantages of the cloud storage model on scalability.
3. This research proves that the efficiency of data transmission for the new storage
model is better than the traditional distributed storage model. We will use the Poisson
process and Little’s law to set up a mathematical model to calculate data transmis-
sion time in a network under the new storage model, and prove that the result is better.
4. A deduplication application will be developed. The application as a front-end
will be a part of a deduplication storage system. It may include both a command
operation model and a graphical user interface (GUI) model. Users do not need to
familiar with command-line, if they use the GUI model.
5. A cloud storage platform will be built. We will employ HDFS and HBase to built
1.6. Structure of the thesis 7
the cloud storage platform. We exploit HDFS to manage the commercial hardware,
and employ HBase as a data manager to deal with the communication with front-end.
When the front-end and back-end can collaborate, the deduplicated storage system
will work well.
1.6 Structure of the thesis
This thesis is organized as the following:
Chapter 2 reviews current work relevant to our research in the areas of “cloud”, dis-
tributed file systems, distribution storage models, and the main deduplication tech-
niques.
Chapter 3 gives an analysis of distributed storage models from P2P to cloud. Based
on the study of distributed storage model, it proposes a new P2CP storage model and
proves that the efficiency of data transmission for P2CP storage model is better than
for other systems by using a mathematical model.
Chapter 4 proposes a deduplication storage system which works over cloud comput-
ing and has been named “DeDu”. This chapter introduces the design of DeDu and
describes the system implementation in details.
Chapter 5 shows DeDu’s performance, and evaluates both DeDu and the P2CP stor-
age model.
Chapter 6 concludes this thesis by providing the advantages and limitations of this
Chapter 2
Literature Review
2.1 Introduction
In this chapter, there is a review of the literature regarding background knowledge of
cloud computing. The “Cloud” is covered in section 2.2, and then the related work on
distributed file systems is introduced in section 2.3. Section 2.4 contains a discussion
of related research on distribution storage model, and section 2.5 discusses related
work on deduplication storage systems. At the end of the chapter, section 2.6 will
summarise this chapter.
2.2 Cloud
2.2.1 Background
Since the past two years, the idea of cloud computing has been extremely popular,
and numerous companies and research institutes have focused on it. They hope that
the idea of the “cloud” will enable them to set up new computing architecture in both
infrastructure and applications. The following two sections will briefly describe the
9
2.2. Cloud 10
relationships and differences between cloud computing and cloud storage.
2.2.2 Cloud computing
Narrowly defined, cloud computing is distributed, parallel, and scalable computation
system based on the user’s demand through the Internet. Generalized cloud comput-
ing is Internet-based computing, which is provided by computers and other devices on
demand to share resources, software and information. This technology evolved from
distributed computing, grid computing, Web 2.0, virtualization, and other network
technologies. The core idea is the Map/Reduce algorithm. Map/Reduce is an asso-
ciated implementation and a programming model for generating and processing huge
data sets. Users define a map function to processes a key/value pair to generate a
set of intermediate key/value pairs, and a reduce function to merge all intermediate
values associated with the same intermediate key [26]. By exploiting the Map/Re-
duce programming model, computation tasks will be automatically parallelized and
executed on distributed clusters. With the support of a distribution file system and
a distribution database which can be built on commodity machines, cloud computing
could achieve high performance, just like an expensive super-cluster.
Cloud computing consists of both applications and hardware delivered to users as
services via the Internet [39]. With the rapid development of cloud computing, more
and more cloud services have emerged, such as SaaS (software as a service), PaaS
(platform as a service)and IaaS (infrastructure as a service).
2.3. Distribution file system 11
2.2.3 Cloud storage
Cloud storage is an online storage model. The concept of cloud storage is derived
from cloud computing. It refers to a series of distributed storage devices to be ac-
cessed over the Internet via Web services application program interfaces (API). With
the rocket-like development of cloud computing, the advantages of cloud storage have
increased significantly, and the concept of cloud storage has been accepted by the com-
munity. The core techniques are the distribution file system and virtualization disk
which features large scale, stable, fault-tolerant, and scalable capacity. Thus, in cloud
storage, dedicated storage servers are not indeed necessary, and most data nodes are
commodity machines. With the support of cloud computing, the capacity of cloud
storage could easily reach the petabyte level and keep the response time within few
seconds.
2.3 Distribution file system
As we mentioned in the previous section, the distribution file system and distributed
database are core techniques in the field of cloud computing. These techniques will be
introduced in the following section.
2.3.1 Distributed file system
The prototype of the distributed file system could be traced back to 1985-1986.
Carnegie-Mellon University (CMU)proposed the principles and design of a distributed
file system in 1985 [50] and developed Andrew, which is a distributed personal comput-
ing environment, with IBM in 1986 [40]. Sun Microsystem designed and implemented
2.3. Distribution file system 12
the Sun net file system (Sun-NFS) in 1985 [48]. Both of these systems have some, albeit
limited, features of cloud storage. For examples, Andrew consists of approximately
5000 workstations and is much larger than Sun-NFS; Andrew is a research prototype,
but not a business product. There is no difference between the clients and servers in
Sun-NFS. In the following years, many network storage systems and distributed file
systems emerged, such as Venti [43], RADOS [47], Petal [21],Ursa Minor [38], Panasas
[13], Farsite [6], Sorrento [29], and FAB [59].
With the development of P2P techniques, a new type of distributed file system ap-
peared. These file systems are based on hierarchical peer-to-peer(P2P) block storage,
with some peers playing the role of master, while some of the peers are data nodes,
such as the Eliot file system [51] and Serverless Network Storage [61].
Google File System (GFS) represents a milestone in modern distributed systems. It
is a scalable distributed file system for large distributed data-intensive applications.
It provides fault tolerance while running on inexpensive commodity hardware, and it
delivers high aggregate performance to a large number of clients. The architecture of
GFS is shown in Figure 2.1 [27].
Figure 2.1: GFS Architecture [27].
2.3. Distribution file system 13
Because of the really excellent performance of GFS, many other distributed file
systems are modeled on it, such as the Hadoop distributed file system (HDFS) [12]
developed by Apache. The advantages of HDFS are that it is open source, can run on
commodity hardware, and provides high throughput access to Web application data.
Detailed information on HDFS will be introduced in Chapter 4, section 4.1.2.
2.3.2 Distributed database
The first generation distributed database was the System for Distributed Databases
(SDD-1), the design of which was initiated in 1976 and which was completed in 1978 by
Computer Corporation of America [46]. In the 1970s, many problems and techniqcal
rules and issues concerning distributed databases had been encountered and solved,
including distributed concurrency control [9], distributed query processing [22], re-
siliency to component failure [52], and distributed directory management [46].
In comparison with the modern distributed database, BigTable is a distributed stor-
age system for managing structured data that is designed to scale to a very large size:
petabytes of data across thousands of commodity servers. “A BigTable is a sparse,
distributed, persistent, multidimensional sorted map. The map is indexed by a row
key, column key, and a timestamp; each value in the map is an uninterpreted array of
bytes ”[15]. BigTable provides a simple data model which gives clients dynamic con-
trol over data layout and format; furthermore, by exploiting an analogous three-level
hierarchy of a B+-tree [19] to store tablet location information, the architecture of
which is shown in Figure 2.2 [15], and due to the GFS architecture, BigTable is easy
to scale up to increase the capacity and offers a high input/output (I/O) throughput,
reducing the response delay .
2.4. Distributed storage model 14
Figure 2.2: Table location hierarchy for BigTable [15].
The most famous imitation of BigTable is HBase, which was developed by Apache
[5], and Cassandra, which was open sourced by Facebook in 2008 and is also being
developed by Apache now [4]. HBase completely inherits the architecture of BigTable
and operates over HDFS, so it supports the Map/Reduce algorithm and mass data
storage very well. The detailed information on HBase will be introduced in Chapter
4, section 4.1.2. Cassandra is focused on growing to distributed database with full
function .
2.4 Distributed storage model
In this section, some distributed storage systems will be introduced and distributed
storage model will be classified. According to the features of distributed storage sys-
tems, it is easy to classify them into different categories based on distributed coordi-
nator, file storage granularity, and system scalability.
2.4. Distributed storage model 15
2.4.1 Distributed coordinator storage system
Serverless Network Storage (SNS) is a persistent peer-to-peer network storage applica-
tion. It has four layers: operation logic; a file information protocol (FIP) that exploits
XML-formatted messages to maintain files and disk information; a proposed security
layer; and a serverless layer which is responsible for routine network state information
[61].
FAB [59] is a distributed enterprise disk array from commodity components. It is
a high reliable and continuous service enterprise storage system using two new mecha-
nisms, a voting based protocol and a dynamic quorum-reconfiguration protocol. With
the voting-based protocol, each request makes progress after receiving replies from a
random quorum of storage bricks, and by this method, any brick can be a coordinator.
The dynamic quorum-reconfiguration protocol changes the quorum configuration of
segment groups.
At the University of California, they have developed a distribution file system named
Ceph [57], which provides excellent performance, reliability, and scalability. It occu-
pies a unique point in the design space based on CRUSH, which separates data from
metadata management, and RADOS [47] which exploits intelligent object storage de-
vices (OSD) to manage data without any central servers. EBOFS [56] provides more
appropriate semantics and superior performance by addressing the workloads and in-
terface. In the end, Ceph’s scalable approach of dynamic sub-tree partitioning offers
both efficiency and ability to adapt to a varying workload. RADOS is a reliable, auto-
matic, distributed object store that exploit device intelligence to distribute consistent
data access and provide redundant storage, failure detection, and failure recovery in
clusters. Both Ceph and RADOS target a high performance cluster or data center
2.4. Distributed storage model 16
environment.
2.4.2 Files storage granularity
2.4.2.1 Files storage at block level
Petal [21] offers always available and unlimited performance and capacity for large
numbers of clients. To a Petal client, this system provides large virtual disks. This
technique is a distributed block-level storage system that tolerates and recovers from
single computer failure; dynamically balances workload, and expands capacity.
Venti [43] is a network storage system. It uses unique hash values to identify block
contents; with this method, it reduces the data occupation of storage space. Venti
builds block for mass storage applications and enforces a write-once policy to avoid
destruction of data. This network storage system emerged in the early stage of net-
work storage, so it is not suitable to deal with mass data, and the system is not scalable.
Panasas [13] is a scalable and parallel file system. It uses storage nodes that run
an OSDFS object store and manager nodes that run a file system metadata manager,
a cluster manager, and a Panasas client that can be re-exported via NFS and CIFS.
Based on balancing the resources of each node, the system achieves the scalability.
The system exploits non-volatile memory to hide latency and protect caches, and has
the ability to distribute metadata into block files, as well as storage node and data
nodes are maintained in the storage clusters to achieve a good performance.
2.4. Distributed storage model 17
2.4.2.2 Files storage at file level
Farsite [6] is federated, available, and reliable storage system for an incompletely
trusted environment. The core architecture of Farsite includes a collection of in-
teracting, Byzantine-fault-tolerant replica groups, which are arranged in a tree that
overlays the file system namespace hierarchy. The availability and reliability of the
system is provided by randomized replicated storage; the secrecy of file contents is
offered by cryptographic techniques; the Byzantine-fault-tolerant protocol maintains
the integrity of files and directories. The scalability is offered by exploiting the dis-
tributed hint mechanism and delegation certificates for path name translations.
2.4.3 System scalability
FS2You [60] is a large-scale online file share system. It has four main components,
which are the directory server, tracking server, replication servers, and peers. With
the peers’ assistance, it makes semi-persistent files available and reduces the server
bandwidth cost. FS2You exploits hash values to identify data, but does not offer a
deduplication function.
The Eliot file system [51] is a reliable mutable file system based on peer-to-peer block
storage. Eliot exploits a metadata service in an auxiliary replicated database which
is separated and generalized to isolate all mutation and client state. The Eliot file
system consists of several components, which are: an untrusted, immutable, reliable
P2P block storage substrate known as the Charles block service; a trusted, replicated
database, known as the metadata service (MS), storing mutable nodes, directories,
symlinks, and superblocks; a set of file system clients; and zero, one, or more cache
servers. Cache servers are intended to improve performance, but are not necessary for
2.5. Related work 18
correctness.
RUSH [30] is a family of algorithms for scalable decentralised data distribution. RUSH
maps replications in storage servers or disks to form a scalable collection. RUSH algo-
rithms are based on user-specified server weighting of decentralised objects to servers.
There is no central directory, so different RUSH variants have different look-up time.
Ursa Minor [38] is a cluster-based storage system which has four components: storage
nodes, object manager, client library, and NFS server with a protocol family which
includes a timing model, storage-node failure model, and client failure model for ac-
cess to allow data-specific selection, as well as on-line changes, encoding schemes, and
fault models. In this way, Ursa Minor has achieved a good result for tracing, OLTP,
scientific and campus workload.
2.5 Related work
2.5.1 Existing deduplicated storage systems
HYDRAstor [14] is a scalable, secondary storage solution, which includes a back-end
consisting a grid of storage nodes with a decentralized hash index and a traditional
file system interface as a front-end. The back-end of HYDRAstor is based on Di-
rected Acyclic Graph (DAG), which has organized large-scale, variable-size, content-
addressed, immutable, and highly-resilient data blocks. HYDRAstor detects duplica-
tions according to the hash table. This approach’s target is a backup system. It does
not consider the situation of multiple users needing to share files.
Extreme Binning [11] is a scalable, paralleled deduplication approach aimed at a
2.5. Related work 19
non-traditional backup workload which is composed of low-locality individual files.
Extreme Binning exploits file similarity instead of locality and allows only one disk
access for chunk look-up per file. Extreme Binning organizes similar files into bins
and deletes replicated chunks inside each bin. Replicates exist among different bins.
Extreme Binning only keeps the primary index in memory in order to reduce RAM
occupation. Since this approach is not an exact deduplication method, duplications
may exist among bins.
MAD2 [55] is an deduplication network backup service which works at both the file
level and the chunk level. It uses four techniques: a hash bucket matrix (HBM),
Bloom filter array (BFA), dual cache, and DHT-base load balancing, to achieve high
performance. This approach is not an exact deduplication method.
DeDe [8] is a block-level deduplication cluster file system without central coordina-
tion. In the DeDe system, summaries of each host which are written to the cluster
file system are kept in hosts. Each host submits summaries to share the index and
reclaims duplications periodically and independently. These deduplication activities
do not occur at the file level, and the results of deduplication are not accurate.
Duplicate Data Elimination (DDE) [28] employs a combination of content hashing,
copy-on-write, and lazy updates to achieve the functions of identifying and coalescing
identical data blocks in a storage area network (SAN) file system. It always processes
in the background.
2.5. Related work 20
2.5.2 Method of identifying duplications
From the previous study we know that the existing approaches for identifying dupli-
cations always work on two different levels. One is the file level, such as MAD2; the
other is the chunk level, for example, HYDRAstor, Extreme Binning, DDE and DeDe.
To handle scalable deduplication, two famous approaches have been proposed, sparse
indexing [35] and Bloom filters [62] with caching. Sparse indexing is a technique to
solve the chunk look-up bottleneck, which is caused by disk access, by using sampling
and exploiting the inherent locality within backup streams. It picks a small portion
of the chunks in the stream as samples; then, the sparse index maps these samples to
the existing segments in which they occur. The incoming streams are broken up into
relatively large segments, and each segment is deduplicated against only some of the
most similar previous segments. The Bloom filter exploits Summary Vector, which is
a compact in-memory data structure for identifying new segments; Stream-Informed
Segment Layout, which is a data layout method to improve on-disk locality for se-
quentially accessed segments; and Locality Preserved Caching with cache fragments,
which maintains the locality of the fingerprints of duplicate segments to achieve high
cache hit ratios.
Walter Santos et.al. proposed a scalable parallel deduplication algorithm [49]. This al-
gorithm was developed in an “anthill” programming environment and exploits task and
data parallelism, multi-programming, and message coalescing techniques to achieve
scalability. Their parallelization is based on four filters: reader comparator, blocking,
classifier, and merger.
2.5. Related work 21
2.5.3 Hash algorithm
Hash value is addressed as a message digest or simply digest. Any changes in the
original data will lead to a hash value change. Thus hash value is widely used into
cryptographic message and data identification. There are several cryptographic hash
functions, and the most famous hash algorithms are the Message-Digest algorithm
(MD) series. MD2 was designed by Kaliski [32], which inputs a message of arbitrary
lengh and outputs a 128-bit message digest of the input which was intended for digital
signature. MD4 was designed by Ronald Rivest in 1990 [45]; This algorithm is imple-
mented for use in message integrity checks. The digest length also is 128 bits. This
algorithm has influenced later cryptographic hash functions such as MD5, SHA-1 and
RACE Integrity Primitives Evaluation Message Digest (RIPEMD). However, the first
full collision attack against MD4 happened in 1995. To replace MD4, Ronald Rivest
designed MD5 in 1992 [44]. MD5 has been widely used for security applications and is
commonly used to check the integrity of files. However, on August 17, 2004, Xiaoyun
Wang et.al. announced collisions for the full MD5, and this was published in Ref [54]
in 2005.
The National Security Agency (NSA) in the U.S.A. designed a series of cryptographic
hash functions named Secure Hash Algorithm (SHA) that were published by the Na-
tional Institute of Standards and Technology (NIST) [41]. They include SHA-1, SHA-
2, and SHA-3. A 160-bit message digest is produced by SHA-1 based on similar
principles to MD4 and MD5 Ronald, but it has a more conservative design. SHA-1
shows a strong resistance to attacks. SHA-2 provides two specifications of the message
digest to the users: one is a 256/224 bit message digest, and the other one a 512/348
bit message digest. Furthermore, a new hash standard, SHA-3, is being developed
now. Even now, there is no report of a collision attack on these series of cryptographic
2.5. Related work 22
hash functions, except for SHA-0, which was withdrawn in 1995.
RACE Integrity Primitives Evaluation Message Digest (RIPEMD) is a 128-bit message
digest algorithm which was designed by Hans Dobbertin based on MD4 and SHA-1.
However, it was broken by Xiaoyun Wang [53]. Soon, Hans designed RIPEMD-160 to
improve the hash function, and RIPEMD-128, RIPEMD-256, and RIPEMD-320 were
developed [3].
Besides these hash functions, GOST [36], HAVAL [33], Panama, and RadioGatun
exist but have not been adopted widely.
2.5.4 Study of cloud storage model
In Feng and et.al’s paper [24], they analyzed several current existing cloud storage
platforms, such as Simple Storage Service, Secure Data Connector, and Azure Storage
Service, with the focus on the problem of security. Furthermore, they pointed out the
problem of repudiation. They proposed and specifically designed a non-repudiation
protocol suitable for the cloud computing environment by using third authorities cer-
tified (TAC) and secret key sharing (SKS) techniques.
In Fang and et.al’s works [23], they analyze the differences between the pure P2P
network and the P2SP network, and their assumption is that the peer arrival and
departure rates follow the Poisson process or Little’s law. Finally, they proved that
P2SP has higher performance than P2P based on two assumptions.
Tahoe [58] is an open source grid storage system which has been deployed in a commer-
2.6. Summary 23
cial backup service and is currently operating. It uses capabilities for access control,
cryptography for confidentiality and integrity, and erasure coding for fault-tolerance.
Dropbox [31] is an on line storage system. Dropbox allow users to sync files automat-
ically, and share files between different users via Web services. It could be accessed by
mobile devices. The back-end is a cloud storage platform.
2.6 Summary
This chapter reviewed some literature relevant to our research from four aspects. In
the first section, the background on “cloud”, cloud computing, and cloud storage was
introduced. In section 2.3, existing distributed file system and distributed databases
were reviewed . In section 2.4, existed distribution storage models were analyzed from
three aspects, where the master servers were distributed in the storage system, file
storage granularity, and system scalability. In the section 2.5, some existing dedu-
plication storage systems were reviewed, and the main deduplication techniques were
introduced. Furthermore, related works on both the P2SP and the P2P storage models
were reviewed.
Chapter 3
P2CP: Peer To Cloud And Peer
3.1 Introduction
Cloud computing is undergoing rapid development. Google, Amazon, Microsoft, and
many other companies have recently been focusing on cloud computing and releasing
related storage products, such as Google file system (GFS), Amazon Elastic Compute
Cloud (EC2), Azure, etc. All of these are based on cloud distributed storage models.
During the download session, the data transmission between cloud users is zero.
This decreases the utilization rate of bandwidth. The current alternative file shar-
ing protocol, Peer to Peer (P2P), has high utilization rates of bandwidth, but it has
the problem that it cannot offer continuous availability. To solve these problems, we
have studied several existing distribution storage models and propose in this paper the
P2CP storage model as the solution. This model exploits the P2P protocol to enhance
the data transmission performance and, at the same time, uses a cloud storage system
to provide continuous availability. We assume that the P2CP model follows the Pois-
son process or Little’s law. We mathematically prove that the speed and availability
of P2CP is indeed better than the pure P2P model or the Peer to Server and Peer
24
3.2. Existing distributed storage models 25
(P2SP) model or pure Cloud model.
3.2 Existing distributed storage models
In this section, we study some existing distributed storage models, including the P2P
model, the P2SP model, and the cloud storage model.
3.2.1 Peer to peer storage model
In a pure P2P storage model, each peer is equal. Peers act as both clients and servers.
In the P2P storage model, there is no master server to manage the network, meta-
data, and data. Especially in a commercial machine environment, each peer is mutable,
which makes the whole network unstable. A particular problem is offering persistent
availability of a specific file. Typical applications are Gnutella before version 0.4 [34],
Freenet [17], Sorrento [29], etc. The architecture of the pure P2P storage model is
shown in Figure 3.1. In a P2P storage model, users get data from each other, when a
user joins to the network, they become a server or a seed. The advantage of the P2P
storage model is that it efficiently exploits the network bandwidth, but sometimes, the
server or seed that contains the particular resource does not exist in the network, so
the file sharing process has to stop. So, the disadvantage of the P2P storage model is
that it is hard to offer persistent data availability.
3.2. Existing distributed storage models 26
Figure 3.1: P2P Storage Model.
3.2.2 Peer to server and peer
To solve the problem of persistent availability in the pure P2P storage model, a hybrid
P2P model that emerged is Peer to Server and Peer (P2SP). In this storage model,
peers are distributed into the client group or the server group. The client group is
responsible to handle the data transmission, and the server group acts as a master
server to coordinate the P2P structure. However, the workload of the master servers
is very heavy, and furthermore, without the server group, the P2P network does not
work. Typical P2SP applications are eMule [37], BitTorrent [18], FS2You [60], etc. For
example, FS2You is a large-scale online storage system. With the peers’ assistance,
it makes semi-persistent files available and reduces the server bandwidth cost. When
the clients are going to download data, firstly, they download data from the server,
and then, they exchange data with each other. If the other peers are not available, the
client will download all the data from the server. No matter the server is an cluster
or a distributed server system, the client will connect to physical one single server
3.2. Existing distributed storage models 27
which in the cluster or distributed server system. The architecture of P2SP is shown
in Figure 3.2.
Figure 3.2: P2SP Storage Model
3.2.3 Cloud storage model
Cloud computing consists of both applications and hardware delivered to users as
services via the Internet. With the rapid development of cloud computing, more and
more cloud services have emerged, such as SaaS (software as a service), PaaS (platform
as a service) and IaaS (infrastructure as a service).
3.2. Existing distributed storage models 28
The concept of cloud storage is derived from cloud computing. It refers to a stor-
age device accessed over the Internet via Web service application program interfaces
(API). There are many cloud storage systems in existence, for example, Amazon S3
(Amazon, 2006), the Google file system [27] , HDFS [12], etc.
The traditional cloud storage system is a scalable, reliable, and available file distri-
bution system with high performance. These system models consist of master nodes
and multiple chunk servers. Data is accessed by multiple clients; all files in the system
are divided into fixed size chunks. The master node (DFS master) maintains all file
system metadata. The master node asks each chunk server about their own chunks of
information at master start-up and whenever a chunk server joins the cluster. Clients
never read and write file-data through the master, but ask the master which chunk
server they should contact.
The problem is that clients get data from the individual data nodes, but the clients do
not have any communication among themselves. The architecture of the cloud storage
model is shown in Figure 3.3.
3.3. A new cloud model: P2CP 29
Figure 3.3: Traditional Cloud Storage Model
3.3 A new cloud model: P2CP
We propose a new cloud storage model, which is the peer to cloud and peer (P2CP)
model. This means that cloud users can download data from the storage cloud and
exchange data with other peers at the same time, regardless of whether the other peers
3.3. A new cloud model: P2CP 30
are cloud users or not. There are three data transmission tunnels in this storage model.
The first is the cloud-user data transmission tunnel. The cloud-user data transmis-
sion tunnel is responsible for data transactions between the cloud storage system and
the cloud users. The second is the clients’ data transmission tunnel. The clients’ data
transmission tunnel is responsible for data transactions between individual cloud users.
The third is the common data transmission tunnel. The common data transmission
tunnel is responsible for data transactions between cloud users and non-cloud users.
Figure 3.4 is an example to show how a P2CP cloud model works. In Figure 3.4, we
can see that cloud user2 is downloading data from data node 1, which is in the cloud,
and at the same time, cloud user2 is exchanging data with cloud user1, cloud user3,
and common peers 2, 5, and 6. By exploiting multiple data transmission tunnels, cloud
users can achieve a high download speed. On the other hand, the P2CP model avoids
extremely high workloads for cloud servers as the number of cloud users increases.
When the resources are committed to other transmitting activities, non-cloud users
may still get access to resources in the cloud which are not in common with the P2P
networks.
3.3. A new cloud model: P2CP 31
Figure 3.4: P2CP storage model
Other existing models such as Groove [42] as known as comparable to Microsoft
SharePoint [16], and Tahoe [58] tended to balance loads between peers and cloud serves
in different ways. However, in our P2CP model, peers may communicate directly and
flexibly between each other without tight dependence on servers, though some ad-
3.4. Comparison 32
vanced features such as backing up, caching, security and versioning of data may still
be elevated or mitigated to servers because peers storage and computing capacities are
supposedly inferior to those cloud servers.
Before the comparison, we will briefly describe the roles and the functions in the
pure P2P storage model, the P2SP storage model, and the Cloud storage model. In
the pure P2P storage model, peers are divided into seeds, which are denoted by S,
and leeches, which are denoted by L. Initially, seeds have the whole file, and leeches
do not have any block of the file, but as time passes, leeches obtain blocks and ex-
change blocks with other peers. When the leeches get the whole file, they may leave
the network or stay in the network as seeds. In the P2SP network storage model, the
difference is that it has a server group. Normally, in the Cloud storage model, there
are three replicas of the file existing in different data nodes, and each data node keeps
different amounts of blocks of the file. In the P2CP storage model, the storage cloud
replaces the role of the server in the P2SP model.
3.4 Comparison
In this section, we evaluate our P2CP storage model against the key three storage
models described in Section 2.5.4: the pure P2P model, the cloud model and the
P2SP model. For the network storage models, the two most important parameters
for performance are average downloading time and usability. In this part, we will
compare all these storage models in terms of average downloading time and usability,
and evaluate the P2CP model. In this part, we will compare average downloading
times of the above models by a mathematical model. We assume:
Seed : each seed upload bandwidth is Us; the number of seeds is Ns.
Peer : each seed upload bandwidth is Up; the number of peers is Np.
3.4. Comparison 33
Server : for each server, average upload bandwidth is Use; the number of servers is
Nse. The average number of peers and seeds is N.
Cloud : for each data node upload, bandwidth is Uc; the number of data nodes is
Nc.
F is the size of the file.
T is the average downloading time.
t is the current time of data transmission happening.
O is usability.
U is the average upload bandwidth of peers and seeds.
λ: arrival rate of peers arrive at the network.
µ: departure rate of peers leave the network.
λ must be greater than µ, otherwise, P2P network will not exist.
3.4.1 Comparison based on Poisson process
The Poisson process is very useful for modelling purposes in many practical applica-
tions. It has been empirically found to well approximate many circumstances arising
in stochastic processes [1]. We assume that peers arrive and leave nearly according to
a Poisson process. This assumption is consistent with literature [23]. The numbers of
peers and seeds existing in the pure P2P network are modeled on the M/G/∞ queue.
We assume that two peers constitute the smallest pure P2P network; the smallest
P2SP network includes one server and one peer; the smallest cloud includes one mas-
ter node and one data node; and the smallest P2CP network includes one smallest
cloud and one peer. We can get the number of peers and seeds that exist in the pure
P2P network with time goes:
N = (λ− µ)t (3.1)
3.4. Comparison 34
If a peer costs T time to download a file with size F in the P2P network, we get:
Ns∑k=1
∫ T
0Usdt = F (3.2)
Ns∑k=1
[(λ− µ)Us]1
2T 2 = F (3.3)
If it costs a peer time T to download a file with size F in the P2SP network, we get:
Ns∑k=1
∫ T
0Usdt+ T
Nse∑k=1
Use = F (3.4)
Ns∑k=1
[(λ− µ)Us]1
2T 2 + T
Nse∑k=1
Use = F (3.5)
If it costs a peer time T to download a file with size F in the P2CP network, we get:
Ns∑k=1
∫ T
0Usdt+ T
Nc∑k=1
Uc = F (3.6)
Ns∑k=1
[(λ− µ)Us]1
2T 2 + T
Nc∑k=1
Uc = F (3.7)
If it costs a peer time T to download a file with size F in the Cloud system, we get:
Nc ∗ Uc ∗ T = F (3.8)
The relationship between the number of cloud host servers and relative throughput is
volatile and according the normal Cloud storage system configuration, we get 3.9 for
convenience of computation.
Nc ≥ 3Nse (3.9)
We assume that:
A =N∑k=1
(λ− µ)U (3.10)
3.4. Comparison 35
B =Nse∑k=1
Use (3.11)
C =Nc∑k=1
Uc = 3Nse∑k=1
Use = 3B (3.12)
According to (3.3), (3.5), and (3.10), (3.11), we get:
A
2T 2 − F = 0 (3.13)
A
2T 2 +BT − F = 0 (3.14)
According to (12) and (14) we get:
A
2T 2 + CT − F =
A
2T 2 + 3BT − F = 0 (3.15)
Then, according to equations (3.13), (3.14) and (3.15), we get:
Tc =2F
6Use ∗Nse
(3.16)
TP2P =2F√2FA
(3.17)
TP2SP =2F√
B2 + 2FA+B(3.18)
TP2CP =2F√
C2 + 2FA+ C=
2F√9B2 + 2FA+ 3B
(3.19)
For our comparative purposes, we assume that the size of the file is 100,000 KB, the
upload bandwidth of the peers and seeds are 20 KB/s, the upload bandwidth of a
server is 100KB/s, the arrival rate of peers is 2 peers/s and the departure rate is 1
peer/s. When the peers’ and servers’ arrival rate is lower than departure rate, the
number of peers and seeds will go to zero, and then the P2P network will not existing.
3.4. Comparison 36
Figure 3.5 clearly shows that the alternative with minimal cost of download time is
P2CP. The maximum download time is found with P2P. P2SP falls in the middle when
there are not too many peers. The difference in download time is quite obvious. When
more peers join the network, download time decreases.
Figure 3.5: Time for download
This clearly shows that the alternative with minimal cost of download time is
P2CP. The maximum cost of download time is found with P2P. P2SP is in the middle.
When there are not too many peers, the difference in download time is obvious. When
more peers join the network, the cost of download time becomes less and less.
With the growth of upload bandwidth for the peers, we have another test. Assume
that the size of a file is 100,000 KB, and the upload bandwidths for peers and seeds
3.4. Comparison 37
are 20 KB/s, 40 KB/s, or 60 KB/s, while the upload bandwidth for the server is 100
KB/s. The arrival rate of peers is 2 peers/s, and the leaving rate is 1 peer/s. Figure 6
shows that when there is an increase of upload bandwidth for the peers, the download
time inversely decreases. At the same time, differences in download time between
P2P, P2SP, and P2CP are also reduced. Pure Cloud storage model performance is
not shown in Figure 3.6, because the result changes significantly. In some instances
it outperforms the P2P and the P2SP models depending on the chunk distribution in
the cloud storage system, but it never outperforms our P2CP storage model.
Figure 3.6: Comparing download time
3.4.2 Comparison based on Little’s law
It is difficult to prove that the peer and seed arrival and departure rates are accurate
according to the Poisson process. Therefore we use Little’s law to relate L (number of
peers), W (sojourn time), and λ (average number of users) [1] as 3.20:
3.4. Comparison 38
L = λW (3.20)
Based on Little’s law, we can get:
N = (λ− µ)T (3.21)
According to (3.10) and (3.21), we obtain:
A =N∑k=1
N
TU (3.22)
Then, according the equation, we obtain:
Tc =2F
6B(3.23)
TP2P =2F
NU(3.24)
TP2SP =2F
NU + 2B(3.25)
TP2CP =2F
NU + 2C=
2F
NU + 6B(3.26)
According the (3.24),(3.25) and (3.26), we obtain:
TP2P ≥ TP2SP ≥ TP2CP (3.27)
Thus, minimum download time is possible with P2CP, then P2SP, and lastly with
P2P.
3.5. Summary 39
3.5 Summary
This chapter firstly introduced several existing distributed storage models, including
the P2P storage model, the P2SP storage model, and the Cloud storage model, and
furthermore, analyzed their own advantages and disadvantages. Secondly, based on
these drawbacks, we proposed a new cloud storage model called the “P2CP” storage
model to improve data transmission efficiency in a distributed storage system. After
that, we compared the downloading efficiency of the P2CP storage model with the
other storage models by mathematical methods, and then use MATLAB to calculate
the results.
Chapter 4
DeDu: A Deduplication Storage
System Over Cloud Computing
4.1 Introduction
In the early 1990s, the “once write multi read” storage concept, a storage medium,
which was typically an optical disk, was set up and used widely. The disadvantages
of this storage concept were the difficulty in sharing data via the Internet and the
enormous wastage of storage space to keep replications, so “once write multi read”
fell into disuse. However, in this network storage era, based on the concept of “once
write multi read”, we propose a new network storage system, which we have named
“DeDu”, to store data and save storage space.
The idea is that, when the user uploads a file for the first time, the system records
this file as source data, and the user will receive a link file for guiding users to the
source data. When the source data has been stored in the system, if the same data
has been uploaded by other users, the system will not accept the same data as new,
but rather, the user who is uploading data will receive a link file to the original source
40
4.1. Introduction 41
data. Users are allowed to read source data but not to write. Under these conditions,
the source data can be reused by many users, and furthermore, there will be a great
saving of storage space. The architecture is shown in Figure 4.1.
Figure 4.1: The architecture of source data and link files
4.1.1 Identifying the duplication
There are two ways to identify duplications in a cloud storage system. One is compar-
ing blocks or files bit to bit, and the other is comparing blocks by hash values. The
advantage of comparing blocks or files bit to bit is that it is accurate, but it is also
time consuming. The advantage of comparing blocks or files by hash value is that it
is very fast, but there is a chance of accidental collision. The chance of accidental
collision depends on the hash algorithm. However, the chance is really very small.
Furthermore, the combination digital message of MD5 and SHA-1 will significantly
4.1. Introduction 42
reduce the probability. Therefore, it is absolutely acceptable to use a hash function to
identify duplications [19, 20].
The existing approaches for identifying duplications always work on two different lev-
els. One is the file level; the other is the chunk level. On the chunk level, data streams
are divided into chunks, each chunk will be hashed, and all these hash values will
be kept in the index. The advantage of this approach is that it is convenient for a
distributed file system to store chunks, but the drawback is the increasing quantity of
hash values. It means that hash values will occupy more RAM usage and increase the
lookup time. On the file level, the hash function will be executed for each file, and
all hash values will be kept in the index. The advantage of this approach is that it
decreases the quantity of hash values significantly. The drawback is that, when the
hash function has to deal with a large file, it will become a little bit slow.
In this manuscript, our deduplication method is at file level based the comparison
of by hash values. There are several hash algorithms, including MD5, SHA-1, and
RIPEMD. We use both the SHA-1 and the MD5 algorithms to identify duplications.
Although the probability of accidental collision is extremely small, we still combine
MD5 and SHA-1 together. We merge the MD5 hash value and the SHA-1 hash value
as the primary value in order to avoid accidental collision. If the MD5 algorithm and
the SHA-1 algorithm are not suitable for our system scale, this can be changed at any
time. The reason for choosing file level deduplication is that we want to keep the index
as small as possible in order to achieve high lookup efficiency.
4.1. Introduction 43
4.1.2 Storage mechanism
We need two storage mechanisms to achieve our data access requirements. One is used
to store mass data, and the other one is used to keep the index. On the one hand,
there are several secondary storage systems, such as CEPH, Petal, Farsite, Sorrento,
Panasas, GFS, Ursa Minor, RADOS, FAB, and HDFS, which can be used as mass data
storage systems. On the other hand, there are several database systems such as SQL,
Oracle, LDAP, BigTable, and HBase that can be used as index systems. All these
systems have their own features, but which two systems combined together will yield
the best results? With regard to the storage system requirements, in order to store
masses of information, the file system must be stable, scalable, and fault-tolerant; for
the index, the system must perform nicely in real time queries.
4.1. Introduction 44
Figure 4.2: Architecture of deduplication cloud storage system
Considering these requirements, we use HDFS and HBase as our storage mech-
anisms. The advantages of HDFS are that it can be used under high throughput
and large dataset conditions, and it is stable, scalable, and fault-tolerant. HBase is a
Hadoop database that is advantageous in queries. Both HDFS and HBase were devel-
oped by Apache, with the aim of storing mass data which was modelled by Google File
System and BigTable. Considering these features, in our system, we use HDFS as a
storage system and use HBase as the index system. We will introduce how HDFS and
HBase collaborate in Section 4. Figure 4.2 shows the architecture of the deduplication
cloud storage system.
4.2. Design of DeDu 45
4.2 Design of DeDu
4.2.1 Data organization
In this system, HDFS and HBase must collaborate to guarantee that the system is
working well. Figure 4.3 shows the collaborative relationship of HDFS and HBase.
Figure 4.3: Collaborative relationship of HDFS and HBase[12]
There are two types of files saved in HDFS, one is source files, and the other one
is link files. We separate source files and link files into different folders. Figure 4.4
shows the architecture of data organization.
4.2. Design of DeDu 46
Figure 4.4: Data organization in DeDu
In the DeDu system, each source file is read by its primary value and saved in
a folder which is named by date. As for the link file, the filename is in the form
“ABC.ext.lnk”, where “ABC” is the original name of the source file, and “ext” is the
original extension of the source file. Every link file records one hash value for each
source file and the logical path to the source file, and it is saved in the folder which
was created by the user. Each link file is 316 bits, and each link file is saved three
times in the distribution file system.
HBase records all the hash values for each file, the number of links, and the logi-
cal path to the source file. There is only one table in HBase, which is named dedu.
There are three columns in the table, which have the headings hash value, count, and
path. Hash value is the primary key. Count is used to calculate the number of links
for each source file. Path is used for recording the logical path to the source file.
4.2. Design of DeDu 47
4.2.2 Storage of the files
In this system, there are three main steps to save a file. Firstly, make a hash value at
the client; secondly, identify any duplication; thirdly, save the file. Figure 4.5 shows
the procedures for storing a file.
Firstly, users select the files or folders which are going to be uploaded and saved
by using a DeDu application. The application use the MD5 and SHA-1 hash functions
to calculate the file’s hash value, and then pass the value to HBase.
Secondly, the table “dedu”, in HBase keeps all file hash values. HBase is operated
under the HDFS environment. It will compare the new hash value with the existing
values. If it does not already exist, a new hash value will be recorded in the table, and
then HDFS will ask clients to upload the files and record the logical path; if it does
already exist, HDFS will check the number of links, and if the number is not zero, the
counter will be incremented by one. In this case, HDFS will tell the clients that the
file has been saved; if the number is zero, HDFS will ask the client to upload the file
and update the logical path.
Thirdly, HDFS will store source files, which are uploaded by users, and corresponding
link files, which are automatically made by DeDu and record the source file’s hash
value and the logical path of the source file.
4.2. Design of DeDu 49
4.2.3 Access to the files
In our system, we use a special approach to access a file, which is the link file. Each
link file records two things: the hash value and the logical path to the source file.
When clients access the file, they first access the link file, and the link file will pass
the logical path of the source file to HDFS. HDFS will then ask the master node for
the block locations. When the clients get the block locations, they can retrieve the
source file from the data nodes. Figure 4.6 shows the procedures to access a file.
Figure 4.6: Procedures to access a file.
4.2.4 Deletion of files
In our system, there are two types of models for deletion: in one case, the file is
pseudo-deleted, and in the other case, it is fully-deleted. This is because, in our sys-
tem, different users may have the same right to access and control the same file. We
4.2. Design of DeDu 50
don’t allow one user to delete a source file which is shared by other users, so we use
pseudo-deletion and full-deletion to solve this problem. When a user deletes a file, the
system will delete the link file which is owned by the user, and the number of links
will be decremented by one. This means that this particular user loses the right to
access the file, but the source files are still stored in HDFS. The file is pseudo-deleted.
A source file may have many link files pointing to it, so while the user may delete the
one link file, this has no impact on the source file. When the last link file has been
deleted, however, the source file will be deleted; the file is now fully-deleted. Figure
4.7 shows the procedures for deleting a file.
Figure 4.7: Procedures to delete a file.
4.3. Environment of simulation 51
4.3 Environment of simulation
In our experiment, our cloud storage platform was set up on a VMware 7.10 work-
station. The configuration of the host machine is that the CPU is 3.32 GHz; RAM
is 4 GB; Hard disk is 320GB. Five virtual machines exist in the cloud storage plat-
form, and each virtual machine has the same configuration. The configuration of each
virtual machine is that the CPU is 3.32 GHz; RAM is 512 MB; Hard disk is 20 GB.
The operating system is Linux mint. The version of HDFS is Hadoop 0.20.2, and the
version of HBase is 0.20.6. The communications between nodes of HDFS and HBase
use the security shell. The detailed information is listed in Table 4.1.
Hosts Usage IPaddress DFS
Mint1 MasterNode, HBase 192.168.58.140 Hadoop, HBase
Mint2 DataNode 192.168.58.142 Hadoop, HBase
Mint3 DataNode 192.168.58.143 Hadoop, HBase
Mint4 DataNode 192.168.58.144 Hadoop, HBase
Mint5 DataNode 192.168.58.145 Hadoop, HBase
Table 4.1: Configuration of virtual machine
4.4 System implementation
In this section, we will introduce the implementation of DeDu in packages, classes,
and interfaces.
4.4.1 System architecture
There are five packages in the system. Those are main package, GUI package, com-
mand package, task package, and utility package. The relationship of these packages is
4.4. System implementation 52
shown in the Figure 4.8. It is clear that the main package and the GUI package invoke
the command package to operate DeDu. And then, the command package calls, the
task package to action, and the task package invokes the utility package to execute
the missions.
Figure 4.8: Over View of Package
In each package, several classes are included to perform the functions. Figure 4.9
gives an overview of the architecture of DeDu. ICommand and ITask are the interfaces
of the command package and the task package. Detailed information and the functions
of classes will be described in section .
4.4. System implementation 54
4.4.2 Class diagram
We will describe the system classes package by package. The first is the command
package. There is one interface, which is ICommand, and one abstract class, which
is AbstractCommand. ICommand is used as an interface for the AbstractICommand
class. Also, there are nine common classes, including DownloadCommand, Down-
loadDirCommand, DeletedCommand, HelpCommand, UpdateCommand, MkDirCom-
mand, RmDirCommand, InitHBaseCommand, and ListCommand, which inherit from
the AbstractICommand class in this package.
DownloadCommand class is the command to download a single file.
DownloadDirCommand class is the command to download multiple files in the
document folder.
DeletedCommand class is the command for deleting files.
HelpCommand class is the command for calling on all the help information.
UpdateCommand class is the command to upload files.
MkDirCommand class is the command for making a new document folder.
RmDirCommand class is the command for deleting a document folder and files in
the folder.
InitHBaseCommand class is the command for to initialize HBase.
ListCommand class is the command for listing all the files in the system.
4.4. System implementation 56
The task package is shown in Figure 4.11. There is one interface which is ITask and
one abstract class which is AbstractTask. ITask is used as an interface for the Abstract-
Task class. Also, there are nine common classes including DownloadTask, Download-
DirTask, DeletedTask, MkDirTask, RmDirTask, UploadTask, UploadDirTask, List-
Task, and ListDirTask, which inherit from the AbstractITask class in this package.
Figure 4.11: Task Package
4.4. System implementation 57
DownloadTask class performs the function of downloading a single files.
DownloadDirTask class performs the function of downloading multiple files in the
document folder.
DeletedTask class performs the function of deleting files.
MkDirTask class performs the function of making a new document folder.
RmDirTask class performs the function of deleting a document folder and files in
the folder.
ListTask class performs the function of listing all the files in the system.
ListDirTask class performs the function of listing all the files and folders in the
system.
UploadTask class performs the function of uploading single files.
UploadDirTask class performs the function of uploading multiple files which are in
the same folder.
The utility package contains one abstract class, named HashUtil, one interface HashUtil-
Factory, and five classes, which are XMLUtil, HBaseManager, HDFSUtil, SHAUtil and
MD5Util. The structure of the Utility package is shown in Figure 4.12.
4.4. System implementation 58
Figure 4.12: Utility Package
In this package, we used the factory method to build up the hash function. HashUtil-
Factory is the interface, and HashUtil decides which hash algorithm should be invoked.
The SHAUtil class and the MD5Util class are both algorithms for calculating hash val-
ues.
The response of XMLUtil class is recording the configuration of HDFS and HBase.
The responses of the HBaseManager class are communicating with HBase and
operating HBase for cooperation in executing tasks.
The response of the HDFSUtil class are communicating with HDFS, and operate
HDFS for cooperation of execute tasks.
4.4. System implementation 59
4.4.3 Interface
In order to operate DeDu conveniently, we have developed a graphical user interface
(GUI). In this section, we will show and briefly describe the features of the GUI.
The following picture (Figure 4.13) shows DeDu’s main interface, before connecting
to HDFS and HBase.
Figure 4.13: DeDu’s Main Interface
Using the interface involves clicking options and then selecting connection config-
uration. A new configuration window with two folders named HDFS and HBase will
come up to perform the function of connection. Figures 4.14 and 4.15 show the inter-
face of connection configuration of HDFS and HBase.
4.4. System implementation 60
Figure 4.14: Configure HDFS
Figure 4.15: Configure HBase
After being connected to HDFS and HBase, DeDu’s main interface shows the file’s
status on right side. Figure 4.16 displays the situation after the connection has been
set up.
4.4. System implementation 61
Figure 4.16: After connecting to HDFS and HBase
The action of uploading files into DeDu is quite easy. Just drag and drop the
selected file or folder from the left side to the right side. Figure 4.17 displays the
process of uploading a file.
4.4. System implementation 62
Figure 4.17: Uploading process.
The action of downloading files to a local disk involves drag and drop selection of a
file or folder from the right side to the left side. Figure 4.18 displays the downloading
process.
Figure 4.18: Downloading process.
4.5. Summary 63
The action of deleting a file or folder is achieved by right clicking the file or folder
which is going to be deleted and selecting the option of delete. Then, the file or folder
will be deleted. Figure 4.19 displays the deletion process.
Figure 4.19: Deletion process
4.5 Summary
This chapter firstly introduced DeDu, the duplication deleting storage system, and
then described the method of identifying the duplication and storage mechanisms.
Secondly, in section 4.2, it described the design of DeDu, which included data orga-
nization, procedures for uploading files, procedures for access to files, and procedures
for deletion of files. Thirdly, in section 4.4, the environment of simulation was briefly
described. Section 4.4 contains details on the system implementation, which covered
system architecture, programs and interfaces.
Chapter 5
Experiment Results and Evaluation
5.1 Evaluations and summary of P2CP
For a storage service, availability and speed are high priority considerations. In the
previous section, we proved that the speed of P2CP is superior. In this section, we
compare and discuss the availability of P2CP in comparison with other models from
the point of view of the whole network and of shared resources. According the work
of [7], we know that common hardware failures often occur in clusters. Hardware
failures, e.g. of servers, are expected in a networked environment. In our comparative
evaluation, we assume that the failure rate of each peer is 1%, and the failure rate
of each server is 0.1%. We assume that two peers constitute the smallest pure P2P
network; the smallest P2SP network includes one server and one peer; the smallest
cloud includes one master node and one data node; and the smallest P2CP network
includes one smallest cloud and one peer.
64
5.1. Evaluations and summary of P2CP 65
5.1.1 Evaluation from network model availability
From the point of view of whole network availability, and based on the assumptions
above, we can see that: In the P2P network, even if only 1 peer exists in the P2P
network, when the user connects to the peer, the P2P network can be set up. Thus
the maximum failure rate of the pure P2P network is 1%. For the failure rate of the
P2SP network, failure for one machine will not lead to breakdown of the whole P2SP
network. If the server is shut down, the network becomes a P2P network; if the peer
is offline, the network becomes client and server based. Only when both the server
and peer break down at the same time, will the whole network be shutdown. Thus the
maximum failure rate of the P2SP network is 1%*0.1% = 0.001%. For the failure rate
of the cloud network, according the features of cloud, we know that no level of master
or data node shut down will lead to the whole cloud network being fully disabled,
unless both the master node and the data node are broken at the same time. So, the
maximum failure rate of the cloud network is 1%*0.1% = 0.001%. The P2CP network
could run without peers, even if there are failures to the master node or data nodes;
only after all peers are gone and both master node and all data nodes are broken, will
the whole P2CP network shut down. So, the failure rate of P2CP is 0.1%*1%*1%
= 0.00001%. Thus, in the worst network situation, the most stable network storage
model is P2CP.
5.1.2 Evaluation from particular resource availability
However, from the point of view of a particular shared resource, we know that the
storage services follow the long tail law [2]. This means that the particular resource
may be very popular at the beginning and in low demand in a long time. In the P2P
5.1. Evaluations and summary of P2CP 66
storage model, initially, the particular resource is frequently downloaded and uploaded
in the network, so users can access the particular resource easily. However, when the
particular resource is no longer popular, and the peers who hold the information for
the particular resource leave, the P2P network is still there, but the resource is not
available. Both the cloud storage model and the P2SP storage model solve this prob-
lem. They use a series of servers or a single server to record the particular resource
to guarantee the availability, but with different transmission efficiency. The transmis-
sion efficiency of the P2SP storage model is improved when the particular resource
is popular and the transmission efficiency is high, but when the particular resource
is unpopular, the transmission efficiency is low. The cloud storage model gets the
opposite result. Only the P2CP storage model achieves the best result. Regardless of
whether the particular resource is in fashion, the availability and speed are very good.
5.1.3 Summary of P2CP
In summary, from the evaluation results, we can clearly see that in whatever the sit-
uation, the cost in time for P2CP is the lowest, and the usability is highest. P2CP is
a new cloud storage system model, developed through this research to enhance data
transmission performance and provide persistent availability. P2CP not only solves
the problem of utilization rate of bandwidth that exists in cloud storage systems, but
also solves the problem of persistent availability in the pure P2P network model. By
employing mathematical model, it proves that with the same network environment,
the utilization rate of bandwidth and the persistent availability of the P2CP model
are better than for the pure P2P model, the P2SP model, or the cloud model.
5.2. Performance of DeDu 67
5.2 Performance of DeDu
5.2.1 Deduplication efficiency
In our experiment, we uploaded 110,000 files, amounting to 475.2 GB, into DeDu. In
a traditional storage system, they should occupy 475.2 GB, as shown by the blue line
in Figure 5.1, and if stored in a traditional distribution file system, both the physical
storage space and the number of files should be three times larger than 475.2 GB and
110,000 files, that is, 1425.6 GB and 330,000 files. (We did not show this in the figure,
because the scale of the figure would become too large.) In a perfect deduplication
distribution file system, the results should take up 37 GB, and the number of files
should be 3,200; but in DeDu, we achieved 38.1 GB, just 1.1 GB larger than the per-
fect situation. The extra 1.1 GB of data are occupied by link files and the dedu table,
which is saved in HBase. The following figure (Figure 5.1 ) shows the deduplication
system efficiency.
Figure 5.1: Deduplication efficiency.
5.2. Performance of DeDu 68
By using the distribution hashing index, an exact deduplication result is achieved.
In our storage system, each file could only be kept in 3 copies at different data nodes,
as backup in case some data nodes are dead. This means that if a file is saved into
this system less than three times, the efficiency of deduplication is low. When a file
is put into the system more than three times, the efficiency of deduplication will be-
come high. Thus, the exact deduplication efficiency depends on both the original data
duplication ratio and how many times the original data has been saved. The higher
the duplication ratio the original data has, the greater the deduplication efficiency. It
is also true that the greater the number of times that the original data is saved, the
greater the deduplication efficiency that can be achieved.
5.2.2 Balance of load
Because each data node keeps different numbers of blocks, and the client will directly
get the data from the data nodes, we have to keep an eye on load balance, in case
some data nodes are overloaded, while others are idle.
5.2.2.1 Static load balance
Figure 5.2 shows the balance situation in 4 data nodes. In the situation of no dedupli-
cation, DN1 (mint2) stores 116 gigabytes of data; DN3 (mint4) stores 114 gigabytes
of data; both DN2 (mint3) and DN4 (mint5) each store 115 gigabytes of data. With
deduplication, DN2 stores 6.95 gigabytes of data; DN3 stores 6.79 gigabytes of data;
and DN1 and DN3 each store 7 gigabytes of data. The perfectly deduplicated situation
should occur when each node stores 6.9 gigabytes of data. It is easy to see that each
data node stores a different amount of data, no matter whether it is at the hundred
gigabytes level or at the dozens of gigabytes level. The usage of storage space in each
5.2. Performance of DeDu 69
node is different, but the differences between numbers of blocks and space occupation
in the same situation will not be more than 10%.
Figure 5.2: Static load balance.
5.2.2.2 Dynamic load balance
When we delete a node or add a new node into the system, DeDu will achieve balance
automatically. The default communication bandwidth is 1 MB/s, and so, the balance
efficiency is low, except when the balance command is entered manually. Figure 5.3
shows that the deduplication load is balanced in 3 data nodes environment, as indicated
by brackets, and 4 data nodes environment. It is easy to see that when it is in the 3
data nodes environment, each data node stores 9.24 GB of data. After one more data
node is added into the system, DN2 stores 6.95 gigabytes of data; DN3 stores 6.79
gigabytes of data; and DN1 and DN3 each store 7 gigabytes of data.
5.2. Performance of DeDu 70
Figure 5.3: Dynamic load balance
5.2.3 Reading efficiency
This part introduces the results of system reading efficiency in two situations, with
two data nodes and with four data nodes. In the two data node situation, we tested
the system with two data streams. Firstly, we downloaded 295 items amounting to
3.3 GB at a cost of 356 seconds. The download speed was 9.49 MB/s.
Secondly, we downloaded 22 items, amounting to 9.2 GB at a cost of 510 seconds.
The download speed was 18.47 MB/s. In the four data node situation, we also tested
with two data streams. Firstly, we downloaded 295 items, amounting to 3.3 GB, at a
cost of 345 seconds, so the speed was 9.79 MB/s. Secondly, we downloaded 22 items
amounting to 9.2 GB at a cost of 475 seconds, and the download speed was 19.83
MB/s. Details are given in Table 5.1.
5.2. Performance of DeDu 71
Write into Two nodes Time (Seconds) Items Size (GB) Speed (MB/s)
Testing1 356 295 3.3 9.49
Testing2 510 22 9.2 18.47
Write into Four nodes Time (Seconds) Items Size (GB) Speed (MB/s)
Testing1 345 295 3.3 9.79
Testing2 475 22 9.2 19.83
Table 5.1: Read efficiency
5.2.4 Writing efficiency
In this part, we will consider the system writing efficiency with two and four data
nodes. Furthermore, in the real world, deduplication happens randomly, and so, we
just calculate the writing efficiency with complete deduplication and the writing effi-
ciency with no deduplication in this manuscript.
In the two data nodes and no deduplication situation, we used two data streams for
testing. Firstly, we uploaded 22 items amounting to 9.2 GB, at a cost of 2628 seconds.
The upload speed was 3.58 MB/s. Secondly, we uploaded 295 items amounting to 3.3
GB at a cost of 771 seconds, so the upload speed was 4.38 MB/s. In the four data
nodes with no deduplication situation, we uploaded 22 items amounting to 9.2 GB at
a cost of 2644 seconds. The upload speed was 3.56 MB/s. Secondly, we uploaded 295
items amounting to 3.3 GB, at a cost of 813 seconds, so the upload speed was 4.15
MB/s. Details are listed in Table 5.2.
5.2. Performance of DeDu 72
Two nodes Time (Seconds) Items Size (GB) Speed (MB/s)
Testing1 771 295 3.3 4.38
Testing2 2628 22 9.2 3.58
Four nodes Time (Seconds) Items Size (GB) Speed (MB/s)
Testing1 813 295 3.3 4.15
Testing2 2644 22 9.2 3.56
Table 5.2: Write efficiency without deduplication
In the two data nodes and complete deduplication situation, we also used two data
streams for testing. Firstly, we uploaded 22 items amounting to 9.2 GB, at a cost of
462 seconds. The upload speed was 20.39 MB/s.
Secondly, we uploaded 295 items amounting to 3.3 GB, at a cost of 401 seconds.
The upload speed was 8.43 MB/s. With four data nodes and a full deduplication
environment, we first uploaded 22 items at 9.2 GB, at a cost of 475 seconds. The
upload speed was 19.83 MB/s. We then uploaded 295 items amounting to 3.3 GB, at
a cost of 356 seconds. The upload speed was 9.49 MB/s. Details are listed in Table 5.3.
Write into Two nodes Time (Seconds) Items Size (GB) Speed (MB/s)
Testing1 401 295 3.3 9.43
Testing2 462 22 9.2 20.39
Write into Four nodes Time (Seconds) Items Size (GB) Speed (MB/s)
Testing1 356 295 3.3 9.49
Testing2 475 22 9.2 19.83
Table 5.3: Write efficiency with deduplication
5.3. Summary 73
5.2.5 Evaluations and summary of DeDu
In our system, the hash value is calculated at the client before data transmission,
and the lookup function is executed in HBase. When duplication is found, real data
transmission will not occur.
We can get the results of DeDu from Table 5.1 to Table 5.3:
1. The fewer the data nodes, the higher the writing efficiency; but the lower the read-
ing efficiency.
2. The more data nodes there are, the lower the writing efficiency, but the higher the
reading efficiency.
3. When a single file is big, the time to calculate hash values becomes higher, but the
time of transmission cost is low.
4. When a single file is small, the time to calculate hash values becomes lower, but
the transmission cost is high.
5.3 Summary
This chapter contains the testing results of DeDu and evaluates P2CP storage model.
Section 5.2 showed the performance of DeDu from four aspects, deduplication effi-
ciency, balance of load, reading efficiency, and writing efficiency. The evaluations of
DeDu were listed in section 5.2.5. Section 5.1 showed the evaluations of P2CP from
two aspects, network availability and particular resource availability.
Chapter 6
Conclusion and Future Work
6.1 Conclusion
In conclusion, Chapter 2, contained a literature review of relevant researches from four
points of view in four sections. Section 2.2 introduced the background of “cloud”, cloud
computing, and cloud storage. Section 2.3, reviewed existing distributed file systems
and distributed databases. Section 2.4, analyzed existing distribution storage models
from three aspects, where the master server fit into the storage system, file storage
granularity, and system scalability. Section 2.5 reviewed several existing deduplication
storage systems, and introduced the main deduplication techniques. Furthermore, re-
ports on both P2SP and P2P storage models were also reviewed.
Chapter 3, introduced a new cloud storage system model to enhance data transmis-
sion performance and provide persistent availability, which had been named P2CP.
The conclusion of the comparative studies presented in this thesis, based on statistical
modeling is that, P2CP not only solved the problem of utilization rate of bandwidth
that exists in cloud storage systems, but also solved the problem of persistent avail-
ability in the P2P network model. The Poisson process and Little’s law were used
74
6.1. Conclusion 75
to prove that the utilization rate of bandwidth and the persistent availability of the
P2CP model are better than for the pure P2P model, the P2SP model, or the cloud
model.
Chapter 4 introduced a scalable and parallel deduplication storage system, which had
been named DeDu. DeDu was not only useful to enterprises for backup purposes, but
also for common users who wanted to store data. Our approach exploited a file hash
value as an index saved in HBase to attain high lookup performance, and it exploited
link files to manage mass data in a Hadoop distribution file system.
The features of DeDu were presented in chapter 5: The fewer the data nodes it had,
the higher its writing efficiency, but the lower the reading efficiency. The more data
nodes there were, the lower the writing efficiency would be, but the higher the reading
efficiency. When a single file was big, the time to calculate hash values became higher,
but the time of transmission cost was low. When a single file was small, the time to
calculate hash values became lower, but the transmission cost was high. Chapter 5
also evaluated the P2CP storage model.
The contributions of this thesis were the following:
1. The research analyzed the advantages and disadvantages of existing distributed
storage systems, especially the P2P storage model, P2SP storage model, and the cloud
storage model.
2. The research proposed a new distributed storage system model named P2CP.
This storage model absorbed the advantages of P2P storage model in terms of high
6.2. Future Work 76
efficiency in its use of bandwidth, those of the advantages of P2SP storage model in
terms of persistent data availability, and those of the cloud storage model in terms of
scalability.
3. The research proved that the data transmission efficiency of the P2CP storage
model was better than traditional distributed storage model by exploiting mathematic
methods.The Poisson process and Little’s law set up were used to a mathematical data
transmission model of P2CP, and used it to prove that the result of P2CP is better.
4. A deduplication application has been developed. The application, a front-end
was a part of deduplicated storage system. It included interfaces in both a command
mode and a graphic user interface (GUI) mode. Users did not need to be familiar with
operation commands to operate the system. With the help of the GUI, everything can
be done.
5. A cloud storage platform has been built. The back-end consisted of HDFS and
HBase to built the cloud storage platform. HDFS managed the commercial hardware,
and HBase was employed as a data manager, to deal with communication with front-
end. DeDu will work well, since the front-end and back-end had a perfect collaboration.
6.2 Future Work
There are some limitations in this work. DeDu has not been compared with other
benchmark deduplicated storage systems, and P2CP still is a theoretical model.
Thus, in future work with my colleagues, three issues will be solved. Firstly, we
6.2. Future Work 77
will compare DeDu with other deduplicated storage systems in term of deduplication
accurarcy, data transmission efficiency, and scalability. Secondly, we will develop the
prototype of P2CP storage system, and compare it with other storage system based
on real experiment. Thirdly, we will combine the P2CP storage model with DeDu to
improve the data transmission efficiency.
References
[1] I. Adan and J. Resing, “Queueing theory,” February 14, 2001.
[2] C. Anderson, Why the Future of Business Is Selling Less of More, 2, Ed.
Hyperion, 2006. [Online]. Available: http://nangchang2blog.files.wordpress.com/
2009/12/17the-long-tail.pdf
[3] H. D. Antoon, H. Dobbertin, A. Bosselaers, and B. Preneel, “Ripemd-160: A
strengthened version of ripemd,” Springer-Verlag, pp. 71–82, 1996.
[4] Apache, “Cassandra.” [Online]. Available: http://cassandra.apache.org/
[5] ——, “Hbase.” [Online]. Available: http://hbase.apache.org/
[6] A. Atul, J. B. William, C. Miguel, C. Gerald, C. Ronnie, R. D. John, H. Jon,
R. L. Jacob, T. Marvin, and P. W. Roger, “Farsite: federated, available, and
reliable storage for an incompletely trusted environment,” SIGOPS Oper. Syst.
Rev. ACM, vol. 36, pp. 1–14, 2002.
[7] S. Austin and T. Vince, “An analysis of reported laptop failures from
malfunctions and accidental damage,” SquareTrade, Tech. Rep., 2009. [Online].
Available: http://www.squaretrade.com/pages/laptop-reliability-1109/.
[8] T. C. Austin, A. Irfan, V. Murali, and L. Jinyuan, “Decentralized deduplication
in san cluster file systems,” in Proceedings of the 2009 conference on USENIX
78
References 79
Annual technical conference. San Diego, California: USENIX Association, 2009,
pp. 101–114.
[9] P. Bernstein, J. Rothnie, J.B., N. Goodman, and C. Papadimitriou, “The concur-
rency control mechanism of sdd-1: A system for distributed databases (the fully
redundant case),” Software Engineering, IEEE Transactions on, vol. SE-4, no. 3,
pp. 154 – 168, May 1978.
[10] Y. Beverly and G.-M. Hector, “Designing a super-peer network.” Los Alamitos,
CA, USA: IEEE Computer Society, 2003, p. 49.
[11] D. Bhagwat, K. Eshghi, D. D. E. Long, and M. Lillibridge, “Extreme binning:
Scalable, parallel deduplication for chunk-based file backup,” in 2009 Ieee In-
ternational Symposium on Modeling, Analysis & Simulation of Computer and
Telecommunication Systems, Mascots, 2009, pp. 237–245.
[12] D. Borthakur. (2007) The hadoop distributed file system: Architecture
and design. [Online]. Available: http://hadoop.apache.org/hdfs/docs/current/
hdfs design.pdf
[13] W. Brent, U. Marc, A. Zainul, G. Garth, M. Brian, S. Jason, Z. Jim, and Z. Bin,
“Scalable performance of the panasas parallel file system,” in Proceedings of the
6th USENIX Conference on File and Storage Technologies. San Jose, California:
USENIX Association, 2008, pp. 17–33.
[14] D. Cezary, G. Leszek, H. Lukasz, K. Michal, K. Wojciech, S. Przemyslaw, S. Jerzy,
U. Cristian, and W. Michal, “Hydrastor: a scalable secondary storage,” in Proc-
cedings of the 7th conference on File and storage technologies. San Francisco,
California: USENIX Association, 2009, pp. 197–210.
References 80
[15] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows,
T. Chandra, A. Fikes, and R. E. Gruber, “Bigtable a distributed storage system
for structured data,” in Operating Systems Design and Implementation, Seattle,
2006, pp. 205–218.
[16] Y. Chou, “Get into the groove: Solutions for secure and dynamic collaboration,”
TechNet Magazine, 2006. [Online]. Available: http://technet.microsoft.com/
en-us/magazine/2006.10.intothegroove.aspx
[17] I. Clarke., “Freenet,” 2000. [Online]. Available: http://freenetproject.org/
[18] B. Cohen, “Bittorrent,” 2001. [Online]. Available: http://www.bittorrent.com/
btusers/what-is-bittorrent
[19] D. Comer, “Ubiquitous b-tree,” ACM Comput. Surv., vol. 11, pp. 121–137, June
1979.
[20] P. Douglas, The Challenge of the Computer Utility. Addison-Wesley Publishing
Company, 1966.
[21] K. L. Edward and A. T. Chandramohan, “Petal: distributed virtual disks,” in
Proceedings of the seventh international conference on Architectural support for
programming languages and operating systems. Cambridge, Massachusetts, US:
ACM, 1996, pp. 84–92.
[22] R. Epstein, M. Stonebraker, and E. Wong, “Distributed query processing in a
relational data base system,” in Proceedings of the 1978 ACM SIGMOD inter-
national conference on management of data, ser. SIGMOD ’78. New York, NY,
USA: ACM, 1978, pp. 169–180.
[23] L. Fang, L. Peng, Y. Jie, and L. Zhenming, “Contrastive analysis of p2sp network
and p2p network,” in Wireless Communications, Networking and Mobile Com-
References 81
puting, 2009. WiCom ’09. 5th International Conference on, 24-26 Sept 2009, pp.
1–5.
[24] J. Feng, Y. Chen, W.-S. Ku, and P. Liu, “Analysis of integrity vulnerabilities
and a non-repudiation protocol for cloud data storage platforms,” in Parallel
Processing Workshops (ICPPW 39th), Sept 2010, pp. 251–258.
[25] J. F. Gantz, C. Chute, A. Manfrediz, S. Minton, D. Reinsel, W. Schlichting, and
A. Toncheva, The Diverse and Exploding Digital Universe, ser. An IDC White
Paper - sponsored by EMC, March 2008. [Online]. Available: http://www.emc.
com/collateral/analyst-reports/diverse-exploding-digital-universe.pdf
[26] J. D. Ghemawat and Sanjay, “Mapreduce: Simplifed data processing on large
clusters,” Operating Systems Design and Implementation, OSDI 2004, p. 13, 2004.
[27] S. Ghemawat, H. Gobioff, and S.-T. Leung, “The google file system,” in Pro-
ceedings of the nineteenth ACM symposium on Operating systems principles, ser.
SOSP ’03. New York, NY, USA: ACM, 2003, pp. 29–43.
[28] B. Hong and D. D. E. Long, “Duplicate data elimination in a san file system,” in
In In Proceedings of the 21st IEEE / 12th NASA Goddard Conference on Mass
Storage Systems and Technologies (MSST), 2004, pp. 301–314.
[29] T. Hong, A. Gulbeden, Z. Jingyu, W. Strathearn, Y. Tao, and C. Lingkun, “A self-
organizing storage cluster for parallel data-intensive applications,” in Proceedings
of the ACM/IEEE Supercomputing 2004 Conference, Nov. 2004, p. 52.
[30] R. J. Honicky and E. L. Miller, “Replication under scalable hashing: a family
of algorithms for scalable decentralized data distribution,” in 18th International
Conference on Parallel and Distributed Processing Symposium, April 2004, p. 96.
References 82
[31] D. Houston and A. Ferdowsi, “Dropbox,” 2007. [Online]. Available: http:
//www.dropbox.com/about
[32] B. Kaliski, The MD2 Message-Digest Algorithm. United States: RFC Editor,
1992.
[33] P. Kasselman and W. Penzhorn, “Cryptanalysis of reduced version of haval,”
Electronics Letters, vol. 36, no. 1, pp. 30 –31, Jan. 2000.
[34] P. Kirk, “Gnutella,” 2003. [Online]. Available: http://rfc-gnutella.sourceforge.
net/
[35] M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezise, and P. Camble,
“Sparse indexing: Large scale, inline deduplication using sampling and locality,”
in 7th USENIX Conference on File and Storage Technologies, San Francisco, Cal-
ifornia, 2009, pp. 111–123.
[36] F. Mendel, N. Pramstaller, C. Rechberger, M. Kontak, and J. Szmidt, “Crypt-
analysis of the gost hash function,” in Advances in Cryptology CRYPTO 2008,
ser. Lecture Notes in Computer Science, D. Wagner, Ed. Springer Berlin /
Heidelberg, 2008, vol. 5157, pp. 162–178.
[37] Merkur, “emule,” May 2002. [Online]. Available: http://www.emule-project.net/
home/perl/general.cgi?l=1
[38] A.-E.-M. Michael, I. William V. Courtright, C. Chuck, R. G. Gregory, H. James,
J. K. Andrew, M. Michael, P. Manish, S. Brandon, R. S. Raja, S. Shafeeq, D. S.
John, T. Eno, W. Matthew, and J. W. Jay, “Ursa minor: versatile cluster-based
storage,” in Proceedings of the 4th conference on USENIX Conference on File and
Storage Technologies, vol. 4. San Francisco, CA: USENIX Association, 2005, pp.
59–72.
References 83
[39] A. Michael, F. Armando, G. Rean, D. J. Anthony, K. Randy, K. Andy,
L. Gunho, P. David, R. Ariel, S. Ion, and Z. Matei, “Above the clouds:
A berkeley view of cloud computing,” Tech. Rep., 2009. [Online]. Available:
www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-28.pdf
[40] J. H. Morris, M. Satyanarayanan, M. H. Conner, J. H. Howard, D. S.
Rosenthal, and F. D. Smith, “Andrew: a distributed personal computing
environment,” Commun. ACM, vol. 29, pp. 184–201, March 1986. [Online].
Available: http://doi.acm.org/10.1145/5666.5671
[41] Secure Hash Standard, National Institute of Standards and Technology
Std., August 2002. [Online]. Available: http://csrc.nist.gov/publications/fips/
fips180-2/fips180-2withchangenotice.pdf
[42] R. Ozzie, “Microsoft, groove networks to combine forces to create anytime,
anywhere collaboration,” Microsoft News Center, 2005. [Online]. Available: http:
//www.microsoft.com/presspass/features/2005/mar05/03-10GrooveQA.mspx
[43] S. Quinlan and S. Dorward, “Awarded best paper! - venti: A new approach to
archival data storage,” in Proceedings of the 1st USENIX Conference on File and
Storage Technologies, ser. FAST ’02. Berkeley, CA, USA: USENIX Association,
2002, pp. 1–14.
[44] R. Rivest, “The md5 message-digest algorithm,” Internet Activities Board,,
United States, Tech. Rep., 1992.
[45] R. L. Rivest, “The md4 message digest algorithm,” in Proceedings of the 10th
Annual International Cryptology Conference on Advances in Cryptology, ser.
CRYPTO ’90. London, UK: Springer-Verlag, 1991, pp. 303–311. [Online].
Available: http://portal.acm.org/citation.cfm?id=646755.705223
References 84
[46] J. B. Rothnie, Jr., P. A. Bernstein, S. Fox, N. Goodman, M. Hammer, T. A.
Landers, C. Reeve, D. W. Shipman, and E. Wong, “Introduction to a system for
distributed databases (sdd-1),” ACM Trans. Database Syst., vol. 5, pp. 1–17,
March 1980. [Online]. Available: http://doi.acm.org/10.1145/320128.320129
[47] A. W. Sage, W. L. Andrew, A. B. Scott, and M. Carlos, “Rados: a scal-
able, reliable storage service for petabyte-scale storage clusters,” in Proceed-
ings of the 2nd international workshop on Petascale data storage: held in
conjunction with Supercomputing. Reno, Nevada: ACM, 2007, pp. 35–44,
http://doi.acm.org/10.1145/1374596.1374606.
[48] R. Sandberg, D. Golgberg, S. Kleiman, D. Walsh, and B. Lyon, Innovations in
Internetworking. Norwood, MA, USA: Artech House, Inc., 1988.
[49] W. Santos, T. Teixeira, C. Machado, W. Meira, A. S. Da Silva, D. R. Ferreira,
and D. Guedes, “A scalable parallel deduplication algorithm,” in Computer Ar-
chitecture and High Performance Computing, 2007. SBAC-PAD 2007. 19th In-
ternational Symposium on, 24-27 Oct. 2007 2007, pp. 79–86.
[50] M. Satyanarayanan, J. H. Howard, D. A. Nichols, R. N. Sidebotham, A. Z. Spec-
tor, and M. J. West, “The itc distributed file system: principles and design,” in
Proceedings of the tenth ACM symposium on Operating systems principles, ser.
SOSP ’85. New York, NY, USA: ACM, 1985, pp. 35–50.
[51] C. A. Stein, M. J. Tucker, and M. I. Seltzer, “Building a reliable mutable file sys-
tem on peer-to-peer storage,” in Reliable Distributed Systems, 2002. Proceedings.
21st IEEE Symposium on, 2002 2002, pp. 324–329.
References 85
[52] J. S. M. Verhofstad, “Recovery techniques for database systems,” ACM
Comput. Surv., vol. 10, pp. 167–195, June 1978. [Online]. Available:
http://doi.acm.org/10.1145/356725.356730
[53] X. Wang, D. Feng, X. Lai, and H. Yu, “Collisions for hash functions md4, md5,
haval-128 and ripemd,” 2004.
[54] X. Wang and H. Yu, “How to break md5 and other hash functions,” in In Euro-
crypt. Springer-Verlag, 2005.
[55] J. Wei, H. Jiang, K. Zhou, and D. Feng, “Mad2: A scalable high-throughput exact
deduplication approach for network backup services,” in Mass Storage Systems
and Technologies (MSST), 2010 IEEE 26th Symposium on, Incline Village, NV,
USA, 3-7 May 2010 2010, pp. 1 – 14.
[56] S. A. Weil, “Leveraging intra-object locality with ebofs.”
[57] S. A. Weil, S. A. Brandt, E. L. Miller, D. D. E. Long, and C. Maltzahn, “Ceph: a
scalable, high-performance distributed file system,” in Proceedings of the 7th sym-
posium on Operating systems design and implementation. Seattle, Washington:
USENIX Association, 2006, pp. 307–320.
[58] Z. Wilcox-O’Hearn and B. Warner, “Tahoe: the least-authority filesystem,” in
Proceedings of the 4th ACM international workshop on Storage security and
survivability, ser. StorageSS ’08. New York, NY, USA: ACM, 2008, pp. 21–26.
[Online]. Available: http://doi.acm.org/10.1145/1456469.1456474
[59] S. Yasushi, F. Svend, lund, V. Alistair, M. Arif, and S. Susan, “Fab: building
distributed enterprise disk arrays from commodity components,” in Proceedings
of the 11th international conference on Architectural support for programming
languages and operating systems. Boston, MA, USA: ACM, 2004, pp. 48–58.
References 86
[60] S. Ye, L. Fangming, L. Bo, and L. Baochun, “Fs2you: Peer-assisted semi-
persistent online storage at a large scale,” April 2009.
[61] W. Ye, A. I. Khan, and E. A. Kendall, “Distributed network file storage for a
serverless (p2p) network,” in The 11th IEEE International Conference on Net-
works. ICON2003., Sept 2003, pp. 343–347.
[62] B. Zhu, K. Li, and H. Patterson, “Avoiding the disk bottleneck in the data domain
deduplication file system,” in Proceedings of the 6th Usenix Conference on File
and Storage Technologies (FAST ’08), 2008, pp. 269–282.
Appendix A
Appendix of DeDu’s Code
This is my appendix A. It will show the code of DeDu by the order of DeDu package,
command packeage, task package and utility package. All classes will be showed for
each package.
DeDu package inclues three files, which are DeDuMain, DefaultReporter and IRe-
port. DeDuMain.java is the following:
1 package edu . uow . i s i t . dedu ;
2 import edu . uow . i s i t . dedu . command . DeleteCommand ;
3 import edu . uow . i s i t . dedu . command .DownloadCommand ;
4 import edu . uow . i s i t . dedu . command . DownloadDirCommand ;
5 import edu . uow . i s i t . dedu . command .HelpCommand ;
6 import edu . uow . i s i t . dedu . command . ICommand ;
7 import edu . uow . i s i t . dedu . command . InitHBaseCommand ;
8 import edu . uow . i s i t . dedu . command . ListCommand ;
9 import edu . uow . i s i t . dedu . command .MkdirCommand ;
10 import edu . uow . i s i t . dedu . command .UpdateCommand ;
11
12 public class DeDuMain {
13 /∗∗
14 ∗ @param args
87
88
15 ∗/
16 public stat ic void main ( St r ing [ ] a rgs ) {
17 // TODO Auto−generated method s tub
18
19 i f ( args . l ength==0)
20 {
21 ICommand cmd = new HelpCommand ( ) ;
22 cmd . execute ( null ) ;
23 return ;
24 }
25 ICommand [ ] commands = new ICommand [ ] {new UpdateCommand ( ) ,
new MkdirCommand( ) , new ListCommand ( ) , new
InitHBaseCommand ( ) ,new DownloadCommand ( ) ,new
DownloadDirCommand ( ) , new DeleteCommand ( ) ,new
HelpCommand ( ) } ;
26 for ( int i =0; i<commands . l ength ; i++)
27 {
28 i f ( args [ 0 ] . equa l s ( commands [ i ] . getCommandType ( ) ) )
29 {
30 int l ength = args . l ength ;
31 St r ing [ ] args2 = java . u t i l . Arrays .
copyOfRange ( args , 1 , l ength ) ;
32 //System . out . p r i n t l n ( args2 [ 0 ] ) ;
33 commands [ i ] . execute ( args2 ) ;
34 }
35 }
36 }
37 }
DefaultReporter.java is the following:
1 package edu . uow . i s i t . dedu ;
2 import java . u t i l . I t e r a t o r ;
89
3 import java . u t i l . L i s t ;
4 import org . apache . hadoop . f s . F i l e S t a tu s ;
5 import org . apache . hadoop . f s . Fi leSystem ;
6
7 public class Defaul tReporter implements IReport {
8 @Override
9 public void r epor t ( S t r ing msg) {
10 // TODO Auto−generated method s tub
11 System . out . p r i n t l n (msg) ;
12 }
13 @Override
14 public void prog r e s s ( ) {
15 // TODO Auto−generated method s tub
16 List<FileSystem . S t a t i s t i c s > s = Fi leSystem .
g e tA l l S t a t i s t i c s ( ) ;
17 I t e r a t o r<FileSystem . S t a t i s t i c s > i t = s . i t e r a t o r ( ) ;
18 while ( i t . hasNext ( ) )
19 {
20 Fi leSystem . S t a t i s t i c s s t a t = i t . next ( ) ;
21 System . out . p r i n t l n ( ”###############
FileSystem . S t a t i s t i c s : ” + s t a t .
getScheme ( ) + ” Writen : ” + s t a t .
getBytesWritten ( ) ) ;
22 System . out . p r i n t l n ( ”###############
FileSystem . S t a t i s t i c s : ” + s t a t .
getScheme ( ) + ” Reading : ” + s t a t .
getBytesRead ( ) ) ;
23 }
24 //System . out . p r i n t ( ” . ” ) ;
25 }
26 @Override
27 public void showResult ( F i l eS t a tu s [ ] f s ) {
90
28 // TODO Auto−generated method s tub
29 for ( int i =0; i<f s . l ength ; i++)
30 {
31 // i f ( ! s t a t u s [ i ] . ge tPath ( ) . t oS t r i n g ( ) . endsWith (”
lnk ”) )
32 System . out . p r i n t l n ( f s [ i ] . getPath ( ) .
t oS t r i ng ( ) ) ;
33 }
34 }
35 }
IReport.java is the following:
1 package edu . uow . i s i t . dedu ;
2 import org . apache . hadoop . f s . F i l e S t a tu s ;
3 import org . apache . hadoop . u t i l . P rog re s sab l e ;
4 public interface IReport extends Progre s sab l e
5 {
6 public void r epor t ( S t r ing msg) ;
7 public void showResult ( F i l eS t a tu s [ ] f s ) ;
8 }
Package of untility includes 7 files, which are HashUtil.java, HashUtilFactory.java,
XMLUtil.java, HBaseManager.java, HDFSUtil.java, SHAUtil.java and MD5Util.java.
HashUtil.java is the following:
1 package edu . uow . i s i t . dedu . u t i l ;
2 import java . i o . F i l e ;
3 import java . i o . Fi le InputStream ;
4 import java . i o . IOException ;
5 import java . n io . MappedByteBuffer ;
6 import java . n io . channe l s . Fi leChannel ;
7 import java . s e c u r i t y . MessageDigest ;
91
8
9 public abstract class HashUtil {
10
11 protected MessageDigest messaged iges t = null ;
12 protected stat ic char hexDig i t s [ ] = { ’ 0 ’ , ’ 1 ’ , ’ 2 ’ , ’ 3 ’ , ’ 4 ’ , ’ 5
’ , ’ 6 ’ , ’ 7 ’ , ’ 8 ’ , ’ 9 ’ , ’ a ’ , ’ b ’ , ’ c ’ , ’ d ’ , ’ e ’ , ’ f ’ } ;
13
14 public St r ing ge tF i l eHashSt r ing ( F i l e f i l e ) throws IOException {
15 Fi le InputStream in = new Fi leInputStream ( f i l e ) ;
16 Fi leChannel ch = in . getChannel ( ) ;
17 MappedByteBuffer byteBuf f e r = ch .map( Fi leChannel .MapMode .
READONLY, 0 ,
18 f i l e . l ength ( ) ) ;
19 messaged iges t . update ( byteBuf f e r ) ;
20 return bufferToHex ( messaged iges t . d i g e s t ( ) ) ;
21 }
22
23 public St r ing getHashStr ing ( S t r ing s ) {
24 return getHashStr ing ( s . getBytes ( ) ) ;
25 }
26
27 public St r ing getHashStr ing (byte [ ] bytes ) {
28 messaged iges t . update ( bytes ) ;
29 return bufferToHex ( messaged iges t . d i g e s t ( ) ) ;
30 }
31
32 protected St r ing bufferToHex (byte bytes [ ] ) {
33 return bufferToHex ( bytes , 0 , bytes . l ength ) ;
34 }
35
36 private St r ing bufferToHex (byte bytes [ ] , int m, int n) {
37 S t r i ngBu f f e r s t r i n g b u f f e r = new St r i ngBu f f e r (2 ∗ n) ;
92
38 int k = m + n ;
39 for ( int l = m; l < k ; l++) {
40 appendHexPair ( bytes [ l ] , s t r i n g b u f f e r ) ;
41 }
42 return s t r i n g b u f f e r . t oS t r i ng ( ) ;
43 }
44
45 private void appendHexPair (byte bt , S t r i ngBu f f e r s t r i n g b u f f e r ) {
46 char c0 = hexDig i t s [ ( bt & 0 xf0 ) >> 4 ] ;
47 char c1 = hexDig i t s [ bt & 0 xf ] ;
48 s t r i n g b u f f e r . append ( c0 ) ;
49 s t r i n g b u f f e r . append ( c1 ) ;
50 }
51
52 public boolean checkPassword ( St r ing password , S t r ing hashPwdStr )
{
53 St r ing s = getHashStr ing ( password ) ;
54 return s . equa l s ( hashPwdStr ) ;
55 }
56 }
HashUtilFactory.java is the following:
1 package edu . uow . i s i t . dedu . u t i l ;
2 import java . s e c u r i t y . NoSuchAlgorithmException ;
3
4 public class HashUti lFactory {
5 public stat ic HashUtil getHashUti l ( S t r ing a lgor i thm )
throws NoSuchAlgorithmException
6 {
7 i f ( a lgor i thm . equa l s ( ”SHA” ) )
8 {
9 return new SHAUtil ( ) ;
93
10 }
11 else i f ( a lgor i thm . equa l s ( ”MD5” ) )
12 {
13 return new MD5Util ( ) ;
14 }
15 else
16 throw new NoSuchAlgorithmException ( ” Fa i l to
c r e a t e HashUtil Object with ”+ algor i thm ) ;
17 }
18 }
HBaseManager.java is the following:
1 package edu . uow . i s i t . dedu . u t i l ;
2 import java . i o . IOException ;
3 import org . apache . hadoop . hbase . HBaseConfiguration ;
4 import org . apache . hadoop . hbase . HColumnDescriptor ;
5 import org . apache . hadoop . hbase . HTableDescr iptor ;
6 import org . apache . hadoop . hbase . MasterNotRunningException ;
7 import org . apache . hadoop . hbase . c l i e n t . Get ;
8 import org . apache . hadoop . hbase . c l i e n t . HBaseAdmin ;
9 import org . apache . hadoop . hbase . c l i e n t . HTable ;
10 import org . apache . hadoop . hbase . c l i e n t . Put ;
11 import org . apache . hadoop . hbase . c l i e n t . Result ;
12 import org . apache . hadoop . hbase . u t i l . Bytes ;
13
14 public class HBaseManager {
15 private stat ic HBaseManager i n s t anc e = new HBaseManager ( ) ;
16 private HBaseConfiguration con f i g ;
17 private HBaseAdmin admin ;
18
19 private HBaseManager ( )
20 {
94
21 c on f i g = new HBaseConfiguration ( ) ;
22 try {
23 admin = new HBaseAdmin( c on f i g ) ;
24 } catch (MasterNotRunningException e ) {
25 // TODO Auto−generated catch b l o c k
26 e . pr intStackTrace ( ) ;
27 }
28 }
29
30 public stat ic HBaseManager ge t In s tance ( )
31 {
32 return i n s t anc e ;
33 }
34
35 public void i n i tTab l e ( )
36 {
37 try { //De le te o ld t a b l e f i r s t
38 i f ( admin . t ab l eEx i s t s ( ”dedu” ) ) {
39 System . out . p r i n t l n ( ”drop tab l e ” ) ;
40 admin . d i sab l eTab l e ( ”dedu” ) ;
41 admin . de l e t eTab l e ( ”dedu” ) ;
42 }
43 System . out . p r i n t l n ( ” c r e a t e t ab l e ” ) ;
44 HTableDescr iptor t ab l eDe s c r i p t e r = new
HTableDescr iptor ( ”dedu” . getBytes ( ) ) ;
45 t ab l eDe s c r i p t e r . addFamily (new
HColumnDescriptor ( ”path : ” ) ) ;
46 t ab l eDe s c r i p t e r . addFamily (new
HColumnDescriptor ( ” count : ” ) ) ;
47 // t a b l eDe s c r i p t e r . addFamily (new
HColumnDescriptor (” sha : ” ) ) ;
48 admin . c reateTab le ( t ab l eDe s c r i p t e r ) ;
95
49 } catch (MasterNotRunningException e ) {
50 // TODO Auto−generated catch b l o c k
51 e . pr intStackTrace ( ) ;
52 } catch ( IOException e ) {
53 // TODO Auto−generated catch b l o c k
54 e . pr intStackTrace ( ) ;
55 }
56 }
57
58 public St r ing getPathWithHash ( St r ing hash )
59 {
60 HTable t ab l e = null ;
61 St r ing va lueSt r path = null ;
62 try {
63 tab l e = new HTable ( con f i g , ”dedu” ) ;
64 System . out . p r i n t l n ( ”Get ”+hash+” data” ) ;
65 Get hashGet = new Get ( hash . getBytes ( ) ) ;
66 hashGet . addColumn( Bytes . toBytes ( ”path” ) ) ;
67 hashGet . setMaxVersions ( ) ;
68 Result hashResult = tab l e . get ( hashGet ) ;
69 byte [ ] va lue path = hashResult . getValue ( Bytes .
toBytes ( ”path” ) ) ;
70 va lueSt r path = Bytes . t oS t r i ng ( va lue path ) ;
71 }
72 catch ( IOException e ) {
73 // TODO Auto−generated catch b l o c k
74 e . pr intStackTrace ( ) ;
75 }
76 f ina l ly
77 {
78 try {
79 i f ( t ab l e !=null )
96
80 tab l e . c l o s e ( ) ;
81 } catch ( IOException e ) {
82 // TODO Auto−generated catch b l o c k
83 e . pr intStackTrace ( ) ;
84 }
85 }
86 return va lueSt r path ;
87 }
88
89 public boolean i sHashExisted ( S t r ing hash ) throws IOException
90 {
91 HTable t ab l e = null ;
92 St r ing va lueSt r count = null ;
93 try {
94 tab l e = new HTable ( con f i g , ”dedu” ) ;
95 System . out . p r i n t l n ( ”Get ”+hash+” data” ) ;
96 Get hashGet = new Get ( hash . getBytes ( ) ) ;
97 hashGet . addColumn( Bytes . toBytes ( ” count” ) ) ;
98 hashGet . setMaxVersions ( ) ;
99 Result hashResult = tab l e . get ( hashGet ) ;
100 byte [ ] va lue count = hashResult . getValue ( Bytes .
toBytes ( ” count” ) ) ;
101 va lueSt r count = Bytes . t oS t r i ng ( va lue count ) ;
102 i f ( va lueSt r count==null | | va lueSt r count . equa l s (
”0” ) )
103 {
104 return fa l se ;
105 }
106 else
107 return true ;
108 }
109 catch ( IOException e ) {
97
110 // TODO Auto−generated catch b l o c k
111 e . pr intStackTrace ( ) ;
112 throw e ;
113 }
114 f ina l ly
115 {
116 try {
117 i f ( t ab l e !=null )
118 tab l e . c l o s e ( ) ;
119 } catch ( IOException e ) {
120 // TODO Auto−generated catch b l o c k
121 e . pr intStackTrace ( ) ;
122 }
123 }
124 }
125
126 public void addNewRecord ( St r ing hash , S t r ing path )
127 {
128
129 HTable t ab l e = null ;
130 try {
131 tab l e = new HTable ( con f i g , ”dedu” ) ;
132 System . out . p r i n t l n ( ”add ”+ hash +” ” + path +”
data” ) ;
133
134 // check md5 in dedu
135
136 Get hashGet = new Get ( hash . getBytes ( ) ) ;
137 hashGet . addColumn( Bytes . toBytes ( ” count” ) ) ;
138 hashGet . setMaxVersions ( ) ;
139 Result hashResult = tab l e . get ( hashGet ) ;
140 byte [ ] va lue count = hashResult . getValue ( Bytes .
98
toBytes ( ” count” ) ) ;
141 St r ing va lueSt r count = Bytes . t oS t r i ng (
va lue count ) ;
142 i f ( va lueSt r count==null | | va lueSt r count . equa l s
( ”0” ) )
143 {
144 // update count
145 Put update = new Put ( hash . getBytes ( ) ) ;
146 update . add (new St r ing ( ” count” ) . getBytes ( )
, new byte [ ] { } , S t r ing . valueOf (1 ) .
getBytes ( ) ) ;
147 update . add (new St r ing ( ”path” ) . getBytes ( ) ,
new byte [ ] { } , path . getBytes ( ) ) ;
148 tab l e . put ( update ) ;
149 }
150 else
151 {
152 Put hashPut = new Put ( hash . getBytes ( ) ) ;
153 hashPut . add (new St r ing ( ”path” ) . getBytes ( )
, new byte [ ] { } , path . getBytes ( ) ) ;
154 hashPut . add (new St r ing ( ” count” ) . getBytes
( ) , new byte [ ] { } , new St r ing ( ”1” ) .
getBytes ( ) ) ;
155 tab l e . put ( hashPut ) ;
156 }
157
158 }
159 catch ( IOException e ) {
160 // TODO Auto−generated catch b l o c k
161 e . pr intStackTrace ( ) ;
162 }
163 f ina l ly
99
164 {
165 try {
166 i f ( t ab l e !=null )
167 tab l e . c l o s e ( ) ;
168 } catch ( IOException e ) {
169 // TODO Auto−generated catch b l o c k
170 e . pr intStackTrace ( ) ;
171 }
172 } // showTableContent ( hash ) ;
173 }
174
175 public int de le teRecord ( St r ing hash )
176 {
177 HTable t ab l e = null ;
178 try {
179 tab l e = new HTable ( con f i g , ”dedu” ) ;
180 // ge t count
181 Get countGet = new Get ( hash . getBytes ( ) ) ;
182 countGet . addColumn( Bytes . toBytes ( ” count” ) ) ;
183 countGet . setMaxVersions ( ) ;
184 Result countResult = tab l e . get ( countGet ) ;
185 byte [ ] va lue = countResult . getValue ( Bytes . toBytes
( ” count” ) ) ;
186 St r ing va lueSt r = Bytes . t oS t r i ng ( value ) ;
187 i f ( va lueSt r . equa l s ( ”NONE” ) )
188 {
189 System . out . p r i n t l n ( ” Fa i l to get a record
with hash : ” + hash ) ;
190 return −1;
191 }
192 int count = In t eg e r . pa r s e In t ( va lueSt r . t oS t r i ng ( ) )
;
100
193 i f ( count==0)
194 {
195 System . out . p r i n t l n ( ”Count equa l s 0 now !
No more l i n k f i l e po in t s t h i s f i l e . ” )
;
196 return 0 ;
197 }
198 Put update = new Put ( hash . getBytes ( ) ) ;
199 update . add (new St r ing ( ” count” ) . getBytes ( ) , new
byte [ ] { } , S t r ing . valueOf(−−count ) . getBytes ( ) )
;
200 tab l e . put ( update ) ;
201 showTableContent ( hash ) ;
202 return count ;
203 } catch ( IOException e ) {
204 // TODO Auto−generated catch b l o c k
205 e . pr intStackTrace ( ) ;
206 }
207 f ina l ly
208 {
209 try {
210 i f ( t ab l e !=null )
211 tab l e . c l o s e ( ) ;
212 } catch ( IOException e ) {
213 // TODO Auto−generated catch b l o c k
214 e . pr intStackTrace ( ) ;
215 }
216 }
217 return −1;
218 }
219
220 public void updateRecord ( St r ing hash )
101
221 {
222 HTable t ab l e = null ;
223 try {
224 tab l e = new HTable ( con f i g , ”dedu” ) ;
225 // ge t count
226 Get countGet = new Get ( hash . getBytes ( ) ) ;
227 countGet . addColumn( Bytes . toBytes ( ” count” ) ) ;
228 countGet . setMaxVersions ( ) ;
229 Result countResult = tab l e . get ( countGet ) ;
230 i f ( countResult . t oS t r i ng ( ) . equa l s ( ”NONE” ) )
231 {
232 System . out . p r i n t l n ( ” Fa i l to get a record
with hash : ” + hash ) ;
233 return ;
234 }
235 byte [ ] b count = countResult . getValue ( Bytes .
toBytes ( ” count” ) ) ;
236 St r ing s t r c oun t = Bytes . t oS t r i ng ( b count ) ;
237 int count = In t eg e r . pa r s e In t ( s t r c oun t ) ;
238 Put update = new Put ( hash . getBytes ( ) ) ;
239 update . add (new St r ing ( ” count” ) . getBytes ( ) , new
byte [ ] { } , S t r ing . valueOf(++count ) . getBytes ( ) )
;
240 tab l e . put ( update ) ;
241 } catch ( IOException e ) {
242 // TODO Auto−generated catch b l o c k
243 e . pr intStackTrace ( ) ;
244 }
245 f ina l ly
246 {
247 try {
248 i f ( t ab l e !=null )
102
249 tab l e . c l o s e ( ) ;
250 } catch ( IOException e ) {
251 // TODO Auto−generated catch b l o c k
252 e . pr intStackTrace ( ) ;
253 }
254 } // showTableContent ( hash ) ;
255 }
256
257 private void showTableContent ( S t r ing hash )
258 {
259 HTable t ab l e = null ;
260 try {
261 tab l e = new HTable ( con f i g , ”dedu” ) ;
262 System . out . p r i n t l n ( ”Get MD5 = ”+hash ) ;
263 Get hashGet = new Get ( hash . getBytes ( ) ) ;
264 hashGet . addColumn( Bytes . toBytes ( ”path” ) ) ;
265 hashGet . addColumn( Bytes . toBytes ( ” count” ) ) ;
266 hashGet . addColumn( Bytes . toBytes ( ” sha” ) ) ;
267 hashGet . setMaxVersions ( ) ;
268 Result hashResult = tab l e . get ( hashGet ) ;
269 byte [ ] valuePath = hashResult . getValue ( Bytes .
toBytes ( ”path” ) ) ;
270 byte [ ] valueCount = hashResult . getValue ( Bytes .
toBytes ( ” count” ) ) ;
271 byte [ ] valueSha = hashResult . getValue ( Bytes .
toBytes ( ” sha” ) ) ;
272 St r ing va lueSt r = Bytes . t oS t r i ng ( valuePath ) ;
273 System . out . p r i n t l n ( ”∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Path = ” + va lueSt r ) ;
274 va lueSt r = Bytes . t oS t r i ng ( valueCount ) ;
275 System . out . p r i n t l n ( ”∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Count = ” + va lueSt r ) ;
103
276 va lueSt r = Bytes . t oS t r i ng ( valueSha ) ;
277 System . out . p r i n t l n ( ”∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ SHA
= ” + va lueSt r ) ;
278 } catch ( IOException e ) {
279 // TODO Auto−generated catch b l o c k
280 e . pr intStackTrace ( ) ;
281 }
282 f ina l ly
283 {
284 try {
285 i f ( t ab l e !=null )
286 tab l e . c l o s e ( ) ;
287 } catch ( IOException e ) {
288 // TODO Auto−generated catch b l o c k
289 e . pr intStackTrace ( ) ;
290 }
291 }
292 }
293 }
HDFSUtil.java is the following:
1 package edu . uow . i s i t . dedu . u t i l ;
2 import java . i o . BufferedInputStream ;
3 import java . i o . Fi le InputStream ;
4 import java . i o . IOException ;
5 import java . i o . InputStream ;
6 import java . i o . OutputStream ;
7 import java . net .URI ;
8 import java . u t i l . P rope r t i e s ;
9 import org . apache . hadoop . conf . Conf igurat ion ;
10 import org . apache . hadoop . f s . F i l e S t a tu s ;
11 import org . apache . hadoop . f s . Fi leSystem ;
104
12 import org . apache . hadoop . f s . Path ;
13 import org . apache . hadoop . i o . IOUt i l s ;
14 import org . apache . hadoop . u t i l . P rog re s sab l e ;
15
16 public class HDFSUtil {
17 private stat ic Conf igurat ion conf = new Conf igurat ion ( ) ;
18 private stat ic St r ing FS DEFAULT NAME = conf . get ( ” f s . d e f au l t . name
” ) ;
19
20 public stat ic FileSystem getFi l eSystem ( ) throws IOException
21 {
22 try {
23 return FileSystem . get (URI . c r e a t e (FS DEFAULT NAME
) , conf ) ;
24 } catch ( IOException e ) {
25 // TODO Auto−generated catch b l o c k
26 throw new IOException ( ” Fa i l to connect : ” +
FS DEFAULT NAME) ;
27 }
28 }
29
30 public stat ic boolean e x i s t s ( S t r ing f i l e ) throws IOException {
31
32 Fi leSystem f s = null ;
33 St r ing h d f s f u l l = FS DEFAULT NAME + f i l e ;
34 try {
35 f s = getFi l eSystem ( ) ;
36 Path p a t h f i l e = new Path ( h d f s f u l l ) ;
37 return f s . e x i s t s ( p a t h f i l e ) ;
38 }
39 catch ( IOException e ) {
40 // TODO Auto−generated catch b l o c k
105
41 e . pr intStackTrace ( ) ;
42 throw e ;
43 }
44 f ina l ly
45 {
46 i f ( f s !=null )
47 f s . c l o s e ( ) ;
48 }
49 }
50
51 public stat ic St r ing [ ] getL inkFi leContent ( S t r ing f i l e ) throws
IOException {
52 Fi leSystem f s = null ;
53 try {
54 f s = getFi l eSystem ( ) ;
55 St r ing h d f s f u l l = FS DEFAULT NAME + f i l e ;
56 Path path = new Path ( h d f s f u l l ) ;
57 //ByteArrayOutputStream out = new
ByteArrayOutputStream (4096) ;
58 InputStream in = f s . open ( path ) ;
59 Prope r t i e s p = new Prope r t i e s ( ) ;
60 p . load ( in ) ;
61
62 St r ing r e a l f i l e = p . getProperty ( ”path” , ”” ) ;
63 St r ing hash = p . getProperty ( ”hash” , ”” ) ;
64 return new St r ing [ ] { r e a l f i l e , hash } ;
65 }
66 catch ( IOException e ) {
67 // TODO Auto−generated catch b l o c k
68 e . pr intStackTrace ( ) ;
69 throw e ;
70 }
106
71 f ina l ly
72 {
73 i f ( f s !=null )
74 f s . c l o s e ( ) ;
75 }
76 }
77
78 public stat ic void d e l e t e F i l e ( S t r ing f i l e ) throws IOException {
79 Fi leSystem f s = null ;
80 St r ing h d f s f u l l = FS DEFAULT NAME + f i l e ;
81 try {
82 f s = getFi l eSystem ( ) ;
83 Path p a t h f i l e = new Path ( h d f s f u l l ) ;
84 f s . d e l e t e ( p a t h f i l e , true ) ;
85 } catch ( IOException e ) {
86 // TODO Auto−generated catch b l o c k
87 // e . pr in tS tackTrace () ;
88 throw e ;
89 }
90 f ina l ly
91 {
92 i f ( f s !=null )
93 f s . c l o s e ( ) ;
94 }
95 }
96
97 public stat ic void updateLinkFi le ( S t r ing l n k f i l e , S t r ing r e a l f i l e
,
98 St r ing hash ) throws IOException {
99 Fi leSystem f s = null ;
100 St r ing lnkPath = FS DEFAULT NAME + l n k f i l e ;
101
107
102 try {
103 f s = getFi l eSystem ( ) ;
104 Path path = new Path ( lnkPath ) ;
105 OutputStream out = f s . c r e a t e ( path ) ;
106 Prope r t i e s p = new Prope r t i e s ( ) ;
107 p . se tProper ty ( ”path” , r e a l f i l e ) ;
108 p . se tProper ty ( ”hash” , hash ) ;
109 p . s t o r e ( out , ”Link f i l e f o r ”+r e a l f i l e ) ;
110 out . c l o s e ( ) ;
111 }
112 catch ( IOException e ) {
113 // TODO Auto−generated catch b l o c k
114 // e . pr in tS tackTrace () ;
115 throw e ;
116 }
117 f ina l ly
118 {
119 i f ( f s !=null )
120 f s . c l o s e ( ) ;
121 }
122 }
123
124 public stat ic void updateRea lFi l e ( S t r ing l o ca l S r c , S t r ing d s t f i l e
,
125 Progre s sab l e p) throws IOException {
126 Fi leSystem f s = null ;
127 try {
128 f s = getFi l eSystem ( ) ;
129 St r ing r e a l f i l e = FS DEFAULT NAME + d s t f i l e ;
130 InputStream in = new BufferedInputStream (new
Fi leInputStream ( l o c a l S r c ) ) ;
131 OutputStream out = f s . c r e a t e (new Path ( r e a l f i l e ) ,
108
p) ;
132 IOUt i l s . copyBytes ( in , out , 4096 , true ) ;
133 in . c l o s e ( ) ;
134 out . c l o s e ( ) ;
135 }
136 catch ( IOException e ) {
137 // TODO Auto−generated catch b l o c k
138 e . pr intStackTrace ( ) ;
139 throw e ;
140 }
141 f ina l ly
142 {
143 i f ( f s !=null )
144 f s . c l o s e ( ) ;
145 }
146 }
147
148 public stat ic boolean i sD i r ( S t r ing f i l e ) throws IOException
149 {
150 St r ing d i r = FS DEFAULT NAME + f i l e ;
151 Fi leSystem f s = null ;
152 try {
153 i f ( ! HDFSUtil . e x i s t s ( f i l e ) )
154 return fa l se ;
155 else
156 {
157 Path path = new Path ( d i r ) ;
158 f s = getFi l eSystem ( ) ;
159 F i l eS t a tu s s t a tu s = f s . g e tF i l e S t a tu s ( path
) ;
160 return s t a tu s . i sD i r ( ) ;
161 }
109
162 }
163 catch ( IOException e ) {
164 // TODO Auto−generated catch b l o c k
165 e . pr intStackTrace ( ) ;
166 throw e ;
167 }
168 f ina l ly
169 {
170 i f ( f s !=null )
171 f s . c l o s e ( ) ;
172 }
173 }
174
175 public stat ic void mkdir ( S t r ing f i l e ) throws IOException
176 {
177 St r ing d i r = FS DEFAULT NAME + f i l e ;
178 Fi leSystem f s = null ;
179 try {
180 f s = getFi l eSystem ( ) ;
181 Path path = new Path ( d i r ) ;
182 f s . mkdirs ( path ) ;
183 } catch ( IOException e ) {
184 // TODO Auto−generated catch b l o c k
185 e . pr intStackTrace ( ) ;
186 throw e ;
187 }
188 f ina l ly
189 {
190 i f ( f s !=null )
191 f s . c l o s e ( ) ;
192 }
193 }
110
194
195 public stat ic void downloadFile ( S t r ing hdfs , S t r ing l o c a l ,
Prog re s sab l e p) throws IOException
196 {
197 Fi leSystem f s = null ;
198 try {
199 f s = getFi l eSystem ( ) ;
200 St r ing h d f s f u l l = FS DEFAULT NAME + hdfs ;
201 Path hdf s path = new Path ( h d f s f u l l ) ;
202 Path l o c a l p a t h = new Path ( l o c a l ) ;
203 f s . copyToLocalFi le ( hdfs path , l o c a l p a t h ) ;
204 } catch ( IOException e ) {
205 // TODO Auto−generated catch b l o c k
206 e . pr intStackTrace ( ) ;
207 throw e ;
208 }
209 f ina l ly
210 {
211 i f ( f s !=null )
212 f s . c l o s e ( ) ;
213 }
214 }
215
216 public stat ic F i l eS t a tu s [ ] l i s t F i l e s ( S t r ing f i l e ) throws
IOException
217 {
218 Fi leSystem f s = null ;
219 St r ing d i r = FS DEFAULT NAME + f i l e ;
220 try {
221 f s = getFi l eSystem ( ) ;
222 Path path = new Path ( d i r ) ;
223 return f s . l i s t S t a t u s ( path ) ;
111
224 } catch ( IOException e ) {
225 // TODO Auto−generated catch b l o c k
226 e . pr intStackTrace ( ) ;
227 throw e ;
228 }
229 f ina l ly
230 {
231 i f ( f s !=null )
232 f s . c l o s e ( ) ;
233 }
234 }
235 }
MD5Util.java is the following:
1 package edu . uow . i s i t . dedu . u t i l ;
2 import java . s e c u r i t y . MessageDigest ;
3 import java . s e c u r i t y . NoSuchAlgorithmException ;
4 public class MD5Util extends HashUtil {
5 public MD5Util ( )
6 {
7 try {
8 messaged iges t = MessageDigest . g e t In s tance ( ”MD5” ) ;
9 } catch ( NoSuchAlgorithmException nsaex ) {
10 System . e r r . p r i n t l n (MD5Util . class . getName ( )
11 + ” Fa i l to i n i t M e s s a g e D i g e s t
does not support MD 5U t i l ” )
;
12 nsaex . pr intStackTrace ( ) ;
13 }
14 }
15 }
112
SHAUtil.java is the following:
1 package edu . uow . i s i t . dedu . u t i l ;
2 import java . s e c u r i t y . MessageDigest ;
3 import java . s e c u r i t y . NoSuchAlgorithmException ;
4 public class SHAUtil extends HashUtil {
5 public SHAUtil ( )
6 {
7 try {
8 messaged iges t = MessageDigest . g e t In s tance ( ”SHA” ) ;
9 } catch ( NoSuchAlgorithmException nsaex ) {
10 System . e r r . p r i n t l n (MD5Util . class . getName ( )
11 + ” Fa i l to i n i t M e s s a g e D i g e s t
does not support S HAU t i l ” )
;
12 nsaex . pr intStackTrace ( ) ;
13 }
14 }
15 }
XMLUtil.java is the following:
1 package edu . uow . i s i t . dedu . u t i l ;
2 import java . i o . F i l e ;
3 import java . i o . IOException ;
4 import java . i o . InputStream ;
5 import javax . xml . pa r s e r s . DocumentBuilder ;
6 import javax . xml . pa r s e r s . DocumentBuilderFactory ;
7 import javax . xml . pa r s e r s . Parse rConf igurat ionExcept ion ;
8 import javax . xml . trans form . Transformer ;
9 import javax . xml . trans form . TransformerException ;
10 import javax . xml . trans form . TransformerFactory ;
11 import javax . xml . trans form .dom.DOMSource ;
12 import javax . xml . trans form . stream . StreamResult ;
113
13 import org . w3c .dom. Document ;
14 import org . w3c .dom. Node ;
15 import org . w3c .dom. NodeList ;
16 import org . xml . sax . SAXException ;
17 import edu . uow . i s i t . dedu . gui . HBaseConfig ;
18 import edu . uow . i s i t . dedu . gui . HDFSConfig ;
19
20 public class XMLUtil {
21 public stat ic HDFSConfig loadHDFSConfigFromXML() throws
SAXException ,
22 IOException , Parse rConf igurat ionExcept ion {
23 DocumentBuilderFactory f a c t o ry = DocumentBuilderFactory .
newInstance ( ) ;
24 DocumentBuilder bu i l d e r = f a c t o ry . newDocumentBuilder ( ) ;
25
26 InputStream in = XMLUtil . class . getResourceAsStream ( ”/ core
−s i t e . xml” ) ;
27 Document document = bu i l d e r . parse ( in ) ;
28 /∗
29 ∗ Element con f i g u r a t i on = document . getDocumentElement ( ) ;
30 ∗ System . out . p r i n t l n ( c on f i g u ra t i on . getNodeName () ) ;
NodeList n l =
31 ∗ con f i g u ra t i on . getElementsByTagName (” proper ty ”) ;
32 ∗/
33 NodeList n l = document . getElementsByTagName ( ”name” ) ;
34 // System . out . p r i n t l n ( n l . ge tLength ( ) ) ;
35 Node node = nl . item (0) ;
36 St r ing name = node . g e tF i r s tCh i l d ( ) . getNodeValue ( ) ;
37 //System . out . p r i n t l n (name) ;
38
39 n l = document . getElementsByTagName ( ” value ” ) ;
40 // System . out . p r i n t l n ( n l . ge tLength ( ) ) ;
114
41 node = nl . item (0) ;
42 St r ing value = node . g e tF i r s tCh i l d ( ) . getNodeValue ( ) ;
43 //System . out . p r i n t l n ( va lue ) ;
44
45 return new HDFSConfig (name , va lue ) ;
46 }
47
48 public stat ic void saveHDFSConfigToXML(HDFSConfig c on f i g )
49 throws TransformerException ,
ParserConf igurat ionExcept ion ,
50 SAXException , IOException {
51 DocumentBuilderFactory f a c t o ry = DocumentBuilderFactory .
newInstance ( ) ;
52 DocumentBuilder bu i l d e r = f a c t o ry . newDocumentBuilder ( ) ;
53
54 InputStream in = XMLUtil . class . getResourceAsStream ( ”/ core
−s i t e . xml” ) ;
55 Document document = bu i l d e r . parse ( in ) ;
56 // NodeList n l = document . getElementsByTagName (”name”) ;
57 // Node node = n l . item (0) ;
58 // node . g e tF i r s tCh i l d ( ) . setNodeValue ( con f i g . getName () ) ;
59
60 NodeList n l = document . getElementsByTagName ( ” value ” ) ;
61 Node node = nl . item (0) ;
62 node . g e tF i r s tCh i l d ( ) . setNodeValue ( c on f i g . getValue ( ) ) ;
63
64 TransformerFactory tFactory = TransformerFactory .
newInstance ( ) ;
65 Transformer t rans fo rmer = tFactory . newTransformer ( ) ;
66
67 DOMSource source = new DOMSource( document ) ;
68 F i l e f = new F i l e (XMLUtil . class . getResource ( ”/” ) . getPath
115
( ) ) ;
69 f = new F i l e ( f . getPath ( ) + ”/ core−s i t e . xml” ) ;
70 //System . out . p r i n t l n ( f . ge tAbso lu tePath ( ) ) ;
71 StreamResult r e s u l t = new StreamResult ( f ) ;
72 t rans fo rmer . trans form ( source , r e s u l t ) ;
73 }
74
75 public stat ic HBaseConfig loadHBaseConfigFromXML () throws
SAXException ,
76 IOException , Parse rConf igurat ionExcept ion {
77 DocumentBuilderFactory f a c t o ry = DocumentBuilderFactory .
newInstance ( ) ;
78 DocumentBuilder bu i l d e r = f a c t o ry . newDocumentBuilder ( ) ;
79
80 InputStream in = XMLUtil . class . getResourceAsStream ( ”/
hbase−s i t e . xml” ) ;
81 Document document = bu i l d e r . parse ( in ) ;
82
83 NodeList n l = document . getElementsByTagName ( ”name” ) ;
84 // System . out . p r i n t l n ( n l . ge tLength ( ) ) ;
85 Node node = nl . item (0) ;
86 St r ing name = node . g e tF i r s tCh i l d ( ) . getNodeValue ( ) ;
87 //System . out . p r i n t l n (name) ;
88
89 n l = document . getElementsByTagName ( ” value ” ) ;
90 // System . out . p r i n t l n ( n l . ge tLength ( ) ) ;
91 node = nl . item (0) ;
92 St r ing value = node . g e tF i r s tCh i l d ( ) . getNodeValue ( ) ;
93 //System . out . p r i n t l n ( va lue ) ;
94
95 return new HBaseConfig (name , va lue ) ;
96 }
116
97
98 public stat ic void saveHBaseConfigToXML(HBaseConfig c on f i g )
99 throws TransformerException ,
ParserConf igurat ionExcept ion ,
100 SAXException , IOException {
101 DocumentBuilderFactory f a c t o ry = DocumentBuilderFactory .
newInstance ( ) ;
102 DocumentBuilder bu i l d e r = f a c t o ry . newDocumentBuilder ( ) ;
103
104 InputStream in = XMLUtil . class . getResourceAsStream ( ”/ core
−s i t e . xml” ) ;
105 Document document = bu i l d e r . parse ( in ) ;
106
107 NodeList n l = document . getElementsByTagName ( ” value ” ) ;
108 Node node = nl . item (0) ;
109 node . g e tF i r s tCh i l d ( ) . setNodeValue ( c on f i g . getValue ( ) ) ;
110
111 TransformerFactory tFactory = TransformerFactory .
newInstance ( ) ;
112 Transformer t rans fo rmer = tFactory . newTransformer ( ) ;
113
114 DOMSource source = new DOMSource( document ) ;
115 F i l e f = new F i l e (XMLUtil . class . getResource ( ”/” ) . getPath
( ) ) ;
116 f = new F i l e ( f . getPath ( ) + ”/hbase−s i t e . xml” ) ;
117 ////System . out . p r i n t l n ( f . ge tAbso lu tePath ( ) ) ;
118 StreamResult r e s u l t = new StreamResult ( f ) ;
119 t rans fo rmer . trans form ( source , r e s u l t ) ;
120 }
121 }
The command package just used to invoke task package. So, this part will show
117
code of task package. Task package includes one interface which is ITask, and one
abstract class which is AbstractTask, and nine common classes including Download-
Task, DownloadDirTask, DeletedTask, MkDirTask, RmDirTask, UploadTask, Upload-
DirTask, ListTask and ListDirTask.
AbstractTask.java is the following:
1 package edu . uow . i s i t . dedu . task ;
2 public class AbstractTask implements ITask {
3 protected boolean be f o r e ( ) throws Exception
4 {
5 return true ;
6 }
7 protected void doTask ( ) throws Exception {}
8 protected void a f t e r ( ) {}
9 @Override
10 public void precessTask ( ) throws Exception {
11 // TODO Auto−generated method s tub
12 try{
13 i f ( be f o r e ( ) )
14 doTask ( ) ;
15 }
16 catch ( Exception e )
17 {
18 throw e ;
19 }
20 f ina l ly
21 {
22 a f t e r ( ) ;
23 }
24 }
25 }
118
ITask.java is the following:
1 package edu . uow . i s i t . dedu . task ;
2 public interface ITask
3 {
4 public void precessTask ( ) throws Exception ;
5 }
DeleteTask.java is the following:
1 package edu . uow . i s i t . dedu . task ;
2 import java . i o . IOException ;
3 import edu . uow . i s i t . dedu . Defau l tReporter ;
4 import edu . uow . i s i t . dedu . HBaseManager ;
5 import edu . uow . i s i t . dedu . HDFSUtil ;
6 import edu . uow . i s i t . dedu . IReport ;
7
8 public class DeleteTask extends AbstractTask {
9 private St r ing d e l e t e f i l e ;
10 private IReport r epo r t e r ;
11
12 public DeleteTask ( St r ing f i l e )
13 {
14 this . d e l e t e f i l e = f i l e + ” . lnk ” ;
15 this . r e po r t e r = new Defaul tReporter ( ) ;
16 }
17
18 public DeleteTask ( St r ing f i l e , IReport r epo r t e r )
19 {
20 this . d e l e t e f i l e = f i l e + ” . lnk ” ;
21 i f ( r epo r t e r==null )
22 this . r e po r t e r = new Defaul tReporter ( ) ;
23 else
119
24 this . r e po r t e r = r epo r t e r ;
25 }
26
27 protected boolean be f o r e ( ) throws IOException
28 {
29 boolean r e s u l t = HDFSUtil . e x i s t s ( this . d e l e t e f i l e ) ;
30 i f ( ! r e s u l t )
31 {
32 r epo r t e r . r epo r t ( ” [ERROR] Fa i l to f i nd ”+ this .
d e l e t e f i l e ) ;
33 }
34 return r e s u l t ;
35 }
36
37 protected void doTask ( ) throws IOException
38 {
39 St r ing [ ] l i n e s = HDFSUtil . getL inkFi leContent ( this .
d e l e t e f i l e ) ;
40 St r ing hash = l i n e s [ 1 ] ;
41 int count = HBaseManager . g e t In s tance ( ) . de l e teRecord ( hash )
;
42 // d e l e t e l i n k f i l e as w e l l
43 HDFSUtil . d e l e t e F i l e ( this . d e l e t e f i l e ) ;
44 // d e l e t e r e a l f i l e
45 i f ( count == 0) {
46 HDFSUtil . d e l e t e F i l e ( l i n e s [ 0 ] ) ;
47 }
48 }
49
50 protected void a f t e r ( )
51 {
52 this . r e po r t e r . r epo r t ( ”Deleted F i l e : ” + this . d e l e t e f i l e ) ;
120
53 }
54 }
DownloadDirTask.java is the following:
1 package edu . uow . i s i t . dedu . task ;
2 import java . i o . F i l e ;
3 import java . i o . IOException ;
4 import org . apache . hadoop . f s . F i l e S t a tu s ;
5 import org . apache . hadoop . f s . Path ;
6 import edu . uow . i s i t . dedu . Defau l tReporter ;
7 import edu . uow . i s i t . dedu . HDFSUtil ;
8 import edu . uow . i s i t . dedu . IReport ;
9
10 public class DownloadDirTask extends AbstractTask {
11
12 private St r ing l o c a l ;
13 private St r ing hdfs ;
14 private IReport r epo r t e r ;
15
16 public DownloadDirTask ( St r ing l o c a l , S t r ing hdfs )
17 {
18 this . l o c a l = l o c a l ;
19 this . hdfs = hdfs ;
20 this . r e po r t e r = new Defaul tReporter ( ) ;
21 }
22
23 public DownloadDirTask ( St r ing l o c a l , S t r ing hdfs , IReport
r epo r t e r )
24 {
25 this . l o c a l = l o c a l ;
26 this . hdfs = hdfs ;
27 i f ( r epo r t e r !=null )
121
28 this . r e po r t e r = r epo r t e r ;
29 else
30 this . r e po r t e r = new Defaul tReporter ( ) ;
31 }
32
33 protected boolean be f o r e ( ) throws IOException
34 {
35
36 i f ( ! HDFSUtil . e x i s t s ( this . hdfs ) )
37 {
38 r epo r t e r . r epo r t ( ” [ERROR] Fa i l to f i nd : ” + this .
hdfs ) ;
39 return fa l se ;
40 }
41 else i f ( ! HDFSUtil . i sD i r ( this . hdfs ) )
42 {
43 r epo r t e r . r epo r t ( ” [ERROR] : ” + this . hdfs + ” i s
not a d i r e c t o r y . ” ) ;
44 return fa l se ;
45 }
46 else
47 return true ;
48
49 }
50
51 protected void doTask ( ) throws Exception
52 {
53 copyFilesFromHDFSDir ( this . hdfs , this . l o c a l ) ;
54 }
55
56 private void copyFilesFromHDFSDir ( S t r ing dir , S t r ing l o c a l ) throws
Exception
122
57 {
58
59 i f (HDFSUtil . i sD i r ( d i r ) )
60 {
61 St r ing [ ] paths = d i r . s p l i t ( Path .SEPARATOR) ;
62
63 F i l e l o c a l F i l e = new F i l e ( l o c a l+System .
getProperty ( ” f i l e . s epa ra to r ” )+paths [ paths .
length −1]) ;
64 i f ( ! l o c a l F i l e . e x i s t s ( ) )
65 {
66 l o c a l F i l e . mkdir ( ) ;
67 }
68 F i l eS t a tu s s t a tu s [ ] = HDFSUtil . l i s t F i l e s ( d i r ) ;
69 for ( int i =0; i<s t a tu s . l ength ; i++)
70 {
71 copyFilesFromHDFSDir ( d i r + Path .SEPARATOR
+ sta tu s [ i ] . getPath ( ) . getName ( ) ,
l o c a l F i l e . getAbsolutePath ( ) ) ;
72 }
73 }
74 else
75 {
76 ITask task = new DownloadTask ( l o c a l , d i r ) ;
77 task . precessTask ( ) ;
78 }
79 }
80
81 protected void a f t e r ( )
82 {
83 this . r e po r t e r . r epo r t ( ”Downloaded F i l e s : ” + this . hdfs +”
to ” + this . l o c a l ) ;
123
84 }
85 }
DownloadTask.java is the following:
1 package edu . uow . i s i t . dedu . task ;
2 import java . i o . IOException ;
3 import edu . uow . i s i t . dedu . Defau l tReporter ;
4 import edu . uow . i s i t . dedu . HDFSUtil ;
5 import edu . uow . i s i t . dedu . IReport ;
6
7 public class DownloadTask extends AbstractTask {
8
9 private St r ing l o c a l ;
10 private St r ing hdfs ;
11 private IReport r epo r t e r ;
12
13 public DownloadTask ( St r ing l o c a l , S t r ing hdfs )
14 {
15 this . l o c a l = l o c a l ;
16 i f ( ! hdfs . endsWith ( ” . lnk ” ) )
17 this . hdfs =hdfs+” . lnk ” ;
18 else
19 this . hdfs = hdfs ;
20 this . r e po r t e r = new Defaul tReporter ( ) ;
21 }
22
23 public DownloadTask ( St r ing l o c a l , S t r ing hdfs , IReport r epo r t e r )
24 {
25 this . l o c a l = l o c a l ;
26 i f ( ! hdfs . endsWith ( ” . lnk ” ) )
27 this . hdfs =hdfs+” . lnk ” ;
28 else
124
29 this . hdfs = hdfs ;
30 i f ( r epo r t e r !=null )
31 this . r e po r t e r = r epo r t e r ;
32 else
33 this . r e po r t e r = new Defaul tReporter ( ) ;
34 }
35
36 protected boolean be f o r e ( ) throws IOException
37 {
38
39 i f (HDFSUtil . e x i s t s ( this . hdfs ) )
40 {
41 return true ;
42 }
43 else
44 {
45 r epo r t e r . r epo r t ( ” [ERROR] Fa i l to f i nd : ” + this .
hdfs ) ;
46 return fa l se ;
47 }
48 }
49
50 protected void doTask ( ) throws IOException
51 {
52 St r ing [ ] l i n e s = HDFSUtil . getL inkFi leContent ( this . hdfs ) ;
53 this . r e po r t e r . r epo r t ( ” S ta r t i ng Downloading F i l e : ” + this .
hdfs +” to ” + this . l o c a l ) ;
54 HDFSUtil . downloadFile ( l i n e s [ 0 ] , l o c a l , r e po r t e r ) ;
55 }
56
57 protected void a f t e r ( )
58 {
125
59 this . r e po r t e r . r epo r t ( ”Downloaded F i l e : ” + this . hdfs +” to
” + this . l o c a l ) ;
60 }
61 }
ListLocalFilesTask.java is the following:
1 package edu . uow . i s i t . dedu . task ;
2 import java . i o . IOException ;
3 import edu . uow . i s i t . dedu . Defau l tReporter ;
4 import edu . uow . i s i t . dedu . IReport ;
5
6 public class L i s tLoca lF i l e sTask extends AbstractTask {
7
8 private St r ing d i r ;
9 private IReport r epo r t e r ;
10 public L i s tLoca lF i l e sTask ( St r ing path )
11 {
12 this . d i r = path ;
13 this . r e po r t e r = new Defaul tReporter ( ) ;
14 }
15 public L i s tLoca lF i l e sTask ( St r ing path , IReport r epo r t e r )
16 {
17 this . d i r = path ;
18 i f ( r epo r t e r==null )
19 this . r e po r t e r = new Defaul tReporter ( ) ;
20 else
21 this . r e po r t e r = r epo r t e r ;
22 }
23
24 protected boolean be f o r e ( ) throws IOException
25 {
26
126
27 return true ;
28 }
29 protected void doTask ( ) throws IOException
30 {
31 // r epo r t e r . showResul t (HDFSUtil . l i s t L o c a l F i l e s ( d i r ) ) ;
32 }
33 protected void a f t e r ( )
34 {
35 this . r e po r t e r . r epo r t ( ”Show F i l e s o f Dir : ” + this . d i r ) ;
36 }
37 }
ListTask.java is the following:
1 package edu . uow . i s i t . dedu . task ;
2 import java . i o . IOException ;
3 import edu . uow . i s i t . dedu . Defau l tReporter ;
4 import edu . uow . i s i t . dedu . HDFSUtil ;
5 import edu . uow . i s i t . dedu . IReport ;
6
7 public class ListTask extends AbstractTask {
8
9 private St r ing d i r ;
10 private IReport r epo r t e r ;
11 public ListTask ( St r ing path )
12 {
13 this . d i r = path ;
14 this . r e po r t e r = new Defaul tReporter ( ) ;
15 }
16 public ListTask ( St r ing path , IReport r epo r t e r )
17 {
18 this . d i r = path ;
19 i f ( r epo r t e r==null )
127
20 this . r e po r t e r = new Defaul tReporter ( ) ;
21 else
22 this . r e po r t e r = r epo r t e r ;
23 }
24
25 protected boolean be f o r e ( ) throws IOException
26 {
27
28 boolean r e s u l t = HDFSUtil . i sD i r ( d i r ) ;
29 i f ( ! r e s u l t )
30 {
31 r epo r t e r . r epo r t ( ” [ERROR] Fa i l to f i nd ”+d i r ) ;
32 }
33 return r e s u l t ;
34 }
35 protected void doTask ( ) throws IOException
36 {
37 r epo r t e r . showResult (HDFSUtil . l i s t F i l e s ( d i r ) ) ;
38 }
39 protected void a f t e r ( )
40 {
41 this . r e po r t e r . r epo r t ( ”Show F i l e s o f Dir : ” + this . d i r ) ;
42 }
43 }
MkdirTask.java is the following:
1 package edu . uow . i s i t . dedu . task ;
2 import edu . uow . i s i t . dedu . ∗ ;
3 import java . i o . IOException ;
4
5 public class MkdirTask extends AbstractTask {
6
128
7 private St r ing d i r ;
8 private IReport r epo r t e r ;
9
10 public MkdirTask ( St r ing path ) {
11 this . d i r = path ;
12 this . r e po r t e r = new Defaul tReporter ( ) ;
13 }
14 public MkdirTask ( St r ing path , IReport r epo r t e r ) {
15 this . d i r = path ;
16 i f ( r epo r t e r !=null )
17 this . r e po r t e r = r epo r t e r ;
18 else
19 this . r e po r t e r = new Defaul tReporter ( ) ;
20 }
21
22 protected boolean be f o r e ( ) throws IOException {
23
24 i f (HDFSUtil . e x i s t s ( d i r ) )
25 {
26 r epo r t e r . r epo r t ( ” [ERROR] Fa i l to mkdir at ”+ d i r
+” . I t e x i s t s ! ” ) ;
27 return fa l se ;
28 }
29 else
30 return true ;
31 }
32
33 protected void doTask ( ) throws IOException {
34
35 HDFSUtil . mkdir ( d i r ) ;
36 }
37
129
38 protected void a f t e r ( )
39 {
40 this . r e po r t e r . r epo r t ( ”Mkdir : ” + d i r + ” f i n i s h e d ! ” ) ;
41 }
42 }
RmdirTask.java is the following:
1 package edu . uow . i s i t . dedu . task ;
2 import java . i o . IOException ;
3 import edu . uow . i s i t . dedu . Defau l tReporter ;
4 import edu . uow . i s i t . dedu . HDFSUtil ;
5 import edu . uow . i s i t . dedu . IReport ;
6
7 public class RmdirTask extends AbstractTask {
8
9 private St r ing d i r ;
10 private IReport r epo r t e r ;
11 public RmdirTask ( St r ing path ) {
12 this . d i r = path ;
13 this . r e po r t e r = new Defaul tReporter ( ) ;
14 }
15 public RmdirTask ( St r ing path , IReport r epo r t e r ) {
16 this . d i r = path ;
17 i f ( r epo r t e r !=null )
18 this . r e po r t e r = r epo r t e r ;
19 else
20 this . r e po r t e r = new Defaul tReporter ( ) ;
21 }
22
23 protected boolean be f o r e ( ) throws IOException
24 {
25 boolean r e s u l t = HDFSUtil . i sD i r ( d i r ) ;
130
26 i f ( ! r e s u l t )
27 {
28 r epo r t e r . r epo r t ( ” [ERROR] Fa i l to f i nd ”+ d i r ) ;
29 }
30 return r e s u l t ;
31 }
32
33 protected void doTask ( ) throws IOException
34 {
35 HDFSUtil . d e l e t e F i l e ( d i r ) ;
36 }
37
38 protected void a f t e r ( )
39 {
40 this . r e po r t e r . r epo r t ( ”Deleted F i l e : ” + this . d i r ) ;
41 }
42 }
UploadTask.java is the following:
1 package edu . uow . i s i t . dedu . task ;
2 import java . i o . F i l e ;
3 import java . i o . IOException ;
4 import edu . uow . i s i t . dedu . ∗ ;
5
6 public class UploadTask extends AbstractTask {
7 private St r ing s r c ;
8 private St r ing dst ;
9 private IReport r epo r t e r ;
10 private St r ing hash ;
11
12 public UploadTask ( St r ing src , S t r ing dst )
13 {
131
14 this . s r c = s r c ;
15 this . dst = dst ;
16 this . r e po r t e r = new Defaul tReporter ( ) ;
17 }
18
19 public UploadTask ( St r ing src , S t r ing dst , IReport r epo r t e r )
20 {
21 this . s r c = s r c ;
22 this . dst = dst ;
23 i f ( r epo r t e r !=null )
24 this . r e po r t e r = r epo r t e r ;
25 else
26 this . r e po r t e r = new Defaul tReporter ( ) ;
27 }
28
29 private boolean i s S r cEx i s t ed ( )
30 {
31 F i l e l o c a l F i l e = new F i l e ( s r c ) ;
32 i f ( ! l o c a l F i l e . e x i s t s ( ) ) {
33 r epo r t e r . r epo r t ( ” [ERROR] F i l e : ” + s r c + ” does
not e x i s t ! ” ) ;
34 return fa l se ;
35 }
36 else
37 return true ;
38 }
39
40 private boolean i sDs tEx i s t ed ( ) throws IOException
41 {
42 i f (HDFSUtil . e x i s t s ( dst ) )
43 {
44 r epo r t e r . r epo r t ( ” [ERROR] F i l e : ” + dst + ” a l ready
132
e x i s t e d ! ” ) ;
45 return true ;
46 }
47 else
48 return fa l se ;
49 }
50
51 private boolean hasHashOnHBase ( ) throws IOException
52 {
53 F i l e l o c a l F i l e = new F i l e ( s r c ) ;
54 hash = MD5Util . getFi leMD5String ( l o c a l F i l e ) ;
55 return HBaseManager . g e t In s tance ( ) . i sHashExisted ( hash ) ;
56 }
57
58 @Override
59 protected boolean be f o r e ( ) throws IOException
60 {
61 return this . i s S r cEx i s t ed ( )&&(! this . i sDs tEx i s t ed ( ) ) ;
62 }
63
64 protected void doTask ( ) throws IOException
65 {
66 St r ing l i n k = dst + ” . lnk ” ;
67 i f ( this . hasHashOnHBase ( ) )
68 {
69 HBaseManager . g e t In s tance ( ) . updateRecord ( hash ) ;
70 St r ing path = HBaseManager . g e t In s tance ( ) .
getPathWithHash ( hash ) ;
71 HDFSUtil . updateLinkFi le ( l ink , path , hash ) ;
72 }
73 else
74 {
133
75
76 HBaseManager . g e t In s tance ( ) . addNewRecord ( hash , dst )
;
77 HDFSUtil . updateRea lFi l e ( src , dst , r epo r t e r ) ;
78 HDFSUtil . updateLinkFi le ( l ink , dst , hash ) ;
79 }
80 }
81
82 protected void a f t e r ( )
83 {
84 this . r e po r t e r . r epo r t ( ”Updated F i l e : ” + s r c + ” to ” + dst
) ;
85 }
86 }