Software-as-a-Service (SaaS) · underlying technologies that support web services and...
Transcript of Software-as-a-Service (SaaS) · underlying technologies that support web services and...
Software-as-a-Service (SaaS)
The traditional model of software distribution, in which software is purchased for and
installed on personal computers, is sometimes referred to as Software-as-a-Product.
Software-as-a-Service is a software distribution model in which applications are hosted
by a vendor or service provider and made available to customers over a network,
typically the Internet. SaaS is becoming an increasingly prevalent delivery model as
underlying technologies that support web services and service-oriented architecture
(SOA)mature and new developmental approaches become popular. SaaS is also often
associated with a pay-as-you-go subscription licensing model. Mean-
while, broadband service has become increasingly available to support useraccess from
more areas around the world. The huge strides made by Internet Service Providers
(ISPs) to increase bandwidth, and the constant introduction of ever more powerful
microprocessors coupled with inexpensive data storage devices, is providing a huge
platform for designing, deploying, and using software across all areas of business and
personal computing. SaaS applications also must be able to interact with other data and
other applications in an equally wide variety of environments and platforms. SaaS is
closely related to other service delivery models we have described. IDC identifies two
slightly different delivery models for SaaS. The hosted application management model is
similar to an Application Service Provider (ASP) model. Here, an ASP hosts
commercially available software for customers and delivers it over the Internet. The
other model is a software on demand model where the provider gives customers
network-based access to a single copy of an application created specifically for SaaS
distribution. IDC predicted that SaaS would make up30% of the software market by
2007 and would be worth $10.7 billion by the end of 2009.
SaaS is most often implemented to provide business software functionality to enterprise
customers at a low cost while allowing those customers to obtain the same benefits
of commercially licensed , internally operated
software without the associated complexity of installation, management, support,
licensing, and high initial cost.
Most customers have little interest in the how or why of software implementation,
deployment, etc., but all have a need to use software in their work. Many types of
software are well suited to the SaaS model (e.g., accounting, customer relationship
management, email software, human resources, IT security, IT service management,
videoconferencing, web analytics, web content management). The distinction between
SaaS and earlier applications delivered over the Internet is that SaaS solutions were
developed specifically to work within a web browser.The architecture of SaaS-based
applications is specifically designed to sup-port many concurrent users (multitenancy)
at once. This is a big difference from the traditional client/server or application service
provider (ASP)-based solutions that cater to a contained audience. SaaS providers, on
the other hand, leverage enormous economies of scale in the deployment, management,
support, and maintenance of their offerings.
SaaS Implementation Issues
Many types of software components and applications frameworks may be employed in
the development of SaaS applications. Using new technology found in these modern
components and application frameworks can drastically reduce the time to market and
cost of converting a traditional on-premises product into a SaaS solution. According to
Microsoft, SaaS architectures can be classified into one of four maturity levels whose key
attributes are ease of configuration, multitenant efficiency, and scalability. Each level is
distinguished from the previous one by the addition of one of these three attributes. The
levels described by Microsoft are as follows
SaaS Architectural Maturity Level 1—Ad-Hoc/Custom.
The first level of maturity is actually no maturity at all. Each customer has a unique,
customized version of the hosted application. The application runs its own instance on
the host’s servers. Migrating a traditional non-networked or client-server application to
this level of SaaS maturity typically requires the least development effort and reduces
operating costs by consolidating server hardware and administration.
SaaS Architectural Maturity Level 2—Configurability.
The second level of SaaS maturity provides greater program flexibility through
configuration metadata. At this level, many customers can use separate instances of the
same application. This allows a vendor to meet the varying needs of each customer by
using detailed configuration options. It also allows the vendor to ease the maintenance
burden by being able to update a common code base.
SaaS Architectural Maturity Level 3—Multitenant Efficiency.
The third maturity level adds multitenancy to the second level. This results in a single
program instance that has the capability to serve all of the vendor’s customers. This
approach enables more efficient use of server resources without any apparent difference
to the end user, but ultimately this level is limited in its ability to scale massively.
SaaS Architectural Maturity Level 4—Scalable.
At the fourth SaaS maturity level, scalability is added by using a multi tiered
architecture. This architecture is capable of supporting a load-balanced farm of identical
application instances running on a variable number of servers, sometimes in the
hundreds or even thousands. System capacity can be dynamically increased or
decreased to match load demand by adding or removing servers, with no need for
further alteration of application software architecture
Key Characteristics of SaaS
Deploying applications in a service-oriented architecture is a more complex problem
than is usually encountered in traditional models of software deployment. As a result,
SaaS applications are generally priced based on the number of users that can have
access to the service. There are often additional fees for the use of help desk services,
extra bandwidth, and storage. SaaS revenue streams to the vendor are usually lower
initially than traditional software license fees. However, the trade-off for lower license
fees is a monthly recurring revenue stream, which is viewed by most corporate CFOs as
a more predictable gauge of how the business is faring quarter to quarter. These
monthly recurring charges are viewed much like maintenance fees for licensed software.
The key characteristics of SaaS software are the following:
1. Network-based management and access to commercially available software from
central locations rather than at each customer’s site, enabling customers to access
applications remotely via the Internet.
2. Application delivery from a one-to-many model (single-instance,multitenant
architecture), as opposed to a traditional one-to-one model.
3. Centralized enhancement and patch updating that obviates any need for
downloading and installing by a user. SaaS is often used in conjunction with a
larger network of communications and collaboration software, sometimes as a
plug-in to a PaaS architecture.
Benefits of the SaaS Model
Application deployment cycles inside companies can take years, consumemassive
resources, and yield unsatisfactory results. Although the initial decision to relinquish
control is a difficult one, it is one that can lead to improved efficiency, lower risk, and a
generous return on investment.
An increasing number of companies want to use the SaaS model for corporate
applications such as customer relationship management and those that fall under the
Sarbanes-Oxley Act compliance umbrella (e.g., financial recording and human
resources). The SaaS model helps enterprises ensure that alllocations are using the
correct application version and, therefore, that the format of the data being recorded
and conveyed is consistent, compatible, and accurate. By placing the responsibility for
an application onto the door-step of a SaaS provider, enterprises can reduce
administration and management burdens they would otherwise have for their own
corporate applications. SaaS also helps to increase the availability of applications to
global locations. SaaS also ensures that all application transactions are logged for
compliance purposes. The benefits of SaaS to the customer are very clear:
1. Streamlined administration
2. Automated update and patch management services
3. Data compatibility across the enterprise (all users have the same version of
software)
4. Facilitated, enterprise-wide collaboration
5. Global accessibility
Server virtualization can be used in SaaS architectures, either in place of or in addition
to multitenancy. A major benefit of platform virtualization is that it can increase a
system’s capacity without any need for additional programming. Conversely, a huge
amount of programming may be required in order to construct more efficient,
mutitenant applications. The effect of combining multitenancy and platform
virtualization into a SaaS solution provides greater flexibility and performance to the
end user. In this chapter, we have discussed how the computing world has moved from
stand-alone, dedicated computing to client/network computing and on into the cloud
for remote computing. The advent of web-based services has given rise to a variety of
service offerings, sometimes known collectively as XaaS. We covered these service
models, focusing on the type of service provided to the customer (i.e., communications,
infrastructure, monitoring, outsourced platforms, and software).
Platform-as-a-Service ( PaaS)
Cloud computing has evolved to include platforms for building and running custom
web-based applications, a concept known as Platform-as-a-Service. PaaS is an
outgrowth of the SaaS application delivery model. ThePaaS model makes all of the
facilities required to support the complete lifecycle of building and delivering web
applications and services entirely available from the Internet, all with no software
downloads or installation for developers, IT managers, or end users. Unlike the IaaS
model, where developers may create a specific operating system instance with home-
grown applications running, PaaS developers are concerned only with web-based
development and generally do not care what operating system isused. PaaS services
allow users to focus on innovation rather than complex infrastructure. Organizations
can redirect a significant portion of their budgets to creating applications that provide
real business value instead of worrying about all the infrastructure issues in a roll-your-
own delivery model. The PaaS model is thus driving a new era of mass innovation. Now,
developers around the world can access unlimited computing power. Any-one with an
Internet connection can build powerful applications and easily deploy them to users
globally
The Traditional On-Premises Model
The traditional approach of building and running on-premises applications has always
been complex, expensive, and risky. Building your own solution has never offered any
guarantee of success. Each application was designed to meet specific business
requirements. Each solution required a specific set of hardware, an operating system, a
database, often a middle- ware package, email and web servers, etc. Once the
hardware and software environment was created, a team of developers had to navigate
complex programming development platforms to build their applications. Additionally,
a team of network, database, and system management experts was needed to keep
everything up and running. Inevitably, a business requirement would force the
developers to make a change to the application. The changed application then required
new test cycles before being distributed. Large companies often needed
specialized facilities to house their data centers. Enormous amounts of electricity also
were needed to power the servers as well as to keep the systems cool. Finally, all of this
required use of fail-over sites to mirror the data center so that information could be
replicated in case of a disaster. Old days, old ways—now, let’s fly into the silver lining of
todays cloud.
The New Cloud Model
PaaS offers a faster, more cost-effective model for application development and delivery.
PaaS provides all the infrastructure needed to run applications over the Internet. Such
is the case with companies such as Amazon. com, eBay, Google, iTunes, and YouTube.
The new cloud model has made it possible to deliver such new capabilities to new
markets via the web browsers. PaaS is based on a metering or subscription model, so
users pay only for what they use.
PaaS offerings include workflow facilities for applicationdesign,application
development, testing, deployment, and hosting, as well as application services such as
virtual offices, team collaboration, database integration, security, scalability, storage,
persistence, state management, dashboard instrumentation, etc
Key Characteristics of PaaS
Chief characteristics of PaaS include services to develop, test, deploy, host,and manage
applications to support the application development life cycle. Web-based
user interface creation tools typically provide some level of sup-port to simplify the
creation of user interfaces, based either on common standards such as HTML and
JavaScript or on other, proprietary technologies. Supporting a multitenant architecture
helps to remove developer concerns regarding the use of the application by many
concurrent users. PaaS providers often include services for concurrency management,
scalability,fail-over and security. Another characteristic is the integration with web ser-
vices and databases. Support for Simple Object Access Protocol (SOAP) and other
interfaces allows PaaS offerings to create combinations of web services(called mashups)
as well as having the ability to access databases and reuse services maintained inside
private networks. The ability to form and sharecode with ad-hoc, predefined, or
distributed teams greatly enhances the productivity of PaaS offerings. Integrated PaaS
offerings provide an opportunity for developers to have much greater insight into the
inner workings of their applications and the behavior of their users by implementing
dash-board-like tools to view the inner workings based on measurements such as
performance, number of concurrent accesses, etc. Some PaaS offerings leverage this
instrumentation to enable pay-per-use billing models.
Infrastructure-as-a-Service (IaaS)
According to the online reference Wikipedia, Infrastructure-as-a-Service(IaaS) is the
delivery of computer infrastructure (typically a platform virtualization environment) as
a service.
IaaS leverages significant technology, services, and data center investments to deliver
IT as a service to customers. Unlike traditional outsourcing, which requires extensive
due diligence, negotiations ad infinitum, and complex, lengthy contract vehicles, IaaS is
centered around a model of service delivery that provisions a predefined, standardized
infrastructure specifically optimized for the customer’s applica-tions. Simplified
statements of work and à la carte service-level choices make it easy to tailor a solution to
a customer’s specific application requirements. IaaS providers manage the transition
and hosting of selected applications on their infrastructure. Customers maintain
ownership and management of their application(s) while off-loading hosting operations
and infrastructure management to the IaaS provider. Provider-owned implementations
typically include the following layered components:
• Computer hardware (typically set up as a grid for massive horizontal scalability)
• Computer network (including routers, firewalls, load balancing,etc.)
• Internet connectivity (often on OC 192 backbones)
• Platform virtualization environment for running client-specified virtual machines
• Service-level agreements
• Utility computing billing
Rather than purchasing data center space, servers, software, network equipment, etc.,
IaaS customers essentially rent those resources as a fully outsourced service. Usually,
the service is billed on a monthly basis, just like a utility company bills customers. The
customer is charged only for resources consumed.
The chief benefits of using this type of outsourced ser-vice include:
• Ready access to a pre configured environment that is generally ITIL-based(The
Information Technology Infrastructure Library [ITIL] is a customized framework
of best practices designed to promote quality computing services in the IT
sector.)
• Use of the latest technology for infrastructure equipment
• Secured, “sand-boxed” (protected and insulated) computing plat-forms that are
usually security monitored for breaches
• Reduced risk by having off-site resources maintained by third parties
• Ability to manage service-demand peaks and valleys
• Lower costs that allow expensing service costs instead of making capital
investments
• Reduced time, cost, and complexity in adding new features or capabilities
Modern On-Demand Computing
On-demand computing is an increasingly popular enterprise model
in which computing resources are made available to the user as needed. Computing
resources that are maintained on a user’s site are becoming fewer and fewer, while those
made available by a service provider are on the rise. The on-demand model evolved to
overcome the challenge of being able to meet fluctuating resource demands efficiently.
Because demand for computing resources can vary drastically from one time to another,
maintaining sufficient resources to meet peak requirements can be costly.Over
engineering a solution can be just as adverse as a situation where the enterprise cuts
costs by maintaining only minimal computing resources, resulting in insufficient
resources to meet peak load requirements. Concepts such as clustered computing, grid
computing, utility computing, etc., may all seem very similar to the concept of on-
demand computing, but they can be better understood if one thinks of them as building
blocks
that evolved over time and with techno-evolution to achieve the modern cloud
computing model we think of and use today (see Figure 2.1).One example we will
examine is Amazon’s Elastic Compute Cloud(Amazon EC2). This is a web service that
provides resizable computing capacity in the cloud. It is designed to make web-scale
computing easier for developers and offers many advantages to customers:
• It’s web service interface allows customers to obtain and configure capacity with
minimal effort.
• It provides users with complete control of their (leased) computing resources and
lets them run on a proven computing environment.
• It reduces the time required to obtain and boot new server instances to minutes,
allowing customers to quickly scale capacity as their computing demands dictate.
• It changes the economics of computing by allowing clients to pay only for
capacity they actually use.
• It provides developers the tools needed to build failure-resilient applications and
isolate themselves from common failure scenarios.
Amazon’s Elastic Cloud
Amazon EC2 presents a true virtual computing environment, allowing clients to use a
web-based interface to obtain and manage services needed to launch one or more
instances of a variety of operating systems (OSs). Clients can load the OS environments
with their customized applications. They can manage their network’s access permissions
and run as many or as few systems as needed. In order to use Amazon EC2, clients first
need to create an Amazon Machine Image (AMI). This image contains the applications,
libraries, data, and associated configuration settings used in the virtual computing
environment. Amazon EC2 offers the use of preconfigured images built with templates
to get up and running immediately. Once users have defined and configured their AMI,
they use the Amazon EC2 tools provided for storing the AMI by uploading the AMI into
Amazon S3. Amazon S3 is a repository that provides safe, reliable, and fast access to a
client AMI. Before clients can use the AMI, they must use the Amazon EC2 web service
to configure security and network access.
Amazon EC2 Service Characteristics
There are quite a few characteristics of the EC2 service that provide significant benefits
to an enterprise. First of all, Amazon EC2 provides financial benefits. Because of
Amazon’s massive scale and large customer base, it is an in expensive alternative to
many other possible solutions. The costs incurred to set up and run an operation are
shared over many customers, making the overall cost to any single customer much lower
than almost any other alter-native. Customers pay a very low rate for the compute
capacity they actually consume. Security is also provided through Amazon EC2 web
service inter-faces. These allow users to configure firewall settings that control
network access to and between groups of instances.
Amazon EC2 offers a highly reliable environment where replacement instances can be
rapidly provisioned.
When one compares this solution to the significant up-front expendi-tures traditionally
required to purchase and maintain hardware, either in-house or hosted, the decision to
outsource is not hard to make. Outsourced solutions like EC2 free customers from many
of the complexities of capacity planning and allow clients to move from large capital
investments and fixed costs to smaller, variable, expensed costs. This approach removes
the need to over buy and over build capacity to handle periodic traffic spikes. The EC2
service runs within Amazon’s proven, secure, and reliable network infra-structure and
data center locations.
Dynamic Scalability
Amazon EC2 enables users to increase or decrease capacity in a few min-utes. Users can
invoke a single instance, hundreds of instances, or even thousands of instances
simultaneously. Of course, because this is all con-trolled with web service APIs, an
application can automatically scale itself up or down depending on its needs. This type
of dynamic scalability is very attractive to enterprise customers because it allows them
to meet their customers’ demands without having to overbuild their infrastructure.
Configuration Flexibility
Configuration settings can vary widely among users. They have the choice of multiple
instance types, operating systems, and software packages. Amazon EC2 allows them to
select a configuration of memory, CPU, and instance storage that is optimal for their
choice of operating system and application. For example, a user’s choice of operating
systems may also include numerous Linux distributions, Microsoft Windows Server, and
even an Open Solaris environment, all running on virtual servers.
Integration with Other Amazon Web Services
Amazon EC2 works in conjunction with a variety of other Amazon web ser-vices. For
example, Amazon Simple Storage Service (Amazon S3), Amazon SimpleDB, Amazon
Simple Queue Service (Amazon SQS), and Amazon Cloud Front are all integrated to
provide a complete solution for computing, query processing, and storage across a wide
range of applications. Amazon S3 provides a web services interface that allows users
to store and retrieve any amount of data from the Internet at any time, anywhere. It
gives developers direct access to the same highly scalable, reliable, fast inexpensive data
storage infrastructure Amazon uses to run its own global network of web sites. The S3
service aims to maximize benefits of scale and to pass those benefits on to developers.
Monitoring-as-a-Service (MaaS)
Monitoring-as-a-Service (MaaS) is the outsourced provisioning of security, primarily on
business platforms that leverage the Internet to conduct business.
MaaS has become increasingly popular over the last decade. Since the advent of cloud
computing, its popularity has, grown even more. Security monitoring involves
protecting an enterprise or government client from cyber threats. A security team plays
a crucial role in securing and maintaining the confidentiality, integrity, and availability
of IT assets. However, time and resource constraints limit security operations and their
effectiveness for most companies. This requires constant vigilance over the security
infra-structure and critical information assets. Many industry regulations require
organizations to monitor their security environment, server logs, and other information
assets to ensure the integrity of these systems. However, conducting effective security
monitoring can be a daunting task because it requires advanced technology, skilled
security experts, and scalable processes—none of which come cheap. MaaS security
monitoring services offer real-time, 24/7 monitoring and nearly immediate incident
response across a security infrastructure—they help to protect critical information assets
of their customers. Prior to the advent of electronic security systems, security
monitoring and response were heavily dependent on human resources and human
capabilities, which also limited the accuracy and effectiveness of monitoring efforts.
Over the past two decades, the adoption of information technology into facility security
systems, and their ability to be connected to security operations centers(SOCs) via
corporate networks, has significantly changed that picture. This means two important
things:
(1) The total cost of ownership (TCO) for traditional SOCs is much higher than for a
modern-technology SOC; and
(2)achieving lower security operations costs and higher security effectiveness means
that modern SOC architecture must use security and IT technology to address security
risks.
Protection Against Internal and External Threats
SOC-based security monitoring services can improve the effectiveness of a customer
security infrastructure by actively analyzing logs and alerts from infrastructure devices
around the clock and in real time. Monitoring teams correlate information from various
security devices to provide security analysts with the data they need to eliminate false
positives and respond to true threats against the enterprise. Having consistent access to
the skills needed to maintain the level of service an organization requires for enterprise-
level monitoring is a huge issue. The information security team can assess system
performance on a periodically recurring basis and provide recommendations for
improvements as needed. Typical services provided by many MaaS vendors
are described below.
Early Detection
An early detection service detects and reports new security vulnerabilitiesshortly after
they appear. Generally, the threats are correlated with third-party sources, and an alert
or report is issued to customers. This report is usually sent by email to the person
designated by the company. Security vulnerability reports, aside from containing a
detailed description of the vulnerability and the platforms affected, also include
information on the impact the exploitation of this vulnerability would have on the
systems or applications previously selected by the company receiving the report. Most
often, the report also indicates specific actions to be taken to minimize the effect of the
vulnerability, if that is known.
Platform, Control, and Services Monitoring
Platform, control, and services monitoring is often implemented as a dash-board
interface and makes it possible to know the operational status of the platform being
monitored at any time. It is accessible from a web interface, making remote access
possible. Each operational element that is monitored usually provides an operational
status indicator, always taking into account the critical impact of each element. This
service aids in determining which elements may be operating at or near capacity or
beyond the limits of established parameters. By detecting and identifying such
problems, preventive measures can be taken to prevent loss of service.
Intelligent Log Centralization and Analysis
Intelligent log centralization and analysis is a monitoring solution based mainly on the
correlation and matching of log entries. Such analysis helps to establish a baseline of
operational performance and provides an index of security threat. Alarms can be raised
in the event an incident moves the established baseline parameters beyond a stipulated
threshold. These types of sophisticated tools are used by a team of security experts who
are responsible for incident response once such a threshold has been crossed and the
threat has generated an alarm or warning picked up by security analysts monitoring the
systems.
Vulnerabilities Detection and Management
Vulnerabilities detection and management enables automated verification and
management of the security level of information systems. The service periodically
performs a series of automated tests for the purpose of identifying system weaknesses
that may be exposed over the Internet, including the possibility of unauthorized access
to administrative services, the existence of services that have not been updated, the
detection of vulnerabilities such as phishing, etc. The service performs periodic follow-
up of tasks performed by security professionals managing information systems security
and pro-vides reports that can be used to implement a plan for continuous improvement
of the system’s security level.
Continuous System Patching/Upgrade and Fortification
Security posture is enhanced with continuous system patching and upgrading of systems
and application software. New patches, updates, and service packs for the equipment’s
operating system are necessary to maintain adequate security levels and support new
versions of installed products. Keeping abreast of all the changes to all the software and
hardware requires a committed effort to stay informed and to communicate gaps in
security that can appear in installed systems and applications.
Intervention, Forensics, and Help Desk Services
Quick intervention when a threat is detected is crucial to mitigating the effects of a
threat. This requires security engineers with ample knowledge in the various
technologies and with the ability to support applications as well as infrastructures on a
24/7 basis. MaaS platforms routinely provide this ser-vice to their customers. When a
detected threat is analyzed, it often requires forensic analysis to determine what it is,
how much effort it will take to fix the problem, and what effects are likely to be seen.
When problems are encountered, the first thing customers tend to do is pick up the
phone.Help desk services provide assistance on questions or issues about the operation
of running systems. This service includes assistance in writing failure reports, managing
operating problems, etc.
Delivering Business Value
Some consider balancing the overall economic impact of any build-versus-buy decision
as a more significant measure than simply calculating a return on investment (ROI). The
key cost categories that are most often associated with MaaS are:--
(1) Service fees for security event monitoring for all firewalls and intrusion detection
devices, servers, and routers;
(2) Internal account maintenance and administration costs; and
(3) Preplanning and development costs. Based on the total cost of ownership, whenever
a customer evaluates the option of an in-house security information monitoring team
and infra-structure compared to outsourcing to a service provider, it does not take long
to realize that establishing and maintaining an in-house capability is not as attractive as
outsourcing the service to a provider with an existing infrastructure. Having an in-house
security operations center forces a company to deal with issues such as staff attrition,
scheduling, around the clock operations, etc. Losses incurred from external and internal
incidents are extremely significant, as evidenced by a regular stream of high-profile
cases in the news. The generally accepted method of valuing the risk of losses from
external and internal incidents is to look at the amount of a potential loss, assume
a frequency of loss, and estimate a probability for incurring the loss. Although this
method is not perfect, it provides a means for tracking information security metrics.
Risk is used as a filter to capture uncertainty about varying cost and benefit estimates. If
a risk-adjusted ROI demonstrates a compelling business case, it raises confidence that
the investment is likely to succeed because the risks that threaten the project have been
considered and quantified. Flexibility represents an investment in additional capacity or
agility today that can be turned into future business benefits at some additional cost.
This provides an organization with the ability to engage in future initiatives, but not the
obligation to do so. The value of flexibility is unique to each organization, and
willingness to measure its value varies from company to company.
Real-Time Log Monitoring Enables Compliance
Security monitoring services can also help customers comply with industry regulations
by automating the collection and reporting of specific events of interest, such as log-in
failures. Regulations and industry guidelines often require log monitoring of critical
servers to ensure the integrity of confidential data. MaaS providers’ security monitoring
services automate this time-consuming process
Communication-as-a-Service (CaaS)
CaaS is an outsourced enterprise communications solution. Providers of this type of
cloud-based solution (known as CaaS vendors) are responsible for the management of
hardware and software required for delivering Voiceover IP (VoIP) services, Instant
Messaging (IM), and video conferencing capabilities to their customers. This model
began its evolutionary process from within the telecommunications (Telco) industry, not
unlike how the SaaS model arose from the software delivery services sector. CaaS
vendors are responsible for all of the hardware and software management consumed by
their user base. CaaS vendors typically offer guaranteed quality of service(QoS) under a
service-level agreement (SLA). A CaaS model allows a CaaS provider’s
business customers to selectively deploy communications features and services
throughout their company on a pay-as-you-go basis for service(s) used. CaaS is designed
on a utility-like pricing model that provides users with comprehensive, flexible, and
(usually) simple-to-understand service plans. According to Gartner,the CaaS market is
expected to total $2.3 billion in 2011, representing a compound annual growth rate of
more than 105% for the period. CaaS service offerings are often bundled and may
include integrated access to traditional voice (or VoIP) and data, advanced unified
communications functionality such as video calling, web collaboration, chat, real-time
presence and unified messaging, a handset, local and long-distance voice services, voice
mail, advanced calling features (such as caller ID, three-
way and conference calling, etc.) and advanced PBX functionality.
A CaaS solution includes redundant switching, network, POP and circuit diversity,
customer premises equipment redundancy, and WAN fail-over that specifically
addresses the needs of their customers. All VoIP transport components are located in
geographically diverse, secure data centers for high availability and survivability.CaaS
offers flexibility and scalability that small and medium-sized business might not
otherwise be able to afford. CaaS service providers are usually prepared to handle peak
loads for their customers by providing services capable of allowing more capacity,
devices, modes or area coverage as their customer demand necessitates. Network
capacity and feature sets can be changed dynamically, so functionality keeps pace with
consumer demand and provider-owned resources are not wasted. From the service
provider customer’s perspective, there is very little to virtually no risk of the service
becoming obsolete, since the provider’s responsibility is to perform periodic upgrades or
replacements of hardware and software to keep the platform technologically current.
CaaS requires little to no management oversight from customers. It eliminates the
business customer’s need for any capital investment in infra-structure, and it eliminates
expense for ongoing maintenance and operations overhead for infrastructure. With a
CaaS solution, customers are able to leverage enterprise-class communication services
without having to build a premises-based solution of their own. This allows those
customers to reallocate budget and personnel resources to where their business can best
use them.
Advantages of CaaS
From the handset found on each employee’s desk to the PC-based software client on
employee laptops, to the VoIP private backbone, and all modes in between, every
component in a CaaS solution is managed 24/7 by the CaaS vendor. As we said
previously, the expense of managing a carrier-grade data center is shared across the
vendor’s customer base, making it more economical for businesses to implement CaaS
than to build their own VoIP net- work. Let’s look as some of the advantages of a hosted
approach for CaaS.
Hosted and Managed Solutions
Remote management of infrastructure services provided by third parties once seemed
an unacceptable situation to most companies. However, over the past decade, with
enhanced technology, networking, and software, the attitude has changed. This is, in
part, due to cost savings achieved in using those services. However, unlike the “one-off”
services offered by specialist providers, CaaS delivers a complete communications
solution that is entirely managed by a single vendor. Along with features such as VoIP
and unified communications, the integration of core PBX features with advanced
functionality is managed by one vendor, who is responsible for all of the integration and
delivery of services to users.
Fully Integrated, Enterprise-Class Unified Communications
With CaaS, the vendor provides voice and data access and manages
LAN/ WAN, security, routers, email, voice mail, and data storage. By managing the
LAN/WAN, the vendor can guarantee consistent quality of service from a user’s desktop
across the network and back. Advanced unified communications features that are most
often a part of a standard CaaS deployment include:
• Chat
• Multimedia conferencing
• Microsoft Outlook integration
• Real-time presence
• “Soft” phones (software-based telephones)
• Video calling
• Unified messaging and mobility
Providers are constantly offering new enhancements (in both performance and features)
to their CaaS services. The development process and subsequent introduction of new
features in applications is much faster, easier, and more economical than ever before.
This is, in large part, because the service provider is doing work that benefits many end
users across the provider’s scalable platform infrastructure. Because many end users of
the provider’s service ultimately share this cost (which, from their perspective, is
miniscule compared to shouldering the burden alone), services can be offered to
individual customers at a cost that is attractive to them.
No Capital Expenses Needed
When business outsource their unified communications needs to a CaaS service
provider, the provider supplies a complete solution that fits the company’s exact needs.
Customers pay a fee (usually billed monthly) for what they use. Customers are not
required to purchase equipment, so there is no capital outlay. Bundled in these types of
services are ongoing maintenance and upgrade costs, which are incurred by the service
provider. The use of CaaS services allows companies the ability to collaborate across any
work-space. Advanced collaboration tools are now used to create high-quality secure,
adaptive work spaces throughout any organization. This allows a company’s workers,
partners, vendors, and customers to communicate and collaborate more effectively.
Better communication allows organizations to adapt quickly to market changes and to
build competitive advantage. CaaS can also accelerate decision making within an
organization. Innovative unified communications capabilities (such as presence, instant
messaging, and rich media services) help ensure that information quickly reaches who
ever needs it.
Flexible Capacity and Feature Set
When customers outsource communications services to a CaaS provider,they pay for
the features they need when they need them. The service provider can distribute the cost
services and delivery across a large customer base. As previously stated, this makes the
use of shared feature functionality more economical for customers to implement.
Economies of scale allow ser-vice providers enough flexibility that they are not tied to a
single vendor investment. They are able to leverage best-of-breed providers such as
Avaya, Cisco, Juniper, Microsoft, Nortel and Shore Tel more economically
than any independent enterprise.
No Risk of Obsolescence
Rapid technology advances, predicted long ago and known as Moore’s law,
have brought about product obsolescence in increasingly shorter periods of time.
Moore’s law describes a trend he recognized that has held true since the beginning of the
use of integrated circuits (ICs) in computing hardware. Since the invention of the
integrated circuit in 1958, the number of transistors that can be placed inexpensively on
an integrated circuit has increased exponentially, doubling approximately every two
years. Unlike IC components, the average life cycles for PBXs and key com-munications
equipment and systems range anywhere from five to 10 years. With the constant
introduction of newer models for all sorts of technology (PCs, cell phones, video
software and hardware, etc.), these types of products now face much shorter life cycles,
sometimes as short as a single year. CaaS vendors must absorb this burden for the user
by continuously upgrading the equipment in their offerings to meet changing demands
in the marketplace.
No Facilities and Engineering Costs Incurred
CaaS providers host all of the equipment needed to provide their services to their
customers, virtually eliminating the need for customers to maintain data center space
and facilities. There is no extra expense for the constant power consumption that such a
facility would demand. Customers receive the benefit of multiple carrier-grade data
centers with full redundancy—and it’s all included in the monthly payment.
Guaranteed Business Continuity
If a catastrophic event occurred at your business’s physical location, would your
company disaster recovery plan allow your business to continue oper-ating without a
break? If your business experienced a serious or extended communications outage, how
long could your company survive? For most businesses, the answer is “not long.”
Distributing risk by using geographically dispersed data centers has become the norm
today. It mitigates risk and allows companies in a location hit by a catastrophic event to
recover as soon as possible. This process is implemented by CaaS providers because
most companies don’t even contemplate voice continuity if catastrophe strikes.Unlike
data continuity, eliminating single points of failure for a voice net- work is usually cost-
prohibitive because of the large scale and management complexity of the project. With a
CaaS solution, multiple levels of redundancy are built into the system, with no single
point of failure.
Database as a service (DBaaS)
Database as a service (DBaaS) is a cloud computing service model that provides users
with some form of access to a database without the need for setting up physical
hardware, installing software or configuring for performance. All of the administrative
tasks and maintenance are taken care of by the service provider so that all the user or
application owner needs to do is use the database. Of course, if the customer opts for
more control over the database, this option is available and may vary depending on the
provider. Database-as-a-Service (DBaaS) is the fastest growing cloud service, and the
Platform-as-a-Service component and provides dramatic improvements in productivity,
performance, standardization, and data security of databases
The term “Database-as-a-Service” (DBaaS) refers to software that enables users to
provision, manage, consume, configure, and operate database software using a common
set of abstractions (primitives), without having to either know nor care about the exact
implementations of those abstractions for the specific database software.
In other words, a DBaaS user could provision a MySQL database, manage, configure
and operate it using the same set of API calls as he (or she) would use if it were an
Oracle or MongoDB database. The user would be able, for example, to request a backup
of the database using an API call which did the right thing(s) for the database that was
being used. Similarly, the user could request a MySQL cluster or a MongoDB cluster,
and then resize that cluster using the same API call(s), without having to know exactly
how that operation was being performed for each of those database technologies.
DBaaS is often considered to be a component of a Platform-as-a-Service, the “platform”
in this case being the database (or a number of databases). The DBaaS solution would
consume resources of the underlying Infrastructure-as-a-Service (IaaS), for example
provisioning compute, storage and networking from that IaaS.
DBaaS in the context of other cloud components
It is important to understand that like other cloud technologies,
DBaaS has two primary consumers:
The IT organization (which operates the cloud, very often also the DBA)
And the developer (sometimes DevOps, or the end user who uses the cloud resources)
An IT organization deploys DBaaS that enables end users (developers) to provision a
database of their choice from a catalog of supported databases. These could include
popular relational and non-relational databases and the IT organization can configure
the DBaaS to support specific releases of these software titles. The IT organization can
further restrict the configurations that specific users can provision (for example,
developers can only provision with small memory footprint, with traditional disks while
devops could provision higher capacity servers with SSD’s). Finally, the IT organization
can setup policies for standard database operations like backups to ensure that the data
is properly saved from time to time to allow for recovery when required.
Typically an end user would access the DBaaS system through a portal that allows him
or her to choose from a number of database titles, and in a variety of different
configuration options. With a few clicks this requested this database will be
provisioned for them. The DBaaS system quickly provisions the database and returns a
query able endpoint like:
mysql://192.168.15.243:3306/
and the application developer can use this in his or her application directly. The DBaaS
system provides simple mechanisms to add users, create databases (schemas) and grant
permissions to different users as required by the application.
The benefits of DBaaS
A DBaaS solution provides an organization a number of benefits, the chief among them
being:
• Developer agility
• DBA productivity
• Application reliability, performance and security
• We now examine each of these in turns.
• Developer agility
When a developer wishes to provision a database, the steps involved include
provisioning compute, storage and networking components, configuring them properly
and then installing database software. Finally, the database software must be configured
properly to utilize the underlying infrastructure components.
This multi-step process leaves many opportunities for errors, omissions and non-
standard modes of operation. When the thing that is being provisioned (a database) is
the “system of record”, this is unacceptable.
The IT organization in configuring the DBaaS establishes the standards by which
databases will be provisioned. By standardizing the provisioning model, DBaaS ensures
that a database can be provisioned in a single operation, and that databases are
provisioned in a consistent way, and in a manner that is aligned with the best practices
for that particular database and business.
Once in operation, complex database operations like resizing a cluster are now a simple
API call and the developer need not concern him or her selves with the minutiae of how
this operation should be performed for the specific database and version. The
abstraction provided by the DBaaS handles all of that and allows the developer to focus
his or her energy on the application rather than the underlying database.
Finally, the activities of a developer are often iterative and involve spinning up, using,
and then destroying database servers. Abstractions in the DBaaS allow for the final step
in this process to be automated as well, securely erasing all storage used for the database
and ensuring that all the resources are released and that data integrity is preserved at all
times.
DBA productivity
When an enterprise operates hundreds of instances of many different databases, a
considerable resources get consumed on maintenance and upkeep. This includes things
like tuning, configuration, patching, periodic backups, and so on; all the things that
DBAs have to do to keep databases in proper working order.
DBaaS solutions provide abstractions that allow DBAs to manage groups of databases
and perform operations like upgrades and configuration changes on a fleet of databases
in a simplified way. This frees up the DBAs to focus more on activities like establishing
the standards of operation for the enterprise and verifying that they have the best tools
available for themselves and the developers who they serve.
Application reliability, performance and security
Databases are often the “system of record” and are the repository of valuable
information in the organization. A database outage could have catastrophic impact.
Through automation and standardization, DBaaS ensures that all common workflows
involved in the provisioning, configuration, management, and operation of databases
are consistent.
Through this standardization, a DBaaS ensures that all databases are operated in the
same way, and in keeping with the best practices established by the IT organization.
This , frees up the developer and DBA to work on more important things like the
application and innovation rather than the boring minutiae of running a database.
It is important to realize that most enterprises today operate applications that require
many different database technologies, a departure from recent years where the
‘corporate standard’ mandated a single database solution for all application needs. With
this diversity in database technologies, DBaaS solutions allow IT organizations to ensure
application reliability, performance and data security no matter what database solution
is in use, without requiring that the IT organization or the developer team have deep
knowledge of the finer points of each of the technologies. DBaaS solutions encapsulate
those best practices and codify the proper way(s) to deploy, manage and operate all of
the different technologies thereby freeing up the DBAs and developers from these
chores.
Comparison of some DBaaS solutions
The most widely used DBaaS in the market today is Amazon Relational Database Service
(RDS). RDS provides support for a number of databases including MySQL, MariaDB,
Oracle, PostgreSQL and SQL Server. In addition, Amazon also provides Aurora and
DynamoDB. Aurora is a scalable relational database compatible with MySQL or
PostgreSQL while DynamoDB is a scalable NoSQL database.
Microsoft offers SQL Database as part of the Azure Cloud platform, and Google offers
Cloud SQL, a fully managed MySQL database service.
With the exception of DynamoDB, all of these are DBaaS solutions that provide
management abstractions but no data API. The application that uses the database
interacts directly with the managed database in these cases. In DynamoDB however, the
service offers a data API as well.
In the OpenStack ecosystem, the Trove project offers a DBaaS that supports a number of
relational and non-relational database packages including most commonly used FOSS
databases.
The value of a DBaaS comes through the standardization of the abstractions, and
through the common API. Since the most widely used Cloud API in the world today is
Amazon’s AWS API , there is considerable value in implementing a solution that exposes
its services using that same API. This is the approach that Stratoscale Symphony has
adopted. Symphony exposes the same APIs defined by AWS and allows you to
provision an AWS region and RDS-compatible DBaaS in your own data center.
SERVICE PROVIDERS
Google App Engine
Google App Engine enables developers to build their web apps on the same
infrastructure that powers Google’s own applications.
Features Leveraging Google App Engine, developers can accomplish the following tasks:
• Write code once and deploy Provisioning and configuring multiple machines for web
serving and data storage can be expensive and time-consuming. Google App Engine
makes it easier to deploy web applications by dynamically providing computing
resources as they are needed. Developers write the code, and Google App Engine takes
care of the rest.
• Absorb spikes in traffic When a web app surges in popularity, the sudden increase in
traffic can be overwhelming for applications of all sizes, from startups to large
companies that find themselves re-architecting their databases and entire systems
several times a year. With automatic replication and load balancing, Google App Engine
makes it easier to scale from one user to one million by taking advantage of Bigtable and
other components of Google’s scalable infrastructure.
• Easily integrate with other Google services It’s unnecessary and inefficient for
developers to write components like authentication and email from scratch for each new
application. Developers using Google App Engine can make use of built-in components
and Google’s broader library of APIs that provide plug-and-play functionality for simple
but important features.
“Google has spent years developing infrastructure for scalable web applications,” said
Pete Koomen, a product manager at Google. “We’ve brought Gmail and Google search to
hundreds of millions of people worldwide, and we’ve built out a powerful network of
datacenters to support those applications. Today we’re taking the first step in making
this infrastructure available to all developers.” Cost Google enticed developers by
offering the App Engine for free, when it launched, but after a few months slapped on
some fees. As of this writing, developers using Google App Engine can expect to pay:
• Free quota to get started: 500MB storage and enough CPU and bandwidth for about 5
million pageviews per month
• $0.10–$0.12 per CPU core-hour
• $0.15–$0.18 per GB-month of storage
• $0.11–$0.13 per GB of outgoing bandwidth
• $0.09–$0.11 per GB of incoming bandwidth
In response to developer feedback, Google App Engine will provide new APIs. The
image-manipulation API enables developers to scale, rotate, and crop images on the
server. The memcache API is a high-performance caching layer designed to make page
rendering faster for developers. More information about Google App Engine is available
at http://code.google.com/ appengine/
Amazon Elastic Compute Cloud (Amazon EC2)
Amazon may be the most widely known cloud vendor. They offer services on many
different fronts, from storage to platform to databases. Amazon seems to have their
finger in a number of cloud technologies.
Amazon Elastic Compute Cloud (Amazon EC2) is a web service that offers resizable
compute capacity in the cloud and is designed to make web scaling easier for developers.
Amazon EC2 provides a simple web interface that allows you to obtain and configure
capacity with little difficulty. It allows you control of your computing resources. Amazon
EC2 cuts the time it takes to obtain and boot new server instances to a few minutes,
allowing you to change scale as your needs change. For instance, Amazon EC2 can run
Microsoft Windows Server 2003 and is a way to deploy applications using the Microsoft
Web Platform, including ASP.NET, ASP.NET AJAX, Silverlight, and Internet
Information Server (IIS).
Amazon EC2 allows you to run Windows-based applications on Amazon’s cloud
computing platform. This might be web sites, web-service hosting, high-performance
computing, data processing, media transcoding, ASP.NET application hosting, or any
other application requiring Windows software. EC2 also supports SQL Server Express
and SQL Server Standard and makes those offerings available to customers on an hourly
basis.
MICROSOFT AZURE
Microsoft offers a number of cloud services for organizations of any size—from
enterprises all the way down to mom-and-pop shops or individuals. A good portion of
Microsoft’s cloud offerings are cloud variants of products that people already use, so
cloud versions aren’t that difficult to use.
Azure Services Platform
The cornerstone of Microsoft’s offerings is the Azure Services Platform. The Azure
Services Platform is a cloud computing and services platform hosted in Microsoft
datacenters. The Azure Services Platform supplies a broad range of functionality to build
applications to serve individuals or large enterprises, and everyone in between. The
platform offers a cloud operating system and developer tools. Applications can be
developed with industry standard protocols like REST and SOAP. Azure services can be
used individually or in conjunction with one another to build new applications or to
enhance existing ones. Let’s take a closer look at the Azure Services Platform
components.
Windows Azure
Windows Azure is a cloud-based operating system that enables the development,
hosting, and service management environment for the Azure Services Platform.
Windows Azure gives developers an on-demand compute and storage environment that
they can use to host, scale, and manage web applications through Microsoft datacenters.
To build applications and services, developers can use the Visual Studio skills they
already have. Further, Azure supports existing standards like SOAP, REST, and XML.
Windows Azure can be used to
• Add web service capabilities to existing applications
• Build and modify applications and then move them onto the Web
• Make, test, debug, and distribute web services efficiently and inexpensively
• Reduce the costs of IT management
SALES FORCE.COM
Salesforce.com Salesforce.com made its name with the success of its flagship
Salesforce.com automation application. Today, the company has three primary areas of
focus:
• The Sales Cloud The popular cloud computing sales application
• The Service Cloud The platform for customer service that lets companies tap into the
power of customer conversations no matter where they take place
• Your Cloud Powerful capabilities to develop custom applications on its cloud
computing platform, Force.com
The company has made its platform available to other companies as a place to build and
deploy their software services. Force.com offers
• A relational database
• User interface options
• Business logic
• Apex, an integrated development environment
• Workflow and approvals engine
• Programmable interface
• Automatic mobile device deployment
• Web services integration
• Reporting and analytics
Using Apex, programmers can test their applications in Force.com’s Sandboxes and
then offer the finalized code on Salesforce.com’s site. Developers initially used
Force.com to create add-ons to the Salesforce CRM, but now it is possible to develop
applications that are unrelated to Salesforce.com’s offerings. For instance, gaming giant
Electronic Arts created an employee-recruiting application on Force.com and software
vendor Coda made a general ledger application. Meanwhile, Salesforce.com promotes its
own applications, which are used by more than 1.1 million people. Salesforce.com is into
other cloud services, as well.
In April 2007 it moved into enterprise content management with Salesforce.com
Content. This makes it possible to store, classify, and share information in a manner
similar to Microsoft SharePoint. The company employs a multitenant architecture,
similar to Google, Amazon, and eBay. As such, servers and other resources are shared by
customers, rather than given to a single account. It allows for better performance, better
scalability, better security, and faster innovation through automatic upgrades.
Multitenancy also allows apps to be elastic—they can scale up to tens of thousands of
users, or down to just a few—always something to consider when moving to cloud-based
solutions. As with other providers, upgrades are taken care of by Salesforce.com for
their customers, so apps get security and performance enhancements automatically.
Because the company generates all its income based on cloud computing,
Salesforce.com is a good bellwether for assessing the growth rate of the application side
of cloud computing. Salesforce.com’s revenue grew to US$290 million in the quarter
ending January 31, 2009— a 34 percent increase year-over-year.
Force.com
Force.com is Salesforce.com’s on-demand cloud computing platform—billed by
Salesforce .com as the world’s first PaaS. Force.com features Visualforce, a technology
that makes it much simpler for end customers, developers, and independent software
vendors (ISVs) to design almost any type of cloud application for a wide range of uses.
The Force.com platform offers global infrastructure and services for database, logic,
workflow, integration, user interface, and application exchange. Visualforce is
essentially a framework for creating new interface designs and enables user interactions
that can be built and delivered with no software or hardware infrastructure
requirements.
PaaS
Force.com delivers PaaS, a way to create and deploy business apps that allows
companies and developers to focus on what their applications do, rather than the
software and infrastructure to run them. The Force.com platform can run multiple
applications within the same Salesforce.com instance, allowing all of a company’s
Salesforce.com applications to share a common security model, data model, and user
interface. This is a major benefit found in cloud computing solutions. Add to that an on-
demand operating system, the ability to create any database on demand, a workflow
engine for managing collaboration between users, and a programming language for
building complex logic. A web services API for programmatic access, mash-ups, and
integration with other applications and data is another key feature.
Visualforce
As part of the Force.com platform, Visualforce provides the ability to design application
user interfaces for practically any experience on any screen. Visualforce uses HTML,
AJAX, and Flex, for business applications. Visualforce provides a page-based model,
built on standard HTML and web presentation technologies, and is complemented with
both a component library for implementing common user interface elements, and a
controller model for creating new interactions between those elements. Visualforce
features and capabilities include
• Pages Enables the design definition of an application’s user interface.
• Components Provides the ability to create new applications that automatically
match the look and feel of Salesforce.com applications or easily customize and extend
the Salesforce.com user interface to specific requirements.
• Logic Controllers The controller enables customers to build any user interface
behavior.
Salesforce.com CRM
Salesforce.com is a leader in cloud computing customer relationship management
(CRM) applications. Its CRM offering consists of the Sales Cloud and the Service Cloud
and can be broken down into five core applications:
• Sales Easily the most popular cloud computing sales application, Salesforce.com says
that CRM Sales is used by more than 1.1 million customers around the world. Its claim
to fame is that it is comprehensive and easy to customize. Its value proposition is that it
empowers companies to manage people and processes more effectively, so reps can
spend more time selling and less time on administrative tasks.
• Marketing With Salesforce.com CRM Marketing, marketers can put the latest web
technologies to work building pipeline while collaborating seamlessly with their sales
organization. The application empowers customers to manage multichannel campaigns
and provide up-to-date messaging to sales. And since the application is integrated with
the Salesforce.com CRM Sales application, the handoff of leads is automated
. • Service The Service Cloud is the new platform for customer service. Companies can
tap into the power of customer conversations no matter where they take place. Because
it’s on the Web, the Service Cloud allows companies to instantly connect to collaborate
in real time, share sales information, and follow joint processes. Connecting with
partners is made to be as easy as connecting with people on LinkedIn: companies
instantly share leads, opportunities, accounts, contacts, and tasks with their partners.
• Collaboration Salesforce.com CRM can help an organization work more efficiently
with customers, partners, and employees by allowing them to collaborate among
themselves in the cloud. Some of the capabilities include
• Create and share content in real time using Google Apps and Salesforce.com
• Track and deliver presentations using Content Library
• Give your community a voice using Ideas and Facebook
• Tap into the collective wisdom of the sales team with Genius
• Analytics Force.com offers real-time reporting, calculations, and dashboards so a
business is better able to optimize performance, decision making, and resource
allocation.
• Custom Applications Custom applications can be quickly created by leveraging one
data model, one sharing model, and one user interface.
MapReduce
What does MapReduce mean?
MapReduce is a programming model introduced by Google for processing and
generating large data sets on clusters of computers. Google first formulated the
framework for the purpose of serving Google’s Web page indexing, and the new
framework replaced earlier indexing algorithms. Beginner developers find the
MapReduce framework beneficial because library routines can be used to create parallel
programs without any worries about infra-cluster communication, task monitoring or
failure handling processes.
How MapReduce Works? The MapReduce algorithm contains two important tasks, namely Map and
Reduce.
• The Map task takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (key-value pairs).
• The Reduce task takes the output from the Map as an input and combines those
data tuples (key-value pairs) into a smaller set of tuples.
The reduce task is always performed after the map job.
• Input Phase − Here we have a Record Reader that translates each record in an input
file and sends the parsed data to the mapper in the form of key-value pairs.
• Map − Map is a user-defined function, which takes a series of key-value pairs and
processes each one of them to generate zero or more key-value pairs.
• Intermediate Keys − They key-value pairs generated by the mapper are known as
intermediate keys.
• Combiner − A combiner is a type of local Reducer that groups similar data from the
map phase into identifiable sets. It takes the intermediate keys from the mapper as
input and applies a user-defined code to aggregate the values in a small scope of one
mapper. It is not a part of the main MapReduce algorithm; it is optional.
• Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It
downloads the grouped key-value pairs onto the local machine, where the Reducer is
running. The individual key-value pairs are sorted by key into a larger data list. The data
list groups the equivalent keys together so that their values can be iterated easily in the
Reducer task.
• Reducer − The Reducer takes the grouped key-value paired data as input and runs a
Reducer function on each one of them. Here, the data can be aggregated, filtered, and
combined in a number of ways, and it requires a wide range of processing. Once the
execution is over, it gives zero or more key-value pairs to the final step.
• Output Phase − In the output phase, we have an output formatter that translates the
final key-value pairs from the Reducer function and writes them onto a file using a
record writer.
Each day, numerous MapReduce programs and MapReduce jobs are executed on Google's
clusters. Programs are automatically parallelized and executed on a large cluster of commodity
machines. The runtime system deals with partitioning the input data, scheduling the program's
execution across a set of machines, machine failure handling and managing required
intermachine communication. Programmers without any experience with parallel and distributed
systems can easily use the resources of a large distributed system.
GOOGLE FILE SYSTEM(GFS)
Google File System (GFS) is a scalable distributed file system (DFS) created by Google Inc. and
developed to accommodate Google’s expanding data processing requirements. GFS provides
fault tolerance, reliability, scalability, availability and performance to large networks and
connected nodes. GFS is made up of several storage systems built from low-cost commodity
hardware components. It is optimized to accomodate Google's different data use and storage
needs, such as its search engine, which generates huge amounts of data that must be stored.
The Google File System capitalized on the strength of off-the-shelf servers while minimizing
hardware weaknesses.GFS is also known as GoogleFS.
The GFS node cluster is a single master with multiple chunk servers that are continuously
accessed by different client systems. Chunk servers store data as Linux files on local disks.
Stored data is divided into large chunks (64 MB), which are replicated in the network a minimum
of three times. The large chunk size reduces network overhead.
GFS is designed to accommodate Google’s large cluster requirements without burdening
applications. Files are stored in hierarchical directories identified by path names. Metadata - such
as namespace, access control data, and mapping information - is controlled by the master, which
interacts with and monitors the status updates of each chunk server through timed heartbeat
messages.
GFS features include:
• Fault tolerance
• Critical data replication
• Automatic and efficient data recovery
• High aggregate throughput
• Reduced client and master interaction because of large chunk server size
• Namespace management and locking
• High availability
The largest GFS clusters have more than 1,000 nodes with 300 TB disk storage capacity. This
can be accessed by hundreds of clients on a continuous basis.
General architecture of Google File System GFS is clusters of computers. A cluster is simply a network of computers. Each cluster might
contain hundreds or even thousands of machines. In each GFS clusters there are three
main entities:
1. Clients
2. Master servers
3. Chunk servers.
Client can be other computers or computer applications and make a file request. Requests can range from retrieving and manipulating existing files to creating new files on the system. Clients can be thought as customers of the GFS.
Master Server is the coordinator for the cluster. Its task include:-
1. Maintaining an operation log, that keeps track of the activities of the cluster. The operation log helps keep service interruptions to a minimum if the master server crashes, a replacement server that has monitored the operation log can take its place.
2. The master server also keeps track of metadata, which is the information that describes chunks. The metadata tells the master server to which files the chunks belong and where they fit within the overall file.
Chunk Servers are the workhorses of the GFS. They store 64-MB file chunks. The chunk servers don't send chunks to the master server. Instead, they send requested chunks directly to the client. The GFS copies every chunk multiple times and stores it on different chunk servers. Each copy is called a replica. By default, the GFS makes three replicas per chunk, but users can change the setting and make more or fewer replicas if desired.
Management done to overloading single master in Google File System
Having a single master enables the master to make sophisticated chunk placement and replication decisions using global knowledge. However, the involvement of master in reads and writes must be minimized so that it does not become a bottleneck. Clients never read and write file data through the master. Instead, a client asks the master which chunk servers it should contact. It caches this information for a limited time and interacts with the chunk servers directly for many subsequent operations.
General scenario of client request handling by GFS
File requests follow a standard work flow. A read request is simple; the client sends a request to the master server to find out where the client can find a particular file on the system. The server responds with the location for the primary replica of the respective chunk. The primary replica holds a lease from the master server for the chunk in question.
If no replica currently holds a lease, the master server designates a chunk as the primary. It does this by comparing the IP address of the client to the addresses of the chunk servers containing the replicas. The master server chooses the chunk server closest to the client. That chunk server's chunk becomes the primary. The client then contacts the appropriate chunk server directly, which sends the replica to the client.
Write requests are a little more complicated. The client still sends a request to the master server, which replies with the location of the primary and secondary replicas. The client stores this information in a memory cache. That way, if the client needs to refer to the same
replica later on, it can bypass the master server. If the primary replica becomes unavailable or the replica changes then the client will have to consult the master server again before contacting a chunk server.
The client then sends the write data to all the replicas, starting with the closest replica and ending with the furthest one. It doesn't matter if the closest replica is a primary or secondary. Google compares this data delivery method to a pipeline.
Once the replicas receive the data, the primary replica begins to assign consecutive serial numbers to each change to the file. Changes are called mutations. The serial numbers instruct the replicas on how to order each mutation. The primary then applies the mutations in sequential order to its own data. Then it sends a write request to the secondary replicas, which follow the same application process. If everything works as it should, all the replicas across the cluster incorporate the new data. The secondary replicas report back to the primary once the application process is over.
At that time, the primary replica reports back to the client. If the process was successful, it ends here. If not, the primary replica tells the client what happened. For example, if one secondary replica failed to update with a particular mutation, the primary replica notifies the client and retries the mutation application several more times. If the secondary replica doesn't update correctly, the primary replica tells the secondary replica to start over from the beginning of the write process. If that doesn't work, the master server will identify the affected replica as garbage.
Advantages and disadvantages of large sized chunks in Google File System
Chunks size is one of the key design parameters. In GFS it is 64 MB, which is much larger than typical file system blocks sizes. Each chunk replica is stored as a plain Linux file on a chunk server and is extended only as needed.
Advantages
1. It reduces clients’ need to interact with the master because reads and writes on the same chunk require only one initial request to the master for chunk location information.
2. Since on a large chunk, a client is more likely to perform many operations on a given chunk, it can reduce network overhead by keeping a persistent TCP connection to the chunk server over an extended period of time.
3. It reduces the size of the metadata stored on the master. This allows us to keep the metadata in memory, which in turn brings other advantages.
Disadvantages
1. Lazy space allocation avoids wasting space due to internal fragmentation.
2. Even with lazy space allocation, a small file consists of a small number of chunks, perhaps just one. The chunk servers storing those chunks may become hot spots if many clients are accessing the same file. In practice, hot spots have not been a major issue because the applications mostly read large multi-chunk files sequentially. To mitigate it, replication and allowance to read from other clients can be done.
HDFS( Hadoop Distributed File System) HDFS is a distributed file system allowing multiple files to be stored and retrieved at the same time at an unprecedented speed. It is one of the basic components of Hadoop framework. Hadoop File System was developed using distributed file system design. It is run on commodity hardware. Unlike other distributed systems, HDFS is highly fault tolerant and designed using low-cost hardware. HDFS holds very large amount of data and provides easier access. To store such huge data, the files are stored across multiple machines. These files are stored in redundant
fashion to rescue the system from possible data losses in case of failure. HDFS also makes applications available to parallel processing.
HDFS is a key part of the many Hadoop ecosystem technologies, as it provides a
reliable means for managing pools of big data and supporting related big data
analytics applications.
How HDFS works
HDFS supports the rapid transfer of data between compute nodes. At its outset, it
was closely coupled with MapReduce, a programmatic framework for data
processing.
When HDFS takes in data, it breaks the information down into separate blocks and
distributes them to different nodes in a cluster, thus enabling highly efficient parallel
processing.
Moreover, the Hadoop Distributed File System is specially designed to be
highly fault-tolerant. The file system replicates, or copies, each piece of data multiple
times and distributes the copies to individual nodes, placing at least one copy on a
different server rack than the others. As a result, the data on nodes that crash can
be found elsewhere within a cluster. This ensures that processing can continue
while data is recovered.
HDFS uses master/slave architecture. In its initial incarnation, each Hadoop
clusterconsisted of a single NameNode that managed file system operations and
supporting DataNodes that managed data storage on individual compute nodes. The
HDFS elements combine to support applications with large data sets.
This master node "data chunking" architecture takes as its design guides elements
from Google File System (GFS), a proprietary file system outlined in in Google
technical papers, as well as IBM's General Parallel File System (GPFS), a format
that boosts I/O by striping blocks of data over multiple disks, writing blocks in
parallel. While HDFS is not Portable Operating System Interface model-compliant, it
echoes POSIX design style in some aspects
Features of HDFS
• It is suitable for the distributed storage and processing.
• Hadoop provides a command interface to interact with HDFS.
• The built-in servers of namenode and datanode help users to easily check the status of
cluster.
• Streaming access to file system data.
• HDFS provides file permissions and authentication.
HDFS Architecture Given below is the architecture of a Hadoop File System.
HDFS follows the master-slave architecture and it has the following elements.
Namenode
The namenode is the commodity hardware that contains the GNU/Linux operating
system and the namenode software. It is a software that can be run on commodity
hardware. The system having the namenode acts as the master server and it does
the following tasks −
• Manages the file system namespace.
• Regulates client’s access to files.
• It also executes file system operations such as renaming, closing, and opening files and
directories.
Datanode
The datanode is a commodity hardware having the GNU/Linux operating system
and datanode software. For every node (Commodity hardware/System) in a
cluster, there will be a datanode. These nodes manage the data storage of their
system.
• Datanodes perform read-write operations on the file systems, as per client request.
• They also perform operations such as block creation, deletion, and replication according
to the instructions of the namenode.
Block
Generally the user data is stored in the files of HDFS. The file in a file system will be
divided into one or more segments and/or stored in individual data nodes. These file
segments are called as blocks. In other words, the minimum amount of data that HDFS can
read or write is called a Block. The default block size is 64MB, but it can be increased as
per the need to change in HDFS configuration.
Goals of HDFS Fault detection and recovery − Since HDFS includes a large number of
commodity hardware, failure of components is frequent. Therefore HDFS should
have mechanisms for quick and automatic fault detection and recovery.
Huge datasets − HDFS should have hundreds of nodes per cluster to manage the
applications having huge datasets.
Hardware at data − A requested task can be done efficiently, when the
computation takes place near the data. Especially where huge datasets are
involved, it reduces the network traffic and increases the throughput.
Hadoop Framework Hadoop is an Apache Software Foundation project that process large volume of data. It is a
Big Data technology to store and process the really huge amount of data by distributing the
data to different nodes.
Hadoop is an Apache open source framework written in java that allows distributed
processing of large datasets across clusters of computers using simple programming
models. The Hadoop framework application works in an environment that provides
distributed storage and computation across clusters of computers. Hadoop is designed to
scale up from single server to thousands of machines, each offering local computation and
storage.
Hadoop is an open source distributed processing framework that manages data processing
and storage for big data applications running in clustered systems. It is at the center of a
growing ecosystem of big data technologies that are primarily used to support advanced
analytics initiatives, including predictive analytics, data mining and machine
learningapplications. Hadoop can handle various forms of structured and unstructured data,
giving users more flexibility for collecting, processing and analyzing data than relational
databases and data warehouses provide.
Hadoop and big data Hadoop runs on clusters of commodity servers and can scale up to support thousands of
hardware nodes and massive amounts of data. It uses a namesake distributed file
system that's designed to provide rapid data access across the nodes in a cluster, plus fault-
tolerant capabilities so applications can continue to run if individual nodes fail.
Consequently, Hadoop became a foundational data management platform for big
data analytics uses after it emerged in the mid-2000s.
Hadoop was created by computer scientists Doug Cutting and Mike Cafarella, initially to
support processing in the Nutch open source search engine and web crawler. After Google
published technical papers detailing its Google File System (GFS)
and MapReduceprogramming framework in 2003 and 2004, respectively, Cutting and
Cafarella modified earlier technology plans and developed a Java-based MapReduce
implementation and a file system modeled on Google's.
In early 2006, those elements were split off from Nutch and became a separate Apache
subproject, which Cutting named Hadoop after his son's stuffed elephant. At the same time,
Cutting was hired by internet services company Yahoo, which became the first production
user of Hadoop later in 2006. (Cafarella, then a graduate student, went on to become a
university professor.)
Use of the framework grew over the next few years, and three independent Hadoop vendors
were founded: Cloudera in 2008, MapR a year later and Hortonworks as a Yahoo spinoff in
2011. In addition, AWS launched a Hadoop cloud service called Elastic MapReduce in 2009.
That was all before Apache released Hadoop 1.0.0, which became available in December
2011 after a succession of 0.x releases.
Hadoop Architecture At its core, Hadoop has two major layers namely −
• Processing/Computation layer (MapReduce), and
• Storage layer (Hadoop Distributed File System).