Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project...
Transcript of Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project...
Visvesvaraya Technological University “Jnana Sangama”, Santhibastawad Road, Machhe, Belgaum-14
2016-2017
A Dissertation Report On
“Big Data Analytics Framework to identify
Agriculture/Aquaculture Diseases and recommendation
of a solution.”
Submitted in partial fulfillment of the requirements for the award of degree of
BACHELOR OF ENGINEERING
in
COMPUTER SCIENCE AND ENGINEERING
by
ANANYA. R (1IC13CS002)
KIRAN VASISHTA. T. S (1IC13CS009)
SHILPARANI. J (1IC13CS027)
Under the guidance of
Mrs. Rekha.M.S
Assistant Professor
Department of CSE
ICEAS, Bangalore
Department of Computer Science and Engineering
IMPACT COLLEGE OF ENGINEERING AND APPLIED SCIENCE
SAHAKAR NAGAR, BANGALORE-560092
2016-2017
IMPACT COLLEGE OF ENGINEERING AND APPLIED SCIENCE
SAHAKAR NAGAR, BANGALORE-560092
CERTIFICATE
This is to certify that the project entitled ”Big Data Analytics Framework to
identify Agriculture/Aquaculture Diseases and recommendation of a solution” is a
bonafide work carried out by Ms. ANANYA R (1IC13CS002), Mr. KIRAN
VASISHTA T S (1IC13CS009), Ms. SHILPARANI J (1IC13CS027) in partial
fulfillment for the award of Bachelor of Engineering in Computer Science and
Engineering of Visvesvaraya Technological University, Belgaum during the year 2016-
2017.The project report has been approved as it satisfies the academic requirements in
aspect of project work prescribed for the said degree.
Signature of Internal Guide Signature of HOD Signature of Principal
Mrs. Rekha.M.S Mrs. Neenu Rana Dr.Narayan Singh
Assistant Professor Professor & HOD Principal
Department of CSE Department of CSE ICEAS, Bangalore
ICEAS, Bangalore. ICEAS, Bangalore.
Internal Examiner External Examiner
Name: __________________ Name: __________________
Signature: _______________ Signature: _______________
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
ACKNOWLEDGEMENTS
It gives me an immense pleasure to convey my sincere thanks to all those intellectuals
who have supported me in submitting my project titled “Big Data Analytics Framework
to identify Agriculture/Aquaculture Diseases and recommendation of a solution.”
with their guidance and encouragement.
I wish to express my gratitude to Dr. NARAYAN SINGH, PRINCIPAL for his
encouragement and also for providing all the facilities for accomplishing this project.
I extend my sincere gratitude to Mrs NEENU RANA, Head of Department of Computer
Science and Engineering for her support and advice.
I also extend my sincere gratitude to Mrs REKHA.M.S, Assistant Professor, Department
of Computer Science and Engineering for his guidance, support and continuous
encouragement.
I express my gratitude to management of IMPACT COLLEGE OF ENGINEERING
AND APPLIED SCIENCES, BANGALORE for providing me an opportunity to fulfil
my cherished goal of taking up the project work as a part of the Under Graduate program.
I also thank all those who have helped me directly or indirectly in ways of time,
resources, moral and technical support.
ANANYA.R(1IC13CS002)
KIRAN VASISHTA.T. S(1IC13CS009)
SHILPARANI.J(1IC13CS027)
ABSTRACT
Due to technology the term data is replaced by transforming big data in many fields. Rapid
advancements in the technology causes agricultural data enter into the era of big data.
Traditional tools and techniques are unable to store and analyze this massive amount of data.
To store and analyze this type of data parallel computing and analyze paradigm is required.
Big data analytics is used as a solution to this. In the project big data analytic Agriculture and
Aquaculture framework is developed that identify disease based on symptoms similarity and
recommend a solution based on high similarity. To achieve this objective Hadoop and Hive
tools has been used. The data is collected, cleansed and normalized. Data is collected from
laboratory reports, web sites etc. then cleansing of data is done that is important information
is extracted from unstructured redundant data. In the next step normalization is done i.e
features are extracted from cleaned data. Normalized data is uploaded on HDFS and saved in
a file supported by hive. HiveQL is a SQL like query language which is used to analyze the
data. It finds out disease name based on crop/fish disease symptoms and proposes a solution
based on evidence from historical data. Result is useful for recommending a solution that is
highly used or high symptoms similarity.
CONTENTS
CHAPTER Nos. TITLE PAGE Nos.
1 Introduction 1
1.1 Motivation 1
1.2 Objective 2
1.3 Methodology 2
1.4 Existing System 4
1.5 Proposed System 5
2 Literature Survey 6
2.1 Software Description 8
2.1.1 Java Technology 8
2.1.2 IntelliJ IDEA 9
2.1.3 IntelliJ IDEA Platform 10
2.1.4 IntelliJ IDEA IDE 12
3 System Analysis 13
3.1 Functional Requirements 13
3.2 Non Functional Requirements 14
3.3 System Requirements 14
3.3.1 Hardware Requirement Specification 14
3.3.2 Software Requirement Specification 14
4 System Design 15
4.1 System Architecture 17
4.2 Use-Case Diagram 21
4.3 Dataflow Diagram 22
4.4 Sequence Diagrams 25
5 Implementation 27
5.1 MapReduce Algorithm 28
5.2 Partitioner 31
5.3 Combiner 32
6 Testing 36
6.1 Introduction to Testing 36
6.1.1 Functional and Non Functional Testing 37
6.1.2 Compatibility Testing 37
6.1.3 Verification and Validation 37
6.2 Testing Methodologies 40
6.3 Testing Levels 40
6.3.1 Unit Testing 40
6.3.2 Integration Testing 41
6.3.3 System Testing 41
6.4 Unit Testing of Main Modules 41
6.4.1 Unit Testing for User 41
7 Results 43
7.1 Snapshots of Browsing HDFS 43
7.2 Snapshots of Hadoop and Hive Interfaces 47
7.3 Snapshots of the project deployed on IntelliJ IDEA 50
7.4 Snapshots of Web Application/Framework 52
8 Conclusion And Future Work 54
References
List of Figures
Figure Nos. Title Page Nos.
1 3V’s of Big Data 6
2 Cloud Computing 7
3 System Architecture 17
4 Hadoop Architecture 18
5 HDFS Architecture 20
6 Use-Case Diagram 21
7 Data Flow diagram 22
8 Host and Symptom DFD 23
9 Splitting Data DFD 23
10 Searching keyword-DFD 24
11 Main sequence diagram 25
12 Login failure sequence diagram 26
13 MapReduce 27
14 MapReduce Architecture 28
15 Working of MapReduce Classes 29
16 Working of MapReduce 30
17 Combiner 32
18 Example of MapReduce 35
19 Overview of HDFS setup 42
320 Summary of HDFS setup 44
21 NameNode status of HDFS 45
22 Datanode information of HDFS 46
23 Hadoop and Hive initializations 47
24 Processes running on Hadoop user 47
25 Hive tables 48
26 Hosttable rows 48
27 Query Execution on Hive table 49
28 Project on IntelliJ 50
29 Core module of project 50
30 Web module of project Testing 51
31 Front web page of the application 52
32 Searching for required data 52
33 Result 53
List of Tables
Table Nos. Title Page Nos.
1 Input, output for key-value pairs 31
2 Unit Test Case 1 41
3 Unit Test Case 2 42
4 Unit Test Case 3 42
5 Unit Test Case 4 42
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 1
Chapter 1
INTRODUCTION
Big data is a term used to depict augmented growth of data. Data may be in the form of
file system or it may be in database, that can’t be processed by traditional software
techniques and databases. The main aim of the paper is to develop a recommendation
system to identify and provide solution of agriculture crop diseases and aquaculture fish
diseases. With the help of big data analytics, researchers can easily make decision from
historical data. It will be a great innovation and pioneering work in human history if big
data analytics is used in agriculture and aquaculture. Agriculture and aquaculture data is
increasing day by day at astonishing rate. The solution for this is to use big data analytics
and for analysis of such type of data Hadoop and its tools are used.
Apache Hadoop is an open-source software framework used for distributed
storage and processing of dataset of big data using the MapReduce programming model.
It consists of computer clusters built from commodity hardware. The core of Apache
Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS),
and a processing part which is a MapReduce programming model. Hadoop splits files into
large blocks and distributes them across nodes in a cluster. It then transfers packaged
code into nodes to process the data in parallel. This approach takes advantage of data
locality, where nodes manipulate the data they have access to. This allows the dataset to
be processed faster and more efficiently than it would be in a more
conventional supercomputer architecture that relies on a parallel file system where
computation and data are distributed via high-speed networking.
1.1 Motivation
In the present scenario there is huge dependency on agricultural data. As we are in
the 21st
century, already the generations are prevailed by the digital systems. Hence the
motivation of making up agriculture with the enhancement of aquaculture disease and
recommendation system is produced in this project.
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 2
1.2 Objective
The objective of this project is to develop a web application which can take up the
symptom of a crop or fish which is expected of an infectious disease from the user and
query with the huge data using hadoop and hive tools and recommend a solution of what
kind of disease it leads to and what might be the prevention for it from the historical data.
1.3 Methodology
Primary motive of generation of results from the collection of data is to serve researchers
by giving a solution for various diseases of crops. It was not an easy task to develop a
new framework identify disease and recommend solution based on symptoms similarity.
These frameworks provide the solution based on historical data. Data for this framework
is collected from various sources.
This model basically works on recommendation system. The recommendation
systems use the historical data or the knowledge of the product. Many e -commerce
companies use recommendation system for sales (e.g. Amazon. in). In the proposed
model recommendation system is applied to agriculture domain.
Firstly data is collected from various sources e.g. lab reports, agriculture websites
etc. collected data is known as raw data because it contain irregularities and unwanted
information. So data is unformatted and it needs formatting or confirmation. This data is
stored on HDFS. NameNode of HDFS keeps track how your files are broken down into
file blocks, which nodes store those blocks. Clients communicate directly with DataNode
to process the local files corresponding to the blocks.
Feasibility Study
Feasibility studies aim to objectively and rationally uncover the strengths and weaknesses
of the existing system and the proposed venture, the resources required to carry through,
and ultimately the prospects for success. A detailed feasibility study was conducted to
know the technical and financial feasibility of the project and it was found that the project
is feasible to design, develop, use and maintain in all respects.
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 3
Requirement Analysis and Project Planning
The requirements of this project was analysed in detail which includes system
requirements specification, software and hardware requirements. The project plan was
developed with the help of requirements gathered in this phase.
Design
After successful analysis of the system requirement, design of the project started where
various design constraints were analysed. The design phase consists of various modules
to be developed for generation of data of historic and live nature which is a need for the
testing of the product. A functional design methodology and top-down strategy is used
in this design phase. The flow diagrams along with the activity diagrams are indicated to
show the flow of control at various stages of the project.
Coding
The design of the system developed during the design phase is converted into code using
the Java environment and Perl as and when required at different stages of the project. The
coding is done according to the design strategy which aligns with the functional
requirements that are categorized as in the previous step.
Testing
The program is tested by executing with the set of test cases in different set of setup
environments and also stand alone systems. Then output of the program for the test cases
is evaluated of determine if the program is performing as expected. I have used
Incremental Testing strategy to ensure functional testing. First some main parts of the
project were tested independently. Then these parts are combined together forming
subsystems, which are then tested separately. Due to integration of various other modules
into the system the testing was carried out to ensure performance behaviour of Web
application.
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 4
1.4 Existing system
At present there are many agricultural websites and apps which help in the cultivation and
crop yielding technics. Some of the existing systems are as follows:
MySmartFarm: MySmartFarm is one-stop-shop-all software for all a farmer’s data and
technology. Hosted in the cloud, driven by statistics and powered by intelligent models
and machine learning, it is designed for easy real-time 'anywhere-access', to empower
farmers with scientific advice, optimizing decision and to save time and money.
aWhere: aWhere’s Agricultural Intelligence platform provides users with accurate global
weather information for all agricultural needs. In order to accomplish this, aWhere
employs 13,000 ground weather stations around the world to collect specific data to
create a continuous weather map for the planet’s surface. By collecting and organizing
this data, aWhere is able to create a valuable network of more than 1.5 million virtual
weather stations that provide hourly forecasts as well as 10+ years of historical data. In
order to ensure that aWhere’s customers are getting the most precise data possible,
aWhere excludes any exhaustive outliers in aWhere’s data sets. With the exclusion of
these outliers, and the use of other algorithm-based methods, aWhere is able to provide
the most accurate and up-to-date weather information on the planet.
Phenonet: Phenonet collects, processes and visualizes sensor data from the field in near
real-time. It is helping plant scientists and farmers identify the best crop varieties to
increase yield and efficiency of digital agriculture.
Farmlogs: Managing your nitrogen efficiently is one of the key ways to drive higher
yields and higher profit. With FarmLogs, you'll have the tools you need to make nitrogen
management easier and more efficient.
Datafloq: Datafloq offers information, insights, knowledge and opportunities to drive
innovation through data. You can read high-quality articles, find big data and technology
vendors, post jobs, connect with talent, find or publish events and register for our online
training.
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 5
1.5 Proposed System
Primary motive of generation of results from the collection of data is to serve researchers by
giving a solution for various diseases of crops. It was not an easy task to develop a new
framework identify disease and recommend solution based on symptoms similarity. These
frameworks provide the solution based on historical data. Data for this framework is collected
from various sources.
This model basically works on recommendation system. The recommendation systems
use the historical data or the knowledge of the product. Many e -commerce companies use
recommendation system for sales (e.g. Amazon. in). In the proposed model recommendation
system is applied to agriculture domain.
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 6
Chapter 2
LITERATURE SURVEY
Laney D, Meta Group Inc. Application Delivery Strategies, February 2001: 3D Data
Management: Controlling Data Volume, Velocity and Variety [1] addresses current
business conditions and mediums that are pushing traditional data managements to their
limits, giving rise to novel, more formalized approaches.
Fig.1 3V’s of Big Data
Xue-Wen Chen, Xiatong Lin, IEEE Access 2014 May: Big Data Deep Learning
Challenges and Perspective [2] discusses that with the sheer size of data available today,
big data brings big opportunities and transformative potential for various sectors; on the
other hand, it also presents unprecedented challenges to harnessing data and information.
As the data keeps getting bigger, deep learning is coming to play a key role in providing
big data predictive analytics solutions. In this paper, we provide a brief overview of deep
learning, and highlight current research efforts and the challenges to big data, as well as
the future trends.
Marx V, Nature 2013, January: Biology- The Big Challenges Of Big Data [3]
introduces that in cloud computing, large data sets are processed on remote Internet
servers, rather than on researchers’ local computers. Large files with the big data problem
in the local systems are passed through the security firewalls and sent or mounted on the
systems of data centers which store the data on the cloud platform as shown in the figure.
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 7
Fig.2 Cloud Computing
David B Lobell, Vol 143(2013): The satellite data for crop yield gap analysis [4]
this paper utilizes the various approaches people have used in past to identify crop yield
and its variations. One such approach is defined in this paper is through analysis of
satellite images in collaboration with other factors such as weather and land factors
determine the crop yield. This approach involves communication with the satellite and
involves cost factor. Other approach is to use data from soil management sensors and
weather information to predict field crop yield. We are also working on one such data.
However, advantage of satellite image analysis approach is much faster compared to
other methods as this communication happens in real time and provides realistic results.
J. Ben, Schafter, Joseph A, Konstan, Kluwer Academic Publishers, Manufactured
in the Netherlands.2001: E-Commerce Recommendation Application: Data Mining and
Knowledge Discovery [5] Primary focus area is to help the consumer choose the product
which he/she is looking for much quicker by analyzing his search history and what he/she
is interested in. It also helps the e-commerce sites to recommend products to consumers
while they are looking some specific products. This helps in improved sales and overall
buying time online. This analysis is either done based on some predefined rules provided
by experts or data that is mined from behavior of the consumer while shopping on sites.
Provide a feel of "Business knows the Consumer best". The accuracy of this
recommendation improves as there is more interaction of the system with the consumer as
it's a self-learning system.
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 8
2.1 Software Description
2.1.1 Java Technology
Java is a general purpose, concurrent, class based, object oriented computer programming
language that is specifically designed to have as few implementation dependencies as
possible. It is intended to let application developers "write once, run anywhere" (WORA),
meaning that code that runs on one platform does not need to be recompiled to run on
another. Java applications are typically compiled to byte code (class file) that can run on
any Java virtual machine (JVM) regardless of computer architecture. Java is, as of 2012,
one of the most popular programming languages in use, particularly for client-server web
applications, with a reported 10 million users. Java was originally developed by James
Gosling at Sun Microsystems (which has since merged into Oracle Corporation) and
released in 1995 as a core component of Sun Microsystems' Java platform. The language
derives much of its syntax from C and C++, but it has fewer low-level facilities than
either of them.
James Gosling, Mike Sheridan, and Patrick Naughton initiated the Java language
project in June 1991. Java was originally designed for interactive television, but it was too
advanced for the digital cable television industry at the time. The language was initially
called Oak after an oak tree that stood outside Gosling's office; it went by the
name Green later, and was later renamed Java, from Java coffee, said to be consumed in
large quantities by the language's creators. Gosling aimed to implement a virtual
machine and a language that had a familiar C/C++ style of notation.
Sun Microsystems released the first public implementation as Java 1.0 in 1995. It
promised "Write Once, Run Anywhere" (WORA), providing no-cost run-times on
popular platforms. Fairly secure and featuring configurable security, it allowed network-
and file-access restrictions. Major web browsers soon incorporated the ability to run Java
applets within web pages, and Java quickly became popular. With the advent of Java
2 (released initially as J2SE 1.2 in December 1998 – 1999), new versions had multiple
configurations built for different types of platforms. For example, J2EE targeted
enterprise applications and the greatly stripped-down version J2ME for mobile
applications (Mobile Java). J2SE designated the Standard Edition. In 2006, for marketing
purposes, Sun renamed new J2 versions as Java EE, Java ME, and Java SE, respectively.
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 9
In 1997, Sun Microsystems approached the ISO/IEC JTC1 standards body and
later the Ecma International to formalize Java, but it soon withdrew from the
process. Java remains a de facto standard, controlled through the Java Community
Process. At one time, Sun made most of its Java implementations available without
charge, despite their proprietary software status. Sun generated revenue from Java
through the selling of licenses for specialized products such as the Java Enterprise
System. Sun distinguishes between its Software Development Kit (SDK) and Runtime
Environment (JRE) (a subset of the SDK); the primary distinction involves the JRE's lack
of the compiler, utility programs, and header files.
On November 13, 2006, Sun released much of Java as free and open source
software, (FOSS), under the terms of the GNU General Public License (GPL). On May 8,
2007, Sun finished the process, making all of Java's core code available under free
software/open-source distribution terms, aside from a small portion of code to which Sun
did not hold the copyright.
Sun's vice-president Rich Green said that Sun's ideal role with regards to Java was
as an "evangelist." Following Oracle Corporation's acquisition of Sun Microsystems in
2009–2010, Oracle has described itself as the "steward of Java technology with a
relentless commitment to fostering a community of participation and transparency". This
did not hold Oracle, however, from filing a lawsuit against Google shortly after that for
using Java inside the Android SDK (see Google section below). Java software runs
on laptops to data centers, game consoles to scientific supercomputers.
There are 930 million Java Runtime Environment downloads each year and 3
billion mobile phones run Java. On April 2, 2010, James Gosling resigned from Oracle.
There were five primary goals in the creation of the Java language:
1. It should be "simple, object-oriented and familiar"
2. It should be "robust and secure"
3. It should be "architecture-neutral and portable"
4. It should execute with "high performance"
5. It should be "interpreted, threaded, and dynamic"
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 10
2.1.2 IntelliJ IDEA
IntelliJ IDEA is a Java integrated development environment (IDE) for developing
computer software. It is developed by Jet Brains (formerly known as IntelliJ), and is
available as an Apache 2 Licensed community edition, and in a proprietary commercial
edition. Both can be used for commercial development.
The first version of IntelliJ IDEA was released in January 2001, and was one of
the first available Java IDEs with advanced code navigation and code
refactoring capabilities integrated.
In a 2010 InfoWorld report, IntelliJ received the highest test center score out of
the four top Java programming tools: Eclipse, IntelliJ IDEA, NetBeans and JDeveloper.
In December 2014, Google announced version 1.0 of Android Studio, an open source IDE
for Android apps, based on the open source community edition of IntelliJ IDEA. Other
development environments based on IntelliJ's framework
include AppCode, CLion, PhpStorm, PyCharm, RubyMine, WebStorm, and MPS.
2.1.3 IntelliJ IDEA Platform
IntelliJ supports plugins through which one can add additional functionality to the IDE.
One can download and install plugins either from IntelliJ's plugin repository website or
through IDE's inbuilt plugin search and install feature. Currently IntelliJ IDEA
Community edition has 1495 plugins available, where as the Ultimate edition has 1626
plugins available.
The Community and Ultimate editions differ in their support for various programming
languages like:
Java
Clojure
Dart
Erlang
Go
Groovy
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 11
Haxe
Perl
Scala
XML/XSL
Kotlin
ActionScript/MXML
CoffeeScript
Haskell
HTML/XHTML/CSS
JavaScript
Lua
PHP
Python
Ruby/JRuby
SQL
TypeScript
Community Edition supports the following technologies and frameworks:
Android
Ant
Gradle
JavaFX
JUnit
Maven
SBT
TestNG
Ultimate Edition supports the following technologies and frameworks:
Django
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 12
EJB
FreeMarker
Google App Engine
Google Web Toolkit
Grails
Hibernate/JPA
Java ME MIDP/CLDC
JBoss Seam
JSF
JSP
Jelastic
Node.js
OSGi
Play
Ruby on Rails
Spring
Struts 2
Struts
Tapestry
Velocity
Web services
2.1.4 IntelliJ IDE
IntelliJ IDE provides certain features like code completion by analyzing the context, code
navigation where one can jump to a class or declaration in the code directly, code
refactoring and providing options to fix inconsistencies via suggestions.
The IDE provides for integration with build/packaging tools like grunt, bower, gradle,
and SBT. It supports version control systems like GIT, Mercurial, Perforce, and SVN.
Databases like Microsoft SQL Server, ORACLE, PostgreSQL, and MySQL can be
accessed directly from the IDE.
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 13
Chapter 3
SYSTEM ANALYSIS
System analysis is the phase or step of the systems approach to problem solving using
computers. It is a process of gathering and interpreting facts, diagnosis of problems and
using the information to recommend improvements to the existing system.
The proposed system for the project entitled “Big Data Analytics Framework to
identify Agriculture/Aquaculture Diseases and recommendation of a solution.”
includes:
To make a web application, that handles the big data problems of agriculture and
aquaculture diseases.
To build a framework that is user friendly.
The data of agriculture and aquaculture are maintained in the form of hive
database. Hive Query Language is used to query with HIVE.
HADOOP technology is used to handle Big Data.
The farmer/researcher inputs the hostname and the symptom of the infected crop
of fisheries.
The receiver receives back the output on the same web page with the details of the
infected disease, location where it can effect and the various steps to prevent it
from further damages.
3.1 Functional Requirements
The definition for a functional requirement specifies what the system should do. A
requirement specifies a function that a system or component must be able to perform.
Functional requirements specify specific behavior or functions. The functional
requirements are those that refer to the functionality of the system. The functional
requirements of the project are given below:
Real time monitor should display the details of all the data that flowing in and out
of the system.
The application takes the data from the client and interacts with the server and
displays the output on the screen.
The web application is deployed on cloud which is accessible through www.
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 14
3.2 Non-Functional Requirements
The definition for a non-functional requirement specifies how the system should behave:
A non-functional requirement is a statement of how a system must behave; it is a
constraint upon the systems behavior. Non-functional requirements specify all the
remaining requirements not covered by the functional requirements. They specify criteria
that judge the operation of a system, rather than specific behaviors. Non-Functional
Requirements in Software Engineering presents a systematic and pragmatic approach to
`building quality into' software systems. Systems must exhibit software quality attributes,
such as accuracy, performance, security and modifiability.
Non-functional requirement are:
The application must be easily operated without needed much knowledge
of the algorithm as well as coding.
It should provide an easy interface to add some more features for other
applications.
3.3 System Requirements
3.3.1 Hardware Requirement Specification
SYSTEMS: 3 systems for multi-clustering in HADOOP.
PROCESSOR: Intel® Core™ i3-2330M CPU @2.20 GHz.
HARDDISK: 40 GB or more.
RAM: 256 Mb or more.
3.3.2 Software Requirement Specification
OPERATING SYSTEM: Windows XP or more.
LANGUAGE USED: JAVA (JDK 1.8 or more).
TOOL USED: Apache HADOOP 2.8Apache TOMCAT 8IDE
IDE USED: IntelliJ IDEA (2015 or more)
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 15
Chapter 4
SYSTEM DESIGN
Primary motive of generation of results from the collection of data is to serve researchers by
giving a solution for various diseases of crops. It was not an easy task to develop a new
framework identify disease and recommend solution based on symptoms similarity. These
frameworks provide the solution based on historical data. Data for this framework is collected
from various sources.
This model basically works on recommendation system. The recommendation systems
use the historical data or the knowledge of the product. Many e -commerce companies use
recommendation system for sales (e.g. Amazon. in). In the proposed model recommendation
system is applied to agriculture domain.
Firstly data is collected from various sources e.g. lab reports, agriculture websites etc.
collected data is known as raw data because it contain irregularities and unwanted information. So
data is unformatted and it needs formatting or confirmation. This data is stored on HDFS.
NameNode of HDFS keeps track how your files are broken down into file blocks, which nodes
store those blocks. Clients communicate directly with DataNode to process the local files
corresponding to the blocks.
Data sources are:
Laboratory Test reports:
It is a crucial source of data for researchers .the tests conducted are soil, water, manure, plant
analysis etc.
Agriculture/Aquaculture info websites:
These websites act like mentor for farmers. These sites give information related to agricultural
economic entity; commonly used pesticides etc. agriculture information websites provide
information to farmers about which crop to plant where and when. And suggest solutions to
various problems related to crops. by these sites farmers get knowledge about new techniques and
tools.
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 16
Agriculture/Aquaculture department reports:
Using these reports decision making is easy for crops of particular area. These reports are
important to provide information regarding particular field of a geographical area.
Data that is collected from above sources is stored on Hadoop distributed file system in
the form of text file. Collected data is unstructured and it contains irrelevant data. Data that is
collected from above sources is stored on Hadoop distributed file system in the form of text file.
Collected data is unstructured and it contains irrelevant data.
Firstly unimportant data is removed and relevant data is extracted from collected data.
Then features are selected and extracted from relevant data and save into text file on hive data
warehouse. Hive is used to querying the data in distributed environment. Hive is open source
software tool used for data ware housing. To extract data out from Hadoop system Hive provides
interface that is similar to SQL interface which is termed as HIVEQL HIVE query language.
Query is submitted in distributed environment by three ways:
By using command line interface
Application programming interface
Web user interface
Thrift server is used as an interface when client and server use different language.
HiveQL extract data from hive data warehouse and save query results into text file that will store
on HDFS. Now submit text file to distributed environment to identify crop disease name based on
crop disease symptoms similarity. In this process after splitting text file submitted to mapper to
calculate pair based symptoms similarity, pair based similarity ignore spelling mistakes and word
ordering this will increase efficiency of recommendation system.
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 17
4.1 System Architecture
Fig.3 System Architecture
As explained in the proposed system, the data is collected from various data sources and
this data is called as raw data. Raw data is then cleansed by the cleaning process which
removes the unwanted entries from the data. The required data is written into a .csv file.
This file is then stored in the HDFS ( Hadoop Distributed File System) of the Apache
HADOOP.
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 18
Fig.4 Hadoop Architecture
Apache Hadoop is an open-source software framework for storage and large-scale
processing of data-sets on clusters of commodity hardware. There are mainly five
building blocks inside this runtime environment.
The cluster is the set of host machines (nodes). Nodes may be partitioned
in racks. This is the hardware part of the infrastructure.
The YARN Infrastructure (Yet Another Resource Negotiator) is the framework
responsible for providing the computational resources (e.g., CPUs, memory, etc.)
needed for application executions. Two important elements are:
The Resource Manager (one per cluster) is the master. It knows where
the slaves are located (Rack Awareness) and how many resources they
have. It runs several services; the most important is the Resource
Scheduler which decides how to assign the resources.
The Node Manager (many per cluster) is the slave of the infrastructure.
When it starts, it announces himself to the Resource Manager.
Periodically, it sends a heartbeat to the Resource Manager. Each Node
Manager offers some resources to the cluster. Its resource capacity is
the amount of memory and the number of scores. At run-time, the
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 19
Resource Scheduler will decide how to use this capacity: a Container is
a fraction of the NM capacity and it is used by the client for running a
program.
The HDFS Federation is the framework responsible for providing permanent,
reliable and distributed storage. This is typically used for storing inputs and output
(but not intermediate ones).
Other alternative storage solutions. For instance, Amazon uses the Simple Storage
Service (S3).
The MapReduce Framework is the software layer implementing the MapReduce
paradigm.
Hadoop File System was developed using distributed file system design. It is run
on commodity hardware. Unlike other distributed systems, HDFS is highly faulttolerant
and designed using low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store such
huge data, the files are stored across multiple machines. These files are stored in
redundant fashion to rescue the system from possible data losses in case of failure.
HDFS also makes applications available to parallel processing.
Features of HDFS
It is suitable for the distributed storage and processing.
Hadoop provides a command interface to interact with HDFS.
The built-in servers of namenode and datanode help users to easily check the
status of cluster.
Streaming access to file system data.
HDFS provides file permissions and authentication.
HDFS Architecture
HDFS follows the master-slave architecture and it has the following elements.
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 20
Fig.5 HDFS Architecture
Namenode
The namenode is the commodity hardware that contains the GNU/Linux operating
system and the namenode software. It is software that can be run on commodity
hardware. The system having the namenode acts as the master server and it does the
following tasks:
Manages the file system namespace.
Regulates client’s access to files.
It also executes file system operations such as renaming, closing, and opening
files and directories.
Datanode
The datanode is a commodity hardware having the GNU/Linux operating system and
datanode software. For every node (Commodity hardware/System) in a cluster, there will
be a datanode. These nodes manage the data storage of their system.
Datanodes perform read-write operations on the file systems, as per client request.
They also perform operations such as block creation, deletion, and replication
according to the instructions of the namenode.
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 21
Block
Generally the user data is stored in the files of HDFS. The file in a file system will be
divided into one or more segments and/or stored in individual data nodes. These file
segments are called as blocks. In other words, the minimum amount of data that HDFS
can read or write is called a Block. The default block size is 64MB, but it can be
increased as per the need to change in HDFS configuration.
After receiving the data from the UI the data is sent to hive to check with the data
that is already stored in hive database. This connection is done through hive driver
classes and thrift drivers. It is written in Java with the help of JDBC/ODBC drivers.
Further the data from the hive are managed by the various shuffling, mapping and
reducing algorithms inside hadoop. The final result is sent back from the hadoop to the
UI.
4.2 Use-Case Diagram
Fig.6 Use-Case Diagram
In the use-case diagram of this project there is one actor and Backend which is the server.
The actor reacts to the backen through the framework which is built that is the web
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 22
application. The web application’s name is ASK-Agri. The symptom and the host is
entered on the web application. The data entered is filtered and sent to the backend of the
system. At the backend is the Hadoop system and Hive tools. The data of the crop
diseases already collected is the hive table. The data to be searched from the hive table is
queried by the HiveQL. Result is sent back to the backend and that inturn returns the data
back to the web appliaction and is shown to the user through the UI of the web
application.
4.3 Dataflow Diagram
Fig.7 Data Flow diagram
The user enters the data by logging in to the web application. The host name and the
symptom is entered. This data is read and split into keyword by the splitter algorithms in
hadoop. The keyord is then passed to the matching algorithm which matches with the
data. The matched keyword is searched in the database which is already present. The
corresponding row of the database is selected and returned back to the UI which is shown
as output.
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 23
Host and symptom-DFD
Fig.8 Host and Symptom DFD
The data from the user is of two forms: the hostname and the symptom. This is given or
typed on the UI of the system. This data is sent through a java servlet to the hadoop
system.
Splitting Data-DFD
Fig.9 Splitting Data DFD
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 24
The input given is split into chunks of data in the hadoop systems. Each chunk of data is
stored in datanodes. The details of which datanode holds what type of data are stored on
the namenode.
Searching keyword-DFD
Fig.10 Searching keyword-DFD
The keyword after it is matched is sent to the Hive database where the hive tables are
present through the hive driver class written in Java.
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 25
4.4 Sequence Diagrams
Fig.11 Main sequence diagram
The main seuence diagram of the project consistes of four entities like User, ASK-Agri,
Hadoop and Database.
The user entity starts its activity by logging into the ASK-Agri framework. After
the login is sucessful the hostname and symptom is entered.
The ASK-Agri then generates the keyword from the entered data and that keyord
is sent to the Hadoop system. Hadoop system checks for the details of the disease from
the already collected data stored in the database. If the search is successful then the
retrieved data is sent back to the Hadoop and then to the framwork. The framework
creates an activity of displaying the result back to the user.
During the login process the sequence flow is between the user, ASK-Agri
framework and the login database.
The user starts an activity by reuesting the login page from the web application. If
the reuest is granted the page is sent where the user enters his credentials like username
and password. The credential details of the registered user is serach in the login database.
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 26
If the user is found then login is successful. If in case the user data is not available login
error occurs.
Fig.12 Login failure sequence diagram
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 27
Chapter 5
IMPLEMENTATION
The objective of implementation step is to create the code, test it for required output and
debug the errors occurring during the execution of the program. System implementation
involves testing the tool created on the setup and finding that the data is generated in the
central manager database.
Why MapReduce?
Traditional Enterprise Systems normally have a centralized server to store and process
data. The following illustration depicts a schematic view of a traditional enterprise
system. Traditional model is certainly not suitable to process huge volumes of scalable
data and cannot be accommodated by standard database servers. Moreover, the
centralized system creates too much of a bottleneck while processing multiple files
simultaneously.
Fig.13 MapReduce
Google solved this bottleneck issue using an algorithm called MapReduce. MapReduce
divides a task into small parts and assigns them to many computers. Later, the results are
collected at one place and integrated to form the result dataset.
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 28
Fig.14 MapReduce Architecture
How MapReduce Works?
The MapReduce algorithm contains two important tasks, namely Map and Reduce.
The Map task takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (key-value pairs).
The Reduce task takes the output from the Map as an input and combines those
data tuples (key-value pairs) into a smaller set of tuples.
The reduce task is always performed after the map job.
5.1 MapReduce Algorithm
The MapReduce algorithm contains two important tasks, namely Map and Reduce.
The map task is done by means of Mapper Class
The reduce task is done by means of Reducer Class.
Mapper class takes the input, tokenizes it, maps and sorts it. The output of Mapper class
is used as input by Reducer class, which in turn searches matching pairs and reduces
them.
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 29
Fig.15 Working of MapReduce Classes
MapReduce implements various mathematical algorithms to divide a task into
small parts and assign them to multiple systems. In technical terms, MapReduce
algorithm helps in sending the Map & Reduce tasks to appropriate servers in a cluster.
These mathematical algorithms may include the following −
Sorting
Searching
Indexing
TF-IDF
Sorting Algorithm
Sorting is one of the basic MapReduce algorithms to process and analyze data.
MapReduce implements sorting algorithm to automatically sort the output key-value
pairs from the mapper by their keys.
Sorting methods are implemented in the mapper class itself.
In the Shuffle and Sort phase, after tokenizing the values in the mapper class,
the Context class (user-defined class) collects the matching valued keys as a
collection.
To collect similar key-value pairs (intermediate keys), the Mapper class takes the
help of RawComparator class to sort the key-value pairs.
The set of intermediate key-value pairs for a given Reducer is automatically
sorted by Hadoop to form key-values (K2, {V2, V2, …}) before they are
presented to the Reducer.
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 30
Searching Algorithm
Searching plays an important role in MapReduce algorithm. It helps in the combiner
phase (optional) and in the Reducer phase.
Generally MapReduce paradigm is based on sending map-reduce programs to
computers where the actual data resides.
During a MapReduce job, Hadoop sends Map and Reduce tasks to appropriate
servers in the cluster.
The framework manages all the details of data-passing like issuing tasks,
verifying task completion, and copying data around the cluster between the
nodes.
Most of the computing takes place on the nodes with data on local disks that
reduces the network traffic.
After completing a given task, the cluster collects and reduces the data to form an
appropriate result, and sends it back to the Hadoop server.
Fig.16 Working of MapReduce
Inputs and Outputs (Java Perspective)
The MapReduce framework operates on key-value pairs, that is, the framework views
the input to the job as a set of key-value pairs and produces a set of key-value pair as the
output of the job, conceivably of different types.
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 31
The key and value classes have to be serializable by the framework and hence, it
is required to implement the Writable interface. Additionally, the key classes have to
implement the WritableComparable interface to facilitate sorting by the framework.
Both the input and output format of a MapReduce job are in the form of key-value pairs
−
(Input) <k1, v1> -> map -> <k2, v2>-> reduce -> <k3, v3> (Output).
Input Output
Map <k1, v1> list (<k2, v2>)
Reduce <k2, list(v2)> list (<k3, v3>)
Table 1.Input, output for key-value pairs
5.2 Partitioner
A partitioner works like a condition in processing an input dataset. The partition phase
takes place after the Map phase and before the Reduce phase.
The number of partitioners is equal to the number of reducers. That means a
partitioner will divide the data according to the number of reducers. Therefore, the data
passed from a single partitioner is processed by a single Reducer.
A partitioner partitions the key-value pairs of intermediate Map-outputs. It
partitions the data using a user-defined condition, which works like a hash function. The
total number of partitions is same as the number of Reducer tasks for the job. Let us take
an example to understand how the partitioner works.
Map Tasks
The map task accepts the key-value pairs as input while we have the text data in a text
file.
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 32
Partitioner Task
The partitioner task accepts the key-value pairs from the map task as its input. Partition
implies dividing the data into segments.
Reduce Tasks
The number of partitioner tasks is equal to the number of reducer tasks.
5.3 Combiner
A Combiner, also known as a semi-reducer, is an optional class that operates by
accepting the inputs from the Map class and thereafter passing the output key-value pairs
to the Reducer class.
The main function of a Combiner is to summarize the map output records with
the same key. The output (key-value collection) of the combiner will be sent over the
network to the actual Reducer task as input.
The Combiner class is used in between the Map class and the Reduce class to
reduce the volume of data transfer between Map and Reduce. Usually, the output of the
map task is large and the data transferred to the reduce task is high.
The following MapReduce task diagram shows the COMBINER PHASE.
Fig.17 Combiner
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 33
How Combiner Works?
Here is a brief summary on how MapReduce Combiner works −
Step 1: A combiner does not have a predefined interface and it must implement the
Reducer interface’s reduce() method.
Step 2: A combiner operates on each map output key. It must have the same output key-
value types as the Reducer class.
Step 3: A combiner can produce summary information from a large dataset because it
replaces the original Map output.
Although, Combiner is optional yet it helps segregating data into multiple groups for
Reduce phase, which makes it easier to process.
The important phases of the MapReduce program with Combiner are discussed below.
Record Reader
This is the first phase of MapReduce where the Record Reader reads every line from the
input text file as text and yields output as key-value pairs.
Input − Line by line text from the input file.
Output − Forms the key-value pairs.
Map Phase
The Map phase takes input from the Record Reader, processes it, and produces the
output as another set of key-value pairs.
Input − The following key-value pair is the input taken from the Record Reader.
The Map phase reads each key-value pair, divides each word from the value using
StringTokenizer, and treats each word as key and the count of that word as value. The
following code snippet shows the Mapper class and the map function.
Combiner Phase
The Combiner phase takes each key-value pair from the Map phase, processes it, and
produces the output as key-value collection pairs.
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 34
Input − The following key-value pair is the input taken from the Map phase.
The Combiner phase reads each key-value pair, combines the common words as key and
values as collection. Usually, the code and operation for a Combiner is similar to that of a
Reducer.
Reducer Phase
The Reducer phase takes each key-value collection pair from the Combiner phase,
processes it, and passes the output as key-value pairs. Note that the Combiner
functionality is same as the Reducer.
Input − The following key-value pair is the input taken from the Combiner phase.
The Reducer phase reads each key-value pair.
Record Writer
This is the last phase of MapReduce where the Record Writer writes every key-value
pair from the Reducer phase and sends the output as text.
Input − Each key-value pair from the Reducer phase along with the Output format.
Output − It gives you the key-value pairs in text format.
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 35
Fig.18 Example of MapReduce
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 36
Chapter 6
TESTING
6.1 Introduction to Testing
Software testing is an investigation conducted to provide stakeholders with information
about the quality of the product or service under test. Software testing can also provide an
objective, independent view of the software to allow the business to appreciate and
understand the risks of software implementation. Test techniques include, but are not
limited to, the process of executing a program or application with the intent of
finding software bugs (errors or other defects).
Software testing can be stated as the process of validating and verifying of a
software program/application/product which meets the requirements that guided its
design and development also works as expected and can be implemented with the same
characteristics.
Software testing, depending on the testing method employed, can be implemented
at any time in the development process. However, most of the test effort traditionally
occurs after the requirements have been defined and the coding process has been
completed having been shown that fixing a bug is less expensive when found earlier in
the development process. Although in the Agile approaches most of the test effort is,
conversely, on-going. As such, the methodology of the test is governed by the software
development methodology adopted.
Different software development models will focus the test effort at different points
in the development process. Newer development models, such as Agile, often employ test
driven development and place an increased portion of the testing in the hands of the
developer, before it reaches a formal team of testers. In a more traditional model, most of
the test execution occurs after the requirements have been defined and the coding process
has been completed.
Based on various parameters there are different methods of testing. A few
commonly used ones are as follows:
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 37
6.1.1 Functional and Non-functional testing
Functional testing refers to activities that verify a specific action or function of the code.
These are usually found in the code requirements documentation, although some
development methodologies work from use cases or user stories. Functional tests tend to
answer the question of "can the user do this" or "does this particular feature work."
Non-functional testing refers to aspects of the software that may not be related to a
specific function or user action, such as scalability or other performance, behaviour under
certain constraints, or security. Testing will determine the flake point, the point at which
extremes of scalability or performance leads to unstable execution. Non-functional
requirements tend to be those that reflect the quality of the product, particularly in the
context of the suitability perspective of its users.
6.1.2 Compatibility testing
A common cause of software failure (real or perceived) is a lack of its compatibility with
other application software, operating systems (or operating system versions, old or new),
or target environments that differ greatly from the original (such as
a terminal or GUI application intended to be run on the desktop now being required to
become a web application, which must render in a web browser). For example, in the case
of a lack of backward compatibility, this can occur because the programmers develop and
test software only on the latest version of the target environment, which not all users may
be running. This results in an unintended consequence that the latest work may not
function on earlier versions of the target environment, or on older hardware that earlier
versions of the target environment was capable of using. Sometimes such issues can be
fixed by proactively abstracting operating system functionality into a separate
program module or library.
6.1.3 Verification and Validation
Validation testing is a concern which overlaps with integration testing. Ensuring that the
application fulfils its specification is a major criterion for the construction of an
integration test. Validation testing also overlaps to a large extent with System Testing,
where the application is tested with respect to its typical working environment.
Consequently for many processes no clear division between validation and system testing
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 38
can be made. Specific tests which can be performed in either or both stages include the
following.
Regression Testing: Where this version of the software is tested with the
automated test harness used with previous versions to ensure that the required
features of the previous version are skill working in the new version.
Recovery Testing: Where the software is deliberately interrupted in a number of
ways off, to ensure that the appropriate techniques for restoring any lost data will
function.
Security Testing: Where unauthorized attempts to operate the software, or parts
of it, attempted it might also include attempts to obtain access the data, or harm
the software installation or even the system software. As with all types of security
determined will be able to obtain unauthorized access and the best that can be
achieved is to make this process as difficult as possible.
Stress Testing: Where abnormal demands are made upon the software by
increasing the rate at which it is asked to accept, or the rate t which it is asked to
produce information. More complex tests may attempt to crate very large data sets
or cause the soft wares to make excessive demands on the operating system.
Performance testing: Where the performance requirements, if any, are checked.
These may include the size of the software when installed, type amount of main
memory and/or secondary storage it requires and the demands made of the
operating when running with normal limits or the response time.
Usability Testing: The process of usability measurement was introduced in the
previous chapter. Even if usability prototypes have been tested whilst the
application was constructed, a validation test of the finished product will always
be required.
Alpha and beta testing: This is where the software is released to the actual end
users. An initial release, the alpha release, might be made to selected users who be
expected to report bugs and other detailed observations back to the production
team. Once the application changes necessitated by the alpha phase can be made
to larger more representative set users, before the final release is made to all users.
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 39
The final process should be a Software audit where the complete software project
is checked to ensure that it meets production management requirements. This ensures that
all required documentation has been produced, is in the correct format and is of
acceptable quality. The purpose of this review is: firstly to assure the quality of the
production process and by implication construction phase commences. A formal hand
over from the development team at the end of the audit will mark the transition between
the two phases.
Top down testing can proceed in a depth-first or a breadth-first manner. For depth-
first integration each module is tested in increasing detail, replacing more and more levels
of detail with actual code rather than stubs. Alternatively breadth-first would processed
by refining all the modules at the same level of control throughout the application .in
practice a combination of the two techniques would be used. At the initial stages all the
modules might be only partly functional, possibly being implemented only to deal with
non-erroneous data. These would be tested in breadth-first manner, but over a period of
time each would be replaced with successive refinements which were closer to the full
functionality. This allows depth-first testing of a module to be performed simultaneously
with breadth-first testing of all the modules.
The other major category of integration testing is Bottom Up Integration Testing
where an individual module is tested form a test harness. Once a set of individual module
have been tested they are then combined into a collection of modules ,known as builds,
which are then tested by a second test harness. This process can continue until the build
consists of the entire application. In practice a combination of top down and bottom-up
testing would be used. In a large software project being developed by a number of sub-
teams, or a smaller project where different modules were built by individuals. The sub
teams or individuals would conduct bottom-up testing of the modules which they were
constructing before releasing them to an integration team which would assemble them
together for top-down testing.
Validation ensures that the product actually meets the user's needs, and that the
specifications were correct in the first place, while verification is ensuring that the
product has been built according to the requirements and design specifications. Validation
ensures that ‘you built the right thing’. Verification ensures that ‘you built it right’.
Validation confirms that the product, as provided, will fulfil its intended use.
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 40
6.2 Testing Methodologies
Software testing methods are traditionally divided into white and black-box testing. These
two approaches are used to describe the point of view that a test engineer takes when
designing test cases.
White-box testing is when the tester has access to the internal data structures and
algorithms including the code that implements these.
Black-box testing treats the software as a "black box"—without any knowledge of
internal implementation.
Grey-box testing involves having knowledge of internal data structures and algorithms
for purposes of designing tests, while executing those tests at the user, or black-box level.
6.3 Testing Levels
Tests are frequently grouped by where they are added in the software development
process, or by the level of specificity of the test. The main levels during the development
process are unit, integration, and systems testing that are distinguished by the test target
without implying a specific process model
6.3.1 Unit testing
Unit testing, also known as component testing refers to tests that verify the functionality
of a specific section of code, usually at the function level. In an object-oriented
environment, this is usually at the class level, and the minimal unit tests include the
constructors and destructors.
These types of tests are usually written by developers as they work on code
(white-box style), to ensure that the specific function is working as expected. One
function might have multiple tests, to catch corner cases or other branches in the code.
Unit testing alone cannot verify the functionality of a piece of software, but rather is used
to assure that the building blocks the software uses work independently of each other.
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 41
6.3.2 Integration testing
Integration testing is any type of software testing that seeks to verify the interfaces
between components against a software design. Software components may be integrated
in an iterative way or all together ("big bang"). Normally the former is considered a better
practice since it allows interface issues to be localised more quickly and fixed. Integration
testing works to expose defects in the interfaces and interaction between integrated
components (modules). Progressively larger groups of tested software components
corresponding to elements of the architectural design are integrated and tested until the
software works as a system.
6.3.3 System testing
System testing tests a completely integrated system to verify that it meets its
requirements.
6.4 Unit Testing of Main Modules
Here different modules are tested independently and their functionality is checked. The following
tables show the details about the unit test cases and the results obtained.
6.4.1 Unit testing for user
Test Case ID Unit Test Case 1
Description Unit testing when Hostname is entered correctly
and symptom is not entered.
Input Hostname which is already there in the hive
database
Expected Output Error enter empty fields
Actual output Got the expected output
Remarks Test passed
Table 2. Unit Test Case 1
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 42
Test Case ID Unit Test Case 2
Description Unit testing when NO Hostname and Symptom is
entered
Input NO input given
Expected Output Error enter empty fields
Actual output Got the expected output
Remarks Test passed
Table 3. Unit Test Case 2
Test Case ID Unit Test Case 3
Description Unit testing when both Hostname and Symptom is
entered but not available in database
Input Hostname and Symptom of infected crop/fish
Expected output No output. Display same page
Actual output Got the expected output
Remarks Test passed
Table 4.Unit Test Case 3
Test Case ID Unit Test Case 4
Description Unit testing when proper valid Hostname and
Symptom is given
Input Hostname and Symptom of infected crop/fish
Expected output Disease and details of the virus infected to
crop/fish
Actual output Got the expected output
Remarks Test passed
Table 5.Unit Test Case 4
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 43
Chapter 7
RESULTS
7.1 Snapshot of browsing HDFS:
Fig.19 Overview of HDFS setup
An overview of HDFS about when the session was started, version of it and some ID
information also.
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 44
Fig.20 Summary of HDFS setup
Summary of the setup which shows the status of various nodes.
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 45
Fig.21 NameNode status of HDFS
NameNode Status information.
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 46
Fig.22 Datanode information of HDFS
DataNode informations like how many datanades are created and how many are under
process are shon in this snapshot.
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 47
7.2 Snapshots of the Hadoop and hive interfaces:
Fig.23 Hadoop and Hive initializations
Hadoop and Hive tools are initialised.
Fig.24 Processes running on Hadoop user
Hadoop processes which are running at the current time are shown.
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 48
Fig.25 Hive tables
The hive tables which are present inside are shown using show tables; command.
Fig.26 Hosttable rows
Number of rows present in Hosttable are shown in this screenshot.
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 49
Fig.27 Query Execution on Hive table
The query where details of one disease from hostname and symptom is shown in this
snapshot.
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 50
7.3 Snapshots of the project deployed on IntelliJ IDEA:
Fig.28 Project on IntelliJ
The project named AgriHelper is created which is a web application. It is divided into two
modules Core and Web.
Fig.29 Core module of project
The snapshot shos the various java programs written under core module.
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 51
Fig.30 Web module of project
The web module contains the html, css, java script files related to the front end
development of web application.
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 52
7.4 Snapshots of the web application/framework:
Fig.31 Front web page of the application
The basic home page of the web application is shown in the snapshot.
Fig.32 Searching for required data
In this snapshot the data is entered and search button is clicked while the searching and
waiting for results notification is shown.
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 53
Fig.33 Result
The result for the entered data is shown.
Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17
Dept. of CSE Page 54
Chapter 8
Conclusion and Future Work
Using Hadoop and HIVE tools big data analytical framework has been developed that
will handle agriculture crop and aquaculture fish disease problems. Developed web
application is useful for farmers and researchers for recommending a solution based on
high similarity symptoms. The developed big data analytics framework is location
specific.
The recommended solution is collected from various government institutions like
GKVK and IIHR. This application is useful for various researches to work with the crop
virus disease and fish general diseases.
Further this project can be enhanced by developing a product by converting this
web application into an android app.
REFERENCES
[1] Laney D. 3D Data Management: Controlling Data Volume, Velocity And Variety. Meta
Group Inc Application Delivery Strategies. 2001 February; ADS(6), 1-4.
[2] Xue-Wen Chen, Xiaotong Lin. Big Data Deep Learning Challenges and Perspective. IEEE
Access. 2014 May; 2(1), 514-522. Marx, V. 2013. Biology: The Big Challenges Of Big Data. p
255-260. Nature
.498.
[3] Marx V. Biology: The Big Challenges Of Big Data. Nature 2013. 2013 January; 498(7453),
255-260.
[4] David B. Lobell. The use of satellite data for crop yield gap analysis.vol no. 143 (2013) .
[5] J.Ben,Schafter,Joseph A,Konstan.E-Commerce Recommendation Applications:Data Mining
and Knowledge Discovery. Kluwer Academic Publishers. Manufactured in The
Netherlands.2001;115-153.
[6] Haoran Zhang, Xuyang Wei, Tengfei Zou, Zhongliang Li, Guocai Yang. Agriculture Big
Data: Research Status, Challenges And Countermeasures. Proceedings of Computer and
Computing Technologies in Agriculture, China, 2014 September, 137-143.
[7] Mysmartfarm. 2014. (Available online with update at http://Mysmart.Farm/).
[8] AwhereWeather. http://www.Awhere.Com/Products/Weather-Awhere. Date accessed:
18/05/2016.
[9] Phenonet:http://www.csiro.au/en/Research/D61/Areas/Robotics-and-autonomous-
systems/Internet-of-Things/Phenonet. Date accessed: 18/05/2016.
Farmlog.:https://www.Farmlogs.Com/Farm-Management-Features/. Date accessed: 18/05/2016.
[10] Datafloq. https://datafloq.com/read/john-deere-revolutionizing-farming-big-data/511. Date
accessed: 18/05/2016.
[11] Farmeron.:https://www.Farmeron.Com/Dairyfeatures.Aspx. Date accessed: 18/05/2016
[12] C.L. Philip Chen, Chun-Yang Zhang.Data-intensive applications, challenges, techniquesand
technologies: A survey on Big Data.Vol No. 275 (2014) ,314–347.
[13] Javier Andreu-Perez, Carmen C. Y. Poon, Robert D. Merrifield, Stephen T. C. Wong,
Guang-Zhong Yang.Big Data for Health.JULY,2015; 19(4).
[14] Er. Rupinder Kaur, Raghu Garg, Dr. Himanshu Aggarwal- Big Data Analytics Framework to
Identify Crop Disease and Recommendation a Solution.