Performance Analysis of Existing Regression...
Transcript of Performance Analysis of Existing Regression...
Performance Analysis of Existing Regression Techniques
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
76
CHAPTER 3
PERFORMANCE ANALYSIS OF EXISTING REGRESSION
TECHNIQUES
3.1 FORECASTING PRODUCT DEMAND
Forecasting product demand is crucial to any supplier, manufacturer, or
retailer. Forecasts of future demand will determine the quantities that should be
purchased, produced and shipped. Demand forecasts are necessary since the basic
operations process, moving from the supplier‘s raw materials to finished goods in
the customer‘s hands, takes time. Firms should not simply wait for demand to
emerge and then react to it. Instead, they are to anticipate and plan for future
demand so that they can react immediately to customer orders as they occur.
In other words, most manufacturers ‗make to stock‘ rather than ‗make to order‘ –
they plan ahead and then deploy inventories of finished goods into field locations.
Thus, once a customer order materializes, it can be fulfilled immediately – since
most customers are not willing to wait for the time it would take to actually
process their order throughout the supply chain and make the product based on
their order. An order cycle could take weeks or months to go back through part
suppliers and sub-assemblers, through manufacturer of the product and through to
the eventual shipment of the order to the customer.
All firms forecast demand, but it would be difficult to find any two firms
that forecast demand in exactly the same way. Over the last few decades, many
different forecasting techniques have been developed in a number of different
application areas, including engineering and economics. Many such procedures
have been applied to the practical problem of forecasting demand in a logistics
system, with varying degrees of success. Most commercial software packages that
support demand forecasting in a logistics system include dozens of different
forecasting algorithms that the analyst can use to generate alternative demand
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
77
forecasts. While scores of different forecasting techniques exist, almost any
forecasting procedure can be broadly classified into one of the following four
basic categories based on the fundamental approach towards the forecasting
problem that is employed by the technique.
(i) Judgmental Approaches - The essence of the judgmental approach is to
address the forecasting issue by assuming that someone knows and can
tell the right answer. That is, judgment-based techniques gather the
knowledge and opinions of people who are in a position to know what
demand there will be. For example, a survey of the customer base may
be conducted to estimate what our sales will be in the following months.
(ii) Experimental Approaches - Another approach to demand forecasting,
which is appealing when an item is "new" and when there is no other
information upon which to base a forecast, is to conduct a demand
experiment on a small group of customers and to extrapolate the results
to a larger population. For example, firms will often test a new consumer
product in a geographically isolated "test market" to establish its
probable market share. This experience is then extrapolated to the
national market to plan the new product launch. Experimental
approaches are very useful and necessary for new products, but for
existing products that have an accumulated historical demand record, it
seems intuitive that demand forecasts should somehow be based on this
demand experience. For most firms (with some very notable exceptions)
the large majority of Stock Keeping Units (SKUs) in the product line
have long demand histories.
(iii) Relational/Causal Approaches - The assumption behind a causal or
relational forecast is that, there is a reason why people buy a product.
A demand forecast can be developed when the reasons of buying are
understood.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
78
(iv) Time Series Approaches - A time series procedure is fundamentally
different from the first three approaches discussed. In a pure time series
technique, no judgment or expertise or opinion is sought. Causes or
relationships or factors which somehow ‗drive‘ demand are not required.
Time series procedures are applied to demand data that are longitudinal
rather than cross-sectional. That is, the demand data represent experience
that is repeated over time rather than across items or locations. The
essence of the approach is to recognize (or assume) that demand occurs
over time in patterns that repeat themselves, at least approximately. If
these general patterns or tendencies are described without regard to their
"causes", then this description forms the basis of a forecast.
All forecasting procedures involve the analysis of historical experience into
patterns and the projection of those patterns into the future in the belief that the
future will somehow resemble the past. The differences in the four approaches are
in the way the "search for pattern" is conducted. Judgmental approaches rely on
the subjective, ad-hoc analyses of external individuals. Experimental tools
extrapolate results from small numbers of customers to large populations. Causal
methods search for reasons for demand. Time series techniques simply analyze the
demand data themselves to identify temporal patterns that emerge and persist.
3.2 SOFTWARE DEVELOPMENT EFFORT ESTIMATION
Software effort estimation is used to determine the amount of effort
necessary to complete a software project, in terms of its scheduling, the allocation
of resources and the meeting of budget requirements. Software defect prediction
strives to improve software quality and testing efficiency by constructing
predictive classification models from code attributes to enable a timely
identification of fault-prone modules. This is an essential activity in the software
project planning phase because major problems occur usually when the surface in
first three months of a software development project and are the result of hasty
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
79
scheduling, irrational commitments and unprofessional estimating techniques
(Marchewka, 2009). The accurate prediction of software development costs is a
critical issue to make the good management decisions and accurately determining
how much effort and time a project required for project managers as well as
system analysts and developers. Estimation is defined as ―The action appraising
assessing or valuing‖ or ―The process of forming an approximate notion of
numbers, quantities, magnitudes etc. without actual enumeration or measurement‖.
From these definitions it follows that task of estimation is not easy to do precisely.
There are many software cost estimation methods that are available. No one
method is necessarily better or worse than the other, in fact, their strengths and
weaknesses are often complimentary to each other. Estimating a effort required for
software development is the most challenging and annoying job that requires
expertise, experience as well as good understanding of process, project management,
metrics and most important use of proper estimation models, tools and techniques.
Good software estimation models can significantly help the software project
manager, project stakeholders to make informed decisions about biding values,
planning the project, resource management and delivering the project on time
within budget.
3.2.1 SOFTWARE ESTIMATION TECHNIQUES
The Software Engineering Body Of Knowledge (SWEBOK) has identified
Knowledge Areas (KAs) such as software requirements, software design, software
construction, software testing, software maintenance, software configuration
management, software engineering management, software engineering process,
software engineering tools and method and software quality. Figure 3.1 explains
the Software Estimation process.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
80
Figure 3.1 Software Estimation Process
Software Engineering Management KA addresses management and
measurement including Software Project Planning, which further addresses Effort,
Schedule and Cost Estimation. Based on the breakdown of tasks, inputs and
outputs the expected effort range required for each task is determined using
calibrated estimation model based on historical size-effort data available,
otherwise method like expert judgment is applied.
Figure 3.2 depicts estimation techniques, methods, tools and their
categorization. The functional form of software estimation models are determined
by theory or functional form.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
81
Fig
ure
3.2
: S
oft
wa
re E
stim
ati
on
Tec
hn
iqu
es
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
82
SLIM is based on Putnam‘s analysis of the lifecycle terms of so-called
Rayleigh distribution of project personnel versus time. In SLIM, Productivity is
used to link the basic Rayleigh manpower distribution model to software
characteristics of size and technology factors. Checkpoint is knowledge-based
software project estimation tool. It has proprietary database of about 8000
software projects. It uses Function Points as its primary input and focuses on areas
that need to be managed to improve software quality and productivity. Checkpoint
predicts effort at four levels of granularity: project, phase, activity and task.
Estimates also include resources, deliverables, defects, costs and schedules.
The PRICE-S model was originally developed for use internally on software
projects that were part of Apollo moon program. It consists of following three sub-
models
(i) Acquisition Sub-model : forecasts cost and schedule
(ii) Sizing Sub-model : facilitates estimating size
(iii) Life-cycle Sub-model : predicts cost of maintenance and support phase
Foresight 2.0 is latest version by PRICE Systems for estimating time, effort
and cost for commercial and non-military government software projects.
ESTIMACS stresses approaching the estimating task in business terms. It also
stresses the need to be able to do sensitivity and trade analysis early on in terms of
staffing/cost estimates and associated risks. SEER-SEM has been evolved into
sophisticated tool supporting top-down and bottom-up methodology. Its modeling
equations are proprietary but they take parametric approach to estimation.
The Scope of this model is wide. It covers all phases of project life-cycle, from
early specification through design, development, delivery and maintenance.
It handles a variety of environmental and application configurations, such as
client-server, stand-alone, distributed, graphics, etc. It also support development
modes such as object oriented, reuse, COTS, spiral, waterfall, prototype and
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
83
incremental. It allows staff capability, required design and process standards and
levels of acceptable development risk to be input as constraints.
SELECT Estimator was released in 1998. It is designed for large scale
distributed systems development. It is object oriented, basing its estimates on
business objects and components. It assumes incremental development life-cycle.
The nature of its inputs allows the model to be used at any stage of the software
development life-cycle, most significantly even at the feasibility stage when little
detailed project information is known. In later stages, as more information is
available, its estimates become correspondingly more reliable. The actual estimation
technique is based upon ObjectMetrix developed by Object Factory. It works by
measuring size of a project by counting and classifying the software elements
within a project. Applying the qualifier and technology adjustments to the base
metric effort for each project element produces an overall estimate of effort in
person-days, by activity. Using the total one man effort estimate, schedule is
determined as a function of number of developers input as an independent
variable.
COCOMO cost and schedule estimation model was originally published in
1981. It became one of the most popular parametric cost estimation model of the
1980s. It has experienced difficulties in estimating the costs of software developed
by following new life-cycle processes and capabilities. The COCOMO II research
started in 1994 at USC to address the issues on non-sequential and rapid
development process models, reengineering, reuse driven approaches and object
oriented approaches. Delphi technique was developed at the Rand Corporation in
the late 1940s originally as a way of making predictions about future events.
More recently, the technique has been used as a means of guiding group of
informed individuals to a consensus of opinion on some issue. Participants are
asked to make some assessment regarding an issue, individually in a preliminary
round, without consulting the other participants in the exercise. First round results
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
84
are then collected, tabulated and returned to each participants for a second round,
during which participants are again asked to make an assessment regarding the
same issue.
Abts and Boehm used the technique to estimate initial parameter values for
Effort Adjustment Factors appearing in the glue code effort estimation
components of the COCOTS integration model. This technique has been used by
Chulani and Boehm to estimate software defect introduction and removal rates
during various phases of the software development lifecycle. These factors appear
in COQUALMO (COnstructive QUALity MOdel), which predicts the residual
defect density in terms of number of defects/unit of size.
Learning-oriented techniques include oldest as well as newest techniques
applied to estimation activities. Former are represented by case studies, among the
most traditional of manual techniques, later are represented by neural networks,
which attempt to automate improvements in the estimation process by building
models that ―learn‖ from previous experience. Case studies represent inductive
learning, whereby estimators and planners try to learn useful general lessons and
estimation heuristics by extrapolation from specific examples. They examine in
detail elaborate studies describing the environmental conditions and constraints
that obtained during the development of previous projects, the technical and
managerial decisions that were made and final successes or failures that resulted.
Neural networks are the most common software estimation model-building
technique used as an alternative to mean least squares regression. These are
estimation models that can be trained using historical data to produce ever better
results by automatically adjusting their algorithmic parameters values to reduce
the delta between known actual and model predictions. Dynamics-based
techniques explicitly acknowledge that software project effort or cost factors
change over the duration of the system development. Factors like deadlines,
staffing levels, design requirements, training needs, budget etc. all fluctuate over
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
85
the course of development and cause corresponding fluctuations in the
productivity of project personnel.
Regression-based techniques are used in conjunction with model-based
techniques and include standard regression, robust regression etc. Standard
regression refers to the classical statistical approach of general linear regression
modeling using least squares. This regression technique is used to calibrate
COCOMO II 1997. Robust regression, alleviates the common problem of outliers
in observed software engineering data. Least Median Squares method fall in this
category. Parametric cost models such as COCOMO II, SLIM, Checkpoint etc.
use some form of regression based techniques due to their simplicity and wide
acceptance.
Composite techniques incorporate a combination of two or more techniques
to formulate the most appropriate functional form for estimation. Bayesian
analysis is mode of inductive reasoning that provides a formal process by which
a-priori expert judgment can be combined with sampling data to produce a robust
a-posteriori model information in a logically consistent manner in making
inferences. This has been used in COCOMO II, but estimating software
development effort is a complex problem. An accurate cost estimation of a
software development effort is critical for good management decision making.
The precision and reliability of the effort estimation is very important for the
software industry because both overestimates and underestimates of the software
effort estimations are harmful to software companies. As a result, from an
organizational perspective, an early and accurate cost estimate reduces the
possibility of organizational conflict during the later stages. With high precision,
predicting software development effort is a great challenging task for the project
managers. An objective of the software engineering community is to develop
useful models that accurately estimate the software effort (Xu et al., 2004).
Among the available techniques for software cost estimation, COCOMO
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
86
(COnstructive COst MOdel) is the widely used algorithmic cost modeling
technique because it is simple to estimate the effort in person-months for a project
at different stages. Figure 3.3 gives the different types of COCOMO Model.
Figure 3.3 Types of COCOMO Model
COCOMO uses the mathematical formulae for predicting the project
estimation. Also the most commonly adopted architecture for estimating software
effort is feed forward multilayer perceptron with back propagation learning
algorithm and the sigmoid activation function. But one of the major drawbacks in
back propagation learning is the slow convergence. The main reason for slow
convergence is that the sigmoid activation function uses its hidden and output
layers in it. The network with sigmoidal function is more sensitive in losing the
parameters. The selection of network patterns and learning rules may cause some
difficulties in network performance and training. So the number of layers and
nodes should be minimized to amplify the performance. Hence to overcome the
above mentioned limitations and to improve the performance of the network, the
solution is the COCOMO model (Reddy, 2009). Some major models that are being
used as benchmarks for software effort estimation are as follows:
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
87
(i) Halstead
(ii) Walston-Felix
(iii) Bailey-Basili
(iv) Doty (for KLOC > 9)
These models have been derived with study of large number of completed
software projects from various organizations and applications to explore how
project sizes are mapped into project effort. But still these models are unable to
predict the software effort estimation accurately (Kaur et al., 2010).
3.2.2 SOFTWARE METRICS
Software metrics are used for prediction of defect. Some of the commonly
used are LOC (Line Of Code), N – Length, V – Volume, E- Effort, D – Difficulty,
CC – Cyclometric Complexity, EC – Essential Complexity, DC – Design Complexity.
Conte (1986) has defined the LOC as: A line of code is any line of program text
that is not a comment or a blank line, regardless of the number of statements or
fragments of statements on the line. This specifically includes all lines containing
program headers, declarations, and executable and non-executable statements.
Number of Lines of Code (NLOC) can be represented as Number of Delivered
Source Instructions (NDSI) and number of Thousands of Delivered Source
Instructions (KDSI). The model used various type of parameters such as the
number of distinct operators (instruction types, keywords, etc.) in a program,
denoted as n1; the number of distinct operands (variables and constants) denoted
as n2; the total number of occurrences of the operators denoted as N1; and the
total number of occurrences of the operands represented as N2. The summation of
n1 and n2 is denoted as n while the sum of N1 and N2 is called N.
Using these four quantities many useful measures have been obtained.
The number of bits needed to represent a program is defined as the volume V of
the program and is calculated by using the equation (3.1)
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
88
V = N log2 n (3.1)
The level of the program denotes the difficulty of understanding a program and it
is computed by equation (3.2)
2 1 2 2 /L n n N (3.2)
The intelligence content of a program is given in equation (3.3)
I LxV (3.3)
Estimated Program Length is defined by equation (3.4)
1 2 1 2 2 2N̂ n log n n log n (3.4)
In an attempt to add some of the psychological aspects of complexity in the
measures, Halstead has used the cognitive processes related to the perception and
retention of simple. An idea of has been to represent the mean number of mental
discriminations per second in an average human being as Stroud number in the
range from 5 to 20. Halstead has used this idea and used 18 as a reference point for
his researches. The number of discriminations made in the preparation of a
program is called effort and is calculated by equation (3.5)
/E V L (3.5)
The programming time is denoted as T which is an estimate of the number
of mental discriminations essential to complete a program divided by the average
number of discriminations per second or Stroud number S. But this estimate uses
the assumption that the programmer is devoting all of its discriminations to the
programming task. The programming time is represented as,
Re , / minasonable Time T E B
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
89
The difficulty for programming is defined by,
1/ Difficulty language level
In the technical report of IEEE (1990), software complexity is defined as
the degree to which a system or component has a design or implementation that is
difficult to understand and verify. Cyclometric Complexity is the graph oriented metric
developed by Thomas J Mc Cabb in 1976 (McCabe, T. J., 1976). The fundamental
assumption used in this is that software complexity is intimately related to the
number of control paths generated by the code. This metric can be defined in two
equivalent ways. Example of cyclometric complexity can be shown as
Complexity = The number of decision statement in a program + 1
Otherwise in a graph G with n vertices, e edges and p connected
components, the complexity is defined by equation (3.6)
2v G e n p (3.6)
Finally number of branches can be counted from the graph. The Mc Cabb
complexity C can be defined by equation (3.7)
1 deg 1 c ree n (3.7)
The limitation of Mc Cabb Complexity is that, the complexity of an
expression with in a conditional statement is never acknowledged. Penalty was
not used for embedded loops versus a series of single loops and both are
represented with the same complexity. The complexity can also be classified into
essential or accidental. But the complexity of software is an essential property, not
an accidental one. Hence, descriptions of a software entity that abstract away its
complexity often abstract away its essence. Essential complexity starts because of
the nature of the problem and how deep a skill set is required for clearly
understanding the problem. Accidental complexity arouses because of the poor
attempts made to solve the problem and may be equivalent to what some call
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
90
complication. Applying the wrong design pattern or selecting an inappropriate
data structure increase the accidental complexity to a problem. The software
design complexity denotes the mapping of the problem domain into a given
representation and Procedural complexity is related with the logical structure of a
program ( Da-Wei , 2007).
3.3 PULPWOOD
Forest and Forest products play a significant role in socio economic
development of the country. Forests provide various services which include direct
and indirect benefits. Since independence, the country's forest area has been under
pressure due to industrialization, urbanization and the associated science and
technology development which resulted in the forest area of 23.81 % against the
mandatory requirement of 33%. The pace of development in the country has also
accelerated the demand of the forest products, which have ushered in a major gap
in the demand and supply pattern. Hence wood based industries in the country
have been directed to generate their own raw material without depending on forest
department supply. Pulpwood is the wood used for pulping. "Wood pulp" is pulp
made from wood. Pulp for papermaking is produced mostly from wood fibres
which contain many different chemicals substances viz., cellulose, hemi cellulose,
lignin and extractives. Wood density is widely regarded as the most important
controlling factor for pulp and paper quality.
Globally, paper consumption has increased by a factor of 20 this century
and has more than tripled over the past 30 years (Grieg‐Gran et al., 1997).
In the developing world, paper consumption is growing rapidly, too, but average
per capita consumption is still just 17.5 kg/year. This is well below the 30-40 kg of
paper per capita per year considered the minimum level necessary to meet basic
needs for communication and literacy. However, total paper and paperboard
consumption in Asia already exceeds that in Europe and is projected to grow
3-4 percent per year until 2010 as income and population increase. Such a rate of
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
91
increase would eventually make the region the biggest paper consumer in the
world.
Figure 3.4 Pulpwood stacked in the Processing Yard
Figure 3.4 shows the way the pulpwood stacked in the Processing Yard for
processing. In Asia, the per capita consumption of paper during 2008 was 7 kg
per annum. Indian paper industry is poised to grow at 8 per cent a year and to
touch 11.5 million tonnes in 2011-2012 from 9.18 million tonnes in 2009-2010. In
India per capita consumption increased to 9.18 kilograms in 2009-10 compared to
8.3 kilograms during 2008-2009 (FAO, 2010). The consumption of paper is
directly attached to the growth of the economy. With the emergence of economy,
use of paper has tremendously risen like in packaging, education and
documentation. Hence, there is an increasingly growing demand to grow quality
pulpwood through organised plantation.
a. Pulp and Paper Industry in India
The paper industry in India is more than a century old. The Indian paper
industry has highly fragmented structure consisting of small, medium and large
sized paper mills having capacities ranging from 10 to 1150 tons per day.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
92
The geographical spread of the industry as well as market is mainly responsible
for regional balance of production and consumption. The Indian Paper Industry is
among the top 12 global players today, with an output of more than 13.5 million
tonnes per annum with an estimated turnover of Rs. 35000 Crores. About 850
small, large and medium paper mills operate in India with a combined annual
capacity of 12.7 million tons requiring about 9.83 million tons of wood per year.
Thirty-one percent of the paper industry‘s output comes from the top 26 papermills,
which are fully or partially wood based mills (Kulkarni ,2013)and balance mills
are based on non-conventional raw materials (Agro Residues and Recycled fibre -
waste paper).These industries manufacture various types of paper materials
required for different purposes. Today the paper industry in India is in search of
technologically advanced methods to reduce the cost of production and augment
the existing technologies to meet the global standards. The government of India
has introduced various rules and regulations to encourage joint ventures and
investments in this field. Many old mills are under revival and /or new green field
projects are under consideration. Sustained availability good quality raw material
is one of the major factors inhibiting the growth of the paper industry
(Mathur, R.M et al., 2009).
b. Indian Paper industry growth
The installed capacity of the Indian paper and paper board is 12.7 m tpa and
the production is 10.11 million tpa and it constitutes about 2.6% of the total world
paper production. The consumption of paper and paper board is 11.15 million tpa
(including Newsprint), which include 1.04 million tpa of imported paper.
The industry provides employment to more than 0.37 million people directly and
1.3 million people indirectly. The estimated turnover of the industry is Rs 35,000 crore
and its contribution to the exchequer is around Rs.3000 crore. India is the fastest
growing market for paper globally at 8% per annum. Paper consumption is poised
for a big leap forward and is estimated to touch 13.95 million tons by 2015-16.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
93
By 2025 the total consumption will be about 24 m tpa and the per capita
consumption of 17 kg against the present consumption of 9.3 kg. The Increase in
consumption by one kg per capita would lead to an increase in demand of
1 million tonnes. The production and consumption of paper is depicted in Figure 3.5.
Figure 3.5: Indian paper industry growth
Prospects of paper industry- Production and Consumption
The forecast for consumption of paper has been derived considering two
alternate scenarios. Scenario 1: Trend in growth of consumption in the past has been
used as basis to determine the growth trend in the 12th Five year plan (2012-17) and
the forecast for the next 15 years has been made. Scenario 2: Consumption
forecast has been made based on the following assumptions. For writing paper,
elasticity of consumption has been taken at 0.9.Taking the GDP growth at 9%
during 2012-17 and beyond, the growth of demand for writing paper has been
assumed at 8.1% per annum. With universalisation of education and increase in
the period spend on education; elasticity of consumption of writing paper could be
higher than one. However, despite a lower per capita consumption relative to other
countries, increasing access to internet and substitution of writing/ printing
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
94
material by the electronic mode, elasticity of consumption has been taken at 0.9.
For packaging paper, the tracking variable is the likely manufacturing growth.
Since the share of the manufacturing sector is proposed to be increased from
existing 16% to 25% in next 10 years, manufacturing growth is expected to remain
higher than the GDP growth. The approach paper to the 12th
Five year plan has
taken manufacturing growth of 9.8% at the base case scenario. A growth of 10%
of the packaging paper is expected. For the Newsprint, the average annual growth
in first two years is taken at 7 %. In subsequent years, the growth has been taken
assuming as elasticity of consumption at 0.9, or growth of 8.1% per annum. Based
on the above assumptions, the expected pattern of paper consumption emerges as
shown in Table 3.1.
Table 3.1: Projected consumption of Paper (Million Tonnes)
Year Writing
paper
Packaging
Paper
News
Total
Consumption
Baseline
Scenario
2010-11 4.0 5.4 1.7 11.2 11.2
2011-12 4.3 5.9 1.8 12.0 12.1
2012-13 4.6 6.4 1.9 13.0 13.0
2013-14 5.0 7.1 2.1 14.2 13.8
2014-15 5.4 7.8 2.2 15.4 14.7
2015-16 5.8 8.6 2.4 16.8 15.6
2016-17 6.3 9.4 2.6 18.4 16.5
2021-22 9.3 15.2 3.9 28.4 21.8
2024-25 11.8 20.2 4.9 36.9 23.5
2026-27 13.8 24.5 5.7 43.9 25.3
Over all, the paper consumption in the baseline scenario is projected to
increase to 16.5 million tonnes in 2016-17 and reach 25.3 million ton in 2026-27.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
95
In the alternative scenario, which appears to be more realistic, the consumption
increases to 18.4 million tons in 2016-17(the terminal year of the 12th
plan) and to
43.9 million tons in 2026-27. The estimates for the production during the 12th
five
year plan (2012-17) and in the next 15 years have also been derived for both the
alternate scenarios. Estimates of production of various paper grades based on
wood, agro residues and recycled paper have also been projected. In the baseline
scenario, it is assumed that the growth in availability of raw material will continue
to be same as in the past. In scenario 2, following growth rates are assumed for
the three alternate raw material sources.
i) Wood based sector : The availability is projected to increase at an annual
rate of 8%
ii) Recycled paper : The growth in availability of recycled paper is assumed
at 10%. Initiatives have been proposed for an increased
availability of the used paper for recycling .Based on
the above consumptions, the paper production at the
baseline scenario and the alternative scenario is
indicated in the Table 3.2.
iii) Agro based sector : The projected growth assumed is also 8%. This growth
would, however, be feasible provided technology is
developed for the use of rice straw in paper making,
particularly for the packaging paper and also assuming
that bagasse will be available for the paper industry.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
96
Table 3.2: Projected production of Paper
Year Wood
resources
Agro based
paper
Recycled
paper
Total
consumption
Baseline
production
2010-11 3.2 2.2 4.7 10.1 10.1
2011-12 3.4 2.3 5.1 10.9 10.9
2012-13 3.7 2.5 5.7 11.8 11.7
2013-14 4.0 2.7 6.2 12.9 12.5
2014-15 4.3 2.9 6.8 14.1 13.3
2015-16 4.6 3.2 7.5 15.3 14.1
2016-17 5.0 3.4 8.3 16.7 14.8
2021-22 8.0 5.4 14.7 28.0 19.6
2024-25 9.3 6.3 17.8 33.4 22.0
2026-27 10.8 7.4 21.5 39.7 23.5
c. Status of raw material
The major challenge is the access to raw materials including wood, agro
residues like baggase, wheat straw, rice husk and reused paper at economical
prices. Cost of these raw materials is rising sharply. Wood prices in India have
risen as much as 25-40 per cent in the last six months. This is a strain on the
industry‘s margins, and cripples its ability to ‗plough-back‘ for enhancing
capacity. In early seventies, the share of wood based raw material was 84%,
whereas waste paper based is 9% and agro based is 7%. Due to scarce availability
of wood based raw material, the share of recycled waste paper and agro based raw
material has increased remarkably and the share in production is furnished in the
Table 3.3.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
97
Table 3.3: Wood, Recycled and Agro based Mills production status
S. No Segment No of Mills Production
M tpa Percent
1 Wood 26 3.19 31
2 Recycled 674 4.72 47
3 Agro waste 150 2.20 22
Total 850 10.11 100
d. Wood based Mills
There are 30 large integrated paper mills located in India. The major raw
materials are hard wood species / bamboo. The production of wood based raw
material is about 3.1 million tons/annum, which contributes about 31% of the total
production. The present consumption of wood as raw material is 9 million tons per
annum, which is about 75% of the wood demand. It is being met through
farm/social forestry sources. Future demand will be additional 12 million tons by
the year 2025.The average growing stock in forest area is very low at 62 m3 per ha
with poor mean annual increment in the range of 1- 5 m3 per ha per year.
The poor increments, extremely low sustainable yields and increasing demand
have led to growing shortage of wood resources. Modernization, growth and
expansion of wood based industries suffered for want of sustained supplies of
industrial wood at reasonable price. Considering a yield of 50 tons of wood per
hectare with a felling cycle of 5 years etc, approximately 2.5 million hectares of
land need to be covered under pulp wood plantations.
d. Recycled fiber based mills
In India, more than two thirds of the mills use Recycle /waste paper as the
primary fiber source. The recycled paper contributes about 4.72 million tons per
annum or 47% of the country‘s total production of paper/paper board and
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
98
newsprint. Nearly 1.33 tons of recycled/waste paper is required to produce one ton
of paper. Currently the availability of recycled paper or waste paper is a
challengeable. Every year 1 million ton of waste paper is received but, the
collection of waste paper is not organized and waste disposal also not systematized
in the country. The recycle or waste paper is best suited for end products like
Newsprint, Duplex board and Kraft paper. Processing of recycled or waste paper
to obtain clean stock for paper making involves a number of cleaning stages to
remove contaminants present in the waste paper such as iron clips, latex, wax, inks
etc., which require appropriate process configuration with state of the art technologies
to produce clean stock. Majority of the paper mills are lacking in state of the art
processing technologies. Table 3.4 depicts the status of availability of recycled/
waste paper.
Table 3.4: Status of availability of Recycled / waste paper
S.No. Particulars Million
tonnes % Share
1. Indigenously recovered waste paper (27%
of total consumption 3.00 43.00
2. Waste Paper import 4.00 57.00
Total 7.00 100.00
e. Agro residue based mills
There are about 150 paper mills producing paper by using agro residues in
India. Bagasse & straws are used as major raw materials, which contributes 22%
of the total production (2.2 million tons/ annum). To produce 1 ton of paper,
nearly 2.5 tons Oven Dry (OD) of bagasse or 2.3 tons OD of wheat straw is
required. Sustainable availability of agro residues is a challengeable one.
The quality and environmental issues associated in manufacturing and
cogeneration of power by using baggasse by sugar mills, increasing coal prices
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
99
forced the agro mills to shift to wood fiber. The projected growth in agro based
sector is assumed only 8%. Table 3.5 shows the availability of agro based raw
materials. This growth would, however, be feasible provided technology is
developed for the use of rice straw in paper making, particularly for the packaging
paper and also assuming that baggasse will be available for the paper industry.
Variety wise production of paper from different raw materials is depicted in
Table 3.6.
Table 3.5: Availability of Agro based raw materials (Million Tonnes)
S.
No. Particulars Bagasse
Wheat
straw
Rice
straw Jute/Kenaf Total
1. Gross availability 53.0 115 58.0 3.0 229.00
2. Net availability 5.2 2.6 16.0 0.5 24.30
Table 3.6: Variety wise production of paper from different raw materials
(2010-11) (Million Tonnes)
Variety of Paper Wood
based
Agro
based
Recycled / Waste
paper based Total
Writing/Printing 2.36 0.73 0.81 3.90
Packaging 0.77 1.50 3.15 5.42
Newsprint 0.03 - 0.76 0.79
Grand Total 3.16 2.23 4.72 10.11
f. The paper production process
The process whereby timber is converted into paper involves six steps.
The first four steps convert the logs into a mass of cellulose fibres with some
residual lignin using a mixture of physical and chemical processes. This pulp is
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
100
then bleached to remove the remaining lignin and finally spread out into smooth,
pressed sheets. For some papers (e.g. cardboards and 'brown paper') the bleaching
step is not needed, but all white and coloured papers require bleaching.
g. Wood preparation
The logs have their bark removed, either by passing through a drum
debarker or by being treated in a hydraulic debarker. The removed bark is a good
fuel, and is normally burnt in a boiler for generating steam. After debarking, the
logs are chipped by multi knife chippers into suitable sized pieces, and are then
screened to remove overlarge chips.
h. Pulping process
Pulping of wood can be done in two ways: mechanically or chemically.
Mechanical pulp
It was first developed in the early 1800s and is used today for newsprint.
Wood is mechanically ground to produce fibers for paper pulp. Grinding creates
very short paper fibers, which are also highly acidic due to the retention of the
wood‘s lignin. Lignin is a naturally occurring substance in wood that darkens and
breaks down into acidic by products as it ages. Ground wood pulp paper is born
acidic and rapidly becomes brittle. Therefore ground wood pulp paper has a
relatively short life expectancy. In the case of mechanical pulp, the wood is
processed into fibre form by grinding it against a quickly rotating stone under
addition of water. The yield of this pulp amounts to approx.95%. The result is
called wood pulp or MP – mechanical pulp. The disadvantage of this type of pulp
is that the fibre is strongly damaged and that there are all sorts of impurities in the
pulp mass. Mechanical wood pulp yields a high opacity, but it is not very strong.
It has a yellowish colour and low light resistance. Figure 3.6 (www.knowpulp.com)
shows the pulping process.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
101
Fig
ure
3.6
Pu
lpin
g p
ro
cess
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
102
Chemical pulp
Chemical pulp is called soda, sulfite, sulfate or Kraft paper (depending upon
how it is processed), chemical wood pulp paper was first developed in the mid-
1800s. Chemical wood pulp paper is used today in printer and notebook paper. For
the production of wood pulp, the pure fibre has to be set free, which means that the
lignin has to be removed as well. To achieve this, the wood chips are cooked in a
chemical solution. In case of wood pulp obtained by means of chemical pulping,
we differentiate between sulphate and sulphite pulp, depending on the chemicals
used. The yield of chemical pulping amounts to approximately 50%. The fibres in
the resulting pulp are very clean and undamaged. It is this type of pulp which is
used for all Sappi fine papers. The sulphate process is an alkaline process.
It allows for the processing of strongly resinous wood types, but this requires
expensive installations and intensive use of chemicals. The sulphite process
utilises a cooking acid consisting of a combination of free sulphur acid and
sulphur acid bound as magnesium bi-sulphite (magnesium bi-sulphite process).
The ‗cooking process‘ is where the main part of the delignification takes
place. Here the chips are mixed with "white liquor" (a solution of sodium
hydroxide and sodium sulphide), heated to increase the reaction rate and then
disintegrated into fibres by 'blowing' – subjecting them to a sudden decrease in
pressure. Typically some 150 kg of NaOH and 50 kg of Na2S are required per
tonne of dry wood. This process is, like any chemical reaction, affected by time,
temperature and concentration of chemical reactants. Time and temperature can be
traded off against each other to a certain extent, but to achieve reasonable cooking
times, it is necessary to have temperature of about 150 – 1650C, so pressure
cookers are used. However, if the temperature is too high then the chips are
delignified unevenly, so a balance must be achieved. The dilute liquor from the
pulp washing (containing the dissoved inorganic and organic solids) is called
‗black liquor‘. The dissolved organics have to be removed for environmental
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
103
reasons and their burning also generates most of the heat energy required by the
kraft mill. The dissolved sodium hydroxide and sodium sulphide are regenerated
so that they can be reused in the white liquor, and thus the escape of an
environmental pollutant is prevented.
Pulp washing
Because of the high amounts of chemicals used in the cooking wood in
kraft pulping, the recovery of the chemicals is of crucial importance. The process
where the chemicals are separated from the cooked pulp is called pulp washing.
A good removal of chemicals (inorganic and organic) is necessary for several
reasons:
(i) The dissolved chemicals interfere with the downstream processing of the pulp.
(ii) The chemicals are expensive to replace.
(iii) The chemicals (especially the dissolved lignin) are detrimental to the
environment.
There are many types of machinery used for pulp washing. Most of them
rely on displacing the dissolved solids (inorganic and organic) in a pulp mat by hot
water, but some use pressing to squeeze out the chemicals with the liquid. An old,
but still common method is to use a drum, covered by a wire mesh, which rotates
in a diluted suspension of the fibres. The fibres form a mat on the drum, and
showers of hot water are then sprayed onto the fibre mat.
Pulp screening
Apart from fibres, the cooked pulp also contains partially uncooked fibre
bundles and knots. Modern cooking processes (together with good chip screening
to achieve consistent chip thickness) have good control over the delignification
and produce less ‗rejects‘. Knots and shives are removed by passing the pulp over
pulp screens equipped with fine holes or slots.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
104
Bleaching
Bleaching is done in two stages. Firstly the pulp is treated with NaOH in
the presence of O2. The NaOH removes hydrogen ions from the lignin and then
the O2 breaks down the polymer. Then, the pulp is treated with ClO2 then a
mixture of NaOH, O2 and peroxide and finally with ClO2 again to remove the
remaining lignin.
Paper Making
Paper making is the process whereby pulp fibres are mechanically and
chemically treated, formed into a dilute suspension, spread over a mesh surface,
the water removed by suction, and the resulting pad of cellulose fibres pressed and
dried to form paper. The mechanical treatment of the fibre normally takes place by
passing it between moving steel bars which are attached to revolving metal discs -
the so-called refiners. This treatment has two effects: it shortens the fibre (fibre
cutting) and it fibrillates the fibre. The latter action increases the surface area, and
as the fibres bond together in the paper sheet by hydrogen bonding, the increased
surface area greatly increases the bonding and strength of the paper. Paper strength
is dependent on the individual fibre strength and the strength of the bonds between
the fibres. It is usually the latter, which is the limiting factor. Refining increases
the interfibre bonding at the expense of the individual fibre strength, but the net
result will be an increase in paper strength. Pressing and calendering (feeding
through rollers) increase density and promote smoothness.
Tamil Nadu Newsprint and Papers Ltd (TNPL)
Tamil Nadu Newsprint and Papers Limited (TNPL) was established by the
Government of Tamil Nadu during early eighties to produce Newsprint and
Printing & Writing Paper using bagasse, a sugarcane residue, as primary raw
material. The Company commenced production in the year 1984 with an initial
capacity of 90,000 tonnes per annum (tpa). Over the years, the production capacity
has increased to 2,45,000 tpa and the Company has emerged as the largest bagasse
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
105
based Paper Mill in the world consuming about one million tonnes of bagasse
every year. TNPL exports about 1/5th of its production to more than 50 countries.
Manufacturing of quality paper for the past two and half decades from bagasse is
an index of the company‘s technological competence. A strong record in adopting
minimum impact best process technology, responsible waste management,
reduced pollution load and commitment to the corporate social responsibility make
the company one of the most environmentally compliant paper mills in the world.
The primary data for past 10 years are collected from TNPL and the forecasting is
carried out.
3.4 METHODOLOGY
In this chapter, it is proposed to investigate the performance of popular
regression algorithms for predicting pulpwood demand and estimating effort in
software development. For effort estimation COnstructive COst MOdel
(COCOMO) is used as this helps to benchmark the present investigations with
other works in literature. Two popular regression techniques M5 and Linear
regression are used for predicting the output for the two dataset.
3.4.1 CONSTRUCTIVE COST MODEL (COCOMO) DATA SET
In the early stages of a software development life cycle, the effort
estimation plays a critical role to help the project managers identify the demands
of software development project with respect to the budgeting, scheduling, and
allocation of resources. The most significant effort estimation models, which are
used in software development projects, are the COCOMO, the System Evaluation
and Estimation of Resource Software Evaluation Model (SEER-SEM), and the
Software Life Cycle Management (SLIM) model. COCOMO, which is developed
by Barry Boehm in the 1980s, is the most popular and most widely used
estimation model for software projects. COCOMO estimates the amount of
software project effort based on the scale and cost factors of a software project
(Manalif 2013). COCOMO model is selected for the reasons which are as follows:
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
106
(i) It consists of a long history from its original version till its most recent version.
(ii) It is a detailed and well documented.
(iii) Its datasets are available to the public in the PROMISE repository.
(iv) It provides commercial implementations such as Costar.
COCOMO, a regression based software cost estimation model and it was developed
by Barry Boehm. It was first published in the year 1981 (so named as COCOMO 81)
and COCOMO II is the latest extension of the original COCOMO (COCOMO 81).
Some differences between COCOMO 81 and COCOMO II are as follows:
(i) COCOMO 81 consists of 63 data points which uses Kilo Deliverable
Source Instructions (KDSI) to measure the project size and three
development modes to be represented by scale factors.
(ii) In contrast, COCOMO II consists of 161 data points which uses KSLOC
for project size, and five scale factors (Jodpimai, 2009).
COCOMO II consists of three different models which are different from
COCOMO 81 model and are as follows:
(i) Application Composition Model: This model is suited for the projects
that are built with modem GUI-builder tool and is based on object points.
(ii) Early Design Model: This model helps to get the rough estimate copy
about the project cost and its duration before determining its architecture
fully. Also it uses set of new cost drivers and an estimating equation and
is based on KSLOC or unadjusted function points.
(iii) Post-Architecture Model: This model is used only after the design of
architecture. Here LOC is used as size estimates and involves as the
actual development and maintenance of a software product.
COCOMO II consists of an input, a set of seventeen Effort Multipliers (EM) or
cost drivers which are used to adjust the nominal effort to reflect the software
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
107
product being developed. The seventeen COCOMO II factors (cost drivers) are
shown in Table 3.7.
Table 3.7: COCOMO II Cost drivers
Cost Driver Description Value
RELY Required Software Reliability 1.1
DATA Data base size 1
RUSE Developed for Reusability 1
DOCU Documentation needs 1.23
CPLX Product Complexity 1.34
TIME Execution Time Constraints 1.29
STOR Main storage Constraints 1.05
PVOL Platform Volatility 0.87
ACAP Analyst Capability 0.71
PCAP Programmer Capability 0.88
APEX Application Experience 0.81
PLEX Platform Experience 0.85
LTEX Language and Tool Experience 0.84
PCON Personnel Continuity 0.9
TOOL Use of Software Tools 0.78
SITE Multisite Development 0.86
SCED Required Development Schedule 1
COCOMO II includes 17 cost drivers in the Post-Architecture model and are given
as follows:
17
1
[ ] iB
i
Effort A size EffortMultiplier
(3.8)
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
108
Where 5
1
1.01 0.01 j
i
B ScaleFactor
(3.9)
A: Multiplicative Constant
Size: Size of the software projects can be measured in terms of KSLOC.
The selection of a Scale Factor (SF) is based on the rationale variation on a
project‘s effort (Attarzadeh 2010).
Scale Factors
This leads to the conclusion that the most critical input to the COCOMO II
model is size, so, a good size estimate is very important for any good model
estimation. Size in COCOMO II is treated as a special cost driver, so it has an
exponential factor, E (Musilek et al., 2002). The exponent E in equation is an
aggregation of five scale factors. All scale factors have rating levels. These rating
levels are Very Low (VL), Low (L), Nominal (N), High (H), Very High (VH) and
Extra High (XH). Each rating level has a weight W, which is a quantitative value
used in the COCOMO II model. The five COCOMO II scale factors are shown in
Table 3.8:
1
0.01N
j
j
E B SF
(3.10)
where B is a constant = 0.91. A and B are constant values devised by the
COCOMO team by calibrating to the actual effort values for the 161 projects
currently in COCOMO II database.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
109
Table 3.8: COCOMO II Scale factors (Yahya 2010)
Scale Factor Description Value
Precedentedness (PREC) Reflects the previous experience of
the organization
3.72
Development Flexibility
(FLEX)
Reflects the degree of flexibility in
the development process
1.01
Risk Resolution (RSEL) Reflects the extent of risk analysis
carried out
2.83
Team Cohesion (TEAM) Reflects how well the development
team knows each other and works
together
2.19
Process Maturity (PMAT) Reflects the process maturity of the
organization
1.59
Table 3.9 tabulates the COCOMO Dataset. This model helps to estimate the
cost, effort and schedule while planning a new software development activity.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
110
Ta
ble
3.9
: C
OC
OM
O D
ata
set
RELY
DATA
CPLX
TIME
STOR
VIRT
TURN
ACAP
AEXP
PCAP
VEXP
LEXP
MODP
TOOL
SCED
LOC
ACT_EFFORT
Nom
ina
l H
igh
Ve
ry
Hig
h
Nom
ina
l N
om
ina
l L
ow
N
om
ina
l H
igh
Nom
ina
l V
ery
Hig
h
Lo
w
Nom
ina
l H
igh
No
min
al
Lo
w
70
27
8
Ve
ry
Hig
h
Hig
h
Hig
h
Ve
ry
Hig
h
Ve
ry
Hig
h
Nom
ina
l Nom
ina
l V
ery
Hig
h
Ve
ry H
igh
Ve
ry
Hig
h
Nom
ina
l H
igh
Hig
h
Hig
h
Lo
w
22
7
11
81
Nom
ina
l H
igh
Hig
h
Ve
ry
Hig
h
Hig
h
Lo
w
Hig
h
Hig
h
Nom
ina
l H
igh
Lo
w
Hig
h
Hig
h
No
min
al
Lo
w
17
7.9
1
24
8
Hig
h
Lo
w
Hig
h
Nom
ina
l N
om
ina
l L
ow
L
ow
N
om
ina
l N
om
ina
l N
om
ina
l N
om
ina
l H
igh
Hig
h
No
min
al
Lo
w
11
5.8
4
80
Hig
h
Lo
w
Hig
h
Nom
ina
l N
om
ina
l L
ow
L
ow
N
om
ina
l N
om
ina
l N
om
ina
l N
om
ina
l H
igh
Hig
h
No
min
al
Lo
w
29
.5
12
0
Hig
h
Lo
w
Hig
h
Nom
ina
l N
om
ina
l L
ow
L
ow
N
om
ina
l N
om
ina
l N
om
ina
l N
om
ina
l H
igh
Hig
h
No
min
al
Lo
w
19
.7
60
Hig
h
Lo
w
Hig
h
Nom
ina
l N
om
ina
l L
ow
L
ow
N
om
ina
l N
om
ina
l N
om
ina
l N
om
ina
l H
igh
Hig
h
No
min
al
Lo
w
66
.6
30
0
Hig
h
Lo
w
Hig
h
Nom
ina
l N
om
ina
l L
ow
L
ow
N
om
ina
l N
om
ina
l N
om
ina
l N
om
ina
l H
igh
Hig
h
No
min
al
Lo
w
5.5
1
8
Hig
h
Lo
w
Hig
h
Nom
ina
l N
om
ina
l L
ow
L
ow
N
om
ina
l N
om
ina
l N
om
ina
l N
om
ina
l H
igh
Hig
h
No
min
al
Lo
w
10
.4
50
Hig
h
Lo
w
Hig
h
Nom
ina
l N
om
ina
l L
ow
L
ow
N
om
ina
l N
om
ina
l N
om
ina
l N
om
ina
l H
igh
Hig
h
No
min
al
Lo
w
14
60
Nom
ina
l Nom
ina
l H
igh
Hig
h
Nom
ina
l Nom
ina
l Nom
ina
l Nom
ina
l N
om
ina
l N
om
ina
l N
om
ina
l N
om
ina
l N
om
ina
l N
om
ina
l No
min
al
16
11
4
Hig
h
Nom
ina
l H
igh
Nom
ina
l N
om
ina
l Nom
ina
l Nom
ina
l Nom
ina
l N
om
ina
l N
om
ina
l N
om
ina
l N
om
ina
l N
om
ina
l N
om
ina
l No
min
al
6.5
4
2
Nom
ina
l Nom
ina
l H
igh
Nom
ina
l N
om
ina
l Nom
ina
l Nom
ina
l Nom
ina
l N
om
ina
l N
om
ina
l N
om
ina
l N
om
ina
l N
om
ina
l N
om
ina
l No
min
al
13
60
Nom
ina
l Nom
ina
l H
igh
Nom
ina
l N
om
ina
l Nom
ina
l Nom
ina
l Nom
ina
l N
om
ina
l N
om
ina
l N
om
ina
l N
om
ina
l N
om
ina
l N
om
ina
l No
min
al
8
42
Nom
ina
l Nom
ina
l H
igh
Nom
ina
l N
om
ina
l Nom
ina
l Nom
ina
l Nom
ina
l N
om
ina
l H
igh
Nom
ina
l H
igh
Hig
h
Hig
h
No
min
al
90
45
0
Hig
h
Nom
ina
l Nom
ina
l H
igh
Nom
ina
l Nom
ina
l Nom
ina
l Nom
ina
l H
igh
Hig
h
Nom
ina
l N
om
ina
l N
om
ina
l N
om
ina
l No
min
al
15
90
Hig
h
Nom
ina
l H
igh
Nom
ina
l N
om
ina
l Nom
ina
l Nom
ina
l Nom
ina
l H
igh
Hig
h
Nom
ina
l N
om
ina
l N
om
ina
l N
om
ina
l No
min
al
38
21
0
Nom
ina
l Nom
ina
l Nom
ina
l N
om
ina
l N
om
ina
l Nom
ina
l Nom
ina
l Nom
ina
l H
igh
Hig
h
Nom
ina
l N
om
ina
l N
om
ina
l N
om
ina
l No
min
al
10
48
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
111
RELY
DATA
CPLX
TIME
STOR
VIRT
TURN
ACAP
AEXP
PCAP
VEXP
LEXP
MODP
TOOL
SCED
LOC
ACT_EFFORT
Nom
ina
l V
ery
H
igh
Hig
h
Ve
ry
Hig
h
Ve
ry
Hig
h
Lo
w
Hig
h
Ve
ry
Hig
h
Hig
h
Nom
ina
l L
ow
H
igh
Ve
ry
Hig
h
Ve
ry
Hig
h
Lo
w
16
1.1
8
15
Nom
ina
l V
ery
H
igh
Hig
h
Ve
ry
Hig
h
Ve
ry
Hig
h
Lo
w
Hig
h
Ve
ry
Hig
h
Hig
h
Nom
ina
l L
ow
H
igh
Ve
ry
Hig
h
Ve
ry
Hig
h
Lo
w
48
.5
23
9
Nom
ina
l V
ery
H
igh
Hig
h
Ve
ry
Hig
h
Ve
ry
Hig
h
Lo
w
Hig
h
Ve
ry
Hig
h
Hig
h
Nom
ina
l L
ow
H
igh
Ve
ry
Hig
h
Ve
ry
Hig
h
Lo
w
32
.6
17
0
Nom
ina
l V
ery
H
igh
Hig
h
Ve
ry
Hig
h
Ve
ry
Hig
h
Lo
w
Hig
h
Ve
ry
Hig
h
Hig
h
Nom
ina
l L
ow
H
igh
Ve
ry
Hig
h
Ve
ry
Hig
h
Lo
w
12
.8
62
Nom
ina
l V
ery
H
igh
Hig
h
Ve
ry
Hig
h
Ve
ry
Hig
h
Lo
w
Hig
h
Ve
ry
Hig
h
Hig
h
Nom
ina
l L
ow
H
igh
Ve
ry
Hig
h
Ve
ry
Hig
h
Lo
w
15
.4
70
Nom
ina
l V
ery
H
igh
Hig
h
Ve
ry
Hig
h
Ve
ry
Hig
h
Lo
w
Hig
h
Ve
ry
Hig
h
Hig
h
Nom
ina
l L
ow
H
igh
Ve
ry
Hig
h
Ve
ry
Hig
h
Lo
w
16
.3
82
Nom
ina
l V
ery
H
igh
Hig
h
Ve
ry
Hig
h
Ve
ry
Hig
h
Lo
w
Hig
h
Ve
ry
Hig
h
Hig
h
Nom
ina
l L
ow
H
igh
Ve
ry
Hig
h
Ve
ry
Hig
h
Lo
w
35
.5
19
2
Hig
h
Lo
w
Hig
h
Nom
ina
l N
om
ina
l L
ow
L
ow
N
om
ina
l N
om
ina
l N
om
ina
l N
om
ina
l H
igh
Hig
h
No
min
al
Lo
w
25
.9
11
7.6
Hig
h
Lo
w
Hig
h
Nom
ina
l N
om
ina
l L
ow
L
ow
N
om
ina
l N
om
ina
l N
om
ina
l N
om
ina
l H
igh
Hig
h
No
min
al
Lo
w
24
.6
11
7.6
Hig
h
Lo
w
Hig
h
Nom
ina
l N
om
ina
l L
ow
L
ow
N
om
ina
l N
om
ina
l N
om
ina
l N
om
ina
l H
igh
Hig
h
No
min
al
Lo
w
7.7
3
1.2
Hig
h
Lo
w
Hig
h
Nom
ina
l N
om
ina
l L
ow
L
ow
N
om
ina
l N
om
ina
l N
om
ina
l N
om
ina
l H
igh
Hig
h
No
min
al
Lo
w
9.7
2
5.2
Hig
h
Lo
w
Hig
h
Nom
ina
l N
om
ina
l L
ow
L
ow
N
om
ina
l N
om
ina
l N
om
ina
l N
om
ina
l H
igh
Hig
h
No
min
al
Lo
w
2.2
8
.4
Hig
h
Lo
w
Hig
h
Nom
ina
l N
om
ina
l L
ow
L
ow
N
om
ina
l N
om
ina
l N
om
ina
l N
om
ina
l H
igh
Hig
h
No
min
al
Lo
w
3.5
1
0.8
Hig
h
Lo
w
Hig
h
Nom
ina
l N
om
ina
l L
ow
L
ow
N
om
ina
l N
om
ina
l N
om
ina
l N
om
ina
l H
igh
Hig
h
No
min
al
Lo
w
8.2
3
6
Hig
h
Lo
w
Hig
h
Nom
ina
l N
om
ina
l L
ow
L
ow
N
om
ina
l N
om
ina
l N
om
ina
l N
om
ina
l H
igh
Hig
h
No
min
al
Lo
w
66
.6
35
2.8
Nom
ina
l L
ow
H
igh
Nom
ina
l E
xtr
a
Hig
h
Lo
w
Lo
w
Hig
h
Ve
ry
Hig
h
Ve
ry
Hig
h
Nom
ina
l H
igh
No
min
al
No
min
al N
om
ina
l 1
50
32
4
Nom
ina
l L
ow
H
igh
Nom
ina
l N
om
ina
l L
ow
L
ow
H
igh
Nom
ina
l N
om
ina
l N
om
ina
l V
ery
L
ow
N
om
ina
l N
om
ina
l No
min
al
10
0
36
0
Ta
ble
3.9
: C
OC
OM
O D
ata
set
con
td…
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
112
RELY
DATA
CPLX
TIME
STOR
VIRT
TURN
ACAP
AEXP
PCAP
VEXP
LEXP
MODP
TOOL
SCED
LOC
ACT_EFFORT
Nom
ina
l L
ow
H
igh
Nom
ina
l N
om
ina
l H
igh
Lo
w
Hig
h
Hig
h
Hig
h
Lo
w
Ve
ry
Lo
w
No
min
al
No
min
al N
om
ina
l 1
00
21
5
Nom
ina
l L
ow
H
igh
Nom
ina
l N
om
ina
l L
ow
L
ow
H
igh
Ve
ry
Hig
h
Ve
ry
Hig
h
Nom
ina
l H
igh
No
min
al
No
min
al N
om
ina
l 1
00
36
0
Nom
ina
l L
ow
H
igh
Nom
ina
l N
om
ina
l L
ow
L
ow
H
igh
Ve
ry
Hig
h
Hig
h
Nom
ina
l H
igh
No
min
al
No
min
al N
om
ina
l 1
5
48
Nom
ina
l L
ow
H
igh
Nom
ina
l E
xtr
a
Hig
h
Lo
w
Lo
w
Hig
h
Hig
h
Nom
ina
l N
om
ina
l H
igh
No
min
al
No
min
al N
om
ina
l 3
2.5
6
0
Nom
ina
l L
ow
H
igh
Nom
ina
l N
om
ina
l L
ow
L
ow
H
igh
Hig
h
Hig
h
Nom
ina
l H
igh
No
min
al
No
min
al N
om
ina
l 3
1.5
6
0
Nom
ina
l L
ow
H
igh
Nom
ina
l N
om
ina
l L
ow
L
ow
H
igh
Ve
ry
Hig
h
Hig
h
Nom
ina
l H
igh
No
min
al
No
min
al N
om
ina
l 6
2
4
Nom
ina
l L
ow
H
igh
Nom
ina
l N
om
ina
l L
ow
L
ow
H
igh
Ve
ry
Hig
h
Nom
ina
l N
om
ina
l L
ow
N
om
ina
l N
om
ina
l No
min
al
11
.3
36
Nom
ina
l L
ow
H
igh
Nom
ina
l N
om
ina
l L
ow
L
ow
H
igh
Ve
ry
Hig
h
Ve
ry
Hig
h
Nom
ina
l H
igh
No
min
al
No
min
al N
om
ina
l 2
0
72
Nom
ina
l L
ow
H
igh
Nom
ina
l N
om
ina
l L
ow
L
ow
H
igh
Ve
ry
Hig
h
Hig
h
Nom
ina
l H
igh
No
min
al
No
min
al N
om
ina
l 2
0
48
Hig
h
Lo
w
Hig
h
Extr
a
Hig
h
Extr
a
Hig
h
Lo
w
Hig
h
Hig
h
Hig
h
Hig
h
Nom
ina
l H
igh
Hig
h
Hig
h
No
min
al
7.5
7
2
Hig
h
Lo
w
Hig
h
Nom
ina
l N
om
ina
l L
ow
L
ow
N
om
ina
l N
om
ina
l H
igh
Nom
ina
l N
om
ina
l H
igh
Ve
ry
Lo
w
No
min
al
30
2
24
00
Hig
h
Nom
ina
l H
igh
Hig
h
Hig
h
Lo
w
Hig
h
Nom
ina
l H
igh
Nom
ina
l N
om
ina
l N
om
ina
l L
ow
V
ery
H
igh
No
min
al
37
0
32
40
Hig
h
Nom
ina
l H
igh
Hig
h
Hig
h
Lo
w
Hig
h
Nom
ina
l H
igh
Nom
ina
l N
om
ina
l N
om
ina
l L
ow
V
ery
H
igh
No
min
al
21
9
21
20
Hig
h
Nom
ina
l H
igh
Hig
h
Hig
h
Lo
w
Hig
h
Nom
ina
l H
igh
Nom
ina
l N
om
ina
l N
om
ina
l L
ow
V
ery
H
igh
No
min
al
50
37
0
Ta
ble
3.9
: C
OC
OM
O D
ata
set
con
t
d…
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
113
RELY
DATA
CPLX
TIME
STOR
VIRT
TURN
ACAP
AEXP
PCAP
VEXP
LEXP
MODP
TOOL
SCED
LOC
ACT_EFFORT
Hig
h
Nom
ina
l V
ery
H
igh
Hig
h
Hig
h
Lo
w
Hig
h
Hig
h
Nom
ina
l N
om
ina
l H
igh
Hig
h
Lo
w
Ve
ry
Hig
h
Hig
h
10
1
75
0
Nom
ina
l Nom
ina
l Nom
ina
l N
om
ina
l N
om
ina
l L
ow
N
om
ina
l H
igh
Ve
ry
Hig
h
Ve
ry
Hig
h
Lo
w
Hig
h
Hig
h
No
min
al N
om
ina
l 1
90
42
0
Nom
ina
l Nom
ina
l H
igh
Nom
ina
l H
igh
Nom
ina
l Nom
ina
l H
igh
Hig
h
Nom
ina
l N
om
ina
l H
igh
Hig
h
No
min
al
Hig
h
47
.5
25
2
Ve
ry
Hig
h
Nom
ina
l E
xtr
a
Hig
h
Hig
h
Hig
h
Lo
w
Lo
w
Nom
ina
l H
igh
Nom
ina
l N
om
ina
l N
om
ina
l L
ow
H
igh
No
min
al
21
10
7
Lo
w
Nom
ina
l Nom
ina
l N
om
ina
l N
om
ina
l L
ow
L
ow
H
igh
Hig
h
Ve
ry
Hig
h
Nom
ina
l H
igh
Lo
w
Lo
w
Hig
h
42
3
23
00
Hig
h
Hig
h
Nom
ina
l N
om
ina
l N
om
ina
l L
ow
L
ow
N
om
ina
l H
igh
Hig
h
Nom
ina
l H
igh
No
min
al
No
min
al N
om
ina
l 7
9
40
0
Hig
h
Hig
h
Lo
w
Nom
ina
l N
om
ina
l Nom
ina
l H
igh
Hig
h
Hig
h
Nom
ina
l N
om
ina
l N
om
ina
l H
igh
No
min
al N
om
ina
l 2
84
.7
97
3
Nom
ina
l H
igh
Lo
w
Nom
ina
l N
om
ina
l H
igh
Nom
ina
l H
igh
Hig
h
Nom
ina
l N
om
ina
l N
om
ina
l H
igh
Hig
h
No
min
al
28
2.1
1
36
8
Nom
ina
l H
igh
Hig
h
Ve
ry
Hig
h
Nom
ina
l Nom
ina
l H
igh
Hig
h
Hig
h
Hig
h
Nom
ina
l H
igh
Lo
w
Lo
w
Hig
h
78
57
1.4
Nom
ina
l H
igh
Hig
h
Ve
ry
Hig
h
Nom
ina
l Nom
ina
l H
igh
Hig
h
Hig
h
Hig
h
Nom
ina
l H
igh
Lo
w
Lo
w
Hig
h
11
.4
98
.8
Nom
ina
l H
igh
Hig
h
Ve
ry H
igh N
om
ina
l Nom
ina
l H
igh
Hig
h
Hig
h
Hig
h
Nom
ina
l H
igh
Lo
w
Lo
w
Hig
h
19
.3
15
5
Ta
ble
3.9
: C
OC
OM
O D
ata
set
con
td…
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
114
3.4.2 M5 Algorithm
Various tree-building algorithms like C4.5 determine the attributes which best
classify data remaining. Tree construction is iterative. Decision trees‘ advantage is the
interpretation of immediate version to rules by decision-makers. Regression/model
trees are used for numeric prediction in data mining. Both build decision trees where
each leaf ensures local regression for a specific input space, the difference being that
decision trees generate constant output values for input data subsets, model trees
generate linear (first-order) models for subsets. The M5 algorithm builds trees whose
leaves are associated to multivariate linear models and the nodes of the tree are
chosen over the attribute that maximizes the expected error reduction as a function of
the standard deviation of output parameter (Rodriguez et al., 2006). M5 model trees
split the input progressively. The set N of examples is either associated with a leaf, or
some test is chosen that splits T into subsets corresponding to the test outcomes and
the same process is applied recursively to the subsets. Splits are based on minimizing
the intra-subset variation in the output values down each branch. The attribute that
maximizes the expected error reduction is chosen. The Standard Deviation Reduction
(SDR) is calculated by equation (3.11)
ii
i
NSDR sd N sd N
N
(3.11)
where N - set of examples that reach the node;
Ni - subset of examples that have the ith outcome of the potential set;
sd - standard deviation
Standard M5 adopts a greedy algorithm which constructs a model tree with
a non fixed structure by using a certain stopping criterion. M5 minimizes error at
each interior node, one node at a time. This process is started at the root and is
repeated recursively until all or almost all of the instances are correctly classified.
In constructing this initial tree M5 is greedy, and this can be improved. In
principle, it is possible to build a fully non-greedy algorithm, however
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
115
computational cost of such approach would be too high. The M5 algorithm
employs a ‗divide-and conquer‘ principle. The set N is either associated with a leaf, or
some test is chosen that splits N into subsets corresponding to the test outcomes and
the same process is applied recursively to the subsets. The splitting criterion for the
M5 algorithm is based on treating the standard deviation of the class values that reach
a node as a measure of the error at that node, and calculating the expected reduction in
this error as a result of testing each attribute at that node.
After examining all possible splits, M5 chooses the one that maximizes the
expected error reduction. Splitting in M5 ceases when the class values of all the
instances that reach a node vary just slightly, or only a few instances remain.
The relentless division often produces over-elaborate structures that must be
pruned back, for instance by replacing a sub tree with a leaf. In the final stage, a
smoothing process is performed to compensate for the sharp discontinuities that
will inevitably occur between adjacent linear models at the leaves of the pruned
tree, particularly for some models constructed from a smaller number of training
examples. In smoothing, the adjacent linear equations are updated in such a way
that the predicted outputs for the neighbouring input vectors corresponding to the
different equations become close in value (Wang and Witten, 1997). Figure 3.7
shows the M5 model tree algorithms. Pseudo-Code for M5 Algorithm is given in
Figure 3.8.
Figure 3.7: M5 Model Tree Algorithm
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
116
tan
{
tan
min
int 1
. tan tan
MakeModelTree ins ces
SD sd ins ces
for each k valued no al attribute
convert o k synthetic binary attributes
root newNode
root ins ces ins ces
split root
prune root
int
}
pr Tree root
{
( . tan ) 4
( . tan ) 0.05*
.
.
split node
if size of node ins ces or
sd node ins ces SD
node type LEAF
else
node type INTERIOR
for each attribute
for all possible split positions of the
'
. max
( . )
( . )
}
attribute
calculate the attribute s SDR
node attribute attribute with imum SDR
split node left
split node right
{
( . )
( . )
.mod Re
.
}
prune node
if node INTERIOR then
prune node leftChild
prune node rightChild
node el linear gression node
if subtreeError node error node then
node type LEAF
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
117
{
. ; .
( ( . tan )*
( . tan )* )
/ ( . tan
subtreeError node
l node left r node right
if node INTERIOR then
return sizeof l ins ces subtreeError l
sizeof r ins ces subtreeError r
sizeof node ins
)
}
ces
else return error node
Figure 3.8 Pseudo code for M5 Algorithm
3.4.3 Linear Regression
Regression analysis is a statistical tool for the investigation of relationships
between variables, shown in equation (3.12)
(3.12)
In most problems, more than one predictor variable will be available. This leads to
the following ‗multiple regression‘ mean function as shown in equation (3.13)
(3.13)
Linear regression is a very powerful statistical technique, where graphs
with straight lines are overlaid on scatter plots. Linear models can be used for
prediction or to evaluate whether there is a linear relationship between two
numerical variables.
A perfect linear relationship is the exact value of y, just by knowing the
value of x. This is unrealistic in almost any natural process. Height and weight of
school children may be considered as an example. Their height x, gives some
information about their weight, y, but there is still a lot of variability, even for
children of the same height.
Linear regression line often written mathematically as in equation (3.14)
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
118
y = β0+β1x (3.14)
where β0 and β1 represent two parameters to identify. Usually x represents β0, β1
Linear model parameters an explanatory or predictor variable and y represents a
response. The variable x is used to predict a response y. usually, b 0 and b1 are
used to denote the point estimates of β0 and β 1 (Montgomery et al., 2012).
3.4.4 Error Statistics
Error is nothing but the deviation of the observed value from the true value
of the measured quantity. There are two types of errors, absolute error and relative
error. Absolute error is the difference between the magnitude of the true value and
the observed one. It gives us the exact number with the units of the quantity that is
deviated from the true one. Unlike absolute error, relative error is expressed in
percentage and it helps to compare how incorrect a quantity is from the value
considered to be true. Relative error is defined as the absolute error divided by the
true value. It is generally expressed as percentage and calculates the ratio between
absolute error and the true value.
Relative Error Equation
Relative error is determined by using the equation (3.15):
Relative error = (x - x0) /x (3.15)
where, x = true value of a quantity,
x0 = observed value of the quantity,
x - x0 = absolute error.
It is necessary to estimate accuracy for evaluation and validation.
A common evaluation criteria in engineering (Setiono et al 2010) used is:
Magnitude Relative Error (MRE) computes absolute error percentage
between actual and predicted efforts for reference samples.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
119
i ii
i
actual estimatedMRE
actual
(3.16)
The Mean Magnitude of Relative Error (MMRE) (Stensrud, E., et al., 2003), is
the de facto standard evaluation criterion to assess the accuracy of software
prediction models. MMRE is a summary statistic, i.e., a single number,
aggregating the fundamental metric MRE, a relative residual error. MMRE is used
for two kinds of assessments (at least). One purpose of MMRE is to select
between competing prediction models: The model that obtains the lowest MMRE
is preferred. Another purpose is to provide a quantitative measure of the
uncertainty of a prediction (Where a low MMRE is taken to mean low uncertainty
or inaccuracy). MMRE calculates MREs average over all reference samples.
As MMRE is sensitive to an individual outlying prediction, a median of MREs
(MdMRE) is adopted for n samples when there are many observations less
sensitive to extreme MRE values. Despite the use of MMRE for estimation
accuracy, there exists much discussion on its efficacy in estimation procedures.
MMRE has been criticized as being unbalanced in many validation circumstances,
resulting often in overestimation (Bhatnagar et al., 2010).
1
1n
i
i
MMRE MREn
i
i
MdMRE median MRE
(3.17)
(3.18)
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
120
Advantages of MMRE
(i) Comparisons can be made across data sets.
(ii) It is independent of units. Independence of units means that it does not
matter whether effort is reported in work hours or work months.
An MMRE will be, say, 10 percent whatever unit is used.
(iii) Comparisons can be made across all kinds of prediction model types.
This means, for example, that it is considered as a valid and reliable
measure to compare AFAs with linear models.
(iv) It is scale independent. Scale independence means that the expected value
of MRE does not vary with size. In other words, an implicit assumption in
using MRE as a measure of predictive accuracy is that the error is
proportional to the size (effort) of the project (Myrtveit et al., 2005).
3.5 RESULTS AND DISCUSSION
In the first experiment, the COCOMO Dataset for software effort
estimation is used to evaluate the performance of the regression techniques M5
and Linear Regression. This is used for benchmarking and the techniques are
evaluated for realtime dataset of pulp wood. The experimental setup consisted of
using the attributes of the COCOMO dataset as it is. Table 3.10 tabulates the
results achieved for COCOMO dataset. Figure 3.9 shows the Magnitude Relative
Error achieved for COCOMO data set using M5 algorithm and linear regression
techniques. Appendix I lists the Magnitude Relative Error and absolute error
achieved for the algorithms.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
121
Table 3.10: Average MMRE and MdMRE achieved for M5 and
Linear Regression Technique - COCOMO Dataset
Technique Used MMRE MdMRE
M5 0.9884 39.90155
Linear Regression 1.655818 57.97808
Figure 3.9: Magnitude Relative Error achieved for COCOMO Dataset
It is observed from figure 3.10 that the M5 algorithm achieves lower
MMRE compared to linear regression. It is observed that linear regression
algorithm performance is poor compared to M5 algorithm. The percentage
difference for MMRE between M5 and Linear Regression is 50.48% for the
COCOMO dataset.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
122
Figure 3.10: MMRE achieved for COCOMO Dataset
Figure 3.11: MdMRE achieved for COCOMO Dataset
Figure 3.11 depicts that the MdMRE achieved for COCOMO data set using
M5 algorithm and linear regression techniques. The percentage difference for
MdMRE between M5 and Linear Regression is 36.93% for the COCOMO dataset.
The study is based on the data collected at the TNPL in Karur District,
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
123
Tamil Nadu. Over ten years data have been collected for the demand patterns of
pulp wood. The data collected is normalized. Table 3.11 shows the sample data of
demand of pulpwood in metric tonnes.
Table 3.11: Sample Demand data of Pulpwood in MT (Metric Tonne)
(Source- TNPL Management Plan.)
Year Demand (MT)
2003 133719
2004 147505
2005 164804
2006 166471
2007 180577
2008 383315
The MMRE and MdMRE are evaluated through various techniques like M5
and Linear Regression for pulp wood dataset. Table 3.12 provides results of
average MMRE and MdMRE for various techniques and figures 3.12 and 3.13
depict the same.
Table 3.12: Average MMRE and MdMRE for pulpwood Dataset
Technique Used MMRE MdMRE
M5 0.314705 23.75804
Linear Regression 0.337028 25.63157
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
124
Figure 3.12: MMRE for M5 and Linear Regression Technique - Pulpwood
It is observed that the M5 algorithm achieves lower MMRE. The percentage
difference for MMRE between M5 and Linear Regression is 6.85% for the
pulpwood dataset.
Figure 3.13: MdMRE for M5 and Linear Regression Technique- Pulpwood
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
125
It is observed that M5 algorithm achieves lesser MMRE and MdMRE than
linear regression. The percentage difference for MdMRE between M5 and Linear
Regression is 7.61% for pulpwood dataset.
3.6 CONCLUSION
The study evaluates the regression algorithms for forecasting demand data
from Tamil Nadu Newsprint and Papers Limited. M5 algorithm and Linear
Regression is used for evaluation.The COCOMO Dataset for software effort
estimation is used for benchmarking for the pulpwood dataset. The MMRE and
MdMRE are used as evaluation criteria. The percentage difference between M5
and Linear Regression for MMRE is 50.48% for the COCOMO dataset and 6.85%
for pulpwood dataset respectively. Similarly the percentage difference for
MdMRE between the two algorithms are 36.93% for the COCOMO dataset and
7.61% for pulpwood dataset.Simulation results demonstrate that the performance
of the M5 algorithm is more effective. From the results it can be concluded that
linear regression can perform reasonably well for predicting data with low number
of attributes. The reason for better perfomance of M5 algorithm is that the leaves
in the algorithm being closely related with multivariate linear models,the nodes of
the decision tree are constructed over the attribute that maximizes the expected
error reduction. Further studies are required to reduce the error in forecasting.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.