Performance Analysis of Existing Regression...

Performance Analysis of Existing Regression Techniques

Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.

76

CHAPTER 3

PERFORMANCE ANALYSIS OF EXISTING REGRESSION

TECHNIQUES

3.1 FORECASTING PRODUCT DEMAND

Forecasting product demand is crucial to any supplier, manufacturer, or

retailer. Forecasts of future demand will determine the quantities that should be

purchased, produced and shipped. Demand forecasts are necessary since the basic

operations process, moving from the supplier‘s raw materials to finished goods in

the customer‘s hands, takes time. Firms should not simply wait for demand to

emerge and then react to it. Instead, they are to anticipate and plan for future

demand so that they can react immediately to customer orders as they occur.

In other words, most manufacturers ‗make to stock‘ rather than ‗make to order‘ –

they plan ahead and then deploy inventories of finished goods into field locations.

Thus, once a customer order materializes, it can be fulfilled immediately – since

most customers are not willing to wait for the time it would take to actually

process their order throughout the supply chain and make the product based on

their order. An order cycle could take weeks or months to go back through part

suppliers and sub-assemblers, through manufacturer of the product and through to

the eventual shipment of the order to the customer.

All firms forecast demand, but it would be difficult to find any two firms

that forecast demand in exactly the same way. Over the last few decades, many

different forecasting techniques have been developed in a number of different

application areas, including engineering and economics. Many such procedures

have been applied to the practical problem of forecasting demand in a logistics

system, with varying degrees of success. Most commercial software packages that

support demand forecasting in a logistics system include dozens of different

forecasting algorithms that the analyst can use to generate alternative demand


77

forecasts. While scores of different forecasting techniques exist, almost any

forecasting procedure can be broadly classified into one of the following four

basic categories based on the fundamental approach towards the forecasting

problem that is employed by the technique.

(i) Judgmental Approaches - The essence of the judgmental approach is to

address the forecasting issue by assuming that someone knows and can

tell the right answer. That is, judgment-based techniques gather the

knowledge and opinions of people who are in a position to know what

demand there will be. For example, a survey of the customer base may

be conducted to estimate what our sales will be in the following months.

(ii) Experimental Approaches - Another approach to demand forecasting,

which is appealing when an item is "new" and when there is no other

information upon which to base a forecast, is to conduct a demand

experiment on a small group of customers and to extrapolate the results

to a larger population. For example, firms will often test a new consumer

product in a geographically isolated "test market" to establish its

probable market share. This experience is then extrapolated to the

national market to plan the new product launch. Experimental

approaches are very useful and necessary for new products, but for

existing products that have an accumulated historical demand record, it

seems intuitive that demand forecasts should somehow be based on this

demand experience. For most firms (with some very notable exceptions)

the large majority of Stock Keeping Units (SKUs) in the product line

have long demand histories.

(iii) Relational/Causal Approaches - The assumption behind a causal or

relational forecast is that, there is a reason why people buy a product.

A demand forecast can be developed when the reasons of buying are

understood.


78

(iv) Time Series Approaches - A time series procedure is fundamentally

different from the first three approaches discussed. In a pure time series

technique, no judgment or expertise or opinion is sought. Causes or

relationships or factors which somehow ‗drive‘ demand are not required.

Time series procedures are applied to demand data that are longitudinal

rather than cross-sectional. That is, the demand data represent experience

that is repeated over time rather than across items or locations. The

essence of the approach is to recognize (or assume) that demand occurs

over time in patterns that repeat themselves, at least approximately. If

these general patterns or tendencies are described without regard to their

"causes", then this description forms the basis of a forecast.

All forecasting procedures involve the analysis of historical experience into

patterns and the projection of those patterns into the future in the belief that the

future will somehow resemble the past. The differences in the four approaches are

in the way the "search for pattern" is conducted. Judgmental approaches rely on

the subjective, ad-hoc analyses of external individuals. Experimental tools

extrapolate results from small numbers of customers to large populations. Causal

methods search for reasons for demand. Time series techniques simply analyze the

demand data themselves to identify temporal patterns that emerge and persist.

3.2 SOFTWARE DEVELOPMENT EFFORT ESTIMATION

Software effort estimation is used to determine the amount of effort

necessary to complete a software project, in terms of its scheduling, the allocation

of resources and the meeting of budget requirements. Software defect prediction

strives to improve software quality and testing efficiency by constructing

predictive classification models from code attributes to enable a timely

identification of fault-prone modules. This is an essential activity in the software

project planning phase because major problems occur usually when the surface in

first three months of a software development project and are the result of hasty


79

scheduling, irrational commitments and unprofessional estimating techniques

(Marchewka, 2009). The accurate prediction of software development costs is a

critical issue to make the good management decisions and accurately determining

how much effort and time a project required for project managers as well as

system analysts and developers. Estimation is defined as ―The action appraising

assessing or valuing‖ or ―The process of forming an approximate notion of

numbers, quantities, magnitudes etc. without actual enumeration or measurement‖.

From these definitions it follows that task of estimation is not easy to do precisely.

There are many software cost estimation methods that are available. No one

method is necessarily better or worse than the other, in fact, their strengths and

weaknesses are often complimentary to each other. Estimating a effort required for

software development is the most challenging and annoying job that requires

expertise, experience as well as good understanding of process, project management,

metrics and most important use of proper estimation models, tools and techniques.

Good software estimation models can significantly help the software project

manager, project stakeholders to make informed decisions about biding values,

planning the project, resource management and delivering the project on time

within budget.

3.2.1 SOFTWARE ESTIMATION TECHNIQUES

The Software Engineering Body Of Knowledge (SWEBOK) has identified

Knowledge Areas (KAs) such as software requirements, software design, software

construction, software testing, software maintenance, software configuration

management, software engineering management, software engineering process,

software engineering tools and method and software quality. Figure 3.1 explains

the Software Estimation process.


80

Figure 3.1 Software Estimation Process

Software Engineering Management KA addresses management and

measurement including Software Project Planning, which further addresses Effort,

Schedule and Cost Estimation. Based on the breakdown of tasks, inputs and

outputs the expected effort range required for each task is determined using

calibrated estimation model based on historical size-effort data available,

otherwise method like expert judgment is applied.

Figure 3.2 depicts estimation techniques, methods, tools and their

categorization. The functional form of software estimation models are determined

by theory or functional form.


81

Fig

ure

3.2

: S

oft

wa

re E

stim

ati

on

Tec

hn

iqu

es


82

SLIM is based on Putnam‘s analysis of the lifecycle terms of so-called

Rayleigh distribution of project personnel versus time. In SLIM, Productivity is

used to link the basic Rayleigh manpower distribution model to software

characteristics of size and technology factors. Checkpoint is knowledge-based

software project estimation tool. It has proprietary database of about 8000

software projects. It uses Function Points as its primary input and focuses on areas

that need to be managed to improve software quality and productivity. Checkpoint

predicts effort at four levels of granularity: project, phase, activity and task.

Estimates also include resources, deliverables, defects, costs and schedules.

The PRICE-S model was originally developed for use internally on software

projects that were part of Apollo moon program. It consists of following three sub-

models

(i) Acquisition Sub-model : forecasts cost and schedule

(ii) Sizing Sub-model : facilitates estimating size

(iii) Life-cycle Sub-model : predicts cost of maintenance and support phase

Foresight 2.0 is latest version by PRICE Systems for estimating time, effort

and cost for commercial and non-military government software projects.

ESTIMACS stresses approaching the estimating task in business terms. It also

stresses the need to be able to do sensitivity and trade analysis early on in terms of

staffing/cost estimates and associated risks. SEER-SEM has been evolved into

sophisticated tool supporting top-down and bottom-up methodology. Its modeling

equations are proprietary but they take parametric approach to estimation.

The Scope of this model is wide. It covers all phases of project life-cycle, from

early specification through design, development, delivery and maintenance.

It handles a variety of environmental and application configurations, such as

client-server, stand-alone, distributed, graphics, etc. It also support development

modes such as object oriented, reuse, COTS, spiral, waterfall, prototype and


83

incremental. It allows staff capability, required design and process standards and

levels of acceptable development risk to be input as constraints.

SELECT Estimator was released in 1998. It is designed for large scale

distributed systems development. It is object oriented, basing its estimates on

business objects and components. It assumes incremental development life-cycle.

The nature of its inputs allows the model to be used at any stage of the software

development life-cycle, most significantly even at the feasibility stage when little

detailed project information is known. In later stages, as more information is

available, its estimates become correspondingly more reliable. The actual estimation

technique is based upon ObjectMetrix developed by Object Factory. It works by

measuring size of a project by counting and classifying the software elements

within a project. Applying the qualifier and technology adjustments to the base

metric effort for each project element produces an overall estimate of effort in

person-days, by activity. Using the total one man effort estimate, schedule is

determined as a function of number of developers input as an independent

variable.

COCOMO cost and schedule estimation model was originally published in

1981. It became one of the most popular parametric cost estimation model of the

1980s. It has experienced difficulties in estimating the costs of software developed

by following new life-cycle processes and capabilities. The COCOMO II research

started in 1994 at USC to address the issues on non-sequential and rapid

development process models, reengineering, reuse driven approaches and object

oriented approaches. Delphi technique was developed at the Rand Corporation in

the late 1940s originally as a way of making predictions about future events.

More recently, the technique has been used as a means of guiding group of

informed individuals to a consensus of opinion on some issue. Participants are

asked to make some assessment regarding an issue, individually in a preliminary

round, without consulting the other participants in the exercise. First round results


84

are then collected, tabulated and returned to each participants for a second round,

during which participants are again asked to make an assessment regarding the

same issue.

Abts and Boehm used the technique to estimate initial parameter values for

Effort Adjustment Factors appearing in the glue code effort estimation

components of the COCOTS integration model. This technique has been used by

Chulani and Boehm to estimate software defect introduction and removal rates

during various phases of the software development lifecycle. These factors appear

in COQUALMO (COnstructive QUALity MOdel), which predicts the residual

defect density in terms of number of defects/unit of size.

Learning-oriented techniques include oldest as well as newest techniques

applied to estimation activities. Former are represented by case studies, among the

most traditional of manual techniques, later are represented by neural networks,

which attempt to automate improvements in the estimation process by building

models that ―learn‖ from previous experience. Case studies represent inductive

learning, whereby estimators and planners try to learn useful general lessons and

estimation heuristics by extrapolation from specific examples. They examine in

detail elaborate studies describing the environmental conditions and constraints

that obtained during the development of previous projects, the technical and

managerial decisions that were made and final successes or failures that resulted.

Neural networks are the most common software estimation model-building

technique used as an alternative to mean least squares regression. These are

estimation models that can be trained using historical data to produce ever better

results by automatically adjusting their algorithmic parameters values to reduce

the delta between known actual and model predictions. Dynamics-based

techniques explicitly acknowledge that software project effort or cost factors

change over the duration of the system development. Factors like deadlines,

staffing levels, design requirements, training needs, budget etc. all fluctuate over


85

the course of development and cause corresponding fluctuations in the

productivity of project personnel.

Regression-based techniques are used in conjunction with model-based

techniques and include standard regression, robust regression etc. Standard

regression refers to the classical statistical approach of general linear regression

modeling using least squares. This regression technique is used to calibrate

COCOMO II 1997. Robust regression, alleviates the common problem of outliers

in observed software engineering data. Least Median Squares method fall in this

category. Parametric cost models such as COCOMO II, SLIM, Checkpoint etc.

use some form of regression based techniques due to their simplicity and wide

acceptance.

Composite techniques incorporate a combination of two or more techniques

to formulate the most appropriate functional form for estimation. Bayesian

analysis is mode of inductive reasoning that provides a formal process by which

a-priori expert judgment can be combined with sampling data to produce a robust

a-posteriori model information in a logically consistent manner in making

inferences. This has been used in COCOMO II, but estimating software

development effort is a complex problem. An accurate cost estimation of a

software development effort is critical for good management decision making.

The precision and reliability of the effort estimation is very important for the

software industry because both overestimates and underestimates of the software

effort estimations are harmful to software companies. As a result, from an

organizational perspective, an early and accurate cost estimate reduces the

possibility of organizational conflict during the later stages. With high precision,

predicting software development effort is a great challenging task for the project

managers. An objective of the software engineering community is to develop

useful models that accurately estimate the software effort (Xu et al., 2004).

Among the available techniques for software cost estimation, COCOMO


86

(COnstructive COst MOdel) is the widely used algorithmic cost modeling

technique because it is simple to estimate the effort in person-months for a project

at different stages. Figure 3.3 gives the different types of COCOMO Model.

Figure 3.3 Types of COCOMO Model

COCOMO uses the mathematical formulae for predicting the project

estimation. Also the most commonly adopted architecture for estimating software

effort is feed forward multilayer perceptron with back propagation learning

algorithm and the sigmoid activation function. But one of the major drawbacks in

back propagation learning is the slow convergence. The main reason for slow

convergence is that the sigmoid activation function uses its hidden and output

layers in it. The network with sigmoidal function is more sensitive in losing the

parameters. The selection of network patterns and learning rules may cause some

difficulties in network performance and training. So the number of layers and

nodes should be minimized to amplify the performance. Hence to overcome the

above mentioned limitations and to improve the performance of the network, the

solution is the COCOMO model (Reddy, 2009). Some major models that are being

used as benchmarks for software effort estimation are as follows:


87

(i) Halstead

(ii) Walston-Felix

(iii) Bailey-Basili

(iv) Doty (for KLOC > 9)

These models have been derived with study of large number of completed

software projects from various organizations and applications to explore how

project sizes are mapped into project effort. But still these models are unable to

predict the software effort estimation accurately (Kaur et al., 2010).

3.2.2 SOFTWARE METRICS

Software metrics are used for prediction of defect. Some of the commonly

used are LOC (Line Of Code), N – Length, V – Volume, E- Effort, D – Difficulty,

CC – Cyclometric Complexity, EC – Essential Complexity, DC – Design Complexity.

Conte (1986) has defined the LOC as: A line of code is any line of program text

that is not a comment or a blank line, regardless of the number of statements or

fragments of statements on the line. This specifically includes all lines containing

program headers, declarations, and executable and non-executable statements.

Number of Lines of Code (NLOC) can be represented as Number of Delivered

Source Instructions (NDSI) and number of Thousands of Delivered Source

Instructions (KDSI). The model used various type of parameters such as the

number of distinct operators (instruction types, keywords, etc.) in a program,

denoted as n1; the number of distinct operands (variables and constants) denoted

as n2; the total number of occurrences of the operators denoted as N1; and the

total number of occurrences of the operands represented as N2. The summation of

n1 and n2 is denoted as n while the sum of N1 and N2 is called N.

Using these four quantities many useful measures have been obtained.

The number of bits needed to represent a program is defined as the volume V of

the program and is calculated by using the equation (3.1)


88

V = N log2 n (3.1)

The level of the program denotes the difficulty of understanding a program and it

is computed by equation (3.2)

2 1 2 2 /L n n N (3.2)

The intelligence content of a program is given in equation (3.3)

I LxV (3.3)

Estimated Program Length is defined by equation (3.4)

1 2 1 2 2 2N̂ n log n n log n (3.4)

In an attempt to add some of the psychological aspects of complexity in the

measures, Halstead has used the cognitive processes related to the perception and

retention of simple. An idea of has been to represent the mean number of mental

discriminations per second in an average human being as Stroud number in the

range from 5 to 20. Halstead has used this idea and used 18 as a reference point for

his researches. The number of discriminations made in the preparation of a

program is called effort and is calculated by equation (3.5)

/E V L (3.5)

The programming time is denoted as T which is an estimate of the number

of mental discriminations essential to complete a program divided by the average

number of discriminations per second or Stroud number S. But this estimate uses

the assumption that the programmer is devoting all of its discriminations to the

programming task. The programming time is represented as,

Re , / minasonable Time T E B


89

The difficulty for programming is defined by,

1/ Difficulty language level

In the technical report of IEEE (1990), software complexity is defined as

the degree to which a system or component has a design or implementation that is

difficult to understand and verify. Cyclometric Complexity is the graph oriented metric

developed by Thomas J Mc Cabb in 1976 (McCabe, T. J., 1976). The fundamental

assumption used in this is that software complexity is intimately related to the

number of control paths generated by the code. This metric can be defined in two

equivalent ways. Example of cyclometric complexity can be shown as

Complexity = The number of decision statement in a program + 1

Otherwise in a graph G with n vertices, e edges and p connected

components, the complexity is defined by equation (3.6)

2v G e n p (3.6)

Finally number of branches can be counted from the graph. The Mc Cabb

complexity C can be defined by equation (3.7)

1 deg 1 c ree n (3.7)

The limitation of Mc Cabb Complexity is that, the complexity of an

expression with in a conditional statement is never acknowledged. Penalty was

not used for embedded loops versus a series of single loops and both are

represented with the same complexity. The complexity can also be classified into

essential or accidental. But the complexity of software is an essential property, not

an accidental one. Hence, descriptions of a software entity that abstract away its

complexity often abstract away its essence. Essential complexity starts because of

the nature of the problem and how deep a skill set is required for clearly

understanding the problem. Accidental complexity arouses because of the poor

attempts made to solve the problem and may be equivalent to what some call


90

complication. Applying the wrong design pattern or selecting an inappropriate

data structure increase the accidental complexity to a problem. The software

design complexity denotes the mapping of the problem domain into a given

representation and Procedural complexity is related with the logical structure of a

program ( Da-Wei , 2007).

3.3 PULPWOOD

Forest and Forest products play a significant role in socio economic

development of the country. Forests provide various services which include direct

and indirect benefits. Since independence, the country's forest area has been under

pressure due to industrialization, urbanization and the associated science and

technology development which resulted in the forest area of 23.81 % against the

mandatory requirement of 33%. The pace of development in the country has also

accelerated the demand of the forest products, which have ushered in a major gap

in the demand and supply pattern. Hence wood based industries in the country

have been directed to generate their own raw material without depending on forest

department supply. Pulpwood is the wood used for pulping. "Wood pulp" is pulp

made from wood. Pulp for papermaking is produced mostly from wood fibres

which contain many different chemicals substances viz., cellulose, hemi cellulose,

lignin and extractives. Wood density is widely regarded as the most important

controlling factor for pulp and paper quality.

Globally, paper consumption has increased by a factor of 20 this century

and has more than tripled over the past 30 years (Grieg‐Gran et al., 1997).

In the developing world, paper consumption is growing rapidly, too, but average

per capita consumption is still just 17.5 kg/year. This is well below the 30-40 kg of

paper per capita per year considered the minimum level necessary to meet basic

needs for communication and literacy. However, total paper and paperboard

consumption in Asia already exceeds that in Europe and is projected to grow

3-4 percent per year until 2010 as income and population increase. Such a rate of


91

increase would eventually make the region the biggest paper consumer in the

world.

Figure 3.4 Pulpwood stacked in the Processing Yard

Figure 3.4 shows the way the pulpwood stacked in the Processing Yard for

processing. In Asia, the per capita consumption of paper during 2008 was 7 kg

per annum. Indian paper industry is poised to grow at 8 per cent a year and to

touch 11.5 million tonnes in 2011-2012 from 9.18 million tonnes in 2009-2010. In

India per capita consumption increased to 9.18 kilograms in 2009-10 compared to

8.3 kilograms during 2008-2009 (FAO, 2010). The consumption of paper is

directly attached to the growth of the economy. With the emergence of economy,

use of paper has tremendously risen like in packaging, education and

documentation. Hence, there is an increasingly growing demand to grow quality

pulpwood through organised plantation.

a. Pulp and Paper Industry in India

The paper industry in India is more than a century old. The Indian paper

industry has highly fragmented structure consisting of small, medium and large

sized paper mills having capacities ranging from 10 to 1150 tons per day.


92

The geographical spread of the industry as well as market is mainly responsible

for regional balance of production and consumption. The Indian Paper Industry is

among the top 12 global players today, with an output of more than 13.5 million

tonnes per annum with an estimated turnover of Rs. 35000 Crores. About 850

small, large and medium paper mills operate in India with a combined annual

capacity of 12.7 million tons requiring about 9.83 million tons of wood per year.

Thirty-one percent of the paper industry‘s output comes from the top 26 papermills,

which are fully or partially wood based mills (Kulkarni ,2013)and balance mills

are based on non-conventional raw materials (Agro Residues and Recycled fibre -

waste paper).These industries manufacture various types of paper materials

required for different purposes. Today the paper industry in India is in search of

technologically advanced methods to reduce the cost of production and augment

the existing technologies to meet the global standards. The government of India

has introduced various rules and regulations to encourage joint ventures and

investments in this field. Many old mills are under revival and /or new green field

projects are under consideration. Sustained availability good quality raw material

is one of the major factors inhibiting the growth of the paper industry

(Mathur, R.M et al., 2009).

b. Indian Paper industry growth

The installed capacity of the Indian paper and paper board is 12.7 m tpa and

the production is 10.11 million tpa and it constitutes about 2.6% of the total world

paper production. The consumption of paper and paper board is 11.15 million tpa

(including Newsprint), which include 1.04 million tpa of imported paper.

The industry provides employment to more than 0.37 million people directly and

1.3 million people indirectly. The estimated turnover of the industry is Rs 35,000 crore

and its contribution to the exchequer is around Rs.3000 crore. India is the fastest

growing market for paper globally at 8% per annum. Paper consumption is poised

for a big leap forward and is estimated to touch 13.95 million tons by 2015-16.


93

By 2025 the total consumption will be about 24 m tpa and the per capita

consumption of 17 kg against the present consumption of 9.3 kg. The Increase in

consumption by one kg per capita would lead to an increase in demand of

1 million tonnes. The production and consumption of paper is depicted in Figure 3.5.

Figure 3.5: Indian paper industry growth

Prospects of paper industry- Production and Consumption

The forecast for consumption of paper has been derived considering two

alternate scenarios. Scenario 1: Trend in growth of consumption in the past has been

used as basis to determine the growth trend in the 12th Five year plan (2012-17) and

the forecast for the next 15 years has been made. Scenario 2: Consumption

forecast has been made based on the following assumptions. For writing paper,

elasticity of consumption has been taken at 0.9.Taking the GDP growth at 9%

during 2012-17 and beyond, the growth of demand for writing paper has been

assumed at 8.1% per annum. With universalisation of education and increase in

the period spend on education; elasticity of consumption of writing paper could be

higher than one. However, despite a lower per capita consumption relative to other

countries, increasing access to internet and substitution of writing/ printing


94

material by the electronic mode, elasticity of consumption has been taken at 0.9.

For packaging paper, the tracking variable is the likely manufacturing growth.

Since the share of the manufacturing sector is proposed to be increased from

existing 16% to 25% in next 10 years, manufacturing growth is expected to remain

higher than the GDP growth. The approach paper to the 12th

Five year plan has

taken manufacturing growth of 9.8% at the base case scenario. A growth of 10%

of the packaging paper is expected. For the Newsprint, the average annual growth

in first two years is taken at 7 %. In subsequent years, the growth has been taken

assuming as elasticity of consumption at 0.9, or growth of 8.1% per annum. Based

on the above assumptions, the expected pattern of paper consumption emerges as

shown in Table 3.1.

Table 3.1: Projected consumption of Paper (Million Tonnes)

Year Writing

paper

Packaging

Paper

News

print

Total

Consumption

Baseline

Scenario

2010-11 4.0 5.4 1.7 11.2 11.2

2011-12 4.3 5.9 1.8 12.0 12.1

2012-13 4.6 6.4 1.9 13.0 13.0

2013-14 5.0 7.1 2.1 14.2 13.8

2014-15 5.4 7.8 2.2 15.4 14.7

2015-16 5.8 8.6 2.4 16.8 15.6

2016-17 6.3 9.4 2.6 18.4 16.5

2021-22 9.3 15.2 3.9 28.4 21.8

2024-25 11.8 20.2 4.9 36.9 23.5

2026-27 13.8 24.5 5.7 43.9 25.3

Over all, the paper consumption in the baseline scenario is projected to

increase to 16.5 million tonnes in 2016-17 and reach 25.3 million ton in 2026-27.


95

In the alternative scenario, which appears to be more realistic, the consumption

increases to 18.4 million tons in 2016-17(the terminal year of the 12th

plan) and to

43.9 million tons in 2026-27. The estimates for the production during the 12th

five

year plan (2012-17) and in the next 15 years have also been derived for both the

alternate scenarios. Estimates of production of various paper grades based on

wood, agro residues and recycled paper have also been projected. In the baseline

scenario, it is assumed that the growth in availability of raw material will continue

to be same as in the past. In scenario 2, following growth rates are assumed for

the three alternate raw material sources.

i) Wood based sector : The availability is projected to increase at an annual

rate of 8%

ii) Recycled paper : The growth in availability of recycled paper is assumed

at 10%. Initiatives have been proposed for an increased

availability of the used paper for recycling .Based on

the above consumptions, the paper production at the

baseline scenario and the alternative scenario is

indicated in the Table 3.2.

iii) Agro based sector : The projected growth assumed is also 8%. This growth

would, however, be feasible provided technology is

developed for the use of rice straw in paper making,

particularly for the packaging paper and also assuming

that bagasse will be available for the paper industry.


96

Table 3.2: Projected production of Paper

Year Wood

resources

Agro based

paper

Recycled

paper

Total

consumption

Baseline

production

2010-11 3.2 2.2 4.7 10.1 10.1

2011-12 3.4 2.3 5.1 10.9 10.9

2012-13 3.7 2.5 5.7 11.8 11.7

2013-14 4.0 2.7 6.2 12.9 12.5

2014-15 4.3 2.9 6.8 14.1 13.3

2015-16 4.6 3.2 7.5 15.3 14.1

2016-17 5.0 3.4 8.3 16.7 14.8

2021-22 8.0 5.4 14.7 28.0 19.6

2024-25 9.3 6.3 17.8 33.4 22.0

2026-27 10.8 7.4 21.5 39.7 23.5

c. Status of raw material

The major challenge is the access to raw materials including wood, agro

residues like baggase, wheat straw, rice husk and reused paper at economical

prices. Cost of these raw materials is rising sharply. Wood prices in India have

risen as much as 25-40 per cent in the last six months. This is a strain on the

industry‘s margins, and cripples its ability to ‗plough-back‘ for enhancing

capacity. In early seventies, the share of wood based raw material was 84%,

whereas waste paper based is 9% and agro based is 7%. Due to scarce availability

of wood based raw material, the share of recycled waste paper and agro based raw

material has increased remarkably and the share in production is furnished in the

Table 3.3.


97

Table 3.3: Wood, Recycled and Agro based Mills production status

S. No Segment No of Mills Production

M tpa Percent

1 Wood 26 3.19 31

2 Recycled 674 4.72 47

3 Agro waste 150 2.20 22

Total 850 10.11 100

d. Wood based Mills

There are 30 large integrated paper mills located in India. The major raw

materials are hard wood species / bamboo. The production of wood based raw

material is about 3.1 million tons/annum, which contributes about 31% of the total

production. The present consumption of wood as raw material is 9 million tons per

annum, which is about 75% of the wood demand. It is being met through

farm/social forestry sources. Future demand will be additional 12 million tons by

the year 2025.The average growing stock in forest area is very low at 62 m3 per ha

with poor mean annual increment in the range of 1- 5 m3 per ha per year.

The poor increments, extremely low sustainable yields and increasing demand

have led to growing shortage of wood resources. Modernization, growth and

expansion of wood based industries suffered for want of sustained supplies of

industrial wood at reasonable price. Considering a yield of 50 tons of wood per

hectare with a felling cycle of 5 years etc, approximately 2.5 million hectares of

land need to be covered under pulp wood plantations.

d. Recycled fiber based mills

In India, more than two thirds of the mills use Recycle /waste paper as the

primary fiber source. The recycled paper contributes about 4.72 million tons per

annum or 47% of the country‘s total production of paper/paper board and


98

newsprint. Nearly 1.33 tons of recycled/waste paper is required to produce one ton

of paper. Currently the availability of recycled paper or waste paper is a

challengeable. Every year 1 million ton of waste paper is received but, the

collection of waste paper is not organized and waste disposal also not systematized

in the country. The recycle or waste paper is best suited for end products like

Newsprint, Duplex board and Kraft paper. Processing of recycled or waste paper

to obtain clean stock for paper making involves a number of cleaning stages to

remove contaminants present in the waste paper such as iron clips, latex, wax, inks

etc., which require appropriate process configuration with state of the art technologies

to produce clean stock. Majority of the paper mills are lacking in state of the art

processing technologies. Table 3.4 depicts the status of availability of recycled/

waste paper.

Table 3.4: Status of availability of Recycled / waste paper

S.No. Particulars Million

tonnes % Share

1. Indigenously recovered waste paper (27%

of total consumption 3.00 43.00

2. Waste Paper import 4.00 57.00

Total 7.00 100.00

e. Agro residue based mills

There are about 150 paper mills producing paper by using agro residues in

India. Bagasse & straws are used as major raw materials, which contributes 22%

of the total production (2.2 million tons/ annum). To produce 1 ton of paper,

nearly 2.5 tons Oven Dry (OD) of bagasse or 2.3 tons OD of wheat straw is

required. Sustainable availability of agro residues is a challengeable one.

The quality and environmental issues associated in manufacturing and

cogeneration of power by using baggasse by sugar mills, increasing coal prices


99

forced the agro mills to shift to wood fiber. The projected growth in agro based

sector is assumed only 8%. Table 3.5 shows the availability of agro based raw

materials. This growth would, however, be feasible provided technology is

developed for the use of rice straw in paper making, particularly for the packaging

paper and also assuming that baggasse will be available for the paper industry.

Variety wise production of paper from different raw materials is depicted in

Table 3.6.

Table 3.5: Availability of Agro based raw materials (Million Tonnes)

S.

No. Particulars Bagasse

Wheat

straw

Rice

straw Jute/Kenaf Total

1. Gross availability 53.0 115 58.0 3.0 229.00

2. Net availability 5.2 2.6 16.0 0.5 24.30

Table 3.6: Variety wise production of paper from different raw materials

(2010-11) (Million Tonnes)

Variety of Paper Wood

based

Agro

based

Recycled / Waste

paper based Total

Writing/Printing 2.36 0.73 0.81 3.90

Packaging 0.77 1.50 3.15 5.42

Newsprint 0.03 - 0.76 0.79

Grand Total 3.16 2.23 4.72 10.11

f. The paper production process

The process whereby timber is converted into paper involves six steps.

The first four steps convert the logs into a mass of cellulose fibres with some

residual lignin using a mixture of physical and chemical processes. This pulp is


100

then bleached to remove the remaining lignin and finally spread out into smooth,

pressed sheets. For some papers (e.g. cardboards and 'brown paper') the bleaching

step is not needed, but all white and coloured papers require bleaching.

g. Wood preparation

The logs have their bark removed, either by passing through a drum

debarker or by being treated in a hydraulic debarker. The removed bark is a good

fuel, and is normally burnt in a boiler for generating steam. After debarking, the

logs are chipped by multi knife chippers into suitable sized pieces, and are then

screened to remove overlarge chips.

h. Pulping process

Pulping of wood can be done in two ways: mechanically or chemically.

Mechanical pulp

It was first developed in the early 1800s and is used today for newsprint.

Wood is mechanically ground to produce fibers for paper pulp. Grinding creates

very short paper fibers, which are also highly acidic due to the retention of the

wood‘s lignin. Lignin is a naturally occurring substance in wood that darkens and

breaks down into acidic by products as it ages. Ground wood pulp paper is born

acidic and rapidly becomes brittle. Therefore ground wood pulp paper has a

relatively short life expectancy. In the case of mechanical pulp, the wood is

processed into fibre form by grinding it against a quickly rotating stone under

addition of water. The yield of this pulp amounts to approx.95%. The result is

called wood pulp or MP – mechanical pulp. The disadvantage of this type of pulp

is that the fibre is strongly damaged and that there are all sorts of impurities in the

pulp mass. Mechanical wood pulp yields a high opacity, but it is not very strong.

It has a yellowish colour and low light resistance. Figure 3.6 (www.knowpulp.com)

shows the pulping process.


http://www.knowpulp.com/english/demo/english/pulping/recovery_boiler/1_general/soodakattila_yleista.htm

101

Fig

ure

3.6

Pu

lpin

g p

ro

cess


102

Chemical pulp

Chemical pulp is called soda, sulfite, sulfate or Kraft paper (depending upon

how it is processed), chemical wood pulp paper was first developed in the mid-

1800s. Chemical wood pulp paper is used today in printer and notebook paper. For

the production of wood pulp, the pure fibre has to be set free, which means that the

lignin has to be removed as well. To achieve this, the wood chips are cooked in a

chemical solution. In case of wood pulp obtained by means of chemical pulping,

we differentiate between sulphate and sulphite pulp, depending on the chemicals

used. The yield of chemical pulping amounts to approximately 50%. The fibres in

the resulting pulp are very clean and undamaged. It is this type of pulp which is

used for all Sappi fine papers. The sulphate process is an alkaline process.

It allows for the processing of strongly resinous wood types, but this requires

expensive installations and intensive use of chemicals. The sulphite process

utilises a cooking acid consisting of a combination of free sulphur acid and

sulphur acid bound as magnesium bi-sulphite (magnesium bi-sulphite process).

The ‗cooking process‘ is where the main part of the delignification takes

place. Here the chips are mixed with "white liquor" (a solution of sodium

hydroxide and sodium sulphide), heated to increase the reaction rate and then

disintegrated into fibres by 'blowing' – subjecting them to a sudden decrease in

pressure. Typically some 150 kg of NaOH and 50 kg of Na2S are required per

tonne of dry wood. This process is, like any chemical reaction, affected by time,

temperature and concentration of chemical reactants. Time and temperature can be

traded off against each other to a certain extent, but to achieve reasonable cooking

times, it is necessary to have temperature of about 150 – 1650C, so pressure

cookers are used. However, if the temperature is too high then the chips are

delignified unevenly, so a balance must be achieved. The dilute liquor from the

pulp washing (containing the dissoved inorganic and organic solids) is called

‗black liquor‘. The dissolved organics have to be removed for environmental


103

reasons and their burning also generates most of the heat energy required by the

kraft mill. The dissolved sodium hydroxide and sodium sulphide are regenerated

so that they can be reused in the white liquor, and thus the escape of an

environmental pollutant is prevented.

Pulp washing

Because of the high amounts of chemicals used in the cooking wood in

kraft pulping, the recovery of the chemicals is of crucial importance. The process

where the chemicals are separated from the cooked pulp is called pulp washing.

A good removal of chemicals (inorganic and organic) is necessary for several

reasons:

(i) The dissolved chemicals interfere with the downstream processing of the pulp.

(ii) The chemicals are expensive to replace.

(iii) The chemicals (especially the dissolved lignin) are detrimental to the

environment.

There are many types of machinery used for pulp washing. Most of them

rely on displacing the dissolved solids (inorganic and organic) in a pulp mat by hot

water, but some use pressing to squeeze out the chemicals with the liquid. An old,

but still common method is to use a drum, covered by a wire mesh, which rotates

in a diluted suspension of the fibres. The fibres form a mat on the drum, and

showers of hot water are then sprayed onto the fibre mat.

Pulp screening

Apart from fibres, the cooked pulp also contains partially uncooked fibre

bundles and knots. Modern cooking processes (together with good chip screening

to achieve consistent chip thickness) have good control over the delignification

and produce less ‗rejects‘. Knots and shives are removed by passing the pulp over

pulp screens equipped with fine holes or slots.


104

Bleaching

Bleaching is done in two stages. Firstly the pulp is treated with NaOH in

the presence of O2. The NaOH removes hydrogen ions from the lignin and then

the O2 breaks down the polymer. Then, the pulp is treated with ClO2 then a

mixture of NaOH, O2 and peroxide and finally with ClO2 again to remove the

remaining lignin.

Paper Making

Paper making is the process whereby pulp fibres are mechanically and

chemically treated, formed into a dilute suspension, spread over a mesh surface,

the water removed by suction, and the resulting pad of cellulose fibres pressed and

dried to form paper. The mechanical treatment of the fibre normally takes place by

passing it between moving steel bars which are attached to revolving metal discs -

the so-called refiners. This treatment has two effects: it shortens the fibre (fibre

cutting) and it fibrillates the fibre. The latter action increases the surface area, and

as the fibres bond together in the paper sheet by hydrogen bonding, the increased

surface area greatly increases the bonding and strength of the paper. Paper strength

is dependent on the individual fibre strength and the strength of the bonds between

the fibres. It is usually the latter, which is the limiting factor. Refining increases

the interfibre bonding at the expense of the individual fibre strength, but the net

result will be an increase in paper strength. Pressing and calendering (feeding

through rollers) increase density and promote smoothness.

Tamil Nadu Newsprint and Papers Ltd (TNPL)

Tamil Nadu Newsprint and Papers Limited (TNPL) was established by the

Government of Tamil Nadu during early eighties to produce Newsprint and

Printing & Writing Paper using bagasse, a sugarcane residue, as primary raw

material. The Company commenced production in the year 1984 with an initial

capacity of 90,000 tonnes per annum (tpa). Over the years, the production capacity

has increased to 2,45,000 tpa and the Company has emerged as the largest bagasse


105

based Paper Mill in the world consuming about one million tonnes of bagasse

every year. TNPL exports about 1/5th of its production to more than 50 countries.

Manufacturing of quality paper for the past two and half decades from bagasse is

an index of the company‘s technological competence. A strong record in adopting

minimum impact best process technology, responsible waste management,

reduced pollution load and commitment to the corporate social responsibility make

the company one of the most environmentally compliant paper mills in the world.

The primary data for past 10 years are collected from TNPL and the forecasting is

carried out.

3.4 METHODOLOGY

In this chapter, it is proposed to investigate the performance of popular

regression algorithms for predicting pulpwood demand and estimating effort in

software development. For effort estimation COnstructive COst MOdel

(COCOMO) is used as this helps to benchmark the present investigations with

other works in literature. Two popular regression techniques M5 and Linear

regression are used for predicting the output for the two dataset.

3.4.1 CONSTRUCTIVE COST MODEL (COCOMO) DATA SET

In the early stages of a software development life cycle, the effort

estimation plays a critical role to help the project managers identify the demands

of software development project with respect to the budgeting, scheduling, and

allocation of resources. The most significant effort estimation models, which are

used in software development projects, are the COCOMO, the System Evaluation

and Estimation of Resource Software Evaluation Model (SEER-SEM), and the

Software Life Cycle Management (SLIM) model. COCOMO, which is developed

by Barry Boehm in the 1980s, is the most popular and most widely used

estimation model for software projects. COCOMO estimates the amount of

software project effort based on the scale and cost factors of a software project

(Manalif 2013). COCOMO model is selected for the reasons which are as follows:


106

(i) It consists of a long history from its original version till its most recent version.

(ii) It is a detailed and well documented.

(iii) Its datasets are available to the public in the PROMISE repository.

(iv) It provides commercial implementations such as Costar.

COCOMO, a regression based software cost estimation model and it was developed

by Barry Boehm. It was first published in the year 1981 (so named as COCOMO 81)

and COCOMO II is the latest extension of the original COCOMO (COCOMO 81).

Some differences between COCOMO 81 and COCOMO II are as follows:

(i) COCOMO 81 consists of 63 data points which uses Kilo Deliverable

Source Instructions (KDSI) to measure the project size and three

development modes to be represented by scale factors.

(ii) In contrast, COCOMO II consists of 161 data points which uses KSLOC

for project size, and five scale factors (Jodpimai, 2009).

COCOMO II consists of three different models which are different from

COCOMO 81 model and are as follows:

(i) Application Composition Model: This model is suited for the projects

that are built with modem GUI-builder tool and is based on object points.

(ii) Early Design Model: This model helps to get the rough estimate copy

about the project cost and its duration before determining its architecture

fully. Also it uses set of new cost drivers and an estimating equation and

is based on KSLOC or unadjusted function points.

(iii) Post-Architecture Model: This model is used only after the design of

architecture. Here LOC is used as size estimates and involves as the

actual development and maintenance of a software product.

COCOMO II consists of an input, a set of seventeen Effort Multipliers (EM) or

cost drivers which are used to adjust the nominal effort to reflect the software


107

product being developed. The seventeen COCOMO II factors (cost drivers) are

shown in Table 3.7.

Table 3.7: COCOMO II Cost drivers

Cost Driver Description Value

RELY Required Software Reliability 1.1

DATA Data base size 1

RUSE Developed for Reusability 1

DOCU Documentation needs 1.23

CPLX Product Complexity 1.34

TIME Execution Time Constraints 1.29

STOR Main storage Constraints 1.05

PVOL Platform Volatility 0.87

ACAP Analyst Capability 0.71

PCAP Programmer Capability 0.88

APEX Application Experience 0.81

PLEX Platform Experience 0.85

LTEX Language and Tool Experience 0.84

PCON Personnel Continuity 0.9

TOOL Use of Software Tools 0.78

SITE Multisite Development 0.86

SCED Required Development Schedule 1

COCOMO II includes 17 cost drivers in the Post-Architecture model and are given

as follows:

17

1

[ ] iB

i

Effort A size EffortMultiplier

(3.8)


108

Where 5

1

1.01 0.01 j

i

B ScaleFactor

(3.9)

A: Multiplicative Constant

Size: Size of the software projects can be measured in terms of KSLOC.

The selection of a Scale Factor (SF) is based on the rationale variation on a

project‘s effort (Attarzadeh 2010).

Scale Factors

This leads to the conclusion that the most critical input to the COCOMO II

model is size, so, a good size estimate is very important for any good model

estimation. Size in COCOMO II is treated as a special cost driver, so it has an

exponential factor, E (Musilek et al., 2002). The exponent E in equation is an

aggregation of five scale factors. All scale factors have rating levels. These rating

levels are Very Low (VL), Low (L), Nominal (N), High (H), Very High (VH) and

Extra High (XH). Each rating level has a weight W, which is a quantitative value

used in the COCOMO II model. The five COCOMO II scale factors are shown in

Table 3.8:

1

0.01N

j

j

E B SF

(3.10)

where B is a constant = 0.91. A and B are constant values devised by the

COCOMO team by calibrating to the actual effort values for the 161 projects

currently in COCOMO II database.


109

Table 3.8: COCOMO II Scale factors (Yahya 2010)

Scale Factor Description Value

Precedentedness (PREC) Reflects the previous experience of

the organization

3.72

Development Flexibility

(FLEX)

Reflects the degree of flexibility in

the development process

1.01

Risk Resolution (RSEL) Reflects the extent of risk analysis

carried out

2.83

Team Cohesion (TEAM) Reflects how well the development

team knows each other and works

together

2.19

Process Maturity (PMAT) Reflects the process maturity of the

organization

1.59

Table 3.9 tabulates the COCOMO Dataset. This model helps to estimate the

cost, effort and schedule while planning a new software development activity.


110

Ta

ble

3.9

: C

OC

OM

O D

ata

set

RELY

DATA

CPLX

TIME

STOR

VIRT

TURN

ACAP

AEXP

PCAP

VEXP

LEXP

MODP

TOOL

SCED

LOC

ACT_EFFORT

Nom

ina

l H

igh

Ve

ry

Hig

h

Nom

ina

l N

om

ina

l L

ow

N

om

ina

l H

igh

Nom

ina

l V

ery

Hig

h

Lo

w

Nom

ina

l H

igh

No

min

al

Lo

w

70

27

8

Ve

ry

Hig

h

Hig

h

Hig

h

Ve

ry

Hig

h

Ve

ry

Hig

h

Nom

ina

l Nom

ina

l V

ery

Hig

h

Ve

ry H

igh

Ve

ry

Hig

h

Nom

ina

l H

igh

Hig

h

Hig

h

Lo

w

22

7

11

81

Nom

ina

l H

igh

Hig

h

Ve

ry

Hig

h

Hig

h

Lo

w

Hig

h

Hig

h

Nom

ina

l H

igh

Lo

w

Hig

h

Hig

h

No

min

al

Lo

w

17

7.9

1

24

8

Hig

h

Lo

w

Hig

h

Nom

ina

l N

om

ina

l L

ow

L

ow

N

om

ina

l N

om

ina

l N

om

ina

l N

om

ina

l H

igh

Hig

h

No

min

al

Lo

w

11

5.8

4

80

Hig

h

Lo

w

Hig

h

Nom

ina

l N

om

ina

l L

ow

L

ow

N

om

ina

l N

om

ina

l N

om

ina

l N

om

ina

l H

igh

Hig

h

No

min

al

Lo

w

29

.5

12

0

Hig

h

Lo

w

Hig

h

Nom

ina

l N

om

ina

l L

ow

L

ow

N

om

ina

l N

om

ina

l N

om

ina

l N

om

ina

l H

igh

Hig

h

No

min

al

Lo

w

19

.7

60

Hig

h

Lo

w

Hig

h

Nom

ina

l N

om

ina

l L

ow

L

ow

N

om

ina

l N

om

ina

l N

om

ina

l N

om

ina

l H

igh

Hig

h

No

min

al

Lo

w

66

.6

30

0

Hig

h

Lo

w

Hig

h

Nom

ina

l N

om

ina

l L

ow

L

ow

N

om

ina

l N

om

ina

l N

om

ina

l N

om

ina

l H

igh

Hig

h

No

min

al

Lo

w

5.5

1

8

Hig

h

Lo

w

Hig

h

Nom

ina

l N

om

ina

l L

ow

L

ow

N

om

ina

l N

om

ina

l N

om

ina

l N

om

ina

l H

igh

Hig

h

No

min

al

Lo

w

10

.4

50

Hig

h

Lo

w

Hig

h

Nom

ina

l N

om

ina

l L

ow

L

ow

N

om

ina

l N

om

ina

l N

om

ina

l N

om

ina

l H

igh

Hig

h

No

min

al

Lo

w

14

60

Nom

ina

l Nom

ina

l H

igh

Hig

h

Nom

ina

l Nom

ina

l Nom

ina

l Nom

ina

l N

om

ina

l N

om

ina

l N

om

ina

l N

om

ina

l N

om

ina

l N

om

ina

l No

min

al

16

11

4

Hig

h

Nom

ina

l H

igh

Nom

ina

l N

om

ina

l Nom

ina

l Nom

ina

l Nom

ina

l N

om

ina

l N

om

ina

l N

om

ina

l N

om

ina

l N

om

ina

l N

om

ina

l No

min

al

6.5

4

2

Nom

ina

l Nom

ina

l H

igh

Nom

ina

l N

om

ina

l Nom

ina

l Nom

ina

l Nom

ina

l N

om

ina

l N

om

ina

l N

om

ina

l N

om

ina

l N

om

ina

l N

om

ina

l No

min

al

13

60

Nom

ina

l Nom

ina

l H

igh

Nom

ina

l N

om

ina

l Nom

ina

l Nom

ina

l Nom

ina

l N

om

ina

l N

om

ina

l N

om

ina

l N

om

ina

l N

om

ina

l N

om

ina

l No

min

al

8

42

Nom

ina

l Nom

ina

l H

igh

Nom

ina

l N

om

ina

l Nom

ina

l Nom

ina

l Nom

ina

l N

om

ina

l H

igh

Nom

ina

l H

igh

Hig

h

Hig

h

No

min

al

90

45

0

Hig

h

Nom

ina

l Nom

ina

l H

igh

Nom

ina

l Nom

ina

l Nom

ina

l Nom

ina

l H

igh

Hig

h

Nom

ina

l N

om

ina

l N

om

ina

l N

om

ina

l No

min

al

15

90

Hig

h

Nom

ina

l H

igh

Nom

ina

l N

om

ina

l Nom

ina

l Nom

ina

l Nom

ina

l H

igh

Hig

h

Nom

ina

l N

om

ina

l N

om

ina

l N

om

ina

l No

min

al

38

21

0

Nom

ina

l Nom

ina

l Nom

ina

l N

om

ina

l N

om

ina

l Nom

ina

l Nom

ina

l Nom

ina

l H

igh

Hig

h

Nom

ina

l N

om

ina

l N

om

ina

l N

om

ina

l No

min

al

10

48


111

RELY

DATA

CPLX

TIME

STOR

VIRT

TURN

ACAP

AEXP

PCAP

VEXP

LEXP

MODP

TOOL

SCED

LOC

ACT_EFFORT

Nom

ina

l V

ery

H

igh

Hig

h

Ve

ry

Hig

h

Ve

ry

Hig

h

Lo

w

Hig

h

Ve

ry

Hig

h

Hig

h

Nom

ina

l L

ow

H

igh

Ve

ry

Hig

h

Ve

ry

Hig

h

Lo

w

16

1.1

8

15

Nom

ina

l V

ery

H

igh

Hig

h

Ve

ry

Hig

h

Ve

ry

Hig

h

Lo

w

Hig

h

Ve

ry

Hig

h

Hig

h

Nom

ina

l L

ow

H

igh

Ve

ry

Hig

h

Ve

ry

Hig

h

Lo

w

48

.5

23

9

Nom

ina

l V

ery

H

igh

Hig

h

Ve

ry

Hig

h

Ve

ry

Hig

h

Lo

w

Hig

h

Ve

ry

Hig

h

Hig

h

Nom

ina

l L

ow

H

igh

Ve

ry

Hig

h

Ve

ry

Hig

h

Lo

w

32

.6

17

0

Nom

ina

l V

ery

H

igh

Hig

h

Ve

ry

Hig

h

Ve

ry

Hig

h

Lo

w

Hig

h

Ve

ry

Hig

h

Hig

h

Nom

ina

l L

ow

H

igh

Ve

ry

Hig

h

Ve

ry

Hig

h

Lo

w

12

.8

62

Nom

ina

l V

ery

H

igh

Hig

h

Ve

ry

Hig

h

Ve

ry

Hig

h

Lo

w

Hig

h

Ve

ry

Hig

h

Hig

h

Nom

ina

l L

ow

H

igh

Ve

ry

Hig

h

Ve

ry

Hig

h

Lo

w

15

.4

70

Nom

ina

l V

ery

H

igh

Hig

h

Ve

ry

Hig

h

Ve

ry

Hig

h

Lo

w

Hig

h

Ve

ry

Hig

h

Hig

h

Nom

ina

l L

ow

H

igh

Ve

ry

Hig

h

Ve

ry

Hig

h

Lo

w

16

.3

82

Nom

ina

l V

ery

H

igh

Hig

h

Ve

ry

Hig

h

Ve

ry

Hig

h

Lo

w

Hig

h

Ve

ry

Hig

h

Hig

h

Nom

ina

l L

ow

H

igh

Ve

ry

Hig

h

Ve

ry

Hig

h

Lo

w

35

.5

19

2

Hig

h

Lo

w

Hig

h

Nom

ina

l N

om

ina

l L

ow

L

ow

N

om

ina

l N

om

ina

l N

om

ina

l N

om

ina

l H

igh

Hig

h

No

min

al

Lo

w

25

.9

11

7.6

Hig

h

Lo

w

Hig

h

Nom

ina

l N

om

ina

l L

ow

L

ow

N

om

ina

l N

om

ina

l N

om

ina

l N

om

ina

l H

igh

Hig

h

No

min

al

Lo

w

24

.6

11

7.6

Hig

h

Lo

w

Hig

h

Nom

ina

l N

om

ina

l L

ow

L

ow

N

om

ina

l N

om

ina

l N

om

ina

l N

om

ina

l H

igh

Hig

h

No

min

al

Lo

w

7.7

3

1.2

Hig

h

Lo

w

Hig

h

Nom

ina

l N

om

ina

l L

ow

L

ow

N

om

ina

l N

om

ina

l N

om

ina

l N

om

ina

l H

igh

Hig

h

No

min

al

Lo

w

9.7

2

5.2

Hig

h

Lo

w

Hig

h

Nom

ina

l N

om

ina

l L

ow

L

ow

N

om

ina

l N

om

ina

l N

om

ina

l N

om

ina

l H

igh

Hig

h

No

min

al

Lo

w

2.2

8

.4

Hig

h

Lo

w

Hig

h

Nom

ina

l N

om

ina

l L

ow

L

ow

N

om

ina

l N

om

ina

l N

om

ina

l N

om

ina

l H

igh

Hig

h

No

min

al

Lo

w

3.5

1

0.8

Hig

h

Lo

w

Hig

h

Nom

ina

l N

om

ina

l L

ow

L

ow

N

om

ina

l N

om

ina

l N

om

ina

l N

om

ina

l H

igh

Hig

h

No

min

al

Lo

w

8.2

3

6

Hig

h

Lo

w

Hig

h

Nom

ina

l N

om

ina

l L

ow

L

ow

N

om

ina

l N

om

ina

l N

om

ina

l N

om

ina

l H

igh

Hig

h

No

min

al

Lo

w

66

.6

35

2.8

Nom

ina

l L

ow

H

igh

Nom

ina

l E

xtr

a

Hig

h

Lo

w

Lo

w

Hig

h

Ve

ry

Hig

h

Ve

ry

Hig

h

Nom

ina

l H

igh

No

min

al

No

min

al N

om

ina

l 1

50

32

4

Nom

ina

l L

ow

H

igh

Nom

ina

l N

om

ina

l L

ow

L

ow

H

igh

Nom

ina

l N

om

ina

l N

om

ina

l V

ery

L

ow

N

om

ina

l N

om

ina

l No

min

al

10

0

36

0

Ta

ble

3.9

: C

OC

OM

O D

ata

set

con

td…


112

RELY

DATA

CPLX

TIME

STOR

VIRT

TURN

ACAP

AEXP

PCAP

VEXP

LEXP

MODP

TOOL

SCED

LOC

ACT_EFFORT

Nom

ina

l L

ow

H

igh

Nom

ina

l N

om

ina

l H

igh

Lo

w

Hig

h

Hig

h

Hig

h

Lo

w

Ve

ry

Lo

w

No

min

al

No

min

al N

om

ina

l 1

00

21

5

Nom

ina

l L

ow

H

igh

Nom

ina

l N

om

ina

l L

ow

L

ow

H

igh

Ve

ry

Hig

h

Ve

ry

Hig

h

Nom

ina

l H

igh

No

min

al

No

min

al N

om

ina

l 1

00

36

0

Nom

ina

l L

ow

H

igh

Nom

ina

l N

om

ina

l L

ow

L

ow

H

igh

Ve

ry

Hig

h

Hig

h

Nom

ina

l H

igh

No

min

al

No

min

al N

om

ina

l 1

5

48

Nom

ina

l L

ow

H

igh

Nom

ina

l E

xtr

a

Hig

h

Lo

w

Lo

w

Hig

h

Hig

h

Nom

ina

l N

om

ina

l H

igh

No

min

al

No

min

al N

om

ina

l 3

2.5

6

0

Nom

ina

l L

ow

H

igh

Nom

ina

l N

om

ina

l L

ow

L

ow

H

igh

Hig

h

Hig

h

Nom

ina

l H

igh

No

min

al

No

min

al N

om

ina

l 3

1.5

6

0

Nom

ina

l L

ow

H

igh

Nom

ina

l N

om

ina

l L

ow

L

ow

H

igh

Ve

ry

Hig

h

Hig

h

Nom

ina

l H

igh

No

min

al

No

min

al N

om

ina

l 6

2

4

Nom

ina

l L

ow

H

igh

Nom

ina

l N

om

ina

l L

ow

L

ow

H

igh

Ve

ry

Hig

h

Nom

ina

l N

om

ina

l L

ow

N

om

ina

l N

om

ina

l No

min

al

11

.3

36

Nom

ina

l L

ow

H

igh

Nom

ina

l N

om

ina

l L

ow

L

ow

H

igh

Ve

ry

Hig

h

Ve

ry

Hig

h

Nom

ina

l H

igh

No

min

al

No

min

al N

om

ina

l 2

0

72

Nom

ina

l L

ow

H

igh

Nom

ina

l N

om

ina

l L

ow

L

ow

H

igh

Ve

ry

Hig

h

Hig

h

Nom

ina

l H

igh

No

min

al

No

min

al N

om

ina

l 2

0

48

Hig

h

Lo

w

Hig

h

Extr

a

Hig

h

Extr

a

Hig

h

Lo

w

Hig

h

Hig

h

Hig

h

Hig

h

Nom

ina

l H

igh

Hig

h

Hig

h

No

min

al

7.5

7

2

Hig

h

Lo

w

Hig

h

Nom

ina

l N

om

ina

l L

ow

L

ow

N

om

ina

l N

om

ina

l H

igh

Nom

ina

l N

om

ina

l H

igh

Ve

ry

Lo

w

No

min

al

30

2

24

00

Hig

h

Nom

ina

l H

igh

Hig

h

Hig

h

Lo

w

Hig

h

Nom

ina

l H

igh

Nom

ina

l N

om

ina

l N

om

ina

l L

ow

V

ery

H

igh

No

min

al

37

0

32

40

Hig

h

Nom

ina

l H

igh

Hig

h

Hig

h

Lo

w

Hig

h

Nom

ina

l H

igh

Nom

ina

l N

om

ina

l N

om

ina

l L

ow

V

ery

H

igh

No

min

al

21

9

21

20

Hig

h

Nom

ina

l H

igh

Hig

h

Hig

h

Lo

w

Hig

h

Nom

ina

l H

igh

Nom

ina

l N

om

ina

l N

om

ina

l L

ow

V

ery

H

igh

No

min

al

50

37

0

Ta

ble

3.9

: C

OC

OM

O D

ata

set

con

t

d…


113

RELY

DATA

CPLX

TIME

STOR

VIRT

TURN

ACAP

AEXP

PCAP

VEXP

LEXP

MODP

TOOL

SCED

LOC

ACT_EFFORT

Hig

h

Nom

ina

l V

ery

H

igh

Hig

h

Hig

h

Lo

w

Hig

h

Hig

h

Nom

ina

l N

om

ina

l H

igh

Hig

h

Lo

w

Ve

ry

Hig

h

Hig

h

10

1

75

0

Nom

ina

l Nom

ina

l Nom

ina

l N

om

ina

l N

om

ina

l L

ow

N

om

ina

l H

igh

Ve

ry

Hig

h

Ve

ry

Hig

h

Lo

w

Hig

h

Hig

h

No

min

al N

om

ina

l 1

90

42

0

Nom

ina

l Nom

ina

l H

igh

Nom

ina

l H

igh

Nom

ina

l Nom

ina

l H

igh

Hig

h

Nom

ina

l N

om

ina

l H

igh

Hig

h

No

min

al

Hig

h

47

.5

25

2

Ve

ry

Hig

h

Nom

ina

l E

xtr

a

Hig

h

Hig

h

Hig

h

Lo

w

Lo

w

Nom

ina

l H

igh

Nom

ina

l N

om

ina

l N

om

ina

l L

ow

H

igh

No

min

al

21

10

7

Lo

w

Nom

ina

l Nom

ina

l N

om

ina

l N

om

ina

l L

ow

L

ow

H

igh

Hig

h

Ve

ry

Hig

h

Nom

ina

l H

igh

Lo

w

Lo

w

Hig

h

42

3

23

00

Hig

h

Hig

h

Nom

ina

l N

om

ina

l N

om

ina

l L

ow

L

ow

N

om

ina

l H

igh

Hig

h

Nom

ina

l H

igh

No

min

al

No

min

al N

om

ina

l 7

9

40

0

Hig

h

Hig

h

Lo

w

Nom

ina

l N

om

ina

l Nom

ina

l H

igh

Hig

h

Hig

h

Nom

ina

l N

om

ina

l N

om

ina

l H

igh

No

min

al N

om

ina

l 2

84

.7

97

3

Nom

ina

l H

igh

Lo

w

Nom

ina

l N

om

ina

l H

igh

Nom

ina

l H

igh

Hig

h

Nom

ina

l N

om

ina

l N

om

ina

l H

igh

Hig

h

No

min

al

28

2.1

1

36

8

Nom

ina

l H

igh

Hig

h

Ve

ry

Hig

h

Nom

ina

l Nom

ina

l H

igh

Hig

h

Hig

h

Hig

h

Nom

ina

l H

igh

Lo

w

Lo

w

Hig

h

78

57

1.4

Nom

ina

l H

igh

Hig

h

Ve

ry

Hig

h

Nom

ina

l Nom

ina

l H

igh

Hig

h

Hig

h

Hig

h

Nom

ina

l H

igh

Lo

w

Lo

w

Hig

h

11

.4

98

.8

Nom

ina

l H

igh

Hig

h

Ve

ry H

igh N

om

ina

l Nom

ina

l H

igh

Hig

h

Hig

h

Hig

h

Nom

ina

l H

igh

Lo

w

Lo

w

Hig

h

19

.3

15

5

Ta

ble

3.9

: C

OC

OM

O D

ata

set

con

td…


114

3.4.2 M5 Algorithm

Various tree-building algorithms like C4.5 determine the attributes which best

classify data remaining. Tree construction is iterative. Decision trees‘ advantage is the

interpretation of immediate version to rules by decision-makers. Regression/model

trees are used for numeric prediction in data mining. Both build decision trees where

each leaf ensures local regression for a specific input space, the difference being that

decision trees generate constant output values for input data subsets, model trees

generate linear (first-order) models for subsets. The M5 algorithm builds trees whose

leaves are associated to multivariate linear models and the nodes of the tree are

chosen over the attribute that maximizes the expected error reduction as a function of

the standard deviation of output parameter (Rodriguez et al., 2006). M5 model trees

split the input progressively. The set N of examples is either associated with a leaf, or

some test is chosen that splits T into subsets corresponding to the test outcomes and

the same process is applied recursively to the subsets. Splits are based on minimizing

the intra-subset variation in the output values down each branch. The attribute that

maximizes the expected error reduction is chosen. The Standard Deviation Reduction

(SDR) is calculated by equation (3.11)

ii

i

NSDR sd N sd N

N

(3.11)

where N - set of examples that reach the node;

Ni - subset of examples that have the ith outcome of the potential set;

sd - standard deviation

Standard M5 adopts a greedy algorithm which constructs a model tree with

a non fixed structure by using a certain stopping criterion. M5 minimizes error at

each interior node, one node at a time. This process is started at the root and is

repeated recursively until all or almost all of the instances are correctly classified.

In constructing this initial tree M5 is greedy, and this can be improved. In

principle, it is possible to build a fully non-greedy algorithm, however


115

computational cost of such approach would be too high. The M5 algorithm

employs a ‗divide-and conquer‘ principle. The set N is either associated with a leaf, or

some test is chosen that splits N into subsets corresponding to the test outcomes and

the same process is applied recursively to the subsets. The splitting criterion for the

M5 algorithm is based on treating the standard deviation of the class values that reach

a node as a measure of the error at that node, and calculating the expected reduction in

this error as a result of testing each attribute at that node.

After examining all possible splits, M5 chooses the one that maximizes the

expected error reduction. Splitting in M5 ceases when the class values of all the

instances that reach a node vary just slightly, or only a few instances remain.

The relentless division often produces over-elaborate structures that must be

pruned back, for instance by replacing a sub tree with a leaf. In the final stage, a

smoothing process is performed to compensate for the sharp discontinuities that

will inevitably occur between adjacent linear models at the leaves of the pruned

tree, particularly for some models constructed from a smaller number of training

examples. In smoothing, the adjacent linear equations are updated in such a way

that the predicted outputs for the neighbouring input vectors corresponding to the

different equations become close in value (Wang and Witten, 1997). Figure 3.7

shows the M5 model tree algorithms. Pseudo-Code for M5 Algorithm is given in

Figure 3.8.

Figure 3.7: M5 Model Tree Algorithm


116

tan

{

tan

min

int 1

. tan tan

MakeModelTree ins ces

SD sd ins ces

for each k valued no al attribute

convert o k synthetic binary attributes

root newNode

root ins ces ins ces

split root

prune root

int

}

pr Tree root

{

( . tan ) 4

( . tan ) 0.05*

.

.

split node

if size of node ins ces or

sd node ins ces SD

node type LEAF

else

node type INTERIOR

for each attribute

for all possible split positions of the

'

. max

( . )

( . )

}

attribute

calculate the attribute s SDR

node attribute attribute with imum SDR

split node left

split node right

{

( . )

( . )

.mod Re

.

}

prune node

if node INTERIOR then

prune node leftChild

prune node rightChild

node el linear gression node

if subtreeError node error node then

node type LEAF


117

{

. ; .

( ( . tan )*

( . tan )* )

/ ( . tan

subtreeError node

l node left r node right

if node INTERIOR then

return sizeof l ins ces subtreeError l

sizeof r ins ces subtreeError r

sizeof node ins

)

}

ces

else return error node

Figure 3.8 Pseudo code for M5 Algorithm

3.4.3 Linear Regression

Regression analysis is a statistical tool for the investigation of relationships

between variables, shown in equation (3.12)

(3.12)

In most problems, more than one predictor variable will be available. This leads to

the following ‗multiple regression‘ mean function as shown in equation (3.13)

(3.13)

Linear regression is a very powerful statistical technique, where graphs

with straight lines are overlaid on scatter plots. Linear models can be used for

prediction or to evaluate whether there is a linear relationship between two

numerical variables.

A perfect linear relationship is the exact value of y, just by knowing the

value of x. This is unrealistic in almost any natural process. Height and weight of

school children may be considered as an example. Their height x, gives some

information about their weight, y, but there is still a lot of variability, even for

children of the same height.

Linear regression line often written mathematically as in equation (3.14)


118

y = β0+β1x (3.14)

where β0 and β1 represent two parameters to identify. Usually x represents β0, β1

Linear model parameters an explanatory or predictor variable and y represents a

response. The variable x is used to predict a response y. usually, b 0 and b1 are

used to denote the point estimates of β0 and β 1 (Montgomery et al., 2012).

3.4.4 Error Statistics

Error is nothing but the deviation of the observed value from the true value

of the measured quantity. There are two types of errors, absolute error and relative

error. Absolute error is the difference between the magnitude of the true value and

the observed one. It gives us the exact number with the units of the quantity that is

deviated from the true one. Unlike absolute error, relative error is expressed in

percentage and it helps to compare how incorrect a quantity is from the value

considered to be true. Relative error is defined as the absolute error divided by the

true value. It is generally expressed as percentage and calculates the ratio between

absolute error and the true value.

Relative Error Equation

Relative error is determined by using the equation (3.15):

Relative error = (x - x0) /x (3.15)

where, x = true value of a quantity,

x0 = observed value of the quantity,

x - x0 = absolute error.

It is necessary to estimate accuracy for evaluation and validation.

A common evaluation criteria in engineering (Setiono et al 2010) used is:

Magnitude Relative Error (MRE) computes absolute error percentage

between actual and predicted efforts for reference samples.


119

i ii

i

actual estimatedMRE

actual

(3.16)

The Mean Magnitude of Relative Error (MMRE) (Stensrud, E., et al., 2003), is

the de facto standard evaluation criterion to assess the accuracy of software

prediction models. MMRE is a summary statistic, i.e., a single number,

aggregating the fundamental metric MRE, a relative residual error. MMRE is used

for two kinds of assessments (at least). One purpose of MMRE is to select

between competing prediction models: The model that obtains the lowest MMRE

is preferred. Another purpose is to provide a quantitative measure of the

uncertainty of a prediction (Where a low MMRE is taken to mean low uncertainty

or inaccuracy). MMRE calculates MREs average over all reference samples.

As MMRE is sensitive to an individual outlying prediction, a median of MREs

(MdMRE) is adopted for n samples when there are many observations less

sensitive to extreme MRE values. Despite the use of MMRE for estimation

accuracy, there exists much discussion on its efficacy in estimation procedures.

MMRE has been criticized as being unbalanced in many validation circumstances,

resulting often in overestimation (Bhatnagar et al., 2010).

1

1n

i

i

MMRE MREn

i

i

MdMRE median MRE

(3.17)

(3.18)


120

Advantages of MMRE

(i) Comparisons can be made across data sets.

(ii) It is independent of units. Independence of units means that it does not

matter whether effort is reported in work hours or work months.

An MMRE will be, say, 10 percent whatever unit is used.

(iii) Comparisons can be made across all kinds of prediction model types.

This means, for example, that it is considered as a valid and reliable

measure to compare AFAs with linear models.

(iv) It is scale independent. Scale independence means that the expected value

of MRE does not vary with size. In other words, an implicit assumption in

using MRE as a measure of predictive accuracy is that the error is

proportional to the size (effort) of the project (Myrtveit et al., 2005).

3.5 RESULTS AND DISCUSSION

In the first experiment, the COCOMO Dataset for software effort

estimation is used to evaluate the performance of the regression techniques M5

and Linear Regression. This is used for benchmarking and the techniques are

evaluated for realtime dataset of pulp wood. The experimental setup consisted of

using the attributes of the COCOMO dataset as it is. Table 3.10 tabulates the

results achieved for COCOMO dataset. Figure 3.9 shows the Magnitude Relative

Error achieved for COCOMO data set using M5 algorithm and linear regression

techniques. Appendix I lists the Magnitude Relative Error and absolute error

achieved for the algorithms.


121

Table 3.10: Average MMRE and MdMRE achieved for M5 and

Linear Regression Technique - COCOMO Dataset

Technique Used MMRE MdMRE

M5 0.9884 39.90155

Linear Regression 1.655818 57.97808

Figure 3.9: Magnitude Relative Error achieved for COCOMO Dataset

It is observed from figure 3.10 that the M5 algorithm achieves lower

MMRE compared to linear regression. It is observed that linear regression

algorithm performance is poor compared to M5 algorithm. The percentage

difference for MMRE between M5 and Linear Regression is 50.48% for the

COCOMO dataset.


122

Figure 3.10: MMRE achieved for COCOMO Dataset

Figure 3.11: MdMRE achieved for COCOMO Dataset

Figure 3.11 depicts that the MdMRE achieved for COCOMO data set using

M5 algorithm and linear regression techniques. The percentage difference for

MdMRE between M5 and Linear Regression is 36.93% for the COCOMO dataset.

The study is based on the data collected at the TNPL in Karur District,


123

Tamil Nadu. Over ten years data have been collected for the demand patterns of

pulp wood. The data collected is normalized. Table 3.11 shows the sample data of

demand of pulpwood in metric tonnes.

Table 3.11: Sample Demand data of Pulpwood in MT (Metric Tonne)

(Source- TNPL Management Plan.)

Year Demand (MT)

2003 133719

2004 147505

2005 164804

2006 166471

2007 180577

2008 383315

The MMRE and MdMRE are evaluated through various techniques like M5

and Linear Regression for pulp wood dataset. Table 3.12 provides results of

average MMRE and MdMRE for various techniques and figures 3.12 and 3.13

depict the same.

Table 3.12: Average MMRE and MdMRE for pulpwood Dataset

Technique Used MMRE MdMRE

M5 0.314705 23.75804

Linear Regression 0.337028 25.63157


124

Figure 3.12: MMRE for M5 and Linear Regression Technique - Pulpwood

It is observed that the M5 algorithm achieves lower MMRE. The percentage

difference for MMRE between M5 and Linear Regression is 6.85% for the

pulpwood dataset.

Figure 3.13: MdMRE for M5 and Linear Regression Technique- Pulpwood


125

It is observed that M5 algorithm achieves lesser MMRE and MdMRE than

linear regression. The percentage difference for MdMRE between M5 and Linear

Regression is 7.61% for pulpwood dataset.

3.6 CONCLUSION

The study evaluates the regression algorithms for forecasting demand data

from Tamil Nadu Newsprint and Papers Limited. M5 algorithm and Linear

Regression is used for evaluation.The COCOMO Dataset for software effort

estimation is used for benchmarking for the pulpwood dataset. The MMRE and

MdMRE are used as evaluation criteria. The percentage difference between M5

and Linear Regression for MMRE is 50.48% for the COCOMO dataset and 6.85%

for pulpwood dataset respectively. Similarly the percentage difference for

MdMRE between the two algorithms are 36.93% for the COCOMO dataset and

7.61% for pulpwood dataset.Simulation results demonstrate that the performance

of the M5 algorithm is more effective. From the results it can be concluded that

linear regression can perform reasonably well for predicting data with low number

of attributes. The reason for better perfomance of M5 algorithm is that the leaves

in the algorithm being closely related with multivariate linear models,the nodes of

the decision tree are constructed over the attribute that maximizes the expected

error reduction. Further studies are required to reduce the error in forecasting.


Performance Analysis of Existing Regression...

Documents

Transcript of Performance Analysis of Existing Regression...