[IEEE 2012 IEEE 15th International Conference on Computational Science and Engineering (CSE) -...

8
Allocation of hard real-time periodic tasks for reliability maximization in distributed systems Hamid Reza Faragardi, Reza Shojaee, Maziar Mirzazad-Barijough School of Electrical and Computer Engineering University of Tehran Tehran, Iran {h.faragardi, r.shojaee, mirzazad}@ut.ac.ir Roozbeh Nosrati Department of Computing and Technology Faculty of Science and Technology Anglia Ruskin University Essex, UK [email protected] Abstract— A real-time parallel application can be divided into a number of tasks and executed concurrently on distinct nodes of a Distributed System (DS). Distributed System Reliability (DSR) can be defined as the probability that all the tasks in the system run successfully. Due to different hazard rates of nodes and links, DSR critically depends on the optimal allocation of these tasks onto the available nodes. In this paper, we have presented a mathematical model for analyzing DSR in a DS on which hard real-time periodic tasks are executed. In addition, to maximize reliability besides satisfying the constraints, we have proposed an offline task allocation algorithm. The algorithm is a new swarm intelligence approach based on Ant Colony Optimization (ACO). For evaluating the algorithm, ACO is compared with Honey Bee Mating Optimization (HBMO) and Particle Swarm Optimization (PSO). Simulation results manifest that ACO produces better solutions than PSO and HBMO. Meanwhile, it leads to shorter execution time. The results also reveal the flexibility and scalability of the proposed algorithm. Keywords- distributed system; reliability; hard real-time system; task allocation; Ant Colony optimization. I. INTRODUCTION In hard real-time systems, timing correctness of the system is as important as the logical correctness. Therefore, to have a successful operation, the application not only should be logically correct, but also it should complete its work before the determined deadline. Many real-time systems are by nature distributed or parallel (e.g. cars), and involve many micro-processors in order to serve different special-purpose demands (e.g. for cruise control, ABS, engine management, etc.). Furthermore, some of today’s real-time applications can be divided into a number of tasks and executed on the nodes of a Distributed system (DS). Major advantages of using DSs over centralized ones are better performance, availability, reliability, resource sharing and extensibility [1]. Task allocation is a significant challenge in the design of DSs. Real world task allocation problems are often complex and impose many constraints which the solution must adhere to them. Most of the works in the field of task allocation in DSs are focused on performance measures in order to improve system efficiency. However, for the problem to be addressed in the context of real-time distributed systems, the most important aspect is to guarantee hard real-time requirements (i.e. meeting deadlines). On the other hand, reliability is a critical requirement for most hard real-time systems (e.g. military system, nuclear plant, etc.). Many of the previous studies for improving the reliability in such systems have been focused on redundancy and diversity [2][3][4][5][6]. However they impose extra hardware or software costs. Another alternative for improving reliability in DSs is optimal task allocation. This method does not require additional hardware or software costs and improves system reliability just by using proper allocation of the tasks to the nodes [7][8][9]. We consider a DS which is composed of several nodes. The nodes are connected by point-to-point communication links and may have different hazard rate, computation power and memory capacity. Also, the links may have different hazard rates and bandwidths. A parallel application with hard real-time constraints is executed on the system. The application is divided into several periodic tasks, where each task can be run on each node with a specific execution time. Due to heavy communication overhead, we assume that the task migration is not allowed. Therefore all instances of a task must execute on the same node. The tasks can communicate with each other through communication links. Moreover, each periodic task has certain deadline equal to its period. Distributed System Reliability (DSR) is defined as the probability that all of the tasks in the system run successfully [10]. The problem is to find a task allocation under which DSR is maximized, while real-time requirements as well as other system and application constraints are satisfied. This problem can be formulated as an optimization problem which consists of a cost function representing the unreliability caused by two factors: the execution time of tasks on nodes and inter-processor communication time. In this paper, a mathematical model for analyzing DSR in real- time DSs is presented. It should be noted that redundancy and task precedence constraints are neglected in the problem formulation. As shown in [5], this problem is NP-Hard, thus exact methods cannot produce optimal solution in reasonable time for large-scale inputs. Hence, to solve the problem and find near optimal solution, an offline task 2012 IEEE 15th International Conference on Computational Science and Engineering 978-0-7695-4914-9/12 $26.00 © 2012 IEEE DOI 10.1109/ICCSE.2012.16 42

Transcript of [IEEE 2012 IEEE 15th International Conference on Computational Science and Engineering (CSE) -...

Allocation of hard real-time periodic tasks for reliability maximization in distributed systems

Hamid Reza Faragardi, Reza Shojaee, Maziar Mirzazad-Barijough

School of Electrical and Computer Engineering University of Tehran

Tehran, Iran {h.faragardi, r.shojaee, mirzazad}@ut.ac.ir

Roozbeh Nosrati Department of Computing and Technology

Faculty of Science and Technology Anglia Ruskin University

Essex, UK [email protected]

Abstract— A real-time parallel application can be divided into a number of tasks and executed concurrently on distinct nodes of a Distributed System (DS). Distributed System Reliability (DSR) can be defined as the probability that all the tasks in the system run successfully. Due to different hazard rates of nodes and links, DSR critically depends on the optimal allocation of these tasks onto the available nodes. In this paper, we have presented a mathematical model for analyzing DSR in a DS on which hard real-time periodic tasks are executed. In addition, to maximize reliability besides satisfying the constraints, we have proposed an offline task allocation algorithm. The algorithm is a new swarm intelligence approach based on Ant Colony Optimization (ACO). For evaluating the algorithm, ACO is compared with Honey Bee Mating Optimization (HBMO) and Particle Swarm Optimization (PSO). Simulation results manifest that ACO produces better solutions than PSO and HBMO. Meanwhile, it leads to shorter execution time. The results also reveal the flexibility and scalability of the proposed algorithm.

Keywords- distributed system; reliability; hard real-time system; task allocation; Ant Colony optimization.

I. INTRODUCTION

In hard real-time systems, timing correctness of the system is as important as the logical correctness. Therefore, to have a successful operation, the application not only should be logically correct, but also it should complete its work before the determined deadline. Many real-time systems are by nature distributed or parallel (e.g. cars), and involve many micro-processors in order to serve different special-purpose demands (e.g. for cruise control, ABS, engine management, etc.). Furthermore, some of today’s real-time applications can be divided into a number of tasks and executed on the nodes of a Distributed system (DS). Major advantages of using DSs over centralized ones are better performance, availability, reliability, resource sharing and extensibility [1].

Task allocation is a significant challenge in the design of DSs. Real world task allocation problems are often complex and impose many constraints which the solution must adhere to them. Most of the works in the field of task allocation in DSs are focused on performance measures in order to improve system efficiency. However, for the problem to be addressed in the context of real-time distributed systems, the most important aspect is to

guarantee hard real-time requirements (i.e. meeting deadlines). On the other hand, reliability is a critical requirement for most hard real-time systems (e.g. military system, nuclear plant, etc.). Many of the previous studies for improving the reliability in such systems have been focused on redundancy and diversity [2][3][4][5][6]. However they impose extra hardware or software costs. Another alternative for improving reliability in DSs is optimal task allocation. This method does not require additional hardware or software costs and improves system reliability just by using proper allocation of the tasks to the nodes [7][8][9].

We consider a DS which is composed of several nodes. The nodes are connected by point-to-point communication links and may have different hazard rate, computation power and memory capacity. Also, the links may have different hazard rates and bandwidths. A parallel application with hard real-time constraints is executed on the system. The application is divided into several periodic tasks, where each task can be run on each node with a specific execution time. Due to heavy communication overhead, we assume that the task migration is not allowed. Therefore all instances of a task must execute on the same node. The tasks can communicate with each other through communication links. Moreover, each periodic task has certain deadline equal to its period. Distributed System Reliability (DSR) is defined as the probability that all of the tasks in the system run successfully [10]. The problem is to find a task allocation under which DSR is maximized, while real-time requirements as well as other system and application constraints are satisfied.

This problem can be formulated as an optimization problem which consists of a cost function representing the unreliability caused by two factors: the execution time of tasks on nodes and inter-processor communication time. In this paper, a mathematical model for analyzing DSR in real-time DSs is presented. It should be noted that redundancy and task precedence constraints are neglected in the problem formulation. As shown in [5], this problem is NP-Hard, thus exact methods cannot produce optimal solution in reasonable time for large-scale inputs. Hence, to solve the problem and find near optimal solution, an offline task

2012 IEEE 15th International Conference on Computational Science and Engineering

978-0-7695-4914-9/12 $26.00 © 2012 IEEE

DOI 10.1109/ICCSE.2012.16

42

allocation algorithm has been proposed. The algorithm is a new swarm intelligence technique based on Ant Colony Optimization (ACO). For evaluating the algorithm, we compared the reliability and execution time of ACO with Honey Bee Mating Optimization (HBMO) and Particle Swarm Optimization (PSO) for various numbers of nodes and tasks. Results have shown that in contrast to HBMO and PSO, ACO obtains satisfactory reliability in reasonable execution time. The results also reveal the flexibility and scalability of the proposed algorithm.

The rest of this paper is organized as follows: In section II, a brief survey on previous works is presented. Problem definition and a mathematical model are presented in section III. The paper deals with solution approach in section IV. This section introduces an ant colony optimization-based algorithm to solve the problem. In section V some comparative experimental results are discussed. Finally, section VI gives concluding remarks and future works.

II. RELATED WORKS

Many studies have been done to improve reliability in distributed systems without considering real-time requirements. Kumar et al. presented reliability evaluation algorithms for distributed systems in 1988 [11]. One year later, Shatz and Wang introduced models and algorithms for reliability-oriented task allocation in redundant distributed computer systems [12]. Chen and Huang focused their research on finding fast algorithm for analyzing reliability on distributed systems [13]. Shatz and his colleagues used task allocation for maximizing reliability of distributed computing systems for the first time. They also presented an analytical model for the problem [10]. After that, many algorithms have been proposed based on this model to find optimal or near optimal solutions. Exact methods can produce optimal solutions and are usually based on branch and bound idea. Kartik and Murthy used the branch and bound with underestimates and reordered the tasks according to task independence for reducing the computations required. Also, They proved that reliability-oriented task allocation in distributed computing systems is NP-Hard [5]. Thus, exact algorithms only work in problems with small and moderate size. Most researches have been concentrated on developing heuristic and meta-heuristic algorithms to solve the problem. Vidyarthi and Tripathi proposed a solution based on simple genetic algorithm to find a near optimal allocation [14]. In 2006, Attiya and Hamam developed a simulated annealing algorithm for the problem and compared its performance with branch-and-bound technique [7]. Yin et al. proposed a hybrid algorithm combining particle swarm optimization and hill climbing heuristic [8]. In 2010 Kang, et al. used honeybee mating optimization technique [9]. Faragardi et al. presented the hybrid of simulated annealing and tabu search which uses a non monotonic cooling schedule, was presented for solving the problem [15]. In

2012, Shojaee et al. proposed a new swarm intelligence technique based on Cat Swarm Optimization (CSO) algorithm to find near optimal solution for the problem [16]. In addition, during the last years many efforts have been done to make an efficient task allocation on distributed systems with considering time constraints. However, their goal was not reliability maximization. In 2006, chen and kuo proposed an approximation algorithm to solve a task allocation in real-time DSs for minimizing energy [17]. Emberson and bate concentrated on obtaining the number of replicas for the tasks for minimizing the energy in redundant systems [18]. Zhu and Aydin focused on online algorithms for task allocation in real-time DSs to present a reliability-aware energy management system and used EDF to meet deadlines [19].

III. PROBLEM STATEMENT

We consider a heterogeneous DS where each processing node may have different computation power, memory size and failure rate. The nodes are connected to each other by point-to-point communication links. We suppose that each pair of nodes communicates just through a direct link. In addition, the communication links between the nodes may have different bandwidths and failure rates. Each component of this system (node or communication link) can be in one of two states: operational or failed. The failure of a component follows a Poisson process with the constant failure rate. Also, failures of components are statistically independent. This assumption has been widely used in the community of computing system’s reliability analysis [20][21][22][23]. The reliability of the considered DS depends on:

• The number of computing nodes composing the system and their individual likelihoods of failure.

• The likelihoods of failure for each path between a pair of nodes.

A real-time parallel application can be divided into a set of M periodic tasks that are executed on a system comprising N processors. The tasks can communicate with a known rate. In this system, task execution times are processor dependent, meaning that the execution time of a task is variable for different processors. Furthermore, each task has a certain deadline equal to its period. The execution of each instance of the task should be finished before its deadline. As we assume a hard real-time application, resource allocation should be performed in such a way that all of the instances of the tasks meet their deadlines. Since Earliest-Deadline-First (EDF) is an optimal scheduling algorithm [24], we use this method for task scheduling in each processor.

Formally, DSR is defined as the probability that all of the tasks in the system run successfully. Accordingly, even occurring one transient node failure during the execution of each instance of the tasks or transient link failure during data transition will result in failing the mission. However, if

43

a component fails during an idle period, we do not consider this to be a critical failure and it will be replaced by a spare.

The goal is to find a task to processor assignment under which DSR is maximized, while memory and communication path load as well as real-time requirements are met. In the following we formally define this problem.

A. Notations

The notations that are used throughout this paper can be summarized as follows:

• N represents number of nodes (processors). • M represents number of tasks. • P={P1,…,PN} is set of nodes. • PHRi is hazard rate for processor Pi. • T={T1,…, TM} is the task set ordered according

to task deadlines in an ascending manner. • Di is relative deadline of task Ti. • �i is period of task Ti (we assume �i = Di). • eij is execution time of one instance of Ti on Pj

(we assume that all instances of a task have the same execution time).

• Ki is the number of instances for Ti during the mission.

• Eij is total execution time of task Ti on node Pj (Eij = Ki . eij)

• �� is the utilization of node j. • CLij is a link between nodes Pi and Pj. • X=[xij] is task to node assignment matrix where

xij equals one, if and only if Ti is assigned to Pj. Otherwise xij = 0.

• CBWij is communication bandwidth of CLij • CRij is total communication rate between all

instances of task Ti and Tj. • PLij is path load of CLij. • CHRij shows communication hazard rate for

CLij. • Memi is memory amount for node Pi. • memi represents memory needed by an instance

of task Ti. • Rs(X) is system reliability for assignment X. • Rs’(X) is system reliability without considering

link failures. • Rs”(X) is system reliability without considering

node failures. • C(X) is cost of assignment X (will be defined

shortly). • TC(X) is total cost of assignment X (will be

defined shortly).

B. Constraints

In this section we outline the principal constraints of the system and formally define them.

• Memory constraint: The total amount of memory requirements for the tasks assigned to a processor should not exceed the available memory of that processor. This constraint is formulated in Eq. 1.

• Path load constraint: The total amount of communication rate requirements of the tasks which communicate through a specific path should not exceed the load of that path. This constraint is formulated in Eq. 2.

• Deadline constraint: To delineate deadline constraint, we have assumed EDF as the scheduling algorithm. The necessary and sufficient schedulability condition for EDF is stated in theorem 1. Consequently, if the CPU utilization of each node becomes equal or less than one, EDF certainly can meet all deadlines. Eq. 3 can formulate the constraint. Theorem 1. A system that contains n independent, preemptive periodic tasks with relative deadlines equal to periods is schedulable according to the EDF algorithm if the CPU utilization is equal or less than 1 [24].

� �������� � Memk for all k, 1� k � N. (1) � � ��������� � ��������� �� �� for all paths pq, 1� p<q � N. (2) �� = � ������

����� � 1 for all j, 1�j�N. (3)

C. Reliability modeling

We formulate the system reliability in three steps. First, reliability of nodes is considered. Afterwards, reliability of paths will be formulated. Finally, using node and path reliabilities, system reliability will be formally defined.

1. Reliability of nodes: We assume that reliability of a node is equal to reliability of its processor (i.e., memory and other parts of a node are perfect). Therefore, the reliability of processing node p in time interval [0,t] can be achieved by:

RP (t) = ��� � !�"#�$% (4) where &�!'" is hazard rate at time '. By assuming a constant hazard rate (like [10] and [25]), Eq. (4) reduces to: RP (t) = ��(�) * (5)

Then, as the total time for executing the tasks assigned to node p by assignment X is + ,����-���

�� = + .��-����� , the corresponding

reliability of processor Pk can be computed by:

RP(X) = ��(�) / .01-012034 (6)

As node failures can be assumed independent [10][25], reliability of system without considering failure of links can be computed by simply multiplying the reliabilities of nodes:

44

RS’(X) =56 ��(�) / .01-0120347�� (7)

= ��/ (�) 8 9: / .01-012

034 (8)

2. Reliability of paths: Similarly, assuming hazard rate &��!;" for path CLpq, the reliability of the path CLpq can be calculated by:

Rpq(t) = ��� �$% <!=">= (9) Due to constant hazard rate assumption, this relation is simplified as: Rpq(t) = ��?@A <B (10) Then, since under assignment X the total communication time for the tasks assigned to nodes

p and q is � � ������! ?A��?CD <"����� �� �� , Eq. 10 is

rewritten as:

Rpq(X) = ��?@A < � � E� E�<5! FG��FHI <"J�9�K:JL:�9: (11) As we assume independent failure of different paths, reliability of system without considering failure of nodes can be computed by multiplying reliability of the paths:

Rs”(X) =6 6 ��?@A < � � E� E�<! FG��FHI1M"J�9�K:JL:�9:7���� 7� ��

= ��� � ?@A8<9 K:8L: 9: < � � E� E�<5! FG��FHI <"J�9�K:JL:�9: (12)

3. System reliability: Due to independence of node and path failures, system reliability can be formulated as: Rs(X) = RS’(X) * Rs” (X) (13) Using (8), (12), and (13), Rs(X) can be calculated by:

Rs(X) =

��!/ (�) 8 9: N O� �� 5�55� � ?@A8<9 K:8L: 9: <� � E� E�<! FG��FHI <"J�9�K:JL:�9: "

J

�9: (14)

D. Problem formulation

According to the stated constraints and reliability model, we can define our problem as an integer linear programming (ILP) problem as follows:

maximize Rs (X) subject to (1),(2), and (3)

To incorporate this problem in our solution framework, it is more convenient that we integrate the constraints and objective function into a single cost function where the goal is to minimize that cost. Therefore, we model the violation of each constraint as a penalty function. Penalty function indicating the violation of memory, path load and time constraints are formulated in Eq. 15, 16 and 17, respectively:

PM = � 2P-!QR � S�S- T52�S��� "7U� (15) PC = � � 2P-!QR � � -��-������ T5����"��� ��� 7���� 7��

(16)

PD = � 2P-VQR !� �������� "��� T 4W7�� (17)

In addition, to take the reliability aspects into account, we define a cost function C(X) as: C(X)= � � �&�U-�U.�U��� X7U� 5� � � � YZ[��������! ?A��

?CD <"����� �� �� 7���� 7� �� (18)

From (14) and (18) we have �\!]" 53 ��^!_" . Therefore, maximizing the system reliability, �\!]" , is equal to minimizing the cost function, �!]" . According to the definition of penalty functions and this cost function, we define total cost of an assignment X as follows:

Total cost of assignment X is equal to a weighted sum of cost X and all penalties: TC(X) = C(X) + �PM + �PC + �PD (19) Where coefficients �, �, and � are used to show importance of each penalty function. In fact, they should be selected in such a way that solving the above mentioned problem be equal to minimizing total cost TC(X). In other words, they should guide the search towards valid solutions and away from invalid ones. Because of the same importance of penalty functions for memory and path load in our model (since none of them can be violated and they all have relatively same scales), we assume the coefficients � and � are equivalent. Thus, we replace � and � with a common value, �, which yields: TC(X) = C(X) + � (PM + PC )+ �PD (20) Accordingly, the main goal is to minimize total cost function, TC(X).

One option for determining coefficients (� and �) is that decision-maker tune the value of them with respect to importance of corresponding penalty function. For example, in a soft real-time system, where missing a few numbers of deadlines is tolerable, to reflect this flexibility, � can be set to a lower value.

IV. SOLUTION APPROACH

In this section, a swarm intelligence approach based on ant colony optimization algorithm is suggested. The proposed algorithm is used for solving the problem. First, we describe ACO, and then we will introduce our offline task allocation algorithm in details.

45

A. Ant Colony

Ant colony optimization meta-heuristics were proposed by M. Dorigo [26][27]. The basic idea is to imitate the behavior of real ants to find good solutions to combinatorial optimization problems. First, ants start their trips to find the food in a random manner. During their trips, they leave an olfactive and volatile substance -named pheromone- on the ground. The probability of choosing a path for each ant is proportional to the amount of pheromones left on it. After a while, the amount of pheromone decreases in some paths because of evaporation and increases in some others because of reinforcement and finally, the colony finds the shortest path.

B. Applying ACO to solve the task allocation problem

In the proposed algorithm, each ant represents a solution for the task allocation problem, which is shown by a vector. The ith element of a vector is the number of processor assigned to the ith task. For example, Fig.1 shows an allocation for a problem of M tasks and N processors. The vector implies that the task number 1 is assigned to the processor number 4. Si1 Si2 Si3 Si4 Si5

4 2 2 1 3Figure 1. An example for representation of the ith solution.

In the initialization step of the algorithm, M random solutions are generated. Then a local search is applied on the solutions and the pheromone matrix is updated based on the cost of the best solution. In the main loop, each solution is improved based on the pheromone trails matrix and the local search is applied again. Finally the pheromone trails are decreased due to evaporation and updated according to the costs of the solutions found in the iteration. This continues until the stopping criterion such as a maximum number of iterations, is met.

C. Pheromone trail

The pheromone trails are the main components of ant systems, and are used to modify the existing solutions in two different phases. One is the process of choosing a path in a probabilistic way, which is called exploration, and the other is exploitation which selects the component that maximizes the blend of trail values and partial objective function. During each iteration a function is called for every ant, which assigns a processor j with a probability proportional to the pheromone trail value �ij, to each task i in each solution. The trail should get modified since the reinforcement and evaporation happen in the iterations. Therefore, the higher a solution is qualified or the more a component in the solution has been chosen, the corresponding pheromone trail increases. Thus the total amount of the pheromone trail is updated as: Fij(t + 1) = � . Fij(t) + � -��̀ab��̀cU� (21) Where � represents the evaporation factor and -��U is equal to 1 if task i is assigned to node j for kth ant. ab��̀ 5is the

amount of pheromones that kth ant leaves if the assignment of processor j to the task i, exists in the solutions found in the current iteration. ab��̀ 5is computed as follows (Q is a parameter): ab��̀ 5= Q / TC(Sk) (22)

D. Local search

To help converging faster to the optimum solution, a search in the neighborhood of the solutions is done. Local search in this algorithm computes the costs of just a half of the neighborhoods to find a better solution. It chooses half of tasks in a solution randomly and checks for other possible assignments for them while the other tasks remain unchanged.

E. Initialization

In this algorithm, each solution represents an artificial ant which is first initialized randomly and then is improved by the local search. Later, the pheromone trail matrix is initialized, based on the value of the best solution found so far. Initially, all the trails are set to the same value. So the initial trial, �0 is set to 1/(Q.cost(Sbest)), where cost(Sbest) is the cost of the best solution found so far. Algorithm 1 ACO

Randomly select the value for initial solutions S1(1),…,SB(1) , associated to the ants Apply local search to improve the initial solutions Let Sbest be the solution with the minimum cost Initialize the pheromone trail matrix F, based on the cost of Sbest Intense = true Diverse = true /*main loop*/ For t = 1 to max_no_of_iterations /*diversification*/ If (NI >= MNI) and (diverse = true) Re-initialize the pheromone trail matrix F For i = 1 to no_of_ants If cost Si(t – 1) > best_cost Re-initialize Si(t – 1) randomly Diverse = false /*solution manipulation*/ For u = 1 to no_of_ants For i = 0 to B Sf

u (t) = Lottery(F(i)) Sl

u(t) � Improve Sfu (t) with local search

/*intensification*/ If intense = true Su (t+1) � the best solution between Su (t) and Sl

u (t) Else Su (t+1) = Sl

u (t) If for all the ants, Su (t+1) = Su (t)) Intense = false //deactivate intensification If there is an ant with cost(Sl

u (t)) < cost(Sbest) Sbest = Sl

u (t) Intense = true //activate intensification /* Updating pheromone trail (Evaporation)*/ For i = 0 to M For j = 0 to N F[i][j] = F[i][j] * p //p is the evaporation factor /*Reinforcement*/ For i = 0 to B For j = 0 to M F[j][Si[j]] = F[j][Si[j]] + (Q / cost[i]) If cost(S*(t+1)) < cost(S*(t)) NI = 0

46

Table I Initial Values for ACO parameters

��������� � ����� ����������� ����� ��� ����� ��������� ��� ��� � ����������������� ����� ��

���� ������������� ������������� ��������� �������� ���� �� � !�

" ���������� �������������� #� $���� ��� ������ ���� �%&

F. Intensification

Intensification is used to explore the neighborhood of the better solutions more completely. When intensification is active, the solution is changed only if it has been improved after using the pheromones and applying the local search at the iteration.

G. Diversification

When the solutions are not improving during a particular number of iterations indicated by MNI, the diversification is used to restart the algorithm. In this step, the pheromone trail matrix and the ants’ matrix is re-initialized where a solution’s cost is more than the cost of the best solution found so far. The pseudo code of ACO is provided in Alg. 1. There are several parameters in the algorithm which should be initialized correctly. We employed extensively different values to tune the parameters of our algorithm to enhance the solutions accuracy. The initial values of ACO are determined in Tab. I.

V. PERFORMANCE EVALUATION

To evaluate the efficiency of the proposed algorithm, intensive experiments have been conducted and the algorithm was compared with HBMO and PSO. PSO was originally proposed by Kennedy and Eberhart [28]. PSO has been inspired by social behavior of natural swarms such as birds, fishes, etc. Similar to communications between swarms in the real world based on the evolutionary computations, PSO combines self-experiences with social experiences. Also, HBMO is a new swarm intelligence technique which was originally proposed by Abbass in 2001 [29]. The attractiveness of using HBMO is due to the following features: natural metaphor, simplicity, and high quality solutions. The HBMO-based approach combines the power of simulated annealing� � genetic algorithms with a fast problem-specific local search heuristic to find the best possible solution within a reasonable computation time. PSO and HBMO have been recently used to solve optimal task allocation in DSs [8][9]. Our results indicate that ACO is superior to these two algorithms in terms of solution quality and execution time. PSO, HBMO and ACO were implemented in C++ language. The system specifications are: Fedora version 14 as operating system, 2.13 GHz Core i3 Intel Processor and 4 Gigabyte Dual Channel RAM. The simulation program

contains three major parts. In order to compute system reliability in a real-time DS, many parameters such as hazard rate of each node and path, memory of each node, bandwidth of each link, and etc. should be determined. Therefore, the first part is determining values of DS parameters. DS parameters are tabulated in Tab. II. These values are similar to the ones used in [10][8]. The second part is reading the number of tasks (M) and the number of nodes (N) as inputs. The program then generates an application task graph and equivalent parameters. Several application parameters should be delineated, such as execution time of each instance, communication rate between the tasks, number of instances for each task, task deadlines and etc. The application parameters are also tabulated in Tab. II. The last part is executing the algorithms and generating the results. Ten sets of randomly generated problems with size M = 16, 20, 24, 28 and 32 (for N=8) and M= 24, 28, 32, 36 and 40 (for N=10) were used. For each problem size (N, M), 20 simulation runs were conducted by PSO, HBMO and ACO. The average values of reliability and execution time and also standard deviation from average reliability are tabulated in Tab. III. Simulation results indicate that ACO produces better solutions than PSO and HBMO. Meanwhile, it leads to shorter execution time. Also standard deviation denotes that the proposed algorithm results have less deviation from average reliability in contrast to HBMO and PSO. The results also reveal the flexibility and scalability of the proposed algorithm.

Table II System parameters and the corresponding value ranges

'(����� � ����� ����������� ) ����� ����� *���������$������� ������ ���� +#,�-.

� $ �������( +#,���.�� �� ������( +#��,����./0 /����� �����0 �� +�,���.�1 � ��1� +#��,����.�20 ����������2 3 � �0 �� +�%�����,��%���#�./20 /����� �����2 3 � �0 �� +�%���#�,��%���4�./56 /����� �����5 � 7� � +#,�-.� $ ����� �����89� �������� : +!,���.; ������������� ���������� ��� �� +#�,���.

<� /�����������������(� � �� ���� ���� ��(����������� �%#�

=� /�������������� � �������� ��(���������� #�

VI. CONCLUSION

In this paper, we considered a heterogeneous DS that runs a hard real-time application. The application is divided into a number of hard real-time periodic tasks and executed concurrently on the nodes. Also, we assumed EDF as the scheduling algorithm in each node and presented a mathematical model for analyzing reliability in real-time distributed systems with respect to schedulability condition of EDF. We also presented an offline task allocation algorithm in order to find a proper task allocation that maximizes system reliability besides satisfying application and system constraints. The algorithm is based on

47

Table III Experimental results for various numbers of nodes and tasks in terms of PSO, HBMO and ACO

�������'�3�� �'>� 25�>� �/>�

���������� ���������0��� �����(�

'� � � ����� �����

*���������$���8���:� 0��� �����(�

'� � � ����� �����

*���������$���8���:� 0��� �����(�

'� � � ����� ������

*���������$���8���:�

�� ���8�:� $ ����8�:�

!� #?� �%&&@&�� �%���&�� -�%#!@� �%&&!#� �%������ ?%�#� �%&&!#� �%������ �-%?#@�

!� ��� �%&@#!� �%������ ��%��?� �%&@#!� �%������ -�%!!!� �%&@#!� �%������ -!%#-#�

!� �-� �%&?-?� �%���@�� #�-%?--� �%&?@4� �%������ &?%#&� �%&?@4� �%������ ##!%�4��

!� �!� �%&?�#� ��%��##� �!#%&4!� �%&?#-� ��%���4� �??%#�4� �%&?��� ��%����� #�?%-�?�

!� 4�� �%&-&@� �%��4#�� �4@%?-4� �%&�4�� �%��#?�� -#&%--� �%&�-�� �%���-�� 4�@%#�-�

#�� �-� �%&@�?� �%������ !4-%��!� �%&@�?� �%������ �&@%44@� �%&@�?� �%������ �@4%��#�

#�� �!� �%&?4!� �%��#4�� &!@%�-&� �%&?-@� �%������ ?��%���� �%&?-@� �%������ 4#?%44&�

#�� 4�� �%&���� �%���#�� #?!�%!!4� �%&�@�� �%���&�� #��#%�@�� �%&�@�� �%������ �@!%�&?�

#�� 4?� �%&-?�� �%������ 4#@-%�4!� �%&�##� �%������ �?�-%@@� �%&���� �%��#4�� #���%!-#�

#�� -�� �%&4�?� �%��!��� @!&4%##@� �%&-4�� �%��-@�� 4#&-%4@#� �%&--�� �%���#� ��4�%&?@�

Ant Colony Optimization which is strengthened by an intensification mechanism. To decrease the risk of an early convergence of the algorithm, we employed diversification mechanism that periodically erases all the pheromone trails. For evaluating the algorithm, ACO was compared with Honey Bee Mating Optimization and Particle Swarm Optimization. The computational evaluations manifestly support the high performance of ACO against other meta-heuristic algorithms which were applied for finding optimal task allocation in DSs. Also standard deviation denotes that the proposed algorithm results have less deviation from average reliability in contrast to HBMO and PSO. As the future work, we plan to contemplate co-related failures in order to accurately model more real-world systems.

REFERENCES [1] K. K. Aggarwal and S. Rai, “Reliability Evaluation in Computer-

Communication Networks,” IEEE Transactions on Reliability, vol. R-30, no. 1, pp. 32–35, Apr. 1981.

[2] A. Elegbede, K. Adjallah, and F. Yalaoui, “Reliability allocation through cost minimization,” IEEE Transactions on Reliability, vol. 52, no. 1, pp. 106–111, Mar. 2003.

[3] C. C. Chiu, Y. S. Yeh, and J. S. Chou, “A fast algorithm for reliability-oriented task assignment in a distributed system,” Computer Communications, vol. 25, no. 17, pp. 1622–1630, Nov. 2002.

[4] C. C. Hsieh, “Optimal task allocation and hardware redundancy policies in distributed computing systems,” European Journal of Operational Research, vol. 147, no. 2, pp. 430–447, Jun. 2003.

[5] S. Kartik and C. Siva Ram Murthy, “Improved task-allocation algorithms to maximize reliability of redundant distributed computing systems,” Reliability, IEEE Transactions on, vol. 44, no. 4, pp. 575–586, 1995.

[6] A. Kumar and D. P. Agrawal, “A generalized algorithm for evaluating distributed-program reliability,” Reliability, IEEE Transactions on, vol. 42, no. 3, pp. 416–426, 1993.

[7] G. Attiya and Y. Hamam, “Task allocation for maximizing reliability of distributed systems: A simulated annealing approach,” Journal of Parallel and Distributed Computing, vol. 66, no. 10, pp. 1259–1266, Oct. 2006.

[8] P. Y. Yin, S. S. Yu, P. P. Wang, and Y. T. Wang, “A hybrid particle swarm optimization algorithm for optimal task assignment in distributed systems,” Computer Standards & Interfaces, vol. 28, no. 4, pp. 441–450, 2006.

[9] Q. M. Kang, H. He, H. M. Song, and R. Deng, “Task allocation for maximizing reliability of distributed computing systems using honeybee mating optimization,” Journal of Systems and Software, vol. 83, no. 11, pp. 2165–2174, 2010.

[10] S. M. Shatz, S. Member, J. ping Wang, and M. Goto, “Task Allocation for Maximizing Reliability of Distributed Computer Systems,” vol. 11, no. 0, 1992.

[11] A. Kumar, S. Rai, and D. P. Agrawal, “Reliability evaluation algorithms for distributed systems,” in INFOCOM’88. Networks: Evolution or Revolution, Proceedings. Seventh Annual Joint Conference of the IEEE Computer and Communcations Societies, IEEE, 1988, pp. 851–860.

[12] S. M. Shatz and J. P. Wang, “Models and algorithms for reliability-oriented task-allocation in redundant distributed-computer systems,” Reliability, IEEE Transactions on, vol. 38, no. 1, pp. 16–27, 1989.

[13] T. H. Huang, D. J. Chen, and M. C. Sheng, “An algorithm to generate FSTs for the reliability analysis of distributed systems,” in TENCON 90. 1990 IEEE Region 10 Conference on Computer and Communication Systems, 1990, pp. 150–154.

[14] D. P. Vidyarthi and A. K. Tripathi, “Maximizing reliability of distributed computing system with task allocation using simple genetic algorithm,” Journal of Systems Architecture, vol. 47, no. 6, pp. 549–554, 2001.

[15] H. R. Faragardi, R. Shojaee, and N. Yazdani, “Reliability-Aware Task Allocation in Distributed Computing Systems using Hybrid Simulated Annealing and Tabu Search,” IEEE International Conference on High Performance Computing and Communications, 2012.

[16] R. Shojaee, H. R. Faragardi, S. Alaee, and N. Yazdani, “A New Cat Swarm Optimization based Algorithm for Reliability-Oriented Task Allocation in Distributed Systems,” International Symposium on Telecommunication, 2012.

48

[17] J. Chen and T. Kuo, “Allocation Cost Minimization for Periodic Hard Real-Time Tasks in Energy-Constrained DVS Systems,” 2006 IEEE/ACM International Conference on Computer Aided Design, pp. 255–260, Nov. 2006.

[18] P. Emberson and I. Bate, “Extending a task allocation algorithm for graceful degradation of real-time distributed embedded systems,” in Real-Time Systems Symposium, 2008, 2008, pp. 270–279.

[19] H. Aydin and D. Zhu, “Reliability-aware energy management for periodic real-time tasks,” Computers, IEEE Transactions on, vol. 58, no. 10, pp. 1382–1397, 2009.

[20] C. H. Sauer and K. M. Chandy, Computer systems performance modeling, vol. 21. Prentice-Hall Englewood Cliffs, NJ, 1981.

[21] J. F. Lawless, “Statistical models and methods for lifetime data,” 1982.

[22] C. Singh, “Calculating the time-specific frequency of system failure,” Reliability, IEEE Transactions on, vol. 28, no. 2, pp. 124–126, 1979.

[23] C. S. Raghavendra and S. V. Makam, “Reliability modeling and analysis of computer networks,” Reliability, IEEE Transactions on, vol. 35, no. 2, pp. 156–160, 1986.

[24] J. W. S. W. Liu, Real-time systems. Prentice Hall PTR, 2000. [25] S. Kartik and C. Siva Ram Murthy, “Task allocation algorithms

for maximizing reliability of distributed computing systems,” Computers, IEEE Transactions on, vol. 46, no. 6, pp. 719–724, 1997.

[26] V. Maniezzo, M. Dorigo, and A. Colorni, “The ant system: Optimization by a colony of cooperating agents,” IEEE Trans. Sys. Man Cybernet, no. Part B, pp. 29–41, 1996.

[27] L. M. Gambardella, E. D. Taillard, M. Dorigo, and others, “Ant colonies for the quadratic assignment problem,” Journal of the operational research society, vol. 50, no. 2, pp. 167–176, 1999.

[28] J. Kennedy and R. Eberhart, “Particle swarm optimization,” in Neural Networks, 1995. Proceedings., IEEE International Conference on, 1995, vol. 4, pp. 1942–1948.

[29] H. A. Abbass, “A single queen single worker honey bees approach to 3-sat,” in Proceedings of Genetic Evolutionary Computation Conference. San Mateo, CA: Morgan Kaufmann, 2001.

49