Research Article Chip Attach Scheduling in Semiconductor...

12
Hindawi Publishing Corporation Journal of Industrial Engineering Volume 2013, Article ID 295604, 11 pages http://dx.doi.org/10.1155/2013/295604 Research Article Chip Attach Scheduling in Semiconductor Assembly Zhicong Zhang, Kaishun Hu, Shuai Li, Huiyu Huang, and Shaoyong Zhao Department of Industrial Engineering, School of Mechanical Engineering, Dongguan University of Technology, Songshan Lake District, Dongguan, Guangdong 523808, China Correspondence should be addressed to Zhicong Zhang; [email protected] Received 13 December 2012; Accepted 20 February 2013 Academic Editor: Josefa Mula Copyright © 2013 Zhicong Zhang et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Chip attach is the bottleneck operation in semiconductor assembly. Chip attach scheduling is in nature unrelated parallel machine scheduling considering practical issues, for example, machine-job qualification, sequence-dependant setup times, initial machine status, and engineering time. e major scheduling objective is to minimize the total weighted unsatisfied Target Production Volume in the schedule horizon. To apply -learning algorithm, the scheduling problem is converted into reinforcement learning problem by constructing elaborate system state representation, actions, and reward function. We select five heuristics as actions and prove the equivalence of reward function and the scheduling objective function. We also conduct experiments with industrial datasets to compare the -learning algorithm, five action heuristics, and Largest Weight First (LWF) heuristics used in industry. Experiment results show that -learning is remarkably superior to the six heuristics. Compared with LWF, -learning reduces three performance measures, objective function value, unsatisfied Target Production Volume index, and unsatisfied job type index, by considerable amounts of 80.92%, 52.20%, and 31.81%, respectively. 1. Introduction Semiconductor manufacturing consists of four basic steps: wafer fabrication, wafer sort, assembly, and test. Assembly and test are back-end steps. Semiconductor assembly con- tains many operations, such as reflow, wafer mount, saw, chip attach, deflux, EPOXY, cure, and PEVI. IS factory is a site for back-end semiconductor manufacturing where chip attach is the bottleneck operation in the assembly line. In terms of eory of Constraints (TOC), the capacity of a shop floor depends on the capacity of the bottleneck, and a bottleneck operation gives a tremendous impact upon the performance of the whole shop floor. Consequently, scheduling of chip attach station has a significant effect on the performance of the assembly line. Chip attach is per- formed in a station which consists of ten parallel machines; thus, chip attach scheduling in nature is some form of unrelated parallel machine scheduling under certain realistic restrictions. Research on unrelated parallel machine scheduling focuses on two sorts of criteria: completion time or flow time related criteria and due date related criteria. Weng et al. [1] proposed a heuristic algorithm called “Algorithm 9” to minimize the total weighted completion time with setup consideration. Algorithm 9 was demonstrated to be superior to six heuristic algorithms. Gairing et al. [2] presented an effective combinatorial approximate algorithm for makespan objective. Mosheiov [3] and Mosheiov and Sidney [4] con- verted an unrelated parallel machine scheduling problem with total flow time objective into polynomial number of assignment problems. e scheduling problem was tack- led by solving the derived assignment problems. Yu et al. [5] formulated unrelated parallel machine scheduling problems as mixed integer programming and dealt with them using Lagrangian Relaxation. ey examined six mea- sures such as makespan and mean flow time. Promising results were achieved compared with a modified FIFO method. Besides completion time or flow time related criteria, tardiness objectives are also employed frequently. Dispatch- ing rules are widely applied to production scheduling with a tardiness objective, such as Earliest Due Date (EDD), Shortest Processing Time (SPT), Critical Ratio (CR), Minimal Slack (MS), Modified Due Date (MDD) [6, 7], Apparent Tardiness

Transcript of Research Article Chip Attach Scheduling in Semiconductor...

Page 1: Research Article Chip Attach Scheduling in Semiconductor ...downloads.hindawi.com/journals/jie/2013/295604.pdf · Chip Attach Scheduling in Semiconductor Assembly ... and PEVI. IS

Hindawi Publishing CorporationJournal of Industrial EngineeringVolume 2013 Article ID 295604 11 pageshttpdxdoiorg1011552013295604

Research ArticleChip Attach Scheduling in Semiconductor Assembly

Zhicong Zhang Kaishun Hu Shuai Li Huiyu Huang and Shaoyong Zhao

Department of Industrial Engineering School of Mechanical Engineering Dongguan University of TechnologySongshan Lake District Dongguan Guangdong 523808 China

Correspondence should be addressed to Zhicong Zhang stephen1998gmailcom

Received 13 December 2012 Accepted 20 February 2013

Academic Editor Josefa Mula

Copyright copy 2013 Zhicong Zhang et alThis is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Chip attach is the bottleneck operation in semiconductor assembly Chip attach scheduling is in nature unrelated parallel machinescheduling considering practical issues for example machine-job qualification sequence-dependant setup times initial machinestatus and engineering time The major scheduling objective is to minimize the total weighted unsatisfied Target ProductionVolume in the schedule horizon To apply 119876-learning algorithm the scheduling problem is converted into reinforcement learningproblem by constructing elaborate system state representation actions and reward function We select five heuristics as actionsand prove the equivalence of reward function and the scheduling objective function We also conduct experiments with industrialdatasets to compare the 119876-learning algorithm five action heuristics and Largest Weight First (LWF) heuristics used in industryExperiment results show that 119876-learning is remarkably superior to the six heuristics Compared with LWF 119876-learning reducesthree performance measures objective function value unsatisfied Target Production Volume index and unsatisfied job type indexby considerable amounts of 8092 5220 and 3181 respectively

1 Introduction

Semiconductor manufacturing consists of four basic stepswafer fabrication wafer sort assembly and test Assemblyand test are back-end steps Semiconductor assembly con-tains many operations such as reflow wafer mount sawchip attach deflux EPOXY cure and PEVI IS factory is asite for back-end semiconductor manufacturing where chipattach is the bottleneck operation in the assembly line Interms of Theory of Constraints (TOC) the capacity of ashop floor depends on the capacity of the bottleneck anda bottleneck operation gives a tremendous impact uponthe performance of the whole shop floor Consequentlyscheduling of chip attach station has a significant effect onthe performance of the assembly line Chip attach is per-formed in a station which consists of ten parallel machinesthus chip attach scheduling in nature is some form ofunrelated parallel machine scheduling under certain realisticrestrictions

Research on unrelated parallel machine schedulingfocuses on two sorts of criteria completion time or flow timerelated criteria and due date related criteria Weng et al [1]

proposed a heuristic algorithm called ldquoAlgorithm 9rdquo tominimize the total weighted completion time with setupconsideration Algorithm 9 was demonstrated to be superiorto six heuristic algorithms Gairing et al [2] presented aneffective combinatorial approximate algorithm for makespanobjective Mosheiov [3] and Mosheiov and Sidney [4] con-verted an unrelated parallel machine scheduling problemwith total flow time objective into polynomial number ofassignment problems The scheduling problem was tack-led by solving the derived assignment problems Yu etal [5] formulated unrelated parallel machine schedulingproblems as mixed integer programming and dealt withthem using Lagrangian Relaxation They examined six mea-sures such as makespan and mean flow time Promisingresults were achieved compared with a modified FIFOmethod

Besides completion time or flow time related criteriatardiness objectives are also employed frequently Dispatch-ing rules are widely applied to production scheduling with atardiness objective such as EarliestDueDate (EDD) ShortestProcessing Time (SPT) Critical Ratio (CR) Minimal Slack(MS) Modified Due Date (MDD) [6 7] Apparent Tardiness

2 Journal of Industrial Engineering

Cost (ATC) [8 9] and COVERT [10ndash12] More complicatedheuristic algorithms and local search methods are alsodeveloped Bank and Werner [13] addressed the problem ofminimizing the weighted sum of linear earliness and tardi-ness penalties in unrelated parallel machine schedulingTheyderived some structural properties useful to searching foran approximate solution and proposed various constructiveand iterative heuristic algorithms Liaw et al [14] found theefficient lower and upper bounds of minimizing the totalweighted tardiness by a two-phase heuristics based on thesolution to an assignment problem They also presented abranch-and-bound algorithm incorporating various domi-nance rules Kim et al [15] studied batch scheduling ofunrelated parallel machines with a total weighted tardinessobjective and setup times consideration They examinedfour search heuristics for this problem the earliest weighteddue date the shortest weighted processing time the two-level batch scheduling heuristic and the simulated annealingmethod

We are concerned in the paper about a particular TargetProduction Volume (TPV) oriented optimization objectiveIn real production in IS factory the planning departmentfigures out the TPV of each job type on chip attach operationin a schedule horizon Thus the major objective of chipattach scheduling is to meet TPVs to the fullest extent (seeSection 21 for details) We apply reinforcement learning(RL) an artificial intelligence method for this study We firstpresent a brief concept of reinforcement learning

11 119876-Learning Reinforcement learning is a machine learn-ing method proposed to approximately solve large-scaleMarkov Decision Process (MDP) or Semi-Markov DecisionProcess (SMDP) problems Reinforcement learning problemis a model in which an agent learns to select optimal ornear-optimal actions for achieving its long-term goals (tomaximize the total or average reward) through trial-and-error interactions with dynamic environment In this paperwe address RL problems of episodic task that is problemswith a terminal state Sutton and Barto [16] defined four keyelements of RL algorithms policy reward function valuefunction andmodel of the environment A policy determinesthe agentrsquos action at each state A reward function determinesthe payment on transition from one state to another A valuefunction specifies the value of a state or a state-action pairin the long run the expected total reward for an episodeBy learning from interaction between the agent and itsenvironment value-based RL algorithms aim to approxi-mate the optimal state or action value function throughiteration and thus find a near-optimal policy Comparedwith dynamic programming RL algorithms do not need toknow the transition probability and reduce the computationaleffort

119876-learning is one of the most widely applied RL algo-rithms based on value iteration 119876-learning was first pro-posed by Watkins [17] Convergence results of tabular 119876-learning were obtained by Watkins and Dayan [18] Jaakkolaet al [19] and Tsitsiklis [20] Bertsekas and Tsitsiklis [21]demonstrated that 119876-learning produces the optimal policy

in discounted reward problems under certain conditions 119876-learning uses119876(119904 119886) called119876-value to represent the value ofa state-action pair 119876(119904 119886) is defined as follows

119876 (119904 119886) = sum

1199041015840isin119878

119901 (119904 119886 1199041015840

) [119903 (119904 119886 1199041015840

) + 120574119881lowast

(1199041015840

)] (1)

where 119878 denotes the state space 119901(119904 119886 1199041015840) denotes thetransition probability from 119904 to 1199041015840 taking action a 119903(119904 119886 1199041015840)denotes the reward on transition from 119904 to 1199041015840 taking action a120574 (0 lt 120574 le 1) is a discounted factor and 119881lowast(sdot) is the optimalstate value function

In terms of Bellman optimality function the followingholds for arbitrary 119904 isin 119878 where119860(119904) denotes the set of actionsavailable for state 119904

119881lowast

(119904) = max119886isin119860(119904)

119876 (119904 119886) (2)

From (1) and (2) the following equation holds

119876 (119904 119886) = sum

1199041015840isin119878

119901 (119904 119886 1199041015840

) [119903 (119904 119886 1199041015840

) + 120574 max1198861015840isin119860(1199041015840)

119876 (1199041015840

1198861015840

)]

forall (119904 119886)

(3)

Equation (3) is the basic transformation of 119876-learningalgorithm The step-size version of 119876-learning is

119876 (119904 119886) = 119876 (119904 119886) + 120572[119903 (119904 119886 1199041015840

) + 120574 max1198861015840isin119860(1199041015840)

119876 (1199041015840

1198861015840

)

minus 119876 (119904 119886)] forall (119904 119886)

(4)

where 120572 (0 lt 120572 le 1) is learning rate Using historicalsamples or simulation experiments 119876-learning obtains anear-optimal policy by driving action-value function119876(119904 119886)towards the optimal action-value function 119876lowast(119904 119886) throughiteration based on formula (4)

Recently RL has drawn attention from productionscheduling S Riedmiller and M Riedmiller [22] used 119876-learning to solve stochastic and dynamic job shop schedulingproblem with the overall tardiness objective Some typicalheuristic dispatching rules SPT LPT EDD and MS werechosen as actions and comparedwith the119876-learningmethodAydin and Oztemel [23] applied a 119876-learning algorithmto minimize the mean tardiness of dynamic job shopscheduling Their results showed that the RL-schedulingsystem outperformed the use of each of the three rules(SPT COVERT and CR) individually with mean tardinessobjective in most of the testing cases Hong and Prabhu[24] formulated setup minimization problem (minimizingthe sum of due date deviation and setup cost) in JIT man-ufacturing systems as an SMDP and solved it by tabular 119876-learningmethod Experiment results showed that119876-learningalgorithms achieved significant performance improvementover usual dispatching rules such as EDD in complex real-time shop floor control problems for JIT production Wang

Journal of Industrial Engineering 3

and Usher [25] applied 119876-learning to select dispatchingrules for the single machine scheduling problem Csaji et al[26] proposed an adaptive iterative distributed schedulingalgorithm operated in a market-based production controlsystem where every machine and job is associated with itsown software agent Singh et al [27] proposed an onlinereinforcement learning algorithm for call admission controlThe approach optimized the SMDP performance criterionwith respect to a family of parameterized policies Multi-agent reinforcement learning system has also been applied toscheduling or control problems for example Kaya and Alhajj[28] Paternina-Arboleda and Das [29] Mariano-Romero etal [30] Vengerov [31] Iwamura et al [32]

Applications of RL algorithms to scheduling problemshave not been thoroughly explored in the prior studies Inthis study we employ 119876-learning algorithm to resolve chipattach scheduling problem and achieve overwhelming exper-imental results compared with six heuristic algorithms Theremainder of this paper is organized as follows We describethe problem and convert it into RL problem explicitly inSection 2 present the RL algorithm in Section 3 conduct thecomputational experiments and analysis in Section 4 anddraw conclusions in Section 5

2 RL Formulation

21 Problem Statement The scheduling problem concernedin this paper is described as follows The work station forchip attach operation consists of 119898 parallel machines andprocesses 119899 types of jobs The bigger the weight of a job typeis the more important it is Each job needs to be processedon one machine only and one machine processes at mostone job at a time Any job type (say 119895) is only allowed to beprocessed on subset119872

119895of the119898 parallel machines The jobs

of the same type 119895 have a deterministic processing time 119901119894119895

(1 le 119894 le 119898 1 le 119895 le 119899) if they are processed on machine119894 The machines are unrelated that is 119901

119894119895is independent of

119901119896119895

for all jobs 119895 and all machines 119894 = 119896 The production islot based Normally one lot contains more than 1000 unitsThus the processing time is the time for processing one lotand processing is nonpreemptive (ie once a machine startsprocessing one lot it cannot process another one until itcompletely processes this lot) Setup time between job type1198951 and 1198952 is 119904

11989511198952(1 le 1198951 1198952 le 119899) The setup times are

deterministic and sequence dependant Trivially 119904119895119895

= 0

holds for arbitrary 119895 (1 le 119895 le 119899) and 119904119895119909+ 119904119909119902gt 119904119895119902

holds forarbitrary 119895 119909 119902 (1 le 119895 119909 119902 le 119899)

The usage of a machine is considered to be in one oftwo categories engineering time (egmaintenance time) andproduction time We only need to schedule the productionin production time the total available time in a schedulehorizon deducting the engineering time Production time isdivided into initial production time and normal productiontime We consider the initial machine status in the schedulehorizon If a machine is processing a lot called ldquoinitial lotrdquoat the beginning of a schedule horizon it is not allowedto process any other lot until it completely processes theremaining units in the initial lot (called initial volume)

The time for processing the unprocessed initial volume inthe initial lot is called ldquoinitial production timerdquo Since theproduction of nonbottleneck operations is determined bythe bottleneck operation we assume that the jobs are alwaysavailable for processing on chip attach operation when theyare needed

The primary objective of chip attach scheduling is tominimize the total weighted unsatisfied TPV of a schedulehorizon Since equipment of semiconductor manufacturingis very expensive machine utilization should be kept in ahigh level Hence on the premise that TPVs of all job typesare entirely satisfied the secondary objective of chip attachscheduling is to process asmuch asweighted excess volume torelieve the burden of the next schedule horizonThe objectivefunction is formulated as follows

min119899

sum

119895=1

119908119895(119863119895minus 119884119895)+

minus

119899

sum

119895=1

119908119895

119872(119884119895minus 119863119895)+

(5)

where 119908119895(1 le 119895 le 119899) is the weight per unit of job type 119895 119863

119895

(1 le 119895 le 119899) is the predetermined TPV of job type 119895 (includingthe initial volume in the initial lots) and 119884

119895(1 le 119895 le 119899) is

the processed volume of job type 119895 119863119895can be represented as

follows

119863119895=

119898

sum

119894=1

120596 (119894 119895) 119868119894+ 119896119895119871 (119896

119895= 0 1 ) (6)

where 119868119894denotes the initial volume in the initial lot processed

bymachine 119894 at the beginning of the schedule horizon 119871 is lotsize and

120596 (119894 119895) =

1 if machine 119894 is processing job type 119895in the beginning of the schedule horizon

0 otherwise(7)

Calculation of 119884119895is rate based interpreted as follows

Suppose machine 119894 processes lot 119871119876 (belonging to job type119902) proceeding lot 119871119869 (belonging to job type 119895) Let 119905119904

119871119869denote

the start time of setup for 119871119869 then the completion time of 119871119869is 119905119904119871119869+119904119902119895+119901119894119895 LetΔ119884

119894119895(119905) denote the increase of processed

volume of job type 119895 because of processing 119871119869 on machine 119894from time 119905119904

119871119869to 119905 defined as follows

Δ119884119894119895(119905) =

(119905 minus 119905119904119871119869) 119871

119904119902119895+ 119901119894119895

(119905119904119871119869le 119905 le 119905119904

119871119869+ 119904119902119895+ 119901119894119895) (8)

119872 is a positive number which is large enough 119872 is setfollowing the next inequality

119872 gt max(119908119902+ 119908119909) (119904V119895 + 119901119894119895)

119908119895119901119894119902

119908119902(119904V119909 + 119901119894119909) (119904119909119895 + 119901119894119895)

119901119894119895min1le119888le1198981le119886119887119896le119899

119908119896(119904119886119887+ 119901119888119887)

(forall1 le 119894 le 119898 1 le 119895 119902 V 119909 le 119899)

(9)

4 Journal of Industrial Engineering

For an optimal schedule minimizing objective function(5) if (9) holds and there exists 119895 (1 le 119895 le 119899) such that 119884

119895gt

119863119895 then

119884119895+

119898

sum

119894=1

120573 (119894 119895) 119880119894ge 119863119895

(forall1 le 119895 le 119899) (10)

where 119880119894denotes the unprocessed volume in the last lot

processed by machine 119894 at the end of this schedule horizon(ie the initial volume of the next schedule horizon) and

120573 (119894 119895)

=

1 if machine 119894 is processing job type 119895 atthe end of the schedule horizon

0 otherwise

(11)

According to inequality (9) in any schedule minimizingobjective function (5) any machine will not process a lotbelonging to a job type whose TPV has been satisfied untilTPV of any other job types is also fully satisfied In otherwords inequality (9) guarantees that the objective functiontakesminimization of the total weighted unsatisfied TPV (thefirst item of objective function (5)) as the first priority Thefundamental problem in applying reinforcement learningto scheduling is to convert scheduling problems into RLproblems including representation of state construction ofactions and definition of reward function

22 State Representation and Transition Probability We firstdefine the state variables State variables describe the majorcharacteristics of the system and are capable of trackingthe change of the system status The system state can berepresented by the vector

120593 = [1198790

119894(1 le 119894 le 119898) 119879

119894(1 le 119894 le 119898) 119905

119894(1 le 119894 le 119898)

119889119895(1 le 119895 le 119899) 119890

119894(1 le 119894 le 119898)]

(12)

where 1198790119894(1 le 119894 le 119898) denotes the job type of which the

latest lot completely processed on machine 119894 119879119894(1 le 119894 le 119898)

denotes the job type of which the lot being processed onmachine 119894 (119879

119894equals zero if machine 119894 is idle) 119905

119894(1 le 119894 le 119898)

denotes the time starting from the beginning of the latestsetup on machine 119894 (for convenience we assume that thereis a zero-time setup if 1198790

119894= 119879119894) 119889119895(1 le 119895 le 119899) is unsatisfied

TPV (ie (119863119895minus 119884119895)+) and 119890

119894(1 le 119894 le 119898) represents the

unscheduled normal production time of machine 119894Considering the initial status of machines the initial

system state of the schedule horizon is

1199040= [1198790

1198940(1 le 119894 le 119898) 119879

1198940(1 le 119894 le 119898) 119905

1198940(1 le 119894 le 119898)

119863119895(1 le 119895 le 119899) TH minus 120590

119894minus TE119894(1 le 119894 le 119898)]

(13)

where TH denotes the overall available time in the schedulehorizon 120590

119894denotes the initial production time of machine 119894

and TE119894denotes the engineering time of machine 119894

There are two kinds of events triggering state transitions(1) completion of processing a lot on one or more machines(2) any machinersquos normal production time is entirely sched-uled If the triggering event is completion of processing thestate at the decision-making epoch is represented as

119904119889= [1198790

119894119889(1 le 119894 le 119898) 119879

119894119889(1 le 119894 le 119898) 119905

119894119889(1 le 119894 le 119898)

119889119895119889(1 le 119895 le 119899) 119890

119894119889(1 le 119894 le 119898)]

(14)

where 119894 | 119879119894119889= 0 1 le 119894 le 119898 =Φ If 119879

119894119889= 0 (machine

119894 is idle) then 119905119894119889= 0 If the triggering event is using up a

machinersquos normal production time then 119894 | 119890119894119889= 0 1 le 119894 le

119898 =ΦAssume that after taking action 119886 the system state

immediately transfers form 119904119889to an interim state 119904 as follows

119904 = [1198790

119894(1 le 119894 le 119898) 119879

119894(1 le 119894 le 119898) 119905

119894(1 le 119894 le 119898)

119889119895(1 le 119895 le 119899) 119890

119894(1 le 119894 le 119898)]

(15)

where 119879119894gt 0 for all 119894 (1 le 119894 le 119898) that is all machines are

busyLet Δ119905 denote the sojourn time at state 119904 then Δ119905 =

minmin1le119894le119898

1199041198790

119894119879119894

+ 119901119894119879119894

minus 119905119894 min

1le119894le119898119890119894| 119890119894gt 0 Let

Λ = 119894 | 1199041198790

119894119879119894

+ 119901119894119879119894

minus 119905119894= Δ119905 then the state at the next

decision-making epoch is represented as

1199041015840

= [119879119894(119894 isin Λ) 119879

0

119894(119894 notin Λ) 119879

119894= 0 (119894 isin Λ) 119879

119894(119894 notin Λ)

0 (119894 isin Λ) 119905119894+ Δ119905 (119894 notin Λ)

119889119895minusΔ119905119871sum

119898

119894=1120575119884(119879119894 119895)

1199041198790

119894119895+ 119901119894119895

(1 le 119895 le 119899)

max 119890119894minus Δ119905 0 (1 le 119894 le 119898)]

(16)

where

120575119884(119879119894 119895) =

1 if 119879119894= 119895

0 if 119879119894= 119895

(17)

Apparently we have 1198751198861199041198891199041015840 = 1 where 119875119886

1199041198891199041015840 denotes the one-

step transition probability from state 119904119889to state 1199041015840 under

action 119886 Let 119904119906and 120591

119906denote the system state and time

respectively at the 119906th decision-making epoch It is easy toshow that119875 119904119906+1

= 119883 120591119906+1

minus 120591119906le 119905 | 119904

0 1199041 119904

119906 1205910 1205911 120591

119906

= 119875 119904119906+1

= 119883 120591119906+1

minus 120591119906le 119905 | 119904

119906 120591119906

(18)

where 120591119906+1

minus 120591119906is the sojourn time at state 119904

119906 That is

the decision process associated with (119904 120591) is a Semi-MarkovDecision Process with particular transition probability andsojourn times The terminal state of an episode is

119904119890= [1198790

119894119890(1 le 119894 le 119898) 119879

119894119890(1 le 119894 le 119898) 119905

119894119890(1 le 119894 le 119898)

119889119895119890(1 le 119895 le 119899) 0 (1 le 119894 le 119898)]

(19)

Journal of Industrial Engineering 5

23 Action Prior domain knowledge can be utilized to fullyexploit the agentrsquos learning ability Apparently an optimalschedule must be nonidle (ie any machine has no idle timeduring the whole schedule) It may happen that more thanone machine are free at the same decision-making epochAn action determines which lot to be processed on whichmachine In the following we define seven actions usingheuristic algorithms

Action 1 Select jobs by WSPT heuristics as follows

Algorithm 1 WSPT heuristics

Step 1 Let SM denote the set of free machines at a decision-making epoch

Step 2 Choose machine 119896 to process job type 119902 with (119896 119902) =argmin

(119894119895)(1199041198790

119894119895+ 119901119894119895)119908119895| 1 le 119895 le 119899 119894 isin 119872

119895and 119894 isin SM

Step 3 Remove 119896 fromSM If SM = Φ go to Step 2 otherwisethe algorithm halts

Action 2 Select jobs byMWSPT (modifiedWSPT) heuristicsas follows

Algorithm 2 MWSPT heuristics

Step 1 Define SM as Step 1 in Algorithm 1 and let SJ denotethe set of job types whose TPVs have not been satisfied at adecision-making epoch that is SJ = 119895 | 119884

119895lt 119863119895 1 le 119895 le 119899

If SJ = Φ go to Step 4

Step 2 Choose job type 119902 to process on machine 119896 with(119896 119902) = argmin

(119894119895)(1199041198790

119894119895+ 119901119894119895)119908119895| 119895 isin SJ 119894 isin 119872

119895and

119894 isin SM

Step 3 Remove 119896 from SM Set 119884119902= 119884119902+ 119871 and update SJ If

SJ = Φ and SM = Φ go to Step 2 if SJ = Φ and SM = Φ go toStep 4 otherwise the algorithm halts

Step 4 Choose machine 119896 to process job type 119902 with (119896 119902) =argmin

(119894119895)(1199041198790

119894119895+ 119901119894119895)119908119895| 1 le 119895 le 119899 119894 isin 119872

119895and 119894 isin SM

Step 5 Remove 119896 fromSM If SM = Φ go to Step 4 otherwisethe algorithm halts

Action 3 Select jobs by Ranking Algorithm (RA) as follows

Algorithm 3 Ranking Algorithm

Step 1 Define SM and SJ as Step 1 in Algorithm 2 If SJ = Φgo to Step 5

Step 2 For each job type 119895 (119895 isin SJ) sort the machines inincreasing order of (119904

119881119894119895+119901119894119895) (1 le 119894 le 119898) where119881

119894is defined

as follows

119881119894=

119879119894 if machine 119894 is busy

1198790

119894 if machine 119894 is free

(1 le 119894 le 119898) (20)

Let 119892119894119895(1 le 119892

119894119895le 119898) denote the order of machine 119894 (1 le 119894 le

119898) for job type 119895 (1 le 119895 le 119899)

Step 3 Choose job 119902 to process on machine 119896 with (119896 119902) =argmin

(119894119895)119892119894119895| 119895 isin SJ 119894 isin 119872

119895and 119894 isin SM If there

exist two or more machine-job combinations (say machine-job combination (119894

1 1198951) (1198942 1198952) (119894

ℎ 119895ℎ)) with the same

minimal order that is (119894119890 119895119890) = argmin

(119894119895)119892119894119895| 119895 isin SJ 119894 isin

119872119895and 119894 isin SM holds for 119890 (1 le 119890 le ℎ) then choose job type

119895119890to process onmachine 119894

119890 with (119894

119890 119895119890) = argmin

(119894119895)(119904119881119894119890119895119890

+

119901119894119890119895119890

)119908119895119890

| 1 le 119890 le ℎ

Step 4 Remove 119896 or 119894119890fromSM Set119884

119902= 119884119902+119871 or119884

119895119890

= 119884119895119890

+119871

and update SJ If SJ = Φ and SM = Φ go to Step 3 if SJ = Φ andSM = Φ go to Step 5 otherwise the algorithm halts

Step 5 Choose job 119902 to process on machine 119896 with (119896 119902) =argmin

(119894119895)119892119894119895| 1 le 119895 le 119899 119894 isin 119872

119895and 119894 isin SM If there

exist two or more machine-job combinations (say machine-job combinations (119894

1 1198951) (1198942 1198952) (119894

ℎ 119895ℎ)) with the same

minimal order choose job type 119895119890to process on machine 119894

119890

with (119894119890 119895119890) = argmin

(119894119895)(119904119881119894119890119895119890

+ 119901119894119890119895119890

)119908119895119890

| 1 le 119890 le ℎ

Step 6 Remove 119896 or 119894119890from SM If SM = Φ go to Step 5

otherwise the algorithm halts

Action 4 Select jobs by LFM-MWSPT heuristics as follows

Algorithm 4 LFM-MWSPT heuristics

Step 1 Define SM and SJ as Step 1 in Algorithm 2

Step 2 Select a free machine (say 119896) from SM by LFM (LeastFlexible Machine see [33]) rule and choose a job type toprocess on machine 119896 following MWSPT heuristics

Step 3 Remove 119896 fromSM If SM = Φ go to Step 2 otherwisethe algorithm halts

Action 5 Select jobs by LFM-RA heuristics as follows

Algorithm 5 LFM-RA heuristics

Step 1 Define SM and SJ as Step 1 in Algorithm 2

Step 2 Select a free machine (say 119896) from SM by LFM ruleand choose a job type to process on machine 119896 followingRanking Algorithm

Step 3 Remove 119896 fromSM If SM = Φ go to Step 2 otherwisethe algorithm halts

Action 6 Each free machine selects the same job type as thelatest one it processed

Action 7 Select no job

At the start of a schedule horizon the system is atinitial state 119904

0 If there are free machines they select jobs to

process by taking one of Actions 1ndash6 otherwise Action 7 is

6 Journal of Industrial Engineering

chosen Afterwards when anymachine completes processinga lot or any machinersquos normal production time is completelyscheduled the system transfers into a new state 119904

119906 The

agent selects an action at this decision-making epoch and thesystem state transfers into an interim state 119904 When againany machine completes processing a lot or any machinersquosnormal production time used is up the system transfers intothe next decision-making state 119904

119906+1and the agent receive

reward 119903119906+1

which is computed due to 119904119906and the sojourn

time between the two transitions into 119904119906and 119904119906+1

(as shownin Section 24) The previous procedure is repeated until aterminal state is attained An episode is a trajectory from theinitial state to a terminal state of a schedule horizon Action7 is available only at the decision-making states when allmachines are busy

24 Reward Function A reward function follows severaldisciplines It indicates the instant impact of an action on theschedule that is to link the action with immediate rewardMoreover the accumulated reward indicates the objectivefunction value that is the agent receives large total rewardfor small objective function value

Definition 6 (reward function) Let 119870 denote the number ofdecision-making epoch during an episode 119905

119906(0 le 119906 lt 119870)

the time at the 119906th decision-making epoch 119879119894119906

(1 le 119894 le 1198981 le 119906 le 119870) the job type of the lot which machine 119894 processesduring time interval (119905

119906minus1 119905119906]1198790119894119906the job type of the lotwhich

precedes the lot machine 119894 processes during time interval(119905119906minus1 119905119906] and 119884

119895(119905119906) the processed volume of job type 119895 by

time 119905119906 It follows that

119884119895(119905119906) minus 119884119895(119905119906minus1) =

119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

(21)

where 120575(119894 119895) is an indicator function defined as

120575 (119894 119895) = 1 119879

119894119906= 119895

0 119879119894119906

= 119895(22)

Let 119903119906(119906 = 1 2 119870) denote the reward function at the

119906th decision-making epoch 119903119906is defined as

119903119906=

119899

sum

119895=1

min119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

[119863119895minus 119884119895(119905119906minus1)]+

119908119895

+max119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

minus [119863119895minus 119884119895(119905119906minus1)]+

0

times119908119895

119872

(23)

The reward function has the following property

Theorem 7 Maximization of the total reward 119877 in an episodeis equivalent to minimization of objective function (5)

Proof The total reward in an episode is

119877 =

119870

sum119906=1

119903119906

=

119870

sum119906=1

119899

sum

119895=1

min119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

[119863119895minus 119884119895(119905119906minus1)]+

119908119895

+max119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

minus [119863119895minus 119884119895(119905119906minus1)]+

0

times119908119895

119872

=

119899

sum

119895=1

119870

sum119906=1

min119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

[119863119895minus 119884119895(119905119906minus1)]+

times 119908119895

+max119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

minus [119863119895minus 119884119895(119905119906minus1)]+

0

times119908119895

119872

(24)

It is easy to show that

119884119895=

119870

sum119906=1

119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

(25)

It follows that

119877 =

119899

sum

119895=1

[119908119895min 119863

119895 119884119895 +

119908119895

119872max 0 119884

119895minus 119863119895]

= sum

119895isinΩ1

[119908119895119863119895+119908119895

119872(119884119895minus 119863119895)] + sum

119895isinΩ2

119908119895119884119895

=

119899

sum

119895=1

119908119895119863119895minus

sum

119895isinΩ1

[minus119908119895

119872(119884119895minus 119863119895)]

+ sum

119895isinΩ2

119908119895(119863119895minus 119884119895)

=

119899

sum

119895=1

119908119895119863119895minus

119899

sum

119895=1

[119908119895(119863119895minus 119884119895)+

minus119908119895

119872(119884119895minus 119863119895)+

]

(26)

where Ω1= 119895 | 119884

119895gt 119863119895 and Ω

2= 119895 | 119884

119895le 119863119895 Since

sum119899

119895=1119908119895119863119895is a constant it follows that

max119877 lArrrArr min119899

sum

119895=1

[119908119895(119863119895minus 119884119895)+

minus119908119895

119872(119884119895minus 119863119895)+

]

(27)

Journal of Industrial Engineering 7

3 The Reinforcement Learning Algorithm

The chip attach scheduling problem is converted into anRL problem with terminal state in Section 2 To apply 119876-learning to solve this RL problem another issue arises that ishow to tailor119876-learning algorithm in this particular contextSince some state variables are continuous the state spaceis infinite This RL system is not in tabular form and it isimpossible to maintain 119876-values for all state-action pairsThus we use linear functionwith gradient-descentmethod toapproximate the 119876-value function 119876-values are representedas linear combination of a set of basis functions Φ

119896(119904) (1 le

119896 le 4119898 + 119899) as shown in the next formula

119876 (119904 119886) =

4119898+119899

sum

119896=1

119888119886

119896Φ119896(119904) (28)

where 119888119886119896(1 le 119886 le 6 1 le 119896 le 4119898 + 119899) are the weights of basis

functions Each state variable corresponds to a basis functionThe following basis functions are defined to normalize thestate variables

Φ119896(119904)

=

1198790

119896

119899(1 le 119896 le 119898)

119879119896minus119898

119899(119898 + 1 le 119896 le 2119898)

119905119896minus2119898

max 11990411989511198952

+ 1199011198952| 1 le 1198951 le 119899 1 le 1198952 le 119899

(2119898 + 1 le 119896 le 3119898)

119889119896minus3119898

119863119896minus3119898

(3119898 + 1 le 119896 le 3119898 + 119899)

119890119896minus3119898minus119899

TH(3119898 + 119899 + 1 le 119896 le 4119898 + 119899)

(29)

Let 119862119886 denote the vector of weights of basis functions asfollows

119862119886

= (119888119886

1 119888119886

2 119888

119886

4119898+119899)119879

(30)

The RL algorithm is presented as Algorithm 8 where 120572is learning rate 120574 is a discount factor 119864(119886) is the vector ofeligibility traces for action 119886 120575(119886) is an error variable foraction 119886 and 120582 is a factor for updating eligibility traces

Algorithm 8 119876-learning with linear gradient-descent func-tion approximation for chip attach scheduling

Initialize 119862119886 and 119864(119886) randomly Set parameters 120572 120574 and120582

Let num episode denote the number of episodes havingbeen run Set num episode = 0

While num episode ltMAX EPISODE doSet the current decision-making state 119904 larr 119904

0

While at least one of state variables 119890119894(1 le 119894 le 119898) is larger

than zero do

Select action 119886 for state 119904 by 120576-greedy policyImplement action a Determine the next event fortriggering state transition and the sojourn timeOnce any machine completes processing a lot orany machinersquos normal production time is completelyscheduled the system transfers into a new decision-making state 1199041015840(1198901015840

119894(1 le 119894 le 119898) is a component of 1199041015840)

Compute reward 1199031198861199041199041015840

Update the vector of weights in the approximate 119876-value function of action a

120575 (119886) larr997888 119903119886

1199041199041015840 + 120574max1198861015840

119876(1199041015840

1198861015840

) minus 119876 (119904 119886)

119864 (119886) larr997888 120582119864 (119886) + nabla119862119886119876 (119904 119886)

119862119886

larr997888 119862119886

+ 120572120575 (119886) 119864 (119886)

(31)

Set 119904 larr 1199041015840

If 119890119894= 0 holds for all 119894 (1 le 119894 le 119898) set num episode =

num episode + 1

4 Experiment Results

In the past the company used a manual process to conductchip attach scheduling A heuristic algorithm called LargestWeight First (LWF) was used as follows

Algorithm 9 (Largest Weight First (LWF) heuristics) Initial-ize SMwith the set of all machines (ie SM = 119894 | 1 le 119894 le 119898)and define SJ as Step 1 inAlgorithm 2 Initialize 119890

119894(1 le 119894 le 119898)

with each machinersquos normal production time Set 119884119895= 119868119895

where 119868119895is the initial production volume of job type 119895

Step 1 Schedule the job types in decreasing order of weightsin order to meet their TPVs

While SJ = Φ and SM = Φ doChoose job 119902 with 119902 = argmax119908

119895| 119895 isin SJ

While SM cap119872119902= Φ and 119884

119902lt 119863119902do

Choose machine 119894 to process job 119902 with 119894 =argmin119901

119896119902119908119902| 119896 isin SM cap119872

119902

If 119890119894minus 1199041198790

119894119902lt (119863119902minus 119884119902)119901119894119902 then

set 119884119902larr 119884119902+ 119871(119890119894minus 1199041198790

119894119902)119901119894119902 119890119894= 0 and

remove 119894 from SMelse set 119890

119894larr 119890119894minus 1199041198790

119894119902minus (119863119902minus 119884119902)119901119894119902119871 and

119884119902= 119863119902

1198790

119894= 119902

Step 2 Allocate the excess production capacity

If SM = Φ thenFor each machine 119894 (119894 isin SM)Choose job 119895with 119895 = argmax(119890

119894minus1199041198790

119894119902)119908119902119901119894119902|

1 le 119902 le 119899 set 119884119895larr 119884119895+ 119871(119890119894minus 1199041198790

119894119895)119901119894119895 119890119894= 0

8 Journal of Industrial Engineering

Table 1 Comparison of objective function values using heuristics and 119876-learning

Dataset no WSPT MWSPT RA LFM-MWSPT LFM-RA LWF Q-Learning1 88867 78116 59758 87689 57582 42253 minus386132 13844 13586 110747 12669 10907 95926 766573 11901 10875 12409 10425 12190 83332 237754 83681 60797 39073 69405 45575 33920 minus442755 12938 12847 96960 10917 99827 89863 214146 70840 55692 51108 66213 51041 16930 minus544677 12090 10060 95399 10933 90754 76422 273748 10242 10780 11656 10333 10762 93663 118409 94606 87914 81763 88812 80331 60164 3303610 90803 88164 90773 87926 8856293 56307 2279811 11113 88287 82916 97882 85605 60160 1649312 10029 89005 86692 95836 78342 60744 minus38617Average 10419 94123 86321 95547 84685 64147 12233

Table 2 Comparison of unsatisfied TPV index using heuristics and 119876-learning

Dataset no WSPT MWSPT RA LFM-MWSPT LFM-RA LWF Q-learning1 01179 01025 00789 01170 00754 00554 000812 01651 01611 01497 01499 01482 01421 001373 01421 01289 01475 01227 01455 00987 006914 01540 01104 00716 01258 00854 00614 000885 01588 01564 01186 01303 01215 01094 005716 01053 00819 00757 01006 00762 00248 001377 01462 01209 01150 01292 01082 00917 005828 01266 01324 01437 01272 01309 01150 003819 01315 01211 01133 01249 01127 00828 0081510 01154 01112 01151 01105 01110 00709 0053611 01544 01215 01146 01387 01204 00827 0069012 01262 01112 01088 01194 01002 00758 00118Average 01370 01216 01112 01247 01113 00842 00402

The chip attach station consists of 10 machines andnormally processes more than ten job types We selected12 sets of industrial data for experiments comparingthe119876-learning algorithm (Algorithm 8) and the six heuristics(Algorithms 1ndash5 9) WSPT MWSPT RA LFM-MWSPTLFM-RA and LWF For each dataset 119876-learning repeatedlysolves the scheduling problem 1000 times and selects theoptimal schedule of the 1000 solutions Table 1 shows theobjective function values of all datasets using the sevenalgorithms Individually any of WSPT MWSPT RA LFM-MWSPT and LFM-RA obtains larger objective functionvalues than LWF for every dataset Nevertheless takingWSPTMWSPT RA LFM-MWSPT and LFM-RA as actions119876-learning algorithm achieves an objective function valuemuch smaller than LWF for each dataset In Tables 1ndash4 thebottom row presents the average value over all datasets Asshown in Table 1 the average objective function value of 119876-learning is only 12233 less than that of LWF 66147 by a largeamount of 8092

Besides objective function value we propose two indicesunsatisfied TPV index and unsatisfied job type index tomea-sure the performance of the seven algorithms Unsatisfied

TPV index (UPI) is defined as formula (32) and indicatesthe weighted proportion of unfinished Target ProductionVolume Table 2 compares UPIs of all datasets using sevenalgorithms Also any of WSPT MWSPT RA LFM-MWSPTand LFM-RA individually obtains larger UPI than LWFfor each dataset However 119876-learning algorithm achievessmaller UPI than LWF does for each datasetThe average UPIof 119876-learning is only 00402 less than that of LWF 00842by a large amount of 5220 Let 119869 denote the set 119895 | 1 le119895 le 119899 119884

119895lt 119863119895 Unsatisfied job type index (UJTI) is defined

as formula (33) and indicates the weighted proportion of thejob types whose TPVs are not completely satisfied Table 3compares UJTIs of all datasets using seven algorithms Withmost datasets 119876-learning algorithm achieves smaller UJTIsthan LWFThe average UJTI of119876-learning is 00802 which isremarkably less than that of LWF 01176 by 3181 Consider

UPI =sum119899

119895=1119908119895(119863119895minus 119884119895)+

sum119899

119895=1119908119895119863119895

(32)

UJTI =sum119895=119869119908119895

sum119899

119895=1119908119895

(33)

Journal of Industrial Engineering 9

Table 3 Comparison of unsatisfied job type index using heuristics and 119876-learning

Dataset no WSPT MWSPT RA LFM-MWSPT LFM-RA LWF Q-learning1 01290 02615 01667 01650 01793 00921 006782 01924 02953 02302 02650 02097 01320 002783 02250 03287 02126 02564 02278 00921 009214 00781 02987 00278 01290 00828 00278 002785 02924 03169 03002 02224 02632 02055 005716 02290 03062 01817 02290 01632 00571 002787 02650 02987 02160 02529 02075 01320 016478 01924 03225 02067 01813 01696 01320 013209 02221 03250 01403 01892 02073 01320 0074910 02621 03304 02667 02859 02708 01781 0067811 02029 02896 02578 02194 02220 01381 0154212 01924 03271 02302 01838 02182 00921 00678Average 02069 03084 02031 02149 02018 01176 00802

Table 4 Comparison of the total setup time using heuristics and 119876-learning

Dataset no WSPT MWSPT RA LFM-MWSPT LFM-RA LWF Q-learning1 08133 08497 03283 07231 03884 03895 100002 08333 13000 04358 07564 05263 04094 100003 08712 11633 04207 06361 04937 04298 100004 11280 07123 04629 08516 05139 04318 100005 09629 13121 04179 08597 05115 03873 100006 07489 10393 04104 07489 04542 04074 100007 17868 22182 08223 14069 10125 04174 100008 06456 08508 04055 06694 05053 03795 100009 09245 09946 05013 07821 06694 04163 1000010 11025 17875 06703 10371 09079 04894 1000011 09973 13655 03994 09686 05129 04066 1000012 07904 11111 04419 06195 05081 04258 10000Average 09671 12254 04764 08383 05837 04158 10000

Table 4 shows the total setup time of all datasets usingseven algorithms For the reason of commercial confiden-tiality we used the normalized data with the setup time of adataset divided by the result of this dataset using 119876-learningThus the total setup times of all datasets by 119876-learning areconverted into one and the data of the six heuristics areadjusted accordingly 119876-learning algorithm requires morethan twice of setup time than LWF does for each dataset Theaverage accumulated setup time of LWF is only 4158 percentsof that of 119876-learning

The previous experimental results reveal that for thewhole scheduling tasks any individual one of the five actionheuristics (WSPT MWSPT RA LFM-MWSPT and LFM-RA) for 119876-learning performs worse than LWF heuristicsHowever 119876-learning greatly outperforms LWF in termsof the three performance measures the objective functionvalue UPI and UJTI This demonstrates that some actionheuristics provide better actions than LWF heuristics at somestates During repeatedly solving the scheduling problem119876-learning system perceives the insights of the schedulingproblem automatically and adjusts its actions towards the

optimal ones facing different system states The actions atall states form a new optimized policy which is differentfrom any policies following any individual action heuristicsor LWF heuristics That is119876-learning incorporates the meritof five alternative heuristics uses them to schedule jobsflexibly and obtains results much better than any individualaction heuristics and LWF heuristics In the experiments119876-learning achieves high-quality schedules at the cost ofinducingmore setup time In other words119876-learning utilizesthe machines more efficiently by increasing conversionsamong a variety of job types

5 Conclusions

We apply 119876-learning to study lot-based chip attach schedul-ing in back-end semiconductor manufacturing To applyreinforcement learning to scheduling the critical issue beingconversion of scheduling problems into RL problems Weconvert chip attach scheduling problem into a particularSMDP problem by Markovian state representation Fiveheuristic algorithms WSPT MWSPT RA LFM-MWSPT

10 Journal of Industrial Engineering

and LFM-RA are selected as actions so as to utilize priordomain knowledge Reward function is directly related toscheduling objective function and we prove that maximizingthe accumulated reward is equivalent to minimizing theobjective function Gradient-descent linear function approx-imation is combined with 119876-learning algorithm

119876-learning exploits the insight structure of the schedul-ing problem by solving it repeatedly It learns a domain-specific policy from the experienced episodes through inter-action and then applies it to latter episodes We definetwo indices unsatisfied TPV index and unsatisfied job typeindex together with objective function value to measure theperformance of 119876-learning and the heuristics Experimentswith industrial datasets show that 119876-learning apparentlyoutperforms six heuristic algorithms WSPT MWSPT RALFM-MWSPT LFM-RA and LWF Compared with LWF119876-learning achieves reduction of the three performancemeasures respectively by an average level of 5220 3181and 8092 With 119876-learning chip attach scheduling isoptimized through increasing effective job type conversions

Disclosure

Given the sensitive and proprietary nature of the semicon-ductor manufacturing environment we use normalized datain this paper

Acknowledgments

This project is supported by the National Natural ScienceFoundation of China (Grant no 71201026) Science andTechnological Program for Dongguanrsquos Higher EducationScience and Research and Health Care Institutions (no2011108102017) andHumanities and Social Sciences Programof Ministry of Education of China (no 10YJC630405)

References

[1] M X Weng J Lu and H Ren ldquoUnrelated parallel machinescheduling with setup consideration and a total weightedcompletion time objectiverdquo International Journal of ProductionEconomics vol 70 no 3 pp 215ndash226 2001

[2] M Gairing B Monien and A Woclaw ldquoA faster combinato-rial approximation algorithm for scheduling unrelated parallelmachinesrdquo inAutomata Languages and Programming vol 3580of Lecture Notes in Computer Science pp 828ndash839 2005

[3] G Mosheiov ldquoParallel machine scheduling with a learningeffectrdquo Journal of the Operational Research Society vol 52 no10 pp 1ndash5 2001

[4] G Mosheiov and J B Sidney ldquoScheduling with general job-dependent learning curvesrdquo European Journal of OperationalResearch vol 147 no 3 pp 665ndash670 2003

[5] L Yu H M Shih M Pfund W M Carlyle and J W FowlerldquoScheduling of unrelated parallel machines an application toPWB manufacturingrdquo IIE Transactions vol 34 no 11 pp 921ndash931 2002

[6] K R Baker and J W M Bertrand ldquoA dynamic priorityrule for scheduling against due-datesrdquo Journal of OperationsManagement vol 3 no 1 pp 37ndash42 1982

[7] J J Kanet and X Li ldquoA weighted modified due date rulefor sequencing to minimize weighted tardinessrdquo Journal ofScheduling vol 7 no 4 pp 261ndash276 2004

[8] R V Rachamadugu and T E Morton ldquoMyopic heuristicsfor the single machine weighted tardiness problemrdquo WorkingPaper 28-81-82 Graduate School of Industrial AdministrationGarnegie-Mellon University 1981

[9] A Volgenant and E Teerhuis ldquoImproved heuristics for then-job single-machine weighted tardiness problemrdquo Computersand Operations Research vol 26 no 1 pp 35ndash44 1999

[10] D C Carroll Heuristic sequencing of jobs with single and mul-tiple components [PhD thesis] Sloan School of ManagementMIT 1965

[11] A P J Vepsalainen and T E Morton ldquoPriority rules for jobshops with weighted tardiness costsrdquoManagement Science vol33 no 8 pp 1035ndash1047 1987

[12] R S Russell E M Dar-El and B W Taylor ldquoA comparativeanalysis of the COVERT job sequencing rule using various shopperformance measuresrdquo International Journal of ProductionResearch vol 25 no 10 pp 1523ndash1540 1987

[13] J Bank and F Werner ldquoHeuristic algorithms for unrelated par-allelmachine schedulingwith a commondue date release datesand linear earliness and tardiness penaltiesrdquoMathematical andComputer Modelling vol 33 no 4-5 pp 363ndash383 2001

[14] C F Liaw Y K Lin C Y Cheng and M Chen ldquoSchedulingunrelated parallel machines to minimize total weighted tardi-nessrdquo Computers and Operations Research vol 30 no 12 pp1777ndash1789 2003

[15] DWKimDGNa andF F Chen ldquoUnrelated parallelmachinescheduling with setup times and a total weighted tardinessobjectiverdquo Robotics and Computer-Integrated Manufacturingvol 19 no 1-2 pp 173ndash181 2003

[16] R S Sutton and A G Barto Reinforcement Learning AnIntroduction MIT Press Cambridge Mass USA 1998

[17] C J CHWatkinsLearning fromdelayed rewards [PhD thesis]Cambridge University 1989

[18] C J C H Watkins and P Dayan ldquoQ-learningrdquo MachineLearning vol 8 no 3-4 pp 279ndash292 1992

[19] T JaakkolaM I Jordan and S P Singh ldquoOn the convergence ofstochastic iterative dynamic programming algorithmsrdquo NeuralComputation vol 6 pp 1185ndash1201 1994

[20] J N Tsitsiklis ldquoAsynchronous stochastic approximation and Q-learningrdquoMachine Learning vol 16 no 3 pp 185ndash202 1994

[21] D P Bertsekas and J N Tsitsiklis Neuro-Dynamic Program-ming Athena Scientific Belmont Mass USA 1996

[22] S Riedmiller and M Riedmiller ldquoA neural reinforcementlearning approach to learn local dispatching policies in produc-tion schedulingrdquo in Proceedings of the 16th International JointConference on Artificial Intelligence Stockholm Sweden 1999

[23] M E Aydin and E Oztemel ldquoDynamic job-shop schedulingusing reinforcement learning agentsrdquoRobotics and AutonomousSystems vol 33 no 2 pp 169ndash178 2000

[24] J Hong and V V Prabhu ldquoDistributed reinforcement learningcontrol for batch sequencing and sizing in just-in-time manu-facturing systemsrdquo Applied Intelligence vol 20 no 1 pp 71ndash872004

[25] Y C Wang and J M Usher ldquoApplication of reinforcementlearning for agent-based production schedulingrdquo EngineeringApplications of Artificial Intelligence vol 18 no 1 pp 73ndash822005

Journal of Industrial Engineering 11

[26] B C Csaji L Monostori and B Kadar ldquoReinforcement learn-ing in a distributed market-based production control systemrdquoAdvanced Engineering Informatics vol 20 no 3 pp 279ndash2882006

[27] S S Singh V B Tadic and A Doucet ldquoA policy gradientmethod for semi-Markov decision processes with applicationto call admission controlrdquo European Journal of OperationalResearch vol 178 no 3 pp 808ndash818 2007

[28] M Kaya and R Alhajj ldquoA novel approach to multiagentreinforcement learning utilizing OLAP mining in the learningprocessrdquo IEEE Transactions on Systems Man and CyberneticsPart C vol 35 no 4 pp 582ndash590 2005

[29] C D Paternina-Arboleda and T K Das ldquoA multi-agent rein-forcement learning approach to obtaining dynamic controlpolicies for stochastic lot scheduling problemrdquo SimulationModelling Practice andTheory vol 13 no 5 pp 389ndash406 2005

[30] C E Mariano-Romero V H Alcocer-Yamanaka and E FMorales ldquoMulti-objective optimization of water-using systemsrdquoEuropean Journal of Operational Research vol 181 no 3 pp1691ndash1707 2007

[31] D Vengerov ldquoA reinforcement learning framework for utility-based scheduling in resource-constrained systemsrdquo Future Gen-eration Computer Systems vol 25 no 7 pp 728ndash736 2009

[32] K Iwamura N Mayumi Y Tanimizu and N SugimuraldquoA study on real-time scheduling for holonic manufacturingsystemsmdashdetermination of utility values based on multi-agentreinforcement learningrdquo in Proceedings of the 4th InternationalConference on Industrial Applications of Holonic and Multi-Agent Systems pp 135ndash144 Linz Austria 2009

[33] M Pinedo Scheduling Theory Algorithms and Systems Pren-tice Hall Englewoods Cliffs NJ USA 2nd edition 2002

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 2: Research Article Chip Attach Scheduling in Semiconductor ...downloads.hindawi.com/journals/jie/2013/295604.pdf · Chip Attach Scheduling in Semiconductor Assembly ... and PEVI. IS

2 Journal of Industrial Engineering

Cost (ATC) [8 9] and COVERT [10ndash12] More complicatedheuristic algorithms and local search methods are alsodeveloped Bank and Werner [13] addressed the problem ofminimizing the weighted sum of linear earliness and tardi-ness penalties in unrelated parallel machine schedulingTheyderived some structural properties useful to searching foran approximate solution and proposed various constructiveand iterative heuristic algorithms Liaw et al [14] found theefficient lower and upper bounds of minimizing the totalweighted tardiness by a two-phase heuristics based on thesolution to an assignment problem They also presented abranch-and-bound algorithm incorporating various domi-nance rules Kim et al [15] studied batch scheduling ofunrelated parallel machines with a total weighted tardinessobjective and setup times consideration They examinedfour search heuristics for this problem the earliest weighteddue date the shortest weighted processing time the two-level batch scheduling heuristic and the simulated annealingmethod

We are concerned in the paper about a particular TargetProduction Volume (TPV) oriented optimization objectiveIn real production in IS factory the planning departmentfigures out the TPV of each job type on chip attach operationin a schedule horizon Thus the major objective of chipattach scheduling is to meet TPVs to the fullest extent (seeSection 21 for details) We apply reinforcement learning(RL) an artificial intelligence method for this study We firstpresent a brief concept of reinforcement learning

11 119876-Learning Reinforcement learning is a machine learn-ing method proposed to approximately solve large-scaleMarkov Decision Process (MDP) or Semi-Markov DecisionProcess (SMDP) problems Reinforcement learning problemis a model in which an agent learns to select optimal ornear-optimal actions for achieving its long-term goals (tomaximize the total or average reward) through trial-and-error interactions with dynamic environment In this paperwe address RL problems of episodic task that is problemswith a terminal state Sutton and Barto [16] defined four keyelements of RL algorithms policy reward function valuefunction andmodel of the environment A policy determinesthe agentrsquos action at each state A reward function determinesthe payment on transition from one state to another A valuefunction specifies the value of a state or a state-action pairin the long run the expected total reward for an episodeBy learning from interaction between the agent and itsenvironment value-based RL algorithms aim to approxi-mate the optimal state or action value function throughiteration and thus find a near-optimal policy Comparedwith dynamic programming RL algorithms do not need toknow the transition probability and reduce the computationaleffort

119876-learning is one of the most widely applied RL algo-rithms based on value iteration 119876-learning was first pro-posed by Watkins [17] Convergence results of tabular 119876-learning were obtained by Watkins and Dayan [18] Jaakkolaet al [19] and Tsitsiklis [20] Bertsekas and Tsitsiklis [21]demonstrated that 119876-learning produces the optimal policy

in discounted reward problems under certain conditions 119876-learning uses119876(119904 119886) called119876-value to represent the value ofa state-action pair 119876(119904 119886) is defined as follows

119876 (119904 119886) = sum

1199041015840isin119878

119901 (119904 119886 1199041015840

) [119903 (119904 119886 1199041015840

) + 120574119881lowast

(1199041015840

)] (1)

where 119878 denotes the state space 119901(119904 119886 1199041015840) denotes thetransition probability from 119904 to 1199041015840 taking action a 119903(119904 119886 1199041015840)denotes the reward on transition from 119904 to 1199041015840 taking action a120574 (0 lt 120574 le 1) is a discounted factor and 119881lowast(sdot) is the optimalstate value function

In terms of Bellman optimality function the followingholds for arbitrary 119904 isin 119878 where119860(119904) denotes the set of actionsavailable for state 119904

119881lowast

(119904) = max119886isin119860(119904)

119876 (119904 119886) (2)

From (1) and (2) the following equation holds

119876 (119904 119886) = sum

1199041015840isin119878

119901 (119904 119886 1199041015840

) [119903 (119904 119886 1199041015840

) + 120574 max1198861015840isin119860(1199041015840)

119876 (1199041015840

1198861015840

)]

forall (119904 119886)

(3)

Equation (3) is the basic transformation of 119876-learningalgorithm The step-size version of 119876-learning is

119876 (119904 119886) = 119876 (119904 119886) + 120572[119903 (119904 119886 1199041015840

) + 120574 max1198861015840isin119860(1199041015840)

119876 (1199041015840

1198861015840

)

minus 119876 (119904 119886)] forall (119904 119886)

(4)

where 120572 (0 lt 120572 le 1) is learning rate Using historicalsamples or simulation experiments 119876-learning obtains anear-optimal policy by driving action-value function119876(119904 119886)towards the optimal action-value function 119876lowast(119904 119886) throughiteration based on formula (4)

Recently RL has drawn attention from productionscheduling S Riedmiller and M Riedmiller [22] used 119876-learning to solve stochastic and dynamic job shop schedulingproblem with the overall tardiness objective Some typicalheuristic dispatching rules SPT LPT EDD and MS werechosen as actions and comparedwith the119876-learningmethodAydin and Oztemel [23] applied a 119876-learning algorithmto minimize the mean tardiness of dynamic job shopscheduling Their results showed that the RL-schedulingsystem outperformed the use of each of the three rules(SPT COVERT and CR) individually with mean tardinessobjective in most of the testing cases Hong and Prabhu[24] formulated setup minimization problem (minimizingthe sum of due date deviation and setup cost) in JIT man-ufacturing systems as an SMDP and solved it by tabular 119876-learningmethod Experiment results showed that119876-learningalgorithms achieved significant performance improvementover usual dispatching rules such as EDD in complex real-time shop floor control problems for JIT production Wang

Journal of Industrial Engineering 3

and Usher [25] applied 119876-learning to select dispatchingrules for the single machine scheduling problem Csaji et al[26] proposed an adaptive iterative distributed schedulingalgorithm operated in a market-based production controlsystem where every machine and job is associated with itsown software agent Singh et al [27] proposed an onlinereinforcement learning algorithm for call admission controlThe approach optimized the SMDP performance criterionwith respect to a family of parameterized policies Multi-agent reinforcement learning system has also been applied toscheduling or control problems for example Kaya and Alhajj[28] Paternina-Arboleda and Das [29] Mariano-Romero etal [30] Vengerov [31] Iwamura et al [32]

Applications of RL algorithms to scheduling problemshave not been thoroughly explored in the prior studies Inthis study we employ 119876-learning algorithm to resolve chipattach scheduling problem and achieve overwhelming exper-imental results compared with six heuristic algorithms Theremainder of this paper is organized as follows We describethe problem and convert it into RL problem explicitly inSection 2 present the RL algorithm in Section 3 conduct thecomputational experiments and analysis in Section 4 anddraw conclusions in Section 5

2 RL Formulation

21 Problem Statement The scheduling problem concernedin this paper is described as follows The work station forchip attach operation consists of 119898 parallel machines andprocesses 119899 types of jobs The bigger the weight of a job typeis the more important it is Each job needs to be processedon one machine only and one machine processes at mostone job at a time Any job type (say 119895) is only allowed to beprocessed on subset119872

119895of the119898 parallel machines The jobs

of the same type 119895 have a deterministic processing time 119901119894119895

(1 le 119894 le 119898 1 le 119895 le 119899) if they are processed on machine119894 The machines are unrelated that is 119901

119894119895is independent of

119901119896119895

for all jobs 119895 and all machines 119894 = 119896 The production islot based Normally one lot contains more than 1000 unitsThus the processing time is the time for processing one lotand processing is nonpreemptive (ie once a machine startsprocessing one lot it cannot process another one until itcompletely processes this lot) Setup time between job type1198951 and 1198952 is 119904

11989511198952(1 le 1198951 1198952 le 119899) The setup times are

deterministic and sequence dependant Trivially 119904119895119895

= 0

holds for arbitrary 119895 (1 le 119895 le 119899) and 119904119895119909+ 119904119909119902gt 119904119895119902

holds forarbitrary 119895 119909 119902 (1 le 119895 119909 119902 le 119899)

The usage of a machine is considered to be in one oftwo categories engineering time (egmaintenance time) andproduction time We only need to schedule the productionin production time the total available time in a schedulehorizon deducting the engineering time Production time isdivided into initial production time and normal productiontime We consider the initial machine status in the schedulehorizon If a machine is processing a lot called ldquoinitial lotrdquoat the beginning of a schedule horizon it is not allowedto process any other lot until it completely processes theremaining units in the initial lot (called initial volume)

The time for processing the unprocessed initial volume inthe initial lot is called ldquoinitial production timerdquo Since theproduction of nonbottleneck operations is determined bythe bottleneck operation we assume that the jobs are alwaysavailable for processing on chip attach operation when theyare needed

The primary objective of chip attach scheduling is tominimize the total weighted unsatisfied TPV of a schedulehorizon Since equipment of semiconductor manufacturingis very expensive machine utilization should be kept in ahigh level Hence on the premise that TPVs of all job typesare entirely satisfied the secondary objective of chip attachscheduling is to process asmuch asweighted excess volume torelieve the burden of the next schedule horizonThe objectivefunction is formulated as follows

min119899

sum

119895=1

119908119895(119863119895minus 119884119895)+

minus

119899

sum

119895=1

119908119895

119872(119884119895minus 119863119895)+

(5)

where 119908119895(1 le 119895 le 119899) is the weight per unit of job type 119895 119863

119895

(1 le 119895 le 119899) is the predetermined TPV of job type 119895 (includingthe initial volume in the initial lots) and 119884

119895(1 le 119895 le 119899) is

the processed volume of job type 119895 119863119895can be represented as

follows

119863119895=

119898

sum

119894=1

120596 (119894 119895) 119868119894+ 119896119895119871 (119896

119895= 0 1 ) (6)

where 119868119894denotes the initial volume in the initial lot processed

bymachine 119894 at the beginning of the schedule horizon 119871 is lotsize and

120596 (119894 119895) =

1 if machine 119894 is processing job type 119895in the beginning of the schedule horizon

0 otherwise(7)

Calculation of 119884119895is rate based interpreted as follows

Suppose machine 119894 processes lot 119871119876 (belonging to job type119902) proceeding lot 119871119869 (belonging to job type 119895) Let 119905119904

119871119869denote

the start time of setup for 119871119869 then the completion time of 119871119869is 119905119904119871119869+119904119902119895+119901119894119895 LetΔ119884

119894119895(119905) denote the increase of processed

volume of job type 119895 because of processing 119871119869 on machine 119894from time 119905119904

119871119869to 119905 defined as follows

Δ119884119894119895(119905) =

(119905 minus 119905119904119871119869) 119871

119904119902119895+ 119901119894119895

(119905119904119871119869le 119905 le 119905119904

119871119869+ 119904119902119895+ 119901119894119895) (8)

119872 is a positive number which is large enough 119872 is setfollowing the next inequality

119872 gt max(119908119902+ 119908119909) (119904V119895 + 119901119894119895)

119908119895119901119894119902

119908119902(119904V119909 + 119901119894119909) (119904119909119895 + 119901119894119895)

119901119894119895min1le119888le1198981le119886119887119896le119899

119908119896(119904119886119887+ 119901119888119887)

(forall1 le 119894 le 119898 1 le 119895 119902 V 119909 le 119899)

(9)

4 Journal of Industrial Engineering

For an optimal schedule minimizing objective function(5) if (9) holds and there exists 119895 (1 le 119895 le 119899) such that 119884

119895gt

119863119895 then

119884119895+

119898

sum

119894=1

120573 (119894 119895) 119880119894ge 119863119895

(forall1 le 119895 le 119899) (10)

where 119880119894denotes the unprocessed volume in the last lot

processed by machine 119894 at the end of this schedule horizon(ie the initial volume of the next schedule horizon) and

120573 (119894 119895)

=

1 if machine 119894 is processing job type 119895 atthe end of the schedule horizon

0 otherwise

(11)

According to inequality (9) in any schedule minimizingobjective function (5) any machine will not process a lotbelonging to a job type whose TPV has been satisfied untilTPV of any other job types is also fully satisfied In otherwords inequality (9) guarantees that the objective functiontakesminimization of the total weighted unsatisfied TPV (thefirst item of objective function (5)) as the first priority Thefundamental problem in applying reinforcement learningto scheduling is to convert scheduling problems into RLproblems including representation of state construction ofactions and definition of reward function

22 State Representation and Transition Probability We firstdefine the state variables State variables describe the majorcharacteristics of the system and are capable of trackingthe change of the system status The system state can berepresented by the vector

120593 = [1198790

119894(1 le 119894 le 119898) 119879

119894(1 le 119894 le 119898) 119905

119894(1 le 119894 le 119898)

119889119895(1 le 119895 le 119899) 119890

119894(1 le 119894 le 119898)]

(12)

where 1198790119894(1 le 119894 le 119898) denotes the job type of which the

latest lot completely processed on machine 119894 119879119894(1 le 119894 le 119898)

denotes the job type of which the lot being processed onmachine 119894 (119879

119894equals zero if machine 119894 is idle) 119905

119894(1 le 119894 le 119898)

denotes the time starting from the beginning of the latestsetup on machine 119894 (for convenience we assume that thereis a zero-time setup if 1198790

119894= 119879119894) 119889119895(1 le 119895 le 119899) is unsatisfied

TPV (ie (119863119895minus 119884119895)+) and 119890

119894(1 le 119894 le 119898) represents the

unscheduled normal production time of machine 119894Considering the initial status of machines the initial

system state of the schedule horizon is

1199040= [1198790

1198940(1 le 119894 le 119898) 119879

1198940(1 le 119894 le 119898) 119905

1198940(1 le 119894 le 119898)

119863119895(1 le 119895 le 119899) TH minus 120590

119894minus TE119894(1 le 119894 le 119898)]

(13)

where TH denotes the overall available time in the schedulehorizon 120590

119894denotes the initial production time of machine 119894

and TE119894denotes the engineering time of machine 119894

There are two kinds of events triggering state transitions(1) completion of processing a lot on one or more machines(2) any machinersquos normal production time is entirely sched-uled If the triggering event is completion of processing thestate at the decision-making epoch is represented as

119904119889= [1198790

119894119889(1 le 119894 le 119898) 119879

119894119889(1 le 119894 le 119898) 119905

119894119889(1 le 119894 le 119898)

119889119895119889(1 le 119895 le 119899) 119890

119894119889(1 le 119894 le 119898)]

(14)

where 119894 | 119879119894119889= 0 1 le 119894 le 119898 =Φ If 119879

119894119889= 0 (machine

119894 is idle) then 119905119894119889= 0 If the triggering event is using up a

machinersquos normal production time then 119894 | 119890119894119889= 0 1 le 119894 le

119898 =ΦAssume that after taking action 119886 the system state

immediately transfers form 119904119889to an interim state 119904 as follows

119904 = [1198790

119894(1 le 119894 le 119898) 119879

119894(1 le 119894 le 119898) 119905

119894(1 le 119894 le 119898)

119889119895(1 le 119895 le 119899) 119890

119894(1 le 119894 le 119898)]

(15)

where 119879119894gt 0 for all 119894 (1 le 119894 le 119898) that is all machines are

busyLet Δ119905 denote the sojourn time at state 119904 then Δ119905 =

minmin1le119894le119898

1199041198790

119894119879119894

+ 119901119894119879119894

minus 119905119894 min

1le119894le119898119890119894| 119890119894gt 0 Let

Λ = 119894 | 1199041198790

119894119879119894

+ 119901119894119879119894

minus 119905119894= Δ119905 then the state at the next

decision-making epoch is represented as

1199041015840

= [119879119894(119894 isin Λ) 119879

0

119894(119894 notin Λ) 119879

119894= 0 (119894 isin Λ) 119879

119894(119894 notin Λ)

0 (119894 isin Λ) 119905119894+ Δ119905 (119894 notin Λ)

119889119895minusΔ119905119871sum

119898

119894=1120575119884(119879119894 119895)

1199041198790

119894119895+ 119901119894119895

(1 le 119895 le 119899)

max 119890119894minus Δ119905 0 (1 le 119894 le 119898)]

(16)

where

120575119884(119879119894 119895) =

1 if 119879119894= 119895

0 if 119879119894= 119895

(17)

Apparently we have 1198751198861199041198891199041015840 = 1 where 119875119886

1199041198891199041015840 denotes the one-

step transition probability from state 119904119889to state 1199041015840 under

action 119886 Let 119904119906and 120591

119906denote the system state and time

respectively at the 119906th decision-making epoch It is easy toshow that119875 119904119906+1

= 119883 120591119906+1

minus 120591119906le 119905 | 119904

0 1199041 119904

119906 1205910 1205911 120591

119906

= 119875 119904119906+1

= 119883 120591119906+1

minus 120591119906le 119905 | 119904

119906 120591119906

(18)

where 120591119906+1

minus 120591119906is the sojourn time at state 119904

119906 That is

the decision process associated with (119904 120591) is a Semi-MarkovDecision Process with particular transition probability andsojourn times The terminal state of an episode is

119904119890= [1198790

119894119890(1 le 119894 le 119898) 119879

119894119890(1 le 119894 le 119898) 119905

119894119890(1 le 119894 le 119898)

119889119895119890(1 le 119895 le 119899) 0 (1 le 119894 le 119898)]

(19)

Journal of Industrial Engineering 5

23 Action Prior domain knowledge can be utilized to fullyexploit the agentrsquos learning ability Apparently an optimalschedule must be nonidle (ie any machine has no idle timeduring the whole schedule) It may happen that more thanone machine are free at the same decision-making epochAn action determines which lot to be processed on whichmachine In the following we define seven actions usingheuristic algorithms

Action 1 Select jobs by WSPT heuristics as follows

Algorithm 1 WSPT heuristics

Step 1 Let SM denote the set of free machines at a decision-making epoch

Step 2 Choose machine 119896 to process job type 119902 with (119896 119902) =argmin

(119894119895)(1199041198790

119894119895+ 119901119894119895)119908119895| 1 le 119895 le 119899 119894 isin 119872

119895and 119894 isin SM

Step 3 Remove 119896 fromSM If SM = Φ go to Step 2 otherwisethe algorithm halts

Action 2 Select jobs byMWSPT (modifiedWSPT) heuristicsas follows

Algorithm 2 MWSPT heuristics

Step 1 Define SM as Step 1 in Algorithm 1 and let SJ denotethe set of job types whose TPVs have not been satisfied at adecision-making epoch that is SJ = 119895 | 119884

119895lt 119863119895 1 le 119895 le 119899

If SJ = Φ go to Step 4

Step 2 Choose job type 119902 to process on machine 119896 with(119896 119902) = argmin

(119894119895)(1199041198790

119894119895+ 119901119894119895)119908119895| 119895 isin SJ 119894 isin 119872

119895and

119894 isin SM

Step 3 Remove 119896 from SM Set 119884119902= 119884119902+ 119871 and update SJ If

SJ = Φ and SM = Φ go to Step 2 if SJ = Φ and SM = Φ go toStep 4 otherwise the algorithm halts

Step 4 Choose machine 119896 to process job type 119902 with (119896 119902) =argmin

(119894119895)(1199041198790

119894119895+ 119901119894119895)119908119895| 1 le 119895 le 119899 119894 isin 119872

119895and 119894 isin SM

Step 5 Remove 119896 fromSM If SM = Φ go to Step 4 otherwisethe algorithm halts

Action 3 Select jobs by Ranking Algorithm (RA) as follows

Algorithm 3 Ranking Algorithm

Step 1 Define SM and SJ as Step 1 in Algorithm 2 If SJ = Φgo to Step 5

Step 2 For each job type 119895 (119895 isin SJ) sort the machines inincreasing order of (119904

119881119894119895+119901119894119895) (1 le 119894 le 119898) where119881

119894is defined

as follows

119881119894=

119879119894 if machine 119894 is busy

1198790

119894 if machine 119894 is free

(1 le 119894 le 119898) (20)

Let 119892119894119895(1 le 119892

119894119895le 119898) denote the order of machine 119894 (1 le 119894 le

119898) for job type 119895 (1 le 119895 le 119899)

Step 3 Choose job 119902 to process on machine 119896 with (119896 119902) =argmin

(119894119895)119892119894119895| 119895 isin SJ 119894 isin 119872

119895and 119894 isin SM If there

exist two or more machine-job combinations (say machine-job combination (119894

1 1198951) (1198942 1198952) (119894

ℎ 119895ℎ)) with the same

minimal order that is (119894119890 119895119890) = argmin

(119894119895)119892119894119895| 119895 isin SJ 119894 isin

119872119895and 119894 isin SM holds for 119890 (1 le 119890 le ℎ) then choose job type

119895119890to process onmachine 119894

119890 with (119894

119890 119895119890) = argmin

(119894119895)(119904119881119894119890119895119890

+

119901119894119890119895119890

)119908119895119890

| 1 le 119890 le ℎ

Step 4 Remove 119896 or 119894119890fromSM Set119884

119902= 119884119902+119871 or119884

119895119890

= 119884119895119890

+119871

and update SJ If SJ = Φ and SM = Φ go to Step 3 if SJ = Φ andSM = Φ go to Step 5 otherwise the algorithm halts

Step 5 Choose job 119902 to process on machine 119896 with (119896 119902) =argmin

(119894119895)119892119894119895| 1 le 119895 le 119899 119894 isin 119872

119895and 119894 isin SM If there

exist two or more machine-job combinations (say machine-job combinations (119894

1 1198951) (1198942 1198952) (119894

ℎ 119895ℎ)) with the same

minimal order choose job type 119895119890to process on machine 119894

119890

with (119894119890 119895119890) = argmin

(119894119895)(119904119881119894119890119895119890

+ 119901119894119890119895119890

)119908119895119890

| 1 le 119890 le ℎ

Step 6 Remove 119896 or 119894119890from SM If SM = Φ go to Step 5

otherwise the algorithm halts

Action 4 Select jobs by LFM-MWSPT heuristics as follows

Algorithm 4 LFM-MWSPT heuristics

Step 1 Define SM and SJ as Step 1 in Algorithm 2

Step 2 Select a free machine (say 119896) from SM by LFM (LeastFlexible Machine see [33]) rule and choose a job type toprocess on machine 119896 following MWSPT heuristics

Step 3 Remove 119896 fromSM If SM = Φ go to Step 2 otherwisethe algorithm halts

Action 5 Select jobs by LFM-RA heuristics as follows

Algorithm 5 LFM-RA heuristics

Step 1 Define SM and SJ as Step 1 in Algorithm 2

Step 2 Select a free machine (say 119896) from SM by LFM ruleand choose a job type to process on machine 119896 followingRanking Algorithm

Step 3 Remove 119896 fromSM If SM = Φ go to Step 2 otherwisethe algorithm halts

Action 6 Each free machine selects the same job type as thelatest one it processed

Action 7 Select no job

At the start of a schedule horizon the system is atinitial state 119904

0 If there are free machines they select jobs to

process by taking one of Actions 1ndash6 otherwise Action 7 is

6 Journal of Industrial Engineering

chosen Afterwards when anymachine completes processinga lot or any machinersquos normal production time is completelyscheduled the system transfers into a new state 119904

119906 The

agent selects an action at this decision-making epoch and thesystem state transfers into an interim state 119904 When againany machine completes processing a lot or any machinersquosnormal production time used is up the system transfers intothe next decision-making state 119904

119906+1and the agent receive

reward 119903119906+1

which is computed due to 119904119906and the sojourn

time between the two transitions into 119904119906and 119904119906+1

(as shownin Section 24) The previous procedure is repeated until aterminal state is attained An episode is a trajectory from theinitial state to a terminal state of a schedule horizon Action7 is available only at the decision-making states when allmachines are busy

24 Reward Function A reward function follows severaldisciplines It indicates the instant impact of an action on theschedule that is to link the action with immediate rewardMoreover the accumulated reward indicates the objectivefunction value that is the agent receives large total rewardfor small objective function value

Definition 6 (reward function) Let 119870 denote the number ofdecision-making epoch during an episode 119905

119906(0 le 119906 lt 119870)

the time at the 119906th decision-making epoch 119879119894119906

(1 le 119894 le 1198981 le 119906 le 119870) the job type of the lot which machine 119894 processesduring time interval (119905

119906minus1 119905119906]1198790119894119906the job type of the lotwhich

precedes the lot machine 119894 processes during time interval(119905119906minus1 119905119906] and 119884

119895(119905119906) the processed volume of job type 119895 by

time 119905119906 It follows that

119884119895(119905119906) minus 119884119895(119905119906minus1) =

119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

(21)

where 120575(119894 119895) is an indicator function defined as

120575 (119894 119895) = 1 119879

119894119906= 119895

0 119879119894119906

= 119895(22)

Let 119903119906(119906 = 1 2 119870) denote the reward function at the

119906th decision-making epoch 119903119906is defined as

119903119906=

119899

sum

119895=1

min119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

[119863119895minus 119884119895(119905119906minus1)]+

119908119895

+max119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

minus [119863119895minus 119884119895(119905119906minus1)]+

0

times119908119895

119872

(23)

The reward function has the following property

Theorem 7 Maximization of the total reward 119877 in an episodeis equivalent to minimization of objective function (5)

Proof The total reward in an episode is

119877 =

119870

sum119906=1

119903119906

=

119870

sum119906=1

119899

sum

119895=1

min119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

[119863119895minus 119884119895(119905119906minus1)]+

119908119895

+max119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

minus [119863119895minus 119884119895(119905119906minus1)]+

0

times119908119895

119872

=

119899

sum

119895=1

119870

sum119906=1

min119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

[119863119895minus 119884119895(119905119906minus1)]+

times 119908119895

+max119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

minus [119863119895minus 119884119895(119905119906minus1)]+

0

times119908119895

119872

(24)

It is easy to show that

119884119895=

119870

sum119906=1

119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

(25)

It follows that

119877 =

119899

sum

119895=1

[119908119895min 119863

119895 119884119895 +

119908119895

119872max 0 119884

119895minus 119863119895]

= sum

119895isinΩ1

[119908119895119863119895+119908119895

119872(119884119895minus 119863119895)] + sum

119895isinΩ2

119908119895119884119895

=

119899

sum

119895=1

119908119895119863119895minus

sum

119895isinΩ1

[minus119908119895

119872(119884119895minus 119863119895)]

+ sum

119895isinΩ2

119908119895(119863119895minus 119884119895)

=

119899

sum

119895=1

119908119895119863119895minus

119899

sum

119895=1

[119908119895(119863119895minus 119884119895)+

minus119908119895

119872(119884119895minus 119863119895)+

]

(26)

where Ω1= 119895 | 119884

119895gt 119863119895 and Ω

2= 119895 | 119884

119895le 119863119895 Since

sum119899

119895=1119908119895119863119895is a constant it follows that

max119877 lArrrArr min119899

sum

119895=1

[119908119895(119863119895minus 119884119895)+

minus119908119895

119872(119884119895minus 119863119895)+

]

(27)

Journal of Industrial Engineering 7

3 The Reinforcement Learning Algorithm

The chip attach scheduling problem is converted into anRL problem with terminal state in Section 2 To apply 119876-learning to solve this RL problem another issue arises that ishow to tailor119876-learning algorithm in this particular contextSince some state variables are continuous the state spaceis infinite This RL system is not in tabular form and it isimpossible to maintain 119876-values for all state-action pairsThus we use linear functionwith gradient-descentmethod toapproximate the 119876-value function 119876-values are representedas linear combination of a set of basis functions Φ

119896(119904) (1 le

119896 le 4119898 + 119899) as shown in the next formula

119876 (119904 119886) =

4119898+119899

sum

119896=1

119888119886

119896Φ119896(119904) (28)

where 119888119886119896(1 le 119886 le 6 1 le 119896 le 4119898 + 119899) are the weights of basis

functions Each state variable corresponds to a basis functionThe following basis functions are defined to normalize thestate variables

Φ119896(119904)

=

1198790

119896

119899(1 le 119896 le 119898)

119879119896minus119898

119899(119898 + 1 le 119896 le 2119898)

119905119896minus2119898

max 11990411989511198952

+ 1199011198952| 1 le 1198951 le 119899 1 le 1198952 le 119899

(2119898 + 1 le 119896 le 3119898)

119889119896minus3119898

119863119896minus3119898

(3119898 + 1 le 119896 le 3119898 + 119899)

119890119896minus3119898minus119899

TH(3119898 + 119899 + 1 le 119896 le 4119898 + 119899)

(29)

Let 119862119886 denote the vector of weights of basis functions asfollows

119862119886

= (119888119886

1 119888119886

2 119888

119886

4119898+119899)119879

(30)

The RL algorithm is presented as Algorithm 8 where 120572is learning rate 120574 is a discount factor 119864(119886) is the vector ofeligibility traces for action 119886 120575(119886) is an error variable foraction 119886 and 120582 is a factor for updating eligibility traces

Algorithm 8 119876-learning with linear gradient-descent func-tion approximation for chip attach scheduling

Initialize 119862119886 and 119864(119886) randomly Set parameters 120572 120574 and120582

Let num episode denote the number of episodes havingbeen run Set num episode = 0

While num episode ltMAX EPISODE doSet the current decision-making state 119904 larr 119904

0

While at least one of state variables 119890119894(1 le 119894 le 119898) is larger

than zero do

Select action 119886 for state 119904 by 120576-greedy policyImplement action a Determine the next event fortriggering state transition and the sojourn timeOnce any machine completes processing a lot orany machinersquos normal production time is completelyscheduled the system transfers into a new decision-making state 1199041015840(1198901015840

119894(1 le 119894 le 119898) is a component of 1199041015840)

Compute reward 1199031198861199041199041015840

Update the vector of weights in the approximate 119876-value function of action a

120575 (119886) larr997888 119903119886

1199041199041015840 + 120574max1198861015840

119876(1199041015840

1198861015840

) minus 119876 (119904 119886)

119864 (119886) larr997888 120582119864 (119886) + nabla119862119886119876 (119904 119886)

119862119886

larr997888 119862119886

+ 120572120575 (119886) 119864 (119886)

(31)

Set 119904 larr 1199041015840

If 119890119894= 0 holds for all 119894 (1 le 119894 le 119898) set num episode =

num episode + 1

4 Experiment Results

In the past the company used a manual process to conductchip attach scheduling A heuristic algorithm called LargestWeight First (LWF) was used as follows

Algorithm 9 (Largest Weight First (LWF) heuristics) Initial-ize SMwith the set of all machines (ie SM = 119894 | 1 le 119894 le 119898)and define SJ as Step 1 inAlgorithm 2 Initialize 119890

119894(1 le 119894 le 119898)

with each machinersquos normal production time Set 119884119895= 119868119895

where 119868119895is the initial production volume of job type 119895

Step 1 Schedule the job types in decreasing order of weightsin order to meet their TPVs

While SJ = Φ and SM = Φ doChoose job 119902 with 119902 = argmax119908

119895| 119895 isin SJ

While SM cap119872119902= Φ and 119884

119902lt 119863119902do

Choose machine 119894 to process job 119902 with 119894 =argmin119901

119896119902119908119902| 119896 isin SM cap119872

119902

If 119890119894minus 1199041198790

119894119902lt (119863119902minus 119884119902)119901119894119902 then

set 119884119902larr 119884119902+ 119871(119890119894minus 1199041198790

119894119902)119901119894119902 119890119894= 0 and

remove 119894 from SMelse set 119890

119894larr 119890119894minus 1199041198790

119894119902minus (119863119902minus 119884119902)119901119894119902119871 and

119884119902= 119863119902

1198790

119894= 119902

Step 2 Allocate the excess production capacity

If SM = Φ thenFor each machine 119894 (119894 isin SM)Choose job 119895with 119895 = argmax(119890

119894minus1199041198790

119894119902)119908119902119901119894119902|

1 le 119902 le 119899 set 119884119895larr 119884119895+ 119871(119890119894minus 1199041198790

119894119895)119901119894119895 119890119894= 0

8 Journal of Industrial Engineering

Table 1 Comparison of objective function values using heuristics and 119876-learning

Dataset no WSPT MWSPT RA LFM-MWSPT LFM-RA LWF Q-Learning1 88867 78116 59758 87689 57582 42253 minus386132 13844 13586 110747 12669 10907 95926 766573 11901 10875 12409 10425 12190 83332 237754 83681 60797 39073 69405 45575 33920 minus442755 12938 12847 96960 10917 99827 89863 214146 70840 55692 51108 66213 51041 16930 minus544677 12090 10060 95399 10933 90754 76422 273748 10242 10780 11656 10333 10762 93663 118409 94606 87914 81763 88812 80331 60164 3303610 90803 88164 90773 87926 8856293 56307 2279811 11113 88287 82916 97882 85605 60160 1649312 10029 89005 86692 95836 78342 60744 minus38617Average 10419 94123 86321 95547 84685 64147 12233

Table 2 Comparison of unsatisfied TPV index using heuristics and 119876-learning

Dataset no WSPT MWSPT RA LFM-MWSPT LFM-RA LWF Q-learning1 01179 01025 00789 01170 00754 00554 000812 01651 01611 01497 01499 01482 01421 001373 01421 01289 01475 01227 01455 00987 006914 01540 01104 00716 01258 00854 00614 000885 01588 01564 01186 01303 01215 01094 005716 01053 00819 00757 01006 00762 00248 001377 01462 01209 01150 01292 01082 00917 005828 01266 01324 01437 01272 01309 01150 003819 01315 01211 01133 01249 01127 00828 0081510 01154 01112 01151 01105 01110 00709 0053611 01544 01215 01146 01387 01204 00827 0069012 01262 01112 01088 01194 01002 00758 00118Average 01370 01216 01112 01247 01113 00842 00402

The chip attach station consists of 10 machines andnormally processes more than ten job types We selected12 sets of industrial data for experiments comparingthe119876-learning algorithm (Algorithm 8) and the six heuristics(Algorithms 1ndash5 9) WSPT MWSPT RA LFM-MWSPTLFM-RA and LWF For each dataset 119876-learning repeatedlysolves the scheduling problem 1000 times and selects theoptimal schedule of the 1000 solutions Table 1 shows theobjective function values of all datasets using the sevenalgorithms Individually any of WSPT MWSPT RA LFM-MWSPT and LFM-RA obtains larger objective functionvalues than LWF for every dataset Nevertheless takingWSPTMWSPT RA LFM-MWSPT and LFM-RA as actions119876-learning algorithm achieves an objective function valuemuch smaller than LWF for each dataset In Tables 1ndash4 thebottom row presents the average value over all datasets Asshown in Table 1 the average objective function value of 119876-learning is only 12233 less than that of LWF 66147 by a largeamount of 8092

Besides objective function value we propose two indicesunsatisfied TPV index and unsatisfied job type index tomea-sure the performance of the seven algorithms Unsatisfied

TPV index (UPI) is defined as formula (32) and indicatesthe weighted proportion of unfinished Target ProductionVolume Table 2 compares UPIs of all datasets using sevenalgorithms Also any of WSPT MWSPT RA LFM-MWSPTand LFM-RA individually obtains larger UPI than LWFfor each dataset However 119876-learning algorithm achievessmaller UPI than LWF does for each datasetThe average UPIof 119876-learning is only 00402 less than that of LWF 00842by a large amount of 5220 Let 119869 denote the set 119895 | 1 le119895 le 119899 119884

119895lt 119863119895 Unsatisfied job type index (UJTI) is defined

as formula (33) and indicates the weighted proportion of thejob types whose TPVs are not completely satisfied Table 3compares UJTIs of all datasets using seven algorithms Withmost datasets 119876-learning algorithm achieves smaller UJTIsthan LWFThe average UJTI of119876-learning is 00802 which isremarkably less than that of LWF 01176 by 3181 Consider

UPI =sum119899

119895=1119908119895(119863119895minus 119884119895)+

sum119899

119895=1119908119895119863119895

(32)

UJTI =sum119895=119869119908119895

sum119899

119895=1119908119895

(33)

Journal of Industrial Engineering 9

Table 3 Comparison of unsatisfied job type index using heuristics and 119876-learning

Dataset no WSPT MWSPT RA LFM-MWSPT LFM-RA LWF Q-learning1 01290 02615 01667 01650 01793 00921 006782 01924 02953 02302 02650 02097 01320 002783 02250 03287 02126 02564 02278 00921 009214 00781 02987 00278 01290 00828 00278 002785 02924 03169 03002 02224 02632 02055 005716 02290 03062 01817 02290 01632 00571 002787 02650 02987 02160 02529 02075 01320 016478 01924 03225 02067 01813 01696 01320 013209 02221 03250 01403 01892 02073 01320 0074910 02621 03304 02667 02859 02708 01781 0067811 02029 02896 02578 02194 02220 01381 0154212 01924 03271 02302 01838 02182 00921 00678Average 02069 03084 02031 02149 02018 01176 00802

Table 4 Comparison of the total setup time using heuristics and 119876-learning

Dataset no WSPT MWSPT RA LFM-MWSPT LFM-RA LWF Q-learning1 08133 08497 03283 07231 03884 03895 100002 08333 13000 04358 07564 05263 04094 100003 08712 11633 04207 06361 04937 04298 100004 11280 07123 04629 08516 05139 04318 100005 09629 13121 04179 08597 05115 03873 100006 07489 10393 04104 07489 04542 04074 100007 17868 22182 08223 14069 10125 04174 100008 06456 08508 04055 06694 05053 03795 100009 09245 09946 05013 07821 06694 04163 1000010 11025 17875 06703 10371 09079 04894 1000011 09973 13655 03994 09686 05129 04066 1000012 07904 11111 04419 06195 05081 04258 10000Average 09671 12254 04764 08383 05837 04158 10000

Table 4 shows the total setup time of all datasets usingseven algorithms For the reason of commercial confiden-tiality we used the normalized data with the setup time of adataset divided by the result of this dataset using 119876-learningThus the total setup times of all datasets by 119876-learning areconverted into one and the data of the six heuristics areadjusted accordingly 119876-learning algorithm requires morethan twice of setup time than LWF does for each dataset Theaverage accumulated setup time of LWF is only 4158 percentsof that of 119876-learning

The previous experimental results reveal that for thewhole scheduling tasks any individual one of the five actionheuristics (WSPT MWSPT RA LFM-MWSPT and LFM-RA) for 119876-learning performs worse than LWF heuristicsHowever 119876-learning greatly outperforms LWF in termsof the three performance measures the objective functionvalue UPI and UJTI This demonstrates that some actionheuristics provide better actions than LWF heuristics at somestates During repeatedly solving the scheduling problem119876-learning system perceives the insights of the schedulingproblem automatically and adjusts its actions towards the

optimal ones facing different system states The actions atall states form a new optimized policy which is differentfrom any policies following any individual action heuristicsor LWF heuristics That is119876-learning incorporates the meritof five alternative heuristics uses them to schedule jobsflexibly and obtains results much better than any individualaction heuristics and LWF heuristics In the experiments119876-learning achieves high-quality schedules at the cost ofinducingmore setup time In other words119876-learning utilizesthe machines more efficiently by increasing conversionsamong a variety of job types

5 Conclusions

We apply 119876-learning to study lot-based chip attach schedul-ing in back-end semiconductor manufacturing To applyreinforcement learning to scheduling the critical issue beingconversion of scheduling problems into RL problems Weconvert chip attach scheduling problem into a particularSMDP problem by Markovian state representation Fiveheuristic algorithms WSPT MWSPT RA LFM-MWSPT

10 Journal of Industrial Engineering

and LFM-RA are selected as actions so as to utilize priordomain knowledge Reward function is directly related toscheduling objective function and we prove that maximizingthe accumulated reward is equivalent to minimizing theobjective function Gradient-descent linear function approx-imation is combined with 119876-learning algorithm

119876-learning exploits the insight structure of the schedul-ing problem by solving it repeatedly It learns a domain-specific policy from the experienced episodes through inter-action and then applies it to latter episodes We definetwo indices unsatisfied TPV index and unsatisfied job typeindex together with objective function value to measure theperformance of 119876-learning and the heuristics Experimentswith industrial datasets show that 119876-learning apparentlyoutperforms six heuristic algorithms WSPT MWSPT RALFM-MWSPT LFM-RA and LWF Compared with LWF119876-learning achieves reduction of the three performancemeasures respectively by an average level of 5220 3181and 8092 With 119876-learning chip attach scheduling isoptimized through increasing effective job type conversions

Disclosure

Given the sensitive and proprietary nature of the semicon-ductor manufacturing environment we use normalized datain this paper

Acknowledgments

This project is supported by the National Natural ScienceFoundation of China (Grant no 71201026) Science andTechnological Program for Dongguanrsquos Higher EducationScience and Research and Health Care Institutions (no2011108102017) andHumanities and Social Sciences Programof Ministry of Education of China (no 10YJC630405)

References

[1] M X Weng J Lu and H Ren ldquoUnrelated parallel machinescheduling with setup consideration and a total weightedcompletion time objectiverdquo International Journal of ProductionEconomics vol 70 no 3 pp 215ndash226 2001

[2] M Gairing B Monien and A Woclaw ldquoA faster combinato-rial approximation algorithm for scheduling unrelated parallelmachinesrdquo inAutomata Languages and Programming vol 3580of Lecture Notes in Computer Science pp 828ndash839 2005

[3] G Mosheiov ldquoParallel machine scheduling with a learningeffectrdquo Journal of the Operational Research Society vol 52 no10 pp 1ndash5 2001

[4] G Mosheiov and J B Sidney ldquoScheduling with general job-dependent learning curvesrdquo European Journal of OperationalResearch vol 147 no 3 pp 665ndash670 2003

[5] L Yu H M Shih M Pfund W M Carlyle and J W FowlerldquoScheduling of unrelated parallel machines an application toPWB manufacturingrdquo IIE Transactions vol 34 no 11 pp 921ndash931 2002

[6] K R Baker and J W M Bertrand ldquoA dynamic priorityrule for scheduling against due-datesrdquo Journal of OperationsManagement vol 3 no 1 pp 37ndash42 1982

[7] J J Kanet and X Li ldquoA weighted modified due date rulefor sequencing to minimize weighted tardinessrdquo Journal ofScheduling vol 7 no 4 pp 261ndash276 2004

[8] R V Rachamadugu and T E Morton ldquoMyopic heuristicsfor the single machine weighted tardiness problemrdquo WorkingPaper 28-81-82 Graduate School of Industrial AdministrationGarnegie-Mellon University 1981

[9] A Volgenant and E Teerhuis ldquoImproved heuristics for then-job single-machine weighted tardiness problemrdquo Computersand Operations Research vol 26 no 1 pp 35ndash44 1999

[10] D C Carroll Heuristic sequencing of jobs with single and mul-tiple components [PhD thesis] Sloan School of ManagementMIT 1965

[11] A P J Vepsalainen and T E Morton ldquoPriority rules for jobshops with weighted tardiness costsrdquoManagement Science vol33 no 8 pp 1035ndash1047 1987

[12] R S Russell E M Dar-El and B W Taylor ldquoA comparativeanalysis of the COVERT job sequencing rule using various shopperformance measuresrdquo International Journal of ProductionResearch vol 25 no 10 pp 1523ndash1540 1987

[13] J Bank and F Werner ldquoHeuristic algorithms for unrelated par-allelmachine schedulingwith a commondue date release datesand linear earliness and tardiness penaltiesrdquoMathematical andComputer Modelling vol 33 no 4-5 pp 363ndash383 2001

[14] C F Liaw Y K Lin C Y Cheng and M Chen ldquoSchedulingunrelated parallel machines to minimize total weighted tardi-nessrdquo Computers and Operations Research vol 30 no 12 pp1777ndash1789 2003

[15] DWKimDGNa andF F Chen ldquoUnrelated parallelmachinescheduling with setup times and a total weighted tardinessobjectiverdquo Robotics and Computer-Integrated Manufacturingvol 19 no 1-2 pp 173ndash181 2003

[16] R S Sutton and A G Barto Reinforcement Learning AnIntroduction MIT Press Cambridge Mass USA 1998

[17] C J CHWatkinsLearning fromdelayed rewards [PhD thesis]Cambridge University 1989

[18] C J C H Watkins and P Dayan ldquoQ-learningrdquo MachineLearning vol 8 no 3-4 pp 279ndash292 1992

[19] T JaakkolaM I Jordan and S P Singh ldquoOn the convergence ofstochastic iterative dynamic programming algorithmsrdquo NeuralComputation vol 6 pp 1185ndash1201 1994

[20] J N Tsitsiklis ldquoAsynchronous stochastic approximation and Q-learningrdquoMachine Learning vol 16 no 3 pp 185ndash202 1994

[21] D P Bertsekas and J N Tsitsiklis Neuro-Dynamic Program-ming Athena Scientific Belmont Mass USA 1996

[22] S Riedmiller and M Riedmiller ldquoA neural reinforcementlearning approach to learn local dispatching policies in produc-tion schedulingrdquo in Proceedings of the 16th International JointConference on Artificial Intelligence Stockholm Sweden 1999

[23] M E Aydin and E Oztemel ldquoDynamic job-shop schedulingusing reinforcement learning agentsrdquoRobotics and AutonomousSystems vol 33 no 2 pp 169ndash178 2000

[24] J Hong and V V Prabhu ldquoDistributed reinforcement learningcontrol for batch sequencing and sizing in just-in-time manu-facturing systemsrdquo Applied Intelligence vol 20 no 1 pp 71ndash872004

[25] Y C Wang and J M Usher ldquoApplication of reinforcementlearning for agent-based production schedulingrdquo EngineeringApplications of Artificial Intelligence vol 18 no 1 pp 73ndash822005

Journal of Industrial Engineering 11

[26] B C Csaji L Monostori and B Kadar ldquoReinforcement learn-ing in a distributed market-based production control systemrdquoAdvanced Engineering Informatics vol 20 no 3 pp 279ndash2882006

[27] S S Singh V B Tadic and A Doucet ldquoA policy gradientmethod for semi-Markov decision processes with applicationto call admission controlrdquo European Journal of OperationalResearch vol 178 no 3 pp 808ndash818 2007

[28] M Kaya and R Alhajj ldquoA novel approach to multiagentreinforcement learning utilizing OLAP mining in the learningprocessrdquo IEEE Transactions on Systems Man and CyberneticsPart C vol 35 no 4 pp 582ndash590 2005

[29] C D Paternina-Arboleda and T K Das ldquoA multi-agent rein-forcement learning approach to obtaining dynamic controlpolicies for stochastic lot scheduling problemrdquo SimulationModelling Practice andTheory vol 13 no 5 pp 389ndash406 2005

[30] C E Mariano-Romero V H Alcocer-Yamanaka and E FMorales ldquoMulti-objective optimization of water-using systemsrdquoEuropean Journal of Operational Research vol 181 no 3 pp1691ndash1707 2007

[31] D Vengerov ldquoA reinforcement learning framework for utility-based scheduling in resource-constrained systemsrdquo Future Gen-eration Computer Systems vol 25 no 7 pp 728ndash736 2009

[32] K Iwamura N Mayumi Y Tanimizu and N SugimuraldquoA study on real-time scheduling for holonic manufacturingsystemsmdashdetermination of utility values based on multi-agentreinforcement learningrdquo in Proceedings of the 4th InternationalConference on Industrial Applications of Holonic and Multi-Agent Systems pp 135ndash144 Linz Austria 2009

[33] M Pinedo Scheduling Theory Algorithms and Systems Pren-tice Hall Englewoods Cliffs NJ USA 2nd edition 2002

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 3: Research Article Chip Attach Scheduling in Semiconductor ...downloads.hindawi.com/journals/jie/2013/295604.pdf · Chip Attach Scheduling in Semiconductor Assembly ... and PEVI. IS

Journal of Industrial Engineering 3

and Usher [25] applied 119876-learning to select dispatchingrules for the single machine scheduling problem Csaji et al[26] proposed an adaptive iterative distributed schedulingalgorithm operated in a market-based production controlsystem where every machine and job is associated with itsown software agent Singh et al [27] proposed an onlinereinforcement learning algorithm for call admission controlThe approach optimized the SMDP performance criterionwith respect to a family of parameterized policies Multi-agent reinforcement learning system has also been applied toscheduling or control problems for example Kaya and Alhajj[28] Paternina-Arboleda and Das [29] Mariano-Romero etal [30] Vengerov [31] Iwamura et al [32]

Applications of RL algorithms to scheduling problemshave not been thoroughly explored in the prior studies Inthis study we employ 119876-learning algorithm to resolve chipattach scheduling problem and achieve overwhelming exper-imental results compared with six heuristic algorithms Theremainder of this paper is organized as follows We describethe problem and convert it into RL problem explicitly inSection 2 present the RL algorithm in Section 3 conduct thecomputational experiments and analysis in Section 4 anddraw conclusions in Section 5

2 RL Formulation

21 Problem Statement The scheduling problem concernedin this paper is described as follows The work station forchip attach operation consists of 119898 parallel machines andprocesses 119899 types of jobs The bigger the weight of a job typeis the more important it is Each job needs to be processedon one machine only and one machine processes at mostone job at a time Any job type (say 119895) is only allowed to beprocessed on subset119872

119895of the119898 parallel machines The jobs

of the same type 119895 have a deterministic processing time 119901119894119895

(1 le 119894 le 119898 1 le 119895 le 119899) if they are processed on machine119894 The machines are unrelated that is 119901

119894119895is independent of

119901119896119895

for all jobs 119895 and all machines 119894 = 119896 The production islot based Normally one lot contains more than 1000 unitsThus the processing time is the time for processing one lotand processing is nonpreemptive (ie once a machine startsprocessing one lot it cannot process another one until itcompletely processes this lot) Setup time between job type1198951 and 1198952 is 119904

11989511198952(1 le 1198951 1198952 le 119899) The setup times are

deterministic and sequence dependant Trivially 119904119895119895

= 0

holds for arbitrary 119895 (1 le 119895 le 119899) and 119904119895119909+ 119904119909119902gt 119904119895119902

holds forarbitrary 119895 119909 119902 (1 le 119895 119909 119902 le 119899)

The usage of a machine is considered to be in one oftwo categories engineering time (egmaintenance time) andproduction time We only need to schedule the productionin production time the total available time in a schedulehorizon deducting the engineering time Production time isdivided into initial production time and normal productiontime We consider the initial machine status in the schedulehorizon If a machine is processing a lot called ldquoinitial lotrdquoat the beginning of a schedule horizon it is not allowedto process any other lot until it completely processes theremaining units in the initial lot (called initial volume)

The time for processing the unprocessed initial volume inthe initial lot is called ldquoinitial production timerdquo Since theproduction of nonbottleneck operations is determined bythe bottleneck operation we assume that the jobs are alwaysavailable for processing on chip attach operation when theyare needed

The primary objective of chip attach scheduling is tominimize the total weighted unsatisfied TPV of a schedulehorizon Since equipment of semiconductor manufacturingis very expensive machine utilization should be kept in ahigh level Hence on the premise that TPVs of all job typesare entirely satisfied the secondary objective of chip attachscheduling is to process asmuch asweighted excess volume torelieve the burden of the next schedule horizonThe objectivefunction is formulated as follows

min119899

sum

119895=1

119908119895(119863119895minus 119884119895)+

minus

119899

sum

119895=1

119908119895

119872(119884119895minus 119863119895)+

(5)

where 119908119895(1 le 119895 le 119899) is the weight per unit of job type 119895 119863

119895

(1 le 119895 le 119899) is the predetermined TPV of job type 119895 (includingthe initial volume in the initial lots) and 119884

119895(1 le 119895 le 119899) is

the processed volume of job type 119895 119863119895can be represented as

follows

119863119895=

119898

sum

119894=1

120596 (119894 119895) 119868119894+ 119896119895119871 (119896

119895= 0 1 ) (6)

where 119868119894denotes the initial volume in the initial lot processed

bymachine 119894 at the beginning of the schedule horizon 119871 is lotsize and

120596 (119894 119895) =

1 if machine 119894 is processing job type 119895in the beginning of the schedule horizon

0 otherwise(7)

Calculation of 119884119895is rate based interpreted as follows

Suppose machine 119894 processes lot 119871119876 (belonging to job type119902) proceeding lot 119871119869 (belonging to job type 119895) Let 119905119904

119871119869denote

the start time of setup for 119871119869 then the completion time of 119871119869is 119905119904119871119869+119904119902119895+119901119894119895 LetΔ119884

119894119895(119905) denote the increase of processed

volume of job type 119895 because of processing 119871119869 on machine 119894from time 119905119904

119871119869to 119905 defined as follows

Δ119884119894119895(119905) =

(119905 minus 119905119904119871119869) 119871

119904119902119895+ 119901119894119895

(119905119904119871119869le 119905 le 119905119904

119871119869+ 119904119902119895+ 119901119894119895) (8)

119872 is a positive number which is large enough 119872 is setfollowing the next inequality

119872 gt max(119908119902+ 119908119909) (119904V119895 + 119901119894119895)

119908119895119901119894119902

119908119902(119904V119909 + 119901119894119909) (119904119909119895 + 119901119894119895)

119901119894119895min1le119888le1198981le119886119887119896le119899

119908119896(119904119886119887+ 119901119888119887)

(forall1 le 119894 le 119898 1 le 119895 119902 V 119909 le 119899)

(9)

4 Journal of Industrial Engineering

For an optimal schedule minimizing objective function(5) if (9) holds and there exists 119895 (1 le 119895 le 119899) such that 119884

119895gt

119863119895 then

119884119895+

119898

sum

119894=1

120573 (119894 119895) 119880119894ge 119863119895

(forall1 le 119895 le 119899) (10)

where 119880119894denotes the unprocessed volume in the last lot

processed by machine 119894 at the end of this schedule horizon(ie the initial volume of the next schedule horizon) and

120573 (119894 119895)

=

1 if machine 119894 is processing job type 119895 atthe end of the schedule horizon

0 otherwise

(11)

According to inequality (9) in any schedule minimizingobjective function (5) any machine will not process a lotbelonging to a job type whose TPV has been satisfied untilTPV of any other job types is also fully satisfied In otherwords inequality (9) guarantees that the objective functiontakesminimization of the total weighted unsatisfied TPV (thefirst item of objective function (5)) as the first priority Thefundamental problem in applying reinforcement learningto scheduling is to convert scheduling problems into RLproblems including representation of state construction ofactions and definition of reward function

22 State Representation and Transition Probability We firstdefine the state variables State variables describe the majorcharacteristics of the system and are capable of trackingthe change of the system status The system state can berepresented by the vector

120593 = [1198790

119894(1 le 119894 le 119898) 119879

119894(1 le 119894 le 119898) 119905

119894(1 le 119894 le 119898)

119889119895(1 le 119895 le 119899) 119890

119894(1 le 119894 le 119898)]

(12)

where 1198790119894(1 le 119894 le 119898) denotes the job type of which the

latest lot completely processed on machine 119894 119879119894(1 le 119894 le 119898)

denotes the job type of which the lot being processed onmachine 119894 (119879

119894equals zero if machine 119894 is idle) 119905

119894(1 le 119894 le 119898)

denotes the time starting from the beginning of the latestsetup on machine 119894 (for convenience we assume that thereis a zero-time setup if 1198790

119894= 119879119894) 119889119895(1 le 119895 le 119899) is unsatisfied

TPV (ie (119863119895minus 119884119895)+) and 119890

119894(1 le 119894 le 119898) represents the

unscheduled normal production time of machine 119894Considering the initial status of machines the initial

system state of the schedule horizon is

1199040= [1198790

1198940(1 le 119894 le 119898) 119879

1198940(1 le 119894 le 119898) 119905

1198940(1 le 119894 le 119898)

119863119895(1 le 119895 le 119899) TH minus 120590

119894minus TE119894(1 le 119894 le 119898)]

(13)

where TH denotes the overall available time in the schedulehorizon 120590

119894denotes the initial production time of machine 119894

and TE119894denotes the engineering time of machine 119894

There are two kinds of events triggering state transitions(1) completion of processing a lot on one or more machines(2) any machinersquos normal production time is entirely sched-uled If the triggering event is completion of processing thestate at the decision-making epoch is represented as

119904119889= [1198790

119894119889(1 le 119894 le 119898) 119879

119894119889(1 le 119894 le 119898) 119905

119894119889(1 le 119894 le 119898)

119889119895119889(1 le 119895 le 119899) 119890

119894119889(1 le 119894 le 119898)]

(14)

where 119894 | 119879119894119889= 0 1 le 119894 le 119898 =Φ If 119879

119894119889= 0 (machine

119894 is idle) then 119905119894119889= 0 If the triggering event is using up a

machinersquos normal production time then 119894 | 119890119894119889= 0 1 le 119894 le

119898 =ΦAssume that after taking action 119886 the system state

immediately transfers form 119904119889to an interim state 119904 as follows

119904 = [1198790

119894(1 le 119894 le 119898) 119879

119894(1 le 119894 le 119898) 119905

119894(1 le 119894 le 119898)

119889119895(1 le 119895 le 119899) 119890

119894(1 le 119894 le 119898)]

(15)

where 119879119894gt 0 for all 119894 (1 le 119894 le 119898) that is all machines are

busyLet Δ119905 denote the sojourn time at state 119904 then Δ119905 =

minmin1le119894le119898

1199041198790

119894119879119894

+ 119901119894119879119894

minus 119905119894 min

1le119894le119898119890119894| 119890119894gt 0 Let

Λ = 119894 | 1199041198790

119894119879119894

+ 119901119894119879119894

minus 119905119894= Δ119905 then the state at the next

decision-making epoch is represented as

1199041015840

= [119879119894(119894 isin Λ) 119879

0

119894(119894 notin Λ) 119879

119894= 0 (119894 isin Λ) 119879

119894(119894 notin Λ)

0 (119894 isin Λ) 119905119894+ Δ119905 (119894 notin Λ)

119889119895minusΔ119905119871sum

119898

119894=1120575119884(119879119894 119895)

1199041198790

119894119895+ 119901119894119895

(1 le 119895 le 119899)

max 119890119894minus Δ119905 0 (1 le 119894 le 119898)]

(16)

where

120575119884(119879119894 119895) =

1 if 119879119894= 119895

0 if 119879119894= 119895

(17)

Apparently we have 1198751198861199041198891199041015840 = 1 where 119875119886

1199041198891199041015840 denotes the one-

step transition probability from state 119904119889to state 1199041015840 under

action 119886 Let 119904119906and 120591

119906denote the system state and time

respectively at the 119906th decision-making epoch It is easy toshow that119875 119904119906+1

= 119883 120591119906+1

minus 120591119906le 119905 | 119904

0 1199041 119904

119906 1205910 1205911 120591

119906

= 119875 119904119906+1

= 119883 120591119906+1

minus 120591119906le 119905 | 119904

119906 120591119906

(18)

where 120591119906+1

minus 120591119906is the sojourn time at state 119904

119906 That is

the decision process associated with (119904 120591) is a Semi-MarkovDecision Process with particular transition probability andsojourn times The terminal state of an episode is

119904119890= [1198790

119894119890(1 le 119894 le 119898) 119879

119894119890(1 le 119894 le 119898) 119905

119894119890(1 le 119894 le 119898)

119889119895119890(1 le 119895 le 119899) 0 (1 le 119894 le 119898)]

(19)

Journal of Industrial Engineering 5

23 Action Prior domain knowledge can be utilized to fullyexploit the agentrsquos learning ability Apparently an optimalschedule must be nonidle (ie any machine has no idle timeduring the whole schedule) It may happen that more thanone machine are free at the same decision-making epochAn action determines which lot to be processed on whichmachine In the following we define seven actions usingheuristic algorithms

Action 1 Select jobs by WSPT heuristics as follows

Algorithm 1 WSPT heuristics

Step 1 Let SM denote the set of free machines at a decision-making epoch

Step 2 Choose machine 119896 to process job type 119902 with (119896 119902) =argmin

(119894119895)(1199041198790

119894119895+ 119901119894119895)119908119895| 1 le 119895 le 119899 119894 isin 119872

119895and 119894 isin SM

Step 3 Remove 119896 fromSM If SM = Φ go to Step 2 otherwisethe algorithm halts

Action 2 Select jobs byMWSPT (modifiedWSPT) heuristicsas follows

Algorithm 2 MWSPT heuristics

Step 1 Define SM as Step 1 in Algorithm 1 and let SJ denotethe set of job types whose TPVs have not been satisfied at adecision-making epoch that is SJ = 119895 | 119884

119895lt 119863119895 1 le 119895 le 119899

If SJ = Φ go to Step 4

Step 2 Choose job type 119902 to process on machine 119896 with(119896 119902) = argmin

(119894119895)(1199041198790

119894119895+ 119901119894119895)119908119895| 119895 isin SJ 119894 isin 119872

119895and

119894 isin SM

Step 3 Remove 119896 from SM Set 119884119902= 119884119902+ 119871 and update SJ If

SJ = Φ and SM = Φ go to Step 2 if SJ = Φ and SM = Φ go toStep 4 otherwise the algorithm halts

Step 4 Choose machine 119896 to process job type 119902 with (119896 119902) =argmin

(119894119895)(1199041198790

119894119895+ 119901119894119895)119908119895| 1 le 119895 le 119899 119894 isin 119872

119895and 119894 isin SM

Step 5 Remove 119896 fromSM If SM = Φ go to Step 4 otherwisethe algorithm halts

Action 3 Select jobs by Ranking Algorithm (RA) as follows

Algorithm 3 Ranking Algorithm

Step 1 Define SM and SJ as Step 1 in Algorithm 2 If SJ = Φgo to Step 5

Step 2 For each job type 119895 (119895 isin SJ) sort the machines inincreasing order of (119904

119881119894119895+119901119894119895) (1 le 119894 le 119898) where119881

119894is defined

as follows

119881119894=

119879119894 if machine 119894 is busy

1198790

119894 if machine 119894 is free

(1 le 119894 le 119898) (20)

Let 119892119894119895(1 le 119892

119894119895le 119898) denote the order of machine 119894 (1 le 119894 le

119898) for job type 119895 (1 le 119895 le 119899)

Step 3 Choose job 119902 to process on machine 119896 with (119896 119902) =argmin

(119894119895)119892119894119895| 119895 isin SJ 119894 isin 119872

119895and 119894 isin SM If there

exist two or more machine-job combinations (say machine-job combination (119894

1 1198951) (1198942 1198952) (119894

ℎ 119895ℎ)) with the same

minimal order that is (119894119890 119895119890) = argmin

(119894119895)119892119894119895| 119895 isin SJ 119894 isin

119872119895and 119894 isin SM holds for 119890 (1 le 119890 le ℎ) then choose job type

119895119890to process onmachine 119894

119890 with (119894

119890 119895119890) = argmin

(119894119895)(119904119881119894119890119895119890

+

119901119894119890119895119890

)119908119895119890

| 1 le 119890 le ℎ

Step 4 Remove 119896 or 119894119890fromSM Set119884

119902= 119884119902+119871 or119884

119895119890

= 119884119895119890

+119871

and update SJ If SJ = Φ and SM = Φ go to Step 3 if SJ = Φ andSM = Φ go to Step 5 otherwise the algorithm halts

Step 5 Choose job 119902 to process on machine 119896 with (119896 119902) =argmin

(119894119895)119892119894119895| 1 le 119895 le 119899 119894 isin 119872

119895and 119894 isin SM If there

exist two or more machine-job combinations (say machine-job combinations (119894

1 1198951) (1198942 1198952) (119894

ℎ 119895ℎ)) with the same

minimal order choose job type 119895119890to process on machine 119894

119890

with (119894119890 119895119890) = argmin

(119894119895)(119904119881119894119890119895119890

+ 119901119894119890119895119890

)119908119895119890

| 1 le 119890 le ℎ

Step 6 Remove 119896 or 119894119890from SM If SM = Φ go to Step 5

otherwise the algorithm halts

Action 4 Select jobs by LFM-MWSPT heuristics as follows

Algorithm 4 LFM-MWSPT heuristics

Step 1 Define SM and SJ as Step 1 in Algorithm 2

Step 2 Select a free machine (say 119896) from SM by LFM (LeastFlexible Machine see [33]) rule and choose a job type toprocess on machine 119896 following MWSPT heuristics

Step 3 Remove 119896 fromSM If SM = Φ go to Step 2 otherwisethe algorithm halts

Action 5 Select jobs by LFM-RA heuristics as follows

Algorithm 5 LFM-RA heuristics

Step 1 Define SM and SJ as Step 1 in Algorithm 2

Step 2 Select a free machine (say 119896) from SM by LFM ruleand choose a job type to process on machine 119896 followingRanking Algorithm

Step 3 Remove 119896 fromSM If SM = Φ go to Step 2 otherwisethe algorithm halts

Action 6 Each free machine selects the same job type as thelatest one it processed

Action 7 Select no job

At the start of a schedule horizon the system is atinitial state 119904

0 If there are free machines they select jobs to

process by taking one of Actions 1ndash6 otherwise Action 7 is

6 Journal of Industrial Engineering

chosen Afterwards when anymachine completes processinga lot or any machinersquos normal production time is completelyscheduled the system transfers into a new state 119904

119906 The

agent selects an action at this decision-making epoch and thesystem state transfers into an interim state 119904 When againany machine completes processing a lot or any machinersquosnormal production time used is up the system transfers intothe next decision-making state 119904

119906+1and the agent receive

reward 119903119906+1

which is computed due to 119904119906and the sojourn

time between the two transitions into 119904119906and 119904119906+1

(as shownin Section 24) The previous procedure is repeated until aterminal state is attained An episode is a trajectory from theinitial state to a terminal state of a schedule horizon Action7 is available only at the decision-making states when allmachines are busy

24 Reward Function A reward function follows severaldisciplines It indicates the instant impact of an action on theschedule that is to link the action with immediate rewardMoreover the accumulated reward indicates the objectivefunction value that is the agent receives large total rewardfor small objective function value

Definition 6 (reward function) Let 119870 denote the number ofdecision-making epoch during an episode 119905

119906(0 le 119906 lt 119870)

the time at the 119906th decision-making epoch 119879119894119906

(1 le 119894 le 1198981 le 119906 le 119870) the job type of the lot which machine 119894 processesduring time interval (119905

119906minus1 119905119906]1198790119894119906the job type of the lotwhich

precedes the lot machine 119894 processes during time interval(119905119906minus1 119905119906] and 119884

119895(119905119906) the processed volume of job type 119895 by

time 119905119906 It follows that

119884119895(119905119906) minus 119884119895(119905119906minus1) =

119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

(21)

where 120575(119894 119895) is an indicator function defined as

120575 (119894 119895) = 1 119879

119894119906= 119895

0 119879119894119906

= 119895(22)

Let 119903119906(119906 = 1 2 119870) denote the reward function at the

119906th decision-making epoch 119903119906is defined as

119903119906=

119899

sum

119895=1

min119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

[119863119895minus 119884119895(119905119906minus1)]+

119908119895

+max119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

minus [119863119895minus 119884119895(119905119906minus1)]+

0

times119908119895

119872

(23)

The reward function has the following property

Theorem 7 Maximization of the total reward 119877 in an episodeis equivalent to minimization of objective function (5)

Proof The total reward in an episode is

119877 =

119870

sum119906=1

119903119906

=

119870

sum119906=1

119899

sum

119895=1

min119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

[119863119895minus 119884119895(119905119906minus1)]+

119908119895

+max119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

minus [119863119895minus 119884119895(119905119906minus1)]+

0

times119908119895

119872

=

119899

sum

119895=1

119870

sum119906=1

min119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

[119863119895minus 119884119895(119905119906minus1)]+

times 119908119895

+max119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

minus [119863119895minus 119884119895(119905119906minus1)]+

0

times119908119895

119872

(24)

It is easy to show that

119884119895=

119870

sum119906=1

119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

(25)

It follows that

119877 =

119899

sum

119895=1

[119908119895min 119863

119895 119884119895 +

119908119895

119872max 0 119884

119895minus 119863119895]

= sum

119895isinΩ1

[119908119895119863119895+119908119895

119872(119884119895minus 119863119895)] + sum

119895isinΩ2

119908119895119884119895

=

119899

sum

119895=1

119908119895119863119895minus

sum

119895isinΩ1

[minus119908119895

119872(119884119895minus 119863119895)]

+ sum

119895isinΩ2

119908119895(119863119895minus 119884119895)

=

119899

sum

119895=1

119908119895119863119895minus

119899

sum

119895=1

[119908119895(119863119895minus 119884119895)+

minus119908119895

119872(119884119895minus 119863119895)+

]

(26)

where Ω1= 119895 | 119884

119895gt 119863119895 and Ω

2= 119895 | 119884

119895le 119863119895 Since

sum119899

119895=1119908119895119863119895is a constant it follows that

max119877 lArrrArr min119899

sum

119895=1

[119908119895(119863119895minus 119884119895)+

minus119908119895

119872(119884119895minus 119863119895)+

]

(27)

Journal of Industrial Engineering 7

3 The Reinforcement Learning Algorithm

The chip attach scheduling problem is converted into anRL problem with terminal state in Section 2 To apply 119876-learning to solve this RL problem another issue arises that ishow to tailor119876-learning algorithm in this particular contextSince some state variables are continuous the state spaceis infinite This RL system is not in tabular form and it isimpossible to maintain 119876-values for all state-action pairsThus we use linear functionwith gradient-descentmethod toapproximate the 119876-value function 119876-values are representedas linear combination of a set of basis functions Φ

119896(119904) (1 le

119896 le 4119898 + 119899) as shown in the next formula

119876 (119904 119886) =

4119898+119899

sum

119896=1

119888119886

119896Φ119896(119904) (28)

where 119888119886119896(1 le 119886 le 6 1 le 119896 le 4119898 + 119899) are the weights of basis

functions Each state variable corresponds to a basis functionThe following basis functions are defined to normalize thestate variables

Φ119896(119904)

=

1198790

119896

119899(1 le 119896 le 119898)

119879119896minus119898

119899(119898 + 1 le 119896 le 2119898)

119905119896minus2119898

max 11990411989511198952

+ 1199011198952| 1 le 1198951 le 119899 1 le 1198952 le 119899

(2119898 + 1 le 119896 le 3119898)

119889119896minus3119898

119863119896minus3119898

(3119898 + 1 le 119896 le 3119898 + 119899)

119890119896minus3119898minus119899

TH(3119898 + 119899 + 1 le 119896 le 4119898 + 119899)

(29)

Let 119862119886 denote the vector of weights of basis functions asfollows

119862119886

= (119888119886

1 119888119886

2 119888

119886

4119898+119899)119879

(30)

The RL algorithm is presented as Algorithm 8 where 120572is learning rate 120574 is a discount factor 119864(119886) is the vector ofeligibility traces for action 119886 120575(119886) is an error variable foraction 119886 and 120582 is a factor for updating eligibility traces

Algorithm 8 119876-learning with linear gradient-descent func-tion approximation for chip attach scheduling

Initialize 119862119886 and 119864(119886) randomly Set parameters 120572 120574 and120582

Let num episode denote the number of episodes havingbeen run Set num episode = 0

While num episode ltMAX EPISODE doSet the current decision-making state 119904 larr 119904

0

While at least one of state variables 119890119894(1 le 119894 le 119898) is larger

than zero do

Select action 119886 for state 119904 by 120576-greedy policyImplement action a Determine the next event fortriggering state transition and the sojourn timeOnce any machine completes processing a lot orany machinersquos normal production time is completelyscheduled the system transfers into a new decision-making state 1199041015840(1198901015840

119894(1 le 119894 le 119898) is a component of 1199041015840)

Compute reward 1199031198861199041199041015840

Update the vector of weights in the approximate 119876-value function of action a

120575 (119886) larr997888 119903119886

1199041199041015840 + 120574max1198861015840

119876(1199041015840

1198861015840

) minus 119876 (119904 119886)

119864 (119886) larr997888 120582119864 (119886) + nabla119862119886119876 (119904 119886)

119862119886

larr997888 119862119886

+ 120572120575 (119886) 119864 (119886)

(31)

Set 119904 larr 1199041015840

If 119890119894= 0 holds for all 119894 (1 le 119894 le 119898) set num episode =

num episode + 1

4 Experiment Results

In the past the company used a manual process to conductchip attach scheduling A heuristic algorithm called LargestWeight First (LWF) was used as follows

Algorithm 9 (Largest Weight First (LWF) heuristics) Initial-ize SMwith the set of all machines (ie SM = 119894 | 1 le 119894 le 119898)and define SJ as Step 1 inAlgorithm 2 Initialize 119890

119894(1 le 119894 le 119898)

with each machinersquos normal production time Set 119884119895= 119868119895

where 119868119895is the initial production volume of job type 119895

Step 1 Schedule the job types in decreasing order of weightsin order to meet their TPVs

While SJ = Φ and SM = Φ doChoose job 119902 with 119902 = argmax119908

119895| 119895 isin SJ

While SM cap119872119902= Φ and 119884

119902lt 119863119902do

Choose machine 119894 to process job 119902 with 119894 =argmin119901

119896119902119908119902| 119896 isin SM cap119872

119902

If 119890119894minus 1199041198790

119894119902lt (119863119902minus 119884119902)119901119894119902 then

set 119884119902larr 119884119902+ 119871(119890119894minus 1199041198790

119894119902)119901119894119902 119890119894= 0 and

remove 119894 from SMelse set 119890

119894larr 119890119894minus 1199041198790

119894119902minus (119863119902minus 119884119902)119901119894119902119871 and

119884119902= 119863119902

1198790

119894= 119902

Step 2 Allocate the excess production capacity

If SM = Φ thenFor each machine 119894 (119894 isin SM)Choose job 119895with 119895 = argmax(119890

119894minus1199041198790

119894119902)119908119902119901119894119902|

1 le 119902 le 119899 set 119884119895larr 119884119895+ 119871(119890119894minus 1199041198790

119894119895)119901119894119895 119890119894= 0

8 Journal of Industrial Engineering

Table 1 Comparison of objective function values using heuristics and 119876-learning

Dataset no WSPT MWSPT RA LFM-MWSPT LFM-RA LWF Q-Learning1 88867 78116 59758 87689 57582 42253 minus386132 13844 13586 110747 12669 10907 95926 766573 11901 10875 12409 10425 12190 83332 237754 83681 60797 39073 69405 45575 33920 minus442755 12938 12847 96960 10917 99827 89863 214146 70840 55692 51108 66213 51041 16930 minus544677 12090 10060 95399 10933 90754 76422 273748 10242 10780 11656 10333 10762 93663 118409 94606 87914 81763 88812 80331 60164 3303610 90803 88164 90773 87926 8856293 56307 2279811 11113 88287 82916 97882 85605 60160 1649312 10029 89005 86692 95836 78342 60744 minus38617Average 10419 94123 86321 95547 84685 64147 12233

Table 2 Comparison of unsatisfied TPV index using heuristics and 119876-learning

Dataset no WSPT MWSPT RA LFM-MWSPT LFM-RA LWF Q-learning1 01179 01025 00789 01170 00754 00554 000812 01651 01611 01497 01499 01482 01421 001373 01421 01289 01475 01227 01455 00987 006914 01540 01104 00716 01258 00854 00614 000885 01588 01564 01186 01303 01215 01094 005716 01053 00819 00757 01006 00762 00248 001377 01462 01209 01150 01292 01082 00917 005828 01266 01324 01437 01272 01309 01150 003819 01315 01211 01133 01249 01127 00828 0081510 01154 01112 01151 01105 01110 00709 0053611 01544 01215 01146 01387 01204 00827 0069012 01262 01112 01088 01194 01002 00758 00118Average 01370 01216 01112 01247 01113 00842 00402

The chip attach station consists of 10 machines andnormally processes more than ten job types We selected12 sets of industrial data for experiments comparingthe119876-learning algorithm (Algorithm 8) and the six heuristics(Algorithms 1ndash5 9) WSPT MWSPT RA LFM-MWSPTLFM-RA and LWF For each dataset 119876-learning repeatedlysolves the scheduling problem 1000 times and selects theoptimal schedule of the 1000 solutions Table 1 shows theobjective function values of all datasets using the sevenalgorithms Individually any of WSPT MWSPT RA LFM-MWSPT and LFM-RA obtains larger objective functionvalues than LWF for every dataset Nevertheless takingWSPTMWSPT RA LFM-MWSPT and LFM-RA as actions119876-learning algorithm achieves an objective function valuemuch smaller than LWF for each dataset In Tables 1ndash4 thebottom row presents the average value over all datasets Asshown in Table 1 the average objective function value of 119876-learning is only 12233 less than that of LWF 66147 by a largeamount of 8092

Besides objective function value we propose two indicesunsatisfied TPV index and unsatisfied job type index tomea-sure the performance of the seven algorithms Unsatisfied

TPV index (UPI) is defined as formula (32) and indicatesthe weighted proportion of unfinished Target ProductionVolume Table 2 compares UPIs of all datasets using sevenalgorithms Also any of WSPT MWSPT RA LFM-MWSPTand LFM-RA individually obtains larger UPI than LWFfor each dataset However 119876-learning algorithm achievessmaller UPI than LWF does for each datasetThe average UPIof 119876-learning is only 00402 less than that of LWF 00842by a large amount of 5220 Let 119869 denote the set 119895 | 1 le119895 le 119899 119884

119895lt 119863119895 Unsatisfied job type index (UJTI) is defined

as formula (33) and indicates the weighted proportion of thejob types whose TPVs are not completely satisfied Table 3compares UJTIs of all datasets using seven algorithms Withmost datasets 119876-learning algorithm achieves smaller UJTIsthan LWFThe average UJTI of119876-learning is 00802 which isremarkably less than that of LWF 01176 by 3181 Consider

UPI =sum119899

119895=1119908119895(119863119895minus 119884119895)+

sum119899

119895=1119908119895119863119895

(32)

UJTI =sum119895=119869119908119895

sum119899

119895=1119908119895

(33)

Journal of Industrial Engineering 9

Table 3 Comparison of unsatisfied job type index using heuristics and 119876-learning

Dataset no WSPT MWSPT RA LFM-MWSPT LFM-RA LWF Q-learning1 01290 02615 01667 01650 01793 00921 006782 01924 02953 02302 02650 02097 01320 002783 02250 03287 02126 02564 02278 00921 009214 00781 02987 00278 01290 00828 00278 002785 02924 03169 03002 02224 02632 02055 005716 02290 03062 01817 02290 01632 00571 002787 02650 02987 02160 02529 02075 01320 016478 01924 03225 02067 01813 01696 01320 013209 02221 03250 01403 01892 02073 01320 0074910 02621 03304 02667 02859 02708 01781 0067811 02029 02896 02578 02194 02220 01381 0154212 01924 03271 02302 01838 02182 00921 00678Average 02069 03084 02031 02149 02018 01176 00802

Table 4 Comparison of the total setup time using heuristics and 119876-learning

Dataset no WSPT MWSPT RA LFM-MWSPT LFM-RA LWF Q-learning1 08133 08497 03283 07231 03884 03895 100002 08333 13000 04358 07564 05263 04094 100003 08712 11633 04207 06361 04937 04298 100004 11280 07123 04629 08516 05139 04318 100005 09629 13121 04179 08597 05115 03873 100006 07489 10393 04104 07489 04542 04074 100007 17868 22182 08223 14069 10125 04174 100008 06456 08508 04055 06694 05053 03795 100009 09245 09946 05013 07821 06694 04163 1000010 11025 17875 06703 10371 09079 04894 1000011 09973 13655 03994 09686 05129 04066 1000012 07904 11111 04419 06195 05081 04258 10000Average 09671 12254 04764 08383 05837 04158 10000

Table 4 shows the total setup time of all datasets usingseven algorithms For the reason of commercial confiden-tiality we used the normalized data with the setup time of adataset divided by the result of this dataset using 119876-learningThus the total setup times of all datasets by 119876-learning areconverted into one and the data of the six heuristics areadjusted accordingly 119876-learning algorithm requires morethan twice of setup time than LWF does for each dataset Theaverage accumulated setup time of LWF is only 4158 percentsof that of 119876-learning

The previous experimental results reveal that for thewhole scheduling tasks any individual one of the five actionheuristics (WSPT MWSPT RA LFM-MWSPT and LFM-RA) for 119876-learning performs worse than LWF heuristicsHowever 119876-learning greatly outperforms LWF in termsof the three performance measures the objective functionvalue UPI and UJTI This demonstrates that some actionheuristics provide better actions than LWF heuristics at somestates During repeatedly solving the scheduling problem119876-learning system perceives the insights of the schedulingproblem automatically and adjusts its actions towards the

optimal ones facing different system states The actions atall states form a new optimized policy which is differentfrom any policies following any individual action heuristicsor LWF heuristics That is119876-learning incorporates the meritof five alternative heuristics uses them to schedule jobsflexibly and obtains results much better than any individualaction heuristics and LWF heuristics In the experiments119876-learning achieves high-quality schedules at the cost ofinducingmore setup time In other words119876-learning utilizesthe machines more efficiently by increasing conversionsamong a variety of job types

5 Conclusions

We apply 119876-learning to study lot-based chip attach schedul-ing in back-end semiconductor manufacturing To applyreinforcement learning to scheduling the critical issue beingconversion of scheduling problems into RL problems Weconvert chip attach scheduling problem into a particularSMDP problem by Markovian state representation Fiveheuristic algorithms WSPT MWSPT RA LFM-MWSPT

10 Journal of Industrial Engineering

and LFM-RA are selected as actions so as to utilize priordomain knowledge Reward function is directly related toscheduling objective function and we prove that maximizingthe accumulated reward is equivalent to minimizing theobjective function Gradient-descent linear function approx-imation is combined with 119876-learning algorithm

119876-learning exploits the insight structure of the schedul-ing problem by solving it repeatedly It learns a domain-specific policy from the experienced episodes through inter-action and then applies it to latter episodes We definetwo indices unsatisfied TPV index and unsatisfied job typeindex together with objective function value to measure theperformance of 119876-learning and the heuristics Experimentswith industrial datasets show that 119876-learning apparentlyoutperforms six heuristic algorithms WSPT MWSPT RALFM-MWSPT LFM-RA and LWF Compared with LWF119876-learning achieves reduction of the three performancemeasures respectively by an average level of 5220 3181and 8092 With 119876-learning chip attach scheduling isoptimized through increasing effective job type conversions

Disclosure

Given the sensitive and proprietary nature of the semicon-ductor manufacturing environment we use normalized datain this paper

Acknowledgments

This project is supported by the National Natural ScienceFoundation of China (Grant no 71201026) Science andTechnological Program for Dongguanrsquos Higher EducationScience and Research and Health Care Institutions (no2011108102017) andHumanities and Social Sciences Programof Ministry of Education of China (no 10YJC630405)

References

[1] M X Weng J Lu and H Ren ldquoUnrelated parallel machinescheduling with setup consideration and a total weightedcompletion time objectiverdquo International Journal of ProductionEconomics vol 70 no 3 pp 215ndash226 2001

[2] M Gairing B Monien and A Woclaw ldquoA faster combinato-rial approximation algorithm for scheduling unrelated parallelmachinesrdquo inAutomata Languages and Programming vol 3580of Lecture Notes in Computer Science pp 828ndash839 2005

[3] G Mosheiov ldquoParallel machine scheduling with a learningeffectrdquo Journal of the Operational Research Society vol 52 no10 pp 1ndash5 2001

[4] G Mosheiov and J B Sidney ldquoScheduling with general job-dependent learning curvesrdquo European Journal of OperationalResearch vol 147 no 3 pp 665ndash670 2003

[5] L Yu H M Shih M Pfund W M Carlyle and J W FowlerldquoScheduling of unrelated parallel machines an application toPWB manufacturingrdquo IIE Transactions vol 34 no 11 pp 921ndash931 2002

[6] K R Baker and J W M Bertrand ldquoA dynamic priorityrule for scheduling against due-datesrdquo Journal of OperationsManagement vol 3 no 1 pp 37ndash42 1982

[7] J J Kanet and X Li ldquoA weighted modified due date rulefor sequencing to minimize weighted tardinessrdquo Journal ofScheduling vol 7 no 4 pp 261ndash276 2004

[8] R V Rachamadugu and T E Morton ldquoMyopic heuristicsfor the single machine weighted tardiness problemrdquo WorkingPaper 28-81-82 Graduate School of Industrial AdministrationGarnegie-Mellon University 1981

[9] A Volgenant and E Teerhuis ldquoImproved heuristics for then-job single-machine weighted tardiness problemrdquo Computersand Operations Research vol 26 no 1 pp 35ndash44 1999

[10] D C Carroll Heuristic sequencing of jobs with single and mul-tiple components [PhD thesis] Sloan School of ManagementMIT 1965

[11] A P J Vepsalainen and T E Morton ldquoPriority rules for jobshops with weighted tardiness costsrdquoManagement Science vol33 no 8 pp 1035ndash1047 1987

[12] R S Russell E M Dar-El and B W Taylor ldquoA comparativeanalysis of the COVERT job sequencing rule using various shopperformance measuresrdquo International Journal of ProductionResearch vol 25 no 10 pp 1523ndash1540 1987

[13] J Bank and F Werner ldquoHeuristic algorithms for unrelated par-allelmachine schedulingwith a commondue date release datesand linear earliness and tardiness penaltiesrdquoMathematical andComputer Modelling vol 33 no 4-5 pp 363ndash383 2001

[14] C F Liaw Y K Lin C Y Cheng and M Chen ldquoSchedulingunrelated parallel machines to minimize total weighted tardi-nessrdquo Computers and Operations Research vol 30 no 12 pp1777ndash1789 2003

[15] DWKimDGNa andF F Chen ldquoUnrelated parallelmachinescheduling with setup times and a total weighted tardinessobjectiverdquo Robotics and Computer-Integrated Manufacturingvol 19 no 1-2 pp 173ndash181 2003

[16] R S Sutton and A G Barto Reinforcement Learning AnIntroduction MIT Press Cambridge Mass USA 1998

[17] C J CHWatkinsLearning fromdelayed rewards [PhD thesis]Cambridge University 1989

[18] C J C H Watkins and P Dayan ldquoQ-learningrdquo MachineLearning vol 8 no 3-4 pp 279ndash292 1992

[19] T JaakkolaM I Jordan and S P Singh ldquoOn the convergence ofstochastic iterative dynamic programming algorithmsrdquo NeuralComputation vol 6 pp 1185ndash1201 1994

[20] J N Tsitsiklis ldquoAsynchronous stochastic approximation and Q-learningrdquoMachine Learning vol 16 no 3 pp 185ndash202 1994

[21] D P Bertsekas and J N Tsitsiklis Neuro-Dynamic Program-ming Athena Scientific Belmont Mass USA 1996

[22] S Riedmiller and M Riedmiller ldquoA neural reinforcementlearning approach to learn local dispatching policies in produc-tion schedulingrdquo in Proceedings of the 16th International JointConference on Artificial Intelligence Stockholm Sweden 1999

[23] M E Aydin and E Oztemel ldquoDynamic job-shop schedulingusing reinforcement learning agentsrdquoRobotics and AutonomousSystems vol 33 no 2 pp 169ndash178 2000

[24] J Hong and V V Prabhu ldquoDistributed reinforcement learningcontrol for batch sequencing and sizing in just-in-time manu-facturing systemsrdquo Applied Intelligence vol 20 no 1 pp 71ndash872004

[25] Y C Wang and J M Usher ldquoApplication of reinforcementlearning for agent-based production schedulingrdquo EngineeringApplications of Artificial Intelligence vol 18 no 1 pp 73ndash822005

Journal of Industrial Engineering 11

[26] B C Csaji L Monostori and B Kadar ldquoReinforcement learn-ing in a distributed market-based production control systemrdquoAdvanced Engineering Informatics vol 20 no 3 pp 279ndash2882006

[27] S S Singh V B Tadic and A Doucet ldquoA policy gradientmethod for semi-Markov decision processes with applicationto call admission controlrdquo European Journal of OperationalResearch vol 178 no 3 pp 808ndash818 2007

[28] M Kaya and R Alhajj ldquoA novel approach to multiagentreinforcement learning utilizing OLAP mining in the learningprocessrdquo IEEE Transactions on Systems Man and CyberneticsPart C vol 35 no 4 pp 582ndash590 2005

[29] C D Paternina-Arboleda and T K Das ldquoA multi-agent rein-forcement learning approach to obtaining dynamic controlpolicies for stochastic lot scheduling problemrdquo SimulationModelling Practice andTheory vol 13 no 5 pp 389ndash406 2005

[30] C E Mariano-Romero V H Alcocer-Yamanaka and E FMorales ldquoMulti-objective optimization of water-using systemsrdquoEuropean Journal of Operational Research vol 181 no 3 pp1691ndash1707 2007

[31] D Vengerov ldquoA reinforcement learning framework for utility-based scheduling in resource-constrained systemsrdquo Future Gen-eration Computer Systems vol 25 no 7 pp 728ndash736 2009

[32] K Iwamura N Mayumi Y Tanimizu and N SugimuraldquoA study on real-time scheduling for holonic manufacturingsystemsmdashdetermination of utility values based on multi-agentreinforcement learningrdquo in Proceedings of the 4th InternationalConference on Industrial Applications of Holonic and Multi-Agent Systems pp 135ndash144 Linz Austria 2009

[33] M Pinedo Scheduling Theory Algorithms and Systems Pren-tice Hall Englewoods Cliffs NJ USA 2nd edition 2002

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 4: Research Article Chip Attach Scheduling in Semiconductor ...downloads.hindawi.com/journals/jie/2013/295604.pdf · Chip Attach Scheduling in Semiconductor Assembly ... and PEVI. IS

4 Journal of Industrial Engineering

For an optimal schedule minimizing objective function(5) if (9) holds and there exists 119895 (1 le 119895 le 119899) such that 119884

119895gt

119863119895 then

119884119895+

119898

sum

119894=1

120573 (119894 119895) 119880119894ge 119863119895

(forall1 le 119895 le 119899) (10)

where 119880119894denotes the unprocessed volume in the last lot

processed by machine 119894 at the end of this schedule horizon(ie the initial volume of the next schedule horizon) and

120573 (119894 119895)

=

1 if machine 119894 is processing job type 119895 atthe end of the schedule horizon

0 otherwise

(11)

According to inequality (9) in any schedule minimizingobjective function (5) any machine will not process a lotbelonging to a job type whose TPV has been satisfied untilTPV of any other job types is also fully satisfied In otherwords inequality (9) guarantees that the objective functiontakesminimization of the total weighted unsatisfied TPV (thefirst item of objective function (5)) as the first priority Thefundamental problem in applying reinforcement learningto scheduling is to convert scheduling problems into RLproblems including representation of state construction ofactions and definition of reward function

22 State Representation and Transition Probability We firstdefine the state variables State variables describe the majorcharacteristics of the system and are capable of trackingthe change of the system status The system state can berepresented by the vector

120593 = [1198790

119894(1 le 119894 le 119898) 119879

119894(1 le 119894 le 119898) 119905

119894(1 le 119894 le 119898)

119889119895(1 le 119895 le 119899) 119890

119894(1 le 119894 le 119898)]

(12)

where 1198790119894(1 le 119894 le 119898) denotes the job type of which the

latest lot completely processed on machine 119894 119879119894(1 le 119894 le 119898)

denotes the job type of which the lot being processed onmachine 119894 (119879

119894equals zero if machine 119894 is idle) 119905

119894(1 le 119894 le 119898)

denotes the time starting from the beginning of the latestsetup on machine 119894 (for convenience we assume that thereis a zero-time setup if 1198790

119894= 119879119894) 119889119895(1 le 119895 le 119899) is unsatisfied

TPV (ie (119863119895minus 119884119895)+) and 119890

119894(1 le 119894 le 119898) represents the

unscheduled normal production time of machine 119894Considering the initial status of machines the initial

system state of the schedule horizon is

1199040= [1198790

1198940(1 le 119894 le 119898) 119879

1198940(1 le 119894 le 119898) 119905

1198940(1 le 119894 le 119898)

119863119895(1 le 119895 le 119899) TH minus 120590

119894minus TE119894(1 le 119894 le 119898)]

(13)

where TH denotes the overall available time in the schedulehorizon 120590

119894denotes the initial production time of machine 119894

and TE119894denotes the engineering time of machine 119894

There are two kinds of events triggering state transitions(1) completion of processing a lot on one or more machines(2) any machinersquos normal production time is entirely sched-uled If the triggering event is completion of processing thestate at the decision-making epoch is represented as

119904119889= [1198790

119894119889(1 le 119894 le 119898) 119879

119894119889(1 le 119894 le 119898) 119905

119894119889(1 le 119894 le 119898)

119889119895119889(1 le 119895 le 119899) 119890

119894119889(1 le 119894 le 119898)]

(14)

where 119894 | 119879119894119889= 0 1 le 119894 le 119898 =Φ If 119879

119894119889= 0 (machine

119894 is idle) then 119905119894119889= 0 If the triggering event is using up a

machinersquos normal production time then 119894 | 119890119894119889= 0 1 le 119894 le

119898 =ΦAssume that after taking action 119886 the system state

immediately transfers form 119904119889to an interim state 119904 as follows

119904 = [1198790

119894(1 le 119894 le 119898) 119879

119894(1 le 119894 le 119898) 119905

119894(1 le 119894 le 119898)

119889119895(1 le 119895 le 119899) 119890

119894(1 le 119894 le 119898)]

(15)

where 119879119894gt 0 for all 119894 (1 le 119894 le 119898) that is all machines are

busyLet Δ119905 denote the sojourn time at state 119904 then Δ119905 =

minmin1le119894le119898

1199041198790

119894119879119894

+ 119901119894119879119894

minus 119905119894 min

1le119894le119898119890119894| 119890119894gt 0 Let

Λ = 119894 | 1199041198790

119894119879119894

+ 119901119894119879119894

minus 119905119894= Δ119905 then the state at the next

decision-making epoch is represented as

1199041015840

= [119879119894(119894 isin Λ) 119879

0

119894(119894 notin Λ) 119879

119894= 0 (119894 isin Λ) 119879

119894(119894 notin Λ)

0 (119894 isin Λ) 119905119894+ Δ119905 (119894 notin Λ)

119889119895minusΔ119905119871sum

119898

119894=1120575119884(119879119894 119895)

1199041198790

119894119895+ 119901119894119895

(1 le 119895 le 119899)

max 119890119894minus Δ119905 0 (1 le 119894 le 119898)]

(16)

where

120575119884(119879119894 119895) =

1 if 119879119894= 119895

0 if 119879119894= 119895

(17)

Apparently we have 1198751198861199041198891199041015840 = 1 where 119875119886

1199041198891199041015840 denotes the one-

step transition probability from state 119904119889to state 1199041015840 under

action 119886 Let 119904119906and 120591

119906denote the system state and time

respectively at the 119906th decision-making epoch It is easy toshow that119875 119904119906+1

= 119883 120591119906+1

minus 120591119906le 119905 | 119904

0 1199041 119904

119906 1205910 1205911 120591

119906

= 119875 119904119906+1

= 119883 120591119906+1

minus 120591119906le 119905 | 119904

119906 120591119906

(18)

where 120591119906+1

minus 120591119906is the sojourn time at state 119904

119906 That is

the decision process associated with (119904 120591) is a Semi-MarkovDecision Process with particular transition probability andsojourn times The terminal state of an episode is

119904119890= [1198790

119894119890(1 le 119894 le 119898) 119879

119894119890(1 le 119894 le 119898) 119905

119894119890(1 le 119894 le 119898)

119889119895119890(1 le 119895 le 119899) 0 (1 le 119894 le 119898)]

(19)

Journal of Industrial Engineering 5

23 Action Prior domain knowledge can be utilized to fullyexploit the agentrsquos learning ability Apparently an optimalschedule must be nonidle (ie any machine has no idle timeduring the whole schedule) It may happen that more thanone machine are free at the same decision-making epochAn action determines which lot to be processed on whichmachine In the following we define seven actions usingheuristic algorithms

Action 1 Select jobs by WSPT heuristics as follows

Algorithm 1 WSPT heuristics

Step 1 Let SM denote the set of free machines at a decision-making epoch

Step 2 Choose machine 119896 to process job type 119902 with (119896 119902) =argmin

(119894119895)(1199041198790

119894119895+ 119901119894119895)119908119895| 1 le 119895 le 119899 119894 isin 119872

119895and 119894 isin SM

Step 3 Remove 119896 fromSM If SM = Φ go to Step 2 otherwisethe algorithm halts

Action 2 Select jobs byMWSPT (modifiedWSPT) heuristicsas follows

Algorithm 2 MWSPT heuristics

Step 1 Define SM as Step 1 in Algorithm 1 and let SJ denotethe set of job types whose TPVs have not been satisfied at adecision-making epoch that is SJ = 119895 | 119884

119895lt 119863119895 1 le 119895 le 119899

If SJ = Φ go to Step 4

Step 2 Choose job type 119902 to process on machine 119896 with(119896 119902) = argmin

(119894119895)(1199041198790

119894119895+ 119901119894119895)119908119895| 119895 isin SJ 119894 isin 119872

119895and

119894 isin SM

Step 3 Remove 119896 from SM Set 119884119902= 119884119902+ 119871 and update SJ If

SJ = Φ and SM = Φ go to Step 2 if SJ = Φ and SM = Φ go toStep 4 otherwise the algorithm halts

Step 4 Choose machine 119896 to process job type 119902 with (119896 119902) =argmin

(119894119895)(1199041198790

119894119895+ 119901119894119895)119908119895| 1 le 119895 le 119899 119894 isin 119872

119895and 119894 isin SM

Step 5 Remove 119896 fromSM If SM = Φ go to Step 4 otherwisethe algorithm halts

Action 3 Select jobs by Ranking Algorithm (RA) as follows

Algorithm 3 Ranking Algorithm

Step 1 Define SM and SJ as Step 1 in Algorithm 2 If SJ = Φgo to Step 5

Step 2 For each job type 119895 (119895 isin SJ) sort the machines inincreasing order of (119904

119881119894119895+119901119894119895) (1 le 119894 le 119898) where119881

119894is defined

as follows

119881119894=

119879119894 if machine 119894 is busy

1198790

119894 if machine 119894 is free

(1 le 119894 le 119898) (20)

Let 119892119894119895(1 le 119892

119894119895le 119898) denote the order of machine 119894 (1 le 119894 le

119898) for job type 119895 (1 le 119895 le 119899)

Step 3 Choose job 119902 to process on machine 119896 with (119896 119902) =argmin

(119894119895)119892119894119895| 119895 isin SJ 119894 isin 119872

119895and 119894 isin SM If there

exist two or more machine-job combinations (say machine-job combination (119894

1 1198951) (1198942 1198952) (119894

ℎ 119895ℎ)) with the same

minimal order that is (119894119890 119895119890) = argmin

(119894119895)119892119894119895| 119895 isin SJ 119894 isin

119872119895and 119894 isin SM holds for 119890 (1 le 119890 le ℎ) then choose job type

119895119890to process onmachine 119894

119890 with (119894

119890 119895119890) = argmin

(119894119895)(119904119881119894119890119895119890

+

119901119894119890119895119890

)119908119895119890

| 1 le 119890 le ℎ

Step 4 Remove 119896 or 119894119890fromSM Set119884

119902= 119884119902+119871 or119884

119895119890

= 119884119895119890

+119871

and update SJ If SJ = Φ and SM = Φ go to Step 3 if SJ = Φ andSM = Φ go to Step 5 otherwise the algorithm halts

Step 5 Choose job 119902 to process on machine 119896 with (119896 119902) =argmin

(119894119895)119892119894119895| 1 le 119895 le 119899 119894 isin 119872

119895and 119894 isin SM If there

exist two or more machine-job combinations (say machine-job combinations (119894

1 1198951) (1198942 1198952) (119894

ℎ 119895ℎ)) with the same

minimal order choose job type 119895119890to process on machine 119894

119890

with (119894119890 119895119890) = argmin

(119894119895)(119904119881119894119890119895119890

+ 119901119894119890119895119890

)119908119895119890

| 1 le 119890 le ℎ

Step 6 Remove 119896 or 119894119890from SM If SM = Φ go to Step 5

otherwise the algorithm halts

Action 4 Select jobs by LFM-MWSPT heuristics as follows

Algorithm 4 LFM-MWSPT heuristics

Step 1 Define SM and SJ as Step 1 in Algorithm 2

Step 2 Select a free machine (say 119896) from SM by LFM (LeastFlexible Machine see [33]) rule and choose a job type toprocess on machine 119896 following MWSPT heuristics

Step 3 Remove 119896 fromSM If SM = Φ go to Step 2 otherwisethe algorithm halts

Action 5 Select jobs by LFM-RA heuristics as follows

Algorithm 5 LFM-RA heuristics

Step 1 Define SM and SJ as Step 1 in Algorithm 2

Step 2 Select a free machine (say 119896) from SM by LFM ruleand choose a job type to process on machine 119896 followingRanking Algorithm

Step 3 Remove 119896 fromSM If SM = Φ go to Step 2 otherwisethe algorithm halts

Action 6 Each free machine selects the same job type as thelatest one it processed

Action 7 Select no job

At the start of a schedule horizon the system is atinitial state 119904

0 If there are free machines they select jobs to

process by taking one of Actions 1ndash6 otherwise Action 7 is

6 Journal of Industrial Engineering

chosen Afterwards when anymachine completes processinga lot or any machinersquos normal production time is completelyscheduled the system transfers into a new state 119904

119906 The

agent selects an action at this decision-making epoch and thesystem state transfers into an interim state 119904 When againany machine completes processing a lot or any machinersquosnormal production time used is up the system transfers intothe next decision-making state 119904

119906+1and the agent receive

reward 119903119906+1

which is computed due to 119904119906and the sojourn

time between the two transitions into 119904119906and 119904119906+1

(as shownin Section 24) The previous procedure is repeated until aterminal state is attained An episode is a trajectory from theinitial state to a terminal state of a schedule horizon Action7 is available only at the decision-making states when allmachines are busy

24 Reward Function A reward function follows severaldisciplines It indicates the instant impact of an action on theschedule that is to link the action with immediate rewardMoreover the accumulated reward indicates the objectivefunction value that is the agent receives large total rewardfor small objective function value

Definition 6 (reward function) Let 119870 denote the number ofdecision-making epoch during an episode 119905

119906(0 le 119906 lt 119870)

the time at the 119906th decision-making epoch 119879119894119906

(1 le 119894 le 1198981 le 119906 le 119870) the job type of the lot which machine 119894 processesduring time interval (119905

119906minus1 119905119906]1198790119894119906the job type of the lotwhich

precedes the lot machine 119894 processes during time interval(119905119906minus1 119905119906] and 119884

119895(119905119906) the processed volume of job type 119895 by

time 119905119906 It follows that

119884119895(119905119906) minus 119884119895(119905119906minus1) =

119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

(21)

where 120575(119894 119895) is an indicator function defined as

120575 (119894 119895) = 1 119879

119894119906= 119895

0 119879119894119906

= 119895(22)

Let 119903119906(119906 = 1 2 119870) denote the reward function at the

119906th decision-making epoch 119903119906is defined as

119903119906=

119899

sum

119895=1

min119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

[119863119895minus 119884119895(119905119906minus1)]+

119908119895

+max119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

minus [119863119895minus 119884119895(119905119906minus1)]+

0

times119908119895

119872

(23)

The reward function has the following property

Theorem 7 Maximization of the total reward 119877 in an episodeis equivalent to minimization of objective function (5)

Proof The total reward in an episode is

119877 =

119870

sum119906=1

119903119906

=

119870

sum119906=1

119899

sum

119895=1

min119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

[119863119895minus 119884119895(119905119906minus1)]+

119908119895

+max119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

minus [119863119895minus 119884119895(119905119906minus1)]+

0

times119908119895

119872

=

119899

sum

119895=1

119870

sum119906=1

min119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

[119863119895minus 119884119895(119905119906minus1)]+

times 119908119895

+max119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

minus [119863119895minus 119884119895(119905119906minus1)]+

0

times119908119895

119872

(24)

It is easy to show that

119884119895=

119870

sum119906=1

119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

(25)

It follows that

119877 =

119899

sum

119895=1

[119908119895min 119863

119895 119884119895 +

119908119895

119872max 0 119884

119895minus 119863119895]

= sum

119895isinΩ1

[119908119895119863119895+119908119895

119872(119884119895minus 119863119895)] + sum

119895isinΩ2

119908119895119884119895

=

119899

sum

119895=1

119908119895119863119895minus

sum

119895isinΩ1

[minus119908119895

119872(119884119895minus 119863119895)]

+ sum

119895isinΩ2

119908119895(119863119895minus 119884119895)

=

119899

sum

119895=1

119908119895119863119895minus

119899

sum

119895=1

[119908119895(119863119895minus 119884119895)+

minus119908119895

119872(119884119895minus 119863119895)+

]

(26)

where Ω1= 119895 | 119884

119895gt 119863119895 and Ω

2= 119895 | 119884

119895le 119863119895 Since

sum119899

119895=1119908119895119863119895is a constant it follows that

max119877 lArrrArr min119899

sum

119895=1

[119908119895(119863119895minus 119884119895)+

minus119908119895

119872(119884119895minus 119863119895)+

]

(27)

Journal of Industrial Engineering 7

3 The Reinforcement Learning Algorithm

The chip attach scheduling problem is converted into anRL problem with terminal state in Section 2 To apply 119876-learning to solve this RL problem another issue arises that ishow to tailor119876-learning algorithm in this particular contextSince some state variables are continuous the state spaceis infinite This RL system is not in tabular form and it isimpossible to maintain 119876-values for all state-action pairsThus we use linear functionwith gradient-descentmethod toapproximate the 119876-value function 119876-values are representedas linear combination of a set of basis functions Φ

119896(119904) (1 le

119896 le 4119898 + 119899) as shown in the next formula

119876 (119904 119886) =

4119898+119899

sum

119896=1

119888119886

119896Φ119896(119904) (28)

where 119888119886119896(1 le 119886 le 6 1 le 119896 le 4119898 + 119899) are the weights of basis

functions Each state variable corresponds to a basis functionThe following basis functions are defined to normalize thestate variables

Φ119896(119904)

=

1198790

119896

119899(1 le 119896 le 119898)

119879119896minus119898

119899(119898 + 1 le 119896 le 2119898)

119905119896minus2119898

max 11990411989511198952

+ 1199011198952| 1 le 1198951 le 119899 1 le 1198952 le 119899

(2119898 + 1 le 119896 le 3119898)

119889119896minus3119898

119863119896minus3119898

(3119898 + 1 le 119896 le 3119898 + 119899)

119890119896minus3119898minus119899

TH(3119898 + 119899 + 1 le 119896 le 4119898 + 119899)

(29)

Let 119862119886 denote the vector of weights of basis functions asfollows

119862119886

= (119888119886

1 119888119886

2 119888

119886

4119898+119899)119879

(30)

The RL algorithm is presented as Algorithm 8 where 120572is learning rate 120574 is a discount factor 119864(119886) is the vector ofeligibility traces for action 119886 120575(119886) is an error variable foraction 119886 and 120582 is a factor for updating eligibility traces

Algorithm 8 119876-learning with linear gradient-descent func-tion approximation for chip attach scheduling

Initialize 119862119886 and 119864(119886) randomly Set parameters 120572 120574 and120582

Let num episode denote the number of episodes havingbeen run Set num episode = 0

While num episode ltMAX EPISODE doSet the current decision-making state 119904 larr 119904

0

While at least one of state variables 119890119894(1 le 119894 le 119898) is larger

than zero do

Select action 119886 for state 119904 by 120576-greedy policyImplement action a Determine the next event fortriggering state transition and the sojourn timeOnce any machine completes processing a lot orany machinersquos normal production time is completelyscheduled the system transfers into a new decision-making state 1199041015840(1198901015840

119894(1 le 119894 le 119898) is a component of 1199041015840)

Compute reward 1199031198861199041199041015840

Update the vector of weights in the approximate 119876-value function of action a

120575 (119886) larr997888 119903119886

1199041199041015840 + 120574max1198861015840

119876(1199041015840

1198861015840

) minus 119876 (119904 119886)

119864 (119886) larr997888 120582119864 (119886) + nabla119862119886119876 (119904 119886)

119862119886

larr997888 119862119886

+ 120572120575 (119886) 119864 (119886)

(31)

Set 119904 larr 1199041015840

If 119890119894= 0 holds for all 119894 (1 le 119894 le 119898) set num episode =

num episode + 1

4 Experiment Results

In the past the company used a manual process to conductchip attach scheduling A heuristic algorithm called LargestWeight First (LWF) was used as follows

Algorithm 9 (Largest Weight First (LWF) heuristics) Initial-ize SMwith the set of all machines (ie SM = 119894 | 1 le 119894 le 119898)and define SJ as Step 1 inAlgorithm 2 Initialize 119890

119894(1 le 119894 le 119898)

with each machinersquos normal production time Set 119884119895= 119868119895

where 119868119895is the initial production volume of job type 119895

Step 1 Schedule the job types in decreasing order of weightsin order to meet their TPVs

While SJ = Φ and SM = Φ doChoose job 119902 with 119902 = argmax119908

119895| 119895 isin SJ

While SM cap119872119902= Φ and 119884

119902lt 119863119902do

Choose machine 119894 to process job 119902 with 119894 =argmin119901

119896119902119908119902| 119896 isin SM cap119872

119902

If 119890119894minus 1199041198790

119894119902lt (119863119902minus 119884119902)119901119894119902 then

set 119884119902larr 119884119902+ 119871(119890119894minus 1199041198790

119894119902)119901119894119902 119890119894= 0 and

remove 119894 from SMelse set 119890

119894larr 119890119894minus 1199041198790

119894119902minus (119863119902minus 119884119902)119901119894119902119871 and

119884119902= 119863119902

1198790

119894= 119902

Step 2 Allocate the excess production capacity

If SM = Φ thenFor each machine 119894 (119894 isin SM)Choose job 119895with 119895 = argmax(119890

119894minus1199041198790

119894119902)119908119902119901119894119902|

1 le 119902 le 119899 set 119884119895larr 119884119895+ 119871(119890119894minus 1199041198790

119894119895)119901119894119895 119890119894= 0

8 Journal of Industrial Engineering

Table 1 Comparison of objective function values using heuristics and 119876-learning

Dataset no WSPT MWSPT RA LFM-MWSPT LFM-RA LWF Q-Learning1 88867 78116 59758 87689 57582 42253 minus386132 13844 13586 110747 12669 10907 95926 766573 11901 10875 12409 10425 12190 83332 237754 83681 60797 39073 69405 45575 33920 minus442755 12938 12847 96960 10917 99827 89863 214146 70840 55692 51108 66213 51041 16930 minus544677 12090 10060 95399 10933 90754 76422 273748 10242 10780 11656 10333 10762 93663 118409 94606 87914 81763 88812 80331 60164 3303610 90803 88164 90773 87926 8856293 56307 2279811 11113 88287 82916 97882 85605 60160 1649312 10029 89005 86692 95836 78342 60744 minus38617Average 10419 94123 86321 95547 84685 64147 12233

Table 2 Comparison of unsatisfied TPV index using heuristics and 119876-learning

Dataset no WSPT MWSPT RA LFM-MWSPT LFM-RA LWF Q-learning1 01179 01025 00789 01170 00754 00554 000812 01651 01611 01497 01499 01482 01421 001373 01421 01289 01475 01227 01455 00987 006914 01540 01104 00716 01258 00854 00614 000885 01588 01564 01186 01303 01215 01094 005716 01053 00819 00757 01006 00762 00248 001377 01462 01209 01150 01292 01082 00917 005828 01266 01324 01437 01272 01309 01150 003819 01315 01211 01133 01249 01127 00828 0081510 01154 01112 01151 01105 01110 00709 0053611 01544 01215 01146 01387 01204 00827 0069012 01262 01112 01088 01194 01002 00758 00118Average 01370 01216 01112 01247 01113 00842 00402

The chip attach station consists of 10 machines andnormally processes more than ten job types We selected12 sets of industrial data for experiments comparingthe119876-learning algorithm (Algorithm 8) and the six heuristics(Algorithms 1ndash5 9) WSPT MWSPT RA LFM-MWSPTLFM-RA and LWF For each dataset 119876-learning repeatedlysolves the scheduling problem 1000 times and selects theoptimal schedule of the 1000 solutions Table 1 shows theobjective function values of all datasets using the sevenalgorithms Individually any of WSPT MWSPT RA LFM-MWSPT and LFM-RA obtains larger objective functionvalues than LWF for every dataset Nevertheless takingWSPTMWSPT RA LFM-MWSPT and LFM-RA as actions119876-learning algorithm achieves an objective function valuemuch smaller than LWF for each dataset In Tables 1ndash4 thebottom row presents the average value over all datasets Asshown in Table 1 the average objective function value of 119876-learning is only 12233 less than that of LWF 66147 by a largeamount of 8092

Besides objective function value we propose two indicesunsatisfied TPV index and unsatisfied job type index tomea-sure the performance of the seven algorithms Unsatisfied

TPV index (UPI) is defined as formula (32) and indicatesthe weighted proportion of unfinished Target ProductionVolume Table 2 compares UPIs of all datasets using sevenalgorithms Also any of WSPT MWSPT RA LFM-MWSPTand LFM-RA individually obtains larger UPI than LWFfor each dataset However 119876-learning algorithm achievessmaller UPI than LWF does for each datasetThe average UPIof 119876-learning is only 00402 less than that of LWF 00842by a large amount of 5220 Let 119869 denote the set 119895 | 1 le119895 le 119899 119884

119895lt 119863119895 Unsatisfied job type index (UJTI) is defined

as formula (33) and indicates the weighted proportion of thejob types whose TPVs are not completely satisfied Table 3compares UJTIs of all datasets using seven algorithms Withmost datasets 119876-learning algorithm achieves smaller UJTIsthan LWFThe average UJTI of119876-learning is 00802 which isremarkably less than that of LWF 01176 by 3181 Consider

UPI =sum119899

119895=1119908119895(119863119895minus 119884119895)+

sum119899

119895=1119908119895119863119895

(32)

UJTI =sum119895=119869119908119895

sum119899

119895=1119908119895

(33)

Journal of Industrial Engineering 9

Table 3 Comparison of unsatisfied job type index using heuristics and 119876-learning

Dataset no WSPT MWSPT RA LFM-MWSPT LFM-RA LWF Q-learning1 01290 02615 01667 01650 01793 00921 006782 01924 02953 02302 02650 02097 01320 002783 02250 03287 02126 02564 02278 00921 009214 00781 02987 00278 01290 00828 00278 002785 02924 03169 03002 02224 02632 02055 005716 02290 03062 01817 02290 01632 00571 002787 02650 02987 02160 02529 02075 01320 016478 01924 03225 02067 01813 01696 01320 013209 02221 03250 01403 01892 02073 01320 0074910 02621 03304 02667 02859 02708 01781 0067811 02029 02896 02578 02194 02220 01381 0154212 01924 03271 02302 01838 02182 00921 00678Average 02069 03084 02031 02149 02018 01176 00802

Table 4 Comparison of the total setup time using heuristics and 119876-learning

Dataset no WSPT MWSPT RA LFM-MWSPT LFM-RA LWF Q-learning1 08133 08497 03283 07231 03884 03895 100002 08333 13000 04358 07564 05263 04094 100003 08712 11633 04207 06361 04937 04298 100004 11280 07123 04629 08516 05139 04318 100005 09629 13121 04179 08597 05115 03873 100006 07489 10393 04104 07489 04542 04074 100007 17868 22182 08223 14069 10125 04174 100008 06456 08508 04055 06694 05053 03795 100009 09245 09946 05013 07821 06694 04163 1000010 11025 17875 06703 10371 09079 04894 1000011 09973 13655 03994 09686 05129 04066 1000012 07904 11111 04419 06195 05081 04258 10000Average 09671 12254 04764 08383 05837 04158 10000

Table 4 shows the total setup time of all datasets usingseven algorithms For the reason of commercial confiden-tiality we used the normalized data with the setup time of adataset divided by the result of this dataset using 119876-learningThus the total setup times of all datasets by 119876-learning areconverted into one and the data of the six heuristics areadjusted accordingly 119876-learning algorithm requires morethan twice of setup time than LWF does for each dataset Theaverage accumulated setup time of LWF is only 4158 percentsof that of 119876-learning

The previous experimental results reveal that for thewhole scheduling tasks any individual one of the five actionheuristics (WSPT MWSPT RA LFM-MWSPT and LFM-RA) for 119876-learning performs worse than LWF heuristicsHowever 119876-learning greatly outperforms LWF in termsof the three performance measures the objective functionvalue UPI and UJTI This demonstrates that some actionheuristics provide better actions than LWF heuristics at somestates During repeatedly solving the scheduling problem119876-learning system perceives the insights of the schedulingproblem automatically and adjusts its actions towards the

optimal ones facing different system states The actions atall states form a new optimized policy which is differentfrom any policies following any individual action heuristicsor LWF heuristics That is119876-learning incorporates the meritof five alternative heuristics uses them to schedule jobsflexibly and obtains results much better than any individualaction heuristics and LWF heuristics In the experiments119876-learning achieves high-quality schedules at the cost ofinducingmore setup time In other words119876-learning utilizesthe machines more efficiently by increasing conversionsamong a variety of job types

5 Conclusions

We apply 119876-learning to study lot-based chip attach schedul-ing in back-end semiconductor manufacturing To applyreinforcement learning to scheduling the critical issue beingconversion of scheduling problems into RL problems Weconvert chip attach scheduling problem into a particularSMDP problem by Markovian state representation Fiveheuristic algorithms WSPT MWSPT RA LFM-MWSPT

10 Journal of Industrial Engineering

and LFM-RA are selected as actions so as to utilize priordomain knowledge Reward function is directly related toscheduling objective function and we prove that maximizingthe accumulated reward is equivalent to minimizing theobjective function Gradient-descent linear function approx-imation is combined with 119876-learning algorithm

119876-learning exploits the insight structure of the schedul-ing problem by solving it repeatedly It learns a domain-specific policy from the experienced episodes through inter-action and then applies it to latter episodes We definetwo indices unsatisfied TPV index and unsatisfied job typeindex together with objective function value to measure theperformance of 119876-learning and the heuristics Experimentswith industrial datasets show that 119876-learning apparentlyoutperforms six heuristic algorithms WSPT MWSPT RALFM-MWSPT LFM-RA and LWF Compared with LWF119876-learning achieves reduction of the three performancemeasures respectively by an average level of 5220 3181and 8092 With 119876-learning chip attach scheduling isoptimized through increasing effective job type conversions

Disclosure

Given the sensitive and proprietary nature of the semicon-ductor manufacturing environment we use normalized datain this paper

Acknowledgments

This project is supported by the National Natural ScienceFoundation of China (Grant no 71201026) Science andTechnological Program for Dongguanrsquos Higher EducationScience and Research and Health Care Institutions (no2011108102017) andHumanities and Social Sciences Programof Ministry of Education of China (no 10YJC630405)

References

[1] M X Weng J Lu and H Ren ldquoUnrelated parallel machinescheduling with setup consideration and a total weightedcompletion time objectiverdquo International Journal of ProductionEconomics vol 70 no 3 pp 215ndash226 2001

[2] M Gairing B Monien and A Woclaw ldquoA faster combinato-rial approximation algorithm for scheduling unrelated parallelmachinesrdquo inAutomata Languages and Programming vol 3580of Lecture Notes in Computer Science pp 828ndash839 2005

[3] G Mosheiov ldquoParallel machine scheduling with a learningeffectrdquo Journal of the Operational Research Society vol 52 no10 pp 1ndash5 2001

[4] G Mosheiov and J B Sidney ldquoScheduling with general job-dependent learning curvesrdquo European Journal of OperationalResearch vol 147 no 3 pp 665ndash670 2003

[5] L Yu H M Shih M Pfund W M Carlyle and J W FowlerldquoScheduling of unrelated parallel machines an application toPWB manufacturingrdquo IIE Transactions vol 34 no 11 pp 921ndash931 2002

[6] K R Baker and J W M Bertrand ldquoA dynamic priorityrule for scheduling against due-datesrdquo Journal of OperationsManagement vol 3 no 1 pp 37ndash42 1982

[7] J J Kanet and X Li ldquoA weighted modified due date rulefor sequencing to minimize weighted tardinessrdquo Journal ofScheduling vol 7 no 4 pp 261ndash276 2004

[8] R V Rachamadugu and T E Morton ldquoMyopic heuristicsfor the single machine weighted tardiness problemrdquo WorkingPaper 28-81-82 Graduate School of Industrial AdministrationGarnegie-Mellon University 1981

[9] A Volgenant and E Teerhuis ldquoImproved heuristics for then-job single-machine weighted tardiness problemrdquo Computersand Operations Research vol 26 no 1 pp 35ndash44 1999

[10] D C Carroll Heuristic sequencing of jobs with single and mul-tiple components [PhD thesis] Sloan School of ManagementMIT 1965

[11] A P J Vepsalainen and T E Morton ldquoPriority rules for jobshops with weighted tardiness costsrdquoManagement Science vol33 no 8 pp 1035ndash1047 1987

[12] R S Russell E M Dar-El and B W Taylor ldquoA comparativeanalysis of the COVERT job sequencing rule using various shopperformance measuresrdquo International Journal of ProductionResearch vol 25 no 10 pp 1523ndash1540 1987

[13] J Bank and F Werner ldquoHeuristic algorithms for unrelated par-allelmachine schedulingwith a commondue date release datesand linear earliness and tardiness penaltiesrdquoMathematical andComputer Modelling vol 33 no 4-5 pp 363ndash383 2001

[14] C F Liaw Y K Lin C Y Cheng and M Chen ldquoSchedulingunrelated parallel machines to minimize total weighted tardi-nessrdquo Computers and Operations Research vol 30 no 12 pp1777ndash1789 2003

[15] DWKimDGNa andF F Chen ldquoUnrelated parallelmachinescheduling with setup times and a total weighted tardinessobjectiverdquo Robotics and Computer-Integrated Manufacturingvol 19 no 1-2 pp 173ndash181 2003

[16] R S Sutton and A G Barto Reinforcement Learning AnIntroduction MIT Press Cambridge Mass USA 1998

[17] C J CHWatkinsLearning fromdelayed rewards [PhD thesis]Cambridge University 1989

[18] C J C H Watkins and P Dayan ldquoQ-learningrdquo MachineLearning vol 8 no 3-4 pp 279ndash292 1992

[19] T JaakkolaM I Jordan and S P Singh ldquoOn the convergence ofstochastic iterative dynamic programming algorithmsrdquo NeuralComputation vol 6 pp 1185ndash1201 1994

[20] J N Tsitsiklis ldquoAsynchronous stochastic approximation and Q-learningrdquoMachine Learning vol 16 no 3 pp 185ndash202 1994

[21] D P Bertsekas and J N Tsitsiklis Neuro-Dynamic Program-ming Athena Scientific Belmont Mass USA 1996

[22] S Riedmiller and M Riedmiller ldquoA neural reinforcementlearning approach to learn local dispatching policies in produc-tion schedulingrdquo in Proceedings of the 16th International JointConference on Artificial Intelligence Stockholm Sweden 1999

[23] M E Aydin and E Oztemel ldquoDynamic job-shop schedulingusing reinforcement learning agentsrdquoRobotics and AutonomousSystems vol 33 no 2 pp 169ndash178 2000

[24] J Hong and V V Prabhu ldquoDistributed reinforcement learningcontrol for batch sequencing and sizing in just-in-time manu-facturing systemsrdquo Applied Intelligence vol 20 no 1 pp 71ndash872004

[25] Y C Wang and J M Usher ldquoApplication of reinforcementlearning for agent-based production schedulingrdquo EngineeringApplications of Artificial Intelligence vol 18 no 1 pp 73ndash822005

Journal of Industrial Engineering 11

[26] B C Csaji L Monostori and B Kadar ldquoReinforcement learn-ing in a distributed market-based production control systemrdquoAdvanced Engineering Informatics vol 20 no 3 pp 279ndash2882006

[27] S S Singh V B Tadic and A Doucet ldquoA policy gradientmethod for semi-Markov decision processes with applicationto call admission controlrdquo European Journal of OperationalResearch vol 178 no 3 pp 808ndash818 2007

[28] M Kaya and R Alhajj ldquoA novel approach to multiagentreinforcement learning utilizing OLAP mining in the learningprocessrdquo IEEE Transactions on Systems Man and CyberneticsPart C vol 35 no 4 pp 582ndash590 2005

[29] C D Paternina-Arboleda and T K Das ldquoA multi-agent rein-forcement learning approach to obtaining dynamic controlpolicies for stochastic lot scheduling problemrdquo SimulationModelling Practice andTheory vol 13 no 5 pp 389ndash406 2005

[30] C E Mariano-Romero V H Alcocer-Yamanaka and E FMorales ldquoMulti-objective optimization of water-using systemsrdquoEuropean Journal of Operational Research vol 181 no 3 pp1691ndash1707 2007

[31] D Vengerov ldquoA reinforcement learning framework for utility-based scheduling in resource-constrained systemsrdquo Future Gen-eration Computer Systems vol 25 no 7 pp 728ndash736 2009

[32] K Iwamura N Mayumi Y Tanimizu and N SugimuraldquoA study on real-time scheduling for holonic manufacturingsystemsmdashdetermination of utility values based on multi-agentreinforcement learningrdquo in Proceedings of the 4th InternationalConference on Industrial Applications of Holonic and Multi-Agent Systems pp 135ndash144 Linz Austria 2009

[33] M Pinedo Scheduling Theory Algorithms and Systems Pren-tice Hall Englewoods Cliffs NJ USA 2nd edition 2002

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 5: Research Article Chip Attach Scheduling in Semiconductor ...downloads.hindawi.com/journals/jie/2013/295604.pdf · Chip Attach Scheduling in Semiconductor Assembly ... and PEVI. IS

Journal of Industrial Engineering 5

23 Action Prior domain knowledge can be utilized to fullyexploit the agentrsquos learning ability Apparently an optimalschedule must be nonidle (ie any machine has no idle timeduring the whole schedule) It may happen that more thanone machine are free at the same decision-making epochAn action determines which lot to be processed on whichmachine In the following we define seven actions usingheuristic algorithms

Action 1 Select jobs by WSPT heuristics as follows

Algorithm 1 WSPT heuristics

Step 1 Let SM denote the set of free machines at a decision-making epoch

Step 2 Choose machine 119896 to process job type 119902 with (119896 119902) =argmin

(119894119895)(1199041198790

119894119895+ 119901119894119895)119908119895| 1 le 119895 le 119899 119894 isin 119872

119895and 119894 isin SM

Step 3 Remove 119896 fromSM If SM = Φ go to Step 2 otherwisethe algorithm halts

Action 2 Select jobs byMWSPT (modifiedWSPT) heuristicsas follows

Algorithm 2 MWSPT heuristics

Step 1 Define SM as Step 1 in Algorithm 1 and let SJ denotethe set of job types whose TPVs have not been satisfied at adecision-making epoch that is SJ = 119895 | 119884

119895lt 119863119895 1 le 119895 le 119899

If SJ = Φ go to Step 4

Step 2 Choose job type 119902 to process on machine 119896 with(119896 119902) = argmin

(119894119895)(1199041198790

119894119895+ 119901119894119895)119908119895| 119895 isin SJ 119894 isin 119872

119895and

119894 isin SM

Step 3 Remove 119896 from SM Set 119884119902= 119884119902+ 119871 and update SJ If

SJ = Φ and SM = Φ go to Step 2 if SJ = Φ and SM = Φ go toStep 4 otherwise the algorithm halts

Step 4 Choose machine 119896 to process job type 119902 with (119896 119902) =argmin

(119894119895)(1199041198790

119894119895+ 119901119894119895)119908119895| 1 le 119895 le 119899 119894 isin 119872

119895and 119894 isin SM

Step 5 Remove 119896 fromSM If SM = Φ go to Step 4 otherwisethe algorithm halts

Action 3 Select jobs by Ranking Algorithm (RA) as follows

Algorithm 3 Ranking Algorithm

Step 1 Define SM and SJ as Step 1 in Algorithm 2 If SJ = Φgo to Step 5

Step 2 For each job type 119895 (119895 isin SJ) sort the machines inincreasing order of (119904

119881119894119895+119901119894119895) (1 le 119894 le 119898) where119881

119894is defined

as follows

119881119894=

119879119894 if machine 119894 is busy

1198790

119894 if machine 119894 is free

(1 le 119894 le 119898) (20)

Let 119892119894119895(1 le 119892

119894119895le 119898) denote the order of machine 119894 (1 le 119894 le

119898) for job type 119895 (1 le 119895 le 119899)

Step 3 Choose job 119902 to process on machine 119896 with (119896 119902) =argmin

(119894119895)119892119894119895| 119895 isin SJ 119894 isin 119872

119895and 119894 isin SM If there

exist two or more machine-job combinations (say machine-job combination (119894

1 1198951) (1198942 1198952) (119894

ℎ 119895ℎ)) with the same

minimal order that is (119894119890 119895119890) = argmin

(119894119895)119892119894119895| 119895 isin SJ 119894 isin

119872119895and 119894 isin SM holds for 119890 (1 le 119890 le ℎ) then choose job type

119895119890to process onmachine 119894

119890 with (119894

119890 119895119890) = argmin

(119894119895)(119904119881119894119890119895119890

+

119901119894119890119895119890

)119908119895119890

| 1 le 119890 le ℎ

Step 4 Remove 119896 or 119894119890fromSM Set119884

119902= 119884119902+119871 or119884

119895119890

= 119884119895119890

+119871

and update SJ If SJ = Φ and SM = Φ go to Step 3 if SJ = Φ andSM = Φ go to Step 5 otherwise the algorithm halts

Step 5 Choose job 119902 to process on machine 119896 with (119896 119902) =argmin

(119894119895)119892119894119895| 1 le 119895 le 119899 119894 isin 119872

119895and 119894 isin SM If there

exist two or more machine-job combinations (say machine-job combinations (119894

1 1198951) (1198942 1198952) (119894

ℎ 119895ℎ)) with the same

minimal order choose job type 119895119890to process on machine 119894

119890

with (119894119890 119895119890) = argmin

(119894119895)(119904119881119894119890119895119890

+ 119901119894119890119895119890

)119908119895119890

| 1 le 119890 le ℎ

Step 6 Remove 119896 or 119894119890from SM If SM = Φ go to Step 5

otherwise the algorithm halts

Action 4 Select jobs by LFM-MWSPT heuristics as follows

Algorithm 4 LFM-MWSPT heuristics

Step 1 Define SM and SJ as Step 1 in Algorithm 2

Step 2 Select a free machine (say 119896) from SM by LFM (LeastFlexible Machine see [33]) rule and choose a job type toprocess on machine 119896 following MWSPT heuristics

Step 3 Remove 119896 fromSM If SM = Φ go to Step 2 otherwisethe algorithm halts

Action 5 Select jobs by LFM-RA heuristics as follows

Algorithm 5 LFM-RA heuristics

Step 1 Define SM and SJ as Step 1 in Algorithm 2

Step 2 Select a free machine (say 119896) from SM by LFM ruleand choose a job type to process on machine 119896 followingRanking Algorithm

Step 3 Remove 119896 fromSM If SM = Φ go to Step 2 otherwisethe algorithm halts

Action 6 Each free machine selects the same job type as thelatest one it processed

Action 7 Select no job

At the start of a schedule horizon the system is atinitial state 119904

0 If there are free machines they select jobs to

process by taking one of Actions 1ndash6 otherwise Action 7 is

6 Journal of Industrial Engineering

chosen Afterwards when anymachine completes processinga lot or any machinersquos normal production time is completelyscheduled the system transfers into a new state 119904

119906 The

agent selects an action at this decision-making epoch and thesystem state transfers into an interim state 119904 When againany machine completes processing a lot or any machinersquosnormal production time used is up the system transfers intothe next decision-making state 119904

119906+1and the agent receive

reward 119903119906+1

which is computed due to 119904119906and the sojourn

time between the two transitions into 119904119906and 119904119906+1

(as shownin Section 24) The previous procedure is repeated until aterminal state is attained An episode is a trajectory from theinitial state to a terminal state of a schedule horizon Action7 is available only at the decision-making states when allmachines are busy

24 Reward Function A reward function follows severaldisciplines It indicates the instant impact of an action on theschedule that is to link the action with immediate rewardMoreover the accumulated reward indicates the objectivefunction value that is the agent receives large total rewardfor small objective function value

Definition 6 (reward function) Let 119870 denote the number ofdecision-making epoch during an episode 119905

119906(0 le 119906 lt 119870)

the time at the 119906th decision-making epoch 119879119894119906

(1 le 119894 le 1198981 le 119906 le 119870) the job type of the lot which machine 119894 processesduring time interval (119905

119906minus1 119905119906]1198790119894119906the job type of the lotwhich

precedes the lot machine 119894 processes during time interval(119905119906minus1 119905119906] and 119884

119895(119905119906) the processed volume of job type 119895 by

time 119905119906 It follows that

119884119895(119905119906) minus 119884119895(119905119906minus1) =

119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

(21)

where 120575(119894 119895) is an indicator function defined as

120575 (119894 119895) = 1 119879

119894119906= 119895

0 119879119894119906

= 119895(22)

Let 119903119906(119906 = 1 2 119870) denote the reward function at the

119906th decision-making epoch 119903119906is defined as

119903119906=

119899

sum

119895=1

min119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

[119863119895minus 119884119895(119905119906minus1)]+

119908119895

+max119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

minus [119863119895minus 119884119895(119905119906minus1)]+

0

times119908119895

119872

(23)

The reward function has the following property

Theorem 7 Maximization of the total reward 119877 in an episodeis equivalent to minimization of objective function (5)

Proof The total reward in an episode is

119877 =

119870

sum119906=1

119903119906

=

119870

sum119906=1

119899

sum

119895=1

min119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

[119863119895minus 119884119895(119905119906minus1)]+

119908119895

+max119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

minus [119863119895minus 119884119895(119905119906minus1)]+

0

times119908119895

119872

=

119899

sum

119895=1

119870

sum119906=1

min119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

[119863119895minus 119884119895(119905119906minus1)]+

times 119908119895

+max119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

minus [119863119895minus 119884119895(119905119906minus1)]+

0

times119908119895

119872

(24)

It is easy to show that

119884119895=

119870

sum119906=1

119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

(25)

It follows that

119877 =

119899

sum

119895=1

[119908119895min 119863

119895 119884119895 +

119908119895

119872max 0 119884

119895minus 119863119895]

= sum

119895isinΩ1

[119908119895119863119895+119908119895

119872(119884119895minus 119863119895)] + sum

119895isinΩ2

119908119895119884119895

=

119899

sum

119895=1

119908119895119863119895minus

sum

119895isinΩ1

[minus119908119895

119872(119884119895minus 119863119895)]

+ sum

119895isinΩ2

119908119895(119863119895minus 119884119895)

=

119899

sum

119895=1

119908119895119863119895minus

119899

sum

119895=1

[119908119895(119863119895minus 119884119895)+

minus119908119895

119872(119884119895minus 119863119895)+

]

(26)

where Ω1= 119895 | 119884

119895gt 119863119895 and Ω

2= 119895 | 119884

119895le 119863119895 Since

sum119899

119895=1119908119895119863119895is a constant it follows that

max119877 lArrrArr min119899

sum

119895=1

[119908119895(119863119895minus 119884119895)+

minus119908119895

119872(119884119895minus 119863119895)+

]

(27)

Journal of Industrial Engineering 7

3 The Reinforcement Learning Algorithm

The chip attach scheduling problem is converted into anRL problem with terminal state in Section 2 To apply 119876-learning to solve this RL problem another issue arises that ishow to tailor119876-learning algorithm in this particular contextSince some state variables are continuous the state spaceis infinite This RL system is not in tabular form and it isimpossible to maintain 119876-values for all state-action pairsThus we use linear functionwith gradient-descentmethod toapproximate the 119876-value function 119876-values are representedas linear combination of a set of basis functions Φ

119896(119904) (1 le

119896 le 4119898 + 119899) as shown in the next formula

119876 (119904 119886) =

4119898+119899

sum

119896=1

119888119886

119896Φ119896(119904) (28)

where 119888119886119896(1 le 119886 le 6 1 le 119896 le 4119898 + 119899) are the weights of basis

functions Each state variable corresponds to a basis functionThe following basis functions are defined to normalize thestate variables

Φ119896(119904)

=

1198790

119896

119899(1 le 119896 le 119898)

119879119896minus119898

119899(119898 + 1 le 119896 le 2119898)

119905119896minus2119898

max 11990411989511198952

+ 1199011198952| 1 le 1198951 le 119899 1 le 1198952 le 119899

(2119898 + 1 le 119896 le 3119898)

119889119896minus3119898

119863119896minus3119898

(3119898 + 1 le 119896 le 3119898 + 119899)

119890119896minus3119898minus119899

TH(3119898 + 119899 + 1 le 119896 le 4119898 + 119899)

(29)

Let 119862119886 denote the vector of weights of basis functions asfollows

119862119886

= (119888119886

1 119888119886

2 119888

119886

4119898+119899)119879

(30)

The RL algorithm is presented as Algorithm 8 where 120572is learning rate 120574 is a discount factor 119864(119886) is the vector ofeligibility traces for action 119886 120575(119886) is an error variable foraction 119886 and 120582 is a factor for updating eligibility traces

Algorithm 8 119876-learning with linear gradient-descent func-tion approximation for chip attach scheduling

Initialize 119862119886 and 119864(119886) randomly Set parameters 120572 120574 and120582

Let num episode denote the number of episodes havingbeen run Set num episode = 0

While num episode ltMAX EPISODE doSet the current decision-making state 119904 larr 119904

0

While at least one of state variables 119890119894(1 le 119894 le 119898) is larger

than zero do

Select action 119886 for state 119904 by 120576-greedy policyImplement action a Determine the next event fortriggering state transition and the sojourn timeOnce any machine completes processing a lot orany machinersquos normal production time is completelyscheduled the system transfers into a new decision-making state 1199041015840(1198901015840

119894(1 le 119894 le 119898) is a component of 1199041015840)

Compute reward 1199031198861199041199041015840

Update the vector of weights in the approximate 119876-value function of action a

120575 (119886) larr997888 119903119886

1199041199041015840 + 120574max1198861015840

119876(1199041015840

1198861015840

) minus 119876 (119904 119886)

119864 (119886) larr997888 120582119864 (119886) + nabla119862119886119876 (119904 119886)

119862119886

larr997888 119862119886

+ 120572120575 (119886) 119864 (119886)

(31)

Set 119904 larr 1199041015840

If 119890119894= 0 holds for all 119894 (1 le 119894 le 119898) set num episode =

num episode + 1

4 Experiment Results

In the past the company used a manual process to conductchip attach scheduling A heuristic algorithm called LargestWeight First (LWF) was used as follows

Algorithm 9 (Largest Weight First (LWF) heuristics) Initial-ize SMwith the set of all machines (ie SM = 119894 | 1 le 119894 le 119898)and define SJ as Step 1 inAlgorithm 2 Initialize 119890

119894(1 le 119894 le 119898)

with each machinersquos normal production time Set 119884119895= 119868119895

where 119868119895is the initial production volume of job type 119895

Step 1 Schedule the job types in decreasing order of weightsin order to meet their TPVs

While SJ = Φ and SM = Φ doChoose job 119902 with 119902 = argmax119908

119895| 119895 isin SJ

While SM cap119872119902= Φ and 119884

119902lt 119863119902do

Choose machine 119894 to process job 119902 with 119894 =argmin119901

119896119902119908119902| 119896 isin SM cap119872

119902

If 119890119894minus 1199041198790

119894119902lt (119863119902minus 119884119902)119901119894119902 then

set 119884119902larr 119884119902+ 119871(119890119894minus 1199041198790

119894119902)119901119894119902 119890119894= 0 and

remove 119894 from SMelse set 119890

119894larr 119890119894minus 1199041198790

119894119902minus (119863119902minus 119884119902)119901119894119902119871 and

119884119902= 119863119902

1198790

119894= 119902

Step 2 Allocate the excess production capacity

If SM = Φ thenFor each machine 119894 (119894 isin SM)Choose job 119895with 119895 = argmax(119890

119894minus1199041198790

119894119902)119908119902119901119894119902|

1 le 119902 le 119899 set 119884119895larr 119884119895+ 119871(119890119894minus 1199041198790

119894119895)119901119894119895 119890119894= 0

8 Journal of Industrial Engineering

Table 1 Comparison of objective function values using heuristics and 119876-learning

Dataset no WSPT MWSPT RA LFM-MWSPT LFM-RA LWF Q-Learning1 88867 78116 59758 87689 57582 42253 minus386132 13844 13586 110747 12669 10907 95926 766573 11901 10875 12409 10425 12190 83332 237754 83681 60797 39073 69405 45575 33920 minus442755 12938 12847 96960 10917 99827 89863 214146 70840 55692 51108 66213 51041 16930 minus544677 12090 10060 95399 10933 90754 76422 273748 10242 10780 11656 10333 10762 93663 118409 94606 87914 81763 88812 80331 60164 3303610 90803 88164 90773 87926 8856293 56307 2279811 11113 88287 82916 97882 85605 60160 1649312 10029 89005 86692 95836 78342 60744 minus38617Average 10419 94123 86321 95547 84685 64147 12233

Table 2 Comparison of unsatisfied TPV index using heuristics and 119876-learning

Dataset no WSPT MWSPT RA LFM-MWSPT LFM-RA LWF Q-learning1 01179 01025 00789 01170 00754 00554 000812 01651 01611 01497 01499 01482 01421 001373 01421 01289 01475 01227 01455 00987 006914 01540 01104 00716 01258 00854 00614 000885 01588 01564 01186 01303 01215 01094 005716 01053 00819 00757 01006 00762 00248 001377 01462 01209 01150 01292 01082 00917 005828 01266 01324 01437 01272 01309 01150 003819 01315 01211 01133 01249 01127 00828 0081510 01154 01112 01151 01105 01110 00709 0053611 01544 01215 01146 01387 01204 00827 0069012 01262 01112 01088 01194 01002 00758 00118Average 01370 01216 01112 01247 01113 00842 00402

The chip attach station consists of 10 machines andnormally processes more than ten job types We selected12 sets of industrial data for experiments comparingthe119876-learning algorithm (Algorithm 8) and the six heuristics(Algorithms 1ndash5 9) WSPT MWSPT RA LFM-MWSPTLFM-RA and LWF For each dataset 119876-learning repeatedlysolves the scheduling problem 1000 times and selects theoptimal schedule of the 1000 solutions Table 1 shows theobjective function values of all datasets using the sevenalgorithms Individually any of WSPT MWSPT RA LFM-MWSPT and LFM-RA obtains larger objective functionvalues than LWF for every dataset Nevertheless takingWSPTMWSPT RA LFM-MWSPT and LFM-RA as actions119876-learning algorithm achieves an objective function valuemuch smaller than LWF for each dataset In Tables 1ndash4 thebottom row presents the average value over all datasets Asshown in Table 1 the average objective function value of 119876-learning is only 12233 less than that of LWF 66147 by a largeamount of 8092

Besides objective function value we propose two indicesunsatisfied TPV index and unsatisfied job type index tomea-sure the performance of the seven algorithms Unsatisfied

TPV index (UPI) is defined as formula (32) and indicatesthe weighted proportion of unfinished Target ProductionVolume Table 2 compares UPIs of all datasets using sevenalgorithms Also any of WSPT MWSPT RA LFM-MWSPTand LFM-RA individually obtains larger UPI than LWFfor each dataset However 119876-learning algorithm achievessmaller UPI than LWF does for each datasetThe average UPIof 119876-learning is only 00402 less than that of LWF 00842by a large amount of 5220 Let 119869 denote the set 119895 | 1 le119895 le 119899 119884

119895lt 119863119895 Unsatisfied job type index (UJTI) is defined

as formula (33) and indicates the weighted proportion of thejob types whose TPVs are not completely satisfied Table 3compares UJTIs of all datasets using seven algorithms Withmost datasets 119876-learning algorithm achieves smaller UJTIsthan LWFThe average UJTI of119876-learning is 00802 which isremarkably less than that of LWF 01176 by 3181 Consider

UPI =sum119899

119895=1119908119895(119863119895minus 119884119895)+

sum119899

119895=1119908119895119863119895

(32)

UJTI =sum119895=119869119908119895

sum119899

119895=1119908119895

(33)

Journal of Industrial Engineering 9

Table 3 Comparison of unsatisfied job type index using heuristics and 119876-learning

Dataset no WSPT MWSPT RA LFM-MWSPT LFM-RA LWF Q-learning1 01290 02615 01667 01650 01793 00921 006782 01924 02953 02302 02650 02097 01320 002783 02250 03287 02126 02564 02278 00921 009214 00781 02987 00278 01290 00828 00278 002785 02924 03169 03002 02224 02632 02055 005716 02290 03062 01817 02290 01632 00571 002787 02650 02987 02160 02529 02075 01320 016478 01924 03225 02067 01813 01696 01320 013209 02221 03250 01403 01892 02073 01320 0074910 02621 03304 02667 02859 02708 01781 0067811 02029 02896 02578 02194 02220 01381 0154212 01924 03271 02302 01838 02182 00921 00678Average 02069 03084 02031 02149 02018 01176 00802

Table 4 Comparison of the total setup time using heuristics and 119876-learning

Dataset no WSPT MWSPT RA LFM-MWSPT LFM-RA LWF Q-learning1 08133 08497 03283 07231 03884 03895 100002 08333 13000 04358 07564 05263 04094 100003 08712 11633 04207 06361 04937 04298 100004 11280 07123 04629 08516 05139 04318 100005 09629 13121 04179 08597 05115 03873 100006 07489 10393 04104 07489 04542 04074 100007 17868 22182 08223 14069 10125 04174 100008 06456 08508 04055 06694 05053 03795 100009 09245 09946 05013 07821 06694 04163 1000010 11025 17875 06703 10371 09079 04894 1000011 09973 13655 03994 09686 05129 04066 1000012 07904 11111 04419 06195 05081 04258 10000Average 09671 12254 04764 08383 05837 04158 10000

Table 4 shows the total setup time of all datasets usingseven algorithms For the reason of commercial confiden-tiality we used the normalized data with the setup time of adataset divided by the result of this dataset using 119876-learningThus the total setup times of all datasets by 119876-learning areconverted into one and the data of the six heuristics areadjusted accordingly 119876-learning algorithm requires morethan twice of setup time than LWF does for each dataset Theaverage accumulated setup time of LWF is only 4158 percentsof that of 119876-learning

The previous experimental results reveal that for thewhole scheduling tasks any individual one of the five actionheuristics (WSPT MWSPT RA LFM-MWSPT and LFM-RA) for 119876-learning performs worse than LWF heuristicsHowever 119876-learning greatly outperforms LWF in termsof the three performance measures the objective functionvalue UPI and UJTI This demonstrates that some actionheuristics provide better actions than LWF heuristics at somestates During repeatedly solving the scheduling problem119876-learning system perceives the insights of the schedulingproblem automatically and adjusts its actions towards the

optimal ones facing different system states The actions atall states form a new optimized policy which is differentfrom any policies following any individual action heuristicsor LWF heuristics That is119876-learning incorporates the meritof five alternative heuristics uses them to schedule jobsflexibly and obtains results much better than any individualaction heuristics and LWF heuristics In the experiments119876-learning achieves high-quality schedules at the cost ofinducingmore setup time In other words119876-learning utilizesthe machines more efficiently by increasing conversionsamong a variety of job types

5 Conclusions

We apply 119876-learning to study lot-based chip attach schedul-ing in back-end semiconductor manufacturing To applyreinforcement learning to scheduling the critical issue beingconversion of scheduling problems into RL problems Weconvert chip attach scheduling problem into a particularSMDP problem by Markovian state representation Fiveheuristic algorithms WSPT MWSPT RA LFM-MWSPT

10 Journal of Industrial Engineering

and LFM-RA are selected as actions so as to utilize priordomain knowledge Reward function is directly related toscheduling objective function and we prove that maximizingthe accumulated reward is equivalent to minimizing theobjective function Gradient-descent linear function approx-imation is combined with 119876-learning algorithm

119876-learning exploits the insight structure of the schedul-ing problem by solving it repeatedly It learns a domain-specific policy from the experienced episodes through inter-action and then applies it to latter episodes We definetwo indices unsatisfied TPV index and unsatisfied job typeindex together with objective function value to measure theperformance of 119876-learning and the heuristics Experimentswith industrial datasets show that 119876-learning apparentlyoutperforms six heuristic algorithms WSPT MWSPT RALFM-MWSPT LFM-RA and LWF Compared with LWF119876-learning achieves reduction of the three performancemeasures respectively by an average level of 5220 3181and 8092 With 119876-learning chip attach scheduling isoptimized through increasing effective job type conversions

Disclosure

Given the sensitive and proprietary nature of the semicon-ductor manufacturing environment we use normalized datain this paper

Acknowledgments

This project is supported by the National Natural ScienceFoundation of China (Grant no 71201026) Science andTechnological Program for Dongguanrsquos Higher EducationScience and Research and Health Care Institutions (no2011108102017) andHumanities and Social Sciences Programof Ministry of Education of China (no 10YJC630405)

References

[1] M X Weng J Lu and H Ren ldquoUnrelated parallel machinescheduling with setup consideration and a total weightedcompletion time objectiverdquo International Journal of ProductionEconomics vol 70 no 3 pp 215ndash226 2001

[2] M Gairing B Monien and A Woclaw ldquoA faster combinato-rial approximation algorithm for scheduling unrelated parallelmachinesrdquo inAutomata Languages and Programming vol 3580of Lecture Notes in Computer Science pp 828ndash839 2005

[3] G Mosheiov ldquoParallel machine scheduling with a learningeffectrdquo Journal of the Operational Research Society vol 52 no10 pp 1ndash5 2001

[4] G Mosheiov and J B Sidney ldquoScheduling with general job-dependent learning curvesrdquo European Journal of OperationalResearch vol 147 no 3 pp 665ndash670 2003

[5] L Yu H M Shih M Pfund W M Carlyle and J W FowlerldquoScheduling of unrelated parallel machines an application toPWB manufacturingrdquo IIE Transactions vol 34 no 11 pp 921ndash931 2002

[6] K R Baker and J W M Bertrand ldquoA dynamic priorityrule for scheduling against due-datesrdquo Journal of OperationsManagement vol 3 no 1 pp 37ndash42 1982

[7] J J Kanet and X Li ldquoA weighted modified due date rulefor sequencing to minimize weighted tardinessrdquo Journal ofScheduling vol 7 no 4 pp 261ndash276 2004

[8] R V Rachamadugu and T E Morton ldquoMyopic heuristicsfor the single machine weighted tardiness problemrdquo WorkingPaper 28-81-82 Graduate School of Industrial AdministrationGarnegie-Mellon University 1981

[9] A Volgenant and E Teerhuis ldquoImproved heuristics for then-job single-machine weighted tardiness problemrdquo Computersand Operations Research vol 26 no 1 pp 35ndash44 1999

[10] D C Carroll Heuristic sequencing of jobs with single and mul-tiple components [PhD thesis] Sloan School of ManagementMIT 1965

[11] A P J Vepsalainen and T E Morton ldquoPriority rules for jobshops with weighted tardiness costsrdquoManagement Science vol33 no 8 pp 1035ndash1047 1987

[12] R S Russell E M Dar-El and B W Taylor ldquoA comparativeanalysis of the COVERT job sequencing rule using various shopperformance measuresrdquo International Journal of ProductionResearch vol 25 no 10 pp 1523ndash1540 1987

[13] J Bank and F Werner ldquoHeuristic algorithms for unrelated par-allelmachine schedulingwith a commondue date release datesand linear earliness and tardiness penaltiesrdquoMathematical andComputer Modelling vol 33 no 4-5 pp 363ndash383 2001

[14] C F Liaw Y K Lin C Y Cheng and M Chen ldquoSchedulingunrelated parallel machines to minimize total weighted tardi-nessrdquo Computers and Operations Research vol 30 no 12 pp1777ndash1789 2003

[15] DWKimDGNa andF F Chen ldquoUnrelated parallelmachinescheduling with setup times and a total weighted tardinessobjectiverdquo Robotics and Computer-Integrated Manufacturingvol 19 no 1-2 pp 173ndash181 2003

[16] R S Sutton and A G Barto Reinforcement Learning AnIntroduction MIT Press Cambridge Mass USA 1998

[17] C J CHWatkinsLearning fromdelayed rewards [PhD thesis]Cambridge University 1989

[18] C J C H Watkins and P Dayan ldquoQ-learningrdquo MachineLearning vol 8 no 3-4 pp 279ndash292 1992

[19] T JaakkolaM I Jordan and S P Singh ldquoOn the convergence ofstochastic iterative dynamic programming algorithmsrdquo NeuralComputation vol 6 pp 1185ndash1201 1994

[20] J N Tsitsiklis ldquoAsynchronous stochastic approximation and Q-learningrdquoMachine Learning vol 16 no 3 pp 185ndash202 1994

[21] D P Bertsekas and J N Tsitsiklis Neuro-Dynamic Program-ming Athena Scientific Belmont Mass USA 1996

[22] S Riedmiller and M Riedmiller ldquoA neural reinforcementlearning approach to learn local dispatching policies in produc-tion schedulingrdquo in Proceedings of the 16th International JointConference on Artificial Intelligence Stockholm Sweden 1999

[23] M E Aydin and E Oztemel ldquoDynamic job-shop schedulingusing reinforcement learning agentsrdquoRobotics and AutonomousSystems vol 33 no 2 pp 169ndash178 2000

[24] J Hong and V V Prabhu ldquoDistributed reinforcement learningcontrol for batch sequencing and sizing in just-in-time manu-facturing systemsrdquo Applied Intelligence vol 20 no 1 pp 71ndash872004

[25] Y C Wang and J M Usher ldquoApplication of reinforcementlearning for agent-based production schedulingrdquo EngineeringApplications of Artificial Intelligence vol 18 no 1 pp 73ndash822005

Journal of Industrial Engineering 11

[26] B C Csaji L Monostori and B Kadar ldquoReinforcement learn-ing in a distributed market-based production control systemrdquoAdvanced Engineering Informatics vol 20 no 3 pp 279ndash2882006

[27] S S Singh V B Tadic and A Doucet ldquoA policy gradientmethod for semi-Markov decision processes with applicationto call admission controlrdquo European Journal of OperationalResearch vol 178 no 3 pp 808ndash818 2007

[28] M Kaya and R Alhajj ldquoA novel approach to multiagentreinforcement learning utilizing OLAP mining in the learningprocessrdquo IEEE Transactions on Systems Man and CyberneticsPart C vol 35 no 4 pp 582ndash590 2005

[29] C D Paternina-Arboleda and T K Das ldquoA multi-agent rein-forcement learning approach to obtaining dynamic controlpolicies for stochastic lot scheduling problemrdquo SimulationModelling Practice andTheory vol 13 no 5 pp 389ndash406 2005

[30] C E Mariano-Romero V H Alcocer-Yamanaka and E FMorales ldquoMulti-objective optimization of water-using systemsrdquoEuropean Journal of Operational Research vol 181 no 3 pp1691ndash1707 2007

[31] D Vengerov ldquoA reinforcement learning framework for utility-based scheduling in resource-constrained systemsrdquo Future Gen-eration Computer Systems vol 25 no 7 pp 728ndash736 2009

[32] K Iwamura N Mayumi Y Tanimizu and N SugimuraldquoA study on real-time scheduling for holonic manufacturingsystemsmdashdetermination of utility values based on multi-agentreinforcement learningrdquo in Proceedings of the 4th InternationalConference on Industrial Applications of Holonic and Multi-Agent Systems pp 135ndash144 Linz Austria 2009

[33] M Pinedo Scheduling Theory Algorithms and Systems Pren-tice Hall Englewoods Cliffs NJ USA 2nd edition 2002

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 6: Research Article Chip Attach Scheduling in Semiconductor ...downloads.hindawi.com/journals/jie/2013/295604.pdf · Chip Attach Scheduling in Semiconductor Assembly ... and PEVI. IS

6 Journal of Industrial Engineering

chosen Afterwards when anymachine completes processinga lot or any machinersquos normal production time is completelyscheduled the system transfers into a new state 119904

119906 The

agent selects an action at this decision-making epoch and thesystem state transfers into an interim state 119904 When againany machine completes processing a lot or any machinersquosnormal production time used is up the system transfers intothe next decision-making state 119904

119906+1and the agent receive

reward 119903119906+1

which is computed due to 119904119906and the sojourn

time between the two transitions into 119904119906and 119904119906+1

(as shownin Section 24) The previous procedure is repeated until aterminal state is attained An episode is a trajectory from theinitial state to a terminal state of a schedule horizon Action7 is available only at the decision-making states when allmachines are busy

24 Reward Function A reward function follows severaldisciplines It indicates the instant impact of an action on theschedule that is to link the action with immediate rewardMoreover the accumulated reward indicates the objectivefunction value that is the agent receives large total rewardfor small objective function value

Definition 6 (reward function) Let 119870 denote the number ofdecision-making epoch during an episode 119905

119906(0 le 119906 lt 119870)

the time at the 119906th decision-making epoch 119879119894119906

(1 le 119894 le 1198981 le 119906 le 119870) the job type of the lot which machine 119894 processesduring time interval (119905

119906minus1 119905119906]1198790119894119906the job type of the lotwhich

precedes the lot machine 119894 processes during time interval(119905119906minus1 119905119906] and 119884

119895(119905119906) the processed volume of job type 119895 by

time 119905119906 It follows that

119884119895(119905119906) minus 119884119895(119905119906minus1) =

119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

(21)

where 120575(119894 119895) is an indicator function defined as

120575 (119894 119895) = 1 119879

119894119906= 119895

0 119879119894119906

= 119895(22)

Let 119903119906(119906 = 1 2 119870) denote the reward function at the

119906th decision-making epoch 119903119906is defined as

119903119906=

119899

sum

119895=1

min119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

[119863119895minus 119884119895(119905119906minus1)]+

119908119895

+max119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

minus [119863119895minus 119884119895(119905119906minus1)]+

0

times119908119895

119872

(23)

The reward function has the following property

Theorem 7 Maximization of the total reward 119877 in an episodeis equivalent to minimization of objective function (5)

Proof The total reward in an episode is

119877 =

119870

sum119906=1

119903119906

=

119870

sum119906=1

119899

sum

119895=1

min119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

[119863119895minus 119884119895(119905119906minus1)]+

119908119895

+max119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

minus [119863119895minus 119884119895(119905119906minus1)]+

0

times119908119895

119872

=

119899

sum

119895=1

119870

sum119906=1

min119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

[119863119895minus 119884119895(119905119906minus1)]+

times 119908119895

+max119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

minus [119863119895minus 119884119895(119905119906minus1)]+

0

times119908119895

119872

(24)

It is easy to show that

119884119895=

119870

sum119906=1

119898

sum

119894=1

(119905119906minus 119905119906minus1) 120575 (119894 119895) 119871

1199041198790

119894119906119879119894119906

+ 119901119894119879119894119906

(25)

It follows that

119877 =

119899

sum

119895=1

[119908119895min 119863

119895 119884119895 +

119908119895

119872max 0 119884

119895minus 119863119895]

= sum

119895isinΩ1

[119908119895119863119895+119908119895

119872(119884119895minus 119863119895)] + sum

119895isinΩ2

119908119895119884119895

=

119899

sum

119895=1

119908119895119863119895minus

sum

119895isinΩ1

[minus119908119895

119872(119884119895minus 119863119895)]

+ sum

119895isinΩ2

119908119895(119863119895minus 119884119895)

=

119899

sum

119895=1

119908119895119863119895minus

119899

sum

119895=1

[119908119895(119863119895minus 119884119895)+

minus119908119895

119872(119884119895minus 119863119895)+

]

(26)

where Ω1= 119895 | 119884

119895gt 119863119895 and Ω

2= 119895 | 119884

119895le 119863119895 Since

sum119899

119895=1119908119895119863119895is a constant it follows that

max119877 lArrrArr min119899

sum

119895=1

[119908119895(119863119895minus 119884119895)+

minus119908119895

119872(119884119895minus 119863119895)+

]

(27)

Journal of Industrial Engineering 7

3 The Reinforcement Learning Algorithm

The chip attach scheduling problem is converted into anRL problem with terminal state in Section 2 To apply 119876-learning to solve this RL problem another issue arises that ishow to tailor119876-learning algorithm in this particular contextSince some state variables are continuous the state spaceis infinite This RL system is not in tabular form and it isimpossible to maintain 119876-values for all state-action pairsThus we use linear functionwith gradient-descentmethod toapproximate the 119876-value function 119876-values are representedas linear combination of a set of basis functions Φ

119896(119904) (1 le

119896 le 4119898 + 119899) as shown in the next formula

119876 (119904 119886) =

4119898+119899

sum

119896=1

119888119886

119896Φ119896(119904) (28)

where 119888119886119896(1 le 119886 le 6 1 le 119896 le 4119898 + 119899) are the weights of basis

functions Each state variable corresponds to a basis functionThe following basis functions are defined to normalize thestate variables

Φ119896(119904)

=

1198790

119896

119899(1 le 119896 le 119898)

119879119896minus119898

119899(119898 + 1 le 119896 le 2119898)

119905119896minus2119898

max 11990411989511198952

+ 1199011198952| 1 le 1198951 le 119899 1 le 1198952 le 119899

(2119898 + 1 le 119896 le 3119898)

119889119896minus3119898

119863119896minus3119898

(3119898 + 1 le 119896 le 3119898 + 119899)

119890119896minus3119898minus119899

TH(3119898 + 119899 + 1 le 119896 le 4119898 + 119899)

(29)

Let 119862119886 denote the vector of weights of basis functions asfollows

119862119886

= (119888119886

1 119888119886

2 119888

119886

4119898+119899)119879

(30)

The RL algorithm is presented as Algorithm 8 where 120572is learning rate 120574 is a discount factor 119864(119886) is the vector ofeligibility traces for action 119886 120575(119886) is an error variable foraction 119886 and 120582 is a factor for updating eligibility traces

Algorithm 8 119876-learning with linear gradient-descent func-tion approximation for chip attach scheduling

Initialize 119862119886 and 119864(119886) randomly Set parameters 120572 120574 and120582

Let num episode denote the number of episodes havingbeen run Set num episode = 0

While num episode ltMAX EPISODE doSet the current decision-making state 119904 larr 119904

0

While at least one of state variables 119890119894(1 le 119894 le 119898) is larger

than zero do

Select action 119886 for state 119904 by 120576-greedy policyImplement action a Determine the next event fortriggering state transition and the sojourn timeOnce any machine completes processing a lot orany machinersquos normal production time is completelyscheduled the system transfers into a new decision-making state 1199041015840(1198901015840

119894(1 le 119894 le 119898) is a component of 1199041015840)

Compute reward 1199031198861199041199041015840

Update the vector of weights in the approximate 119876-value function of action a

120575 (119886) larr997888 119903119886

1199041199041015840 + 120574max1198861015840

119876(1199041015840

1198861015840

) minus 119876 (119904 119886)

119864 (119886) larr997888 120582119864 (119886) + nabla119862119886119876 (119904 119886)

119862119886

larr997888 119862119886

+ 120572120575 (119886) 119864 (119886)

(31)

Set 119904 larr 1199041015840

If 119890119894= 0 holds for all 119894 (1 le 119894 le 119898) set num episode =

num episode + 1

4 Experiment Results

In the past the company used a manual process to conductchip attach scheduling A heuristic algorithm called LargestWeight First (LWF) was used as follows

Algorithm 9 (Largest Weight First (LWF) heuristics) Initial-ize SMwith the set of all machines (ie SM = 119894 | 1 le 119894 le 119898)and define SJ as Step 1 inAlgorithm 2 Initialize 119890

119894(1 le 119894 le 119898)

with each machinersquos normal production time Set 119884119895= 119868119895

where 119868119895is the initial production volume of job type 119895

Step 1 Schedule the job types in decreasing order of weightsin order to meet their TPVs

While SJ = Φ and SM = Φ doChoose job 119902 with 119902 = argmax119908

119895| 119895 isin SJ

While SM cap119872119902= Φ and 119884

119902lt 119863119902do

Choose machine 119894 to process job 119902 with 119894 =argmin119901

119896119902119908119902| 119896 isin SM cap119872

119902

If 119890119894minus 1199041198790

119894119902lt (119863119902minus 119884119902)119901119894119902 then

set 119884119902larr 119884119902+ 119871(119890119894minus 1199041198790

119894119902)119901119894119902 119890119894= 0 and

remove 119894 from SMelse set 119890

119894larr 119890119894minus 1199041198790

119894119902minus (119863119902minus 119884119902)119901119894119902119871 and

119884119902= 119863119902

1198790

119894= 119902

Step 2 Allocate the excess production capacity

If SM = Φ thenFor each machine 119894 (119894 isin SM)Choose job 119895with 119895 = argmax(119890

119894minus1199041198790

119894119902)119908119902119901119894119902|

1 le 119902 le 119899 set 119884119895larr 119884119895+ 119871(119890119894minus 1199041198790

119894119895)119901119894119895 119890119894= 0

8 Journal of Industrial Engineering

Table 1 Comparison of objective function values using heuristics and 119876-learning

Dataset no WSPT MWSPT RA LFM-MWSPT LFM-RA LWF Q-Learning1 88867 78116 59758 87689 57582 42253 minus386132 13844 13586 110747 12669 10907 95926 766573 11901 10875 12409 10425 12190 83332 237754 83681 60797 39073 69405 45575 33920 minus442755 12938 12847 96960 10917 99827 89863 214146 70840 55692 51108 66213 51041 16930 minus544677 12090 10060 95399 10933 90754 76422 273748 10242 10780 11656 10333 10762 93663 118409 94606 87914 81763 88812 80331 60164 3303610 90803 88164 90773 87926 8856293 56307 2279811 11113 88287 82916 97882 85605 60160 1649312 10029 89005 86692 95836 78342 60744 minus38617Average 10419 94123 86321 95547 84685 64147 12233

Table 2 Comparison of unsatisfied TPV index using heuristics and 119876-learning

Dataset no WSPT MWSPT RA LFM-MWSPT LFM-RA LWF Q-learning1 01179 01025 00789 01170 00754 00554 000812 01651 01611 01497 01499 01482 01421 001373 01421 01289 01475 01227 01455 00987 006914 01540 01104 00716 01258 00854 00614 000885 01588 01564 01186 01303 01215 01094 005716 01053 00819 00757 01006 00762 00248 001377 01462 01209 01150 01292 01082 00917 005828 01266 01324 01437 01272 01309 01150 003819 01315 01211 01133 01249 01127 00828 0081510 01154 01112 01151 01105 01110 00709 0053611 01544 01215 01146 01387 01204 00827 0069012 01262 01112 01088 01194 01002 00758 00118Average 01370 01216 01112 01247 01113 00842 00402

The chip attach station consists of 10 machines andnormally processes more than ten job types We selected12 sets of industrial data for experiments comparingthe119876-learning algorithm (Algorithm 8) and the six heuristics(Algorithms 1ndash5 9) WSPT MWSPT RA LFM-MWSPTLFM-RA and LWF For each dataset 119876-learning repeatedlysolves the scheduling problem 1000 times and selects theoptimal schedule of the 1000 solutions Table 1 shows theobjective function values of all datasets using the sevenalgorithms Individually any of WSPT MWSPT RA LFM-MWSPT and LFM-RA obtains larger objective functionvalues than LWF for every dataset Nevertheless takingWSPTMWSPT RA LFM-MWSPT and LFM-RA as actions119876-learning algorithm achieves an objective function valuemuch smaller than LWF for each dataset In Tables 1ndash4 thebottom row presents the average value over all datasets Asshown in Table 1 the average objective function value of 119876-learning is only 12233 less than that of LWF 66147 by a largeamount of 8092

Besides objective function value we propose two indicesunsatisfied TPV index and unsatisfied job type index tomea-sure the performance of the seven algorithms Unsatisfied

TPV index (UPI) is defined as formula (32) and indicatesthe weighted proportion of unfinished Target ProductionVolume Table 2 compares UPIs of all datasets using sevenalgorithms Also any of WSPT MWSPT RA LFM-MWSPTand LFM-RA individually obtains larger UPI than LWFfor each dataset However 119876-learning algorithm achievessmaller UPI than LWF does for each datasetThe average UPIof 119876-learning is only 00402 less than that of LWF 00842by a large amount of 5220 Let 119869 denote the set 119895 | 1 le119895 le 119899 119884

119895lt 119863119895 Unsatisfied job type index (UJTI) is defined

as formula (33) and indicates the weighted proportion of thejob types whose TPVs are not completely satisfied Table 3compares UJTIs of all datasets using seven algorithms Withmost datasets 119876-learning algorithm achieves smaller UJTIsthan LWFThe average UJTI of119876-learning is 00802 which isremarkably less than that of LWF 01176 by 3181 Consider

UPI =sum119899

119895=1119908119895(119863119895minus 119884119895)+

sum119899

119895=1119908119895119863119895

(32)

UJTI =sum119895=119869119908119895

sum119899

119895=1119908119895

(33)

Journal of Industrial Engineering 9

Table 3 Comparison of unsatisfied job type index using heuristics and 119876-learning

Dataset no WSPT MWSPT RA LFM-MWSPT LFM-RA LWF Q-learning1 01290 02615 01667 01650 01793 00921 006782 01924 02953 02302 02650 02097 01320 002783 02250 03287 02126 02564 02278 00921 009214 00781 02987 00278 01290 00828 00278 002785 02924 03169 03002 02224 02632 02055 005716 02290 03062 01817 02290 01632 00571 002787 02650 02987 02160 02529 02075 01320 016478 01924 03225 02067 01813 01696 01320 013209 02221 03250 01403 01892 02073 01320 0074910 02621 03304 02667 02859 02708 01781 0067811 02029 02896 02578 02194 02220 01381 0154212 01924 03271 02302 01838 02182 00921 00678Average 02069 03084 02031 02149 02018 01176 00802

Table 4 Comparison of the total setup time using heuristics and 119876-learning

Dataset no WSPT MWSPT RA LFM-MWSPT LFM-RA LWF Q-learning1 08133 08497 03283 07231 03884 03895 100002 08333 13000 04358 07564 05263 04094 100003 08712 11633 04207 06361 04937 04298 100004 11280 07123 04629 08516 05139 04318 100005 09629 13121 04179 08597 05115 03873 100006 07489 10393 04104 07489 04542 04074 100007 17868 22182 08223 14069 10125 04174 100008 06456 08508 04055 06694 05053 03795 100009 09245 09946 05013 07821 06694 04163 1000010 11025 17875 06703 10371 09079 04894 1000011 09973 13655 03994 09686 05129 04066 1000012 07904 11111 04419 06195 05081 04258 10000Average 09671 12254 04764 08383 05837 04158 10000

Table 4 shows the total setup time of all datasets usingseven algorithms For the reason of commercial confiden-tiality we used the normalized data with the setup time of adataset divided by the result of this dataset using 119876-learningThus the total setup times of all datasets by 119876-learning areconverted into one and the data of the six heuristics areadjusted accordingly 119876-learning algorithm requires morethan twice of setup time than LWF does for each dataset Theaverage accumulated setup time of LWF is only 4158 percentsof that of 119876-learning

The previous experimental results reveal that for thewhole scheduling tasks any individual one of the five actionheuristics (WSPT MWSPT RA LFM-MWSPT and LFM-RA) for 119876-learning performs worse than LWF heuristicsHowever 119876-learning greatly outperforms LWF in termsof the three performance measures the objective functionvalue UPI and UJTI This demonstrates that some actionheuristics provide better actions than LWF heuristics at somestates During repeatedly solving the scheduling problem119876-learning system perceives the insights of the schedulingproblem automatically and adjusts its actions towards the

optimal ones facing different system states The actions atall states form a new optimized policy which is differentfrom any policies following any individual action heuristicsor LWF heuristics That is119876-learning incorporates the meritof five alternative heuristics uses them to schedule jobsflexibly and obtains results much better than any individualaction heuristics and LWF heuristics In the experiments119876-learning achieves high-quality schedules at the cost ofinducingmore setup time In other words119876-learning utilizesthe machines more efficiently by increasing conversionsamong a variety of job types

5 Conclusions

We apply 119876-learning to study lot-based chip attach schedul-ing in back-end semiconductor manufacturing To applyreinforcement learning to scheduling the critical issue beingconversion of scheduling problems into RL problems Weconvert chip attach scheduling problem into a particularSMDP problem by Markovian state representation Fiveheuristic algorithms WSPT MWSPT RA LFM-MWSPT

10 Journal of Industrial Engineering

and LFM-RA are selected as actions so as to utilize priordomain knowledge Reward function is directly related toscheduling objective function and we prove that maximizingthe accumulated reward is equivalent to minimizing theobjective function Gradient-descent linear function approx-imation is combined with 119876-learning algorithm

119876-learning exploits the insight structure of the schedul-ing problem by solving it repeatedly It learns a domain-specific policy from the experienced episodes through inter-action and then applies it to latter episodes We definetwo indices unsatisfied TPV index and unsatisfied job typeindex together with objective function value to measure theperformance of 119876-learning and the heuristics Experimentswith industrial datasets show that 119876-learning apparentlyoutperforms six heuristic algorithms WSPT MWSPT RALFM-MWSPT LFM-RA and LWF Compared with LWF119876-learning achieves reduction of the three performancemeasures respectively by an average level of 5220 3181and 8092 With 119876-learning chip attach scheduling isoptimized through increasing effective job type conversions

Disclosure

Given the sensitive and proprietary nature of the semicon-ductor manufacturing environment we use normalized datain this paper

Acknowledgments

This project is supported by the National Natural ScienceFoundation of China (Grant no 71201026) Science andTechnological Program for Dongguanrsquos Higher EducationScience and Research and Health Care Institutions (no2011108102017) andHumanities and Social Sciences Programof Ministry of Education of China (no 10YJC630405)

References

[1] M X Weng J Lu and H Ren ldquoUnrelated parallel machinescheduling with setup consideration and a total weightedcompletion time objectiverdquo International Journal of ProductionEconomics vol 70 no 3 pp 215ndash226 2001

[2] M Gairing B Monien and A Woclaw ldquoA faster combinato-rial approximation algorithm for scheduling unrelated parallelmachinesrdquo inAutomata Languages and Programming vol 3580of Lecture Notes in Computer Science pp 828ndash839 2005

[3] G Mosheiov ldquoParallel machine scheduling with a learningeffectrdquo Journal of the Operational Research Society vol 52 no10 pp 1ndash5 2001

[4] G Mosheiov and J B Sidney ldquoScheduling with general job-dependent learning curvesrdquo European Journal of OperationalResearch vol 147 no 3 pp 665ndash670 2003

[5] L Yu H M Shih M Pfund W M Carlyle and J W FowlerldquoScheduling of unrelated parallel machines an application toPWB manufacturingrdquo IIE Transactions vol 34 no 11 pp 921ndash931 2002

[6] K R Baker and J W M Bertrand ldquoA dynamic priorityrule for scheduling against due-datesrdquo Journal of OperationsManagement vol 3 no 1 pp 37ndash42 1982

[7] J J Kanet and X Li ldquoA weighted modified due date rulefor sequencing to minimize weighted tardinessrdquo Journal ofScheduling vol 7 no 4 pp 261ndash276 2004

[8] R V Rachamadugu and T E Morton ldquoMyopic heuristicsfor the single machine weighted tardiness problemrdquo WorkingPaper 28-81-82 Graduate School of Industrial AdministrationGarnegie-Mellon University 1981

[9] A Volgenant and E Teerhuis ldquoImproved heuristics for then-job single-machine weighted tardiness problemrdquo Computersand Operations Research vol 26 no 1 pp 35ndash44 1999

[10] D C Carroll Heuristic sequencing of jobs with single and mul-tiple components [PhD thesis] Sloan School of ManagementMIT 1965

[11] A P J Vepsalainen and T E Morton ldquoPriority rules for jobshops with weighted tardiness costsrdquoManagement Science vol33 no 8 pp 1035ndash1047 1987

[12] R S Russell E M Dar-El and B W Taylor ldquoA comparativeanalysis of the COVERT job sequencing rule using various shopperformance measuresrdquo International Journal of ProductionResearch vol 25 no 10 pp 1523ndash1540 1987

[13] J Bank and F Werner ldquoHeuristic algorithms for unrelated par-allelmachine schedulingwith a commondue date release datesand linear earliness and tardiness penaltiesrdquoMathematical andComputer Modelling vol 33 no 4-5 pp 363ndash383 2001

[14] C F Liaw Y K Lin C Y Cheng and M Chen ldquoSchedulingunrelated parallel machines to minimize total weighted tardi-nessrdquo Computers and Operations Research vol 30 no 12 pp1777ndash1789 2003

[15] DWKimDGNa andF F Chen ldquoUnrelated parallelmachinescheduling with setup times and a total weighted tardinessobjectiverdquo Robotics and Computer-Integrated Manufacturingvol 19 no 1-2 pp 173ndash181 2003

[16] R S Sutton and A G Barto Reinforcement Learning AnIntroduction MIT Press Cambridge Mass USA 1998

[17] C J CHWatkinsLearning fromdelayed rewards [PhD thesis]Cambridge University 1989

[18] C J C H Watkins and P Dayan ldquoQ-learningrdquo MachineLearning vol 8 no 3-4 pp 279ndash292 1992

[19] T JaakkolaM I Jordan and S P Singh ldquoOn the convergence ofstochastic iterative dynamic programming algorithmsrdquo NeuralComputation vol 6 pp 1185ndash1201 1994

[20] J N Tsitsiklis ldquoAsynchronous stochastic approximation and Q-learningrdquoMachine Learning vol 16 no 3 pp 185ndash202 1994

[21] D P Bertsekas and J N Tsitsiklis Neuro-Dynamic Program-ming Athena Scientific Belmont Mass USA 1996

[22] S Riedmiller and M Riedmiller ldquoA neural reinforcementlearning approach to learn local dispatching policies in produc-tion schedulingrdquo in Proceedings of the 16th International JointConference on Artificial Intelligence Stockholm Sweden 1999

[23] M E Aydin and E Oztemel ldquoDynamic job-shop schedulingusing reinforcement learning agentsrdquoRobotics and AutonomousSystems vol 33 no 2 pp 169ndash178 2000

[24] J Hong and V V Prabhu ldquoDistributed reinforcement learningcontrol for batch sequencing and sizing in just-in-time manu-facturing systemsrdquo Applied Intelligence vol 20 no 1 pp 71ndash872004

[25] Y C Wang and J M Usher ldquoApplication of reinforcementlearning for agent-based production schedulingrdquo EngineeringApplications of Artificial Intelligence vol 18 no 1 pp 73ndash822005

Journal of Industrial Engineering 11

[26] B C Csaji L Monostori and B Kadar ldquoReinforcement learn-ing in a distributed market-based production control systemrdquoAdvanced Engineering Informatics vol 20 no 3 pp 279ndash2882006

[27] S S Singh V B Tadic and A Doucet ldquoA policy gradientmethod for semi-Markov decision processes with applicationto call admission controlrdquo European Journal of OperationalResearch vol 178 no 3 pp 808ndash818 2007

[28] M Kaya and R Alhajj ldquoA novel approach to multiagentreinforcement learning utilizing OLAP mining in the learningprocessrdquo IEEE Transactions on Systems Man and CyberneticsPart C vol 35 no 4 pp 582ndash590 2005

[29] C D Paternina-Arboleda and T K Das ldquoA multi-agent rein-forcement learning approach to obtaining dynamic controlpolicies for stochastic lot scheduling problemrdquo SimulationModelling Practice andTheory vol 13 no 5 pp 389ndash406 2005

[30] C E Mariano-Romero V H Alcocer-Yamanaka and E FMorales ldquoMulti-objective optimization of water-using systemsrdquoEuropean Journal of Operational Research vol 181 no 3 pp1691ndash1707 2007

[31] D Vengerov ldquoA reinforcement learning framework for utility-based scheduling in resource-constrained systemsrdquo Future Gen-eration Computer Systems vol 25 no 7 pp 728ndash736 2009

[32] K Iwamura N Mayumi Y Tanimizu and N SugimuraldquoA study on real-time scheduling for holonic manufacturingsystemsmdashdetermination of utility values based on multi-agentreinforcement learningrdquo in Proceedings of the 4th InternationalConference on Industrial Applications of Holonic and Multi-Agent Systems pp 135ndash144 Linz Austria 2009

[33] M Pinedo Scheduling Theory Algorithms and Systems Pren-tice Hall Englewoods Cliffs NJ USA 2nd edition 2002

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 7: Research Article Chip Attach Scheduling in Semiconductor ...downloads.hindawi.com/journals/jie/2013/295604.pdf · Chip Attach Scheduling in Semiconductor Assembly ... and PEVI. IS

Journal of Industrial Engineering 7

3 The Reinforcement Learning Algorithm

The chip attach scheduling problem is converted into anRL problem with terminal state in Section 2 To apply 119876-learning to solve this RL problem another issue arises that ishow to tailor119876-learning algorithm in this particular contextSince some state variables are continuous the state spaceis infinite This RL system is not in tabular form and it isimpossible to maintain 119876-values for all state-action pairsThus we use linear functionwith gradient-descentmethod toapproximate the 119876-value function 119876-values are representedas linear combination of a set of basis functions Φ

119896(119904) (1 le

119896 le 4119898 + 119899) as shown in the next formula

119876 (119904 119886) =

4119898+119899

sum

119896=1

119888119886

119896Φ119896(119904) (28)

where 119888119886119896(1 le 119886 le 6 1 le 119896 le 4119898 + 119899) are the weights of basis

functions Each state variable corresponds to a basis functionThe following basis functions are defined to normalize thestate variables

Φ119896(119904)

=

1198790

119896

119899(1 le 119896 le 119898)

119879119896minus119898

119899(119898 + 1 le 119896 le 2119898)

119905119896minus2119898

max 11990411989511198952

+ 1199011198952| 1 le 1198951 le 119899 1 le 1198952 le 119899

(2119898 + 1 le 119896 le 3119898)

119889119896minus3119898

119863119896minus3119898

(3119898 + 1 le 119896 le 3119898 + 119899)

119890119896minus3119898minus119899

TH(3119898 + 119899 + 1 le 119896 le 4119898 + 119899)

(29)

Let 119862119886 denote the vector of weights of basis functions asfollows

119862119886

= (119888119886

1 119888119886

2 119888

119886

4119898+119899)119879

(30)

The RL algorithm is presented as Algorithm 8 where 120572is learning rate 120574 is a discount factor 119864(119886) is the vector ofeligibility traces for action 119886 120575(119886) is an error variable foraction 119886 and 120582 is a factor for updating eligibility traces

Algorithm 8 119876-learning with linear gradient-descent func-tion approximation for chip attach scheduling

Initialize 119862119886 and 119864(119886) randomly Set parameters 120572 120574 and120582

Let num episode denote the number of episodes havingbeen run Set num episode = 0

While num episode ltMAX EPISODE doSet the current decision-making state 119904 larr 119904

0

While at least one of state variables 119890119894(1 le 119894 le 119898) is larger

than zero do

Select action 119886 for state 119904 by 120576-greedy policyImplement action a Determine the next event fortriggering state transition and the sojourn timeOnce any machine completes processing a lot orany machinersquos normal production time is completelyscheduled the system transfers into a new decision-making state 1199041015840(1198901015840

119894(1 le 119894 le 119898) is a component of 1199041015840)

Compute reward 1199031198861199041199041015840

Update the vector of weights in the approximate 119876-value function of action a

120575 (119886) larr997888 119903119886

1199041199041015840 + 120574max1198861015840

119876(1199041015840

1198861015840

) minus 119876 (119904 119886)

119864 (119886) larr997888 120582119864 (119886) + nabla119862119886119876 (119904 119886)

119862119886

larr997888 119862119886

+ 120572120575 (119886) 119864 (119886)

(31)

Set 119904 larr 1199041015840

If 119890119894= 0 holds for all 119894 (1 le 119894 le 119898) set num episode =

num episode + 1

4 Experiment Results

In the past the company used a manual process to conductchip attach scheduling A heuristic algorithm called LargestWeight First (LWF) was used as follows

Algorithm 9 (Largest Weight First (LWF) heuristics) Initial-ize SMwith the set of all machines (ie SM = 119894 | 1 le 119894 le 119898)and define SJ as Step 1 inAlgorithm 2 Initialize 119890

119894(1 le 119894 le 119898)

with each machinersquos normal production time Set 119884119895= 119868119895

where 119868119895is the initial production volume of job type 119895

Step 1 Schedule the job types in decreasing order of weightsin order to meet their TPVs

While SJ = Φ and SM = Φ doChoose job 119902 with 119902 = argmax119908

119895| 119895 isin SJ

While SM cap119872119902= Φ and 119884

119902lt 119863119902do

Choose machine 119894 to process job 119902 with 119894 =argmin119901

119896119902119908119902| 119896 isin SM cap119872

119902

If 119890119894minus 1199041198790

119894119902lt (119863119902minus 119884119902)119901119894119902 then

set 119884119902larr 119884119902+ 119871(119890119894minus 1199041198790

119894119902)119901119894119902 119890119894= 0 and

remove 119894 from SMelse set 119890

119894larr 119890119894minus 1199041198790

119894119902minus (119863119902minus 119884119902)119901119894119902119871 and

119884119902= 119863119902

1198790

119894= 119902

Step 2 Allocate the excess production capacity

If SM = Φ thenFor each machine 119894 (119894 isin SM)Choose job 119895with 119895 = argmax(119890

119894minus1199041198790

119894119902)119908119902119901119894119902|

1 le 119902 le 119899 set 119884119895larr 119884119895+ 119871(119890119894minus 1199041198790

119894119895)119901119894119895 119890119894= 0

8 Journal of Industrial Engineering

Table 1 Comparison of objective function values using heuristics and 119876-learning

Dataset no WSPT MWSPT RA LFM-MWSPT LFM-RA LWF Q-Learning1 88867 78116 59758 87689 57582 42253 minus386132 13844 13586 110747 12669 10907 95926 766573 11901 10875 12409 10425 12190 83332 237754 83681 60797 39073 69405 45575 33920 minus442755 12938 12847 96960 10917 99827 89863 214146 70840 55692 51108 66213 51041 16930 minus544677 12090 10060 95399 10933 90754 76422 273748 10242 10780 11656 10333 10762 93663 118409 94606 87914 81763 88812 80331 60164 3303610 90803 88164 90773 87926 8856293 56307 2279811 11113 88287 82916 97882 85605 60160 1649312 10029 89005 86692 95836 78342 60744 minus38617Average 10419 94123 86321 95547 84685 64147 12233

Table 2 Comparison of unsatisfied TPV index using heuristics and 119876-learning

Dataset no WSPT MWSPT RA LFM-MWSPT LFM-RA LWF Q-learning1 01179 01025 00789 01170 00754 00554 000812 01651 01611 01497 01499 01482 01421 001373 01421 01289 01475 01227 01455 00987 006914 01540 01104 00716 01258 00854 00614 000885 01588 01564 01186 01303 01215 01094 005716 01053 00819 00757 01006 00762 00248 001377 01462 01209 01150 01292 01082 00917 005828 01266 01324 01437 01272 01309 01150 003819 01315 01211 01133 01249 01127 00828 0081510 01154 01112 01151 01105 01110 00709 0053611 01544 01215 01146 01387 01204 00827 0069012 01262 01112 01088 01194 01002 00758 00118Average 01370 01216 01112 01247 01113 00842 00402

The chip attach station consists of 10 machines andnormally processes more than ten job types We selected12 sets of industrial data for experiments comparingthe119876-learning algorithm (Algorithm 8) and the six heuristics(Algorithms 1ndash5 9) WSPT MWSPT RA LFM-MWSPTLFM-RA and LWF For each dataset 119876-learning repeatedlysolves the scheduling problem 1000 times and selects theoptimal schedule of the 1000 solutions Table 1 shows theobjective function values of all datasets using the sevenalgorithms Individually any of WSPT MWSPT RA LFM-MWSPT and LFM-RA obtains larger objective functionvalues than LWF for every dataset Nevertheless takingWSPTMWSPT RA LFM-MWSPT and LFM-RA as actions119876-learning algorithm achieves an objective function valuemuch smaller than LWF for each dataset In Tables 1ndash4 thebottom row presents the average value over all datasets Asshown in Table 1 the average objective function value of 119876-learning is only 12233 less than that of LWF 66147 by a largeamount of 8092

Besides objective function value we propose two indicesunsatisfied TPV index and unsatisfied job type index tomea-sure the performance of the seven algorithms Unsatisfied

TPV index (UPI) is defined as formula (32) and indicatesthe weighted proportion of unfinished Target ProductionVolume Table 2 compares UPIs of all datasets using sevenalgorithms Also any of WSPT MWSPT RA LFM-MWSPTand LFM-RA individually obtains larger UPI than LWFfor each dataset However 119876-learning algorithm achievessmaller UPI than LWF does for each datasetThe average UPIof 119876-learning is only 00402 less than that of LWF 00842by a large amount of 5220 Let 119869 denote the set 119895 | 1 le119895 le 119899 119884

119895lt 119863119895 Unsatisfied job type index (UJTI) is defined

as formula (33) and indicates the weighted proportion of thejob types whose TPVs are not completely satisfied Table 3compares UJTIs of all datasets using seven algorithms Withmost datasets 119876-learning algorithm achieves smaller UJTIsthan LWFThe average UJTI of119876-learning is 00802 which isremarkably less than that of LWF 01176 by 3181 Consider

UPI =sum119899

119895=1119908119895(119863119895minus 119884119895)+

sum119899

119895=1119908119895119863119895

(32)

UJTI =sum119895=119869119908119895

sum119899

119895=1119908119895

(33)

Journal of Industrial Engineering 9

Table 3 Comparison of unsatisfied job type index using heuristics and 119876-learning

Dataset no WSPT MWSPT RA LFM-MWSPT LFM-RA LWF Q-learning1 01290 02615 01667 01650 01793 00921 006782 01924 02953 02302 02650 02097 01320 002783 02250 03287 02126 02564 02278 00921 009214 00781 02987 00278 01290 00828 00278 002785 02924 03169 03002 02224 02632 02055 005716 02290 03062 01817 02290 01632 00571 002787 02650 02987 02160 02529 02075 01320 016478 01924 03225 02067 01813 01696 01320 013209 02221 03250 01403 01892 02073 01320 0074910 02621 03304 02667 02859 02708 01781 0067811 02029 02896 02578 02194 02220 01381 0154212 01924 03271 02302 01838 02182 00921 00678Average 02069 03084 02031 02149 02018 01176 00802

Table 4 Comparison of the total setup time using heuristics and 119876-learning

Dataset no WSPT MWSPT RA LFM-MWSPT LFM-RA LWF Q-learning1 08133 08497 03283 07231 03884 03895 100002 08333 13000 04358 07564 05263 04094 100003 08712 11633 04207 06361 04937 04298 100004 11280 07123 04629 08516 05139 04318 100005 09629 13121 04179 08597 05115 03873 100006 07489 10393 04104 07489 04542 04074 100007 17868 22182 08223 14069 10125 04174 100008 06456 08508 04055 06694 05053 03795 100009 09245 09946 05013 07821 06694 04163 1000010 11025 17875 06703 10371 09079 04894 1000011 09973 13655 03994 09686 05129 04066 1000012 07904 11111 04419 06195 05081 04258 10000Average 09671 12254 04764 08383 05837 04158 10000

Table 4 shows the total setup time of all datasets usingseven algorithms For the reason of commercial confiden-tiality we used the normalized data with the setup time of adataset divided by the result of this dataset using 119876-learningThus the total setup times of all datasets by 119876-learning areconverted into one and the data of the six heuristics areadjusted accordingly 119876-learning algorithm requires morethan twice of setup time than LWF does for each dataset Theaverage accumulated setup time of LWF is only 4158 percentsof that of 119876-learning

The previous experimental results reveal that for thewhole scheduling tasks any individual one of the five actionheuristics (WSPT MWSPT RA LFM-MWSPT and LFM-RA) for 119876-learning performs worse than LWF heuristicsHowever 119876-learning greatly outperforms LWF in termsof the three performance measures the objective functionvalue UPI and UJTI This demonstrates that some actionheuristics provide better actions than LWF heuristics at somestates During repeatedly solving the scheduling problem119876-learning system perceives the insights of the schedulingproblem automatically and adjusts its actions towards the

optimal ones facing different system states The actions atall states form a new optimized policy which is differentfrom any policies following any individual action heuristicsor LWF heuristics That is119876-learning incorporates the meritof five alternative heuristics uses them to schedule jobsflexibly and obtains results much better than any individualaction heuristics and LWF heuristics In the experiments119876-learning achieves high-quality schedules at the cost ofinducingmore setup time In other words119876-learning utilizesthe machines more efficiently by increasing conversionsamong a variety of job types

5 Conclusions

We apply 119876-learning to study lot-based chip attach schedul-ing in back-end semiconductor manufacturing To applyreinforcement learning to scheduling the critical issue beingconversion of scheduling problems into RL problems Weconvert chip attach scheduling problem into a particularSMDP problem by Markovian state representation Fiveheuristic algorithms WSPT MWSPT RA LFM-MWSPT

10 Journal of Industrial Engineering

and LFM-RA are selected as actions so as to utilize priordomain knowledge Reward function is directly related toscheduling objective function and we prove that maximizingthe accumulated reward is equivalent to minimizing theobjective function Gradient-descent linear function approx-imation is combined with 119876-learning algorithm

119876-learning exploits the insight structure of the schedul-ing problem by solving it repeatedly It learns a domain-specific policy from the experienced episodes through inter-action and then applies it to latter episodes We definetwo indices unsatisfied TPV index and unsatisfied job typeindex together with objective function value to measure theperformance of 119876-learning and the heuristics Experimentswith industrial datasets show that 119876-learning apparentlyoutperforms six heuristic algorithms WSPT MWSPT RALFM-MWSPT LFM-RA and LWF Compared with LWF119876-learning achieves reduction of the three performancemeasures respectively by an average level of 5220 3181and 8092 With 119876-learning chip attach scheduling isoptimized through increasing effective job type conversions

Disclosure

Given the sensitive and proprietary nature of the semicon-ductor manufacturing environment we use normalized datain this paper

Acknowledgments

This project is supported by the National Natural ScienceFoundation of China (Grant no 71201026) Science andTechnological Program for Dongguanrsquos Higher EducationScience and Research and Health Care Institutions (no2011108102017) andHumanities and Social Sciences Programof Ministry of Education of China (no 10YJC630405)

References

[1] M X Weng J Lu and H Ren ldquoUnrelated parallel machinescheduling with setup consideration and a total weightedcompletion time objectiverdquo International Journal of ProductionEconomics vol 70 no 3 pp 215ndash226 2001

[2] M Gairing B Monien and A Woclaw ldquoA faster combinato-rial approximation algorithm for scheduling unrelated parallelmachinesrdquo inAutomata Languages and Programming vol 3580of Lecture Notes in Computer Science pp 828ndash839 2005

[3] G Mosheiov ldquoParallel machine scheduling with a learningeffectrdquo Journal of the Operational Research Society vol 52 no10 pp 1ndash5 2001

[4] G Mosheiov and J B Sidney ldquoScheduling with general job-dependent learning curvesrdquo European Journal of OperationalResearch vol 147 no 3 pp 665ndash670 2003

[5] L Yu H M Shih M Pfund W M Carlyle and J W FowlerldquoScheduling of unrelated parallel machines an application toPWB manufacturingrdquo IIE Transactions vol 34 no 11 pp 921ndash931 2002

[6] K R Baker and J W M Bertrand ldquoA dynamic priorityrule for scheduling against due-datesrdquo Journal of OperationsManagement vol 3 no 1 pp 37ndash42 1982

[7] J J Kanet and X Li ldquoA weighted modified due date rulefor sequencing to minimize weighted tardinessrdquo Journal ofScheduling vol 7 no 4 pp 261ndash276 2004

[8] R V Rachamadugu and T E Morton ldquoMyopic heuristicsfor the single machine weighted tardiness problemrdquo WorkingPaper 28-81-82 Graduate School of Industrial AdministrationGarnegie-Mellon University 1981

[9] A Volgenant and E Teerhuis ldquoImproved heuristics for then-job single-machine weighted tardiness problemrdquo Computersand Operations Research vol 26 no 1 pp 35ndash44 1999

[10] D C Carroll Heuristic sequencing of jobs with single and mul-tiple components [PhD thesis] Sloan School of ManagementMIT 1965

[11] A P J Vepsalainen and T E Morton ldquoPriority rules for jobshops with weighted tardiness costsrdquoManagement Science vol33 no 8 pp 1035ndash1047 1987

[12] R S Russell E M Dar-El and B W Taylor ldquoA comparativeanalysis of the COVERT job sequencing rule using various shopperformance measuresrdquo International Journal of ProductionResearch vol 25 no 10 pp 1523ndash1540 1987

[13] J Bank and F Werner ldquoHeuristic algorithms for unrelated par-allelmachine schedulingwith a commondue date release datesand linear earliness and tardiness penaltiesrdquoMathematical andComputer Modelling vol 33 no 4-5 pp 363ndash383 2001

[14] C F Liaw Y K Lin C Y Cheng and M Chen ldquoSchedulingunrelated parallel machines to minimize total weighted tardi-nessrdquo Computers and Operations Research vol 30 no 12 pp1777ndash1789 2003

[15] DWKimDGNa andF F Chen ldquoUnrelated parallelmachinescheduling with setup times and a total weighted tardinessobjectiverdquo Robotics and Computer-Integrated Manufacturingvol 19 no 1-2 pp 173ndash181 2003

[16] R S Sutton and A G Barto Reinforcement Learning AnIntroduction MIT Press Cambridge Mass USA 1998

[17] C J CHWatkinsLearning fromdelayed rewards [PhD thesis]Cambridge University 1989

[18] C J C H Watkins and P Dayan ldquoQ-learningrdquo MachineLearning vol 8 no 3-4 pp 279ndash292 1992

[19] T JaakkolaM I Jordan and S P Singh ldquoOn the convergence ofstochastic iterative dynamic programming algorithmsrdquo NeuralComputation vol 6 pp 1185ndash1201 1994

[20] J N Tsitsiklis ldquoAsynchronous stochastic approximation and Q-learningrdquoMachine Learning vol 16 no 3 pp 185ndash202 1994

[21] D P Bertsekas and J N Tsitsiklis Neuro-Dynamic Program-ming Athena Scientific Belmont Mass USA 1996

[22] S Riedmiller and M Riedmiller ldquoA neural reinforcementlearning approach to learn local dispatching policies in produc-tion schedulingrdquo in Proceedings of the 16th International JointConference on Artificial Intelligence Stockholm Sweden 1999

[23] M E Aydin and E Oztemel ldquoDynamic job-shop schedulingusing reinforcement learning agentsrdquoRobotics and AutonomousSystems vol 33 no 2 pp 169ndash178 2000

[24] J Hong and V V Prabhu ldquoDistributed reinforcement learningcontrol for batch sequencing and sizing in just-in-time manu-facturing systemsrdquo Applied Intelligence vol 20 no 1 pp 71ndash872004

[25] Y C Wang and J M Usher ldquoApplication of reinforcementlearning for agent-based production schedulingrdquo EngineeringApplications of Artificial Intelligence vol 18 no 1 pp 73ndash822005

Journal of Industrial Engineering 11

[26] B C Csaji L Monostori and B Kadar ldquoReinforcement learn-ing in a distributed market-based production control systemrdquoAdvanced Engineering Informatics vol 20 no 3 pp 279ndash2882006

[27] S S Singh V B Tadic and A Doucet ldquoA policy gradientmethod for semi-Markov decision processes with applicationto call admission controlrdquo European Journal of OperationalResearch vol 178 no 3 pp 808ndash818 2007

[28] M Kaya and R Alhajj ldquoA novel approach to multiagentreinforcement learning utilizing OLAP mining in the learningprocessrdquo IEEE Transactions on Systems Man and CyberneticsPart C vol 35 no 4 pp 582ndash590 2005

[29] C D Paternina-Arboleda and T K Das ldquoA multi-agent rein-forcement learning approach to obtaining dynamic controlpolicies for stochastic lot scheduling problemrdquo SimulationModelling Practice andTheory vol 13 no 5 pp 389ndash406 2005

[30] C E Mariano-Romero V H Alcocer-Yamanaka and E FMorales ldquoMulti-objective optimization of water-using systemsrdquoEuropean Journal of Operational Research vol 181 no 3 pp1691ndash1707 2007

[31] D Vengerov ldquoA reinforcement learning framework for utility-based scheduling in resource-constrained systemsrdquo Future Gen-eration Computer Systems vol 25 no 7 pp 728ndash736 2009

[32] K Iwamura N Mayumi Y Tanimizu and N SugimuraldquoA study on real-time scheduling for holonic manufacturingsystemsmdashdetermination of utility values based on multi-agentreinforcement learningrdquo in Proceedings of the 4th InternationalConference on Industrial Applications of Holonic and Multi-Agent Systems pp 135ndash144 Linz Austria 2009

[33] M Pinedo Scheduling Theory Algorithms and Systems Pren-tice Hall Englewoods Cliffs NJ USA 2nd edition 2002

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 8: Research Article Chip Attach Scheduling in Semiconductor ...downloads.hindawi.com/journals/jie/2013/295604.pdf · Chip Attach Scheduling in Semiconductor Assembly ... and PEVI. IS

8 Journal of Industrial Engineering

Table 1 Comparison of objective function values using heuristics and 119876-learning

Dataset no WSPT MWSPT RA LFM-MWSPT LFM-RA LWF Q-Learning1 88867 78116 59758 87689 57582 42253 minus386132 13844 13586 110747 12669 10907 95926 766573 11901 10875 12409 10425 12190 83332 237754 83681 60797 39073 69405 45575 33920 minus442755 12938 12847 96960 10917 99827 89863 214146 70840 55692 51108 66213 51041 16930 minus544677 12090 10060 95399 10933 90754 76422 273748 10242 10780 11656 10333 10762 93663 118409 94606 87914 81763 88812 80331 60164 3303610 90803 88164 90773 87926 8856293 56307 2279811 11113 88287 82916 97882 85605 60160 1649312 10029 89005 86692 95836 78342 60744 minus38617Average 10419 94123 86321 95547 84685 64147 12233

Table 2 Comparison of unsatisfied TPV index using heuristics and 119876-learning

Dataset no WSPT MWSPT RA LFM-MWSPT LFM-RA LWF Q-learning1 01179 01025 00789 01170 00754 00554 000812 01651 01611 01497 01499 01482 01421 001373 01421 01289 01475 01227 01455 00987 006914 01540 01104 00716 01258 00854 00614 000885 01588 01564 01186 01303 01215 01094 005716 01053 00819 00757 01006 00762 00248 001377 01462 01209 01150 01292 01082 00917 005828 01266 01324 01437 01272 01309 01150 003819 01315 01211 01133 01249 01127 00828 0081510 01154 01112 01151 01105 01110 00709 0053611 01544 01215 01146 01387 01204 00827 0069012 01262 01112 01088 01194 01002 00758 00118Average 01370 01216 01112 01247 01113 00842 00402

The chip attach station consists of 10 machines andnormally processes more than ten job types We selected12 sets of industrial data for experiments comparingthe119876-learning algorithm (Algorithm 8) and the six heuristics(Algorithms 1ndash5 9) WSPT MWSPT RA LFM-MWSPTLFM-RA and LWF For each dataset 119876-learning repeatedlysolves the scheduling problem 1000 times and selects theoptimal schedule of the 1000 solutions Table 1 shows theobjective function values of all datasets using the sevenalgorithms Individually any of WSPT MWSPT RA LFM-MWSPT and LFM-RA obtains larger objective functionvalues than LWF for every dataset Nevertheless takingWSPTMWSPT RA LFM-MWSPT and LFM-RA as actions119876-learning algorithm achieves an objective function valuemuch smaller than LWF for each dataset In Tables 1ndash4 thebottom row presents the average value over all datasets Asshown in Table 1 the average objective function value of 119876-learning is only 12233 less than that of LWF 66147 by a largeamount of 8092

Besides objective function value we propose two indicesunsatisfied TPV index and unsatisfied job type index tomea-sure the performance of the seven algorithms Unsatisfied

TPV index (UPI) is defined as formula (32) and indicatesthe weighted proportion of unfinished Target ProductionVolume Table 2 compares UPIs of all datasets using sevenalgorithms Also any of WSPT MWSPT RA LFM-MWSPTand LFM-RA individually obtains larger UPI than LWFfor each dataset However 119876-learning algorithm achievessmaller UPI than LWF does for each datasetThe average UPIof 119876-learning is only 00402 less than that of LWF 00842by a large amount of 5220 Let 119869 denote the set 119895 | 1 le119895 le 119899 119884

119895lt 119863119895 Unsatisfied job type index (UJTI) is defined

as formula (33) and indicates the weighted proportion of thejob types whose TPVs are not completely satisfied Table 3compares UJTIs of all datasets using seven algorithms Withmost datasets 119876-learning algorithm achieves smaller UJTIsthan LWFThe average UJTI of119876-learning is 00802 which isremarkably less than that of LWF 01176 by 3181 Consider

UPI =sum119899

119895=1119908119895(119863119895minus 119884119895)+

sum119899

119895=1119908119895119863119895

(32)

UJTI =sum119895=119869119908119895

sum119899

119895=1119908119895

(33)

Journal of Industrial Engineering 9

Table 3 Comparison of unsatisfied job type index using heuristics and 119876-learning

Dataset no WSPT MWSPT RA LFM-MWSPT LFM-RA LWF Q-learning1 01290 02615 01667 01650 01793 00921 006782 01924 02953 02302 02650 02097 01320 002783 02250 03287 02126 02564 02278 00921 009214 00781 02987 00278 01290 00828 00278 002785 02924 03169 03002 02224 02632 02055 005716 02290 03062 01817 02290 01632 00571 002787 02650 02987 02160 02529 02075 01320 016478 01924 03225 02067 01813 01696 01320 013209 02221 03250 01403 01892 02073 01320 0074910 02621 03304 02667 02859 02708 01781 0067811 02029 02896 02578 02194 02220 01381 0154212 01924 03271 02302 01838 02182 00921 00678Average 02069 03084 02031 02149 02018 01176 00802

Table 4 Comparison of the total setup time using heuristics and 119876-learning

Dataset no WSPT MWSPT RA LFM-MWSPT LFM-RA LWF Q-learning1 08133 08497 03283 07231 03884 03895 100002 08333 13000 04358 07564 05263 04094 100003 08712 11633 04207 06361 04937 04298 100004 11280 07123 04629 08516 05139 04318 100005 09629 13121 04179 08597 05115 03873 100006 07489 10393 04104 07489 04542 04074 100007 17868 22182 08223 14069 10125 04174 100008 06456 08508 04055 06694 05053 03795 100009 09245 09946 05013 07821 06694 04163 1000010 11025 17875 06703 10371 09079 04894 1000011 09973 13655 03994 09686 05129 04066 1000012 07904 11111 04419 06195 05081 04258 10000Average 09671 12254 04764 08383 05837 04158 10000

Table 4 shows the total setup time of all datasets usingseven algorithms For the reason of commercial confiden-tiality we used the normalized data with the setup time of adataset divided by the result of this dataset using 119876-learningThus the total setup times of all datasets by 119876-learning areconverted into one and the data of the six heuristics areadjusted accordingly 119876-learning algorithm requires morethan twice of setup time than LWF does for each dataset Theaverage accumulated setup time of LWF is only 4158 percentsof that of 119876-learning

The previous experimental results reveal that for thewhole scheduling tasks any individual one of the five actionheuristics (WSPT MWSPT RA LFM-MWSPT and LFM-RA) for 119876-learning performs worse than LWF heuristicsHowever 119876-learning greatly outperforms LWF in termsof the three performance measures the objective functionvalue UPI and UJTI This demonstrates that some actionheuristics provide better actions than LWF heuristics at somestates During repeatedly solving the scheduling problem119876-learning system perceives the insights of the schedulingproblem automatically and adjusts its actions towards the

optimal ones facing different system states The actions atall states form a new optimized policy which is differentfrom any policies following any individual action heuristicsor LWF heuristics That is119876-learning incorporates the meritof five alternative heuristics uses them to schedule jobsflexibly and obtains results much better than any individualaction heuristics and LWF heuristics In the experiments119876-learning achieves high-quality schedules at the cost ofinducingmore setup time In other words119876-learning utilizesthe machines more efficiently by increasing conversionsamong a variety of job types

5 Conclusions

We apply 119876-learning to study lot-based chip attach schedul-ing in back-end semiconductor manufacturing To applyreinforcement learning to scheduling the critical issue beingconversion of scheduling problems into RL problems Weconvert chip attach scheduling problem into a particularSMDP problem by Markovian state representation Fiveheuristic algorithms WSPT MWSPT RA LFM-MWSPT

10 Journal of Industrial Engineering

and LFM-RA are selected as actions so as to utilize priordomain knowledge Reward function is directly related toscheduling objective function and we prove that maximizingthe accumulated reward is equivalent to minimizing theobjective function Gradient-descent linear function approx-imation is combined with 119876-learning algorithm

119876-learning exploits the insight structure of the schedul-ing problem by solving it repeatedly It learns a domain-specific policy from the experienced episodes through inter-action and then applies it to latter episodes We definetwo indices unsatisfied TPV index and unsatisfied job typeindex together with objective function value to measure theperformance of 119876-learning and the heuristics Experimentswith industrial datasets show that 119876-learning apparentlyoutperforms six heuristic algorithms WSPT MWSPT RALFM-MWSPT LFM-RA and LWF Compared with LWF119876-learning achieves reduction of the three performancemeasures respectively by an average level of 5220 3181and 8092 With 119876-learning chip attach scheduling isoptimized through increasing effective job type conversions

Disclosure

Given the sensitive and proprietary nature of the semicon-ductor manufacturing environment we use normalized datain this paper

Acknowledgments

This project is supported by the National Natural ScienceFoundation of China (Grant no 71201026) Science andTechnological Program for Dongguanrsquos Higher EducationScience and Research and Health Care Institutions (no2011108102017) andHumanities and Social Sciences Programof Ministry of Education of China (no 10YJC630405)

References

[1] M X Weng J Lu and H Ren ldquoUnrelated parallel machinescheduling with setup consideration and a total weightedcompletion time objectiverdquo International Journal of ProductionEconomics vol 70 no 3 pp 215ndash226 2001

[2] M Gairing B Monien and A Woclaw ldquoA faster combinato-rial approximation algorithm for scheduling unrelated parallelmachinesrdquo inAutomata Languages and Programming vol 3580of Lecture Notes in Computer Science pp 828ndash839 2005

[3] G Mosheiov ldquoParallel machine scheduling with a learningeffectrdquo Journal of the Operational Research Society vol 52 no10 pp 1ndash5 2001

[4] G Mosheiov and J B Sidney ldquoScheduling with general job-dependent learning curvesrdquo European Journal of OperationalResearch vol 147 no 3 pp 665ndash670 2003

[5] L Yu H M Shih M Pfund W M Carlyle and J W FowlerldquoScheduling of unrelated parallel machines an application toPWB manufacturingrdquo IIE Transactions vol 34 no 11 pp 921ndash931 2002

[6] K R Baker and J W M Bertrand ldquoA dynamic priorityrule for scheduling against due-datesrdquo Journal of OperationsManagement vol 3 no 1 pp 37ndash42 1982

[7] J J Kanet and X Li ldquoA weighted modified due date rulefor sequencing to minimize weighted tardinessrdquo Journal ofScheduling vol 7 no 4 pp 261ndash276 2004

[8] R V Rachamadugu and T E Morton ldquoMyopic heuristicsfor the single machine weighted tardiness problemrdquo WorkingPaper 28-81-82 Graduate School of Industrial AdministrationGarnegie-Mellon University 1981

[9] A Volgenant and E Teerhuis ldquoImproved heuristics for then-job single-machine weighted tardiness problemrdquo Computersand Operations Research vol 26 no 1 pp 35ndash44 1999

[10] D C Carroll Heuristic sequencing of jobs with single and mul-tiple components [PhD thesis] Sloan School of ManagementMIT 1965

[11] A P J Vepsalainen and T E Morton ldquoPriority rules for jobshops with weighted tardiness costsrdquoManagement Science vol33 no 8 pp 1035ndash1047 1987

[12] R S Russell E M Dar-El and B W Taylor ldquoA comparativeanalysis of the COVERT job sequencing rule using various shopperformance measuresrdquo International Journal of ProductionResearch vol 25 no 10 pp 1523ndash1540 1987

[13] J Bank and F Werner ldquoHeuristic algorithms for unrelated par-allelmachine schedulingwith a commondue date release datesand linear earliness and tardiness penaltiesrdquoMathematical andComputer Modelling vol 33 no 4-5 pp 363ndash383 2001

[14] C F Liaw Y K Lin C Y Cheng and M Chen ldquoSchedulingunrelated parallel machines to minimize total weighted tardi-nessrdquo Computers and Operations Research vol 30 no 12 pp1777ndash1789 2003

[15] DWKimDGNa andF F Chen ldquoUnrelated parallelmachinescheduling with setup times and a total weighted tardinessobjectiverdquo Robotics and Computer-Integrated Manufacturingvol 19 no 1-2 pp 173ndash181 2003

[16] R S Sutton and A G Barto Reinforcement Learning AnIntroduction MIT Press Cambridge Mass USA 1998

[17] C J CHWatkinsLearning fromdelayed rewards [PhD thesis]Cambridge University 1989

[18] C J C H Watkins and P Dayan ldquoQ-learningrdquo MachineLearning vol 8 no 3-4 pp 279ndash292 1992

[19] T JaakkolaM I Jordan and S P Singh ldquoOn the convergence ofstochastic iterative dynamic programming algorithmsrdquo NeuralComputation vol 6 pp 1185ndash1201 1994

[20] J N Tsitsiklis ldquoAsynchronous stochastic approximation and Q-learningrdquoMachine Learning vol 16 no 3 pp 185ndash202 1994

[21] D P Bertsekas and J N Tsitsiklis Neuro-Dynamic Program-ming Athena Scientific Belmont Mass USA 1996

[22] S Riedmiller and M Riedmiller ldquoA neural reinforcementlearning approach to learn local dispatching policies in produc-tion schedulingrdquo in Proceedings of the 16th International JointConference on Artificial Intelligence Stockholm Sweden 1999

[23] M E Aydin and E Oztemel ldquoDynamic job-shop schedulingusing reinforcement learning agentsrdquoRobotics and AutonomousSystems vol 33 no 2 pp 169ndash178 2000

[24] J Hong and V V Prabhu ldquoDistributed reinforcement learningcontrol for batch sequencing and sizing in just-in-time manu-facturing systemsrdquo Applied Intelligence vol 20 no 1 pp 71ndash872004

[25] Y C Wang and J M Usher ldquoApplication of reinforcementlearning for agent-based production schedulingrdquo EngineeringApplications of Artificial Intelligence vol 18 no 1 pp 73ndash822005

Journal of Industrial Engineering 11

[26] B C Csaji L Monostori and B Kadar ldquoReinforcement learn-ing in a distributed market-based production control systemrdquoAdvanced Engineering Informatics vol 20 no 3 pp 279ndash2882006

[27] S S Singh V B Tadic and A Doucet ldquoA policy gradientmethod for semi-Markov decision processes with applicationto call admission controlrdquo European Journal of OperationalResearch vol 178 no 3 pp 808ndash818 2007

[28] M Kaya and R Alhajj ldquoA novel approach to multiagentreinforcement learning utilizing OLAP mining in the learningprocessrdquo IEEE Transactions on Systems Man and CyberneticsPart C vol 35 no 4 pp 582ndash590 2005

[29] C D Paternina-Arboleda and T K Das ldquoA multi-agent rein-forcement learning approach to obtaining dynamic controlpolicies for stochastic lot scheduling problemrdquo SimulationModelling Practice andTheory vol 13 no 5 pp 389ndash406 2005

[30] C E Mariano-Romero V H Alcocer-Yamanaka and E FMorales ldquoMulti-objective optimization of water-using systemsrdquoEuropean Journal of Operational Research vol 181 no 3 pp1691ndash1707 2007

[31] D Vengerov ldquoA reinforcement learning framework for utility-based scheduling in resource-constrained systemsrdquo Future Gen-eration Computer Systems vol 25 no 7 pp 728ndash736 2009

[32] K Iwamura N Mayumi Y Tanimizu and N SugimuraldquoA study on real-time scheduling for holonic manufacturingsystemsmdashdetermination of utility values based on multi-agentreinforcement learningrdquo in Proceedings of the 4th InternationalConference on Industrial Applications of Holonic and Multi-Agent Systems pp 135ndash144 Linz Austria 2009

[33] M Pinedo Scheduling Theory Algorithms and Systems Pren-tice Hall Englewoods Cliffs NJ USA 2nd edition 2002

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 9: Research Article Chip Attach Scheduling in Semiconductor ...downloads.hindawi.com/journals/jie/2013/295604.pdf · Chip Attach Scheduling in Semiconductor Assembly ... and PEVI. IS

Journal of Industrial Engineering 9

Table 3 Comparison of unsatisfied job type index using heuristics and 119876-learning

Dataset no WSPT MWSPT RA LFM-MWSPT LFM-RA LWF Q-learning1 01290 02615 01667 01650 01793 00921 006782 01924 02953 02302 02650 02097 01320 002783 02250 03287 02126 02564 02278 00921 009214 00781 02987 00278 01290 00828 00278 002785 02924 03169 03002 02224 02632 02055 005716 02290 03062 01817 02290 01632 00571 002787 02650 02987 02160 02529 02075 01320 016478 01924 03225 02067 01813 01696 01320 013209 02221 03250 01403 01892 02073 01320 0074910 02621 03304 02667 02859 02708 01781 0067811 02029 02896 02578 02194 02220 01381 0154212 01924 03271 02302 01838 02182 00921 00678Average 02069 03084 02031 02149 02018 01176 00802

Table 4 Comparison of the total setup time using heuristics and 119876-learning

Dataset no WSPT MWSPT RA LFM-MWSPT LFM-RA LWF Q-learning1 08133 08497 03283 07231 03884 03895 100002 08333 13000 04358 07564 05263 04094 100003 08712 11633 04207 06361 04937 04298 100004 11280 07123 04629 08516 05139 04318 100005 09629 13121 04179 08597 05115 03873 100006 07489 10393 04104 07489 04542 04074 100007 17868 22182 08223 14069 10125 04174 100008 06456 08508 04055 06694 05053 03795 100009 09245 09946 05013 07821 06694 04163 1000010 11025 17875 06703 10371 09079 04894 1000011 09973 13655 03994 09686 05129 04066 1000012 07904 11111 04419 06195 05081 04258 10000Average 09671 12254 04764 08383 05837 04158 10000

Table 4 shows the total setup time of all datasets usingseven algorithms For the reason of commercial confiden-tiality we used the normalized data with the setup time of adataset divided by the result of this dataset using 119876-learningThus the total setup times of all datasets by 119876-learning areconverted into one and the data of the six heuristics areadjusted accordingly 119876-learning algorithm requires morethan twice of setup time than LWF does for each dataset Theaverage accumulated setup time of LWF is only 4158 percentsof that of 119876-learning

The previous experimental results reveal that for thewhole scheduling tasks any individual one of the five actionheuristics (WSPT MWSPT RA LFM-MWSPT and LFM-RA) for 119876-learning performs worse than LWF heuristicsHowever 119876-learning greatly outperforms LWF in termsof the three performance measures the objective functionvalue UPI and UJTI This demonstrates that some actionheuristics provide better actions than LWF heuristics at somestates During repeatedly solving the scheduling problem119876-learning system perceives the insights of the schedulingproblem automatically and adjusts its actions towards the

optimal ones facing different system states The actions atall states form a new optimized policy which is differentfrom any policies following any individual action heuristicsor LWF heuristics That is119876-learning incorporates the meritof five alternative heuristics uses them to schedule jobsflexibly and obtains results much better than any individualaction heuristics and LWF heuristics In the experiments119876-learning achieves high-quality schedules at the cost ofinducingmore setup time In other words119876-learning utilizesthe machines more efficiently by increasing conversionsamong a variety of job types

5 Conclusions

We apply 119876-learning to study lot-based chip attach schedul-ing in back-end semiconductor manufacturing To applyreinforcement learning to scheduling the critical issue beingconversion of scheduling problems into RL problems Weconvert chip attach scheduling problem into a particularSMDP problem by Markovian state representation Fiveheuristic algorithms WSPT MWSPT RA LFM-MWSPT

10 Journal of Industrial Engineering

and LFM-RA are selected as actions so as to utilize priordomain knowledge Reward function is directly related toscheduling objective function and we prove that maximizingthe accumulated reward is equivalent to minimizing theobjective function Gradient-descent linear function approx-imation is combined with 119876-learning algorithm

119876-learning exploits the insight structure of the schedul-ing problem by solving it repeatedly It learns a domain-specific policy from the experienced episodes through inter-action and then applies it to latter episodes We definetwo indices unsatisfied TPV index and unsatisfied job typeindex together with objective function value to measure theperformance of 119876-learning and the heuristics Experimentswith industrial datasets show that 119876-learning apparentlyoutperforms six heuristic algorithms WSPT MWSPT RALFM-MWSPT LFM-RA and LWF Compared with LWF119876-learning achieves reduction of the three performancemeasures respectively by an average level of 5220 3181and 8092 With 119876-learning chip attach scheduling isoptimized through increasing effective job type conversions

Disclosure

Given the sensitive and proprietary nature of the semicon-ductor manufacturing environment we use normalized datain this paper

Acknowledgments

This project is supported by the National Natural ScienceFoundation of China (Grant no 71201026) Science andTechnological Program for Dongguanrsquos Higher EducationScience and Research and Health Care Institutions (no2011108102017) andHumanities and Social Sciences Programof Ministry of Education of China (no 10YJC630405)

References

[1] M X Weng J Lu and H Ren ldquoUnrelated parallel machinescheduling with setup consideration and a total weightedcompletion time objectiverdquo International Journal of ProductionEconomics vol 70 no 3 pp 215ndash226 2001

[2] M Gairing B Monien and A Woclaw ldquoA faster combinato-rial approximation algorithm for scheduling unrelated parallelmachinesrdquo inAutomata Languages and Programming vol 3580of Lecture Notes in Computer Science pp 828ndash839 2005

[3] G Mosheiov ldquoParallel machine scheduling with a learningeffectrdquo Journal of the Operational Research Society vol 52 no10 pp 1ndash5 2001

[4] G Mosheiov and J B Sidney ldquoScheduling with general job-dependent learning curvesrdquo European Journal of OperationalResearch vol 147 no 3 pp 665ndash670 2003

[5] L Yu H M Shih M Pfund W M Carlyle and J W FowlerldquoScheduling of unrelated parallel machines an application toPWB manufacturingrdquo IIE Transactions vol 34 no 11 pp 921ndash931 2002

[6] K R Baker and J W M Bertrand ldquoA dynamic priorityrule for scheduling against due-datesrdquo Journal of OperationsManagement vol 3 no 1 pp 37ndash42 1982

[7] J J Kanet and X Li ldquoA weighted modified due date rulefor sequencing to minimize weighted tardinessrdquo Journal ofScheduling vol 7 no 4 pp 261ndash276 2004

[8] R V Rachamadugu and T E Morton ldquoMyopic heuristicsfor the single machine weighted tardiness problemrdquo WorkingPaper 28-81-82 Graduate School of Industrial AdministrationGarnegie-Mellon University 1981

[9] A Volgenant and E Teerhuis ldquoImproved heuristics for then-job single-machine weighted tardiness problemrdquo Computersand Operations Research vol 26 no 1 pp 35ndash44 1999

[10] D C Carroll Heuristic sequencing of jobs with single and mul-tiple components [PhD thesis] Sloan School of ManagementMIT 1965

[11] A P J Vepsalainen and T E Morton ldquoPriority rules for jobshops with weighted tardiness costsrdquoManagement Science vol33 no 8 pp 1035ndash1047 1987

[12] R S Russell E M Dar-El and B W Taylor ldquoA comparativeanalysis of the COVERT job sequencing rule using various shopperformance measuresrdquo International Journal of ProductionResearch vol 25 no 10 pp 1523ndash1540 1987

[13] J Bank and F Werner ldquoHeuristic algorithms for unrelated par-allelmachine schedulingwith a commondue date release datesand linear earliness and tardiness penaltiesrdquoMathematical andComputer Modelling vol 33 no 4-5 pp 363ndash383 2001

[14] C F Liaw Y K Lin C Y Cheng and M Chen ldquoSchedulingunrelated parallel machines to minimize total weighted tardi-nessrdquo Computers and Operations Research vol 30 no 12 pp1777ndash1789 2003

[15] DWKimDGNa andF F Chen ldquoUnrelated parallelmachinescheduling with setup times and a total weighted tardinessobjectiverdquo Robotics and Computer-Integrated Manufacturingvol 19 no 1-2 pp 173ndash181 2003

[16] R S Sutton and A G Barto Reinforcement Learning AnIntroduction MIT Press Cambridge Mass USA 1998

[17] C J CHWatkinsLearning fromdelayed rewards [PhD thesis]Cambridge University 1989

[18] C J C H Watkins and P Dayan ldquoQ-learningrdquo MachineLearning vol 8 no 3-4 pp 279ndash292 1992

[19] T JaakkolaM I Jordan and S P Singh ldquoOn the convergence ofstochastic iterative dynamic programming algorithmsrdquo NeuralComputation vol 6 pp 1185ndash1201 1994

[20] J N Tsitsiklis ldquoAsynchronous stochastic approximation and Q-learningrdquoMachine Learning vol 16 no 3 pp 185ndash202 1994

[21] D P Bertsekas and J N Tsitsiklis Neuro-Dynamic Program-ming Athena Scientific Belmont Mass USA 1996

[22] S Riedmiller and M Riedmiller ldquoA neural reinforcementlearning approach to learn local dispatching policies in produc-tion schedulingrdquo in Proceedings of the 16th International JointConference on Artificial Intelligence Stockholm Sweden 1999

[23] M E Aydin and E Oztemel ldquoDynamic job-shop schedulingusing reinforcement learning agentsrdquoRobotics and AutonomousSystems vol 33 no 2 pp 169ndash178 2000

[24] J Hong and V V Prabhu ldquoDistributed reinforcement learningcontrol for batch sequencing and sizing in just-in-time manu-facturing systemsrdquo Applied Intelligence vol 20 no 1 pp 71ndash872004

[25] Y C Wang and J M Usher ldquoApplication of reinforcementlearning for agent-based production schedulingrdquo EngineeringApplications of Artificial Intelligence vol 18 no 1 pp 73ndash822005

Journal of Industrial Engineering 11

[26] B C Csaji L Monostori and B Kadar ldquoReinforcement learn-ing in a distributed market-based production control systemrdquoAdvanced Engineering Informatics vol 20 no 3 pp 279ndash2882006

[27] S S Singh V B Tadic and A Doucet ldquoA policy gradientmethod for semi-Markov decision processes with applicationto call admission controlrdquo European Journal of OperationalResearch vol 178 no 3 pp 808ndash818 2007

[28] M Kaya and R Alhajj ldquoA novel approach to multiagentreinforcement learning utilizing OLAP mining in the learningprocessrdquo IEEE Transactions on Systems Man and CyberneticsPart C vol 35 no 4 pp 582ndash590 2005

[29] C D Paternina-Arboleda and T K Das ldquoA multi-agent rein-forcement learning approach to obtaining dynamic controlpolicies for stochastic lot scheduling problemrdquo SimulationModelling Practice andTheory vol 13 no 5 pp 389ndash406 2005

[30] C E Mariano-Romero V H Alcocer-Yamanaka and E FMorales ldquoMulti-objective optimization of water-using systemsrdquoEuropean Journal of Operational Research vol 181 no 3 pp1691ndash1707 2007

[31] D Vengerov ldquoA reinforcement learning framework for utility-based scheduling in resource-constrained systemsrdquo Future Gen-eration Computer Systems vol 25 no 7 pp 728ndash736 2009

[32] K Iwamura N Mayumi Y Tanimizu and N SugimuraldquoA study on real-time scheduling for holonic manufacturingsystemsmdashdetermination of utility values based on multi-agentreinforcement learningrdquo in Proceedings of the 4th InternationalConference on Industrial Applications of Holonic and Multi-Agent Systems pp 135ndash144 Linz Austria 2009

[33] M Pinedo Scheduling Theory Algorithms and Systems Pren-tice Hall Englewoods Cliffs NJ USA 2nd edition 2002

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 10: Research Article Chip Attach Scheduling in Semiconductor ...downloads.hindawi.com/journals/jie/2013/295604.pdf · Chip Attach Scheduling in Semiconductor Assembly ... and PEVI. IS

10 Journal of Industrial Engineering

and LFM-RA are selected as actions so as to utilize priordomain knowledge Reward function is directly related toscheduling objective function and we prove that maximizingthe accumulated reward is equivalent to minimizing theobjective function Gradient-descent linear function approx-imation is combined with 119876-learning algorithm

119876-learning exploits the insight structure of the schedul-ing problem by solving it repeatedly It learns a domain-specific policy from the experienced episodes through inter-action and then applies it to latter episodes We definetwo indices unsatisfied TPV index and unsatisfied job typeindex together with objective function value to measure theperformance of 119876-learning and the heuristics Experimentswith industrial datasets show that 119876-learning apparentlyoutperforms six heuristic algorithms WSPT MWSPT RALFM-MWSPT LFM-RA and LWF Compared with LWF119876-learning achieves reduction of the three performancemeasures respectively by an average level of 5220 3181and 8092 With 119876-learning chip attach scheduling isoptimized through increasing effective job type conversions

Disclosure

Given the sensitive and proprietary nature of the semicon-ductor manufacturing environment we use normalized datain this paper

Acknowledgments

This project is supported by the National Natural ScienceFoundation of China (Grant no 71201026) Science andTechnological Program for Dongguanrsquos Higher EducationScience and Research and Health Care Institutions (no2011108102017) andHumanities and Social Sciences Programof Ministry of Education of China (no 10YJC630405)

References

[1] M X Weng J Lu and H Ren ldquoUnrelated parallel machinescheduling with setup consideration and a total weightedcompletion time objectiverdquo International Journal of ProductionEconomics vol 70 no 3 pp 215ndash226 2001

[2] M Gairing B Monien and A Woclaw ldquoA faster combinato-rial approximation algorithm for scheduling unrelated parallelmachinesrdquo inAutomata Languages and Programming vol 3580of Lecture Notes in Computer Science pp 828ndash839 2005

[3] G Mosheiov ldquoParallel machine scheduling with a learningeffectrdquo Journal of the Operational Research Society vol 52 no10 pp 1ndash5 2001

[4] G Mosheiov and J B Sidney ldquoScheduling with general job-dependent learning curvesrdquo European Journal of OperationalResearch vol 147 no 3 pp 665ndash670 2003

[5] L Yu H M Shih M Pfund W M Carlyle and J W FowlerldquoScheduling of unrelated parallel machines an application toPWB manufacturingrdquo IIE Transactions vol 34 no 11 pp 921ndash931 2002

[6] K R Baker and J W M Bertrand ldquoA dynamic priorityrule for scheduling against due-datesrdquo Journal of OperationsManagement vol 3 no 1 pp 37ndash42 1982

[7] J J Kanet and X Li ldquoA weighted modified due date rulefor sequencing to minimize weighted tardinessrdquo Journal ofScheduling vol 7 no 4 pp 261ndash276 2004

[8] R V Rachamadugu and T E Morton ldquoMyopic heuristicsfor the single machine weighted tardiness problemrdquo WorkingPaper 28-81-82 Graduate School of Industrial AdministrationGarnegie-Mellon University 1981

[9] A Volgenant and E Teerhuis ldquoImproved heuristics for then-job single-machine weighted tardiness problemrdquo Computersand Operations Research vol 26 no 1 pp 35ndash44 1999

[10] D C Carroll Heuristic sequencing of jobs with single and mul-tiple components [PhD thesis] Sloan School of ManagementMIT 1965

[11] A P J Vepsalainen and T E Morton ldquoPriority rules for jobshops with weighted tardiness costsrdquoManagement Science vol33 no 8 pp 1035ndash1047 1987

[12] R S Russell E M Dar-El and B W Taylor ldquoA comparativeanalysis of the COVERT job sequencing rule using various shopperformance measuresrdquo International Journal of ProductionResearch vol 25 no 10 pp 1523ndash1540 1987

[13] J Bank and F Werner ldquoHeuristic algorithms for unrelated par-allelmachine schedulingwith a commondue date release datesand linear earliness and tardiness penaltiesrdquoMathematical andComputer Modelling vol 33 no 4-5 pp 363ndash383 2001

[14] C F Liaw Y K Lin C Y Cheng and M Chen ldquoSchedulingunrelated parallel machines to minimize total weighted tardi-nessrdquo Computers and Operations Research vol 30 no 12 pp1777ndash1789 2003

[15] DWKimDGNa andF F Chen ldquoUnrelated parallelmachinescheduling with setup times and a total weighted tardinessobjectiverdquo Robotics and Computer-Integrated Manufacturingvol 19 no 1-2 pp 173ndash181 2003

[16] R S Sutton and A G Barto Reinforcement Learning AnIntroduction MIT Press Cambridge Mass USA 1998

[17] C J CHWatkinsLearning fromdelayed rewards [PhD thesis]Cambridge University 1989

[18] C J C H Watkins and P Dayan ldquoQ-learningrdquo MachineLearning vol 8 no 3-4 pp 279ndash292 1992

[19] T JaakkolaM I Jordan and S P Singh ldquoOn the convergence ofstochastic iterative dynamic programming algorithmsrdquo NeuralComputation vol 6 pp 1185ndash1201 1994

[20] J N Tsitsiklis ldquoAsynchronous stochastic approximation and Q-learningrdquoMachine Learning vol 16 no 3 pp 185ndash202 1994

[21] D P Bertsekas and J N Tsitsiklis Neuro-Dynamic Program-ming Athena Scientific Belmont Mass USA 1996

[22] S Riedmiller and M Riedmiller ldquoA neural reinforcementlearning approach to learn local dispatching policies in produc-tion schedulingrdquo in Proceedings of the 16th International JointConference on Artificial Intelligence Stockholm Sweden 1999

[23] M E Aydin and E Oztemel ldquoDynamic job-shop schedulingusing reinforcement learning agentsrdquoRobotics and AutonomousSystems vol 33 no 2 pp 169ndash178 2000

[24] J Hong and V V Prabhu ldquoDistributed reinforcement learningcontrol for batch sequencing and sizing in just-in-time manu-facturing systemsrdquo Applied Intelligence vol 20 no 1 pp 71ndash872004

[25] Y C Wang and J M Usher ldquoApplication of reinforcementlearning for agent-based production schedulingrdquo EngineeringApplications of Artificial Intelligence vol 18 no 1 pp 73ndash822005

Journal of Industrial Engineering 11

[26] B C Csaji L Monostori and B Kadar ldquoReinforcement learn-ing in a distributed market-based production control systemrdquoAdvanced Engineering Informatics vol 20 no 3 pp 279ndash2882006

[27] S S Singh V B Tadic and A Doucet ldquoA policy gradientmethod for semi-Markov decision processes with applicationto call admission controlrdquo European Journal of OperationalResearch vol 178 no 3 pp 808ndash818 2007

[28] M Kaya and R Alhajj ldquoA novel approach to multiagentreinforcement learning utilizing OLAP mining in the learningprocessrdquo IEEE Transactions on Systems Man and CyberneticsPart C vol 35 no 4 pp 582ndash590 2005

[29] C D Paternina-Arboleda and T K Das ldquoA multi-agent rein-forcement learning approach to obtaining dynamic controlpolicies for stochastic lot scheduling problemrdquo SimulationModelling Practice andTheory vol 13 no 5 pp 389ndash406 2005

[30] C E Mariano-Romero V H Alcocer-Yamanaka and E FMorales ldquoMulti-objective optimization of water-using systemsrdquoEuropean Journal of Operational Research vol 181 no 3 pp1691ndash1707 2007

[31] D Vengerov ldquoA reinforcement learning framework for utility-based scheduling in resource-constrained systemsrdquo Future Gen-eration Computer Systems vol 25 no 7 pp 728ndash736 2009

[32] K Iwamura N Mayumi Y Tanimizu and N SugimuraldquoA study on real-time scheduling for holonic manufacturingsystemsmdashdetermination of utility values based on multi-agentreinforcement learningrdquo in Proceedings of the 4th InternationalConference on Industrial Applications of Holonic and Multi-Agent Systems pp 135ndash144 Linz Austria 2009

[33] M Pinedo Scheduling Theory Algorithms and Systems Pren-tice Hall Englewoods Cliffs NJ USA 2nd edition 2002

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 11: Research Article Chip Attach Scheduling in Semiconductor ...downloads.hindawi.com/journals/jie/2013/295604.pdf · Chip Attach Scheduling in Semiconductor Assembly ... and PEVI. IS

Journal of Industrial Engineering 11

[26] B C Csaji L Monostori and B Kadar ldquoReinforcement learn-ing in a distributed market-based production control systemrdquoAdvanced Engineering Informatics vol 20 no 3 pp 279ndash2882006

[27] S S Singh V B Tadic and A Doucet ldquoA policy gradientmethod for semi-Markov decision processes with applicationto call admission controlrdquo European Journal of OperationalResearch vol 178 no 3 pp 808ndash818 2007

[28] M Kaya and R Alhajj ldquoA novel approach to multiagentreinforcement learning utilizing OLAP mining in the learningprocessrdquo IEEE Transactions on Systems Man and CyberneticsPart C vol 35 no 4 pp 582ndash590 2005

[29] C D Paternina-Arboleda and T K Das ldquoA multi-agent rein-forcement learning approach to obtaining dynamic controlpolicies for stochastic lot scheduling problemrdquo SimulationModelling Practice andTheory vol 13 no 5 pp 389ndash406 2005

[30] C E Mariano-Romero V H Alcocer-Yamanaka and E FMorales ldquoMulti-objective optimization of water-using systemsrdquoEuropean Journal of Operational Research vol 181 no 3 pp1691ndash1707 2007

[31] D Vengerov ldquoA reinforcement learning framework for utility-based scheduling in resource-constrained systemsrdquo Future Gen-eration Computer Systems vol 25 no 7 pp 728ndash736 2009

[32] K Iwamura N Mayumi Y Tanimizu and N SugimuraldquoA study on real-time scheduling for holonic manufacturingsystemsmdashdetermination of utility values based on multi-agentreinforcement learningrdquo in Proceedings of the 4th InternationalConference on Industrial Applications of Holonic and Multi-Agent Systems pp 135ndash144 Linz Austria 2009

[33] M Pinedo Scheduling Theory Algorithms and Systems Pren-tice Hall Englewoods Cliffs NJ USA 2nd edition 2002

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 12: Research Article Chip Attach Scheduling in Semiconductor ...downloads.hindawi.com/journals/jie/2013/295604.pdf · Chip Attach Scheduling in Semiconductor Assembly ... and PEVI. IS

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of