Deep Reinforcement Learning for Scheduling in an Edge ...

Research ArticleDeep Reinforcement Learning for Scheduling in an EdgeComputing-Based Industrial Internet of Things

Jingjing Wu ,1 Guoliang Zhang ,1 Jiaqi Nie ,1 Yuhuai Peng ,1 and Yunhou Zhang 2

1School of Computer Science and Engineering, Northeastern University, Shenyang 110819, China2Northeast Branch of State Grid Corporation of China, Shenyang 110180, China

Correspondence should be addressed to Yunhou Zhang; [email protected]

Received 4 June 2021; Accepted 22 July 2021; Published 10 August 2021

Academic Editor: Xiaojie Wang

Copyright © 2021 Jingjing Wu et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

The demand for improving productivity in manufacturing systems makes the industrial Internet of things (IIoT) an importantresearch area spawned by the Internet of things (IoT). In IIoT systems, there is an increasing demand for different types ofindustrial equipment to exchange stream data with different delays. Communications between massive heterogeneous industrialdevices and clouds will cause high latency and require high network bandwidth. The introduction of edge computing in the IIoTcan address unacceptable processing latency and reduce the heavy link burden. However, the limited resources in edgecomputing servers are one of the difficulties in formulating communication scheduling and resource allocation strategies. In thisarticle, we use deep reinforcement learning (DRL) to solve the scheduling problem in edge computing to improve the quality ofservices provided to users in IIoT applications. First, we propose a hierarchical scheduling model considering the central-edgecomputing heterogeneous architecture. Then, according to the model characteristics, a deep intelligent scheduling algorithm(DISA) based on a double deep Q network (DDQN) framework is proposed to make scheduling decisions for communication.We compare DISA with other baseline solutions using various performance metrics. Simulation results show that the proposedalgorithm is more effective than other baseline algorithms.

1. Introduction

With the rapid development of the Internet of things (IoT),an increasing number of daily services can easily obtainseamless network connectivity everywhere. Among the IoTextensions, the industrial Internet of things (IIoT) is consid-ered a promising technology and has attracted much atten-tion [1–3]. With the popularization of modern industry,IIoT devices, such as wireless sensors, programmable logiccontrollers (PLCs), remote terminal units (RTUs), and smartswitches, are used to improve manufacturing efficiency andrealize traditional industry intellectualization. In recentyears, the IIoT has been utilized in many fields, includingmining, healthcare monitoring, energy generation, and smartfactories. Nevertheless, with the explosive growth in IIoTapplications, the network environment becomes increasinglycomplex, which leads to unprecedented challenges, e.g.,intermittent wireless connections, scarce spectrum resources,

and high propagation delay. Further research needs to beperformed to address the aforementioned problems.

Interconnected sensors or smart devices can collect dataand interact through modern industrial network infrastruc-ture via the Internet. In this case, these sensors and devicesgenerate large amounts of data that need further processing,which provides intelligence to both continuous environmen-tal monitoring and data analysis [4]. In traditional cloud-based network architecture, all data must be uploaded to acentralized server, and the processed solutions need to besent back from the cloud to terminal sensors and devices.This process creates high communication latency betweenend users and the cloud located far away from most endusers. Introducing the edge computing [5, 6] paradigm intothe IIoT is widely accepted as a promising technology toaddress the aforementioned problems. According to this par-adigm, heavy computational tasks or multiple functions canbe delivered at the edge of the network. In edge computing,

HindawiWireless Communications and Mobile ComputingVolume 2021, Article ID 8017334, 12 pageshttps://doi.org/10.1155/2021/8017334

https://orcid.org/0000-0002-0995-4340

https://orcid.org/0000-0002-9333-1698

https://orcid.org/0000-0002-4322-3875

https://orcid.org/0000-0001-9343-5377

https://orcid.org/0000-0002-5691-291X

https://creativecommons.org/licenses/by/4.0/








https://doi.org/10.1155/2021/8017334

the massive on-site data generated by different types of endusers can be analyzed at the network edge, rather than trans-mitting data to distant centers to address delay and band-width concerns. Compared with cloud computing, theintroduction of edge computing can reduce network conges-tion and promote resource optimization. Edge computing ismore suitable for integrating with the IIoT with a large num-ber of end users, and the architecture can be considered forfuture IIoT infrastructures.

With the rapid growth in edge computing-based IIoT,to be more competitive, a manufacturing system shouldhave good flexibility, fast response capabilities, and sophis-ticated heterogeneous structures. Therefore, scheduling [7,8] plays an important role in ensuring the reliability andresponsiveness of a manufacturing system. Communica-tion devices are dedicated to transmitting data, while con-trol devices perform real-time scheduling for dynamicdecision-making information. In an IIoT system, most ofthe industrial devices generate regular packets, such asthe packets obtained by detecting or collecting on-siteenvironment. However, when emergency events occur,some urgent packets are generated, which need to bedelivered to the destination node within a specified dead-line [9]. Edge computing servers must provide differentlevels of services to each request, which means that sometypes of requests may have higher priorities than others.Accordingly, scheduling algorithms for edge computingmust satisfy expectations for each type of IIoT request inheterogeneous applications without wasting resources.

Additionally, the existing network protocols face tre-mendous pressure in industrial applications with massivedata, limited transmission bandwidth, high data rates, andlow-latency requirements. When the actual states of thenetwork change and the transmission strategies need to beadjusted, the existing network protocols lack intelligence.Therefore, with the growth in network scale and the explo-sion in data, artificial intelligence (AI) [10] has been drasti-cally promoted in recent years. The AI technique hasalready made breakthroughs in a variety of areas, such asrobot control, automatic drive, and speech recognition.Compared with conventional methods, deep reinforcementlearning (DRL) [11] that emerged from AI has shown greatadvantages in large-scale networks. For example, trainedarchitectures can be utilized to monitor data processing,classification and decision-making with high accuracy.Once abnormal traffic occurs, the DRL architectures canquickly make judgments before deterioration spreads inthe networks. The reason for the abovementioned perfor-mance improvement is that DRL can efficiently extractthe features from the sample data and learn the relation-ships among multiple metrics by training a large quantityof data. By accumulating action experiences from interac-tions with the target by reinforcing actions leading tohigher rewards, DRL can learn successful policies progres-sively. However, modeling and prediction of communica-tion networks are very difficult because they have becomemore sophisticated and dynamic. Hence, deploying a moreintelligent scheduling algorithm in networks is a necessarycondition for rational allocation network resources.

In this paper, we propose a deep intelligent schedulingalgorithm (DISA), an intelligence-driven experiential net-work architecture that exploits edge computing and DRLfor scheduling. Our principal contributions are summarizedas follows.

(1) We use the idea of DRL to describe the schedulingproblem in edge computing-based IIoT and definethe corresponding state space, action space, rewardfunction, and value function

(2) We propose DISA, an edge computing-based net-work architecture for communication scheduling.DISA adapts the DDQN scheme, which guaranteesthe stability of dynamically generated policies

(3) Effective simulation experiments demonstrate thatthe proposed DISA can effectively create schedulingstrategies for traffic flows. Numerical results verifythe superiority of DISA over the typical baselinealgorithms

The remaining sections of this paper are organized as fol-lows. Section 2 reviews the related work done in other studiesfor the proposed problem. Then, problem descriptions andscheduling models are discussed in Section 3. Section 4 intro-duces the implementation details of our DISA. Thereafter,simulation results are analyzed in Section 5. Finally, we con-clude the paper in Section 6.

2. Related Work

This section discusses the major works that have been citedin response to the problem raised in the preceding section.

2.1. Edge Computing-Based IIoT. There have been manystudies on edge computing in recent years, and edge comput-ing provides computational and processing facilities at theedge of the network. The authors in [12] presented anedge-cloud interplay based on software-defined network(SDN) to handle flow scheduling among edge and clouddevices. In this respect, a multiobjective evolutionary algo-rithm based on TChebycheff decomposition was designedfor scheduling in an IIoT environment. The authors in [13]expounded the development and integration process of IIoTand edge computing through an extensive review of theresearch achievements. They proposed a reference architec-ture of IIoT edge computing and carried out a comprehen-sive explanation from a number of performance indicators.The work in [14] drew on the idea of blockchain and solvedthe problem of traffic classification in the Internet of things.A voting-based consensus algorithm was designed to syn-chronize and update the binary coding tree set and hash tablerequired for the classification of extended hash streams basedon edge nodes. In a survey paper [15], the authors investi-gated the motivation of the edge cloud environment, the lat-est research results, key enabling technologies and possiblefuture applications. The purpose was to fully understandthe edge computing issue through this comprehensive dis-cussion. The authors in [16] roughly summarized the main

2 Wireless Communications and Mobile Computing

structure of edge computing technology as fog computing,mobile edge computing and cloudlet. Under each modelstructure, they gave detailed tutorials on principles, systemarchitecture, standards, and applications.

2.2. Scheduling Technologies in IIoT. Currently, an increasingnumber of researchers and practitioners are attempting tosolve dynamic scheduling problems in the field of wirelessnetworks. Literature [17] proposed an efficient packet emer-gency sensing scheduling algorithm for smart cities. Thealgorithm divided data packets into three priority levels.High-priority data packets were sent to the destination nodefirst, while low-priority data packets only need to be deliveredbefore the deadline. The sensor scheduling problem in powerconstrained wireless networks was studied in Reference [18].The communication channels in the wireless network weremodeled as ergodic Markov chains, and different transmis-sion power levels were used in different channels to ensurethe success of service transmission. The authors in [19] pro-posed an offline scheduling algorithm based on imitationlearning. Specifically, they described the scheduling problemas an optimization problem, established the system model,designed the imitation learning scheduling algorithm, andobtained the optimal scheduling results. In [20], a real-timescheduling scheme based on bargaining game was presentedto achieve real-time scheduling in the manufacturing work-shop. This paper proposed a flexible workshop architecture,which provided a new paradigm for manufacturing enter-prises to improve real-time scheduling efficiency to eliminatethe impact of abnormal events. The authors in [21] adoptedLyapunov optimization technology to solve the asymptoti-cally optimal solution to the mobile edge computing offload-ing scheduling problem under the condition of partialnetwork knowledge. The Lyapunov optimization problem isdecomposed into a knapsack problem for solving asymptoti-cally optimal scheduling.

2.3. Intelligence-Driven Architecture in IIoT. Because tradi-tional schemes rely heavily on manual processes when con-figuring data transmission strategies, it is a great challengeto design dynamic near-optimal control decisions in largenetworks. Currently, a lot of research work has shifted itsfocus to the direction of how to make the industrial IoT moreintelligent in data transmission and network management.An online task scheduling algorithm based on imitationlearning was proposed in [22] to minimize system energyconsumption while meeting task delay requirements. Thisarticle was an early endeavor to use intelligent learning foronline task scheduling in the vehicular edge computing net-work, which allowed the learning agent to consistently followthe expert’s strategy and had a tolerable theoretical perfor-mance gap. The authors in [23] introduced deep learning ofthe IoT to edge computing to make the network performanceoptimized and user privacy security when uploading packets.The edge computing technology reduced the network datavolume from IoT terminals to cloud servers, because the edgenodes uploaded intermediate packets instead of inputpackets. The work in [24] proposed a priority-aware rein-forcement learning-based integrated design network subsys-

tem. This method automatically assigned sampling rates andbackoff delays to the control and network subsystems in theindustrial Internet of things system. In order to improvethe system performance of highly coupled industrial IoT,according to the characteristics of industrial systems, Refer-ence [25] leveraged reinforcement learning technology toautomatically configure control and network systems indynamic industrial environments. Three new strategies aredesigned to accelerate the convergence of reinforcementlearning. The authors in [26] proposed a service quality-aware secure routing protocol (DQSP) based on deep rein-forcement learning. While ensuring QoS, this methodextracted knowledge from historical traffic demands by inter-acting with the underlying network environment anddynamically optimized routing strategies.

However, further studies are still necessary on dynamicscheduling considering several performance metrics in IIoTapplications. Moreover, few studies have investigated schedul-ing algorithms supported by intelligence-driven architectures.

3. Problem Definition and Models

3.1. Network Framework. In this section, we adopt a hierar-chical structure to generalize all the contents in the IIoT net-work, as shown in Figure 1. There are three layers in thisstructure: the device layer, the edge intelligence layer, andthe centralized intelligence layer. The device layer is com-posed of all objectives, workmen, users, and smart terminalsthat can collect industrial data from live environments. Theedge intelligence layer provides distributed, low-latency,and limited computing resources between the device layerand the higher layer. Lightweight DRL-based data distributedcomputing and edge processing features are implemented inthis layer. We introduce the DRL agent into the edge intelli-gence layer to maintain equivalent performance and offloadcomputational tasks from the cloud. The centralized intelli-gence layer consists of cloud data centers that aggregate datafrom lower layers. DRL-based data validation and centralprocessing features are implemented in this layer. We intro-duce the DRL agent into this layer to optimize network per-formance and take global control.

The basic method by which smart devices, edge comput-ing servers, and cloud data centers operate and interact is asfollows. In the network, different applications generate differ-ent traffic types, which occupy different network resources.Here, the collected industrial traffic flows are divided intotwo categories: computing-intensive traffic flows and time-sensitive traffic flows. Computing-intensive traffic flowsrequire more bandwidth, and the quality of transmission ismore crucial. In contrast, time-sensitive traffic flows are sen-sitive to delay, and latency is more crucial. Thus, all the for-warding decisions are determined by the gateway node setin the corresponding layer. The device layer gateway can pro-cess a traffic flow locally, transmit it to an edge computingserver, or transmit it to a cloud data center. The control flowcan be adopted by the gateway based on the traffic flow clas-sification. The edge intelligence layer is an intermediate layerand is closer to users than the cloud. Edge computing serverscan process traffic flow scheduling and routing or forward the

3Wireless Communications and Mobile Computing

control flows to the higher layer. Time-sensitive traffic flowscan be handled in this layer and reduce the overall servicedelay. The centralized intelligence layer mainly processescomputing-intensive traffic flows or the flows forwarded bythe lower layer and sends the response back to the lowerusers. The DRL module is installed in the data center to pro-vide higher service and more efficient resource utilization.

3.2. Scheduling Model. We describe the edge computing-based IIoT scheduling problem in this section. The networktopology is modeled as an undirected graph GðV , EÞ. Here,V is the set of IIoT devices, and E = fði, jÞ ∣ i, j ∈ Vg is a setof wireless links. A traffic flow is denoted as a tuple F = ðs, d, b, tDÞ, where s ∈ V is the source node, d ∈ V is the destina-tion node, b is the size of traffic flow generated by the device,and tD is the deadline before which the flow must be trans-mitted. We denote the set of traffic flows F = f1, 2,⋯,f gand use f to refer to the f th traffic flow. The system operatesin a frame-based time-division-multiplexing manner, andthe set of time slots for scheduling is denoted as T = f0, 1,⋯,tmaxg with frame length jtmaxj, as shown in Figure 2.Assume there are K channels between two wireless nodes,and the frame lengths on each channel are the same. Addi-tional notations used in this paper are summarized inTable 1.

In our scheduling model, we consider an incrementaltraffic model. In this model, each traffic flow generated bythe end device is individual, and once resources are allocatedto it, the traffic flow in the network cannot be redistributed.When network resources cannot successfully provide ser-vices for a certain traffic flow, the flow is rejected immediatelywithout suspension. At every scheduling interval, the gate-way in the device layer classifies the traffic flows. This stepdetermines at which higher layer the traffic flow can be han-dled. If the traffic flow is assigned to the edge layer, then theedge layer gateway determines whether the traffic flow can beprocessed locally, submitted to the centralized layer or

rejected based on its deadline. We assume that the post backof the analysis is small, so the feedback transmission delay isidentical. Hence, the analysis latency in edge computingserver l can be represented by

TANAf = Tl

t,f + Tlc,f + TFB

f : ð1Þ

If the traffic flow is submitted to the centralized layerbecause of the lack of computing resources in the edge layer,the analysis latency of flow f runs on the cloud can be calcu-lated as

TANAf = Tl

t,f + Tl ′c,f + Tl−cloud

t,f + Tcloudc,f + TFB

f : ð2Þ

Here, Tl ′c,f represents the analysis latency of flow f when

the edge layer computing resources are insufficient, and Tl ′c,f

< Tlc,f . Thus, the analysis redundancy on the edge layer can

be reduced as much as possible.

Control flowWireless linkData transmission path

Centralized intelligence layer

Edge computing server

Edge intelligence layer

Gateway Gateway Gateway

CameraMobile terminal Robotic

Workman

Workman WorkmanMonitoringPLCDigital terminal

Robotic

Robot

Device layer PDA

Edge computing server Edge computing server

Cloud data centersState

Reward

Action

StateReward

Action

StateReward

Action

StateReward

Action

Figure 1: Network frame.

Channel cni,j …… ……

…… ……Channel cni,j

Frame 1 Frame 2

t1 t2 tmax

…

1

k

Figure 2: Scheduling model.


In the cloud analysis model, the analysis latency of flow fcan be represented by

TANAf = Tcloud

t,f + Tcloudc,f + TFB

f : ð3Þ

We assume that Tlt,f + Tl−cloud

t,f = Tcloudt,f , and we can infer

that the largest analysis latency comes from equation (2).As long as TANA

f ≤ TD, the traffic flow can be served.In the scheduling process, the network calculates the

channel and the number of time slots for the traffic flow.For instance, traffic flow f is considered, and the size of thetraffic flow determines the transmission time slots for accom-modating the flow.

t fn =bci,j

: ð4Þ

We define the execution interval period for a traffic flow

denoted by EIPfi,j, and

EIP fi,j =

NF∙tmax

t fn: ð5Þ

If a traffic flow has a large quantity of data, it has anintensive execution frequency within the scheduling process;here, NF represents the number of frames during scheduling.Since a large data size results in a dense execution intervalperiod, computing-intensive traffic flows prefer to choosethe channel with more bandwidth, while time-sensitive trafficflows prefer to choose the channel with minimum delay.Once an appropriate channel is allocated, several time slotsare allocated to each of the flows corresponding to their indi-vidual size. To avoid the overallocation of network resources,flows with the same channel can be distinguished by timeslots, as shown in Figure 2.

4. Proposed DISA Mechanism

In this section, we formulate the scheduling problem as aDRL process, including the state space S , action space A ,and rewardsR. Then, we consider the unique characteristicsof dynamic time slot provisioning and enable the DRL agentto optimize the problem.

4.1. DRL Formulation

4.1.1. State. Let st denote the network state at time t (st ∈ S),which is composed of the source node, destination node,number of transmission time slots, and time slot occupancy.Therefore, we define the array as

st = i, j, t fn, tm,k1 , tm,k

2 , tk3, tk4n o

∣m ∈ 1,M½ �, k ∈ 1, K½ �n o

, ð6Þ

which summarizes the network information at time intervalΔt. The features are explained as follows.

tm,k1 is the available time slot of each execution interval

period of the total jMj execution periods according to the sizeof the traffic flow in the kth channel of the total jKj channelsfor link ði, jÞ.

tm,k2 is the initial allocation index number of the available

time slots in every execution interval period. For ∃tm,k1 = 0,

m ∈ ½1, 2,⋯,M�, k ∈ ½1, 2,⋯,K�, then tm,k2 is invalid.

tk3 is the total number of available time slots, whichreflects the degree of occupancy situation along the link.

tk4 is the total number of continuously available time slotblocks, which reflects the degree of fragmentation ofresources along the link. This feature helps the agent identifythose links that potentially divide the time slots into frag-ments. Too much fragmentation in time slot may affectaccess to subsequent traffic flows. Similar to the traditionalscheduling problems, these features enable the agent to per-ceive the capacity, status, traffic load, and security of eachwireless link.

Figure 3 shows an example of constructing st in DISA.For the sake of simplicity, we assume that there is only oneframe. We assume that all the nodes in the network are equaland that there are three traffic flows that arrive in order.There are 2 channels (K = 2) in the network, and the slot bit-mask on each wireless link corresponds to its time slot utili-zation. Based on the previous definition, we assume that thethree traffic flows require 1, 4, and 2 time slots. In this exam-ple, one frame can be divided into 4 execution intervalperiods at most (M = 4).

The first flow needs to select 1 time slot in one frame timeof the 2 channels. For instance, the available time slots inchannel 1 (m = 1, k = 1) have a value of t1,11 = 2, and the initialallocation index number is t1,12 = 1. The remaining array ele-ments for channel 1 are t13 = 6 and t14 = 3. Since the traffic flowneeds to be executed once within a frame, time slot 1 in chan-nel 1 is allocated based on the principle of early processing.For the second traffic flow, we determine that not every tm,k

2of 4 execution interval periods in channel 1 is valid, so wecan only find idle time slots on channel 2. The four execution

Table 1: Definitions of notations.

Notation Definition

t fn The transmission time slots of flow f

cnki,j The kth channel connecting i and j

Δt The time interval of time slot t, t ≥ 0ci,j The data rate a time slot can support for link i, jð ÞTANA

f The analysis latency of flow f

Tlt,f The transmission time between the device and the edge

Tl−cloudt,f The transmission time between the edge and the cloud

Tcloudt,f The transmission time between the device and the cloud

Tlc,f The analysis time on edge computing server l

Tl ′c,f The preanalysis time on edge computing server l

Tcloudc,f The analysis time on the cloud data center

TFBf The analysis feedback of flow f


interval periods have 3, 5, 5, and 3 available time slots(m = 1, 2, 3, 4, k = 2). Their initial allocation index numbersare 2, 0, 0, and 0. To maintain equal execution intervals, wechoose the time slots with index number 3 in each executioninterval period. For the third traffic flow, which requires 2time slots during scheduling, we need to allocate two execu-tion interval periods in a frame. Other values of state st canbe seen in Table 2. For each st , we assume that both M andK are constants. When the number of possible candidate exe-cution frequencies or channels is less than the array dimen-sion, we assign a constant array to ensure a unified formatof st .

4.1.2. Action. In the approach, the agent determines whichchannel and time slot combination is available to assign tothe current network state, and the action space is denoted as

A = a1, a2,⋯,aK∙Mf g: ð7Þ

An action refers to a channel from the K th candidates andone of the M time slots on the selected channel obtained bythe gateway. Therefore, the action space includes K∙Mactions.

4.1.3. Reward. The reward is the objective of the algorithm.The agent relies on rewards to evaluate the effectiveness ofthe action and further improve the policies. For any state st∈ S , rt ∈R is the immediate reward that numerically charac-terizes the performance of an action at from the discrete set.

The network receives a reward rt = 0 if traffic flow F issuccessfully received. Otherwise, rt = −1. As a result, to avoidcongestion action for computing-intensive traffic flows andreduce delay for time-sensitive traffic flows, the objective ofthe algorithm should be expressed as finding the optimal pol-icy. The details are described in the next section.

4.2. Process of DISA. To allocate channels and time slots effi-ciently, we use the double deep Q network (DDQN) architec-ture [24] with experience replay and a greedy policy to solvethe reinforcement learning problem. This architecture notonly yields more accurate value estimations but also leadsto much higher learning stability.

Figure 4 illustrates the DISA architecture, in which a DRLtrains and optimizes the actions to address channel selectionand transmission scheduling. DISA takes advantage of theedge computing networking paradigm for centralized andautomated control of the IIoT device layer management. Spe-cifically, a corresponding gateway interacts with the currentDRL agent to collect network states and traffic flow requestsand develop scheduling strategies. Upon receiving a trafficflow F (step 1) generated by the end device, the layer gatewayfetches the current network state, including the in-servicewireless channels, time slot resources, and topology abstrac-tion, and then generates tailored state data st for DISA (step2). The neural network input is a given state st , while the out-put is the value of each function. The action values can berepresented by Qðst , at ; θÞ, where θ denotes the parametersof the neural network. For each action at ∈A in that givenstate (step 3), which corresponds to a particular channeland time slot combination (step 4), the layer gatewayattempts to set up the corresponding wireless connection(step 5). The network receives the scheduling strategiesrelated to the previous operations as feedback and producesan immediate reward rt for the agent; then, the networkmoves to the next state st+1. Then, rt , st , at , and st+1 are storedin a replay memory denoted byD (step 6), from which DISAderives training signals for updating the DRL agent (step 7).

The important ingredient for training the traditionalDQN is that it maintains two independent and identical neu-ral networks, a target DQN (Qðst , at ; θ′Þ) and an evaluateDQN (Qðst , at ; θÞ). The evaluate DQN is utilized to computethe Q value for each action, while the target DQN producesthe Q values to train the parameters of the evaluate DQN.Afterward, the action with the maximum Q value is chosento set the transmission for F. Both the evaluate DQN andthe target DQN employ the same neural network structureas the basic module, which uses a simple fully connected neu-ral network, including one hidden layer. The neural networkstarts in state st and follows the value of each action. Itattempts to minimize the loss function defined as

LDQN θð Þ = E YQt −Q st , at ; θð Þ� �2h i

: ð8Þ

0 1 2 3 4 0 1 2 3 4

EIP 0

F 1

F 2

0 1 2 3 4

EIP 2

0 1 2 3 4

EIP 3EIP 1

F 3

Channel 1

Channel 2

One frame

Channel 1

Channel 2

Channel occupancy before allocation

Channel occupancy a�er allocation

Figure 3: An illustration of state status in DISA.

Table 2: State values.

Channel 1 states before allocation

t1,11 = 2 t2,11 = 0 t3,11 = 4 t4,11 = 0t13 = 6 t14 = 3

t1,12 = 1 t2,12 =∅ t3,12 = 1 t4,12 =∅

Channel 2 states before allocation

t1,21 = 3 t2,21 = 5 t3,21 = 5 t4,21 = 3t23 = 16 t24 = 2

t1,22 = 2 t2,22 = 0 t3,22 = 0 t4,22 = 0


Here, YQt is the target Q value represented as

YQt = rt+1 + γ max

at+1Q st+1, at+1 ; θ′� �

: ð9Þ

In equation (9), state st+1 is the next state after perform-ing action at in state st , and action at+1 is an optional actionin state st+1. γ ∈ ½0, 1� is a discount factor that trades off theimportance of immediate and later rewards. As mentionedabove, we can use the experience tuples (st , at , rt , st+1) storedin the replay memory to train the neural network. The targetQ value YQ

t is determined according to the immediate rewardrt+1 and the maximum value of Qðst+1, at+1Þ obtained byinputting st+1 into the target DQN. Therefore, Y

Qt can be fur-

ther expanded as

YQt = rt+1 + γQ st+1, argmax

at

Q st+1, at ; θ′� �

; θ′ !

: ð10Þ

DQN uses the same network parameters for the selectionand evaluation of an action, which leads to overoptimisticaction values. To avoid this situation, DQN can decoupleselection and evaluation so that the double DQN method isproposed. In DDQN, the target value of (10) can be writtenas

YDoubleQt = rt+1 + γQ st+1, argmax

at

Q st+1, at ; θð Þ ; θ′ !

: ð11Þ

We choose actions according to the parameters from theevaluate DQN and use the target DQN parameters to mea-sure the value of Qðst+1, at+1Þ. Then, the loss function isdefined as

LDDQN θð Þ = E YDoubleQt −Q st , at ; θð Þ

� �2� �: ð12Þ

The overall algorithm is summarized in Algorithm 1.

5. Simulation and Analysis

In this section, we first present the experimental setup andthen demonstrate the performance of the proposed DISAcompared with several baseline schemes.

5.1. Simulation Setup. All simulation experiments are imple-mented in a Python environment with TensorFlow. We use acomputer with a 5.0GHz Intel i7 CPU and 16GB of ARM.We generate topologies of three sizes, namely, small,medium, and large. Each network comprises 15, 22, and 30gateways, and every gateway is attached to 3 to 5 end devices.Figure 5 shows the experimental environment for this layeredstructure. It is also assumed that all wireless links have equalbandwidth and that the available channels on each wirelesslink are set to 8. The number of time slots on each channelis the least common multiple of the number of time slotsrequired by the traffic flows, and the length of each time slotis 0.5ms. We generate two kinds of traffic flows betweennodes. Each traffic flow contains 100 data packets. The aver-age data size of computing-intensive traffic flows is set to 200

… …

Evaluate DQN

… …

Target DQN

Loss function

Layer gateway

Environment

Action translator module

Replaymemory

Double DQN

2

Traffic flow

Network states

Actions

Scheduling schemes

Channel and time slots

Rewards

Training updating

……

………

……

………

……

st+1

rt

st,at

Q(st,at;𝜃′)

(st,at,rt,st+1)

6

1 5

4

7

3

Q(st,at;𝜃)

Figure 4: An illustration of the DDQN architecture.


bytes, and the average data size of time-sensitive traffic flow isset to 50 bytes. The period of each flow is randomly pickedwithin the range of 27~10 ms. The relative deadline of eachflow is equal to its period. Each simulation experiment with-out training process is repeated 1000 times, and the averagevalue is obtained as the result of the experiment.

5.2. Result Analysis.We first study the training phase perfor-mance of the proposed DISA. The loss function evaluationand the analysis latency (TANA

f ) are two methods to deter-mine how well the model is trained. Figure 6 shows the lossfunction against the iteration steps of the proposed DISAalgorithm at different discount factors γ. The loss value inFigure 6 is obtained during the training process accordingto (12). It can be seen that all loss values decrease and con-verge with the increase in the number of iteration steps. Afterapproximately 2,000 iteration steps, the loss value is stable ata low level, which shows the convergence of the DISA algo-rithm and the effectiveness of the training method. The dis-

count factors γ are set as 0.9, 0.8, and 0.7. We can see thatthe DISA has the fastest convergence when γ = 0:9.

Figure 7 depicts the impact of three scale network topol-ogies (small, middle, and large) with 200 data flows on theconvergence performance of the analysis latency. InFigure 7, we can observe that the analysis latency is very highat the beginning of the training process. However, as thenumber of episodes increases, the curves descend and thenfluctuate slightly. This is because the Q value estimationneeds to gradually improve and the accumulative perfor-mance reaches a stable state approximately 60 episodes afterthe model has been fully trained. In addition, our scheme isnot sensitive to the network topology setting because the

1. Initialize the evaluate network with random weights and biases as θ;2. Initialize the target network as a copy of the evaluate network weights and biases as θ′;3. Initialize replay memory D;4. for i=1 to MaxEpisodedo5. Initialize state st in equation (6);6. Input the system state st into the evaluate DQN;7. Compute the Q value Qðst , at ; θÞ;8. With probability ε, choose an action at ;9. Execute action at , receive a reward rt and observe the next state st+1;10. Store interaction tuple (st , at , rt , st+1) in D;11. for j =1 to MaxStepdo12. Sample a random transition ðsj, aj, r j, sj+1Þ from D;13. Compute the target Q value

yj = rj+1 + γQðsj+1, argmaxaj

Qðsj+1, aj ; θÞ ; θ′Þ;

14. Train the network to minimize the loss functionLðθÞ = E½ðyj −Qðsj, aj ; θÞÞ2�;

15. Perform gradient descent with respect to θ;16. Update target networks every NDDQN steps

θ′ ⟵ θ;17. end for18. end for

Algorithm 1: Procedures of DDQN-based DISA.

Figure 5: Topology considered in our experiment. Every gateway isattached to 3 to 5 end devices.

0

𝛾 = 0.9𝛾 = 0.8𝛾 = 0.7

0 2000

Loss

4000Training steps

6000 8000

20

40

60

80

100

120

140

160

Figure 6: DISA loss function for different discount factors.


trends of the three curves are similar regardless of the specificvalues. When the network topology is large, the analysislatency after convergence is the highest, and the value ishigher than 6ms. When the network topology is small, theanalysis latency after convergence is the lowest, and the dif-ference between the two topologies is nearly 4ms. We canconclude that the larger the network topology is, the higherthe network complexity and the longer the analysis latency.

At a real industrial site, the schedule scheme is requiredto generate an available schedule in seconds when a trafficflow occurs. We compare the network performance of ourproposed DISA against the following three schemes.

(1) Rate monotonic (RM) scheduling [27]: in thismethod, traffic priorities are assigned statically andinversely proportional to the traffic periods

(2) Earliest deadline first (EDF) scheduling [28]: EDFassigns priorities based on absolute deadlines, andthe traffic flow with the earliest deadline has the high-est probability

(3) Genetic algorithm (GA) [29]: the GA converts twoseparate sets of routing and scheduling constraintsinto one set of constraints and uses a single step tosolve the scheduling problem

What the experiment considers to be schedulable iswhether the scheduler returns a feasible solution within thetime limit. Figure 8 illustrates the schedulability of the RM,GA, EDF, and DISA algorithms under different network traf-fic loads. As shown in Figure 8, all the algorithms are sche-dulable when the number of traffic flows is below 125, andthe schedulability drops dramatically after the network loadincreases. As the network load increases, additional con-straints make it harder for the scheduler to find a feasiblesolution. The schedulability of the four algorithms is always100% with a network load less than 100. In addition, thevalue obtained by the DISA is higher than that of RM, GA,and EDF. We can see that, compared to the RM algorithm,

the DISA can provide scheduling for more than 50 trafficflows because DISA can usually select more suitable timeslots and make more intelligent decisions than traditionalalgorithms due to the DRL process.

Next, we present the bandwidth consumption of the RM,GA, EDF, and DISA algorithms under different network traf-fic loads. Here, bandwidth consumption is defined as theratio of the bandwidth occupied by the traffic flows to thetotal bandwidth of the occupied wireless links. This ratioreflects the frequency of use of the wireless link. The largerthe value is, the more consistently the resources are allocated.In Figure 9, we test the bandwidth consumption performancefor the four algorithms. We assume that the network load isbelow 150 because all traffic flows are scheduled. As shownin Figure 9, the bandwidth consumption of the four algo-rithms increases as the number of traffic loads increasesbecause the new incoming traffic flows need more bandwidthresources. We can see that the RM algorithm and the EDF

0

SmallMiddleLarge

0

10

20

30

40

50

20 40 60 80 100

Ana

lysis

late

ncy

(ms)

Episode

Figure 7: Analysis latency of DISA in different network topologies.

0

RMGA

EDF

0

20

40

60

80

100

25 50 75 100 125 150 175 200 225 250 275

Sche

dula

bilit

y (%

)

Number of flows

DISA

Figure 8: Schedulability of different scheduling algorithms.

0

RMGA

EDF

0

5

10

15

20

25

30

35

40

10 20 30 40 50 60 70 80 90 100

Band

wid

th co

nsum

ptio

n (%

)

Number of flows

DISA

Figure 9: Bandwidth utilization in different scheduling algorithms.


algorithm have the highest bandwidth consumption becausethey use a fixed scheduling strategy every time. Constraints inthe GA algorithm include load balancing, while the DISA canintelligently arrange the network status, so their bandwidthconsumption is relatively low.

We also evaluate the average delivery time of the RM,GA, EDF, and DISA algorithms under different networktraffic loads. Here, average delivery time includes theend-to-end transmission time of the traffic flows and theanalysis latency of the traffic flows. As shown inFigure 10, the average delivery time of the four algorithmsincreases as the number of traffic loads increases becausethe transmission delay is amortized on each traffic flow.It can be observed that the average delivery time is closelyrelated to the bandwidth consumption of the link in thenetwork. High bandwidth consumption means that theselinks have less bandwidth resources. Traffic flow throughthese links will increase the transmission delay. High sche-dulability of the DISA algorithm means shorter end-to-enddelay and better utilization of link bandwidth. DISA canachieve load balancing on different links and has achievedthe optimal average delivery time. We can see that the RMalgorithm and the EDF algorithm have the longest averagedelivery time, which is nearly 60ms higher than the short-est DISA algorithm.

We analyze the probability of successful scheduling whenthe network load is heavy and packet loss occurs. The com-parison result for the four algorithms under different packetloss ratios is shown in Figure 11. Here, we assume that theexperiment is running on a large topology and devices with175 traffic flows. Due to insufficient network resources,packet loss occurs randomly. The larger the packet loss ratiois, the fewer idle time slots in the network for scheduling. InFigure 11, we can see that when packet loss occurs in the net-work, all scheduling algorithms have a failure ratio. The situ-ation becomes increasingly worse. For instance, when thepacket loss ratio changes from 5% to 25%, a successful sched-uling ratio for DISA can obtain up to a nearly 50% reduction.The successful scheduling ratio for DISA is significantly

higher than that of the other three algorithms. DISA ensuresnetwork resource allocation by sensing the network statusand thus achieves successful scheduling with a higherprobability.

Finally, in order to verify the efficiency of the DISA algo-rithm, we analyze the runtime of the algorithm under differ-ent network traffic loads. Here, we only compare the DISAalgorithm and the EDF algorithm because they are similarin time scale. As shown in Figure 12, the running time of bothalgorithms increases as the traffic load increases. Amongthem, the running time of DISA algorithm has a gentleupward trend, while the change trend of EDF algorithm ismore obvious. This is because the DISA algorithm has atraining process, and the requirement for computingresources in the network has not changed much. On theother hand, with the increase in traffic loads, the computa-tional complexity of EDF algorithm increases, and its needfor CPU computing resources increases.

20

RMGA

EDF

5

40

60

80

100

120

10 15 20 25 30 35 40

Ave

rage

del

iver

y tim

e (m

s)

Number of flows

DISA

Figure 10: Average delivery time in different scheduling algorithms.

0

RMGA

EDF

20

40

60

80

100

5 10 15 20 25

Succ

essfu

l sch

edul

ing

ratio

s (%

)

Packet loss ratio (%)

DISA

Figure 11: Successful scheduling ratio in different schedulingalgorithms.

0.00

EDF

5

0.02

0.04

0.06

0.08

0.10

10 15 20 25 30 35 40

Runt

ime (

s)

Number of flows

DISA

Figure 12: Runtime in different scheduling algorithms.


6. Conclusions

Millions of IIoT end devices generate a billion bytes of data atthe edge of the network. Driven by this trend, the combina-tion of edge computing and deep reinforcement learninghas received a tremendous amount of attention. In this paper,we proposed a deep intelligent scheduling algorithm (DISA)for the cloud-edge environment. By employing the classicdouble deep Q network (DDQN) architecture in intelligentscheduling, our DISA can identify reasonable channel andtime slot combinations with competitive performance. Fol-lowing the DDQN framework, the agent separates selectedaction from the target Q network, leading to more effectiveparameter training. Extensive simulation experiments wereimplemented, demonstrating that our DISA can obtain betternetwork performance than the traditional schedulingschemes.

Data Availability

The authors confirm that the data supporting the findings ofthis study are available within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported in part by the National KeyResearch and Development Program of China under Grant2018YFB1702000 and in part by the Fundamental ResearchFunds for the Central Universities under Grant N2116012,Grant N2116013, and Grant N180708009.

References

[1] K. K. Raymond, G. Stefanos, and J. H. Park, “Cryptographicsolutions for industrial Internet-of-things: research challengesand opportunities,” IEEE Transactions on Industrial Informat-ics, vol. 14, no. 8, pp. 3567–3569, 2018.

[2] M. Z. Hasan and H. Al-Rizzo, “Optimization of sensor deploy-ment for industrial Internet of things using a multiswarm algo-rithm,” IEEE Internet of Things Journal, vol. 6, no. 6,pp. 10344–10362, 2019.

[3] K. Tange, M. De Donno, X. Fafoutis, and N. Dragoni, “A sys-tematic survey of industrial Internet of things security:requirements and fog computing opportunities,” IEEE Com-munications Surveys & Tutorials, vol. 22, no. 4, pp. 2489–2520, 2020.

[4] X. Li, D. Li, J. Wan, C. Liu, and M. Imran, “Adaptive transmis-sion optimization in SDN-based industrial Internet of thingswith edge computing,” IEEE Internet of Things Journal,vol. 5, no. 3, pp. 1351–1360, 2018.

[5] Z. Ning, P. Dong, X. Wang et al., “Mobile edge computingenabled 5G health monitoring for Internet of medical things:a decentralized game theoretic approach,” IEEE Journal onSelected Areas in Communications, vol. 39, no. 2, pp. 463–478, 2021.

[6] M. Aazam, K. A. Harras, and S. Zeadally, “Fog computing for5G tactile industrial Internet of things: QoE-aware resource

allocation model,” IEEE Transactions on Industrial Informat-ics, vol. 15, no. 5, pp. 3085–3092, 2019.

[7] V. Kotsiou, G. Z. Papadopoulos, P. Chatzimisios, andF. Theoleyre, “LDSF: low-latency distributed scheduling func-tion for industrial Internet of things,” IEEE Internet of ThingsJournal, vol. 7, no. 9, pp. 8688–8699, 2020.

[8] Z. Ning, P. Dong, X. Wang et al., “Partial computation off-loading and adaptive task scheduling for 5G-enabled vehic-ular networks,” IEEE Transactions on Mobile Computing, p.1, 2020.

[9] M. O. Ojo, S. Giordano, D. Adami, and M. Pagano, “Through-put maximizing and fair scheduling algorithms in industrialInternet of things networks,” IEEE Transactions on IndustrialInformatics, vol. 15, no. 6, pp. 3400–3410, 2019.

[10] Z. Ning, S. Sun, X. Wang et al., “Blockchain-enabled intelligenttransportation systems: a distributed crowdsensing frame-work,” IEEE Transactions on Mobile Computing, p. 1, 2021.

[11] Z. Lv, Y. Han, A. K. Singh, G. Manogaran, and H. Lv, “Trust-worthiness in industrial IoT systems based on artificial intelli-gence,” IEEE Transactions on Industrial Informatics, vol. 17,no. 2, pp. 1496–1504, 2021.

[12] K. Kaur, S. Garg, G. S. Aujla, N. Kumar, J. J. Rodrigues, andM. Guizani, “Edge computing in the industrial Internet ofthings environment: software-defined networks-based edge-cloud interplay,” IEEE Communications Magazine, vol. 56,no. 2, pp. 44–51, 2018.

[13] T. Qiu, J. Chi, X. Zhou, Z. Ning, M. Atiquzzaman, and D. O.Wu, “Edge computing in industrial Internet of things: archi-tecture, advances and challenges,” IEEE Communications Sur-veys & Tutorials, vol. 22, no. 4, pp. 2462–2488, 2020.

[14] H. Qi, J. Wang, W. Li, Y. Wang, and T. Qiu, “A blockchain-driven IIoT traffic classification service for edge computing,”IEEE Internet of Things Journal, vol. 8, no. 4, pp. 2124–2134,2021.

[15] J. Pan and J. McElhannon, “Future edge cloud and edge com-puting for Internet of things applications,” IEEE Internet ofThings Journal, vol. 5, no. 1, pp. 439–449, 2018.

[16] Y. Ai, M. Peng, and K. Zhang, “Edge computing technologiesfor Internet of things: a primer,” Journal of Digital Communi-cations and Networks, vol. 4, pp. 77–86, 2018.

[17] T. Qiu, K. Zheng, M. Han, C. P. Chen, and M. Xu, “A data-emergency-aware scheduling scheme for Internet of things insmart cities,” IEEE Transactions on Industrial Informatics,vol. 14, no. 5, pp. 2042–2051, 2018.

[18] H. Tang, J. Wang, L. Song, and J. Song, “Minimizing age ofinformation with power constraints: multi-user opportunisticscheduling in multi-state time-varying channels,” IEEE Jour-nal on Selected Areas in Communications, vol. 38, no. 5,pp. 854–868, 2020.

[19] X. Wang, Z. Ning, S. Guo, M. Wen, and V. Poor, “Minimizingthe age-of-critical-information: an imitation learning-basedscheduling approach under partial observations,” IEEE Trans-actions on Mobile Computing, 2021.

[20] J. Wang, Y. Zhang, Y. Liu, and N. Wu, “Multiagent andbargaining-game-based real-time scheduling for Internet ofthings-enabled flexible job shop,” IEEE Internet of Things Jour-nal, vol. 6, no. 2, pp. 2518–2531, 2019.

[21] X. Lyu, W. Ni, H. Tian et al., “Optimal schedule of mobile edgecomputing for Internet of things using partial information,”IEEE Journal on Selected Areas in Communications, vol. 35,no. 11, pp. 2606–2615, 2017.


[22] X. Wang, Z. Ning, S. Guo, and L. Wang, “Imitation learningenabled task scheduling for online vehicular edge computing,”IEEE Transactions on Mobile Computing, 2020.

[23] H. Li, K. Ota, and M. Dong, “Learning IoT in edge: deep learn-ing for the Internet of things with edge computing,” IEEE Net-work, vol. 64, no. 6, pp. 96–101, 2018.

[24] H. Xu, X. Liu, W. G. Hatcher, G. Xu, W. Liao, andW. Yu, “Pri-ority-aware reinforcement learning-based integrated design ofnetworking and control for industrial Internet of things,” IEEEInternet of Things Journal, vol. 8, no. 6, pp. 4668–4680, 2021.

[25] H. Xu, X. Liu, W. Yu, D. Griffith, and N. Golmie, “Reinforce-ment learning-based control and networking co-design forindustrial Internet of things,” IEEE Journal on Selected Areasin Communications, vol. 38, no. 5, pp. 885–898, 2020.

[26] X. Guo, H. Lin, Z. Li, and M. Peng, “Deep-reinforcement-learning-based QoS-aware secure routing for SDN-IoT,” IEEEInternet of Things Journal, vol. 7, no. 7, pp. 6242–6251, 2020.

[27] H. Wang, L. Shu, W. Yin, Y. Xiao, and J. Cao, “Hyperbolic uti-lization bounds for rate monotonic scheduling on homoge-neous multiprocessors,” IEEE Transactions on Parallel andDistributed Systems, vol. 25, no. 6, pp. 1510–1521, 2014.

[28] A. A. Ayele, V. S. Rao, K. G. Dileep, and R. K. Bokka, “Com-bining EDF and LST to enhance the performance of real-time task scheduling,” in International Conference on ICT inBusiness Industry & Government (ICTBIG), Indore, India,2016.

[29] M. Pahlevan and R. Obermaisser, “Genetic algorithm forscheduling time-triggered traffic in time-sensitive networks,”in IEEE 23rd International Conference on Emerging Technolo-gies and Factory Automation (ETFA), vol. 1, pp. 337–344,Turin, Italy, 2018.


Deep Reinforcement Learning for Scheduling in an Edge ...

Documents

Transcript of Deep Reinforcement Learning for Scheduling in an Edge ...