Integrating Imitation Learning with Human Driving Data ...

Integrating ImitationLearning with Human Driving

Data into ReinforcementLearning to Improve TrainingEfficiency for Autonomous

Driving

Heidi LuThe Harker School, 500 Saratoga Avenue

San Jose, CA 95129

Abstract

Two current methods used to trainautonomous cars are reinforcement learningand imitation learning. This researchdevelops a new learning methodology andsystematic approach in both a simulated anda smaller real world environment byintegrating supervised imitation learninginto reinforcement learning to make the RLtraining data collection process moreeffective and efficient. By combining thetwo methods, the proposed researchsuccessfully leverages the advantages ofboth RL and IL methods. First, a realmini-scale robot car was assembled andtrained on a 6 feet by 9 feet real world trackusing imitation learning. During the process,a handle controller was used to control themini-scale robot car to drive on the track byimitating a human expert driver andmanually recorded the actions usingMicrosoft AirSim's API. 331 accuratehuman-like reward training samples wereable to be generated and collected. Then, anagent was trained in the Microsoft AirSimsimulator using reinforcement learning for 6

hours with the initial 331 reward datainputted from imitation learning training.After a 6-hour training period, themini-scale robot car was able to successfullydrive full laps around the 6 feet by 9 feettrack autonomously while the mini-scalerobot car was unable to complete one fulllap round the track even after 30 hourtraining pure RL training. With 80% lesstraining time, the new methodologyproduced significantly more averagerewards per hour. Thus, the newmethodology was able to save a significantamount of training time and can be used toaccelerate the adoption of RL in autonomousdriving, which would help produce moreefficient and better results in the long runwhen applied to real life scenarios.

Key Words: Reinforcement Learning (RL),Imitation Learning (IL), AutonomousDriving, Human Driving Data, CNN

I. Introduction:

WHO reported that 1.3 millionpeople die every year in the US due totraffic-related accidents, and nearly 3,500lives per day can be saved [1]. According toAccenture and the Stevens Institute ofTechnology in Hoboken, N.J., the insuranceindustry, worth more than $225 billion, willsee traditional premiums drop by nearly20% over the next 30 years [3]. Autonomousdriving is quickly becoming a reality. In theimitation learning process, the agentobserves and learns from expertdemonstrations [4]. However, collectingsufficient diverse data is challengingbecause samples tend to be positively biased

and the agent’s performance is limited bythe expert driver’s performance. In thereinforcement learning process, the agentlearns from trial and error and a designatedreward function, but data collection isincommodious as focusing on only positivescenarios leads to biased sampling. Physicaldamage to the car is bad as sampling datacan be extremely costly [5].

Most autonomous driving cases havebeen solved so far and many cars that aretesting and driving around nowadays can beseen with a safety driver behind the wheel.Most autonomous driving companies arecurrently focusing on solving the last 10%corner cases in order for them to remove thesafety driver (ensuring 99.99% safety) andstart making business and profits. Waymohas removed the safety driver in a verylimited region in Phoenix in its robot taxiservice, but not all areas [10]. Supervisedmachine learning (ML) is commonly usednow by autonomous driving companies, butrequires heavy data labelling, and is not agood solution for solving the last 10%corner cases. It only knows how to deal withlabelled scenarios, but normally corner casesare the ones non-experienced and notlabelled previously. RL is a differentapproach as it learns from its owntrial-and-error, and thus doesn’t requirehuman labelling data [8]. It can solve unseencorner cases from similar situations it’sexperienced before, similar to humandrivers. Unfortunately, collecting diversedriving data for RL is very challenging, andthat’s why RL is very rarely used by mostautonomous driving companies. Current RLdevelopment and methods are still primarily

on research papers[13]. However, in thefuture, RL will be implemented byautonomous driving companies bycombining supervised ML in their software,focusing on the areas of motion planning,which will help solve the complicated cornercases. Other normal driving cases can besolved by rule based coding and supervisedML through labelled data.

This paper proposes a newmethodology which integrates imitationlearning with efficient human driving data toaccelerate reinforcement learning forautonomous driving with a more efficientand effective training data collectionprocess. By combining both methods, theadvantages of each can be leveraged toenhance the performance of reinforcementlearning from utilizing guidance from ahuman driver. Imitation learning serves as afoundation to bootstrap reinforcementlearning training by providing the necessaryinitial training data and a basic pretrainedmodel for reinforcement learning.Ultimately, reinforcement learning does notlimit the agent’s performance and requiresno expensive data labelling cost, which aremajor advantages over imitation learning,and thus can be used to effectively solve thelast 10% autonomous driving corner cases.

II. Related Works

As can be seen from Table 1, most ofthe previous research primarily focuses onproving that RL is more effective thansupervised IL, or that combining supervisedIL with RL is better than just pure IL or pureRL in terms of performance. With the

development of deep reinforcementlearning, the domain of reinforcementlearning has become a powerful learningframework now capable of learning complexpolicies in high dimensional environments[18]. A group of researchers wrote aresearch paper in 2019 that summarizes deepreinforcement learning (DRL), provides ataxonomy of automated driving tasks where(D)RL methods have been employed,highlights key challenges and methods toevaluate, test, and robustify existingsolutions in RL and IL. They formalized andorganized RL applications for autonomousdriving and found that it suits RL but stillsuffers from many challenges that need to beresolved. Past researchers Tianqi Wang et al.[5] in October of 2019 sought to combineimitation learning and reinforcementlearning in a simulated environment to trainthe car to drive in different weatherenvironments. Following this, also inOctober of 2019, researchers FenjiroYoussef et al. [17] wrote a paper on theoptimal combination of IL and RL forself-driving cars. Testing it in a simulatedenvironment, their resulting AdvantageActor-Critic algorithm from Demonstrationsoptimally Constrained (A2CfDoC) modelwas able to outperform several existingalgorithms in terms of speed and accuracy aswell as surpassing the expertise level usingRL. Additionally, in September of 2020,Rousslan F. J. DOSSA et al. [11] conductedan experiment on the hybrid of RL and ILfor human-like agents. They proposed thatthese hybrid agents demonstrated similarbehavior to that of human experts and foundthat these hybrid agents were able to surpassthe human expert in each given scenario.

In summary, all previous researchworks were trying to prove how combiningthese two methods are better than just one orusing enhanced algorithms to test theseconcepts in simulated environments.However, it is also important to focus onhow to apply these methods and researchpapers to the real world's autonomousdriving works. This research work aims tofill the research gap and it is also the firstresearch focusing on pretraining IL toeffectively generate good reward datasetsfor bootstrapping RL, as RL is the ultimatesolution to the last 10% corner cases. Amini-scale robot car has been built and usedto test the proposed methodology in asmaller, real-world environment. In the test,a handle controller was used to drive themini-scale robot car and the actions wererecorded using Microsoft Airsim’s API togenerate accurate human-like performancefor the “reward” datasets used in subsequentRL training. The test result based on robotcar performance shows that the proposedmethodology helped save 80% of trainingtime as compared to pure RL, and it alsoproduced better performances.

III. Methodology

The complete comparison of theproposed new architecture over previousexisting architecture is shown in Figure 2.Figure 2A shows a demonstration of howdriving using machine learning is better thanhuman driving and how the existingarchitecture focuses on proving and makingthe rewards generated by reinforcementlearning is better than the reward datagenerated by human expert driving data.

Figure 2B shows a diagram of how quicklyMk can achieve a similar level to humanexpert driving, and how the new architectureworks on making the reward generated byreinforcement learning better than thereward generated by human expert drivingdata in less training time.

The IL agent can be only as good asthe expert’s demonstrations and since itlacks generalization due to expensivelabelling costs, it’s not good to use pure IL.However, when facing unknownenvironments, RL and IL combined turn outto be the greatest strengths and can achievethe highest performance of the agent, andtheir capacities to manage new situations bylearning through the process of trial anderror and exploration makse them favorable.

The key difference between theproposed architecture shown in Figure 2Band pre-existing architecture shown inFigure 2A is that RExpert generated byAExpert was used as the inputs to acceleratethe proposed Roriginal generated by thereinforcement learning agent, AOriginal,while other researchers used RExpertgenerated by AExpert to compare toRoriginal generated by the reinforcementlearning agent, AOriginal. Pre-existingarchitecture focused on proving and makingRoriginal better than RExpert, while theproposed architecture worked on makingRoriginal similar to RExpert but achieving itin less training time. This research believesthat the robot should drive as good as or

better than a human, while retaining some ofits high performance of RL.

Rdistance = Min(1, D/Di), Rspeed =Min(1, V/Vi)Wd = Ws = 0.5Di = 10 cm to track edgeVi = 10 cm/sOne reward is generated when Ris close to 1

In the proposed new architecture, Ris the reward function defined in Figure 2B.Here the distance to the nearest obstacle andcurrent vehicle speed is considered as twocomponents of the reward function to makesure that only keeping far from the obstaclesand driving fast at the same time can resultin a high reward. RExpert is the reward datagenerated from the handle, for example, by“expert” human driving as AExpert. Itworks as the input for reward samples in theDDPG (Deep Deterministic PolicyGradient), the popular RL algorithm used asthe reinforcement learning algorithm.Rdistance is the distance to the nearestobstacle, in othisur case, the road edges, andRspeed is the jetbot robot car’s currentspeed. Two important reward functions →“D!” is the threshold value chosen for idealdistance (in this research, it was chosen tobe 10 CM from the edge). "W!" is thethreshold value chosen for ideal speed,which is 10 CM/s. Wd and Ws are theweights of reward distance and speed and

sum up to one, and in this research Wd = Ws= 0.5. Based on this setting, the rewardfunction is able to be squeezed to theinter-value [0,1], meaning that reward canonly be two numbers, 0 or 1. 10 cm waschosen as the ideal distance because thejetbot robot car is centered at 10 cm awayfrom the edge. Roriginal is the actual rewardgenerated by reinforcement learning.AOriginal is the agent used in the RLsimulator.

IV. Experimental Set-up

As seen in Figure 3A and 3B, themini-scale robot car is composed of differenthardware and software parts. Combined, it isa multi-functional jetbot robot car poweredby the Nvidia JETSON NANO AI platform.The Jetson Nano 64G image Nvidiafirmware was burned to a 64G SD card.After assembling and building up itsfirmware environment, the jetbot robot carwas assigned an IP and connected to wifi.

The hardware part of the vehicle ismade of green aluminum alloy and follows aunique mechanical structure of a tractor. It isalso equipped with 3-degrees of freedomlifting platform and a 8 million HDRaspberry Pi camera, which gives areal-time view of front looking scenes. Themicroprocessor is a Quad-core ARM A57 +128-core Nvidia Maxwell. The operatingsystem is a Ubuntu18.04 LTS with the inputimages through a custom HD camera andthe output consisting of a L-type 370 motor,buzzer, 3-DOF camera platform, and an

OLED display. The power solution is a18650 battery pack with 12.6V [23].

A. Setting up the track:As can also be seen in Figure 3C, the

track used as the testing environment is a 6feet by 9 feet automatic drive track map. It ismade of durable, tarpaulin waterproofmaterial and is a large size, replicating a reallife road and track environment. This trackwas specifically chosen because it was mostsuited for the small-size front lookingcamera used by the jetbot robot car. It iscomposed of two tracks of suitable width,with the white dotted lines and the yellowsolid lines on both sides able to be used asreference objects to control the movementdirection of the jetbot robot car. Comparedwith an ordinary circular track, this trackpossesses some 90 degree angles, as well asdifferent degrees of corners and a simulatedsidewalk which helps achieve creativefunctions such as stop signs and pedestrianavoidance.

B. Collecting the data:One of the most important new

breakthroughs was the reward datacollection process, and it used Airsim API tovisualize and record images with labels forrewards. Each image generates a set of X, Yvalues of a reward corresponding to speedand steering angle. As seen in Figure 3D, byusing the handle controller to manually drivethe robot car at different locations on thetrack, a “green dot” was placed at the targetdirection in each location as a reward and ithelps collect a total of 331 initial rewarddatasets. As seen in Figure 4, the dotted lightblue lines with triangles represent the

collected 331 reward datasets used for pureimitation learning.

C. Training the model:A block diagram of the end-to-end

RL training system from Nvidia is shown inFigure 1. Images collected for steering gear(rotation) and motor (speed) are fed into aconvolutional neural network (CNN) whichthen computes a proposed steeringcommand. The proposed command iscompared to the desired command for thatimage and the CNN is adjusted to bring theCNN output closer to the desired output.The adjustment is accomplished using backpropagation as implemented in Nvidia'smachine learning package. Once trained, theCNN can generate steering angle and speedfrom the video images of the single centerfront camera. Steering and speed commandsgenerated by the CNN can control the carsthrough a drive-by-wire interface.

For this project, the smallest networkResNet-18 was used because it provides agood balance of performance and efficiencyfor the Jetson Nano platform. The chosenaction (speed, steering) is concatenated inthe final fully connected layer’s input. Theneural network was trained to get an inputimage and output set of x (speed), y(steering) values corresponding to a target.The model training consisted of 5 mainsteps. 1) Using PyTorch deep learningframework to train the model to identifyroad conditions for autonomous driving. 2)Creating a custom “torch.utils.data.Dataset''for loading the 331 images collected andparsing the x and y values in the image filename. 3) Splitting the dataset into training

(90% of the data) and testing (10%) data thatwas used to verify the accuracy of the modeltrained. 4) Training the regression model byusing a data batch size of 64 based on theJetson Nano GPU and setting the number ofiterations to 50, then training the pure RLmodels for increasing increments of time (3hours, 6 hours, 15 hours, 30 hours) and thenew proposed model for only 6 hours. In theproject, 50 times took approximately 1 hourto train. 5) Lastly, testing .pth files weregenerated for the trained models.

D. Uploading the trained algorithm:Next, it was time to upload the

trained algorithm to the jetbot robot car andimplement the motion algorithm using theproportional integral differential (PID)controller to control the car. Then, valueswere assigned and adjusted to the followingindexes through the slider to get the Jetbotrobot car to drive to its best condition: A)Speed_gain_index- setting the speed valueto start the car. B) Steering_gain_index- ifthe car is spinning, this index is reduceduntil the driving becomes smooth. C)Steering_bias_index- if the car leans too farto the right or too far to the left of the track,it adjusts this index until the robot car iscentered on the track again.

E. Testing and Optimization:Lastly is track testing of autonomous

driving algorithms in the 6 feet by 9 feetdriving track map. The autonomous drivingtesting performs as the following steps - a.processing camera images; b. performing aneural network based on a trained model; c.

calculating the approximate steering valueand vehicle speed for each camera image; d.controlling the motor of the jetbot robot carusing the PID controller. The testing resultwas measured based on the number ofinterventions when the car runs off the track.In the end, the proposed new methodologywas able to achieve 100%, full autonomousdriving after only 6 hours of training (testingresults can be found in reference file 1.)

V. Results and Discussion:

In this paper, the number of rewardscollected after training for a certain amountof time was used in order to comparelearning efficiency for these 3 methods, pureRL, pure IL, and the newly proposedmethod. The results shown in Figure 4 is animportant graph of the AverageAccumulated Reward where the orange starrepresents the method, the blue trianglerepresents pure IL, and the black squaresrepresents pure RL. One can see thesignificant difference between the slowprogress of the pure RL method and thestraight line of pure IL. These numbers ofthe reward function are important to andcould heavily affect the performance of thetrained agent.

The original IL’s generated reward #- 331 collected rewards, while the rewardpolicy obtained from pure RL (DDPG)which was trained from scratch generatedonly 11 rewards in 30 hours training, neverperformed well, and showed very littleimproving trends. The reward policy

obtained from the new method achieves aconsiderable performance boost from theoriginal IL’s trained policy - 376 rewards in3 hours of training and 431 rewards in 6hours of training.

Loading the following trainedmodels to run the autonomous drivingprogram, 3 pure RL models were trained for3 hours, 6 hours, and 30 hours.

1. After 3 hours, the robot carwas unable to moveautonomously for even 2ft.

2. After 6 hours, the robot cardid not show much progress,unable to turn corners ordrive autonomously for 4ft.

3. After 30 hours, the robot carwas still unable to complete ahalf loop of the track.

However, after training the new RLmodel with IL rewards for only 6 hours, therobot car was able to finish multiple fulltrack loops smoothly, using 80% lesstraining time.

The percentage of the time thetrained model can drive autonomously isdetermined by counting simulated humaninterventions. In this project, theinterventions occur when the mini-scalerobot car runs completely off the track. Inreal life an actual intervention would requirea total of six seconds: this is the timerequired for a human to retake control of thevehicle, re-center it, and then restart theself-steering mode.

Autonomy Value = (1 − (number ofinterventions) · 6 [seconds] / testing time[seconds]) * 100

As a result, approximately 95interventions in 600 seconds (or 10 minutes)were recorded during testing for pure RL,while 0 interventions in 600 seconds wasrecorded testing for the new method.

● The calculated autonomousvalue for pure RL is:(1− 95 · 6/600) ·100% = 5%

(95% of the time is off the track)● The autonomy value of the

new method is 100%.

VI. Real World Autonomous DrivingApplications

The proposed new method hassignificant potential when applied to realworld autonomous driving development,primarily because it can be easily scaled byhaving 1 human driven car as an efficientagent driving around in the real roads fordata collection. The experiment testingenvironment can be quickly extended tocover thousands of miles of road, which canhelp increase RL’s sampling efficiency,avoid data overfitting, and thus improveversatility and generality of the RLalgorithm. Therefore, the suggestedmethodology saves a significant amount oftraining time and accelerates the quickadoption of RL in autonomous driving,which would produce more efficient andbetter results in the long run when applied toreal life scenarios.

VII. Conclusion

This research has presented anddemonstrated the combination of IL and RLfor the first time and showed a significantadvantage in RL training efficiency with theutilization of pre-collected reward data fromIL. The training results show that althoughtraining via pure RL and training viacombined IL and RL would eventually leadto a similar performance, around 80% oftraining time was able to be cut down byinitially training with the reward datacollected by imitation learning. The RLmethod heavily relies on the computingpower of the control unit, and the NvidiaJetson Nano is a relatively small platformwith only 64 Core ARM CPU and 128 CoreNvidia Maxwell GPU. If using morepowerful control units than Jetson Nano, theactual saving in training time for combiningimitation learning and reinforcementlearning may be significantly less than 80%.

As compared to supervised IL thatrequires explicit data labeling, the proposedmethod was much simpler, a main advantagebeing that the experiments only needed 1remote controlled robot car and 1 simulatedagent car to learn how to handle interaction.The jetbot robot car trained using theproposed methodology seems to be able todrive better and smoother than the robot cartrained using pure RL, because the drivingpolicy is rewarded through imitating thecollected human expert driving behaviors.Still, the jetbot robot car was only tested in a

very simple loop track without anycomplicated scenarios, such as lanemerging, mingling with other objects suchas bikers, human driving cars, pedestrians,stop signs, traffic lights, etc. In order tosolve the last 10% corner cases inautonomous driving, there’s not enough“bad” examples and one can’t simply hiresomeone to crash their cars purposely toproduce the “penalized” driving data forraining the reinforcement learning model.The human expert driving helps collect only“reward” data for reinforcement learning,while the “bad or penalized” data forreinforcement learning to be generated by

human expert driving can be costly andtragic. In reality, most car accidents werecaused by “bad” driving behaviors which isthe main challenge all research projects face.If in a really complicated drivingenvironment with other objects, the car maynot be able to drive better or smoother. As aresult, the suggested methodology proves tosave a significant amount of RL trainingtime and can be quickly applied to realworld autonomous driving development, butstill more work is needed to verify itsrobustness in solving the rare complicatedautonomous driving corner cases.

References:

1. “Road Traffic Injuries.” World Health Organization, World Health Organization, 7 Feb.

2020, www.who.int/news-room/fact-sheets/detail/road-traffic-injuries.

2. Plumer, Brad. “Cars take up way too much space in cities. New technology could change

that.” Vox.com, 2016, www.vox.com/a/new-economy-future/cars-cities-technologies.

3. Notte, Jason. “How Self-Driving Cars Affect Insurance.” The Simple Dollar, 11 Mar.

2020,

www.thesimpledollar.com/insurance/auto/how-do-self-driving-car-features-affect-your-in

surance/.

4. Pan, Yunpeng, et al. Agile Autonomous Driving Using End-to-End Deep Imitation

Learning. Georgia Institute of Technology, 9 Aug. 2019, arxiv.org/pdf/1709.07174.pdf.

5. Wang, Tianqi, and Dong Eui Chang. Improved Reinforcement Learning through

Imitation Learning Pretraining Towards Image-Based Autonomous Driving. 19th

International Conference on Control, Automation and Systems, 2019,

arxiv.org/pdf/1907.06838.pdf.

6. Lynberg, Matthew. “Automated Vehicles for Safety.” NHTSA, 15 June 2020,

www.nhtsa.gov/technology-innovation/automated-vehicles.

7. "A Brief History Of Autonomous Vehicle Technology". Wired, 2016,

www.wired.com/brandlab/2016/03/a-brief-history-of-autonomous-vehicle-technology/.

8. Chopra, Rohan, and Sanjiban Sekhar Roy. “End-to-End Reinforcement Learning for

Self-Driving Car.” Advances in Intelligent Systems and Computing Advanced

Computing and Intelligent Engineering, Apr. 2019, pp. 53–61.,

doi:10.1007/978-981-15-1081-6_5.

9. Wang, Sen, et al. “Deep Reinforcement Learning for Autonomous Driving.” 19 May

2019.

10. Bussewitz, Cathy. “Waymo Removing Backup Drivers from Its Autonomous Vehicles.”

ABC News, ABC News Network,

abcnews.go.com/Technology/wireStory/waymo-removing-backup-drivers-autonomous-v

ehicles-73502551.

11. Dossa, Rousslan F. J., et al. Hybrid of Reinforcement and Imitation Learning for

Human-Like Agents. 2020.

12. “Self-Driving Cars – Facts and Figures.” Driverless Guru, Driverless Media Ltd,

www.driverlessguru.com/self-driving-cars-facts-and-figures.

13. Chopra, Rohan & Roy, Sanjiban. (2019). End-to-End Reinforcement Learning for

Self-Driving Car.

https://web.stanford.edu/~anayebi/projects/CS_239_Final_Project_Writeup.pdf

14. Jiakai Zhang and Kyunghyun Cho. Query-efficient imitation learning for end-to-end

autonomous driving. arXiv preprint arXiv:1605.06450, 2016.

15. Ahmad El Sallab, Mohammed Abdou, Etienne Perot, and Senthil Yogamani. End-to-end

deep reinforcement learning for lane keeping assist. arXiv preprint arXiv:1612.04340,

2016.

16. Wang, Sen, et al. Deep Reinforcement Learning for Autonomous Driving. Arvix, 19 May

2019, https://arxiv.org/pdf/1811.11329.pdf.

17. Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P.

http://www.driverlessguru.com/self-driving-cars-facts-and-figures

https://web.stanford.edu/~anayebi/projects/CS_239_Final_Project_Writeup.pdf

Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for

deep reinforcement learning. CoRR, abs/1602.01783, 2016. URL http:

//arxiv.org/abs/1602.01783.

18. Karan, Ravi. Deep Reinforcement Learning for Autonomous Driving: A Survey. Arvix, 23

Jan. 2021, Deep Reinforcement Learning for Autonomous Driving: A Survey.

19. Bojarski, Mariusz & Testa, Davide & Dworakowski, Daniel & Firner, Bernhard & Flepp,

Beat & Goyal, Prasoon & Jackel, Larry & Monfort, Mathew & Muller, Urs & Zhang,

Jiakai & Zhang, Xin & Zhao, Jake & Zieba, Karol. (2016). End to End Learning for

Self-Driving Cars.

20. Pan, Xinlei. Virtual to Real Reinforcement Learning for Autonomous Driving. Arvix, 26

Sept. 2017, Virtual to Real Reinforcement Learning for Autonomous Driving.

21. Youssef, F., Houda, B. (2019). Optimal combination of imitation and reinforcement

learning for self-driving cars. Revue d'Intelligence Artificielle, Vol. 33, No. 4, pp.

265-273. https://doi.org/10.18280/ria.330402

22. Zuo, Sixiang & Wang, Zhiyang & Zhu, Xiaorui & Ou, Yongsheng. (2017). Continuous

Reinforcement Learning From Human Demonstrations With Integrated Experience

Replay for Autonomous Driving. 10.1109/ROBIO.2017.8324787.

23. “Yahboom.” JetBot AI Robot Car, www.yahboom.net/study/JETBOT.

Reference file 1:

https://www.youtube.com/watch?v=BeFd6wfJTxI.

Table 1 Summary of proposed approach compared with previously reported research

ResearchFocus

Improved ReinforcementLearning through Imitation

Learning Pre-training TowardsImage-based Autonomous

Driving

Optimal Combination ofImitation and Reinforcement

Learning for Self-driving Cars

Hybrid of Reinforcement andImitation Learning forHuman-Like Agents

Proposed Research: NewMethodology To AccelerateReinforcement Learning ForAutonomous Driving Using

Imitation Learning & HumanDriving Data

ResearchGroup

School of ElectricalEngineering, Korea Advanced

Institute of Science andTechnology, Tianqi Wang et

al., Oct. 2019 [5]

National School of ComputerScience and Systems Analysis

(ENSIAS), Mohammed VUniversity Fenjiro Youssef et

al., Oct. 2019 [3]

Graduate School of SystemInformatics, Kobe UniversityRousslan F. J. DOSSA et al.,

Sep. 2020 [10]

The Harker Upper SchoolHeidi Lu and Olivia Xu, Mar.

2021

MethodsImitation and reinforcement

learning, Airsims, ResNet-34,DDPG

DQfD, A2C model, Imitationand reinforcement learning

Imitation and reinforcementlearning, DDPG, sensitivity

test

New Imitation learning CNN,optimized reinforcement

learning reward collection,DDPG, AirSim’s API

MainConclusions

combined IL and RL showedbetter performance ascompared to both pure

imitation learning and pureDDPG

resulting A2CfDoC modeloutperformed several existingalgorithms in terms of speed

and accuracy

proposed hybrid agentexhibits behavior similar to

that of human experts

saved 80% of training timecompared to pure reinforcement

learning training

Advantagesdrove the car in various

weather and environmentssurpassed expertise level

using RL

hybrid agents surpassed thatof the human expert in each

scenario

used one human controlled carto collect sufficient training datato speed up RL training process

Limitationsonly tested in simulated

environmentonly tested in simulated

environmentonly benefit from a larger

populationonly tested in smaller real worldenvironment in a mini-scale car

https://www.youtube.com/watch?v=BeFd6wfJTxI

https://arxiv.org/pdf/1907.06838.pdf





http://www.iieta.org/journals/ria/paper/10.18280/ria.330402



https://www.jstage.jst.go.jp/article/transinf/E103.D/9/E103.D_2019EDP7298/_article/-char/en



Figure 1 Design of the framework of supervised imitation learning developed to train the modelsas well as the integration of imitation learning and reinforcement learning. (A) shows a

demonstration of human driving and how the data is used to pretrain a model for reinforcementlearning and shaping the reward policy as well as new IL CNN developed, including specificparts like the camera and how the steering wheel inputs are adjusted for shift and rotation. (B)

shows a more detailed view of the CNN, the ResNet-18 CNN network. It consists of 18 differentlayers, including a normalization layer, convolutional layers, and fully connected layers.

Figure 2 Two diagrams: previous existing architecture and the newly proposed architecture. (A)shows a demonstration of how driving using machine learning is better than human driving and

how the existing architecture focuses on proving and making the rewards generated byreinforcement learning is better than the reward data generated by human expert driving data. (B)

shows a diagram of how quickly machine learning can achieve a similar level to human expertdriving, and how the new architecture works on making the reward generated by reinforcement

learning better than the reward generated by human expert driving data in less training time.

Figure 3 shows the testing procedures and hardware design, specifically the jetbot miniscalerobot used. (A) and (B) shows the assembled car body as well as the integrated Raspberry Picamera, steering gear, battery pack etc. (C) shows the real world environment set-up of a road

map scenario (a 9 feet by 6 feet racing track) and testing the jetbot on it. (D) shows reward datacollection situation where we’re collecting some human driving data used for training.

Figure 4 shows a graph of the Average Accumulated Reward where the orange star represents themethod and the blue shapes represent pure IL and pure RL. As shown above, one can see the

significant difference between the slow progress of the pure RL method and the straight line ofpure IL. These numbers of the reward function are important to and could heavily affect the

performance of the trained agent. For pure RL throughout a time period of 30 hours, the totalreward accumulated was only 11.

Integrating Imitation Learning with Human Driving Data ...

Documents

Transcript of Integrating Imitation Learning with Human Driving Data ...