FINAL Placencia -- Positive Train Control Developing ... · was the case of WMATA Fort Totten,...

© AREMA 2015 1

Positive Train Control: Developing Lessons Learned About Automation and High Reliability from the Aviation Industry

Greg Placencia, PhD [email protected],

Epstein Department of Industrial and Systems Engineering 3715 McClintock Ave., GER 240 University of Southern California

Los Angeles, California, USA

Number of Words: 6,016 Abstract Advocates of Positive Train Control (PTC) have long recognized its potential for eliminating / mitigating certain types of rail incidents. However the effects of its full-scale integration on class 1, intercity passenger and commuter railroads are still not well understood. Over the past 30 years, the aviation industry has learned many valuable lessons about automation and technological integration that could provide invaluable insights integrating PTC and other technologies into today’s rail industry. This paper will outline several of these lessons learned and will establish the potential for harvesting others for the rail industry. For example, automation commonly diverts pilot attention from flying to managing flight systems. The crashes of Asiana Flight 214 (AF214) on July 6, 2013 and Air France Flight 447 (AFF447) on June 1, 2009 highlighted the dangers of automation overreliance and scenario-based simulations at the expense of pilots developing a basic “stick and rudder” feel for their aircraft. In the context of PTC, anecdotal evidence suggests train operators may face similar challenges fixating on the onboard PTC system and the information it generates, rather than developing a “feel” for the routes they travel. In such cases, operators often develop diminished situational awareness that would enable them more quickly to recognize and recover from adverse condition PTC cannot detect such as trespassers on the rails; or as was the case of WMATA Fort Totten, override a faulty automated system. Adapting the lessons learned from catastrophe’s in other industries better promises to establish High Reliability and Safety within an increasingly automated rail industry. Background and Introduction The recent derailment of Amtrak Northeast Regional No. 188, in Philadelphia, has renewed demand for better crash prevention technologies like Positive Train Control (PTC). As has been widely report, the northbound section of track upon which the event occurred did not yet have Amtrak’s version of PTC – Advanced Civil Speed Enforcement System (ACSES) – enabled, which should have detected the over speed violation and stopped the train [1] – [3], even if the operator was incapacitated as has been suggested. [4] While there is little argument that technologies like PTC or Automatic Train Control (ATC) could have prevented the over speed violation that directly attributed to the derailment and subsequent death of 8 people and over 200 injuries; little has been said about automation’s potential for inducing comparable errors into the rail industry. Careful distinctions though, must be made between automated systems designed to assist operators, but result in “technology induced error,” and errors caused by using non-assistive technologies like cell phones. For example, much attention has been focused on whether the operator of Amtrak 188 was distracted by technology like a cell call, texting, or app [5], as was in the case in the Sep 12, 2008, Metrolink 111 at Chatsworth, CA. This paper focuses exclusively on automation designed to assist operators. The emerging incorporation of automation into the rail industry can be likened to that of the aviation automation in the early 1980’s. On the positive side, empirical data indicates that automation has improved aviation safety considerably. [15], [22] In 2013, there were 265 passenger fatalities in 29 accidents, an incredible number considering a peak of about 2,400 in 1975, and that the commercial aviation industry has doubled in size each year for the past 15 years. This was attributed to a new generation of automated fly–by–wire aircraft, more reliable turbine engines, more effective application of human factors engineering creating improved safety cultures, crew resource management, leadership, and management. However it has been argued that this trend could reverse given unexpected influences.

© AREMA 2015 2

[15] In 2013, the FAA identified an increase in manual handling errors based on analysis of flight operations data. [6] As the FAA warned:

Autoflight systems are useful tools for pilots and have improved safety and workload management … However, continuous use of autoflight systems could lead to degradation of the pilot’s ability to quickly recover the aircraft from an undesired state. (author’s emphasis) Operators are encouraged to take an integrated approach by incorporating emphasis of manual flight operations into both line operations and training (initial/upgrade and recurrent). Operational policies should be developed or reviewed to ensure there are appropriate opportunities for pilots to exercise manual flying skills ... In addition, policies should be developed or reviewed to ensure that pilots understand when to use the automated systems ... Augmented crew operations may also limit the ability of some pilots to obtain practice in manual flight operations. Airline operational policies should ensure that all pilots have the appropriate opportunities to exercise the aforementioned knowledge and skills in flight operations. (author’s emphasis) [6]

Aviation disasters like AFF447 and AF 214 highlighted the FAA’s concerns. Several rail accidents could also be considered precursor events indicating the potential of what could be considered “technology induced error.” For example, the June 22, 2009, WMATA 112 collision at Fort Totten, Washington, DC involved elements of technology induced errors involving Automated Train Control (ATC). [7] The July 24, 2013 Renfe line derailment at Santiago de Compostela, Spain, also shows some eerily similar characteristics to the Amtrak derailment, including what could be seen as overreliance on operator assisting technology. [8] Moreover cases like the February 24, 2015 Metrolink crash with a pickup truck in Oxnard, California could not have be prevented with current PTC implementations [9], but could falsely be thought to be preventable using automation. Hence drawing “lessons learned” from incidents involving automation in aviation could provide valuable insight into understanding how assistive technologies could adversely affect rail operators and operations. Automation in Aviation Automation has undoubtedly made air travel much safety since researchers began to study its effects on flight safety in 1980. [10] Still, despite nearly 35 years of active research, accidents associated with automation continue to persist. And while automation in rail cabs is much less complex, these concerns still apply. Dr. David Woods notes several dimensions along which automation operates – like team integration, user flexibility, availability of multiple data sources, coupling of divergent systems – can help or hurt human performance depending on how it changes workloads. [10, pp 3 – 17] Barry Kantowitz and John Campbell note that automation that makes pilot workloads too high or too low can cause errors. [10, pp 118 – 121]. Automation typically shifts the pilot’s role from manual control to supervision which inevitably decreases piloting skill and overall familiarity with the aircraft “system.” It also introduces new forms of errors associated with the automated system and requires one become proficient not only in knowing how to fly, but in how the automation works. [10, pp 123 – 126] Dr. Najmedin Meshkati adds that, “System designers can neither anticipate all possible scenarios nor foresee all aspects of unfolding emergency. Front-line operators’ improvisation via dynamic problem solving and reconfiguration of available recourses provide the last resort for preventing a total system failure. Despite advances in automation, operators should remain in charge of controlling and monitoring of safety-critical systems.” [20] Captain Dennis Landry echoes these concerns from a practitioner’s standpoint which he labels automation addition. [11] Automation Addiction / Fixation Captain Dennis Landry described automation addiction as such:

During transition training from the DC-9 to the Airbus A320, I was trained to deal with most events while operating with all available automation. Training activities were devoted to establishing proficiency in the use of automation. There was little emphasis on operating the aircraft without all of the automation, unless the specific automation feature was inoperative or specifically denied as part of the training. … … I found little need or perceived opportunity to practice basic attitude instrument skills; the aircraft generally operated flawlessly. Many of the automation exceptions I

© AREMA 2015 3

experienced were induced by operator errors or ATC [Air Traffic Controller] demands rather than equipment malfunctions. … When the automation was intentionally failed or was out of sequence with the desired flight path, I found myself scrambling to maintain aircraft control. My cognitive efforts were devoted to the simple task of maintaining airspeed, altitude and heading control while navigating somewhere without the benefit of the flight director and “green line” on the Navigation Display (a solid green line on the Navigation Display indicates the aircraft is on a course programmed in the FMGC). I found myself nearly overwhelmed with these tasks and unable to focus on the training or proper analysis of the other tasks. My instrument scan and management of navigational radios was virtually non-existent. This was an alarming change from my basic attitude instrument proficiency level during initial training. I found my experiences were not unique; many other pilots expressed similar concerns regarding the effects of automation on their flying skills. Discomfort with various levels of reduced automation was a constant refrain. (author’s emphasis) [11]

This author has heard similar concerns about PTC among longtime rail people. As one discussed with me, they were already worried that rail operator skills would inevitably degrade, where newer operators would rely solely on PTC feedback rather than developing a “feel” and appreciation for their routes like where certain obstructions were or how conditions varied during the day and/or season. In addition they imagined that operators would “play” to see how closely they could operate within the performance envelope allowed by PTC, e.g. seeing how close they could come to not activating automatic braking. The author had already heard anecdotal evidence about degraded skills among rail dispatchers from another rail person. They explained that prior to the development of modern dispatching software, dispatchers developed mental models and manual tools that actively allowed them to keep track of trains along their routes. Modern dispatchers, they thought, relied far too heavily on dispatching software. In the course of observing PTC implementation at Metrolink / SCRRA, this author developed similar observations as his rail sources. For example, I examined cab designs of different locomotives and cab cars in the Metrolink Fleet in Southern California during the initial integration and testing of onboard PTC components to be used by operators. By design, the location of PTC onboard interface is kept towards the left periphery of the operator’s vision so as to encourage normal operations. However I’ve also observed PTC during a live demonstration run with an experienced operator. The operator displayed a practiced sense of where braking should be applied along the route but the operator’s braking patterns did not match those calculated by PTC, thereby setting off multiple warning alarms during the demonstration. Of course one interpretation of this discrepancy is that PTC’s more conservative pattern encourages more cautious operator behavior so as to avoid the alarms. However we can also see this as encouraging focus towards monitoring the PTC onboard system, rather than operating the train, including observing rail conditions. In addition I observed and spoke with dispatchers who told me about cases when the dispatching software had caused them significant problems during operations. While these are not empirical studies, they point to a definite need for future study. The WMATA 112 crash is a critical example of potential automation addiction in the rail industry. The NTSB investigation noted that WMATA policy required operators to use ATC automated mode over manual operations. The operator of train 214 (which would be struck) had violated standard operating procedure on several occasions prior to the 2009 crash because of concern that the system was not making proper station stops. But as he could not provide evidence to justify operating in manual mode he was subsequently reprimanded for violating the manual only prohibition, and counseled about using manual mode. However legitimate concerns about the automated system near Fort Totten still existed. [7] On December 18, 2007 WMATA technicians replaced the GRS impedance bond at chain marker 311+71 on track B2 as part of a traction upgrade program. The work crew verified the track circuit was operating properly along that section at the time of replacement. Shortly thereafter, track circuits along the section, including circuit B2-304, failed to detect trains. On February 28, 2008, a work order was opened for track circuit B2-304 because of a "bobbing" signal. The order was subsequently closed on September 26, 2008 with no indication of corrective measures. Advanced Information Management (AIM) historical records

© AREMA 2015 4

further indicated intermittent bobbing continued, despite the closed work order, with no indication of potential corrective measures. [7] On June 22, 2009, B2–304 appeared to have failed, rendering train 214 a “ghost” to the ATC system. The operator of train 214, who was operating manually, stopped the train following the faulty signal command to stop. Train 112 operator though, was operating automatically, per WMATA policy; therefore as soon as train 214 “disappeared,” train 112 accelerated to 55 MPH until train 112’s operator applied emergency braking too late to avoid the tragic collision. Sadly, on June 7, 2005, nearly 4 years before the Fort Totten accident, a faulty train circuit caused similar conditions in the tunnel between WMATA’s Foggy Bottom and Rosslyn stations. However in that case the two train operators overrode the ATC system and manually stopped their trains to prevent collision. Even more tragically, the lessons learned about this from this near miss event had been forgotten by 2009, rather than being integrated into WMATA’s safety and operating culture. [7] While it is debatable whether the collision could have avoided given obstructions at the curve near the Fort Totten station, clear indications of automation addiction exist. In particular the operator of train 112, in compliance with WMATA operating standards, relied on ATC, particularly Automatic Train Operation, to control movement. The crashes of AFF 447 on June 1, 2009 and AF 214 on July 6, 2013 bear striking similarities to the events of WMATA 112, particularly because of blind cooperation with automation. The AFF 447 crash of an Airbus 330 on June 1, 2009 was such a distinct incident, that Vanity Fair dedicated 26 pages of its October, 2014 issue to an exposé on the crash and problems with automation. [12] In that case, AFF 447’s Pilot Flying engaged the autopilot 4 minutes after takeoff, as was Air France’s standard operating procedure. The Pilot in Command (a Pilot Not Flying in this case) signaled a second co-pilot (also a Pilot Not Flying) who was sleeping in the flight-rest compartment (a small cabin containing two berths just behind the cockpit) to relieve him in the cockpit. The two changed places so the Pilot in Command could sleep. The Pilot Flying only interacted with the autopilot by “programming” it to avoid thunderstorms and climb to 36,000 ft (the recommended maximum altitude was 37,000 ft.) when they encountered turbulence and ice. The autopilot relied on data from air-pressure probes known as pitot tubes mounted under the cockpit to function. As a safety feature, the autopilot was programmed to disengage and return command to the pilot in the event that data became unreliable. It was also known that pitot tubes on that particular model of A330 clogged under rare high–altitude conditions as were present in this case, but no accident has yet resulted, and replacement probes were apparently soon to be installed. Unfortunately, the pilots were unaware that ice was forming in the tubes. It was only after the A330 autopilot suddenly sounded an alarm and disengaged that pilots realized a potential problem. In response, the Pilot Flying took control of the control stick, but his limited flying time led him to “over steer” like a panicked driver, and continually pull back on the stick thinking the plan was stalling. This pushed the nose upward and sent the plane into a stall for the remaining 4 minutes 20 seconds of the flight. In all likelihood, the crash could have been averted had either pilot in the cockpit simply pushed the stick forward early on, leveling the plane or placing it into a very shallow descent. [13] On July 6, 2013, Asiana Flight 214 was on final approach to San Francisco International Airport (SFO). Part of the airport’s automated landing system was inoperable, which required the trainee pilot – an experience pilot who had little experience flying the Boeing 777 – to rely on visual approach, autothrottle, and other automated features to guide the plane’s glide path. The plane’s altitude and air speed were higher than normal for the approach, to which the trainee responded by adjusting the autopilot several times during the descent to compensate. At about 2.25 miles from the airport, the trainee disengaged the autopilot, which inadvertently disengaged the autothrottle. This caused the pilot to continue to lose airspeed without his knowledge. At about 1 mile from the airport, AF 214 was well below the desired glide path and airspeed, which caused alarm several seconds thereafter. The 777’s tail section hit the seawall at the edge of the runway, separating the tail from the fuselage that skidded and tumbled several hundred feet before oil from a ruptured oil tank ignited into flames. This resulted in 3 fatalities. [14] In all three cases, operators slavishly heeded the input of automation. WMATA 112’s operator, per organizational policy, kept her train in automation mode, allowing her train to proceed without

© AREMA 2015 5

consideration of potential obstructions given the lowered visibility in the curve ahead. When she did take manual control by applying emergency braking it was too late to intervene. AFF447’s Pilot Flying, per organizational policy, engaged his A330’s automation almost immediately after takeoff, relying on it, with minimal “programming” input, to guide the plane safely to Paris, where he would take control shortly before landing. Instead, when the autopilot suddenly disengaged, the lack of situational awareness left the two men in the cockpit disoriented; and ill prepared to assess and recover from the situation because of a lack of “stick time”. A pilot with more manual flight time would have most likely known that pulling back on the stick was the worst thing to do. In the case of AF 214 an experience pilot flying an unfamiliar aircraft relied on automation to assist him by controlling airspeed while he focused on maneuvering the plane. In the past pilots developed an innate sense of their aircraft, which they guided using a stick and throttle as an integrated control input. The decoupling of the two in this case left the pilot with a decided lack of appreciation for what his plane was doing, leading to the crash. The Cycle of Skill Degradation From the cases we’ve examined, we note a vicious cycle of degrading operator skill that is at the core of the hidden danger of automation. So called “black swan” events [15] create what Captain Landry called automation exceptions [11] from which automation inevitably cannot recover. Unfortunately in such cases automated systems transfer control back to pilots whose basic skills have inevitably atrophied from disuse, and who lack the basic situation awareness needed to recover because of a sudden transition from automation. Sadly, as we have seen in the cases we’ve examined, the operators’ respective organizations propagated the cycle of skill degradation by stressing the use of automation through organizational policies. Inevitably these policies contribute to the loss of basic skill – or what pilots call basic attitude instrument flying proficiency – in three distinct ways [11]. 1) Continuous and repetitive use of automation Pilots are encouraged and trained to follow the “green lines” wherever they lead then, when automated flight directors are active. [11], [15] Such blind adherence tends to degrade pilots’ cognitive processes as pilots spend more time monitoring and managing automation rather than “flying miles ahead” of the aircraft and developing a strategic understanding of where to guide their aircraft. We see an example of this in the case of WMATA 112. Both operators responded slavishly to commands issued by the automated system without any readily apparent manual means of establishing whether the system was operating properly or knowing the position of other trains. Train 112 was kept completely automated until the fateful seconds before the crash when its operators applied emergency breaking. And while train 214’s operator was in manual operation, he did not question whether the sudden command to stop was erroneous or not. [7] 2. Automation is emphasized over practicing, building or maintaining basic skills. Automation is considered much more reliable that human flight skills, therefore using it is the best way to maintain passenger comfort, improve workload management and safety, and reduce liability concerns for operators and pilot’s certificates. As a result operational practices stress maximum use of automation neglecting issues of automation addiction and exception. But as the FAA noted during its hearing on the AF 214 crash, “pilots [are trained] to rely on the systems all the time, but they are not taught to question the systems. They expect the system to work when they use it and when it doesn’t they get caught short.” [16]. The NTSB report on the WMATA 112.crash noted WMATA’s emphasis on automated operations over manual. Train 214’s operator had in fact been reprimanded for operating his train manually, despite his concerns that the automated system was making improper stops at stations. Moreover at the time of the accident, he was manually operating his train, keeping it at a much lower speed – about 20 mph – than the automated system dictated – 55 mph – approaching the curve going towards Fort Totten. In contrast, it can be argued that the reactions of train 112’s operator suggest she was unaccustomed to operating her train in manual mode as she relied on the automated system to set the train’s speed until she applied emergency breaking in the final seconds before the crash.

© AREMA 2015 6

3. Operators are uncomfortable or unwilling to participate in activities with which they lack proficiency. Good aviation training typically develops a broad range of skills to acceptable levels. However current training programs used by the airlines for decades fail to develop or to improve pilot skills to such levels because they are too short and too predictable. [15] There is also infrequent and insufficient simulator time to employ the military training methods to ‘Demonstrate, then Direct, then Monitor’ sequences often need to respond to the situations during which automation exceptions occur. [15] When pilots do engage in simulations, they are limited to very critical but improbably emergencies that have symptoms and outcomes, but rarely an opportunity to repeat the sequences or practice them. [15] This leaves pilots with little confidence and a hope they will never see such situations. There is no indication that WMATA used simulators at the time of the WMATA 112 crash in 2009, nor any kind of training outside operators’ regular job. Therefore we cannot say definitively how this element translates into the rail industry. However, a decisive underutilization of simulation within the rail industry has been found. [17] Given this we can extrapolate that rail operators have little or no practice identifying or recovering from potential accident scenarios. Required Skills for Operator Proficiency Given the criticality of skill degradation caused by automation addiction / exceptions, identifying the roots of their effect on operators is important. Captain Dennis Landry astutely recognized that proficient pilots use rule–based behaviors for aircraft flight path control and knowledge–based behaviors to determine how to resolve an aircraft flight path or automation issue. [11] Rule–based behaviors result from multiple practice efforts instilling a specific response that eventually occurs without a conscious reaction to the stimulus. In contrast, knowledge–based behaviors occur when a pilot analyzes the information from various cockpit instruments. Actions are based on a rational process that requires time to gather, analyze and react to specific situations. Knowledge–based behaviors are useful when a pilot has time to work through a complex issue, which rarely occurs in practice. [11] Rule–based behaviors are basic operating skills like steering, using the gas pedal (throttling), and braking in most automobiles. Such skills can only be developed and maintained through training and practice which allows them to go from a deliberate thought process to instinctive responses. Using the automotive example, this is like the process of going from focusing on how to drive a car to being able to navigate to and from a destination. Captain Landry emphatically points out

As pilots become accustomed to automation it becomes more difficult for them to deal with the occasional events that demand that automation be disregarded. During an automation exception, knowledge–based behaviors are required for recognition of the need to challenge or disregard the automation. These knowledge–based behaviors require a high degree of confidence by the pilot that his choice of action will not result in an undesired outcome. An automation exception requires direct and correct action on the part of the pilot, and this is where proficiency in rule–based behaviors will stand the pilot in good stead. These behaviors will allow the pilots to perform the basic attitude instrument flying tasks required to respond to the automation exception. Rule–based behaviors must be developed and maintained during normal flight operations.

In other words, knowledge–based behaviors are better suited to problem solving caused by automation exceptions. However when operators need to manage two sets of knowledge–based behaviors, basic operator skills as well as problem solving, the ability to solve the exception decreased dramatically. In some sense it is like when a driver using an automatic transmission in a car switches to a manual transmission. While the driver is capable of guiding the car, the additional cognitive load of operating the clutch can take a significant amount of attention away from navigating to a destination. The take–away from this discussion is that when operator skills atrophy in complex transportation system because of lack of training or practice, a significant danger evolves where operator proficiency effectively

© AREMA 2015 7

becomes that of a novice when advanced skills are needed. As Dr. Najmedin Meshkati observes. “Improvisation requires mastery of the subject matter [operator skill], a total system comprehensive ... and ability to extrapolate the behaviour of the newly ‘improvised’ and patched up system, and to shepherd it to the safe state. [20] Hence there is a real need for operators to establish and maintain baseline skills by assuming manual control during non–essential operations. For example in all our case studies operators were unable to recognize when automation exceptions had occurred. In the case of WMATA 112, the problem of the faulty track circuit was analogous to the blocked pitot tube in the case of AFF 447. Both operators delegated complete control to the automated systems of their vehicle with little consideration of their limitations. In the case of WMATA 112, the operator allowed automation to guide her train with little appreciation that a faulty signal could cause problems, especially when there is little time to recognize and recover from an automation error. In contrast WMATA 214’s operator entered the block at lower than recommended speed. Whether he did this because he recognized the danger of going into a blind curve is unclear, but his actions show a decided lack of trust for the automated system, most likely because he apparently had seen cases of it failing Potential Recommendations for Automation in the Rail Industry Proactively developing strategies to avoid automation pitfalls within rail automation is essential. Both Captain Dennis Landry and Captain Richard Champion de Crespigny give good recommendations for countering the problems of automation based on rule–based versus knowledge–based behavior. We adapt these to the rail industry. 1. Develop and maintain basic manual operating skills as rule–based behavior. [11], [15] Training should enable operators to immediately intervene when automation exceptions occur with minimal transition. The current transition to PTC–enabled trains would be a perfect time to tap experienced train operators for input on what constitute best methods and basic operating skills for manual train operation. Optimally these skills would be cultivated and maintained by the current generation of operators and communicated to the next generation of operators who will train with PTC. 2. Encourage and allow operators to control trains manually during normal operations [11] While automation reduces operator error, operators should use numerous opportunities to practice basic skills during normal non–critical operations. Such operations should be identified, but can be typically considered those of easy to medium difficulty by most operators. Overall this should enable train operators to improve the feel of how train should accelerate, decelerate, and move through different phases of operation using precise inputs without conscious thought to effectively manage a number of variables [11] through multiple regular and emergency scenarios. The exact time needed to do this is unclear; however Captain Landry suggests 15 to 30 minutes per month. Fortunately current implementations of PTC technology do not automate train operations, but rather act as a safety “overlay” that activates when operators fail to intervene by slowing or stopping at a pre–computed distance. In addition in the event PTC is activated, it would be treated as serious “near–miss” that will require investigation. However as observed before, there is a danger that operators will use the PTC monitor to establish acceptable behavior. Furthermore while PTC uses redundant communication modes to ensure data gaps like those at WMATA are prevented, overreliance on PTC for information will discourage the development of “route sense” which is strategically similar to the “flying ahead of the aircraft” sense used by pilot during manual control. Moreover, such overreliance could create a faulty sense that PTC can identify and react to elements not recognized by the system like vehicles, trespassers, and debris on the track.

3. Train operators to recognize and recover from automation exceptions [11], [15]

© AREMA 2015 8

It is imperative for operators to understanding the capabilities and limitations of automated systems with which they interacts. They should focus on “flight–path management” first and foremost, [16] so that they “operate the train and managing the automation,” not “manage automation to operate the train” [11]. Automation should be treated as a tool that helps them drive effective, not as a substitute for the operator. They must learn to appreciate that automation will not always act correctly therefore it must be adequately monitored and supervised. Moreover operators should learn be ever vigilant and prepared to take manual control, especially at critical transition points. [16] 4. Regulatory bodies and organizations must develop and maintain policies and operational practices

that promote good automation integration practices. With the integration of PTC and newer automation technology into the rail industry, there has never been a greater opportunity to establish and to promote better training and operation practices within. However such improvement cannot and will not occur without sufficient regulatory and industry support. For example, while train operators typically use their basic skills on a daily basis, the lack of skill and route variation creates ample opportunities for complacency. The potential of training for emergency scenarios though has never been greater with the availability of both low and high fidelity rail simulators. Unfortunately, the decisive underutilization of simulation within the rail industry can arguably be attributed to a lack of industry and regulatory support. However as this author has observed, the introduction of PTC has created a need for simulator training to prepare operators for the transition. Still, much to the author’s frustration, the data generated from such sessions is neither stored, nor analyzed at any point. Such data would be invaluable to identifying individual operators’ strengths and weaknesses so as to further develop their basic and potentially advanced operator skills. Moreover such data could be used to further organizational learning, by helping identify particularly tricky regions to cross, or identifying certain skills that many operators’ lack. Knowing these problem areas is the first step in developing mitigations like training or procedures to address those weaknesses. But this will only occur should the industry and regulatory bodies within rail recognize and act upon this incredible opportunity to establish better training and management of operator skills. The Need for incorporating High Reliability, Psychological, and Cultural Elements within Automated Environments It has been argued that PTC can enable the creation of High Reliability Organizations (HROs) within the rail industry. [18] Moreover it has also been argued that positive organizational and industrial changes occur more easily using psychological and cultural elements. [19] The principles of HROs were conceived to manage tightly scheduled operations (e.g. launching aircraft from an aircraft carrier) while maintaining low risks within inherently high-hazard environments, using numerous organizational processes. HROs instill an inherent climate of safety to allow organizations to "... repeatedly accomplishes its high hazard mission while avoiding catastrophic events, despite significant hazards, dynamic tasks, time constraints, and complex technologies.” [18] HROs though, cannot be developed without recognizing that organizational cultures, like natural environments, are often conducive to certain outcomes. Hence properly managing the underlying psychological and cultural factors that develop and sustain organizational environments, including intrinsic (internal) and extrinsic (external) factors that drive worker actions within a working environment, is essential to promoting robust, healthy organizations that operate efficiently and safely. [19] Becoming an HRO is not a goal, but rather a process characterized by 5 basic principles [18]:

1. Preoccupation with failure 2. Reluctance to simplify interpretations 3. Sensitivity to operations 4. Commitment to resilience 5. Deference to expertise

Certain processes can help develop HRO as well [18]:

© AREMA 2015 9

1. Develop a system of process checks to spot expected and unexpected safety problems 2. Develop a reward system to incentivize proper individual and organizational behavior 3. Avoid degradation of current process or inferior process development 4. Develop a good sense of risk perception 5. Develop a good organizational command and control structure.

Many of these components are found within the recommendations developed above. But it is critical to understand that as the rail industry continues to incorporate automation into its operations, the industry must recognize that the automation successes that the aviation industry has had are the result of steadfastly incorporating HRO principles and processes, as well as through learning from heartbreaking cases like AFF 447 and AF 214. For example Chow, Yortsos, and Meshkati examined the critical role culture played in the crash of AF 214 despite the state–of–the–art automation in the Boeing 777. [21] The FAA also recognized that “all stakeholders must remain vigilant to ensure that risks are continuously evaluated and mitigated. The ongoing evolution in airspace operations will require careful attention to change management to maintain or improve safety.” [22] They included a comprehensive list of recommendation which the rail industry would be wise to use as a point of further discussion about automation. [22] Conclusion In 2014, the Honorable Chris Hart – now NTSB Chairman – spoke about improving rail safety. [23]

“The importance of building relationships between management and employees that foster a vibrant safety culture cannot be overlooked. Trust is an essential ingredient in those relationships. A culture in which front-line employees may openly report operational errors and safety issues without fear of reprisal is absolutely critical, and, as we have seen in the aviation context, improves safety. The NTSB will continue to urge Federal regulators, such as FRA and the Federal Transit Administration (FTA), to facilitate establishment of appropriate safety cultures. The WMATA [112] accident, in particular, underscored the critical need for rail mass transit operators to enhance and nourish safety cultures. Our 2014 Most Wanted List reaffirms our view … [t]he FTA should consider the elements of safety culture, crew resource management, fatigue risk management, and technology, as well as lessons learned from the rail industry, as it moves forward with [new legislative authority to set and enforce new safety standards and conduct investigations]. Identifying and implementing these will be key to saving lives and preventing injuries.”

Despite the prevailing sentiment that automation can cure rail safety issues, his honor wisely recognizes it is only part of larger consideration. For example, the February 3, 2015 collision at Valhala, NY where a Metro-North Harlem Line train collided with an SUV would have been an automation exception, as current PTC implementations cannot practically sense obstacles on the rails like vehicles, people, or debris. As the late Dr. Richard Feynman wrote: “For a successful technology, reality must take precedence over public relations, for nature cannot be fooled.” [13] If PTC is to succeed in its goal in providing a safer rail system, we would be wise to listen to his advice.

© AREMA 2015 10

References [1] Mouawad, J. Technology could have prevented Derailment Has Not Yet Been Installed. New York Times, May 14, 2015. [2] Shear, M. D. and Mouawad, J. Hurdles Stalled Safety System Railroad Says. New York Times. May 15, 2015. [3] Susman, T. Amtrak crash hearing focuses on safeguards. Los Angeles Times. June 3, 2015. [4] Phillips, D. and Schmidt, M. S. Bullet Did Not Strike Windshield of Wrecked Train, Investigators Say. New York Times. May 19, 2015. [5] Nixon, R. No Evidence That Amtrak Engineer Used Phone During Crash, Safety Board Says. New York Times. June 11, 2015. [6] Safety Alert for Operators #13002 (SAFO 13002). U.S. Department of Transportation, Federal Aviation Administration, Flight Standards Service, Washington, DC. January 4, 2013. http://www.faa.gov/other_visit/aviation_industry/airline_operators/airline_safety/safo/all_safos/media/2013/SAFO13002.pdf (accessed June 21, 2015) [7] Collision of Two Washington Metropolitan Area Transit Authority Metrorail Trains Near Fort Totten Station Washington D.C. National Transportation Safety Board, Railroad Accident Report, NTSB Number: RAR-10-02. http://www.ntsb.gov/investigations/AccidentReports/Reports/RAR1002.pdf. [8] Informe Final Sobre el Accidente Grave Ferrovario No. 0054/2013 Ocurrido el Día 24.07.2013 en las Proximades de la Estacíon de Santiago de Compostela (A Coruña), Ministerio de Fomento, Comicíon de Investigacíon de Accidentes Ferrovarios, 2014. [9] Covarrubias, A. Rocha, V. and Sahagun, L. For Metrolink riders, an explosion, then 'everything started flying'. Los Angeles Times, February 23, 2015 [10] Parasuraman, R. and Mouloua, M. Automation and Human Performance, Lawrence Erlbaum Associates, Mahwah, New Jersey, 1996. [11] Landry, D. J. (2006). Automation Addiction: Skill losses induced by continuous reliance on Flight Management and Guidance Systems. 59th annual International Air Safety Seminar (IASS) • Flight Safety Foundation (FSF), International Federation of Airworthiness (IFA) 36th International Conference, and International Air Transport Association (IATA) • Paris, France, October 23–26, 2006. [12] Langewiesche, W. The Human Factor, Vanity Fair, October 2014. [13] Final Report on the accident on 1st June 2009 to the Airbus A330-203 registered F-GZCP operated by Air France flight AF 447 Rio de Janeiro – Paris. Bureau d’Enquêtes et d’Analyses pour la sécurité de l’aviation civile, Ministère de l’Écologie, du Développement durable, des Transports et du Logement. July 2012. [14] National Transportation Safety Board. 2014. Descent Below Visual Glidepath and Impact With Seawall, Asiana Airlines Flight 214, Boeing 777-200ER, HL7742, San Francisco, California, July 6, 2013. Aircraft Accident Report NTSB/AAR-14/01. Washington, DC. [15] de Crespigny, R. (2015). Resilience - Recovering pilots's lost flying skills. Air Transport. June 2015 [16] Johnson, D. (2015). Children of the Magenta. Aviation Safety Spotlight, January, 2015.

© AREMA 2015 11

[17] Naweed, A. (2013) Simulator integration in the rail industry: the Robocop problem. Proceedings of the Institution of Mechanical Engineers, Part F: Journal of Rail and Rapid Transit. September 2013 vol. 227 no. 5 407-418. [18] Placencia, G, Meshkati N, Moore, J, Khashe, Y. (2014). Technology and High Reliability Organizations in Railroad Operations Safety: A Case Study of Metrolink and Positive Train Control (PTC) Implementation. Proceedings of the Joint Rail Conference, 2014, Colorado Spring, CO. [19] Placencia, G. (2015). Psychological and Cultural Components Affecting Rail Worker Culture: A Literature Review. Proceedings of the Joint Rail Conference, March 23–26, 2015, San Jose, CA. [20] Meshkati, N. and Khashe, Y. (2015). Operators’ Improvisation in Complex Technological Systems: Successfully Tackling Ambiguity, Enhancing Resiliency and the Last Resort to Averting Disaster. Journal of Contingencies and Crisis Management. Volume 23 Number 2 June 2015. [21] Chow, S., Yortsos, S., and Meshkati, N. (2014). Asiana Airlines Flight 214: Investigating Cockpit Automation and Culture Issues in Aviation Safety. Aviation Psychology and Applied Human Factors [22] Operational Use of Flight Path Management Systems: Final Report of the Performance-based operations Aviation Rulemaking Committee / Commercial Aviation Safety Team Flight Deck Automation Working Group, September 5, 2013. [23] Hart, C. (2014). The Honorable Christopher A. Hart Vice Chairman On Behalf of the National Transportation Safety Board, Before the Subcommittee on Surface Transportation and Merchant Marine Infrastructure, Safety, and Security Committee on Commerce, Science and Transportation United States Senate, Hearing on Enhancing Our Rail Safety: Current Challenges for Passenger and Freight Rail, Washington DC March 6, 2014 [24] Richard Feynman, (1988) "What Do You Care What Other People Think? Further Adventures of a Curious Character" Bantam Books. Toronto, 1988.

Appendix A Personal observations on the reliability of the Shuttle by R. P. Feynman

Here in its entirety is Dr. Richard Feynman’s (in)famous Appendix F to the Rogers Commission Report on the Space Shuttle Challenger disaster. Dr. Feynman had a knack for explaining complex concepts in simple terms and his observations are a fine example of this. While not directly related to the rail industry they are well worth reading and sharing with others. The BBC television movie, The Challenger Disaster, is also highly recommended, though less informative. The reader is encouraged to develop their own lessons learned from his work, but some key points to consider are:

1. Do we ignore small failures / errors and excuse them by modifying our quality measures? 2. Do we investigate the source / root cause of systematic failures / errors? 3. Do we think something will continue to work properly if it hasn’t failed in the past? 4. Does our working culture focus on common goals rather than individual obligations when

responding to problems? 5. Do we consider top down (strategic) and bottom up (tactical / nuts & bolts) issues concurrently? 6. Are the efforts of good teams recognized, studied, and replicated by others?

I’d like to express my gratitude to AREMA for allowing me to share this seminal article.

© AREMA 2015 12

Appendix F – Personal observations on the reliability of the Shuttle by R. P. Feynman

Introduction It appears that there are enormous differences of opinion as to the probability of a failure with loss of vehicle and of human life. The estimates range from roughly 1 in 100 to 1 in 100,000. The higher figures come from the working engineers, and the very low figures from management. What are the causes and consequences of this lack of agreement? Since 1 part in 100,000 would imply that one could put a Shuttle up each day for 300 years expecting to lose only one, we could properly ask "What is the cause of management's fantastic faith in the machinery?" We have also found that certification criteria used in Flight Readiness Reviews often develop a gradually decreasing strictness. The argument that the same risk was flown before without failure is often accepted as an argument for the safety of accepting it again. Because of this, obvious weaknesses are accepted again and again, sometimes without a sufficiently serious attempt to remedy them, or to delay a flight because of their continued presence. There are several sources of information. There are published criteria for certification, including a history of modifications in the form of waivers and deviations. In addition, the records of the Flight Readiness Reviews for each flight document the arguments used to accept the risks of the flight. Information was obtained from the direct testimony and the reports of the range safety officer, Louis J. Ullian, with respect to the history of success of solid fuel rockets. There was a further study by him (as chairman of the launch abort safety panel (LASP)) in an attempt to determine the risks involved in possible accidents leading to radioactive contamination from attempting to fly a plutonium power supply (RTG) for future planetary missions. The NASA study of the same question is also available. For the History of the Space Shuttle Main Engines, interviews with management and engineers at Marshall, and informal interviews with engineers at Rocketdyne, were made. An independent (Cal Tech) mechanical engineer who consulted for NASA about engines was also interviewed informally. A visit to Johnson was made to gather information on the reliability of the avionics (computers, sensors, and effectors). Finally there is a report "A Review of Certification Practices, Potentially Applicable to Man-rated Reusable Rocket Engines," prepared at the Jet Propulsion Laboratory by N. Moore, et al., in February, 1986, for NASA Headquarters, Office of Space Flight. It deals with the methods used by the FAA and the military to certify their gas turbine and rocket engines. These authors were also interviewed informally. Solid Rockets (SRB) An estimate of the reliability of solid rockets was made by the range safety officer, by studying the experience of all previous rocket flights. Out of a total of nearly 2,900 flights, 121 failed (1 in 25). This includes, however, what may be called, early errors, rockets flown for the first few times in which design errors are discovered and fixed. A more reasonable figure for the mature rockets might be 1 in 50. With special care in the selection of parts and in inspection, a figure of below 1 in 100 might be achieved but 1 in 1,000 is probably not attainable with today's technology. (Since there are two rockets on the Shuttle, these rocket failure rates must be doubled to get Shuttle failure rates from Solid Rocket Booster failure.) NASA officials argue that the figure is much lower. They point out that these figures are for unmanned rockets but since the Shuttle is a manned vehicle "the probability of mission success is necessarily very close to 1.0." It is not very clear what this phrase means. Does it mean it is close to 1 or that it ought to be close to 1? They go on to explain "Historically this extremely high degree of mission success has given rise to a difference in philosophy between manned space flight programs and unmanned programs; i.e., numerical probability usage versus engineering judgment." (These quotations are from "Space Shuttle Data for Planetary Mission RTG Safety Analysis," Pages 3-1, 3-1, February 15, 1985, NASA, JSC.) It is true that if the probability of failure was as low as 1 in 100,000 it would take an inordinate number of tests to determine it ( you would get nothing but a string of perfect flights from which no precise figure, other than that the probability is likely less than the number of such flights in the string so far). But, if the real probability is not so small, flights would show troubles, near failures, and possible actual failures with a reasonable number of trials. and standard statistical methods could give a reasonable estimate. In fact, previous NASA experience had shown, on occasion, just such difficulties, near accidents, and accidents, all giving warning that the probability of flight failure was not so very small. The inconsistency of the

© AREMA 2015 13

argument not to determine reliability through historical experience, as the range safety officer did, is that NASA also appeals to history, beginning "Historically this high degree of mission success..." Finally, if we are to replace standard numerical probability usage with engineering judgment, why do we find such an enormous disparity between the management estimate and the judgment of the engineers? It would appear that, for whatever purpose, be it for internal or external consumption, the management of NASA exaggerates the reliability of its product, to the point of fantasy. The history of the certification and Flight Readiness Reviews will not be repeated here. (See other part of Commission reports.) The phenomenon of accepting for flight, seals that had shown erosion and blow-by in previous flights, is very clear. The Challenger flight is an excellent example. There are several references to flights that had gone before. The acceptance and success of these flights is taken as evidence of safety. But erosion and blow-by are not what the design expected. They are warnings that something is wrong. The equipment is not operating as expected, and therefore there is a danger that it can operate with even wider deviations in this unexpected and not thoroughly understood way. The fact that this danger did not lead to a catastrophe before is no guarantee that it will not the next time, unless it is completely understood. When playing Russian roulette the fact that the first shot got off safely is little comfort for the next. The origin and consequences of the erosion and blow-by were not understood. They did not occur equally on all flights and all joints; sometimes more, and sometimes less. Why not sometime, when whatever conditions determined it were right, still more leading to catastrophe? In spite of these variations from case to case, officials behaved as if they understood it, giving apparently logical arguments to each other often depending on the "success" of previous flights. For example. in determining if flight 51-L was safe to fly in the face of ring erosion in flight 51-C, it was noted that the erosion depth was only one-third of the radius. It had been noted in an experiment cutting the ring that cutting it as deep as one radius was necessary before the ring failed. Instead of being very concerned that variations of poorly understood conditions might reasonably create a deeper erosion this time, it was asserted, there was "a safety factor of three." This is a strange use of the engineer's term ,"safety factor." If a bridge is built to withstand a certain load without the beams permanently deforming, cracking, or breaking, it may be designed for the materials used to actually stand up under three times the load. This "safety factor" is to allow for uncertain excesses of load, or unknown extra loads, or weaknesses in the material that might have unexpected flaws, etc. If now the expected load comes on to the new bridge and a crack appears in a beam, this is a failure of the design. There was no safety factor at all; even though the bridge did not actually collapse because the crack went only one-third of the way through the beam. The O-rings of the Solid Rocket Boosters were not designed to erode. Erosion was a clue that something was wrong. Erosion was not something from which safety can be inferred.

There was no way, without full understanding, that one could have confidence that conditions the next time might not produce erosion three times more severe than the time before. Nevertheless, officials fooled themselves into thinking they had such understanding and confidence, in spite of the peculiar variations from case to case. A mathematical model was made to calculate erosion. This was a model based not on physical understanding but on empirical curve fitting. To be more detailed, it was supposed a stream of hot gas impinged on the O-ring material, and the heat was determined at the point of stagnation (so far, with reasonable physical, thermodynamic laws). But to determine how much rubber eroded it was assumed this depended only on this heat by a formula suggested by data on a similar material. A logarithmic plot suggested a straight line, so it was supposed that the erosion varied as the .58 power of the heat, the .58 being determined by a nearest fit. At any rate, adjusting some other numbers, it was determined that the model agreed with the erosion (to depth of one-third the radius of the ring). There is nothing much so wrong with this as believing the answer! Uncertainties appear everywhere. How strong the gas stream might be was unpredictable, it depended on holes formed in the putty. Blow-by showed that the ring might fail even though not, or only partially eroded through. The empirical formula was known to be uncertain, for it did not go directly through the very data points by which it was determined. There were a cloud of points some twice above, and some twice below the fitted curve, so erosions twice predicted were reasonable from that cause alone. Similar uncertainties

© AREMA 2015 14

surrounded the other constants in the formula, etc., etc. When using a mathematical model careful attention must be given to uncertainties in the model.

Liquid Fuel Engine (SSME) During the flight of 51-L the three Space Shuttle Main Engines all worked perfectly, even, at the last moment, beginning to shut down the engines as the fuel supply began to fail. The question arises, however, as to whether, had it failed, and we were to investigate it in as much detail as we did the Solid Rocket Booster, we would find a similar lack of attention to faults and a deteriorating reliability. In other words, were the organization weaknesses that contributed to the accident confined to the Solid Rocket Booster sector or were they a more general characteristic of NASA? To that end the Space Shuttle Main Engines and the avionics were both investigated. No similar study of the Orbiter, or the External Tank were made. The engine is a much more complicated structure than the Solid Rocket Booster, and a great deal more detailed engineering goes into it. Generally, the engineering seems to be of high quality and apparently considerable attention is paid to deficiencies and faults found in operation. The usual way that such engines are designed (for military or civilian aircraft) may be called the component system, or bottom-up design. First it is necessary to thoroughly understand the properties and limitations of the materials to be used (for turbine blades, for example), and tests are begun in experimental rigs to determine those. With this knowledge larger component parts (such as bearings) are designed and tested individually. As deficiencies and design errors are noted they are corrected and verified with further testing. Since one tests only parts at a time these tests and modifications are not overly expensive. Finally one works up to the final design of the entire engine, to the necessary specifications. There is a good chance, by this time that the engine will generally succeed, or that any failures are easily isolated and analyzed because the failure modes, limitations of materials, etc., are so well understood. There is a very good chance that the modifications to the engine to get around the final difficulties are not very hard to make, for most of the serious problems have already been discovered and dealt with in the earlier, less expensive, stages of the process. The Space Shuttle Main Engine was handled in a different manner, top down, we might say. The engine was designed and put together all at once with relatively little detailed preliminary study of the material and components. Then when troubles are found in the bearings, turbine blades, coolant pipes, etc., it is more expensive and difficult to discover the causes and make changes. For example, cracks have been found in the turbine blades of the high pressure oxygen turbopump. Are they caused by flaws in the material, the effect of the oxygen atmosphere on the properties of the material, the thermal stresses of startup or shutdown, the vibration and stresses of steady running, or mainly at some resonance at certain speeds, etc.? How long can we run from crack initiation to crack failure, and how does this depend on power level? Using the completed engine as a test bed to resolve such questions is extremely expensive. One does not wish to lose an entire engine in order to find out where and how failure occurs. Yet, an accurate knowledge of this information is essential to acquire a confidence in the engine reliability in use. Without detailed understanding, confidence can not be attained. A further disadvantage of the top-down method is that, if an understanding of a fault is obtained, a simple fix, such as a new shape for the turbine housing, may be impossible to implement without a redesign of the entire engine. The Space Shuttle Main Engine is a very remarkable machine. It has a greater ratio of thrust to weight than any previous engine. It is built at the edge of, or outside of, previous engineering experience. Therefore, as expected, many different kinds of flaws and difficulties have turned up. Because, unfortunately, it was built in the top-down manner, they are difficult to find and fix. The design aim of a lifetime of 55 missions equivalent firings (27,000 seconds of operation, either in a mission of 500 seconds, or on a test stand) has not been obtained. The engine now requires very frequent maintenance and replacement of important parts, such as turbopumps, bearings, sheet metal housings, etc. The high-pressure fuel turbopump had to be replaced every three or four mission equivalents (although that may

© AREMA 2015 15

have been fixed, now) and the high pressure oxygen turbopump every five or six. This is at most ten percent of the original specification. But our main concern here is the determination of reliability. In a total of about 250,000 seconds of operation, the engines have failed seriously perhaps 16 times. Engineering pays close attention to these failings and tries to remedy them as quickly as possible. This it does by test studies on special rigs experimentally designed for the flaws in question, by careful inspection of the engine for suggestive clues (like cracks), and by considerable study and analysis. In this way, in spite of the difficulties of top-down design, through hard work, many of the problems have apparently been solved. A list of some of the problems follows. Those followed by an asterisk (*) are probably solved:

1. Turbine blade cracks in high pressure fuel turbopumps (HPFTP). (May have been solved.) 2. Turbine blade cracks in high pressure oxygen turbopumps (HPOTP). 3. Augmented Spark Igniter (ASI) line rupture.* 4. Purge check valve failure.* 5. ASI chamber erosion.* 6. HPFTP turbine sheet metal cracking. 7. HPFTP coolant liner failure.* 8. Main combustion chamber outlet elbow failure.* 9. Main combustion chamber inlet elbow weld offset.* 10. HPOTP subsynchronous whirl.* 11. Flight acceleration safety cutoff system (partial failure in a redundant system).* 12. Bearing spalling (partially solved). 13. A vibration at 4,000 Hertz making some engines inoperable, etc.

Many of these solved problems are the early difficulties of a new design, for 13 of them occurred in the first 125,000 seconds and only three in the second 125,000 seconds. Naturally, one can never be sure that all the bugs are out, and, for some, the fix may not have addressed the true cause. Thus, it is not unreasonable to guess there may be at least one surprise in the next 250,000 seconds, a probability of 1/500 per engine per mission. On a mission there are three engines, but some accidents would possibly be contained, and only affect one engine. The system can abort with only two engines. Therefore let us say that the unknown suprises do not, even of themselves, permit us to guess that the probability of mission failure do to the Space Shuttle Main Engine is less than 1/500. To this we must add the chance of failure from known, but as yet unsolved, problems (those without the asterisk in the list above). These we discuss below. (Engineers at Rocketdyne, the manufacturer, estimate the total probability as 1/10,000. Engineers at marshal estimate it as 1/300, while NASA management, to whom these engineers report, claims it is 1/100,000. An independent engineer consulting for NASA thought 1 or 2 per 100 a reasonable estimate.) The history of the certification principles for these engines is confusing and difficult to explain. Initially the rule seems to have been that two sample engines must each have had twice the time operating without failure as the operating time of the engine to be certified (rule of 2x). At least that is the FAA practice, and NASA seems to have adopted it, originally expecting the certified time to be 10 missions (hence 20 missions for each sample). Obviously the best engines to use for comparison would be those of greatest total (flight plus test) operating time -- the so-called "fleet leaders." But what if a third sample and several others fail in a short time? Surely we will not be safe because two were unusual in lasting longer. The short time might be more representative of the real possibilities, and in the spirit of the safety factor of 2, we should only operate at half the time of the short-lived samples. The slow shift toward decreasing safety factor can be seen in many examples. We take that of the HPFTP turbine blades. First of all the idea of testing an entire engine was abandoned. Each engine number has had many important parts (like the turbopumps themselves) replaced at frequent intervals, so that the rule must be shifted from engines to components. We accept an HPFTP for a certification time if two samples have each run successfully for twice that time (and of course, as a practical matter, no longer insisting that this time be as large as 10 missions). But what is "successfully?" The FAA calls a turbine blade crack a failure, in order, in practice, to really provide a safety factor greater than 2. There is some time that an engine can run between the time a crack originally starts until the time it has grown

© AREMA 2015 16

large enough to fracture. (The FAA is contemplating new rules that take this extra safety time into account, but only if it is very carefully analyzed through known models within a known range of experience and with materials thoroughly tested. None of these conditions apply to the Space Shuttle Main Engine. Cracks were found in many second stage HPFTP turbine blades. In one case three were found after 1,900 seconds, while in another they were not found after 4,200 seconds, although usually these longer runs showed cracks. To follow this story further we shall have to realize that the stress depends a great deal on the power level. The Challenger flight was to be at, and previous flights had been at, a power level called 104% of rated power level during most of the time the engines were operating. Judging from some material data it is supposed that at the level 104% of rated power level, the time to crack is about twice that at 109% or full power level (FPL). Future flights were to be at this level because of heavier payloads, and many tests were made at this level. Therefore dividing time at 104% by 2, we obtain units called equivalent full power level (EFPL). (Obviously, some uncertainty is introduced by that, but it has not been studied.) The earliest cracks mentioned above occurred at 1,375 EFPL. Now the certification rule becomes "limit all second stage blades to a maximum of 1,375 seconds EFPL." If one objects that the safety factor of 2 is lost it is pointed out that the one turbine ran for 3,800 seconds EFPL without cracks, and half of this is 1,900 so we are being more conservative. We have fooled ourselves in three ways. First we have only one sample, and it is not the fleet leader, for the other two samples of 3,800 or more seconds had 17 cracked blades between them. (There are 59 blades in the engine.) Next we have abandoned the 2x rule and substituted equal time. And finally, 1,375 is where we did see a crack. We can say that no crack had been found below 1,375, but the last time we looked and saw no cracks was 1,100 seconds EFPL. We do not know when the crack formed between these times, for example cracks may have formed at 1,150 seconds EFPL. (Approximately 2/3 of the blade sets tested in excess of 1,375 seconds EFPL had cracks. Some recent experiments have, indeed, shown cracks as early as 1,150 seconds.) It was important to keep the number high, for the Challenger was to fly an engine very close to the limit by the time the flight was over. Finally it is claimed that the criteria are not abandoned, and the system is safe, by giving up the FAA convention that there should be no cracks, and considering only a completely fractured blade a failure. With this definition no engine has yet failed. The idea is that since there is sufficient time for a crack to grow to a fracture we can insure that all is safe by inspecting all blades for cracks. If they are found, replace them, and if none are found we have enough time for a safe mission. This makes the crack problem not a flight safety problem, but merely a maintenance problem. This may in fact be true. But how well do we know that cracks always grow slowly enough that no fracture can occur in a mission? Three engines have run for long times with a few cracked blades (about 3,000 seconds EFPL) with no blades broken off. But a fix for this cracking may have been found. By changing the blade shape, shot-peening the surface, and covering with insulation to exclude thermal shock, the blades have not cracked so far. A very similar story appears in the history of certification of the HPOTP, but we shall not give the details here. It is evident, in summary, that the Flight Readiness Reviews and certification rules show a deterioration for some of the problems of the Space Shuttle Main Engine that is closely analogous to the deterioration seen in the rules for the Solid Rocket Booster. Avionics By "avionics" is meant the computer system on the Orbiter as well as its input sensors and output actuators. At first we will restrict ourselves to the computers proper and not be concerned with the reliability of the input information from the sensors of temperature, pressure, etc., nor with whether the computer output is faithfully followed by the actuators of rocket firings, mechanical controls, displays to astronauts, etc.

© AREMA 2015 17

The computer system is very elaborate, having over 250,000 lines of code. It is responsible, among many other things, for the automatic control of the entire ascent to orbit, and for the descent until well into the atmosphere (below Mach 1) once one button is pushed deciding the landing site desired. It would be possible to make the entire landing automatically (except that the landing gear lowering signal is expressly left out of computer control, and must be provided by the pilot, ostensibly for safety reasons) but such an entirely automatic landing is probably not as safe as a pilot controlled landing. During orbital flight it is used in the control of payloads, in displaying information to the astronauts, and the exchange of information to the ground. It is evident that the safety of flight requires guaranteed accuracy of this elaborate system of computer hardware and software. In brief, the hardware reliability is ensured by having four essentially independent identical computer systems. Where possible each sensor also has multiple copies, usually four, and each copy feeds all four of the computer lines. If the inputs from the sensors disagree, depending on circumstances, certain averages, or a majority selection is used as the effective input. The algorithm used by each of the four computers is exactly the same, so their inputs (since each sees all copies of the sensors) are the same. Therefore at each step the results in each computer should be identical. From time to time they are compared, but because they might operate at slightly different speeds a system of stopping and waiting at specific times is instituted before each comparison is made. If one of the computers disagrees, or is too late in having its answer ready, the three which do agree are assumed to be correct and the errant computer is taken completely out of the system. If, now, another computer fails, as judged by the agreement of the other two, it is taken out of the system, and the rest of the flight canceled, and descent to the landing site is instituted, controlled by the two remaining computers. It is seen that this is a redundant system since the failure of only one computer does not affect the mission. Finally, as an extra feature of safety, there is a fifth independent computer, whose memory is loaded with only the programs of ascent and descent, and which is capable of controlling the descent if there is a failure of more than two of the computers of the main line four. There is not enough room in the memory of the main line computers for all the programs of ascent, descent, and payload programs in flight, so the memory is loaded about four times from tapes, by the astronauts. Because of the enormous effort required to replace the software for such an elaborate system, and for checking a new system out, no change has been made to the hardware since the system began about fifteen years ago. The actual hardware is obsolete; for example, the memories are of the old ferrite core type. It is becoming more difficult to find manufacturers to supply such old-fashioned computers reliably and of high quality. Modern computers are very much more reliable, can run much faster, simplifying circuits, and allowing more to be done, and would not require so much loading of memory, for the memories are much larger. The software is checked very carefully in a bottom-up fashion. First, each new line of code is checked, then sections of code or modules with special functions are verified. The scope is increased step by step until the new changes are incorporated into a complete system and checked. This complete output is considered the final product, newly released. But completely independently there is an independent verification group, that takes an adversary attitude to the software development group, and tests and verifies the software as if it were a customer of the delivered product. There is additional verification in using the new programs in simulators, etc. A discovery of an error during verification testing is considered very serious, and its origin studied very carefully to avoid such mistakes in the future. Such unexpected errors have been found only about six times in all the programming and program changing (for new or altered payloads) that has been done. The principle that is followed is that all the verification is not an aspect of program safety, it is merely a test of that safety, in a non-catastrophic verification. Flight safety is to be judged solely on how well the programs do in the verification tests. A failure here generates considerable concern. To summarize then, the computer software checking system and attitude is of the highest quality. There appears to be no process of gradually fooling oneself while degrading standards so characteristic of the

© AREMA 2015 18

Solid Rocket Booster or Space Shuttle Main Engine safety systems. To be sure, there have been recent suggestions by management to curtail such elaborate and expensive tests as being unnecessary at this late date in Shuttle history. This must be resisted for it does not appreciate the mutual subtle influences, and sources of error generated by even small changes of one part of a program on another. There are perpetual requests for changes as new payloads and new demands and modifications are suggested by the users. Changes are expensive because they require extensive testing. The proper way to save money is to curtail the number of requested changes, not the quality of testing for each. One might add that the elaborate system could be very much improved by more modern hardware and programming techniques. Any outside competition would have all the advantages of starting over, and whether that is a good idea for NASA now should be carefully considered. Finally, returning to the sensors and actuators of the avionics system, we find that the attitude to system failure and reliability is not nearly as good as for the computer system. For example, a difficulty was found with certain temperature sensors sometimes failing. Yet 18 months later the same sensors were still being used, still sometimes failing, until a launch had to be scrubbed because two of them failed at the same time. Even on a succeeding flight this unreliable sensor was used again. Again reaction control systems, the rocket jets used for reorienting and control in flight still are somewhat unreliable. There is considerable redundancy, but a long history of failures, none of which has yet been extensive enough to seriously affect flight. The action of the jets is checked by sensors, and, if they fail to fire the computers choose another jet to fire. But they are not designed to fail, and the problem should be solved. Conclusions If a reasonable launch schedule is to be maintained, engineering often cannot be done fast enough to keep up with the expectations of originally conservative certification criteria designed to guarantee a very safe vehicle. In these situations, subtly, and often with apparently logical arguments, the criteria are altered so that flights may still be certified in time. They therefore fly in a relatively unsafe condition, with a chance of failure of the order of a percent (it is difficult to be more accurate). Official management, on the other hand, claims to believe the probability of failure is a thousand times less. One reason for this may be an attempt to assure the government of NASA perfection and success in order to ensure the supply of funds. The other may be that they sincerely believed it to be true, demonstrating an almost incredible lack of communication between themselves and their working engineers. In any event this has had very unfortunate consequences, the most serious of which is to encourage ordinary citizens to fly in such a dangerous machine, as if it had attained the safety of an ordinary airliner. The astronauts, like test pilots, should know their risks, and we honor them for their courage. Who can doubt that McAuliffe was equally a person of great courage, who was closer to an awareness of the true risk than NASA management would have us believe? Let us make recommendations to ensure that NASA officials deal in a world of reality in understanding technological weaknesses and imperfections well enough to be actively trying to eliminate them. They must live in reality in comparing the costs and utility of the Shuttle to other methods of entering space. And they must be realistic in making contracts, in estimating costs, and the difficulty of the projects. Only realistic flight schedules should be proposed, schedules that have a reasonable chance of being met. If in this way the government would not support them, then so be it. NASA owes it to the citizens from whom it asks support to be frank, honest, and informative, so that these citizens can make the wisest decisions for the use of their limited resources. For a successful technology, reality must take precedence over public relations, for nature cannot be fooled.

© AREMA 2015 20

A R E M A 2 0 1 5 A N N U A L C O N F E R E N C E

Minneapolis, MN | October 4-7, 2015

The train of events at Chernobyl NPS, which led to the tragedy, was in no way reminiscent of even one of the emergency situations at other nuclear power stations, but was very, very similar, right down to the last details, to what happened at the chemical works at Bhopal in 1984. The lesson of Bhopal went unheeded....”

Dr. Valery Legasov, 1986



Air France 447

CBS Report – www.youtube.com/watch?v=kERSSRJant0

Vanity Fair, October, 2014

www.vanityfair.com/news/business/2014/10/air-france-flight-447-crash

Documentary – www.youtube.com/watch?v=TsgyBqlFixo



Asiana 214

NTSB – www.youtube.com/watch?v=8MFPSfGoT1U

Automation –www.youtube.com/watch?v=tH_F6Ekf3js



Automation

Addiction / Fixation

Exception



Automation’s Cycle of Skill Degradation

Continuous and Repetitive Use

Emphasized over basic operator skills

practice, building or maintenance

Operators uncomfortable / unwilling to participate in activities with which they lack proficiency



Different Skills

Rule – based behavior

Knowledge – based behavior

© AREMA 2015 21



Some RecommendationsDevelop and maintain basic manual operating skills

Encourage and allow operators to control trains manually

Train operators to recognize / recover from automation exceptions

Develop / maintain policies and practices to promote good automation integration practices (via regulations and culture)

Incorporate High Reliability into the Organization



High Reliability OrganizationsProcess characterized by 5 basic principles

Preoccupation with failureReluctance to simplify interpretationsSensitivity to operationsCommitment to resilienceDeference to expertise

Processes that can develop HRODevelop a system of process checks to spot expected and unexpected safety problemsDevelop a reward system to incentivize proper individual and organizational behaviorAvoid degradation of current process or inferior process developmentDevelop a good sense of risk perceptionDevelop a good organizational command and control structure



WMATA 112

A tale of potential things to come?

NTSB – www.youtube.com/watch?v=KHMosix9bQ0



Discussion



“For a successful technology, reality must take precedence over public relations, for nature cannot be fooled.”

Dr. Richard Feynman, 1986Personal Observations on the Reliability of the Space Shuttle System

FINAL Placencia -- Positive Train Control Developing ... · was the case of WMATA Fort Totten,...

Documents

Transcript of FINAL Placencia -- Positive Train Control Developing ... · was the case of WMATA Fort Totten,...