Real-Time Performance Improvement and...

1
BSA improvement plan separate out time-critical and non-time-critical parts, and queue the non-time-critical processing into a lower priority thread implement thread level code instead of the record based processing to avoid overhead and performance dragging of record processing. Lessons from Prototype test prototype shows improvement epics record global locking and lockset issue need more test with the EPICS snapshot version for multi-core and parallel callbacks K. H. Kim # , S. Allison, T. Straumann, E. Williams SLAC National Accelerator Laboratory, Menlo Park, CA 94025, U. S. A. Abstract Real-Time Performance Improvement and Consideration of Parallel Processing for Beam Synchronous Acquisition (BSA) * Beam Synchronous Acquisition (BSA) Definition / Requirements BSA Improvement Plan Summary and Discussion *Work supported by the the U.S. Department of Energy, Office of Science under Contract DE-AC02-76SF00515 for LCLS I and LCLS II. # [email protected] Acquire all beam-dependent scalars across multiple IOCs on the same pulse over multiple pulses of a certain kind up to 120Hz Acquire up to 2800 values per scalar in one acquisition Each value of the 2800 values can be an average/RMS of up to 1000 values Configure and process 20 independent acquisitions simultaneously Each acquisition request can specify: beam code (defines project) machine conditions of interest rate, timeslot, permit, and etc BSA record processing Improvement Required BSA failures in operation Beam Synchronous Acquisition (BSA) provides a common infrastructure for aligning data to each individual beam pulse, as required by the Linac Coherent Light Source (LCLS). BSA allows 20 independent acquisitions simultaneously for the entire LCLS facility and is used extensively for beam physics, machine diagnostics and operation. BSA is designed as a part of LCLS timing system and is currently an EPICS record based implementation, allowing timing receiver EPICS applications to easily add BSA functionality to their own record processing. However the lack of real-time performance of EPICS record processing and the increasing number of BSA devices has brought real-time performance issues. We think the major reason for the performance problem is due to the lack of separation between time-critical BSA upstream processing and non-critical downstream processing. We are improving BSA with thread level programming, breaking the global lock in each BSA device, adding a queue between upstream and downstream processing, and moving out the non-critical downstream to a lower priority worker thread. We are also investigating the use of multiple worker threads for parallel processing in Symmetric Multi-Processor (SMP) system. BSA operation Acquisition EDEF setup and request activation, done on the EVG IOC 360Hz checking on the EVG IOC with user notification when finished 360Hz requests (acquisition control) sent by the EVG IOC to all EVR IOCs via timing fiber optical link Data checking, averaging, and array update per scalar record per request on the EVR IOCs Figure 1: Data Acquisition across IOCs Figure 2: BSA EDEF reservation, mask setup, and control screens 360Hz Task in EVG IOC The 360Hz event task wakes up on interrupt which indicates timeslot change on the AC powerline, and performs the following steps: Generates 196 bits timing pattern using pre-programmed scenario Moves the timing pattern pipeline Matches pattern for event codes and program the EVG sequence ram for next timeslot Matches pattern for BSA BSA Pattern Matching Checks for a match with the new pulse’s pattern, beam code and each EDEFs Counts the number of measurements and number of values for the current average per EDEF The timing pattern is pipelined the new pattern is actually for 3 pulses ahead Figure 3: 360Hz tasks in EVG and EVR 360Hz Task in EVR IOC EDEF bit masks included in 360Hz data sent by EVG to EVR EVR IOC caches the data on data interrupt EVR IOC 360Hz event task is activated on the next fiducial interrupt Copies pattern data to the end of pipeline Moves the pipeline Updates EDEF tables after the pipeline update preparing for BSA processing done later in the same pulse Figure 4: 360Hz tasks and EDEF table update Data source PV triggers BSA record processing, after the EDEF table updating by the 360Hz processing The BSA record processing performs following: Matches timestamp to decide if the data is used in the BSA buffer Calculates RMS and Average, if required Checks Severity Process BSA buffer (compress record) Updates BSA management related information Figure 5: BSA record processing Figure 6: BSA fail scenario LCLS-I has total ~1,150 BSA PVs ~982 PVs succeed for BSA (85.4%) but, ~168 PVs fail (14.6%) The reason for failure varies and depends on the situations misconfigured data source PVs wrong trigger rates data is ready too late due to slow device and communication, and does not meet the time budget for BSA Figure 6: Typical BSA fail scenario with unexpected delay, resulting in wrong timestamp in record Improvement Plan Separate BSA processing into two parts: time-critical and non-time-critical Provide C/C++ APIs for data source device/driver code directly calls the API to put data into BSA keep data receptor (record interface) for backward compatibility The time-critical part (BSA upstream processing) is done by device/driver context in the API method or it is done by high-priority callback thread in the record interface method. The BSA upstream processing quickly do the timestamp checks and if the data matches, queues up the information for the lower priority agent thread The lower priority agent thread processes the non-time-critical part (BSA downstream processing) whenever it gets CPU time. BSA fail scenario BSA is based on epics record processing and is driven by the data source PV Both time-critical and non-time-critical parts are processed by a high priority callback in epics IOC: time-critical part: gather data from data source and compare timestamp with EDEF table non-time-critical part: calculate RMS, average, maintain counters, and post out BSA buffers to epics PVs The high priority callback is a single thread in epics IOC, the non-time- critical part and other record processing, adds a delay on the time-critical part The delay could make BSA fail due to timestamp mismatching. The overloaded high priority callback thread could also delay the data source PV processing and result it a late timestamp for the data. Consideration for multi-core platform and parallel callback for new EPICS base Figure 7: Processing timeline for BSA improvement (for single core system) Figure 8: Consideration for parallel callbacks for multi-core system The improvement allows to make multiple agent threads which waits a same queue to get BSA request from the upstream BSA processing. It allows parallel processing for downstream part in the multi-core system The improvement also allow to use parallel callbacks in new epics base release. It allows parallel processing for the BSA upstreaming processing Prototyping and Preliminary Test Results Quick prototype code record processing based code separates the upstream processing and the down stream processing high priority callback processes the upstream part and queues downstream processing to the lower priority agent thread. Test for single-core system MVME6100/RTEMS/epics-R3.15.1 stress test under 100% CPU load (lowest priority dummy thread runs an infinite loop) find spending ~ 5 usec for time-critical part and ~ 10 usec for non-time-critical for each BSA cycle capable for 64 BSA PVs @ 120Hz for 8 EDEFs (~61k BSA acquisition per sec) improve almost x2 capacity compare to the original Test for multi-core system x86/LinuxRT/epics-R3.15.1 stress test under 100% CPU load by lower priority scan threads run dummy record processing apply parallel callbacks and multiple agent threads BSA capacity saturated around x2 compare to the original even having a large number of cores and parallel callback threads find there are record global locking and lock set issue to improve epics base itself request EPICS core development team to solve the global locking and lock set issues. Ongoing BSA test with the EPICS base snapshot version

Transcript of Real-Time Performance Improvement and...

Page 1: Real-Time Performance Improvement and …icalepcs.synchrotron.org.au/posters/wepgf122_poster.pdfMoves the timing pattern pipeline ... 360Hz tasks and EDEF table update ... request

BSA improvement plan separate out time-critical and non-time-critical parts, and queue the non-time-critical

processing into a lower priority thread

implement thread level code instead of the record based processing to avoid overhead

and performance dragging of record processing.

Lessons from Prototype test prototype shows improvement

epics record global locking and lockset issue

need more test with the EPICS snapshot version for multi-core and parallel callbacks

K. H. Kim#, S. Allison, T. Straumann, E. Williams SLAC National Accelerator Laboratory, Menlo Park, CA 94025, U. S. A.

Abstract

Real-Time Performance Improvement and

Consideration of Parallel Processing for

Beam Synchronous Acquisition (BSA)*

Beam Synchronous Acquisition (BSA)

Definition / Requirements

BSA Improvement Plan

Summary and Discussion

*Work supported by the the U.S. Department of Energy, Office of Science

under Contract DE-AC02-76SF00515 for LCLS I and LCLS II. #[email protected]

Acquire all beam-dependent scalars across multiple IOCs on the same pulse

over multiple pulses of a certain kind up to 120Hz

Acquire up to 2800 values per scalar in one acquisition

Each value of the 2800 values can be an average/RMS of up to 1000 values

Configure and process 20 independent acquisitions simultaneously

Each acquisition request can specify: beam code (defines project)

machine conditions of interest – rate, timeslot, permit, and etc

BSA record processing

Improvement Required

BSA failures in operation

Beam Synchronous Acquisition (BSA) provides a common infrastructure

for aligning data to each individual beam pulse, as required by the Linac

Coherent Light Source (LCLS). BSA allows 20 independent acquisitions

simultaneously for the entire LCLS facility and is used extensively for beam

physics, machine diagnostics and operation. BSA is designed as a part of LCLS

timing system and is currently an EPICS record based implementation, allowing

timing receiver EPICS applications to easily add BSA functionality to their own

record processing. However the lack of real-time performance of EPICS record

processing and the increasing number of BSA devices has brought real-time

performance issues. We think the major reason for the performance problem is

due to the lack of separation between time-critical BSA upstream processing and

non-critical downstream processing. We are improving BSA with thread level

programming, breaking the global lock in each BSA device, adding a queue

between upstream and downstream processing, and moving out the non-critical

downstream to a lower priority worker thread. We are also investigating the use

of multiple worker threads for parallel processing in Symmetric Multi-Processor

(SMP) system.

BSA operation

Acquisition EDEF setup and request activation, done on the EVG IOC

360Hz checking on the EVG IOC with user notification when finished

360Hz requests (acquisition control) sent by the EVG IOC to all EVR IOCs

via timing fiber optical link

Data checking, averaging, and array update per scalar record per request on

the EVR IOCs

Figure 1: Data Acquisition across IOCs

Figure 2: BSA EDEF reservation, mask setup, and control screens

360Hz Task in EVG IOC

The 360Hz event task wakes up on interrupt which indicates timeslot change

on the AC powerline, and performs the following steps: Generates 196 bits timing pattern using pre-programmed scenario

Moves the timing pattern pipeline

Matches pattern for event codes and program the EVG sequence ram for next timeslot

Matches pattern for BSA

BSA Pattern Matching Checks for a match with the new pulse’s pattern, beam code and each EDEFs

Counts the number of measurements and number of values for the current average

per EDEF

The timing pattern is pipelined –the new pattern is actually for 3 pulses ahead

Figure 3: 360Hz tasks in EVG and EVR

360Hz Task in EVR IOC

EDEF bit masks included in 360Hz data sent by EVG to EVR

EVR IOC caches the data on data interrupt

EVR IOC 360Hz event task is activated on the next fiducial interrupt Copies pattern data to the end of pipeline

Moves the pipeline

Updates EDEF tables after the pipeline update preparing for BSA processing

done later in the same pulse

Figure 4: 360Hz tasks and EDEF table update

Data source PV triggers BSA record processing, after the EDEF table

updating by the 360Hz processing

The BSA record processing performs following: Matches timestamp to decide if the data is used in the BSA buffer

Calculates RMS and Average, if required

Checks Severity

Process BSA buffer (compress record)

Updates BSA management related information

Figure 5: BSA record processing

Figure 6: BSA fail scenario

LCLS-I has total ~1,150 BSA PVs ~982 PVs succeed for BSA (85.4%)

but, ~168 PVs fail (14.6%)

The reason for failure varies and depends on the situations misconfigured data source PVs

wrong trigger rates

data is ready too late due to slow device and communication, and does not meet

the time budget for BSA

Figure 6: Typical BSA fail scenario with unexpected delay,

resulting in wrong timestamp in record

Improvement Plan

Separate BSA processing into two parts: time-critical and non-time-critical

Provide C/C++ APIs for data source

device/driver code directly calls the API to put data into BSA

keep data receptor (record interface) for backward compatibility

The time-critical part (BSA upstream processing) is done by device/driver

context in the API method or it is done by high-priority callback thread in

the record interface method.

The BSA upstream processing quickly do the timestamp checks and if the

data matches, queues up the information for the lower priority agent thread

The lower priority agent thread processes the non-time-critical part (BSA

downstream processing) whenever it gets CPU time.

BSA fail scenario

BSA is based on epics record processing and is driven by the data source PV

Both time-critical and non-time-critical parts are processed by a high priority

callback in epics IOC: time-critical part: gather data from data source

and compare timestamp with EDEF table

non-time-critical part: calculate RMS, average, maintain counters,

and post out BSA buffers to epics PVs

The high priority callback is a single thread in epics IOC, the non-time-

critical part and other record processing, adds a delay on the time-critical part

The delay could make BSA fail due to timestamp mismatching.

The overloaded high priority callback thread could also delay the data source

PV processing and result it a late timestamp for the data.

Consideration for multi-core platform and parallel callback

for new EPICS base

Figure 7: Processing timeline for BSA improvement (for single core system)

Figure 8: Consideration for parallel callbacks for multi-core system

The improvement allows to make multiple agent threads which waits a

same queue to get BSA request from the upstream BSA processing. It

allows parallel processing for downstream part in the multi-core system

The improvement also allow to use parallel callbacks in new epics base

release. It allows parallel processing for the BSA upstreaming processing

Prototyping and Preliminary Test Results

Quick prototype code record processing based code

separates the upstream processing and the down stream processing

high priority callback processes the upstream part and queues downstream processing to

the lower priority agent thread.

Test for single-core system MVME6100/RTEMS/epics-R3.15.1

stress test under 100% CPU load (lowest priority dummy thread runs an infinite loop)

find spending ~ 5 usec for time-critical part and ~ 10 usec for non-time-critical

for each BSA cycle

capable for 64 BSA PVs @ 120Hz for 8 EDEFs (~61k BSA acquisition per sec)

improve almost x2 capacity compare to the original

Test for multi-core system x86/LinuxRT/epics-R3.15.1

stress test under 100% CPU load by lower priority scan threads run dummy

record processing

apply parallel callbacks and multiple agent threads

BSA capacity saturated around x2 compare to the original even having a large number

of cores and parallel callback threads

find there are record global locking and lock set issue to improve epics base itself

request EPICS core development team to solve the global locking and lock set issues.

Ongoing BSA test with the EPICS base snapshot version