UNIFIED MEMORY ON P100 - Oak Ridge Leadership Computing ... · 1/11/2017 Read-only copy will be...

UNIFIED MEMORY ON P100 Peng Wang

HPC Developer Technology, NVIDIA

OVERVIEW

How UM works in P100

UM optimization

UM and directives

HOW UNIFIED MEMORY WORKS IN P100

HETEROGENEOUS ARCHITECTURES Memory hierarchy

1/11/2017

System Memory

GPU Memory

GPU 0 GPU 1 GPU N

UNIFIED MEMORY Starting with Kepler and CUDA 6

1/11/2017

Custom Data Management

System Memory

GPU Memory

Developer View With Unified Memory

Unified Memory

UNIFIED MEMORY IMPROVES DEV CYCLE

1/11/2017

Identify Parallel Regions

Allocate and Move Data

Implement GPU Kernels

Optimize GPU Kernels

Often time-consuming and

bug-prone

Identify Parallel Regions

Implement GPU Kernels

Optimize Data

Locality

Optimize GPU Kernels

UNIFIED MEMORY Single pointer for CPU and GPU

1/11/2017

void sortfile(FILE *fp, int N) { char *data; data = (char *)malloc(N); fread(data, 1, N, fp); qsort(data, N, 1, compare); use_data(data); free(data); }

void sortfile(FILE *fp, int N) { char *data; cudaMallocManaged(&data, N); fread(data, 1, N, fp); qsort<<<...>>>(data,N,1,compare); cudaDeviceSynchronize(); use_data(data); cudaFree(data); }

CPU code GPU code with Unified Memory

UNIFIED MEMORY ON PRE-PASCAL Code example explained

Pages allocated before they are used – cannot oversubscribe GPU

Pages migrate to GPU only on kernel launch – cannot migrate on-demand

1/11/2017

cudaMallocManaged(&ptr, ...); *ptr = 1; qsort<<<...>>>(ptr);

CPU page fault: data migrates to CPU

Pages are populated in GPU memory

Kernel launch: data migrates to GPU

UNIFIED MEMORY ON PRE-PASCAL Kernel launch triggers bulk page migrations

1/11/2017

GPU memory ~0.3 TB/s

System memory ~0.1 TB/s

kernel launch page

page fault

cudaMallocManaged

PAGE MIGRATION ENGINE Support Virtual Memory Demand Paging

49-bit Virtual Addresses

Sufficient to cover 48-bit CPU address + all GPU memory

GPU page faulting capability

Can handle thousands of simultaneous page faults

Up to 2 MB page size

Better TLB coverage of GPU memory

11.1.2017 Г.

UNIFIED MEMORY ON PASCAL Now supports GPU page faults

If GPU does not have a VA translation, it issues an interrupt to CPU

Unified Memory driver could decide to map or migrate depending on heuristics

Pages populated and data migrated on first touch

1/11/2017

cudaMallocManaged(&ptr, ...); *ptr = 1; qsort<<<...>>>(ptr);

CPU page fault: data allocates on CPU

Empty, no pages anywhere (similar to malloc)

GPU page fault: data migrates to GPU

UNIFIED MEMORY ON PASCAL True on-demand page migrations

1/11/2017

GPU memory ~0.7 TB/s

System memory ~0.1 TB/s

interconnect page fault

page fault

map VA to system memory

cudaMallocManaged

UNIFIED MEMORY ON PASCAL Improvements over previous GPU generations

On-demand page migration

GPU memory oversubscription is now practical (*)

Concurrent access to memory from CPU and GPU (page-level coherency)

Can access OS-controlled memory on supporting systems

(*) on pre-Pascal you can use zero-copy but the data will always stay in system memory

1/11/2017

UM USE CASE: HPGMG

Fine grids are offloaded to GPU (TOC), coarse grids are processed on CPU (LOC)

THRESHOLD

V-CYCLE F-CYCLE

HPGMG AMR PROXY Data locality and reuse of AMR levels

OVERSUBSCRIPTION RESULTS

1.4 4.7 8.6 28.9 58.6

x86 K40 P100 (x86 PCI-e) P100 (P8 NVLINK)

Applicati

ughput

P100 memory size (16GB)

x86 CPU: Intel E5-2630 v3, 2 sockets of 10 cores each with HT on (40 threads)

All 5 levels fit in GPU memory

Only 2 levels fit

Only 1 level fits

Application working set (GB)

UNIFIED MEMORY: ATOMICS

Pre-Pascal: atomics from the GPU are atomic only for that GPU

GPU atomics to peer memory are not atomic for remote GPU

GPU atomics to CPU memory are not atomic for CPU operations

Pascal: Unified Memory enables wider scope for atomic operations

NVLINK supports native atomics in hardware (both CPU-GPU and GPU-GPU)

PCI-E will have software-assisted atomics (only CPU-GPU)

1/11/2017

UNIFIED MEMORY ATOMICS System-Wide Atomics

__global__ void mykernel(int *addr) { atomicAdd_system(addr, 10); } void foo() { int *addr; cudaMallocManaged(addr, 4); *addr = 0; mykernel<<<...>>>(addr); __sync_fetch_and_add(addr, 10); }

System-wide atomics not available on Kepler / Maxwell

Pascal enables system-wide atomics • Direct support of atomics over NVLink • Software-assisted over PCIe

UNIFIED MEMORY: MULTI-GPU

Pre-Pascal: direct access requires P2P support, otherwise falls back to sysmem

Use CUDA_MANAGED_FORCE_DEVICE_ALLOC to mitigate this

Pascal: Unified Memory works very similar to CPU-GPU scenario

GPU A accesses GPU B memory: GPU A takes a page fault

Can decide to migrate from GPU B to GPU A, or map GPU A

GPUs can map each other’s memory, but CPU cannot access GPU memory directly

1/11/2017

UNIFIED MEMORY OPTIMIZATION

GENERAL GUIDELINES

UM overhead Migration: move the data, limited by CPU-GPU interconnect bandwidth Page fault: update page table, ~10s of μs per page, while execution stalls. Solution: prefetch

Redundant transfer for read-only data Solution: duplication

Thrashing: infrequent access, migration overhead can exceed locality benefits Solution: mapping

1/11/2017

NEW HINTS API IN CUDA 8

cudaMemPrefetchAsync(ptr, length, destDevice, stream)

Migrate data to destDevice: overlap with compute Update page table: much lower overhead than page fault in kernel Async operation that follows CUDA stream semantics

cudaMemAdvise(ptr, length, advice, device)

Specifies allocation and usage policy for memory region User can set and unset at any time

1/11/2017

PREFETCH Simple code example

1/11/2017

void foo(cudaStream_t s) { char *data; cudaMallocManaged(&data, N); init_data(data, N); cudaMemPrefetchAsync(data, N, myGpuId, s); // potentially other compute ... mykernel<<<..., s>>>(data, N, 1, compare); cudaMemPrefetchAsync(data, N, cudaCpuDeviceId, s); // potentially other compute ... cudaStreamSynchronize(s); use_data(data, N); cudaFree(data); }

CPU faults are less expensive may still be worth avoiding

GPU faults are expensive prefetch to avoid excess faults

PREFETCH EXPERIMENT

__global__ void inc(float *a, int n)

int gid = blockIdx.x*blockDim.x + threadIdx.x;

a[gid] += 1;

cudaMallocManaged(&a, n*sizeof(float));

for (int i = 0; i < n; i++)

a[i] = 1;

timer.Start();

cudaMemPrefetchAsync(a, n*sizeof(float), 0, NULL);

inc<<<grid, block>>>(a, n);

cudaDeviceSynchronize();

timer.Stop();

ms (GB/s) Total

Kernel

Prefetch

No prefetch 476 (2.3) 476 (2.3) N/A

Prefetch 49 (22) 2 (536) 47 (11)

n=128M

Page fault within kernel are expensive.

Prefetch next level while performing computations on current level

Use cudaMemPrefetchAsync with non-blocking stream to overlap with default stream

HPGMG: DATA PREFETCHING

4 4 3 4 4 3

0 1 2 3 4

compute

prefetch

0 1 2 3

eviction 0 1 2 4

RESULTS WITH USER HINTS

1.4 4.7 8.6 28.9 58.6

x86 K40 P100 (x86 PCI-e) P100 + hints (x86 PCI-e) P100 (P8 NVLINK) P100 + hints (P8 NVLINK)

Applicati

ughput

x86 CPU: Intel E5-2630 v3, 2 sockets of 10 cores each with HT on (40 threads)

All 5 levels fit in GPU memory

Only 2 levels fit

Only 1 level fits

Application working set (GB)

P100 memory size (16GB)

init_data(data, N); cudaMemAdvise(data, N, cudaMemAdviseSetReadMostly, myGpuId); mykernel<<<...>>>(data, N); use_data(data, N);

READ DUPLICATION

cudaMemAdviseSetReadMostly

Use when data is mostly read and occasionally written to

1/11/2017

Read-only copy will be created on GPU page fault

CPU reads will not page fault

READ DUPLICATION

Prefetching creates read-duplicated copy of data and avoids page faults

Note: writes are allowed but will generate page fault and remapping

1/11/2017

init_data(data, N); cudaMemAdvise(data, N, cudaMemAdviseSetReadMostly, myGpuId); cudaMemPrefetchAsync(data, N, myGpuId, cudaStreamLegacy); mykernel<<<...>>>(data, N); use_data(data, N);

Read-only copy will be created during prefetch

CPU and GPU reads will not fault

DIRECT MAPPING Preferred location and direct access

cudaMemAdviseSetPreferredLocation

Set preferred location to avoid migrations

First access will page fault and establish mapping

cudaMemAdviseSetAccessedBy

Pre-map data to avoid page faults

First access will not page fault

Actual data location can be anywhere

1/11/2017

DIRECT MAPPING

CUDA 8 performance hints (works with cudaMallocManaged)

1/11/2017

// set preferred location to CPU to avoid migrations cudaMemAdvise(ptr, size, cudaMemAdviseSetPreferredLocation, cudaCpuDeviceId); // keep this region mapped to my GPU to avoid page faults cudaMemAdvise(ptr, size, cudaMemAdviseSetAccessedBy, myGpuId); // prefetch data to CPU and establish GPU mapping cudaMemPrefetchAsync(ptr, size, cudaCpuDeviceId, cudaStreamLegacy);

HYBRID IMPLEMENTATION Data sharing between CPU and GPU

Level N (large) is shared between CPU and GPU Level N+1 (small) is shared between CPU and GPU

Level N Level N+1

Smoother Residual Restriction

Memory

GPU kernels

Smoother

CPU functions

Redisual

To avoid frequent migrations allocate N+1: pin page in CPU memory

DIRECT MAPPING Use case: HPGMG

Problem: excessive faults and migrations at CPU-GPU crossover point

Solution: pin coarse levels to CPU and map them to GPU page tables

Pre-Pascal: allocate data with cudaMallocHost or malloc + cudaHostRegister

1/11/2017

no page faults

~5% performance improvement on Tesla P100

MPI AND UNIFIED MEMORY

Using Unified Memory with CUDA-aware MPI needs explicit support from the MPI implementation:

Check with your MPI implementation of choice for their support

Unified Memory is supported in OpenMPI since 1.8.5 and MVAPICH2-GDR since 2.2b

Set preferred location may help improve performance of CUDA-aware MPI using managed pointers

1/11/2017

MULTI-GPU PERFORMANCE: HPGMG

MPI buffers can be allocated with cudaMalloc, cudaMallocHost, cudaMallocManaged

CUDA-aware MPI can stage managed buffers through system or device memory

managed managed +cuda-aware

zero-copy device +cuda-aware

2xK40 with P2P support

UNIFIED MEMORY AND DIRECTIVES

module.F90: module particle_array real,dimension(:,:),allocatable :: zelectron !$acc declare create(zelectron) setup.F90: allocate(zelectron(6,me)) call init(zelectron) !$acc update device(zelectron) pushe.F90: real mpgc(4,me) !$acc declare create(mpgc) !$acc parallel loop do i=1,me zelectron(1,m)=mpgc(1,m) enddo !$acc end parallel

!$acc update host(zelectron) call foo(zelectron)

W/o UM W/ UM

module.F90: module particle_array real,dimension(:,:),allocatable :: zelectron setup.F90: allocate(zelectron(6,me)) call init(zelectron) pushe.F90: real mpgc(4,me) !$acc parallel loop do i=1,me zelectron(1,me)=... enddo !$acc end parallel

call foo(zelectron)

Remove all data directives. Add “-ta=managed” to compile

NVIDIA and LLNL Confidential 36

UM OPENACC USE CASE: GTC

Plasma code for fusion science

10X slowdown in key compute routine when turning on Unified Memory

module.F90: module particle_array real,dimension(:,:),allocatable :: zelectron setup.F90: allocate(zelectron(6,me)) call init(zelectron) pushe.F90: real mpgc(4,me) !$acc parallel loop do i=m,me zelectron(1,m)=mpgc(1,m) enddo !$acc end parallel

call foo(zelectron)

Problem: automatic array.

Allocated/deallocated EACH time entering the routine

Paying page fault cost EACH time entering the routine.

Even though no migration.

UNIFIED MEMORY SUMMARY

On-demand page migration

Performance overhead: migration, page fault

Optimization: prefetch, duplicate, mapping

UNIFIED MEMORY ON P100 - Oak Ridge Leadership Computing ... · 1/11/2017 Read-only copy will be...

Documents

Transcript of UNIFIED MEMORY ON P100 - Oak Ridge Leadership Computing ... · 1/11/2017 Read-only copy will be...

Read Me for Cisco Unified Communications Manager IM ...€¦ · Introduction Release Notes for Cisco Unified IM & Presence Release 11.5(1) SU8 1 Introduction This readme file describes

CITROEN V9.31 Model function tablecandointl.com/wp-content/uploads/2019/04/Citroen-function-list.pdfC-ZERO Radio Read Fault Code Y C-ZERO Radio Erase Fault Code Y C-ZERO Radio Live

A Fault-Tolerant Routing Strategy for Fibonacci …zhangx/pubDoc/ACSAC.pdfA Fault-Tolerant Routing Strategy for Fibonacci-Class ... uApplicable to all Fibonacci-class Cubes in a unified

Manual for the R5/fcx fault code toolFault Code Read. The tool automatically starts in this mode, (though it won’t read the fault codes until you press the “Start” button). When

Research Article TO LIMIT THE FAULT CURRENT USING UNIFIED … II/JERS... · 2012-05-19 · This paper deals with unified power quality conditioners (UPQC’s), which aim at the integration

Fault lines - Small Arms Survey · Fault lines Tracking armed violence in Yemen A unified state for just 20 years, Yemen has endured decades of social and political turmoil. It is

SAM - OCR A Level Physics A: Unified physics– …€¦ · · 2015-09-27H556/03 Unified physics . SAMPLE MARK SCHEME Duration: ... Make sure that you have read and understood the

Improvement Fault Ride through Capability of Fixed-Speed .... Basic. Appl. Sci. Res., 3(7... · Improvement Fault Ride through ... Unified Power Flow Controller (UPFC), ... The simulation

Eastern Fault Futaba Fault Sendai Japan Tsubonuma Fault ...

DPF Aftertreatment Transmission Controller Anti-lock ... · F7S-G Master TabPro Screenshots Read ECU Information Erase Fault Code Action test Trip Information 00 ~ 0. VCI B Read fault

P246385 or P246309 stored in engine control unit Fault ...1. Read out and document the fault freeze frame data for each fault code from the engine control unit. 2. Record the performance

HUAWEI TECHNOLOGIES CO., LTD....·Fast Fault Localization: Huawei’ s U2000 Unified Network Management System (NMS) provides unified management and visual O&M tools to help operators

POWER SYSTEM STABLITY ENHANCEMENT USING UNIFIED POWER FLOW ... · PDF filePOWER SYSTEM STABLITY ENHANCEMENT USING UNIFIED POWER FLOW CONTROLLER UNDER THREE PHASE FAULT Raja ji

Viscoelastic Block Models of the North Anatolian Fault: A Unified … · 2016-11-30 · Viscoelastic Block Models of the North Anatolian Fault: A Unified Earthquake Cycle Representation

Torrent SDK Documentation - Read the Docs...Torrent SDK Documentation, Release 1 Fields table ﬁeld help text de-fault nul-lable read-only blank uniquetype isActive Boolean data.

READ - Montana Unified School Trust › ... › Summer-MUST-Read-2019.pdfensuring health and happiness. STAY HYDRATED—DRINK PLENTY OF WATER! High temps during summertime can increase

Thesis Map - Sierra Bacha 30k · Bacha Fault Bacha Fault Pozo Coyote Fault Coyote Fault Pozo Pozo Coyote Fault Noriega Fault San Ignacio Fault? Amado-Libertad Fault A A’ B’ B

II Specification AUTlLAND - toolplanet.co.jp · VeDiS-II SpecificationMODEL DOBLÒ ENGINE TYPE SYSTEM TYPE/SUPPLIER READ FAULT CLEAR FAULT ACTIVATION DATASTREAM 1.2 8V Power Train

Read a review of “The tulsaworld.com Scene Fault in …...Fault in Our Stars.” D3 D1 Saturday | June 7, 2014 | tulsaworld.com BY NICOLE MARSHALL MIDDLETON World Scene WriterF rom

MakerMakerMakerMaker Vehicle VehicleVehicleVehicle Year ...€¦ · Peugeot 206 1.6 HDi 9HZ BOSCH EDC16C3 1.Read Fault Code 2.Clear Fault Code 3.Read Live Data 4.Particle Filter Regeneration