Mrutyunjay (Mjay) University of Colorado, Denver.

Post on 29-Dec-2015

221 views 1 download

Transcript of Mrutyunjay (Mjay) University of Colorado, Denver.

Modern Hardware for DBMS

Mrutyunjay (Mjay)University of Colorado, Denver

MotivationHardware Trends

Multi-Core CPUsMany Core: Co-Processors

GPU (NVIDIA, AMD Radeon)Huge main memory capacity with complex access characteristics (Caches, NUMA)Non-Volatile Storage

Flash SSD (Solid State Drive)

Multi-Core CPU: MotivationAround 2005, frequency-scaling wall, improvements by adding multiple processing cores to the same CPU chip, forming chip multiprocessors servers with multiple CPU sockets of multicore processors (SMP of CMP)

The Multi-core AlternativeUse Moore’s law to place more cores per chip

2x cores/chip with each CMOS generationRoughly same clock frequencyKnown as multi-core chips or chip-multiprocessors (CMP)

The good newsExponentially scaling peak performanceNo power problems due to clock frequencyEasier design and verification

The bad newsNeed parallel program if we want to ran a single app fasterPower density is still an issue as transistors shrink

Multi-Core CPU: ChallengesThis how we think its works.

This how EXACTLY it works.

Multi-Core CPU: ChallengesType of cores

E.g. few OOO cores Vs many simple cores

Memory hierarchyWhich caching levels are shared and which are privateCache coherenceSynchronization

On-chip interconnectBus Vs Ring Vs scalable interconnect (e.g., mesh)Flat Vs hierarchical

Multi-Core CPUAll processor have access to unified physical memory

The can communicate using loads and storesAdvantages

Looks like a better multithreaded processor (multitasking)Requires evolutionary changes the OSThreads within an app communicate implicitly without using OSSimpler to code for and low overheadApp development: first focus on correctness, then on performance

DisadvantagesImplicit communication is hard to optimizeSynchronization can get trickyHigher hardware complexity for cache management

NUMA ArchitectureNUMA: Non-Uniform Memory Access

Many-Core: GPU / GPGPU GPU (Graphics Processing Unit) is a specialized microprocessor for accelerating graphics rendering

GPUs traditionally for graphics computing

GPUs now allow general purpose computing easily

GPGPU: using GPU for general purpose computing Physics, Finance, Biology, Geosciences, Medicine, etc

NVIDIA and AMD Radeon

GPU vs CPU

GPU design with up to a thousand of core enables massively parallel computingGPUs architecture with streaming multiprocessors has form of SIMD processors

CPU GPU

SIMD Processor

SIMD: Single Instruction Multiple Data

Distributed memory SIMD computer Shared memory SIMD computer

NVIDIA GPUs with SIMD ProcessorsEach GPU has ≥ 1 Streaming Multiprocessors (SMs)Each SM has design of an simple SIMD Processor

8-192 Streaming Processors (SPs)NVIDIA GeForce 8-Series GPUs and later

Questions from Previous Session

SMP of CMP: SMP: sockets of multicore processors (Multiple CPU in single system)CMP: Chip Multiprocessor (Single Chip with multi/many cores)

SP: Streaming ProcessorSFU: Special Function Units Double Precision UnitMultithreaded Instruction Unit

Hardware thread scheduling

GPU Cores

14 Streaming Multiprocessors per GPU 32 cores per Streaming Multiprocessors

Development tools for GPU

Two main approaches:

Other tool ? OpenACC

What is CUDA?CUDA = Compute Unified Device ArchitectureA development framework for Nvidia GPUsExtensions of C languageSupport NVIDIA GeForce 8-Series & later

DefinitionsHost = CPUDevice = GPUHost memory = RAM Device memory = RAM on GPU

Host memory

Device memory

Host(CPU)

Device(GPU)

PCI Express bus

CUDA Compute ModelCPU sends data to the GPUCPU instructs the processing on GPUGPU processes data CPU collects the results from GPU

Host memory

Device memory

Host(CPU)

Device(GPU)

12

3

4

CUDA Example1. CPU sends data to the GPU

2. CPU instructs the processing on GPU

3. GPU processes data

4. CPU collects the results from GPU

Host Codeint N= 1000;int size = N*sizeof(float);float A[1000], *dA;

cudaMalloc((void **)&dA, size);cudaMemcpy(dA , A, size, cudaMemcpyHostToDevice);

ComputeArray <<< 10, 20 >>> (dA ,N);

cudaMemcpy(A, dA, size, cudaMemcpyDeviceToHost);cudaFree(dA);

Device Code__global__ void ComputeArray(float *A, int N){ int i = blockIdx.x * blockDim.x + threadIdx.x; if (i<N) A[i] = A[i]*A[i]; }

CUDA Example•A kernel is executed as a grid of blocks

•A block is a batch of threads that can cooperate with each other by:

– Sharing data through shared memory – Synchronizing their execution

• Threads from different blocks cannot cooperate

GPU Computation ChallengeLimiting kernel launchesLimiting data transfers(Solution Overlapped Transfers)

GPU in Databases & Data MiningGPU strengths are useful

Memory bandwidthParallel processing

Accelerating SQL queries – 10x improvementAlso well suited for stream miningContinuous queries on streaming data instead of one-time queries on static database

Memory/Storage

Memory Hierarchy

Slowest part: Main Memory and Fixed Disk.Can we decrease the latency between Main Memory and Fixed disk?

Solution: SSD

SSD: New Generation Non-Volatile MemoryA Solid-State Disk (SSD) is a data storage device that emulates a hard disk drive (HDD). It has no moving parts like in HDD.NAND Flash SSD’s are essentially arrays of flash memory devices which include a controller that electrically and mechanically emulate, and are software compatible with magnetic HDD’s

SSD: ArchitectureHost Interface LogicSSD ControllerRAM BufferFlash Memory Package

Flash Memory

NAND-flash cells have a limited lifespan due to their limited number of P/E cycles (Program/Erase Cycle).

What will be the initial state of SSD? Ans: Still looking for it.

SSD: Architecture

Read, Write and EraseReads are aligned on page size: It is not possible to read less than one page at once. One can of course only request just one byte from the operating system, but a full page will be retrieved in the SSD, forcing a lot more data to be read than necessary.Writes are aligned on page size: When writing to an SSD, writes happen by increments of the page size. So even if a write operation affects only one byte, a whole page will be written anyway. Writing more data than necessary is known as write amplificationPages cannot be overwritten: A NAND-flash page can be written to only if it is in the “free” state. When data is changed, the content of the page is copied into an internal register, the data is updated, and the new version is stored in a “free” page, an operation called “read-modify-write”.Erases are aligned on block size: Pages cannot be overwritten, and once they become stale, the only way to make them free again is to erase them. However, it is not possible to erase individual pages, and it is only possible to erase whole blocks at once.

Example of Write:

Buffer small writes: To maximize throughput, whenever possible keep small writes into a buffer in RAM and when the buffer is full, perform a single large write to batch all the small writes

Align writes: Align writes on the page size, and write chunks of data that are multiple of the page size.

SSD: How it stores data?

SSD: How it stores data?Latency difference for each type.More levels increases the latency: Delays in read and write.Solution: Hybrid SDD, consisting mixed levels

Garbage collectionThe garbage collection process in the SSD controller ensures that “stale” pages are erased and restored into a “free” state so that the incoming write commands can be processed.Split cold and hot data. : Hot data is data that changes frequently, and cold data is data that changes infrequently. If some hot data is stored in the same page as some cold data, the cold data will be copied along every time the hot data is updated in a read-modify-write operation, and will be moved along during garbage collection for wear leveling. Splitting cold and hot data as much as possible into separate pages will make the job of the garbage collector easierBuffer hot data: Extremely hot data should be buffered as much as possible and written to the drive as infrequently as possible.

Flash Translation LayerThe main factor that made adoption of SSDs so easy is that they use the same host interfaces as HDDs. Although presenting an array of Logical Block Addresses (LBA) makes sense for HDDs as their sectors can be overwritten, it is not fully suited to the way flash memory worksFor this reason, an additional component is required to hide the inner characteristics of NAND flash memory and expose only an array of LBAs to the host. This component is called the Flash Translation Layer (FTL), and resides in the SSD controller.The FTL is critical and has two main purposes: logical block mapping and garbage collection.This mapping takes the form of a table, which for any LBA gives the corresponding PBA. This mapping table is stored in the RAM of the SSD for speed of access, and is persisted in flash memory in case of power failure. When the SSD powers up, the table is read from the persisted version and reconstructed into the RAM of the SSD

Internal Parallelism in SSDsInternal parallelism: Internally, several levels of parallelism allow to write to several blocks at once into different NAND-flash chips, to what is called a “clustered block”.Multiple levels of parallelism:

Channel-level parallelismPackage-level parallelismChip-level parallelismPlane-level parallelism

Characteristics and latencies of NAND-flash memory

Advantages & DisadvantagesSSD Advantages

Read and write are much faster than traditional HDDAllow PCs to boot up and launch programs far more quicklyMore physically Robust.Use less power and generate less heat

SSD DisadvantagesLower capacity than HDDsHigher storage cost per GBLimited number of data write cyclesPerformance degradation over time

Questions???