Post on 09-Jan-2017
Anurag Sekhsaria
Introduction GPU features Applications
Two Versions - Logan and Denver Logan - 32 bit quad-core 4-PLUS-1 ARM
Cortex A15 CPU; upto 2.3GHz; 28 nm process
Logan - Two part nos. available CD575M and CD575MI
Denver - 64 bit dual core based on ARMv8 architecture; upto 2.5GHz
64kB L1;32kB of I-cache and 32kB D-cache 2MB L2 cache OUR FOCUS → LOGAN
Vector Graphics Rasterisation Variable Symmetric Multiprocessing (vSMP) Streaming Multiprocessor (SMX) Dynamic Parallelism Hyper-Q Polymorph Engine Bindless Textures
Vector graphics is the use of geometrical primitives such as points, lines, curves, and shapes or polygons—all of which are based on mathematical expressions—to represent images in computer graphics
Rasterisation (or rasterization) is the task of taking an image described in a vector graphics format (shapes) and converting it into a raster image (pixels or dots) for output on a video display or printer, or for storage in a bitmap file format.
Reason- mobile devices are in standby state for almost 80% time → power saving
4-PLUS-1 CPU → 4 HIGH performance more power intensive cores and 1 LOW power, low performance core
S/W b/w cores done on basis of processing reqd.; intelligent s/w hysteresis
Total Power = Leakage + Dynamic Dynamic Power α Frequency x (Voltage)2
Fast Process = Optimized for high frequency operation, but higher leakage
Low Power Process = Operates at lower frequency with lower leakage
High to low performance crossover at 600MHz
Low power core has peak freq. of 1GHz Both cores are OS transparent Not all 4 high performance cores active;
dynamic enable/disable Note: all the 5 cores cannot be active
simultaneously
Motive is to free the CPU Handle varied workload and use GPU
efficiently Run complex, less structured tasks Any kernel can launch another kernel and
can create the necessary streams,events and dependencies needed to process additional work without the need for host CPU interaction.
GPU core can be used by multiple CPUs Enables multiple CPU cores to launch work
on a single GPU simultaneously Increases GPU utilization and slashing CPU
idle times 32 simultaneous, hardware managed
connections(?)
Hyper-Q....contd
Applications Internet of Things (IoT) Medical Traffic Monitoring Video Analytics
About the SCU The SCU connects one to four Cortex-A9
processors to the memory system through the AXI interfaces.
The SCU functions are to: maintain data cache coherency between
the Cortex-A9 processors initiate L2 AXI memory accesses arbitrate between Cortex-A9 processors
requesting L2 accesses manage ACP accesses.
Snoop Control Unit(SCU)Snoop Control Unit(SCU)
AGENDA AVP MPIO Interrupt Controllers Clock Boot Power States PMC Flow Controller Power Architecture Memory Controller Peripherals
AVP- Audio Video Processor Functions
- Manage initial boot stages- Control and assist hardware audio decoding blocks, BSEA and VCP2- Control and assist hardware video decoder,VDE
256 kB local RAM(IRAM) 8kB cache
Muti- purpose I/O : MPIOMuti- purpose I/O : MPIOEach MPIO consists of: Output driver with:
-Tri state capability- Drive strength controls-Push pull mode, open drain mode or both
Input receiver with Schmitt mode, CMOS mode or both
Weak pull up or pull down They stay in their POR state until changed
by software(bootloader or OS) Default pad drive impedance is 50 ohms
5 types of MPIO pads: ST(Standard) DD(dual driver)- 3.3V tolarant(pull up
resistor) regardless of i/p V....must be set to open drain mode...special pwr seq considerations for this
OD(open drain)-5V tolerant..no push pull driver
CZ(controlled Z)-tigntly controlled Z LV- 1.8V tolerant
MPIO....contd.MPIO....contd.
MPIO....contd.MPIO....contd. Each MPIO can have upto 5 functions- upto
4 SFIO( special funtion wherein they are for peripherals) and 1 as GPIO
Pinmux controller handles MPIO functionality and has one register per MPIO
MPIO....contd.MPIO....contd.
GPIO Controller GPIO controller is divided into 8 banks Each bank handles upto 32 MPIOs Within each bank, GPIOs are arranged as 4
ports of 8 bits each 162 GPIOs in all Individually config. as Input, output,
interrupt source with edge/level triggering Lock bit functionality(optional) ensures GPIO
config. is not modified during runtime, system reset can clear this bit
Unused Pin- PWR Saving Assert tri state and disable input buffer If all pins in a pad control group are unused,
set the drive strengths and slew rates to a minimum
If all pins on a power rail are unused, assert E_NO_IOPOWER for that rail in the PMC registers
Two- vGIC(Virtual generic Interrupt controller) and LIC(Legacy Interrupt controller)
vGIC- For the ARM15 CPUs and LIC for the ARM7 AVP
160 hardware interrupts grouped into slices of 32 where each slice can be configured independently
There is one vGIC per CPU cluster and runs at half the clk freq. of that cluster
vGIC supports 256 interrupts each with a unique ID
Interrupt sources for vGIC Software Generated Interrupts(SGI) Private Peripheral Interrupts(PPI) Shared Peripheral Interrupts(SPI)
SGIs(also called IPIs ie Inter Processor Interrupts) generated by writing to vGIC registers, max. of 16 in no., ID 0 to 15
PPIs are generated by a peripheral that is specific to a CPU. 7 PPIs per CPU. nFIQ and nIRQ provided as pins.(?)
SPIs are external hardware interrupts given via IRQ pins and also by internal SoC units. Level triggered
Interrupt Interrupt Controllers.....contd.Controllers.....contd.
Two external Clks- 32.768kHz(for PMC and RTC) and 12MHz
16 PLLs For saving power by clock gating refer page 78 of
TRM Each peripheral has its own CLOCK_SOURCE
register- 2 bits to select from 4 clk sources and 8 bits for clk divider, 7 for integer and 1 for fraction
CL-DVFS(Closed Loop Dynamic Voltage and Frequency Scaling) register help controlling clock and power supply to FCPU(fast CPU) complex
RTC Maintains sec and ms counters 5 alarm registers Always ON pwr domain Can issue interrupts in LP states Hardware adjusts drifts in clock due to PPM
variations of osc All registers(except BUSY) use 32KHz clk
domain
TIMERS RTC Nvidia Generic Timers (10 nos) WDT- 5 nos: 1 per FCPU and 1 for COP(AVP)
[LP CPU doesn't have WDT?] GIT- ARM CPU Generic Timers(4 timers per
CPU: Secure & Non Secure Physical Timers; Hypervisor Timer and Virtual timer)
TSC-Generic Time System Counter- reference for GIT. Its a part of PMC
Note: any timer can be used as WDT
Power On Reset(POR)- deasserted externally(SYS_RESET_N pin)
Reset by thermal Sensor Watchdog Timer-Two types: Deadman Timer(legacy) WDT-1st expiry interrupt
issued and on 2nd reset but only some subunits WDT2- 1st expiry interrupt issued, on 2nd FIQ, on
3rd CPU reset, on 4th full system reset Software reset- Config bit in PMC; resets whole
chip LP0 wakeup reset- PMC logic controlled
During POR or system reset, reset controller deasserts boot blocks first and then the CPU and COP after 511 osc. clock periods to prevent COP/CPU from talking to these boot devices while itself still being in reset state
Non boot devices are brought into operation from reset by software
At POR bits of registers RST_DEVICES_L/H/U/V/W/X and CLK_OUT_ENB_L/H/U/V/W/X are set by hardware(pg 90 of TRM)
PORPOR
Blocks necessary for the boot are: AVP with its L1 All systems buses like AHB, APB etc Timer RTC NOR flash controller eFUSE GPIO CoreSight- debug controller; one per
cluster
BOOT SOURCES SPI Flash eMMC USB Recovery
Power States Active Suspend(LP1) Deep Sleep(LP0) OFF
Power States..contd.
Power States..contd.LP2 Cluster switch (a variant of LP2)- Cluster1 to 0 switch-Cluster0 to 1 switch :CPU3 ie last of cluster0
initiates this switch
Power States..contd.LP3(per CPU) If CPU is idle for a short time its clock is
ungated ie CPU is halted( we have not pwr gated this CPU only clk is stopped)
Only small wake up logic clk is enabled, others ungated
LP3 exited on detection of IRQ or FIQ Flow controller not needed, clk
gating/ungating internal to FCPUs and LPCPU
AVP Low Power States No specific instruction to halt the AVP However, its memory bus can be put into
WAIT state by flow controller (HALT State) IRQ/FIQ and other wake events can bring
AVP out of halt state During halt, AVP clk is automatically
ungated by hardware AVP is NOT power gated
PWR Management Controller(PMC)
PMC....contd. Provides interface to external PMIC Controls votage switching/transitions as
processor changes power states(eg LP0, LP1)
Processes power/clock requests( acts as slave) from various peripherals
To speed up operation, the PMC register file operates in local peripheral interface bus domain (APB) rather than in the 32KHz clock domain used for PMC processing
Flow Controller- IMPORTANT* Provides sequencing of hardware controlled
CPU power states Handles switching between CPU clusters 0 &
1 and also switching them OFF Receives CPU pwr state requests from CPUs,
sends pwr ON/OFF requests to PMC which power gates/ungates corresponding CPUs
Monitors per CPU interrupts and events to determine CPU wake events
Initiates CPU wake WFI(wait for interrupt) command used to
trigger low power states
Flow Controller....contd.
Flow Controller....contd. Note:
Flow controller has 3 different state machines-
* Main CPU flow controller state machines shown in fig. above
* CPU rail power UP state machine* State machine for COP
Flow controller uses CPU-ID (in MPID register) to identify the cores
Power Architecture There are sense pins for various system
voltage domains which access then continuously
Power Gating and Ungating For CCPLEX PG partitions, sequencing
ensured by hardware when power gating is done via flow controller
For SoC(non CCPLEX) PG partitions, sequencing is done by software
Power gating controller- two in number1. SoC PG controller2. GPU PG controller
SoC PG Controller Controls 8 zones and uses a fixed power
ON/OFF sequence using a fixed set of delays Power OFF seq. is opposite of power ON Same programming register for all zones
GPU PG Controller GPU PG controlled by GPMU unit inside
Kepler GPU Independent of SoC/CPU PG If CPU and GPU share the same voltage rail
(for cost reduction), then software settings should ensure that simultaneous PG of CPU and GPU should not occur to avoid di/dt issues
Fast CPU PG COntroller Used to power gate fast CPU partitions Funtioning similar to SoC PG controller
Power Gating Flow controller uses seperate state machine
for PG each CPU PG done based on CPU-ID Only one request handled at a time to avoid
pwr noise issues Flow controller - PMC inerface has core ID
and not Cluster ID As shown in figure, CPU and non CPU
components can be PG seperately
Power Gating....contd.
Power Gating....contd. At boot,CPU rail is OFF by default. It can be
enabled by AVP using register write to PMC registers
CPU rail can also be switched ON by PMIC (I2C write)
COP can switch OFF the FCPUs CPU and non CPU blocks cannot be switched
simultaneously
Hardware Accelerators NEON ISP
Memory Controller- RAM Only DDR3L and LPDDR3 supported and tested by
NVIDIA x32 bit or x64 bit configuration 4 chip selects 4 individually controllable clock enables 4 individually controllable ODTs Rank 0 size > or = rank 1 3 BA Column width- 9 to 12 bits Row width - 12 to 16 bits DDR3 upto 966MHz Upto 4 GB supported (as per datasheet) 1T and 2T support
Peripherals- USB
Peripherals- USB....contd. USB_OTG supports USB recovery boot USB2 and USB3 support host mode only XUSB supports host mode only
Peripherals- AUDIO Features:
- I2S controllers- 1 S/PDIF controller
Peripherals- Display Controller 2 independent display controllers which can
support 2 independent displays
Peripherals- MIPI CSI 2.0 2 CSI interfaces, each supports upto 4 lanes 2 image sensors can be used
simultaneously (eg stereo apps.) CSI B can support one additional single lane
input
Peripherals- Video Input(VI)
Peripherals- SD/MMC Controller
Peripherals-SD/MMC Controller
Peripherals- SATA & PCIe SATA spec Rev 3.1and AHCI spec. Rev 1.3.1 5 lane PCIe; Gen 1(2.5 GT/s) and Gen
2(5GT/s) supported
Peripherals- I2C 6 I2C interfaces I2C 3.0 spec compliant Modes supported:
- Standard (upto 100kbps)- Fast Mode (upto 400kbps)- Fast Mode plus (upto 1Mbps)- High speed mode (upto 3.4 Mbps)
Peripherals- UART, SPI & Misc. 4 UART interfaces (with RTS and CTS); upto
12.5Mbps baud rate SPI master upto 65MHz and slave upto
45MHz, six CS JTAG 4 PWFM interfaces Serial Transport stream(TS) Controller for
Digital TV