Post on 01-Nov-2014
description
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Application OS performance What does it depend on?
Greg Tinker – HP Master Technologist
Chris Tinker – HP Master Technologist
Month day, 2013
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 3
My background
Title
HP Master Technologist
IT industry experience • Published Author
• Patents pending
• Social media/white papers
Professional information • HP MVP
• Social media ambassador
Years at HP
14
Current responsibilities • Lead technologist for HP’s Global Solution Support
Engineering (GSSE) team
Name: Chris Tinker
E-mail: chris.tinker@hp.com
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 4
My background
Title
HP Master Technologist
IT industry experience • Published Author
• Patents pending
• Social media/white papers
Professional information • HP MVP
• Social media ambassador
Years at HP
14
Current responsibilities • Lead technologist for HP’s Global Solution Support
Engineering (GSSE) team
Name: Greg Tinker
E-mail: greg.tinker@hp.com
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Application performance
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 6
The stack
Layer overview U
ser Sp
ace
Applications ~~ User Code
GNU C lib
Kern
el Sp
ace
System Call Interface
VFS (ext3, NTFS, VxFS, etc)
Page alloc
MPIO – device mapper
Char devices
LVM, VxVM, sd<alpha>
BLK DV Drivers SCSI IDE Etc…
sockets memory process
Tasks
scheduler
Interrupts
CPU
VM
logical
protocols
Net Drv BUS Dvrs
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 7
Overview
Application Performance
Ap
plica
tion
Execution
Data Access
Managing resources
Platform Architecture
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 8
Architecture CPU
IA32 program on an X86_64 machine – can it run on a PA_RISC?
Can an executable run on a machine for which it was not compiled?
Performance trade offs
MAGIC
Originally used to determine binary object type exec_magic, demand_magic, shared_magic,
shmem_magic; however, around 1999/2000 ELF was adopted as the new file format,
replacing the magic
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 9
Architecture CPU
• Instruction set – leverage branch prediction
• Frequency
• BUS
• cache– L3, L2, and L1 (location from Cores: registers, AL Units, Branch
UNITS, LS units, FP UNITS, etc)
• CPU bus: – QPI – Intel QuickPath Interconnect
– HTB – AMD Hyper Transport Bus
– Frontside Bus – Older INTEL/AMD
– RunWay bus – IA64
• NUMA
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 10
Architecture Execution – access to address space
• Locality domains
• Memory interleaving: NODE, Channel, Bank, Cell( depends on hardware)
• OS’s ability to determine Locality domains and differentiate cost to each from each
• SLIT – Advanced performance tuning option on HP Proliant BIOS systems
• Integrity supports LDOMS – Locality domains
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 11
Architecture Execution – access to address space: interleaving
• Memory bank interleaving When you use memory bank interleaving, data goes alternately to memory banks through the common memory channel connecting the DIMM banks and the integrated memory controller. Memory bank interleaving increases the probability that more DIMMs will remain in an active state (requiring more power) because the memory controller alternates between memory banks and between DIMMs.
Memory bank interleaving is automatically enabled on a processor node under the following conditions:
• Two single-rank DIMMs per channel result in two-way bank interleaving.
• Two dual-rank DIMMs per channel result in four--way bank interleaving.
• Two quad-rank DIMMs per channel result in eight-way bank interleaving.
• Two dual-rank DIMMs and one quad-rank DIMM result in eight-way bank interleaving, in servers using three DIMMs per channel.
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 12
Architecture Execution – access to address space: interleaving
Memory channel interleaving
Memory channel interleaving transfers data by alternate routing through the two available
memory channels. As a result, when the memory controller must access a block of logically
contiguous memory, the requests don’t stack up in the queue of a single channel. Alternate
routing decreases memory access latency and increases performance. However, memory
channel interleaving increases the probability that more DIMMs must remain in an active state.
Memory channel interleaving is always active on AMD Opteron 6200 Series processors.
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 13
Architecture Execution – access to address space: interleaving
Memory node interleaving
Node interleaving can interleave memory across any subset of nodes in the multi-processor
system.
Memory Cell interleaving
The way a multi-cell machine would interleave memory (cell local vs. global see superdome
partitioning)
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 14
Architecture PA - Runway
CC
CPU
P0
Runway Runway
Runway Runway
CPU
P2
CPU
P1 CPU
P3
MID1 Data
Quad 2 Quad 3
Quad 0 Quad 1
MID0 Data
MID0 Adr + Ctl
MID1 Adr + Ctl
M2
M2 M2
M2 M2
M2 M2
M2
Legacy Superdome cell
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 15
Architecture INTEL - FSB
Legacy FSB
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 16
Architecture AMD HTB
DL685 Hyper Transport BUS
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 17
Architecture Intel QPI
*http://www.intel.com/content/dam/staging/image/Kim/quickpath-technology.png
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 18
Architecture BUS limits
Bandwidth is limited by the lanes and the protocols
Manufactures standardize on a PCI bus for the cards & slots
• 2X 32bit PCI @ 33 Mhz ~125 MB/s
• 4X 64Bit PCI @ 33/66 Mhz
• 4X 64Bit PCIX @ 66 Mhz
• 4X 32Bit PCIX @ 133 Mhz
• 8X 64Bit PCIX @ 133Mhz ~ 1024MB/s
PCI-e replaces the above older PCI architecture… and is capable of hitting significantly higher signaling rates per lane 8Gbit/sec per lane!
Expect this to increase as protocols become more efficient
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 19
Architecture BUS limits
Different types of memory have way different performance profiles!
• Anywhere from 800Mhz to 1333MHz
• http://h18004.www1.hp.com/products/servers/options/tool/hp_memtool.html
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 20
Architecture BUS limits
SLIT
• Allows the BIOS to send the
hardware layout to the OS
• System locality Information
Table
• OS must support SLIT in order
to leverage these latency
factors
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Execution
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 22
Execution Objects
Compiled or interpreted
• speed vs. agility
– Interpreted can change at runtime..
Interpreted is Indirectly executed
Compiled is directly executed
Many languages today implement just-in-time compilers
• PERL is compiled by the Perl engine before it is executed (so it is first interpreted, then compiled, then executed). Of course, you can compile PERL to produce an executable object.
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 23
CPU Executable types Cross platform
IA-64 ~ RX8600
32bit ELF
X86_64 ~ DL980
PA RISC
MIPS
IA64 ELF
IA32
ELF-64 / X86_64
PARISC
MIPS
Use of emulation engines
ARIES
− HP HPUX platform engine allows for PA RISC to
execute on IA64 OS kernel and platform
Binfmt
− Linux driver module that allows for emulation of
many architecture types
Objects
Execution
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 24
Execution Language examples
Compiled Interpreted
C,C++,C# BASIC
Visual Basic .NET PostScript
Python Python
Lisp Scripting Languages
Java
PERL PERL*
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 25
Execution Determine object type
# file <string> • Uses the magic to determine file type!
# file /boot/vmlinuz-3.0.0-26-generic-pae
/boot/vmlinuz-3.0.0-26-generic-pae: Linux kernel x86 boot executable bzImage, version 3.0.0-26-generic-pae
(buildd@roseapple) #42-Ubuntu SMP Wed Sep , RO-rootFS, root_dev 0x801, swap_dev 0x4, Normal VGA
# file /bin/ls
/bin/ls: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV)
readelf -a /bin/ls | head -50 ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: Intel 80386
Version: 0x1
Entry point address: 0x804be34
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 26
Sharing resources
System V message queues
Mutex locks
Data sharing
Context switching
Data access
The never forgiving sleep() interrupt is a better way to go
Execution
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 27
Execution Processes and Threads
execve()
#include <unistd.h>
int execve(const char *filename, char *const argv[],
char *const envp[]);
*filename ~ must be executable or shell with interpreter called out “#!”
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 28
Execution Processes and threads
Exec(), fork(),clone() .. Vfork(), clone2(), etc
Examples:
16935 fork() = 17424 <-- NEW task's (HWP)
17424 execve("/bin/ls", ["ls", "-F", "--color=auto", "-l", "test"], [/* 56 vars */])
= 0
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 29
Processes and threads
HWP – Heavy Weight Process –forks() a new process
LWP – light Weight Process – thread ~ clone()
Major different is in sharing of resources
HWP only shares the parent's text; whereas, a LWP can share everything but the
private stack.
HWP’s utilize pipes, PF_UNIX (Unix sockets), signals, or Inter-process
Communication's shared memory, message queues, and semaphores to share data.
Execution
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 30
Processes and threads
UNIX Processes
Single threaded process
Multithreaded process
Linux Processes
Single threaded process
Multithreaded process
Task group
Process/Task --
Thread(s) -- Execution
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 31
Execution Basic portions of address space
Text ~ machine code instructions.
• Usually the OS sets this to read only .. Allows for many instances of the same execution to reference a single structure– the application code normally does not change.
Data
• Initialized Read only
• Initialized read/write
• Uninitialized Data
• Heap – dynamically allocated memory
Stack – local variables, stack frames
Shared memory
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 32
Memory – user address space
routine var1() var2() …
Main() routine1() routine2() …
Array1 Array2 …
stack
text
data
heap
routine1 var1() var2()
Main() routine1() routine2() …
Array1 Array2 …
Thread stack
text
data
heap
routine1 var1() var2()
Thread stack
Execution
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 33
Execution Tempered by
logic
• Compiler optimization
• Execution flow
CPU
• Hardware
• Scheduler – task switching
Data fetch
• Memory
• IO
Locks and/or IPC
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Profiling
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 35
Profiling Toolbag
Application instrumentation
• gprof, Valgrind, Visual Studio, komodo, Xcode – many others
Compiler instrumentation
• At time of compile – use flags to leverage trace pointers
Kernel tracing
• Great for understanding what the application is doing when it enters KERNEL space
System profiling
Environment profiling
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 36
Profiling The layer involved and precision required determines toolbox
What is the application waiting on?
• CPU
• Networking
• Disk
• Filesystem
• locks?
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 37
IPC Network Access
Semaphores
semop(), semctl()
Locking of resources
Messages queues
msgsnd() / msgrcv()
Shared memory
shmget() shmat()
RPC – (request /response framework)
Normally leverages sockets but can leverage Pipes (no network)
Socket (layer 5)
TCP/IP (transport)
Segments – frames!
RTT
Sliding windows
BDP (bandwidth delay product)
Latency
Throughput/bandwidth
Serialization/parallelization
Flow Control
PROFILING
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 38
Profiling The toolbox : Example
Linux Windows HP-UX Solaris AIX ESX
Collectl / Glance
Perfmon / sysinternals
GLANCE GLANCE topas esxtop
strace Sysinternals, Xperf
tusc Truss / strace
truss
Kitrace / Oprofile
Logman/perfmon/PAL
Kitrace caliper
trace
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 39
Profiling Glance
Ap
plica
tion
Object
Execution
Profiling
Labs
Platform Architecture
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Labs
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 41
Labs Scenario 1
1. Where do you start?
2. What data would you collect?
3. How would you analyze it?
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Thank you