Enterprise supercomputers ike nassi_v5

1. Enterprise SupercomputersIke Nassi(Work done while at SAPJoint work with Rainer Brendle, Keith Klemba, and David Reed)UCSC, April, 2012

2. A Personal ReflectionLet me start with a story.The setting. Early 1982:Digital Equipment Corporation (DEC) had introduced the VAX (a super-minicomputer) three years earlierSales were going wellThe VAX was a popular productAnd then one dayJanuary 17, 2012 2 2 3. The Moral of the StoryWe see that Powerful personal workstationsUnmet Needs A Large Market+Inflection Point($$$$)A Set of NewTechnologies Bitmapped Displays 10 MB EthernetThe Market for Workstations Powerful processors Open source UNIXJanuary 17, 2012 3 3 4. What are the big questions and issues (IMHO)Data: Scale up vs. Scale out vs. HybridShared kernel (or not)? (Shared Memory and/or Message Passing?)Just what is the role for flash? Memory, SSD, other?Can we innovate without disrupting? (HW moves faster than software)How far can we push scalable coherent shared memory?System Design: simplify complex problems and connect the dotsJanuary 17, 201244 5. Is the Cloud an inflection point?Emergence of Cloud indicates that Emerging Needswe have been missing something MOBILITY CLOUD BIG DATA NEW UI GAP?ELASTIC SOCIAL CURRENT SYSTEMS NEW BUSINESS Current systems represent thework of the last two decadesDriversJanuary 17, 2012 5 5 6. Cloud computing has emerged as a viable mode for theenterprise Ease of access Elastic Mobility and sensors More pervasive data storage Computing as a utility, devices as appliances Expandable storage New business models are emerging (SaaS, PaaS, IaaS)January 17, 2012 6 6 7. Today Ill talk a bit about Current commodity server design How this creates a computing gap for the enterprise * How businesses of tomorrow will function And how enterprise supercomputers will support them A different layer of cloudCloud Computing Cloud Computing*A Closer LookA Gap Current Systems Current SystemsThe Perception My RealityJanuary 17, 2012 7 7 8. The short story VirtualMachine 1Virtual Machine 2 VirtualMachine 3 VirtualMachine N Current View Physical MachinePhysical Machine 1Physical Machine 2 PhysicalMachine 3 PhysicalMachine M Emerging ViewVirtual MachineJanuary 17, 2012 8 8 9. Enterprise HW Computing Landscape And what has emerged The Chip The Board BladeServer The CloudJanuary 17, 201299 10. And this creates a perceived technology gapProblems with todays servers: Cores and memory scale in lock step Access latencies to large memories fromThis creates a technologymodern cores conflict: Addressing those memoriesCores and memory need to scale more independentlyCurrent and emerging business needs: rather than in lock step Low latency decision support and analytics What if? simulationHow can we resolve this conflict? Market/Sentiment analysis Pricing and inventory analysis In other words, we need real real time January 17, 201210 10 11. Flattening the layers We need to innovate without disrupting customers. (Like your multicorelaptops.) Seamless platform transitions are difficult, but doable (Digital, Apple, Intel, Microsoft) Large memory technology is poised to take off, but it needs appropriatehardware Early results in High Performance Computing (HPC)* have yielded results thatcan now be applied to Enterprise Software. But today its different: We have a lot of cores per chip now and better interconnects We can flatten the layers and simplify We can build systems that improve our software now without making any major modifications Nail Soup argument: But, if we are willing to modify our software, we can winbigger. But we do this on our own schedule. (Stone Soup on Wikipedia)*Scientific/Technical ComputingJanuary 17, 20121111 12. The Chip How do we leverage advances in: Processor Architecture Multiprocessing (Cores and threads) NUMA (e.g. QPI, Hypertransport) Connecting to the board (sockets) The Chip Core to memory ratioWe have 64 bits of address, but can we reach those addresses?January 17, 20121212 13. Typical Boards Highlights: Sockets (2, 4, 8) Speed versus AccessHow do we make thesechoices for the enterprise? An 8-socket board layoutJanuary 17, 2012 13 13 14. StorageConsiderations for storage Cost per Terabyte (for the data explosion) Speed of access (for real time access) Storage network (for global business) Disk, Flash, DRAM How can applications be optimized forExample: A 2TB OCZ Cardstorage?January 17, 2012 14 14 15. Memory ConsiderationsHighlights Today, cores and memory currently scalein lock step Large memory computing (to handlemodern business needs) Collapse the multi-tiered architecturelevels NUMA support is built in to Linux Physical limits (boarddesign, connections, etc)Memory RiserJanuary 17, 2012 15 15 16. The Magic CableJanuary 17, 20121616 17. Reincarnating the mainframe In-memory computing provides the real time performance werelooking for In-memory computing requires more memory per core in asymmetric way I remember what Dave Cutler said about the VAX: There is nothing that helps virtual memory like real memory Ikes corollary: There is nothing that helps an in-memorydatabase like real memory The speed of business should drive the need for more flexibleserver architectures Current server architectures are limiting, and we must change thearchitecture and do so inexpensively And how do we do this?January 17, 20121717 18. Taking In-Memory Computing SeriouslyBasic Assumptions and observations Hard disks are for archives only. All active data must be in DRAM memory. Good caching and data locality are essential. Otherwise CPUs are stalled due to too many cache misses There can be many levels of cachesProblems and Opportunities for In-Memory Computing Addressable DRAM per box is limited due to processor physical address pins. But we need to scale memory independently from physical boxes Scaling Architecture Arbitrary scaling of the amount of data stored in DRAM Arbitrary and independent scaling of the number of active users and associated computingload DRAM access times have been the limiting factor for remote communications Adjust the architecture to DRAM latencies (~100 ns?) InterProcess Communication is slow and hard to program (latencies are in the area of~0.5-1ms )We can do better. January 17, 2012 1818 19. Putting it together The Big Iron InitiativeJanuary 17, 20121919 20. What is Big Iron Technology Today? Big Iron-1Big Iron-2 Extreme Performance, Low CostExtreme Performance, Scalability,and much simpler system model Traditional Cluster Enterprise SupercomputerDeveloperscan treat as asingle large Large sharedphysical coherentserver memory across nodes 5x 4U Nodes (Intel XEON x7560 2.26Ghz)10 x 4U Nodes (Intel XEON x7560 2.26Ghz) 160 cores (320 Hyper-threads), 32 cores per node 320 cores (640 Hyper-threads), 32 cores per node 5 TB memory total, 30TB solid state disk 5TB memory (10TB max), 20TB solid state disk 160 GB InfiniBand interconnect per node 1 NAS (72TB Expandable to 180)System architecture: SAP Scalable coherent shared memory (via ScaleMP) System architecture: SAPJanuary 17, 20122020 21. The Next Generation Servers a.k.a. Big Iron are a realityBig Iron -1Big Iron -2 Fully OperationalFully OperationalSeptember 30, 2010 2:30pm February 4, 2011All 10 Big Iron-1 nodes up and runningAll 5 Big Iron-2 nodesup and runningJanuary 17, 2012 21 21 22. At a Modern Data CenterThese bad boys are hugebackup SpecialPod waygenerators Coolingdown has wonthere for awardsBig IronJanuary 17, 2012 22 22 23. Progress, Sapphire and others:Sapphire Type 1 Machine (Senior)Sapphire Type 1 Machine (Junior) 32 Nodes (5 racks, 2.5 tons) 1 node (1 box, on stage) 1024 Cores 80 cores 16 TB of Memory 1 TB of Memory 64 TB of Solid State Storage 2 TB of SSD 108 TB of Network Attached Storage No Attached StorageJanuary 17, 2012 23 23 24. Coherent Shared Memory A Much Simpler Alternative to Remote CommunicationUses high-speed, low latency networks(Optical fiber/copper with 40Gb/s or above) Typical latencies of this are in the area of 1-5 sec Aggregated throughput matches CPU consumption L4 cache needed to balance the longer latency on non-local access(cache-coherent non-uniform memory access over different physical machines)Separate the data transport and cache layers into a separate tier below theoperating system(almost) never seen by the application or the operating system!Applications and database code can just reference data The data is just there, i.e. its a load/store architecture, not network datagrams Application level caches are possibly not necessary the system does this for you (?) Streaming of query results is simplified, L4 cache schedules the read operations for you. Communication is much lighter weight. Data is accessed directly and thread calls are simpleand fast (higher quality by less code) Application designers do not confront communications protocol design issues Parallelization of analytics and combining simulation with data are far simpler, enablingpowerful new business capabilities of mixed analytics and decision support at scale January 17, 2012 2424 25. Software implications of Coherent Shared MemoryOn Application Servers Access to database can be by reference No caches on application server side. Application can refer to database query results includingmetadata, master data etc. within the database process. Caches are handled by hardware and are guaranteed to be coherent. Lean and fast application server for in-memory computingOn Database Code A physically distributed database can approach DRAM-like latencies Database programmers can focus on database problems Data replication and sharding are handled by touching the data and L4 cache does theautomatic distributionIn fact, do we need to separate application servers from database servers at all?No lock-in to fixed machine configurations or clock ratesNo need to make app-level design tradeoffs between communications andmemory accessOne O/S instance, one security model, easier to manage January 17, 201225 25 26. But what aboutDistributed Transactions? We dont need no stinkin distributed transactions!What about traditional relational databases? In the future, databases become data structures!(Do I really believe this? Well, not really. Just wanted to make the point.)Why not build a mainframe? From a SW perspective, it is a mainframe (with superior price/performance for in-memory computing)Is it Virtualization? In traditional virtualization, you take multiple virtual machines and multiplex them onto the same physical hardware. Were taking physical hardware instances and running them on a single virtual instance. January 17, 2012 2626 27. A simplified programming model and landscapeVirtual Machine 1Virtual Machine 2 VirtualMachine 3 VirtualMachine N Current View Physical MachinePhysical Machine 1Physical Machine 2 PhysicalMachine 3 PhysicalMachine M Emerging ViewVirtual Machine (Single OS instance)January 17, 2012 27 27 28. The business benefits of coherent shared memoryImproved in-memory computing performance at dramatically lower cost The ability to build high performance mainframe-like computing systems with commoditycluster server components Ability to scale memory capacity in a more in-memory computing-friendly fashion Simplified software system landscape using system architecture that can be made invisibleto application softwareMinimize changes to applications Enables applications to scale seamlessly without changes to the application code oradditional programming effort Traditional cluster based parallel processing requires specialized programming skills heldonly by top developers; this will constrain an organizations ability to scale development forin-memory computing With coherent shared memory, the bulk of developers can develop as they do today and letthe underlying hardware and lower level software handle some of the resourceallocation, unlike today. But, existing, standard NUMA capabilities in Linux remain available. January 17, 201228 28 29. Big Iron Research Next generation server architectures for in-memory computing Processor Architectures Interconnect technology Systems software for coherent memory OS considerations Storage Fast data load utilities Reliability and Failover Performance engineering for enterprise computing loads Price/performance analysis Private and public cloud approaches And moreJanuary 17, 2012 29 29 30. Some experiments that we have run Industry standard system benchmarks Evaluation of shared memory program scalability Scalability of Big Data analytics just beginningJanuary 17, 20123030 31. Industry Standard BenchmarksSTREAM: specJbb: large memory vector processing Highly concurrent Java transactions BigIron would be 12th highest recorded test Nearly linear in system size BigIron would be 13th highest recorded test Measures scalability of memory access More computational than STREAMJanuary 17, 2012 31 31 32. To sum up Cloud computing has emerged The businesses of tomorrow will function in real time. Really! There is a growing computing technology gap There is an unprecedented wave of advances in HW The mainframe has been reincarnated with its attendant benefits, but can be made from commodity parts Enterprise supercomputers leveraging coherent shared memory principlesand the wave of SW+HW advances will enable this new realityJanuary 17, 2012 32 32 33. Questions?Contact information:Ike NassiEmail: [email protected]

Enterprise supercomputers ike nassi_v5

Technology

Transcript of Enterprise supercomputers ike nassi_v5