CPR: Composable Performance Regression for Scalable...
Transcript of CPR: Composable Performance Regression for Scalable...
CPR: Composable Performance Regressionfor Scalable Multiprocessor Models
Benjamin C. LeeComputer Architecture Group
Microsoft Research
Jamison Collins, Hong WangMicroarchitecture Research Lab
Intel Corporation
David BrooksEngineering and Applied Sciences
Harvard University
International Symposium on Microarchitecture11 November 2008
Benjamin C. Lee, et al 1 :: MICRO :: 11 Nov 08
MotivationUniprocessor
Multiprocessor
Technology TrendsSimulation ChallengesSimulation Paradigm
Technology TrendsMoore’s Law and increasing transistor densities
Performance and power efficiency
Transition to multi-core and parallelism
Benjamin C. Lee, et al 2 :: MICRO :: 11 Nov 08
MotivationUniprocessor
Multiprocessor
Technology TrendsSimulation ChallengesSimulation Paradigm
Simulation Challenges
Cycle-Accurate SimulationAccurately identifies trends in design spaceTracks instructions’ progress through microprocessor
Simulation CostsCosts per simulation :: minutes, hours per designNumber of simulations :: scales exponentially (mp)
p,m :: parameter count, resolutionCosts per simulation :: scales superlinearity (nγ)
n cores, γ > 1
Benjamin C. Lee, et al 3 :: MICRO :: 11 Nov 08
MotivationUniprocessor
Multiprocessor
Technology TrendsSimulation ChallengesSimulation Paradigm
Statistical InferenceConstruct inferential models from samples [ASPLOS’06]
Use models as efficient surrogates for simulator
Benjamin C. Lee, et al 4 :: MICRO :: 11 Nov 08
MotivationUniprocessor
Multiprocessor
Technology TrendsSimulation ChallengesSimulation Paradigm
Multiprocessor Inference
Expensive CMP SimulationsPhysical resource contention increases host cyclesLogical resource contention increases simulated cyclesSynchronization increases cost per simulated cycle
Composable Performance RegressionLeverage core models to minimize CMP simulationsCore :: Uniprocessor performanceContention :: Shared resource contentionPenalty :: Performance penalty from contention
Benjamin C. Lee, et al 5 :: MICRO :: 11 Nov 08
MotivationUniprocessor
Multiprocessor
Regression TheoryModel EvaluationEvolutionary Design
Outline
MotivationTechnology TrendsSimulation ChallengesSimulation Paradigm
UniprocessorRegression TheoryModel EvaluationEvolutionary Design
MultiprocessorCPRModel EvaluationScalability
Benjamin C. Lee, et al 6 :: MICRO :: 11 Nov 08
MotivationUniprocessor
Multiprocessor
Regression TheoryModel EvaluationEvolutionary Design
Outline
MotivationTechnology TrendsSimulation ChallengesSimulation Paradigm
UniprocessorRegression TheoryModel EvaluationEvolutionary Design
MultiprocessorCPRModel EvaluationScalability
Benjamin C. Lee, et al 7 :: MICRO :: 11 Nov 08
MotivationUniprocessor
Multiprocessor
Regression TheoryModel EvaluationEvolutionary Design
Regression Theory
Statistical InferenceModels relationships between dataRequires initial data to train, formulate modelLeverages correlation from initial data for prediction
Regression ModelsLow training costs (sample 300 from 4.3B designs)Accurate inference (1.5% median error)Efficient computation (100’s of predictions per second)
Benjamin C. Lee, et al 8 :: MICRO :: 11 Nov 08
MotivationUniprocessor
Multiprocessor
Regression TheoryModel EvaluationEvolutionary Design
Formulationn simulated design samples, p design parameters
Response :: Y design metrics (e.g., performance)
Predictor :: X design parameters (e.g., ROB, cache)
Y =
y1...
yn
X =
x11 . . . x1p...
. . ....
xn1 . . . xnp
Coefficients :: β = [β0, . . . , βp]
T
Errors :: ε = [ε1, . . . , εn]T where εi ∼ N(0, σ2)
Y = Xβ + ε
F(Y) = G(X)βG + ε
Benjamin C. Lee, et al 9 :: MICRO :: 11 Nov 08
MotivationUniprocessor
Multiprocessor
Regression TheoryModel EvaluationEvolutionary Design
Prediction
Requirementsβ known from least squares model trainingX known for a given set of queries
Expected ResponseResponse as weighted sum of predictor valuesComputed efficiently as matrix-vector product
E[Y] = E[Xβ + ε]= E[Xβ] + E[ε]= Xβ
Benjamin C. Lee, et al 10 :: MICRO :: 11 Nov 08
MotivationUniprocessor
Multiprocessor
Regression TheoryModel EvaluationEvolutionary Design
Experimental Methodology
Intel Product SimulatorsModels consecutive generations of x86 µ-archSupports dual-, quad-core architectures.
Sampling Uniformly at Random (UAR)Parameter space includes predictors, ROB, caches15 parameters, 4.3B designsSample 300 designs for simulation
Statistical FrameworkR :: software environment for statistical computingHmisc and Design packages [Harrell]
Benjamin C. Lee, et al 11 :: MICRO :: 11 Nov 08
MotivationUniprocessor
Multiprocessor
Regression TheoryModel EvaluationEvolutionary Design
BenchmarksDigital Home1 audio audio conversion2 video video compression3 photo photoshop albumGames4 unreal Unreal Tournament5 halflife Half-Life, modified Quake engine6 homeworld Homeworld, three-dimensional movementMultimedia7 mentalray rendering, ray tracing8 painter raster graphics package9 tachyon ray tracingOffice10 outlook personal information manager11 access relational database management system12 excel spreadsheet applicationProductivity13 md2 OpenSSL cryptographic hash function14 encrypt file encryption15 flash multimedia playerServer16 specweb web server17 tpcc on-line transaction processing18 specjapp J2EE 1.3 application servers
Benjamin C. Lee, et al 12 :: MICRO :: 11 Nov 08
MotivationUniprocessor
Multiprocessor
Regression TheoryModel EvaluationEvolutionary Design
Uniprocessor Model AccuracyObtain 50 additional random samples for validation
Core :: 1.5% median error
Benjamin C. Lee, et al 13 :: MICRO :: 11 Nov 08
MotivationUniprocessor
Multiprocessor
Regression TheoryModel EvaluationEvolutionary Design
Evolutionary Design
Evolutionary ApproachOptimize ProcXDesign ProcY, enhancing ProcX with µ-arch featuresRe-construct models, accounting for µ-arch featuresOptimize ProcY
Case StudyConsecutive generations of x86 µ-archImprove FE (e.g., branch prediction)Improve MEM (e.g., prefetching)Improve OOO (e.g., memory disambiguation)
Benjamin C. Lee, et al 14 :: MICRO :: 11 Nov 08
MotivationUniprocessor
Multiprocessor
Regression TheoryModel EvaluationEvolutionary Design
Evolving CachesImprove MEM: similar performance with smaller caches
Benjamin C. Lee, et al 15 :: MICRO :: 11 Nov 08
MotivationUniprocessor
Multiprocessor
Regression TheoryModel EvaluationEvolutionary Design
Evolving ROBImprove FE: more instructions inflight, suggests larger ROB
Improve MEM: fewer cache misses, suggests smaller ROB
Benjamin C. Lee, et al 16 :: MICRO :: 11 Nov 08
MotivationUniprocessor
Multiprocessor
CPRModel EvaluationScalability
Outline
MotivationTechnology TrendsSimulation ChallengesSimulation Paradigm
UniprocessorRegression TheoryModel EvaluationEvolutionary Design
MultiprocessorCPRModel EvaluationScalability
Benjamin C. Lee, et al 17 :: MICRO :: 11 Nov 08
MotivationUniprocessor
Multiprocessor
CPRModel EvaluationScalability
Composable Performance Regression
CPR :: build separate core, contention, penalty models
Requires simulations Nuni > Ncon ≥ Npen
Suppose core sims require T1, multi-core sims require T1nγ
Benjamin C. Lee, et al 18 :: MICRO :: 11 Nov 08
MotivationUniprocessor
Multiprocessor
CPRModel EvaluationScalability
CPR: Core
Train with uniprocessor sims from full parameter space
Estimate per core delay from all design parameters
Benjamin C. Lee, et al 19 :: MICRO :: 11 Nov 08
MotivationUniprocessor
Multiprocessor
CPRModel EvaluationScalability
CPR: Contention
Train with CMP, cache-only sims from reduced subspace
Estimate cache hits/misses from shared cache parameters
Benjamin C. Lee, et al 20 :: MICRO :: 11 Nov 08
MotivationUniprocessor
Multiprocessor
CPRModel EvaluationScalability
CPR: Penalty
Train with composed predictions, few CMP sims from full space
Estimate CMP core delays from core, contention predictions
Benjamin C. Lee, et al 21 :: MICRO :: 11 Nov 08
MotivationUniprocessor
Multiprocessor
CPRModel EvaluationScalability
BenchmarksDual-Core BenchmarksSet .1 .21 painter homeworld2 access mentalray3 specjapp specweb4 homeworld tachyon5 dense flash
Quad-Core BenchmarksSet .1 .2 .3 .41 dense excel flash md22 video specjapp specweb tachyon3 excel homeworld audio unreal4 outlook encrypt halflife homeworld5 painter mentalray outlook encrypt
Benjamin C. Lee, et al 22 :: MICRO :: 11 Nov 08
MotivationUniprocessor
Multiprocessor
CPRModel EvaluationScalability
Multiprocessor Model AccuracyDual-core :: 6.6% median error
Quad-core :: 4.8% median error
Benjamin C. Lee, et al 23 :: MICRO :: 11 Nov 08
MotivationUniprocessor
Multiprocessor
CPRModel EvaluationScalability
Scaling TrendsLower bound CPR costs 0.33x of naïve costs
Approach lower bound as uniprocessor models built
Benjamin C. Lee, et al 24 :: MICRO :: 11 Nov 08
Conclusion
Inference in IndustryEffective inference for x86 µ-arch1.5% median errors relative to simulationEvolutionary design for new features across generations
Composable Performance RegressionLeverage core models to minimize CMP simulationsConstruct separate core, contention, penalty models4.8 to 6.6% median errors for dual-, quad-core0.33x training costs of prior approaches
Benjamin C. Lee, et al 25 :: MICRO :: 11 Nov 08
Future Directions
Efficient Multiprogramming AnalysisEvaluate combinations without modeling every combination
Multi-Threaded WorkloadsExtend for homogeneous, heterogeneous threads.Account for synchronization events
Many-Core ArchitecturesConstruct models without many-core simulatorsConsider other shared resources (e.g., network)
Benjamin C. Lee, et al 26 :: MICRO :: 11 Nov 08
CPR: Composable Performance Regressionfor Scalable Multiprocessor Models
Benjamin C. LeeComputer Architecture Group
Microsoft Research
Jamison Collins, Hong WangMicroarchitecture Research Lab
Intel Corporation
David BrooksEngineering and Applied Sciences
Harvard University
International Symposium on Microarchitecture11 November 2008
Benjamin C. Lee, et al 27 :: MICRO :: 11 Nov 08