Exam.spring10.Sols

1

EE282: Computer Systems Architecture Spring 2010 Stanford University June 9th, 2010

EE282 Final Exam Solutions Exam Instructions: Answer each of the questions included in the exam. Write all of your answers directly on the examination paper, including any work that you wish to be considered for partial credit. The examination is closed book, but you can make use of one page of notes and a calculator. You may not use a computer or browser of any kind.

On equations: Wherever possible, make sure to first write the equation with symbolic terms, then the equation rewritten with the numerical values, and then the final solution. Partial credit will be weighted appropriately for each component of the problem, and providing more information improves the likelihood that partial credit can be awarded.

On writing code: Unless otherwise stated, for any answers that require code examples or fragments, you should write C-‐like pseudocode. You do not need to optimize your code unless specifically instructed to do so. Comments for any code are not strictly required on the exam, but are highly recommended. They may help you receive partial credit on a problem, if they help us determine what you were trying to do.

On time: You will have three hours (180 minutes) to complete this exam. Budget your time and try to leave some at the end to go over your work. The point weightings correspond roughly the difficulty of each problem. If you find a problem too difficult at first, move on to the other problems and revisit it later.

Name (print) ___________________________________________________________________

Leland Username ___________________________________ SCPD Student (Y/N) ___________

THE STANFORD UNIVERSITY HONOR CODE

The Honor Code is an undertaking of the students, individually and collectively: (1) that they will not give or receive aid in examinations; that they will not give or receive unpermitted aid in class

work, in the preparation of reports, or in any other work that is to be used by the instructor as the basis of grading;

(2) that they will do their share and take an active part in seeing to it that others as well as themselves uphold the spirit and letter of the Honor Code.

I acknowledge and accept the Honor Code.

Name (sign) __________________________________________________________

Score Grader

Problem 1 27 ________ ______

Problem 2 40 ________ ______

Problem 3 33 ________ ______

Total (100) ______

2

Problem 1: The Truth will Set You Free [27 points] Indicate if the following statements are True or False. Provide a single sentence justification for your answer. Answers without justification will receive no credit. [3 points per statement] a) Assume that you are doing a DMA transfer from memory to an I/O device. Also assume that, in addition to reading DRAM you also need to read the processor caches that may be caching addresses involved in the DRAM transfer. If an L2 cache read produces a hit, then it must be forward to the L1 cache for an additional lookup at that level. F – it needs to be forward to the L1 cache only if the cache line is dirty (or suspected to be dirty) b) Software prefetching can only be implemented if the hardware implements non-‐blocking caches. T – otherwise the processor would immediately stall on the prefetch instruction c) The memory allocation policies used by software can affect the power consumption of the system. T – it can affect the number of DRAM banks/ranks/DIMMs/channels that are active in order to serve a program (as opposed to being in standby or low power modes) d) The most important metric when building a data center is low energy consumption. F – Total cost of ownership (TCO) is the most important one e) When scaling down the voltage and clock frequency of a processor, the right order is to first reduce the power supply voltage and then reduce the clock frequency. F – you first need to reduce the frequency as electronics work slower at lower voltages f) Using virtually-‐addressed caches in a processor leads to lower energy consumption compared to a processor with physically-‐addressed caches. T – potentially yes because you can skip translation for accesses that hit in the L1 g) For RAID-‐5, the actual number of disk accesses necessary to write a single byte is 2. F – it’s actually 4 (read old value/parity, write new value/parity) h) RAID-‐1 improves the performance of read accesses. T – you can send a read access to either disk i) In a virtual machine environment, I/O interrupts are first processed by the virtual machine monitor and then by the interrupt handler of the guest OS. T – the other way around is unsafe as an I/O interrupt may actually be for another guest

3

Problem 2: In theory, practice and theory are the same [40 points] Provide short answers to the following questions. Typically, a few sentences or a short bulleted list will be sufficient. A long explanation is likely to include some incorrect statement; so keep it short and to the point. a) Provide one specific example for each of the following “top-‐10” approaches for improving energy efficiency in computer systems. Each example should be no longer than one sentence. The example can be from any class of systems (notebooks, smartphones, datacenters) and can be a system-‐level, chip-‐level, or software technique. [10 points] Use energy-‐efficient technologies: use flash instead of disks Match power to work: dynamic voltage-‐frequency scaling when load is low Match work to power: reduce the frame rate for video playback when low on battery Piggy back energy events: interrupt coalescing to amortize overheads Special-‐purpose solutions: use GPUs, DSPs, or other special function units Cross-‐layer efficiency: workload consolidation and scale-‐down in datacenter Tradeoff some other metric: store 2 instead of 3 copies of data in a data center Tradeoff the uncommon case: provision for a lower performance load to avoid excessive energy costs in power supply or cooling Spend somebody else’s power: client sends computation to server (assuming communication cost is lower than computation cost) Spend power to save power: compress data to be able to turn off some memory/disk components

4

b) Processor vendors are using the exponentially increasing transistor budgets to include multiple cores per chip. Even if we assume that we have a large number of independent programs or tasks to run in parallel on the cores, what may be two factors that limit the usefulness of the a multi-‐core chip? [4 points – 2 points each] -‐ Power consumption and power density: you may not be able to provide power or remove heat if all the processors in the chip are working concurrently. For example, the heat removal capabilities are proportional to the area of the chip so they remained fixed as we put an increasing number of processors in the same space. -‐ Memory & I/O bandwidth: the collective memory and I/O bandwidth of all the applications may exceed the bandwidth available for off-‐chip communication. The off-‐chip bandwidth depends on the number of pins of the chip, which in turn depends on the area of the chip. Hence, the bandwidth does not scale with the number of processor we squeeze in one chip. c) Certain companies propose that we should operate data-‐centers without active air-‐conditioning (aka air-‐side economization). This implies that the servers in the data center will be operating at a higher temperature. What is the tradeoff you should to study to evaluate if this is a good idea? [4 points] It is an issue of balancing costs. On the one hand you have the cost of buying machines. In classical data-‐centers with air conditioning, you pay for new machines every 3 years. Without air conditioning it will be more often. If that extra cost/year is less than what you save from not paying for air conditioning (equipment, energy, etc), then it’s a good idea. 2 to realize it’s a cost issue, 2 to explain a tradeoff between cost of HW replacement and cost of cooling.

5

d) Messages in interconnect networks typically use error detecting but not error correcting codes (as it is the case in memories and disks). Describe briefly how you can provide error correction in networks without the use of error correcting codes. What are the advantages of your proposal over just using error-‐correcting codes for the contents of each message? What are the implementation requirements of your proposal? [6 points] You can use retransmission to do error correction. Once a message is decided to be incorrect or lost, then we can retransmit it. Retransmission requires buffering of messages at the sender, reordering capabilities in the receiver, an acknowledgement protocol, and a timeout mechanism to detect lost messages. The advantages are: -‐ it works even if you get many errors in one message (more than what a cost-‐effective error detecting code can support) -‐ it works even if the whole message is lost 2 points to mention retransmission, 2 points to explain a little how it works/requirements 1 points for each advantage e) A system can recover from errors by taking periodical checkpoints of its state and reverting to one of them when an error is detected. List the factors you would consider to select the frequency of checkpointing the system state and the number of active checkpoints maintained by the system. [6 points] -‐ The latency of error detection -‐ The storage overhead (for both a single checkpoint and all the checkpoints) -‐ The time to restore one or more checkpoints for recovery. 6 points, 2 for each point

6

f) Assume an I/O system with a DMA controller that can support multiple, concurrently active, DMA requests. Since all DMA requests go over the same memory bus, some arbitration mechanism is necessary. List the factors that the DMA controller could take into account in arbitrating between the requests and why they are important. Note: you should not explain a specific arbitration policy, but the factors that could be taken into account in various policies. [4 points] -‐ Channel status: some DMA transfers may be blocked. E.g., one channel may be moving data from memory to disk. Since disks are slow, that DMA channel will be blocked quite often. -‐ Locality/Granularity: due to locality effects, it may be faster to group requests from one channel and execute them back-‐to-‐back rather than switch between channels after every request. -‐ Software-‐defined priorities -‐ Fairness (if there are no priorities) g) A processor uses 32-‐bit physical addresses and 32-‐bit virtual addresses with 1-‐KByte pages. The processor's TLB has 128 entries and is 4-‐way set associative. What is the storage as in the number of SRAM bits (or Kbits = 1024 bits) required to implement the TLB? Assume that each entry includes three permission bits (R, W, X) and that replacement uses a randomized algorithm. [6 points] The 32-‐bit virtual address includes two fields: 10 bit page offset and 22 bit virtual page number. The 22-‐bit VPN is the address for the TLB. The TLB has 128 entries organized in 4 ways of 32 entries each. So, we need to extract a 5-‐bit index from the VPN in order to select one of these 32-‐entriers. The remaining 22-‐5=17bits of the VPN will be the tag for the TLB. So, each TLB entry has a valid bit, 17bits of Tag, 22 bits of physical page number (PPN – the translation result), and 3 permission bits. Total 43 bits. There is no need for LRU bits (randomized replacement). So the total cost of the TLB is 128 entries * 43 bits = 5504 bits = 5.375Kbits -‐1 point for forgetting the valid bit -‐1 point for forgetting the permission bits -‐1 for adding other random bits to the entry -‐1 if you get the PPN length wrong, -‐2 if you forget it completely -‐2 for a 22b tag (-‐1 if the calculation of a 17b tag field is wrong)

7

Problem 3: But in practice, they are different [33 points] a) Assume the following C code that scans a linked-‐list data-‐structure. current = head; // start from the head of the linked list while (current!=NULL) { // while list is not empty process (current->element); // do some work on the current element current = current->next; // go to the next element }

Given a system with two processors that share the same first-‐level data cache, how would you prefetch the linked-‐list data for the above traversal? Under what conditions would the prefetching scheme be successful? [10 points] A simple prefetching scheme is to have the second processor execute a “simplified” version of the loop that does no work but prefetches elements for the first processor. The prefetch loop will look like: current = head; // start from the head of the linked list while (current!=NULL) { // while list is not empty fetch(current->element); // no work, just fetch in cache current = current->next; // go to the next element }

This approach will work well if the process() function in the first processor includes significant amount of work to hide the memory latency of the miss for each element. This allows the 2nd processor to be a few elements ahead of the first one. We should also note that if the latency of process() is much higher than that of the miss, the 2nd processor may run too far ahead causing destructive interference in the cache. It probably makes sense to synchronize the two processors periodically. 6 points for describing a scheme that seems to work 4 points for discussing the plus/minus

8

b) Assume you are designing a new, large-‐scale datacenter with 100,000 servers. Your goal is to operate the datacenter with a single technician responsible for hardware repairs. Each server repair takes 1 hour and costs $150 in labor and 10% of the server’s cost for replacement parts. Assume a full-‐time technician works 40 hours a week 48 weeks out of the year. What is the maximum annual failure rate that you can tolerate for the servers? [4 points] Assume that the failure rate is X. We want the total time needed to repair servers to be less than the time a full-‐time technician can work in a year. Repair time < technician’s time ó X*1h*100,000servers < 48 weeks * 40 hours/week ó X <0.0192 or X<1.9% failure rate. Assume that you have two choices of servers for your datacenter. Server A costs $2,000 per unit and has an annual failure rate of 0.05. Server B costs $2,500 has an annual failure rate 0.015. Server B also provides higher performance so 90,000 servers will be sufficient to the datacenter. Assuming a 3-‐year lifetime for servers, which server type should you use for your data center? Show your work. [6 points] For each type of server, there are two costs to consider: capital expenses (cost of buying the servers) and operational expenses (cost of repairing the servers). The operational expenses include replacement parts and repair time Cost A = $2K/server*100,000servers + 3 years*5%*100,000*(1h*$150/hour+10%*$2,000) = $200M + $5.25M = $205.25M Cost B = $2.5/server*90,000servers + 3 years*1.5%*100,000*(1h*$150/hour+10%*2,500) = $225M + $1.620M = $226.620M Obviously, the capital expenses dominate so server A is the right way to go.

9

c) Consider the following graph that explores the latency of strided accesses on the cache hierarchy of a well-‐known microprocessor chip. Given this graph, answer the following questions. Provide a 1 sentence justification for each answer. [13 points]

What is the L1 D-‐cache line size? 64B What is the associativity of the L1 D-‐cache? 4-‐way What is the size of the L1 D-‐cache? 16KB What is the size of the L2 cache? 256KB What is the associativity of the L2 cache? 8-‐way

Exam.spring10.Sols

Documents

Transcript of Exam.spring10.Sols