fos: an operating system for cloud and many-core...

16
fos: an operating system for cloud and many-core computing 詹剑锋 先进计算机系统实验室 计算机系统结构国家重点实验室 中国科学院计算技术研究所 http://weibo.com/jfzhan

Transcript of fos: an operating system for cloud and many-core...

  • fos: an operating system for cloud and many-core computing

    詹剑锋

    先进计算机系统实验室

    计算机系统结构国家重点实验室

    中国科学院计算技术研究所

    http://weibo.com/jfzhan

  • Motivation Cloud computers and multicore processors

    two emerging classes of computational hardware that have the potential to provide unprecedented compute capacity to the average user.

    Existing multicore operating systems do not scale to large numbers of cores, and do not support clouds.

    Current day cloud systems push much complexity onto the user, requiring the user to manage individual Virtual Machines (VMs) and deal with many system-level concerns.

  • Motivation (cont)

    The next decade will also bring single chip microprocessors containing hundreds or even thousands of computing cores.

    Making operating systems scale, designing scalable internal OS data structures, and managing these growing resources will be a tremendous challenge.

  • fos provides a single system image across all the cloud nodes

  • Solutions

    First, fos factors system services of a full-featured OS by serviceS.

    Second, fos further factors and parallelizes each system service into an Internet-style collection, or fleet, of cooperating servers that are distributed among the underlying cores and machines. Message passing

  • An overview of the fos server architecture

  • Design decisions Space multiplexing replaces time multiplexing.

    Due to the growing bounty of cores, there will soon be a time where the number of cores in the system exceeds the number of active processes.

    scheduling becomes a layout problem, not a time-multiplexing problem.

    The operating system will run on distinct cores from the application. This gives spatially partitioned working sets; the OS does not interfere with the application’s

    cache.

  • Design decisions (cont)

    OS is factored into function-specific services, where each is implemented as a parallel, distributed service.

    In fos, services collaborate and communicate only via messages, although applications can use shared memory if it is supported.

    Services are bound to a core, improving cache locality.

    Through a library layer, libfos, applications communicate to services via messages.

  • Design decisions (cont)

    The utilization of active services is measured, and highly loaded services are provisioned more cores (or other resources).

    Faults are detected and handled by OS. OS services are monitored by watchdog process. If a service fails, a new instance is spawned to meet

    demand, and the naming service reassigns communication channels.

  • Anatomy of a File System Access

  • 性能数据(1)

  • 性能数据(2)

  • 性能数据(3)

  • 性能数据(4)

  • 谢谢

  • Reading list David Wentzlaff et al. An Operating System for Multicore and Clouds:

    Mechanisms and Implementation, ACM Symposium on Cloud Computing (SOCC), June 2010

    Jianfeng Zhan, et al. Cost-aware Cooperative Resource Provisioning for Heterogeneous Workloads in Data Centers. Accepted by IEEE Transaction on Computers (TC).

    Matei Zaharia, et al. Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling, In EuroSys 2010, Paris, France, April 2010.

    Benjamin Hindman et al. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. NSDI 2011.

    Lei Wang et al. In Cloud, Can Scientific Communities Benefit from the Economics of Scale? IEEE Trans. Parallel Distrib. Syst. 23(2): 296-303 (2012).