HPC/HTC and Cloud · PDF file Rajul Kumar Northeastern University [email protected]

Click here to load reader

  • date post

    06-Jul-2020
  • Category

    Documents

  • view

    0
  • download

    0

Embed Size (px)

Transcript of HPC/HTC and Cloud · PDF file Rajul Kumar Northeastern University [email protected]

  • HPC/HTC and Cloud: Making them work together efficiently

    Rajul Kumar

    Northeastern University

    [email protected]

  • Our group

    Rajul Kumar

    Northeastern University [email protected]

    Evan Weinberg

    Boston University [email protected]

    Chris Hill

    Massachusetts Institute of Technology [email protected]

  • HPC and Cloud convergence

    High Performance Computing (HPC)

    • HPC users have infinite demand for resources

    Cloud

    • Overprovisioned to meet the peak workloads and mostly stay underutilized

    Can we make HPC soak up these idle cycles without impacting cloud workload

  • Simple Case: Single node HTC jobs

    • High Throughput Computing (HTC) jobs focus on efficient execution of loosely-coupled tasks

    • Backfilled HTC jobs get killed to release resources for HPC workload

    • Invested compute cycles are lost and requires complete rework

    Suspend and resume the Virtual Machine running the jobs as and when the resources are available

  • Implementation

    HPC cluster OpenStack cloud

    Resource monitorHPC

    HTC

    Cloud

  • Implementation

    HPC cluster OpenStack cloud

    Resource monitor

    OpenVPN

  • Implementation

    Control daemon

    HPC cluster OpenStack cloud

    Resource monitors

    OpenVPN

  • Implementation

    Control daemon

    H P

    C c

    lu st

    er

    O p

    en Stack clo

    u d

    Resource monitors

    OpenVPN

    HPC jobs

    HPC job arrives

  • Implementation

    Control daemon

    Resource monitors

    OpenVPN

    H P

    C c

    lu st

    er

    O p

    en Stack clo

    u d

    HTC jobs moved to Cloud

  • Implementation

    Control daemon

    Resource monitors

    OpenVPN

    H P

    C c

    lu st

    er

    O p

    en Stack clo

    u d

    Cloud utilization increases

  • Implementation

    Control daemon

    Resource monitors

    OpenVPN

    H P

    C c

    lu st

    er

    O p

    en Stack clo

    u d

    HTC job suspended to release resources for cloud

  • Implementation

    Control daemon

    Resource monitors

    OpenVPN

    H P

    C c

    lu st

    er

    O p

    en Stack clo

    u d

    Cloud utilization goes low

  • Implementation

    Control daemon

    Resource monitors

    OpenVPN

    H P

    C c

    lu st

    er

    O p

    en Stack clo

    u d

    HTC jobs resumed on cloud

  • Modifications to Slurm

    Slurm – A workload manager for HPC cluster

    • Manages the resource and job scheduling

    • Marks a node DOWN and removes the jobs for an unreachable node

    • Does the same for a suspended virtual node

    Modified Slurm to manage the suspended node and keep the job states intact

  • Future prospects

    • Harden and utilize full data center performance (hardware, network etc.)

    • Running multi-node jobs in virtual environment

    • Move the jobs between Virtual Machine and Bare metal nodes

    • Experiment with container frameworks

  • Conclusion

    • Dynamic HPC/HTC cluster with least overhead and impact

    • Better productive utilization of the HPC/HTC cluster

    • Better resource utilization of the cloud

    http://info.massopencloud.org