WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann,...

44
WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU Scott Baker, Manager, Sensitive Research, UBC Ahmad Bisher, WestGrid Partnership Development Alex Razoumov, WestGrid Training Visualization

Transcript of WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann,...

Page 1: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

WestGrid Town Hall: June 26, 2020

Patrick Mann, WestGrid Director of OperationsMartin Siegert, Director Research Computing, SFU

Scott Baker, Manager, Sensitive Research, UBCAhmad Bisher, WestGrid Partnership Development

Alex Razoumov, WestGrid Training Visualization

Page 2: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

Admin

To ask questions during today’s session:

From Webstream: Email [email protected] From Vidyo: Use the GROUP CHAT to ask questions.

Please mute your mic unless you have a question.

Page 3: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

Outline

1. Recent incident report - Martin Siegert, SFU

2. System Access Best Practices - Scott Baker, UBC

3. Associate Member Program - Ahmad Bisher, WestGrid

4. Operations - Patrick Mann, WestGrid

5. User Training - Alex Razoumov, WestGrid

Page 4: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

Cedar Security Incident:Debrief & lessons learned

Martin SiegertDirector Research Computing,

Simon Fraser University;WestGrid Site Lead;

Compute Canada SFU/Cedar Site Lead

Page 5: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

Incident

What happened:

● System compromised around March 23rd● Used for crypto-currency mining● Symptoms: reduced performance of research

applications● Crypto-mining processes well hidden from

common tools (ps, top, etc.)● Detected Apr. 15 after receiving information from

on research group that slowdown of applications only happens between midnight and 6am

● Compromise stopped on Apr. 15

Page 6: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

Cedar Team response

Two days of detective work:● crypto-mining processes were started

through a cronjob at midnight● processes hidden through a

kernel module installed by the intruder● intruder gained access through a

compromised user account● that account had elevated privileges on servers for the ATLAS project● gained root privileges on compute nodes by inserting an suid-root shell in a file

system exported from ATLAS servers● unlikely that any user data were accessed

Big thank-you to the team: pan-Canadian effort SFU-Dalhousie-UVic

Page 7: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

Cedar Team response (cont’d)

Actions taken:

● stopped cronjob

● disabled compromised account

● removed suid-root program and disabled exports of suid-root programs

● rebuilt all compute nodes

● rebuilt all ATLAS servers

● removed all authorized_keys files from all users’ .ssh directories

● ssh keys revoked of users who had the private key stored on Cedar

Page 8: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

Lessons Learned

● Coincidences happen: the compromise happened a day after Cedar was expanded and underwent a complete update

● User accounts are the weakest link: most common entry point for intruders

○ Weak passwords○ Compromised desktop/laptop

■ Gives access to passwords■ Gives access to ssh keys and their passwords

● Need better monitoring/alert system

● Multi-Factor Authentication (MFA) would have prevented compromise

Page 9: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

Questions?

Page 10: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

System Access Best Practices

Scott BakerManager Sensitive Research, ARC/UBC;Member of the National Security Council

Over to Scott...

Page 11: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

ImplementationInterference

Improvement

1 5 10

Bonus Tip: But wait, there’s more!

WHAT

WHY

HOW

Page 12: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

ImplementationInterference

Improvement

1 5 10

Bonus Tip: Consider how hard it will be to type in on your mobile: minimize keyboard switching.

Unique Passwords (Secrets)

The single most important good habit:Prevent one site’s breach from exposing all your accounts.

Passwords shouldn’t be memorized or guess-able.

Size matters – use passphrases

Only change them when necessary

Do not allow anything except your vault to “memorise passwords”

Page 13: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

https://haveibeenpwned.com/

Most people have at least one email listed at least once…or will eventually.

The email address I have had since 1994 is listed in 12

Depending on the breach, different information is included.

Page 14: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

ImplementationInterference

Improvement

1 5 10

Bonus Tip: Learn the keyboard shortcuts to save even more time.

Use a Vault for Secrets

It’s actually faster and easier than typing.

Vaults store much more than passwords

The ultimate exit strategy

Browser or Standalone. Either is better than nothing

Many open-source and free options exist

Page 15: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

ImplementationInterference

Improvement

1 5 10

Bonus Tip: Change keys like passwords – if there is a posibilityof compromise.

Practice Good Key Management

Keys are like passwords (but more) and the same rules apply.

Generate strong keys – learn the current ciphers:https://infosec.mozilla.org/guidelines/key_management

Always protect them with a password

The private key should never leave the system where it was generated.

Don’t sweat the public keys… they are public after all.

Page 16: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

ImplementationInterference

Improvement

1 5 10

Bonus Tip: Set your terminal to warn about multi-line paste.

Don’t Copy and Paste Code

It’s easy to hide content on the web through CSS or JS.

Malicious code can be hidden in a multitude of ways on web sites.

Hidden code in snippets may or may not originate with the content publisher.

In all cases pasting into a “dumb” editor first will help confirm what is actually on the clipboard before it goes into the shell or application.

Page 17: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

ImplementationInterference

Improvement

1 5 10

Bonus Tip: Wiping cookies will sign you out of many services in a single step.

Sign-out

If someone gains access to your device… they have accessTo everything you’re already signed into.

This also prevents information disclosure across different web sites and/or social media sites.

Signing out adds another layer of security.

Modern browsers sometimes allow different accounts, containers, or private browsing that segments access across different apps.

Page 18: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

ImplementationInterference

Improvement

1 5 10

Bonus Tip: The encryption key (or recovery code when applicable) must be very carefully secured.

Encryption at Rest

Encryption protects confidentiality on multiple levels,the more mobile a device, the more it matters.

Most modern systems include the ability to turn on disk encryption. This includes desktop and laptop disk drives, mobile phones, even some USB keys.

In general encryption should always be enabled, but the more likely a device is to be lost or stolen the more important it becomes to encrypt it.

Page 19: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

ImplementationInterference

Improvement

1 5 10

Bonus Tip: Incorrect permissions on cloud storage is the most common source of breaches

Watch Your File Permissions

Captain obvious says: if you set it public… It’s public.

File permissions are powerful and entirely under your control.

Security is largely about “Need To Know”

Friends don’t let friends chmod 777

Make it sticky if that helps

Page 20: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

ImplementationInterference

Improvement

1 5 10

Bonus Tip: Knowing where your sensitive data is, makes this step easier.

Delete Unnecessary Information

What isn’t there, can’t be stolen or mishandled.

Consider what data you have and why.

Just like shredding paper documents – ‘securely’ delete data that is no longer required in any particular location.

Yes, it’s really just that simple.

Page 21: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

ImplementationInterference

Improvement

1 5 10

Bonus Tip: Remember to check for patches to equipment & appliances like your router or smart-<thing>.

Apply Software Updates

Smart people are working hard to fix vulnerabilities:Take advantage of that (usually free) protection.

Computers, phones, TVs cars, and now even toasters run on software/firmware. Keep that patched by applying reputable updates from known sources.

In some respects, the closer the device is to the outside – the more critical it is to patch.

Do not only rely on automatic updates.

Page 22: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

ImplementationInterference

Improvement

1 5 10

Bonus Tip: Test your restoration plan periodically –ensure it works.

Have Backups and Plan to Restore

A backup that can’t be easily restored is useless andmight be the only option after a ransomware attack.

Avoid becoming too entangled in a proprietary system that only works after it’s installed.

Remember to think about the security of your backup as well (eg: high-profile iCloud breaches)

A simple external hard drive caddy is reliable and cost effective.

Page 23: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

ImplementationInterference

Improvement

1 5 10

Bonus Tip: DNS lookup can sometimes be faster over a VPN.

Use a VPN

All unencrypted traffic over a network is subject to snoopingPrevent credential and data theft on un-trusted networks.Prevent DNS Cache poisoning attacks.

Basic: Subscribe to a trusted VPN serviceAnd/Or Use your Institutional VPN service

Configure this for all mobile devices and use it any time you’re on an un-trusted network.

Advanced: Set up your own VPN

Page 24: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

ImplementationInterference

Improvement

1 5 10

Bonus Tip: Also avoid social media single sign on to prevent creating a back door on yourself

Use Multi-Factor Authentication

Passwords are a single point of failure.MFA therefore provides dramatically increased protection.

Something you Have, Know, or Are

SMS is insecure, but still better than nothing

Many options exist, many systems support it.

EG: YubiKey, Google Authenticator, SecurID, etc...

Page 25: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

Resources

• Mozilla Key Mgt: https://infosec.mozilla.org/guidelines/key_management• KeePass Password Safe: https://keepass.info/download.html• Cryptomator: https://cryptomator.org/• Interpol: https://www.nomoreransom.org/•

Page 26: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

WestGrid Associate Member Program

Ahmad BisherDirector, Strategy & Partnership Development

WestGrid

Page 27: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

Introduction

● Over the years, institutions outside of WestGrid’s 7 founding members have been reaching out to get some help

● Growing number of requests and inquiries

● In response we want to extend our reach to them

● Created a program to help them and give them the opportunity to connect with the rest of the research community

Page 28: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

Benefits

● Representation on WestGrid’s Associate Member Research Advisory Committee

● ‘Fast-tracked’ access to specialized support in multiple research domains, and expertise in software development, scientific visualization, machine learning, and research data management.

● Targeted and customizable training to build skills and practical experience in advanced research computing

● Discounted access to WestGrid-organized events, including training events and conferences

● Opportunities to participate in regional and national discussions in digital research infrastructure (DRI) platform and policy planning

Page 29: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

Program Launch

● Send communications to those who expressed interest

● Send to other potential institutions who do research

● Use different communications channels to create awareness

● Set up meeting and walk them through the program and how we can help

● Once we have a few joining then establish the Advisory Group

Page 30: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

Questions?

For more information: Email Ahmad:

[email protected]

or visit:www.westgrid.ca/become_member

Page 31: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

Operations

Patrick Mann, Director of Operations, WestGrid

Page 32: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

Outages and Maintenance I

Graham May 28, 2020 Planned maintenance requiring power cutoff. About 2 hours.

May 22, 2020 Power outage all jobs lost.

May 11, 2020 - Jun 15, 2020

Security incident on gra-vdi.computecanada.ca resulting in no VNC access.No data lost or accessed, but required re-designing the integration of gra-vdi within the graham system.

Apr 24, 2020 Globus endpoint outage due to hardware failure on gra-dtn1.

Cedar Apr 23-current Slow and sometimes unresponsive /scratch filesystem. Parts replacement gave an improvement but cedar team continuing to work with vendor.

Apr 23, 2020 Filesystem issue. Faulty hardware and storage node required replacing.

Béluga Jun 11, 2020 Cooling outage - GPU nodes were shutdown for a couple of hours.

May 7, 2020 Unresponsive /home and /scratch

May 6, 2020 Outage on connection through commercial link.

Apr 21, 2020 /scratch hardware problem.

June hot weather - nodes drained due to request from utility.

Page 33: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

Outages and Maintenance II

Niagara All well!

Arbutus June 16, 2020 DNS resolution - some sites not resolving properly (Docker and Yum downloads). Switched to a new internal DNS resolver.

May 18, 2020 DHCP issues for some users in particular tenant networks.

May 6, 2020 API access issues due to high load.

Apr 17, 2020 Login problem to OpenStack dashboard.

Page 34: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

Cedar /scratch issues

Cedar has had intermittent /scratch issues for quite a while.● Storage system had a habit of losing drives but the drives themselves seemed fine.

Tuesday June 23:

We may have finally found and cleared the backend storage issue. Looks like a faulty SAS cable to one of two controllers causing many drives to go into missing/available/rebuild loop. Now there are 33 full rebuilds in progress and expect them to finished by the end of today. Due to heavy work on these rebuild, the filesystem performance may be marginally affected.

Very tough problem to track down. Congratulations to the cedar team:● Lance Couture● Lixin Liu

Page 35: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

Cedar CPU usage

New hardware

Security Incident

Page 36: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

Cedar GPU UtilizationGPU usage took a while to ramp-up.

● To ensure that legacy job scripts worked the default was the older Pascal GPUs.

● New GPUs (Volta) needed to be specifically identified in job scripts. (mentioned in last Townhall)

New hardware

Security Incident

Page 37: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

Cedar StorageFilesystem Current

CapacityCurrent

UtilizationRAC 2021 target

(75%)RAC 2020 Allocation

/project 21 PB 9.3 PB (46%) 15.8 PB 11.3 PB June 21, 2020

● So we’re ok currently for storage especially after the latest purchases and upgrades.

● RAC 2020 was difficult: intense discussion○ Moved many folks to /nearline

● We’re expecting a significant increase in RAC 2021 storage asks.● No new funds for storage.● And note that storage is very difficult to re-allocate or scale.

/nearline is now in production. Ask at [email protected]

Page 38: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

Arbutus Usage

Instances Projects

vCPUs used Memory used

Jan 1, 2019 - Present

Page 39: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

User Training & Outreach

Alex Razoumov, Visualization & Training Coordinator WestGrid

Page 40: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

2020 Summer School

● “Regular” years○ two in-person schools, each 4 days, 2-3 parallel streams, 12-15 courses * 3-6 hours each

● This year: single online school for all WestGrid (and beyond) institutions○ 7 weeks (May-25 to July-10)○ single stream (no parallel courses)○ 16 courses + 3 repeat courses (everyone on the waitlist)○ each course: 1-3 days of mostly at-your-own-pace learning○ format: reading materials + pre-recorded videos + exercises + live Zoom sessions○ in most courses interactive hands-on on our training cluster or Cedar

● 300 registrants and 300+ on the waitlist● Zoom session invites only to participants with Canadian institutional

email addresses (83%)● presentation materials available to all

https://wgschool.netlify.app

Page 41: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

2020 Summer School❏ Bash command line❏ Intro to HPC❏ Programming in Julia❏ Python in Jupyter❏ Scientific visualization❏ Machine learning with PyTorch❏ UBC Sockeye cluster❏ Parallel programming in Chapel❏ Gromacs optimization❏ Singularity containers❏ Bioinformatics sessions❏ Git version control❏ Databases on Cedar❏ CC cloud❏ Docker in your VM❏ MATLAB sessions

https://wgschool.netlify.app/program

Already started running repeat courses, so too late to join the waitlist …

However:1. all presentation materials are available!2. we will likely repeat a smaller version of

this school in several months

Big thanks to our instructors: Marie-Helene Burle, Ian Allison, Roman Baranowski, Ryan Thomson, Olivier Fisette, Grigory Shamov, Phillip Richmond, Matthew Douglas, Brian McConeghy, Alex Lopes, Wolfgang Richter, Venkat Mahadevan, Jacob Boschee, Raymond Norris, Reece Teramoto

Page 42: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

Other Online Training

● WestGrid fall webinars will start in September○ up to 1 hour, usually not interactive○ please email us your topic suggestions [email protected]○ the schedule will be finalized by late August

● Collaborating with UBC ARC for their summer school in August

● Collaborating with UBC Research Commons for their fall workshops

(monthly WestGrid series)

● ECCC custom workshops in September

We are very much looking to the day when we can resume our in-person training!

Page 43: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

Questions?

To ask questions about today’s presentations:

From Webstream: email [email protected]

From Vidyo: Two options...

Unmute your mic

Use the group chat

Page 44: WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann, WestGrid Director of Operations Martin Siegert, Director Research Computing, SFU

Questions after this session? Email us anytime:[email protected]

[email protected]

We also advocate on behalf of WestGrid member and user concerns within Compute Canada.