WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann,...
Transcript of WestGrid Town Hall: June 26, 2020 · 1 day ago · WestGrid Town Hall: June 26, 2020 Patrick Mann,...
WestGrid Town Hall: June 26, 2020
Patrick Mann, WestGrid Director of OperationsMartin Siegert, Director Research Computing, SFU
Scott Baker, Manager, Sensitive Research, UBCAhmad Bisher, WestGrid Partnership Development
Alex Razoumov, WestGrid Training Visualization
Admin
To ask questions during today’s session:
From Webstream: Email [email protected] From Vidyo: Use the GROUP CHAT to ask questions.
Please mute your mic unless you have a question.
Outline
1. Recent incident report - Martin Siegert, SFU
2. System Access Best Practices - Scott Baker, UBC
3. Associate Member Program - Ahmad Bisher, WestGrid
4. Operations - Patrick Mann, WestGrid
5. User Training - Alex Razoumov, WestGrid
Cedar Security Incident:Debrief & lessons learned
Martin SiegertDirector Research Computing,
Simon Fraser University;WestGrid Site Lead;
Compute Canada SFU/Cedar Site Lead
Incident
What happened:
● System compromised around March 23rd● Used for crypto-currency mining● Symptoms: reduced performance of research
applications● Crypto-mining processes well hidden from
common tools (ps, top, etc.)● Detected Apr. 15 after receiving information from
on research group that slowdown of applications only happens between midnight and 6am
● Compromise stopped on Apr. 15
Cedar Team response
Two days of detective work:● crypto-mining processes were started
through a cronjob at midnight● processes hidden through a
kernel module installed by the intruder● intruder gained access through a
compromised user account● that account had elevated privileges on servers for the ATLAS project● gained root privileges on compute nodes by inserting an suid-root shell in a file
system exported from ATLAS servers● unlikely that any user data were accessed
Big thank-you to the team: pan-Canadian effort SFU-Dalhousie-UVic
Cedar Team response (cont’d)
Actions taken:
● stopped cronjob
● disabled compromised account
● removed suid-root program and disabled exports of suid-root programs
● rebuilt all compute nodes
● rebuilt all ATLAS servers
● removed all authorized_keys files from all users’ .ssh directories
● ssh keys revoked of users who had the private key stored on Cedar
Lessons Learned
● Coincidences happen: the compromise happened a day after Cedar was expanded and underwent a complete update
● User accounts are the weakest link: most common entry point for intruders
○ Weak passwords○ Compromised desktop/laptop
■ Gives access to passwords■ Gives access to ssh keys and their passwords
● Need better monitoring/alert system
● Multi-Factor Authentication (MFA) would have prevented compromise
Questions?
System Access Best Practices
Scott BakerManager Sensitive Research, ARC/UBC;Member of the National Security Council
Over to Scott...
ImplementationInterference
Improvement
1 5 10
Bonus Tip: But wait, there’s more!
WHAT
WHY
HOW
ImplementationInterference
Improvement
1 5 10
Bonus Tip: Consider how hard it will be to type in on your mobile: minimize keyboard switching.
Unique Passwords (Secrets)
The single most important good habit:Prevent one site’s breach from exposing all your accounts.
Passwords shouldn’t be memorized or guess-able.
Size matters – use passphrases
Only change them when necessary
Do not allow anything except your vault to “memorise passwords”
https://haveibeenpwned.com/
Most people have at least one email listed at least once…or will eventually.
The email address I have had since 1994 is listed in 12
Depending on the breach, different information is included.
ImplementationInterference
Improvement
1 5 10
Bonus Tip: Learn the keyboard shortcuts to save even more time.
Use a Vault for Secrets
It’s actually faster and easier than typing.
Vaults store much more than passwords
The ultimate exit strategy
Browser or Standalone. Either is better than nothing
Many open-source and free options exist
ImplementationInterference
Improvement
1 5 10
Bonus Tip: Change keys like passwords – if there is a posibilityof compromise.
Practice Good Key Management
Keys are like passwords (but more) and the same rules apply.
Generate strong keys – learn the current ciphers:https://infosec.mozilla.org/guidelines/key_management
Always protect them with a password
The private key should never leave the system where it was generated.
Don’t sweat the public keys… they are public after all.
ImplementationInterference
Improvement
1 5 10
Bonus Tip: Set your terminal to warn about multi-line paste.
Don’t Copy and Paste Code
It’s easy to hide content on the web through CSS or JS.
Malicious code can be hidden in a multitude of ways on web sites.
Hidden code in snippets may or may not originate with the content publisher.
In all cases pasting into a “dumb” editor first will help confirm what is actually on the clipboard before it goes into the shell or application.
ImplementationInterference
Improvement
1 5 10
Bonus Tip: Wiping cookies will sign you out of many services in a single step.
Sign-out
If someone gains access to your device… they have accessTo everything you’re already signed into.
This also prevents information disclosure across different web sites and/or social media sites.
Signing out adds another layer of security.
Modern browsers sometimes allow different accounts, containers, or private browsing that segments access across different apps.
ImplementationInterference
Improvement
1 5 10
Bonus Tip: The encryption key (or recovery code when applicable) must be very carefully secured.
Encryption at Rest
Encryption protects confidentiality on multiple levels,the more mobile a device, the more it matters.
Most modern systems include the ability to turn on disk encryption. This includes desktop and laptop disk drives, mobile phones, even some USB keys.
In general encryption should always be enabled, but the more likely a device is to be lost or stolen the more important it becomes to encrypt it.
ImplementationInterference
Improvement
1 5 10
Bonus Tip: Incorrect permissions on cloud storage is the most common source of breaches
Watch Your File Permissions
Captain obvious says: if you set it public… It’s public.
File permissions are powerful and entirely under your control.
Security is largely about “Need To Know”
Friends don’t let friends chmod 777
Make it sticky if that helps
ImplementationInterference
Improvement
1 5 10
Bonus Tip: Knowing where your sensitive data is, makes this step easier.
Delete Unnecessary Information
What isn’t there, can’t be stolen or mishandled.
Consider what data you have and why.
Just like shredding paper documents – ‘securely’ delete data that is no longer required in any particular location.
Yes, it’s really just that simple.
ImplementationInterference
Improvement
1 5 10
Bonus Tip: Remember to check for patches to equipment & appliances like your router or smart-<thing>.
Apply Software Updates
Smart people are working hard to fix vulnerabilities:Take advantage of that (usually free) protection.
Computers, phones, TVs cars, and now even toasters run on software/firmware. Keep that patched by applying reputable updates from known sources.
In some respects, the closer the device is to the outside – the more critical it is to patch.
Do not only rely on automatic updates.
ImplementationInterference
Improvement
1 5 10
Bonus Tip: Test your restoration plan periodically –ensure it works.
Have Backups and Plan to Restore
A backup that can’t be easily restored is useless andmight be the only option after a ransomware attack.
Avoid becoming too entangled in a proprietary system that only works after it’s installed.
Remember to think about the security of your backup as well (eg: high-profile iCloud breaches)
A simple external hard drive caddy is reliable and cost effective.
ImplementationInterference
Improvement
1 5 10
Bonus Tip: DNS lookup can sometimes be faster over a VPN.
Use a VPN
All unencrypted traffic over a network is subject to snoopingPrevent credential and data theft on un-trusted networks.Prevent DNS Cache poisoning attacks.
Basic: Subscribe to a trusted VPN serviceAnd/Or Use your Institutional VPN service
Configure this for all mobile devices and use it any time you’re on an un-trusted network.
Advanced: Set up your own VPN
ImplementationInterference
Improvement
1 5 10
Bonus Tip: Also avoid social media single sign on to prevent creating a back door on yourself
Use Multi-Factor Authentication
Passwords are a single point of failure.MFA therefore provides dramatically increased protection.
Something you Have, Know, or Are
SMS is insecure, but still better than nothing
Many options exist, many systems support it.
EG: YubiKey, Google Authenticator, SecurID, etc...
Resources
• Mozilla Key Mgt: https://infosec.mozilla.org/guidelines/key_management• KeePass Password Safe: https://keepass.info/download.html• Cryptomator: https://cryptomator.org/• Interpol: https://www.nomoreransom.org/•
WestGrid Associate Member Program
Ahmad BisherDirector, Strategy & Partnership Development
WestGrid
Introduction
● Over the years, institutions outside of WestGrid’s 7 founding members have been reaching out to get some help
● Growing number of requests and inquiries
● In response we want to extend our reach to them
● Created a program to help them and give them the opportunity to connect with the rest of the research community
Benefits
● Representation on WestGrid’s Associate Member Research Advisory Committee
● ‘Fast-tracked’ access to specialized support in multiple research domains, and expertise in software development, scientific visualization, machine learning, and research data management.
● Targeted and customizable training to build skills and practical experience in advanced research computing
● Discounted access to WestGrid-organized events, including training events and conferences
● Opportunities to participate in regional and national discussions in digital research infrastructure (DRI) platform and policy planning
Program Launch
● Send communications to those who expressed interest
● Send to other potential institutions who do research
● Use different communications channels to create awareness
● Set up meeting and walk them through the program and how we can help
● Once we have a few joining then establish the Advisory Group
Questions?
For more information: Email Ahmad:
or visit:www.westgrid.ca/become_member
Operations
Patrick Mann, Director of Operations, WestGrid
Outages and Maintenance I
Graham May 28, 2020 Planned maintenance requiring power cutoff. About 2 hours.
May 22, 2020 Power outage all jobs lost.
May 11, 2020 - Jun 15, 2020
Security incident on gra-vdi.computecanada.ca resulting in no VNC access.No data lost or accessed, but required re-designing the integration of gra-vdi within the graham system.
Apr 24, 2020 Globus endpoint outage due to hardware failure on gra-dtn1.
Cedar Apr 23-current Slow and sometimes unresponsive /scratch filesystem. Parts replacement gave an improvement but cedar team continuing to work with vendor.
Apr 23, 2020 Filesystem issue. Faulty hardware and storage node required replacing.
Béluga Jun 11, 2020 Cooling outage - GPU nodes were shutdown for a couple of hours.
May 7, 2020 Unresponsive /home and /scratch
May 6, 2020 Outage on connection through commercial link.
Apr 21, 2020 /scratch hardware problem.
June hot weather - nodes drained due to request from utility.
Outages and Maintenance II
Niagara All well!
Arbutus June 16, 2020 DNS resolution - some sites not resolving properly (Docker and Yum downloads). Switched to a new internal DNS resolver.
May 18, 2020 DHCP issues for some users in particular tenant networks.
May 6, 2020 API access issues due to high load.
Apr 17, 2020 Login problem to OpenStack dashboard.
Cedar /scratch issues
Cedar has had intermittent /scratch issues for quite a while.● Storage system had a habit of losing drives but the drives themselves seemed fine.
Tuesday June 23:
We may have finally found and cleared the backend storage issue. Looks like a faulty SAS cable to one of two controllers causing many drives to go into missing/available/rebuild loop. Now there are 33 full rebuilds in progress and expect them to finished by the end of today. Due to heavy work on these rebuild, the filesystem performance may be marginally affected.
Very tough problem to track down. Congratulations to the cedar team:● Lance Couture● Lixin Liu
Cedar CPU usage
New hardware
Security Incident
Cedar GPU UtilizationGPU usage took a while to ramp-up.
● To ensure that legacy job scripts worked the default was the older Pascal GPUs.
● New GPUs (Volta) needed to be specifically identified in job scripts. (mentioned in last Townhall)
New hardware
Security Incident
Cedar StorageFilesystem Current
CapacityCurrent
UtilizationRAC 2021 target
(75%)RAC 2020 Allocation
/project 21 PB 9.3 PB (46%) 15.8 PB 11.3 PB June 21, 2020
● So we’re ok currently for storage especially after the latest purchases and upgrades.
● RAC 2020 was difficult: intense discussion○ Moved many folks to /nearline
● We’re expecting a significant increase in RAC 2021 storage asks.● No new funds for storage.● And note that storage is very difficult to re-allocate or scale.
/nearline is now in production. Ask at [email protected]
Arbutus Usage
Instances Projects
vCPUs used Memory used
Jan 1, 2019 - Present
User Training & Outreach
Alex Razoumov, Visualization & Training Coordinator WestGrid
2020 Summer School
● “Regular” years○ two in-person schools, each 4 days, 2-3 parallel streams, 12-15 courses * 3-6 hours each
● This year: single online school for all WestGrid (and beyond) institutions○ 7 weeks (May-25 to July-10)○ single stream (no parallel courses)○ 16 courses + 3 repeat courses (everyone on the waitlist)○ each course: 1-3 days of mostly at-your-own-pace learning○ format: reading materials + pre-recorded videos + exercises + live Zoom sessions○ in most courses interactive hands-on on our training cluster or Cedar
● 300 registrants and 300+ on the waitlist● Zoom session invites only to participants with Canadian institutional
email addresses (83%)● presentation materials available to all
https://wgschool.netlify.app
2020 Summer School❏ Bash command line❏ Intro to HPC❏ Programming in Julia❏ Python in Jupyter❏ Scientific visualization❏ Machine learning with PyTorch❏ UBC Sockeye cluster❏ Parallel programming in Chapel❏ Gromacs optimization❏ Singularity containers❏ Bioinformatics sessions❏ Git version control❏ Databases on Cedar❏ CC cloud❏ Docker in your VM❏ MATLAB sessions
https://wgschool.netlify.app/program
Already started running repeat courses, so too late to join the waitlist …
However:1. all presentation materials are available!2. we will likely repeat a smaller version of
this school in several months
Big thanks to our instructors: Marie-Helene Burle, Ian Allison, Roman Baranowski, Ryan Thomson, Olivier Fisette, Grigory Shamov, Phillip Richmond, Matthew Douglas, Brian McConeghy, Alex Lopes, Wolfgang Richter, Venkat Mahadevan, Jacob Boschee, Raymond Norris, Reece Teramoto
Other Online Training
● WestGrid fall webinars will start in September○ up to 1 hour, usually not interactive○ please email us your topic suggestions [email protected]○ the schedule will be finalized by late August
● Collaborating with UBC ARC for their summer school in August
● Collaborating with UBC Research Commons for their fall workshops
(monthly WestGrid series)
● ECCC custom workshops in September
We are very much looking to the day when we can resume our in-person training!
Questions?
To ask questions about today’s presentations:
From Webstream: email [email protected]
From Vidyo: Two options...
Unmute your mic
Use the group chat
Questions after this session? Email us anytime:[email protected]
We also advocate on behalf of WestGrid member and user concerns within Compute Canada.