Growing the Site Reliability Team at LinkedIn: Hiring is Hard

21
Growing the Site Reliability Team at LinkedIn: Hiring is Hard Greg Leffler Manager, Site Reliability https://linkedin.com/in/gl effler [email protected]

description

Growing the Site Reliability Team at LinkedIn: Hiring is Hard. Greg Leffler Manager, Site Reliability https://linkedin.com/in/gleffler [email protected]. Who am I?. Site Reliability Manager (New York) MS in Industrial/Organizational Psychology Responsible for interview process for SREs - PowerPoint PPT Presentation

Transcript of Growing the Site Reliability Team at LinkedIn: Hiring is Hard

Page 1: Growing the Site Reliability Team at LinkedIn: Hiring is Hard

Growing the Site Reliability Team at

LinkedIn: Hiring is HardGreg Leffler

Manager, Site Reliabilityhttps://linkedin.com/in/gleffler

[email protected]

Page 2: Growing the Site Reliability Team at LinkedIn: Hiring is Hard

Who am I? Site Reliability Manager (New York)

MS in Industrial/Organizational Psychology

Responsible for interview process for SREs- Took this responsibility as an IC, so originated from the bottom up

Team grew 10x from August 2011 to May 2014

Page 3: Growing the Site Reliability Team at LinkedIn: Hiring is Hard

Who are SREs at LinkedIn? 100+ SREs 5 sites, 2 countries 1000+ SW Engineers

8th busiest website in the world 10k+ prod machines per DC: 2 DCs today +1 in 2014 300+ RESTful services, 300MM+ members Services with 99th %ile latencies as low as 10 ms

Page 4: Growing the Site Reliability Team at LinkedIn: Hiring is Hard

What matters for a great company? Funding? Good idea? Execution? Product?

People.

Page 5: Growing the Site Reliability Team at LinkedIn: Hiring is Hard

Obligatory LinkedIn culture plug Talent is our #1 operating priority. Our culture is what sets us apart. We are committed to supporting the

career transformation of our employees. Transparency is encouraged and emphasized at every company all-

hands meeting- Which occur every other week

Our commitment to our employees is emphasized in how we behave- Everyone is encouraged to do interviews! Yes, everyone.- ~60% of SREs participate in the interview process

Page 6: Growing the Site Reliability Team at LinkedIn: Hiring is Hard

What do we want from SREs? Excited about LinkedIn and the SRE role

- We have the luxury of being picky

Fit our culture and embody our values- These matter. If you haven’t set them or can’t articulate them, you need that 1st

Have the skills needed to do the job- These also matter. You need to know what these are before you screen for them

AND NOTHING ELSE

Page 7: Growing the Site Reliability Team at LinkedIn: Hiring is Hard

These don’t always work Coding puzzles “Fermi problems” Algorithm design questions If you were a zebra, what pattern would your stripes have? Homework Personality tests Trivia (quick, which signal is #7 in RHEL 6.4 on x86?)

Page 8: Growing the Site Reliability Team at LinkedIn: Hiring is Hard

Here’s why Industrial Psychology has figured this out already

Schmidt & Hunter, 1998- The validity and utility of selection methods in personnel psychology: Practical

and theoretical implications of 85 years of research findings

Even if they hadn’t, you should collect your own data- And not rely on hunches or cargo cults

Further reading in the notes on this slide.

Page 9: Growing the Site Reliability Team at LinkedIn: Hiring is Hard

What does work? Good funnel at the start

Realistic job previews

Structured interviews

Situational judgment tests

Page 10: Growing the Site Reliability Team at LinkedIn: Hiring is Hard

The LinkedIn SRE funnel Sourcing/screening Recruiter prescreen Operationally-focused phone screen (TPS 1) Code-focused phone screen (TPS 2)

By the time onsite, we expect they will pass.

82%24%

Page 11: Growing the Site Reliability Team at LinkedIn: Hiring is Hard

Realistic job preview

Page 12: Growing the Site Reliability Team at LinkedIn: Hiring is Hard

Structured Interview

Page 13: Growing the Site Reliability Team at LinkedIn: Hiring is Hard

Situational Judgment Test

Page 14: Growing the Site Reliability Team at LinkedIn: Hiring is Hard

How do we implement these? Live Troubleshooting (Realistic job preview) Systems Internals, Web Architecture (Structured interviews) Triage & Investigation (Situational judgment test)

Host Manager (structured interview for culture and role fit) Lunch (not an interview… or is it?)

Page 15: Growing the Site Reliability Team at LinkedIn: Hiring is Hard

Live Troubleshooting Here’s a broken service (in EC2) Fix it

(As realistic as it gets) No ‘man voldemort’

You are probably the 1st person in the world to troubleshoot the exact situation in question

Page 16: Growing the Site Reliability Team at LinkedIn: Hiring is Hard

Technical Modules Added structure and scoring guidelines

Scoring guidelines are what matter

Consistency is the only way you can scientifically prove if these are working

High # of interviewers = need to be able to compare results

Page 17: Growing the Site Reliability Team at LinkedIn: Hiring is Hard

Triage & Investigation Module Situational Judgment and Triage

- It’s your first day oncall and the NOC calls to say the site is on fire. Here’s the alert board – what do you look at first? Why?

Assesses standard troubleshooting/investigation ability- “The CEO calls you and says ‘the site is slow’ – what do you do?”- “Disk is full. You delete a file but df still shows the disk being full. What’s wrong?”

Page 18: Growing the Site Reliability Team at LinkedIn: Hiring is Hard

Results of implementing changes Happier candidates

- In fact, no unhappy candidates- “The troubleshooting module was the most fun I’ve ever had in an interview”- “I thought the troubleshooting module was hard but I learned so much”

Happier interviewers- Some hesitation at first

- Live Troubleshooting is stressful for the interviewer too!- Solve with training and apprenticeships

Page 19: Growing the Site Reliability Team at LinkedIn: Hiring is Hard

Data, data, data We’re collecting scores from each module Correlating them to performance ratings

Re-evaluating the utility of each module- If a module doesn’t predict performance, get rid of it

- This is hard, especially with things people ‘need’- However, if there’s no correlation, it is worthless.

Page 20: Growing the Site Reliability Team at LinkedIn: Hiring is Hard

How to make your process better Make talent your first priority Implement the good stuff from I/O psych

- Realistic job previews- Situational judgment tests- STRUCTURED interviews

Collect data on interview performance (module scores)- Correlate this to job performance!- Re-evaluate your process

Page 21: Growing the Site Reliability Team at LinkedIn: Hiring is Hard

Want to experience it for real? We’re hiring. See me afterwards.

Office hours are at 2 pm- Any hiring or culture related questions are fair game