Facilitating Collaborative Life Science Research in Commercial & Enterprise Environments

1Photo credit: Aaron Gardner

Bridging the Gap - Facilitating Collaborative Life Science Research in Commercial & Enterprise Environments

March 2017 - NEREN SEMINAR

2

I’m Chris.I’m an infrastructure geek (and failed scientist) I work for the BioTeam.

Photo credit: Cindy Jessel@chris_dag

3

www.BioTeam.netIndependent Consulting Shop

Run by scientists forced to learn IT to “get science done”Virtual company with nationwide staff

15+ years “bridging the gap” between hardcore science, HPC & ITHonest. Objective. Vendor & Technology Agnostic.

We are hiring :)

4

Content Warning I am not an “expert” … or a “thought leader”I try to speak honestly about what I see, do and experience “on the ground” as an IT worker

My views are biased by the types of work I perform. Filter my words through your own expertise …

I’m worried about time so I may skip slides — full PDF of slide deck will be available.

5

Q1’17 Current State:Commercial LifeSci Research Computing

6

01: Science Evolves Faster Than IT

‣ Rate of scientific innovation is incredible

‣ Same innovation rate seen with lab side instruments

‣ Scientific and instrument requirements change far faster than IT organizations can build, rebuild or refresh complex infrastructure

‣ In the face of science world changing month-to-month:

‣ … best funded, most aggressive shops can only refresh large installations every ~2 years. Most refresh on 3-4 year cycles.

‣ Gulp!

7

02: We’ve lost the centralization battle‣ Old way:

‣ Centralize all HPC and Research Computing functions into a single-site, centrally managed & supported environment

‣ Bring the users and the data to the shared environment

‣ This no longer works as well as it used to …

‣ Terabyte-scale instruments have diffused EVERYWHERE and will continue to pop up “everywhere”

‣ Building/campus LANs can’t support tera|peta-scale data movement

‣ Does not address external collaborators or data sources well

8

03: Petabytes for “free”‣ There are petabytes of very interesting open-access data available for free

on the internet

‣ There are many valid business and scientific reasons for a research computing user wanting to bring some of this data in-house to facilitate new or existing research programs but …

‣ Massive technical challenges (Ingest, ‘trash tier storage’, etc.)

‣ Massive organizational challenges:

‣ It takes a ton of work and resources to host peta-scale “free” data

‣ Organizations struggling to build governance/approval models tied to actual business or scientific goals

9

04: Userbase now spanning the enterprise‣ Life was a lot easier when the only users of research computing were

scientists and R&D organizations ‣ Easy to build domain expertise and bias our infrastructure to favor power and capability over 99.99%

uptime. Researchers will tolerate occasional downtime if the “payoff” is faster systems or bigger storage

‣ Much harder when the full enterprise needs “data intensive science”

‣ Those pesky corporate types want SLAs and 24x7 support :)

‣ Userbase diversity is incredible: manufacturing, process optimization, commercial operations, sales operations, compliance, risk management, etc, etc,

‣ Far far harder to support, train, enable and “mentor”

10

05: Data Types Getting Weird‣ We are very good at handling terabytes and petabytes of static structured or

unstructured data - storage tech and operational practices for this have evolved over DECADES

‣ Ingesting, storing and computing against data streams requires entirely new tech, skills and infrastructure

‣ Sensor telementy from bioreactors in manufacturing

‣ Environmental sensor data streams from greenhouses

‣ Website clickstream and advertising metrics from Commercial Ops

‣ etc. etc.

11

06: Our Networks Suck

‣ Enterprise network architectures are optimized for lots of small concurrent traffic flows. They have issues with “elephant flows” where a single network flow may be using 1gb, 10gb or 40gb of bandwidth to move a big data file

‣ Our network cores can barely handle 10gig when they should be running at 40gig and 100gig so they can do 10gig to top-of-rack trivially

‣ Our building-to-building and lab-to-lab links are woefully undersized

‣ Our connections to the outside world are woefully undersized

‣ Cost of Cisco networking at 40gb and higher is simply ludicrous

12

07: Our Firewalls Suck

‣ Stuck with legacy model and operational assumptions (“Yes we can do deep packet inspection on EVERYTHING …” & “Yeah it makes total sense to only put a firewall at the perimeter of our network”)

‣ That $90,000 firewall advertised as “10gig ready” can’t actually handle a large scientific data transfer because inside the box they are actually aggregating 10x cheap 1gig network paths and calling it “10 gig”

‣ Feed it a single file transfer stream @ 10gbps and watch it thrash and drop throughput by 90%.

13

Summarizing our key challengesWhat keeps us from the collaborative computing promised land?

Collaborative Research: Key Challenges

‣ Network speeds: Internal & External

‣ Deploying ScienceDMZ architectures to take “data intensive science” load off of networks built for business users

‣ Network security methods: Core & Edge

‣ Federated Identity Management

‣ Obtaining the domain expertise required to enable, mentor and fully support the massively expanding class of collaborative researchers who need sophisticated compute and analytics

14

15

Ok dude. All your challenges are tech related. What about the human side of research facilitation?

16

Collaborative Research Challenges: Human Factors

‣ Wishful thinking rather than critical thinking about what the organization REALLY wants to encourage. We see a lot of “build the database/catalog/warehouse/repository/lake/commons and they will come” pitches with zero support for follow-through.

‣ Collab/research facilitators with enough seniority to to be thinking “Where are the collaborative opportunities, how do they align with the business needs, what data is actually useful to others?”

17

Collaborative Research Challenges: Human Factors, 2‣ The BIGGEST ISSUE OF ALL:

‣ What’s in it personally for the collaborating parties?

‣ Does this get them promoted, published, solve their research problem, answer their burning questions, etc. or does it detract from these things by taking time away from activities more beneficial to the org or person?

‣ Does the ‘system’ support or inhibit collaboration through activities like budget allocations, staffing, approval processes, etc. ?

‣ Org charts, corporate culture and operating models can either encourage or stifle any collaborative efforts that may exist. h/t - Simon Twigger!

18

Collaborative Research Challenges: Human Factors, 3

‣ Research Facilitators in Industry: Someone needs to be out there learning about the ‘silos of excellence’ and seeing the opportunities for collaboration

‣ Some scientists are too heads down in their own area to see beyond immediate needs. Having a human to make this happen could be huge, way more effective than all the technological ‘solutions’ we usually throw at this problem.

‣ Impedance mismatch: A real issue. We need something like an E-Harmony for matchmaking between collaborators with the same motivation levels !

h/t - Simon Twigger!

19

Collaborative Research Computing: InternalSupporting internal efforts in commercial pharma/biotech

20

Facilitating Internal Collaboration‣ Harder than multi-party collaboration in some ways

‣ Few companies incentivize or otherwise actively encourage collaboration across departmental boundaries

‣ Or if they do “encourage” it is often just empty talk; the reality on the ground when it comes to performance reviews, HR and local management may be different

‣ Talk is cheap. Taking steps to encourage, track and reward people is not.

‣ Other main issue is “impedence mismatch” between potential collaborators

‣ Often two groups that may wish to collaborate may have different timeframes, interest levels and available resources. Tough to find perfect alignment

21

Internal Collaboration: How we do it (1) ‣ Regular HPC/computing training classes where all are welcome and attendees span

various business units. Serendipitous opportunities abound

‣ Mailing list, Slack etc. methods for consumers of research computing services to actively communicate, share code and troubleshooting assistance

‣ Road-shows and “lunch and learn” sessions with rotating cast of speakers, delivered across multiple sites. Speakers are often users/consumers with great stories and data to talk about

‣ Having most apps and data sets on a large single namespace storage system makes the act of collaboration easier for all comers; Private GitLab or other code hosting portal for users to share code and tooling also helps

22

Internal Collaboration: How we do it (2) ‣ Publishing data catalogs so people understand what is available for use and

exploration is very helpful. Does not have to be complex - even a simple Wiki or web page can work

‣ “Research Facilitators” who can embed with departments or groups for weeklong or monthlong periods are very useful

‣ … at driving new use cases and collaborations

‣ … at collecting valuable domain knowledge needed for long term support of users and departments

‣ … dissolving barriers between IT and people asking interesting questions

23

Internal Collaboration: Challenges‣ Fighting for permission to deploy real, useful collaboration tools vs. management who just keep

saying “SharePoint, SharePoint, SharePoint …”

‣ The new crop of potential collaborators may sit at sites not previously covered by research computing infrastructure or support resources

‣ As data types and tooling get more diverse and more complex it is a constant battle to retain the internal IT “domain knowledge” necessary to help compute consumers be successful in their efforts

‣ Research IT / R&D organizations have long known the value of hiring “research facilitators” or embeddable support/consultants. This awareness is far less common outside of Research.

‣ Non-research/Non-product groups are often not funded at levels that allow them to think about novel support / staffing / collaboration structures

24

Collaborative Research Computing: Multi-PartySupporting multi-party collaboration in commercial pharma/biotech

25

Multi-Party Collaboration: How we do it (1) ‣ Supporting this work is straightforward. We don’t have to evangelize or encourage —

they know what they want to do and “our” job is to deploy & facilitate

‣ We usually don’t even have to train people. The collaborators know their data and tooling far better than we do

‣ Important to understand in the commercial space that it is common for organizations to be collaborators in one area and fierce competitors in other areas/markets

‣ This means that NOBODY is punching holes in firewalls and adding external people to the local Active Directory server.

‣ Almost all of the complex multi-party collaborations that Bioteam is involved with in this space are occurring within dedicated IaaS cloud environments

26

Multi-Party Collaboration: How we do it (2) ‣ IaaS cloud environments like a private Amazon AWS VPC are the default neutral

meeting ground for complex multi-party/multi-organization collaborations

‣ Why?

• Nobody has to invite strangers behind their firewall or VPN

• Vast amounts of storage, compute and analytics resources at-hand

• Security controls are powerful and very fine-grained . Often 1000x more capable than the security controls we typically see “inhouse”

• Data sets may already be hosted on Amazon and if not, high-velocity data ingest is something that can be engineered and built

• AWS is on Internet2 — good access to national research centers and academia

27

Multi-Party Collaboration: Challenges‣ The biggest challenge is identity management, authorization and access control

‣ Building a federated ID service that can do role based access control amongst multiple people and institutions is neither quick nor simple

‣ The people who control Active Directory “at-home” rarely interact with mere mortals and securing approval to expose/federate an internal directory to “the cloud” can be a long and complex process

‣ In a 40,000 person global enterprise there may be only 2 folk who truly understand the deep technical details involved with ADFS, AD, SAML, Federation and related topics. Finding those people and stealing them for your team is hard work.

‣ Those crazy academic collaborators use weird stuff for ID management like “Shibboleth” :) that corporate IT suits have a very hard time understanding and dealing with

‣ Other challenges: Long term storage and hosting of data if terabyte or petabyte volumes of data are involved. Where does this go after active collaboration ends?

28

A reasonable question to ask …

‣ Why is all this collaborative scientific computing stuff on Amazon instead of a regional specialty facility like MGHPCC?

‣ Lots of reasons but none are insurmountable …

• Awareness & ease of access

• Inertia and laziness

• 3rd party vendor & solution presence within AWS

• …

29

But …‣ There is an interesting trend BioTeam has observed that may play into this …

‣ We are predicting a number of high-profile ‘cloud pullback’ projects this year and next. We are actively working on at least one right now involving large-scale scientific computing and petabyte+ volume of data.

‣ The VERY INTERESTING thing is that these projects that are being “pulled back” from public clouds ARE NOT going back on-premise.

• … they are going to specialty facilities that appear similar in nature/mission as MGHPCC

• End result: You may see a larger commercial/industrial presence at shared facilities more commonly associated with academic or .gov supercomputing. Industry/Academic collaborations may get much easier in the future if this trend holds up.

30

end; Thanks! slideshare.net/chrisdag/ [email protected] @chris_dag

mailto:[email protected]

Facilitating Collaborative Life Science Research in Commercial & Enterprise Environments

Science

Transcript of Facilitating Collaborative Life Science Research in Commercial & Enterprise Environments