Scaling Servers and Storage for Film Assets

Post on 19-Jan-2015

446 views 3 download

Tags:

description

In the past two years, Pixar has grown from a handful of Perforce servers to over 90 servers. In this session, members of the Pixar team will discuss how they met the challenges in scaling out and being prudent about storage usage, from automating server creation to de-duplicating the repositories.

Transcript of Scaling Servers and Storage for Film Assets

Scaling Servers and Storage for Film Assets Mike Sundy Digital Asset System Administrator

David Baraff Senior Animation Research Scientist

Pixar Animation Studios

Environment Overview Scaling Storage Scaling Servers Challenges with Scaling

Environment Overview

Environment

As of March 2011: •  ~1000 Perforce users (80% of company) •  70 GB db.have •  12 million p4 ops per day (on busiest server) •  30+ VMWare server instances •  40 million submitted changelists (across all servers) •  On 2009.1 but planning to upgrade to 2010.1 soon

Growth & Types of Data

Pixar grew from one code server in 2007 to 90+ Perforce servers storing all types of assets: •  art – reference and concept art – inspirational art for film. •  tech – show-specific data. e.g. models, textures, pipeline. •  studio – company-wide reference libraries. e.g. animation reference, config files, flickr-like company photo site. •  tools – code for our central tools team, software projects. •  dept – department-specific files. e.g. Creative Resources has “blessed” marketing images. •  exotics – patent data, casting audio, data for live action shorts, story gags, theme park concepts, intern art show.

Scaling Storage

Storage Stats

•  115 million files in Perforce. •  20+ TB of versioned files.

Techniques to Manage Storage

•  Use +S filetype for the majority of generated data. Saved 40% of storage for Toy Story 3 (1.2 TB). •  Work with teams to migrate versionless data out of Perforce. Saved 2 TB by moving binary scene data out. •  De-dupe files — saved 1 million files and 1 TB.

De-dupe Trigger Cases

• p4 submit file1 file2 ••• fileN p4 submit file1 file2 ••• fileN # only file2 actually modified •  p4 submit file # contents: revision n # five seconds later: “crap!” p4 submit file # contents: revision n–1 •  p4 delete file p4 submit file # user deletes file (revision n) # five seconds later: “crap!” p4 add file p4 submit file # contents: revision n

De-dupe Trigger Mechanics

repfile.14 AABBCC…!

repfile.15 AABBCC…!

repfile.24 AABBCC…!

repfile.25 XXYYZZ…!

repfile.26 AABBCC…!

repfile.34 AABBCC…!

repfile.38 AABBCC…!

file#n file#n+1

file#n file#n+1 file#n+2

file#n file#n+2 file#n+1

De-dupe Trigger Mechanics

repfile.24 AABBCC…!

repfile.25 XXYYZZ…!

repfile.26 AABBCC…!

•  +F for all files; detect duplicates via checksums. •  Safely discard duplicate: $ ln repfile.24 repfile.26.tmp $ rename repfile.26.tmp repfile.26!

repfile.24 AABBCC…!

repfile.25 XXYYZZ…!

repfile.26.tmp

repfile.26 AABBCC…!

hardlink rename

Scaling Servers

Scale Up vs. Scale Out

Why did we choose to scale out? •  Shows are self-contained. •  Performance of one depot won’t affect another.* •  Easy to browse other depots. •  Easier administration/downtime scheduling. •  Fits with workflow (e.g. no merging art) •  Central code server – share where it matters.

Pixar Perforce Server Spec

•  VMWare ESX Version 4. •  RHEL 5 (Linux 2.6). •  4 GB RAM. •  50 GB “local” data volume (on EMC SAN). •  Versioned files on Netapp GFX. •  90 Perforce depots on 6 node VMWare cluster – special 2-node cluster for “hot” tech show. •  For more details, see 2009 conference paper.

Virtualization Benefits

•  Quick to spin up new servers. •  Stable and fault tolerant. •  Easy to remotely administer. •  Cost-effective. •  Reduces datacenter footprint, cooling, power, etc.

Reduce Dependencies

•  Clone all servers from a VM template. •  RHEL vs. Fedora. •  Reduce triggers to minimum. •  Default tables, p4d startup options. •  Versioned files stored on NFS. •  VM on a cluster. •  Can build new VM quickly if one ever dies.

Virtualization Gotchas

•  Had severe performance problem when one datastore grew to over 90% full. •  Requires some jockeying to ensure load stays balanced across multiple nodes – manual vs. auto. •  Physical host performance issues can cause cross-depot issues.

Speed of Virtual Perforce Servers

•  Used Perforce Benchmark Results Database tools. •  Virtualized servers 95% of performance for branchsubmit benchmark. •  85% of performance for browse benchmark (not as critical to us). •  VMWare flexibility outweighed minor performance hit.

Quick Server Setup

•  Critical to be able to quickly spin up new servers. •  Went from 2-3 days for setup to 1 hour. 1-hour Setup •  Clone a p4 template VM. (30 minutes) •  Prep the VM. ( 15 minutes) •  Run “squire” script to build out p4 instance. (8 seconds) •  Validate and test. (15 minutes)

Squire

Script which automates p4 server setup. Sets up: •  p4 binaries •  metadata tables (protect/triggers/typemap/counters) •  cron jobs (checkpoint/journal/verify) •  monitoring •  permissions (filesystem and p4) •  .initd startup script •  linkatron namespace •  pipeline integration (for tech depots) •  config files

Superp4

Script for managing p4 metadata tables across multiple servers. •  Preferable to hand-editing 90 tables. •  Database driven (i.e. list of depots) •  Scopable by depot domain (art, tech, etc.) •  Rollback functionality.

Superp4 example

$ cd /usr/anim/ts3!$ p4 triggers -o!Triggers:

!noHost form-out client ”removeHost.py %formfile%”!!$ cat fix-noHost.py!def modify(data, depot):! return [line.replace("noHost form-out”,! "noHost form-in”)! for line in data]!!$ superp4 –table triggers –script fix-noHost.py –diff! • Copies triggers to restore dir • Runs fix-noHost.py to produce new triggers, for each depot. • Shows me a diff of the above. • Asks confirmation; finally, modifies triggers on each depot. • Tells me where the restore dir is!!

Superp4 options

$ superp4 –help! -n Don’t actually modify data! -diff Show diffs for each depot using xdiff. -category category Pick depots by category (art, tech, etc.) -units unit1 unit2 ... Specify an explicit depot list (regexp allowed). -script script Python file to be execfile()'d; must define a function named modify(). -table tableType Table to operate on (triggers, typemap,…) -configFile configFile Config file to modify (e.g. admin/values-config) -outDir outDir Directory to store working files, and for restoral. -restoreDir restoreDir Directory previously produced by running superp4, for when you screw up.

Challenges With Scaling

Gotchas

•  //spec/client filled up. •  user-written triggers sub-optimal. •  “shadow files” consumed server space. •  monitoring difficult – cue templaRX and mayday. •  cap renderfarm ops. •  beware of automated tests and clueless GUIs. •  verify can be dangerous to your health (cross-depot).

Summary

•  Perforce scales well for large amounts of binary data. •  Virtualization = fast and cost-effective server setup. •  Use +S filetype and de-dupe to reduce storage usage.

Q & A

Questions?