WSDC Hardware Architecture - wise2.ipac.caltech.edu · Science Team Expectant Ingest Server Public...

26
National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology HB WSDC Hardware Status Peer Review – March 19, 2009 WSDC Hardware Architecture Tim Conrow, Lead Engineer Heidi Brandenburg IPAC/Caltech 1

Transcript of WSDC Hardware Architecture - wise2.ipac.caltech.edu · Science Team Expectant Ingest Server Public...

Page 1: WSDC Hardware Architecture - wise2.ipac.caltech.edu · Science Team Expectant Ingest Server Public 3. National Aeronautics and Space Administration ... 8 GB RAM, 500GB SATA storage

National Aeronautics and Space AdministrationJet Propulsion LaboratoryCalifornia Institute of Technology

HB WSDC Hardware Status Peer Review – March 19, 2009

WSDC Hardware Architecture

Tim Conrow, Lead EngineerHeidi Brandenburg

IPAC/Caltech

1

Page 2: WSDC Hardware Architecture - wise2.ipac.caltech.edu · Science Team Expectant Ingest Server Public 3. National Aeronautics and Space Administration ... 8 GB RAM, 500GB SATA storage

National Aeronautics and Space AdministrationJet Propulsion LaboratoryCalifornia Institute of Technology

HB WSDC Hardware Status Peer Review – March 19, 2009

Overview

• Hardware System Architecture as presented at the Critical Design Review

• RFA from CDR board• Additional tasks since CDR• Current (and currently planned) Hardware System

Architecture• Status & Deployment Timeline• Backup• Resource Reporting • Testing with v2• Open Issues

2

Page 3: WSDC Hardware Architecture - wise2.ipac.caltech.edu · Science Team Expectant Ingest Server Public 3. National Aeronautics and Space Administration ... 8 GB RAM, 500GB SATA storage

National Aeronautics and Space AdministrationJet Propulsion LaboratoryCalifornia Institute of Technology

HB WSDC Hardware Status Peer Review – March 19, 2009

WSDC Data Flow

HRP

White Sands

MOS

JPL

SOC

UCLA

Ingest

Scan/FramePipelines

- Level-0,1b frames- Coadds- Frame index- Calibration data- Ancillary data- 2MASS ref. catalog- QA Data- Meta-data

Ops Archive

-Working Source DB . Coadd . Frame (epoch)- QA Meta-data- Image Meta-data- Final Products . Catalogs . Atlas

IRSA Archive

Archive

Multi-FramePipeline

QA

FPG

Images

H/K Maneuvers

Ancillary

data

Mission Status

Web

SSOID

Caltech/WSDC

Science Team

Expectant PublicIngest Server

3

Page 4: WSDC Hardware Architecture - wise2.ipac.caltech.edu · Science Team Expectant Ingest Server Public 3. National Aeronautics and Space Administration ... 8 GB RAM, 500GB SATA storage

National Aeronautics and Space AdministrationJet Propulsion LaboratoryCalifornia Institute of Technology

HB WSDC Hardware Status Peer Review – March 19, 2009

CDR Hardware Plan - Overview

4

Page 5: WSDC Hardware Architecture - wise2.ipac.caltech.edu · Science Team Expectant Ingest Server Public 3. National Aeronautics and Space Administration ... 8 GB RAM, 500GB SATA storage

National Aeronautics and Space AdministrationJet Propulsion LaboratoryCalifornia Institute of Technology

HB WSDC Hardware Status Peer Review – March 19, 2009

CDR - Cluster

• ~25 Dual Quad-core Commodity Dell servers. 1U, 8 GB RAM, 500GB SATA storage

• Some machines have better resources: more RAM/faster CPUS

• RHEL4• Current frame pipeline scales with

increasing cores and increasing number of cluster nodes*

* Scales as long as we can push data in and out fast enough, of course.

5

Page 6: WSDC Hardware Architecture - wise2.ipac.caltech.edu · Science Team Expectant Ingest Server Public 3. National Aeronautics and Space Administration ... 8 GB RAM, 500GB SATA storage

National Aeronautics and Space AdministrationJet Propulsion LaboratoryCalifornia Institute of Technology

HB WSDC Hardware Status Peer Review – March 19, 2009

CDR - Operations Archive

• The operations archive hosts the WSDC software tree and functions as persistent data store for L0, L1, and L3 products

• Accessed by pipelines to retrieve their inputs and push their (minimal) products

• Critically important that the operations archive can support the data rate required to meet the mission operations scenario. Currently we calculate this as 2Gbs.

6

Page 7: WSDC Hardware Architecture - wise2.ipac.caltech.edu · Science Team Expectant Ingest Server Public 3. National Aeronautics and Space Administration ... 8 GB RAM, 500GB SATA storage

National Aeronautics and Space AdministrationJet Propulsion LaboratoryCalifornia Institute of Technology

HB WSDC Hardware Status Peer Review – March 19, 2009

CDR - Backup

• Current backup provided by IPAC Systems Group

• Future backup of operations archive occurs on WISE hardware

• Utilize hot swappable disk for local, cycling backup

• Utilize LTO for long term offsite backup (Telemetry, expanded L0 products, Archive)

7

Page 8: WSDC Hardware Architecture - wise2.ipac.caltech.edu · Science Team Expectant Ingest Server Public 3. National Aeronautics and Space Administration ... 8 GB RAM, 500GB SATA storage

National Aeronautics and Space AdministrationJet Propulsion LaboratoryCalifornia Institute of Technology

HB WSDC Hardware Status Peer Review – March 19, 2009

RFAs

8

At the CDR we received two RFAs relevant to this review:RFA #5 Hardware architectureConcern: Hardware Architecture is at PDR levelRecommended Action: Suggest a delta peer level review of hardware architecture in mid 2009 to review final configuration and projected system loading RFA #8 Hardware system developmentConcern: The plan to construct the WSDC hardware system is not adequately concrete. The team needs a plan that tracks the system resource use. That plan can then be used to construct the hardware system based on recorded usage patterns. Recommended Action: (summary) The system engineer should specify a resource allocation budget to each system module, including the Exec infrastructure. Resource allocations should include CPU use and memory at the very least. Other system constraints, such as I/O required to access RAID storage, should be considered.

Page 9: WSDC Hardware Architecture - wise2.ipac.caltech.edu · Science Team Expectant Ingest Server Public 3. National Aeronautics and Space Administration ... 8 GB RAM, 500GB SATA storage

National Aeronautics and Space AdministrationJet Propulsion LaboratoryCalifornia Institute of Technology

HB WSDC Hardware Status Peer Review – March 19, 2009

Additional Tasks since CDR

• 6 Month Mission Extension (The Extended Mission)– Double the mission length to achieve a second pass over the whole sky– Doubles the operations archive size and doubles the backup size

• Near Earth Object Identification & Tracking (NEO-WISE)– Hardware and budgets in planning– Preliminary estimate anticipates a 1 in 7 increase in the number of

nodes, which currently means 5 machines

9

Page 10: WSDC Hardware Architecture - wise2.ipac.caltech.edu · Science Team Expectant Ingest Server Public 3. National Aeronautics and Space Administration ... 8 GB RAM, 500GB SATA storage

IPAC Internal Network

Operations Archive

Cluster

WISE Subnet

IPAC Perimeter Network

Ingest

Webserver

Test Data Archive

Perimeter

Network

Firewall

HTTPS

WIS

E S

witc

h

IPAC Backup Infrastructure

web application server(QA, Science Issue Tracking)

HTTP

SFTP

FastCopy

SFTP

Internal

Network

Firewall

NFS

Backup Services

Backu

p

Serv

ices

NF

S

SOC,

MOS,

Science

Team

Public

White

Sands,

MOS

Science

Team

National Aeronautics and Space AdministrationJet Propulsion LaboratoryCalifornia Institute of Technology

HB WSDC Hardware Status Peer Review – March 19, 2009

Current Hardware Plan - Overview

10

Page 11: WSDC Hardware Architecture - wise2.ipac.caltech.edu · Science Team Expectant Ingest Server Public 3. National Aeronautics and Space Administration ... 8 GB RAM, 500GB SATA storage

Cluster

White Sands

MOSIngest

25 GB/day Raw Telemetry

in 4 Downlinks

Operations Archive

IPAC Backup Infrastructure

800 GB/day

uncompressed

150 TB Total (7 month mission)

150 TB Total (7 month mission)

Images

Housekeeping

L0: 13 MB/FS -> 100 GB/day

L1: 75 MB/FS -> 570 GB/day

L3: <100 GB/day

National Aeronautics and Space AdministrationJet Propulsion LaboratoryCalifornia Institute of Technology

HB WSDC Hardware Status Peer Review – March 19, 2009

Current Hardware Plan - Data Flow & Throughput

11

Page 12: WSDC Hardware Architecture - wise2.ipac.caltech.edu · Science Team Expectant Ingest Server Public 3. National Aeronautics and Space Administration ... 8 GB RAM, 500GB SATA storage

National Aeronautics and Space AdministrationJet Propulsion LaboratoryCalifornia Institute of Technology

HB WSDC Hardware Status Peer Review – March 19, 2009

Current Plan - Cluster

• Same as CDR plan with more and better

• ~35 Dual Quad-core Commodity Dell servers. 1U, 16 GB RAM, 500GB SATA storage

• RHEL4• Storage is temporary work

space– Max: 130 GB/day/node– Min: 75 GB/day/node

• Kernel 2.6.9-78.0.8

12

Page 13: WSDC Hardware Architecture - wise2.ipac.caltech.edu · Science Team Expectant Ingest Server Public 3. National Aeronautics and Space Administration ... 8 GB RAM, 500GB SATA storage

National Aeronautics and Space AdministrationJet Propulsion LaboratoryCalifornia Institute of Technology

HB WSDC Hardware Status Peer Review – March 19, 2009

Current Plan - Operations Archive

• Same as CDR plan with more and better

• ~160 TB RAIDed storage controlled by 4 File Servers

13

Sun X4150

J4500J4400

Sun X4150

J4500J4400

Sun X4150

J4500J4400

Sun X4150

J4500J4400

Cluster

Cisco 3750

Network Switch

Page 14: WSDC Hardware Architecture - wise2.ipac.caltech.edu · Science Team Expectant Ingest Server Public 3. National Aeronautics and Space Administration ... 8 GB RAM, 500GB SATA storage

National Aeronautics and Space AdministrationJet Propulsion LaboratoryCalifornia Institute of Technology

HB WSDC Hardware Status Peer Review – March 19, 2009

Current Plan - Backup

• Future backup of operations archive occurs on WISE hardware

• Backup during entire mission will take place on the shared IPAC backup infrastructure, run by the IPAC Systems Group

• More Information on Backup in later slides

14

Page 15: WSDC Hardware Architecture - wise2.ipac.caltech.edu · Science Team Expectant Ingest Server Public 3. National Aeronautics and Space Administration ... 8 GB RAM, 500GB SATA storage

• Not covered at CDR• Test Data Archive is an

Xserve attached Xserve RAID storage

• Other services hosted on commodity Linux servers, interchangeable with cluster nodes

• Disk backup via IPAC backup infrastructure

IPAC Perimeter Network

HTTPS

Pu

blic

Inte

rn

et

Backup Services

NFS (read only)

IPA

C In

tern

al N

etw

ork

QAReports(web application)

Science

Issue

Tracking (web application)

Ingest FastCopy

Test Data

Archive FTP

Public Web

Wiki

HTTP

Mailserver

National Aeronautics and Space AdministrationJet Propulsion LaboratoryCalifornia Institute of Technology

HB WSDC Hardware Status Peer Review – March 19, 2009

Current Plan - Perimeter Network Services

15

Page 16: WSDC Hardware Architecture - wise2.ipac.caltech.edu · Science Team Expectant Ingest Server Public 3. National Aeronautics and Space Administration ... 8 GB RAM, 500GB SATA storage

National Aeronautics and Space AdministrationJet Propulsion LaboratoryCalifornia Institute of Technology

HB WSDC Hardware Status Peer Review – March 19, 2009

Status & Timeline - Cluster & Archive

• 20 nodes to be installed upstairs week of March 23, 2009• Extant 12 nodes (11 workers + 1 master) moved from

downstairs at later date• 140 TB + 3 Sun X4150 Fileservers to be installed upstairs

by beginning of April• Extant 20 TB of storage & fileservers moved upstairs at

later date.• All WSDC hardware installed or moved by April 1.

• These deployments complete minimum pre-launch build out of cluster and operations archive

16

Page 17: WSDC Hardware Architecture - wise2.ipac.caltech.edu · Science Team Expectant Ingest Server Public 3. National Aeronautics and Space Administration ... 8 GB RAM, 500GB SATA storage

National Aeronautics and Space AdministrationJet Propulsion LaboratoryCalifornia Institute of Technology

HB WSDC Hardware Status Peer Review – March 19, 2009

Status & Timeline - Perimeter Network Services

• Ingest server, Test data archive, public web pages, and internal wiki currently deployed

• Science Issue Tracking and QA hosted on same machine, nominally of same configuration as a cluster node, installed 5/1/09 to support application deployment.

• Science Issue Tracking web application deployed 5/15/09 (WSDC Project milestone)

• QA web application deployed as part of v3.0 integration, test, & delivery.

17

Page 18: WSDC Hardware Architecture - wise2.ipac.caltech.edu · Science Team Expectant Ingest Server Public 3. National Aeronautics and Space Administration ... 8 GB RAM, 500GB SATA storage

National Aeronautics and Space AdministrationJet Propulsion LaboratoryCalifornia Institute of Technology

HB WSDC Hardware Status Peer Review – March 19, 2009

Backup

• Raw telemetry archive– 25 GB/day spread over 4 deliveries (average)– Copy kept on disk at White Sands throughout mission– Kept on disk at IPAC throughout mission– 3 tape archive copies; at IPAC, elsewhere at CIT, off-campus– Manual backups, once/day– Hardware: TBD (Advise?)

• Options– DAT: matched to daily volume; no need to rotate media– LTO: much used at IPAC already; need to rotate media

• Redundant units, separate from ops archive backup units– Roll out by July ‘09

18

Page 19: WSDC Hardware Architecture - wise2.ipac.caltech.edu · Science Team Expectant Ingest Server Public 3. National Aeronautics and Space Administration ... 8 GB RAM, 500GB SATA storage

National Aeronautics and Space AdministrationJet Propulsion LaboratoryCalifornia Institute of Technology

HB WSDC Hardware Status Peer Review – March 19, 2009

Backup

• Critical software & reference data is mirrored offsite• Ops archive backup

– ~800 GB/day uncompressed– Kept on-line throughout mission– Existing IPAC disk-to-disk backup with LTO-3 (400 GB/tape) tape

library archive– Takes about 6 hours to backup or restore one day’s products– ~4 tapes written per day: one pair stored elsewhere at CIT– No time for full tape backups; all incrementals– System commissioned by July ’09

• No node disk backup

19

Page 20: WSDC Hardware Architecture - wise2.ipac.caltech.edu · Science Team Expectant Ingest Server Public 3. National Aeronautics and Space Administration ... 8 GB RAM, 500GB SATA storage

National Aeronautics and Space AdministrationJet Propulsion LaboratoryCalifornia Institute of Technology

HB WSDC Hardware Status Peer Review – March 19, 2009

Backup

• Disaster recovery– Off-site “mini-cluster” able to perform minimal processing for 30 days

• Ingest and raw telemetry backup• L0 archive creation• Quicklook (<5% of data processed) and minimal QA• Should require no more than

– Two 8-core Dell 1950’s– 2 tape drives (?)– <10 TB disk (1 RAID-5 array)

– Recovery process• Run from mini-cluster up to 1 month• Rebuild primary cluster• Restore old ops DB as processing is taken up by new cluster

– Full, detailed plan TBD

20

Page 21: WSDC Hardware Architecture - wise2.ipac.caltech.edu · Science Team Expectant Ingest Server Public 3. National Aeronautics and Space Administration ... 8 GB RAM, 500GB SATA storage

National Aeronautics and Space AdministrationJet Propulsion LaboratoryCalifornia Institute of Technology

HB WSDC Hardware Status Peer Review – March 19, 2009

Resource Reporting

• All pipelines create log files with resource usage tags:>>>> END iam=>'Spawn_instruframecal', host=>'wcnode12', pid=>"20782.20820", starttime=>'09/03/10_22:17:34Z', >>>>+ endtime=>'09/03/10_22:17:51Z', status=>0, signal=>0, retcode=>0, upcpu=>0.43, spcpu=>0.08, uccpu=>7.56, sccpu=>3.14, >>>>+ totcpu=>11.21, util=>0.713, elapt=>16.260, ttime=>0, load=>0, rssk=>146500, rssmb=>143.066, rdmb=>0.000, wtmb=>0.000, >>>>+ rxmb=>34.166, txmb=>3.085, pkcoll=>0

• Tag info comes from standard system calls and kernel data structures (accessed via Perl module Sys::Statistics::Linux)

• Log files are parseable with an internal API.• RscRpt, a Perl program, is used after pipeline runs to

aggregate resource information from the log and print a report.

• RscRpt can aggregate information for an entire scan worth of log files, or even multiple scans.

• TODO: include resource reporting in the integration & test plan

21

Page 22: WSDC Hardware Architecture - wise2.ipac.caltech.edu · Science Team Expectant Ingest Server Public 3. National Aeronautics and Space Administration ... 8 GB RAM, 500GB SATA storage

National Aeronautics and Space AdministrationJet Propulsion LaboratoryCalifornia Institute of Technology

HB WSDC Hardware Status Peer Review – March 19, 2009

Testing ScanFrame in v2

• Hardware for test– Cluster at 1/3 size (11 nodes + master) – 1 Fileserver attached to 1 disk array

• Data– 8 scans from a full-orbit data simulation– Each scan trimmed to 1/3 nominal size - ~84 frames

• Method– Launch all 8 scans concurrently– After run, process log files with RscRpt

22

Page 23: WSDC Hardware Architecture - wise2.ipac.caltech.edu · Science Team Expectant Ingest Server Public 3. National Aeronautics and Space Administration ... 8 GB RAM, 500GB SATA storage

National Aeronautics and Space AdministrationJet Propulsion LaboratoryCalifornia Institute of Technology

HB WSDC Hardware Status Peer Review – March 19, 2009

Testing ScanFrame in v2

• Results at 1/3 scale– Ingest (1 delivery): ~0.6 hours– 8-scan (delivery) run time: ~1 hour– Max. memory usage (1 process):

~1GB– Max. network rate: ~1Gbps

(saturated ~10 min.s)– Typical network rate: ~0.1Gbps

(sustained)– 1 scan run time: ~1 hour (8 at a

time)– 1 frame run time: ~6 min (88 at a

time)– Ingest+scans for day complete in <

8 hours

23

Page 24: WSDC Hardware Architecture - wise2.ipac.caltech.edu · Science Team Expectant Ingest Server Public 3. National Aeronautics and Space Administration ... 8 GB RAM, 500GB SATA storage

National Aeronautics and Space AdministrationJet Propulsion LaboratoryCalifornia Institute of Technology

HB WSDC Hardware Status Peer Review – March 19, 2009

Resource Usage - ScanFrame

24

\starttime = '09/03/06_21:03:22Z'

\endtime = '09/03/06_22:03:34Z'

\rssk = 1084524

| executable | num_entries |mean-elapt|mean-util|mean-totcpu|max-rssk|max-rdmb|max-wtmb|max-rxmb|max-txmb|

| c | i | r | r | r | r | r | r | r | r |

| | | | | | | | | | |

| null | null | null | null | null | null | null | null | null | null |

spawn_awaic 4064 6.51498622 0.8963647 6.475718504 411492 5.669 10760.79 71.732 99.161

spawn_fpcal 546 0.22542125 0.845793 0.748791209 9060 0.512 82.012 25.646 45.853

spawn_instrufram 2460 10.6754972 0.7469504 8.628126016 361876 8.827 7095.656 86.343 152.068

spawn_mdet 3074 7.90560735 0.8858051 7.80951529 284132 10.115 25543.94 111.143 125.388

spawn_wsspipepos 8 309.525375 0.692625 218.875 22944 0 0 293.532 194.51

spawn_wphot 3072 17.9151452 0.9289518 16.97088216 497328 85.794 36396.22 544.397 583.172

spawn_frameqa 615 39.2788976 0.8995154 288.1356098 941164 57.078 7708.197 376.292 254.023

spawn_flag 615 11.2369545 0.9115382 4.757626016 492316 207.414 7.933 389.368 131.764

spawn_sfprex 1217 4.16515612 0.9322769 3.344864421 5148 7.568 2762.738 113.971 107.738

spawn_wssflag 8 185.953 0.671125 129.8375 43396 0 0 485.728 291.472

spawn_spcal 8 11.917625 0.903625 2.04 16980 0 0 210.173 196.773

spawn_get_latent 615 2.7657561 0.9233707 2.672471545 19240 3.421 3584.339 49.526 84.722

Page 25: WSDC Hardware Architecture - wise2.ipac.caltech.edu · Science Team Expectant Ingest Server Public 3. National Aeronautics and Space Administration ... 8 GB RAM, 500GB SATA storage

National Aeronautics and Space AdministrationJet Propulsion LaboratoryCalifornia Institute of Technology

HB WSDC Hardware Status Peer Review – March 19, 2009

Open Issues

• Unknown processing requirements for the extended mission and NEO-WISE

• Pipeline performance on real sky instead of simulations• Administering clusters is not easy. We’d like to have better

tools.• We’d like better tools for monitoring network performance.

25

Page 26: WSDC Hardware Architecture - wise2.ipac.caltech.edu · Science Team Expectant Ingest Server Public 3. National Aeronautics and Space Administration ... 8 GB RAM, 500GB SATA storage

National Aeronautics and Space AdministrationJet Propulsion LaboratoryCalifornia Institute of Technology

HB WSDC Hardware Status Peer Review – March 19, 2009

Lessons Learned

• The commercial FORTRAN compiler from Intel produces significantly faster executables for the xeon processors than the free gFORTRAN.

• Multiple obscure, hard to debug, problems when using Linux as an NFS server. We’re in-process of migrating the last of our busy filesystems to Solaris servers.

• The various disk arrays have different “sweet spots” for file sizes. It’s necessary to tune arrays for the average file size of our products. ISG has been a big help in getting our storage operating as efficiently as possible.

26