Best Practices at the National Center for Atmospheric Research Gene Harano National Center for Atmospheric Research Computational & Information Systems Laboratory Scientific Computing Division High-End Services Section
[email protected]
Outline • • • •
NCAR Overview Scientific Computing Division Services Best Practices Questions
Best Practices Workshop May 11-12, 2005 Copyright© 2005 University Corporation for Atmospheric Research
1
National Center for Atmospheric Research
Best Practices Workshop May 11-12, 2005 Copyright© 2005 University Corporation for Atmospheric Research
2
SCD Services • • • • • • • •
Supercomputing Mass Storage & Data Archiving Networking Enterprise Services - 7x24x365 Operations Support & Consulting Services Scientific Data Archives and Data Portals Visualization, Access Grid Research & Development Best Practices Workshop May 11-12, 2005 Copyright© 2005 University Corporation for Atmospheric Research
3
Areas of Best Practices • • • • • • •
Organization Planning Vendor Relationships Operational procedures Documentation Infrastructure Management Tools Best Practices Workshop May 11-12, 2005 Copyright© 2005 University Corporation for Atmospheric Research
4
Organization • Dependency Charts – hardware (networking) & services (functional) – Change management – Crisis management
• Well-defined roles and staff expectations – Clear responsibilities
• Resource allocations management Best Practices Workshop May 11-12, 2005 Copyright© 2005 University Corporation for Atmospheric Research
5
Planning • Understand user community – Requirements – computing/data/network – Reliability – major influence on infrastructure requirements
• Maintenance & evolution of benchmark suites and regression tests – key NCAR models – algorithmic “kernels” – I/O and workload tests
• Deploying Systems – installation planning process • Test/checkout systems Best Practices Workshop May 11-12, 2005 Copyright© 2005 University Corporation for Atmospheric Research
6
Planning • Master Planning (strategic planning) – short term/long term • Evolution – planning/flexibility – “technology ooze” – managing legacy systems (trimming the trailing edge)
• Coordination across entities Best Practices Workshop May 11-12, 2005 Copyright© 2005 University Corporation for Atmospheric Research
7
Vendor Relationships • Develop strong vendor sales, management, and technical relationships • Establish (reasonable) expectations with vendor maintenance and service personnel • Solid relationships with extant computational, storage, networking gear vendors Best Practices Workshop May 11-12, 2005 Copyright© 2005 University Corporation for Atmospheric Research
8
Operational Procedures • • • • • •
7x24x365 Operations Center User Consulting/Help desk Downtime notification procedures Change control Daily Bulletin, email groups, targeted email Status notification procedures Best Practices Workshop May 11-12, 2005 Copyright© 2005 University Corporation for Atmospheric Research
9
Operational Procedures • Real-time system monitoring – Automated notification/alert systems for “out-ofnominal” situations (email, pages, etc.) – Human-observed (routine checks)
• On-call procedures • Problem escalation procedures – in-house (contact lists, management chain …) – with vendors
• Training Best Practices Workshop May 11-12, 2005 Copyright© 2005 University Corporation for Atmospheric Research
10
Operational Procedures • Problem tracking and reporting system • Physical security – access control • Resource control & management – Quality of Service – Service Level Agreements – Usage quotas
Best Practices Workshop May 11-12, 2005 Copyright© 2005 University Corporation for Atmospheric Research
11
Operational Procedures • Statistics gathering and evaluation – – – –
Uptime/downtime Component/subsystem failure Performance Workload assessments: job attributes statistics, queue wait, system utilization – Trend analysis on all of above Best Practices Workshop May 11-12, 2005 Copyright© 2005 University Corporation for Atmospheric Research
12
Documentation • • • •
Service Level Agreements (SLAs) Labeling standards Computer Room maps (e.g. CAD diagrams) Business continuity plan
Best Practices Workshop May 11-12, 2005 Copyright© 2005 University Corporation for Atmospheric Research
13
Documentation • Emergency procedures (for system, subsystem, shutdown & recovery) • Operational Processes and Procedures • Develop a master space plan – Incorporate infrastructure staff early in equipment procurements
Best Practices Workshop May 11-12, 2005 Copyright© 2005 University Corporation for Atmospheric Research
14
Infrastructure Management • Power requirements • HVAC requirements • Cables – Routing – Protection – raceways, conduit – Labeling
Best Practices Workshop May 11-12, 2005 Copyright© 2005 University Corporation for Atmospheric Research
15
Infrastructure Management • Routine testing of emergency subsystems (e.g. generators, UPS’s, fire detection & suppression, etc.) • Floor layout – Service clearances – Air flow – Placement – future machines
• Physical security and physical access Best Practices Workshop May 11-12, 2005 Copyright© 2005 University Corporation for Atmospheric Research
16
Tools • • • •
Access Gold Card reader System HP Openview Easimap Homegrown super computer monitoring tools • Remedy/Extraview Trouble Ticket system Best Practices Workshop May 11-12, 2005 Copyright© 2005 University Corporation for Atmospheric Research
17
Tools • • • •
Big Brother Foreseer Datatrax AutoCAD Tileflow
Best Practices Workshop May 11-12, 2005 Copyright© 2005 University Corporation for Atmospheric Research
18
Questions
[email protected] www.scd.ucar.edu
Best Practices Workshop May 11-12, 2005 Copyright© 2005 University Corporation for Atmospheric Research
19