Best Practices at the National Center for Atmospheric Research

Best Practices at the National Center for Atmospheric Research Gene Harano National Center for Atmospheric Research Computational & Information System...
Author: Daniela Fisher
1 downloads 0 Views 170KB Size
Best Practices at the National Center for Atmospheric Research Gene Harano National Center for Atmospheric Research Computational & Information Systems Laboratory Scientific Computing Division High-End Services Section [email protected]

Outline • • • •

NCAR Overview Scientific Computing Division Services Best Practices Questions

Best Practices Workshop May 11-12, 2005 Copyright© 2005 University Corporation for Atmospheric Research

1

National Center for Atmospheric Research

Best Practices Workshop May 11-12, 2005 Copyright© 2005 University Corporation for Atmospheric Research

2

SCD Services • • • • • • • •

Supercomputing Mass Storage & Data Archiving Networking Enterprise Services - 7x24x365 Operations Support & Consulting Services Scientific Data Archives and Data Portals Visualization, Access Grid Research & Development Best Practices Workshop May 11-12, 2005 Copyright© 2005 University Corporation for Atmospheric Research

3

Areas of Best Practices • • • • • • •

Organization Planning Vendor Relationships Operational procedures Documentation Infrastructure Management Tools Best Practices Workshop May 11-12, 2005 Copyright© 2005 University Corporation for Atmospheric Research

4

Organization • Dependency Charts – hardware (networking) & services (functional) – Change management – Crisis management

• Well-defined roles and staff expectations – Clear responsibilities

• Resource allocations management Best Practices Workshop May 11-12, 2005 Copyright© 2005 University Corporation for Atmospheric Research

5

Planning • Understand user community – Requirements – computing/data/network – Reliability – major influence on infrastructure requirements

• Maintenance & evolution of benchmark suites and regression tests – key NCAR models – algorithmic “kernels” – I/O and workload tests

• Deploying Systems – installation planning process • Test/checkout systems Best Practices Workshop May 11-12, 2005 Copyright© 2005 University Corporation for Atmospheric Research

6

Planning • Master Planning (strategic planning) – short term/long term • Evolution – planning/flexibility – “technology ooze” – managing legacy systems (trimming the trailing edge)

• Coordination across entities Best Practices Workshop May 11-12, 2005 Copyright© 2005 University Corporation for Atmospheric Research

7

Vendor Relationships • Develop strong vendor sales, management, and technical relationships • Establish (reasonable) expectations with vendor maintenance and service personnel • Solid relationships with extant computational, storage, networking gear vendors Best Practices Workshop May 11-12, 2005 Copyright© 2005 University Corporation for Atmospheric Research

8

Operational Procedures • • • • • •

7x24x365 Operations Center User Consulting/Help desk Downtime notification procedures Change control Daily Bulletin, email groups, targeted email Status notification procedures Best Practices Workshop May 11-12, 2005 Copyright© 2005 University Corporation for Atmospheric Research

9

Operational Procedures • Real-time system monitoring – Automated notification/alert systems for “out-ofnominal” situations (email, pages, etc.) – Human-observed (routine checks)

• On-call procedures • Problem escalation procedures – in-house (contact lists, management chain …) – with vendors

• Training Best Practices Workshop May 11-12, 2005 Copyright© 2005 University Corporation for Atmospheric Research

10

Operational Procedures • Problem tracking and reporting system • Physical security – access control • Resource control & management – Quality of Service – Service Level Agreements – Usage quotas

Best Practices Workshop May 11-12, 2005 Copyright© 2005 University Corporation for Atmospheric Research

11

Operational Procedures • Statistics gathering and evaluation – – – –

Uptime/downtime Component/subsystem failure Performance Workload assessments: job attributes statistics, queue wait, system utilization – Trend analysis on all of above Best Practices Workshop May 11-12, 2005 Copyright© 2005 University Corporation for Atmospheric Research

12

Documentation • • • •

Service Level Agreements (SLAs) Labeling standards Computer Room maps (e.g. CAD diagrams) Business continuity plan

Best Practices Workshop May 11-12, 2005 Copyright© 2005 University Corporation for Atmospheric Research

13

Documentation • Emergency procedures (for system, subsystem, shutdown & recovery) • Operational Processes and Procedures • Develop a master space plan – Incorporate infrastructure staff early in equipment procurements

Best Practices Workshop May 11-12, 2005 Copyright© 2005 University Corporation for Atmospheric Research

14

Infrastructure Management • Power requirements • HVAC requirements • Cables – Routing – Protection – raceways, conduit – Labeling

Best Practices Workshop May 11-12, 2005 Copyright© 2005 University Corporation for Atmospheric Research

15

Infrastructure Management • Routine testing of emergency subsystems (e.g. generators, UPS’s, fire detection & suppression, etc.) • Floor layout – Service clearances – Air flow – Placement – future machines

• Physical security and physical access Best Practices Workshop May 11-12, 2005 Copyright© 2005 University Corporation for Atmospheric Research

16

Tools • • • •

Access Gold Card reader System HP Openview Easimap Homegrown super computer monitoring tools • Remedy/Extraview Trouble Ticket system Best Practices Workshop May 11-12, 2005 Copyright© 2005 University Corporation for Atmospheric Research

17

Tools • • • •

Big Brother Foreseer Datatrax AutoCAD Tileflow

Best Practices Workshop May 11-12, 2005 Copyright© 2005 University Corporation for Atmospheric Research

18

Questions

[email protected] www.scd.ucar.edu

Best Practices Workshop May 11-12, 2005 Copyright© 2005 University Corporation for Atmospheric Research

19

Suggest Documents