Jan 4, 2013

25 Things Every High Availability NOC Should Have

Below I've listed the top 25 things I believe every Network Operations Center (NOC) should have in order to become high availability.  This applies to NOCs that are operational 24/5 or require an uptime of 365 days a year.  I've broken this up into several sections to help you consider all aspects of what makes the NOC. 

Basics:

1. Hunt Groups and Call Forwarding
You should be able to have multiple phones all ring at once when calls come in.  When no one is in the NOC (very rare) it should forward to an on-call cell number.

2. Displays
Multiple displays can help to increase your NOCs overall monitoring capabilities.  Typically 4 monitors per person and at least a couple of large screen TV's can help to handle all of the areas of monitoring.  The large displays also help to demonstrate your monitoring capabilities to your upper management when they tour the NOC.

3. Automated Monitoring and Alerting 
Automated monitoring and alerting tools are probably the most crucial thing any NOC has to have.  You honestly don't even have a NOC without the right monitoring visibility into your environment.  There are countless free and paid for tools that will do alerting for you.  The best I can recommend are Intermapper and Orion NPM.  Intermapper will give you the best dashboard view and Orion can give you the most additional features.


Advancements:

4. Automated Reporting
Reports are one of my favorite ways to get statistics of what's really going on in an environment.  The thing about reports is that they mean nothing if there is no one to read them. This is where the NOC can really help to make the best of reporting.  You only need to run reports 2-3 times a day to have a great holistic view of your environment.  Reports should include not just numerical information, but also graphical analytics.

5. Historical Analytics
An important part of maintaining and supporting any network or system is having insight to the trending of statistics unique to what you're monitoring.  In order to do this you will need some kind of system that records trends over a period of time.  The typical retention on data should be at least 3 months if not longer.

6. Configuration Management
For network equipment and application configurations, it's crucial to be able to record and easily restore configuration files.  The best way to do this is to have a configuration management system.  This will manage the storage of configurations over the course of months and allow for easy downloading of configuration files or even running diffs against older configuration files and newer ones.


Communication Tactics for your NOC:

7. Paging System
Many people think that with the dawn of email on the phone, that there is no more need to a paging system.  However the best way to bypass email filters during an emergency, and get someone to pick up their phone is to send a page to them.  Most paging systems support groups similar to distribution lists.

8. Email Distribution Lists
You want to make sure that your NOC has it's own email list with the direct manager included.  Besides that all other levels of escalation should have their own email list.  The email list for the NOC should receive reports, alerts, and details communication strings with customers and vendors.  It's the single best way to make sure that everyone in your NOC is on the same page.

9. Company-wide Audible Alert System
This is an alert system that is manually triggered.   The way I've seen this implemented is using loud speakers to sound off a generally monotone alarm, different from a fire alarm but distinctive in its own way.  It allows the NOC to give notice to management and tier 3 engineers to report to the NOC for escalated troubleshooting.  It is probably the most effective way, along with paging, to get required attention.


Documentation:

10. Run Book
A run book is a living document that contains everything from most basic procedures to troubleshooting steps.  A hard copy of the run book are rare now or days,  they are best preserved in an online wiki.

11. Escalation Diagram
Your NOC needs to know exactly who to escalate any issues with and which teams to involve depending on the issue at hand.  Keep an escalation diagram available for them to reference.  These should be printed out and visually available within the work environment.

12. Escalation Contact List
Along with an Escalation Diagram you should have an updated contact list that your NOC can reference for emergencies.  Here you would contain cell and home numbers of the escalation contacts.

13. Service Provider and Vendor Contact List
The NOC should always have a copy of all of the vendor and service provider contacts.  This will include your sales rep in the event that the regular support lines aren't getting you anywhere.

14. Service Contracts
Your NOC needs to have access to all service contract information, serial numbers, and circuit IDs.  Without this information, in an updated manner, your NOC will have a tough time calling service providers and vendors to address urgent issues.

15. Full Hardware Inventory
One of the major initiatives that a NOC should partake in is the gathering of information for all the hardware within the environment.  They should have serial numbers, model numbers, and details specs of anything installed on the hardware as well.

16.  Datacenter Elevation Diagram
Many times when a NOC utilizes remote hands at the datacenter it helps to have an actual elevation diagram for everything that is located in the datacenter.  The elevation diagram will include the cage number, cabinet numbers, front and rear views, hardware vendor, model, and serial numbers.  These can be as simple as a spread sheet or as detailed as a visio diagram with vendor stencils.


Security Best Practices for the NOC:

17. Access cards for NOC access
Many companies, depending on the services they perform, require that their operations center be siloed off from the rest of the company.  While this may seem like an annoyance to many, it is one of the most basic ways to protect your production systems from intruders.  But in order for this to work it's best, the next point MUST be followed.

18. Separate Production and Corporate Networks
The best way I've ever seem this handled in a NOC was to give your operations personal 2 computers.  One was dedicated for corporate office access (i.e. email and IMs) and the other was dedicated to production access.  The production PC had no access to the internet or even internal email systems.  This insures that people who plug into an open port by a desk at some near by cubicle cannot reach your production systems.

19. Access to the internet from the NOC needs to be limited
Without the right tools, it can be extremely difficult to understand how much garbage traffic goes over the internet from our computers.  Unfortunately, not all of this traffic is just harmless spam and spyware.  Many times host machines are hijacked without users even knowing it.  These hijackings can lead to access of a seemingly secure network.  This is why this goes hand in hand with isolating your production and corporate networks.  It's better for a hacker to take down your email server than for them to take down your trading server.


Policies and Procedures:

20. Policies
As vague as this may sound, policies ranging from security standards, standard operating procedures in the NOC, change management policy, and an updated HR policy manual can clearly identify to your operations personal exactly how to conduct themselves within the NOC.  Remember that your policies are only as good as you enforce them to be.

21. Shift Turnovers
Every operational shift should provide a turnover to the next shift for a full understanding of all open and even closed items that the NOC has been working on.  These items should be held over for at least 24 hours so all shifts are aware of work that has happened throughout the entire day.

22. Ticketing System
A ticketing system will give you the ability to track old problems and incidents over the course of months or even years.  The best ticketing systems will even develop charts and statistics based on the tickets you've created and gives you a great way to track the productivity of your team.


Staffing and Support:

23. Diversifying your team
The best NOCs always have cross-training of some kind with regards to understanding and working with different systems and apps.  Among all of these things you will find that having your team know and understand these items will give you the best of all worlds: routing/switching, firewalls, Linux, Windows, server hardware, storage, virtualization, scripting, and application specific support.  These items cover just about everything your NOC will need to know.  You won't need expert level understanding at your NOC.  You'll only need people who have a low to mid level of understanding of the environment and their areas of specialty.  You should always be able to rely on your tier 3 engineers for high level support.

24. The right head count
The best way to calculate the number of heads needed is to look at a combination of the number of servers, network devices, employees, hours of usage for your production systems, and the quantity of customers.  I'm still working on a formula for this, so as soon as I get it together I will post it here.

25.  Remote Hands or Onsite Staff for Datacenter Support
When you need fast access to hardware at the datacenter, don't skim out on the datacenter support.  If you can't afford to have an actually datacenter technician on site during production hours make sure that your equipment is housed in a datacenter that provides "Hot Hands" or "Remote Hands".  This is definitely a life saver for the HA NOC.

Thanks for reading and I hope this helps you put together your NOC.  I'm definitely interested in feedback.

-David Pagan

No comments:

Post a Comment