The five nines concept

Introduction

What Does the Five Nines Mean?

Five nines signifies a system or service availability of 99.999%, with less than 5.26 minutes of downtime annually, whether planned or unplanned. Maintaining such high availability involves:

Eliminate single points of failure
Design for reliability
Detect failures as they occur

However, achieving and sustaining this level of availability can be costly and resource-intensive. It often requires additional hardware purchases, leading to increased complexity in system configuration and higher risks of component failures.

Availability	Downtime per Year
99%	87 hours 36 mins
99.5%	43 hours 48 mins
99.95%	4 hours 23 mins
99.99%	53 mins
99.999%	5 mins

Environments that Require Five Nines

While sustaining high availability can be costly, certain industries require “five nines” reliability:

Finance: For continuous trading, compliance, and customer trust.
Healthcare: To provide uninterrupted patient care.
Public Safety: To ensure security and services.
Retail: For efficient supply chains and customer satisfaction.
News Media: To deliver real-time information to the public.

Availability

Threats to Availability

The following threats pose a high risk to data and information availability:

Unauthorized access to the primary database
Successful DoS attack
Significant loss of confidential data
Outage of mission-critical application
Compromise of Admin or root user
Detection of cross-site scripting or illegal file server share
Website defacement affecting public relations
Severe weather like hurricanes or tornadoes
Catastrophic events like terrorist attacks or building fires
Long-term utility/service provider outage
Water damage from flooding or sprinkler failure

Designing High Availability System

High availability incorporates three major principles to achieve the goal of uninterrupted access to data and services:

Elimination or reduction of single-points of failure:

It’s crucial to address single points of failure, which can be central routers or switches, network services, or key IT staff. Any loss in these areas can severely disrupt the entire system. To mitigate this risk, it’s important to implement processes, resources, and components that reduce these single points of failure. One effective strategy is to use high availability clusters, where a group of interconnected servers shares access to the same storage and network configurations. This setup allows all servers to handle services simultaneously, appearing as a single, resilient system. If one server fails, the others seamlessly continue processing without interruption.

System Resiliency:

System resiliency means maintaining data and operations even during attacks or disruptions. It involves redundant power and processing systems, so if one fails, the other can seamlessly take over without interruption. Resilience goes beyond just securing devices; it ensures that data and services remain available even under attack.

Fault Tolerance:

Fault tolerance allows a system to keep working even if one or more components fail. Data mirroring is an example, where a mirrored system provides uninterrupted service by supplying requested data if a fault occurs, like in a disk controller, without the user noticing any disruption.

Measures to improve Availability

Asset management

Asset Identification

Asset management, including a comprehensive inventory of hardware and software, is essential for determining configuration parameters. This inventory should cover all components susceptible to security risks:

Hardware systems
Operating systems
Hardware network devices
Network device operating systems
Software applications
Firmware
Language runtime environments
Individual libraries

Organizations can opt for automated solutions to track assets. Administrators should promptly investigate any configuration changes to ensure they are up-to-date and authorized.

Asset Classification

Asset classification groups an organization’s resources based on shared traits. It should be applied to documents, data records, files, and disks. Critical informations requires the highest protection, possibly needing special handling. An organization can use a labeling system based on information value, sensitivity, and criticality. To identify and classify assets:

Define asset categories.
Assign owners to all assets and software.
Set classification criteria.
Implement the classification system.

Asset Standardization

Asset management oversees the lifecycle and inventory of technology assets like devices and software. In an IT asset management system, organizations define acceptable IT assets to align with their goals, reducing asset diversity. For instance, only compliant applications are installed. Eliminating non-compliant applications enhances security. Asset standards specify the hardware and software products an organization supports. Quick action during failures ensures access and security. Without standardized hardware, personnel may struggle to find replacements, requiring more expertise and increasing maintenance costs in non-standard environments.

Risk Analysis

Risk analysis evaluates threats from natural and human-caused events to an organization’s assets. Asset identification aids in deciding which assets to safeguard. Risk analysis has four key goals:

Identify assets and their value
Identify vulnerabilities and threats
Quantify the probability and impact of the identified threats
Balance the impact of the threat against the cost of the countermeasure

There are two approaches to risk analysis:

Quantitative Risk Analysis:

It is a method that assigns numerical values to elements of the risk analysis process. It involves calculating the potential financial impact of a risk by considering factors like asset value, exposure factor (EF), annualized rate of occurrence (ARO), and annual loss expectancy (ALE). This approach provides management with concrete data to make informed decisions about risk mitigation and resource allocation.

Qualitative Risk Analysis:

It is a method that relies on subjective assessments and scenarios to evaluate risks. It involves a team evaluating each threat based on its likelihood and impact, usually plotted on a table. The results are used as a guide for decision-making, often focusing on threats within a specified risk zone. Unlike quantitative analysis, the numerical values in qualitative analysis are not directly proportional and are more subjective in nature.

Mitigation

Mitigation aims to lessen the impact or likelihood of a loss. Technical controls like authentication systems, file permissions, and firewalls are examples. It’s crucial to balance the potential negative impact of mitigation measures with the benefits of risk reduction. There are four common ways to reduce risk:

Accept the risk and periodically re-assess
Reduce the risk by implementing controls
Avoid the risk by totally changing the approach
Transfer the risk to a third party

In the short term, one strategy is to accept the risk and create contingency plans. Risk acceptance is a common practice for individuals and organizations. Modern methods reduce risk by incremental software development and regular updates to address vulnerabilities. Risk transfer includes outsourcing, purchasing insurance, or maintenance contracts. Hiring specialists for critical tasks can reduce risk effectively. A good risk mitigation plan can include two or more strategies.

Defense in Depth

Defense in depth, while not foolproof, helps organizations stay ahead of cyber threats by minimizing risk. Relying on a single defense leaves data vulnerable, so multiple layers of protection are crucial.

Term
Layering
Limiting
Diversity
Obscurity
Simplicity

Layering

A layered approach offers comprehensive security, as attackers must breach each layer, which becomes increasingly complex. Layering involves coordinating multiple defenses, like storing sensitive data in a server within a secured facility.

Limiting

Limiting data access minimizes threats. Access should be restricted to what’s necessary for each user’s role. For instance, the marketing team doesn’t need payroll access. Technology like file permissions helps, but procedure measures are also vital. Employees should be barred from taking sensitive documents off-site.

Diversity

For effective protection, layers of security must vary. If all layers are the same, breaching one means breaching all. Diverse layers make it harder for attackers. Each layer should use different techniques. Even if one layer is breached, the entire system remains secure. To achieve diversity, organizations can use products from different companies for multifactor authentication. For instance, a server with sensitive data might require a swipe card from one company and biometric authentication from another.

Obscurity

Obscuring information safeguards data. Organizations should avoid disclosing details like server operating system versions or equipment types that cybercriminals could exploit. Error messages should also be generic to prevent revealing vulnerabilities. This concealment makes it harder for attackers to target a system.

Simplicity

Complexity doesn’t always ensure security. Overly complex systems can be hard to manage and may even increase vulnerability. If employees struggle to configure them correctly, they can become easy targets for cybercriminals. A good security solution is simple internally but presents a complex front to deter attacks.

Redundancy

Single Points of Failure

A single point of failure is a crucial part of an organization’s operations that, if it fails, can halt other operations dependent on it. This can be hardware, a process, data, or a utility. Single points of failure are weak links that disrupt operations. The solution is to modify the critical operation to remove reliance on a single element or to introduce redundant components that can take over if a point fails.

N+1 Redundancy

N+1 redundancy is a system design principle where critical components (N) have at least one backup component (+1) to ensure system availability in case of failure. For example, in a data center, N+1 redundancy means the system can continue operating if one of its components, such as servers, power supplies, switches, or routers, fails. The +1 represents the additional backup component or system ready to take over if needed. While N+1 redundancy provides backup components, it does not create a fully redundant system.

RAID

RAID (Redundant Array of Independent Disks) combines multiple physical hard drives into a single logical unit for data redundancy and performance improvement. It spreads data across drives, so if one disk fails, data can be recovered from the others. RAID also speeds up data retrieval by using multiple drives. There are hardware-based and software-based RAID solutions, with the former requiring a specialized hardware controller. RAID uses different methods to store data across disks:

Parity: Detects data errors.
Striping: Writes data across multiple drives.
Mirroring: Stores duplicate data on a second drive.

Spanning Tree

Spanning Tree Protocol (STP) is a network protocol that prevents loops in a network’s topology when switches are interconnected via multiple paths. Its primary function is to ensure that redundant physical links between switches are loop-free, allowing only one logical path between all network destinations. STP achieves this by intentionally blocking redundant paths that could cause loops, while still maintaining them for redundancy. When a network cable or switch fails, STP recalculates the paths and unblocks the necessary ports to activate the redundant path, ensuring network availability.

Router Redundancy

The default gateway, usually a router, provides access to the network or the Internet. Relying on a single router as the default gateway poses a single point of failure. To mitigate this, organizations can set up a standby router. In a redundancy setup, routers use a protocol to determine which one forwards traffic. Each router has a physical and a virtual IP address; end devices use the virtual IP as the default gateway. Routers exchange periodic messages using their physical IPs to check availability. If the standby router stops receiving these messages from the forwarding router, it takes over. This ability to recover from gateway failures is called first-hop redundancy.

The list below outlines router redundancy options based on the communication protocol between network devices.

Hot Standby Router Protocol (HSRP): HSRP ensures high network availability by providing first-hop routing redundancy. A group of routers uses HSRP to select an active device and a standby device. The active device routes packets, while the standby device takes over if the active one fails.
Virtual Router Redundancy Protocol (VRRP): VRRP routers run the VRRP protocol with one or more routers on a LAN. In this setup, one router is elected as the virtual router master, and the others act as backups in case the master fails.
Gateway Load Balancing Protocol (GLBP): GLBP protects data traffic from a failed router or circuit like HSRP and VRRP do, while also enabling load balancing between a group of redundant routers.

Location Redundancy

An organization may need to consider location redundancy depending on its needs. The following outlines three forms of location redundancy.

Synchronous:
- Real-time synchronization.
- Requires high bandwidth.
- Locations must be close to reduce latency.
Asynchronous Replication:
- Not real-time synchronized but close.
- Requires less bandwidth.
- Sites can be further apart due to reduced latency concerns.
Point-in-time Replication:
- Periodic updates for backup data.
- Most bandwidth conservative, no constant connection needed.

System Reliance

Resilient Design

Resiliency involves methods and configurations to tolerate system or network failures. For instance, redundant links in a network using STP provide alternate paths in case of link failures, but switchover may not be immediate without optimal configuration. Routing protocols also enhance resiliency, with fine-tuning improving switchover times. Testing non-default settings in a controlled environment can help optimize network recovery. Resilient design goes beyond redundancy, requiring an understanding of business needs to create a truly resilient network.

Application Resilience

Application resilience refers to an application’s ability to function despite component issues. Downtime can result from application errors, infrastructure failures, or planned maintenance. Achieving high availability in applications involves balancing infrastructure costs with potential business losses due to failures. The complexity and cost of solutions for application resilience increase with higher availability factors.

IOS Resilience

The Interwork Operating System (IOS) for Cisco routers and switches includes a resilient configuration feature for faster recovery from malicious or unintentional data loss. It maintains a secure working copy of the IOS image file and running configuration, preventing their removal. These secure files, known as the primary bootset, ensure system integrity.

The command underneath is used to protect the Cisco IOS image file from unauthorized modifications:

1
secure boot-image

Incident Response

Incident response phases

Preparation

Incident response involves an organization’s procedures following an event outside the normal range, such as a data breach that exposes sensitive information to an untrusted environment. This breach can result from accidental or intentional acts. To address incidents, organizations need an incident response plan and a Computer Security Incident Response Team (CSIRT) to manage the response. The team performs the following functions:

Maintains the incident response plan
Ensures its members are knowledgeable about the plan
Tests the plan
Gets management’s approval of the plan The CSIRT can be a formal team within the organization or an ad hoc one. It follows predefined steps to ensure a uniform approach and completeness. National CSIRTs handle incident response at a country level.

Detection and Analysis

Detection begins when an incident is discovered. While organizations may invest in advanced detection systems, their effectiveness relies on administrators reviewing logs and monitoring alerts. Proper detection involves understanding the incident’s cause, the data and systems affected, and promptly notifying senior management and relevant managers for remediation. Detection and analysis includes the following:

Alerts and notifications
Monitoring and follow-up Incident analysis identifies the source, extent, impact, and details of a data breach. Depending on the situation, the organization may decide to bring in forensic experts for further investigation.

Containment and Eradication, and Recovery

Containment involves immediate actions like disconnecting systems to stop information leaks. After identifying a breach, the organization must contain and eradicate it, which may require additional system downtime. Recovery involves resolving the breach and restoring systems to their pre-breach state.

Post-Incident Follow-Up

After returning to normal operations, the organization should investigate the incident by asking:

What actions will prevent the incident from reoccurring?
What preventive measures need strengthening?
How can it improve system monitoring?
How can it minimize downtime during the containment, eradication, and recovery phases?
How can management minimize the impact to the business? Reviewing lessons learned can help the organization improve its incident response plan.

Incident response Technologies

Network Admission Control

Network Admission Control (NAC) ensures that only authorized users with compliant systems can access the network. It evaluates incoming devices against network policies, quarantines non-compliant systems, and manages their remediation. This can be achieved using existing network infrastructure and third-party software, or through a dedicated NAC appliance that controls access, evaluates compliance, and enforces security policies for all endpoints. Common NAC systems checks include:

Updated virus detection
Operating systems patches and updates
Complex password enforcement

Intrusion Detection Systems

Intrusion Detection Systems (IDSs) passively monitor network traffic by copying it for analysis . Working offline, it compares the captured traffic with known malicious signatures, similar to how antivirus software checks for viruses. Working offline means several things:

IDS works passively
IDS device is physically positioned in the network so that traffic must be mirrored in order to reach it
Network traffic does not pass through the IDS unless it is mirrored In passive mode, an IDS monitors and reports on traffic without taking any action. This is known as operating in promiscuous mode. Operating with a copy of the traffic allows the IDS to monitor without affecting the packet flow, but it can’t stop single-packet attacks. IDS often needs assistance from routers and firewalls to respond to attacks. A more effective solution is to use an Intrusion Prevention System (IPS) that can detect and stop attacks in real time.

Intrusion Prevention Systems

An Intrusion Prevention System (IPS) operates in inline mode, meaning all traffic passes through it for analysis. It can immediately detect and address network issues, including sophisticated attacks, by analyzing packet contents and payloads. Unlike an Intrusion Detection System (IDS), an IPS does not allow malicious traffic to pass through. However, a misconfigured IPS can disrupt normal traffic flow.

NetFlow and IPFIX

NetFlow is a Cisco technology that gathers packet statistics from routers and switches. It’s the standard for collecting network operational data. IP Flow Information Export (IPFIX) is based on NetFlow Version 9 and is used to export traffic flow information from routers to data collection devices. This data helps optimize network performance when used by network managers and applications that support the protocol. Applications that support IPFIX can display statistics from routers that use the standard. Collecting, storing, and analyzing data from IPFIX-supported devices offers several benefits:

Secures the network against internal and external threats
Troubleshoots network failures quickly and precisely
Analyzes network flows for capacity planning

Advanced Threat Intelligence

Advanced threat intelligence can help organizations detect cyberattacks at various stages, sometimes even before they occur, with the right information. Organizations can detect attack indicators in their logs and system reports for the following security alerts:

Account lockouts
All database events
Asset creation and deletion
Configuration modification to systems Advanced threat intelligence, consisting of event or profile data, enhances security monitoring and response. Understanding malware tactics is crucial as cybercriminals become more sophisticated. Improved visibility into attack methods allows organizations to respond faster to incidents.

Disaster Recovery

Disaster Recovery Planning

Types of Disasters

Maintaining organizational function during a disaster is critical. Disasters encompass natural or human-caused events that damage assets and hinder operations.

Natural Disasters

Natural disasters vary by location and can be challenging to predict. They generally fall into the following categories:

Geological disasters: earthquakes, landslides, volcanoes, tsunamis
Meteorological disasters: hurricanes, tornadoes, snow storms, lightning, hail
Health disasters: widespread illnesses, quarantines, pandemics
Miscellaneous disasters: fires, floods, solar storms, avalanches

Human-caused Disasters

Human-caused disasters involve people or organizations and fall into the following categories:

Labor events: strikes, walkouts, slowdowns
Social-political events: vandalism, blockades, protests, sabotage, terrorism, war
Materials events: hazardous spills, fires
Utilities disruptions: power failures, communication outages, fuel shortages, radioactive fallout

Disaster Recovery Plan

During an ongoing disaster, the organization implements its Disaster Recovery Plan (DRP) to swiftly restore critical systems, including assessing, salvaging, repairing, and restoring damaged facilities and assets. To create the DRP, answer the following questions:

Who is responsible for this process?
What does the individual need to perform the process?
Where does the individual perform this process?
What is the process?
Why is the process critical? A DRP must prioritize critical processes within the organization. When recovering, the organization focuses on restoring its mission-critical systems first.

Implementing Disaster Recovery Controls

Disaster recovery controls reduce the impact of a disaster, ensuring resources and business processes can quickly resume operations. There are three types of IT disaster recovery controls:

Preventative measures prevent disasters by identifying risks.
Detective measures discover unwanted events, uncovering new threats.
Corrective measures restore systems after disasters or events.

Business Continuity Planning

Need for Business Continuity

Business continuity is crucial in computer security. While companies strive to prevent disasters and data loss, it’s impossible to predict every scenario. Having plans for business continuity is vital. This plan, broader than a DRP, involves relocating critical systems while the original facility is repaired. Personnel adapt to alternative methods until normal operations resume. Availability ensures that resources necessary for the organization remain accessible to personnel and systems.

Business Continuity Considerations

Business continuity controls are more than data backups and redundant hardware. They also rely on properly trained employees to configure and operate systems effectively. Data can be useless until it provides information. An organization should look at the following:

Getting the right people to the right places
Documenting configurations
Establishing alternate communications channels for both voice and data
Providing power
Identifying all dependencies for applications and processes so that they are properly understood
Understanding how to carry out automated tasks manually

Business Continuity Best Practices

The National Institute of Standards and Technology (NIST) developed the following best practices:

Write a policy that provides guidance for developing the business continuity plan and assigns roles for task execution.
Identify critical systems and processes and prioritize them based on necessity.
Identify vulnerabilities, threats, and calculate risks.
Identify and implement controls and countermeasures to reduce risk.
Devise methods to quickly restore critical systems.
Write procedures to maintain organizational functionality during chaos.
Test the plan.
Regularly update the plan.