Top 10 must-haves for data center resiliency: checklist (part 2)

Chris Kapusta
September 9, 2024

Updated: June 30, 2025

Robust, resilient data infrastructure is key to keeping your organization secure and avoiding the challenges that arise from data breaches or loss. But it isn’t just a risk mitigation strategy — a well-architected and well-maintained data center empowers your organization to move quickly, serve customers well, streamline processes, and keep your teams focused on the tasks that move the needle.

In part two of our cyber resiliency blog series, discover five more ways to secure your data center against threats. As a refresher, the previous installment covered network infrastructure; physical security; power, cooling, and fire suppression; cybersecurity/ransomware protection; and data backup and recovery. Discover how checklist items 6 through 10 can help you build cyber resiliency.

6. Redundancy and failover

We touched on this briefly in the network section, but both redundancy and failover are key network design elements that help to prevent downtime and improve network availability. While redundancies are the multiple network paths that enable continued performance should a certain node fail, failover is the programmed mechanism by which the switch from a failed node to a performing, redundant node occurs.

Server redundancy: Having multiple servers with identical configurations helps to ensure critical applications are available should the primary server fail. Not only does this strategy help to support the business in the event of a server failure, but it also provides the opportunity to distribute workloads across multiple servers for better performance and business continuity.

Storage redundancy: One option for storage redundancy is a redundant array of independent disks (RAID) — of which there are several available configurations. Determining the RAID configuration that’s right for your storage system depends on whether your organization needs more emphasis on speed, redundancy, or both. Other types of storage replication include zone-redundant storage, geo-redundant storage, and object replication.

Geographical redundancy: Depending on your business, having only one data center may be equivalent to putting all your eggs in one basket. Taking advantage of backup servers and data centers in different geographical locations can help support your critical applications and keep the business running if or when one location becomes compromised.

7. Monitoring and alerting systems

Real-time monitoring and alerts are key to detecting anomalies and risks in your data center environment. A resilient data center strategy will incorporate several advanced systems to stay ahead of potential risks.

AI detection: Using AI to monitor and assess environmental and system security can help you identify and interpret unusual factors and understand variations in your environments.

Environmental monitoring: Even sans AI, environmental monitoring is important to help ensure temperature, humidity, and other environmental factors are within safe parameters and can be quickly addressed when risk factors are detected.

System alerts: Enable real-time alerts to help your teams stay aware of and address any power, network, or hardware failures as soon as they happen.

Logging and auditing: Track anomalies with consistent logs and audits to stay abreast of user activity and enable security teams to look into breaches and compliance as needed.

8. Compliance and documentation

Managing data center risk and resilience isn’t just about mitigating cyberthreats and accidents. Managing compliance is necessary and often complex, with different regulatory standards set depending on location, industry, and other factors. And noncompliance is a risk in and of itself, leading to potential fines, loss of trust, disruption of operations, and more.

Compliance audits: Every industry has its own set of standards to follow (ISO, GDPR, HIPAA, etc.). Ensure your workforce understands yours and that your team is regularly reviewing adherence.

Documentation: Keep documentation up to date, including architectural diagrams, contacts, emergency and escalation procedures, and infrastructure build documentation. In a rebuild scenario, you’ll be glad you did.

Training and awareness: All staff should participate in regular and required training as it pertains to security, access, emergency protocols, phishing, social engineering, etc. Insider threats play a large role in data loss and are often accidental. These can be mitigated with proper training.

9. Vendor and third-party management

Vendors, partners, and other third parties may require access to your organization’s infrastructure and/or data. Ensuring all third parties are carefully vetted and have access only to what is necessary can help save your organization a headache down the road.

Service-level agreements (SLAs) and contracts: Make sure your SLAs with vendors spell out requirements and access protocols and meet your organization’s standards for resiliency.

Third-party reviews: Rely on trusted resources and independent reviews to regularly assess the resilience and risk mitigation practices of the third-party vendors you work with.

Third-party appliances/systems: Ensure your team understands the security practices of the third-party appliances running within your organization’s data center. Staying aware of third-party update timing and other practices can help prevent gaps that could turn into breach points.

10. Regular testing and drills

Testing is a sore spot for many organizations. “Of course, we have a plan in place, but who has time for testing?” Overburdened security and operations teams may struggle to regularly make time to review, test, and practice security protocols. But testing is the only way to identify previously unseen flaws or gaps within your plan and to gain a realistic grasp on timing for response and recovery. Ensure your team performs the following regularly:

Disaster recovery drills: Your disaster recovery plan depends on timing, and drilling is the only way to improve and maintain emergency response times.

Failover and redundancy testing: Ensure your systems are free of flaws that would prevent proper backup performance by regularly testing your redundancy and failover mechanisms.

Security breach simulations: Penetration tests and breach simulations are a core part of a strong security program and resilience strategy. These will help your teams identify vulnerabilities and address weaknesses.

Creating a resilient, responsive data center infrastructure is not a simple task. But it doesn’t need to be overwhelming. For support in identifying and filling potential gaps in your data center resiliency, consider scheduling a complimentary AI infrastructure workshop with GDT experts.

And, if you missed “Top 10 Data Center Resiliency Checklist Must-Haves: Part 1,” you can access it now here.

Share this article

Author

Chris Kapusta is the director of Advisory and Transformation at GDT, where he is responsible for the Data Center, Hybrid Cloud, and AI Practice.  Chris and his team help clients with their digital transformation journey, engaging on projects involving data center and infrastructure modernization, developing hybrid cloud and multi-cloud strategies, and helping with AI initiatives.  Chris has more than 25 years of experience in areas ranging from application development, data center infrastructure architecture, and performance management, to more recently, cloud and AI technologies. In his spare time, Chris enjoys hiking and downhill skiing with his family, reading horror novels, and playing taxi driver for his daughter’s way too many activities and hobbies.

View all posts

You might also like:

Press release

GDT’s Irwin Teodoro discusses AI infrastructure planning on TechStrong TV

Press release

GDT CFO Fachtna Keohane shares strategies for controlling IT costs in CEOWORLD

Press release

What NVIDIA’s AI factory strategy means for the enterprise

Blog

IT lifecycle services: How to reduce cost and complexity

What we do

Services

Industries

Our partners

Resources

About GDT

Resources

Blogs

Top 10 must-haves for data center resiliency: checklist (part 2)

6. Redundancy and failover

7. Monitoring and alerting systems

8. Compliance and documentation

9. Vendor and third-party management

10. Regular testing and drills

Share this article

Author

Chris Kapusta

You might also like: