您的位置:首页 > 其它

灾难恢复_如何为您的IT团队创建灾难恢复计划

2020-08-21 11:07 483 查看

灾难恢复

You know the old joke: there are two kinds of companies, those that've been hit with IT disaster, and those who don't yet realize they've been hit with IT disaster.

您知道这是个老笑话:有两种公司,即遭受IT灾难打击的公司,以及尚未意识到遭受IT灾难打击的公司。

But what they all have in common is that there are plenty more disasters to come. So ask yourself whether you're ready for the next one.

但是他们都有一个共同点,那就是还会有更多的灾难发生。 因此,问问自己是否准备好下一个。

This article, which is based on my Pluralsight course, Linux System Maintenance and Troubleshooting, is intended to start you thinking about what building an effective protocol will take.

本文基于我的Pluralsight课程Linux系统维护和故障排除 ,旨在使您开始思考如何构建有效的协议。

您需要具备什么 (What you need to have in place)

It all begins with the business continuity plan (BCP). This is a formal plan that's meant to define the procedures an organization would use to ensure survival in the event of an emergency.

这一切都始于业务连续性计划 (BCP)。 这是一个正式计划,旨在定义组织在紧急情况下将用来确保生存的程序。

BCPs will generally include sub-plans to secure the immediate safety of employees and customers, work to restore previously-designated critical operations as soon as possible and, eventually, to restore full normal operations.

BCP通常将包括子计划,以确保员工和客户的即时安全,致力于尽快恢复先前指定的关键操作,并最终恢复完全正常的操作。

In addition, an effective BCP will also include two sub-plans that are specific to IT operations: the incident management protocol and disaster recovery plan.

此外,有效的BCP还将包括两个特定于IT运营的子计划:事件管理协议和灾难恢复计划。

The disaster recovery plan (DRP) aims to protect an organization's IT infrastructure in the event of a disaster. Its primary goals are to minimize damage and to restore functionality as quickly as possible.

灾难恢复 计划 (DRP)旨在在发生灾难时保护组织的IT基础架构。 其主要目标是最大程度地减少损害并尽快恢复功能。

The reason we call this a "plan" is because it simply won't work without serious prior preparation. Infrastructure protection, threat detection, and corrective protocols are critical parts of the plan.

我们称其为“计划”的原因是,如果没有认真的事先准备,该计划将根本无法运作。 基础架构保护,威胁检测和纠正协议是该计划的关键部分。

An Incident Management Plan (IMP) is meant to address the specific threat of cyber attacks against IT infrastructure. Its goals are to minimize damage and remove the threat.

事件管理计划 (IMP)旨在解决针对IT基础架构的网络攻击的特定威胁。 其目标是最大程度地减少损害并消除威胁。

As you can easily tell, there will be some overlap between your DRP and IMP. But the key focus of disaster recovery is to get your infrastructure back on its feet, while incident management is much more closely aligned with the world of IT security.

如您所知,DRP和IMP之间会有一些重叠。 但是,灾难恢复的重点是使您的基础架构恢复正常运行,而事件管理则与IT安全领域紧密结合。

For the rest of this short article we're going to look at what goes into creating incident management and disaster recovery plans and how to ensure that your plan is sound and should, when executed, actually work.

在本文的其余部分中,我们将研究如何创建事件管理和灾难恢复计划,以及如何确保您的计划是正确的,并且在执行时应该切实可行。

开发事件管理协议 (Developing an Incident Management Protocol)

Since incident management is going to be your first response to trouble, we'll begin there.

由于事件管理将是您对故障的第一React,因此我们将从这里开始。

The first indication that there's trouble can come from a user who notices that something's not right with the system. Or, if you've done a particularly good job configuring your infrastructure, it could also come to you in the form of an automated alert triggered by monitoring software.

出现故障的第一个迹象可能来自用户,该用户发现系统存在问题。 或者,如果您在配置基础架构方面做得特别出色,那么它也可能以监视软件触发的自动警报的形式出现。

When that alert comes in, it'll be the job of the technician or admin on call to decide how it's going to be handled and who has to handle it.

收到警报后,技术人员或管理员可以根据需要来决定如何处理警报以及由谁来处理。

Escalation can happen through a direct phone call or email, a ticket submitted through a collaboration tool like Jira, or by using a purpose-built Security Information and Event Management (SIEM) tool.

升级可以通过直接电话或电子邮件,通过Jira等协作工具提交的票证或使用专用的安全信息和事件管理(SIEM)工具进行。

Again, though, the more smart automation you build into the process, the faster and more efficient it's likely to be.

同样,尽管您在流程中构建的智能化程度越高,则可能会更快,更高效。

Whoever ends up with the ultimate responsibility will coordinate efforts to definitively diagnose and resolve the problem. Ideally, where necessary, such coordination will include admins, developers, and other key stakeholders to ensure you've got all the resources you'll need to address the problem.

最终负有最终责任的人将协调努力,明确诊断并解决问题。 理想情况下,在必要时,此类协调将包括管理员,开发人员和其他关键利益相关者,以确保您拥有解决问题所需的所有资源。

When it's all over, once you've confirmed the problem is resolved, you'll want to close the incident by assessing what went wrong and what went right, how your response could have been better, and how you can rework things to reduce the risk of a repeat of the incident.

全部解决之后,一旦您确认问题已解决,您将希望通过评估错误和正确之处,您的响应本来会如何更好以及如何进行重新工作以减少问题的发生来结束事件。再次发生事故的风险。

But what does all this have to do with IT administration? Well, responsible IT managers must be able to build resiliency into their infrastructure.

但是,这与IT管理有什么关系? 那么,负责任的IT经理必须能够在其基础架构中建立弹性。

That will mean spending serious time fine-tuning their software monitoring systems so they'll catch and alert you to real problems while issuing alerts for as few false positives as possible.

这将意味着要花费大量时间来微调其软件监控系统,以便他们在发出警报时尽可能少地误报,从而及时发现并提醒您真正的问题。

And it'll probably also involve intelligently automating logging and intrusion detection systems and generally getting a good idea of how things are supposed to look.

而且它可能还涉及智能地自动化日志记录和入侵检测系统,并且通常会很好地了解事物的外观。

制定灾难恢复计划 (Developing a Disaster Recovery Plan)

Disaster recovery planning requires you to:

灾难恢复计划要求您:

  • Define exactly what recovery means

    准确定义恢复的含义
  • Identify the resources that achieving recovery will require

    确定实现恢复所需的资源
  • Convert those observations into a formal plan format

    将这些观察结果转换为正式计划格式
  • Communicate the plan to the players who will one day have to carry it out

    将计划传达给一天之内必须执行的玩家

What does recovery mean? It's when your poor, stricken infrastructure has returned to the shape it was in the moment before disaster hit.

恢复意味着什么? 到那时,您的基础设施已经破烂不堪,恢复了灾难之前的状态。

What you'll need to get you back to that point can be defined by establishing a Recovery Time Objective (RTO) and Recovery Point Objective (RPO) that fits your organization's needs.

可以通过建立适合您组织需求的恢复时间目标 (RTO)和恢复点目标 (RPO)来定义恢复到该点所需要的条件。

A Recovery Time Objective represents the maximum number of minutes, hours, or days that your organization could survive an IT service outage. So your recovery plan will need to incorporate that hard deadline into its protocols.

恢复时间目标表示您的组织可以度过IT服务中断的最大分钟数,小时数或天数。 因此,您的恢复计划将需要将艰苦的最后期限纳入其协议中。

Of course that means you'll need to have team members available to make it into the office even in the small hours of the night quickly enough to make a difference.

当然,这意味着您需要有团队成员才能进入办公室,即使是在夜晚的一小段时间内也要足够快才能有所作为。

But it also means, say, that if your RTO is six hours, but restoring critical data from your backups would take a minimum of eight hours just to handle the transfer, then you'll have to rethink those numbers before signing off on the plan.

但这也意味着,例如,如果您的RTO是6个小时,但是从备份中恢复关键数据至少要花8个小时才能处理传输,那么您在签署计划之前必须重新考虑这些数字。 。

A Recovery Point Objective is the amount of transaction data your organization could afford to lose during an outage and survive.

恢复点目标是您的组织在中断和生存期间可能承受的交易数据量。

To illustrate, an e-commerce website that normally processes 25 transactions each minute could, perhaps, afford to issue apologies and refunds to 30 minutes worth of angry customers wondering why their credit cards were billed but their electric train sets weren't delivered. Refunding more than 30 minutes worth, however, could deplete your financial reserves to the point that you're no longer viable.

举例来说,一个通常每分钟处理25次交易的电子商务网站也许可以向30分钟的生气客户道歉并退款,他们想知道为什么要为他们的信用卡付款,但没有交付电动火车。 但是,退款超过30分钟的时间可能会耗尽您的财务储备,以致您无法再生存了。

In any case, calculating accurate and reliable RTOs and RPOs is how you set the limits within which your recovery plan will have to operate. Or, in other words, you'll have defined what recovery means.

无论如何,计算准确和可靠的RTO和RPO就是设置恢复计划必须运行的限制的方式。 或者,换句话说,您将定义恢复的含义。

Now what about resources? By which I mean the data backups and, when necessary, the physical equipment you'll need to get your application back on its feet.

现在资源呢? 我的意思是指数据备份以及必要时需要的物理设备,以使应用程序恢复正常运行。

To make that work you'll have to decide on an infrastructure backup system. Whether you choose to go with incremental or differential, on-site or off-site, and single or multiple media types, you'll have to map out exactly how the recovery will go and whether or not it'll meet your RTO and RPO limits.

为此,您必须决定基础架构备份系统。 无论选择增量,差分,现场还是异地,以及单一或多种媒体类型,都必须准确地确定恢复的过程以及恢复是否满足您的RTO和RPO限制。

Of course there's no end of really bad things that can happen to make those plans utterly useless. What if your local server facility just burns down? What if it's lost to some kind of political upheaval or widespread power disruption?

当然,使这些计划完全无用的真正坏事情没有尽头。 如果您的本地服务器设施刚刚烧毁怎么办? 如果它因某种政治动荡或广泛的权力中断而丢失怎么办?

Even if you've conscientiously maintained up-to-date data backups off-site, what good will they do you if your hardware effectively no longer exists?

即使您认真地维护了异地的最新数据备份,但是如果您的硬件不再有效,它们会对您有什么好处呢?

Thinking about all those horrors can make preparing a cloud-based backup protocol using platforms like AWS and Azure sound mighty attractive. The big public clouds have the resources to distribute their infrastructure widely enough that it's virtually impossible for the whole thing to ever go down.

考虑所有这些恐怖因素,可以使使用AWS和Azure之类的平台准备基于云的备份协议听起来很有吸引力。 大型公共云具有足够广泛地分布其基础结构的资源,以至于整个事情几乎不可能崩溃。

So you could, for instance, maintain a reliably replicated data store on a public cloud platform that mirrors your main deployment. You could also design an infrastructure template that could be loaded up with your backup data and then launched on demand to take over in the event of an outage. Because nothing is kept running until it's actually needed, it can take a good few minutes to bring this one up to speed.

因此,例如,您可以在镜像主部署的公共云平台上维护可靠复制的数据存储。 您还可以设计一个基础架构模板,该模板可以加载备份数据,然后根据需要启动以在发生故障时接管。 由于没有任何东西可以一直运行直到真正需要它,因此可能要花几分钟才能使它达到最高速度。

A warm standby recovery design might maintain your data running 24/7 on a minimal number of virtual servers. In an emergency, you can hit the switch and the platform's auto scaling will fire up all the instances you'll need.

热备份恢复设计可以使您的数据在最少数量的虚拟服务器上保持24/7运行。 在紧急情况下,您可以按一下开关,平台的自动缩放功能将启动您需要的所有实例。

You could set the scaling to kick in when triggered by an alert from your primary system. The public cloud presents endless possibilities, but they all require planning and preparation.

您可以将缩放设置为在您的主系统警报触发时启动。 公共云提供了无限的可能性,但是它们都需要进行计划和准备。

A solid disaster recovery plan must be effectively communicated long before crunch time. Practically speaking, that means it'll all be written up, printed, and distributed to each of the key players who will carry out the plan.

必须在紧要关头之前很长时间有效地传达可靠的灾难恢复计划 。 实际上,这意味着将全部编写,打印并分发给将执行该计划的每个关键参与者。

That's not to say it ends there: those players will of course have actually read the thing and, ideally, engage in realistic simulations until they're confident they can make it work under pressure.

但这并不是说到此为止:这些玩家当然会真正阅读该东西,并且理想情况下会进行逼真的模拟,直到他们确信自己可以使它在压力下工作。

What goes in this book?

这本书有什么内容?

  • An enumeration of all the stuff that could go wrong and bring down your system

    列举所有可能出错并导致系统崩溃的东西
  • An inventory of exactly what you've got running in your server room and what would be needed to replace it

    准确列出您在服务器机房中正在运行的设备以及更换它所需的设备
  • The information you'll need to access and restore backed up data

    访问和还原备份数据所需的信息
  • An up-to-date contact list of the people who will be responsible for every aspect of the plan

    有关计划各个方面的人员的最新联系人列表
  • The exact sequence of the tasks and events that will make up the recovery

    构成恢复的任务和事件的确切顺序

That's a lot of detail. But it's barely a drop in the bucket when compared with the total amount of preparation and plain old hard work that goes into creating a real-world recovery plan.

有很多细节。 但是,与创建真实的恢复计划所需的准备工作和简单的旧工作相比,这简直是九牛一毛。

But for now, the key takeaway from this module is simply to keep all this in mind. Why? Because the next time you sit down to configure a monitoring package or administration framework, you'll think about incident management protocols and disaster recovery plans and wonder how you should include them in your configuration.

但就目前而言,此模块的主要作用只是要牢记所有这些。 为什么? 由于下次您坐下来配置监视程序包或管理框架时,您将考虑事件管理协议和灾难恢复计划,并想知道如何在配置中包括它们。

There's much more administration goodness in the form of books, courses, and articles available at my bootstrap-it.com.

我的bootstrap-it.com上提供了书籍,课程和文章形式的管理优势。

翻译自: https://www.freecodecamp.org/news/disaster-recovery-plan/

灾难恢复

内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: