您的位置:首页 > 其它

在电晕时代扩展野餐系统

2020-08-21 13:13 190 查看

As an online groceries company, Picnic saw an enormous increase in demand during the start of the Corona crisis. Our systems suddenly experienced traffic peaks of up to 10–20 times the pre-Corona peak traffic. Even though we build our systems for scale, surges like these exposed challenges we didn’t know we had. In this post, we’ll share what we learned while scaling our systems during these times. If you prefer, you can also watch the recording of an online meetup we organised around this same topic.

作为一家在线杂货公司,Picnic在电晕危机开始期间看到了需求的巨大增长。 我们的系统突然遇到的流量高峰高达电晕前高峰流量的10–20倍。 尽管我们构建了规模化的系统,但是像这些未知的挑战之类的浪潮却是我们所不知道的。 在这篇文章中,我们将分享在这些时间扩展系统时学到的知识。 如果愿意,您还可以观看我们围绕同一主题组织的一次在线聚会的录像

When we talk about scaling systems, you might be tempted to think about infrastructure and raw processing power. These are definitely important pieces of the puzzle. However, just throwing more resources at a scaling problem is rarely the best way forward. We’ll see that scaling our infrastructure goes hand in hand with improving the services that run on this infrastructure.

当我们谈论扩展系统时,您可能会想起基础架构和原始处理能力。 这些绝对是难题的重要组成部分。 但是,仅将更多的资源投入到扩展问题上,很少是前进的最佳方法。 我们将看到扩展基础架构与改善在该基础架构上运行的服务紧密相关。

Before diving into solutions, let’s first get a better understanding of how Picnic operates and what issues arose.

在探讨解决方案之前,让我们首先更好地了解Picnic的运作方式以及出现的问题。

野餐的承诺 (The Picnic Promise)

Customers use the Picnic app to order groceries for the next day, or further into the future. Picnic then fulfills this order by picking the products in our fulfillment centers, and delivering them to the door using our iconic electric vehicles. Unfortunately, we don’t have infinite stock, vehicles, and picking capacity. That’s why a customer can choose a delivery slot for a certain day and time, as long as it has capacity available.

客户使用Picnic应用在第二天或更远的将来订购食品。 然后,Picnic通过在我们的运营中心内挑选产品,并使用我们的标志性电动车将产品交付到门上来履行此订单。 不幸的是,我们没有无限的库存,车辆和领料能力。 这就是为什么客户可以选择特定日期和时间的交付时段,只要它有可用的容量即可。

It’s mid-March, right after the ‘intelligent lockdown’ in The Netherlands was a fact. Our story begins with many existing and new customers turning to Picnic for their essentials in these uncertain times. Slots are filling up quickly, and customers are eagerly waiting for new slots to become available. New slots are made available at a fixed time every day, and we communicate this to give everyone a fair chance to place an order. Of course, this does lead to a big influx of customers around these slot opening times. These moments brought about the toughest scaling challenges.

3月中旬,就在荷兰发生“智能封锁”之后。 我们的故事始于在这个不确定的时期,许多现有和新客户都转向Picnic寻求基本服务。 插槽很快就装满了,客户急切地等待新插槽可用。 每天都有固定的时间提供新的广告位,我们会与大家进行沟通,使每个人都有机会下订单。 当然,这确实会导致在这些时段开放时间内大量涌入客户。 这些时刻带来了最严峻的扩展挑战。

Now, let’s briefly sketch the Picnic system landscape before we move on to the lessons we learned.

现在,在继续学习之前,让我们简要地概述一下Picnic系统的概况。

野餐的系统景观 (Picnic’s System Landscape)

Our systems run on AWS. Kubernetes (EKS) is used to run compute workloads, and for data storage we use managed services like Amazon RDS and MongoDB Atlas. When users order groceries through the Picnic app, it communicates with a service we call Storefront. Behind Storefront, there are many other internal services handling every aspect of our business, from stock management to order fulfilment. They all communicate with REST over HTTP, and most of our services are implemented on a relatively uniform Java 11 + Spring Boot stack.

我们的系统在AWS上运行。 Kubernetes(EKS)用于运行计算工作负载,对于数据存储,我们使用托管服务,如Amazon RDS和MongoDB Atlas。 当用户通过Picnic应用程序订购食品杂货时,它将与我们称为Storefront的服务进行通信。 在店面的背后,还有许多其他内部服务在处理我们业务的各个方面,从库存管理到订单完成。 它们都通过HTTP与REST通信,并且我们的大多数服务都在相对统一的Java 11 + Spring Boot堆栈上实现。

扩展Kubernetes荚 (Scaling Kubernetes Pods)

With the first surges of traffic, the Kubernetes pods running Storefront receive more traffic than they can handle. Scaling up the number of replicas is the obvious first step.

随着流量的首次激增,运行Storefront的Kubernetes Pod接收到的流量超出了他们的处理能力。 显然,扩大副本数量是第一步。

Doing so uncovered some interesting bottlenecks in the service itself, which we’ll talk about in a bit. For now, let’s focus on how the infrastructure evolved to cope with traffic surges. Manually increasing the number of replicas works, but Kubernetes also offers a way to automatically scale pods. We first tried using the Horizontal Pod Autoscaler (HPA) functionality. Based on configurable CPU utilization thresholds, additional replicas are automatically added or removed based on the actual load.

这样做可以发现服务本身中的一些有趣的瓶颈,我们将在稍后讨论。 现在,让我们集中讨论基础架构如何发展以应对流量激增。 手动增加副本作品的数量,但是Kubernetes还提供了一种自动缩放Pod的方法。 我们首先尝试使用Horizo​​ntal Pod Autoscaler (HPA)功能。 根据可配置的CPU使用率阈值,会根据实际负载自动添加或删除其他副本。

Only, there’s an issue: the highest traffic peaks occur when new delivery slots are opened up for customers. People start anticipating this moment, and traffic surges so quickly that the Autoscaler just can’t keep up.

唯一的问题是:为客户开放新的交付时段时,流量高峰会最高。 人们开始期待这一刻,流量激增如此之快,以至于Autoscaler无法跟上。

Since we know the delivery slot opening moments beforehand, we combined the HPA with a simple Kubernetes CronJob. It automatically increases the `minReplicas` and `maxReplicas` values in the HPA configuration some time before the slots open. Note that we don’t touch the replica count of the Kubernetes deployment itself, that’s the HPA’s job!

由于我们提前知道了交付槽的打开时刻,因此我们将HPA与简单的Kubernetes CronJob结合在一起。 在插槽打开之前的一段时间,它会自动增加HPA配置中的`minReplicas`和`maxReplicas`值。 请注意,我们不会涉及Kubernetes部署本身的副本数量,这是HPA的工作!

Now, additional pods are started and warmed up before the peak hits. Of course, after the peak we reduce the counts in the HPA configuration commensurately. This way, we have the certainty of having additional capacity before the peak occurs. At the same, there’s also the flexibility of the HPA kicking in at other moments when traffic increases or decreases unexpectedly.

现在,将在高峰期开始之前启动并预热其他豆荚。 当然,在峰值之后,我们会相应地减少HPA配置中的计数。 这样,我们可以确定在峰值出现之前具有额外的容量。 同时,在流量意外增加或减少的其他时刻,HPA也具有灵活性。

扩展Kubernetes节点 (Scaling Kubernetes Nodes)

Just scaling pods isn’t enough: the Kubernetes cluster itself must have enough capacity across all nodes to support the increased number of pods. Costs now also come into the picture: the more and heavier nodes we use, the higher the EKS bill will be. Fortunately, another open source project called Cluster Autoscaler solves this challenge. We created an instance group that can be automatically scaled up and down by the Cluster Autoscaler based on utilization. The nodes in this instance group were used exclusively for applications subject to the HPA scaling discussed earlier.

仅扩展Pod是不够的:Kubernetes集群本身必须在所有节点上具有足够的容量以支持增加数量的Pod。 现在,成本也随之浮现:我们使用的节点越重,EKS账单就越高。 幸运的是,另一个名为Cluster Autoscaler的开源项目解决了这一难题。 我们创建了一个实例组,该实例组可以由集群自动缩放器根据利用率自动缩放。 该实例组中的节点仅用于受前面讨论过的HPA扩展限制的应用程序。

Combining HPA and Cluster Autoscaler helped us to elastically scale the compute parts of our services. However, the managed MongoDB and RDS clusters also needed to be scaled up. Scaling up these managed services is a costly affair, and cannot be done as easily as scaling up and down Kubernetes nodes. Hence, besides scaling up, we also had to address the ways services utilize these managed resources.

结合使用HPA和Cluster Autoscaler,可以帮助我们灵活地扩展服务的计算部分。 但是,托管的MongoDB和RDS集群也需要扩展。 扩展这些托管服务是一项昂贵的事务,并且无法像扩展和缩减Kubernetes节点那样容易。 因此,除了扩大规模外,我们还必须解决服务利用这些托管资源的方式。

MongoDB使用模式 (MongoDB Usage Patterns)

It’s not just that scaling up a MongoDB cluster is expensive. We were even hitting physical limits of the best available underlying hardware! A single MongoDB cluster was serving Storefront and all the other services shown earlier on. Even after scaling up this cluster, at some point, we were saturating the 10 Gbit Ethernet link to the master node. Of course, this leads to slowdowns and outages across services.

不仅仅是扩展MongoDB集群是昂贵的。 我们甚至达到了最佳可用基础硬件的物理极限! 一个MongoDB集群正在为Storefront和之前显示的所有其他服务提供服务。 即使在扩展了该群集之后,在某些时候,我们也会饱和到主节点的10 Gbit以太网链路。 当然,这会导致服务速度下降和服务中断。

You can’t scale your way out of every problem. That’s why, in addition to all of the infrastructure efforts discussed so far, we also started looking critically at how our services were interacting with MongoDB. In order to reduce bandwidth to the cluster (especially the master node), we applied several code optimizations:

您无法解决所有问题。 因此,除了到目前为止讨论的所有基础架构工作之外,我们还开始批判性地研究我们的服务如何与MongoDB进行交互。 为了减少到群集(特别是主节点)的带宽,我们应用了一些代码优化:

  • Read data from secondaries, wherever we can tolerate the reduced consistency guarantees.

    只要能够容忍降低的一致性保证,就可以从辅助数据库读取数据。
  • Don’t read-your-writes by default. Developers do this out of habit, but often it’s not necessary because you already have all the information you need.

    默认情况下,请勿阅读。 开发人员出于习惯而这样做,但是通常没有必要,因为您已经拥有了所需的所有信息。
  • When you do read data, apply projections to reduce bytes-on-the-wire. In some cases this means introducing multiple specialized queries with projections for different use-cases, trading a bit of code reuse for performance.

    当您读取数据时,请应用投影以减少在线字节数。 在某些情况下,这意味着引入多个专门的查询,并针对不同的用例进行预测,以牺牲一些代码重用为代价。

A somewhat bigger change we applied for some services was to provision them with their own MongoDB clusters. This is a good idea anyway (also from a resilience perspective, for example), and the Corona scaling challenges only made this more apparent.

我们为某些服务申请的一个更大的更改是为其提供了自己的MongoDB集群。 无论如何,这都是一个好主意(例如,从弹性的角度来看),而电晕缩放方面的挑战只会使这一点更加明显。

可观察性 (Observability)

Optimizing and re-deploying services during a crisis is stressful, to say the least. Where do you start, and how do you know your changes are actually helping? For us, observability and metrics were key in finding the right spots to tune service implementations. Fortunately, we already had Micrometer, Prometheus, Grafana, and New Relic in place to track and monitor metrics.

至少可以说,在危机期间优化和重新部署服务非常紧张。 您从哪里开始,您如何知道您的更改实际上有帮助? 对于我们来说,可观察性和指标是找到合适的位置来调整服务实现的关键。 幸运的是,我们已经可以使用Micrometer,Prometheus,Grafana和New Relic来跟踪和监视指标。

This was especially useful when we encountered an issue that stumped us for a bit. Requests from the app to Storefront were painfully slow, yet the Storefront pods themselves were not exhibiting any signs of excessive memory or CPU use. Also, outgoing requests from Storefront to internal services were reasonably fast.

当我们遇到一个使我们感到困惑的问题时,这特别有用。 从应用程序到Storefront的请求非常缓慢,但是Storefront窗格本身没有任何迹象表明存在过多的内存或CPU使用情况。 而且,从Storefront到内部服务的传出请求相当快。

So what was causing the slowdown of all these user requests? Finally, a hypothesis emerged: it might be a connection pool starvation issue of the Apache HTTP client we use to make internal service calls. We could try to tweak these settings and see if it helps, but in order to really understand what’s going on, you need accurate metrics. Apache HTTP client Micrometer bindings are available out-of-the-box, so integrating them was a relatively small change. The metrics indeed indicated that we had a large amount of pending connections whenever these slowdowns occurred. With the numbers clearly visible, we tweaked the connection pool sizes and this bottleneck was cleared.

那么,是什么原因导致所有这些用户请求变慢? 最后,出现了一个假设:这可能是我们用于进行内部服务调用的Apache HTTP客户端的连接池不足问题。 我们可以尝试调整这些设置,看看是否有帮助,但是为了真正了解正在发生的事情,您需要准确的指标。 Apache HTTP客户端的Micrometer绑定是现成可用的,因此集成它们是一个相对小的更改。 指标确实表明,只要出现这些速度下降,我们就有大量待处理的连接。 在数字清晰可见的情况下,我们调整了连接池的大小,并消除了此瓶颈。

服务互动 (Service Interactions)

Removing one bottleneck often reveals another, and this time was no different. Increasing the amount of concurrent calls from Storefront to internal services (by enlarging the HTTP client connection pool sizes) puts additional load on downstream services, revealing new areas for improvement. This additional load caused us to take a hard look at all the internal service interactions stemming from a single user request.

消除一个瓶颈通常会发现另一个瓶颈,这次也没有什么不同。 从Storefront到内部服务的并发调用数量增加(通过扩大HTTP客户端连接池大小),给下游服务增加了负担,从而揭示了需要改进的新领域。 这种额外的负担使我们不得不仔细研究源自单个用户请求的所有内部服务交互。

The best internal service call is the one you never make. Through careful log and code analysis, we identified several calls that could be eliminated, either because they were duplicated in different code paths, or not strictly necessary for certain requests.

最好的内部服务电话是您从未拨打过的电话。 通过仔细的日志和代码分析,我们确定了一些可以消除的调用,这是因为它们在不同的代码路径中重复,或者对于某些请求并非严格必要。

Most calls, however, are there for a reason. Judicious use of caching is the next step to reduce stress on downstream services. We already had per-JVM in-memory caches for several scenarios. Again, metrics helped locate high-volume service calls where introducing caching would help the most.

但是,大多数来电是有原因的。 明智地使用缓存是减轻下游服务压力的下一步。 我们已经在几种情况下使用了按JVM的内存中缓存。 同样,度量标准有助于将大量服务调用定位在引入缓存最有帮助的地方。

Part of our caching setup also worked against us during scaling events. One particular type of caching we use is a polling cache: it periodically retrieves a collection of data from another service. When the service starts, these polling caches initialize themselves, until they expire and reload data. Normally, that works just fine. When the whole system is already under high load, and Kubernetes automatically starts adding many more pods, things get dicey. These new pods try to fill their polling caches simultaneously, putting a lot of load on the downstream services providing the data. Subsequently, these caches all expire at the same time and reload data, again causing peak loads. To prevent such a ‘thundering herd’, we introduced random jitter in the schedule of polling caches. This smoothens the same load over time.

在扩展事件期间,部分缓存设置也不利于我们。 我们使用的一种特殊类型的缓存是轮询缓存:它定期从另一个服务检索数据集合。 服务启动时,这些轮询缓存将自行初始化,直到它们过期并重新加载数据。 通常,这很好。 当整个系统已经处于高负载下时,Kubernetes自动开始添加更多的Pod,事情变得简单了。 这些新的Pod尝试同时填充其轮询缓存,这给提供数据的下游服务带来了很多负担。 随后,这些高速缓存全部同时到期并重新加载数据,再次导致峰值负载。 为了防止这种“雷声群”,我们在轮询缓存的调度中引入了随机抖动。 随着时间的推移,这将平滑相同的负载。

结语 (Wrapping Up)

To solve scaling challenges, you need a holistic approach spanning infrastructure and software development. Every improvement described in this post (and many more) was found and implemented over the course of several weeks. During these first weeks, we had daily check-ins where both developers and infrastructure specialists analyzed the current state and discussed upcoming changes. Improvements on both code and infrastructure were closely monitored with custom metrics in Grafana and generic service metrics in New Relic. Every day, our services performed a little bit better, and we could soon prevent downtime like we had in the first days.

为了解决扩展挑战,您需要一种涵盖基础架构和软件开发的整体方法。 在数个星期的过程中,发现并实施了本文(以及更多)中描述的所有改进。 在最初的几周内,我们每天进行检查,开发人员和基础架构专家都在其中分析了当前状态并讨论了即将发生的变化。 使用Grafana中的自定义指标和New Relic中的常规服务指标来密切监视代码和基础结构的改进。 每天,我们的服务性能都会好一点,并且我们可以像在第一天一样,尽快避免停机。

Eventually, the traffic peaks smoothened and somewhat subsided, as society adapted to the new reality. Our improvements, however, will be with us for a long time!

最终,随着社会适应新现实,交通高峰得以缓解并有所减弱。 但是,我们的改进将长期存在!

翻译自: https://blog.picnic.nl/scaling-picnic-systems-in-corona-times-701699aa513b

内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: