专访肖雨浓:Netflix 是怎样探索落地 FaaS 的?

阅读数:848 2018 年 6 月 21 日 05:56

2014 年,Serverless 架构进入大众视线,当时业界普遍认为,Serverless 化可大幅降低 IT 成本,将云的费用减少 10%-90%,同时还能提高服务部署效率。

经过这几年的沉淀,部分公司已经在实践 Serverless,取得的效果也很明显。7 月 6 日在深圳举办的 ArchSummit 全球架构师峰会,我们邀请了 Netflix 首席软件工程师肖雨浓来分享 Netflix 对 FaaS 技术的探索过程,希望能给技术工作者带来收获。


Xiao:I currently lead the FaaS and API platform at Netflix. The Netflix API is a tier-1 service through which every single request from all Netflix clients flow through. It allows us to integrate the hundreds of microservices on the backend into one coherent service for clients to access. We're building a FaaS platform to enable engineers to quickly develop, test, and operate these API services -- which generally are bespoke to each device.

我目前在 Netflix 带领 FaaS 和 API 平台团队,Netflix API 是一个 tier-1 服务,通过这个服务,来自 Netflix 所有客户的每一个单个需求都可以平滑经过。基于这个 API 服务,我们还可以将后端的上百个微服务整合进一个连贯的服务里,便与用户访问。我们当前正在构建一个 FaaS 平台来帮助工程师们快速开发,测试并维护这些 API 服务,通常情况下,这个平台会被定制到每一个设备里。

InfoQ:实践 Serverless 过程中给 Netflix 带来哪些方面的优化?在您看来 Serverless 架构适合哪些业务场景,不适合哪些场景?(Serverless 模式能给 Netflix 降低多少成本?)

Xiao:At Netflix, we design our product with innovation in mind. What this means is that we're constantly A/B testing our product and launching many new features each week. In order enable this kind of velocity, we require a API services platform which enables client engineers to be able to rapidly deploy to production changes to their services. FaaS achieves this by abstracting away all of the platform components usually associated with a service down to just business logic itself -- allowing engineers to focus on developing great new features instead of writing boiler plate code.

Netflix 的产品在设计上就已经被赋予了创新的基因,除了不间断的 A/B 测试之外,每周都会发布很多新功能。为了确保这样高强度的工作成果,我们需要一个 API 服务平台来助理客户端工程师快速而有效的将更改的需求部署到服务层。FaaS 通过把那些与服务相关的所有平台组件抽象为业务逻辑本身来实现这一目标的,这样可以使工程师能够更专注于开发优异的新功能,而不是编写那些冗余而又不得不写的代码。

Additionally, operating services at more than four 9s of availability is difficult -- even for seasoned server engineers. Thus a serverless model where we centralize the operations allows us provide a platform that allows even engineers without server and operational experience to develop highly available services.

此外,即使对于经验丰富的服务器工程师而言,运行服务的可用性超过四个 9 也是很困难的。 因此,我们集中操作的 Serverless 模式能够为我们提供一个平台,即使没有服务器和运营经验的工程师也可以开发高可用的服务。

InfoQ:能否进一步详细介绍 API Platform 的架构?目前 API Platform 是如何落地 Serverless 的?

Xiao:At a very high level, the API platform consists of a FaaS platform which allows engineers to deploy functions with customs business logic as highly available production services.

在更高层面上,API 平台由 FaaS 平台组成,该平台允许工程师将用户业务逻辑的功能部署为高可用的生产服务。

InfoQ:Serverless 架构是否是微服务的极致?您团队接下来的优化重点是什么?

Xiao:There are tradeoffs to consider with serverless. By adopting the FaaS model, you are essentially trading customization for velocity and perhaps availbility. There are some applications where FaaS for services works really well -- as is the case for the Netflix API where we run relatively uniform microservices that only need to access and mutate data from downstream services. However, if a service requires customization, such as needing to change various parts of the service platform e.g. RPC, data access, caching, authentication, then the FaaS model may not provide enough flexibility for such services.

无服务器在实践场景里是可以考虑权衡点的。 通过采用 FaaS 模式,本质上是对交易速度和可能性的定制化。有些应用程序的 FaaS 服务表现得很好——Netflix API 的情况就是如此,我们运行的是相对统一的微服务,只需要访问和改变下游服务的数据。 然而,如果服务需要定制化,例如需要改变服务平台的各个组成部分,像 RPC,数据访问,缓存,认证等,那么 FaaS 模式可能无法为这些服务提供足够的灵活性。

Our focus currently is to finish migrating the legacy API services over to the new stack. After that our focus could include many areas such as performance -- both to reduce cost and improve customer experience -- and other areas such as infrastructure and platform improvements.

我们目前的重点是完成将旧版 API 服务迁移到新堆栈。之后,我们的重点可能会包括很多方面,例如性能,既要降低成本,又要改善客户体验,以及基础架构和平台改进等其他领域。

InfoQ:能否结合实例谈谈 Serverless 中,怎样的函数依赖关系是合理的,从业务逻辑上如何评估哪些关键路径需要报警,哪些允许失败?(如何防止错误地消耗大量资源进而增加大量费用?)

Xiao:Functions are deployed as isolated services -- which means we're not deploying functions from different services on the same instance. This is really important for us as we wouldn't want one misbehaving service to take down all of Netflix. This isolation helps us prevent large scale outages across all of Netflix. We also integrate against our internal metrics, alerting, and monitoring systems, which gives us visibility into the health of each service. The service platform contains modern load-shedding technologies such as concurrency limits and circuit breaking -- these generally help prevent large scale outages. We've also invested heavily in runtime debugging, profiling, and sampling which provides the observability we need to operate many services at scale. There are many other components in the platform that help us run reliably, come to the talk to find out more!

函数被部署为独立服务,这意味着我们不会在同一个实例上部署不同服务的函数。这对我们来说非常重要,因为我们不想让一个行为不良的服务拖累所有的 Netflix 服务。这种隔离有助于防止所有 Netflix 服务出现大规模停机。我们还会对内部指标、警报和监控系统进行整合,从而让我们了解每项服务的健康状况。该服务平台包含先进的削减负荷技术,如并发限制和断路,这些措施有助于防止大规模停机。我们还在运行时调试、分析和采样方面投入大量精力,这为我们提供了必须的可观察性,以便对服务进行大规模运维。该平台还有许多其他组件帮助我们更可靠地运行,来听我的演讲了解更多信息!《 Going FaaSter: Function as a Service at Netflix

In terms of dependencies we allow users to import third party libraries at will -- but of course this means engineers need to exercise judgement with respect to things like security and performance.


InfoQ:如何决策或对比使用公有云 FaaS 服务或私有云自建 FaaS 服务?

Xiao:This comes down to the classic build vs buy question. I think one should be pragmatic when faced with this decision. When we were first designing our FaaS platform, we considered public options such as Lambda and App Engine. We would be happy to use off the shelf solutions if they fit our use case.

这归结为典型的“自建 or 购买”问题。我认为面对这个决定时应该务实。当我们首次设计 FaaS 平台时,我们考虑了诸如 Lambda 和 App Engine 等公共选项。如果符合我们的场景,我们当然很乐意使用现成的解决方案。

As it turns out, we needed a platform that integrated with the existing Netflix service platform components such as metrics, alerts, service discovery, and many others, and this integration with high level FaaS platforms would be difficult.

事实证明,我们需要一个能与现有 Netflix 服务平台组件(如度量,警报,服务发现等)集成的平台,而且这种与高级 FaaS 平台的集成将是一个很困难的过程。

Additionally, we needed full visibility into the services using the FaaS platform. Building it ourselves meant that we have full control all the way down to the operating system -- and we can give operators (ourselves) the tools and visibility to debug the services and platform.

另外,我们需要全面了解是什么样的服务在使用 FaaS 平台。自建意味着可以完全控制操作系统,需要给运维人员提供调试服务和可视化工具。

Obivously there's a huge amount of effort, time, and cost that went into building our own FaaS platform -- so we don't make these decisions lightly. However at the time we couldn't find an open source or public FaaS option that satisfied our requirements.

显然,自建 FaaS 平台需要花费大量的精力、时间和成本,所以我们不会轻易做这样的决定。然而,当时我们找不到满足要求的开源方案或公开的 FaaS 选项。

This doesn't mean others should follow in our footsteps. If there is an open source or public FaaS option that suits your requirements, then absolutely go and use it. Opportunity cost is also an important metric. Technology is just a means to an end -- and people should absolutely use the best tool for the job -- often this means buying and not building

这并不意味着大家都要模仿 Netflix 的脚步。如果符合需求的开源或公开 FaaS 选项存在,那么绝对要去使用。机会成本也是一个重要指标。技术只是达到目的的手段 - 我们当然应该使用最好的工具来完成这项工作,通常这意味着购买成熟的方案而不是自建。

InfoQ:对于 CI/CD 与 FaaS 的结合,有什么样比较好的建议?

Xiao:Providing a robust first class testing framework is important. We designed our FaaS platform with testing in mind. As a result, we created a testing framework with features such as first class mocks and tight integration with the developer tooling to make it very easy for engineers to write unit, integration and end to end tests using the FaaS platform.

提供强大的一流测试框架非常重要。我们在设计 FaaS 平台的时候考虑到了测试,创建了一个测试框架,其中包含一流的模拟功能以及与开发人员工具紧密集成的特性,使工程师可以非常方便地使用 FaaS 平台编写单元,集成和端到端测试。

One of the main advantages of the our test framework is that it allows them to test their functions in isolation, either locally or on jenkins -- without having to deploy code to the cloud. This ease of use inventivises our customers to write tests -- which helps us improve the reliability of the service.

我们的测试框架主要优点之一,是允许在本地或在 Jenkins 上单独测试其功能,而无需将代码部署到云中。这种易用性使我们的客户能够编写测试,而这有助于提高服务的可靠性。

InfoQ:目前业界全面落地 Serverless 尚且遥远,且没有统一的构建标准,如何确保你们的实践方向是正确的?能否分享历年过程中你们的经验教训?

Xiao:Today most Serverless solutions are geared towards batch and event driven tasks which are not latency sensitive. However, we believe serverless should also be considered for production services since they reduce operational and code complexity by abstracting away the platform and infrastructure.

目前大多数 Serverless 解决方案都适用于批量和事件驱动的任务,这些任务对延迟不敏感。然而我们认为 Serverless 也应该被考虑用于生产服务,因为它能通过抽象化平台和基础设施来减少操作和代码复杂性。

For us, there was a clear need within the Netflix API organization for a FaaS model which supported service style workloads. We believe through converstaions with other companies that there is an appetite for service style FaaS platforms -- most services for teams are a means to an end -- they're not opionionated or care about how the service is implemented, only that it performs the business logic they need reliably with good developer ergonomics.

对于我们来说,Netflix API 组织中有明确的需求,需要 FaaS 模式来支持服务型工作负载。我们相信通过与其他公司的交流,大家对服务型 FaaS 平台会有浓厚的兴趣,大多数团队服务都只是为达到目的一种手段,没人激励他们,也没人关心服务是如何部署的,只需要它们可靠的执行业务逻辑。

I think FaaS is a natural evolution, many years ago most services used bespoke software up and down the entire stack, running inside data centers owned by each company. We're moving towards a model today where we're commoditizing the components further and futher up the stack -- we started with the commoditizing of hardware and data centers with IaaS (think AWS EC2), and then moved towards commoditizing some parts of the platform with PaaS (think Heroku, or Google Cloud Platform), the natural evolution of this is toward FaaS where everything is provided by the platform except for the business logic which is the function itself.

我认为 FaaS 是一种自然演变,许多年前,大多数服务使用定制软件在整个堆栈中运行,并在每个公司内部数据中心运行。现在,我们正朝着一种模式迈进,在这个模型中,我们将组件进一步商品化,并进一步向前推进。我们开始使用 IaaS 商业化硬件和数据中心(例如 AWS EC2),然后转向将平台与 PaaS 的某些部分商业化(例如 Heroku 或 Google Cloud Platform)。这种自然演变促使 FaaS 出现,一切都由平台提供,而只有业务逻辑是函数本身的。

InfoQ:随着容器和 Kubernetes 技术的兴起,当前有很多基于这两种技术构建的 Serverless 架构,比如 Fn、Kubeless、OpenFaaS、IronFunctions 等,您如何看待容器技术尤其是 Kubernetes 为 Serverless 架构带来的机遇?

Xiao:One of the reasons we see so many FaaS platforms built on top of K8s is due to the fact that K8s abstracts away the infrastructure and platform required for building scalable and reliable services on top of containers. This is powerful as it means that FaaS frameworks can focus on the function runtime.

如此多 FaaS 平台构建于 K8s 之上的原因之一,是 K8s 将基础架构和平台抽象为在容器上构建可扩展和可靠的服务所需的事实。这是非常强大的,因为它意味着 FaaS 框架可以专注于函数运行时。

This space will continue to evolve and I hope to see additional FaaS frameworks emerge -- especially ones that can fulfill the need for service style workloads at scale (Think rich metrics, autoscaling, performance optimizations). I believe K8s will evolve in terms of its ability to run at larger scales -- this would make it an even better fit for use cases exceeding 5000 physical nodes.

这一块将继续演变,我希望看到更多的 FaaS 框架出现,尤其是能够满足大规模服务风格工作负载需求的那些(能够考虑到丰富的指标,自动调整,性能优化)。 我相信 K8s 将以更大规模运行的能力发展,这将使它更适合超过 5000 个物理节点的使用情况。


Xiao:Engineers should be pragmatic and look to make incremental changes to the architecture. Changing everything at once significantly increases the complexity, risk, and timeline of the project. Making incremental changes means we can shorten the feedback loop, realize gains more quickly for the business, and reduce the risk by changing only a few components at a time.


We should balance the tradeoffs of each decision and seek to get broad alignment within the company and mine for dissent. Be judicious when it comes to adopting new technology -- ask yourself the question, "why are you picking this technology?" If you can't answer it in a way that satisfies your team or organization -- then you should think twice. Think about the implications of adopting new technologies. Does it have a broad user and support base? Does it provide a good set of tooling to operate and debug? What about documentation? How about the maintenance cycle? What is the impact to the organization as a whole by adopting a new technology -- will platform teams now need to support this new technology across the entire organization?


For example, we adopted containers for the FaaS platform, for very specific reasons. It allowed us to enable engineers to run their services everywhere, and gave us immutable build artifacts. This decision didn't just impact our team -- as it required us to create a new team at Netflix which was tasked with building a container orchestration system. The decisions to use new technology can often have rippling and unforseen consequences up and down the entire company.

例如,我们 FaaS 平台采用了容器技术,原因很特殊。它可以确保工程师随时随地运行服务,并为我们提供不可变的构建组建。这个决定会对团队有一些影响,需要在 Netflix 内部创建一个新团队,负责构建一个容器编排系统。决定使用一项新技术经常会给整个公司带来不确定的后果。

InfoQ:在 FaaS 服务的开发过程中,工程师最关注点的是什么?

Xiao:For the development experience, we focused on the ergonomics of our FaaS platform. This was the biggest feedback from engineers using the FaaS platform. As a result we focused on building developer tooling that allows engineers to develop and debug their functions locally on their dev machines -- including the ability to tail logs and attach debuggers.

对于开发体验,我们专注于 FaaS 平台的人体工程学。这是工程师使用 FaaS 平台的最大反馈。 因此,我们专注于构建开发者工具,使工程师能够在其开发机器上本地开发和调试其功能,包括尾部日志和附加调试程序的功能。


Xiao:Engineers should focus on the things that matter to their teams -- for most this no longer means the infrastructure or service platform. For our engineers who use the FaaS platform, this allows them to focus on product innovation -- improving the Netflix experience for our more than 125 million members.

工程师应该将重点放在与团队有关的事情上,大多数情况下,这不再意味着基础架构或服务平台。对于使用 FaaS 平台的工程师来说,这能让他们更专注于产品创新,为 Netflix 超过 1.25 亿的会员提高用户体验。

点击查看 7 月深圳 ArchSummit 全球架构师技术峰会日程


专访肖雨浓:Netflix是怎样探索落地FaaS的?肖雨浓目前是 Netflix 位于美国加利福尼亚州洛斯盖多斯 (镇) 的首席软件工程师,带领 Netflix API 平台设计和架构团队。在此前,他任职于 AWS 和 Joyent,主要方向是分布式系统,并帮助规划和构建了多款云计算产品,例如 AWS IAM 和 Manta。与此同时,他也在维护开源项目 Node.JS 框架的校正。Yunong 获得了滑铁卢大学计算机工程荣誉学位。