SRE

最早讨论 SRE 来源于 Google 这本书《Site Reliability Engineering: How Google Runs Production Systems》。由 Google SRE 关键成员分享他们是如何对软件进行生命周期的整体性关注，以及为什么这样做能够帮助 Google 成功地构建、部署、监控和运维世界上现存最大的软件系统。

可了解到 SRE 的定义：

Site reliability engineering (SRE) is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals are to create scalable and highly reliable software systems.

其中有句形象描述 SRE 工作的描述：

SRE is “what happens when a software engineer is tasked with what used to be called operations.”

即 SRE 的目标是构建可扩展和高可用的软件系统，通过软件工程的方法解决基础设施和操作相关的问题。在 Google SRE 书中，对 SRE 日常工作状态有个准确的描述：至多 50% 的时间精力处理操作相关事宜，50% 以上的精力通过软件工程保障基础设施的稳定性和可扩展性。

基于上述描述，我对 SRE 的理解是：

职责：保障基础设施的稳定性和可扩展性
核心：解决问题
方法：通过操作类事务积累问题经验，通过编码等方式提升问题的解决效率

SRE

SRE

Links