Quora Infrastructure: How does Quora solve problems in their deployment process?
2015-08-03 14:31
281 查看
Let's say you want to delete a column from a database table. You would first push code that no longer uses that column, then wait until that
push is finished, and then delete the column.
If you want to rename a column, it's trickier.
The right way to do this would be roughly:
Create a new column with the new name.
Push code that duplicates writes to both columns, but still reads only from the old column.
Run a query to copy all data from the old column into the new column. At this point the columns are identical and will be maintained as identical because of the duplicate writes.
Push code that switches reads to use the new column.
Push code that stops the duplicate writes and just writes to the new column.
Drop the old column.
This generalizes into any kind of data migration. You probably wouldn't go through the overhead for this just to rename a column, but imagine
changing from one data format to a more compressed representation: It's the same process.
Do all the new engineers know this? No,
they learn as necessary. Often we just tolerate some errors during the restart window and don't go through a process like this.
Doesn't this make developers' life difficult? Occasionally,
but there is no way around it. This is a consequence of running a service that never stops. It's not feasible to make an atomic code deployment process when you have hundreds of servers and don't want to have downtime. And if there is going to be a data migration
that will take time to run, you'd need to do this even if you could deploy code atomically. I'd also suggest that this is not that difficult. Because of all the continuous deployment infrastructure our pushes are really lightweight. We always have the option
of taking a service down to do a migration to avoid some of this overhead, and we have done that on a few occasions.
Doesn't this make the code look dirty? Temporarily,
while this process is happening, it does, but at the end, a few hours later, the code is back to being clean.
push is finished, and then delete the column.
If you want to rename a column, it's trickier.
The right way to do this would be roughly:
Create a new column with the new name.
Push code that duplicates writes to both columns, but still reads only from the old column.
Run a query to copy all data from the old column into the new column. At this point the columns are identical and will be maintained as identical because of the duplicate writes.
Push code that switches reads to use the new column.
Push code that stops the duplicate writes and just writes to the new column.
Drop the old column.
This generalizes into any kind of data migration. You probably wouldn't go through the overhead for this just to rename a column, but imagine
changing from one data format to a more compressed representation: It's the same process.
Do all the new engineers know this? No,
they learn as necessary. Often we just tolerate some errors during the restart window and don't go through a process like this.
Doesn't this make developers' life difficult? Occasionally,
but there is no way around it. This is a consequence of running a service that never stops. It's not feasible to make an atomic code deployment process when you have hundreds of servers and don't want to have downtime. And if there is going to be a data migration
that will take time to run, you'd need to do this even if you could deploy code atomically. I'd also suggest that this is not that difficult. Because of all the continuous deployment infrastructure our pushes are really lightweight. We always have the option
of taking a service down to do a migration to avoid some of this overhead, and we have done that on a few occasions.
Doesn't this make the code look dirty? Temporarily,
while this process is happening, it does, but at the end, a few hours later, the code is back to being clean.
相关文章推荐
- C#读写.ini文件
- Windows下的Objective-C集成开发环境搭建(IDE)
- Android Studio常用小技巧
- Windows 7 安装之后要做的事
- 自定义一个在底部显示的dialog
- 设计模式 模板方法模式
- 发布设置setting.xml
- 【Windows 10 IoT - 3】Windows 10 RTM安装及新特性(树莓派 Pi2)
- js获取图片高度
- curl: (1) Protocol 'http not supported or disabled in libcurl
- Checkpoint防火墙因CoreXL被激活的防火墙实例数目不同而导致cluster失败的排除
- Hello CSDN Blog - 第一篇博客
- Android微信抢红包外挂 源代码
- LeetCode OJ 之 Word Search II (单词搜索 - 二)
- 例说Linux内核链表(三)
- 如何使用Docker Machine部署Swarm集群
- Silver Cow Party poj 3268
- Linux程序存储结构与进程结构堆和栈的区别
- 线索二叉树的生成
- js读取后台Map