您的位置：首页 > 其它

Cloud Foundry中应用实例生命周期过程中的文件目录分析

2014-01-20 21:28 316 查看

在Cloud Foundry中，应用在DEA上运行，而应用在自身的生命周期中，自身的文件目录也会随着不同的周期，做出不同的变化。

本文将从创建一个应用（start an app），停止一个应用（stop an app），删除一个应用（delete an app），重启一个应用（restart an app），应用crash，关闭dea，启动dea，dea异常退出后重启，这几个方面入手，进行分析应用实例目录的变化。

本文所讲述的Cloud Foundry仅限于v1版本，v2版本会后续跟进。

start an app

start an app主要是指应用用户发出请求，让Cloud Foundry创建一个应用，或者启动一个应用。需要注意的是，在start an app之前，Cloud Foundry的每一个DEA中都不会存有该app的文件。在某一个DEA接受到start an app的请求后，该DEA必须从存放droplet的地方，下载droplet，并在DEA所在节点的某个文件路径下解压改droplet，最终启动解压后的droplet的应用启动脚本。这样的话，该DEA的文件系统中就会有一个该应用相应的文件目录存在。

以上操作的代码实现，在/dea/lib/dea/agent.rb的process_dea_start方法中：

[ruby] view
plaincopy

tgz_file = File.join(@staged_dir, "#{sha1}.tgz")

instance_dir = File.join(@apps_dir, "#{name}-#{instance_index}-#{instance_id}")

该部分的代码产生应用在所在DEA上的压缩包文件目录以及具体执行的文件目录，并在后续的success = stage_app_dir(bits_file, bits_uri, sha1, tgz_file, instance_dir, runtime)中实现下载应用源码至instance_dir。启动完成之后，以上的instance_dir，就是该应用的文件路径。

总结：start an app创建应用在某一个DEA上的文件目录并启动该应用。

stop an app

stop an app主要是指应用用户发出请求，让Cloud Foundry停止一个应用的运行。需要注意的是，在stop an app之前，肯定是必须要在运行的该应用，该应用的文件目录以及源码已经存在于某一个DEA的文件系统中。Cloud　Controller收到用户的stop an app请求后，首先会找到该应用所在运行的DEA节点，并对该DEA发送stop该应用的请求。当DEA接收到该请求后，执行process_dea_stop方法，如下：

[ruby] view
plaincopy

NATS.subscribe('dea.stop') { |msg| process_dea_stop(msg) }

在process_dea_stop中，主要执行的便是该应用的停止，包括该应用的所有实例，代码实现如下：

[ruby] view
plaincopy

return unless instances = @droplets[droplet_id]

instances.each_value do |instance|

version_matched = version.nil? || instance[:version] == version

instance_matched = instance_ids.nil? || instance_ids.include?(instance[:instance_id])

index_matched = indices.nil? || indices.include?(instance[:instance_index])

state_matched = states.nil? || states.include?(instance[:state].to_s)

if (version_matched && instance_matched && index_matched && state_matched)

instance[:exit_reason] = :STOPPED if [:STARTING, :RUNNING].include?(instance[:state])

if instance[:state] == :CRASHED

instance[:state] = :DELETED

instance[:stop_processed] = false

end

stop_droplet(instance)

end

end

首先现在@droplets这个hash对象中找到所在停止的应用id，然后再遍历该应用的所有实例，在对应用实例进行状态处理之后，随即执行stop_droplet方法。也就是说真正实现停止应用实例的操作在stop_droplet方法，以下进入该方法的代码实现：

[ruby] view
plaincopy

def stop_droplet(instance)

return if (instance[:stop_processed])

send_exited_message(instance)

username = instance[:secure_user]

# if system thinks this process is running, make sure to execute stop script

if instance[:pid] || [:STARTING, :RUNNING].include?(instance[:state])

instance[:state] = :STOPPED unless instance[:state] == :CRASHED

instance[:state_timestamp] = Time.now.to_i

stop_script = File.join(instance[:dir], 'stop')

insecure_stop_cmd = "#{stop_script} #{instance[:pid]} 2> /dev/null"

stop_cmd =

if @secure

"su -c \"#{insecure_stop_cmd}\" #{username}"

else

insecure_stop_cmd

end

unless (RUBY_PLATFORM =~ /darwin/ and @secure)

Bundler.with_clean_env { system(stop_cmd) }

end

end

………………

cleanup_droplet(instance)

end

可以看到在该方法中，主要是通过执行该应用的停止脚本来实现stop an app请求。其中，stop_script = File.join(instance[:dir], 'stop')为找到停止脚本所在的位置，insecure_stop_cmd = "#{stop_script} #{instance[:pid]} 2>
/dev/null"未生成脚本命令，然后通过@secure变量重生成stop_cmd，最后执行Bundler.with_clean_env { system(stop_cmd) }，实现为启动一个全新环境来让操作系统执行脚本stop_cmd。

其实本文最关心的是DEA接下来的操作cleanup_droplet操作，因为该操作才是真正于应用在DEA文件系统目录相关的部分。以下进入cleanup_droplet方法：

[ruby] view
plaincopy

def cleanup_droplet(instance)

remove_instance_resources(instance)

@usage.delete(instance[:pid]) if instance[:pid]

if instance[:state] != :CRASHED || instance[:flapping]

if droplet = @droplets[instance[:droplet_id].to_s]

droplet.delete(instance[:instance_id])

@droplets.delete(instance[:droplet_id].to_s) if droplet.empty?

schedule_snapshot

end

unless @disable_dir_cleanup

@logger.debug("#{instance[:name]}: Cleaning up dir #{instance[:dir]}#{instance[:flapping]?' (flapping)':''}")

EM.system("rm -rf #{instance[:dir]}")

endFileUtils.mv(tmp.path, @app_state_file)

else

@logger.debug("#{instance[:name]}: Chowning crashed dir #{instance[:dir]}")

EM.system("chown -R #{Process.euid}:#{Process.egid} #{instance[:dir]}")

end

end

在该方法中，检查应用实例状态后，如果应用的状态不为：CRASHED或者instance[:flapping]不为真时，在@droplets这个hash对象中删除所要停止的应用实例ID，随后进行schedule_snapshot操作，该方法的实现于作用稍后会进行分析。然后通过以下代码实现应用实例文件目录删除：

[ruby] view
plaincopy

unless @disable_dir_cleanup

@logger.debug("#{instance[:name]}: Cleaning up dir #{instance[:dir]}#{instance[:flapping]?' (flapping)':''}")

EM.system("rm -rf #{instance[:dir]}")

end

也就在是说@disable_dir_cleanup变量为真话，不会执行脚本命令 rm -rf #{instance[:dir]} ，如果为假，则执行脚本命令 rm -rf #{instance[:dir]} ，换句话说会将应用实例的文件目录全部删除。在默认情况下，Cloud Foundry关于@disable_dir_cleanup变量的初始化，在agent类的intialize()方法中，初始化读取配置config['disable_dir_cleanup']，而该配置默认为空，即为假。

现在分析刚才涉及的方法schedule_snapshot方法，在stop_droplet方法中，删除了@droplets中关于要删除应用实例的信息后，随即调用该schedule_snapshot方法。该方法的实现如下：

[ruby] view
plaincopy

def schedule_snapshot

return if @snapshot_scheduled

@snapshot_scheduled = true

EM.next_tick { snapshot_app_state }

end

可以看到主要是实现了snapshot_app_state方法，现在进入该方法：

[html] view
plaincopy

def snapshot_app_state

start = Time.now

tmp = File.new("#{@db_dir}/snap_#{Time.now.to_i}", 'w')

tmp.puts(JSON.pretty_generate(@droplets))

tmp.close

FileUtils.mv(tmp.path, @app_state_file)

@logger.debug("Took #{Time.now - start} to snapshot application state.")

@snapshot_scheduled = false

end

首先，该方法获取了当前时间，并以tmp = File.new("#{@db_dir}/snap_#{Time.now.to_i}", 'w')创建了一个文件，通过将@droplets变量json化，随后将json信息写入tmp文件；关闭该文件后，通过命令FileUtils.mv(tmp.path, @app_state_file)实现将该tmp文件重命名为@app_state_file，该变量为@app_state_file
= File.join(@db_dir, APP_STATE_FILE)，其中APP_STATE_FILE = 'applications.json'：。

总结，当stop an app时，DEA的操作流程如下：

删除该app的所有实例在@droplets中的信息；
对该app的所有实例执行stop脚本；
将删除指定记录后的@droplets对象中的所有记录写入@app_state_file；

对该app的所有实例的文件目录，进行删除处理。

delete an app

delete an app主要是指应用用户发起一个删除应用的请求，该请求由Cloud Controller捕获，Cloud Controller首先将该应用的所有实例停止，然后再将该应用的droplet删除掉。因此，在操作该请求的时候，有相关该应用的所有信息都会被删除，自然包括该应用实例在DEA上的文件目录。

restart an app

restart an app主要是指应用用户发起一个重启应用的请求，该请求在vmc处的实现就是分解为两个请求，一个stop请求，一个start请求。因此，stop请求在一个DEA上停止该应用的运行，并且删除该应用的文件目录；而start请求在一个DEA上现下载该应用的源码，也就是创建一个文件目录，最后将该应用启动起来。需要特别注意的是，执行stop请求的DEA和执行start请求的DEA不一定是同一个DEA。执行stop请求的DEA为当前需要停止的应用所在的DEA，而执行start请求的DEA，需要由Cloud
Controller决策而出。

app crashes

app crashes主要是指应用在运行过程中出现了崩溃的请求。换句话说，应用崩溃，DEA是事先不知晓的，这和stop an app有很大的区别，在具体集群中可以通过强制杀死应用进程来模拟应用的崩溃。

首先由于应用的崩溃不经过DEA，所以DEA不会执行stop_droplet方法以及cleanup_droplet方法，理论上该应用的文件目录依然会存在于DEA的文件系统中，据许占据DEA文件系统的磁盘空间。可以想象，如果应用长此以往的话，对系统磁盘空间的浪费是很明显的。而关于这个话题，Cloud Foundry中DEA会采取定期执行清除crashed应用的操作，将已经崩溃的应用文件目录删除。

具体来讲，由于应用崩溃，那么关于之前该应用的pid也就不会存在了（理论上是这样），在DEA定期执行monitor_app方法的时候，将所有进程的信息保存起来，随后执行monitor_apps_helper方法，对于@droplets中的每一个应用的每一个实例，将其的pid信息于实际在DEA节点处的进程pid进行对比，如果失败，则说明@droplets中的该应用实例已经不在运行，可以认为是不正常的退出执行。实现代码如下：

[ruby] view
plaincopy

def monitor_apps_helper(startup_check, ma_start, du_start, du_all_out, pid_info, user_info)

…………

@droplets.each_value do |instances|

instances.each_value do |instance|

if instance[:pid] && pid_info[instance[:pid]]

…………

else

# App *should* no longer be running if we are here

instance.delete(:pid)

# Check to see if this is an orphan that is no longer running, clean up here if needed

# since there will not be a cleanup proc or stop call associated with the instance..

stop_droplet(instance) if (instance[:orphaned] && !instance[:stop_processed])

end

end

end

…………

end

当发现该应用实例实际情况下已经不再运行的话，DEA就会执行代码 instance.delete(:pid) 以及 stop_droplet(instance) if (instance[:orphaned] && !instance[:stop_processed]) ，可以如果(instance[:orphaned] && !instance[:stop_processed]) 为真的话，那就执行stop_droplet方法，在执行stop_droplet方法的时候，由于先执行send_exited_message方法，如下：

[ruby] view
plaincopy

def stop_droplet(instance)

# On stop from cloud controller, this can get called twice. Just make sure we are re-entrant..

return if (instance[:stop_processed])

# Unplug us from the system immediately, both the routers and health managers.

send_exited_message(instance)

……

cleanup_droplet(instance)

end

而send_exited_message方法中的代码实现如下：

[ruby] view
plaincopy

def send_exited_message(instance)

return if instance[:notified]

unregister_instance_from_router(instance)

unless instance[:exit_reason]

instance[:exit_reason] = :CRASHED

instance[:state] = :CRASHED

instance[:state_timestamp] = Time.now.to_i

instance.delete(:pid) unless instance_running? instance

end

send_exited_notification(instance)

instance[:notified] = true

end

首先先在router中注销该应用实例的url，由于对于一个异常终止的应用实例来说，肯定不会有instance[:exit_reason]值，所以正如正常逻辑，应该将该应用实例的:exit_reason以及:state设置为:CRASHED。

stop_droplet方法中执行完send_exit_message方法之后，最后会执行cleanup_droplet方法。进入cleanup_droplet方法中，由于该应用实例的:state已经被设定为:CRASHED,所以该应用实例不会进入删除文件没有的命令中，而是执行chown命令，代码如下：

[ruby] view
plaincopy

def cleanup_droplet(instance)

……

if instance[:state] != :CRASHED || instance[:flapping]

……

else

@logger.debug("#{instance[:name]}: Chowning crashed dir #{instance[:dir]}")

EM.system("chown -R #{Process.euid}:#{Process.egid} #{instance[:dir]}")

end

end

到目前为止，crashed应用的状态只是被标记为:CRASHED,而其文件目录还是存在于DEA的文件系统中，并没有删除。

但是可以想象的是，对于一个崩溃的应用实例，没有将其删除的情况是不合理的，当时Cloud Foundry的设计者肯定会考虑这一点。实际情况中，DEA的执行时，会添加一个周期性任务crashes_reaper，实现代码如下：

[ruby] view
plaincopy

EM.add_periodic_timer(CRASHES_REAPER_INTERVAL) { crashes_reaper }

而CRASHES_REAPER_INTERNAL的数值设定为3600，也就是每隔一小时都是执行一次crashes_reaper操作，现在进入crashes_reaper方法的代码实现：

[ruby] view
plaincopy

def crashes_reaper

@droplets.each_value do |instances|

# delete all crashed instances that are older than an hour

instances.delete_if do |_, instance|

delete_instance = instance[:state] == :CRASHED && Time.now.to_i - instance[:state_timestamp] > CRASHES_REAPER_TIMEOUT

if delete_instance

@logger.debug("Crashes reaper deleted: #{instance[:instance_id]}")

EM.system("rm -rf #{instance[:dir]}") unless @disable_dir_cleanup

end

delete_instance

end

end

@droplets.delete_if do |_, droplet|

droplet.empty?

end

end

该代码的实现很简单，也就是如果一个应用实例的状态为：CRASHED，那就删除该应用实例的文件目录。

总结，当一个应用实例crash的时候，应用实例将不能被访问，而且其文件目录依然会存在与DEA所在节点的文件系统中，DEA会将应用实例的状态标记为：CRASHED，随后通过周期为1小时的任务crashes_reaper将其文件目录删除。

stop DEA

stop DEA主要是指，Cloud Foundry的开发者用户通过Cloud Foundry中指定的脚本命令，停止DEA组件的运行。当开发者用户发起该请求时，DEA组件会捕获这个请求：

[ruby] view
plaincopy

['TERM', 'INT', 'QUIT'].each { |s| trap(s) { shutdown() } }

捕获到这个请求时，DEA会执行shutdown方法，现在进入该方法的代码实现：

[ruby] view
plaincopy

def shutdown()

@shutting_down = true

@logger.info('Shutting down..')

@droplets.each_pair do |id, instances|

@logger.debug("Stopping app #{id}")

instances.each_value do |instance|

# skip any crashed instances

instance[:exit_reason] = :DEA_SHUTDOWN unless instance[:state] == :CRASHED

stop_droplet(instance)

end

end

# Allows messages to get out.

EM.add_timer(0.25) do

snapshot_app_state

@file_viewer_server.stop!

NATS.stop { EM.stop }

@logger.info('Bye..')

@pid_file.unlink()

end

end

看以上代码可知，执行shutdown方法的时候，对于@droplets中的每一个应用的每一个非CRASHED状态的实例，将:exit_reason设置为:DEA_SHUTDOWN之后，随后执行stop_droplet方法以及cleanup_droplet方法，也就是说会将这些应用实例的文件目录全部删除。删除完之后，DEA会选择结束进程。当然关于这些进程信息的application.json文件中，也会删除那些正常运行的应用实例信息。

总结：stop一个DEA的时候，会先停止所有正常应用实例的运行，随后这些正应用实例的文件目录会被删除。

start DEA

start DEA主要是指，Cloud Foundry的开发者用户通过Cloud Foundry指定的脚本命令，启动DEA组件的运行。当开发者发起该请求时，DEA组件启动，重要的部分为agent对象的创建与运行，现在进入agent实例对象的运行代码，主要关注与应用实例文件目录的部分：

[ruby] view
plaincopy

# Recover existing application state.

recover_existing_droplets

delete_untracked_instance_dirs

可以看到的是首先进行recover_existing_droplets方法，代码实现如下：

[ruby] view
plaincopy

def recover_existing_droplets

…………

File.open(@app_state_file, 'r') { |f| recovered = Yajl::Parser.parse(f) }

# Whip through and reconstruct droplet_ids and instance symbols correctly for droplets, state, etc..

recovered.each_pair do |app_id, instances|

@droplets[app_id.to_s] = instances

instances.each_pair do |instance_id, instance|

…………

end

end

@recovered_droplets = true

# Go ahead and do a monitoring pass here to detect app state

monitor_apps(true)

send_heartbeat

schedule_snapshot

end

该方法主要根据@app_state_file文件中的信息，还原@droplets信息，随后执行monitor_apps，send_heartbeat以及schedule_snapshot方法。

随后会执行delete_untracked_instance_dirs方法，主要是删除与@droplets不相符的应用实例文件目录。

总结，如果之前DEA为正常退出的话，且正常退出前已经清除所有crashed应用实例的话，aplication_json文件中不会有任何信息，而存放应用文件目录的路径下不会有任何应用实例，因此该方法不会文件目录删除；如果DEA正常退出之前，还有crashed应用实例还没有删除的话，启动的时候该应用实例还是会存在，等待crashes_reaper操作将其删除；如果DEA崩溃退出时，存在应用实例文件目录的路径下与DEA崩溃前出现不一致，而application.json也与实际的应用实例不一致时，会将不匹配的应用实例的文件目录进行删除。

实现如下：

[ruby] view
plaincopy

# Removes any instance dirs without a corresponding instance entry in @droplets

# NB: This is run once at startup, so not using EM.system to perform the rm is fine.

def delete_untracked_instance_dirs

tracked_instance_dirs = Set.new

for droplet_id, instances in @droplets

for instance_id, instance in instances

tracked_instance_dirs << instance[:dir]

end

end

all_instance_dirs = Set.new(Dir.glob(File.join(@apps_dir, '*')))

to_remove = all_instance_dirs - tracked_instance_dirs

for dir in to_remove

@logger.warn("Removing instance dir '#{dir}', doesn't correspond to any instance entry.")

FileUtils.rm_rf(dir)

end

end

DEA crashes

DEA crashes主要是指，DEA在运行过程崩溃，非正常终止，可以是用强制结束DEA进程来模拟DEA crashes。

由于DEA进程退出后，并不会直接影响到应用实例的运行，所以应用的文件目录还是会存在的，应用还是可以访问。当重新正常启动DEA进程的时候，由于和start DEA操作完全一致。需要注意的是，假如重启的时候，之前运行的应用都正常运行的话，那么通过recover_existing_droplets方法可以做到监控所有应用实例，通过monitor_apps方法。随后又可以通过send_heartbeat以及schedule_snapshot方法，实现与外部组件的通信。假如DEA重启的时候，之前运行的应用实例有部分已经crashes掉了，那在monitor_apps方法的后续执行中会将其文件目录删除。

以上便是我对Cloud Foundry中应用实例生命周期中文件目录的变化分析。

关于作者：

孙宏亮，DAOCLOUD软件工程师。两年来在云计算方面主要研究PaaS领域的相关知识与技术。坚信轻量级虚拟化容器的技术，会给PaaS领域带来深度影响，甚至决定未来PaaS技术的走向。

转载请注明出处。

这篇文档更多出于我本人的理解，肯定在一些地方存在不足和错误。希望本文能够对接触Cloud Foundry中应用实例生命周期中文件目录变化的人有些帮助，如果你对这方面感兴趣，并有更好的想法和建议，也请联系我。
我的邮箱：allen.sun@daocloud.io

新浪微博：@莲子弗如清

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航