您的位置：首页 > 其它

使用scribe来收集数据需要注意的问题

2013-10-08 13:47 465 查看

在使用scribe的过程中，遇到了非常奇怪的问题：scribe的中心收集节点服务器，过一段时间之后，就会拒绝服务——ssh登录不上，但是ping可以ping通，只有重新启动服务器之后才能使业务恢复正常。在查找问题原因的时候在/var/log/messages文件中发现了以下的一段报错：

Oct  4 18:31:07 aggr01 automount[4811]: expire_proc: expire thread create for /net failed
Oct  4 18:31:07 aggr01 automount[4811]: expire_proc: expire thread create for /misc failed
Oct  4 18:32:22 aggr01 automount[4811]: expire_proc: expire thread create for /net failed
Oct  4 18:32:22 aggr01 automount[4811]: expire_proc: expire thread create for /misc failed
Oct  4 18:33:37 aggr01 automount[4811]: expire_proc: expire thread create for /net failed
Oct  4 18:33:37 aggr01 automount[4811]: expire_proc: expire thread create for /misc failed
Oct  4 18:34:52 aggr01 automount[4811]: expire_proc: expire thread create for /net failed
Oct  4 18:34:52 aggr01 automount[4811]: expire_proc: expire thread create for /misc failed
Oct  4 18:36:07 aggr01 automount[4811]: expire_proc: expire thread create for /net failed

通过查询相关资料，发现这是因为系统线运行的程数超出内核参数设定的最大值（cat /proc/sys/kernel/threads-max，我们线上系统的这个值是40万）引起的。进一步分析scribe的应用场景，可以发现，确实是scribe本身的机制导致了这个问题。

业务场景：

所有的数据是按照产品、日期和数据类型来分区的，也就是说每个产品、每种数据类型每天都会使用一个新的category来接收数据，这样导致的最直接的一个问题就是：每天的category的数量（大概上千个）就会非常大。在scribe的配置文件中，有这样的一个配置：new_thread_per_category=yes，对于每个category要创建一个新的线程来处理，并且scribe并没有一个释放旧有线程的机制——在这一点上Flume就有处理，Flume会把一段时间没有更新的文件关闭掉，而释放系统资源——这样，经过一段时间之后，scribe会将系统线程数逐渐推高(通过pstack
11045 | grep Thread | wc -l 查看线程数，其中11045是scribe进程号)，直到超过系统限制，导致ssh连接出现问题，并且所有的对外服务都进入不可用的状态。

解决问题中遇到的一些新问题：

为了解决这个问题，最直接的办法，就是将同一种数据类型（数据类型的数量是固定的）的数据都使用同一个线程来处理，可以这样更改store配置：

port=1465
#max_msg_per_second=2000000
max_queue_size=2000000
check_interval=3
new_thread_per_category=nonum_thrift_server_threads=32

###############################store for XXX###
<store>
category=xxx_*
type=buffer
target_write_size=20480
buffer_send_rate=2
retry_interval=30
retry_interval_range=6
use_hostname_sub_directory=no
replay_buffer=yes
<primary>
type=file
fs_type=std
file_path=/data/scribe/xxx
create_symlink=no
#base_filename=thisisoverwriten
max_size=600000000
add_newlines=1
max_write_size=16384
</primary>
<secondary>
type=file
fs_type=std
file_path=/data/scribe3/xxx0
#base_filename=thisisoverwriten
max_size=3000000
max_write_size=4096
</secondary>
</store>

###############################END store for XXX###

这样，确实会使得xxx的数据类型的所有数据使用同一个线程来做处理，但是实际的情况是，到本地磁盘上的category信息丢失，只有xxx_，后面的产品信息、日期信息都丢失了，这显然是不能接受的，因为数据的元信息丢失了！所以最终还是要放弃这种解决方案，采取一种中庸的解决方案：每天对scribe执行一次reload操作，这个操作的本意应是重新加载配置文件的，但是在这个过程中scribe也会断开所有已经打开的文件描述符，所以我们可以利用这个“副作用”，使用计划任务来每天reload一次scribe：scribe_ctrl
reload 。这样做其实就是每天定时释放资源，但是最好的办法还是在实现层面来加入类似Flume的超时机制，不过scribe项目的更新似乎已经停滞了，建议新的项目使用更加完善的Flume来收集日志。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航