您的位置：首页 > 其它

flume学习（八）：自定义拦截器

2016-12-09 22:07 399 查看

还是针对学习七中的那个需求，我们现在换一种实现方式，采用拦截器来实现。

先回想一下，spooldir source可以将文件名作为header中的key:basename写入到event的header当中去。试想一下，如果有一个拦截器可以拦截这个event,然后抽取header中这个key的值，将其拆分成3段，每一段都放入到header中，这样就可以实现那个需求了。

遗憾的是，flume没有提供可以拦截header的拦截器。不过有一个抽取body内容的拦截器：RegexExtractorInterceptor，看起来也很强大，以下是一个官方文档的示例：

If the Flume event body contained 1:2:3.4foobar5 and the following configuration was used

a1.sources.r1.interceptors.i1.regex = (\\d):(\\d):(\\d)

a1.sources.r1.interceptors.i1.serializers = s1 s2 s3

a1.sources.r1.interceptors.i1.serializers.s1.name = one

a1.sources.r1.interceptors.i1.serializers.s2.name = two

a1.sources.r1.interceptors.i1.serializers.s3.name = three

The extracted event will contain the same body but the following headers will have been added one=>1, two=>2, three=>3

大概意思就是，通过这样的配置，event body中如果有1:2:3.4foobar5 这样的内容，这会通过正则的规则抽取具体部分的内容，然后设置到header当中去。

于是决定打这个拦截器的主义，觉得只要把代码稍微改改，从拦截body改为拦截header中的具体key，就OK了。翻开源码，哎呀，很工整，改起来没难度，以下是我新增的一个拦截器：RegexExtractorExtInterceptor：

[java] view
plain copy

package com.besttone.flume;



import java.util.List;

import java.util.Map;

import java.util.regex.Matcher;

import java.util.regex.Pattern;



import org.apache.commons.lang.StringUtils;

import org.apache.flume.Context;

import org.apache.flume.Event;

import org.apache.flume.interceptor.Interceptor;

import org.apache.flume.interceptor.RegexExtractorInterceptorPassThroughSerializer;

import org.apache.flume.interceptor.RegexExtractorInterceptorSerializer;

import org.slf4j.Logger;

import org.slf4j.LoggerFactory;



import com.google.common.base.Charsets;

import com.google.common.base.Preconditions;

import com.google.common.base.Throwables;

import com.google.common.collect.Lists;



/**

* Interceptor that extracts matches using a specified regular expression and

* appends the matches to the event headers using the specified serializers

* Note that all regular expression matching occurs through Java's built in

* java.util.regex package. Properties:

* 

* regex: The regex to use

* 

* serializers: Specifies the group the serializer will be applied to, and the

* name of the header that will be added. If no serializer is specified for a

* group the default {@link RegexExtractorInterceptorPassThroughSerializer} will

* be used

* 

* Sample config:

* 

* agent.sources.r1.channels = c1

* 

* agent.sources.r1.type = SEQ

* 

* agent.sources.r1.interceptors = i1

* 

* agent.sources.r1.interceptors.i1.type = REGEX_EXTRACTOR

* 

* agent.sources.r1.interceptors.i1.regex = (WARNING)|(ERROR)|(FATAL)

* 

* agent.sources.r1.interceptors.i1.serializers = s1 s2

* agent.sources.r1.interceptors.i1.serializers.s1.type =

* com.blah.SomeSerializer agent.sources.r1.interceptors.i1.serializers.s1.name

* = warning agent.sources.r1.interceptors.i1.serializers.s2.type =

* org.apache.flume.interceptor.RegexExtractorInterceptorTimestampSerializer

* agent.sources.r1.interceptors.i1.serializers.s2.name = error

* agent.sources.r1.interceptors.i1.serializers.s2.dateFormat = yyyy-MM-dd

* </code>

* 

*

* <pre>

* Example 1:

* 

* EventBody: 1:2:3.4foobar5 Configuration:

* agent.sources.r1.interceptors.i1.regex = (\\d):(\\d):(\\d)

* 

* agent.sources.r1.interceptors.i1.serializers = s1 s2 s3

* agent.sources.r1.interceptors.i1.serializers.s1.name = one

* agent.sources.r1.interceptors.i1.serializers.s2.name = two

* agent.sources.r1.interceptors.i1.serializers.s3.name = three

* 

* results in an event with the the following

*

* body: 1:2:3.4foobar5 headers: one=>1, two=>2, three=3

*

* Example 2:

*

* EventBody: 1:2:3.4foobar5

*

* Configuration: agent.sources.r1.interceptors.i1.regex = (\\d):(\\d):(\\d)

* 

* agent.sources.r1.interceptors.i1.serializers = s1 s2

* agent.sources.r1.interceptors.i1.serializers.s1.name = one

* agent.sources.r1.interceptors.i1.serializers.s2.name = two

* 

*

* results in an event with the the following

*

* body: 1:2:3.4foobar5 headers: one=>1, two=>2

* </pre>

*/

public class RegexExtractorExtInterceptor implements Interceptor {



 static final String REGEX = "regex";

 static final String SERIALIZERS = "serializers";



 // 增加代码开始



 static final String EXTRACTOR_HEADER = "extractorHeader";

 static final boolean DEFAULT_EXTRACTOR_HEADER = false;

 static final String EXTRACTOR_HEADER_KEY = "extractorHeaderKey";



 // 增加代码结束



 private static final Logger logger = LoggerFactory

 .getLogger(RegexExtractorExtInterceptor.class);



 private final Pattern regex;

 private final List<NameAndSerializer> serializers;



 // 增加代码开始



 private final boolean extractorHeader;

 private final String extractorHeaderKey;



 // 增加代码结束



 private RegexExtractorExtInterceptor(Pattern regex,

 List<NameAndSerializer> serializers, boolean extractorHeader,

 String extractorHeaderKey) {

 this.regex = regex;

 this.serializers = serializers;

 this.extractorHeader = extractorHeader;

 this.extractorHeaderKey = extractorHeaderKey;

 }



 @Override

 public void initialize() {

 // NO-OP...

 }



 @Override

 public void close() {

 // NO-OP...

 }



 @Override

 public Event intercept(Event event) {

 String tmpStr;

 if(extractorHeader)

 {

 tmpStr = event.getHeaders().get(extractorHeaderKey);

 }

 else

 {

 tmpStr=new String(event.getBody(),

 Charsets.UTF_8);

 }



 Matcher matcher = regex.matcher(tmpStr);

 Map<String, String> headers = event.getHeaders();

 if (matcher.find()) {

 for (int group = 0, count = matcher.groupCount(); group < count; group++) {

 int groupIndex = group + 1;

 if (groupIndex > serializers.size()) {

 if (logger.isDebugEnabled()) {

 logger.debug(

 "Skipping group {} to {} due to missing serializer",

 group, count);

 }

 break;

 }

 NameAndSerializer serializer = serializers.get(group);

 if (logger.isDebugEnabled()) {

 logger.debug("Serializing {} using {}",

 serializer.headerName, serializer.serializer);

 }

 headers.put(serializer.headerName, serializer.serializer

 .serialize(matcher.group(groupIndex)));

 }

 }

 return event;

 }



 @Override

 public List<Event> intercept(List<Event> events) {

 List<Event> intercepted = Lists.newArrayListWithCapacity(events.size());

 for (Event event : events) {

 Event interceptedEvent = intercept(event);

 if (interceptedEvent != null) {

 intercepted.add(interceptedEvent);

 }

 }

 return intercepted;

 }



 public static class Builder implements Interceptor.Builder {



 private Pattern regex;

 private List<NameAndSerializer> serializerList;



 // 增加代码开始



 private boolean extractorHeader;

 private String extractorHeaderKey;



 // 增加代码结束



 private final RegexExtractorInterceptorSerializer defaultSerializer = new RegexExtractorInterceptorPassThroughSerializer();



 @Override

 public void configure(Context context) {

 String regexString = context.getString(REGEX);

 Preconditions.checkArgument(!StringUtils.isEmpty(regexString),

 "Must supply a valid regex string");



 regex = Pattern.compile(regexString);

 regex.pattern();

 regex.matcher("").groupCount();

 configureSerializers(context);



 // 增加代码开始

 extractorHeader = context.getBoolean(EXTRACTOR_HEADER,

 DEFAULT_EXTRACTOR_HEADER);



 if (extractorHeader) {

 extractorHeaderKey = context.getString(EXTRACTOR_HEADER_KEY);

 Preconditions.checkArgument(

 !StringUtils.isEmpty(extractorHeaderKey),

 "必须指定要抽取内容的header key");

 }

 // 增加代码结束

 }



 private void configureSerializers(Context context) {

 String serializerListStr = context.getString(SERIALIZERS);

 Preconditions.checkArgument(

 !StringUtils.isEmpty(serializerListStr),

 "Must supply at least one name and serializer");



 String[] serializerNames = serializerListStr.split("\\s+");



 Context serializerContexts = new Context(

 context.getSubProperties(SERIALIZERS + "."));



 serializerList = Lists

 .newArrayListWithCapacity(serializerNames.length);

 for (String serializerName : serializerNames) {

 Context serializerContext = new Context(

 serializerContexts.getSubProperties(serializerName

 + "."));

 String type = serializerContext.getString("type", "DEFAULT");

 String name = serializerContext.getString("name");

 Preconditions.checkArgument(!StringUtils.isEmpty(name),

 "Supplied name cannot be empty.");



 if ("DEFAULT".equals(type)) {

 serializerList.add(new NameAndSerializer(name,

 defaultSerializer));

 } else {

 serializerList.add(new NameAndSerializer(name,

 getCustomSerializer(type, serializerContext)));

 }

 }

 }



 private RegexExtractorInterceptorSerializer getCustomSerializer(

 String clazzName, Context context) {

 try {

 RegexExtractorInterceptorSerializer serializer = (RegexExtractorInterceptorSerializer) Class

 .forName(clazzName).newInstance();

 serializer.configure(context);

 return serializer;

 } catch (Exception e) {

 logger.error("Could not instantiate event serializer.", e);

 Throwables.propagate(e);

 }

 return defaultSerializer;

 }



 @Override

 public Interceptor build() {

 Preconditions.checkArgument(regex != null,

 "Regex pattern was misconfigured");

 Preconditions.checkArgument(serializerList.size() > 0,

 "Must supply a valid group match id list");

 return new RegexExtractorExtInterceptor(regex, serializerList,

 extractorHeader, extractorHeaderKey);

 }

 }



 static class NameAndSerializer {

 private final String headerName;

 private final RegexExtractorInterceptorSerializer serializer;



 public NameAndSerializer(String headerName,

 RegexExtractorInterceptorSerializer serializer) {

 this.headerName = headerName;

 this.serializer = serializer;

 }

 }

}

简单说明一下改动的内容：

增加了两个配置参数：

extractorHeader 是否抽取的是header部分，默认为false,即和原始的拦截器功能一致，抽取的是event body的内容

extractorHeaderKey 抽取的header的指定的key的内容，当extractorHeader为true时，必须指定该参数。

按照第七讲的方法，我们将该类打成jar包，作为flume的插件放到了/var/lib/flume-ng/plugins.d/RegexExtractorExtInterceptor/lib目录下，重新启动flume，将该拦截器加载到classpath中。

最终的flume.conf如下：

[plain] view
plain copy

tier1.sources=source1

tier1.channels=channel1

tier1.sinks=sink1

tier1.sources.source1.type=spooldir

tier1.sources.source1.spoolDir=/opt/logs

tier1.sources.source1.fileHeader=true

tier1.sources.source1.basenameHeader=true

tier1.sources.source1.interceptors=i1

tier1.sources.source1.interceptors.i1.type=com.besttone.flume.RegexExtractorExtInterceptor$Builder

tier1.sources.source1.interceptors.i1.regex=(.*)\\.(.*)\\.(.*)

tier1.sources.source1.interceptors.i1.extractorHeader=true

tier1.sources.source1.interceptors.i1.extractorHeaderKey=basename

tier1.sources.source1.interceptors.i1.serializers=s1 s2 s3

tier1.sources.source1.interceptors.i1.serializers.s1.name=one

tier1.sources.source1.interceptors.i1.serializers.s2.name=two

tier1.sources.source1.interceptors.i1.serializers.s3.name=three

tier1.sources.source1.channels=channel1

tier1.sinks.sink1.type=hdfs

tier1.sinks.sink1.channel=channel1

tier1.sinks.sink1.hdfs.path=hdfs://master68:8020/flume/events/%{one}/%{three}

tier1.sinks.sink1.hdfs.round=true

tier1.sinks.sink1.hdfs.roundValue=10

tier1.sinks.sink1.hdfs.roundUnit=minute

tier1.sinks.sink1.hdfs.fileType=DataStream

tier1.sinks.sink1.hdfs.writeFormat=Text

tier1.sinks.sink1.hdfs.rollInterval=0

tier1.sinks.sink1.hdfs.rollSize=10240

tier1.sinks.sink1.hdfs.rollCount=0

tier1.sinks.sink1.hdfs.idleTimeout=60

tier1.channels.channel1.type=memory

tier1.channels.channel1.capacity=10000

tier1.channels.channel1.transactionCapacity=1000

tier1.channels.channel1.keep-alive=30

我把source type改回了内置的spooldir，而不是上一讲自定义的source,然后添加了一个拦截器i1,type是自定义的拦截器：com.besttone.flume.RegexExtractorExtInterceptor$Builder,正则表达式按“.”分隔抽取三部分，分别放到header中的key:one,two,three当中去，即a.log.2014-07-31,通过拦截器后，在header当中就会增加三个key: one=a,two=log,three=2014-07-31。这时候我们在tier1.sinks.sink1.hdfs.path=hdfs://master68:8020/flume/events/%{one}/%{three}。

就实现了和前面第七讲一模一样的需求。

也可以看到，自定义拦截器的改动成本非常小，比自定义source小多了，我们这就增加了一个类，就实现了该功能。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航