我有一个用例,其中单个Kafka主题中出现多种类型的Avro记录,因为我们在架构注册表中向主题起诉TopicRecordNameStrategy。
现在,我已经写了一个消费者来阅读该主题并构建GenericRecord的数据流。现在,由于该流包含不同类型的架构记录,因此我无法以木地板格式将此流下沉到hdfs / s3。因此,我通过应用过滤器并创建不同的流,然后分别下沉每个流,来过滤每种类型的不同记录。
下面是我正在使用的代码-”
import io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient;
import io.confluent.kafka.schemaregistry.client.SchemaMetadata;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericRecord;
import org.apache.flink.api.common.ExecutionConfig;
import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.api.common.serialization.SimpleStringEncoder;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.core.fs.Path;
import org.apache.flink.formats.parquet.avro.ParquetAvroWriters;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.InputStream;
import java.util.List;
import java.util.Properties;
public class EventStreamProcessor {
private static final Logger LOGGER = LoggerFactory.getLogger(EventStreamProcessor.class);
private static final String KAFKA_TOPICS = "events";
private static Properties properties = new Properties();
private static String schemaRegistryUrl = "";
private static CachedSchemaRegistryClient registryClient = new CachedSchemaRegistryClient(schemaRegistryUrl, 1000);
public static void main(String args[]) throws Exception {
ParameterTool para = ParameterTool.fromArgs(args);
InputStream inputStreamProperties = EventStreamProcessor.class.getClassLoader().getResourceAsStream(para.get("properties"));
properties.load(inputStreamProperties);
int numSlots = para.getInt("numslots", 1);
int parallelism = para.getInt("parallelism");
String outputPath = para.get("output");
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(parallelism);
env.getConfig().enableForceAvro();
env.enableCheckpointing(60000);
ExecutionConfig executionConfig = env.getConfig();
executionConfig.disableForceKryo();
executionConfig.enableForceAvro();
FlinkKafkaConsumer kafkaConsumer010 = new FlinkKafkaConsumer(KAFKA_TOPICS,
new KafkaGenericAvroDeserializationSchema(schemaRegistryUrl),
properties);
Path path = new Path(outputPath);
DataStream<GenericRecord> dataStream = env.addSource(kafkaConsumer010).name("bike_flow_source");
try {
final StreamingFileSink<GenericRecord> sink = StreamingFileSink.forBulkFormat
(path, ParquetAvroWriters.forGenericRecord(SchemaUtils.getSchema("events-com.events.search_list")))
.withBucketAssigner(new EventTimeBucketAssigner())
.build();
dataStream.filter((FilterFunction<GenericRecord>) genericRecord -> {
if (genericRecord.get(Constants.EVENT_NAME).toString().equals("search_list")) {
return true;
}
return false;
}).addSink(sink).name("search_list_sink").setParallelism(parallelism);
final StreamingFileSink<GenericRecord> sink_search_details = StreamingFileSink.forBulkFormat
(path, ParquetAvroWriters.forGenericRecord(SchemaUtils.getSchema("events-com.events.search_details")))
.withBucketAssigner(new EventTimeBucketAssigner())
.build();
dataStream.filter((FilterFunction<GenericRecord>) genericRecord -> {
if (genericRecord.get(Constants.EVENT_NAME).toString().equals("search_details")) {
return true;
}
return false;
}).addSink(sink_search_details).name("search_details_sink").setParallelism(parallelism);
final StreamingFileSink<GenericRecord> search_list = StreamingFileSink.forBulkFormat
(path, ParquetAvroWriters.forGenericRecord(SchemaUtils.getSchema("events-com.events.search_list")))
.withBucketAssigner(new EventTimeBucketAssigner())
.build();
dataStream.filter((FilterFunction<GenericRecord>) genericRecord -> {
if (genericRecord.get(Constants.EVENT_NAME).toString().equals("search_list")) {
return true;
}
return false;
}).addSink(search_list).name("search_list_sink").setParallelism(parallelism);
} catch (Exception e) {
LOGGER.info("exception in sinking event");
}
env.execute("event_stream_processor");
}
}
``
因此,对于我来说,这看起来效率很低1.每次添加新事件时,我都必须进行代码更改。2.我必须通过过滤器和全部创建多个流。
因此,请向我建议是否可以在不创建多个流的情况下编写GenericRecord流。如果不是,如何使用某些配置文件使此代码更具动态性,因此每次不必为新事件再次编写相同的代码?
请提出解决此问题的更好方法。
那么,您可以简单地将可能的消息类型列表作为config参数传递,然后对其进行简单的迭代。您将有类似这样的内容:
messageTypes.foreach( msgType => {
final StreamingFileSink<GenericRecord> sink = StreamingFileSink.forBulkFormat
(path, ParquetAvroWriters.forGenericRecord(SchemaUtils.getSchema(msgType)))
.withBucketAssigner(new EventTimeBucketAssigner())
.build();
dataStream.filter((FilterFunction<GenericRecord>) genericRecord -> {
if (genericRecord.get(Constants.EVENT_NAME).toString().equals(msgType)) {
return true;
}
return false;
}).addSink(sink).name(msgType+"_sink").setParallelism(parallelism);
}})
这意味着您仅需要在收到新消息类型时使用更改的配置重新启动作业。