Practice of Rewriting Watermark in Flink
Apache Flink is a powerful streaming processing framework capable of handling real-time data streams. A key component in handling real-time data is the Watermark. Watermarks are special timestamps used for processing event-time stream data to address the problems of out-of-order events and late data. However, sometimes we may need to customize the generation of Watermarks based on specific business logic. This article will explore how to rewrite Watermarks in Flink and provide some practical tips and examples.
1 What is Watermark
In Flink, a Watermark is a marker of event time, indicating that data up until this timestamp has been processed. Flink uses Watermarks to determine when event-time-based window operations can be triggered. If an event’s timestamp precedes the current Watermark, then the event is considered “late” and might be discarded or placed into a special side output stream.
2 Why Rewrite
In practical applications, the original stream’s Watermarks may not meet our needs. For instance, we might need to adjust the generation strategy of Watermarks based on business logic, or handle special circumstances, such as data delays or system failures. At this point, we need to customize the logic for generating Watermarks.
3 How to Rewrite Watermarks in Flink
In Flink, we can customize the generation of Watermarks by implementing the WatermarkStrategy
interface. Generally, we need to do the following four things:
-
Define Watermark Strategy: Create a class that extends
WatermarkStrategy
and implement thecreateTimestampAssigner
andcreateWatermarkGenerator
methods. -
Implement TimestampAssigner: In the
createTimestampAssigner
method, return aTimestampAssigner
instance responsible for assigning timestamps to each event. -
Implement WatermarkGenerator: In the
createWatermarkGenerator
method, return aWatermarkGenerator
instance responsible for generating Watermarks. -
Apply Watermark Strategy: When creating a DataStream, use
WatermarkStrategy
to set the strategy for generating Watermarks.
4 Example: Custom Watermark Generation Strategy
Suppose we have a log data stream containing timestamps of user activity. We wish to dynamically adjust Watermarks based on the frequency of user activity to better handle late data. Here is a simple example:
|
|
This example creates a custom Watermark strategy that dynamically adjusts Watermarks based on the timestamp of events. This way, we can better handle late data without losing too much information due to overly strict Watermark strategies.
4.1 Transforming Time Format
For time information in the format of “2022/10/22 10:34”, how should we rewrite it?
If this string has been parsed into a Java time object, such as java.util.Date or java.time.Instant, it can be passed directly as a parameter in the assignTimestampsAndWatermarks function. For example:
|
|
If the string has not been parsed into a Java time object, it needs to be parsed into one first, then used in the assignTimestampsAndWatermarks function. The SimpleDateFormat class can be used to parse the time string:
|
|
5 Summary
Rewriting Watermarks in Flink is an advanced feature that allows for the adjustment of stream processing behavior according to specific business requirements. By customizing the Watermark strategy, it is possible to handle real-time data streams more flexibly, enhancing the accuracy and efficiency of data processing. In practical applications, making appropriate use of this operation can improve the handling capabilities for big data.