Fountain Voyage

Practice of Rewriting Watermarks in Flink

Tim published on 2022-11-30 included in Data Science & Machine Learning

Apache Flink is a powerful stream processing framework capable of handling real-time data streams. When processing real-time data, Watermarks are a key tool. They are a special type of timestamp used for processing event-time stream data to address issues with out-of-order events and delayed data. Flink uses Watermarks to determine when event-time-based window operations can be triggered. To meet specific business needs, it may be necessary to customize the logic for generating Watermarks. In Flink, you can customize Watermark generation by implementing the WatermarkStrategy interface. Custom Watermark strategies typically require defining a Watermark strategy, implementing a TimestampAssigner and a WatermarkGenerator, and applying the strategy when creating data streams. This article provides an example demonstrating how to dynamically adjust Watermarks based on user activity frequency to better handle late data. Additionally, it discusses how to rewrite Watermarks for specific time formats. If the time information is a string, it can be parsed into a Java time object and then used in the assignTimestampsAndWatermarks function. By customizing Watermark strategies, you can more flexibly handle real-time data streams and improve the accuracy and efficiency of data processing. Proper use of this feature can enhance big data processing capabilities.

Big Data Architecture Course Review Notes

Tim published on 2022-11-12 included in Learning Notes Data Science & Machine Learning and Course-Notes

The requirements of big data systems encompass multiple aspects such as data, functionality, and performance, aiming to achieve high performance, high availability, fault tolerance, and scalability. Big data is closely related to cloud computing, which provides computing resources for big data processing, while big data is a typical application of cloud computing services. Cloud computing provides dynamically scalable computing services through the network, characterized by resource virtualization, massive scale, and elasticity, and is divided into three service models: IaaS, PaaS, and SaaS. Public cloud, private cloud, community cloud, and hybrid cloud are the four main service forms, each with its advantages and disadvantages. The core technologies of cloud computing include virtualization and containerization, where virtualization abstracts computer resources, and containerization provides a lightweight virtualized environment. The big data processing process involves data collection, preprocessing, storage, analysis, and visualization, with distributed computing being its key technology. Hadoop is the core framework for big data processing, including HDFS, MapReduce, and YARN, supporting the storage and computation of large-scale data. Distributed systems achieve high availability and fault tolerance through sharding and replication, and the CAP theorem states that distributed systems must make trade-offs between consistency, availability, and partition tolerance.