DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Enterprise AI Trend Report: Gain insights on ethical AI, MLOps, generative AI, large language models, and much more.

2024 Cloud survey: Share your insights on microservices, containers, K8s, CI/CD, and DevOps (+ enter a $750 raffle!) for our Trend Reports.

PostgreSQL: Learn about the open-source RDBMS' advanced capabilities, core components, common commands and functions, and general DBA tasks.

AI Automation Essentials. Check out the latest Refcard on all things AI automation, including model training, data security, and more.

Related

  • Building a Real-Time Data Warehouse With TiDB and Pravega
  • ETL and How it Changed Over Time
  • Big Data Realtime Data Pipeline Architecture
  • A Deep Dive Into the Differences Between Kafka and Pulsar

Trending

  • Debugging Streams With Peek
  • Using My New Raspberry Pi To Run an Existing GitHub Action
  • BPMN 2.0 and Jakarta EE: A Powerful Alliance
  • Building a Performant Application Using Netty Framework in Java
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Leveraging Apache Kafka for the Distribution of Large Messages

Leveraging Apache Kafka for the Distribution of Large Messages

In this article, we will explore the architectural approach for separating the actual payload (the large video file) from the message intended to be circulated via Kafka.

By 
Gautam Goswami user avatar
Gautam Goswami
DZone Core CORE ·
Dec. 19, 23 · Tutorial
Like (3)
Save
Tweet
Share
3.7K Views

Join the DZone community and get the full member experience.

Join For Free

In today's data-driven world, the capability to transport and circulate large amounts of data, especially video files, in real-time is crucial for news media companies. For example, an incident occurred in a specific location, and a news reporter promptly filmed the entire situation. Subsequently, the complete video was distributed for broadcasting across their multiple studios situated in geographically distant locations.

To construct or create a comprehensive solution for the given problem statement, we can utilize Apache Kafka in conjunction with external storage to upload large-sized video files. The external storage may take the form of a cloud store, such as Amazon S3, or an on-premise large file storage system, such as a network system or HDFS (Hadoop Distributed File System). Of course, Apache Kafka was not designed to handle large data files as direct messages for communication between publishers and subscribers/consumers. Instead, it serves as a modern event streaming platform where, by default, the maximum size of messages is limited to 1 MB.

There are several reasons why Kafka has limited the message size by default. One of the major reasons is the large messages are expensive to handle, could slow down the brokers, and eventually increase the memory pressure on the broker JVM.

In this article, we will explore the architectural approach for separating the actual payload (the large video file) from the message intended to be circulated via Kafka. This involves notifying subscribed consumers, who can then proceed to download the video files for further processing or broadcast after editing.

The first thing that should cross our mind is that databases don’t recognize video files. There is no “image” or “video” data type. This is why we will have to manually convert video files into blobs and manage them ourselves in whatever format we choose (base64, binary, etc.). This is an extra operation that needs to be handled each time we do any transaction with the large video files. 

To mitigate the former obstacle, as said with the database, we can store large video files on a distributed file system like HDFS or Hadoop Distributed File System. When we want to store a large file, it gets broken into blocks and it gets replicated across the cluster. This redundancy is very helpful to increase data availability and accessibility. An on-premise large file storage system such as a network system or HDFS would be the best option if we don’t rely on cloud storage such as Amazon S3 or Microsoft Azure because of cost/revenue constraints.

HDFS

The data is kept on HDFS in blocks of a predetermined size (default 128MB in Hadoop 2.X). Although all the files are seen when using the hdfs and dfs commands, HDFS internally saves these files as blocks. In a nutshell, we could mount a remote folder/location where large video files can be dumped, and subsequently, by executing a script periodically, those files can be transferred into HDFS. The script might internally use distcp, a general utility for copying large data sets between remotely mounted folder location and the HDFS cluster’s folder location for storage. 

Once the file is stored successfully on the HDFS cluster, the location of the HDFS folder with the file name can be clubbed together with access details of the HDFS cluster like IP address, security credentials, etc., in Kafka message via Kafka publisher and eventually published to the Kafka topic.

The message structure could be like below:

JSON
 
{
  "headers": {
    "Content-Type": "application/json",
    "Authorization": "Authorization token like eyJhb$#@67^EDf*g&",
    "Custom-Header": "some custom value related to video"
  },
  "key": "Key of video file like name of the file",
  "value": {
    "field1": "video file extension like mp4, avi etc",
    "field2": "Location of HDFS folder where video file has been uploaded like 
               hdfs://Remote-namenode:8020/user/sampleuser/data/videos/news.mp4"
  }
}


On the other hand, the consumers subscribed to that topic can consume these messages and extract the file location where the large video file has been uploaded on the remote HDFS cluster. We might need to integrate different downstream data pipelines with the consumers to execute the workflow to download the video file from the HDFS cluster into the local system. Typically, to download a file from the Hadoop Distributed File System (HDFS) to the local system, we can use the hadoop fs command or the distcp (distributed copy) command. We can also use a tool like Apache Nifi to transfer data from HDFS to a local filesystem. Once the video file is available in a geographically separated location, it can be edited/modified and finally broadcast via various news channels.

In summary, this represents a high-level architectural concept leveraging Apache Kafka to inform subscribed message consumers about the upload of a large video file to remote big data storage, including all relevant credentials and the file's location. By processing these events through integrated downstream data pipelines, consumers can download and process the large video file, preparing it for subsequent broadcasting. However, it's important to acknowledge potential gray areas or challenges in the discussed approach that may emerge during the detailed implementation.

I hope you enjoyed reading this. If you found this content valuable, please consider liking and sharing.

Big data File system kafka

Published at DZone with permission of Gautam Goswami, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Building a Real-Time Data Warehouse With TiDB and Pravega
  • ETL and How it Changed Over Time
  • Big Data Realtime Data Pipeline Architecture
  • A Deep Dive Into the Differences Between Kafka and Pulsar

Partner Resources


Comments

ABOUT US

  • About DZone
  • Send feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: