DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Enterprise AI Trend Report: Gain insights on ethical AI, MLOps, generative AI, large language models, and much more.

2024 Cloud survey: Share your insights on microservices, containers, K8s, CI/CD, and DevOps (+ enter a $750 raffle!) for our Trend Reports.

PostgreSQL: Learn about the open-source RDBMS' advanced capabilities, core components, common commands and functions, and general DBA tasks.

AI Automation Essentials. Check out the latest Refcard on all things AI automation, including model training, data security, and more.

Related

  • Inside the World of AI Data Centers
  • Accelerate Innovation by Shifting Left FinOps: Part 4
  • Pipelining To Increase Throughput of Stream Processing Systems
  • Conversational Applications With Large Language Models Understanding the Sequence of User Inputs, Prompts, and Responses

Trending

  • Telemetry Pipelines Workshop: Introduction To Fluent Bit
  • Generative AI With Spring Boot and Spring AI
  • Role-Based Multi-Factor Authentication
  • Implementing CI/CD Pipelines With Jenkins and Docker
  1. DZone
  2. Software Design and Architecture
  3. Performance
  4. Make Your Jobs More Robust With Automatic Safety Switches

Make Your Jobs More Robust With Automatic Safety Switches

This article delves into enhancing error management in batch processing programs through the strategic implementation of automatic safety switches and their critical role in safeguarding data integrity during technical errors.

By 
Bertrand Florat user avatar
Bertrand Florat
·
Oct. 08, 23 · Analysis
Like (3)
Save
Tweet
Share
2.8K Views

Join the DZone community and get the full member experience.

Join For Free

In this article, I'll refer to a "job" as a batch processing program, as defined in JSR 352. A job can be written in any language but is scheduled periodically to automatically process bulk data, in contrast to interactive processing (CLI or GUI) for end-users. Error handling in jobs differs significantly from interactive processing. For instance, in the latter case, backend calls might not be retried as a human can respond to errors, while jobs need robust error recovery due to their automated nature. Moreover, jobs often possess higher privileges and can potentially damage extensive data.

Consider a scenario: What if a job fails due to a backend or dependency component issue? If a job is scheduled hourly and faces a major downtime just minutes before execution, what should be done?

Based on my experience with various large projects, implementing automatic safety switches for handling technical errors is a best practice.

Enhancing Failure Handling With Automatic Safety Switches

When a technical error occurs (e.g., timeout, storage shortage, database failure), the job should attempt several retries (as per best practices outlined below) and halt immediately at the current processing step. It's advisable to record the current step position, allowing for intelligent restarts once the system is operational again.

Only human intervention, after thorough analysis and resolution, should reset the switch. While in a disabled state, any attempt to schedule the job should log that it's inactive and cannot initiate. This is also the opportune moment to create a post-mortem report, valuable for future failure analysis and potential adjustments to code or configuration for improved robustness (e.g., adjusting timeouts, adding retries, or enhancing input controls).

The switch can then be removed, enabling the job to recommence or complete outstanding steps (if supported) during the next scheduled run. Alternatively, immediate execution can be forced to prevent prolonged downtime delays, especially if job frequency is low. Delaying a job's execution excessively can lead to end-user latency and potential accumulation of such delays, eventually overwhelming the job's capacity.

Rationale for Automatic Safety Switches

  • Prevention of Data Corruption: They can avert significant data corruption resulting from bugs by halting activity during unexpected states.

  • Error Log Management: They help prevent system flooding with repetitive error logs (such as database access error stack traces). Uncontrolled log volumes might also exacerbate issues like filesystems filling.

  • Facilitating System Repair: A system without an automatic safety switch significantly complicates the diagnostic and fixing process. Human operators cannot make decisions with clarity since the system remains enabled and could potentially jam again as soon as it's scheduled."

  • Resource Exhaustion Mitigation: Continuing periodic jobs during technical errors caused by resource exhaustion (memory, CPU, storage, network bandwidth, etc.) worsens the situation. Automatic safety switches act as circuit breakers, stopping jobs and freeing up resources. After resolving the root problem, operators can restart jobs sequentially and securely.

  • Security Enhancement: Many attacks, including brute force attacks, SQL injections, or Server Side Injection (SSI), involve injecting malicious data into a system. Such data might be processed later by jobs, potentially triggering technical errors. Stopping the job improves security by forcing human or team analysis of the data. Similarly, halting a job after a timeout can help foil a resource exhaustion-type attack, such as a ReDOS (Regular Expression Denial of Service).

  • Promoting System Analysis: Organizations that overlook job robustness often allow failed jobs to run in subsequent schedules, adopting a risky approach. Automatic safety switches necessitate human intervention, detecting every failure. This encourages systematic analysis, post-mortem documentation, and long-term improvements.

  • Code Reuse: Besides emergency handling, the code written for this purpose can be repurposed to disable a job without altering the scheduling. This is similar to the Suspend: true attribute in Kubernetes CronJobs. In a recent project, we utilized this functionality to conveniently initiate job maintenance. By setting the stop flag, the maintenance script then awaits the completion of all jobs.

Implementing Effective Safety Switches

  • Simple Implementation: The most straightforward approach involves each job, during scheduling, checking for a persistent stop flag. If present, the job exits with a log. The flag can be implemented, for example, through a file, a database record, or a REST API result. For robustness, a stop file per job is preferable, containing metadata like the reason for stopping and the date. This flag is set on technical errors and removed only by a human operator's initiative (using commands like rm or more advanced methods like a shell script for instance).

  • Coupling with Retrying Mechanism: Safety switches must work alongside a robust retry solution. Jobs shouldn't halt and require human intervention at the first sign of intermittent issues like database connection saturation or occasional timeouts due to backups slowing down the SAN. Effective systems, such as the Spring Retry library, incorporate exponential backoff with jitter. For instance, setting 10 tries, including the initial call, results in retries spaced exponentially apart (1-second interval, then 2 seconds, and so on). This entire process spans 10 to 15 minutes before failing if the root cause isn't resolved within that timeframe. Jitter introduces small random intervals to avoid retry storms where all jobs simultaneously retry.

  • Ensure Exclusive Job Launches: Like any batch processing solution, guarantee that jobs are mutually exclusive—ensuring a new job isn't launched while a previous instance is still running.

  • Business Error Handling: Business errors (e.g., poorly formatted data) shouldn't trigger safety switches, unless the code lacks defensive measures and unexpected errors arise. In such cases, it's a code bug and qualifies as a technical error, warranting the safety switch trigger and requiring hotfix deployment or data correction.

  • Facilitate Smooth Restarts: When possible, allow seamless restarts using batch checkpoints, storing the current step, processing data context, or even the presently processed item.

  • Monitoring and Alerting: Ensure that monitoring and alerting systems are aware of job stoppage triggered by automatic safety switches. For example, email alerts could be sent or jobs could be highlighted in red within a monitoring system.

  • Semi-Automatic Restarts: While we always advocate for thorough system analysis during production issues, there are moments when having jobs halted for human intervention isn't practical, especially during weekends. A middle-ground solution between routine automatic job restarts and a complete halt is to authorize an automatic restart after a predetermined period. In our scenario, we've set a mechanism to remove the stop flag after 8 hours. This allows the job to try restarting if no human intervention has addressed the issue by then. This approach merges the benefits of an automatic safety switch, such as preventing data corruption or log overflow, with certain drawbacks. For instance, it might overlook the importance of a systematic analysis and the resulting continuous improvement. Hence, we believe this solution should be implemented judiciously.

Conclusion

Automatic safety switches prove invaluable in handling unexpected technical errors. They significantly reduce the risk of data corruption, empower operators to address issues thoughtfully, and foster a culture of post-mortems and robustness improvements. However, their effectiveness hinges on not being overly sensitive, as excessive interventions can burden operators. Thus, coupling these switches with well-designed retry mechanisms is crucial.

Batch processing Data corruption Processing systems

Published at DZone with permission of Bertrand Florat. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Inside the World of AI Data Centers
  • Accelerate Innovation by Shifting Left FinOps: Part 4
  • Pipelining To Increase Throughput of Stream Processing Systems
  • Conversational Applications With Large Language Models Understanding the Sequence of User Inputs, Prompts, and Responses

Partner Resources


Comments

ABOUT US

  • About DZone
  • Send feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: