DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Enterprise AI Trend Report: Gain insights on ethical AI, MLOps, generative AI, large language models, and much more.

2024 Cloud survey: Share your insights on microservices, containers, K8s, CI/CD, and DevOps (+ enter a $750 raffle!) for our Trend Reports.

PostgreSQL: Learn about the open-source RDBMS' advanced capabilities, core components, common commands and functions, and general DBA tasks.

AI Automation Essentials. Check out the latest Refcard on all things AI automation, including model training, data security, and more.

Related

  • Containerization and AI: Streamlining the Deployment of Machine Learning Models
  • Role of Artificial Intelligence for Government
  • Solving Four Kubernetes Networking Challenges
  • Porter: A Promising Newcomer in CNCF Landscape for Bare Metal Kubernetes Clusters

Trending

  • Test Parameterization With JUnit 5.7: A Deep Dive Into @EnumSource
  • Power BI: Transforming Banking Data
  • Navigating the AI Renaissance: Practical Insights and Pioneering Use Cases
  • Implementation Best Practices: Microservice API With Spring Boot
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. AIOps Now: Scaling Kubernetes With AI and Machine Learning

AIOps Now: Scaling Kubernetes With AI and Machine Learning

Using AI and digital twins, optimize Kubernetes apps and address SRE challenges with continuous learning for improved outcomes.

By 
Raj Nair user avatar
Raj Nair
·
Feb. 04, 24 · Analysis
Like (1)
Save
Tweet
Share
4.5K Views

Join the DZone community and get the full member experience.

Join For Free

If you are a site reliability engineer (SRE) for a large Kubernetes-powered application, optimizing resources and performance is a daunting job. Some spikes, like a busy shopping day, are things you can broadly schedule, but, if done right, would require painstakingly understanding the behavior of hundreds of microservices and their interdependence that has to be re-evaluated with each new release — not a very scalable approach, let alone the monotony and resulting stress to the SRE. Moreover, there will always be unexpected peaks to respond to. Continually keeping tabs on performance and putting the optimal amount of resources in the right place is essentially impossible. 

The way this is being solved now is through gross overprovisioning, or a combination of guesswork and endless alerts — requiring support teams to review and intervene. It’s simply not sustainable or practical, and certainly not scalable. But it’s just the kind of problem that machine learning and AI thrives on. We have spent the last decade dealing with such problems, and the arrival of the latest generation of AI tools such as generative AI has opened the possibility of applying machine learning to the real problems of the SRE to realize the promise of AIOps.

Turning Up the Compute Knob…to Be Safe

No matter how great your observability dashboard, the amount of data and the need for agility is just too much. You have to provision adequate resources to achieve the desired response times and error rates. It is not unusual for people in this role to peg compute utilization at 30 percent “to be safe” and be prepared to monitor hundreds of microservices to ensure the desired service-level agreement (SLA) is achieved. The end result is costly — not just from compute resources, but also DevOps resources dedicated to maintaining the SLA. 

It seems that, for all it has brought us, Kubernetes has gone beyond the comprehension of those charged with operating it. Horizontal pod autoscaling (HPA) and reactive scaling solutions still leave the SREs guessing at what level to set the CPU utilization threshold that would work for various traffic loads and service graph dependencies. Traffic does not have a linear relationship to microservice loading and thus to performance, and that is not the only reason to change the states of the application deployment. SREs are also monitoring issues like temperature, faults, and latency. 

For a typical Kubernetes application, there are on average several hundreds of microservices. Furthermore, each microservice is dependent on other microservices in a web of interconnected relationships with other microservices. It is not easy for a person to view and understand it all and then make detailed changes and do this repeatedly for every release of each microservice every week. SREs figuratively “turn up the compute knob” and hope that it improves whatever has dropped below the service-level objective (SLO). But, the reality is that it is useless to increase resources at a microservice which is dependent on another microservice, which is actually the bottleneck. 

An Ideal Use Case for AI

In 2024, when someone says AI, the next thought is almost inevitably ChatGPT. ChatGPT is generative AI that selects the best next word. While the architecture required for a strong AIOps platform is very different from ChatGPT (more on that later), the goal is similar — choose the best next state for the application.

The intricately interconnected ecosystems of modern microservice applications are too big and complex for the SRE team to comprehend in detail and make those decisions. Most efforts to autoscale these applications fail to take into account the nuanced requirements and performance needs of individual services. I’ve been hearing about this problem continuously for over 20 years (starting with the L5 network load balancer we invented at Arrowpoint Communications). 

The Digital Twin Goes Through the Paces

Training data is the fuel for AI. To teach an application to operate a mission critical Kubernetes instance, we need to develop good information about how the performance can be optimized. Digital twins have been used for decades in multiple fields including manufacturing and racing to help people recreate a digital equivalent of the real subject to study its behavior. In our case, we use performance metrics to build a digital twin of each microservice. 

In reinforcement learning (RL), digital twins are used to create a simulation environment to generate an observation space in which a model can be trained to discover and learn the best paths (also known as "trajectories") to guide the system to states that have the desired target properties in terms of cost, performance, etc. In our case, we use proximal policy optimization (PPO) as the RL training algorithm. Our approach is service-graph aware to take into account the dependencies of microservices that impact scaling. Ultimately, we will have a model-free network that is continually learning based on operational experience. 

Better Responsiveness and Ongoing Improvement

Kubernetes has come a long way. There is extensive tool-level automation, but not a lot of effective system-level automation. Perhaps that has a lot to do with the vast amount of activity within a Kubernetes instance. We boiled the problem down to deciding the best next state for the application. 

People have been playing with generative AI that can produce words and images for a general audience. We are seeing how the same technology can transform our digital experience. 

For SREs Now and Developers of the Future

SREs today could benefit from a transformation. Talking to SRE teams, we have learned that they are asked to contribute to their own SLOs and they simply don’t know where to begin. It seems that the complexity of Kubernetes has outpaced the ability of humans alone to operate it. 

Looking ahead, applying AIOps models and moving toward autonomous infrastructure can allow for a new level of complexity and scale for microservices applications.

AI Kubernetes Machine learning Site reliability engineering microservice

Published at DZone with permission of Raj Nair. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Containerization and AI: Streamlining the Deployment of Machine Learning Models
  • Role of Artificial Intelligence for Government
  • Solving Four Kubernetes Networking Challenges
  • Porter: A Promising Newcomer in CNCF Landscape for Bare Metal Kubernetes Clusters

Partner Resources


Comments

ABOUT US

  • About DZone
  • Send feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: