DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Enterprise AI Trend Report: Gain insights on ethical AI, MLOps, generative AI, large language models, and much more.

2024 Cloud survey: Share your insights on microservices, containers, K8s, CI/CD, and DevOps (+ enter a $750 raffle!) for our Trend Reports.

PostgreSQL: Learn about the open-source RDBMS' advanced capabilities, core components, common commands and functions, and general DBA tasks.

AI Automation Essentials. Check out the latest Refcard on all things AI automation, including model training, data security, and more.

Related

  • How To Reduce the Impact of a Cloud Outage
  • Demystifying AWS Security: 8 Key Considerations for Secure Cloud Environments
  • Dynatrace Perform: Day Two
  • Secure and Scalable CI/CD Pipeline With AWS

Trending

  • Initializing Services in Node.js Application
  • How To Optimize Your Agile Process With Project Management Software
  • Understanding Escape Analysis in Go
  • Real-Time Communication Protocols: A Developer's Guide With JavaScript
  1. DZone
  2. Testing, Deployment, and Maintenance
  3. Maintenance
  4. Unpacking Our Findings From Assessing Numerous Infrastructures (Part 2)

Unpacking Our Findings From Assessing Numerous Infrastructures (Part 2)

Making superior performance accessible. Get better at assessing your core infrastructure needs, find out where engineering teams often falter.

By 
Komal J Prabhakar user avatar
Komal J Prabhakar
·
Mar. 05, 24 · Opinion
Like (2)
Save
Tweet
Share
3.5K Views

Join the DZone community and get the full member experience.

Join For Free

When superior performance comes at a higher price tag, innovation makes it accessible. This is quite evident from the way AWS has been evolving its services:

  •  gp3, the successor of gp2 volumes: Offers the same durability, supported volume size, max IOPS per volume, and max IOPS per instance. The main difference between gp2 and gp3 is gp3’s decoupling of IOPS, throughput, and volume size. This flexibility to configure each piece independently – is where the savings come in.
  • AWS Graviton3 processors: Offers 25% better computing, double the floating-point, and improved cryptographic performance compared to its predecessors. It’s 3x faster than Graviton 2 and supports DDR5 memory, providing 50% more bandwidth than DDR4 (Graviton 2). 

To be better at assessing your core infrastructure needs, knowing the AWS services is just half the battle. In my previous blog, I’ve discussed numerous areas where engineering teams often falter. Do give it a read! Unpacking Our Findings From Assessing Numerous Infrastructures – Part 1

What we’ll be discussing here are: 

  • Are your systems truly reliable?
  • How do you respond to a security incident?
  • How do you reduce defects, ease remediation, and improve flow into production? (Operational Excellence)

Are Your Systems Truly Reliable?

Nearly 67% of teams showed high risk in questions around resilience testing. Starting with the lack of basic pre-thinking of how things might fail, and building plans for what you would do in that event. Of course, teams did perform root cause analysis after things actually went wrong — that we can consider as learning from mistakes.  For the majority of them — there’s no playbook/procedure to investigate failures and post-incident analysis. 

How Do You Plan for Disaster Recovery?

Eighty percent of the workloads we reviewed score a high risk in this area. Despite disaster recovery being a vital necessity, many organizations avoid it due to its perceived complexity and cost. Some other common reasons were —  insufficient time, inadequate resources, inability to prioritize due to lack of skilled personnel, etc.

An easy way to begin is by noting down the:

  • Recovery point objective: How much data are you prepared to lose?
  • Recovery time objective: How long can you handle downtime to serve your customers?

The next important step is planning and working on the recovery strategies. Let’s consider the Lambda function. How can you go about thinking of various error scenarios: 

  • Manual deployment errors: Risk of deploying incorrect code or configuration changes.
  • Cold start delay: It so happens with Lambda that it takes time to initiate the underlying hardware, resulting in the first request taking longer to serve, often attributed to instance expiration from inactivity. Thus resulting in a poor user experience.
  • Lambda concurrency limit: Risk of throttling the default concurrency limit, where if it is exceeded, the lambda no longer invokes, resulting in the loss of all requests.

Or maybe answering questions like — what will happen to your application if your database goes away? — Does it reconnect? Does it reconnect properly? Is it re-resolving the DNS name?

While the cloud does take away most of your “heavy lifting” with infrastructure management, this doesn’t include managing your application and business requirements.

Some Best Practices To Follow

  • Being aware of unchangeable service quotas, service constraints, and physical resource limits to prevent service interruptions or financial overruns.
  • Validate your backup integrity and processes by performing recovery tests.
  • Ensure a sufficient gap exists between the current quotas and the maximum usage to accommodate failover.

How Do You Respond to a Security Incident?

75% of technology teams are not doing a good job at responding to security incidents. They’re not planning ahead for things that are going on in the security landscape. Only 30% of teams knew what tooling they would use to either mitigate or investigate a security incident. 

Now, we’re talking about security incidents caused by exploited frameworks. Some of the common tell-tale signs observed were:

  • Allowing untrusted code execution on your machines.
  • Failure to set up adequate access controls on storage services, such as leading to Data leakage from an S3 bucket, potentially making data public.
  • Accidental exposure of API keys, such as when checked into a public Git repository.

Another aspect of security is understanding the health of your workload, implying monitoring and telemetry. In this framework, we differentiate user behavior monitoring and real user monitoring versus workload behavior. Here, this is notable because teams are undoubtedly collecting all sorts of data but are not doing much with it. 

  • More than half of them have clearly defined their KPIs, but fewer have actually established baselines for what normal looks like. 
  • The number drops further when it comes to setting up alerts for those monitored items. 

Then comes access and granting the least privileges. Although teams understood what work they do and what access they should have, not many were following it. There was an absolute absence of:

  • Role-Based Access Mechanism
  • Multi-factor authentication
  • Rotation of passwords and,
  • Use of secret vaults like Secrets Managers or HashiCorp Vault (and instead simply baking them into config for their applications), etc. 

In short, automation of credential management is pretty much nonexistent.

How Do You Reduce Defects, Ease Remediation, and Enhance the Production Deployment Process?

Yes, finally, we are talking about the pillar – operational excellence. People are pretty much familiar with the version control system and are using Git (mostly). They run a lot of automated testing in their CI, basically a lot of smoke tests and integration tests. 

Operational excellence focuses on defining, executing, measuring, and improving the standard operating procedures in response to incidents and client requests. Following the DevOps philosophy is not enough if the tools and workflows don’t support it. The absence of proper documentation and sole dependence on DevOps engineers to use automation has led to burnout. DevOps engineers manually stitching solutions for every situation has resulted in slow workflow development and brittle operations.

As per Gartner, platform engineering is an emerging trend within digital transformation efforts that “improves developer experience and productivity by providing self-service capabilities with automated infrastructure operations.”  Beyond commercial hype, an Internal Developer Platform is a curated set of tools, capabilities, and processes packaged together for easy consumption by development teams. Reduced human dependency and standardized workflows empower engineering teams to scale efficiently.

I guess the primary takeaway for us through the reviews was that today people are better at building platforms than they are at securing or running them. This is the real lesson, and there’s a high chance that this applies to you as well. 

What’s Next?

Over time your workloads evolve and accommodate demanding business needs and highly reliant customers; making it more than necessary to ensure they remain secure, reliable, and performant to serve them better.

You should totally try the Well-Architected Review tool that's available right in your AWS console. You can begin by working through those questions and following the linked information to better understand your own practice. 

Strip off the 'AWS Label' from the WAR tool, and you're left with best practices helping you deliver a consistent approach to architecting secure and scalable systems on the AWS Cloud. 

AWS Disaster recovery IOPS Infrastructure security

Published at DZone with permission of Komal J Prabhakar. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • How To Reduce the Impact of a Cloud Outage
  • Demystifying AWS Security: 8 Key Considerations for Secure Cloud Environments
  • Dynatrace Perform: Day Two
  • Secure and Scalable CI/CD Pipeline With AWS

Partner Resources


Comments

ABOUT US

  • About DZone
  • Send feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: