· nervico-team · cloud-architecture · 12 min read
AWS Well-Architected Framework: The 6 Pillars Explained
Complete guide to the AWS Well-Architected Framework: all 6 pillars explained with practical examples, common anti-patterns, and how to apply each principle in real architectures.
AWS has published the Well-Architected Framework since 2015. It is a set of best practices, evaluation questions, and design patterns that define what a well-built cloud architecture should look like. It is not a product you buy or a service you activate. It is a reference framework that AWS has distilled from operating infrastructure for millions of customers.
The framework has 6 pillars. Each pillar covers a fundamental aspect of architecture. And while AWS presents it as an integrated whole, the reality is that the pillars compete with each other: optimizing costs can compromise security, maximizing reliability can inflate the bill, and pursuing peak performance can complicate operations.
This article explains each pillar with technical depth, practical examples, and the most common anti-patterns we encounter in architecture audits.
What the Well-Architected Framework Is
Origin and Evolution
The framework started as an internal AWS document for evaluating customer architectures. It became public in 2015 with 4 original pillars. In 2020, the sustainability pillar was added, and the 2023 revision consolidated the definitive 6 pillars.
AWS also publishes specialized “lenses” for specific industries and use cases: Serverless, SaaS, Machine Learning, IoT, Financial Services. Each lens adds industry-specific questions and practices to the base framework.
The Well-Architected Tool
AWS provides a free tool within the console called the AWS Well-Architected Tool. It allows you to:
- Evaluate your workloads against the 6 pillars by answering a structured questionnaire.
- Identify high and medium risk areas.
- Generate an improvement plan with prioritized actions.
- Compare evaluations over time to measure progress.
The tool does not analyze your infrastructure automatically. It is a manual questionnaire that requires your team to answer honestly. Its value lies in forcing the conversation about architectural decisions that many teams never document.
Pillar 1: Operational Excellence
Principle
Operational excellence focuses on running and monitoring systems to deliver business value, and continuously improving processes and procedures.
Design Principles
- Perform operations as code: All infrastructure must be defined as code. No “manual console changes” are acceptable in production.
- Make frequent, small, reversible changes: Small, frequent deployments instead of large, infrequent releases. If something fails, the rollback affects little.
- Refine operations procedures frequently: Runbooks and playbooks are updated with every incident. What is not documented is lost.
- Anticipate failure: Practice game days and chaos engineering. If you have never tested what happens when an availability zone goes down, you do not know whether your architecture survives it.
- Learn from all operational failures: Blameless post-mortems. Every incident is an opportunity to improve the system, not an opportunity to point fingers.
Common Anti-Patterns
Manual infrastructure: Teams that create resources from the AWS console and have no record of what exists or why. When someone leaves, the knowledge leaves with them.
No proactive alerts: Monitoring is not having CloudWatch. It is having alerts configured that notify you before users detect the problem. If your team learns about an incident because a customer calls, your observability has failed.
Manual deployments: If the deployment process requires more than one command or more than one person, it is a fragile process. CI/CD is not optional. It is the foundation of operational excellence.
Essential Practices
- Infrastructure as Code with Terraform or AWS CDK.
- CI/CD with automatic rollback.
- Alerts based on SLOs (Service Level Objectives), not just infrastructure metrics.
- Documented runbooks for the 10 most likely incidents.
- Written post-mortems shared after every significant incident.
Pillar 2: Security
Principle
Protect data, systems, and assets while delivering business value. Security is the pillar that is never negotiated.
Design Principles
- Implement a strong identity foundation: IAM is the most important AWS service. Every person, every service, and every resource must have the minimum permissions necessary.
- Enable traceability: Every change, every access, every action must be recorded. CloudTrail enabled in all accounts and regions.
- Apply security at all layers: Do not rely solely on the perimeter firewall. Security Groups on instances, Network ACLs on subnets, encryption in transit and at rest, validation in the application.
- Automate security best practices: AWS Config rules, GuardDuty, Security Hub. Manual security checks do not scale.
- Protect data in transit and at rest: TLS for all traffic. KMS encryption for all stored data. No exceptions.
- Minimize data access: Principle of least privilege. If a service does not need access to a database, it should not have it. Period.
Common Anti-Patterns
Overly permissive IAM: Policies with "Action": "*" and "Resource": "*". It is the equivalent of giving every employee the keys to the entire building. In audits, we find this in more than 40% of the AWS accounts we review.
Secrets in code: API keys, database passwords, and access tokens hardcoded in source code or in unencrypted environment variables. Use AWS Secrets Manager or Systems Manager Parameter Store.
Unnecessary public subnets: Databases accessible from the internet because “it was easier to set up.” RDS, ElastiCache, and any datastore must be in private subnets. Always.
Essential Practices
- IAM with least-privilege policies. Review permissions with IAM Access Analyzer.
- Mandatory MFA for all console users.
- AWS Organizations with SCPs (Service Control Policies) to limit what member accounts can do.
- VPC with private subnets for all datastores.
- KMS encryption for S3, EBS, RDS, and any service that stores data.
- GuardDuty enabled for threat detection.
Pillar 3: Reliability
Principle
The ability of a system to recover from infrastructure or service failures, dynamically acquire compute resources to meet demand, and mitigate disruptions such as misconfigurations or transient network issues.
Design Principles
- Test recovery procedures: If you have never restored an RDS backup, you do not know if it works. Schedule periodic DR (Disaster Recovery) tests.
- Scale horizontally: Increase capacity by adding more small instances, not by making instances larger. Vertical scaling has a ceiling. Horizontal does not.
- Stop guessing capacity: Use auto-scaling. If you are planning capacity manually, you are either wasting money or risking an outage.
- Manage change through automation: Manual changes cause more incidents than automated changes. Automate everything you can.
Common Anti-Patterns
Single point of failure: An RDS database without Multi-AZ. A service running on a single EC2 instance without an auto-scaling group. A single availability zone. If something can fail, it will.
Unverified backups: Having automated backups configured but never having tested a restore. The most dangerous backup is the one you assume works.
Missing circuit breakers: A service that depends on another service and blocks when the dependency fails. Without circuit breakers, a local failure becomes a cascading failure.
Essential Practices
- Multi-AZ for all stateful services (RDS, ElastiCache, OpenSearch).
- Auto-scaling groups with appropriate health checks.
- Automated backups with periodic restore tests.
- Multi-region architecture for critical workloads.
- Circuit breakers and retry with exponential backoff for all inter-service calls.
Pillar 4: Performance Efficiency
Principle
Use compute resources efficiently to meet system requirements, and maintain that efficiency as demand and available technologies change.
Design Principles
- Democratize advanced technologies: Use managed services instead of operating your own infrastructure. Do not run your own Elasticsearch cluster if OpenSearch Service covers your use case.
- Go global in minutes: CloudFront for static content, read replicas across multiple regions for data, Route 53 with latency-based routing.
- Use serverless architectures: Eliminate the need to manage servers and let the team focus on business logic.
- Experiment more frequently: The cloud lets you test different configurations without long-term commitment. Test, measure, adjust.
- Have mechanical sympathy: Understand how services work under the hood. Knowing that DynamoDB distributes data by partition key helps you design better keys. Knowing that S3 scales better with random prefixes helps you choose better object names.
Common Anti-Patterns
Oversized instances: 40% of EC2 instances are oversized, according to AWS Compute Optimizer data. Every oversized instance is money wasted.
Poorly chosen databases: Using RDS PostgreSQL for everything. Sometimes DynamoDB, ElastiCache, or OpenSearch is the right choice. Database selection should be based on access patterns, not team familiarity.
No CDN: Serving static assets directly from the application server instead of using CloudFront. This adds latency, consumes unnecessary bandwidth, and increases server load.
Essential Practices
- Periodic right-sizing with AWS Compute Optimizer.
- CloudFront for all static assets and, where possible, for APIs with caching.
- Databases chosen by access pattern, not by habit.
- Periodic performance benchmarks with tools like k6, Locust, or Artillery.
Pillar 5: Cost Optimization
Principle
Run systems at the lowest price point while delivering business value.
Design Principles
- Implement cloud financial management: Someone on the team must be responsible for costs. If nobody watches them, they grow.
- Adopt a consumption model: Pay only for resources consumed. Shut down development environments outside working hours.
- Measure overall efficiency: Do not optimize the cost of an individual service if the impact on the overall system is negative.
- Stop spending money on undifferentiated heavy lifting: Use managed services instead of running your own infrastructure. Your team’s time is more expensive than the AWS bill.
- Analyze and attribute expenditure: Cost allocation tags on all resources. Without tags, you cannot know which team, project, or environment generates each cost.
Common Anti-Patterns
Orphaned resources: Unattached EBS volumes, unassociated Elastic IPs, old snapshots, load balancers with no traffic. They accumulate cost without generating value.
On-Demand for everything: Not using Savings Plans or Reserved Instances for predictable workloads. You can save between 30% and 72% with 1-3 year commitments.
24/7 development environments: Development and staging environments running around the clock when they are only used 8 hours a day. Scheduling automatic shutdown saves 66% on those environments.
Essential Practices
- AWS Budgets with alerts at 50%, 80%, and 100%.
- Mandatory cost allocation tags.
- Monthly cost reviews with the team.
- Savings Plans for compute with predictable load.
- Automation to shut down non-production environments outside working hours.
Pillar 6: Sustainability
Principle
Minimize the environmental impact of cloud workloads.
Design Principles
- Understand your impact: AWS Customer Carbon Footprint Tool shows the CO2 emissions associated with your account.
- Establish sustainability goals: Define energy efficiency metrics per transaction or per user.
- Maximize utilization: Underutilized resources consume energy without generating value. Right-sizing and auto-scaling are sustainability practices as much as cost practices.
- Use managed services: AWS can run workloads with greater energy efficiency than most corporate data centers.
- Reduce downstream impact: Optimize response sizes, use compression, minimize unnecessary data transfer.
Common Anti-Patterns
Unnecessary processing: Running ETL jobs that process entire datasets when only incremental data has changed. Processing data that nobody queries.
Indefinite storage: Keeping all logs forever “just in case.” Define retention policies based on actual requirements, not fear.
Excessive data transfer: Architectures that move data between regions or between services without necessity. Every GB transferred consumes energy.
Essential Practices
- Lifecycle policies on S3 to move data to Glacier or delete it when no longer needed.
- Incremental processing instead of full reprocessing.
- Graviton (ARM) for compatible workloads: up to 60% more energy-efficient than equivalent x86 instances.
- AWS regions with renewable energy when the use case permits.
Pillar Interactions: The Inevitable Trade-Offs
Security vs Cost Optimization
Encrypting everything with KMS adds cost per encryption/decryption operation. Separate accounts per environment (a fundamental security practice) multiply base service costs like NAT Gateways, VPNs, and monitoring tools. The principle of least privilege in IAM requires engineering time to maintain granular policies instead of using AdministratorAccess.
Security always wins this trade-off. The cost of a security incident (GDPR fines, customer loss, reputational damage) far exceeds the savings from ignoring best practices.
Reliability vs Cost Optimization
Multi-AZ doubles database costs. Multi-region can triple them or more. Reserved Instances save money but reduce flexibility to change instance types.
The right strategy depends on the SLA your business needs. 99.9% availability (43 minutes of downtime per month) is achievable with Multi-AZ at a reasonable cost. 99.99% (4 minutes per month) requires multi-region and a significantly larger budget.
Performance vs Operational Complexity
Using the right service for each use case (DynamoDB for key-value access, ElastiCache for caching, OpenSearch for search) optimizes performance but adds operational complexity. Every additional service is another service to monitor, scale, patch, and train the team on.
For small teams, consolidating into fewer services (even if they are not optimal for every case) can be the right decision. Perfect performance has no value if the team cannot operate the infrastructure.
AWS Well-Architected Lenses
AWS publishes specialized lenses that extend the base framework for specific use cases:
Serverless Lens
Adds questions and practices specific to serverless architectures: cold starts, Lambda function design, event-driven patterns, state management in stateless systems.
SaaS Lens
Focused on multi-tenant architectures: tenant data isolation, tenancy models (silo vs pool), automated onboarding, per-tenant billing.
Machine Learning Lens
Covers the complete ML lifecycle: data preparation, training, inference, model monitoring, drift detection, model governance.
Financial Services Lens
Financial sector-specific requirements: regulatory compliance, encryption, auditing, resilience, mandatory DR testing.
Each lens adds between 20 and 40 additional questions to the base framework. You do not need to apply all lenses: only those relevant to your use case.
How to Conduct a Well-Architected Review
The Process
- Select the workload: Do not review the entire account. Choose a specific application or service.
- Assemble the right team: You need the architect, the development team, the operations team, and ideally someone from the business side.
- Answer the questions: The Well-Architected Tool has structured questions by pillar. Answer honestly, not aspirationally.
- Prioritize the risks: You cannot solve everything at once. Prioritize by business impact and implementation effort.
- Create an improvement plan: Concrete actions, assigned to people, with dates. If it has no date and no owner, it will not happen.
- Iterate: Repeat the review every 6-12 months. Architectures evolve, and what was acceptable a year ago may be a risk today.
The Most Common Mistake
The most common mistake is treating the Well-Architected review as a compliance exercise: answering the questions to “pass” and filing the report. The real value is in the conversations it generates, in the decisions it documents, and in the risks it makes visible.
Conclusion
The AWS Well-Architected Framework is not a certification or a quality seal. It is a structured reflection tool that forces teams to question their architectural decisions against a proven standard.
The 6 pillars are interdependent and, in many cases, compete with each other. Excellence is not about maximizing each pillar individually but about finding the right balance for your context: your budget, your team, your business requirements, and your risk tolerance.
If you have never conducted a Well-Architected review of your AWS infrastructure, now is a good time to start. At NERVICO, we perform architecture audits based on this framework as a starting point. Request a free audit and we will help you identify the highest-risk areas and the highest-impact improvements.