I have spent the better part of twelve years placing cloud engineers and solutions architects at companies ranging from early-stage startups to Fortune 100 infrastructure teams. I have sat across the table from hundreds of hiring managers and watched them evaluate candidates in real time. And the single most consistent pattern I have noticed is this: the engineers who get offers are not necessarily the ones with the most certifications or the longest AWS experience. They are the ones who can demonstrate that their knowledge translates to judgment.
A junior candidate can recite the Shared Responsibility Model. A senior candidate can tell you exactly which controls are theirs to own, give you a concrete example from a service they have actually run, and name the thing that would go wrong if they got it wrong. That gap, from definition to decision is what every question on this list is really testing.
The 25 questions below were selected by cross-referencing what is actually being asked in AWS interview loops right now, across cloud engineering, solutions architect, and DevOps/SRE roles. I focused on the topics that show up repeatedly and specifically on the ones where I consistently see candidates stumble, not because they do not know the concept, but because they rehearsed the textbook answer and stopped there. The sample answers are written the way I coach strong candidates: direct, honest about trade-offs, and light on jargon for its own sake.
One more thing before we get into it: these questions span networking, identity, storage, compute, databases, operations, and cost. That is intentional. Modern AWS interview loops at mid-to-senior level mix all of these. If you are only drilling one category, you are leaving something on the table.
Part 1: Fundamentals and Well-Architected Thinking (And Why They Reveal More Than You Think)
Here is something I notice consistently after years of debrief conversations: the fundamentals round tells me more about a candidate than almost any other. Not because the questions are hard – they are not. Because the answers reveal whether someone has actually operated AWS workloads or just read about them. The candidates who stumble here are not confused about definitions. They are unprepared for the follow-up: "Give me a real example."
1. Explain the AWS Shared Responsibility Model with one real example.
What the interviewer is really testing:
Can you move from a definition to a concrete operational implication? Strong candidates do not stop at "AWS secures the cloud, customers secure what they put in the cloud." They name specific controls, specific services, and specific things that go wrong when engineers misunderstand the split.
Sample answer:
The model splits responsibility at the layer where AWS control ends and customer control begins. AWS is responsible for the physical infrastructure, the hypervisor, the managed service internals – the cloud itself. I am responsible for everything I put in it: my IAM configuration, my network rules, my encryption choices, my OS patches on EC2.
The place this bites teams most often is with managed services. Developers sometimes assume that because RDS is "managed," security is handled. AWS manages the database engine patching and the underlying hardware. But the security group rules, the IAM permissions, whether the instance is in a private subnet, whether encryption at rest is enabled – all of that is still mine. I have seen teams with RDS instances in public subnets and no encryption because someone assumed "managed" meant "secure by default."
Follow-up to expect:
"How does the responsibility split change between EC2 and Lambda?" / "Where do you see teams most often misunderstand what they own?"
2. What is the difference between a Region and an Availability Zone, and how does that shape how you design for high availability?
What the interviewer is really testing:
Whether you design for failure domains or just hope for uptime. The definition is trivial. The interesting part is the architectural judgment that follows from it – specifically when multi-AZ is enough and when you actually need multi-Region.
Sample answer:
A Region is a geographic area with multiple, isolated Availability Zones. AZs are physically separate data centers within that Region with independent power, cooling, and networking, but connected with low-latency links. For most HA requirements, spreading across multiple AZs within a single Region is the right starting point. You get fault isolation from single-AZ failures, low latency between tiers, and relatively simple operations.
Multi-Region is a different level of investment. You add it when the business genuinely requires it – a latency requirement that cannot be met from one Region, a compliance requirement for data sovereignty, or an RTO so aggressive that even a full Region failure cannot exceed it. The mistake I see most often is teams reaching for multi-Region because it sounds more resilient, without calculating whether the added complexity and cost is justified by actual business requirements. Multi-Region replication, failover orchestration, and data consistency across Regions are hard problems. I want a clear RTO/RPO requirement before I take that on.
Follow-up to expect:
"Walk me through a specific architecture you would build for 99.99% availability." / "How does this change for a stateful vs stateless application?"
Part 2: Networking and VPC (Where Most Production Problems Actually Live)
Networking questions are where I see the biggest split between engineers who have shipped production systems and engineers who have only built things in tutorials. Tutorials skip the parts that matter: subnets, route tables, NAT, security group chaining. The questions in this section are bread-and-butter for any AWS role, and the answers are surprisingly specific once you have actually debugged a connectivity issue at 2am.
3. Design a VPC for a public web tier, private application tier, and private database tier.
What the interviewer is really testing:
Network boundary thinking and blast radius awareness. A candidate who puts databases in public subnets, or who cannot explain why the DB tier has no internet route, is not ready to own a production AWS environment.
Sample answer:
I would start with at least two Availability Zones for every tier. Public subnets in each AZ hold the load balancer and the NAT gateways – those are the only things that need direct internet exposure. Private subnets in each AZ hold the application servers. A second set of private subnets, further isolated, holds the databases.
Route tables are how you enforce the boundaries. Public subnets have a route to the internet gateway for 0.0.0.0/0. Private app subnets have a route to the NAT gateway for outbound traffic like patch updates, but no inbound internet route. DB subnets have no internet route at all – the only traffic they accept comes from the application layer, enforced by security groups.
Security groups do the fine-grained work: the load balancer accepts 443 from the internet; the app layer accepts only from the load balancer's security group; the database accepts only from the app layer's security group. That layering is what gives you blast radius containment. If the web tier is compromised, the attacker still cannot reach the database directly.
Follow-up to expect:
"What happens to outbound traffic from private app subnets if a NAT gateway fails?" / "How would you isolate dev and prod environments – same VPC or separate?"
4. Security groups vs network ACLs – what is the difference and when do you actually reach for NACLs?
What the interviewer is really testing:
Whether you understand stateful vs stateless packet filtering and can articulate where each control belongs in a layered defense model. "They both control traffic" is not an answer.
Sample answer:
Security groups are stateful. If I allow an inbound connection, the return traffic is automatically allowed without a separate rule. They operate at the instance level and I use them as the primary control for almost everything, they are flexible, specific, and easy to audit.
NACLs are stateless and operate at the subnet level. Because they are stateless, I have to explicitly allow traffic in both directions for every flow. That makes them more cumbersome to manage for typical workloads. Where I actually reach for NACLs: coarse subnet-level guardrails, and hard deny rules. If I need to block a specific known-malicious CIDR range and I want that block to be enforced regardless of what security group rules someone might add later, a NACL deny is the right tool. It is a backstop layer, not a replacement for security groups.
Follow-up to expect:
"Walk me through what happens to a packet that is allowed by a security group but denied by a NACL." / "How do you manage NACL rule order in practice?"
5. EC2 instances in a private subnet need outbound internet access for OS patching. What do you do?
What the interviewer is really testing:
Real-world "private but not stuck" networking. This is a scenario that comes up in every AWS environment, and the right answer has a nuance most candidates skip: if you only need AWS service access, there is a better option than NAT.
Sample answer:
For general internet access – OS patches, third-party package repositories – I put a NAT gateway in a public subnet and add a route in the private subnet's route table pointing 0.0.0.0/0 to that NAT gateway. The NAT gateway allows outbound traffic while preventing any inbound connections from the internet. The instances stay private.
But I always ask: do these instances actually need general internet access, or do they only need to reach AWS services? If it is just S3 for artifact downloads, or Systems Manager for patching orchestration, VPC endpoints are a better answer. A gateway endpoint for S3 costs nothing and keeps traffic off the internet path entirely. An interface endpoint for Systems Manager gives you private connectivity without NAT. I would rather eliminate the internet dependency than manage NAT gateway costs and egress traffic at scale.
Follow-up to expect:
"What is the cost model for NAT gateways at scale and when does it become a problem?" / "How would you patch instances that have no internet access at all, including no NAT?"
6. VPC endpoints – when do you use gateway endpoints vs interface endpoints?
What the interviewer is really testing:
This is a cost and security question disguised as a networking question. Candidates who know endpoints exist but cannot explain the difference between gateway and interface endpoints, or name which services use each, have surface-level knowledge.
Sample answer:
Gateway endpoints are a special case – they are free, region-scoped, and currently only support S3 and DynamoDB. They work by adding an entry to the route table that directs traffic for those services to the endpoint rather than the internet. No additional network interface, no cost per hour, no data processing charge.
Interface endpoints use AWS PrivateLink. They create an elastic network interface in your subnet with a private IP, and traffic flows over that interface to the target service. They cover a much broader range of services – Secrets Manager, KMS, SQS, SNS, most things with a PrivateLink option. They do have an hourly cost and a data processing charge, so I factor that in before deploying them at scale.
My decision rule: for S3 and DynamoDB, always use gateway endpoints – there is no reason not to. For other AWS services I need to reach privately, interface endpoints. For accessing services from on-premises over Direct Connect or VPN, interface endpoints are the path because gateway endpoints do not extend outside the VPC.
Follow-up to expect:
"How do you enforce that traffic to S3 from your VPC always uses the endpoint and never goes over the internet?"
Part 3: Identity, Security, and Governance (Where the Expensive Mistakes Happen)
If I had to pick one category that separates engineers who are ready to own a production AWS environment from those who are not, it is IAM. Not because IAM is intellectually difficult – it is not. Because IAM mistakes are often invisible until something goes wrong, and when they go wrong the consequences range from a minor permission denied error to a full account compromise. The candidates who answer these questions well have usually been burned by IAM at some point. That experience shows.
7. Walk me through IAM policy evaluation – implicit deny, explicit deny, and what wins.
What the interviewer is really testing:
A crisp, correct mental model. There is a right answer here, and hedging or getting it wrong signals that this person should not be owning access control decisions in a production account.
Sample answer:
AWS evaluates IAM policies with a clear decision logic. The starting state for every request is an implicit deny – unless something explicitly allows the action, it is denied. An explicit Allow in an applicable policy can override that implicit deny and grant access. But an explicit Deny in any applicable policy overrides all Allows, full stop. Explicit deny wins.
In an AWS Organizations setup, this gets a layer more complex. Service Control Policies act as guardrails on accounts. Even if a user's identity policy explicitly allows an action, if the account's SCP does not allow it, the request is denied. Effective permissions are always the intersection of what identity policies allow and what SCPs permit. I think of SCPs as the ceiling and IAM policies as what you allocate within that ceiling. You cannot grant yourself access to something the SCP has already blocked, regardless of what your IAM policy says.
Follow-up to expect:
"What is the evaluation order when both an identity policy and a resource policy are involved?" / "Can an SCP ever grant permissions on its own?"
8. IAM users vs IAM roles – when do you use roles, and how does cross-account access actually work?
What the interviewer is really testing:
Whether you default to roles for workloads the way experienced AWS engineers do. Anyone who still creates IAM users with long-lived access keys for applications has not operated a production environment long enough to get burned by it.
Sample answer:
IAM users have long-lived credentials – passwords and access keys that do not automatically rotate. Roles use temporary credentials issued by STS, typically valid for minutes to hours. For any workload – an EC2 instance, a Lambda function, a container, a CI/CD pipeline – I use roles. The EC2 instance assumes an instance profile role; the Lambda has an execution role. No long-lived keys, no key rotation problem, no keys accidentally committed to a repository.
For cross-account access, the pattern is: in the target account, create a role with the required permissions and a trust policy that specifies who is allowed to assume it – the principal in the source account. In the source account, the identity or service calls STS AssumeRole to get temporary credentials scoped to that role. The temporary credentials expire. Nothing persistent is shared between accounts. That model scales cleanly across many accounts without the key management overhead that comes with IAM users.
Follow-up to expect:
"How do you limit what an assumed role can do to prevent privilege escalation?" / "How would you audit who has been assuming a cross-account role and when?"
9. How do Service Control Policies work in AWS Organizations?
What the interviewer is really testing:
Whether you understand the difference between granting permissions and constraining them. SCPs are one of the most important controls in a multi-account AWS setup and one of the most misunderstood.
Sample answer:
SCPs define the maximum permissions available in an account or OU. They do not grant permissions on their own – a user in an account still needs an IAM policy that allows the action. What SCPs do is set the ceiling. If an SCP denies an action or simply does not allow it, no IAM policy in that account can override it.
The way I use SCPs in practice: deny actions that no team in a given OU should ever be able to take – like leaving the organization, disabling CloudTrail, or creating IAM users with console access in a workload account. I also use SCPs to enforce region restrictions, blocking API calls to regions we have no business operating in. That reduces blast radius if credentials are compromised. The important thing to communicate to teams: SCPs are guardrails, not permission grants. Teams still need their own IAM policies for anything they want to do. SCPs just ensure they cannot accidentally or maliciously step outside defined boundaries.
Follow-up to expect:
"How do you roll out a new SCP without breaking existing workloads?" / "What is the difference between a deny-list SCP strategy and an allow-list strategy, and when would you use each?"
10. KMS basics – what is envelope encryption and why do key policies matter more than IAM policies?
What the interviewer is really testing:
Security depth. Candidates who treat encryption as "just turn it on" have not thought about who controls the keys, what happens if key access is revoked, or how to audit key usage.
Sample answer:
Envelope encryption solves a practical problem: you do not want to send large data payloads to KMS for encryption because KMS has throughput limits and is not designed for bulk data operations. Instead, KMS generates a data key. You use that data key to encrypt your actual data locally. Then you use KMS to encrypt the data key itself. The encrypted data and the encrypted data key are stored together. To decrypt, you call KMS to decrypt the data key, then use it locally to decrypt the data. KMS only ever handles the small key, not the large payload.
Key policies matter because they are the foundational access control for a KMS key. Unlike almost everything else in AWS where IAM policies are the primary control, KMS keys require a key policy. If the key policy does not explicitly allow an IAM principal to use the key, IAM policies cannot grant that access. Every KMS key I create gets a key policy that explicitly lists which roles and services can use it for which operations – encrypt, decrypt, generate data key – and I keep that list as small as possible. Key usage is logged to CloudTrail automatically, which gives me a full audit trail of every encrypt and decrypt operation.
Follow-up to expect:
"What happens to your encrypted data if you delete a KMS key?" / "How do you handle KMS key access in a disaster recovery scenario where your primary Region is unavailable?"
Part 4: Storage (Where Bad Assumptions Cause Silent Data Loss)
Storage questions are underweighted in candidate prep and overweighted in actual production incidents. The candidates I most want to hire can articulate not just what each storage option is, but what happens to your data under failure conditions. That is the question underneath every storage question in an AWS interview.
11. How would you secure an S3 bucket used for centralized logs?
What the interviewer is really testing:
Whether you treat logs as sensitive, immutable evidence or as files that happen to be stored in S3. The answer reveals your security posture and your operational maturity around audit trails.
Sample answer:
First, access control. Only the specific services writing logs – CloudTrail, ALB, VPC flow logs – and a small, audited set of security roles should have any access. I enforce this with a bucket policy that explicitly denies all other principals, including account root unless MFA is present. Public access block settings are enabled at the bucket and account level.
Encryption: server-side encryption with KMS so I have key-level audit logs of every access. The bucket policy includes a condition that denies PUT requests without encryption – that prevents any misconfigured writer from accidentally storing unencrypted data.
For immutability: versioning enabled, and for CloudTrail logs specifically, S3 Object Lock in compliance mode. That prevents anyone – including the root user – from deleting or modifying log objects during the retention period. That matters for incident investigations where log tampering is a concern.
Lifecycle rules handle cost: logs are in Standard for the first 30 days for fast incident access, transition to a cheaper tier for the next 60 days, then archive or delete based on the compliance retention requirement.
Follow-up to expect:
"How do you ensure log completeness – that nothing is missing from your audit trail?" / "How would you grant a third-party auditor read access without exposing more than necessary?"
12. Explain S3 storage classes and lifecycle rules with a real retention example.
What the interviewer is really testing:
Cost literacy tied to operational requirements. The wrong answer is "move everything to Glacier because it is cheaper." The right answer connects retention tiers to actual retrieval patterns and business needs.
Sample answer:
S3 Standard is where everything starts – fast access, high durability, no retrieval fees. I keep objects in Standard for as long as there is a realistic operational need to access them quickly. For application logs, that is typically 14 to 30 days – the window where engineers are debugging recent incidents.
After that window, I use lifecycle rules to transition based on access probability. S3 Standard-IA for objects that might still be needed but rarely – things like logs from 30 to 90 days ago. The retrieval cost is acceptable for infrequent access patterns. For longer retention required by compliance – say, 1 year – Glacier Instant Retrieval gives archive pricing with millisecond retrieval, which matters if a compliance audit requires pulling a specific log file on short notice. For truly cold archival beyond a year, Glacier Deep Archive is the cheapest option but comes with a 12-hour retrieval time, so I only use it when I am confident retrieval will be planned, not reactive.
The lifecycle rule configuration is straightforward: transition to Standard-IA at 30 days, transition to Glacier Instant Retrieval at 90 days, expire at 365 days unless there is a compliance hold. That gives me cost control without sacrificing the access speed I actually need.
Follow-up to expect:
"How do you handle retrieval costs when the access pattern is unpredictable?" / "What happens to lifecycle rules when versioning is enabled?"
13. What is Amazon S3's data consistency model today, and why does it matter for how you build on it?
What the interviewer is really testing:
Whether candidates keep up with platform changes and avoid building workarounds for problems that no longer exist. Getting this wrong signals outdated knowledge.
Sample answer:
S3 has provided strong consistency for all object operations since December 2020. Strong read-after-write consistency for PUTs and DELETEs, and strongly consistent LIST operations. What you write is what you immediately read back. List operations reflect the current state of the bucket.
Before that change, S3 had eventual consistency for overwrite PUTs and DELETEs in some cases, and eventually consistent LIST operations. A lot of older architecture patterns include workarounds for that – retry logic for missing objects after writes, delays before listing newly created objects. Those workarounds are now unnecessary and I actively remove them when I encounter them in existing systems, because they add latency and complexity for a problem that no longer exists.
In practice, strong consistency means I can write an object and immediately list or read it without coordination overhead. It simplifies pipeline designs where downstream steps depend on upstream S3 writes.
Follow-up to expect:
"Are there any remaining eventual consistency behaviors in S3 I should be aware of?" / "How does this affect how you design event-driven architectures on top of S3?"
14. EBS vs instance store – which do you use for what?
What the interviewer is really testing:
Whether you understand data durability under failure conditions. This catches engineers who have never lost data to an instance termination – and those who have.
Sample answer:
EBS is durable block storage that exists independently from the EC2 instance. I can snapshot it, reattach it to a different instance, and it survives instance termination. It is the right choice for anything I cannot lose: the OS volume, application data, database storage.
Instance store is ephemeral storage physically attached to the host. When the instance is stopped, terminated, or the host fails, that data is gone. It is not a durability option – it is a performance option. Instance store provides very high throughput and low latency because there is no network hop, which makes it a good fit for workloads that need high-speed scratch space: temporary sort buffers, intermediate data in a processing pipeline, cache data that can be rebuilt. The key requirement is that the workload is designed to treat that storage as disposable and rebuild from a durable source if needed.
The mistake I see most often: engineers using instance store for something they expect to persist and not realizing it until the instance is replaced.
Follow-up to expect:
"How do EBS Multi-Attach and EBS io2 Block Express change your storage design options?" / "What is your snapshot strategy for EBS volumes backing a production database?"
Part 5: Compute and Scaling (Matching the Tool to the Workload)
Compute questions test judgment more than any other category. There is no universally correct compute choice – the right answer depends on the workload, the team, the traffic pattern, and the operational context. Candidates who answer with "always use X" without qualifying it are telling me they have not operated enough different workloads to understand the trade-offs.
15. EC2 purchasing options – On-Demand, Reserved Instances, Savings Plans, Spot – how do you choose?
What the interviewer is really testing:
Cost ownership. Whether they can match a purchase model to a workload profile rather than defaulting to the cheapest option or the most familiar one.
Sample answer:
On-Demand is the baseline – full flexibility, no commitment, highest per-hour cost. I use it for workloads with unpredictable traffic, short-lived environments, and anything I am not yet sure will stick around.
Reserved Instances and Savings Plans both trade commitment for discount. RIs commit to specific instance attributes – family, size, Region, OS – in exchange for up to 72% off On-Demand pricing. Savings Plans are more flexible – you commit to a spend level in dollars per hour rather than specific instance attributes, and the discount applies broadly across matching usage. For stable, predictable workloads running continuously, one of these is almost always justified for the cost savings.
Spot is the right choice for fault-tolerant, interruption-tolerant workloads: batch processing, data transformation pipelines, CI/CD build workers, rendering. The discount is significant – up to 90% – but instances can be reclaimed with two minutes notice. Any workload using Spot needs to be designed to handle interruption gracefully: checkpoint state, use SQS to queue work, or use Spot with an Auto Scaling group that can replace capacity from On-Demand as a fallback.
Follow-up to expect:
"How do you model the break-even point between On-Demand and Reserved Instances for a new workload?" / "What happens to your Spot instances when there is a capacity crunch in a particular AZ?"
16. ALB vs NLB – what is the difference and when do you pick one?
What the interviewer is really testing:
Network layer awareness and real service selection, not just knowing both exist.
Sample answer:
ALB operates at Layer 7 – it understands HTTP and HTTPS. That means it can make routing decisions based on the request itself: path-based routing, host-based routing, header conditions, query string matching. It integrates natively with WAF, supports gRPC, and provides detailed request-level metrics. For web applications, APIs, and microservices communicating over HTTP, ALB is almost always the right choice.
NLB operates at Layer 4 – it routes TCP, UDP, and TLS traffic without looking at application-layer content. The trade-off is simplicity and performance: NLB handles extreme throughput with very low latency and provides static IP addresses per AZ, which matters for clients that require IP-based whitelisting. I reach for NLB when the protocol is not HTTP – game servers, IoT devices, custom TCP protocols – or when I need static IPs for the load balancer endpoint, or when I am fronting an NLB with an ALB through AWS Global Accelerator for specific architectures.
Follow-up to expect:
"Can you put an NLB in front of an ALB and why would you?" / "How does connection draining work differently between ALB and NLB?"
17. How would you scale an EC2 workload for unpredictable traffic spikes without manually managing it?
What the interviewer is really testing:
Autoscaling policy depth – not just knowing Auto Scaling Groups exist, but understanding which metric to scale on and what guardrails to set.
Sample answer:
I would use an Auto Scaling group spread across multiple AZs behind an ALB. The group size responds to a scaling policy, and the policy choice matters a lot. Target tracking scaling on a meaningful metric is usually the right starting point. For a web service, I prefer request count per target from the ALB over raw CPU – it more directly reflects the actual work each instance is handling, rather than a secondary metric that might lag or be noisy.
I set a minimum capacity based on the baseline traffic floor – enough to handle normal load without scaling down to zero. I set a maximum capacity based on the largest realistic spike I would want to absorb – and at some point beyond that, I would rather see requests queue or degrade gracefully than spin up unbounded capacity. Health checks need to be configured at both the EC2 level and the ALB level so unhealthy instances are replaced rather than continuing to receive traffic.
I also configure scale-in protection carefully. Aggressive scale-in during a spike can create thrash – the group scales out, then starts scaling back in before traffic has fully normalized, then scales out again. Warm-up periods and conservative cooldowns reduce that.
Follow-up to expect:
"How do you handle long-running requests that are in flight when an instance is being terminated by scale-in?" / "What is the difference between target tracking and step scaling and when would you choose step scaling?"

18. Lambda throttling and concurrency – how do you prevent one hot function from consuming all available concurrency?
What the interviewer is really testing:
Whether they understand the account-level concurrency limit and the blast radius problem that comes from it. "Lambda scales automatically" is not a complete answer.
Sample answer:
Lambda scales by adding concurrent executions for incoming requests, up to the account-level concurrency limit in a Region. That limit is shared across all functions in the account. If one function goes viral – say, an upstream service starts hammering it – it can consume enough concurrency to throttle other functions that have nothing to do with it. That is the blast radius problem.
Reserved concurrency solves this in two ways simultaneously: it caps a specific function's concurrency so it cannot consume more than its allocation, and it guarantees that allocation is available so other functions cannot steal it. I use reserved concurrency for both: limiting noisy functions and protecting critical functions.
Provisioned concurrency is a different tool – it pre-warms Lambda execution environments to eliminate cold starts. I use it for latency-sensitive functions where the first request after a quiet period cannot afford the cold start penalty. It costs more than on-demand execution, so I apply it selectively – usually to the functions that are customer-facing with p99 latency SLOs.
For downstream protection: if Lambda is writing to a database or calling an API, I put a queue between them. Lambda writes to SQS, a separate Lambda reads from SQS with a concurrency limit set to match what the downstream system can handle. That decouples the input throughput from the downstream processing rate.
Follow-up to expect:
"How do you handle backpressure when Lambda is consuming from a Kinesis stream faster than downstream can handle?" / "What happens to requests that hit a throttled Lambda – are they retried automatically?"
19. ECS vs EKS vs Fargate – how do you decide for a new containerized service?
What the interviewer is really testing:
Ability to choose the simplest tool that fits the actual requirements. Engineers who default to EKS for everything have often not been accountable for operating a Kubernetes cluster.
Sample answer:
ECS is AWS-native container orchestration. It integrates tightly with the rest of AWS – IAM, ALB, CloudWatch, Secrets Manager – and the operational model is straightforward. If the team does not need Kubernetes-specific tooling, ecosystem integrations, or multi-cloud portability, ECS is almost always simpler to operate and easier to onboard engineers onto.
EKS makes sense when there is a genuine need for the Kubernetes ecosystem: Helm charts the team is already using, operators for stateful workloads, multi-cloud strategy where Kubernetes provides a consistent plane, or a large organization that has already standardized on Kubernetes tooling. But EKS brings real operational overhead – cluster upgrades, node group management, the Kubernetes control plane complexity. I want a clear reason before I take that on.
Fargate is the compute layer – it eliminates node management for both ECS and EKS. You do not provision or patch EC2 instances; you define the CPU and memory for a task and Fargate handles placement. I use Fargate when I want to minimize infrastructure management overhead and the workload characteristics fit its constraints – no host-level access required, no GPU workloads, tolerable per-vCPU cost relative to what I would spend managing node groups.
Follow-up to expect:
"How do you handle persistent storage for a containerized application on ECS or EKS?" / "When would you choose EC2 launch type for ECS instead of Fargate?"
Part 6: Databases, DevOps, and Operations (The Stuff That Actually Predicts Production Readiness)
The questions in this section are where I calibrate seniority most precisely. Almost every candidate can describe what RDS is. Far fewer can articulate the failure mode when someone confuses read replicas with Multi-AZ. And very few can describe a CloudFormation drift detection workflow the way someone who has actually used it in production can. These questions are easy to pass with surface-level knowledge and hard to pass with surface-level knowledge alone.
20. RDS Multi-AZ vs read replicas – what is the difference and what does each one actually solve?
What the interviewer is really testing:
Whether they can clearly separate high availability from read scaling. Confusing these two is a real production risk – teams that treat read replicas as an HA mechanism get a surprise when the primary fails and the replica does not automatically take over.
Sample answer:
Multi-AZ is a high availability feature. RDS maintains a synchronous standby replica in a different AZ. If the primary instance fails – hardware issue, AZ disruption, maintenance event – RDS automatically promotes the standby and updates the endpoint DNS. The failover typically completes in 60 to 120 seconds. The standby is not accessible for reads or any other workload – its only job is to be ready for failover.
Read replicas are for scaling read workloads. The replication is asynchronous, which means there is a replication lag – typically small, but not zero. Read replicas have their own endpoints and can serve read queries. They can be in the same Region or a different Region. They can be promoted to standalone instances, which makes them useful for DR patterns, but promotion is a manual or scripted process – it does not happen automatically during a primary failure.
My standard production setup for a critical database: Multi-AZ enabled for HA, read replicas added when read traffic grows to the point where the primary's read capacity is a bottleneck. I make sure the application is routing reads to replica endpoints – that distinction matters at the code level.
Follow-up to expect:
"How do you minimize the cutover time during an RDS failover event?" / "How does Aurora's storage architecture change the HA model compared to standard RDS Multi-AZ?"
21. DynamoDB on-demand vs provisioned capacity – when do you pick each?
What the interviewer is really testing:
Capacity planning instincts and cost literacy. "Always use on-demand because you do not have to think about it" is not a senior answer.
Sample answer:
On-demand mode is the right starting point for new tables or tables with genuinely unpredictable traffic. It scales to accommodate any request rate automatically and you pay per read and write request. There is no capacity planning, no throttling from under-provisioning, and no waste from over-provisioning. For early-stage products or workloads with highly variable access patterns, that simplicity is worth the slightly higher per-request cost.
Provisioned mode gives you cost predictability and potentially lower cost per request when traffic patterns are stable and well-understood. You define the read and write capacity units you need, pay for that provisioned capacity whether or not you use it, and get throttled if you exceed it. I pair provisioned capacity with auto scaling in most cases – the auto scaling policy adjusts provisioned capacity based on actual utilization, which reduces the operational overhead of managing it manually while still capturing the cost benefit of provisioned pricing at steady-state.
I typically start new tables on on-demand, establish traffic patterns, and revisit the capacity mode once I have 4 to 8 weeks of production access data. If the traffic is stable enough that provisioned plus auto scaling would consistently cost less, I switch.
Follow-up to expect:
"How do you handle DynamoDB hot partitions and what does that tell you about your key design?" / "What are the limits of DynamoDB auto scaling and when might it not react fast enough?"
22. CloudWatch vs CloudTrail – which one tells you what broke and which one tells you who changed what?
What the interviewer is really testing:
Operational maturity. Engineers who know what each service is but cannot use them together in an incident investigation are not ready to be the person on-call.
Sample answer:
CloudWatch is my operational visibility layer – metrics, logs, alarms, and dashboards. When something is going wrong right now, CloudWatch tells me what is happening: error rates are elevated, latency is spiking, CPU is saturated, the queue depth is climbing. I instrument applications to emit metrics and structured logs to CloudWatch, set alarms on the signals that matter, and build dashboards for the key health indicators I would look at during an incident.
CloudTrail is my audit and forensics layer – it records every API call made to AWS services: who called it, from where, at what time, with what parameters, and whether it succeeded. During an incident investigation, after I have used CloudWatch to understand what the system is doing, I use CloudTrail to understand what changed. A security group was modified 20 minutes before the outage started. An IAM policy was updated. A resource was deleted. CloudTrail gives me that timeline.
In practice I use them together in every non-trivial incident: CloudWatch to triage the impact, CloudTrail to find the cause. I also send CloudTrail logs to CloudWatch Logs and S3 so I have both real-time alerting on suspicious API calls and a long-term immutable record.
Follow-up to expect:
"How would you alert on a specific high-risk API call – like someone disabling CloudTrail – in near real time?" / "What is the difference between management events and data events in CloudTrail and why does that matter for cost?"
23. CloudFormation change sets and drift detection – what do they actually do for you in production?
What the interviewer is really testing:
IaC operational maturity. Engineers who treat CloudFormation as a one-time provisioning tool rather than an ongoing state management system have not operated stacks in a real change management process.
Sample answer:
Change sets are CloudFormation's "plan before apply" mechanism. When I submit a stack update, instead of immediately executing it, I create a change set first. CloudFormation analyzes the difference between the current stack state and the proposed template and shows me exactly what will be created, modified, or deleted – including whether any changes are replacements, which means the resource will be deleted and recreated. That is critical for production changes involving databases, load balancers, or anything stateful where a replacement would cause downtime or data loss.
In my change management process, change sets are mandatory for production stacks. The change set output goes into a pull request review before anyone executes it. That review catches "this is going to delete and recreate the RDS instance" before it happens in production rather than during the apply.
Drift detection answers a different question: has the real world diverged from what CloudFormation thinks it manages? If an engineer made a manual change in the console – modified a security group rule, changed an instance type – drift detection surfaces that. Undetected drift is dangerous because the next CloudFormation update will overwrite those manual changes without warning. I run drift detection on a scheduled basis for production stacks and treat any detected drift as something that needs to be resolved – either by updating the template to reflect the intended state or by reverting the manual change.
Follow-up to expect:
"How do you handle importing existing resources into a CloudFormation stack without downtime?" / "What is your strategy for rolling back a CloudFormation stack update that has already started?"
Part 7: Reliability, Disaster Recovery, and Cost (The Questions That Test Whether You Think Like an Owner)
The final category is where I separate engineers who think like engineers from engineers who think like owners. Reliability and cost are not afterthoughts – they are first-class design constraints. Candidates who treat DR as something to bolt on later, or who cannot connect cost decisions to business risk, are telling me they have not yet been accountable for what they built.
24. Define RTO and RPO, then pick an AWS DR strategy for a real workload.
What the interviewer is really testing:
Whether they map strategy to business requirements rather than defaulting to the most expensive option. "We should do active/active" with no discussion of requirements is not a senior answer.
Sample answer:
RPO – Recovery Point Objective – is how much data loss the business can tolerate, expressed as a time window from the last recovery point. RPO of 15 minutes means we can lose at most 15 minutes of transactions. RTO – Recovery Time Objective – is how long the business can tolerate being offline before the service is restored. These are business constraints, not engineering preferences, and I want them defined by stakeholders before I design a DR strategy.
AWS describes four DR patterns on a cost-versus-recovery-speed spectrum. Backup and restore is the simplest and cheapest: data is backed up to S3, and recovery means restoring from backup into a new environment. RTO is measured in hours. Right for non-critical workloads where extended downtime is acceptable.
Pilot light keeps a minimal version of the core infrastructure running – the database replicating, the most critical components warm – but compute is scaled to zero. Recovery means scaling up the compute tier. RTO is measured in tens of minutes.
Warm standby keeps a scaled-down but fully operational version of the environment running continuously. Recovery means scaling up. RTO is measured in minutes.
Multi-site active/active runs the full workload in multiple Regions simultaneously. Failover is near-instantaneous because capacity is already present. RTO is measured in seconds. This is also the most expensive pattern and the most operationally complex – data replication, latency, conflict resolution, and cost all become harder problems.
My starting question is always: what is the actual RTO/RPO requirement and what is the cost of downtime per hour for this specific workload? That anchors the conversation in business value rather than technical preference.
Follow-up to expect:
"How do you test your DR plan without taking down production?" / "What is the difference between a DR test and a game day exercise?"
25. In practice, how do you track and reduce AWS spend without degrading reliability?
What the interviewer is really testing:
Cost ownership. Whether they can reduce spend safely and systematically, rather than cutting costs in ways that introduce risk or reduce observability.
Sample answer:
Visibility has to come before optimization. If I do not know where the spend is, I will optimize the wrong things. I start with cost allocation tags – every resource tagged by team, application, and environment – and Cost Explorer to understand the actual cost drivers. I set Budgets with alerts at 80% and 100% of expected spend so anomalies surface before they become a surprise on the invoice. I also enable AWS Cost Anomaly Detection so unusual spending patterns get flagged automatically.
Once I can see where money is going, I prioritize the high-ROI, low-risk wins first. Right-sizing over-provisioned instances – looking at CloudWatch metrics to find instances consistently under 10% CPU with headroom for the actual peak. Removing idle resources – stopped EC2 instances still paying for EBS volumes, unused load balancers, old snapshots. Storage lifecycle rules on S3 buckets that are in Standard when the access pattern justifies a cheaper tier.
After the easy wins, I look at compute commitment. If a workload has been running for 3 to 6 months with stable utilization, a Savings Plan or Reserved Instance is usually justified and the ROI is clear. I model the break-even against current On-Demand spend before committing.
The thing I am careful about: cost reduction changes get validated with monitoring before and after. Right-sizing an instance, for example – I verify with CloudWatch that the smaller instance is not now running at 90% CPU under normal load. The goal is lower cost at the same reliability, not lower cost at the expense of headroom.
Follow-up to expect:
"How do you handle cost attribution for shared infrastructure that benefits multiple teams?" / "What do you do when a team's AWS spend suddenly spikes 3x and they do not know why?"
How to Use This Guide Properly
Do not memorize these answers. Read them once, understand the structure, then practice articulating them in your own voice. The version that gets you hired is the one that sounds like you, informed by real experience, not recited from a prep guide.
The pattern you should take from every answer above is: definition, then operational implication, then failure mode, then what you would actually do. That four-part structure is what separates a strong answer from a textbook answer across every category on this list.
The candidates I have seen get hired for senior AWS roles are not the ones with the most certifications or the longest list of services they have touched. They are the ones who can walk into an ambiguous problem, name what is uncertain, propose a reasonable path, and explain what they would watch to know if it was working. That is what these questions are ultimately testing.
Frequently Asked Questions
What are the most common AWS interview questions for cloud engineers?
The most common AWS interview questions for cloud and solutions architect roles cover networking fundamentals (VPC design, security groups, NAT gateways), identity and access management (IAM policy evaluation, roles, SCPs), storage (S3 consistency, EBS vs instance store, lifecycle rules), compute (EC2 purchasing options, Auto Scaling, Lambda concurrency), and operational topics (CloudWatch vs CloudTrail, disaster recovery strategies, cost optimization). For senior roles, expect additional depth on cross-account access patterns, encryption key management, IaC practices with CloudFormation or Terraform, and well-architected design trade-offs.
How do AWS Solutions Architect interviews differ from AWS DevOps Engineer interviews?
Solutions Architect interviews tend to emphasize broad architectural judgment: choosing the right services for a given workload, designing for HA and DR, cost trade-offs between service options, and explaining AWS fundamentals like the Shared Responsibility Model and Well-Architected pillars. DevOps Engineer interviews go deeper on operational topics: CI/CD pipeline design, IaC practices, monitoring and alerting, incident response, and automation. There is significant overlap in networking, IAM, compute, and storage – both roles need solid foundations in those areas.
How important are AWS certifications for passing AWS interviews?
Certifications demonstrate that you have studied AWS systematically and understand the breadth of services. They are a useful signal, particularly for roles where you will be advising customers or evaluating architectural options across many services. But certifications do not substitute for operational experience in interviews. The questions that filter senior candidates – "walk me through a VPC design," "explain what happens during an RDS failover," "how do you debug a Lambda throttling problem" – require experience, not just certification knowledge. Certifications help you get the interview. Experience helps you get the offer.
What should I prepare for an AWS system design interview?
For AWS system design questions, practice the following structure: clarify requirements and constraints first (traffic scale, RTO/RPO, cost sensitivity, compliance requirements), then propose a high-level architecture with named AWS services, explain why you chose each service over alternatives, identify the failure modes and how you handle them, and describe what monitoring and operational runbooks you would build around the system. Common scenarios include designing a multi-tier web application for high availability, designing a data ingestion pipeline, designing a serverless event-driven architecture, and designing a disaster recovery strategy for a critical database workload.
What is the AWS Well-Architected Framework and do I need to know it for interviews?
The Well-Architected Framework is AWS's structured approach to evaluating cloud architectures across six pillars: Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, and Sustainability. You do not need to memorize the framework by name or recite its pillars in an interview. But the underlying principles – shipping safely, securing by default, designing for failure, matching resources to actual workload needs, and measuring everything – should inform every answer you give. Candidates who naturally reason through these dimensions, even without naming the framework, are demonstrating exactly what interviewers want to see.

