SQS Key Point
Preface
Before going to the technology itself, I will explain why we need queue in our system and why we choose AWS SQS as our queue system.
Our system developed with limited time to go live, so at that time we did not really have time to identify any logic and build system that support asynchronous process (There are so much things to consider when you go for microservices pattern, such as consistency and error handling between services).
After going live for several months and the system stable enough, we decided to enhance some logic and its architecture. First, we were looking for smaller features that meet up with certain criteria that able to executed asynchronously before going directly to core features. At least to prove that it really improve performance.
What technology can help us to handle asynchronous request from each service to another service? Queue.
Why AWS SQS?
How do we choose AWS SQS actually is pretty straightforward :)
- Easy to use
- Price, we consider that AWS SQS price is pretty cheap and while our services just go live for a few months, the request is not much XD. AWS give free 1 million request per-month !
- Highly-Available and reliable since its managed by AWS.
- Most of our services are using AWS services (EKS, ECR, CI/CD, CloudWatch, etc…).
At first, we have other choices such as:
RabbitMQ: We need to build the infra part, maintain the services and the cost also based on what type of EC2 we use. While SQS you only pay for what you use.
Kafka: We know that kafka is powerful and have rich feature as a message queue. Even though AWS provide managed-kafka (MSK), the cost comparison with SQS is pretty significant.
Standard Queue OR FIFO Queue
Scalability or Preserve order messages? In our case, we choose scalability and high throughput. Since we clearly know what attributes that we need on our logic, it’s pretty easy for us to choose. So it’s pretty important to know what does the business logic want to achieve.
Short Polling and Long Polling
There are two ways to retrieve message from SQS, most of the time and the preferable way is to use long polling. Why?
Short Polling

- Expensive operation in terms of both price (since AWS still count your ReceiveMessage request from consumer) and http connection (You open and close connection frequently).
Long Polling

- Allow you to receive the message as soon as it arrive to the queue while it also reduce cost of SQS, since it reduce request / empty message.
Duplicated Message
Our system build on top of kubernetes, so this architecture is pretty normal for load balancing request:

Then when you integrate queue between each services, then it become like this:

If you notice from above architecture, there is a possibility that two pods of the same services would consume the same message at the same time and results in handle duplicated data.
Visibility Timeout
Fortunately, visibility timeout help us resolve this. When the message consumed by one of the pods (Pod A), that message temporarily removed from SQS by visibility timeout time (Recommended set to 30 secs by maximum) so that Pod B and Pod C will not get the exact same message as what Pod A received.
After Pod A consume the message, there might be two condition will occur:
- Pod A consume the message successfully, then consumer should delete that message permanently from SQS so it won’t be consumed again by another consumer.
- Pod A somehow crashed or has some networking issue after received the message. Since the message have not been deleted yet by Pod A, that message will appear again in SQS after the visibility timeout passed so that it can be consumed and processed again.
The second situation is a retry-mechanism done by SQS, and it help us to remove complex retry-mechanism logic from service layer.
But even with visibility timeout, Standard Queue still have a small chance to publish duplicated message to consumer. The reason is Standard Queue guarantee message delivered at-least-once. SQS will store the messages to its own HA system and it is possible to have duplicated message.
Idempotency is one of the important factor to solve duplicated message problem. It will be good to store message information such as message id, status (With status you will need to consider if the message coming from retry/DLQ) to database and use it to decide whether the message should be consume or not.
Dead-Letter Queue
When you have retry-policy, what if your service unavailable for a period of time ? Or what if our business logic have bug that cannot process particular message ? Doesn’t it mean it will always retry and never gonna be deleted from queue ?

SQS allow us to enable Redrive-Policy and it is recommended by AWS. Original queue will send the message that cannot be consumed and reached the threshold counter number (Maximum Receives in SQS) to DLQ for later process.
Wrapping Up
Things I don’t like about SQS
While I said nice things about SQS, I’ll also state a point that I don’t like about SQS.
SQS does not have topic like kafka. When we want to adapt Event-driven architecture in our system, each service may listen to multiple different events. So if you have 4 events then you will have 4 queue+ its DLQ (Dead-Letter-Queue) = 8 queue. If you have 10 events, you will have 20 queues in total !
There are some stuff not included here such as integrate with SNS to distribute message for multiple different consumer, and also some other stuffs.
Thank you for reading ! Hope it helps you and feel free to correct me if I’m wrong :)