SQS FIFO Deduplication and Retry: Two Safety Nets Are Better Than One

SQS guarantees at-least-once delivery. This means a message can, under normal failure conditions, be delivered more than once. For most workloads this is fine; you design your consumer to be idempotent and move on. But when each message triggers an expensive operation (a Fargate task launch, a download, a decryption, a write to S3), duplicates are worth preventing properly.

In the download pipeline covered earlier, two independent deduplication mechanisms stack on top of each other: SQS FIFO content-based deduplication on the queue, and DynamoDB optimistic locking in the consumer. This post breaks down how each works, why both are needed, and how the retry and dead-letter pattern is configured around them.

SQS FIFO vs Standard: What Actually Differs

SQS has two queue types. Standard queues offer higher throughput but make no ordering guarantee and allow duplicate deliveries beyond the at-least-once baseline. FIFO queues add:

Exactly-once processing within a deduplication window (5 minutes by default)
Strict ordering within a message group
Message group IDs: the key concept for parallelism within a FIFO queue

The tradeoff is throughput: FIFO queues are limited to 3,000 messages per second with batching (300 without), compared to effectively unlimited for standard queues. For a daily batch of a few dozen files, this is irrelevant.

The Queue Configuration

resource "aws_sqs_queue" "data_download_fifo" {
  name                        = "data-download-queue.fifo"
  fifo_queue                  = true
  content_based_deduplication = true
  kms_master_key_id           = "alias/aws/sqs"

  visibility_timeout_seconds = 3600    # 60 minutes
  message_retention_seconds  = 1209600 # 14 days
}

resource "aws_sqs_queue" "data_download_dlq" {
  name              = "data-download-dlq.fifo"
  fifo_queue        = true
  kms_master_key_id = "alias/aws/sqs"
  message_retention_seconds = 1209600
}

resource "aws_sqs_queue_redrive_policy" "data_download_redrive" {
  queue_url = aws_sqs_queue.data_download_fifo.id

  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.data_download_dlq.arn
    maxReceiveCount     = 3
  })
}

Several decisions are worth unpacking here.

`content_based_deduplication = true`

With content-based deduplication enabled, SQS computes a SHA-256 hash of the message body and uses it as the deduplication ID. Any message with the same body sent within the 5-minute deduplication window is silently discarded.

The alternative is to supply an explicit MessageDeduplicationId per message, which is what the producer does:

await sqs.send(new SendMessageCommand({
  QueueUrl:               QUEUE_URL,
  MessageBody:            JSON.stringify({ FileID: fileId, Entity: entity, BatchID: batchId }),
  MessageGroupId:         'data-download-group',
  MessageDeduplicationId: fileId              // e.g. "accounts-2026-02-28"
}));

When MessageDeduplicationId is provided explicitly, it takes precedence over content-based deduplication. This is the right approach here: the dedup key is the logical identity of the job (fileId), not the hash of the message body. If the producer ever adds metadata to the message body (a timestamp, a batch ID) without changing fileId, content-based dedup would fail to catch the duplicate, but the explicit ID wouldn’t.

The combination of content_based_deduplication = true in Terraform and an explicit MessageDeduplicationId in the SDK call means: explicit ID is used when provided, content-based acts as a fallback for any send path that forgets to set one.

`MessageGroupId`

Every FIFO message must belong to a group. Messages within the same group are delivered in strict order and are processed one at a time: a message is not delivered until the previous message in the group is no longer in flight.

All messages here use MessageGroupId: 'data-download-group'. This means:

Files are processed in the order they were queued
Only one Fargate task can be actively processing a message from this group at a time (from the queue’s perspective)

For this workload, strict ordering is not a hard requirement; files are independent of each other. Using a single group is a pragmatic choice: it simplifies configuration and the throughput limit (one in-flight message per group) is not a bottleneck when tasks run for ~60 seconds and a batch contains a handful of files.

If you had hundreds of files and needed true parallelism, you would use a unique MessageGroupId per file (e.g. the file ID itself). Each group would then be independent, and multiple tasks could process simultaneously. The tradeoff is that you lose ordering guarantees across groups (though you retain them within each group).

`visibility_timeout_seconds = 3600`

When SQS delivers a message to a consumer, it hides that message from all other consumers for the visibility timeout period. If the consumer acknowledges (deletes) the message before the timeout expires, it is gone. If it does not – because the consumer crashed, the task failed, or processing took too long – the message becomes visible again and is redelivered.

The 3600-second (60-minute) timeout is intentionally generous. A Fargate task needs to download a file from an external URL, decrypt it in memory, and upload to S3. On a slow upstream host or for a large file, this could take well over the 5 or 15 minutes common for Lambda workloads. Setting the timeout too low causes false redeliveries: the task is still running, but SQS thinks it has failed and delivers the message again.

With EventBridge Pipes as the consumer, message deletion happens automatically when the Fargate task starts; the Pipe manages the SQS receive/delete lifecycle, not the container. This means the visibility timeout is effectively a ceiling on the gap between “task launched” and “task exits”. If the task hangs indefinitely, the message will eventually reappear and trigger a retry.

`maxReceiveCount = 3`

After a message has been received (and made visible again) three times without being deleted, it is moved to the dead-letter queue. This covers:

Fargate tasks that crash immediately (exit code 1)
Tasks that time out (message becomes visible again after 60 minutes)
Transient infrastructure failures (ECS task placement failures, network errors)

Three attempts is a reasonable balance. One retry catches transient issues. Two catches most intermittent failures. By the third failure, something is probably wrong with the file or the upstream source; moving it to the DLQ for manual inspection is the right call rather than retrying indefinitely.

The Idempotency Problem SQS Dedup Doesn’t Solve

SQS FIFO deduplication has a 5-minute window. After 5 minutes, SQS treats a message with the same dedup ID as new and delivers it. This covers accidental double-sends but does not protect against:

Re-queuing after a system restart (the producer sends again the next day, same file ID, more than 5 minutes after the original send)
Durable function replay: if the durable Lambda poller replays its fetch-and-queue-files step after a re-invocation, it will try to send the same messages again

This is why the producer wraps its SQS send in a DynamoDB conditional write:

try {
  await dynamodb.send(new PutCommand({
    TableName: JOBS_TABLE,
    Item: {
      FileID: fileId,
      Status: 'available',
      // ...
    },
    ConditionExpression: 'attribute_not_exists(FileID)' // fails if already written
  }));

  // Only send to SQS if the DynamoDB write succeeded
  await sqs.send(new SendMessageCommand({
    QueueUrl:               QUEUE_URL,
    MessageBody:            JSON.stringify({ FileID: fileId, BatchID: batchId }),
    MessageGroupId:         'data-download-group',
    MessageDeduplicationId: fileId
  }));

  fileIds.push(fileId);
} catch (error) {
  if (error instanceof ConditionalCheckFailedException) {
    logger.info('File already exists --skipping SQS send', { fileId });
    fileIds.push(fileId); // still track as expected
  } else {
    throw error;
  }
}

attribute_not_exists(FileID) means “only write if this item does not already exist.” If the producer replays and the item is already in DynamoDB, the write throws ConditionalCheckFailedException, the SQS send is skipped entirely, and the file ID is still tracked in the expected list (so the durable function’s completion check remains accurate).

This makes the producer side idempotent beyond the 5-minute SQS window.

The Consumer: A Second Optimistic Lock

Even with all of the above, SQS still guarantees at-least-once, not exactly-once. In rare scenarios – task crash after SQS receive but before DynamoDB update, visibility timeout expiry during a slow download – a message can be redelivered while a first task is still running or after it has already completed.

The Fargate container handles this with a second optimistic lock in DynamoDB, on the consumer side:

async function claimFile(fileId: string): Promise<JobRecord | null> {
  const getResult = await dynamodb.send(new GetCommand({
    TableName: TABLE_NAME,
    Key: { FileID: fileId }
  }));

  const record = getResult.Item as JobRecord;

  if (!record || record.Status !== 'available') {
    // Already claimed, downloading, completed, or failed
    logger.warn('File not available for claiming', {
      fileId,
      status: record?.Status ?? 'not found'
    });
    return null; // exit gracefully
  }

  try {
    await dynamodb.send(new UpdateCommand({
      TableName: TABLE_NAME,
      Key: { FileID: fileId },
      UpdateExpression:    'SET #status = :downloading, StatusUpdatedAt = :ts',
      ConditionExpression: '#status = :available', // optimistic lock
      ExpressionAttributeNames:  { '#status': 'Status' },
      ExpressionAttributeValues: {
        ':downloading': 'downloading',
        ':available':   'available',
        ':ts':          Date.now()
      }
    }));

    return record; // we hold the lock
  } catch (error: any) {
    if (error.name === 'ConditionalCheckFailedException') {
      logger.warn('Race: another task claimed this file first', { fileId });
      return null;
    }
    throw error;
  }
}

The status machine has four terminal or in-progress states:

available → downloading → completed
                       ↘ failed

The conditional update only succeeds when Status = 'available'. Once a task transitions the record to downloading, any other task that receives the same message will see a non-available status and exit with code 0. The DLQ counter does not increment on a clean process.exit(0); only process.exit(1) (or an uncaught exception) counts as a failure.

This means a duplicate Fargate task launch for a file that is already downloading or completed is entirely benign: it exits cleanly in a few milliseconds having done nothing.

The Status Machine as the Source of Truth

Because DynamoDB holds authoritative state, the pipeline is resilient to failures at any point:

Failure point	SQS behaviour	DynamoDB state	Recovery
Task crashes before `claimFile`	Message becomes visible after 60 min	`available`	Task relaunched; claims and processes normally
Task crashes after claim, during download	Message becomes visible after 60 min	`downloading` (stuck)	Redelivered task sees `downloading`, exits clean; message goes to DLQ after 3 attempts
Task completes but SQS delete fails	Message becomes visible again	`completed`	Redelivered task sees `completed`, exits clean
Producer replays before 5-min window	SQS dedup drops message	`available` (existing)	No action needed
Producer replays after 5-min window	SQS delivers message	`available` (existing, due to DynamoDB conditional write blocking re-insert)	Consumer claims and processes; idempotent because DynamoDB record is unchanged

The one case that ends up in the DLQ is a task that crashes after claiming the file; the status is stuck at downloading. A retry task exits cleanly (status is not available), so nothing re-processes the file. The DLQ message is the signal that this file is in a stuck state and needs manual resolution.

A scheduled cleanup job could handle this automatically: query DynamoDB for records stuck in downloading for more than N minutes, reset them to available, and republish to SQS. Whether this is worth building depends on how often it happens in practice.

Monitoring the Pattern

Three CloudWatch alarms cover the meaningful failure modes:

DLQ message count > 0: the most important alarm. A message in the DLQ means three consecutive task failures for a specific file. Requires manual investigation: check the Fargate task logs for that file ID to understand why it failed three times.

resource "aws_cloudwatch_metric_alarm" "dlq_message_count" {
  alarm_name          = "data-download-dlq-not-empty"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "ApproximateNumberOfMessagesVisible"
  namespace           = "AWS/SQS"
  period              = 60
  statistic           = "Sum"
  threshold           = 0
  alarm_description   = "DLQ has messages --manual investigation required"

  dimensions = {
    QueueName = aws_sqs_queue.data_download_dlq.name
  }
}

ApproximateAgeOfOldestMessage > threshold on the main queue: if messages are sitting in the queue much longer than expected, tasks are either not launching or are taking too long. Could indicate an EventBridge Pipe failure or a VPC/security group misconfiguration preventing task placement.

ApproximateNumberOfMessagesNotVisible > expected: messages in flight (currently being processed). In steady state this should equal the number of actively running Fargate tasks. A sustained spike could indicate tasks are hanging.

Putting It All Together

The full deduplication and retry chain, from producer to consumer:

Producer
  1. DynamoDB conditional write (attribute_not_exists)
     → prevents re-queuing an already-registered file
  2. SQS send with explicit MessageDeduplicationId
     → prevents duplicate messages within 5-min window

SQS FIFO Queue
  3. Content-based dedup fallback
  4. 60-min visibility timeout
     → gives Fargate task time to finish without false redelivery
  5. maxReceiveCount = 3 → DLQ after 3 failures

EventBridge Pipe (batch_size = 1)
  6. One Fargate task per message

Consumer (Fargate)
  7. DynamoDB conditional update (status = 'available')
     → prevents duplicate processing regardless of how the duplicate arrived
  8. process.exit(0) on graceful skip (no DLQ increment)
  9. process.exit(1) on real failure (DLQ counter increments)

Each layer covers failures the previous one does not. SQS deduplication is fast but time-bounded. DynamoDB conditional writes are durable but only cover the producer replay case. The consumer optimistic lock is the last line of defence and handles every remaining race. The DLQ is the safety valve for genuine, repeated failures that no automatic retry can resolve.

None of these layers is complex on its own. The design principle is just: assume each layer will sometimes fail, and make sure the next layer handles it.