DynamoDB Table Design and GSI Patterns for a Daily Batch Pipeline

DynamoDB design starts with access patterns, not with data. In a relational database you normalise first, then write queries. In DynamoDB you enumerate every read and write your application needs, then design the table and indexes to serve those patterns efficiently. If you start from the data model and work outward, you usually end up with scans, which are expensive and almost always a sign the design missed an access pattern.

This post covers the two-table design used in the daily batch download pipeline: what access patterns each table and GSI serves, why the keys are shaped the way they are, and the supporting decisions around billing mode, TTL, and conditional writes.

The Two Tables

The pipeline uses two tables with a deliberate separation of concerns:

data-download-jobs: one record per file, per day. Tracks the lifecycle of a single download job from queued to complete.
data-download-batches: one record per daily run. Tracks the orchestration-level view: how many files were expected, how many finished, and whether the batch timed out.

This is a multi-table design, not a single-table design. Single-table is often advocated in DynamoDB circles for its efficiency gains in highly relational access patterns. Here the two entities – batches and jobs – have mostly independent access patterns and different retention periods (30 vs 90 days). Keeping them separate makes each table’s key schema easier to reason about and avoids the cognitive overhead of generic PK/SK attributes that mean different things per entity type.

Table 1: `data-download-jobs`

resource "aws_dynamodb_table" "data_download_jobs" {
  name         = "data-download-jobs"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "FileID"

  attribute { name = "FileID";          type = "S" }
  attribute { name = "Status";          type = "S" }
  attribute { name = "StatusUpdatedAt"; type = "N" }
  attribute { name = "BatchID";         type = "S" }

  global_secondary_index {
    name            = "StatusIndex"
    hash_key        = "Status"
    range_key       = "StatusUpdatedAt"
    projection_type = "ALL"
  }

  global_secondary_index {
    name            = "BatchIndex"
    hash_key        = "BatchID"
    range_key       = "Status"
    projection_type = "ALL"
  }

  ttl { enabled = true; attribute_name = "TTL" }

  point_in_time_recovery {
    enabled = var.environment == "prod"
  }
}

Primary Key: `FileID`

FileID is a composite string of the form {entity}-{date}, for example accounts-2026-02-28. This gives every record a natural, stable, business-meaningful key that encodes both what the file is and when it is from.

The implications:

GetItem is the fast path. The durable Lambda poller tracks expected file IDs explicitly in its polling state. When checking completion, it iterates over those IDs and calls GetItem on each one directly; no index query needed.
Idempotent writes work correctly. ConditionExpression: 'attribute_not_exists(FileID)' is the producer’s guard against re-inserting an already-queued record. A natural key means this check is meaningful; a surrogate UUID would not help here.
The key survives replay. If the durable function re-runs fetch-and-queue-files after a re-invocation, the same FileID is computed for the same entity+date combination. The conditional write fails safely, the SQS send is skipped, and the file is still tracked as expected.

The full record shape:

interface FileJobRecord {
  FileID:          string;  // "accounts-2026-02-28"
  BatchID:         string;  // "batch-2026-02-28-<uuid>"
  Status:          'available' | 'downloading' | 'completed' | 'failed' | 'error';
  Entity:          string;  // "accounts"
  DownloadDate:    string;  // "2026-02-28"
  PresignedURL:    string;  // S3 pre-signed URL (expires in 30 min)
  EncryptedAESKey: string;  // base64-encoded RSA-wrapped AES key
  EncodedAESIV:    string;  // base64-encoded AES IV
  CreatedAt:       string;  // ISO 8601
  StatusUpdatedAt: number;  // epoch ms (used as GSI range key)
  CompletedAt?:    string;
  ErrorMessage?:   string;
  TTL:             number;  // epoch seconds, 30 days from creation
}

GSI 1: `StatusIndex` – `Status` (hash) + `StatusUpdatedAt` (range)

This index answers: “Which files are currently in a given status, ordered by when they entered that status?”

// Query: all 'available' files, oldest first
const result = await dynamodb.send(new QueryCommand({
  TableName: JOBS_TABLE,
  IndexName: 'StatusIndex',
  KeyConditionExpression: '#s = :available',
  ExpressionAttributeNames:  { '#s': 'Status' },
  ExpressionAttributeValues: { ':available': 'available' },
  ScanIndexForward: true // ascending by StatusUpdatedAt
}));

In the current implementation, the downloader receives its FileID from SQS and goes straight to GetItem; it does not use StatusIndex to find work. But StatusIndex is still valuable as an operational query:

Debugging a stalled batch: query Status = 'downloading' with a StatusUpdatedAt older than 30 minutes to find stuck jobs.
Manual recovery tooling: find all available records that were never picked up (e.g. the SQS message was lost before reaching the Pipe).
Metrics and dashboards: count records by status across all batches without scanning the full table.

StatusUpdatedAt is a Number (epoch milliseconds), not an ISO string. DynamoDB sorts number types numerically, so range key comparisons (<, >, between) work correctly on it. If it were stored as a string, lexicographic sorting would give wrong results for timestamps that cross a digit boundary (e.g. "999" sorts after "1000").

GSI 2: `BatchIndex` – `BatchID` (hash) + `Status` (range)

This index answers: “For a given batch, how many files are in each status?”

// Query: all completed files for a specific batch
const result = await dynamodb.send(new QueryCommand({
  TableName: JOBS_TABLE,
  IndexName: 'BatchIndex',
  KeyConditionExpression: 'BatchID = :bid AND #s = :completed',
  ExpressionAttributeNames:  { '#s': 'Status' },
  ExpressionAttributeValues: {
    ':bid':       'batch-2026-02-28-abc123',
    ':completed': 'completed'
  }
}));

Using Status as the range key means you can constrain a batch query to a single status bucket in the key condition, which is more efficient than filtering after the fact.

In the durable poller’s completion check, the current implementation iterates over each expected FileID and calls GetItem individually rather than querying this GSI. For small batches (a handful of files), individual GetItem calls have lower latency than a Query. For large batches (hundreds of files), a single BatchIndex query with pagination would be more efficient; this is a clear scaling boundary to know about before increasing batch size.

Why `projection_type = "ALL"` on Both GSIs

ALL means the GSI contains every attribute from the base table, not just the key attributes. This costs more storage but eliminates the need for GetItem to retrieve non-key attributes after a GSI query (the “double read” problem). For operational queries that need the full record – a debugging tool, a monitoring script – ALL is the right choice. For hot query paths where only key attributes are needed, KEYS_ONLY or a custom INCLUDE list would reduce storage and read costs.

Here the GSIs are primarily for operational visibility rather than primary application access patterns, so ALL keeps queries simple at a modest storage premium.

Table 2: `data-download-batches`

resource "aws_dynamodb_table" "data_download_batches" {
  name         = "data-download-batches"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "BatchID"

  attribute { name = "BatchID";     type = "S" }
  attribute { name = "PollingDate"; type = "S" }
  attribute { name = "Status";      type = "S" }

  global_secondary_index {
    name            = "PollingDateIndex"
    hash_key        = "PollingDate"
    range_key       = "Status"
    projection_type = "ALL"
  }

  ttl { enabled = true; attribute_name = "TTL" }

  point_in_time_recovery {
    enabled = var.environment == "prod"
  }
}

Primary Key: `BatchID`

BatchID is a synthetic key of the form batch-{date}-{uuid}, for example batch-2026-02-28-f47ac10b. The UUID suffix means multiple batch records can exist for the same date, relevant if a batch is manually re-triggered, or if the poller runs twice due to a durable execution anomaly.

The full record shape:

interface BatchRecord {
  BatchID:        string;   // "batch-2026-02-28-<uuid>"
  PollingDate:    string;   // "2026-02-28"
  Status:         'initiated' | 'queued' | 'completed' | 'timeout' | 'failed';
  StartTime:      string;   // ISO 8601
  ExpectedFiles:  string[]; // ["accounts-2026-02-28", "transactions-2026-02-28"]
  CompletedFiles: string[];
  FailedFiles:    string[];
  CreatedAt:      string;
  UpdatedAt:      string;
  TimeoutAt?:     string;
  TTL:            number;   // epoch seconds, 90 days from creation
}

ExpectedFiles, CompletedFiles, and FailedFiles are stored as StringSet attributes rather than as child records in a separate table. For a batch of a few dozen file IDs, this is appropriate: the set fits comfortably within DynamoDB’s 400 KB item size limit (a file ID like accounts-2026-02-28 is 22 bytes; a 400-item set would be ~8 KB). Denormalising the file lists into the batch record means reading a batch’s status is a single GetItem, not a join.

If file lists grew into the thousands, you would move them to a separate table and use a GSI to aggregate, but that is a scaling problem worth reaching first.

GSI: `PollingDateIndex` – `PollingDate` (hash) + `Status` (range)

This index answers: “What is the state of all batches for a given date?”

// Query: all batches for today, filter to those that timed out
const result = await dynamodb.send(new QueryCommand({
  TableName:  BATCH_TABLE,
  IndexName:  'PollingDateIndex',
  KeyConditionExpression: 'PollingDate = :date AND #s = :timeout',
  ExpressionAttributeNames:  { '#s': 'Status' },
  ExpressionAttributeValues: {
    ':date':    '2026-02-28',
    ':timeout': 'timeout'
  }
}));

The primary use case is operational: looking up today’s batch without knowing the BatchID in advance. The durable function knows its own BatchID (it created it), but a monitoring dashboard, a CloudWatch alarm runbook, or a manual debugging session would start from a date and work inward.

Status as the range key lets you scope the query to a specific outcome. If you queried without a range key condition, you would get all batches for that date, typically just one, but the index design handles the re-run case cleanly.

The Status Machines

Both tables use a Status attribute that moves through a defined set of states. It is worth drawing these explicitly because DynamoDB conditional writes enforce transitions: attempting a transition from the wrong source state throws ConditionalCheckFailedException, which is the application’s primary concurrency and idempotency mechanism.

FileJobRecord status machine:

                    ┌──────────────────────────────────┐
                    │                                  │
  (producer write)  │     (downloader claims)          │    (downloader done)
      ──────►  available ──────────────► downloading ──┼────────────► completed
                                                       │
                                                       └────────────► failed
                                                                      error

Transitions are enforced by the consumer with ConditionExpression: '#status = :available' on the claim update. This means downloading → downloading is impossible (two tasks cannot both claim the same file), and completed → downloading is impossible (a finished file cannot be reprocessed by a stale message).

BatchRecord status machine:

  (poller init)       (files queued)      (completion check)
  ──────► initiated ──────► queued ──────────────► completed
                                    └────────────► timeout
                                    └────────────► failed

Batch status is updated by the durable function at each step. No concurrent writer updates the batch record; the durable execution model ensures only one instance of the orchestrator runs at a time, so conditional writes on the batch table are primarily for correctness rather than concurrency safety.

TTL: Two Different Retention Periods

// Job records: 30-day TTL
TTL: Math.floor(Date.now() / 1000) + (30 * 24 * 60 * 60)

// Batch records: 90-day TTL
TTL: Math.floor(Date.now() / 1000) + (90 * 24 * 60 * 60)

Job records are operational: needed while a batch is running and for short-term debugging. 30 days covers a month of history. Batch records are audit-level: a month-end audit might ask “did we receive all expected data for March?”, so 90 days covers a full quarter.

TTL values must be stored in epoch seconds, not milliseconds. Date.now() returns milliseconds; dividing by 1000 converts it. DynamoDB silently ignores TTL attributes with values more than 5 years in the future, so a bug that stores milliseconds instead of seconds would result in records never expiring – worth knowing.

DynamoDB TTL deletion is eventually consistent and can lag by up to 48 hours. For these workloads that is fine: the records are queryable past their TTL until DynamoDB gets to them, they simply do not count against storage billing after expiry.

`PAY_PER_REQUEST` Billing

Both tables use on-demand billing (PAY_PER_REQUEST) rather than provisioned throughput. For a workload that fires once a day and processes a handful of records, provisioned capacity would require guessing at a read/write unit baseline and paying for idle capacity. On-demand charges per request unit consumed, with no idle cost.

The daily access pattern looks like this in terms of read/write units:

Operation	Count	RCU/WCU
Batch PutItem	1	1 WCU
File PutItem (per entity)	~10	~10 WCU
File UpdateItem (claim, per entity)	~10	~10 WCU
GetItem (completion polling × 6 polls × 10 files)	~60	~60 RCU
Batch UpdateItem (status, timeout)	~2	~2 WCU
File UpdateItem (complete/fail, per entity)	~10	~10 WCU

Total: roughly 60 RCU and 33 WCU per day. At AWS free tier limits (25 RCU and 25 WCU provisioned) this would overflow; at on-demand rates ($0.25 per million RCU, $1.25 per million WCU) the monthly cost is a fraction of a cent. Provisioned capacity would cost more in idle charges than the entire workload consumes.

Point-in-Time Recovery

point_in_time_recovery {
  enabled = var.environment == "prod"
}

PITR is enabled in production only. It allows restoring a table to any second within the past 35 days, which is relevant if records are corrupted, accidentally deleted, or if an audit requires reconstructing historical state. In dev and test environments, the data is synthetic and disposable, so the PITR cost ($0.20 per GB per month) is not justified.

Putting the Access Patterns Together

A complete map of access patterns to table/index:

Access pattern	Table	Key/Index	Operation
Write new job record (idempotent)	jobs	`FileID` (primary)	`PutItem` + `attribute_not_exists`
Claim a file for processing	jobs	`FileID` (primary)	`UpdateItem` + condition
Update file status (complete/fail)	jobs	`FileID` (primary)	`UpdateItem`
Check if specific file is done	jobs	`FileID` (primary)	`GetItem`
Find stuck ‘downloading’ files	jobs	`StatusIndex`	`Query` + range
All files for a batch	jobs	`BatchIndex`	`Query`
Completed files for a batch	jobs	`BatchIndex`	`Query` + range
Create batch record	batches	`BatchID` (primary)	`PutItem`
Update batch (expected files, timeout)	batches	`BatchID` (primary)	`UpdateItem`
Look up today’s batch by date	batches	`PollingDateIndex`	`Query`
Look up timed-out batches for a date	batches	`PollingDateIndex`	`Query` + range

Every pattern is served by a GetItem or an indexed Query. There are no scans. This is the goal of DynamoDB design: enumerate your access patterns before writing a line of schema, then verify that every pattern resolves to a primary key lookup or an indexed query.

If you find yourself adding a FilterExpression to a Scan for a production read path, that is the signal to add a GSI, or to reconsider whether DynamoDB is the right store for that access pattern at all.

The Two Tables

Table 1: data-download-jobs

Primary Key: FileID

GSI 1: StatusIndex – Status (hash) + StatusUpdatedAt (range)

GSI 2: BatchIndex – BatchID (hash) + Status (range)

Why projection_type = "ALL" on Both GSIs

Table 2: data-download-batches

Primary Key: BatchID

GSI: PollingDateIndex – PollingDate (hash) + Status (range)