AWS DynamoDB

Oct 22, 2024

Analyze items in DynamoDB using Export to S3

To analyze items in DynamoDB, you can export the data to S3 and then use various tools to process and analyze the data.

This is generally more efficient and cost-effective than querying the data directly from DynamoDB, especially for large datasets.

Pre-processing the data before analysis can help improve performance and reduce costs.

Use Export to S3 if:

You don’t need live data (up to 5 minutes old is OK).
You want to use tools like Amazon Athena, Amazon EMR, or AWS Glue for analysis.
You want to store the data in a more cost-effective manner.
You want to avoid throttling and RCU costs.

Note: Point-in-time recovery (PITR) must be enabled on your DynamoDB table to use Export to S3.

Steps to Export DynamoDB Table to S3

Using AWS Console (UI):
1. Go to the DynamoDB Console.
2. Select your table.
3. In the left navigation pane, choose Exports to S3.
4. Click Export to S3.
5. Configure the S3 bucket and export options.
6. Click Export to start the export process.

Using boto3 (Python SDK):

import boto3

client = boto3.client('dynamodb')

response = client.export_table_to_point_in_time(
    TableArn='arn:aws:dynamodb:REGION:ACCOUNT_ID:table/TABLE_NAME',
    S3Bucket='your-s3-bucket-name',
    ExportFormat='DYNAMODB_JSON'  # or 'ION' or 'CSV'
)
print(response)

Replace REGION, ACCOUNT_ID, and TABLE_NAME with your values.

Make sure the S3 bucket exists and your IAM role has the necessary permissions.

Note: The export process may take some time to complete, depending on the size of your DynamoDB table.

Exporting data to S3 allows you to take advantage of the scalability and cost-effectiveness of S3 for data storage and analysis.

Analyze the data

Once your DynamoDB data is exported to S3 in DYNAMODB_JSON format, you can analyze it using AWS services or locally:

The exported files are in JSON Lines (JSONL) format, where each line is a separate JSON object representing a DynamoDB item.

Download the exported files from S3, parse and analyze the JSONL data:

import json
import boto3

profile_name = ""

s3c = boto3.session.Session(profile_name=profile_name).client('s3')

bucket_name = "sample-bucket"
prefix = "sample-prefix/"
local_dir = "local-dir"

os.makedirs(local_dir, exist_ok=True)

paginator = s3c.get_paginator('list_objects_v2')

for page in paginator.paginate(Bucket=bucket_name, Prefix=prefix):
    for obj in page.get("Contents", []):
        key = obj["Key"]
        filename = os.path.basename(key)
        if not filename:  # skip folders
            continue

        local_path = os.path.join(local_dir, filename)
        print(f"Downloading {key} → {local_path}")
        s3c.download_file(bucket_name, key, local_path)

output_csv = os.path.join(local_dir, "output.csv")
fieldnames = [] # Define the attribute names you want to extract

rows = []

for filename in os.listdir(local_dir):
    if filename.endswith(".json.gz"):
        file_path = os.path.join(local_dir, filename)
        print(f"Reading {file_path}...")

        with gzip.open(file_path, 'rt', encoding='utf-8') as f:
            for line in f:
                data = json.loads(line)
                item = data.get("Item", {})
                row = {
                    "sample-attribute": item.get("sample-attribute", {}).get("S", "")
                }
                rows.append(row)

with open(output_csv, "w", newline='', encoding='utf-8') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    writer.writerows(rows)

print(f"\n✅ All data extracted to {output_csv}.")