AI & assistant-friendly summary

This section provides structured content for AI assistants and search engines. You can cite or summarize it when referencing this page.

Summary

AWS Glue automates ETL (Extract, Transform, Load) workflows while Athena provides serverless SQL queries. This guide covers building a complete data pipeline: ingesting raw data, transforming it, and querying at scale without managing servers.

Key Facts

  • AWS Glue automates ETL (Extract, Transform, Load) workflows while Athena provides serverless SQL queries
  • AWS Glue automates ETL (Extract, Transform, Load) workflows while Athena provides serverless SQL queries

Entity Definitions

Glue
Glue is an AWS service discussed in this article.
AWS Glue
AWS Glue is an AWS service discussed in this article.
Athena
Athena is an AWS service discussed in this article.
serverless
serverless is a cloud computing concept discussed in this article.

How to Build a Serverless Data Pipeline with AWS Glue and Athena

Data & Analytics Palaniappan P 7 min read

Quick summary: AWS Glue automates ETL (Extract, Transform, Load) workflows while Athena provides serverless SQL queries. This guide covers building a complete data pipeline: ingesting raw data, transforming it, and querying at scale without managing servers.

Key Takeaways

  • AWS Glue automates ETL (Extract, Transform, Load) workflows while Athena provides serverless SQL queries
  • AWS Glue automates ETL (Extract, Transform, Load) workflows while Athena provides serverless SQL queries
Table of Contents

AWS Glue automates data pipelines: extract raw data (logs, CSVs, databases), transform it (clean, enrich, aggregate), and load it into a data warehouse for analysis. Athena provides serverless SQL to query that data without spinning up databases.

Together, Glue + Athena eliminate infrastructure management. You define transformations, Glue runs them on a schedule, and Athena surfaces insights via SQL — all serverless.

This guide covers building a complete pipeline: ingesting data, transforming it with Glue jobs, and querying with Athena.

Building Data Pipelines on AWS? FactualMinds helps organizations architect serverless data infrastructure for analytics at scale. See our AWS data services or talk to our team.

Step 1: Understand the Glue + Athena Architecture

Raw Data (S3, RDS, APIs)

AWS Glue Job (scheduled ETL)
  → Extract (read from source)
  → Transform (clean, dedupe, enrich)
  → Load (write to S3 in Parquet format)

Athena (SQL queries on S3 data)
  → Run analytics
  → Generate reports

Key components:

  • Glue Crawler: Auto-discovers schema from raw data
  • Glue Job: Executes PySpark ETL code
  • Glue Data Catalog: Metadata about your tables
  • Athena: SQL engine for querying S3 data
  • Partitioning: Organize S3 data by date/region for fast queries

Step 2: Set Up S3 and Data Catalog

Create S3 buckets for raw data, transformed data, and Athena results:

# Raw data (source)
aws s3api create-bucket --bucket raw-data-bucket --region us-east-1

# Transformed data (processed)
aws s3api create-bucket --bucket transformed-data-bucket --region us-east-1

# Athena results (query output)
aws s3api create-bucket --bucket athena-results-bucket --region us-east-1

# Enable versioning (optional but recommended)
aws s3api put-bucket-versioning --bucket raw-data-bucket --versioning-configuration Status=Enabled

Step 3: Crawl Raw Data with Glue Crawler

A Glue Crawler automatically discovers schema from your raw data:

  1. Go to AWS GlueCrawlers
  2. Click Create crawler
  3. Crawler name: raw-data-crawler
  4. Data sources:
    • Type: S3
    • S3 path: s3://raw-data-bucket/logs/
  5. IAM role: Create or select a role with S3 access
  6. Output configuration:
    • Database: Create new database raw_data
    • Table prefix: raw_ (tables will be named raw_logs, etc.)
  7. Scheduling:
    • Frequency: Daily at 2 AM
    • (or manual for testing)
  8. Click Create crawler

Run the crawler:

aws glue start-crawler --name raw-data-crawler

Check results in GlueDatabasesraw_dataTables. The crawler will have created a table with schema inferred from your data.

Step 4: Create a Glue ETL Job

Create a Glue job to transform raw data:

  1. Go to AWS GlueJobs
  2. Click Create job
  3. Job name: transform-logs
  4. Type: Spark (for large datasets) or Python Shell (for small ones)
  5. IAM role: Select the role with S3 access
  6. Script path: s3://glue-scripts/transform-logs.py
  7. Click Create job

Write the ETL script:

# transform-logs.py
import sys
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.transforms import *
from pyspark.sql.functions import col, to_timestamp, year, month, day

args = sys.argv[1:]
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args[0], args)

# Read raw data from Glue Data Catalog
raw_logs = glueContext.create_dynamic_frame.from_table(
    database="raw_data",
    table_name="raw_logs"
)

# Convert to DataFrame for transformations
df = raw_logs.toDF()

# Transformations
df = df.filter(col("event_type").isNotNull())  # Remove nulls
df = df.withColumn("timestamp", to_timestamp(col("timestamp")))  # Parse date
df = df.dropDuplicates()  # Dedupe
df = df.withColumn("year", year(col("timestamp")))
df = df.withColumn("month", month(col("timestamp")))
df = df.withColumn("day", day(col("timestamp")))

# Write transformed data to S3 partitioned by date
df.repartition(col("year"), col("month"), col("day")).write \
    .mode("overwrite") \
    .parquet("s3://transformed-data-bucket/logs/")

job.commit()

Upload the script to S3:

aws s3 cp transform-logs.py s3://glue-scripts/

Step 5: Run Glue Job on Schedule

  1. In the Glue job, go to Job details
  2. Trigger: Set to trigger daily at 3 AM (after crawler runs at 2 AM)
  3. Max retries: 1
  4. Click Save

Or trigger manually:

aws glue start-job-run --job-name transform-logs

Monitor the job:

aws glue get-job-run --job-name transform-logs --run-id xxxxx

Step 6: Create Athena Table from Transformed Data

Once Glue writes transformed data to S3, create an Athena table to query it.

Option A: Glue Crawler (auto-detect schema)

Run a second crawler on the transformed data bucket:

aws glue start-crawler --name transformed-data-crawler

Option B: Manual Athena SQL

CREATE EXTERNAL TABLE IF NOT EXISTS transformed_logs (
    event_type STRING,
    user_id STRING,
    timestamp BIGINT,
    message STRING
)
PARTITIONED BY (
    year INT,
    month INT,
    day INT
)
STORED AS PARQUET
LOCATION 's3://transformed-data-bucket/logs/'

Add partitions:

ALTER TABLE transformed_logs ADD IF NOT EXISTS
    PARTITION (year=2026, month=4, day=3)
        LOCATION 's3://transformed-data-bucket/logs/year=2026/month=4/day=3/'

Step 7: Query Data with Athena

Go to Amazon AthenaQuery editor:

-- Count events by type
SELECT event_type, COUNT(*) as count
FROM transformed_logs
WHERE year = 2026 AND month = 4
GROUP BY event_type
ORDER BY count DESC
-- Find events for a specific user
SELECT timestamp, event_type, message
FROM transformed_logs
WHERE user_id = 'user-12345' AND year = 2026
ORDER BY timestamp DESC
-- Daily aggregation
SELECT year, month, day, COUNT(*) as total_events
FROM transformed_logs
WHERE year = 2026
GROUP BY year, month, day
ORDER BY year, month, day

Step 8: Optimize for Cost and Performance

Partition Strategy

Organize data in S3 with partitions so Athena only scans relevant data:

s3://transformed-data-bucket/logs/
├── year=2026/
│   ├── month=1/
│   │   ├── day=1/
│   │   ├── day=2/
│   │   └── ...
│   ├── month=2/
│   └── ...
└── year=2025/

Query with partition filters to reduce scans:

-- Scans ONLY April 2026 data
SELECT COUNT(*) FROM transformed_logs
WHERE year = 2026 AND month = 4

-- Scans ALL data (100x more expensive!)
SELECT COUNT(*) FROM transformed_logs

Compress Data

Use Parquet format with compression:

# In Glue job
df.write.mode("overwrite").option("compression", "snappy").parquet("s3://bucket/data/")

Parquet + Snappy compression = 50-80% smaller than CSV/JSON.

Use Columnar Format

Query only required columns:

-- Efficient: scans only 3 columns
SELECT user_id, event_type, timestamp FROM logs

-- Inefficient: scans all columns
SELECT * FROM logs

Step 9: Automate Reporting

Create a scheduled Athena query that runs daily and saves results:

# lambda_function.py
import boto3
from datetime import datetime

athena = boto3.client('athena')

def run_daily_report():
    query = """
    SELECT event_type, COUNT(*) as count
    FROM transformed_logs
    WHERE year = 2026 AND month = 4 AND day = 3
    GROUP BY event_type
    """

    response = athena.start_query_execution(
        QueryString=query,
        QueryExecutionContext={'Database': 'default'},
        ResultConfiguration={'OutputLocation': 's3://athena-results-bucket/'}
    )

    print(f"Query started: {response['QueryExecutionId']}")

def lambda_handler(event, context):
    run_daily_report()
    return {'statusCode': 200}

Schedule via CloudWatch Events:

aws events put-rule --name daily-athena-report --schedule-expression "cron(0 8 * * ? *)"
aws events put-targets --rule daily-athena-report --targets "Id"="1","Arn"="arn:aws:lambda:us-east-1:123456789012:function:daily-report"

Step 10: Production Patterns

Pattern 1: Multi-Stage Pipeline

Raw → Staging (light transformation)

Staging → Curated (business logic)

Curated → Analytics (final queries)

Each stage has its own Glue job and S3 location. This allows:

  • Reusing staging data for multiple downstream jobs
  • Easy debugging (inspect data at each stage)
  • Auditing transformation steps

Pattern 2: Quality Checks

Add data quality validation in Glue jobs:

# Count nulls
null_count = df.filter(col("user_id").isNull()).count()
if null_count > 0:
    print(f"WARNING: {null_count} rows with null user_id")
    # Send SNS alert

# Check for unexpected values
invalid_types = df.filter(~df.event_type.isin(['login', 'logout', 'click'])).count()
if invalid_types > 0:
    print(f"ERROR: {invalid_types} rows with invalid event_type")
    # Fail job
    sys.exit(1)

# If we reach here, data is valid
df.write.mode("overwrite").parquet("s3://bucket/data/")

Pattern 3: Incremental Processing

Avoid reprocessing the same data:

# Only process data from last 24 hours
from datetime import datetime, timedelta

cutoff_time = (datetime.now() - timedelta(days=1)).timestamp()
df = df.filter(col("timestamp") > cutoff_time)

This reduces processing time and cost on large datasets.

Common Mistakes to Avoid

  1. Not partitioning data

    • Unpartitioned: Every query scans all data (expensive)
    • Partitioned: Queries scan only relevant data (cheap)
    • Always partition by date, region, or other common filters
  2. Using CSV instead of Parquet

    • CSV: Slow, uncompressed, row-oriented
    • Parquet: Fast, compressed, columnar
    • Convert to Parquet in Glue jobs
  3. Not adding partition metadata

    • Athena needs partition info to optimize queries
    • Use MSCK REPAIR TABLE to add partitions:
      MSCK REPAIR TABLE my_table
  4. Querying without WHERE clauses

    • SELECT * FROM logs scans all data
    • Always add partition filters: WHERE year=2026 AND month=4
  5. Ignoring job failures

    • Set up SNS alerts for failed Glue jobs
    • Monitor job logs in CloudWatch

Cost Estimation

For 1GB of raw data ingested daily:

ComponentCost
Glue (2 DPUs, 30 mins/day)~$8/month
S3 storage (30GB/month retention)~$0.69/month
Athena (50 queries/month, 100MB scanned each)~$31/month
Total~$40/month

Next Steps

  1. Create S3 buckets (5 mins)
  2. Run Glue Crawler on raw data (15 mins)
  3. Write a simple Glue job (1 hour)
  4. Create Athena table and run a query (30 mins)
  5. Set up daily job schedule (15 mins)
  6. Talk to FactualMinds if you need help scaling to production or optimizing costs for large datasets
PP
Palaniappan P

AWS Cloud Architect & AI Expert

AWS-certified cloud architect and AI expert with deep expertise in cloud migrations, cost optimization, and generative AI on AWS.

AWS ArchitectureCloud MigrationGenAI on AWSCost OptimizationDevOps

Ready to discuss your AWS strategy?

Our certified architects can help you implement these solutions.

Recommended Reading

Explore All Articles »