Skip to content

Athena Cost Optimization

This guide covers cost optimization strategies for AWS Athena, including query optimization, data format selection, partitioning strategies, and cost monitoring.

Athena uses a pay-per-query pricing model where you’re charged based on the amount of data scanned during query execution. Cost optimization focuses on minimizing data scanning while maintaining query performance.

  • Data Scanned: Primary cost driver - amount of data processed
  • Query Complexity: Complex queries may scan more data
  • Data Format: Different formats have different scanning costs
  • Partitioning: Proper partitioning reduces data scanning
  • Compression: Better compression reduces storage and scanning costs

Always filter by partition columns to enable automatic partition pruning.

-- Good: Use partition columns in WHERE clause
SELECT
user_id,
event_type,
COUNT(*) as event_count
FROM analytics.user_events
WHERE year = '2024'
AND month = '01'
AND day = '15'
AND event_type IN ('click', 'view', 'purchase')
GROUP BY 1, 2
-- Avoid: Don't use partition columns
SELECT
user_id,
event_type,
COUNT(*) as event_count
FROM analytics.user_events
WHERE event_type IN ('click', 'view', 'purchase')
GROUP BY 1, 2

Use appropriate WHERE clauses to limit data scope.

-- Good: Filter early and effectively
SELECT
user_id,
event_type,
event_timestamp,
session_id
FROM analytics.user_events
WHERE year = '2024'
AND month = '01'
AND day = '15'
AND event_type = 'click'
AND user_id IS NOT NULL
AND session_id IS NOT NULL
-- Avoid: Ineffective filtering
SELECT
user_id,
event_type,
event_timestamp,
session_id
FROM analytics.user_events
WHERE event_type = 'click'

Limit data processing for exploration and testing queries.

-- Good: Use LIMIT for exploration
SELECT * FROM analytics.large_table
WHERE year = '2024' AND month = '01'
LIMIT 1000
-- Avoid: Process all data for exploration
SELECT * FROM analytics.large_table
WHERE year = '2024' AND month = '01'

Choose columnar formats like Parquet or ORC for better compression and performance.

-- Good: Use Parquet for efficiency
SELECT
user_id,
event_type,
COUNT(*) as event_count
FROM analytics.user_events_parquet
WHERE year = '2024' AND month = '01'
GROUP BY 1, 2
-- Avoid: Use CSV for large datasets
SELECT
user_id,
event_type,
COUNT(*) as event_count
FROM analytics.user_events_csv
WHERE year = '2024' AND month = '01'
GROUP BY 1, 2

Use appropriate compression to reduce storage and scanning costs.

-- Create compressed Parquet table
CREATE TABLE analytics.user_events_compressed (
user_id string,
event_type string,
event_timestamp timestamp,
session_id string,
properties map<string, string>
)
PARTITIONED BY (
year string,
month string,
day string
)
STORED AS PARQUET
LOCATION 's3://your-bucket/analytics/user_events_compressed/'
TBLPROPERTIES (
'parquet.compression'='SNAPPY'
)

Optimize file sizes for better performance and cost efficiency.

-- Good: Use appropriate file sizes
-- Target 128MB - 1GB per file for optimal performance
CREATE TABLE analytics.user_events_optimized (
user_id string,
event_type string,
event_timestamp timestamp
)
PARTITIONED BY (
year string,
month string,
day string
)
STORED AS PARQUET
LOCATION 's3://your-bucket/analytics/user_events_optimized/'
TBLPROPERTIES (
'parquet.block.size'='134217728' -- 128MB
)

Choose partition columns that align with common query patterns.

-- Good: Use date-based partitioning
CREATE TABLE analytics.user_events (
user_id string,
event_type string,
event_timestamp timestamp,
session_id string
)
PARTITIONED BY (
year string,
month string,
day string
)
STORED AS PARQUET
LOCATION 's3://your-bucket/analytics/user_events/'

Don’t create too many small partitions.

-- Good: Reasonable partition granularity
PARTITIONED BY (
year string,
month string,
day string
)
-- Avoid: Over-partitioning
PARTITIONED BY (
year string,
month string,
day string,
hour string,
minute string
)

Enable partition projection for better performance.

-- Enable partition projection
CREATE TABLE analytics.user_events_projected (
user_id string,
event_type string,
event_timestamp timestamp
)
PARTITIONED BY (
year string,
month string,
day string
)
STORED AS PARQUET
LOCATION 's3://your-bucket/analytics/user_events/'
TBLPROPERTIES (
'projection.enabled'='true',
'projection.year.type'='integer',
'projection.year.range'='2020,2030',
'projection.month.type'='integer',
'projection.month.range'='1,12',
'projection.day.type'='integer',
'projection.day.range'='1,31'
)

Choose the right JOIN type and optimize JOIN conditions.

-- Good: Efficient JOIN with filtering
SELECT
u.user_id,
u.email,
e.event_count
FROM users u
INNER JOIN (
SELECT
user_id,
COUNT(*) as event_count
FROM user_events
WHERE year = '2024' AND month = '01'
GROUP BY 1
) e ON u.user_id = e.user_id
WHERE u.status = 'active'
-- Avoid: Inefficient JOIN
SELECT
u.user_id,
u.email,
COUNT(e.event_id) as event_count
FROM users u
LEFT JOIN user_events e ON u.user_id = e.user_id
WHERE u.status = 'active'
GROUP BY 1, 2

Use efficient aggregation patterns.

-- Good: Efficient aggregation
SELECT
user_id,
COUNT(*) as event_count,
COUNT(DISTINCT session_id) as session_count,
SUM(duration) as total_duration
FROM analytics.user_events
WHERE year = '2024' AND month = '01'
GROUP BY 1
-- Avoid: Inefficient aggregation
SELECT
user_id,
COUNT(*) as event_count,
COUNT(DISTINCT session_id) as session_count,
SUM(duration) as total_duration
FROM analytics.user_events
GROUP BY 1

Optimize window functions for better performance.

-- Good: Efficient window function
SELECT
user_id,
event_timestamp,
event_type,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_timestamp) as event_sequence
FROM analytics.user_events
WHERE year = '2024' AND month = '01'
AND user_id IS NOT NULL
-- Avoid: Inefficient window function
SELECT
user_id,
event_timestamp,
event_type,
ROW_NUMBER() OVER (ORDER BY event_timestamp) as global_sequence
FROM analytics.user_events
WHERE year = '2024' AND month = '01'

Choose the right data types to minimize storage and scanning costs.

-- Good: Appropriate data types
SELECT
CAST(user_id AS VARCHAR) as user_id,
CAST(event_count AS INTEGER) as event_count,
CAST(duration AS DOUBLE) as duration,
CAST(created_at AS TIMESTAMP) as created_at
FROM staging.events
WHERE year = '{{ ds | date_format("%Y") }}'
-- Avoid: Unnecessary data type conversions
SELECT
user_id,
event_count,
duration,
created_at
FROM staging.events
WHERE year = '{{ ds | date_format("%Y") }}'

Avoid unnecessary casting and conversions.

-- Good: Minimal conversions
SELECT
user_id,
event_type,
DATE(event_timestamp) as event_date,
COUNT(*) as event_count
FROM staging.events
WHERE year = '{{ ds | date_format("%Y") }}'
GROUP BY 1, 2, 3
-- Avoid: Excessive conversions
SELECT
CAST(user_id AS VARCHAR) as user_id,
CAST(event_type AS VARCHAR) as event_type,
CAST(DATE(event_timestamp) AS DATE) as event_date,
CAST(COUNT(*) AS INTEGER) as event_count
FROM staging.events
WHERE year = '{{ ds | date_format("%Y") }}'
GROUP BY 1, 2, 3

Track data scanning volume to identify cost optimization opportunities.

-- Monitor data scanning
SELECT
DATE(execution_date) as query_date,
SUM(data_scanned_bytes) as total_bytes_scanned,
COUNT(*) as query_count,
AVG(data_scanned_bytes) as avg_bytes_per_query
FROM athena_query_logs
WHERE execution_date >= DATE_SUB(CURRENT_DATE, INTERVAL 7 DAY)
GROUP BY 1
ORDER BY 1 DESC

Find queries that scan the most data.

-- Find expensive queries
SELECT
query_id,
execution_date,
data_scanned_bytes,
cost_usd,
query
FROM athena_query_logs
WHERE execution_date >= DATE_SUB(CURRENT_DATE, INTERVAL 7 DAY)
ORDER BY data_scanned_bytes DESC
LIMIT 20

Monitor cost trends over time.

-- Track cost trends
SELECT
DATE(execution_date) as query_date,
SUM(cost_usd) as daily_cost,
SUM(data_scanned_bytes) as daily_bytes_scanned,
COUNT(*) as daily_query_count
FROM athena_query_logs
WHERE execution_date >= DATE_SUB(CURRENT_DATE, INTERVAL 30 DAY)
GROUP BY 1
ORDER BY 1 DESC

Set up alerts for daily cost thresholds.

-- Check daily costs
SELECT
DATE(execution_date) as query_date,
SUM(cost_usd) as daily_cost
FROM athena_query_logs
WHERE execution_date = CURRENT_DATE
GROUP BY 1
HAVING SUM(cost_usd) > 50.00 -- Alert threshold

Monitor monthly spending patterns.

-- Check monthly costs
SELECT
YEAR(execution_date) as query_year,
MONTH(execution_date) as query_month,
SUM(cost_usd) as monthly_cost
FROM athena_query_logs
WHERE execution_date >= DATE_SUB(CURRENT_DATE, INTERVAL 1 MONTH)
GROUP BY 1, 2
HAVING SUM(cost_usd) > 1000.00 -- Alert threshold

Always filter by partition columns first.

-- Good: Partition filter first
SELECT * FROM analytics.user_events
WHERE year = '2024' -- Partition filter
AND month = '01' -- Partition filter
AND day = '15' -- Partition filter
AND user_id = '12345'
AND event_type = 'click'

Choose the right JOIN type for your use case.

-- Good: Use INNER JOIN when appropriate
SELECT
u.user_id,
u.email,
e.event_count
FROM users u
INNER JOIN user_events e ON u.user_id = e.user_id
WHERE u.year = '2024' AND u.month = '01'
AND e.year = '2024' AND e.month = '01'

Use efficient aggregation patterns.

-- Good: Efficient aggregation
SELECT
user_id,
COUNT(*) as event_count,
COUNT(DISTINCT session_id) as session_count,
SUM(duration) as total_duration
FROM analytics.user_events
WHERE year = '2024' AND month = '01'
GROUP BY 1

Set up data retention and archival policies.

-- Implement data retention
SELECT * FROM analytics.user_events
WHERE year >= '2023' -- Keep last 2 years
AND year = '{{ ds | date_format("%Y") }}'

Implement S3 storage class transitions.

-- Use S3 lifecycle policies
-- Transition to IA after 30 days
-- Transition to Glacier after 90 days
-- Delete after 2 years

Compress older data to reduce storage costs.

-- Compress historical data
CREATE TABLE analytics.user_events_compressed
WITH (
format = 'PARQUET',
external_location = 's3://your-bucket/analytics/user_events_compressed/'
)
AS
SELECT * FROM analytics.user_events
WHERE year < '2023'

Regularly monitor and optimize query performance.

-- Monitor slow queries
SELECT
query_id,
execution_time_ms,
data_scanned_bytes,
cost_usd,
query
FROM athena_query_logs
WHERE execution_date >= DATE_SUB(CURRENT_DATE, INTERVAL 24 HOUR)
AND execution_time_ms > 30000 -- Slow queries
ORDER BY execution_time_ms DESC

Set up cost monitoring and controls.

-- Monitor costs
SELECT
DATE(execution_date) as query_date,
SUM(data_scanned_bytes) as total_bytes_scanned,
SUM(cost_usd) as total_cost_usd,
COUNT(*) as query_count
FROM athena_query_logs
WHERE execution_date >= DATE_SUB(CURRENT_DATE, INTERVAL 7 DAY)
GROUP BY 1
ORDER BY 1 DESC

Conduct regular cost optimization reviews.

  • Analyze cost trends and patterns
  • Identify optimization opportunities
  • Implement cost-saving measures
  • Monitor cost impact of changes

Use AWS Cost Explorer to analyze Athena costs.

-- Analyze costs by service
SELECT
service_name,
SUM(cost_usd) as total_cost,
SUM(data_scanned_bytes) as total_bytes_scanned
FROM aws_cost_explorer
WHERE service_name = 'Amazon Athena'
AND usage_date >= DATE_SUB(CURRENT_DATE, INTERVAL 30 DAY)
GROUP BY 1
-- Analyze costs by resource
SELECT
resource_id,
SUM(cost_usd) as total_cost,
COUNT(*) as usage_count
FROM aws_cost_explorer
WHERE service_name = 'Amazon Athena'
AND usage_date >= DATE_SUB(CURRENT_DATE, INTERVAL 30 DAY)
GROUP BY 1
ORDER BY total_cost DESC

Create custom dashboards for cost monitoring.

-- Daily cost dashboard
SELECT
DATE(execution_date) as query_date,
SUM(cost_usd) as daily_cost,
SUM(data_scanned_bytes) as daily_bytes_scanned,
COUNT(*) as daily_query_count,
AVG(cost_usd) as avg_cost_per_query
FROM athena_query_logs
WHERE execution_date >= DATE_SUB(CURRENT_DATE, INTERVAL 30 DAY)
GROUP BY 1
ORDER BY 1 DESC
-- Monthly cost dashboard
SELECT
YEAR(execution_date) as query_year,
MONTH(execution_date) as query_month,
SUM(cost_usd) as monthly_cost,
SUM(data_scanned_bytes) as monthly_bytes_scanned,
COUNT(*) as monthly_query_count
FROM athena_query_logs
WHERE execution_date >= DATE_SUB(CURRENT_DATE, INTERVAL 12 MONTH)
GROUP BY 1, 2
ORDER BY 1 DESC, 2 DESC

Problem: Sudden increase in Athena costs

Solutions:

  • Check for new queries or increased usage
  • Review data scanning patterns
  • Implement cost controls and alerts
  • Optimize expensive queries

Debug Steps:

  1. Analyze cost breakdown by query
  2. Check for new data sources
  3. Review query patterns and frequency
  4. Implement cost monitoring

Problem: Queries scanning too much data

Solutions:

  • Optimize query structure
  • Implement proper partitioning
  • Use appropriate data formats
  • Add effective filtering

Debug Steps:

  1. Analyze query execution plans
  2. Check partition pruning effectiveness
  3. Review data format and compression
  4. Optimize query patterns

Problem: High S3 storage costs

Solutions:

  • Implement data lifecycle policies
  • Use appropriate storage classes
  • Compress historical data
  • Archive old data

Debug Steps:

  1. Analyze storage usage by bucket
  2. Check data retention policies
  3. Review storage class transitions
  4. Implement data archival
  • Weekly cost analysis
  • Monthly optimization reviews
  • Quarterly cost planning
  • Annual cost budgeting
  • Set up cost alerts
  • Implement cost dashboards
  • Monitor cost trends
  • Track optimization impact
  • Regular query performance analysis
  • Implement query optimization
  • Monitor query patterns
  • Optimize expensive queries
  • Implement data lifecycle policies
  • Use appropriate storage classes
  • Compress historical data
  • Archive old data