Athena Data Formats

This guide covers data formats supported by AWS Athena, including format selection, optimization strategies, and best practices for different use cases.

Overview

Athena supports multiple data formats, each with different characteristics for storage efficiency, query performance, and use case suitability. Choosing the right format is crucial for optimal performance and cost management.

Supported Formats

Parquet: Columnar format, optimal for analytical workloads
ORC: Optimized Row Columnar format, good alternative to Parquet
JSON: Semi-structured format, flexible schema
CSV: Comma-separated values, simple but inefficient
TSV: Tab-separated values, similar to CSV
Avro: Row-based format with schema evolution
Ion: Amazon’s binary format for semi-structured data

Parquet Format

Overview

Parquet is a columnar storage format that provides excellent compression and query performance for analytical workloads.

Benefits

High Compression: Up to 80% reduction in storage size
Columnar Storage: Optimized for analytical queries
Schema Evolution: Built-in support for schema changes
Predicate Pushdown: Efficient filtering at storage level
Athena Optimization: Native support and optimization

Creating Parquet Tables

-- Create Parquet table
CREATE TABLE analytics.user_events_parquet (
    user_id string,
    event_type string,
    event_timestamp timestamp,
    session_id string,
    properties map<string, string>
)
PARTITIONED BY (
    year string,
    month string,
    day string
)
STORED AS PARQUET
LOCATION 's3://your-bucket/analytics/user_events_parquet/'

Parquet Configuration

-- Configure Parquet settings
CREATE TABLE analytics.user_events_parquet (
    user_id string,
    event_type string,
    event_timestamp timestamp,
    session_id string,
    properties map<string, string>
)
PARTITIONED BY (
    year string,
    month string,
    day string
)
STORED AS PARQUET
LOCATION 's3://your-bucket/analytics/user_events_parquet/'
TBLPROPERTIES (
    'parquet.compression'='SNAPPY',
    'parquet.enable.dictionary'='true',
    'parquet.page.size'='1048576'
)

Querying Parquet Data

-- Query Parquet data efficiently
SELECT
    user_id,
    event_type,
    COUNT(*) as event_count,
    COUNT(DISTINCT session_id) as session_count
FROM analytics.user_events_parquet
WHERE year = '2024'
  AND month = '01'
  AND day = '15'
  AND event_type IN ('click', 'view', 'purchase')
GROUP BY 1, 2
ORDER BY event_count DESC

Best Practices for Parquet

1. Use Appropriate Compression

-- Good: Use Snappy compression for balance
TBLPROPERTIES ('parquet.compression'='SNAPPY')

-- For maximum compression (slower writes)
TBLPROPERTIES ('parquet.compression'='GZIP')

-- For fastest writes (less compression)
TBLPROPERTIES ('parquet.compression'='UNCOMPRESSED')

2. Enable Dictionary Encoding

-- Enable dictionary encoding for repeated values
TBLPROPERTIES ('parquet.enable.dictionary'='true')

3. Optimize Row Group Size

-- Set appropriate row group size
TBLPROPERTIES ('parquet.block.size'='134217728')  -- 128MB

ORC Format

Overview

ORC (Optimized Row Columnar) is another columnar format that provides good compression and performance, with some advantages over Parquet in certain scenarios.

Benefits

Good Compression: Efficient storage compression
Columnar Storage: Optimized for analytical queries
ACID Support: Transaction support for data consistency
Hive Compatibility: Native Hive format support
Bloom Filters: Built-in bloom filter support

Creating ORC Tables

-- Create ORC table
CREATE TABLE analytics.user_events_orc (
    user_id string,
    event_type string,
    event_timestamp timestamp,
    session_id string,
    properties map<string, string>
)
PARTITIONED BY (
    year string,
    month string,
    day string
)
STORED AS ORC
LOCATION 's3://your-bucket/analytics/user_events_orc/'

ORC Configuration

-- Configure ORC settings
CREATE TABLE analytics.user_events_orc (
    user_id string,
    event_type string,
    event_timestamp timestamp,
    session_id string,
    properties map<string, string>
)
PARTITIONED BY (
    year string,
    month string,
    day string
)
STORED AS ORC
LOCATION 's3://your-bucket/analytics/user_events_orc/'
TBLPROPERTIES (
    'orc.compress'='SNAPPY',
    'orc.stripe.size'='67108864',
    'orc.row.index.stride'='10000'
)

Querying ORC Data

-- Query ORC data efficiently
SELECT
    user_id,
    event_type,
    COUNT(*) as event_count,
    AVG(duration) as avg_duration
FROM analytics.user_events_orc
WHERE year = '2024'
  AND month = '01'
  AND day = '15'
  AND event_type = 'click'
GROUP BY 1, 2

Best Practices for ORC

1. Use Appropriate Compression

-- Good: Use Snappy compression
TBLPROPERTIES ('orc.compress'='SNAPPY')

-- For maximum compression
TBLPROPERTIES ('orc.compress'='ZLIB')

-- For fastest writes
TBLPROPERTIES ('orc.compress'='NONE')

2. Optimize Stripe Size

-- Set appropriate stripe size
TBLPROPERTIES ('orc.stripe.size'='67108864')  -- 64MB

3. Configure Row Index Stride

-- Set row index stride for better performance
TBLPROPERTIES ('orc.row.index.stride'='10000')

JSON Format

Overview

JSON format provides flexibility for semi-structured data but with lower performance compared to columnar formats.

Benefits

Schema Flexibility: Easy to work with semi-structured data
Human Readable: Easy to inspect and debug
Rapid Prototyping: Quick to set up and test
Schema Evolution: Natural support for changing schemas
Simple Integration: Easy to work with application data

Creating JSON Tables

-- Create JSON table
CREATE TABLE analytics.user_events_json (
    user_id string,
    event_type string,
    event_timestamp timestamp,
    session_id string,
    properties string
)
PARTITIONED BY (
    year string,
    month string,
    day string
)
STORED AS JSON
LOCATION 's3://your-bucket/analytics/user_events_json/'

Querying JSON Data

-- Query JSON data
SELECT
    user_id,
    event_type,
    event_timestamp,
    JSON_EXTRACT_SCALAR(properties, '$.browser') as browser,
    JSON_EXTRACT_SCALAR(properties, '$.device') as device
FROM analytics.user_events_json
WHERE year = '2024'
  AND month = '01'
  AND day = '15'
  AND event_type = 'click'

JSON Functions

Extract Scalar Values

-- Extract scalar values from JSON
SELECT
    user_id,
    JSON_EXTRACT_SCALAR(properties, '$.browser') as browser,
    JSON_EXTRACT_SCALAR(properties, '$.device') as device,
    JSON_EXTRACT_SCALAR(properties, '$.location.country') as country
FROM analytics.user_events_json
WHERE year = '2024' AND month = '01'

Extract Arrays

-- Extract arrays from JSON
SELECT
    user_id,
    JSON_EXTRACT(properties, '$.tags') as tags,
    JSON_EXTRACT_SCALAR_ARRAY(properties, '$.categories') as categories
FROM analytics.user_events_json
WHERE year = '2024' AND month = '01'

Complex JSON Queries

-- Complex JSON queries
SELECT
    user_id,
    event_type,
    JSON_EXTRACT_SCALAR(properties, '$.user_agent') as user_agent,
    JSON_EXTRACT_SCALAR(properties, '$.session.duration') as session_duration,
    JSON_EXTRACT_SCALAR(properties, '$.location.latitude') as latitude,
    JSON_EXTRACT_SCALAR(properties, '$.location.longitude') as longitude
FROM analytics.user_events_json
WHERE year = '2024'
  AND month = '01'
  AND JSON_EXTRACT_SCALAR(properties, '$.device') = 'mobile'

Best Practices for JSON

1. Use Appropriate Data Types

-- Good: Use string for JSON data
CREATE TABLE analytics.user_events_json (
    user_id string,
    properties string  -- Store as string
)

-- Avoid: Using complex types for JSON
CREATE TABLE analytics.user_events_json (
    user_id string,
    properties map<string, string>  -- Less flexible
)

2. Optimize JSON Structure

-- Good: Flatten JSON when possible
SELECT
    user_id,
    JSON_EXTRACT_SCALAR(properties, '$.browser') as browser,
    JSON_EXTRACT_SCALAR(properties, '$.device') as device
FROM analytics.user_events_json

-- Avoid: Deeply nested JSON queries
SELECT
    user_id,
    JSON_EXTRACT_SCALAR(properties, '$.user.session.device.browser.name') as browser
FROM analytics.user_events_json

3. Use JSON Functions Efficiently

-- Good: Extract once and reuse
WITH extracted_properties AS (
  SELECT
    user_id,
    JSON_EXTRACT_SCALAR(properties, '$.browser') as browser,
    JSON_EXTRACT_SCALAR(properties, '$.device') as device
  FROM analytics.user_events_json
  WHERE year = '2024' AND month = '01'
)
SELECT
  browser,
  device,
  COUNT(*) as event_count
FROM extracted_properties
GROUP BY 1, 2

CSV Format

Overview

CSV is a simple text format that’s easy to work with but generally inefficient for large-scale analytical workloads.

Benefits

Simplicity: Easy to understand and work with
Compatibility: Works with many tools and systems
Human Readable: Easy to inspect and debug
Quick Setup: Fast to implement and test

Limitations

Poor Compression: Minimal compression benefits
No Schema: No built-in schema support
Performance: Slower query performance
Cost: Higher storage and query costs

Creating CSV Tables

-- Create CSV table
CREATE TABLE analytics.user_events_csv (
    user_id string,
    event_type string,
    event_timestamp timestamp,
    session_id string,
    properties string
)
PARTITIONED BY (
    year string,
    month string,
    day string
)
STORED AS TEXTFILE
LOCATION 's3://your-bucket/analytics/user_events_csv/'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
ESCAPED BY '\\'
LINES TERMINATED BY '\n'

Querying CSV Data

-- Query CSV data
SELECT
    user_id,
    event_type,
    COUNT(*) as event_count
FROM analytics.user_events_csv
WHERE year = '2024'
  AND month = '01'
  AND day = '15'
GROUP BY 1, 2

Best Practices for CSV

1. Use Only for Small Datasets

-- Good: Use CSV for small, simple datasets
CREATE TABLE staging.simple_data_csv (
    id string,
    name string,
    value double
)
STORED AS TEXTFILE
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','

2. Avoid for Large Datasets

-- Avoid: Don't use CSV for large datasets
-- Use Parquet or ORC instead
CREATE TABLE analytics.large_dataset_parquet (
    user_id string,
    event_data string
)
STORED AS PARQUET

Format Selection Guide

Choose Parquet When:

Analytical Workloads: Complex analytical queries
Large Datasets: Datasets > 1GB
Cost Optimization: Minimizing storage and query costs
Performance Critical: Query performance is important
Stable Schema: Schema doesn’t change frequently

Choose ORC When:

Hive Compatibility: Need Hive ecosystem compatibility
ACID Transactions: Require transaction support
Good Compression: Need efficient compression
Columnar Benefits: Want columnar storage benefits

Choose JSON When:

Semi-structured Data: Working with flexible schemas
Schema Evolution: Schema changes frequently
Prototyping: Rapid development and testing
Small to Medium Datasets: Datasets < 1GB
Human Readability: Need to inspect data easily

Choose CSV When:

Simple Data: Basic, flat data structures
Small Datasets: Datasets < 100MB
Compatibility: Need broad tool compatibility
Quick Setup: Rapid prototyping needs
Human Inspection: Need to easily view data

Avoid CSV When:

Large Datasets: Datasets > 100MB
Cost Optimization: Cost is a concern
Performance Critical: Query performance matters
Analytical Workloads: Complex analytical queries
Production Use: Production workloads

Performance Comparison

Storage Efficiency

Format	Compression	Storage Efficiency	Query Performance
Parquet	Excellent	High	Excellent
ORC	Good	High	Good
JSON	Poor	Low	Poor
CSV	None	Very Low	Very Poor

Query Performance

-- Performance comparison query
SELECT
    format_type,
    AVG(execution_time_ms) as avg_execution_time,
    AVG(data_scanned_bytes) as avg_data_scanned,
    AVG(cost_usd) as avg_cost
FROM query_performance_logs
WHERE query_date >= '2024-01-01'
GROUP BY 1
ORDER BY avg_execution_time

Cost Analysis

-- Cost comparison query
SELECT
    format_type,
    SUM(storage_cost_usd) as total_storage_cost,
    SUM(query_cost_usd) as total_query_cost,
    SUM(storage_cost_usd + query_cost_usd) as total_cost
FROM cost_analysis
WHERE analysis_date >= '2024-01-01'
GROUP BY 1
ORDER BY total_cost

Migration Strategies

Converting Between Formats

CSV to Parquet

-- Convert CSV to Parquet
CREATE TABLE analytics.user_events_parquet
WITH (
    format = 'PARQUET',
    external_location = 's3://your-bucket/analytics/user_events_parquet/'
)
AS
SELECT * FROM analytics.user_events_csv
WHERE year = '2024' AND month = '01'

JSON to Parquet

-- Convert JSON to Parquet
CREATE TABLE analytics.user_events_parquet
WITH (
    format = 'PARQUET',
    external_location = 's3://your-bucket/analytics/user_events_parquet/'
)
AS
SELECT
    user_id,
    event_type,
    event_timestamp,
    session_id,
    JSON_EXTRACT_SCALAR(properties, '$.browser') as browser,
    JSON_EXTRACT_SCALAR(properties, '$.device') as device
FROM analytics.user_events_json
WHERE year = '2024' AND month = '01'

Migration Best Practices

1. Gradual Migration

-- Migrate data gradually by partition
INSERT INTO analytics.user_events_parquet
SELECT * FROM analytics.user_events_csv
WHERE year = '2024' AND month = '01'

2. Validate Data

-- Validate migration
SELECT
    COUNT(*) as csv_count
FROM analytics.user_events_csv
WHERE year = '2024' AND month = '01';

SELECT
    COUNT(*) as parquet_count
FROM analytics.user_events_parquet
WHERE year = '2024' AND month = '01';

3. Performance Testing

-- Test query performance
EXPLAIN
SELECT
    user_id,
    COUNT(*) as event_count
FROM analytics.user_events_parquet
WHERE year = '2024' AND month = '01'
GROUP BY 1

Troubleshooting

Common Issues

Format Compatibility

Problem: Format not supported or recognized

Solutions:

Check format specification
Verify file structure
Use appropriate SERDE

Debug Steps:

Check file format in S3
Verify table definition
Test with sample data
Check Athena documentation

Performance Issues

Problem: Poor query performance

Solutions:

Use appropriate format
Optimize compression
Implement partitioning
Review query patterns

Debug Steps:

Analyze query execution plan
Check data format and compression
Review partitioning strategy
Monitor performance metrics

Schema Evolution

Problem: Schema changes breaking queries

Solutions:

Use flexible formats (JSON, Parquet)
Implement schema versioning
Use appropriate data types
Plan for schema evolution

Debug Steps:

Check schema compatibility
Review data type changes
Test with sample data
Implement gradual migration

Athena Data Formats

Overview

Supported Formats

Parquet Format

Overview

Benefits

Creating Parquet Tables

Parquet Configuration

Querying Parquet Data

Best Practices for Parquet

1. Use Appropriate Compression

2. Enable Dictionary Encoding

3. Optimize Row Group Size

ORC Format

Overview

Benefits

Creating ORC Tables

ORC Configuration

Querying ORC Data

Best Practices for ORC

1. Use Appropriate Compression

2. Optimize Stripe Size

3. Configure Row Index Stride

JSON Format

Overview

Benefits

Creating JSON Tables

Querying JSON Data

JSON Functions

Extract Scalar Values

Extract Arrays

Complex JSON Queries

Best Practices for JSON

1. Use Appropriate Data Types

2. Optimize JSON Structure

3. Use JSON Functions Efficiently

CSV Format

Overview

Benefits

Limitations

Creating CSV Tables

Querying CSV Data

Best Practices for CSV

1. Use Only for Small Datasets

2. Avoid for Large Datasets

Format Selection Guide

Choose Parquet When:

Choose ORC When:

Choose JSON When:

Choose CSV When:

Avoid CSV When:

Performance Comparison

Storage Efficiency

Query Performance

Cost Analysis

Migration Strategies

Converting Between Formats

CSV to Parquet

JSON to Parquet

Migration Best Practices

1. Gradual Migration

2. Validate Data

3. Performance Testing

Troubleshooting

Common Issues

Format Compatibility

Performance Issues

Schema Evolution

Related Documentation