Skip to content

Athena Data Formats

This guide covers data formats supported by AWS Athena, including format selection, optimization strategies, and best practices for different use cases.

Athena supports multiple data formats, each with different characteristics for storage efficiency, query performance, and use case suitability. Choosing the right format is crucial for optimal performance and cost management.

  • Parquet: Columnar format, optimal for analytical workloads
  • ORC: Optimized Row Columnar format, good alternative to Parquet
  • JSON: Semi-structured format, flexible schema
  • CSV: Comma-separated values, simple but inefficient
  • TSV: Tab-separated values, similar to CSV
  • Avro: Row-based format with schema evolution
  • Ion: Amazon’s binary format for semi-structured data

Parquet is a columnar storage format that provides excellent compression and query performance for analytical workloads.

  • High Compression: Up to 80% reduction in storage size
  • Columnar Storage: Optimized for analytical queries
  • Schema Evolution: Built-in support for schema changes
  • Predicate Pushdown: Efficient filtering at storage level
  • Athena Optimization: Native support and optimization
-- Create Parquet table
CREATE TABLE analytics.user_events_parquet (
user_id string,
event_type string,
event_timestamp timestamp,
session_id string,
properties map<string, string>
)
PARTITIONED BY (
year string,
month string,
day string
)
STORED AS PARQUET
LOCATION 's3://your-bucket/analytics/user_events_parquet/'
-- Configure Parquet settings
CREATE TABLE analytics.user_events_parquet (
user_id string,
event_type string,
event_timestamp timestamp,
session_id string,
properties map<string, string>
)
PARTITIONED BY (
year string,
month string,
day string
)
STORED AS PARQUET
LOCATION 's3://your-bucket/analytics/user_events_parquet/'
TBLPROPERTIES (
'parquet.compression'='SNAPPY',
'parquet.enable.dictionary'='true',
'parquet.page.size'='1048576'
)
-- Query Parquet data efficiently
SELECT
user_id,
event_type,
COUNT(*) as event_count,
COUNT(DISTINCT session_id) as session_count
FROM analytics.user_events_parquet
WHERE year = '2024'
AND month = '01'
AND day = '15'
AND event_type IN ('click', 'view', 'purchase')
GROUP BY 1, 2
ORDER BY event_count DESC
-- Good: Use Snappy compression for balance
TBLPROPERTIES ('parquet.compression'='SNAPPY')
-- For maximum compression (slower writes)
TBLPROPERTIES ('parquet.compression'='GZIP')
-- For fastest writes (less compression)
TBLPROPERTIES ('parquet.compression'='UNCOMPRESSED')
-- Enable dictionary encoding for repeated values
TBLPROPERTIES ('parquet.enable.dictionary'='true')
-- Set appropriate row group size
TBLPROPERTIES ('parquet.block.size'='134217728') -- 128MB

ORC (Optimized Row Columnar) is another columnar format that provides good compression and performance, with some advantages over Parquet in certain scenarios.

  • Good Compression: Efficient storage compression
  • Columnar Storage: Optimized for analytical queries
  • ACID Support: Transaction support for data consistency
  • Hive Compatibility: Native Hive format support
  • Bloom Filters: Built-in bloom filter support
-- Create ORC table
CREATE TABLE analytics.user_events_orc (
user_id string,
event_type string,
event_timestamp timestamp,
session_id string,
properties map<string, string>
)
PARTITIONED BY (
year string,
month string,
day string
)
STORED AS ORC
LOCATION 's3://your-bucket/analytics/user_events_orc/'
-- Configure ORC settings
CREATE TABLE analytics.user_events_orc (
user_id string,
event_type string,
event_timestamp timestamp,
session_id string,
properties map<string, string>
)
PARTITIONED BY (
year string,
month string,
day string
)
STORED AS ORC
LOCATION 's3://your-bucket/analytics/user_events_orc/'
TBLPROPERTIES (
'orc.compress'='SNAPPY',
'orc.stripe.size'='67108864',
'orc.row.index.stride'='10000'
)
-- Query ORC data efficiently
SELECT
user_id,
event_type,
COUNT(*) as event_count,
AVG(duration) as avg_duration
FROM analytics.user_events_orc
WHERE year = '2024'
AND month = '01'
AND day = '15'
AND event_type = 'click'
GROUP BY 1, 2
-- Good: Use Snappy compression
TBLPROPERTIES ('orc.compress'='SNAPPY')
-- For maximum compression
TBLPROPERTIES ('orc.compress'='ZLIB')
-- For fastest writes
TBLPROPERTIES ('orc.compress'='NONE')
-- Set appropriate stripe size
TBLPROPERTIES ('orc.stripe.size'='67108864') -- 64MB
-- Set row index stride for better performance
TBLPROPERTIES ('orc.row.index.stride'='10000')

JSON format provides flexibility for semi-structured data but with lower performance compared to columnar formats.

  • Schema Flexibility: Easy to work with semi-structured data
  • Human Readable: Easy to inspect and debug
  • Rapid Prototyping: Quick to set up and test
  • Schema Evolution: Natural support for changing schemas
  • Simple Integration: Easy to work with application data
-- Create JSON table
CREATE TABLE analytics.user_events_json (
user_id string,
event_type string,
event_timestamp timestamp,
session_id string,
properties string
)
PARTITIONED BY (
year string,
month string,
day string
)
STORED AS JSON
LOCATION 's3://your-bucket/analytics/user_events_json/'
-- Query JSON data
SELECT
user_id,
event_type,
event_timestamp,
JSON_EXTRACT_SCALAR(properties, '$.browser') as browser,
JSON_EXTRACT_SCALAR(properties, '$.device') as device
FROM analytics.user_events_json
WHERE year = '2024'
AND month = '01'
AND day = '15'
AND event_type = 'click'
-- Extract scalar values from JSON
SELECT
user_id,
JSON_EXTRACT_SCALAR(properties, '$.browser') as browser,
JSON_EXTRACT_SCALAR(properties, '$.device') as device,
JSON_EXTRACT_SCALAR(properties, '$.location.country') as country
FROM analytics.user_events_json
WHERE year = '2024' AND month = '01'
-- Extract arrays from JSON
SELECT
user_id,
JSON_EXTRACT(properties, '$.tags') as tags,
JSON_EXTRACT_SCALAR_ARRAY(properties, '$.categories') as categories
FROM analytics.user_events_json
WHERE year = '2024' AND month = '01'
-- Complex JSON queries
SELECT
user_id,
event_type,
JSON_EXTRACT_SCALAR(properties, '$.user_agent') as user_agent,
JSON_EXTRACT_SCALAR(properties, '$.session.duration') as session_duration,
JSON_EXTRACT_SCALAR(properties, '$.location.latitude') as latitude,
JSON_EXTRACT_SCALAR(properties, '$.location.longitude') as longitude
FROM analytics.user_events_json
WHERE year = '2024'
AND month = '01'
AND JSON_EXTRACT_SCALAR(properties, '$.device') = 'mobile'
-- Good: Use string for JSON data
CREATE TABLE analytics.user_events_json (
user_id string,
properties string -- Store as string
)
-- Avoid: Using complex types for JSON
CREATE TABLE analytics.user_events_json (
user_id string,
properties map<string, string> -- Less flexible
)
-- Good: Flatten JSON when possible
SELECT
user_id,
JSON_EXTRACT_SCALAR(properties, '$.browser') as browser,
JSON_EXTRACT_SCALAR(properties, '$.device') as device
FROM analytics.user_events_json
-- Avoid: Deeply nested JSON queries
SELECT
user_id,
JSON_EXTRACT_SCALAR(properties, '$.user.session.device.browser.name') as browser
FROM analytics.user_events_json
-- Good: Extract once and reuse
WITH extracted_properties AS (
SELECT
user_id,
JSON_EXTRACT_SCALAR(properties, '$.browser') as browser,
JSON_EXTRACT_SCALAR(properties, '$.device') as device
FROM analytics.user_events_json
WHERE year = '2024' AND month = '01'
)
SELECT
browser,
device,
COUNT(*) as event_count
FROM extracted_properties
GROUP BY 1, 2

CSV is a simple text format that’s easy to work with but generally inefficient for large-scale analytical workloads.

  • Simplicity: Easy to understand and work with
  • Compatibility: Works with many tools and systems
  • Human Readable: Easy to inspect and debug
  • Quick Setup: Fast to implement and test
  • Poor Compression: Minimal compression benefits
  • No Schema: No built-in schema support
  • Performance: Slower query performance
  • Cost: Higher storage and query costs
-- Create CSV table
CREATE TABLE analytics.user_events_csv (
user_id string,
event_type string,
event_timestamp timestamp,
session_id string,
properties string
)
PARTITIONED BY (
year string,
month string,
day string
)
STORED AS TEXTFILE
LOCATION 's3://your-bucket/analytics/user_events_csv/'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
-- Query CSV data
SELECT
user_id,
event_type,
COUNT(*) as event_count
FROM analytics.user_events_csv
WHERE year = '2024'
AND month = '01'
AND day = '15'
GROUP BY 1, 2
-- Good: Use CSV for small, simple datasets
CREATE TABLE staging.simple_data_csv (
id string,
name string,
value double
)
STORED AS TEXTFILE
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
-- Avoid: Don't use CSV for large datasets
-- Use Parquet or ORC instead
CREATE TABLE analytics.large_dataset_parquet (
user_id string,
event_data string
)
STORED AS PARQUET
  • Analytical Workloads: Complex analytical queries
  • Large Datasets: Datasets > 1GB
  • Cost Optimization: Minimizing storage and query costs
  • Performance Critical: Query performance is important
  • Stable Schema: Schema doesn’t change frequently
  • Hive Compatibility: Need Hive ecosystem compatibility
  • ACID Transactions: Require transaction support
  • Good Compression: Need efficient compression
  • Columnar Benefits: Want columnar storage benefits
  • Semi-structured Data: Working with flexible schemas
  • Schema Evolution: Schema changes frequently
  • Prototyping: Rapid development and testing
  • Small to Medium Datasets: Datasets < 1GB
  • Human Readability: Need to inspect data easily
  • Simple Data: Basic, flat data structures
  • Small Datasets: Datasets < 100MB
  • Compatibility: Need broad tool compatibility
  • Quick Setup: Rapid prototyping needs
  • Human Inspection: Need to easily view data
  • Large Datasets: Datasets > 100MB
  • Cost Optimization: Cost is a concern
  • Performance Critical: Query performance matters
  • Analytical Workloads: Complex analytical queries
  • Production Use: Production workloads
FormatCompressionStorage EfficiencyQuery Performance
ParquetExcellentHighExcellent
ORCGoodHighGood
JSONPoorLowPoor
CSVNoneVery LowVery Poor
-- Performance comparison query
SELECT
format_type,
AVG(execution_time_ms) as avg_execution_time,
AVG(data_scanned_bytes) as avg_data_scanned,
AVG(cost_usd) as avg_cost
FROM query_performance_logs
WHERE query_date >= '2024-01-01'
GROUP BY 1
ORDER BY avg_execution_time
-- Cost comparison query
SELECT
format_type,
SUM(storage_cost_usd) as total_storage_cost,
SUM(query_cost_usd) as total_query_cost,
SUM(storage_cost_usd + query_cost_usd) as total_cost
FROM cost_analysis
WHERE analysis_date >= '2024-01-01'
GROUP BY 1
ORDER BY total_cost
-- Convert CSV to Parquet
CREATE TABLE analytics.user_events_parquet
WITH (
format = 'PARQUET',
external_location = 's3://your-bucket/analytics/user_events_parquet/'
)
AS
SELECT * FROM analytics.user_events_csv
WHERE year = '2024' AND month = '01'
-- Convert JSON to Parquet
CREATE TABLE analytics.user_events_parquet
WITH (
format = 'PARQUET',
external_location = 's3://your-bucket/analytics/user_events_parquet/'
)
AS
SELECT
user_id,
event_type,
event_timestamp,
session_id,
JSON_EXTRACT_SCALAR(properties, '$.browser') as browser,
JSON_EXTRACT_SCALAR(properties, '$.device') as device
FROM analytics.user_events_json
WHERE year = '2024' AND month = '01'
-- Migrate data gradually by partition
INSERT INTO analytics.user_events_parquet
SELECT * FROM analytics.user_events_csv
WHERE year = '2024' AND month = '01'
-- Validate migration
SELECT
COUNT(*) as csv_count
FROM analytics.user_events_csv
WHERE year = '2024' AND month = '01';
SELECT
COUNT(*) as parquet_count
FROM analytics.user_events_parquet
WHERE year = '2024' AND month = '01';
-- Test query performance
EXPLAIN
SELECT
user_id,
COUNT(*) as event_count
FROM analytics.user_events_parquet
WHERE year = '2024' AND month = '01'
GROUP BY 1

Problem: Format not supported or recognized

Solutions:

  • Check format specification
  • Verify file structure
  • Use appropriate SERDE

Debug Steps:

  1. Check file format in S3
  2. Verify table definition
  3. Test with sample data
  4. Check Athena documentation

Problem: Poor query performance

Solutions:

  • Use appropriate format
  • Optimize compression
  • Implement partitioning
  • Review query patterns

Debug Steps:

  1. Analyze query execution plan
  2. Check data format and compression
  3. Review partitioning strategy
  4. Monitor performance metrics

Problem: Schema changes breaking queries

Solutions:

  • Use flexible formats (JSON, Parquet)
  • Implement schema versioning
  • Use appropriate data types
  • Plan for schema evolution

Debug Steps:

  1. Check schema compatibility
  2. Review data type changes
  3. Test with sample data
  4. Implement gradual migration