Athena Data Formats
This guide covers data formats supported by AWS Athena, including format selection, optimization strategies, and best practices for different use cases.
Overview
Section titled “Overview”Athena supports multiple data formats, each with different characteristics for storage efficiency, query performance, and use case suitability. Choosing the right format is crucial for optimal performance and cost management.
Supported Formats
Section titled “Supported Formats”- Parquet: Columnar format, optimal for analytical workloads
- ORC: Optimized Row Columnar format, good alternative to Parquet
- JSON: Semi-structured format, flexible schema
- CSV: Comma-separated values, simple but inefficient
- TSV: Tab-separated values, similar to CSV
- Avro: Row-based format with schema evolution
- Ion: Amazon’s binary format for semi-structured data
Parquet Format
Section titled “Parquet Format”Overview
Section titled “Overview”Parquet is a columnar storage format that provides excellent compression and query performance for analytical workloads.
Benefits
Section titled “Benefits”- High Compression: Up to 80% reduction in storage size
- Columnar Storage: Optimized for analytical queries
- Schema Evolution: Built-in support for schema changes
- Predicate Pushdown: Efficient filtering at storage level
- Athena Optimization: Native support and optimization
Creating Parquet Tables
Section titled “Creating Parquet Tables”-- Create Parquet tableCREATE TABLE analytics.user_events_parquet ( user_id string, event_type string, event_timestamp timestamp, session_id string, properties map<string, string>)PARTITIONED BY ( year string, month string, day string)STORED AS PARQUETLOCATION 's3://your-bucket/analytics/user_events_parquet/'Parquet Configuration
Section titled “Parquet Configuration”-- Configure Parquet settingsCREATE TABLE analytics.user_events_parquet ( user_id string, event_type string, event_timestamp timestamp, session_id string, properties map<string, string>)PARTITIONED BY ( year string, month string, day string)STORED AS PARQUETLOCATION 's3://your-bucket/analytics/user_events_parquet/'TBLPROPERTIES ( 'parquet.compression'='SNAPPY', 'parquet.enable.dictionary'='true', 'parquet.page.size'='1048576')Querying Parquet Data
Section titled “Querying Parquet Data”-- Query Parquet data efficientlySELECT user_id, event_type, COUNT(*) as event_count, COUNT(DISTINCT session_id) as session_countFROM analytics.user_events_parquetWHERE year = '2024' AND month = '01' AND day = '15' AND event_type IN ('click', 'view', 'purchase')GROUP BY 1, 2ORDER BY event_count DESCBest Practices for Parquet
Section titled “Best Practices for Parquet”1. Use Appropriate Compression
Section titled “1. Use Appropriate Compression”-- Good: Use Snappy compression for balanceTBLPROPERTIES ('parquet.compression'='SNAPPY')
-- For maximum compression (slower writes)TBLPROPERTIES ('parquet.compression'='GZIP')
-- For fastest writes (less compression)TBLPROPERTIES ('parquet.compression'='UNCOMPRESSED')2. Enable Dictionary Encoding
Section titled “2. Enable Dictionary Encoding”-- Enable dictionary encoding for repeated valuesTBLPROPERTIES ('parquet.enable.dictionary'='true')3. Optimize Row Group Size
Section titled “3. Optimize Row Group Size”-- Set appropriate row group sizeTBLPROPERTIES ('parquet.block.size'='134217728') -- 128MBORC Format
Section titled “ORC Format”Overview
Section titled “Overview”ORC (Optimized Row Columnar) is another columnar format that provides good compression and performance, with some advantages over Parquet in certain scenarios.
Benefits
Section titled “Benefits”- Good Compression: Efficient storage compression
- Columnar Storage: Optimized for analytical queries
- ACID Support: Transaction support for data consistency
- Hive Compatibility: Native Hive format support
- Bloom Filters: Built-in bloom filter support
Creating ORC Tables
Section titled “Creating ORC Tables”-- Create ORC tableCREATE TABLE analytics.user_events_orc ( user_id string, event_type string, event_timestamp timestamp, session_id string, properties map<string, string>)PARTITIONED BY ( year string, month string, day string)STORED AS ORCLOCATION 's3://your-bucket/analytics/user_events_orc/'ORC Configuration
Section titled “ORC Configuration”-- Configure ORC settingsCREATE TABLE analytics.user_events_orc ( user_id string, event_type string, event_timestamp timestamp, session_id string, properties map<string, string>)PARTITIONED BY ( year string, month string, day string)STORED AS ORCLOCATION 's3://your-bucket/analytics/user_events_orc/'TBLPROPERTIES ( 'orc.compress'='SNAPPY', 'orc.stripe.size'='67108864', 'orc.row.index.stride'='10000')Querying ORC Data
Section titled “Querying ORC Data”-- Query ORC data efficientlySELECT user_id, event_type, COUNT(*) as event_count, AVG(duration) as avg_durationFROM analytics.user_events_orcWHERE year = '2024' AND month = '01' AND day = '15' AND event_type = 'click'GROUP BY 1, 2Best Practices for ORC
Section titled “Best Practices for ORC”1. Use Appropriate Compression
Section titled “1. Use Appropriate Compression”-- Good: Use Snappy compressionTBLPROPERTIES ('orc.compress'='SNAPPY')
-- For maximum compressionTBLPROPERTIES ('orc.compress'='ZLIB')
-- For fastest writesTBLPROPERTIES ('orc.compress'='NONE')2. Optimize Stripe Size
Section titled “2. Optimize Stripe Size”-- Set appropriate stripe sizeTBLPROPERTIES ('orc.stripe.size'='67108864') -- 64MB3. Configure Row Index Stride
Section titled “3. Configure Row Index Stride”-- Set row index stride for better performanceTBLPROPERTIES ('orc.row.index.stride'='10000')JSON Format
Section titled “JSON Format”Overview
Section titled “Overview”JSON format provides flexibility for semi-structured data but with lower performance compared to columnar formats.
Benefits
Section titled “Benefits”- Schema Flexibility: Easy to work with semi-structured data
- Human Readable: Easy to inspect and debug
- Rapid Prototyping: Quick to set up and test
- Schema Evolution: Natural support for changing schemas
- Simple Integration: Easy to work with application data
Creating JSON Tables
Section titled “Creating JSON Tables”-- Create JSON tableCREATE TABLE analytics.user_events_json ( user_id string, event_type string, event_timestamp timestamp, session_id string, properties string)PARTITIONED BY ( year string, month string, day string)STORED AS JSONLOCATION 's3://your-bucket/analytics/user_events_json/'Querying JSON Data
Section titled “Querying JSON Data”-- Query JSON dataSELECT user_id, event_type, event_timestamp, JSON_EXTRACT_SCALAR(properties, '$.browser') as browser, JSON_EXTRACT_SCALAR(properties, '$.device') as deviceFROM analytics.user_events_jsonWHERE year = '2024' AND month = '01' AND day = '15' AND event_type = 'click'JSON Functions
Section titled “JSON Functions”Extract Scalar Values
Section titled “Extract Scalar Values”-- Extract scalar values from JSONSELECT user_id, JSON_EXTRACT_SCALAR(properties, '$.browser') as browser, JSON_EXTRACT_SCALAR(properties, '$.device') as device, JSON_EXTRACT_SCALAR(properties, '$.location.country') as countryFROM analytics.user_events_jsonWHERE year = '2024' AND month = '01'Extract Arrays
Section titled “Extract Arrays”-- Extract arrays from JSONSELECT user_id, JSON_EXTRACT(properties, '$.tags') as tags, JSON_EXTRACT_SCALAR_ARRAY(properties, '$.categories') as categoriesFROM analytics.user_events_jsonWHERE year = '2024' AND month = '01'Complex JSON Queries
Section titled “Complex JSON Queries”-- Complex JSON queriesSELECT user_id, event_type, JSON_EXTRACT_SCALAR(properties, '$.user_agent') as user_agent, JSON_EXTRACT_SCALAR(properties, '$.session.duration') as session_duration, JSON_EXTRACT_SCALAR(properties, '$.location.latitude') as latitude, JSON_EXTRACT_SCALAR(properties, '$.location.longitude') as longitudeFROM analytics.user_events_jsonWHERE year = '2024' AND month = '01' AND JSON_EXTRACT_SCALAR(properties, '$.device') = 'mobile'Best Practices for JSON
Section titled “Best Practices for JSON”1. Use Appropriate Data Types
Section titled “1. Use Appropriate Data Types”-- Good: Use string for JSON dataCREATE TABLE analytics.user_events_json ( user_id string, properties string -- Store as string)
-- Avoid: Using complex types for JSONCREATE TABLE analytics.user_events_json ( user_id string, properties map<string, string> -- Less flexible)2. Optimize JSON Structure
Section titled “2. Optimize JSON Structure”-- Good: Flatten JSON when possibleSELECT user_id, JSON_EXTRACT_SCALAR(properties, '$.browser') as browser, JSON_EXTRACT_SCALAR(properties, '$.device') as deviceFROM analytics.user_events_json
-- Avoid: Deeply nested JSON queriesSELECT user_id, JSON_EXTRACT_SCALAR(properties, '$.user.session.device.browser.name') as browserFROM analytics.user_events_json3. Use JSON Functions Efficiently
Section titled “3. Use JSON Functions Efficiently”-- Good: Extract once and reuseWITH extracted_properties AS ( SELECT user_id, JSON_EXTRACT_SCALAR(properties, '$.browser') as browser, JSON_EXTRACT_SCALAR(properties, '$.device') as device FROM analytics.user_events_json WHERE year = '2024' AND month = '01')SELECT browser, device, COUNT(*) as event_countFROM extracted_propertiesGROUP BY 1, 2CSV Format
Section titled “CSV Format”Overview
Section titled “Overview”CSV is a simple text format that’s easy to work with but generally inefficient for large-scale analytical workloads.
Benefits
Section titled “Benefits”- Simplicity: Easy to understand and work with
- Compatibility: Works with many tools and systems
- Human Readable: Easy to inspect and debug
- Quick Setup: Fast to implement and test
Limitations
Section titled “Limitations”- Poor Compression: Minimal compression benefits
- No Schema: No built-in schema support
- Performance: Slower query performance
- Cost: Higher storage and query costs
Creating CSV Tables
Section titled “Creating CSV Tables”-- Create CSV tableCREATE TABLE analytics.user_events_csv ( user_id string, event_type string, event_timestamp timestamp, session_id string, properties string)PARTITIONED BY ( year string, month string, day string)STORED AS TEXTFILELOCATION 's3://your-bucket/analytics/user_events_csv/'ROW FORMAT DELIMITEDFIELDS TERMINATED BY ','ESCAPED BY '\\'LINES TERMINATED BY '\n'Querying CSV Data
Section titled “Querying CSV Data”-- Query CSV dataSELECT user_id, event_type, COUNT(*) as event_countFROM analytics.user_events_csvWHERE year = '2024' AND month = '01' AND day = '15'GROUP BY 1, 2Best Practices for CSV
Section titled “Best Practices for CSV”1. Use Only for Small Datasets
Section titled “1. Use Only for Small Datasets”-- Good: Use CSV for small, simple datasetsCREATE TABLE staging.simple_data_csv ( id string, name string, value double)STORED AS TEXTFILEROW FORMAT DELIMITEDFIELDS TERMINATED BY ','2. Avoid for Large Datasets
Section titled “2. Avoid for Large Datasets”-- Avoid: Don't use CSV for large datasets-- Use Parquet or ORC insteadCREATE TABLE analytics.large_dataset_parquet ( user_id string, event_data string)STORED AS PARQUETFormat Selection Guide
Section titled “Format Selection Guide”Choose Parquet When:
Section titled “Choose Parquet When:”- Analytical Workloads: Complex analytical queries
- Large Datasets: Datasets > 1GB
- Cost Optimization: Minimizing storage and query costs
- Performance Critical: Query performance is important
- Stable Schema: Schema doesn’t change frequently
Choose ORC When:
Section titled “Choose ORC When:”- Hive Compatibility: Need Hive ecosystem compatibility
- ACID Transactions: Require transaction support
- Good Compression: Need efficient compression
- Columnar Benefits: Want columnar storage benefits
Choose JSON When:
Section titled “Choose JSON When:”- Semi-structured Data: Working with flexible schemas
- Schema Evolution: Schema changes frequently
- Prototyping: Rapid development and testing
- Small to Medium Datasets: Datasets < 1GB
- Human Readability: Need to inspect data easily
Choose CSV When:
Section titled “Choose CSV When:”- Simple Data: Basic, flat data structures
- Small Datasets: Datasets < 100MB
- Compatibility: Need broad tool compatibility
- Quick Setup: Rapid prototyping needs
- Human Inspection: Need to easily view data
Avoid CSV When:
Section titled “Avoid CSV When:”- Large Datasets: Datasets > 100MB
- Cost Optimization: Cost is a concern
- Performance Critical: Query performance matters
- Analytical Workloads: Complex analytical queries
- Production Use: Production workloads
Performance Comparison
Section titled “Performance Comparison”Storage Efficiency
Section titled “Storage Efficiency”| Format | Compression | Storage Efficiency | Query Performance |
|---|---|---|---|
| Parquet | Excellent | High | Excellent |
| ORC | Good | High | Good |
| JSON | Poor | Low | Poor |
| CSV | None | Very Low | Very Poor |
Query Performance
Section titled “Query Performance”-- Performance comparison querySELECT format_type, AVG(execution_time_ms) as avg_execution_time, AVG(data_scanned_bytes) as avg_data_scanned, AVG(cost_usd) as avg_costFROM query_performance_logsWHERE query_date >= '2024-01-01'GROUP BY 1ORDER BY avg_execution_timeCost Analysis
Section titled “Cost Analysis”-- Cost comparison querySELECT format_type, SUM(storage_cost_usd) as total_storage_cost, SUM(query_cost_usd) as total_query_cost, SUM(storage_cost_usd + query_cost_usd) as total_costFROM cost_analysisWHERE analysis_date >= '2024-01-01'GROUP BY 1ORDER BY total_costMigration Strategies
Section titled “Migration Strategies”Converting Between Formats
Section titled “Converting Between Formats”CSV to Parquet
Section titled “CSV to Parquet”-- Convert CSV to ParquetCREATE TABLE analytics.user_events_parquetWITH ( format = 'PARQUET', external_location = 's3://your-bucket/analytics/user_events_parquet/')ASSELECT * FROM analytics.user_events_csvWHERE year = '2024' AND month = '01'JSON to Parquet
Section titled “JSON to Parquet”-- Convert JSON to ParquetCREATE TABLE analytics.user_events_parquetWITH ( format = 'PARQUET', external_location = 's3://your-bucket/analytics/user_events_parquet/')ASSELECT user_id, event_type, event_timestamp, session_id, JSON_EXTRACT_SCALAR(properties, '$.browser') as browser, JSON_EXTRACT_SCALAR(properties, '$.device') as deviceFROM analytics.user_events_jsonWHERE year = '2024' AND month = '01'Migration Best Practices
Section titled “Migration Best Practices”1. Gradual Migration
Section titled “1. Gradual Migration”-- Migrate data gradually by partitionINSERT INTO analytics.user_events_parquetSELECT * FROM analytics.user_events_csvWHERE year = '2024' AND month = '01'2. Validate Data
Section titled “2. Validate Data”-- Validate migrationSELECT COUNT(*) as csv_countFROM analytics.user_events_csvWHERE year = '2024' AND month = '01';
SELECT COUNT(*) as parquet_countFROM analytics.user_events_parquetWHERE year = '2024' AND month = '01';3. Performance Testing
Section titled “3. Performance Testing”-- Test query performanceEXPLAINSELECT user_id, COUNT(*) as event_countFROM analytics.user_events_parquetWHERE year = '2024' AND month = '01'GROUP BY 1Troubleshooting
Section titled “Troubleshooting”Common Issues
Section titled “Common Issues”Format Compatibility
Section titled “Format Compatibility”Problem: Format not supported or recognized
Solutions:
- Check format specification
- Verify file structure
- Use appropriate SERDE
Debug Steps:
- Check file format in S3
- Verify table definition
- Test with sample data
- Check Athena documentation
Performance Issues
Section titled “Performance Issues”Problem: Poor query performance
Solutions:
- Use appropriate format
- Optimize compression
- Implement partitioning
- Review query patterns
Debug Steps:
- Analyze query execution plan
- Check data format and compression
- Review partitioning strategy
- Monitor performance metrics
Schema Evolution
Section titled “Schema Evolution”Problem: Schema changes breaking queries
Solutions:
- Use flexible formats (JSON, Parquet)
- Implement schema versioning
- Use appropriate data types
- Plan for schema evolution
Debug Steps:
- Check schema compatibility
- Review data type changes
- Test with sample data
- Implement gradual migration