Skip to content

BigQuery Performance Optimization

This guide covers performance optimization techniques for BigQuery SQL tasks, including query optimization, partitioning strategies, clustering, and cost management.

BigQuery performance optimization focuses on reducing query execution time, minimizing data processing costs, and improving overall pipeline efficiency. The Datablast platform provides built-in optimizations and best practices for BigQuery development.

  • Query Design: Efficient SQL patterns and structures
  • Partitioning: Data organization for optimal performance
  • Clustering: Data ordering within partitions
  • Cost Management: Reducing BigQuery costs
  • Resource Utilization: Optimizing compute resources

Always filter by partition columns to enable partition pruning.

-- Good: Filter by partition column
SELECT
user_id,
event_type,
COUNT(*) as event_count
FROM staging.user_events
WHERE dt = '{{ ds }}' -- Partition pruning enabled
AND user_id IS NOT NULL
GROUP BY 1, 2
-- Avoid: No partition filter
SELECT
user_id,
event_type,
COUNT(*) as event_count
FROM staging.user_events
WHERE user_id = '12345' -- No partition pruning
GROUP BY 1, 2

Use appropriate JOIN types and conditions for better performance.

-- Good: Efficient JOIN with proper filtering
SELECT
u.user_id,
u.email,
o.order_count,
o.total_revenue
FROM users u
JOIN user_orders o ON u.user_id = o.user_id
WHERE u.dt = '{{ ds }}' -- Filter early
AND o.dt = '{{ ds }}' -- Filter early
AND u.status = 'active'
-- Avoid: Inefficient JOIN without filtering
SELECT
u.user_id,
u.email,
o.order_count,
o.total_revenue
FROM users u
JOIN user_orders o ON u.user_id = o.user_id
WHERE u.status = 'active' -- Filter after JOIN

Optimize window functions for better performance.

-- Good: Efficient window function
SELECT
user_id,
event_timestamp,
event_type,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_timestamp) as event_sequence
FROM staging.user_events
WHERE dt = '{{ ds }}'
AND user_id IS NOT NULL
-- Avoid: Inefficient window function
SELECT
user_id,
event_timestamp,
event_type,
ROW_NUMBER() OVER (ORDER BY event_timestamp) as global_sequence
FROM staging.user_events
WHERE dt = '{{ ds }}'

Choose the right data types to minimize storage and improve performance.

-- Good: Appropriate data types
SELECT
CAST(user_id AS STRING) as user_id,
CAST(event_count AS INT64) as event_count,
CAST(duration AS FLOAT64) as duration,
CAST(created_at AS TIMESTAMP) as created_at
FROM staging.events
WHERE dt = '{{ ds }}'
-- Avoid: Unnecessary data type conversions
SELECT
user_id,
event_count,
duration,
created_at
FROM staging.events
WHERE dt = '{{ ds }}'

Avoid unnecessary casting and conversions.

-- Good: Minimal conversions
SELECT
user_id,
event_type,
DATE(event_timestamp) as event_date,
COUNT(*) as event_count
FROM staging.events
WHERE dt = '{{ ds }}'
GROUP BY 1, 2, 3
-- Avoid: Excessive conversions
SELECT
CAST(user_id AS STRING) as user_id,
CAST(event_type AS STRING) as event_type,
CAST(DATE(event_timestamp) AS DATE) as event_date,
CAST(COUNT(*) AS INT64) as event_count
FROM staging.events
WHERE dt = '{{ ds }}'
GROUP BY 1, 2, 3

Break down complex queries into manageable parts.

-- Good: Use CTEs for clarity and performance
WITH base_events AS (
SELECT
user_id,
event_type,
event_timestamp,
session_id
FROM staging.user_events
WHERE dt = '{{ ds }}'
AND user_id IS NOT NULL
),
aggregated_events AS (
SELECT
user_id,
event_type,
COUNT(*) as event_count,
COUNT(DISTINCT session_id) as session_count
FROM base_events
GROUP BY 1, 2
)
SELECT
user_id,
event_type,
event_count,
session_count,
CURRENT_TIMESTAMP() as processed_at
FROM aggregated_events
ORDER BY event_count DESC

Use efficient subquery patterns.

-- Good: Efficient subquery
SELECT
u.user_id,
u.email,
COALESCE(o.order_count, 0) as order_count
FROM users u
LEFT JOIN (
SELECT
user_id,
COUNT(*) as order_count
FROM orders
WHERE dt = '{{ ds }}'
GROUP BY 1
) o ON u.user_id = o.user_id
WHERE u.dt = '{{ ds }}'
-- Avoid: Inefficient subquery
SELECT
u.user_id,
u.email,
(SELECT COUNT(*) FROM orders WHERE user_id = u.user_id AND dt = '{{ ds }}') as order_count
FROM users u
WHERE u.dt = '{{ ds }}'

Enable automatic partition pruning for better performance.

-- Good: Partition pruning enabled
SELECT
user_id,
event_type,
COUNT(*) as event_count
FROM staging.user_events
WHERE dt = '{{ ds }}' -- Partition column filter
AND user_id IS NOT NULL
GROUP BY 1, 2
-- Avoid: No partition pruning
SELECT
user_id,
event_type,
COUNT(*) as event_count
FROM staging.user_events
WHERE user_id = '12345' -- No partition filter
GROUP BY 1, 2

Choose the right partitioning strategy for your use case.

Best for time-series data.

materialization:
type: "table"
partition_by: "dt"
-- Optimized for date partitioning
SELECT
DATE(event_timestamp) as dt,
user_id,
event_type,
COUNT(*) as event_count
FROM staging.user_events
WHERE DATE(event_timestamp) = '{{ ds }}'
GROUP BY 1, 2, 3

Good for user ID or other integer-based partitioning.

materialization:
type: "table"
partition_by: "user_id"
-- Optimized for integer partitioning
SELECT
user_id,
event_type,
COUNT(*) as event_count
FROM staging.user_events
WHERE user_id BETWEEN 1000 AND 2000
GROUP BY 1, 2

Cluster by frequently queried columns.

materialization:
type: "table"
cluster_by: ["user_id"]
-- Optimized for single column clustering
SELECT
user_id,
event_type,
event_timestamp,
session_id
FROM staging.user_events
WHERE dt = '{{ ds }}'
AND user_id = '12345' -- Use clustered column
ORDER BY user_id

Cluster by multiple columns for complex query patterns.

materialization:
type: "table"
cluster_by: ["user_id", "event_type", "dt"]
-- Optimized for multi-column clustering
SELECT
user_id,
event_type,
COUNT(*) as event_count
FROM staging.user_events
WHERE dt = '{{ ds }}'
AND user_id = '12345' -- Use clustered columns
AND event_type = 'click'
GROUP BY 1, 2
ORDER BY user_id, event_type

Leverage clustering for better performance.

-- Good: Use clustered columns
SELECT * FROM analytics.user_events
WHERE dt = '{{ ds }}'
AND user_id = '12345'
AND event_type = 'click'
-- Avoid: Don't use clustered columns
SELECT * FROM analytics.user_events
WHERE duration > 1000
AND session_id = 'abc123'

Use clustered columns in ORDER BY for optimal performance.

-- Good: Order by clustered columns
SELECT * FROM analytics.user_events
WHERE dt = '{{ ds }}'
ORDER BY user_id, event_type
-- Avoid: Order by non-clustered columns
SELECT * FROM analytics.user_events
WHERE dt = '{{ ds }}'
ORDER BY duration, session_id

Limit data processing for exploration queries.

-- Good: Use LIMIT for exploration
SELECT * FROM analytics.large_table
WHERE dt = '{{ ds }}'
LIMIT 1000
-- Avoid: Process all data for exploration
SELECT * FROM analytics.large_table
WHERE dt = '{{ ds }}'

Use approximate functions when exact counts aren’t required.

-- Good: Use approximate functions
SELECT
APPROX_COUNT_DISTINCT(user_id) as unique_users,
APPROX_QUANTILES(duration, 100)[OFFSET(50)] as median_duration
FROM staging.user_events
WHERE dt = '{{ ds }}'
-- Avoid: Use exact functions when approximation is sufficient
SELECT
COUNT(DISTINCT user_id) as unique_users,
PERCENTILE_CONT(duration, 0.5) OVER() as median_duration
FROM staging.user_events
WHERE dt = '{{ ds }}'

Use efficient data types to reduce storage costs.

-- Good: Efficient data types
SELECT
CAST(user_id AS STRING) as user_id,
CAST(event_count AS INT64) as event_count,
CAST(duration AS FLOAT64) as duration
FROM staging.events
WHERE dt = '{{ ds }}'
-- Avoid: Inefficient data types
SELECT
user_id,
event_count,
duration
FROM staging.events
WHERE dt = '{{ ds }}'

Set up data retention policies to manage storage costs.

-- Good: Implement data retention
SELECT * FROM analytics.user_events
WHERE dt >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
AND dt = '{{ ds }}'
-- Avoid: Keep all historical data
SELECT * FROM analytics.user_events
WHERE dt = '{{ ds }}'

Leverage BigQuery’s automatic compression.

-- Good: Let BigQuery handle compression
SELECT
user_id,
event_type,
event_timestamp,
session_id
FROM staging.user_events
WHERE dt = '{{ ds }}'
-- Avoid: Unnecessary data expansion
SELECT
user_id,
event_type,
event_timestamp,
session_id,
'processed' as status,
CURRENT_TIMESTAMP() as processed_at
FROM staging.user_events
WHERE dt = '{{ ds }}'

Monitor query execution time to identify performance issues.

-- Monitor query performance
SELECT
job_id,
creation_time,
total_bytes_processed,
total_slot_ms,
query
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
AND job_type = 'QUERY'
ORDER BY creation_time DESC

Track data processing to optimize costs.

-- Monitor data processing
SELECT
DATE(creation_time) as job_date,
SUM(total_bytes_processed) as total_bytes,
COUNT(*) as query_count,
AVG(total_slot_ms) as avg_slot_ms
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
AND job_type = 'QUERY'
GROUP BY 1
ORDER BY 1 DESC

Analyze query execution plans to identify optimization opportunities.

-- Analyze query performance
EXPLAIN
SELECT
user_id,
event_type,
COUNT(*) as event_count
FROM staging.user_events
WHERE dt = '{{ ds }}'
GROUP BY 1, 2

Monitor resource utilization for optimization.

-- Monitor resource utilization
SELECT
job_id,
total_slot_ms,
total_bytes_processed,
(total_slot_ms / 1000) / (total_bytes_processed / 1024 / 1024 / 1024) as slot_ms_per_gb
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
AND job_type = 'QUERY'
ORDER BY slot_ms_per_gb DESC

Always filter by partition columns first.

-- Good: Partition filter first
SELECT * FROM analytics.user_events
WHERE dt = '{{ ds }}' -- Partition filter
AND user_id = '12345'
AND event_type = 'click'

Choose the right JOIN type for your use case.

-- Good: Use INNER JOIN when appropriate
SELECT
u.user_id,
u.email,
o.order_count
FROM users u
INNER JOIN user_orders o ON u.user_id = o.user_id
WHERE u.dt = '{{ ds }}'
AND o.dt = '{{ ds }}'

Use efficient aggregation patterns.

-- Good: Efficient aggregation
SELECT
user_id,
COUNT(*) as event_count,
COUNT(DISTINCT session_id) as session_count,
SUM(duration) as total_duration
FROM staging.user_events
WHERE dt = '{{ ds }}'
GROUP BY 1

Regularly monitor and optimize query performance.

-- Monitor slow queries
SELECT
job_id,
creation_time,
total_slot_ms,
query
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
AND job_type = 'QUERY'
AND total_slot_ms > 1000000 -- Slow queries
ORDER BY total_slot_ms DESC

Set up cost monitoring and controls.

-- Monitor costs
SELECT
DATE(creation_time) as job_date,
SUM(total_bytes_processed) as total_bytes,
SUM(total_bytes_processed) * 5 / 1024 / 1024 / 1024 as estimated_cost_usd
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
AND job_type = 'QUERY'
GROUP BY 1
ORDER BY 1 DESC

Conduct regular performance optimization reviews.

  • Analyze query performance trends
  • Identify optimization opportunities
  • Implement performance improvements
  • Monitor cost impact

Problem: Queries running slowly

Solutions:

  • Check partition pruning
  • Optimize JOIN operations
  • Use appropriate clustering
  • Review query structure

Debug Steps:

  1. Analyze query execution plan
  2. Check partition filtering
  3. Review JOIN strategies
  4. Monitor resource usage

Problem: Unexpected high query costs

Solutions:

  • Optimize data processing
  • Use appropriate data types
  • Implement cost controls
  • Review query patterns

Debug Steps:

  1. Analyze cost breakdown
  2. Check data processing volume
  3. Review query efficiency
  4. Monitor cost trends

Problem: Resource limitations affecting performance

Solutions:

  • Optimize query patterns
  • Use appropriate materialization
  • Implement resource management
  • Monitor resource utilization

Debug Steps:

  1. Monitor slot usage
  2. Check query concurrency
  3. Review resource allocation
  4. Analyze performance bottlenecks

Use EXPLAIN to analyze query execution plans.

EXPLAIN
SELECT
user_id,
event_type,
COUNT(*) as event_count
FROM staging.user_events
WHERE dt = '{{ ds }}'
GROUP BY 1, 2

Monitor query performance metrics.

-- Monitor query performance
SELECT
job_id,
creation_time,
total_bytes_processed,
total_slot_ms,
query
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
AND job_type = 'QUERY'
ORDER BY total_slot_ms DESC

Analyze query costs and trends.

-- Analyze costs
SELECT
DATE(creation_time) as job_date,
SUM(total_bytes_processed) as total_bytes,
COUNT(*) as query_count
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
AND job_type = 'QUERY'
GROUP BY 1
ORDER BY 1 DESC