Unify and manage your data

GBQ compaction scheduler

Learn how the GBQ compaction scheduler automatically evaluates and runs data compaction jobs to reduce duplicate records in BigQuery.

The GBQ compaction scheduler is part of the Reltio Data Pipeline for Google BigQuery (GBQ). It automatically checks each tenant to determine whether a compaction job should run to reduce duplicate records in raw GBQ tables. Compaction helps control storage costs and improve query performance by retaining only the most recent version of each record.

Unlike manual compaction, the scheduler operates on a defined schedule and uses tenant-specific logic to decide whether action is needed. It evaluates each tenant independently and ensures compaction runs only when appropriate, based on both data volume and timing.

How the scheduler works

The scheduler runs in the background at regular intervals defined by an environment-level cron expression. This schedule is configured using the datapipeline.gbq.compaction.cron.expression property. For example:

datapipeline.gbq.compaction.cron.expression=0 1 * * * *

The example above runs the scheduler at 1 minute past every hour. The cron expression format is: second, minute, hour, day-of-month, month, day-of-week.

For a tenant to qualify:

  • The time since the last compaction must exceed the configured hoursToCompaction value.
  • The calculated percentage of duplicate records must exceed the compactionThreshold.

The scheduler runs on a single node in multi-node environments to prevent duplicate execution. To enable or disable it configure the com.reltio.datapipeline.gbq.compaction.enabled property in the Data Pipeline Hub (DPH) configuration.

For more information about how to configure the compaction scheduler, see topic Configure the GBQ compaction scheduler.

Duplicate calculation and example

The scheduler determines duplication using this formula:


(total records − unique records) ÷ total records
      

For example, if a raw table has 100 total records and 30 of them are unique, the duplicate percentage is: (100 − 30) ÷ 100 = 0.7. If compactionThreshold is 0.5, the table qualifies for compaction. This evaluation is applied to all records in the table, not just those recently modified.

Configuration options

You can control how the scheduler behaves using a combination of tenant and environment properties:

compactionThreshold
The minimum percentage of duplicates required to trigger compaction. Default: 0.8.
hoursToCompaction
The minimum number of hours between scheduled compactions for the tenant. Default: 168 hours (7 days).
com.reltio.datapipeline.gbq.compaction.enabled
Optional. Set to false to disable the scheduler for specific tenants.
datapipeline.gbq.compaction.cron.expression
An environment property that defines when the scheduler runs across the environment. Uses standard cron syntax (6 fields, including seconds).

To apply these settings in your tenant or environment configuration, see Configure the GBQ compaction scheduler.

Manual compaction

You can trigger compaction manually using the following API endpoint:


POST {{hub-url}}/api/tenants/{{tenantID}}/adapters/bigquery/actions/compact_tables
      

This API does not consider compactionThreshold or hoursToCompaction. It runs immediately for the tenant. Use it for validation, testing, or one-time intervention.

Compaction log table

Every compaction run is recorded in a compaction_log table within the same GBQ dataset as your raw tables. This table tracks which tables were compacted, when, and with what outcome.

Fields include:

  • compactionId — unique ID per run
  • timestamp — when the log entry was created
  • status — SUCCESS or ERROR
  • message — status message
  • table — raw table name
  • duplicate_before — number of duplicates before compaction
  • expected_duplicates_after — number of duplicates remaining

Troubleshooting scheduler behavior

If compaction doesn't seem to be running, it's often due to configuration rather than failure. First, verify that com.reltio.datapipeline.gbq.compaction.enabled is not set to false. Then check the tenant’s hoursToCompaction and compactionThreshold values. If the tenant was recently compacted or the table doesn’t exceed the threshold, the job will be skipped.

Also make sure the environment-level cron expression is set correctly. The scheduler will not run at all without it. You can use the compaction_log table to confirm when the last run occurred and what the result was.

If needed, use the compaction API to trigger compaction manually while you review configuration.