How to clean up your BigQuery data warehouse?

BigQuery is a fully managed, petabyte-scale analytics data warehouse that enables businesses to analyze all their data very quickly. It stores data in a columnar format, which makes it efficient for querying large datasets. "This ease of use leads to a rapid increase in the use of BigQuery, especially in terms of storage. Don't panic, Biq Blue explains how to deal with it.

Get an overview of all your storage

The first step is to have a clear view of your storage by project and by dataset. You can use INFORMATION_SCHEMA.TABLE_STORAGE view, to calculate the short-term data and long-term data.

    table_schema AS dataset,
    SUM(total_logical_bytes) AS total_logical_bytes,
    SUM(active_logical_bytes) AS active_logical_bytes,
    SUM(long_term_logical_bytes) AS long_term_logical_bytes,
    SUM(total_physical_bytes) AS total_physical_bytes,
    SUM(active_physical_bytes) AS active_physical_bytes,
    SUM(long_term_physical_bytes) AS long_term_physical_bytes,
    SUM(time_travel_physical_bytes) AS time_travel_physical_bytes,

Data that is frequently used is considered active data. It automatically changes to long-term data after 3 months. Since July 2023, BigQuery automatically compresses stored data and offers two pricing models for calculating storage costs at the dataset level: logical or physical (compressed). Depending on the nature of your data, one or the other will be the most appropriate. The sum of all these costs belong the sku is your monthly storage bill.

Now, if you want to save time, use Biq Blue, which will give you an overview of your entire organization in a single click.

Check the GCP official documentation too.

Delete unused storage

Now it is time to ask yourself if you need to store all this data for so long. Any data generates a cost, so it is preferable to keep only the data that you use. This is a good reminder that data storage costs money. It is important to regularly review your data storage needs to ensure that you are only storing the data that you need.

Delete unused storage

Set up the right storage billing model for each dataset

Once you have no more unused data stored, you can choose the storage billing model with no doubt, for each dataset.

Set up storage billing model

Configure the time travel window

Time travel is a very useful feature to recover modified or deleted data, up to one week. When you choose to use the physical storage billing model on a dataset, the storage used for time travel is invoiced too, and can be quite expensive. You can configure the time travel window which will impact your storage cost.


By following these tips, you will have a Bigquery data warehouse that only stores the data it needs, much less expensive, and much more resilient.