How to clean up your BigQuery data warehouse?
BigQuery is a fully managed, petabyte-scale analytics data warehouse that enables businesses to analyze all their data very quickly. It stores data in a columnar format, which makes it efficient for querying large datasets. "This ease of use leads to a rapid increase in the use of BigQuery, especially in terms of storage. Don't panic, Biq Blue explains how to deal with it.
Get an overview of all your storage
The first step is to have a clear view of your storage by project and by dataset. You can use INFORMATION_SCHEMA.TABLE_STORAGE
view, to calculate the short-term data and long-term data.
SELECT table_schema AS dataset, SUM(total_logical_bytes) AS total_logical_bytes, SUM(active_logical_bytes) AS active_logical_bytes, SUM(long_term_logical_bytes) AS long_term_logical_bytes, SUM(total_physical_bytes) AS total_physical_bytes, SUM(active_physical_bytes) AS active_physical_bytes, SUM(long_term_physical_bytes) AS long_term_physical_bytes, SUM(time_travel_physical_bytes) AS time_travel_physical_bytes, FROM `INFORMATION_SCHEMA.TABLE_STORAGE` GROUP BY 1 ORDER BY 1 ASC;
Data that is frequently used is considered active
data. It automatically changes to long-term
data after 3 months. Since July 2023, BigQuery automatically compresses stored data and offers two pricing models for calculating storage costs at the dataset level: logical
or physical
(compressed). Depending on the nature of your data, one or the other will be the most appropriate. The sum of all these costs belong the sku is your monthly storage bill.
Now, if you want to save time, use Biq Blue, which will give you an overview of your entire organization in a single click.
Check the GCP official documentation too.
Delete unused storage
Now it is time to ask yourself if you need to store all this data for so long. Any data generates a cost, so it is preferable to keep only the data that you use. This is a good reminder that data storage costs money. It is important to regularly review your data storage needs to ensure that you are only storing the data that you need.
⭐ Delete unused storage
Set up the right storage billing model for each dataset
Once you have no more unused data stored, you can choose the storage billing model with no doubt, for each dataset.
⭐ Set up storage billing model
Configure the time travel window
Time travel is a very useful feature to recover modified or deleted data, up to one week. When you choose to use the physical storage billing model on a dataset, the storage used for time travel is invoiced too, and can be quite expensive. You can configure the time travel window which will impact your storage cost.
Conclusion
By following these tips, you will have a Bigquery data warehouse that only stores the data it needs, much less expensive, and much more resilient.