🔻 🔋 💔 The book “Google BigQuery. All About Data Warehousing, Analytics and Machine Learning " 🏧 🥔 🧖🏾

Hello Habitants! Are you intimidated by the need to process petabyte datasets? Get to know Google BigQuery, a storage engine that can consolidate data across the enterprise, facilitate interactive analysis, and enable machine learning. Now you can efficiently store, query, receive, and explore data in one convenient environment. Walyappa Lakshmanan and Jordan Taijani will teach you how to work in a modern data warehouse using the full power of a scalable, serverless public cloud. With this book, you will: - Dive into the internals of BigQuery - Explore the data types, functions and operators that Big Query supports - Optimize queries and implement schemes to improve performance or reduce costs - Learn about GIS, time travel, DDL / DML.custom functions and SQL scripts - Solve many machine learning problems - Learn how to protect data, track performance, and authenticate users.

Minimizing network costs

Minimizing Network Costs BigQuery is a regional service that is available worldwide. For example, if you are requesting a dataset stored in the EU region, the request will run on servers located in a data center in the European Union. For you to be able to store the query results in a table, it must be in a dataset that is also in the EU region. However, the BigQuery REST API can be called (i.e., run a query) from anywhere in the world, even from computers outside of GCP. When working with other GCP resources, such as Google Cloud Storage or Cloud Pub / Sub, best performance is obtained if they are in the same region as the dataset. Therefore, if the request is executed from a Compute Engine instance or Cloud Dataproc cluster, the network overhead will be minimal,if the instance or cluster is also in the same region as the requested dataset. When accessing BigQuery from outside of GCP, consider your network topology and try to minimize the number of hops between the client computer and the GCP center where the dataset resides.

Concise, incomplete responses

By directly accessing the REST API, network overhead can be reduced by accepting concise, incomplete responses. To accept compressed responses, you can specify in the HTTP header that you are ready to accept a gzip archive and ensure that the "gzip" line is present in the User-Agent header, for example:

Accept-Encoding: gzip
User-Agent: programName (gzip)

In this case, all responses will be compressed using gzip. By default, BigQuery responses contain all of the fields listed in the documentation. However, if we know which part of the answer we are interested in, we can ask BigQuery to send only that part, thereby reducing network overhead. For example, in this chapter, we saw how to get complete information about a job using the Jobs API. If you are only interested in a subset of the full answer (for example, only steps in the query plan), you can specify the fields of interest to limit the size of the response:

JOBSURL="https://www.googleapis.com/bigquery/v2/projects/$PROJECT/jobs"
FIELDS="statistics(query(queryPlan(steps)))"
curl --silent \
    -H "Authorization: Bearer $access_token" \
    -H "Accept-Encoding: gzip" \
    -H "User-Agent: get_job_details (gzip)" \
    -X GET \
    "${JOBSURL}/${JOBID}?fields=${FIELDS}" \
| zcat

Please note that it also states that we accept gzip compressed data.

Combining multiple requests into packages

When using the REST API, it is possible to combine multiple BigQuery API calls using the multipart / mixed content type and nested HTTP requests in each part. The body of each part specifies the HTTP operation (GET, PUT, etc.), the path to the URL, headers and body. In response, the server will send a single HTTP response with the multipart / mixed content type, each part of which will contain the response (in order) to the corresponding request in the batch request. Although the responses are returned in a specific order, the server can process calls in any order. Therefore, a batch request can be thought of as a group of requests executed in parallel. Here's an example of sending a batch request to get some details from the execution plans of the last five requests in our project. We first use the BigQuery command line tool,to get the last five successful quests:

# 5   
JOBS=$(bq ls -j -n 50 | grep SUCCESS | head -5 | awk '{print $1}')

The request is sent to the BigQuery endpoint for batch processing:

BATCHURL="https://www.googleapis.com/batch/bigquery/v2"
JOBSPATH="/projects/$PROJECT/jobs"
FIELDS="statistics(query(queryPlan(steps)))"

Individual requests can be defined in the URL path:

request=""
for JOBID in $JOBS; do
read -d '' part << EOF
--batch_part_starts_here
GET ${JOBSPATH}/${JOBID}?fields=${FIELDS}
EOF
request=$(echo "$request"; echo "$part")
done

Then you can send the request as a compound request:

curl --silent \
   -H "Authorization: Bearer $access_token" \
   -H "Content-Type: multipart/mixed; boundary=batch_part_starts_here" \
   -X POST \
   -d "$request" \
   "${BATCHURL}"

Bulk Read Using BigQuery Storage API

In Chapter 5, we discussed using the BigQuery REST API and client libraries to enumerate tables and retrieve query results. The REST API returns data as paginated records that are better suited for relatively small result sets. However, with the advent of machine learning and distributed extract, transform, and load (ETL) tools, external tools now require fast and efficient bulk access to the BigQuery managed repository. This bulk reading access is provided in the BigQuery Storage API through the Remote Procedure Call (RPC) protocol. With the BigQuery Storage API, structured data is transmitted over the network in a binary serialization format that more closely matches the columnar data storage format.This provides additional parallelization of the result set across multiple consumers.

End users do not use the BigQuery Storage API directly. Instead, they use Cloud Dataflow, Cloud Dataproc, TensorFlow, AutoML, and other tools that use the Storage API to read data directly from managed storage, rather than through the BigQuery API.

Since the Storage API accesses the stored data directly, the permission to access the BigQuery Storage API is different from the existing BigQuery API. BigQuery Storage API permissions must be configured independently of BigQuery permissions.

The BigQuery Storage API provides several benefits to tools that read data directly from BigQuery managed storage. For example, consumers can read non-overlapping recordsets from a table using multiple threads (for example, by allowing distributed reads from different production servers in Cloud Dataproc), dynamically segmenting these threads (thus reducing tail latency, which can be a serious problem for MapReduce jobs) , select a subset of columns to read (to pass only the features used by the model to machine learning structures), filter the column values (reduce the amount of data transmitted over the network) and at the same time ensure the consistency of snapshots (that is, reading data from a certain point in time).

In Chapter 5, we covered using the %% bigquery extension in Jupyter Notebook to load query results into DataFrames. However, the examples used relatively small datasets - from a dozen to several hundred records. Is it possible to load the entire london_bicycles dataset (24 million records) into a DataFrame? Yes, you can, but in this case, you should use the Storage API, not the BigQuery API, to load data into the DataFrame. First, you need to install the Python Storage API client library with Avro and pandas support. This can be done with the command

%pip install google-cloud-bigquery-storage[fastavro,pandas]

Then all that remains is to use the %% bigquery extension, as before, but add a parameter that requires the use of the Storage API:

%%bigquery df --use_bqstorage_api --project $PROJECT
SELECT 
   start_station_name 
   , end_station_name 
   , start_date 
   , duration
FROM `bigquery-public-data`.london_bicycles.cycle_hire

Note that here we are using the Storage API's ability to provide direct access to individual columns; it is not necessary to read the entire BigQuery table into a DataFrame. If the request returns a small amount of data, the extension will automatically use the BigQuery API. Therefore, it is not scary if you always indicate this flag in the cells of the notebook. To enable the --usebqstorageapi flag in all cells of the notebook, you can set the context flag:

import google.cloud.bigquery.magics
google.cloud.bigquery.magics.context.use_bqstorage_api = True

Choosing an efficient storage format

The performance of a query depends on where and in what format the data that makes up the table is stored. In general, the less the query needs to perform lookups or type conversions, the better the performance.

Internal and External Data Sources

BigQuery supports querying external data sources such as Google Cloud Storage, Cloud Bigtable, and Google Sheets, but you can only get the best performance from your own tables.

We recommend using BigQuery as a repository of analytical data for all your structured and semi-structured data. External data sources are best used for staging storage (Google Cloud Storage), live uploads (Cloud Pub / Sub, Cloud Bigtable), or periodic updates (Cloud SQL, Cloud Spanner). Next, set up your data pipeline to load data on a schedule from these external sources into BigQuery (see Chapter 4).

If you need to request data from Google Cloud Storage, save it in a compressed columnar format (such as Parquet) if possible. Use record-based formats such as JSON or CSV as a last resort.

Staging Bucket Lifecycle Management

If you upload data to BigQuery after putting it into Google Cloud Storage, be sure to delete it from the cloud after upload. If you use the ETL pipeline to load data into BigQuery (to significantly transform it or leave only part of the data along the way), you may want to save the original data to Google Cloud Storage. In such cases, you can help reduce costs by defining bucket lifecycle management rules that downgrade storage in Google Cloud Storage.

Here's how you can turn on bucket lifecycle management and set up automatic movement of data from federated regions or standard classes that is more than 30 days old to Nearline Storage, and data stored in Nearline Storage for more than 90 days to Coldline Storage:

gsutil lifecycle set lifecycle.yaml gs://some_bucket/

In this example, the lifecycle.yaml file contains the following code:

{
"lifecycle": {
  "rule": [
  {
   "action": {
    "type": "SetStorageClass",
    "storageClass": "NEARLINE"
   },
   "condition": {
    "age": 30,
    "matchesStorageClass": ["MULTI_REGIONAL", "STANDARD"]
   }
 },
 {
  "action": {
   "type": "SetStorageClass",
   "storageClass": "COLDLINE"
  },
  "condition": {
   "age": 90,
   "matchesStorageClass": ["NEARLINE"]
  }
 }
]}}

You can use lifecycle management not only to change the class of an object, but also to remove objects that are older than a certain threshold.

Storing data as arrays and structures

In addition to other publicly available datasets, BigQuery has a dataset containing information about cyclonic storms (hurricanes, typhoons, cyclones, etc.) from meteorological services around the world. Cyclonic storms can last up to several weeks, and their meteorological parameters are measured approximately every three hours. Suppose you decide to find in this dataset all the storms that occurred in 2018, the maximum wind speed reached by each storm, and the time and location of the storm when that maximum speed was reached. The following query retrieves all of this information from the public dataset:

SELECT
  sid, number, basin, name,
  ARRAY_AGG(STRUCT(iso_time, usa_latitude, usa_longitude, usa_wind) ORDER BY
usa_wind DESC LIMIT 1)[OFFSET(0)].*
FROM
  `bigquery-public-data`.noaa_hurricanes.hurricanes
WHERE
  season = '2018'
GROUP BY
  sid, number, basin, name
ORDER BY number ASC

The query retrieves the storm identifier (sid), its seasons, pool and storm name (if assigned), and then finds an array of observations made for that storm, ranking the observations in descending order of wind speed and choosing the maximum speed for each storm ... The storms themselves are ordered by sequential number. The result includes 88 records and looks something like this:

The request took 1.4 seconds and processed 41.7 MB. The first entry describes storm Bolaven, which reached a maximum speed of 29 m / s on January 2, 2018 at 18:00 UTC.

Since observations are made by multiple meteorological services, this data can be standardized using nested fields and stored in BigQuery, as shown below:

CREATE OR REPLACE TABLE ch07.hurricanes_nested AS

SELECT sid, season, number, basin, name, iso_time, nature, usa_sshs,
    STRUCT(usa_latitude AS latitude, usa_longitude AS longitude, usa_wind AS
wind, usa_pressure AS pressure) AS usa,
    STRUCT(tokyo_latitude AS latitude, tokyo_longitude AS longitude,
tokyo_wind AS wind, tokyo_pressure AS pressure) AS tokyo,
    ... AS cma,
    ... AS hko,
    ... AS newdelhi,
    ... AS reunion,
    ... bom,
    ... AS wellington,
    ... nadi
FROM `bigquery-public-data`.noaa_hurricanes.hurricanes

Queries on this table look the same as queries on the original table, but with a slight change in the column names (usa.latitude instead of usa_latitude):

SELECT
  sid, number, basin, name,
  ARRAY_AGG(STRUCT(iso_time, usa.latitude, usa.longitude, usa.wind) ORDER BY
usa.wind DESC LIMIT 1)[OFFSET(0)].*
FROM
  ch07.hurricanes_nested
WHERE
  season = '2018'
GROUP BY
  sid, number, basin, name
ORDER BY number ASC

This request processes the same amount of data and runs in the same amount of time as the original, using the public dataset. The use of nested fields (structures) does not change the speed or cost of the query, but it can make the query more readable. Since there are many observations of the same storm during its duration, we can change the storage to fit in one record the entire array of observations for each storm:

CREATE OR REPLACE TABLE ch07.hurricanes_nested_track AS

SELECT sid, season, number, basin, name,
 ARRAY_AGG(
   STRUCT(
    iso_time,
    nature,
    usa_sshs,
    STRUCT(usa_latitude AS latitude, usa_longitude AS longitude, usa_wind AS
wind, usa_pressure AS pressure) AS usa,
    STRUCT(tokyo_latitude AS latitude, tokyo_longitude AS longitude,
      tokyo_wind AS wind, tokyo_pressure AS pressure) AS tokyo,
    ... AS cma,
    ... AS hko,
    ... AS newdelhi,
    ... AS reunion,
    ... bom,
    ... AS wellington,
    ... nadi
  ) ORDER BY iso_time ASC ) AS obs
FROM `bigquery-public-data`.noaa_hurricanes.hurricanes
GROUP BY sid, season, number, basin, name

Note that we now store the sid, season and other characteristics of the storm as scalar columns, because they do not change depending on its duration.

The rest of the data, which changes with each observation, is stored as an array of structures. This is how the query for the new table looks like:

SELECT
  number, name, basin,
  (SELECT AS STRUCT iso_time, usa.latitude, usa.longitude, usa.wind
     FROM UNNEST(obs) ORDER BY usa.wind DESC LIMIT 1).*
FROM ch07.hurricanes_nested_track
WHERE season = '2018'
ORDER BY number ASC

This request will return the same result, but this time it will only process 14.7 MB (a threefold cost reduction) and complete in one second (a 30% increase in speed). What caused this performance improvement? When data is stored as an array, the number of records in the table drops dramatically (from 682,000 to 14,000), 2 because now there is only one record per storm, not many records — one for each observation. Then, when we filter rows by season, BigQuery can drop many related cases at the same time, as shown in Figure 1. 7.13.

Another advantage is that there is no need to duplicate data records when cases with different levels of detail are stored in the same table. A single table can store both latitude and longitude change data for storms and high-level data such as storm name and season. And because BigQuery stores tabular data in columns using compression, you can query and process high-level data without fear of the cost of working with detailed data - now it is stored as a separate array of values for each storm.

For example, to find out the number of storms by year, you can query only the required columns:

WITH hurricane_detail AS (
SELECT sid, season, number, basin, name,
 ARRAY_AGG(
  STRUCT(
    iso_time,
    nature,
    usa_sshs,
    STRUCT(usa_latitude AS latitude, usa_longitude AS longitude, usa_wind AS
wind, usa_pressure AS pressure) AS usa,
    STRUCT(tokyo_latitude AS latitude, tokyo_longitude AS longitude,
        tokyo_wind
AS wind, tokyo_pressure AS pressure) AS tokyo
  ) ORDER BY iso_time ASC ) AS obs
FROM `bigquery-public-data`.noaa_hurricanes.hurricanes
GROUP BY sid, season, number, basin, name
)
SELECT
  COUNT(sid) AS count_of_storms,
  season
FROM hurricane_detail
GROUP BY season
ORDER BY season DESC

The previous request processed 27 MB, which is half the 56 MB that would have to be processed if nested repeating fields were not used.

Nested fields do not improve performance on their own, although they can improve readability by actually performing a join to other related tables. In addition, nested repeating fields are extremely useful from a performance standpoint. Consider using nested repeating fields in your schema because they can dramatically increase the speed and reduce the cost of queries filtering on a non-nested or repeating column (in our case, season).

The key disadvantage of nested repeating fields is the difficulty of implementing streaming into such a table if the streaming updates involve adding items to existing arrays. This is much more difficult to implement than adding new records: you will need to modify an existing record - this is a significant drawback for the storm information table, as new observations are constantly being added to it, and this explains why this public dataset does not use nested duplicates fields.

Practice of using arrays

Experience has shown that it takes some practice to successfully use nested repeating fields. The Google Analytics sample dataset in BigQuery is ideal for this purpose. The easiest way to identify nested data in a schema is to find the word RECORD in the Type column, which corresponds to the data type STRUCT, and the word REPEATED in the Mode column, as shown below:

In this example, the TOTALS field is STRUCT (but not repeated), and the HITS field is STRUCT and repeats. This makes some sense, because Google Analytics tracks visitor session data at the aggregation level (one session value for totals.hits) and at the granularity level (separate hit.time values for each page and images retrieved from your site) ... Storing data at these different levels of detail without duplicating visitorId in records is only possible with arrays. After saving the data in a repeating format with arrays, you need to consider deploying that data in your requests using UNNEST, for example:

SELECT DISTINCT
  visitId
  , totals.pageviews
  , totals.timeOnsite
  , trafficSource.source
  , device.browser
  , device.isMobile
  , h.page.pageTitle
FROM
  `bigquery-public-data`.google_analytics_sample.ga_sessions_20170801,
  UNNEST(hits) AS h
WHERE
  totals.timeOnSite IS NOT NULL AND h.page.pageTitle =
'Shopping Cart'
ORDER BY pageviews DESC
LIMIT 10
     ,   [1,2,3,4,5]   :
[1,
2
3
4
5]

You can then perform normal SQL operations such as WHERE to filter hits on pages with titles such as Shopping Cart. Try it!

On the other hand, the GitHub public commit information dataset (bigquery-publicdata.githubrepos.commits) uses a nested repeating field (reponame) to store the list of repositories affected by the commit. It does not change over time and provides faster queries that filter on any other field.

Storing data as geographic types

The BigQuery public dataset contains a table of US zip code area boundaries (bigquery-public-data.utilityus.zipcodearea) and another table with polygons describing the boundaries of US cities (bigquery-publicdata.utilityus.uscitiesarea). A zipcodegeom column is a string, while a city_geom column is a geographic type.

From these two tables, you can get a list of all ZIP codes for Santa Fe in New Mexico, as shown below:

SELECT name, zipcode
FROM `bigquery-public-data`.utility_us.zipcode_area
JOIN `bigquery-public-data`.utility_us.us_cities_area
ON ST_INTERSECTS(ST_GeogFromText(zipcode_geom), city_geom)
WHERE name LIKE '%Santa Fe%'

This query takes 51.9 seconds, processes 305.5 MB of data, and returns the following results:

Why is this request taking so long? This is not because of the STINTERSECTS operation, but mainly because the STGeogFromText function must evaluate the S2 cells and construct the GEOGRAPHY type corresponding to each zip code.

We can try modifying the postcode table by doing this beforehand and store the geometry as a GEOGRAPHY value:

CREATE OR REPLACE TABLE ch07.zipcode_area AS
SELECT 
  * REPLACE(ST_GeogFromText(zipcode_geom) AS zipcode_geom)
FROM 
  `bigquery-public-data`.utility_us.zipcode_area

REPLACE (see previous query) is a convenient way to replace a column from a SELECT * expression.

The new dataset is 131.8 MB in size, which is significantly larger than the 116.5 MB in the original table. However, queries against this table can use S2 coverage and are much faster. For example, the following query takes 5.3 seconds (a 10x increase in speed) and processes 320.8 MB (a slight increase in cost when using a “on-demand” tariff plan):

SELECT name, zipcode
FROM ch07.zipcode_area
JOIN `bigquery-public-data`.utility_us.us_cities_area
ON ST_INTERSECTS(zipcode_geom, city_geom)
WHERE name LIKE '%Santa Fe%'

The performance benefits of storing geographic data in a GEOGRAPHY column are more than compelling. This is why the utilityus dataset is deprecated (it is still available to keep the queries already written) alive. We recommend using the bigquery-public-data.geousboundaries.uszip_codes table, which stores geographic information in a GEOGRAPHY column and is constantly updated.

»More details about the book can be found on the website of the publishing house

» Table of Contents

» Excerpt

For Habitants a 25% discount on coupon - Google

Upon payment for the paper version of the book, an e-book is sent to the e-mail.

The book “Google BigQuery. All About Data Warehousing, Analytics and Machine Learning "

Minimizing network costs

Choosing an efficient storage format

More articles: