Learn to access and query data using Google BigQuery. With examples in Python and R

Hello, Habr!



We recently published a detailed book on working with the Google BigQuery data warehouse . Today we decided to briefly touch on this topic again and publish a small case on how to query BigQuery data in Python and R.



Let us know in the comments if you are interested in a post on machine learning using BigQuery







Overview



In this article, we'll look at how to load Google BigQuery data using Python and R, and then we'll talk about querying the data to draw useful insights. We'll use the Google Cloud BigQuery library to connect to BigQuery Python and the bigrquery library to do the same in R.



We'll also discuss two stages of BigQuery data manipulation using Python / R:



  • Google BigQuery connectivity and data access
  • Querying data with Python / R


This article assumes that all of your user data is stored in Google BigQuery.



Python



Python is one of the most popular general purpose languages โ€‹โ€‹for manipulating data. It enjoys attention and demand due to its flexibility and ease of use, and in data science it boasts a wide variety of libraries and tools for interacting with third-party systems.



Connecting to Google BigQuery with Python



To query Google BigQuery data using Python, you need to connect the Python client to your BigQuery instance. This uses the cloud client library for the Google BigQuery API. There are also alternative solutions for connecting to BigQuery using Python; for example, tylertreat's BigQuery-Python library is great.



We will be working with the Google Cloud BigQuery library as it is stable and officially supported by Google.



This assumes that you already have a Python development environment set up.

To install the library, run the following command at the command line:



pip install --upgrade google-cloud-bigquery


Next, we connect the client to the database. To do this, you need to download a JSON file containing credentials for the BigQuery service. If you don't have one, here's how to create one. Next, download this JSON file to your local machine.



Now that we have everything set up, we proceed to initialize the connection. The following Python code is used for this:



rom google.cloud import bigquery
	from google.oauth2 import service_account
	credentials = service_account.Credentials.from_service_account_file(
	'path/to/file.json')
	

	project_id = 'my-bq'
	client = bigquery.Client(credentials= credentials,project=project_id)


In the above snippet, you will also need to specify the project_idlocation of the JSON file with the key, replacing ' path/to/file.json' with the actually correct path to the JSON file saved on the local machine.



In Google BigQuery, a project is a top-level container and provides default access control over all datasets.



Querying BigQuery Data with Python



Now that our BigQuery client is set up and ready to use, we can query a lot of BigQuery data.



This uses a query method that puts the query job on a BigQuery queue. The requests are then executed asynchronously - this means that we do not specify any delays, and the client waits for the job to be completed. Once this happens, the method returns an instance Query_Jobcontaining the results.



You can read more about how this method works in the official documentation here .



This is what the Python code of interest looks like:



query_job = client.query("""
	   SELECT *
	   FROM dataset.my_table
	   LIMIT 1000 """)
	

	results = query_job.result() #   .


Please note that the above query defaults to standard SQL syntax. If you want to use legacy SQL, then the code will be like this:



job_config.use_legacy_sql = True
	query_job = client.query("""
	   SELECT *
	   FROM dataset.my_table
	   LIMIT 1000""", job_config = job_config)
	

	results = query_job.result() #   .


R



The R language is a popular alternative to Python and is actively used in data science. If you are interested in detailed and methodical statistical analysis of data, then there are few languages โ€‹โ€‹that can compete with R.



When working with Google BigQuery, the R language also offers a reliable and easy-to-use library for querying and manipulating data. Here we will be working with the bigrquery library created and maintained by Hadley Wickham, director of research at RStudio.



This assumes that you have already set up your development environment in R. If not, use this guide to set up RStudio.



Connect to Google BigQuery with R



To install bigrquery, run the following command from the R console:



install.packages(โ€œbigrqueryโ€)


It's that simple! We are ready to go.



As with Python, our R client authorization is required to access Google Cloud Services. As follows from the bigrquery documentation , follow the prompt from the R console to open the authorization URL, and then copy the code to the console.



Please note: this authorization needs to be done only once. All subsequent requests will automatically update the access credentials.



Querying BigQuery Data with R



To query BigQuery data in R, follow these steps:



  • We will indicate the project ID from the Google Cloud console, as it was done in Python.
  • Let's form a query string with which we will request data.
  • Let's call it query_execwith our project ID and query string.


Here's the code to do all of this:



#  
	library(bigrquery)
	

	#   ID  
	project_id <- "your-project-id" 
	

	#  
	sql_string <- "SELECT * FROM dataset.my_table LIMIT 1000"
	

	#     
	query_results <- query_exec(sql_string, project = project_id, useLegacySql = FALSE)


As with Python, you can run queries written in legacy SQL. You can also change the value useLegacySqlto TRUEin your function query_exec.



Conclusion



So, we examined how easy and simple it is to work with data saved in Google BigQuery, referring to them in Python and R.



In these two languages, it is not difficult to build a statistical model based on data processed in this way, and use the model itself in different purposes: to understand how the user behaves in the application, to predict the churn rate, etc.



All Articles