Choose the right Python library
When using BigQuery, you can choose the following three Python libraries according to your needs:
- BigQuery DataFrame: Through server-side processing, it supports Pandas and Scikit-learn APIs, suitable for data processing and machine learning tasks.
- pandas-gbq: Client library for reading and writing BigQuery data in Python, suitable for simple data processing and analysis.
- google-cloud-bigquery: A library maintained by Google that provides complete BigQuery API capabilities for complex data management and analysis.
Install the library
To use these libraries, you need to install the following packages:
pip install --upgrade pandas-gbq 'google-cloud-bigquery[bqstorage,pandas]'
Run the query
Using GoogleSQL Syntax
The following example shows how to usepandas-gbq
andgoogle-cloud-bigquery
Run GoogleSQL query:
pandas-gbq
import pandas sql = """ SELECT name FROM `bigquery-public-data.usa_names.usa_1910_current` WHERE state = 'TX' LIMIT 100 """ # Use standard SQL querydf = pandas.read_gbq(sql, dialect="standard") # Specify the project IDproject_id = "your-project-id" df = pandas.read_gbq(sql, project_id=project_id, dialect="standard")
google-cloud-bigquery
from import bigquery client = () sql = """ SELECT name FROM `bigquery-public-data.usa_names.usa_1910_current` WHERE state = 'TX' LIMIT 100 """ # Use standard SQL querydf = (sql).to_dataframe() # Specify the project IDproject_id = "your-project-id" df = (sql, project=project_id).to_dataframe()
Using legacy SQL syntax
If you need to use legacy SQL syntax, you can do it in the following ways:
pandas-gbq
import pandas sql = """ SELECT name FROM [bigquery-public-data:usa_names.usa_1910_current] WHERE state = 'TX' LIMIT 100 """ df = pandas.read_gbq(sql, dialect="legacy")
google-cloud-bigquery
from import bigquery client = () sql = """ SELECT name FROM [bigquery-public-data:usa_names.usa_1910_current] WHERE state = 'TX' LIMIT 100 """ query_config = (use_legacy_sql=True) df = (sql, job_config=query_config).to_dataframe()
Accelerate data downloads using the BigQuery Storage API
The BigQuery Storage API can significantly improve download speeds for large results. The following example shows how to use this API:
pandas-gbq
import pandas sql = "SELECT * FROM `bigquery-public-data.irs_990.irs_990_2012`" # Use BigQuery Storage API to speed up downloadsdf = pandas.read_gbq(sql, dialect="standard", use_bqstorage_api=True)
google-cloud-bigquery
from import bigquery client = () sql = "SELECT * FROM `bigquery-public-data.irs_990.irs_990_2012`" # If the BigQuery Storage API is enabled, use it automaticallydf = (sql).to_dataframe()
Configure query
Parameterized query
The following example shows how to use parameterized queries:
pandas-gbq
import pandas sql = """ SELECT name FROM `bigquery-public-data.usa_names.usa_1910_current` WHERE state = @state LIMIT @limit """ query_config = { "query": { "parameterMode": "NAMED", "queryParameters": [ { "name": "state", "parameterType": {"type": "STRING"}, "parameterValue": {"value": "TX"}, }, { "name": "limit", "parameterType": {"type": "INTEGER"}, "parameterValue": {"value": 100}, }, ], } } df = pandas.read_gbq(sql, configuration=query_config)
google-cloud-bigquery
from import bigquery client = () sql = """ SELECT name FROM `bigquery-public-data.usa_names.usa_1910_current` WHERE state = @state LIMIT @limit """ query_config = ( query_parameters=[ ("state", "STRING", "TX"), ("limit", "INTEGER", 100), ] ) df = (sql, job_config=query_config).to_dataframe()
Load pandas DataFrame into BigQuery table
The following example shows how to load a pandas DataFrame into a BigQuery table:
pandas-gbq
import pandas df = ( { "my_string": ["a", "b", "c"], "my_int64": [1, 2, 3], "my_float64": [4.0, 5.0, 6.0], "my_timestamp": [ ("1998-09-04T16:03:14"), ("2010-09-13T12:03:45"), ("2015-10-02T16:00:00"), ], } ) table_id = "my_dataset.new_table" df.to_gbq(table_id)
google-cloud-bigquery
from import bigquery import pandas df = ( { "my_string": ["a", "b", "c"], "my_int64": [1, 2, 3], "my_float64": [4.0, 5.0, 6.0], "my_timestamp": [ ("1998-09-04T16:03:14"), ("2010-09-13T12:03:45"), ("2015-10-02T16:00:00"), ], } ) client = () table_id = "my_dataset.new_table" # Ensure the correct data typejob_config = ( schema=[ ("my_string", "STRING"), ] ) job = client.load_table_from_dataframe(df, table_id, job_config=job_config) # Wait for the loading to complete()
Limitations of pandas-gbq
- Dataset Management: Creating, updating, or deleting datasets is not supported.
- Data format support: Only CSV format is supported, and nested values or array values are not supported.
- Table management: Listing, copying, or deleting tables is not supported.
- Data Export: Direct export of data to Cloud Storage is not supported.
Resolve connection pooling errors
If you encounter a connection pool error, you can increase the connection pool size by:
import requests client = () adapter = (pool_connections=128, pool_maxsize=128, max_retries=3) client._http.mount("https://", adapter) client._http._auth_request.("https://", adapter)
The above is the detailed explanation of the code that uses Python to interact with BigQuery. For more information about the interaction between Python and BigQuery, please follow my other related articles!