如何自动备份和版本存储过程等BigQuery代码？

Question

备份 BigQuery DDL 的一些选项有哪些 - 特别是视图、存储过程和函数代码？

我们在 BigQuery 中有大量代码，我们希望自动备份这些代码，最好也对其进行版本控制。想知道其他人是如何做到这一点的。

感谢任何帮助。

谢谢！

Answer 1

为了保存和跟踪我们的 BigQuery 结构和代码，我们使用 Terraform 来管理大查询中的每个资源。更具体地说，针对您的问题，我们使用 google_bigquery_routine 资源来确保其他团队成员审核更改以及您从使用 VCS 中获得的所有其他好处。

TerraForm 代码的另一个重要部分是我们对 BigQuery 模块进行版本控制（通过 github 版本/标签），其中包括表结构和例程、对其进行版本控制并在多个环境中使用它。

看起来像：

主.tf

module "bigquery" {
  source = "github.com/sample-org/terraform-modules.git?ref=0.0.2/bigquery"

  project_id       = var.project_id


  ...
  ... other vars for the module
  ... 
}

terraform-modules/bigquery/main.tf

resource "google_bigquery_dataset" "test" {
    dataset_id = "dataset_id"
    project_id = var.project_name
}

resource "google_bigquery_routine" "sproc" {
  dataset_id = google_bigquery_dataset.test.dataset_id
  routine_id     = "routine_id"
  routine_type = "PROCEDURE"
  language = "SQL"
  definition_body = "CREATE FUNCTION Add(x FLOAT64, y FLOAT64) RETURNS FLOAT64 AS (x + y);"
}

这有助于我们在所有环境中升级基础设施，而无需更改额外的代码

Answer 2

我们最终使用 INFORMATION_SCHEMA 备份了 DDL 和例程。计划作业提取相关元数据，然后将内容上传到 GCS。

SQL 示例：

select * from <schema>.INFORMATION_SCHEMA.ROUTINES;
select * from <schema>.INFORMATION_SCHEMA.VIEWS;
select *, DDL from <schema>.INFORMATION_SCHEMA.TABLES;

您必须在列列表中显式指定 DDL 才能显示表 DDL。

请检查文档，因为这些事情正在迅速发展。

Answer 3

我每晚使用 Cloud Run 将表/视图和例程（存储过程和函数）定义文件写入 Cloud Storage。有关设置的信息，请参阅本教程。 Cloud Run 具有使用 Cloud Scheduler 安排的 HTTP 端点。它本质上运行这个脚本：

#!/usr/bin/env bash

set -eo pipefail

GCLOUD_REPORT_BUCKET="myproject-code/backups"

objects_report="gs://${GCLOUD_REPORT_BUCKET}/objects-backup-report-$(date +%s).txt"
routines_report="gs://${GCLOUD_REPORT_BUCKET}/routines-backup-report-$(date +%s).txt"
project_id="myproject-dw"
table_defs=()
routine_defs=()

# get list of datasets and table definitions
datasets=$(bq ls --max_results=1000 | grep -v -e "fivetran*" | awk '{print $1}' | tail +3)

for dataset in $datasets
do
  echo ${project_id}:${dataset} 

  # write tables and views to file
  tables=$(bq ls --max_results 1000 ${project_id}:${dataset} | awk '{print $1}' | tail +3)
  for table in $tables
  do
    echo ${project_id}:${dataset}.${table}
    table_defs+="$(bq show --format=prettyjson ${project_id}:${dataset}.${table})"
  done

  # write routines (stored procs and functions) to file
  routines=$(bq ls --max_results 1000 --routines=true ${project_id}:${dataset} | awk '{print $1}' | tail +3)
  for routine in $routines
  do
    echo ${project_id}:${dataset}.${routine}
    routine_defs+="$(bq show --format=prettyjson --routine=true ${project_id}:${dataset}.${routine})"
  done

done

echo $table_defs | jq '.' | gsutil -q cp -J - "${objects_report}"
echo $routine_defs | jq '.' | gsutil -q cp -J - "${routines_report}"

# /dev/stderr is sent to Cloud Logging.
echo "objects-backup-report: wrote to ${objects_report}" >&2
echo "Wrote objects report to ${objects_report}"
echo "routines-backup-report: wrote to ${routines_report}" >&2
echo "Wrote routines report to ${routines_report}"

输出本质上与为所有数据集编写

bq ls

和

bq show

命令相同，并将结果通过管道传输到带有日期的文本文件。我可以将其添加到 git，但该文件包含时间戳，因此您可以通过查看特定日期的文件来了解 BigQuery 的状态。

Answer 4

我知道我迟到了，但我今天遇到了这个问题，并创建了一个使用 Google CLI 导出 DDL 的 PowerShell 脚本。

它并不完美（而且我不经常使用 PS），但它对我有用，其他人可以使用它作为起点。

该程序为每个“对象”（表、视图、过程等）创建一个文件管理器，因此您可以将这些文件添加到源代码管理中。

更新您的项目 ID、数据集和输出文件夹。

正如我提到的，您需要安装 Google CLI 并进行身份验证！

# Set your Project and Dataset IDs
$projectId = "<YOUR PROJECT ID>"
$dataset = "<YOUR DATASET NAME>"
$output = "${PSScriptRoot}\<YOUR SUB-FOLDER>"

# Delete output files
Get-ChildItem -Path $output -File -Recurse | ForEach-Object { $_.Delete() }

# Get the list of all tables in the dataset
$tables = bq ls --format=prettyjson --project_id $projectId $dataset | ConvertFrom-Json

# Process each table
foreach ($table in $tables) {
    # Get the table ID
    $tableName = $table.tableReference.tableId
    $type = $table.type

    # Get the DDL of the table
    $query = "select ddl from ${projectId}.${dataset}.INFORMATION_SCHEMA.TABLES where table_name='${tableName}';"
    $ddl = bq query --format=prettyjson --use_legacy_sql=false $query | ConvertFrom-Json

    # Define the file name for saving the DDL
    $fileName = "${type}_${tableName}.sql"
    $fileName

    # Save the DDL to a file
    $contents = $ddl.ddl
    $contents | Out-File -FilePath "${output}\${fileName}"
}


# Get the list of all routines (Stored Procedures and Functions) in the dataset
$routines = bq ls --format=prettyjson --routines --project_id $projectId $dataset | ConvertFrom-Json

# Process each routine
foreach ($routine in $routines) {
    # Get routine type (PROCEDURE or FUNCTION)
    $type = $routine.routineType

    # Get the routine ID
    $routineId = $routine.routineReference.routineId

    # Get the DDL of the routine
    $query = "select ddl from ${projectId}.${dataset}.INFORMATION_SCHEMA.ROUTINES where routine_name='${routineId}';"
    $ddl = bq query --format=prettyjson --use_legacy_sql=false $query | ConvertFrom-Json

    # Define the file name for saving the DDL
    $fileName = "${type}_${routineId}.sql"
    $fileName

    # Save the DDL to a file
    $contents = $ddl.ddl
    $contents | Out-File -FilePath "${output}\${fileName}"
}

希望有帮助。

Answer 5

我们使用 Github 存储所有代码，并通过 Github Actions 自动将其部署到 BigQuery。这样，我们就可以管理 Github 上的所有代码了。

我们使用 Github Actions 创建从 Github 到 BigQuery 的持续部署管道。

这是我们的 Github 存储库的结构：

data-project/
├── .github/workflows
│   ├── bigquery.yml
├── sql/
│   ├── models/
│   └── reports/
└── README.md

在 .github/wokflows 中创建一个 YAML 文件来控制持续部署管道。这是一个例子：

name: bigquery queries

on:
  push:
    branches:
      - master
  workflow_dispatch:

jobs:
  run-queries:
      runs-on: ubuntu-latest

    steps:
      - name: checkout code
        uses: actions/checkout@v2

# Authenticate with Google Cloud using service account key
      - name: Authenticate to Google Cloud
        uses: google-github-actions/auth@v1
        with:
          credentials_json: ${{ secrets.GCP_SA_KEY }}

# Set up gcloud with the authenticated account
  - name: setup gcloud
    uses: google-github-actions/setup-gcloud@v1
    with:
      project_id: ${{ secrets.GCP_PROJECT }}

# Run your BigQuery queries
  - name: run bigquery queries
    run: |
      find ./sql -name '*.sql' | while read sql_file; do
        echo "Running $sql_file"
        bq query --use_legacy_sql=false < "$sql_file"
      done

您需要在 BigQuery 中创建一个服务帐户并将凭据添加到 Github Secrets Manager。每当您将更改合并到 sql 文件夹中的 .sql 文件时，Github 都会将最新代码部署到 BigQuery。更多详细信息请参见：https://erikedin.com/2024/09/09/version-control-for-data-analysts/

另一个流行的选择是使用 DBT 来管理你的代码。

如何自动备份和版本存储过程等BigQuery代码？

问题描述投票：0回答：5

5个回答

最新问题

如何自动备份和版本存储过程等BigQuery代码？

问题描述 投票：0回答：5

5个回答

最新问题

问题描述投票：0回答：5