用于 Azure Synapse Lake 数据库的 Python

Question

我在AWS做了什么： 我在 AWS 中有一个数据湖。我使用 AWS Glue Crawlers 获取不同分区中的数据架构（.parquet 格式）（例如year=2000/month=1/day=1/file1.parquet、year=2000/month=1/day=1/file2 .parquet）并填充 AWS Glue Catalog 数据库中的表。我使用 AWS Athena 使用 pyathena 查询数据库中的表，并在本地使用此数据进行某些处理。

我想做的事： 我想在 Azure 中复制上述内容。我可以创建 Azure Synapse 工作区并添加 Lake 数据库，并在添加表时指定列分区。我在 Azure Synapse Workspace 中运行 SQL 命令，一切正常。但是，我无法使用 python 脚本查询数据库和表（在我的本地计算机 - Windows 11 中） 并将结果保存在数据框中。

我研究了 pyodbc 并尝试运行以下代码：

connection_string = f'DRIVER={driver};SERVER={server};PORT=1433;DATABASE={database};UID={username};PWD={password}'

try:
    # Connect to the database
    conn = pyodbc.connect(connection_string)
    print("Connection successful!")

    cursor = conn.cursor()

    query = "SELECT TOP 10 * FROM dbo.tablename"

    cursor.execute(query)
    rows = cursor.fetchall()

    if rows:
        columns = [column[0] for column in cursor.description]
        df = pd.DataFrame.from_records(rows, columns=columns)
        print(df)
    else:
        print("No results found for the query.")

    cursor.close()
    conn.close()
    
except Exception as e:
    print(f"An error occurred: {e}")

我收到以下错误：发生错误：('42000'，“[42000] [Microsoft][SQL Server 的 ODBC 驱动程序 17][SQL Server]找不到凭据 ''https://.dfs.core.windows.net//year= /month=/day=/.parquet','，因为它不存在或您没有权限 (15151) (SQLExecDirectW)")

有人可以帮我解决这个问题吗？由于我对 Azure 非常陌生，如果您能从头开始指导（也许我在创建 Azure Synapse 工作区和 Lake 数据库期间错过了一些步骤），那将会非常有帮助。

Answer 1

正如您提到的，您想要查询 Lake 数据库并加载到数据框中。

我尝试过以下方法：

enter image description here

查询Lake数据库表并按分区列过滤

df = spark.sql("SELECT * FROM `Database1`.`emp` WHERE dept = 'HR'")
df.show()
df.printSchema()

结果：

+---+----+----+
| id|name|dept|
+---+----+----+

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- dept: string (nullable = false)

用于 Azure Synapse Lake 数据库的 Python

问题描述投票：0回答：1

1个回答

最新问题

用于 Azure Synapse Lake 数据库的 Python

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1