Pandas / Polars:将 JSON 列表写入数据库失败,并显示“ndarray 不是 json 可序列化”

问题描述 投票:0回答:1

我有多个 json 列,我将它们连接到一个 json 列数组。 DataFarme 看起来像这样

┌─────────────────────────────────┐
│ json_concat                     │
│ ---                             │
│ list[str]                       │
╞═════════════════════════════════╡
│ ["{"integer_col":52,"string_co… │
│ ["{"integer_col":93,"string_co… │
│ ["{"integer_col":15,"string_co… │
│ ["{"integer_col":72,"string_co… │
│ ["{"integer_col":61,"string_co… │
│ ["{"integer_col":21,"string_co… │
│ ["{"integer_col":83,"string_co… │
│ ["{"integer_col":87,"string_co… │
│ ["{"integer_col":75,"string_co… │
│ ["{"integer_col":75,"string_co… │
└─────────────────────────────────┘

这是极坐标的输出

glimpse

Rows: 10
Columns: 1
$ json_concat <list[str]> ['{"integer_col":52,"string_col":"v"}', '{"float_col":86.61761457749351,"bool_col":true}', '{"datetime_col":"2021-01-01 00:00:00","categorical_col":"Category3"}'], ['{"integer_col":93,"string_col":"l"}', '{"float_col":60.11150117432088,"bool_col":false}', '{"datetime_col":"2021-01-02 00:00:00","categorical_col":"Category2"}'], ['{"integer_col":15,"string_col":"y"}', '{"float_col":70.80725777960456,"bool_col":false}', '{"datetime_col":"2021-01-03 00:00:00","categorical_col":"Category1"}'], ['{"integer_col":72,"string_col":"q"}', '{"float_col":2.0584494295802447,"bool_col":true}', '{"datetime_col":"2021-01-04 00:00:00","categorical_col":"Category2"}'], ['{"integer_col":61,"string_col":"j"}', '{"float_col":96.99098521619943,"bool_col":true}', '{"datetime_col":"2021-01-05 00:00:00","categorical_col":"Category2"}'], ['{"integer_col":21,"string_col":"p"}', '{"float_col":83.24426408004217,"bool_col":true}', '{"datetime_col":"2021-01-06 00:00:00","categorical_col":"Category2"}'], ['{"integer_col":83,"string_col":"o"}', '{"float_col":21.233911067827616,"bool_col":true}', '{"datetime_col":"2021-01-07 00:00:00","categorical_col":"Category1"}'], ['{"integer_col":87,"string_col":"o"}', '{"float_col":18.182496720710063,"bool_col":true}', '{"datetime_col":"2021-01-08 00:00:00","categorical_col":"Category2"}'], ['{"integer_col":75,"string_col":"s"}', '{"float_col":18.34045098534338,"bool_col":true}', '{"datetime_col":"2021-01-09 00:00:00","categorical_col":"Category1"}'], ['{"integer_col":75,"string_col":"l"}', '{"float_col":30.42422429595377,"bool_col":true}', '{"datetime_col":"2021-01-10 00:00:00","categorical_col":"Category2"}']

我想将 json 列写入名为

testing
的表中。我尝试了
pd.DataFrame.to_sql()
pl.DataFrame.write_database()
都失败了,并出现类似的错误

错误

最重要的部分是这个 sqlalchemy.exc.StatementError: (builtins.TypeError) ndarray 类型的对象不是 JSON 可序列化的

File "/usr/lib/python3.10/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
sqlalchemy.exc.StatementError: (builtins.TypeError) Object of type ndarray is not JSON serializable
[SQL: INSERT INTO some_schema.testing (json_concat) VALUES (%(json_concat)s)]
[parameters: [{'json_concat': array(['{"integer_col":52,"string_col":"v"}',
       '{"float_col":86.61761457749351,"bool_col":true}',
       '{"datetime_col":"2021-01-01 00:00:00","categorical_col":"Category3"}'],
      dtype=object)}, 
      # ... abbreviated
      dtype=object)}, {'json_concat': array(['{"integer_col":75,"string_col":"l"}',
       '{"float_col":30.42422429595377,"bool_col":true}',
       '{"datetime_col":"2021-01-10 00:00:00","categorical_col":"Category2"}'],
      dtype=object)}]]

产生错误的代码

(以熊猫为例)

df_pandas.to_sql(
    "testing",
    con=engines.engine,
    schema=schema,
    index=False,
    if_exists="append",
    dtype=DTYPE,
)

问题

我需要如何准备连接的 json 列才能使其可 json 序列化?

MRE(创建示例数据)

from typing import Any
import numpy as np
import pandas as pd
import polars as pl
from myengines import engines
from sqlalchemy import dialects, text

schema = "some_schema"
# Seed for reproducibility
np.random.seed(42)

n = 10

# Generate random data
integer_col = np.random.randint(1, 100, n)
float_col = np.random.random(n) * 100
string_col = np.random.choice(list("abcdefghijklmnopqrstuvwxyz"), n)
bool_col = np.random.choice([True, False], n)
datetime_col = pd.date_range(start="2021-01-01", periods=n, freq="D")
categorical_col = np.random.choice(["Category1", "Category2", "Category3"], n)

# Creating the DataFrame
df = pl.DataFrame(
    {
        "integer_col": integer_col,
        "float_col": float_col,
        "string_col": string_col,
        "bool_col": bool_col,
        "datetime_col": datetime_col,
        "categorical_col": categorical_col,
    }
)



df = df.select(
    pl.struct(pl.col("integer_col", "string_col")).struct.json_encode().alias("json1"),
    pl.struct(pl.col("float_col", "bool_col")).struct.json_encode().alias("json2"),
    pl.struct(pl.col("datetime_col", "categorical_col"))
    .struct.json_encode()
    .alias("json3"),
).select(pl.concat_list(pl.col(["json1", "json2", "json3"])).alias("json_concat"))


DTYPE: dict[str, Any] = {"json_concat": dialects.postgresql.JSONB}
python json pandas sqlalchemy python-polars
1个回答
0
投票

不幸的是,没有 Polars 函数可以将列表序列化为 JSON 数组。以下是手动操作的方法:

df = df.select(
    pl.struct(pl.col("integer_col", "string_col")).struct.json_encode().alias("json1"),
    pl.struct(pl.col("float_col", "bool_col")).struct.json_encode().alias("json2"),
    pl.struct(pl.col("datetime_col", "categorical_col")).struct.json_encode().alias("json3"),
).select(
    pl.format("[{}]", pl.concat_list(pl.col(["json1", "json2", "json3"])).list.join(",")).alias("json_concat")
)

engine = create_engine("postgresql+psycopg2://postgres:postgres@localhost:5432/postgres", echo=True)
df.write_database(
    "testing",
    connection=engine,
    if_table_exists="append",
)

此外,在表达式中,字符串被读取为列名称,因此不需要

pl.col
。这是清理后的代码:

df = df.select(
    pl.struct("integer_col", "string_col").struct.json_encode().alias("json1"),
    pl.struct("float_col", "bool_col").struct.json_encode().alias("json2"),
    pl.struct("datetime_col", "categorical_col").struct.json_encode().alias("json3"),
).select(
    pl.format("[{}]", pl.concat_list("json1", "json2", "json3").list.join(",")).alias("json_concat")
)
© www.soinside.com 2019 - 2024. All rights reserved.