我有多个 json 列,我将它们连接到一个 json 列数组。 DataFarme 看起来像这样
┌─────────────────────────────────┐
│ json_concat │
│ --- │
│ list[str] │
╞═════════════════════════════════╡
│ ["{"integer_col":52,"string_co… │
│ ["{"integer_col":93,"string_co… │
│ ["{"integer_col":15,"string_co… │
│ ["{"integer_col":72,"string_co… │
│ ["{"integer_col":61,"string_co… │
│ ["{"integer_col":21,"string_co… │
│ ["{"integer_col":83,"string_co… │
│ ["{"integer_col":87,"string_co… │
│ ["{"integer_col":75,"string_co… │
│ ["{"integer_col":75,"string_co… │
└─────────────────────────────────┘
这是极坐标的输出
glimpse
Rows: 10
Columns: 1
$ json_concat <list[str]> ['{"integer_col":52,"string_col":"v"}', '{"float_col":86.61761457749351,"bool_col":true}', '{"datetime_col":"2021-01-01 00:00:00","categorical_col":"Category3"}'], ['{"integer_col":93,"string_col":"l"}', '{"float_col":60.11150117432088,"bool_col":false}', '{"datetime_col":"2021-01-02 00:00:00","categorical_col":"Category2"}'], ['{"integer_col":15,"string_col":"y"}', '{"float_col":70.80725777960456,"bool_col":false}', '{"datetime_col":"2021-01-03 00:00:00","categorical_col":"Category1"}'], ['{"integer_col":72,"string_col":"q"}', '{"float_col":2.0584494295802447,"bool_col":true}', '{"datetime_col":"2021-01-04 00:00:00","categorical_col":"Category2"}'], ['{"integer_col":61,"string_col":"j"}', '{"float_col":96.99098521619943,"bool_col":true}', '{"datetime_col":"2021-01-05 00:00:00","categorical_col":"Category2"}'], ['{"integer_col":21,"string_col":"p"}', '{"float_col":83.24426408004217,"bool_col":true}', '{"datetime_col":"2021-01-06 00:00:00","categorical_col":"Category2"}'], ['{"integer_col":83,"string_col":"o"}', '{"float_col":21.233911067827616,"bool_col":true}', '{"datetime_col":"2021-01-07 00:00:00","categorical_col":"Category1"}'], ['{"integer_col":87,"string_col":"o"}', '{"float_col":18.182496720710063,"bool_col":true}', '{"datetime_col":"2021-01-08 00:00:00","categorical_col":"Category2"}'], ['{"integer_col":75,"string_col":"s"}', '{"float_col":18.34045098534338,"bool_col":true}', '{"datetime_col":"2021-01-09 00:00:00","categorical_col":"Category1"}'], ['{"integer_col":75,"string_col":"l"}', '{"float_col":30.42422429595377,"bool_col":true}', '{"datetime_col":"2021-01-10 00:00:00","categorical_col":"Category2"}']
我想将 json 列写入名为
testing
的表中。我尝试了 pd.DataFrame.to_sql()
和 pl.DataFrame.write_database()
都失败了,并出现类似的错误
最重要的部分是这个 sqlalchemy.exc.StatementError: (builtins.TypeError) ndarray 类型的对象不是 JSON 可序列化的
File "/usr/lib/python3.10/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.__class__.__name__} '
sqlalchemy.exc.StatementError: (builtins.TypeError) Object of type ndarray is not JSON serializable
[SQL: INSERT INTO some_schema.testing (json_concat) VALUES (%(json_concat)s)]
[parameters: [{'json_concat': array(['{"integer_col":52,"string_col":"v"}',
'{"float_col":86.61761457749351,"bool_col":true}',
'{"datetime_col":"2021-01-01 00:00:00","categorical_col":"Category3"}'],
dtype=object)},
# ... abbreviated
dtype=object)}, {'json_concat': array(['{"integer_col":75,"string_col":"l"}',
'{"float_col":30.42422429595377,"bool_col":true}',
'{"datetime_col":"2021-01-10 00:00:00","categorical_col":"Category2"}'],
dtype=object)}]]
(以熊猫为例)
df_pandas.to_sql(
"testing",
con=engines.engine,
schema=schema,
index=False,
if_exists="append",
dtype=DTYPE,
)
我需要如何准备连接的 json 列才能使其可 json 序列化?
from typing import Any
import numpy as np
import pandas as pd
import polars as pl
from myengines import engines
from sqlalchemy import dialects, text
schema = "some_schema"
# Seed for reproducibility
np.random.seed(42)
n = 10
# Generate random data
integer_col = np.random.randint(1, 100, n)
float_col = np.random.random(n) * 100
string_col = np.random.choice(list("abcdefghijklmnopqrstuvwxyz"), n)
bool_col = np.random.choice([True, False], n)
datetime_col = pd.date_range(start="2021-01-01", periods=n, freq="D")
categorical_col = np.random.choice(["Category1", "Category2", "Category3"], n)
# Creating the DataFrame
df = pl.DataFrame(
{
"integer_col": integer_col,
"float_col": float_col,
"string_col": string_col,
"bool_col": bool_col,
"datetime_col": datetime_col,
"categorical_col": categorical_col,
}
)
df = df.select(
pl.struct(pl.col("integer_col", "string_col")).struct.json_encode().alias("json1"),
pl.struct(pl.col("float_col", "bool_col")).struct.json_encode().alias("json2"),
pl.struct(pl.col("datetime_col", "categorical_col"))
.struct.json_encode()
.alias("json3"),
).select(pl.concat_list(pl.col(["json1", "json2", "json3"])).alias("json_concat"))
DTYPE: dict[str, Any] = {"json_concat": dialects.postgresql.JSONB}
不幸的是,没有 Polars 函数可以将列表序列化为 JSON 数组。以下是手动操作的方法:
df = df.select(
pl.struct(pl.col("integer_col", "string_col")).struct.json_encode().alias("json1"),
pl.struct(pl.col("float_col", "bool_col")).struct.json_encode().alias("json2"),
pl.struct(pl.col("datetime_col", "categorical_col")).struct.json_encode().alias("json3"),
).select(
pl.format("[{}]", pl.concat_list(pl.col(["json1", "json2", "json3"])).list.join(",")).alias("json_concat")
)
engine = create_engine("postgresql+psycopg2://postgres:postgres@localhost:5432/postgres", echo=True)
df.write_database(
"testing",
connection=engine,
if_table_exists="append",
)
此外,在表达式中,字符串被读取为列名称,因此不需要
pl.col
。这是清理后的代码:
df = df.select(
pl.struct("integer_col", "string_col").struct.json_encode().alias("json1"),
pl.struct("float_col", "bool_col").struct.json_encode().alias("json2"),
pl.struct("datetime_col", "categorical_col").struct.json_encode().alias("json3"),
).select(
pl.format("[{}]", pl.concat_list("json1", "json2", "json3").list.join(",")).alias("json_concat")
)