自定义解码极坐标中的二进制数据

Question

在处理二进制数据时，我使用自定义函数来解码它们。这需要在极地中使用 apply 。由于这种情况下的元素明智处理，在处理大型数据集时，计算时间显着增加。

我尝试将二进制数据转换为List(UInt8)，但这尚未实现。

exceptions.ArrowErrorException: NotYetImplemented("Casting from LargeBinary to LargeList(Field { name: \"item\", data_type: UInt8, is_nullable: true, metadata: {} }) not supported")

有没有更有效的方法？

import polars as pl
import struct
import io

data = {"binary": [b'\xFD\x00\xFE\x00\xFF\x00',b'\x10\x00\x20\x00\x30\x00'], "id": [1,2]}
schema = {"binary": pl.Binary, "id":pl.Int16}

df = pl.DataFrame(data, schema)

返回：

shape: (2, 2)
┌───────────────┬─────┐
│ binary        ┆ id  │
│ ---           ┆ --- │
│ binary        ┆ i16 │
╞═══════════════╪═════╡
│ [binary data] ┆ 1   │
│ [binary data] ┆ 2   │
└───────────────┴─────┘

现在，当我们应用函数来解码二进制列时：

def custom_decode(data):
   bytestream = io.BytesIO(data)
   lst = []

   while bytestream.tell() < 6:
      lst.append(struct.unpack('<H', bytestream.read(2))[0])

   return lst

df = df.with_columns(
      pl.col('binary').map_elements(lambda x: custom_decode(x))
   )

结果：

shape: (2, 2)
┌─────────────────┬─────┐
│ binary          ┆ id  │
│ ---             ┆ --- │
│ list[i64]       ┆ i16 │
╞═════════════════╪═════╡
│ [253, 254, 255] ┆ 1   │
│ [16, 32, 48]    ┆ 2   │
└─────────────────┴─────┘

Answer 1

我已经在上游添加了演员表。在 Polars 的下一个版本中；

polars>=0.18.1

，你可以这样做：

data = {"binary": [b'\xFD\x00\xFE\x00\xFF\x00',b'\x10\x00\x20\x00\x30\x00'], "id": [1,2]}
schema = {"binary": pl.Binary, "id":pl.Int16}

(
    pl.DataFrame(data, schema)
    .with_columns(
        pl.col("binary").cast(pl.List(pl.UInt8))
    )
)

shape: (2, 2)
┌───────────────┬─────┐
│ binary        ┆ id  │
│ ---           ┆ --- │
│ list[u8]      ┆ i16 │
╞═══════════════╪═════╡
│ [253, 0, … 0] ┆ 1   │
│ [16, 0, … 0]  ┆ 2   │
└───────────────┴─────┘

Answer 2

一些可能会带来边际改进的想法。

不要使用 while，而是使用列表理解，因为你知道你已经这样做了 3 次。这里的改进是列表大小将提前分配而不是追加。追加到列表比从预先分配的列表开始更昂贵。
在自定义函数中进行循环并让它返回一个Series，这样您就可以使用map而不是apply（我不确定极性循环与python循环的相对开销是多少）

你可以这样写你的函数

def custom_decode(data):
    retL=[None] * len(data)
    for i, datum in enumerate(data):
        bytestream = io.BytesIO(datum)
        retL[i]=[struct.unpack('<H', bytestream.read(2))[0] for _ in range(3)]
    return(pl.Series(retL))

然后做

df.with_columns([
    pl.col('binary').map(custom_decode)
])

自定义解码极坐标中的二进制数据

问题描述投票：0回答：2

2个回答

最新问题

自定义解码极坐标中的二进制数据

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2