将列从十六进制字符串转换为 uint64？

Question

作为kaggle竞赛的一部分（https://www.kaggle.com/competitions/amex-default-prediction/overview），我试图利用他们（其他竞争对手分享他们的解决方案）的技巧通过将十六进制字符串解释为以 16 为基数的 uint64 来减小列的大小。我正在尝试弄清楚这在极地/铁锈中是否可行：

# The python approach - this is used via .apply in pandas.
string = "0000099d6bd597052cdcda90ffabf56573fe9d7c79be5fbac11a8ed792feb62a"
def func(x):
    return int(string[-16:], 16)
func(string)
# 13914591055249847850

我在极坐标中的解决方案的尝试产生了几乎正确的答案，但最后的数字是关闭的，这有点令人困惑：

import polars as pl
def func(x: str) -> int:
    return int(x[-16:], 16)

strings = [
    "0000099d6bd597052cdcda90ffabf56573fe9d7c79be5fbac11a8ed792feb62a",
    "00000fd6641609c6ece5454664794f0340ad84dddce9a267a310b5ae68e9d8e5",
]

df = pl.DataFrame({"id": strings})

result_polars = df.with_columns(pl.col("id").map_elements(func).cast(pl.UInt64)).to_series().to_list()
result_python = [func(x) for x in strings]

result_polars, result_python
# ([13914591055249848320, 11750091188498716672],
#  [13914591055249847850, 11750091188498716901])

我也尝试过直接从 utf-8 转换为 uint64，但出现以下错误，如果我通过

null

，则会产生

strict=False

。

df.with_columns(pl.col("id").str.slice(-16).cast(pl.UInt64)).to_series().to_list()

# InvalidOperationError: conversion from `str` to `u64` failed 
# in column 'id' for 2 out of 2 values: ["c11a8ed792feb62a", "a310b5ae68e9d8e5"]

Answer 1

您从

func

返回的值是：

13914591055249847850
11750091188498716901

这些值大于

pl.Int64

所能表示的值。这就是 Polars 用于 Python 的

int

类型的内容。如果值溢出，极坐标会改为使用

Float64

，但这会导致精度损失。

更好的解决方案

仅获取字符串的最新

值会丢弃大量信息，这意味着很容易发生冲突。最好使用试图避免冲突的哈希函数。

您可以使用

hash

表达式。这为您提供了更定性的哈希，并且速度会更快，因为您不运行 python 代码。

df.with_columns(
    pl.col("id").hash(seed=0)
)

shape: (2, 1)
┌─────────────────────┐
│ id                  │
│ ---                 │
│ u64                 │
╞═════════════════════╡
│ 478697168017298650  │
│ 7596707240263070258 │
└─────────────────────┘

将列从十六进制字符串转换为 uint64？

问题描述投票：0回答：1

1个回答

更好的解决方案

最新问题

将列从十六进制字符串转换为 uint64？

问题描述 投票：0回答：1

1个回答

更好的解决方案

最新问题

问题描述投票：0回答：1