在极坐标Python中使用列表或字典替换字符串中的子字符串

Question

我目前正在研究 NLP 模型，目前正在优化预处理步骤。由于我使用的是自定义函数 Polars 无法并行化操作。

我已经尝试了一些极坐标“replace_all”和一些“.when.then.otherwise”的方法，但尚未找到解决方案。

在这种情况下，我是否在做“扩展收缩”（例如，我是 -> 我是）。

我目前使用这个：

# This is only a few example contractions that I use.
cList = {
    "i'm": "i am",
    "i've": "i have",
    "isn't": "is not"
}

c_re = re.compile("(%s)" % "|".join(cList.keys()))

def expandContractions(text, c_re=c_re):
    def replace(match):
        return cList[match.group(0)]

    return c_re.sub(replace, text)


df = pl.DataFrame({"Text": ["i'm i've, isn't"]})
df["Text"].map_elements(expandContractions)

输出

shape: (1, 1)
┌─────────────────────┐
│ Text                │
│ ---                 │
│ str                 │
╞═════════════════════╡
│ i am i have, is not │
└─────────────────────┘

但想充分利用 Polar 的性能优势，因为我处理的数据集非常大。

性能测试：

#This dict have 100+ key/value pairs in my test case
cList = {
    "i'm": "i am",
    "i've": "i have",
    "isn't": "is not"
}

def base_case(sr: pl.Series) -> pl.Series:
    c_re = re.compile("(%s)" % "|".join(cList.keys()))
    def expandContractions(text, c_re=c_re):
        def replace(match):
            return cList[match.group(0)]

        return c_re.sub(replace, text)

    sr = sr.map_elements(expandContractions)
    return sr


def loop_case(sr: pl.Series) -> pl.Series:

    for old, new in cList.items():
        sr = sr.str.replace_all(old, new, literal=True)

    return sr



def iter_case(sr: pl.Series) -> pl.Series:
    sr = functools.reduce(
        lambda res, x: getattr(getattr(res, "str"), "replace_all")(
            x[0], x[1], literal=True
        ),
        cList.items(),
        sr,
    )
    return sr

它们都返回相同的结果，这是 15 个循环的平均时间，样本长度约为 500 个字符，约 10,000 个样本。

Base case: 16.112362766265868
Loop case: 7.028670716285705
Iter case: 7.112465214729309

因此，使用这两种方法中的任何一种，速度都会提高一倍以上，这主要归功于 Polars API 调用“replace_all”。我最终使用了循环案例，从那时起我就少了一个需要导入的模块。查看jqurious回答的这个问题

Answer 1

怎么样

(
    df['Text']
    .str.replace_all("i'm", "i am", literal=True)
    .str.replace_all("i've", "i have", literal=True)
    .str.replace_all("isn't", "is not", literal=True)
)

？

或：

functools.reduce(
    lambda res, x: getattr(
        getattr(res, "str"), "replace_all"
    )(x[0], x[1], literal=True),
    cList.items(),
    df["Text"],
)

在极坐标Python中使用列表或字典替换字符串中的子字符串

问题描述投票：0回答：1

1个回答

最新问题

在极坐标Python中使用列表或字典替换字符串中的子字符串

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1