将Python中特定的中文标点符号替换为相应的英文标点符号

问题描述 投票:0回答:3

给定测试数据集如下:

   id                   company
0   1                  xyz,ltd。
1   2  wall street english (bj)
2   3                 James(sh)
3   4                       NaN
4   5                    黑石(上海)

我需要用相应的英文标点符号替换中文标点符号:

(
代表
)
代表
.
代表
,
代表

我尝试使用

dd.company.str.replace('(', '(').replace(')', ')').replace('。', '.').replace(',', ',')
,这不是Pythonic解决方案,也无法解决。

出:

0                    xyz,ltd。
1    wall street english (bj)
2                   James(sh)
3                         NaN
4                      黑石(上海)
Name: company, dtype: object

如何正确更换它们?非常感谢。

python-3.x regex pandas string dataframe
3个回答
2
投票

一个想法是使用

2 lists
或字典并通过
regex=True
进行子字符串替换:

dd.company.replace(['(',')', '。', ','], ['(',')','.', ','], regex=True)

dd.company.replace({'(':')', '(':')', '。':'.', ',':','}, regex=True)

2
投票

我建议匹配任何标点符号(例如,使用

[^\w\s]
正则表达式匹配任何非单词和非空白字符)并在匹配上应用
unidecode

import pandas as pd
import numpy as np
import unidecode
df = pd.DataFrame({'id':[1,2,3,4,5], 'company': ['xyz,ltd。', 'wall street english (bj)', 'James(sh)', np.NaN, '黑石(上海)']})
df['company'].str.replace(r'[^\w\s]+', lambda x: unidecode.unidecode(x.group()))
>>> df['company'].str.replace(r'[^\w\s]+', lambda x: unidecode.unidecode(x.group()))
0                   xyz,ltd. 
1    wall street english (bj)
2                   James(sh)
3                         NaN
4                      黑石(上海)
Name: company, dtype: object

0
投票

派对迟到了,我尝试了

unidecode
,但它没有预期的那么好,这是我的映射,我认为它可以帮助人们寻找可定制的解决方案


# Expanded dictionary for Chinese and full-width punctuation to English equivalents
punctuation_map = {
    "·": "-",     # Middle dot
    "。": ".",      # Full stop
    ",": ",",      # Comma
    "、": ",",      # Enumeration comma (same as comma)
    ";": ";",      # Semicolon
    ":": ":",      # Colon
    "?": "?",      # Question mark
    "!": "!",      # Exclamation mark
    "。": ".",      # Japanese full stop (。)
    """: "\"",     # Quotation mark (full-width)
    "#": "#",      # Hash (full-width)
    "$": "$",      # Dollar sign (full-width)
    "%": "%",      # Percent sign (full-width)
    "&": "&",      # Ampersand (full-width)
    "'": "'",     # Apostrophe (full-width)
    "(": "(",   # Parentheses (full-width)
    ")": ")",   # Parentheses (full-width)
    "*": "*",      # Asterisk (full-width)
    "+": "+",      # Plus sign (full-width)
    ",": ",",      # Comma (full-width)
    "-": "-",      # Minus (full-width)
    "/": "/",      # Slash (full-width)
    ":": ":",      # Colon (full-width)
    ";": ";",      # Semicolon (full-width)
    "<": "<",      # Less than sign (full-width)
    "=": "=",      # Equals sign (full-width)
    ">": ">",      # Greater than sign (full-width)
    "@": "@",      # At sign (full-width)
    "[": "[",      # Left square bracket (full-width)
    "\": "\\",     # Backslash (full-width)
    "]": "]",      # Right square bracket (full-width)
    "^": "^",      # Caret (full-width)
    "_": "_",      # Underscore (full-width)
    "`": "`",      # Backtick (full-width)
    "{": "{",      # Left curly brace (full-width)
    "|": "|",      # Pipe (full-width)
    "}": "}",      # Right curly brace (full-width)
    "~": "~",      # Tilde (full-width)
    "⦅": "[",      # Open corner bracket (full-width)
    "⦆": "]",      # Close corner bracket (full-width)
    "「": "「",      # Opening corner bracket (Chinese quotation)
    "」": "」",      # Closing corner bracket (Chinese quotation)
    "、": ",",      # Japanese comma (。)
    "、": ",",      # Japanese enumeration comma
    "〃": "\"",     # Ditto mark (used to repeat a phrase)
    "》": "\"",      # Chinese right angle quotation mark
    "《": "\"",      # Chinese left angle quotation mark
    "「": "“",      # Left Chinese quotation mark
    "」": "”",      # Right Chinese quotation mark
    "『": "‘",      # Left corner quotation mark
    "』": "’",      # Right corner quotation mark
    "【": "[",      # Left square bracket (Chinese)
    "】": "]",      # Right square bracket (Chinese)
    "〔": "[",      # Left curly bracket (Chinese)
    "〕": "]",      # Right curly bracket (Chinese)
    "〖": "[",      # Left double square bracket
    "〗": "]",      # Right double square bracket
    "〘": "[",      # Left round bracket (Chinese)
    "〙": "]",      # Right round bracket (Chinese)
    "〚": "[",      # Left parenthesis (Chinese)
    "〛": "]",      # Right parenthesis (Chinese)
    "〜": "~",      # Full-width tilde (similar to ~)
    "〝": "\"",     # Left double quotation mark (Chinese)
    "〞": "\"",     # Right double quotation mark (Chinese)
    "〟": "\"",     # Quotation marks
    "〰": "-",      # Wavy dash
    "〾": "~",      # Wavy tilde
    "〿": "~",      # Full-width tilde
    "–": "-",      # En dash (similar to English dash)
    "—": "-",     # Em dash (double dash)
    "‘": "'",      # Left single quotation mark (English)
    "’": "'",      # Right single quotation mark (English)
    "‛": "'",      # Reversed single quotation mark
    "“": "\"",     # Left double quotation mark
    "”": "\"",     # Right double quotation mark
    "„": "\"",     # Low double quotation mark
    "‟": "\"",     # Reversed low double quotation mark
    "…": "...",    # Ellipsis
    "‧": ".",      # Dotted symbol, similar to period
    "﹏": "_",      # Low line or underscore
    "»": ">",     # Right angle quotation mark
    "«": "<",      # Left angle quotation mark
    "‹": "<",      # Left angle bracket
    "›": ">",      # Right angle bracket
    "〈": "<",      # Left angle bracket (Chinese)
    "〉": ">",      # Right angle bracket (Chinese)
}

def convert_punctuation(text):
    # Replace each Chinese and full-width punctuation character with its English equivalent
    for ch, replacement in punctuation_map.items():
        text = text.replace(ch, replacement)
        
    return text

用途:


import pandas as pd

data = {
    'company': [
        '上海·阿里巴巴…', 
        '深圳·腾讯〉', 
        'xyz,ltd。',
        'wall street english (bj)',
        '黑石(上海)'
    ]
}
df = pd.DataFrame(data, columns=['company'])

df['company'] = df['company'].apply(convert_punctuation)
print(df['company'])

输出:

0                  上海-阿里巴巴...
1                      深圳-腾讯>
2                    xyz,ltd.
3    wall street english (bj)
4                      黑石(上海)
© www.soinside.com 2019 - 2024. All rights reserved.