将Python中特定的中文标点符号替换为相应的英文标点符号

Question

给定测试数据集如下：

   id                   company
0   1                  xyz，ltd。
1   2  wall street english （bj）
2   3                 James（sh）
3   4                       NaN
4   5                    黑石（上海）

我需要用相应的英文标点符号替换中文标点符号：

代表

（

，

代表

）

，

代表

。

，

代表

，

。

我尝试使用

dd.company.str.replace('（', '(').replace('）', ')').replace('。', '.').replace('，', ',')

，这不是Pythonic解决方案，也无法解决。

出：

0                    xyz，ltd。
1    wall street english (bj）
2                   James(sh）
3                         NaN
4                      黑石(上海）
Name: company, dtype: object

如何正确更换它们？非常感谢。

Answer 1

一个想法是使用

2 lists

或字典并通过

regex=True

进行子字符串替换：

dd.company.replace(['（','）', '。', '，'], ['(',')','.', ','], regex=True)

dd.company.replace({'（':'）', '(':')', '。':'.', '，':','}, regex=True)

Answer 2

我建议匹配任何标点符号（例如，使用

[^\w\s]

正则表达式匹配任何非单词和非空白字符）并在匹配上应用

unidecode

：

import pandas as pd
import numpy as np
import unidecode
df = pd.DataFrame({'id':[1,2,3,4,5], 'company': ['xyz，ltd。', 'wall street english （bj）', 'James（sh）', np.NaN, '黑石（上海）']})
df['company'].str.replace(r'[^\w\s]+', lambda x: unidecode.unidecode(x.group()))
>>> df['company'].str.replace(r'[^\w\s]+', lambda x: unidecode.unidecode(x.group()))
0                   xyz,ltd. 
1    wall street english (bj)
2                   James(sh)
3                         NaN
4                      黑石(上海)
Name: company, dtype: object

Answer 3

派对迟到了，我尝试了

unidecode

，但它没有预期的那么好，这是我的映射，我认为它可以帮助人们寻找可定制的解决方案


# Expanded dictionary for Chinese and full-width punctuation to English equivalents
punctuation_map = {
    "·": "-",     # Middle dot
    "。": ".",      # Full stop
    "，": ",",      # Comma
    "、": ",",      # Enumeration comma (same as comma)
    "；": ";",      # Semicolon
    "：": ":",      # Colon
    "？": "?",      # Question mark
    "！": "!",      # Exclamation mark
    "｡": ".",      # Japanese full stop (｡)
    "＂": "\"",     # Quotation mark (full-width)
    "＃": "#",      # Hash (full-width)
    "＄": "$",      # Dollar sign (full-width)
    "％": "%",      # Percent sign (full-width)
    "＆": "&",      # Ampersand (full-width)
    "＇": "'",     # Apostrophe (full-width)
    "（": "(",   # Parentheses (full-width)
    "）": ")",   # Parentheses (full-width)
    "＊": "*",      # Asterisk (full-width)
    "＋": "+",      # Plus sign (full-width)
    "，": ",",      # Comma (full-width)
    "－": "-",      # Minus (full-width)
    "／": "/",      # Slash (full-width)
    "：": ":",      # Colon (full-width)
    "；": ";",      # Semicolon (full-width)
    "＜": "<",      # Less than sign (full-width)
    "＝": "=",      # Equals sign (full-width)
    "＞": ">",      # Greater than sign (full-width)
    "＠": "@",      # At sign (full-width)
    "［": "[",      # Left square bracket (full-width)
    "＼": "\\",     # Backslash (full-width)
    "］": "]",      # Right square bracket (full-width)
    "＾": "^",      # Caret (full-width)
    "＿": "_",      # Underscore (full-width)
    "｀": "`",      # Backtick (full-width)
    "｛": "{",      # Left curly brace (full-width)
    "｜": "|",      # Pipe (full-width)
    "｝": "}",      # Right curly brace (full-width)
    "～": "~",      # Tilde (full-width)
    "｟": "[",      # Open corner bracket (full-width)
    "｠": "]",      # Close corner bracket (full-width)
    "｢": "「",      # Opening corner bracket (Chinese quotation)
    "｣": "」",      # Closing corner bracket (Chinese quotation)
    "､": ",",      # Japanese comma (｡)
    "、": ",",      # Japanese enumeration comma
    "〃": "\"",     # Ditto mark (used to repeat a phrase)
    "》": "\"",      # Chinese right angle quotation mark
    "《": "\"",      # Chinese left angle quotation mark
    "「": "“",      # Left Chinese quotation mark
    "」": "”",      # Right Chinese quotation mark
    "『": "‘",      # Left corner quotation mark
    "』": "’",      # Right corner quotation mark
    "【": "[",      # Left square bracket (Chinese)
    "】": "]",      # Right square bracket (Chinese)
    "〔": "[",      # Left curly bracket (Chinese)
    "〕": "]",      # Right curly bracket (Chinese)
    "〖": "[",      # Left double square bracket
    "〗": "]",      # Right double square bracket
    "〘": "[",      # Left round bracket (Chinese)
    "〙": "]",      # Right round bracket (Chinese)
    "〚": "[",      # Left parenthesis (Chinese)
    "〛": "]",      # Right parenthesis (Chinese)
    "〜": "~",      # Full-width tilde (similar to ~)
    "〝": "\"",     # Left double quotation mark (Chinese)
    "〞": "\"",     # Right double quotation mark (Chinese)
    "〟": "\"",     # Quotation marks
    "〰": "-",      # Wavy dash
    "〾": "~",      # Wavy tilde
    "〿": "~",      # Full-width tilde
    "–": "-",      # En dash (similar to English dash)
    "—": "-",     # Em dash (double dash)
    "‘": "'",      # Left single quotation mark (English)
    "’": "'",      # Right single quotation mark (English)
    "‛": "'",      # Reversed single quotation mark
    "“": "\"",     # Left double quotation mark
    "”": "\"",     # Right double quotation mark
    "„": "\"",     # Low double quotation mark
    "‟": "\"",     # Reversed low double quotation mark
    "…": "...",    # Ellipsis
    "‧": ".",      # Dotted symbol, similar to period
    "﹏": "_",      # Low line or underscore
    "»": ">",     # Right angle quotation mark
    "«": "<",      # Left angle quotation mark
    "‹": "<",      # Left angle bracket
    "›": ">",      # Right angle bracket
    "〈": "<",      # Left angle bracket (Chinese)
    "〉": ">",      # Right angle bracket (Chinese)
}

def convert_punctuation(text):
    # Replace each Chinese and full-width punctuation character with its English equivalent
    for ch, replacement in punctuation_map.items():
        text = text.replace(ch, replacement)
        
    return text

用途：


import pandas as pd

data = {
    'company': [
        '上海·阿里巴巴…', 
        '深圳·腾讯〉', 
        'xyz，ltd。',
        'wall street english （bj）',
        '黑石（上海）'
    ]
}
df = pd.DataFrame(data, columns=['company'])

df['company'] = df['company'].apply(convert_punctuation)
print(df['company'])

输出：

0                  上海-阿里巴巴...
1                      深圳-腾讯>
2                    xyz,ltd.
3    wall street english (bj)
4                      黑石(上海)

将Python中特定的中文标点符号替换为相应的英文标点符号

问题描述投票：0回答：3

3个回答

最新问题

将Python中特定的中文标点符号替换为相应的英文标点符号

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3