给定测试数据集如下:
id company
0 1 xyz,ltd。
1 2 wall street english (bj)
2 3 James(sh)
3 4 NaN
4 5 黑石(上海)
我需要用相应的英文标点符号替换中文标点符号:
(
代表(
,)
代表)
,.
代表。
,,
代表,
。
我尝试使用
dd.company.str.replace('(', '(').replace(')', ')').replace('。', '.').replace(',', ',')
,这不是Pythonic解决方案,也无法解决。
出:
0 xyz,ltd。
1 wall street english (bj)
2 James(sh)
3 NaN
4 黑石(上海)
Name: company, dtype: object
如何正确更换它们?非常感谢。
一个想法是使用
2 lists
或字典并通过 regex=True
进行子字符串替换:
dd.company.replace(['(',')', '。', ','], ['(',')','.', ','], regex=True)
dd.company.replace({'(':')', '(':')', '。':'.', ',':','}, regex=True)
我建议匹配任何标点符号(例如,使用
[^\w\s]
正则表达式匹配任何非单词和非空白字符)并在匹配上应用 unidecode
:
import pandas as pd
import numpy as np
import unidecode
df = pd.DataFrame({'id':[1,2,3,4,5], 'company': ['xyz,ltd。', 'wall street english (bj)', 'James(sh)', np.NaN, '黑石(上海)']})
df['company'].str.replace(r'[^\w\s]+', lambda x: unidecode.unidecode(x.group()))
>>> df['company'].str.replace(r'[^\w\s]+', lambda x: unidecode.unidecode(x.group()))
0 xyz,ltd.
1 wall street english (bj)
2 James(sh)
3 NaN
4 黑石(上海)
Name: company, dtype: object
派对迟到了,我尝试了
unidecode
,但它没有预期的那么好,这是我的映射,我认为它可以帮助人们寻找可定制的解决方案
# Expanded dictionary for Chinese and full-width punctuation to English equivalents
punctuation_map = {
"·": "-", # Middle dot
"。": ".", # Full stop
",": ",", # Comma
"、": ",", # Enumeration comma (same as comma)
";": ";", # Semicolon
":": ":", # Colon
"?": "?", # Question mark
"!": "!", # Exclamation mark
"。": ".", # Japanese full stop (。)
""": "\"", # Quotation mark (full-width)
"#": "#", # Hash (full-width)
"$": "$", # Dollar sign (full-width)
"%": "%", # Percent sign (full-width)
"&": "&", # Ampersand (full-width)
"'": "'", # Apostrophe (full-width)
"(": "(", # Parentheses (full-width)
")": ")", # Parentheses (full-width)
"*": "*", # Asterisk (full-width)
"+": "+", # Plus sign (full-width)
",": ",", # Comma (full-width)
"-": "-", # Minus (full-width)
"/": "/", # Slash (full-width)
":": ":", # Colon (full-width)
";": ";", # Semicolon (full-width)
"<": "<", # Less than sign (full-width)
"=": "=", # Equals sign (full-width)
">": ">", # Greater than sign (full-width)
"@": "@", # At sign (full-width)
"[": "[", # Left square bracket (full-width)
"\": "\\", # Backslash (full-width)
"]": "]", # Right square bracket (full-width)
"^": "^", # Caret (full-width)
"_": "_", # Underscore (full-width)
"`": "`", # Backtick (full-width)
"{": "{", # Left curly brace (full-width)
"|": "|", # Pipe (full-width)
"}": "}", # Right curly brace (full-width)
"~": "~", # Tilde (full-width)
"⦅": "[", # Open corner bracket (full-width)
"⦆": "]", # Close corner bracket (full-width)
"「": "「", # Opening corner bracket (Chinese quotation)
"」": "」", # Closing corner bracket (Chinese quotation)
"、": ",", # Japanese comma (。)
"、": ",", # Japanese enumeration comma
"〃": "\"", # Ditto mark (used to repeat a phrase)
"》": "\"", # Chinese right angle quotation mark
"《": "\"", # Chinese left angle quotation mark
"「": "“", # Left Chinese quotation mark
"」": "”", # Right Chinese quotation mark
"『": "‘", # Left corner quotation mark
"』": "’", # Right corner quotation mark
"【": "[", # Left square bracket (Chinese)
"】": "]", # Right square bracket (Chinese)
"〔": "[", # Left curly bracket (Chinese)
"〕": "]", # Right curly bracket (Chinese)
"〖": "[", # Left double square bracket
"〗": "]", # Right double square bracket
"〘": "[", # Left round bracket (Chinese)
"〙": "]", # Right round bracket (Chinese)
"〚": "[", # Left parenthesis (Chinese)
"〛": "]", # Right parenthesis (Chinese)
"〜": "~", # Full-width tilde (similar to ~)
"〝": "\"", # Left double quotation mark (Chinese)
"〞": "\"", # Right double quotation mark (Chinese)
"〟": "\"", # Quotation marks
"〰": "-", # Wavy dash
"〾": "~", # Wavy tilde
"〿": "~", # Full-width tilde
"–": "-", # En dash (similar to English dash)
"—": "-", # Em dash (double dash)
"‘": "'", # Left single quotation mark (English)
"’": "'", # Right single quotation mark (English)
"‛": "'", # Reversed single quotation mark
"“": "\"", # Left double quotation mark
"”": "\"", # Right double quotation mark
"„": "\"", # Low double quotation mark
"‟": "\"", # Reversed low double quotation mark
"…": "...", # Ellipsis
"‧": ".", # Dotted symbol, similar to period
"﹏": "_", # Low line or underscore
"»": ">", # Right angle quotation mark
"«": "<", # Left angle quotation mark
"‹": "<", # Left angle bracket
"›": ">", # Right angle bracket
"〈": "<", # Left angle bracket (Chinese)
"〉": ">", # Right angle bracket (Chinese)
}
def convert_punctuation(text):
# Replace each Chinese and full-width punctuation character with its English equivalent
for ch, replacement in punctuation_map.items():
text = text.replace(ch, replacement)
return text
用途:
import pandas as pd
data = {
'company': [
'上海·阿里巴巴…',
'深圳·腾讯〉',
'xyz,ltd。',
'wall street english (bj)',
'黑石(上海)'
]
}
df = pd.DataFrame(data, columns=['company'])
df['company'] = df['company'].apply(convert_punctuation)
print(df['company'])
输出:
0 上海-阿里巴巴...
1 深圳-腾讯>
2 xyz,ltd.
3 wall street english (bj)
4 黑石(上海)