Python使用re.sub来搜索字符串并从字典中进行更改

问题描述 投票:0回答:1

我有一个python脚本,我在其中读取csv文件,并在csv文件中的一列上,我想从字典中搜索和替换。如果我在csv和字典中都有100%匹配的字段,我的代码就可以工作。我的问题是在字典中,保存来自csv的匹配字符串的字段可以在用逗号分隔的长字符串内,所以我想在字典字符串中搜索并且在任何部分匹配中我想要从字典中更改值。例如,newdata下面的代码如下所示:

u'46764967051,46490797539,4639238933': u'google.com', u'46104376787335,46739600111': u'bt.se', u'46700961026,4638138399': u'lake.se'

而我的csv字段2是我想要在其上执行re.sub的编号。让我说我有csv字段中的数字:4638138399在这种情况下,我想在字典(newdata)中进行searach并更改为此示例中的域“lake.se”,因为该数字在最后一个字典中(newdata) )。所以我的问题是我可以在行上改变什么

domain = re.sub(domain, lambda find_all: newdata.get(find_all.group(0), domain), domain)

要使其搜索任何匹配而不仅仅是完全匹配?

我的代码:

client = MongoClient('mongodb://ip-addr:27017/user')
db = client['user']

x   = []
cur = db.user.find()
for i in cur:
    x.append(i)

newdata = {}    
for entry in x:
    numbers = entry.pop('numbers')
    numbers = numbers.replace("+","")
    domain = entry.pop('domain')
    newdata[numbers] = domain

def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
# csv.py doesn't do Unicode; encode temporarily as UTF-8:
csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
                        dialect=dialect, **kwargs)
for row in csv_reader:
    # decode UTF-8 back to Unicode, cell by cell:
    yield [unicode(cell, 'utf-8') for cell in row]

def utf_8_encoder(unicode_csv_data):
    for line in unicode_csv_data:
        yield line.encode('utf-8')


reader = unicode_csv_reader(codecs.open("201807_12_49333_N29069.csv", 
encoding="iso-8859-1"))
for row in reader:
    domain = row[2].encode('ascii') 
    domain = str(domain)
    domain = re.sub(domain, lambda find_all: newdata.get(find_all.group(0), domain), domain)
    row[2] = domain
    print(row[2], row[3]) 
python mongodb subprocess
1个回答
2
投票

一种方法是重新调整newdata,使其不是每个键由多个逗号分隔的数字组成,而是为每个数字设置不同的键。这是有道理的,因为字典条目最容易通过其精确键来查找,而不是键的子串。只需用newdata[numbers] = domain替换for n in numbers.split(','): newdata[n] = domain行。这是一个自包含的例子:

import re

x = [
    dict(numbers = u'46764967051,46490797539,4639238933',
        domain = u'google.com'),
    dict(numbers = u'46104376787335,46739600111',
        domain = u'bt.se'),
    dict(numbers = u'46700961026,4638138399',
        domain = u'lake.se')]
newdata = {}
for entry in x:
    numbers = entry.pop('numbers')
    numbers = numbers.replace("+","")
    domain = entry.pop('domain')
    for n in numbers.split(','):
        newdata[n] = domain

s = "my favorite site is 46490797539"
s = re.sub(r"\d+", lambda m: newdata[m.group(0)], s)
print(s)
© www.soinside.com 2019 - 2024. All rights reserved.