我之前曾发布过类似的问题,但是在重新设计项目之后,我到了这里:
带有两个csv文件(new.csv,scrapers.csv)-
scrapers.csv包含两列:'scraper_dom' =特定URL域的简化'scraper_id' =关联的scraper_id,用于将URL导入到单独管理的数据库中
我的目标是遍历new.csv(使用fnetloc
解析urlparse
)并在scrapers.csv上执行lookup以返回一组匹配的< [[scraper_id'给出一组'urls'(VLOOKUP的工作方式或SQL中的JOIN方式),一旦urlparse
执行,就可以将URL中的netloc隔离出来( fnetloc
)。
urlparse
不会将URL(从new.csv
解析为scrapers.csv文件中的确切简化形式),所以我将依赖于直至partial match为止,直到我能找出用于该部分的正则表达式为止。我导入了pandas
,因为先前的尝试发现我创建了DataFrames并执行了pd.merge
,但我也无法使它正常工作...当前代码,底部注释掉的是失败的尝试,只是以为我会包括到目前为止已经尝试过的内容。(##
只是我用来检查程序输出的中间print
行)
import pandas as pd, re
from urllib.parse import urlparse
import csv
sd = {}
sid = {}
#INT = []
def fnetloc(any):
try:
p = urlparse(any)
return p.netloc
except IndexError:
return 'Error'
def dom(any):
try:
r = any.split(',')
return r[0]
except IndexError:
return 'Error'
def ids(any):
try:
e = any.split(',')
return e[0]
except IndexError:
return 'Error'
with open('scrapers.csv',encoding='utf-8',newline='') as s:
reader = enumerate(csv.reader(s))
s.readline()
for j, row in reader:
dict1 = dict({'scraper_dom':dom(row[0]), 'scraper_id':ids(row[1])})
sid[j + 1] = dict1
for di in sid.keys():
id = di
##print(sid[di]['scraper_dom'],sid[di]['scraper_id'])
with open('new.csv',encoding='UTF-8',newline='') as f:
reader = enumerate(csv.reader(f))
f.readline()
for i, row in reader:
dict2 = dict({'scraper_domain': fnetloc(row[0])})
sd[i + 1] = dict2
for d in sd.keys():
id = d
##print(sd[d]['scraper_domain'])
#def tryme( ):
#return filter(sd.has_key, sid)
#print(list(filter(sid, sd.keys())))
所需输出的样本。
def fnetloc_to_scraperid(fnetloc: str, scrapers: List[Scraper]) -> str:
try:
return next(x.scraper_id for x in scrapers if x.matches(fnetloc))
except:
return "[no scraper id found]"
[我还建议您使用某些类,而不是将所有内容保留在csv行对象中,从长远来看,它减少了代码错误,并大大提高了理智。此脚本处理了我输入的示例数据:
import csv from urllib.parse import urlparse from typing import List def fnetloc(any) -> str: try: p = urlparse(any) return p.netloc except IndexError: return 'Error' class Scraper: def __init__(self, scraper_dom: str, scraper_id: str): self.scraper_dom = scraper_dom self.scraper_id = scraper_id def matches(self, fnetloc: str) -> bool: return fnetloc.endswith(self.scraper_dom) class Site: def __init__(self, url: str): self.url = url self.fnetloc = fnetloc(url) def get_scraperid(self, scrapers: List[Scraper]) -> str: try: return next(x.scraper_id for x in scrapers if x.matches(self.fnetloc)) except: return "[no scraper id found]" sites = [Site(row[0]) for row in csv.reader(open("new.csv"))] scrapers = [Scraper(row[0], row[1]) for row in csv.reader(open("scrapers.csv"))] for site in sites: print(site.url, site.get_scraperid(scrapers), sep="\t")