循环遍历列表中的字符串

Question

我正在尝试运行下面的代码来获取URL列表（在文件out.txt中）并使用xpath从该页面中提取文本内容。代码从URL中找到域，然后在我创建的具有域和Xpath的json文件中查找域。然后使用xpath查找内容。

但是，现在如果我在循环之外运行代码它工作正常（页面= 200）。但是如果我在循环内部进行操作，我会得到页面= 404。

我确信这是循环的语法错误，可能非常简单。我究竟做错了什么？

URLList = open("out.txt").readlines()
for item in URLList:
    inputurl = item
    print (inputurl)
    type(inputurl)

    #this takes a URL and finds the xpath - it uses an external 
    domainlookup.json that is manually created
    # inputurl = input("PLEASE PROVIDE A URL FROM AN APPROVED DOMAIN: ")
    t = urlparse(inputurl).netloc
    domain = ('.'.join(t.split('.')[1:]))

    with open('domainlookup.json') as json_data:
        domainlookup = json.load(json_data)

    for i in domainlookup:
        if i['DOMAIN'] == domain:
             xpath = (i['XPATH'])

    #this requests the xpath from the URL and scrapes the text content

    page = requests.get(inputurl)
    tree = html.fromstring(page.content)
    content = tree.xpath(xpath)

Answer 1

您可以使用以下代码找到代码的错误：

URLList = open("out.txt").readlines()
for item in URLList:
    inputurl = item
    print("[{0}]".format(inputurl) )

正如您从输出中看到的那样，您没有从URL中删除新行字符，这就是为什么requests以后无法加载它。在使用之前只需使用strip()：

inputurl = item.strip()

循环遍历列表中的字符串

问题描述投票：0回答：1

1个回答

最新问题

循环遍历列表中的字符串

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1