解析文本文件，使用python从每个行的每个链接上抓取一个图像

Question

我正在尝试打开一个txt文件，每行都有一个http链接，然后让python转到每个链接，找到一个特定的图像，并打印出一个指向该图像的直接链接，FOR EACH页面，在txt中列出文件。

但是，我不知道我在做什么。（几天前开始python）

这是我目前的代码，不起作用......

from urllib2 import urlopen
import re
from bs4 import BeautifulSoup

txt = open('links.txt').read().splitlines()
page = urlopen(txt)
html = page.read()
image_links = re.findall("src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)", html)
print image_links

更新1：

好的，这是我需要更具体的一点。我有一个脚本，它将很多链接打印成一个txt文件，每个链接都在它自己的行上。即

http://link.com/1 http://link.com/2 等等等等

我正在努力实现的目前是有一些东西打开那个文本文件，包含那些链接，并运行我已经发布的正则表达式，然后打印图像链接，它会在link.com/1等中找到，到另一个文本文件，它应该看起来像

http://link.com/1/image.jpg http://link.com/2/image.jpg

等等

然后，我不需要任何帮助，因为我已经有一个python脚本，将从该txt文件下载图像。

更新2：基本上，我需要的是这个脚本。

from urllib2 import urlopen
import re
from bs4 import BeautifulSoup

url = 'http://staff.tumblr.com'
page = urlopen(url)
html = page.read()
image_links = re.findall("src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)", html)

print image_links

但是在url变量中，它不会查找特定的url，而是会抓取我指定的文本文件中的所有url，然后打印出结果。

Answer 1

我建议你使用Scrapy spider

这是一个例子

from scrapy import log
from scrapy.item import Item
from scrapy.http import Request
from scrapy.contrib.spiders import XMLFeedSpider


def NextURL():
    urllist =[]
    with open("URLFilename") as f:
        for line in f:
            urllist.append(line)

class YourScrapingSpider(XMLFeedSpider):

    name = "imagespider"

    allowed_domains = []

    url = NextURL()

    start_urls = []

    def start_requests(self):

        start_url = self.url.next()

        request = Request(start_url, dont_filter=True)

        yield request


    def parse(self, response, node):

        scraped_item = Item()
        yield scraped_item
        next_url = self.url.next()
        yield Request(next_url)

我正在创建一个蜘蛛，同时将从文件中读取URL并发出请求并下载图像。

为此，我们必须使用ImagesPipeline

在开始阶段会很困难，但我建议你学习Scrapy。 Scrapy是Python中的Web爬行框架。

更新：

import re
import sys
import urllib
import urlparse
from BeautifulSoup import BeautifulSoup

class MyOpener(urllib.FancyURLopener):
    version = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15'

def process(url):
    myopener = MyOpener()
    #page = urllib.urlopen(url)
    page = myopener.open(url)

    text = page.read()
    page.close()

    soup = BeautifulSoup(text)

    print(soup)

    for tag in soup.findAll('img'):
        print (tag)
# process(url)

def main():
    url = "https://www.organicfacts.net/health-benefits/fruit/health-benefits-of-grapes.html"
    process(url)


if __name__ == "__main__":
    main()

o / p

<img src="https://www.organicfacts.net/wp-content/uploads/wordpress-popular-posts/1430-35x35.jpg" title="Coconut Oil for Skin" alt="Coconut Oil for Skin" width="35" height="35" class="wpp-thumbnail wpp_cached_thumb wpp_featured" />
<img src="https://www.organicfacts.net/wp-content/uploads/wordpress-popular-posts/1427-35x35.jpg" title="Coconut Oil for Hair" alt="Coconut Oil for Hair" width="35" height="35" class="wpp-thumbnail wpp_cached_thumb wpp_featured" />
<img src="https://www.organicfacts.net/wp-content/uploads/wordpress-popular-posts/335-35x35.jpg" title="Health Benefits of Cranberry Juice" alt="Health Benefits of Cranberry Juice" width="35" height="35" class="wpp-thumbnail wpp_cached_thumb wpp_featured" />
<img src="https://www.organicfacts.net/wp-content/uploads/wordpress-popular-posts/59-35x35.jpg"

更新2：

with open(the_filename, 'w') as f:
    for s in image_links:
        f.write(s + '\n')

解析文本文件，使用python从每个行的每个链接上抓取一个图像

问题描述投票：0回答：1

1个回答

最新问题

解析文本文件，使用python从每个行的每个链接上抓取一个图像

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1