解析文本文件,使用python从每个行的每个链接上抓取一个图像

问题描述 投票:0回答:1

我正在尝试打开一个txt文件,每行都有一个http链接,然后让python转到每个链接,找到一个特定的图像,并打印出一个指向该图像的直接链接,FOR EACH页面,在txt中列出文件。

但是,我不知道我在做什么。 (几天前开始python)

这是我目前的代码,不起作用......

from urllib2 import urlopen
import re
from bs4 import BeautifulSoup

txt = open('links.txt').read().splitlines()
page = urlopen(txt)
html = page.read()
image_links = re.findall("src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)", html)
print image_links

更新1:

好的,这是我需要更具体的一点。我有一个脚本,它将很多链接打印成一个txt文件,每个链接都在它自己的行上。即

http://link.com/1 http://link.com/2 等等 等等

我正在努力实现的目前是有一些东西打开那个文本文件,包含那些链接,并运行我已经发布的正则表达式,然后打印图像链接,它会在link.com/1等中找到,到另一个文本文件,它应该看起来像

http://link.com/1/image.jpg http://link.com/2/image.jpg

等等

然后,我不需要任何帮助,因为我已经有一个python脚本,将从该txt文件下载图像。

更新2:基本上,我需要的是这个脚本。

from urllib2 import urlopen
import re
from bs4 import BeautifulSoup

url = 'http://staff.tumblr.com'
page = urlopen(url)
html = page.read()
image_links = re.findall("src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)", html)

print image_links

但是在url变量中,它不会查找特定的url,而是会抓取我指定的文本文件中的所有url,然后打印出结果。

python web-scraping beautifulsoup findall
1个回答
1
投票

我建议你使用Scrapy spider

这是一个例子

from scrapy import log
from scrapy.item import Item
from scrapy.http import Request
from scrapy.contrib.spiders import XMLFeedSpider


def NextURL():
    urllist =[]
    with open("URLFilename") as f:
        for line in f:
            urllist.append(line)

class YourScrapingSpider(XMLFeedSpider):

    name = "imagespider"

    allowed_domains = []

    url = NextURL()

    start_urls = []

    def start_requests(self):

        start_url = self.url.next()

        request = Request(start_url, dont_filter=True)

        yield request


    def parse(self, response, node):

        scraped_item = Item()
        yield scraped_item
        next_url = self.url.next()
        yield Request(next_url)

我正在创建一个蜘蛛,同时将从文件中读取URL并发出请求并下载图像。

为此,我们必须使用ImagesPipeline

在开始阶段会很困难,但我建议你学习Scrapy。 Scrapy是Python中的Web爬行框架。

更新:

import re
import sys
import urllib
import urlparse
from BeautifulSoup import BeautifulSoup

class MyOpener(urllib.FancyURLopener):
    version = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15'

def process(url):
    myopener = MyOpener()
    #page = urllib.urlopen(url)
    page = myopener.open(url)

    text = page.read()
    page.close()

    soup = BeautifulSoup(text)

    print(soup)

    for tag in soup.findAll('img'):
        print (tag)
# process(url)

def main():
    url = "https://www.organicfacts.net/health-benefits/fruit/health-benefits-of-grapes.html"
    process(url)


if __name__ == "__main__":
    main()

o / p

<img src="https://www.organicfacts.net/wp-content/uploads/wordpress-popular-posts/1430-35x35.jpg" title="Coconut Oil for Skin" alt="Coconut Oil for Skin" width="35" height="35" class="wpp-thumbnail wpp_cached_thumb wpp_featured" />
<img src="https://www.organicfacts.net/wp-content/uploads/wordpress-popular-posts/1427-35x35.jpg" title="Coconut Oil for Hair" alt="Coconut Oil for Hair" width="35" height="35" class="wpp-thumbnail wpp_cached_thumb wpp_featured" />
<img src="https://www.organicfacts.net/wp-content/uploads/wordpress-popular-posts/335-35x35.jpg" title="Health Benefits of Cranberry Juice" alt="Health Benefits of Cranberry Juice" width="35" height="35" class="wpp-thumbnail wpp_cached_thumb wpp_featured" />
<img src="https://www.organicfacts.net/wp-content/uploads/wordpress-popular-posts/59-35x35.jpg"

更新2:

with open(the_filename, 'w') as f:
    for s in image_links:
        f.write(s + '\n')
© www.soinside.com 2019 - 2024. All rights reserved.