我正在尝试打开一个txt文件,每行都有一个http链接,然后让python转到每个链接,找到一个特定的图像,并打印出一个指向该图像的直接链接,FOR EACH页面,在txt中列出文件。
但是,我不知道我在做什么。 (几天前开始python)
这是我目前的代码,不起作用......
from urllib2 import urlopen
import re
from bs4 import BeautifulSoup
txt = open('links.txt').read().splitlines()
page = urlopen(txt)
html = page.read()
image_links = re.findall("src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)", html)
print image_links
更新1:
好的,这是我需要更具体的一点。我有一个脚本,它将很多链接打印成一个txt文件,每个链接都在它自己的行上。即
http://link.com/1 http://link.com/2 等等 等等
我正在努力实现的目前是有一些东西打开那个文本文件,包含那些链接,并运行我已经发布的正则表达式,然后打印图像链接,它会在link.com/1等中找到,到另一个文本文件,它应该看起来像
http://link.com/1/image.jpg http://link.com/2/image.jpg
等等
然后,我不需要任何帮助,因为我已经有一个python脚本,将从该txt文件下载图像。
更新2:基本上,我需要的是这个脚本。
from urllib2 import urlopen
import re
from bs4 import BeautifulSoup
url = 'http://staff.tumblr.com'
page = urlopen(url)
html = page.read()
image_links = re.findall("src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)", html)
print image_links
但是在url变量中,它不会查找特定的url,而是会抓取我指定的文本文件中的所有url,然后打印出结果。
我建议你使用Scrapy spider
这是一个例子
from scrapy import log
from scrapy.item import Item
from scrapy.http import Request
from scrapy.contrib.spiders import XMLFeedSpider
def NextURL():
urllist =[]
with open("URLFilename") as f:
for line in f:
urllist.append(line)
class YourScrapingSpider(XMLFeedSpider):
name = "imagespider"
allowed_domains = []
url = NextURL()
start_urls = []
def start_requests(self):
start_url = self.url.next()
request = Request(start_url, dont_filter=True)
yield request
def parse(self, response, node):
scraped_item = Item()
yield scraped_item
next_url = self.url.next()
yield Request(next_url)
我正在创建一个蜘蛛,同时将从文件中读取URL并发出请求并下载图像。
为此,我们必须使用ImagesPipeline
在开始阶段会很困难,但我建议你学习Scrapy。 Scrapy是Python中的Web爬行框架。
更新:
import re
import sys
import urllib
import urlparse
from BeautifulSoup import BeautifulSoup
class MyOpener(urllib.FancyURLopener):
version = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15'
def process(url):
myopener = MyOpener()
#page = urllib.urlopen(url)
page = myopener.open(url)
text = page.read()
page.close()
soup = BeautifulSoup(text)
print(soup)
for tag in soup.findAll('img'):
print (tag)
# process(url)
def main():
url = "https://www.organicfacts.net/health-benefits/fruit/health-benefits-of-grapes.html"
process(url)
if __name__ == "__main__":
main()
o / p
<img src="https://www.organicfacts.net/wp-content/uploads/wordpress-popular-posts/1430-35x35.jpg" title="Coconut Oil for Skin" alt="Coconut Oil for Skin" width="35" height="35" class="wpp-thumbnail wpp_cached_thumb wpp_featured" />
<img src="https://www.organicfacts.net/wp-content/uploads/wordpress-popular-posts/1427-35x35.jpg" title="Coconut Oil for Hair" alt="Coconut Oil for Hair" width="35" height="35" class="wpp-thumbnail wpp_cached_thumb wpp_featured" />
<img src="https://www.organicfacts.net/wp-content/uploads/wordpress-popular-posts/335-35x35.jpg" title="Health Benefits of Cranberry Juice" alt="Health Benefits of Cranberry Juice" width="35" height="35" class="wpp-thumbnail wpp_cached_thumb wpp_featured" />
<img src="https://www.organicfacts.net/wp-content/uploads/wordpress-popular-posts/59-35x35.jpg"
更新2:
with open(the_filename, 'w') as f:
for s in image_links:
f.write(s + '\n')