Python 列出 HTTP 文件和目录

Question

如果我只有 IP 地址，如何列出文件和文件夹？

使用 urllib 等，我只能显示

index.html

文件的内容。但是如果我还想查看根目录中有哪些文件怎么办？

我正在寻找一个示例，展示如何在需要时实现用户名和密码。（大多数时候index.html是公开的，但有时其他文件不是公开的）。

Answer 1

使用

requests

获取页面内容并使用

BeautifulSoup

解析结果。
例如，如果我们搜索位于

iso

的所有

http://cdimage.debian.org/debian-cd/8.2.0-live/i386/iso-hybrid/

文件：

from bs4 import BeautifulSoup
import requests

url = 'http://cdimage.debian.org/debian-cd/8.2.0-live/i386/iso-hybrid/'
ext = 'iso'

def listFD(url, ext=''):
    page = requests.get(url).text
    print page
    soup = BeautifulSoup(page, 'html.parser')
    return [url + '/' + node.get('href') for node in soup.find_all('a') if node.get('href').endswith(ext)]

for file in listFD(url, ext):
    print file

Answer 2

正如另一个答案所说，您无法直接通过 HTTP 获取目录列表。 HTTP 服务器“决定”向您提供什么内容。有些会给你一个 HTML 页面，显示指向“目录”内所有文件的链接，有些会给你一些页面 (index.html)，有些甚至不会将“目录”解释为一个。

例如，您可能有一个指向“http://localhost/user-login/”的链接：这并不意味着服务器的文档根目录中有一个名为 user-login 的目录。服务器将其解释为某个页面的“链接”。

现在，要实现您想要的目标，您要么必须使用 HTTP 以外的其他东西（您想要访问的“IP 地址”上的 FTP 服务器就可以完成这项工作），或者在该计算机上设置一个 HTTP 服务器，以提供每个路径（http://192.168.2.100/directory）其中的文件列表（任何格式）并通过Python解析它。

如果服务器提供“/bla/bla 索引”类型的页面（如 Apache 服务器所做的目录列表），您可以解析 HTML 输出以找出文件和目录的名称。如果没有（例如自定义的index.html，或服务器决定给你的任何内容），那么你就不走运了:(，你不能这样做。

Answer 3

Zety 提供了一个很好的紧凑解决方案。我将通过使

requests

组件更加健壮和实用来添加到他的示例中：

import requests
from bs4 import BeautifulSoup

def get_url_paths(url, ext='', params={}):
    response = requests.get(url, params=params)
    if response.ok:
        response_text = response.text
    else:
        return response.raise_for_status()
    soup = BeautifulSoup(response_text, 'html.parser')
    parent = [url + node.get('href') for node in soup.find_all('a') if node.get('href').endswith(ext)]
    return parent

url = 'http://cdimage.debian.org/debian-cd/8.2.0-live/i386/iso-hybrid'
ext = 'iso'
result = get_url_paths(url, ext)
print(result)

Answer 4

您可以使用以下脚本获取 HTTP Server 中子目录和目录中的所有文件的名称。可以使用文件编写器来下载它们。

from urllib.request import Request, urlopen, urlretrieve
from bs4 import BeautifulSoup
def read_url(url):
    url = url.replace(" ","%20")
    req = Request(url)
    a = urlopen(req).read()
    soup = BeautifulSoup(a, 'html.parser')
    x = (soup.find_all('a'))
    for i in x:
        file_name = i.extract().get_text()
        url_new = url + file_name
        url_new = url_new.replace(" ","%20")
        if(file_name[-1]=='/' and file_name[0]!='.'):
            read_url(url_new)
        print(url_new)

read_url("www.example.com")

Answer 5

HTTP 不适用于“文件”和“目录”。选择不同的协议。

Answer 6

htmllistparse python 模块以结构化方式检索 HTML 目录列表：

import htmllistparse

cwd, listing = htmllistparse.fetch_listing("https://www.kernel.org/pub/linux/kernel/SillySounds/")
print(listing[0])

FileEntry(name='english.au', 修改=time.struct_time(tm_year=1994, tm_mon=3, tm_mday=18, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=4, tm_yday=77, tm_isdst =-1)，大小=41984，描述=无）

对于检索列表比较复杂的情况，可以单独执行此操作并将结果作为 BeautifulSoup 对象传递给 htmllistparse：

from bs4 import BeautifulSoup
import htmllistparse
import requests

response = requests.get("https://www.kernel.org/pub/linux/kernel/SillySounds/")
html = response.text
soup = BeautifulSoup(html, "html.parser")
cwd, listing = htmllistparse.parse(soup)

Python 列出 HTTP 文件和目录

问题描述投票：0回答：6

6个回答

最新问题

Python 列出 HTTP 文件和目录

问题描述 投票：0回答：6

6个回答

最新问题

问题描述投票：0回答：6