识别 URL 的文件扩展名

Question

我希望提取文件扩展名（如果它存在于网址中）（试图确定哪些链接指向我不想要的扩展名列表，例如

.jpg

、

.exe

等）。

所以，我想从以下 URL

www.example.com/image.jpg

中提取扩展名

jpg

，并处理没有扩展名的情况，例如

www.example.com/file

（即什么都不返回）。

我想不出如何实现它，但我想到的一种方法是在最后一个点之后获取所有内容，如果有扩展名，我可以查看该扩展名，如果没有，对于例如

www.example.com/file

它会返回

com/file

（给出的不在我的排除文件扩展名列表中，很好）。

使用我不知道的包可能有另一种更好的方法，它可以识别什么是/不是实际的扩展。（即处理 URL 实际上没有扩展名的情况）。

Answer 1

urlparse

模块（Python 3 中的

urllib.parse

）提供了处理 URL 的工具。虽然它没有提供从 URL 中提取文件扩展名的方法，但可以通过将其与

os.path.splitext

:

结合使用来实现

from urlparse import urlparse
from os.path import splitext

def get_ext(url):
    """Return the filename extension from url, or ''."""
    parsed = urlparse(url)
    root, ext = splitext(parsed.path)
    return ext  # or ext[1:] if you don't want the leading '.'

用法示例：

>>> get_ext("www.example.com/image.jpg")
'.jpg'
>>> get_ext("https://www.example.com/page.html?foo=1&bar=2#fragment")
'.html'
>>> get_ext("https://www.example.com/resource")
''

Answer 2

如果您的 URL 中没有扩展名，您可以使用响应 'Content-Type' 标头来获取扩展名，如下所示：

from urllib.request import urlopen

get_ext(url):
    resp = urlopen(url)
    ext = resp.info()['Content-Type'].split("/")[-1]
    return ext

识别 URL 的文件扩展名

问题描述投票：0回答：2

2个回答

最新问题

识别 URL 的文件扩展名

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2