在Python中使用Regex和BeautifulSoup来计算网页上不同URL的数量

Question

我目前有一项任务，要创建一个Python脚本，该脚本可以搜索网页上唯一URL的数量并提供计数。分配的范围是使用正则表达式来搜索URL，因此无论是否存在更好的替代方法（例如，使用href标记），它都在代码中。

[假定URL将用两个引号引起来（例如或，所以我不需要计算诸如“ mtv（http://mtv.com）”的情况。

本地文件引用，例如/ news，/ sports，/ rankings不会被视为不同的URL，而该URL的参数例如？login = misterT和？login = missesK将被视为两个不同的URL。

这是我到目前为止的内容：

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import re
import ssl
import sys

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

#Ask the user for URL:
url = input('Enter the URL you would like to check: ')

try:
    html = urllib.request.urlopen(url, context=ctx).read()
    soup = BeautifulSoup(html, 'html.parser')
except:
    print("Cannot access - please check your URL!")
    sys.exit()

#Initialize counter
count = 0

##Search for http, whether or not it is secure, optional www group, any one or more word character after,
##any one word character one or more times(Groups for domain, then .com, .org, .gov, etc.), followed by forward slash, 
##followed by any one word character one or more times,
##followed by a question mark (to account for arguments in URLs), enclosed by double quotes, regardless of case 
def websitecount(url):
    re_flags = ( re.MULTILINE | re.IGNORECASE | re.UNICODE )
    websites = re.findall('/(http|https)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/', str(soup), re_flags)

    return len(websites)

#Count the number of unique URLs
print('The count of unique URLs is:', websitecount(url))

[执行代码时，我在正则表达式行（网站= re.findall）上收到错误“ TypeError：预期的字符串或类似字节的对象”。我如何考虑修改代码以实现期望的目标？

编辑：我已经修改了代码以使用一个函数，但是我仍然得到0的返回值。当我输入该线程的URL时，鉴于以下4个URL，该值应返回4。

"http://www.google.com/""http://www.google2.com/""http://www.google3.com/""http://www.mtv.com/"

我目前有一项任务，要创建一个Python脚本，该脚本可以搜索网页上唯一URL的数量并提供计数。分配范围是使用正则表达式来搜索...

在Python中使用Regex和BeautifulSoup来计算网页上不同URL的数量

问题描述投票：0回答：1

1个回答

最新问题

在Python中使用Regex和BeautifulSoup来计算网页上不同URL的数量

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1