我试图在通过法学硕士运行文本之前从 HTML 中删除所有无关的标签、URL 和脚本。 现在我有以下Python函数。
def remove_tags(html) -> str:
# First we decode any encoded text
html = unquote(html)
# Next we strip out all of the HTML tags
soup = BeautifulSoup(html, "html.parser")
for data in soup(['style', 'script']):
# Remove tags
data.decompose()
# Now we get rid of the URLs
tag_free = ' '.join(soup.stripped_strings)
words = tag_free.split()
for i, word in enumerate(words):
parsed_url = urlparse(word)
if parsed_url.scheme and parsed_url.netloc:
words[i] = "[URL Removed]"
final_text = ' '.join(words)
# Finally we remove any unwanted returns
final_text.replace("\t", " ").replace("\n", " ").replace("\r", " ")
return final_text
这适用于除内容 URL 之外的所有内容,如下所示:
content - url(https - //link.sonos.com/f/a/ZinnmUI5FVMlzaiMExZvPw~~/AAQRxQA~/RgRoUgEXP0R5aHR0cHM6Ly9icmF6ZS1pbWFnZXMuY29tL2FwcGJveS9jb21tdW5pY2F0aW9uL2Fzc2V0cy9pbWFnZV9hc3NldHMvaW1hZ2VzLzY1OWQ5MjA0MDBhOTVmMDA1OTYwN2EwMS9vcmlnaW5hbC5wbmc_MTcwNDgyNTM0N1cDc3BjQgpmbRd8b2anjldDUhNrZW50b2xhanJAZ21haWwuY29tWAQAAAPP)
这些脚本化 URL 无处不在,并且使我的内容变得臃肿,我需要删除它们。
我尝试过各种正则表达式选项,例如
^\['content - url'\]+\[)\]$
但它不起作用。
我正在使用回复:
start = "content - url"
test_string = ("sonos-logo content - url(https - "
"//link.sonos.com/f/a/ZinnmUI5FVMlzaiMExZvPw~~/AAQRxQA"
"~/RgRoUgEXP0R5aHR0cHM6Ly9icmF6ZS1pbWFnZXMuY29tL2FwcGJveS9jb21tdW5pY2F0aW9uL2Fzc2V0cy9pbWFnZV9hc3NldHMvaW1hZ2VzLzY1OWQ5MjA0MDBhOTVmMDA1OTYwN2EwMS9vcmlnaW5hbC5wbmc_MTcwNDgyNTM0N1cDc3BjQgpmbRd8b2anjldDUhNrZW50b2xhanJAZ21haWwuY29tWAQAAAPP) !important; u + .body .arrow-icon content - url(https - //link.sonos.com/f/a/1vHBAmM0w7VCBDGBsH-ADg~~/AAQRxQA~/RgRoUgEXP0R5aHR0cHM6Ly9icmF6ZS1pbWFnZXMuY29tL2FwcGJveS9jb21tdW5pY2F0aW9uL2Fzc2V0cy9pbWFnZV9hc3NldHMvaW1hZ2VzLzY1OWViOTlhMGIyNWY4MDA0ZGU2MzVhYS9vcmlnaW5hbC5wbmc_MTcwNDkwMTAxOFcDc3BjQgpmbRd8b2anjldDUhNrZW50b2xhanJAZ21haWwuY29tWAQAAAPP) !important; u + .body .facebook-icon")
clean_string = re.sub('^[' + start + ']+[)]$', '', test_string)
有人可以提供帮助吗?
看起来你想要类似的东西
clean_string = re.sub(
r'content\s*-\s*url\s*\([^()]+\)',
'[URL removed]', test_string)