我可以使用什么正则表达式来删除 contet-url() 表达式?

问题描述 投票:0回答:1

我试图在通过法学硕士运行文本之前从 HTML 中删除所有无关的标签、URL 和脚本。 现在我有以下Python函数。

def remove_tags(html) -> str:

    # First we decode any encoded text
    html = unquote(html)

    # Next we strip out all of the HTML tags
    soup = BeautifulSoup(html, "html.parser")

    for data in soup(['style', 'script']):
        # Remove tags
        data.decompose()

    # Now we get rid of the URLs
    tag_free = ' '.join(soup.stripped_strings)
    words = tag_free.split()
    for i, word in enumerate(words):
        parsed_url = urlparse(word)
        if parsed_url.scheme and parsed_url.netloc:
            words[i] = "[URL Removed]"

    final_text = ' '.join(words)

    # Finally we remove any unwanted returns
    final_text.replace("\t", " ").replace("\n", " ").replace("\r", " ")

    return final_text

这适用于除内容 URL 之外的所有内容,如下所示:

content - url(https - //link.sonos.com/f/a/ZinnmUI5FVMlzaiMExZvPw~~/AAQRxQA~/RgRoUgEXP0R5aHR0cHM6Ly9icmF6ZS1pbWFnZXMuY29tL2FwcGJveS9jb21tdW5pY2F0aW9uL2Fzc2V0cy9pbWFnZV9hc3NldHMvaW1hZ2VzLzY1OWQ5MjA0MDBhOTVmMDA1OTYwN2EwMS9vcmlnaW5hbC5wbmc_MTcwNDgyNTM0N1cDc3BjQgpmbRd8b2anjldDUhNrZW50b2xhanJAZ21haWwuY29tWAQAAAPP) 

这些脚本化 URL 无处不在,并且使我的内容变得臃肿,我需要删除它们。

我尝试了各种正则表达式选项,例如 ^['content - url']+[)]$ 但它不起作用。

我正在使用回复:

start = "content - url"
test_string = ("sonos-logo  content -  url(https - "
           "//link.sonos.com/f/a/ZinnmUI5FVMlzaiMExZvPw~~/AAQRxQA"
           "~/RgRoUgEXP0R5aHR0cHM6Ly9icmF6ZS1pbWFnZXMuY29tL2FwcGJveS9jb21tdW5pY2F0aW9uL2Fzc2V0cy9pbWFnZV9hc3NldHMvaW1hZ2VzLzY1OWQ5MjA0MDBhOTVmMDA1OTYwN2EwMS9vcmlnaW5hbC5wbmc_MTcwNDgyNTM0N1cDc3BjQgpmbRd8b2anjldDUhNrZW50b2xhanJAZ21haWwuY29tWAQAAAPP) !important;  u + .body .arrow-icon  content -  url(https - //link.sonos.com/f/a/1vHBAmM0w7VCBDGBsH-ADg~~/AAQRxQA~/RgRoUgEXP0R5aHR0cHM6Ly9icmF6ZS1pbWFnZXMuY29tL2FwcGJveS9jb21tdW5pY2F0aW9uL2Fzc2V0cy9pbWFnZV9hc3NldHMvaW1hZ2VzLzY1OWViOTlhMGIyNWY4MDA0ZGU2MzVhYS9vcmlnaW5hbC5wbmc_MTcwNDkwMTAxOFcDc3BjQgpmbRd8b2anjldDUhNrZW50b2xhanJAZ21haWwuY29tWAQAAAPP) !important;  u + .body .facebook-icon")

clean_string = re.sub('^[' + start + ']+[)]$', '', test_string)

有人可以提供帮助吗?

python python-3.x regex
1个回答
0
投票

看起来你想要类似的东西

clean_string = re.sub(
    r'content\s*-\s*url\s*\([^()]+\)',
    '[URL removed]', test_string)
© www.soinside.com 2019 - 2024. All rights reserved.