如何使用python将纯文本格式的URL转换为可点击的链接？

Question

例如，我有纯文本，请考虑以下句子：

我正在浏览www.google.com，发现了一个有趣的网站www.stackoverflow.com。太神奇了！

在上面的示例中，www.google.com是纯文本，我需要像www.google.com一样进行转换（包装在锚定标记中，具有指向google.com的链接）。虽然，www.stackoverflow.com已经在锚标记中，但我希望保持不变。如何使用Python正则表达式执行此操作？

Answer 1

此任务必须分为两部分：

提取a标记中还没有的所有文本
查找（或更准确地说是猜测）该文本中的所有网址，并将它们包装起来

对于第一部分，我建议选择BeautifulSoup。您也可以使用html.parser，但这将是很多额外的工作

使用递归函数查找文本：

html.parser

您可以通过将from bs4 import BeautifulSoup from bs4.element import NavigableString your_text = """I was surfing <a href="...">www.google.com</a>, and I found an interesting site https://www.stackoverflow.com/. It's amazing! I also liked Heroku (http://heroku.com/pricing) more.domains.tld/at-the-end-of-line https://at-the_end_of-text.com""" soup = BeautifulSoup(your_text, "html.parser") def wrap_plaintext_links(bs_tag): for element in bs_tag.children: if type(element) == NavigableString: pass # now we have a text node, process it # so it is a Tag (or the soup object, which is for most purposes a tag as well) elif element.name != "a": # if it isn't the a tag, process it recursively wrap_plaintext_links(element) wrap_plaintext_links(soup) # call the recursive function替换为pass来测试它仅找到所需的值。

现在查找网址并替换其自身。使用的正则表达式的复杂度实际上取决于您要达到的精度。我会这样：

print(element)

功能和代码添加，包括替换：

(https?://)?        # match http(s):// in separate group if present
(                   # start of the main capturing group, what will be between the tags
  (?:[\w-]+\.)+     #   at least one domain and any subdomains before TLD
  [a-z]+            #   TLD
  (?:/\S*?)?        #   /[anything except whitespace] if present - URL path
)                   # end of the group
(?=[\.,)]?(?:\s|$)) # prevent matching any of ".,)" that might appear immediately after the URL as the text goes...

注意：您也可以在上面编写的代码中包含模式说明，请参见import re def create_replacement(matchobj): if matchobj.group(1): # if there's http(s)://, keep it full_url = matchobj.group(0) else: # otherwise prepend it. it would be a long discussion if https or http. decide. full_url = "http://" + matchobj.group(2) tag = soup.new_tag("a", href=full_url) tag.string = matchobj.group(2) return str(tag) # compile the pattern beforehand, as it's going to be used many times r = re.compile(r"(https?://)?((?:[\w-]+\.)+[a-z]+(?:/\S*?)?)(?=[\.,)]?(?:\s|$))") def wrap_plaintext_links(bs_tag): for element in bs_tag.children: if type(element) == NavigableString: replaced = r.sub(create_replacement, str(element)) element.replaceWith(BeautifulSoup(replaced)) # make it a Soup so that the tags aren't escaped elif element.name != "a": wrap_plaintext_links(element)标志

如何使用python将纯文本格式的URL转换为可点击的链接？

问题描述投票：0回答：1

1个回答

最新问题

如何使用python将纯文本格式的URL转换为可点击的链接？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1