正则表达式 - 提取子域和域

Question

我正在尝试形成一个正则表达式 (javascript/node.js)，它将从任何给定的 URL 中提取子域和域部分。这就是我最终得到的：

[^(?:http:\/\/|www\.|https:\/\/)]([^\/]+)

现在，我只是在考虑 http、https 作为协议并排除“www”。来自 URL 的子域+域部分的部分。我检查了表达式，它几乎可以工作。但是，问题来了：

成功

'http://mplay.google.co.in/sadfask/asdkfals?dk=10'.match(/[^(?:http:\/\/|www\.|https:\/\/)]([^\/]+)/i)

'http://lplay.google.co.in/sadfask/asdkfals?dk=10'.match(/[^(?:http:\/\/|www\.|https:\/\/)]([^\/]+)/i)

失败

'http://play.google.co.in/sadfask/asdkfals?dk=10'.match(/[^(?:http:\/\/|www\.|https:\/\/)]([^\/]+)/i)

'http://tplay.google.co.in/sadfask/asdkfals?dk=10'.match(/[^(?:http:\/\/|www\.|https:\/\/)]([^\/]+)/i)

我只使用结果数组中的第一个元素。我不明白为什么要“玩”。＆“玩”。不起作用。任何人都可以在这方面帮助我吗？

“/p”和“/t”对正则表达式求值器有什么意义吗？

是否有任何其他方法可以使用正则表达式从任何给定的 URL 中提取子域和域？

编辑-

例子：

https://play.google.com/store/apps/details?id=com.skgames.trafficracer => play.google.com

https://mail.google.com/mail/u/0/#inbox => mail.google.com

Answer 1

您的正则表达式似乎不正确。试试这个正则表达式：

/^(?:https?:\/\/)?(?:[^@\n]+@)?(?:www\.)?([^:\/\n?]+)/img

正则表达式演示

Answer 2

你大约是百万分之一尝试用 JavaScript 解析 URL 的人。我有点惊讶你没有看到任何关于 SO 的现有问题可以追溯到几年前。您想做的最后一件事是编写另一个损坏的正则表达式，对那些为您的问题提供答案的人给予应有的尊重。

有很多记录良好的库和方法来处理这个问题。去谷歌上查询。最简单的方法是在内存中创建一个

元素，给它赋一个

href

，然后访问它的

hostname

和其他属性。见http://tutorialzine.com/2013/07/quick-tip-parse-urls/。如果这不能使你的船漂浮，那么使用像 uri.js.

这样的库

如果你真的不想使用库，而坚持要重新发明轮子，那么至少做下面这样的事情：

function get_domain_from_url(url) {
    var a = document.createElement('a').
    a.setAttribute('href', url);
    return a.hostname;
}

本质上，您将 URL 的子域/域部分的提取委托给浏览器的 URL 解析逻辑，这比您将要编写的任何东西都好得多。

另请参阅使用 jquery/javascript 解析 URL？、使用 Javascript 解析 URL、如何将 URL 解析为 javascript 中的主机名和路径？，或使用 JavaScript 或 jQuery 解析 URL。你是怎么错过那些的？抱歉，我必须投票将其作为重复项关闭。

Answer 3

与 anubhava 的回答相同的 RegExp，仅添加了对 protocol-relative URLs 的支持，如

//google.com

:

/^(?:https?:)?(?:\/\/)?(?:[^@\n]+@)?(?:www\.)?([^:\/\n]+)/im

正则表达式演示

Answer 4

这里有一个解决方案忽略之前的一切

://

.*\://?([^\/]+)

万一你想忽略

www.

.*\://(?:www.)?([^\/]+)

Answer 5

你的正则表达式工作得很好。您只需要删除括号。最后的表达是：

^(?:http:\/\/|www\.|https:\/\/)([^\/]+)

希望有用！

Answer 6

我知道我迟到了，但我想用一些额外有用的信息来回答这个问题。

使用正则表达式从链接中获取域名。

^(https?:\/\/)?(www\.)?([^\/]+)

这里是以上正则表达式的链接。

如果你想得到

subdomain

，

split

上述正则表达式与第一次出现

的匹配结果之一

注意：

regex

比语言内置模块更快。检查下面的例子，

regex

比内置模块快15x

带有正则表达式的 javascript 示例：

console.time('time2');
const pttrn = /^(https?:\/\/)?(www\.)?([^\/]+)/gm
const urlInfo = pttrn.exec("https://www.google.co.in/imghp");
console.timeEnd('time2');

//time2: 0.055ms
console.log(urlInfo[0]) // https://www.google.co.in
console.log(urlInfo[1]) // https://
console.log(urlInfo[2]) // www.
console.log(urlInfo[3]) // google.co.in

Nodejs 内置url 模块

console.time('time');
const url = require('url');
const urlInfo = url.parse("https://www.google.co.in/imghp");
console.timeEnd('time');

//time: 0.840ms;
console.log(urlInfo.hostname) //www.google.co.in

Answer 7

这个使用命名捕获组的 javascript 正则表达式将 URL 分解为其功能组件：

console.log("https://www.sub.domain.google.com:443/maps/place/Arc+De+Triomphe/@48.8737917,2.2928388,17z?query=1&foo#hash".match(/^(?<protocol>https?:\/\/)(?=(?<fqdn>[^:/]+))(?:(?<service>www|ww\d|cdn|ftp|mail|pop\d?|ns\d?|git)\.)?(?:(?<subdomain>[^:/]+)\.)*(?<domain>[^:/]+\.[a-z0-9]+)(?::(?<port>\d+))?(?<path>\/[^?]*)?(?:\?(?<query>[^#]*))?(?:#(?<hash>.*))?/).groups)

输出：

{
  "protocol": "https://",
  "fqdn": "www.sub.domain.google.com",
  "service": "www",
  "subdomain": "sub.domain",
  "domain": "google.com",
  "port": "443",
  "path": "/maps/place/Arc+De+Triomphe/@48.8737917,2.2928388,17z",
  "query": "query=1&foo",
  "hash": "hash"
}

所以你可以使用任何你喜欢的组件

正则表达式 - 提取子域和域

问题描述投票：0回答：7

7个回答

正则表达式演示

最新问题

正则表达式 - 提取子域和域

问题描述 投票：0回答：7

7个回答

正则表达式演示

最新问题

问题描述投票：0回答：7