我正在尝试从 URL 字符串中提取域名。我快有了...我正在使用 URI
我有一个字符串..我的第一个想法是使用 Regex 但后来我决定使用 URI 类
我需要将上面的内容转换为 google.com 和没有 www 的 google
我做了以下事情
Uri test = new Uri(referrer);
log.Info("Domain part : " + test.Host);
基本上这会返回 www.google.com ....如果可能的话,我想尝试返回 2 个表单...如上所述...
google.com 和谷歌
这可以通过 URI 实现吗?
是的,可以使用:
Uri.GetLeftPart( UriPartial.Authority )
使用 Nager.PublicSuffix
安装包 Nager.PublicSuffix
var domainParser = new DomainParser(new WebTldRuleProvider());
var domainName = domainParser.Get("sub.test.co.uk");
//domainName.Domain = "test";
//domainName.Hostname = "sub.test.co.uk";
//domainName.RegistrableDomain = "test.co.uk";
//domainName.SubDomain = "sub";
//domainName.TLD = "co.uk";
我几乎尝试了所有方法,但都没有达到预期的结果。 所以这是我从 servermanfail 调整而来的方法。
tld 文件可在 https://publicsuffix.org/list/ 获取 我从 https://publicsuffix.org/list/ effective_tld_names.dat 获取该文件,解析它并搜索 tld。如果发布新顶级域名,只需下载最新文件。
玩得开心。
using System;
using System.Collections.Generic;
using System.IO;
namespace SearchWebsite
{
internal class NetDomain
{
static public string GetDomainFromUrl(string Url)
{
return GetDomainFromUrl(new Uri(Url));
}
static public string GetDomainFromUrl(string Url, bool Strict)
{
return GetDomainFromUrl(new Uri(Url), Strict);
}
static public string GetDomainFromUrl(Uri Url)
{
return GetDomainFromUrl(Url, false);
}
static public string GetDomainFromUrl(Uri Url, bool Strict)
{
initializeTLD();
if (Url == null) return null;
var dotBits = Url.Host.Split('.');
if (dotBits.Length == 1) return Url.Host; //eg http://localhost/blah.php = "localhost"
if (dotBits.Length == 2) return Url.Host; //eg http://blah.co/blah.php = "localhost"
string bestMatch = "";
foreach (var tld in DOMAINS)
{
if (Url.Host.EndsWith(tld, StringComparison.InvariantCultureIgnoreCase))
{
if (tld.Length > bestMatch.Length) bestMatch = tld;
}
}
if (string.IsNullOrEmpty(bestMatch))
return Url.Host; //eg http://domain.com/blah = "domain.com"
//add the domain name onto tld
string[] bestBits = bestMatch.Split('.');
string[] inputBits = Url.Host.Split('.');
int getLastBits = bestBits.Length + 1;
bestMatch = "";
for (int c = inputBits.Length - getLastBits; c < inputBits.Length; c++)
{
if (bestMatch.Length > 0) bestMatch += ".";
bestMatch += inputBits[c];
}
return bestMatch;
}
static private void initializeTLD()
{
if (DOMAINS.Count > 0) return;
string line;
StreamReader reader = File.OpenText("effective_tld_names.dat");
while ((line = reader.ReadLine()) != null)
{
if (!string.IsNullOrEmpty(line) && !line.StartsWith("//"))
{
DOMAINS.Add(line);
}
}
reader.Close();
}
// This file was taken from https://publicsuffix.org/list/effective_tld_names.dat
static public List<String> DOMAINS = new List<String>();
}
}
google.com 不保证与 www.google.com 相同(嗯,对于本示例来说,技术上是相同,但也可能不同)。
也许您实际上需要的是删除“顶级”域名和“www”子域名?然后就
split('.')
并获取最后一部分之前的部分!
下面是一些代码,仅提供 SLD 加上 gTLD 或 ccTLD 扩展名(请注意下面的例外情况)。我不关心 DNS。
理论如下:
至于代码,简短而甜蜜:
private static string GetDomainName(string url)
{
string domain = new Uri(url).DnsSafeHost.ToLower();
var tokens = domain.Split('.');
if (tokens.Length > 2)
{
//Add only second level exceptions to the < 3 rule here
string[] exceptions = { "info", "firm", "name", "com", "biz", "gen", "ltd", "web", "net", "pro", "org" };
var validTokens = 2 + ((tokens[tokens.Length - 2].Length < 3 || exceptions.Contains(tokens[tokens.Length - 2])) ? 1 : 0);
domain = string.Join(".", tokens, tokens.Length - validTokens, validTokens);
}
return domain;
}
明显的例外是,这不会处理 2 个字母的域名。因此,如果您足够幸运拥有 ab.com,则需要稍微调整代码。对于我们这些凡人来说,此代码将涵盖几乎所有 gTLD 和 ccTLD,除了一些非常奇特的。
我认为您对“域名”的构成存在误解 - 不存在常见的“纯域名” - 如果您想要一致的结果,您需要定义这一点。
您只想去掉“www”部分吗?
然后有另一个版本,它剥离顶级域名(例如,剥离“.com”或“.co.uk”等部分?)
另一个答案提到 split(".") - 如果您想手动排除主机名的特定部分,则需要使用类似的东西,.NET 框架中没有任何东西可以完全满足您的要求 - 您需要实现这些事情自己做。
我想出了以下解决方案(使用 Linq):
public string MainDomainFromHost(string host)
{
string[] parts = host.Split('.');
if (parts.Length <= 2)
return host; // host is probably already a main domain
if (parts[parts.Length - 1].All(char.IsNumber))
return host; // host is probably an IPV4 address
if (parts[parts.Length - 1].Length == 2 && parts[parts.Length - 2].Length == 2)
return string.Join(".", parts.TakeLast(3)); // this is the case for co.uk, co.in, etc...
return string.Join(".", parts.TakeLast(2)); // all others, take only the last 2
}
是的,我在这里发布了解决方案:http://pastebin.com/raw.php?i=raxNQkCF
如果您想删除扩展程序,只需添加
if (url.indexof(".")>-1) {url = url.substring(0, url.indexof("."))}
Uri 的主机始终返回域名 (www.google.com),包括标签 (www) 和顶级域名 (com)。但通常您会想要提取中间部分。我就是这么做的
Uri uri;
bool result = Uri.TryCreate(returnUri, UriKind.Absolute, out uri);
if (result == false)
return false;
//if you are sure it's not "localhost"
string domainParts = uri.Host.Split('.');
string topLevel = domainParts[domainParts.Length - 1]
string hostBody = domainParts[domainParts.Length - 2]
string label = domainParts[domainParts.Length - 3]
但是您确实需要检查domainParts.length,因为给定的uri通常类似于“google.com”。
我为自己找到了一个解决方案,并且没有使用任何 TLD 或其他东西。
它利用了这样一个事实:所谓的主机名位于 Uri 的主机部分中,始终位于倒数第二个位置。子域名始终位于名称前面,TLD 始终位于名称后面。
看这里:
private static string GetNameFromHost(string host)
{
if (host.Count(f => f == '.') == 1)
{
return host.Split('.')[0];
}
else
{
var _list = host.Split('.').ToList();
return _list.ElementAt(_list.Count - 2);
}
}
看到这个
UriHostWithoutSubdomain
要点。
请参阅一些测试的要点,但用法如下:
string subdomain = UriHostWithoutSubdomain.GetHostWithoutSubdomain("foo.example.com");
// subdomain == "example.com"
// this depends on the following more lower level:
bool hasSub = UriHostWithoutSubdomain.HasSubdomain("foo.example.com", out int index);
// hasSub == true, and index == 4. The latter can be used to get both the subdomain and the domain without subdomain
/// <summary>
/// `Uri` class does not allow getting a subdomain from the `uri.Host`.
/// This class provides a highly efficient method of getting a subdomain
/// if it exists on an input host / domain string (ideally sent in via
/// <see cref="Uri.Host"/>).
/// </summary>
public class UriHostWithoutSubdomain
{
/// <summary>
/// If subdomain detected returns the subdomain, else returns input value.
/// If null returns null.
/// </summary>
/// <param name="host">Send in `Uri.Host`</param>
public static string GetHostWithoutSubdomain(string host)
{
if(host == null)
return null;
if(!HasSubdomain(host, out int registrableDomainIndex))
return host;
string regDomain = host.Substring(registrableDomainIndex);
return regDomain;
}
/// <summary>
/// Detects if an input host string contains a subdomain. Is a highly efficient
/// implementation: with NO allocations, and with two for loops sharing a single `int i`:
/// one up to the first period, and one that continues from that point
///
/// and is NOT a validator of a domain string. For efficiency purposes
/// that is not desirable, as the intended use-case as well as separation of concerns
/// is that a `Uri.Host` value was input, or something like that.
/// Relying on this allows us to provide this highly efficient algorithm, which essentially
/// only needs to walk once through the string chars looking first for the first period,
/// and then, if a second period is detected, it is established that a subdomain exists.
/// If so, the out arg allows one to know where to take a substring to get the subdomain,
/// or the second-level domain without it, etc.
/// <para />
/// </summary>
/// <param name="host">Uri host / domain string</param>
/// <param name="postSubdomainIndex">If subdomain exists, this will be set to the
/// index at which the 'second-level domain' begins (after the subdomain and 1 AFTER the period,
/// so in foo.example.com, index will be at 'e' in 'example.com').
/// But if no subdomain (if FALSE), then this will be the position of the 'top-level' domain,
/// e.g. at the 'c' in "com" in "example.com". If one wants they could use this to
/// efficiently get the top-level domain and etc.</param>
public static bool HasSubdomain(string host, out int postSubdomainIndex)
{
ArgumentNullException.ThrowIfNull(host);
int i = 0;
int len = host.Length;
for(; i < len; i++)
if(host[i] == '.')
break;
if(i >= len) {
// no period at all
postSubdomainIndex = 0;
return false;
}
postSubdomainIndex = ++i;
for(; i < len; i++)
if(host[i] == '.')
break;
if(i >= len) {
// no second period was found, so period was for TLD (top-level domain)
return false;
}
// second period WAS found, and we are NOT at the end yet,
// but we do NOT need to go further now, subdomain ends before first period,
// we're now already at second
return true;
}
}
由于域名存在多种变体,并且不存在任何真正的权威列表来构成您所描述的“纯域名”,我过去只是求助于使用 Uri.Host。为了避免 www.google.com 和 google.com 显示为两个不同的域,我经常采取剥离 www 的方式。来自包含它的所有域,因为它几乎保证(几乎)指向同一个站点。这确实是唯一简单的方法,而且不会冒丢失某些数据的风险。
string domain = new Uri(HttpContext.Current.Request.Url.AbsoluteUri).GetLeftPart(UriPartial.Authority);