Parsing domain names in a useful way is a bitch. Top level domains have lots of exceptions and weird rules that are imposible to reflect in a sensible regular expression.
The only reliable way of approaching this is using a regularly updated list of all known public suffixes and their basic rules.
Thankfully, the Mozilla guys maintain the Public Suffix List which is exactly such a list.
For most domain names it is pretty straight forward to figure out what the “tld”, “apex domain” and “subdomain” are. For example, given
www.foo.com we can easily tell
com is the “tld”, while the “apex domain” is
www is a subdomain.
Now consider domain names like
a.b.c.d.foo.uk.com. If you are familiar with domain name registrations you probably know where this is going. Exceptional rules apply for many “public suffixes”, and in practical terms,
uk.com should be considered as the “tld” and not
psl allows you to easily determine which part of the domain name is the
public suffix, .
It handles all kinds of special rules like the ones affecting
.jp, where the registry reserves domains for each prefecture and government body, but domains can also be registered at the top level domain. Consider
Finally, another very important thing to bear in mind is internationalised domain names.
psl handles both punnycode ascii domains as well as unicode.
Feedback and pull requests are welcome 😉