Parsing domain names in node.js and the browser

Parsing domain names in a useful way is a bitch. Top level domains have lots of exceptions and weird rules that are imposible to reflect in a sensible regular expression.

The only reliable way of approaching this is using a regularly updated list of all known public suffixes and their basic rules.

Thankfully, the Mozilla guys maintain the Public Suffix List which is exactly such a list.

Based on this list I have written a JavaScript module that allows you to parse domain names into meaningful parts: psl.

For most domain names it is pretty straight forward to figure out what the “tld”, “apex domain” and “subdomain” are. For example, given www.foo.com we can easily tell com is the “tld”, while the “apex domain” is foo.com and www is a subdomain.

Now consider domain names like a.b.c.d.foo.uk.com. If you are familiar with domain name registrations you probably know where this is going. Exceptional rules apply for many “public suffixes”, and in practical terms, uk.com should be considered as the “tld” and not com.

psl allows you to easily determine which part of the domain name is the tld or public suffix, .

It handles all kinds of special rules like the ones affecting .jp, where the registry reserves domains for each prefecture and government body, but domains can also be registered at the top level domain. Consider a.b.ide.kyoto.jp and www.sony.jp.

Finally, another very important thing to bear in mind is internationalised domain names. psl handles both punnycode ascii domains as well as unicode.

Feedback and pull requests are welcome 😉