When looking for a .com name, it can be frustrating to find many that are already taken but appear to be unused.

Is there rampant domain speculation or some other explanation?

Let’s look at the data…

There are currently 137 million .com domain names registered.1 Of these, roughly a third of domains are “in use” (businesses, personal websites, email, etc.), and another third appear to be unused, leaving the last third for a variety of speculative purposes.

.com Domain Usage, from a sample of 2,188 domains
99% CI ±3%2

How I determined these numbers

I started by crawling a random sample of the domains from the top-level .com DNS zone file3 until reaching 100,000 valid domains.4

For each domain, I collected the following:

  • the WHOIS record,5
  • all DNS records for the top-level domain and the www subdomain,6
  • HTTP and HTTPS7 responses (status code, headers, and bodies) for the root page of the top-level domain and the www subdomain,
  • screenshots of the root page as viewed by Mozilla Firefox 64.0 on Linux

The crawl took a little over 48 hours from a single server located in a Singapore data centre.

I ran a follow-up crawl for any domains that failed to connect over HTTP or HTTPS (in case of transient errors). And finally, for the 2,188 domains to be categorized I manually checked any that had failed in case the crawler had timed out or had DOM events blocked by JavaScript.

Then, I wrote a script to help me categorize websites based on their screenshot and body.

The categorization script presents the possible categories as a list of buttons, with Content being the default.

I used the script to categorize domains over the next 2 days.8 In some cases the screenshot and body were not sufficient, so I manually opened the domain in a web browser for inspection.

Summary statistics and insights

Top 10 .com Domain Registrars, from a sample of 100,000 domains

  • GoDaddy is the registrar for a third of all .com domain names. That’s roughly 45 million domain names. Of those, one in three have parking pages. In other words, more than 10 per cent of all .com domain names host GoDaddy ads pages.
  • While there are 1,851 registrars in the sample, the majority of those are controlled by a smaller number of operators.9
  • 25 per cent of domains were registered within the last year.
Domain ages according to WHOIS creation dates, from a sample of 100,000 domains

Domain categories

These categories evolved as I worked.

For example, I hadn’t anticipated the high number of gambling domains (aliases).

For most categories, I’ve included a random sample of screenshots from that category excluding redundant ones.

Content (31% or ~43 million)

Content is the category of any domain with a website displaying unique content. It doesn’t matter what the content is, as long as it appears to be unique for the domain and publicly accessible.

When I was unsure, I placed domains in this category by default.

Ads (23% or ~31 million)

Note that half the domains in this category are GoDaddy parking pages, on which GoDaddy places Google ads based on the keywords related to the domain name.

No Web Server (11% or ~16 million)

If I was unable to connect to, or receive a valid response from, port 80 or 443 for either the top-level domain or the www subdomain and the domain had no MX records, I placed the domain in this category.

Some of these domains likely have some non-web use, such as an FTP or video game server, but I expect them to be a small fraction.

Additionally, the crawling server was only configured for IPv4, so any IPv6-only websites would have been grouped here.

Empty (9.2% or ~13 million)

An Empty domain is one for which a web server is answering requests, but returning empty pages, 404s, or unfilled templates (such as default WordPress installs).

The difference between an Empty domain and a Parked domain is that the Empty domain has presumably been configured by the user, but no content has been added yet.

For Sale (7.1% or ~9.8 million)

Many domains are listed For Sale, usually by domain investors, through various brokers and marketplaces.

Nearly half of this category appears to be domains sold by HugeDomains, although their website lists only “over 200,000” domains available for purchase (a fraction of their ~4 million domains if the sample is representative).

I only included domains from recognizable marketplaces or when the contact details were not part of ad placement, as ad networks and domain brokers will often falsely claim that they represent a domain owner (I categorized all such domains as Ads instead).

Error (5.7% or ~7.9 million)

If a domain returned any type of error, whether an HTTP error or an in-page error, it belongs to this category.

Note that I might have miscategorized some Private domains as Errors if they used basic authentication, as I did not distinguish between 403 Forbidden (due to no basic auth credentials) and other errors.

Parked (4.8% or ~6.5 million)

Parked domains are those that display a page from the registrar or host explaining that the domain has not been set up yet.

To qualify as Parked, a domain had to serve a page without any external ads. It could advertise its own services, but it couldn’t place ads from an ad network.

Gambling (3.0% or ~4 million)

All websites in this category are in Chinese and are operating under aliases, often short strings of numbers or consonants (e.g. 17770012 or tdwhtr).

Also Read: We want to blur the lines between offline and online worlds: Zilingo CEO Ankiti Bose

They follow common templates and contain similar images, often with automatically-generated logos. I assume their purpose is to attract people who think the names are lucky.

Mail (2.6% or ~3.5 million)

Any domain not in any other category, but with MX DNS records (for email), I categorized as Mail.

I did not attempt to see if the mail server was working or if delivery was possible. It’s possible that many of these domains are not actually used for email, but I’ve given them the benefit of the doubt.

Redirect (1.1% or ~1.6 million)

Redirects include vanity domains pointing to Facebook pages, alternative names for businesses, etc.

Private (0.64% or ~0.9 million)

Private domains did not appear to have any content accessible without first logging in (or in some cases registering).

Also Read: Meet 7 more top-notch investors who will be judging TOP100!

Porn (0.59% or ~0.8 million)

Similar to gambling websites, a number of pornographic websites operate under various aliases. The websites were predominantly in Chinese and the domains followed similar naming patterns. As many of the sites display pornographic material directly (not after a warning), I’ve not included the screenshots here.

  1. ^ According to Verisign there are 137,756,106 .com domains in the “active zone” as of 2019-01-27. I had previously verified this number against the DNS zone file downloaded on 2019-01-21.
  2. ^ Steven K. Thompson. Sample Size for Estimating Multinomial Proportions. The American Statistician, 4(1):42-46, 2 1987.
  3. ^ I downloaded the zone file from Verisign at 2019-01-21 02:00 UTC and crawled the domains from 2019-01-21 11:20:52 UTC to 2019-01-23 14:04:40 UTC.
  4. ^ Not all records in the zone file are valid domains. Some do not have a WHOIS record and may act as honeypots to catch people distributing and using zone files without permission. It’s possible that there are also valid domains that act as honeypots, but without any way to identify them, I’ve ignored that possibility for the purpose of this study. Additionally, approximately 1 per cent of the records in the zone file are for name servers, not top-level domains. I excluded them from all analysis (i.e. only 98,854 of the 100,000 crawled records are used).
  5. ^ WHOIS records are directly from Verisign’s WHOIS server.
  6. ^ I collected DNS records by issuing a DNS ANY query directly to the name servers listed in the domain’s WHOIS record (in order to avoid inaccuracies due to caching and recursive resolution). A small number of DNS providers do not respond correctly or at all to ANY queries.
  7. ^ The crawler verified SSL certificates, so any HTTPS-only websites with invalid SSL certificates were classified as Error.
  8. ^ I did not manually categorize every website. When I noticed repetitive and obvious cases, such as when the title of the page was, I used an appropriate regular expression to bulk categorize website bodies that matched. I previewed the matches beforehand to ensure that they were not overly broad, but it’s possible that I misclassified some edge cases.
  9. ^ DropCatch.com uses numbered LLCs like DropCatch.com 1000 LLC, DropCatch.com 1001 LLC, DropCatch.com 1002, etc. Other drop catching operators have similar collections of names, but not all alternate registrars are named so obviously.

Christopher Forno is CTO at the Singapore Data Company, which specialises in data science, data engineering, data security and DevOps consulting.

Image Credit: ojogabonitoo

This article first appeared on singaporedatacompany.com

e27 publishes relevant guest contributions from the community. Share your honest opinions and expert knowledge by submitting your content here.