This afternoon there was an issue which affected all .io domains across the internet. As there has been no official word from ICB (the registrar which runs .io domains among others, I wanted to write a quick post to explain a) what we identified happening and b) how this affected us and what we've done to prevent future issues.
A bit of background
In order to convert a domain into an IP address, we use a system called DNS (Domain Name System). In its most basic form DNS allows names (for example codebase.atech.io) to be converted into an IP address. For example, each time you enter a domain into your browser, your computer will use your ISPs DNS servers to lookup the domain you have entered and return that information to you. In order to allow users to control their DNS, the system allows for individual domains to be delegated to other nameservers which are in the control of the domain owner. In order to determine which nameserver to use, the DNS system follows a simple path to determine the authoritative nameserver(s) for the domain. For example, imagine we are looking up www.atech.io:
- The root nameservers (of which there are currently 13) are queried to find the nameservers responsible for the - .iotop level domain. The root nameservers will return a number of different nameservers which are able to tell us where to look next. In the case of .io domains there are 7 possible servers to query next.
- We will ask one of the nameservers provided where to look for - atech.io. It will return an array of nameservers to ask next. For these, you will often only see two or three possible nameservers.
- One of the atech.io nameservers will now be queried for - www.atech.io. If it exists, the appropriate record will be determined and returned to you.
Using the dig tool you can interrogate this process, an example output is shown below:
; <<>> DiG 9.8.3-P1 <<>> +trace www.atech.io
;; global options: +cmd
.           2165    IN  NS  a.root-servers.net.
.           2165    IN  NS  b.root-servers.net.
.           2165    IN  NS  c.root-servers.net.
.           2165    IN  NS  d.root-servers.net.
.           2165    IN  NS  e.root-servers.net.
.           2165    IN  NS  f.root-servers.net.
.           2165    IN  NS  g.root-servers.net.
.           2165    IN  NS  h.root-servers.net.
.           2165    IN  NS  i.root-servers.net.
.           2165    IN  NS  j.root-servers.net.
.           2165    IN  NS  k.root-servers.net.
.           2165    IN  NS  l.root-servers.net.
.           2165    IN  NS  m.root-servers.net.
;; Received 228 bytes from 8.8.8.8#53(8.8.8.8) in 1453 ms
io.         172800  IN  NS  ns1.communitydns.net.
io.         172800  IN  NS  a.nic.io.
io.         172800  IN  NS  b.nic.ac.
io.         172800  IN  NS  ns3.icb.co.uk.
io.         172800  IN  NS  b.nic.io.
io.         172800  IN  NS  b.ns13.net.
io.         172800  IN  NS  a.ns13.net.
;; Received 354 bytes from 192.36.148.17#53(192.36.148.17) in 757 ms
atech.io.       86400   IN  NS  dns1.atech.io.
atech.io.       86400   IN  NS  dns2.atech.io.
;; Received 100 bytes from 49.212.31.192#53(49.212.31.192) in 286 ms
www.atech.io.       3600    IN  CNAME   atech-web.vips.atech.io.
atech-web.vips.atech.io. 3600   IN  A   185.22.208.202
vips.atech.io.      3600    IN  NS  dns1.atech.io.
vips.atech.io.      3600    IN  NS  dns2.atech.io.
;; Received 201 bytes from 185.22.210.2#53(185.22.210.2) in 14 ms
What happened to .io domains today?
Today, an issue occurred in step 2 of the above process. Some of these .io authoritative nameservers were unable to tell us where to look next and therefore were failing.
We don't have any specifics about why this happened and things looked to be restored after about an hour of intermittent issues. We could speculate on what caused this, but we'd rather wait for an official response.
Why did this affect us?
Although we don't use many .io domains for our services, we do use it heavily in our backend infrastructure. All our domains (regardless of their top level domain) were configured to use dns1.atech.io and dns2.atech.io as their authoritative nameservers which meant that as these other domains could not be resolved at all.
What did we do to mitigate issues like this?
Unfortunately, it does seem as though the IO nameservers aren't as reliable as those provided for other top level domains like .com or .net. Another fact which disturbs us is that during this outage there has been zero communication from the registry responsible for managing the failed nameservers.
Therefore we have taken the decision to set all our domains to use nameservers on the .com top level domain. This means that any future issues with smaller top level domain registries will not affect all our service in such as significant fashion. We will be moving all our domains over to use a.atechdns.com and b.atechdns.com over the next few days.