This morning’s outage demonstrated how a problem on Roon’s servers can bork the audio systems of Roon’s whole worldwide customer base. It showed that Roon are operating a live service, across many timezones, not just delivering software which we run locally.
Although Roon’s engineers cracked on with the problem right away, there wasn’t any customer communication until quite a few hours later - nothing in the community forum, on Twitter or on the website. By then, customers had created hundreds of messages in dozens of threads reporting various random symptoms. New users who were mid-installation thought they had broken something. Users posted things they didn’t like about the new version without realising that it was broken at the server end. General chaos. No communication.
It would be helpful, once the dust has cleared, if Roon could review their critical incident processes. Good practices include having a status page which is hosted completely independently (Atlassian’s statuspage.io product is one commercial example); showing Roon components and underlying services (ie Tidal, Qobuz) live availability on that status page; having a specific incident escalation process which includes notifying customers that a fault has been recognised and is being addressed; and having a person responsible for customer communication who will be active and not sucked into problem fixing.
Furthermore, this morning’s forum posts demonstrated how many users are outside the US timezone. Roon’s incident processes should span all timezones - eg there could have been a nominated person in Europe who would have communicated to customers while waking up someone in the US.
Roon works very hard to give us great software, but we would appreciate a review of incident management along the above lines - as Roon take-up grows, so does the need for very precise incident management and communication processes. Thank you.