A service status page, and associated critical incident processes

This morning’s outage demonstrated how a problem on Roon’s servers can bork the audio systems of Roon’s whole worldwide customer base. It showed that Roon are operating a live service, across many timezones, not just delivering software which we run locally.

Although Roon’s engineers cracked on with the problem right away, there wasn’t any customer communication until quite a few hours later - nothing in the community forum, on Twitter or on the website. By then, customers had created hundreds of messages in dozens of threads reporting various random symptoms. New users who were mid-installation thought they had broken something. Users posted things they didn’t like about the new version without realising that it was broken at the server end. General chaos. No communication.

It would be helpful, once the dust has cleared, if Roon could review their critical incident processes. Good practices include having a status page which is hosted completely independently (Atlassian’s statuspage.io product is one commercial example); showing Roon components and underlying services (ie Tidal, Qobuz) live availability on that status page; having a specific incident escalation process which includes notifying customers that a fault has been recognised and is being addressed; and having a person responsible for customer communication who will be active and not sucked into problem fixing.

Furthermore, this morning’s forum posts demonstrated how many users are outside the US timezone. Roon’s incident processes should span all timezones - eg there could have been a nominated person in Europe who would have communicated to customers while waking up someone in the US.

Roon works very hard to give us great software, but we would appreciate a review of incident management along the above lines - as Roon take-up grows, so does the need for very precise incident management and communication processes. Thank you.

17 Likes

I agree. We will definitely review this incident during the week.

6 Likes

OK, so here we are, today another service incident, and there is still no status page – the app doesn’t know the servers are down, it just partially stops working. We do have a post in the community this time, which is helpful (although would that have happened if the failure had been outside local working hours?).

@brian Please can you pick up the suggestion from last year and think about how to manage status information?

5 Likes

+1
10 characters

1 Like

+1
after recent Tidal communication/log-in issues

1 Like

Amen to this request.
Dirk

1 Like

+1. This is a good idea.

1 Like

+1 Yupp, there‘s no way not to do this.