Roon Radio: Unexpected Error limiting radio to library

It’s been a long “morning” spent fixing this. Things should be better now. Sorry for the lack of communication–we’ve had developers focused on the issue for many hours, but I wasn’t aware of this thread till a few minutes ago, when I finally got a chance to check messages for the first time today (I woke up straight into this problem).

Our database server ran into a major issue right about when our developers in NY were falling asleep. This wasn’t a straightforward problem that could be solved in the operations domain–we had to get a few developers working together on it (in US time zones) in order to figure out what was happening and why.

About 90 minutes ago, we deployed some new code with a workaround for the problem. There may be some residual effects as a backlog of work that was queued up while things were going wrong flushes through the systems. We will continue to monitor the situation through the weekend.

Our apologies for the turbulence. This is really not how things are supposed to go. We’ll be revisiting this next week to consider ways of making our systems more robust to this sort of thing in the future.

(For the technical people–at about 2AM local time, postgres’s query planner decided to re-think how it executes one of the Roon Radio queries in a way that made it 300 times slower than it was before, after a table’s row-count crept ever so slightly past an invisible threshold. Postgres then decided that hashing an 8mm element table on each query so that it can perform a hash join is a good optimization compared to ~2-3 primary key lookups against that table).

22 Likes