Modernized Roon SDK

Hi! I’m a new roon user and loving it! I’ve spent the last 4 years or so building my vinyl collection and somehow missed out on this whole roon thing until now. :slight_smile: I’m also an SDK Architect at Microsoft so less than a week into roon and there are already some improvements I’d like to make…

I was surprised to find that roon doesn’t have Alexa integration and it’s bugging the crap out of me that I just can’t tell Alexa to mute the current roon zone when I get a call or have a meeting. I designed the Bot Builder SDK’s for the Microsoft Bot Framework so conversational experiences are kind of my jam and I’m planning to try building an extension that lets you control roon using your voice. To be clear, I’m not wanting to stream audio to the Alexa, just command and control.

Anyway I started diving into code last night and thought I’d introduce myself to the community. I’ve been digging to some of the existing extension code on github (thank you to everyone for sharing your work.) I was just wondering if anyone can give a summery of what the state of the node SDK’s are and who’s actively working on projects in the community? I see that it’s been many years since any new work on the node SDK but does everything still work? I’m assuming yes… Any gotcha’s I should watch out for?

Second… The first bit of work I’m doing is to modernize the current SDK some. I’ve created TypeScript definitions for most of the SDK packages and I have a new RoonExtension class I’m working on that bundles together all of the core SDK services and makes things a little easier to work with. It adds things like:

  • EventEmitter support so you can have multiple components listen to cores pair/unpair.
  • Promises instead of callbacks so you can use modern async/await patterns.
  • Revocable Proxy support which is just a safety mechanism to ensure your code can’t talk to a core after it’s been lost/unpaired.
  • whatever else I can think of :slight_smile:

I’m building this class as part of my extension but wondering if others would be interested in using it as well for their projects?

Nice to meet everyone…

Steve

18 Likes

FYI

1 Like

Thanks @Martin_Webster I saw that… That’s why I made the comment that I’m only looking to add command-and-control support via voice, not audio streaming to an alexa. You would be able to issue commands like “alexa ask roon to mute the office” or “alexa ask roon to play pink floyd in the living room”.

Even that has its challenges. You need a service in the cloud that Alexa talks to and that service needs to talk to something running on your local network. That’s the piece I’m working on first. I’m building an app that will open a web socket against a service in the cloud and create something we call a Digital Twin at Microsoft. The Alexa skill will be able to talk to a digital twin for a local roon network and make API calls as if it were local.

3 Likes

That’s all the Roon team were seeking to do. Alexa is, in my experience, pretty poor with any kind of background noise and similar sounding words. Quite often when turning on my hi-fi Alexa will respond: “I don’t know high five.” As soon as you have to raise your voice to control something using Alexa it’s lost its utility. Nonetheless, if you get this working, I’m sure there will be some happy people.

1 Like

Ah… That means they weren’t doing something we call “priming”. You need to train Alexa on the keywords found in the users media library. That’s one of the main issues I expect to have to work on. As I said… This is my area of expertise…

3 Likes

I’ve no idea what they were doing, only that they weren’t satisfied with the quality of their implementation.

1 Like

Thanks for the feedback @Martin_Webster . It should be a fun little project. In addition to priming there are some tricks you can do using things like stemming that expand the word alternatives. Worst case I can build my own local index of the users library so I can implement my own search for things like “high five” using all of the alternatives… I built the Desktop Search experience for Windows 95 & Windows 7 and I geared the entire experience over just being able to efficiently search over my large collection of ripped CD’s :slight_smile:

2 Likes

I posted a design for something like this a couple years ago (using the Google Assistant framework, but remarkably similar). Your Google Assistant needs to pass the spoken command over to another public service, which has a Web socket opened from the LAN, and it passes the command on down to the local server, which then can act on it. I wanted to play actual tracks and artists, though, and I didn’t see a good way to disambiguate the spoken names of songs and people, particularly with accents and allophones, the bit you refer to as “priming”. I figured I’d have to get the user to speak a set of phrases which would contain all usable phones, then analyze the song titles to get their phonemes, and use the map of user phones to general phonemes to do the title selection. Seems like a lot of work.

1 Like
2 Likes

So priming will help minimize the errors you get from your speech rec system. For Alexa that translates to more accurate slot filling. But that’s only half the battle. You then need to be able to find items in a bit more fuzzy way on the backend of things using features like stemming. That may meaning having to build a custom Lucene index (popular search engine) that can be tuned for speech but I’m planning to start with roon’s existing search engine and I’ll just have to see if I need to further tune things. I’ll need to write a crawler to collect the priming data anyway so adding a custom lucene index isn’t that much more work.

1 Like

Adding an edit distance component to your ranking algorithm is another good way to add fuzzyness to your search engine. Edit distance is a measure of how similar the users query is to the fields that were matched.

1 Like

Yep, I’m there with you. But it’s more than just stemming. You have to do phoneme analysis on the stems, then create a user-specific map of phones to phonemes, to be able to understand the user’s speech patterns. For instance, Google Play Music, when I used it, was consistently unable to play the track “Krazy Kat”, presumably because what it heard was “crazy cat”. The perils of using textual matching in a phoneme-based problem.

1 Like

You shouldn’t need to do your own phoneme analysis. That’s what providing the priming list to Alexa does. You give them the priming list (I think it’s capped at 400,000 terms or something) and they do all of the analysis.

Sure. There are various automated tools. Somehow you have to take the list of words you need to recognize, and produce their phoneme maps, for a particular dialect (British versus American versus Australian, etc.). But the hard part for me was creating the phone-to-phoneme map on a per-user basis. And I don’t believe Google Assistant would give me the phone breakdown either; it was the actual speech, plus their guess at the transcription of that speech. So I would have needed a phone analysis piece, too.

Even with priming you still probably want a secondary mapping that happens on your side… The example I like to always use is zero. If a voice assistant asks you for your account number and you reply “one zero seven …” for some reason, speech rec system almost always want to give you “1 zero 7” back so you need to be prepared for that and map that into “107” on your side. Keyword priming won’t help with that because the speech rec system doesn’t know that you’re expecting a number back. This is why Alexa slots are typed. By specifying a “number” slot, Alexa can do post processing to map “1 zero 7” to “107” before they give you the slot value.

I looked up the discussions from four years ago…

1 Like

I haven’t looked at Google Home in a while but in general all assistant platforms want to hide as much of the internal details like that as they can. I’m not even sure if Google lets you provide priming information? They must…

Most of my experience is using a Language Understanding service my old team built called LUIS. https://www.luis.ai/

Normally I’d prefer to use LUIS for everything because a) I understand how it thinks, and b) it’s super flexible. But for speech you really need to get everything right at the very beginning of the pipe when you do the recognition task. That means that for this project I’ll have to just use keyword priming plus alexa slots which should be fine.

Check out Phonemizer

It does look like “Google Home” supports priming, they just call it “speech adaptation”. You would think as an industry we could get together and agree on what we call everything :).

https://medium.com/google-cloud/mastering-auto-speech-adaptation-in-dialogflow-for-voice-agents-25d5b65a1cf9

That would mean that if I can get all of this to work well for Alexa, it shouldn’t be too difficult for someone to adapt it to work with Google Home. I tried all 3 for a while: Alexa, Google Home, and Cortana (something I indirectly worked on) and Alexa was light years ahead tech wise so that’s what I use everywhere and don’t even own a Google Home puck anymore.