Madmom and MIREX and genres

I’ve been playing around with madmom, a Python library for extracting machine-learning features from music tracks. Features like key, tempo, beats, etc. Very interesting, and a set of conference papers for each of the extracted features.

It’s led me to MIREX, a conference which is currently underway in Delft (Nov 4-8). Like many engineering-oriented conferences, this features a competition where the various attendees bring their software and run against a standard set of data to accomplish various tasks.

If you look at this link, you’ll see that the various tasks are of exceptional relevance to library music management. There are tasks which involve genre classification, mood classification, cover song identification, alignment of lyrics to the music, tempo and key estimation, melody extraction. There’s a task to query a database of songs by humming or singing. Some of these tasks are further along than others. The hard part is usually building the databases of ground truth to test against.

This is the sort of music analysis I’d imagined Roon would be doing on our music. But it looks like we may have to do it ourselves. I’m going to get the conference proceedings and read up.


Yes please!
Very interesting topic here

If I’m reading the results properly, the winning codebase for genre and mood classification would appear to be this. But I can’t find a write-up. Software seems conventional; it uses long-standing music analysis library librosa, and standard machine learning libraries like sklearn and PyTorch. Probably should say, “now-standard”.

By the way, the videos from the conference are online at

The papers are at

1 Like

It’s interesting to look at the confusion matrices for the 2019 competition results.

If you’re not familiar with confusion matrices, they’re often used as a summary of how well a particular system does on classification tasks (the link above is for the mood classifiers, for instance). If you have N categories to classify things into, the confusion matrix is an N-by-N checkerboard with numbers in each square that count how many things of type X (the rows of the checkerboard) were classified as type Y (the columns). In a perfect result, the only squares with non-zero numbers are on the upper-left to lower-right diagonal. Often this is visualized as a black-and-white checkerboard, with whiter squares being low numbers, and darker ones being the higher numbers. A white matrix with black diagonal is optimal.

You can see that the top-scoring classifier has a pretty good matrix:

whereas the lowest-scoring one has some issues:


Super interesting. Thanks for the links and the thoughts.

The ISMIR 2019 conference proceedings have a lot of interesting papers, but one that stands out is the Erlangen International Audio Labs (yes, the people who brought you MP3) report on their FMP toolkit for music information retrieval (the “MIR” in MIREX). They’ve put together a set of Jupyter notebooks on computational music analysis, designed to complement the textbook Fundamentals of Music Processing.

If you’ve never used a Jypyter notebook, it’s pretty cool. Basically, it’s a dynamic interactive web page, divided into a vertical list of cells. Each cell can have some text, or a figure, or a fragment of some programming language (typically Python or some other language with a read-eval-print loop), or a variety of other things. You can fiddle with the code fragments and execute them, and display the results inline, then try something else and see how that works. Everything is preserved in the background as a file, and once you have a coherent page you want to save or show to someone else, you can just save the file.

Anyway, to quote from the report:

The FMP notebooks provide detailed textbook-like explanations of central techniques and algorithms in combination with Python code examples that illustrate how to implement the theory. All components including the introductions of MIR scenarios, illustrations, sound examples, technical concepts, mathematical details, and code examples are integrated into a consistent and comprehensive framework

Seems like a fantastic resource.

1 Like


That looks awesome

Been poking around a bit more in the ISMIR archives.

The very first paper in the 2019 proceedings, Data Usage in MIR: History & Future Recommendations, isn’t a bad place to start:

This paper examines the unique issues of data access that MIR has faced over the last 20 years. We explore datasets used in ISMIR papers, examine the evolution of data access over time, and offer three proposals to increase equity of access to data.

The key issue is copyright and the problems of licensing data to run experiments over. Many of the early data sets were collections of oddball or made-up music. What’s more, this had a kind of snowball effect in that it made it hard to know exactly what a MIR system should do. As J. Stephen Downie of UIUC reports in a 2003 paper:

There is a much-lamented paucity of formal literature
reporting upon the analyses of the real-world information
needs and uses of MIR/MDL users (Downie, 2003; Byrd and
Crawford, 2002; Futrelle and Downie, 2002). In fairness, this
paucity is partially caused by the non-existence of MIR/MDL
systems containing music that users actually want.

You don’t know what users want because you don’t have any music that users want, to run user studies with.

Data is the key. If you’re doing machine learning (ML) for instance, you start out with an empty or neutral model of how songs should be classified. You give it a song, and tell it what the output should be, and it adjusts its internal springs and channels (not really, it’s all numbers inside, but I like to think of them as stylized Rube Goldberg devices [that’s Heath Robinson to you Brits]) so that when presented with that song, it will produce that output. And then you repeat, with a different song. This time the internal adjustments are a bit harder to make, and usually less extensive, because now it has to accommodate both its previous settings and the new song. Then you repeat with a third song, and so on ad nauseum.

Note that you need both the songs, and the desired output (together, this is often referred to “the ground truth”). And you need lots of it, sampled pretty evenly from across the entire catalog of music. Otherwise you’ll get lacunae in your model, where the result can be wildly off, and/or overfitting, where everything looks the same to the model. The model is really a statistical artifact, and statistics is really deeply in love with the Law of Large Numbers.

So where do you get this large collection of music that people actually are interested in, to experiment with? It’s a conundrum. Downie and his collaborators at UIUC got together with the supercomputer center at NCSA to propose

the development of a secure, yet accessible, research environment that allows researchers to remotely access the large-scale testbed collection.

and to get the Naxos label to donate their entire catalog, about 30,000 tracks, to be placed in this repository, along with the entire AllMusic metadata database. This formed the basis for the first MIREX server, which runs the various competition entries on the NCSA supercomputer frameworks over various subsets of the data. More data has been added to the repository over time.

The 2019 overview paper calls for extending this 2003 framework to be a distributed system, with multiple MIREX centers linked over the net, so that record labels could run their own MIREX which would contain their own collection, without having to send their tracks to UIUC to be incorporated in the central repository.

There are a number of other existing test data sets, and I’ll report more on them as I read the papers.

A couple of other papers from the 2019 ISMIR cover databases of tracks that have been created to support various experiments.

Da-TACOS: A Dataset for Cover Song Identification and Understanding” has been designed to support CSI – cover song identification – when is one song a version of another song? Every so often we see a thread here in the forum asking about why Roon can’t show the best, or the local, or the Qobuz, or whatever, version of a song. Well, how can it know one song is a version of another song? Lots of ongoing research here. Da-TACOS contains “25K songs represented by unique editorial meta-data plus 9 low- and mid-level features pre-computed with open source libraries” to help with that. There are 1000 groups of 13 versions of a song in each group, plus 2000 additional songs which don’t belong to any of the 1000 groups. This allows you to benchmark your algorithm. The paper also includes a nice overview of current CSI algorithms and results.

By the way – features? When I said previously that to train a ML model, you “give it a song”, I simplified a bit. You can’t really give those gizmos songs; you can only give them numbers. So what you do is what’s called feature extraction, by computing some numbers that describe the song. They can be simple, like length in seconds, more complicated like dynamic range, or even musical, like the key or tempo. Those are what you really use to train the model, or classify the song using a pre-trained model.

And features are how these databases get around the copyright issues. They extract the features from each track, and make the features public, instead of making the whole track public.

As, for instance, with “The AcousticBrainz Genre Dataset: Multi-Source, Multi-Level, Multi-Label, and Large-Scale”, to support genre classification experiments.

Each of the four datasets contains multiple labels featuring hundreds of sub-genres covering in total over 2,086,000 recordings, which are connected to AcousticBrainz via MBIDs. We refer to the combination of the four datasets and the music features from Acoustic-Brainz as the AcousticBrainz Genre Dataset.

That’s a lot of data to run experiments on.

To avoid further sidetracking the “why I won’t buy Roon” thread, I really wonder what one could do with access to a couple of hundred thousand libraries curated by audiophiles… just sayin’. It’s “analysis” after all. And if I’m not mistaken, you should be able to do ML with NUC cores (be they CPU or iGPU), so while it might not be the most ethical way to go at it as far as electricity costs are concerned, or the fastest in terms of computing, other than lack of quality ground truth from mediocre genre metadata, I don’t see a real practical problem if RoonLabs wanted to give it a go…

Make it opt-in. Might be jazz-heavy, though. :grinning:

1 Like

Nah - keep with the startup ethos and make it opt-out :wink: :stuck_out_tongue:

Jazz, Diana Krall, and home-recorded-but-perfectly-tagged nose-whistle bootlegs is what I’d assume, yes… That, and incomprehensible-to-the-machine variations of classical pieces… just brace for the “My library has been analysing for 6 months, and Roon still confuses my Rabin, Tetzlaff, Heifetz and Cepicky 3rd Sonata recordings ! This is completely unacceptable, and I want a refund !” posts.

1 Like