Been poking around a bit more in the ISMIR archives.
The very first paper in the 2019 proceedings, Data Usage in MIR: History & Future Recommendations, isn’t a bad place to start:
This paper examines the unique issues of data access that MIR has faced over the last 20 years. We explore datasets used in ISMIR papers, examine the evolution of data access over time, and offer three proposals to increase equity of access to data.
The key issue is copyright and the problems of licensing data to run experiments over. Many of the early data sets were collections of oddball or made-up music. What’s more, this had a kind of snowball effect in that it made it hard to know exactly what a MIR system should do. As J. Stephen Downie of UIUC reports in a 2003 paper:
There is a much-lamented paucity of formal literature
reporting upon the analyses of the real-world information
needs and uses of MIR/MDL users (Downie, 2003; Byrd and
Crawford, 2002; Futrelle and Downie, 2002). In fairness, this
paucity is partially caused by the non-existence of MIR/MDL
systems containing music that users actually want.
You don’t know what users want because you don’t have any music that users want, to run user studies with.
Data is the key. If you’re doing machine learning (ML) for instance, you start out with an empty or neutral model of how songs should be classified. You give it a song, and tell it what the output should be, and it adjusts its internal springs and channels (not really, it’s all numbers inside, but I like to think of them as stylized Rube Goldberg devices [that’s Heath Robinson to you Brits]) so that when presented with that song, it will produce that output. And then you repeat, with a different song. This time the internal adjustments are a bit harder to make, and usually less extensive, because now it has to accommodate both its previous settings and the new song. Then you repeat with a third song, and so on ad nauseum.
Note that you need both the songs, and the desired output (together, this is often referred to “the ground truth”). And you need lots of it, sampled pretty evenly from across the entire catalog of music. Otherwise you’ll get lacunae in your model, where the result can be wildly off, and/or overfitting, where everything looks the same to the model. The model is really a statistical artifact, and statistics is really deeply in love with the Law of Large Numbers.
So where do you get this large collection of music that people actually are interested in, to experiment with? It’s a conundrum. Downie and his collaborators at UIUC got together with the supercomputer center at NCSA to propose
the development of a secure, yet accessible, research environment that allows researchers to remotely access the large-scale testbed collection.
and to get the Naxos label to donate their entire catalog, about 30,000 tracks, to be placed in this repository, along with the entire AllMusic metadata database. This formed the basis for the first MIREX server, which runs the various competition entries on the NCSA supercomputer frameworks over various subsets of the data. More data has been added to the repository over time.
The 2019 overview paper calls for extending this 2003 framework to be a distributed system, with multiple MIREX centers linked over the net, so that record labels could run their own MIREX which would contain their own collection, without having to send their tracks to UIUC to be incorporated in the central repository.
There are a number of other existing test data sets, and I’ll report more on them as I read the papers.