An idea for improved parallelization in HQPlayer

Magnus · July 18, 2021, 6:12pm

I am an experienced developer and have my own heavily threaded product. And for whatever reason, I started thinking about how to increase the parallelization in HQPlayer, since that’s one of its “less-strong” parts. And I think I have a working idea, even though @jussi_laako might have considered it and rejected it already. But just in case, here it is.

The idea is to divide the execution in many parts, where each part can be started as soon as the previous part is done and before any later parts are started. Very similar to how pipelining works for modern RISC CPUs, except there will be no guarantee that each part takes equally long. As many parts as possible should be identified and used, since each part will execute in its own thread/process and increase parallelization. And the more equal the parts is in execution time, the better.

Below is an example for how this could look for DSD up-sampling with convolution and some additional DSP.
Part A - Fetch for example 0.1 seconds of data from input stream and perform volume calculation
Part B - perform convolution on output from A
Part C - perform other DSP in output from B
Part D - Up-sample output from C to PCM
Part E - Convert output from D to DSD
Part F - Do modulator on output from E

Execution would then look like this

Thread/process 1: ABCDEF
Thread/process 2: -ABCDEF
Thread/process 3: --ABCDEF
Thread/process 4: ---ABCDEF
Thread/process 5: ----ABCDEF
Thread/process 6: -----ABCDEF
Thread/process 1: ------ABCDEF

and so on

Note that threads/processes can be kept alive to avoid overhead of starting and stopping them. Some synchronization will be needed between execution units, but nothing to advanced (lock + FIFO queue etc.).

Total latency should be the same as before, maybe a few ms extra for overhead, and sound output is ready to start playing as soon as thread 1 is done with its last task.

And this example is for one channel, so a stereo output could use 12 threads and multi-channel would use even more.

And as mentioned, its not unlikely that more parts can be identified and used, which would increase the parallelization (for example when up-sampling to non-integer). Maybe each *2 up-sampling could be its own part. And maybe internal parts can be identified in the modulator and/or PCM → DSD. If CUDA is used, CUDA execution will be its own part (which it probably already is).

Note that if correctly implemented, more parts than CPU cores available will have very little impact, but if needed execution can stall until a core is available (assume one thread or process for one core, even though the OS might switch them around).

The key in this example is probably the modulator and its likely that it needs to be divided into its own parts, especially the EC modulator, for this to work very well (it will still work, but won’t yield as good parallelization).

jussi_laako · July 18, 2021, 6:51pm

Except that it won’t work for any of the modulators.

Other than that, why do you think I have not already done everything possible to do things in parallel?

Magnus · July 18, 2021, 8:19pm

Of course I don’t know what you have tried, but even if modulator can’t be divided into subparts at the very least you would get the rest parallelized. This post is not meant to criticize, just to give some idea for possible improvement.

And its not only about DSD+modulator, non-integer up-sampling with some filters are also terrible heavy on the CPU.

jussi_laako · July 18, 2021, 9:48pm

It already is…

Those are already parallelized… At both vector (SIMD) level and CPU core level.

Non-integer factor, for example 44.1 → 768 instead of 44.1 → 705.6 is not practically any heavier in terms of raw processing instructions, but it is in terms of memory access. If you want to make it faster you need larger caches and faster memory buses. So your best bet is to get something like 4 or 8 channel of DDR4-3600 with CL16. Or even better, a GPU.

Magnus · July 18, 2021, 10:15pm

The reason I thought parallelization can be improved comes from these 2 points:

I assume both channels for normal stereo music can be handled individually for up-sampling and (most) DSP and convolution, so that would in itself be a parallelization over 2 cores. Which as far as I can tell is what HQPlayer already does since 2 cores seems to be used at most.
Memory cannot be the main bottleneck considering it scales with CPU frequency up to at least 5Ghz for Intel CPUs. Even my old I7-7700i with old memory and motherboard (slow busses etc) scaled with CPU frequency, at least for DSD up-sampling using EC modulator.

With this in mind, and even if some tight loops cannot be parallelized, I still feel there can be more done. Btw, SIMD vector operation would work excellent as one “part” in my idea.

jussi_laako · July 19, 2021, 11:04am

If you have “Multicore DSP” set to “auto” (grayed), HQPlayer uses all physical cores. If you see only load on two cores, those are likely the modulators, and your filter selection is so light load that you don’t even notice it when spread over the remaining cores.

Try for example with poly-sinc-ext3 or poly-sinc-xtr (non-2s) or poly-sinc-gauss-xl(a) for some additional load.

No, RAM speed is independent of the CPU clock frequency.

For optimal performance, you’d need full RAM running at speed of L1 cache. On modern CPUs, only L1 cache runs at the core speed. But L1 cache is tiny. All the rest, L2 and L3 plus RAM run at slower speeds.

GPUs can have for example 512 or even 4096 -bit wide memory bus, and thus are not as constrained on RAM speed.

Magnus · July 19, 2021, 12:06pm

Yes, of course, my point was that if an application scales with CPU speed, it’s bottleneck is not memory or cache speed.

HQPlayer do seem to use a lot of threads though, maybe a little to much, although some of them are OS created (GUI threads etc)

For comparison, a normal desktop app like Word used 157 threads (quick test)

jussi_laako · July 19, 2021, 3:44pm

That depends on algorithm. Modulators scale quite well, filters not as much, with some factors I stated earlier.

I don’t get that many, but depends on what you are doing etc. Few hundred is normal figure, of which more than half are something coming from libraries and such. DNS resolvers typically have a spool, as well as web server, etc.

Magnus · July 19, 2021, 4:14pm

I’m using the desktop HQPlayer app in a normally setup Windows 10, with Roon running on same computer. My CPU has 32 logical cores though (16 physical).

About parallelization: I assume then that the modulator is hard to parallelize efficiently (tight loop with values that depends on last iteration or something similar). Can the modulator be separate from PCM-to-DSD and DSD up-sampling? Would the sound quality suffer a lot if you for example went PCM-to-DSD64, then EC modulator and finally DSD64 to DSD256?

jussi_laako · July 19, 2021, 4:24pm

No, modulator is integral fundamental part of producing SDM output.

To some extent, without winning anything in terms of saving CPU time. In fact, that would be heavier process.

For modulator, only thing that matters is how many instructions you can execute on a single core per output sample. While for filters what matters is memory bandwidth and total parallel processing power.

That’s why combination of fast GPU + fast CPU is best.

Magnus · July 19, 2021, 5:29pm

Ok, I’m just brainstorming, seems to me there is (or should be) a lot of A → B → C (and so on) steps, but if one step takes like 90% of the CPU power and can’t be split into more parts (or internally parallelized) than I guess its hard to improve.

If multi-core DSP is checked and not only grey, how to you parallelize then? Seems a lot of threads are used at least but I’m guessing with to much overhead.

jussi_laako · July 19, 2021, 8:53pm

It just blindly enables all parallel features without smartness. While auto decides best configuration for the machine based on extensive testing on the type of machine in question.

Best way to improve is to improve the hardware… Like combination of best parts of Apple’s M1, Intel’s Core and AMD’s Zen.

Magnus · July 20, 2021, 2:59pm

Btw, did you “fix” something with PCM up-sampling? I could not run gauss-xla before without using “adaptive”, now with 4.12.2 I can run it even without CUDA. And it seem to spread the load quite nicely over the cores (total CPU usage is about 24% with hyperthreading, so a little less than half the CPU used in total).

Btw, I can’t run gauss-xla with CUDA though, it chokes the gfx card and won’t work.

jussi_laako · July 20, 2021, 9:44pm

No… No changes…

(but Windows is pretty hopeless platform)

Magnus · July 22, 2021, 9:40am

I wonder if I was fooled by the “grey out” feature, which usually means the app chooses on or off automatically (like how adaptive works). You might want to replace that with a combobox or 3 radio buttons, to make it more clear that its actually 3-way selector: for example with none, adaptive or full.

nquery · July 22, 2021, 3:17pm

Yes, I was (still am a bit) confused by the auto/adaptive greyed box. It doesn’t just choose multicore ON or OFF at start of processing based on some heuristic. It’s actually a different performance profile/process than checked enabled. I guess it comes down to this as per HQP manual:

When the selection box is grayed, automatic detection and configuration is active and can utilize any number of cores … When the box is not checked, processing is optimized for dual-core CPUs. When the box is checked, processing is optimized for 4/6-core CPUs.

This is why in my tests of guass-xla on an 8 core i9, with all else being equal, multicore enabled fully uses 4 cores (+ 2 partial), whereas auto uses 6 cores fully (+ 2 partial).

It’s not entirely intuitive as at first it might seem like checked is the “max” setting for multicore and thus would utilize all/more of the cores vs auto/adaptive, when in reality auto/adaptive usually “maxes out” the cpus more fully.

Digital_Dude · July 23, 2021, 1:02am

Just wondering if Intel’s new NUC that comes with 11th gen CPUs and space for a full-size graphics card would be good enough… Of course adding an Nvidia GeForce RTX™ 30 Series GPUs would deliver the ultimate performance…

Spencer_Didriksen · July 26, 2021, 9:48pm

Sorry for stealing the forum but a related question. I’m only using PCM filters, fortunately my favourite ones have no delay when starting to play, but some of them like SINC-M have a 5 second delay, how come?

Not complaining! It’s just that I’m watching the CPU resource manager and GPU usage and there is never more than 2-5 percent usage for the CPU and 1% usage of my 1080Ti… so what is taking 5 seconds? perhaps a buffer? can it be “shortened”?

nquery · July 27, 2021, 2:41am

the SINC-M filter does 1 million taps no matter what, and the delay will inversely correlate with the output sampling rate. It’s a simple equation - 500,000 / outputSampleRate. Plus buffering. You will get big delays if you are only upsampling to 96/192. See Which HQP Filter are you using? - #1307 by jussi_laako. If you like the sound of the Sinc-xxx filters but want a constant time delay, try the Sinc-Mx filter - but it won’t sound as good at low input rates. No free lunch unfortunately.