This is a technical post about PulseAudio internals and the upcoming protocol improvements in the upcoming PulseAudio 6.0 release.
PulseAudio memory copies and buffering
PulseAudio is said to have a “zero-copy” architecture. So let’s look at what copies and buffers are involved in a typical playback scenario.
When PulseAudio server and client runs as the same user, PulseAudio enables shared memory (SHM) for audio data. (In other cases, SHM is disabled for security reasons.) Applications can use pa_stream_begin_write to get a pointer directly into the SHM buffer. When using pa_stream_write or through the ALSA plugin, there will be one memory copy into the SHM.
Server resampling and remapping
On the server side, the server might need to convert the stream into a format that fits the hardware (and potential other streams that might be running simultaneously). This step is skipped if deemed unnecessary.
First, the samples are converted to either signed 16 bit or float 32 bit (mainly depending on resampler requirements).
In case resampling is necessary, we make use of external resampler libraries for this, the default being speex.
Second, if remapping is necessary, e g if the input is mono and the output is stereo, that is performed as well. Finally, the samples are converted to a format that the hardware supports.
So, in worst case, there might be up to four different buffers involved here (first: after converting to “work format”, second: after resampling, third: after remapping, fourth: after converting to hardware supported format), and in best case, this step is entirely skipped.
Mixing and hardware output
PulseAudio’s built in mixer multiplies each channel of each stream with a volume factor and writes the result to the hardware. In case the hardware supports mmap (memory mapping), we write the mix result directly into the DMA buffers.
The best we can do is one copy in total, from the SHM buffer directly into the DMA hardware buffer. I hope this clears up any confusion about what PulseAudio’s advertised “zero copy” capabilities means in practice.
However, memory copies is not the only thing you want to avoid to get good performance, which brings us to the next point:
Protocol improvements in 6.0
PulseAudio does pretty well CPU wise for high latency loads (e g music playback), but a bit worse for low latency loads (e g VOIP, gaming). Or to put it another way, PulseAudio has a low per sample cost, but there is still some optimisation that can be done per packet.
For every playback packet, there are three messages sent: from server to client saying “I need more data”, from client to server saying “here’s some data, I put it in SHM, at this address”, and then a third from server to client saying “thanks, I have no more use for this SHM data, please reclaim the memory”. The third message is not sent until the audio has actually been played back.
For every message, it means syscalls to write, read, and poll a unix socket. This overhead turned out to be significant enough to try to improve.
So instead of putting just the audio data into SHM, as of 6.0 we also put the messages into two SHM ringbuffers, one in each direction. For signalling we use eventfds. (There is also an optimisation layer on top of the eventfd that tries to avoid writing to the eventfd in case no one is currently waiting.) This is not so much for saving memory copies but to save syscalls.
From my own unscientific benchmarks (i e, running “top”), this saves us ~10% – 25% of CPU power in low latency use cases, half of that being on the client side.