The Goal

I want a fast and lightweight library handling 3D audio sources in a game environment as realistic as possible. This includes occlusion, realistic attenuation and delay based on the distance as well as all kind of indirect effects of sound reflecting from the level geometry.
It should also be able to use a HRTF when using headphones and do correct panning for different speaker layout.
But I also want to be able to somewhat correctly play stereo music or sound effects that don't have a position in the game world.
And of course it should work on all the big platforms.

The Video

Here is video of what it currently looks and sounds like. (There some sound artifacts in the video, I didn't notice them while recording, so I am not entirely sure where they come from...)

The Libraries

Turns out that there is the still very new Steam Audio with a very permissive licence which can fullfill most of those requirements for 3D audio sources. But it is very specialized and still requires the audio to come from somewhwere and go somewhere and can't even do attenuation, besides providing a factor to use.
Fortunately I already have a working ogg file loader that just loads and decodes ogg files into RAM and could in the future also be tweaked to stream big files from the hdd. It mainly just uses the ogg vorbis reference implementation.
This still leaves the output and there are many different solutions. I could just implement my own for the different platforms, but that seemed like a lot of work, especially considering that I found three different open source projects fullfilling my requirements: PortAudio, RtAudio and libsoundio. While all three are probably acceptable options, I somewhat randomly picked libsoundio.

The Oculus Rift

My main goal is to do realistic audio in VR for headphones and the Oculus Rift has integrated ones. It is possible to set different output devices in the Oculus settings and the SDK has functionality to get the preferred audio device. But only as the device GUID.
Fortunately it turned out that libsoundio also has an ID for each device which on Windows happens to be the same as the device GUID returned by the Oculus SDK.
My solution is to enumerate all audio devices with libsoundio and just use the first one with a matching ID.
On my system most IDs exist twice in that list, but the first one also happens to be correct. Since the second one happens to be the "raw" device, I could probably just ignore those completely for my use case.

The Ambisonics

Looking at the Steam Audio documentation, going ambisonics all the way seems to be the best compromise between speed and quality. The convolution filter that is applied to all sources for the indirect sound effects already returns it's results encoded in ambisonics and mixed for all sources. And while Steam Audio takes care of all the bells and whistles for those indirect effects given an environment, all it does for direct audio is to calculate a couple of parameters to use to render the sound source correctly. BUT there is also a pannining effect that can encode such a direct sound source into ambisonics.

The Pipeline

My audio pipeline consists of an asset representing the data from an audio file, a sampler that given a time, channel and asset will return a sample from that asset, currently by doing a linear interpolation.

Then there is the audio source, which has a sampler and an asset, a position, gain, radius (it's only used for Steam Audios volumetric occlusion effect and has nothing to do with the range, which is infinite, with loudness depending on the gain value) and a pitch property.
It internally keeps track of the current playback time (and progresses it with every audio frame). The audio source also has a method that is called per frame and feeds the samples for the complete frame to it's steam audio effects and returns the resulting direct audio as ambisonics data.
Audio sources only play a single channel of an asset.

For audio that has no position, such as background music and maybe some effects, there is the audio player object. Which is very similar to a source, but without all the effects and position and instead mixes into the final audio. I wanted to use iplConvertAudioBufferFormat to convert an input format to the desired output layout, but it doesn't work for all kind of combination and requires deinterleaved data while all my data other is interleaved, which complicated things. Instead I am now just doing the conversation myself while only supporting mono and stereo source material and maybe extending it in the future.

All sources and players are combined in an audio world, which takes care of initializing libsoundio and Steam Audio and does the final mixing. The world also handles the geometry data.
And because all the positional audio needs to know where the listener is, the world has a listener property which can be any scene node.

The Indirect Audio

Each audio source can produce indirect audio using the Steam Audio convolution effect. It just needs to be created and provided with the new audio data every audio frame.
The audio data is just buffer I create by sampling the asset, multiplying each sample with the gain property and by applying the pitch property as a multiplicator on the time delta used to progress the sources internal time.
Everything else is handled by Steam Audio.

The Direct Audio

For the direct audio there is the function iplGetDirectSoundPath which given some information about the listener and the sound position, will return a direction, an attenuation factor, an occlusion factor and the time it takes the sound to travel from the source to the listener (and also air absorption values, which happen to not be supported and are always 0).

Because I want to output the resulting audio encoded as ambisonics, I am using the panning effect. Just like the convolution effect, the panning effect takes the audio for the current audio frame but it also takes a direction (the one returned for the direct sound path) and will then immediately output the result into an output buffer.
The gain and pitch are used just like before and the attenuation and occlusion can just be multiplied with the sample.

I haven't tried it without, but in theory the propagation delay should be somewhat important in combination with the indirect audio. The change in delay between samples due to the sound source and/or listener moving should also be causing the doppler effect.
My solution is to calculate a delay per sample by incrementing the previous frames delay by the change in delay per sample ((new_delay-old_delay)/sample_count) for every sample, so that the last sample will have a delay corresponding to the new frames delay. This delay per sample is then used as a negative offset for the lookup time for each sample.
It works quite well, but could probably be improved by smoothing the change in delay and maybe reaching the frames target delay for the sample in the center of the frame, but this would also introduce new issues.

The final ambisonics buffer is then passed on to the audio world.

The Mixing

The audio world loops over all sources and mixes their direct audio output buffers using iplMixAudioBuffers and mixes the result with the already mixed indirect sound returned by iplGetMixedEnvironmentalAudio.
The ambisonics binaural or ambisonics panning effect is then used depending on the target speaker layout to decode the final ambisonics buffer to either two headphone channels using an HRTF or any other speaker layout.
The result is then mixed with the transformed audio player buffers and passed to the system as the final audio.

The Geometry

To use the indirect audio features, Steam Audio needs geometry data. This is created as a scene object which is used to build an environment, which is used to build an environment renderer, which is then used to create the convolution effect (and also to calculate the direct sound path! The scene can be null in this case though...).

I implemented a mechanism that allows to add materials and meshes with material id, position and rotation to the audio world. Calling an update method will then recreate all scene dependent steam audio objects using the new materials and geometry. There was nothing complicated about this and it just works.

The Other Things

Most additional effects on the per source audio could probably be added with some effect pipeline as part of the sampler.

Most of the per frame memory blocks can be reused between sources.

The linear interpolation to sample between samples does not appear to do such a great job. It could also be the source audio, but some frequencies appear to sound a bit dirty.

The current Steam Audio release contains a static library, but has additional dependencies and will be dropped in the future. The windows dll on the other hand is 50mb and thus adds massively to the size of any project using it. At least it compresses pretty well to about 15mb...

All Steam Audio effects for the source have internal state and thus, one per source should be used.

I am not using any of the serialization and audio source baking functionality yet, but both seems like a good idea and shouldn't be much more than calling another Steam Audio function (baking will require propes to be placed in the level though).

The center frequencies for the three frequency bands used for Steam Audio materials are 800 Hz, 4 KHz, and 15 KHz.

The reverb effect could be used for audio sources without convolution effect to fake a similar effect. The convolution effect should be used sparely due to the CPU overhead it will add.

The direct audio occlusion with mainly just a flat ground plane and a wall blocking the audio will make it impossible to hear the source. This feels quite wrong as in reality there are just soo many small surfaces reflecting the sound everywhere. There might be ways to solve this, but it could get very tricky.

Finding good settings for the raycasting turned out to be a bit tricky. Especially a low number of rays seems to cause artifacts. An irDuration of 2.0 turned out way too slow, while everything is great when set to 1.0, somewhat independ of all other settings...
I also noticed that when using an ambisonics order higher than three something is seriously wrong with the output (the source seems to be behind me while it is in front of me and other things).

The End

While not everything is perfect, the resulting audio is quite convincing and just works without tweaking each source independently to have the right sound based on it's environment. Also I learned a few things about audio :).