Object-based Audio and Podcasting

Photo by George Milton from Pexels

Object-based audio was once thought to be limited to gaming, installations, and film/TV productions, but podcasters and audio storytellers are now beginning to explore the potential of this technology. Formats like Dolby Atmos, MPEG-H, and DTS:X all have the potential to make audio storytelling content like podcasts and audiobooks more accessible, interactive, and exciting. Let’s take a closer look.

What is Object-Based Audio?

The term "object-based audio" is used to describe recordings where individual sources (like voices, instruments, or sound effects) are stored as its own audio file (usually mono) and accompanying metadata that defines levels, panning, and much much more. Each audio object (a project can have hundreds) is then joined with some mixed audio elements (often referred to as channel-based or scene-based “beds”) and rendered by software and equipment on the listener’s equipment. That renderer decides how to best represent the audio mix on whatever technology the listener is using, whether it’s headphones, a home entertainment system with a few different speakers, or a carefully tuned movie theater with dozens of speakers surrounding the audience. This offers major advantages for spatial audio content delivery, and one of the goals of these formats is to limit the amount of different deliverables a mixer has to create (stereo, 5.1, 7.1 etc.) and just let the intentions of the mixer be interpreted by the rendering equipment.

Because these audio objects get passed all the way through the delivery format, this can also offer the listener an enhanced level of control over the recorded audio. For example: a listener could turn up the narrator to hear them better over the incidental music in a podcast, or turn off the commentary while watching sports, or choose a different language for all of the dialogue in a film.

This ability to control the audio is very different from the way we’ve conventionally delivered audio recordings to listeners: as a stereo or multichannel mix where all the individual audio decisions get “baked in” to that final mix. Object-based formats wait until that final rendering step to do the “baking”, which means listeners have unprecedented control and customization of the audio they’re listening to, and the rendering technologies can make smart decisions that create a great sounding experience on whatever playback system is available. We can even imagine audio players that can customize playback based on a listeners profile, preferences, and environment. For example: the player could identify that you’re listening to a podcast in a noisy car and make dynamic range adjustments that will work well for that situation. But if you come back and listen to the rest of that podcast on headphones in a quiet library, it can adjust for that situation and provide a more nuanced listening experience.

As this technology becomes more widely available and the content becomes easier to distribute, we could see a significant leap in user-friendly features that mainstream podcast and audiobook audiences could benefit from. Of particular focus will be the areas of podcast accessibility and podcast translation, which may dramatically expand the ways in which audiences engage with audio content. One key result will be a significant increase in consumption of content, and, as a by-product, more opportunities for advertisers to target audiences thoughtfully.

Including Metadata With Podcast and Audiobooks

Normal audio files that are played on music services like Apple Music and Spotify contain metadata like the name of the artist, band, song title, artwork, lyrics, copyright information, etc. With object-based audio, we could imagine even more metadata being included about each individual component within that musical work, including gain, position, and identifying characteristics like who played bass, where each track was recorded, or maybe even musical notation for each track.

When we think about how this might apply to podcasts and audiobooks, there could be individual audio objects and metadata for each guest or actor in the recording. The metadata could include information such as who the speaker is, when and where they were recorded, and more. In theory, it would be possible for a listener to use their podcast app to search for all the podcasts that feature a specific guest, let’s say Kamala Harris, and then bring up the entire catalogue of her guest appearances, across all of the podcasts that she’s appeared on or been quoted within. Wouldn’t that be handy?

There are already some types of metadata included in podcast deliveries, like chapters and descriptions of the shows, and so you can easily jump to different segments or access links provided by the creator. We could take it several steps further with object-based audio, and have the speakers’ names appearing in real-time, and the ability to swipe or tap to see their bio and other information while they’re talking. This could also facilitate sharing quotations or segments from podcasts easily from within a piece of content, helping to curate the best portions of an interview and disseminate them to a broader audience.

Podcasts and Audiobooks in Multiple Languages

The increased demand for interactive types of content coupled with a global growth of podcast consumption points towards another potential use for object-based audio: allowing listeners to access multiple language translations of the same podcast or audiobook within the same deliverable. With the flick of a proverbial switch, you could change from an English-speaking narrator to a French-speaking narrator in real-time. The creator wouldn’t have to publish separate translated versions for each language, which can also help to simplify metrics tracking, reviews, and rankings. We could even imagine advertisements that are targeted based on whatever language version the listener was accessing.

This ability to control the translated versions opens up a myriad of opportunities, including for people who want to learn foreign languages, or for those who are multilingual but have a preferred language. And it can improve global sharing and relatability of content, opening up new avenues of communication between people across cultures.

Accessibility and Podcasting

Another potential benefit of object-based audio is for listeners with hearing impairments. For example, if the music bed of a podcast or audiobook is too loud to be able to decipher the dialogue, then the listener could reduce the volume of the music bed (or mute it entirely), to hear the dialogue more clearly. Or set preferences for the priority of content that the audio player can interpret, like prioritizing dialogue first, sound design second, and incidental music third. We could also imagine metadata stored along with each audio object that could include the transcribed dialogue to make it easy to follow along with an accurate transcription and navigate through a podcast based on the transcription, not just chapter markers and timings.

Object-based Audio Distribution

Currently, there aren’t simple means for distributing object-based audio content to conventional podcast players or apps, but it does seem to be on the horizon. It’s our hope that podcasters and audio storytellers will soon be able to integrate a fantastic set of new content creation tools to help recognize and grow the diversity of their audiences, and also generate content that takes on all new interactive possibilities.