Intelligent detection for audiovisual augmented reality


Image Credit: Zapp2Photo /

This article concerns augmented reality and artificial intelligence technologies to capture real audio-visual scenes through digital sensors to render enhanced virtual objects, which can be accomplished by wearing a headset display or smart glasses and listening through headphones or loudspeakers.

Augmented reality headsets in today’s consumer electronics market are equipped with micro-electro-mechanical systems (MEMS) sensors with embedded processing capability and artificial intelligence elements. These sensors respond with high precision to stimuli from the actual physical world to blend holographic data with real-world environments.

The environmental data is collected through sensors such as microphones, accelerometers, gyroscopes, and cameras to detect sound pressure, vibration acceleration, rotational (or inertial) direction, and visual information, respectively. The integration of this data is known as multi-sensor data fusion.

As discussed below, sensor output accuracy and data fusion are essential for augmented reality wearable devices’ perceived performance quality. The users have a high level of expectations and demand extremely responsive systems to align real-world dynamic environments with the virtual counterparts.

The consumer demand promotes product standardization and generates economies of scale, presenting ground-breaking opportunities and significant challenges to harness intelligent sensing benefits for a wide range of applications. These include audio-visual communication, gaming, navigation, autonomous driving, smart homes, health monitoring, and robotics.

Spatial Audio Augmented Reality Applications and Sensor Performance

Over the last decade or so, there has been intensive augmented reality research[1]. This research has focused on visual perception, where rendering of virtual auditory objects in three-dimensional space has little focus on artificial intelligence methods such as deep learning for motion tracking, directional tracking and distance (or depth) detection.

Recent deep learning methods are advancing automatic sound description, classification, and recognition in contexts other than augmented reality. Novel automatic scene description and room layout reconstruction can be used to localize virtual auditory objects when using an augmented reality headset to identify and classify visual features such as room size, sound source angular direction, and distance[2].

Novel deep learning methods for video captioning have also shown improved performance and discovery of audio-visual correlation and hence their potential for audio-visual localization and synchronization[3].

There is also a growing interest in spatial audio capture and rendering techniques to automatically adapt augmented audio reproduction to bespoke speaker arrangements using object-based audio content, which involves capturing and controlling the spatial distribution of sound objects in musical events such as orchestral settings.[4].

To this effect, intelligent spherical microphone array technology[5] delivers outstanding performance through ultra-small build geometries, low power consumption, and excellent stability of sensor properties in terms of sensitivity, repeatability, and frequency response accuracy.

These three-dimensional microphone configurations are capable of dynamic beamforming to adjust microphone directivity patterns and are inspired by positional tracking (i.e., measuring) techniques such as six degrees of freedom, which is based on the detection of both rotational and translational movement by using gyroscopes, resembling the ability of the human inner ear to sense body pose (i.e., the combination of position and orientation).

Motion Tracking and Sensor Data Fusion Techniques

As a user of smart glasses or an augmented reality headset can see real and virtual objects simultaneously, positional information and motion tracking are critical to providing meaningful sensory feedback for interactivity[6].

Tracking a headset pose can be particularly demanding when the device’s wearer makes rapid head movements, as visual misalignment may occur due to poor inertial sensor data fusion. Optical tracking techniques can alleviate this problem by using a visual camera, for example, to reconstruct either the pose of the camera in its surroundings or the pose and the spatial depth of the tracked object. Hybrid techniques can also be more efficient by fusing visual with inertial orientation data.

Microphone arrays can estimate sound source distance for environmental localization[7]. However, signal noise or environmental interference may depend on the directivity of the sensor and hardware interfaces such as signal digitization and amplification.

To provide accurate or reliable signal information, popular multi-sensor data fusion schemes such as state-estimation methods that are based on Kalman filtering or particle filtering[8] tend to use a relatively small number of MEMS microphones in symmetrical network configurations. For example, one-dimensional network configurations are common in biologically inspired systems that mimic human binaural perception.

This approach may not enhance the signal-to-noise ratio as efficiently as larger-sized condenser microphone technologies but is more straightforward to configure than other irregular configurations (e.g. pyramidal, cubic, etc.).


This article has explored some current and emerging trends on intelligent MEMS sensors’ performance and their configurations, also considering multi-sensor data fusion, deep learning, and the nature of relevant techniques such as motion tracking in applications for wearable devices used in augmented audio-visual reality and immersive spatial audio.

The highlighted developments can help researchers and the broader community in industry and society to better understand some essential concepts that enable manufacturers and developers to create more meaningful interactive experiences.

The existing increasing demand for human-computer interaction and sensor technologies for augmented reality products indicates the importance of these products to the societal well-being and the perceived performance quality of wearable devices, not only for multimedia entertainment and communication applications but also for a wide range of multidisciplinary applications in industries such as medical, transport and urban infrastructure.

References and Further Reading

[1] K. Kim, et al. (2018, accessed March 8, 2021). Revisiting research trends in augmented reality: a review of the 2nd decade of ISMAR (2008-2017). IEEE transactions on visualization and infographic 24 (11), 2947-2962. Available:

[2] H. Kim, et al. (2020, accessed March 8, 2021). Immersive virtual reality audio rendering suitable for the listener and the room. Real VR – Immersive Digital Reality: How to import the real world into immersive head-mounted displays, 293-318. Available :

[3] H. Zhu, et al. (2020, accessed March 8, 2021). In-depth audiovisual learning: a survey. arXiv: 2001.04758. Available:

[4] P. Coleman, et al. (2018, accessed March 8, 2021). An audiovisual system for object-based audio: from recording to listening. IEEE Transactions on Multimedia PP, 1-1. Available:

[5] JY Hong, et al. (2017, accessed March 8, 2021). Spatial audio for the design of soundscapes: recording and reproduction. Applied Sciences 7 (6), 627. Available:

[6] GA Kouliéris, et al. (2019, accessed March 8, 2021). Display and close-tracking technologies for virtual and augmented reality. Computer Graphics Forum 38 (2), 493-519. Available:

[7] C. Rascon and IVM Ruiz. (2017, accessed March 8, 2021). Localization of sound sources in robotics: a review. Auton robotics. Syst. 96, 184-210. Available:

[8] F. Castanedo. (2013, accessed March 8, 2021). A review of data fusion techniques. The Journal of the Scientific World 2013, 704504. Available:

Disclaimer: The opinions expressed here are those of the author, expressed in a private capacity and do not necessarily represent the views of Limited T / A AZoNetwork, the owner and operator of this website. This disclaimer is part of the terms and conditions of use of this website.

Source link


Comments are closed.