A group of academic researchers has devised a technique to extract sounds from still images captured using smartphone cameras with rolling shutter and movable lens structures.
The movement of camera hardware, such as the Complementary Metal-oxide–Semiconductor (CMOS) rolling shutters and the moving lenses used for Optical Image Stabilization (OIS) and Auto Focus (AF), create sounds that are modulated into images as imperceptible distortions.
These types of smartphone cameras, the researchers explain in a research paper (PDF), create a “point-of-view (POV) optical-acoustic side channel for acoustic eavesdropping” that requires no line of sight, nor the presence of an object within the camera’s field of view.
Focusing on the limitations of this side channel – which relies on a “suitable mechanical path from the sound source to the smartphone” to support sound propagation, the researchers extract and analyze the leaked acoustic information identifying with high accuracy different speakers, genders, and spoken digits.
The academics relied on machine learning to recover information from human speech broadcast by speakers, in the context of an attacker that has a malicious application running on the smartphone but does not have access to the device’s microphone.
However, the threat model assumes that the attacker can captures a video with the victim’s camera and that they can acquire speech samples of the target individuals beforehand, to use them as part of the learning process.
Using a dataset of 10,000 samples of signal-digit utterances, the researchers performed three classification tasks (gender, identity, and digit recognition) and trained their model for each task. They used Google Pixel, Samsung Galaxy, and Apple iPhone devices for the experiments.
“Our evaluation with 10 smartphones on a spoken digit dataset reports 80.66%, 91.28%, and 99.67% accuracies on recognizing 10 spoken digits, 20 speakers, and 2 genders respectively,” the academics say.
Lower quality cameras, the researchers say, would limit the potential information leakage associated with this type of attack. Keeping smartphones away from speakers and adding vibration-isolation dampening materials between the phone and the transmitting surface should also help.
Smartphone makers can mitigate the attack through higher rolling shutter frequencies, random-code rolling shutters, tougher lens suspension springs, and lens locking mechanisms.
“We believe the high classification accuracies obtained in our evaluation and the related work using motion sensors suggest this optical-acoustic side channel can support more diverse malicious applications by incorporating speech reconstruction functionality in the signal processing pipeline,” the researchers added.