Recognizing Speech from Gyroscope Signals

Yan Michalevsky (1), Gabi Nakibly (2) and Dan Boneh (1)

(1) Stanford University
(2) National Research and Simulation Center, Rafael Ltd.


Want to own mic.

Got no permissions.

What else can we do?

What if...

Gyroscopes are susceptible to acoustic noise... No way it's gonna work.

What do we have?

  • Low sampling frequency - 200 Hz
    • Can sample the range of 0 - 100 Hz
  • Low SNR
  • Sensitivity to sound angle of arrival
  • Do we have aliasing?

Experimental setup

Room. Simple computer speakers. Desk.
(Yes, we tried an anechoic chamber, not beneficial at this stage.)

MEMS Gyroscopes

MEMS gyroscopes measure Coriolis force. $$F = 2m\vec{v} \times \vec{\omega}$$

Major vendors:

  • STM Microelectronics (Samsung Galaxy)
  • InvenSense (Google Nexus)

Gyroscope as a microphone

  • Sampling frequency:
    • InvenSense: up to 8000 Hz
    • STM Microelectronics: 800 Hz
    • Sampling rate is limited by software
  • Acoustic sensitivity threshold: ~70 dB.
    Comparable to a loud conversation.
  • Accessibile to
    • Applications
    • Browsers (through JavaScript API)

Software limitation of the sampling rate


Software limitation of the sampling rate


(InvenSense driver example)
hardware/invensense/65xx/libsensors iio/MPLSensor.cpp
static int hertz_rate = 200;
#define DEFAULT_GYRO_RATE (20000L) //us ...

Initial experiments

70 Hz tone power spectral density.
50 Hz tone power spectral density.

We've got a nice peak!

Let's hear how a recorded chirp would sound.

Problem: how do we look into higher frequencies?

The voiced speech of a typical adult male will have a fundamental frequency from 85 to 180 Hz, and that of a typical adult female from 165 to 255 Hz.
- Wikipedia


If a function x(t) contains no frequencies higher than B hertz, it is completely determined by giving its ordinates at a series of points spaced 1/(2B) seconds apart.

In absence of a low-pass pre-sampling filter, frequencies above the Nyquist rate will be folded into the range below the Nyquist range.


The result of recording tones between 120 and 160 Hz on a Nexus7 device.

We've got aliasing!

(Less so on Samsung devices)

Speech analysis using a single gyroscope

Rise (and fall) of the

  • CMU Sphinx is a well known speech recognition engine.
  • An attempt to use Sphinx train on our data yielded 14% successful identification rate.
  • Hey, it's better then a random guess.
  • Encouraging, but not good enough

Sphinx is geared toward recognizing human speech. The techniques used might not be completely suitable for our data.


  • MFCC - Mel-Frequency Cepstral Coefficients
    • Statistical features are used (mean and standard deviation)
    • delta-MFCC
  • Spectral centroid
  • RMS energy
  • STFT - Short-Time Fourier Transform


  • SVM (and Multi-class SVM)
  • GMM (Gaussian Mixture Model)
  • DTW (Dynamic Time Warping)
  • Perhaps HMM could do even better?

Dynamic Time Warping


  • All samples are converted to audio files in WAV format
  • Upsampled to 8 KHz
  • Silence removal (based on voiced/unvoiced segment classification)


  • Subset of TIDIGITS (isolated digits pronunciations)
  • 11 words per speaker
  • Each speaker recorded each word twice
  • There are 10 speakers (5 female and 5 male)
  • Digitized at 20 KHz

$$10 \times 11 \times 2 = 220$$ recordings

Gender identification

  • Binary SVM with spectral features
  • DTW with STFT features
  • STFT features:
    • Window size: 512 samples - corresponds to 64 samples under 8 KHz sampling rate

Speaker identification

  • Multi-class SVM and GMM with spectral features
  • DTW with STFT features (same as before)

Isolated words recognition

Speaker independent

Speaker dependent

Multi-device setup

How can we leverage eavesdropping on several co-located devices simultaneously?

Multi-device setup

Similar to time-interleaved ADCs

All ADC-s have sampling rate $$F_s = 1/T = 200$$
$$T_Q = \frac{T}{N}$$ Time-skews $$t_p \in [0, T_Q]$$


  • Offset mismatch: DC component removal
  • Gain mismatch: Normalization / use a reference signal
  • Time mismatch: background of foreground calibration

We can afford heavy offline processing, involving time consuming algorithms.

Non-uniform reconstruction

Filterbank interpolation based on Eldar and Oppenheim's paper

$$h_{p}\left(t\right)=a_{p}sinc\left(\frac{t}{T}\right)\prod_{q=0,q\neq p}^{N-1}\sin\left(\frac{\pi\left(t+t_{p}-t_{q}\right)}{T}\right)$$

Requires knowing precise time-skews

Practical compromise

Interleaving samples from multiple devices
(Note: we can also try scanning all possible values for time-skews)


Tested for the case of speaker dependent word recognition

  • Exhibits improvement over using a single device
  • Using even more devices might yield even better results
  • Not a proper non-uniform reconstruction


There's much more work to do...

Further Attacks

One time patch

Having a one time root access we can patch a driver

hardware/invensense/65xx/libsensors iio/MPLSensor.cap

static int hertz_rate = 200 8000;

Source separation

Use angle of arrival information to do source separation/speaker identification. Perhaps learn the number of sound sources around.

Ambient sound recognition

Is the user in a restaurant/outdoors/on a street?


Secure system design requires consideration of the whole system and a clear definition of the power of the attacker.

Defending against user-level attacks

  • Low-pass filter the raw samples
  • 0-20 Hz range should be enough for browser based applications
    (according to WebKit)
  • Access to high sampling rate should require a special permission

Defending against escalated attacks

  • Hardware filtering of sensor signals
    (Not subject to configuration)
  • Acoustic masking



Gyro sampling via JavaScript

This presentation

Appendix: Sampling via JavaScript

						if(window.DeviceMotionEvent) {
							window.addEventListener('devicemotion', function(event) {
								var r = event.rotationRate;
								if ( r!=null ) {
									console.log('Rotation at [x,y,z] is: [' +