Yan Michalevsky (1), Gabi Nakibly (2) and Dan Boneh (1)

(1) Stanford University

(2) National Research and Simulation Center, Rafael Ltd.

Want to own mic.

Got no permissions.

What else can we do?

- Low sampling frequency - 200 Hz
- Can sample the range of 0 - 100 Hz
- Low SNR
- Sensitivity to sound angle of arrival
- Do we have aliasing?

(Yes, we tried an anechoic chamber, not beneficial at this stage.)

Major vendors:

- STM Microelectronics (Samsung Galaxy)
- InvenSense (Google Nexus)

- Sampling frequency:
- InvenSense: up to 8000 Hz
- STM Microelectronics: 800 Hz
- Sampling rate is limited by software

- Acoustic sensitivity threshold: ~70 dB.

Comparable to a loud conversation. - Accessibile to
- Applications
- Browsers (through JavaScript API)

(InvenSense driver example)

*hardware/invensense/65xx/libsensors iio/MPLSensor.cpp*

```
static int hertz_rate = 200;
```

#define DEFAULT_GYRO_RATE (20000L) //us
...

We've got a nice peak!

Let's hear how a recorded chirp would sound.

The voiced speech of a typical adult male will have a fundamental frequency from 85 to 180 Hz, and that of a typical adult female from 165 to 255 Hz.

If a function x(t) contains no frequencies higher than B hertz, it is completely determined by giving its ordinates at a series of points spaced 1/(2B) seconds apart.

In absence of a low-pass pre-sampling filter, frequencies above the Nyquist rate will be folded into the range below the Nyquist range.

We've got aliasing!

(Less so on Samsung devices)

- CMU Sphinx is a well known speech recognition engine.
- An attempt to use Sphinx train on our data yielded 14% successful identification rate.
- Hey, it's better then a random guess.
- Encouraging, but not good enough

Sphinx is geared toward recognizing human speech. The techniques used might not be completely suitable for our data.

- MFCC - Mel-Frequency Cepstral Coefficients
- Statistical features are used (mean and standard deviation)
- delta-MFCC
- Spectral centroid
- RMS energy
- STFT - Short-Time Fourier Transform

- SVM (and Multi-class SVM)
- GMM (Gaussian Mixture Model)
- DTW (Dynamic Time Warping)
- Perhaps HMM could do even better?

- All samples are converted to audio files in WAV format
- Upsampled to 8 KHz
- Silence removal (based on voiced/unvoiced segment classification)

- Subset of TIDIGITS (isolated digits pronunciations)
- 11 words per speaker
- Each speaker recorded each word twice
- There are 10 speakers (5 female and 5 male)
- Digitized at 20 KHz

$$10 \times 11 \times 2 = 220$$ recordings

- Binary SVM with spectral features
- DTW with STFT features
- STFT features:
- Window size: 512 samples - corresponds to 64 samples under 8 KHz sampling rate

- Multi-class SVM and GMM with spectral features
- DTW with STFT features (same as before)

How can we leverage eavesdropping on several co-located devices simultaneously?

All ADC-s have sampling rate $$F_s = 1/T = 200$$

$$T_Q = \frac{T}{N}$$
Time-skews $$t_p \in [0, T_Q]$$

- Offset mismatch: DC component removal
- Gain mismatch: Normalization / use a reference signal
- Time mismatch: background of foreground calibration

We can afford heavy offline processing, involving time consuming algorithms.

Filterbank interpolation based on Eldar and Oppenheim's paper

$$h_{p}\left(t\right)=a_{p}sinc\left(\frac{t}{T}\right)\prod_{q=0,q\neq p}^{N-1}\sin\left(\frac{\pi\left(t+t_{p}-t_{q}\right)}{T}\right)$$

Requires knowing precise time-skews

(Note: we can also try scanning all possible values for time-skews)

- Exhibits improvement over using a single device
- Using even more devices might yield even better results
- Not a proper non-uniform reconstruction

Having a one time root access we can patch a driver

*hardware/invensense/65xx/libsensors iio/MPLSensor.cap*

```
static int hertz_rate =
```

~~200~~ 8000;

Secure system design requires consideration of the whole system and a clear definition of the power of the attacker.

- Low-pass filter the raw samples
- 0-20 Hz range should be enough for browser based applications

(according to WebKit) - Access to high sampling rate should require a special permission

- Hardware filtering of sensor signals

(Not subject to configuration) - Acoustic masking

` ````
if(window.DeviceMotionEvent) {
window.addEventListener('devicemotion', function(event) {
var r = event.rotationRate;
if ( r!=null ) {
console.log('Rotation at [x,y,z] is: [' +
r.alpha+','+r.beta+','+r.gamma+']\n');
}
}
}
```