{"title": "Adversarial Music: Real world Audio Adversary against Wake-word Detection System", "book": "Advances in Neural Information Processing Systems", "page_first": 11931, "page_last": 11941, "abstract": "Voice Assistants (VAs) such as Amazon Alexa or Google Assistant rely on wake-word detection to respond to people's commands, which could potentially be vulnerable to audio adversarial examples. In this work, we target our attack on the wake-word detection system. Our goal is to jam the model with some inconspicuous background music to deactivate the VAs while our audio adversary is present. We implemented an emulated wake-word detection system of Amazon Alexa based on recent publications. We validated our models against the real Alexa in terms of wake-word detection accuracy. Then we computed our audio adversaries with consideration of expectation over transform and we implemented our audio adversary with a differentiable synthesizer. Next we verified our audio adversaries digitally on hundreds of samples of utterances collected from the real world. Our experiments show that we can effectively reduce the recognition F1 score of our emulated model from 93.4% to 11.0%. Finally, we tested our audio adversary over the air, and verified it works effectively against Alexa, reducing its F1 score from 92.5% to 11.0%. To the best of our knowledge, this is the first real-world adversarial attack against a commercial grade VA wake-word detection system. Our demo video is included in the supplementary material.", "full_text": "Adversarial Music: Real World Audio Adversary\n\nAgainst Wake-word Detection System\n\nJuncheng B. Li1\n\njunchenl@cs.cmu.edu\n\nShuhui Qu3\n\nshuhuiq@stanford.edu\n\nXinjian Li1\n\nxinjianl@cs.cmu.edu\n\nJoseph Szurley2\n\njszurley@bosch.com\n\nJ. Zico Kolter1,2\n\nzkolter@cs.cmu.edu\n\nFlorian Metze1\n\nfmetze@cs.cmu.edu\n\n1Carnegie Mellon University, 2Bosch Center for Arti\ufb01cial Intelligence, 3Stanford University\n\nAbstract\n\nVoice Assistants (VAs) such as Amazon Alexa or Google Assistant rely on wake-\nword detection to respond to people\u2019s commands, which could potentially be\nvulnerable to audio adversarial examples. In this work, we target our attack on\nthe wake-word detection system, jamming the model with some inconspicuous\nbackground music to deactivate the VAs while our audio adversary is present.\nWe implemented an emulated wake-word detection system of Amazon Alexa\nbased on recent publications. We validated our models against the real Alexa in\nterms of wake-word detection accuracy. Then we computed our audio adversaries\nwith consideration of expectation over transform and we implemented our audio\nadversary with a differentiable synthesizer. Next we veri\ufb01ed our audio adversaries\ndigitally on hundreds of samples of utterances collected from the real world. Our\nexperiments show that we can effectively reduce the recognition F1 score of our\nemulated model from 93.4% to 11.0%. Finally, we tested our audio adversary over\nthe air, and veri\ufb01ed it works effectively against Alexa, reducing its F1 score from\n92.5% to 11.0%.1 We also veri\ufb01ed that non-adversarial music does not disable\nAlexa as effectively as our music at the same sound level. To the best of our\nknowledge, this is the \ufb01rst real-world adversarial attack against a commercial grade\nVA wake-word detection system.\n\n1\n\nIntroduction\n\nAdversarial attacks on machine learning systems are a topic of growing importance. As machine\nlearning becomes ever more prevalent in all aspects of modern life, concerns about safety tend to gain\nprominence as well. Recent demonstrations of the ease with which machine learning systems can be\n\u201cfooled\u201d have caused a strong impact in the \ufb01eld and in the general media. Systems that use voice\nand audio such as Amazon Alexa, Google Assistant, Apple Siri and Microsoft Cortana are growing\nin popularity, and their applications maybe safety critical in cases like in a car. The hidden risk of\nthose advancements is that those systems are potentially vulnerable to adversarial attacks from an\nill-intended third-party. Despite the recent growth in consumer presence of audio-based arti\ufb01cial\nintelligence products, attacks on audio and speech systems have received much less attention than the\nimage and language domains so far.\n\n1Our code and demo videos can be accessed at https://www.junchengbillyli.com/\nAdversarialMusic; F1 score is the metric we chose here since detection and false alarms are equally\ndamaging\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fDespite a number of recent works attempting to create adversarial examples against Automatic\nSpeech Recognition (ASR) systems [Carlini and Wagner, 2018, Sch\u00f6nherr et al., 2018, Qin et al.,\n2019], we believe robust playable-over-the-air real-time audio adversaries against commercial ASR\nsystems still have not been demonstrated. Attacks that worked digitally against speci\ufb01c models are\nnot effective played over-the-air against commercial ASR models. Moreover, as consumer-facing\nproducts, Voice Assistants (VAs) such as Amazon Alexa and Google Assistant are well-maintained\nby their infrastructure teams. It is revealed that these teams retrain and update ASR models very\nfrequently on their cloud back-end, making a robust audio adversary that can consistently work\nagainst these ASR systems almost impossible to craft Hoffmeister [2019]. Attackers would not only\nlack knowledge of the backend models\u2019 parameters and gradients, but would also struggle to keep up\nwith the ever-evolving models.\nHowever, all the existing VAs rely solely on wake-word detection to respond to people\u2019s commands,\nwhich could potentially make them vulnerable to audio adversarial examples. In this work, rather than\ndirectly attacking the general ASR models, we target our attack on the wake-word detection system.\nWake-word detection models always have to be stored and executed on-board within smart-home\nhardware, which is usually very limited in terms of computing power due to form factors. It is also\nrevealed that updates to the wake-word models are much more infrequent and dif\ufb01cult compared to\nthe backend ASR models Hoffmeister [2019]. We therefore propose to attack VAs by \u201cjamming\u201d\nthe wake-word detection, effectively deactivating the VA while our adversary is present. In security\nterm, this is a type of denial-of-service (DoS) attack. Speci\ufb01cally, we create a parametric attack that\nresembles a piece of background music, making the attack inconspicuous to humans.\nWe reimplemented the wake-word detection system used in Amazon Alexa based on their latest\npublications on the architecture [Wu et al., 2018]. We leveraged a large amount of open-sourced\nspeech data to train our wake-word model, testing and making sure it has on par performance\ncompared with the real Alexa. We collected 100 samples of \"Alexa\" utterances from 10 people and\naugmented the data set by varying the volume, tempo and speed. We created a synthetic data set\nusing publicly available data sets as background noise and negative speech examples. This collected\ndatabase is used to validate our emulated model and be compared with the real Alexa.\nAfter successfully training a high-performance emulated wake-word model, we synthesized guitar-\nlike music to craft our audio perturbations. Projected Gradient Descent (PGD) is used here to\nmaximize the effect of our attack while keeping the parameters of our adversarial music within\nrealistic boundaries. During the training of our audio perturbation, we considered expectation over\ntransforms [Athalye et al., 2018] including psychoacoustic masking and room impulse responses.\nFinally, we tested our perturbation over the air against our emulated model in parallel with the\nreal Amazon Echo and veri\ufb01ed that our perturbation works effectively against both of them in a\nreal world setting. Speci\ufb01cally, the recognition F1 score of our emulated model was reduced from\n93.4% to 11.0%, and the F1 score of the real Alexa was also brought down from 92.5% to 11.0%.\nOur adversarial music poses a real threat against commercial grade VAs, leading to potential safety\nconcerns such as distraction in driving and malicious manipulations of smart devices, and thus, our\n\ufb01nding calls for future speech research looking into defenses against potential risks and general\nrobustness.\n\n2 Background and Related Work\n\nMost current adversarial attacks work by trying to \ufb01nd a way to modify a given input (hopefully by a\nvery small amount) in such a way that the machine learning system\u2019s proper functioning is disrupted.\nA classic example is to take an image classi\ufb01er and modify an input with a very small perturbation\n(dif\ufb01cult for human to tell apart from the original image) that still changes the output classi\ufb01cation\nto a completely distinct (and incorrect) one. To achieve such a goal, the general idea behind many\nof the attack algorithms is to optimize an objective that involves maximizing the likelihood of the\nintended (incorrect) behavior, while being constrained to a small perturbation. For differentiable\nsystems such as deep networks, which are the current state-of-the-art for many classi\ufb01cation tasks,\nutilizing gradient-based methods is a common approach. We describe such methods and their relation\nto our work in more depth in Section. 3.5. In this work, our target of attack is wake-word detection\nsystems.\n\n2\n\n\fAdversarial attacks were initially introduced for images [Szegedy et al., 2013] and have been studied\nthe most in the domain of computer vision [Nguyen et al., 2015, Kurakin et al., 2016, Moosavi-\nDezfooli et al., 2016, Elsayed et al., 2018, Li et al., 2019]. Following successful demonstrations in\nthe vision domain, adversarial attacks were also successfully applied to natural language processing\n[Papernot et al., 2016, Ebrahimi et al., 2018, Reddy and Knight, 2016, Iyyer et al., 2018, Naik et al.,\n2018]. This trend gives rise to defensive systems such as Cisse et al. [2017], Wong and Kolter [2018],\nand thus provides a guideline to the community about how to build robust machine learning models.\nAttacks on audio and speech systems have received much less attention so far. Two years ago, Zhang\net al. [2017] pioneered a proof-of-concept that proved the feasibility of real-world attacks on speech\nrecognition models. This work however, had a larger focus on the hardware part of the Automatic\nSpeech Recognition (ASR) system, instead of its machine learning component. Until very recently,\nthere was not much work done on exploring adversarial perturbation on speech recognition models.\nCarlini et al. [2016] was the \ufb01rst to demonstrate that attack against HMM models are possible.\nThey claimed to effectively attack based on the inversion of feature extractions. This work was\npreliminary since it only showcased a limited number of discrete voice commands, and the majority\nof perturbations are not able to be played over the air. As a follow-up work, Carlini and Wagner\n[2018], Qin et al. [2019] showcased that curated white-box attacks based on adversarial perturbation\ncan easily fool the Mozilla speech recognition system. Again, their attacks would only work in\nwith their special setups and are very brittle in the real world. Meanwhile, Sch\u00f6nherr et al. [2018]\nattempted to leverage psycho-acoustic hiding to improve the chance of success of playable attacks.\nThey veri\ufb01ed their attacks against the Kaldi ASR system, whereas the real-world success rate was\nstill not satisfying, and the adversary itself cannot be played from a different source. Unfortunately,\nstate-of-the-art audio adversaries can only work digitally against the speci\ufb01c model they were trained\nagainst, but cannot work robustly over the air against state-of-the-art commercial grade ASR systems.\nWe veri\ufb01ed all these adversaries mentioned above not effective against Alexa or Siri. Alexa and\nSiri can still transcribe correctly with the presence of these audio perturbations. Acknowledging\nthe strict limit on the potency of such an over-the-air attack, we aim at a different target rather than\nfocusing on the ASR models deployed in commercial products. Our proposed attack targets at the\nmore manageable but equally critical wake-word detection system, and effectively demonstrates that\nit can be playable over the air.\nHere are our main contributions in this work compared with previous works:\n\n1. We create a parametric threat model in audio domain that allows us to disguise our adver-\n\nsarial attack as a piece of music playable over the air in the physical space.\n\n2. Our adversarial attack is a \u201cgray-box\u201d attack2 that leverages the domain transferability of\nour perturbation. This is a lot more challenging than previous works on white-box attack\nwhere attackers have perfect information about the model.\n\n3. Our adversarial attack is jointly optimizing the attack nature while \ufb01tting the threat model\nto the perturbation achievable by the microphone hearing response of Amazon Alexa. Our\nattack budget is very limited compared with previous works, which makes this challenging.\n4. Our adversarial attack demonstrated its effect in the real world under separate audio source\nsettings, which is the \ufb01rst real-time \u201cgray-box\u201d adversarial attack against commercial grade\nVoice Assist\u2019s wake-word detection system to our knowledge. 3\n\n3 Synthesizing Adversarial Music against Alexa\n\nThis section contains the main methodological contributions of our paper: the algorithmic and\npractical pipeline for synthesizing our adversarial music. To begin, we will \ufb01rst describe the training\nsetup and the performance of our emulated wake-word detection model. We then describe the general\nthreat model we consider in this work. Unlike past works which often considered Lp norm bounded\nperturbations on the entire spectrum, we require a threat model that can generate a piece of audible\nmusic. Next, we describe the approach we used to make our threat model robust over the air. Finally,\n\n2We avoid using the term \u201cblack-box\u201d here since we admit that the gradients of the real Alexa model can be\n\nexpected to be similar to those of models of Panchapagesan et al. [2016]\n\n3Our demo video is included in the supplementary material\n\n3\n\n\fwe describe how we optimized the parameters of our threat model to synthesize a robust over-the-air\nattack.\n\n3.1 Emulate the Wake-word Detection Model\n\nWake-word detection is the \ufb01rst important step before any interactions with distant speech recognition.\nDue to the compacted space of embedded platforms and the need for quick re\ufb02ection time, models of\nwake-word detection are usually truncated and vulnerable to attacks. Thus, we target our attack at the\nwake-word detection function.\nSince the Alexa model is a black-box to us, the only way to attack it is either to estimate its gradients,\nor to emulate its architecture and later transfer the white-box attack against the emulated model to its\noriginal model. Estimating its gradient using \ufb01rst order optimization techniques would be extremely\ncomputationally expensive [Ilyas et al., 2018], making it dif\ufb01cult if not impossible to implement.\nLuckily, the architecture of Amazon Alexa was published in Panchapagesan et al. [2016], Kumatani\net al. [2017], Guo et al. [2018], allowing us to emulate the model as if it is a white-box attack. We\nimplemented the time-delayed bottleneck highway networks with Discrete Fourier Transform (DFT)\nfeatures as shown in Figure. 1. The architecture is following the details in Guo et al. [2018], which is\nthe most up-to-date information on the model architecture.\nThe architecture of the emulated model is shown in Figure. 1. The model is a two-level architecture\nwhich uses the highway block as the basic building block. Speci\ufb01cally, our architecture contains\na 4-layer highway block as a feature extractor, a linear layer acting as the bottleneck, a temporal\ncontext window that concatenates features from adjacent frames, and a 6-layer highway block for\nclassi\ufb01cation.\nThe training data for wake-word detection systems is very limited, so our model is \ufb01rst pre-trained\nwith several large corpora [Cieri et al., 2004, Godfrey et al., 1992, Rousseau et al., 2012] to train a\ngeneral acoustic model. Then it is adapted to the wake-word detection model by using a small amount\nof wake-word detection training data. The emulated model could detect the wake-word \u201cAlexa\" by\nrecognizing the corresponding phonemes of AX, L, EH, K, S and AX.\nWe trained the model as a binary classi\ufb01cation problem over a time sequence, distinguishing between\nwake-words and non wake-words. The performance is evaluated over a reserved test set. Care has\nbeen taken to ensure that augmented copies of the same raw audio sample will not occur in the train\nand test set simultaneously. Common performance metrics are listed in Table 1. Given that our\nemulated model\u2019s architecture is very similar to the Alexa model, the gradients of the two models\nshould also share great similarities. We believe that the white-box attack computed by gradient based\nattacks against our emulated model would be effective against Alexa\u2019s original model. Figure. 3.1\nshows the Detection Error Tradeoff (DET) in the similar style as Guo et al. [2018]. We note that\nour performance is not exactly the same as Alexa model due to the lack of training data, but our\nperformance is in the same order of magnitude. Thus, we believe we should have a high \ufb01delity\nemulation of the Alexa wake-word detection model.\n\n3.2 Adversarial Music Synthesizer (Threat Model)\n\nTo perform the adversarial attack on the audio domain, we introduce a parametric model to de\ufb01ne\na realistic construction of our adversary \u03b4\u03b8 parameterized by \u03b8. We use the Karplus-Strong algo-\nrithm Jaffe and Smith [1983] to synthesis guitar-timbre sounds. The goal is to generate a sequence of\nL guitar notes {(f ri, di, voli)}L\ni=1, with given Beats Per Minute (BPM) bpm, Sampling Frequency\nf rs, where f ri is the frequency pitch of the ith note, di is the note duration, and voli is the note\nvolume. In this work, we update the \u03b8 = {(f ri, voli)}L\ni=1 of all notes\u2019 frequency f r and volume vol,\nwhile \ufb01xing the notes\u2019 duration {di}L\nThe Karplus-Strong algorithm is an example of digital waveguide synthesis to simulate string\ninstruments physically. It mimics the dampening effects of a real guitar string as it vibrates by taking\nthe decaying average of two consecutive samples: y[n] = \u03b3(y[n \u2212 D] + y[n \u2212 D + 1])/2, where \u03b3\nis the decay factor, y is the output wave, D is the sampling phase delay. This is similar to a one-zero\nlow-pass \ufb01lter with a transfer function H(z) = (1 + z\u22121)/2 [Smith, 1983], as shown in Figure. 3.\nWhen a guitar string is plucked, the string vibrates and emits sound. To simulate this, the Karplus-\nStrong algorithm consists of two phases, as shown in Figure. 3:\n\ni=1.\n\n4\n\n\f0.14\n\n0.12\n\n0.1\n\n%\ne\nt\na\nR\ns\ns\ni\n\nM\n8 \u00b7 10\u22122\n\n6 \u00b7 10\u22122\n\n1\n\nEmulated Model\n\nAlexa Model\n\n1.4\n\n1.2\nFalse Alarm Rate%\n\n1.6\n\n1.8\n\u00b710\u22122\n\nFigure 1: Emulated Wake-word model\n\nFigure 2: Detection Error Tradeoff Curve. The\ncurve of Alexa model is shown in a \ufb02at line as its\nFalse Alarm Rate is not published\n\nFigure 3: String instruments with the one-zero low-pass \ufb01lter approximation. The synthesis process\n\ufb01rst generates a short excitation D-length waveform. It is then fed into the \ufb01lter iteratively to generate\nthe sound.\n\nPlucking the string: The string is \u201cplucked\u201d by a random initial displacement and initial velocity\ndistribution y[0 : D \u2212 1] \u223c N (0, \u03b2), where \u03b2 is the displacement factor. The plucking time, i.e. the\nsampling phase delay D for the note frequency f r is calculated using D = f rs/f r.\nThe resulting vibrations: The pluck causes a wave-like displacement over time with a decaying\nsound. The decay factor depends on the note frequency f r and the delay period nD = d \u00d7 f r/f rs.\nIt is calculated as: \u03b3 = (4/log(f r))1/nD. The volume of the output of a note is adjusted by\nvoutput = pv \u00d7 vol using a frequency-speci\ufb01c volume factor pv = 1 + 0.8 \u00d7 (log(f r) \u2212 3)/5.5 \u00d7\ncos(\u03c0/5.3(log(f r) \u2212 3)) [Woodhouse, 2004]. Jaffe and Smith [1983] suggested using a linear\ninterpolation to obtain a fractional delay to generate a better result:\n\ny[n] = \u03b3voutput(\u03c9 \u00d7 y[n \u2212 D] + (1 \u2212 \u03c9) \u00d7 y[n \u2212 D \u2212 1])/2,\n\n(1)\nwhere, \u03c9 \u2208 (0, 1) is the weight factor. Overall, the Karplus-Strong algorithm is shown in algorithm. 1.\n\nAlgorithm 1: Karplus-Strong algorithm\n1 Simulate the plucking phase of each note i by initialize yi[0 : D \u2212 1] \u223c N (0, \u03b2);\n2 for i = 1, ..., L do\n\nfor n = D, ..., di do\n\nyi[n] = \u03b3voutput(\u03c9 \u00d7 yi[n \u2212 D] + (1 \u2212 \u03c9) \u00d7 yi[n \u2212 D \u2212 1])/2;\n\n3 Return y;\n\n5\n\n\fThe resulting synthesizing function \u03c0(x; \u03b8) is a differentiable function of part of the parameters \u03b8\nvalue is a continuous function of the parameters: the frequency f and the volume vol of each note\n(We \ufb01x the note duration d). Therefore, we can implement the perturbation model within an automatic\ndifferentiation toolkit (we implement it with the PyTorch library), a feature that will be exploited to\nboth \ufb01t the parametric perturbation model to real data, and to construct real-world adversarial attacks.\n\n3.3 Psychoacoustic Effect\n\nOur ultimate task is to deceive the voice assistant with minimal impact to human hearing, and it is\nnatural to leverage the psychoacoustic effect of human hearing. The principles of the psychoacoustic\nmodel are similar to what used in the compression process of audio \ufb01les, e.g. compress the lossless\n\ufb01le format \"wav\" to the lossy \ufb01le format \"mp3\". In this process, the information carried by the\naudio \ufb01le is actually altered while human ear could not tell the differences between these two\nsounds. Speci\ufb01cally, a louder signal (the \u201cmasker\") can make other signals at nearby frequencies\n(the \u201cmaskees\") imperceptible [Mitchell, 2004]. In this work, we adopted the same setup as Qin et al.\n[2019]. When we add an perturbation \u03b4, the normalized power spectral density (PSD) estimate of the\nperturbation \u00afp\u03b4(k) is under the frequency masking threshold of the original audio \u03b7x(k),\n\n\u00afp\u03b4(k) = 92 \u2212 max\n\n(2)\nwhere p\u03b4(k) = 10 log10 | 1\nN sx(k)|2 are power spectral density estima-\ntion of the perturbation and the original audio input. sx(k) is the kth bin of the spectrum of frame x.\nThis results in the loss function term:\n\nN s\u03b4(k)|2, px(k) = 10 log10 | 1\n\npx(k) + p\u03b4(k)\n\nk\n\nL\u03b7(x, \u03b4) =\n\n1\n\n(cid:98) N\n2 (cid:99) + 1\n\n2 (cid:99)(cid:88)\n\n(cid:98) N\n\nk=0\n\nmax{\u00afp\u03b4(k) \u2212 \u03b7x(k), 0}\n\n(3)\n\nwhere N is the prede\ufb01ned window size and (cid:98)x(cid:99) outputs the greatest integer no larger than x.\n\n3.4 Expectation Over Transform\n\nWhen using the voice assistant in a room, the real sound caught by the microphone includes both\nthe original sound spoken by human and the re\ufb02ected sound. The \"room impulse response\" function\nexplained the transform of the original audio and the audio caught by the microphone. Therefore, to\nmake our adversarial attack effective in the physical domain, i.e. attack the voice assistant over the\nair, it is necessary to consider the room impulse response in our work. Here, we use the classic Image\nSource Method introduced in Allen and Berkley [1979], Scheibler et al. [2018] to create the room\nimpulse response r based on the room con\ufb01gurations (dimension, source location, reverberation time):\nt(x) = x \u2217 r, where x denotes clean audio and \u2217 denotes convolution operation. The transformation\nfunction t follows a chosen distribution T over different room con\ufb01gurations.\n\n3.5 Projected Gradient Descent and Loss Formulation\n\nOriginally, wake-word detection problem is formulated as a minimization of Ex,y\u223cD[L(f (x), y)]\nwhere L is the loss function, f is the classi\ufb01er mapping from input x to label y, and D is the data distri-\nbution. We evaluate the quality of our classi\ufb01er based on the loss, and a smaller loss usually indicates\n[Ex,y\u223cD[L(f (x(cid:48)), y)]],\na better classi\ufb01er. In this work, since we are interested in attack, we form max\nwhere x(cid:48) = x + \u03b4 is our perturbed audio. In a completely differentiable system, an immediately\nobvious initial approach to this would be to use gradient ascent in order to search for an x(cid:48) that maxi-\nmizes this loss. However, for this maximization to be interesting both practically and theoretically,\nwe need x(cid:48) to be close to the original datapoint x, according to some measure. It is thus common\nto de\ufb01ne a perturbation set C(x) that constrains x(cid:48), such that the maximization problem becomes\n[Ex,y\u223cD[L(f (x(cid:48)), y)]]. The set C(x) is usually de\ufb01ned as a ball of small radius (of either\nmax\nx(cid:48)\u2208C(x)\n(cid:96)\u221e, (cid:96)2 or (cid:96)1) around x.\nSince we have to solve such a constrained optimization problem, we cannot simply apply the gradient\ndescent method to maximize the loss, as this could take us out of the constrained region. One of the\nmost common methods utilized to circumvent this issue is called Projected Gradient Descent (PGD).\n\n\u03b4\n\n6\n\n\fFigure 4: Physical Testing Illustration\n\nFigure 5: A sample adversarial music\n\nTo actually implement gradient descent methods, instead of maximizing the aforementioned loss L,\nwe will invert the sign and minimize the negative loss, i.e., min\n\n[\u2212l(x, \u03b4, y)].\n\nx(cid:48)\u2208C(x)\n\nIn our case, we attack the voice assistant via a parametric threat model and considering psychoacoustic\neffect, as illustrated in Section. 3.3 and Section. 3.2. The loss function of attack thus can be rewritten\nas:\n\nmax l(x, \u03b4\u03b8, y) = Ex,y\u223cD[Lwake(f (x + \u03b4\u03b8), y) \u2212 \u03b1 \u00b7 L\u03b7(x, \u03b4\u03b8)]\n\n(4)\nwhere Lwake is the original loss of the emulated model for wake-word detection, and \u03b1 is the balancing\nparameter. \u03b4\u03b8 is our adversarial music de\ufb01ned in Section. 3.2. In addition, we consider the room\nimpulse response de\ufb01ned in Section. 3.4, which would be the attack in physical world over the air.\nWe simulated our testing environments using the parameters shown in Table 2 to transform t(x) our\ndigital adversary to compensate for the room impulse responses. We want to optimize our \u03b4 by tuning\n\u03b8 to maximize this loss to synthesize our adversary:\n\nmax l(x, \u03b4\u03b8, y) = Et\u2208T ,x,y\u223cD[Lwake(f (t(x + \u03b4\u03b8), y)) \u2212 \u03b1 \u00b7 L\u03b7(x, \u03b4\u03b8)]\n\n(5)\n\n4 Experiments and Results\n\nDatasets\nWe collected 100 positive speech samples (speaking \"Alexa\") from 10 peoples (4 males and 6 females;\n4 native speakers of English, 6 non-native speakers of English). Each person provided 10 utterances,\nunder the requirement of varying their tone and pitch as much as possible. We further augmented the\ndata to 20x by varying the speed, tempo and the volume of the utterance, resulting in 2000 samples.\nWe used LJ speech dataset [Ito, 2017] for background noise and negative speech examples (speak\nanything but \"Alexa\"). We created a synthetic data set by randomly adding positive and negative\nspeech examples onto a 10s background noise and created binary labels accordingly. While \"hearing\"\npositive speech examples, we set label values as 1.\nTraining and Testing the Adversarial Music\nWe followed the methodology described in Section. 3.5, using the loss function de\ufb01ned by Eq. 5\nand the parametric method illustrated in Section. 3.2, we optimized the parameters of the music to\nmaximize our attack. We used the emulated model developed in Section. 3.1 to estimate the gradient\nand maximizes the classi\ufb01cation loss following Eq. 5. When using PGD to train, we restricted\nthe frequency to 27.5Hz \u223c 4186Hz in the 88 notes space, and restricted the volume from 0 dBA\n\u223c 100 dBA. Other parameters are de\ufb01ned in the code, we \ufb01xed some parameters to speed up the\ntraining. The trained perturbations are directly added on top of the clean signals to perform our\ndigital evaluation. Our physical experiments are conducted at a home environment which can be\nseen in the video attached in the supplementary material, and the setup is shown in Figure. 4. The\nadversarial music is played by a MacBook Pro (15-inch, 2018) speaker4 The tester stands dt away\n4We have also experimented with Alienware MX 18R2 speakers and Logitech Z506 speakers, MacBook Pro\nspeakers generated the best results. Since we are not comparing different speakers in this paper, we report the\nresults from the MacBook Pro speakers.\n\n7\n\n\ffrom the Amazon Echo or the computer running our emulated model, and the adversary is placed\nda away from the Echo. The angle between the two lines is \u03d5. We tested against the model for 100\nsamples with and without the audio adversary. We used a decibel meter to ensure that our adversary\nis never louder than the wake-word command. Our human spoken wake-word command is measured\nto be 70 dBA on average. We experimented with adversaries being played at from 40 dBA and 70\ndBA (measured by the decibel meter) with the background reference noise at 20 dBA, and we also\nexperimented with 2 different testers (a male and a female). We also experimented with the in\ufb02uence\nof timing offset toffset on the performance of the adversarial music. 5\n\nAttack Digital / Physical\n\nModels\nEmulated Model No\nEmulated Model No\nAlexa\nNo\nEmulated Model Yes\nEmulated Model Yes\nAlexa\nYes\n\nDigital\nPhysical\nPhysical\nDigital\nPhysical\nPhysical\n\nPrecision Recall\n0.94\n0.91\n0.92\n0.11\n0.09\n0.10\n\n0.97\n0.96\n0.93\n0.14\n0.12\n0.11\n\nF1 Score\n0.955\n0.934\n0.925\n0.117\n0.110\n0.110\n\n# Samples\n4000\n100\n100\n4000\n100\n100\n\nTable 1: Performance of the models with and without attacks in digital and physical testing environ-\nments given the number of testing samples\n\nThe performance metrics of the emulated model on adversary examples are shown in Table 1. An\nexample of modi\ufb01ed adversarial attack example is shown in Figure .5. As we can see, our attack\nis effective against both the emulated model and the real Alexa model in both digital and physical\ntests. Especially, the precision score takes a heavier hit by the adversary compared with the recall\nscore. We also noticed that False Positives are relatively uncommon, this might be due to the fact that\nwe are running an untargeted attack against the wake-word detection, and our adversaries are not\nencouraging the model to predict a speci\ufb01c target label. Overall, our experiments showcased that\naudio adversaries masked as a piece of music can be played over the air, and disable a commercial\ngrade wake-word detection system.\n\nTest Against Alexa\n\ndt =\n\n4.2f t\nda = 4.7f t, 70dBA 0/10\nda = 6.2f t, 70dBA 1/10\nda = 7.7f t, 70dBA 2/10\nda = 4.7f t, 60dBA 0/10\nda = 6.2f t, 60dBA 1/10\nda = 7.7f t, 60dBA 2/10\nda = 4.7f t, 50dBA 1/10\nda = 6.2f t, 50dBA 2/10\nda = 7.7f t, 50dBA 2/10\nda = 4.7f t, 40dBA 3/10\nda = 6.2f t, 40dBA 3/10\nda = 7.7f t, 40dBA 3/10\nTable 2: Times of the real Amazon Alexa being able to respond to the wake-word under the in\ufb02uence\nof our adversarial music with different settings. (The female and male tester each tests 5 utterances.)\nThe testing set up is illustrated in Figure. 4, and it is also showed in the demo video.\n\n10.2f t\n0/10\n0/10\n0/10\n0/10\n0/10\n0/10\n1/10\n2/10\n2/10\n3/10\n4/10\n4/10\n\n10.2f t\n0/10\n0/10\n1/10\n0/10\n0/10\n1/10\n2/10\n2/10\n3/10\n4/10\n4/10\n4/10\n\n4.2f t\n0/10\n1/10\n3/10\n0/10\n3/10\n3/10\n2/10\n3/10\n3/10\n4/10\n4/10\n5/10\n\n\u03d5 = 180\u25e6\n7.2f t\n0/10\n2/10\n3/10\n0/10\n2/10\n3/10\n2/10\n3/10\n3/10\n4/10\n4/10\n4/10\n\n10.2f t\n0/10\n1/10\n1/10\n0/10\n0/10\n1/10\n1/10\n2/10\n3/10\n4/10\n4/10\n5/10\n\n4.2f t\n0/10\n1/10\n3/10\n0/10\n2/10\n4/10\n2/10\n2/10\n4/10\n4/10\n4/10\n5/10\n\n\u03d5 = 0\u25e6\n7.2f t\n0/10\n0/10\n0/10\n0/10\n1/10\n1/10\n2/10\n3/10\n3/10\n4/10\n4/10\n4/10\n\n\u03d5 = 90\u25e6\n7.2f t\n0/10\n0/10\n1/10\n0/10\n1/10\n2/10\n2/10\n3/10\n2/10\n3/10\n4/10\n5/10\n\nIn our physical experiments against the real Alexa, we measured the performance of our audio\nperturbation under several different physical settings shown in Table 2. As we can observe, our\nmodel successfully attacked against Alexa in most cases. The distance da between Alexa and the\nattacker is critical for a successful attack: a shorter distance da generally leads to a more effective\nadversary, while the number of successful attacks declines when distance dt is shorter. The adversary\neffectively fooled Alexa at most volume levels. The success rate increased as the volume of the\nadversarial music increased from 40 dBA to 70 dBA. We also observed it would not be effective\n5please refer to the demo video in https://www.junchengbillyli.com/AdversarialMusic\n\n8\n\n\fbeing played under 40 dBA. We also experimented with different starting time of the adversary and\nwake words. As is shown in Table 3, the adversary was effective when it overlapped with the entire\nlength of the wake-word, while it was ineffective if the wake-word uttered \ufb01rst. Our adversaries\nsuccessfully demonstrated to be effective at various physical locations and angles \u03d5 referenced to\nAlexa and the human tester, the larger \u03d5 the less effective our adversary would be. This is relevant\nbecause Alexa\u2019s 7-microphone array is supposedly to be very robust at source separation. This can\nprove that our consideration of the room impulse responses was useful. When we remove the room\nimpulse transform, the adversary lost its effect.\nTo verify that Alexa would not be affected easily with other non-adversarial noises, we experimented\nwith three different baselines: random music generated by the Karplus-Strong (KS) algorithm, single-\nnote sounds generated by the KS algorithm and random pieces of music.6 (not generated by the\nKS algorithm). The baseline noises are played at the same volume as our adversarial music. As\nis shown in Table 4, these three baselines could rarely fool Alexa under various testing conditions.\nThis demonstrated the Projected Gradient Descent (PGD) adversarial music trained using the KS\nalgorithm is effective.\n\nTest Against Alexa\n\ndt =\n\ntoffset = +1s\ntoffset = \u22121s\n\nloop\n\n4.2f t\n10/10\n0/10\n0/10\n\n\u03d5 = 0\u25e6\n7.2f t\n10/10\n0/10\n1/10\n\n10.2f t\n10/10\n0/10\n0/10\n\n4.2f t\n10/10\n0/10\n0/10\n\n\u03d5 = 90\u25e6\n7.2f t\n10/10\n0/10\n1/10\n\n10.2f t\n10/10\n0/10\n0/10\n\n4.2f t\n10/10\n0/10\n1/10\n\n\u03d5 = 180\u25e6\n7.2f t\n10/10\n0/10\n0/10\n\n10.2f t\n10/10\n0/10\n1/10\n\nTable 3: Times of the real Amazon Alexa being able to respond to the wake-word under the in\ufb02uence\nof our adversarial music with different time-offset. The adversarial music was placed at da = 7.7f t\naway from the Echo, and played at 70dBA. Here, toffset is +1 indicates the adversarial music is played\n1 second after the wake word; loop indicates non-stop adversarial music.\n\nTest Against Alexa\n\ndt =\n\nRandom Music\nRandom Notes\n\nReal Music\n\n\u03d5 = 0\u25e6\n7.2f t\n10/10\n10/10\n10/10\n\n4.2f t\n10/10\n9/10\n10/10\n\n10.2f t\n10/10\n10/10\n10/10\n\n4.2f t\n10/10\n9/10\n10/10\n\n\u03d5 = 90\u25e6\n7.2f t\n9/10\n10/10\n10/10\n\n10.2f t\n10/10\n10/10\n10/10\n\n4.2f t\n10/10\n10/10\n10/10\n\n\u03d5 = 180\u25e6\n7.2f t\n10/10\n9/10\n10/10\n\n10.2f t\n10/10\n9/10\n9/10\n\nTable 4: Times of the real Amazon Alexa being able to respond to the wake-word under the in\ufb02uence\nof baseline noises: Random music and random single notes generated by the Karplus-Strong (KS)\nalgorithm, and real guitar music which are not generated by the KS algorithm. The adversarial music\nwas placed at da = 7.7f t away from the Echo, and played at 70dBA.\n5 Conclusion\n\nIn this work, we demonstrated a DoS attack on the wake-word detection system of Amazon Alexa\nwith a real-word inconspicuous adversarial background music. We \ufb01rst created an emulated model\nfor the wake-word detection on Amazon Alexa following the implementation details published in\nGuo et al. [2018]. The model is the basis to design our \u201cgray-box\u201d attacks. We collected, augmented\nand synthesized a data set for training and testing. We implemented our threat model which disguises\nour attack in a sequence of inconspicuous background music notes in PyTorch. Our experiments\nveri\ufb01ed that our adversarial music is not only effective against our emulated model digitally, but also\ncan effectively disable Amazon Alexa\u2019s wake-word detection function over the air. We also veri\ufb01ed\nthat non-adversarial music does not affect Alexa compared to our adversary at the same sound level.\nOur work shows the potential of adversarial attack in the audio domain, speci\ufb01cally, against the\nwidely applied wake-word detection systems in voice assistants. Overall, this suggests a real concern\nof attack against commercial grade machine learning algorithms, highlighting the importance of\nadversarial robustness from a practical, security-based point of view.\nAcknowledgement: The authors would like to thank Bingqing Chen and Zhuoran Zhang for their\nhelp with data collection and preliminary studies.\n\n6We picked 10 different top ranking guitar music on Youtube.\n\n9\n\n\fReferences\nJ.B. Allen and D.A. Berkley. Image method for ef\ufb01ciently simulating small-room acoustics. Journal\n\nof the Acoustical Society of America, 65(4):943\u2013950, 1979.\n\nAnish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. Synthesizing robust adversarial\n\nexamples. International Conference on Machine Learning, pages 284\u2013293, 2018.\n\nNicholas Carlini and David Wagner. Audio adversarial examples: Targeted attacks on speech-to-text.\n\narXiv preprint arXiv:1801.01944, 2018.\n\nNicholas Carlini, Pratyush Mishra, Tavish Vaidya, Yuankai Zhang, Micah Sherr, Clay Shields, David\nWagner, and Wenchao Zhou. Hidden voice commands. 25th {USENIX} Security Symposium\n({USENIX} Security 16), pages 513\u2013530, 2016.\n\nChristopher Cieri, David Miller, and Kevin Walker. The \ufb01sher corpus: a resource for the next\n\ngenerations of speech-to-text. LREC, 4:69\u201371, 2004.\n\nMoustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier. Parseval\nnetworks: Improving robustness to adversarial examples. arXiv preprint arXiv:1704.08847, 2017.\n\nJavid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. Hot\ufb02ip: White-box adversarial examples\nfor text classi\ufb01cation. Proceedings of the 56th Annual Meeting of the Association for Computational\nLinguistics (Volume 2: Short Papers), 2:31\u201336, 2018.\n\nGamaleldin F Elsayed, Shreya Shankar, Brian Cheung, Nicolas Papernot, Alex Kurakin, Ian Goodfel-\nlow, and Jascha Sohl-Dickstein. Adversarial examples that fool both human and computer vision.\nNIPS, 2018.\n\nJohn J Godfrey, Edward C Holliman, and Jane McDaniel. Switchboard: Telephone speech corpus for\nresearch and development. [Proceedings] ICASSP-92: 1992 IEEE International Conference on\nAcoustics, Speech, and Signal Processing, 1:517\u2013520, 1992.\n\nJinxi Guo, Kenichi Kumatani, Ming Sun, Minhua Wu, Anirudh Raju, Nikko Str\u00f6m, and Arindam\nMandal. Time-delayed bottleneck highway networks using a dft feature for keyword spotting.\n2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages\n5489\u20135493, 2018.\n\nBj\u00f6rn Hoffmeister. Alexa pittsburgh\u2019s science networking event, keynote. May 30, 2019, 2019.\n\nAndrew Ilyas, Logan Engstrom, and Aleksander Madry. Prior convictions: Black-box adversarial\n\nattacks with bandits and priors. arXiv preprint arXiv:1807.07978, 2018.\n\nKeith Ito. The lj speech dataset. https://keithito.com/LJ-Speech-Dataset/, 2017.\n\nMohit Iyyer, John Wieting, Kevin Gimpel, and Luke Zettlemoyer. Adversarial example generation\n\nwith syntactically controlled paraphrase networks. arXiv preprint arXiv:1804.06059, 2018.\n\nDavid A Jaffe and Julius O Smith. Extensions of the karplus-strong plucked-string algorithm.\n\nComputer Music Journal, 7(2):56\u201369, 1983.\n\nKenichi Kumatani, Sankaran Panchapagesan, Minhua Wu, Minjae Kim, Nikko Strom, Gautam Tiwari,\nand Arindam Mandai. Direct modeling of raw audio with dnns for wake word detection. 2017\nIEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 252\u2013257, 2017.\n\nAlexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world.\n\narXiv preprint arXiv:1607.02533, 2016.\n\nJuncheng B Li, Frank R Schmidt, and J Zico Kolter. Adversarial camera stickers: A physical camera\n\nattack on deep learning classi\ufb01er. arXiv preprint arXiv:1904.00759, 2019.\n\nJoan L Mitchell. Introduction to digital audio coding and standards. Journal of Electronic Imaging,\n\n13(2):399, 2004.\n\n10\n\n\fSeyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and\naccurate method to fool deep neural networks. Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition, pages 2574\u20132582, 2016.\n\nAakanksha Naik, Abhilasha Ravichander, Norman Sadeh, Carolyn Rose, and Graham Neubig. Stress\n\ntest evaluation for natural language inference. arXiv preprint arXiv:1806.00692, 2018.\n\nAnh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High con\ufb01dence\npredictions for unrecognizable images. The IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), June 2015.\n\nSankaran Panchapagesan, Ming Sun, Aparna Khare, Spyros Matsoukas, Arindam Mandal, Bj\u00f6rn\nHoffmeister, and Shiv Vitaladevuni. Multi-task learning and weighted cross-entropy for dnn-based\nkeyword spotting. Interspeech, pages 760\u2013764, 2016.\n\nNicolas Papernot, Patrick McDaniel, Ananthram Swami, and Richard Harang. Crafting adversarial\ninput sequences for recurrent neural networks. Military Communications Conference, MILCOM\n2016-2016 IEEE, pages 49\u201354, 2016.\n\nY. Qin, N. Carlini, I. Goodfellow, G. Cottrell, and C. Raffel. Imperceptible, Robust, and Targeted\n\nAdversarial Examples for Automatic Speech Recognition. arXiv e-prints, March 2019.\n\nSravana Reddy and Kevin Knight. Obfuscating gender in social media writing. Proceedings of the\n\nFirst Workshop on NLP and Computational Social Science, pages 17\u201326, 2016.\n\nAnthony Rousseau, Paul Del\u00e9glise, and Yannick Esteve. Ted-lium: an automatic speech recognition\n\ndedicated corpus. LREC, pages 125\u2013129, 2012.\n\nRobin Scheibler, Eric Bezzam, and Ivan Dokmani\u00b4c. Pyroomacoustics: A python package for\naudio room simulation and array processing algorithms. 2018 IEEE International Conference on\nAcoustics, Speech and Signal Processing (ICASSP), pages 351\u2013355, 2018.\n\nLea Sch\u00f6nherr, Katharina Kohls, Steffen Zeiler, Thorsten Holz, and Dorothea Kolossa. Adversarial\nattacks against automatic speech recognition systems via psychoacoustic hiding. arXiv preprint\narXiv:1808.05665, 2018.\n\nJulius Orion Smith. Techniques for digital \ufb01lter design and system identi\ufb01cation with application to\n\nthe violin. Number 14. CCRMA, Dept. of Music, Stanford University, 1983.\n\nChristian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow,\nand Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.\n\nEric Wong and Zico Kolter. Provable defenses against adversarial examples via the convex outer\n\nadversarial polytope. International Conference on Machine Learning, pages 5283\u20135292, 2018.\n\nJim Woodhouse. Plucked guitar transients: Comparison of measurements and synthesis. Acta\n\nAcustica united with Acustica, 90(5):945\u2013965, 2004.\n\nMinhua Wu, Sankaran Panchapagesan, Ming Sun, Jiacheng Gu, Ryan Thomas, Shiv Naga Prasad\nVitaladevuni, Bjorn Hoffmeister, and Arindam Mandal. Monophone-based background modeling\nfor two-stage on-device wake word detection. 2018 IEEE International Conference on Acoustics,\nSpeech and Signal Processing (ICASSP), pages 5494\u20135498, 2018.\n\nGuoming Zhang, Chen Yan, Xiaoyu Ji, Tianchen Zhang, Taimin Zhang, and Wenyuan Xu. Dolphinat-\ntack: Inaudible voice commands. Proceedings of the 2017 ACM SIGSAC Conference on Computer\nand Communications Security, pages 103\u2013117, 2017.\n\n11\n\n\f", "award": [], "sourceid": 6421, "authors": [{"given_name": "Juncheng", "family_name": "Li", "institution": "Carnegie Mellon University"}, {"given_name": "Shuhui", "family_name": "Qu", "institution": "Stanford University"}, {"given_name": "Xinjian", "family_name": "Li", "institution": "Carnegie Mellon University"}, {"given_name": "Joseph", "family_name": "Szurley", "institution": "Bosch Center for Artificial Intelligence"}, {"given_name": "J. Zico", "family_name": "Kolter", "institution": "Carnegie Mellon University / Bosch Center for AI"}, {"given_name": "Florian", "family_name": "Metze", "institution": "Carnegie Mellon University"}]}