From microphone to .WAV with: getUserMedia and Web Audio

Update: The new MediaStream recording specification is aiming at solving this use case through a much simpler API. Follow the conversations on the mailing list.

A few years ago, I wrote a little ActionScript 3 library called MicRecorder, which allowed you to record the microphone input and export it to a .WAV file. Very simple, but pretty handy. The other day I thought it would be cool to port it to JavaScript. I realized quickly that it is not as easy. In Flash, the SampleDataEvent directly provides the byte stream  PCM samples) from the microphone. With getUserMedia, the Web Audio APIs are required to extract the samples. Note that getUserMedia and Web Audio are not broadly supported yet, but it is coming. Firefox has also landed Web Audio recently, which is great news.

Because I did not find an article that went through the steps involved, here is a short article on how it works, from getting access to the microphone to the final .WAV file, it may be useful to you in the future. The most helpful resource I came across was this nice HTML5 Rocks article which pointed to Matt Diamond's example, which contains the key piece I was looking for to get the Web Audio APIs hooked up. Thanks so much Matt! Credits also goes to Matt for the merging and interleaving code of the buffers which works very nicely.

First, we need to get access to the microphone, and we use the getUserMedia API for that.

if (!navigator.getUserMedia)
        navigator.getUserMedia = navigator.getUserMedia || navigator.webkitGetUserMedia ||
                      navigator.mozGetUserMedia || navigator.msGetUserMedia;

if (navigator.getUserMedia){
    navigator.getUserMedia({audio:true}, success, function(e) {
    alert('Error capturing audio.');
    });
} else alert('getUserMedia not supported in this browser.');

The first argument of the getUserMedia API provides information on what do we want to get access to (here the microphone), if we wanted to get access to the camera, we would have passed an object with the video flag on:

navigator.getUserMedia({video:true}, success, function(e) {
    alert('Error capturing video.');
});

The two other arguments are callbacks to handle successful access to the hardware or failure. At this point, the success callback will be triggered if the user clicks "Allow" through this panel:

getusermedia-access

Once the user has allowed access to the microphone, we need to start querying the PCM samples, this is where it becomes tricky and the Web Audio APIs comes into the game. If you have not checked the Web Audio spec, you will see that the surface is very large and quite scary when you see it for the first time and that's because the Web Audio APIs can do a lot, like audio filters, synthesized music, 3D audio engines and more. But all we need here are the PCM samples that we would store and pack inside a WAV container using a simple ArrayBuffer.

So our user has clicked "Allow", we now go on and create an audio context and start capturing the audio data:

function success(e){
    // creates the audio context
    audioContext = window.AudioContext || window.webkitAudioContext;
    context = new audioContext();

    // retrieve the current sample rate to be used for WAV packaging
    sampleRate = context.sampleRate;
    
    // creates a gain node
    volume = context.createGain();

    // creates an audio node from the microphone incoming stream
    audioInput = context.createMediaStreamSource(e);

    // connect the stream to the gain node
    audioInput.connect(volume);

    /* From the spec: This value controls how frequently the audioprocess event is 
    dispatched and how many sample-frames need to be processed each call. 
    Lower values for buffer size will result in a lower (better) latency. 
    Higher values will be necessary to avoid audio breakup and glitches */
    var bufferSize = 2048;
    recorder = context.createScriptProcessor(bufferSize, 2, 2);

    recorder.onaudioprocess = function(e){
        console.log ('recording');
        var left = e.inputBuffer.getChannelData (0);
        var right = e.inputBuffer.getChannelData (1);
        // we clone the samples
        leftchannel.push (new Float32Array (left));
        rightchannel.push (new Float32Array (right));
        recordingLength += bufferSize;
    }

    // we connect the recorder
    volume.connect (recorder);
    recorder.connect (context.destination); 
}

The createJavaScriptNode API takes as a first argument the buffer size you want to retrieve, as I added in the comments, this value will dictate how frequently the audioprocess event will be dispatched. For best latency, choose a low value, like 2048 (remember it needs to be a power of two). Every time the event is dispatched, we call the getChannelData APIs for each channel (left and right) and get a new Float32Array buffer for each channel that we clone (sorry GC) and store into two separate Arrays. This code could would be much simpler and more GC friendly if it was possible to write each channel into a Float32Array directly, but given that these cannot have an undefined length, we need to fallback to plain Arrays.

So why do we have to clone the channels? It actually drove me nuts for many hours. What happens is that the returned channel buffers are pointers to the current samples coming in, so you need to snapshot them (clone) otherwise you will end up with samples reflecting the sound coming from the microphone at the instant you stopped recording.

Once we have our arrays of buffers, we need to flat down each channel:

function mergeBuffers(channelBuffer, recordingLength){
  var result = new Float32Array(recordingLength);
  var offset = 0;
  var lng = channelBuffer.length;
  for (var i = 0; i < lng; i++){
    var buffer = channelBuffer[i];
    result.set(buffer, offset);
    offset += buffer.length;
  }
  return result;
}

Once flat, we can interleave both channels together:

function interleave(leftChannel, rightChannel){
  var length = leftChannel.length + rightChannel.length;
  var result = new Float32Array(length);

  var inputIndex = 0;

  for (var index = 0; index < length; ){
    result[index++] = leftChannel[inputIndex];
    result[index++] = rightChannel[inputIndex];
    inputIndex++;
  }
  return result;
}

We then add the little writeUTFBytes utility function:

function writeUTFBytes(view, offset, string){ 
  var lng = string.length;
  for (var i = 0; i < lng; i++){
    view.setUint8(offset + i, string.charCodeAt(i));
  }
}

We are now ready for WAV packaging, you can change the volume variable if needed (from 0 to 1):

// we flat the left and right channels down
var leftBuffer = mergeBuffers ( leftchannel, recordingLength );
var rightBuffer = mergeBuffers ( rightchannel, recordingLength );
// we interleave both channels together
var interleaved = interleave ( leftBuffer, rightBuffer );

// create the buffer and view to create the .WAV file
var buffer = new ArrayBuffer(44 + interleaved.length * 2);
var view = new DataView(buffer);

// write the WAV container, check spec at: https://ccrma.stanford.edu/courses/422/projects/WaveFormat/
// RIFF chunk descriptor
writeUTFBytes(view, 0, 'RIFF');
view.setUint32(4, 44 + interleaved.length * 2, true);
writeUTFBytes(view, 8, 'WAVE');
// FMT sub-chunk
writeUTFBytes(view, 12, 'fmt ');
view.setUint32(16, 16, true);
view.setUint16(20, 1, true);
// stereo (2 channels)
view.setUint16(22, 2, true);
view.setUint32(24, sampleRate, true);
view.setUint32(28, sampleRate * 4, true);
view.setUint16(32, 4, true);
view.setUint16(34, 16, true);
// data sub-chunk
writeUTFBytes(view, 36, 'data');
view.setUint32(40, interleaved.length * 2, true);

// write the PCM samples
var lng = interleaved.length;
var index = 44;
var volume = 1;
for (var i = 0; i < lng; i++){
    view.setInt16(index, interleaved[i] * (0x7FFF * volume), true);
    index += 2;
}

// our final binary blob that we can hand off
var blob = new Blob ( [ view ], { type : 'audio/wav' } );

Obviously, if WAV packaging becomes too expensive, it is an ideal task to offload to a background worker ;)
Once done, we can save our blob to a file or do whatever we want with it. We can now save it locally or remotely, or even post process it. You can also check the live demo here for more fun.

Comments (2)

  1. Tyler wrote:

    Thanks for covering this. I’m slowly moving into web audio, and recording is something I’ll be facing in the future. You guys made it too easy for us with Flash. :-) I hope that you get to cover the MediaStream api version of this when it’s ready. All of your audio articles have been very helpful, as this is still an area that has seen pretty light coverage.

    Thursday, November 20, 2014 at 11:43 pm #
  2. Thibault Imbert wrote:

    Hi Tyler,

    Happy to hear that was helpful. Yeah, definitely trickier than with Flash ;) I will try to write more about this in the future.

    Tuesday, January 13, 2015 at 1:47 am #