I’m almost done implementing a first version the TIMIT dataset in Pylearn2. You can find the code in my public research repository. Let’s look at what the problem was and how I solved it.

### The challenge

My goal was to implement a dataset that can provide as training examples sequences of k audio frames as features and the following frame as target. The frames have a fixed length, and can overlap by a fixed number of samples.

A naive approach would be to allocate a new 2D numpy array and fill it with every example you can generate from your audio sequence. This approach does not scale, and here’s why: say you have 200 acoustic samples that you need to separate into 20-samples-long frames overlapping by 5 samples. Allocating memory for each frame would involve having 13 distinct frames: frame 1 gets samples 1 through 20, frame 2 gets samples 15 through 35, …, and frame 13 gets samples 180 through 200. Already, you can see the overlap adds 60 duplicated frames if you were to enumerate them explicitly. It gets worse, though: say you want to predict the next frame based on the two previous frames. Then your training set would have 11 examples: the first example gets frames 1 and 2 as features and frame 3 as target, the second example gets frames 2 and 3 as features and frame 4 as target, …, and the eleventh example gets frame 11 and 12 as features and frame 13 as target. If you were to list all examples explicitly, you would have 660 acoustic samples, more than three times the length of your original audio sequence. When dealing with thousands of audio sequences of thousands of acoustic samples each, this quickly becomes impractical.

### The solution

Obviously, any practical solution would involve keeping a compact representation of the data in memory and having some sort of mapping to the training examples.

One nice thing about numpy is that it gives you the ability to manipulate the strides of your arrays. This makes it possible to create a view of a numpy array in which data is segmented into overlapping frames without touching to the actual array (see this script).

If you have a numpy array of numpy arrays (all of your audio sequences), you can segment each sequence by calling the segment_axis method on it and then build two additional numpy arrays whose rows represent training examples: the first one maps to a sequence index and the starting (inclusive) and ending (exclusive) frames of the example’s features, and the second one maps to a sequence index and the example’s target frame. You can then write a get() method which takes a list of example indexes and builds the example batch by using the two “mapping” arrays and the array of sequences.

This way, you only have to change a small part of the iterator: instead of acting directly upon a reference pointing to the raw data of your dataset, it calls the dataset’s get() method, which builds and returns the batch of example needed.

### A caveat

For now the dataset only manages acoustic samples; this means no phones / auxiliary information. I’m working on this with Laurent Dinh, and I’ll keep you informed of our progress.

### Example YAML file

You can look here for a (completely unrealistic) example on how to use the dataset in a YAML file.