icefall- Audio read | Sangeet Sagar

Where exactly audio read and feature extraction (KaldiFeat fbank) is done in Lhotse?

Audio read

The process begins with the creation of a UnsupervisedWaveformDataset object in Lhotse.

dataset = UnsupervisedWaveformDataset(collate=collate)

A DataLoader is then created from the dataset [see: lhotse/cut/set.py].

dloader = DataLoader( dataset, batch_size=None, sampler=sampler, num_workers=num_workers )

Inside UnsupervisedWaveformDataset class (lhotse/lhotse/dataset/unsupervised.py), __getitem__ returns

c.load_audio()
{
"cuts": cuts,
"audio": audio,
}

So when we iterate through the dloader, we get cuts and audio tensor -

for batch in dloader:
	cuts = batch["cuts"]
	waves = batch["audio"]

To get audio tensor, we can simply do

for cut in cut_set: 
   print(cut.load_audio())

Further to extract features in set.py,

features = extractor.extract_batch( waves, sampling_rate=cuts[0].sampling_rate, lengths=wave_lens )

Extractor has been explained here. The function extract_batch is present in lhotse/features/kaldifeat.py. The actual Kaldifeat fbank feature extraction takes place in here (lhotse/features/kaldifeat.py) – this is because the extractor specifies the type of feature I.e. Kaldifeat fbank) -

import kaldifeat 
self.extractor = kaldifeat.Fbank(kaldifeat.FbankOptions.from_dict(self.config.to_dict()) )
# Actual feature extraction. 
result = self.extractor(samples, chunk_size=self.config.chunk_size)

Once the Kaldifeat fbank features are computed, they are sent to further processing -

_save_worker(cuts, features)- this is done in lhotse/cut/set.py			

No matter what start or end times are specified, the feature extraction process will compute features for the entire audio file. If the associated transcript for the audio file specifies a segment, such as from 5 seconds to 8 seconds within a 20-minute long audio, the computed features should be trimmed to this specific segment before being saved to the LCA-compressed file. Lets see how this is done in _save_worker(cuts, features)

Inside the below function, the actual trimming is done (set.py)

# Features= lhotse.features.base.Features
feat_manifest = Features( 
      start=cut.start, 
      duration=cut.duration, 
      type=extractor.name, 
      num_frames=feat_mat.shape[0], 
      num_features=feat_mat.shape[1], 
      frame_shift=frame_shift, 
      sampling_rate=cut.sampling_rate, 
      channels=cut.channel, 
      storage_type=feats_writer.name, 
      storage_path=str(feats_writer.storage_path), 
      storage_key=storage_key
)

This actually writes the feature on the disk.

storage_key = feats_writer.write(cut.id, feat_mat) 

Returns keys like: “14601,31,23,42”. The first number is the offset for the whole array, and the following numbers are relative offsets for each chunk. These offsets are relative to the previous chunk start.

Where exactly audio read and feature extraction (KaldiFeat fbank) is done in Lhotse?

Audio read

Enjoy Reading This Article?