icefall- Audio read
Where exactly audio read and feature extraction (KaldiFeat fbank) is done in Lhotse?
Audio read
The process begins with the creation of a UnsupervisedWaveformDataset object in Lhotse.
dataset = UnsupervisedWaveformDataset(collate=collate)
A DataLoader is then created from the dataset [see: lhotse/cut/set.py
].
dloader = DataLoader( dataset, batch_size=None, sampler=sampler, num_workers=num_workers )
Inside UnsupervisedWaveformDataset
class (lhotse/lhotse/dataset/unsupervised.py
), __getitem__
returns
c.load_audio()
{
"cuts": cuts,
"audio": audio,
}
So when we iterate through the dloader
, we get cuts
and audio
tensor -
for batch in dloader:
cuts = batch["cuts"]
waves = batch["audio"]
To get audio
tensor, we can simply do
for cut in cut_set:
print(cut.load_audio())
Further to extract features in set.py
,
features = extractor.extract_batch( waves, sampling_rate=cuts[0].sampling_rate, lengths=wave_lens )
Extractor has been explained here. The function extract_batch
is present in lhotse/features/kaldifeat.py
. The actual Kaldifeat fbank feature extraction takes place in here (lhotse/features/kaldifeat.py
) – this is because the extractor specifies the type of feature I.e. Kaldifeat
fbank) -
import kaldifeat
self.extractor = kaldifeat.Fbank(kaldifeat.FbankOptions.from_dict(self.config.to_dict()) )
# Actual feature extraction.
result = self.extractor(samples, chunk_size=self.config.chunk_size)
Once the Kaldifeat fbank features are computed, they are sent to further processing -
_save_worker(cuts, features)- this is done in lhotse/cut/set.py
No matter what start or end times are specified, the feature extraction process will compute features for the entire audio file. If the associated transcript for the audio file specifies a segment, such as from 5 seconds to 8 seconds within a 20-minute long audio, the computed features should be trimmed to this specific segment before being saved to the LCA-compressed file. Lets see how this is done in _save_worker(cuts, features)
Inside the below function, the actual trimming is done (set.py
)
# Features= lhotse.features.base.Features
feat_manifest = Features(
start=cut.start,
duration=cut.duration,
type=extractor.name,
num_frames=feat_mat.shape[0],
num_features=feat_mat.shape[1],
frame_shift=frame_shift,
sampling_rate=cut.sampling_rate,
channels=cut.channel,
storage_type=feats_writer.name,
storage_path=str(feats_writer.storage_path),
storage_key=storage_key
)
This actually writes the feature on the disk.
storage_key = feats_writer.write(cut.id, feat_mat)
Returns keys like: “14601,31,23,42”. The first number is the offset for the whole array, and the following numbers are relative offsets for each chunk. These offsets are relative to the previous chunk start.
Enjoy Reading This Article?
Here are some more articles you might like to read next: