icefall- Data Prepration Pipeline

Data Prepration

icefall relies on Lhotse library for data prepration. It makes audio data preparation for all recipes. Lets study and investigate whats happening at each stage.

First, manifests of the data are prepared

 lhotse prepare commonvoice -j 1 $commonvoice_dir data/manifests

In data/manifests you can find files like

 data/manifests/commonvoice_recordings_train.jsonl.gz 
 data/manifests/commonvoice_supervisions_train.jsonl.gz

Let’s take a look at the contents of these files:

 gzip -cd data/manifests/commonvoice_recordings_train.jsonl.gz | head -n 1

Sample content:

 {
 "id":"sample_09901.wav",
 "sources":[
     {
         "type":"file",
         "channels":[
             0
         ],
         "source":"/home/audio/sample_09901.wav"
     }
 ],
 "sampling_rate":16000,
 "num_samples":1769149,
 "duration":110.5718125,
 "channel_ids":[
     0
 ]
 }

 gzip -cd data/manifests/commonvoice_supervisions_train.jsonl.gz | head -n 1

Sample content:

 {
 "id":"sample_09901.wav",
 "recording_id":"sample_09901.wav",
 "start":0.0,
 "duration":110.5704,
 "channel":0,
 "text":"~MUSIC i like spongebob.. so does my neighbour",
 "language":"de",
 "speaker":"0000000013-spk1_deu",
 "custom":{
     "utt_id":"rec-037864_deu",
     "end":110.5704
 }
 }

Pre-processing of Manifests
The next step is the pre-processing of these manifests using:
```
 ./local/preprocess_commonvoice.py
```
This involves cleaning and normalizing transcripts for train/test/valid data. It inputs
```
 data/manifests/commonvoice_recordings_train.jsonl.gz
 data/manifests/commonvoice_supervisions_train.jsonl.gz
```
The processed cuts are stored in data/fbank/commonvoice_cuts_train_raw.jsonl.gz.
Further, this cut is split into 1000 pieces (num_splits)
```
 split_dir=data/fbank/commonvoice_train_split_100/
 lhotse split 1000 -i 1 ./data/fbank/commonvoice_cuts_train_raw.jsonl.gz $split_dir
```
However, this split is only done for the train dataset and not for other sets. These cuts are stored in:
- data/fbank/commonvoice_train_split_1000/commonvoice_cuts_train_raw.1.jsonl.gz
- data/fbank/commonvoice_train_split_1000/commonvoice_cuts_train_raw.2.jsonl.gz
- data/fbank/commonvoice_train_split_1000/commonvoice_cuts_train_raw.3.jsonl.gz
- … and so on.

Each of these cuts is then processed, and fbank features are computed

 ./local/compute_fbank_commonvoice_splits.py --num-workers $nj --batch-duration 200 --start 0 --num-splits $num_splits

Here is how the processing takes place:

 Processing 1/1000
 Loading data/fbank/commonvoice_train_split_1000/commonvoice_cuts_train_raw.1.jsonl.gz
 Filtering out super short utts
 Removed 0 cuts from 1000 cuts. 0.000% data is removed.
 Doing speed perturb (train set only)
 Splitting cuts into smaller chunks.
 Computing features
 Computing features in batches
 Saving to data/fbank/commonvoice_train_split_1000/commonvoice_cuts_train.1.jsonl.

At the end of each processing step, two files are created in data/fbank/commonvoice_train_split_1000/:

commonvoice_feats_train_1.lca: fbank features stored in compressed format (lilcom_chunky: *.lca)
commonvoice_cuts_train.1.jsonl.gz: new cut stored after filtering short sentences, doing speed perturbation, etc. For the next steps, this cut shall be used and not the raw one.

Feature Combination
Next, the features are combined for the train set using:
```
 lhotse combine $pieces data/fbank/commonvoice_cuts_train.jsonl.gz
```
This takes all the new cuts and combines them into one file data/fbank/commonvoice_cuts_train.jsonl.gz. Further, it shuffles them.
Transcript Creation and BPE Models
Finally, the train transcripts are created from data/fbank/commonvoice_cuts_train.jsonl.gz, and BPE models are created.

Data Prepration

Enjoy Reading This Article?