icefall- Data Prepration Pipeline
Data Prepration
icefall relies on Lhotse library for data prepration. It makes audio data preparation for all recipes. Lets study and investigate whats happening at each stage.
-
First, manifests of the data are prepared
lhotse prepare commonvoice -j 1 $commonvoice_dir data/manifests
In
data/manifests
you can find files likedata/manifests/commonvoice_recordings_train.jsonl.gz data/manifests/commonvoice_supervisions_train.jsonl.gz
Let’s take a look at the contents of these files:
gzip -cd data/manifests/commonvoice_recordings_train.jsonl.gz | head -n 1
Sample content:
{ "id":"sample_09901.wav", "sources":[ { "type":"file", "channels":[ 0 ], "source":"/home/audio/sample_09901.wav" } ], "sampling_rate":16000, "num_samples":1769149, "duration":110.5718125, "channel_ids":[ 0 ] }
gzip -cd data/manifests/commonvoice_supervisions_train.jsonl.gz | head -n 1
Sample content:
{ "id":"sample_09901.wav", "recording_id":"sample_09901.wav", "start":0.0, "duration":110.5704, "channel":0, "text":"~MUSIC i like spongebob.. so does my neighbour", "language":"de", "speaker":"0000000013-spk1_deu", "custom":{ "utt_id":"rec-037864_deu", "end":110.5704 } }
-
Pre-processing of Manifests
The next step is the pre-processing of these manifests using:./local/preprocess_commonvoice.py
This involves cleaning and normalizing transcripts for train/test/valid data. It inputs
data/manifests/commonvoice_recordings_train.jsonl.gz data/manifests/commonvoice_supervisions_train.jsonl.gz
The processed cuts are stored in
data/fbank/commonvoice_cuts_train_raw.jsonl.gz
. -
Further, this cut is split into 1000 pieces (
num_splits
)
split_dir=data/fbank/commonvoice_train_split_100/ lhotse split 1000 -i 1 ./data/fbank/commonvoice_cuts_train_raw.jsonl.gz $split_dir
However, this split is only done for the train dataset and not for other sets. These cuts are stored in:
data/fbank/commonvoice_train_split_1000/commonvoice_cuts_train_raw.1.jsonl.gz
data/fbank/commonvoice_train_split_1000/commonvoice_cuts_train_raw.2.jsonl.gz
data/fbank/commonvoice_train_split_1000/commonvoice_cuts_train_raw.3.jsonl.gz
- … and so on.
-
Each of these cuts is then processed, and fbank features are computed
./local/compute_fbank_commonvoice_splits.py --num-workers $nj --batch-duration 200 --start 0 --num-splits $num_splits
Here is how the processing takes place:
Processing 1/1000 Loading data/fbank/commonvoice_train_split_1000/commonvoice_cuts_train_raw.1.jsonl.gz Filtering out super short utts Removed 0 cuts from 1000 cuts. 0.000% data is removed. Doing speed perturb (train set only) Splitting cuts into smaller chunks. Computing features Computing features in batches Saving to data/fbank/commonvoice_train_split_1000/commonvoice_cuts_train.1.jsonl.
At the end of each processing step, two files are created in
data/fbank/commonvoice_train_split_1000/
:-
commonvoice_feats_train_1.lca
: fbank features stored in compressed format (lilcom_chunky: *.lca) -
commonvoice_cuts_train.1.jsonl.gz
: new cut stored after filtering short sentences, doing speed perturbation, etc. For the next steps, this cut shall be used and not the raw one.
-
-
Feature Combination
Next, the features are combined for the train set using:lhotse combine $pieces data/fbank/commonvoice_cuts_train.jsonl.gz
This takes all the new cuts and combines them into one file
data/fbank/commonvoice_cuts_train.jsonl.gz
. Further, it shuffles them. -
Transcript Creation and BPE Models
Finally, the train transcripts are created fromdata/fbank/commonvoice_cuts_train.jsonl.gz
, and BPE models are created.
Enjoy Reading This Article?
Here are some more articles you might like to read next: