TransWikia.com

What is the suggested way to create features (Mel-Spectograms) from speech signal for classification with ResNet?

Data Science Asked by 3r1c on February 25, 2021

At the moment I have this piece of code which cuts a Spectogram into fixed length tensors:

def chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in range(0, len(l[0][0]), n):
        if(i+n < len(l[0][0])):
            yield X_sample.narrow(2, i, n)

The following piece of code

  1. downsamples the Audio
  2. Creates Mel_Spectograms and takes the log of it
  3. Applies a Cepstral Mean and Variance Normalization
  4. Then it cuts the spectogram with the code above into a fixed size of length and appends it to an array
for index, row in df.iterrows():
    #resample
    wave_form, sample_rate = torchaudio.load(row["path"], normalization=True)
    downsample_resample = torchaudio.transforms.Resample(
    sample_rate, downsample_rate, resampling_method='sinc_interpolation')
    wav = downsample_resample(wave_form)
    mel = torchaudio.transforms.MelSpectrogram(downsample_rate)(wav)
    mellog = np.log(mel + 1e-9)
    X_sample = speechpy.processing.cmvnw(mellog.squeeze(), win_size=301, variance_normalization=True)
    X_sample = torch.tensor(X_sample).unsqueeze(0)
    _min = min(np.amin(X_sample.numpy()),_min)
    _max = max(np.amax(X_sample.numpy()),_max)
    for chunked_X_sample in list(chunks(X_sample,  max_total_context)):
        print(len(chunked_X_sample[0][0]))
        if len(chunked_X_sample[0][0]) == max_total_context:
            X.append(chunked_X_sample)
            y.append(row["y"])

My question: Is this the common way to create features for deep learning?
Do you have any suggestions to optimize this code?
Furthermore I’m not sure if it is right to split the melspectograms instead of splitting the audio earlier.

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP