If we have sound samples of human voice using which we have to predict the text, which of the following data augmentation ways are okay to employ?

