-
Notifications
You must be signed in to change notification settings - Fork 25
Description
Hello,
Firstly, thank you for making the code publicly available. I am attempting to reproduce the results of the BS-RoFormer paper to achieve similar SDR performance and am referencing your code to do so. However, I am encountering significantly lower performance during reproduction and have a few questions. I would appreciate it if you could answer the following queries:
The hyperparameters I used are as follows:
dim: 384
depth: 6
stereo: True
num_stems: 1
time_transformer_depth: 2
freq_transformer_depth: 2
dim_head: 64
heads: 8
ff_dropout: 0.1
attn_dropout: 0.1
flash_attn: True
mask_estimator_depth: 2
Here are my questions:
Could you please share the model and training hyperparameters you used during training?
The paper mentions using a complex spectrogram as input, but I noticed the code uses torch.view_as_real to handle the input in a CaC manner. I believe this is different from the paper. Could you explain the reason for this difference?
I am running the training on an H100 80GB GPU with the above hyperparameters. Despite slight differences from the paper's hyperparameters, the batch_size of 4 fills up the 80GB memory. Could you let me know what batch_size you used and if there were any additional steps taken to manage memory efficiency? For reference, I used 44.1kHz 8-second audio for both target and mixture inputs, as per the paper's setup.
Your answers would be greatly helpful. Thank you very much!