Optimizing the RAM consumption when preparing data for training

The `load_chunk_data` method is aggressively cosuming huge amounts of RAM when concatenating np arrays.

I am currently trying to implement something that will reduce the RAM consumption

@karnwatcharasupat @thomeou I am happy to request a PR when I am done, if that's is acceptable by you.

PS: I noticed that the previous method never worked, and I apologize for not properly testing it; I am trying something new now.

@karnwatcharasupat The splitting idea didn't work, even after I fixed it to actually concat the chunks because in the end, I am still going to concatenate np arrays that will eventually reach the shape of (7, 1920000, 200), which is unhandleable anyway. I had an idea to not concatenate them at all, but to export them to the `db_data` in `get_split` method, like this for example:
```
db_data = {
    'features': features,
    'features_2': features_2,
    'features_3': features_3,
    'features_4': features_4,
    'sed_targets': sed_targets,
    'doa_targets': doa_targets,
    'feature_chunk_idxes': feature_chunk_idxes,
    'gt_chunk_idxes': gt_chunk_idxes,
    'filename_list': filename_list,
    'test_batch_size': test_batch_size,
    'feature_chunk_len': self.chunk_len,
    'gt_chunk_len': self.chunk_len // self.label_upsample_ratio
}
```
where `features`, `features_2`, `features_3`, and `features_4` are just `features`, but splitted into 4 chunks. And then adjust the use of `features` in the whole project to include the other `features` sequentially. I have already developed such a method to export 4 arrays, but I am still exploring the code to better understand it before changing how it works. Currently, I can see that the `get_split` method is called when training in the `datamodule.py` file, specifically in 
```
train_db = self.feature_db.get_split(split=self.train_split, split_meta_dir=self.split_meta_dir, stage='fit')
```
and in
```
val_db = self.feature_db.get_split(split=self.val_split, split_meta_dir=self.split_meta_dir, stage='inference')
```

The call from `train_db` variable is currently my problem. 
If you have an idea how to add the chunks part to the code, please let me know.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimizing the RAM consumption when preparing data for training #7

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Optimizing the RAM consumption when preparing data for training #7

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions