-
Notifications
You must be signed in to change notification settings - Fork 0
Bugfix/package config edgecases #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: bugfix/package-config-edgecases
Are you sure you want to change the base?
Bugfix/package config edgecases #8
Conversation
ntomita
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your PR. I have some minor requests that should be addressed before merging. Thanks!
MaskHIT/maskhit/cross_validation.py
Outdated
| config = Config(config_file_default, config_file) | ||
| folds = [int(i) for i in args.folds.split(',')] if args.folds else list(range(config.dataset.num_folds)) | ||
| print(f"Testing on folds: {list(folds)}") | ||
| print(f"[INFO] Conducting 5 fold Cross Validation on folds: {list(folds)}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert this change: fold number is arbitrary and we should not hard code it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the specific fold number
| def parse(self): | ||
| args = self.parser.parse_args() | ||
| print(args) | ||
| # print(args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why commenting out? It's okay if we printing the same info somewhere else, but should keep it if not.
| pattern = r'--timestr=[^\s]+' | ||
| org_cmd = re.sub(pattern, '', org_cmd) | ||
|
|
||
| timestr_new += '-test' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually a helpful postfix to distinguish one from training so please revert this.
timestr_new variable sounds terrible. Could you rename it to timestr_test ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I tested the original version it was appending '-test' a new time for fold so that by the last fold the timestr was '2023_2_6-vit-test-test-test-test-test' for example. To fix I removed that code and instead append "-test" to the timestr passed from cross_validation.py. Let me know if you'd still recommend changes there, I'll add the rename to timestr_test now though.
| patient_ids.append(patient_id) # adding patient id to the list | ||
| meta_split['id_patient'] = patient_ids # adding column to the meta_split dataframe | ||
| # formatting rows in meta_file of the id patients so they match that of meta_split df | ||
| meta_file['id_patient'] = meta_file['id_patient'].apply(lambda x: pd.Series(x.split(' ')[0])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes for 272-307 seems making sense for non-ibd users. Then next step is whoever affected by this change should correctly update their meta data files in SlidePrep library part. The assumption here seems both meta files has id_patient column with consistent values so it can be merged at 318. Could you add this note after the line 272 (if-branch) so people know what should be fixed in case this change breaks some non-ibd users' code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
| # use the model name | ||
| model_name = config.model.resume | ||
| TIMESTR = model_name.split('-')[0] | ||
| elif config.model.resume: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we stick to one instead of making compatible for both if both args are meant to be the same thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran into a problem with that setup when using cross_validation.py. My dataset config file doesn't resume a pretrained model but I then need to load a different model for each fold when testing. Since the config file doesn't change between test folds, I had to specify which model to evaluate (ie resume) in the form of args to train.py. This seemed like the best way to fix the issue when running cross validation while also allowing for fine tuning a pretrained model specified in config, let me know if have an idea for something better.
| data_dict = {"train": df_train, "val": df_test} | ||
|
|
||
| df_test.to_csv('fold0.csv') | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure about this fold0 too so I'm okay to remove this. Based on file name it's probably for debugging in the early stage of development.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I added that for debugging with the IBD Project, so it is safe to be removed
| # print(f"[INFO] Loading config files:") | ||
| # print(f"[INFO] Default config: {default_config_file}") | ||
| # print(f"[INFO] User config: {user_config_file}") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure why you have to move this to crossvalidation script? Functionally here seems better place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll switch it back
| self.writer['meta'].info('\t'+ data_dict['val'][['id_patient']].to_string()) | ||
| elif procedure == 'test': | ||
| self.writer['meta'].info('Testing patients:') | ||
| self.writer['meta'].info('\t'+ data_dict['val'][['id_patient']].to_string()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain why these lines have to be added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that it would be useful to have a record in the log file of which slides are being trained/validated/tested on. That's useful for me but I can also remove it from the PR if other users would rather not include that info in the log files.
Simplified and improved the readability of logging outputs. Also fixed bugs introduced by hardcoded dataset processing and args in quicktest.py