Skip to content

Conversation

@jack-mcmahon
Copy link
Collaborator

Simplified and improved the readability of logging outputs. Also fixed bugs introduced by hardcoded dataset processing and args in quicktest.py

@ntomita ntomita requested review from adas2125 and ntomita February 8, 2024 18:35
Copy link
Contributor

@ntomita ntomita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR. I have some minor requests that should be addressed before merging. Thanks!

config = Config(config_file_default, config_file)
folds = [int(i) for i in args.folds.split(',')] if args.folds else list(range(config.dataset.num_folds))
print(f"Testing on folds: {list(folds)}")
print(f"[INFO] Conducting 5 fold Cross Validation on folds: {list(folds)}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert this change: fold number is arbitrary and we should not hard code it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the specific fold number

def parse(self):
args = self.parser.parse_args()
print(args)
# print(args)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why commenting out? It's okay if we printing the same info somewhere else, but should keep it if not.

pattern = r'--timestr=[^\s]+'
org_cmd = re.sub(pattern, '', org_cmd)

timestr_new += '-test'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a helpful postfix to distinguish one from training so please revert this.
timestr_new variable sounds terrible. Could you rename it to timestr_test ?

Copy link
Collaborator Author

@jack-mcmahon jack-mcmahon Feb 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I tested the original version it was appending '-test' a new time for fold so that by the last fold the timestr was '2023_2_6-vit-test-test-test-test-test' for example. To fix I removed that code and instead append "-test" to the timestr passed from cross_validation.py. Let me know if you'd still recommend changes there, I'll add the rename to timestr_test now though.

patient_ids.append(patient_id) # adding patient id to the list
meta_split['id_patient'] = patient_ids # adding column to the meta_split dataframe
# formatting rows in meta_file of the id patients so they match that of meta_split df
meta_file['id_patient'] = meta_file['id_patient'].apply(lambda x: pd.Series(x.split(' ')[0]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes for 272-307 seems making sense for non-ibd users. Then next step is whoever affected by this change should correctly update their meta data files in SlidePrep library part. The assumption here seems both meta files has id_patient column with consistent values so it can be merged at 318. Could you add this note after the line 272 (if-branch) so people know what should be fixed in case this change breaks some non-ibd users' code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

# use the model name
model_name = config.model.resume
TIMESTR = model_name.split('-')[0]
elif config.model.resume:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we stick to one instead of making compatible for both if both args are meant to be the same thing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into a problem with that setup when using cross_validation.py. My dataset config file doesn't resume a pretrained model but I then need to load a different model for each fold when testing. Since the config file doesn't change between test folds, I had to specify which model to evaluate (ie resume) in the form of args to train.py. This seemed like the best way to fix the issue when running cross validation while also allowing for fine tuning a pretrained model specified in config, let me know if have an idea for something better.

data_dict = {"train": df_train, "val": df_test}

df_test.to_csv('fold0.csv')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure about this fold0 too so I'm okay to remove this. Based on file name it's probably for debugging in the early stage of development.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I added that for debugging with the IBD Project, so it is safe to be removed

# print(f"[INFO] Loading config files:")
# print(f"[INFO] Default config: {default_config_file}")
# print(f"[INFO] User config: {user_config_file}")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why you have to move this to crossvalidation script? Functionally here seems better place.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll switch it back

self.writer['meta'].info('\t'+ data_dict['val'][['id_patient']].to_string())
elif procedure == 'test':
self.writer['meta'].info('Testing patients:')
self.writer['meta'].info('\t'+ data_dict['val'][['id_patient']].to_string())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why these lines have to be added?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that it would be useful to have a record in the log file of which slides are being trained/validated/tested on. That's useful for me but I can also remove it from the PR if other users would rather not include that info in the log files.

@jack-mcmahon jack-mcmahon requested a review from ntomita February 18, 2024 22:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants