Skip to content

Refactor PSF Dataset Generation Script for Maintainability and Testing #172

@jeipollack

Description

@jeipollack

The PSF dataset generation script (data_generation_script.py) is a large, monolithic function that would benefit from refactoring to improve maintainability, testability, and code reuse.

Current Issues

Code Structure

  • Monolithic design: Single 800+ line main() function handling multiple responsibilities
  • Poor separation of concerns: Configuration, data processing, visualization, and I/O all mixed together
  • Complex control flow: Deeply nested conditionals make the code difficult to follow and debug

Testing & Reliability

  • No unit tests: Critical data generation logic lacks test coverage
  • Missing error handling: File operations and array manipulations lack proper exception handling
  • No input validation: Configuration parameters aren't validated before use
  • Sign convention mismatch: Data generation script appears to use different sign conventions than the main codebase preventing the "saved ground-truths" information from being used correctly for verification

Code Quality

  • Inconsistent naming: Mix of camelCase (sim_PSF_toolkit) and snake_case (train_positions)
  • Magic numbers: Hard-coded values scattered throughout (e.g., selected_id_SED = np.random.randint(low=0, high=13))
  • Duplicate code: Similar operations repeated for train/test datasets

Proposed Refactoring

  1. Create a Class-Based Architecture

  2. Separate Concerns into Modules

  3. Add Comprehensive Testing

  4. Improve Error Handling

  5. Code Quality Improvements

Benefits

  1. Maintainability: Smaller, focused functions are easier to understand and modify
  2. Testability: Individual components can be unit tested in isolation
  3. Reusability: Modular design allows components to be reused in other contexts
  4. Debugging: Easier to isolate and fix issues in specific components
  5. Documentation: Clear separation makes the codebase more approachable for new contributors

This refactoring should be completed before the next release cycle. If not possible, then it will be removed from develop into a dedicated feature branch when there is time to refactor it to prevent it from blocking the develop→main merge.

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions