Neural signal modeling toolkit for MEG/ephys research. It bundles preprocessing, dataset utilities, tokenizers, autoregressive/diffusion/flow/conv models, and PyTorch Lightning training + evaluation entrypoints.
- Python 3.13 (see
setup.py). - Create/activate an environment, then
pip install -e .. - Preprocessing depends on
osl-ephysand MNE in addition to the packages inrequirements.txt.
preprocess.py– CLI for multi-stage MEG preprocessing.run.py– Unified train/test/eval/tokenizer driver built onExperimentDL.evals.py– Lightweight evaluation runner for automated checkpoint sweeps.ephys_gpt/preprocessing– Dataset-specific preprocessors (Ωmega/MEG and MOUS) with OSL wrappers and post-processing steps.ephys_gpt/dataset– Dataset builders, datasplitters, augmentations, mixup dataloader, and text helpers.ephys_gpt/models– Model zoo (GPT-style AR, Wavenets, CNN/LSTM, diffusion/flow, tokenizers, MEGFormer, BrainOmni, etc.) pluslayers/andmdl/submodules.ephys_gpt/training– LightningLitModel, experiment runners, optimizer/scheduler/pl logging glue, tokenizer trainers, and evaluation utilities.ephys_gpt/eval– Evaluation classes invoked viarun.py --mode eval(quantized AR, diffusion, flow/image, continuous, flat, text, VQ systems).ephys_gpt/utils– Quantizers, plotting, sampling, metrics, and test helpers.tests/– Shape/causality checks for datasets and models.configs/– Example configs for preprocessing, models, training/eval, augmentations, and local development.
The preprocessing script orchestrates OSL-ephys (stage 1) and custom transforms (stage 2/3) for different datasets.
python preprocess.py --dataset <omega|mous|mous_conditioned> \
--stage <stage_1|stage_2|stage_3|both|all> \
--args configs/local/preprocess.yaml--datasetselects the dataset-specific pipeline (Ωmega, MOUS, or MOUS with conditioning).--stagechooses which stages to run (both= stage_1+stage_2,allruns all available stages).--argspoints to a YAML file containing stage parameters.
Outputs are stored under preprocessed/<sub-...>/<i>.npy with data arrays and metadata (sfreq, ch_names, pos_2d, etc.).
- Paths:
data_path(raw dataset root),out_root(optional override forpreprocessed/),osl_config(OSL YAML path or inline dict to forward toosl_ephys.preprocessing.run_proc_chain). - Stage selection:
stage_1keys mirror OSL-ephys parameters (filtering, ICA, bad channel detection).stage_2handles chunking/normalization/quantization viapreproc_config.stage_3handles post-processing such as conditioning labels for MOUS. - Performance:
n_workers(Dask workers),max_bad_channelsandmax_chunksto bound work,chunk_secondsto control sequence length for downstream models. - Transforms:
preproc_configtogglesnormalize,clip,mulaw_quantize,random_sign_flip, etc. Usemulaw_quantizeparameters to set codebook size for downstream quantizers. - Metadata:
save_infoenables saving sensor metadata alongside.npychunks for plotting/evaluation.
run.py loads a YAML config, resolves model/dataset/loss classes, and dispatches to the requested mode.
python run.py --mode train --args configs/gpt2meg/train.yaml
python run.py --mode test --args configs/gpt2meg/train.yaml
python run.py --mode eval --args configs/gpt2meg/eval.yaml
python run.py --mode tokenizer --args <tokenizer_yaml>
python run.py --mode tokenizer-text --args <text_tokenizer_yaml>- Composition: top-level YAML (
configs/*/train.yaml) references amodel_config(e.g.,configs/gpt2meg/model.yaml) that defines architecture + tokenizer settings. Fields in the top-level file override/extend the nested model file when merged. - Datasets:
datasplitterchooses a dataset class (OmegaDataset,QuantizedOmegaDataset,MousDataset, etc.) plus paths and sizing (dataset_root,example_seconds,step_seconds,val_ratio,example_overlap_seconds).dataset_kwargsconfigures quantization bins, positional encodings, and conditioning labels. - Dataloaders:
dataloaderconfigures batch size, workers, and persistence. For tokenizer training, usetext_dataloaderandtokenizer_datablocks to specify text corpora or image shards. - Models/losses:
model_nameandloss_nameare strings resolved bytraining/utils.py.lossholds task-specific knobs (e.g.,label_smoothing,alpha_l1, KL weights).model_configadds architecture depth/width, attention/windowing, quantizer heads, and tokenizer checkpoints. - Optim/Lightning:
lightningcarries optimizer choice (optimizer,lr,weight_decay, betas), scheduler (lr_scheduler,warmup_steps), AMP/compile toggles, gradient clipping, and logging frequency. - Trainer:
trainermirrors PyTorch Lightning arguments (accelerator, devices, precision,max_epochs,max_steps,accumulate_grad_batches, checkpointing/logging cadence). - Saving/resume:
save_dir,resume_from, andckpt_pathcontrol where checkpoints and logs land. Useversionto segregate runs under the same save directory.
eval_classchooses the evaluator (EvalQuant,EvalDiffusion,EvalFlow,EvalCont,EvalVQ,EvalFlat,EvalText).evalblock configures checkpoint selection (ckpt_path,version,step,best), data sizing (max_batches,num_examples), and generation settings (gen_sampling,temperature,top_k,top_p,future_steps,gen_seconds).plot/sampleoptions enable PSD plots, spectrograms, and sample dumping.shape_hintscan enforce target sequence length or channels for diffusion/flow/image evaluators.
evals.py is a thin wrapper around ephys_gpt.training.eval_runner.EvaluationRunner, useful for periodic checkpoint checks.
# YAML config
python evals.py --args configs/gpt2meg/eval.yaml
# Inline JSON config
python evals.py --dict --args '{"save_dir": "...", "eval_runner": {"ckpt_path": "..."}, ...}'eval_runner options support selecting a checkpoint (ckpt_path), version/step filtering, max_batches, num_examples, and optional generation settings under eval_runner.generate (strategy, temperature, top-k/p). The runner rebuilds the validation dataloader from the provided datasplitter/dataloader configuration before scoring.
- Start from the closest example under
configs/(e.g.,configs/gpt2meg/train.yamlfor Ωmega AR training,configs/cnnlstm/train.yamlfor LibriBrain classification,configs/local/preprocess.yamlfor preprocessing). Copy and adjust paths rather than editing the originals. - Keep architecture in
model_configand data/training knobs in the top-level file. Shared defaults can be placed in a base YAML and extended via YAML anchors/includeif desired. - Quantized pipelines: set
datasplitter.dataset_roottopreprocessed/outputs and align tokenizer vocab/codebook sizes betweenpreproc_config,dataset_kwargs, and the model config. - Tokenizer training: use
mode tokenizer/tokenizer-textwith configs underconfigs/tokenizers/. These mirror the train configs but swap inExperimentTokenizer/ExperimentTokenizerTextand text/image dataset specs. - Remote runs: adjust
trainer.accelerator/devicesfor multi-GPU, and ensuresave_dirpoints to a shared filesystem; Lightning loggers are wired intraining/logging.
- Preprocessing: dataset wrappers (
Omega,MOUS,MOUSConditioned) orchestrating OSL-ephys and custom transforms. - Datasets: chunked Ωmega loaders (continuous/quantized/image), LibriBrain grouped classification helpers, text loaders, and a
MixupDataLoader. - Models: GPT2MEG/STGPT2MEG/VQGPT2MEG, Wavenet variants, CNN/LSTM hybrids, diffusion (
NTD), flow/image models (MEGFormer,ChronoFlowSSM), tokenizers (VideoGPTTokenizer,Emu3VisionVQ,BrainOmniTokenizer), and research attention stacks (LITRA/TACA/CK3D/TASA3D/LatteAR). - Layers/MDL: reusable attention/convolution blocks plus model wrappers in
ephys_gpt.layersandephys_gpt.mdl(e.g.,Perceiver,LinearAttention,S4,CKConv,MLPMixer). - Losses: task-specific objectives in
ephys_gpt.losses(e.g., contrastive, reconstruction, classification, multi-task losses). - Training:
ExperimentDL,ExperimentTokenizer, andExperimentTokenizerTextwire configs into Lightning modules, optimizers, schedulers, and loggers;eval_runnerand evaluation classes provide reusable scoring/generation hooks. - Eval: task-specific evaluators (
EvalQuant,EvalDiffusion,EvalFlow,EvalCont,EvalVQ,EvalFlat,EvalText) that share dataset splitting logic and plotting/sampling utilities. - Logging: wrappers for WandB/CSV/TensorBoard in
ephys_gpt.loggingplus experiment helpers underscripts/. - Utils: quantizers (µ-law, residual VQ helpers), PSD/cov plotting, sampling helpers, YAML loaders, and gradient-causality assertions for tests.
- Notebooks/Scripts: exploratory notebooks and utility scripts (e.g., dataset inspection, checkpoint export) under
notebooks/andscripts/.
Run the lightweight test suite with:
pytest -qThe suite checks dataset shapes/shift semantics/image mapping, forward passes across key models (GPT2MEG, STGPT2MEG, MEGFormer, BENDR, NTD, Wavenet, tokenizers, etc.), and causality/gradient safety via utilities like tests/utils.assert_future_grad_zero. Tests use synthetic CPU inputs by default.
- Open issues or discussions before large changes; small fixes/docs are welcome via PR.
- Keep README/config examples synchronized with new pipelines or arguments.
- Ensure new models/datasets include minimal tests and synthetic-forward coverage. Prefer CPU-friendly fixtures in
tests/. - Run
pytest -q(and tokenizer-specific smoke tests if you add a tokenizer) before submitting.