Guglielmo Camporese · Practical Notes on Training Neural Networks

A living collection of practical notes on training neural networks — things that have saved time, things that go wrong, and things worth keeping in mind.

Architecture

Backbones. ResNet and its variants (ResNeXt, WideResNet, ResNeSt) are still reliable defaults for vision tasks, especially with ImageNet-pretrained weights. Don't reinvent the backbone unless you have a strong reason.

Bias before BatchNorm. Set bias=False on any layer immediately followed by BatchNorm — the bias term is already absorbed into the BN parameters and just adds redundant parameters.

Training Loop

train() / eval() modes. Always call model.train() before the training loop and model.eval() before validation. Forgetting this changes the behaviour of BatchNorm (running stats vs batch stats) and Dropout (active vs bypassed), producing wrong results in a subtle, hard-to-debug way.

Optimizer. Adam is a safe default. For fine-tuning or when you want better generalisation, AdamW (Adam with decoupled weight decay) is usually a better choice. [paper]

Learning rate schedule. ReduceLROnPlateau is robust and requires little tuning — set patience generously relative to your total epoch budget. Cosine annealing is a good alternative if you know your total training length in advance.

Early stopping. Track the validation metric and save the checkpoint that maximises it, rather than taking the final epoch. This is free regularisation.

Augmentation

MixUp. Simple and broadly effective for classification. Mix two training samples (both inputs and labels) with a Beta-distributed coefficient. [paper]

SpecAugment. For audio/spectrogram tasks: mask random frequency bands and time steps. Surprisingly effective and cheap. [paper]

Test-time augmentation (TTA). At inference, run the same input through several augmentations and average the predictions. Costs N× inference time but often gives a free accuracy bump.

Evaluation & Ensembling

K-Fold cross-validation. K=5 is a good default. More reliable than a single train/val split, especially on smaller datasets.

Stochastic Weight Averaging (SWA). Average the weights of checkpoints sampled along the loss surface rather than taking the final checkpoint. Often improves generalisation with no extra training. [paper]

Ensemble. Arithmetic average of output probabilities is simple and usually works. Geometric average (equivalent to averaging log-probabilities) is slightly better calibrated. Diversity across ensemble members matters more than individual model strength.

Guglielmo Camporese (gool-yell-moe)

AI Researcher at Disney Research · Zurich

guglielmocamporese [at] gmail [dot] com

Architecture

Training Loop

Augmentation

Evaluation & Ensembling