Similar loss, different results

I have trained multiple CNNs for image classification. I suspect there is something wrong with my training pipeline, since many of my experiments get very similar training loss at the end of training, but perform very differently on the test set.

What can be the cause of that?

since machine learning consists of stochastic methods unless you will set specific random seed on beginning, your model will be initialized with different values and this will lead to different values of model after training

