Inference Services with long periods of idle time can automatically be scaled to zero. When scaled down, the Inference Service will consume no resources and incur no billing. As soon as a new request comes in, a Pod will be instantiated and the request will be served. For small models, this can be as quick as five seconds. For larger models, spin up times can be 30 to 60 seconds. Model loading times are highly dependent on the code responsible of loading the model into the GPU. Many popular PyTorch frameworks do not optimize for optimal loading times.