A Flood of Droplets

Aug 4, 2015| Daniel Jones

If you play the dangerous game of scaling Cloud Foundry DEAs up instead of out, make sure you have plenty of headroom in the CPU department to start with. If you don’t then there’s a good chance you’ll see a heap of no available stagers errors upon cf push and insufficient resources to start errors upon cf start. This will last for an hour before magically fixing itself.

Why is this? What causes all these errors, and what can be done if one gets it wrong? First a little background. A PaaS operations team I work with received customer reports that apps could no longer be pushed, citing a no available stagers error message.

Upon investigation we realised that we had let our CF usage exceed the load that was reasonable for our DEAs and that they had started running out of ephemeral disk.

No problem - we’ll just use BOSH to give them all another few gigabytes of disk, and all will be well with the world. Right?

Wrong. After the update had finished, our first app push worked. Shortly afterwards our continuous acceptance tests tried pushing apps, and received no available stagers. Wasn’t that the problem we were trying to fix? Had BOSH not updated the DEAs properly? Was there some lag in the DEAs reporting their available resources? Soon, other continuous acceptance tests started failing restarting apps with insufficient resources to start.

All of the above happened on a non-production environment towards the end of the working day. We agreed to remedy the problem in the morning, but one of the team noticed some time later that the errors had all cleared without any intervention.

The next day we started sleuthing, and found the following:

It was then that we stumbled across an old Cloud Foundry mailing list thread that mentioned that the DEA retains crashed apps for a configurable amount of time, defaulting to one hour. The memory used by these retained apps is not subtracted from the available total, but the disk used by the retained apps is subtracted from the available total.

This meant crashed apps were taking up 43GiB out of the 48GiB available.

One of the team suggested that perhaps BOSH had not shut down the DEAs cleanly, and that all the apps had been hard-stopped somehow marking them as CRASHED. This didn’t hold up to the way that BOSH drain scripts behave, as BOSH will wait indefinitely if a drain script tells it to. The DEA drain script does indeed tell BOSH to come back later if it hasn’t cleanly shut down all apps.

Whilst looking for hints in the crash logs an ls in var/vcap/data/dea_next/crashes yielded a clue - all the app crash directories had been created within a few minutes of each other, and also that this was after the DEA started up.

So what had happened?

  1. BOSH executes the drain script of the first DEA in order to scale up its ephemeral disk
  2. The drain script evacuates all apps
  3. Cloud Foundry starts those on other DEAs (can you see where this is heading?)
  4. DEA comes back online with bigger disk
  5. BOSH calls drain script on the next DEA
  6. Our first DEA now has to try and start 43 apps all at once, within 180 seconds
  7. This fails miserably, marking all the apps as crashed, reserving vast amounts of the available disk
  8. Repeat for all DEAs until you have a Cloud Foundry that can’t push or start apps
  9. An hour later the DEAs reap all the crashed apps, and good times resume

The real solution? Scale out, and not up.

Sadly in this case we couldn’t scale out because there were concerns about increasing the number of DEAs without being able to tie them to individual hosts - the thinking being that the more DEAs there are, the more likely all of any app’s instances will all end up on the same physical host.

comments powered by Disqus

Get in touch

See how much we can help you.
Call +44 (0) 20 7846 0140 or

Contact us