Last year, during one of the sessions of the Berlin AWS meetup where I am often present, during the networking that happened after the event @freenerd from Mapbox mentioned something about the spot market , saying how much cheaper it is for them to run instances there, but also the fact that for their use case it sometimes happened that the instances were terminated in the middle of their batch processing job that prepares the map for the entire world.
A few weeks later, at another session of the AWS meetup, I participated in a similar discussion where someone mentioned the possibility to have instances attached to an on-demand AutoScaling group, which was a feature just released by AWS at that time. I don’t remember if spot was mentioned in the same discussion, or if it was all in my mind, but somehow these concepts got connected and I thought this is a nice problem to hack on.
I was thinking about the problem for a while, and after a couple of weeks I came up with an algorithm based on the instance attach/detach mechanism supported by AutoScaling. I tested it manually and I quickly confirmed that AutoScaling happily allows attaching spot instances and detaching on-demand ones in order to keep the capacity constant, but that it often tries to rebalance the availability zones, so in order for it not to interfere with the automation, the trick is to try to keep the group more or less balanced across availability zones, so that AutoScaling won’t try to rebalance it.
I soon started coding a prototype in my spare time, which is actually my first non-trivial program written in a while, and to make it even more interesting, I chose to write it in golang.
After a few weeks of coding, in which I rewrote it at least twice(and even now I’m still nowhere near being happy with how it looks), I realized it’s quite a bit harder and more complex than I initially thought. Other things happened and I kind of lost interest, I stopped working on it and it all got stuck.
A few months later at the re:invent conference I attended some talks where I met some other folks interested by this problem and I saw other approaches of attacking the problem, with multiple AutoScaling groups, and that was also when I first got in touch with someone from spotinst.com who was trying to promote their solution and was sharing business cards.
After re:invent I became a bit more active for a while, I also tried to get some collaborators but failed at it, so I kept working on it in my spare time every now and then and I got closer to get it work. Then I recently had a long vacation, and immediately after I returned I attended the Berlin AWS Summit, where I met the SpotInst folks once again, and it seems they now have a full fledged solution for the problem, based on pretty much a reimplementation of AutoScaling, using machine learning and with a beautiful UI and they are really successful with it. Funnily enough, they even contacted me to sell that solution to my company and we are seriously evaluating it
After the Berlin AWS Summit, having my batteries charged, I resumed my work and after a few coding nights I managed to make my prototype work. It took much longer than expected, but at least I got there, yay!
What I have so far
- A CloudFormation template that creates an SNS topic, a Lambda function written in golang(with a small JS wrapper that downloads and run it), subscribed to the topic and a few IAM settings to make it all work
- A golang binary, for now closed source, but I’m going to open it up once I get it in a good enough shape so that I’m not ashamed of it and after I get all the approvals from my corporate overlords, who according to my employment contract need to approve the publishing of such non-trivial code
How does it work
The lambda function is executed by a custom CloudFormation resource when creating the CloudFormation stack from the template, and it subscribes to both your topic and a topic that I run, which fires it every 30 minutes, using a scheduled event.
When my scheduled function runs the lambda function, it will concurrently inspect the AutoScaling groups from all the AWS regions and it will ignore all those that are not tagged with the EC2 tags it expects.
The AutoScaling groups marked with the expected tag will be processed concurrently, on each of them gradually replacing the on-demand instances with compatible spot instances, one at a time. Each run will either launch a single spot instance or attach a launched spot instance to the AutoScaling group, after detaching an on-demand one it is meant to replace. The spot instance is not attached while its uptime is less than the Autoscaling group’s grace period.
The spot instance bid price matches the price of the on-demand instance it is meant to replace. If your spot request is outbid, AutoScaling will handle it as a regular instance failure, and will immediately replace it with an on-demand instance. That instance will later be replaced by the cheapest available compatible spot instance, likely of a different type and with a different spot price.
In practice the group should converge to the most stable instance pricing.
How to use it
All you need to do is set an EC2 tag on the AutoScaling group where you want to test it. Any other AutoScaling groups will be ignored.
The tag should have the following attributes:
Feedback is more than welcome
If you find any bugs or you would like to suggest any improvements, please comment below.
This is experimental, summarily tested and likely full of bugs, so you should not run it on production, but it should be safe enough for evaluation purposes.
Anyway, use it at your own risk, and don’t hold me responsible for any misuse, bugs or damage this may cause you.
转载本站任何文章请注明：转载至神刀安全网，谢谢神刀安全网 » My take at making AWS EC2 cheaper by automating SPOT instances with AutoScaling