Backup and restore workflows are extremely important for any production MongoDB cluster. Apart from the actual functionality of backup and restore you also have to consider other non functionals like availabilty of backups, security, recovery time, recovery granularity etc. At a high level you have three options to backup your mongodb server
2. MongoDB Cloud manager
3. Disk snapshots
Each of the above three techniques has its own pros and cons. Read below to understand more details
Mongodump is the “getting started” backup tool for most mongodb developers. This is probably how most developers start backing up their mongodb database. The MongoDump tool is really simple to use and dumps out all the data in the database in binary format (BSON) which you can store at a location of your choice.
1. Simple to use2. Flexibility in where the backup is stored – once the dump is complete you can move it to any location of your choice – NFS shares, AWS S3 etc
1. Full backup, everytime – It is a full backup, not a diff from your previous backup. So as your database gets large it can take hours to complete the backup and is unwieldy to store2. Not a Point in time – Backup created by mongodump is not a point in time snapshot by default. So if your data is changing during the backup you can end up with a mongodump that is inconsistent from an application perspective. You can remedy this by using the “–oplog” option which takes a snapshot at the end of the mongodump process. However this option is not available for standalone databases
2. MongoDB Cloud manager
Cloud manager is a cloud service provided by the MongoDB team to help you backup your mongodb cluster.
1. Simple to use – Install the mongodb cloud manager agent to manage the backup/restore of your cluster. It is a little more complicated that using mongodump but not by a whole lot.2. Continuous backup – The cloud manager continuously queries and backs up your oplog. So this enables you to restore to any point of time instead of specific times when the backup was taken. This minimizes your exposure to data loss
1. Data control – The backup data is stored in the MongoDB datacenter outside your control. In some parts of the world (E.g. Europe) and depending on your security needs this can be a big problem.
2. Extra expense – You are paying by the size of the data and the amount of oplog changes. If you have a large database or a high number of writes this cost can add up
3. Slow Restores – In order to restore your data from the MongoDB cloud manager the data needs to be physically downloaded from the Cloud manager data center. If you have a large database this can be a very time consuming operation. E.g. If your data is 1TB it can take several hours to download and use the data
3. Disk snapshots
Snapshots can either be at the Cloud level (E.g AWS EBS disk snapshots) or OS level (LVM snapshots). LVM snapshots although convenient are not easily portable outside the machine. Hence for the rest of this discussion we are going to focus on Cloud disk snapshots like AWS EBS snapshots.
1. Simple and easy to use – It is relatively trivial to trigger a snapshot of an EBS disk
2. Portability – You can move your snapshots to other data centers if you need higher availability for your backups
3. Diff snapshots – The snapshots are diff snapshots, so they only store the changes from your previous snapshot. This reduces the amount of storage needed by your backup
4. No data copy – There is no data copy involved to restore your data. E.g. If you want to restore a 1TB snapshot you can just create a new volume from the snapshot and this does not result in any actual data copy. This is a * big deal * when dealing with large amounts of data.
5. Backup control – The backups remain in the same datacenter as your primary data and are secured by the same authentication mechanisms as your primary data servers
1. Not a continuous backup – It is a point in time backup and can only be recovered to the backup points2. Physical machines – On premise physical machines cannot be backed up using this technique.
At the end of the day if your data is small all three options will work well. When you start having larger amounts of data you will have to spend time and choose the option that works best for your scenario.