1. All backups are by research group, not user. Each group needs to nominate a single person to co-ordinate their backups.
2. 10TB of backup space is available per group, due to limits on available hardware and the need to reserve capacity for future data. We hope to increase this going forwards.
3. Due to the difficulties in implementing a user-customisable partial-backup system with acceptable performance, backups are on a per-directory basis. It is not possible to selectively back up individual files in directories.
Within each of those specified locations, each directory which wants to be backed up needs to contain a hidden file named .backup. This file signifies that the directory and everything it contains - including other directories - should be backed up. It is vital to note that not only is it unnecessary to add .backup files to subdirectories of a directory which has already been marked for backup, it is invalid. If further .backup files are found within sub-directories of a previously specified backup then the entire backup will not run at all. Backups cannot be nested.
When the backup runs, only locations specified in the configuration file will be searched for .backup files and only those directories which are subsequently found to contain .backup files will be backed up. All other data will be ignored.
/t1-data/ | +--------+--------+ | | /backup/ /user/ | | +--------+ +--------+--------+ | | | staylor /nicki/ /dtooke/ | | +----------------------+ +------+------+ |/t1-data/user/dtooke | | | |/t1-data/user/smcgowan| /data/ /results/ |/t1-data/user/macmahon| | +----------------------+ +--------+--------+ | | .backup /ChipSeq/
The backups for this example will run in the following manner.
- Based on the configuration file, only the directories for dtooke/ , smcgowan/ and macmahon/ are searched
- The directory for nicki/ will not be searched and is not backed up even if it contains a .backup file
- The directory results/ within the directory dtooke/ contains a .backup file, and is backed up along with the ChipSeq/ directory inside it
- The ChipSeq/ directory must not contain a .backup file - it is already being backed up and attempting to manually specify nested backups is invalid
- The data/ directory does not contain a .backup file and is not backed up even though it has been searched; however, if a .backup file is subsequently added it will be found by the search and the directory backed up the next time the backup runs.
When editing the configuration file, please remember that this is a list of paths to search and not a list of directories to backup. You must still create .backup files in the relevant locations or your data will not be backed up.
- The complexity of your backup structure
- The total number of files
- The total size of the backup
Ensuring that your backup structure is simple and effective will reduce the likelihood of problems. Reducing the number of files and/or the total size will both make your backups faster and enable you to store more data. Here's some tips for getting successful backups:
- Don’t add data to a backup and then change your mind and remove it for next time. Your next backup may well run, but eventually you’ll go over your quota on the back-end storage. At that point, we may have no way to untangle data that you no longer want backed up from data that you do. In that scenario we may have to delete your entire backup and let it recreate from scratch on the next run, which will be slow and leave you without a backup until it’s finished.
- Be careful before moving your data around into a new structure that makes the backups simpler, unless you’re sure you know what you’re doing. If you have symbolic links you may break them and have difficulty fixing them.
- Add data to your backups incrementally. Backups run every night, unless a previous backup is still running. Backing up a small amount to start with, then slowly adding more over the course of several days, is a good way to avoid blocking errors or overshooting your quota.
- If possible, back up a few large directories rather than many small, scattered directories. This is far less likely to result in problems later, easier to manage and easier to fix if you get it wrong.
- Use du(1) to understand where your data is. Bear in mind that you can only calculate the size of data which you have read access to.
- Keep your configuration file simple. Don’t add a directory to search if you’re not sure it contains anything that will need to be backed up - you can always add it later. Also, if you can make the search location more specific, that’s great - /t1-data/project/taylorlab/duncan/ will take longer to search than /t1-data/project/taylorlab/duncan/results/ - but don’t make the configuration complicated in doing so
- If you have data or output files which you will not need again, delete them. Be absolutely sure before you do this as there is no default backup and we cannot recover your data if you make a mistake.
- Use gzip(1) to compress old files. Be sure to test this on one file before compressing a large number, as some file types are already compressed or do not compress well. A good starting point is any plain text files (such as job output files) and any .fastq and .fq files (.fastq.gz and .fq.gz files are already compressed).
- Avoid backing up large numbers of small files, such as log files from jobs, .jpeg images or software directories. For experienced Linux users, use tar(1) to create an archive containing many smaller files and/or directories and consider compressing this as well.
1. Decide which member of your group will be the point of contact for your backups. It will help if this person has a solid understanding of your bioinformatics and of Linux.
2. E-mail email@example.com and provide the contact person’s details. This can only be done by the PI.
3. We will let you know your quota.
4. Decide which data you want to be backed up for your group. Please note that only data in collaboration projects can be backed up. Consider leaving some space in your quota for future data.
5. Check the size of the data you’ve chosen using du - note that you will need read permission on the data. Be sure that the total is under your quota.
6. Let us know that you’re ready for your first backup. We will create an empty configuration file for your group.
7. Based on the data you’ve chosen, configure the paths you need searching for backups in the configuration file. Note that each line must contain the full path to the directory, and that each directory must be on a separate line. No other text is allowed, including blank lines.
8. Make sure that the directories you’ve chosen to be backed up contain a .backup file - the command touch(1) will help here.
9. Check that your configuration is working correctly by running the command:
$ t1-backup-du path_to_your_configuration_file
Note that this will only give the correct result if you have read access to all the data
10. If you have too much data, check the output carefully and make the necessary changes to your proposed backup scheme.
11. Let us know that you’ve successfully configured your backups. We will set it up to run for you.
12. At this point, you'll be all set up. You will receive an e-mail with your backup results each time it runs. If there’s a problem, let us know and we’ll help you to fix it. If all went well, it will look something like this:
--- Backup of /t1-data for staylor group --- Quota 1073741824 KB, found 194 KB marked for backup. OK. Starting Fri 27 Nov 08:22:41 GMT 2020 Only new and changed files will be transferred sending incremental file list Done Fri 27 Nov 08:22:42 GMT 2020 --- Backup completed successfully ---