OVERVIEW

Each group has a configuration file owned by their nominated account. This file can be edited at any time, and is stored at /t1-data/backup/GROUPNAME . This configuration file specifies which locations within /t1-data/ should be searched for directories to back up for the group, with one directory listed per line. By targeting the search, it enables increased performance and ensures that only that group’s data is included in their backup quota. In general, this list is unlikely to contain significantly more entries than the number of members in the group.

Within each of those specified locations, each directory which wants to be backed up needs to contain a hidden file named .backup. This file signifies that the directory and everything it contains - including other directories - should be backed up. It is vital to note that not only is it unnecessary to add .backup files to subdirectories of a directory which has already been marked for backup, it is invalid. If further .backup files are found within sub-directories of a previously specified backup then the entire backup will not run at all. Backups cannot be nested.

When the backup runs, only locations specified in the configuration file will be searched for .backup files and only those directories which are subsequently found to contain .backup files will be backed up. All other data will be ignored.

EXAMPLE

The following example is intended to elucidate the backup configuration. The items enclused in a slashes (/) are directories, whereas those which do not are files. The text in the box is the contents of the file /t1-data/backup/staylor

                         /t1-data/
                             |
                    +--------+--------+
                    |                 |
                 /backup/           /user/
                    |                 |
           +--------+        +--------+--------+
           |                 |                 |
        staylor           /nicki/           /dtooke/
           |                                   |
+----------------------+                +------+------+
|/t1-data/user/dtooke  |                |             |
|/t1-data/user/smcgowan|             /data/       /results/
|/t1-data/user/macmahon|                              |
+----------------------+                     +--------+--------+
                                             |                 |
                                          .backup          /ChipSeq/

The backups for this example will run in the following manner.

- Based on the configuration file, only the directories for dtooke/ , smcgowan/ and macmahon/ are searched

- The directory for nicki/ will not be searched and is not backed up even if it contains a .backup file

- The directory results/ within the directory dtooke/ contains a .backup file, and is backed up along with the ChipSeq/ directory inside it

- The ChipSeq/ directory must not contain a .backup file - it is already being backed up and attempting to manually specify nested backups is invalid

- The data/ directory does not contain a .backup file and is not backed up even though it has been searched; however, if a .backup file is subsequently added it will be found by the search and the directory backed up the next time the backup runs.

THE CONFIGURATION FILE

The format of the configurtion is relatively simple. It is a plain text file which should contain a list of directories to search for .backup files, one directory per line. There should be no blank lines and there are no spaces permitted in the directory paths; if you wish to backup a directory with a space in the name, please list the parent directory in the configuration file and then place a .backup file as normal.

When editing the configuration file, please remember that this is a list of paths to search and not a list of directories to backup. You must still create .backup files in the relevant locations or your data will not be backed up.

TIPS FOR SIMPLE AND EFFECTIVE BACKUPS

The success of your backups is affected by three key factors:

- The complexity of your backup structure

- The total number of files

- The total size of the backup

Ensuring that your backup structure is simple and effective will reduce the likelihood of problems. Reducing the number of files and/or the total size will both make your backups faster and enable you to store more data. Here's some tips for getting successful backups:

- Don’t add data to a backup and then change your mind and remove it for next time. Your next backup may well run, but eventually you’ll go over your quota on the back-end storage. At that point, we may have no way to untangle data that you no longer want backed up from data that you do. In that scenario we may have to delete your entire backup and let it recreate from scratch on the next run, which will be slow and leave you without a backup until it’s finished.

- Be careful before moving your data around into a new structure that makes the backups simpler, unless you’re sure you know what you’re doing. If you have symbolic links you may break them and have difficulty fixing them.

- Add data to your backups incrementally. Backups run every night, unless a previous backup is still running. Backing up a small amount to start with, then slowly adding more over the course of several days, is a good way to avoid blocking errors or overshooting your quota.

- If possible, back up a few large directories rather than many small, scattered directories. This is far less likely to result in problems later, easier to manage and easier to fix if you get it wrong.

- Use du(1) to understand where your data is. Bear in mind that you can only calculate the size of data which you have read access to.

- Keep your configuration file simple. Don’t add a directory to search if you’re not sure it contains anything that will need to be backed up - you can always add it later. Also, if you can make the search location more specific, that’s great - /t1-data/project/taylorlab/duncan/ will take longer to search than /t1-data/project/taylorlab/duncan/results/ - but don’t make the configuration complicated in doing so

- If you have data or output files which you will not need again, delete them. Be absolutely sure before you do this as there is no default backup and we cannot recover your data if you make a mistake.

- Use gzip(1) to compress old files. Be sure to test this on one file before compressing a large number, as some file types are already compressed or do not compress well. A good starting point is any plain text files (such as job output files) and any .fastq and .fq files (.fastq.gz and .fq.gz files are already compressed).

- Avoid backing up large numbers of small files, such as log files from jobs, .jpeg images or software directories. For experienced Linux users, use tar(1) to create an archive containing many smaller files and/or directories and consider compressing this as well.

SETTING UP YOUR BACKUP

In practice, setting up your backup is not difficult. The most time-consuming part of the process is deciding which data to back up. We recommend that you read the warnings and advice in the previous sections if you haven’t done so already.

1. Decide which member of your group will be the point of contact for your backups. It will help if this person has a solid understanding of your bioinformatics and of Linux.

2. E-mail genmail@molbiol.ox.ac.uk and provide the contact person’s details. This can only be done by the PI.

3. We will let you know your quota.

4. Decide which data you want to be backed up for your group. Please note that only data in collaboration projects can be backed up. Consider leaving some space in your quota for future data.

5. Check the size of the data you’ve chosen using du - note that you will need read permission on the data. Be sure that the total is under your quota.

6. Let us know that you’re ready for your first backup. We will create an empty configuration file for your group.

7. Based on the data you’ve chosen, configure the paths you need searching for backups in the configuration file. Note that each line must contain the full path to the directory, and that each directory must be on a separate line. No other text is allowed, including blank lines.

8. Make sure that the directories you’ve chosen to be backed up contain a .backup file - the command touch(1) will help here.

9. Check that your configuration is working correctly by running the command:

$ t1-backup-du path_to_your_configuration_file

Note that this will only give the correct result if you have read access to all the data

10. If you have too much data, check the output carefully and make the necessary changes to your proposed backup scheme.

11. Let us know that you’ve successfully configured your backups. We will set it up to run for you.

12. At this point, you'll be all set up. You will receive an e-mail with your backup results each time it runs. If there’s a problem, let us know and we’ll help you to fix it. If all went well, it will look something like this:

--- Backup of /t1-data for staylor group ---
Quota 1073741824 KB, found 194 KB marked for backup. OK.
Starting Fri 27 Nov 08:22:41 GMT 2020
Only new and changed files will be transferred
sending incremental file list
Done Fri 27 Nov 08:22:42 GMT 2020
--- Backup completed successfully ---

GETTING HELP

You can email the CCB team using the email address genmail@molbiol.ox.ac.uk. Using this address ensures your email is logged and assigned a tracking number, and will go to all the core team, which means the appropriate person or people will be able to pick it up.

COPYRIGHT

This text is copyright University of Oxford and MRC and may not be reproduced or redistributed without permission.

AUTHOR

Duncan Tooke <duncan.tooke@imm.ox.ac.uk>

Index

OVERVIEW
KEY PRINCIPLES
OVERVIEW
EXAMPLE
THE CONFIGURATION FILE
TIPS FOR SIMPLE AND EFFECTIVE BACKUPS
SETTING UP YOUR BACKUP
GETTING HELP
COPYRIGHT
AUTHOR