Walkthrough with real data set ( GSE67959 )
The pre-requisite of a data-rich SNP-style run is a SNP of interest closer than half the --ampliconSize (default 350b) from a RE cut site (175b). Preferably the SNP of interest should be within +/- 50 bases from the RE cut site, and the closer to the cut site, the better. If all your SNPs are farther from RE cut sites than this, you cannot probe them via this method. However, note that having multiple SNPs of interest under your very capture oligo (within the very oligonucleotide you used to capture the data with) may affect your pull-down.
As this method only probes reads which overlap the SNPs of interest - the data depth is greatly reduced (the more the farther away the SNP of interest is from the closest RE cut site). Do not expect high quality interaction profile, but rather a plot from which you can "roughly determine" whether the general landscape of interactions was different between the allelic forms.
You are recommended to use --bowtie2 as a flag in your CCseqBasic SNP-specific run, as it has better tolerance for SNPs in mapping. However, the duplicate filtering of CCseqBasic struggles a little with bowtie2 multimapped reads, so best practise (at the time of writing this 01Feb2019) is to map with both --bowtie1 and --bowtie2 , and check no artifacts accumulate in bowtie2, and check that SNPs get treated fairly in bowtie1 The very best way would be to generate a N-masked genome (and its associated bowtie1 build), which replaces the SNPs of interest with Ns , and update that genome to the /conf/genomeBuildSetup.sh file. This genome can now be used in bowtie1 mapping without biasing the true SNP distribution over the sites of interest.
You should now have ran the CapSequm web tool (the first step of the analysis), and the --onlyBlat run of CCseqBasic (the second step of the analysis).
The above are optional pre-requisites of the CCseqBasic run (but are highly recommended).
To run CCseqBasic , you need the CapSequm output file called "CCseqBasicFragmentFile.txt" , which indeed contains the coordinates of the RE fragments for all capture sites of your design.
You can also make this file manually, but CapSequm output is good for first testing : it is for sure in the right format, and you can thus use it without problems.
A readymade CapSequm output file "CCseqBasicFragmentFile.txt" , for the 30g data set used in this walk-through, is available here :
The above file only contains 7 columns. To make the run SNP-specific, we need to add in the 8th and 9th column, to describe the desired SNP, and the desired allele.
Each SNP-specific file can contain each capture site ONLY ONCE ( if you want to run for multiple alleles of the base of interest, you need to run several runs ).
In this example we are running the SNP-specific run for one of the capture sites (Oct-4) of the sample. The selected SNP has 2 alleles of interest (G, A) in the position of interest. This leads us to have 2 runs. Even though only one of our capture sites has a SNP of interest, we also include the other capture sites into the analysis, to enable the run properly filter the ambiguous interactions which are assigned to more than one capture site.
The readymade "CCseqBasicFragmentFile.txt" files, for the 2 SNP runs, are available here :
More about the above file, its file format, and editing instructions available here : fragmentfile.html
In addition to this file, you need the Fastq files from the sequencing of the Capture-C library.
Below instructions how to fetch them for the 30g data set, which is used as the example data set in this walk-through.
Here a readymade script to fetch the fastqs for GSE67959, stored in the European Sequence Archive :
You need to run the above script (or something similar) to fetch the fastq files to your own data area.
Here detailed instructions, how the above script was constructed - i.e. how to get from the GEO archive number GSE67959 to the European Sequence Archive file download locations :
In the below example we are only using the 'ery1' sample (not all the downloaded 6 samples)
The expected file sizes are :
ery1: 3.0G SRR1976563_1.fastq.gz 3.2G SRR1976563_2.fastq.gz
Now you should have these files :
CCseqBasicFragmentFile_SNP1.txt CCseqBasicFragmentFile_SNP2.txtand
GSE67959_fastqs `-- ery1 |-- SRR1976563_1.fastq.gz `-- SRR1976563_2.fastq.gz
These files can now be used to set up the CCseqBasic runs.
Below a .zip archive of all the needed files, in a correct folder structure, to start both SNP runs :
After unpacking, you should see these files and folders :
|-- CCseqBasicFragmentFile_SNP_allele1.txt |-- CCseqBasicFragmentFile_SNP_allele2.txt | |-- ery1_SNP_allele1 | |-- PIPE_fastqPaths.txt | `-- run.sh | `-- ery1_SNP_allele2 |-- PIPE_fastqPaths.txt `-- run.sh
The ery1_SNP_allele1 ery1_SNP_allele2 folders are the run folders : each sample needs to be started in its own empty folder (except the PIPE_fastqPaths.txt and run.sh files)
You need to edit the PIPE_fastqPaths.txt files, to reflect the actual location of the fastq files in your system.
ery1/PIPE_fastqPaths.txt and ery2/PIPE_fastqPaths.txt show the 2 supported file formats of this file :
ery1_SNP_allele1/PIPE_fastqPaths.txt ------------------------ /t1-data/GSE67959_fastqs/ery1/SRR1976563_1.fastq.gz /t1-data/GSE67959_fastqs/ery1/SRR1976563_2.fastq.gz ------------------------ ery1_SNP_allele2/PIPE_fastqPaths.txt ------------------------ SRR1976563_1.fastq.gz SRR1976563_2.fastq.gz /t1-data/GSE67959_fastqs/ery2 ------------------------
You need to also edit the run.sh file, to reflect the location of the other input files in your system :
capturesiteFile="/t1-data/user/hugheslab/telenius/3_CCseqBasic/CCseqBasicFragmentFile_SNP1.txt" PublicPath="/public/telenius/CaptureCompendium/CCseqBasic/3_run/30gRUNS" reuseBLATdir='/t1-data/user/hugheslab/telenius/2_CCseqBasic_onlyBlat/BlatPloidyFilterRun/REUSE_blat'
You need to also change the pipePath to reflect the correct location of the pipeline :
pipePath="/t1-home/molhaem2/telenius/CCseqBasic/CS5/RELEASE/"
The public path is the location where you want to store your run output data - for browsing via UCSC genome browser. More about setting this, in the installation instructions page : CCseqBasic/1_setup/index.html on step (9.) "server address".
The reuseBLATdir is the directory where your CCseqBasic --onlyBlat run results are. If you didn't run the --onlyBlat run, set :
reuseBLATdir='.'
Here you can download readymade BlatPloidyFilterRun/REUSE_blat folder for the 30g data set, to be set to reuseBLATdir : BlatPloidyFilterRun.zip
Notice, that if you are using the REUSE_blat directory from your earlier (non-SNP) 30g test runs you need to generate correctly named copies (or symlinks) to the SNP-run capture sites, like so :
cp TEMP_Oct-4_blat.psl TEMP_Oct-4_1_blat.psl cp TEMP_Oct-4_blat.psl TEMP_Oct-4_2_blat.psl
Now you are ready to run.
This is a long production run, and needs to be submitted to a queue system (if you have one). The exact submitting command depends on the queue system used. The below commands were used when the script was ran in SGE (Sun Grid Engine) cluster at the WIMM :
cd ery1_SNP1 qsub -cwd -e qsub.err -o qsub.out -N ery1_SNP1 < ./run.sh cd ../ery1_SNP2/ qsub -cwd -e qsub.err -o qsub.out -N ery1_SNP2 < ./run.sh cd ..
You can keep an eye on the run by reading the output and error logs qsub.out and qsub.err.
Once the run has finished, the last 10 lines of the qsub.out file will give the location of the visualisation data hubs :
[telenius@deva ery1]$ tail -n 10 qsub.out Generated a data hub for RAW data in : http://userweb.molbiol.ox.ac.uk/public/telenius/test7/C ( pre-filtered data for DEBUGGING purposes is here : http://userweb.molbiol.ox.ac.uk/public/telenius/test7/C Generated a data hub for FILTERED data in : http://userweb.molbiol.ox.ac.uk/public/telenius/test7/C Generated a data hub for FILTERED flashed+nonflashed combined data in : http://userweb.molbiol.ox.ac.uk/p Generated a COMBINED data hub (of all the above) in : http://userweb.molbiol.ox.ac.uk/public/telenius/tes How to load this hub to UCSC : http://sara.molbiol.ox.ac.uk/public/telenius/DataHubs/ReadMe/HUBtutorial_Allgr [telenius@deva 1]$
Of these, the most important is the COMBINED data hub (of all the above) ( the others are needed mostly for troubleshooting, if runs crash unexpectedly before finishing )
As you can see, these lines also point to a help document, how to load this data hub into UCSC : http://sara.molbiol.ox.ac.uk/public/telenius/DataHubs/ReadMe/HUBtutorial_AllGroups_160813.pdf
More information of these data hubs in : 6_visualisation/index.html and 7_QC/index.html
The output folders of the run are explained more thoroughly in here : 4_outputfolders/index.html The interaction counters are explained here : 5_counters/index.html And the workflow of the CCseqBasic tool is explained here : 2_workflow/index.html
Optional run parameters, and other finetuning of the run is explained here : 3_run/index.html
Here the readymade COMBINED data hubs for both ery1_SNP1 and ery1_SNP2 samples, for loading to UCSC :
How to load the above data hubs into UCSC : HUBtutorial_AllGroups_160813.pdf