Walkthrough with real data set ( GSE67959 )
You should now have ran the CapSequm web tool (the first step of the analysis), and the --onlyBlat run of CCseqBasic (the second step of the analysis).
The above are optional pre-requisites of the CCseqBasic run (but are highly recommended).
To run CCseqBasic , you need the CapSequm output file called "CCseqBasicFragmentFile.txt" , which indeed contains the coordinates of the RE fragments for all capture sites of your design.
You can also make this file manually, but CapSequm output is good for first testing : it is for sure in the right format, and you can thus use it without problems.
A readymade CapSequm output file "CCseqBasicFragmentFile.txt" , for the 30g data set used in this walk-through, is available here :
More about the above file, its file format, and editing instructions available here : fragmentfile.html
In addition to this file, you need the Fastq files from the sequencing of the Capture-C library.
Below instructions how to fetch them for the 30g data set, which is used as the example data set in this walk-through.
Here a readymade script to fetch the fastqs for GSE67959, stored in the European Sequence Archive :
You need to run the above script (or something similar) to fetch the fastq files to your own data area.
Here detailed instructions, how the above script was constructed - i.e. how to get from the GEO archive number GSE67959 to the European Sequence Archive file download locations :
The expected file sizes are :
ery1: 3.0G SRR1976563_1.fastq.gz 3.2G SRR1976563_2.fastq.gz ery2: 1.8G SRR1976564_1.fastq.gz 1.9G SRR1976564_2.fastq.gz ery3: 2.7G SRR1976565_1.fastq.gz 2.9G SRR1976565_2.fastq.gz es1: 2.7G SRR1976567_1.fastq.gz 2.9G SRR1976567_2.fastq.gz es2: 2.2G SRR1976568_1.fastq.gz 2.3G SRR1976568_2.fastq.gz es3: 2.6G SRR1976569_1.fastq.gz 2.4G SRR1976569_2.fastq.gz
Now you should have these files :
CCseqBasicFragmentFile.txtand
GSE67959_fastqs |-- ery1 | |-- SRR1976563_1.fastq.gz | `-- SRR1976563_2.fastq.gz |-- ery2 | |-- SRR1976564_1.fastq.gz | `-- SRR1976564_2.fastq.gz |-- ery3 | |-- SRR1976565_1.fastq.gz | `-- SRR1976565_2.fastq.gz |-- es1 | |-- SRR1976567_1.fastq.gz | `-- SRR1976567_2.fastq.gz |-- es2 | |-- SRR1976568_1.fastq.gz | `-- SRR1976568_2.fastq.gz `-- es3 |-- SRR1976569_1.fastq.gz `-- SRR1976569_2.fastq.gz
These files can now be used to set up the CCseqBasic runs.
Below a .zip archive of all the needed files, in a correct folder structure, to start all 6 runs :
After unpacking, you should see these files and folders :
|-- CCseqBasicFragmentFile.txt |-- ery1 | |-- PIPE_fastqPaths.txt | `-- run.sh |-- ery2 | |-- PIPE_fastqPaths.txt | `-- run.sh |-- ery3 | |-- PIPE_fastqPaths.txt | `-- run.sh |-- es1 | |-- PIPE_fastqPaths.txt | `-- run.sh |-- es2 | |-- PIPE_fastqPaths.txt | `-- run.sh `-- es3 |-- PIPE_fastqPaths.txt `-- run.sh
The ery1 ery2 ery3 es1 es2 es3 folders are the run folders : each sample needs to be started in its own empty folder (except the PIPE_fastqPaths.txt and run.sh files)
You need to edit the PIPE_fastqPaths.txt files, to reflect the actual location of the fastq files in your system.
ery1/PIPE_fastqPaths.txt and ery2/PIPE_fastqPaths.txt show the 2 supported file formats of this file :
ery1/PIPE_fastqPaths.txt ------------------------ /t1-data/GSE67959_fastqs/ery1/SRR1976563_1.fastq.gz /t1-data/GSE67959_fastqs/ery1/SRR1976563_2.fastq.gz ------------------------ ery2/PIPE_fastqPaths.txt ------------------------ SRR1976564_2.fastq.gz SRR1976564_1.fastq.gz /t1-data/GSE67959_fastqs/ery2 ------------------------
If you have multiple lanes for same sample, you can give multiple lines (to give multiple fastq pairs), like so
ery1_and_ery2/PIPE_fastqPaths.txt (if we wanted to merge ery1 and ery2 replicates before analysis) : ------------------------ SRR1976563_1.fastq.gz SRR1976563_2.fastq.gz /t1-data/user/hugheslab/telenius/GSE67959_fastqs/ery1 SRR1976564_2.fastq.gz SRR1976564_1.fastq.gz /t1-data/user/hugheslab/telenius/GSE67959_fastqs/ery2
You need to also edit the run.sh file, to reflect the location of the other input files in your system :
capturesiteFile="/t1-data/user/hugheslab/telenius/3_CCseqBasic/CCseqBasicFragmentFile.txt" PublicPath="/public/telenius/CaptureCompendium/CCseqBasic/3_run/30gRUNS" reuseBLATdir='/t1-data/user/hugheslab/telenius/2_CCseqBasic_onlyBlat/BlatPloidyFilterRun/REUSE_blat'
You need to also change the pipePath to reflect the correct location of the pipeline :
pipePath="/t1-home/molhaem2/telenius/CCseqBasic/CS5/RELEASE/"
The public path is the location where you want to store your run output data - for browsing via UCSC genome browser. More about setting this, in the installation instructions page : CCseqBasic/1_setup/index.html on step (9.) "server address".
The reuseBLATdir is the directory where your CCseqBasic --onlyBlat run results are. If you didn't run the --onlyBlat run, set :
reuseBLATdir='.'
Here you can download readymade BlatPloidyFilterRun/REUSE_blat folder for the 30g data set, to be set to reuseBLATdir : BlatPloidyFilterRun.zip
Now you are ready to run.
This is a long production run, and needs to be submitted to a queue system (if you have one). The exact submitting command depends on the queue system used. The below commands were used when the script was ran in SGE (Sun Grid Engine) cluster at the WIMM :
cd ery1 qsub -cwd -e qsub.err -o qsub.out -N ery1 < ./run.sh cd ../ery2/ qsub -cwd -e qsub.err -o qsub.out -N ery2 < ./run.sh cd ../ery3/ qsub -cwd -e qsub.err -o qsub.out -N ery3 < ./run.sh cd ../es1 qsub -cwd -e qsub.err -o qsub.out -N es1 < ./run.sh cd ../es2/ qsub -cwd -e qsub.err -o qsub.out -N es2 < ./run.sh cd ../es3/ qsub -cwd -e qsub.err -o qsub.out -N es3 < ./run.sh cd ..
You can keep an eye on the run by reading the output and error logs qsub.out and qsub.err.
Once the run has finished, the last 10 lines of the qsub.out file will give the location of the visualisation data hubs :
[telenius@deva ery1]$ tail -n 10 qsub.out Generated a data hub for RAW data in : http://userweb.molbiol.ox.ac.uk/public/telenius/test7/C ( pre-filtered data for DEBUGGING purposes is here : http://userweb.molbiol.ox.ac.uk/public/telenius/test7/C Generated a data hub for FILTERED data in : http://userweb.molbiol.ox.ac.uk/public/telenius/test7/C Generated a data hub for FILTERED flashed+nonflashed combined data in : http://userweb.molbiol.ox.ac.uk/p Generated a COMBINED data hub (of all the above) in : http://userweb.molbiol.ox.ac.uk/public/telenius/tes How to load this hub to UCSC : http://sara.molbiol.ox.ac.uk/public/telenius/DataHubs/ReadMe/HUBtutorial_Allgr [telenius@deva 1]$
Of these, the most important is the COMBINED data hub (of all the above) ( the others are needed mostly for troubleshooting, if runs crash unexpectedly before finishing )
As you can see, these lines also point to a help document, how to load this data hub into UCSC : http://sara.molbiol.ox.ac.uk/public/telenius/DataHubs/ReadMe/HUBtutorial_AllGroups_160813.pdf
More information of these data hubs in : 6_visualisation/index.html and 7_QC/index.html
The output folders of the run are explained more thoroughly in here : 4_outputfolders/index.html The interaction counters are explained here : 5_counters/index.html And the workflow of the CCseqBasic tool is explained here : 2_workflow/index.html
Optional run parameters, and other finetuning of the run is explained here : 3_run/index.html
Here the readymade COMBINED data hubs for all ery1 ery2 ery3 es1 es2 es3 samples, for loading to UCSC :
How to load the above data hubs into UCSC : HUBtutorial_AllGroups_160813.pdf