Page updated by Jelena Telenius - 17:00 28/Nov/2018

CCseqBasic
CaptureC data analysis and QC

Walkthrough with real data set ( GSE67959 )

You should now have ran the CapSequm web tool (the first step of the analysis),
and the --onlyBlat run of CCseqBasic (the second step of the analysis).

The above are optional pre-requisites of the CCseqBasic run (but are highly recommended).

To run CCseqBasic , you need the CapSequm output file called "CCseqBasicFragmentFile.txt" ,
which indeed contains the coordinates of the RE fragments for all capture sites of your design.

You can also make this file manually, but CapSequm output is good for first testing :
it is for sure in the right format, and you can thus use it without problems.

A readymade CapSequm output file "CCseqBasicFragmentFile.txt" ,
for the 30g data set used in this walk-through, is available here :

CCseqBasicFragmentFile.txt

More about the above file, its file format, and editing instructions available here : fragmentfile.html

In addition to this file, you need the Fastq files from the sequencing of the Capture-C library.

Below instructions how to fetch them for the 30g data set, which is used as the example data set in this walk-through.

Here a readymade script to fetch the fastqs for GSE67959, stored in the European Sequence Archive :

downloadFastqs.sh.txt

You need to run the above script (or something similar) to fetch the fastq files to your own data area.

Here detailed instructions, how the above script was constructed
- i.e. how to get from the GEO archive number GSE67959
to the European Sequence Archive file download locations :

fromGEOtoFastq.txt

The expected file sizes are :

ery1:   3.0G    SRR1976563_1.fastq.gz
        3.2G    SRR1976563_2.fastq.gz

ery2:   1.8G    SRR1976564_1.fastq.gz
        1.9G    SRR1976564_2.fastq.gz

ery3:   2.7G    SRR1976565_1.fastq.gz
        2.9G    SRR1976565_2.fastq.gz

es1:    2.7G    SRR1976567_1.fastq.gz
        2.9G    SRR1976567_2.fastq.gz

es2:    2.2G    SRR1976568_1.fastq.gz
        2.3G    SRR1976568_2.fastq.gz

es3:    2.6G    SRR1976569_1.fastq.gz
        2.4G    SRR1976569_2.fastq.gz

Now you should have these files :

CCseqBasicFragmentFile.txt

and

GSE67959_fastqs
|-- ery1
|   |-- SRR1976563_1.fastq.gz
|   `-- SRR1976563_2.fastq.gz
|-- ery2
|   |-- SRR1976564_1.fastq.gz
|   `-- SRR1976564_2.fastq.gz
|-- ery3
|   |-- SRR1976565_1.fastq.gz
|   `-- SRR1976565_2.fastq.gz
|-- es1
|   |-- SRR1976567_1.fastq.gz
|   `-- SRR1976567_2.fastq.gz
|-- es2
|   |-- SRR1976568_1.fastq.gz
|   `-- SRR1976568_2.fastq.gz
`-- es3
    |-- SRR1976569_1.fastq.gz
    `-- SRR1976569_2.fastq.gz

These files can now be used to set up the CCseqBasic runs.

Below a .zip archive of all the needed files, in a correct folder structure, to start all 6 runs :

CCseqBasic_runExample.zip

After unpacking, you should see these files and folders :

|-- CCseqBasicFragmentFile.txt
|-- ery1
|   |-- PIPE_fastqPaths.txt
|   `-- run.sh
|-- ery2
|   |-- PIPE_fastqPaths.txt
|   `-- run.sh
|-- ery3
|   |-- PIPE_fastqPaths.txt
|   `-- run.sh
|-- es1
|   |-- PIPE_fastqPaths.txt
|   `-- run.sh
|-- es2
|   |-- PIPE_fastqPaths.txt
|   `-- run.sh
`-- es3
    |-- PIPE_fastqPaths.txt
    `-- run.sh

The ery1 ery2 ery3 es1 es2 es3 folders are the run folders :
each sample needs to be started in its own empty folder (except the PIPE_fastqPaths.txt and run.sh files)

You need to edit the PIPE_fastqPaths.txt files, to reflect the actual location of the fastq files in your system.

ery1/PIPE_fastqPaths.txt and ery2/PIPE_fastqPaths.txt show the 2 supported file formats of this file :

ery1/PIPE_fastqPaths.txt
------------------------
/t1-data/GSE67959_fastqs/ery1/SRR1976563_1.fastq.gz	/t1-data/GSE67959_fastqs/ery1/SRR1976563_2.fastq.gz
------------------------

ery2/PIPE_fastqPaths.txt
------------------------
SRR1976564_2.fastq.gz	SRR1976564_1.fastq.gz	/t1-data/GSE67959_fastqs/ery2
------------------------

If you have multiple lanes for same sample, you can give multiple lines (to give multiple fastq pairs), like so

ery1_and_ery2/PIPE_fastqPaths.txt (if we wanted to merge ery1 and ery2 replicates before analysis) :
------------------------
SRR1976563_1.fastq.gz	SRR1976563_2.fastq.gz   /t1-data/user/hugheslab/telenius/GSE67959_fastqs/ery1
SRR1976564_2.fastq.gz	SRR1976564_1.fastq.gz	/t1-data/user/hugheslab/telenius/GSE67959_fastqs/ery2

You need to also edit the run.sh file, to reflect the location of the other input files in your system :

capturesiteFile="/t1-data/user/hugheslab/telenius/3_CCseqBasic/CCseqBasicFragmentFile.txt"
PublicPath="/public/telenius/CaptureCompendium/CCseqBasic/3_run/30gRUNS"
reuseBLATdir='/t1-data/user/hugheslab/telenius/2_CCseqBasic_onlyBlat/BlatPloidyFilterRun/REUSE_blat'

You need to also change the pipePath to reflect the correct location of the pipeline :

pipePath="/t1-home/molhaem2/telenius/CCseqBasic/CS5/RELEASE/"

The public path is the location where you want to store your run output data - for browsing via UCSC genome browser.
More about setting this, in the installation instructions page : CCseqBasic/1_setup/index.html
on step (9.) "server address".

The reuseBLATdir is the directory where your CCseqBasic --onlyBlat run results are.
If you didn't run the --onlyBlat run, set :

reuseBLATdir='.'

Here you can download readymade BlatPloidyFilterRun/REUSE_blat folder for the 30g data set, to be set to reuseBLATdir : BlatPloidyFilterRun.zip

Now you are ready to run.

This is a long production run, and needs to be submitted to a queue system (if you have one).
The exact submitting command depends on the queue system used.
The below commands were used when the script was ran in SGE (Sun Grid Engine) cluster at the WIMM :

cd ery1
qsub -cwd -e qsub.err -o qsub.out -N ery1 < ./run.sh
cd ../ery2/
qsub -cwd -e qsub.err -o qsub.out -N ery2 < ./run.sh
cd ../ery3/
qsub -cwd -e qsub.err -o qsub.out -N ery3 < ./run.sh
cd ../es1
qsub -cwd -e qsub.err -o qsub.out -N es1 < ./run.sh
cd ../es2/
qsub -cwd -e qsub.err -o qsub.out -N es2 < ./run.sh
cd ../es3/
qsub -cwd -e qsub.err -o qsub.out -N es3 < ./run.sh
cd ..

You can keep an eye on the run by reading the output and error logs qsub.out and qsub.err.

Once the run has finished, the last 10 lines of the qsub.out file will give the location of the visualisation data hubs :

[telenius@deva ery1]$ tail -n 10 qsub.out 

Generated a data hub for RAW data in :                http://userweb.molbiol.ox.ac.uk/public/telenius/test7/C
( pre-filtered data for DEBUGGING purposes is here :  http://userweb.molbiol.ox.ac.uk/public/telenius/test7/C
Generated a data hub for FILTERED data in :           http://userweb.molbiol.ox.ac.uk/public/telenius/test7/C
Generated a data hub for FILTERED flashed+nonflashed combined data in :     http://userweb.molbiol.ox.ac.uk/p

Generated a COMBINED data hub (of all the above) in :     http://userweb.molbiol.ox.ac.uk/public/telenius/tes
How to load this hub to UCSC : http://sara.molbiol.ox.ac.uk/public/telenius/DataHubs/ReadMe/HUBtutorial_Allgr

[telenius@deva 1]$

Of these, the most important is the COMBINED data hub (of all the above)
( the others are needed mostly for troubleshooting, if runs crash unexpectedly before finishing )

As you can see, these lines also point to a help document, how to load this data hub into UCSC :
http://sara.molbiol.ox.ac.uk/public/telenius/DataHubs/ReadMe/HUBtutorial_AllGroups_160813.pdf

More information of these data hubs in : 6_visualisation/index.html and 7_QC/index.html

The output folders of the run are explained more thoroughly in here : 4_outputfolders/index.html
The interaction counters are explained here : 5_counters/index.html
And the workflow of the CCseqBasic tool is explained here : 2_workflow/index.html

Optional run parameters, and other finetuning of the run is explained here : 3_run/index.html

Here the readymade COMBINED data hubs for all ery1 ery2 ery3 es1 es2 es3 samples, for loading to UCSC :

Ery1

http://userweb.molbiol.ox.ac.uk/public/telenius/CaptureCompendium/CCseqBasic/3_run/30gRUNS/30g_ery1/CS5_dpnII/30g_ery1_CS5_hub.txt /CaptureCompendium/3_run/30gRUNS/30g_ery1/CS5_dpnII/30g_ery1_CS5_hub.txt

Ery2

http://userweb.molbiol.ox.ac.uk/public/telenius/CaptureCompendium/CCseqBasic/3_run/30gRUNS/30g_ery2/CS5_dpnII/30g_ery2_CS5_hub.txt

Ery3

http://userweb.molbiol.ox.ac.uk/public/telenius/CaptureCompendium/CCseqBasic/3_run/30gRUNS/30g_ery3/CS5_dpnII/30g_ery3_CS5_hub.txt

Es1

http://userweb.molbiol.ox.ac.uk/public/telenius/CaptureCompendium/CCseqBasic/3_run/30gRUNS/30g_es1/CS5_dpnII/30g_es1_CS5_hub.txt

Es2

http://userweb.molbiol.ox.ac.uk/public/telenius/CaptureCompendium/CCseqBasic/3_run/30gRUNS/30g_es2/CS5_dpnII/30g_es2_CS5_hub.txt