Setup

The base files for the analysis should be in data/

real_R1.fastq.gz, real_R2.fastq.gz
trim_R1.fastq, trim_R2.fastq.gz

You can get these from (must use browser, not wget)

https://copy.com/jxoJKLOhWzjz

You can then generate the trimmed versions as:

bash src/trim.sh data/real_R{1,2}.fastq.gz 15
bash src/trim.sh data/sim_R{1,2}.fastq.gz 20

Where the number indicates the threshold.

Simulated

The simulated data were generated with Sherman v0.1.6. using: src/gen-simulated.sh

To generate another version of the the simulated data, use src/gen-simulated.sh which should require only minimal changes to point to the reference fasta for mm10 on your system.

Once the simulated data is generated, run:

bash src/trim.sh data/sim_R{1,2}.fastq.gz

Adjust ./common.sh so that REF points the the mm10 on your system.

Index

You will then need to create the appropriate indexes for each of the alingers you wish to test.

Bowtie: bismark_genome_preparation /path/to/reference/files/

GSNAP: python src/gsnap-meth.py index $REF 15

bwa-meth: python ../bwameth.py index $REF

last: lastdb -w 2 -u /path/to/last-hg/examples/bisulfite_f.seed $REF.last_f $REF lastdb -w 2 -u /path/to/last-hg/examples/bisulfite_r.seed $REF.last_r $REF

bsmap: no indexing

bison: bison_index $REF

bsmooth: bswc_bowtie2_index.pl –name=bsmooth/mm10 $REF

Align

Once you have the indexes in place and REF= in common.sh in place, you can use the run-{method} scripts to run each of the aligners.

All of the run-{method} scripts source common.sh so make sure that has what you want.

Assessment

For simulated reads:

$ python src/sim-roc.py \
    --reads data/sim_R1.fastq.gz \ # this is to get the number of input reads
    results/*-sim.bam > sim-trim-qual-summ.txt

Used:

$ python src/sim-roc.py --reads 1000000 results/*-sim.bam results/bsmooth/bsmooth-sim.bam results/bison-sim/sim_R1.bam > sim.quals.txt 

And:

$ python src/sim-roc.py --reads 1000000 results/trim/*-sim.bam results/trim//bsmooth/bsmooth-sim.bam results/trim/bison-sim/sim_R1.trim.bam > sim.trim.quals.txt 

And:

$ python src/sim-roc.py --reads 1000000 results/trim/*-noerrorsim.bam results/trim//bsmooth/bsmooth-noerrorsim.bam results/trim/bison-noerrorsim/noerrorsim_R1.trim.bam > noerrorsim.trim.quals.txt 

And:

$ python src/sim-roc.py --reads 1000000 results/*-noerrorsim.bam results/bsmooth/bsmooth-noerrorsim.bam results/bison-noerrorsim/noerrorsim_R1.bam > noerrorsim.quals.txt 

This will make a plot with matplotlib that’s not very pretty. One can then create a nice ggplot, plot with:

Rscript src/plot-quals.R sim-trim-quals.txt sim-trim-quals.png
Rscript src/plot-quals.R sim-quals.txt sim-quals.png

And the output will appear in the png (or .pdf or .eps) specified as the 2nd arg.

For real reads:

$ python src/target-roc.py \
    data/mm10.capture-regions.bed.gz \
    --reads data/real_R1.fastq.gz \ # this is to get the number of input reads
    results/*-real.bam

Both real and simulated data can also be after trimming. BAMs from trimmed input appear in results/trim/

Note

If you want to simply align some reads. The entire syntax is:

bwameth.py index /path/to/ref.fasta
bwameth.py --reference /path/to/ref.fasta reads_R1.fastq reads_R2.fastq -p myoutput

and the result will appear in myoutput.bam