fq2SortedBAM: OpenOmics’ Genomics Secondary Analysis Pipeline
fq2SortedBAM: OpenOmics’ Genomics Secondary Analysis Pipeline
Overview:
The pipeline takes input fastq files and produces sorted BAM file through the following stages:
- Sequence Alignment: bwa-mem2 for short reads, mm2-fast (Accelerated Minimap2) for long reads (PacBio, ONT)
- SAMSort (Using SAMTools)
Modes:
fq2SortedBAM supports 4 different modes:
sortedbam
: It takes fastq reads files and reference genome as input and outputs sorted BAM fileflatmode
: It takes fastq reads files and reference genome as input, and outputs multiple (equal to the number of ranks created) unsorted SAM filesfqprocessonly
: Custom mode, not for general usemultifq
: Custom mode, not for general use
Use Docker
Docker build:
wget https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/releases/download/3.0/Source_code_with_submodules.tar.gz
tar -xzf Source_code_with_submodules.tar.gz
cp Open-Omics-Acceleration-Framework/pipelines/fq2sortedbam/Dockerfile .
cp Open-Omics-Acceleration-Framework/pipelines/fq2sortedbam/config.yaml <inputdir>
docker build -t fq2bam .
docker save fq2bam:latest > fq2bam.tar ## this step is optional
Setup Input Parameters:
Setup <inputdir>/config.yaml (described below) with appropriate values
Docker run:
docker load -i fq2bam.tar ## optional, if the image is build on the same machine or is already loaded
docker run -v <inputdir>:/input -v <outdir>:/out -v <refdir>:/refdir -v <tempdir>:/tempdir fq2bam:latest bash run_bwa.sh sortedbam /input/config.yaml
Note:
<inputdir>: Location of the local directory containing read files read1 & read2
<refdir>: Location of the local directory containing reference sequence file ref
<outdir>: Location of the local directory for output files SAM/BAM
<tempdir>: Location of the local directory for temporary files (defaults to <outdir>)
Use Source Code
Installation:
wget https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/releases/download/3.0/Source_code_with_submodules.tar.gz
tar -xzf Source_code_with_submodules.tar.gz
cd Open-Omics-Acceleration-Framework/pipelines/fq2sortedbam/
bash install.sh <onprem/cloud> ## onprem mode: Manually install the depenendies present in basic_setup_ubuntu.sh as it needs sudo access
Setup Input Parameters:
Setup ./config.yaml (described below) with appropriate values
Run:
bash run_bwa.sh sortedbam ./config.yaml
General Notes:
- Individual pipeline tools are present in applications folder
- To understand various parameters to these tools, you can access their
man
page - You can setup the parameters of these tools using
params
variable inconfig.yaml
Setup config.yaml1:
- bwa: bwa-mem2 related parameters
- dindex: dtype=bool, values=”True/False”, if True it creates .fai index files for input reads files
- params: dtype=string, the command line paramteres to bwa-mem2 mapping run e.g. ‘+R “@RG\tID:RG1\tSM:RGSN1”’
- rindex: dtype=bool, ‘values=True/False’, if “True” it creates bwa-mem2 index for the reference genome
- dataset: data details
- index: dtype:string, Input reference genome file name e.g. “GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz”
- input: dtype=string, folder location of input reads files e.g. “/input”
- outfile: dtype=string, output SAM/BAM file name(s) e.g. short.se.sam. Default value: “final_fq2bam”
- output: dtype=string, folder location of output SAM/BAM files e.g. “/out”
- read1: dtype=string, input reads file1 name e.g. “HG001.novaseq.pcr-free.30x.R1.fastq.gz”
- read2: dtype=string, input reads file2 name e.g. “HG001.novaseq.pcr-free.30x.R2.fastq.gz”
- read_type: dtype:string, values=short/long, short read mapping using bwa-mem2, long read mapping using mm2-fast
- refdir: dtype=string, folder location of reference genome and its index files e.g. “/refdir”
- tempdir: dtype=string, folder location for storing intermedaite files e.g. “/out/”. In case of none, output folder is used
- fqprocess: custom mode variables
- bam_size: dtype=int,, value=5
- barcode_orientation: dtype=string, values=FIRST_BP_RC
- output_format: dtype=string, value=FASTQ
- prefix: dtype=string,, value=multiome-practice-may15_arcgtf
- read3: dtype=string, value=’’
- read_structure: dtype=string, value=16C
- readi1: dtype=string, value=’’
- sample_id: dtype=int, value=’’
- suffix: dtype=string, value=trimmed_adapters.fastq.gz
- whitelist: dtype=string, value=whitelist.txt
- mm2: mm2-fast related parameters
- params: dtype=string, the command line paramteres to mm2-fast mapping run e.g.’ -ax map-hifi ‘
1 Parameters in bold are mandatory ones