A Comprehensive Guide to Hybrid Assembly Pipeline for Genomic Sequencing
Introduction
Microbial genomics provides a direct window into microbial diversity, function, and metabolic potential. Genomic sequencing has become a standard approach for characterizing both cultured and uncultured organisms.
This guide walks through a hybrid assembly pipeline that combines long reads (ONT) and short reads (Illumina), covering quality control, assembly, polishing, and assessment.
Requirements
Using Anaconda for Package Management
Anaconda - simplifies installing and managing bioinformatics tools.
conda create --name myenv
conda activate myenv
Quality Control Tools
- NanoPlot - visualize quality metrics of long reads (ONT/PacBio).
- FastQC - generate quality reports for short reads (Illumina).
- Filtlong - filter and trim long reads by quality and length.
Assembly
- Flye - long-read assembler optimized for ONT and PacBio data.
Polishing
- Medaka - ONT neural-network-based polishing tool.
- BWA - short-read aligner for mapping Illumina reads to assemblies.
- PolyPolish - polishing tool that uses short-read alignments to correct assembly errors.
Assembly Assessment
- BUSCO - assess genome completeness using universal single-copy orthologs.
- QUAST - evaluate assembly quality with metrics like N50, misassemblies, and GC content.
Hybrid Assembly Pipeline
Quality Control
For Long Reads (ONT):
mkdir -p nanoplot_lr_raw
NanoPlot --fastq LR_input.fastq --N50 --verbose --outdir nanoplot_lr_raw/ -t 8
For Short Reads (Illumina): mkdir -p fastqc_reports fastqc SR_input_1.fastq SR_input_2.fastq -o fastqc_reports/ -t 8
Filter Long Reads
filtlong -1 SR_input_1.fastq -2 SR_input_2.fastq --min_length 1000 --keep_percent 90 LR_input.fastq > LR_filtered.fastq
mkdir -p nanoplot_lr_filtered
NanoPlot --fastq LR_filtered.fastq --N50 --verbose --outdir nanoplot_lr_filtered/ -t 8
Long Reads Assembly (Flye)
flye --nano-raw LR_filtered.fastq --out-dir Flye/ --threads 8 --scaffold -g 6m
First Polishing (Medaka)
conda activate medaka
medaka_consensus -i LR_filtered.fastq -d Flye/assembly.fasta -o Polish1/ -m r941_min_fast_g303 -t 8
Second Polishing (Polypolish)
mkdir -p Polish2
bwa index Polish1/consensus.fasta
bwa mem -t 8 -a Polish1/consensus.fasta SR_input_1.fastq > Polish2/alignments_1.sam
bwa mem -t 8 -a Polish1/consensus.fasta SR_input_2.fastq > Polish2/alignments_2.sam
polypolish_insert_filter.py --in1 Polish2/alignments_1.sam --in2 Polish2/alignments_2.sam --out1 Polish2/filtered_1.sam --out2 Polish2/filtered_2.sam
polypolish Polish1/consensus.fasta Polish2/filtered_1.sam Polish2/filtered_2.sam > final_assembly.fasta
Assembly Quality Assessment
conda activate busco
busco -i final_assembly.fasta -l bacteria_odb10 -m genome -o busco_final_assembly
quast -o quast_final_assembly -t 8 final_assembly.fasta
Annotation for 16S rRNA
Prokka
conda install -c conda-forge -c bioconda prokka
prokka --outdir prokka_annotation --prefix final_assembly final_assembly.fasta
RAST
Submit the genome via the RAST website for functional annotation.
Bottom Line
Combining long and short reads with thorough QC and polishing produces reliable, high-quality genome assemblies. Prokka and RAST provide complementary annotation, covering both gene prediction and functional assignment.