A Comprehensive Guide to Hybrid Assembly Pipeline for Genomic Sequencing

Introduction
Microbes, life’s unseen workhorses, hold immense potential for bioremediation, medicine, and understanding our planet. Yet, their intricate workings remain largely a mystery. This is where microbial genomics steps in, offering a powerful tool to decode their genetic language. Genomic sequencing has revolutionized our understanding of microbial diversity and function.
In this guide, we’ll walk through a hybrid assembly pipeline, combining long reads (ONT) and short reads (Illumina), using tools for quality control, assembly, polishing, and assessment.
Requirements
Using Anaconda for Package Management
Anaconda - simplifies installing and managing bioinformatics tools.
conda create --name myenv
conda activate myenv
Quality Control Tools
- NanoPlot - visualize quality metrics of long reads (ONT/PacBio).
- FastQC - generate quality reports for short reads (Illumina).
- Filtlong - filter and trim long reads by quality and length.
Assembly
- Flye - long-read assembler optimized for ONT and PacBio data.
Polishing
- Medaka - ONT neural-network-based polishing tool.
- BWA - short-read aligner for mapping Illumina reads to assemblies.
- PolyPolish - polishing tool that uses short-read alignments to correct assembly errors.
Assembly Assessment
- BUSCO - assess genome completeness using universal single-copy orthologs.
- QUAST - evaluate assembly quality with metrics like N50, misassemblies, and GC content.
Hybrid Assembly Pipeline
Quality Control
For Long Reads (ONT):
mkdir -p nanoplot_lr_raw
NanoPlot --fastq LR_input.fastq --N50 --verbose --outdir nanoplot_lr_raw/ -t 8
For Short Reads (Illumina): mkdir -p fastqc_reports fastqc SR_input_1.fastq SR_input_2.fastq -o fastqc_reports/ -t 8
Filter Long Reads
filtlong -1 SR_input_1.fastq -2 SR_input_2.fastq --min_length 1000 --keep_percent 90 LR_input.fastq > LR_filtered.fastq
mkdir -p nanoplot_lr_filtered
NanoPlot --fastq LR_filtered.fastq --N50 --verbose --outdir nanoplot_lr_filtered/ -t 8
Long Reads Assembly (Flye)
flye --nano-raw LR_filtered.fastq --out-dir Flye/ --threads 8 --scaffold -g 6m
First Polishing (Medaka)
conda activate medaka
medaka_consensus -i LR_filtered.fastq -d Flye/assembly.fasta -o Polish1/ -m r941_min_fast_g303 -t 8
Second Polishing (Polypolish)
mkdir -p Polish2
bwa index Polish1/consensus.fasta
bwa mem -t 8 -a Polish1/consensus.fasta SR_input_1.fastq > Polish2/alignments_1.sam
bwa mem -t 8 -a Polish1/consensus.fasta SR_input_2.fastq > Polish2/alignments_2.sam
polypolish_insert_filter.py --in1 Polish2/alignments_1.sam --in2 Polish2/alignments_2.sam --out1 Polish2/filtered_1.sam --out2 Polish2/filtered_2.sam
polypolish Polish1/consensus.fasta Polish2/filtered_1.sam Polish2/filtered_2.sam > final_assembly.fasta
Assembly Quality Assessment
conda activate busco
busco -i final_assembly.fasta -l bacteria_odb10 -m genome -o busco_final_assembly
quast -o quast_final_assembly -t 8 final_assembly.fasta
Annotation for 16S rRNA
Prokka
conda install -c conda-forge -c bioconda prokka
prokka --outdir prokka_annotation --prefix final_assembly final_assembly.fasta
RAST
Submit the genome via the RAST website for functional annotation.
Bottom Line
This hybrid assembly pipeline, coupled with annotation tools like Prokka and RAST, empowers researchers to unravel microbial genomes. Combining long and short reads with rigorous QC and assessment ensures a reliable, accurate representation of genomic information.