A Comprehensive Guide to Hybrid Assembly Pipeline for Genomic Sequencing

Jan 11, 2024·
Yassir Boulaamane
Yassir Boulaamane
· 2 min read

Introduction

Microbial genomics provides a direct window into microbial diversity, function, and metabolic potential. Genomic sequencing has become a standard approach for characterizing both cultured and uncultured organisms.

This guide walks through a hybrid assembly pipeline that combines long reads (ONT) and short reads (Illumina), covering quality control, assembly, polishing, and assessment.


Requirements

Using Anaconda for Package Management

Anaconda - simplifies installing and managing bioinformatics tools.

conda create --name myenv
conda activate myenv

Quality Control Tools

  • NanoPlot - visualize quality metrics of long reads (ONT/PacBio).
  • FastQC - generate quality reports for short reads (Illumina).
  • Filtlong - filter and trim long reads by quality and length.

Assembly

  • Flye - long-read assembler optimized for ONT and PacBio data.

Polishing

  • Medaka - ONT neural-network-based polishing tool.
  • BWA - short-read aligner for mapping Illumina reads to assemblies.
  • PolyPolish - polishing tool that uses short-read alignments to correct assembly errors.

Assembly Assessment

  • BUSCO - assess genome completeness using universal single-copy orthologs.
  • QUAST - evaluate assembly quality with metrics like N50, misassemblies, and GC content.

Hybrid Assembly Pipeline

Quality Control

For Long Reads (ONT):

mkdir -p nanoplot_lr_raw
NanoPlot --fastq LR_input.fastq --N50 --verbose --outdir nanoplot_lr_raw/ -t 8

For Short Reads (Illumina): mkdir -p fastqc_reports fastqc SR_input_1.fastq SR_input_2.fastq -o fastqc_reports/ -t 8


Filter Long Reads

filtlong -1 SR_input_1.fastq -2 SR_input_2.fastq --min_length 1000 --keep_percent 90 LR_input.fastq > LR_filtered.fastq
mkdir -p nanoplot_lr_filtered
NanoPlot --fastq LR_filtered.fastq --N50 --verbose --outdir nanoplot_lr_filtered/ -t 8

Long Reads Assembly (Flye)

flye --nano-raw LR_filtered.fastq --out-dir Flye/ --threads 8 --scaffold -g 6m

First Polishing (Medaka)

conda activate medaka
medaka_consensus -i LR_filtered.fastq -d Flye/assembly.fasta -o Polish1/ -m r941_min_fast_g303 -t 8

Second Polishing (Polypolish)

mkdir -p Polish2
bwa index Polish1/consensus.fasta
bwa mem -t 8 -a Polish1/consensus.fasta SR_input_1.fastq > Polish2/alignments_1.sam
bwa mem -t 8 -a Polish1/consensus.fasta SR_input_2.fastq > Polish2/alignments_2.sam

polypolish_insert_filter.py   --in1 Polish2/alignments_1.sam   --in2 Polish2/alignments_2.sam   --out1 Polish2/filtered_1.sam   --out2 Polish2/filtered_2.sam

polypolish Polish1/consensus.fasta   Polish2/filtered_1.sam   Polish2/filtered_2.sam > final_assembly.fasta

Assembly Quality Assessment

conda activate busco
busco -i final_assembly.fasta -l bacteria_odb10 -m genome -o busco_final_assembly

quast -o quast_final_assembly -t 8 final_assembly.fasta

Annotation for 16S rRNA

Prokka

conda install -c conda-forge -c bioconda prokka
prokka --outdir prokka_annotation --prefix final_assembly final_assembly.fasta

RAST

Submit the genome via the RAST website for functional annotation.


Bottom Line

Combining long and short reads with thorough QC and polishing produces reliable, high-quality genome assemblies. Prokka and RAST provide complementary annotation, covering both gene prediction and functional assignment.