A Comprehensive Guide to Hybrid Assembly Pipeline for Genomic Sequencing

Jan 11, 2024·
Yassir Boulaamane
Yassir Boulaamane
· 2 min read

Introduction

Microbes, life’s unseen workhorses, hold immense potential for bioremediation, medicine, and understanding our planet. Yet, their intricate workings remain largely a mystery. This is where microbial genomics steps in, offering a powerful tool to decode their genetic language. Genomic sequencing has revolutionized our understanding of microbial diversity and function.

In this guide, we’ll walk through a hybrid assembly pipeline, combining long reads (ONT) and short reads (Illumina), using tools for quality control, assembly, polishing, and assessment.


Requirements

Using Anaconda for Package Management

Anaconda - simplifies installing and managing bioinformatics tools.

conda create --name myenv
conda activate myenv

Quality Control Tools

  • NanoPlot - visualize quality metrics of long reads (ONT/PacBio).
  • FastQC - generate quality reports for short reads (Illumina).
  • Filtlong - filter and trim long reads by quality and length.

Assembly

  • Flye - long-read assembler optimized for ONT and PacBio data.

Polishing

  • Medaka - ONT neural-network-based polishing tool.
  • BWA - short-read aligner for mapping Illumina reads to assemblies.
  • PolyPolish - polishing tool that uses short-read alignments to correct assembly errors.

Assembly Assessment

  • BUSCO - assess genome completeness using universal single-copy orthologs.
  • QUAST - evaluate assembly quality with metrics like N50, misassemblies, and GC content.

Hybrid Assembly Pipeline

Quality Control

For Long Reads (ONT):

mkdir -p nanoplot_lr_raw
NanoPlot --fastq LR_input.fastq --N50 --verbose --outdir nanoplot_lr_raw/ -t 8

For Short Reads (Illumina): mkdir -p fastqc_reports fastqc SR_input_1.fastq SR_input_2.fastq -o fastqc_reports/ -t 8


Filter Long Reads

filtlong -1 SR_input_1.fastq -2 SR_input_2.fastq --min_length 1000 --keep_percent 90 LR_input.fastq > LR_filtered.fastq
mkdir -p nanoplot_lr_filtered
NanoPlot --fastq LR_filtered.fastq --N50 --verbose --outdir nanoplot_lr_filtered/ -t 8

Long Reads Assembly (Flye)

flye --nano-raw LR_filtered.fastq --out-dir Flye/ --threads 8 --scaffold -g 6m

First Polishing (Medaka)

conda activate medaka
medaka_consensus -i LR_filtered.fastq -d Flye/assembly.fasta -o Polish1/ -m r941_min_fast_g303 -t 8

Second Polishing (Polypolish)

mkdir -p Polish2
bwa index Polish1/consensus.fasta
bwa mem -t 8 -a Polish1/consensus.fasta SR_input_1.fastq > Polish2/alignments_1.sam
bwa mem -t 8 -a Polish1/consensus.fasta SR_input_2.fastq > Polish2/alignments_2.sam

polypolish_insert_filter.py   --in1 Polish2/alignments_1.sam   --in2 Polish2/alignments_2.sam   --out1 Polish2/filtered_1.sam   --out2 Polish2/filtered_2.sam

polypolish Polish1/consensus.fasta   Polish2/filtered_1.sam   Polish2/filtered_2.sam > final_assembly.fasta

Assembly Quality Assessment

conda activate busco
busco -i final_assembly.fasta -l bacteria_odb10 -m genome -o busco_final_assembly

quast -o quast_final_assembly -t 8 final_assembly.fasta

Annotation for 16S rRNA

Prokka

conda install -c conda-forge -c bioconda prokka
prokka --outdir prokka_annotation --prefix final_assembly final_assembly.fasta

RAST

Submit the genome via the RAST website for functional annotation.


Bottom Line

This hybrid assembly pipeline, coupled with annotation tools like Prokka and RAST, empowers researchers to unravel microbial genomes. Combining long and short reads with rigorous QC and assessment ensures a reliable, accurate representation of genomic information.