WGS Data Analysis

In this tutorial, we give the important tools and steps to run the De Novo assembly of WGS data.

Data

We analyze WGS data obtained by a paired-end sequencing of genomes of microbes present in milk and sewage samples:

sewage_R1.fastq.gz
sewage_R2.fastq.gz
milk_R1.fastq.gz
milk_R2.fastq.gz

1. Checking the quality of the reads:

First of all, we need to explore the reads and inspect them in order to figure out which kind of pre-processing should be carried out in order to improve the overall quality and usability. For instance, adapters used to do the sequencing must be removed before downstream analysis as they don't belong to the genomic profile of the microbes. The reads we are going to use have been already filtered out from the adapters by the sequencing facility. However, we put here the quality control tool called falco that can be used as follows:

falco milk_R1.fastq.gz

The outputs of this command are .txt and .html files summarizing the different aspects of the reads such as presence/absence of adapters, reads lengths... We recommend to open the .html file using a browser as it contains graphs that are easy to interpret (description of each graph/metric can be found here).

2. Filtering out of short reads:

The longer the reads the more accurate the assembly will be. As a rule of thumb, reads shorter than 50 pb must be discarded. BBTools is a suite of tools widely used for WGS data analysis. Here, we use bbduk.sh to filter short reads as follows:

bbduk.sh in=milk_R1.fastq.gz in2=milk_R2.fastq.gz out1=filtered_milk_R1.fastq.gz. out2=filtered_milk_R2.fastq.gz minlen=50

The minimal read length is specified by "minlen" argument.

2. Removal of host genome:

Milk and sewage samples contain other genomes coming, for instance, from the cows' cells. These eukaryotic genomes must be removed to keep only microbial genomes:

1. We need to download the reference cow genome:

wget "https://hgdownload.soe.ucsc.edu/goldenPath/bosTau9/bigZips/bosTau9.fa.gz" > bosTau9.fa.gz

2. We need to index the cow genome using bowtie2:

bowtie2-build --large-index --threads 30 bosTau9.fa.gz indexed_genome

bowtie2 --time --threads 30 -x indexed_genome -1 filtered_milk_R1.fastq.gz -2 filtered_milk_R2.fastq.gz -S milk.sam (to do also for filtered_sewage R1 and R2)

3. Convert SAM files (Sequence Alignment Map) to BAM (Binary Alignment Map) using samtools:

samtools view -bS milk.sam > milk.bam (to do also for sewage)

4. Sort the BAM files because sorted files are parsed (read) faster, which expedites the downstream analysis:

samtools sort milk.bam -o milk_sorted.bam (to do also for sewage)

5. Extract the reads that were not mapped to the cow genome. These reads correspond to the microbial genome:

samtools view -b -f 12 -F 256 milk_sorted.bam > milk_microbial_reads.bam (to do also for sewage)

-f means include reads with flag 12. Both versions of the read (forward and reverse) were not mapped to the cow genome.

-F means exclude reads with flag 256: exclude reads not mapped by a primary alignment.

You can find the meaning of the flags on this website.

6. Now, we can split the microbial reads into forward and reverse to restore the initial format (R1.fastq.gz and R2.fastq.gz):

samtools fastq -@ 30 milk_microbial_reads.bam -1 milk_clean_R1.fastq.gz -2 milk_clean_R2.fastq.gz -n (to do also for sewage)

Assembly:

Now that the raw data are processed, we can start the assembly of the reads. The two widely used tools for de novo assembly are MEGAHIT and METASPADES. MEGAHIT is faster than SPADES and requires less memory. On the other hand, SPADES is slower but yields better assemblies in terms of quality. We are going to assemble only two samples (1 milk, and 1 sewage) so we can use METASDAPDES.

1. Just run the assembly for the pre-processed reads (>= 50pb, and clean from cow genome):

metaspades.py -t 30 -1 milk_clean_R1.fastq.gz -2 milk_clean_R2.fastq.gz -o milk (to do also for sewage)

2. As for reads, the longer the contig the better is the assembly. So, we need to filter out short contigs (< 1000pb):

bbduk.sh in=scaffolds.fasta out=scaffolds_1000.fasta minlen=1000 (to do for milk and sewage scaffolds.fasta)

scaffolds.fasta is the output of metaspades.py.

3. Once we get the contigs (i.e. scaffolds), we can check the quality of the assembly using QUAST:

quast.py scaffolds.fasta -o quality_reports (to do for milk and sewage)

Binning

Bining tries to group similar contigs by their organism of origin.

1. Beforehand, we need to map the reads to the contigs:

bowtie2-build --large-index --threads 30 milk/scaffolds.fasta indexed_milk (to do also for sewage)

2. Then, we map the reads to the contigs:

bowtie2 --time --threads 30 -x indexed_milk -1 milk_clean_R1.fastq.gz -2 milk_clean_R2.fastq.gz -S milk_binning.sam (to do also for sewage)

samtools view -bS milk_binning.sam > milk_binning.bam (to do also for sewage)

samtools sort --write-index milk_binning.bam -o milk_binning_sorted.bam (to do also for sewage)

3. Now, we can start binning using MetaBAT:

jgi_summarize_bam_contig_depths milk_binning_sorted.bam --outputDepth milk.dep (to do also for sewage)

metabat2 -t 30 --verbose -i milk/scaffolds_1000.fasta -a milk.dep -m 1500 -o bins/milk (to do also for sewage)

Note: Bin refinement can be done when several binning algorithms are used. It merges all the outputs of these algorithms to obtain refined bins. Here, we only used one binning algorithm, so we can't run bin refinement. Bin curation: The users can manually intervene to further improve the refined bins using tools like KBase, ggKbase, Galaxy and Anvio.

4. Check binning quality:

checkm2 predict --threads 30 --extension fa --input *.fa --output-directory quality (to do for milk and sewage)

We keep only bins with Completeness > 90% and Contamination < 5%. For instance, 100 % Completness means that the bin (the draft genome) contains all the genes that should be there. 2 % Contamination means 2% of the present genes belong to another organism(s).

5. Bin dereplication: some bin may be very similar as they represent one organism. To reduce redundancy, we can perform bin dereplication to merge very similar bins using dRep:

dRep dereplicate . -p 30 -comp 90 -con 5 -g .bins/*.fa --genomeInfo quality_report.csv (quality_report.tsv was created by checkm2)

Annotation:

Now that we have the draft genomes (i.e. bins), we can perform taxonomic and functional annotations in order to determine the taxonomic classification and functions of the genes, respectively.

1. Taxonomic annotation can be run using GTDB-Tk:

gtdbtk classify_wf --genome_dir bins/dereplicated_genomes --extension "fa" --out_dir . --cpus 30

2. Functional annotation using DRAM:

DRAM-setup.py prepare_databases --output_dir dram_db

DRAM.py annotate --threads 30 -i dereplicated_genomes/*.fa -o annotation