Skip to content

Data processing workflow

This section describes the tools and workflows used to process the data, for reproducibility purpose.

Genome assembly

Genome assembly pipeline used with Illumina data only

Data QC & cleaning DNAseq reads

1- DNAseq reads quality control (FastQC v.0.11.9)

fastqc $DNAseq_R1.fq.gz \
        $DNAseq_R2.fq.gz

2- DNAseq reads cleaning (Trimmomatic v0.36)

trimmomatic PE -threads 4 -phred33 \
        $DNAseq_R1.fq.gz \
        $DNAseq_R2.fq.gz \
        $paired_clean_DNAseq_R1.fq.gz \
        $unpaired_clean_DNAseq_R1.fq.gz \
        $paired_clean_DNAseq_R2.fq.gz \
        $unpaired_clean_DNAseq_R2.fq.gz \
        ILLUMINACLIP:adapters.fa:2:30:10 LEADING:24 TRAILING:24 MINLEN:50 AVGQUAL:28

3- Cleaned DNAseq reads quality control (FastQC v.0.11.9)

fastqc  $paired_clean_DNAseq_R1.fq.gz \
        $unpaired_clean_DNAseq_R1.fq.gz \
        $paired_clean_DNAseq_R2.fq.gz \
        $unpaired_clean_DNAseq_R2.fq.gz

Fast metagenome assembly

SPAdes v3.12.0

metaspades.py -1 $paired_clean_DNAseq_R1.fq.gz -2 $paired_clean_DNAseq_R2.fq.gz -o $assemblyDIR --restart-from k55 -k 77,91 --tmp-dir $TMP

Removing reads of bacterial contaminant

blobtools v1.1.1; diamond v.2.0.9; bowtie v2.4.1; samtools v1.9

1- Preparing diamond databases and taxonomy files

  • files from NCBI: nr.fsa; nt.fsa; nodes.dmp; names.dmp; prot.accession2taxid
  • files from UNIPROT: uniprot_sprot.fasta; uniprot_trembl.fasta; idmapping.dat
cat uniprot_sprot.fasta uniprot_trembl.fasta > uniprot_ref_proteomes.fasta
cat uniprot_ref_proteomes.fasta | sed -r 's/(^>sp\|)|(^>tr\|)/>/g' | cut -f 1 -d "|" > temp; mv temp uniprot_ref_proteomes.fasta
grep "NCBI_TaxID" idmapping.dat > uniprot_ref_proteomes.NCBI-taxids
diamond makedb --threads $SLURM_CPUS_PER_TASK --in nr.fsa --taxonmap prot.accession2taxid --taxonnodes nodes.dmp -d ./nr
diamond makedb --threads $SLURM_CPUS_PER_TASK --in nt.fsa -d ./nt
diamond makedb --threads $SLURM_CPUS_PER_TASK --in uniprot_ref_proteomes.fasta -d uniprot_ref_proteomes

2- Mapping back DNAseq on metagenome assembly

bowtie2-build --threads $SLURM_CPUS_PER_TASK --large-index -f $MetaGenome.fa $MetaGenome
bowtie2 --threads $SLURM_CPUS_PER_TASK  -x $MetaGenome -q -1 $paired_clean_DNAseq_R1.fq.gz -2 $paired_clean_DNAseq_R2.fq.gz -S $Mapping_on_MetaGenome.sam
samtools view -S -b $Mapping_on_MetaGenome.sam | samtools sort -o $Mapping_on_MetaGenome.bam

3- Blast against uniprot

diamond blastx --sensitive -p 8 -q $MetaGenome.fa --outfmt 6 --out $Uniprotblast --max-target-seqs 1 --db $DBDIR/uniprot_ref_proteomes.dmnd --evalue 1e-25
blobtools taxify -f $UniprotblastTAX --taxid_mapping_file $DBDIR/uniprot_ref_proteomes.NCBI-taxids --map_col_sseqid 0 --map_col_taxid 2

4- Blast against nr

diamond blastx --sensitive -p $SLURM_CPUS_PER_TASK -q $MetaGenome.fa --outfmt 102 -o $NRblastTAX --taxonmap $DBDIR/prot.accession2taxid.gz --taxonnodes $DBDIR/nodes.dmp -d $DBDIR/nr

5- Megablast against nt

blastn -task megablast -query $MetaGenome.fa -db $DBDIR/nt -outfmt '6 qseqid staxids bitscore std' -max_target_seqs 1 -max_hsps 1 -num_threads 8 -evalue 1e-25 -out $NTmegablast

6- Blobtools database constuction

blobtools create -i $Genome -o $sample -t $NTmegablast -t $NRblastTAX -t $Uniprotblast -t $UniprotblastTAX -b $Mapping_on_MetaGenome.bam --nodes $dbDir/nodes.dmp --names $dbDir/names.dmp --db $dbDir"/nod\

7- Contig taxonomy assignation and generation of plot at "superkingdom" ranks

blobtools view --rank 'superkingdom' -i $sample".blobDB.json" -o $sample".superkingdom"
blobtools plot --rank 'superkingdom' -i $sample".blobDB.json" -o $sample".superkingdom"
blobtools plot --rank 'superkingdom' --multiplot -i $sample".blobDB.json" -o $sample".superkingdom"

8- Contig taxonomy assignation and generation of plot at "phylum" ranks

blobtools view -i $sample".blobDB.json" -o $sample".phylum"
blobtools plot -i $sample".blobDB.json" -o $sample".phylum"
blobtools plot --multiplot -i $sample".blobDB.json" -o $sample".phylum"

9- Extraction of Illumina DNAseq reads that map onto contig from selected phylum

grep -v "#" $CONTIG | awk '{if ( ($6=="Eukaryota")|| ($6=="undef") ) print $1 }' > $TOKEEP
blobtools bamfilter --read_format fq --noninterleaved --exclude_unmapped --bam $Mapping_on_MetaGenome.bam --include $TOKEEP --out $sample"_BLOBcleaned_Eukaryota+undef"

Genome assembly

SPAdes v.3.15.2

spades.py --threads $SLURM_CPUS_PER_TASK --memory $SLURM_MEM_PER_NODE \
        --pe1-1 $sample"_BLOBcleaned_Eukaryota+undef.ExIn.1.fq" \
        --pe1-2 $sample"_BLOBcleaned_Eukaryota+undef.InIn.1.fq" \
        --pe2-1 $sample"_BLOBcleaned_Eukaryota+undef.ExIn.2.fq" \
        --pe2-2 $sample"_BLOBcleaned_Eukaryota+undef.InIn.2.fq" \
        -k 21,51,81,111 -o $sample --tmp-dir $TMP

Genome assembly QC

QUAST v.5.0.2; BUSCO v.3.0.2

quast.py -t $SLURM_CPUS_PER_TASK -o "QUASTv5_"$genome $genome
run_BUSCO.py -i $genome -o "BUSCOv3-geno-euk09_"$genome -l $BUSCOdir/eukaryota_odb9 -m geno -c $SLURM_CPUS_PER_TASK

Genome assembly pipeline used with PacBio data

Illumina data QC & cleaning DNAseq reads

1- DNAseq reads quality control (FastQC v.0.11.9)

fastqc $DNAseq_R1.fq.gz \
        $DNAseq_R2.fq.gz

2- DNAseq reads cleaning (Trimmomatic v0.36)

trimmomatic PE -threads 4 -phred33 \
        $DNAseq_R1.fq.gz \
        $DNAseq_R2.fq.gz \
        $paired_clean_DNAseq_R1.fq.gz \
        $unpaired_clean_DNAseq_R1.fq.gz \
        $paired_clean_DNAseq_R2.fq.gz \
        $unpaired_clean_DNAseq_R2.fq.gz \
        ILLUMINACLIP:adapters.fa:2:30:10 LEADING:24 TRAILING:24 MINLEN:50 AVGQUAL:28

3- Cleaned DNAseq reads quality control (FastQC v.0.11.9)

fastqc  $paired_clean_DNAseq_R1.fq.gz \
        $unpaired_clean_DNAseq_R1.fq.gz \
        $paired_clean_DNAseq_R2.fq.gz \
        $unpaired_clean_DNAseq_R2.fq.gz

PacBio data QC & cleaning DNAseq reads

QUAST v.5.0.2; blobtools v1.1.1; diamond v.2.0.9; bowtie v2.4.1; samtools v1.9

1- PacBio

2- Preparing diamond databases and taxonomy files

  • files from NCBI: nr.fsa; nt.fsa; nodes.dmp; names.dmp; prot.accession2taxid
  • files from UNIPROT: uniprot_sprot.fasta; uniprot_trembl.fasta; idmapping.dat
cat uniprot_sprot.fasta uniprot_trembl.fasta > uniprot_ref_proteomes.fasta
cat uniprot_ref_proteomes.fasta | sed -r 's/(^>sp\|)|(^>tr\|)/>/g' | cut -f 1 -d "|" > temp; mv temp uniprot_ref_proteomes.fasta
grep "NCBI_TaxID" idmapping.dat > uniprot_ref_proteomes.NCBI-taxids
diamond makedb --threads $SLURM_CPUS_PER_TASK --in nr.fsa --taxonmap prot.accession2taxid --taxonnodes nodes.dmp -d ./nr
diamond makedb --threads $SLURM_CPUS_PER_TASK --in nt.fsa -d ./nt
diamond makedb --threads $SLURM_CPUS_PER_TASK --in uniprot_ref_proteomes.fasta -d uniprot_ref_proteomes

Genome assembly

CANU v.2.0 ; FLYE v2.6

1- Assembly with CANU

canu useGrid=true gridEngine=slurm gridEngineThreadsOption=THREADS gridEngineMemoryOption=MEMORY gridOptions="-p long" \
         -p $sample"_canu" \
         -d $wkgDIR genomeSize=75m \ 
         -pacbio $cleaned_subreads

2- Assembly with FLYE

flye --threads $SLURM_CPUS_PER_TASK --pacbio-raw $cleaned_subreads --out-dir $wkgDIR --genome-size 75m --iterations 2

3- Genome assemblies QC (QUAST v.5.0.2; BUSCO v.3.0.2) and selection of best assembly

quast.py -t $SLURM_CPUS_PER_TASK -o "QUASTv5_"$genome $genome
run_BUSCO.py -i $genome -o "BUSCOv3-geno-euk09_"$genome -l $BUSCOdir/eukaryota_odb9 -m geno -c $SLURM_CPUS_PER_TASK

Genome assembly polishing and sequencing error correction

racon v.1.4.20; pilon v.1.23; minimap2 v.2.18; bowtie v2.4.1; samtools v1.9

1- Iterations of genome assembly polishing

minimap2 -x map-pb -t $SLURM_CPUS_PER_TASK $genome $cleaned_subreads > $PAFmapping1
racon --threads $SLURM_CPUS_PER_TASK $cleaned_subreads $PAFmapping1 $genome > $genome2
minimap2 -x map-pb -t $SLURM_CPUS_PER_TASK $genome2 $cleaned_subreads > $PAFmapping2
racon --threads $SLURM_CPUS_PER_TASK $cleaned_subreads $PAFmapping2 $genome2 > $genome3
minimap2 -x map-pb -t $SLURM_CPUS_PER_TASK $genome3 $cleaned_subreads > $PAFmapping3
racon --threads $SLURM_CPUS_PER_TASK $cleaned_subreads $PAFmapping3 $genome3 > $genome4

2- Sequencing error correction using Illumina DNAseq

bowtie2-build --threads $SLURM_CPUS_PER_TASK -f $genome4 $genome4INDEX
bowtie2 --threads $SLURM_CPUS_PER_TASK -x $genome4INDEX -q -1 $illuminaR1 -2 $illuminaR2 -S $SAMgenome4
samtools view -S -b $SAMgenome4 | samtools sort -O BAM -o $BAMgenome4
samtools index $BAMgenome4
java -Xmx100G -jar pilon-1.23.jar --genome $genome4 --frags $BAMgenome4 --vcf --tracks --threads 4 --output $FINALgenome

Genome assembly QC

QUAST v.5.0.2; BUSCO v.3.0.2

quast.py -t $SLURM_CPUS_PER_TASK -o "QUASTv5_"$FINALgenome $FINALgenome
run_BUSCO.py -i $FINALgenome -o "BUSCOv3-geno-euk09_"$FINALgenome -l $BUSCOdir/eukaryota_odb9 -m geno -c $SLURM_CPUS_PER_TASK

Genome annotation

1- Transposon PSI (v1.0.0)

transposonPSI.pl $GENOME nuc

2- RepeatScout (v1.0.6)

build_lmer_table -sequence $GENOME -freq $LMER_FREQ
RepeatScout -sequence $GENOME -output $OUT -freq $LMER_FREQ

3- RepeatMasker (v4.0.9)

RepeatMasker -pa 8 -qq -nocut -dir $DIR -xsmall -lib repeats.fasta -gff $GFF

4- Trimmomatic (v0.39)

trimmomatic PE -threads 6 -trimlog $index.LogFile -summary $index.statsSummaryFile -quiet -validatePairs $FQ1 $FQ2 -baseout ${index}trimmed  ILLUMINACLIP:contaminants2trimm.fa:2:30:10 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50

5- Hisat2 (v2.2.1)

hisat2 -p 4 -x $GENOME_INDEX -1 $FORWARD -2 $REVERSE -U $UNPAIR1,$UNPAIR2 | samtools view -@ 6 -bh | samtools sort -@ 6 >$OUT

6- BRAKER2 (v2.1.6)

braker.pl --species $SPECIES --genome $GENOME.masked \
--bam $BAM1 $BAM2
--prot_seq=$SPECIES_training.pep.fasta --prg=gth --gth2traingenes --gff3 --cores 8 --softmasking --useexisting