Data processing workflow
This section describes the tools and workflows used to process the data, for reproducibility purpose.
Genome assembly
Genome assembly pipeline used with Illumina data only
Data QC & cleaning DNAseq reads
1- DNAseq reads quality control (FastQC v.0.11.9)
fastqc $DNAseq_R1.fq.gz \
$DNAseq_R2.fq.gz
2- DNAseq reads cleaning (Trimmomatic v0.36)
trimmomatic PE -threads 4 -phred33 \
$DNAseq_R1.fq.gz \
$DNAseq_R2.fq.gz \
$paired_clean_DNAseq_R1.fq.gz \
$unpaired_clean_DNAseq_R1.fq.gz \
$paired_clean_DNAseq_R2.fq.gz \
$unpaired_clean_DNAseq_R2.fq.gz \
ILLUMINACLIP:adapters.fa:2:30:10 LEADING:24 TRAILING:24 MINLEN:50 AVGQUAL:28
3- Cleaned DNAseq reads quality control (FastQC v.0.11.9)
fastqc $paired_clean_DNAseq_R1.fq.gz \
$unpaired_clean_DNAseq_R1.fq.gz \
$paired_clean_DNAseq_R2.fq.gz \
$unpaired_clean_DNAseq_R2.fq.gz
Fast metagenome assembly
SPAdes v3.12.0
metaspades.py -1 $paired_clean_DNAseq_R1.fq.gz -2 $paired_clean_DNAseq_R2.fq.gz -o $assemblyDIR --restart-from k55 -k 77,91 --tmp-dir $TMP
Removing reads of bacterial contaminant
blobtools v1.1.1; diamond v.2.0.9; bowtie v2.4.1; samtools v1.9
1- Preparing diamond databases and taxonomy files
- files from NCBI:
nr.fsa
;nt.fsa
;nodes.dmp
;names.dmp
;prot.accession2taxid
- files from UNIPROT:
uniprot_sprot.fasta
;uniprot_trembl.fasta
;idmapping.dat
cat uniprot_sprot.fasta uniprot_trembl.fasta > uniprot_ref_proteomes.fasta
cat uniprot_ref_proteomes.fasta | sed -r 's/(^>sp\|)|(^>tr\|)/>/g' | cut -f 1 -d "|" > temp; mv temp uniprot_ref_proteomes.fasta
grep "NCBI_TaxID" idmapping.dat > uniprot_ref_proteomes.NCBI-taxids
diamond makedb --threads $SLURM_CPUS_PER_TASK --in nr.fsa --taxonmap prot.accession2taxid --taxonnodes nodes.dmp -d ./nr
diamond makedb --threads $SLURM_CPUS_PER_TASK --in nt.fsa -d ./nt
diamond makedb --threads $SLURM_CPUS_PER_TASK --in uniprot_ref_proteomes.fasta -d uniprot_ref_proteomes
2- Mapping back DNAseq on metagenome assembly
bowtie2-build --threads $SLURM_CPUS_PER_TASK --large-index -f $MetaGenome.fa $MetaGenome
bowtie2 --threads $SLURM_CPUS_PER_TASK -x $MetaGenome -q -1 $paired_clean_DNAseq_R1.fq.gz -2 $paired_clean_DNAseq_R2.fq.gz -S $Mapping_on_MetaGenome.sam
samtools view -S -b $Mapping_on_MetaGenome.sam | samtools sort -o $Mapping_on_MetaGenome.bam
3- Blast against uniprot
diamond blastx --sensitive -p 8 -q $MetaGenome.fa --outfmt 6 --out $Uniprotblast --max-target-seqs 1 --db $DBDIR/uniprot_ref_proteomes.dmnd --evalue 1e-25
blobtools taxify -f $UniprotblastTAX --taxid_mapping_file $DBDIR/uniprot_ref_proteomes.NCBI-taxids --map_col_sseqid 0 --map_col_taxid 2
4- Blast against nr
diamond blastx --sensitive -p $SLURM_CPUS_PER_TASK -q $MetaGenome.fa --outfmt 102 -o $NRblastTAX --taxonmap $DBDIR/prot.accession2taxid.gz --taxonnodes $DBDIR/nodes.dmp -d $DBDIR/nr
5- Megablast against nt
blastn -task megablast -query $MetaGenome.fa -db $DBDIR/nt -outfmt '6 qseqid staxids bitscore std' -max_target_seqs 1 -max_hsps 1 -num_threads 8 -evalue 1e-25 -out $NTmegablast
6- Blobtools database constuction
blobtools create -i $Genome -o $sample -t $NTmegablast -t $NRblastTAX -t $Uniprotblast -t $UniprotblastTAX -b $Mapping_on_MetaGenome.bam --nodes $dbDir/nodes.dmp --names $dbDir/names.dmp --db $dbDir"/nod\
7- Contig taxonomy assignation and generation of plot at "superkingdom" ranks
blobtools view --rank 'superkingdom' -i $sample".blobDB.json" -o $sample".superkingdom"
blobtools plot --rank 'superkingdom' -i $sample".blobDB.json" -o $sample".superkingdom"
blobtools plot --rank 'superkingdom' --multiplot -i $sample".blobDB.json" -o $sample".superkingdom"
8- Contig taxonomy assignation and generation of plot at "phylum" ranks
blobtools view -i $sample".blobDB.json" -o $sample".phylum"
blobtools plot -i $sample".blobDB.json" -o $sample".phylum"
blobtools plot --multiplot -i $sample".blobDB.json" -o $sample".phylum"
9- Extraction of Illumina DNAseq reads that map onto contig from selected phylum
grep -v "#" $CONTIG | awk '{if ( ($6=="Eukaryota")|| ($6=="undef") ) print $1 }' > $TOKEEP
blobtools bamfilter --read_format fq --noninterleaved --exclude_unmapped --bam $Mapping_on_MetaGenome.bam --include $TOKEEP --out $sample"_BLOBcleaned_Eukaryota+undef"
Genome assembly
SPAdes v.3.15.2
spades.py --threads $SLURM_CPUS_PER_TASK --memory $SLURM_MEM_PER_NODE \
--pe1-1 $sample"_BLOBcleaned_Eukaryota+undef.ExIn.1.fq" \
--pe1-2 $sample"_BLOBcleaned_Eukaryota+undef.InIn.1.fq" \
--pe2-1 $sample"_BLOBcleaned_Eukaryota+undef.ExIn.2.fq" \
--pe2-2 $sample"_BLOBcleaned_Eukaryota+undef.InIn.2.fq" \
-k 21,51,81,111 -o $sample --tmp-dir $TMP
Genome assembly QC
QUAST v.5.0.2; BUSCO v.3.0.2
quast.py -t $SLURM_CPUS_PER_TASK -o "QUASTv5_"$genome $genome
run_BUSCO.py -i $genome -o "BUSCOv3-geno-euk09_"$genome -l $BUSCOdir/eukaryota_odb9 -m geno -c $SLURM_CPUS_PER_TASK
Genome assembly pipeline used with PacBio data
Illumina data QC & cleaning DNAseq reads
1- DNAseq reads quality control (FastQC v.0.11.9)
fastqc $DNAseq_R1.fq.gz \
$DNAseq_R2.fq.gz
2- DNAseq reads cleaning (Trimmomatic v0.36)
trimmomatic PE -threads 4 -phred33 \
$DNAseq_R1.fq.gz \
$DNAseq_R2.fq.gz \
$paired_clean_DNAseq_R1.fq.gz \
$unpaired_clean_DNAseq_R1.fq.gz \
$paired_clean_DNAseq_R2.fq.gz \
$unpaired_clean_DNAseq_R2.fq.gz \
ILLUMINACLIP:adapters.fa:2:30:10 LEADING:24 TRAILING:24 MINLEN:50 AVGQUAL:28
3- Cleaned DNAseq reads quality control (FastQC v.0.11.9)
fastqc $paired_clean_DNAseq_R1.fq.gz \
$unpaired_clean_DNAseq_R1.fq.gz \
$paired_clean_DNAseq_R2.fq.gz \
$unpaired_clean_DNAseq_R2.fq.gz
PacBio data QC & cleaning DNAseq reads
QUAST v.5.0.2; blobtools v1.1.1; diamond v.2.0.9; bowtie v2.4.1; samtools v1.9
1- PacBio
2- Preparing diamond databases and taxonomy files
- files from NCBI:
nr.fsa
;nt.fsa
;nodes.dmp
;names.dmp
;prot.accession2taxid
- files from UNIPROT:
uniprot_sprot.fasta
;uniprot_trembl.fasta
;idmapping.dat
cat uniprot_sprot.fasta uniprot_trembl.fasta > uniprot_ref_proteomes.fasta
cat uniprot_ref_proteomes.fasta | sed -r 's/(^>sp\|)|(^>tr\|)/>/g' | cut -f 1 -d "|" > temp; mv temp uniprot_ref_proteomes.fasta
grep "NCBI_TaxID" idmapping.dat > uniprot_ref_proteomes.NCBI-taxids
diamond makedb --threads $SLURM_CPUS_PER_TASK --in nr.fsa --taxonmap prot.accession2taxid --taxonnodes nodes.dmp -d ./nr
diamond makedb --threads $SLURM_CPUS_PER_TASK --in nt.fsa -d ./nt
diamond makedb --threads $SLURM_CPUS_PER_TASK --in uniprot_ref_proteomes.fasta -d uniprot_ref_proteomes
Genome assembly
CANU v.2.0 ; FLYE v2.6
1- Assembly with CANU
canu useGrid=true gridEngine=slurm gridEngineThreadsOption=THREADS gridEngineMemoryOption=MEMORY gridOptions="-p long" \
-p $sample"_canu" \
-d $wkgDIR genomeSize=75m \
-pacbio $cleaned_subreads
2- Assembly with FLYE
flye --threads $SLURM_CPUS_PER_TASK --pacbio-raw $cleaned_subreads --out-dir $wkgDIR --genome-size 75m --iterations 2
3- Genome assemblies QC (QUAST v.5.0.2; BUSCO v.3.0.2) and selection of best assembly
quast.py -t $SLURM_CPUS_PER_TASK -o "QUASTv5_"$genome $genome
run_BUSCO.py -i $genome -o "BUSCOv3-geno-euk09_"$genome -l $BUSCOdir/eukaryota_odb9 -m geno -c $SLURM_CPUS_PER_TASK
Genome assembly polishing and sequencing error correction
racon v.1.4.20; pilon v.1.23; minimap2 v.2.18; bowtie v2.4.1; samtools v1.9
1- Iterations of genome assembly polishing
minimap2 -x map-pb -t $SLURM_CPUS_PER_TASK $genome $cleaned_subreads > $PAFmapping1
racon --threads $SLURM_CPUS_PER_TASK $cleaned_subreads $PAFmapping1 $genome > $genome2
minimap2 -x map-pb -t $SLURM_CPUS_PER_TASK $genome2 $cleaned_subreads > $PAFmapping2
racon --threads $SLURM_CPUS_PER_TASK $cleaned_subreads $PAFmapping2 $genome2 > $genome3
minimap2 -x map-pb -t $SLURM_CPUS_PER_TASK $genome3 $cleaned_subreads > $PAFmapping3
racon --threads $SLURM_CPUS_PER_TASK $cleaned_subreads $PAFmapping3 $genome3 > $genome4
2- Sequencing error correction using Illumina DNAseq
bowtie2-build --threads $SLURM_CPUS_PER_TASK -f $genome4 $genome4INDEX
bowtie2 --threads $SLURM_CPUS_PER_TASK -x $genome4INDEX -q -1 $illuminaR1 -2 $illuminaR2 -S $SAMgenome4
samtools view -S -b $SAMgenome4 | samtools sort -O BAM -o $BAMgenome4
samtools index $BAMgenome4
java -Xmx100G -jar pilon-1.23.jar --genome $genome4 --frags $BAMgenome4 --vcf --tracks --threads 4 --output $FINALgenome
Genome assembly QC
QUAST v.5.0.2; BUSCO v.3.0.2
quast.py -t $SLURM_CPUS_PER_TASK -o "QUASTv5_"$FINALgenome $FINALgenome
run_BUSCO.py -i $FINALgenome -o "BUSCOv3-geno-euk09_"$FINALgenome -l $BUSCOdir/eukaryota_odb9 -m geno -c $SLURM_CPUS_PER_TASK
Genome annotation
1- Transposon PSI (v1.0.0)
transposonPSI.pl $GENOME nuc
2- RepeatScout (v1.0.6)
build_lmer_table -sequence $GENOME -freq $LMER_FREQ
RepeatScout -sequence $GENOME -output $OUT -freq $LMER_FREQ
3- RepeatMasker (v4.0.9)
RepeatMasker -pa 8 -qq -nocut -dir $DIR -xsmall -lib repeats.fasta -gff $GFF
4- Trimmomatic (v0.39)
trimmomatic PE -threads 6 -trimlog $index.LogFile -summary $index.statsSummaryFile -quiet -validatePairs $FQ1 $FQ2 -baseout ${index}trimmed ILLUMINACLIP:contaminants2trimm.fa:2:30:10 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50
5- Hisat2 (v2.2.1)
hisat2 -p 4 -x $GENOME_INDEX -1 $FORWARD -2 $REVERSE -U $UNPAIR1,$UNPAIR2 | samtools view -@ 6 -bh | samtools sort -@ 6 >$OUT
6- BRAKER2 (v2.1.6)
braker.pl --species $SPECIES --genome $GENOME.masked \
--bam $BAM1 $BAM2
--prot_seq=$SPECIES_training.pep.fasta --prg=gth --gth2traingenes --gff3 --cores 8 --softmasking --useexisting