TANGO: Taxonomic Assignment in Metagenomics

Copyright © 2012-2014 Gabriel Valiente

Original TANGO code © 2010 José C. Clemente, Jesper Jansson, Gabriel Valiente
Revised TANGO code © 2012 Daniel Alonso-Alemany, Aurélien Barré, Stefano Beretta, Paola Bonizzoni, Macha Nikolski, Gabriel Valiente

Latest update:

NCBI release 2013/10/29 (832,581 reference sequences)
RDP release 11.1 (2,872,235 reference sequences)
Greengenes release 13.5 (1,262,986 reference sequences)

SUBMIT

NCBI RDP Greengenes

NCBI RDP Greengenes
NCBI RDP Greengenes
NCBI RDP Greengenes
NCBI RDP Greengenes
NCBI RDP Greengenes
NCBI RDP Greengenes
NCBI RDP Greengenes
NCBI RDP Greengenes

Balance between precision (0) and recall (1)

Level of detail for the taxonomy table (1-7)

"; for ($i = 1; $i <= $max_files; $i++) { $fieldname = "file$i"; if ($_FILES[$fieldname]['size'] > 0) { $size = $_FILES[$fieldname]['size']; $name = $_FILES[$fieldname]['name']; $type = $_FILES[$fieldname]['type']; $tmp_name = $_FILES[$fieldname]['tmp_name']; $save_name = date("Ymd")."_".$session_id."_".$name; // echo "moving $tmp_name to $folder/$save_name
"; if (move_uploaded_file($tmp_name,$folder."/".$save_name)) { // echo "File $i saved as $save_name
"; $_FILES[$fieldname]['save_name'] = $save_name; switch ($i) { case 1: $_FILES[$fieldname]['sample_name'] = $_POST['name1']; $_FILES[$fieldname]['ref_name'] = $_POST['ref1']; break; case 2: $_FILES[$fieldname]['sample_name'] = $_POST['name2']; $_FILES[$fieldname]['ref_name'] = $_POST['ref2']; break; case 3: $_FILES[$fieldname]['sample_name'] = $_POST['name3']; $_FILES[$fieldname]['ref_name'] = $_POST['ref3']; break; case 4: $_FILES[$fieldname]['sample_name'] = $_POST['name4']; $_FILES[$fieldname]['ref_name'] = $_POST['ref4']; break; case 5: $_FILES[$fieldname]['sample_name'] = $_POST['name5']; $_FILES[$fieldname]['ref_name'] = $_POST['ref5']; break; case 6: $_FILES[$fieldname]['sample_name'] = $_POST['name6']; $_FILES[$fieldname]['ref_name'] = $_POST['ref6']; break; case 7: $_FILES[$fieldname]['sample_name'] = $_POST['name7']; $_FILES[$fieldname]['ref_name'] = $_POST['ref7']; break; case 8: $_FILES[$fieldname]['sample_name'] = $_POST['name8']; $_FILES[$fieldname]['ref_name'] = $_POST['ref8']; break; } $existing_files++; } else echo "Error saving file $i
"; } } $real_param=$_POST['real_param']; echo "The balance between precision and recall is $real_param
"; $int_param=$_POST['int_param']; echo "The level of detail for the taxonomy table is $int_param
"; } if ($existing_files > 0) { $names = ""; $titles = ""; $table_name = date("Ymd")."_".$session_id.".html"; $table_unifrac_name = date("Ymd")."_".$session_id."_"."unifrac".".html"; for ($i = 1; $i <= $max_files; $i++) { $fieldname = "file$i"; if ($_FILES[$fieldname]['size'] > 0) { $save_name = $_FILES[$fieldname]['save_name']; $sample_name = $_FILES[$fieldname]['sample_name']; $ref_name = $_FILES[$fieldname]['ref_name']; putenv ("PERL5LIB=/home/valiente/www/code"); // echo getenv("PERL5LIB"); switch ($ref_taxonomy) { case "NCBI": $prep_name = "NCBI.ser"; switch ($ref_name) { case "NCBI": exec("perl -I code code/tango.pl --taxonomy data/$prep_name --matches $folder/$save_name --output $folder/$save_name.out --q-value $real_param --print-id"); break; case "RDP": exec("perl -I code code/relabel.pl --matches $folder/$save_name --mapping data/RDP2NCBI.ser --output $folder/$save_name.rel"); exec("perl -I code code/tango.pl --taxonomy data/$prep_name --matches $folder/$save_name.rel --output $folder/$save_name.out --q-value $real_param --print-id"); break; case "Greengenes": exec("perl -I code code/relabel.pl --matches $folder/$save_name --mapping data/GREEN2NCBI.ser --output $folder/$save_name.rel"); exec("perl -I code code/tango.pl --taxonomy data/$prep_name --matches $folder/$save_name.rel --output $folder/$save_name.out --q-value $real_param --print-id"); break; } break; case "RDP": $prep_name = "RDP.ser"; switch ($ref_name) { case "NCBI": exec("perl -I code code/relabel.pl --matches $folder/$save_name --mapping data/NCBI2RDP.ser --output $folder/$save_name.rel"); exec("perl -I code code/tango.pl --taxonomy data/$prep_name --matches $folder/$save_name.rel --output $folder/$save_name.out --q-value $real_param --print-id"); break; case "RDP": exec("perl -I code code/tango.pl --taxonomy data/$prep_name --matches $folder/$save_name --output $folder/$save_name.out --q-value $real_param --print-id"); break; case "Greengenes": echo "Relabeling RDP to Greengenes not implemented yet.
"; break; } break; case "Greengenes": $prep_name = "GREEN.ser"; switch ($ref_name) { case "NCBI": exec("perl -I code code/relabel.pl --matches $folder/$save_name --mapping data/NCBI2GREEN.ser --output $folder/$save_name.rel"); exec("perl -I code code/tango.pl --taxonomy data/$prep_name --matches $folder/$save_name.rel --output $folder/$save_name.out --q-value $real_param --print-id"); break; case "RDP": echo "Relabeling Greengenes to RDP not implemented yet.
"; break; case "Greengenes": exec("perl -I code code/tango.pl --taxonomy data/$prep_name --matches $folder/$save_name --output $folder/$save_name.out --q-value $real_param --print-id"); break; } break; } echo "Taxonomic assignment for $sample_name
"; $names = $names." $folder/$save_name.out"; $titles = $titles." '$sample_name'"; } } exec("perl -I code code/profile.pl $int_param $names $titles > $folder/$table_name"); echo "Comparative taxonomic assignment table
"; exec("perl -I code code/unifrac2.pl data/$prep_name $names $titles > $folder/$table_unifrac_name"); echo "Comparative UniFrac table
"; } ?>

INPUT

The input is

  1. A choice of reference taxonomy (NCBI, RDP, Greengenes) for the taxonomic assignment
  2. One or more text files containing the parsed output of a mapping program, in the format: “read_id” “species_id_1” ... “species_id_n”, where the “species_id” are all NCBI, RDP or Greengenes identifiers, together with a sample name and a choice of reference database (NCBI, RDP, Greengenes) which the reads were mapped to
  3. A value for the “q” parameter that allows balancing the taxonomic assignment between precision (q = 0) and recall (q = 1), with q = 0.5 corresponding to the F-measure (harmonic mean of precision and recall)
  4. The desired level of detail for the taxonomy table (1 = Kingdom, 2 = Phylum, 3 = Class, 4 = Order, 5 = Family, 6 = Genus, 7 = Species)

Example:

Global Parameters

Reference Taxonomy: NCBI
q: 0.5 (for the F-measure)
Detail Level: 3 (for a Class taxonomy table)

Sample 1

Name: Sample 1
Reference database: RDP
Mapping File 1:
EKQJ6TS02JZIAQ S000381989 S000381993 S000413971 S000381991 S000014058 S000127286 S000381980 S000381981
               S000381988 S000381990 S000413959 S000414717
EKQJ6TS02FMP8N S000544250 S000381286 S000128654 S000544216 S000544225 S000439343 S000544219 S000439345
               S000002270 S000005595 S000439340 S000544257 S000570539 S000544224 S000544227 S000001861
               S000544239 S000367091 S000016032 S000351476 S000351464 S000462003 S000021808 S000544764
               S000627041 S000544255 S000084736 S000129856 S000003179 S000544258 S000544259 S000426477
               S000438872 S000544223 S000544231 S000003569 S000005908 S000614212 S000544215 S000544214
               S000544237 S000544254 S000428473 S000428468
EKQJ6TS02JKNY4 S000000438
...

Sample 2

Name: Sample 2
Reference database: RDP
Mapping File 2:
EZ0R7OU01EGCC0 S000414599
EZ0R7OU02HXMV6 S000383173 S000539267
EZ0R7OU01EWYUU S000530395
...

Sample 3

Name: Sample 3
Reference database: RDP
Mapping File 3:
D4WT9DQ09FQQZP S000504763 S000381395 S000001859 S000015791 S000015790 S000006319 S000381397 S000381398
               S000015432 S000389072 S000381394 S000004412 S000010893 S000381399 S000322431 S000503245
               S000503244 S000494000 S000002814 S000000841 S000381396 S000022475 S000008975 S000013156
D4WT9DQ09FL3D4 S000352704 S000436053 S000420325
D4WT9DQ09FP8T0 S000541016
...

OUTPUT

The output is a text file for each input file containing, for each line in the input file with at least one match, the read_id, the penalty score of the taxonomic assignment, and the lineage in the chosen reference taxonomy to which the read was assigned with optimal precision and recall, followed by a taxonomy table showing, for each input file, the number of reads assigned to each taxonomic rank, and a UniFrac table showing the UniFrac distance between each pair of samples.

Example:

Sample 1

EKQJ6TS02JZIAQ k_Bacteria;p_Firmicutes;c_Bacilli;o_Bacillales;f_Staphylococcaceae;g_Staphylococcus;S000414717
EKQJ6TS02FMP8N k_Bacteria;p_Actinobacteria;c_Actinobacteria;o_Actinomycetales;f_Streptomycetaceae;g_Streptomyces;S000544227
EKQJ6TS02JKNY4 k_Bacteria;p_Actinobacteria;c_Actinobacteria;o_Actinomycetales;f_Micrococcaceae;g_Kocuria;S000000438
...

Sample 2

EZ0R7OU01EGCC0 k_Bacteria;p_Firmicutes;c_Clostridia;o_Clostridiales;f_Lachnospiraceae;g_Lachnospiracea_incertae_sedis;S000414599
EZ0R7OU02HXMV6 k_Bacteria;p_Firmicutes;c_Clostridia;o_Clostridiales;f_Lachnospiraceae;g_Blautia;S000539267
EZ0R7OU01EWYUU k_Bacteria;p_Bacteroidetes;c_Bacteroidia;o_Bacteroidales;f_Bacteroidaceae;g_Bacteroides;S000530395
...

Sample 3

D4WT9DQ09FQQZP k_Bacteria;p_Proteobacteria;c_Gammaproteobacteria;o_Xanthomonadales;f_Xanthomonadaceae;g_Pseudoxanthomonas;S000494000
D4WT9DQ09FL3D4 k_Bacteria;p_Proteobacteria;c_Gammaproteobacteria;o_Alteromonadales;f_Alteromonadaceae;g_Marinobacter;S000420325
D4WT9DQ09FP8T0 k_Bacteria;p_Proteobacteria;c_Alphaproteobacteria;o_Kordiimonadales;f_Kordiimonadaceae;g_Kordiimonas;S000541016
...

Taxonomy Table

Taxonomic RankSample 1Sample 2Sample 3
k_Bacteria100010001000
  p_Actinobacteria529
    c_Actinobacteria529
  p_Bacteroidetes63646
    c_Bacteroidia636
    c_Flavobacteria46
  p_Firmicutes9953612
    c_Bacilli992
    c_Clostridia940174
    c_Erysipelotrichia175
    c_Negativicutes463
  p_Lentisphaerae1
    c_Lentisphaeria1
  p_Proteobacteria1942
    c_Alphaproteobacteria124
    c_Betaproteobacteria160
    c_Deltaproteobacteria2
    c_Gammaproteobacteria756

UniFrac Table

Sample 2Sample 3
Sample 10.8956521739130440.980487804878049
Sample 20.98