PhyloSift Outputs – Understanding Probability Distributions

Investigating the PhyloSift output files today. Wanted to figure out how many reads Phylosift was reporting as placing, and Aaron gave me this awesome Unix command to run:

cut -f 1 sequence_taxa.txt >sequence_taxa_cut

sort sequence_taxa_cut > sequence_taxa_sort

uniq sequence_taxa_sort > sequence_taxa_uniq

wc -l sequence_taxa_uniq

There is discordance between the number of placed reads that PhyloSift is reporting in the sequence_taxa.txt file and the taxasummary.txt files.

For example, in the output listed at /home/hollybik/phylosift_devel_20120511/PS_temp/pool5_AACACCT_test.fastq
Sequence_taxa is reporting 11310 unique reads placed via PhyloSift, but the taxasummary.txt file (summed value for column 4) is reporting 47414 reads placed.

I also asked for clarification regarding the different files. 

sequence_taxa.txt – for each read, this shows the different NODES (and associated probability distributions) at which each read is placed. /1 and /2 sequences listed separately for Illumina-PE data

sequence_taxa_summary.txt – expands the sequence_taxa.txt information by walking up the tree until the probability distribution reaches 1. But I noticed some discrepancies in this as it stands and have added an issue on GitHub

taxasummary.txt – this file directly summarizes sequence_taxa.txt, where it takes the probability distributions reported across each read and sums them up according to NCBI taxon ID

taxa_90pct_HPD.txt – gets rid of taxa in the bottom 10% of the probability distribution.