Threshold parameters for PhyloSift search steps

After discussions with Aaron, I’ve removed references to the probability thresholds and e-values that are currently hardcoded into the PhyloSift code. We’re now talking about moving all these hardcoded values into a global parameter file, which will tell users exactly what values are used in the analysis.

Candidate marker sequences identified in LAST searches are next screened against profile alignments that have been pre-computed for reference marker genes (housed in the local directory: /share/phylosift/markers/ ). In order to take a stringent search approach towards short read data, PhyloSift relies on threshold e-values to accept or reject candidate sequences after initial LAST searches. For rRNA sequences, screening and alignment relies on Covariance Model profiles (CMs; a class of Stochastic Context Free Grammar Models that utilize stem/loop information in rRNA secondary structure) and is carried out via the cmalign algorithm in the SSU-align software, using probability thresholds of 1×10-6 for sequences >1000 bp and 1×10-20 for sequences <1000 bp. Protein coding genes rely on profile Hidden Markov Models (computed via the HMMer software suite; Eddy 2010), with a threshold e-value set at 10. These profile alignments can be found in the /share/phylosift/markers/ directory as *.cm (rRNA) and *.hmm (protein coding genes) files.


