Expanding support for euks in PhyloSift

PhyloSift genome data lives on Merlot in /share/eisen-z2/phylosift/

Talked with Guillaume about euk markers

  • maker updates – ONLY updating tree and its associated taxonomy (not the HMM, because we don’t want the HMM to shift over time). We throw away all the old reference sequences and start afresh by scanning the downloaded genomes
  • For euks – I’m worried that we’re getting rid of a lot of taxonomic diversity for this marker update method. Original Parfrey sequences definitely seem to be getting thrown away. Need to figure out if we’re representing all the deep protist lineages during euk maker updates (99% tree pruning won’t do much harm if we already have these taxa present).

To investigate:

  • How many euk genomes do we have in the Phylosift directories (draft, ebi, WGS, etc)? How many bacterial/archaeal in comparison?
  • Work with Guillaume to get full NCBI taxonomic hierarchies placed into the trees. This will help to evaluate what lineages present/abesent in our reference markers.

PhyloSift analysis of Deepsea OTUs

Prepping for lab meeting tomorrow, so looking at the results of the PhyloSift runs for the Deepsea OTU data.

Edge PCA (produces an .xml tree file) :

./guppy pca –out-dir ~/phylosift_v1.0.0_01/ ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_ShallowCalif.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_ShallowGulf.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Atlantic22.1.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Atlantic25.2.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Atlantic29.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Atlantic43.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Atlantic45.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Pacific128.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Pacific237.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Pacific321.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Pacific422.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Pacific528.fna/treeDir/18s_reps.1.jplace –prefix guppyDS

Squash clustering (

./guppy squash –out-dir ~/phylosift_v1.0.0_01/ ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_ShallowCalif.fna/treeDir/ShallowCalif_18Sreps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_ShallowGulf.fna/treeDir/ShallowGulf_18Sreps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Atlantic22.1.fna/treeDir/Atlantic22.1_18Sreps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Atlantic25.2.fna/treeDir/Atlantic25.2_18Sreps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Atlantic29.fna/treeDir/Atlantic29_18Sreps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Atlantic43.fna/treeDir/Atlantic43_18Sreps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Atlantic45.fna/treeDir/Atlantic45_18Sreps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Pacific128.fna/treeDir/Pacific128_18Sreps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Pacific237.fna/treeDir/Pacific237_18Sreps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Pacific321.fna/treeDir/Pacific321_18Sreps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Pacific422.fna/treeDir/Pacific422_18Sreps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Pacific528.fna/treeDir/Pacific528_18Sreps.1.jplace –prefix guppyDeepsea_squash

Kantorovich-Rubinstein Distance:

~/phylosift_v1.0.0_01/bin$ ./guppy kr –out-dir ~/phylosift_v1.0.0_01/ ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_ShallowCalif.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_ShallowGulf.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Atlantic22.1.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Atlantic25.2.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Atlantic29.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Atlantic43.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Atlantic45.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Pacific128.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Pacific237.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Pacific321.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Pacific422.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Pacific528.fna/treeDir/18s_reps.1.jplace -o guppy_Deepsea_KRdistance

More updates to PhyloSift website

Finished updating the outputs section of the PhyloSift website. Going to close the GitHub issue but here’s what I still need to update:

  • Column 2 of sequence_taxa files – what is it?
  • Confirm that marker_summary.txt in main output directory summarizes the alignDir marker info
  • More clear info on the .jnlp and .xml output files that Aaron is working on for the fat tree visualization.

Phylosift website – update info about output files

Went through the updated list of output files with Guillaume. Here are the deets for all the files now being created in the PS_temp directory for each run:


protein coding markers

  • *1.unmasked – aligned protein with no masking – not used in downstream analyses
  • *codon.updated.1.fasta – nucleotide, aligned and masked
  • *.newCandidate.aa.1 – same file (unaligned version of hits)
  • *.updated.1.fasta – protein, aligned and masked

.1 refers to chunk number – so if you have duplicate files with .2, etc


  • .1.unmasked – aligned nucleotide ith no masking
  • .short.1.fasta – alignment using cmalign, with masking
  • .long.1.fasta – will we have this too? and sep file for unmasked long sees?

Do we get two .unmasked files if we have a mix of short and long sequences?

no unaligned file in alignDir for 16S/18S data


  • marker_summary.txt – how many hits per marker for each gene

Search mode –keep_search – flag that retains all the search info in the BLAST directory; automatically retains the temp blast files

–keep_search – just undocumented, need to document this under output for all mode


  • enolase.codon.updated.sub1.1.jplace — nucleotide jplace
  • enolase.updated.1.jplace — aa jplace

How is the information from codon and aa trees used in phylosift summaries?

Main output directory – Krona reports

  • filename_allmarkers.html – all markers in treeDir with jplaces
  • filename.html – core markers DNGNGWU only
  • filename.jnlp – javascript of FAT tree visualization
  • filename.xml – fat tree viz itself?

Main output directory – summary files

  • marker_summary.txt – based off of the taxon summary files
  • run_info.txt – going to be updated in the next few days; lists commands and md5 sums and step completion status (start/end time and duration for each chunk at each step – search, align, place, summarize)
  • sequence_taxa_summary.1.txt – summary of chunk
  • sequnce_taxa_summary.txt – combined info from all chunks
  • sequence_taxa.1.txt – summary of chunk
  • sequence_taxa.txt – combined information from all the chunks
  • taxa_90pct_HPD.txt –
  • taxasummary.txt


PhyloSift paper and web updates, continued

More progress toward the PhyloSift paper and web updates. Here are the things I need to follow up on this weekend:

Intro edits – say something about:

  • If you want to test for an organism’s presence, people would do a BLAST search with a homology test. But doesn’t tell you if you have several hits, or about the evolutionary relationship of what’s in your sample.
  • People also to tree based analysis too right now (manual inspection) – need a better way to do this (a method for the HTP era), and something that is statistically robust way of doing this (not just exploratory methods).

***Put each sentence on a separate line – for intro for Latex purposes

***Go through Aaron’s Nov 1 email outline to make sure I mention all the necessary things in the intro


Test out Bayes factor test and start putting together the web tutorial 

run all mode with –bayes flag

and then run test_lineage mode with the relevant flags.

Aaron is going to prep an analysis for the paper, and then we can make a tutorial with this data

PhyloSift website updates

General questions:

  • What are our “levels” of PhyloSift markers – need to standardize terminology
    • “elite markers” – only DNGNGWU?
    • “core markers” – all devel markers?
    • “extended markers” – protein families; additional download
  • What is the ‘web’ folder that now appears in PS download folder?  – Eric’s scripts; Aaron will remove.
  • Still finding the –help dialogue flag structure really confusing. These do nothing: commands: list the application’s commands help: display a command’s help screen

Intro Tutorial – web updates needed

  • Update screenshot of output directory with newest collection of output files
  • Update the names of the krona files that get generated in the output directory
  • Add info about the automatic fat tree visualization that Aaron just added to outputs

Bayes factor tests – new page creation

  • Screenshot of equation used
  • Explanation about what Bayes factor tests do
  • Commands/workflow needed to run Bayes factor tests
  • How to interpret the outputs generated – biological context, uncertainty/detection thresholds

Output files – major page update

  • Go through outputs from HMP data and update and explain new file types

General Web Updates

  • Check all pages to make sure double dashes are inserted – Done already – Intro tutorial, phylosift RC file page, Monkey, Kangaroo, DBupdate
  • Update example command lines on Monkey, Kangaroo, DBupdate (e.g. with new flag structure). Done- 11/9/12