B.anthracis metagenome work

Transferring notes over from GoogleDocs. Here is the recent work I’ve been doing:


Downloaded Holly Gantz data to iMac and converted fastq to FASTA.

cat Holly_G_3C2.r1.fastq | perl -e ‘$i=0;while(<>){if(/^\@/&&$i==0){s/^\@/\>/;print;}elsif($i==1){print;$i=-3}$i++;}’ > Holly_G_3C2.r1.fasta

cat Holly_G_3C2.r2.fastq | perl -e ‘$i=0;while(<>){if(/^\@/&&$i==0){s/^\@/\>/;print;}elsif($i==1){print;$i=-3}$i++;}’ > Holly_G_3C2.r2.fasta

Removed the /n characters that were interrupting the DNA sequences in the grinder sim read outputs. Now running through PhyloSift to see if this will give us hits.

perl /Users/hollybik/Dropbox/perl/scripts/Grinder_remove_newlines_fasta.pl grinder-reads_Banthracis_1000000.fa

  • Still not getting any results after removing /n characters. Now going to use other tools to see if it is a problem with the reads themselves.


Running Holly Gantz’s data through QIIME to see if we have 16S hits on any Bacillus taxa. Doing this on qiime@localhost on Edhar.

parallel_pick_otus_uclust_ref.py -i /home/qiime/data/hbik/B_anthracis/Holly_G_3C2.r1.fasta -o HollyG_3C2_r1_qiime16S -r /home/qiime/data/hbik/gg_otus_4feb2011/rep_set/gg_99_otus_4feb2011.fasta -s 0.99 –enable_rev_strand_match

pick_rep_set.py -i HollyG_3C2_r1_qiime16S/Holly_G_3C2.r1_otus.txt -f Holly_G_3C2.r1.fasta -o HollyG_3C2_r1_qiime16S/Holly_G_3C2.r1_rep_set.fna

assign_taxonomy.py -i HollyG_3C2_r1_qiime16S/Holly_G_3C2.r1_rep_set.fna -m rdp

  • RDP taxonomy doesn’t indicate that we have any Bacillus anthracis, although we do have a number of Bacilli in the data –the main OTU is Stapylococcus, which I think correlates with the PhyloSift results? We got a total of 317 rRNA OTUs from this QIIME scan.

Set up a BLAST DB on Edhar to compare marker genes (cat file of marker gene hits pulled out by PhyloSift run) with the simulated grinder data.

megablast -d ~/BLAST_db/B_anthracis_PMPROK_markers/Banthracis_lastalcandidate_master_20120717_nts.txt -D 2 -p 95 -a 2 -b 1 -v 1 -i ~/grinder_sims/grinder-reads_Banthracis_100000_NLrem.fasta -F F > ~/grinder_sims/grinder-reads_Banthracis_100000_NLrem_BLASTout.txt

perl /home/hollybik/OCTUPUS_scripts/blastFilter.pl /home/hollybik/grinder_sims/grinder-reads_Banthracis_100000_NLrem_BLASTout.txt /home/hollybik/grinder_sims/grinder-reads_Banthracis_100000_NLrem_BLASTfilter.txt

  • No hits at all using 95% sequence identity…tried again with only 90% sequence identity:

megablast -d ~/BLAST_db/B_anthracis_PMPROK_markers/Banthracis_lastalcandidate_master_20120717_nts.txt -D 2 -p 90 -a 2 -b 1 -v 1 -i ~/grinder_sims/grinder-reads_Banthracis_100000_NLrem.fasta -F F > ~/grinder_sims/grinder-reads_Banthracis_100000_NLrem_BLASTout2.txt

perl /home/hollybik/OCTUPUS_scripts/blastFilter.pl /home/hollybik/grinder_sims/grinder-reads_Banthracis_100000_NLrem_BLASTout2.txt /home/hollybik/grinder_sims/grinder-reads_Banthracis_100000_NLrem_BLASTfilter2.txt

  • No hits at all even using 90% similarity, so I went down to 60% sequence identity:

megablast -d ~/BLAST_db/B_anthracis_PMPROK_markers/Banthracis_lastalcandidate_master_20120717_nts.txt -D 2 -p 60 -a 2 -b 1 -v 1 -i ~/grinder_sims/grinder-reads_Banthracis_100000_NLrem.fasta -F F > ~/grinder_sims/grinder-reads_Banthracis_100000_NLrem_BLASTout3.txt

perl /home/hollybik/OCTUPUS_scripts/blastFilter.pl /home/hollybik/grinder_sims/grinder-reads_Banthracis_100000_NLrem_BLASTout3.txt /home/hollybik/grinder_sims/grinder-reads_Banthracis_100000_NLrem_BLASTfilter3.txt

  • Still no hits. So it might be related to Illumina error

Re-simulated Grinder reads, but removed the Illuimina-specific error simulation and instead opted for a general 0.1 uniform error rate.

grinder -read_dist 105 -insert_dist 400 normal 50 uniform 0.1 -mr 95 5 -genome_file B_anthracis_genome.fasta -total_reads 100000

And then ran another megablast, this time at 80% to see the difference

megablast -d ~/BLAST_db/B_anthracis_PMPROK_markers/Banthracis_lastalcandidate_master_20120717_nts.txt -D 2 -p 80 -a 2 -b 1 -v 1 -i ~/grinder_sims/grinder-reads_Banthracis_100000_Error1pct.fa -F F > ~/grinder_sims/grinder-reads_Banthracis_100000_NLrem_BLASTout4.txt

perl /home/hollybik/OCTUPUS_scripts/blastFilter.pl /home/hollybik/grinder_sims/grinder-reads_Banthracis_100000_NLrem_BLASTout4.txt /home/hollybik/grinder_sims/grinder-reads_Banthracis_100000_NLrem_BLASTfilter4.txt

  • Still nope. Maybe I don’t have enough sequences in the dataset to have any marker gene hits show up?
  • So now I’m going to try with 1000000 PE reads. Perhaps the increased read count will give us hits:

megablast -d ~/BLAST_db/B_anthracis_PMPROK_markers/Banthracis_lastalcandidate_master_20120717_nts.txt -D 2 -p 80 -a 2 -b 1 -v 1 -i ~/grinder_sims/grinder-reads_Banthracis_1000000_NLrem.fasta -F F > ~/grinder_sims/grinder-reads_Banthracis_1000000_NLrem_BLASTout.txt

perl /home/hollybik/OCTUPUS_scripts/blastFilter.pl /home/hollybik/grinder_sims/grinder-reads_Banthracis_1000000_NLrem_BLASTout.txt /home/hollybik/grinder_sims/grinder-reads_Banthracis_1000000_NLrem_BLASTfilter.txt

Aaron looked into the issues and said we needed to change the LAST search parameter to -e40 in the FastSeach.pm file. After running a couple datasets (phylosift_devel_20120801 folder), I am now getting hits in the blastDir, but nothing transferring over to the alignDir


After scrumming this morning it sounds like the HMMer search parameters are set as sensitive as they can be – Aaron suggested that perhaps we aren’t pulling out the relevant marker genes from genome contigs during the database update (DBupdate script), so when we go to search against the reference marker genes, we’re not getting any hits because we don’t actually have the B.anthracis markers represented in the PhyloSift core markers. Aaron is looking into this and going to see if we can clarify further.


So all the simulation problems turned out to be a bug in Grinder. Hopefully we can proceed forward more easily with the new simulator problem.

We also seemed to have fixed the problem with the genome contigs — the lack of summaries/kronas was because all the input FASTA headers had the same name. Fixed once you change to unique headers (did this with perl script and then ran AA and nt sequences on master branch w/master markers).


Switched to artsim after finding a bug in grinder. We’ve gone from # of reads to fold coverage.

./art_illumina -i ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/B_anthracis_genome.fasta -p -l 100 -f 0.01 -m 200 -s 10 -o ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/artsim_sim_reads/artsim_0.01x_

./art_illumina -i ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/B_anthracis_genome.fasta -p -l 100 -f 0.25 -m 200 -s 10 -o ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/artsim_sim_reads/artsim_0.25x_

./art_illumina -i ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/B_anthracis_genome.fasta -p -l 100 -f 0.5 -m 200 -s 10 -o ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/artsim_sim_reads/artsim_0.5x_

./art_illumina -i ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/B_anthracis_genome.fasta -p -l 100 -f 1 -m 200 -s 10 -o ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/artsim_sim_reads/artsim_1x_

./art_illumina -i ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/B_anthracis_genome.fasta -p -l 100 -f 2 -m 200 -s 10 -o ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/artsim_sim_reads/artsim_2x_

./art_illumina -i ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/B_anthracis_genome.fasta -p -l 100 -f 0.1 -m 200 -s 10 -o ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/artsim_sim_reads/artsim_0.1x_

./art_illumina -i ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/B_anthracis_genome.fasta -p -l 100 -f 5 -m 200 -s 10 -o ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/artsim_sim_reads/artsim_5x_

./art_illumina -i ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/B_anthracis_genome.fasta -p -l 100 -f 10 -m 200 -s 10 -o ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/artsim_sim_reads/artsim_10_

./art_illumina -i ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/B_anthracis_genome.fasta -p -l 100 -f 25 -m 200 -s 10 -o ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/artsim_sim_reads/artsim_25x_

./art_illumina -i ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/B_anthracis_genome.fasta -p -l 100 -f 50 -m 200 -s 10 -o ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/artsim_sim_reads/artsim_50x_

./art_illumina -i ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/B_anthracis_genome.fasta -p -l 100 -f 100 -m 200 -s 10 -o ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/artsim_sim_reads/artsim_100x_


Running all the artsim B.anthracis data through PhyloSift

./phylosift.pl all –debug –paired ~/grinder_sims/artsim_sim_reads/artsim_0.1x_1.fq ~/grinder_sims/artsim_sim_reads/artsim_0.1x_2.fq

./phylosift.pl all –debug –paired ~/grinder_sims/artsim_sim_reads/artsim_0.5x_1.fq ~/grinder_sims/artsim_sim_reads/artsim_0.5x_2.fq

./phylosift.pl all –debug –paired ~/grinder_sims/artsim_sim_reads/artsim_0.01x_1.fq ~/grinder_sims/artsim_sim_reads/artsim_0.01x_2.fq

./phylosift.pl all –debug –paired ~/grinder_sims/artsim_sim_reads/artsim_0.25x_1.fq ~/grinder_sims/artsim_sim_reads/artsim_0.25x_2.fq

./phylosift.pl all –debug –paired ~/grinder_sims/artsim_sim_reads/artsim_1x_1.fq ~/grinder_sims/artsim_sim_reads/artsim_1x_2.fq

./phylosift.pl all –debug –paired ~/grinder_sims/artsim_sim_reads/artsim_2x_1.fq ~/grinder_sims/artsim_sim_reads/artsim_2x_2.fq

./phylosift.pl all –debug –paired ~/grinder_sims/artsim_sim_reads/artsim_5x_1.fq ~/grinder_sims/artsim_sim_reads/artsim_5x_2.fq

./phylosift.pl all –debug –paired ~/grinder_sims/artsim_sim_reads/artsim_10x_1.fq ~/grinder_sims/artsim_sim_reads/artsim_10x_2.fq

./phylosift.pl all –debug –paired ~/grinder_sims/artsim_sim_reads/artsim_25x_1.fq ~/grinder_sims/artsim_sim_reads/artsim_25x_2.fq

./phylosift.pl all –debug –paired ~/grinder_sims/artsim_sim_reads/artsim_50x_1.fq ~/grinder_sims/artsim_sim_reads/artsim_50x_2.fq

./phylosift.pl all –debug –paired ~/grinder_sims/artsim_sim_reads/artsim_100x_1.fq ~/grinder_sims/artsim_sim_reads/artsim_100x_2.fq


Holly G is going to give us another soil sample on Tues/Wed – this time she’s going to PCR-confirm that there is B.anthracis in the sample.

Also lowering the coverage more for the artsim data

./art_illumina -i ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/B_anthracis_genome.fasta -p -l 100 -f 0.005 -m 200 -s 10 -o ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/artsim_sim_reads/artsim_0.005x_

./art_illumina -i ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/B_anthracis_genome.fasta -p -l 100 -f 0.0025 -m 200 -s 10 -o ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/artsim_sim_reads/artsim_0.0025x_

./art_illumina -i ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/B_anthracis_genome.fasta -p -l 100 -f 0.001 -m 200 -s 10 -o ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/artsim_sim_reads/artsim_0.001x_

…and then running through PhyloSift:

./phylosift.pl all –debug –paired ~/grinder_sims/artsim_sim_reads/artsim_0.005x_1.fq ~/grinder_sims/artsim_sim_reads/artsim_0.005x_2.fq –disable_updates

./phylosift.pl all –debug –paired ~/grinder_sims/artsim_sim_reads/artsim_0.0025x_1.fq ~/grinder_sims/artsim_sim_reads/artsim_0.0025x_2.fq –disable_updates

./phylosift.pl all –debug –paired ~/grinder_sims/artsim_sim_reads/artsim_0.001x_1.fq ~/grinder_sims/artsim_sim_reads/artsim_0.001x_2.fq –disable_updates

Yatsunenko and GOM – more QIIMEing

Continuing where I left off with the QIIME analyses. For the Yatsunenko data, trying to finish up the 16S amplicon analyses before the sprint ends on Friday (have to do this manually because the core analyses workflow just wasn’t working for the huge dataset):

pick_rep_set.py -i /home/ubuntu/ps_16Samplicons_parallel_picktotus_17sept/16S_amplicon_yatsunenko_combo_otus.txt -m first -o 16S_amplicon_yatsunenko_combo_otus_rep_set.txt -s otu -r /home/ubuntu/gg_otus_4feb2011/rep_set/gg_97_otus_4feb2011.fasta

align_seqs.py -i /home/ubuntu/ps_16Samplicons_parallel_picktotus_17sept/ 16S_amplicon_yatsunenko_combo_otus_rep_set.txt -t /home/ubuntu/gg_otus_4feb2011/rep_set/gg_97_otus_4feb2011_aligned.fasta -m pynast -a uclust -o /home/ubuntu/ps_16Samplicons_parallel_picktotus_17sept/aligned_seqs/ -e 150 -p 0.75

filter_alignment.py -i /home/ubuntu/ps_16Samplicons_parallel_picktotus_17sept/aligned_seqs/ 16S_amplicon_yatsunenko_combo_otus_rep_set_aligned.fasta -o /home/ubuntu/ps_16Samplicons_parallel_picktotus_17sept/aligned_seqs/ -g 0.999999 -t 3.0 –suppress_lane_mask_filter

assign_taxonomy.py -i /home/ubun
tu/ps_16Samplicons_parallel_picktotus_17sept/16S_amplicon_yatsunenko_combo_otus_rep_set.txt -t /home/
ubuntu/gg_otus_4feb2011/taxonomies/greengenes_tax_rdp_train.txt -r /home/ubuntu/gg_otus_4feb2011/rep_
set/gg_97_otus_4feb2011.fasta -m rdp -c 0.8 -o /home/ubuntu/ps_16Samplicons_parallel_picktotus_17sept

make_otu_table.py -i /home/ubuntu/ps_16Samplicons_parallel_picktotus_17sept/16S_amplicon_yatsunenko_combo_otus.txt -o /home/ubuntu/ps_16Samplicons_parallel_picktotus_17sept/ 16S_amplicon_yatsunenko_combo_otu_table.biom -t /home/ubuntu/ps_16Samplicons_parallel_picktotus_17sept/assign_taxonomy/ 16S_amplicon_yatsunenko_combo_otus_rep_set_tax_assignments.txt

beta_diversity_through_plots.py -i 16S_amplicon_yatsunenko_combo_otu_table.biom -m /home/ubuntu/qiime_16S_amplicon_mappingfile.txt -o /home/ubuntu/ps_16Samplicons_parallel_picktotus_17sept/beta_diversity/ -t /home/ubuntu/ps_16Samplicons_parallel_picktotus_17sept/aligned_seqs/ 16S_amplicon_yatsunenko_combo_otus_rep_set_aligned_pfiltered.tre -p /home/ubuntu/qiime_parameters_ps_amplicon.txt

And for the GOM Data, proceeding manually (core analyses ran into an error at the RDP step). Commands as follows:

align_seqs.py -i /home/ubuntu/uclust_99_GOMillumina_fwd_1to12_1/otus/rep_set/ GOM_illumina_demultiplexed_fwd_concat_1to12_1_rep_set.fasta -t /home/ubuntu/Silva_108/core_aligned/Silva_108_core_aligned_seqs.fasta -m pynast -a uclust -o /home/ubuntu/uclust_99_GOMillumina_fwd_1to12_1/otus/aligned_seqs/ -e 150 -p 0.75

GOM Illumina and Yatsunenko rRNA analyses

Kicked off some more analyses this morning on the cloud. First off is the GOM Illumina data – forward reads give me about a 6GB file, so running the core analyses on a cloud server with 32GB memory:

core_qiime_analyses.py -i /home/ubuntu/data_GOM/GOM_illumina_demultiplex
ed_fwd_concat_1to12_1 -o /home/ubuntu/uclust_99_GOMillumina_fwd_1to12_1 -m /home/ubuntu/QIIMEmappin
gfile_GOM_Illumina_fakebarcodes.txt –suppress_split_libraries -p /home/ubuntu/qiime_parameters_GOM

Transferred the Yatsunenko 16S amplicon data over to a High-Memory Instance (transfer speeds are FAST! <20 mins for a 40GB file) – 68GB memory on this one. Using the same core_qiime_analyses command as yesterday (w/greengenes tree).

Error on the Cloud

When running Yatsunenko 16S amplicon data (40 gig combo file). Was running on Instance with ~32GB memory, so am now going to up that to the largest possible instance at 68GB:

ubuntu@ip-10-218-81-140:~$ core_qiime_analyses.py -i /home/ubuntu/data_yatsunenko/ps_metagenome_extract/ps_metagenome_combined.fna -o /home/ubuntu/ps_metagenome_coreanalyses_14sept -m qiime_ps_metagenomes_mappingfile.txt -f –suppress_split_libraries -t /home/ubuntu/gg_otus_4feb2011/trees/gg_97_otus_4feb2011.tre -p /home/ubuntu/qiime_parameters_ps_amplicon.txt
ubuntu@ip-10-218-81-140:~$ core_qiime_analyses.py -i /home/ubuntu/data_yatsunenko/yatsunenko/raw_data/16S_amplicon_yatsunenko_combo.fna -o /home/ubuntu/ps_16S_amplicon_coreanalyses_14sept -m /home/ubuntu/qiime_16S_amplicon_mappingfile.txt -f –suppress_split_libraries -t /home/ubuntu/gg_otus_4feb2011/trees/gg_97_otus_4feb2011.tre -p /home/ubuntu/qiime_parameters_ps_amplicon.txt
Traceback (most recent call last):
File “/home/ubuntu/qiime_software/qiime-1.5.0-release/bin/core_qiime_analyses.py”, line 191, in <module>
File “/home/ubuntu/qiime_software/qiime-1.5.0-release/bin/core_qiime_analyses.py”, line 188, in main
File “/home/ubuntu/qiime_software/qiime-1.5.0-release/lib/qiime/workflow.py”, line 1411, in run_core_qiime_analyses
File “/home/ubuntu/qiime_software/qiime-1.5.0-release/lib/qiime/workflow.py”, line 471, in run_pick_otus_through_otu_table
File “/home/ubuntu/qiime_software/qiime-1.5.0-release/lib/qiime/workflow.py”, line 135, in call_commands_serially
raise WorkflowError, msg

Command run was:
/home/ubuntu/qiime_software/python-2.7.1-release/bin/python /home/ubuntu/qiime_software/qiime-1.5.0-release/bin/pick_otus.py -i /home/ubuntu/data_yatsunenko/yatsunenko/raw_data/16S_amplicon_yatsunenko_combo.fna -o /home/ubuntu/ps_16S_amplicon_coreanalyses_14sept/otus//uclust_ref_picked_otus –max_rejects 8 –otu_picking_method uclust_ref –similarity 0.97 –max_cdhit_memory 400 –refseqs_fp /home/ubuntu/gg_otus_4feb2011/rep_set/gg_97_otus_4feb2011.fasta –clustering_algorithm furthest –word_length 8 –max_accepts 1 –stepwords 8 –max_e_value 1e-10
Command returned exit status: 137


Looking for Xeno DNA in the metagenome

Trying to figure out if we actually have any Xeno DNA in the sequencing run. First thing I’m doing is looking through the .jplace files for the Parfrey marker genes. The closest relatives for the Xenophyophore are Rhizaria, and the species included in the Parfrey study are listed below:

PhyloSift Marker

Rhizaria Species

  • Reticulomyxa filosa
40S (none)
  • Ammonia sp. T7
  • Ovammina opaca
  • Reticulomyxa filosa
  • Corallomyxa tenera
  • Massisteria marina
  • Gymnophrys sp
  • Bodomorpha minima
  • Dimorpha sp.
  • Capsellina sp.
  • Plasmodiophora brassicae
  • Proleptomonas faecicola
  • Thaumatomonas seravini
  • Gymnochlora stellata
  • Allogromia sp.
  • Gromia oviformis
  • Spongomonas
  • Lotharella globosa
  • Clathrulina elegans
  • Hedriocystis reticulata
  • Euglypha rotunda
  • Amphisorus hemprichii
  • Globobulimina turgida
  • Bonamia ostreae
  • Haplosporidium costale, H. louisiana, H. nelsoni
  • Minchinia chitonis
  • Urosporidium crescens
  • Lecythium sp.
  • Sorosphaera veronicae
  • Spongospora subterranea
  • Collozoum inerme
  • Thalassicolla pellucida
  • Ovammina opaca
  • Reticulomyxa filosa
  • Corallomyxa tenera
  • Massisteria marina
  • Gymnophrys sp
  • Bodomorpha minima
  • Dimorpha sp.
  • Capsellina sp.
  • Proleptomonas faecicola
  • Thaumatomonas seravini
  • Rhabdammina cornuta
  • Ovammina opaca
  • Reticulomyxa filosa
  • Corallomyxa tenera
  • Massisteria marina
  • Gymnophrys sp
  • Bodomorpha minima
  • Dimorpha sp.
  • Capsellina sp.
  • Plasmodiophora brassicae
  • Proleptomonas faecicola
  • Thaumatomonas seravini
  • Allogromia sp.
  • Rhabdammina cornuta
  • Gromia oviformis
  • Spongomonas
  • Astrammina rara
  • Corallomyxa tenera
  • Corallomyxa tenera
  • Reticulomyxa filosa
  • Corallomyxa tenera
  • Reticulomyxa filosa
  • Reticulomyxa filosa
  • Reticulomyxa filosa
  • Massisteria marina
  • Gymnophrys sp
  • Reticulomyxa filosa
Rps22a (none)
  • Gymnochlora stellata
  • Plasmodiophora brassicae

New Core Marker Names for PhyloSift

Updated the PhyloSift Website to reflect the update to the Core markers (we’re now using Dongying’s newest set of 40 markers and have changed the reference marker names to reflect this.) The old table that was deleted is listed here – in case I need this HTML code in the future.

PhyloSift Marker

Gene Name

PMPROK00003 50S Ribosomal Protein L14P (rpl14P)
PMPROK00014 O-sialoglycoprotein endopeptidase (gcp)
PMPROK00015 30S Ribosomal Protein S3P (rps3P)
PMPROK00019 50S Ribosomal Protein S5 (rps5P)
PMPROK00020 50S Ribosomal Protein L6P (rpl6P)
PMPROK00022 50S Ribosomal Protein L1P (rpl1P)
PMPROK00024 50S Ribosomal Protein L22P (rpl22P)
PMPROK00025 30S Ribosomal Protein S19P (rps19P)
PMPROK00028 30S Ribosmal Protein S2P (rps2P)
PMPROK00029 50S Ribosomal Protein L11P (rpl11P)
PMPROK00031 GTP-binding signal recognition particle G-domain protein (srp54)
PMPROK00034 50S Ribosomal Protein L4P (rpl4P)
PMPROK00041 30S Ribsomal Protein S17P (rps17P)
PMPROK00047 CTP synthase, UTP-ammonia lyase (pyrG)
PMPROK00048 50S Ribosomal Protein L2P (rpl2P)
PMPROK00050 30S Ribosomal Protein S15P/S13e (rps15P)
PMPROK00051 30S Ribosomal Protein S9P/S13 (rps9P)
PMPROK00052 50S Ribsomal Protein L18P (rpl18P)
PMPROK00053 50S Ribosomal Protein L5P (rpl5P)
PMPROK00054 30S Ribosomal Protein S7P (rps7P)
PMPROK00060 Translation elongation factor aEF-2
PMPROK00064 Translation initiation factor aIF-2
PMPROK00067 50S Ribosomal Protein L24P (rpl24P)
PMPROK00068 30S Ribosomal Protein S11P (rps11P)
PMPROK00069 50S Ribosomal Protein L10E (rp10E)
PMPROK00071 50S Ribosomal Protein L16/L10E
PMPROK00074 30S Ribosomal Protein S8P (rps8E)
PMPROK00075 50S Ribosomal Protein L3P (rpl3P)
PMPROK00081 30S Ribosomal Protein S13P (rps13P)
PMPROK00086 Phenylalanyl-tRNA synthetase, beta subunit (pheT)
PMPROK00087 Phenylalanyl-tRNA synthetase, alpha subunit (pheS)
PMPROK00092 50S Ribosomal Protein L15P (rpl15P)
PMPROK00093 30S Ribosomal Protein S10P (rps10P)
PMPROK00094 30S Ribosomal Protein S12P (rps12P)
PMPROK00097 Triosephosphate isomerase (tpiA)
PMPROK00106 tRNA-ribosyltransferase / tRNA-guanine transglycosylase
PMPROK00123 rdgB/HAM1 protein family
PMPROK00126 Ribonuclease HII (rnhB)

Threshold parameters for PhyloSift search steps

After discussions with Aaron, I’ve removed references to the probability thresholds and e-values that are currently hardcoded into the PhyloSift code. We’re now talking about moving all these hardcoded values into a global parameter file, which will tell users exactly what values are used in the analysis.

Candidate marker sequences identified in LAST searches are next screened against profile alignments that have been pre-computed for reference marker genes (housed in the local directory: /share/phylosift/markers/ ). In order to take a stringent search approach towards short read data, PhyloSift relies on threshold e-values to accept or reject candidate sequences after initial LAST searches. For rRNA sequences, screening and alignment relies on Covariance Model profiles (CMs; a class of Stochastic Context Free Grammar Models that utilize stem/loop information in rRNA secondary structure) and is carried out via the cmalign algorithm in the SSU-align software, using probability thresholds of 1×10-6 for sequences >1000 bp and 1×10-20 for sequences <1000 bp. Protein coding genes rely on profile Hidden Markov Models (computed via the HMMer software suite; Eddy 2010), with a threshold e-value set at 10. These profile alignments can be found in the /share/phylosift/markers/ directory as *.cm (rRNA) and *.hmm (protein coding genes) files.