QIIME open ref OTU picking

Been trying to use the new QIIME 1.7 scripts on the GOM Illumina data (running out of memory so this is still a work in progress though…heading over to Amazon Cloud soon):pick_open_reference_otus.py -i

Standard workflow:

/Users/hollybik/Desktop/Data/Illumina_GOM/demultiplexed_qiime1.7/GOM_concat1.7_fwd_demulti_1to12_1.fna -o /Users/hollybik/Desktop/Data/Illumina_GOM/uclust_99_fwd -r /macqiime/silva_111/rep_set/Silva_111_full_unique.fasta --parallel -O 2 -s 0.1 --suppress_taxonomy_assignment --suppress_align_and_tree

Skipping prefiltering (thought this would speed things up but no…)

pick_open_reference_otus.py -i /Users/hollybik/Desktop/Data/Illumina_GOM/demultiplexed_qiime1.7/GOM_concat1.7_fwd_demulti_1to12_1.fna -o /Users/hollybik/Desktop/Data/Illumina_GOM/uclust_99_fwd -r /macqiime/silva_111/rep_set/Silva_111_full_unique.fasta --parallel -O 2 -s 0.1 --suppress_taxonomy_assignment --suppress_align_and_tree --prefilter_percent_id 0.0
Advertisements

Expanding support for euks in PhyloSift

PhyloSift genome data lives on Merlot in /share/eisen-z2/phylosift/

Talked with Guillaume about euk markers

  • maker updates – ONLY updating tree and its associated taxonomy (not the HMM, because we don’t want the HMM to shift over time). We throw away all the old reference sequences and start afresh by scanning the downloaded genomes
  • For euks – I’m worried that we’re getting rid of a lot of taxonomic diversity for this marker update method. Original Parfrey sequences definitely seem to be getting thrown away. Need to figure out if we’re representing all the deep protist lineages during euk maker updates (99% tree pruning won’t do much harm if we already have these taxa present).

To investigate:

  • How many euk genomes do we have in the Phylosift directories (draft, ebi, WGS, etc)? How many bacterial/archaeal in comparison?
  • Work with Guillaume to get full NCBI taxonomic hierarchies placed into the trees. This will help to evaluate what lineages present/abesent in our reference markers.

PhyloSift analysis of Deepsea OTUs

Prepping for lab meeting tomorrow, so looking at the results of the PhyloSift runs for the Deepsea OTU data.

Edge PCA (produces an .xml tree file) :

./guppy pca –out-dir ~/phylosift_v1.0.0_01/ ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_ShallowCalif.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_ShallowGulf.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Atlantic22.1.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Atlantic25.2.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Atlantic29.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Atlantic43.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Atlantic45.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Pacific128.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Pacific237.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Pacific321.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Pacific422.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Pacific528.fna/treeDir/18s_reps.1.jplace –prefix guppyDS

Squash clustering (

./guppy squash –out-dir ~/phylosift_v1.0.0_01/ ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_ShallowCalif.fna/treeDir/ShallowCalif_18Sreps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_ShallowGulf.fna/treeDir/ShallowGulf_18Sreps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Atlantic22.1.fna/treeDir/Atlantic22.1_18Sreps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Atlantic25.2.fna/treeDir/Atlantic25.2_18Sreps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Atlantic29.fna/treeDir/Atlantic29_18Sreps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Atlantic43.fna/treeDir/Atlantic43_18Sreps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Atlantic45.fna/treeDir/Atlantic45_18Sreps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Pacific128.fna/treeDir/Pacific128_18Sreps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Pacific237.fna/treeDir/Pacific237_18Sreps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Pacific321.fna/treeDir/Pacific321_18Sreps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Pacific422.fna/treeDir/Pacific422_18Sreps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Pacific528.fna/treeDir/Pacific528_18Sreps.1.jplace –prefix guppyDeepsea_squash

Kantorovich-Rubinstein Distance:

~/phylosift_v1.0.0_01/bin$ ./guppy kr –out-dir ~/phylosift_v1.0.0_01/ ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_ShallowCalif.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_ShallowGulf.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Atlantic22.1.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Atlantic25.2.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Atlantic29.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Atlantic43.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Atlantic45.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Pacific128.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Pacific237.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Pacific321.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Pacific422.fna/treeDir/18s_reps.1.jplace ~/phylosift_v1.0.0_01/PS_temp/QiimeSplit_F04_Pacific528.fna/treeDir/18s_reps.1.jplace -o guppy_Deepsea_KRdistance

QIIME GOM data

For some reason I couldn’t get the parallel_assign_taxonomy_rdp.py script to work, so I had to revert back to the normal assign_taxonomy.py script. I was getting an error unless I increased the max memory flag, so remember to do this again next time.

assign_taxonomy.py -m rdp -i /home/ubuntu/GOM_demulti2_fwd_OpenRef_uclust99_12Oct/GOM_concat_fwd_demultirepeat_1to12_1_otus_rep_set.fasta -r /home/ubuntu/Silva_108/rep_set/Silva_108_rep_set_Eukarya_only.fna -t /home/ubuntu/Silva_108/taxa_mapping/Silva_RDP_taxa_mapping_Eukarya_only_genus.txt -o /home/ubuntu/GOM_demulti2_fwd_OpenRef_uclust99_12Oct/rdp_genus_assigntaxon/ –rdp_max_memory 50000

assign_taxonomy.py -m rdp -i /home/ubuntu/GOM_demulti2_rev_OpenRef_uclust99_12Oct/GOM_concat_rev_demultirepeat_1to12_2_otus_rep_set.fasta -r /home/ubuntu/Silva_108/rep_set/Silva_108_rep_set_Eukarya_only.fna -t /home/ubuntu/Silva_108/taxa_mapping/Silva_RDP_taxa_mapping_Eukarya_only_genus.txt -o /home/ubuntu/GOM_demulti2_fwd_OpenRef_uclust99_12Oct/rdp_genus_assigntaxon/ –rdp_max_memory 50000

Making OTU tables of all seqs:

make_otu_table.py -i GOM_concat_fwd_demultirepeat_1to12_1_otus.txt -t rdp_genus_assigntaxon/GOM_concat_fwd_demultirepeat_1to12_1_otus_rep_set_tax_assignments.txt -o GOM_concat
_fwd_demultirepeat_1to12_1_otu_table_allotus.biom

Chimera checking (used parallell for actual data anlysis, but listing non-parallel script here for reference):

identify_chimeric_seqs.py -i GOM_concat_fwd_demultirepeat_1to12_1_otus_rep_set.fasta -t /home/ubuntu/Silva_108/taxa_mapping/Silva_RDP_taxa_mapping_Eukarya_only_genus.txt -r /home/ubuntu/Silva_108/rep_set/Silva_108_rep_set_Eukarya_only.fna -o GOM_concat_fwd_demultirepeat_1to12_1_chimeric_seqs.txt -m blast_fragments

parallel_identify_chimeric_seqs.py -m blast_fragments -i GOM_concat_fwd_demultirepeat_1to12_1_otus_rep_set.fasta -t /home/ubuntu/Silva_108/taxa_mapping/Silva_RDP_taxa_mapping_Eukarya_only_genus.txt -r /home/ubuntu/Silva_108/rep_set/Silva_108_rep_set_Eukarya_only.fna -o GOM_concat_fwd_demultirepeat_1to12_1_chimeric_seqs.txt -O 6

 

GOM data analysis – yet more QIIME

Continuing next steps with QIIME. Filtering alignments next:

filter_alignment.py -i /home/ubuntu/GOM_demulti2_rev_OpenRef_uclust99_12Oct/aligned_seqs/GOM_concat_rev_d
emultirepeat_1to12_2_otus_rep_set_aligned.fasta -o /home/ubuntu/GOM_demulti2_rev_OpenRef_uclust99_12Oct/aligned_seqs/GOM_concat_rev_
demultirepeat_1to12_2_otus_rep_set_filtered_aligned.fasta -s -e 0.10 -g 0.90

filter_alignment.py -i /home/ubuntu/GOM_demulti2_fwd_OpenRef_uclust99_12Oct/aligned_seqs/GOM_concat_fwd_demultirepeat_1to12_1_otus_rep_set_aligned.fasta -o /home/ubuntu/GOM_demulti2_fwd_OpenRef_uclust99_12Oct/aligned_seqs/GOM_concat_fwd_demultirepeat_1to12_1_otus_rep_set_filtered_aligned.fasta -s -e 0.10 -g 0.90

Now assigning taxonomy:

parallel_assign_taxonomy_rdp.py -i /home/ubuntu/GOM_demulti2_fwd_OpenRef_uclust99_12Oct/GOM_concat_fwd_demultirepeat_1to12_1_otus_rep_set.fasta -r /home/ubuntu/Silva_108/rep_set/Silva_108_rep_set_Eukarya_only.fna -t /home/ubuntu/Silva_108/taxa_mapping/Silva_RDP_taxa_mapping_Eukarya_only_species.txt -o /home/ubuntu/GOM_demulti2_fwd_OpenRef_uclust99_12Oct/rdp_taxonomy/ -c 0.7 -O 6

GOM Illumina – next steps with QIIME

Getting back to GOM data analysis on the Amazon cloud. I had run the fwd OTU picking too but forgot to note down the command (finished on Oct 16th):

pick_otus.py -i ~/GOM_demultiplexed/GOM_concat_fwd_demultirepeat_1to12_1.fna -m uclust_ref -o /home/ubuntu/GOM_demulti2_fwd_OpenRef_uclust99_12Oct -r /home/ubuntu/Silva_108/rep_set/Silva_108_rep_set_Eukarya_only.fna -s 0.99 –enable_rev_strand_match

Proceeding with picking rep set of sequences:

pick_rep_set.py -i GOM_concat_fwd_demultirepeat_1to12_1_otus.txt -f ~/GOM_demultiplexed/GOM_concat_fwd_demultirepeat_1to12_1.fna -m first -l pick_rep_set.log -o GOM_concat_fwd_demultirepeat_1to12_1_otus_rep_set.fasta

pick_rep_set.py -i GOM_concat_rev_demultirepeat_1to12_2_otus.txt -f ~/GOM_demultiplexed/GOM_concat_rev_demultirepeat_1to1
2_2.fna -m first -l pick_rep_set.log -o GOM_concat_rev_demultirepeat_1to12_2_otus_rep_set.fasta

Next need to align sequences:

parallel_align_seqs_pynast.py -i /home/ubuntu/GOM_demulti2_fwd_OpenRef_uclust99_12Oct/GOM_concat_fwd_demultirepeat_1to12_1_otus_rep_set.fasta -t /home/ubuntu/Silva_108/core_aligned/Silva_108_core_aligned_seqs.fasta -a uclust -o /home/ubuntu/GOM_demulti2_fwd_OpenRef_uclust99_12Oct/aligned_seqs/ -e 70 -O 6

parallel_align_seqs_pynast.py -i /home/ubuntu/GOM_demulti2_rev_OpenRef_uclust99_12Oct/GOM_concat_rev_demultirepeat_1to12_2_otus_rep_set.fasta -t /home/ubuntu/Silva_108/core_aligned/Silva_108_core_aligned_seqs.fasta -a uclust -o /home/ubuntu/GOM_demulti2_rev_OpenRef_uclust99_12Oct/aligned_seqs/ -e 70 -O 6

More updates to PhyloSift website

Finished updating the outputs section of the PhyloSift website. Going to close the GitHub issue but here’s what I still need to update:

  • Column 2 of sequence_taxa files – what is it?
  • Confirm that marker_summary.txt in main output directory summarizes the alignDir marker info
  • More clear info on the .jnlp and .xml output files that Aaron is working on for the fat tree visualization.