PhyloSift data uploads to FigShare

Final files are sitting here (for the moment, but will transfer to external hard drive soon): /Users/hollybik/Dropbox/UC Davis Projects/PhyloSift/Yatsunenko_phylosift/Figshare Fies/ 

One of the 107 metagenome samples seemed to disappear from the QIIME analyses…did some sleuthing and the most likely culprit seems to be the following rRNA file mined from Phylosfit (saw this on the QIIME_yatsunenko AWS instance). Its only 8.7k:

-rw-r—– 1 ubuntu ubuntu 8.7K 2012-09-13 17:13 4461121.3_ps_16S_bac_extract.fna

Exploring Figshare

Today I’m uploading data to Figshare for the first time – getting our PhyloSift analysis published before we submit the manuscript.

On Edhar, the final data folders are qiime_analyses_yatsunenko_16S_amplicons and qiime_analyses_yatsunenko_metagenomes_7Oct. Figured this out after I looked through my e-mails from October (when I was running these analyses):

So I spent the weekend looking through yatsunenko data after my discovery about QIIME’s reference-based OTU picking protocol. It turns out that our 16S data is OK (this was a closed-reference based process, where all the reads not matchging greengenes were discarded), but the 16S derived from PhyloSift metagenome analysis was an open-ref process (where QIIME created new de novo clusters for sequences that didn’t match greengenes–e.g. opposite of what yatsunenko did).
 
Because I wanted to keep things consistent in the manuscript (keeping amplicon and metagenome workflows the same, and directly comparable with the Yatsunenko methods), I re-ran the metagenome data now using a Closed-Reference OTU picking process. That means we have a lot less OTUS, and might see different patterns in the PCoAs. Sorry about the confusion over this, just glad I caught it early enough.
 
New data has been downloaded to Edhar: /home/hollybik/yatsunenko_QIIME/qiime_analyses_yatsunenko_metagenomes_7oct

PhyloSifting updates

So I spent the weekend looking through yatsunenko data after my discovery about QIIME’s reference-based OTU picking protocol. It turns out that our 16S data is OK (this was a closed-reference based process, where all the reads not matchging greengenes were discarded), but the 16S derived from PhyloSift metagenome analysis was an open-ref process (where QIIME created new de novo clusters for sequences that didn’t match greengenes–e.g. opposite of what yatsunenko did).

Because I wanted to keep things consistent in the manuscript (keeping amplicon and metagenome workflows the same, and directly comparable with the Yatsunenko methods), I re-ran the metagenome data now using a Closed-Reference OTU picking process. That means we have a lot less OTUS, and might see different patterns in the PCoAs.

New data has been downloaded to Edhar: /home/hollybik/yatsunenko_QIIME/qiime_analyses_yatsunenko_metagenomes_7oct

Also have started playing around again with PhyloSift devel branch, particularly for 18S rRNA data and Pam Brannock’s Euk mRNA contigs from the GOM project.

For 18S rRNA data, it seems like PhyloSift is not pulling down all the sequences — I used two PE input files, each about 1.4GB in size, and only got 6 chunks of 18S sequences where the alignDir files were ~3MB each. These file sizes seem a bit small for such a big input files of 18S amplicon data, no? The combined fasta file of demultiplexed GOM sequences didn’t seem to get any useful output (need to re-check this, though).

Also updated the website with info about fastq trimming feature.

Re-running Yatsunenko and GOM QIIME

Re-running the Yatsunenko et al. Core Analyses because the old run was open-reference OTU picking, and not the closed-ref protocol (my previous commmand didn’t turn off split libraries)

core_qiime_analyses.py -i /home/ubuntu/data_yatsunenko/ps_metagenome_extract/ps_metagenome_combined.fna -o /home/ubuntu/ps_metagenome_coreanalyses_7oct -m /home/ubuntu/qiime_ps_metagenomes_mappingfile.txt -f –suppress_split_libraries -t /home/ubuntu/gg_otus_4feb2011/trees/gg_97_otus_4feb2011.tre -p /home/ubuntu/qiime_parameters_metagenome_7Oct.txt –parallel

But for some reason alpha rarefaction didn’t work so had to run this script after getting error:

alpha_rarefaction.py -i /home/ubuntu/ps_metagenome_coreanalyses_7oct/otus/otu_table.biom -o /home/ubuntu/ps_metagenome_coreanalyses_7oct/alpha_rarefaction/ -t /home/ubuntu/ps_metagenome_coreanalyses_7oct/otus/rep_set.tre -m /home/ubuntu/qiime_ps_metagenomes_mappingfile.txt
ubuntu@ip-10-218-7-163:~$

**note, for parallel uclust_ref core analyses, you must delete from parameters file the following parameters: –clustering_algorithim, –max_e_value, and –max_cdhit_memory (all from pick_otus parameters) and –e_value (from assign_taxonomy using rep)

Also moving forward with more GOM QIIME-ing, using Closed and Open-ref based OTU picking. Going to use a bunch of different analyses, including uclust_ref and usearch_ref.

core_qiime_analyses.py -i /home/ubuntu/GOM_demultiplexed/GOM_concat_fwd_demulti_1to12_1.fna -o home/ubuntu/GOM_coreanalyses_ClosedRef_99_7Oct -m /home/ubuntu/QIIMEmappingfile_GOM_Illumina_fakebarcodes.txt -f –suppress_split_libraries -p /home/ubuntu/qiime_parameters_GOM_ClosedRef.txt –parallel

Also figured out why the PYNAST euk alignment wasn’t working – I had the align_seqs:min_length set to 150 originally, meaning that no Illumina sequence would ever align because they’re wayyy shorter than this. Now changed this parameter to 70 in all the aiime parameter files.

Yatsunenko more analysis yay

Getting more analyses done in prep for the paper. We wanted to know what taxa were causing the PCoA groupings across components. QIIME forum suggested we do bi-plots for this. Summarized OTU table and then made the plot (in home/hollybik/yatsunenko_QIIME/ps_16Samplicons_parallel_picktotus_17sept/beta_diversity/biplots on Edhar):

summarize_taxa.py -i 16S_amplicon_yatsunenko_combo_otu_table.biom -L 6 -o summarize_taxa/

make_3d_plots.py -i weighted_unifrac_pc.txt -m /home/ubuntu/qiime_16S_amplicon_mappingfile.txt -t /home/ubuntu/ps_16Samplicons_parallel_picktotus_17sept/summarize_taxa/ 16S_amplicon_yatsunenko_combo_otu_table_L6.txt -o biplots/

Also ran alpha diversity metrics – this command wasn’t working before, but all seemed to be OK the second time I tried running it on the 16S amplicon data.

alpha_diversity.py -i 16S_amplicon_yatsunenko_combo_otu_table.biom -o /home/ubuntu/ps_16Samplicons_parallel_picktotus_17sept/alpha_diversity -t /home/ubuntu/ps_16Samplicons_parallel_picktotus_17sept/aligned_seqs/ 16S_amplicon_yatsunenko_combo_otus_rep_set_aligned_pfiltered.tre

Yatsunenko and GOM – more QIIMEing

Continuing where I left off with the QIIME analyses. For the Yatsunenko data, trying to finish up the 16S amplicon analyses before the sprint ends on Friday (have to do this manually because the core analyses workflow just wasn’t working for the huge dataset):

pick_rep_set.py -i /home/ubuntu/ps_16Samplicons_parallel_picktotus_17sept/16S_amplicon_yatsunenko_combo_otus.txt -m first -o 16S_amplicon_yatsunenko_combo_otus_rep_set.txt -s otu -r /home/ubuntu/gg_otus_4feb2011/rep_set/gg_97_otus_4feb2011.fasta

align_seqs.py -i /home/ubuntu/ps_16Samplicons_parallel_picktotus_17sept/ 16S_amplicon_yatsunenko_combo_otus_rep_set.txt -t /home/ubuntu/gg_otus_4feb2011/rep_set/gg_97_otus_4feb2011_aligned.fasta -m pynast -a uclust -o /home/ubuntu/ps_16Samplicons_parallel_picktotus_17sept/aligned_seqs/ -e 150 -p 0.75

filter_alignment.py -i /home/ubuntu/ps_16Samplicons_parallel_picktotus_17sept/aligned_seqs/ 16S_amplicon_yatsunenko_combo_otus_rep_set_aligned.fasta -o /home/ubuntu/ps_16Samplicons_parallel_picktotus_17sept/aligned_seqs/ -g 0.999999 -t 3.0 –suppress_lane_mask_filter

assign_taxonomy.py -i /home/ubun
tu/ps_16Samplicons_parallel_picktotus_17sept/16S_amplicon_yatsunenko_combo_otus_rep_set.txt -t /home/
ubuntu/gg_otus_4feb2011/taxonomies/greengenes_tax_rdp_train.txt -r /home/ubuntu/gg_otus_4feb2011/rep_
set/gg_97_otus_4feb2011.fasta -m rdp -c 0.8 -o /home/ubuntu/ps_16Samplicons_parallel_picktotus_17sept
/assign_taxonomy/

make_otu_table.py -i /home/ubuntu/ps_16Samplicons_parallel_picktotus_17sept/16S_amplicon_yatsunenko_combo_otus.txt -o /home/ubuntu/ps_16Samplicons_parallel_picktotus_17sept/ 16S_amplicon_yatsunenko_combo_otu_table.biom -t /home/ubuntu/ps_16Samplicons_parallel_picktotus_17sept/assign_taxonomy/ 16S_amplicon_yatsunenko_combo_otus_rep_set_tax_assignments.txt

beta_diversity_through_plots.py -i 16S_amplicon_yatsunenko_combo_otu_table.biom -m /home/ubuntu/qiime_16S_amplicon_mappingfile.txt -o /home/ubuntu/ps_16Samplicons_parallel_picktotus_17sept/beta_diversity/ -t /home/ubuntu/ps_16Samplicons_parallel_picktotus_17sept/aligned_seqs/ 16S_amplicon_yatsunenko_combo_otus_rep_set_aligned_pfiltered.tre -p /home/ubuntu/qiime_parameters_ps_amplicon.txt

And for the GOM Data, proceeding manually (core analyses ran into an error at the RDP step). Commands as follows:

align_seqs.py -i /home/ubuntu/uclust_99_GOMillumina_fwd_1to12_1/otus/rep_set/ GOM_illumina_demultiplexed_fwd_concat_1to12_1_rep_set.fasta -t /home/ubuntu/Silva_108/core_aligned/Silva_108_core_aligned_seqs.fasta -m pynast -a uclust -o /home/ubuntu/uclust_99_GOMillumina_fwd_1to12_1/otus/aligned_seqs/ -e 150 -p 0.75

GOM Illumina and Yatsunenko rRNA analyses

Kicked off some more analyses this morning on the cloud. First off is the GOM Illumina data – forward reads give me about a 6GB file, so running the core analyses on a cloud server with 32GB memory:

core_qiime_analyses.py -i /home/ubuntu/data_GOM/GOM_illumina_demultiplex
ed_fwd_concat_1to12_1 -o /home/ubuntu/uclust_99_GOMillumina_fwd_1to12_1 -m /home/ubuntu/QIIMEmappin
gfile_GOM_Illumina_fakebarcodes.txt –suppress_split_libraries -p /home/ubuntu/qiime_parameters_GOM
.txt

Transferred the Yatsunenko 16S amplicon data over to a High-Memory Instance (transfer speeds are FAST! <20 mins for a 40GB file) – 68GB memory on this one. Using the same core_qiime_analyses command as yesterday (w/greengenes tree).