Metagenomes (xeno, gom_fungi) and rerunning Open Ref OTUs

Xeno HiSeq data – Talked with David, still trying to figure out if we have xeno in there (David will run the raw reads through PhyloSift). What I’m doing:

  • Running Xeno data through MG-RAST. Trying to get an initial overview of the shotgun data
  • Running Xeno data through QIIME (prefiltering, ref-based picking only at 60%) to pull out any rRNA reads that might be in there. Hopefully we can get a better picture of the microbial community. Command ran:
! -i /Users/hollybik/Desktop/Data/metagenomes/HB_RN_March2013_XENO_unzip.fasta -r /macqiime/silva_111/rep_set/Silva_111_full_unique.fasta -o /Users/hollybik/Desktop/Data/metagenomes/xeno_qiime60prefilter -p /Users/hollybik/Dropbox/QIIME/qiime_parameters_filterMGforrRNA.txt --parallel -O 2

Also uploaded GOM_Fungi data to MG-RAST to get an idea of what’s in the sample – data is processing through the pipeline now.

Made some final tweaks to the open ref OTU picking protocol on StarCluster. This should hopefully be the final command that will run to completion after changing the SC script in qiime_config:

! -i /gom_data/GOM_concat1.7_rev_demulti_1to12_2.fna -o /gom_data/uclust_openref96_ref_22Aug -r /gom_data/99_Silva_111_rep_set_euk.fasta --parallel -O 8 -s 0.1 -p /gom_data/qiime_parameters_18Sopenref96_GOMamazon.txt --prefilter_percent_id 0.0 -f

Exploring Figshare

Today I’m uploading data to Figshare for the first time – getting our PhyloSift analysis published before we submit the manuscript.

On Edhar, the final data folders are qiime_analyses_yatsunenko_16S_amplicons and qiime_analyses_yatsunenko_metagenomes_7Oct. Figured this out after I looked through my e-mails from October (when I was running these analyses):

So I spent the weekend looking through yatsunenko data after my discovery about QIIME’s reference-based OTU picking protocol. It turns out that our 16S data is OK (this was a closed-reference based process, where all the reads not matchging greengenes were discarded), but the 16S derived from PhyloSift metagenome analysis was an open-ref process (where QIIME created new de novo clusters for sequences that didn’t match greengenes–e.g. opposite of what yatsunenko did).
Because I wanted to keep things consistent in the manuscript (keeping amplicon and metagenome workflows the same, and directly comparable with the Yatsunenko methods), I re-ran the metagenome data now using a Closed-Reference OTU picking process. That means we have a lot less OTUS, and might see different patterns in the PCoAs. Sorry about the confusion over this, just glad I caught it early enough.
New data has been downloaded to Edhar: /home/hollybik/yatsunenko_QIIME/qiime_analyses_yatsunenko_metagenomes_7oct

Xeno data analysis progress

Transferring info over from Google Docs. Here is the recent progress with the xeno dataset:


Ran Illumina FASTA files through QIIME to mine 18S sequences from raw reads. (to see if we have any hits to Rhizaria). Using reference-based OTU picking against the SILVA 108 database. -i /home/qiime/Desktop/Shared_Folder/xeno_data/xeno_raw_s_2_1_sequence.fasta -r /home/qiime/Silva_108/rep_set/Silva_108_rep_set_Eukarya_only.fna -o /home/qiime/Desktop/Shared_Folder/xeno_data/QIIME_18S_Silva_scan/ -t /home/qiime/Silva_108/taxa_mapping/Silva_108_taxa_mapping.txt


The QIIME virtual box kept crashing and not working, so I re-installed it (upgraded to QIIME 1.5 release though), and am trying the parallel script to pick reference OTUs through uclust now. -i /home/qiime/Desktop/Shared_Folder/xeno_data/xeno_raw_s_2_1_sequence.fasta -o xeno_raw_s_2_1_qiime18S -r /home/qiime/Desktop/Shared_Folder/Silva_108/rep_set/Silva_108_rep_set_Eukarya_only.fna -s 0.99 –enable_rev_strand_match


QIIME still giving me trouble, particularly the parallel scripts. Now trying your bog standard ref-based OTU picking, hopefully this will work. If not, I’ll run this on Edhar next because the VB approach seems to be failing me… -i /home/qiime/Desktop/Shared_Folder/xeno_data/xeno_raw_s_2_1_sequence.fasta -o xeno_raw_s_2_1_qiime18S -m uclust_ref -r /home/qiime/Desktop/Shared_Folder/Silva_108/rep_set/Silva_108_rep_set_Eukarya_only.fna -s 0.99 -t -z


Still been having problems with running the 18S reference picking through QIIME. Have now uploaded the xeno FASTA file of raw Illumina reads to qiime@localhost on Edhar, and running closed-reference picking: -i /home/qiime/data/hbik/xeno_raw/xeno_raw_s_2_1_sequence.fasta -r /home/qiime/data/hbik/Silva_108/rep_set/Silva_108_rep_set_Eukarya_only.fna -o /home/qiime/data/hbik/xeno_18S/ -t /home/qiime/data/hbik/Silva_108/taxa_mapping/Silva_108_taxa_mapping.txt –parameter_fp /home/qiime/data/hbik/xeno_raw/qiime_parameters_xeno.txt -f


Also running the xeno contigs on the latest build of PhyloSift devel (phylosift_devel_20120806) and master (phylosift_master_20120806) using both the core markers and the devel markers. Yesterday I noticed that the two marker sets don’t give you overlapping results in terms of the contigs they pull out – even when you’re apparently using the same markers (e.g. Viral – although it might be different *versions* of the markers)

No hits to devel branch for some reason – xeno data not producing any outputs at all?!

Getting hits using devel branch and devel markers – but not all the files in the alignDir are giving me .jplace files (16S/18S specifically)


I was getting problems with the FASTA file on Edhar, even after fixing all the parameter files it said that QIIME needed…so I’ve upped my memory on my iMac and am trying again with ref-based OTU picking against SILVA -i /home/qiime/Desktop/Shared_Folder/xeno_data/xeno_raw_s_2_1_sequence.fasta -r /home/qiime/Desktop/Shared_Folder/Silva_108/rep_set/Silva_108_rep_set_Eukarya_only.fna -o /home/qiime/Desktop/Shared_Folder/xeno_data/xeno_18S/ -t /home/qiime/Desktop/Shared_Folder/Silva_108/taxa_mapping/Silva_108_taxa_mapping.txt

Runinng xeno data using Master markers (master branch 20120810) and Devel markers (devel branch 20120810). Going to compare the outputs and see how the markers used affect the taxonomy summary and scaffold mining (e.g. clarify the discrepancies I was seeing across old PhyloSift builds)

Next Steps

  1. Look at probability distributions for a subset of contigs, across time for PhyloSift analyses. Do they get better or worse?
  2. Run Illumina FASTA files through QIIME to mine 18S sequences from raw reads. See if we have any hits to Rhizaria.
  3. Look in the Eukaryote-specific .jplace files to see if we are getting hits to the foraminifera for the xeno data. Need to re-run on non-updated markers (delete .updated files from devel markers)
  4. Look for xeno data in raw reads – use 18S built marker packages (still haven’t resolved this – GitHub issue #322)
  5. Run raw reads through PhyloSift to see difference in taxonomy summary versus contigs (raw reads taking up too much memory to run through PhyloSift at the moment).