Xeno data analysis progress

Transferring info over from Google Docs. Here is the recent progress with the xeno dataset:


Ran Illumina FASTA files through QIIME to mine 18S sequences from raw reads. (to see if we have any hits to Rhizaria). Using reference-based OTU picking against the SILVA 108 database.

pick_reference_otus_through_otu_table.py -i /home/qiime/Desktop/Shared_Folder/xeno_data/xeno_raw_s_2_1_sequence.fasta -r /home/qiime/Silva_108/rep_set/Silva_108_rep_set_Eukarya_only.fna -o /home/qiime/Desktop/Shared_Folder/xeno_data/QIIME_18S_Silva_scan/ -t /home/qiime/Silva_108/taxa_mapping/Silva_108_taxa_mapping.txt


The QIIME virtual box kept crashing and not working, so I re-installed it (upgraded to QIIME 1.5 release though), and am trying the parallel script to pick reference OTUs through uclust now.

parallel_pick_otus_uclust_ref.py -i /home/qiime/Desktop/Shared_Folder/xeno_data/xeno_raw_s_2_1_sequence.fasta -o xeno_raw_s_2_1_qiime18S -r /home/qiime/Desktop/Shared_Folder/Silva_108/rep_set/Silva_108_rep_set_Eukarya_only.fna -s 0.99 –enable_rev_strand_match


QIIME still giving me trouble, particularly the parallel scripts. Now trying your bog standard ref-based OTU picking, hopefully this will work. If not, I’ll run this on Edhar next because the VB approach seems to be failing me…

pick_otus.py -i /home/qiime/Desktop/Shared_Folder/xeno_data/xeno_raw_s_2_1_sequence.fasta -o xeno_raw_s_2_1_qiime18S -m uclust_ref -r /home/qiime/Desktop/Shared_Folder/Silva_108/rep_set/Silva_108_rep_set_Eukarya_only.fna -s 0.99 -t -z


Still been having problems with running the 18S reference picking through QIIME. Have now uploaded the xeno FASTA file of raw Illumina reads to qiime@localhost on Edhar, and running closed-reference picking:

pick_reference_otus_through_otu_table.py -i /home/qiime/data/hbik/xeno_raw/xeno_raw_s_2_1_sequence.fasta -r /home/qiime/data/hbik/Silva_108/rep_set/Silva_108_rep_set_Eukarya_only.fna -o /home/qiime/data/hbik/xeno_18S/ -t /home/qiime/data/hbik/Silva_108/taxa_mapping/Silva_108_taxa_mapping.txt –parameter_fp /home/qiime/data/hbik/xeno_raw/qiime_parameters_xeno.txt -f


Also running the xeno contigs on the latest build of PhyloSift devel (phylosift_devel_20120806) and master (phylosift_master_20120806) using both the core markers and the devel markers. Yesterday I noticed that the two marker sets don’t give you overlapping results in terms of the contigs they pull out – even when you’re apparently using the same markers (e.g. Viral – although it might be different *versions* of the markers)

No hits to devel branch for some reason – xeno data not producing any outputs at all?!

Getting hits using devel branch and devel markers – but not all the files in the alignDir are giving me .jplace files (16S/18S specifically)


I was getting problems with the FASTA file on Edhar, even after fixing all the parameter files it said that QIIME needed…so I’ve upped my memory on my iMac and am trying again with ref-based OTU picking against SILVA

pick_reference_otus_through_otu_table.py -i /home/qiime/Desktop/Shared_Folder/xeno_data/xeno_raw_s_2_1_sequence.fasta -r /home/qiime/Desktop/Shared_Folder/Silva_108/rep_set/Silva_108_rep_set_Eukarya_only.fna -o /home/qiime/Desktop/Shared_Folder/xeno_data/xeno_18S/ -t /home/qiime/Desktop/Shared_Folder/Silva_108/taxa_mapping/Silva_108_taxa_mapping.txt

Runinng xeno data using Master markers (master branch 20120810) and Devel markers (devel branch 20120810). Going to compare the outputs and see how the markers used affect the taxonomy summary and scaffold mining (e.g. clarify the discrepancies I was seeing across old PhyloSift builds)

Next Steps

  1. Look at probability distributions for a subset of contigs, across time for PhyloSift analyses. Do they get better or worse?
  2. Run Illumina FASTA files through QIIME to mine 18S sequences from raw reads. See if we have any hits to Rhizaria.
  3. Look in the Eukaryote-specific .jplace files to see if we are getting hits to the foraminifera for the xeno data. Need to re-run on non-updated markers (delete .updated files from devel markers)
  4. Look for xeno data in raw reads – use 18S built marker packages (still haven’t resolved this – GitHub issue #322)
  5. Run raw reads through PhyloSift to see difference in taxonomy summary versus contigs (raw reads taking up too much memory to run through PhyloSift at the moment).

BLAST Dongying’s markers and RAxML run on thesis news

Making progress on BLASTing Dongying vs. Parfrey’s eukaryotic markers. Ran command on Edhar:

blastall -p blastp -d /home/hollybik/euks_vs_dongying_markers/DW_BacArch_ComboMarkers.faa -i /home/hollybik/Euks_ParfreyMarkers/Euk_ParfeyMarkers_allgenes_unaligned.fasta -o Parfrey_vs_Dongying_blastp.txt -v 3 -b 3

Having major problems with file conversions, so trying to run RAxML locally now on Edhar:

raxmlHPC -s SSUalign_BikThesisNems_Phylip_21Apr.txt -n SSUalign_RAxML_GTRCAT -m GTRCAT -f a -T 4

That didn’t work so Guillaume fixed the Phylip file and we ran the following command:

raxmlHPC -s gjospin_23Apr_out.phylip -n SSUalign_RAxML_GTRCAT -m GTRCAT -f a -T 4 -x 12345 -# 100 -p 12345

Also re-downloaded the newest version of PhyloSift (including all new markers), and re-ran the xeno_assembly_low_cov.fa

./phylosift all /home/hollybik/TestData/xeno_assembly_low_cov.fa