Rerunning GOM Illumina on QIIME 1.8

I’ve spent the last few day re-parsing data to get around some issues related to the way the sequencing facility handed me the GOM Illumina data (ended up needing to separate out primer sets and renumber demultiplexed sequences using the script in the GOM_Illumina/QIIME_files/Dec_2013 folder on Dropbox). In any case, continuing with analysis on QIIME 1.8. -i /home/ubuntu/data/GOM_co
ncat1.7_allF04combo_10Jan14.fna -o /home/ubuntu/data/uclust_openref99_F04_10Jan 
-r /home/ubuntu/data/silva_111/99_Silva_111_rep_set_euk.fasta --parallel -O 8 -s
 0.1 -p /home/ubuntu/data/qiime_parameters_18Sopenref99_GOMamazon_16sept.txt --p
refilter_percent_id 0.0 -i /home/ubuntu/data/GOM_co
ncat1.7_allR22combo_10Jan14.fna -o /home/ubuntu/data/uclust_openref99_R22_10Jan 
-r /home/ubuntu/data/silva_111/99_Silva_111_rep_set_euk.fasta --parallel -O 8 -s
 0.1 -p /home/ubuntu/data/qiime_parameters_18Sopenref99_GOMamazon_16sept.txt --p
refilter_percent_id 0.0

Got a weird error on AWS when I started ran these above script – “OpenBLAS : Your OS does not support AVX instructions. OpenBLAS is using Nehalem kernels as a fallback, which may give poorer performance.” Not sure if this will affect anything, but the open ref OTU picking seems to be progressing OK regardless (for now…).

Filtering Fasta files in QIIME

Just discovered this script to filter my input fasta sequences in QIIME (e.g. post-demultiplexed samples with SampleId_SeqID header format). Needed to do some filtering on the GOM Illumina data, as follows: -f /Users/hollybik/Desktop/Data/Illumina_GOM/demultiplexed_qiime1.7/GOM_concat1.7_rev_demulti_1to12_2.fna -o /Users/hollybik/Desktop/Data/Illumina_GOM/demultiplexed_qiime1.7/GOM_concat1.7_rev_demulti_1to12_2_F04only.fna --sample_id_fp /Users/hollybik/Dropbox/Projects/GOM_Illumina_Dauphin/QIIME_files/Dec_2013/QIIMEmappingfile_GOM_Illumina_fakebarcodes_F04only.txt -f /Users/hollybik/Desktop/Data/Illumina_GOM/demultiplexed_qiime1.7/GOM_concat1.7_rev_demulti_1to12_2.fna -o /Users/hollybik/Desktop/Data/Illumina_GOM/demultiplexed_qiime1.7/split_by_sample_Dec2013/GOM_concat1.7_rev_demulti_1to12_2_R22only.fna --sample_id_fp /Users/hollybik/Dropbox/Projects/GOM_Illumina_Dauphin/QIIME_files/Dec_2013/QIIMEmappingfile_GOM_Illumina_fakebarcodes_R22only.txt -f /Users/hollybik/Desktop/Data/Illumina_GOM/demultiplexed_qiime1.7/GOM_concat1.7_fwd_demulti_1to12_1.fna -o /Users/hollybik/Desktop/Data/Illumina_GOM/demultiplexed_qiime1.7/split_by_sample_Dec2013/GOM_concat1.7_fwd_demulti_1to12_1_F04only.fna --sample_id_fp /Users/hollybik/Dropbox/Projects/GOM_Illumina_Dauphin/QIIME_files/Dec_2013/QIIMEmappingfile_GOM_Illumina_fakebarcodes_F04only.txt -f /Users/hollybik/Desktop/Data/Illumina_GOM/demultiplexed_qiime1.7/GOM_concat1.7_fwd_demulti_1to12_1.fna -o /Users/hollybik/Desktop/Data/Illumina_GOM/demultiplexed_qiime1.7/split_by_sample_Dec2013/GOM_concat1.7_fwd_demulti_1to12_1_R22only.fna --sample_id_fp /Users/hollybik/Dropbox/Projects/GOM_Illumina_Dauphin/QIIME_files/Dec_2013/QIIMEmappingfile_GOM_Illumina_fakebarcodes_R22only.txt

Generating .biom tables with taxonomy

The scripts just don’t seem to be working for most of the 18S_chimera and GOM_illumina runs. Did a bit of poking on the QIIME forums and it seems like the easier way to do this (confirmed via my fiddling) is just re-generating the OTU tables using and passing in the taxonomy mapping file using the -t flag. This is quick and easy to do on my MacBook Retina:

Commands run: -i /Users/hollybik/Desktop/Alaska\
 Analyses/GOM_Illumina/uclust_openref99_rev_16Sept/final_otu_map_mc2.txt -o /Users/hol
lybik/Desktop/Alaska\ Analyses/GOM_Illumina/uclust_openref99_rev_16Sept/otu_table_mc2_
w_tax.biom -t /Users/hollybik/Desktop/Alaska\ Analyses/GOM_Illumina/uclust_openref99_r
ev_16Sept/rdp_assigned_taxonomy/rep_set_tax_assignments.txt -i /Users/hollybik/Desktop/Alaska\
 Analyses/GOM_Illumina/uclust_openref96_fwd_16Sept/final_otu_map_mc2.txt -o /Users/hol
lybik/Desktop/Alaska\ Analyses/GOM_Illumina/uclust_openref96_fwd_16Sept/otu_table_mc2_
w_tax.biom -t /Users/hollybik/Desktop/Alaska\ Analyses/GOM_Illumina/uclust_openref96_f
wd_16Sept/rdp_assigned_taxonomy/rep_set_tax_assignments.txt -i /Users/hollybik/Desktop/Alaska\
 Analyses/GOM_Illumina/uclust_openref99_fwd_20Sept/final_otu_map_mc2.txt -o /Users/hol
lybik/Desktop/Alaska\ Analyses/GOM_Illumina/uclust_openref99_fwd_20Sept/otu_table_mc2_
w_tax.biom -t /Users/hollybik/Desktop/Alaska\ Analyses/GOM_Illumina/uclust_openref99_f

Organizing GOM Illumina data

Organizing GOM analyses run to data – downloaded completed runs onto 1TB external hard drive, along with parameter files (and copied command ran into a comment line at the top of the parameter file). Proceeding with more AWS analysis.

Forward reads at 96% (m2.4xlarge was running out of memory, so dropped down to 6 parallel jobs): -i /home/ubuntu/data/GOM_concat1.7_fwd_demulti_1to12_1.fna -o /home/ubuntu/data/uclust_openref96_fwd_16Sept -r /home/ubuntu/data/silva_111/99_Silva_111_rep_set_euk.fasta --parallel -O 6 -s 0.1 -p /home/ubuntu/data/qiime_parameters_18Sopenref96_GOMamazon_16sept.txt --prefilter_percent_id 0.0

(9/20/13) Forward reads at 99% – kept at 6 parallel jobs -i /home/ubuntu/data/GOM_concat1.7_fwd_demulti_1to12_1.fna -o /home/ubuntu/data/uclust_openref99_fwd_20Sept -r /home/ubuntu/data/silva_111/99_Silva_111_rep_set_euk.fasta --parallel -O 6 -s 0.1 -p /home/ubuntu/data/qiime_parameters_18Sopenref99_GOMamazon_16sept.txt --prefilter_percent_id 0.0

Reverse reads at 99%: -i /home/ubuntu/data/GOM_concat1.7_rev_demulti_1to12_2.fna -o /home/ubuntu/data/uclust_openref99_rev_16Sept -r /home/ubuntu/data/silva_111/99_Silva_111_rep_set_euk.fasta --parallel -O 8 -s 0.1 -p /home/ubuntu/data/qiime_parameters_18Sopenref99_GOMamazon_16sept.txt --prefilter_percent_id 0.0 not working on EC2

The script finished for GOM Illumina reverse reads clustered at 96% – however – the script was not working…the log file indicated the script had been executed, but top command showed nothing running. Tried running the script manually (below command), both with and without the -T flag. Also didn’t seem to work – the paralell jobs would start but then would all end for some reason.. -i /home/ubuntu/gom_data/uclust_openref96_ref_27Aug/rep_set.fna -o /home/ubuntu/gom_data/uclust_openref96_ref_27Aug/pynast_aligned_seqs_manual -O 8 -t /home/ubuntu/gom_data/99_Silva_111_rep_set_euk_aligned.fasta -a uclust -p 70.0

So instead of worrying about that for now, I’m going to move on to 99% clustering on the reverse reads to see how long this takes. (note: the 96% cutoff finished overnight, for a ~1.5GB file on an m2.4xlarge instance) -i /home/ubuntu/gom_data/GOM_concat1.7_rev_demulti_1to12_2.fna -o /home/ubuntu/gom_data/uclust_openref99_28Aug -r /home/ubuntu/gom_data/99_Silva_111_rep_set_euk.fasta --parallel -O 10 -s 0.1 -p /home/ubuntu/gom_data/qiime_parameters_uclust99gom_aws28Aug.txt --prefilter_percent_id 0.0

Reverting to AWS SSHing

Running long jobs with StarCluster/iPython notebook is giving me issues (need to troubleshoot these with the QIIME forum)…so just to get data run I’m moving back to standard SSHing into Amazon AWS for the GOM illumina data: -i /home/ubuntu/gom_data/GOM_concat1.7_rev_demulti_1to12_2.fna -o /home/ubuntu/gom_data/uclust_openref96_ref_27Aug -r /home/ubuntu/gom_data/99_Silva_111_rep_set_euk.fasta --parallel -O 8 -s 0.1 -p /home/ubuntu/gom_data/qiime_parameters_uclust96gom_aws27aug.txt --prefilter_percent_id 0.0

QIIME parameters for new subsampled Open Ref workflow

Been thinking about my choices of parameters and playing around with the new workflow.

Some Key things to remember:

  • The QIIME site reccommends running 8 parallel jobs for the m2.4xlarge Amazon AWS instances (they state this here)
  • I was on the fence about the prefiltering step but I think the 60% reference-based OTU picking will do a good job at reducing error, even for eukaryotes. So I re-enabled this command. UPDATE (8/23): I changed my mind after thinking about it more. While prefiltering is a ¬†good idea for 16S (where you might get chloroplast DNA), I don’t think the 18S primers hit anything else that’s fishy. But I will test this out with some of the GOM data anyway.
  • The QIIME parameter file MUST contain the pick_otus:enable_rev_strand_match command (I forgot to add this in on the last run). This is vital unless you want to lose data because you fail to reverse complment! Also be careful to check the align_seqs:min_length, contingent on the dataset – e.g. I set this at 50 for the HiSeq GOM Illumina data, but upped it to 150 for the merged MiSeq 18S_chimera data. Finally, assign_taxonomy:evalue is ONLY required when you are using BLAST. If you have anything value listed here when running RDP, you’ll get an error and the taxonomy assignment won’t complete (annoying if running a workflow script…)

So the new command I ran is as follows – note I’m running at 96% this time to get a quick answer so I can look through the results:

! -i /gom_data/GOM_concat1.7_rev_demulti_1to12_2.fna -o /gom_data/uclust_openref96_ref_22Aug -r /gom_data/99_Silva_111_rep_set_euk.fasta --parallel -O 8 -s 0.1 -p /gom_data/qiime_parameters_18Sopenref_GOMamazon.txt -f