Parallel Aligning Sequences, yet again

Reattempting to align sequences so I can move forward with some cursory core diversity analyses. This time I opted for an AWS m2.2xlarge (only 32GB memory), since I the larger instance was just not wanting to align anything. Same errors with this script, even after changing the qiime_config file. -i /home/ubuntu/data/18S_chimera_openref96_alldenovo_18Oct13/rep_set.fna -o /home/ubuntu/data/18S_chimera_openref96_alldenovo_18Oct13/pynast_aligned_seqs -T –jobs_to_start 4 –template_fp /home/ubuntu/data/silva_111/99_Silva_111_rep_set_euk_aligned.fasta –pairwise_alignment_method uclust –min_percent_id 70.0 –min_length 150

Finally posted a help message on the QIIME Google Group, because this issue has been frustrating me long enough:!topic/qiime-forum/f2tA2a97OxE


18S_chimera – 100% subsampled OTUs

Been thinking about OTU picking, and if we really want to figure out how chimeric sequences are being incorporated into OTUs, I have to cluster 100% of the failure OTUs. So running another round of analyses:

96 clustering: -i /home/ubuntu/data/chim_demux.extendedFrags_primersremoved_fastxtrimmed_chimeraslabelled.fasta -o /home/ubuntu/data/18S_chimera_openref96_alldenovo_18Oct13 -r /home/ubuntu/data/silva_111/99_Silva_111_rep_set_euk.fasta --parallel -O 8 -s 1.0 --prefilter_percent_id 0.0 -p /home/ubuntu/data/qiime_parameters_18Schimera_96_amazon.txt

99% clustering: -i /home/ubuntu/data/chim_demux.extendedFrags_primersremoved_fastxtrimmed_chimeraslabelled.fasta -o /home/ubuntu/data/18S_chimera_openref99_alldenovo_20Oct13 -r /home/ubuntu/data/silva_111/99_Silva_111_rep_set_euk.fasta --parallel -O 8 -s 1.0 --prefilter_percent_id 0.0 -p /home/ubuntu/data/qiime_parameters_18Schimera_99_amazon.txt

18S Chimera – rerunning relabeled files

Wrote a script last night to label chimeric sequences with >chimera_ – now rerunning QIIME analyses locally on my iMac -i /Users/hollybik/Desktop/Data/18S_chimera/chim_demux.extendedFrags_primersremoved_fastxtrimmed_chimeraslabelled.fasta -o /Users/hollybik/Desktop/Data/18S_chimera/chimera_openref96_18Sept -r /macqiime/silva_111/eukaryotes_only/rep_set_euks/99_Silva_111_rep_set_euk.fasta --parallel -O 2 -s 0.1 --prefilter_percent_id 0.0 -p /Users/hollybik/Dropbox/QIIME/qiime_parameters_18Schimera_96_iMac.txt 

Update (10/3/13) – iMac taking way too long for OTU picking, so moved over to Amazon AWS. Command for 96% open ref: -i /home/ubuntu/data/chim_demux.extendedFrags_primersremoved_fastxtrimmed_chimeraslabelled.fasta -o /home/ubuntu/data/18S_chimera_openref96_3oct13 -r /home/ubuntu/data/silva_111/99_Silva_111_rep_set_euk.fasta --parallel -O 8 -s 0.1 --prefilter_percent_id 0.0 -p /home/ubuntu/data/qiime_parameters_18Schimera_96_amazon.txt

Command for 99% open ref: -i /home/ubuntu/data/chim_demux.extendedFrags_primersremoved_fastxtrimmed_chimeraslabelled.fasta -o /home/ubuntu/data/18S_chimera_openref99_5oct13 -r /home/ubuntu/data/silva_111/99_Silva_111_rep_set_euk.fasta --parallel -O 8 -s 0.1 --prefilter_percent_id 0.0 -p /home/ubuntu/data/qiime_parameters_18Schimera_99_amazon.txt

QIIME parameters for new subsampled Open Ref workflow

Been thinking about my choices of parameters and playing around with the new workflow.

Some Key things to remember:

  • The QIIME site reccommends running 8 parallel jobs for the m2.4xlarge Amazon AWS instances (they state this here)
  • I was on the fence about the prefiltering step but I think the 60% reference-based OTU picking will do a good job at reducing error, even for eukaryotes. So I re-enabled this command. UPDATE (8/23): I changed my mind after thinking about it more. While prefiltering is a ¬†good idea for 16S (where you might get chloroplast DNA), I don’t think the 18S primers hit anything else that’s fishy. But I will test this out with some of the GOM data anyway.
  • The QIIME parameter file MUST contain the pick_otus:enable_rev_strand_match command (I forgot to add this in on the last run). This is vital unless you want to lose data because you fail to reverse complment! Also be careful to check the align_seqs:min_length, contingent on the dataset – e.g. I set this at 50 for the HiSeq GOM Illumina data, but upped it to 150 for the merged MiSeq 18S_chimera data. Finally, assign_taxonomy:evalue is ONLY required when you are using BLAST. If you have anything value listed here when running RDP, you’ll get an error and the taxonomy assignment won’t complete (annoying if running a workflow script…)

So the new command I ran is as follows – note I’m running at 96% this time to get a quick answer so I can look through the results:

! -i /gom_data/GOM_concat1.7_rev_demulti_1to12_2.fna -o /gom_data/uclust_openref96_ref_22Aug -r /gom_data/99_Silva_111_rep_set_euk.fasta --parallel -O 8 -s 0.1 -p /gom_data/qiime_parameters_18Sopenref_GOMamazon.txt -f

OK, I have played around some more with the script. Figured out that I do need a parameter file to tweak things the way I wanted. Template QIIME parameter files are now posted to Dropbox folder “QIIME” (need to post these to website too). Commands I ran are as follows:

18S Chimera – rerun locally at 99% cutoff:

! -i /Users/hollybik/Desktop/Data/18S_chimera/chim_demux.extendedFrags_primersremoved_fastxtrimmed.fasta -o /Users/hollybik/Desktop/Data/18S_chimera/uclust_99_redo_21Aug -r /macqiime/silva_111/eukaryotes_only/rep_set_euks/99_Silva_111_rep_set_euk.fasta --parallel -O 2 -s 0.1 --prefilter_percent_id 0.0 -p /Users/hollybik/Dropbox/QIIME/qiime_parameters_18Sopenref_iMac.txt

GOM Illumina – Starcluster/Amazon run on the cloud:

! -i /gom_data/GOM_concat1.7_rev_demulti_1to12_2.fna -o /gom_data/uclust_openref99_rev_21Aug -r /gom_data/99_Silva_111_rep_set_euk.fasta --parallel -O 6 -s 0.1 --prefilter_percent_id 0.0 -p /gom_data/qiime_parameters_18Sopenref_GOMamazon.txt

iPython Notebook 18S Chimera

Started a new iPython notebook for the 18S Chimera project. Running open ref OTU picking over the weekend:

! -i /Users/hollybik/Desktop/Data/18S_chimera/chim_demux.extendedFrags_primersremoved_fastxtrimmed.fasta -o /Users/hollybik/Desktop/Data/18S_chimera/uclust_99_merged -r /macqiime/silva_111/eukaryotes_only/rep_set_euks/99_Silva_111_rep_set_euk.fasta --parallel -O 2 -s 0.1 --suppress_taxonomy_assignment --suppress_align_and_tree --prefilter_percent_id 0.0

OTU picking finished over the weekend (yay!) on my Desktop iMac for an input file that was ~2GB in size. Running taxonomy assignment next:

! -i /Users/hollybik/Desktop/Data/18S_chimera/uclust_99_merged/rep_set.fna -r /macqiime/silva_111/eukaryotes_only/rep_set_euks/99_Silva_111_rep_set_euk.fasta -t /macqiime/silva_111/eukaryotes_only/taxonomy_euks/99_Silva_111_taxa_map_RDP_7_levels_euks.txt -m rdp -c 0.5

Installing FASTX Toolkit

Not sure how I got away so long without FASTX toolkit on my iMac. Followed these install instructions to get it set up on my computer (need to quality trim the 18S chimera sequences):

First, libgtextutils:

curl -O
tar xvjf libgtextutils-0.6.tar.bz2
cd libgtextutils-0.6
sudo make install

Then the FASTX-Toolkit – note the step to define PKG_CONFIG_PATH:

curl -O
tar xjvf fastx_toolkit-0.0.13.tar.bz2
cd fastx_toolkit-0.0.13
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig:$PKG_CONFIG_PATH
sudo make install

Then I used fastq_quality_filter to remove low quality reads:

fastq_quality_filter -i chim_demux.extendedFrags_primersremoved.fastq -o chim_demux.extendedFrags_primersremoved_fastxtrimmed.fastq -q 20 -p 80 -Q 33 -v

Quality cut-off: 20
Minimum percentage: 80
Input: 2829756 reads.
Output: 2746576 reads.
discarded 83180 (2%) low-quality reads.

This site has a good tutorial for using FASTX trim to quality filter reads. And this site is where I got the install help for the FASTX toolkit.

Next step was to convert FASTQ to FASTA:

fastq_to_fasta -i chim_demux.extendedFrags_primersremoved_fastxtrimmed.fastq -o chim_demux.extendedFrags_primersremoved_fastxtrimmed.fasta -n -v -Q 33

Input: 2746576 reads.
Output: 2746576 reads.

I was originally getting an “invalid quality score value” error, but upon further investigation it seems like you need to use the -Q 33 parameter to indicate the new encoding on Illumina quality values (see here: