18S Chimera – rerunning relabeled files

Wrote a script last night to label chimeric sequences with >chimera_ – now rerunning QIIME analyses locally on my iMac

pick_open_reference_otus.py -i /Users/hollybik/Desktop/Data/18S_chimera/chim_demux.extendedFrags_primersremoved_fastxtrimmed_chimeraslabelled.fasta -o /Users/hollybik/Desktop/Data/18S_chimera/chimera_openref96_18Sept -r /macqiime/silva_111/eukaryotes_only/rep_set_euks/99_Silva_111_rep_set_euk.fasta --parallel -O 2 -s 0.1 --prefilter_percent_id 0.0 -p /Users/hollybik/Dropbox/QIIME/qiime_parameters_18Schimera_96_iMac.txt 

Update (10/3/13) – iMac taking way too long for OTU picking, so moved over to Amazon AWS. Command for 96% open ref:

pick_open_reference_otus.py -i /home/ubuntu/data/chim_demux.extendedFrags_primersremoved_fastxtrimmed_chimeraslabelled.fasta -o /home/ubuntu/data/18S_chimera_openref96_3oct13 -r /home/ubuntu/data/silva_111/99_Silva_111_rep_set_euk.fasta --parallel -O 8 -s 0.1 --prefilter_percent_id 0.0 -p /home/ubuntu/data/qiime_parameters_18Schimera_96_amazon.txt

Command for 99% open ref:

pick_open_reference_otus.py -i /home/ubuntu/data/chim_demux.extendedFrags_primersremoved_fastxtrimmed_chimeraslabelled.fasta -o /home/ubuntu/data/18S_chimera_openref99_5oct13 -r /home/ubuntu/data/silva_111/99_Silva_111_rep_set_euk.fasta --parallel -O 8 -s 0.1 --prefilter_percent_id 0.0 -p /home/ubuntu/data/qiime_parameters_18Schimera_99_amazon.txt

Organizing GOM Illumina data

Organizing GOM analyses run to data – downloaded completed runs onto 1TB external hard drive, along with parameter files (and copied command ran into a comment line at the top of the parameter file). Proceeding with more AWS analysis.

Forward reads at 96% (m2.4xlarge was running out of memory, so dropped down to 6 parallel jobs):

pick_open_reference_otus.py -i /home/ubuntu/data/GOM_concat1.7_fwd_demulti_1to12_1.fna -o /home/ubuntu/data/uclust_openref96_fwd_16Sept -r /home/ubuntu/data/silva_111/99_Silva_111_rep_set_euk.fasta --parallel -O 6 -s 0.1 -p /home/ubuntu/data/qiime_parameters_18Sopenref96_GOMamazon_16sept.txt --prefilter_percent_id 0.0

(9/20/13) Forward reads at 99% – kept at 6 parallel jobs

pick_open_reference_otus.py -i /home/ubuntu/data/GOM_concat1.7_fwd_demulti_1to12_1.fna -o /home/ubuntu/data/uclust_openref99_fwd_20Sept -r /home/ubuntu/data/silva_111/99_Silva_111_rep_set_euk.fasta --parallel -O 6 -s 0.1 -p /home/ubuntu/data/qiime_parameters_18Sopenref99_GOMamazon_16sept.txt --prefilter_percent_id 0.0

Reverse reads at 99%:

pick_open_reference_otus.py -i /home/ubuntu/data/GOM_concat1.7_rev_demulti_1to12_2.fna -o /home/ubuntu/data/uclust_openref99_rev_16Sept -r /home/ubuntu/data/silva_111/99_Silva_111_rep_set_euk.fasta --parallel -O 8 -s 0.1 -p /home/ubuntu/data/qiime_parameters_18Sopenref99_GOMamazon_16sept.txt --prefilter_percent_id 0.0

Reverting to AWS SSHing

Running long jobs with StarCluster/iPython notebook is giving me issues (need to troubleshoot these with the QIIME forum)…so just to get data run I’m moving back to standard SSHing into Amazon AWS for the GOM illumina data:

pick_open_reference_otus.py -i /home/ubuntu/gom_data/GOM_concat1.7_rev_demulti_1to12_2.fna -o /home/ubuntu/gom_data/uclust_openref96_ref_27Aug -r /home/ubuntu/gom_data/99_Silva_111_rep_set_euk.fasta --parallel -O 8 -s 0.1 -p /home/ubuntu/gom_data/qiime_parameters_uclust96gom_aws27aug.txt --prefilter_percent_id 0.0

QIIME parameters for new subsampled Open Ref workflow

Been thinking about my choices of parameters and playing around with the new pick_open_reference_otus.py workflow.

Some Key things to remember:

  • The QIIME site reccommends running 8 parallel jobs for the m2.4xlarge Amazon AWS instances (they state this here)
  • I was on the fence about the prefiltering step but I think the 60% reference-based OTU picking will do a good job at reducing error, even for eukaryotes. So I re-enabled this command. UPDATE (8/23): I changed my mind after thinking about it more. While prefiltering is a ¬†good idea for 16S (where you might get chloroplast DNA), I don’t think the 18S primers hit anything else that’s fishy. But I will test this out with some of the GOM data anyway.
  • The QIIME parameter file MUST contain the pick_otus:enable_rev_strand_match command (I forgot to add this in on the last run). This is vital unless you want to lose data because you fail to reverse complment! Also be careful to check the align_seqs:min_length, contingent on the dataset – e.g. I set this at 50 for the HiSeq GOM Illumina data, but upped it to 150 for the merged MiSeq 18S_chimera data. Finally, assign_taxonomy:evalue is ONLY required when you are using BLAST. If you have anything value listed here when running RDP, you’ll get an error and the taxonomy assignment won’t complete (annoying if running a workflow script…)

So the new command I ran is as follows – note I’m running at 96% this time to get a quick answer so I can look through the results:

!pick_open_reference_otus.py -i /gom_data/GOM_concat1.7_rev_demulti_1to12_2.fna -o /gom_data/uclust_openref96_ref_22Aug -r /gom_data/99_Silva_111_rep_set_euk.fasta --parallel -O 8 -s 0.1 -p /gom_data/qiime_parameters_18Sopenref_GOMamazon.txt -f


OK, I have played around some more with the pick_open_reference_otus.py script. Figured out that I do need a parameter file to tweak things the way I wanted. Template QIIME parameter files are now posted to Dropbox folder “QIIME” (need to post these to website too). Commands I ran are as follows:

18S Chimera – rerun locally at 99% cutoff:

!pick_open_reference_otus.py -i /Users/hollybik/Desktop/Data/18S_chimera/chim_demux.extendedFrags_primersremoved_fastxtrimmed.fasta -o /Users/hollybik/Desktop/Data/18S_chimera/uclust_99_redo_21Aug -r /macqiime/silva_111/eukaryotes_only/rep_set_euks/99_Silva_111_rep_set_euk.fasta --parallel -O 2 -s 0.1 --prefilter_percent_id 0.0 -p /Users/hollybik/Dropbox/QIIME/qiime_parameters_18Sopenref_iMac.txt

GOM Illumina – Starcluster/Amazon run on the cloud:

!pick_open_reference_otus.py -i /gom_data/GOM_concat1.7_rev_demulti_1to12_2.fna -o /gom_data/uclust_openref99_rev_21Aug -r /gom_data/99_Silva_111_rep_set_euk.fasta --parallel -O 6 -s 0.1 --prefilter_percent_id 0.0 -p /gom_data/qiime_parameters_18Sopenref_GOMamazon.txt

iPython Notebook 18S Chimera

Started a new iPython notebook for the 18S Chimera project. Running open ref OTU picking over the weekend:

!pick_open_reference_otus.py -i /Users/hollybik/Desktop/Data/18S_chimera/chim_demux.extendedFrags_primersremoved_fastxtrimmed.fasta -o /Users/hollybik/Desktop/Data/18S_chimera/uclust_99_merged -r /macqiime/silva_111/eukaryotes_only/rep_set_euks/99_Silva_111_rep_set_euk.fasta --parallel -O 2 -s 0.1 --suppress_taxonomy_assignment --suppress_align_and_tree --prefilter_percent_id 0.0

OTU picking finished over the weekend (yay!) on my Desktop iMac for an input file that was ~2GB in size. Running taxonomy assignment next:

!assign_taxonomy.py -i /Users/hollybik/Desktop/Data/18S_chimera/uclust_99_merged/rep_set.fna -r /macqiime/silva_111/eukaryotes_only/rep_set_euks/99_Silva_111_rep_set_euk.fasta -t /macqiime/silva_111/eukaryotes_only/taxonomy_euks/99_Silva_111_taxa_map_RDP_7_levels_euks.txt -m rdp -c 0.5

Installing FASTX Toolkit

Not sure how I got away so long without FASTX toolkit on my iMac. Followed these install instructions to get it set up on my computer (need to quality trim the 18S chimera sequences):

First, libgtextutils:

curl -O http://hannonlab.cshl.edu/fastx_toolkit/libgtextutils-0.6.tar.bz2
tar xvjf libgtextutils-0.6.tar.bz2
cd libgtextutils-0.6
sudo make install

Then the FASTX-Toolkit – note the step to define PKG_CONFIG_PATH:

curl -O http://hannonlab.cshl.edu/fastx_toolkit/fastx_toolkit-0.0.13.tar.bz2
tar xjvf fastx_toolkit-0.0.13.tar.bz2
cd fastx_toolkit-0.0.13
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig:$PKG_CONFIG_PATH
sudo make install

Then I used fastq_quality_filter to remove low quality reads:

fastq_quality_filter -i chim_demux.extendedFrags_primersremoved.fastq -o chim_demux.extendedFrags_primersremoved_fastxtrimmed.fastq -q 20 -p 80 -Q 33 -v

Quality cut-off: 20
Minimum percentage: 80
Input: 2829756 reads.
Output: 2746576 reads.
discarded 83180 (2%) low-quality reads.

This site has a good tutorial for using FASTX trim to quality filter reads. And this site is where I got the install help for the FASTX toolkit.

Next step was to convert FASTQ to FASTA:

fastq_to_fasta -i chim_demux.extendedFrags_primersremoved_fastxtrimmed.fastq -o chim_demux.extendedFrags_primersremoved_fastxtrimmed.fasta -n -v -Q 33

Input: 2746576 reads.
Output: 2746576 reads.

I was originally getting an “invalid quality score value” error, but upon further investigation it seems like you need to use the -Q 33 parameter to indicate the new encoding on Illumina quality values (see here: http://seqanswers.com/forums/archive/index.php/t-7399.html)