Aquarium data – Round 2

Guillaume re-demultiplexed the Aquarium data for us – we found one barcode that was used twice (so had to throw out those samples, 42.SedimentCoralpond.2 and 0.WaterFreshwater.2). Newest demultiplexed data allows 2 mismatches per merged dual index long barcode (so 1 per mate barcode). Concatenated data on Merlot and downloaded to my iMac:

cat /share/eisen-z2/gjospin/Slims_Eisenlab_backup/AQUEXP1/demul_9_reverse/qiime_ready/AQUEXP1.MPF.*.faa > AQUEXP1.MPF.all_reads_merged_20Jun.faa
cat /share/eisen-z2/gjospin/Slims_Eisenlab_backup/AQUEXP1/demul_9_reverse/qiime_ready/AQUEXP1.MPR.*.faa > AQUEXP1.MPR.all_reads_merged_20Jun.faa
cat /share/eisen-z2/gjospin/Slims_Eisenlab_backup/AQUEXP1/demul_9_reverse/qiime_ready/AQUEXP1.M.*.faa > AQUEXP1.M.all_reads_merged_20Jun.faa

And then kicked off the QIIME analysis (uclust open reference OTU picking, with parameters file enabling reverse strand match):

pick_open_reference_otus.py -i /Users/hollybik/Desktop/Data/aquarium_project/demul_trim_prep_2mismatcheslongBC/AQUEXP1.MPF.all_reads_merged_20Jun.faa -r /macqiime/greengenes/gg_12_10_otus/rep_set/99_otus.fasta -o /Users/hollybik/Desktop/Data/aquarium_project/demul_trim_prep_2mismatcheslongBC/uclust_openref_20Jun --parallel -O 2 -p /Users/hollybik/Desktop/Data/aquarium_project/mapping_files/aquarium_parametersfile.txt

Continuation 6/28/13 – Next filtered out the Koi pond samples, and any samples that had less than 1000 observations (sequences):

filter_samples_from_otu_table.py -m /Users/hollybik/Desktop/Data/aquarium_project/mapping_files/AquariumSampleMap8.txt -n 1000 -o /Users/hollybik/Desktop/Data/aquarium_project/demul_trim_prep_2mismatcheslongBC/uclust_openref_20Jun/filtered_otu_tables/otu_table_mc2_w_tax_no_pynast_failures_filtered1000seqs_noKoiPond.biom -i /Users/hollybik/Desktop/Data/aquarium_project/demul_trim_prep_2mismatcheslongBC/uclust_openref_20Jun/otu_table_mc2_w_tax_no_pynast_failures.biom --sample_id_fp /Users/hollybik/Desktop/Data/aquarium_project/mapping_files/samples_to_keep_noKoi.txt

Removed genus Brachybacterium from samples:

filter_taxa_from_otu_table.py -i /Users/hollybik/Desktop/Data/aquarium_project/demul_trim_prep_2mismatcheslongBC/uclust_openref_20Jun/filtered_otu_tables/otu_table_mc2_w_tax_no_pynast_failures_filtered1000seqs_noKoiPond.biom -o /Users/hollybik/Desktop/Data/aquarium_project/demul_trim_prep_2mismatcheslongBC/uclust_openref_20Jun/filtered_otu_tables/otu_table_mc2_w_tax_no_pynast_failures_filtered1000seqs_noKoiPond_noBrachybacterium.biom -n g__Brachybacterium

Filtered out “dead OTUs” (no sequences left after filtering):

filter_otus_from_otu_table.py -i /Users/hollybik/Desktop/Data/aquarium_project/demul_trim_prep_2mismatcheslongBC/uclust_openref_20Jun/filtered_otu_tables/otu_table_mc2_w_tax_no_pynast_failures_filtered1000seqs_noKoiPond_noBrachybacterium.biom -o /Users/hollybik/Desktop/Data/aquarium_project/demul_trim_prep_2mismatcheslongBC/uclust_openref_20Jun/filtered_otu_tables/otu_table_mc2_w_tax_no_pynast_failures_filtered1000seqs_noKoiPond_noBrachybacterium_nodeadOTUs.biom -n 1

Then running the core_diversity_analyses.py script:

core_diversity_analyses.py -i /Users/hollybik/Desktop/Data/aquarium_project/demul_trim_prep_2mismatcheslongBC/uclust_openref_20Jun/filtered_otu_tables/otu_table_mc2_w_tax_no_pynast_failures_filtered1000seqs_noKoiPond_noBrachybacterium_nodeadOTUs.biom -o /Users/hollybik/Desktop/Data/aquarium_project/demul_trim_prep_2mismatcheslongBC/uclust_openref_20Jun/core_diversity_analyses -m /Users/hollybik/Desktop/Data/aquarium_project/mapping_files/AquariumSampleMap8.txt -e 90 -t /Users/hollybik/Desktop/Data/aquarium_project/demul_trim_prep_2mismatcheslongBC/uclust_openref_20Jun/rep_set.tre -c SampleType,Location,SampleTypeLocation,Days
Advertisements

Getting started with Aquarium Project data

The data is in – time to get started in QIIME! Walked David through some workflows, but then executed my own reference-based OTU picking as a first pass through the data. Note: 4 parallel OTU picking jobs was too much for my iMac – suggest dropping this to 2 jobs next time so my computer doesn’t slow to a crawl.

Commands I ran:

parallel_pick_otus_uclust_ref.py -i /Users/hollybik/Desktop/Data/aquarium_project/demul_trim_prep_noBCmismatches/input_data/AQUEXP1.MPF.all_reads_merged_30May.faa -o ~/Desktop/Data/aquarium_project/demul_trim_prep_noBCmismatches/uclust_ref -r /macqiime/greengenes/gg_12_10_otus/rep_set/99_otus.fasta -O 4 −−enable_rev_strand_match
 
pick_rep_set.py -i /Users/hollybik/Desktop/Data/aquarium_project/demul_trim_prep_noBCmismatches/uclust_ref/AQUEXP1.MPF.all_reads_merged_30May_otus.txt -r /macqiime/greengenes/gg_12_10_otus/rep_set/99_otus.fasta -l pick_rep_set.log
 
assign_taxonomy.py -i /Users/hollybik/Desktop/Data/aquarium_project/demul_trim_prep_noBCmismatches/uclust_ref/AQUEXP1.MPF.all_read_merged_30May_rep_set.fasta -m rdp -r /macqiime/greengenes/gg_12_10_otus/rep_set/99_otus.fasta -t /macqiime/greengenes/gg_12_10_otus/taxonomy/99_otu_taxonomy.txt
 
make_otu_table.py -i /Users/hollybik/Desktop/Data/aquarium_project/demul_trim_prep_noBCmismatches/uclust_ref/AQUEXP1.MPF.all_reads_merged_30May_otus.txt -o AQUEXP1.MPF.all_reads_merged_30May_otu_table.biom -t /Users/hollybik/Desktop/Data/aquarium_project/demul_trim_prep_noBCmismatches/uclust_ref/rdp_assigned_taxonomy/AQUEXP1.MPF.all_read_merged_30May_rep_set_tax_assignments.txt

Exploring Figshare

Today I’m uploading data to Figshare for the first time – getting our PhyloSift analysis published before we submit the manuscript.

On Edhar, the final data folders are qiime_analyses_yatsunenko_16S_amplicons and qiime_analyses_yatsunenko_metagenomes_7Oct. Figured this out after I looked through my e-mails from October (when I was running these analyses):

So I spent the weekend looking through yatsunenko data after my discovery about QIIME’s reference-based OTU picking protocol. It turns out that our 16S data is OK (this was a closed-reference based process, where all the reads not matchging greengenes were discarded), but the 16S derived from PhyloSift metagenome analysis was an open-ref process (where QIIME created new de novo clusters for sequences that didn’t match greengenes–e.g. opposite of what yatsunenko did).
 
Because I wanted to keep things consistent in the manuscript (keeping amplicon and metagenome workflows the same, and directly comparable with the Yatsunenko methods), I re-ran the metagenome data now using a Closed-Reference OTU picking process. That means we have a lot less OTUS, and might see different patterns in the PCoAs. Sorry about the confusion over this, just glad I caught it early enough.
 
New data has been downloaded to Edhar: /home/hollybik/yatsunenko_QIIME/qiime_analyses_yatsunenko_metagenomes_7oct

Re-running Yatsunenko and GOM QIIME

Re-running the Yatsunenko et al. Core Analyses because the old run was open-reference OTU picking, and not the closed-ref protocol (my previous commmand didn’t turn off split libraries)

core_qiime_analyses.py -i /home/ubuntu/data_yatsunenko/ps_metagenome_extract/ps_metagenome_combined.fna -o /home/ubuntu/ps_metagenome_coreanalyses_7oct -m /home/ubuntu/qiime_ps_metagenomes_mappingfile.txt -f –suppress_split_libraries -t /home/ubuntu/gg_otus_4feb2011/trees/gg_97_otus_4feb2011.tre -p /home/ubuntu/qiime_parameters_metagenome_7Oct.txt –parallel

But for some reason alpha rarefaction didn’t work so had to run this script after getting error:

alpha_rarefaction.py -i /home/ubuntu/ps_metagenome_coreanalyses_7oct/otus/otu_table.biom -o /home/ubuntu/ps_metagenome_coreanalyses_7oct/alpha_rarefaction/ -t /home/ubuntu/ps_metagenome_coreanalyses_7oct/otus/rep_set.tre -m /home/ubuntu/qiime_ps_metagenomes_mappingfile.txt
ubuntu@ip-10-218-7-163:~$

**note, for parallel uclust_ref core analyses, you must delete from parameters file the following parameters: –clustering_algorithim, –max_e_value, and –max_cdhit_memory (all from pick_otus parameters) and –e_value (from assign_taxonomy using rep)

Also moving forward with more GOM QIIME-ing, using Closed and Open-ref based OTU picking. Going to use a bunch of different analyses, including uclust_ref and usearch_ref.

core_qiime_analyses.py -i /home/ubuntu/GOM_demultiplexed/GOM_concat_fwd_demulti_1to12_1.fna -o home/ubuntu/GOM_coreanalyses_ClosedRef_99_7Oct -m /home/ubuntu/QIIMEmappingfile_GOM_Illumina_fakebarcodes.txt -f –suppress_split_libraries -p /home/ubuntu/qiime_parameters_GOM_ClosedRef.txt –parallel

Also figured out why the PYNAST euk alignment wasn’t working – I had the align_seqs:min_length set to 150 originally, meaning that no Illumina sequence would ever align because they’re wayyy shorter than this. Now changed this parameter to 70 in all the aiime parameter files.

Continuing progress with GOM analysis

Finished the re-demultiplexing the GOM Illumina data yesterday on Amazon Cloud, now combining files in prep for another round of OTU clustering, etc.

cat KO_1_1/seqs.fna KO_2_1/seqs.fna
KO_3_1/seqs.fna KO_4_1/seqs.fna KO_5_1/seqs.fna KO_6_1/seqs.fna KO_7_1/seqs.fna
KO_8_1/seqs.fna KO_9_1/seqs.fna KO_10_1/seqs.fna KO_11_1/seqs.fna KO_12_1/seqs
.fna > GOM_concat_fwd_demulti_1to12_1.fna

cat KO_1_2/seqs.fna KO_2_2/seqs.fna
KO_3_2/seqs.fna KO_4_2/seqs.fna KO_5_2/seqs.fna KO_6_2/seqs.fna KO_7_2/seqs.fna
KO_8_2/seqs.fna KO_9_2/seqs.fna KO_10_2/seqs.fna KO_11_2/seqs.fna KO_12_2/seqs
.fna > GOM_concat_rev_demulti_1to12_2.fna

Comparing the differences between the old QIIME 1.4 (iMac) demultiplexing and new QIIME 1.5 (AWS Cloud) demultiplexing:

Re-splitting libraries for Illumina GOM data

Trying to see if I can get better outputs for the _2 PE read files. Libraries were originally split using QIIME 1.4 on my iMac, but now I am re-processing them on a HiMem Cloud instance (QIIME 1.5, 68 GB memory, oh yeah). Now using minimum PHRED scores of 20 (-q 20) and specifically specifying max 1.5 errors in the barcode sequence. Commands run are as follows:

split_libraries_fastq.py -i /home/ubuntu/fastx_trimmed_files/1926-KO-1_1_trimmed.txt -b /home/ubuntu/fastx_trimmed_files/1926-KO-1_1_barcode.txt -o /home/ubuntu/GOM_demultiplexed/KO_1_1/ -m /home/ubuntu/QIIME_mapping_files/1926_KO_1.txt –barcode_type=5 –max_barcode_errors=1.5 -q 20

split_libraries_fastq.py -i /home/ubuntu/fastx_trimmed_files/1926-KO-1_2_trimmed.txt -b /home/ubuntu/fastx_trimmed_files/1926-KO-1_2_barcode.txt -o /home/ubuntu/GOM_demultiplexed/KO_1_2/ -m /home/ubuntu/QIIME_mapping_files/1926_KO_1.txt –barcode_type=5 –max_barcode_errors=1.5 -q 20

split_libraries_fastq.py -i /home/ubuntu/fastx_trimmed_files/1926-KO-2_1_trimmed.txt -b /home/ubuntu/fastx_trimmed_files/1926-KO-2_1_barcode.txt -o /home/ubuntu/GOM_demultiplexed/KO_2_1/ -m /home/ubuntu/QIIME_mapping_files/1926_KO_1.txt –barcode_type=5 –max_barcode_errors=1.5 -q 20

split_libraries_fastq.py -i /home/ubuntu/fastx_trimmed_files/1926-KO-2_2_trimmed.txt -b /home/ubuntu/fastx_trimmed_files/1926-KO-2_2_barcode.txt -o /home/ubuntu/GOM_demultiplexed/KO_2_2/ -m /home/ubuntu/QIIME_mapping_files/1926_KO_1.txt –barcode_type=5 –max_barcode_errors=1.5 -q 20

split_libraries_fastq.py -i /home/ubuntu/fastx_trimmed_files/1926-KO-3_1_trimmed.txt -b /home/ubuntu/fastx_trimmed_files/1926-KO-3_1_barcode.txt -o /home/ubuntu/GOM_demultiplexed/KO_3_1/ -m /home/ubuntu/QIIME_mapping_files/1926_KO_1.txt –barcode_type=5 –max_barcode_errors=1.5 -q 20

split_libraries_fastq.py -i /home/ubuntu/fastx_trimmed_files/1926-KO-3_2_trimmed.txt -b /home/ubuntu/fastx_trimmed_files/1926-KO-3_2_barcode.txt -o /home/ubuntu/GOM_demultiplexed/KO_3_2/ -m /home/ubuntu/QIIME_mapping_files/1926_KO_1.txt –barcode_type=5 –max_barcode_errors=1.5 -q 20

split_libraries_fastq.py -i /home/ubuntu/fastx_trimmed_files/1926-KO-4_1_trimmed.txt -b /home/ubuntu/fastx_trimmed_files/1926-KO-4_1_barcode.txt -o /home/ubuntu/GOM_demultiplexed/KO_4_1/ -m /home/ubuntu/QIIME_mapping_files/1926_KO_1.txt –barcode_type=5 –max_barcode_errors=1.5 -q 20

split_libraries_fastq.py -i /home/ubuntu/fastx_trimmed_files/1926-KO-4_2_trimmed.txt -b /home/ubuntu/fastx_trimmed_files/1926-KO-4_2_barcode.txt -o /home/ubuntu/GOM_demultiplexed/KO_4_2/ -m /home/ubuntu/QIIME_mapping_files/1926_KO_1.txt –barcode_type=5 –max_barcode_errors=1.5 -q 20

split_libraries_fastq.py -i /home/ubuntu/fastx_trimmed_files/1926-KO-5_1_trimmed.txt -b /home/ubuntu/fastx_trimmed_files/1926-KO-5_1_barcode.txt -o /home/ubuntu/GOM_demultiplexed/KO_5_1/ -m /home/ubuntu/QIIME_mapping_files/1926_KO_1.txt –barcode_type=5 –max_barcode_errors=1.5 -q 20

split_libraries_fastq.py -i /home/ubuntu/fastx_trimmed_files/1926-KO-5_2_trimmed.txt -b /home/ubuntu/fastx_trimmed_files/1926-KO-5_2_barcode.txt -o /home/ubuntu/GOM_demultiplexed/KO_5_2/ -m /home/ubuntu/QIIME_mapping_files/1926_KO_1.txt –barcode_type=5 –max_barcode_errors=1.5 -q 20

split_libraries_fastq.py -i /home/ubuntu/fastx_trimmed_files/1926-KO-6_1_trimmed.txt -b /home/ubuntu/fastx_trimmed_files/1926-KO-6_1_barcode.txt -o /home/ubuntu/GOM_demultiplexed/KO_6_1/ -m /home/ubuntu/QIIME_mapping_files/1926_KO_1.txt –barcode_type=5 –max_barcode_errors=1.5 -q 20

split_libraries_fastq.py -i /home/ubuntu/fastx_trimmed_files/1926-KO-6_2_trimmed.txt -b /home/ubuntu/fastx_trimmed_files/1926-KO-6_2_barcode.txt -o /home/ubuntu/GOM_demultiplexed/KO_6_2/ -m /home/ubuntu/QIIME_mapping_files/1926_KO_1.txt –barcode_type=5 –max_barcode_errors=1.5 -q 20

split_libraries_fastq.py -i /home/ubuntu/fastx_trimmed_files/1926-KO-7_1_trimmed.txt -b /home/ubuntu/fastx_trimmed_files/1926-KO-7_1_barcode.txt -o /home/ubuntu/GOM_demultiplexed/KO_7_1/ -m /home/ubuntu/QIIME_mapping_files/1926_KO_1.txt –barcode_type=5 –max_barcode_errors=1.5 -q 20

split_libraries_fastq.py -i /home/ubuntu/fastx_trimmed_files/1926-KO-7_2_trimmed.txt -b /home/ubuntu/fastx_trimmed_files/1926-KO-7_2_barcode.txt -o /home/ubuntu/GOM_demultiplexed/KO_7_2/ -m /home/ubuntu/QIIME_mapping_files/1926_KO_1.txt –barcode_type=5 –max_barcode_errors=1.5 -q 20

split_libraries_fastq.py -i /home/ubuntu/fastx_trimmed_files/1926-KO-8_1_trimmed.txt -b /home/ubuntu/fastx_trimmed_files/1926-KO-8_1_barcode.txt -o /home/ubuntu/GOM_demultiplexed/KO_8_1/ -m /home/ubuntu/QIIME_mapping_files/1926_KO_1.txt –barcode_type=5 –max_barcode_errors=1.5 -q 20

split_libraries_fastq.py -i /home/ubuntu/fastx_trimmed_files/1926-KO-8_2_trimmed.txt -b /home/ubuntu/fastx_trimmed_files/1926-KO-8_2_barcode.txt -o /home/ubuntu/GOM_demultiplexed/KO_8_2/ -m /home/ubuntu/QIIME_mapping_files/1926_KO_1.txt –barcode_type=5 –max_barcode_errors=1.5 -q 20

split_libraries_fastq.py -i /home/ubuntu/fastx_trimmed_files/1926-KO-9_1_trimmed.txt -b /home/ubuntu/fastx_trimmed_files/1926-KO-9_1_barcode.txt -o /home/ubuntu/GOM_demultiplexed/KO_9_1/ -m /home/ubuntu/QIIME_mapping_files/1926_KO_1.txt –barcode_type=5 –max_barcode_errors=1.5 -q 20

split_libraries_fastq.py -i /home/ubuntu/fastx_trimmed_files/1926-KO-9_2_trimmed.txt -b /home/ubuntu/fastx_trimmed_files/1926-KO-9_2_barcode.txt -o /home/ubuntu/GOM_demultiplexed/KO_9_2/ -m /home/ubuntu/QIIME_mapping_files/1926_KO_1.txt –barcode_type=5 –max_barcode_errors=1.5 -q 20

split_libraries_fastq.py -i /home/ubuntu/fastx_trimmed_files/1926-KO-10_1_trimmed.txt -b /home/ubuntu/fastx_trimmed_files/1926-KO-10_1_barcode.txt -o /home/ubuntu/GOM_demultiplexed/KO_10_1/ -m /home/ubuntu/QIIME_mapping_files/1926_KO_1.txt –barcode_type=5 –max_barcode_errors=1.5 -q 20

split_libraries_fastq.py -i /home/ubuntu/fastx_trimmed_files/1926-KO-10_2_trimmed.txt -b /home/ubuntu/fastx_trimmed_files/1926-KO-10_2_barcode.txt -o /home/ubuntu/GOM_demultiplexed/KO_10_2/ -m /home/ubuntu/QIIME_mapping_files/1926_KO_1.txt –barcode_type=5 –max_barcode_errors=1.5 -q 20

split_libraries_fastq.py -i /home/ubuntu/fastx_trimmed_files/1926-KO-11_1_trimmed.txt -b /home/ubuntu/fastx_trimmed_files/1926-KO-11_1_barcode.txt -o /home/ubuntu/GOM_demultiplexed/KO_11_1/ -m /home/ubuntu/QIIME_mapping_files/1926_KO_1.txt –barcode_type=5 –max_barcode_errors=1.5 -q 20

split_libraries_fastq.py -i /home/ubuntu/fastx_trimmed_files/1926-KO-11_2_trimmed.txt -b /home/ubuntu/fastx_trimmed_files/1926-KO-11_2_barcode.txt -o /home/ubuntu/GOM_demultiplexed/KO_11_2/ -m /home/ubuntu/QIIME_mapping_files/1926_KO_1.txt –barcode_type=5 –max_barcode_errors=1.5 -q 20

split_libraries_fastq.py -i /home/ubuntu/fastx_trimmed_files/1926-KO-12_1_trimmed.txt -b /home/ubuntu/fastx_trimmed_files/1926-KO-12_1_barcode.txt -o /home/ubuntu/GOM_demultiplexed/KO_12_1/ -m /home/ubuntu/QIIME_mapping_files/1926_KO_1.txt –barcode_type=5 –max_barcode_errors=1.5 -q 20

split_libraries_fastq.py -i /home/ubuntu/fastx_trimmed_files/1926-KO-12_2_trimmed.txt -b /home/ubuntu/fastx_trimmed_files/1926-KO-12_2_barcode.txt -o /home/ubuntu/GOM_demultiplexed/KO_12_2/ -m /home/ubuntu/QIIME_mapping_files/1926_KO_1.txt –barcode_type=5 –max_barcode_errors=1.5 -q 20

Playing around with GOM Illumina rRNA data

So I’ve been trying to come up with a useful workflow 18S rRNA eukaryotic data, aiming to incorporate a number of different tools (PhyloSift and QIIME) to compare biological comparisons from multi-sample comparisons. Test dataset has been the Illumina GOM data.

  1. The QIIME analysis broke at the alignment step – using PYNAST against the SILVA reference database, I ended up with ALL sequences failing and an empty alignment file.
  2. Transferred over to ssu-align on the iMac, which has been running for a couple days and is slowly making its way through the rep_set OTUs from QIIME (99% de novo clustered OTUs)
  3. Now moving over to hmm-align within phylosift (align mode) to see if I can get a speedup on this end. Wasn’t working the other day, so I am liaising with Guillaume to push this forward.

GOM paper brainstorm

QIIME analyses

  • Another round of demultiplexing – using QIIME 1.5.0 an upping the number of barcode mismatches
  • OTU picking
    • de novo (99%)
    • Open-Reference (SILVA database and de novo 99%)
  • Alignments
    • PYNAST
    • ssu-align (iMac)
    • hmm-align (PhyloSift)
  • Taxonomy assignment
    • RDP
    • BLAST
    • Phylo Placement?

PhyloSift analyses

  • Single-end analysis (OTUs from QIIME)
    • 1_1 dataset (big file)
    • 1_2 dataset (much smaller, because of read quality issues?)
  • Paired-end analysis (raw Illumina data)
    • Should I trim off barcodes before running PE-analysis? I would lean toward “no”, because the hmm-align step is going to trim off non-matching parts of the reads.