Rerunning GSL guppy analyses

Guppy gives its outputs using the original file names, so I had to change them to make sure the PCAs have informative labels:

hollybik@edhar:~/GSL_analyses/concat_files$ cp ~/phylosift_v1.0.0_01/PS_temp/2051774008_NA_RozPt_all_assembled.fna.gz/treeDir/concat.updated.1.jplace GSL_concat_NA_RozPt_allassembled.jplace
hollybik@edhar:~/GSL_analyses/concat_files$ cp ~/phylosift_v1.0.0_01/PS_temp/2058419001_SA_AntIs_all_assembled.fna.gz/treeDir/concat.updated.1.jplace GSL_concat_SA_AntIs_allassembled.jplace
hollybik@edhar:~/GSL_analyses/concat_files$ cp ~/phylosift_v1.0.0_01/PS_temp/2058419003_NA_Stram_all_assembled.fna.gz/treeDir/concat.updated.1.jplace GSL_concat_NA_Stram_allassembled.jplace
hollybik@edhar:~/GSL_analyses/concat_files$ cp ~/phylosift_v1.0.0_01/PS_temp/2077657010_SA_Stram_all_assembled.fna.gz/treeDir/concat.updated.1.jplace GSL_concat_SA_Stram_allassembled.jplace

And then reran all the guppy analyses:

./guppy pca –out-dir ~/GSL_analyses/guppy/ ~/GSL_analyses/concat_files/GSL_concat_NA_RozPt_allassembled.jplace ~/GSL_analyses/concat_files/GSL_concat_NA_Stram_allassembled.jplace ~/GSL_analyses/concat_files/GSL_concat_SA_AntIs_allassembled.jplace ~/GSL_analyses/concat_files/GSL_concat_SA_Stram_allassembled.jplace –prefix GSL_concat_guppyPCA


./guppy squash –out-dir ~/GSL_analyses/guppy/ ~/GSL_analyses/concat_files/GSL_concat_NA_RozPt_allassembled.jplace ~/GSL_analyses/concat_files/GSL_concat_NA_Stram_allassembled.jplace ~/GSL_analyses/concat_files/GSL_concat_SA_AntIs_allassembled.jplace ~/GSL_analyses/concat_files/GSL_concat_SA_Stram_allassembled.jplace –prefix GSL_concat_guppySquash


./guppy kr –out-dir ~/GSL_analyses/guppy/ ~/GSL_analyses/concat_files/GSL_concat_NA_RozPt_allassembled.jplace ~/GSL_analyses/concat_files/GSL_concat_NA_Stram_allassembled.jplace ~/GSL_analyses/concat_files/GSL_concat_SA_AntIs_allassembled.jplace ~/GSL_analyses/concat_files/GSL_concat_SA_Stram_allassembled.jplace -o GSL_concat_guppyKRdist

Illumina GOM, getting up and running again

Uploading the last two Illumina files to Edhar that have not been transferred from my iMac desktop (8 and 9 PE files – confirmed that all files successfully uploaded by the next morning).

Also starting to run the raw Illumina GOM data through PhyloSift: 

./phylosift all −−paired ~/TestData/GOM_Illumina/fastx_trimmed_files/1926-KO-1_1_trimmed.txt ~/TestData/GOM_Illumina/fastx_trimmed_files/1926-KO-1_2_trimmed.txt −−debug

Note: Phylosift analysis not working on Edhar at the moment because its using up a ton of memory, so doing these analyses on the iMac desktop.

PhyloSift data uploads to FigShare

Final files are sitting here (for the moment, but will transfer to external hard drive soon): /Users/hollybik/Dropbox/UC Davis Projects/PhyloSift/Yatsunenko_phylosift/Figshare Fies/ 

One of the 107 metagenome samples seemed to disappear from the QIIME analyses…did some sleuthing and the most likely culprit seems to be the following rRNA file mined from Phylosfit (saw this on the QIIME_yatsunenko AWS instance). Its only 8.7k:

-rw-r—– 1 ubuntu ubuntu 8.7K 2012-09-13 17:13 4461121.3_ps_16S_bac_extract.fna

Exploring Figshare

Today I’m uploading data to Figshare for the first time – getting our PhyloSift analysis published before we submit the manuscript.

On Edhar, the final data folders are qiime_analyses_yatsunenko_16S_amplicons and qiime_analyses_yatsunenko_metagenomes_7Oct. Figured this out after I looked through my e-mails from October (when I was running these analyses):

So I spent the weekend looking through yatsunenko data after my discovery about QIIME’s reference-based OTU picking protocol. It turns out that our 16S data is OK (this was a closed-reference based process, where all the reads not matchging greengenes were discarded), but the 16S derived from PhyloSift metagenome analysis was an open-ref process (where QIIME created new de novo clusters for sequences that didn’t match greengenes–e.g. opposite of what yatsunenko did).
Because I wanted to keep things consistent in the manuscript (keeping amplicon and metagenome workflows the same, and directly comparable with the Yatsunenko methods), I re-ran the metagenome data now using a Closed-Reference OTU picking process. That means we have a lot less OTUS, and might see different patterns in the PCoAs. Sorry about the confusion over this, just glad I caught it early enough.
New data has been downloaded to Edhar: /home/hollybik/yatsunenko_QIIME/qiime_analyses_yatsunenko_metagenomes_7oct

Transferring stuff over from Trello

Trying to get this project management task sorted out. Organizing my Trello and transferring stuff over that is no longer active.

Info about PhyloSift plastid marker work (before we decided not to include these in the core set of markers):

Screen Shot 2013-03-04 at 11.45.13 AM

Whale Shark microbiome

Running the whale shark data (phylosift_v1.0.0_01 on Edhar). Starting this now because I suspect it will be a huge run…:

./phylosift all –paired ~/TestData/whale_
shark/new_scp_download_28Jan/99_Transfer/Whale_Shark_R1_abyss.fastq.bz2 ~/TestDa

All of the Whale Shark data has successfully downloaded to Edhar. Here is proof:

Screen Shot 2013-03-04 at 11.58.36 AM

Expanding support for euks in PhyloSift

PhyloSift genome data lives on Merlot in /share/eisen-z2/phylosift/

Talked with Guillaume about euk markers

  • maker updates – ONLY updating tree and its associated taxonomy (not the HMM, because we don’t want the HMM to shift over time). We throw away all the old reference sequences and start afresh by scanning the downloaded genomes
  • For euks – I’m worried that we’re getting rid of a lot of taxonomic diversity for this marker update method. Original Parfrey sequences definitely seem to be getting thrown away. Need to figure out if we’re representing all the deep protist lineages during euk maker updates (99% tree pruning won’t do much harm if we already have these taxa present).

To investigate:

  • How many euk genomes do we have in the Phylosift directories (draft, ebi, WGS, etc)? How many bacterial/archaeal in comparison?
  • Work with Guillaume to get full NCBI taxonomic hierarchies placed into the trees. This will help to evaluate what lineages present/abesent in our reference markers.