BMC paper work

Submitted the reduced Enoplid (stem/loops) tree to Raxml for the BMC paper (File 29Mar_EnopDaimiStemLoopsCombo.txt on Mac)

http://phylobench.vital-it.ch/raxml-bb/index.php?jid=70736

Miscellaneous OCTU stats

Typing up some old pages, and here are some stats I calculated for the OCTUs.

Results from Chimera check on 95% OCTUs:

  • 95A had 2769 chimeric OCTUs, and 1179 non-chimeric OCTUs
  • 95B had 1883 chimeric OCTUs, and 2169 non-chimeric OCTUs

Contaminant OCTUs detected in 95 and 99 datasets:

99A Dataset:

  • Vector OCTU 10409 (Blast match 100%)
  • Human OCTU 23900 (Blast match 99.55%), shared Pacific/Atlantic
  • Human OCTU 31567 (Blast match 99.11%)
  • Human OCTU 30401 (Blast match 98.41), shared Pacific/Atlantic
  • Human OCTU 79050 (Blast match 97.87%)
  • Human OCTU 78664 (Blast match 98.67%), shared Pacific/Atlantic
  • LOTS more (55 overall total) in 99A dataset–check spreadsheet for exact OCTU numbers

95A Dataset:

  • Vector OCTU 1312 (Blast match 100%)
  • Human OCTU 807 (Blast match 95.37%)
  • Human OCTU 1829 (Blast match 99.78%), shared Pacific/Atlantic
  • Human OCTU 1860 (Blast mach 97.39%), shared Pacific/Atlantic
  • Human OCTU 2917 (Blast match 99.56%), shared Pacific/Atlantic
  • Human OCTU 2453 (Blast match 92.39%)
  • Human OCTU 2335 (Blast match 97.39%)

95B Dataset:

  • Human OCTU 1034 (Blast match 100%), shared Deep/Shallow and Pacific/Atlantic
  • Human OCTU 2850 (Blast match 99.12%), shared Pacific/Atlantic
  • Human OCTU 3017 (Blast match 92.87%)
  • Human OCTU 2291 (Blast match 100%)

Shared OCTUs below base error frequency:

  • 95A deep/shallow (3) = OCTUs 30, 37, 81 ; Pacific/Atlantic (0)
  • 95B deep/shallow (7) = OCTUs 6, 8, 26, 84, 108, 113, 332 ; Pacific/Atlantic (1) = OCTU 381
  • 99A deep/shallow (1) = OCTU 142  ; Pacific/Atlantic (1) = 16413

19 March Daimi StemLoops combo submitted to CIPRES; partition is Helix (alignment sites 1-1175) and Loops (alignment sites 1176-2154)

The way forward?

So I’ve been thinking about how we’re going to proceed towards publication with the data analysis we’ve done so far.

Phylogenetics:

  • The RAxML trees using all 95A OCTUs came out alright, but as expected you can’t infer much about higher phylogeny.  Basically things cluster into ‘Metazoa’ or ‘Non-metazoa’.  Some of the major clade groupings are recovered–Nematoda seems pretty robust–but the bootstraps are abysmal and you get sequeces out of place everywhere (hinting at alignment errors or an untrustworthy BLAST result).  Also, hello long branch attraction!
  • I’m having concerns about our BLAST matching system; sometimes when I BLAST an OCTU sequence I get a completely different hit versus what the OCTUPUS pipeline churned out.  Also, my new hit normally corresponds with the phylogenetic placement in ARB (an erroneous observation is why I re-BLASTed the OCTU in the first place).
  • The possiblity that a lot of sequences aren’t ‘real’ and can’t be fixed by simple alignment tweaks–e.g. when an OCTU sequence is missing 10-20 bps in a conserved rRNA region!  Are these chimeras or are these sequencing errors?

Phylogeography

  • I like the individual OCTU phylogeny approach that Kelley came up with.  It seems to be working really well and gives a good idea of the geographic distribution when you break things down.
  • Need to liaise with Pari regarding a database that will run this phylogeny approach.  At the very least I need to learn how to manipulate FASTA headers and add on information from excel spreadsheets so that I can run some of this stuff myself.

Phylogenetic species concept

  • Need to read papers.  The microbial people seem to think we can do this with pairwise identity cutoffs? (e.g. 98%?)  But I’m sure there are other algorithms based on branch lengths, etc. that would be worth trying
  • Progress on mtDNA arrays?  Ordering oligos? I know this is a whole other can of worms, but additional loci will be important in determining species boundaries/community compostions in the long run.

Final files for phylogenetics

For the Metagenomics 95A OCTU phylogeny, the final files for phylogenetic are as follows (stored in folder ‘Aligned_OCTUs_95A’ and derived from the MasterDB_95A.arb):

  • CIPRES_Simon95A_phy_Rep.phy (also 18Mar_SimonOCTUs95A.fasta) 2451 characters in alignment
  • CIPRES_Dorota95A_phy_Rep.phy (also 18Mar_DorotaOCTUs95A.fasta) 2808 characters in alignment

For the BMC paper, the Enoplid-only tree file

Websites from David Lunt

David Lunt Correspondence:

http://treethinkers.blogspot.com/2009/05/for-this-unauthorized-installment-of.html

http://treethinkers.blogspot.com/2009/04/when-we-fail-mrbayes.html

http://www.microbesonline.org/fasttree/

There are several ways to concatenate sequence alignments.

There is a utility to concatenate sequences in this web page
http://phylemon.bioinfo.cipf.es/cgi-bin/utilities.cgi

Mesquite will concatenate alignments
http://www.mesquiteproject.org

Check out this site. You can do lots of useful things including join 2
alignments. The sequences have to be in fasta format though.
http://www.daimi.au.dk/~biopv/php/fabox/index.php

Also the perl script seqCat.pl on this page
http://www.molekularesystematik.uni-oldenburg.de/en/34011.html
It can take several formats as input.

I think that the program DnaSp can concatenate also.

Something to convert alignment formats might be useful to you
http://www.ii.uib.no/~matthewb/tools/align_convert_in.cgi

My Public Folder
http://public.me.com/dhlunt

Objective 1: Phylogenetics

After going over the data yesterday, Kelley set out 3 main objectives which I should be aiming for with the metagenomics data:

  1. Phylogenetics–Use 95% OCTU dataset to compile a phylogeny that will hopefully be informative for higher clade relationships (deep splits).  For this purpose we have to assume that each 95 OCTU is a monophyletic group (although I need to look into this more!)
  2. Phylogeography–For each 95% OCTU, compile a phylogeny of the 99% octus that make up this grouping.  Look at separation (or lack thereof) of Pacific/Atlantic and Deep/Shallow reads.  Are any patterns taxa-specific?
  3. Phylogenetic species concepts–Find an algorithm that can be implemented on individual OCTU phylogenies that will define species using tree branch lengths and/or Newick tree files.

Today I was working out the kinks for Objective 1.  Mainly these were related to ARB–sorting the data, exporting it with correct labels and formatting the Fasta files.  Still working in the MasterDB_95A ARB database.  I’ve separated the OCTUs according to primer set (respectively labelled ‘Dorota’ [2019 spp] and ‘Simon [1923 spp] in the tax_slv field of ARB).  I also set up the PT server for this new, aligned database of 95 OCTUs under User4.arb.  A few OCTUs in the database had to be re-aligned by hand, and two OCTUS (653 and 3826) were completely nonsensical and I think are error/chimera reads.  These two OCTUs have been excluded from tree building, and are labelled as ‘unaligned error’  in the tax_slv field of ARB.

For future information (in case I need to build a filter at some point), Simon’s primer set goes from position 0-10221 and Dorota’s primer set goes from 10222-end of alignment.

Also been working out how to export ARB alignments and retain the taxonomic information (and avoid time-consuming tree annotations).  The best way to do it is to export the files in fasta_acc.eft format and then convert this to Phylip using the Readseq tool on CIPRES.  Readseq will cut off the names at the first space in the FASTA header, so you need to put in underscores (e.g. between binomial species names) to stop the relevant information from being truncated–Readseq is awesome because it doesn’t truncate at 10 characters like Dataconvert.  After converting the file to Phylip, you need to download this file, replace all . with ? in the alignment, and also delete any prohibited characters from the species names, including . – / () Then, voila!  Upload the modified Phylip file back to CIPRES and its ready to run on RAxML.

Metagenetics Phylogeny work

So I am currently trying to organize a reasonable (and accurate) way of constructing these metagenetics phylogenies.  Muscle alignments, although fast and easy to implement through the CIPRES portal, are not at all accurate and the resulting alignments look horrendous.  Plus, CIPRES’s 3-day limit on all jobs meant that even the (few) OCTUs from the 95A dataset were terminated before the alignment was completed.

Next, on to my original idea for using SILVA’s SINA aligner and the ARB program as a database manager.  I had aligned the 95A and 99A OCTUs back in December when we were preparing the NSF grant application; however, I just looked over the 95A ARB database (MasterDB_95A.arb) and there were four OCTUs that somehow got dropped during the alignment process.  These four missing 95 OCTUs are: 2661, 2085, 3062, and 3287.

In the 95A dataset, OCTU 3287 (Oncholaimidae) is identical to OCTU 3464 (100% query coverage and sequence identity), and ARB will not incorporate identical sequences into the database.  Of, the remaining three OCTUs, 2085 (99 OCTU 21340, 1 read) and 2661 (99 OCTUs 33262 and 38572, 2 reads total) match to Arabidopsis with 100% query coverage and 99% sequence identity, and didn’t align with anything via SINA either–we’ll class these as error reads.  The last OCTU (3062) is an N/A match, has no significant similarity in Genbank, and contains only 1 read (as well as only one corresponding 99% OCTU (49011) containing 1 read)–all this suggests it is another error read!

So, final exported alignment contains 3944 OCTUs and was saved as file name ‘95A_AlloctusAlign_17Mar.fasta‘ in the Meta95A folder.  However, I couldn’t submit this file to RAxML or CIPRES because the file is too big…

Productive Friday

Today I continued the edits on the BMC paper.  Left off at the start of the Discussion section.

Also commenced work on the Metagenomics Phylogenies.  Uploaded both the 95A and 95B OCTU sequences into CIPRES and am now running the Muscle alignments for those sequences.  Asked way to modify the 99A OCTU sequence names to include the corresponding 95A OCTU.

To do for Monday:

  • Send Fangning comments regarding the OCTU database
  • Put together the presentation for Steve Jones on Monday.

Getting started at UNH

Resurrecting my online lab book so that I remember what I do!

David Lunt sent back his comments regarding the (now combined) BMC papers detailing the Enoplid phylogeny.  He pointed out some of my outgroup choices, so I’ve had to go back and re-run a couple of the trees.

Just FYI in ARB, to export secondary structure features:

  • Helix, box reads  . 0 – =
  • Loops, box reads [<>]

I removed the Chromadorid outgroups and many Dorylaimid species from the Enoplid-only ML trees and ran the following jobs:

  • Raxml Job #891091  11Mar10_EnopReducedDS.phylip  (Normal ML run with Dorylaimia OGs)
  • Raxml Job #894968  DO NOT USE!  Forgot to add data partition file
  • Raxml Job #896361 11MarEnop_LSU_ML (Normal ML run with Dorylaimia OGs)
  • Raxml Job #898456 11MarEnop_StemLoopCombo, partition fie Raxml_SL_11Mar.txt (Stem/Loop ML run with Dorylaimia OGs)

CIPRES portal was rejecting my Bayesian runs, so will try again today.