microBEnet isolate genomes

Transferring info over from Google Docs. Here is the recent progress with the microBEnet isolate genome data:


QIIME can’t handle raw FASTQ data if you go straight into reference-based OTU picking. Had to convert Illumina files to fasta first

cat THU.r1.fastq | perl -e ‘$i=0;while(<>){if(/^\@/&&$i==0){s/^\@/\>/;print;}elsif($i==1){print;$i=-3}$i++;}’ > THU_r1_out.fasta

cat THU.r2.fastq | perl -e ‘$i=0;while(<>){if(/^\@/&&$i==0){s/^\@/\>/;print;}elsif($i==1){print;$i=-3}$i++;}’ > THU_r2_out.fasta

Ran THU_r1 and THU_r2 data through QIIME to see what 16S OTUs we had in our dataset. Used the virtual box for QIIME 1.4 I think (old .vmi):

pick_reference_otus_through_otu_table.py -i /home/qiime/Desktop/Shared_Folder/David_microBEnet/THU_r1_out.fasta -r /home/qiime/gg_otus_4feb2011/rep_set/gg_99_otus_4feb2011.fasta -o /home/qiime/Desktop/Shared_Folder/David_microBEnet/16S_gg_otus_w_tax_r1/ -t /home/qiime/gg_otus_4feb2011/taxonomies/greengenes_tax.txt

  • Looks like mainly Leucobacter
  • Need to build a tree with these reference sequences.

pick_reference_otus_through_otu_table.py -i /home/qiime/Desktop/Shared_Folder/David_microBEnet/THU_r2_out.fasta -r /home/qiime/gg_otus_4feb2011/rep_set/gg_99_otus_4feb2011.fasta -o /home/qiime/Desktop/Shared_Folder/David_microBEnet/16S_gg_otus_w_tax_r2/ -t /home/qiime/gg_otus_4feb2011/taxonomies/greengenes_tax.txt


/phylosift.pl all –debug –isolate /home/coild/public_html/THU.r1.fastq
/phylosift.pl all –debug –besthit –isolate /home/coild/public_html/THU.r1.fastq

  • PhyloSift can’t process input data in FASTQ format for isolate mode?
  • lastal: bad symbol: @

./phylosift.pl all –debug –isolate /home/hollybik/TestData/THU_r1_out.fasta

  • getting an isolates.fasta file, but no other output besides this
  • Need to figure out what isolate mode does (or is meant to do)

./phylosift.pl all –debug –besthit /home/coild/public_html/THU.r1.fastq /home/coild/public_html/THU.r2.fastq

  • Only 37 reads being pulled out?! 56 lines in the taxa_90pct file

Running scaffolds through PhyloSift to see if we can get better hits. Starting out with besthit

./phylosift.pl all –debug –besthit ~/TestData/microBEnet/THUdemulti.final.scaffolds.fasta
master_20120720 – master markers

  • Getting more lines in taxa_90pct than in the below run *not* using best hit?!
  • Need to figure out exactly *what* files change when running besthit mode–apparently you are only taking the best BLAST hit. Not best probability hit.

./phylosift.pl all –debug ~/TestData/microBEnet/THUdemulti.final.scaffolds.fasta
master_20120720 – master markers

./phylosift.pl all –debug –besthit -marker
_url=”http://edhar.genomecenter.ucdavis.edu/~koadman/phylosift_markers/devel&#8221; ~/
devel_20120720 – devel markers


Running the TEU data through PhyloSift (the isolate data with probably mixed genomes).

./phylosift.pl all –debug –paired /home/coild/public_html/TEU.r1.fastq /home/coild/public_html/TEU.r2.fastq
devel_20120720 – core markers


Now running TEU data through QIIME to pull out greengenes 16S OTUs for the potentially mixed isolate genome.

First convert FASTQ to FASTA

cat TEU.r1.fastq | perl -e ‘$i=0;while(<>){if(/^\@/&&$i==0){s/^\@/\>/;print;}elsif($i==1){print;$i=-3}$i++;}’ > TEU_r1_out.fasta

cat TEU.r2.fastq | perl -e ‘$i=0;while(<>){if(/^\@/&&$i==0){s/^\@/\>/;print;}elsif($i==1){print;$i=-3}$i++;}’ > TEU_r2_out.fasta

Reference-based picking of 16S OTUs in QIIME. Running this on the updated Virtual Box containing QIIME 1.5.0 release.

pick_reference_otus_through_otu_table.py -i /home/qiime/Desktop/Shared_Folder/David_microBEnet/TEU_r1_out.fasta -r /home/qiime/Desktop/Shared_Folder/gg_otus_4feb2011/rep_set/gg_99_otus_4feb2011.fasta -o /home/qiime/Desktop/Shared_Folder/David_microBEnet/greengenes_TEU_r1/ -t /home/qiime/Desktop/Shared_Folder/gg_otus_4feb2011/taxonomies/greengenes_tax.txt

pick_reference_otus_through_otu_table.py -i /home/qiime/Desktop/Shared_Folder/David_microBEnet/TEU_r2_out.fasta -r /home/qiime/Desktop/Shared_Folder/gg_otus_4feb2011/rep_set/gg_99_otus_4feb2011.fasta -o /home/qiime/Desktop/Shared_Folder/David_microBEnet/greengenes_TEU_r2/ -t /home/qiime/Desktop/Shared_Folder/gg_otus_4feb2011/taxonomies/greengenes_tax.txt

QIIME 1.5 now uses BIOM format files for OTU tables, so these need to be converted back to “classic” format. Navigate to directory and execute command:

convert_biom.py -i otu_table.biom -o otu_table_classic.txt -b –header_key taxonomy

Pick rep set of sequences

pick_rep_set.py -i THU_r1_out_otus.txt -f /home/qiime/Desktop/Shared_Folder/David_microBEnet/THU_r1_out.fasta -o rep_set_THU_r1

pick_rep_set.py -i THU_r2_out_otus.txt -f /home/qiime/Desktop/Shared_Folder/David_microBEnet/THU_r2_out.fasta -o rep_set_THU_r2

pick_rep_set.py -i TEU_r1_out_otus.txt -f /home/qiime/Desktop/Shared_Folder/David_microBEnet/TEU_r1_out.fasta -o rep_set_TEU_r1

pick_rep_set.py -i TEU_r2_out_otus.txt -f /home/qiime/Desktop/Shared_Folder/David_microBEnet/TEU_r2_out.fasta -o rep_set_TEU_r2


  • Wanted to build trees with David’s data today but I am running into issues
  • Aligned all the rep_set sequences using the online SINA aligner
  • Tried to merge these into the reference SILVA database but was getting an error saying “Unknown host: hollyb” — need to find where ARB is looking for the host name and change it to ‘localhost’. Re-run the DB merge and see what the error dialogue says. I think I remember it saying something about $ARBHOME…

Next steps

  1. Insert THU and TEU 16S sequences into Tree (use ARB fast parsimony insertion), to see where rRNA sequences fall on topology – clustered or spread out across tree?
  2. Run THU and TEU rep_sets through PhyloSift to see if tree placement on the 16S markers are easier.

Up and running on the Amazon Cloud

This morning I played around with EC2 instances and trying to mount S3 storage as buckets. Followed this s3fs tutorial for an external program that allows s3 mounting – managed to get the program and its dependencies installed, but I couldn’t mount volumes in the end for some reason (and sidenote, the program had dumb rules like S3 buckets couldn’t have capitalized names…!) . After a lot of frustration I decided to look further into alternative storage.

So I think now I’ve finally sorted out the storage issues with Amazon Cloud – discovered that you can just change the default root volume for each instance to increase the storage space (auto mounted EBS volumes) – so no need to use S3 and all that mounting bucket malarky.

Time to get started with QIIME – first focused on the microBEnet genomes.

wget David’s error corrected THU files from Edhar

Convert to FASTA for QIIME

cat THU.r1.fastq.pp.ec.fastq | perl -e ‘$i=0;while(<>){if(/^\@/&&$i==0){s/^\@/\>/;print;}elsif($i==1){print;$i=-3}$i++;}’ > THU.r1.fastq.pp.ec.converted.fasta

cat THU.r2.fastq.pp.ec.fastq | perl -e ‘$i=0;while(<>){if(/^\@/&&$i==0){s/^\@/\>/;print;}elsif($i==1){print;$i=-3}$i++;}’ > THU.r2.fastq.pp.ec.converted.fasta

Downloaded/unzipped greengenes via wget and ran closed-reference OTU picking:

pick_reference_otus_through_otu_table.py -i THU.r1.fastq.pp.ec.converted.fasta -r /home/ubuntu/microbenet_data/gg_otus_4feb2011/rep_set/gg_99_otus_4feb2011.fasta -o /home/ubuntu/microbenet_data/16S_gg_THU_ECreads_r1/ -t /home/ubuntu/microbenet_data/gg_otus_4feb2011/taxonomies/greengenes_tax.txt

.py -i THU.r2.fastq.pp.ec.converted.fasta -r /home/ubuntu/microbenet_data/gg_otu
s_4feb2011/rep_set/gg_99_otus_4feb2011.fasta -o /home/ubuntu/microbenet_data/16S
_gg_THU_ECreads_r2/ -t /home/ubuntu/microbenet_data/gg_otus_4feb2011/taxonomies/