B.anthracis metagenome work

Transferring notes over from GoogleDocs. Here is the recent work I’ve been doing:

7/25/12

Downloaded Holly Gantz data to iMac and converted fastq to FASTA.

cat Holly_G_3C2.r1.fastq | perl -e ‘$i=0;while(<>){if(/^\@/&&$i==0){s/^\@/\>/;print;}elsif($i==1){print;$i=-3}$i++;}’ > Holly_G_3C2.r1.fasta

cat Holly_G_3C2.r2.fastq | perl -e ‘$i=0;while(<>){if(/^\@/&&$i==0){s/^\@/\>/;print;}elsif($i==1){print;$i=-3}$i++;}’ > Holly_G_3C2.r2.fasta

Removed the /n characters that were interrupting the DNA sequences in the grinder sim read outputs. Now running through PhyloSift to see if this will give us hits.

perl /Users/hollybik/Dropbox/perl/scripts/Grinder_remove_newlines_fasta.pl grinder-reads_Banthracis_1000000.fa

  • Still not getting any results after removing /n characters. Now going to use other tools to see if it is a problem with the reads themselves.

7/27/12

Running Holly Gantz’s data through QIIME to see if we have 16S hits on any Bacillus taxa. Doing this on qiime@localhost on Edhar.

parallel_pick_otus_uclust_ref.py -i /home/qiime/data/hbik/B_anthracis/Holly_G_3C2.r1.fasta -o HollyG_3C2_r1_qiime16S -r /home/qiime/data/hbik/gg_otus_4feb2011/rep_set/gg_99_otus_4feb2011.fasta -s 0.99 –enable_rev_strand_match

pick_rep_set.py -i HollyG_3C2_r1_qiime16S/Holly_G_3C2.r1_otus.txt -f Holly_G_3C2.r1.fasta -o HollyG_3C2_r1_qiime16S/Holly_G_3C2.r1_rep_set.fna

assign_taxonomy.py -i HollyG_3C2_r1_qiime16S/Holly_G_3C2.r1_rep_set.fna -m rdp

  • RDP taxonomy doesn’t indicate that we have any Bacillus anthracis, although we do have a number of Bacilli in the data –the main OTU is Stapylococcus, which I think correlates with the PhyloSift results? We got a total of 317 rRNA OTUs from this QIIME scan.

Set up a BLAST DB on Edhar to compare marker genes (cat file of marker gene hits pulled out by PhyloSift run) with the simulated grinder data.

megablast -d ~/BLAST_db/B_anthracis_PMPROK_markers/Banthracis_lastalcandidate_master_20120717_nts.txt -D 2 -p 95 -a 2 -b 1 -v 1 -i ~/grinder_sims/grinder-reads_Banthracis_100000_NLrem.fasta -F F > ~/grinder_sims/grinder-reads_Banthracis_100000_NLrem_BLASTout.txt

perl /home/hollybik/OCTUPUS_scripts/blastFilter.pl /home/hollybik/grinder_sims/grinder-reads_Banthracis_100000_NLrem_BLASTout.txt /home/hollybik/grinder_sims/grinder-reads_Banthracis_100000_NLrem_BLASTfilter.txt

  • No hits at all using 95% sequence identity…tried again with only 90% sequence identity:

megablast -d ~/BLAST_db/B_anthracis_PMPROK_markers/Banthracis_lastalcandidate_master_20120717_nts.txt -D 2 -p 90 -a 2 -b 1 -v 1 -i ~/grinder_sims/grinder-reads_Banthracis_100000_NLrem.fasta -F F > ~/grinder_sims/grinder-reads_Banthracis_100000_NLrem_BLASTout2.txt

perl /home/hollybik/OCTUPUS_scripts/blastFilter.pl /home/hollybik/grinder_sims/grinder-reads_Banthracis_100000_NLrem_BLASTout2.txt /home/hollybik/grinder_sims/grinder-reads_Banthracis_100000_NLrem_BLASTfilter2.txt

  • No hits at all even using 90% similarity, so I went down to 60% sequence identity:

megablast -d ~/BLAST_db/B_anthracis_PMPROK_markers/Banthracis_lastalcandidate_master_20120717_nts.txt -D 2 -p 60 -a 2 -b 1 -v 1 -i ~/grinder_sims/grinder-reads_Banthracis_100000_NLrem.fasta -F F > ~/grinder_sims/grinder-reads_Banthracis_100000_NLrem_BLASTout3.txt

perl /home/hollybik/OCTUPUS_scripts/blastFilter.pl /home/hollybik/grinder_sims/grinder-reads_Banthracis_100000_NLrem_BLASTout3.txt /home/hollybik/grinder_sims/grinder-reads_Banthracis_100000_NLrem_BLASTfilter3.txt

  • Still no hits. So it might be related to Illumina error

Re-simulated Grinder reads, but removed the Illuimina-specific error simulation and instead opted for a general 0.1 uniform error rate.

grinder -read_dist 105 -insert_dist 400 normal 50 uniform 0.1 -mr 95 5 -genome_file B_anthracis_genome.fasta -total_reads 100000

And then ran another megablast, this time at 80% to see the difference

megablast -d ~/BLAST_db/B_anthracis_PMPROK_markers/Banthracis_lastalcandidate_master_20120717_nts.txt -D 2 -p 80 -a 2 -b 1 -v 1 -i ~/grinder_sims/grinder-reads_Banthracis_100000_Error1pct.fa -F F > ~/grinder_sims/grinder-reads_Banthracis_100000_NLrem_BLASTout4.txt

perl /home/hollybik/OCTUPUS_scripts/blastFilter.pl /home/hollybik/grinder_sims/grinder-reads_Banthracis_100000_NLrem_BLASTout4.txt /home/hollybik/grinder_sims/grinder-reads_Banthracis_100000_NLrem_BLASTfilter4.txt

  • Still nope. Maybe I don’t have enough sequences in the dataset to have any marker gene hits show up?
  • So now I’m going to try with 1000000 PE reads. Perhaps the increased read count will give us hits:

megablast -d ~/BLAST_db/B_anthracis_PMPROK_markers/Banthracis_lastalcandidate_master_20120717_nts.txt -D 2 -p 80 -a 2 -b 1 -v 1 -i ~/grinder_sims/grinder-reads_Banthracis_1000000_NLrem.fasta -F F > ~/grinder_sims/grinder-reads_Banthracis_1000000_NLrem_BLASTout.txt

perl /home/hollybik/OCTUPUS_scripts/blastFilter.pl /home/hollybik/grinder_sims/grinder-reads_Banthracis_1000000_NLrem_BLASTout.txt /home/hollybik/grinder_sims/grinder-reads_Banthracis_1000000_NLrem_BLASTfilter.txt

Aaron looked into the issues and said we needed to change the LAST search parameter to -e40 in the FastSeach.pm file. After running a couple datasets (phylosift_devel_20120801 folder), I am now getting hits in the blastDir, but nothing transferring over to the alignDir


8/6/12

After scrumming this morning it sounds like the HMMer search parameters are set as sensitive as they can be – Aaron suggested that perhaps we aren’t pulling out the relevant marker genes from genome contigs during the database update (DBupdate script), so when we go to search against the reference marker genes, we’re not getting any hits because we don’t actually have the B.anthracis markers represented in the PhyloSift core markers. Aaron is looking into this and going to see if we can clarify further.


8/10/12

So all the simulation problems turned out to be a bug in Grinder. Hopefully we can proceed forward more easily with the new simulator problem.

We also seemed to have fixed the problem with the genome contigs — the lack of summaries/kronas was because all the input FASTA headers had the same name. Fixed once you change to unique headers (did this with perl script and then ran AA and nt sequences on master branch w/master markers).


8/13/12

Switched to artsim after finding a bug in grinder. We’ve gone from # of reads to fold coverage.

./art_illumina -i ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/B_anthracis_genome.fasta -p -l 100 -f 0.01 -m 200 -s 10 -o ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/artsim_sim_reads/artsim_0.01x_

./art_illumina -i ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/B_anthracis_genome.fasta -p -l 100 -f 0.25 -m 200 -s 10 -o ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/artsim_sim_reads/artsim_0.25x_

./art_illumina -i ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/B_anthracis_genome.fasta -p -l 100 -f 0.5 -m 200 -s 10 -o ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/artsim_sim_reads/artsim_0.5x_

./art_illumina -i ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/B_anthracis_genome.fasta -p -l 100 -f 1 -m 200 -s 10 -o ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/artsim_sim_reads/artsim_1x_

./art_illumina -i ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/B_anthracis_genome.fasta -p -l 100 -f 2 -m 200 -s 10 -o ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/artsim_sim_reads/artsim_2x_

./art_illumina -i ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/B_anthracis_genome.fasta -p -l 100 -f 0.1 -m 200 -s 10 -o ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/artsim_sim_reads/artsim_0.1x_

./art_illumina -i ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/B_anthracis_genome.fasta -p -l 100 -f 5 -m 200 -s 10 -o ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/artsim_sim_reads/artsim_5x_

./art_illumina -i ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/B_anthracis_genome.fasta -p -l 100 -f 10 -m 200 -s 10 -o ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/artsim_sim_reads/artsim_10_

./art_illumina -i ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/B_anthracis_genome.fasta -p -l 100 -f 25 -m 200 -s 10 -o ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/artsim_sim_reads/artsim_25x_

./art_illumina -i ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/B_anthracis_genome.fasta -p -l 100 -f 50 -m 200 -s 10 -o ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/artsim_sim_reads/artsim_50x_

./art_illumina -i ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/B_anthracis_genome.fasta -p -l 100 -f 100 -m 200 -s 10 -o ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/artsim_sim_reads/artsim_100x_


8/15/12

Running all the artsim B.anthracis data through PhyloSift

./phylosift.pl all –debug –paired ~/grinder_sims/artsim_sim_reads/artsim_0.1x_1.fq ~/grinder_sims/artsim_sim_reads/artsim_0.1x_2.fq

./phylosift.pl all –debug –paired ~/grinder_sims/artsim_sim_reads/artsim_0.5x_1.fq ~/grinder_sims/artsim_sim_reads/artsim_0.5x_2.fq

./phylosift.pl all –debug –paired ~/grinder_sims/artsim_sim_reads/artsim_0.01x_1.fq ~/grinder_sims/artsim_sim_reads/artsim_0.01x_2.fq

./phylosift.pl all –debug –paired ~/grinder_sims/artsim_sim_reads/artsim_0.25x_1.fq ~/grinder_sims/artsim_sim_reads/artsim_0.25x_2.fq

./phylosift.pl all –debug –paired ~/grinder_sims/artsim_sim_reads/artsim_1x_1.fq ~/grinder_sims/artsim_sim_reads/artsim_1x_2.fq

./phylosift.pl all –debug –paired ~/grinder_sims/artsim_sim_reads/artsim_2x_1.fq ~/grinder_sims/artsim_sim_reads/artsim_2x_2.fq

./phylosift.pl all –debug –paired ~/grinder_sims/artsim_sim_reads/artsim_5x_1.fq ~/grinder_sims/artsim_sim_reads/artsim_5x_2.fq

./phylosift.pl all –debug –paired ~/grinder_sims/artsim_sim_reads/artsim_10x_1.fq ~/grinder_sims/artsim_sim_reads/artsim_10x_2.fq

./phylosift.pl all –debug –paired ~/grinder_sims/artsim_sim_reads/artsim_25x_1.fq ~/grinder_sims/artsim_sim_reads/artsim_25x_2.fq

./phylosift.pl all –debug –paired ~/grinder_sims/artsim_sim_reads/artsim_50x_1.fq ~/grinder_sims/artsim_sim_reads/artsim_50x_2.fq

./phylosift.pl all –debug –paired ~/grinder_sims/artsim_sim_reads/artsim_100x_1.fq ~/grinder_sims/artsim_sim_reads/artsim_100x_2.fq


8/20/12

Holly G is going to give us another soil sample on Tues/Wed – this time she’s going to PCR-confirm that there is B.anthracis in the sample.

Also lowering the coverage more for the artsim data

./art_illumina -i ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/B_anthracis_genome.fasta -p -l 100 -f 0.005 -m 200 -s 10 -o ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/artsim_sim_reads/artsim_0.005x_

./art_illumina -i ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/B_anthracis_genome.fasta -p -l 100 -f 0.0025 -m 200 -s 10 -o ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/artsim_sim_reads/artsim_0.0025x_

./art_illumina -i ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/B_anthracis_genome.fasta -p -l 100 -f 0.001 -m 200 -s 10 -o ~/Desktop/PhyloSift/Grinder_B_anthracis_sim/artsim_sim_reads/artsim_0.001x_

…and then running through PhyloSift:

./phylosift.pl all –debug –paired ~/grinder_sims/artsim_sim_reads/artsim_0.005x_1.fq ~/grinder_sims/artsim_sim_reads/artsim_0.005x_2.fq –disable_updates

./phylosift.pl all –debug –paired ~/grinder_sims/artsim_sim_reads/artsim_0.0025x_1.fq ~/grinder_sims/artsim_sim_reads/artsim_0.0025x_2.fq –disable_updates

./phylosift.pl all –debug –paired ~/grinder_sims/artsim_sim_reads/artsim_0.001x_1.fq ~/grinder_sims/artsim_sim_reads/artsim_0.001x_2.fq –disable_updates