talk @ metabarcoding.org
Parallel Strategies - Printable Version

+- talk @ metabarcoding.org (http://talk.metabarcoding.org)
+-- Forum: Bioinformatic softwares (http://talk.metabarcoding.org/forumdisplay.php?fid=1)
+--- Forum: OBITools (http://talk.metabarcoding.org/forumdisplay.php?fid=2)
+---- Forum: Using OBITools (http://talk.metabarcoding.org/forumdisplay.php?fid=7)
+---- Thread: Parallel Strategies (/showthread.php?tid=26)



Parallel Strategies - cbird - 07-23-2016

We've been using obitools for a few months in my lab. It has been working well but, most or all of the commands are single threaded.  I actually like that because it gives me more control over how obitools is running.

I was wondering if anybody was willing to share their strategies for making obitools parallel?

Here is ours:

This past week, I was helping a student with illuminapaired end, obigrep, and ngsfilter.  It was taking several hours for our workstation (dual xeon v3, 40 threads of capacity) to plow through illuminapairedend because it was only using 1 thread. I wanted to see how adjusting the alignment score would affect the retention of sequences, so we wanted to run illumina paired end with 5 different alignment scores.  We wrote  a bash script that broke the fastq files into several files, then used gnu parallel (rather than a nested for loop) to run illuminapairedend on the directory of files while iterating through the 5 different alignment scores and it finished very quickly.  After the gnu parallel command, we concat the files back together.  FWIW, changing the alignment score didn't change much. However, after reading the documentation, we changed a couple arguments and increased our number of retained reads by 5x.  We were able to employ the parallel strategy with the other obitools commands also.

This same strategy works well for for other obitools steps if you are only processing 1 file, to review:
     Divide fastq into several files (one pair of f and r files per thread)
     Use gnu parallel instead of a for loop to run the obitools command on each of the sub files (this provides dramatic speed increase if you have a lot of threads available)
     Concatenate the files (or not if you are going to run another obitools command with gnu parallel)


RE: Parallel Strategies - cbird - 08-01-2016

here is the newest parallel strategy we've come up with  
(we welcome feedback, comments, suggestions, and criticism)

1) obidistribute to Divide F & R files into 40 files each (40= # threads on workstation)

2) GNU parallel (illuminapairedend)
3) GNU parallel (obigrep to remove unaligned files)
4) GNU parallel (ngsfilter)

5) Concatenate files

6) GNU parallel (obigrep) to divide file by sample    


7) GNU parallel (obiunique)
8) GNU parallel (obiclean)
9) GNU parallel (obigrep for additional filtering)
10) Remove chimeras with other software
11) GNU parallel (ecotag to id sequences)  *haven't tried parallel yet