View on GitHub

Computational Techniques for Life Sciences

Part of the TACC Institute Series, Immersive Training in Advanced Computation

Code Optimization

Whenever you write a program in non-interprative languages like C, C++, or Fortran, it gets converted into machine code that the processor understands. Our machines specifically understand x86_64. Other chips like those in your phones understand ARM (RISC) code.

If you have ever taken a programming class and had to write algorithms to print out Fibonacci numbers, you know probably know there are many ways to achieve the same result. Some of those ways are also faster than others. The same rules apply to the code that assemblers generate.

At TACC, whenever we deploy software, we experiment with different ways to compile code. We know these tools will be used by researchers around the world, and we want to make sure the software not only works, but runs as quickly as possible. We are going to experiment with the program FLASh today and learn some tricks to optimize code.

Default

When you copied over the data files, you already copied over the FLASH source code, so let’s try compiling it and running it.

First, compile it using default parameters using the GCC compiler. Notice we will be running all of this from /dev/shm so no slowdowns are introduced by the network filesystems. Just like on your home wifi, someone might be streaming ALL of youtube.

$ cp SRR*fastq /dev/shm
$ cd /dev/shm
$ tar -xzf ~/FLASH-1.2.11.tar.gz
$ cd FLASH-1.2.11
$ module load gcc
$ make CC=gcc

Now, we can run it.

$ ./flash -m 20 -M 250 -t 1 ../SRR2014925_1.fastq ../SRR2014925_2.fastq

Conveniently, FLASH provides a runtime at the end, so we can compare all of our runtimes for different ways to compile. Lets also run it one more time, but record the output to a log file.

$ ./flash -m 20 -M 250 -t 1 ../SRR2014925_1.fastq ../SRR2014925_2.fastq &> default.log

Optimization Level

By default, FLASh compiles with an optimization level of 2 (-O2). Lets see what happens with an optimization level of 3, which is also the maximum.

$ make clean
$ make CC=gcc CFLAGS="-O3 -Wall -std=c99 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64"

After compiling, we should test again.

$ ./flash -m 20 -M 250 -t 1 ../SRR2014925_1.fastq ../SRR2014925_2.fastq &> O3.log
$ tail O3.log

Was there a speedup?

Speedup = Default/O3

Auto-vectorization

We can also specify the architecture of our CPU to allow the compiler the chance to auto-vectorize some loops. This means that several operations can take place during each clock cycle. Lonestar 5 compute nodes use Intel Haswell processors, so we can specifically target them with the -march=haswell flag.

$ make clean
$ make CC=gcc CFLAGS="-O3 -march=haswell -Wall -std=c99 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64"
$ ./flash -m 20 -M 250 -t 1 ../SRR2014925_1.fastq ../SRR2014925_2.fastq &> avx2.log
$ tail avx2.log

For times when you do not know the exact architecture, you can use -march=native. Just remember that this binary may not work on another system.

Thread Scaling

Now that we have an optimal sequential version of our code, lets try running it with different core counts and see how well it scales.

$ for N in {1..24}
$ do
$   ./flash -m 20 -M 250 -t $N ../SRR2014925_1.fastq ../SRR2014925_2.fastq &> avx2_${N}.log
$ done

Explore

Exercises

Back - Task Distribution   —   Next - Hardware