------- FFdecsa ------- Compiling is as easy as running a make command, if you have gcc and are using a little endian machine. 64 bit machines have not been tested but may work with little or no changes; big endian machines will certainly give incorrect results (read the technical_background.txt to know where the problem is). Before compiling you could edit the Makefile to tweak compiler flags for optimal performance. If you want to play with different bit-grouping strategies you have to edit FFdecsa_DBG.c and change the "our choice" definition. This is highly critical for performance. After compilation run the FFdecsa_test application. It will test correct decryption and print the meausered speed (use "nice --19 ./FFdecsa_test" on an idle machine for better results). Or just use "make test". gcc >=3.3.3 is highly recommended. Older versions could give performance problems. icc is currently unusable. In the initial phases of development of FFdecsa icc was able to compile the code and gave interesting speed results when using the 8charA grouping mode (array of 8 characters are automatically manipulated through MMX instructions). At some point the code began to work incorrectly because of a compiler bug (but I found a workaround). Then, the performance dropped with no reason; I found a workaround by adding an unused variable (alignment problem, grep for icc in the code to see where it happens). Then, with the introduction of group modes based on intrinsics, gcc was finally able to go beyond the speed record originally set by icc. Additional code tweaks added more speed to gcc, while icc started to segfault on compilation (both version 7 and 8). In conclusion, icc is bugged and this code is too hard for it. gcc on the other hand is great. I tried to inspect generated assembler to find weak spots, and the generated code is very good indeed. Note: the code can be compiled with gcc or g++. g++ is 3% faster for some reason. You should not get any errors or warnings. I only get two "inlining failed" warnings on two functions I asked to be inlined but gcc doesn't want to inline. The build process creates additional temp files by running grep commands. This is how debugging output is handled. All the lines containing DBG are removed and the temp file is compiled (so the line numbers change between temp and original files). Don't edit the temp files, they will be overwritten. If you don't remove the DBG lines (for example, by changing "grep -v DBG" into "grep -v aaDBG" in Makefile) a lot of output will be generated. This is useful to understand what's wrong when the FFdecsa_test is failing. I included a reference "known good" output in the debug_output directory. Extra debug output is commented out in the code. The debug output functionality could be... bugged. This is because I tested everything using hard coded int grouping mode and then generalized the debug output to abstract grouping modes. A bug where 4 bytes are printed instead of 8 could be present somewhere. I think it isn't, but you've been warned. This code was only tried on Linux. It should work on Windows or other platforms, but you may encounter problems related to the compiler quality. If you want to try, begin with the int grouping mode. It is only 30% slower then the best (MMX) and it should be easily portable because no intrinsics are used. I'm particularly interested in hearing what kind of performance can be obtained on x86_64 processors in int, long long int, mmx, 2mmx, sse modes. As a reference, here are the results I get on an Athlon XP 2400+ (this processor runs at 2000MHz); other processors belonging to the Athlon XP architecture, including Durons, should have the same speed per MHz. Cache size and bus speed don't matter. CPU: AMD Athlon XP 2400+ Compiler: g++ (gcc version 3.3.3 20040412 (Red Hat Linux 3.3.3-7)) Flags: -O3 -march=athlon-xp -fexpensive-optimizations -funroll-loops --param max-unrolled-insns=500 grouping mode speed (Mbit/s) notes --------------------------------------------------------------------- PARALLEL_32_4CHAR 14 PARALLEL_32_4CHARA 12 PARALLEL_32_INT 125 very good and very portable PARALLEL_64_8CHAR 17 PARALLEL_64_8CHARA 15 needs a vectorizing compiler PARALLEL_64_2INT 75 x86 has too few registers PARALLEL_64_LONG 97 try this on x86_64 PARALLEL_64_MMX 165 the best PARALLEL_128_16CHAR 6 PARALLEL_128_16CHARA 7 PARALLEL_128_4INT 69 PARALLEL_128_2LONG 52 PARALLEL_128_2MMX 36 slower than expected PARALLEL_128_SSE 156 just slower than 64_MMX Best speeds are obtained with native data types: int, mmx, sse (this could be a compiler artifact). 64 bit processors should try 64_LONG. Vectorizing compilers should like *CHARA. 64_MMX is faster than 128_SSE on the Athlon; perhaps SSE instruction are internally split into 64 bit chunks. Could be different on x86_64 or Intel processors. 128_SSE has a 64 bit (MMX) batch type because SSE has no shifting instructions, they are only available on SSE2. As the Athlon XP doesn't support SSE2, I couldn't experiment with that.