[WireGuard] News about MIPS and ARM optimized code?

Fri Sep 9 15:46:11 CEST 2016

Duo the misaligned data fetching function like poly1305 causes  
regression on the mips.

	h0 += (le32_to_cpuvp(src +  0) >> 0) & 0x3ffffff;
		h1 += (le32_to_cpuvp(src +  3) >> 2) & 0x3ffffff;
		h2 += (le32_to_cpuvp(src +  6) >> 4) & 0x3ffffff;
		h3 += (le32_to_cpuvp(src +  9) >> 6) & 0x3ffffff;
		h4 += (le32_to_cpuvp(src + 12) >> 8) | hibit;

Had 26MBit now +42.

root at lede:~# iperf3 -c 10.0.0.1 -i 10
Connecting to host 10.0.0.1, port 5201
[  4] local 10.0.0.2 port 36216 connected to 10.0.0.1 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-10.08  sec  51.2 MBytes  42.7 Mbits/sec    0    171 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.08  sec  51.2 MBytes  42.7 Mbits/sec    0             sender
[  4]   0.00-10.08  sec  51.2 MBytes  42.7 Mbits/sec                  receiver

iperf Done.
root at lede:~# iperf3 -c 10.0.0.1 -u -b 1G -i 10
Connecting to host 10.0.0.1, port 5201
[  4] local 10.0.0.2 port 60714 connected to 10.0.0.1 port 5201
[ ID] Interval           Transfer     Bandwidth       Total Datagrams
[  4]   0.00-10.00  sec  56.3 MBytes  47.2 Mbits/sec  7209
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter     
Lost/Total Datagrams
[  4]   0.00-10.00  sec  56.3 MBytes  47.2 Mbits/sec  0.034 ms  0/7209 (0%)
[  4] Sent 7209 datagrams

iperf Done.
root at lede:~#

Work is not done yet but a good start.

Greats,

René van Dorst.

Quoting René van Dorst <opensource at vdorst.com>:

> I did try to write some MIPS32r2 code.
> I wrote the chacha20_keysetup, chacha20_generic_block and  
> poly1305_generic_blocks in assembly.
> Tried to load all needed variables in the registers. Which should  
> reduce the memory overhead.
> But it is very difficult for me to do code profiling and/or isolate  
> the code and make some benchmark programs like supercop.
> So testing was simple. Crosscompile the code. Copy and load the  
> module on the target. Run setup script and iperf.
>
> #ifdef CONFIG_CPU_MIPS32_R2
> asmlinkage void chacha20_keysetup(struct chacha20_ctx *ctx, const u8  
> key[static 32], const u8 nonce[static 8]);
> asmlinkage void chacha20_generic_block(struct chacha20_ctx *ctx);
> asmlinkage unsigned int poly1305_generic_blocks(struct poly1305_ctx  
> *ctx, const u8 *src, unsigned int srclen, u32 hibit);
> #endif
>
> But the speed is equal or less on my TP WR1043ND device which is a  
> MIPS32r2 24kc big endian.
> So GCC does a good job. Also 24kc has no special CoProcessors or FPU.
>
> Most improvement what I had it to change the buildroot default  
> optimization -Os to -O2.
> This gives around 1-3% speed improvement.
>
> ideas:
> - remove the little endian parts on the MIPS.
>   Offcourse do it also on the other side.
>   On this device I can't switch endian.
>   But I did not see any improvements. Need 2 instruction for  
> swapping 32bit register.
>   After a quick calculation it could save around 0.4% which is  
> ~0.1MBit/s on this device.
>
> Greats,
>
> René van Dorst.
>
> _______________________________________________
> WireGuard mailing list
> WireGuard at lists.zx2c4.com
> http://lists.zx2c4.com/mailman/listinfo/wireguard