<div dir="ltr">Hey René,<div><br></div><div>This is an excellent find. Thanks. Pretty significant speed improvements. I wonder where else this is happening too.</div><div><br></div><div>Have you tested this on both endians?</div><div><br></div><div>The main thing I'm wondering here is why exactly the compiler can't generate more efficient code itself. </div><div><br></div><div>I'll review this and merge soon if it looks good.</div><div><br></div><div>Regards,</div><div>Jason</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Sun, Sep 11, 2016 at 2:06 PM, René van Dorst <span dir="ltr"><<a href="mailto:opensource@vdorst.com" target="_blank">opensource@vdorst.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Typo HAVE_EFFICIENT_UNALIGNED_ACCES<wbr>S --> CONFIG_HAVE_EFFICIENT_UNALIGNE<wbr>D_ACCESS.<br>
<br>
>From 13fae657624aac6b9c1f411aa6472a<wbr>91aae7fcc3 Mon Sep 17 00:00:00 2001<span class=""><br>
From: =?UTF-8?q?Ren=C3=A9=20van=20Do<wbr>rst?= <<a href="mailto:opensource@vdorst.com" target="_blank">opensource@vdorst.com</a>><br>
Date: Sat, 10 Sep 2016 10:58:58 +0200<br></span>
Subject: [PATCH] Add support for platforms which has no efficient unaligned<br>
memory access<br><span class="">
<br>
Without it, it caused 55.2% slowdown in throughput at TP-Link WR1043ND, MIPS32r2@400Mhz.<br>
<br></span><span class="">
Simply check for CONFIG_HAVE_EFFICIENT_UNALIGNE<wbr>D_ACCESS at compile time.<br>
<br>
Test on TP-Link WR1043ND, MIPS32r2@400Mhz.<br>
Setup: <a href="https://lists.zx2c4.com/pipermail/wireguard/2016-August/000331.html" rel="noreferrer" target="_blank">https://lists.zx2c4.com/piperm<wbr>ail/wireguard/2016-August/<wbr>000331.html</a><br>
<br></span><div><div class="h5">
Benchmarks before:<br>
<br>
root@lede:~# iperf3 -c 10.0.0.1 -i 10<br>
[ ID] Interval Transfer Bandwidth Retr Cwnd<br>
[ 4] 0.00-10.13 sec 28.8 MBytes 23.8 Mbits/sec 0 202 KBytes<br>
- - - - - - - - - - - - - - - - - - - - - - - - -<br>
[ ID] Interval Transfer Bandwidth Retr<br>
[ 4] 0.00-10.13 sec 28.8 MBytes 23.8 Mbits/sec 0 sender<br>
[ 4] 0.00-10.13 sec 28.8 MBytes 23.8 Mbits/sec receiver<br>
<br>
root@lede:~# iperf3 -c 10.0.0.1 -i 10 -u -b 1G<br>
[ ID] Interval Transfer Bandwidth Total Datagrams<br>
[ 4] 0.00-10.00 sec 31.1 MBytes 26.1 Mbits/sec 3982<br>
- - - - - - - - - - - - - - - - - - - - - - - - -<br>
[ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams<br>
[ 4] 0.00-10.00 sec 31.1 MBytes 26.1 Mbits/sec 0.049 ms 0/3982 (0%)<br>
[ 4] Sent 3982 datagrams<br>
<br>
Benchmarks with aligned memory fetching:<br>
<br>
root@lede:~# iperf3 -c 10.0.0.1 -i 10<br>
[ ID] Interval Transfer Bandwidth Retr Cwnd<br>
[ 4] 0.00-10.22 sec 52.5 MBytes 43.1 Mbits/sec 0 145 KBytes<br>
- - - - - - - - - - - - - - - - - - - - - - - - -<br>
[ ID] Interval Transfer Bandwidth Retr<br>
[ 4] 0.00-10.22 sec 52.5 MBytes 43.1 Mbits/sec 0 sender<br>
[ 4] 0.00-10.22 sec 52.5 MBytes 43.1 Mbits/sec receiver<br>
<br>
iperf Done.<br>
root@lede:~# iperf3 -c 10.0.0.1 -i 10 -u -b 1G<br>
[ ID] Interval Transfer Bandwidth Total Datagrams<br>
[ 4] 0.00-10.00 sec 56.3 MBytes 47.2 Mbits/sec 7207<br>
- - - - - - - - - - - - - - - - - - - - - - - - -<br>
[ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams<br>
[ 4] 0.00-10.00 sec 56.3 MBytes 47.2 Mbits/sec 0.041 ms 0/7207 (0%)<br>
[ 4] Sent 7207 datagrams<br></div></div><span class="">
---<br>
src/crypto/chacha20poly1305.c | 31 ++++++++++++++++++++++++++++++<wbr>+<br>
1 file changed, 31 insertions(+)<br>
<br></span>
diff --git a/src/crypto/chacha20poly1305.<wbr>c b/src/crypto/chacha20poly1305.<wbr>c<br>
index 5190894..294cbf6 100644<span class=""><br>
--- a/src/crypto/chacha20poly1305.<wbr>c<br>
+++ b/src/crypto/chacha20poly1305.<wbr>c<br>
@@ -248,13 +248,29 @@ struct poly1305_ctx {<br>
<br>
static void poly1305_init(struct poly1305_ctx *ctx, const u8 key[static POLY1305_KEY_SIZE])<br>
{<br></span>
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNE<wbr>D_ACCESS<span class=""><br>
+ u32 t0, t1, t2, t3;<br>
+#endif<br>
+<br>
memset(ctx, 0, sizeof(struct poly1305_ctx));<br>
/* r &= 0xffffffc0ffffffc0ffffffc0ffff<wbr>fff */<br></span>
+#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNE<wbr>D_ACCESS<span class=""><br>
ctx->r[0] = (le32_to_cpuvp(key + 0) >> 0) & 0x3ffffff;<br>
ctx->r[1] = (le32_to_cpuvp(key + 3) >> 2) & 0x3ffff03;<br>
ctx->r[2] = (le32_to_cpuvp(key + 6) >> 4) & 0x3ffc0ff;<br>
ctx->r[3] = (le32_to_cpuvp(key + 9) >> 6) & 0x3f03fff;<br>
ctx->r[4] = (le32_to_cpuvp(key + 12) >> 8) & 0x00fffff;<br>
+#else<br>
+ t0 = le32_to_cpuvp(key + 0);<br>
+ t1 = le32_to_cpuvp(key + 4);<br>
+ t2 = le32_to_cpuvp(key + 8);<br>
+ t3 = le32_to_cpuvp(key +12);<br>
+ ctx->r[0] = t0 & 0x3ffffff; t0 >>= 26; t0 |= t1 << 6;<br>
+ ctx->r[1] = t0 & 0x3ffff03; t1 >>= 20; t1 |= t2 << 12;<br>
+ ctx->r[2] = t1 & 0x3ffc0ff; t2 >>= 14; t2 |= t3 << 18;<br>
+ ctx->r[3] = t2 & 0x3f03fff; t3 >>= 8;<br>
+ ctx->r[4] = t3 & 0x00fffff;<br>
+#endif<br>
ctx->s[0] = le32_to_cpuvp(key + 16);<br>
ctx->s[1] = le32_to_cpuvp(key + 20);<br>
ctx->s[2] = le32_to_cpuvp(key + 24);<br>
@@ -267,6 +283,9 @@ static unsigned int poly1305_generic_blocks(struct poly1305_ctx *ctx, const u8 *<br>
u32 s1, s2, s3, s4;<br>
u32 h0, h1, h2, h3, h4;<br>
u64 d0, d1, d2, d3, d4;<br></span>
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNE<wbr>D_ACCESS<span class=""><br>
+ u32 t0, t1, t2, t3;<br>
+#endif<br>
<br>
r0 = ctx->r[0];<br>
r1 = ctx->r[1];<br>
@@ -287,11 +306,23 @@ static unsigned int poly1305_generic_blocks(struct poly1305_ctx *ctx, const u8 *<br>
<br>
while (likely(srclen >= POLY1305_BLOCK_SIZE)) {<br>
/* h += m[i] */<br></span>
+#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNE<wbr>D_ACCESS<span class="im HOEnZb"><br>
h0 += (le32_to_cpuvp(src + 0) >> 0) & 0x3ffffff;<br>
h1 += (le32_to_cpuvp(src + 3) >> 2) & 0x3ffffff;<br>
h2 += (le32_to_cpuvp(src + 6) >> 4) & 0x3ffffff;<br>
h3 += (le32_to_cpuvp(src + 9) >> 6) & 0x3ffffff;<br>
h4 += (le32_to_cpuvp(src + 12) >> 8) | hibit;<br>
+#else<br>
+ t0 = le32_to_cpuvp(src + 0);<br>
+ t1 = le32_to_cpuvp(src + 4);<br>
+ t2 = le32_to_cpuvp(src + 8);<br>
+ t3 = le32_to_cpuvp(src + 12);<br>
+ h0 += t0 & 0x3ffffff;<br>
+ h1 += sr((((u64)t1 << 32) | t0), 26) & 0x3ffffff;<br>
+ h2 += sr((((u64)t2 << 32) | t1), 20) & 0x3ffffff;<br>
+ h3 += sr((((u64)t3 << 32) | t2), 14) & 0x3ffffff;<br>
+ h4 += (t3 >> 8) | hibit;<br>
+#endif<br>
<br>
/* h *= r */<br>
d0 = mlt(h0, r0) + mlt(h1, s4) + mlt(h2, s3) + mlt(h3, s2) + mlt(h4, s1);<br>
--<br>
2.5.5<br>
<br>
<br></span><div class="HOEnZb"><div class="h5">
______________________________<wbr>_________________<br>
WireGuard mailing list<br>
<a href="mailto:WireGuard@lists.zx2c4.com" target="_blank">WireGuard@lists.zx2c4.com</a><br>
<a href="http://lists.zx2c4.com/mailman/listinfo/wireguard" rel="noreferrer" target="_blank">http://lists.zx2c4.com/mailman<wbr>/listinfo/wireguard</a><br>
</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature" data-smartmail="gmail_signature">Jason A. Donenfeld<br>Deep Space Explorer<br>fr: +33 6 51 90 82 66<br>us: +1 513 476 1200<br><a href="http://www.jasondonenfeld.com" target="_blank">www.jasondonenfeld.com</a><br><a href="http://www.zx2c4.com" target="_blank">www.zx2c4.com</a><br><a href="http://zx2c4.com/keys/AB9942E6D4A4CFC3412620A749FC7012A5DE03AE.asc" target="_blank">zx2c4.com/keys/AB9942E6D4A4CFC3412620A749FC7012A5DE03AE.asc</a></div>
</div>