<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
<title></title>
</head>
<body style="font-family:Arial;font-size:14px">
<p lang="en-US" style="margin-bottom: 0cm; line-height: 100%">Hi Jason,<br>
<br>
I searched a bit if I could find it in other places but I could not find it.<br>
<br>
> Have you tested this on both endians?<br>
<br>
No, my hardware only supports big endian.<br>
I am not experienced enough to run it in Qemu.</p>
<p>I see it is already applied. Great!<br>
<br>
Greats,<br>
<br>
René van Dorst.<br>
<br>
Quoting "Jason A. Donenfeld" <<a href="mailto:Jason@zx2c4.com">Jason@zx2c4.com</a>>:</p>
<blockquote style="border-left:2px solid blue;margin-left:2px;padding-left:12px;" type="cite">
<div dir="ltr">Hey René,
<div> </div>
<div>This is an excellent find. Thanks. Pretty significant speed improvements. I wonder where else this is happening too.</div>
<div> </div>
<div>Have you tested this on both endians?</div>
<div> </div>
<div>The main thing I'm wondering here is why exactly the compiler can't generate more efficient code itself. </div>
<div> </div>
<div>I'll review this and merge soon if it looks good.</div>
<div> </div>
<div>Regards,</div>
<div>Jason</div>
</div>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Sun, Sep 11, 2016 at 2:06 PM, René van Dorst <span dir="ltr"><<a href="mailto:opensource@vdorst.com" target="_blank">opensource@vdorst.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<p>Typo HAVE_EFFICIENT_UNALIGNED_ACCES<wbr>S --> CONFIG_HAVE_EFFICIENT_UNALIGNE<wbr>D_ACCESS.<br>
<br>
>From 13fae657624aac6b9c1f411aa6472a<wbr>91aae7fcc3 Mon Sep 17 00:00:00 2001<br>
<span>From: =?UTF-8?q?Ren=C3=A9=20van=20Do<wbr></span>rst?= <<a href="mailto:opensource@vdorst.com" target="_blank">opensource@vdorst.com</a>><br>
Date: Sat, 10 Sep 2016 10:58:58 +0200<br>
Subject: [PATCH] Add support for platforms which has no efficient unaligned<br>
memory access<br>
<br>
<span>Without it, it caused 55.2% slowdown in throughput at TP-Link WR1043ND, MIPS32r2@400Mhz.</span><br>
<br>
<span>Simply check for CONFIG_HAVE_EFFICIENT_UNALIGNE<wbr></span>D_ACCESS at compile time.<br>
<br>
Test on TP-Link WR1043ND, MIPS32r2@400Mhz.<br>
Setup: <a href="https://lists.zx2c4.com/pipermail/wireguard/2016-August/000331.html" rel="noreferrer" target="_blank">https://lists.zx2c4.com/piperm<wbr>ail/wireguard/2016-August/<wbr>000331.html</a><br>
<br></p>
<div>
<div class="h5">Benchmarks before:<br>
<br>
root@lede:~# iperf3 -c 10.0.0.1 -i 10<br>
[ ID] Interval Transfer Bandwidth Retr Cwnd<br>
[ 4] 0.00-10.13 sec 28.8 MBytes 23.8 Mbits/sec 0 202 KBytes<br>
- - - - - - - - - - - - - - - - - - - - - - - - -<br>
[ ID] Interval Transfer Bandwidth Retr<br>
[ 4] 0.00-10.13 sec 28.8 MBytes 23.8 Mbits/sec 0 sender<br>
[ 4] 0.00-10.13 sec 28.8 MBytes 23.8 Mbits/sec receiver<br>
<br>
root@lede:~# iperf3 -c 10.0.0.1 -i 10 -u -b 1G<br>
[ ID] Interval Transfer Bandwidth Total Datagrams<br>
[ 4] 0.00-10.00 sec 31.1 MBytes 26.1 Mbits/sec 3982<br>
- - - - - - - - - - - - - - - - - - - - - - - - -<br>
[ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams<br>
[ 4] 0.00-10.00 sec 31.1 MBytes 26.1 Mbits/sec 0.049 ms 0/3982 (0%)<br>
[ 4] Sent 3982 datagrams<br>
<br>
Benchmarks with aligned memory fetching:<br>
<br>
root@lede:~# iperf3 -c 10.0.0.1 -i 10<br>
[ ID] Interval Transfer Bandwidth Retr Cwnd<br>
[ 4] 0.00-10.22 sec 52.5 MBytes 43.1 Mbits/sec 0 145 KBytes<br>
- - - - - - - - - - - - - - - - - - - - - - - - -<br>
[ ID] Interval Transfer Bandwidth Retr<br>
[ 4] 0.00-10.22 sec 52.5 MBytes 43.1 Mbits/sec 0 sender<br>
[ 4] 0.00-10.22 sec 52.5 MBytes 43.1 Mbits/sec receiver<br>
<br>
iperf Done.<br>
root@lede:~# iperf3 -c 10.0.0.1 -i 10 -u -b 1G<br>
[ ID] Interval Transfer Bandwidth Total Datagrams<br>
[ 4] 0.00-10.00 sec 56.3 MBytes 47.2 Mbits/sec 7207<br>
- - - - - - - - - - - - - - - - - - - - - - - - -<br>
[ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams<br>
[ 4] 0.00-10.00 sec 56.3 MBytes 47.2 Mbits/sec 0.041 ms 0/7207 (0%)<br>
[ 4] Sent 7207 datagrams</div>
</div>
<span>---<br>
src/crypto/chacha20poly1305.c | 31 ++++++++++++++++++++++++++++++<wbr></span>+<br>
1 file changed, 31 insertions(+)<br>
<br>
diff --git a/src/crypto/chacha20poly1305.<wbr>c b/src/crypto/chacha20poly1305.<wbr>c<br>
index 5190894..294cbf6 100644<br>
<span>--- a/src/crypto/chacha20poly1305.<wbr></span>c<br>
+++ b/src/crypto/chacha20poly1305.<wbr>c<br>
@@ -248,13 +248,29 @@ struct poly1305_ctx {<br>
<br>
static void poly1305_init(struct poly1305_ctx *ctx, const u8 key[static POLY1305_KEY_SIZE])<br>
{<br>
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNE<wbr>D_ACCESS<br>
<span>+ u32 t0, t1, t2, t3;<br>
+#endif<br>
+<br>
memset(ctx, 0, sizeof(struct poly1305_ctx));<br>
/* r &= 0xffffffc0ffffffc0ffffffc0ffff<wbr></span>fff */<br>
+#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNE<wbr>D_ACCESS<br>
<span> ctx->r[0] = (le32_to_cpuvp(key + 0) >> 0) & 0x3ffffff;<br>
ctx->r[1] = (le32_to_cpuvp(key + 3) >> 2) & 0x3ffff03;<br>
ctx->r[2] = (le32_to_cpuvp(key + 6) >> 4) & 0x3ffc0ff;<br>
ctx->r[3] = (le32_to_cpuvp(key + 9) >> 6) & 0x3f03fff;<br>
ctx->r[4] = (le32_to_cpuvp(key + 12) >> 8) & 0x00fffff;<br>
+#else<br>
+ t0 = le32_to_cpuvp(key + 0);<br>
+ t1 = le32_to_cpuvp(key + 4);<br>
+ t2 = le32_to_cpuvp(key + 8);<br>
+ t3 = le32_to_cpuvp(key +12);<br>
+ ctx->r[0] = t0 & 0x3ffffff; t0 >>= 26; t0 |= t1 << 6;<br>
+ ctx->r[1] = t0 & 0x3ffff03; t1 >>= 20; t1 |= t2 << 12;<br>
+ ctx->r[2] = t1 & 0x3ffc0ff; t2 >>= 14; t2 |= t3 << 18;<br>
+ ctx->r[3] = t2 & 0x3f03fff; t3 >>= 8;<br>
+ ctx->r[4] = t3 & 0x00fffff;<br>
+#endif<br>
ctx->s[0] = le32_to_cpuvp(key + 16);<br>
ctx->s[1] = le32_to_cpuvp(key + 20);<br>
ctx->s[2] = le32_to_cpuvp(key + 24);<br>
@@ -267,6 +283,9 @@ static unsigned int poly1305_generic_blocks(struct poly1305_ctx *ctx, const u8 *<br>
u32 s1, s2, s3, s4;<br>
u32 h0, h1, h2, h3, h4;<br>
u64 d0, d1, d2, d3, d4;</span><br>
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNE<wbr>D_ACCESS<br>
<span>+ u32 t0, t1, t2, t3;<br>
+#endif<br>
<br>
r0 = ctx->r[0];<br>
r1 = ctx->r[1];<br>
@@ -287,11 +306,23 @@ static unsigned int poly1305_generic_blocks(struct poly1305_ctx *ctx, const u8 *<br>
<br>
while (likely(srclen >= POLY1305_BLOCK_SIZE)) {<br>
/* h += m[i] */</span><br>
+#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNE<wbr>D_ACCESS<br>
<span class="im HOEnZb"> h0 += (le32_to_cpuvp(src + 0) >> 0) & 0x3ffffff;<br>
h1 += (le32_to_cpuvp(src + 3) >> 2) & 0x3ffffff;<br>
h2 += (le32_to_cpuvp(src + 6) >> 4) & 0x3ffffff;<br>
h3 += (le32_to_cpuvp(src + 9) >> 6) & 0x3ffffff;<br>
h4 += (le32_to_cpuvp(src + 12) >> 8) | hibit;<br>
+#else<br>
+ t0 = le32_to_cpuvp(src + 0);<br>
+ t1 = le32_to_cpuvp(src + 4);<br>
+ t2 = le32_to_cpuvp(src + 8);<br>
+ t3 = le32_to_cpuvp(src + 12);<br>
+ h0 += t0 & 0x3ffffff;<br>
+ h1 += sr((((u64)t1 << 32) | t0), 26) & 0x3ffffff;<br>
+ h2 += sr((((u64)t2 << 32) | t1), 20) & 0x3ffffff;<br>
+ h3 += sr((((u64)t3 << 32) | t2), 14) & 0x3ffffff;<br>
+ h4 += (t3 >> 8) | hibit;<br>
+#endif<br>
<br>
/* h *= r */<br>
d0 = mlt(h0, r0) + mlt(h1, s4) + mlt(h2, s3) + mlt(h3, s2) + mlt(h4, s1);<br>
--<br>
2.5.5</span><br>
<br>
<br>
<div class="HOEnZb">
<div class="h5">______________________________<wbr>_________________<br>
WireGuard mailing list<br>
<a href="mailto:WireGuard@lists.zx2c4.com" target="_blank">WireGuard@lists.zx2c4.com</a><br>
<a href="http://lists.zx2c4.com/mailman/listinfo/wireguard" rel="noreferrer" target="_blank">http://lists.zx2c4.com/mailman<wbr>/listinfo/wireguard</a></div>
</div>
</blockquote>
</div>
<br>
<br clear="all">
<div> </div>
--<br>
<div class="gmail_signature" data-smartmail="gmail_signature">Jason A. Donenfeld<br>
Deep Space Explorer<br>
fr: +33 6 51 90 82 66<br>
us: +1 513 476 1200<br>
<a href="http://www.jasondonenfeld.com" target="_blank">www.jasondonenfeld.com</a><br>
<a href="http://www.zx2c4.com" target="_blank">www.zx2c4.com</a><br>
<a href="http://zx2c4.com/keys/AB9942E6D4A4CFC3412620A749FC7012A5DE03AE.asc" target="_blank">zx2c4.com/keys/AB9942E6D4A4CFC3412620A749FC7012A5DE03AE.asc</a></div>
</div>
</blockquote>
<p><br>
<br></p>
</body>
</html>