commit 64bb3cd6b3a3724dbca4352a0cb17e8cb694a0f2 Author: Ludvig Strigeus Date: Wed Aug 8 13:12:38 2018 +0200 TunSafe open source (Same as 1.3-rc3 version) diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..fc80387 --- /dev/null +++ b/.gitignore @@ -0,0 +1,18 @@ +/Debug/ +/Release/ +/ipzip2/Debug/ +/Build +/Win32/ +/TunSafe.aps +/ipch +/*.sdf +/*vcxproj.user +/*.opensdf +/*.suo +/.vs/ +/x64/ +/Azire.conf +/*.psess +/*.vspx +/installer/*.zip +/config/ \ No newline at end of file diff --git a/LICENSE.AGPL.TXT b/LICENSE.AGPL.TXT new file mode 100644 index 0000000..a38b98c --- /dev/null +++ b/LICENSE.AGPL.TXT @@ -0,0 +1,76 @@ +AFFERO GENERAL PUBLIC LICENSE +Version 1, March 2002 + +Copyright © 2002 Affero Inc. +510 Third Street - Suite 225, San Francisco, CA 94107, USA + +This license is a modified version of the GNU General Public License copyright (C) 1989, 1991 Free Software Foundation, Inc. made with their permission. Section 2(d) has been added to cover use of software over a computer network. + +Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. + +Preamble + +The licenses for most software are designed to take away your freedom to share and change it. By contrast, the Affero General Public License is intended to guarantee your freedom to share and change free software--to make sure the software is free for all its users. This Public License applies to most of Affero's software and to any other program whose authors commit to using it. (Some other Affero software is covered by the GNU Library General Public License instead.) You can apply it to your programs, too. + +When we speak of free software, we are referring to freedom, not price. This General Public License is designed to make sure that you have the freedom to distribute copies of free software (and charge for this service if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs; and that you know you can do these things. + +To protect your rights, we need to make restrictions that forbid anyone to deny you these rights or to ask you to surrender the rights. These restrictions translate to certain responsibilities for you if you distribute copies of the software, or if you modify it. + +For example, if you distribute copies of such a program, whether gratis or for a fee, you must give the recipients all the rights that you have. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights. + +We protect your rights with two steps: (1) copyright the software, and (2) offer you this license which gives you legal permission to copy, distribute and/or modify the software. + +Also, for each author's protection and ours, we want to make certain that everyone understands that there is no warranty for this free software. If the software is modified by someone else and passed on, we want its recipients to know that what they have is not the original, so that any problems introduced by others will not reflect on the original authors' reputations. + +Finally, any free program is threatened constantly by software patents. We wish to avoid the danger that redistributors of a free program will individually obtain patent licenses, in effect making the program proprietary. To prevent this, we have made it clear that any patent must be licensed for everyone's free use or not licensed at all. + +The precise terms and conditions for copying, distribution and modification follow. + +TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION + +0. This License applies to any program or other work which contains a notice placed by the copyright holder saying it may be distributed under the terms of this Affero General Public License. The "Program", below, refers to any such program or work, and a "work based on the Program" means either the Program or any derivative work under copyright law: that is to say, a work containing the Program or a portion of it, either verbatim or with modifications and/or translated into another language. (Hereinafter, translation is included without limitation in the term "modification".) Each licensee is addressed as "you". +Activities other than copying, distribution and modification are not covered by this License; they are outside its scope. The act of running the Program is not restricted, and the output from the Program is covered only if its contents constitute a work based on the Program (independent of having been made by running the Program). Whether that is true depends on what the Program does. + +1. You may copy and distribute verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice and disclaimer of warranty; keep intact all the notices that refer to this License and to the absence of any warranty; and give any other recipients of the Program a copy of this License along with the Program. +You may charge a fee for the physical act of transferring a copy, and you may at your option offer warranty protection in exchange for a fee. + +2. You may modify your copy or copies of the Program or any portion of it, thus forming a work based on the Program, and copy and distribute such modifications or work under the terms of Section 1 above, provided that you also meet all of these conditions: +a) You must cause the modified files to carry prominent notices stating that you changed the files and the date of any change. +b) You must cause any work that you distribute or publish, that in whole or in part contains or is derived from the Program or any part thereof, to be licensed as a whole at no charge to all third parties under the terms of this License. +c) If the modified program normally reads commands interactively when run, you must cause it, when started running for such interactive use in the most ordinary way, to print or display an announcement including an appropriate copyright notice and a notice that there is no warranty (or else, saying that you provide a warranty) and that users may redistribute the program under these conditions, and telling the user how to view a copy of this License. (Exception: if the Program itself is interactive but does not normally print such an announcement, your work based on the Program is not required to print an announcement.) +d) If the Program as you received it is intended to interact with users through a computer network and if, in the version you received, any user interacting with the Program was given the opportunity to request transmission to that user of the Program's complete source code, you must not remove that facility from your modified version of the Program or work based on the Program, and must offer an equivalent opportunity for all users interacting with your Program through a computer network to request immediate transmission by HTTP of the complete source code of your modified version or other derivative work. +These requirements apply to the modified work as a whole. If identifiable sections of that work are not derived from the Program, and can be reasonably considered independent and separate works in themselves, then this License, and its terms, do not apply to those sections when you distribute them as separate works. But when you distribute the same sections as part of a whole which is a work based on the Program, the distribution of the whole must be on the terms of this License, whose permissions for other licensees extend to the entire whole, and thus to each and every part regardless of who wrote it. + +Thus, it is not the intent of this section to claim rights or contest your rights to work written entirely by you; rather, the intent is to exercise the right to control the distribution of derivative or collective works based on the Program. + +In addition, mere aggregation of another work not based on the Program with the Program (or with a work based on the Program) on a volume of a storage or distribution medium does not bring the other work under the scope of this License. + +3. You may copy and distribute the Program (or a work based on it, under Section 2) in object code or executable form under the terms of Sections 1 and 2 above provided that you also do one of the following: +a) Accompany it with the complete corresponding machine-readable source code, which must be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or, +b) Accompany it with a written offer, valid for at least three years, to give any third party, for a charge no more than your cost of physically performing source distribution, a complete machine-readable copy of the corresponding source code, to be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or, +c) Accompany it with the information you received as to the offer to distribute corresponding source code. (This alternative is allowed only for noncommercial distribution and only if you received the program in object code or executable form with such an offer, in accord with Subsection b above.) +The source code for a work means the preferred form of the work for making modifications to it. For an executable work, complete source code means all the source code for all modules it contains, plus any associated interface definition files, plus the scripts used to control compilation and installation of the executable. However, as a special exception, the source code distributed need not include anything that is normally distributed (in either source or binary form) with the major components (compiler, kernel, and so on) of the operating system on which the executable runs, unless that component itself accompanies the executable. + +If distribution of executable or object code is made by offering access to copy from a designated place, then offering equivalent access to copy the source code from the same place counts as distribution of the source code, even though third parties are not compelled to copy the source along with the object code. + +4. You may not copy, modify, sublicense, or distribute the Program except as expressly provided under this License. Any attempt otherwise to copy, modify, sublicense or distribute the Program is void, and will automatically terminate your rights under this License. However, parties who have received copies, or rights, from you under this License will not have their licenses terminated so long as such parties remain in full compliance. +5. You are not required to accept this License, since you have not signed it. However, nothing else grants you permission to modify or distribute the Program or its derivative works. These actions are prohibited by law if you do not accept this License. Therefore, by modifying or distributing the Program (or any work based on the Program), you indicate your acceptance of this License to do so, and all its terms and conditions for copying, distributing or modifying the Program or works based on it. +6. Each time you redistribute the Program (or any work based on the Program), the recipient automatically receives a license from the original licensor to copy, distribute or modify the Program subject to these terms and conditions. You may not impose any further restrictions on the recipients' exercise of the rights granted herein. You are not responsible for enforcing compliance by third parties to this License. +7. If, as a consequence of a court judgment or allegation of patent infringement or for any other reason (not limited to patent issues), conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot distribute so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not distribute the Program at all. For example, if a patent license would not permit royalty-free redistribution of the Program by all those who receive copies directly or indirectly through you, then the only way you could satisfy both it and this License would be to refrain entirely from distribution of the Program. +If any portion of this section is held invalid or unenforceable under any particular circumstance, the balance of the section is intended to apply and the section as a whole is intended to apply in other circumstances. + +It is not the purpose of this section to induce you to infringe any patents or other property right claims or to contest validity of any such claims; this section has the sole purpose of protecting the integrity of the free software distribution system, which is implemented by public license practices. Many people have made generous contributions to the wide range of software distributed through that system in reliance on consistent application of that system; it is up to the author/donor to decide if he or she is willing to distribute software through any other system and a licensee cannot impose that choice. + +This section is intended to make thoroughly clear what is believed to be a consequence of the rest of this License. + +8. If the distribution and/or use of the Program is restricted in certain countries either by patents or by copyrighted interfaces, the original copyright holder who places the Program under this License may add an explicit geographical distribution limitation excluding those countries, so that distribution is permitted only in or among countries not thus excluded. In such case, this License incorporates the limitation as if written in the body of this License. +9. Affero Inc. may publish revised and/or new versions of the Affero General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. +Each version is given a distinguishing version number. If the Program specifies a version number of this License which applies to it and "any later version", you have the option of following the terms and conditions either of that version or of any later version published by Affero, Inc. If the Program does not specify a version number of this License, you may choose any version ever published by Affero, Inc. + +You may also choose to redistribute modified versions of this program under any version of the Free Software Foundation's GNU General Public License version 3 or higher, so long as that version of the GNU GPL includes terms and conditions substantially equivalent to those of this license. + +10. If you wish to incorporate parts of the Program into other free programs whose distribution conditions are different, write to the author to ask for permission. For software which is copyrighted by Affero, Inc., write to us; we sometimes make exceptions for this. Our decision will be guided by the two goals of preserving the free status of all derivatives of our free software and of promoting the sharing and reuse of software generally. +NO WARRANTY + +11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. +12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. diff --git a/README.md b/README.md new file mode 100644 index 0000000..c679487 --- /dev/null +++ b/README.md @@ -0,0 +1,11 @@ +# TunSafe +Source code of the TunSafe client. + +This open sourced TunSafe code is AGPL-1.0 licensed. Do note that the repository contains BSD and OpenSSL licensed files, so if you want to release a version based off of this repository you need to take that into account. + +To build on Windows, open TunSafe.sln and build, or run build.py. + +To build on Linux, run build_linux.sh + +To build on FreeBSD, run build_freebsd.sh + diff --git a/TunSafe.conf b/TunSafe.conf new file mode 100644 index 0000000..073c9e5 --- /dev/null +++ b/TunSafe.conf @@ -0,0 +1,16 @@ +[Interface] +PrivateKey = KMakx+0sYjWKnkY2pO8+CFZ0Sp+Gzzp/GfxwlR+WgXQ= +ListenPort = 51820 +Address = 192.168.2.2/24 +MTU = 1420 + + +[Peer] +PublicKey = 2m1BdGW9AwwF5dqaGm0NgMggdDZDUPFAL4JxCySdgBw= +#AllowedIPs = 0.0.0.0/0, fc00::2/64 +AllowedIPs = 192.168.2.0/24 +Endpoint = 192.168.1.4:8040 +#Endpoint = [fe80::6825:68f4:7c6f:42d4]:8040 +PersistentKeepalive = 25 + + diff --git a/TunSafe.rc b/TunSafe.rc new file mode 100644 index 0000000..7139e02 Binary files /dev/null and b/TunSafe.rc differ diff --git a/TunSafe.sln b/TunSafe.sln new file mode 100644 index 0000000..cc929b1 --- /dev/null +++ b/TunSafe.sln @@ -0,0 +1,46 @@ + +Microsoft Visual Studio Solution File, Format Version 12.00 +# Visual Studio 15 +VisualStudioVersion = 15.0.26403.7 +MinimumVisualStudioVersion = 10.0.40219.1 +Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "TunSafe", "TunSafe.vcxproj", "{626FBC16-64C6-407D-BC2B-6C087794E0D0}" +EndProject +Global + GlobalSection(SolutionConfigurationPlatforms) = preSolution + Debug|Win32 = Debug|Win32 + Debug|x64 = Debug|x64 + Release|Win32 = Release|Win32 + Release|x64 = Release|x64 + EndGlobalSection + GlobalSection(ProjectConfigurationPlatforms) = postSolution + {626FBC16-64C6-407D-BC2B-6C087794E0D0}.Debug|Win32.ActiveCfg = Debug|Win32 + {626FBC16-64C6-407D-BC2B-6C087794E0D0}.Debug|Win32.Build.0 = Debug|Win32 + {626FBC16-64C6-407D-BC2B-6C087794E0D0}.Debug|x64.ActiveCfg = Debug|x64 + {626FBC16-64C6-407D-BC2B-6C087794E0D0}.Debug|x64.Build.0 = Debug|x64 + {626FBC16-64C6-407D-BC2B-6C087794E0D0}.Release|Win32.ActiveCfg = Release|Win32 + {626FBC16-64C6-407D-BC2B-6C087794E0D0}.Release|Win32.Build.0 = Release|Win32 + {626FBC16-64C6-407D-BC2B-6C087794E0D0}.Release|x64.ActiveCfg = Release|x64 + {626FBC16-64C6-407D-BC2B-6C087794E0D0}.Release|x64.Build.0 = Release|x64 + EndGlobalSection + GlobalSection(SolutionProperties) = preSolution + HideSolutionNode = FALSE + EndGlobalSection + GlobalSection(Performance) = preSolution + HasPerformanceSessions = true + EndGlobalSection + GlobalSection(Performance) = preSolution + HasPerformanceSessions = true + EndGlobalSection + GlobalSection(Performance) = preSolution + HasPerformanceSessions = true + EndGlobalSection + GlobalSection(Performance) = preSolution + HasPerformanceSessions = true + EndGlobalSection + GlobalSection(Performance) = preSolution + HasPerformanceSessions = true + EndGlobalSection + GlobalSection(Performance) = preSolution + HasPerformanceSessions = true + EndGlobalSection +EndGlobal diff --git a/TunSafe.vcxproj b/TunSafe.vcxproj new file mode 100644 index 0000000..f9118c6 --- /dev/null +++ b/TunSafe.vcxproj @@ -0,0 +1,268 @@ + + + + + Debug + Win32 + + + Debug + x64 + + + Release + Win32 + + + Release + x64 + + + + {626FBC16-64C6-407D-BC2B-6C087794E0D0} + Win32Proj + TunSafe + 10.0.15063.0 + TunSafe + + + + Application + true + v141 + MultiByte + + + Application + true + v141 + MultiByte + + + Application + false + v141 + true + MultiByte + + + Application + false + v141 + true + MultiByte + + + + + + + + + + + + + + + + + + + + true + TunSafe + $(SolutionDir)$(Platform)\$(Configuration)\ + $(Platform)\$(Configuration)\ + + + true + $(VC_ExecutablePath_x64);$(WindowsSDK_ExecutablePath);$(VS_ExecutablePath);$(MSBuild_ExecutablePath);$(FxCopDir);$(PATH);C:\Bin\Dev\nasm + TunSafe + + + false + TunSafe + $(SolutionDir)$(Platform)\$(Configuration)\ + $(Platform)\$(Configuration)\ + + + false + $(VC_ExecutablePath_x64);$(WindowsSDK_ExecutablePath);$(VS_ExecutablePath);$(MSBuild_ExecutablePath);$(FxCopDir);$(PATH);C:\Bin\Dev\nasm + TunSafe + + + + Use + Level3 + Disabled + WIN32;_DEBUG;_CONSOLE;%(PreprocessorDefinitions);_CRT_SECURE_NO_WARNINGS;_CRT_SECURE_NO_WARNINGS + . + + + Windows + true + kernel32.lib;user32.lib;gdi32.lib;winspool.lib;comdlg32.lib;advapi32.lib;shell32.lib;ole32.lib;oleaut32.lib;uuid.lib;odbc32.lib;odbccp32.lib;%(AdditionalDependencies);ws2_32.lib;Iphlpapi.lib + RequireAdministrator + + + + + Use + Level3 + Disabled + WIN32;_DEBUG;_CONSOLE;%(PreprocessorDefinitions);_CRT_SECURE_NO_WARNINGS;_CRT_SECURE_NO_WARNINGS=1 + + + . + + + Windows + true + kernel32.lib;user32.lib;gdi32.lib;winspool.lib;comdlg32.lib;advapi32.lib;shell32.lib;ole32.lib;oleaut32.lib;uuid.lib;odbc32.lib;odbccp32.lib;%(AdditionalDependencies);ws2_32.lib;Iphlpapi.lib;Comctl32.lib + + + RequireAdministrator + + + + + Level3 + Use + MaxSpeed + true + true + WIN32;NDEBUG;_CONSOLE;%(PreprocessorDefinitions);_CRT_SECURE_NO_WARNINGS + MultiThreaded + . + + + Windows + true + true + true + kernel32.lib;user32.lib;gdi32.lib;winspool.lib;comdlg32.lib;advapi32.lib;shell32.lib;ole32.lib;oleaut32.lib;uuid.lib;odbc32.lib;odbccp32.lib;%(AdditionalDependencies);ws2_32.lib;Iphlpapi.lib + RequireAdministrator + + + + + Level3 + Use + MinSpace + true + true + WIN32;NDEBUG;_CONSOLE;%(PreprocessorDefinitions);_CRT_SECURE_NO_WARNINGS=1 + MultiThreaded + Size + + + AnySuitable + true + . + + + Windows + true + true + true + kernel32.lib;user32.lib;gdi32.lib;winspool.lib;comdlg32.lib;advapi32.lib;shell32.lib;ole32.lib;oleaut32.lib;uuid.lib;odbc32.lib;odbccp32.lib;%(AdditionalDependencies);ws2_32.lib;Iphlpapi.lib + RequireAdministrator + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + NotUsing + NotUsing + + + NotUsing + NotUsing + + + NotUsing + NotUsing + NotUsing + NotUsing + + + Create + Create + Create + Create + + + + + + + + + + + + + + + + + true + true + + + true + true + + + true + true + + + Document + true + true + + + true + true + + + true + true + + + + + + + \ No newline at end of file diff --git a/TunSafe.vcxproj.filters b/TunSafe.vcxproj.filters new file mode 100644 index 0000000..220b7f6 --- /dev/null +++ b/TunSafe.vcxproj.filters @@ -0,0 +1,154 @@ + + + + + {4FC737F1-C7A5-4376-A066-2A32D752A2FF} + cpp;c;cc;cxx;def;odl;idl;hpj;bat;asm;asmx + + + {cfa17b4c-1bee-434e-81b4-ba780c3f7e2d} + + + {49ba9478-f871-449f-a410-b401e993893f} + + + {d31b1b9f-4a2e-42d4-a26c-7c3daa4ccbe3} + + + + + Source Files + + + Source Files + + + + Source Files + + + Source Files + + + Source Files + + + Source Files\Win32 + + + Source Files\Win32 + + + Source Files\Win32 + + + crypto + + + crypto + + + Source Files + + + Source Files + + + crypto + + + Source Files + + + crypto\aesgcm + + + Source Files + + + Source Files + + + Source Files + + + + + Source Files + + + Source Files + + + Source Files + + + Source Files + + + Source Files\Win32 + + + Source Files\Win32 + + + Source Files\Win32 + + + crypto + + + crypto + + + crypto + + + crypto + + + Source Files + + + crypto + + + crypto\aesgcm + + + Source Files + + + Source Files + + + + + + + + + + + + + + crypto + + + crypto + + + crypto + + + crypto\aesgcm + + + crypto\aesgcm + + + crypto\aesgcm + + + \ No newline at end of file diff --git a/benchmark.cpp b/benchmark.cpp new file mode 100644 index 0000000..4d30d80 --- /dev/null +++ b/benchmark.cpp @@ -0,0 +1,94 @@ +// SPDX-License-Identifier: AGPL-1.0-only +// Copyright (C) 2018 Ludvig Strigeus . All Rights Reserved. +#include "stdafx.h" +#include "tunsafe_types.h" +#include "crypto/chacha20poly1305.h" +#include "crypto/aesgcm/aes.h" +#include "tunsafe_cpu.h" + +#include +#include + +#if defined(OS_FREEBSD) || defined(OS_LINUX) +#include +#include +typedef uint64 LARGE_INTEGER; +void QueryPerformanceCounter(LARGE_INTEGER *x) { + struct timespec ts; + if (clock_gettime(CLOCK_MONOTONIC, &ts) != 0) { + fprintf(stderr, "clock_gettime failed\n"); + exit(1); + } + *x = (uint64)ts.tv_sec * 1000000000 + ts.tv_nsec; +} + +void QueryPerformanceFrequency(LARGE_INTEGER *x) { + *x = 1000000000; +} +#elif defined(OS_MACOSX) +#include +#include +typedef uint64 LARGE_INTEGER; + +void QueryPerformanceCounter(LARGE_INTEGER *x) { + *x = mach_absolute_time(); +} + +void QueryPerformanceFrequency(LARGE_INTEGER *x) { + mach_timebase_info_data_t timebase = { 0, 0 }; + if (mach_timebase_info(&timebase) != 0) + abort(); + printf("numer/denom: %d %d\n", timebase.numer, timebase.denom); + *x = timebase.denom * 1000000000; +} + +#endif + +int gcm_self_test(); + + + +void *fake_glb; +void Benchmark() { + int64 a, b, f, t1 = 0, t2 = 0; + +#if WITH_AESGCM + gcm_self_test(); +#endif // WITH_AESGCM + + PrintCpuFeatures(); + + QueryPerformanceFrequency((LARGE_INTEGER*)&f); + + uint8 dst[1500 + 16]; + uint8 key[32] = {0, 1, 2, 3, 4, 5, 6}; + uint8 mac[16]; + + fake_glb = dst; + + auto RunOneBenchmark = [&](const char *name, const std::function &ff) { + uint64 bytes = 0; + QueryPerformanceCounter((LARGE_INTEGER*)&b); + size_t i; + for (i = 0; bytes < 1000000000; i++) + bytes += ff(i); + QueryPerformanceCounter((LARGE_INTEGER*)&a); + RINFO("%s: %f MB/s", name, (double)bytes * 0.000001 / (a - b) * f); + }; + + memset(dst, 0, 1500); + RunOneBenchmark("chacha20-encrypt", [&](size_t i) -> uint64 { chacha20poly1305_encrypt(dst, dst, 1460, NULL, 0, i, key); return 1460; }); + RunOneBenchmark("chacha20-decrypt", [&](size_t i) -> uint64 { chacha20poly1305_decrypt_get_mac(dst, dst, 1460, NULL, 0, i, key, mac); return 1460; }); + + RunOneBenchmark("poly1305-only", [&](size_t i) -> uint64 { poly1305_get_mac(dst, 1460, NULL, 0, i, key, mac); return 1460; }); + +#if WITH_AESGCM + if (X86_PCAP_AES) { + AesGcm128StaticContext sctx; + CRYPTO_gcm128_init(&sctx, key, 128); + + RunOneBenchmark("aes128-gcm-encrypt", [&](size_t i) -> uint64 { aesgcm_encrypt(dst, dst, 1460, NULL, 0, i, &sctx); return 1460; }); + RunOneBenchmark("aes128-gcm-decrypt", [&](size_t i) -> uint64 { aesgcm_decrypt_get_mac(dst, dst, 1460, NULL, 0, i, &sctx, mac); return 1460; }); + } +#endif // WITH_AESGCM +} diff --git a/bit_ops.h b/bit_ops.h new file mode 100644 index 0000000..1e22032 --- /dev/null +++ b/bit_ops.h @@ -0,0 +1,49 @@ +// SPDX-License-Identifier: AGPL-1.0-only +// Copyright (C) 2018 Ludvig Strigeus . All Rights Reserved. +#pragma once + +#include "tunsafe_types.h" +#include "tunsafe_endian.h" + +#if !defined(ARCH_CPU_X86_64) && defined(COMPILER_MSVC) +static inline int _BitScanReverse64(unsigned long *index, uint64 x) { + if (_BitScanReverse(index, x >> 32)) { + (*index) += 32; + return true; + } + return _BitScanReverse(index, (uint32)x); +} +#endif + +#if !defined(COMPILER_MSVC) +static inline int _BitScanReverse64(unsigned long *index, uint64 x) { + *index = 63 - __builtin_clzll(x); + return (x != 0); +} + +static inline int _BitScanReverse(unsigned long *index, uint32 x) { + *index = 31 - __builtin_clz(x); + return (x != 0); +} + +#endif + +static inline int FindHighestSetBit32(uint32 x) { + unsigned long index; + return _BitScanReverse(&index, x) ? (int)(index + 1) : 0; +} + +static inline int FindLastSetBit32(uint32 x) { + unsigned long index; + _BitScanReverse(&index, x); + return index; +} + +static inline int FindHighestSetBit64(uint64 x) { + unsigned long index; + return _BitScanReverse64(&index, x) ? (int)(index + 1) : 0; +} + +static inline int FindHighestSetBit128(uint64 hi, uint64 lo) { + return hi ? 64 + FindHighestSetBit64(hi) : FindHighestSetBit64(lo); +} diff --git a/build.py b/build.py new file mode 100644 index 0000000..934e577 --- /dev/null +++ b/build.py @@ -0,0 +1,95 @@ +# SPDX-License-Identifier: AGPL-1.0-only +# Copyright (C) 2018 Ludvig Strigeus . All Rights Reserved. +import os +import shutil +import win32crypt +import base64 +import sys +import zipfile +import re + +MSBUILD_PATH = r"C:\Dev\VS2017\MSBuild\15.0\Bin\MSBuild.exe" +NSIS_PATH = r'C:\Dev\NSIS\makeNSIS.EXE' + +SIGNTOOL_PATH = r'c:\Program Files (x86)\Windows Kits\10\bin\10.0.15063.0\x86\signtool.exe' +SIGNTOOL_KEY_PATH = '' # put key here +SIGNTOOL_PASS = '' # put key pass here + +def RmTree(path): + try: + print ('Deleting %s' % path) + shutil.rmtree(path) + except FileNotFoundError: + pass + +def Run(s): + print ('Running %s' % s) + x = os.system(s) + if x: + raise Exception('Command failed (%d) : %s' % (x, s)) + +def CopyFile(src, dst): + shutil.copyfile(src, dst) + +def SignExe(src): + print ('Signing %s' % src) + cmd = r'""c:\Program Files (x86)\Windows Kits\10\bin\10.0.15063.0\x86\signtool.exe" sign /f "%s" /p %s /t http://timestamp.verisign.com/scripts/timstamp.dll "%s"' % (SIGNTOOL_KEY_PATH, SIGNTOOL_PASS, src) + #cmd = r'""c:\Program Files (x86)\Windows Kits\10\bin\10.0.15063.0\x86\signtool.exe" sign %s ' % (SIGNTOOL_KEY_PATH, ) + x = os.system(cmd) + if x: + raise Exception('Signing failed (%d) : %s' % (x, cmd)) + +def GetVersion(): + for line in open(BASE + '/tunsafe_config.h', 'r'): + m = re.match('^#define TUNSAFE_VERSION_STRING "TunSafe (.*)"$', line) + if m: + return m.group(1) + raise Exception('Version not found') + +# + +#os.system(r'""') + +command = sys.argv[1] + +BASE = r'D:\Code\TunSafe' + + +if command == 'build_tap': + Run(r'%s /V4 installer\tap\tap-windows6.nsi' % NSIS_PATH) + SignExe(r'installer\tap\TunSafe-TAP-9.21.2.exe') + sys.exit(0) + +if 1: + RmTree(BASE + r'\Win32\Release') + RmTree(BASE + r'\x64\Release') + Run('%s TunSafe.sln /t:Clean;Rebuild /p:Configuration=Release /p:Platform=x64' % MSBUILD_PATH) + Run('%s TunSafe.sln /t:Clean;Rebuild /p:Configuration=Release /p:Platform=Win32' % MSBUILD_PATH) + +if 1: + CopyFile(BASE + r'\Win32\Release\TunSafe.exe', + BASE + r'\installer\x86\TunSafe.exe') + + SignExe(BASE + r'\installer\x86\TunSafe.exe') + CopyFile(BASE + r'\x64\Release\TunSafe.exe', + BASE + r'\installer\x64\TunSafe.exe') + SignExe(BASE + r'\installer\x64\TunSafe.exe') + +VERSION = GetVersion() + +Run(r'%s /V4 -DPRODUCT_VERSION=%s installer\tunsafe.nsi ' % (NSIS_PATH, VERSION)) +SignExe(BASE + r'\installer\TunSafe-%s.exe' % VERSION) + +zipf = zipfile.ZipFile(BASE + '\installer\TunSafe-%s-x86.zip' % VERSION, 'w', zipfile.ZIP_DEFLATED) +zipf.write(BASE + r'\installer\x86\TunSafe.exe', 'TunSafe.exe') +zipf.write(BASE + r'\installer\License.txt', 'License.txt') +zipf.write(BASE + r'\installer\ChangeLog.txt', 'ChangeLog.txt') +zipf.write(BASE + r'\installer\TunSafe.conf', 'Config\\TunSafe.conf') +zipf.close() + +zipf = zipfile.ZipFile(BASE + '\installer\TunSafe-%s-x64.zip' % VERSION, 'w', zipfile.ZIP_DEFLATED) +zipf.write(BASE + r'\installer\x64\TunSafe.exe', 'TunSafe.exe') +zipf.write(BASE + r'\installer\License.txt', 'License.txt') +zipf.write(BASE + r'\installer\ChangeLog.txt', 'ChangeLog.txt') +zipf.write(BASE + r'\installer\TunSafe.conf', 'Config\\TunSafe.conf') +zipf.close() diff --git a/build_config.h b/build_config.h new file mode 100644 index 0000000..953087c --- /dev/null +++ b/build_config.h @@ -0,0 +1,116 @@ +// File is taken from Chromium +#ifndef BUILD_BUILD_CONFIG_H_ +#define BUILD_BUILD_CONFIG_H_ + +#if defined(__APPLE__) +#include +#endif + +// A set of macros to use for platform detection. +#if defined(__APPLE__) +#define OS_MACOSX 1 +#if defined(TARGET_OS_IPHONE) && TARGET_OS_IPHONE +#define OS_IOS 1 +#endif // defined(TARGET_OS_IPHONE) && TARGET_OS_IPHONE +#elif defined(ANDROID) +#define OS_ANDROID 1 +#elif defined(__native_client__) +#define OS_NACL 1 +#elif defined(__FLASHPLAYER) +#define OS_FLASHPLAYER 1 +#elif defined(__linux__) +#define OS_LINUX 1 +#elif defined(_WIN32) +#define OS_WIN 1 +#elif defined(__FreeBSD__) +#define OS_FREEBSD 1 +#elif defined(__OpenBSD__) +#define OS_OPENBSD 1 +#elif defined(__sun) +#define OS_SOLARIS 1 +#elif defined(EMSCRIPTEN) +#define OS_EMSCRIPTEN 1 +#else +#error Please add support for your platform in build_config.h +#endif + +// For access to standard BSD features, use OS_BSD instead of a +// more specific macro. +#if defined(OS_FREEBSD) || defined(OS_OPENBSD) +#define OS_BSD 1 +#endif + +// For access to standard POSIXish features, use OS_POSIX instead of a +// more specific macro. +#if defined(OS_MACOSX) || defined(OS_LINUX) || defined(OS_FREEBSD) || \ + defined(OS_OPENBSD) || defined(OS_SOLARIS) || defined(OS_ANDROID) || \ + defined(OS_NACL) +#define OS_POSIX 1 +#endif + +#if defined(OS_POSIX) && !defined(OS_MACOSX) && !defined(OS_ANDROID) && \ + !defined(OS_NACL) +#define USE_X11 1 // Use X for graphics. +#endif + +// Compiler detection. +#if defined(__GNUC__) +#define COMPILER_GCC 1 + +#if defined(__clang__) +#define COMPILER_CLANG 1 +#endif +#elif defined(_MSC_VER) +#define COMPILER_MSVC 1 +#elif defined(__TINYC__) +#define COMPILER_TCC 1 +#else +#error Please add support for your compiler in build/build_config.h +#endif + +// Processor architecture detection. For more info on what's defined, see: +// http://msdn.microsoft.com/en-us/library/b0084kay.aspx +// http://www.agner.org/optimize/calling_conventions.pdf +// or with gcc, run: "echo | gcc -E -dM -" +#if defined(_M_X64) || defined(__x86_64__) +#define ARCH_CPU_X86_FAMILY 1 +#define ARCH_CPU_X86_64 1 +#define ARCH_CPU_64_BITS 1 +#define ARCH_CPU_LITTLE_ENDIAN 1 +#define ARCH_CPU_ALLOW_UNALIGNED 1 +#elif defined(_M_IX86) || defined(__i386__) +#define ARCH_CPU_X86_FAMILY 1 +#define ARCH_CPU_X86 1 +#define ARCH_CPU_32_BITS 1 +#define ARCH_CPU_LITTLE_ENDIAN 1 +#define ARCH_CPU_ALLOW_UNALIGNED 1 +#define ARCH_CPU_NEED_64BIT_ALIGN 1 +#elif defined(__ARMEL__) || defined(__arm__) && defined(__ARMCC_VERSION) +#define ARCH_CPU_ARM_FAMILY 1 +#define ARCH_CPU_ARMEL 1 +#define ARCH_CPU_32_BITS 1 +#define ARCH_CPU_LITTLE_ENDIAN 1 +#elif defined(__pnacl__) +#define ARCH_CPU_32_BITS 1 +#elif defined(__MIPSEL__) +#define ARCH_CPU_MIPS_FAMILY 1 +#define ARCH_CPU_MIPSEL 1 +#define ARCH_CPU_32_BITS 1 +#define ARCH_CPU_LITTLE_ENDIAN 1 +#elif defined(EMSCRIPTEN) +#define ARCH_CPU_JS 1 +#define ARCH_CPU_32_BITS 1 +#define ARCH_CPU_LITTLE_ENDIAN 1 +#elif defined(__FLASHPLAYER) +#define ARCH_CPU_FLASHPLAYER 1 +#define ARCH_CPU_32_BITS 1 +#else +#error Please add support for your architecture in build_config.h +#endif + +#if defined(ARCH_CPU_LITTLE_ENDIAN) && defined(ARCH_CPU_BIG_ENDIAN) || !defined(ARCH_CPU_LITTLE_ENDIAN) && !defined(ARCH_CPU_BIG_ENDIAN) +#error Please add support for your endianness in build_config.h +#endif + + +#endif // BUILD_BUILD_CONFIG_H_ diff --git a/build_freebsd.sh b/build_freebsd.sh new file mode 100644 index 0000000..93a6236 --- /dev/null +++ b/build_freebsd.sh @@ -0,0 +1,2 @@ +g++7 -I . -O2 -static -mssse3 -o tunsafe benchmark.cpp tunsafe_cpu.cpp wireguard_config.cpp wireguard.cpp wireguard_proto.cpp util.cpp network_bsd.cpp crypto/blake2s.cpp crypto/blake2s_sse.cpp crypto/chacha20poly1305.cpp crypto/curve25519-donna.cpp crypto/siphash.cpp crypto/chacha20_x64_gas.s crypto/poly1305_x64_gas.s ipzip2/ipzip2.cpp -lrt + diff --git a/build_linux.sh b/build_linux.sh new file mode 100644 index 0000000..63a15bc --- /dev/null +++ b/build_linux.sh @@ -0,0 +1,9 @@ +#!/bin/sh +clang++-6.0 -c -march=skylake-avx512 crypto/poly1305_x64_gas.s crypto/chacha20_x64_gas.s +clang++-6.0 -I . -O3 -mssse3 -pthread -lrt -o tunsafe util.cpp wireguard_config.cpp wireguard.cpp \ +wireguard_proto.cpp network_bsd_mt.cpp tunsafe_cpu.cpp benchmark.cpp crypto/blake2s.cpp crypto/blake2s_sse.cpp crypto/chacha20poly1305.cpp \ +crypto/curve25519-donna.cpp crypto/siphash.cpp chacha20_x64_gas.o crypto/aesgcm/aesni_gcm_x64_gas.s \ +crypto/aesgcm/aesni_x64_gas.s crypto/aesgcm/aesgcm.cpp poly1305_x64_gas.o ipzip2/ipzip2.cpp \ +crypto/aesgcm/ghash_x64_gas.s + + diff --git a/build_osx.sh b/build_osx.sh new file mode 100644 index 0000000..b95681b --- /dev/null +++ b/build_osx.sh @@ -0,0 +1,17 @@ +set -e + + +clang++ -c -mavx512f -mavx512vl crypto/poly1305_x64_gas_macosx.s crypto/chacha20_x64_gas_macosx.s + +clang++ -g -O3 -I . -std=c++11 -DNDEBUG=1 -fno-exceptions -fno-rtti -ffunction-sections -o tunsafe \ +wireguard_config.cpp wireguard.cpp wireguard_proto.cpp util.cpp network_bsd_mt.cpp benchmark.cpp tunsafe_cpu.cpp \ +crypto/blake2s.cpp crypto/blake2s_sse.cpp crypto/chacha20poly1305.cpp crypto/curve25519-donna.cpp \ +crypto/siphash.cpp crypto/aesgcm/aesgcm.cpp ipzip2/ipzip2.cpp \ +crypto/aesgcm/aesni_gcm_x64_gas_macosx.s crypto/aesgcm/aesni_x64_gas_macosx.s crypto/aesgcm/ghash_x64_gas_macosx.s \ +chacha20_x64_gas_macosx.o poly1305_x64_gas_macosx.o + +cp tunsafe tunsafe.unstripped +strip tunsafe +rm -f tunsafe_osx.zip +zip tunsafe_osx.zip tunsafe readme_osx.txt + diff --git a/crypto/.gitignore b/crypto/.gitignore new file mode 100644 index 0000000..fb7243a --- /dev/null +++ b/crypto/.gitignore @@ -0,0 +1 @@ +/old/ \ No newline at end of file diff --git a/crypto/aesgcm/aes.h b/crypto/aesgcm/aes.h new file mode 100644 index 0000000..310b1eb --- /dev/null +++ b/crypto/aesgcm/aes.h @@ -0,0 +1,84 @@ +/** + * Downloaded from + * + * http://www.esat.kuleuven.ac.be/~rijmen/rijndael/rijndael-fst-3.0.zip + * + * rijndael-alg-fst.h + * + * @version 3.0 (December 2000) + * + * Optimised ANSI C code for the Rijndael cipher (now AES) + * + * @author Vincent Rijmen + * @author Antoon Bosselaers + * @author Paulo Barreto + * + * This code is hereby placed in the public domain. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHORS ''AS IS'' AND ANY EXPRESS + * OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, + * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE + * OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, + * EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ +#ifndef __RIJNDAEL_ALG_FST_H +#define __RIJNDAEL_ALG_FST_H + +#include "tunsafe_types.h" + +#define AESGCM_MAXNR 14 + +struct AesContext { + uint32 rk[(AESGCM_MAXNR + 1) * 4]; + int rounds; +}; + +typedef struct { uint64 hi, lo; } aesgcm_u128; + +struct AesGcm128StaticContext { + void(*gmult)(uint64 Xi[2], const aesgcm_u128 Htable[16]); + void(*ghash)(uint64 Xi[2], const aesgcm_u128 Htable[16], const uint8 *inp, size_t len); + bool use_aesni_gcm_crypt; + + // Don't move H and Htable cause the asm code depends on them + union { uint64 u[2]; uint32 d[4]; uint8 c[16]; size_t t[16 / sizeof(size_t)]; } H; + aesgcm_u128 Htable[16]; + AesContext aes; +}; + +struct AesGcm128TempContext { + AesGcm128StaticContext *sctx; + union { uint64 u[2]; uint32 d[4]; uint8 c[16]; size_t t[16/sizeof(size_t)]; } EKi,EK0,len, Yi, Xi; + unsigned int mres, ares; +}; + +void CRYPTO_gcm128_init(AesGcm128StaticContext *ctx, const uint8 *key, int key_size); + +void CRYPTO_gcm128_setiv(AesGcm128TempContext *ctx, AesGcm128StaticContext *sctx, const unsigned char *iv,size_t len); +void CRYPTO_gcm128_aad(AesGcm128TempContext *ctx,const uint8 *aad,size_t len); +void CRYPTO_gcm128_encrypt_ctr32(AesGcm128TempContext *ctx, const uint8 *in, uint8 *out, size_t len); +void CRYPTO_gcm128_decrypt_ctr32(AesGcm128TempContext *ctx, const uint8 *in, uint8 *out, size_t len); +void CRYPTO_gcm128_finish(AesGcm128TempContext *ctx, unsigned char *tag, size_t len); + +void aesgcm_encrypt(uint8 *dst, const uint8 *src, const size_t src_len, + const uint8 *ad, const size_t ad_len, + const uint64 nonce, AesGcm128StaticContext *sctx); + +void aesgcm_decrypt_get_mac(uint8 *dst, const uint8 *src, const size_t src_len, + const uint8 *ad, const size_t ad_len, + const uint64 nonce, AesGcm128StaticContext *sctx, + uint8 mac[16]); + +#if defined(ARCH_CPU_X86_64) +#define WITH_AESGCM 0 +#endif + + + +#endif /* __RIJNDAEL_ALG_FST_H */ diff --git a/crypto/aesgcm/aesgcm.cpp b/crypto/aesgcm/aesgcm.cpp new file mode 100644 index 0000000..12ac2cd --- /dev/null +++ b/crypto/aesgcm/aesgcm.cpp @@ -0,0 +1,882 @@ +#include "stdafx.h" +#include "tunsafe_types.h" +#include "tunsafe_endian.h" +#include "tunsafe_cpu.h" +#include "crypto/aesgcm/aes.h" +#include +#include +#include +//#include +#include "crypto/chacha20poly1305.h" +#define AESNIGCM_ASM 1 +#define AESGCM_ASM 1 +#define AESNI_GCM 1 + +// We only implement AES stuff on X86-64 +#if WITH_AESGCM + +extern "C" { +void gcm_init_clmul(aesgcm_u128 Htable[16],const uint64 Xi[2]); +void gcm_gmult_clmul(uint64 Xi[2],const aesgcm_u128 Htable[16]); +void gcm_ghash_clmul(uint64 Xi[2],const aesgcm_u128 Htable[16],const uint8 *inp,size_t len); +void gcm_init_avx(aesgcm_u128 Htable[16],const uint64 Xi[2]); +void gcm_gmult_avx(uint64 Xi[2],const aesgcm_u128 Htable[16]); +void gcm_ghash_avx(uint64 Xi[2],const aesgcm_u128 Htable[16],const uint8 *inp,size_t len); +void gcm_gmult_4bit(uint64 Xi[2], const aesgcm_u128 Htable[16]); +void gcm_ghash_4bit(uint64 Xi[2],const aesgcm_u128 Htable[16], const uint8 *inp,size_t len); + +// ivec points to Yi followed by Xi +// h_and_htable points at h and htable from the static context +size_t aesni_gcm_encrypt(const uint8 *in,uint8 *out,size_t len,const void *key,uint8 ivec_and_xi[16],uint64 *h_and_htable); +size_t aesni_gcm_decrypt(const uint8 *in,uint8 *out,size_t len,const void *key,uint8 ivec_and_xi[16],uint64 *h_and_htable); +void aesni_ctr32_encrypt_blocks(const void *in, void *out, size_t blocks, const AesContext *key, const uint8 *ivec); +void aesni_encrypt(const void *inp, void *out, const AesContext *key); +void aesni_decrypt(const void *inp, void *out, const AesContext *key); +int aesni_set_encrypt_key(const unsigned char *inp, int bits, AesContext *key); +int aesni_set_decrypt_key(const unsigned char *inp, int bits, AesContext *key); +}; + + +#define GCM_MUL(ctx,Xi) (*gcm_gmult_p)(ctx->Xi.u,sctx->Htable) +#define GHASH(ctx,in,len) (*gcm_ghash_p)(ctx->Xi.u,sctx->Htable,in,len) +#define GHASH_CHUNK (3*1024) + +void CRYPTO_gcm128_aad(AesGcm128TempContext *ctx,const uint8 *aad,size_t len) { + size_t i; + unsigned int n; + AesGcm128StaticContext *sctx = ctx->sctx; + uint64 alen = ctx->len.u[0]; + void (*gcm_gmult_p)(uint64 Xi[2],const aesgcm_u128 Htable[16]) = sctx->gmult; + void (*gcm_ghash_p)(uint64 Xi[2],const aesgcm_u128 Htable[16], const uint8 *inp,size_t len) = sctx->ghash; + + assert(!ctx->len.u[1]); +// if () return -2; + alen += len; +// if (alen>(uint64(1)<<61) || (sizeof(len)==8 && alenlen.u[0] = alen; + + n = ctx->ares; + if (n) { + while (n && len) { + ctx->Xi.c[n] ^= *(aad++); + --len; + n = (n+1)%16; + } + if (n==0) GCM_MUL(ctx,Xi); + else { + ctx->ares = n; + return; + } + } + +#ifdef GHASH + if ((i = (len&(size_t)-16))) { + GHASH(ctx,aad,i); + aad += i; + len -= i; + } +#else + while (len>=16) { + for (i=0; i<16; ++i) ctx->Xi.c[i] ^= aad[i]; + GCM_MUL(ctx,Xi); + aad += 16; + len -= 16; + } +#endif + if (len) { + n = (unsigned int)len; + for (i=0; iXi.c[i] ^= aad[i]; + } + + ctx->ares = n; +} + +void CRYPTO_gcm128_encrypt_ctr32(AesGcm128TempContext *ctx, const uint8 *in, uint8 *out, size_t len) { + unsigned int n, ctr; + size_t i; + AesGcm128StaticContext *sctx = ctx->sctx; + uint64 mlen = ctx->len.u[1]; + void (*gcm_gmult_p)(uint64 Xi[2],const aesgcm_u128 Htable[16]) = sctx->gmult; + void (*gcm_ghash_p)(uint64 Xi[2],const aesgcm_u128 Htable[16], const uint8 *inp,size_t len) = sctx->ghash; + mlen += len; +// if (mlen>((uint64(1)<<36)-32) || (sizeof(len)==8 && mlenlen.u[1] = mlen; + + if (ctx->ares) { + /* First call to encrypt finalizes GHASH(AAD) */ + GCM_MUL(ctx,Xi); + ctx->ares = 0; + } + n = ctx->mres; + if (n) { + while (n && len) { + ctx->Xi.c[n] ^= *(out++) = *(in++)^ctx->EKi.c[n]; + --len; + n = (n+1)%16; + } + if (n==0) GCM_MUL(ctx,Xi); + else { + ctx->mres = n; + return; + } + } + +#if defined(AESNI_GCM) + if (sctx->use_aesni_gcm_crypt && len >= 0x120) { + // |aesni_gcm_encrypt| may not process all the input given to it. It may + // not process *any* of its input if it is deemed too small. + size_t bulk = aesni_gcm_encrypt(in, out, len, &sctx->aes, ctx->Yi.c, sctx->H.u); + in += bulk; + out += bulk; + len -= bulk; + } +#endif + ctr = ReadBE32(ctx->Yi.c + 12); + +#if defined(STRICT_ALIGNMENT) + if (((size_t)in | (size_t)out) % sizeof(size_t) != 0) { + for (i = 0; iYi.c, ctx->EKi.c, &sctx->aes); + ++ctr; + WriteBE32(ctx->Yi.c + 12, ctr); + } + ctx->Xi.c[n] ^= out[i] = in[i] ^ ctx->EKi.c[n]; + n = (n + 1) % 16; + if (n == 0) + GCM_MUL(ctx, Xi); + } + ctx->mres = n; + return; + } +#endif + while (len>=GHASH_CHUNK) { + aesni_ctr32_encrypt_blocks(in, out, GHASH_CHUNK / 16, &sctx->aes, ctx->Yi.c); + GHASH(ctx, out, GHASH_CHUNK); + ctr += GHASH_CHUNK / 16; + WriteBE32(ctx->Yi.c + 12, ctr); + in += GHASH_CHUNK; + out += GHASH_CHUNK; + len -= GHASH_CHUNK; + } + if ((i = (len&(size_t)-16))) { + aesni_ctr32_encrypt_blocks(in, out, i / 16, &sctx->aes, ctx->Yi.c); + GHASH(ctx, out, i); + ctr += (uint32)(i / 16); + WriteBE32(ctx->Yi.c + 12, ctr); + out += i; + in += i; + len -= i; + } + if (len) { + aesni_encrypt(ctx->Yi.c, ctx->EKi.c, &sctx->aes); + ++ctr; + WriteBE32(ctx->Yi.c+12,ctr); + while (len--) { + ctx->Xi.c[n] ^= out[n] = in[n] ^ ctx->EKi.c[n]; + ++n; + } + } + ctx->mres = n; +} + +void CRYPTO_gcm128_decrypt_ctr32(AesGcm128TempContext *ctx, const uint8 *in, uint8 *out, size_t len) { + unsigned int n, ctr; + size_t i; + uint64 mlen = ctx->len.u[1]; + AesGcm128StaticContext *sctx = ctx->sctx; + void (*gcm_gmult_p)(uint64 Xi[2],const aesgcm_u128 Htable[16]) = sctx->gmult; + void (*gcm_ghash_p)(uint64 Xi[2],const aesgcm_u128 Htable[16], const uint8 *inp,size_t len) = sctx->ghash; + + mlen += len; +// if (mlen>((uint64(1)<<36)-32) || (sizeof(len)==8 && mlenlen.u[1] = mlen; + + if (ctx->ares) { + /* First call to decrypt finalizes GHASH(AAD) */ + GCM_MUL(ctx,Xi); + ctx->ares = 0; + } + + n = ctx->mres; + if (n) { + while (n && len) { + uint8 c = *(in++); + *(out++) = c^ctx->EKi.c[n]; + ctx->Xi.c[n] ^= c; + --len; + n = (n+1)%16; + } + if (n==0) GCM_MUL (ctx,Xi); + else { + ctx->mres = n; + return; + } + } + +#if defined(AESNI_GCM) + if (sctx->use_aesni_gcm_crypt) { + // |aesni_gcm_decrypt| may not process all the input given to it. It may + // not process *any* of its input if it is deemed too small. + size_t bulk = aesni_gcm_decrypt(in, out, len, &sctx->aes, ctx->Yi.c, sctx->H.u); + in += bulk; + out += bulk; + len -= bulk; + } +#endif + ctr = ReadBE32(ctx->Yi.c + 12); + +#if defined(STRICT_ALIGNMENT) + if (((size_t)in|(size_t)out)%sizeof(size_t) != 0) { + for (i=0;iYi.c, ctx->EKi.c, key); + ++ctr; + WriteBE32(ctx->Yi.c+12,ctr); + } + c = in[i]; + out[i] = c^ctx->EKi.c[n]; + ctx->Xi.c[n] ^= c; + n = (n+1)%16; + if (n==0) + GCM_MUL(ctx,Xi); + } + ctx->mres = n; + return; + } +#endif + while (len >= GHASH_CHUNK) { + GHASH(ctx, in, GHASH_CHUNK); + aesni_ctr32_encrypt_blocks(in, out, GHASH_CHUNK / 16, &sctx->aes, ctx->Yi.c); + ctr += GHASH_CHUNK / 16; + WriteBE32(ctx->Yi.c + 12, ctr); + in += GHASH_CHUNK; + out += GHASH_CHUNK; + len -= GHASH_CHUNK; + } + if ((i = (len&(size_t)-16))) { + GHASH(ctx, in, i); + aesni_ctr32_encrypt_blocks(in, out, i / 16, &sctx->aes, ctx->Yi.c); + ctr += (uint32)(i / 16); + WriteBE32(ctx->Yi.c + 12, ctr); + out += i; + in += i; + len -= i; + } + if (len) { + aesni_encrypt(ctx->Yi.c, ctx->EKi.c, &sctx->aes); + ++ctr; + WriteBE32(ctx->Yi.c+12,ctr); + while (len--) { + uint8 c = in[n]; + ctx->Xi.c[n] ^= c; + out[n] = c^ctx->EKi.c[n]; + ++n; + } + } + ctx->mres = n; +} + +void CRYPTO_gcm128_finish(AesGcm128TempContext *ctx,uint8 *tag, size_t len) { + uint64 alen = ctx->len.u[0]<<3; + uint64 clen = ctx->len.u[1]<<3; + AesGcm128StaticContext *sctx = ctx->sctx; + void (*gcm_gmult_p)(uint64 Xi[2],const aesgcm_u128 Htable[16]) = sctx->gmult; + + if (ctx->mres || ctx->ares) + GCM_MUL(ctx,Xi); + + alen = ToBE64(alen); + clen = ToBE64(clen); + + ctx->Xi.u[0] ^= alen; + ctx->Xi.u[1] ^= clen; + GCM_MUL(ctx,Xi); + + ctx->Xi.u[0] ^= ctx->EK0.u[0]; + ctx->Xi.u[1] ^= ctx->EK0.u[1]; + + memcpy(tag, ctx->Xi.c,len); +} + +#define REDUCE1BIT(V) do { \ + if (sizeof(size_t)==8) { \ + uint64 T = 0xe100000000000000ull & (0-(V.lo&1)); \ + V.lo = (V.hi<<63)|(V.lo>>1); \ + V.hi = (V.hi>>1 )^T; \ + } else { \ + uint32 T = 0xe1000000U & (0-(uint32)(V.lo&1)); \ + V.lo = (V.hi<<63)|(V.lo>>1); \ + V.hi = (V.hi>>1 )^((uint64)T<<32); \ + } \ +} while(0) + +static void gcm_init_4bit(aesgcm_u128 Htable[16], uint64 H[2]) { + aesgcm_u128 V; + + Htable[0].hi = 0; + Htable[0].lo = 0; + V.hi = H[0]; + V.lo = H[1]; + + Htable[8] = V; + REDUCE1BIT(V); + Htable[4] = V; + REDUCE1BIT(V); + Htable[2] = V; + REDUCE1BIT(V); + Htable[1] = V; + Htable[3].hi = V.hi^Htable[2].hi, Htable[3].lo = V.lo^Htable[2].lo; + V=Htable[4]; + Htable[5].hi = V.hi^Htable[1].hi, Htable[5].lo = V.lo^Htable[1].lo; + Htable[6].hi = V.hi^Htable[2].hi, Htable[6].lo = V.lo^Htable[2].lo; + Htable[7].hi = V.hi^Htable[3].hi, Htable[7].lo = V.lo^Htable[3].lo; + V=Htable[8]; + Htable[9].hi = V.hi^Htable[1].hi, Htable[9].lo = V.lo^Htable[1].lo; + Htable[10].hi = V.hi^Htable[2].hi, Htable[10].lo = V.lo^Htable[2].lo; + Htable[11].hi = V.hi^Htable[3].hi, Htable[11].lo = V.lo^Htable[3].lo; + Htable[12].hi = V.hi^Htable[4].hi, Htable[12].lo = V.lo^Htable[4].lo; + Htable[13].hi = V.hi^Htable[5].hi, Htable[13].lo = V.lo^Htable[5].lo; + Htable[14].hi = V.hi^Htable[6].hi, Htable[14].lo = V.lo^Htable[6].lo; + Htable[15].hi = V.hi^Htable[7].hi, Htable[15].lo = V.lo^Htable[7].lo; +} + + +#if !AESGCM_ASM +#define PACK(s) ((size_t)(s)<<(sizeof(size_t)*8-16)) +static const size_t rem_4bit[16] = { + PACK(0x0000), PACK(0x1C20), PACK(0x3840), PACK(0x2460), + PACK(0x7080), PACK(0x6CA0), PACK(0x48C0), PACK(0x54E0), + PACK(0xE100), PACK(0xFD20), PACK(0xD940), PACK(0xC560), + PACK(0x9180), PACK(0x8DA0), PACK(0xA9C0), PACK(0xB5E0)}; + +void gcm_gmult_4bit(uint64 Xi[2], const aesgcm_u128 Htable[16]) { + aesgcm_u128 Z; + int cnt = 15; + size_t rem, nlo, nhi; + const union { long one; char little; } is_endian = {1}; + + nlo = ((const uint8 *)Xi)[15]; + nhi = nlo>>4; + nlo &= 0xf; + + Z.hi = Htable[nlo].hi; + Z.lo = Htable[nlo].lo; + + while (1) { + rem = (size_t)Z.lo&0xf; + Z.lo = (Z.hi<<60)|(Z.lo>>4); + Z.hi = (Z.hi>>4); + if (sizeof(size_t)==8) + Z.hi ^= rem_4bit[rem]; + else + Z.hi ^= (uint64)rem_4bit[rem]<<32; + + Z.hi ^= Htable[nhi].hi; + Z.lo ^= Htable[nhi].lo; + + if (--cnt<0) break; + + nlo = ((const uint8 *)Xi)[cnt]; + nhi = nlo>>4; + nlo &= 0xf; + + rem = (size_t)Z.lo&0xf; + Z.lo = (Z.hi<<60)|(Z.lo>>4); + Z.hi = (Z.hi>>4); + if (sizeof(size_t)==8) + Z.hi ^= rem_4bit[rem]; + else + Z.hi ^= (uint64)rem_4bit[rem]<<32; + + Z.hi ^= Htable[nlo].hi; + Z.lo ^= Htable[nlo].lo; + } + Xi[0] = ToBE64(Z.hi); + Xi[1] = ToBE64(Z.lo); +} + +void gcm_ghash_4bit(uint64 Xi[2],const aesgcm_u128 Htable[16], const uint8 *inp,size_t len) { + aesgcm_u128 Z; + int cnt; + size_t rem, nlo, nhi; + + do { + cnt = 15; + nlo = ((const uint8 *)Xi)[15]; + nlo ^= inp[15]; + nhi = nlo>>4; + nlo &= 0xf; + + Z.hi = Htable[nlo].hi; + Z.lo = Htable[nlo].lo; + + while (1) { + rem = (size_t)Z.lo&0xf; + Z.lo = (Z.hi<<60)|(Z.lo>>4); + Z.hi = (Z.hi>>4); + if (sizeof(size_t)==8) + Z.hi ^= rem_4bit[rem]; + else + Z.hi ^= (uint64)rem_4bit[rem]<<32; + + Z.hi ^= Htable[nhi].hi; + Z.lo ^= Htable[nhi].lo; + + if (--cnt<0) break; + + nlo = ((const uint8 *)Xi)[cnt]; + nlo ^= inp[cnt]; + nhi = nlo>>4; + nlo &= 0xf; + + rem = (size_t)Z.lo&0xf; + Z.lo = (Z.hi<<60)|(Z.lo>>4); + Z.hi = (Z.hi>>4); + if (sizeof(size_t)==8) + Z.hi ^= rem_4bit[rem]; + else + Z.hi ^= (uint64)rem_4bit[rem]<<32; + + Z.hi ^= Htable[nlo].hi; + Z.lo ^= Htable[nlo].lo; + } + Xi[0] = ToBE64(Z.hi); + Xi[1] = ToBE64(Z.lo); + + } while (inp+=16, len-=16); +} +#endif + +void CRYPTO_gcm128_init(AesGcm128StaticContext *ctx, const uint8 *key, int key_size) { + memset(ctx,0,sizeof(*ctx)); + ctx->use_aesni_gcm_crypt = X86_PCAP_MOVBE; + aesni_set_encrypt_key(key, key_size, &ctx->aes); + aesni_encrypt(ctx->H.c,ctx->H.c, &ctx->aes); + ctx->H.u[0] = ToBE64(ctx->H.u[0]); + ctx->H.u[1] = ToBE64(ctx->H.u[1]); + if (X86_PCAP_AVX) { + gcm_init_avx(ctx->Htable,ctx->H.u); + ctx->gmult = gcm_gmult_avx; + ctx->ghash = gcm_ghash_avx; + } else if (X86_PCAP_PCLMULQDQ) { + gcm_init_clmul(ctx->Htable,ctx->H.u); + ctx->gmult = gcm_gmult_clmul; + ctx->ghash = gcm_ghash_clmul; + } else { + gcm_init_4bit(ctx->Htable, ctx->H.u); + ctx->gmult = gcm_gmult_4bit; + ctx->ghash = gcm_ghash_4bit; + } +} + +void CRYPTO_gcm128_setiv(AesGcm128TempContext *ctx, AesGcm128StaticContext *sctx, const unsigned char *iv, size_t len) { + unsigned int ctr; + void (*gcm_gmult_p)(uint64 Xi[2],const aesgcm_u128 Htable[16]) = sctx->gmult; + + ctx->sctx = sctx; + ctx->Yi.u[0] = 0; + ctx->Yi.u[1] = 0; + ctx->Xi.u[0] = 0; + ctx->Xi.u[1] = 0; + ctx->len.u[0] = 0; /* AAD length */ + ctx->len.u[1] = 0; /* message length */ + ctx->ares = 0; + ctx->mres = 0; + + if (len==12) { + memcpy(ctx->Yi.c,iv,12); + ctx->Yi.c[15]=1; + ctr=1; + } else { + size_t i; + uint64 len0 = len; + + while (len>=16) { + for (i=0; i<16; ++i) ctx->Yi.c[i] ^= iv[i]; + GCM_MUL(ctx,Yi); + iv += 16; + len -= 16; + } + if (len) { + for (i=0; iYi.c[i] ^= iv[i]; + GCM_MUL(ctx,Yi); + } + len0 <<= 3; + ctx->Yi.u[1] ^= ToBE64(len0); + + GCM_MUL(ctx,Yi); + + ctr = ToBE32(ctx->Yi.d[3]); + } + + aesni_encrypt(ctx->Yi.c, ctx->EK0.c, &sctx->aes); + ++ctr; + ctx->Yi.d[3] = ToBE32(ctr); +} + +union AesGcmIV { + uint32 nonce[3]; + uint8 nonceb[12]; +}; + +void aesgcm_encrypt(uint8 *dst, const uint8 *src, const size_t src_len, + const uint8 *ad, const size_t ad_len, + const uint64 nonce, AesGcm128StaticContext *sctx) { + AesGcm128TempContext ctx; + AesGcmIV iv; + + WriteLE64(iv.nonce, nonce); + iv.nonce[2] = 0; + + CRYPTO_gcm128_setiv(&ctx, sctx, iv.nonceb, sizeof(iv)); + CRYPTO_gcm128_aad(&ctx, ad, ad_len); + CRYPTO_gcm128_encrypt_ctr32(&ctx, src, dst, src_len); + CRYPTO_gcm128_finish(&ctx, dst + src_len, 16); +} + +void aesgcm_decrypt_get_mac(uint8 *dst, const uint8 *src, const size_t src_len, + const uint8 *ad, const size_t ad_len, + const uint64 nonce, AesGcm128StaticContext *sctx, + uint8 mac[16]) { + AesGcm128TempContext ctx; + AesGcmIV iv; + + WriteLE64(iv.nonce, nonce); + iv.nonce[2] = 0; + + CRYPTO_gcm128_setiv(&ctx, sctx, iv.nonceb, sizeof(iv)); + CRYPTO_gcm128_aad(&ctx, ad, ad_len); + CRYPTO_gcm128_decrypt_ctr32(&ctx, src, dst, src_len); + CRYPTO_gcm128_finish(&ctx, mac, 16); +} + +#if 1 + +/* +* GCM test vectors from: +* +* http://csrc.nist.gov/groups/STM/cavp/documents/mac/gcmtestvectors.zip +*/ +#define MAX_TESTS 6 + +static int key_index[MAX_TESTS] = +{ 0, 0, 1, 1, 1, 1 }; + +static uint8 key[MAX_TESTS][32] = +{ + { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, + 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, + 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, + 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 }, + { 0xfe, 0xff, 0xe9, 0x92, 0x86, 0x65, 0x73, 0x1c, + 0x6d, 0x6a, 0x8f, 0x94, 0x67, 0x30, 0x83, 0x08, + 0xfe, 0xff, 0xe9, 0x92, 0x86, 0x65, 0x73, 0x1c, + 0x6d, 0x6a, 0x8f, 0x94, 0x67, 0x30, 0x83, 0x08 }, +}; + +static size_t iv_len[MAX_TESTS] = +{ 12, 12, 12, 12, 8, 60 }; + +static int iv_index[MAX_TESTS] = +{ 0, 0, 1, 1, 1, 2 }; + +static uint8 iv[MAX_TESTS][64] = +{ + { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, + 0x00, 0x00, 0x00, 0x00 }, + { 0xca, 0xfe, 0xba, 0xbe, 0xfa, 0xce, 0xdb, 0xad, + 0xde, 0xca, 0xf8, 0x88 }, + { 0x93, 0x13, 0x22, 0x5d, 0xf8, 0x84, 0x06, 0xe5, + 0x55, 0x90, 0x9c, 0x5a, 0xff, 0x52, 0x69, 0xaa, + 0x6a, 0x7a, 0x95, 0x38, 0x53, 0x4f, 0x7d, 0xa1, + 0xe4, 0xc3, 0x03, 0xd2, 0xa3, 0x18, 0xa7, 0x28, + 0xc3, 0xc0, 0xc9, 0x51, 0x56, 0x80, 0x95, 0x39, + 0xfc, 0xf0, 0xe2, 0x42, 0x9a, 0x6b, 0x52, 0x54, + 0x16, 0xae, 0xdb, 0xf5, 0xa0, 0xde, 0x6a, 0x57, + 0xa6, 0x37, 0xb3, 0x9b }, +}; + +static size_t add_len[MAX_TESTS] = +{ 0, 0, 0, 20, 20, 20 }; + +int add_index[MAX_TESTS] = +{ 0, 0, 0, 1, 1, 1 }; + +static uint8 additional[MAX_TESTS][64] = +{ + { 0x00 }, + { 0xfe, 0xed, 0xfa, 0xce, 0xde, 0xad, 0xbe, 0xef, + 0xfe, 0xed, 0xfa, 0xce, 0xde, 0xad, 0xbe, 0xef, + 0xab, 0xad, 0xda, 0xd2 }, +}; + +static size_t pt_len[MAX_TESTS] = +{ 0, 16, 64, 60, 60, 60 }; + +static int pt_index[MAX_TESTS] = +{ 0, 0, 1, 1, 1, 1 }; + +static uint8 pt[MAX_TESTS][64] = +{ + { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, + 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 }, + { 0xd9, 0x31, 0x32, 0x25, 0xf8, 0x84, 0x06, 0xe5, + 0xa5, 0x59, 0x09, 0xc5, 0xaf, 0xf5, 0x26, 0x9a, + 0x86, 0xa7, 0xa9, 0x53, 0x15, 0x34, 0xf7, 0xda, + 0x2e, 0x4c, 0x30, 0x3d, 0x8a, 0x31, 0x8a, 0x72, + 0x1c, 0x3c, 0x0c, 0x95, 0x95, 0x68, 0x09, 0x53, + 0x2f, 0xcf, 0x0e, 0x24, 0x49, 0xa6, 0xb5, 0x25, + 0xb1, 0x6a, 0xed, 0xf5, 0xaa, 0x0d, 0xe6, 0x57, + 0xba, 0x63, 0x7b, 0x39, 0x1a, 0xaf, 0xd2, 0x55 }, +}; + +static uint8 ct[MAX_TESTS * 3][64] = +{ + { 0x00 }, + { 0x03, 0x88, 0xda, 0xce, 0x60, 0xb6, 0xa3, 0x92, + 0xf3, 0x28, 0xc2, 0xb9, 0x71, 0xb2, 0xfe, 0x78 }, + { 0x42, 0x83, 0x1e, 0xc2, 0x21, 0x77, 0x74, 0x24, + 0x4b, 0x72, 0x21, 0xb7, 0x84, 0xd0, 0xd4, 0x9c, + 0xe3, 0xaa, 0x21, 0x2f, 0x2c, 0x02, 0xa4, 0xe0, + 0x35, 0xc1, 0x7e, 0x23, 0x29, 0xac, 0xa1, 0x2e, + 0x21, 0xd5, 0x14, 0xb2, 0x54, 0x66, 0x93, 0x1c, + 0x7d, 0x8f, 0x6a, 0x5a, 0xac, 0x84, 0xaa, 0x05, + 0x1b, 0xa3, 0x0b, 0x39, 0x6a, 0x0a, 0xac, 0x97, + 0x3d, 0x58, 0xe0, 0x91, 0x47, 0x3f, 0x59, 0x85 }, + { 0x42, 0x83, 0x1e, 0xc2, 0x21, 0x77, 0x74, 0x24, + 0x4b, 0x72, 0x21, 0xb7, 0x84, 0xd0, 0xd4, 0x9c, + 0xe3, 0xaa, 0x21, 0x2f, 0x2c, 0x02, 0xa4, 0xe0, + 0x35, 0xc1, 0x7e, 0x23, 0x29, 0xac, 0xa1, 0x2e, + 0x21, 0xd5, 0x14, 0xb2, 0x54, 0x66, 0x93, 0x1c, + 0x7d, 0x8f, 0x6a, 0x5a, 0xac, 0x84, 0xaa, 0x05, + 0x1b, 0xa3, 0x0b, 0x39, 0x6a, 0x0a, 0xac, 0x97, + 0x3d, 0x58, 0xe0, 0x91 }, + { 0x61, 0x35, 0x3b, 0x4c, 0x28, 0x06, 0x93, 0x4a, + 0x77, 0x7f, 0xf5, 0x1f, 0xa2, 0x2a, 0x47, 0x55, + 0x69, 0x9b, 0x2a, 0x71, 0x4f, 0xcd, 0xc6, 0xf8, + 0x37, 0x66, 0xe5, 0xf9, 0x7b, 0x6c, 0x74, 0x23, + 0x73, 0x80, 0x69, 0x00, 0xe4, 0x9f, 0x24, 0xb2, + 0x2b, 0x09, 0x75, 0x44, 0xd4, 0x89, 0x6b, 0x42, + 0x49, 0x89, 0xb5, 0xe1, 0xeb, 0xac, 0x0f, 0x07, + 0xc2, 0x3f, 0x45, 0x98 }, + { 0x8c, 0xe2, 0x49, 0x98, 0x62, 0x56, 0x15, 0xb6, + 0x03, 0xa0, 0x33, 0xac, 0xa1, 0x3f, 0xb8, 0x94, + 0xbe, 0x91, 0x12, 0xa5, 0xc3, 0xa2, 0x11, 0xa8, + 0xba, 0x26, 0x2a, 0x3c, 0xca, 0x7e, 0x2c, 0xa7, + 0x01, 0xe4, 0xa9, 0xa4, 0xfb, 0xa4, 0x3c, 0x90, + 0xcc, 0xdc, 0xb2, 0x81, 0xd4, 0x8c, 0x7c, 0x6f, + 0xd6, 0x28, 0x75, 0xd2, 0xac, 0xa4, 0x17, 0x03, + 0x4c, 0x34, 0xae, 0xe5 }, + { 0x00 }, + { 0x98, 0xe7, 0x24, 0x7c, 0x07, 0xf0, 0xfe, 0x41, + 0x1c, 0x26, 0x7e, 0x43, 0x84, 0xb0, 0xf6, 0x00 }, + { 0x39, 0x80, 0xca, 0x0b, 0x3c, 0x00, 0xe8, 0x41, + 0xeb, 0x06, 0xfa, 0xc4, 0x87, 0x2a, 0x27, 0x57, + 0x85, 0x9e, 0x1c, 0xea, 0xa6, 0xef, 0xd9, 0x84, + 0x62, 0x85, 0x93, 0xb4, 0x0c, 0xa1, 0xe1, 0x9c, + 0x7d, 0x77, 0x3d, 0x00, 0xc1, 0x44, 0xc5, 0x25, + 0xac, 0x61, 0x9d, 0x18, 0xc8, 0x4a, 0x3f, 0x47, + 0x18, 0xe2, 0x44, 0x8b, 0x2f, 0xe3, 0x24, 0xd9, + 0xcc, 0xda, 0x27, 0x10, 0xac, 0xad, 0xe2, 0x56 }, + { 0x39, 0x80, 0xca, 0x0b, 0x3c, 0x00, 0xe8, 0x41, + 0xeb, 0x06, 0xfa, 0xc4, 0x87, 0x2a, 0x27, 0x57, + 0x85, 0x9e, 0x1c, 0xea, 0xa6, 0xef, 0xd9, 0x84, + 0x62, 0x85, 0x93, 0xb4, 0x0c, 0xa1, 0xe1, 0x9c, + 0x7d, 0x77, 0x3d, 0x00, 0xc1, 0x44, 0xc5, 0x25, + 0xac, 0x61, 0x9d, 0x18, 0xc8, 0x4a, 0x3f, 0x47, + 0x18, 0xe2, 0x44, 0x8b, 0x2f, 0xe3, 0x24, 0xd9, + 0xcc, 0xda, 0x27, 0x10 }, + { 0x0f, 0x10, 0xf5, 0x99, 0xae, 0x14, 0xa1, 0x54, + 0xed, 0x24, 0xb3, 0x6e, 0x25, 0x32, 0x4d, 0xb8, + 0xc5, 0x66, 0x63, 0x2e, 0xf2, 0xbb, 0xb3, 0x4f, + 0x83, 0x47, 0x28, 0x0f, 0xc4, 0x50, 0x70, 0x57, + 0xfd, 0xdc, 0x29, 0xdf, 0x9a, 0x47, 0x1f, 0x75, + 0xc6, 0x65, 0x41, 0xd4, 0xd4, 0xda, 0xd1, 0xc9, + 0xe9, 0x3a, 0x19, 0xa5, 0x8e, 0x8b, 0x47, 0x3f, + 0xa0, 0xf0, 0x62, 0xf7 }, + { 0xd2, 0x7e, 0x88, 0x68, 0x1c, 0xe3, 0x24, 0x3c, + 0x48, 0x30, 0x16, 0x5a, 0x8f, 0xdc, 0xf9, 0xff, + 0x1d, 0xe9, 0xa1, 0xd8, 0xe6, 0xb4, 0x47, 0xef, + 0x6e, 0xf7, 0xb7, 0x98, 0x28, 0x66, 0x6e, 0x45, + 0x81, 0xe7, 0x90, 0x12, 0xaf, 0x34, 0xdd, 0xd9, + 0xe2, 0xf0, 0x37, 0x58, 0x9b, 0x29, 0x2d, 0xb3, + 0xe6, 0x7c, 0x03, 0x67, 0x45, 0xfa, 0x22, 0xe7, + 0xe9, 0xb7, 0x37, 0x3b }, + { 0x00 }, + { 0xce, 0xa7, 0x40, 0x3d, 0x4d, 0x60, 0x6b, 0x6e, + 0x07, 0x4e, 0xc5, 0xd3, 0xba, 0xf3, 0x9d, 0x18 }, + { 0x52, 0x2d, 0xc1, 0xf0, 0x99, 0x56, 0x7d, 0x07, + 0xf4, 0x7f, 0x37, 0xa3, 0x2a, 0x84, 0x42, 0x7d, + 0x64, 0x3a, 0x8c, 0xdc, 0xbf, 0xe5, 0xc0, 0xc9, + 0x75, 0x98, 0xa2, 0xbd, 0x25, 0x55, 0xd1, 0xaa, + 0x8c, 0xb0, 0x8e, 0x48, 0x59, 0x0d, 0xbb, 0x3d, + 0xa7, 0xb0, 0x8b, 0x10, 0x56, 0x82, 0x88, 0x38, + 0xc5, 0xf6, 0x1e, 0x63, 0x93, 0xba, 0x7a, 0x0a, + 0xbc, 0xc9, 0xf6, 0x62, 0x89, 0x80, 0x15, 0xad }, + { 0x52, 0x2d, 0xc1, 0xf0, 0x99, 0x56, 0x7d, 0x07, + 0xf4, 0x7f, 0x37, 0xa3, 0x2a, 0x84, 0x42, 0x7d, + 0x64, 0x3a, 0x8c, 0xdc, 0xbf, 0xe5, 0xc0, 0xc9, + 0x75, 0x98, 0xa2, 0xbd, 0x25, 0x55, 0xd1, 0xaa, + 0x8c, 0xb0, 0x8e, 0x48, 0x59, 0x0d, 0xbb, 0x3d, + 0xa7, 0xb0, 0x8b, 0x10, 0x56, 0x82, 0x88, 0x38, + 0xc5, 0xf6, 0x1e, 0x63, 0x93, 0xba, 0x7a, 0x0a, + 0xbc, 0xc9, 0xf6, 0x62 }, + { 0xc3, 0x76, 0x2d, 0xf1, 0xca, 0x78, 0x7d, 0x32, + 0xae, 0x47, 0xc1, 0x3b, 0xf1, 0x98, 0x44, 0xcb, + 0xaf, 0x1a, 0xe1, 0x4d, 0x0b, 0x97, 0x6a, 0xfa, + 0xc5, 0x2f, 0xf7, 0xd7, 0x9b, 0xba, 0x9d, 0xe0, + 0xfe, 0xb5, 0x82, 0xd3, 0x39, 0x34, 0xa4, 0xf0, + 0x95, 0x4c, 0xc2, 0x36, 0x3b, 0xc7, 0x3f, 0x78, + 0x62, 0xac, 0x43, 0x0e, 0x64, 0xab, 0xe4, 0x99, + 0xf4, 0x7c, 0x9b, 0x1f }, + { 0x5a, 0x8d, 0xef, 0x2f, 0x0c, 0x9e, 0x53, 0xf1, + 0xf7, 0x5d, 0x78, 0x53, 0x65, 0x9e, 0x2a, 0x20, + 0xee, 0xb2, 0xb2, 0x2a, 0xaf, 0xde, 0x64, 0x19, + 0xa0, 0x58, 0xab, 0x4f, 0x6f, 0x74, 0x6b, 0xf4, + 0x0f, 0xc0, 0xc3, 0xb7, 0x80, 0xf2, 0x44, 0x45, + 0x2d, 0xa3, 0xeb, 0xf1, 0xc5, 0xd8, 0x2c, 0xde, + 0xa2, 0x41, 0x89, 0x97, 0x20, 0x0e, 0xf8, 0x2e, + 0x44, 0xae, 0x7e, 0x3f }, +}; + +static uint8 tag[MAX_TESTS * 3][16] = +{ + { 0x58, 0xe2, 0xfc, 0xce, 0xfa, 0x7e, 0x30, 0x61, + 0x36, 0x7f, 0x1d, 0x57, 0xa4, 0xe7, 0x45, 0x5a }, + { 0xab, 0x6e, 0x47, 0xd4, 0x2c, 0xec, 0x13, 0xbd, + 0xf5, 0x3a, 0x67, 0xb2, 0x12, 0x57, 0xbd, 0xdf }, + { 0x4d, 0x5c, 0x2a, 0xf3, 0x27, 0xcd, 0x64, 0xa6, + 0x2c, 0xf3, 0x5a, 0xbd, 0x2b, 0xa6, 0xfa, 0xb4 }, + { 0x5b, 0xc9, 0x4f, 0xbc, 0x32, 0x21, 0xa5, 0xdb, + 0x94, 0xfa, 0xe9, 0x5a, 0xe7, 0x12, 0x1a, 0x47 }, + { 0x36, 0x12, 0xd2, 0xe7, 0x9e, 0x3b, 0x07, 0x85, + 0x56, 0x1b, 0xe1, 0x4a, 0xac, 0xa2, 0xfc, 0xcb }, + { 0x61, 0x9c, 0xc5, 0xae, 0xff, 0xfe, 0x0b, 0xfa, + 0x46, 0x2a, 0xf4, 0x3c, 0x16, 0x99, 0xd0, 0x50 }, + { 0xcd, 0x33, 0xb2, 0x8a, 0xc7, 0x73, 0xf7, 0x4b, + 0xa0, 0x0e, 0xd1, 0xf3, 0x12, 0x57, 0x24, 0x35 }, + { 0x2f, 0xf5, 0x8d, 0x80, 0x03, 0x39, 0x27, 0xab, + 0x8e, 0xf4, 0xd4, 0x58, 0x75, 0x14, 0xf0, 0xfb }, + { 0x99, 0x24, 0xa7, 0xc8, 0x58, 0x73, 0x36, 0xbf, + 0xb1, 0x18, 0x02, 0x4d, 0xb8, 0x67, 0x4a, 0x14 }, + { 0x25, 0x19, 0x49, 0x8e, 0x80, 0xf1, 0x47, 0x8f, + 0x37, 0xba, 0x55, 0xbd, 0x6d, 0x27, 0x61, 0x8c }, + { 0x65, 0xdc, 0xc5, 0x7f, 0xcf, 0x62, 0x3a, 0x24, + 0x09, 0x4f, 0xcc, 0xa4, 0x0d, 0x35, 0x33, 0xf8 }, + { 0xdc, 0xf5, 0x66, 0xff, 0x29, 0x1c, 0x25, 0xbb, + 0xb8, 0x56, 0x8f, 0xc3, 0xd3, 0x76, 0xa6, 0xd9 }, + { 0x53, 0x0f, 0x8a, 0xfb, 0xc7, 0x45, 0x36, 0xb9, + 0xa9, 0x63, 0xb4, 0xf1, 0xc4, 0xcb, 0x73, 0x8b }, + { 0xd0, 0xd1, 0xc8, 0xa7, 0x99, 0x99, 0x6b, 0xf0, + 0x26, 0x5b, 0x98, 0xb5, 0xd4, 0x8a, 0xb9, 0x19 }, + { 0xb0, 0x94, 0xda, 0xc5, 0xd9, 0x34, 0x71, 0xbd, + 0xec, 0x1a, 0x50, 0x22, 0x70, 0xe3, 0xcc, 0x6c }, + { 0x76, 0xfc, 0x6e, 0xce, 0x0f, 0x4e, 0x17, 0x68, + 0xcd, 0xdf, 0x88, 0x53, 0xbb, 0x2d, 0x55, 0x1b }, + { 0x3a, 0x33, 0x7d, 0xbf, 0x46, 0xa7, 0x92, 0xc4, + 0x5e, 0x45, 0x49, 0x13, 0xfe, 0x2e, 0xa8, 0xf2 }, + { 0xa4, 0x4a, 0x82, 0x66, 0xee, 0x1c, 0x8e, 0xb0, + 0xc8, 0xb5, 0xd4, 0xcf, 0x5a, 0xe9, 0xf1, 0x9a }, +}; + +int gcm_self_test() +{ + uint8 buf[64]; + uint8 tag_buf[16]; + int i, j; + + AesGcm128TempContext ctx; + AesGcm128StaticContext sctx; + + + { + AesContext aes; + uint8 key[16] = {43,126,21,22,40,174,210,166,171,247,21,136,9,207,79,60}; + uint8 in[16] = {107,193,190,226,46,64,159,150,233,61,126,17,115,147,23,42}; + uint8 out[16] = {58,215,123,180,13,122,54,96,168,158,202,243,36,102,239,151}, t[16]; + aesni_set_encrypt_key(key, 128, &aes); + aesni_encrypt(in, t, &aes); + if (memcmp(t, out,16)) { printf("AES test fail!\n"); return 1; } + aesni_set_decrypt_key(key, 128, &aes); + aesni_decrypt(out, t, &aes); + if (memcmp(t, in,16)) { printf("AES test fail!\n"); return 1; } + } + + uint8 correct[] = { 62,85,184,249,224,220,4,77,201,216,202,172,121,7,25,200, }; + if (0) { + uint8 buf[512 + 16]; + for (size_t i = 0; i < 512; i++) + buf[i] = (uint8)(i >> 4);// 0x11; + uint8 buf2[512 + 16]; + for (size_t i = 0; i < 512; i++) + buf2[i] = buf[i]; + + size_t pp = 0x60; + + CRYPTO_gcm128_init(&sctx, key[0], 128); + + sctx.use_aesni_gcm_crypt = 1; + + aesgcm_decrypt_get_mac(buf, buf, pp, NULL, 0, 1, &sctx, buf + pp); + sctx.use_aesni_gcm_crypt = 0; + aesgcm_decrypt_get_mac(buf2, buf2, pp, NULL, 0, 1, &sctx, buf2 + pp); + //aesgcm_encrypt(buf, buf, 0x120 + 32, NULL, 0, 1, &sctx); + + for (size_t i = 0; i < 16; i++) + printf("%d,", buf[pp + i]); + printf("\n"); + for (size_t i = 0; i < 16; i++) + printf("%d,", buf2[pp + i]); + printf("\n"); + + if (memcmp(buf2 + pp, buf + pp, 16) == 0) + printf("CORRECT!!\n"); + else + printf("******** FAIL ************\n"); +// for(size_t i = 0; i < 16; i++) +// printf("%d,", buf[pp +i]); + printf("\n"); + } + return 0; + + for( j = 0; j < 3; j++ ) { + int key_len = 128 + 64 * j; + for( i = 0; i < MAX_TESTS; i++ ) { + CRYPTO_gcm128_init(&sctx, key[key_index[i]], key_len); + CRYPTO_gcm128_setiv(&ctx, &sctx, iv[iv_index[i]], iv_len[i]); + CRYPTO_gcm128_aad(&ctx, additional[add_index[i]], add_len[i]); + CRYPTO_gcm128_encrypt_ctr32(&ctx, pt[pt_index[i]], buf, pt_len[i]); + CRYPTO_gcm128_finish(&ctx, tag_buf, 16); + if(memcmp( buf, ct[j * 6 + i], pt_len[i] ) != 0 || + memcmp( tag_buf, tag[j * 6 + i], 16 ) != 0 ) { + printf( "AES-GCM-%3d #%d (%s): failed\n", key_len, i, "enc" ); + return( 1 ); + } + + CRYPTO_gcm128_init(&sctx, key[key_index[i]], key_len); + CRYPTO_gcm128_setiv(&ctx, &sctx, iv[iv_index[i]], iv_len[i]); + CRYPTO_gcm128_aad(&ctx, additional[add_index[i]], add_len[i]); + CRYPTO_gcm128_decrypt_ctr32(&ctx, ct[j * 6 + i], buf, pt_len[i]); + CRYPTO_gcm128_finish(&ctx, tag_buf, 16); + if(memcmp( buf, pt[pt_index[i]], pt_len[i] ) != 0 || + memcmp( tag_buf, tag[j * 6 + i], 16 ) != 0 ) { + printf( "AES-GCM-%3d #%d (%s): failed\n", key_len, i, "dec" ); + return( 1 ); + } + } + } + + return( 0 ); +} + +//int main() { +// gcm_self_test(); +//} +#endif + +#endif // #if WITH_AESGCM diff --git a/crypto/aesgcm/aesni-gcm-x86_64.pl b/crypto/aesgcm/aesni-gcm-x86_64.pl new file mode 100644 index 0000000..f1607c7 --- /dev/null +++ b/crypto/aesgcm/aesni-gcm-x86_64.pl @@ -0,0 +1,1146 @@ +#! /usr/bin/env perl +# Copyright 2013-2016 The OpenSSL Project Authors. All Rights Reserved. + +# Ludde note : This is the stitched AES+GCM code. Min size = 0x60 +# +# Licensed under the OpenSSL license (the "License"). You may not use +# this file except in compliance with the License. You can obtain a copy +# in the file LICENSE in the source distribution or at +# https://www.openssl.org/source/license.html + +# +# ==================================================================== +# Written by Andy Polyakov for the OpenSSL +# project. The module is, however, dual licensed under OpenSSL and +# CRYPTOGAMS licenses depending on where you obtain it. For further +# details see http://www.openssl.org/~appro/cryptogams/. +# ==================================================================== +# +# +# AES-NI-CTR+GHASH stitch. +# +# February 2013 +# +# OpenSSL GCM implementation is organized in such way that its +# performance is rather close to the sum of its streamed components, +# in the context parallelized AES-NI CTR and modulo-scheduled +# PCLMULQDQ-enabled GHASH. Unfortunately, as no stitch implementation +# was observed to perform significantly better than the sum of the +# components on contemporary CPUs, the effort was deemed impossible to +# justify. This module is based on combination of Intel submissions, +# [1] and [2], with MOVBE twist suggested by Ilya Albrekht and Max +# Locktyukhin of Intel Corp. who verified that it reduces shuffles +# pressure with notable relative improvement, achieving 1.0 cycle per +# byte processed with 128-bit key on Haswell processor, 0.74 - on +# Broadwell, 0.63 - on Skylake... [Mentioned results are raw profiled +# measurements for favourable packet size, one divisible by 96. +# Applications using the EVP interface will observe a few percent +# worse performance.] +# +# Knights Landing processes 1 byte in 1.25 cycles (measured with EVP). +# +# [1] http://rt.openssl.org/Ticket/Display.html?id=2900&user=guest&pass=guest +# [2] http://www.intel.com/content/dam/www/public/us/en/documents/software-support/enabling-high-performance-gcm.pdf + +$flavour = shift; +$output = shift; +if ($flavour =~ /\./) { $output = $flavour; undef $flavour; } + +$win64=0; $win64=1 if ($flavour =~ /[nm]asm|mingw64/ || $output =~ /\.asm$/); + +$0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1; +( $xlate="${dir}x86_64-xlate.pl" and -f $xlate ) or +( $xlate="${dir}../x86_64-xlate.pl" and -f $xlate) or +die "can't locate x86_64-xlate.pl"; + +# |$avx| in ghash-x86_64.pl must be set to at least 1; otherwise tags will +# be computed incorrectly. +# +# In upstream, this is controlled by shelling out to the compiler to check +# versions, but BoringSSL is intended to be used with pre-generated perlasm +# output, so this isn't useful anyway. +# +# The upstream code uses the condition |$avx>1| even though no AVX2 +# instructions are used, because it assumes MOVBE is supported by the assembler +# if and only if AVX2 is also supported by the assembler; see +# https://marc.info/?l=openssl-dev&m=146567589526984&w=2. +$avx = 2; + +open OUT,"| \"$^X\" \"$xlate\" $flavour \"$output\""; +*STDOUT=*OUT; + +# See the comment above regarding why the condition is ($avx>1) when there are +# no AVX2 instructions being used. +if ($avx>1) {{{ + +($inp,$out,$len,$key,$ivp,$Xip)=("%rdi","%rsi","%rdx","%rcx","%r8","%r9"); + +($Ii,$T1,$T2,$Hkey, + $Z0,$Z1,$Z2,$Z3,$Xi) = map("%xmm$_",(0..8)); + +($inout0,$inout1,$inout2,$inout3,$inout4,$inout5,$rndkey) = map("%xmm$_",(9..15)); + +($counter,$rounds,$ret,$const,$in0,$end0)=("%ebx","%ebp","%r10","%r11","%r14","%r15"); + +$code=<<___; +.text + +.type _aesni_ctr32_ghash_6x,\@abi-omnipotent +.align 32 +_aesni_ctr32_ghash_6x: +.cfi_startproc + vmovdqu 0x20($const),$T2 # borrow $T2, .Lone_msb + sub \$6,$len + vpxor $Z0,$Z0,$Z0 # $Z0 = 0 + vmovdqu 0x00-0x80($key),$rndkey + vpaddb $T2,$T1,$inout1 + vpaddb $T2,$inout1,$inout2 + vpaddb $T2,$inout2,$inout3 + vpaddb $T2,$inout3,$inout4 + vpaddb $T2,$inout4,$inout5 + vpxor $rndkey,$T1,$inout0 + vmovdqu $Z0,16+8(%rsp) # "$Z3" = 0 + jmp .Loop6x + +.align 32 +.Loop6x: + add \$`6<<24`,$counter + jc .Lhandle_ctr32 # discard $inout[1-5]? + vmovdqu 0x00-0x20($Xip),$Hkey # $Hkey^1 + vpaddb $T2,$inout5,$T1 # next counter value + vpxor $rndkey,$inout1,$inout1 + vpxor $rndkey,$inout2,$inout2 + +.Lresume_ctr32: + vmovdqu $T1,($ivp) # save next counter value + vpclmulqdq \$0x10,$Hkey,$Z3,$Z1 + vpxor $rndkey,$inout3,$inout3 + vmovups 0x10-0x80($key),$T2 # borrow $T2 for $rndkey + vpclmulqdq \$0x01,$Hkey,$Z3,$Z2 + + # At this point, the current block of 96 (0x60) bytes has already been + # loaded into registers. Concurrently with processing it, we want to + # load the next 96 bytes of input for the next round. Obviously, we can + # only do this if there are at least 96 more bytes of input beyond the + # input we're currently processing, or else we'd read past the end of + # the input buffer. Here, we set |%r12| to 96 if there are at least 96 + # bytes of input beyond the 96 bytes we're already processing, and we + # set |%r12| to 0 otherwise. In the case where we set |%r12| to 96, + # we'll read in the next block so that it is in registers for the next + # loop iteration. In the case where we set |%r12| to 0, we'll re-read + # the current block and then ignore what we re-read. + # + # At this point, |$in0| points to the current (already read into + # registers) block, and |$end0| points to 2*96 bytes before the end of + # the input. Thus, |$in0| > |$end0| means that we do not have the next + # 96-byte block to read in, and |$in0| <= |$end0| means we do. + xor %r12,%r12 + cmp $in0,$end0 + + vaesenc $T2,$inout0,$inout0 + vmovdqu 0x30+8(%rsp),$Ii # I[4] + vpxor $rndkey,$inout4,$inout4 + vpclmulqdq \$0x00,$Hkey,$Z3,$T1 + vaesenc $T2,$inout1,$inout1 + vpxor $rndkey,$inout5,$inout5 + setnc %r12b + vpclmulqdq \$0x11,$Hkey,$Z3,$Z3 + vaesenc $T2,$inout2,$inout2 + vmovdqu 0x10-0x20($Xip),$Hkey # $Hkey^2 + neg %r12 + vaesenc $T2,$inout3,$inout3 + vpxor $Z1,$Z2,$Z2 + vpclmulqdq \$0x00,$Hkey,$Ii,$Z1 + vpxor $Z0,$Xi,$Xi # modulo-scheduled + vaesenc $T2,$inout4,$inout4 + vpxor $Z1,$T1,$Z0 + and \$0x60,%r12 + vmovups 0x20-0x80($key),$rndkey + vpclmulqdq \$0x10,$Hkey,$Ii,$T1 + vaesenc $T2,$inout5,$inout5 + + vpclmulqdq \$0x01,$Hkey,$Ii,$T2 + lea ($in0,%r12),$in0 + vaesenc $rndkey,$inout0,$inout0 + vpxor 16+8(%rsp),$Xi,$Xi # modulo-scheduled [vpxor $Z3,$Xi,$Xi] + vpclmulqdq \$0x11,$Hkey,$Ii,$Hkey + vmovdqu 0x40+8(%rsp),$Ii # I[3] + vaesenc $rndkey,$inout1,$inout1 + movbe 0x58($in0),%r13 + vaesenc $rndkey,$inout2,$inout2 + movbe 0x50($in0),%r12 + vaesenc $rndkey,$inout3,$inout3 + mov %r13,0x20+8(%rsp) + vaesenc $rndkey,$inout4,$inout4 + mov %r12,0x28+8(%rsp) + vmovdqu 0x30-0x20($Xip),$Z1 # borrow $Z1 for $Hkey^3 + vaesenc $rndkey,$inout5,$inout5 + + vmovups 0x30-0x80($key),$rndkey + vpxor $T1,$Z2,$Z2 + vpclmulqdq \$0x00,$Z1,$Ii,$T1 + vaesenc $rndkey,$inout0,$inout0 + vpxor $T2,$Z2,$Z2 + vpclmulqdq \$0x10,$Z1,$Ii,$T2 + vaesenc $rndkey,$inout1,$inout1 + vpxor $Hkey,$Z3,$Z3 + vpclmulqdq \$0x01,$Z1,$Ii,$Hkey + vaesenc $rndkey,$inout2,$inout2 + vpclmulqdq \$0x11,$Z1,$Ii,$Z1 + vmovdqu 0x50+8(%rsp),$Ii # I[2] + vaesenc $rndkey,$inout3,$inout3 + vaesenc $rndkey,$inout4,$inout4 + vpxor $T1,$Z0,$Z0 + vmovdqu 0x40-0x20($Xip),$T1 # borrow $T1 for $Hkey^4 + vaesenc $rndkey,$inout5,$inout5 + + vmovups 0x40-0x80($key),$rndkey + vpxor $T2,$Z2,$Z2 + vpclmulqdq \$0x00,$T1,$Ii,$T2 + vaesenc $rndkey,$inout0,$inout0 + vpxor $Hkey,$Z2,$Z2 + vpclmulqdq \$0x10,$T1,$Ii,$Hkey + vaesenc $rndkey,$inout1,$inout1 + movbe 0x48($in0),%r13 + vpxor $Z1,$Z3,$Z3 + vpclmulqdq \$0x01,$T1,$Ii,$Z1 + vaesenc $rndkey,$inout2,$inout2 + movbe 0x40($in0),%r12 + vpclmulqdq \$0x11,$T1,$Ii,$T1 + vmovdqu 0x60+8(%rsp),$Ii # I[1] + vaesenc $rndkey,$inout3,$inout3 + mov %r13,0x30+8(%rsp) + vaesenc $rndkey,$inout4,$inout4 + mov %r12,0x38+8(%rsp) + vpxor $T2,$Z0,$Z0 + vmovdqu 0x60-0x20($Xip),$T2 # borrow $T2 for $Hkey^5 + vaesenc $rndkey,$inout5,$inout5 + + vmovups 0x50-0x80($key),$rndkey + vpxor $Hkey,$Z2,$Z2 + vpclmulqdq \$0x00,$T2,$Ii,$Hkey + vaesenc $rndkey,$inout0,$inout0 + vpxor $Z1,$Z2,$Z2 + vpclmulqdq \$0x10,$T2,$Ii,$Z1 + vaesenc $rndkey,$inout1,$inout1 + movbe 0x38($in0),%r13 + vpxor $T1,$Z3,$Z3 + vpclmulqdq \$0x01,$T2,$Ii,$T1 + vpxor 0x70+8(%rsp),$Xi,$Xi # accumulate I[0] + vaesenc $rndkey,$inout2,$inout2 + movbe 0x30($in0),%r12 + vpclmulqdq \$0x11,$T2,$Ii,$T2 + vaesenc $rndkey,$inout3,$inout3 + mov %r13,0x40+8(%rsp) + vaesenc $rndkey,$inout4,$inout4 + mov %r12,0x48+8(%rsp) + vpxor $Hkey,$Z0,$Z0 + vmovdqu 0x70-0x20($Xip),$Hkey # $Hkey^6 + vaesenc $rndkey,$inout5,$inout5 + + vmovups 0x60-0x80($key),$rndkey + vpxor $Z1,$Z2,$Z2 + vpclmulqdq \$0x10,$Hkey,$Xi,$Z1 + vaesenc $rndkey,$inout0,$inout0 + vpxor $T1,$Z2,$Z2 + vpclmulqdq \$0x01,$Hkey,$Xi,$T1 + vaesenc $rndkey,$inout1,$inout1 + movbe 0x28($in0),%r13 + vpxor $T2,$Z3,$Z3 + vpclmulqdq \$0x00,$Hkey,$Xi,$T2 + vaesenc $rndkey,$inout2,$inout2 + movbe 0x20($in0),%r12 + vpclmulqdq \$0x11,$Hkey,$Xi,$Xi + vaesenc $rndkey,$inout3,$inout3 + mov %r13,0x50+8(%rsp) + vaesenc $rndkey,$inout4,$inout4 + mov %r12,0x58+8(%rsp) + vpxor $Z1,$Z2,$Z2 + vaesenc $rndkey,$inout5,$inout5 + vpxor $T1,$Z2,$Z2 + + vmovups 0x70-0x80($key),$rndkey + vpslldq \$8,$Z2,$Z1 + vpxor $T2,$Z0,$Z0 + vmovdqu 0x10($const),$Hkey # .Lpoly + + vaesenc $rndkey,$inout0,$inout0 + vpxor $Xi,$Z3,$Z3 + vaesenc $rndkey,$inout1,$inout1 + vpxor $Z1,$Z0,$Z0 + movbe 0x18($in0),%r13 + vaesenc $rndkey,$inout2,$inout2 + movbe 0x10($in0),%r12 + vpalignr \$8,$Z0,$Z0,$Ii # 1st phase + vpclmulqdq \$0x10,$Hkey,$Z0,$Z0 + mov %r13,0x60+8(%rsp) + vaesenc $rndkey,$inout3,$inout3 + mov %r12,0x68+8(%rsp) + vaesenc $rndkey,$inout4,$inout4 + vmovups 0x80-0x80($key),$T1 # borrow $T1 for $rndkey + vaesenc $rndkey,$inout5,$inout5 + + vaesenc $T1,$inout0,$inout0 + vmovups 0x90-0x80($key),$rndkey + vaesenc $T1,$inout1,$inout1 + vpsrldq \$8,$Z2,$Z2 + vaesenc $T1,$inout2,$inout2 + vpxor $Z2,$Z3,$Z3 + vaesenc $T1,$inout3,$inout3 + vpxor $Ii,$Z0,$Z0 + movbe 0x08($in0),%r13 + vaesenc $T1,$inout4,$inout4 + movbe 0x00($in0),%r12 + vaesenc $T1,$inout5,$inout5 + vmovups 0xa0-0x80($key),$T1 + cmp \$11,$rounds + jb .Lenc_tail # 128-bit key + + vaesenc $rndkey,$inout0,$inout0 + vaesenc $rndkey,$inout1,$inout1 + vaesenc $rndkey,$inout2,$inout2 + vaesenc $rndkey,$inout3,$inout3 + vaesenc $rndkey,$inout4,$inout4 + vaesenc $rndkey,$inout5,$inout5 + + vaesenc $T1,$inout0,$inout0 + vaesenc $T1,$inout1,$inout1 + vaesenc $T1,$inout2,$inout2 + vaesenc $T1,$inout3,$inout3 + vaesenc $T1,$inout4,$inout4 + vmovups 0xb0-0x80($key),$rndkey + vaesenc $T1,$inout5,$inout5 + vmovups 0xc0-0x80($key),$T1 + je .Lenc_tail # 192-bit key + + vaesenc $rndkey,$inout0,$inout0 + vaesenc $rndkey,$inout1,$inout1 + vaesenc $rndkey,$inout2,$inout2 + vaesenc $rndkey,$inout3,$inout3 + vaesenc $rndkey,$inout4,$inout4 + vaesenc $rndkey,$inout5,$inout5 + + vaesenc $T1,$inout0,$inout0 + vaesenc $T1,$inout1,$inout1 + vaesenc $T1,$inout2,$inout2 + vaesenc $T1,$inout3,$inout3 + vaesenc $T1,$inout4,$inout4 + vmovups 0xd0-0x80($key),$rndkey + vaesenc $T1,$inout5,$inout5 + vmovups 0xe0-0x80($key),$T1 + jmp .Lenc_tail # 256-bit key + +.align 32 +.Lhandle_ctr32: + vmovdqu ($const),$Ii # borrow $Ii for .Lbswap_mask + vpshufb $Ii,$T1,$Z2 # byte-swap counter + vmovdqu 0x30($const),$Z1 # borrow $Z1, .Ltwo_lsb + vpaddd 0x40($const),$Z2,$inout1 # .Lone_lsb + vpaddd $Z1,$Z2,$inout2 + vmovdqu 0x00-0x20($Xip),$Hkey # $Hkey^1 + vpaddd $Z1,$inout1,$inout3 + vpshufb $Ii,$inout1,$inout1 + vpaddd $Z1,$inout2,$inout4 + vpshufb $Ii,$inout2,$inout2 + vpxor $rndkey,$inout1,$inout1 + vpaddd $Z1,$inout3,$inout5 + vpshufb $Ii,$inout3,$inout3 + vpxor $rndkey,$inout2,$inout2 + vpaddd $Z1,$inout4,$T1 # byte-swapped next counter value + vpshufb $Ii,$inout4,$inout4 + vpshufb $Ii,$inout5,$inout5 + vpshufb $Ii,$T1,$T1 # next counter value + jmp .Lresume_ctr32 + +.align 32 +.Lenc_tail: + vaesenc $rndkey,$inout0,$inout0 + vmovdqu $Z3,16+8(%rsp) # postpone vpxor $Z3,$Xi,$Xi + vpalignr \$8,$Z0,$Z0,$Xi # 2nd phase + vaesenc $rndkey,$inout1,$inout1 + vpclmulqdq \$0x10,$Hkey,$Z0,$Z0 + vpxor 0x00($inp),$T1,$T2 + vaesenc $rndkey,$inout2,$inout2 + vpxor 0x10($inp),$T1,$Ii + vaesenc $rndkey,$inout3,$inout3 + vpxor 0x20($inp),$T1,$Z1 + vaesenc $rndkey,$inout4,$inout4 + vpxor 0x30($inp),$T1,$Z2 + vaesenc $rndkey,$inout5,$inout5 + vpxor 0x40($inp),$T1,$Z3 + vpxor 0x50($inp),$T1,$Hkey + vmovdqu ($ivp),$T1 # load next counter value + + vaesenclast $T2,$inout0,$inout0 + vmovdqu 0x20($const),$T2 # borrow $T2, .Lone_msb + vaesenclast $Ii,$inout1,$inout1 + vpaddb $T2,$T1,$Ii + mov %r13,0x70+8(%rsp) + lea 0x60($inp),$inp + vaesenclast $Z1,$inout2,$inout2 + vpaddb $T2,$Ii,$Z1 + mov %r12,0x78+8(%rsp) + lea 0x60($out),$out + vmovdqu 0x00-0x80($key),$rndkey + vaesenclast $Z2,$inout3,$inout3 + vpaddb $T2,$Z1,$Z2 + vaesenclast $Z3, $inout4,$inout4 + vpaddb $T2,$Z2,$Z3 + vaesenclast $Hkey,$inout5,$inout5 + vpaddb $T2,$Z3,$Hkey + + add \$0x60,$ret + sub \$0x6,$len + jc .L6x_done + + vmovups $inout0,-0x60($out) # save output + vpxor $rndkey,$T1,$inout0 + vmovups $inout1,-0x50($out) + vmovdqa $Ii,$inout1 # 0 latency + vmovups $inout2,-0x40($out) + vmovdqa $Z1,$inout2 # 0 latency + vmovups $inout3,-0x30($out) + vmovdqa $Z2,$inout3 # 0 latency + vmovups $inout4,-0x20($out) + vmovdqa $Z3,$inout4 # 0 latency + vmovups $inout5,-0x10($out) + vmovdqa $Hkey,$inout5 # 0 latency + vmovdqu 0x20+8(%rsp),$Z3 # I[5] + jmp .Loop6x + +.L6x_done: + vpxor 16+8(%rsp),$Xi,$Xi # modulo-scheduled + vpxor $Z0,$Xi,$Xi # modulo-scheduled + + ret +.cfi_endproc +.size _aesni_ctr32_ghash_6x,.-_aesni_ctr32_ghash_6x +___ +###################################################################### +# +# size_t aesni_gcm_[en|de]crypt(const void *inp, void *out, size_t len, +# const AES_KEY *key, struct { u128 Yi, Xi; } *yi_xi, +# struct { u128 H,Htbl[9]; } *H_htbl); +$code.=<<___; +.globl aesni_gcm_decrypt +.type aesni_gcm_decrypt,\@function,6 +.align 32 +aesni_gcm_decrypt: +.cfi_startproc + xor $ret,$ret + + # We call |_aesni_ctr32_ghash_6x|, which requires at least 96 (0x60) + # bytes of input. + cmp \$0x60,$len # minimal accepted length + jb .Lgcm_dec_abort + + lea (%rsp),%rax # save stack pointer +.cfi_def_cfa_register %rax + push %rbx +.cfi_push %rbx + push %rbp +.cfi_push %rbp + push %r12 +.cfi_push %r12 + push %r13 +.cfi_push %r13 + push %r14 +.cfi_push %r14 + push %r15 +.cfi_push %r15 +___ + +$code .= <<___ if ($win64); + lea -0xa8(%rsp),%rsp + movaps %xmm6,-0xd8(%rax) + movaps %xmm7,-0xc8(%rax) + movaps %xmm8,-0xb8(%rax) + movaps %xmm9,-0xa8(%rax) + movaps %xmm10,-0x98(%rax) + movaps %xmm11,-0x88(%rax) + movaps %xmm12,-0x78(%rax) + movaps %xmm13,-0x68(%rax) + movaps %xmm14,-0x58(%rax) + movaps %xmm15,-0x48(%rax) +.Lgcm_dec_body: +___ + +$code.=<<___; + vzeroupper + + vmovdqu ($ivp),$T1 # input counter value + add \$-128,%rsp + mov 12($ivp),$counter + lea .Lbswap_mask(%rip),$const + lea -0x80($key),$in0 # borrow $in0 + mov \$0xf80,$end0 # borrow $end0 + vmovdqu 0x10($ivp),$Xi # load Xi + and \$-128,%rsp # ensure stack alignment + vmovdqu ($const),$Ii # borrow $Ii for .Lbswap_mask + lea 0x80($key),$key # size optimization + lea 0x10+0x20($Xip),$Xip # size optimization + mov 0xf0-0x80($key),$rounds + vpshufb $Ii,$Xi,$Xi + + and $end0,$in0 + and %rsp,$end0 + sub $in0,$end0 + jc .Ldec_no_key_aliasing + cmp \$768,$end0 + jnc .Ldec_no_key_aliasing + sub $end0,%rsp # avoid aliasing with key +.Ldec_no_key_aliasing: + + vmovdqu 0x50($inp),$Z3 # I[5] + lea ($inp),$in0 + vmovdqu 0x40($inp),$Z0 + + # |_aesni_ctr32_ghash_6x| requires |$end0| to point to 2*96 (0xc0) + # bytes before the end of the input. Note, in particular, that this is + # correct even if |$len| is not an even multiple of 96 or 16. XXX: This + # seems to require that |$inp| + |$len| >= 2*96 (0xc0); i.e. |$inp| must + # not be near the very beginning of the address space when |$len| < 2*96 + # (0xc0). + lea -0xc0($inp,$len),$end0 + + vmovdqu 0x30($inp),$Z1 + shr \$4,$len + xor $ret,$ret + vmovdqu 0x20($inp),$Z2 + vpshufb $Ii,$Z3,$Z3 # passed to _aesni_ctr32_ghash_6x + vmovdqu 0x10($inp),$T2 + vpshufb $Ii,$Z0,$Z0 + vmovdqu ($inp),$Hkey + vpshufb $Ii,$Z1,$Z1 + vmovdqu $Z0,0x30(%rsp) + vpshufb $Ii,$Z2,$Z2 + vmovdqu $Z1,0x40(%rsp) + vpshufb $Ii,$T2,$T2 + vmovdqu $Z2,0x50(%rsp) + vpshufb $Ii,$Hkey,$Hkey + vmovdqu $T2,0x60(%rsp) + vmovdqu $Hkey,0x70(%rsp) + + call _aesni_ctr32_ghash_6x + + vmovups $inout0,-0x60($out) # save output + vmovups $inout1,-0x50($out) + vmovups $inout2,-0x40($out) + vmovups $inout3,-0x30($out) + vmovups $inout4,-0x20($out) + vmovups $inout5,-0x10($out) + + vpshufb ($const),$Xi,$Xi # .Lbswap_mask + vmovdqu $Xi,0x10($ivp) # output Xi + + vzeroupper +___ + +$code.=<<___ if ($win64); + movaps -0xd8(%rax),%xmm6 + movaps -0xc8(%rax),%xmm7 + movaps -0xb8(%rax),%xmm8 + movaps -0xa8(%rax),%xmm9 + movaps -0x98(%rax),%xmm10 + movaps -0x88(%rax),%xmm11 + movaps -0x78(%rax),%xmm12 + movaps -0x68(%rax),%xmm13 + movaps -0x58(%rax),%xmm14 + movaps -0x48(%rax),%xmm15 +___ +$code.=<<___; + mov -48(%rax),%r15 +.cfi_restore %r15 + mov -40(%rax),%r14 +.cfi_restore %r14 + mov -32(%rax),%r13 +.cfi_restore %r13 + mov -24(%rax),%r12 +.cfi_restore %r12 + mov -16(%rax),%rbp +.cfi_restore %rbp + mov -8(%rax),%rbx +.cfi_restore %rbx + lea (%rax),%rsp # restore %rsp +.cfi_def_cfa_register %rsp +.Lgcm_dec_abort: + mov $ret,%rax # return value + ret +.cfi_endproc +.size aesni_gcm_decrypt,.-aesni_gcm_decrypt +___ + +$code.=<<___; +.type _aesni_ctr32_6x,\@abi-omnipotent +.align 32 +_aesni_ctr32_6x: +.cfi_startproc + vmovdqu 0x00-0x80($key),$Z0 # borrow $Z0 for $rndkey + vmovdqu 0x20($const),$T2 # borrow $T2, .Lone_msb + lea -1($rounds),%r13 + vmovups 0x10-0x80($key),$rndkey + lea 0x20-0x80($key),%r12 + vpxor $Z0,$T1,$inout0 + add \$`6<<24`,$counter + jc .Lhandle_ctr32_2 + vpaddb $T2,$T1,$inout1 + vpaddb $T2,$inout1,$inout2 + vpxor $Z0,$inout1,$inout1 + vpaddb $T2,$inout2,$inout3 + vpxor $Z0,$inout2,$inout2 + vpaddb $T2,$inout3,$inout4 + vpxor $Z0,$inout3,$inout3 + vpaddb $T2,$inout4,$inout5 + vpxor $Z0,$inout4,$inout4 + vpaddb $T2,$inout5,$T1 + vpxor $Z0,$inout5,$inout5 + jmp .Loop_ctr32 + +.align 16 +.Loop_ctr32: + vaesenc $rndkey,$inout0,$inout0 + vaesenc $rndkey,$inout1,$inout1 + vaesenc $rndkey,$inout2,$inout2 + vaesenc $rndkey,$inout3,$inout3 + vaesenc $rndkey,$inout4,$inout4 + vaesenc $rndkey,$inout5,$inout5 + vmovups (%r12),$rndkey + lea 0x10(%r12),%r12 + dec %r13d + jnz .Loop_ctr32 + + vmovdqu (%r12),$Hkey # last round key + vaesenc $rndkey,$inout0,$inout0 + vpxor 0x00($inp),$Hkey,$Z0 + vaesenc $rndkey,$inout1,$inout1 + vpxor 0x10($inp),$Hkey,$Z1 + vaesenc $rndkey,$inout2,$inout2 + vpxor 0x20($inp),$Hkey,$Z2 + vaesenc $rndkey,$inout3,$inout3 + vpxor 0x30($inp),$Hkey,$Xi + vaesenc $rndkey,$inout4,$inout4 + vpxor 0x40($inp),$Hkey,$T2 + vaesenc $rndkey,$inout5,$inout5 + vpxor 0x50($inp),$Hkey,$Hkey + lea 0x60($inp),$inp + + vaesenclast $Z0,$inout0,$inout0 + vaesenclast $Z1,$inout1,$inout1 + vaesenclast $Z2,$inout2,$inout2 + vaesenclast $Xi,$inout3,$inout3 + vaesenclast $T2,$inout4,$inout4 + vaesenclast $Hkey,$inout5,$inout5 + vmovups $inout0,0x00($out) + vmovups $inout1,0x10($out) + vmovups $inout2,0x20($out) + vmovups $inout3,0x30($out) + vmovups $inout4,0x40($out) + vmovups $inout5,0x50($out) + lea 0x60($out),$out + + ret +.align 32 +.Lhandle_ctr32_2: + vpshufb $Ii,$T1,$Z2 # byte-swap counter + vmovdqu 0x30($const),$Z1 # borrow $Z1, .Ltwo_lsb + vpaddd 0x40($const),$Z2,$inout1 # .Lone_lsb + vpaddd $Z1,$Z2,$inout2 + vpaddd $Z1,$inout1,$inout3 + vpshufb $Ii,$inout1,$inout1 + vpaddd $Z1,$inout2,$inout4 + vpshufb $Ii,$inout2,$inout2 + vpxor $Z0,$inout1,$inout1 + vpaddd $Z1,$inout3,$inout5 + vpshufb $Ii,$inout3,$inout3 + vpxor $Z0,$inout2,$inout2 + vpaddd $Z1,$inout4,$T1 # byte-swapped next counter value + vpshufb $Ii,$inout4,$inout4 + vpxor $Z0,$inout3,$inout3 + vpshufb $Ii,$inout5,$inout5 + vpxor $Z0,$inout4,$inout4 + vpshufb $Ii,$T1,$T1 # next counter value + vpxor $Z0,$inout5,$inout5 + jmp .Loop_ctr32 +.cfi_endproc +.size _aesni_ctr32_6x,.-_aesni_ctr32_6x + +.globl aesni_gcm_encrypt +.type aesni_gcm_encrypt,\@function,6 +.align 32 +aesni_gcm_encrypt: +.cfi_startproc + xor $ret,$ret + + # We call |_aesni_ctr32_6x| twice, each call consuming 96 bytes of + # input. Then we call |_aesni_ctr32_ghash_6x|, which requires at + # least 96 more bytes of input. + cmp \$0x60*3,$len # minimal accepted length + jb .Lgcm_enc_abort + + lea (%rsp),%rax # save stack pointer +.cfi_def_cfa_register %rax + push %rbx +.cfi_push %rbx + push %rbp +.cfi_push %rbp + push %r12 +.cfi_push %r12 + push %r13 +.cfi_push %r13 + push %r14 +.cfi_push %r14 + push %r15 +.cfi_push %r15 +___ +$code.=<<___ if ($win64); + lea -0xa8(%rsp),%rsp + movaps %xmm6,-0xd8(%rax) + movaps %xmm7,-0xc8(%rax) + movaps %xmm8,-0xb8(%rax) + movaps %xmm9,-0xa8(%rax) + movaps %xmm10,-0x98(%rax) + movaps %xmm11,-0x88(%rax) + movaps %xmm12,-0x78(%rax) + movaps %xmm13,-0x68(%rax) + movaps %xmm14,-0x58(%rax) + movaps %xmm15,-0x48(%rax) +.Lgcm_enc_body: +___ +$code.=<<___; + vzeroupper + + vmovdqu ($ivp),$T1 # input counter value + add \$-128,%rsp + mov 12($ivp),$counter + lea .Lbswap_mask(%rip),$const + lea -0x80($key),$in0 # borrow $in0 + mov \$0xf80,$end0 # borrow $end0 + lea 0x80($key),$key # size optimization + vmovdqu ($const),$Ii # borrow $Ii for .Lbswap_mask + and \$-128,%rsp # ensure stack alignment + mov 0xf0-0x80($key),$rounds + + and $end0,$in0 + and %rsp,$end0 + sub $in0,$end0 + jc .Lenc_no_key_aliasing + cmp \$768,$end0 + jnc .Lenc_no_key_aliasing + sub $end0,%rsp # avoid aliasing with key +.Lenc_no_key_aliasing: + + lea ($out),$in0 + + # |_aesni_ctr32_ghash_6x| requires |$end0| to point to 2*96 (0xc0) + # bytes before the end of the input. Note, in particular, that this is + # correct even if |$len| is not an even multiple of 96 or 16. Unlike in + # the decryption case, there's no caveat that |$out| must not be near + # the very beginning of the address space, because we know that + # |$len| >= 3*96 from the check above, and so we know + # |$out| + |$len| >= 2*96 (0xc0). + lea -0xc0($out,$len),$end0 + + shr \$4,$len + + call _aesni_ctr32_6x + + vpshufb $Ii,$inout0,$Xi # save bswapped output on stack + vpshufb $Ii,$inout1,$T2 + vmovdqu $Xi,0x70(%rsp) + vpshufb $Ii,$inout2,$Z0 + vmovdqu $T2,0x60(%rsp) + vpshufb $Ii,$inout3,$Z1 + vmovdqu $Z0,0x50(%rsp) + vpshufb $Ii,$inout4,$Z2 + vmovdqu $Z1,0x40(%rsp) + vpshufb $Ii,$inout5,$Z3 # passed to _aesni_ctr32_ghash_6x + vmovdqu $Z2,0x30(%rsp) + + call _aesni_ctr32_6x + + vmovdqu 0x10($ivp),$Xi # load Xi + lea 0x10+0x20($Xip),$Xip # size optimization + sub \$12,$len + mov \$0x60*2,$ret + vpshufb $Ii,$Xi,$Xi + + call _aesni_ctr32_ghash_6x + vmovdqu 0x20(%rsp),$Z3 # I[5] + vmovdqu ($const),$Ii # borrow $Ii for .Lbswap_mask + vmovdqu 0x00-0x20($Xip),$Hkey # $Hkey^1 + vpunpckhqdq $Z3,$Z3,$T1 + vmovdqu 0x20-0x20($Xip),$rndkey # borrow $rndkey for $HK + vmovups $inout0,-0x60($out) # save output + vpshufb $Ii,$inout0,$inout0 # but keep bswapped copy + vpxor $Z3,$T1,$T1 + vmovups $inout1,-0x50($out) + vpshufb $Ii,$inout1,$inout1 + vmovups $inout2,-0x40($out) + vpshufb $Ii,$inout2,$inout2 + vmovups $inout3,-0x30($out) + vpshufb $Ii,$inout3,$inout3 + vmovups $inout4,-0x20($out) + vpshufb $Ii,$inout4,$inout4 + vmovups $inout5,-0x10($out) + vpshufb $Ii,$inout5,$inout5 + vmovdqu $inout0,0x10(%rsp) # free $inout0 +___ +{ my ($HK,$T3)=($rndkey,$inout0); + +$code.=<<___; + vmovdqu 0x30(%rsp),$Z2 # I[4] + vmovdqu 0x10-0x20($Xip),$Ii # borrow $Ii for $Hkey^2 + vpunpckhqdq $Z2,$Z2,$T2 + vpclmulqdq \$0x00,$Hkey,$Z3,$Z1 + vpxor $Z2,$T2,$T2 + vpclmulqdq \$0x11,$Hkey,$Z3,$Z3 + vpclmulqdq \$0x00,$HK,$T1,$T1 + + vmovdqu 0x40(%rsp),$T3 # I[3] + vpclmulqdq \$0x00,$Ii,$Z2,$Z0 + vmovdqu 0x30-0x20($Xip),$Hkey # $Hkey^3 + vpxor $Z1,$Z0,$Z0 + vpunpckhqdq $T3,$T3,$Z1 + vpclmulqdq \$0x11,$Ii,$Z2,$Z2 + vpxor $T3,$Z1,$Z1 + vpxor $Z3,$Z2,$Z2 + vpclmulqdq \$0x10,$HK,$T2,$T2 + vmovdqu 0x50-0x20($Xip),$HK + vpxor $T1,$T2,$T2 + + vmovdqu 0x50(%rsp),$T1 # I[2] + vpclmulqdq \$0x00,$Hkey,$T3,$Z3 + vmovdqu 0x40-0x20($Xip),$Ii # borrow $Ii for $Hkey^4 + vpxor $Z0,$Z3,$Z3 + vpunpckhqdq $T1,$T1,$Z0 + vpclmulqdq \$0x11,$Hkey,$T3,$T3 + vpxor $T1,$Z0,$Z0 + vpxor $Z2,$T3,$T3 + vpclmulqdq \$0x00,$HK,$Z1,$Z1 + vpxor $T2,$Z1,$Z1 + + vmovdqu 0x60(%rsp),$T2 # I[1] + vpclmulqdq \$0x00,$Ii,$T1,$Z2 + vmovdqu 0x60-0x20($Xip),$Hkey # $Hkey^5 + vpxor $Z3,$Z2,$Z2 + vpunpckhqdq $T2,$T2,$Z3 + vpclmulqdq \$0x11,$Ii,$T1,$T1 + vpxor $T2,$Z3,$Z3 + vpxor $T3,$T1,$T1 + vpclmulqdq \$0x10,$HK,$Z0,$Z0 + vmovdqu 0x80-0x20($Xip),$HK + vpxor $Z1,$Z0,$Z0 + + vpxor 0x70(%rsp),$Xi,$Xi # accumulate I[0] + vpclmulqdq \$0x00,$Hkey,$T2,$Z1 + vmovdqu 0x70-0x20($Xip),$Ii # borrow $Ii for $Hkey^6 + vpunpckhqdq $Xi,$Xi,$T3 + vpxor $Z2,$Z1,$Z1 + vpclmulqdq \$0x11,$Hkey,$T2,$T2 + vpxor $Xi,$T3,$T3 + vpxor $T1,$T2,$T2 + vpclmulqdq \$0x00,$HK,$Z3,$Z3 + vpxor $Z0,$Z3,$Z0 + + vpclmulqdq \$0x00,$Ii,$Xi,$Z2 + vmovdqu 0x00-0x20($Xip),$Hkey # $Hkey^1 + vpunpckhqdq $inout5,$inout5,$T1 + vpclmulqdq \$0x11,$Ii,$Xi,$Xi + vpxor $inout5,$T1,$T1 + vpxor $Z1,$Z2,$Z1 + vpclmulqdq \$0x10,$HK,$T3,$T3 + vmovdqu 0x20-0x20($Xip),$HK + vpxor $T2,$Xi,$Z3 + vpxor $Z0,$T3,$Z2 + + vmovdqu 0x10-0x20($Xip),$Ii # borrow $Ii for $Hkey^2 + vpxor $Z1,$Z3,$T3 # aggregated Karatsuba post-processing + vpclmulqdq \$0x00,$Hkey,$inout5,$Z0 + vpxor $T3,$Z2,$Z2 + vpunpckhqdq $inout4,$inout4,$T2 + vpclmulqdq \$0x11,$Hkey,$inout5,$inout5 + vpxor $inout4,$T2,$T2 + vpslldq \$8,$Z2,$T3 + vpclmulqdq \$0x00,$HK,$T1,$T1 + vpxor $T3,$Z1,$Xi + vpsrldq \$8,$Z2,$Z2 + vpxor $Z2,$Z3,$Z3 + + vpclmulqdq \$0x00,$Ii,$inout4,$Z1 + vmovdqu 0x30-0x20($Xip),$Hkey # $Hkey^3 + vpxor $Z0,$Z1,$Z1 + vpunpckhqdq $inout3,$inout3,$T3 + vpclmulqdq \$0x11,$Ii,$inout4,$inout4 + vpxor $inout3,$T3,$T3 + vpxor $inout5,$inout4,$inout4 + vpalignr \$8,$Xi,$Xi,$inout5 # 1st phase + vpclmulqdq \$0x10,$HK,$T2,$T2 + vmovdqu 0x50-0x20($Xip),$HK + vpxor $T1,$T2,$T2 + + vpclmulqdq \$0x00,$Hkey,$inout3,$Z0 + vmovdqu 0x40-0x20($Xip),$Ii # borrow $Ii for $Hkey^4 + vpxor $Z1,$Z0,$Z0 + vpunpckhqdq $inout2,$inout2,$T1 + vpclmulqdq \$0x11,$Hkey,$inout3,$inout3 + vpxor $inout2,$T1,$T1 + vpxor $inout4,$inout3,$inout3 + vxorps 0x10(%rsp),$Z3,$Z3 # accumulate $inout0 + vpclmulqdq \$0x00,$HK,$T3,$T3 + vpxor $T2,$T3,$T3 + + vpclmulqdq \$0x10,0x10($const),$Xi,$Xi + vxorps $inout5,$Xi,$Xi + + vpclmulqdq \$0x00,$Ii,$inout2,$Z1 + vmovdqu 0x60-0x20($Xip),$Hkey # $Hkey^5 + vpxor $Z0,$Z1,$Z1 + vpunpckhqdq $inout1,$inout1,$T2 + vpclmulqdq \$0x11,$Ii,$inout2,$inout2 + vpxor $inout1,$T2,$T2 + vpalignr \$8,$Xi,$Xi,$inout5 # 2nd phase + vpxor $inout3,$inout2,$inout2 + vpclmulqdq \$0x10,$HK,$T1,$T1 + vmovdqu 0x80-0x20($Xip),$HK + vpxor $T3,$T1,$T1 + + vxorps $Z3,$inout5,$inout5 + vpclmulqdq \$0x10,0x10($const),$Xi,$Xi + vxorps $inout5,$Xi,$Xi + + vpclmulqdq \$0x00,$Hkey,$inout1,$Z0 + vmovdqu 0x70-0x20($Xip),$Ii # borrow $Ii for $Hkey^6 + vpxor $Z1,$Z0,$Z0 + vpunpckhqdq $Xi,$Xi,$T3 + vpclmulqdq \$0x11,$Hkey,$inout1,$inout1 + vpxor $Xi,$T3,$T3 + vpxor $inout2,$inout1,$inout1 + vpclmulqdq \$0x00,$HK,$T2,$T2 + vpxor $T1,$T2,$T2 + + vpclmulqdq \$0x00,$Ii,$Xi,$Z1 + vpclmulqdq \$0x11,$Ii,$Xi,$Z3 + vpxor $Z0,$Z1,$Z1 + vpclmulqdq \$0x10,$HK,$T3,$Z2 + vpxor $inout1,$Z3,$Z3 + vpxor $T2,$Z2,$Z2 + + vpxor $Z1,$Z3,$Z0 # aggregated Karatsuba post-processing + vpxor $Z0,$Z2,$Z2 + vpslldq \$8,$Z2,$T1 + vmovdqu 0x10($const),$Hkey # .Lpoly + vpsrldq \$8,$Z2,$Z2 + vpxor $T1,$Z1,$Xi + vpxor $Z2,$Z3,$Z3 + + vpalignr \$8,$Xi,$Xi,$T2 # 1st phase + vpclmulqdq \$0x10,$Hkey,$Xi,$Xi + vpxor $T2,$Xi,$Xi + + vpalignr \$8,$Xi,$Xi,$T2 # 2nd phase + vpclmulqdq \$0x10,$Hkey,$Xi,$Xi + vpxor $Z3,$T2,$T2 + vpxor $T2,$Xi,$Xi +___ +} +$code.=<<___; + vpshufb ($const),$Xi,$Xi # .Lbswap_mask + vmovdqu $Xi,0x10($ivp) # output Xi + + vzeroupper +___ +$code.=<<___ if ($win64); + movaps -0xd8(%rax),%xmm6 + movaps -0xc8(%rax),%xmm7 + movaps -0xb8(%rax),%xmm8 + movaps -0xa8(%rax),%xmm9 + movaps -0x98(%rax),%xmm10 + movaps -0x88(%rax),%xmm11 + movaps -0x78(%rax),%xmm12 + movaps -0x68(%rax),%xmm13 + movaps -0x58(%rax),%xmm14 + movaps -0x48(%rax),%xmm15 +___ +$code.=<<___; + mov -48(%rax),%r15 +.cfi_restore %r15 + mov -40(%rax),%r14 +.cfi_restore %r14 + mov -32(%rax),%r13 +.cfi_restore %r13 + mov -24(%rax),%r12 +.cfi_restore %r12 + mov -16(%rax),%rbp +.cfi_restore %rbp + mov -8(%rax),%rbx +.cfi_restore %rbx + lea (%rax),%rsp # restore %rsp +.cfi_def_cfa_register %rsp +.Lgcm_enc_abort: + mov $ret,%rax # return value + ret +.cfi_endproc +.size aesni_gcm_encrypt,.-aesni_gcm_encrypt +___ + +$code.=<<___; +.align 64 +.Lbswap_mask: + .byte 15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0 +.Lpoly: + .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0xc2 +.Lone_msb: + .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1 +.Ltwo_lsb: + .byte 2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 +.Lone_lsb: + .byte 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 +.asciz "AES-NI GCM module for x86_64, CRYPTOGAMS by " +.align 64 +___ +if ($win64) { +$rec="%rcx"; +$frame="%rdx"; +$context="%r8"; +$disp="%r9"; + +$code.=<<___ +.extern __imp_RtlVirtualUnwind +.type gcm_se_handler,\@abi-omnipotent +.align 16 +gcm_se_handler: + push %rsi + push %rdi + push %rbx + push %rbp + push %r12 + push %r13 + push %r14 + push %r15 + pushfq + sub \$64,%rsp + + mov 120($context),%rax # pull context->Rax + mov 248($context),%rbx # pull context->Rip + + mov 8($disp),%rsi # disp->ImageBase + mov 56($disp),%r11 # disp->HandlerData + + mov 0(%r11),%r10d # HandlerData[0] + lea (%rsi,%r10),%r10 # prologue label + cmp %r10,%rbx # context->RipRsp + + mov 4(%r11),%r10d # HandlerData[1] + lea (%rsi,%r10),%r10 # epilogue label + cmp %r10,%rbx # context->Rip>=epilogue label + jae .Lcommon_seh_tail + + mov 120($context),%rax # pull context->Rax + + mov -48(%rax),%r15 + mov -40(%rax),%r14 + mov -32(%rax),%r13 + mov -24(%rax),%r12 + mov -16(%rax),%rbp + mov -8(%rax),%rbx + mov %r15,240($context) + mov %r14,232($context) + mov %r13,224($context) + mov %r12,216($context) + mov %rbp,160($context) + mov %rbx,144($context) + + lea -0xd8(%rax),%rsi # %xmm save area + lea 512($context),%rdi # & context.Xmm6 + mov \$20,%ecx # 10*sizeof(%xmm0)/sizeof(%rax) + .long 0xa548f3fc # cld; rep movsq + +.Lcommon_seh_tail: + mov 8(%rax),%rdi + mov 16(%rax),%rsi + mov %rax,152($context) # restore context->Rsp + mov %rsi,168($context) # restore context->Rsi + mov %rdi,176($context) # restore context->Rdi + + mov 40($disp),%rdi # disp->ContextRecord + mov $context,%rsi # context + mov \$154,%ecx # sizeof(CONTEXT) + .long 0xa548f3fc # cld; rep movsq + + mov $disp,%rsi + xor %rcx,%rcx # arg1, UNW_FLAG_NHANDLER + mov 8(%rsi),%rdx # arg2, disp->ImageBase + mov 0(%rsi),%r8 # arg3, disp->ControlPc + mov 16(%rsi),%r9 # arg4, disp->FunctionEntry + mov 40(%rsi),%r10 # disp->ContextRecord + lea 56(%rsi),%r11 # &disp->HandlerData + lea 24(%rsi),%r12 # &disp->EstablisherFrame + mov %r10,32(%rsp) # arg5 + mov %r11,40(%rsp) # arg6 + mov %r12,48(%rsp) # arg7 + mov %rcx,56(%rsp) # arg8, (NULL) + call *__imp_RtlVirtualUnwind(%rip) + + mov \$1,%eax # ExceptionContinueSearch + add \$64,%rsp + popfq + pop %r15 + pop %r14 + pop %r13 + pop %r12 + pop %rbp + pop %rbx + pop %rdi + pop %rsi + ret +.size gcm_se_handler,.-gcm_se_handler + +.section .pdata +.align 4 + .rva .LSEH_begin_aesni_gcm_decrypt + .rva .LSEH_end_aesni_gcm_decrypt + .rva .LSEH_gcm_dec_info + + .rva .LSEH_begin_aesni_gcm_encrypt + .rva .LSEH_end_aesni_gcm_encrypt + .rva .LSEH_gcm_enc_info +.section .xdata +.align 8 +.LSEH_gcm_dec_info: + .byte 9,0,0,0 + .rva gcm_se_handler + .rva .Lgcm_dec_body,.Lgcm_dec_abort +.LSEH_gcm_enc_info: + .byte 9,0,0,0 + .rva gcm_se_handler + .rva .Lgcm_enc_body,.Lgcm_enc_abort +___ +} +}}} else {{{ +$code=<<___; # assembler is too old +.text + +.globl aesni_gcm_encrypt +.type aesni_gcm_encrypt,\@abi-omnipotent +aesni_gcm_encrypt: + xor %eax,%eax + ret +.size aesni_gcm_encrypt,.-aesni_gcm_encrypt + +.globl aesni_gcm_decrypt +.type aesni_gcm_decrypt,\@abi-omnipotent +aesni_gcm_decrypt: + xor %eax,%eax + ret +.size aesni_gcm_decrypt,.-aesni_gcm_decrypt +___ +}}} + +$code =~ s/\`([^\`]*)\`/eval($1)/gem; + +print $code; + +close STDOUT; diff --git a/crypto/aesgcm/aesni-x86.pl b/crypto/aesgcm/aesni-x86.pl new file mode 100644 index 0000000..cf1a51e --- /dev/null +++ b/crypto/aesgcm/aesni-x86.pl @@ -0,0 +1,2544 @@ +#! /usr/bin/env perl +# Copyright 2009-2016 The OpenSSL Project Authors. All Rights Reserved. +# +# Licensed under the OpenSSL license (the "License"). You may not use +# this file except in compliance with the License. You can obtain a copy +# in the file LICENSE in the source distribution or at +# https://www.openssl.org/source/license.html + + +# ==================================================================== +# Written by Andy Polyakov for the OpenSSL +# project. The module is, however, dual licensed under OpenSSL and +# CRYPTOGAMS licenses depending on where you obtain it. For further +# details see http://www.openssl.org/~appro/cryptogams/. +# ==================================================================== +# +# This module implements support for Intel AES-NI extension. In +# OpenSSL context it's used with Intel engine, but can also be used as +# drop-in replacement for crypto/aes/asm/aes-586.pl [see below for +# details]. +# +# Performance. +# +# To start with see corresponding paragraph in aesni-x86_64.pl... +# Instead of filling table similar to one found there I've chosen to +# summarize *comparison* results for raw ECB, CTR and CBC benchmarks. +# The simplified table below represents 32-bit performance relative +# to 64-bit one in every given point. Ratios vary for different +# encryption modes, therefore interval values. +# +# 16-byte 64-byte 256-byte 1-KB 8-KB +# 53-67% 67-84% 91-94% 95-98% 97-99.5% +# +# Lower ratios for smaller block sizes are perfectly understandable, +# because function call overhead is higher in 32-bit mode. Largest +# 8-KB block performance is virtually same: 32-bit code is less than +# 1% slower for ECB, CBC and CCM, and ~3% slower otherwise. + +# January 2011 +# +# See aesni-x86_64.pl for details. Unlike x86_64 version this module +# interleaves at most 6 aes[enc|dec] instructions, because there are +# not enough registers for 8x interleave [which should be optimal for +# Sandy Bridge]. Actually, performance results for 6x interleave +# factor presented in aesni-x86_64.pl (except for CTR) are for this +# module. + +# April 2011 +# +# Add aesni_xts_[en|de]crypt. Westmere spends 1.50 cycles processing +# one byte out of 8KB with 128-bit key, Sandy Bridge - 1.09. + +# November 2015 +# +# Add aesni_ocb_[en|de]crypt. [Removed in BoringSSL] + +###################################################################### +# Current large-block performance in cycles per byte processed with +# 128-bit key (less is better). +# +# CBC en-/decrypt CTR XTS ECB OCB +# Westmere 3.77/1.37 1.37 1.52 1.27 +# * Bridge 5.07/0.98 0.99 1.09 0.91 1.10 +# Haswell 4.44/0.80 0.97 1.03 0.72 0.76 +# Skylake 2.68/0.65 0.65 0.66 0.64 0.66 +# Silvermont 5.77/3.56 3.67 4.03 3.46 4.03 +# Goldmont 3.84/1.39 1.39 1.63 1.31 1.70 +# Bulldozer 5.80/0.98 1.05 1.24 0.93 1.23 + +$PREFIX="aesni"; # if $PREFIX is set to "AES", the script + # generates drop-in replacement for + # crypto/aes/asm/aes-586.pl:-) +$inline=1; # inline _aesni_[en|de]crypt + +$0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1; +push(@INC,"${dir}","${dir}../../../perlasm"); +require "x86asm.pl"; + +$output = pop; +open OUT,">$output"; +*STDOUT=*OUT; + +&asm_init($ARGV[0]); + +&external_label("OPENSSL_ia32cap_P"); +&static_label("key_const"); + +if ($PREFIX eq "aesni") { $movekey=\&movups; } +else { $movekey=\&movups; } + +$len="eax"; +$rounds="ecx"; +$key="edx"; +$inp="esi"; +$out="edi"; +$rounds_="ebx"; # backup copy for $rounds +$key_="ebp"; # backup copy for $key + +$rndkey0="xmm0"; +$rndkey1="xmm1"; +$inout0="xmm2"; +$inout1="xmm3"; +$inout2="xmm4"; +$inout3="xmm5"; $in1="xmm5"; +$inout4="xmm6"; $in0="xmm6"; +$inout5="xmm7"; $ivec="xmm7"; + +# AESNI extension +sub aeskeygenassist +{ my($dst,$src,$imm)=@_; + if ("$dst:$src" =~ /xmm([0-7]):xmm([0-7])/) + { &data_byte(0x66,0x0f,0x3a,0xdf,0xc0|($1<<3)|$2,$imm); } +} +sub aescommon +{ my($opcodelet,$dst,$src)=@_; + if ("$dst:$src" =~ /xmm([0-7]):xmm([0-7])/) + { &data_byte(0x66,0x0f,0x38,$opcodelet,0xc0|($1<<3)|$2);} +} +sub aesimc { aescommon(0xdb,@_); } +sub aesenc { aescommon(0xdc,@_); } +sub aesenclast { aescommon(0xdd,@_); } +sub aesdec { aescommon(0xde,@_); } +sub aesdeclast { aescommon(0xdf,@_); } + +# Inline version of internal aesni_[en|de]crypt1 +{ my $sn; +sub aesni_inline_generate1 +{ my ($p,$inout,$ivec)=@_; $inout=$inout0 if (!defined($inout)); + $sn++; + + &$movekey ($rndkey0,&QWP(0,$key)); + &$movekey ($rndkey1,&QWP(16,$key)); + &xorps ($ivec,$rndkey0) if (defined($ivec)); + &lea ($key,&DWP(32,$key)); + &xorps ($inout,$ivec) if (defined($ivec)); + &xorps ($inout,$rndkey0) if (!defined($ivec)); + &set_label("${p}1_loop_$sn"); + eval"&aes${p} ($inout,$rndkey1)"; + &dec ($rounds); + &$movekey ($rndkey1,&QWP(0,$key)); + &lea ($key,&DWP(16,$key)); + &jnz (&label("${p}1_loop_$sn")); + eval"&aes${p}last ($inout,$rndkey1)"; +}} + +sub aesni_generate1 # fully unrolled loop +{ my ($p,$inout)=@_; $inout=$inout0 if (!defined($inout)); + + &function_begin_B("_aesni_${p}rypt1"); + &movups ($rndkey0,&QWP(0,$key)); + &$movekey ($rndkey1,&QWP(0x10,$key)); + &xorps ($inout,$rndkey0); + &$movekey ($rndkey0,&QWP(0x20,$key)); + &lea ($key,&DWP(0x30,$key)); + &cmp ($rounds,11); + &jb (&label("${p}128")); + &lea ($key,&DWP(0x20,$key)); + &je (&label("${p}192")); + &lea ($key,&DWP(0x20,$key)); + eval"&aes${p} ($inout,$rndkey1)"; + &$movekey ($rndkey1,&QWP(-0x40,$key)); + eval"&aes${p} ($inout,$rndkey0)"; + &$movekey ($rndkey0,&QWP(-0x30,$key)); + &set_label("${p}192"); + eval"&aes${p} ($inout,$rndkey1)"; + &$movekey ($rndkey1,&QWP(-0x20,$key)); + eval"&aes${p} ($inout,$rndkey0)"; + &$movekey ($rndkey0,&QWP(-0x10,$key)); + &set_label("${p}128"); + eval"&aes${p} ($inout,$rndkey1)"; + &$movekey ($rndkey1,&QWP(0,$key)); + eval"&aes${p} ($inout,$rndkey0)"; + &$movekey ($rndkey0,&QWP(0x10,$key)); + eval"&aes${p} ($inout,$rndkey1)"; + &$movekey ($rndkey1,&QWP(0x20,$key)); + eval"&aes${p} ($inout,$rndkey0)"; + &$movekey ($rndkey0,&QWP(0x30,$key)); + eval"&aes${p} ($inout,$rndkey1)"; + &$movekey ($rndkey1,&QWP(0x40,$key)); + eval"&aes${p} ($inout,$rndkey0)"; + &$movekey ($rndkey0,&QWP(0x50,$key)); + eval"&aes${p} ($inout,$rndkey1)"; + &$movekey ($rndkey1,&QWP(0x60,$key)); + eval"&aes${p} ($inout,$rndkey0)"; + &$movekey ($rndkey0,&QWP(0x70,$key)); + eval"&aes${p} ($inout,$rndkey1)"; + eval"&aes${p}last ($inout,$rndkey0)"; + &ret(); + &function_end_B("_aesni_${p}rypt1"); +} + +# void $PREFIX_encrypt (const void *inp,void *out,const AES_KEY *key); +&aesni_generate1("enc") if (!$inline); +&function_begin_B("${PREFIX}_encrypt"); + &mov ("eax",&wparam(0)); + &mov ($key,&wparam(2)); + &movups ($inout0,&QWP(0,"eax")); + &mov ($rounds,&DWP(240,$key)); + &mov ("eax",&wparam(1)); + if ($inline) + { &aesni_inline_generate1("enc"); } + else + { &call ("_aesni_encrypt1"); } + &pxor ($rndkey0,$rndkey0); # clear register bank + &pxor ($rndkey1,$rndkey1); + &movups (&QWP(0,"eax"),$inout0); + &pxor ($inout0,$inout0); + &ret (); +&function_end_B("${PREFIX}_encrypt"); + +# void $PREFIX_decrypt (const void *inp,void *out,const AES_KEY *key); +&aesni_generate1("dec") if(!$inline); +&function_begin_B("${PREFIX}_decrypt"); + &mov ("eax",&wparam(0)); + &mov ($key,&wparam(2)); + &movups ($inout0,&QWP(0,"eax")); + &mov ($rounds,&DWP(240,$key)); + &mov ("eax",&wparam(1)); + if ($inline) + { &aesni_inline_generate1("dec"); } + else + { &call ("_aesni_decrypt1"); } + &pxor ($rndkey0,$rndkey0); # clear register bank + &pxor ($rndkey1,$rndkey1); + &movups (&QWP(0,"eax"),$inout0); + &pxor ($inout0,$inout0); + &ret (); +&function_end_B("${PREFIX}_decrypt"); + +# _aesni_[en|de]cryptN are private interfaces, N denotes interleave +# factor. Why 3x subroutine were originally used in loops? Even though +# aes[enc|dec] latency was originally 6, it could be scheduled only +# every *2nd* cycle. Thus 3x interleave was the one providing optimal +# utilization, i.e. when subroutine's throughput is virtually same as +# of non-interleaved subroutine [for number of input blocks up to 3]. +# This is why it originally made no sense to implement 2x subroutine. +# But times change and it became appropriate to spend extra 192 bytes +# on 2x subroutine on Atom Silvermont account. For processors that +# can schedule aes[enc|dec] every cycle optimal interleave factor +# equals to corresponding instructions latency. 8x is optimal for +# * Bridge, but it's unfeasible to accommodate such implementation +# in XMM registers addressable in 32-bit mode and therefore maximum +# of 6x is used instead... + +sub aesni_generate2 +{ my $p=shift; + + &function_begin_B("_aesni_${p}rypt2"); + &$movekey ($rndkey0,&QWP(0,$key)); + &shl ($rounds,4); + &$movekey ($rndkey1,&QWP(16,$key)); + &xorps ($inout0,$rndkey0); + &pxor ($inout1,$rndkey0); + &$movekey ($rndkey0,&QWP(32,$key)); + &lea ($key,&DWP(32,$key,$rounds)); + &neg ($rounds); + &add ($rounds,16); + + &set_label("${p}2_loop"); + eval"&aes${p} ($inout0,$rndkey1)"; + eval"&aes${p} ($inout1,$rndkey1)"; + &$movekey ($rndkey1,&QWP(0,$key,$rounds)); + &add ($rounds,32); + eval"&aes${p} ($inout0,$rndkey0)"; + eval"&aes${p} ($inout1,$rndkey0)"; + &$movekey ($rndkey0,&QWP(-16,$key,$rounds)); + &jnz (&label("${p}2_loop")); + eval"&aes${p} ($inout0,$rndkey1)"; + eval"&aes${p} ($inout1,$rndkey1)"; + eval"&aes${p}last ($inout0,$rndkey0)"; + eval"&aes${p}last ($inout1,$rndkey0)"; + &ret(); + &function_end_B("_aesni_${p}rypt2"); +} + +sub aesni_generate3 +{ my $p=shift; + + &function_begin_B("_aesni_${p}rypt3"); + &$movekey ($rndkey0,&QWP(0,$key)); + &shl ($rounds,4); + &$movekey ($rndkey1,&QWP(16,$key)); + &xorps ($inout0,$rndkey0); + &pxor ($inout1,$rndkey0); + &pxor ($inout2,$rndkey0); + &$movekey ($rndkey0,&QWP(32,$key)); + &lea ($key,&DWP(32,$key,$rounds)); + &neg ($rounds); + &add ($rounds,16); + + &set_label("${p}3_loop"); + eval"&aes${p} ($inout0,$rndkey1)"; + eval"&aes${p} ($inout1,$rndkey1)"; + eval"&aes${p} ($inout2,$rndkey1)"; + &$movekey ($rndkey1,&QWP(0,$key,$rounds)); + &add ($rounds,32); + eval"&aes${p} ($inout0,$rndkey0)"; + eval"&aes${p} ($inout1,$rndkey0)"; + eval"&aes${p} ($inout2,$rndkey0)"; + &$movekey ($rndkey0,&QWP(-16,$key,$rounds)); + &jnz (&label("${p}3_loop")); + eval"&aes${p} ($inout0,$rndkey1)"; + eval"&aes${p} ($inout1,$rndkey1)"; + eval"&aes${p} ($inout2,$rndkey1)"; + eval"&aes${p}last ($inout0,$rndkey0)"; + eval"&aes${p}last ($inout1,$rndkey0)"; + eval"&aes${p}last ($inout2,$rndkey0)"; + &ret(); + &function_end_B("_aesni_${p}rypt3"); +} + +# 4x interleave is implemented to improve small block performance, +# most notably [and naturally] 4 block by ~30%. One can argue that one +# should have implemented 5x as well, but improvement would be <20%, +# so it's not worth it... +sub aesni_generate4 +{ my $p=shift; + + &function_begin_B("_aesni_${p}rypt4"); + &$movekey ($rndkey0,&QWP(0,$key)); + &$movekey ($rndkey1,&QWP(16,$key)); + &shl ($rounds,4); + &xorps ($inout0,$rndkey0); + &pxor ($inout1,$rndkey0); + &pxor ($inout2,$rndkey0); + &pxor ($inout3,$rndkey0); + &$movekey ($rndkey0,&QWP(32,$key)); + &lea ($key,&DWP(32,$key,$rounds)); + &neg ($rounds); + &data_byte (0x0f,0x1f,0x40,0x00); + &add ($rounds,16); + + &set_label("${p}4_loop"); + eval"&aes${p} ($inout0,$rndkey1)"; + eval"&aes${p} ($inout1,$rndkey1)"; + eval"&aes${p} ($inout2,$rndkey1)"; + eval"&aes${p} ($inout3,$rndkey1)"; + &$movekey ($rndkey1,&QWP(0,$key,$rounds)); + &add ($rounds,32); + eval"&aes${p} ($inout0,$rndkey0)"; + eval"&aes${p} ($inout1,$rndkey0)"; + eval"&aes${p} ($inout2,$rndkey0)"; + eval"&aes${p} ($inout3,$rndkey0)"; + &$movekey ($rndkey0,&QWP(-16,$key,$rounds)); + &jnz (&label("${p}4_loop")); + + eval"&aes${p} ($inout0,$rndkey1)"; + eval"&aes${p} ($inout1,$rndkey1)"; + eval"&aes${p} ($inout2,$rndkey1)"; + eval"&aes${p} ($inout3,$rndkey1)"; + eval"&aes${p}last ($inout0,$rndkey0)"; + eval"&aes${p}last ($inout1,$rndkey0)"; + eval"&aes${p}last ($inout2,$rndkey0)"; + eval"&aes${p}last ($inout3,$rndkey0)"; + &ret(); + &function_end_B("_aesni_${p}rypt4"); +} + +sub aesni_generate6 +{ my $p=shift; + + &function_begin_B("_aesni_${p}rypt6"); + &static_label("_aesni_${p}rypt6_enter"); + &$movekey ($rndkey0,&QWP(0,$key)); + &shl ($rounds,4); + &$movekey ($rndkey1,&QWP(16,$key)); + &xorps ($inout0,$rndkey0); + &pxor ($inout1,$rndkey0); # pxor does better here + &pxor ($inout2,$rndkey0); + eval"&aes${p} ($inout0,$rndkey1)"; + &pxor ($inout3,$rndkey0); + &pxor ($inout4,$rndkey0); + eval"&aes${p} ($inout1,$rndkey1)"; + &lea ($key,&DWP(32,$key,$rounds)); + &neg ($rounds); + eval"&aes${p} ($inout2,$rndkey1)"; + &pxor ($inout5,$rndkey0); + &$movekey ($rndkey0,&QWP(0,$key,$rounds)); + &add ($rounds,16); + &jmp (&label("_aesni_${p}rypt6_inner")); + + &set_label("${p}6_loop",16); + eval"&aes${p} ($inout0,$rndkey1)"; + eval"&aes${p} ($inout1,$rndkey1)"; + eval"&aes${p} ($inout2,$rndkey1)"; + &set_label("_aesni_${p}rypt6_inner"); + eval"&aes${p} ($inout3,$rndkey1)"; + eval"&aes${p} ($inout4,$rndkey1)"; + eval"&aes${p} ($inout5,$rndkey1)"; + &set_label("_aesni_${p}rypt6_enter"); + &$movekey ($rndkey1,&QWP(0,$key,$rounds)); + &add ($rounds,32); + eval"&aes${p} ($inout0,$rndkey0)"; + eval"&aes${p} ($inout1,$rndkey0)"; + eval"&aes${p} ($inout2,$rndkey0)"; + eval"&aes${p} ($inout3,$rndkey0)"; + eval"&aes${p} ($inout4,$rndkey0)"; + eval"&aes${p} ($inout5,$rndkey0)"; + &$movekey ($rndkey0,&QWP(-16,$key,$rounds)); + &jnz (&label("${p}6_loop")); + + eval"&aes${p} ($inout0,$rndkey1)"; + eval"&aes${p} ($inout1,$rndkey1)"; + eval"&aes${p} ($inout2,$rndkey1)"; + eval"&aes${p} ($inout3,$rndkey1)"; + eval"&aes${p} ($inout4,$rndkey1)"; + eval"&aes${p} ($inout5,$rndkey1)"; + eval"&aes${p}last ($inout0,$rndkey0)"; + eval"&aes${p}last ($inout1,$rndkey0)"; + eval"&aes${p}last ($inout2,$rndkey0)"; + eval"&aes${p}last ($inout3,$rndkey0)"; + eval"&aes${p}last ($inout4,$rndkey0)"; + eval"&aes${p}last ($inout5,$rndkey0)"; + &ret(); + &function_end_B("_aesni_${p}rypt6"); +} +&aesni_generate2("enc") if ($PREFIX eq "aesni"); +&aesni_generate2("dec"); +&aesni_generate3("enc") if ($PREFIX eq "aesni"); +&aesni_generate3("dec"); +&aesni_generate4("enc") if ($PREFIX eq "aesni"); +&aesni_generate4("dec"); +&aesni_generate6("enc") if ($PREFIX eq "aesni"); +&aesni_generate6("dec"); + +if ($PREFIX eq "aesni") { +###################################################################### +# void aesni_ecb_encrypt (const void *in, void *out, +# size_t length, const AES_KEY *key, +# int enc); +&function_begin("aesni_ecb_encrypt"); + &mov ($inp,&wparam(0)); + &mov ($out,&wparam(1)); + &mov ($len,&wparam(2)); + &mov ($key,&wparam(3)); + &mov ($rounds_,&wparam(4)); + &and ($len,-16); + &jz (&label("ecb_ret")); + &mov ($rounds,&DWP(240,$key)); + &test ($rounds_,$rounds_); + &jz (&label("ecb_decrypt")); + + &mov ($key_,$key); # backup $key + &mov ($rounds_,$rounds); # backup $rounds + &cmp ($len,0x60); + &jb (&label("ecb_enc_tail")); + + &movdqu ($inout0,&QWP(0,$inp)); + &movdqu ($inout1,&QWP(0x10,$inp)); + &movdqu ($inout2,&QWP(0x20,$inp)); + &movdqu ($inout3,&QWP(0x30,$inp)); + &movdqu ($inout4,&QWP(0x40,$inp)); + &movdqu ($inout5,&QWP(0x50,$inp)); + &lea ($inp,&DWP(0x60,$inp)); + &sub ($len,0x60); + &jmp (&label("ecb_enc_loop6_enter")); + +&set_label("ecb_enc_loop6",16); + &movups (&QWP(0,$out),$inout0); + &movdqu ($inout0,&QWP(0,$inp)); + &movups (&QWP(0x10,$out),$inout1); + &movdqu ($inout1,&QWP(0x10,$inp)); + &movups (&QWP(0x20,$out),$inout2); + &movdqu ($inout2,&QWP(0x20,$inp)); + &movups (&QWP(0x30,$out),$inout3); + &movdqu ($inout3,&QWP(0x30,$inp)); + &movups (&QWP(0x40,$out),$inout4); + &movdqu ($inout4,&QWP(0x40,$inp)); + &movups (&QWP(0x50,$out),$inout5); + &lea ($out,&DWP(0x60,$out)); + &movdqu ($inout5,&QWP(0x50,$inp)); + &lea ($inp,&DWP(0x60,$inp)); +&set_label("ecb_enc_loop6_enter"); + + &call ("_aesni_encrypt6"); + + &mov ($key,$key_); # restore $key + &mov ($rounds,$rounds_); # restore $rounds + &sub ($len,0x60); + &jnc (&label("ecb_enc_loop6")); + + &movups (&QWP(0,$out),$inout0); + &movups (&QWP(0x10,$out),$inout1); + &movups (&QWP(0x20,$out),$inout2); + &movups (&QWP(0x30,$out),$inout3); + &movups (&QWP(0x40,$out),$inout4); + &movups (&QWP(0x50,$out),$inout5); + &lea ($out,&DWP(0x60,$out)); + &add ($len,0x60); + &jz (&label("ecb_ret")); + +&set_label("ecb_enc_tail"); + &movups ($inout0,&QWP(0,$inp)); + &cmp ($len,0x20); + &jb (&label("ecb_enc_one")); + &movups ($inout1,&QWP(0x10,$inp)); + &je (&label("ecb_enc_two")); + &movups ($inout2,&QWP(0x20,$inp)); + &cmp ($len,0x40); + &jb (&label("ecb_enc_three")); + &movups ($inout3,&QWP(0x30,$inp)); + &je (&label("ecb_enc_four")); + &movups ($inout4,&QWP(0x40,$inp)); + &xorps ($inout5,$inout5); + &call ("_aesni_encrypt6"); + &movups (&QWP(0,$out),$inout0); + &movups (&QWP(0x10,$out),$inout1); + &movups (&QWP(0x20,$out),$inout2); + &movups (&QWP(0x30,$out),$inout3); + &movups (&QWP(0x40,$out),$inout4); + jmp (&label("ecb_ret")); + +&set_label("ecb_enc_one",16); + if ($inline) + { &aesni_inline_generate1("enc"); } + else + { &call ("_aesni_encrypt1"); } + &movups (&QWP(0,$out),$inout0); + &jmp (&label("ecb_ret")); + +&set_label("ecb_enc_two",16); + &call ("_aesni_encrypt2"); + &movups (&QWP(0,$out),$inout0); + &movups (&QWP(0x10,$out),$inout1); + &jmp (&label("ecb_ret")); + +&set_label("ecb_enc_three",16); + &call ("_aesni_encrypt3"); + &movups (&QWP(0,$out),$inout0); + &movups (&QWP(0x10,$out),$inout1); + &movups (&QWP(0x20,$out),$inout2); + &jmp (&label("ecb_ret")); + +&set_label("ecb_enc_four",16); + &call ("_aesni_encrypt4"); + &movups (&QWP(0,$out),$inout0); + &movups (&QWP(0x10,$out),$inout1); + &movups (&QWP(0x20,$out),$inout2); + &movups (&QWP(0x30,$out),$inout3); + &jmp (&label("ecb_ret")); +###################################################################### +&set_label("ecb_decrypt",16); + &mov ($key_,$key); # backup $key + &mov ($rounds_,$rounds); # backup $rounds + &cmp ($len,0x60); + &jb (&label("ecb_dec_tail")); + + &movdqu ($inout0,&QWP(0,$inp)); + &movdqu ($inout1,&QWP(0x10,$inp)); + &movdqu ($inout2,&QWP(0x20,$inp)); + &movdqu ($inout3,&QWP(0x30,$inp)); + &movdqu ($inout4,&QWP(0x40,$inp)); + &movdqu ($inout5,&QWP(0x50,$inp)); + &lea ($inp,&DWP(0x60,$inp)); + &sub ($len,0x60); + &jmp (&label("ecb_dec_loop6_enter")); + +&set_label("ecb_dec_loop6",16); + &movups (&QWP(0,$out),$inout0); + &movdqu ($inout0,&QWP(0,$inp)); + &movups (&QWP(0x10,$out),$inout1); + &movdqu ($inout1,&QWP(0x10,$inp)); + &movups (&QWP(0x20,$out),$inout2); + &movdqu ($inout2,&QWP(0x20,$inp)); + &movups (&QWP(0x30,$out),$inout3); + &movdqu ($inout3,&QWP(0x30,$inp)); + &movups (&QWP(0x40,$out),$inout4); + &movdqu ($inout4,&QWP(0x40,$inp)); + &movups (&QWP(0x50,$out),$inout5); + &lea ($out,&DWP(0x60,$out)); + &movdqu ($inout5,&QWP(0x50,$inp)); + &lea ($inp,&DWP(0x60,$inp)); +&set_label("ecb_dec_loop6_enter"); + + &call ("_aesni_decrypt6"); + + &mov ($key,$key_); # restore $key + &mov ($rounds,$rounds_); # restore $rounds + &sub ($len,0x60); + &jnc (&label("ecb_dec_loop6")); + + &movups (&QWP(0,$out),$inout0); + &movups (&QWP(0x10,$out),$inout1); + &movups (&QWP(0x20,$out),$inout2); + &movups (&QWP(0x30,$out),$inout3); + &movups (&QWP(0x40,$out),$inout4); + &movups (&QWP(0x50,$out),$inout5); + &lea ($out,&DWP(0x60,$out)); + &add ($len,0x60); + &jz (&label("ecb_ret")); + +&set_label("ecb_dec_tail"); + &movups ($inout0,&QWP(0,$inp)); + &cmp ($len,0x20); + &jb (&label("ecb_dec_one")); + &movups ($inout1,&QWP(0x10,$inp)); + &je (&label("ecb_dec_two")); + &movups ($inout2,&QWP(0x20,$inp)); + &cmp ($len,0x40); + &jb (&label("ecb_dec_three")); + &movups ($inout3,&QWP(0x30,$inp)); + &je (&label("ecb_dec_four")); + &movups ($inout4,&QWP(0x40,$inp)); + &xorps ($inout5,$inout5); + &call ("_aesni_decrypt6"); + &movups (&QWP(0,$out),$inout0); + &movups (&QWP(0x10,$out),$inout1); + &movups (&QWP(0x20,$out),$inout2); + &movups (&QWP(0x30,$out),$inout3); + &movups (&QWP(0x40,$out),$inout4); + &jmp (&label("ecb_ret")); + +&set_label("ecb_dec_one",16); + if ($inline) + { &aesni_inline_generate1("dec"); } + else + { &call ("_aesni_decrypt1"); } + &movups (&QWP(0,$out),$inout0); + &jmp (&label("ecb_ret")); + +&set_label("ecb_dec_two",16); + &call ("_aesni_decrypt2"); + &movups (&QWP(0,$out),$inout0); + &movups (&QWP(0x10,$out),$inout1); + &jmp (&label("ecb_ret")); + +&set_label("ecb_dec_three",16); + &call ("_aesni_decrypt3"); + &movups (&QWP(0,$out),$inout0); + &movups (&QWP(0x10,$out),$inout1); + &movups (&QWP(0x20,$out),$inout2); + &jmp (&label("ecb_ret")); + +&set_label("ecb_dec_four",16); + &call ("_aesni_decrypt4"); + &movups (&QWP(0,$out),$inout0); + &movups (&QWP(0x10,$out),$inout1); + &movups (&QWP(0x20,$out),$inout2); + &movups (&QWP(0x30,$out),$inout3); + +&set_label("ecb_ret"); + &pxor ("xmm0","xmm0"); # clear register bank + &pxor ("xmm1","xmm1"); + &pxor ("xmm2","xmm2"); + &pxor ("xmm3","xmm3"); + &pxor ("xmm4","xmm4"); + &pxor ("xmm5","xmm5"); + &pxor ("xmm6","xmm6"); + &pxor ("xmm7","xmm7"); +&function_end("aesni_ecb_encrypt"); + +###################################################################### +# void aesni_ccm64_[en|de]crypt_blocks (const void *in, void *out, +# size_t blocks, const AES_KEY *key, +# const char *ivec,char *cmac); +# +# Handles only complete blocks, operates on 64-bit counter and +# does not update *ivec! Nor does it finalize CMAC value +# (see engine/eng_aesni.c for details) +# +{ my $cmac=$inout1; +&function_begin("aesni_ccm64_encrypt_blocks"); + &mov ($inp,&wparam(0)); + &mov ($out,&wparam(1)); + &mov ($len,&wparam(2)); + &mov ($key,&wparam(3)); + &mov ($rounds_,&wparam(4)); + &mov ($rounds,&wparam(5)); + &mov ($key_,"esp"); + &sub ("esp",60); + &and ("esp",-16); # align stack + &mov (&DWP(48,"esp"),$key_); + + &movdqu ($ivec,&QWP(0,$rounds_)); # load ivec + &movdqu ($cmac,&QWP(0,$rounds)); # load cmac + &mov ($rounds,&DWP(240,$key)); + + # compose byte-swap control mask for pshufb on stack + &mov (&DWP(0,"esp"),0x0c0d0e0f); + &mov (&DWP(4,"esp"),0x08090a0b); + &mov (&DWP(8,"esp"),0x04050607); + &mov (&DWP(12,"esp"),0x00010203); + + # compose counter increment vector on stack + &mov ($rounds_,1); + &xor ($key_,$key_); + &mov (&DWP(16,"esp"),$rounds_); + &mov (&DWP(20,"esp"),$key_); + &mov (&DWP(24,"esp"),$key_); + &mov (&DWP(28,"esp"),$key_); + + &shl ($rounds,4); + &mov ($rounds_,16); + &lea ($key_,&DWP(0,$key)); + &movdqa ($inout3,&QWP(0,"esp")); + &movdqa ($inout0,$ivec); + &lea ($key,&DWP(32,$key,$rounds)); + &sub ($rounds_,$rounds); + &pshufb ($ivec,$inout3); + +&set_label("ccm64_enc_outer"); + &$movekey ($rndkey0,&QWP(0,$key_)); + &mov ($rounds,$rounds_); + &movups ($in0,&QWP(0,$inp)); + + &xorps ($inout0,$rndkey0); + &$movekey ($rndkey1,&QWP(16,$key_)); + &xorps ($rndkey0,$in0); + &xorps ($cmac,$rndkey0); # cmac^=inp + &$movekey ($rndkey0,&QWP(32,$key_)); + +&set_label("ccm64_enc2_loop"); + &aesenc ($inout0,$rndkey1); + &aesenc ($cmac,$rndkey1); + &$movekey ($rndkey1,&QWP(0,$key,$rounds)); + &add ($rounds,32); + &aesenc ($inout0,$rndkey0); + &aesenc ($cmac,$rndkey0); + &$movekey ($rndkey0,&QWP(-16,$key,$rounds)); + &jnz (&label("ccm64_enc2_loop")); + &aesenc ($inout0,$rndkey1); + &aesenc ($cmac,$rndkey1); + &paddq ($ivec,&QWP(16,"esp")); + &dec ($len); + &aesenclast ($inout0,$rndkey0); + &aesenclast ($cmac,$rndkey0); + + &lea ($inp,&DWP(16,$inp)); + &xorps ($in0,$inout0); # inp^=E(ivec) + &movdqa ($inout0,$ivec); + &movups (&QWP(0,$out),$in0); # save output + &pshufb ($inout0,$inout3); + &lea ($out,&DWP(16,$out)); + &jnz (&label("ccm64_enc_outer")); + + &mov ("esp",&DWP(48,"esp")); + &mov ($out,&wparam(5)); + &movups (&QWP(0,$out),$cmac); + + &pxor ("xmm0","xmm0"); # clear register bank + &pxor ("xmm1","xmm1"); + &pxor ("xmm2","xmm2"); + &pxor ("xmm3","xmm3"); + &pxor ("xmm4","xmm4"); + &pxor ("xmm5","xmm5"); + &pxor ("xmm6","xmm6"); + &pxor ("xmm7","xmm7"); +&function_end("aesni_ccm64_encrypt_blocks"); + +&function_begin("aesni_ccm64_decrypt_blocks"); + &mov ($inp,&wparam(0)); + &mov ($out,&wparam(1)); + &mov ($len,&wparam(2)); + &mov ($key,&wparam(3)); + &mov ($rounds_,&wparam(4)); + &mov ($rounds,&wparam(5)); + &mov ($key_,"esp"); + &sub ("esp",60); + &and ("esp",-16); # align stack + &mov (&DWP(48,"esp"),$key_); + + &movdqu ($ivec,&QWP(0,$rounds_)); # load ivec + &movdqu ($cmac,&QWP(0,$rounds)); # load cmac + &mov ($rounds,&DWP(240,$key)); + + # compose byte-swap control mask for pshufb on stack + &mov (&DWP(0,"esp"),0x0c0d0e0f); + &mov (&DWP(4,"esp"),0x08090a0b); + &mov (&DWP(8,"esp"),0x04050607); + &mov (&DWP(12,"esp"),0x00010203); + + # compose counter increment vector on stack + &mov ($rounds_,1); + &xor ($key_,$key_); + &mov (&DWP(16,"esp"),$rounds_); + &mov (&DWP(20,"esp"),$key_); + &mov (&DWP(24,"esp"),$key_); + &mov (&DWP(28,"esp"),$key_); + + &movdqa ($inout3,&QWP(0,"esp")); # bswap mask + &movdqa ($inout0,$ivec); + + &mov ($key_,$key); + &mov ($rounds_,$rounds); + + &pshufb ($ivec,$inout3); + if ($inline) + { &aesni_inline_generate1("enc"); } + else + { &call ("_aesni_encrypt1"); } + &shl ($rounds_,4); + &mov ($rounds,16); + &movups ($in0,&QWP(0,$inp)); # load inp + &paddq ($ivec,&QWP(16,"esp")); + &lea ($inp,&QWP(16,$inp)); + &sub ($rounds,$rounds_); + &lea ($key,&DWP(32,$key_,$rounds_)); + &mov ($rounds_,$rounds); + &jmp (&label("ccm64_dec_outer")); + +&set_label("ccm64_dec_outer",16); + &xorps ($in0,$inout0); # inp ^= E(ivec) + &movdqa ($inout0,$ivec); + &movups (&QWP(0,$out),$in0); # save output + &lea ($out,&DWP(16,$out)); + &pshufb ($inout0,$inout3); + + &sub ($len,1); + &jz (&label("ccm64_dec_break")); + + &$movekey ($rndkey0,&QWP(0,$key_)); + &mov ($rounds,$rounds_); + &$movekey ($rndkey1,&QWP(16,$key_)); + &xorps ($in0,$rndkey0); + &xorps ($inout0,$rndkey0); + &xorps ($cmac,$in0); # cmac^=out + &$movekey ($rndkey0,&QWP(32,$key_)); + +&set_label("ccm64_dec2_loop"); + &aesenc ($inout0,$rndkey1); + &aesenc ($cmac,$rndkey1); + &$movekey ($rndkey1,&QWP(0,$key,$rounds)); + &add ($rounds,32); + &aesenc ($inout0,$rndkey0); + &aesenc ($cmac,$rndkey0); + &$movekey ($rndkey0,&QWP(-16,$key,$rounds)); + &jnz (&label("ccm64_dec2_loop")); + &movups ($in0,&QWP(0,$inp)); # load inp + &paddq ($ivec,&QWP(16,"esp")); + &aesenc ($inout0,$rndkey1); + &aesenc ($cmac,$rndkey1); + &aesenclast ($inout0,$rndkey0); + &aesenclast ($cmac,$rndkey0); + &lea ($inp,&QWP(16,$inp)); + &jmp (&label("ccm64_dec_outer")); + +&set_label("ccm64_dec_break",16); + &mov ($rounds,&DWP(240,$key_)); + &mov ($key,$key_); + if ($inline) + { &aesni_inline_generate1("enc",$cmac,$in0); } + else + { &call ("_aesni_encrypt1",$cmac); } + + &mov ("esp",&DWP(48,"esp")); + &mov ($out,&wparam(5)); + &movups (&QWP(0,$out),$cmac); + + &pxor ("xmm0","xmm0"); # clear register bank + &pxor ("xmm1","xmm1"); + &pxor ("xmm2","xmm2"); + &pxor ("xmm3","xmm3"); + &pxor ("xmm4","xmm4"); + &pxor ("xmm5","xmm5"); + &pxor ("xmm6","xmm6"); + &pxor ("xmm7","xmm7"); +&function_end("aesni_ccm64_decrypt_blocks"); +} + +###################################################################### +# void aesni_ctr32_encrypt_blocks (const void *in, void *out, +# size_t blocks, const AES_KEY *key, +# const char *ivec); +# +# Handles only complete blocks, operates on 32-bit counter and +# does not update *ivec! (see crypto/modes/ctr128.c for details) +# +# stack layout: +# 0 pshufb mask +# 16 vector addend: 0,6,6,6 +# 32 counter-less ivec +# 48 1st triplet of counter vector +# 64 2nd triplet of counter vector +# 80 saved %esp + +&function_begin("aesni_ctr32_encrypt_blocks"); + &mov ($inp,&wparam(0)); + &mov ($out,&wparam(1)); + &mov ($len,&wparam(2)); + &mov ($key,&wparam(3)); + &mov ($rounds_,&wparam(4)); + &mov ($key_,"esp"); + &sub ("esp",88); + &and ("esp",-16); # align stack + &mov (&DWP(80,"esp"),$key_); + + &cmp ($len,1); + &je (&label("ctr32_one_shortcut")); + + &movdqu ($inout5,&QWP(0,$rounds_)); # load ivec + + # compose byte-swap control mask for pshufb on stack + &mov (&DWP(0,"esp"),0x0c0d0e0f); + &mov (&DWP(4,"esp"),0x08090a0b); + &mov (&DWP(8,"esp"),0x04050607); + &mov (&DWP(12,"esp"),0x00010203); + + # compose counter increment vector on stack + &mov ($rounds,6); + &xor ($key_,$key_); + &mov (&DWP(16,"esp"),$rounds); + &mov (&DWP(20,"esp"),$rounds); + &mov (&DWP(24,"esp"),$rounds); + &mov (&DWP(28,"esp"),$key_); + + &pextrd ($rounds_,$inout5,3); # pull 32-bit counter + &pinsrd ($inout5,$key_,3); # wipe 32-bit counter + + &mov ($rounds,&DWP(240,$key)); # key->rounds + + # compose 2 vectors of 3x32-bit counters + &bswap ($rounds_); + &pxor ($rndkey0,$rndkey0); + &pxor ($rndkey1,$rndkey1); + &movdqa ($inout0,&QWP(0,"esp")); # load byte-swap mask + &pinsrd ($rndkey0,$rounds_,0); + &lea ($key_,&DWP(3,$rounds_)); + &pinsrd ($rndkey1,$key_,0); + &inc ($rounds_); + &pinsrd ($rndkey0,$rounds_,1); + &inc ($key_); + &pinsrd ($rndkey1,$key_,1); + &inc ($rounds_); + &pinsrd ($rndkey0,$rounds_,2); + &inc ($key_); + &pinsrd ($rndkey1,$key_,2); + &movdqa (&QWP(48,"esp"),$rndkey0); # save 1st triplet + &pshufb ($rndkey0,$inout0); # byte swap + &movdqu ($inout4,&QWP(0,$key)); # key[0] + &movdqa (&QWP(64,"esp"),$rndkey1); # save 2nd triplet + &pshufb ($rndkey1,$inout0); # byte swap + + &pshufd ($inout0,$rndkey0,3<<6); # place counter to upper dword + &pshufd ($inout1,$rndkey0,2<<6); + &cmp ($len,6); + &jb (&label("ctr32_tail")); + &pxor ($inout5,$inout4); # counter-less ivec^key[0] + &shl ($rounds,4); + &mov ($rounds_,16); + &movdqa (&QWP(32,"esp"),$inout5); # save counter-less ivec^key[0] + &mov ($key_,$key); # backup $key + &sub ($rounds_,$rounds); # backup twisted $rounds + &lea ($key,&DWP(32,$key,$rounds)); + &sub ($len,6); + &jmp (&label("ctr32_loop6")); + +&set_label("ctr32_loop6",16); + # inlining _aesni_encrypt6's prologue gives ~6% improvement... + &pshufd ($inout2,$rndkey0,1<<6); + &movdqa ($rndkey0,&QWP(32,"esp")); # pull counter-less ivec + &pshufd ($inout3,$rndkey1,3<<6); + &pxor ($inout0,$rndkey0); # merge counter-less ivec + &pshufd ($inout4,$rndkey1,2<<6); + &pxor ($inout1,$rndkey0); + &pshufd ($inout5,$rndkey1,1<<6); + &$movekey ($rndkey1,&QWP(16,$key_)); + &pxor ($inout2,$rndkey0); + &pxor ($inout3,$rndkey0); + &aesenc ($inout0,$rndkey1); + &pxor ($inout4,$rndkey0); + &pxor ($inout5,$rndkey0); + &aesenc ($inout1,$rndkey1); + &$movekey ($rndkey0,&QWP(32,$key_)); + &mov ($rounds,$rounds_); + &aesenc ($inout2,$rndkey1); + &aesenc ($inout3,$rndkey1); + &aesenc ($inout4,$rndkey1); + &aesenc ($inout5,$rndkey1); + + &call (&label("_aesni_encrypt6_enter")); + + &movups ($rndkey1,&QWP(0,$inp)); + &movups ($rndkey0,&QWP(0x10,$inp)); + &xorps ($inout0,$rndkey1); + &movups ($rndkey1,&QWP(0x20,$inp)); + &xorps ($inout1,$rndkey0); + &movups (&QWP(0,$out),$inout0); + &movdqa ($rndkey0,&QWP(16,"esp")); # load increment + &xorps ($inout2,$rndkey1); + &movdqa ($rndkey1,&QWP(64,"esp")); # load 2nd triplet + &movups (&QWP(0x10,$out),$inout1); + &movups (&QWP(0x20,$out),$inout2); + + &paddd ($rndkey1,$rndkey0); # 2nd triplet increment + &paddd ($rndkey0,&QWP(48,"esp")); # 1st triplet increment + &movdqa ($inout0,&QWP(0,"esp")); # load byte swap mask + + &movups ($inout1,&QWP(0x30,$inp)); + &movups ($inout2,&QWP(0x40,$inp)); + &xorps ($inout3,$inout1); + &movups ($inout1,&QWP(0x50,$inp)); + &lea ($inp,&DWP(0x60,$inp)); + &movdqa (&QWP(48,"esp"),$rndkey0); # save 1st triplet + &pshufb ($rndkey0,$inout0); # byte swap + &xorps ($inout4,$inout2); + &movups (&QWP(0x30,$out),$inout3); + &xorps ($inout5,$inout1); + &movdqa (&QWP(64,"esp"),$rndkey1); # save 2nd triplet + &pshufb ($rndkey1,$inout0); # byte swap + &movups (&QWP(0x40,$out),$inout4); + &pshufd ($inout0,$rndkey0,3<<6); + &movups (&QWP(0x50,$out),$inout5); + &lea ($out,&DWP(0x60,$out)); + + &pshufd ($inout1,$rndkey0,2<<6); + &sub ($len,6); + &jnc (&label("ctr32_loop6")); + + &add ($len,6); + &jz (&label("ctr32_ret")); + &movdqu ($inout5,&QWP(0,$key_)); + &mov ($key,$key_); + &pxor ($inout5,&QWP(32,"esp")); # restore count-less ivec + &mov ($rounds,&DWP(240,$key_)); # restore $rounds + +&set_label("ctr32_tail"); + &por ($inout0,$inout5); + &cmp ($len,2); + &jb (&label("ctr32_one")); + + &pshufd ($inout2,$rndkey0,1<<6); + &por ($inout1,$inout5); + &je (&label("ctr32_two")); + + &pshufd ($inout3,$rndkey1,3<<6); + &por ($inout2,$inout5); + &cmp ($len,4); + &jb (&label("ctr32_three")); + + &pshufd ($inout4,$rndkey1,2<<6); + &por ($inout3,$inout5); + &je (&label("ctr32_four")); + + &por ($inout4,$inout5); + &call ("_aesni_encrypt6"); + &movups ($rndkey1,&QWP(0,$inp)); + &movups ($rndkey0,&QWP(0x10,$inp)); + &xorps ($inout0,$rndkey1); + &movups ($rndkey1,&QWP(0x20,$inp)); + &xorps ($inout1,$rndkey0); + &movups ($rndkey0,&QWP(0x30,$inp)); + &xorps ($inout2,$rndkey1); + &movups ($rndkey1,&QWP(0x40,$inp)); + &xorps ($inout3,$rndkey0); + &movups (&QWP(0,$out),$inout0); + &xorps ($inout4,$rndkey1); + &movups (&QWP(0x10,$out),$inout1); + &movups (&QWP(0x20,$out),$inout2); + &movups (&QWP(0x30,$out),$inout3); + &movups (&QWP(0x40,$out),$inout4); + &jmp (&label("ctr32_ret")); + +&set_label("ctr32_one_shortcut",16); + &movups ($inout0,&QWP(0,$rounds_)); # load ivec + &mov ($rounds,&DWP(240,$key)); + +&set_label("ctr32_one"); + if ($inline) + { &aesni_inline_generate1("enc"); } + else + { &call ("_aesni_encrypt1"); } + &movups ($in0,&QWP(0,$inp)); + &xorps ($in0,$inout0); + &movups (&QWP(0,$out),$in0); + &jmp (&label("ctr32_ret")); + +&set_label("ctr32_two",16); + &call ("_aesni_encrypt2"); + &movups ($inout3,&QWP(0,$inp)); + &movups ($inout4,&QWP(0x10,$inp)); + &xorps ($inout0,$inout3); + &xorps ($inout1,$inout4); + &movups (&QWP(0,$out),$inout0); + &movups (&QWP(0x10,$out),$inout1); + &jmp (&label("ctr32_ret")); + +&set_label("ctr32_three",16); + &call ("_aesni_encrypt3"); + &movups ($inout3,&QWP(0,$inp)); + &movups ($inout4,&QWP(0x10,$inp)); + &xorps ($inout0,$inout3); + &movups ($inout5,&QWP(0x20,$inp)); + &xorps ($inout1,$inout4); + &movups (&QWP(0,$out),$inout0); + &xorps ($inout2,$inout5); + &movups (&QWP(0x10,$out),$inout1); + &movups (&QWP(0x20,$out),$inout2); + &jmp (&label("ctr32_ret")); + +&set_label("ctr32_four",16); + &call ("_aesni_encrypt4"); + &movups ($inout4,&QWP(0,$inp)); + &movups ($inout5,&QWP(0x10,$inp)); + &movups ($rndkey1,&QWP(0x20,$inp)); + &xorps ($inout0,$inout4); + &movups ($rndkey0,&QWP(0x30,$inp)); + &xorps ($inout1,$inout5); + &movups (&QWP(0,$out),$inout0); + &xorps ($inout2,$rndkey1); + &movups (&QWP(0x10,$out),$inout1); + &xorps ($inout3,$rndkey0); + &movups (&QWP(0x20,$out),$inout2); + &movups (&QWP(0x30,$out),$inout3); + +&set_label("ctr32_ret"); + &pxor ("xmm0","xmm0"); # clear register bank + &pxor ("xmm1","xmm1"); + &pxor ("xmm2","xmm2"); + &pxor ("xmm3","xmm3"); + &pxor ("xmm4","xmm4"); + &movdqa (&QWP(32,"esp"),"xmm0"); # clear stack + &pxor ("xmm5","xmm5"); + &movdqa (&QWP(48,"esp"),"xmm0"); + &pxor ("xmm6","xmm6"); + &movdqa (&QWP(64,"esp"),"xmm0"); + &pxor ("xmm7","xmm7"); + &mov ("esp",&DWP(80,"esp")); +&function_end("aesni_ctr32_encrypt_blocks"); + +###################################################################### +# void aesni_xts_[en|de]crypt(const char *inp,char *out,size_t len, +# const AES_KEY *key1, const AES_KEY *key2 +# const unsigned char iv[16]); +# +{ my ($tweak,$twtmp,$twres,$twmask)=($rndkey1,$rndkey0,$inout0,$inout1); + +&function_begin("aesni_xts_encrypt"); + &mov ($key,&wparam(4)); # key2 + &mov ($inp,&wparam(5)); # clear-text tweak + + &mov ($rounds,&DWP(240,$key)); # key2->rounds + &movups ($inout0,&QWP(0,$inp)); + if ($inline) + { &aesni_inline_generate1("enc"); } + else + { &call ("_aesni_encrypt1"); } + + &mov ($inp,&wparam(0)); + &mov ($out,&wparam(1)); + &mov ($len,&wparam(2)); + &mov ($key,&wparam(3)); # key1 + + &mov ($key_,"esp"); + &sub ("esp",16*7+8); + &mov ($rounds,&DWP(240,$key)); # key1->rounds + &and ("esp",-16); # align stack + + &mov (&DWP(16*6+0,"esp"),0x87); # compose the magic constant + &mov (&DWP(16*6+4,"esp"),0); + &mov (&DWP(16*6+8,"esp"),1); + &mov (&DWP(16*6+12,"esp"),0); + &mov (&DWP(16*7+0,"esp"),$len); # save original $len + &mov (&DWP(16*7+4,"esp"),$key_); # save original %esp + + &movdqa ($tweak,$inout0); + &pxor ($twtmp,$twtmp); + &movdqa ($twmask,&QWP(6*16,"esp")); # 0x0...010...87 + &pcmpgtd($twtmp,$tweak); # broadcast upper bits + + &and ($len,-16); + &mov ($key_,$key); # backup $key + &mov ($rounds_,$rounds); # backup $rounds + &sub ($len,16*6); + &jc (&label("xts_enc_short")); + + &shl ($rounds,4); + &mov ($rounds_,16); + &sub ($rounds_,$rounds); + &lea ($key,&DWP(32,$key,$rounds)); + &jmp (&label("xts_enc_loop6")); + +&set_label("xts_enc_loop6",16); + for ($i=0;$i<4;$i++) { + &pshufd ($twres,$twtmp,0x13); + &pxor ($twtmp,$twtmp); + &movdqa (&QWP(16*$i,"esp"),$tweak); + &paddq ($tweak,$tweak); # &psllq($tweak,1); + &pand ($twres,$twmask); # isolate carry and residue + &pcmpgtd ($twtmp,$tweak); # broadcast upper bits + &pxor ($tweak,$twres); + } + &pshufd ($inout5,$twtmp,0x13); + &movdqa (&QWP(16*$i++,"esp"),$tweak); + &paddq ($tweak,$tweak); # &psllq($tweak,1); + &$movekey ($rndkey0,&QWP(0,$key_)); + &pand ($inout5,$twmask); # isolate carry and residue + &movups ($inout0,&QWP(0,$inp)); # load input + &pxor ($inout5,$tweak); + + # inline _aesni_encrypt6 prologue and flip xor with tweak and key[0] + &mov ($rounds,$rounds_); # restore $rounds + &movdqu ($inout1,&QWP(16*1,$inp)); + &xorps ($inout0,$rndkey0); # input^=rndkey[0] + &movdqu ($inout2,&QWP(16*2,$inp)); + &pxor ($inout1,$rndkey0); + &movdqu ($inout3,&QWP(16*3,$inp)); + &pxor ($inout2,$rndkey0); + &movdqu ($inout4,&QWP(16*4,$inp)); + &pxor ($inout3,$rndkey0); + &movdqu ($rndkey1,&QWP(16*5,$inp)); + &pxor ($inout4,$rndkey0); + &lea ($inp,&DWP(16*6,$inp)); + &pxor ($inout0,&QWP(16*0,"esp")); # input^=tweak + &movdqa (&QWP(16*$i,"esp"),$inout5); # save last tweak + &pxor ($inout5,$rndkey1); + + &$movekey ($rndkey1,&QWP(16,$key_)); + &pxor ($inout1,&QWP(16*1,"esp")); + &pxor ($inout2,&QWP(16*2,"esp")); + &aesenc ($inout0,$rndkey1); + &pxor ($inout3,&QWP(16*3,"esp")); + &pxor ($inout4,&QWP(16*4,"esp")); + &aesenc ($inout1,$rndkey1); + &pxor ($inout5,$rndkey0); + &$movekey ($rndkey0,&QWP(32,$key_)); + &aesenc ($inout2,$rndkey1); + &aesenc ($inout3,$rndkey1); + &aesenc ($inout4,$rndkey1); + &aesenc ($inout5,$rndkey1); + &call (&label("_aesni_encrypt6_enter")); + + &movdqa ($tweak,&QWP(16*5,"esp")); # last tweak + &pxor ($twtmp,$twtmp); + &xorps ($inout0,&QWP(16*0,"esp")); # output^=tweak + &pcmpgtd ($twtmp,$tweak); # broadcast upper bits + &xorps ($inout1,&QWP(16*1,"esp")); + &movups (&QWP(16*0,$out),$inout0); # write output + &xorps ($inout2,&QWP(16*2,"esp")); + &movups (&QWP(16*1,$out),$inout1); + &xorps ($inout3,&QWP(16*3,"esp")); + &movups (&QWP(16*2,$out),$inout2); + &xorps ($inout4,&QWP(16*4,"esp")); + &movups (&QWP(16*3,$out),$inout3); + &xorps ($inout5,$tweak); + &movups (&QWP(16*4,$out),$inout4); + &pshufd ($twres,$twtmp,0x13); + &movups (&QWP(16*5,$out),$inout5); + &lea ($out,&DWP(16*6,$out)); + &movdqa ($twmask,&QWP(16*6,"esp")); # 0x0...010...87 + + &pxor ($twtmp,$twtmp); + &paddq ($tweak,$tweak); # &psllq($tweak,1); + &pand ($twres,$twmask); # isolate carry and residue + &pcmpgtd($twtmp,$tweak); # broadcast upper bits + &pxor ($tweak,$twres); + + &sub ($len,16*6); + &jnc (&label("xts_enc_loop6")); + + &mov ($rounds,&DWP(240,$key_)); # restore $rounds + &mov ($key,$key_); # restore $key + &mov ($rounds_,$rounds); + +&set_label("xts_enc_short"); + &add ($len,16*6); + &jz (&label("xts_enc_done6x")); + + &movdqa ($inout3,$tweak); # put aside previous tweak + &cmp ($len,0x20); + &jb (&label("xts_enc_one")); + + &pshufd ($twres,$twtmp,0x13); + &pxor ($twtmp,$twtmp); + &paddq ($tweak,$tweak); # &psllq($tweak,1); + &pand ($twres,$twmask); # isolate carry and residue + &pcmpgtd($twtmp,$tweak); # broadcast upper bits + &pxor ($tweak,$twres); + &je (&label("xts_enc_two")); + + &pshufd ($twres,$twtmp,0x13); + &pxor ($twtmp,$twtmp); + &movdqa ($inout4,$tweak); # put aside previous tweak + &paddq ($tweak,$tweak); # &psllq($tweak,1); + &pand ($twres,$twmask); # isolate carry and residue + &pcmpgtd($twtmp,$tweak); # broadcast upper bits + &pxor ($tweak,$twres); + &cmp ($len,0x40); + &jb (&label("xts_enc_three")); + + &pshufd ($twres,$twtmp,0x13); + &pxor ($twtmp,$twtmp); + &movdqa ($inout5,$tweak); # put aside previous tweak + &paddq ($tweak,$tweak); # &psllq($tweak,1); + &pand ($twres,$twmask); # isolate carry and residue + &pcmpgtd($twtmp,$tweak); # broadcast upper bits + &pxor ($tweak,$twres); + &movdqa (&QWP(16*0,"esp"),$inout3); + &movdqa (&QWP(16*1,"esp"),$inout4); + &je (&label("xts_enc_four")); + + &movdqa (&QWP(16*2,"esp"),$inout5); + &pshufd ($inout5,$twtmp,0x13); + &movdqa (&QWP(16*3,"esp"),$tweak); + &paddq ($tweak,$tweak); # &psllq($inout0,1); + &pand ($inout5,$twmask); # isolate carry and residue + &pxor ($inout5,$tweak); + + &movdqu ($inout0,&QWP(16*0,$inp)); # load input + &movdqu ($inout1,&QWP(16*1,$inp)); + &movdqu ($inout2,&QWP(16*2,$inp)); + &pxor ($inout0,&QWP(16*0,"esp")); # input^=tweak + &movdqu ($inout3,&QWP(16*3,$inp)); + &pxor ($inout1,&QWP(16*1,"esp")); + &movdqu ($inout4,&QWP(16*4,$inp)); + &pxor ($inout2,&QWP(16*2,"esp")); + &lea ($inp,&DWP(16*5,$inp)); + &pxor ($inout3,&QWP(16*3,"esp")); + &movdqa (&QWP(16*4,"esp"),$inout5); # save last tweak + &pxor ($inout4,$inout5); + + &call ("_aesni_encrypt6"); + + &movaps ($tweak,&QWP(16*4,"esp")); # last tweak + &xorps ($inout0,&QWP(16*0,"esp")); # output^=tweak + &xorps ($inout1,&QWP(16*1,"esp")); + &xorps ($inout2,&QWP(16*2,"esp")); + &movups (&QWP(16*0,$out),$inout0); # write output + &xorps ($inout3,&QWP(16*3,"esp")); + &movups (&QWP(16*1,$out),$inout1); + &xorps ($inout4,$tweak); + &movups (&QWP(16*2,$out),$inout2); + &movups (&QWP(16*3,$out),$inout3); + &movups (&QWP(16*4,$out),$inout4); + &lea ($out,&DWP(16*5,$out)); + &jmp (&label("xts_enc_done")); + +&set_label("xts_enc_one",16); + &movups ($inout0,&QWP(16*0,$inp)); # load input + &lea ($inp,&DWP(16*1,$inp)); + &xorps ($inout0,$inout3); # input^=tweak + if ($inline) + { &aesni_inline_generate1("enc"); } + else + { &call ("_aesni_encrypt1"); } + &xorps ($inout0,$inout3); # output^=tweak + &movups (&QWP(16*0,$out),$inout0); # write output + &lea ($out,&DWP(16*1,$out)); + + &movdqa ($tweak,$inout3); # last tweak + &jmp (&label("xts_enc_done")); + +&set_label("xts_enc_two",16); + &movaps ($inout4,$tweak); # put aside last tweak + + &movups ($inout0,&QWP(16*0,$inp)); # load input + &movups ($inout1,&QWP(16*1,$inp)); + &lea ($inp,&DWP(16*2,$inp)); + &xorps ($inout0,$inout3); # input^=tweak + &xorps ($inout1,$inout4); + + &call ("_aesni_encrypt2"); + + &xorps ($inout0,$inout3); # output^=tweak + &xorps ($inout1,$inout4); + &movups (&QWP(16*0,$out),$inout0); # write output + &movups (&QWP(16*1,$out),$inout1); + &lea ($out,&DWP(16*2,$out)); + + &movdqa ($tweak,$inout4); # last tweak + &jmp (&label("xts_enc_done")); + +&set_label("xts_enc_three",16); + &movaps ($inout5,$tweak); # put aside last tweak + &movups ($inout0,&QWP(16*0,$inp)); # load input + &movups ($inout1,&QWP(16*1,$inp)); + &movups ($inout2,&QWP(16*2,$inp)); + &lea ($inp,&DWP(16*3,$inp)); + &xorps ($inout0,$inout3); # input^=tweak + &xorps ($inout1,$inout4); + &xorps ($inout2,$inout5); + + &call ("_aesni_encrypt3"); + + &xorps ($inout0,$inout3); # output^=tweak + &xorps ($inout1,$inout4); + &xorps ($inout2,$inout5); + &movups (&QWP(16*0,$out),$inout0); # write output + &movups (&QWP(16*1,$out),$inout1); + &movups (&QWP(16*2,$out),$inout2); + &lea ($out,&DWP(16*3,$out)); + + &movdqa ($tweak,$inout5); # last tweak + &jmp (&label("xts_enc_done")); + +&set_label("xts_enc_four",16); + &movaps ($inout4,$tweak); # put aside last tweak + + &movups ($inout0,&QWP(16*0,$inp)); # load input + &movups ($inout1,&QWP(16*1,$inp)); + &movups ($inout2,&QWP(16*2,$inp)); + &xorps ($inout0,&QWP(16*0,"esp")); # input^=tweak + &movups ($inout3,&QWP(16*3,$inp)); + &lea ($inp,&DWP(16*4,$inp)); + &xorps ($inout1,&QWP(16*1,"esp")); + &xorps ($inout2,$inout5); + &xorps ($inout3,$inout4); + + &call ("_aesni_encrypt4"); + + &xorps ($inout0,&QWP(16*0,"esp")); # output^=tweak + &xorps ($inout1,&QWP(16*1,"esp")); + &xorps ($inout2,$inout5); + &movups (&QWP(16*0,$out),$inout0); # write output + &xorps ($inout3,$inout4); + &movups (&QWP(16*1,$out),$inout1); + &movups (&QWP(16*2,$out),$inout2); + &movups (&QWP(16*3,$out),$inout3); + &lea ($out,&DWP(16*4,$out)); + + &movdqa ($tweak,$inout4); # last tweak + &jmp (&label("xts_enc_done")); + +&set_label("xts_enc_done6x",16); # $tweak is pre-calculated + &mov ($len,&DWP(16*7+0,"esp")); # restore original $len + &and ($len,15); + &jz (&label("xts_enc_ret")); + &movdqa ($inout3,$tweak); + &mov (&DWP(16*7+0,"esp"),$len); # save $len%16 + &jmp (&label("xts_enc_steal")); + +&set_label("xts_enc_done",16); + &mov ($len,&DWP(16*7+0,"esp")); # restore original $len + &pxor ($twtmp,$twtmp); + &and ($len,15); + &jz (&label("xts_enc_ret")); + + &pcmpgtd($twtmp,$tweak); # broadcast upper bits + &mov (&DWP(16*7+0,"esp"),$len); # save $len%16 + &pshufd ($inout3,$twtmp,0x13); + &paddq ($tweak,$tweak); # &psllq($tweak,1); + &pand ($inout3,&QWP(16*6,"esp")); # isolate carry and residue + &pxor ($inout3,$tweak); + +&set_label("xts_enc_steal"); + &movz ($rounds,&BP(0,$inp)); + &movz ($key,&BP(-16,$out)); + &lea ($inp,&DWP(1,$inp)); + &mov (&BP(-16,$out),&LB($rounds)); + &mov (&BP(0,$out),&LB($key)); + &lea ($out,&DWP(1,$out)); + &sub ($len,1); + &jnz (&label("xts_enc_steal")); + + &sub ($out,&DWP(16*7+0,"esp")); # rewind $out + &mov ($key,$key_); # restore $key + &mov ($rounds,$rounds_); # restore $rounds + + &movups ($inout0,&QWP(-16,$out)); # load input + &xorps ($inout0,$inout3); # input^=tweak + if ($inline) + { &aesni_inline_generate1("enc"); } + else + { &call ("_aesni_encrypt1"); } + &xorps ($inout0,$inout3); # output^=tweak + &movups (&QWP(-16,$out),$inout0); # write output + +&set_label("xts_enc_ret"); + &pxor ("xmm0","xmm0"); # clear register bank + &pxor ("xmm1","xmm1"); + &pxor ("xmm2","xmm2"); + &movdqa (&QWP(16*0,"esp"),"xmm0"); # clear stack + &pxor ("xmm3","xmm3"); + &movdqa (&QWP(16*1,"esp"),"xmm0"); + &pxor ("xmm4","xmm4"); + &movdqa (&QWP(16*2,"esp"),"xmm0"); + &pxor ("xmm5","xmm5"); + &movdqa (&QWP(16*3,"esp"),"xmm0"); + &pxor ("xmm6","xmm6"); + &movdqa (&QWP(16*4,"esp"),"xmm0"); + &pxor ("xmm7","xmm7"); + &movdqa (&QWP(16*5,"esp"),"xmm0"); + &mov ("esp",&DWP(16*7+4,"esp")); # restore %esp +&function_end("aesni_xts_encrypt"); + +&function_begin("aesni_xts_decrypt"); + &mov ($key,&wparam(4)); # key2 + &mov ($inp,&wparam(5)); # clear-text tweak + + &mov ($rounds,&DWP(240,$key)); # key2->rounds + &movups ($inout0,&QWP(0,$inp)); + if ($inline) + { &aesni_inline_generate1("enc"); } + else + { &call ("_aesni_encrypt1"); } + + &mov ($inp,&wparam(0)); + &mov ($out,&wparam(1)); + &mov ($len,&wparam(2)); + &mov ($key,&wparam(3)); # key1 + + &mov ($key_,"esp"); + &sub ("esp",16*7+8); + &and ("esp",-16); # align stack + + &xor ($rounds_,$rounds_); # if(len%16) len-=16; + &test ($len,15); + &setnz (&LB($rounds_)); + &shl ($rounds_,4); + &sub ($len,$rounds_); + + &mov (&DWP(16*6+0,"esp"),0x87); # compose the magic constant + &mov (&DWP(16*6+4,"esp"),0); + &mov (&DWP(16*6+8,"esp"),1); + &mov (&DWP(16*6+12,"esp"),0); + &mov (&DWP(16*7+0,"esp"),$len); # save original $len + &mov (&DWP(16*7+4,"esp"),$key_); # save original %esp + + &mov ($rounds,&DWP(240,$key)); # key1->rounds + &mov ($key_,$key); # backup $key + &mov ($rounds_,$rounds); # backup $rounds + + &movdqa ($tweak,$inout0); + &pxor ($twtmp,$twtmp); + &movdqa ($twmask,&QWP(6*16,"esp")); # 0x0...010...87 + &pcmpgtd($twtmp,$tweak); # broadcast upper bits + + &and ($len,-16); + &sub ($len,16*6); + &jc (&label("xts_dec_short")); + + &shl ($rounds,4); + &mov ($rounds_,16); + &sub ($rounds_,$rounds); + &lea ($key,&DWP(32,$key,$rounds)); + &jmp (&label("xts_dec_loop6")); + +&set_label("xts_dec_loop6",16); + for ($i=0;$i<4;$i++) { + &pshufd ($twres,$twtmp,0x13); + &pxor ($twtmp,$twtmp); + &movdqa (&QWP(16*$i,"esp"),$tweak); + &paddq ($tweak,$tweak); # &psllq($tweak,1); + &pand ($twres,$twmask); # isolate carry and residue + &pcmpgtd ($twtmp,$tweak); # broadcast upper bits + &pxor ($tweak,$twres); + } + &pshufd ($inout5,$twtmp,0x13); + &movdqa (&QWP(16*$i++,"esp"),$tweak); + &paddq ($tweak,$tweak); # &psllq($tweak,1); + &$movekey ($rndkey0,&QWP(0,$key_)); + &pand ($inout5,$twmask); # isolate carry and residue + &movups ($inout0,&QWP(0,$inp)); # load input + &pxor ($inout5,$tweak); + + # inline _aesni_encrypt6 prologue and flip xor with tweak and key[0] + &mov ($rounds,$rounds_); + &movdqu ($inout1,&QWP(16*1,$inp)); + &xorps ($inout0,$rndkey0); # input^=rndkey[0] + &movdqu ($inout2,&QWP(16*2,$inp)); + &pxor ($inout1,$rndkey0); + &movdqu ($inout3,&QWP(16*3,$inp)); + &pxor ($inout2,$rndkey0); + &movdqu ($inout4,&QWP(16*4,$inp)); + &pxor ($inout3,$rndkey0); + &movdqu ($rndkey1,&QWP(16*5,$inp)); + &pxor ($inout4,$rndkey0); + &lea ($inp,&DWP(16*6,$inp)); + &pxor ($inout0,&QWP(16*0,"esp")); # input^=tweak + &movdqa (&QWP(16*$i,"esp"),$inout5); # save last tweak + &pxor ($inout5,$rndkey1); + + &$movekey ($rndkey1,&QWP(16,$key_)); + &pxor ($inout1,&QWP(16*1,"esp")); + &pxor ($inout2,&QWP(16*2,"esp")); + &aesdec ($inout0,$rndkey1); + &pxor ($inout3,&QWP(16*3,"esp")); + &pxor ($inout4,&QWP(16*4,"esp")); + &aesdec ($inout1,$rndkey1); + &pxor ($inout5,$rndkey0); + &$movekey ($rndkey0,&QWP(32,$key_)); + &aesdec ($inout2,$rndkey1); + &aesdec ($inout3,$rndkey1); + &aesdec ($inout4,$rndkey1); + &aesdec ($inout5,$rndkey1); + &call (&label("_aesni_decrypt6_enter")); + + &movdqa ($tweak,&QWP(16*5,"esp")); # last tweak + &pxor ($twtmp,$twtmp); + &xorps ($inout0,&QWP(16*0,"esp")); # output^=tweak + &pcmpgtd ($twtmp,$tweak); # broadcast upper bits + &xorps ($inout1,&QWP(16*1,"esp")); + &movups (&QWP(16*0,$out),$inout0); # write output + &xorps ($inout2,&QWP(16*2,"esp")); + &movups (&QWP(16*1,$out),$inout1); + &xorps ($inout3,&QWP(16*3,"esp")); + &movups (&QWP(16*2,$out),$inout2); + &xorps ($inout4,&QWP(16*4,"esp")); + &movups (&QWP(16*3,$out),$inout3); + &xorps ($inout5,$tweak); + &movups (&QWP(16*4,$out),$inout4); + &pshufd ($twres,$twtmp,0x13); + &movups (&QWP(16*5,$out),$inout5); + &lea ($out,&DWP(16*6,$out)); + &movdqa ($twmask,&QWP(16*6,"esp")); # 0x0...010...87 + + &pxor ($twtmp,$twtmp); + &paddq ($tweak,$tweak); # &psllq($tweak,1); + &pand ($twres,$twmask); # isolate carry and residue + &pcmpgtd($twtmp,$tweak); # broadcast upper bits + &pxor ($tweak,$twres); + + &sub ($len,16*6); + &jnc (&label("xts_dec_loop6")); + + &mov ($rounds,&DWP(240,$key_)); # restore $rounds + &mov ($key,$key_); # restore $key + &mov ($rounds_,$rounds); + +&set_label("xts_dec_short"); + &add ($len,16*6); + &jz (&label("xts_dec_done6x")); + + &movdqa ($inout3,$tweak); # put aside previous tweak + &cmp ($len,0x20); + &jb (&label("xts_dec_one")); + + &pshufd ($twres,$twtmp,0x13); + &pxor ($twtmp,$twtmp); + &paddq ($tweak,$tweak); # &psllq($tweak,1); + &pand ($twres,$twmask); # isolate carry and residue + &pcmpgtd($twtmp,$tweak); # broadcast upper bits + &pxor ($tweak,$twres); + &je (&label("xts_dec_two")); + + &pshufd ($twres,$twtmp,0x13); + &pxor ($twtmp,$twtmp); + &movdqa ($inout4,$tweak); # put aside previous tweak + &paddq ($tweak,$tweak); # &psllq($tweak,1); + &pand ($twres,$twmask); # isolate carry and residue + &pcmpgtd($twtmp,$tweak); # broadcast upper bits + &pxor ($tweak,$twres); + &cmp ($len,0x40); + &jb (&label("xts_dec_three")); + + &pshufd ($twres,$twtmp,0x13); + &pxor ($twtmp,$twtmp); + &movdqa ($inout5,$tweak); # put aside previous tweak + &paddq ($tweak,$tweak); # &psllq($tweak,1); + &pand ($twres,$twmask); # isolate carry and residue + &pcmpgtd($twtmp,$tweak); # broadcast upper bits + &pxor ($tweak,$twres); + &movdqa (&QWP(16*0,"esp"),$inout3); + &movdqa (&QWP(16*1,"esp"),$inout4); + &je (&label("xts_dec_four")); + + &movdqa (&QWP(16*2,"esp"),$inout5); + &pshufd ($inout5,$twtmp,0x13); + &movdqa (&QWP(16*3,"esp"),$tweak); + &paddq ($tweak,$tweak); # &psllq($inout0,1); + &pand ($inout5,$twmask); # isolate carry and residue + &pxor ($inout5,$tweak); + + &movdqu ($inout0,&QWP(16*0,$inp)); # load input + &movdqu ($inout1,&QWP(16*1,$inp)); + &movdqu ($inout2,&QWP(16*2,$inp)); + &pxor ($inout0,&QWP(16*0,"esp")); # input^=tweak + &movdqu ($inout3,&QWP(16*3,$inp)); + &pxor ($inout1,&QWP(16*1,"esp")); + &movdqu ($inout4,&QWP(16*4,$inp)); + &pxor ($inout2,&QWP(16*2,"esp")); + &lea ($inp,&DWP(16*5,$inp)); + &pxor ($inout3,&QWP(16*3,"esp")); + &movdqa (&QWP(16*4,"esp"),$inout5); # save last tweak + &pxor ($inout4,$inout5); + + &call ("_aesni_decrypt6"); + + &movaps ($tweak,&QWP(16*4,"esp")); # last tweak + &xorps ($inout0,&QWP(16*0,"esp")); # output^=tweak + &xorps ($inout1,&QWP(16*1,"esp")); + &xorps ($inout2,&QWP(16*2,"esp")); + &movups (&QWP(16*0,$out),$inout0); # write output + &xorps ($inout3,&QWP(16*3,"esp")); + &movups (&QWP(16*1,$out),$inout1); + &xorps ($inout4,$tweak); + &movups (&QWP(16*2,$out),$inout2); + &movups (&QWP(16*3,$out),$inout3); + &movups (&QWP(16*4,$out),$inout4); + &lea ($out,&DWP(16*5,$out)); + &jmp (&label("xts_dec_done")); + +&set_label("xts_dec_one",16); + &movups ($inout0,&QWP(16*0,$inp)); # load input + &lea ($inp,&DWP(16*1,$inp)); + &xorps ($inout0,$inout3); # input^=tweak + if ($inline) + { &aesni_inline_generate1("dec"); } + else + { &call ("_aesni_decrypt1"); } + &xorps ($inout0,$inout3); # output^=tweak + &movups (&QWP(16*0,$out),$inout0); # write output + &lea ($out,&DWP(16*1,$out)); + + &movdqa ($tweak,$inout3); # last tweak + &jmp (&label("xts_dec_done")); + +&set_label("xts_dec_two",16); + &movaps ($inout4,$tweak); # put aside last tweak + + &movups ($inout0,&QWP(16*0,$inp)); # load input + &movups ($inout1,&QWP(16*1,$inp)); + &lea ($inp,&DWP(16*2,$inp)); + &xorps ($inout0,$inout3); # input^=tweak + &xorps ($inout1,$inout4); + + &call ("_aesni_decrypt2"); + + &xorps ($inout0,$inout3); # output^=tweak + &xorps ($inout1,$inout4); + &movups (&QWP(16*0,$out),$inout0); # write output + &movups (&QWP(16*1,$out),$inout1); + &lea ($out,&DWP(16*2,$out)); + + &movdqa ($tweak,$inout4); # last tweak + &jmp (&label("xts_dec_done")); + +&set_label("xts_dec_three",16); + &movaps ($inout5,$tweak); # put aside last tweak + &movups ($inout0,&QWP(16*0,$inp)); # load input + &movups ($inout1,&QWP(16*1,$inp)); + &movups ($inout2,&QWP(16*2,$inp)); + &lea ($inp,&DWP(16*3,$inp)); + &xorps ($inout0,$inout3); # input^=tweak + &xorps ($inout1,$inout4); + &xorps ($inout2,$inout5); + + &call ("_aesni_decrypt3"); + + &xorps ($inout0,$inout3); # output^=tweak + &xorps ($inout1,$inout4); + &xorps ($inout2,$inout5); + &movups (&QWP(16*0,$out),$inout0); # write output + &movups (&QWP(16*1,$out),$inout1); + &movups (&QWP(16*2,$out),$inout2); + &lea ($out,&DWP(16*3,$out)); + + &movdqa ($tweak,$inout5); # last tweak + &jmp (&label("xts_dec_done")); + +&set_label("xts_dec_four",16); + &movaps ($inout4,$tweak); # put aside last tweak + + &movups ($inout0,&QWP(16*0,$inp)); # load input + &movups ($inout1,&QWP(16*1,$inp)); + &movups ($inout2,&QWP(16*2,$inp)); + &xorps ($inout0,&QWP(16*0,"esp")); # input^=tweak + &movups ($inout3,&QWP(16*3,$inp)); + &lea ($inp,&DWP(16*4,$inp)); + &xorps ($inout1,&QWP(16*1,"esp")); + &xorps ($inout2,$inout5); + &xorps ($inout3,$inout4); + + &call ("_aesni_decrypt4"); + + &xorps ($inout0,&QWP(16*0,"esp")); # output^=tweak + &xorps ($inout1,&QWP(16*1,"esp")); + &xorps ($inout2,$inout5); + &movups (&QWP(16*0,$out),$inout0); # write output + &xorps ($inout3,$inout4); + &movups (&QWP(16*1,$out),$inout1); + &movups (&QWP(16*2,$out),$inout2); + &movups (&QWP(16*3,$out),$inout3); + &lea ($out,&DWP(16*4,$out)); + + &movdqa ($tweak,$inout4); # last tweak + &jmp (&label("xts_dec_done")); + +&set_label("xts_dec_done6x",16); # $tweak is pre-calculated + &mov ($len,&DWP(16*7+0,"esp")); # restore original $len + &and ($len,15); + &jz (&label("xts_dec_ret")); + &mov (&DWP(16*7+0,"esp"),$len); # save $len%16 + &jmp (&label("xts_dec_only_one_more")); + +&set_label("xts_dec_done",16); + &mov ($len,&DWP(16*7+0,"esp")); # restore original $len + &pxor ($twtmp,$twtmp); + &and ($len,15); + &jz (&label("xts_dec_ret")); + + &pcmpgtd($twtmp,$tweak); # broadcast upper bits + &mov (&DWP(16*7+0,"esp"),$len); # save $len%16 + &pshufd ($twres,$twtmp,0x13); + &pxor ($twtmp,$twtmp); + &movdqa ($twmask,&QWP(16*6,"esp")); + &paddq ($tweak,$tweak); # &psllq($tweak,1); + &pand ($twres,$twmask); # isolate carry and residue + &pcmpgtd($twtmp,$tweak); # broadcast upper bits + &pxor ($tweak,$twres); + +&set_label("xts_dec_only_one_more"); + &pshufd ($inout3,$twtmp,0x13); + &movdqa ($inout4,$tweak); # put aside previous tweak + &paddq ($tweak,$tweak); # &psllq($tweak,1); + &pand ($inout3,$twmask); # isolate carry and residue + &pxor ($inout3,$tweak); + + &mov ($key,$key_); # restore $key + &mov ($rounds,$rounds_); # restore $rounds + + &movups ($inout0,&QWP(0,$inp)); # load input + &xorps ($inout0,$inout3); # input^=tweak + if ($inline) + { &aesni_inline_generate1("dec"); } + else + { &call ("_aesni_decrypt1"); } + &xorps ($inout0,$inout3); # output^=tweak + &movups (&QWP(0,$out),$inout0); # write output + +&set_label("xts_dec_steal"); + &movz ($rounds,&BP(16,$inp)); + &movz ($key,&BP(0,$out)); + &lea ($inp,&DWP(1,$inp)); + &mov (&BP(0,$out),&LB($rounds)); + &mov (&BP(16,$out),&LB($key)); + &lea ($out,&DWP(1,$out)); + &sub ($len,1); + &jnz (&label("xts_dec_steal")); + + &sub ($out,&DWP(16*7+0,"esp")); # rewind $out + &mov ($key,$key_); # restore $key + &mov ($rounds,$rounds_); # restore $rounds + + &movups ($inout0,&QWP(0,$out)); # load input + &xorps ($inout0,$inout4); # input^=tweak + if ($inline) + { &aesni_inline_generate1("dec"); } + else + { &call ("_aesni_decrypt1"); } + &xorps ($inout0,$inout4); # output^=tweak + &movups (&QWP(0,$out),$inout0); # write output + +&set_label("xts_dec_ret"); + &pxor ("xmm0","xmm0"); # clear register bank + &pxor ("xmm1","xmm1"); + &pxor ("xmm2","xmm2"); + &movdqa (&QWP(16*0,"esp"),"xmm0"); # clear stack + &pxor ("xmm3","xmm3"); + &movdqa (&QWP(16*1,"esp"),"xmm0"); + &pxor ("xmm4","xmm4"); + &movdqa (&QWP(16*2,"esp"),"xmm0"); + &pxor ("xmm5","xmm5"); + &movdqa (&QWP(16*3,"esp"),"xmm0"); + &pxor ("xmm6","xmm6"); + &movdqa (&QWP(16*4,"esp"),"xmm0"); + &pxor ("xmm7","xmm7"); + &movdqa (&QWP(16*5,"esp"),"xmm0"); + &mov ("esp",&DWP(16*7+4,"esp")); # restore %esp +&function_end("aesni_xts_decrypt"); +} +} + +###################################################################### +# void $PREFIX_cbc_encrypt (const void *inp, void *out, +# size_t length, const AES_KEY *key, +# unsigned char *ivp,const int enc); +&function_begin("${PREFIX}_cbc_encrypt"); + &mov ($inp,&wparam(0)); + &mov ($rounds_,"esp"); + &mov ($out,&wparam(1)); + &sub ($rounds_,24); + &mov ($len,&wparam(2)); + &and ($rounds_,-16); + &mov ($key,&wparam(3)); + &mov ($key_,&wparam(4)); + &test ($len,$len); + &jz (&label("cbc_abort")); + + &cmp (&wparam(5),0); + &xchg ($rounds_,"esp"); # alloca + &movups ($ivec,&QWP(0,$key_)); # load IV + &mov ($rounds,&DWP(240,$key)); + &mov ($key_,$key); # backup $key + &mov (&DWP(16,"esp"),$rounds_); # save original %esp + &mov ($rounds_,$rounds); # backup $rounds + &je (&label("cbc_decrypt")); + + &movaps ($inout0,$ivec); + &cmp ($len,16); + &jb (&label("cbc_enc_tail")); + &sub ($len,16); + &jmp (&label("cbc_enc_loop")); + +&set_label("cbc_enc_loop",16); + &movups ($ivec,&QWP(0,$inp)); # input actually + &lea ($inp,&DWP(16,$inp)); + if ($inline) + { &aesni_inline_generate1("enc",$inout0,$ivec); } + else + { &xorps($inout0,$ivec); &call("_aesni_encrypt1"); } + &mov ($rounds,$rounds_); # restore $rounds + &mov ($key,$key_); # restore $key + &movups (&QWP(0,$out),$inout0); # store output + &lea ($out,&DWP(16,$out)); + &sub ($len,16); + &jnc (&label("cbc_enc_loop")); + &add ($len,16); + &jnz (&label("cbc_enc_tail")); + &movaps ($ivec,$inout0); + &pxor ($inout0,$inout0); + &jmp (&label("cbc_ret")); + +&set_label("cbc_enc_tail"); + &mov ("ecx",$len); # zaps $rounds + &data_word(0xA4F3F689); # rep movsb + &mov ("ecx",16); # zero tail + &sub ("ecx",$len); + &xor ("eax","eax"); # zaps $len + &data_word(0xAAF3F689); # rep stosb + &lea ($out,&DWP(-16,$out)); # rewind $out by 1 block + &mov ($rounds,$rounds_); # restore $rounds + &mov ($inp,$out); # $inp and $out are the same + &mov ($key,$key_); # restore $key + &jmp (&label("cbc_enc_loop")); +###################################################################### +&set_label("cbc_decrypt",16); + &cmp ($len,0x50); + &jbe (&label("cbc_dec_tail")); + &movaps (&QWP(0,"esp"),$ivec); # save IV + &sub ($len,0x50); + &jmp (&label("cbc_dec_loop6_enter")); + +&set_label("cbc_dec_loop6",16); + &movaps (&QWP(0,"esp"),$rndkey0); # save IV + &movups (&QWP(0,$out),$inout5); + &lea ($out,&DWP(0x10,$out)); +&set_label("cbc_dec_loop6_enter"); + &movdqu ($inout0,&QWP(0,$inp)); + &movdqu ($inout1,&QWP(0x10,$inp)); + &movdqu ($inout2,&QWP(0x20,$inp)); + &movdqu ($inout3,&QWP(0x30,$inp)); + &movdqu ($inout4,&QWP(0x40,$inp)); + &movdqu ($inout5,&QWP(0x50,$inp)); + + &call ("_aesni_decrypt6"); + + &movups ($rndkey1,&QWP(0,$inp)); + &movups ($rndkey0,&QWP(0x10,$inp)); + &xorps ($inout0,&QWP(0,"esp")); # ^=IV + &xorps ($inout1,$rndkey1); + &movups ($rndkey1,&QWP(0x20,$inp)); + &xorps ($inout2,$rndkey0); + &movups ($rndkey0,&QWP(0x30,$inp)); + &xorps ($inout3,$rndkey1); + &movups ($rndkey1,&QWP(0x40,$inp)); + &xorps ($inout4,$rndkey0); + &movups ($rndkey0,&QWP(0x50,$inp)); # IV + &xorps ($inout5,$rndkey1); + &movups (&QWP(0,$out),$inout0); + &movups (&QWP(0x10,$out),$inout1); + &lea ($inp,&DWP(0x60,$inp)); + &movups (&QWP(0x20,$out),$inout2); + &mov ($rounds,$rounds_); # restore $rounds + &movups (&QWP(0x30,$out),$inout3); + &mov ($key,$key_); # restore $key + &movups (&QWP(0x40,$out),$inout4); + &lea ($out,&DWP(0x50,$out)); + &sub ($len,0x60); + &ja (&label("cbc_dec_loop6")); + + &movaps ($inout0,$inout5); + &movaps ($ivec,$rndkey0); + &add ($len,0x50); + &jle (&label("cbc_dec_clear_tail_collected")); + &movups (&QWP(0,$out),$inout0); + &lea ($out,&DWP(0x10,$out)); +&set_label("cbc_dec_tail"); + &movups ($inout0,&QWP(0,$inp)); + &movaps ($in0,$inout0); + &cmp ($len,0x10); + &jbe (&label("cbc_dec_one")); + + &movups ($inout1,&QWP(0x10,$inp)); + &movaps ($in1,$inout1); + &cmp ($len,0x20); + &jbe (&label("cbc_dec_two")); + + &movups ($inout2,&QWP(0x20,$inp)); + &cmp ($len,0x30); + &jbe (&label("cbc_dec_three")); + + &movups ($inout3,&QWP(0x30,$inp)); + &cmp ($len,0x40); + &jbe (&label("cbc_dec_four")); + + &movups ($inout4,&QWP(0x40,$inp)); + &movaps (&QWP(0,"esp"),$ivec); # save IV + &movups ($inout0,&QWP(0,$inp)); + &xorps ($inout5,$inout5); + &call ("_aesni_decrypt6"); + &movups ($rndkey1,&QWP(0,$inp)); + &movups ($rndkey0,&QWP(0x10,$inp)); + &xorps ($inout0,&QWP(0,"esp")); # ^= IV + &xorps ($inout1,$rndkey1); + &movups ($rndkey1,&QWP(0x20,$inp)); + &xorps ($inout2,$rndkey0); + &movups ($rndkey0,&QWP(0x30,$inp)); + &xorps ($inout3,$rndkey1); + &movups ($ivec,&QWP(0x40,$inp)); # IV + &xorps ($inout4,$rndkey0); + &movups (&QWP(0,$out),$inout0); + &movups (&QWP(0x10,$out),$inout1); + &pxor ($inout1,$inout1); + &movups (&QWP(0x20,$out),$inout2); + &pxor ($inout2,$inout2); + &movups (&QWP(0x30,$out),$inout3); + &pxor ($inout3,$inout3); + &lea ($out,&DWP(0x40,$out)); + &movaps ($inout0,$inout4); + &pxor ($inout4,$inout4); + &sub ($len,0x50); + &jmp (&label("cbc_dec_tail_collected")); + +&set_label("cbc_dec_one",16); + if ($inline) + { &aesni_inline_generate1("dec"); } + else + { &call ("_aesni_decrypt1"); } + &xorps ($inout0,$ivec); + &movaps ($ivec,$in0); + &sub ($len,0x10); + &jmp (&label("cbc_dec_tail_collected")); + +&set_label("cbc_dec_two",16); + &call ("_aesni_decrypt2"); + &xorps ($inout0,$ivec); + &xorps ($inout1,$in0); + &movups (&QWP(0,$out),$inout0); + &movaps ($inout0,$inout1); + &pxor ($inout1,$inout1); + &lea ($out,&DWP(0x10,$out)); + &movaps ($ivec,$in1); + &sub ($len,0x20); + &jmp (&label("cbc_dec_tail_collected")); + +&set_label("cbc_dec_three",16); + &call ("_aesni_decrypt3"); + &xorps ($inout0,$ivec); + &xorps ($inout1,$in0); + &xorps ($inout2,$in1); + &movups (&QWP(0,$out),$inout0); + &movaps ($inout0,$inout2); + &pxor ($inout2,$inout2); + &movups (&QWP(0x10,$out),$inout1); + &pxor ($inout1,$inout1); + &lea ($out,&DWP(0x20,$out)); + &movups ($ivec,&QWP(0x20,$inp)); + &sub ($len,0x30); + &jmp (&label("cbc_dec_tail_collected")); + +&set_label("cbc_dec_four",16); + &call ("_aesni_decrypt4"); + &movups ($rndkey1,&QWP(0x10,$inp)); + &movups ($rndkey0,&QWP(0x20,$inp)); + &xorps ($inout0,$ivec); + &movups ($ivec,&QWP(0x30,$inp)); + &xorps ($inout1,$in0); + &movups (&QWP(0,$out),$inout0); + &xorps ($inout2,$rndkey1); + &movups (&QWP(0x10,$out),$inout1); + &pxor ($inout1,$inout1); + &xorps ($inout3,$rndkey0); + &movups (&QWP(0x20,$out),$inout2); + &pxor ($inout2,$inout2); + &lea ($out,&DWP(0x30,$out)); + &movaps ($inout0,$inout3); + &pxor ($inout3,$inout3); + &sub ($len,0x40); + &jmp (&label("cbc_dec_tail_collected")); + +&set_label("cbc_dec_clear_tail_collected",16); + &pxor ($inout1,$inout1); + &pxor ($inout2,$inout2); + &pxor ($inout3,$inout3); + &pxor ($inout4,$inout4); +&set_label("cbc_dec_tail_collected"); + &and ($len,15); + &jnz (&label("cbc_dec_tail_partial")); + &movups (&QWP(0,$out),$inout0); + &pxor ($rndkey0,$rndkey0); + &jmp (&label("cbc_ret")); + +&set_label("cbc_dec_tail_partial",16); + &movaps (&QWP(0,"esp"),$inout0); + &pxor ($rndkey0,$rndkey0); + &mov ("ecx",16); + &mov ($inp,"esp"); + &sub ("ecx",$len); + &data_word(0xA4F3F689); # rep movsb + &movdqa (&QWP(0,"esp"),$inout0); + +&set_label("cbc_ret"); + &mov ("esp",&DWP(16,"esp")); # pull original %esp + &mov ($key_,&wparam(4)); + &pxor ($inout0,$inout0); + &pxor ($rndkey1,$rndkey1); + &movups (&QWP(0,$key_),$ivec); # output IV + &pxor ($ivec,$ivec); +&set_label("cbc_abort"); +&function_end("${PREFIX}_cbc_encrypt"); + +###################################################################### +# Mechanical port from aesni-x86_64.pl. +# +# _aesni_set_encrypt_key is private interface, +# input: +# "eax" const unsigned char *userKey +# $rounds int bits +# $key AES_KEY *key +# output: +# "eax" return code +# $round rounds + +&function_begin_B("_aesni_set_encrypt_key"); + &push ("ebp"); + &push ("ebx"); + &test ("eax","eax"); + &jz (&label("bad_pointer")); + &test ($key,$key); + &jz (&label("bad_pointer")); + + &call (&label("pic")); +&set_label("pic"); + &blindpop("ebx"); + &lea ("ebx",&DWP(&label("key_const")."-".&label("pic"),"ebx")); + + &picmeup("ebp","OPENSSL_ia32cap_P","ebx",&label("key_const")); + &movups ("xmm0",&QWP(0,"eax")); # pull first 128 bits of *userKey + &xorps ("xmm4","xmm4"); # low dword of xmm4 is assumed 0 + &mov ("ebp",&DWP(4,"ebp")); + &lea ($key,&DWP(16,$key)); + &and ("ebp",1<<28|1<<11); # AVX and XOP bits + &cmp ($rounds,256); + &je (&label("14rounds")); + &cmp ($rounds,192); + &je (&label("12rounds")); + &cmp ($rounds,128); + &jne (&label("bad_keybits")); + +&set_label("10rounds",16); + &cmp ("ebp",1<<28); + &je (&label("10rounds_alt")); + + &mov ($rounds,9); + &$movekey (&QWP(-16,$key),"xmm0"); # round 0 + &aeskeygenassist("xmm1","xmm0",0x01); # round 1 + &call (&label("key_128_cold")); + &aeskeygenassist("xmm1","xmm0",0x2); # round 2 + &call (&label("key_128")); + &aeskeygenassist("xmm1","xmm0",0x04); # round 3 + &call (&label("key_128")); + &aeskeygenassist("xmm1","xmm0",0x08); # round 4 + &call (&label("key_128")); + &aeskeygenassist("xmm1","xmm0",0x10); # round 5 + &call (&label("key_128")); + &aeskeygenassist("xmm1","xmm0",0x20); # round 6 + &call (&label("key_128")); + &aeskeygenassist("xmm1","xmm0",0x40); # round 7 + &call (&label("key_128")); + &aeskeygenassist("xmm1","xmm0",0x80); # round 8 + &call (&label("key_128")); + &aeskeygenassist("xmm1","xmm0",0x1b); # round 9 + &call (&label("key_128")); + &aeskeygenassist("xmm1","xmm0",0x36); # round 10 + &call (&label("key_128")); + &$movekey (&QWP(0,$key),"xmm0"); + &mov (&DWP(80,$key),$rounds); + + &jmp (&label("good_key")); + +&set_label("key_128",16); + &$movekey (&QWP(0,$key),"xmm0"); + &lea ($key,&DWP(16,$key)); +&set_label("key_128_cold"); + &shufps ("xmm4","xmm0",0b00010000); + &xorps ("xmm0","xmm4"); + &shufps ("xmm4","xmm0",0b10001100); + &xorps ("xmm0","xmm4"); + &shufps ("xmm1","xmm1",0b11111111); # critical path + &xorps ("xmm0","xmm1"); + &ret(); + +&set_label("10rounds_alt",16); + &movdqa ("xmm5",&QWP(0x00,"ebx")); + &mov ($rounds,8); + &movdqa ("xmm4",&QWP(0x20,"ebx")); + &movdqa ("xmm2","xmm0"); + &movdqu (&QWP(-16,$key),"xmm0"); + +&set_label("loop_key128"); + &pshufb ("xmm0","xmm5"); + &aesenclast ("xmm0","xmm4"); + &pslld ("xmm4",1); + &lea ($key,&DWP(16,$key)); + + &movdqa ("xmm3","xmm2"); + &pslldq ("xmm2",4); + &pxor ("xmm3","xmm2"); + &pslldq ("xmm2",4); + &pxor ("xmm3","xmm2"); + &pslldq ("xmm2",4); + &pxor ("xmm2","xmm3"); + + &pxor ("xmm0","xmm2"); + &movdqu (&QWP(-16,$key),"xmm0"); + &movdqa ("xmm2","xmm0"); + + &dec ($rounds); + &jnz (&label("loop_key128")); + + &movdqa ("xmm4",&QWP(0x30,"ebx")); + + &pshufb ("xmm0","xmm5"); + &aesenclast ("xmm0","xmm4"); + &pslld ("xmm4",1); + + &movdqa ("xmm3","xmm2"); + &pslldq ("xmm2",4); + &pxor ("xmm3","xmm2"); + &pslldq ("xmm2",4); + &pxor ("xmm3","xmm2"); + &pslldq ("xmm2",4); + &pxor ("xmm2","xmm3"); + + &pxor ("xmm0","xmm2"); + &movdqu (&QWP(0,$key),"xmm0"); + + &movdqa ("xmm2","xmm0"); + &pshufb ("xmm0","xmm5"); + &aesenclast ("xmm0","xmm4"); + + &movdqa ("xmm3","xmm2"); + &pslldq ("xmm2",4); + &pxor ("xmm3","xmm2"); + &pslldq ("xmm2",4); + &pxor ("xmm3","xmm2"); + &pslldq ("xmm2",4); + &pxor ("xmm2","xmm3"); + + &pxor ("xmm0","xmm2"); + &movdqu (&QWP(16,$key),"xmm0"); + + &mov ($rounds,9); + &mov (&DWP(96,$key),$rounds); + + &jmp (&label("good_key")); + +&set_label("12rounds",16); + &movq ("xmm2",&QWP(16,"eax")); # remaining 1/3 of *userKey + &cmp ("ebp",1<<28); + &je (&label("12rounds_alt")); + + &mov ($rounds,11); + &$movekey (&QWP(-16,$key),"xmm0"); # round 0 + &aeskeygenassist("xmm1","xmm2",0x01); # round 1,2 + &call (&label("key_192a_cold")); + &aeskeygenassist("xmm1","xmm2",0x02); # round 2,3 + &call (&label("key_192b")); + &aeskeygenassist("xmm1","xmm2",0x04); # round 4,5 + &call (&label("key_192a")); + &aeskeygenassist("xmm1","xmm2",0x08); # round 5,6 + &call (&label("key_192b")); + &aeskeygenassist("xmm1","xmm2",0x10); # round 7,8 + &call (&label("key_192a")); + &aeskeygenassist("xmm1","xmm2",0x20); # round 8,9 + &call (&label("key_192b")); + &aeskeygenassist("xmm1","xmm2",0x40); # round 10,11 + &call (&label("key_192a")); + &aeskeygenassist("xmm1","xmm2",0x80); # round 11,12 + &call (&label("key_192b")); + &$movekey (&QWP(0,$key),"xmm0"); + &mov (&DWP(48,$key),$rounds); + + &jmp (&label("good_key")); + +&set_label("key_192a",16); + &$movekey (&QWP(0,$key),"xmm0"); + &lea ($key,&DWP(16,$key)); +&set_label("key_192a_cold",16); + &movaps ("xmm5","xmm2"); +&set_label("key_192b_warm"); + &shufps ("xmm4","xmm0",0b00010000); + &movdqa ("xmm3","xmm2"); + &xorps ("xmm0","xmm4"); + &shufps ("xmm4","xmm0",0b10001100); + &pslldq ("xmm3",4); + &xorps ("xmm0","xmm4"); + &pshufd ("xmm1","xmm1",0b01010101); # critical path + &pxor ("xmm2","xmm3"); + &pxor ("xmm0","xmm1"); + &pshufd ("xmm3","xmm0",0b11111111); + &pxor ("xmm2","xmm3"); + &ret(); + +&set_label("key_192b",16); + &movaps ("xmm3","xmm0"); + &shufps ("xmm5","xmm0",0b01000100); + &$movekey (&QWP(0,$key),"xmm5"); + &shufps ("xmm3","xmm2",0b01001110); + &$movekey (&QWP(16,$key),"xmm3"); + &lea ($key,&DWP(32,$key)); + &jmp (&label("key_192b_warm")); + +&set_label("12rounds_alt",16); + &movdqa ("xmm5",&QWP(0x10,"ebx")); + &movdqa ("xmm4",&QWP(0x20,"ebx")); + &mov ($rounds,8); + &movdqu (&QWP(-16,$key),"xmm0"); + +&set_label("loop_key192"); + &movq (&QWP(0,$key),"xmm2"); + &movdqa ("xmm1","xmm2"); + &pshufb ("xmm2","xmm5"); + &aesenclast ("xmm2","xmm4"); + &pslld ("xmm4",1); + &lea ($key,&DWP(24,$key)); + + &movdqa ("xmm3","xmm0"); + &pslldq ("xmm0",4); + &pxor ("xmm3","xmm0"); + &pslldq ("xmm0",4); + &pxor ("xmm3","xmm0"); + &pslldq ("xmm0",4); + &pxor ("xmm0","xmm3"); + + &pshufd ("xmm3","xmm0",0xff); + &pxor ("xmm3","xmm1"); + &pslldq ("xmm1",4); + &pxor ("xmm3","xmm1"); + + &pxor ("xmm0","xmm2"); + &pxor ("xmm2","xmm3"); + &movdqu (&QWP(-16,$key),"xmm0"); + + &dec ($rounds); + &jnz (&label("loop_key192")); + + &mov ($rounds,11); + &mov (&DWP(32,$key),$rounds); + + &jmp (&label("good_key")); + +&set_label("14rounds",16); + &movups ("xmm2",&QWP(16,"eax")); # remaining half of *userKey + &lea ($key,&DWP(16,$key)); + &cmp ("ebp",1<<28); + &je (&label("14rounds_alt")); + + &mov ($rounds,13); + &$movekey (&QWP(-32,$key),"xmm0"); # round 0 + &$movekey (&QWP(-16,$key),"xmm2"); # round 1 + &aeskeygenassist("xmm1","xmm2",0x01); # round 2 + &call (&label("key_256a_cold")); + &aeskeygenassist("xmm1","xmm0",0x01); # round 3 + &call (&label("key_256b")); + &aeskeygenassist("xmm1","xmm2",0x02); # round 4 + &call (&label("key_256a")); + &aeskeygenassist("xmm1","xmm0",0x02); # round 5 + &call (&label("key_256b")); + &aeskeygenassist("xmm1","xmm2",0x04); # round 6 + &call (&label("key_256a")); + &aeskeygenassist("xmm1","xmm0",0x04); # round 7 + &call (&label("key_256b")); + &aeskeygenassist("xmm1","xmm2",0x08); # round 8 + &call (&label("key_256a")); + &aeskeygenassist("xmm1","xmm0",0x08); # round 9 + &call (&label("key_256b")); + &aeskeygenassist("xmm1","xmm2",0x10); # round 10 + &call (&label("key_256a")); + &aeskeygenassist("xmm1","xmm0",0x10); # round 11 + &call (&label("key_256b")); + &aeskeygenassist("xmm1","xmm2",0x20); # round 12 + &call (&label("key_256a")); + &aeskeygenassist("xmm1","xmm0",0x20); # round 13 + &call (&label("key_256b")); + &aeskeygenassist("xmm1","xmm2",0x40); # round 14 + &call (&label("key_256a")); + &$movekey (&QWP(0,$key),"xmm0"); + &mov (&DWP(16,$key),$rounds); + &xor ("eax","eax"); + + &jmp (&label("good_key")); + +&set_label("key_256a",16); + &$movekey (&QWP(0,$key),"xmm2"); + &lea ($key,&DWP(16,$key)); +&set_label("key_256a_cold"); + &shufps ("xmm4","xmm0",0b00010000); + &xorps ("xmm0","xmm4"); + &shufps ("xmm4","xmm0",0b10001100); + &xorps ("xmm0","xmm4"); + &shufps ("xmm1","xmm1",0b11111111); # critical path + &xorps ("xmm0","xmm1"); + &ret(); + +&set_label("key_256b",16); + &$movekey (&QWP(0,$key),"xmm0"); + &lea ($key,&DWP(16,$key)); + + &shufps ("xmm4","xmm2",0b00010000); + &xorps ("xmm2","xmm4"); + &shufps ("xmm4","xmm2",0b10001100); + &xorps ("xmm2","xmm4"); + &shufps ("xmm1","xmm1",0b10101010); # critical path + &xorps ("xmm2","xmm1"); + &ret(); + +&set_label("14rounds_alt",16); + &movdqa ("xmm5",&QWP(0x00,"ebx")); + &movdqa ("xmm4",&QWP(0x20,"ebx")); + &mov ($rounds,7); + &movdqu (&QWP(-32,$key),"xmm0"); + &movdqa ("xmm1","xmm2"); + &movdqu (&QWP(-16,$key),"xmm2"); + +&set_label("loop_key256"); + &pshufb ("xmm2","xmm5"); + &aesenclast ("xmm2","xmm4"); + + &movdqa ("xmm3","xmm0"); + &pslldq ("xmm0",4); + &pxor ("xmm3","xmm0"); + &pslldq ("xmm0",4); + &pxor ("xmm3","xmm0"); + &pslldq ("xmm0",4); + &pxor ("xmm0","xmm3"); + &pslld ("xmm4",1); + + &pxor ("xmm0","xmm2"); + &movdqu (&QWP(0,$key),"xmm0"); + + &dec ($rounds); + &jz (&label("done_key256")); + + &pshufd ("xmm2","xmm0",0xff); + &pxor ("xmm3","xmm3"); + &aesenclast ("xmm2","xmm3"); + + &movdqa ("xmm3","xmm1"); + &pslldq ("xmm1",4); + &pxor ("xmm3","xmm1"); + &pslldq ("xmm1",4); + &pxor ("xmm3","xmm1"); + &pslldq ("xmm1",4); + &pxor ("xmm1","xmm3"); + + &pxor ("xmm2","xmm1"); + &movdqu (&QWP(16,$key),"xmm2"); + &lea ($key,&DWP(32,$key)); + &movdqa ("xmm1","xmm2"); + &jmp (&label("loop_key256")); + +&set_label("done_key256"); + &mov ($rounds,13); + &mov (&DWP(16,$key),$rounds); + +&set_label("good_key"); + &pxor ("xmm0","xmm0"); + &pxor ("xmm1","xmm1"); + &pxor ("xmm2","xmm2"); + &pxor ("xmm3","xmm3"); + &pxor ("xmm4","xmm4"); + &pxor ("xmm5","xmm5"); + &xor ("eax","eax"); + &pop ("ebx"); + &pop ("ebp"); + &ret (); + +&set_label("bad_pointer",4); + &mov ("eax",-1); + &pop ("ebx"); + &pop ("ebp"); + &ret (); +&set_label("bad_keybits",4); + &pxor ("xmm0","xmm0"); + &mov ("eax",-2); + &pop ("ebx"); + &pop ("ebp"); + &ret (); +&function_end_B("_aesni_set_encrypt_key"); + +# int $PREFIX_set_encrypt_key (const unsigned char *userKey, int bits, +# AES_KEY *key) +&function_begin_B("${PREFIX}_set_encrypt_key"); + &mov ("eax",&wparam(0)); + &mov ($rounds,&wparam(1)); + &mov ($key,&wparam(2)); + &call ("_aesni_set_encrypt_key"); + &ret (); +&function_end_B("${PREFIX}_set_encrypt_key"); + +# int $PREFIX_set_decrypt_key (const unsigned char *userKey, int bits, +# AES_KEY *key) +&function_begin_B("${PREFIX}_set_decrypt_key"); + &mov ("eax",&wparam(0)); + &mov ($rounds,&wparam(1)); + &mov ($key,&wparam(2)); + &call ("_aesni_set_encrypt_key"); + &mov ($key,&wparam(2)); + &shl ($rounds,4); # rounds-1 after _aesni_set_encrypt_key + &test ("eax","eax"); + &jnz (&label("dec_key_ret")); + &lea ("eax",&DWP(16,$key,$rounds)); # end of key schedule + + &$movekey ("xmm0",&QWP(0,$key)); # just swap + &$movekey ("xmm1",&QWP(0,"eax")); + &$movekey (&QWP(0,"eax"),"xmm0"); + &$movekey (&QWP(0,$key),"xmm1"); + &lea ($key,&DWP(16,$key)); + &lea ("eax",&DWP(-16,"eax")); + +&set_label("dec_key_inverse"); + &$movekey ("xmm0",&QWP(0,$key)); # swap and inverse + &$movekey ("xmm1",&QWP(0,"eax")); + &aesimc ("xmm0","xmm0"); + &aesimc ("xmm1","xmm1"); + &lea ($key,&DWP(16,$key)); + &lea ("eax",&DWP(-16,"eax")); + &$movekey (&QWP(16,"eax"),"xmm0"); + &$movekey (&QWP(-16,$key),"xmm1"); + &cmp ("eax",$key); + &ja (&label("dec_key_inverse")); + + &$movekey ("xmm0",&QWP(0,$key)); # inverse middle + &aesimc ("xmm0","xmm0"); + &$movekey (&QWP(0,$key),"xmm0"); + + &pxor ("xmm0","xmm0"); + &pxor ("xmm1","xmm1"); + &xor ("eax","eax"); # return success +&set_label("dec_key_ret"); + &ret (); +&function_end_B("${PREFIX}_set_decrypt_key"); + +&set_label("key_const",64); +&data_word(0x0c0f0e0d,0x0c0f0e0d,0x0c0f0e0d,0x0c0f0e0d); +&data_word(0x04070605,0x04070605,0x04070605,0x04070605); +&data_word(1,1,1,1); +&data_word(0x1b,0x1b,0x1b,0x1b); +&asciz("AES for Intel AES-NI, CRYPTOGAMS by "); + +&asm_finish(); + +close STDOUT; diff --git a/crypto/aesgcm/aesni-x86_64.pl b/crypto/aesgcm/aesni-x86_64.pl new file mode 100644 index 0000000..252c485 --- /dev/null +++ b/crypto/aesgcm/aesni-x86_64.pl @@ -0,0 +1,5136 @@ +#! /usr/bin/env perl +# Copyright 2009-2016 The OpenSSL Project Authors. All Rights Reserved. + +# Ludde note : This is the regular AES code. + +# +# Licensed under the OpenSSL license (the "License"). You may not use +# this file except in compliance with the License. You can obtain a copy +# in the file LICENSE in the source distribution or at +# https://www.openssl.org/source/license.html + +# +# ==================================================================== +# Written by Andy Polyakov for the OpenSSL +# project. The module is, however, dual licensed under OpenSSL and +# CRYPTOGAMS licenses depending on where you obtain it. For further +# details see http://www.openssl.org/~appro/cryptogams/. +# ==================================================================== +# +# This module implements support for Intel AES-NI extension. In +# OpenSSL context it's used with Intel engine, but can also be used as +# drop-in replacement for crypto/aes/asm/aes-x86_64.pl [see below for +# details]. +# +# Performance. +# +# Given aes(enc|dec) instructions' latency asymptotic performance for +# non-parallelizable modes such as CBC encrypt is 3.75 cycles per byte +# processed with 128-bit key. And given their throughput asymptotic +# performance for parallelizable modes is 1.25 cycles per byte. Being +# asymptotic limit it's not something you commonly achieve in reality, +# but how close does one get? Below are results collected for +# different modes and block sized. Pairs of numbers are for en-/ +# decryption. +# +# 16-byte 64-byte 256-byte 1-KB 8-KB +# ECB 4.25/4.25 1.38/1.38 1.28/1.28 1.26/1.26 1.26/1.26 +# CTR 5.42/5.42 1.92/1.92 1.44/1.44 1.28/1.28 1.26/1.26 +# CBC 4.38/4.43 4.15/1.43 4.07/1.32 4.07/1.29 4.06/1.28 +# CCM 5.66/9.42 4.42/5.41 4.16/4.40 4.09/4.15 4.06/4.07 +# OFB 5.42/5.42 4.64/4.64 4.44/4.44 4.39/4.39 4.38/4.38 +# CFB 5.73/5.85 5.56/5.62 5.48/5.56 5.47/5.55 5.47/5.55 +# +# ECB, CTR, CBC and CCM results are free from EVP overhead. This means +# that otherwise used 'openssl speed -evp aes-128-??? -engine aesni +# [-decrypt]' will exhibit 10-15% worse results for smaller blocks. +# The results were collected with specially crafted speed.c benchmark +# in order to compare them with results reported in "Intel Advanced +# Encryption Standard (AES) New Instruction Set" White Paper Revision +# 3.0 dated May 2010. All above results are consistently better. This +# module also provides better performance for block sizes smaller than +# 128 bytes in points *not* represented in the above table. +# +# Looking at the results for 8-KB buffer. +# +# CFB and OFB results are far from the limit, because implementation +# uses "generic" CRYPTO_[c|o]fb128_encrypt interfaces relying on +# single-block aesni_encrypt, which is not the most optimal way to go. +# CBC encrypt result is unexpectedly high and there is no documented +# explanation for it. Seemingly there is a small penalty for feeding +# the result back to AES unit the way it's done in CBC mode. There is +# nothing one can do and the result appears optimal. CCM result is +# identical to CBC, because CBC-MAC is essentially CBC encrypt without +# saving output. CCM CTR "stays invisible," because it's neatly +# interleaved wih CBC-MAC. This provides ~30% improvement over +# "straightforward" CCM implementation with CTR and CBC-MAC performed +# disjointly. Parallelizable modes practically achieve the theoretical +# limit. +# +# Looking at how results vary with buffer size. +# +# Curves are practically saturated at 1-KB buffer size. In most cases +# "256-byte" performance is >95%, and "64-byte" is ~90% of "8-KB" one. +# CTR curve doesn't follow this pattern and is "slowest" changing one +# with "256-byte" result being 87% of "8-KB." This is because overhead +# in CTR mode is most computationally intensive. Small-block CCM +# decrypt is slower than encrypt, because first CTR and last CBC-MAC +# iterations can't be interleaved. +# +# Results for 192- and 256-bit keys. +# +# EVP-free results were observed to scale perfectly with number of +# rounds for larger block sizes, i.e. 192-bit result being 10/12 times +# lower and 256-bit one - 10/14. Well, in CBC encrypt case differences +# are a tad smaller, because the above mentioned penalty biases all +# results by same constant value. In similar way function call +# overhead affects small-block performance, as well as OFB and CFB +# results. Differences are not large, most common coefficients are +# 10/11.7 and 10/13.4 (as opposite to 10/12.0 and 10/14.0), but one +# observe even 10/11.2 and 10/12.4 (CTR, OFB, CFB)... + +# January 2011 +# +# While Westmere processor features 6 cycles latency for aes[enc|dec] +# instructions, which can be scheduled every second cycle, Sandy +# Bridge spends 8 cycles per instruction, but it can schedule them +# every cycle. This means that code targeting Westmere would perform +# suboptimally on Sandy Bridge. Therefore this update. +# +# In addition, non-parallelizable CBC encrypt (as well as CCM) is +# optimized. Relative improvement might appear modest, 8% on Westmere, +# but in absolute terms it's 3.77 cycles per byte encrypted with +# 128-bit key on Westmere, and 5.07 - on Sandy Bridge. These numbers +# should be compared to asymptotic limits of 3.75 for Westmere and +# 5.00 for Sandy Bridge. Actually, the fact that they get this close +# to asymptotic limits is quite amazing. Indeed, the limit is +# calculated as latency times number of rounds, 10 for 128-bit key, +# and divided by 16, the number of bytes in block, or in other words +# it accounts *solely* for aesenc instructions. But there are extra +# instructions, and numbers so close to the asymptotic limits mean +# that it's as if it takes as little as *one* additional cycle to +# execute all of them. How is it possible? It is possible thanks to +# out-of-order execution logic, which manages to overlap post- +# processing of previous block, things like saving the output, with +# actual encryption of current block, as well as pre-processing of +# current block, things like fetching input and xor-ing it with +# 0-round element of the key schedule, with actual encryption of +# previous block. Keep this in mind... +# +# For parallelizable modes, such as ECB, CBC decrypt, CTR, higher +# performance is achieved by interleaving instructions working on +# independent blocks. In which case asymptotic limit for such modes +# can be obtained by dividing above mentioned numbers by AES +# instructions' interleave factor. Westmere can execute at most 3 +# instructions at a time, meaning that optimal interleave factor is 3, +# and that's where the "magic" number of 1.25 come from. "Optimal +# interleave factor" means that increase of interleave factor does +# not improve performance. The formula has proven to reflect reality +# pretty well on Westmere... Sandy Bridge on the other hand can +# execute up to 8 AES instructions at a time, so how does varying +# interleave factor affect the performance? Here is table for ECB +# (numbers are cycles per byte processed with 128-bit key): +# +# instruction interleave factor 3x 6x 8x +# theoretical asymptotic limit 1.67 0.83 0.625 +# measured performance for 8KB block 1.05 0.86 0.84 +# +# "as if" interleave factor 4.7x 5.8x 6.0x +# +# Further data for other parallelizable modes: +# +# CBC decrypt 1.16 0.93 0.74 +# CTR 1.14 0.91 0.74 +# +# Well, given 3x column it's probably inappropriate to call the limit +# asymptotic, if it can be surpassed, isn't it? What happens there? +# Rewind to CBC paragraph for the answer. Yes, out-of-order execution +# magic is responsible for this. Processor overlaps not only the +# additional instructions with AES ones, but even AES instructions +# processing adjacent triplets of independent blocks. In the 6x case +# additional instructions still claim disproportionally small amount +# of additional cycles, but in 8x case number of instructions must be +# a tad too high for out-of-order logic to cope with, and AES unit +# remains underutilized... As you can see 8x interleave is hardly +# justifiable, so there no need to feel bad that 32-bit aesni-x86.pl +# utilizes 6x interleave because of limited register bank capacity. +# +# Higher interleave factors do have negative impact on Westmere +# performance. While for ECB mode it's negligible ~1.5%, other +# parallelizables perform ~5% worse, which is outweighed by ~25% +# improvement on Sandy Bridge. To balance regression on Westmere +# CTR mode was implemented with 6x aesenc interleave factor. + +# April 2011 +# +# Add aesni_xts_[en|de]crypt. Westmere spends 1.25 cycles processing +# one byte out of 8KB with 128-bit key, Sandy Bridge - 0.90. Just like +# in CTR mode AES instruction interleave factor was chosen to be 6x. + +# November 2015 +# +# Add aesni_ocb_[en|de]crypt. AES instruction interleave factor was +# chosen to be 6x. + +###################################################################### +# Current large-block performance in cycles per byte processed with +# 128-bit key (less is better). +# +# CBC en-/decrypt CTR XTS ECB OCB +# Westmere 3.77/1.25 1.25 1.25 1.26 +# * Bridge 5.07/0.74 0.75 0.90 0.85 0.98 +# Haswell 4.44/0.63 0.63 0.73 0.63 0.70 +# Skylake 2.62/0.63 0.63 0.63 0.63 +# Silvermont 5.75/3.54 3.56 4.12 3.87(*) 4.11 +# Knights L 2.54/0.77 0.78 0.85 - 1.50 +# Goldmont 3.82/1.26 1.26 1.29 1.29 1.50 +# Bulldozer 5.77/0.70 0.72 0.90 0.70 0.95 +# Ryzen 2.71/0.35 0.35 0.44 0.38 0.49 +# +# (*) Atom Silvermont ECB result is suboptimal because of penalties +# incurred by operations on %xmm8-15. As ECB is not considered +# critical, nothing was done to mitigate the problem. + +$PREFIX="aesni"; # if $PREFIX is set to "AES", the script + # generates drop-in replacement for + # crypto/aes/asm/aes-x86_64.pl:-) + +$flavour = shift; +$output = shift; +if ($flavour =~ /\./) { $output = $flavour; undef $flavour; } + +$win64=0; $win64=1 if ($flavour =~ /[nm]asm|mingw64/ || $output =~ /\.asm$/); + +$0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1; +( $xlate="${dir}x86_64-xlate.pl" and -f $xlate ) or +( $xlate="${dir}../x86_64-xlate.pl" and -f $xlate) or +die "can't locate x86_64-xlate.pl"; + +open OUT,"| \"$^X\" \"$xlate\" $flavour \"$output\""; +*STDOUT=*OUT; + +$movkey = $PREFIX eq "aesni" ? "movups" : "movups"; +@_4args=$win64? ("%rcx","%rdx","%r8", "%r9") : # Win64 order + ("%rdi","%rsi","%rdx","%rcx"); # Unix order + +$code=".text\n"; +#$code.=".extern OPENSSL_ia32cap_P\n"; + +$rounds="%eax"; # input to and changed by aesni_[en|de]cryptN !!! +# this is natural Unix argument order for public $PREFIX_[ecb|cbc]_encrypt ... +$inp="%rdi"; +$out="%rsi"; +$len="%rdx"; +$key="%rcx"; # input to and changed by aesni_[en|de]cryptN !!! +$ivp="%r8"; # cbc, ctr, ... + +$rnds_="%r10d"; # backup copy for $rounds +$key_="%r11"; # backup copy for $key + +# %xmm register layout +$rndkey0="%xmm0"; $rndkey1="%xmm1"; +$inout0="%xmm2"; $inout1="%xmm3"; +$inout2="%xmm4"; $inout3="%xmm5"; +$inout4="%xmm6"; $inout5="%xmm7"; +$inout6="%xmm8"; $inout7="%xmm9"; + +$in2="%xmm6"; $in1="%xmm7"; # used in CBC decrypt, CTR, ... +$in0="%xmm8"; $iv="%xmm9"; + +# Inline version of internal aesni_[en|de]crypt1. +# +# Why folded loop? Because aes[enc|dec] is slow enough to accommodate +# cycles which take care of loop variables... +{ my $sn; +sub aesni_generate1 { +my ($p,$key,$rounds,$inout,$ivec)=@_; $inout=$inout0 if (!defined($inout)); +++$sn; +$code.=<<___; + $movkey ($key),$rndkey0 + $movkey 16($key),$rndkey1 +___ +$code.=<<___ if (defined($ivec)); + xorps $rndkey0,$ivec + lea 32($key),$key + xorps $ivec,$inout +___ +$code.=<<___ if (!defined($ivec)); + lea 32($key),$key + xorps $rndkey0,$inout +___ +$code.=<<___; +.Loop_${p}1_$sn: + aes${p} $rndkey1,$inout + dec $rounds + $movkey ($key),$rndkey1 + lea 16($key),$key + jnz .Loop_${p}1_$sn # loop body is 16 bytes + aes${p}last $rndkey1,$inout +___ +}} +# void $PREFIX_[en|de]crypt (const void *inp,void *out,const AES_KEY *key); +# +{ my ($inp,$out,$key) = @_4args; + +$code.=<<___; +.globl ${PREFIX}_encrypt +.type ${PREFIX}_encrypt,\@abi-omnipotent +.align 16 +${PREFIX}_encrypt: + movups ($inp),$inout0 # load input + mov 240($key),$rounds # key->rounds +___ + &aesni_generate1("enc",$key,$rounds); +$code.=<<___; + pxor $rndkey0,$rndkey0 # clear register bank + pxor $rndkey1,$rndkey1 + movups $inout0,($out) # output + pxor $inout0,$inout0 + ret +.size ${PREFIX}_encrypt,.-${PREFIX}_encrypt + +.globl ${PREFIX}_decrypt +.type ${PREFIX}_decrypt,\@abi-omnipotent +.align 16 +${PREFIX}_decrypt: + movups ($inp),$inout0 # load input + mov 240($key),$rounds # key->rounds +___ + &aesni_generate1("dec",$key,$rounds); +$code.=<<___; + pxor $rndkey0,$rndkey0 # clear register bank + pxor $rndkey1,$rndkey1 + movups $inout0,($out) # output + pxor $inout0,$inout0 + ret +.size ${PREFIX}_decrypt, .-${PREFIX}_decrypt +___ +} + +# _aesni_[en|de]cryptN are private interfaces, N denotes interleave +# factor. Why 3x subroutine were originally used in loops? Even though +# aes[enc|dec] latency was originally 6, it could be scheduled only +# every *2nd* cycle. Thus 3x interleave was the one providing optimal +# utilization, i.e. when subroutine's throughput is virtually same as +# of non-interleaved subroutine [for number of input blocks up to 3]. +# This is why it originally made no sense to implement 2x subroutine. +# But times change and it became appropriate to spend extra 192 bytes +# on 2x subroutine on Atom Silvermont account. For processors that +# can schedule aes[enc|dec] every cycle optimal interleave factor +# equals to corresponding instructions latency. 8x is optimal for +# * Bridge and "super-optimal" for other Intel CPUs... + +sub aesni_generate2 { +my $dir=shift; +# As already mentioned it takes in $key and $rounds, which are *not* +# preserved. $inout[0-1] is cipher/clear text... +$code.=<<___; +.type _aesni_${dir}rypt2,\@abi-omnipotent +.align 16 +_aesni_${dir}rypt2: + $movkey ($key),$rndkey0 + shl \$4,$rounds + $movkey 16($key),$rndkey1 + xorps $rndkey0,$inout0 + xorps $rndkey0,$inout1 + $movkey 32($key),$rndkey0 + lea 32($key,$rounds),$key + neg %rax # $rounds + add \$16,%rax + +.L${dir}_loop2: + aes${dir} $rndkey1,$inout0 + aes${dir} $rndkey1,$inout1 + $movkey ($key,%rax),$rndkey1 + add \$32,%rax + aes${dir} $rndkey0,$inout0 + aes${dir} $rndkey0,$inout1 + $movkey -16($key,%rax),$rndkey0 + jnz .L${dir}_loop2 + + aes${dir} $rndkey1,$inout0 + aes${dir} $rndkey1,$inout1 + aes${dir}last $rndkey0,$inout0 + aes${dir}last $rndkey0,$inout1 + ret +.size _aesni_${dir}rypt2,.-_aesni_${dir}rypt2 +___ +} +sub aesni_generate3 { +my $dir=shift; +# As already mentioned it takes in $key and $rounds, which are *not* +# preserved. $inout[0-2] is cipher/clear text... +$code.=<<___; +.type _aesni_${dir}rypt3,\@abi-omnipotent +.align 16 +_aesni_${dir}rypt3: + $movkey ($key),$rndkey0 + shl \$4,$rounds + $movkey 16($key),$rndkey1 + xorps $rndkey0,$inout0 + xorps $rndkey0,$inout1 + xorps $rndkey0,$inout2 + $movkey 32($key),$rndkey0 + lea 32($key,$rounds),$key + neg %rax # $rounds + add \$16,%rax + +.L${dir}_loop3: + aes${dir} $rndkey1,$inout0 + aes${dir} $rndkey1,$inout1 + aes${dir} $rndkey1,$inout2 + $movkey ($key,%rax),$rndkey1 + add \$32,%rax + aes${dir} $rndkey0,$inout0 + aes${dir} $rndkey0,$inout1 + aes${dir} $rndkey0,$inout2 + $movkey -16($key,%rax),$rndkey0 + jnz .L${dir}_loop3 + + aes${dir} $rndkey1,$inout0 + aes${dir} $rndkey1,$inout1 + aes${dir} $rndkey1,$inout2 + aes${dir}last $rndkey0,$inout0 + aes${dir}last $rndkey0,$inout1 + aes${dir}last $rndkey0,$inout2 + ret +.size _aesni_${dir}rypt3,.-_aesni_${dir}rypt3 +___ +} +# 4x interleave is implemented to improve small block performance, +# most notably [and naturally] 4 block by ~30%. One can argue that one +# should have implemented 5x as well, but improvement would be <20%, +# so it's not worth it... +sub aesni_generate4 { +my $dir=shift; +# As already mentioned it takes in $key and $rounds, which are *not* +# preserved. $inout[0-3] is cipher/clear text... +$code.=<<___; +.type _aesni_${dir}rypt4,\@abi-omnipotent +.align 16 +_aesni_${dir}rypt4: + $movkey ($key),$rndkey0 + shl \$4,$rounds + $movkey 16($key),$rndkey1 + xorps $rndkey0,$inout0 + xorps $rndkey0,$inout1 + xorps $rndkey0,$inout2 + xorps $rndkey0,$inout3 + $movkey 32($key),$rndkey0 + lea 32($key,$rounds),$key + neg %rax # $rounds + .byte 0x0f,0x1f,0x00 + add \$16,%rax + +.L${dir}_loop4: + aes${dir} $rndkey1,$inout0 + aes${dir} $rndkey1,$inout1 + aes${dir} $rndkey1,$inout2 + aes${dir} $rndkey1,$inout3 + $movkey ($key,%rax),$rndkey1 + add \$32,%rax + aes${dir} $rndkey0,$inout0 + aes${dir} $rndkey0,$inout1 + aes${dir} $rndkey0,$inout2 + aes${dir} $rndkey0,$inout3 + $movkey -16($key,%rax),$rndkey0 + jnz .L${dir}_loop4 + + aes${dir} $rndkey1,$inout0 + aes${dir} $rndkey1,$inout1 + aes${dir} $rndkey1,$inout2 + aes${dir} $rndkey1,$inout3 + aes${dir}last $rndkey0,$inout0 + aes${dir}last $rndkey0,$inout1 + aes${dir}last $rndkey0,$inout2 + aes${dir}last $rndkey0,$inout3 + ret +.size _aesni_${dir}rypt4,.-_aesni_${dir}rypt4 +___ +} +sub aesni_generate6 { +my $dir=shift; +# As already mentioned it takes in $key and $rounds, which are *not* +# preserved. $inout[0-5] is cipher/clear text... +$code.=<<___; +.type _aesni_${dir}rypt6,\@abi-omnipotent +.align 16 +_aesni_${dir}rypt6: + $movkey ($key),$rndkey0 + shl \$4,$rounds + $movkey 16($key),$rndkey1 + xorps $rndkey0,$inout0 + pxor $rndkey0,$inout1 + pxor $rndkey0,$inout2 + aes${dir} $rndkey1,$inout0 + lea 32($key,$rounds),$key + neg %rax # $rounds + aes${dir} $rndkey1,$inout1 + pxor $rndkey0,$inout3 + pxor $rndkey0,$inout4 + aes${dir} $rndkey1,$inout2 + pxor $rndkey0,$inout5 + $movkey ($key,%rax),$rndkey0 + add \$16,%rax + jmp .L${dir}_loop6_enter +.align 16 +.L${dir}_loop6: + aes${dir} $rndkey1,$inout0 + aes${dir} $rndkey1,$inout1 + aes${dir} $rndkey1,$inout2 +.L${dir}_loop6_enter: + aes${dir} $rndkey1,$inout3 + aes${dir} $rndkey1,$inout4 + aes${dir} $rndkey1,$inout5 + $movkey ($key,%rax),$rndkey1 + add \$32,%rax + aes${dir} $rndkey0,$inout0 + aes${dir} $rndkey0,$inout1 + aes${dir} $rndkey0,$inout2 + aes${dir} $rndkey0,$inout3 + aes${dir} $rndkey0,$inout4 + aes${dir} $rndkey0,$inout5 + $movkey -16($key,%rax),$rndkey0 + jnz .L${dir}_loop6 + + aes${dir} $rndkey1,$inout0 + aes${dir} $rndkey1,$inout1 + aes${dir} $rndkey1,$inout2 + aes${dir} $rndkey1,$inout3 + aes${dir} $rndkey1,$inout4 + aes${dir} $rndkey1,$inout5 + aes${dir}last $rndkey0,$inout0 + aes${dir}last $rndkey0,$inout1 + aes${dir}last $rndkey0,$inout2 + aes${dir}last $rndkey0,$inout3 + aes${dir}last $rndkey0,$inout4 + aes${dir}last $rndkey0,$inout5 + ret +.size _aesni_${dir}rypt6,.-_aesni_${dir}rypt6 +___ +} +sub aesni_generate8 { +my $dir=shift; +# As already mentioned it takes in $key and $rounds, which are *not* +# preserved. $inout[0-7] is cipher/clear text... +$code.=<<___; +.type _aesni_${dir}rypt8,\@abi-omnipotent +.align 16 +_aesni_${dir}rypt8: + $movkey ($key),$rndkey0 + shl \$4,$rounds + $movkey 16($key),$rndkey1 + xorps $rndkey0,$inout0 + xorps $rndkey0,$inout1 + pxor $rndkey0,$inout2 + pxor $rndkey0,$inout3 + pxor $rndkey0,$inout4 + lea 32($key,$rounds),$key + neg %rax # $rounds + aes${dir} $rndkey1,$inout0 + pxor $rndkey0,$inout5 + pxor $rndkey0,$inout6 + aes${dir} $rndkey1,$inout1 + pxor $rndkey0,$inout7 + $movkey ($key,%rax),$rndkey0 + add \$16,%rax + jmp .L${dir}_loop8_inner +.align 16 +.L${dir}_loop8: + aes${dir} $rndkey1,$inout0 + aes${dir} $rndkey1,$inout1 +.L${dir}_loop8_inner: + aes${dir} $rndkey1,$inout2 + aes${dir} $rndkey1,$inout3 + aes${dir} $rndkey1,$inout4 + aes${dir} $rndkey1,$inout5 + aes${dir} $rndkey1,$inout6 + aes${dir} $rndkey1,$inout7 +.L${dir}_loop8_enter: + $movkey ($key,%rax),$rndkey1 + add \$32,%rax + aes${dir} $rndkey0,$inout0 + aes${dir} $rndkey0,$inout1 + aes${dir} $rndkey0,$inout2 + aes${dir} $rndkey0,$inout3 + aes${dir} $rndkey0,$inout4 + aes${dir} $rndkey0,$inout5 + aes${dir} $rndkey0,$inout6 + aes${dir} $rndkey0,$inout7 + $movkey -16($key,%rax),$rndkey0 + jnz .L${dir}_loop8 + + aes${dir} $rndkey1,$inout0 + aes${dir} $rndkey1,$inout1 + aes${dir} $rndkey1,$inout2 + aes${dir} $rndkey1,$inout3 + aes${dir} $rndkey1,$inout4 + aes${dir} $rndkey1,$inout5 + aes${dir} $rndkey1,$inout6 + aes${dir} $rndkey1,$inout7 + aes${dir}last $rndkey0,$inout0 + aes${dir}last $rndkey0,$inout1 + aes${dir}last $rndkey0,$inout2 + aes${dir}last $rndkey0,$inout3 + aes${dir}last $rndkey0,$inout4 + aes${dir}last $rndkey0,$inout5 + aes${dir}last $rndkey0,$inout6 + aes${dir}last $rndkey0,$inout7 + ret +.size _aesni_${dir}rypt8,.-_aesni_${dir}rypt8 +___ +} +&aesni_generate2("enc") if ($PREFIX eq "aesni"); +&aesni_generate2("dec"); +&aesni_generate3("enc") if ($PREFIX eq "aesni"); +&aesni_generate3("dec"); +&aesni_generate4("enc") if ($PREFIX eq "aesni"); +&aesni_generate4("dec"); +&aesni_generate6("enc") if ($PREFIX eq "aesni"); +&aesni_generate6("dec"); +&aesni_generate8("enc") if ($PREFIX eq "aesni"); +&aesni_generate8("dec"); + +if ($PREFIX eq "aesni") { +if (0) { +######################################################################## +# void aesni_ecb_encrypt (const void *in, void *out, +# size_t length, const AES_KEY *key, +# int enc); +$code.=<<___; +.globl aesni_ecb_encrypt +.type aesni_ecb_encrypt,\@function,5 +.align 16 +aesni_ecb_encrypt: +___ +$code.=<<___ if ($win64); + lea -0x58(%rsp),%rsp + movaps %xmm6,(%rsp) # offload $inout4..7 + movaps %xmm7,0x10(%rsp) + movaps %xmm8,0x20(%rsp) + movaps %xmm9,0x30(%rsp) +.Lecb_enc_body: +___ +$code.=<<___; + and \$-16,$len # if ($len<16) + jz .Lecb_ret # return + + mov 240($key),$rounds # key->rounds + $movkey ($key),$rndkey0 + mov $key,$key_ # backup $key + mov $rounds,$rnds_ # backup $rounds + test %r8d,%r8d # 5th argument + jz .Lecb_decrypt +#--------------------------- ECB ENCRYPT ------------------------------# + cmp \$0x80,$len # if ($len<8*16) + jb .Lecb_enc_tail # short input + + movdqu ($inp),$inout0 # load 8 input blocks + movdqu 0x10($inp),$inout1 + movdqu 0x20($inp),$inout2 + movdqu 0x30($inp),$inout3 + movdqu 0x40($inp),$inout4 + movdqu 0x50($inp),$inout5 + movdqu 0x60($inp),$inout6 + movdqu 0x70($inp),$inout7 + lea 0x80($inp),$inp # $inp+=8*16 + sub \$0x80,$len # $len-=8*16 (can be zero) + jmp .Lecb_enc_loop8_enter +.align 16 +.Lecb_enc_loop8: + movups $inout0,($out) # store 8 output blocks + mov $key_,$key # restore $key + movdqu ($inp),$inout0 # load 8 input blocks + mov $rnds_,$rounds # restore $rounds + movups $inout1,0x10($out) + movdqu 0x10($inp),$inout1 + movups $inout2,0x20($out) + movdqu 0x20($inp),$inout2 + movups $inout3,0x30($out) + movdqu 0x30($inp),$inout3 + movups $inout4,0x40($out) + movdqu 0x40($inp),$inout4 + movups $inout5,0x50($out) + movdqu 0x50($inp),$inout5 + movups $inout6,0x60($out) + movdqu 0x60($inp),$inout6 + movups $inout7,0x70($out) + lea 0x80($out),$out # $out+=8*16 + movdqu 0x70($inp),$inout7 + lea 0x80($inp),$inp # $inp+=8*16 +.Lecb_enc_loop8_enter: + + call _aesni_encrypt8 + + sub \$0x80,$len + jnc .Lecb_enc_loop8 # loop if $len-=8*16 didn't borrow + + movups $inout0,($out) # store 8 output blocks + mov $key_,$key # restore $key + movups $inout1,0x10($out) + mov $rnds_,$rounds # restore $rounds + movups $inout2,0x20($out) + movups $inout3,0x30($out) + movups $inout4,0x40($out) + movups $inout5,0x50($out) + movups $inout6,0x60($out) + movups $inout7,0x70($out) + lea 0x80($out),$out # $out+=8*16 + add \$0x80,$len # restore real remaining $len + jz .Lecb_ret # done if ($len==0) + +.Lecb_enc_tail: # $len is less than 8*16 + movups ($inp),$inout0 + cmp \$0x20,$len + jb .Lecb_enc_one + movups 0x10($inp),$inout1 + je .Lecb_enc_two + movups 0x20($inp),$inout2 + cmp \$0x40,$len + jb .Lecb_enc_three + movups 0x30($inp),$inout3 + je .Lecb_enc_four + movups 0x40($inp),$inout4 + cmp \$0x60,$len + jb .Lecb_enc_five + movups 0x50($inp),$inout5 + je .Lecb_enc_six + movdqu 0x60($inp),$inout6 + xorps $inout7,$inout7 + call _aesni_encrypt8 + movups $inout0,($out) # store 7 output blocks + movups $inout1,0x10($out) + movups $inout2,0x20($out) + movups $inout3,0x30($out) + movups $inout4,0x40($out) + movups $inout5,0x50($out) + movups $inout6,0x60($out) + jmp .Lecb_ret +.align 16 +.Lecb_enc_one: +___ + &aesni_generate1("enc",$key,$rounds); +$code.=<<___; + movups $inout0,($out) # store one output block + jmp .Lecb_ret +.align 16 +.Lecb_enc_two: + call _aesni_encrypt2 + movups $inout0,($out) # store 2 output blocks + movups $inout1,0x10($out) + jmp .Lecb_ret +.align 16 +.Lecb_enc_three: + call _aesni_encrypt3 + movups $inout0,($out) # store 3 output blocks + movups $inout1,0x10($out) + movups $inout2,0x20($out) + jmp .Lecb_ret +.align 16 +.Lecb_enc_four: + call _aesni_encrypt4 + movups $inout0,($out) # store 4 output blocks + movups $inout1,0x10($out) + movups $inout2,0x20($out) + movups $inout3,0x30($out) + jmp .Lecb_ret +.align 16 +.Lecb_enc_five: + xorps $inout5,$inout5 + call _aesni_encrypt6 + movups $inout0,($out) # store 5 output blocks + movups $inout1,0x10($out) + movups $inout2,0x20($out) + movups $inout3,0x30($out) + movups $inout4,0x40($out) + jmp .Lecb_ret +.align 16 +.Lecb_enc_six: + call _aesni_encrypt6 + movups $inout0,($out) # store 6 output blocks + movups $inout1,0x10($out) + movups $inout2,0x20($out) + movups $inout3,0x30($out) + movups $inout4,0x40($out) + movups $inout5,0x50($out) + jmp .Lecb_ret +#--------------------------- ECB DECRYPT ------------------------------# +.align 16 +.Lecb_decrypt: + cmp \$0x80,$len # if ($len<8*16) + jb .Lecb_dec_tail # short input + + movdqu ($inp),$inout0 # load 8 input blocks + movdqu 0x10($inp),$inout1 + movdqu 0x20($inp),$inout2 + movdqu 0x30($inp),$inout3 + movdqu 0x40($inp),$inout4 + movdqu 0x50($inp),$inout5 + movdqu 0x60($inp),$inout6 + movdqu 0x70($inp),$inout7 + lea 0x80($inp),$inp # $inp+=8*16 + sub \$0x80,$len # $len-=8*16 (can be zero) + jmp .Lecb_dec_loop8_enter +.align 16 +.Lecb_dec_loop8: + movups $inout0,($out) # store 8 output blocks + mov $key_,$key # restore $key + movdqu ($inp),$inout0 # load 8 input blocks + mov $rnds_,$rounds # restore $rounds + movups $inout1,0x10($out) + movdqu 0x10($inp),$inout1 + movups $inout2,0x20($out) + movdqu 0x20($inp),$inout2 + movups $inout3,0x30($out) + movdqu 0x30($inp),$inout3 + movups $inout4,0x40($out) + movdqu 0x40($inp),$inout4 + movups $inout5,0x50($out) + movdqu 0x50($inp),$inout5 + movups $inout6,0x60($out) + movdqu 0x60($inp),$inout6 + movups $inout7,0x70($out) + lea 0x80($out),$out # $out+=8*16 + movdqu 0x70($inp),$inout7 + lea 0x80($inp),$inp # $inp+=8*16 +.Lecb_dec_loop8_enter: + + call _aesni_decrypt8 + + $movkey ($key_),$rndkey0 + sub \$0x80,$len + jnc .Lecb_dec_loop8 # loop if $len-=8*16 didn't borrow + + movups $inout0,($out) # store 8 output blocks + pxor $inout0,$inout0 # clear register bank + mov $key_,$key # restore $key + movups $inout1,0x10($out) + pxor $inout1,$inout1 + mov $rnds_,$rounds # restore $rounds + movups $inout2,0x20($out) + pxor $inout2,$inout2 + movups $inout3,0x30($out) + pxor $inout3,$inout3 + movups $inout4,0x40($out) + pxor $inout4,$inout4 + movups $inout5,0x50($out) + pxor $inout5,$inout5 + movups $inout6,0x60($out) + pxor $inout6,$inout6 + movups $inout7,0x70($out) + pxor $inout7,$inout7 + lea 0x80($out),$out # $out+=8*16 + add \$0x80,$len # restore real remaining $len + jz .Lecb_ret # done if ($len==0) + +.Lecb_dec_tail: + movups ($inp),$inout0 + cmp \$0x20,$len + jb .Lecb_dec_one + movups 0x10($inp),$inout1 + je .Lecb_dec_two + movups 0x20($inp),$inout2 + cmp \$0x40,$len + jb .Lecb_dec_three + movups 0x30($inp),$inout3 + je .Lecb_dec_four + movups 0x40($inp),$inout4 + cmp \$0x60,$len + jb .Lecb_dec_five + movups 0x50($inp),$inout5 + je .Lecb_dec_six + movups 0x60($inp),$inout6 + $movkey ($key),$rndkey0 + xorps $inout7,$inout7 + call _aesni_decrypt8 + movups $inout0,($out) # store 7 output blocks + pxor $inout0,$inout0 # clear register bank + movups $inout1,0x10($out) + pxor $inout1,$inout1 + movups $inout2,0x20($out) + pxor $inout2,$inout2 + movups $inout3,0x30($out) + pxor $inout3,$inout3 + movups $inout4,0x40($out) + pxor $inout4,$inout4 + movups $inout5,0x50($out) + pxor $inout5,$inout5 + movups $inout6,0x60($out) + pxor $inout6,$inout6 + pxor $inout7,$inout7 + jmp .Lecb_ret +.align 16 +.Lecb_dec_one: +___ + &aesni_generate1("dec",$key,$rounds); +$code.=<<___; + movups $inout0,($out) # store one output block + pxor $inout0,$inout0 # clear register bank + jmp .Lecb_ret +.align 16 +.Lecb_dec_two: + call _aesni_decrypt2 + movups $inout0,($out) # store 2 output blocks + pxor $inout0,$inout0 # clear register bank + movups $inout1,0x10($out) + pxor $inout1,$inout1 + jmp .Lecb_ret +.align 16 +.Lecb_dec_three: + call _aesni_decrypt3 + movups $inout0,($out) # store 3 output blocks + pxor $inout0,$inout0 # clear register bank + movups $inout1,0x10($out) + pxor $inout1,$inout1 + movups $inout2,0x20($out) + pxor $inout2,$inout2 + jmp .Lecb_ret +.align 16 +.Lecb_dec_four: + call _aesni_decrypt4 + movups $inout0,($out) # store 4 output blocks + pxor $inout0,$inout0 # clear register bank + movups $inout1,0x10($out) + pxor $inout1,$inout1 + movups $inout2,0x20($out) + pxor $inout2,$inout2 + movups $inout3,0x30($out) + pxor $inout3,$inout3 + jmp .Lecb_ret +.align 16 +.Lecb_dec_five: + xorps $inout5,$inout5 + call _aesni_decrypt6 + movups $inout0,($out) # store 5 output blocks + pxor $inout0,$inout0 # clear register bank + movups $inout1,0x10($out) + pxor $inout1,$inout1 + movups $inout2,0x20($out) + pxor $inout2,$inout2 + movups $inout3,0x30($out) + pxor $inout3,$inout3 + movups $inout4,0x40($out) + pxor $inout4,$inout4 + pxor $inout5,$inout5 + jmp .Lecb_ret +.align 16 +.Lecb_dec_six: + call _aesni_decrypt6 + movups $inout0,($out) # store 6 output blocks + pxor $inout0,$inout0 # clear register bank + movups $inout1,0x10($out) + pxor $inout1,$inout1 + movups $inout2,0x20($out) + pxor $inout2,$inout2 + movups $inout3,0x30($out) + pxor $inout3,$inout3 + movups $inout4,0x40($out) + pxor $inout4,$inout4 + movups $inout5,0x50($out) + pxor $inout5,$inout5 + +.Lecb_ret: + xorps $rndkey0,$rndkey0 # %xmm0 + pxor $rndkey1,$rndkey1 +___ +$code.=<<___ if ($win64); + movaps (%rsp),%xmm6 + movaps %xmm0,(%rsp) # clear stack + movaps 0x10(%rsp),%xmm7 + movaps %xmm0,0x10(%rsp) + movaps 0x20(%rsp),%xmm8 + movaps %xmm0,0x20(%rsp) + movaps 0x30(%rsp),%xmm9 + movaps %xmm0,0x30(%rsp) + lea 0x58(%rsp),%rsp +.Lecb_enc_ret: +___ +$code.=<<___; + ret +.size aesni_ecb_encrypt,.-aesni_ecb_encrypt +___ +} +{ +###################################################################### +# void aesni_ccm64_[en|de]crypt_blocks (const void *in, void *out, +# size_t blocks, const AES_KEY *key, +# const char *ivec,char *cmac); +# +# Handles only complete blocks, operates on 64-bit counter and +# does not update *ivec! Nor does it finalize CMAC value +# (see engine/eng_aesni.c for details) +# +if (0) { +my $cmac="%r9"; # 6th argument + +my $increment="%xmm9"; +my $iv="%xmm6"; +my $bswap_mask="%xmm7"; + +$code.=<<___; +.globl aesni_ccm64_encrypt_blocks +.type aesni_ccm64_encrypt_blocks,\@function,6 +.align 16 +aesni_ccm64_encrypt_blocks: +___ +$code.=<<___ if ($win64); + lea -0x58(%rsp),%rsp + movaps %xmm6,(%rsp) # $iv + movaps %xmm7,0x10(%rsp) # $bswap_mask + movaps %xmm8,0x20(%rsp) # $in0 + movaps %xmm9,0x30(%rsp) # $increment +.Lccm64_enc_body: +___ +$code.=<<___; + mov 240($key),$rounds # key->rounds + movdqu ($ivp),$iv + movdqa .Lincrement64(%rip),$increment + movdqa .Lbswap_mask(%rip),$bswap_mask + + shl \$4,$rounds + mov \$16,$rnds_ + lea 0($key),$key_ + movdqu ($cmac),$inout1 + movdqa $iv,$inout0 + lea 32($key,$rounds),$key # end of key schedule + pshufb $bswap_mask,$iv + sub %rax,%r10 # twisted $rounds + jmp .Lccm64_enc_outer +.align 16 +.Lccm64_enc_outer: + $movkey ($key_),$rndkey0 + mov %r10,%rax + movups ($inp),$in0 # load inp + + xorps $rndkey0,$inout0 # counter + $movkey 16($key_),$rndkey1 + xorps $in0,$rndkey0 + xorps $rndkey0,$inout1 # cmac^=inp + $movkey 32($key_),$rndkey0 + +.Lccm64_enc2_loop: + aesenc $rndkey1,$inout0 + aesenc $rndkey1,$inout1 + $movkey ($key,%rax),$rndkey1 + add \$32,%rax + aesenc $rndkey0,$inout0 + aesenc $rndkey0,$inout1 + $movkey -16($key,%rax),$rndkey0 + jnz .Lccm64_enc2_loop + aesenc $rndkey1,$inout0 + aesenc $rndkey1,$inout1 + paddq $increment,$iv + dec $len # $len-- ($len is in blocks) + aesenclast $rndkey0,$inout0 + aesenclast $rndkey0,$inout1 + + lea 16($inp),$inp + xorps $inout0,$in0 # inp ^= E(iv) + movdqa $iv,$inout0 + movups $in0,($out) # save output + pshufb $bswap_mask,$inout0 + lea 16($out),$out # $out+=16 + jnz .Lccm64_enc_outer # loop if ($len!=0) + + pxor $rndkey0,$rndkey0 # clear register bank + pxor $rndkey1,$rndkey1 + pxor $inout0,$inout0 + movups $inout1,($cmac) # store resulting mac + pxor $inout1,$inout1 + pxor $in0,$in0 + pxor $iv,$iv +___ +$code.=<<___ if ($win64); + movaps (%rsp),%xmm6 + movaps %xmm0,(%rsp) # clear stack + movaps 0x10(%rsp),%xmm7 + movaps %xmm0,0x10(%rsp) + movaps 0x20(%rsp),%xmm8 + movaps %xmm0,0x20(%rsp) + movaps 0x30(%rsp),%xmm9 + movaps %xmm0,0x30(%rsp) + lea 0x58(%rsp),%rsp +.Lccm64_enc_ret: +___ +$code.=<<___; + ret +.size aesni_ccm64_encrypt_blocks,.-aesni_ccm64_encrypt_blocks +___ +###################################################################### +$code.=<<___; +.globl aesni_ccm64_decrypt_blocks +.type aesni_ccm64_decrypt_blocks,\@function,6 +.align 16 +aesni_ccm64_decrypt_blocks: +___ +$code.=<<___ if ($win64); + lea -0x58(%rsp),%rsp + movaps %xmm6,(%rsp) # $iv + movaps %xmm7,0x10(%rsp) # $bswap_mask + movaps %xmm8,0x20(%rsp) # $in8 + movaps %xmm9,0x30(%rsp) # $increment +.Lccm64_dec_body: +___ +$code.=<<___; + mov 240($key),$rounds # key->rounds + movups ($ivp),$iv + movdqu ($cmac),$inout1 + movdqa .Lincrement64(%rip),$increment + movdqa .Lbswap_mask(%rip),$bswap_mask + + movaps $iv,$inout0 + mov $rounds,$rnds_ + mov $key,$key_ + pshufb $bswap_mask,$iv +___ + &aesni_generate1("enc",$key,$rounds); +$code.=<<___; + shl \$4,$rnds_ + mov \$16,$rounds + movups ($inp),$in0 # load inp + paddq $increment,$iv + lea 16($inp),$inp # $inp+=16 + sub %r10,%rax # twisted $rounds + lea 32($key_,$rnds_),$key # end of key schedule + mov %rax,%r10 + jmp .Lccm64_dec_outer +.align 16 +.Lccm64_dec_outer: + xorps $inout0,$in0 # inp ^= E(iv) + movdqa $iv,$inout0 + movups $in0,($out) # save output + lea 16($out),$out # $out+=16 + pshufb $bswap_mask,$inout0 + + sub \$1,$len # $len-- ($len is in blocks) + jz .Lccm64_dec_break # if ($len==0) break + + $movkey ($key_),$rndkey0 + mov %r10,%rax + $movkey 16($key_),$rndkey1 + xorps $rndkey0,$in0 + xorps $rndkey0,$inout0 + xorps $in0,$inout1 # cmac^=out + $movkey 32($key_),$rndkey0 + jmp .Lccm64_dec2_loop +.align 16 +.Lccm64_dec2_loop: + aesenc $rndkey1,$inout0 + aesenc $rndkey1,$inout1 + $movkey ($key,%rax),$rndkey1 + add \$32,%rax + aesenc $rndkey0,$inout0 + aesenc $rndkey0,$inout1 + $movkey -16($key,%rax),$rndkey0 + jnz .Lccm64_dec2_loop + movups ($inp),$in0 # load input + paddq $increment,$iv + aesenc $rndkey1,$inout0 + aesenc $rndkey1,$inout1 + aesenclast $rndkey0,$inout0 + aesenclast $rndkey0,$inout1 + lea 16($inp),$inp # $inp+=16 + jmp .Lccm64_dec_outer + +.align 16 +.Lccm64_dec_break: + #xorps $in0,$inout1 # cmac^=out + mov 240($key_),$rounds +___ + &aesni_generate1("enc",$key_,$rounds,$inout1,$in0); +$code.=<<___; + pxor $rndkey0,$rndkey0 # clear register bank + pxor $rndkey1,$rndkey1 + pxor $inout0,$inout0 + movups $inout1,($cmac) # store resulting mac + pxor $inout1,$inout1 + pxor $in0,$in0 + pxor $iv,$iv +___ +$code.=<<___ if ($win64); + movaps (%rsp),%xmm6 + movaps %xmm0,(%rsp) # clear stack + movaps 0x10(%rsp),%xmm7 + movaps %xmm0,0x10(%rsp) + movaps 0x20(%rsp),%xmm8 + movaps %xmm0,0x20(%rsp) + movaps 0x30(%rsp),%xmm9 + movaps %xmm0,0x30(%rsp) + lea 0x58(%rsp),%rsp +.Lccm64_dec_ret: +___ +$code.=<<___; + ret +.size aesni_ccm64_decrypt_blocks,.-aesni_ccm64_decrypt_blocks +___ +} +###################################################################### +# void aesni_ctr32_encrypt_blocks (const void *in, void *out, +# size_t blocks, const AES_KEY *key, +# const char *ivec); +# +# Handles only complete blocks, operates on 32-bit counter and +# does not update *ivec! (see crypto/modes/ctr128.c for details) +# +# Overhaul based on suggestions from Shay Gueron and Vlad Krasnov, +# http://rt.openssl.org/Ticket/Display.html?id=3021&user=guest&pass=guest. +# Keywords are full unroll and modulo-schedule counter calculations +# with zero-round key xor. +{ +my ($in0,$in1,$in2,$in3,$in4,$in5)=map("%xmm$_",(10..15)); +my ($key0,$ctr)=("%ebp","${ivp}d"); +my $frame_size = 0x80 + ($win64?160:0); + +$code.=<<___; +.globl aesni_ctr32_encrypt_blocks +.type aesni_ctr32_encrypt_blocks,\@function,5 +.align 16 +aesni_ctr32_encrypt_blocks: +.cfi_startproc + cmp \$1,$len + jne .Lctr32_bulk + + # handle single block without allocating stack frame, + # useful when handling edges + movups ($ivp),$inout0 + movups ($inp),$inout1 + mov 240($key),%edx # key->rounds +___ + &aesni_generate1("enc",$key,"%edx"); +$code.=<<___; + pxor $rndkey0,$rndkey0 # clear register bank + pxor $rndkey1,$rndkey1 + xorps $inout1,$inout0 + pxor $inout1,$inout1 + movups $inout0,($out) + xorps $inout0,$inout0 + jmp .Lctr32_epilogue + +.align 16 +.Lctr32_bulk: + lea (%rsp),$key_ # use $key_ as frame pointer +.cfi_def_cfa_register $key_ + push %rbp +.cfi_push %rbp + sub \$$frame_size,%rsp + and \$-16,%rsp # Linux kernel stack can be incorrectly seeded +___ +$code.=<<___ if ($win64); + movaps %xmm6,-0xa8($key_) # offload everything + movaps %xmm7,-0x98($key_) + movaps %xmm8,-0x88($key_) + movaps %xmm9,-0x78($key_) + movaps %xmm10,-0x68($key_) + movaps %xmm11,-0x58($key_) + movaps %xmm12,-0x48($key_) + movaps %xmm13,-0x38($key_) + movaps %xmm14,-0x28($key_) + movaps %xmm15,-0x18($key_) +.Lctr32_body: +___ +$code.=<<___; + + # 8 16-byte words on top of stack are counter values + # xor-ed with zero-round key + + movdqu ($ivp),$inout0 + movdqu ($key),$rndkey0 + mov 12($ivp),$ctr # counter LSB + pxor $rndkey0,$inout0 + mov 12($key),$key0 # 0-round key LSB + movdqa $inout0,0x00(%rsp) # populate counter block + bswap $ctr + movdqa $inout0,$inout1 + movdqa $inout0,$inout2 + movdqa $inout0,$inout3 + movdqa $inout0,0x40(%rsp) + movdqa $inout0,0x50(%rsp) + movdqa $inout0,0x60(%rsp) + mov %rdx,%r10 # about to borrow %rdx + movdqa $inout0,0x70(%rsp) + + lea 1($ctr),%rax + lea 2($ctr),%rdx + bswap %eax + bswap %edx + xor $key0,%eax + xor $key0,%edx + pinsrd \$3,%eax,$inout1 + lea 3($ctr),%rax + movdqa $inout1,0x10(%rsp) + pinsrd \$3,%edx,$inout2 + bswap %eax + mov %r10,%rdx # restore %rdx + lea 4($ctr),%r10 + movdqa $inout2,0x20(%rsp) + xor $key0,%eax + bswap %r10d + pinsrd \$3,%eax,$inout3 + xor $key0,%r10d + movdqa $inout3,0x30(%rsp) + lea 5($ctr),%r9 + mov %r10d,0x40+12(%rsp) + bswap %r9d + lea 6($ctr),%r10 + mov 240($key),$rounds # key->rounds + xor $key0,%r9d + bswap %r10d + mov %r9d,0x50+12(%rsp) + xor $key0,%r10d + lea 7($ctr),%r9 + mov %r10d,0x60+12(%rsp) + bswap %r9d +# leaq OPENSSL_ia32cap_P(%rip),%r10 +# mov 4(%r10),%r10d + xor $key0,%r9d +# and \$`1<<26|1<<22`,%r10d # isolate XSAVE+MOVBE + mov %r9d,0x70+12(%rsp) + + $movkey 0x10($key),$rndkey1 + + movdqa 0x40(%rsp),$inout4 + movdqa 0x50(%rsp),$inout5 + + cmp \$8,$len # $len is in blocks + jb .Lctr32_tail # short input if ($len<8) + + sub \$6,$len # $len is biased by -6 +# cmp \$`1<<22`,%r10d # check for MOVBE without XSAVE +# je .Lctr32_6x # [which denotes Atom Silvermont] + + lea 0x80($key),$key # size optimization + sub \$2,$len # $len is biased by -8 + jmp .Lctr32_loop8 + +#.align 16 +#.Lctr32_6x: +# shl \$4,$rounds +# mov \$48,$rnds_ +# bswap $key0 +# lea 32($key,$rounds),$key # end of key schedule +# sub %rax,%r10 # twisted $rounds +# jmp .Lctr32_loop6 + +.align 16 +.Lctr32_loop6: + add \$6,$ctr # next counter value + $movkey -48($key,$rnds_),$rndkey0 + aesenc $rndkey1,$inout0 + mov $ctr,%eax + xor $key0,%eax + aesenc $rndkey1,$inout1 + movbe %eax,`0x00+12`(%rsp) # store next counter value + lea 1($ctr),%eax + aesenc $rndkey1,$inout2 + xor $key0,%eax + movbe %eax,`0x10+12`(%rsp) + aesenc $rndkey1,$inout3 + lea 2($ctr),%eax + xor $key0,%eax + aesenc $rndkey1,$inout4 + movbe %eax,`0x20+12`(%rsp) + lea 3($ctr),%eax + aesenc $rndkey1,$inout5 + $movkey -32($key,$rnds_),$rndkey1 + xor $key0,%eax + + aesenc $rndkey0,$inout0 + movbe %eax,`0x30+12`(%rsp) + lea 4($ctr),%eax + aesenc $rndkey0,$inout1 + xor $key0,%eax + movbe %eax,`0x40+12`(%rsp) + aesenc $rndkey0,$inout2 + lea 5($ctr),%eax + xor $key0,%eax + aesenc $rndkey0,$inout3 + movbe %eax,`0x50+12`(%rsp) + mov %r10,%rax # mov $rnds_,$rounds + aesenc $rndkey0,$inout4 + aesenc $rndkey0,$inout5 + $movkey -16($key,$rnds_),$rndkey0 + + call .Lenc_loop6 + + movdqu ($inp),$inout6 # load 6 input blocks + movdqu 0x10($inp),$inout7 + movdqu 0x20($inp),$in0 + movdqu 0x30($inp),$in1 + movdqu 0x40($inp),$in2 + movdqu 0x50($inp),$in3 + lea 0x60($inp),$inp # $inp+=6*16 + $movkey -64($key,$rnds_),$rndkey1 + pxor $inout0,$inout6 # inp^=E(ctr) + movaps 0x00(%rsp),$inout0 # load next counter [xor-ed with 0 round] + pxor $inout1,$inout7 + movaps 0x10(%rsp),$inout1 + pxor $inout2,$in0 + movaps 0x20(%rsp),$inout2 + pxor $inout3,$in1 + movaps 0x30(%rsp),$inout3 + pxor $inout4,$in2 + movaps 0x40(%rsp),$inout4 + pxor $inout5,$in3 + movaps 0x50(%rsp),$inout5 + movdqu $inout6,($out) # store 6 output blocks + movdqu $inout7,0x10($out) + movdqu $in0,0x20($out) + movdqu $in1,0x30($out) + movdqu $in2,0x40($out) + movdqu $in3,0x50($out) + lea 0x60($out),$out # $out+=6*16 + + sub \$6,$len + jnc .Lctr32_loop6 # loop if $len-=6 didn't borrow + + add \$6,$len # restore real remaining $len + jz .Lctr32_done # done if ($len==0) + + lea -48($rnds_),$rounds + lea -80($key,$rnds_),$key # restore $key + neg $rounds + shr \$4,$rounds # restore $rounds + jmp .Lctr32_tail + +.align 32 +.Lctr32_loop8: + add \$8,$ctr # next counter value + movdqa 0x60(%rsp),$inout6 + aesenc $rndkey1,$inout0 + mov $ctr,%r9d + movdqa 0x70(%rsp),$inout7 + aesenc $rndkey1,$inout1 + bswap %r9d + $movkey 0x20-0x80($key),$rndkey0 + aesenc $rndkey1,$inout2 + xor $key0,%r9d + nop + aesenc $rndkey1,$inout3 + mov %r9d,0x00+12(%rsp) # store next counter value + lea 1($ctr),%r9 + aesenc $rndkey1,$inout4 + aesenc $rndkey1,$inout5 + aesenc $rndkey1,$inout6 + aesenc $rndkey1,$inout7 + $movkey 0x30-0x80($key),$rndkey1 +___ +for($i=2;$i<8;$i++) { +my $rndkeyx = ($i&1)?$rndkey1:$rndkey0; +$code.=<<___; + bswap %r9d + aesenc $rndkeyx,$inout0 + aesenc $rndkeyx,$inout1 + xor $key0,%r9d + .byte 0x66,0x90 + aesenc $rndkeyx,$inout2 + aesenc $rndkeyx,$inout3 + mov %r9d,`0x10*($i-1)`+12(%rsp) + lea $i($ctr),%r9 + aesenc $rndkeyx,$inout4 + aesenc $rndkeyx,$inout5 + aesenc $rndkeyx,$inout6 + aesenc $rndkeyx,$inout7 + $movkey `0x20+0x10*$i`-0x80($key),$rndkeyx +___ +} +$code.=<<___; + bswap %r9d + aesenc $rndkey0,$inout0 + aesenc $rndkey0,$inout1 + aesenc $rndkey0,$inout2 + xor $key0,%r9d + movdqu 0x00($inp),$in0 # start loading input + aesenc $rndkey0,$inout3 + mov %r9d,0x70+12(%rsp) + cmp \$11,$rounds + aesenc $rndkey0,$inout4 + aesenc $rndkey0,$inout5 + aesenc $rndkey0,$inout6 + aesenc $rndkey0,$inout7 + $movkey 0xa0-0x80($key),$rndkey0 + + jb .Lctr32_enc_done + + aesenc $rndkey1,$inout0 + aesenc $rndkey1,$inout1 + aesenc $rndkey1,$inout2 + aesenc $rndkey1,$inout3 + aesenc $rndkey1,$inout4 + aesenc $rndkey1,$inout5 + aesenc $rndkey1,$inout6 + aesenc $rndkey1,$inout7 + $movkey 0xb0-0x80($key),$rndkey1 + + aesenc $rndkey0,$inout0 + aesenc $rndkey0,$inout1 + aesenc $rndkey0,$inout2 + aesenc $rndkey0,$inout3 + aesenc $rndkey0,$inout4 + aesenc $rndkey0,$inout5 + aesenc $rndkey0,$inout6 + aesenc $rndkey0,$inout7 + $movkey 0xc0-0x80($key),$rndkey0 + je .Lctr32_enc_done + + aesenc $rndkey1,$inout0 + aesenc $rndkey1,$inout1 + aesenc $rndkey1,$inout2 + aesenc $rndkey1,$inout3 + aesenc $rndkey1,$inout4 + aesenc $rndkey1,$inout5 + aesenc $rndkey1,$inout6 + aesenc $rndkey1,$inout7 + $movkey 0xd0-0x80($key),$rndkey1 + + aesenc $rndkey0,$inout0 + aesenc $rndkey0,$inout1 + aesenc $rndkey0,$inout2 + aesenc $rndkey0,$inout3 + aesenc $rndkey0,$inout4 + aesenc $rndkey0,$inout5 + aesenc $rndkey0,$inout6 + aesenc $rndkey0,$inout7 + $movkey 0xe0-0x80($key),$rndkey0 + jmp .Lctr32_enc_done + +.align 16 +.Lctr32_enc_done: + movdqu 0x10($inp),$in1 + pxor $rndkey0,$in0 # input^=round[last] + movdqu 0x20($inp),$in2 + pxor $rndkey0,$in1 + movdqu 0x30($inp),$in3 + pxor $rndkey0,$in2 + movdqu 0x40($inp),$in4 + pxor $rndkey0,$in3 + movdqu 0x50($inp),$in5 + pxor $rndkey0,$in4 + pxor $rndkey0,$in5 + aesenc $rndkey1,$inout0 + aesenc $rndkey1,$inout1 + aesenc $rndkey1,$inout2 + aesenc $rndkey1,$inout3 + aesenc $rndkey1,$inout4 + aesenc $rndkey1,$inout5 + aesenc $rndkey1,$inout6 + aesenc $rndkey1,$inout7 + movdqu 0x60($inp),$rndkey1 # borrow $rndkey1 for inp[6] + lea 0x80($inp),$inp # $inp+=8*16 + + aesenclast $in0,$inout0 # $inN is inp[N]^round[last] + pxor $rndkey0,$rndkey1 # borrowed $rndkey + movdqu 0x70-0x80($inp),$in0 + aesenclast $in1,$inout1 + pxor $rndkey0,$in0 + movdqa 0x00(%rsp),$in1 # load next counter block + aesenclast $in2,$inout2 + aesenclast $in3,$inout3 + movdqa 0x10(%rsp),$in2 + movdqa 0x20(%rsp),$in3 + aesenclast $in4,$inout4 + aesenclast $in5,$inout5 + movdqa 0x30(%rsp),$in4 + movdqa 0x40(%rsp),$in5 + aesenclast $rndkey1,$inout6 + movdqa 0x50(%rsp),$rndkey0 + $movkey 0x10-0x80($key),$rndkey1#real 1st-round key + aesenclast $in0,$inout7 + + movups $inout0,($out) # store 8 output blocks + movdqa $in1,$inout0 + movups $inout1,0x10($out) + movdqa $in2,$inout1 + movups $inout2,0x20($out) + movdqa $in3,$inout2 + movups $inout3,0x30($out) + movdqa $in4,$inout3 + movups $inout4,0x40($out) + movdqa $in5,$inout4 + movups $inout5,0x50($out) + movdqa $rndkey0,$inout5 + movups $inout6,0x60($out) + movups $inout7,0x70($out) + lea 0x80($out),$out # $out+=8*16 + + sub \$8,$len + jnc .Lctr32_loop8 # loop if $len-=8 didn't borrow + + add \$8,$len # restore real remaining $len + jz .Lctr32_done # done if ($len==0) + lea -0x80($key),$key + +.Lctr32_tail: + # note that at this point $inout0..5 are populated with + # counter values xor-ed with 0-round key + lea 16($key),$key + cmp \$4,$len + jb .Lctr32_loop3 + je .Lctr32_loop4 + + # if ($len>4) compute 7 E(counter) + shl \$4,$rounds + movdqa 0x60(%rsp),$inout6 + pxor $inout7,$inout7 + + $movkey 16($key),$rndkey0 + aesenc $rndkey1,$inout0 + aesenc $rndkey1,$inout1 + lea 32-16($key,$rounds),$key# prepare for .Lenc_loop8_enter + neg %rax + aesenc $rndkey1,$inout2 + add \$16,%rax # prepare for .Lenc_loop8_enter + movups ($inp),$in0 + aesenc $rndkey1,$inout3 + aesenc $rndkey1,$inout4 + movups 0x10($inp),$in1 # pre-load input + movups 0x20($inp),$in2 + aesenc $rndkey1,$inout5 + aesenc $rndkey1,$inout6 + + call .Lenc_loop8_enter + + movdqu 0x30($inp),$in3 + pxor $in0,$inout0 + movdqu 0x40($inp),$in0 + pxor $in1,$inout1 + movdqu $inout0,($out) # store output + pxor $in2,$inout2 + movdqu $inout1,0x10($out) + pxor $in3,$inout3 + movdqu $inout2,0x20($out) + pxor $in0,$inout4 + movdqu $inout3,0x30($out) + movdqu $inout4,0x40($out) + cmp \$6,$len + jb .Lctr32_done # $len was 5, stop store + + movups 0x50($inp),$in1 + xorps $in1,$inout5 + movups $inout5,0x50($out) + je .Lctr32_done # $len was 6, stop store + + movups 0x60($inp),$in2 + xorps $in2,$inout6 + movups $inout6,0x60($out) + jmp .Lctr32_done # $len was 7, stop store + +.align 32 +.Lctr32_loop4: + aesenc $rndkey1,$inout0 + lea 16($key),$key + dec $rounds + aesenc $rndkey1,$inout1 + aesenc $rndkey1,$inout2 + aesenc $rndkey1,$inout3 + $movkey ($key),$rndkey1 + jnz .Lctr32_loop4 + aesenclast $rndkey1,$inout0 + aesenclast $rndkey1,$inout1 + movups ($inp),$in0 # load input + movups 0x10($inp),$in1 + aesenclast $rndkey1,$inout2 + aesenclast $rndkey1,$inout3 + movups 0x20($inp),$in2 + movups 0x30($inp),$in3 + + xorps $in0,$inout0 + movups $inout0,($out) # store output + xorps $in1,$inout1 + movups $inout1,0x10($out) + pxor $in2,$inout2 + movdqu $inout2,0x20($out) + pxor $in3,$inout3 + movdqu $inout3,0x30($out) + jmp .Lctr32_done # $len was 4, stop store + +.align 32 +.Lctr32_loop3: + aesenc $rndkey1,$inout0 + lea 16($key),$key + dec $rounds + aesenc $rndkey1,$inout1 + aesenc $rndkey1,$inout2 + $movkey ($key),$rndkey1 + jnz .Lctr32_loop3 + aesenclast $rndkey1,$inout0 + aesenclast $rndkey1,$inout1 + aesenclast $rndkey1,$inout2 + + movups ($inp),$in0 # load input + xorps $in0,$inout0 + movups $inout0,($out) # store output + cmp \$2,$len + jb .Lctr32_done # $len was 1, stop store + + movups 0x10($inp),$in1 + xorps $in1,$inout1 + movups $inout1,0x10($out) + je .Lctr32_done # $len was 2, stop store + + movups 0x20($inp),$in2 + xorps $in2,$inout2 + movups $inout2,0x20($out) # $len was 3, stop store + +.Lctr32_done: + xorps %xmm0,%xmm0 # clear register bank + xor $key0,$key0 + pxor %xmm1,%xmm1 + pxor %xmm2,%xmm2 + pxor %xmm3,%xmm3 + pxor %xmm4,%xmm4 + pxor %xmm5,%xmm5 +___ +$code.=<<___ if (!$win64); + pxor %xmm6,%xmm6 + pxor %xmm7,%xmm7 + movaps %xmm0,0x00(%rsp) # clear stack + pxor %xmm8,%xmm8 + movaps %xmm0,0x10(%rsp) + pxor %xmm9,%xmm9 + movaps %xmm0,0x20(%rsp) + pxor %xmm10,%xmm10 + movaps %xmm0,0x30(%rsp) + pxor %xmm11,%xmm11 + movaps %xmm0,0x40(%rsp) + pxor %xmm12,%xmm12 + movaps %xmm0,0x50(%rsp) + pxor %xmm13,%xmm13 + movaps %xmm0,0x60(%rsp) + pxor %xmm14,%xmm14 + movaps %xmm0,0x70(%rsp) + pxor %xmm15,%xmm15 +___ +$code.=<<___ if ($win64); + movaps -0xa8($key_),%xmm6 + movaps %xmm0,-0xa8($key_) # clear stack + movaps -0x98($key_),%xmm7 + movaps %xmm0,-0x98($key_) + movaps -0x88($key_),%xmm8 + movaps %xmm0,-0x88($key_) + movaps -0x78($key_),%xmm9 + movaps %xmm0,-0x78($key_) + movaps -0x68($key_),%xmm10 + movaps %xmm0,-0x68($key_) + movaps -0x58($key_),%xmm11 + movaps %xmm0,-0x58($key_) + movaps -0x48($key_),%xmm12 + movaps %xmm0,-0x48($key_) + movaps -0x38($key_),%xmm13 + movaps %xmm0,-0x38($key_) + movaps -0x28($key_),%xmm14 + movaps %xmm0,-0x28($key_) + movaps -0x18($key_),%xmm15 + movaps %xmm0,-0x18($key_) + movaps %xmm0,0x00(%rsp) + movaps %xmm0,0x10(%rsp) + movaps %xmm0,0x20(%rsp) + movaps %xmm0,0x30(%rsp) + movaps %xmm0,0x40(%rsp) + movaps %xmm0,0x50(%rsp) + movaps %xmm0,0x60(%rsp) + movaps %xmm0,0x70(%rsp) +___ +$code.=<<___; + mov -8($key_),%rbp +.cfi_restore %rbp + lea ($key_),%rsp +.cfi_def_cfa_register %rsp +.Lctr32_epilogue: + ret +.cfi_endproc +.size aesni_ctr32_encrypt_blocks,.-aesni_ctr32_encrypt_blocks +___ +} + +###################################################################### +# void aesni_xts_[en|de]crypt(const char *inp,char *out,size_t len, +# const AES_KEY *key1, const AES_KEY *key2 +# const unsigned char iv[16]); +# +if (0) { +my @tweak=map("%xmm$_",(10..15)); +my ($twmask,$twres,$twtmp)=("%xmm8","%xmm9",@tweak[4]); +my ($key2,$ivp,$len_)=("%r8","%r9","%r9"); +my $frame_size = 0x70 + ($win64?160:0); +my $key_ = "%rbp"; # override so that we can use %r11 as FP + +$code.=<<___; +.globl aesni_xts_encrypt +.type aesni_xts_encrypt,\@function,6 +.align 16 +aesni_xts_encrypt: +.cfi_startproc + lea (%rsp),%r11 # frame pointer +.cfi_def_cfa_register %r11 + push %rbp +.cfi_push %rbp + sub \$$frame_size,%rsp + and \$-16,%rsp # Linux kernel stack can be incorrectly seeded +___ +$code.=<<___ if ($win64); + movaps %xmm6,-0xa8(%r11) # offload everything + movaps %xmm7,-0x98(%r11) + movaps %xmm8,-0x88(%r11) + movaps %xmm9,-0x78(%r11) + movaps %xmm10,-0x68(%r11) + movaps %xmm11,-0x58(%r11) + movaps %xmm12,-0x48(%r11) + movaps %xmm13,-0x38(%r11) + movaps %xmm14,-0x28(%r11) + movaps %xmm15,-0x18(%r11) +.Lxts_enc_body: +___ +$code.=<<___; + movups ($ivp),$inout0 # load clear-text tweak + mov 240(%r8),$rounds # key2->rounds + mov 240($key),$rnds_ # key1->rounds +___ + # generate the tweak + &aesni_generate1("enc",$key2,$rounds,$inout0); +$code.=<<___; + $movkey ($key),$rndkey0 # zero round key + mov $key,$key_ # backup $key + mov $rnds_,$rounds # backup $rounds + shl \$4,$rnds_ + mov $len,$len_ # backup $len + and \$-16,$len + + $movkey 16($key,$rnds_),$rndkey1 # last round key + + movdqa .Lxts_magic(%rip),$twmask + movdqa $inout0,@tweak[5] + pshufd \$0x5f,$inout0,$twres + pxor $rndkey0,$rndkey1 +___ + # alternative tweak calculation algorithm is based on suggestions + # by Shay Gueron. psrad doesn't conflict with AES-NI instructions + # and should help in the future... + for ($i=0;$i<4;$i++) { + $code.=<<___; + movdqa $twres,$twtmp + paddd $twres,$twres + movdqa @tweak[5],@tweak[$i] + psrad \$31,$twtmp # broadcast upper bits + paddq @tweak[5],@tweak[5] + pand $twmask,$twtmp + pxor $rndkey0,@tweak[$i] + pxor $twtmp,@tweak[5] +___ + } +$code.=<<___; + movdqa @tweak[5],@tweak[4] + psrad \$31,$twres + paddq @tweak[5],@tweak[5] + pand $twmask,$twres + pxor $rndkey0,@tweak[4] + pxor $twres,@tweak[5] + movaps $rndkey1,0x60(%rsp) # save round[0]^round[last] + + sub \$16*6,$len + jc .Lxts_enc_short # if $len-=6*16 borrowed + + mov \$16+96,$rounds + lea 32($key_,$rnds_),$key # end of key schedule + sub %r10,%rax # twisted $rounds + $movkey 16($key_),$rndkey1 + mov %rax,%r10 # backup twisted $rounds + lea .Lxts_magic(%rip),%r8 + jmp .Lxts_enc_grandloop + +.align 32 +.Lxts_enc_grandloop: + movdqu `16*0`($inp),$inout0 # load input + movdqa $rndkey0,$twmask + movdqu `16*1`($inp),$inout1 + pxor @tweak[0],$inout0 # input^=tweak^round[0] + movdqu `16*2`($inp),$inout2 + pxor @tweak[1],$inout1 + aesenc $rndkey1,$inout0 + movdqu `16*3`($inp),$inout3 + pxor @tweak[2],$inout2 + aesenc $rndkey1,$inout1 + movdqu `16*4`($inp),$inout4 + pxor @tweak[3],$inout3 + aesenc $rndkey1,$inout2 + movdqu `16*5`($inp),$inout5 + pxor @tweak[5],$twmask # round[0]^=tweak[5] + movdqa 0x60(%rsp),$twres # load round[0]^round[last] + pxor @tweak[4],$inout4 + aesenc $rndkey1,$inout3 + $movkey 32($key_),$rndkey0 + lea `16*6`($inp),$inp + pxor $twmask,$inout5 + + pxor $twres,@tweak[0] # calculate tweaks^round[last] + aesenc $rndkey1,$inout4 + pxor $twres,@tweak[1] + movdqa @tweak[0],`16*0`(%rsp) # put aside tweaks^round[last] + aesenc $rndkey1,$inout5 + $movkey 48($key_),$rndkey1 + pxor $twres,@tweak[2] + + aesenc $rndkey0,$inout0 + pxor $twres,@tweak[3] + movdqa @tweak[1],`16*1`(%rsp) + aesenc $rndkey0,$inout1 + pxor $twres,@tweak[4] + movdqa @tweak[2],`16*2`(%rsp) + aesenc $rndkey0,$inout2 + aesenc $rndkey0,$inout3 + pxor $twres,$twmask + movdqa @tweak[4],`16*4`(%rsp) + aesenc $rndkey0,$inout4 + aesenc $rndkey0,$inout5 + $movkey 64($key_),$rndkey0 + movdqa $twmask,`16*5`(%rsp) + pshufd \$0x5f,@tweak[5],$twres + jmp .Lxts_enc_loop6 +.align 32 +.Lxts_enc_loop6: + aesenc $rndkey1,$inout0 + aesenc $rndkey1,$inout1 + aesenc $rndkey1,$inout2 + aesenc $rndkey1,$inout3 + aesenc $rndkey1,$inout4 + aesenc $rndkey1,$inout5 + $movkey -64($key,%rax),$rndkey1 + add \$32,%rax + + aesenc $rndkey0,$inout0 + aesenc $rndkey0,$inout1 + aesenc $rndkey0,$inout2 + aesenc $rndkey0,$inout3 + aesenc $rndkey0,$inout4 + aesenc $rndkey0,$inout5 + $movkey -80($key,%rax),$rndkey0 + jnz .Lxts_enc_loop6 + + movdqa (%r8),$twmask # start calculating next tweak + movdqa $twres,$twtmp + paddd $twres,$twres + aesenc $rndkey1,$inout0 + paddq @tweak[5],@tweak[5] + psrad \$31,$twtmp + aesenc $rndkey1,$inout1 + pand $twmask,$twtmp + $movkey ($key_),@tweak[0] # load round[0] + aesenc $rndkey1,$inout2 + aesenc $rndkey1,$inout3 + aesenc $rndkey1,$inout4 + pxor $twtmp,@tweak[5] + movaps @tweak[0],@tweak[1] # copy round[0] + aesenc $rndkey1,$inout5 + $movkey -64($key),$rndkey1 + + movdqa $twres,$twtmp + aesenc $rndkey0,$inout0 + paddd $twres,$twres + pxor @tweak[5],@tweak[0] + aesenc $rndkey0,$inout1 + psrad \$31,$twtmp + paddq @tweak[5],@tweak[5] + aesenc $rndkey0,$inout2 + aesenc $rndkey0,$inout3 + pand $twmask,$twtmp + movaps @tweak[1],@tweak[2] + aesenc $rndkey0,$inout4 + pxor $twtmp,@tweak[5] + movdqa $twres,$twtmp + aesenc $rndkey0,$inout5 + $movkey -48($key),$rndkey0 + + paddd $twres,$twres + aesenc $rndkey1,$inout0 + pxor @tweak[5],@tweak[1] + psrad \$31,$twtmp + aesenc $rndkey1,$inout1 + paddq @tweak[5],@tweak[5] + pand $twmask,$twtmp + aesenc $rndkey1,$inout2 + aesenc $rndkey1,$inout3 + movdqa @tweak[3],`16*3`(%rsp) + pxor $twtmp,@tweak[5] + aesenc $rndkey1,$inout4 + movaps @tweak[2],@tweak[3] + movdqa $twres,$twtmp + aesenc $rndkey1,$inout5 + $movkey -32($key),$rndkey1 + + paddd $twres,$twres + aesenc $rndkey0,$inout0 + pxor @tweak[5],@tweak[2] + psrad \$31,$twtmp + aesenc $rndkey0,$inout1 + paddq @tweak[5],@tweak[5] + pand $twmask,$twtmp + aesenc $rndkey0,$inout2 + aesenc $rndkey0,$inout3 + aesenc $rndkey0,$inout4 + pxor $twtmp,@tweak[5] + movaps @tweak[3],@tweak[4] + aesenc $rndkey0,$inout5 + + movdqa $twres,$rndkey0 + paddd $twres,$twres + aesenc $rndkey1,$inout0 + pxor @tweak[5],@tweak[3] + psrad \$31,$rndkey0 + aesenc $rndkey1,$inout1 + paddq @tweak[5],@tweak[5] + pand $twmask,$rndkey0 + aesenc $rndkey1,$inout2 + aesenc $rndkey1,$inout3 + pxor $rndkey0,@tweak[5] + $movkey ($key_),$rndkey0 + aesenc $rndkey1,$inout4 + aesenc $rndkey1,$inout5 + $movkey 16($key_),$rndkey1 + + pxor @tweak[5],@tweak[4] + aesenclast `16*0`(%rsp),$inout0 + psrad \$31,$twres + paddq @tweak[5],@tweak[5] + aesenclast `16*1`(%rsp),$inout1 + aesenclast `16*2`(%rsp),$inout2 + pand $twmask,$twres + mov %r10,%rax # restore $rounds + aesenclast `16*3`(%rsp),$inout3 + aesenclast `16*4`(%rsp),$inout4 + aesenclast `16*5`(%rsp),$inout5 + pxor $twres,@tweak[5] + + lea `16*6`($out),$out # $out+=6*16 + movups $inout0,`-16*6`($out) # store 6 output blocks + movups $inout1,`-16*5`($out) + movups $inout2,`-16*4`($out) + movups $inout3,`-16*3`($out) + movups $inout4,`-16*2`($out) + movups $inout5,`-16*1`($out) + sub \$16*6,$len + jnc .Lxts_enc_grandloop # loop if $len-=6*16 didn't borrow + + mov \$16+96,$rounds + sub $rnds_,$rounds + mov $key_,$key # restore $key + shr \$4,$rounds # restore original value + +.Lxts_enc_short: + # at the point @tweak[0..5] are populated with tweak values + mov $rounds,$rnds_ # backup $rounds + pxor $rndkey0,@tweak[0] + add \$16*6,$len # restore real remaining $len + jz .Lxts_enc_done # done if ($len==0) + + pxor $rndkey0,@tweak[1] + cmp \$0x20,$len + jb .Lxts_enc_one # $len is 1*16 + pxor $rndkey0,@tweak[2] + je .Lxts_enc_two # $len is 2*16 + + pxor $rndkey0,@tweak[3] + cmp \$0x40,$len + jb .Lxts_enc_three # $len is 3*16 + pxor $rndkey0,@tweak[4] + je .Lxts_enc_four # $len is 4*16 + + movdqu ($inp),$inout0 # $len is 5*16 + movdqu 16*1($inp),$inout1 + movdqu 16*2($inp),$inout2 + pxor @tweak[0],$inout0 + movdqu 16*3($inp),$inout3 + pxor @tweak[1],$inout1 + movdqu 16*4($inp),$inout4 + lea 16*5($inp),$inp # $inp+=5*16 + pxor @tweak[2],$inout2 + pxor @tweak[3],$inout3 + pxor @tweak[4],$inout4 + pxor $inout5,$inout5 + + call _aesni_encrypt6 + + xorps @tweak[0],$inout0 + movdqa @tweak[5],@tweak[0] + xorps @tweak[1],$inout1 + xorps @tweak[2],$inout2 + movdqu $inout0,($out) # store 5 output blocks + xorps @tweak[3],$inout3 + movdqu $inout1,16*1($out) + xorps @tweak[4],$inout4 + movdqu $inout2,16*2($out) + movdqu $inout3,16*3($out) + movdqu $inout4,16*4($out) + lea 16*5($out),$out # $out+=5*16 + jmp .Lxts_enc_done + +.align 16 +.Lxts_enc_one: + movups ($inp),$inout0 + lea 16*1($inp),$inp # inp+=1*16 + xorps @tweak[0],$inout0 +___ + &aesni_generate1("enc",$key,$rounds); +$code.=<<___; + xorps @tweak[0],$inout0 + movdqa @tweak[1],@tweak[0] + movups $inout0,($out) # store one output block + lea 16*1($out),$out # $out+=1*16 + jmp .Lxts_enc_done + +.align 16 +.Lxts_enc_two: + movups ($inp),$inout0 + movups 16($inp),$inout1 + lea 32($inp),$inp # $inp+=2*16 + xorps @tweak[0],$inout0 + xorps @tweak[1],$inout1 + + call _aesni_encrypt2 + + xorps @tweak[0],$inout0 + movdqa @tweak[2],@tweak[0] + xorps @tweak[1],$inout1 + movups $inout0,($out) # store 2 output blocks + movups $inout1,16*1($out) + lea 16*2($out),$out # $out+=2*16 + jmp .Lxts_enc_done + +.align 16 +.Lxts_enc_three: + movups ($inp),$inout0 + movups 16*1($inp),$inout1 + movups 16*2($inp),$inout2 + lea 16*3($inp),$inp # $inp+=3*16 + xorps @tweak[0],$inout0 + xorps @tweak[1],$inout1 + xorps @tweak[2],$inout2 + + call _aesni_encrypt3 + + xorps @tweak[0],$inout0 + movdqa @tweak[3],@tweak[0] + xorps @tweak[1],$inout1 + xorps @tweak[2],$inout2 + movups $inout0,($out) # store 3 output blocks + movups $inout1,16*1($out) + movups $inout2,16*2($out) + lea 16*3($out),$out # $out+=3*16 + jmp .Lxts_enc_done + +.align 16 +.Lxts_enc_four: + movups ($inp),$inout0 + movups 16*1($inp),$inout1 + movups 16*2($inp),$inout2 + xorps @tweak[0],$inout0 + movups 16*3($inp),$inout3 + lea 16*4($inp),$inp # $inp+=4*16 + xorps @tweak[1],$inout1 + xorps @tweak[2],$inout2 + xorps @tweak[3],$inout3 + + call _aesni_encrypt4 + + pxor @tweak[0],$inout0 + movdqa @tweak[4],@tweak[0] + pxor @tweak[1],$inout1 + pxor @tweak[2],$inout2 + movdqu $inout0,($out) # store 4 output blocks + pxor @tweak[3],$inout3 + movdqu $inout1,16*1($out) + movdqu $inout2,16*2($out) + movdqu $inout3,16*3($out) + lea 16*4($out),$out # $out+=4*16 + jmp .Lxts_enc_done + +.align 16 +.Lxts_enc_done: + and \$15,$len_ # see if $len%16 is 0 + jz .Lxts_enc_ret + mov $len_,$len + +.Lxts_enc_steal: + movzb ($inp),%eax # borrow $rounds ... + movzb -16($out),%ecx # ... and $key + lea 1($inp),$inp + mov %al,-16($out) + mov %cl,0($out) + lea 1($out),$out + sub \$1,$len + jnz .Lxts_enc_steal + + sub $len_,$out # rewind $out + mov $key_,$key # restore $key + mov $rnds_,$rounds # restore $rounds + + movups -16($out),$inout0 + xorps @tweak[0],$inout0 +___ + &aesni_generate1("enc",$key,$rounds); +$code.=<<___; + xorps @tweak[0],$inout0 + movups $inout0,-16($out) + +.Lxts_enc_ret: + xorps %xmm0,%xmm0 # clear register bank + pxor %xmm1,%xmm1 + pxor %xmm2,%xmm2 + pxor %xmm3,%xmm3 + pxor %xmm4,%xmm4 + pxor %xmm5,%xmm5 +___ +$code.=<<___ if (!$win64); + pxor %xmm6,%xmm6 + pxor %xmm7,%xmm7 + movaps %xmm0,0x00(%rsp) # clear stack + pxor %xmm8,%xmm8 + movaps %xmm0,0x10(%rsp) + pxor %xmm9,%xmm9 + movaps %xmm0,0x20(%rsp) + pxor %xmm10,%xmm10 + movaps %xmm0,0x30(%rsp) + pxor %xmm11,%xmm11 + movaps %xmm0,0x40(%rsp) + pxor %xmm12,%xmm12 + movaps %xmm0,0x50(%rsp) + pxor %xmm13,%xmm13 + movaps %xmm0,0x60(%rsp) + pxor %xmm14,%xmm14 + pxor %xmm15,%xmm15 +___ +$code.=<<___ if ($win64); + movaps -0xa8(%r11),%xmm6 + movaps %xmm0,-0xa8(%r11) # clear stack + movaps -0x98(%r11),%xmm7 + movaps %xmm0,-0x98(%r11) + movaps -0x88(%r11),%xmm8 + movaps %xmm0,-0x88(%r11) + movaps -0x78(%r11),%xmm9 + movaps %xmm0,-0x78(%r11) + movaps -0x68(%r11),%xmm10 + movaps %xmm0,-0x68(%r11) + movaps -0x58(%r11),%xmm11 + movaps %xmm0,-0x58(%r11) + movaps -0x48(%r11),%xmm12 + movaps %xmm0,-0x48(%r11) + movaps -0x38(%r11),%xmm13 + movaps %xmm0,-0x38(%r11) + movaps -0x28(%r11),%xmm14 + movaps %xmm0,-0x28(%r11) + movaps -0x18(%r11),%xmm15 + movaps %xmm0,-0x18(%r11) + movaps %xmm0,0x00(%rsp) + movaps %xmm0,0x10(%rsp) + movaps %xmm0,0x20(%rsp) + movaps %xmm0,0x30(%rsp) + movaps %xmm0,0x40(%rsp) + movaps %xmm0,0x50(%rsp) + movaps %xmm0,0x60(%rsp) +___ +$code.=<<___; + mov -8(%r11),%rbp +.cfi_restore %rbp + lea (%r11),%rsp +.cfi_def_cfa_register %rsp +.Lxts_enc_epilogue: + ret +.cfi_endproc +.size aesni_xts_encrypt,.-aesni_xts_encrypt +___ + +$code.=<<___; +.globl aesni_xts_decrypt +.type aesni_xts_decrypt,\@function,6 +.align 16 +aesni_xts_decrypt: +.cfi_startproc + lea (%rsp),%r11 # frame pointer +.cfi_def_cfa_register %r11 + push %rbp +.cfi_push %rbp + sub \$$frame_size,%rsp + and \$-16,%rsp # Linux kernel stack can be incorrectly seeded +___ +$code.=<<___ if ($win64); + movaps %xmm6,-0xa8(%r11) # offload everything + movaps %xmm7,-0x98(%r11) + movaps %xmm8,-0x88(%r11) + movaps %xmm9,-0x78(%r11) + movaps %xmm10,-0x68(%r11) + movaps %xmm11,-0x58(%r11) + movaps %xmm12,-0x48(%r11) + movaps %xmm13,-0x38(%r11) + movaps %xmm14,-0x28(%r11) + movaps %xmm15,-0x18(%r11) +.Lxts_dec_body: +___ +$code.=<<___; + movups ($ivp),$inout0 # load clear-text tweak + mov 240($key2),$rounds # key2->rounds + mov 240($key),$rnds_ # key1->rounds +___ + # generate the tweak + &aesni_generate1("enc",$key2,$rounds,$inout0); +$code.=<<___; + xor %eax,%eax # if ($len%16) len-=16; + test \$15,$len + setnz %al + shl \$4,%rax + sub %rax,$len + + $movkey ($key),$rndkey0 # zero round key + mov $key,$key_ # backup $key + mov $rnds_,$rounds # backup $rounds + shl \$4,$rnds_ + mov $len,$len_ # backup $len + and \$-16,$len + + $movkey 16($key,$rnds_),$rndkey1 # last round key + + movdqa .Lxts_magic(%rip),$twmask + movdqa $inout0,@tweak[5] + pshufd \$0x5f,$inout0,$twres + pxor $rndkey0,$rndkey1 +___ + for ($i=0;$i<4;$i++) { + $code.=<<___; + movdqa $twres,$twtmp + paddd $twres,$twres + movdqa @tweak[5],@tweak[$i] + psrad \$31,$twtmp # broadcast upper bits + paddq @tweak[5],@tweak[5] + pand $twmask,$twtmp + pxor $rndkey0,@tweak[$i] + pxor $twtmp,@tweak[5] +___ + } +$code.=<<___; + movdqa @tweak[5],@tweak[4] + psrad \$31,$twres + paddq @tweak[5],@tweak[5] + pand $twmask,$twres + pxor $rndkey0,@tweak[4] + pxor $twres,@tweak[5] + movaps $rndkey1,0x60(%rsp) # save round[0]^round[last] + + sub \$16*6,$len + jc .Lxts_dec_short # if $len-=6*16 borrowed + + mov \$16+96,$rounds + lea 32($key_,$rnds_),$key # end of key schedule + sub %r10,%rax # twisted $rounds + $movkey 16($key_),$rndkey1 + mov %rax,%r10 # backup twisted $rounds + lea .Lxts_magic(%rip),%r8 + jmp .Lxts_dec_grandloop + +.align 32 +.Lxts_dec_grandloop: + movdqu `16*0`($inp),$inout0 # load input + movdqa $rndkey0,$twmask + movdqu `16*1`($inp),$inout1 + pxor @tweak[0],$inout0 # intput^=tweak^round[0] + movdqu `16*2`($inp),$inout2 + pxor @tweak[1],$inout1 + aesdec $rndkey1,$inout0 + movdqu `16*3`($inp),$inout3 + pxor @tweak[2],$inout2 + aesdec $rndkey1,$inout1 + movdqu `16*4`($inp),$inout4 + pxor @tweak[3],$inout3 + aesdec $rndkey1,$inout2 + movdqu `16*5`($inp),$inout5 + pxor @tweak[5],$twmask # round[0]^=tweak[5] + movdqa 0x60(%rsp),$twres # load round[0]^round[last] + pxor @tweak[4],$inout4 + aesdec $rndkey1,$inout3 + $movkey 32($key_),$rndkey0 + lea `16*6`($inp),$inp + pxor $twmask,$inout5 + + pxor $twres,@tweak[0] # calculate tweaks^round[last] + aesdec $rndkey1,$inout4 + pxor $twres,@tweak[1] + movdqa @tweak[0],`16*0`(%rsp) # put aside tweaks^last round key + aesdec $rndkey1,$inout5 + $movkey 48($key_),$rndkey1 + pxor $twres,@tweak[2] + + aesdec $rndkey0,$inout0 + pxor $twres,@tweak[3] + movdqa @tweak[1],`16*1`(%rsp) + aesdec $rndkey0,$inout1 + pxor $twres,@tweak[4] + movdqa @tweak[2],`16*2`(%rsp) + aesdec $rndkey0,$inout2 + aesdec $rndkey0,$inout3 + pxor $twres,$twmask + movdqa @tweak[4],`16*4`(%rsp) + aesdec $rndkey0,$inout4 + aesdec $rndkey0,$inout5 + $movkey 64($key_),$rndkey0 + movdqa $twmask,`16*5`(%rsp) + pshufd \$0x5f,@tweak[5],$twres + jmp .Lxts_dec_loop6 +.align 32 +.Lxts_dec_loop6: + aesdec $rndkey1,$inout0 + aesdec $rndkey1,$inout1 + aesdec $rndkey1,$inout2 + aesdec $rndkey1,$inout3 + aesdec $rndkey1,$inout4 + aesdec $rndkey1,$inout5 + $movkey -64($key,%rax),$rndkey1 + add \$32,%rax + + aesdec $rndkey0,$inout0 + aesdec $rndkey0,$inout1 + aesdec $rndkey0,$inout2 + aesdec $rndkey0,$inout3 + aesdec $rndkey0,$inout4 + aesdec $rndkey0,$inout5 + $movkey -80($key,%rax),$rndkey0 + jnz .Lxts_dec_loop6 + + movdqa (%r8),$twmask # start calculating next tweak + movdqa $twres,$twtmp + paddd $twres,$twres + aesdec $rndkey1,$inout0 + paddq @tweak[5],@tweak[5] + psrad \$31,$twtmp + aesdec $rndkey1,$inout1 + pand $twmask,$twtmp + $movkey ($key_),@tweak[0] # load round[0] + aesdec $rndkey1,$inout2 + aesdec $rndkey1,$inout3 + aesdec $rndkey1,$inout4 + pxor $twtmp,@tweak[5] + movaps @tweak[0],@tweak[1] # copy round[0] + aesdec $rndkey1,$inout5 + $movkey -64($key),$rndkey1 + + movdqa $twres,$twtmp + aesdec $rndkey0,$inout0 + paddd $twres,$twres + pxor @tweak[5],@tweak[0] + aesdec $rndkey0,$inout1 + psrad \$31,$twtmp + paddq @tweak[5],@tweak[5] + aesdec $rndkey0,$inout2 + aesdec $rndkey0,$inout3 + pand $twmask,$twtmp + movaps @tweak[1],@tweak[2] + aesdec $rndkey0,$inout4 + pxor $twtmp,@tweak[5] + movdqa $twres,$twtmp + aesdec $rndkey0,$inout5 + $movkey -48($key),$rndkey0 + + paddd $twres,$twres + aesdec $rndkey1,$inout0 + pxor @tweak[5],@tweak[1] + psrad \$31,$twtmp + aesdec $rndkey1,$inout1 + paddq @tweak[5],@tweak[5] + pand $twmask,$twtmp + aesdec $rndkey1,$inout2 + aesdec $rndkey1,$inout3 + movdqa @tweak[3],`16*3`(%rsp) + pxor $twtmp,@tweak[5] + aesdec $rndkey1,$inout4 + movaps @tweak[2],@tweak[3] + movdqa $twres,$twtmp + aesdec $rndkey1,$inout5 + $movkey -32($key),$rndkey1 + + paddd $twres,$twres + aesdec $rndkey0,$inout0 + pxor @tweak[5],@tweak[2] + psrad \$31,$twtmp + aesdec $rndkey0,$inout1 + paddq @tweak[5],@tweak[5] + pand $twmask,$twtmp + aesdec $rndkey0,$inout2 + aesdec $rndkey0,$inout3 + aesdec $rndkey0,$inout4 + pxor $twtmp,@tweak[5] + movaps @tweak[3],@tweak[4] + aesdec $rndkey0,$inout5 + + movdqa $twres,$rndkey0 + paddd $twres,$twres + aesdec $rndkey1,$inout0 + pxor @tweak[5],@tweak[3] + psrad \$31,$rndkey0 + aesdec $rndkey1,$inout1 + paddq @tweak[5],@tweak[5] + pand $twmask,$rndkey0 + aesdec $rndkey1,$inout2 + aesdec $rndkey1,$inout3 + pxor $rndkey0,@tweak[5] + $movkey ($key_),$rndkey0 + aesdec $rndkey1,$inout4 + aesdec $rndkey1,$inout5 + $movkey 16($key_),$rndkey1 + + pxor @tweak[5],@tweak[4] + aesdeclast `16*0`(%rsp),$inout0 + psrad \$31,$twres + paddq @tweak[5],@tweak[5] + aesdeclast `16*1`(%rsp),$inout1 + aesdeclast `16*2`(%rsp),$inout2 + pand $twmask,$twres + mov %r10,%rax # restore $rounds + aesdeclast `16*3`(%rsp),$inout3 + aesdeclast `16*4`(%rsp),$inout4 + aesdeclast `16*5`(%rsp),$inout5 + pxor $twres,@tweak[5] + + lea `16*6`($out),$out # $out+=6*16 + movups $inout0,`-16*6`($out) # store 6 output blocks + movups $inout1,`-16*5`($out) + movups $inout2,`-16*4`($out) + movups $inout3,`-16*3`($out) + movups $inout4,`-16*2`($out) + movups $inout5,`-16*1`($out) + sub \$16*6,$len + jnc .Lxts_dec_grandloop # loop if $len-=6*16 didn't borrow + + mov \$16+96,$rounds + sub $rnds_,$rounds + mov $key_,$key # restore $key + shr \$4,$rounds # restore original value + +.Lxts_dec_short: + # at the point @tweak[0..5] are populated with tweak values + mov $rounds,$rnds_ # backup $rounds + pxor $rndkey0,@tweak[0] + pxor $rndkey0,@tweak[1] + add \$16*6,$len # restore real remaining $len + jz .Lxts_dec_done # done if ($len==0) + + pxor $rndkey0,@tweak[2] + cmp \$0x20,$len + jb .Lxts_dec_one # $len is 1*16 + pxor $rndkey0,@tweak[3] + je .Lxts_dec_two # $len is 2*16 + + pxor $rndkey0,@tweak[4] + cmp \$0x40,$len + jb .Lxts_dec_three # $len is 3*16 + je .Lxts_dec_four # $len is 4*16 + + movdqu ($inp),$inout0 # $len is 5*16 + movdqu 16*1($inp),$inout1 + movdqu 16*2($inp),$inout2 + pxor @tweak[0],$inout0 + movdqu 16*3($inp),$inout3 + pxor @tweak[1],$inout1 + movdqu 16*4($inp),$inout4 + lea 16*5($inp),$inp # $inp+=5*16 + pxor @tweak[2],$inout2 + pxor @tweak[3],$inout3 + pxor @tweak[4],$inout4 + + call _aesni_decrypt6 + + xorps @tweak[0],$inout0 + xorps @tweak[1],$inout1 + xorps @tweak[2],$inout2 + movdqu $inout0,($out) # store 5 output blocks + xorps @tweak[3],$inout3 + movdqu $inout1,16*1($out) + xorps @tweak[4],$inout4 + movdqu $inout2,16*2($out) + pxor $twtmp,$twtmp + movdqu $inout3,16*3($out) + pcmpgtd @tweak[5],$twtmp + movdqu $inout4,16*4($out) + lea 16*5($out),$out # $out+=5*16 + pshufd \$0x13,$twtmp,@tweak[1] # $twres + and \$15,$len_ + jz .Lxts_dec_ret + + movdqa @tweak[5],@tweak[0] + paddq @tweak[5],@tweak[5] # psllq 1,$tweak + pand $twmask,@tweak[1] # isolate carry and residue + pxor @tweak[5],@tweak[1] + jmp .Lxts_dec_done2 + +.align 16 +.Lxts_dec_one: + movups ($inp),$inout0 + lea 16*1($inp),$inp # $inp+=1*16 + xorps @tweak[0],$inout0 +___ + &aesni_generate1("dec",$key,$rounds); +$code.=<<___; + xorps @tweak[0],$inout0 + movdqa @tweak[1],@tweak[0] + movups $inout0,($out) # store one output block + movdqa @tweak[2],@tweak[1] + lea 16*1($out),$out # $out+=1*16 + jmp .Lxts_dec_done + +.align 16 +.Lxts_dec_two: + movups ($inp),$inout0 + movups 16($inp),$inout1 + lea 32($inp),$inp # $inp+=2*16 + xorps @tweak[0],$inout0 + xorps @tweak[1],$inout1 + + call _aesni_decrypt2 + + xorps @tweak[0],$inout0 + movdqa @tweak[2],@tweak[0] + xorps @tweak[1],$inout1 + movdqa @tweak[3],@tweak[1] + movups $inout0,($out) # store 2 output blocks + movups $inout1,16*1($out) + lea 16*2($out),$out # $out+=2*16 + jmp .Lxts_dec_done + +.align 16 +.Lxts_dec_three: + movups ($inp),$inout0 + movups 16*1($inp),$inout1 + movups 16*2($inp),$inout2 + lea 16*3($inp),$inp # $inp+=3*16 + xorps @tweak[0],$inout0 + xorps @tweak[1],$inout1 + xorps @tweak[2],$inout2 + + call _aesni_decrypt3 + + xorps @tweak[0],$inout0 + movdqa @tweak[3],@tweak[0] + xorps @tweak[1],$inout1 + movdqa @tweak[4],@tweak[1] + xorps @tweak[2],$inout2 + movups $inout0,($out) # store 3 output blocks + movups $inout1,16*1($out) + movups $inout2,16*2($out) + lea 16*3($out),$out # $out+=3*16 + jmp .Lxts_dec_done + +.align 16 +.Lxts_dec_four: + movups ($inp),$inout0 + movups 16*1($inp),$inout1 + movups 16*2($inp),$inout2 + xorps @tweak[0],$inout0 + movups 16*3($inp),$inout3 + lea 16*4($inp),$inp # $inp+=4*16 + xorps @tweak[1],$inout1 + xorps @tweak[2],$inout2 + xorps @tweak[3],$inout3 + + call _aesni_decrypt4 + + pxor @tweak[0],$inout0 + movdqa @tweak[4],@tweak[0] + pxor @tweak[1],$inout1 + movdqa @tweak[5],@tweak[1] + pxor @tweak[2],$inout2 + movdqu $inout0,($out) # store 4 output blocks + pxor @tweak[3],$inout3 + movdqu $inout1,16*1($out) + movdqu $inout2,16*2($out) + movdqu $inout3,16*3($out) + lea 16*4($out),$out # $out+=4*16 + jmp .Lxts_dec_done + +.align 16 +.Lxts_dec_done: + and \$15,$len_ # see if $len%16 is 0 + jz .Lxts_dec_ret +.Lxts_dec_done2: + mov $len_,$len + mov $key_,$key # restore $key + mov $rnds_,$rounds # restore $rounds + + movups ($inp),$inout0 + xorps @tweak[1],$inout0 +___ + &aesni_generate1("dec",$key,$rounds); +$code.=<<___; + xorps @tweak[1],$inout0 + movups $inout0,($out) + +.Lxts_dec_steal: + movzb 16($inp),%eax # borrow $rounds ... + movzb ($out),%ecx # ... and $key + lea 1($inp),$inp + mov %al,($out) + mov %cl,16($out) + lea 1($out),$out + sub \$1,$len + jnz .Lxts_dec_steal + + sub $len_,$out # rewind $out + mov $key_,$key # restore $key + mov $rnds_,$rounds # restore $rounds + + movups ($out),$inout0 + xorps @tweak[0],$inout0 +___ + &aesni_generate1("dec",$key,$rounds); +$code.=<<___; + xorps @tweak[0],$inout0 + movups $inout0,($out) + +.Lxts_dec_ret: + xorps %xmm0,%xmm0 # clear register bank + pxor %xmm1,%xmm1 + pxor %xmm2,%xmm2 + pxor %xmm3,%xmm3 + pxor %xmm4,%xmm4 + pxor %xmm5,%xmm5 +___ +$code.=<<___ if (!$win64); + pxor %xmm6,%xmm6 + pxor %xmm7,%xmm7 + movaps %xmm0,0x00(%rsp) # clear stack + pxor %xmm8,%xmm8 + movaps %xmm0,0x10(%rsp) + pxor %xmm9,%xmm9 + movaps %xmm0,0x20(%rsp) + pxor %xmm10,%xmm10 + movaps %xmm0,0x30(%rsp) + pxor %xmm11,%xmm11 + movaps %xmm0,0x40(%rsp) + pxor %xmm12,%xmm12 + movaps %xmm0,0x50(%rsp) + pxor %xmm13,%xmm13 + movaps %xmm0,0x60(%rsp) + pxor %xmm14,%xmm14 + pxor %xmm15,%xmm15 +___ +$code.=<<___ if ($win64); + movaps -0xa8(%r11),%xmm6 + movaps %xmm0,-0xa8(%r11) # clear stack + movaps -0x98(%r11),%xmm7 + movaps %xmm0,-0x98(%r11) + movaps -0x88(%r11),%xmm8 + movaps %xmm0,-0x88(%r11) + movaps -0x78(%r11),%xmm9 + movaps %xmm0,-0x78(%r11) + movaps -0x68(%r11),%xmm10 + movaps %xmm0,-0x68(%r11) + movaps -0x58(%r11),%xmm11 + movaps %xmm0,-0x58(%r11) + movaps -0x48(%r11),%xmm12 + movaps %xmm0,-0x48(%r11) + movaps -0x38(%r11),%xmm13 + movaps %xmm0,-0x38(%r11) + movaps -0x28(%r11),%xmm14 + movaps %xmm0,-0x28(%r11) + movaps -0x18(%r11),%xmm15 + movaps %xmm0,-0x18(%r11) + movaps %xmm0,0x00(%rsp) + movaps %xmm0,0x10(%rsp) + movaps %xmm0,0x20(%rsp) + movaps %xmm0,0x30(%rsp) + movaps %xmm0,0x40(%rsp) + movaps %xmm0,0x50(%rsp) + movaps %xmm0,0x60(%rsp) +___ +$code.=<<___; + mov -8(%r11),%rbp +.cfi_restore %rbp + lea (%r11),%rsp +.cfi_def_cfa_register %rsp +.Lxts_dec_epilogue: + ret +.cfi_endproc +.size aesni_xts_decrypt,.-aesni_xts_decrypt +___ +} + +###################################################################### +# void aesni_ocb_[en|de]crypt(const char *inp, char *out, size_t blocks, +# const AES_KEY *key, unsigned int start_block_num, +# unsigned char offset_i[16], const unsigned char L_[][16], +# unsigned char checksum[16]); +# +if (0) { +my @offset=map("%xmm$_",(10..15)); +my ($checksum,$rndkey0l)=("%xmm8","%xmm9"); +my ($block_num,$offset_p)=("%r8","%r9"); # 5th and 6th arguments +my ($L_p,$checksum_p) = ("%rbx","%rbp"); +my ($i1,$i3,$i5) = ("%r12","%r13","%r14"); +my $seventh_arg = $win64 ? 56 : 8; +my $blocks = $len; + +$code.=<<___; +.globl aesni_ocb_encrypt +.type aesni_ocb_encrypt,\@function,6 +.align 32 +aesni_ocb_encrypt: +.cfi_startproc + lea (%rsp),%rax + push %rbx +.cfi_push %rbx + push %rbp +.cfi_push %rbp + push %r12 +.cfi_push %r12 + push %r13 +.cfi_push %r13 + push %r14 +.cfi_push %r14 +___ +$code.=<<___ if ($win64); + lea -0xa0(%rsp),%rsp + movaps %xmm6,0x00(%rsp) # offload everything + movaps %xmm7,0x10(%rsp) + movaps %xmm8,0x20(%rsp) + movaps %xmm9,0x30(%rsp) + movaps %xmm10,0x40(%rsp) + movaps %xmm11,0x50(%rsp) + movaps %xmm12,0x60(%rsp) + movaps %xmm13,0x70(%rsp) + movaps %xmm14,0x80(%rsp) + movaps %xmm15,0x90(%rsp) +.Locb_enc_body: +___ +$code.=<<___; + mov $seventh_arg(%rax),$L_p # 7th argument + mov $seventh_arg+8(%rax),$checksum_p# 8th argument + + mov 240($key),$rnds_ + mov $key,$key_ + shl \$4,$rnds_ + $movkey ($key),$rndkey0l # round[0] + $movkey 16($key,$rnds_),$rndkey1 # round[last] + + movdqu ($offset_p),@offset[5] # load last offset_i + pxor $rndkey1,$rndkey0l # round[0] ^ round[last] + pxor $rndkey1,@offset[5] # offset_i ^ round[last] + + mov \$16+32,$rounds + lea 32($key_,$rnds_),$key + $movkey 16($key_),$rndkey1 # round[1] + sub %r10,%rax # twisted $rounds + mov %rax,%r10 # backup twisted $rounds + + movdqu ($L_p),@offset[0] # L_0 for all odd-numbered blocks + movdqu ($checksum_p),$checksum # load checksum + + test \$1,$block_num # is first block number odd? + jnz .Locb_enc_odd + + bsf $block_num,$i1 + add \$1,$block_num + shl \$4,$i1 + movdqu ($L_p,$i1),$inout5 # borrow + movdqu ($inp),$inout0 + lea 16($inp),$inp + + call __ocb_encrypt1 + + movdqa $inout5,@offset[5] + movups $inout0,($out) + lea 16($out),$out + sub \$1,$blocks + jz .Locb_enc_done + +.Locb_enc_odd: + lea 1($block_num),$i1 # even-numbered blocks + lea 3($block_num),$i3 + lea 5($block_num),$i5 + lea 6($block_num),$block_num + bsf $i1,$i1 # ntz(block) + bsf $i3,$i3 + bsf $i5,$i5 + shl \$4,$i1 # ntz(block) -> table offset + shl \$4,$i3 + shl \$4,$i5 + + sub \$6,$blocks + jc .Locb_enc_short + jmp .Locb_enc_grandloop + +.align 32 +.Locb_enc_grandloop: + movdqu `16*0`($inp),$inout0 # load input + movdqu `16*1`($inp),$inout1 + movdqu `16*2`($inp),$inout2 + movdqu `16*3`($inp),$inout3 + movdqu `16*4`($inp),$inout4 + movdqu `16*5`($inp),$inout5 + lea `16*6`($inp),$inp + + call __ocb_encrypt6 + + movups $inout0,`16*0`($out) # store output + movups $inout1,`16*1`($out) + movups $inout2,`16*2`($out) + movups $inout3,`16*3`($out) + movups $inout4,`16*4`($out) + movups $inout5,`16*5`($out) + lea `16*6`($out),$out + sub \$6,$blocks + jnc .Locb_enc_grandloop + +.Locb_enc_short: + add \$6,$blocks + jz .Locb_enc_done + + movdqu `16*0`($inp),$inout0 + cmp \$2,$blocks + jb .Locb_enc_one + movdqu `16*1`($inp),$inout1 + je .Locb_enc_two + + movdqu `16*2`($inp),$inout2 + cmp \$4,$blocks + jb .Locb_enc_three + movdqu `16*3`($inp),$inout3 + je .Locb_enc_four + + movdqu `16*4`($inp),$inout4 + pxor $inout5,$inout5 + + call __ocb_encrypt6 + + movdqa @offset[4],@offset[5] + movups $inout0,`16*0`($out) + movups $inout1,`16*1`($out) + movups $inout2,`16*2`($out) + movups $inout3,`16*3`($out) + movups $inout4,`16*4`($out) + + jmp .Locb_enc_done + +.align 16 +.Locb_enc_one: + movdqa @offset[0],$inout5 # borrow + + call __ocb_encrypt1 + + movdqa $inout5,@offset[5] + movups $inout0,`16*0`($out) + jmp .Locb_enc_done + +.align 16 +.Locb_enc_two: + pxor $inout2,$inout2 + pxor $inout3,$inout3 + + call __ocb_encrypt4 + + movdqa @offset[1],@offset[5] + movups $inout0,`16*0`($out) + movups $inout1,`16*1`($out) + + jmp .Locb_enc_done + +.align 16 +.Locb_enc_three: + pxor $inout3,$inout3 + + call __ocb_encrypt4 + + movdqa @offset[2],@offset[5] + movups $inout0,`16*0`($out) + movups $inout1,`16*1`($out) + movups $inout2,`16*2`($out) + + jmp .Locb_enc_done + +.align 16 +.Locb_enc_four: + call __ocb_encrypt4 + + movdqa @offset[3],@offset[5] + movups $inout0,`16*0`($out) + movups $inout1,`16*1`($out) + movups $inout2,`16*2`($out) + movups $inout3,`16*3`($out) + +.Locb_enc_done: + pxor $rndkey0,@offset[5] # "remove" round[last] + movdqu $checksum,($checksum_p) # store checksum + movdqu @offset[5],($offset_p) # store last offset_i + + xorps %xmm0,%xmm0 # clear register bank + pxor %xmm1,%xmm1 + pxor %xmm2,%xmm2 + pxor %xmm3,%xmm3 + pxor %xmm4,%xmm4 + pxor %xmm5,%xmm5 +___ +$code.=<<___ if (!$win64); + pxor %xmm6,%xmm6 + pxor %xmm7,%xmm7 + pxor %xmm8,%xmm8 + pxor %xmm9,%xmm9 + pxor %xmm10,%xmm10 + pxor %xmm11,%xmm11 + pxor %xmm12,%xmm12 + pxor %xmm13,%xmm13 + pxor %xmm14,%xmm14 + pxor %xmm15,%xmm15 + lea 0x28(%rsp),%rax +.cfi_def_cfa %rax,8 +___ +$code.=<<___ if ($win64); + movaps 0x00(%rsp),%xmm6 + movaps %xmm0,0x00(%rsp) # clear stack + movaps 0x10(%rsp),%xmm7 + movaps %xmm0,0x10(%rsp) + movaps 0x20(%rsp),%xmm8 + movaps %xmm0,0x20(%rsp) + movaps 0x30(%rsp),%xmm9 + movaps %xmm0,0x30(%rsp) + movaps 0x40(%rsp),%xmm10 + movaps %xmm0,0x40(%rsp) + movaps 0x50(%rsp),%xmm11 + movaps %xmm0,0x50(%rsp) + movaps 0x60(%rsp),%xmm12 + movaps %xmm0,0x60(%rsp) + movaps 0x70(%rsp),%xmm13 + movaps %xmm0,0x70(%rsp) + movaps 0x80(%rsp),%xmm14 + movaps %xmm0,0x80(%rsp) + movaps 0x90(%rsp),%xmm15 + movaps %xmm0,0x90(%rsp) + lea 0xa0+0x28(%rsp),%rax +.Locb_enc_pop: +___ +$code.=<<___; + mov -40(%rax),%r14 +.cfi_restore %r14 + mov -32(%rax),%r13 +.cfi_restore %r13 + mov -24(%rax),%r12 +.cfi_restore %r12 + mov -16(%rax),%rbp +.cfi_restore %rbp + mov -8(%rax),%rbx +.cfi_restore %rbx + lea (%rax),%rsp +.cfi_def_cfa_register %rsp +.Locb_enc_epilogue: + ret +.cfi_endproc +.size aesni_ocb_encrypt,.-aesni_ocb_encrypt + +.type __ocb_encrypt6,\@abi-omnipotent +.align 32 +__ocb_encrypt6: + pxor $rndkey0l,@offset[5] # offset_i ^ round[0] + movdqu ($L_p,$i1),@offset[1] + movdqa @offset[0],@offset[2] + movdqu ($L_p,$i3),@offset[3] + movdqa @offset[0],@offset[4] + pxor @offset[5],@offset[0] + movdqu ($L_p,$i5),@offset[5] + pxor @offset[0],@offset[1] + pxor $inout0,$checksum # accumulate checksum + pxor @offset[0],$inout0 # input ^ round[0] ^ offset_i + pxor @offset[1],@offset[2] + pxor $inout1,$checksum + pxor @offset[1],$inout1 + pxor @offset[2],@offset[3] + pxor $inout2,$checksum + pxor @offset[2],$inout2 + pxor @offset[3],@offset[4] + pxor $inout3,$checksum + pxor @offset[3],$inout3 + pxor @offset[4],@offset[5] + pxor $inout4,$checksum + pxor @offset[4],$inout4 + pxor $inout5,$checksum + pxor @offset[5],$inout5 + $movkey 32($key_),$rndkey0 + + lea 1($block_num),$i1 # even-numbered blocks + lea 3($block_num),$i3 + lea 5($block_num),$i5 + add \$6,$block_num + pxor $rndkey0l,@offset[0] # offset_i ^ round[last] + bsf $i1,$i1 # ntz(block) + bsf $i3,$i3 + bsf $i5,$i5 + + aesenc $rndkey1,$inout0 + aesenc $rndkey1,$inout1 + aesenc $rndkey1,$inout2 + aesenc $rndkey1,$inout3 + pxor $rndkey0l,@offset[1] + pxor $rndkey0l,@offset[2] + aesenc $rndkey1,$inout4 + pxor $rndkey0l,@offset[3] + pxor $rndkey0l,@offset[4] + aesenc $rndkey1,$inout5 + $movkey 48($key_),$rndkey1 + pxor $rndkey0l,@offset[5] + + aesenc $rndkey0,$inout0 + aesenc $rndkey0,$inout1 + aesenc $rndkey0,$inout2 + aesenc $rndkey0,$inout3 + aesenc $rndkey0,$inout4 + aesenc $rndkey0,$inout5 + $movkey 64($key_),$rndkey0 + shl \$4,$i1 # ntz(block) -> table offset + shl \$4,$i3 + jmp .Locb_enc_loop6 + +.align 32 +.Locb_enc_loop6: + aesenc $rndkey1,$inout0 + aesenc $rndkey1,$inout1 + aesenc $rndkey1,$inout2 + aesenc $rndkey1,$inout3 + aesenc $rndkey1,$inout4 + aesenc $rndkey1,$inout5 + $movkey ($key,%rax),$rndkey1 + add \$32,%rax + + aesenc $rndkey0,$inout0 + aesenc $rndkey0,$inout1 + aesenc $rndkey0,$inout2 + aesenc $rndkey0,$inout3 + aesenc $rndkey0,$inout4 + aesenc $rndkey0,$inout5 + $movkey -16($key,%rax),$rndkey0 + jnz .Locb_enc_loop6 + + aesenc $rndkey1,$inout0 + aesenc $rndkey1,$inout1 + aesenc $rndkey1,$inout2 + aesenc $rndkey1,$inout3 + aesenc $rndkey1,$inout4 + aesenc $rndkey1,$inout5 + $movkey 16($key_),$rndkey1 + shl \$4,$i5 + + aesenclast @offset[0],$inout0 + movdqu ($L_p),@offset[0] # L_0 for all odd-numbered blocks + mov %r10,%rax # restore twisted rounds + aesenclast @offset[1],$inout1 + aesenclast @offset[2],$inout2 + aesenclast @offset[3],$inout3 + aesenclast @offset[4],$inout4 + aesenclast @offset[5],$inout5 + ret +.size __ocb_encrypt6,.-__ocb_encrypt6 + +.type __ocb_encrypt4,\@abi-omnipotent +.align 32 +__ocb_encrypt4: + pxor $rndkey0l,@offset[5] # offset_i ^ round[0] + movdqu ($L_p,$i1),@offset[1] + movdqa @offset[0],@offset[2] + movdqu ($L_p,$i3),@offset[3] + pxor @offset[5],@offset[0] + pxor @offset[0],@offset[1] + pxor $inout0,$checksum # accumulate checksum + pxor @offset[0],$inout0 # input ^ round[0] ^ offset_i + pxor @offset[1],@offset[2] + pxor $inout1,$checksum + pxor @offset[1],$inout1 + pxor @offset[2],@offset[3] + pxor $inout2,$checksum + pxor @offset[2],$inout2 + pxor $inout3,$checksum + pxor @offset[3],$inout3 + $movkey 32($key_),$rndkey0 + + pxor $rndkey0l,@offset[0] # offset_i ^ round[last] + pxor $rndkey0l,@offset[1] + pxor $rndkey0l,@offset[2] + pxor $rndkey0l,@offset[3] + + aesenc $rndkey1,$inout0 + aesenc $rndkey1,$inout1 + aesenc $rndkey1,$inout2 + aesenc $rndkey1,$inout3 + $movkey 48($key_),$rndkey1 + + aesenc $rndkey0,$inout0 + aesenc $rndkey0,$inout1 + aesenc $rndkey0,$inout2 + aesenc $rndkey0,$inout3 + $movkey 64($key_),$rndkey0 + jmp .Locb_enc_loop4 + +.align 32 +.Locb_enc_loop4: + aesenc $rndkey1,$inout0 + aesenc $rndkey1,$inout1 + aesenc $rndkey1,$inout2 + aesenc $rndkey1,$inout3 + $movkey ($key,%rax),$rndkey1 + add \$32,%rax + + aesenc $rndkey0,$inout0 + aesenc $rndkey0,$inout1 + aesenc $rndkey0,$inout2 + aesenc $rndkey0,$inout3 + $movkey -16($key,%rax),$rndkey0 + jnz .Locb_enc_loop4 + + aesenc $rndkey1,$inout0 + aesenc $rndkey1,$inout1 + aesenc $rndkey1,$inout2 + aesenc $rndkey1,$inout3 + $movkey 16($key_),$rndkey1 + mov %r10,%rax # restore twisted rounds + + aesenclast @offset[0],$inout0 + aesenclast @offset[1],$inout1 + aesenclast @offset[2],$inout2 + aesenclast @offset[3],$inout3 + ret +.size __ocb_encrypt4,.-__ocb_encrypt4 + +.type __ocb_encrypt1,\@abi-omnipotent +.align 32 +__ocb_encrypt1: + pxor @offset[5],$inout5 # offset_i + pxor $rndkey0l,$inout5 # offset_i ^ round[0] + pxor $inout0,$checksum # accumulate checksum + pxor $inout5,$inout0 # input ^ round[0] ^ offset_i + $movkey 32($key_),$rndkey0 + + aesenc $rndkey1,$inout0 + $movkey 48($key_),$rndkey1 + pxor $rndkey0l,$inout5 # offset_i ^ round[last] + + aesenc $rndkey0,$inout0 + $movkey 64($key_),$rndkey0 + jmp .Locb_enc_loop1 + +.align 32 +.Locb_enc_loop1: + aesenc $rndkey1,$inout0 + $movkey ($key,%rax),$rndkey1 + add \$32,%rax + + aesenc $rndkey0,$inout0 + $movkey -16($key,%rax),$rndkey0 + jnz .Locb_enc_loop1 + + aesenc $rndkey1,$inout0 + $movkey 16($key_),$rndkey1 # redundant in tail + mov %r10,%rax # restore twisted rounds + + aesenclast $inout5,$inout0 + ret +.size __ocb_encrypt1,.-__ocb_encrypt1 + +.globl aesni_ocb_decrypt +.type aesni_ocb_decrypt,\@function,6 +.align 32 +aesni_ocb_decrypt: +.cfi_startproc + lea (%rsp),%rax + push %rbx +.cfi_push %rbx + push %rbp +.cfi_push %rbp + push %r12 +.cfi_push %r12 + push %r13 +.cfi_push %r13 + push %r14 +.cfi_push %r14 +___ +$code.=<<___ if ($win64); + lea -0xa0(%rsp),%rsp + movaps %xmm6,0x00(%rsp) # offload everything + movaps %xmm7,0x10(%rsp) + movaps %xmm8,0x20(%rsp) + movaps %xmm9,0x30(%rsp) + movaps %xmm10,0x40(%rsp) + movaps %xmm11,0x50(%rsp) + movaps %xmm12,0x60(%rsp) + movaps %xmm13,0x70(%rsp) + movaps %xmm14,0x80(%rsp) + movaps %xmm15,0x90(%rsp) +.Locb_dec_body: +___ +$code.=<<___; + mov $seventh_arg(%rax),$L_p # 7th argument + mov $seventh_arg+8(%rax),$checksum_p# 8th argument + + mov 240($key),$rnds_ + mov $key,$key_ + shl \$4,$rnds_ + $movkey ($key),$rndkey0l # round[0] + $movkey 16($key,$rnds_),$rndkey1 # round[last] + + movdqu ($offset_p),@offset[5] # load last offset_i + pxor $rndkey1,$rndkey0l # round[0] ^ round[last] + pxor $rndkey1,@offset[5] # offset_i ^ round[last] + + mov \$16+32,$rounds + lea 32($key_,$rnds_),$key + $movkey 16($key_),$rndkey1 # round[1] + sub %r10,%rax # twisted $rounds + mov %rax,%r10 # backup twisted $rounds + + movdqu ($L_p),@offset[0] # L_0 for all odd-numbered blocks + movdqu ($checksum_p),$checksum # load checksum + + test \$1,$block_num # is first block number odd? + jnz .Locb_dec_odd + + bsf $block_num,$i1 + add \$1,$block_num + shl \$4,$i1 + movdqu ($L_p,$i1),$inout5 # borrow + movdqu ($inp),$inout0 + lea 16($inp),$inp + + call __ocb_decrypt1 + + movdqa $inout5,@offset[5] + movups $inout0,($out) + xorps $inout0,$checksum # accumulate checksum + lea 16($out),$out + sub \$1,$blocks + jz .Locb_dec_done + +.Locb_dec_odd: + lea 1($block_num),$i1 # even-numbered blocks + lea 3($block_num),$i3 + lea 5($block_num),$i5 + lea 6($block_num),$block_num + bsf $i1,$i1 # ntz(block) + bsf $i3,$i3 + bsf $i5,$i5 + shl \$4,$i1 # ntz(block) -> table offset + shl \$4,$i3 + shl \$4,$i5 + + sub \$6,$blocks + jc .Locb_dec_short + jmp .Locb_dec_grandloop + +.align 32 +.Locb_dec_grandloop: + movdqu `16*0`($inp),$inout0 # load input + movdqu `16*1`($inp),$inout1 + movdqu `16*2`($inp),$inout2 + movdqu `16*3`($inp),$inout3 + movdqu `16*4`($inp),$inout4 + movdqu `16*5`($inp),$inout5 + lea `16*6`($inp),$inp + + call __ocb_decrypt6 + + movups $inout0,`16*0`($out) # store output + pxor $inout0,$checksum # accumulate checksum + movups $inout1,`16*1`($out) + pxor $inout1,$checksum + movups $inout2,`16*2`($out) + pxor $inout2,$checksum + movups $inout3,`16*3`($out) + pxor $inout3,$checksum + movups $inout4,`16*4`($out) + pxor $inout4,$checksum + movups $inout5,`16*5`($out) + pxor $inout5,$checksum + lea `16*6`($out),$out + sub \$6,$blocks + jnc .Locb_dec_grandloop + +.Locb_dec_short: + add \$6,$blocks + jz .Locb_dec_done + + movdqu `16*0`($inp),$inout0 + cmp \$2,$blocks + jb .Locb_dec_one + movdqu `16*1`($inp),$inout1 + je .Locb_dec_two + + movdqu `16*2`($inp),$inout2 + cmp \$4,$blocks + jb .Locb_dec_three + movdqu `16*3`($inp),$inout3 + je .Locb_dec_four + + movdqu `16*4`($inp),$inout4 + pxor $inout5,$inout5 + + call __ocb_decrypt6 + + movdqa @offset[4],@offset[5] + movups $inout0,`16*0`($out) # store output + pxor $inout0,$checksum # accumulate checksum + movups $inout1,`16*1`($out) + pxor $inout1,$checksum + movups $inout2,`16*2`($out) + pxor $inout2,$checksum + movups $inout3,`16*3`($out) + pxor $inout3,$checksum + movups $inout4,`16*4`($out) + pxor $inout4,$checksum + + jmp .Locb_dec_done + +.align 16 +.Locb_dec_one: + movdqa @offset[0],$inout5 # borrow + + call __ocb_decrypt1 + + movdqa $inout5,@offset[5] + movups $inout0,`16*0`($out) # store output + xorps $inout0,$checksum # accumulate checksum + jmp .Locb_dec_done + +.align 16 +.Locb_dec_two: + pxor $inout2,$inout2 + pxor $inout3,$inout3 + + call __ocb_decrypt4 + + movdqa @offset[1],@offset[5] + movups $inout0,`16*0`($out) # store output + xorps $inout0,$checksum # accumulate checksum + movups $inout1,`16*1`($out) + xorps $inout1,$checksum + + jmp .Locb_dec_done + +.align 16 +.Locb_dec_three: + pxor $inout3,$inout3 + + call __ocb_decrypt4 + + movdqa @offset[2],@offset[5] + movups $inout0,`16*0`($out) # store output + xorps $inout0,$checksum # accumulate checksum + movups $inout1,`16*1`($out) + xorps $inout1,$checksum + movups $inout2,`16*2`($out) + xorps $inout2,$checksum + + jmp .Locb_dec_done + +.align 16 +.Locb_dec_four: + call __ocb_decrypt4 + + movdqa @offset[3],@offset[5] + movups $inout0,`16*0`($out) # store output + pxor $inout0,$checksum # accumulate checksum + movups $inout1,`16*1`($out) + pxor $inout1,$checksum + movups $inout2,`16*2`($out) + pxor $inout2,$checksum + movups $inout3,`16*3`($out) + pxor $inout3,$checksum + +.Locb_dec_done: + pxor $rndkey0,@offset[5] # "remove" round[last] + movdqu $checksum,($checksum_p) # store checksum + movdqu @offset[5],($offset_p) # store last offset_i + + xorps %xmm0,%xmm0 # clear register bank + pxor %xmm1,%xmm1 + pxor %xmm2,%xmm2 + pxor %xmm3,%xmm3 + pxor %xmm4,%xmm4 + pxor %xmm5,%xmm5 +___ +$code.=<<___ if (!$win64); + pxor %xmm6,%xmm6 + pxor %xmm7,%xmm7 + pxor %xmm8,%xmm8 + pxor %xmm9,%xmm9 + pxor %xmm10,%xmm10 + pxor %xmm11,%xmm11 + pxor %xmm12,%xmm12 + pxor %xmm13,%xmm13 + pxor %xmm14,%xmm14 + pxor %xmm15,%xmm15 + lea 0x28(%rsp),%rax +.cfi_def_cfa %rax,8 +___ +$code.=<<___ if ($win64); + movaps 0x00(%rsp),%xmm6 + movaps %xmm0,0x00(%rsp) # clear stack + movaps 0x10(%rsp),%xmm7 + movaps %xmm0,0x10(%rsp) + movaps 0x20(%rsp),%xmm8 + movaps %xmm0,0x20(%rsp) + movaps 0x30(%rsp),%xmm9 + movaps %xmm0,0x30(%rsp) + movaps 0x40(%rsp),%xmm10 + movaps %xmm0,0x40(%rsp) + movaps 0x50(%rsp),%xmm11 + movaps %xmm0,0x50(%rsp) + movaps 0x60(%rsp),%xmm12 + movaps %xmm0,0x60(%rsp) + movaps 0x70(%rsp),%xmm13 + movaps %xmm0,0x70(%rsp) + movaps 0x80(%rsp),%xmm14 + movaps %xmm0,0x80(%rsp) + movaps 0x90(%rsp),%xmm15 + movaps %xmm0,0x90(%rsp) + lea 0xa0+0x28(%rsp),%rax +.Locb_dec_pop: +___ +$code.=<<___; + mov -40(%rax),%r14 +.cfi_restore %r14 + mov -32(%rax),%r13 +.cfi_restore %r13 + mov -24(%rax),%r12 +.cfi_restore %r12 + mov -16(%rax),%rbp +.cfi_restore %rbp + mov -8(%rax),%rbx +.cfi_restore %rbx + lea (%rax),%rsp +.cfi_def_cfa_register %rsp +.Locb_dec_epilogue: + ret +.cfi_endproc +.size aesni_ocb_decrypt,.-aesni_ocb_decrypt + +.type __ocb_decrypt6,\@abi-omnipotent +.align 32 +__ocb_decrypt6: + pxor $rndkey0l,@offset[5] # offset_i ^ round[0] + movdqu ($L_p,$i1),@offset[1] + movdqa @offset[0],@offset[2] + movdqu ($L_p,$i3),@offset[3] + movdqa @offset[0],@offset[4] + pxor @offset[5],@offset[0] + movdqu ($L_p,$i5),@offset[5] + pxor @offset[0],@offset[1] + pxor @offset[0],$inout0 # input ^ round[0] ^ offset_i + pxor @offset[1],@offset[2] + pxor @offset[1],$inout1 + pxor @offset[2],@offset[3] + pxor @offset[2],$inout2 + pxor @offset[3],@offset[4] + pxor @offset[3],$inout3 + pxor @offset[4],@offset[5] + pxor @offset[4],$inout4 + pxor @offset[5],$inout5 + $movkey 32($key_),$rndkey0 + + lea 1($block_num),$i1 # even-numbered blocks + lea 3($block_num),$i3 + lea 5($block_num),$i5 + add \$6,$block_num + pxor $rndkey0l,@offset[0] # offset_i ^ round[last] + bsf $i1,$i1 # ntz(block) + bsf $i3,$i3 + bsf $i5,$i5 + + aesdec $rndkey1,$inout0 + aesdec $rndkey1,$inout1 + aesdec $rndkey1,$inout2 + aesdec $rndkey1,$inout3 + pxor $rndkey0l,@offset[1] + pxor $rndkey0l,@offset[2] + aesdec $rndkey1,$inout4 + pxor $rndkey0l,@offset[3] + pxor $rndkey0l,@offset[4] + aesdec $rndkey1,$inout5 + $movkey 48($key_),$rndkey1 + pxor $rndkey0l,@offset[5] + + aesdec $rndkey0,$inout0 + aesdec $rndkey0,$inout1 + aesdec $rndkey0,$inout2 + aesdec $rndkey0,$inout3 + aesdec $rndkey0,$inout4 + aesdec $rndkey0,$inout5 + $movkey 64($key_),$rndkey0 + shl \$4,$i1 # ntz(block) -> table offset + shl \$4,$i3 + jmp .Locb_dec_loop6 + +.align 32 +.Locb_dec_loop6: + aesdec $rndkey1,$inout0 + aesdec $rndkey1,$inout1 + aesdec $rndkey1,$inout2 + aesdec $rndkey1,$inout3 + aesdec $rndkey1,$inout4 + aesdec $rndkey1,$inout5 + $movkey ($key,%rax),$rndkey1 + add \$32,%rax + + aesdec $rndkey0,$inout0 + aesdec $rndkey0,$inout1 + aesdec $rndkey0,$inout2 + aesdec $rndkey0,$inout3 + aesdec $rndkey0,$inout4 + aesdec $rndkey0,$inout5 + $movkey -16($key,%rax),$rndkey0 + jnz .Locb_dec_loop6 + + aesdec $rndkey1,$inout0 + aesdec $rndkey1,$inout1 + aesdec $rndkey1,$inout2 + aesdec $rndkey1,$inout3 + aesdec $rndkey1,$inout4 + aesdec $rndkey1,$inout5 + $movkey 16($key_),$rndkey1 + shl \$4,$i5 + + aesdeclast @offset[0],$inout0 + movdqu ($L_p),@offset[0] # L_0 for all odd-numbered blocks + mov %r10,%rax # restore twisted rounds + aesdeclast @offset[1],$inout1 + aesdeclast @offset[2],$inout2 + aesdeclast @offset[3],$inout3 + aesdeclast @offset[4],$inout4 + aesdeclast @offset[5],$inout5 + ret +.size __ocb_decrypt6,.-__ocb_decrypt6 + +.type __ocb_decrypt4,\@abi-omnipotent +.align 32 +__ocb_decrypt4: + pxor $rndkey0l,@offset[5] # offset_i ^ round[0] + movdqu ($L_p,$i1),@offset[1] + movdqa @offset[0],@offset[2] + movdqu ($L_p,$i3),@offset[3] + pxor @offset[5],@offset[0] + pxor @offset[0],@offset[1] + pxor @offset[0],$inout0 # input ^ round[0] ^ offset_i + pxor @offset[1],@offset[2] + pxor @offset[1],$inout1 + pxor @offset[2],@offset[3] + pxor @offset[2],$inout2 + pxor @offset[3],$inout3 + $movkey 32($key_),$rndkey0 + + pxor $rndkey0l,@offset[0] # offset_i ^ round[last] + pxor $rndkey0l,@offset[1] + pxor $rndkey0l,@offset[2] + pxor $rndkey0l,@offset[3] + + aesdec $rndkey1,$inout0 + aesdec $rndkey1,$inout1 + aesdec $rndkey1,$inout2 + aesdec $rndkey1,$inout3 + $movkey 48($key_),$rndkey1 + + aesdec $rndkey0,$inout0 + aesdec $rndkey0,$inout1 + aesdec $rndkey0,$inout2 + aesdec $rndkey0,$inout3 + $movkey 64($key_),$rndkey0 + jmp .Locb_dec_loop4 + +.align 32 +.Locb_dec_loop4: + aesdec $rndkey1,$inout0 + aesdec $rndkey1,$inout1 + aesdec $rndkey1,$inout2 + aesdec $rndkey1,$inout3 + $movkey ($key,%rax),$rndkey1 + add \$32,%rax + + aesdec $rndkey0,$inout0 + aesdec $rndkey0,$inout1 + aesdec $rndkey0,$inout2 + aesdec $rndkey0,$inout3 + $movkey -16($key,%rax),$rndkey0 + jnz .Locb_dec_loop4 + + aesdec $rndkey1,$inout0 + aesdec $rndkey1,$inout1 + aesdec $rndkey1,$inout2 + aesdec $rndkey1,$inout3 + $movkey 16($key_),$rndkey1 + mov %r10,%rax # restore twisted rounds + + aesdeclast @offset[0],$inout0 + aesdeclast @offset[1],$inout1 + aesdeclast @offset[2],$inout2 + aesdeclast @offset[3],$inout3 + ret +.size __ocb_decrypt4,.-__ocb_decrypt4 + +.type __ocb_decrypt1,\@abi-omnipotent +.align 32 +__ocb_decrypt1: + pxor @offset[5],$inout5 # offset_i + pxor $rndkey0l,$inout5 # offset_i ^ round[0] + pxor $inout5,$inout0 # input ^ round[0] ^ offset_i + $movkey 32($key_),$rndkey0 + + aesdec $rndkey1,$inout0 + $movkey 48($key_),$rndkey1 + pxor $rndkey0l,$inout5 # offset_i ^ round[last] + + aesdec $rndkey0,$inout0 + $movkey 64($key_),$rndkey0 + jmp .Locb_dec_loop1 + +.align 32 +.Locb_dec_loop1: + aesdec $rndkey1,$inout0 + $movkey ($key,%rax),$rndkey1 + add \$32,%rax + + aesdec $rndkey0,$inout0 + $movkey -16($key,%rax),$rndkey0 + jnz .Locb_dec_loop1 + + aesdec $rndkey1,$inout0 + $movkey 16($key_),$rndkey1 # redundant in tail + mov %r10,%rax # restore twisted rounds + + aesdeclast $inout5,$inout0 + ret +.size __ocb_decrypt1,.-__ocb_decrypt1 +___ +} }} + +######################################################################## +# void $PREFIX_cbc_encrypt (const void *inp, void *out, +# size_t length, const AES_KEY *key, +# unsigned char *ivp,const int enc); +if (0) { +my $frame_size = 0x10 + ($win64?0xa0:0); # used in decrypt +my ($iv,$in0,$in1,$in2,$in3,$in4)=map("%xmm$_",(10..15)); + +$code.=<<___; +.globl ${PREFIX}_cbc_encrypt +.type ${PREFIX}_cbc_encrypt,\@function,6 +.align 16 +${PREFIX}_cbc_encrypt: +.cfi_startproc + test $len,$len # check length + jz .Lcbc_ret + + mov 240($key),$rnds_ # key->rounds + mov $key,$key_ # backup $key + test %r9d,%r9d # 6th argument + jz .Lcbc_decrypt +#--------------------------- CBC ENCRYPT ------------------------------# + movups ($ivp),$inout0 # load iv as initial state + mov $rnds_,$rounds + cmp \$16,$len + jb .Lcbc_enc_tail + sub \$16,$len + jmp .Lcbc_enc_loop +.align 16 +.Lcbc_enc_loop: + movups ($inp),$inout1 # load input + lea 16($inp),$inp + #xorps $inout1,$inout0 +___ + &aesni_generate1("enc",$key,$rounds,$inout0,$inout1); +$code.=<<___; + mov $rnds_,$rounds # restore $rounds + mov $key_,$key # restore $key + movups $inout0,0($out) # store output + lea 16($out),$out + sub \$16,$len + jnc .Lcbc_enc_loop + add \$16,$len + jnz .Lcbc_enc_tail + pxor $rndkey0,$rndkey0 # clear register bank + pxor $rndkey1,$rndkey1 + movups $inout0,($ivp) + pxor $inout0,$inout0 + pxor $inout1,$inout1 + jmp .Lcbc_ret + +.Lcbc_enc_tail: + mov $len,%rcx # zaps $key + xchg $inp,$out # $inp is %rsi and $out is %rdi now + .long 0x9066A4F3 # rep movsb + mov \$16,%ecx # zero tail + sub $len,%rcx + xor %eax,%eax + .long 0x9066AAF3 # rep stosb + lea -16(%rdi),%rdi # rewind $out by 1 block + mov $rnds_,$rounds # restore $rounds + mov %rdi,%rsi # $inp and $out are the same + mov $key_,$key # restore $key + xor $len,$len # len=16 + jmp .Lcbc_enc_loop # one more spin +#--------------------------- CBC DECRYPT ------------------------------# +.align 16 +.Lcbc_decrypt: + cmp \$16,$len + jne .Lcbc_decrypt_bulk + + # handle single block without allocating stack frame, + # useful in ciphertext stealing mode + movdqu ($inp),$inout0 # load input + movdqu ($ivp),$inout1 # load iv + movdqa $inout0,$inout2 # future iv +___ + &aesni_generate1("dec",$key,$rnds_); +$code.=<<___; + pxor $rndkey0,$rndkey0 # clear register bank + pxor $rndkey1,$rndkey1 + movdqu $inout2,($ivp) # store iv + xorps $inout1,$inout0 # ^=iv + pxor $inout1,$inout1 + movups $inout0,($out) # store output + pxor $inout0,$inout0 + jmp .Lcbc_ret +.align 16 +.Lcbc_decrypt_bulk: + lea (%rsp),%r11 # frame pointer +.cfi_def_cfa_register %r11 + push %rbp +.cfi_push %rbp + sub \$$frame_size,%rsp + and \$-16,%rsp # Linux kernel stack can be incorrectly seeded +___ +$code.=<<___ if ($win64); + movaps %xmm6,0x10(%rsp) + movaps %xmm7,0x20(%rsp) + movaps %xmm8,0x30(%rsp) + movaps %xmm9,0x40(%rsp) + movaps %xmm10,0x50(%rsp) + movaps %xmm11,0x60(%rsp) + movaps %xmm12,0x70(%rsp) + movaps %xmm13,0x80(%rsp) + movaps %xmm14,0x90(%rsp) + movaps %xmm15,0xa0(%rsp) +.Lcbc_decrypt_body: +___ + +my $inp_=$key_="%rbp"; # reassign $key_ + +$code.=<<___; + mov $key,$key_ # [re-]backup $key [after reassignment] + movups ($ivp),$iv + mov $rnds_,$rounds + cmp \$0x50,$len + jbe .Lcbc_dec_tail + + $movkey ($key),$rndkey0 + movdqu 0x00($inp),$inout0 # load input + movdqu 0x10($inp),$inout1 + movdqa $inout0,$in0 + movdqu 0x20($inp),$inout2 + movdqa $inout1,$in1 + movdqu 0x30($inp),$inout3 + movdqa $inout2,$in2 + movdqu 0x40($inp),$inout4 + movdqa $inout3,$in3 + movdqu 0x50($inp),$inout5 + movdqa $inout4,$in4 + leaq OPENSSL_ia32cap_P(%rip),%r9 + mov 4(%r9),%r9d + cmp \$0x70,$len + jbe .Lcbc_dec_six_or_seven + + and \$`1<<26|1<<22`,%r9d # isolate XSAVE+MOVBE + sub \$0x50,$len # $len is biased by -5*16 + cmp \$`1<<22`,%r9d # check for MOVBE without XSAVE + je .Lcbc_dec_loop6_enter # [which denotes Atom Silvermont] + sub \$0x20,$len # $len is biased by -7*16 + lea 0x70($key),$key # size optimization + jmp .Lcbc_dec_loop8_enter +.align 16 +.Lcbc_dec_loop8: + movups $inout7,($out) + lea 0x10($out),$out +.Lcbc_dec_loop8_enter: + movdqu 0x60($inp),$inout6 + pxor $rndkey0,$inout0 + movdqu 0x70($inp),$inout7 + pxor $rndkey0,$inout1 + $movkey 0x10-0x70($key),$rndkey1 + pxor $rndkey0,$inout2 + mov \$-1,$inp_ + cmp \$0x70,$len # is there at least 0x60 bytes ahead? + pxor $rndkey0,$inout3 + pxor $rndkey0,$inout4 + pxor $rndkey0,$inout5 + pxor $rndkey0,$inout6 + + aesdec $rndkey1,$inout0 + pxor $rndkey0,$inout7 + $movkey 0x20-0x70($key),$rndkey0 + aesdec $rndkey1,$inout1 + aesdec $rndkey1,$inout2 + aesdec $rndkey1,$inout3 + aesdec $rndkey1,$inout4 + aesdec $rndkey1,$inout5 + aesdec $rndkey1,$inout6 + adc \$0,$inp_ + and \$128,$inp_ + aesdec $rndkey1,$inout7 + add $inp,$inp_ + $movkey 0x30-0x70($key),$rndkey1 +___ +for($i=1;$i<12;$i++) { +my $rndkeyx = ($i&1)?$rndkey0:$rndkey1; +$code.=<<___ if ($i==7); + cmp \$11,$rounds +___ +$code.=<<___; + aesdec $rndkeyx,$inout0 + aesdec $rndkeyx,$inout1 + aesdec $rndkeyx,$inout2 + aesdec $rndkeyx,$inout3 + aesdec $rndkeyx,$inout4 + aesdec $rndkeyx,$inout5 + aesdec $rndkeyx,$inout6 + aesdec $rndkeyx,$inout7 + $movkey `0x30+0x10*$i`-0x70($key),$rndkeyx +___ +$code.=<<___ if ($i<6 || (!($i&1) && $i>7)); + nop +___ +$code.=<<___ if ($i==7); + jb .Lcbc_dec_done +___ +$code.=<<___ if ($i==9); + je .Lcbc_dec_done +___ +$code.=<<___ if ($i==11); + jmp .Lcbc_dec_done +___ +} +$code.=<<___; +.align 16 +.Lcbc_dec_done: + aesdec $rndkey1,$inout0 + aesdec $rndkey1,$inout1 + pxor $rndkey0,$iv + pxor $rndkey0,$in0 + aesdec $rndkey1,$inout2 + aesdec $rndkey1,$inout3 + pxor $rndkey0,$in1 + pxor $rndkey0,$in2 + aesdec $rndkey1,$inout4 + aesdec $rndkey1,$inout5 + pxor $rndkey0,$in3 + pxor $rndkey0,$in4 + aesdec $rndkey1,$inout6 + aesdec $rndkey1,$inout7 + movdqu 0x50($inp),$rndkey1 + + aesdeclast $iv,$inout0 + movdqu 0x60($inp),$iv # borrow $iv + pxor $rndkey0,$rndkey1 + aesdeclast $in0,$inout1 + pxor $rndkey0,$iv + movdqu 0x70($inp),$rndkey0 # next IV + aesdeclast $in1,$inout2 + lea 0x80($inp),$inp + movdqu 0x00($inp_),$in0 + aesdeclast $in2,$inout3 + aesdeclast $in3,$inout4 + movdqu 0x10($inp_),$in1 + movdqu 0x20($inp_),$in2 + aesdeclast $in4,$inout5 + aesdeclast $rndkey1,$inout6 + movdqu 0x30($inp_),$in3 + movdqu 0x40($inp_),$in4 + aesdeclast $iv,$inout7 + movdqa $rndkey0,$iv # return $iv + movdqu 0x50($inp_),$rndkey1 + $movkey -0x70($key),$rndkey0 + + movups $inout0,($out) # store output + movdqa $in0,$inout0 + movups $inout1,0x10($out) + movdqa $in1,$inout1 + movups $inout2,0x20($out) + movdqa $in2,$inout2 + movups $inout3,0x30($out) + movdqa $in3,$inout3 + movups $inout4,0x40($out) + movdqa $in4,$inout4 + movups $inout5,0x50($out) + movdqa $rndkey1,$inout5 + movups $inout6,0x60($out) + lea 0x70($out),$out + + sub \$0x80,$len + ja .Lcbc_dec_loop8 + + movaps $inout7,$inout0 + lea -0x70($key),$key + add \$0x70,$len + jle .Lcbc_dec_clear_tail_collected + movups $inout7,($out) + lea 0x10($out),$out + cmp \$0x50,$len + jbe .Lcbc_dec_tail + + movaps $in0,$inout0 +.Lcbc_dec_six_or_seven: + cmp \$0x60,$len + ja .Lcbc_dec_seven + + movaps $inout5,$inout6 + call _aesni_decrypt6 + pxor $iv,$inout0 # ^= IV + movaps $inout6,$iv + pxor $in0,$inout1 + movdqu $inout0,($out) + pxor $in1,$inout2 + movdqu $inout1,0x10($out) + pxor $inout1,$inout1 # clear register bank + pxor $in2,$inout3 + movdqu $inout2,0x20($out) + pxor $inout2,$inout2 + pxor $in3,$inout4 + movdqu $inout3,0x30($out) + pxor $inout3,$inout3 + pxor $in4,$inout5 + movdqu $inout4,0x40($out) + pxor $inout4,$inout4 + lea 0x50($out),$out + movdqa $inout5,$inout0 + pxor $inout5,$inout5 + jmp .Lcbc_dec_tail_collected + +.align 16 +.Lcbc_dec_seven: + movups 0x60($inp),$inout6 + xorps $inout7,$inout7 + call _aesni_decrypt8 + movups 0x50($inp),$inout7 + pxor $iv,$inout0 # ^= IV + movups 0x60($inp),$iv + pxor $in0,$inout1 + movdqu $inout0,($out) + pxor $in1,$inout2 + movdqu $inout1,0x10($out) + pxor $inout1,$inout1 # clear register bank + pxor $in2,$inout3 + movdqu $inout2,0x20($out) + pxor $inout2,$inout2 + pxor $in3,$inout4 + movdqu $inout3,0x30($out) + pxor $inout3,$inout3 + pxor $in4,$inout5 + movdqu $inout4,0x40($out) + pxor $inout4,$inout4 + pxor $inout7,$inout6 + movdqu $inout5,0x50($out) + pxor $inout5,$inout5 + lea 0x60($out),$out + movdqa $inout6,$inout0 + pxor $inout6,$inout6 + pxor $inout7,$inout7 + jmp .Lcbc_dec_tail_collected + +.align 16 +.Lcbc_dec_loop6: + movups $inout5,($out) + lea 0x10($out),$out + movdqu 0x00($inp),$inout0 # load input + movdqu 0x10($inp),$inout1 + movdqa $inout0,$in0 + movdqu 0x20($inp),$inout2 + movdqa $inout1,$in1 + movdqu 0x30($inp),$inout3 + movdqa $inout2,$in2 + movdqu 0x40($inp),$inout4 + movdqa $inout3,$in3 + movdqu 0x50($inp),$inout5 + movdqa $inout4,$in4 +.Lcbc_dec_loop6_enter: + lea 0x60($inp),$inp + movdqa $inout5,$inout6 + + call _aesni_decrypt6 + + pxor $iv,$inout0 # ^= IV + movdqa $inout6,$iv + pxor $in0,$inout1 + movdqu $inout0,($out) + pxor $in1,$inout2 + movdqu $inout1,0x10($out) + pxor $in2,$inout3 + movdqu $inout2,0x20($out) + pxor $in3,$inout4 + mov $key_,$key + movdqu $inout3,0x30($out) + pxor $in4,$inout5 + mov $rnds_,$rounds + movdqu $inout4,0x40($out) + lea 0x50($out),$out + sub \$0x60,$len + ja .Lcbc_dec_loop6 + + movdqa $inout5,$inout0 + add \$0x50,$len + jle .Lcbc_dec_clear_tail_collected + movups $inout5,($out) + lea 0x10($out),$out + +.Lcbc_dec_tail: + movups ($inp),$inout0 + sub \$0x10,$len + jbe .Lcbc_dec_one # $len is 1*16 or less + + movups 0x10($inp),$inout1 + movaps $inout0,$in0 + sub \$0x10,$len + jbe .Lcbc_dec_two # $len is 2*16 or less + + movups 0x20($inp),$inout2 + movaps $inout1,$in1 + sub \$0x10,$len + jbe .Lcbc_dec_three # $len is 3*16 or less + + movups 0x30($inp),$inout3 + movaps $inout2,$in2 + sub \$0x10,$len + jbe .Lcbc_dec_four # $len is 4*16 or less + + movups 0x40($inp),$inout4 # $len is 5*16 or less + movaps $inout3,$in3 + movaps $inout4,$in4 + xorps $inout5,$inout5 + call _aesni_decrypt6 + pxor $iv,$inout0 + movaps $in4,$iv + pxor $in0,$inout1 + movdqu $inout0,($out) + pxor $in1,$inout2 + movdqu $inout1,0x10($out) + pxor $inout1,$inout1 # clear register bank + pxor $in2,$inout3 + movdqu $inout2,0x20($out) + pxor $inout2,$inout2 + pxor $in3,$inout4 + movdqu $inout3,0x30($out) + pxor $inout3,$inout3 + lea 0x40($out),$out + movdqa $inout4,$inout0 + pxor $inout4,$inout4 + pxor $inout5,$inout5 + sub \$0x10,$len + jmp .Lcbc_dec_tail_collected + +.align 16 +.Lcbc_dec_one: + movaps $inout0,$in0 +___ + &aesni_generate1("dec",$key,$rounds); +$code.=<<___; + xorps $iv,$inout0 + movaps $in0,$iv + jmp .Lcbc_dec_tail_collected +.align 16 +.Lcbc_dec_two: + movaps $inout1,$in1 + call _aesni_decrypt2 + pxor $iv,$inout0 + movaps $in1,$iv + pxor $in0,$inout1 + movdqu $inout0,($out) + movdqa $inout1,$inout0 + pxor $inout1,$inout1 # clear register bank + lea 0x10($out),$out + jmp .Lcbc_dec_tail_collected +.align 16 +.Lcbc_dec_three: + movaps $inout2,$in2 + call _aesni_decrypt3 + pxor $iv,$inout0 + movaps $in2,$iv + pxor $in0,$inout1 + movdqu $inout0,($out) + pxor $in1,$inout2 + movdqu $inout1,0x10($out) + pxor $inout1,$inout1 # clear register bank + movdqa $inout2,$inout0 + pxor $inout2,$inout2 + lea 0x20($out),$out + jmp .Lcbc_dec_tail_collected +.align 16 +.Lcbc_dec_four: + movaps $inout3,$in3 + call _aesni_decrypt4 + pxor $iv,$inout0 + movaps $in3,$iv + pxor $in0,$inout1 + movdqu $inout0,($out) + pxor $in1,$inout2 + movdqu $inout1,0x10($out) + pxor $inout1,$inout1 # clear register bank + pxor $in2,$inout3 + movdqu $inout2,0x20($out) + pxor $inout2,$inout2 + movdqa $inout3,$inout0 + pxor $inout3,$inout3 + lea 0x30($out),$out + jmp .Lcbc_dec_tail_collected + +.align 16 +.Lcbc_dec_clear_tail_collected: + pxor $inout1,$inout1 # clear register bank + pxor $inout2,$inout2 + pxor $inout3,$inout3 +___ +$code.=<<___ if (!$win64); + pxor $inout4,$inout4 # %xmm6..9 + pxor $inout5,$inout5 + pxor $inout6,$inout6 + pxor $inout7,$inout7 +___ +$code.=<<___; +.Lcbc_dec_tail_collected: + movups $iv,($ivp) + and \$15,$len + jnz .Lcbc_dec_tail_partial + movups $inout0,($out) + pxor $inout0,$inout0 + jmp .Lcbc_dec_ret +.align 16 +.Lcbc_dec_tail_partial: + movaps $inout0,(%rsp) + pxor $inout0,$inout0 + mov \$16,%rcx + mov $out,%rdi + sub $len,%rcx + lea (%rsp),%rsi + .long 0x9066A4F3 # rep movsb + movdqa $inout0,(%rsp) + +.Lcbc_dec_ret: + xorps $rndkey0,$rndkey0 # %xmm0 + pxor $rndkey1,$rndkey1 +___ +$code.=<<___ if ($win64); + movaps 0x10(%rsp),%xmm6 + movaps %xmm0,0x10(%rsp) # clear stack + movaps 0x20(%rsp),%xmm7 + movaps %xmm0,0x20(%rsp) + movaps 0x30(%rsp),%xmm8 + movaps %xmm0,0x30(%rsp) + movaps 0x40(%rsp),%xmm9 + movaps %xmm0,0x40(%rsp) + movaps 0x50(%rsp),%xmm10 + movaps %xmm0,0x50(%rsp) + movaps 0x60(%rsp),%xmm11 + movaps %xmm0,0x60(%rsp) + movaps 0x70(%rsp),%xmm12 + movaps %xmm0,0x70(%rsp) + movaps 0x80(%rsp),%xmm13 + movaps %xmm0,0x80(%rsp) + movaps 0x90(%rsp),%xmm14 + movaps %xmm0,0x90(%rsp) + movaps 0xa0(%rsp),%xmm15 + movaps %xmm0,0xa0(%rsp) +___ +$code.=<<___; + mov -8(%r11),%rbp +.cfi_restore %rbp + lea (%r11),%rsp +.cfi_def_cfa_register %rsp +.Lcbc_ret: + ret +.cfi_endproc +.size ${PREFIX}_cbc_encrypt,.-${PREFIX}_cbc_encrypt +___ +} +# int ${PREFIX}_set_decrypt_key(const unsigned char *inp, +# int bits, AES_KEY *key) +# +# input: $inp user-supplied key +# $bits $inp length in bits +# $key pointer to key schedule +# output: %eax 0 denoting success, -1 or -2 - failure (see C) +# *$key key schedule +# +{ my ($inp,$bits,$key) = @_4args; + $bits =~ s/%r/%e/; + +$code.=<<___; +.globl ${PREFIX}_set_decrypt_key +.type ${PREFIX}_set_decrypt_key,\@abi-omnipotent +.align 16 +${PREFIX}_set_decrypt_key: +.cfi_startproc + .byte 0x48,0x83,0xEC,0x08 # sub rsp,8 +.cfi_adjust_cfa_offset 8 + call __aesni_set_encrypt_key + shl \$4,$bits # rounds-1 after _aesni_set_encrypt_key + test %eax,%eax + jnz .Ldec_key_ret + lea 16($key,$bits),$inp # points at the end of key schedule + + $movkey ($key),%xmm0 # just swap + $movkey ($inp),%xmm1 + $movkey %xmm0,($inp) + $movkey %xmm1,($key) + lea 16($key),$key + lea -16($inp),$inp + +.Ldec_key_inverse: + $movkey ($key),%xmm0 # swap and inverse + $movkey ($inp),%xmm1 + aesimc %xmm0,%xmm0 + aesimc %xmm1,%xmm1 + lea 16($key),$key + lea -16($inp),$inp + $movkey %xmm0,16($inp) + $movkey %xmm1,-16($key) + cmp $key,$inp + ja .Ldec_key_inverse + + $movkey ($key),%xmm0 # inverse middle + aesimc %xmm0,%xmm0 + pxor %xmm1,%xmm1 + $movkey %xmm0,($inp) + pxor %xmm0,%xmm0 +.Ldec_key_ret: + add \$8,%rsp +.cfi_adjust_cfa_offset -8 + ret +.cfi_endproc +.LSEH_end_set_decrypt_key: +.size ${PREFIX}_set_decrypt_key,.-${PREFIX}_set_decrypt_key +___ + +# This is based on submission from Intel by +# Huang Ying +# Vinodh Gopal +# Kahraman Akdemir +# +# Aggressively optimized in respect to aeskeygenassist's critical path +# and is contained in %xmm0-5 to meet Win64 ABI requirement. +# +# int ${PREFIX}_set_encrypt_key(const unsigned char *inp, +# int bits, AES_KEY * const key); +# +# input: $inp user-supplied key +# $bits $inp length in bits +# $key pointer to key schedule +# output: %eax 0 denoting success, -1 or -2 - failure (see C) +# $bits rounds-1 (used in aesni_set_decrypt_key) +# *$key key schedule +# $key pointer to key schedule (used in +# aesni_set_decrypt_key) +# +# Subroutine is frame-less, which means that only volatile registers +# are used. Note that it's declared "abi-omnipotent", which means that +# amount of volatile registers is smaller on Windows. +# +$code.=<<___; +.globl ${PREFIX}_set_encrypt_key +.type ${PREFIX}_set_encrypt_key,\@abi-omnipotent +.align 16 +${PREFIX}_set_encrypt_key: +__aesni_set_encrypt_key: +.cfi_startproc + .byte 0x48,0x83,0xEC,0x08 # sub rsp,8 +.cfi_adjust_cfa_offset 8 + mov \$-1,%rax + test $inp,$inp + jz .Lenc_key_ret + test $key,$key + jz .Lenc_key_ret + + movups ($inp),%xmm0 # pull first 128 bits of *userKey + xorps %xmm4,%xmm4 # low dword of xmm4 is assumed 0 +# leaq OPENSSL_ia32cap_P(%rip),%r10 +# movl 4(%r10),%r10d +# and \$`1<<28|1<<11`,%r10d # AVX and XOP bits + lea 16($key),%rax # %rax is used as modifiable copy of $key + cmp \$256,$bits + je .L14rounds + cmp \$192,$bits + je .L12rounds + cmp \$128,$bits + jne .Lbad_keybits + +.L10rounds: + mov \$9,$bits # 10 rounds for 128-bit key +# cmp \$`1<<28`,%r10d # AVX, bit no XOP +# je .L10rounds_alt +# jmp .L10rounds_alt +# $movkey %xmm0,($key) # round 0 +# aeskeygenassist \$0x1,%xmm0,%xmm1 # round 1 +# call .Lkey_expansion_128_cold +# aeskeygenassist \$0x2,%xmm0,%xmm1 # round 2 +# call .Lkey_expansion_128 +# aeskeygenassist \$0x4,%xmm0,%xmm1 # round 3 +# call .Lkey_expansion_128 +# aeskeygenassist \$0x8,%xmm0,%xmm1 # round 4 +# call .Lkey_expansion_128 +# aeskeygenassist \$0x10,%xmm0,%xmm1 # round 5 +# call .Lkey_expansion_128 +# aeskeygenassist \$0x20,%xmm0,%xmm1 # round 6 +# call .Lkey_expansion_128 +# aeskeygenassist \$0x40,%xmm0,%xmm1 # round 7 +# call .Lkey_expansion_128 +# aeskeygenassist \$0x80,%xmm0,%xmm1 # round 8 +# call .Lkey_expansion_128 +# aeskeygenassist \$0x1b,%xmm0,%xmm1 # round 9 +# call .Lkey_expansion_128 +# aeskeygenassist \$0x36,%xmm0,%xmm1 # round 10 +# call .Lkey_expansion_128 +# $movkey %xmm0,(%rax) +# mov $bits,80(%rax) # 240(%rdx) +# xor %eax,%eax +# jmp .Lenc_key_ret + +#.align 16 +#.L10rounds_alt: + movdqa .Lkey_rotate(%rip),%xmm5 + mov \$8,%r10d + movdqa .Lkey_rcon1(%rip),%xmm4 + movdqa %xmm0,%xmm2 + movdqu %xmm0,($key) + jmp .Loop_key128 + +.align 16 +.Loop_key128: + pshufb %xmm5,%xmm0 + aesenclast %xmm4,%xmm0 + pslld \$1,%xmm4 + lea 16(%rax),%rax + + movdqa %xmm2,%xmm3 + pslldq \$4,%xmm2 + pxor %xmm2,%xmm3 + pslldq \$4,%xmm2 + pxor %xmm2,%xmm3 + pslldq \$4,%xmm2 + pxor %xmm3,%xmm2 + + pxor %xmm2,%xmm0 + movdqu %xmm0,-16(%rax) + movdqa %xmm0,%xmm2 + + dec %r10d + jnz .Loop_key128 + + movdqa .Lkey_rcon1b(%rip),%xmm4 + + pshufb %xmm5,%xmm0 + aesenclast %xmm4,%xmm0 + pslld \$1,%xmm4 + + movdqa %xmm2,%xmm3 + pslldq \$4,%xmm2 + pxor %xmm2,%xmm3 + pslldq \$4,%xmm2 + pxor %xmm2,%xmm3 + pslldq \$4,%xmm2 + pxor %xmm3,%xmm2 + + pxor %xmm2,%xmm0 + movdqu %xmm0,(%rax) + + movdqa %xmm0,%xmm2 + pshufb %xmm5,%xmm0 + aesenclast %xmm4,%xmm0 + + movdqa %xmm2,%xmm3 + pslldq \$4,%xmm2 + pxor %xmm2,%xmm3 + pslldq \$4,%xmm2 + pxor %xmm2,%xmm3 + pslldq \$4,%xmm2 + pxor %xmm3,%xmm2 + + pxor %xmm2,%xmm0 + movdqu %xmm0,16(%rax) + + mov $bits,96(%rax) # 240($key) + xor %eax,%eax + jmp .Lenc_key_ret + +.align 16 +.L12rounds: + movq 16($inp),%xmm2 # remaining 1/3 of *userKey + mov \$11,$bits # 12 rounds for 192 +# cmp \$`1<<28`,%r10d # AVX, but no XOP +# je .L12rounds_alt + +# $movkey %xmm0,($key) # round 0 +# aeskeygenassist \$0x1,%xmm2,%xmm1 # round 1,2 +# call .Lkey_expansion_192a_cold +# aeskeygenassist \$0x2,%xmm2,%xmm1 # round 2,3 +# call .Lkey_expansion_192b +# aeskeygenassist \$0x4,%xmm2,%xmm1 # round 4,5 +# call .Lkey_expansion_192a +# aeskeygenassist \$0x8,%xmm2,%xmm1 # round 5,6 +# call .Lkey_expansion_192b +# aeskeygenassist \$0x10,%xmm2,%xmm1 # round 7,8 +# call .Lkey_expansion_192a +# aeskeygenassist \$0x20,%xmm2,%xmm1 # round 8,9 +# call .Lkey_expansion_192b +# aeskeygenassist \$0x40,%xmm2,%xmm1 # round 10,11 +# call .Lkey_expansion_192a +# aeskeygenassist \$0x80,%xmm2,%xmm1 # round 11,12 +# call .Lkey_expansion_192b +# $movkey %xmm0,(%rax) +# mov $bits,48(%rax) # 240(%rdx) +# xor %rax, %rax +# jmp .Lenc_key_ret + +#.align 16 +#.L12rounds_alt: + movdqa .Lkey_rotate192(%rip),%xmm5 + movdqa .Lkey_rcon1(%rip),%xmm4 + mov \$8,%r10d + movdqu %xmm0,($key) + jmp .Loop_key192 + +.align 16 +.Loop_key192: + movq %xmm2,0(%rax) + movdqa %xmm2,%xmm1 + pshufb %xmm5,%xmm2 + aesenclast %xmm4,%xmm2 + pslld \$1, %xmm4 + lea 24(%rax),%rax + + movdqa %xmm0,%xmm3 + pslldq \$4,%xmm0 + pxor %xmm0,%xmm3 + pslldq \$4,%xmm0 + pxor %xmm0,%xmm3 + pslldq \$4,%xmm0 + pxor %xmm3,%xmm0 + + pshufd \$0xff,%xmm0,%xmm3 + pxor %xmm1,%xmm3 + pslldq \$4,%xmm1 + pxor %xmm1,%xmm3 + + pxor %xmm2,%xmm0 + pxor %xmm3,%xmm2 + movdqu %xmm0,-16(%rax) + + dec %r10d + jnz .Loop_key192 + + mov $bits,32(%rax) # 240($key) + xor %eax,%eax + jmp .Lenc_key_ret + +.align 16 +.L14rounds: + movups 16($inp),%xmm2 # remaining half of *userKey + mov \$13,$bits # 14 rounds for 256 + lea 16(%rax),%rax +# cmp \$`1<<28`,%r10d # AVX, but no XOP +# je .L14rounds_alt +# +# $movkey %xmm0,($key) # round 0 +# $movkey %xmm2,16($key) # round 1 +# aeskeygenassist \$0x1,%xmm2,%xmm1 # round 2 +# call .Lkey_expansion_256a_cold +# aeskeygenassist \$0x1,%xmm0,%xmm1 # round 3 +# call .Lkey_expansion_256b +# aeskeygenassist \$0x2,%xmm2,%xmm1 # round 4 +# call .Lkey_expansion_256a +# aeskeygenassist \$0x2,%xmm0,%xmm1 # round 5 +# call .Lkey_expansion_256b +# aeskeygenassist \$0x4,%xmm2,%xmm1 # round 6 +# call .Lkey_expansion_256a +# aeskeygenassist \$0x4,%xmm0,%xmm1 # round 7 +# call .Lkey_expansion_256b +# aeskeygenassist \$0x8,%xmm2,%xmm1 # round 8 +# call .Lkey_expansion_256a +# aeskeygenassist \$0x8,%xmm0,%xmm1 # round 9 +# call .Lkey_expansion_256b +# aeskeygenassist \$0x10,%xmm2,%xmm1 # round 10 +# call .Lkey_expansion_256a +# aeskeygenassist \$0x10,%xmm0,%xmm1 # round 11 +# call .Lkey_expansion_256b +# aeskeygenassist \$0x20,%xmm2,%xmm1 # round 12 +# call .Lkey_expansion_256a +# aeskeygenassist \$0x20,%xmm0,%xmm1 # round 13 +# call .Lkey_expansion_256b +# aeskeygenassist \$0x40,%xmm2,%xmm1 # round 14 +# call .Lkey_expansion_256a +# $movkey %xmm0,(%rax) +# mov $bits,16(%rax) # 240(%rdx) +# xor %rax,%rax +# jmp .Lenc_key_ret + +#.align 16 +#.L14rounds_alt: + movdqa .Lkey_rotate(%rip),%xmm5 + movdqa .Lkey_rcon1(%rip),%xmm4 + mov \$7,%r10d + movdqu %xmm0,0($key) + movdqa %xmm2,%xmm1 + movdqu %xmm2,16($key) + jmp .Loop_key256 + +.align 16 +.Loop_key256: + pshufb %xmm5,%xmm2 + aesenclast %xmm4,%xmm2 + + movdqa %xmm0,%xmm3 + pslldq \$4,%xmm0 + pxor %xmm0,%xmm3 + pslldq \$4,%xmm0 + pxor %xmm0,%xmm3 + pslldq \$4,%xmm0 + pxor %xmm3,%xmm0 + pslld \$1,%xmm4 + + pxor %xmm2,%xmm0 + movdqu %xmm0,(%rax) + + dec %r10d + jz .Ldone_key256 + + pshufd \$0xff,%xmm0,%xmm2 + pxor %xmm3,%xmm3 + aesenclast %xmm3,%xmm2 + + movdqa %xmm1,%xmm3 + pslldq \$4,%xmm1 + pxor %xmm1,%xmm3 + pslldq \$4,%xmm1 + pxor %xmm1,%xmm3 + pslldq \$4,%xmm1 + pxor %xmm3,%xmm1 + + pxor %xmm1,%xmm2 + movdqu %xmm2,16(%rax) + lea 32(%rax),%rax + movdqa %xmm2,%xmm1 + + jmp .Loop_key256 + +.Ldone_key256: + mov $bits,16(%rax) # 240($key) + xor %eax,%eax + jmp .Lenc_key_ret + +.align 16 +.Lbad_keybits: + mov \$-2,%rax +.Lenc_key_ret: + pxor %xmm0,%xmm0 + pxor %xmm1,%xmm1 + pxor %xmm2,%xmm2 + pxor %xmm3,%xmm3 + pxor %xmm4,%xmm4 + pxor %xmm5,%xmm5 + add \$8,%rsp +.cfi_adjust_cfa_offset -8 + ret +.cfi_endproc +.LSEH_end_set_encrypt_key: + +#.align 16 +#.Lkey_expansion_128: +# $movkey %xmm0,(%rax) +# lea 16(%rax),%rax +#.Lkey_expansion_128_cold: +# shufps \$0b00010000,%xmm0,%xmm4 +# xorps %xmm4, %xmm0 +# shufps \$0b10001100,%xmm0,%xmm4 +# xorps %xmm4, %xmm0 +# shufps \$0b11111111,%xmm1,%xmm1 # critical path +# xorps %xmm1,%xmm0 +# ret + +#.align 16 +#.Lkey_expansion_192a: +# $movkey %xmm0,(%rax) +# lea 16(%rax),%rax +#.Lkey_expansion_192a_cold: +# movaps %xmm2, %xmm5 +#.Lkey_expansion_192b_warm: +# shufps \$0b00010000,%xmm0,%xmm4 +# movdqa %xmm2,%xmm3 +# xorps %xmm4,%xmm0 +# shufps \$0b10001100,%xmm0,%xmm4 +# pslldq \$4,%xmm3 +# xorps %xmm4,%xmm0 +# pshufd \$0b01010101,%xmm1,%xmm1 # critical path +# pxor %xmm3,%xmm2 +# pxor %xmm1,%xmm0 +# pshufd \$0b11111111,%xmm0,%xmm3 +# pxor %xmm3,%xmm2 +# ret +# +#.align 16 +#.Lkey_expansion_192b: +# movaps %xmm0,%xmm3 +# shufps \$0b01000100,%xmm0,%xmm5 +# $movkey %xmm5,(%rax) +# shufps \$0b01001110,%xmm2,%xmm3 +# $movkey %xmm3,16(%rax) +# lea 32(%rax),%rax +# jmp .Lkey_expansion_192b_warm +# +#.align 16 +#.Lkey_expansion_256a: +# $movkey %xmm2,(%rax) +# lea 16(%rax),%rax +#.Lkey_expansion_256a_cold: +# shufps \$0b00010000,%xmm0,%xmm4 +# xorps %xmm4,%xmm0 +# shufps \$0b10001100,%xmm0,%xmm4 +# xorps %xmm4,%xmm0 +# shufps \$0b11111111,%xmm1,%xmm1 # critical path +# xorps %xmm1,%xmm0 +# ret +# +#.align 16 +#.Lkey_expansion_256b: +# $movkey %xmm0,(%rax) +# lea 16(%rax),%rax +# +# shufps \$0b00010000,%xmm2,%xmm4 +# xorps %xmm4,%xmm2 +# shufps \$0b10001100,%xmm2,%xmm4 +# xorps %xmm4,%xmm2 +# shufps \$0b10101010,%xmm1,%xmm1 # critical path +# xorps %xmm1,%xmm2 +# ret +.size ${PREFIX}_set_encrypt_key,.-${PREFIX}_set_encrypt_key +.size __aesni_set_encrypt_key,.-__aesni_set_encrypt_key +___ +} + +$code.=<<___; +.align 64 +.Lbswap_mask: + .byte 15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0 +.Lincrement32: + .long 6,6,6,0 +.Lincrement64: + .long 1,0,0,0 +.Lxts_magic: + .long 0x87,0,1,0 +.Lincrement1: + .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1 +.Lkey_rotate: + .long 0x0c0f0e0d,0x0c0f0e0d,0x0c0f0e0d,0x0c0f0e0d +.Lkey_rotate192: + .long 0x04070605,0x04070605,0x04070605,0x04070605 +.Lkey_rcon1: + .long 1,1,1,1 +.Lkey_rcon1b: + .long 0x1b,0x1b,0x1b,0x1b + +.align 64 +___ + +# EXCEPTION_DISPOSITION handler (EXCEPTION_RECORD *rec,ULONG64 frame, +# CONTEXT *context,DISPATCHER_CONTEXT *disp) +if ($win64) { +$rec="%rcx"; +$frame="%rdx"; +$context="%r8"; +$disp="%r9"; + +$code.=<<___; +.extern __imp_RtlVirtualUnwind +___ +$code.=<<___ if ($PREFIX eq "aesni" && 0); +.type ecb_ccm64_se_handler,\@abi-omnipotent +.align 16 +ecb_ccm64_se_handler: + push %rsi + push %rdi + push %rbx + push %rbp + push %r12 + push %r13 + push %r14 + push %r15 + pushfq + sub \$64,%rsp + + mov 120($context),%rax # pull context->Rax + mov 248($context),%rbx # pull context->Rip + + mov 8($disp),%rsi # disp->ImageBase + mov 56($disp),%r11 # disp->HandlerData + + mov 0(%r11),%r10d # HandlerData[0] + lea (%rsi,%r10),%r10 # prologue label + cmp %r10,%rbx # context->RipRsp + + mov 4(%r11),%r10d # HandlerData[1] + lea (%rsi,%r10),%r10 # epilogue label + cmp %r10,%rbx # context->Rip>=epilogue label + jae .Lcommon_seh_tail + + lea 0(%rax),%rsi # %xmm save area + lea 512($context),%rdi # &context.Xmm6 + mov \$8,%ecx # 4*sizeof(%xmm0)/sizeof(%rax) + .long 0xa548f3fc # cld; rep movsq + lea 0x58(%rax),%rax # adjust stack pointer + + jmp .Lcommon_seh_tail +.size ecb_ccm64_se_handler,.-ecb_ccm64_se_handler +___ +$code.=<<___ if ($PREFIX eq "aesni"); +.type ctr_xts_se_handler,\@abi-omnipotent +.align 16 +ctr_xts_se_handler: + push %rsi + push %rdi + push %rbx + push %rbp + push %r12 + push %r13 + push %r14 + push %r15 + pushfq + sub \$64,%rsp + + mov 120($context),%rax # pull context->Rax + mov 248($context),%rbx # pull context->Rip + + mov 8($disp),%rsi # disp->ImageBase + mov 56($disp),%r11 # disp->HandlerData + + mov 0(%r11),%r10d # HandlerData[0] + lea (%rsi,%r10),%r10 # prologue lable + cmp %r10,%rbx # context->RipRsp + + mov 4(%r11),%r10d # HandlerData[1] + lea (%rsi,%r10),%r10 # epilogue label + cmp %r10,%rbx # context->Rip>=epilogue label + jae .Lcommon_seh_tail + + mov 208($context),%rax # pull context->R11 + + lea -0xa8(%rax),%rsi # %xmm save area + lea 512($context),%rdi # & context.Xmm6 + mov \$20,%ecx # 10*sizeof(%xmm0)/sizeof(%rax) + .long 0xa548f3fc # cld; rep movsq + + mov -8(%rax),%rbp # restore saved %rbp + mov %rbp,160($context) # restore context->Rbp + jmp .Lcommon_seh_tail +.size ctr_xts_se_handler,.-ctr_xts_se_handler +___ +$code.=<<___ if ($PREFIX eq "aesni" && 0); +.type ocb_se_handler,\@abi-omnipotent +.align 16 +ocb_se_handler: + push %rsi + push %rdi + push %rbx + push %rbp + push %r12 + push %r13 + push %r14 + push %r15 + pushfq + sub \$64,%rsp + + mov 120($context),%rax # pull context->Rax + mov 248($context),%rbx # pull context->Rip + + mov 8($disp),%rsi # disp->ImageBase + mov 56($disp),%r11 # disp->HandlerData + + mov 0(%r11),%r10d # HandlerData[0] + lea (%rsi,%r10),%r10 # prologue lable + cmp %r10,%rbx # context->RipRip>=epilogue label + jae .Lcommon_seh_tail + + mov 8(%r11),%r10d # HandlerData[2] + lea (%rsi,%r10),%r10 + cmp %r10,%rbx # context->Rip>=pop label + jae .Locb_no_xmm + + mov 152($context),%rax # pull context->Rsp + + lea (%rax),%rsi # %xmm save area + lea 512($context),%rdi # & context.Xmm6 + mov \$20,%ecx # 10*sizeof(%xmm0)/sizeof(%rax) + .long 0xa548f3fc # cld; rep movsq + lea 0xa0+0x28(%rax),%rax + +.Locb_no_xmm: + mov -8(%rax),%rbx + mov -16(%rax),%rbp + mov -24(%rax),%r12 + mov -32(%rax),%r13 + mov -40(%rax),%r14 + + mov %rbx,144($context) # restore context->Rbx + mov %rbp,160($context) # restore context->Rbp + mov %r12,216($context) # restore context->R12 + mov %r13,224($context) # restore context->R13 + mov %r14,232($context) # restore context->R14 + + jmp .Lcommon_seh_tail +.size ocb_se_handler,.-ocb_se_handler +___ +$code.=<<___; +.type cbc_se_handler,\@abi-omnipotent +.align 16 +cbc_se_handler: +___ +$code.=<<___ if (0); + push %rsi + push %rdi + push %rbx + push %rbp + push %r12 + push %r13 + push %r14 + push %r15 + pushfq + sub \$64,%rsp + + mov 152($context),%rax # pull context->Rsp + mov 248($context),%rbx # pull context->Rip + + lea .Lcbc_decrypt_bulk(%rip),%r10 + cmp %r10,%rbx # context->Rip<"prologue" label + jb .Lcommon_seh_tail + + mov 120($context),%rax # pull context->Rax + + lea .Lcbc_decrypt_body(%rip),%r10 + cmp %r10,%rbx # context->RipRsp + + lea .Lcbc_ret(%rip),%r10 + cmp %r10,%rbx # context->Rip>="epilogue" label + jae .Lcommon_seh_tail + + lea 16(%rax),%rsi # %xmm save area + lea 512($context),%rdi # &context.Xmm6 + mov \$20,%ecx # 10*sizeof(%xmm0)/sizeof(%rax) + .long 0xa548f3fc # cld; rep movsq + + mov 208($context),%rax # pull context->R11 + + mov -8(%rax),%rbp # restore saved %rbp + mov %rbp,160($context) # restore context->Rbp + +___ +$code.=<<___; +.Lcommon_seh_tail: + mov 8(%rax),%rdi + mov 16(%rax),%rsi + mov %rax,152($context) # restore context->Rsp + mov %rsi,168($context) # restore context->Rsi + mov %rdi,176($context) # restore context->Rdi + + mov 40($disp),%rdi # disp->ContextRecord + mov $context,%rsi # context + mov \$154,%ecx # sizeof(CONTEXT) + .long 0xa548f3fc # cld; rep movsq + + mov $disp,%rsi + xor %rcx,%rcx # arg1, UNW_FLAG_NHANDLER + mov 8(%rsi),%rdx # arg2, disp->ImageBase + mov 0(%rsi),%r8 # arg3, disp->ControlPc + mov 16(%rsi),%r9 # arg4, disp->FunctionEntry + mov 40(%rsi),%r10 # disp->ContextRecord + lea 56(%rsi),%r11 # &disp->HandlerData + lea 24(%rsi),%r12 # &disp->EstablisherFrame + mov %r10,32(%rsp) # arg5 + mov %r11,40(%rsp) # arg6 + mov %r12,48(%rsp) # arg7 + mov %rcx,56(%rsp) # arg8, (NULL) + call *__imp_RtlVirtualUnwind(%rip) + + mov \$1,%eax # ExceptionContinueSearch + add \$64,%rsp + popfq + pop %r15 + pop %r14 + pop %r13 + pop %r12 + pop %rbp + pop %rbx + pop %rdi + pop %rsi + ret +.size cbc_se_handler,.-cbc_se_handler +___ +$code.=<<___ if ($PREFIX eq "aesni"); +.section .pdata +.align 4 +# .rva .LSEH_begin_aesni_ecb_encrypt +# .rva .LSEH_end_aesni_ecb_encrypt +# .rva .LSEH_info_ecb + +# .rva .LSEH_begin_aesni_ccm64_encrypt_blocks +# .rva .LSEH_end_aesni_ccm64_encrypt_blocks +# .rva .LSEH_info_ccm64_enc + +# .rva .LSEH_begin_aesni_ccm64_decrypt_blocks +# .rva .LSEH_end_aesni_ccm64_decrypt_blocks +# .rva .LSEH_info_ccm64_dec + + .rva .LSEH_begin_aesni_ctr32_encrypt_blocks + .rva .LSEH_end_aesni_ctr32_encrypt_blocks + .rva .LSEH_info_ctr32 + +# .rva .LSEH_begin_aesni_xts_encrypt +# .rva .LSEH_end_aesni_xts_encrypt +# .rva .LSEH_info_xts_enc + +# .rva .LSEH_begin_aesni_xts_decrypt +# .rva .LSEH_end_aesni_xts_decrypt +# .rva .LSEH_info_xts_dec + +# .rva .LSEH_begin_aesni_ocb_encrypt +# .rva .LSEH_end_aesni_ocb_encrypt +# .rva .LSEH_info_ocb_enc + +# .rva .LSEH_begin_aesni_ocb_decrypt +# .rva .LSEH_end_aesni_ocb_decrypt +# .rva .LSEH_info_ocb_dec +___ +$code.=<<___; +# .rva .LSEH_begin_${PREFIX}_cbc_encrypt +# .rva .LSEH_end_${PREFIX}_cbc_encrypt +# .rva .LSEH_info_cbc + + .rva ${PREFIX}_set_decrypt_key + .rva .LSEH_end_set_decrypt_key + .rva .LSEH_info_key + + .rva ${PREFIX}_set_encrypt_key + .rva .LSEH_end_set_encrypt_key + .rva .LSEH_info_key +.section .xdata +.align 8 +___ +$code.=<<___ if ($PREFIX eq "aesni"); +#.LSEH_info_ecb: +# .byte 9,0,0,0 +# .rva ecb_ccm64_se_handler +# .rva .Lecb_enc_body,.Lecb_enc_ret # HandlerData[] +#.LSEH_info_ccm64_enc: +# .byte 9,0,0,0 +# .rva ecb_ccm64_se_handler +# .rva .Lccm64_enc_body,.Lccm64_enc_ret # HandlerData[] +#.LSEH_info_ccm64_dec: +# .byte 9,0,0,0 +# .rva ecb_ccm64_se_handler +# .rva .Lccm64_dec_body,.Lccm64_dec_ret # HandlerData[] +.LSEH_info_ctr32: + .byte 9,0,0,0 + .rva ctr_xts_se_handler + .rva .Lctr32_body,.Lctr32_epilogue # HandlerData[] +#.LSEH_info_xts_enc: +# .byte 9,0,0,0 +# .rva ctr_xts_se_handler +# .rva .Lxts_enc_body,.Lxts_enc_epilogue # HandlerData[] +#.LSEH_info_xts_dec: +# .byte 9,0,0,0 +# .rva ctr_xts_se_handler +# .rva .Lxts_dec_body,.Lxts_dec_epilogue # HandlerData[] +#.LSEH_info_ocb_enc: +# .byte 9,0,0,0 +# .rva ocb_se_handler +# .rva .Locb_enc_body,.Locb_enc_epilogue # HandlerData[] +# .rva .Locb_enc_pop +# .long 0 +#.LSEH_info_ocb_dec: +# .byte 9,0,0,0 +# .rva ocb_se_handler +# .rva .Locb_dec_body,.Locb_dec_epilogue # HandlerData[] +# .rva .Locb_dec_pop +# .long 0 +___ +$code.=<<___; +#.LSEH_info_cbc: +# .byte 9,0,0,0 +# .rva cbc_se_handler +.LSEH_info_key: + .byte 0x01,0x04,0x01,0x00 + .byte 0x04,0x02,0x00,0x00 # sub rsp,8 +___ +} + +sub rex { + local *opcode=shift; + my ($dst,$src)=@_; + my $rex=0; + + $rex|=0x04 if($dst>=8); + $rex|=0x01 if($src>=8); + push @opcode,$rex|0x40 if($rex); +} + +sub aesni { + my $line=shift; + my @opcode=(0x66); + + if ($line=~/(aeskeygenassist)\s+\$([x0-9a-f]+),\s*%xmm([0-9]+),\s*%xmm([0-9]+)/) { + rex(\@opcode,$4,$3); + push @opcode,0x0f,0x3a,0xdf; + push @opcode,0xc0|($3&7)|(($4&7)<<3); # ModR/M + my $c=$2; + push @opcode,$c=~/^0/?oct($c):$c; + return ".byte\t".join(',',@opcode); + } + elsif ($line=~/(aes[a-z]+)\s+%xmm([0-9]+),\s*%xmm([0-9]+)/) { + my %opcodelet = ( + "aesimc" => 0xdb, + "aesenc" => 0xdc, "aesenclast" => 0xdd, + "aesdec" => 0xde, "aesdeclast" => 0xdf + ); + return undef if (!defined($opcodelet{$1})); + rex(\@opcode,$3,$2); + push @opcode,0x0f,0x38,$opcodelet{$1}; + push @opcode,0xc0|($2&7)|(($3&7)<<3); # ModR/M + return ".byte\t".join(',',@opcode); + } + elsif ($line=~/(aes[a-z]+)\s+([0x1-9a-fA-F]*)\(%rsp\),\s*%xmm([0-9]+)/) { + my %opcodelet = ( + "aesenc" => 0xdc, "aesenclast" => 0xdd, + "aesdec" => 0xde, "aesdeclast" => 0xdf + ); + return undef if (!defined($opcodelet{$1})); + my $off = $2; + push @opcode,0x44 if ($3>=8); + push @opcode,0x0f,0x38,$opcodelet{$1}; + push @opcode,0x44|(($3&7)<<3),0x24; # ModR/M + push @opcode,($off=~/^0/?oct($off):$off)&0xff; + return ".byte\t".join(',',@opcode); + } + return $line; +} + +sub movbe { + ".byte 0x0f,0x38,0xf1,0x44,0x24,".shift; +} + +$code =~ s/\`([^\`]*)\`/eval($1)/gem; +$code =~ s/\b(aes.*%xmm[0-9]+).*$/aesni($1)/gem; +#$code =~ s/\bmovbe\s+%eax/bswap %eax; mov %eax/gm; # debugging artefact +$code =~ s/\bmovbe\s+%eax,\s*([0-9]+)\(%rsp\)/movbe($1)/gem; + +print $code; + +close STDOUT; diff --git a/crypto/aesgcm/aesni_gcm_x64_gas.s b/crypto/aesgcm/aesni_gcm_x64_gas.s new file mode 100644 index 0000000..993e81b --- /dev/null +++ b/crypto/aesgcm/aesni_gcm_x64_gas.s @@ -0,0 +1,831 @@ +.text + +.type _aesni_ctr32_ghash_6x,@function +.align 32 +_aesni_ctr32_ghash_6x: +.cfi_startproc + vmovdqu 32(%r11),%xmm2 + subq $6,%rdx + vpxor %xmm4,%xmm4,%xmm4 + vmovdqu 0-128(%rcx),%xmm15 + vpaddb %xmm2,%xmm1,%xmm10 + vpaddb %xmm2,%xmm10,%xmm11 + vpaddb %xmm2,%xmm11,%xmm12 + vpaddb %xmm2,%xmm12,%xmm13 + vpaddb %xmm2,%xmm13,%xmm14 + vpxor %xmm15,%xmm1,%xmm9 + vmovdqu %xmm4,16+8(%rsp) + jmp .Loop6x + +.align 32 +.Loop6x: + addl $100663296,%ebx + jc .Lhandle_ctr32 + vmovdqu 0-32(%r9),%xmm3 + vpaddb %xmm2,%xmm14,%xmm1 + vpxor %xmm15,%xmm10,%xmm10 + vpxor %xmm15,%xmm11,%xmm11 + +.Lresume_ctr32: + vmovdqu %xmm1,(%r8) + vpclmulqdq $0x10,%xmm3,%xmm7,%xmm5 + vpxor %xmm15,%xmm12,%xmm12 + vmovups 16-128(%rcx),%xmm2 + vpclmulqdq $0x01,%xmm3,%xmm7,%xmm6 + + + + + + + + + + + + + + + + + + xorq %r12,%r12 + cmpq %r14,%r15 + + vaesenc %xmm2,%xmm9,%xmm9 + vmovdqu 48+8(%rsp),%xmm0 + vpxor %xmm15,%xmm13,%xmm13 + vpclmulqdq $0x00,%xmm3,%xmm7,%xmm1 + vaesenc %xmm2,%xmm10,%xmm10 + vpxor %xmm15,%xmm14,%xmm14 + setnc %r12b + vpclmulqdq $0x11,%xmm3,%xmm7,%xmm7 + vaesenc %xmm2,%xmm11,%xmm11 + vmovdqu 16-32(%r9),%xmm3 + negq %r12 + vaesenc %xmm2,%xmm12,%xmm12 + vpxor %xmm5,%xmm6,%xmm6 + vpclmulqdq $0x00,%xmm3,%xmm0,%xmm5 + vpxor %xmm4,%xmm8,%xmm8 + vaesenc %xmm2,%xmm13,%xmm13 + vpxor %xmm5,%xmm1,%xmm4 + andq $0x60,%r12 + vmovups 32-128(%rcx),%xmm15 + vpclmulqdq $0x10,%xmm3,%xmm0,%xmm1 + vaesenc %xmm2,%xmm14,%xmm14 + + vpclmulqdq $0x01,%xmm3,%xmm0,%xmm2 + leaq (%r14,%r12,1),%r14 + vaesenc %xmm15,%xmm9,%xmm9 + vpxor 16+8(%rsp),%xmm8,%xmm8 + vpclmulqdq $0x11,%xmm3,%xmm0,%xmm3 + vmovdqu 64+8(%rsp),%xmm0 + vaesenc %xmm15,%xmm10,%xmm10 + movbeq 88(%r14),%r13 + vaesenc %xmm15,%xmm11,%xmm11 + movbeq 80(%r14),%r12 + vaesenc %xmm15,%xmm12,%xmm12 + movq %r13,32+8(%rsp) + vaesenc %xmm15,%xmm13,%xmm13 + movq %r12,40+8(%rsp) + vmovdqu 48-32(%r9),%xmm5 + vaesenc %xmm15,%xmm14,%xmm14 + + vmovups 48-128(%rcx),%xmm15 + vpxor %xmm1,%xmm6,%xmm6 + vpclmulqdq $0x00,%xmm5,%xmm0,%xmm1 + vaesenc %xmm15,%xmm9,%xmm9 + vpxor %xmm2,%xmm6,%xmm6 + vpclmulqdq $0x10,%xmm5,%xmm0,%xmm2 + vaesenc %xmm15,%xmm10,%xmm10 + vpxor %xmm3,%xmm7,%xmm7 + vpclmulqdq $0x01,%xmm5,%xmm0,%xmm3 + vaesenc %xmm15,%xmm11,%xmm11 + vpclmulqdq $0x11,%xmm5,%xmm0,%xmm5 + vmovdqu 80+8(%rsp),%xmm0 + vaesenc %xmm15,%xmm12,%xmm12 + vaesenc %xmm15,%xmm13,%xmm13 + vpxor %xmm1,%xmm4,%xmm4 + vmovdqu 64-32(%r9),%xmm1 + vaesenc %xmm15,%xmm14,%xmm14 + + vmovups 64-128(%rcx),%xmm15 + vpxor %xmm2,%xmm6,%xmm6 + vpclmulqdq $0x00,%xmm1,%xmm0,%xmm2 + vaesenc %xmm15,%xmm9,%xmm9 + vpxor %xmm3,%xmm6,%xmm6 + vpclmulqdq $0x10,%xmm1,%xmm0,%xmm3 + vaesenc %xmm15,%xmm10,%xmm10 + movbeq 72(%r14),%r13 + vpxor %xmm5,%xmm7,%xmm7 + vpclmulqdq $0x01,%xmm1,%xmm0,%xmm5 + vaesenc %xmm15,%xmm11,%xmm11 + movbeq 64(%r14),%r12 + vpclmulqdq $0x11,%xmm1,%xmm0,%xmm1 + vmovdqu 96+8(%rsp),%xmm0 + vaesenc %xmm15,%xmm12,%xmm12 + movq %r13,48+8(%rsp) + vaesenc %xmm15,%xmm13,%xmm13 + movq %r12,56+8(%rsp) + vpxor %xmm2,%xmm4,%xmm4 + vmovdqu 96-32(%r9),%xmm2 + vaesenc %xmm15,%xmm14,%xmm14 + + vmovups 80-128(%rcx),%xmm15 + vpxor %xmm3,%xmm6,%xmm6 + vpclmulqdq $0x00,%xmm2,%xmm0,%xmm3 + vaesenc %xmm15,%xmm9,%xmm9 + vpxor %xmm5,%xmm6,%xmm6 + vpclmulqdq $0x10,%xmm2,%xmm0,%xmm5 + vaesenc %xmm15,%xmm10,%xmm10 + movbeq 56(%r14),%r13 + vpxor %xmm1,%xmm7,%xmm7 + vpclmulqdq $0x01,%xmm2,%xmm0,%xmm1 + vpxor 112+8(%rsp),%xmm8,%xmm8 + vaesenc %xmm15,%xmm11,%xmm11 + movbeq 48(%r14),%r12 + vpclmulqdq $0x11,%xmm2,%xmm0,%xmm2 + vaesenc %xmm15,%xmm12,%xmm12 + movq %r13,64+8(%rsp) + vaesenc %xmm15,%xmm13,%xmm13 + movq %r12,72+8(%rsp) + vpxor %xmm3,%xmm4,%xmm4 + vmovdqu 112-32(%r9),%xmm3 + vaesenc %xmm15,%xmm14,%xmm14 + + vmovups 96-128(%rcx),%xmm15 + vpxor %xmm5,%xmm6,%xmm6 + vpclmulqdq $0x10,%xmm3,%xmm8,%xmm5 + vaesenc %xmm15,%xmm9,%xmm9 + vpxor %xmm1,%xmm6,%xmm6 + vpclmulqdq $0x01,%xmm3,%xmm8,%xmm1 + vaesenc %xmm15,%xmm10,%xmm10 + movbeq 40(%r14),%r13 + vpxor %xmm2,%xmm7,%xmm7 + vpclmulqdq $0x00,%xmm3,%xmm8,%xmm2 + vaesenc %xmm15,%xmm11,%xmm11 + movbeq 32(%r14),%r12 + vpclmulqdq $0x11,%xmm3,%xmm8,%xmm8 + vaesenc %xmm15,%xmm12,%xmm12 + movq %r13,80+8(%rsp) + vaesenc %xmm15,%xmm13,%xmm13 + movq %r12,88+8(%rsp) + vpxor %xmm5,%xmm6,%xmm6 + vaesenc %xmm15,%xmm14,%xmm14 + vpxor %xmm1,%xmm6,%xmm6 + + vmovups 112-128(%rcx),%xmm15 + vpslldq $8,%xmm6,%xmm5 + vpxor %xmm2,%xmm4,%xmm4 + vmovdqu 16(%r11),%xmm3 + + vaesenc %xmm15,%xmm9,%xmm9 + vpxor %xmm8,%xmm7,%xmm7 + vaesenc %xmm15,%xmm10,%xmm10 + vpxor %xmm5,%xmm4,%xmm4 + movbeq 24(%r14),%r13 + vaesenc %xmm15,%xmm11,%xmm11 + movbeq 16(%r14),%r12 + vpalignr $8,%xmm4,%xmm4,%xmm0 + vpclmulqdq $0x10,%xmm3,%xmm4,%xmm4 + movq %r13,96+8(%rsp) + vaesenc %xmm15,%xmm12,%xmm12 + movq %r12,104+8(%rsp) + vaesenc %xmm15,%xmm13,%xmm13 + vmovups 128-128(%rcx),%xmm1 + vaesenc %xmm15,%xmm14,%xmm14 + + vaesenc %xmm1,%xmm9,%xmm9 + vmovups 144-128(%rcx),%xmm15 + vaesenc %xmm1,%xmm10,%xmm10 + vpsrldq $8,%xmm6,%xmm6 + vaesenc %xmm1,%xmm11,%xmm11 + vpxor %xmm6,%xmm7,%xmm7 + vaesenc %xmm1,%xmm12,%xmm12 + vpxor %xmm0,%xmm4,%xmm4 + movbeq 8(%r14),%r13 + vaesenc %xmm1,%xmm13,%xmm13 + movbeq 0(%r14),%r12 + vaesenc %xmm1,%xmm14,%xmm14 + vmovups 160-128(%rcx),%xmm1 + cmpl $11,%ebp + jb .Lenc_tail + + vaesenc %xmm15,%xmm9,%xmm9 + vaesenc %xmm15,%xmm10,%xmm10 + vaesenc %xmm15,%xmm11,%xmm11 + vaesenc %xmm15,%xmm12,%xmm12 + vaesenc %xmm15,%xmm13,%xmm13 + vaesenc %xmm15,%xmm14,%xmm14 + + vaesenc %xmm1,%xmm9,%xmm9 + vaesenc %xmm1,%xmm10,%xmm10 + vaesenc %xmm1,%xmm11,%xmm11 + vaesenc %xmm1,%xmm12,%xmm12 + vaesenc %xmm1,%xmm13,%xmm13 + vmovups 176-128(%rcx),%xmm15 + vaesenc %xmm1,%xmm14,%xmm14 + vmovups 192-128(%rcx),%xmm1 + je .Lenc_tail + + vaesenc %xmm15,%xmm9,%xmm9 + vaesenc %xmm15,%xmm10,%xmm10 + vaesenc %xmm15,%xmm11,%xmm11 + vaesenc %xmm15,%xmm12,%xmm12 + vaesenc %xmm15,%xmm13,%xmm13 + vaesenc %xmm15,%xmm14,%xmm14 + + vaesenc %xmm1,%xmm9,%xmm9 + vaesenc %xmm1,%xmm10,%xmm10 + vaesenc %xmm1,%xmm11,%xmm11 + vaesenc %xmm1,%xmm12,%xmm12 + vaesenc %xmm1,%xmm13,%xmm13 + vmovups 208-128(%rcx),%xmm15 + vaesenc %xmm1,%xmm14,%xmm14 + vmovups 224-128(%rcx),%xmm1 + jmp .Lenc_tail + +.align 32 +.Lhandle_ctr32: + vmovdqu (%r11),%xmm0 + vpshufb %xmm0,%xmm1,%xmm6 + vmovdqu 48(%r11),%xmm5 + vpaddd 64(%r11),%xmm6,%xmm10 + vpaddd %xmm5,%xmm6,%xmm11 + vmovdqu 0-32(%r9),%xmm3 + vpaddd %xmm5,%xmm10,%xmm12 + vpshufb %xmm0,%xmm10,%xmm10 + vpaddd %xmm5,%xmm11,%xmm13 + vpshufb %xmm0,%xmm11,%xmm11 + vpxor %xmm15,%xmm10,%xmm10 + vpaddd %xmm5,%xmm12,%xmm14 + vpshufb %xmm0,%xmm12,%xmm12 + vpxor %xmm15,%xmm11,%xmm11 + vpaddd %xmm5,%xmm13,%xmm1 + vpshufb %xmm0,%xmm13,%xmm13 + vpshufb %xmm0,%xmm14,%xmm14 + vpshufb %xmm0,%xmm1,%xmm1 + jmp .Lresume_ctr32 + +.align 32 +.Lenc_tail: + vaesenc %xmm15,%xmm9,%xmm9 + vmovdqu %xmm7,16+8(%rsp) + vpalignr $8,%xmm4,%xmm4,%xmm8 + vaesenc %xmm15,%xmm10,%xmm10 + vpclmulqdq $0x10,%xmm3,%xmm4,%xmm4 + vpxor 0(%rdi),%xmm1,%xmm2 + vaesenc %xmm15,%xmm11,%xmm11 + vpxor 16(%rdi),%xmm1,%xmm0 + vaesenc %xmm15,%xmm12,%xmm12 + vpxor 32(%rdi),%xmm1,%xmm5 + vaesenc %xmm15,%xmm13,%xmm13 + vpxor 48(%rdi),%xmm1,%xmm6 + vaesenc %xmm15,%xmm14,%xmm14 + vpxor 64(%rdi),%xmm1,%xmm7 + vpxor 80(%rdi),%xmm1,%xmm3 + vmovdqu (%r8),%xmm1 + + vaesenclast %xmm2,%xmm9,%xmm9 + vmovdqu 32(%r11),%xmm2 + vaesenclast %xmm0,%xmm10,%xmm10 + vpaddb %xmm2,%xmm1,%xmm0 + movq %r13,112+8(%rsp) + leaq 96(%rdi),%rdi + vaesenclast %xmm5,%xmm11,%xmm11 + vpaddb %xmm2,%xmm0,%xmm5 + movq %r12,120+8(%rsp) + leaq 96(%rsi),%rsi + vmovdqu 0-128(%rcx),%xmm15 + vaesenclast %xmm6,%xmm12,%xmm12 + vpaddb %xmm2,%xmm5,%xmm6 + vaesenclast %xmm7,%xmm13,%xmm13 + vpaddb %xmm2,%xmm6,%xmm7 + vaesenclast %xmm3,%xmm14,%xmm14 + vpaddb %xmm2,%xmm7,%xmm3 + + addq $0x60,%r10 + subq $0x6,%rdx + jc .L6x_done + + vmovups %xmm9,-96(%rsi) + vpxor %xmm15,%xmm1,%xmm9 + vmovups %xmm10,-80(%rsi) + vmovdqa %xmm0,%xmm10 + vmovups %xmm11,-64(%rsi) + vmovdqa %xmm5,%xmm11 + vmovups %xmm12,-48(%rsi) + vmovdqa %xmm6,%xmm12 + vmovups %xmm13,-32(%rsi) + vmovdqa %xmm7,%xmm13 + vmovups %xmm14,-16(%rsi) + vmovdqa %xmm3,%xmm14 + vmovdqu 32+8(%rsp),%xmm7 + jmp .Loop6x + +.L6x_done: + vpxor 16+8(%rsp),%xmm8,%xmm8 + vpxor %xmm4,%xmm8,%xmm8 + + ret +.cfi_endproc +.size _aesni_ctr32_ghash_6x,.-_aesni_ctr32_ghash_6x +.globl aesni_gcm_decrypt +.type aesni_gcm_decrypt,@function +.align 32 +aesni_gcm_decrypt: +.cfi_startproc + xorq %r10,%r10 + + + + cmpq $0x60,%rdx + jb .Lgcm_dec_abort + + leaq (%rsp),%rax +.cfi_def_cfa_register %rax + pushq %rbx +.cfi_offset %rbx,-16 + pushq %rbp +.cfi_offset %rbp,-24 + pushq %r12 +.cfi_offset %r12,-32 + pushq %r13 +.cfi_offset %r13,-40 + pushq %r14 +.cfi_offset %r14,-48 + pushq %r15 +.cfi_offset %r15,-56 + vzeroupper + + vmovdqu (%r8),%xmm1 + addq $-128,%rsp + movl 12(%r8),%ebx + leaq .Lbswap_mask(%rip),%r11 + leaq -128(%rcx),%r14 + movq $0xf80,%r15 + vmovdqu 16(%r8),%xmm8 + andq $-128,%rsp + vmovdqu (%r11),%xmm0 + leaq 128(%rcx),%rcx + leaq 16+32(%r9),%r9 + movl 240-128(%rcx),%ebp + vpshufb %xmm0,%xmm8,%xmm8 + + andq %r15,%r14 + andq %rsp,%r15 + subq %r14,%r15 + jc .Ldec_no_key_aliasing + cmpq $768,%r15 + jnc .Ldec_no_key_aliasing + subq %r15,%rsp +.Ldec_no_key_aliasing: + + vmovdqu 80(%rdi),%xmm7 + leaq (%rdi),%r14 + vmovdqu 64(%rdi),%xmm4 + + + + + + + + leaq -192(%rdi,%rdx,1),%r15 + + vmovdqu 48(%rdi),%xmm5 + shrq $4,%rdx + xorq %r10,%r10 + vmovdqu 32(%rdi),%xmm6 + vpshufb %xmm0,%xmm7,%xmm7 + vmovdqu 16(%rdi),%xmm2 + vpshufb %xmm0,%xmm4,%xmm4 + vmovdqu (%rdi),%xmm3 + vpshufb %xmm0,%xmm5,%xmm5 + vmovdqu %xmm4,48(%rsp) + vpshufb %xmm0,%xmm6,%xmm6 + vmovdqu %xmm5,64(%rsp) + vpshufb %xmm0,%xmm2,%xmm2 + vmovdqu %xmm6,80(%rsp) + vpshufb %xmm0,%xmm3,%xmm3 + vmovdqu %xmm2,96(%rsp) + vmovdqu %xmm3,112(%rsp) + + call _aesni_ctr32_ghash_6x + + vmovups %xmm9,-96(%rsi) + vmovups %xmm10,-80(%rsi) + vmovups %xmm11,-64(%rsi) + vmovups %xmm12,-48(%rsi) + vmovups %xmm13,-32(%rsi) + vmovups %xmm14,-16(%rsi) + + vpshufb (%r11),%xmm8,%xmm8 + vmovdqu %xmm8,16(%r8) + + vzeroupper + movq -48(%rax),%r15 +.cfi_restore %r15 + movq -40(%rax),%r14 +.cfi_restore %r14 + movq -32(%rax),%r13 +.cfi_restore %r13 + movq -24(%rax),%r12 +.cfi_restore %r12 + movq -16(%rax),%rbp +.cfi_restore %rbp + movq -8(%rax),%rbx +.cfi_restore %rbx + leaq (%rax),%rsp +.cfi_def_cfa_register %rsp +.Lgcm_dec_abort: + movq %r10,%rax + ret +.cfi_endproc +.size aesni_gcm_decrypt,.-aesni_gcm_decrypt +.type _aesni_ctr32_6x,@function +.align 32 +_aesni_ctr32_6x: +.cfi_startproc + vmovdqu 0-128(%rcx),%xmm4 + vmovdqu 32(%r11),%xmm2 + leaq -1(%rbp),%r13 + vmovups 16-128(%rcx),%xmm15 + leaq 32-128(%rcx),%r12 + vpxor %xmm4,%xmm1,%xmm9 + addl $100663296,%ebx + jc .Lhandle_ctr32_2 + vpaddb %xmm2,%xmm1,%xmm10 + vpaddb %xmm2,%xmm10,%xmm11 + vpxor %xmm4,%xmm10,%xmm10 + vpaddb %xmm2,%xmm11,%xmm12 + vpxor %xmm4,%xmm11,%xmm11 + vpaddb %xmm2,%xmm12,%xmm13 + vpxor %xmm4,%xmm12,%xmm12 + vpaddb %xmm2,%xmm13,%xmm14 + vpxor %xmm4,%xmm13,%xmm13 + vpaddb %xmm2,%xmm14,%xmm1 + vpxor %xmm4,%xmm14,%xmm14 + jmp .Loop_ctr32 + +.align 16 +.Loop_ctr32: + vaesenc %xmm15,%xmm9,%xmm9 + vaesenc %xmm15,%xmm10,%xmm10 + vaesenc %xmm15,%xmm11,%xmm11 + vaesenc %xmm15,%xmm12,%xmm12 + vaesenc %xmm15,%xmm13,%xmm13 + vaesenc %xmm15,%xmm14,%xmm14 + vmovups (%r12),%xmm15 + leaq 16(%r12),%r12 + decl %r13d + jnz .Loop_ctr32 + + vmovdqu (%r12),%xmm3 + vaesenc %xmm15,%xmm9,%xmm9 + vpxor 0(%rdi),%xmm3,%xmm4 + vaesenc %xmm15,%xmm10,%xmm10 + vpxor 16(%rdi),%xmm3,%xmm5 + vaesenc %xmm15,%xmm11,%xmm11 + vpxor 32(%rdi),%xmm3,%xmm6 + vaesenc %xmm15,%xmm12,%xmm12 + vpxor 48(%rdi),%xmm3,%xmm8 + vaesenc %xmm15,%xmm13,%xmm13 + vpxor 64(%rdi),%xmm3,%xmm2 + vaesenc %xmm15,%xmm14,%xmm14 + vpxor 80(%rdi),%xmm3,%xmm3 + leaq 96(%rdi),%rdi + + vaesenclast %xmm4,%xmm9,%xmm9 + vaesenclast %xmm5,%xmm10,%xmm10 + vaesenclast %xmm6,%xmm11,%xmm11 + vaesenclast %xmm8,%xmm12,%xmm12 + vaesenclast %xmm2,%xmm13,%xmm13 + vaesenclast %xmm3,%xmm14,%xmm14 + vmovups %xmm9,0(%rsi) + vmovups %xmm10,16(%rsi) + vmovups %xmm11,32(%rsi) + vmovups %xmm12,48(%rsi) + vmovups %xmm13,64(%rsi) + vmovups %xmm14,80(%rsi) + leaq 96(%rsi),%rsi + + ret +.align 32 +.Lhandle_ctr32_2: + vpshufb %xmm0,%xmm1,%xmm6 + vmovdqu 48(%r11),%xmm5 + vpaddd 64(%r11),%xmm6,%xmm10 + vpaddd %xmm5,%xmm6,%xmm11 + vpaddd %xmm5,%xmm10,%xmm12 + vpshufb %xmm0,%xmm10,%xmm10 + vpaddd %xmm5,%xmm11,%xmm13 + vpshufb %xmm0,%xmm11,%xmm11 + vpxor %xmm4,%xmm10,%xmm10 + vpaddd %xmm5,%xmm12,%xmm14 + vpshufb %xmm0,%xmm12,%xmm12 + vpxor %xmm4,%xmm11,%xmm11 + vpaddd %xmm5,%xmm13,%xmm1 + vpshufb %xmm0,%xmm13,%xmm13 + vpxor %xmm4,%xmm12,%xmm12 + vpshufb %xmm0,%xmm14,%xmm14 + vpxor %xmm4,%xmm13,%xmm13 + vpshufb %xmm0,%xmm1,%xmm1 + vpxor %xmm4,%xmm14,%xmm14 + jmp .Loop_ctr32 +.cfi_endproc +.size _aesni_ctr32_6x,.-_aesni_ctr32_6x + +.globl aesni_gcm_encrypt +.type aesni_gcm_encrypt,@function +.align 32 +aesni_gcm_encrypt: +.cfi_startproc + xorq %r10,%r10 + + + + + cmpq $288,%rdx + jb .Lgcm_enc_abort + + leaq (%rsp),%rax +.cfi_def_cfa_register %rax + pushq %rbx +.cfi_offset %rbx,-16 + pushq %rbp +.cfi_offset %rbp,-24 + pushq %r12 +.cfi_offset %r12,-32 + pushq %r13 +.cfi_offset %r13,-40 + pushq %r14 +.cfi_offset %r14,-48 + pushq %r15 +.cfi_offset %r15,-56 + vzeroupper + + vmovdqu (%r8),%xmm1 + addq $-128,%rsp + movl 12(%r8),%ebx + leaq .Lbswap_mask(%rip),%r11 + leaq -128(%rcx),%r14 + movq $0xf80,%r15 + leaq 128(%rcx),%rcx + vmovdqu (%r11),%xmm0 + andq $-128,%rsp + movl 240-128(%rcx),%ebp + + andq %r15,%r14 + andq %rsp,%r15 + subq %r14,%r15 + jc .Lenc_no_key_aliasing + cmpq $768,%r15 + jnc .Lenc_no_key_aliasing + subq %r15,%rsp +.Lenc_no_key_aliasing: + + leaq (%rsi),%r14 + + + + + + + + + leaq -192(%rsi,%rdx,1),%r15 + + shrq $4,%rdx + + call _aesni_ctr32_6x + + vpshufb %xmm0,%xmm9,%xmm8 + vpshufb %xmm0,%xmm10,%xmm2 + vmovdqu %xmm8,112(%rsp) + vpshufb %xmm0,%xmm11,%xmm4 + vmovdqu %xmm2,96(%rsp) + vpshufb %xmm0,%xmm12,%xmm5 + vmovdqu %xmm4,80(%rsp) + vpshufb %xmm0,%xmm13,%xmm6 + vmovdqu %xmm5,64(%rsp) + vpshufb %xmm0,%xmm14,%xmm7 + vmovdqu %xmm6,48(%rsp) + + call _aesni_ctr32_6x + + vmovdqu 16(%r8),%xmm8 + leaq 16+32(%r9),%r9 + subq $12,%rdx + movq $192,%r10 + vpshufb %xmm0,%xmm8,%xmm8 + + call _aesni_ctr32_ghash_6x + vmovdqu 32(%rsp),%xmm7 + vmovdqu (%r11),%xmm0 + vmovdqu 0-32(%r9),%xmm3 + vpunpckhqdq %xmm7,%xmm7,%xmm1 + vmovdqu 32-32(%r9),%xmm15 + vmovups %xmm9,-96(%rsi) + vpshufb %xmm0,%xmm9,%xmm9 + vpxor %xmm7,%xmm1,%xmm1 + vmovups %xmm10,-80(%rsi) + vpshufb %xmm0,%xmm10,%xmm10 + vmovups %xmm11,-64(%rsi) + vpshufb %xmm0,%xmm11,%xmm11 + vmovups %xmm12,-48(%rsi) + vpshufb %xmm0,%xmm12,%xmm12 + vmovups %xmm13,-32(%rsi) + vpshufb %xmm0,%xmm13,%xmm13 + vmovups %xmm14,-16(%rsi) + vpshufb %xmm0,%xmm14,%xmm14 + vmovdqu %xmm9,16(%rsp) + vmovdqu 48(%rsp),%xmm6 + vmovdqu 16-32(%r9),%xmm0 + vpunpckhqdq %xmm6,%xmm6,%xmm2 + vpclmulqdq $0x00,%xmm3,%xmm7,%xmm5 + vpxor %xmm6,%xmm2,%xmm2 + vpclmulqdq $0x11,%xmm3,%xmm7,%xmm7 + vpclmulqdq $0x00,%xmm15,%xmm1,%xmm1 + + vmovdqu 64(%rsp),%xmm9 + vpclmulqdq $0x00,%xmm0,%xmm6,%xmm4 + vmovdqu 48-32(%r9),%xmm3 + vpxor %xmm5,%xmm4,%xmm4 + vpunpckhqdq %xmm9,%xmm9,%xmm5 + vpclmulqdq $0x11,%xmm0,%xmm6,%xmm6 + vpxor %xmm9,%xmm5,%xmm5 + vpxor %xmm7,%xmm6,%xmm6 + vpclmulqdq $0x10,%xmm15,%xmm2,%xmm2 + vmovdqu 80-32(%r9),%xmm15 + vpxor %xmm1,%xmm2,%xmm2 + + vmovdqu 80(%rsp),%xmm1 + vpclmulqdq $0x00,%xmm3,%xmm9,%xmm7 + vmovdqu 64-32(%r9),%xmm0 + vpxor %xmm4,%xmm7,%xmm7 + vpunpckhqdq %xmm1,%xmm1,%xmm4 + vpclmulqdq $0x11,%xmm3,%xmm9,%xmm9 + vpxor %xmm1,%xmm4,%xmm4 + vpxor %xmm6,%xmm9,%xmm9 + vpclmulqdq $0x00,%xmm15,%xmm5,%xmm5 + vpxor %xmm2,%xmm5,%xmm5 + + vmovdqu 96(%rsp),%xmm2 + vpclmulqdq $0x00,%xmm0,%xmm1,%xmm6 + vmovdqu 96-32(%r9),%xmm3 + vpxor %xmm7,%xmm6,%xmm6 + vpunpckhqdq %xmm2,%xmm2,%xmm7 + vpclmulqdq $0x11,%xmm0,%xmm1,%xmm1 + vpxor %xmm2,%xmm7,%xmm7 + vpxor %xmm9,%xmm1,%xmm1 + vpclmulqdq $0x10,%xmm15,%xmm4,%xmm4 + vmovdqu 128-32(%r9),%xmm15 + vpxor %xmm5,%xmm4,%xmm4 + + vpxor 112(%rsp),%xmm8,%xmm8 + vpclmulqdq $0x00,%xmm3,%xmm2,%xmm5 + vmovdqu 112-32(%r9),%xmm0 + vpunpckhqdq %xmm8,%xmm8,%xmm9 + vpxor %xmm6,%xmm5,%xmm5 + vpclmulqdq $0x11,%xmm3,%xmm2,%xmm2 + vpxor %xmm8,%xmm9,%xmm9 + vpxor %xmm1,%xmm2,%xmm2 + vpclmulqdq $0x00,%xmm15,%xmm7,%xmm7 + vpxor %xmm4,%xmm7,%xmm4 + + vpclmulqdq $0x00,%xmm0,%xmm8,%xmm6 + vmovdqu 0-32(%r9),%xmm3 + vpunpckhqdq %xmm14,%xmm14,%xmm1 + vpclmulqdq $0x11,%xmm0,%xmm8,%xmm8 + vpxor %xmm14,%xmm1,%xmm1 + vpxor %xmm5,%xmm6,%xmm5 + vpclmulqdq $0x10,%xmm15,%xmm9,%xmm9 + vmovdqu 32-32(%r9),%xmm15 + vpxor %xmm2,%xmm8,%xmm7 + vpxor %xmm4,%xmm9,%xmm6 + + vmovdqu 16-32(%r9),%xmm0 + vpxor %xmm5,%xmm7,%xmm9 + vpclmulqdq $0x00,%xmm3,%xmm14,%xmm4 + vpxor %xmm9,%xmm6,%xmm6 + vpunpckhqdq %xmm13,%xmm13,%xmm2 + vpclmulqdq $0x11,%xmm3,%xmm14,%xmm14 + vpxor %xmm13,%xmm2,%xmm2 + vpslldq $8,%xmm6,%xmm9 + vpclmulqdq $0x00,%xmm15,%xmm1,%xmm1 + vpxor %xmm9,%xmm5,%xmm8 + vpsrldq $8,%xmm6,%xmm6 + vpxor %xmm6,%xmm7,%xmm7 + + vpclmulqdq $0x00,%xmm0,%xmm13,%xmm5 + vmovdqu 48-32(%r9),%xmm3 + vpxor %xmm4,%xmm5,%xmm5 + vpunpckhqdq %xmm12,%xmm12,%xmm9 + vpclmulqdq $0x11,%xmm0,%xmm13,%xmm13 + vpxor %xmm12,%xmm9,%xmm9 + vpxor %xmm14,%xmm13,%xmm13 + vpalignr $8,%xmm8,%xmm8,%xmm14 + vpclmulqdq $0x10,%xmm15,%xmm2,%xmm2 + vmovdqu 80-32(%r9),%xmm15 + vpxor %xmm1,%xmm2,%xmm2 + + vpclmulqdq $0x00,%xmm3,%xmm12,%xmm4 + vmovdqu 64-32(%r9),%xmm0 + vpxor %xmm5,%xmm4,%xmm4 + vpunpckhqdq %xmm11,%xmm11,%xmm1 + vpclmulqdq $0x11,%xmm3,%xmm12,%xmm12 + vpxor %xmm11,%xmm1,%xmm1 + vpxor %xmm13,%xmm12,%xmm12 + vxorps 16(%rsp),%xmm7,%xmm7 + vpclmulqdq $0x00,%xmm15,%xmm9,%xmm9 + vpxor %xmm2,%xmm9,%xmm9 + + vpclmulqdq $0x10,16(%r11),%xmm8,%xmm8 + vxorps %xmm14,%xmm8,%xmm8 + + vpclmulqdq $0x00,%xmm0,%xmm11,%xmm5 + vmovdqu 96-32(%r9),%xmm3 + vpxor %xmm4,%xmm5,%xmm5 + vpunpckhqdq %xmm10,%xmm10,%xmm2 + vpclmulqdq $0x11,%xmm0,%xmm11,%xmm11 + vpxor %xmm10,%xmm2,%xmm2 + vpalignr $8,%xmm8,%xmm8,%xmm14 + vpxor %xmm12,%xmm11,%xmm11 + vpclmulqdq $0x10,%xmm15,%xmm1,%xmm1 + vmovdqu 128-32(%r9),%xmm15 + vpxor %xmm9,%xmm1,%xmm1 + + vxorps %xmm7,%xmm14,%xmm14 + vpclmulqdq $0x10,16(%r11),%xmm8,%xmm8 + vxorps %xmm14,%xmm8,%xmm8 + + vpclmulqdq $0x00,%xmm3,%xmm10,%xmm4 + vmovdqu 112-32(%r9),%xmm0 + vpxor %xmm5,%xmm4,%xmm4 + vpunpckhqdq %xmm8,%xmm8,%xmm9 + vpclmulqdq $0x11,%xmm3,%xmm10,%xmm10 + vpxor %xmm8,%xmm9,%xmm9 + vpxor %xmm11,%xmm10,%xmm10 + vpclmulqdq $0x00,%xmm15,%xmm2,%xmm2 + vpxor %xmm1,%xmm2,%xmm2 + + vpclmulqdq $0x00,%xmm0,%xmm8,%xmm5 + vpclmulqdq $0x11,%xmm0,%xmm8,%xmm7 + vpxor %xmm4,%xmm5,%xmm5 + vpclmulqdq $0x10,%xmm15,%xmm9,%xmm6 + vpxor %xmm10,%xmm7,%xmm7 + vpxor %xmm2,%xmm6,%xmm6 + + vpxor %xmm5,%xmm7,%xmm4 + vpxor %xmm4,%xmm6,%xmm6 + vpslldq $8,%xmm6,%xmm1 + vmovdqu 16(%r11),%xmm3 + vpsrldq $8,%xmm6,%xmm6 + vpxor %xmm1,%xmm5,%xmm8 + vpxor %xmm6,%xmm7,%xmm7 + + vpalignr $8,%xmm8,%xmm8,%xmm2 + vpclmulqdq $0x10,%xmm3,%xmm8,%xmm8 + vpxor %xmm2,%xmm8,%xmm8 + + vpalignr $8,%xmm8,%xmm8,%xmm2 + vpclmulqdq $0x10,%xmm3,%xmm8,%xmm8 + vpxor %xmm7,%xmm2,%xmm2 + vpxor %xmm2,%xmm8,%xmm8 + vpshufb (%r11),%xmm8,%xmm8 + vmovdqu %xmm8,16(%r8) + + vzeroupper + movq -48(%rax),%r15 +.cfi_restore %r15 + movq -40(%rax),%r14 +.cfi_restore %r14 + movq -32(%rax),%r13 +.cfi_restore %r13 + movq -24(%rax),%r12 +.cfi_restore %r12 + movq -16(%rax),%rbp +.cfi_restore %rbp + movq -8(%rax),%rbx +.cfi_restore %rbx + leaq (%rax),%rsp +.cfi_def_cfa_register %rsp +.Lgcm_enc_abort: + movq %r10,%rax + ret +.cfi_endproc +.size aesni_gcm_encrypt,.-aesni_gcm_encrypt +.align 64 +.Lbswap_mask: +.byte 15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0 +.Lpoly: +.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0xc2 +.Lone_msb: +.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1 +.Ltwo_lsb: +.byte 2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 +.Lone_lsb: +.byte 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 +.byte 65,69,83,45,78,73,32,71,67,77,32,109,111,100,117,108,101,32,102,111,114,32,120,56,54,95,54,52,44,32,67,82,89,80,84,79,71,65,77,83,32,98,121,32,60,97,112,112,114,111,64,111,112,101,110,115,115,108,46,111,114,103,62,0 +.align 64 diff --git a/crypto/aesgcm/aesni_gcm_x64_gas_macosx.s b/crypto/aesgcm/aesni_gcm_x64_gas_macosx.s new file mode 100644 index 0000000..184f239 --- /dev/null +++ b/crypto/aesgcm/aesni_gcm_x64_gas_macosx.s @@ -0,0 +1,831 @@ +.text + + +.p2align 5 +_aesni_ctr32_ghash_6x: + + vmovdqu 32(%r11),%xmm2 + subq $6,%rdx + vpxor %xmm4,%xmm4,%xmm4 + vmovdqu 0-128(%rcx),%xmm15 + vpaddb %xmm2,%xmm1,%xmm10 + vpaddb %xmm2,%xmm10,%xmm11 + vpaddb %xmm2,%xmm11,%xmm12 + vpaddb %xmm2,%xmm12,%xmm13 + vpaddb %xmm2,%xmm13,%xmm14 + vpxor %xmm15,%xmm1,%xmm9 + vmovdqu %xmm4,16+8(%rsp) + jmp L$oop6x + +.p2align 5 +L$oop6x: + addl $100663296,%ebx + jc L$handle_ctr32 + vmovdqu 0-32(%r9),%xmm3 + vpaddb %xmm2,%xmm14,%xmm1 + vpxor %xmm15,%xmm10,%xmm10 + vpxor %xmm15,%xmm11,%xmm11 + +L$resume_ctr32: + vmovdqu %xmm1,(%r8) + vpclmulqdq $0x10,%xmm3,%xmm7,%xmm5 + vpxor %xmm15,%xmm12,%xmm12 + vmovups 16-128(%rcx),%xmm2 + vpclmulqdq $0x01,%xmm3,%xmm7,%xmm6 + + + + + + + + + + + + + + + + + + xorq %r12,%r12 + cmpq %r14,%r15 + + vaesenc %xmm2,%xmm9,%xmm9 + vmovdqu 48+8(%rsp),%xmm0 + vpxor %xmm15,%xmm13,%xmm13 + vpclmulqdq $0x00,%xmm3,%xmm7,%xmm1 + vaesenc %xmm2,%xmm10,%xmm10 + vpxor %xmm15,%xmm14,%xmm14 + setnc %r12b + vpclmulqdq $0x11,%xmm3,%xmm7,%xmm7 + vaesenc %xmm2,%xmm11,%xmm11 + vmovdqu 16-32(%r9),%xmm3 + negq %r12 + vaesenc %xmm2,%xmm12,%xmm12 + vpxor %xmm5,%xmm6,%xmm6 + vpclmulqdq $0x00,%xmm3,%xmm0,%xmm5 + vpxor %xmm4,%xmm8,%xmm8 + vaesenc %xmm2,%xmm13,%xmm13 + vpxor %xmm5,%xmm1,%xmm4 + andq $0x60,%r12 + vmovups 32-128(%rcx),%xmm15 + vpclmulqdq $0x10,%xmm3,%xmm0,%xmm1 + vaesenc %xmm2,%xmm14,%xmm14 + + vpclmulqdq $0x01,%xmm3,%xmm0,%xmm2 + leaq (%r14,%r12,1),%r14 + vaesenc %xmm15,%xmm9,%xmm9 + vpxor 16+8(%rsp),%xmm8,%xmm8 + vpclmulqdq $0x11,%xmm3,%xmm0,%xmm3 + vmovdqu 64+8(%rsp),%xmm0 + vaesenc %xmm15,%xmm10,%xmm10 + movbeq 88(%r14),%r13 + vaesenc %xmm15,%xmm11,%xmm11 + movbeq 80(%r14),%r12 + vaesenc %xmm15,%xmm12,%xmm12 + movq %r13,32+8(%rsp) + vaesenc %xmm15,%xmm13,%xmm13 + movq %r12,40+8(%rsp) + vmovdqu 48-32(%r9),%xmm5 + vaesenc %xmm15,%xmm14,%xmm14 + + vmovups 48-128(%rcx),%xmm15 + vpxor %xmm1,%xmm6,%xmm6 + vpclmulqdq $0x00,%xmm5,%xmm0,%xmm1 + vaesenc %xmm15,%xmm9,%xmm9 + vpxor %xmm2,%xmm6,%xmm6 + vpclmulqdq $0x10,%xmm5,%xmm0,%xmm2 + vaesenc %xmm15,%xmm10,%xmm10 + vpxor %xmm3,%xmm7,%xmm7 + vpclmulqdq $0x01,%xmm5,%xmm0,%xmm3 + vaesenc %xmm15,%xmm11,%xmm11 + vpclmulqdq $0x11,%xmm5,%xmm0,%xmm5 + vmovdqu 80+8(%rsp),%xmm0 + vaesenc %xmm15,%xmm12,%xmm12 + vaesenc %xmm15,%xmm13,%xmm13 + vpxor %xmm1,%xmm4,%xmm4 + vmovdqu 64-32(%r9),%xmm1 + vaesenc %xmm15,%xmm14,%xmm14 + + vmovups 64-128(%rcx),%xmm15 + vpxor %xmm2,%xmm6,%xmm6 + vpclmulqdq $0x00,%xmm1,%xmm0,%xmm2 + vaesenc %xmm15,%xmm9,%xmm9 + vpxor %xmm3,%xmm6,%xmm6 + vpclmulqdq $0x10,%xmm1,%xmm0,%xmm3 + vaesenc %xmm15,%xmm10,%xmm10 + movbeq 72(%r14),%r13 + vpxor %xmm5,%xmm7,%xmm7 + vpclmulqdq $0x01,%xmm1,%xmm0,%xmm5 + vaesenc %xmm15,%xmm11,%xmm11 + movbeq 64(%r14),%r12 + vpclmulqdq $0x11,%xmm1,%xmm0,%xmm1 + vmovdqu 96+8(%rsp),%xmm0 + vaesenc %xmm15,%xmm12,%xmm12 + movq %r13,48+8(%rsp) + vaesenc %xmm15,%xmm13,%xmm13 + movq %r12,56+8(%rsp) + vpxor %xmm2,%xmm4,%xmm4 + vmovdqu 96-32(%r9),%xmm2 + vaesenc %xmm15,%xmm14,%xmm14 + + vmovups 80-128(%rcx),%xmm15 + vpxor %xmm3,%xmm6,%xmm6 + vpclmulqdq $0x00,%xmm2,%xmm0,%xmm3 + vaesenc %xmm15,%xmm9,%xmm9 + vpxor %xmm5,%xmm6,%xmm6 + vpclmulqdq $0x10,%xmm2,%xmm0,%xmm5 + vaesenc %xmm15,%xmm10,%xmm10 + movbeq 56(%r14),%r13 + vpxor %xmm1,%xmm7,%xmm7 + vpclmulqdq $0x01,%xmm2,%xmm0,%xmm1 + vpxor 112+8(%rsp),%xmm8,%xmm8 + vaesenc %xmm15,%xmm11,%xmm11 + movbeq 48(%r14),%r12 + vpclmulqdq $0x11,%xmm2,%xmm0,%xmm2 + vaesenc %xmm15,%xmm12,%xmm12 + movq %r13,64+8(%rsp) + vaesenc %xmm15,%xmm13,%xmm13 + movq %r12,72+8(%rsp) + vpxor %xmm3,%xmm4,%xmm4 + vmovdqu 112-32(%r9),%xmm3 + vaesenc %xmm15,%xmm14,%xmm14 + + vmovups 96-128(%rcx),%xmm15 + vpxor %xmm5,%xmm6,%xmm6 + vpclmulqdq $0x10,%xmm3,%xmm8,%xmm5 + vaesenc %xmm15,%xmm9,%xmm9 + vpxor %xmm1,%xmm6,%xmm6 + vpclmulqdq $0x01,%xmm3,%xmm8,%xmm1 + vaesenc %xmm15,%xmm10,%xmm10 + movbeq 40(%r14),%r13 + vpxor %xmm2,%xmm7,%xmm7 + vpclmulqdq $0x00,%xmm3,%xmm8,%xmm2 + vaesenc %xmm15,%xmm11,%xmm11 + movbeq 32(%r14),%r12 + vpclmulqdq $0x11,%xmm3,%xmm8,%xmm8 + vaesenc %xmm15,%xmm12,%xmm12 + movq %r13,80+8(%rsp) + vaesenc %xmm15,%xmm13,%xmm13 + movq %r12,88+8(%rsp) + vpxor %xmm5,%xmm6,%xmm6 + vaesenc %xmm15,%xmm14,%xmm14 + vpxor %xmm1,%xmm6,%xmm6 + + vmovups 112-128(%rcx),%xmm15 + vpslldq $8,%xmm6,%xmm5 + vpxor %xmm2,%xmm4,%xmm4 + vmovdqu 16(%r11),%xmm3 + + vaesenc %xmm15,%xmm9,%xmm9 + vpxor %xmm8,%xmm7,%xmm7 + vaesenc %xmm15,%xmm10,%xmm10 + vpxor %xmm5,%xmm4,%xmm4 + movbeq 24(%r14),%r13 + vaesenc %xmm15,%xmm11,%xmm11 + movbeq 16(%r14),%r12 + vpalignr $8,%xmm4,%xmm4,%xmm0 + vpclmulqdq $0x10,%xmm3,%xmm4,%xmm4 + movq %r13,96+8(%rsp) + vaesenc %xmm15,%xmm12,%xmm12 + movq %r12,104+8(%rsp) + vaesenc %xmm15,%xmm13,%xmm13 + vmovups 128-128(%rcx),%xmm1 + vaesenc %xmm15,%xmm14,%xmm14 + + vaesenc %xmm1,%xmm9,%xmm9 + vmovups 144-128(%rcx),%xmm15 + vaesenc %xmm1,%xmm10,%xmm10 + vpsrldq $8,%xmm6,%xmm6 + vaesenc %xmm1,%xmm11,%xmm11 + vpxor %xmm6,%xmm7,%xmm7 + vaesenc %xmm1,%xmm12,%xmm12 + vpxor %xmm0,%xmm4,%xmm4 + movbeq 8(%r14),%r13 + vaesenc %xmm1,%xmm13,%xmm13 + movbeq 0(%r14),%r12 + vaesenc %xmm1,%xmm14,%xmm14 + vmovups 160-128(%rcx),%xmm1 + cmpl $11,%ebp + jb L$enc_tail + + vaesenc %xmm15,%xmm9,%xmm9 + vaesenc %xmm15,%xmm10,%xmm10 + vaesenc %xmm15,%xmm11,%xmm11 + vaesenc %xmm15,%xmm12,%xmm12 + vaesenc %xmm15,%xmm13,%xmm13 + vaesenc %xmm15,%xmm14,%xmm14 + + vaesenc %xmm1,%xmm9,%xmm9 + vaesenc %xmm1,%xmm10,%xmm10 + vaesenc %xmm1,%xmm11,%xmm11 + vaesenc %xmm1,%xmm12,%xmm12 + vaesenc %xmm1,%xmm13,%xmm13 + vmovups 176-128(%rcx),%xmm15 + vaesenc %xmm1,%xmm14,%xmm14 + vmovups 192-128(%rcx),%xmm1 + je L$enc_tail + + vaesenc %xmm15,%xmm9,%xmm9 + vaesenc %xmm15,%xmm10,%xmm10 + vaesenc %xmm15,%xmm11,%xmm11 + vaesenc %xmm15,%xmm12,%xmm12 + vaesenc %xmm15,%xmm13,%xmm13 + vaesenc %xmm15,%xmm14,%xmm14 + + vaesenc %xmm1,%xmm9,%xmm9 + vaesenc %xmm1,%xmm10,%xmm10 + vaesenc %xmm1,%xmm11,%xmm11 + vaesenc %xmm1,%xmm12,%xmm12 + vaesenc %xmm1,%xmm13,%xmm13 + vmovups 208-128(%rcx),%xmm15 + vaesenc %xmm1,%xmm14,%xmm14 + vmovups 224-128(%rcx),%xmm1 + jmp L$enc_tail + +.p2align 5 +L$handle_ctr32: + vmovdqu (%r11),%xmm0 + vpshufb %xmm0,%xmm1,%xmm6 + vmovdqu 48(%r11),%xmm5 + vpaddd 64(%r11),%xmm6,%xmm10 + vpaddd %xmm5,%xmm6,%xmm11 + vmovdqu 0-32(%r9),%xmm3 + vpaddd %xmm5,%xmm10,%xmm12 + vpshufb %xmm0,%xmm10,%xmm10 + vpaddd %xmm5,%xmm11,%xmm13 + vpshufb %xmm0,%xmm11,%xmm11 + vpxor %xmm15,%xmm10,%xmm10 + vpaddd %xmm5,%xmm12,%xmm14 + vpshufb %xmm0,%xmm12,%xmm12 + vpxor %xmm15,%xmm11,%xmm11 + vpaddd %xmm5,%xmm13,%xmm1 + vpshufb %xmm0,%xmm13,%xmm13 + vpshufb %xmm0,%xmm14,%xmm14 + vpshufb %xmm0,%xmm1,%xmm1 + jmp L$resume_ctr32 + +.p2align 5 +L$enc_tail: + vaesenc %xmm15,%xmm9,%xmm9 + vmovdqu %xmm7,16+8(%rsp) + vpalignr $8,%xmm4,%xmm4,%xmm8 + vaesenc %xmm15,%xmm10,%xmm10 + vpclmulqdq $0x10,%xmm3,%xmm4,%xmm4 + vpxor 0(%rdi),%xmm1,%xmm2 + vaesenc %xmm15,%xmm11,%xmm11 + vpxor 16(%rdi),%xmm1,%xmm0 + vaesenc %xmm15,%xmm12,%xmm12 + vpxor 32(%rdi),%xmm1,%xmm5 + vaesenc %xmm15,%xmm13,%xmm13 + vpxor 48(%rdi),%xmm1,%xmm6 + vaesenc %xmm15,%xmm14,%xmm14 + vpxor 64(%rdi),%xmm1,%xmm7 + vpxor 80(%rdi),%xmm1,%xmm3 + vmovdqu (%r8),%xmm1 + + vaesenclast %xmm2,%xmm9,%xmm9 + vmovdqu 32(%r11),%xmm2 + vaesenclast %xmm0,%xmm10,%xmm10 + vpaddb %xmm2,%xmm1,%xmm0 + movq %r13,112+8(%rsp) + leaq 96(%rdi),%rdi + vaesenclast %xmm5,%xmm11,%xmm11 + vpaddb %xmm2,%xmm0,%xmm5 + movq %r12,120+8(%rsp) + leaq 96(%rsi),%rsi + vmovdqu 0-128(%rcx),%xmm15 + vaesenclast %xmm6,%xmm12,%xmm12 + vpaddb %xmm2,%xmm5,%xmm6 + vaesenclast %xmm7,%xmm13,%xmm13 + vpaddb %xmm2,%xmm6,%xmm7 + vaesenclast %xmm3,%xmm14,%xmm14 + vpaddb %xmm2,%xmm7,%xmm3 + + addq $0x60,%r10 + subq $0x6,%rdx + jc L$6x_done + + vmovups %xmm9,-96(%rsi) + vpxor %xmm15,%xmm1,%xmm9 + vmovups %xmm10,-80(%rsi) + vmovdqa %xmm0,%xmm10 + vmovups %xmm11,-64(%rsi) + vmovdqa %xmm5,%xmm11 + vmovups %xmm12,-48(%rsi) + vmovdqa %xmm6,%xmm12 + vmovups %xmm13,-32(%rsi) + vmovdqa %xmm7,%xmm13 + vmovups %xmm14,-16(%rsi) + vmovdqa %xmm3,%xmm14 + vmovdqu 32+8(%rsp),%xmm7 + jmp L$oop6x + +L$6x_done: + vpxor 16+8(%rsp),%xmm8,%xmm8 + vpxor %xmm4,%xmm8,%xmm8 + + ret + + +.globl _aesni_gcm_decrypt + +.p2align 5 +_aesni_gcm_decrypt: + + xorq %r10,%r10 + + + + cmpq $0x60,%rdx + jb L$gcm_dec_abort + + leaq (%rsp),%rax + + pushq %rbx + + pushq %rbp + + pushq %r12 + + pushq %r13 + + pushq %r14 + + pushq %r15 + + vzeroupper + + vmovdqu (%r8),%xmm1 + addq $-128,%rsp + movl 12(%r8),%ebx + leaq L$bswap_mask(%rip),%r11 + leaq -128(%rcx),%r14 + movq $0xf80,%r15 + vmovdqu 16(%r8),%xmm8 + andq $-128,%rsp + vmovdqu (%r11),%xmm0 + leaq 128(%rcx),%rcx + leaq 16+32(%r9),%r9 + movl 240-128(%rcx),%ebp + vpshufb %xmm0,%xmm8,%xmm8 + + andq %r15,%r14 + andq %rsp,%r15 + subq %r14,%r15 + jc L$dec_no_key_aliasing + cmpq $768,%r15 + jnc L$dec_no_key_aliasing + subq %r15,%rsp +L$dec_no_key_aliasing: + + vmovdqu 80(%rdi),%xmm7 + leaq (%rdi),%r14 + vmovdqu 64(%rdi),%xmm4 + + + + + + + + leaq -192(%rdi,%rdx,1),%r15 + + vmovdqu 48(%rdi),%xmm5 + shrq $4,%rdx + xorq %r10,%r10 + vmovdqu 32(%rdi),%xmm6 + vpshufb %xmm0,%xmm7,%xmm7 + vmovdqu 16(%rdi),%xmm2 + vpshufb %xmm0,%xmm4,%xmm4 + vmovdqu (%rdi),%xmm3 + vpshufb %xmm0,%xmm5,%xmm5 + vmovdqu %xmm4,48(%rsp) + vpshufb %xmm0,%xmm6,%xmm6 + vmovdqu %xmm5,64(%rsp) + vpshufb %xmm0,%xmm2,%xmm2 + vmovdqu %xmm6,80(%rsp) + vpshufb %xmm0,%xmm3,%xmm3 + vmovdqu %xmm2,96(%rsp) + vmovdqu %xmm3,112(%rsp) + + call _aesni_ctr32_ghash_6x + + vmovups %xmm9,-96(%rsi) + vmovups %xmm10,-80(%rsi) + vmovups %xmm11,-64(%rsi) + vmovups %xmm12,-48(%rsi) + vmovups %xmm13,-32(%rsi) + vmovups %xmm14,-16(%rsi) + + vpshufb (%r11),%xmm8,%xmm8 + vmovdqu %xmm8,16(%r8) + + vzeroupper + movq -48(%rax),%r15 + + movq -40(%rax),%r14 + + movq -32(%rax),%r13 + + movq -24(%rax),%r12 + + movq -16(%rax),%rbp + + movq -8(%rax),%rbx + + leaq (%rax),%rsp + +L$gcm_dec_abort: + movq %r10,%rax + ret + + + +.p2align 5 +_aesni_ctr32_6x: + + vmovdqu 0-128(%rcx),%xmm4 + vmovdqu 32(%r11),%xmm2 + leaq -1(%rbp),%r13 + vmovups 16-128(%rcx),%xmm15 + leaq 32-128(%rcx),%r12 + vpxor %xmm4,%xmm1,%xmm9 + addl $100663296,%ebx + jc L$handle_ctr32_2 + vpaddb %xmm2,%xmm1,%xmm10 + vpaddb %xmm2,%xmm10,%xmm11 + vpxor %xmm4,%xmm10,%xmm10 + vpaddb %xmm2,%xmm11,%xmm12 + vpxor %xmm4,%xmm11,%xmm11 + vpaddb %xmm2,%xmm12,%xmm13 + vpxor %xmm4,%xmm12,%xmm12 + vpaddb %xmm2,%xmm13,%xmm14 + vpxor %xmm4,%xmm13,%xmm13 + vpaddb %xmm2,%xmm14,%xmm1 + vpxor %xmm4,%xmm14,%xmm14 + jmp L$oop_ctr32 + +.p2align 4 +L$oop_ctr32: + vaesenc %xmm15,%xmm9,%xmm9 + vaesenc %xmm15,%xmm10,%xmm10 + vaesenc %xmm15,%xmm11,%xmm11 + vaesenc %xmm15,%xmm12,%xmm12 + vaesenc %xmm15,%xmm13,%xmm13 + vaesenc %xmm15,%xmm14,%xmm14 + vmovups (%r12),%xmm15 + leaq 16(%r12),%r12 + decl %r13d + jnz L$oop_ctr32 + + vmovdqu (%r12),%xmm3 + vaesenc %xmm15,%xmm9,%xmm9 + vpxor 0(%rdi),%xmm3,%xmm4 + vaesenc %xmm15,%xmm10,%xmm10 + vpxor 16(%rdi),%xmm3,%xmm5 + vaesenc %xmm15,%xmm11,%xmm11 + vpxor 32(%rdi),%xmm3,%xmm6 + vaesenc %xmm15,%xmm12,%xmm12 + vpxor 48(%rdi),%xmm3,%xmm8 + vaesenc %xmm15,%xmm13,%xmm13 + vpxor 64(%rdi),%xmm3,%xmm2 + vaesenc %xmm15,%xmm14,%xmm14 + vpxor 80(%rdi),%xmm3,%xmm3 + leaq 96(%rdi),%rdi + + vaesenclast %xmm4,%xmm9,%xmm9 + vaesenclast %xmm5,%xmm10,%xmm10 + vaesenclast %xmm6,%xmm11,%xmm11 + vaesenclast %xmm8,%xmm12,%xmm12 + vaesenclast %xmm2,%xmm13,%xmm13 + vaesenclast %xmm3,%xmm14,%xmm14 + vmovups %xmm9,0(%rsi) + vmovups %xmm10,16(%rsi) + vmovups %xmm11,32(%rsi) + vmovups %xmm12,48(%rsi) + vmovups %xmm13,64(%rsi) + vmovups %xmm14,80(%rsi) + leaq 96(%rsi),%rsi + + ret +.p2align 5 +L$handle_ctr32_2: + vpshufb %xmm0,%xmm1,%xmm6 + vmovdqu 48(%r11),%xmm5 + vpaddd 64(%r11),%xmm6,%xmm10 + vpaddd %xmm5,%xmm6,%xmm11 + vpaddd %xmm5,%xmm10,%xmm12 + vpshufb %xmm0,%xmm10,%xmm10 + vpaddd %xmm5,%xmm11,%xmm13 + vpshufb %xmm0,%xmm11,%xmm11 + vpxor %xmm4,%xmm10,%xmm10 + vpaddd %xmm5,%xmm12,%xmm14 + vpshufb %xmm0,%xmm12,%xmm12 + vpxor %xmm4,%xmm11,%xmm11 + vpaddd %xmm5,%xmm13,%xmm1 + vpshufb %xmm0,%xmm13,%xmm13 + vpxor %xmm4,%xmm12,%xmm12 + vpshufb %xmm0,%xmm14,%xmm14 + vpxor %xmm4,%xmm13,%xmm13 + vpshufb %xmm0,%xmm1,%xmm1 + vpxor %xmm4,%xmm14,%xmm14 + jmp L$oop_ctr32 + + + +.globl _aesni_gcm_encrypt + +.p2align 5 +_aesni_gcm_encrypt: + + xorq %r10,%r10 + + + + + cmpq $288,%rdx + jb L$gcm_enc_abort + + leaq (%rsp),%rax + + pushq %rbx + + pushq %rbp + + pushq %r12 + + pushq %r13 + + pushq %r14 + + pushq %r15 + + vzeroupper + + vmovdqu (%r8),%xmm1 + addq $-128,%rsp + movl 12(%r8),%ebx + leaq L$bswap_mask(%rip),%r11 + leaq -128(%rcx),%r14 + movq $0xf80,%r15 + leaq 128(%rcx),%rcx + vmovdqu (%r11),%xmm0 + andq $-128,%rsp + movl 240-128(%rcx),%ebp + + andq %r15,%r14 + andq %rsp,%r15 + subq %r14,%r15 + jc L$enc_no_key_aliasing + cmpq $768,%r15 + jnc L$enc_no_key_aliasing + subq %r15,%rsp +L$enc_no_key_aliasing: + + leaq (%rsi),%r14 + + + + + + + + + leaq -192(%rsi,%rdx,1),%r15 + + shrq $4,%rdx + + call _aesni_ctr32_6x + + vpshufb %xmm0,%xmm9,%xmm8 + vpshufb %xmm0,%xmm10,%xmm2 + vmovdqu %xmm8,112(%rsp) + vpshufb %xmm0,%xmm11,%xmm4 + vmovdqu %xmm2,96(%rsp) + vpshufb %xmm0,%xmm12,%xmm5 + vmovdqu %xmm4,80(%rsp) + vpshufb %xmm0,%xmm13,%xmm6 + vmovdqu %xmm5,64(%rsp) + vpshufb %xmm0,%xmm14,%xmm7 + vmovdqu %xmm6,48(%rsp) + + call _aesni_ctr32_6x + + vmovdqu 16(%r8),%xmm8 + leaq 16+32(%r9),%r9 + subq $12,%rdx + movq $192,%r10 + vpshufb %xmm0,%xmm8,%xmm8 + + call _aesni_ctr32_ghash_6x + vmovdqu 32(%rsp),%xmm7 + vmovdqu (%r11),%xmm0 + vmovdqu 0-32(%r9),%xmm3 + vpunpckhqdq %xmm7,%xmm7,%xmm1 + vmovdqu 32-32(%r9),%xmm15 + vmovups %xmm9,-96(%rsi) + vpshufb %xmm0,%xmm9,%xmm9 + vpxor %xmm7,%xmm1,%xmm1 + vmovups %xmm10,-80(%rsi) + vpshufb %xmm0,%xmm10,%xmm10 + vmovups %xmm11,-64(%rsi) + vpshufb %xmm0,%xmm11,%xmm11 + vmovups %xmm12,-48(%rsi) + vpshufb %xmm0,%xmm12,%xmm12 + vmovups %xmm13,-32(%rsi) + vpshufb %xmm0,%xmm13,%xmm13 + vmovups %xmm14,-16(%rsi) + vpshufb %xmm0,%xmm14,%xmm14 + vmovdqu %xmm9,16(%rsp) + vmovdqu 48(%rsp),%xmm6 + vmovdqu 16-32(%r9),%xmm0 + vpunpckhqdq %xmm6,%xmm6,%xmm2 + vpclmulqdq $0x00,%xmm3,%xmm7,%xmm5 + vpxor %xmm6,%xmm2,%xmm2 + vpclmulqdq $0x11,%xmm3,%xmm7,%xmm7 + vpclmulqdq $0x00,%xmm15,%xmm1,%xmm1 + + vmovdqu 64(%rsp),%xmm9 + vpclmulqdq $0x00,%xmm0,%xmm6,%xmm4 + vmovdqu 48-32(%r9),%xmm3 + vpxor %xmm5,%xmm4,%xmm4 + vpunpckhqdq %xmm9,%xmm9,%xmm5 + vpclmulqdq $0x11,%xmm0,%xmm6,%xmm6 + vpxor %xmm9,%xmm5,%xmm5 + vpxor %xmm7,%xmm6,%xmm6 + vpclmulqdq $0x10,%xmm15,%xmm2,%xmm2 + vmovdqu 80-32(%r9),%xmm15 + vpxor %xmm1,%xmm2,%xmm2 + + vmovdqu 80(%rsp),%xmm1 + vpclmulqdq $0x00,%xmm3,%xmm9,%xmm7 + vmovdqu 64-32(%r9),%xmm0 + vpxor %xmm4,%xmm7,%xmm7 + vpunpckhqdq %xmm1,%xmm1,%xmm4 + vpclmulqdq $0x11,%xmm3,%xmm9,%xmm9 + vpxor %xmm1,%xmm4,%xmm4 + vpxor %xmm6,%xmm9,%xmm9 + vpclmulqdq $0x00,%xmm15,%xmm5,%xmm5 + vpxor %xmm2,%xmm5,%xmm5 + + vmovdqu 96(%rsp),%xmm2 + vpclmulqdq $0x00,%xmm0,%xmm1,%xmm6 + vmovdqu 96-32(%r9),%xmm3 + vpxor %xmm7,%xmm6,%xmm6 + vpunpckhqdq %xmm2,%xmm2,%xmm7 + vpclmulqdq $0x11,%xmm0,%xmm1,%xmm1 + vpxor %xmm2,%xmm7,%xmm7 + vpxor %xmm9,%xmm1,%xmm1 + vpclmulqdq $0x10,%xmm15,%xmm4,%xmm4 + vmovdqu 128-32(%r9),%xmm15 + vpxor %xmm5,%xmm4,%xmm4 + + vpxor 112(%rsp),%xmm8,%xmm8 + vpclmulqdq $0x00,%xmm3,%xmm2,%xmm5 + vmovdqu 112-32(%r9),%xmm0 + vpunpckhqdq %xmm8,%xmm8,%xmm9 + vpxor %xmm6,%xmm5,%xmm5 + vpclmulqdq $0x11,%xmm3,%xmm2,%xmm2 + vpxor %xmm8,%xmm9,%xmm9 + vpxor %xmm1,%xmm2,%xmm2 + vpclmulqdq $0x00,%xmm15,%xmm7,%xmm7 + vpxor %xmm4,%xmm7,%xmm4 + + vpclmulqdq $0x00,%xmm0,%xmm8,%xmm6 + vmovdqu 0-32(%r9),%xmm3 + vpunpckhqdq %xmm14,%xmm14,%xmm1 + vpclmulqdq $0x11,%xmm0,%xmm8,%xmm8 + vpxor %xmm14,%xmm1,%xmm1 + vpxor %xmm5,%xmm6,%xmm5 + vpclmulqdq $0x10,%xmm15,%xmm9,%xmm9 + vmovdqu 32-32(%r9),%xmm15 + vpxor %xmm2,%xmm8,%xmm7 + vpxor %xmm4,%xmm9,%xmm6 + + vmovdqu 16-32(%r9),%xmm0 + vpxor %xmm5,%xmm7,%xmm9 + vpclmulqdq $0x00,%xmm3,%xmm14,%xmm4 + vpxor %xmm9,%xmm6,%xmm6 + vpunpckhqdq %xmm13,%xmm13,%xmm2 + vpclmulqdq $0x11,%xmm3,%xmm14,%xmm14 + vpxor %xmm13,%xmm2,%xmm2 + vpslldq $8,%xmm6,%xmm9 + vpclmulqdq $0x00,%xmm15,%xmm1,%xmm1 + vpxor %xmm9,%xmm5,%xmm8 + vpsrldq $8,%xmm6,%xmm6 + vpxor %xmm6,%xmm7,%xmm7 + + vpclmulqdq $0x00,%xmm0,%xmm13,%xmm5 + vmovdqu 48-32(%r9),%xmm3 + vpxor %xmm4,%xmm5,%xmm5 + vpunpckhqdq %xmm12,%xmm12,%xmm9 + vpclmulqdq $0x11,%xmm0,%xmm13,%xmm13 + vpxor %xmm12,%xmm9,%xmm9 + vpxor %xmm14,%xmm13,%xmm13 + vpalignr $8,%xmm8,%xmm8,%xmm14 + vpclmulqdq $0x10,%xmm15,%xmm2,%xmm2 + vmovdqu 80-32(%r9),%xmm15 + vpxor %xmm1,%xmm2,%xmm2 + + vpclmulqdq $0x00,%xmm3,%xmm12,%xmm4 + vmovdqu 64-32(%r9),%xmm0 + vpxor %xmm5,%xmm4,%xmm4 + vpunpckhqdq %xmm11,%xmm11,%xmm1 + vpclmulqdq $0x11,%xmm3,%xmm12,%xmm12 + vpxor %xmm11,%xmm1,%xmm1 + vpxor %xmm13,%xmm12,%xmm12 + vxorps 16(%rsp),%xmm7,%xmm7 + vpclmulqdq $0x00,%xmm15,%xmm9,%xmm9 + vpxor %xmm2,%xmm9,%xmm9 + + vpclmulqdq $0x10,16(%r11),%xmm8,%xmm8 + vxorps %xmm14,%xmm8,%xmm8 + + vpclmulqdq $0x00,%xmm0,%xmm11,%xmm5 + vmovdqu 96-32(%r9),%xmm3 + vpxor %xmm4,%xmm5,%xmm5 + vpunpckhqdq %xmm10,%xmm10,%xmm2 + vpclmulqdq $0x11,%xmm0,%xmm11,%xmm11 + vpxor %xmm10,%xmm2,%xmm2 + vpalignr $8,%xmm8,%xmm8,%xmm14 + vpxor %xmm12,%xmm11,%xmm11 + vpclmulqdq $0x10,%xmm15,%xmm1,%xmm1 + vmovdqu 128-32(%r9),%xmm15 + vpxor %xmm9,%xmm1,%xmm1 + + vxorps %xmm7,%xmm14,%xmm14 + vpclmulqdq $0x10,16(%r11),%xmm8,%xmm8 + vxorps %xmm14,%xmm8,%xmm8 + + vpclmulqdq $0x00,%xmm3,%xmm10,%xmm4 + vmovdqu 112-32(%r9),%xmm0 + vpxor %xmm5,%xmm4,%xmm4 + vpunpckhqdq %xmm8,%xmm8,%xmm9 + vpclmulqdq $0x11,%xmm3,%xmm10,%xmm10 + vpxor %xmm8,%xmm9,%xmm9 + vpxor %xmm11,%xmm10,%xmm10 + vpclmulqdq $0x00,%xmm15,%xmm2,%xmm2 + vpxor %xmm1,%xmm2,%xmm2 + + vpclmulqdq $0x00,%xmm0,%xmm8,%xmm5 + vpclmulqdq $0x11,%xmm0,%xmm8,%xmm7 + vpxor %xmm4,%xmm5,%xmm5 + vpclmulqdq $0x10,%xmm15,%xmm9,%xmm6 + vpxor %xmm10,%xmm7,%xmm7 + vpxor %xmm2,%xmm6,%xmm6 + + vpxor %xmm5,%xmm7,%xmm4 + vpxor %xmm4,%xmm6,%xmm6 + vpslldq $8,%xmm6,%xmm1 + vmovdqu 16(%r11),%xmm3 + vpsrldq $8,%xmm6,%xmm6 + vpxor %xmm1,%xmm5,%xmm8 + vpxor %xmm6,%xmm7,%xmm7 + + vpalignr $8,%xmm8,%xmm8,%xmm2 + vpclmulqdq $0x10,%xmm3,%xmm8,%xmm8 + vpxor %xmm2,%xmm8,%xmm8 + + vpalignr $8,%xmm8,%xmm8,%xmm2 + vpclmulqdq $0x10,%xmm3,%xmm8,%xmm8 + vpxor %xmm7,%xmm2,%xmm2 + vpxor %xmm2,%xmm8,%xmm8 + vpshufb (%r11),%xmm8,%xmm8 + vmovdqu %xmm8,16(%r8) + + vzeroupper + movq -48(%rax),%r15 + + movq -40(%rax),%r14 + + movq -32(%rax),%r13 + + movq -24(%rax),%r12 + + movq -16(%rax),%rbp + + movq -8(%rax),%rbx + + leaq (%rax),%rsp + +L$gcm_enc_abort: + movq %r10,%rax + ret + + +.p2align 6 +L$bswap_mask: +.byte 15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0 +L$poly: +.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0xc2 +L$one_msb: +.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1 +L$two_lsb: +.byte 2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 +L$one_lsb: +.byte 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 +.byte 65,69,83,45,78,73,32,71,67,77,32,109,111,100,117,108,101,32,102,111,114,32,120,56,54,95,54,52,44,32,67,82,89,80,84,79,71,65,77,83,32,98,121,32,60,97,112,112,114,111,64,111,112,101,110,115,115,108,46,111,114,103,62,0 +.p2align 6 diff --git a/crypto/aesgcm/aesni_gcm_x64_nasm.asm b/crypto/aesgcm/aesni_gcm_x64_nasm.asm new file mode 100644 index 0000000..f3371e8 --- /dev/null +++ b/crypto/aesgcm/aesni_gcm_x64_nasm.asm @@ -0,0 +1,1023 @@ +default rel +%define XMMWORD +%define YMMWORD +%define ZMMWORD +section .text code align=64 + + + +ALIGN 32 +_aesni_ctr32_ghash_6x: + + vmovdqu xmm2,XMMWORD[32+r11] + sub rdx,6 + vpxor xmm4,xmm4,xmm4 + vmovdqu xmm15,XMMWORD[((0-128))+rcx] + vpaddb xmm10,xmm1,xmm2 + vpaddb xmm11,xmm10,xmm2 + vpaddb xmm12,xmm11,xmm2 + vpaddb xmm13,xmm12,xmm2 + vpaddb xmm14,xmm13,xmm2 + vpxor xmm9,xmm1,xmm15 + vmovdqu XMMWORD[(16+8)+rsp],xmm4 + jmp NEAR $L$oop6x + +ALIGN 32 +$L$oop6x: + add ebx,100663296 + jc NEAR $L$handle_ctr32 + vmovdqu xmm3,XMMWORD[((0-32))+r9] + vpaddb xmm1,xmm14,xmm2 + vpxor xmm10,xmm10,xmm15 + vpxor xmm11,xmm11,xmm15 + +$L$resume_ctr32: + vmovdqu XMMWORD[r8],xmm1 + vpclmulqdq xmm5,xmm7,xmm3,0x10 + vpxor xmm12,xmm12,xmm15 + vmovups xmm2,XMMWORD[((16-128))+rcx] + vpclmulqdq xmm6,xmm7,xmm3,0x01 + + + + + + + + + + + + + + + + + + xor r12,r12 + cmp r15,r14 + + vaesenc xmm9,xmm9,xmm2 + vmovdqu xmm0,XMMWORD[((48+8))+rsp] + vpxor xmm13,xmm13,xmm15 + vpclmulqdq xmm1,xmm7,xmm3,0x00 + vaesenc xmm10,xmm10,xmm2 + vpxor xmm14,xmm14,xmm15 + setnc r12b + vpclmulqdq xmm7,xmm7,xmm3,0x11 + vaesenc xmm11,xmm11,xmm2 + vmovdqu xmm3,XMMWORD[((16-32))+r9] + neg r12 + vaesenc xmm12,xmm12,xmm2 + vpxor xmm6,xmm6,xmm5 + vpclmulqdq xmm5,xmm0,xmm3,0x00 + vpxor xmm8,xmm8,xmm4 + vaesenc xmm13,xmm13,xmm2 + vpxor xmm4,xmm1,xmm5 + and r12,0x60 + vmovups xmm15,XMMWORD[((32-128))+rcx] + vpclmulqdq xmm1,xmm0,xmm3,0x10 + vaesenc xmm14,xmm14,xmm2 + + vpclmulqdq xmm2,xmm0,xmm3,0x01 + lea r14,[r12*1+r14] + vaesenc xmm9,xmm9,xmm15 + vpxor xmm8,xmm8,XMMWORD[((16+8))+rsp] + vpclmulqdq xmm3,xmm0,xmm3,0x11 + vmovdqu xmm0,XMMWORD[((64+8))+rsp] + vaesenc xmm10,xmm10,xmm15 + movbe r13,QWORD[88+r14] + vaesenc xmm11,xmm11,xmm15 + movbe r12,QWORD[80+r14] + vaesenc xmm12,xmm12,xmm15 + mov QWORD[((32+8))+rsp],r13 + vaesenc xmm13,xmm13,xmm15 + mov QWORD[((40+8))+rsp],r12 + vmovdqu xmm5,XMMWORD[((48-32))+r9] + vaesenc xmm14,xmm14,xmm15 + + vmovups xmm15,XMMWORD[((48-128))+rcx] + vpxor xmm6,xmm6,xmm1 + vpclmulqdq xmm1,xmm0,xmm5,0x00 + vaesenc xmm9,xmm9,xmm15 + vpxor xmm6,xmm6,xmm2 + vpclmulqdq xmm2,xmm0,xmm5,0x10 + vaesenc xmm10,xmm10,xmm15 + vpxor xmm7,xmm7,xmm3 + vpclmulqdq xmm3,xmm0,xmm5,0x01 + vaesenc xmm11,xmm11,xmm15 + vpclmulqdq xmm5,xmm0,xmm5,0x11 + vmovdqu xmm0,XMMWORD[((80+8))+rsp] + vaesenc xmm12,xmm12,xmm15 + vaesenc xmm13,xmm13,xmm15 + vpxor xmm4,xmm4,xmm1 + vmovdqu xmm1,XMMWORD[((64-32))+r9] + vaesenc xmm14,xmm14,xmm15 + + vmovups xmm15,XMMWORD[((64-128))+rcx] + vpxor xmm6,xmm6,xmm2 + vpclmulqdq xmm2,xmm0,xmm1,0x00 + vaesenc xmm9,xmm9,xmm15 + vpxor xmm6,xmm6,xmm3 + vpclmulqdq xmm3,xmm0,xmm1,0x10 + vaesenc xmm10,xmm10,xmm15 + movbe r13,QWORD[72+r14] + vpxor xmm7,xmm7,xmm5 + vpclmulqdq xmm5,xmm0,xmm1,0x01 + vaesenc xmm11,xmm11,xmm15 + movbe r12,QWORD[64+r14] + vpclmulqdq xmm1,xmm0,xmm1,0x11 + vmovdqu xmm0,XMMWORD[((96+8))+rsp] + vaesenc xmm12,xmm12,xmm15 + mov QWORD[((48+8))+rsp],r13 + vaesenc xmm13,xmm13,xmm15 + mov QWORD[((56+8))+rsp],r12 + vpxor xmm4,xmm4,xmm2 + vmovdqu xmm2,XMMWORD[((96-32))+r9] + vaesenc xmm14,xmm14,xmm15 + + vmovups xmm15,XMMWORD[((80-128))+rcx] + vpxor xmm6,xmm6,xmm3 + vpclmulqdq xmm3,xmm0,xmm2,0x00 + vaesenc xmm9,xmm9,xmm15 + vpxor xmm6,xmm6,xmm5 + vpclmulqdq xmm5,xmm0,xmm2,0x10 + vaesenc xmm10,xmm10,xmm15 + movbe r13,QWORD[56+r14] + vpxor xmm7,xmm7,xmm1 + vpclmulqdq xmm1,xmm0,xmm2,0x01 + vpxor xmm8,xmm8,XMMWORD[((112+8))+rsp] + vaesenc xmm11,xmm11,xmm15 + movbe r12,QWORD[48+r14] + vpclmulqdq xmm2,xmm0,xmm2,0x11 + vaesenc xmm12,xmm12,xmm15 + mov QWORD[((64+8))+rsp],r13 + vaesenc xmm13,xmm13,xmm15 + mov QWORD[((72+8))+rsp],r12 + vpxor xmm4,xmm4,xmm3 + vmovdqu xmm3,XMMWORD[((112-32))+r9] + vaesenc xmm14,xmm14,xmm15 + + vmovups xmm15,XMMWORD[((96-128))+rcx] + vpxor xmm6,xmm6,xmm5 + vpclmulqdq xmm5,xmm8,xmm3,0x10 + vaesenc xmm9,xmm9,xmm15 + vpxor xmm6,xmm6,xmm1 + vpclmulqdq xmm1,xmm8,xmm3,0x01 + vaesenc xmm10,xmm10,xmm15 + movbe r13,QWORD[40+r14] + vpxor xmm7,xmm7,xmm2 + vpclmulqdq xmm2,xmm8,xmm3,0x00 + vaesenc xmm11,xmm11,xmm15 + movbe r12,QWORD[32+r14] + vpclmulqdq xmm8,xmm8,xmm3,0x11 + vaesenc xmm12,xmm12,xmm15 + mov QWORD[((80+8))+rsp],r13 + vaesenc xmm13,xmm13,xmm15 + mov QWORD[((88+8))+rsp],r12 + vpxor xmm6,xmm6,xmm5 + vaesenc xmm14,xmm14,xmm15 + vpxor xmm6,xmm6,xmm1 + + vmovups xmm15,XMMWORD[((112-128))+rcx] + vpslldq xmm5,xmm6,8 + vpxor xmm4,xmm4,xmm2 + vmovdqu xmm3,XMMWORD[16+r11] + + vaesenc xmm9,xmm9,xmm15 + vpxor xmm7,xmm7,xmm8 + vaesenc xmm10,xmm10,xmm15 + vpxor xmm4,xmm4,xmm5 + movbe r13,QWORD[24+r14] + vaesenc xmm11,xmm11,xmm15 + movbe r12,QWORD[16+r14] + vpalignr xmm0,xmm4,xmm4,8 + vpclmulqdq xmm4,xmm4,xmm3,0x10 + mov QWORD[((96+8))+rsp],r13 + vaesenc xmm12,xmm12,xmm15 + mov QWORD[((104+8))+rsp],r12 + vaesenc xmm13,xmm13,xmm15 + vmovups xmm1,XMMWORD[((128-128))+rcx] + vaesenc xmm14,xmm14,xmm15 + + vaesenc xmm9,xmm9,xmm1 + vmovups xmm15,XMMWORD[((144-128))+rcx] + vaesenc xmm10,xmm10,xmm1 + vpsrldq xmm6,xmm6,8 + vaesenc xmm11,xmm11,xmm1 + vpxor xmm7,xmm7,xmm6 + vaesenc xmm12,xmm12,xmm1 + vpxor xmm4,xmm4,xmm0 + movbe r13,QWORD[8+r14] + vaesenc xmm13,xmm13,xmm1 + movbe r12,QWORD[r14] + vaesenc xmm14,xmm14,xmm1 + vmovups xmm1,XMMWORD[((160-128))+rcx] + cmp ebp,11 + jb NEAR $L$enc_tail + + vaesenc xmm9,xmm9,xmm15 + vaesenc xmm10,xmm10,xmm15 + vaesenc xmm11,xmm11,xmm15 + vaesenc xmm12,xmm12,xmm15 + vaesenc xmm13,xmm13,xmm15 + vaesenc xmm14,xmm14,xmm15 + + vaesenc xmm9,xmm9,xmm1 + vaesenc xmm10,xmm10,xmm1 + vaesenc xmm11,xmm11,xmm1 + vaesenc xmm12,xmm12,xmm1 + vaesenc xmm13,xmm13,xmm1 + vmovups xmm15,XMMWORD[((176-128))+rcx] + vaesenc xmm14,xmm14,xmm1 + vmovups xmm1,XMMWORD[((192-128))+rcx] + je NEAR $L$enc_tail + + vaesenc xmm9,xmm9,xmm15 + vaesenc xmm10,xmm10,xmm15 + vaesenc xmm11,xmm11,xmm15 + vaesenc xmm12,xmm12,xmm15 + vaesenc xmm13,xmm13,xmm15 + vaesenc xmm14,xmm14,xmm15 + + vaesenc xmm9,xmm9,xmm1 + vaesenc xmm10,xmm10,xmm1 + vaesenc xmm11,xmm11,xmm1 + vaesenc xmm12,xmm12,xmm1 + vaesenc xmm13,xmm13,xmm1 + vmovups xmm15,XMMWORD[((208-128))+rcx] + vaesenc xmm14,xmm14,xmm1 + vmovups xmm1,XMMWORD[((224-128))+rcx] + jmp NEAR $L$enc_tail + +ALIGN 32 +$L$handle_ctr32: + vmovdqu xmm0,XMMWORD[r11] + vpshufb xmm6,xmm1,xmm0 + vmovdqu xmm5,XMMWORD[48+r11] + vpaddd xmm10,xmm6,XMMWORD[64+r11] + vpaddd xmm11,xmm6,xmm5 + vmovdqu xmm3,XMMWORD[((0-32))+r9] + vpaddd xmm12,xmm10,xmm5 + vpshufb xmm10,xmm10,xmm0 + vpaddd xmm13,xmm11,xmm5 + vpshufb xmm11,xmm11,xmm0 + vpxor xmm10,xmm10,xmm15 + vpaddd xmm14,xmm12,xmm5 + vpshufb xmm12,xmm12,xmm0 + vpxor xmm11,xmm11,xmm15 + vpaddd xmm1,xmm13,xmm5 + vpshufb xmm13,xmm13,xmm0 + vpshufb xmm14,xmm14,xmm0 + vpshufb xmm1,xmm1,xmm0 + jmp NEAR $L$resume_ctr32 + +ALIGN 32 +$L$enc_tail: + vaesenc xmm9,xmm9,xmm15 + vmovdqu XMMWORD[(16+8)+rsp],xmm7 + vpalignr xmm8,xmm4,xmm4,8 + vaesenc xmm10,xmm10,xmm15 + vpclmulqdq xmm4,xmm4,xmm3,0x10 + vpxor xmm2,xmm1,XMMWORD[rdi] + vaesenc xmm11,xmm11,xmm15 + vpxor xmm0,xmm1,XMMWORD[16+rdi] + vaesenc xmm12,xmm12,xmm15 + vpxor xmm5,xmm1,XMMWORD[32+rdi] + vaesenc xmm13,xmm13,xmm15 + vpxor xmm6,xmm1,XMMWORD[48+rdi] + vaesenc xmm14,xmm14,xmm15 + vpxor xmm7,xmm1,XMMWORD[64+rdi] + vpxor xmm3,xmm1,XMMWORD[80+rdi] + vmovdqu xmm1,XMMWORD[r8] + + vaesenclast xmm9,xmm9,xmm2 + vmovdqu xmm2,XMMWORD[32+r11] + vaesenclast xmm10,xmm10,xmm0 + vpaddb xmm0,xmm1,xmm2 + mov QWORD[((112+8))+rsp],r13 + lea rdi,[96+rdi] + vaesenclast xmm11,xmm11,xmm5 + vpaddb xmm5,xmm0,xmm2 + mov QWORD[((120+8))+rsp],r12 + lea rsi,[96+rsi] + vmovdqu xmm15,XMMWORD[((0-128))+rcx] + vaesenclast xmm12,xmm12,xmm6 + vpaddb xmm6,xmm5,xmm2 + vaesenclast xmm13,xmm13,xmm7 + vpaddb xmm7,xmm6,xmm2 + vaesenclast xmm14,xmm14,xmm3 + vpaddb xmm3,xmm7,xmm2 + + add r10,0x60 + sub rdx,0x6 + jc NEAR $L$6x_done + + vmovups XMMWORD[(-96)+rsi],xmm9 + vpxor xmm9,xmm1,xmm15 + vmovups XMMWORD[(-80)+rsi],xmm10 + vmovdqa xmm10,xmm0 + vmovups XMMWORD[(-64)+rsi],xmm11 + vmovdqa xmm11,xmm5 + vmovups XMMWORD[(-48)+rsi],xmm12 + vmovdqa xmm12,xmm6 + vmovups XMMWORD[(-32)+rsi],xmm13 + vmovdqa xmm13,xmm7 + vmovups XMMWORD[(-16)+rsi],xmm14 + vmovdqa xmm14,xmm3 + vmovdqu xmm7,XMMWORD[((32+8))+rsp] + jmp NEAR $L$oop6x + +$L$6x_done: + vpxor xmm8,xmm8,XMMWORD[((16+8))+rsp] + vpxor xmm8,xmm8,xmm4 + + ret + + +global aesni_gcm_decrypt + +ALIGN 32 +aesni_gcm_decrypt: + mov QWORD[8+rsp],rdi ;WIN64 prologue + mov QWORD[16+rsp],rsi + mov rax,rsp +$L$SEH_begin_aesni_gcm_decrypt: + mov rdi,rcx + mov rsi,rdx + mov rdx,r8 + mov rcx,r9 + mov r8,QWORD[40+rsp] + mov r9,QWORD[48+rsp] + + + + xor r10,r10 + + + + cmp rdx,0x60 + jb NEAR $L$gcm_dec_abort + + lea rax,[rsp] + + push rbx + + push rbp + + push r12 + + push r13 + + push r14 + + push r15 + + lea rsp,[((-168))+rsp] + movaps XMMWORD[(-216)+rax],xmm6 + movaps XMMWORD[(-200)+rax],xmm7 + movaps XMMWORD[(-184)+rax],xmm8 + movaps XMMWORD[(-168)+rax],xmm9 + movaps XMMWORD[(-152)+rax],xmm10 + movaps XMMWORD[(-136)+rax],xmm11 + movaps XMMWORD[(-120)+rax],xmm12 + movaps XMMWORD[(-104)+rax],xmm13 + movaps XMMWORD[(-88)+rax],xmm14 + movaps XMMWORD[(-72)+rax],xmm15 +$L$gcm_dec_body: + vzeroupper + + vmovdqu xmm1,XMMWORD[r8] + add rsp,-128 + mov ebx,DWORD[12+r8] + lea r11,[$L$bswap_mask] + lea r14,[((-128))+rcx] + mov r15,0xf80 + vmovdqu xmm8,XMMWORD[16+r8] + and rsp,-128 + vmovdqu xmm0,XMMWORD[r11] + lea rcx,[128+rcx] + lea r9,[((16+32))+r9] + mov ebp,DWORD[((240-128))+rcx] + vpshufb xmm8,xmm8,xmm0 + + and r14,r15 + and r15,rsp + sub r15,r14 + jc NEAR $L$dec_no_key_aliasing + cmp r15,768 + jnc NEAR $L$dec_no_key_aliasing + sub rsp,r15 +$L$dec_no_key_aliasing: + + vmovdqu xmm7,XMMWORD[80+rdi] + lea r14,[rdi] + vmovdqu xmm4,XMMWORD[64+rdi] + + + + + + + + lea r15,[((-192))+rdx*1+rdi] + + vmovdqu xmm5,XMMWORD[48+rdi] + shr rdx,4 + xor r10,r10 + vmovdqu xmm6,XMMWORD[32+rdi] + vpshufb xmm7,xmm7,xmm0 + vmovdqu xmm2,XMMWORD[16+rdi] + vpshufb xmm4,xmm4,xmm0 + vmovdqu xmm3,XMMWORD[rdi] + vpshufb xmm5,xmm5,xmm0 + vmovdqu XMMWORD[48+rsp],xmm4 + vpshufb xmm6,xmm6,xmm0 + vmovdqu XMMWORD[64+rsp],xmm5 + vpshufb xmm2,xmm2,xmm0 + vmovdqu XMMWORD[80+rsp],xmm6 + vpshufb xmm3,xmm3,xmm0 + vmovdqu XMMWORD[96+rsp],xmm2 + vmovdqu XMMWORD[112+rsp],xmm3 + + call _aesni_ctr32_ghash_6x + + vmovups XMMWORD[(-96)+rsi],xmm9 + vmovups XMMWORD[(-80)+rsi],xmm10 + vmovups XMMWORD[(-64)+rsi],xmm11 + vmovups XMMWORD[(-48)+rsi],xmm12 + vmovups XMMWORD[(-32)+rsi],xmm13 + vmovups XMMWORD[(-16)+rsi],xmm14 + + vpshufb xmm8,xmm8,XMMWORD[r11] + vmovdqu XMMWORD[16+r8],xmm8 + + vzeroupper + movaps xmm6,XMMWORD[((-216))+rax] + movaps xmm7,XMMWORD[((-200))+rax] + movaps xmm8,XMMWORD[((-184))+rax] + movaps xmm9,XMMWORD[((-168))+rax] + movaps xmm10,XMMWORD[((-152))+rax] + movaps xmm11,XMMWORD[((-136))+rax] + movaps xmm12,XMMWORD[((-120))+rax] + movaps xmm13,XMMWORD[((-104))+rax] + movaps xmm14,XMMWORD[((-88))+rax] + movaps xmm15,XMMWORD[((-72))+rax] + mov r15,QWORD[((-48))+rax] + + mov r14,QWORD[((-40))+rax] + + mov r13,QWORD[((-32))+rax] + + mov r12,QWORD[((-24))+rax] + + mov rbp,QWORD[((-16))+rax] + + mov rbx,QWORD[((-8))+rax] + + lea rsp,[rax] + +$L$gcm_dec_abort: + mov rax,r10 + mov rdi,QWORD[8+rsp] ;WIN64 epilogue + mov rsi,QWORD[16+rsp] + ret + +$L$SEH_end_aesni_gcm_decrypt: + +ALIGN 32 +_aesni_ctr32_6x: + + vmovdqu xmm4,XMMWORD[((0-128))+rcx] + vmovdqu xmm2,XMMWORD[32+r11] + lea r13,[((-1))+rbp] + vmovups xmm15,XMMWORD[((16-128))+rcx] + lea r12,[((32-128))+rcx] + vpxor xmm9,xmm1,xmm4 + add ebx,100663296 + jc NEAR $L$handle_ctr32_2 + vpaddb xmm10,xmm1,xmm2 + vpaddb xmm11,xmm10,xmm2 + vpxor xmm10,xmm10,xmm4 + vpaddb xmm12,xmm11,xmm2 + vpxor xmm11,xmm11,xmm4 + vpaddb xmm13,xmm12,xmm2 + vpxor xmm12,xmm12,xmm4 + vpaddb xmm14,xmm13,xmm2 + vpxor xmm13,xmm13,xmm4 + vpaddb xmm1,xmm14,xmm2 + vpxor xmm14,xmm14,xmm4 + jmp NEAR $L$oop_ctr32 + +ALIGN 16 +$L$oop_ctr32: + vaesenc xmm9,xmm9,xmm15 + vaesenc xmm10,xmm10,xmm15 + vaesenc xmm11,xmm11,xmm15 + vaesenc xmm12,xmm12,xmm15 + vaesenc xmm13,xmm13,xmm15 + vaesenc xmm14,xmm14,xmm15 + vmovups xmm15,XMMWORD[r12] + lea r12,[16+r12] + dec r13d + jnz NEAR $L$oop_ctr32 + + vmovdqu xmm3,XMMWORD[r12] + vaesenc xmm9,xmm9,xmm15 + vpxor xmm4,xmm3,XMMWORD[rdi] + vaesenc xmm10,xmm10,xmm15 + vpxor xmm5,xmm3,XMMWORD[16+rdi] + vaesenc xmm11,xmm11,xmm15 + vpxor xmm6,xmm3,XMMWORD[32+rdi] + vaesenc xmm12,xmm12,xmm15 + vpxor xmm8,xmm3,XMMWORD[48+rdi] + vaesenc xmm13,xmm13,xmm15 + vpxor xmm2,xmm3,XMMWORD[64+rdi] + vaesenc xmm14,xmm14,xmm15 + vpxor xmm3,xmm3,XMMWORD[80+rdi] + lea rdi,[96+rdi] + + vaesenclast xmm9,xmm9,xmm4 + vaesenclast xmm10,xmm10,xmm5 + vaesenclast xmm11,xmm11,xmm6 + vaesenclast xmm12,xmm12,xmm8 + vaesenclast xmm13,xmm13,xmm2 + vaesenclast xmm14,xmm14,xmm3 + vmovups XMMWORD[rsi],xmm9 + vmovups XMMWORD[16+rsi],xmm10 + vmovups XMMWORD[32+rsi],xmm11 + vmovups XMMWORD[48+rsi],xmm12 + vmovups XMMWORD[64+rsi],xmm13 + vmovups XMMWORD[80+rsi],xmm14 + lea rsi,[96+rsi] + + ret +ALIGN 32 +$L$handle_ctr32_2: + vpshufb xmm6,xmm1,xmm0 + vmovdqu xmm5,XMMWORD[48+r11] + vpaddd xmm10,xmm6,XMMWORD[64+r11] + vpaddd xmm11,xmm6,xmm5 + vpaddd xmm12,xmm10,xmm5 + vpshufb xmm10,xmm10,xmm0 + vpaddd xmm13,xmm11,xmm5 + vpshufb xmm11,xmm11,xmm0 + vpxor xmm10,xmm10,xmm4 + vpaddd xmm14,xmm12,xmm5 + vpshufb xmm12,xmm12,xmm0 + vpxor xmm11,xmm11,xmm4 + vpaddd xmm1,xmm13,xmm5 + vpshufb xmm13,xmm13,xmm0 + vpxor xmm12,xmm12,xmm4 + vpshufb xmm14,xmm14,xmm0 + vpxor xmm13,xmm13,xmm4 + vpshufb xmm1,xmm1,xmm0 + vpxor xmm14,xmm14,xmm4 + jmp NEAR $L$oop_ctr32 + + + +global aesni_gcm_encrypt + +ALIGN 32 +aesni_gcm_encrypt: + mov QWORD[8+rsp],rdi ;WIN64 prologue + mov QWORD[16+rsp],rsi + mov rax,rsp +$L$SEH_begin_aesni_gcm_encrypt: + mov rdi,rcx + mov rsi,rdx + mov rdx,r8 + mov rcx,r9 + mov r8,QWORD[40+rsp] + mov r9,QWORD[48+rsp] + + + + xor r10,r10 + + + + + cmp rdx,0x60*3 + jb NEAR $L$gcm_enc_abort + + lea rax,[rsp] + + push rbx + + push rbp + + push r12 + + push r13 + + push r14 + + push r15 + + lea rsp,[((-168))+rsp] + movaps XMMWORD[(-216)+rax],xmm6 + movaps XMMWORD[(-200)+rax],xmm7 + movaps XMMWORD[(-184)+rax],xmm8 + movaps XMMWORD[(-168)+rax],xmm9 + movaps XMMWORD[(-152)+rax],xmm10 + movaps XMMWORD[(-136)+rax],xmm11 + movaps XMMWORD[(-120)+rax],xmm12 + movaps XMMWORD[(-104)+rax],xmm13 + movaps XMMWORD[(-88)+rax],xmm14 + movaps XMMWORD[(-72)+rax],xmm15 +$L$gcm_enc_body: + vzeroupper + + vmovdqu xmm1,XMMWORD[r8] + add rsp,-128 + mov ebx,DWORD[12+r8] + lea r11,[$L$bswap_mask] + lea r14,[((-128))+rcx] + mov r15,0xf80 + lea rcx,[128+rcx] + vmovdqu xmm0,XMMWORD[r11] + and rsp,-128 + mov ebp,DWORD[((240-128))+rcx] + + and r14,r15 + and r15,rsp + sub r15,r14 + jc NEAR $L$enc_no_key_aliasing + cmp r15,768 + jnc NEAR $L$enc_no_key_aliasing + sub rsp,r15 +$L$enc_no_key_aliasing: + + lea r14,[rsi] + + + + + + + + + lea r15,[((-192))+rdx*1+rsi] + + shr rdx,4 + + call _aesni_ctr32_6x + + vpshufb xmm8,xmm9,xmm0 + vpshufb xmm2,xmm10,xmm0 + vmovdqu XMMWORD[112+rsp],xmm8 + vpshufb xmm4,xmm11,xmm0 + vmovdqu XMMWORD[96+rsp],xmm2 + vpshufb xmm5,xmm12,xmm0 + vmovdqu XMMWORD[80+rsp],xmm4 + vpshufb xmm6,xmm13,xmm0 + vmovdqu XMMWORD[64+rsp],xmm5 + vpshufb xmm7,xmm14,xmm0 + vmovdqu XMMWORD[48+rsp],xmm6 + + call _aesni_ctr32_6x + + vmovdqu xmm8,XMMWORD[16+r8] + lea r9,[((16+32))+r9] + sub rdx,12 + mov r10,0x60*2 + vpshufb xmm8,xmm8,xmm0 + + call _aesni_ctr32_ghash_6x + vmovdqu xmm7,XMMWORD[32+rsp] + vmovdqu xmm0,XMMWORD[r11] + vmovdqu xmm3,XMMWORD[((0-32))+r9] + vpunpckhqdq xmm1,xmm7,xmm7 + vmovdqu xmm15,XMMWORD[((32-32))+r9] + vmovups XMMWORD[(-96)+rsi],xmm9 + vpshufb xmm9,xmm9,xmm0 + vpxor xmm1,xmm1,xmm7 + vmovups XMMWORD[(-80)+rsi],xmm10 + vpshufb xmm10,xmm10,xmm0 + vmovups XMMWORD[(-64)+rsi],xmm11 + vpshufb xmm11,xmm11,xmm0 + vmovups XMMWORD[(-48)+rsi],xmm12 + vpshufb xmm12,xmm12,xmm0 + vmovups XMMWORD[(-32)+rsi],xmm13 + vpshufb xmm13,xmm13,xmm0 + vmovups XMMWORD[(-16)+rsi],xmm14 + vpshufb xmm14,xmm14,xmm0 + vmovdqu XMMWORD[16+rsp],xmm9 + vmovdqu xmm6,XMMWORD[48+rsp] + vmovdqu xmm0,XMMWORD[((16-32))+r9] + vpunpckhqdq xmm2,xmm6,xmm6 + vpclmulqdq xmm5,xmm7,xmm3,0x00 + vpxor xmm2,xmm2,xmm6 + vpclmulqdq xmm7,xmm7,xmm3,0x11 + vpclmulqdq xmm1,xmm1,xmm15,0x00 + + vmovdqu xmm9,XMMWORD[64+rsp] + vpclmulqdq xmm4,xmm6,xmm0,0x00 + vmovdqu xmm3,XMMWORD[((48-32))+r9] + vpxor xmm4,xmm4,xmm5 + vpunpckhqdq xmm5,xmm9,xmm9 + vpclmulqdq xmm6,xmm6,xmm0,0x11 + vpxor xmm5,xmm5,xmm9 + vpxor xmm6,xmm6,xmm7 + vpclmulqdq xmm2,xmm2,xmm15,0x10 + vmovdqu xmm15,XMMWORD[((80-32))+r9] + vpxor xmm2,xmm2,xmm1 + + vmovdqu xmm1,XMMWORD[80+rsp] + vpclmulqdq xmm7,xmm9,xmm3,0x00 + vmovdqu xmm0,XMMWORD[((64-32))+r9] + vpxor xmm7,xmm7,xmm4 + vpunpckhqdq xmm4,xmm1,xmm1 + vpclmulqdq xmm9,xmm9,xmm3,0x11 + vpxor xmm4,xmm4,xmm1 + vpxor xmm9,xmm9,xmm6 + vpclmulqdq xmm5,xmm5,xmm15,0x00 + vpxor xmm5,xmm5,xmm2 + + vmovdqu xmm2,XMMWORD[96+rsp] + vpclmulqdq xmm6,xmm1,xmm0,0x00 + vmovdqu xmm3,XMMWORD[((96-32))+r9] + vpxor xmm6,xmm6,xmm7 + vpunpckhqdq xmm7,xmm2,xmm2 + vpclmulqdq xmm1,xmm1,xmm0,0x11 + vpxor xmm7,xmm7,xmm2 + vpxor xmm1,xmm1,xmm9 + vpclmulqdq xmm4,xmm4,xmm15,0x10 + vmovdqu xmm15,XMMWORD[((128-32))+r9] + vpxor xmm4,xmm4,xmm5 + + vpxor xmm8,xmm8,XMMWORD[112+rsp] + vpclmulqdq xmm5,xmm2,xmm3,0x00 + vmovdqu xmm0,XMMWORD[((112-32))+r9] + vpunpckhqdq xmm9,xmm8,xmm8 + vpxor xmm5,xmm5,xmm6 + vpclmulqdq xmm2,xmm2,xmm3,0x11 + vpxor xmm9,xmm9,xmm8 + vpxor xmm2,xmm2,xmm1 + vpclmulqdq xmm7,xmm7,xmm15,0x00 + vpxor xmm4,xmm7,xmm4 + + vpclmulqdq xmm6,xmm8,xmm0,0x00 + vmovdqu xmm3,XMMWORD[((0-32))+r9] + vpunpckhqdq xmm1,xmm14,xmm14 + vpclmulqdq xmm8,xmm8,xmm0,0x11 + vpxor xmm1,xmm1,xmm14 + vpxor xmm5,xmm6,xmm5 + vpclmulqdq xmm9,xmm9,xmm15,0x10 + vmovdqu xmm15,XMMWORD[((32-32))+r9] + vpxor xmm7,xmm8,xmm2 + vpxor xmm6,xmm9,xmm4 + + vmovdqu xmm0,XMMWORD[((16-32))+r9] + vpxor xmm9,xmm7,xmm5 + vpclmulqdq xmm4,xmm14,xmm3,0x00 + vpxor xmm6,xmm6,xmm9 + vpunpckhqdq xmm2,xmm13,xmm13 + vpclmulqdq xmm14,xmm14,xmm3,0x11 + vpxor xmm2,xmm2,xmm13 + vpslldq xmm9,xmm6,8 + vpclmulqdq xmm1,xmm1,xmm15,0x00 + vpxor xmm8,xmm5,xmm9 + vpsrldq xmm6,xmm6,8 + vpxor xmm7,xmm7,xmm6 + + vpclmulqdq xmm5,xmm13,xmm0,0x00 + vmovdqu xmm3,XMMWORD[((48-32))+r9] + vpxor xmm5,xmm5,xmm4 + vpunpckhqdq xmm9,xmm12,xmm12 + vpclmulqdq xmm13,xmm13,xmm0,0x11 + vpxor xmm9,xmm9,xmm12 + vpxor xmm13,xmm13,xmm14 + vpalignr xmm14,xmm8,xmm8,8 + vpclmulqdq xmm2,xmm2,xmm15,0x10 + vmovdqu xmm15,XMMWORD[((80-32))+r9] + vpxor xmm2,xmm2,xmm1 + + vpclmulqdq xmm4,xmm12,xmm3,0x00 + vmovdqu xmm0,XMMWORD[((64-32))+r9] + vpxor xmm4,xmm4,xmm5 + vpunpckhqdq xmm1,xmm11,xmm11 + vpclmulqdq xmm12,xmm12,xmm3,0x11 + vpxor xmm1,xmm1,xmm11 + vpxor xmm12,xmm12,xmm13 + vxorps xmm7,xmm7,XMMWORD[16+rsp] + vpclmulqdq xmm9,xmm9,xmm15,0x00 + vpxor xmm9,xmm9,xmm2 + + vpclmulqdq xmm8,xmm8,XMMWORD[16+r11],0x10 + vxorps xmm8,xmm8,xmm14 + + vpclmulqdq xmm5,xmm11,xmm0,0x00 + vmovdqu xmm3,XMMWORD[((96-32))+r9] + vpxor xmm5,xmm5,xmm4 + vpunpckhqdq xmm2,xmm10,xmm10 + vpclmulqdq xmm11,xmm11,xmm0,0x11 + vpxor xmm2,xmm2,xmm10 + vpalignr xmm14,xmm8,xmm8,8 + vpxor xmm11,xmm11,xmm12 + vpclmulqdq xmm1,xmm1,xmm15,0x10 + vmovdqu xmm15,XMMWORD[((128-32))+r9] + vpxor xmm1,xmm1,xmm9 + + vxorps xmm14,xmm14,xmm7 + vpclmulqdq xmm8,xmm8,XMMWORD[16+r11],0x10 + vxorps xmm8,xmm8,xmm14 + + vpclmulqdq xmm4,xmm10,xmm3,0x00 + vmovdqu xmm0,XMMWORD[((112-32))+r9] + vpxor xmm4,xmm4,xmm5 + vpunpckhqdq xmm9,xmm8,xmm8 + vpclmulqdq xmm10,xmm10,xmm3,0x11 + vpxor xmm9,xmm9,xmm8 + vpxor xmm10,xmm10,xmm11 + vpclmulqdq xmm2,xmm2,xmm15,0x00 + vpxor xmm2,xmm2,xmm1 + + vpclmulqdq xmm5,xmm8,xmm0,0x00 + vpclmulqdq xmm7,xmm8,xmm0,0x11 + vpxor xmm5,xmm5,xmm4 + vpclmulqdq xmm6,xmm9,xmm15,0x10 + vpxor xmm7,xmm7,xmm10 + vpxor xmm6,xmm6,xmm2 + + vpxor xmm4,xmm7,xmm5 + vpxor xmm6,xmm6,xmm4 + vpslldq xmm1,xmm6,8 + vmovdqu xmm3,XMMWORD[16+r11] + vpsrldq xmm6,xmm6,8 + vpxor xmm8,xmm5,xmm1 + vpxor xmm7,xmm7,xmm6 + + vpalignr xmm2,xmm8,xmm8,8 + vpclmulqdq xmm8,xmm8,xmm3,0x10 + vpxor xmm8,xmm8,xmm2 + + vpalignr xmm2,xmm8,xmm8,8 + vpclmulqdq xmm8,xmm8,xmm3,0x10 + vpxor xmm2,xmm2,xmm7 + vpxor xmm8,xmm8,xmm2 + vpshufb xmm8,xmm8,XMMWORD[r11] + vmovdqu XMMWORD[16+r8],xmm8 + + vzeroupper + movaps xmm6,XMMWORD[((-216))+rax] + movaps xmm7,XMMWORD[((-200))+rax] + movaps xmm8,XMMWORD[((-184))+rax] + movaps xmm9,XMMWORD[((-168))+rax] + movaps xmm10,XMMWORD[((-152))+rax] + movaps xmm11,XMMWORD[((-136))+rax] + movaps xmm12,XMMWORD[((-120))+rax] + movaps xmm13,XMMWORD[((-104))+rax] + movaps xmm14,XMMWORD[((-88))+rax] + movaps xmm15,XMMWORD[((-72))+rax] + mov r15,QWORD[((-48))+rax] + + mov r14,QWORD[((-40))+rax] + + mov r13,QWORD[((-32))+rax] + + mov r12,QWORD[((-24))+rax] + + mov rbp,QWORD[((-16))+rax] + + mov rbx,QWORD[((-8))+rax] + + lea rsp,[rax] + +$L$gcm_enc_abort: + mov rax,r10 + mov rdi,QWORD[8+rsp] ;WIN64 epilogue + mov rsi,QWORD[16+rsp] + ret + +$L$SEH_end_aesni_gcm_encrypt: +ALIGN 64 +$L$bswap_mask: +DB 15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0 +$L$poly: +DB 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0xc2 +$L$one_msb: +DB 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1 +$L$two_lsb: +DB 2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 +$L$one_lsb: +DB 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 +DB 65,69,83,45,78,73,32,71,67,77,32,109,111,100,117,108 +DB 101,32,102,111,114,32,120,56,54,95,54,52,44,32,67,82 +DB 89,80,84,79,71,65,77,83,32,98,121,32,60,97,112,112 +DB 114,111,64,111,112,101,110,115,115,108,46,111,114,103,62,0 +ALIGN 64 +EXTERN __imp_RtlVirtualUnwind + +ALIGN 16 +gcm_se_handler: + push rsi + push rdi + push rbx + push rbp + push r12 + push r13 + push r14 + push r15 + pushfq + sub rsp,64 + + mov rax,QWORD[120+r8] + mov rbx,QWORD[248+r8] + + mov rsi,QWORD[8+r9] + mov r11,QWORD[56+r9] + + mov r10d,DWORD[r11] + lea r10,[r10*1+rsi] + cmp rbx,r10 + jb NEAR $L$common_seh_tail + + mov rax,QWORD[152+r8] + + mov r10d,DWORD[4+r11] + lea r10,[r10*1+rsi] + cmp rbx,r10 + jae NEAR $L$common_seh_tail + + mov rax,QWORD[120+r8] + + mov r15,QWORD[((-48))+rax] + mov r14,QWORD[((-40))+rax] + mov r13,QWORD[((-32))+rax] + mov r12,QWORD[((-24))+rax] + mov rbp,QWORD[((-16))+rax] + mov rbx,QWORD[((-8))+rax] + mov QWORD[240+r8],r15 + mov QWORD[232+r8],r14 + mov QWORD[224+r8],r13 + mov QWORD[216+r8],r12 + mov QWORD[160+r8],rbp + mov QWORD[144+r8],rbx + + lea rsi,[((-216))+rax] + lea rdi,[512+r8] + mov ecx,20 + DD 0xa548f3fc + +$L$common_seh_tail: + mov rdi,QWORD[8+rax] + mov rsi,QWORD[16+rax] + mov QWORD[152+r8],rax + mov QWORD[168+r8],rsi + mov QWORD[176+r8],rdi + + mov rdi,QWORD[40+r9] + mov rsi,r8 + mov ecx,154 + DD 0xa548f3fc + + mov rsi,r9 + xor rcx,rcx + mov rdx,QWORD[8+rsi] + mov r8,QWORD[rsi] + mov r9,QWORD[16+rsi] + mov r10,QWORD[40+rsi] + lea r11,[56+rsi] + lea r12,[24+rsi] + mov QWORD[32+rsp],r10 + mov QWORD[40+rsp],r11 + mov QWORD[48+rsp],r12 + mov QWORD[56+rsp],rcx + call QWORD[__imp_RtlVirtualUnwind] + + mov eax,1 + add rsp,64 + popfq + pop r15 + pop r14 + pop r13 + pop r12 + pop rbp + pop rbx + pop rdi + pop rsi + ret + + +section .pdata rdata align=4 +ALIGN 4 + DD $L$SEH_begin_aesni_gcm_decrypt wrt ..imagebase + DD $L$SEH_end_aesni_gcm_decrypt wrt ..imagebase + DD $L$SEH_gcm_dec_info wrt ..imagebase + + DD $L$SEH_begin_aesni_gcm_encrypt wrt ..imagebase + DD $L$SEH_end_aesni_gcm_encrypt wrt ..imagebase + DD $L$SEH_gcm_enc_info wrt ..imagebase +section .xdata rdata align=8 +ALIGN 8 +$L$SEH_gcm_dec_info: +DB 9,0,0,0 + DD gcm_se_handler wrt ..imagebase + DD $L$gcm_dec_body wrt ..imagebase,$L$gcm_dec_abort wrt ..imagebase +$L$SEH_gcm_enc_info: +DB 9,0,0,0 + DD gcm_se_handler wrt ..imagebase + DD $L$gcm_enc_body wrt ..imagebase,$L$gcm_enc_abort wrt ..imagebase diff --git a/crypto/aesgcm/aesni_x64_gas.s b/crypto/aesgcm/aesni_x64_gas.s new file mode 100644 index 0000000..a1cd80b --- /dev/null +++ b/crypto/aesgcm/aesni_x64_gas.s @@ -0,0 +1,1510 @@ +.text +.globl aesni_encrypt +.type aesni_encrypt,@function +.align 16 +aesni_encrypt: + movups (%rdi),%xmm2 + movl 240(%rdx),%eax + movups (%rdx),%xmm0 + movups 16(%rdx),%xmm1 + leaq 32(%rdx),%rdx + xorps %xmm0,%xmm2 +.Loop_enc1_1: +.byte 102,15,56,220,209 + decl %eax + movups (%rdx),%xmm1 + leaq 16(%rdx),%rdx + jnz .Loop_enc1_1 +.byte 102,15,56,221,209 + pxor %xmm0,%xmm0 + pxor %xmm1,%xmm1 + movups %xmm2,(%rsi) + pxor %xmm2,%xmm2 + ret +.size aesni_encrypt,.-aesni_encrypt + +.globl aesni_decrypt +.type aesni_decrypt,@function +.align 16 +aesni_decrypt: + movups (%rdi),%xmm2 + movl 240(%rdx),%eax + movups (%rdx),%xmm0 + movups 16(%rdx),%xmm1 + leaq 32(%rdx),%rdx + xorps %xmm0,%xmm2 +.Loop_dec1_2: +.byte 102,15,56,222,209 + decl %eax + movups (%rdx),%xmm1 + leaq 16(%rdx),%rdx + jnz .Loop_dec1_2 +.byte 102,15,56,223,209 + pxor %xmm0,%xmm0 + pxor %xmm1,%xmm1 + movups %xmm2,(%rsi) + pxor %xmm2,%xmm2 + ret +.size aesni_decrypt, .-aesni_decrypt +.type _aesni_encrypt2,@function +.align 16 +_aesni_encrypt2: + movups (%rcx),%xmm0 + shll $4,%eax + movups 16(%rcx),%xmm1 + xorps %xmm0,%xmm2 + xorps %xmm0,%xmm3 + movups 32(%rcx),%xmm0 + leaq 32(%rcx,%rax,1),%rcx + negq %rax + addq $16,%rax + +.Lenc_loop2: +.byte 102,15,56,220,209 +.byte 102,15,56,220,217 + movups (%rcx,%rax,1),%xmm1 + addq $32,%rax +.byte 102,15,56,220,208 +.byte 102,15,56,220,216 + movups -16(%rcx,%rax,1),%xmm0 + jnz .Lenc_loop2 + +.byte 102,15,56,220,209 +.byte 102,15,56,220,217 +.byte 102,15,56,221,208 +.byte 102,15,56,221,216 + ret +.size _aesni_encrypt2,.-_aesni_encrypt2 +.type _aesni_decrypt2,@function +.align 16 +_aesni_decrypt2: + movups (%rcx),%xmm0 + shll $4,%eax + movups 16(%rcx),%xmm1 + xorps %xmm0,%xmm2 + xorps %xmm0,%xmm3 + movups 32(%rcx),%xmm0 + leaq 32(%rcx,%rax,1),%rcx + negq %rax + addq $16,%rax + +.Ldec_loop2: +.byte 102,15,56,222,209 +.byte 102,15,56,222,217 + movups (%rcx,%rax,1),%xmm1 + addq $32,%rax +.byte 102,15,56,222,208 +.byte 102,15,56,222,216 + movups -16(%rcx,%rax,1),%xmm0 + jnz .Ldec_loop2 + +.byte 102,15,56,222,209 +.byte 102,15,56,222,217 +.byte 102,15,56,223,208 +.byte 102,15,56,223,216 + ret +.size _aesni_decrypt2,.-_aesni_decrypt2 +.type _aesni_encrypt3,@function +.align 16 +_aesni_encrypt3: + movups (%rcx),%xmm0 + shll $4,%eax + movups 16(%rcx),%xmm1 + xorps %xmm0,%xmm2 + xorps %xmm0,%xmm3 + xorps %xmm0,%xmm4 + movups 32(%rcx),%xmm0 + leaq 32(%rcx,%rax,1),%rcx + negq %rax + addq $16,%rax + +.Lenc_loop3: +.byte 102,15,56,220,209 +.byte 102,15,56,220,217 +.byte 102,15,56,220,225 + movups (%rcx,%rax,1),%xmm1 + addq $32,%rax +.byte 102,15,56,220,208 +.byte 102,15,56,220,216 +.byte 102,15,56,220,224 + movups -16(%rcx,%rax,1),%xmm0 + jnz .Lenc_loop3 + +.byte 102,15,56,220,209 +.byte 102,15,56,220,217 +.byte 102,15,56,220,225 +.byte 102,15,56,221,208 +.byte 102,15,56,221,216 +.byte 102,15,56,221,224 + ret +.size _aesni_encrypt3,.-_aesni_encrypt3 +.type _aesni_decrypt3,@function +.align 16 +_aesni_decrypt3: + movups (%rcx),%xmm0 + shll $4,%eax + movups 16(%rcx),%xmm1 + xorps %xmm0,%xmm2 + xorps %xmm0,%xmm3 + xorps %xmm0,%xmm4 + movups 32(%rcx),%xmm0 + leaq 32(%rcx,%rax,1),%rcx + negq %rax + addq $16,%rax + +.Ldec_loop3: +.byte 102,15,56,222,209 +.byte 102,15,56,222,217 +.byte 102,15,56,222,225 + movups (%rcx,%rax,1),%xmm1 + addq $32,%rax +.byte 102,15,56,222,208 +.byte 102,15,56,222,216 +.byte 102,15,56,222,224 + movups -16(%rcx,%rax,1),%xmm0 + jnz .Ldec_loop3 + +.byte 102,15,56,222,209 +.byte 102,15,56,222,217 +.byte 102,15,56,222,225 +.byte 102,15,56,223,208 +.byte 102,15,56,223,216 +.byte 102,15,56,223,224 + ret +.size _aesni_decrypt3,.-_aesni_decrypt3 +.type _aesni_encrypt4,@function +.align 16 +_aesni_encrypt4: + movups (%rcx),%xmm0 + shll $4,%eax + movups 16(%rcx),%xmm1 + xorps %xmm0,%xmm2 + xorps %xmm0,%xmm3 + xorps %xmm0,%xmm4 + xorps %xmm0,%xmm5 + movups 32(%rcx),%xmm0 + leaq 32(%rcx,%rax,1),%rcx + negq %rax +.byte 0x0f,0x1f,0x00 + addq $16,%rax + +.Lenc_loop4: +.byte 102,15,56,220,209 +.byte 102,15,56,220,217 +.byte 102,15,56,220,225 +.byte 102,15,56,220,233 + movups (%rcx,%rax,1),%xmm1 + addq $32,%rax +.byte 102,15,56,220,208 +.byte 102,15,56,220,216 +.byte 102,15,56,220,224 +.byte 102,15,56,220,232 + movups -16(%rcx,%rax,1),%xmm0 + jnz .Lenc_loop4 + +.byte 102,15,56,220,209 +.byte 102,15,56,220,217 +.byte 102,15,56,220,225 +.byte 102,15,56,220,233 +.byte 102,15,56,221,208 +.byte 102,15,56,221,216 +.byte 102,15,56,221,224 +.byte 102,15,56,221,232 + ret +.size _aesni_encrypt4,.-_aesni_encrypt4 +.type _aesni_decrypt4,@function +.align 16 +_aesni_decrypt4: + movups (%rcx),%xmm0 + shll $4,%eax + movups 16(%rcx),%xmm1 + xorps %xmm0,%xmm2 + xorps %xmm0,%xmm3 + xorps %xmm0,%xmm4 + xorps %xmm0,%xmm5 + movups 32(%rcx),%xmm0 + leaq 32(%rcx,%rax,1),%rcx + negq %rax +.byte 0x0f,0x1f,0x00 + addq $16,%rax + +.Ldec_loop4: +.byte 102,15,56,222,209 +.byte 102,15,56,222,217 +.byte 102,15,56,222,225 +.byte 102,15,56,222,233 + movups (%rcx,%rax,1),%xmm1 + addq $32,%rax +.byte 102,15,56,222,208 +.byte 102,15,56,222,216 +.byte 102,15,56,222,224 +.byte 102,15,56,222,232 + movups -16(%rcx,%rax,1),%xmm0 + jnz .Ldec_loop4 + +.byte 102,15,56,222,209 +.byte 102,15,56,222,217 +.byte 102,15,56,222,225 +.byte 102,15,56,222,233 +.byte 102,15,56,223,208 +.byte 102,15,56,223,216 +.byte 102,15,56,223,224 +.byte 102,15,56,223,232 + ret +.size _aesni_decrypt4,.-_aesni_decrypt4 +.type _aesni_encrypt6,@function +.align 16 +_aesni_encrypt6: + movups (%rcx),%xmm0 + shll $4,%eax + movups 16(%rcx),%xmm1 + xorps %xmm0,%xmm2 + pxor %xmm0,%xmm3 + pxor %xmm0,%xmm4 +.byte 102,15,56,220,209 + leaq 32(%rcx,%rax,1),%rcx + negq %rax +.byte 102,15,56,220,217 + pxor %xmm0,%xmm5 + pxor %xmm0,%xmm6 +.byte 102,15,56,220,225 + pxor %xmm0,%xmm7 + movups (%rcx,%rax,1),%xmm0 + addq $16,%rax + jmp .Lenc_loop6_enter +.align 16 +.Lenc_loop6: +.byte 102,15,56,220,209 +.byte 102,15,56,220,217 +.byte 102,15,56,220,225 +.Lenc_loop6_enter: +.byte 102,15,56,220,233 +.byte 102,15,56,220,241 +.byte 102,15,56,220,249 + movups (%rcx,%rax,1),%xmm1 + addq $32,%rax +.byte 102,15,56,220,208 +.byte 102,15,56,220,216 +.byte 102,15,56,220,224 +.byte 102,15,56,220,232 +.byte 102,15,56,220,240 +.byte 102,15,56,220,248 + movups -16(%rcx,%rax,1),%xmm0 + jnz .Lenc_loop6 + +.byte 102,15,56,220,209 +.byte 102,15,56,220,217 +.byte 102,15,56,220,225 +.byte 102,15,56,220,233 +.byte 102,15,56,220,241 +.byte 102,15,56,220,249 +.byte 102,15,56,221,208 +.byte 102,15,56,221,216 +.byte 102,15,56,221,224 +.byte 102,15,56,221,232 +.byte 102,15,56,221,240 +.byte 102,15,56,221,248 + ret +.size _aesni_encrypt6,.-_aesni_encrypt6 +.type _aesni_decrypt6,@function +.align 16 +_aesni_decrypt6: + movups (%rcx),%xmm0 + shll $4,%eax + movups 16(%rcx),%xmm1 + xorps %xmm0,%xmm2 + pxor %xmm0,%xmm3 + pxor %xmm0,%xmm4 +.byte 102,15,56,222,209 + leaq 32(%rcx,%rax,1),%rcx + negq %rax +.byte 102,15,56,222,217 + pxor %xmm0,%xmm5 + pxor %xmm0,%xmm6 +.byte 102,15,56,222,225 + pxor %xmm0,%xmm7 + movups (%rcx,%rax,1),%xmm0 + addq $16,%rax + jmp .Ldec_loop6_enter +.align 16 +.Ldec_loop6: +.byte 102,15,56,222,209 +.byte 102,15,56,222,217 +.byte 102,15,56,222,225 +.Ldec_loop6_enter: +.byte 102,15,56,222,233 +.byte 102,15,56,222,241 +.byte 102,15,56,222,249 + movups (%rcx,%rax,1),%xmm1 + addq $32,%rax +.byte 102,15,56,222,208 +.byte 102,15,56,222,216 +.byte 102,15,56,222,224 +.byte 102,15,56,222,232 +.byte 102,15,56,222,240 +.byte 102,15,56,222,248 + movups -16(%rcx,%rax,1),%xmm0 + jnz .Ldec_loop6 + +.byte 102,15,56,222,209 +.byte 102,15,56,222,217 +.byte 102,15,56,222,225 +.byte 102,15,56,222,233 +.byte 102,15,56,222,241 +.byte 102,15,56,222,249 +.byte 102,15,56,223,208 +.byte 102,15,56,223,216 +.byte 102,15,56,223,224 +.byte 102,15,56,223,232 +.byte 102,15,56,223,240 +.byte 102,15,56,223,248 + ret +.size _aesni_decrypt6,.-_aesni_decrypt6 +.type _aesni_encrypt8,@function +.align 16 +_aesni_encrypt8: + movups (%rcx),%xmm0 + shll $4,%eax + movups 16(%rcx),%xmm1 + xorps %xmm0,%xmm2 + xorps %xmm0,%xmm3 + pxor %xmm0,%xmm4 + pxor %xmm0,%xmm5 + pxor %xmm0,%xmm6 + leaq 32(%rcx,%rax,1),%rcx + negq %rax +.byte 102,15,56,220,209 + pxor %xmm0,%xmm7 + pxor %xmm0,%xmm8 +.byte 102,15,56,220,217 + pxor %xmm0,%xmm9 + movups (%rcx,%rax,1),%xmm0 + addq $16,%rax + jmp .Lenc_loop8_inner +.align 16 +.Lenc_loop8: +.byte 102,15,56,220,209 +.byte 102,15,56,220,217 +.Lenc_loop8_inner: +.byte 102,15,56,220,225 +.byte 102,15,56,220,233 +.byte 102,15,56,220,241 +.byte 102,15,56,220,249 +.byte 102,68,15,56,220,193 +.byte 102,68,15,56,220,201 +.Lenc_loop8_enter: + movups (%rcx,%rax,1),%xmm1 + addq $32,%rax +.byte 102,15,56,220,208 +.byte 102,15,56,220,216 +.byte 102,15,56,220,224 +.byte 102,15,56,220,232 +.byte 102,15,56,220,240 +.byte 102,15,56,220,248 +.byte 102,68,15,56,220,192 +.byte 102,68,15,56,220,200 + movups -16(%rcx,%rax,1),%xmm0 + jnz .Lenc_loop8 + +.byte 102,15,56,220,209 +.byte 102,15,56,220,217 +.byte 102,15,56,220,225 +.byte 102,15,56,220,233 +.byte 102,15,56,220,241 +.byte 102,15,56,220,249 +.byte 102,68,15,56,220,193 +.byte 102,68,15,56,220,201 +.byte 102,15,56,221,208 +.byte 102,15,56,221,216 +.byte 102,15,56,221,224 +.byte 102,15,56,221,232 +.byte 102,15,56,221,240 +.byte 102,15,56,221,248 +.byte 102,68,15,56,221,192 +.byte 102,68,15,56,221,200 + ret +.size _aesni_encrypt8,.-_aesni_encrypt8 +.type _aesni_decrypt8,@function +.align 16 +_aesni_decrypt8: + movups (%rcx),%xmm0 + shll $4,%eax + movups 16(%rcx),%xmm1 + xorps %xmm0,%xmm2 + xorps %xmm0,%xmm3 + pxor %xmm0,%xmm4 + pxor %xmm0,%xmm5 + pxor %xmm0,%xmm6 + leaq 32(%rcx,%rax,1),%rcx + negq %rax +.byte 102,15,56,222,209 + pxor %xmm0,%xmm7 + pxor %xmm0,%xmm8 +.byte 102,15,56,222,217 + pxor %xmm0,%xmm9 + movups (%rcx,%rax,1),%xmm0 + addq $16,%rax + jmp .Ldec_loop8_inner +.align 16 +.Ldec_loop8: +.byte 102,15,56,222,209 +.byte 102,15,56,222,217 +.Ldec_loop8_inner: +.byte 102,15,56,222,225 +.byte 102,15,56,222,233 +.byte 102,15,56,222,241 +.byte 102,15,56,222,249 +.byte 102,68,15,56,222,193 +.byte 102,68,15,56,222,201 +.Ldec_loop8_enter: + movups (%rcx,%rax,1),%xmm1 + addq $32,%rax +.byte 102,15,56,222,208 +.byte 102,15,56,222,216 +.byte 102,15,56,222,224 +.byte 102,15,56,222,232 +.byte 102,15,56,222,240 +.byte 102,15,56,222,248 +.byte 102,68,15,56,222,192 +.byte 102,68,15,56,222,200 + movups -16(%rcx,%rax,1),%xmm0 + jnz .Ldec_loop8 + +.byte 102,15,56,222,209 +.byte 102,15,56,222,217 +.byte 102,15,56,222,225 +.byte 102,15,56,222,233 +.byte 102,15,56,222,241 +.byte 102,15,56,222,249 +.byte 102,68,15,56,222,193 +.byte 102,68,15,56,222,201 +.byte 102,15,56,223,208 +.byte 102,15,56,223,216 +.byte 102,15,56,223,224 +.byte 102,15,56,223,232 +.byte 102,15,56,223,240 +.byte 102,15,56,223,248 +.byte 102,68,15,56,223,192 +.byte 102,68,15,56,223,200 + ret +.size _aesni_decrypt8,.-_aesni_decrypt8 +.globl aesni_ctr32_encrypt_blocks +.type aesni_ctr32_encrypt_blocks,@function +.align 16 +aesni_ctr32_encrypt_blocks: +.cfi_startproc + cmpq $1,%rdx + jne .Lctr32_bulk + + + + movups (%r8),%xmm2 + movups (%rdi),%xmm3 + movl 240(%rcx),%edx + movups (%rcx),%xmm0 + movups 16(%rcx),%xmm1 + leaq 32(%rcx),%rcx + xorps %xmm0,%xmm2 +.Loop_enc1_3: +.byte 102,15,56,220,209 + decl %edx + movups (%rcx),%xmm1 + leaq 16(%rcx),%rcx + jnz .Loop_enc1_3 +.byte 102,15,56,221,209 + pxor %xmm0,%xmm0 + pxor %xmm1,%xmm1 + xorps %xmm3,%xmm2 + pxor %xmm3,%xmm3 + movups %xmm2,(%rsi) + xorps %xmm2,%xmm2 + jmp .Lctr32_epilogue + +.align 16 +.Lctr32_bulk: + leaq (%rsp),%r11 +.cfi_def_cfa_register %r11 + pushq %rbp +.cfi_offset %rbp,-16 + subq $128,%rsp + andq $-16,%rsp + + + + + movdqu (%r8),%xmm2 + movdqu (%rcx),%xmm0 + movl 12(%r8),%r8d + pxor %xmm0,%xmm2 + movl 12(%rcx),%ebp + movdqa %xmm2,0(%rsp) + bswapl %r8d + movdqa %xmm2,%xmm3 + movdqa %xmm2,%xmm4 + movdqa %xmm2,%xmm5 + movdqa %xmm2,64(%rsp) + movdqa %xmm2,80(%rsp) + movdqa %xmm2,96(%rsp) + movq %rdx,%r10 + movdqa %xmm2,112(%rsp) + + leaq 1(%r8),%rax + leaq 2(%r8),%rdx + bswapl %eax + bswapl %edx + xorl %ebp,%eax + xorl %ebp,%edx +.byte 102,15,58,34,216,3 + leaq 3(%r8),%rax + movdqa %xmm3,16(%rsp) +.byte 102,15,58,34,226,3 + bswapl %eax + movq %r10,%rdx + leaq 4(%r8),%r10 + movdqa %xmm4,32(%rsp) + xorl %ebp,%eax + bswapl %r10d +.byte 102,15,58,34,232,3 + xorl %ebp,%r10d + movdqa %xmm5,48(%rsp) + leaq 5(%r8),%r9 + movl %r10d,64+12(%rsp) + bswapl %r9d + leaq 6(%r8),%r10 + movl 240(%rcx),%eax + xorl %ebp,%r9d + bswapl %r10d + movl %r9d,80+12(%rsp) + xorl %ebp,%r10d + leaq 7(%r8),%r9 + movl %r10d,96+12(%rsp) + bswapl %r9d + + + xorl %ebp,%r9d + + movl %r9d,112+12(%rsp) + + movups 16(%rcx),%xmm1 + + movdqa 64(%rsp),%xmm6 + movdqa 80(%rsp),%xmm7 + + cmpq $8,%rdx + jb .Lctr32_tail + + subq $6,%rdx + + + + leaq 128(%rcx),%rcx + subq $2,%rdx + jmp .Lctr32_loop8 + + + + + + + + + + +.align 16 +.Lctr32_loop6: + addl $6,%r8d + movups -48(%rcx,%r10,1),%xmm0 +.byte 102,15,56,220,209 + movl %r8d,%eax + xorl %ebp,%eax +.byte 102,15,56,220,217 +.byte 0x0f,0x38,0xf1,0x44,0x24,12 + leal 1(%r8),%eax +.byte 102,15,56,220,225 + xorl %ebp,%eax +.byte 0x0f,0x38,0xf1,0x44,0x24,28 +.byte 102,15,56,220,233 + leal 2(%r8),%eax + xorl %ebp,%eax +.byte 102,15,56,220,241 +.byte 0x0f,0x38,0xf1,0x44,0x24,44 + leal 3(%r8),%eax +.byte 102,15,56,220,249 + movups -32(%rcx,%r10,1),%xmm1 + xorl %ebp,%eax + +.byte 102,15,56,220,208 +.byte 0x0f,0x38,0xf1,0x44,0x24,60 + leal 4(%r8),%eax +.byte 102,15,56,220,216 + xorl %ebp,%eax +.byte 0x0f,0x38,0xf1,0x44,0x24,76 +.byte 102,15,56,220,224 + leal 5(%r8),%eax + xorl %ebp,%eax +.byte 102,15,56,220,232 +.byte 0x0f,0x38,0xf1,0x44,0x24,92 + movq %r10,%rax +.byte 102,15,56,220,240 +.byte 102,15,56,220,248 + movups -16(%rcx,%r10,1),%xmm0 + + call .Lenc_loop6 + + movdqu (%rdi),%xmm8 + movdqu 16(%rdi),%xmm9 + movdqu 32(%rdi),%xmm10 + movdqu 48(%rdi),%xmm11 + movdqu 64(%rdi),%xmm12 + movdqu 80(%rdi),%xmm13 + leaq 96(%rdi),%rdi + movups -64(%rcx,%r10,1),%xmm1 + pxor %xmm2,%xmm8 + movaps 0(%rsp),%xmm2 + pxor %xmm3,%xmm9 + movaps 16(%rsp),%xmm3 + pxor %xmm4,%xmm10 + movaps 32(%rsp),%xmm4 + pxor %xmm5,%xmm11 + movaps 48(%rsp),%xmm5 + pxor %xmm6,%xmm12 + movaps 64(%rsp),%xmm6 + pxor %xmm7,%xmm13 + movaps 80(%rsp),%xmm7 + movdqu %xmm8,(%rsi) + movdqu %xmm9,16(%rsi) + movdqu %xmm10,32(%rsi) + movdqu %xmm11,48(%rsi) + movdqu %xmm12,64(%rsi) + movdqu %xmm13,80(%rsi) + leaq 96(%rsi),%rsi + + subq $6,%rdx + jnc .Lctr32_loop6 + + addq $6,%rdx + jz .Lctr32_done + + leal -48(%r10),%eax + leaq -80(%rcx,%r10,1),%rcx + negl %eax + shrl $4,%eax + jmp .Lctr32_tail + +.align 32 +.Lctr32_loop8: + addl $8,%r8d + movdqa 96(%rsp),%xmm8 +.byte 102,15,56,220,209 + movl %r8d,%r9d + movdqa 112(%rsp),%xmm9 +.byte 102,15,56,220,217 + bswapl %r9d + movups 32-128(%rcx),%xmm0 +.byte 102,15,56,220,225 + xorl %ebp,%r9d + nop +.byte 102,15,56,220,233 + movl %r9d,0+12(%rsp) + leaq 1(%r8),%r9 +.byte 102,15,56,220,241 +.byte 102,15,56,220,249 +.byte 102,68,15,56,220,193 +.byte 102,68,15,56,220,201 + movups 48-128(%rcx),%xmm1 + bswapl %r9d +.byte 102,15,56,220,208 +.byte 102,15,56,220,216 + xorl %ebp,%r9d +.byte 0x66,0x90 +.byte 102,15,56,220,224 +.byte 102,15,56,220,232 + movl %r9d,16+12(%rsp) + leaq 2(%r8),%r9 +.byte 102,15,56,220,240 +.byte 102,15,56,220,248 +.byte 102,68,15,56,220,192 +.byte 102,68,15,56,220,200 + movups 64-128(%rcx),%xmm0 + bswapl %r9d +.byte 102,15,56,220,209 +.byte 102,15,56,220,217 + xorl %ebp,%r9d +.byte 0x66,0x90 +.byte 102,15,56,220,225 +.byte 102,15,56,220,233 + movl %r9d,32+12(%rsp) + leaq 3(%r8),%r9 +.byte 102,15,56,220,241 +.byte 102,15,56,220,249 +.byte 102,68,15,56,220,193 +.byte 102,68,15,56,220,201 + movups 80-128(%rcx),%xmm1 + bswapl %r9d +.byte 102,15,56,220,208 +.byte 102,15,56,220,216 + xorl %ebp,%r9d +.byte 0x66,0x90 +.byte 102,15,56,220,224 +.byte 102,15,56,220,232 + movl %r9d,48+12(%rsp) + leaq 4(%r8),%r9 +.byte 102,15,56,220,240 +.byte 102,15,56,220,248 +.byte 102,68,15,56,220,192 +.byte 102,68,15,56,220,200 + movups 96-128(%rcx),%xmm0 + bswapl %r9d +.byte 102,15,56,220,209 +.byte 102,15,56,220,217 + xorl %ebp,%r9d +.byte 0x66,0x90 +.byte 102,15,56,220,225 +.byte 102,15,56,220,233 + movl %r9d,64+12(%rsp) + leaq 5(%r8),%r9 +.byte 102,15,56,220,241 +.byte 102,15,56,220,249 +.byte 102,68,15,56,220,193 +.byte 102,68,15,56,220,201 + movups 112-128(%rcx),%xmm1 + bswapl %r9d +.byte 102,15,56,220,208 +.byte 102,15,56,220,216 + xorl %ebp,%r9d +.byte 0x66,0x90 +.byte 102,15,56,220,224 +.byte 102,15,56,220,232 + movl %r9d,80+12(%rsp) + leaq 6(%r8),%r9 +.byte 102,15,56,220,240 +.byte 102,15,56,220,248 +.byte 102,68,15,56,220,192 +.byte 102,68,15,56,220,200 + movups 128-128(%rcx),%xmm0 + bswapl %r9d +.byte 102,15,56,220,209 +.byte 102,15,56,220,217 + xorl %ebp,%r9d +.byte 0x66,0x90 +.byte 102,15,56,220,225 +.byte 102,15,56,220,233 + movl %r9d,96+12(%rsp) + leaq 7(%r8),%r9 +.byte 102,15,56,220,241 +.byte 102,15,56,220,249 +.byte 102,68,15,56,220,193 +.byte 102,68,15,56,220,201 + movups 144-128(%rcx),%xmm1 + bswapl %r9d +.byte 102,15,56,220,208 +.byte 102,15,56,220,216 +.byte 102,15,56,220,224 + xorl %ebp,%r9d + movdqu 0(%rdi),%xmm10 +.byte 102,15,56,220,232 + movl %r9d,112+12(%rsp) + cmpl $11,%eax +.byte 102,15,56,220,240 +.byte 102,15,56,220,248 +.byte 102,68,15,56,220,192 +.byte 102,68,15,56,220,200 + movups 160-128(%rcx),%xmm0 + + jb .Lctr32_enc_done + +.byte 102,15,56,220,209 +.byte 102,15,56,220,217 +.byte 102,15,56,220,225 +.byte 102,15,56,220,233 +.byte 102,15,56,220,241 +.byte 102,15,56,220,249 +.byte 102,68,15,56,220,193 +.byte 102,68,15,56,220,201 + movups 176-128(%rcx),%xmm1 + +.byte 102,15,56,220,208 +.byte 102,15,56,220,216 +.byte 102,15,56,220,224 +.byte 102,15,56,220,232 +.byte 102,15,56,220,240 +.byte 102,15,56,220,248 +.byte 102,68,15,56,220,192 +.byte 102,68,15,56,220,200 + movups 192-128(%rcx),%xmm0 + je .Lctr32_enc_done + +.byte 102,15,56,220,209 +.byte 102,15,56,220,217 +.byte 102,15,56,220,225 +.byte 102,15,56,220,233 +.byte 102,15,56,220,241 +.byte 102,15,56,220,249 +.byte 102,68,15,56,220,193 +.byte 102,68,15,56,220,201 + movups 208-128(%rcx),%xmm1 + +.byte 102,15,56,220,208 +.byte 102,15,56,220,216 +.byte 102,15,56,220,224 +.byte 102,15,56,220,232 +.byte 102,15,56,220,240 +.byte 102,15,56,220,248 +.byte 102,68,15,56,220,192 +.byte 102,68,15,56,220,200 + movups 224-128(%rcx),%xmm0 + jmp .Lctr32_enc_done + +.align 16 +.Lctr32_enc_done: + movdqu 16(%rdi),%xmm11 + pxor %xmm0,%xmm10 + movdqu 32(%rdi),%xmm12 + pxor %xmm0,%xmm11 + movdqu 48(%rdi),%xmm13 + pxor %xmm0,%xmm12 + movdqu 64(%rdi),%xmm14 + pxor %xmm0,%xmm13 + movdqu 80(%rdi),%xmm15 + pxor %xmm0,%xmm14 + pxor %xmm0,%xmm15 +.byte 102,15,56,220,209 +.byte 102,15,56,220,217 +.byte 102,15,56,220,225 +.byte 102,15,56,220,233 +.byte 102,15,56,220,241 +.byte 102,15,56,220,249 +.byte 102,68,15,56,220,193 +.byte 102,68,15,56,220,201 + movdqu 96(%rdi),%xmm1 + leaq 128(%rdi),%rdi + +.byte 102,65,15,56,221,210 + pxor %xmm0,%xmm1 + movdqu 112-128(%rdi),%xmm10 +.byte 102,65,15,56,221,219 + pxor %xmm0,%xmm10 + movdqa 0(%rsp),%xmm11 +.byte 102,65,15,56,221,228 +.byte 102,65,15,56,221,237 + movdqa 16(%rsp),%xmm12 + movdqa 32(%rsp),%xmm13 +.byte 102,65,15,56,221,246 +.byte 102,65,15,56,221,255 + movdqa 48(%rsp),%xmm14 + movdqa 64(%rsp),%xmm15 +.byte 102,68,15,56,221,193 + movdqa 80(%rsp),%xmm0 + movups 16-128(%rcx),%xmm1 +.byte 102,69,15,56,221,202 + + movups %xmm2,(%rsi) + movdqa %xmm11,%xmm2 + movups %xmm3,16(%rsi) + movdqa %xmm12,%xmm3 + movups %xmm4,32(%rsi) + movdqa %xmm13,%xmm4 + movups %xmm5,48(%rsi) + movdqa %xmm14,%xmm5 + movups %xmm6,64(%rsi) + movdqa %xmm15,%xmm6 + movups %xmm7,80(%rsi) + movdqa %xmm0,%xmm7 + movups %xmm8,96(%rsi) + movups %xmm9,112(%rsi) + leaq 128(%rsi),%rsi + + subq $8,%rdx + jnc .Lctr32_loop8 + + addq $8,%rdx + jz .Lctr32_done + leaq -128(%rcx),%rcx + +.Lctr32_tail: + + + leaq 16(%rcx),%rcx + cmpq $4,%rdx + jb .Lctr32_loop3 + je .Lctr32_loop4 + + + shll $4,%eax + movdqa 96(%rsp),%xmm8 + pxor %xmm9,%xmm9 + + movups 16(%rcx),%xmm0 +.byte 102,15,56,220,209 +.byte 102,15,56,220,217 + leaq 32-16(%rcx,%rax,1),%rcx + negq %rax +.byte 102,15,56,220,225 + addq $16,%rax + movups (%rdi),%xmm10 +.byte 102,15,56,220,233 +.byte 102,15,56,220,241 + movups 16(%rdi),%xmm11 + movups 32(%rdi),%xmm12 +.byte 102,15,56,220,249 +.byte 102,68,15,56,220,193 + + call .Lenc_loop8_enter + + movdqu 48(%rdi),%xmm13 + pxor %xmm10,%xmm2 + movdqu 64(%rdi),%xmm10 + pxor %xmm11,%xmm3 + movdqu %xmm2,(%rsi) + pxor %xmm12,%xmm4 + movdqu %xmm3,16(%rsi) + pxor %xmm13,%xmm5 + movdqu %xmm4,32(%rsi) + pxor %xmm10,%xmm6 + movdqu %xmm5,48(%rsi) + movdqu %xmm6,64(%rsi) + cmpq $6,%rdx + jb .Lctr32_done + + movups 80(%rdi),%xmm11 + xorps %xmm11,%xmm7 + movups %xmm7,80(%rsi) + je .Lctr32_done + + movups 96(%rdi),%xmm12 + xorps %xmm12,%xmm8 + movups %xmm8,96(%rsi) + jmp .Lctr32_done + +.align 32 +.Lctr32_loop4: +.byte 102,15,56,220,209 + leaq 16(%rcx),%rcx + decl %eax +.byte 102,15,56,220,217 +.byte 102,15,56,220,225 +.byte 102,15,56,220,233 + movups (%rcx),%xmm1 + jnz .Lctr32_loop4 +.byte 102,15,56,221,209 +.byte 102,15,56,221,217 + movups (%rdi),%xmm10 + movups 16(%rdi),%xmm11 +.byte 102,15,56,221,225 +.byte 102,15,56,221,233 + movups 32(%rdi),%xmm12 + movups 48(%rdi),%xmm13 + + xorps %xmm10,%xmm2 + movups %xmm2,(%rsi) + xorps %xmm11,%xmm3 + movups %xmm3,16(%rsi) + pxor %xmm12,%xmm4 + movdqu %xmm4,32(%rsi) + pxor %xmm13,%xmm5 + movdqu %xmm5,48(%rsi) + jmp .Lctr32_done + +.align 32 +.Lctr32_loop3: +.byte 102,15,56,220,209 + leaq 16(%rcx),%rcx + decl %eax +.byte 102,15,56,220,217 +.byte 102,15,56,220,225 + movups (%rcx),%xmm1 + jnz .Lctr32_loop3 +.byte 102,15,56,221,209 +.byte 102,15,56,221,217 +.byte 102,15,56,221,225 + + movups (%rdi),%xmm10 + xorps %xmm10,%xmm2 + movups %xmm2,(%rsi) + cmpq $2,%rdx + jb .Lctr32_done + + movups 16(%rdi),%xmm11 + xorps %xmm11,%xmm3 + movups %xmm3,16(%rsi) + je .Lctr32_done + + movups 32(%rdi),%xmm12 + xorps %xmm12,%xmm4 + movups %xmm4,32(%rsi) + +.Lctr32_done: + xorps %xmm0,%xmm0 + xorl %ebp,%ebp + pxor %xmm1,%xmm1 + pxor %xmm2,%xmm2 + pxor %xmm3,%xmm3 + pxor %xmm4,%xmm4 + pxor %xmm5,%xmm5 + pxor %xmm6,%xmm6 + pxor %xmm7,%xmm7 + movaps %xmm0,0(%rsp) + pxor %xmm8,%xmm8 + movaps %xmm0,16(%rsp) + pxor %xmm9,%xmm9 + movaps %xmm0,32(%rsp) + pxor %xmm10,%xmm10 + movaps %xmm0,48(%rsp) + pxor %xmm11,%xmm11 + movaps %xmm0,64(%rsp) + pxor %xmm12,%xmm12 + movaps %xmm0,80(%rsp) + pxor %xmm13,%xmm13 + movaps %xmm0,96(%rsp) + pxor %xmm14,%xmm14 + movaps %xmm0,112(%rsp) + pxor %xmm15,%xmm15 + movq -8(%r11),%rbp +.cfi_restore %rbp + leaq (%r11),%rsp +.cfi_def_cfa_register %rsp +.Lctr32_epilogue: + ret +.cfi_endproc +.size aesni_ctr32_encrypt_blocks,.-aesni_ctr32_encrypt_blocks +.globl aesni_set_decrypt_key +.type aesni_set_decrypt_key,@function +.align 16 +aesni_set_decrypt_key: +.cfi_startproc +.byte 0x48,0x83,0xEC,0x08 +.cfi_adjust_cfa_offset 8 + call __aesni_set_encrypt_key + shll $4,%esi + testl %eax,%eax + jnz .Ldec_key_ret + leaq 16(%rdx,%rsi,1),%rdi + + movups (%rdx),%xmm0 + movups (%rdi),%xmm1 + movups %xmm0,(%rdi) + movups %xmm1,(%rdx) + leaq 16(%rdx),%rdx + leaq -16(%rdi),%rdi + +.Ldec_key_inverse: + movups (%rdx),%xmm0 + movups (%rdi),%xmm1 +.byte 102,15,56,219,192 +.byte 102,15,56,219,201 + leaq 16(%rdx),%rdx + leaq -16(%rdi),%rdi + movups %xmm0,16(%rdi) + movups %xmm1,-16(%rdx) + cmpq %rdx,%rdi + ja .Ldec_key_inverse + + movups (%rdx),%xmm0 +.byte 102,15,56,219,192 + pxor %xmm1,%xmm1 + movups %xmm0,(%rdi) + pxor %xmm0,%xmm0 +.Ldec_key_ret: + addq $8,%rsp +.cfi_adjust_cfa_offset -8 + ret +.cfi_endproc +.LSEH_end_set_decrypt_key: +.size aesni_set_decrypt_key,.-aesni_set_decrypt_key +.globl aesni_set_encrypt_key +.type aesni_set_encrypt_key,@function +.align 16 +aesni_set_encrypt_key: +__aesni_set_encrypt_key: +.cfi_startproc +.byte 0x48,0x83,0xEC,0x08 +.cfi_adjust_cfa_offset 8 + movq $-1,%rax + testq %rdi,%rdi + jz .Lenc_key_ret + testq %rdx,%rdx + jz .Lenc_key_ret + + movups (%rdi),%xmm0 + xorps %xmm4,%xmm4 + + + + leaq 16(%rdx),%rax + cmpl $256,%esi + je .L14rounds + cmpl $192,%esi + je .L12rounds + cmpl $128,%esi + jne .Lbad_keybits + +.L10rounds: + movl $9,%esi + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + movdqa .Lkey_rotate(%rip),%xmm5 + movl $8,%r10d + movdqa .Lkey_rcon1(%rip),%xmm4 + movdqa %xmm0,%xmm2 + movdqu %xmm0,(%rdx) + jmp .Loop_key128 + +.align 16 +.Loop_key128: + pshufb %xmm5,%xmm0 +.byte 102,15,56,221,196 + pslld $1,%xmm4 + leaq 16(%rax),%rax + + movdqa %xmm2,%xmm3 + pslldq $4,%xmm2 + pxor %xmm2,%xmm3 + pslldq $4,%xmm2 + pxor %xmm2,%xmm3 + pslldq $4,%xmm2 + pxor %xmm3,%xmm2 + + pxor %xmm2,%xmm0 + movdqu %xmm0,-16(%rax) + movdqa %xmm0,%xmm2 + + decl %r10d + jnz .Loop_key128 + + movdqa .Lkey_rcon1b(%rip),%xmm4 + + pshufb %xmm5,%xmm0 +.byte 102,15,56,221,196 + pslld $1,%xmm4 + + movdqa %xmm2,%xmm3 + pslldq $4,%xmm2 + pxor %xmm2,%xmm3 + pslldq $4,%xmm2 + pxor %xmm2,%xmm3 + pslldq $4,%xmm2 + pxor %xmm3,%xmm2 + + pxor %xmm2,%xmm0 + movdqu %xmm0,(%rax) + + movdqa %xmm0,%xmm2 + pshufb %xmm5,%xmm0 +.byte 102,15,56,221,196 + + movdqa %xmm2,%xmm3 + pslldq $4,%xmm2 + pxor %xmm2,%xmm3 + pslldq $4,%xmm2 + pxor %xmm2,%xmm3 + pslldq $4,%xmm2 + pxor %xmm3,%xmm2 + + pxor %xmm2,%xmm0 + movdqu %xmm0,16(%rax) + + movl %esi,96(%rax) + xorl %eax,%eax + jmp .Lenc_key_ret + +.align 16 +.L12rounds: + movq 16(%rdi),%xmm2 + movl $11,%esi + + + + + + + + + + + + + + + + + + + + + + + + + + + + movdqa .Lkey_rotate192(%rip),%xmm5 + movdqa .Lkey_rcon1(%rip),%xmm4 + movl $8,%r10d + movdqu %xmm0,(%rdx) + jmp .Loop_key192 + +.align 16 +.Loop_key192: + movq %xmm2,0(%rax) + movdqa %xmm2,%xmm1 + pshufb %xmm5,%xmm2 +.byte 102,15,56,221,212 + pslld $1,%xmm4 + leaq 24(%rax),%rax + + movdqa %xmm0,%xmm3 + pslldq $4,%xmm0 + pxor %xmm0,%xmm3 + pslldq $4,%xmm0 + pxor %xmm0,%xmm3 + pslldq $4,%xmm0 + pxor %xmm3,%xmm0 + + pshufd $0xff,%xmm0,%xmm3 + pxor %xmm1,%xmm3 + pslldq $4,%xmm1 + pxor %xmm1,%xmm3 + + pxor %xmm2,%xmm0 + pxor %xmm3,%xmm2 + movdqu %xmm0,-16(%rax) + + decl %r10d + jnz .Loop_key192 + + movl %esi,32(%rax) + xorl %eax,%eax + jmp .Lenc_key_ret + +.align 16 +.L14rounds: + movups 16(%rdi),%xmm2 + movl $13,%esi + leaq 16(%rax),%rax + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + movdqa .Lkey_rotate(%rip),%xmm5 + movdqa .Lkey_rcon1(%rip),%xmm4 + movl $7,%r10d + movdqu %xmm0,0(%rdx) + movdqa %xmm2,%xmm1 + movdqu %xmm2,16(%rdx) + jmp .Loop_key256 + +.align 16 +.Loop_key256: + pshufb %xmm5,%xmm2 +.byte 102,15,56,221,212 + + movdqa %xmm0,%xmm3 + pslldq $4,%xmm0 + pxor %xmm0,%xmm3 + pslldq $4,%xmm0 + pxor %xmm0,%xmm3 + pslldq $4,%xmm0 + pxor %xmm3,%xmm0 + pslld $1,%xmm4 + + pxor %xmm2,%xmm0 + movdqu %xmm0,(%rax) + + decl %r10d + jz .Ldone_key256 + + pshufd $0xff,%xmm0,%xmm2 + pxor %xmm3,%xmm3 +.byte 102,15,56,221,211 + + movdqa %xmm1,%xmm3 + pslldq $4,%xmm1 + pxor %xmm1,%xmm3 + pslldq $4,%xmm1 + pxor %xmm1,%xmm3 + pslldq $4,%xmm1 + pxor %xmm3,%xmm1 + + pxor %xmm1,%xmm2 + movdqu %xmm2,16(%rax) + leaq 32(%rax),%rax + movdqa %xmm2,%xmm1 + + jmp .Loop_key256 + +.Ldone_key256: + movl %esi,16(%rax) + xorl %eax,%eax + jmp .Lenc_key_ret + +.align 16 +.Lbad_keybits: + movq $-2,%rax +.Lenc_key_ret: + pxor %xmm0,%xmm0 + pxor %xmm1,%xmm1 + pxor %xmm2,%xmm2 + pxor %xmm3,%xmm3 + pxor %xmm4,%xmm4 + pxor %xmm5,%xmm5 + addq $8,%rsp +.cfi_adjust_cfa_offset -8 + ret +.cfi_endproc +.LSEH_end_set_encrypt_key: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +.size aesni_set_encrypt_key,.-aesni_set_encrypt_key +.size __aesni_set_encrypt_key,.-__aesni_set_encrypt_key +.align 64 +.Lbswap_mask: +.byte 15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0 +.Lincrement32: +.long 6,6,6,0 +.Lincrement64: +.long 1,0,0,0 +.Lxts_magic: +.long 0x87,0,1,0 +.Lincrement1: +.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1 +.Lkey_rotate: +.long 0x0c0f0e0d,0x0c0f0e0d,0x0c0f0e0d,0x0c0f0e0d +.Lkey_rotate192: +.long 0x04070605,0x04070605,0x04070605,0x04070605 +.Lkey_rcon1: +.long 1,1,1,1 +.Lkey_rcon1b: +.long 0x1b,0x1b,0x1b,0x1b + +.align 64 diff --git a/crypto/aesgcm/aesni_x64_gas_macosx.s b/crypto/aesgcm/aesni_x64_gas_macosx.s new file mode 100644 index 0000000..13e6806 --- /dev/null +++ b/crypto/aesgcm/aesni_x64_gas_macosx.s @@ -0,0 +1,1510 @@ +.text +.globl _aesni_encrypt + +.p2align 4 +_aesni_encrypt: + movups (%rdi),%xmm2 + movl 240(%rdx),%eax + movups (%rdx),%xmm0 + movups 16(%rdx),%xmm1 + leaq 32(%rdx),%rdx + xorps %xmm0,%xmm2 +L$oop_enc1_1: +.byte 102,15,56,220,209 + decl %eax + movups (%rdx),%xmm1 + leaq 16(%rdx),%rdx + jnz L$oop_enc1_1 +.byte 102,15,56,221,209 + pxor %xmm0,%xmm0 + pxor %xmm1,%xmm1 + movups %xmm2,(%rsi) + pxor %xmm2,%xmm2 + ret + + +.globl _aesni_decrypt + +.p2align 4 +_aesni_decrypt: + movups (%rdi),%xmm2 + movl 240(%rdx),%eax + movups (%rdx),%xmm0 + movups 16(%rdx),%xmm1 + leaq 32(%rdx),%rdx + xorps %xmm0,%xmm2 +L$oop_dec1_2: +.byte 102,15,56,222,209 + decl %eax + movups (%rdx),%xmm1 + leaq 16(%rdx),%rdx + jnz L$oop_dec1_2 +.byte 102,15,56,223,209 + pxor %xmm0,%xmm0 + pxor %xmm1,%xmm1 + movups %xmm2,(%rsi) + pxor %xmm2,%xmm2 + ret + + +.p2align 4 +_aesni_encrypt2: + movups (%rcx),%xmm0 + shll $4,%eax + movups 16(%rcx),%xmm1 + xorps %xmm0,%xmm2 + xorps %xmm0,%xmm3 + movups 32(%rcx),%xmm0 + leaq 32(%rcx,%rax,1),%rcx + negq %rax + addq $16,%rax + +L$enc_loop2: +.byte 102,15,56,220,209 +.byte 102,15,56,220,217 + movups (%rcx,%rax,1),%xmm1 + addq $32,%rax +.byte 102,15,56,220,208 +.byte 102,15,56,220,216 + movups -16(%rcx,%rax,1),%xmm0 + jnz L$enc_loop2 + +.byte 102,15,56,220,209 +.byte 102,15,56,220,217 +.byte 102,15,56,221,208 +.byte 102,15,56,221,216 + ret + + +.p2align 4 +_aesni_decrypt2: + movups (%rcx),%xmm0 + shll $4,%eax + movups 16(%rcx),%xmm1 + xorps %xmm0,%xmm2 + xorps %xmm0,%xmm3 + movups 32(%rcx),%xmm0 + leaq 32(%rcx,%rax,1),%rcx + negq %rax + addq $16,%rax + +L$dec_loop2: +.byte 102,15,56,222,209 +.byte 102,15,56,222,217 + movups (%rcx,%rax,1),%xmm1 + addq $32,%rax +.byte 102,15,56,222,208 +.byte 102,15,56,222,216 + movups -16(%rcx,%rax,1),%xmm0 + jnz L$dec_loop2 + +.byte 102,15,56,222,209 +.byte 102,15,56,222,217 +.byte 102,15,56,223,208 +.byte 102,15,56,223,216 + ret + + +.p2align 4 +_aesni_encrypt3: + movups (%rcx),%xmm0 + shll $4,%eax + movups 16(%rcx),%xmm1 + xorps %xmm0,%xmm2 + xorps %xmm0,%xmm3 + xorps %xmm0,%xmm4 + movups 32(%rcx),%xmm0 + leaq 32(%rcx,%rax,1),%rcx + negq %rax + addq $16,%rax + +L$enc_loop3: +.byte 102,15,56,220,209 +.byte 102,15,56,220,217 +.byte 102,15,56,220,225 + movups (%rcx,%rax,1),%xmm1 + addq $32,%rax +.byte 102,15,56,220,208 +.byte 102,15,56,220,216 +.byte 102,15,56,220,224 + movups -16(%rcx,%rax,1),%xmm0 + jnz L$enc_loop3 + +.byte 102,15,56,220,209 +.byte 102,15,56,220,217 +.byte 102,15,56,220,225 +.byte 102,15,56,221,208 +.byte 102,15,56,221,216 +.byte 102,15,56,221,224 + ret + + +.p2align 4 +_aesni_decrypt3: + movups (%rcx),%xmm0 + shll $4,%eax + movups 16(%rcx),%xmm1 + xorps %xmm0,%xmm2 + xorps %xmm0,%xmm3 + xorps %xmm0,%xmm4 + movups 32(%rcx),%xmm0 + leaq 32(%rcx,%rax,1),%rcx + negq %rax + addq $16,%rax + +L$dec_loop3: +.byte 102,15,56,222,209 +.byte 102,15,56,222,217 +.byte 102,15,56,222,225 + movups (%rcx,%rax,1),%xmm1 + addq $32,%rax +.byte 102,15,56,222,208 +.byte 102,15,56,222,216 +.byte 102,15,56,222,224 + movups -16(%rcx,%rax,1),%xmm0 + jnz L$dec_loop3 + +.byte 102,15,56,222,209 +.byte 102,15,56,222,217 +.byte 102,15,56,222,225 +.byte 102,15,56,223,208 +.byte 102,15,56,223,216 +.byte 102,15,56,223,224 + ret + + +.p2align 4 +_aesni_encrypt4: + movups (%rcx),%xmm0 + shll $4,%eax + movups 16(%rcx),%xmm1 + xorps %xmm0,%xmm2 + xorps %xmm0,%xmm3 + xorps %xmm0,%xmm4 + xorps %xmm0,%xmm5 + movups 32(%rcx),%xmm0 + leaq 32(%rcx,%rax,1),%rcx + negq %rax +.byte 0x0f,0x1f,0x00 + addq $16,%rax + +L$enc_loop4: +.byte 102,15,56,220,209 +.byte 102,15,56,220,217 +.byte 102,15,56,220,225 +.byte 102,15,56,220,233 + movups (%rcx,%rax,1),%xmm1 + addq $32,%rax +.byte 102,15,56,220,208 +.byte 102,15,56,220,216 +.byte 102,15,56,220,224 +.byte 102,15,56,220,232 + movups -16(%rcx,%rax,1),%xmm0 + jnz L$enc_loop4 + +.byte 102,15,56,220,209 +.byte 102,15,56,220,217 +.byte 102,15,56,220,225 +.byte 102,15,56,220,233 +.byte 102,15,56,221,208 +.byte 102,15,56,221,216 +.byte 102,15,56,221,224 +.byte 102,15,56,221,232 + ret + + +.p2align 4 +_aesni_decrypt4: + movups (%rcx),%xmm0 + shll $4,%eax + movups 16(%rcx),%xmm1 + xorps %xmm0,%xmm2 + xorps %xmm0,%xmm3 + xorps %xmm0,%xmm4 + xorps %xmm0,%xmm5 + movups 32(%rcx),%xmm0 + leaq 32(%rcx,%rax,1),%rcx + negq %rax +.byte 0x0f,0x1f,0x00 + addq $16,%rax + +L$dec_loop4: +.byte 102,15,56,222,209 +.byte 102,15,56,222,217 +.byte 102,15,56,222,225 +.byte 102,15,56,222,233 + movups (%rcx,%rax,1),%xmm1 + addq $32,%rax +.byte 102,15,56,222,208 +.byte 102,15,56,222,216 +.byte 102,15,56,222,224 +.byte 102,15,56,222,232 + movups -16(%rcx,%rax,1),%xmm0 + jnz L$dec_loop4 + +.byte 102,15,56,222,209 +.byte 102,15,56,222,217 +.byte 102,15,56,222,225 +.byte 102,15,56,222,233 +.byte 102,15,56,223,208 +.byte 102,15,56,223,216 +.byte 102,15,56,223,224 +.byte 102,15,56,223,232 + ret + + +.p2align 4 +_aesni_encrypt6: + movups (%rcx),%xmm0 + shll $4,%eax + movups 16(%rcx),%xmm1 + xorps %xmm0,%xmm2 + pxor %xmm0,%xmm3 + pxor %xmm0,%xmm4 +.byte 102,15,56,220,209 + leaq 32(%rcx,%rax,1),%rcx + negq %rax +.byte 102,15,56,220,217 + pxor %xmm0,%xmm5 + pxor %xmm0,%xmm6 +.byte 102,15,56,220,225 + pxor %xmm0,%xmm7 + movups (%rcx,%rax,1),%xmm0 + addq $16,%rax + jmp L$enc_loop6_enter +.p2align 4 +L$enc_loop6: +.byte 102,15,56,220,209 +.byte 102,15,56,220,217 +.byte 102,15,56,220,225 +L$enc_loop6_enter: +.byte 102,15,56,220,233 +.byte 102,15,56,220,241 +.byte 102,15,56,220,249 + movups (%rcx,%rax,1),%xmm1 + addq $32,%rax +.byte 102,15,56,220,208 +.byte 102,15,56,220,216 +.byte 102,15,56,220,224 +.byte 102,15,56,220,232 +.byte 102,15,56,220,240 +.byte 102,15,56,220,248 + movups -16(%rcx,%rax,1),%xmm0 + jnz L$enc_loop6 + +.byte 102,15,56,220,209 +.byte 102,15,56,220,217 +.byte 102,15,56,220,225 +.byte 102,15,56,220,233 +.byte 102,15,56,220,241 +.byte 102,15,56,220,249 +.byte 102,15,56,221,208 +.byte 102,15,56,221,216 +.byte 102,15,56,221,224 +.byte 102,15,56,221,232 +.byte 102,15,56,221,240 +.byte 102,15,56,221,248 + ret + + +.p2align 4 +_aesni_decrypt6: + movups (%rcx),%xmm0 + shll $4,%eax + movups 16(%rcx),%xmm1 + xorps %xmm0,%xmm2 + pxor %xmm0,%xmm3 + pxor %xmm0,%xmm4 +.byte 102,15,56,222,209 + leaq 32(%rcx,%rax,1),%rcx + negq %rax +.byte 102,15,56,222,217 + pxor %xmm0,%xmm5 + pxor %xmm0,%xmm6 +.byte 102,15,56,222,225 + pxor %xmm0,%xmm7 + movups (%rcx,%rax,1),%xmm0 + addq $16,%rax + jmp L$dec_loop6_enter +.p2align 4 +L$dec_loop6: +.byte 102,15,56,222,209 +.byte 102,15,56,222,217 +.byte 102,15,56,222,225 +L$dec_loop6_enter: +.byte 102,15,56,222,233 +.byte 102,15,56,222,241 +.byte 102,15,56,222,249 + movups (%rcx,%rax,1),%xmm1 + addq $32,%rax +.byte 102,15,56,222,208 +.byte 102,15,56,222,216 +.byte 102,15,56,222,224 +.byte 102,15,56,222,232 +.byte 102,15,56,222,240 +.byte 102,15,56,222,248 + movups -16(%rcx,%rax,1),%xmm0 + jnz L$dec_loop6 + +.byte 102,15,56,222,209 +.byte 102,15,56,222,217 +.byte 102,15,56,222,225 +.byte 102,15,56,222,233 +.byte 102,15,56,222,241 +.byte 102,15,56,222,249 +.byte 102,15,56,223,208 +.byte 102,15,56,223,216 +.byte 102,15,56,223,224 +.byte 102,15,56,223,232 +.byte 102,15,56,223,240 +.byte 102,15,56,223,248 + ret + + +.p2align 4 +_aesni_encrypt8: + movups (%rcx),%xmm0 + shll $4,%eax + movups 16(%rcx),%xmm1 + xorps %xmm0,%xmm2 + xorps %xmm0,%xmm3 + pxor %xmm0,%xmm4 + pxor %xmm0,%xmm5 + pxor %xmm0,%xmm6 + leaq 32(%rcx,%rax,1),%rcx + negq %rax +.byte 102,15,56,220,209 + pxor %xmm0,%xmm7 + pxor %xmm0,%xmm8 +.byte 102,15,56,220,217 + pxor %xmm0,%xmm9 + movups (%rcx,%rax,1),%xmm0 + addq $16,%rax + jmp L$enc_loop8_inner +.p2align 4 +L$enc_loop8: +.byte 102,15,56,220,209 +.byte 102,15,56,220,217 +L$enc_loop8_inner: +.byte 102,15,56,220,225 +.byte 102,15,56,220,233 +.byte 102,15,56,220,241 +.byte 102,15,56,220,249 +.byte 102,68,15,56,220,193 +.byte 102,68,15,56,220,201 +L$enc_loop8_enter: + movups (%rcx,%rax,1),%xmm1 + addq $32,%rax +.byte 102,15,56,220,208 +.byte 102,15,56,220,216 +.byte 102,15,56,220,224 +.byte 102,15,56,220,232 +.byte 102,15,56,220,240 +.byte 102,15,56,220,248 +.byte 102,68,15,56,220,192 +.byte 102,68,15,56,220,200 + movups -16(%rcx,%rax,1),%xmm0 + jnz L$enc_loop8 + +.byte 102,15,56,220,209 +.byte 102,15,56,220,217 +.byte 102,15,56,220,225 +.byte 102,15,56,220,233 +.byte 102,15,56,220,241 +.byte 102,15,56,220,249 +.byte 102,68,15,56,220,193 +.byte 102,68,15,56,220,201 +.byte 102,15,56,221,208 +.byte 102,15,56,221,216 +.byte 102,15,56,221,224 +.byte 102,15,56,221,232 +.byte 102,15,56,221,240 +.byte 102,15,56,221,248 +.byte 102,68,15,56,221,192 +.byte 102,68,15,56,221,200 + ret + + +.p2align 4 +_aesni_decrypt8: + movups (%rcx),%xmm0 + shll $4,%eax + movups 16(%rcx),%xmm1 + xorps %xmm0,%xmm2 + xorps %xmm0,%xmm3 + pxor %xmm0,%xmm4 + pxor %xmm0,%xmm5 + pxor %xmm0,%xmm6 + leaq 32(%rcx,%rax,1),%rcx + negq %rax +.byte 102,15,56,222,209 + pxor %xmm0,%xmm7 + pxor %xmm0,%xmm8 +.byte 102,15,56,222,217 + pxor %xmm0,%xmm9 + movups (%rcx,%rax,1),%xmm0 + addq $16,%rax + jmp L$dec_loop8_inner +.p2align 4 +L$dec_loop8: +.byte 102,15,56,222,209 +.byte 102,15,56,222,217 +L$dec_loop8_inner: +.byte 102,15,56,222,225 +.byte 102,15,56,222,233 +.byte 102,15,56,222,241 +.byte 102,15,56,222,249 +.byte 102,68,15,56,222,193 +.byte 102,68,15,56,222,201 +L$dec_loop8_enter: + movups (%rcx,%rax,1),%xmm1 + addq $32,%rax +.byte 102,15,56,222,208 +.byte 102,15,56,222,216 +.byte 102,15,56,222,224 +.byte 102,15,56,222,232 +.byte 102,15,56,222,240 +.byte 102,15,56,222,248 +.byte 102,68,15,56,222,192 +.byte 102,68,15,56,222,200 + movups -16(%rcx,%rax,1),%xmm0 + jnz L$dec_loop8 + +.byte 102,15,56,222,209 +.byte 102,15,56,222,217 +.byte 102,15,56,222,225 +.byte 102,15,56,222,233 +.byte 102,15,56,222,241 +.byte 102,15,56,222,249 +.byte 102,68,15,56,222,193 +.byte 102,68,15,56,222,201 +.byte 102,15,56,223,208 +.byte 102,15,56,223,216 +.byte 102,15,56,223,224 +.byte 102,15,56,223,232 +.byte 102,15,56,223,240 +.byte 102,15,56,223,248 +.byte 102,68,15,56,223,192 +.byte 102,68,15,56,223,200 + ret + +.globl _aesni_ctr32_encrypt_blocks + +.p2align 4 +_aesni_ctr32_encrypt_blocks: + + cmpq $1,%rdx + jne L$ctr32_bulk + + + + movups (%r8),%xmm2 + movups (%rdi),%xmm3 + movl 240(%rcx),%edx + movups (%rcx),%xmm0 + movups 16(%rcx),%xmm1 + leaq 32(%rcx),%rcx + xorps %xmm0,%xmm2 +L$oop_enc1_3: +.byte 102,15,56,220,209 + decl %edx + movups (%rcx),%xmm1 + leaq 16(%rcx),%rcx + jnz L$oop_enc1_3 +.byte 102,15,56,221,209 + pxor %xmm0,%xmm0 + pxor %xmm1,%xmm1 + xorps %xmm3,%xmm2 + pxor %xmm3,%xmm3 + movups %xmm2,(%rsi) + xorps %xmm2,%xmm2 + jmp L$ctr32_epilogue + +.p2align 4 +L$ctr32_bulk: + leaq (%rsp),%r11 + + pushq %rbp + + subq $128,%rsp + andq $-16,%rsp + + + + + movdqu (%r8),%xmm2 + movdqu (%rcx),%xmm0 + movl 12(%r8),%r8d + pxor %xmm0,%xmm2 + movl 12(%rcx),%ebp + movdqa %xmm2,0(%rsp) + bswapl %r8d + movdqa %xmm2,%xmm3 + movdqa %xmm2,%xmm4 + movdqa %xmm2,%xmm5 + movdqa %xmm2,64(%rsp) + movdqa %xmm2,80(%rsp) + movdqa %xmm2,96(%rsp) + movq %rdx,%r10 + movdqa %xmm2,112(%rsp) + + leaq 1(%r8),%rax + leaq 2(%r8),%rdx + bswapl %eax + bswapl %edx + xorl %ebp,%eax + xorl %ebp,%edx +.byte 102,15,58,34,216,3 + leaq 3(%r8),%rax + movdqa %xmm3,16(%rsp) +.byte 102,15,58,34,226,3 + bswapl %eax + movq %r10,%rdx + leaq 4(%r8),%r10 + movdqa %xmm4,32(%rsp) + xorl %ebp,%eax + bswapl %r10d +.byte 102,15,58,34,232,3 + xorl %ebp,%r10d + movdqa %xmm5,48(%rsp) + leaq 5(%r8),%r9 + movl %r10d,64+12(%rsp) + bswapl %r9d + leaq 6(%r8),%r10 + movl 240(%rcx),%eax + xorl %ebp,%r9d + bswapl %r10d + movl %r9d,80+12(%rsp) + xorl %ebp,%r10d + leaq 7(%r8),%r9 + movl %r10d,96+12(%rsp) + bswapl %r9d + + + xorl %ebp,%r9d + + movl %r9d,112+12(%rsp) + + movups 16(%rcx),%xmm1 + + movdqa 64(%rsp),%xmm6 + movdqa 80(%rsp),%xmm7 + + cmpq $8,%rdx + jb L$ctr32_tail + + subq $6,%rdx + + + + leaq 128(%rcx),%rcx + subq $2,%rdx + jmp L$ctr32_loop8 + + + + + + + + + + +.p2align 4 +L$ctr32_loop6: + addl $6,%r8d + movups -48(%rcx,%r10,1),%xmm0 +.byte 102,15,56,220,209 + movl %r8d,%eax + xorl %ebp,%eax +.byte 102,15,56,220,217 +.byte 0x0f,0x38,0xf1,0x44,0x24,12 + leal 1(%r8),%eax +.byte 102,15,56,220,225 + xorl %ebp,%eax +.byte 0x0f,0x38,0xf1,0x44,0x24,28 +.byte 102,15,56,220,233 + leal 2(%r8),%eax + xorl %ebp,%eax +.byte 102,15,56,220,241 +.byte 0x0f,0x38,0xf1,0x44,0x24,44 + leal 3(%r8),%eax +.byte 102,15,56,220,249 + movups -32(%rcx,%r10,1),%xmm1 + xorl %ebp,%eax + +.byte 102,15,56,220,208 +.byte 0x0f,0x38,0xf1,0x44,0x24,60 + leal 4(%r8),%eax +.byte 102,15,56,220,216 + xorl %ebp,%eax +.byte 0x0f,0x38,0xf1,0x44,0x24,76 +.byte 102,15,56,220,224 + leal 5(%r8),%eax + xorl %ebp,%eax +.byte 102,15,56,220,232 +.byte 0x0f,0x38,0xf1,0x44,0x24,92 + movq %r10,%rax +.byte 102,15,56,220,240 +.byte 102,15,56,220,248 + movups -16(%rcx,%r10,1),%xmm0 + + call L$enc_loop6 + + movdqu (%rdi),%xmm8 + movdqu 16(%rdi),%xmm9 + movdqu 32(%rdi),%xmm10 + movdqu 48(%rdi),%xmm11 + movdqu 64(%rdi),%xmm12 + movdqu 80(%rdi),%xmm13 + leaq 96(%rdi),%rdi + movups -64(%rcx,%r10,1),%xmm1 + pxor %xmm2,%xmm8 + movaps 0(%rsp),%xmm2 + pxor %xmm3,%xmm9 + movaps 16(%rsp),%xmm3 + pxor %xmm4,%xmm10 + movaps 32(%rsp),%xmm4 + pxor %xmm5,%xmm11 + movaps 48(%rsp),%xmm5 + pxor %xmm6,%xmm12 + movaps 64(%rsp),%xmm6 + pxor %xmm7,%xmm13 + movaps 80(%rsp),%xmm7 + movdqu %xmm8,(%rsi) + movdqu %xmm9,16(%rsi) + movdqu %xmm10,32(%rsi) + movdqu %xmm11,48(%rsi) + movdqu %xmm12,64(%rsi) + movdqu %xmm13,80(%rsi) + leaq 96(%rsi),%rsi + + subq $6,%rdx + jnc L$ctr32_loop6 + + addq $6,%rdx + jz L$ctr32_done + + leal -48(%r10),%eax + leaq -80(%rcx,%r10,1),%rcx + negl %eax + shrl $4,%eax + jmp L$ctr32_tail + +.p2align 5 +L$ctr32_loop8: + addl $8,%r8d + movdqa 96(%rsp),%xmm8 +.byte 102,15,56,220,209 + movl %r8d,%r9d + movdqa 112(%rsp),%xmm9 +.byte 102,15,56,220,217 + bswapl %r9d + movups 32-128(%rcx),%xmm0 +.byte 102,15,56,220,225 + xorl %ebp,%r9d + nop +.byte 102,15,56,220,233 + movl %r9d,0+12(%rsp) + leaq 1(%r8),%r9 +.byte 102,15,56,220,241 +.byte 102,15,56,220,249 +.byte 102,68,15,56,220,193 +.byte 102,68,15,56,220,201 + movups 48-128(%rcx),%xmm1 + bswapl %r9d +.byte 102,15,56,220,208 +.byte 102,15,56,220,216 + xorl %ebp,%r9d +.byte 0x66,0x90 +.byte 102,15,56,220,224 +.byte 102,15,56,220,232 + movl %r9d,16+12(%rsp) + leaq 2(%r8),%r9 +.byte 102,15,56,220,240 +.byte 102,15,56,220,248 +.byte 102,68,15,56,220,192 +.byte 102,68,15,56,220,200 + movups 64-128(%rcx),%xmm0 + bswapl %r9d +.byte 102,15,56,220,209 +.byte 102,15,56,220,217 + xorl %ebp,%r9d +.byte 0x66,0x90 +.byte 102,15,56,220,225 +.byte 102,15,56,220,233 + movl %r9d,32+12(%rsp) + leaq 3(%r8),%r9 +.byte 102,15,56,220,241 +.byte 102,15,56,220,249 +.byte 102,68,15,56,220,193 +.byte 102,68,15,56,220,201 + movups 80-128(%rcx),%xmm1 + bswapl %r9d +.byte 102,15,56,220,208 +.byte 102,15,56,220,216 + xorl %ebp,%r9d +.byte 0x66,0x90 +.byte 102,15,56,220,224 +.byte 102,15,56,220,232 + movl %r9d,48+12(%rsp) + leaq 4(%r8),%r9 +.byte 102,15,56,220,240 +.byte 102,15,56,220,248 +.byte 102,68,15,56,220,192 +.byte 102,68,15,56,220,200 + movups 96-128(%rcx),%xmm0 + bswapl %r9d +.byte 102,15,56,220,209 +.byte 102,15,56,220,217 + xorl %ebp,%r9d +.byte 0x66,0x90 +.byte 102,15,56,220,225 +.byte 102,15,56,220,233 + movl %r9d,64+12(%rsp) + leaq 5(%r8),%r9 +.byte 102,15,56,220,241 +.byte 102,15,56,220,249 +.byte 102,68,15,56,220,193 +.byte 102,68,15,56,220,201 + movups 112-128(%rcx),%xmm1 + bswapl %r9d +.byte 102,15,56,220,208 +.byte 102,15,56,220,216 + xorl %ebp,%r9d +.byte 0x66,0x90 +.byte 102,15,56,220,224 +.byte 102,15,56,220,232 + movl %r9d,80+12(%rsp) + leaq 6(%r8),%r9 +.byte 102,15,56,220,240 +.byte 102,15,56,220,248 +.byte 102,68,15,56,220,192 +.byte 102,68,15,56,220,200 + movups 128-128(%rcx),%xmm0 + bswapl %r9d +.byte 102,15,56,220,209 +.byte 102,15,56,220,217 + xorl %ebp,%r9d +.byte 0x66,0x90 +.byte 102,15,56,220,225 +.byte 102,15,56,220,233 + movl %r9d,96+12(%rsp) + leaq 7(%r8),%r9 +.byte 102,15,56,220,241 +.byte 102,15,56,220,249 +.byte 102,68,15,56,220,193 +.byte 102,68,15,56,220,201 + movups 144-128(%rcx),%xmm1 + bswapl %r9d +.byte 102,15,56,220,208 +.byte 102,15,56,220,216 +.byte 102,15,56,220,224 + xorl %ebp,%r9d + movdqu 0(%rdi),%xmm10 +.byte 102,15,56,220,232 + movl %r9d,112+12(%rsp) + cmpl $11,%eax +.byte 102,15,56,220,240 +.byte 102,15,56,220,248 +.byte 102,68,15,56,220,192 +.byte 102,68,15,56,220,200 + movups 160-128(%rcx),%xmm0 + + jb L$ctr32_enc_done + +.byte 102,15,56,220,209 +.byte 102,15,56,220,217 +.byte 102,15,56,220,225 +.byte 102,15,56,220,233 +.byte 102,15,56,220,241 +.byte 102,15,56,220,249 +.byte 102,68,15,56,220,193 +.byte 102,68,15,56,220,201 + movups 176-128(%rcx),%xmm1 + +.byte 102,15,56,220,208 +.byte 102,15,56,220,216 +.byte 102,15,56,220,224 +.byte 102,15,56,220,232 +.byte 102,15,56,220,240 +.byte 102,15,56,220,248 +.byte 102,68,15,56,220,192 +.byte 102,68,15,56,220,200 + movups 192-128(%rcx),%xmm0 + je L$ctr32_enc_done + +.byte 102,15,56,220,209 +.byte 102,15,56,220,217 +.byte 102,15,56,220,225 +.byte 102,15,56,220,233 +.byte 102,15,56,220,241 +.byte 102,15,56,220,249 +.byte 102,68,15,56,220,193 +.byte 102,68,15,56,220,201 + movups 208-128(%rcx),%xmm1 + +.byte 102,15,56,220,208 +.byte 102,15,56,220,216 +.byte 102,15,56,220,224 +.byte 102,15,56,220,232 +.byte 102,15,56,220,240 +.byte 102,15,56,220,248 +.byte 102,68,15,56,220,192 +.byte 102,68,15,56,220,200 + movups 224-128(%rcx),%xmm0 + jmp L$ctr32_enc_done + +.p2align 4 +L$ctr32_enc_done: + movdqu 16(%rdi),%xmm11 + pxor %xmm0,%xmm10 + movdqu 32(%rdi),%xmm12 + pxor %xmm0,%xmm11 + movdqu 48(%rdi),%xmm13 + pxor %xmm0,%xmm12 + movdqu 64(%rdi),%xmm14 + pxor %xmm0,%xmm13 + movdqu 80(%rdi),%xmm15 + pxor %xmm0,%xmm14 + pxor %xmm0,%xmm15 +.byte 102,15,56,220,209 +.byte 102,15,56,220,217 +.byte 102,15,56,220,225 +.byte 102,15,56,220,233 +.byte 102,15,56,220,241 +.byte 102,15,56,220,249 +.byte 102,68,15,56,220,193 +.byte 102,68,15,56,220,201 + movdqu 96(%rdi),%xmm1 + leaq 128(%rdi),%rdi + +.byte 102,65,15,56,221,210 + pxor %xmm0,%xmm1 + movdqu 112-128(%rdi),%xmm10 +.byte 102,65,15,56,221,219 + pxor %xmm0,%xmm10 + movdqa 0(%rsp),%xmm11 +.byte 102,65,15,56,221,228 +.byte 102,65,15,56,221,237 + movdqa 16(%rsp),%xmm12 + movdqa 32(%rsp),%xmm13 +.byte 102,65,15,56,221,246 +.byte 102,65,15,56,221,255 + movdqa 48(%rsp),%xmm14 + movdqa 64(%rsp),%xmm15 +.byte 102,68,15,56,221,193 + movdqa 80(%rsp),%xmm0 + movups 16-128(%rcx),%xmm1 +.byte 102,69,15,56,221,202 + + movups %xmm2,(%rsi) + movdqa %xmm11,%xmm2 + movups %xmm3,16(%rsi) + movdqa %xmm12,%xmm3 + movups %xmm4,32(%rsi) + movdqa %xmm13,%xmm4 + movups %xmm5,48(%rsi) + movdqa %xmm14,%xmm5 + movups %xmm6,64(%rsi) + movdqa %xmm15,%xmm6 + movups %xmm7,80(%rsi) + movdqa %xmm0,%xmm7 + movups %xmm8,96(%rsi) + movups %xmm9,112(%rsi) + leaq 128(%rsi),%rsi + + subq $8,%rdx + jnc L$ctr32_loop8 + + addq $8,%rdx + jz L$ctr32_done + leaq -128(%rcx),%rcx + +L$ctr32_tail: + + + leaq 16(%rcx),%rcx + cmpq $4,%rdx + jb L$ctr32_loop3 + je L$ctr32_loop4 + + + shll $4,%eax + movdqa 96(%rsp),%xmm8 + pxor %xmm9,%xmm9 + + movups 16(%rcx),%xmm0 +.byte 102,15,56,220,209 +.byte 102,15,56,220,217 + leaq 32-16(%rcx,%rax,1),%rcx + negq %rax +.byte 102,15,56,220,225 + addq $16,%rax + movups (%rdi),%xmm10 +.byte 102,15,56,220,233 +.byte 102,15,56,220,241 + movups 16(%rdi),%xmm11 + movups 32(%rdi),%xmm12 +.byte 102,15,56,220,249 +.byte 102,68,15,56,220,193 + + call L$enc_loop8_enter + + movdqu 48(%rdi),%xmm13 + pxor %xmm10,%xmm2 + movdqu 64(%rdi),%xmm10 + pxor %xmm11,%xmm3 + movdqu %xmm2,(%rsi) + pxor %xmm12,%xmm4 + movdqu %xmm3,16(%rsi) + pxor %xmm13,%xmm5 + movdqu %xmm4,32(%rsi) + pxor %xmm10,%xmm6 + movdqu %xmm5,48(%rsi) + movdqu %xmm6,64(%rsi) + cmpq $6,%rdx + jb L$ctr32_done + + movups 80(%rdi),%xmm11 + xorps %xmm11,%xmm7 + movups %xmm7,80(%rsi) + je L$ctr32_done + + movups 96(%rdi),%xmm12 + xorps %xmm12,%xmm8 + movups %xmm8,96(%rsi) + jmp L$ctr32_done + +.p2align 5 +L$ctr32_loop4: +.byte 102,15,56,220,209 + leaq 16(%rcx),%rcx + decl %eax +.byte 102,15,56,220,217 +.byte 102,15,56,220,225 +.byte 102,15,56,220,233 + movups (%rcx),%xmm1 + jnz L$ctr32_loop4 +.byte 102,15,56,221,209 +.byte 102,15,56,221,217 + movups (%rdi),%xmm10 + movups 16(%rdi),%xmm11 +.byte 102,15,56,221,225 +.byte 102,15,56,221,233 + movups 32(%rdi),%xmm12 + movups 48(%rdi),%xmm13 + + xorps %xmm10,%xmm2 + movups %xmm2,(%rsi) + xorps %xmm11,%xmm3 + movups %xmm3,16(%rsi) + pxor %xmm12,%xmm4 + movdqu %xmm4,32(%rsi) + pxor %xmm13,%xmm5 + movdqu %xmm5,48(%rsi) + jmp L$ctr32_done + +.p2align 5 +L$ctr32_loop3: +.byte 102,15,56,220,209 + leaq 16(%rcx),%rcx + decl %eax +.byte 102,15,56,220,217 +.byte 102,15,56,220,225 + movups (%rcx),%xmm1 + jnz L$ctr32_loop3 +.byte 102,15,56,221,209 +.byte 102,15,56,221,217 +.byte 102,15,56,221,225 + + movups (%rdi),%xmm10 + xorps %xmm10,%xmm2 + movups %xmm2,(%rsi) + cmpq $2,%rdx + jb L$ctr32_done + + movups 16(%rdi),%xmm11 + xorps %xmm11,%xmm3 + movups %xmm3,16(%rsi) + je L$ctr32_done + + movups 32(%rdi),%xmm12 + xorps %xmm12,%xmm4 + movups %xmm4,32(%rsi) + +L$ctr32_done: + xorps %xmm0,%xmm0 + xorl %ebp,%ebp + pxor %xmm1,%xmm1 + pxor %xmm2,%xmm2 + pxor %xmm3,%xmm3 + pxor %xmm4,%xmm4 + pxor %xmm5,%xmm5 + pxor %xmm6,%xmm6 + pxor %xmm7,%xmm7 + movaps %xmm0,0(%rsp) + pxor %xmm8,%xmm8 + movaps %xmm0,16(%rsp) + pxor %xmm9,%xmm9 + movaps %xmm0,32(%rsp) + pxor %xmm10,%xmm10 + movaps %xmm0,48(%rsp) + pxor %xmm11,%xmm11 + movaps %xmm0,64(%rsp) + pxor %xmm12,%xmm12 + movaps %xmm0,80(%rsp) + pxor %xmm13,%xmm13 + movaps %xmm0,96(%rsp) + pxor %xmm14,%xmm14 + movaps %xmm0,112(%rsp) + pxor %xmm15,%xmm15 + movq -8(%r11),%rbp + + leaq (%r11),%rsp + +L$ctr32_epilogue: + ret + + +.globl _aesni_set_decrypt_key + +.p2align 4 +_aesni_set_decrypt_key: + +.byte 0x48,0x83,0xEC,0x08 + + call __aesni_set_encrypt_key + shll $4,%esi + testl %eax,%eax + jnz L$dec_key_ret + leaq 16(%rdx,%rsi,1),%rdi + + movups (%rdx),%xmm0 + movups (%rdi),%xmm1 + movups %xmm0,(%rdi) + movups %xmm1,(%rdx) + leaq 16(%rdx),%rdx + leaq -16(%rdi),%rdi + +L$dec_key_inverse: + movups (%rdx),%xmm0 + movups (%rdi),%xmm1 +.byte 102,15,56,219,192 +.byte 102,15,56,219,201 + leaq 16(%rdx),%rdx + leaq -16(%rdi),%rdi + movups %xmm0,16(%rdi) + movups %xmm1,-16(%rdx) + cmpq %rdx,%rdi + ja L$dec_key_inverse + + movups (%rdx),%xmm0 +.byte 102,15,56,219,192 + pxor %xmm1,%xmm1 + movups %xmm0,(%rdi) + pxor %xmm0,%xmm0 +L$dec_key_ret: + addq $8,%rsp + + ret + +L$SEH_end_set_decrypt_key: + +.globl _aesni_set_encrypt_key + +.p2align 4 +_aesni_set_encrypt_key: +__aesni_set_encrypt_key: + +.byte 0x48,0x83,0xEC,0x08 + + movq $-1,%rax + testq %rdi,%rdi + jz L$enc_key_ret + testq %rdx,%rdx + jz L$enc_key_ret + + movups (%rdi),%xmm0 + xorps %xmm4,%xmm4 + + + + leaq 16(%rdx),%rax + cmpl $256,%esi + je L$14rounds + cmpl $192,%esi + je L$12rounds + cmpl $128,%esi + jne L$bad_keybits + +L$10rounds: + movl $9,%esi + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + movdqa L$key_rotate(%rip),%xmm5 + movl $8,%r10d + movdqa L$key_rcon1(%rip),%xmm4 + movdqa %xmm0,%xmm2 + movdqu %xmm0,(%rdx) + jmp L$oop_key128 + +.p2align 4 +L$oop_key128: + pshufb %xmm5,%xmm0 +.byte 102,15,56,221,196 + pslld $1,%xmm4 + leaq 16(%rax),%rax + + movdqa %xmm2,%xmm3 + pslldq $4,%xmm2 + pxor %xmm2,%xmm3 + pslldq $4,%xmm2 + pxor %xmm2,%xmm3 + pslldq $4,%xmm2 + pxor %xmm3,%xmm2 + + pxor %xmm2,%xmm0 + movdqu %xmm0,-16(%rax) + movdqa %xmm0,%xmm2 + + decl %r10d + jnz L$oop_key128 + + movdqa L$key_rcon1b(%rip),%xmm4 + + pshufb %xmm5,%xmm0 +.byte 102,15,56,221,196 + pslld $1,%xmm4 + + movdqa %xmm2,%xmm3 + pslldq $4,%xmm2 + pxor %xmm2,%xmm3 + pslldq $4,%xmm2 + pxor %xmm2,%xmm3 + pslldq $4,%xmm2 + pxor %xmm3,%xmm2 + + pxor %xmm2,%xmm0 + movdqu %xmm0,(%rax) + + movdqa %xmm0,%xmm2 + pshufb %xmm5,%xmm0 +.byte 102,15,56,221,196 + + movdqa %xmm2,%xmm3 + pslldq $4,%xmm2 + pxor %xmm2,%xmm3 + pslldq $4,%xmm2 + pxor %xmm2,%xmm3 + pslldq $4,%xmm2 + pxor %xmm3,%xmm2 + + pxor %xmm2,%xmm0 + movdqu %xmm0,16(%rax) + + movl %esi,96(%rax) + xorl %eax,%eax + jmp L$enc_key_ret + +.p2align 4 +L$12rounds: + movq 16(%rdi),%xmm2 + movl $11,%esi + + + + + + + + + + + + + + + + + + + + + + + + + + + + movdqa L$key_rotate192(%rip),%xmm5 + movdqa L$key_rcon1(%rip),%xmm4 + movl $8,%r10d + movdqu %xmm0,(%rdx) + jmp L$oop_key192 + +.p2align 4 +L$oop_key192: + movq %xmm2,0(%rax) + movdqa %xmm2,%xmm1 + pshufb %xmm5,%xmm2 +.byte 102,15,56,221,212 + pslld $1,%xmm4 + leaq 24(%rax),%rax + + movdqa %xmm0,%xmm3 + pslldq $4,%xmm0 + pxor %xmm0,%xmm3 + pslldq $4,%xmm0 + pxor %xmm0,%xmm3 + pslldq $4,%xmm0 + pxor %xmm3,%xmm0 + + pshufd $0xff,%xmm0,%xmm3 + pxor %xmm1,%xmm3 + pslldq $4,%xmm1 + pxor %xmm1,%xmm3 + + pxor %xmm2,%xmm0 + pxor %xmm3,%xmm2 + movdqu %xmm0,-16(%rax) + + decl %r10d + jnz L$oop_key192 + + movl %esi,32(%rax) + xorl %eax,%eax + jmp L$enc_key_ret + +.p2align 4 +L$14rounds: + movups 16(%rdi),%xmm2 + movl $13,%esi + leaq 16(%rax),%rax + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + movdqa L$key_rotate(%rip),%xmm5 + movdqa L$key_rcon1(%rip),%xmm4 + movl $7,%r10d + movdqu %xmm0,0(%rdx) + movdqa %xmm2,%xmm1 + movdqu %xmm2,16(%rdx) + jmp L$oop_key256 + +.p2align 4 +L$oop_key256: + pshufb %xmm5,%xmm2 +.byte 102,15,56,221,212 + + movdqa %xmm0,%xmm3 + pslldq $4,%xmm0 + pxor %xmm0,%xmm3 + pslldq $4,%xmm0 + pxor %xmm0,%xmm3 + pslldq $4,%xmm0 + pxor %xmm3,%xmm0 + pslld $1,%xmm4 + + pxor %xmm2,%xmm0 + movdqu %xmm0,(%rax) + + decl %r10d + jz L$done_key256 + + pshufd $0xff,%xmm0,%xmm2 + pxor %xmm3,%xmm3 +.byte 102,15,56,221,211 + + movdqa %xmm1,%xmm3 + pslldq $4,%xmm1 + pxor %xmm1,%xmm3 + pslldq $4,%xmm1 + pxor %xmm1,%xmm3 + pslldq $4,%xmm1 + pxor %xmm3,%xmm1 + + pxor %xmm1,%xmm2 + movdqu %xmm2,16(%rax) + leaq 32(%rax),%rax + movdqa %xmm2,%xmm1 + + jmp L$oop_key256 + +L$done_key256: + movl %esi,16(%rax) + xorl %eax,%eax + jmp L$enc_key_ret + +.p2align 4 +L$bad_keybits: + movq $-2,%rax +L$enc_key_ret: + pxor %xmm0,%xmm0 + pxor %xmm1,%xmm1 + pxor %xmm2,%xmm2 + pxor %xmm3,%xmm3 + pxor %xmm4,%xmm4 + pxor %xmm5,%xmm5 + addq $8,%rsp + + ret + +L$SEH_end_set_encrypt_key: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +.p2align 6 +L$bswap_mask: +.byte 15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0 +L$increment32: +.long 6,6,6,0 +L$increment64: +.long 1,0,0,0 +L$xts_magic: +.long 0x87,0,1,0 +L$increment1: +.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1 +L$key_rotate: +.long 0x0c0f0e0d,0x0c0f0e0d,0x0c0f0e0d,0x0c0f0e0d +L$key_rotate192: +.long 0x04070605,0x04070605,0x04070605,0x04070605 +L$key_rcon1: +.long 1,1,1,1 +L$key_rcon1b: +.long 0x1b,0x1b,0x1b,0x1b + +.p2align 6 diff --git a/crypto/aesgcm/aesni_x64_nasm.asm b/crypto/aesgcm/aesni_x64_nasm.asm new file mode 100644 index 0000000..f464cf9 --- /dev/null +++ b/crypto/aesgcm/aesni_x64_nasm.asm @@ -0,0 +1,1723 @@ +default rel +%define XMMWORD +%define YMMWORD +%define ZMMWORD +section .text code align=64 + +global aesni_encrypt + +ALIGN 16 +aesni_encrypt: + movups xmm2,XMMWORD[rcx] + mov eax,DWORD[240+r8] + movups xmm0,XMMWORD[r8] + movups xmm1,XMMWORD[16+r8] + lea r8,[32+r8] + xorps xmm2,xmm0 +$L$oop_enc1_1: +DB 102,15,56,220,209 + dec eax + movups xmm1,XMMWORD[r8] + lea r8,[16+r8] + jnz NEAR $L$oop_enc1_1 +DB 102,15,56,221,209 + pxor xmm0,xmm0 + pxor xmm1,xmm1 + movups XMMWORD[rdx],xmm2 + pxor xmm2,xmm2 + ret + + +global aesni_decrypt + +ALIGN 16 +aesni_decrypt: + movups xmm2,XMMWORD[rcx] + mov eax,DWORD[240+r8] + movups xmm0,XMMWORD[r8] + movups xmm1,XMMWORD[16+r8] + lea r8,[32+r8] + xorps xmm2,xmm0 +$L$oop_dec1_2: +DB 102,15,56,222,209 + dec eax + movups xmm1,XMMWORD[r8] + lea r8,[16+r8] + jnz NEAR $L$oop_dec1_2 +DB 102,15,56,223,209 + pxor xmm0,xmm0 + pxor xmm1,xmm1 + movups XMMWORD[rdx],xmm2 + pxor xmm2,xmm2 + ret + + +ALIGN 16 +_aesni_encrypt2: + movups xmm0,XMMWORD[rcx] + shl eax,4 + movups xmm1,XMMWORD[16+rcx] + xorps xmm2,xmm0 + xorps xmm3,xmm0 + movups xmm0,XMMWORD[32+rcx] + lea rcx,[32+rax*1+rcx] + neg rax + add rax,16 + +$L$enc_loop2: +DB 102,15,56,220,209 +DB 102,15,56,220,217 + movups xmm1,XMMWORD[rax*1+rcx] + add rax,32 +DB 102,15,56,220,208 +DB 102,15,56,220,216 + movups xmm0,XMMWORD[((-16))+rax*1+rcx] + jnz NEAR $L$enc_loop2 + +DB 102,15,56,220,209 +DB 102,15,56,220,217 +DB 102,15,56,221,208 +DB 102,15,56,221,216 + ret + + +ALIGN 16 +_aesni_decrypt2: + movups xmm0,XMMWORD[rcx] + shl eax,4 + movups xmm1,XMMWORD[16+rcx] + xorps xmm2,xmm0 + xorps xmm3,xmm0 + movups xmm0,XMMWORD[32+rcx] + lea rcx,[32+rax*1+rcx] + neg rax + add rax,16 + +$L$dec_loop2: +DB 102,15,56,222,209 +DB 102,15,56,222,217 + movups xmm1,XMMWORD[rax*1+rcx] + add rax,32 +DB 102,15,56,222,208 +DB 102,15,56,222,216 + movups xmm0,XMMWORD[((-16))+rax*1+rcx] + jnz NEAR $L$dec_loop2 + +DB 102,15,56,222,209 +DB 102,15,56,222,217 +DB 102,15,56,223,208 +DB 102,15,56,223,216 + ret + + +ALIGN 16 +_aesni_encrypt3: + movups xmm0,XMMWORD[rcx] + shl eax,4 + movups xmm1,XMMWORD[16+rcx] + xorps xmm2,xmm0 + xorps xmm3,xmm0 + xorps xmm4,xmm0 + movups xmm0,XMMWORD[32+rcx] + lea rcx,[32+rax*1+rcx] + neg rax + add rax,16 + +$L$enc_loop3: +DB 102,15,56,220,209 +DB 102,15,56,220,217 +DB 102,15,56,220,225 + movups xmm1,XMMWORD[rax*1+rcx] + add rax,32 +DB 102,15,56,220,208 +DB 102,15,56,220,216 +DB 102,15,56,220,224 + movups xmm0,XMMWORD[((-16))+rax*1+rcx] + jnz NEAR $L$enc_loop3 + +DB 102,15,56,220,209 +DB 102,15,56,220,217 +DB 102,15,56,220,225 +DB 102,15,56,221,208 +DB 102,15,56,221,216 +DB 102,15,56,221,224 + ret + + +ALIGN 16 +_aesni_decrypt3: + movups xmm0,XMMWORD[rcx] + shl eax,4 + movups xmm1,XMMWORD[16+rcx] + xorps xmm2,xmm0 + xorps xmm3,xmm0 + xorps xmm4,xmm0 + movups xmm0,XMMWORD[32+rcx] + lea rcx,[32+rax*1+rcx] + neg rax + add rax,16 + +$L$dec_loop3: +DB 102,15,56,222,209 +DB 102,15,56,222,217 +DB 102,15,56,222,225 + movups xmm1,XMMWORD[rax*1+rcx] + add rax,32 +DB 102,15,56,222,208 +DB 102,15,56,222,216 +DB 102,15,56,222,224 + movups xmm0,XMMWORD[((-16))+rax*1+rcx] + jnz NEAR $L$dec_loop3 + +DB 102,15,56,222,209 +DB 102,15,56,222,217 +DB 102,15,56,222,225 +DB 102,15,56,223,208 +DB 102,15,56,223,216 +DB 102,15,56,223,224 + ret + + +ALIGN 16 +_aesni_encrypt4: + movups xmm0,XMMWORD[rcx] + shl eax,4 + movups xmm1,XMMWORD[16+rcx] + xorps xmm2,xmm0 + xorps xmm3,xmm0 + xorps xmm4,xmm0 + xorps xmm5,xmm0 + movups xmm0,XMMWORD[32+rcx] + lea rcx,[32+rax*1+rcx] + neg rax +DB 0x0f,0x1f,0x00 + add rax,16 + +$L$enc_loop4: +DB 102,15,56,220,209 +DB 102,15,56,220,217 +DB 102,15,56,220,225 +DB 102,15,56,220,233 + movups xmm1,XMMWORD[rax*1+rcx] + add rax,32 +DB 102,15,56,220,208 +DB 102,15,56,220,216 +DB 102,15,56,220,224 +DB 102,15,56,220,232 + movups xmm0,XMMWORD[((-16))+rax*1+rcx] + jnz NEAR $L$enc_loop4 + +DB 102,15,56,220,209 +DB 102,15,56,220,217 +DB 102,15,56,220,225 +DB 102,15,56,220,233 +DB 102,15,56,221,208 +DB 102,15,56,221,216 +DB 102,15,56,221,224 +DB 102,15,56,221,232 + ret + + +ALIGN 16 +_aesni_decrypt4: + movups xmm0,XMMWORD[rcx] + shl eax,4 + movups xmm1,XMMWORD[16+rcx] + xorps xmm2,xmm0 + xorps xmm3,xmm0 + xorps xmm4,xmm0 + xorps xmm5,xmm0 + movups xmm0,XMMWORD[32+rcx] + lea rcx,[32+rax*1+rcx] + neg rax +DB 0x0f,0x1f,0x00 + add rax,16 + +$L$dec_loop4: +DB 102,15,56,222,209 +DB 102,15,56,222,217 +DB 102,15,56,222,225 +DB 102,15,56,222,233 + movups xmm1,XMMWORD[rax*1+rcx] + add rax,32 +DB 102,15,56,222,208 +DB 102,15,56,222,216 +DB 102,15,56,222,224 +DB 102,15,56,222,232 + movups xmm0,XMMWORD[((-16))+rax*1+rcx] + jnz NEAR $L$dec_loop4 + +DB 102,15,56,222,209 +DB 102,15,56,222,217 +DB 102,15,56,222,225 +DB 102,15,56,222,233 +DB 102,15,56,223,208 +DB 102,15,56,223,216 +DB 102,15,56,223,224 +DB 102,15,56,223,232 + ret + + +ALIGN 16 +_aesni_encrypt6: + movups xmm0,XMMWORD[rcx] + shl eax,4 + movups xmm1,XMMWORD[16+rcx] + xorps xmm2,xmm0 + pxor xmm3,xmm0 + pxor xmm4,xmm0 +DB 102,15,56,220,209 + lea rcx,[32+rax*1+rcx] + neg rax +DB 102,15,56,220,217 + pxor xmm5,xmm0 + pxor xmm6,xmm0 +DB 102,15,56,220,225 + pxor xmm7,xmm0 + movups xmm0,XMMWORD[rax*1+rcx] + add rax,16 + jmp NEAR $L$enc_loop6_enter +ALIGN 16 +$L$enc_loop6: +DB 102,15,56,220,209 +DB 102,15,56,220,217 +DB 102,15,56,220,225 +$L$enc_loop6_enter: +DB 102,15,56,220,233 +DB 102,15,56,220,241 +DB 102,15,56,220,249 + movups xmm1,XMMWORD[rax*1+rcx] + add rax,32 +DB 102,15,56,220,208 +DB 102,15,56,220,216 +DB 102,15,56,220,224 +DB 102,15,56,220,232 +DB 102,15,56,220,240 +DB 102,15,56,220,248 + movups xmm0,XMMWORD[((-16))+rax*1+rcx] + jnz NEAR $L$enc_loop6 + +DB 102,15,56,220,209 +DB 102,15,56,220,217 +DB 102,15,56,220,225 +DB 102,15,56,220,233 +DB 102,15,56,220,241 +DB 102,15,56,220,249 +DB 102,15,56,221,208 +DB 102,15,56,221,216 +DB 102,15,56,221,224 +DB 102,15,56,221,232 +DB 102,15,56,221,240 +DB 102,15,56,221,248 + ret + + +ALIGN 16 +_aesni_decrypt6: + movups xmm0,XMMWORD[rcx] + shl eax,4 + movups xmm1,XMMWORD[16+rcx] + xorps xmm2,xmm0 + pxor xmm3,xmm0 + pxor xmm4,xmm0 +DB 102,15,56,222,209 + lea rcx,[32+rax*1+rcx] + neg rax +DB 102,15,56,222,217 + pxor xmm5,xmm0 + pxor xmm6,xmm0 +DB 102,15,56,222,225 + pxor xmm7,xmm0 + movups xmm0,XMMWORD[rax*1+rcx] + add rax,16 + jmp NEAR $L$dec_loop6_enter +ALIGN 16 +$L$dec_loop6: +DB 102,15,56,222,209 +DB 102,15,56,222,217 +DB 102,15,56,222,225 +$L$dec_loop6_enter: +DB 102,15,56,222,233 +DB 102,15,56,222,241 +DB 102,15,56,222,249 + movups xmm1,XMMWORD[rax*1+rcx] + add rax,32 +DB 102,15,56,222,208 +DB 102,15,56,222,216 +DB 102,15,56,222,224 +DB 102,15,56,222,232 +DB 102,15,56,222,240 +DB 102,15,56,222,248 + movups xmm0,XMMWORD[((-16))+rax*1+rcx] + jnz NEAR $L$dec_loop6 + +DB 102,15,56,222,209 +DB 102,15,56,222,217 +DB 102,15,56,222,225 +DB 102,15,56,222,233 +DB 102,15,56,222,241 +DB 102,15,56,222,249 +DB 102,15,56,223,208 +DB 102,15,56,223,216 +DB 102,15,56,223,224 +DB 102,15,56,223,232 +DB 102,15,56,223,240 +DB 102,15,56,223,248 + ret + + +ALIGN 16 +_aesni_encrypt8: + movups xmm0,XMMWORD[rcx] + shl eax,4 + movups xmm1,XMMWORD[16+rcx] + xorps xmm2,xmm0 + xorps xmm3,xmm0 + pxor xmm4,xmm0 + pxor xmm5,xmm0 + pxor xmm6,xmm0 + lea rcx,[32+rax*1+rcx] + neg rax +DB 102,15,56,220,209 + pxor xmm7,xmm0 + pxor xmm8,xmm0 +DB 102,15,56,220,217 + pxor xmm9,xmm0 + movups xmm0,XMMWORD[rax*1+rcx] + add rax,16 + jmp NEAR $L$enc_loop8_inner +ALIGN 16 +$L$enc_loop8: +DB 102,15,56,220,209 +DB 102,15,56,220,217 +$L$enc_loop8_inner: +DB 102,15,56,220,225 +DB 102,15,56,220,233 +DB 102,15,56,220,241 +DB 102,15,56,220,249 +DB 102,68,15,56,220,193 +DB 102,68,15,56,220,201 +$L$enc_loop8_enter: + movups xmm1,XMMWORD[rax*1+rcx] + add rax,32 +DB 102,15,56,220,208 +DB 102,15,56,220,216 +DB 102,15,56,220,224 +DB 102,15,56,220,232 +DB 102,15,56,220,240 +DB 102,15,56,220,248 +DB 102,68,15,56,220,192 +DB 102,68,15,56,220,200 + movups xmm0,XMMWORD[((-16))+rax*1+rcx] + jnz NEAR $L$enc_loop8 + +DB 102,15,56,220,209 +DB 102,15,56,220,217 +DB 102,15,56,220,225 +DB 102,15,56,220,233 +DB 102,15,56,220,241 +DB 102,15,56,220,249 +DB 102,68,15,56,220,193 +DB 102,68,15,56,220,201 +DB 102,15,56,221,208 +DB 102,15,56,221,216 +DB 102,15,56,221,224 +DB 102,15,56,221,232 +DB 102,15,56,221,240 +DB 102,15,56,221,248 +DB 102,68,15,56,221,192 +DB 102,68,15,56,221,200 + ret + + +ALIGN 16 +_aesni_decrypt8: + movups xmm0,XMMWORD[rcx] + shl eax,4 + movups xmm1,XMMWORD[16+rcx] + xorps xmm2,xmm0 + xorps xmm3,xmm0 + pxor xmm4,xmm0 + pxor xmm5,xmm0 + pxor xmm6,xmm0 + lea rcx,[32+rax*1+rcx] + neg rax +DB 102,15,56,222,209 + pxor xmm7,xmm0 + pxor xmm8,xmm0 +DB 102,15,56,222,217 + pxor xmm9,xmm0 + movups xmm0,XMMWORD[rax*1+rcx] + add rax,16 + jmp NEAR $L$dec_loop8_inner +ALIGN 16 +$L$dec_loop8: +DB 102,15,56,222,209 +DB 102,15,56,222,217 +$L$dec_loop8_inner: +DB 102,15,56,222,225 +DB 102,15,56,222,233 +DB 102,15,56,222,241 +DB 102,15,56,222,249 +DB 102,68,15,56,222,193 +DB 102,68,15,56,222,201 +$L$dec_loop8_enter: + movups xmm1,XMMWORD[rax*1+rcx] + add rax,32 +DB 102,15,56,222,208 +DB 102,15,56,222,216 +DB 102,15,56,222,224 +DB 102,15,56,222,232 +DB 102,15,56,222,240 +DB 102,15,56,222,248 +DB 102,68,15,56,222,192 +DB 102,68,15,56,222,200 + movups xmm0,XMMWORD[((-16))+rax*1+rcx] + jnz NEAR $L$dec_loop8 + +DB 102,15,56,222,209 +DB 102,15,56,222,217 +DB 102,15,56,222,225 +DB 102,15,56,222,233 +DB 102,15,56,222,241 +DB 102,15,56,222,249 +DB 102,68,15,56,222,193 +DB 102,68,15,56,222,201 +DB 102,15,56,223,208 +DB 102,15,56,223,216 +DB 102,15,56,223,224 +DB 102,15,56,223,232 +DB 102,15,56,223,240 +DB 102,15,56,223,248 +DB 102,68,15,56,223,192 +DB 102,68,15,56,223,200 + ret + +global aesni_ctr32_encrypt_blocks + +ALIGN 16 +aesni_ctr32_encrypt_blocks: + mov QWORD[8+rsp],rdi ;WIN64 prologue + mov QWORD[16+rsp],rsi + mov rax,rsp +$L$SEH_begin_aesni_ctr32_encrypt_blocks: + mov rdi,rcx + mov rsi,rdx + mov rdx,r8 + mov rcx,r9 + mov r8,QWORD[40+rsp] + + + + cmp rdx,1 + jne NEAR $L$ctr32_bulk + + + + movups xmm2,XMMWORD[r8] + movups xmm3,XMMWORD[rdi] + mov edx,DWORD[240+rcx] + movups xmm0,XMMWORD[rcx] + movups xmm1,XMMWORD[16+rcx] + lea rcx,[32+rcx] + xorps xmm2,xmm0 +$L$oop_enc1_3: +DB 102,15,56,220,209 + dec edx + movups xmm1,XMMWORD[rcx] + lea rcx,[16+rcx] + jnz NEAR $L$oop_enc1_3 +DB 102,15,56,221,209 + pxor xmm0,xmm0 + pxor xmm1,xmm1 + xorps xmm2,xmm3 + pxor xmm3,xmm3 + movups XMMWORD[rsi],xmm2 + xorps xmm2,xmm2 + jmp NEAR $L$ctr32_epilogue + +ALIGN 16 +$L$ctr32_bulk: + lea r11,[rsp] + + push rbp + + sub rsp,288 + and rsp,-16 + movaps XMMWORD[(-168)+r11],xmm6 + movaps XMMWORD[(-152)+r11],xmm7 + movaps XMMWORD[(-136)+r11],xmm8 + movaps XMMWORD[(-120)+r11],xmm9 + movaps XMMWORD[(-104)+r11],xmm10 + movaps XMMWORD[(-88)+r11],xmm11 + movaps XMMWORD[(-72)+r11],xmm12 + movaps XMMWORD[(-56)+r11],xmm13 + movaps XMMWORD[(-40)+r11],xmm14 + movaps XMMWORD[(-24)+r11],xmm15 +$L$ctr32_body: + + + + + movdqu xmm2,XMMWORD[r8] + movdqu xmm0,XMMWORD[rcx] + mov r8d,DWORD[12+r8] + pxor xmm2,xmm0 + mov ebp,DWORD[12+rcx] + movdqa XMMWORD[rsp],xmm2 + bswap r8d + movdqa xmm3,xmm2 + movdqa xmm4,xmm2 + movdqa xmm5,xmm2 + movdqa XMMWORD[64+rsp],xmm2 + movdqa XMMWORD[80+rsp],xmm2 + movdqa XMMWORD[96+rsp],xmm2 + mov r10,rdx + movdqa XMMWORD[112+rsp],xmm2 + + lea rax,[1+r8] + lea rdx,[2+r8] + bswap eax + bswap edx + xor eax,ebp + xor edx,ebp +DB 102,15,58,34,216,3 + lea rax,[3+r8] + movdqa XMMWORD[16+rsp],xmm3 +DB 102,15,58,34,226,3 + bswap eax + mov rdx,r10 + lea r10,[4+r8] + movdqa XMMWORD[32+rsp],xmm4 + xor eax,ebp + bswap r10d +DB 102,15,58,34,232,3 + xor r10d,ebp + movdqa XMMWORD[48+rsp],xmm5 + lea r9,[5+r8] + mov DWORD[((64+12))+rsp],r10d + bswap r9d + lea r10,[6+r8] + mov eax,DWORD[240+rcx] + xor r9d,ebp + bswap r10d + mov DWORD[((80+12))+rsp],r9d + xor r10d,ebp + lea r9,[7+r8] + mov DWORD[((96+12))+rsp],r10d + bswap r9d +; leaq OPENSSL_ia32cap_P(%rip),%r10 +; mov 4(%r10),%r10d + xor r9d,ebp +; and $71303168,%r10d + mov DWORD[((112+12))+rsp],r9d + + movups xmm1,XMMWORD[16+rcx] + + movdqa xmm6,XMMWORD[64+rsp] + movdqa xmm7,XMMWORD[80+rsp] + + cmp rdx,8 + jb NEAR $L$ctr32_tail + + sub rdx,6 +; cmp $4194304,%r10d +; je .Lctr32_6x + + lea rcx,[128+rcx] + sub rdx,2 + jmp NEAR $L$ctr32_loop8 + +;.align 16 +;.Lctr32_6x: +; shl $4,%eax +; mov $48,%r10d +; bswap %ebp +; lea 32(%rcx,%eax),%rcx +; sub %rax,%r10 +; jmp .Lctr32_loop6 + +ALIGN 16 +$L$ctr32_loop6: + add r8d,6 + movups xmm0,XMMWORD[((-48))+r10*1+rcx] +DB 102,15,56,220,209 + mov eax,r8d + xor eax,ebp +DB 102,15,56,220,217 +DB 0x0f,0x38,0xf1,0x44,0x24,12 + lea eax,[1+r8] +DB 102,15,56,220,225 + xor eax,ebp +DB 0x0f,0x38,0xf1,0x44,0x24,28 +DB 102,15,56,220,233 + lea eax,[2+r8] + xor eax,ebp +DB 102,15,56,220,241 +DB 0x0f,0x38,0xf1,0x44,0x24,44 + lea eax,[3+r8] +DB 102,15,56,220,249 + movups xmm1,XMMWORD[((-32))+r10*1+rcx] + xor eax,ebp + +DB 102,15,56,220,208 +DB 0x0f,0x38,0xf1,0x44,0x24,60 + lea eax,[4+r8] +DB 102,15,56,220,216 + xor eax,ebp +DB 0x0f,0x38,0xf1,0x44,0x24,76 +DB 102,15,56,220,224 + lea eax,[5+r8] + xor eax,ebp +DB 102,15,56,220,232 +DB 0x0f,0x38,0xf1,0x44,0x24,92 + mov rax,r10 +DB 102,15,56,220,240 +DB 102,15,56,220,248 + movups xmm0,XMMWORD[((-16))+r10*1+rcx] + + call $L$enc_loop6 + + movdqu xmm8,XMMWORD[rdi] + movdqu xmm9,XMMWORD[16+rdi] + movdqu xmm10,XMMWORD[32+rdi] + movdqu xmm11,XMMWORD[48+rdi] + movdqu xmm12,XMMWORD[64+rdi] + movdqu xmm13,XMMWORD[80+rdi] + lea rdi,[96+rdi] + movups xmm1,XMMWORD[((-64))+r10*1+rcx] + pxor xmm8,xmm2 + movaps xmm2,XMMWORD[rsp] + pxor xmm9,xmm3 + movaps xmm3,XMMWORD[16+rsp] + pxor xmm10,xmm4 + movaps xmm4,XMMWORD[32+rsp] + pxor xmm11,xmm5 + movaps xmm5,XMMWORD[48+rsp] + pxor xmm12,xmm6 + movaps xmm6,XMMWORD[64+rsp] + pxor xmm13,xmm7 + movaps xmm7,XMMWORD[80+rsp] + movdqu XMMWORD[rsi],xmm8 + movdqu XMMWORD[16+rsi],xmm9 + movdqu XMMWORD[32+rsi],xmm10 + movdqu XMMWORD[48+rsi],xmm11 + movdqu XMMWORD[64+rsi],xmm12 + movdqu XMMWORD[80+rsi],xmm13 + lea rsi,[96+rsi] + + sub rdx,6 + jnc NEAR $L$ctr32_loop6 + + add rdx,6 + jz NEAR $L$ctr32_done + + lea eax,[((-48))+r10] + lea rcx,[((-80))+r10*1+rcx] + neg eax + shr eax,4 + jmp NEAR $L$ctr32_tail + +ALIGN 32 +$L$ctr32_loop8: + add r8d,8 + movdqa xmm8,XMMWORD[96+rsp] +DB 102,15,56,220,209 + mov r9d,r8d + movdqa xmm9,XMMWORD[112+rsp] +DB 102,15,56,220,217 + bswap r9d + movups xmm0,XMMWORD[((32-128))+rcx] +DB 102,15,56,220,225 + xor r9d,ebp + nop +DB 102,15,56,220,233 + mov DWORD[((0+12))+rsp],r9d + lea r9,[1+r8] +DB 102,15,56,220,241 +DB 102,15,56,220,249 +DB 102,68,15,56,220,193 +DB 102,68,15,56,220,201 + movups xmm1,XMMWORD[((48-128))+rcx] + bswap r9d +DB 102,15,56,220,208 +DB 102,15,56,220,216 + xor r9d,ebp +DB 0x66,0x90 +DB 102,15,56,220,224 +DB 102,15,56,220,232 + mov DWORD[((16+12))+rsp],r9d + lea r9,[2+r8] +DB 102,15,56,220,240 +DB 102,15,56,220,248 +DB 102,68,15,56,220,192 +DB 102,68,15,56,220,200 + movups xmm0,XMMWORD[((64-128))+rcx] + bswap r9d +DB 102,15,56,220,209 +DB 102,15,56,220,217 + xor r9d,ebp +DB 0x66,0x90 +DB 102,15,56,220,225 +DB 102,15,56,220,233 + mov DWORD[((32+12))+rsp],r9d + lea r9,[3+r8] +DB 102,15,56,220,241 +DB 102,15,56,220,249 +DB 102,68,15,56,220,193 +DB 102,68,15,56,220,201 + movups xmm1,XMMWORD[((80-128))+rcx] + bswap r9d +DB 102,15,56,220,208 +DB 102,15,56,220,216 + xor r9d,ebp +DB 0x66,0x90 +DB 102,15,56,220,224 +DB 102,15,56,220,232 + mov DWORD[((48+12))+rsp],r9d + lea r9,[4+r8] +DB 102,15,56,220,240 +DB 102,15,56,220,248 +DB 102,68,15,56,220,192 +DB 102,68,15,56,220,200 + movups xmm0,XMMWORD[((96-128))+rcx] + bswap r9d +DB 102,15,56,220,209 +DB 102,15,56,220,217 + xor r9d,ebp +DB 0x66,0x90 +DB 102,15,56,220,225 +DB 102,15,56,220,233 + mov DWORD[((64+12))+rsp],r9d + lea r9,[5+r8] +DB 102,15,56,220,241 +DB 102,15,56,220,249 +DB 102,68,15,56,220,193 +DB 102,68,15,56,220,201 + movups xmm1,XMMWORD[((112-128))+rcx] + bswap r9d +DB 102,15,56,220,208 +DB 102,15,56,220,216 + xor r9d,ebp +DB 0x66,0x90 +DB 102,15,56,220,224 +DB 102,15,56,220,232 + mov DWORD[((80+12))+rsp],r9d + lea r9,[6+r8] +DB 102,15,56,220,240 +DB 102,15,56,220,248 +DB 102,68,15,56,220,192 +DB 102,68,15,56,220,200 + movups xmm0,XMMWORD[((128-128))+rcx] + bswap r9d +DB 102,15,56,220,209 +DB 102,15,56,220,217 + xor r9d,ebp +DB 0x66,0x90 +DB 102,15,56,220,225 +DB 102,15,56,220,233 + mov DWORD[((96+12))+rsp],r9d + lea r9,[7+r8] +DB 102,15,56,220,241 +DB 102,15,56,220,249 +DB 102,68,15,56,220,193 +DB 102,68,15,56,220,201 + movups xmm1,XMMWORD[((144-128))+rcx] + bswap r9d +DB 102,15,56,220,208 +DB 102,15,56,220,216 +DB 102,15,56,220,224 + xor r9d,ebp + movdqu xmm10,XMMWORD[rdi] +DB 102,15,56,220,232 + mov DWORD[((112+12))+rsp],r9d + cmp eax,11 +DB 102,15,56,220,240 +DB 102,15,56,220,248 +DB 102,68,15,56,220,192 +DB 102,68,15,56,220,200 + movups xmm0,XMMWORD[((160-128))+rcx] + + jb NEAR $L$ctr32_enc_done + +DB 102,15,56,220,209 +DB 102,15,56,220,217 +DB 102,15,56,220,225 +DB 102,15,56,220,233 +DB 102,15,56,220,241 +DB 102,15,56,220,249 +DB 102,68,15,56,220,193 +DB 102,68,15,56,220,201 + movups xmm1,XMMWORD[((176-128))+rcx] + +DB 102,15,56,220,208 +DB 102,15,56,220,216 +DB 102,15,56,220,224 +DB 102,15,56,220,232 +DB 102,15,56,220,240 +DB 102,15,56,220,248 +DB 102,68,15,56,220,192 +DB 102,68,15,56,220,200 + movups xmm0,XMMWORD[((192-128))+rcx] + je NEAR $L$ctr32_enc_done + +DB 102,15,56,220,209 +DB 102,15,56,220,217 +DB 102,15,56,220,225 +DB 102,15,56,220,233 +DB 102,15,56,220,241 +DB 102,15,56,220,249 +DB 102,68,15,56,220,193 +DB 102,68,15,56,220,201 + movups xmm1,XMMWORD[((208-128))+rcx] + +DB 102,15,56,220,208 +DB 102,15,56,220,216 +DB 102,15,56,220,224 +DB 102,15,56,220,232 +DB 102,15,56,220,240 +DB 102,15,56,220,248 +DB 102,68,15,56,220,192 +DB 102,68,15,56,220,200 + movups xmm0,XMMWORD[((224-128))+rcx] + jmp NEAR $L$ctr32_enc_done + +ALIGN 16 +$L$ctr32_enc_done: + movdqu xmm11,XMMWORD[16+rdi] + pxor xmm10,xmm0 + movdqu xmm12,XMMWORD[32+rdi] + pxor xmm11,xmm0 + movdqu xmm13,XMMWORD[48+rdi] + pxor xmm12,xmm0 + movdqu xmm14,XMMWORD[64+rdi] + pxor xmm13,xmm0 + movdqu xmm15,XMMWORD[80+rdi] + pxor xmm14,xmm0 + pxor xmm15,xmm0 +DB 102,15,56,220,209 +DB 102,15,56,220,217 +DB 102,15,56,220,225 +DB 102,15,56,220,233 +DB 102,15,56,220,241 +DB 102,15,56,220,249 +DB 102,68,15,56,220,193 +DB 102,68,15,56,220,201 + movdqu xmm1,XMMWORD[96+rdi] + lea rdi,[128+rdi] + +DB 102,65,15,56,221,210 + pxor xmm1,xmm0 + movdqu xmm10,XMMWORD[((112-128))+rdi] +DB 102,65,15,56,221,219 + pxor xmm10,xmm0 + movdqa xmm11,XMMWORD[rsp] +DB 102,65,15,56,221,228 +DB 102,65,15,56,221,237 + movdqa xmm12,XMMWORD[16+rsp] + movdqa xmm13,XMMWORD[32+rsp] +DB 102,65,15,56,221,246 +DB 102,65,15,56,221,255 + movdqa xmm14,XMMWORD[48+rsp] + movdqa xmm15,XMMWORD[64+rsp] +DB 102,68,15,56,221,193 + movdqa xmm0,XMMWORD[80+rsp] + movups xmm1,XMMWORD[((16-128))+rcx] +DB 102,69,15,56,221,202 + + movups XMMWORD[rsi],xmm2 + movdqa xmm2,xmm11 + movups XMMWORD[16+rsi],xmm3 + movdqa xmm3,xmm12 + movups XMMWORD[32+rsi],xmm4 + movdqa xmm4,xmm13 + movups XMMWORD[48+rsi],xmm5 + movdqa xmm5,xmm14 + movups XMMWORD[64+rsi],xmm6 + movdqa xmm6,xmm15 + movups XMMWORD[80+rsi],xmm7 + movdqa xmm7,xmm0 + movups XMMWORD[96+rsi],xmm8 + movups XMMWORD[112+rsi],xmm9 + lea rsi,[128+rsi] + + sub rdx,8 + jnc NEAR $L$ctr32_loop8 + + add rdx,8 + jz NEAR $L$ctr32_done + lea rcx,[((-128))+rcx] + +$L$ctr32_tail: + + + lea rcx,[16+rcx] + cmp rdx,4 + jb NEAR $L$ctr32_loop3 + je NEAR $L$ctr32_loop4 + + + shl eax,4 + movdqa xmm8,XMMWORD[96+rsp] + pxor xmm9,xmm9 + + movups xmm0,XMMWORD[16+rcx] +DB 102,15,56,220,209 +DB 102,15,56,220,217 + lea rcx,[((32-16))+rax*1+rcx] + neg rax +DB 102,15,56,220,225 + add rax,16 + movups xmm10,XMMWORD[rdi] +DB 102,15,56,220,233 +DB 102,15,56,220,241 + movups xmm11,XMMWORD[16+rdi] + movups xmm12,XMMWORD[32+rdi] +DB 102,15,56,220,249 +DB 102,68,15,56,220,193 + + call $L$enc_loop8_enter + + movdqu xmm13,XMMWORD[48+rdi] + pxor xmm2,xmm10 + movdqu xmm10,XMMWORD[64+rdi] + pxor xmm3,xmm11 + movdqu XMMWORD[rsi],xmm2 + pxor xmm4,xmm12 + movdqu XMMWORD[16+rsi],xmm3 + pxor xmm5,xmm13 + movdqu XMMWORD[32+rsi],xmm4 + pxor xmm6,xmm10 + movdqu XMMWORD[48+rsi],xmm5 + movdqu XMMWORD[64+rsi],xmm6 + cmp rdx,6 + jb NEAR $L$ctr32_done + + movups xmm11,XMMWORD[80+rdi] + xorps xmm7,xmm11 + movups XMMWORD[80+rsi],xmm7 + je NEAR $L$ctr32_done + + movups xmm12,XMMWORD[96+rdi] + xorps xmm8,xmm12 + movups XMMWORD[96+rsi],xmm8 + jmp NEAR $L$ctr32_done + +ALIGN 32 +$L$ctr32_loop4: +DB 102,15,56,220,209 + lea rcx,[16+rcx] + dec eax +DB 102,15,56,220,217 +DB 102,15,56,220,225 +DB 102,15,56,220,233 + movups xmm1,XMMWORD[rcx] + jnz NEAR $L$ctr32_loop4 +DB 102,15,56,221,209 +DB 102,15,56,221,217 + movups xmm10,XMMWORD[rdi] + movups xmm11,XMMWORD[16+rdi] +DB 102,15,56,221,225 +DB 102,15,56,221,233 + movups xmm12,XMMWORD[32+rdi] + movups xmm13,XMMWORD[48+rdi] + + xorps xmm2,xmm10 + movups XMMWORD[rsi],xmm2 + xorps xmm3,xmm11 + movups XMMWORD[16+rsi],xmm3 + pxor xmm4,xmm12 + movdqu XMMWORD[32+rsi],xmm4 + pxor xmm5,xmm13 + movdqu XMMWORD[48+rsi],xmm5 + jmp NEAR $L$ctr32_done + +ALIGN 32 +$L$ctr32_loop3: +DB 102,15,56,220,209 + lea rcx,[16+rcx] + dec eax +DB 102,15,56,220,217 +DB 102,15,56,220,225 + movups xmm1,XMMWORD[rcx] + jnz NEAR $L$ctr32_loop3 +DB 102,15,56,221,209 +DB 102,15,56,221,217 +DB 102,15,56,221,225 + + movups xmm10,XMMWORD[rdi] + xorps xmm2,xmm10 + movups XMMWORD[rsi],xmm2 + cmp rdx,2 + jb NEAR $L$ctr32_done + + movups xmm11,XMMWORD[16+rdi] + xorps xmm3,xmm11 + movups XMMWORD[16+rsi],xmm3 + je NEAR $L$ctr32_done + + movups xmm12,XMMWORD[32+rdi] + xorps xmm4,xmm12 + movups XMMWORD[32+rsi],xmm4 + +$L$ctr32_done: + xorps xmm0,xmm0 + xor ebp,ebp + pxor xmm1,xmm1 + pxor xmm2,xmm2 + pxor xmm3,xmm3 + pxor xmm4,xmm4 + pxor xmm5,xmm5 + movaps xmm6,XMMWORD[((-168))+r11] + movaps XMMWORD[(-168)+r11],xmm0 + movaps xmm7,XMMWORD[((-152))+r11] + movaps XMMWORD[(-152)+r11],xmm0 + movaps xmm8,XMMWORD[((-136))+r11] + movaps XMMWORD[(-136)+r11],xmm0 + movaps xmm9,XMMWORD[((-120))+r11] + movaps XMMWORD[(-120)+r11],xmm0 + movaps xmm10,XMMWORD[((-104))+r11] + movaps XMMWORD[(-104)+r11],xmm0 + movaps xmm11,XMMWORD[((-88))+r11] + movaps XMMWORD[(-88)+r11],xmm0 + movaps xmm12,XMMWORD[((-72))+r11] + movaps XMMWORD[(-72)+r11],xmm0 + movaps xmm13,XMMWORD[((-56))+r11] + movaps XMMWORD[(-56)+r11],xmm0 + movaps xmm14,XMMWORD[((-40))+r11] + movaps XMMWORD[(-40)+r11],xmm0 + movaps xmm15,XMMWORD[((-24))+r11] + movaps XMMWORD[(-24)+r11],xmm0 + movaps XMMWORD[rsp],xmm0 + movaps XMMWORD[16+rsp],xmm0 + movaps XMMWORD[32+rsp],xmm0 + movaps XMMWORD[48+rsp],xmm0 + movaps XMMWORD[64+rsp],xmm0 + movaps XMMWORD[80+rsp],xmm0 + movaps XMMWORD[96+rsp],xmm0 + movaps XMMWORD[112+rsp],xmm0 + mov rbp,QWORD[((-8))+r11] + + lea rsp,[r11] + +$L$ctr32_epilogue: + mov rdi,QWORD[8+rsp] ;WIN64 epilogue + mov rsi,QWORD[16+rsp] + ret + +$L$SEH_end_aesni_ctr32_encrypt_blocks: +global aesni_set_decrypt_key + +ALIGN 16 +aesni_set_decrypt_key: + +DB 0x48,0x83,0xEC,0x08 + + call __aesni_set_encrypt_key + shl edx,4 + test eax,eax + jnz NEAR $L$dec_key_ret + lea rcx,[16+rdx*1+r8] + + movups xmm0,XMMWORD[r8] + movups xmm1,XMMWORD[rcx] + movups XMMWORD[rcx],xmm0 + movups XMMWORD[r8],xmm1 + lea r8,[16+r8] + lea rcx,[((-16))+rcx] + +$L$dec_key_inverse: + movups xmm0,XMMWORD[r8] + movups xmm1,XMMWORD[rcx] +DB 102,15,56,219,192 +DB 102,15,56,219,201 + lea r8,[16+r8] + lea rcx,[((-16))+rcx] + movups XMMWORD[16+rcx],xmm0 + movups XMMWORD[(-16)+r8],xmm1 + cmp rcx,r8 + ja NEAR $L$dec_key_inverse + + movups xmm0,XMMWORD[r8] +DB 102,15,56,219,192 + pxor xmm1,xmm1 + movups XMMWORD[rcx],xmm0 + pxor xmm0,xmm0 +$L$dec_key_ret: + add rsp,8 + + ret + +$L$SEH_end_set_decrypt_key: + +global aesni_set_encrypt_key + +ALIGN 16 +aesni_set_encrypt_key: +__aesni_set_encrypt_key: + +DB 0x48,0x83,0xEC,0x08 + + mov rax,-1 + test rcx,rcx + jz NEAR $L$enc_key_ret + test r8,r8 + jz NEAR $L$enc_key_ret + + movups xmm0,XMMWORD[rcx] + xorps xmm4,xmm4 +; leaq OPENSSL_ia32cap_P(%rip),%r10 +; movl 4(%r10),%r10d +; and $268437504,%r10d + lea rax,[16+r8] + cmp edx,256 + je NEAR $L$14rounds + cmp edx,192 + je NEAR $L$12rounds + cmp edx,128 + jne NEAR $L$bad_keybits + +$L$10rounds: + mov edx,9 +; cmp $268435456,%r10d +; je .L10rounds_alt +; jmp .L10rounds_alt +; movups %xmm0,(%r8) +; .byte 102,15,58,223,200,1 +; call .Lkey_expansion_128_cold +; .byte 102,15,58,223,200,2 +; call .Lkey_expansion_128 +; .byte 102,15,58,223,200,4 +; call .Lkey_expansion_128 +; .byte 102,15,58,223,200,8 +; call .Lkey_expansion_128 +; .byte 102,15,58,223,200,16 +; call .Lkey_expansion_128 +; .byte 102,15,58,223,200,32 +; call .Lkey_expansion_128 +; .byte 102,15,58,223,200,64 +; call .Lkey_expansion_128 +; .byte 102,15,58,223,200,128 +; call .Lkey_expansion_128 +; .byte 102,15,58,223,200,27 +; call .Lkey_expansion_128 +; .byte 102,15,58,223,200,54 +; call .Lkey_expansion_128 +; movups %xmm0,(%rax) +; mov %edx,80(%rax) +; xor %eax,%eax +; jmp .Lenc_key_ret + +;.align 16 +;.L10rounds_alt: + movdqa xmm5,XMMWORD[$L$key_rotate] + mov r10d,8 + movdqa xmm4,XMMWORD[$L$key_rcon1] + movdqa xmm2,xmm0 + movdqu XMMWORD[r8],xmm0 + jmp NEAR $L$oop_key128 + +ALIGN 16 +$L$oop_key128: + pshufb xmm0,xmm5 +DB 102,15,56,221,196 + pslld xmm4,1 + lea rax,[16+rax] + + movdqa xmm3,xmm2 + pslldq xmm2,4 + pxor xmm3,xmm2 + pslldq xmm2,4 + pxor xmm3,xmm2 + pslldq xmm2,4 + pxor xmm2,xmm3 + + pxor xmm0,xmm2 + movdqu XMMWORD[(-16)+rax],xmm0 + movdqa xmm2,xmm0 + + dec r10d + jnz NEAR $L$oop_key128 + + movdqa xmm4,XMMWORD[$L$key_rcon1b] + + pshufb xmm0,xmm5 +DB 102,15,56,221,196 + pslld xmm4,1 + + movdqa xmm3,xmm2 + pslldq xmm2,4 + pxor xmm3,xmm2 + pslldq xmm2,4 + pxor xmm3,xmm2 + pslldq xmm2,4 + pxor xmm2,xmm3 + + pxor xmm0,xmm2 + movdqu XMMWORD[rax],xmm0 + + movdqa xmm2,xmm0 + pshufb xmm0,xmm5 +DB 102,15,56,221,196 + + movdqa xmm3,xmm2 + pslldq xmm2,4 + pxor xmm3,xmm2 + pslldq xmm2,4 + pxor xmm3,xmm2 + pslldq xmm2,4 + pxor xmm2,xmm3 + + pxor xmm0,xmm2 + movdqu XMMWORD[16+rax],xmm0 + + mov DWORD[96+rax],edx + xor eax,eax + jmp NEAR $L$enc_key_ret + +ALIGN 16 +$L$12rounds: + movq xmm2,QWORD[16+rcx] + mov edx,11 +; cmp $268435456,%r10d +; je .L12rounds_alt + +; movups %xmm0,(%r8) +; .byte 102,15,58,223,202,1 +; call .Lkey_expansion_192a_cold +; .byte 102,15,58,223,202,2 +; call .Lkey_expansion_192b +; .byte 102,15,58,223,202,4 +; call .Lkey_expansion_192a +; .byte 102,15,58,223,202,8 +; call .Lkey_expansion_192b +; .byte 102,15,58,223,202,16 +; call .Lkey_expansion_192a +; .byte 102,15,58,223,202,32 +; call .Lkey_expansion_192b +; .byte 102,15,58,223,202,64 +; call .Lkey_expansion_192a +; .byte 102,15,58,223,202,128 +; call .Lkey_expansion_192b +; movups %xmm0,(%rax) +; mov %edx,48(%rax) +; xor %rax, %rax +; jmp .Lenc_key_ret + +;.align 16 +;.L12rounds_alt: + movdqa xmm5,XMMWORD[$L$key_rotate192] + movdqa xmm4,XMMWORD[$L$key_rcon1] + mov r10d,8 + movdqu XMMWORD[r8],xmm0 + jmp NEAR $L$oop_key192 + +ALIGN 16 +$L$oop_key192: + movq QWORD[rax],xmm2 + movdqa xmm1,xmm2 + pshufb xmm2,xmm5 +DB 102,15,56,221,212 + pslld xmm4,1 + lea rax,[24+rax] + + movdqa xmm3,xmm0 + pslldq xmm0,4 + pxor xmm3,xmm0 + pslldq xmm0,4 + pxor xmm3,xmm0 + pslldq xmm0,4 + pxor xmm0,xmm3 + + pshufd xmm3,xmm0,0xff + pxor xmm3,xmm1 + pslldq xmm1,4 + pxor xmm3,xmm1 + + pxor xmm0,xmm2 + pxor xmm2,xmm3 + movdqu XMMWORD[(-16)+rax],xmm0 + + dec r10d + jnz NEAR $L$oop_key192 + + mov DWORD[32+rax],edx + xor eax,eax + jmp NEAR $L$enc_key_ret + +ALIGN 16 +$L$14rounds: + movups xmm2,XMMWORD[16+rcx] + mov edx,13 + lea rax,[16+rax] +; cmp $268435456,%r10d +; je .L14rounds_alt +; +; movups %xmm0,(%r8) +; movups %xmm2,16(%r8) +; .byte 102,15,58,223,202,1 +; call .Lkey_expansion_256a_cold +; .byte 102,15,58,223,200,1 +; call .Lkey_expansion_256b +; .byte 102,15,58,223,202,2 +; call .Lkey_expansion_256a +; .byte 102,15,58,223,200,2 +; call .Lkey_expansion_256b +; .byte 102,15,58,223,202,4 +; call .Lkey_expansion_256a +; .byte 102,15,58,223,200,4 +; call .Lkey_expansion_256b +; .byte 102,15,58,223,202,8 +; call .Lkey_expansion_256a +; .byte 102,15,58,223,200,8 +; call .Lkey_expansion_256b +; .byte 102,15,58,223,202,16 +; call .Lkey_expansion_256a +; .byte 102,15,58,223,200,16 +; call .Lkey_expansion_256b +; .byte 102,15,58,223,202,32 +; call .Lkey_expansion_256a +; .byte 102,15,58,223,200,32 +; call .Lkey_expansion_256b +; .byte 102,15,58,223,202,64 +; call .Lkey_expansion_256a +; movups %xmm0,(%rax) +; mov %edx,16(%rax) +; xor %rax,%rax +; jmp .Lenc_key_ret + +;.align 16 +;.L14rounds_alt: + movdqa xmm5,XMMWORD[$L$key_rotate] + movdqa xmm4,XMMWORD[$L$key_rcon1] + mov r10d,7 + movdqu XMMWORD[r8],xmm0 + movdqa xmm1,xmm2 + movdqu XMMWORD[16+r8],xmm2 + jmp NEAR $L$oop_key256 + +ALIGN 16 +$L$oop_key256: + pshufb xmm2,xmm5 +DB 102,15,56,221,212 + + movdqa xmm3,xmm0 + pslldq xmm0,4 + pxor xmm3,xmm0 + pslldq xmm0,4 + pxor xmm3,xmm0 + pslldq xmm0,4 + pxor xmm0,xmm3 + pslld xmm4,1 + + pxor xmm0,xmm2 + movdqu XMMWORD[rax],xmm0 + + dec r10d + jz NEAR $L$done_key256 + + pshufd xmm2,xmm0,0xff + pxor xmm3,xmm3 +DB 102,15,56,221,211 + + movdqa xmm3,xmm1 + pslldq xmm1,4 + pxor xmm3,xmm1 + pslldq xmm1,4 + pxor xmm3,xmm1 + pslldq xmm1,4 + pxor xmm1,xmm3 + + pxor xmm2,xmm1 + movdqu XMMWORD[16+rax],xmm2 + lea rax,[32+rax] + movdqa xmm1,xmm2 + + jmp NEAR $L$oop_key256 + +$L$done_key256: + mov DWORD[16+rax],edx + xor eax,eax + jmp NEAR $L$enc_key_ret + +ALIGN 16 +$L$bad_keybits: + mov rax,-2 +$L$enc_key_ret: + pxor xmm0,xmm0 + pxor xmm1,xmm1 + pxor xmm2,xmm2 + pxor xmm3,xmm3 + pxor xmm4,xmm4 + pxor xmm5,xmm5 + add rsp,8 + + ret + +$L$SEH_end_set_encrypt_key: + +;.align 16 +;.Lkey_expansion_128: +; movups %xmm0,(%rax) +; lea 16(%rax),%rax +;.Lkey_expansion_128_cold: +; shufps $0b00010000,%xmm0,%xmm4 +; xorps %xmm4, %xmm0 +; shufps $0b10001100,%xmm0,%xmm4 +; xorps %xmm4, %xmm0 +; shufps $0b11111111,%xmm1,%xmm1 +; xorps %xmm1,%xmm0 +; ret + +;.align 16 +;.Lkey_expansion_192a: +; movups %xmm0,(%rax) +; lea 16(%rax),%rax +;.Lkey_expansion_192a_cold: +; movaps %xmm2, %xmm5 +;.Lkey_expansion_192b_warm: +; shufps $0b00010000,%xmm0,%xmm4 +; movdqa %xmm2,%xmm3 +; xorps %xmm4,%xmm0 +; shufps $0b10001100,%xmm0,%xmm4 +; pslldq $4,%xmm3 +; xorps %xmm4,%xmm0 +; pshufd $0b01010101,%xmm1,%xmm1 +; pxor %xmm3,%xmm2 +; pxor %xmm1,%xmm0 +; pshufd $0b11111111,%xmm0,%xmm3 +; pxor %xmm3,%xmm2 +; ret +; +;.align 16 +;.Lkey_expansion_192b: +; movaps %xmm0,%xmm3 +; shufps $0b01000100,%xmm0,%xmm5 +; movups %xmm5,(%rax) +; shufps $0b01001110,%xmm2,%xmm3 +; movups %xmm3,16(%rax) +; lea 32(%rax),%rax +; jmp .Lkey_expansion_192b_warm +; +;.align 16 +;.Lkey_expansion_256a: +; movups %xmm2,(%rax) +; lea 16(%rax),%rax +;.Lkey_expansion_256a_cold: +; shufps $0b00010000,%xmm0,%xmm4 +; xorps %xmm4,%xmm0 +; shufps $0b10001100,%xmm0,%xmm4 +; xorps %xmm4,%xmm0 +; shufps $0b11111111,%xmm1,%xmm1 +; xorps %xmm1,%xmm0 +; ret +; +;.align 16 +;.Lkey_expansion_256b: +; movups %xmm0,(%rax) +; lea 16(%rax),%rax +; +; shufps $0b00010000,%xmm2,%xmm4 +; xorps %xmm4,%xmm2 +; shufps $0b10001100,%xmm2,%xmm4 +; xorps %xmm4,%xmm2 +; shufps $0b10101010,%xmm1,%xmm1 +; xorps %xmm1,%xmm2 +; ret + + +ALIGN 64 +$L$bswap_mask: +DB 15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0 +$L$increment32: + DD 6,6,6,0 +$L$increment64: + DD 1,0,0,0 +$L$xts_magic: + DD 0x87,0,1,0 +$L$increment1: +DB 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1 +$L$key_rotate: + DD 0x0c0f0e0d,0x0c0f0e0d,0x0c0f0e0d,0x0c0f0e0d +$L$key_rotate192: + DD 0x04070605,0x04070605,0x04070605,0x04070605 +$L$key_rcon1: + DD 1,1,1,1 +$L$key_rcon1b: + DD 0x1b,0x1b,0x1b,0x1b + +ALIGN 64 +EXTERN __imp_RtlVirtualUnwind + +ALIGN 16 +ctr_xts_se_handler: + push rsi + push rdi + push rbx + push rbp + push r12 + push r13 + push r14 + push r15 + pushfq + sub rsp,64 + + mov rax,QWORD[120+r8] + mov rbx,QWORD[248+r8] + + mov rsi,QWORD[8+r9] + mov r11,QWORD[56+r9] + + mov r10d,DWORD[r11] + lea r10,[r10*1+rsi] + cmp rbx,r10 + jb NEAR $L$common_seh_tail + + mov rax,QWORD[152+r8] + + mov r10d,DWORD[4+r11] + lea r10,[r10*1+rsi] + cmp rbx,r10 + jae NEAR $L$common_seh_tail + + mov rax,QWORD[208+r8] + + lea rsi,[((-168))+rax] + lea rdi,[512+r8] + mov ecx,20 + DD 0xa548f3fc + + mov rbp,QWORD[((-8))+rax] + mov QWORD[160+r8],rbp + jmp NEAR $L$common_seh_tail + + +ALIGN 16 +cbc_se_handler: +$L$common_seh_tail: + mov rdi,QWORD[8+rax] + mov rsi,QWORD[16+rax] + mov QWORD[152+r8],rax + mov QWORD[168+r8],rsi + mov QWORD[176+r8],rdi + + mov rdi,QWORD[40+r9] + mov rsi,r8 + mov ecx,154 + DD 0xa548f3fc + + mov rsi,r9 + xor rcx,rcx + mov rdx,QWORD[8+rsi] + mov r8,QWORD[rsi] + mov r9,QWORD[16+rsi] + mov r10,QWORD[40+rsi] + lea r11,[56+rsi] + lea r12,[24+rsi] + mov QWORD[32+rsp],r10 + mov QWORD[40+rsp],r11 + mov QWORD[48+rsp],r12 + mov QWORD[56+rsp],rcx + call QWORD[__imp_RtlVirtualUnwind] + + mov eax,1 + add rsp,64 + popfq + pop r15 + pop r14 + pop r13 + pop r12 + pop rbp + pop rbx + pop rdi + pop rsi + ret + +section .pdata rdata align=4 +ALIGN 4 + + + + + + + + + + + + + DD $L$SEH_begin_aesni_ctr32_encrypt_blocks wrt ..imagebase + DD $L$SEH_end_aesni_ctr32_encrypt_blocks wrt ..imagebase + DD $L$SEH_info_ctr32 wrt ..imagebase + + + + + + + + + + + + + + + + + + + + + DD aesni_set_decrypt_key wrt ..imagebase + DD $L$SEH_end_set_decrypt_key wrt ..imagebase + DD $L$SEH_info_key wrt ..imagebase + + DD aesni_set_encrypt_key wrt ..imagebase + DD $L$SEH_end_set_encrypt_key wrt ..imagebase + DD $L$SEH_info_key wrt ..imagebase +section .xdata rdata align=8 +ALIGN 8 +;.LSEH_info_ecb: +; .byte 9,0,0,0 +; .rva ecb_ccm64_se_handler +; .rva .Lecb_enc_body,.Lecb_enc_ret +;.LSEH_info_ccm64_enc: +; .byte 9,0,0,0 +; .rva ecb_ccm64_se_handler +; .rva .Lccm64_enc_body,.Lccm64_enc_ret +;.LSEH_info_ccm64_dec: +; .byte 9,0,0,0 +; .rva ecb_ccm64_se_handler +; .rva .Lccm64_dec_body,.Lccm64_dec_ret +$L$SEH_info_ctr32: +DB 9,0,0,0 + DD ctr_xts_se_handler wrt ..imagebase + DD $L$ctr32_body wrt ..imagebase,$L$ctr32_epilogue wrt ..imagebase +;.LSEH_info_xts_enc: +; .byte 9,0,0,0 +; .rva ctr_xts_se_handler +; .rva .Lxts_enc_body,.Lxts_enc_epilogue +;.LSEH_info_xts_dec: +; .byte 9,0,0,0 +; .rva ctr_xts_se_handler +; .rva .Lxts_dec_body,.Lxts_dec_epilogue +;.LSEH_info_ocb_enc: +; .byte 9,0,0,0 +; .rva ocb_se_handler +; .rva .Locb_enc_body,.Locb_enc_epilogue +; .rva .Locb_enc_pop +; .long 0 +;.LSEH_info_ocb_dec: +; .byte 9,0,0,0 +; .rva ocb_se_handler +; .rva .Locb_dec_body,.Locb_dec_epilogue +; .rva .Locb_dec_pop +; .long 0 +;.LSEH_info_cbc: +; .byte 9,0,0,0 +; .rva cbc_se_handler +$L$SEH_info_key: +DB 0x01,0x04,0x01,0x00 +DB 0x04,0x02,0x00,0x00 diff --git a/crypto/aesgcm/ghash-x86.pl b/crypto/aesgcm/ghash-x86.pl new file mode 100644 index 0000000..02edf03 --- /dev/null +++ b/crypto/aesgcm/ghash-x86.pl @@ -0,0 +1,1176 @@ +#! /usr/bin/env perl +# Copyright 2010-2016 The OpenSSL Project Authors. All Rights Reserved. +# +# Licensed under the OpenSSL license (the "License"). You may not use +# this file except in compliance with the License. You can obtain a copy +# in the file LICENSE in the source distribution or at +# https://www.openssl.org/source/license.html + +# +# ==================================================================== +# Written by Andy Polyakov for the OpenSSL +# project. The module is, however, dual licensed under OpenSSL and +# CRYPTOGAMS licenses depending on where you obtain it. For further +# details see http://www.openssl.org/~appro/cryptogams/. +# ==================================================================== +# +# March, May, June 2010 +# +# The module implements "4-bit" GCM GHASH function and underlying +# single multiplication operation in GF(2^128). "4-bit" means that it +# uses 256 bytes per-key table [+64/128 bytes fixed table]. It has two +# code paths: vanilla x86 and vanilla SSE. Former will be executed on +# 486 and Pentium, latter on all others. SSE GHASH features so called +# "528B" variant of "4-bit" method utilizing additional 256+16 bytes +# of per-key storage [+512 bytes shared table]. Performance results +# are for streamed GHASH subroutine and are expressed in cycles per +# processed byte, less is better: +# +# gcc 2.95.3(*) SSE assembler x86 assembler +# +# Pentium 105/111(**) - 50 +# PIII 68 /75 12.2 24 +# P4 125/125 17.8 84(***) +# Opteron 66 /70 10.1 30 +# Core2 54 /67 8.4 18 +# Atom 105/105 16.8 53 +# VIA Nano 69 /71 13.0 27 +# +# (*) gcc 3.4.x was observed to generate few percent slower code, +# which is one of reasons why 2.95.3 results were chosen, +# another reason is lack of 3.4.x results for older CPUs; +# comparison with SSE results is not completely fair, because C +# results are for vanilla "256B" implementation, while +# assembler results are for "528B";-) +# (**) second number is result for code compiled with -fPIC flag, +# which is actually more relevant, because assembler code is +# position-independent; +# (***) see comment in non-MMX routine for further details; +# +# To summarize, it's >2-5 times faster than gcc-generated code. To +# anchor it to something else SHA1 assembler processes one byte in +# ~7 cycles on contemporary x86 cores. As for choice of MMX/SSE +# in particular, see comment at the end of the file... + +# May 2010 +# +# Add PCLMULQDQ version performing at 2.10 cycles per processed byte. +# The question is how close is it to theoretical limit? The pclmulqdq +# instruction latency appears to be 14 cycles and there can't be more +# than 2 of them executing at any given time. This means that single +# Karatsuba multiplication would take 28 cycles *plus* few cycles for +# pre- and post-processing. Then multiplication has to be followed by +# modulo-reduction. Given that aggregated reduction method [see +# "Carry-less Multiplication and Its Usage for Computing the GCM Mode" +# white paper by Intel] allows you to perform reduction only once in +# a while we can assume that asymptotic performance can be estimated +# as (28+Tmod/Naggr)/16, where Tmod is time to perform reduction +# and Naggr is the aggregation factor. +# +# Before we proceed to this implementation let's have closer look at +# the best-performing code suggested by Intel in their white paper. +# By tracing inter-register dependencies Tmod is estimated as ~19 +# cycles and Naggr chosen by Intel is 4, resulting in 2.05 cycles per +# processed byte. As implied, this is quite optimistic estimate, +# because it does not account for Karatsuba pre- and post-processing, +# which for a single multiplication is ~5 cycles. Unfortunately Intel +# does not provide performance data for GHASH alone. But benchmarking +# AES_GCM_encrypt ripped out of Fig. 15 of the white paper with aadt +# alone resulted in 2.46 cycles per byte of out 16KB buffer. Note that +# the result accounts even for pre-computing of degrees of the hash +# key H, but its portion is negligible at 16KB buffer size. +# +# Moving on to the implementation in question. Tmod is estimated as +# ~13 cycles and Naggr is 2, giving asymptotic performance of ... +# 2.16. How is it possible that measured performance is better than +# optimistic theoretical estimate? There is one thing Intel failed +# to recognize. By serializing GHASH with CTR in same subroutine +# former's performance is really limited to above (Tmul + Tmod/Naggr) +# equation. But if GHASH procedure is detached, the modulo-reduction +# can be interleaved with Naggr-1 multiplications at instruction level +# and under ideal conditions even disappear from the equation. So that +# optimistic theoretical estimate for this implementation is ... +# 28/16=1.75, and not 2.16. Well, it's probably way too optimistic, +# at least for such small Naggr. I'd argue that (28+Tproc/Naggr), +# where Tproc is time required for Karatsuba pre- and post-processing, +# is more realistic estimate. In this case it gives ... 1.91 cycles. +# Or in other words, depending on how well we can interleave reduction +# and one of the two multiplications the performance should be between +# 1.91 and 2.16. As already mentioned, this implementation processes +# one byte out of 8KB buffer in 2.10 cycles, while x86_64 counterpart +# - in 2.02. x86_64 performance is better, because larger register +# bank allows to interleave reduction and multiplication better. +# +# Does it make sense to increase Naggr? To start with it's virtually +# impossible in 32-bit mode, because of limited register bank +# capacity. Otherwise improvement has to be weighed against slower +# setup, as well as code size and complexity increase. As even +# optimistic estimate doesn't promise 30% performance improvement, +# there are currently no plans to increase Naggr. +# +# Special thanks to David Woodhouse for providing access to a +# Westmere-based system on behalf of Intel Open Source Technology Centre. + +# January 2010 +# +# Tweaked to optimize transitions between integer and FP operations +# on same XMM register, PCLMULQDQ subroutine was measured to process +# one byte in 2.07 cycles on Sandy Bridge, and in 2.12 - on Westmere. +# The minor regression on Westmere is outweighed by ~15% improvement +# on Sandy Bridge. Strangely enough attempt to modify 64-bit code in +# similar manner resulted in almost 20% degradation on Sandy Bridge, +# where original 64-bit code processes one byte in 1.95 cycles. + +##################################################################### +# For reference, AMD Bulldozer processes one byte in 1.98 cycles in +# 32-bit mode and 1.89 in 64-bit. + +# February 2013 +# +# Overhaul: aggregate Karatsuba post-processing, improve ILP in +# reduction_alg9. Resulting performance is 1.96 cycles per byte on +# Westmere, 1.95 - on Sandy/Ivy Bridge, 1.76 - on Bulldozer. + +$0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1; +push(@INC,"${dir}","${dir}../../../perlasm"); +require "x86asm.pl"; + +$output=pop; +open STDOUT,">$output"; + +&asm_init($ARGV[0],$x86only = $ARGV[$#ARGV] eq "386"); + +$sse2=0; +for (@ARGV) { $sse2=1 if (/-DOPENSSL_IA32_SSE2/); } + +($Zhh,$Zhl,$Zlh,$Zll) = ("ebp","edx","ecx","ebx"); +$inp = "edi"; +$Htbl = "esi"; + +$unroll = 0; # Affects x86 loop. Folded loop performs ~7% worse + # than unrolled, which has to be weighted against + # 2.5x x86-specific code size reduction. + +sub x86_loop { + my $off = shift; + my $rem = "eax"; + + &mov ($Zhh,&DWP(4,$Htbl,$Zll)); + &mov ($Zhl,&DWP(0,$Htbl,$Zll)); + &mov ($Zlh,&DWP(12,$Htbl,$Zll)); + &mov ($Zll,&DWP(8,$Htbl,$Zll)); + &xor ($rem,$rem); # avoid partial register stalls on PIII + + # shrd practically kills P4, 2.5x deterioration, but P4 has + # MMX code-path to execute. shrd runs tad faster [than twice + # the shifts, move's and or's] on pre-MMX Pentium (as well as + # PIII and Core2), *but* minimizes code size, spares register + # and thus allows to fold the loop... + if (!$unroll) { + my $cnt = $inp; + &mov ($cnt,15); + &jmp (&label("x86_loop")); + &set_label("x86_loop",16); + for($i=1;$i<=2;$i++) { + &mov (&LB($rem),&LB($Zll)); + &shrd ($Zll,$Zlh,4); + &and (&LB($rem),0xf); + &shrd ($Zlh,$Zhl,4); + &shrd ($Zhl,$Zhh,4); + &shr ($Zhh,4); + &xor ($Zhh,&DWP($off+16,"esp",$rem,4)); + + &mov (&LB($rem),&BP($off,"esp",$cnt)); + if ($i&1) { + &and (&LB($rem),0xf0); + } else { + &shl (&LB($rem),4); + } + + &xor ($Zll,&DWP(8,$Htbl,$rem)); + &xor ($Zlh,&DWP(12,$Htbl,$rem)); + &xor ($Zhl,&DWP(0,$Htbl,$rem)); + &xor ($Zhh,&DWP(4,$Htbl,$rem)); + + if ($i&1) { + &dec ($cnt); + &js (&label("x86_break")); + } else { + &jmp (&label("x86_loop")); + } + } + &set_label("x86_break",16); + } else { + for($i=1;$i<32;$i++) { + &comment($i); + &mov (&LB($rem),&LB($Zll)); + &shrd ($Zll,$Zlh,4); + &and (&LB($rem),0xf); + &shrd ($Zlh,$Zhl,4); + &shrd ($Zhl,$Zhh,4); + &shr ($Zhh,4); + &xor ($Zhh,&DWP($off+16,"esp",$rem,4)); + + if ($i&1) { + &mov (&LB($rem),&BP($off+15-($i>>1),"esp")); + &and (&LB($rem),0xf0); + } else { + &mov (&LB($rem),&BP($off+15-($i>>1),"esp")); + &shl (&LB($rem),4); + } + + &xor ($Zll,&DWP(8,$Htbl,$rem)); + &xor ($Zlh,&DWP(12,$Htbl,$rem)); + &xor ($Zhl,&DWP(0,$Htbl,$rem)); + &xor ($Zhh,&DWP(4,$Htbl,$rem)); + } + } + &bswap ($Zll); + &bswap ($Zlh); + &bswap ($Zhl); + if (!$x86only) { + &bswap ($Zhh); + } else { + &mov ("eax",$Zhh); + &bswap ("eax"); + &mov ($Zhh,"eax"); + } +} + +if ($unroll) { + &function_begin_B("_x86_gmult_4bit_inner"); + &x86_loop(4); + &ret (); + &function_end_B("_x86_gmult_4bit_inner"); +} + +sub deposit_rem_4bit { + my $bias = shift; + + &mov (&DWP($bias+0, "esp"),0x0000<<16); + &mov (&DWP($bias+4, "esp"),0x1C20<<16); + &mov (&DWP($bias+8, "esp"),0x3840<<16); + &mov (&DWP($bias+12,"esp"),0x2460<<16); + &mov (&DWP($bias+16,"esp"),0x7080<<16); + &mov (&DWP($bias+20,"esp"),0x6CA0<<16); + &mov (&DWP($bias+24,"esp"),0x48C0<<16); + &mov (&DWP($bias+28,"esp"),0x54E0<<16); + &mov (&DWP($bias+32,"esp"),0xE100<<16); + &mov (&DWP($bias+36,"esp"),0xFD20<<16); + &mov (&DWP($bias+40,"esp"),0xD940<<16); + &mov (&DWP($bias+44,"esp"),0xC560<<16); + &mov (&DWP($bias+48,"esp"),0x9180<<16); + &mov (&DWP($bias+52,"esp"),0x8DA0<<16); + &mov (&DWP($bias+56,"esp"),0xA9C0<<16); + &mov (&DWP($bias+60,"esp"),0xB5E0<<16); +} + +if (!$x86only) {{{ + +&static_label("rem_4bit"); + +if (!$sse2) {{ # pure-MMX "May" version... + + # This code was removed since SSE2 is required for BoringSSL. The + # outer structure of the code was retained to minimize future merge + # conflicts. + +}} else {{ # "June" MMX version... + # ... has slower "April" gcm_gmult_4bit_mmx with folded + # loop. This is done to conserve code size... +$S=16; # shift factor for rem_4bit + +sub mmx_loop() { +# MMX version performs 2.8 times better on P4 (see comment in non-MMX +# routine for further details), 40% better on Opteron and Core2, 50% +# better on PIII... In other words effort is considered to be well +# spent... + my $inp = shift; + my $rem_4bit = shift; + my $cnt = $Zhh; + my $nhi = $Zhl; + my $nlo = $Zlh; + my $rem = $Zll; + + my ($Zlo,$Zhi) = ("mm0","mm1"); + my $tmp = "mm2"; + + &xor ($nlo,$nlo); # avoid partial register stalls on PIII + &mov ($nhi,$Zll); + &mov (&LB($nlo),&LB($nhi)); + &mov ($cnt,14); + &shl (&LB($nlo),4); + &and ($nhi,0xf0); + &movq ($Zlo,&QWP(8,$Htbl,$nlo)); + &movq ($Zhi,&QWP(0,$Htbl,$nlo)); + &movd ($rem,$Zlo); + &jmp (&label("mmx_loop")); + + &set_label("mmx_loop",16); + &psrlq ($Zlo,4); + &and ($rem,0xf); + &movq ($tmp,$Zhi); + &psrlq ($Zhi,4); + &pxor ($Zlo,&QWP(8,$Htbl,$nhi)); + &mov (&LB($nlo),&BP(0,$inp,$cnt)); + &psllq ($tmp,60); + &pxor ($Zhi,&QWP(0,$rem_4bit,$rem,8)); + &dec ($cnt); + &movd ($rem,$Zlo); + &pxor ($Zhi,&QWP(0,$Htbl,$nhi)); + &mov ($nhi,$nlo); + &pxor ($Zlo,$tmp); + &js (&label("mmx_break")); + + &shl (&LB($nlo),4); + &and ($rem,0xf); + &psrlq ($Zlo,4); + &and ($nhi,0xf0); + &movq ($tmp,$Zhi); + &psrlq ($Zhi,4); + &pxor ($Zlo,&QWP(8,$Htbl,$nlo)); + &psllq ($tmp,60); + &pxor ($Zhi,&QWP(0,$rem_4bit,$rem,8)); + &movd ($rem,$Zlo); + &pxor ($Zhi,&QWP(0,$Htbl,$nlo)); + &pxor ($Zlo,$tmp); + &jmp (&label("mmx_loop")); + + &set_label("mmx_break",16); + &shl (&LB($nlo),4); + &and ($rem,0xf); + &psrlq ($Zlo,4); + &and ($nhi,0xf0); + &movq ($tmp,$Zhi); + &psrlq ($Zhi,4); + &pxor ($Zlo,&QWP(8,$Htbl,$nlo)); + &psllq ($tmp,60); + &pxor ($Zhi,&QWP(0,$rem_4bit,$rem,8)); + &movd ($rem,$Zlo); + &pxor ($Zhi,&QWP(0,$Htbl,$nlo)); + &pxor ($Zlo,$tmp); + + &psrlq ($Zlo,4); + &and ($rem,0xf); + &movq ($tmp,$Zhi); + &psrlq ($Zhi,4); + &pxor ($Zlo,&QWP(8,$Htbl,$nhi)); + &psllq ($tmp,60); + &pxor ($Zhi,&QWP(0,$rem_4bit,$rem,8)); + &movd ($rem,$Zlo); + &pxor ($Zhi,&QWP(0,$Htbl,$nhi)); + &pxor ($Zlo,$tmp); + + &psrlq ($Zlo,32); # lower part of Zlo is already there + &movd ($Zhl,$Zhi); + &psrlq ($Zhi,32); + &movd ($Zlh,$Zlo); + &movd ($Zhh,$Zhi); + + &bswap ($Zll); + &bswap ($Zhl); + &bswap ($Zlh); + &bswap ($Zhh); +} + +&function_begin("gcm_gmult_4bit_mmx"); + &mov ($inp,&wparam(0)); # load Xi + &mov ($Htbl,&wparam(1)); # load Htable + + &call (&label("pic_point")); + &set_label("pic_point"); + &blindpop("eax"); + &lea ("eax",&DWP(&label("rem_4bit")."-".&label("pic_point"),"eax")); + + &movz ($Zll,&BP(15,$inp)); + + &mmx_loop($inp,"eax"); + + &emms (); + &mov (&DWP(12,$inp),$Zll); + &mov (&DWP(4,$inp),$Zhl); + &mov (&DWP(8,$inp),$Zlh); + &mov (&DWP(0,$inp),$Zhh); +&function_end("gcm_gmult_4bit_mmx"); + +###################################################################### +# Below subroutine is "528B" variant of "4-bit" GCM GHASH function +# (see gcm128.c for details). It provides further 20-40% performance +# improvement over above mentioned "May" version. + +&static_label("rem_8bit"); + +&function_begin("gcm_ghash_4bit_mmx"); +{ my ($Zlo,$Zhi) = ("mm7","mm6"); + my $rem_8bit = "esi"; + my $Htbl = "ebx"; + + # parameter block + &mov ("eax",&wparam(0)); # Xi + &mov ("ebx",&wparam(1)); # Htable + &mov ("ecx",&wparam(2)); # inp + &mov ("edx",&wparam(3)); # len + &mov ("ebp","esp"); # original %esp + &call (&label("pic_point")); + &set_label ("pic_point"); + &blindpop ($rem_8bit); + &lea ($rem_8bit,&DWP(&label("rem_8bit")."-".&label("pic_point"),$rem_8bit)); + + &sub ("esp",512+16+16); # allocate stack frame... + &and ("esp",-64); # ...and align it + &sub ("esp",16); # place for (u8)(H[]<<4) + + &add ("edx","ecx"); # pointer to the end of input + &mov (&DWP(528+16+0,"esp"),"eax"); # save Xi + &mov (&DWP(528+16+8,"esp"),"edx"); # save inp+len + &mov (&DWP(528+16+12,"esp"),"ebp"); # save original %esp + + { my @lo = ("mm0","mm1","mm2"); + my @hi = ("mm3","mm4","mm5"); + my @tmp = ("mm6","mm7"); + my ($off1,$off2,$i) = (0,0,); + + &add ($Htbl,128); # optimize for size + &lea ("edi",&DWP(16+128,"esp")); + &lea ("ebp",&DWP(16+256+128,"esp")); + + # decompose Htable (low and high parts are kept separately), + # generate Htable[]>>4, (u8)(Htable[]<<4), save to stack... + for ($i=0;$i<18;$i++) { + + &mov ("edx",&DWP(16*$i+8-128,$Htbl)) if ($i<16); + &movq ($lo[0],&QWP(16*$i+8-128,$Htbl)) if ($i<16); + &psllq ($tmp[1],60) if ($i>1); + &movq ($hi[0],&QWP(16*$i+0-128,$Htbl)) if ($i<16); + &por ($lo[2],$tmp[1]) if ($i>1); + &movq (&QWP($off1-128,"edi"),$lo[1]) if ($i>0 && $i<17); + &psrlq ($lo[1],4) if ($i>0 && $i<17); + &movq (&QWP($off1,"edi"),$hi[1]) if ($i>0 && $i<17); + &movq ($tmp[0],$hi[1]) if ($i>0 && $i<17); + &movq (&QWP($off2-128,"ebp"),$lo[2]) if ($i>1); + &psrlq ($hi[1],4) if ($i>0 && $i<17); + &movq (&QWP($off2,"ebp"),$hi[2]) if ($i>1); + &shl ("edx",4) if ($i<16); + &mov (&BP($i,"esp"),&LB("edx")) if ($i<16); + + unshift (@lo,pop(@lo)); # "rotate" registers + unshift (@hi,pop(@hi)); + unshift (@tmp,pop(@tmp)); + $off1 += 8 if ($i>0); + $off2 += 8 if ($i>1); + } + } + + &movq ($Zhi,&QWP(0,"eax")); + &mov ("ebx",&DWP(8,"eax")); + &mov ("edx",&DWP(12,"eax")); # load Xi + +&set_label("outer",16); + { my $nlo = "eax"; + my $dat = "edx"; + my @nhi = ("edi","ebp"); + my @rem = ("ebx","ecx"); + my @red = ("mm0","mm1","mm2"); + my $tmp = "mm3"; + + &xor ($dat,&DWP(12,"ecx")); # merge input data + &xor ("ebx",&DWP(8,"ecx")); + &pxor ($Zhi,&QWP(0,"ecx")); + &lea ("ecx",&DWP(16,"ecx")); # inp+=16 + #&mov (&DWP(528+12,"esp"),$dat); # save inp^Xi + &mov (&DWP(528+8,"esp"),"ebx"); + &movq (&QWP(528+0,"esp"),$Zhi); + &mov (&DWP(528+16+4,"esp"),"ecx"); # save inp + + &xor ($nlo,$nlo); + &rol ($dat,8); + &mov (&LB($nlo),&LB($dat)); + &mov ($nhi[1],$nlo); + &and (&LB($nlo),0x0f); + &shr ($nhi[1],4); + &pxor ($red[0],$red[0]); + &rol ($dat,8); # next byte + &pxor ($red[1],$red[1]); + &pxor ($red[2],$red[2]); + + # Just like in "May" version modulo-schedule for critical path in + # 'Z.hi ^= rem_8bit[Z.lo&0xff^((u8)H[nhi]<<4)]<<48'. Final 'pxor' + # is scheduled so late that rem_8bit[] has to be shifted *right* + # by 16, which is why last argument to pinsrw is 2, which + # corresponds to <<32=<<48>>16... + for ($j=11,$i=0;$i<15;$i++) { + + if ($i>0) { + &pxor ($Zlo,&QWP(16,"esp",$nlo,8)); # Z^=H[nlo] + &rol ($dat,8); # next byte + &pxor ($Zhi,&QWP(16+128,"esp",$nlo,8)); + + &pxor ($Zlo,$tmp); + &pxor ($Zhi,&QWP(16+256+128,"esp",$nhi[0],8)); + &xor (&LB($rem[1]),&BP(0,"esp",$nhi[0])); # rem^(H[nhi]<<4) + } else { + &movq ($Zlo,&QWP(16,"esp",$nlo,8)); + &movq ($Zhi,&QWP(16+128,"esp",$nlo,8)); + } + + &mov (&LB($nlo),&LB($dat)); + &mov ($dat,&DWP(528+$j,"esp")) if (--$j%4==0); + + &movd ($rem[0],$Zlo); + &movz ($rem[1],&LB($rem[1])) if ($i>0); + &psrlq ($Zlo,8); # Z>>=8 + + &movq ($tmp,$Zhi); + &mov ($nhi[0],$nlo); + &psrlq ($Zhi,8); + + &pxor ($Zlo,&QWP(16+256+0,"esp",$nhi[1],8)); # Z^=H[nhi]>>4 + &and (&LB($nlo),0x0f); + &psllq ($tmp,56); + + &pxor ($Zhi,$red[1]) if ($i>1); + &shr ($nhi[0],4); + &pinsrw ($red[0],&WP(0,$rem_8bit,$rem[1],2),2) if ($i>0); + + unshift (@red,pop(@red)); # "rotate" registers + unshift (@rem,pop(@rem)); + unshift (@nhi,pop(@nhi)); + } + + &pxor ($Zlo,&QWP(16,"esp",$nlo,8)); # Z^=H[nlo] + &pxor ($Zhi,&QWP(16+128,"esp",$nlo,8)); + &xor (&LB($rem[1]),&BP(0,"esp",$nhi[0])); # rem^(H[nhi]<<4) + + &pxor ($Zlo,$tmp); + &pxor ($Zhi,&QWP(16+256+128,"esp",$nhi[0],8)); + &movz ($rem[1],&LB($rem[1])); + + &pxor ($red[2],$red[2]); # clear 2nd word + &psllq ($red[1],4); + + &movd ($rem[0],$Zlo); + &psrlq ($Zlo,4); # Z>>=4 + + &movq ($tmp,$Zhi); + &psrlq ($Zhi,4); + &shl ($rem[0],4); # rem<<4 + + &pxor ($Zlo,&QWP(16,"esp",$nhi[1],8)); # Z^=H[nhi] + &psllq ($tmp,60); + &movz ($rem[0],&LB($rem[0])); + + &pxor ($Zlo,$tmp); + &pxor ($Zhi,&QWP(16+128,"esp",$nhi[1],8)); + + &pinsrw ($red[0],&WP(0,$rem_8bit,$rem[1],2),2); + &pxor ($Zhi,$red[1]); + + &movd ($dat,$Zlo); + &pinsrw ($red[2],&WP(0,$rem_8bit,$rem[0],2),3); # last is <<48 + + &psllq ($red[0],12); # correct by <<16>>4 + &pxor ($Zhi,$red[0]); + &psrlq ($Zlo,32); + &pxor ($Zhi,$red[2]); + + &mov ("ecx",&DWP(528+16+4,"esp")); # restore inp + &movd ("ebx",$Zlo); + &movq ($tmp,$Zhi); # 01234567 + &psllw ($Zhi,8); # 1.3.5.7. + &psrlw ($tmp,8); # .0.2.4.6 + &por ($Zhi,$tmp); # 10325476 + &bswap ($dat); + &pshufw ($Zhi,$Zhi,0b00011011); # 76543210 + &bswap ("ebx"); + + &cmp ("ecx",&DWP(528+16+8,"esp")); # are we done? + &jne (&label("outer")); + } + + &mov ("eax",&DWP(528+16+0,"esp")); # restore Xi + &mov (&DWP(12,"eax"),"edx"); + &mov (&DWP(8,"eax"),"ebx"); + &movq (&QWP(0,"eax"),$Zhi); + + &mov ("esp",&DWP(528+16+12,"esp")); # restore original %esp + &emms (); +} +&function_end("gcm_ghash_4bit_mmx"); +}} + +if ($sse2) {{ +###################################################################### +# PCLMULQDQ version. + +$Xip="eax"; +$Htbl="edx"; +$const="ecx"; +$inp="esi"; +$len="ebx"; + +($Xi,$Xhi)=("xmm0","xmm1"); $Hkey="xmm2"; +($T1,$T2,$T3)=("xmm3","xmm4","xmm5"); +($Xn,$Xhn)=("xmm6","xmm7"); + +&static_label("bswap"); + +sub clmul64x64_T2 { # minimal "register" pressure +my ($Xhi,$Xi,$Hkey,$HK)=@_; + + &movdqa ($Xhi,$Xi); # + &pshufd ($T1,$Xi,0b01001110); + &pshufd ($T2,$Hkey,0b01001110) if (!defined($HK)); + &pxor ($T1,$Xi); # + &pxor ($T2,$Hkey) if (!defined($HK)); + $HK=$T2 if (!defined($HK)); + + &pclmulqdq ($Xi,$Hkey,0x00); ####### + &pclmulqdq ($Xhi,$Hkey,0x11); ####### + &pclmulqdq ($T1,$HK,0x00); ####### + &xorps ($T1,$Xi); # + &xorps ($T1,$Xhi); # + + &movdqa ($T2,$T1); # + &psrldq ($T1,8); + &pslldq ($T2,8); # + &pxor ($Xhi,$T1); + &pxor ($Xi,$T2); # +} + +sub clmul64x64_T3 { +# Even though this subroutine offers visually better ILP, it +# was empirically found to be a tad slower than above version. +# At least in gcm_ghash_clmul context. But it's just as well, +# because loop modulo-scheduling is possible only thanks to +# minimized "register" pressure... +my ($Xhi,$Xi,$Hkey)=@_; + + &movdqa ($T1,$Xi); # + &movdqa ($Xhi,$Xi); + &pclmulqdq ($Xi,$Hkey,0x00); ####### + &pclmulqdq ($Xhi,$Hkey,0x11); ####### + &pshufd ($T2,$T1,0b01001110); # + &pshufd ($T3,$Hkey,0b01001110); + &pxor ($T2,$T1); # + &pxor ($T3,$Hkey); + &pclmulqdq ($T2,$T3,0x00); ####### + &pxor ($T2,$Xi); # + &pxor ($T2,$Xhi); # + + &movdqa ($T3,$T2); # + &psrldq ($T2,8); + &pslldq ($T3,8); # + &pxor ($Xhi,$T2); + &pxor ($Xi,$T3); # +} + +if (1) { # Algorithm 9 with <<1 twist. + # Reduction is shorter and uses only two + # temporary registers, which makes it better + # candidate for interleaving with 64x64 + # multiplication. Pre-modulo-scheduled loop + # was found to be ~20% faster than Algorithm 5 + # below. Algorithm 9 was therefore chosen for + # further optimization... + +sub reduction_alg9 { # 17/11 times faster than Intel version +my ($Xhi,$Xi) = @_; + + # 1st phase + &movdqa ($T2,$Xi); # + &movdqa ($T1,$Xi); + &psllq ($Xi,5); + &pxor ($T1,$Xi); # + &psllq ($Xi,1); + &pxor ($Xi,$T1); # + &psllq ($Xi,57); # + &movdqa ($T1,$Xi); # + &pslldq ($Xi,8); + &psrldq ($T1,8); # + &pxor ($Xi,$T2); + &pxor ($Xhi,$T1); # + + # 2nd phase + &movdqa ($T2,$Xi); + &psrlq ($Xi,1); + &pxor ($Xhi,$T2); # + &pxor ($T2,$Xi); + &psrlq ($Xi,5); + &pxor ($Xi,$T2); # + &psrlq ($Xi,1); # + &pxor ($Xi,$Xhi) # +} + +&function_begin_B("gcm_init_clmul"); + &mov ($Htbl,&wparam(0)); + &mov ($Xip,&wparam(1)); + + &call (&label("pic")); +&set_label("pic"); + &blindpop ($const); + &lea ($const,&DWP(&label("bswap")."-".&label("pic"),$const)); + + &movdqu ($Hkey,&QWP(0,$Xip)); + &pshufd ($Hkey,$Hkey,0b01001110);# dword swap + + # <<1 twist + &pshufd ($T2,$Hkey,0b11111111); # broadcast uppermost dword + &movdqa ($T1,$Hkey); + &psllq ($Hkey,1); + &pxor ($T3,$T3); # + &psrlq ($T1,63); + &pcmpgtd ($T3,$T2); # broadcast carry bit + &pslldq ($T1,8); + &por ($Hkey,$T1); # H<<=1 + + # magic reduction + &pand ($T3,&QWP(16,$const)); # 0x1c2_polynomial + &pxor ($Hkey,$T3); # if(carry) H^=0x1c2_polynomial + + # calculate H^2 + &movdqa ($Xi,$Hkey); + &clmul64x64_T2 ($Xhi,$Xi,$Hkey); + &reduction_alg9 ($Xhi,$Xi); + + &pshufd ($T1,$Hkey,0b01001110); + &pshufd ($T2,$Xi,0b01001110); + &pxor ($T1,$Hkey); # Karatsuba pre-processing + &movdqu (&QWP(0,$Htbl),$Hkey); # save H + &pxor ($T2,$Xi); # Karatsuba pre-processing + &movdqu (&QWP(16,$Htbl),$Xi); # save H^2 + &palignr ($T2,$T1,8); # low part is H.lo^H.hi + &movdqu (&QWP(32,$Htbl),$T2); # save Karatsuba "salt" + + &ret (); +&function_end_B("gcm_init_clmul"); + +&function_begin_B("gcm_gmult_clmul"); + &mov ($Xip,&wparam(0)); + &mov ($Htbl,&wparam(1)); + + &call (&label("pic")); +&set_label("pic"); + &blindpop ($const); + &lea ($const,&DWP(&label("bswap")."-".&label("pic"),$const)); + + &movdqu ($Xi,&QWP(0,$Xip)); + &movdqa ($T3,&QWP(0,$const)); + &movups ($Hkey,&QWP(0,$Htbl)); + &pshufb ($Xi,$T3); + &movups ($T2,&QWP(32,$Htbl)); + + &clmul64x64_T2 ($Xhi,$Xi,$Hkey,$T2); + &reduction_alg9 ($Xhi,$Xi); + + &pshufb ($Xi,$T3); + &movdqu (&QWP(0,$Xip),$Xi); + + &ret (); +&function_end_B("gcm_gmult_clmul"); + +&function_begin("gcm_ghash_clmul"); + &mov ($Xip,&wparam(0)); + &mov ($Htbl,&wparam(1)); + &mov ($inp,&wparam(2)); + &mov ($len,&wparam(3)); + + &call (&label("pic")); +&set_label("pic"); + &blindpop ($const); + &lea ($const,&DWP(&label("bswap")."-".&label("pic"),$const)); + + &movdqu ($Xi,&QWP(0,$Xip)); + &movdqa ($T3,&QWP(0,$const)); + &movdqu ($Hkey,&QWP(0,$Htbl)); + &pshufb ($Xi,$T3); + + &sub ($len,0x10); + &jz (&label("odd_tail")); + + ####### + # Xi+2 =[H*(Ii+1 + Xi+1)] mod P = + # [(H*Ii+1) + (H*Xi+1)] mod P = + # [(H*Ii+1) + H^2*(Ii+Xi)] mod P + # + &movdqu ($T1,&QWP(0,$inp)); # Ii + &movdqu ($Xn,&QWP(16,$inp)); # Ii+1 + &pshufb ($T1,$T3); + &pshufb ($Xn,$T3); + &movdqu ($T3,&QWP(32,$Htbl)); + &pxor ($Xi,$T1); # Ii+Xi + + &pshufd ($T1,$Xn,0b01001110); # H*Ii+1 + &movdqa ($Xhn,$Xn); + &pxor ($T1,$Xn); # + &lea ($inp,&DWP(32,$inp)); # i+=2 + + &pclmulqdq ($Xn,$Hkey,0x00); ####### + &pclmulqdq ($Xhn,$Hkey,0x11); ####### + &pclmulqdq ($T1,$T3,0x00); ####### + &movups ($Hkey,&QWP(16,$Htbl)); # load H^2 + &nop (); + + &sub ($len,0x20); + &jbe (&label("even_tail")); + &jmp (&label("mod_loop")); + +&set_label("mod_loop",32); + &pshufd ($T2,$Xi,0b01001110); # H^2*(Ii+Xi) + &movdqa ($Xhi,$Xi); + &pxor ($T2,$Xi); # + &nop (); + + &pclmulqdq ($Xi,$Hkey,0x00); ####### + &pclmulqdq ($Xhi,$Hkey,0x11); ####### + &pclmulqdq ($T2,$T3,0x10); ####### + &movups ($Hkey,&QWP(0,$Htbl)); # load H + + &xorps ($Xi,$Xn); # (H*Ii+1) + H^2*(Ii+Xi) + &movdqa ($T3,&QWP(0,$const)); + &xorps ($Xhi,$Xhn); + &movdqu ($Xhn,&QWP(0,$inp)); # Ii + &pxor ($T1,$Xi); # aggregated Karatsuba post-processing + &movdqu ($Xn,&QWP(16,$inp)); # Ii+1 + &pxor ($T1,$Xhi); # + + &pshufb ($Xhn,$T3); + &pxor ($T2,$T1); # + + &movdqa ($T1,$T2); # + &psrldq ($T2,8); + &pslldq ($T1,8); # + &pxor ($Xhi,$T2); + &pxor ($Xi,$T1); # + &pshufb ($Xn,$T3); + &pxor ($Xhi,$Xhn); # "Ii+Xi", consume early + + &movdqa ($Xhn,$Xn); #&clmul64x64_TX ($Xhn,$Xn,$Hkey); H*Ii+1 + &movdqa ($T2,$Xi); #&reduction_alg9($Xhi,$Xi); 1st phase + &movdqa ($T1,$Xi); + &psllq ($Xi,5); + &pxor ($T1,$Xi); # + &psllq ($Xi,1); + &pxor ($Xi,$T1); # + &pclmulqdq ($Xn,$Hkey,0x00); ####### + &movups ($T3,&QWP(32,$Htbl)); + &psllq ($Xi,57); # + &movdqa ($T1,$Xi); # + &pslldq ($Xi,8); + &psrldq ($T1,8); # + &pxor ($Xi,$T2); + &pxor ($Xhi,$T1); # + &pshufd ($T1,$Xhn,0b01001110); + &movdqa ($T2,$Xi); # 2nd phase + &psrlq ($Xi,1); + &pxor ($T1,$Xhn); + &pxor ($Xhi,$T2); # + &pclmulqdq ($Xhn,$Hkey,0x11); ####### + &movups ($Hkey,&QWP(16,$Htbl)); # load H^2 + &pxor ($T2,$Xi); + &psrlq ($Xi,5); + &pxor ($Xi,$T2); # + &psrlq ($Xi,1); # + &pxor ($Xi,$Xhi) # + &pclmulqdq ($T1,$T3,0x00); ####### + + &lea ($inp,&DWP(32,$inp)); + &sub ($len,0x20); + &ja (&label("mod_loop")); + +&set_label("even_tail"); + &pshufd ($T2,$Xi,0b01001110); # H^2*(Ii+Xi) + &movdqa ($Xhi,$Xi); + &pxor ($T2,$Xi); # + + &pclmulqdq ($Xi,$Hkey,0x00); ####### + &pclmulqdq ($Xhi,$Hkey,0x11); ####### + &pclmulqdq ($T2,$T3,0x10); ####### + &movdqa ($T3,&QWP(0,$const)); + + &xorps ($Xi,$Xn); # (H*Ii+1) + H^2*(Ii+Xi) + &xorps ($Xhi,$Xhn); + &pxor ($T1,$Xi); # aggregated Karatsuba post-processing + &pxor ($T1,$Xhi); # + + &pxor ($T2,$T1); # + + &movdqa ($T1,$T2); # + &psrldq ($T2,8); + &pslldq ($T1,8); # + &pxor ($Xhi,$T2); + &pxor ($Xi,$T1); # + + &reduction_alg9 ($Xhi,$Xi); + + &test ($len,$len); + &jnz (&label("done")); + + &movups ($Hkey,&QWP(0,$Htbl)); # load H +&set_label("odd_tail"); + &movdqu ($T1,&QWP(0,$inp)); # Ii + &pshufb ($T1,$T3); + &pxor ($Xi,$T1); # Ii+Xi + + &clmul64x64_T2 ($Xhi,$Xi,$Hkey); # H*(Ii+Xi) + &reduction_alg9 ($Xhi,$Xi); + +&set_label("done"); + &pshufb ($Xi,$T3); + &movdqu (&QWP(0,$Xip),$Xi); +&function_end("gcm_ghash_clmul"); + +} else { # Algorithm 5. Kept for reference purposes. + +sub reduction_alg5 { # 19/16 times faster than Intel version +my ($Xhi,$Xi)=@_; + + # <<1 + &movdqa ($T1,$Xi); # + &movdqa ($T2,$Xhi); + &pslld ($Xi,1); + &pslld ($Xhi,1); # + &psrld ($T1,31); + &psrld ($T2,31); # + &movdqa ($T3,$T1); + &pslldq ($T1,4); + &psrldq ($T3,12); # + &pslldq ($T2,4); + &por ($Xhi,$T3); # + &por ($Xi,$T1); + &por ($Xhi,$T2); # + + # 1st phase + &movdqa ($T1,$Xi); + &movdqa ($T2,$Xi); + &movdqa ($T3,$Xi); # + &pslld ($T1,31); + &pslld ($T2,30); + &pslld ($Xi,25); # + &pxor ($T1,$T2); + &pxor ($T1,$Xi); # + &movdqa ($T2,$T1); # + &pslldq ($T1,12); + &psrldq ($T2,4); # + &pxor ($T3,$T1); + + # 2nd phase + &pxor ($Xhi,$T3); # + &movdqa ($Xi,$T3); + &movdqa ($T1,$T3); + &psrld ($Xi,1); # + &psrld ($T1,2); + &psrld ($T3,7); # + &pxor ($Xi,$T1); + &pxor ($Xhi,$T2); + &pxor ($Xi,$T3); # + &pxor ($Xi,$Xhi); # +} + +&function_begin_B("gcm_init_clmul"); + &mov ($Htbl,&wparam(0)); + &mov ($Xip,&wparam(1)); + + &call (&label("pic")); +&set_label("pic"); + &blindpop ($const); + &lea ($const,&DWP(&label("bswap")."-".&label("pic"),$const)); + + &movdqu ($Hkey,&QWP(0,$Xip)); + &pshufd ($Hkey,$Hkey,0b01001110);# dword swap + + # calculate H^2 + &movdqa ($Xi,$Hkey); + &clmul64x64_T3 ($Xhi,$Xi,$Hkey); + &reduction_alg5 ($Xhi,$Xi); + + &movdqu (&QWP(0,$Htbl),$Hkey); # save H + &movdqu (&QWP(16,$Htbl),$Xi); # save H^2 + + &ret (); +&function_end_B("gcm_init_clmul"); + +&function_begin_B("gcm_gmult_clmul"); + &mov ($Xip,&wparam(0)); + &mov ($Htbl,&wparam(1)); + + &call (&label("pic")); +&set_label("pic"); + &blindpop ($const); + &lea ($const,&DWP(&label("bswap")."-".&label("pic"),$const)); + + &movdqu ($Xi,&QWP(0,$Xip)); + &movdqa ($Xn,&QWP(0,$const)); + &movdqu ($Hkey,&QWP(0,$Htbl)); + &pshufb ($Xi,$Xn); + + &clmul64x64_T3 ($Xhi,$Xi,$Hkey); + &reduction_alg5 ($Xhi,$Xi); + + &pshufb ($Xi,$Xn); + &movdqu (&QWP(0,$Xip),$Xi); + + &ret (); +&function_end_B("gcm_gmult_clmul"); + +&function_begin("gcm_ghash_clmul"); + &mov ($Xip,&wparam(0)); + &mov ($Htbl,&wparam(1)); + &mov ($inp,&wparam(2)); + &mov ($len,&wparam(3)); + + &call (&label("pic")); +&set_label("pic"); + &blindpop ($const); + &lea ($const,&DWP(&label("bswap")."-".&label("pic"),$const)); + + &movdqu ($Xi,&QWP(0,$Xip)); + &movdqa ($T3,&QWP(0,$const)); + &movdqu ($Hkey,&QWP(0,$Htbl)); + &pshufb ($Xi,$T3); + + &sub ($len,0x10); + &jz (&label("odd_tail")); + + ####### + # Xi+2 =[H*(Ii+1 + Xi+1)] mod P = + # [(H*Ii+1) + (H*Xi+1)] mod P = + # [(H*Ii+1) + H^2*(Ii+Xi)] mod P + # + &movdqu ($T1,&QWP(0,$inp)); # Ii + &movdqu ($Xn,&QWP(16,$inp)); # Ii+1 + &pshufb ($T1,$T3); + &pshufb ($Xn,$T3); + &pxor ($Xi,$T1); # Ii+Xi + + &clmul64x64_T3 ($Xhn,$Xn,$Hkey); # H*Ii+1 + &movdqu ($Hkey,&QWP(16,$Htbl)); # load H^2 + + &sub ($len,0x20); + &lea ($inp,&DWP(32,$inp)); # i+=2 + &jbe (&label("even_tail")); + +&set_label("mod_loop"); + &clmul64x64_T3 ($Xhi,$Xi,$Hkey); # H^2*(Ii+Xi) + &movdqu ($Hkey,&QWP(0,$Htbl)); # load H + + &pxor ($Xi,$Xn); # (H*Ii+1) + H^2*(Ii+Xi) + &pxor ($Xhi,$Xhn); + + &reduction_alg5 ($Xhi,$Xi); + + ####### + &movdqa ($T3,&QWP(0,$const)); + &movdqu ($T1,&QWP(0,$inp)); # Ii + &movdqu ($Xn,&QWP(16,$inp)); # Ii+1 + &pshufb ($T1,$T3); + &pshufb ($Xn,$T3); + &pxor ($Xi,$T1); # Ii+Xi + + &clmul64x64_T3 ($Xhn,$Xn,$Hkey); # H*Ii+1 + &movdqu ($Hkey,&QWP(16,$Htbl)); # load H^2 + + &sub ($len,0x20); + &lea ($inp,&DWP(32,$inp)); + &ja (&label("mod_loop")); + +&set_label("even_tail"); + &clmul64x64_T3 ($Xhi,$Xi,$Hkey); # H^2*(Ii+Xi) + + &pxor ($Xi,$Xn); # (H*Ii+1) + H^2*(Ii+Xi) + &pxor ($Xhi,$Xhn); + + &reduction_alg5 ($Xhi,$Xi); + + &movdqa ($T3,&QWP(0,$const)); + &test ($len,$len); + &jnz (&label("done")); + + &movdqu ($Hkey,&QWP(0,$Htbl)); # load H +&set_label("odd_tail"); + &movdqu ($T1,&QWP(0,$inp)); # Ii + &pshufb ($T1,$T3); + &pxor ($Xi,$T1); # Ii+Xi + + &clmul64x64_T3 ($Xhi,$Xi,$Hkey); # H*(Ii+Xi) + &reduction_alg5 ($Xhi,$Xi); + + &movdqa ($T3,&QWP(0,$const)); +&set_label("done"); + &pshufb ($Xi,$T3); + &movdqu (&QWP(0,$Xip),$Xi); +&function_end("gcm_ghash_clmul"); + +} + +&set_label("bswap",64); + &data_byte(15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0); + &data_byte(1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0xc2); # 0x1c2_polynomial +&set_label("rem_8bit",64); + &data_short(0x0000,0x01C2,0x0384,0x0246,0x0708,0x06CA,0x048C,0x054E); + &data_short(0x0E10,0x0FD2,0x0D94,0x0C56,0x0918,0x08DA,0x0A9C,0x0B5E); + &data_short(0x1C20,0x1DE2,0x1FA4,0x1E66,0x1B28,0x1AEA,0x18AC,0x196E); + &data_short(0x1230,0x13F2,0x11B4,0x1076,0x1538,0x14FA,0x16BC,0x177E); + &data_short(0x3840,0x3982,0x3BC4,0x3A06,0x3F48,0x3E8A,0x3CCC,0x3D0E); + &data_short(0x3650,0x3792,0x35D4,0x3416,0x3158,0x309A,0x32DC,0x331E); + &data_short(0x2460,0x25A2,0x27E4,0x2626,0x2368,0x22AA,0x20EC,0x212E); + &data_short(0x2A70,0x2BB2,0x29F4,0x2836,0x2D78,0x2CBA,0x2EFC,0x2F3E); + &data_short(0x7080,0x7142,0x7304,0x72C6,0x7788,0x764A,0x740C,0x75CE); + &data_short(0x7E90,0x7F52,0x7D14,0x7CD6,0x7998,0x785A,0x7A1C,0x7BDE); + &data_short(0x6CA0,0x6D62,0x6F24,0x6EE6,0x6BA8,0x6A6A,0x682C,0x69EE); + &data_short(0x62B0,0x6372,0x6134,0x60F6,0x65B8,0x647A,0x663C,0x67FE); + &data_short(0x48C0,0x4902,0x4B44,0x4A86,0x4FC8,0x4E0A,0x4C4C,0x4D8E); + &data_short(0x46D0,0x4712,0x4554,0x4496,0x41D8,0x401A,0x425C,0x439E); + &data_short(0x54E0,0x5522,0x5764,0x56A6,0x53E8,0x522A,0x506C,0x51AE); + &data_short(0x5AF0,0x5B32,0x5974,0x58B6,0x5DF8,0x5C3A,0x5E7C,0x5FBE); + &data_short(0xE100,0xE0C2,0xE284,0xE346,0xE608,0xE7CA,0xE58C,0xE44E); + &data_short(0xEF10,0xEED2,0xEC94,0xED56,0xE818,0xE9DA,0xEB9C,0xEA5E); + &data_short(0xFD20,0xFCE2,0xFEA4,0xFF66,0xFA28,0xFBEA,0xF9AC,0xF86E); + &data_short(0xF330,0xF2F2,0xF0B4,0xF176,0xF438,0xF5FA,0xF7BC,0xF67E); + &data_short(0xD940,0xD882,0xDAC4,0xDB06,0xDE48,0xDF8A,0xDDCC,0xDC0E); + &data_short(0xD750,0xD692,0xD4D4,0xD516,0xD058,0xD19A,0xD3DC,0xD21E); + &data_short(0xC560,0xC4A2,0xC6E4,0xC726,0xC268,0xC3AA,0xC1EC,0xC02E); + &data_short(0xCB70,0xCAB2,0xC8F4,0xC936,0xCC78,0xCDBA,0xCFFC,0xCE3E); + &data_short(0x9180,0x9042,0x9204,0x93C6,0x9688,0x974A,0x950C,0x94CE); + &data_short(0x9F90,0x9E52,0x9C14,0x9DD6,0x9898,0x995A,0x9B1C,0x9ADE); + &data_short(0x8DA0,0x8C62,0x8E24,0x8FE6,0x8AA8,0x8B6A,0x892C,0x88EE); + &data_short(0x83B0,0x8272,0x8034,0x81F6,0x84B8,0x857A,0x873C,0x86FE); + &data_short(0xA9C0,0xA802,0xAA44,0xAB86,0xAEC8,0xAF0A,0xAD4C,0xAC8E); + &data_short(0xA7D0,0xA612,0xA454,0xA596,0xA0D8,0xA11A,0xA35C,0xA29E); + &data_short(0xB5E0,0xB422,0xB664,0xB7A6,0xB2E8,0xB32A,0xB16C,0xB0AE); + &data_short(0xBBF0,0xBA32,0xB874,0xB9B6,0xBCF8,0xBD3A,0xBF7C,0xBEBE); +}} # $sse2 + +&set_label("rem_4bit",64); + &data_word(0,0x0000<<$S,0,0x1C20<<$S,0,0x3840<<$S,0,0x2460<<$S); + &data_word(0,0x7080<<$S,0,0x6CA0<<$S,0,0x48C0<<$S,0,0x54E0<<$S); + &data_word(0,0xE100<<$S,0,0xFD20<<$S,0,0xD940<<$S,0,0xC560<<$S); + &data_word(0,0x9180<<$S,0,0x8DA0<<$S,0,0xA9C0<<$S,0,0xB5E0<<$S); +}}} # !$x86only + +&asciz("GHASH for x86, CRYPTOGAMS by "); +&asm_finish(); + +close STDOUT; + +# A question was risen about choice of vanilla MMX. Or rather why wasn't +# SSE2 chosen instead? In addition to the fact that MMX runs on legacy +# CPUs such as PIII, "4-bit" MMX version was observed to provide better +# performance than *corresponding* SSE2 one even on contemporary CPUs. +# SSE2 results were provided by Peter-Michael Hager. He maintains SSE2 +# implementation featuring full range of lookup-table sizes, but with +# per-invocation lookup table setup. Latter means that table size is +# chosen depending on how much data is to be hashed in every given call, +# more data - larger table. Best reported result for Core2 is ~4 cycles +# per processed byte out of 64KB block. This number accounts even for +# 64KB table setup overhead. As discussed in gcm128.c we choose to be +# more conservative in respect to lookup table sizes, but how do the +# results compare? Minimalistic "256B" MMX version delivers ~11 cycles +# on same platform. As also discussed in gcm128.c, next in line "8-bit +# Shoup's" or "4KB" method should deliver twice the performance of +# "256B" one, in other words not worse than ~6 cycles per byte. It +# should be also be noted that in SSE2 case improvement can be "super- +# linear," i.e. more than twice, mostly because >>8 maps to single +# instruction on SSE2 register. This is unlike "4-bit" case when >>4 +# maps to same amount of instructions in both MMX and SSE2 cases. +# Bottom line is that switch to SSE2 is considered to be justifiable +# only in case we choose to implement "8-bit" method... diff --git a/crypto/aesgcm/ghash-x86_64.pl b/crypto/aesgcm/ghash-x86_64.pl new file mode 100644 index 0000000..ad94168 --- /dev/null +++ b/crypto/aesgcm/ghash-x86_64.pl @@ -0,0 +1,1766 @@ +#! /usr/bin/env perl +# Copyright 2010-2016 The OpenSSL Project Authors. All Rights Reserved. +# +# Licensed under the OpenSSL license (the "License"). You may not use +# this file except in compliance with the License. You can obtain a copy +# in the file LICENSE in the source distribution or at +# https://www.openssl.org/source/license.html + +# +# ==================================================================== +# Written by Andy Polyakov for the OpenSSL +# project. The module is, however, dual licensed under OpenSSL and +# CRYPTOGAMS licenses depending on where you obtain it. For further +# details see http://www.openssl.org/~appro/cryptogams/. +# ==================================================================== +# +# March, June 2010 +# +# The module implements "4-bit" GCM GHASH function and underlying +# single multiplication operation in GF(2^128). "4-bit" means that +# it uses 256 bytes per-key table [+128 bytes shared table]. GHASH +# function features so called "528B" variant utilizing additional +# 256+16 bytes of per-key storage [+512 bytes shared table]. +# Performance results are for this streamed GHASH subroutine and are +# expressed in cycles per processed byte, less is better: +# +# gcc 3.4.x(*) assembler +# +# P4 28.6 14.0 +100% +# Opteron 19.3 7.7 +150% +# Core2 17.8 8.1(**) +120% +# Atom 31.6 16.8 +88% +# VIA Nano 21.8 10.1 +115% +# +# (*) comparison is not completely fair, because C results are +# for vanilla "256B" implementation, while assembler results +# are for "528B";-) +# (**) it's mystery [to me] why Core2 result is not same as for +# Opteron; + +# May 2010 +# +# Add PCLMULQDQ version performing at 2.02 cycles per processed byte. +# See ghash-x86.pl for background information and details about coding +# techniques. +# +# Special thanks to David Woodhouse for +# providing access to a Westmere-based system on behalf of Intel +# Open Source Technology Centre. + +# December 2012 +# +# Overhaul: aggregate Karatsuba post-processing, improve ILP in +# reduction_alg9, increase reduction aggregate factor to 4x. As for +# the latter. ghash-x86.pl discusses that it makes lesser sense to +# increase aggregate factor. Then why increase here? Critical path +# consists of 3 independent pclmulqdq instructions, Karatsuba post- +# processing and reduction. "On top" of this we lay down aggregated +# multiplication operations, triplets of independent pclmulqdq's. As +# issue rate for pclmulqdq is limited, it makes lesser sense to +# aggregate more multiplications than it takes to perform remaining +# non-multiplication operations. 2x is near-optimal coefficient for +# contemporary Intel CPUs (therefore modest improvement coefficient), +# but not for Bulldozer. Latter is because logical SIMD operations +# are twice as slow in comparison to Intel, so that critical path is +# longer. A CPU with higher pclmulqdq issue rate would also benefit +# from higher aggregate factor... +# +# Westmere 1.78(+13%) +# Sandy Bridge 1.80(+8%) +# Ivy Bridge 1.80(+7%) +# Haswell 0.55(+93%) (if system doesn't support AVX) +# Broadwell 0.45(+110%)(if system doesn't support AVX) +# Skylake 0.44(+110%)(if system doesn't support AVX) +# Bulldozer 1.49(+27%) +# Silvermont 2.88(+13%) +# Knights L 2.12(-) (if system doesn't support AVX) +# Goldmont 1.08(+24%) + +# March 2013 +# +# ... 8x aggregate factor AVX code path is using reduction algorithm +# suggested by Shay Gueron[1]. Even though contemporary AVX-capable +# CPUs such as Sandy and Ivy Bridge can execute it, the code performs +# sub-optimally in comparison to above mentioned version. But thanks +# to Ilya Albrekht and Max Locktyukhin of Intel Corp. we knew that +# it performs in 0.41 cycles per byte on Haswell processor, in +# 0.29 on Broadwell, and in 0.36 on Skylake. +# +# Knights Landing achieves 1.09 cpb. +# +# [1] http://rt.openssl.org/Ticket/Display.html?id=2900&user=guest&pass=guest + +$flavour = shift; +$output = shift; +if ($flavour =~ /\./) { $output = $flavour; undef $flavour; } + +$win64=0; $win64=1 if ($flavour =~ /[nm]asm|mingw64/ || $output =~ /\.asm$/); + +$0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1; +( $xlate="${dir}x86_64-xlate.pl" and -f $xlate ) or +( $xlate="${dir}../x86_64-xlate.pl" and -f $xlate) or +die "can't locate x86_64-xlate.pl"; + +# See the notes about |$avx| in aesni-gcm-x86_64.pl; otherwise tags will be +# computed incorrectly. +# +# In upstream, this is controlled by shelling out to the compiler to check +# versions, but BoringSSL is intended to be used with pre-generated perlasm +# output, so this isn't useful anyway. +$avx = 1; + +open OUT,"| \"$^X\" \"$xlate\" $flavour \"$output\""; +*STDOUT=*OUT; + +$do4xaggr=1; + +# common register layout +$nlo="%rax"; +$nhi="%rbx"; +$Zlo="%r8"; +$Zhi="%r9"; +$tmp="%r10"; +$rem_4bit = "%r11"; + +$Xi="%rdi"; +$Htbl="%rsi"; + +# per-function register layout +$cnt="%rcx"; +$rem="%rdx"; + +sub LB() { my $r=shift; $r =~ s/%[er]([a-d])x/%\1l/ or + $r =~ s/%[er]([sd]i)/%\1l/ or + $r =~ s/%[er](bp)/%\1l/ or + $r =~ s/%(r[0-9]+)[d]?/%\1b/; $r; } + +sub AUTOLOAD() # thunk [simplified] 32-bit style perlasm +{ my $opcode = $AUTOLOAD; $opcode =~ s/.*:://; + my $arg = pop; + $arg = "\$$arg" if ($arg*1 eq $arg); + $code .= "\t$opcode\t".join(',',$arg,reverse @_)."\n"; +} + +{ my $N; + sub loop() { + my $inp = shift; + + $N++; +$code.=<<___; + xor $nlo,$nlo + xor $nhi,$nhi + mov `&LB("$Zlo")`,`&LB("$nlo")` + mov `&LB("$Zlo")`,`&LB("$nhi")` + shl \$4,`&LB("$nlo")` + mov \$14,$cnt + mov 8($Htbl,$nlo),$Zlo + mov ($Htbl,$nlo),$Zhi + and \$0xf0,`&LB("$nhi")` + mov $Zlo,$rem + jmp .Loop$N + +.align 16 +.Loop$N: + shr \$4,$Zlo + and \$0xf,$rem + mov $Zhi,$tmp + mov ($inp,$cnt),`&LB("$nlo")` + shr \$4,$Zhi + xor 8($Htbl,$nhi),$Zlo + shl \$60,$tmp + xor ($Htbl,$nhi),$Zhi + mov `&LB("$nlo")`,`&LB("$nhi")` + xor ($rem_4bit,$rem,8),$Zhi + mov $Zlo,$rem + shl \$4,`&LB("$nlo")` + xor $tmp,$Zlo + dec $cnt + js .Lbreak$N + + shr \$4,$Zlo + and \$0xf,$rem + mov $Zhi,$tmp + shr \$4,$Zhi + xor 8($Htbl,$nlo),$Zlo + shl \$60,$tmp + xor ($Htbl,$nlo),$Zhi + and \$0xf0,`&LB("$nhi")` + xor ($rem_4bit,$rem,8),$Zhi + mov $Zlo,$rem + xor $tmp,$Zlo + jmp .Loop$N + +.align 16 +.Lbreak$N: + shr \$4,$Zlo + and \$0xf,$rem + mov $Zhi,$tmp + shr \$4,$Zhi + xor 8($Htbl,$nlo),$Zlo + shl \$60,$tmp + xor ($Htbl,$nlo),$Zhi + and \$0xf0,`&LB("$nhi")` + xor ($rem_4bit,$rem,8),$Zhi + mov $Zlo,$rem + xor $tmp,$Zlo + + shr \$4,$Zlo + and \$0xf,$rem + mov $Zhi,$tmp + shr \$4,$Zhi + xor 8($Htbl,$nhi),$Zlo + shl \$60,$tmp + xor ($Htbl,$nhi),$Zhi + xor $tmp,$Zlo + xor ($rem_4bit,$rem,8),$Zhi + + bswap $Zlo + bswap $Zhi +___ +}} + +$code=<<___; +.text +#.extern OPENSSL_ia32cap_P + +.globl gcm_gmult_4bit +.type gcm_gmult_4bit,\@function,2 +.align 16 +gcm_gmult_4bit: + push %rbx + push %rbp # %rbp and others are pushed exclusively in + push %r12 # order to reuse Win64 exception handler... + push %r13 + push %r14 + push %r15 + sub \$280,%rsp +.Lgmult_prologue: + + movzb 15($Xi),$Zlo + lea .Lrem_4bit(%rip),$rem_4bit +___ + &loop ($Xi); +$code.=<<___; + mov $Zlo,8($Xi) + mov $Zhi,($Xi) + + lea 280+48(%rsp),%rsi + mov -8(%rsi),%rbx + lea (%rsi),%rsp +.Lgmult_epilogue: + ret +.size gcm_gmult_4bit,.-gcm_gmult_4bit +___ + +# per-function register layout +$inp="%rdx"; +$len="%rcx"; +$rem_8bit=$rem_4bit; + +$code.=<<___; +.globl gcm_ghash_4bit +.type gcm_ghash_4bit,\@function,4 +.align 16 +gcm_ghash_4bit: + push %rbx + push %rbp + push %r12 + push %r13 + push %r14 + push %r15 + sub \$280,%rsp +.Lghash_prologue: + mov $inp,%r14 # reassign couple of args + mov $len,%r15 +___ +{ my $inp="%r14"; + my $dat="%edx"; + my $len="%r15"; + my @nhi=("%ebx","%ecx"); + my @rem=("%r12","%r13"); + my $Hshr4="%rbp"; + + &sub ($Htbl,-128); # size optimization + &lea ($Hshr4,"16+128(%rsp)"); + { my @lo =($nlo,$nhi); + my @hi =($Zlo,$Zhi); + + &xor ($dat,$dat); + for ($i=0,$j=-2;$i<18;$i++,$j++) { + &mov ("$j(%rsp)",&LB($dat)) if ($i>1); + &or ($lo[0],$tmp) if ($i>1); + &mov (&LB($dat),&LB($lo[1])) if ($i>0 && $i<17); + &shr ($lo[1],4) if ($i>0 && $i<17); + &mov ($tmp,$hi[1]) if ($i>0 && $i<17); + &shr ($hi[1],4) if ($i>0 && $i<17); + &mov ("8*$j($Hshr4)",$hi[0]) if ($i>1); + &mov ($hi[0],"16*$i+0-128($Htbl)") if ($i<16); + &shl (&LB($dat),4) if ($i>0 && $i<17); + &mov ("8*$j-128($Hshr4)",$lo[0]) if ($i>1); + &mov ($lo[0],"16*$i+8-128($Htbl)") if ($i<16); + &shl ($tmp,60) if ($i>0 && $i<17); + + push (@lo,shift(@lo)); + push (@hi,shift(@hi)); + } + } + &add ($Htbl,-128); + &mov ($Zlo,"8($Xi)"); + &mov ($Zhi,"0($Xi)"); + &add ($len,$inp); # pointer to the end of data + &lea ($rem_8bit,".Lrem_8bit(%rip)"); + &jmp (".Louter_loop"); + +$code.=".align 16\n.Louter_loop:\n"; + &xor ($Zhi,"($inp)"); + &mov ("%rdx","8($inp)"); + &lea ($inp,"16($inp)"); + &xor ("%rdx",$Zlo); + &mov ("($Xi)",$Zhi); + &mov ("8($Xi)","%rdx"); + &shr ("%rdx",32); + + &xor ($nlo,$nlo); + &rol ($dat,8); + &mov (&LB($nlo),&LB($dat)); + &movz ($nhi[0],&LB($dat)); + &shl (&LB($nlo),4); + &shr ($nhi[0],4); + + for ($j=11,$i=0;$i<15;$i++) { + &rol ($dat,8); + &xor ($Zlo,"8($Htbl,$nlo)") if ($i>0); + &xor ($Zhi,"($Htbl,$nlo)") if ($i>0); + &mov ($Zlo,"8($Htbl,$nlo)") if ($i==0); + &mov ($Zhi,"($Htbl,$nlo)") if ($i==0); + + &mov (&LB($nlo),&LB($dat)); + &xor ($Zlo,$tmp) if ($i>0); + &movzw ($rem[1],"($rem_8bit,$rem[1],2)") if ($i>0); + + &movz ($nhi[1],&LB($dat)); + &shl (&LB($nlo),4); + &movzb ($rem[0],"(%rsp,$nhi[0])"); + + &shr ($nhi[1],4) if ($i<14); + &and ($nhi[1],0xf0) if ($i==14); + &shl ($rem[1],48) if ($i>0); + &xor ($rem[0],$Zlo); + + &mov ($tmp,$Zhi); + &xor ($Zhi,$rem[1]) if ($i>0); + &shr ($Zlo,8); + + &movz ($rem[0],&LB($rem[0])); + &mov ($dat,"$j($Xi)") if (--$j%4==0); + &shr ($Zhi,8); + + &xor ($Zlo,"-128($Hshr4,$nhi[0],8)"); + &shl ($tmp,56); + &xor ($Zhi,"($Hshr4,$nhi[0],8)"); + + unshift (@nhi,pop(@nhi)); # "rotate" registers + unshift (@rem,pop(@rem)); + } + &movzw ($rem[1],"($rem_8bit,$rem[1],2)"); + &xor ($Zlo,"8($Htbl,$nlo)"); + &xor ($Zhi,"($Htbl,$nlo)"); + + &shl ($rem[1],48); + &xor ($Zlo,$tmp); + + &xor ($Zhi,$rem[1]); + &movz ($rem[0],&LB($Zlo)); + &shr ($Zlo,4); + + &mov ($tmp,$Zhi); + &shl (&LB($rem[0]),4); + &shr ($Zhi,4); + + &xor ($Zlo,"8($Htbl,$nhi[0])"); + &movzw ($rem[0],"($rem_8bit,$rem[0],2)"); + &shl ($tmp,60); + + &xor ($Zhi,"($Htbl,$nhi[0])"); + &xor ($Zlo,$tmp); + &shl ($rem[0],48); + + &bswap ($Zlo); + &xor ($Zhi,$rem[0]); + + &bswap ($Zhi); + &cmp ($inp,$len); + &jb (".Louter_loop"); +} +$code.=<<___; + mov $Zlo,8($Xi) + mov $Zhi,($Xi) + + lea 280+48(%rsp),%rsi + mov -48(%rsi),%r15 + mov -40(%rsi),%r14 + mov -32(%rsi),%r13 + mov -24(%rsi),%r12 + mov -16(%rsi),%rbp + mov -8(%rsi),%rbx + lea 0(%rsi),%rsp +.Lghash_epilogue: + ret +.size gcm_ghash_4bit,.-gcm_ghash_4bit +___ + +###################################################################### +# PCLMULQDQ version. + +@_4args=$win64? ("%rcx","%rdx","%r8", "%r9") : # Win64 order + ("%rdi","%rsi","%rdx","%rcx"); # Unix order + +($Xi,$Xhi)=("%xmm0","%xmm1"); $Hkey="%xmm2"; +($T1,$T2,$T3)=("%xmm3","%xmm4","%xmm5"); + +sub clmul64x64_T2 { # minimal register pressure +my ($Xhi,$Xi,$Hkey,$HK)=@_; + +if (!defined($HK)) { $HK = $T2; +$code.=<<___; + movdqa $Xi,$Xhi # + pshufd \$0b01001110,$Xi,$T1 + pshufd \$0b01001110,$Hkey,$T2 + pxor $Xi,$T1 # + pxor $Hkey,$T2 +___ +} else { +$code.=<<___; + movdqa $Xi,$Xhi # + pshufd \$0b01001110,$Xi,$T1 + pxor $Xi,$T1 # +___ +} +$code.=<<___; + pclmulqdq \$0x00,$Hkey,$Xi ####### + pclmulqdq \$0x11,$Hkey,$Xhi ####### + pclmulqdq \$0x00,$HK,$T1 ####### + pxor $Xi,$T1 # + pxor $Xhi,$T1 # + + movdqa $T1,$T2 # + psrldq \$8,$T1 + pslldq \$8,$T2 # + pxor $T1,$Xhi + pxor $T2,$Xi # +___ +} + +sub reduction_alg9 { # 17/11 times faster than Intel version +my ($Xhi,$Xi) = @_; + +$code.=<<___; + # 1st phase + movdqa $Xi,$T2 # + movdqa $Xi,$T1 + psllq \$5,$Xi + pxor $Xi,$T1 # + psllq \$1,$Xi + pxor $T1,$Xi # + psllq \$57,$Xi # + movdqa $Xi,$T1 # + pslldq \$8,$Xi + psrldq \$8,$T1 # + pxor $T2,$Xi + pxor $T1,$Xhi # + + # 2nd phase + movdqa $Xi,$T2 + psrlq \$1,$Xi + pxor $T2,$Xhi # + pxor $Xi,$T2 + psrlq \$5,$Xi + pxor $T2,$Xi # + psrlq \$1,$Xi # + pxor $Xhi,$Xi # +___ +} + +{ my ($Htbl,$Xip)=@_4args; + my $HK="%xmm6"; + +$code.=<<___; +.globl gcm_init_clmul +.type gcm_init_clmul,\@abi-omnipotent +.align 16 +gcm_init_clmul: +.L_init_clmul: +___ +$code.=<<___ if ($win64); +.LSEH_begin_gcm_init_clmul: + # I can't trust assembler to use specific encoding:-( + .byte 0x48,0x83,0xec,0x18 #sub $0x18,%rsp + .byte 0x0f,0x29,0x34,0x24 #movaps %xmm6,(%rsp) +___ +$code.=<<___; + movdqu ($Xip),$Hkey + pshufd \$0b01001110,$Hkey,$Hkey # dword swap + + # <<1 twist + pshufd \$0b11111111,$Hkey,$T2 # broadcast uppermost dword + movdqa $Hkey,$T1 + psllq \$1,$Hkey + pxor $T3,$T3 # + psrlq \$63,$T1 + pcmpgtd $T2,$T3 # broadcast carry bit + pslldq \$8,$T1 + por $T1,$Hkey # H<<=1 + + # magic reduction + pand .L0x1c2_polynomial(%rip),$T3 + pxor $T3,$Hkey # if(carry) H^=0x1c2_polynomial + + # calculate H^2 + pshufd \$0b01001110,$Hkey,$HK + movdqa $Hkey,$Xi + pxor $Hkey,$HK +___ + &clmul64x64_T2 ($Xhi,$Xi,$Hkey,$HK); + &reduction_alg9 ($Xhi,$Xi); +$code.=<<___; + pshufd \$0b01001110,$Hkey,$T1 + pshufd \$0b01001110,$Xi,$T2 + pxor $Hkey,$T1 # Karatsuba pre-processing + movdqu $Hkey,0x00($Htbl) # save H + pxor $Xi,$T2 # Karatsuba pre-processing + movdqu $Xi,0x10($Htbl) # save H^2 + palignr \$8,$T1,$T2 # low part is H.lo^H.hi... + movdqu $T2,0x20($Htbl) # save Karatsuba "salt" +___ +if ($do4xaggr) { + &clmul64x64_T2 ($Xhi,$Xi,$Hkey,$HK); # H^3 + &reduction_alg9 ($Xhi,$Xi); +$code.=<<___; + movdqa $Xi,$T3 +___ + &clmul64x64_T2 ($Xhi,$Xi,$Hkey,$HK); # H^4 + &reduction_alg9 ($Xhi,$Xi); +$code.=<<___; + pshufd \$0b01001110,$T3,$T1 + pshufd \$0b01001110,$Xi,$T2 + pxor $T3,$T1 # Karatsuba pre-processing + movdqu $T3,0x30($Htbl) # save H^3 + pxor $Xi,$T2 # Karatsuba pre-processing + movdqu $Xi,0x40($Htbl) # save H^4 + palignr \$8,$T1,$T2 # low part is H^3.lo^H^3.hi... + movdqu $T2,0x50($Htbl) # save Karatsuba "salt" +___ +} +$code.=<<___ if ($win64); + movaps (%rsp),%xmm6 + lea 0x18(%rsp),%rsp +.LSEH_end_gcm_init_clmul: +___ +$code.=<<___; + ret +.size gcm_init_clmul,.-gcm_init_clmul +___ +} + +{ my ($Xip,$Htbl)=@_4args; + +$code.=<<___; +.globl gcm_gmult_clmul +.type gcm_gmult_clmul,\@abi-omnipotent +.align 16 +gcm_gmult_clmul: +.L_gmult_clmul: + movdqu ($Xip),$Xi + movdqa .Lbswap_mask(%rip),$T3 + movdqu ($Htbl),$Hkey + movdqu 0x20($Htbl),$T2 + pshufb $T3,$Xi +___ + &clmul64x64_T2 ($Xhi,$Xi,$Hkey,$T2); +$code.=<<___ if (0 || (&reduction_alg9($Xhi,$Xi)&&0)); + # experimental alternative. special thing about is that there + # no dependency between the two multiplications... + mov \$`0xE1<<1`,%eax + mov \$0xA040608020C0E000,%r10 # ((7..0)·0xE0)&0xff + mov \$0x07,%r11d + movq %rax,$T1 + movq %r10,$T2 + movq %r11,$T3 # borrow $T3 + pand $Xi,$T3 + pshufb $T3,$T2 # ($Xi&7)·0xE0 + movq %rax,$T3 + pclmulqdq \$0x00,$Xi,$T1 # ·(0xE1<<1) + pxor $Xi,$T2 + pslldq \$15,$T2 + paddd $T2,$T2 # <<(64+56+1) + pxor $T2,$Xi + pclmulqdq \$0x01,$T3,$Xi + movdqa .Lbswap_mask(%rip),$T3 # reload $T3 + psrldq \$1,$T1 + pxor $T1,$Xhi + pslldq \$7,$Xi + pxor $Xhi,$Xi +___ +$code.=<<___; + pshufb $T3,$Xi + movdqu $Xi,($Xip) + ret +.size gcm_gmult_clmul,.-gcm_gmult_clmul +___ +} + +{ my ($Xip,$Htbl,$inp,$len)=@_4args; + my ($Xln,$Xmn,$Xhn,$Hkey2,$HK) = map("%xmm$_",(3..7)); + my ($T1,$T2,$T3)=map("%xmm$_",(8..10)); + +$code.=<<___; +.globl gcm_ghash_clmul +.type gcm_ghash_clmul,\@abi-omnipotent +.align 32 +gcm_ghash_clmul: +.L_ghash_clmul: +___ +$code.=<<___ if ($win64); + lea -0x88(%rsp),%rax +.LSEH_begin_gcm_ghash_clmul: + # I can't trust assembler to use specific encoding:-( + .byte 0x48,0x8d,0x60,0xe0 #lea -0x20(%rax),%rsp + .byte 0x0f,0x29,0x70,0xe0 #movaps %xmm6,-0x20(%rax) + .byte 0x0f,0x29,0x78,0xf0 #movaps %xmm7,-0x10(%rax) + .byte 0x44,0x0f,0x29,0x00 #movaps %xmm8,0(%rax) + .byte 0x44,0x0f,0x29,0x48,0x10 #movaps %xmm9,0x10(%rax) + .byte 0x44,0x0f,0x29,0x50,0x20 #movaps %xmm10,0x20(%rax) + .byte 0x44,0x0f,0x29,0x58,0x30 #movaps %xmm11,0x30(%rax) + .byte 0x44,0x0f,0x29,0x60,0x40 #movaps %xmm12,0x40(%rax) + .byte 0x44,0x0f,0x29,0x68,0x50 #movaps %xmm13,0x50(%rax) + .byte 0x44,0x0f,0x29,0x70,0x60 #movaps %xmm14,0x60(%rax) + .byte 0x44,0x0f,0x29,0x78,0x70 #movaps %xmm15,0x70(%rax) +___ +$code.=<<___; + movdqa .Lbswap_mask(%rip),$T3 + + movdqu ($Xip),$Xi + movdqu ($Htbl),$Hkey + movdqu 0x20($Htbl),$HK + pshufb $T3,$Xi + + sub \$0x10,$len + jz .Lodd_tail + + movdqu 0x10($Htbl),$Hkey2 +___ +if ($do4xaggr) { +my ($Xl,$Xm,$Xh,$Hkey3,$Hkey4)=map("%xmm$_",(11..15)); + +$code.=<<___; +# leaq OPENSSL_ia32cap_P(%rip),%rax +# mov 4(%rax),%eax + cmp \$0x30,$len + jb .Lskip4x + +# and \$`1<<26|1<<22`,%eax # isolate MOVBE+XSAVE +# cmp \$`1<<22`,%eax # check for MOVBE without XSAVE +# je .Lskip4x + + sub \$0x30,$len + mov \$0xA040608020C0E000,%rax # ((7..0)·0xE0)&0xff + movdqu 0x30($Htbl),$Hkey3 + movdqu 0x40($Htbl),$Hkey4 + + ####### + # Xi+4 =[(H*Ii+3) + (H^2*Ii+2) + (H^3*Ii+1) + H^4*(Ii+Xi)] mod P + # + movdqu 0x30($inp),$Xln + movdqu 0x20($inp),$Xl + pshufb $T3,$Xln + pshufb $T3,$Xl + movdqa $Xln,$Xhn + pshufd \$0b01001110,$Xln,$Xmn + pxor $Xln,$Xmn + pclmulqdq \$0x00,$Hkey,$Xln + pclmulqdq \$0x11,$Hkey,$Xhn + pclmulqdq \$0x00,$HK,$Xmn + + movdqa $Xl,$Xh + pshufd \$0b01001110,$Xl,$Xm + pxor $Xl,$Xm + pclmulqdq \$0x00,$Hkey2,$Xl + pclmulqdq \$0x11,$Hkey2,$Xh + pclmulqdq \$0x10,$HK,$Xm + xorps $Xl,$Xln + xorps $Xh,$Xhn + movups 0x50($Htbl),$HK + xorps $Xm,$Xmn + + movdqu 0x10($inp),$Xl + movdqu 0($inp),$T1 + pshufb $T3,$Xl + pshufb $T3,$T1 + movdqa $Xl,$Xh + pshufd \$0b01001110,$Xl,$Xm + pxor $T1,$Xi + pxor $Xl,$Xm + pclmulqdq \$0x00,$Hkey3,$Xl + movdqa $Xi,$Xhi + pshufd \$0b01001110,$Xi,$T1 + pxor $Xi,$T1 + pclmulqdq \$0x11,$Hkey3,$Xh + pclmulqdq \$0x00,$HK,$Xm + xorps $Xl,$Xln + xorps $Xh,$Xhn + + lea 0x40($inp),$inp + sub \$0x40,$len + jc .Ltail4x + + jmp .Lmod4_loop +.align 32 +.Lmod4_loop: + pclmulqdq \$0x00,$Hkey4,$Xi + xorps $Xm,$Xmn + movdqu 0x30($inp),$Xl + pshufb $T3,$Xl + pclmulqdq \$0x11,$Hkey4,$Xhi + xorps $Xln,$Xi + movdqu 0x20($inp),$Xln + movdqa $Xl,$Xh + pclmulqdq \$0x10,$HK,$T1 + pshufd \$0b01001110,$Xl,$Xm + xorps $Xhn,$Xhi + pxor $Xl,$Xm + pshufb $T3,$Xln + movups 0x20($Htbl),$HK + xorps $Xmn,$T1 + pclmulqdq \$0x00,$Hkey,$Xl + pshufd \$0b01001110,$Xln,$Xmn + + pxor $Xi,$T1 # aggregated Karatsuba post-processing + movdqa $Xln,$Xhn + pxor $Xhi,$T1 # + pxor $Xln,$Xmn + movdqa $T1,$T2 # + pclmulqdq \$0x11,$Hkey,$Xh + pslldq \$8,$T1 + psrldq \$8,$T2 # + pxor $T1,$Xi + movdqa .L7_mask(%rip),$T1 + pxor $T2,$Xhi # + movq %rax,$T2 + + pand $Xi,$T1 # 1st phase + pshufb $T1,$T2 # + pxor $Xi,$T2 # + pclmulqdq \$0x00,$HK,$Xm + psllq \$57,$T2 # + movdqa $T2,$T1 # + pslldq \$8,$T2 + pclmulqdq \$0x00,$Hkey2,$Xln + psrldq \$8,$T1 # + pxor $T2,$Xi + pxor $T1,$Xhi # + movdqu 0($inp),$T1 + + movdqa $Xi,$T2 # 2nd phase + psrlq \$1,$Xi + pclmulqdq \$0x11,$Hkey2,$Xhn + xorps $Xl,$Xln + movdqu 0x10($inp),$Xl + pshufb $T3,$Xl + pclmulqdq \$0x10,$HK,$Xmn + xorps $Xh,$Xhn + movups 0x50($Htbl),$HK + pshufb $T3,$T1 + pxor $T2,$Xhi # + pxor $Xi,$T2 + psrlq \$5,$Xi + + movdqa $Xl,$Xh + pxor $Xm,$Xmn + pshufd \$0b01001110,$Xl,$Xm + pxor $T2,$Xi # + pxor $T1,$Xhi + pxor $Xl,$Xm + pclmulqdq \$0x00,$Hkey3,$Xl + psrlq \$1,$Xi # + pxor $Xhi,$Xi # + movdqa $Xi,$Xhi + pclmulqdq \$0x11,$Hkey3,$Xh + xorps $Xl,$Xln + pshufd \$0b01001110,$Xi,$T1 + pxor $Xi,$T1 + + pclmulqdq \$0x00,$HK,$Xm + xorps $Xh,$Xhn + + lea 0x40($inp),$inp + sub \$0x40,$len + jnc .Lmod4_loop + +.Ltail4x: + pclmulqdq \$0x00,$Hkey4,$Xi + pclmulqdq \$0x11,$Hkey4,$Xhi + pclmulqdq \$0x10,$HK,$T1 + xorps $Xm,$Xmn + xorps $Xln,$Xi + xorps $Xhn,$Xhi + pxor $Xi,$Xhi # aggregated Karatsuba post-processing + pxor $Xmn,$T1 + + pxor $Xhi,$T1 # + pxor $Xi,$Xhi + + movdqa $T1,$T2 # + psrldq \$8,$T1 + pslldq \$8,$T2 # + pxor $T1,$Xhi + pxor $T2,$Xi # +___ + &reduction_alg9($Xhi,$Xi); +$code.=<<___; + add \$0x40,$len + jz .Ldone + movdqu 0x20($Htbl),$HK + sub \$0x10,$len + jz .Lodd_tail +.Lskip4x: +___ +} +$code.=<<___; + ####### + # Xi+2 =[H*(Ii+1 + Xi+1)] mod P = + # [(H*Ii+1) + (H*Xi+1)] mod P = + # [(H*Ii+1) + H^2*(Ii+Xi)] mod P + # + movdqu ($inp),$T1 # Ii + movdqu 16($inp),$Xln # Ii+1 + pshufb $T3,$T1 + pshufb $T3,$Xln + pxor $T1,$Xi # Ii+Xi + + movdqa $Xln,$Xhn + pshufd \$0b01001110,$Xln,$Xmn + pxor $Xln,$Xmn + pclmulqdq \$0x00,$Hkey,$Xln + pclmulqdq \$0x11,$Hkey,$Xhn + pclmulqdq \$0x00,$HK,$Xmn + + lea 32($inp),$inp # i+=2 + nop + sub \$0x20,$len + jbe .Leven_tail + nop + jmp .Lmod_loop + +.align 32 +.Lmod_loop: + movdqa $Xi,$Xhi + movdqa $Xmn,$T1 + pshufd \$0b01001110,$Xi,$Xmn # + pxor $Xi,$Xmn # + + pclmulqdq \$0x00,$Hkey2,$Xi + pclmulqdq \$0x11,$Hkey2,$Xhi + pclmulqdq \$0x10,$HK,$Xmn + + pxor $Xln,$Xi # (H*Ii+1) + H^2*(Ii+Xi) + pxor $Xhn,$Xhi + movdqu ($inp),$T2 # Ii + pxor $Xi,$T1 # aggregated Karatsuba post-processing + pshufb $T3,$T2 + movdqu 16($inp),$Xln # Ii+1 + + pxor $Xhi,$T1 + pxor $T2,$Xhi # "Ii+Xi", consume early + pxor $T1,$Xmn + pshufb $T3,$Xln + movdqa $Xmn,$T1 # + psrldq \$8,$T1 + pslldq \$8,$Xmn # + pxor $T1,$Xhi + pxor $Xmn,$Xi # + + movdqa $Xln,$Xhn # + + movdqa $Xi,$T2 # 1st phase + movdqa $Xi,$T1 + psllq \$5,$Xi + pxor $Xi,$T1 # + pclmulqdq \$0x00,$Hkey,$Xln ####### + psllq \$1,$Xi + pxor $T1,$Xi # + psllq \$57,$Xi # + movdqa $Xi,$T1 # + pslldq \$8,$Xi + psrldq \$8,$T1 # + pxor $T2,$Xi + pshufd \$0b01001110,$Xhn,$Xmn + pxor $T1,$Xhi # + pxor $Xhn,$Xmn # + + movdqa $Xi,$T2 # 2nd phase + psrlq \$1,$Xi + pclmulqdq \$0x11,$Hkey,$Xhn ####### + pxor $T2,$Xhi # + pxor $Xi,$T2 + psrlq \$5,$Xi + pxor $T2,$Xi # + lea 32($inp),$inp + psrlq \$1,$Xi # + pclmulqdq \$0x00,$HK,$Xmn ####### + pxor $Xhi,$Xi # + + sub \$0x20,$len + ja .Lmod_loop + +.Leven_tail: + movdqa $Xi,$Xhi + movdqa $Xmn,$T1 + pshufd \$0b01001110,$Xi,$Xmn # + pxor $Xi,$Xmn # + + pclmulqdq \$0x00,$Hkey2,$Xi + pclmulqdq \$0x11,$Hkey2,$Xhi + pclmulqdq \$0x10,$HK,$Xmn + + pxor $Xln,$Xi # (H*Ii+1) + H^2*(Ii+Xi) + pxor $Xhn,$Xhi + pxor $Xi,$T1 + pxor $Xhi,$T1 + pxor $T1,$Xmn + movdqa $Xmn,$T1 # + psrldq \$8,$T1 + pslldq \$8,$Xmn # + pxor $T1,$Xhi + pxor $Xmn,$Xi # +___ + &reduction_alg9 ($Xhi,$Xi); +$code.=<<___; + test $len,$len + jnz .Ldone + +.Lodd_tail: + movdqu ($inp),$T1 # Ii + pshufb $T3,$T1 + pxor $T1,$Xi # Ii+Xi +___ + &clmul64x64_T2 ($Xhi,$Xi,$Hkey,$HK); # H*(Ii+Xi) + &reduction_alg9 ($Xhi,$Xi); +$code.=<<___; +.Ldone: + pshufb $T3,$Xi + movdqu $Xi,($Xip) +___ +$code.=<<___ if ($win64); + movaps (%rsp),%xmm6 + movaps 0x10(%rsp),%xmm7 + movaps 0x20(%rsp),%xmm8 + movaps 0x30(%rsp),%xmm9 + movaps 0x40(%rsp),%xmm10 + movaps 0x50(%rsp),%xmm11 + movaps 0x60(%rsp),%xmm12 + movaps 0x70(%rsp),%xmm13 + movaps 0x80(%rsp),%xmm14 + movaps 0x90(%rsp),%xmm15 + lea 0xa8(%rsp),%rsp +.LSEH_end_gcm_ghash_clmul: +___ +$code.=<<___; + ret +.size gcm_ghash_clmul,.-gcm_ghash_clmul +___ +} + +$code.=<<___; +.globl gcm_init_avx +.type gcm_init_avx,\@abi-omnipotent +.align 32 +gcm_init_avx: +___ +if ($avx) { +my ($Htbl,$Xip)=@_4args; +my $HK="%xmm6"; + +$code.=<<___ if ($win64); +.LSEH_begin_gcm_init_avx: + # I can't trust assembler to use specific encoding:-( + .byte 0x48,0x83,0xec,0x18 #sub $0x18,%rsp + .byte 0x0f,0x29,0x34,0x24 #movaps %xmm6,(%rsp) +___ +$code.=<<___; + vzeroupper + + vmovdqu ($Xip),$Hkey + vpshufd \$0b01001110,$Hkey,$Hkey # dword swap + + # <<1 twist + vpshufd \$0b11111111,$Hkey,$T2 # broadcast uppermost dword + vpsrlq \$63,$Hkey,$T1 + vpsllq \$1,$Hkey,$Hkey + vpxor $T3,$T3,$T3 # + vpcmpgtd $T2,$T3,$T3 # broadcast carry bit + vpslldq \$8,$T1,$T1 + vpor $T1,$Hkey,$Hkey # H<<=1 + + # magic reduction + vpand .L0x1c2_polynomial(%rip),$T3,$T3 + vpxor $T3,$Hkey,$Hkey # if(carry) H^=0x1c2_polynomial + + vpunpckhqdq $Hkey,$Hkey,$HK + vmovdqa $Hkey,$Xi + vpxor $Hkey,$HK,$HK + mov \$4,%r10 # up to H^8 + jmp .Linit_start_avx +___ + +sub clmul64x64_avx { +my ($Xhi,$Xi,$Hkey,$HK)=@_; + +if (!defined($HK)) { $HK = $T2; +$code.=<<___; + vpunpckhqdq $Xi,$Xi,$T1 + vpunpckhqdq $Hkey,$Hkey,$T2 + vpxor $Xi,$T1,$T1 # + vpxor $Hkey,$T2,$T2 +___ +} else { +$code.=<<___; + vpunpckhqdq $Xi,$Xi,$T1 + vpxor $Xi,$T1,$T1 # +___ +} +$code.=<<___; + vpclmulqdq \$0x11,$Hkey,$Xi,$Xhi ####### + vpclmulqdq \$0x00,$Hkey,$Xi,$Xi ####### + vpclmulqdq \$0x00,$HK,$T1,$T1 ####### + vpxor $Xi,$Xhi,$T2 # + vpxor $T2,$T1,$T1 # + + vpslldq \$8,$T1,$T2 # + vpsrldq \$8,$T1,$T1 + vpxor $T2,$Xi,$Xi # + vpxor $T1,$Xhi,$Xhi +___ +} + +sub reduction_avx { +my ($Xhi,$Xi) = @_; + +$code.=<<___; + vpsllq \$57,$Xi,$T1 # 1st phase + vpsllq \$62,$Xi,$T2 + vpxor $T1,$T2,$T2 # + vpsllq \$63,$Xi,$T1 + vpxor $T1,$T2,$T2 # + vpslldq \$8,$T2,$T1 # + vpsrldq \$8,$T2,$T2 + vpxor $T1,$Xi,$Xi # + vpxor $T2,$Xhi,$Xhi + + vpsrlq \$1,$Xi,$T2 # 2nd phase + vpxor $Xi,$Xhi,$Xhi + vpxor $T2,$Xi,$Xi # + vpsrlq \$5,$T2,$T2 + vpxor $T2,$Xi,$Xi # + vpsrlq \$1,$Xi,$Xi # + vpxor $Xhi,$Xi,$Xi # +___ +} + +$code.=<<___; +.align 32 +.Linit_loop_avx: + vpalignr \$8,$T1,$T2,$T3 # low part is H.lo^H.hi... + vmovdqu $T3,-0x10($Htbl) # save Karatsuba "salt" +___ + &clmul64x64_avx ($Xhi,$Xi,$Hkey,$HK); # calculate H^3,5,7 + &reduction_avx ($Xhi,$Xi); +$code.=<<___; +.Linit_start_avx: + vmovdqa $Xi,$T3 +___ + &clmul64x64_avx ($Xhi,$Xi,$Hkey,$HK); # calculate H^2,4,6,8 + &reduction_avx ($Xhi,$Xi); +$code.=<<___; + vpshufd \$0b01001110,$T3,$T1 + vpshufd \$0b01001110,$Xi,$T2 + vpxor $T3,$T1,$T1 # Karatsuba pre-processing + vmovdqu $T3,0x00($Htbl) # save H^1,3,5,7 + vpxor $Xi,$T2,$T2 # Karatsuba pre-processing + vmovdqu $Xi,0x10($Htbl) # save H^2,4,6,8 + lea 0x30($Htbl),$Htbl + sub \$1,%r10 + jnz .Linit_loop_avx + + vpalignr \$8,$T2,$T1,$T3 # last "salt" is flipped + vmovdqu $T3,-0x10($Htbl) + + vzeroupper +___ +$code.=<<___ if ($win64); + movaps (%rsp),%xmm6 + lea 0x18(%rsp),%rsp +.LSEH_end_gcm_init_avx: +___ +$code.=<<___; + ret +.size gcm_init_avx,.-gcm_init_avx +___ +} else { +$code.=<<___; + jmp .L_init_clmul +.size gcm_init_avx,.-gcm_init_avx +___ +} + +$code.=<<___; +.globl gcm_gmult_avx +.type gcm_gmult_avx,\@abi-omnipotent +.align 32 +gcm_gmult_avx: + jmp .L_gmult_clmul +.size gcm_gmult_avx,.-gcm_gmult_avx +___ + +$code.=<<___; +.globl gcm_ghash_avx +.type gcm_ghash_avx,\@abi-omnipotent +.align 32 +gcm_ghash_avx: +___ +if ($avx) { +my ($Xip,$Htbl,$inp,$len)=@_4args; +my ($Xlo,$Xhi,$Xmi, + $Zlo,$Zhi,$Zmi, + $Hkey,$HK,$T1,$T2, + $Xi,$Xo,$Tred,$bswap,$Ii,$Ij) = map("%xmm$_",(0..15)); + +$code.=<<___ if ($win64); + lea -0x88(%rsp),%rax +.LSEH_begin_gcm_ghash_avx: + # I can't trust assembler to use specific encoding:-( + .byte 0x48,0x8d,0x60,0xe0 #lea -0x20(%rax),%rsp + .byte 0x0f,0x29,0x70,0xe0 #movaps %xmm6,-0x20(%rax) + .byte 0x0f,0x29,0x78,0xf0 #movaps %xmm7,-0x10(%rax) + .byte 0x44,0x0f,0x29,0x00 #movaps %xmm8,0(%rax) + .byte 0x44,0x0f,0x29,0x48,0x10 #movaps %xmm9,0x10(%rax) + .byte 0x44,0x0f,0x29,0x50,0x20 #movaps %xmm10,0x20(%rax) + .byte 0x44,0x0f,0x29,0x58,0x30 #movaps %xmm11,0x30(%rax) + .byte 0x44,0x0f,0x29,0x60,0x40 #movaps %xmm12,0x40(%rax) + .byte 0x44,0x0f,0x29,0x68,0x50 #movaps %xmm13,0x50(%rax) + .byte 0x44,0x0f,0x29,0x70,0x60 #movaps %xmm14,0x60(%rax) + .byte 0x44,0x0f,0x29,0x78,0x70 #movaps %xmm15,0x70(%rax) +___ +$code.=<<___; + vzeroupper + + vmovdqu ($Xip),$Xi # load $Xi + lea .L0x1c2_polynomial(%rip),%r10 + lea 0x40($Htbl),$Htbl # size optimization + vmovdqu .Lbswap_mask(%rip),$bswap + vpshufb $bswap,$Xi,$Xi + cmp \$0x80,$len + jb .Lshort_avx + sub \$0x80,$len + + vmovdqu 0x70($inp),$Ii # I[7] + vmovdqu 0x00-0x40($Htbl),$Hkey # $Hkey^1 + vpshufb $bswap,$Ii,$Ii + vmovdqu 0x20-0x40($Htbl),$HK + + vpunpckhqdq $Ii,$Ii,$T2 + vmovdqu 0x60($inp),$Ij # I[6] + vpclmulqdq \$0x00,$Hkey,$Ii,$Xlo + vpxor $Ii,$T2,$T2 + vpshufb $bswap,$Ij,$Ij + vpclmulqdq \$0x11,$Hkey,$Ii,$Xhi + vmovdqu 0x10-0x40($Htbl),$Hkey # $Hkey^2 + vpunpckhqdq $Ij,$Ij,$T1 + vmovdqu 0x50($inp),$Ii # I[5] + vpclmulqdq \$0x00,$HK,$T2,$Xmi + vpxor $Ij,$T1,$T1 + + vpshufb $bswap,$Ii,$Ii + vpclmulqdq \$0x00,$Hkey,$Ij,$Zlo + vpunpckhqdq $Ii,$Ii,$T2 + vpclmulqdq \$0x11,$Hkey,$Ij,$Zhi + vmovdqu 0x30-0x40($Htbl),$Hkey # $Hkey^3 + vpxor $Ii,$T2,$T2 + vmovdqu 0x40($inp),$Ij # I[4] + vpclmulqdq \$0x10,$HK,$T1,$Zmi + vmovdqu 0x50-0x40($Htbl),$HK + + vpshufb $bswap,$Ij,$Ij + vpxor $Xlo,$Zlo,$Zlo + vpclmulqdq \$0x00,$Hkey,$Ii,$Xlo + vpxor $Xhi,$Zhi,$Zhi + vpunpckhqdq $Ij,$Ij,$T1 + vpclmulqdq \$0x11,$Hkey,$Ii,$Xhi + vmovdqu 0x40-0x40($Htbl),$Hkey # $Hkey^4 + vpxor $Xmi,$Zmi,$Zmi + vpclmulqdq \$0x00,$HK,$T2,$Xmi + vpxor $Ij,$T1,$T1 + + vmovdqu 0x30($inp),$Ii # I[3] + vpxor $Zlo,$Xlo,$Xlo + vpclmulqdq \$0x00,$Hkey,$Ij,$Zlo + vpxor $Zhi,$Xhi,$Xhi + vpshufb $bswap,$Ii,$Ii + vpclmulqdq \$0x11,$Hkey,$Ij,$Zhi + vmovdqu 0x60-0x40($Htbl),$Hkey # $Hkey^5 + vpxor $Zmi,$Xmi,$Xmi + vpunpckhqdq $Ii,$Ii,$T2 + vpclmulqdq \$0x10,$HK,$T1,$Zmi + vmovdqu 0x80-0x40($Htbl),$HK + vpxor $Ii,$T2,$T2 + + vmovdqu 0x20($inp),$Ij # I[2] + vpxor $Xlo,$Zlo,$Zlo + vpclmulqdq \$0x00,$Hkey,$Ii,$Xlo + vpxor $Xhi,$Zhi,$Zhi + vpshufb $bswap,$Ij,$Ij + vpclmulqdq \$0x11,$Hkey,$Ii,$Xhi + vmovdqu 0x70-0x40($Htbl),$Hkey # $Hkey^6 + vpxor $Xmi,$Zmi,$Zmi + vpunpckhqdq $Ij,$Ij,$T1 + vpclmulqdq \$0x00,$HK,$T2,$Xmi + vpxor $Ij,$T1,$T1 + + vmovdqu 0x10($inp),$Ii # I[1] + vpxor $Zlo,$Xlo,$Xlo + vpclmulqdq \$0x00,$Hkey,$Ij,$Zlo + vpxor $Zhi,$Xhi,$Xhi + vpshufb $bswap,$Ii,$Ii + vpclmulqdq \$0x11,$Hkey,$Ij,$Zhi + vmovdqu 0x90-0x40($Htbl),$Hkey # $Hkey^7 + vpxor $Zmi,$Xmi,$Xmi + vpunpckhqdq $Ii,$Ii,$T2 + vpclmulqdq \$0x10,$HK,$T1,$Zmi + vmovdqu 0xb0-0x40($Htbl),$HK + vpxor $Ii,$T2,$T2 + + vmovdqu ($inp),$Ij # I[0] + vpxor $Xlo,$Zlo,$Zlo + vpclmulqdq \$0x00,$Hkey,$Ii,$Xlo + vpxor $Xhi,$Zhi,$Zhi + vpshufb $bswap,$Ij,$Ij + vpclmulqdq \$0x11,$Hkey,$Ii,$Xhi + vmovdqu 0xa0-0x40($Htbl),$Hkey # $Hkey^8 + vpxor $Xmi,$Zmi,$Zmi + vpclmulqdq \$0x10,$HK,$T2,$Xmi + + lea 0x80($inp),$inp + cmp \$0x80,$len + jb .Ltail_avx + + vpxor $Xi,$Ij,$Ij # accumulate $Xi + sub \$0x80,$len + jmp .Loop8x_avx + +.align 32 +.Loop8x_avx: + vpunpckhqdq $Ij,$Ij,$T1 + vmovdqu 0x70($inp),$Ii # I[7] + vpxor $Xlo,$Zlo,$Zlo + vpxor $Ij,$T1,$T1 + vpclmulqdq \$0x00,$Hkey,$Ij,$Xi + vpshufb $bswap,$Ii,$Ii + vpxor $Xhi,$Zhi,$Zhi + vpclmulqdq \$0x11,$Hkey,$Ij,$Xo + vmovdqu 0x00-0x40($Htbl),$Hkey # $Hkey^1 + vpunpckhqdq $Ii,$Ii,$T2 + vpxor $Xmi,$Zmi,$Zmi + vpclmulqdq \$0x00,$HK,$T1,$Tred + vmovdqu 0x20-0x40($Htbl),$HK + vpxor $Ii,$T2,$T2 + + vmovdqu 0x60($inp),$Ij # I[6] + vpclmulqdq \$0x00,$Hkey,$Ii,$Xlo + vpxor $Zlo,$Xi,$Xi # collect result + vpshufb $bswap,$Ij,$Ij + vpclmulqdq \$0x11,$Hkey,$Ii,$Xhi + vxorps $Zhi,$Xo,$Xo + vmovdqu 0x10-0x40($Htbl),$Hkey # $Hkey^2 + vpunpckhqdq $Ij,$Ij,$T1 + vpclmulqdq \$0x00,$HK, $T2,$Xmi + vpxor $Zmi,$Tred,$Tred + vxorps $Ij,$T1,$T1 + + vmovdqu 0x50($inp),$Ii # I[5] + vpxor $Xi,$Tred,$Tred # aggregated Karatsuba post-processing + vpclmulqdq \$0x00,$Hkey,$Ij,$Zlo + vpxor $Xo,$Tred,$Tred + vpslldq \$8,$Tred,$T2 + vpxor $Xlo,$Zlo,$Zlo + vpclmulqdq \$0x11,$Hkey,$Ij,$Zhi + vpsrldq \$8,$Tred,$Tred + vpxor $T2, $Xi, $Xi + vmovdqu 0x30-0x40($Htbl),$Hkey # $Hkey^3 + vpshufb $bswap,$Ii,$Ii + vxorps $Tred,$Xo, $Xo + vpxor $Xhi,$Zhi,$Zhi + vpunpckhqdq $Ii,$Ii,$T2 + vpclmulqdq \$0x10,$HK, $T1,$Zmi + vmovdqu 0x50-0x40($Htbl),$HK + vpxor $Ii,$T2,$T2 + vpxor $Xmi,$Zmi,$Zmi + + vmovdqu 0x40($inp),$Ij # I[4] + vpalignr \$8,$Xi,$Xi,$Tred # 1st phase + vpclmulqdq \$0x00,$Hkey,$Ii,$Xlo + vpshufb $bswap,$Ij,$Ij + vpxor $Zlo,$Xlo,$Xlo + vpclmulqdq \$0x11,$Hkey,$Ii,$Xhi + vmovdqu 0x40-0x40($Htbl),$Hkey # $Hkey^4 + vpunpckhqdq $Ij,$Ij,$T1 + vpxor $Zhi,$Xhi,$Xhi + vpclmulqdq \$0x00,$HK, $T2,$Xmi + vxorps $Ij,$T1,$T1 + vpxor $Zmi,$Xmi,$Xmi + + vmovdqu 0x30($inp),$Ii # I[3] + vpclmulqdq \$0x10,(%r10),$Xi,$Xi + vpclmulqdq \$0x00,$Hkey,$Ij,$Zlo + vpshufb $bswap,$Ii,$Ii + vpxor $Xlo,$Zlo,$Zlo + vpclmulqdq \$0x11,$Hkey,$Ij,$Zhi + vmovdqu 0x60-0x40($Htbl),$Hkey # $Hkey^5 + vpunpckhqdq $Ii,$Ii,$T2 + vpxor $Xhi,$Zhi,$Zhi + vpclmulqdq \$0x10,$HK, $T1,$Zmi + vmovdqu 0x80-0x40($Htbl),$HK + vpxor $Ii,$T2,$T2 + vpxor $Xmi,$Zmi,$Zmi + + vmovdqu 0x20($inp),$Ij # I[2] + vpclmulqdq \$0x00,$Hkey,$Ii,$Xlo + vpshufb $bswap,$Ij,$Ij + vpxor $Zlo,$Xlo,$Xlo + vpclmulqdq \$0x11,$Hkey,$Ii,$Xhi + vmovdqu 0x70-0x40($Htbl),$Hkey # $Hkey^6 + vpunpckhqdq $Ij,$Ij,$T1 + vpxor $Zhi,$Xhi,$Xhi + vpclmulqdq \$0x00,$HK, $T2,$Xmi + vpxor $Ij,$T1,$T1 + vpxor $Zmi,$Xmi,$Xmi + vxorps $Tred,$Xi,$Xi + + vmovdqu 0x10($inp),$Ii # I[1] + vpalignr \$8,$Xi,$Xi,$Tred # 2nd phase + vpclmulqdq \$0x00,$Hkey,$Ij,$Zlo + vpshufb $bswap,$Ii,$Ii + vpxor $Xlo,$Zlo,$Zlo + vpclmulqdq \$0x11,$Hkey,$Ij,$Zhi + vmovdqu 0x90-0x40($Htbl),$Hkey # $Hkey^7 + vpclmulqdq \$0x10,(%r10),$Xi,$Xi + vxorps $Xo,$Tred,$Tred + vpunpckhqdq $Ii,$Ii,$T2 + vpxor $Xhi,$Zhi,$Zhi + vpclmulqdq \$0x10,$HK, $T1,$Zmi + vmovdqu 0xb0-0x40($Htbl),$HK + vpxor $Ii,$T2,$T2 + vpxor $Xmi,$Zmi,$Zmi + + vmovdqu ($inp),$Ij # I[0] + vpclmulqdq \$0x00,$Hkey,$Ii,$Xlo + vpshufb $bswap,$Ij,$Ij + vpclmulqdq \$0x11,$Hkey,$Ii,$Xhi + vmovdqu 0xa0-0x40($Htbl),$Hkey # $Hkey^8 + vpxor $Tred,$Ij,$Ij + vpclmulqdq \$0x10,$HK, $T2,$Xmi + vpxor $Xi,$Ij,$Ij # accumulate $Xi + + lea 0x80($inp),$inp + sub \$0x80,$len + jnc .Loop8x_avx + + add \$0x80,$len + jmp .Ltail_no_xor_avx + +.align 32 +.Lshort_avx: + vmovdqu -0x10($inp,$len),$Ii # very last word + lea ($inp,$len),$inp + vmovdqu 0x00-0x40($Htbl),$Hkey # $Hkey^1 + vmovdqu 0x20-0x40($Htbl),$HK + vpshufb $bswap,$Ii,$Ij + + vmovdqa $Xlo,$Zlo # subtle way to zero $Zlo, + vmovdqa $Xhi,$Zhi # $Zhi and + vmovdqa $Xmi,$Zmi # $Zmi + sub \$0x10,$len + jz .Ltail_avx + + vpunpckhqdq $Ij,$Ij,$T1 + vpxor $Xlo,$Zlo,$Zlo + vpclmulqdq \$0x00,$Hkey,$Ij,$Xlo + vpxor $Ij,$T1,$T1 + vmovdqu -0x20($inp),$Ii + vpxor $Xhi,$Zhi,$Zhi + vpclmulqdq \$0x11,$Hkey,$Ij,$Xhi + vmovdqu 0x10-0x40($Htbl),$Hkey # $Hkey^2 + vpshufb $bswap,$Ii,$Ij + vpxor $Xmi,$Zmi,$Zmi + vpclmulqdq \$0x00,$HK,$T1,$Xmi + vpsrldq \$8,$HK,$HK + sub \$0x10,$len + jz .Ltail_avx + + vpunpckhqdq $Ij,$Ij,$T1 + vpxor $Xlo,$Zlo,$Zlo + vpclmulqdq \$0x00,$Hkey,$Ij,$Xlo + vpxor $Ij,$T1,$T1 + vmovdqu -0x30($inp),$Ii + vpxor $Xhi,$Zhi,$Zhi + vpclmulqdq \$0x11,$Hkey,$Ij,$Xhi + vmovdqu 0x30-0x40($Htbl),$Hkey # $Hkey^3 + vpshufb $bswap,$Ii,$Ij + vpxor $Xmi,$Zmi,$Zmi + vpclmulqdq \$0x00,$HK,$T1,$Xmi + vmovdqu 0x50-0x40($Htbl),$HK + sub \$0x10,$len + jz .Ltail_avx + + vpunpckhqdq $Ij,$Ij,$T1 + vpxor $Xlo,$Zlo,$Zlo + vpclmulqdq \$0x00,$Hkey,$Ij,$Xlo + vpxor $Ij,$T1,$T1 + vmovdqu -0x40($inp),$Ii + vpxor $Xhi,$Zhi,$Zhi + vpclmulqdq \$0x11,$Hkey,$Ij,$Xhi + vmovdqu 0x40-0x40($Htbl),$Hkey # $Hkey^4 + vpshufb $bswap,$Ii,$Ij + vpxor $Xmi,$Zmi,$Zmi + vpclmulqdq \$0x00,$HK,$T1,$Xmi + vpsrldq \$8,$HK,$HK + sub \$0x10,$len + jz .Ltail_avx + + vpunpckhqdq $Ij,$Ij,$T1 + vpxor $Xlo,$Zlo,$Zlo + vpclmulqdq \$0x00,$Hkey,$Ij,$Xlo + vpxor $Ij,$T1,$T1 + vmovdqu -0x50($inp),$Ii + vpxor $Xhi,$Zhi,$Zhi + vpclmulqdq \$0x11,$Hkey,$Ij,$Xhi + vmovdqu 0x60-0x40($Htbl),$Hkey # $Hkey^5 + vpshufb $bswap,$Ii,$Ij + vpxor $Xmi,$Zmi,$Zmi + vpclmulqdq \$0x00,$HK,$T1,$Xmi + vmovdqu 0x80-0x40($Htbl),$HK + sub \$0x10,$len + jz .Ltail_avx + + vpunpckhqdq $Ij,$Ij,$T1 + vpxor $Xlo,$Zlo,$Zlo + vpclmulqdq \$0x00,$Hkey,$Ij,$Xlo + vpxor $Ij,$T1,$T1 + vmovdqu -0x60($inp),$Ii + vpxor $Xhi,$Zhi,$Zhi + vpclmulqdq \$0x11,$Hkey,$Ij,$Xhi + vmovdqu 0x70-0x40($Htbl),$Hkey # $Hkey^6 + vpshufb $bswap,$Ii,$Ij + vpxor $Xmi,$Zmi,$Zmi + vpclmulqdq \$0x00,$HK,$T1,$Xmi + vpsrldq \$8,$HK,$HK + sub \$0x10,$len + jz .Ltail_avx + + vpunpckhqdq $Ij,$Ij,$T1 + vpxor $Xlo,$Zlo,$Zlo + vpclmulqdq \$0x00,$Hkey,$Ij,$Xlo + vpxor $Ij,$T1,$T1 + vmovdqu -0x70($inp),$Ii + vpxor $Xhi,$Zhi,$Zhi + vpclmulqdq \$0x11,$Hkey,$Ij,$Xhi + vmovdqu 0x90-0x40($Htbl),$Hkey # $Hkey^7 + vpshufb $bswap,$Ii,$Ij + vpxor $Xmi,$Zmi,$Zmi + vpclmulqdq \$0x00,$HK,$T1,$Xmi + vmovq 0xb8-0x40($Htbl),$HK + sub \$0x10,$len + jmp .Ltail_avx + +.align 32 +.Ltail_avx: + vpxor $Xi,$Ij,$Ij # accumulate $Xi +.Ltail_no_xor_avx: + vpunpckhqdq $Ij,$Ij,$T1 + vpxor $Xlo,$Zlo,$Zlo + vpclmulqdq \$0x00,$Hkey,$Ij,$Xlo + vpxor $Ij,$T1,$T1 + vpxor $Xhi,$Zhi,$Zhi + vpclmulqdq \$0x11,$Hkey,$Ij,$Xhi + vpxor $Xmi,$Zmi,$Zmi + vpclmulqdq \$0x00,$HK,$T1,$Xmi + + vmovdqu (%r10),$Tred + + vpxor $Xlo,$Zlo,$Xi + vpxor $Xhi,$Zhi,$Xo + vpxor $Xmi,$Zmi,$Zmi + + vpxor $Xi, $Zmi,$Zmi # aggregated Karatsuba post-processing + vpxor $Xo, $Zmi,$Zmi + vpslldq \$8, $Zmi,$T2 + vpsrldq \$8, $Zmi,$Zmi + vpxor $T2, $Xi, $Xi + vpxor $Zmi,$Xo, $Xo + + vpclmulqdq \$0x10,$Tred,$Xi,$T2 # 1st phase + vpalignr \$8,$Xi,$Xi,$Xi + vpxor $T2,$Xi,$Xi + + vpclmulqdq \$0x10,$Tred,$Xi,$T2 # 2nd phase + vpalignr \$8,$Xi,$Xi,$Xi + vpxor $Xo,$Xi,$Xi + vpxor $T2,$Xi,$Xi + + cmp \$0,$len + jne .Lshort_avx + + vpshufb $bswap,$Xi,$Xi + vmovdqu $Xi,($Xip) + vzeroupper +___ +$code.=<<___ if ($win64); + movaps (%rsp),%xmm6 + movaps 0x10(%rsp),%xmm7 + movaps 0x20(%rsp),%xmm8 + movaps 0x30(%rsp),%xmm9 + movaps 0x40(%rsp),%xmm10 + movaps 0x50(%rsp),%xmm11 + movaps 0x60(%rsp),%xmm12 + movaps 0x70(%rsp),%xmm13 + movaps 0x80(%rsp),%xmm14 + movaps 0x90(%rsp),%xmm15 + lea 0xa8(%rsp),%rsp +.LSEH_end_gcm_ghash_avx: +___ +$code.=<<___; + ret +.size gcm_ghash_avx,.-gcm_ghash_avx +___ +} else { +$code.=<<___; + jmp .L_ghash_clmul +.size gcm_ghash_avx,.-gcm_ghash_avx +___ +} + +$code.=<<___; +.align 64 +.Lbswap_mask: + .byte 15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0 +.L0x1c2_polynomial: + .byte 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0xc2 +.L7_mask: + .long 7,0,7,0 +.L7_mask_poly: + .long 7,0,`0xE1<<1`,0 +.align 64 +.type .Lrem_4bit,\@object +.Lrem_4bit: + .long 0,`0x0000<<16`,0,`0x1C20<<16`,0,`0x3840<<16`,0,`0x2460<<16` + .long 0,`0x7080<<16`,0,`0x6CA0<<16`,0,`0x48C0<<16`,0,`0x54E0<<16` + .long 0,`0xE100<<16`,0,`0xFD20<<16`,0,`0xD940<<16`,0,`0xC560<<16` + .long 0,`0x9180<<16`,0,`0x8DA0<<16`,0,`0xA9C0<<16`,0,`0xB5E0<<16` +.type .Lrem_8bit,\@object +.Lrem_8bit: + .value 0x0000,0x01C2,0x0384,0x0246,0x0708,0x06CA,0x048C,0x054E + .value 0x0E10,0x0FD2,0x0D94,0x0C56,0x0918,0x08DA,0x0A9C,0x0B5E + .value 0x1C20,0x1DE2,0x1FA4,0x1E66,0x1B28,0x1AEA,0x18AC,0x196E + .value 0x1230,0x13F2,0x11B4,0x1076,0x1538,0x14FA,0x16BC,0x177E + .value 0x3840,0x3982,0x3BC4,0x3A06,0x3F48,0x3E8A,0x3CCC,0x3D0E + .value 0x3650,0x3792,0x35D4,0x3416,0x3158,0x309A,0x32DC,0x331E + .value 0x2460,0x25A2,0x27E4,0x2626,0x2368,0x22AA,0x20EC,0x212E + .value 0x2A70,0x2BB2,0x29F4,0x2836,0x2D78,0x2CBA,0x2EFC,0x2F3E + .value 0x7080,0x7142,0x7304,0x72C6,0x7788,0x764A,0x740C,0x75CE + .value 0x7E90,0x7F52,0x7D14,0x7CD6,0x7998,0x785A,0x7A1C,0x7BDE + .value 0x6CA0,0x6D62,0x6F24,0x6EE6,0x6BA8,0x6A6A,0x682C,0x69EE + .value 0x62B0,0x6372,0x6134,0x60F6,0x65B8,0x647A,0x663C,0x67FE + .value 0x48C0,0x4902,0x4B44,0x4A86,0x4FC8,0x4E0A,0x4C4C,0x4D8E + .value 0x46D0,0x4712,0x4554,0x4496,0x41D8,0x401A,0x425C,0x439E + .value 0x54E0,0x5522,0x5764,0x56A6,0x53E8,0x522A,0x506C,0x51AE + .value 0x5AF0,0x5B32,0x5974,0x58B6,0x5DF8,0x5C3A,0x5E7C,0x5FBE + .value 0xE100,0xE0C2,0xE284,0xE346,0xE608,0xE7CA,0xE58C,0xE44E + .value 0xEF10,0xEED2,0xEC94,0xED56,0xE818,0xE9DA,0xEB9C,0xEA5E + .value 0xFD20,0xFCE2,0xFEA4,0xFF66,0xFA28,0xFBEA,0xF9AC,0xF86E + .value 0xF330,0xF2F2,0xF0B4,0xF176,0xF438,0xF5FA,0xF7BC,0xF67E + .value 0xD940,0xD882,0xDAC4,0xDB06,0xDE48,0xDF8A,0xDDCC,0xDC0E + .value 0xD750,0xD692,0xD4D4,0xD516,0xD058,0xD19A,0xD3DC,0xD21E + .value 0xC560,0xC4A2,0xC6E4,0xC726,0xC268,0xC3AA,0xC1EC,0xC02E + .value 0xCB70,0xCAB2,0xC8F4,0xC936,0xCC78,0xCDBA,0xCFFC,0xCE3E + .value 0x9180,0x9042,0x9204,0x93C6,0x9688,0x974A,0x950C,0x94CE + .value 0x9F90,0x9E52,0x9C14,0x9DD6,0x9898,0x995A,0x9B1C,0x9ADE + .value 0x8DA0,0x8C62,0x8E24,0x8FE6,0x8AA8,0x8B6A,0x892C,0x88EE + .value 0x83B0,0x8272,0x8034,0x81F6,0x84B8,0x857A,0x873C,0x86FE + .value 0xA9C0,0xA802,0xAA44,0xAB86,0xAEC8,0xAF0A,0xAD4C,0xAC8E + .value 0xA7D0,0xA612,0xA454,0xA596,0xA0D8,0xA11A,0xA35C,0xA29E + .value 0xB5E0,0xB422,0xB664,0xB7A6,0xB2E8,0xB32A,0xB16C,0xB0AE + .value 0xBBF0,0xBA32,0xB874,0xB9B6,0xBCF8,0xBD3A,0xBF7C,0xBEBE + +.asciz "GHASH for x86_64, CRYPTOGAMS by " +.align 64 +___ + +# EXCEPTION_DISPOSITION handler (EXCEPTION_RECORD *rec,ULONG64 frame, +# CONTEXT *context,DISPATCHER_CONTEXT *disp) +if ($win64) { +$rec="%rcx"; +$frame="%rdx"; +$context="%r8"; +$disp="%r9"; + +$code.=<<___; +.extern __imp_RtlVirtualUnwind +.type se_handler,\@abi-omnipotent +.align 16 +se_handler: + push %rsi + push %rdi + push %rbx + push %rbp + push %r12 + push %r13 + push %r14 + push %r15 + pushfq + sub \$64,%rsp + + mov 120($context),%rax # pull context->Rax + mov 248($context),%rbx # pull context->Rip + + mov 8($disp),%rsi # disp->ImageBase + mov 56($disp),%r11 # disp->HandlerData + + mov 0(%r11),%r10d # HandlerData[0] + lea (%rsi,%r10),%r10 # prologue label + cmp %r10,%rbx # context->RipRsp + + mov 4(%r11),%r10d # HandlerData[1] + lea (%rsi,%r10),%r10 # epilogue label + cmp %r10,%rbx # context->Rip>=epilogue label + jae .Lin_prologue + + lea 48+280(%rax),%rax # adjust "rsp" + + mov -8(%rax),%rbx + mov -16(%rax),%rbp + mov -24(%rax),%r12 + mov -32(%rax),%r13 + mov -40(%rax),%r14 + mov -48(%rax),%r15 + mov %rbx,144($context) # restore context->Rbx + mov %rbp,160($context) # restore context->Rbp + mov %r12,216($context) # restore context->R12 + mov %r13,224($context) # restore context->R13 + mov %r14,232($context) # restore context->R14 + mov %r15,240($context) # restore context->R15 + +.Lin_prologue: + mov 8(%rax),%rdi + mov 16(%rax),%rsi + mov %rax,152($context) # restore context->Rsp + mov %rsi,168($context) # restore context->Rsi + mov %rdi,176($context) # restore context->Rdi + + mov 40($disp),%rdi # disp->ContextRecord + mov $context,%rsi # context + mov \$`1232/8`,%ecx # sizeof(CONTEXT) + .long 0xa548f3fc # cld; rep movsq + + mov $disp,%rsi + xor %rcx,%rcx # arg1, UNW_FLAG_NHANDLER + mov 8(%rsi),%rdx # arg2, disp->ImageBase + mov 0(%rsi),%r8 # arg3, disp->ControlPc + mov 16(%rsi),%r9 # arg4, disp->FunctionEntry + mov 40(%rsi),%r10 # disp->ContextRecord + lea 56(%rsi),%r11 # &disp->HandlerData + lea 24(%rsi),%r12 # &disp->EstablisherFrame + mov %r10,32(%rsp) # arg5 + mov %r11,40(%rsp) # arg6 + mov %r12,48(%rsp) # arg7 + mov %rcx,56(%rsp) # arg8, (NULL) + call *__imp_RtlVirtualUnwind(%rip) + + mov \$1,%eax # ExceptionContinueSearch + add \$64,%rsp + popfq + pop %r15 + pop %r14 + pop %r13 + pop %r12 + pop %rbp + pop %rbx + pop %rdi + pop %rsi + ret +.size se_handler,.-se_handler + +.section .pdata +.align 4 + .rva .LSEH_begin_gcm_gmult_4bit + .rva .LSEH_end_gcm_gmult_4bit + .rva .LSEH_info_gcm_gmult_4bit + + .rva .LSEH_begin_gcm_ghash_4bit + .rva .LSEH_end_gcm_ghash_4bit + .rva .LSEH_info_gcm_ghash_4bit + + .rva .LSEH_begin_gcm_init_clmul + .rva .LSEH_end_gcm_init_clmul + .rva .LSEH_info_gcm_init_clmul + + .rva .LSEH_begin_gcm_ghash_clmul + .rva .LSEH_end_gcm_ghash_clmul + .rva .LSEH_info_gcm_ghash_clmul +___ +$code.=<<___ if ($avx); + .rva .LSEH_begin_gcm_init_avx + .rva .LSEH_end_gcm_init_avx + .rva .LSEH_info_gcm_init_clmul + + .rva .LSEH_begin_gcm_ghash_avx + .rva .LSEH_end_gcm_ghash_avx + .rva .LSEH_info_gcm_ghash_clmul +___ +$code.=<<___; +.section .xdata +.align 8 +.LSEH_info_gcm_gmult_4bit: + .byte 9,0,0,0 + .rva se_handler + .rva .Lgmult_prologue,.Lgmult_epilogue # HandlerData +.LSEH_info_gcm_ghash_4bit: + .byte 9,0,0,0 + .rva se_handler + .rva .Lghash_prologue,.Lghash_epilogue # HandlerData +.LSEH_info_gcm_init_clmul: + .byte 0x01,0x08,0x03,0x00 + .byte 0x08,0x68,0x00,0x00 #movaps 0x00(rsp),xmm6 + .byte 0x04,0x22,0x00,0x00 #sub rsp,0x18 +.LSEH_info_gcm_ghash_clmul: + .byte 0x01,0x33,0x16,0x00 + .byte 0x33,0xf8,0x09,0x00 #movaps 0x90(rsp),xmm15 + .byte 0x2e,0xe8,0x08,0x00 #movaps 0x80(rsp),xmm14 + .byte 0x29,0xd8,0x07,0x00 #movaps 0x70(rsp),xmm13 + .byte 0x24,0xc8,0x06,0x00 #movaps 0x60(rsp),xmm12 + .byte 0x1f,0xb8,0x05,0x00 #movaps 0x50(rsp),xmm11 + .byte 0x1a,0xa8,0x04,0x00 #movaps 0x40(rsp),xmm10 + .byte 0x15,0x98,0x03,0x00 #movaps 0x30(rsp),xmm9 + .byte 0x10,0x88,0x02,0x00 #movaps 0x20(rsp),xmm8 + .byte 0x0c,0x78,0x01,0x00 #movaps 0x10(rsp),xmm7 + .byte 0x08,0x68,0x00,0x00 #movaps 0x00(rsp),xmm6 + .byte 0x04,0x01,0x15,0x00 #sub rsp,0xa8 +___ +} + +$code =~ s/\`([^\`]*)\`/eval($1)/gem; + +print $code; + +close STDOUT; diff --git a/crypto/aesgcm/ghash_x64_gas.s b/crypto/aesgcm/ghash_x64_gas.s new file mode 100644 index 0000000..07d5456 --- /dev/null +++ b/crypto/aesgcm/ghash_x64_gas.s @@ -0,0 +1,1795 @@ +.text + + +.globl gcm_gmult_4bit +.type gcm_gmult_4bit,@function +.align 16 +gcm_gmult_4bit: + pushq %rbx + pushq %rbp + pushq %r12 + pushq %r13 + pushq %r14 + pushq %r15 + subq $280,%rsp +.Lgmult_prologue: + + movzbq 15(%rdi),%r8 + leaq .Lrem_4bit(%rip),%r11 + xorq %rax,%rax + xorq %rbx,%rbx + movb %r8b,%al + movb %r8b,%bl + shlb $4,%al + movq $14,%rcx + movq 8(%rsi,%rax,1),%r8 + movq (%rsi,%rax,1),%r9 + andb $0xf0,%bl + movq %r8,%rdx + jmp .Loop1 + +.align 16 +.Loop1: + shrq $4,%r8 + andq $0xf,%rdx + movq %r9,%r10 + movb (%rdi,%rcx,1),%al + shrq $4,%r9 + xorq 8(%rsi,%rbx,1),%r8 + shlq $60,%r10 + xorq (%rsi,%rbx,1),%r9 + movb %al,%bl + xorq (%r11,%rdx,8),%r9 + movq %r8,%rdx + shlb $4,%al + xorq %r10,%r8 + decq %rcx + js .Lbreak1 + + shrq $4,%r8 + andq $0xf,%rdx + movq %r9,%r10 + shrq $4,%r9 + xorq 8(%rsi,%rax,1),%r8 + shlq $60,%r10 + xorq (%rsi,%rax,1),%r9 + andb $0xf0,%bl + xorq (%r11,%rdx,8),%r9 + movq %r8,%rdx + xorq %r10,%r8 + jmp .Loop1 + +.align 16 +.Lbreak1: + shrq $4,%r8 + andq $0xf,%rdx + movq %r9,%r10 + shrq $4,%r9 + xorq 8(%rsi,%rax,1),%r8 + shlq $60,%r10 + xorq (%rsi,%rax,1),%r9 + andb $0xf0,%bl + xorq (%r11,%rdx,8),%r9 + movq %r8,%rdx + xorq %r10,%r8 + + shrq $4,%r8 + andq $0xf,%rdx + movq %r9,%r10 + shrq $4,%r9 + xorq 8(%rsi,%rbx,1),%r8 + shlq $60,%r10 + xorq (%rsi,%rbx,1),%r9 + xorq %r10,%r8 + xorq (%r11,%rdx,8),%r9 + + bswapq %r8 + bswapq %r9 + movq %r8,8(%rdi) + movq %r9,(%rdi) + + leaq 280+48(%rsp),%rsi + movq -8(%rsi),%rbx + leaq (%rsi),%rsp +.Lgmult_epilogue: + ret +.size gcm_gmult_4bit,.-gcm_gmult_4bit +.globl gcm_ghash_4bit +.type gcm_ghash_4bit,@function +.align 16 +gcm_ghash_4bit: + pushq %rbx + pushq %rbp + pushq %r12 + pushq %r13 + pushq %r14 + pushq %r15 + subq $280,%rsp +.Lghash_prologue: + movq %rdx,%r14 + movq %rcx,%r15 + subq $-128,%rsi + leaq 16+128(%rsp),%rbp + xorl %edx,%edx + movq 0+0-128(%rsi),%r8 + movq 0+8-128(%rsi),%rax + movb %al,%dl + shrq $4,%rax + movq %r8,%r10 + shrq $4,%r8 + movq 16+0-128(%rsi),%r9 + shlb $4,%dl + movq 16+8-128(%rsi),%rbx + shlq $60,%r10 + movb %dl,0(%rsp) + orq %r10,%rax + movb %bl,%dl + shrq $4,%rbx + movq %r9,%r10 + shrq $4,%r9 + movq %r8,0(%rbp) + movq 32+0-128(%rsi),%r8 + shlb $4,%dl + movq %rax,0-128(%rbp) + movq 32+8-128(%rsi),%rax + shlq $60,%r10 + movb %dl,1(%rsp) + orq %r10,%rbx + movb %al,%dl + shrq $4,%rax + movq %r8,%r10 + shrq $4,%r8 + movq %r9,8(%rbp) + movq 48+0-128(%rsi),%r9 + shlb $4,%dl + movq %rbx,8-128(%rbp) + movq 48+8-128(%rsi),%rbx + shlq $60,%r10 + movb %dl,2(%rsp) + orq %r10,%rax + movb %bl,%dl + shrq $4,%rbx + movq %r9,%r10 + shrq $4,%r9 + movq %r8,16(%rbp) + movq 64+0-128(%rsi),%r8 + shlb $4,%dl + movq %rax,16-128(%rbp) + movq 64+8-128(%rsi),%rax + shlq $60,%r10 + movb %dl,3(%rsp) + orq %r10,%rbx + movb %al,%dl + shrq $4,%rax + movq %r8,%r10 + shrq $4,%r8 + movq %r9,24(%rbp) + movq 80+0-128(%rsi),%r9 + shlb $4,%dl + movq %rbx,24-128(%rbp) + movq 80+8-128(%rsi),%rbx + shlq $60,%r10 + movb %dl,4(%rsp) + orq %r10,%rax + movb %bl,%dl + shrq $4,%rbx + movq %r9,%r10 + shrq $4,%r9 + movq %r8,32(%rbp) + movq 96+0-128(%rsi),%r8 + shlb $4,%dl + movq %rax,32-128(%rbp) + movq 96+8-128(%rsi),%rax + shlq $60,%r10 + movb %dl,5(%rsp) + orq %r10,%rbx + movb %al,%dl + shrq $4,%rax + movq %r8,%r10 + shrq $4,%r8 + movq %r9,40(%rbp) + movq 112+0-128(%rsi),%r9 + shlb $4,%dl + movq %rbx,40-128(%rbp) + movq 112+8-128(%rsi),%rbx + shlq $60,%r10 + movb %dl,6(%rsp) + orq %r10,%rax + movb %bl,%dl + shrq $4,%rbx + movq %r9,%r10 + shrq $4,%r9 + movq %r8,48(%rbp) + movq 128+0-128(%rsi),%r8 + shlb $4,%dl + movq %rax,48-128(%rbp) + movq 128+8-128(%rsi),%rax + shlq $60,%r10 + movb %dl,7(%rsp) + orq %r10,%rbx + movb %al,%dl + shrq $4,%rax + movq %r8,%r10 + shrq $4,%r8 + movq %r9,56(%rbp) + movq 144+0-128(%rsi),%r9 + shlb $4,%dl + movq %rbx,56-128(%rbp) + movq 144+8-128(%rsi),%rbx + shlq $60,%r10 + movb %dl,8(%rsp) + orq %r10,%rax + movb %bl,%dl + shrq $4,%rbx + movq %r9,%r10 + shrq $4,%r9 + movq %r8,64(%rbp) + movq 160+0-128(%rsi),%r8 + shlb $4,%dl + movq %rax,64-128(%rbp) + movq 160+8-128(%rsi),%rax + shlq $60,%r10 + movb %dl,9(%rsp) + orq %r10,%rbx + movb %al,%dl + shrq $4,%rax + movq %r8,%r10 + shrq $4,%r8 + movq %r9,72(%rbp) + movq 176+0-128(%rsi),%r9 + shlb $4,%dl + movq %rbx,72-128(%rbp) + movq 176+8-128(%rsi),%rbx + shlq $60,%r10 + movb %dl,10(%rsp) + orq %r10,%rax + movb %bl,%dl + shrq $4,%rbx + movq %r9,%r10 + shrq $4,%r9 + movq %r8,80(%rbp) + movq 192+0-128(%rsi),%r8 + shlb $4,%dl + movq %rax,80-128(%rbp) + movq 192+8-128(%rsi),%rax + shlq $60,%r10 + movb %dl,11(%rsp) + orq %r10,%rbx + movb %al,%dl + shrq $4,%rax + movq %r8,%r10 + shrq $4,%r8 + movq %r9,88(%rbp) + movq 208+0-128(%rsi),%r9 + shlb $4,%dl + movq %rbx,88-128(%rbp) + movq 208+8-128(%rsi),%rbx + shlq $60,%r10 + movb %dl,12(%rsp) + orq %r10,%rax + movb %bl,%dl + shrq $4,%rbx + movq %r9,%r10 + shrq $4,%r9 + movq %r8,96(%rbp) + movq 224+0-128(%rsi),%r8 + shlb $4,%dl + movq %rax,96-128(%rbp) + movq 224+8-128(%rsi),%rax + shlq $60,%r10 + movb %dl,13(%rsp) + orq %r10,%rbx + movb %al,%dl + shrq $4,%rax + movq %r8,%r10 + shrq $4,%r8 + movq %r9,104(%rbp) + movq 240+0-128(%rsi),%r9 + shlb $4,%dl + movq %rbx,104-128(%rbp) + movq 240+8-128(%rsi),%rbx + shlq $60,%r10 + movb %dl,14(%rsp) + orq %r10,%rax + movb %bl,%dl + shrq $4,%rbx + movq %r9,%r10 + shrq $4,%r9 + movq %r8,112(%rbp) + shlb $4,%dl + movq %rax,112-128(%rbp) + shlq $60,%r10 + movb %dl,15(%rsp) + orq %r10,%rbx + movq %r9,120(%rbp) + movq %rbx,120-128(%rbp) + addq $-128,%rsi + movq 8(%rdi),%r8 + movq 0(%rdi),%r9 + addq %r14,%r15 + leaq .Lrem_8bit(%rip),%r11 + jmp .Louter_loop +.align 16 +.Louter_loop: + xorq (%r14),%r9 + movq 8(%r14),%rdx + leaq 16(%r14),%r14 + xorq %r8,%rdx + movq %r9,(%rdi) + movq %rdx,8(%rdi) + shrq $32,%rdx + xorq %rax,%rax + roll $8,%edx + movb %dl,%al + movzbl %dl,%ebx + shlb $4,%al + shrl $4,%ebx + roll $8,%edx + movq 8(%rsi,%rax,1),%r8 + movq (%rsi,%rax,1),%r9 + movb %dl,%al + movzbl %dl,%ecx + shlb $4,%al + movzbq (%rsp,%rbx,1),%r12 + shrl $4,%ecx + xorq %r8,%r12 + movq %r9,%r10 + shrq $8,%r8 + movzbq %r12b,%r12 + shrq $8,%r9 + xorq -128(%rbp,%rbx,8),%r8 + shlq $56,%r10 + xorq (%rbp,%rbx,8),%r9 + roll $8,%edx + xorq 8(%rsi,%rax,1),%r8 + xorq (%rsi,%rax,1),%r9 + movb %dl,%al + xorq %r10,%r8 + movzwq (%r11,%r12,2),%r12 + movzbl %dl,%ebx + shlb $4,%al + movzbq (%rsp,%rcx,1),%r13 + shrl $4,%ebx + shlq $48,%r12 + xorq %r8,%r13 + movq %r9,%r10 + xorq %r12,%r9 + shrq $8,%r8 + movzbq %r13b,%r13 + shrq $8,%r9 + xorq -128(%rbp,%rcx,8),%r8 + shlq $56,%r10 + xorq (%rbp,%rcx,8),%r9 + roll $8,%edx + xorq 8(%rsi,%rax,1),%r8 + xorq (%rsi,%rax,1),%r9 + movb %dl,%al + xorq %r10,%r8 + movzwq (%r11,%r13,2),%r13 + movzbl %dl,%ecx + shlb $4,%al + movzbq (%rsp,%rbx,1),%r12 + shrl $4,%ecx + shlq $48,%r13 + xorq %r8,%r12 + movq %r9,%r10 + xorq %r13,%r9 + shrq $8,%r8 + movzbq %r12b,%r12 + movl 8(%rdi),%edx + shrq $8,%r9 + xorq -128(%rbp,%rbx,8),%r8 + shlq $56,%r10 + xorq (%rbp,%rbx,8),%r9 + roll $8,%edx + xorq 8(%rsi,%rax,1),%r8 + xorq (%rsi,%rax,1),%r9 + movb %dl,%al + xorq %r10,%r8 + movzwq (%r11,%r12,2),%r12 + movzbl %dl,%ebx + shlb $4,%al + movzbq (%rsp,%rcx,1),%r13 + shrl $4,%ebx + shlq $48,%r12 + xorq %r8,%r13 + movq %r9,%r10 + xorq %r12,%r9 + shrq $8,%r8 + movzbq %r13b,%r13 + shrq $8,%r9 + xorq -128(%rbp,%rcx,8),%r8 + shlq $56,%r10 + xorq (%rbp,%rcx,8),%r9 + roll $8,%edx + xorq 8(%rsi,%rax,1),%r8 + xorq (%rsi,%rax,1),%r9 + movb %dl,%al + xorq %r10,%r8 + movzwq (%r11,%r13,2),%r13 + movzbl %dl,%ecx + shlb $4,%al + movzbq (%rsp,%rbx,1),%r12 + shrl $4,%ecx + shlq $48,%r13 + xorq %r8,%r12 + movq %r9,%r10 + xorq %r13,%r9 + shrq $8,%r8 + movzbq %r12b,%r12 + shrq $8,%r9 + xorq -128(%rbp,%rbx,8),%r8 + shlq $56,%r10 + xorq (%rbp,%rbx,8),%r9 + roll $8,%edx + xorq 8(%rsi,%rax,1),%r8 + xorq (%rsi,%rax,1),%r9 + movb %dl,%al + xorq %r10,%r8 + movzwq (%r11,%r12,2),%r12 + movzbl %dl,%ebx + shlb $4,%al + movzbq (%rsp,%rcx,1),%r13 + shrl $4,%ebx + shlq $48,%r12 + xorq %r8,%r13 + movq %r9,%r10 + xorq %r12,%r9 + shrq $8,%r8 + movzbq %r13b,%r13 + shrq $8,%r9 + xorq -128(%rbp,%rcx,8),%r8 + shlq $56,%r10 + xorq (%rbp,%rcx,8),%r9 + roll $8,%edx + xorq 8(%rsi,%rax,1),%r8 + xorq (%rsi,%rax,1),%r9 + movb %dl,%al + xorq %r10,%r8 + movzwq (%r11,%r13,2),%r13 + movzbl %dl,%ecx + shlb $4,%al + movzbq (%rsp,%rbx,1),%r12 + shrl $4,%ecx + shlq $48,%r13 + xorq %r8,%r12 + movq %r9,%r10 + xorq %r13,%r9 + shrq $8,%r8 + movzbq %r12b,%r12 + movl 4(%rdi),%edx + shrq $8,%r9 + xorq -128(%rbp,%rbx,8),%r8 + shlq $56,%r10 + xorq (%rbp,%rbx,8),%r9 + roll $8,%edx + xorq 8(%rsi,%rax,1),%r8 + xorq (%rsi,%rax,1),%r9 + movb %dl,%al + xorq %r10,%r8 + movzwq (%r11,%r12,2),%r12 + movzbl %dl,%ebx + shlb $4,%al + movzbq (%rsp,%rcx,1),%r13 + shrl $4,%ebx + shlq $48,%r12 + xorq %r8,%r13 + movq %r9,%r10 + xorq %r12,%r9 + shrq $8,%r8 + movzbq %r13b,%r13 + shrq $8,%r9 + xorq -128(%rbp,%rcx,8),%r8 + shlq $56,%r10 + xorq (%rbp,%rcx,8),%r9 + roll $8,%edx + xorq 8(%rsi,%rax,1),%r8 + xorq (%rsi,%rax,1),%r9 + movb %dl,%al + xorq %r10,%r8 + movzwq (%r11,%r13,2),%r13 + movzbl %dl,%ecx + shlb $4,%al + movzbq (%rsp,%rbx,1),%r12 + shrl $4,%ecx + shlq $48,%r13 + xorq %r8,%r12 + movq %r9,%r10 + xorq %r13,%r9 + shrq $8,%r8 + movzbq %r12b,%r12 + shrq $8,%r9 + xorq -128(%rbp,%rbx,8),%r8 + shlq $56,%r10 + xorq (%rbp,%rbx,8),%r9 + roll $8,%edx + xorq 8(%rsi,%rax,1),%r8 + xorq (%rsi,%rax,1),%r9 + movb %dl,%al + xorq %r10,%r8 + movzwq (%r11,%r12,2),%r12 + movzbl %dl,%ebx + shlb $4,%al + movzbq (%rsp,%rcx,1),%r13 + shrl $4,%ebx + shlq $48,%r12 + xorq %r8,%r13 + movq %r9,%r10 + xorq %r12,%r9 + shrq $8,%r8 + movzbq %r13b,%r13 + shrq $8,%r9 + xorq -128(%rbp,%rcx,8),%r8 + shlq $56,%r10 + xorq (%rbp,%rcx,8),%r9 + roll $8,%edx + xorq 8(%rsi,%rax,1),%r8 + xorq (%rsi,%rax,1),%r9 + movb %dl,%al + xorq %r10,%r8 + movzwq (%r11,%r13,2),%r13 + movzbl %dl,%ecx + shlb $4,%al + movzbq (%rsp,%rbx,1),%r12 + shrl $4,%ecx + shlq $48,%r13 + xorq %r8,%r12 + movq %r9,%r10 + xorq %r13,%r9 + shrq $8,%r8 + movzbq %r12b,%r12 + movl 0(%rdi),%edx + shrq $8,%r9 + xorq -128(%rbp,%rbx,8),%r8 + shlq $56,%r10 + xorq (%rbp,%rbx,8),%r9 + roll $8,%edx + xorq 8(%rsi,%rax,1),%r8 + xorq (%rsi,%rax,1),%r9 + movb %dl,%al + xorq %r10,%r8 + movzwq (%r11,%r12,2),%r12 + movzbl %dl,%ebx + shlb $4,%al + movzbq (%rsp,%rcx,1),%r13 + shrl $4,%ebx + shlq $48,%r12 + xorq %r8,%r13 + movq %r9,%r10 + xorq %r12,%r9 + shrq $8,%r8 + movzbq %r13b,%r13 + shrq $8,%r9 + xorq -128(%rbp,%rcx,8),%r8 + shlq $56,%r10 + xorq (%rbp,%rcx,8),%r9 + roll $8,%edx + xorq 8(%rsi,%rax,1),%r8 + xorq (%rsi,%rax,1),%r9 + movb %dl,%al + xorq %r10,%r8 + movzwq (%r11,%r13,2),%r13 + movzbl %dl,%ecx + shlb $4,%al + movzbq (%rsp,%rbx,1),%r12 + shrl $4,%ecx + shlq $48,%r13 + xorq %r8,%r12 + movq %r9,%r10 + xorq %r13,%r9 + shrq $8,%r8 + movzbq %r12b,%r12 + shrq $8,%r9 + xorq -128(%rbp,%rbx,8),%r8 + shlq $56,%r10 + xorq (%rbp,%rbx,8),%r9 + roll $8,%edx + xorq 8(%rsi,%rax,1),%r8 + xorq (%rsi,%rax,1),%r9 + movb %dl,%al + xorq %r10,%r8 + movzwq (%r11,%r12,2),%r12 + movzbl %dl,%ebx + shlb $4,%al + movzbq (%rsp,%rcx,1),%r13 + shrl $4,%ebx + shlq $48,%r12 + xorq %r8,%r13 + movq %r9,%r10 + xorq %r12,%r9 + shrq $8,%r8 + movzbq %r13b,%r13 + shrq $8,%r9 + xorq -128(%rbp,%rcx,8),%r8 + shlq $56,%r10 + xorq (%rbp,%rcx,8),%r9 + roll $8,%edx + xorq 8(%rsi,%rax,1),%r8 + xorq (%rsi,%rax,1),%r9 + movb %dl,%al + xorq %r10,%r8 + movzwq (%r11,%r13,2),%r13 + movzbl %dl,%ecx + shlb $4,%al + movzbq (%rsp,%rbx,1),%r12 + andl $240,%ecx + shlq $48,%r13 + xorq %r8,%r12 + movq %r9,%r10 + xorq %r13,%r9 + shrq $8,%r8 + movzbq %r12b,%r12 + movl -4(%rdi),%edx + shrq $8,%r9 + xorq -128(%rbp,%rbx,8),%r8 + shlq $56,%r10 + xorq (%rbp,%rbx,8),%r9 + movzwq (%r11,%r12,2),%r12 + xorq 8(%rsi,%rax,1),%r8 + xorq (%rsi,%rax,1),%r9 + shlq $48,%r12 + xorq %r10,%r8 + xorq %r12,%r9 + movzbq %r8b,%r13 + shrq $4,%r8 + movq %r9,%r10 + shlb $4,%r13b + shrq $4,%r9 + xorq 8(%rsi,%rcx,1),%r8 + movzwq (%r11,%r13,2),%r13 + shlq $60,%r10 + xorq (%rsi,%rcx,1),%r9 + xorq %r10,%r8 + shlq $48,%r13 + bswapq %r8 + xorq %r13,%r9 + bswapq %r9 + cmpq %r15,%r14 + jb .Louter_loop + movq %r8,8(%rdi) + movq %r9,(%rdi) + + leaq 280+48(%rsp),%rsi + movq -48(%rsi),%r15 + movq -40(%rsi),%r14 + movq -32(%rsi),%r13 + movq -24(%rsi),%r12 + movq -16(%rsi),%rbp + movq -8(%rsi),%rbx + leaq 0(%rsi),%rsp +.Lghash_epilogue: + ret +.size gcm_ghash_4bit,.-gcm_ghash_4bit +.globl gcm_init_clmul +.type gcm_init_clmul,@function +.align 16 +gcm_init_clmul: +.L_init_clmul: + movdqu (%rsi),%xmm2 + pshufd $78,%xmm2,%xmm2 + + + pshufd $255,%xmm2,%xmm4 + movdqa %xmm2,%xmm3 + psllq $1,%xmm2 + pxor %xmm5,%xmm5 + psrlq $63,%xmm3 + pcmpgtd %xmm4,%xmm5 + pslldq $8,%xmm3 + por %xmm3,%xmm2 + + + pand .L0x1c2_polynomial(%rip),%xmm5 + pxor %xmm5,%xmm2 + + + pshufd $78,%xmm2,%xmm6 + movdqa %xmm2,%xmm0 + pxor %xmm2,%xmm6 + movdqa %xmm0,%xmm1 + pshufd $78,%xmm0,%xmm3 + pxor %xmm0,%xmm3 +.byte 102,15,58,68,194,0 +.byte 102,15,58,68,202,17 +.byte 102,15,58,68,222,0 + pxor %xmm0,%xmm3 + pxor %xmm1,%xmm3 + + movdqa %xmm3,%xmm4 + psrldq $8,%xmm3 + pslldq $8,%xmm4 + pxor %xmm3,%xmm1 + pxor %xmm4,%xmm0 + + movdqa %xmm0,%xmm4 + movdqa %xmm0,%xmm3 + psllq $5,%xmm0 + pxor %xmm0,%xmm3 + psllq $1,%xmm0 + pxor %xmm3,%xmm0 + psllq $57,%xmm0 + movdqa %xmm0,%xmm3 + pslldq $8,%xmm0 + psrldq $8,%xmm3 + pxor %xmm4,%xmm0 + pxor %xmm3,%xmm1 + + + movdqa %xmm0,%xmm4 + psrlq $1,%xmm0 + pxor %xmm4,%xmm1 + pxor %xmm0,%xmm4 + psrlq $5,%xmm0 + pxor %xmm4,%xmm0 + psrlq $1,%xmm0 + pxor %xmm1,%xmm0 + pshufd $78,%xmm2,%xmm3 + pshufd $78,%xmm0,%xmm4 + pxor %xmm2,%xmm3 + movdqu %xmm2,0(%rdi) + pxor %xmm0,%xmm4 + movdqu %xmm0,16(%rdi) +.byte 102,15,58,15,227,8 + movdqu %xmm4,32(%rdi) + movdqa %xmm0,%xmm1 + pshufd $78,%xmm0,%xmm3 + pxor %xmm0,%xmm3 +.byte 102,15,58,68,194,0 +.byte 102,15,58,68,202,17 +.byte 102,15,58,68,222,0 + pxor %xmm0,%xmm3 + pxor %xmm1,%xmm3 + + movdqa %xmm3,%xmm4 + psrldq $8,%xmm3 + pslldq $8,%xmm4 + pxor %xmm3,%xmm1 + pxor %xmm4,%xmm0 + + movdqa %xmm0,%xmm4 + movdqa %xmm0,%xmm3 + psllq $5,%xmm0 + pxor %xmm0,%xmm3 + psllq $1,%xmm0 + pxor %xmm3,%xmm0 + psllq $57,%xmm0 + movdqa %xmm0,%xmm3 + pslldq $8,%xmm0 + psrldq $8,%xmm3 + pxor %xmm4,%xmm0 + pxor %xmm3,%xmm1 + + + movdqa %xmm0,%xmm4 + psrlq $1,%xmm0 + pxor %xmm4,%xmm1 + pxor %xmm0,%xmm4 + psrlq $5,%xmm0 + pxor %xmm4,%xmm0 + psrlq $1,%xmm0 + pxor %xmm1,%xmm0 + movdqa %xmm0,%xmm5 + movdqa %xmm0,%xmm1 + pshufd $78,%xmm0,%xmm3 + pxor %xmm0,%xmm3 +.byte 102,15,58,68,194,0 +.byte 102,15,58,68,202,17 +.byte 102,15,58,68,222,0 + pxor %xmm0,%xmm3 + pxor %xmm1,%xmm3 + + movdqa %xmm3,%xmm4 + psrldq $8,%xmm3 + pslldq $8,%xmm4 + pxor %xmm3,%xmm1 + pxor %xmm4,%xmm0 + + movdqa %xmm0,%xmm4 + movdqa %xmm0,%xmm3 + psllq $5,%xmm0 + pxor %xmm0,%xmm3 + psllq $1,%xmm0 + pxor %xmm3,%xmm0 + psllq $57,%xmm0 + movdqa %xmm0,%xmm3 + pslldq $8,%xmm0 + psrldq $8,%xmm3 + pxor %xmm4,%xmm0 + pxor %xmm3,%xmm1 + + + movdqa %xmm0,%xmm4 + psrlq $1,%xmm0 + pxor %xmm4,%xmm1 + pxor %xmm0,%xmm4 + psrlq $5,%xmm0 + pxor %xmm4,%xmm0 + psrlq $1,%xmm0 + pxor %xmm1,%xmm0 + pshufd $78,%xmm5,%xmm3 + pshufd $78,%xmm0,%xmm4 + pxor %xmm5,%xmm3 + movdqu %xmm5,48(%rdi) + pxor %xmm0,%xmm4 + movdqu %xmm0,64(%rdi) +.byte 102,15,58,15,227,8 + movdqu %xmm4,80(%rdi) + ret +.size gcm_init_clmul,.-gcm_init_clmul +.globl gcm_gmult_clmul +.type gcm_gmult_clmul,@function +.align 16 +gcm_gmult_clmul: +.L_gmult_clmul: + movdqu (%rdi),%xmm0 + movdqa .Lbswap_mask(%rip),%xmm5 + movdqu (%rsi),%xmm2 + movdqu 32(%rsi),%xmm4 + pshufb %xmm5,%xmm0 + movdqa %xmm0,%xmm1 + pshufd $78,%xmm0,%xmm3 + pxor %xmm0,%xmm3 +.byte 102,15,58,68,194,0 +.byte 102,15,58,68,202,17 +.byte 102,15,58,68,220,0 + pxor %xmm0,%xmm3 + pxor %xmm1,%xmm3 + + movdqa %xmm3,%xmm4 + psrldq $8,%xmm3 + pslldq $8,%xmm4 + pxor %xmm3,%xmm1 + pxor %xmm4,%xmm0 + + movdqa %xmm0,%xmm4 + movdqa %xmm0,%xmm3 + psllq $5,%xmm0 + pxor %xmm0,%xmm3 + psllq $1,%xmm0 + pxor %xmm3,%xmm0 + psllq $57,%xmm0 + movdqa %xmm0,%xmm3 + pslldq $8,%xmm0 + psrldq $8,%xmm3 + pxor %xmm4,%xmm0 + pxor %xmm3,%xmm1 + + + movdqa %xmm0,%xmm4 + psrlq $1,%xmm0 + pxor %xmm4,%xmm1 + pxor %xmm0,%xmm4 + psrlq $5,%xmm0 + pxor %xmm4,%xmm0 + psrlq $1,%xmm0 + pxor %xmm1,%xmm0 + pshufb %xmm5,%xmm0 + movdqu %xmm0,(%rdi) + ret +.size gcm_gmult_clmul,.-gcm_gmult_clmul +.globl gcm_ghash_clmul +.type gcm_ghash_clmul,@function +.align 32 +gcm_ghash_clmul: +.L_ghash_clmul: + movdqa .Lbswap_mask(%rip),%xmm10 + + movdqu (%rdi),%xmm0 + movdqu (%rsi),%xmm2 + movdqu 32(%rsi),%xmm7 + pshufb %xmm10,%xmm0 + + subq $0x10,%rcx + jz .Lodd_tail + + movdqu 16(%rsi),%xmm6 + + + cmpq $0x30,%rcx + jb .Lskip4x + + + + + + subq $0x30,%rcx + movq $0xA040608020C0E000,%rax + movdqu 48(%rsi),%xmm14 + movdqu 64(%rsi),%xmm15 + + + + + movdqu 48(%rdx),%xmm3 + movdqu 32(%rdx),%xmm11 + pshufb %xmm10,%xmm3 + pshufb %xmm10,%xmm11 + movdqa %xmm3,%xmm5 + pshufd $78,%xmm3,%xmm4 + pxor %xmm3,%xmm4 +.byte 102,15,58,68,218,0 +.byte 102,15,58,68,234,17 +.byte 102,15,58,68,231,0 + + movdqa %xmm11,%xmm13 + pshufd $78,%xmm11,%xmm12 + pxor %xmm11,%xmm12 +.byte 102,68,15,58,68,222,0 +.byte 102,68,15,58,68,238,17 +.byte 102,68,15,58,68,231,16 + xorps %xmm11,%xmm3 + xorps %xmm13,%xmm5 + movups 80(%rsi),%xmm7 + xorps %xmm12,%xmm4 + + movdqu 16(%rdx),%xmm11 + movdqu 0(%rdx),%xmm8 + pshufb %xmm10,%xmm11 + pshufb %xmm10,%xmm8 + movdqa %xmm11,%xmm13 + pshufd $78,%xmm11,%xmm12 + pxor %xmm8,%xmm0 + pxor %xmm11,%xmm12 +.byte 102,69,15,58,68,222,0 + movdqa %xmm0,%xmm1 + pshufd $78,%xmm0,%xmm8 + pxor %xmm0,%xmm8 +.byte 102,69,15,58,68,238,17 +.byte 102,68,15,58,68,231,0 + xorps %xmm11,%xmm3 + xorps %xmm13,%xmm5 + + leaq 64(%rdx),%rdx + subq $0x40,%rcx + jc .Ltail4x + + jmp .Lmod4_loop +.align 32 +.Lmod4_loop: +.byte 102,65,15,58,68,199,0 + xorps %xmm12,%xmm4 + movdqu 48(%rdx),%xmm11 + pshufb %xmm10,%xmm11 +.byte 102,65,15,58,68,207,17 + xorps %xmm3,%xmm0 + movdqu 32(%rdx),%xmm3 + movdqa %xmm11,%xmm13 +.byte 102,68,15,58,68,199,16 + pshufd $78,%xmm11,%xmm12 + xorps %xmm5,%xmm1 + pxor %xmm11,%xmm12 + pshufb %xmm10,%xmm3 + movups 32(%rsi),%xmm7 + xorps %xmm4,%xmm8 +.byte 102,68,15,58,68,218,0 + pshufd $78,%xmm3,%xmm4 + + pxor %xmm0,%xmm8 + movdqa %xmm3,%xmm5 + pxor %xmm1,%xmm8 + pxor %xmm3,%xmm4 + movdqa %xmm8,%xmm9 +.byte 102,68,15,58,68,234,17 + pslldq $8,%xmm8 + psrldq $8,%xmm9 + pxor %xmm8,%xmm0 + movdqa .L7_mask(%rip),%xmm8 + pxor %xmm9,%xmm1 +.byte 102,76,15,110,200 + + pand %xmm0,%xmm8 + pshufb %xmm8,%xmm9 + pxor %xmm0,%xmm9 +.byte 102,68,15,58,68,231,0 + psllq $57,%xmm9 + movdqa %xmm9,%xmm8 + pslldq $8,%xmm9 +.byte 102,15,58,68,222,0 + psrldq $8,%xmm8 + pxor %xmm9,%xmm0 + pxor %xmm8,%xmm1 + movdqu 0(%rdx),%xmm8 + + movdqa %xmm0,%xmm9 + psrlq $1,%xmm0 +.byte 102,15,58,68,238,17 + xorps %xmm11,%xmm3 + movdqu 16(%rdx),%xmm11 + pshufb %xmm10,%xmm11 +.byte 102,15,58,68,231,16 + xorps %xmm13,%xmm5 + movups 80(%rsi),%xmm7 + pshufb %xmm10,%xmm8 + pxor %xmm9,%xmm1 + pxor %xmm0,%xmm9 + psrlq $5,%xmm0 + + movdqa %xmm11,%xmm13 + pxor %xmm12,%xmm4 + pshufd $78,%xmm11,%xmm12 + pxor %xmm9,%xmm0 + pxor %xmm8,%xmm1 + pxor %xmm11,%xmm12 +.byte 102,69,15,58,68,222,0 + psrlq $1,%xmm0 + pxor %xmm1,%xmm0 + movdqa %xmm0,%xmm1 +.byte 102,69,15,58,68,238,17 + xorps %xmm11,%xmm3 + pshufd $78,%xmm0,%xmm8 + pxor %xmm0,%xmm8 + +.byte 102,68,15,58,68,231,0 + xorps %xmm13,%xmm5 + + leaq 64(%rdx),%rdx + subq $0x40,%rcx + jnc .Lmod4_loop + +.Ltail4x: +.byte 102,65,15,58,68,199,0 +.byte 102,65,15,58,68,207,17 +.byte 102,68,15,58,68,199,16 + xorps %xmm12,%xmm4 + xorps %xmm3,%xmm0 + xorps %xmm5,%xmm1 + pxor %xmm0,%xmm1 + pxor %xmm4,%xmm8 + + pxor %xmm1,%xmm8 + pxor %xmm0,%xmm1 + + movdqa %xmm8,%xmm9 + psrldq $8,%xmm8 + pslldq $8,%xmm9 + pxor %xmm8,%xmm1 + pxor %xmm9,%xmm0 + + movdqa %xmm0,%xmm4 + movdqa %xmm0,%xmm3 + psllq $5,%xmm0 + pxor %xmm0,%xmm3 + psllq $1,%xmm0 + pxor %xmm3,%xmm0 + psllq $57,%xmm0 + movdqa %xmm0,%xmm3 + pslldq $8,%xmm0 + psrldq $8,%xmm3 + pxor %xmm4,%xmm0 + pxor %xmm3,%xmm1 + + + movdqa %xmm0,%xmm4 + psrlq $1,%xmm0 + pxor %xmm4,%xmm1 + pxor %xmm0,%xmm4 + psrlq $5,%xmm0 + pxor %xmm4,%xmm0 + psrlq $1,%xmm0 + pxor %xmm1,%xmm0 + addq $0x40,%rcx + jz .Ldone + movdqu 32(%rsi),%xmm7 + subq $0x10,%rcx + jz .Lodd_tail +.Lskip4x: + + + + + + movdqu (%rdx),%xmm8 + movdqu 16(%rdx),%xmm3 + pshufb %xmm10,%xmm8 + pshufb %xmm10,%xmm3 + pxor %xmm8,%xmm0 + + movdqa %xmm3,%xmm5 + pshufd $78,%xmm3,%xmm4 + pxor %xmm3,%xmm4 +.byte 102,15,58,68,218,0 +.byte 102,15,58,68,234,17 +.byte 102,15,58,68,231,0 + + leaq 32(%rdx),%rdx + nop + subq $0x20,%rcx + jbe .Leven_tail + nop + jmp .Lmod_loop + +.align 32 +.Lmod_loop: + movdqa %xmm0,%xmm1 + movdqa %xmm4,%xmm8 + pshufd $78,%xmm0,%xmm4 + pxor %xmm0,%xmm4 + +.byte 102,15,58,68,198,0 +.byte 102,15,58,68,206,17 +.byte 102,15,58,68,231,16 + + pxor %xmm3,%xmm0 + pxor %xmm5,%xmm1 + movdqu (%rdx),%xmm9 + pxor %xmm0,%xmm8 + pshufb %xmm10,%xmm9 + movdqu 16(%rdx),%xmm3 + + pxor %xmm1,%xmm8 + pxor %xmm9,%xmm1 + pxor %xmm8,%xmm4 + pshufb %xmm10,%xmm3 + movdqa %xmm4,%xmm8 + psrldq $8,%xmm8 + pslldq $8,%xmm4 + pxor %xmm8,%xmm1 + pxor %xmm4,%xmm0 + + movdqa %xmm3,%xmm5 + + movdqa %xmm0,%xmm9 + movdqa %xmm0,%xmm8 + psllq $5,%xmm0 + pxor %xmm0,%xmm8 +.byte 102,15,58,68,218,0 + psllq $1,%xmm0 + pxor %xmm8,%xmm0 + psllq $57,%xmm0 + movdqa %xmm0,%xmm8 + pslldq $8,%xmm0 + psrldq $8,%xmm8 + pxor %xmm9,%xmm0 + pshufd $78,%xmm5,%xmm4 + pxor %xmm8,%xmm1 + pxor %xmm5,%xmm4 + + movdqa %xmm0,%xmm9 + psrlq $1,%xmm0 +.byte 102,15,58,68,234,17 + pxor %xmm9,%xmm1 + pxor %xmm0,%xmm9 + psrlq $5,%xmm0 + pxor %xmm9,%xmm0 + leaq 32(%rdx),%rdx + psrlq $1,%xmm0 +.byte 102,15,58,68,231,0 + pxor %xmm1,%xmm0 + + subq $0x20,%rcx + ja .Lmod_loop + +.Leven_tail: + movdqa %xmm0,%xmm1 + movdqa %xmm4,%xmm8 + pshufd $78,%xmm0,%xmm4 + pxor %xmm0,%xmm4 + +.byte 102,15,58,68,198,0 +.byte 102,15,58,68,206,17 +.byte 102,15,58,68,231,16 + + pxor %xmm3,%xmm0 + pxor %xmm5,%xmm1 + pxor %xmm0,%xmm8 + pxor %xmm1,%xmm8 + pxor %xmm8,%xmm4 + movdqa %xmm4,%xmm8 + psrldq $8,%xmm8 + pslldq $8,%xmm4 + pxor %xmm8,%xmm1 + pxor %xmm4,%xmm0 + + movdqa %xmm0,%xmm4 + movdqa %xmm0,%xmm3 + psllq $5,%xmm0 + pxor %xmm0,%xmm3 + psllq $1,%xmm0 + pxor %xmm3,%xmm0 + psllq $57,%xmm0 + movdqa %xmm0,%xmm3 + pslldq $8,%xmm0 + psrldq $8,%xmm3 + pxor %xmm4,%xmm0 + pxor %xmm3,%xmm1 + + + movdqa %xmm0,%xmm4 + psrlq $1,%xmm0 + pxor %xmm4,%xmm1 + pxor %xmm0,%xmm4 + psrlq $5,%xmm0 + pxor %xmm4,%xmm0 + psrlq $1,%xmm0 + pxor %xmm1,%xmm0 + testq %rcx,%rcx + jnz .Ldone + +.Lodd_tail: + movdqu (%rdx),%xmm8 + pshufb %xmm10,%xmm8 + pxor %xmm8,%xmm0 + movdqa %xmm0,%xmm1 + pshufd $78,%xmm0,%xmm3 + pxor %xmm0,%xmm3 +.byte 102,15,58,68,194,0 +.byte 102,15,58,68,202,17 +.byte 102,15,58,68,223,0 + pxor %xmm0,%xmm3 + pxor %xmm1,%xmm3 + + movdqa %xmm3,%xmm4 + psrldq $8,%xmm3 + pslldq $8,%xmm4 + pxor %xmm3,%xmm1 + pxor %xmm4,%xmm0 + + movdqa %xmm0,%xmm4 + movdqa %xmm0,%xmm3 + psllq $5,%xmm0 + pxor %xmm0,%xmm3 + psllq $1,%xmm0 + pxor %xmm3,%xmm0 + psllq $57,%xmm0 + movdqa %xmm0,%xmm3 + pslldq $8,%xmm0 + psrldq $8,%xmm3 + pxor %xmm4,%xmm0 + pxor %xmm3,%xmm1 + + + movdqa %xmm0,%xmm4 + psrlq $1,%xmm0 + pxor %xmm4,%xmm1 + pxor %xmm0,%xmm4 + psrlq $5,%xmm0 + pxor %xmm4,%xmm0 + psrlq $1,%xmm0 + pxor %xmm1,%xmm0 +.Ldone: + pshufb %xmm10,%xmm0 + movdqu %xmm0,(%rdi) + ret +.size gcm_ghash_clmul,.-gcm_ghash_clmul +.globl gcm_init_avx +.type gcm_init_avx,@function +.align 32 +gcm_init_avx: + vzeroupper + + vmovdqu (%rsi),%xmm2 + vpshufd $78,%xmm2,%xmm2 + + + vpshufd $255,%xmm2,%xmm4 + vpsrlq $63,%xmm2,%xmm3 + vpsllq $1,%xmm2,%xmm2 + vpxor %xmm5,%xmm5,%xmm5 + vpcmpgtd %xmm4,%xmm5,%xmm5 + vpslldq $8,%xmm3,%xmm3 + vpor %xmm3,%xmm2,%xmm2 + + + vpand .L0x1c2_polynomial(%rip),%xmm5,%xmm5 + vpxor %xmm5,%xmm2,%xmm2 + + vpunpckhqdq %xmm2,%xmm2,%xmm6 + vmovdqa %xmm2,%xmm0 + vpxor %xmm2,%xmm6,%xmm6 + movq $4,%r10 + jmp .Linit_start_avx +.align 32 +.Linit_loop_avx: + vpalignr $8,%xmm3,%xmm4,%xmm5 + vmovdqu %xmm5,-16(%rdi) + vpunpckhqdq %xmm0,%xmm0,%xmm3 + vpxor %xmm0,%xmm3,%xmm3 + vpclmulqdq $0x11,%xmm2,%xmm0,%xmm1 + vpclmulqdq $0x00,%xmm2,%xmm0,%xmm0 + vpclmulqdq $0x00,%xmm6,%xmm3,%xmm3 + vpxor %xmm0,%xmm1,%xmm4 + vpxor %xmm4,%xmm3,%xmm3 + + vpslldq $8,%xmm3,%xmm4 + vpsrldq $8,%xmm3,%xmm3 + vpxor %xmm4,%xmm0,%xmm0 + vpxor %xmm3,%xmm1,%xmm1 + vpsllq $57,%xmm0,%xmm3 + vpsllq $62,%xmm0,%xmm4 + vpxor %xmm3,%xmm4,%xmm4 + vpsllq $63,%xmm0,%xmm3 + vpxor %xmm3,%xmm4,%xmm4 + vpslldq $8,%xmm4,%xmm3 + vpsrldq $8,%xmm4,%xmm4 + vpxor %xmm3,%xmm0,%xmm0 + vpxor %xmm4,%xmm1,%xmm1 + + vpsrlq $1,%xmm0,%xmm4 + vpxor %xmm0,%xmm1,%xmm1 + vpxor %xmm4,%xmm0,%xmm0 + vpsrlq $5,%xmm4,%xmm4 + vpxor %xmm4,%xmm0,%xmm0 + vpsrlq $1,%xmm0,%xmm0 + vpxor %xmm1,%xmm0,%xmm0 +.Linit_start_avx: + vmovdqa %xmm0,%xmm5 + vpunpckhqdq %xmm0,%xmm0,%xmm3 + vpxor %xmm0,%xmm3,%xmm3 + vpclmulqdq $0x11,%xmm2,%xmm0,%xmm1 + vpclmulqdq $0x00,%xmm2,%xmm0,%xmm0 + vpclmulqdq $0x00,%xmm6,%xmm3,%xmm3 + vpxor %xmm0,%xmm1,%xmm4 + vpxor %xmm4,%xmm3,%xmm3 + + vpslldq $8,%xmm3,%xmm4 + vpsrldq $8,%xmm3,%xmm3 + vpxor %xmm4,%xmm0,%xmm0 + vpxor %xmm3,%xmm1,%xmm1 + vpsllq $57,%xmm0,%xmm3 + vpsllq $62,%xmm0,%xmm4 + vpxor %xmm3,%xmm4,%xmm4 + vpsllq $63,%xmm0,%xmm3 + vpxor %xmm3,%xmm4,%xmm4 + vpslldq $8,%xmm4,%xmm3 + vpsrldq $8,%xmm4,%xmm4 + vpxor %xmm3,%xmm0,%xmm0 + vpxor %xmm4,%xmm1,%xmm1 + + vpsrlq $1,%xmm0,%xmm4 + vpxor %xmm0,%xmm1,%xmm1 + vpxor %xmm4,%xmm0,%xmm0 + vpsrlq $5,%xmm4,%xmm4 + vpxor %xmm4,%xmm0,%xmm0 + vpsrlq $1,%xmm0,%xmm0 + vpxor %xmm1,%xmm0,%xmm0 + vpshufd $78,%xmm5,%xmm3 + vpshufd $78,%xmm0,%xmm4 + vpxor %xmm5,%xmm3,%xmm3 + vmovdqu %xmm5,0(%rdi) + vpxor %xmm0,%xmm4,%xmm4 + vmovdqu %xmm0,16(%rdi) + leaq 48(%rdi),%rdi + subq $1,%r10 + jnz .Linit_loop_avx + + vpalignr $8,%xmm4,%xmm3,%xmm5 + vmovdqu %xmm5,-16(%rdi) + + vzeroupper + ret +.size gcm_init_avx,.-gcm_init_avx +.globl gcm_gmult_avx +.type gcm_gmult_avx,@function +.align 32 +gcm_gmult_avx: + jmp .L_gmult_clmul +.size gcm_gmult_avx,.-gcm_gmult_avx +.globl gcm_ghash_avx +.type gcm_ghash_avx,@function +.align 32 +gcm_ghash_avx: + vzeroupper + + vmovdqu (%rdi),%xmm10 + leaq .L0x1c2_polynomial(%rip),%r10 + leaq 64(%rsi),%rsi + vmovdqu .Lbswap_mask(%rip),%xmm13 + vpshufb %xmm13,%xmm10,%xmm10 + cmpq $0x80,%rcx + jb .Lshort_avx + subq $0x80,%rcx + + vmovdqu 112(%rdx),%xmm14 + vmovdqu 0-64(%rsi),%xmm6 + vpshufb %xmm13,%xmm14,%xmm14 + vmovdqu 32-64(%rsi),%xmm7 + + vpunpckhqdq %xmm14,%xmm14,%xmm9 + vmovdqu 96(%rdx),%xmm15 + vpclmulqdq $0x00,%xmm6,%xmm14,%xmm0 + vpxor %xmm14,%xmm9,%xmm9 + vpshufb %xmm13,%xmm15,%xmm15 + vpclmulqdq $0x11,%xmm6,%xmm14,%xmm1 + vmovdqu 16-64(%rsi),%xmm6 + vpunpckhqdq %xmm15,%xmm15,%xmm8 + vmovdqu 80(%rdx),%xmm14 + vpclmulqdq $0x00,%xmm7,%xmm9,%xmm2 + vpxor %xmm15,%xmm8,%xmm8 + + vpshufb %xmm13,%xmm14,%xmm14 + vpclmulqdq $0x00,%xmm6,%xmm15,%xmm3 + vpunpckhqdq %xmm14,%xmm14,%xmm9 + vpclmulqdq $0x11,%xmm6,%xmm15,%xmm4 + vmovdqu 48-64(%rsi),%xmm6 + vpxor %xmm14,%xmm9,%xmm9 + vmovdqu 64(%rdx),%xmm15 + vpclmulqdq $0x10,%xmm7,%xmm8,%xmm5 + vmovdqu 80-64(%rsi),%xmm7 + + vpshufb %xmm13,%xmm15,%xmm15 + vpxor %xmm0,%xmm3,%xmm3 + vpclmulqdq $0x00,%xmm6,%xmm14,%xmm0 + vpxor %xmm1,%xmm4,%xmm4 + vpunpckhqdq %xmm15,%xmm15,%xmm8 + vpclmulqdq $0x11,%xmm6,%xmm14,%xmm1 + vmovdqu 64-64(%rsi),%xmm6 + vpxor %xmm2,%xmm5,%xmm5 + vpclmulqdq $0x00,%xmm7,%xmm9,%xmm2 + vpxor %xmm15,%xmm8,%xmm8 + + vmovdqu 48(%rdx),%xmm14 + vpxor %xmm3,%xmm0,%xmm0 + vpclmulqdq $0x00,%xmm6,%xmm15,%xmm3 + vpxor %xmm4,%xmm1,%xmm1 + vpshufb %xmm13,%xmm14,%xmm14 + vpclmulqdq $0x11,%xmm6,%xmm15,%xmm4 + vmovdqu 96-64(%rsi),%xmm6 + vpxor %xmm5,%xmm2,%xmm2 + vpunpckhqdq %xmm14,%xmm14,%xmm9 + vpclmulqdq $0x10,%xmm7,%xmm8,%xmm5 + vmovdqu 128-64(%rsi),%xmm7 + vpxor %xmm14,%xmm9,%xmm9 + + vmovdqu 32(%rdx),%xmm15 + vpxor %xmm0,%xmm3,%xmm3 + vpclmulqdq $0x00,%xmm6,%xmm14,%xmm0 + vpxor %xmm1,%xmm4,%xmm4 + vpshufb %xmm13,%xmm15,%xmm15 + vpclmulqdq $0x11,%xmm6,%xmm14,%xmm1 + vmovdqu 112-64(%rsi),%xmm6 + vpxor %xmm2,%xmm5,%xmm5 + vpunpckhqdq %xmm15,%xmm15,%xmm8 + vpclmulqdq $0x00,%xmm7,%xmm9,%xmm2 + vpxor %xmm15,%xmm8,%xmm8 + + vmovdqu 16(%rdx),%xmm14 + vpxor %xmm3,%xmm0,%xmm0 + vpclmulqdq $0x00,%xmm6,%xmm15,%xmm3 + vpxor %xmm4,%xmm1,%xmm1 + vpshufb %xmm13,%xmm14,%xmm14 + vpclmulqdq $0x11,%xmm6,%xmm15,%xmm4 + vmovdqu 144-64(%rsi),%xmm6 + vpxor %xmm5,%xmm2,%xmm2 + vpunpckhqdq %xmm14,%xmm14,%xmm9 + vpclmulqdq $0x10,%xmm7,%xmm8,%xmm5 + vmovdqu 176-64(%rsi),%xmm7 + vpxor %xmm14,%xmm9,%xmm9 + + vmovdqu (%rdx),%xmm15 + vpxor %xmm0,%xmm3,%xmm3 + vpclmulqdq $0x00,%xmm6,%xmm14,%xmm0 + vpxor %xmm1,%xmm4,%xmm4 + vpshufb %xmm13,%xmm15,%xmm15 + vpclmulqdq $0x11,%xmm6,%xmm14,%xmm1 + vmovdqu 160-64(%rsi),%xmm6 + vpxor %xmm2,%xmm5,%xmm5 + vpclmulqdq $0x10,%xmm7,%xmm9,%xmm2 + + leaq 128(%rdx),%rdx + cmpq $0x80,%rcx + jb .Ltail_avx + + vpxor %xmm10,%xmm15,%xmm15 + subq $0x80,%rcx + jmp .Loop8x_avx + +.align 32 +.Loop8x_avx: + vpunpckhqdq %xmm15,%xmm15,%xmm8 + vmovdqu 112(%rdx),%xmm14 + vpxor %xmm0,%xmm3,%xmm3 + vpxor %xmm15,%xmm8,%xmm8 + vpclmulqdq $0x00,%xmm6,%xmm15,%xmm10 + vpshufb %xmm13,%xmm14,%xmm14 + vpxor %xmm1,%xmm4,%xmm4 + vpclmulqdq $0x11,%xmm6,%xmm15,%xmm11 + vmovdqu 0-64(%rsi),%xmm6 + vpunpckhqdq %xmm14,%xmm14,%xmm9 + vpxor %xmm2,%xmm5,%xmm5 + vpclmulqdq $0x00,%xmm7,%xmm8,%xmm12 + vmovdqu 32-64(%rsi),%xmm7 + vpxor %xmm14,%xmm9,%xmm9 + + vmovdqu 96(%rdx),%xmm15 + vpclmulqdq $0x00,%xmm6,%xmm14,%xmm0 + vpxor %xmm3,%xmm10,%xmm10 + vpshufb %xmm13,%xmm15,%xmm15 + vpclmulqdq $0x11,%xmm6,%xmm14,%xmm1 + vxorps %xmm4,%xmm11,%xmm11 + vmovdqu 16-64(%rsi),%xmm6 + vpunpckhqdq %xmm15,%xmm15,%xmm8 + vpclmulqdq $0x00,%xmm7,%xmm9,%xmm2 + vpxor %xmm5,%xmm12,%xmm12 + vxorps %xmm15,%xmm8,%xmm8 + + vmovdqu 80(%rdx),%xmm14 + vpxor %xmm10,%xmm12,%xmm12 + vpclmulqdq $0x00,%xmm6,%xmm15,%xmm3 + vpxor %xmm11,%xmm12,%xmm12 + vpslldq $8,%xmm12,%xmm9 + vpxor %xmm0,%xmm3,%xmm3 + vpclmulqdq $0x11,%xmm6,%xmm15,%xmm4 + vpsrldq $8,%xmm12,%xmm12 + vpxor %xmm9,%xmm10,%xmm10 + vmovdqu 48-64(%rsi),%xmm6 + vpshufb %xmm13,%xmm14,%xmm14 + vxorps %xmm12,%xmm11,%xmm11 + vpxor %xmm1,%xmm4,%xmm4 + vpunpckhqdq %xmm14,%xmm14,%xmm9 + vpclmulqdq $0x10,%xmm7,%xmm8,%xmm5 + vmovdqu 80-64(%rsi),%xmm7 + vpxor %xmm14,%xmm9,%xmm9 + vpxor %xmm2,%xmm5,%xmm5 + + vmovdqu 64(%rdx),%xmm15 + vpalignr $8,%xmm10,%xmm10,%xmm12 + vpclmulqdq $0x00,%xmm6,%xmm14,%xmm0 + vpshufb %xmm13,%xmm15,%xmm15 + vpxor %xmm3,%xmm0,%xmm0 + vpclmulqdq $0x11,%xmm6,%xmm14,%xmm1 + vmovdqu 64-64(%rsi),%xmm6 + vpunpckhqdq %xmm15,%xmm15,%xmm8 + vpxor %xmm4,%xmm1,%xmm1 + vpclmulqdq $0x00,%xmm7,%xmm9,%xmm2 + vxorps %xmm15,%xmm8,%xmm8 + vpxor %xmm5,%xmm2,%xmm2 + + vmovdqu 48(%rdx),%xmm14 + vpclmulqdq $0x10,(%r10),%xmm10,%xmm10 + vpclmulqdq $0x00,%xmm6,%xmm15,%xmm3 + vpshufb %xmm13,%xmm14,%xmm14 + vpxor %xmm0,%xmm3,%xmm3 + vpclmulqdq $0x11,%xmm6,%xmm15,%xmm4 + vmovdqu 96-64(%rsi),%xmm6 + vpunpckhqdq %xmm14,%xmm14,%xmm9 + vpxor %xmm1,%xmm4,%xmm4 + vpclmulqdq $0x10,%xmm7,%xmm8,%xmm5 + vmovdqu 128-64(%rsi),%xmm7 + vpxor %xmm14,%xmm9,%xmm9 + vpxor %xmm2,%xmm5,%xmm5 + + vmovdqu 32(%rdx),%xmm15 + vpclmulqdq $0x00,%xmm6,%xmm14,%xmm0 + vpshufb %xmm13,%xmm15,%xmm15 + vpxor %xmm3,%xmm0,%xmm0 + vpclmulqdq $0x11,%xmm6,%xmm14,%xmm1 + vmovdqu 112-64(%rsi),%xmm6 + vpunpckhqdq %xmm15,%xmm15,%xmm8 + vpxor %xmm4,%xmm1,%xmm1 + vpclmulqdq $0x00,%xmm7,%xmm9,%xmm2 + vpxor %xmm15,%xmm8,%xmm8 + vpxor %xmm5,%xmm2,%xmm2 + vxorps %xmm12,%xmm10,%xmm10 + + vmovdqu 16(%rdx),%xmm14 + vpalignr $8,%xmm10,%xmm10,%xmm12 + vpclmulqdq $0x00,%xmm6,%xmm15,%xmm3 + vpshufb %xmm13,%xmm14,%xmm14 + vpxor %xmm0,%xmm3,%xmm3 + vpclmulqdq $0x11,%xmm6,%xmm15,%xmm4 + vmovdqu 144-64(%rsi),%xmm6 + vpclmulqdq $0x10,(%r10),%xmm10,%xmm10 + vxorps %xmm11,%xmm12,%xmm12 + vpunpckhqdq %xmm14,%xmm14,%xmm9 + vpxor %xmm1,%xmm4,%xmm4 + vpclmulqdq $0x10,%xmm7,%xmm8,%xmm5 + vmovdqu 176-64(%rsi),%xmm7 + vpxor %xmm14,%xmm9,%xmm9 + vpxor %xmm2,%xmm5,%xmm5 + + vmovdqu (%rdx),%xmm15 + vpclmulqdq $0x00,%xmm6,%xmm14,%xmm0 + vpshufb %xmm13,%xmm15,%xmm15 + vpclmulqdq $0x11,%xmm6,%xmm14,%xmm1 + vmovdqu 160-64(%rsi),%xmm6 + vpxor %xmm12,%xmm15,%xmm15 + vpclmulqdq $0x10,%xmm7,%xmm9,%xmm2 + vpxor %xmm10,%xmm15,%xmm15 + + leaq 128(%rdx),%rdx + subq $0x80,%rcx + jnc .Loop8x_avx + + addq $0x80,%rcx + jmp .Ltail_no_xor_avx + +.align 32 +.Lshort_avx: + vmovdqu -16(%rdx,%rcx,1),%xmm14 + leaq (%rdx,%rcx,1),%rdx + vmovdqu 0-64(%rsi),%xmm6 + vmovdqu 32-64(%rsi),%xmm7 + vpshufb %xmm13,%xmm14,%xmm15 + + vmovdqa %xmm0,%xmm3 + vmovdqa %xmm1,%xmm4 + vmovdqa %xmm2,%xmm5 + subq $0x10,%rcx + jz .Ltail_avx + + vpunpckhqdq %xmm15,%xmm15,%xmm8 + vpxor %xmm0,%xmm3,%xmm3 + vpclmulqdq $0x00,%xmm6,%xmm15,%xmm0 + vpxor %xmm15,%xmm8,%xmm8 + vmovdqu -32(%rdx),%xmm14 + vpxor %xmm1,%xmm4,%xmm4 + vpclmulqdq $0x11,%xmm6,%xmm15,%xmm1 + vmovdqu 16-64(%rsi),%xmm6 + vpshufb %xmm13,%xmm14,%xmm15 + vpxor %xmm2,%xmm5,%xmm5 + vpclmulqdq $0x00,%xmm7,%xmm8,%xmm2 + vpsrldq $8,%xmm7,%xmm7 + subq $0x10,%rcx + jz .Ltail_avx + + vpunpckhqdq %xmm15,%xmm15,%xmm8 + vpxor %xmm0,%xmm3,%xmm3 + vpclmulqdq $0x00,%xmm6,%xmm15,%xmm0 + vpxor %xmm15,%xmm8,%xmm8 + vmovdqu -48(%rdx),%xmm14 + vpxor %xmm1,%xmm4,%xmm4 + vpclmulqdq $0x11,%xmm6,%xmm15,%xmm1 + vmovdqu 48-64(%rsi),%xmm6 + vpshufb %xmm13,%xmm14,%xmm15 + vpxor %xmm2,%xmm5,%xmm5 + vpclmulqdq $0x00,%xmm7,%xmm8,%xmm2 + vmovdqu 80-64(%rsi),%xmm7 + subq $0x10,%rcx + jz .Ltail_avx + + vpunpckhqdq %xmm15,%xmm15,%xmm8 + vpxor %xmm0,%xmm3,%xmm3 + vpclmulqdq $0x00,%xmm6,%xmm15,%xmm0 + vpxor %xmm15,%xmm8,%xmm8 + vmovdqu -64(%rdx),%xmm14 + vpxor %xmm1,%xmm4,%xmm4 + vpclmulqdq $0x11,%xmm6,%xmm15,%xmm1 + vmovdqu 64-64(%rsi),%xmm6 + vpshufb %xmm13,%xmm14,%xmm15 + vpxor %xmm2,%xmm5,%xmm5 + vpclmulqdq $0x00,%xmm7,%xmm8,%xmm2 + vpsrldq $8,%xmm7,%xmm7 + subq $0x10,%rcx + jz .Ltail_avx + + vpunpckhqdq %xmm15,%xmm15,%xmm8 + vpxor %xmm0,%xmm3,%xmm3 + vpclmulqdq $0x00,%xmm6,%xmm15,%xmm0 + vpxor %xmm15,%xmm8,%xmm8 + vmovdqu -80(%rdx),%xmm14 + vpxor %xmm1,%xmm4,%xmm4 + vpclmulqdq $0x11,%xmm6,%xmm15,%xmm1 + vmovdqu 96-64(%rsi),%xmm6 + vpshufb %xmm13,%xmm14,%xmm15 + vpxor %xmm2,%xmm5,%xmm5 + vpclmulqdq $0x00,%xmm7,%xmm8,%xmm2 + vmovdqu 128-64(%rsi),%xmm7 + subq $0x10,%rcx + jz .Ltail_avx + + vpunpckhqdq %xmm15,%xmm15,%xmm8 + vpxor %xmm0,%xmm3,%xmm3 + vpclmulqdq $0x00,%xmm6,%xmm15,%xmm0 + vpxor %xmm15,%xmm8,%xmm8 + vmovdqu -96(%rdx),%xmm14 + vpxor %xmm1,%xmm4,%xmm4 + vpclmulqdq $0x11,%xmm6,%xmm15,%xmm1 + vmovdqu 112-64(%rsi),%xmm6 + vpshufb %xmm13,%xmm14,%xmm15 + vpxor %xmm2,%xmm5,%xmm5 + vpclmulqdq $0x00,%xmm7,%xmm8,%xmm2 + vpsrldq $8,%xmm7,%xmm7 + subq $0x10,%rcx + jz .Ltail_avx + + vpunpckhqdq %xmm15,%xmm15,%xmm8 + vpxor %xmm0,%xmm3,%xmm3 + vpclmulqdq $0x00,%xmm6,%xmm15,%xmm0 + vpxor %xmm15,%xmm8,%xmm8 + vmovdqu -112(%rdx),%xmm14 + vpxor %xmm1,%xmm4,%xmm4 + vpclmulqdq $0x11,%xmm6,%xmm15,%xmm1 + vmovdqu 144-64(%rsi),%xmm6 + vpshufb %xmm13,%xmm14,%xmm15 + vpxor %xmm2,%xmm5,%xmm5 + vpclmulqdq $0x00,%xmm7,%xmm8,%xmm2 + vmovq 184-64(%rsi),%xmm7 + subq $0x10,%rcx + jmp .Ltail_avx + +.align 32 +.Ltail_avx: + vpxor %xmm10,%xmm15,%xmm15 +.Ltail_no_xor_avx: + vpunpckhqdq %xmm15,%xmm15,%xmm8 + vpxor %xmm0,%xmm3,%xmm3 + vpclmulqdq $0x00,%xmm6,%xmm15,%xmm0 + vpxor %xmm15,%xmm8,%xmm8 + vpxor %xmm1,%xmm4,%xmm4 + vpclmulqdq $0x11,%xmm6,%xmm15,%xmm1 + vpxor %xmm2,%xmm5,%xmm5 + vpclmulqdq $0x00,%xmm7,%xmm8,%xmm2 + + vmovdqu (%r10),%xmm12 + + vpxor %xmm0,%xmm3,%xmm10 + vpxor %xmm1,%xmm4,%xmm11 + vpxor %xmm2,%xmm5,%xmm5 + + vpxor %xmm10,%xmm5,%xmm5 + vpxor %xmm11,%xmm5,%xmm5 + vpslldq $8,%xmm5,%xmm9 + vpsrldq $8,%xmm5,%xmm5 + vpxor %xmm9,%xmm10,%xmm10 + vpxor %xmm5,%xmm11,%xmm11 + + vpclmulqdq $0x10,%xmm12,%xmm10,%xmm9 + vpalignr $8,%xmm10,%xmm10,%xmm10 + vpxor %xmm9,%xmm10,%xmm10 + + vpclmulqdq $0x10,%xmm12,%xmm10,%xmm9 + vpalignr $8,%xmm10,%xmm10,%xmm10 + vpxor %xmm11,%xmm10,%xmm10 + vpxor %xmm9,%xmm10,%xmm10 + + cmpq $0,%rcx + jne .Lshort_avx + + vpshufb %xmm13,%xmm10,%xmm10 + vmovdqu %xmm10,(%rdi) + vzeroupper + ret +.size gcm_ghash_avx,.-gcm_ghash_avx +.align 64 +.Lbswap_mask: +.byte 15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0 +.L0x1c2_polynomial: +.byte 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0xc2 +.L7_mask: +.long 7,0,7,0 +.L7_mask_poly: +.long 7,0,450,0 +.align 64 +.type .Lrem_4bit,@object +.Lrem_4bit: +.long 0,0,0,471859200,0,943718400,0,610271232 +.long 0,1887436800,0,1822425088,0,1220542464,0,1423966208 +.long 0,3774873600,0,4246732800,0,3644850176,0,3311403008 +.long 0,2441084928,0,2376073216,0,2847932416,0,3051356160 +.type .Lrem_8bit,@object +.Lrem_8bit: +.value 0x0000,0x01C2,0x0384,0x0246,0x0708,0x06CA,0x048C,0x054E +.value 0x0E10,0x0FD2,0x0D94,0x0C56,0x0918,0x08DA,0x0A9C,0x0B5E +.value 0x1C20,0x1DE2,0x1FA4,0x1E66,0x1B28,0x1AEA,0x18AC,0x196E +.value 0x1230,0x13F2,0x11B4,0x1076,0x1538,0x14FA,0x16BC,0x177E +.value 0x3840,0x3982,0x3BC4,0x3A06,0x3F48,0x3E8A,0x3CCC,0x3D0E +.value 0x3650,0x3792,0x35D4,0x3416,0x3158,0x309A,0x32DC,0x331E +.value 0x2460,0x25A2,0x27E4,0x2626,0x2368,0x22AA,0x20EC,0x212E +.value 0x2A70,0x2BB2,0x29F4,0x2836,0x2D78,0x2CBA,0x2EFC,0x2F3E +.value 0x7080,0x7142,0x7304,0x72C6,0x7788,0x764A,0x740C,0x75CE +.value 0x7E90,0x7F52,0x7D14,0x7CD6,0x7998,0x785A,0x7A1C,0x7BDE +.value 0x6CA0,0x6D62,0x6F24,0x6EE6,0x6BA8,0x6A6A,0x682C,0x69EE +.value 0x62B0,0x6372,0x6134,0x60F6,0x65B8,0x647A,0x663C,0x67FE +.value 0x48C0,0x4902,0x4B44,0x4A86,0x4FC8,0x4E0A,0x4C4C,0x4D8E +.value 0x46D0,0x4712,0x4554,0x4496,0x41D8,0x401A,0x425C,0x439E +.value 0x54E0,0x5522,0x5764,0x56A6,0x53E8,0x522A,0x506C,0x51AE +.value 0x5AF0,0x5B32,0x5974,0x58B6,0x5DF8,0x5C3A,0x5E7C,0x5FBE +.value 0xE100,0xE0C2,0xE284,0xE346,0xE608,0xE7CA,0xE58C,0xE44E +.value 0xEF10,0xEED2,0xEC94,0xED56,0xE818,0xE9DA,0xEB9C,0xEA5E +.value 0xFD20,0xFCE2,0xFEA4,0xFF66,0xFA28,0xFBEA,0xF9AC,0xF86E +.value 0xF330,0xF2F2,0xF0B4,0xF176,0xF438,0xF5FA,0xF7BC,0xF67E +.value 0xD940,0xD882,0xDAC4,0xDB06,0xDE48,0xDF8A,0xDDCC,0xDC0E +.value 0xD750,0xD692,0xD4D4,0xD516,0xD058,0xD19A,0xD3DC,0xD21E +.value 0xC560,0xC4A2,0xC6E4,0xC726,0xC268,0xC3AA,0xC1EC,0xC02E +.value 0xCB70,0xCAB2,0xC8F4,0xC936,0xCC78,0xCDBA,0xCFFC,0xCE3E +.value 0x9180,0x9042,0x9204,0x93C6,0x9688,0x974A,0x950C,0x94CE +.value 0x9F90,0x9E52,0x9C14,0x9DD6,0x9898,0x995A,0x9B1C,0x9ADE +.value 0x8DA0,0x8C62,0x8E24,0x8FE6,0x8AA8,0x8B6A,0x892C,0x88EE +.value 0x83B0,0x8272,0x8034,0x81F6,0x84B8,0x857A,0x873C,0x86FE +.value 0xA9C0,0xA802,0xAA44,0xAB86,0xAEC8,0xAF0A,0xAD4C,0xAC8E +.value 0xA7D0,0xA612,0xA454,0xA596,0xA0D8,0xA11A,0xA35C,0xA29E +.value 0xB5E0,0xB422,0xB664,0xB7A6,0xB2E8,0xB32A,0xB16C,0xB0AE +.value 0xBBF0,0xBA32,0xB874,0xB9B6,0xBCF8,0xBD3A,0xBF7C,0xBEBE + +.byte 71,72,65,83,72,32,102,111,114,32,120,56,54,95,54,52,44,32,67,82,89,80,84,79,71,65,77,83,32,98,121,32,60,97,112,112,114,111,64,108,46,111,114,103,62,0 +.align 64 diff --git a/crypto/aesgcm/ghash_x64_gas_macosx.s b/crypto/aesgcm/ghash_x64_gas_macosx.s new file mode 100644 index 0000000..dfd7cc9 --- /dev/null +++ b/crypto/aesgcm/ghash_x64_gas_macosx.s @@ -0,0 +1,1795 @@ +.text + + +.globl _gcm_gmult_4bit + +.p2align 4 +_gcm_gmult_4bit: + pushq %rbx + pushq %rbp + pushq %r12 + pushq %r13 + pushq %r14 + pushq %r15 + subq $280,%rsp +L$gmult_prologue: + + movzbq 15(%rdi),%r8 + leaq L$rem_4bit(%rip),%r11 + xorq %rax,%rax + xorq %rbx,%rbx + movb %r8b,%al + movb %r8b,%bl + shlb $4,%al + movq $14,%rcx + movq 8(%rsi,%rax,1),%r8 + movq (%rsi,%rax,1),%r9 + andb $0xf0,%bl + movq %r8,%rdx + jmp L$oop1 + +.p2align 4 +L$oop1: + shrq $4,%r8 + andq $0xf,%rdx + movq %r9,%r10 + movb (%rdi,%rcx,1),%al + shrq $4,%r9 + xorq 8(%rsi,%rbx,1),%r8 + shlq $60,%r10 + xorq (%rsi,%rbx,1),%r9 + movb %al,%bl + xorq (%r11,%rdx,8),%r9 + movq %r8,%rdx + shlb $4,%al + xorq %r10,%r8 + decq %rcx + js L$break1 + + shrq $4,%r8 + andq $0xf,%rdx + movq %r9,%r10 + shrq $4,%r9 + xorq 8(%rsi,%rax,1),%r8 + shlq $60,%r10 + xorq (%rsi,%rax,1),%r9 + andb $0xf0,%bl + xorq (%r11,%rdx,8),%r9 + movq %r8,%rdx + xorq %r10,%r8 + jmp L$oop1 + +.p2align 4 +L$break1: + shrq $4,%r8 + andq $0xf,%rdx + movq %r9,%r10 + shrq $4,%r9 + xorq 8(%rsi,%rax,1),%r8 + shlq $60,%r10 + xorq (%rsi,%rax,1),%r9 + andb $0xf0,%bl + xorq (%r11,%rdx,8),%r9 + movq %r8,%rdx + xorq %r10,%r8 + + shrq $4,%r8 + andq $0xf,%rdx + movq %r9,%r10 + shrq $4,%r9 + xorq 8(%rsi,%rbx,1),%r8 + shlq $60,%r10 + xorq (%rsi,%rbx,1),%r9 + xorq %r10,%r8 + xorq (%r11,%rdx,8),%r9 + + bswapq %r8 + bswapq %r9 + movq %r8,8(%rdi) + movq %r9,(%rdi) + + leaq 280+48(%rsp),%rsi + movq -8(%rsi),%rbx + leaq (%rsi),%rsp +L$gmult_epilogue: + ret + +.globl _gcm_ghash_4bit + +.p2align 4 +_gcm_ghash_4bit: + pushq %rbx + pushq %rbp + pushq %r12 + pushq %r13 + pushq %r14 + pushq %r15 + subq $280,%rsp +L$ghash_prologue: + movq %rdx,%r14 + movq %rcx,%r15 + subq $-128,%rsi + leaq 16+128(%rsp),%rbp + xorl %edx,%edx + movq 0+0-128(%rsi),%r8 + movq 0+8-128(%rsi),%rax + movb %al,%dl + shrq $4,%rax + movq %r8,%r10 + shrq $4,%r8 + movq 16+0-128(%rsi),%r9 + shlb $4,%dl + movq 16+8-128(%rsi),%rbx + shlq $60,%r10 + movb %dl,0(%rsp) + orq %r10,%rax + movb %bl,%dl + shrq $4,%rbx + movq %r9,%r10 + shrq $4,%r9 + movq %r8,0(%rbp) + movq 32+0-128(%rsi),%r8 + shlb $4,%dl + movq %rax,0-128(%rbp) + movq 32+8-128(%rsi),%rax + shlq $60,%r10 + movb %dl,1(%rsp) + orq %r10,%rbx + movb %al,%dl + shrq $4,%rax + movq %r8,%r10 + shrq $4,%r8 + movq %r9,8(%rbp) + movq 48+0-128(%rsi),%r9 + shlb $4,%dl + movq %rbx,8-128(%rbp) + movq 48+8-128(%rsi),%rbx + shlq $60,%r10 + movb %dl,2(%rsp) + orq %r10,%rax + movb %bl,%dl + shrq $4,%rbx + movq %r9,%r10 + shrq $4,%r9 + movq %r8,16(%rbp) + movq 64+0-128(%rsi),%r8 + shlb $4,%dl + movq %rax,16-128(%rbp) + movq 64+8-128(%rsi),%rax + shlq $60,%r10 + movb %dl,3(%rsp) + orq %r10,%rbx + movb %al,%dl + shrq $4,%rax + movq %r8,%r10 + shrq $4,%r8 + movq %r9,24(%rbp) + movq 80+0-128(%rsi),%r9 + shlb $4,%dl + movq %rbx,24-128(%rbp) + movq 80+8-128(%rsi),%rbx + shlq $60,%r10 + movb %dl,4(%rsp) + orq %r10,%rax + movb %bl,%dl + shrq $4,%rbx + movq %r9,%r10 + shrq $4,%r9 + movq %r8,32(%rbp) + movq 96+0-128(%rsi),%r8 + shlb $4,%dl + movq %rax,32-128(%rbp) + movq 96+8-128(%rsi),%rax + shlq $60,%r10 + movb %dl,5(%rsp) + orq %r10,%rbx + movb %al,%dl + shrq $4,%rax + movq %r8,%r10 + shrq $4,%r8 + movq %r9,40(%rbp) + movq 112+0-128(%rsi),%r9 + shlb $4,%dl + movq %rbx,40-128(%rbp) + movq 112+8-128(%rsi),%rbx + shlq $60,%r10 + movb %dl,6(%rsp) + orq %r10,%rax + movb %bl,%dl + shrq $4,%rbx + movq %r9,%r10 + shrq $4,%r9 + movq %r8,48(%rbp) + movq 128+0-128(%rsi),%r8 + shlb $4,%dl + movq %rax,48-128(%rbp) + movq 128+8-128(%rsi),%rax + shlq $60,%r10 + movb %dl,7(%rsp) + orq %r10,%rbx + movb %al,%dl + shrq $4,%rax + movq %r8,%r10 + shrq $4,%r8 + movq %r9,56(%rbp) + movq 144+0-128(%rsi),%r9 + shlb $4,%dl + movq %rbx,56-128(%rbp) + movq 144+8-128(%rsi),%rbx + shlq $60,%r10 + movb %dl,8(%rsp) + orq %r10,%rax + movb %bl,%dl + shrq $4,%rbx + movq %r9,%r10 + shrq $4,%r9 + movq %r8,64(%rbp) + movq 160+0-128(%rsi),%r8 + shlb $4,%dl + movq %rax,64-128(%rbp) + movq 160+8-128(%rsi),%rax + shlq $60,%r10 + movb %dl,9(%rsp) + orq %r10,%rbx + movb %al,%dl + shrq $4,%rax + movq %r8,%r10 + shrq $4,%r8 + movq %r9,72(%rbp) + movq 176+0-128(%rsi),%r9 + shlb $4,%dl + movq %rbx,72-128(%rbp) + movq 176+8-128(%rsi),%rbx + shlq $60,%r10 + movb %dl,10(%rsp) + orq %r10,%rax + movb %bl,%dl + shrq $4,%rbx + movq %r9,%r10 + shrq $4,%r9 + movq %r8,80(%rbp) + movq 192+0-128(%rsi),%r8 + shlb $4,%dl + movq %rax,80-128(%rbp) + movq 192+8-128(%rsi),%rax + shlq $60,%r10 + movb %dl,11(%rsp) + orq %r10,%rbx + movb %al,%dl + shrq $4,%rax + movq %r8,%r10 + shrq $4,%r8 + movq %r9,88(%rbp) + movq 208+0-128(%rsi),%r9 + shlb $4,%dl + movq %rbx,88-128(%rbp) + movq 208+8-128(%rsi),%rbx + shlq $60,%r10 + movb %dl,12(%rsp) + orq %r10,%rax + movb %bl,%dl + shrq $4,%rbx + movq %r9,%r10 + shrq $4,%r9 + movq %r8,96(%rbp) + movq 224+0-128(%rsi),%r8 + shlb $4,%dl + movq %rax,96-128(%rbp) + movq 224+8-128(%rsi),%rax + shlq $60,%r10 + movb %dl,13(%rsp) + orq %r10,%rbx + movb %al,%dl + shrq $4,%rax + movq %r8,%r10 + shrq $4,%r8 + movq %r9,104(%rbp) + movq 240+0-128(%rsi),%r9 + shlb $4,%dl + movq %rbx,104-128(%rbp) + movq 240+8-128(%rsi),%rbx + shlq $60,%r10 + movb %dl,14(%rsp) + orq %r10,%rax + movb %bl,%dl + shrq $4,%rbx + movq %r9,%r10 + shrq $4,%r9 + movq %r8,112(%rbp) + shlb $4,%dl + movq %rax,112-128(%rbp) + shlq $60,%r10 + movb %dl,15(%rsp) + orq %r10,%rbx + movq %r9,120(%rbp) + movq %rbx,120-128(%rbp) + addq $-128,%rsi + movq 8(%rdi),%r8 + movq 0(%rdi),%r9 + addq %r14,%r15 + leaq L$rem_8bit(%rip),%r11 + jmp L$outer_loop +.p2align 4 +L$outer_loop: + xorq (%r14),%r9 + movq 8(%r14),%rdx + leaq 16(%r14),%r14 + xorq %r8,%rdx + movq %r9,(%rdi) + movq %rdx,8(%rdi) + shrq $32,%rdx + xorq %rax,%rax + roll $8,%edx + movb %dl,%al + movzbl %dl,%ebx + shlb $4,%al + shrl $4,%ebx + roll $8,%edx + movq 8(%rsi,%rax,1),%r8 + movq (%rsi,%rax,1),%r9 + movb %dl,%al + movzbl %dl,%ecx + shlb $4,%al + movzbq (%rsp,%rbx,1),%r12 + shrl $4,%ecx + xorq %r8,%r12 + movq %r9,%r10 + shrq $8,%r8 + movzbq %r12b,%r12 + shrq $8,%r9 + xorq -128(%rbp,%rbx,8),%r8 + shlq $56,%r10 + xorq (%rbp,%rbx,8),%r9 + roll $8,%edx + xorq 8(%rsi,%rax,1),%r8 + xorq (%rsi,%rax,1),%r9 + movb %dl,%al + xorq %r10,%r8 + movzwq (%r11,%r12,2),%r12 + movzbl %dl,%ebx + shlb $4,%al + movzbq (%rsp,%rcx,1),%r13 + shrl $4,%ebx + shlq $48,%r12 + xorq %r8,%r13 + movq %r9,%r10 + xorq %r12,%r9 + shrq $8,%r8 + movzbq %r13b,%r13 + shrq $8,%r9 + xorq -128(%rbp,%rcx,8),%r8 + shlq $56,%r10 + xorq (%rbp,%rcx,8),%r9 + roll $8,%edx + xorq 8(%rsi,%rax,1),%r8 + xorq (%rsi,%rax,1),%r9 + movb %dl,%al + xorq %r10,%r8 + movzwq (%r11,%r13,2),%r13 + movzbl %dl,%ecx + shlb $4,%al + movzbq (%rsp,%rbx,1),%r12 + shrl $4,%ecx + shlq $48,%r13 + xorq %r8,%r12 + movq %r9,%r10 + xorq %r13,%r9 + shrq $8,%r8 + movzbq %r12b,%r12 + movl 8(%rdi),%edx + shrq $8,%r9 + xorq -128(%rbp,%rbx,8),%r8 + shlq $56,%r10 + xorq (%rbp,%rbx,8),%r9 + roll $8,%edx + xorq 8(%rsi,%rax,1),%r8 + xorq (%rsi,%rax,1),%r9 + movb %dl,%al + xorq %r10,%r8 + movzwq (%r11,%r12,2),%r12 + movzbl %dl,%ebx + shlb $4,%al + movzbq (%rsp,%rcx,1),%r13 + shrl $4,%ebx + shlq $48,%r12 + xorq %r8,%r13 + movq %r9,%r10 + xorq %r12,%r9 + shrq $8,%r8 + movzbq %r13b,%r13 + shrq $8,%r9 + xorq -128(%rbp,%rcx,8),%r8 + shlq $56,%r10 + xorq (%rbp,%rcx,8),%r9 + roll $8,%edx + xorq 8(%rsi,%rax,1),%r8 + xorq (%rsi,%rax,1),%r9 + movb %dl,%al + xorq %r10,%r8 + movzwq (%r11,%r13,2),%r13 + movzbl %dl,%ecx + shlb $4,%al + movzbq (%rsp,%rbx,1),%r12 + shrl $4,%ecx + shlq $48,%r13 + xorq %r8,%r12 + movq %r9,%r10 + xorq %r13,%r9 + shrq $8,%r8 + movzbq %r12b,%r12 + shrq $8,%r9 + xorq -128(%rbp,%rbx,8),%r8 + shlq $56,%r10 + xorq (%rbp,%rbx,8),%r9 + roll $8,%edx + xorq 8(%rsi,%rax,1),%r8 + xorq (%rsi,%rax,1),%r9 + movb %dl,%al + xorq %r10,%r8 + movzwq (%r11,%r12,2),%r12 + movzbl %dl,%ebx + shlb $4,%al + movzbq (%rsp,%rcx,1),%r13 + shrl $4,%ebx + shlq $48,%r12 + xorq %r8,%r13 + movq %r9,%r10 + xorq %r12,%r9 + shrq $8,%r8 + movzbq %r13b,%r13 + shrq $8,%r9 + xorq -128(%rbp,%rcx,8),%r8 + shlq $56,%r10 + xorq (%rbp,%rcx,8),%r9 + roll $8,%edx + xorq 8(%rsi,%rax,1),%r8 + xorq (%rsi,%rax,1),%r9 + movb %dl,%al + xorq %r10,%r8 + movzwq (%r11,%r13,2),%r13 + movzbl %dl,%ecx + shlb $4,%al + movzbq (%rsp,%rbx,1),%r12 + shrl $4,%ecx + shlq $48,%r13 + xorq %r8,%r12 + movq %r9,%r10 + xorq %r13,%r9 + shrq $8,%r8 + movzbq %r12b,%r12 + movl 4(%rdi),%edx + shrq $8,%r9 + xorq -128(%rbp,%rbx,8),%r8 + shlq $56,%r10 + xorq (%rbp,%rbx,8),%r9 + roll $8,%edx + xorq 8(%rsi,%rax,1),%r8 + xorq (%rsi,%rax,1),%r9 + movb %dl,%al + xorq %r10,%r8 + movzwq (%r11,%r12,2),%r12 + movzbl %dl,%ebx + shlb $4,%al + movzbq (%rsp,%rcx,1),%r13 + shrl $4,%ebx + shlq $48,%r12 + xorq %r8,%r13 + movq %r9,%r10 + xorq %r12,%r9 + shrq $8,%r8 + movzbq %r13b,%r13 + shrq $8,%r9 + xorq -128(%rbp,%rcx,8),%r8 + shlq $56,%r10 + xorq (%rbp,%rcx,8),%r9 + roll $8,%edx + xorq 8(%rsi,%rax,1),%r8 + xorq (%rsi,%rax,1),%r9 + movb %dl,%al + xorq %r10,%r8 + movzwq (%r11,%r13,2),%r13 + movzbl %dl,%ecx + shlb $4,%al + movzbq (%rsp,%rbx,1),%r12 + shrl $4,%ecx + shlq $48,%r13 + xorq %r8,%r12 + movq %r9,%r10 + xorq %r13,%r9 + shrq $8,%r8 + movzbq %r12b,%r12 + shrq $8,%r9 + xorq -128(%rbp,%rbx,8),%r8 + shlq $56,%r10 + xorq (%rbp,%rbx,8),%r9 + roll $8,%edx + xorq 8(%rsi,%rax,1),%r8 + xorq (%rsi,%rax,1),%r9 + movb %dl,%al + xorq %r10,%r8 + movzwq (%r11,%r12,2),%r12 + movzbl %dl,%ebx + shlb $4,%al + movzbq (%rsp,%rcx,1),%r13 + shrl $4,%ebx + shlq $48,%r12 + xorq %r8,%r13 + movq %r9,%r10 + xorq %r12,%r9 + shrq $8,%r8 + movzbq %r13b,%r13 + shrq $8,%r9 + xorq -128(%rbp,%rcx,8),%r8 + shlq $56,%r10 + xorq (%rbp,%rcx,8),%r9 + roll $8,%edx + xorq 8(%rsi,%rax,1),%r8 + xorq (%rsi,%rax,1),%r9 + movb %dl,%al + xorq %r10,%r8 + movzwq (%r11,%r13,2),%r13 + movzbl %dl,%ecx + shlb $4,%al + movzbq (%rsp,%rbx,1),%r12 + shrl $4,%ecx + shlq $48,%r13 + xorq %r8,%r12 + movq %r9,%r10 + xorq %r13,%r9 + shrq $8,%r8 + movzbq %r12b,%r12 + movl 0(%rdi),%edx + shrq $8,%r9 + xorq -128(%rbp,%rbx,8),%r8 + shlq $56,%r10 + xorq (%rbp,%rbx,8),%r9 + roll $8,%edx + xorq 8(%rsi,%rax,1),%r8 + xorq (%rsi,%rax,1),%r9 + movb %dl,%al + xorq %r10,%r8 + movzwq (%r11,%r12,2),%r12 + movzbl %dl,%ebx + shlb $4,%al + movzbq (%rsp,%rcx,1),%r13 + shrl $4,%ebx + shlq $48,%r12 + xorq %r8,%r13 + movq %r9,%r10 + xorq %r12,%r9 + shrq $8,%r8 + movzbq %r13b,%r13 + shrq $8,%r9 + xorq -128(%rbp,%rcx,8),%r8 + shlq $56,%r10 + xorq (%rbp,%rcx,8),%r9 + roll $8,%edx + xorq 8(%rsi,%rax,1),%r8 + xorq (%rsi,%rax,1),%r9 + movb %dl,%al + xorq %r10,%r8 + movzwq (%r11,%r13,2),%r13 + movzbl %dl,%ecx + shlb $4,%al + movzbq (%rsp,%rbx,1),%r12 + shrl $4,%ecx + shlq $48,%r13 + xorq %r8,%r12 + movq %r9,%r10 + xorq %r13,%r9 + shrq $8,%r8 + movzbq %r12b,%r12 + shrq $8,%r9 + xorq -128(%rbp,%rbx,8),%r8 + shlq $56,%r10 + xorq (%rbp,%rbx,8),%r9 + roll $8,%edx + xorq 8(%rsi,%rax,1),%r8 + xorq (%rsi,%rax,1),%r9 + movb %dl,%al + xorq %r10,%r8 + movzwq (%r11,%r12,2),%r12 + movzbl %dl,%ebx + shlb $4,%al + movzbq (%rsp,%rcx,1),%r13 + shrl $4,%ebx + shlq $48,%r12 + xorq %r8,%r13 + movq %r9,%r10 + xorq %r12,%r9 + shrq $8,%r8 + movzbq %r13b,%r13 + shrq $8,%r9 + xorq -128(%rbp,%rcx,8),%r8 + shlq $56,%r10 + xorq (%rbp,%rcx,8),%r9 + roll $8,%edx + xorq 8(%rsi,%rax,1),%r8 + xorq (%rsi,%rax,1),%r9 + movb %dl,%al + xorq %r10,%r8 + movzwq (%r11,%r13,2),%r13 + movzbl %dl,%ecx + shlb $4,%al + movzbq (%rsp,%rbx,1),%r12 + andl $240,%ecx + shlq $48,%r13 + xorq %r8,%r12 + movq %r9,%r10 + xorq %r13,%r9 + shrq $8,%r8 + movzbq %r12b,%r12 + movl -4(%rdi),%edx + shrq $8,%r9 + xorq -128(%rbp,%rbx,8),%r8 + shlq $56,%r10 + xorq (%rbp,%rbx,8),%r9 + movzwq (%r11,%r12,2),%r12 + xorq 8(%rsi,%rax,1),%r8 + xorq (%rsi,%rax,1),%r9 + shlq $48,%r12 + xorq %r10,%r8 + xorq %r12,%r9 + movzbq %r8b,%r13 + shrq $4,%r8 + movq %r9,%r10 + shlb $4,%r13b + shrq $4,%r9 + xorq 8(%rsi,%rcx,1),%r8 + movzwq (%r11,%r13,2),%r13 + shlq $60,%r10 + xorq (%rsi,%rcx,1),%r9 + xorq %r10,%r8 + shlq $48,%r13 + bswapq %r8 + xorq %r13,%r9 + bswapq %r9 + cmpq %r15,%r14 + jb L$outer_loop + movq %r8,8(%rdi) + movq %r9,(%rdi) + + leaq 280+48(%rsp),%rsi + movq -48(%rsi),%r15 + movq -40(%rsi),%r14 + movq -32(%rsi),%r13 + movq -24(%rsi),%r12 + movq -16(%rsi),%rbp + movq -8(%rsi),%rbx + leaq 0(%rsi),%rsp +L$ghash_epilogue: + ret + +.globl _gcm_init_clmul + +.p2align 4 +_gcm_init_clmul: +L$_init_clmul: + movdqu (%rsi),%xmm2 + pshufd $78,%xmm2,%xmm2 + + + pshufd $255,%xmm2,%xmm4 + movdqa %xmm2,%xmm3 + psllq $1,%xmm2 + pxor %xmm5,%xmm5 + psrlq $63,%xmm3 + pcmpgtd %xmm4,%xmm5 + pslldq $8,%xmm3 + por %xmm3,%xmm2 + + + pand L$0x1c2_polynomial(%rip),%xmm5 + pxor %xmm5,%xmm2 + + + pshufd $78,%xmm2,%xmm6 + movdqa %xmm2,%xmm0 + pxor %xmm2,%xmm6 + movdqa %xmm0,%xmm1 + pshufd $78,%xmm0,%xmm3 + pxor %xmm0,%xmm3 +.byte 102,15,58,68,194,0 +.byte 102,15,58,68,202,17 +.byte 102,15,58,68,222,0 + pxor %xmm0,%xmm3 + pxor %xmm1,%xmm3 + + movdqa %xmm3,%xmm4 + psrldq $8,%xmm3 + pslldq $8,%xmm4 + pxor %xmm3,%xmm1 + pxor %xmm4,%xmm0 + + movdqa %xmm0,%xmm4 + movdqa %xmm0,%xmm3 + psllq $5,%xmm0 + pxor %xmm0,%xmm3 + psllq $1,%xmm0 + pxor %xmm3,%xmm0 + psllq $57,%xmm0 + movdqa %xmm0,%xmm3 + pslldq $8,%xmm0 + psrldq $8,%xmm3 + pxor %xmm4,%xmm0 + pxor %xmm3,%xmm1 + + + movdqa %xmm0,%xmm4 + psrlq $1,%xmm0 + pxor %xmm4,%xmm1 + pxor %xmm0,%xmm4 + psrlq $5,%xmm0 + pxor %xmm4,%xmm0 + psrlq $1,%xmm0 + pxor %xmm1,%xmm0 + pshufd $78,%xmm2,%xmm3 + pshufd $78,%xmm0,%xmm4 + pxor %xmm2,%xmm3 + movdqu %xmm2,0(%rdi) + pxor %xmm0,%xmm4 + movdqu %xmm0,16(%rdi) +.byte 102,15,58,15,227,8 + movdqu %xmm4,32(%rdi) + movdqa %xmm0,%xmm1 + pshufd $78,%xmm0,%xmm3 + pxor %xmm0,%xmm3 +.byte 102,15,58,68,194,0 +.byte 102,15,58,68,202,17 +.byte 102,15,58,68,222,0 + pxor %xmm0,%xmm3 + pxor %xmm1,%xmm3 + + movdqa %xmm3,%xmm4 + psrldq $8,%xmm3 + pslldq $8,%xmm4 + pxor %xmm3,%xmm1 + pxor %xmm4,%xmm0 + + movdqa %xmm0,%xmm4 + movdqa %xmm0,%xmm3 + psllq $5,%xmm0 + pxor %xmm0,%xmm3 + psllq $1,%xmm0 + pxor %xmm3,%xmm0 + psllq $57,%xmm0 + movdqa %xmm0,%xmm3 + pslldq $8,%xmm0 + psrldq $8,%xmm3 + pxor %xmm4,%xmm0 + pxor %xmm3,%xmm1 + + + movdqa %xmm0,%xmm4 + psrlq $1,%xmm0 + pxor %xmm4,%xmm1 + pxor %xmm0,%xmm4 + psrlq $5,%xmm0 + pxor %xmm4,%xmm0 + psrlq $1,%xmm0 + pxor %xmm1,%xmm0 + movdqa %xmm0,%xmm5 + movdqa %xmm0,%xmm1 + pshufd $78,%xmm0,%xmm3 + pxor %xmm0,%xmm3 +.byte 102,15,58,68,194,0 +.byte 102,15,58,68,202,17 +.byte 102,15,58,68,222,0 + pxor %xmm0,%xmm3 + pxor %xmm1,%xmm3 + + movdqa %xmm3,%xmm4 + psrldq $8,%xmm3 + pslldq $8,%xmm4 + pxor %xmm3,%xmm1 + pxor %xmm4,%xmm0 + + movdqa %xmm0,%xmm4 + movdqa %xmm0,%xmm3 + psllq $5,%xmm0 + pxor %xmm0,%xmm3 + psllq $1,%xmm0 + pxor %xmm3,%xmm0 + psllq $57,%xmm0 + movdqa %xmm0,%xmm3 + pslldq $8,%xmm0 + psrldq $8,%xmm3 + pxor %xmm4,%xmm0 + pxor %xmm3,%xmm1 + + + movdqa %xmm0,%xmm4 + psrlq $1,%xmm0 + pxor %xmm4,%xmm1 + pxor %xmm0,%xmm4 + psrlq $5,%xmm0 + pxor %xmm4,%xmm0 + psrlq $1,%xmm0 + pxor %xmm1,%xmm0 + pshufd $78,%xmm5,%xmm3 + pshufd $78,%xmm0,%xmm4 + pxor %xmm5,%xmm3 + movdqu %xmm5,48(%rdi) + pxor %xmm0,%xmm4 + movdqu %xmm0,64(%rdi) +.byte 102,15,58,15,227,8 + movdqu %xmm4,80(%rdi) + ret + +.globl _gcm_gmult_clmul + +.p2align 4 +_gcm_gmult_clmul: +L$_gmult_clmul: + movdqu (%rdi),%xmm0 + movdqa L$bswap_mask(%rip),%xmm5 + movdqu (%rsi),%xmm2 + movdqu 32(%rsi),%xmm4 + pshufb %xmm5,%xmm0 + movdqa %xmm0,%xmm1 + pshufd $78,%xmm0,%xmm3 + pxor %xmm0,%xmm3 +.byte 102,15,58,68,194,0 +.byte 102,15,58,68,202,17 +.byte 102,15,58,68,220,0 + pxor %xmm0,%xmm3 + pxor %xmm1,%xmm3 + + movdqa %xmm3,%xmm4 + psrldq $8,%xmm3 + pslldq $8,%xmm4 + pxor %xmm3,%xmm1 + pxor %xmm4,%xmm0 + + movdqa %xmm0,%xmm4 + movdqa %xmm0,%xmm3 + psllq $5,%xmm0 + pxor %xmm0,%xmm3 + psllq $1,%xmm0 + pxor %xmm3,%xmm0 + psllq $57,%xmm0 + movdqa %xmm0,%xmm3 + pslldq $8,%xmm0 + psrldq $8,%xmm3 + pxor %xmm4,%xmm0 + pxor %xmm3,%xmm1 + + + movdqa %xmm0,%xmm4 + psrlq $1,%xmm0 + pxor %xmm4,%xmm1 + pxor %xmm0,%xmm4 + psrlq $5,%xmm0 + pxor %xmm4,%xmm0 + psrlq $1,%xmm0 + pxor %xmm1,%xmm0 + pshufb %xmm5,%xmm0 + movdqu %xmm0,(%rdi) + ret + +.globl _gcm_ghash_clmul + +.p2align 5 +_gcm_ghash_clmul: +L$_ghash_clmul: + movdqa L$bswap_mask(%rip),%xmm10 + + movdqu (%rdi),%xmm0 + movdqu (%rsi),%xmm2 + movdqu 32(%rsi),%xmm7 + pshufb %xmm10,%xmm0 + + subq $0x10,%rcx + jz L$odd_tail + + movdqu 16(%rsi),%xmm6 + + + cmpq $0x30,%rcx + jb L$skip4x + + + + + + subq $0x30,%rcx + movq $0xA040608020C0E000,%rax + movdqu 48(%rsi),%xmm14 + movdqu 64(%rsi),%xmm15 + + + + + movdqu 48(%rdx),%xmm3 + movdqu 32(%rdx),%xmm11 + pshufb %xmm10,%xmm3 + pshufb %xmm10,%xmm11 + movdqa %xmm3,%xmm5 + pshufd $78,%xmm3,%xmm4 + pxor %xmm3,%xmm4 +.byte 102,15,58,68,218,0 +.byte 102,15,58,68,234,17 +.byte 102,15,58,68,231,0 + + movdqa %xmm11,%xmm13 + pshufd $78,%xmm11,%xmm12 + pxor %xmm11,%xmm12 +.byte 102,68,15,58,68,222,0 +.byte 102,68,15,58,68,238,17 +.byte 102,68,15,58,68,231,16 + xorps %xmm11,%xmm3 + xorps %xmm13,%xmm5 + movups 80(%rsi),%xmm7 + xorps %xmm12,%xmm4 + + movdqu 16(%rdx),%xmm11 + movdqu 0(%rdx),%xmm8 + pshufb %xmm10,%xmm11 + pshufb %xmm10,%xmm8 + movdqa %xmm11,%xmm13 + pshufd $78,%xmm11,%xmm12 + pxor %xmm8,%xmm0 + pxor %xmm11,%xmm12 +.byte 102,69,15,58,68,222,0 + movdqa %xmm0,%xmm1 + pshufd $78,%xmm0,%xmm8 + pxor %xmm0,%xmm8 +.byte 102,69,15,58,68,238,17 +.byte 102,68,15,58,68,231,0 + xorps %xmm11,%xmm3 + xorps %xmm13,%xmm5 + + leaq 64(%rdx),%rdx + subq $0x40,%rcx + jc L$tail4x + + jmp L$mod4_loop +.p2align 5 +L$mod4_loop: +.byte 102,65,15,58,68,199,0 + xorps %xmm12,%xmm4 + movdqu 48(%rdx),%xmm11 + pshufb %xmm10,%xmm11 +.byte 102,65,15,58,68,207,17 + xorps %xmm3,%xmm0 + movdqu 32(%rdx),%xmm3 + movdqa %xmm11,%xmm13 +.byte 102,68,15,58,68,199,16 + pshufd $78,%xmm11,%xmm12 + xorps %xmm5,%xmm1 + pxor %xmm11,%xmm12 + pshufb %xmm10,%xmm3 + movups 32(%rsi),%xmm7 + xorps %xmm4,%xmm8 +.byte 102,68,15,58,68,218,0 + pshufd $78,%xmm3,%xmm4 + + pxor %xmm0,%xmm8 + movdqa %xmm3,%xmm5 + pxor %xmm1,%xmm8 + pxor %xmm3,%xmm4 + movdqa %xmm8,%xmm9 +.byte 102,68,15,58,68,234,17 + pslldq $8,%xmm8 + psrldq $8,%xmm9 + pxor %xmm8,%xmm0 + movdqa L$7_mask(%rip),%xmm8 + pxor %xmm9,%xmm1 +.byte 102,76,15,110,200 + + pand %xmm0,%xmm8 + pshufb %xmm8,%xmm9 + pxor %xmm0,%xmm9 +.byte 102,68,15,58,68,231,0 + psllq $57,%xmm9 + movdqa %xmm9,%xmm8 + pslldq $8,%xmm9 +.byte 102,15,58,68,222,0 + psrldq $8,%xmm8 + pxor %xmm9,%xmm0 + pxor %xmm8,%xmm1 + movdqu 0(%rdx),%xmm8 + + movdqa %xmm0,%xmm9 + psrlq $1,%xmm0 +.byte 102,15,58,68,238,17 + xorps %xmm11,%xmm3 + movdqu 16(%rdx),%xmm11 + pshufb %xmm10,%xmm11 +.byte 102,15,58,68,231,16 + xorps %xmm13,%xmm5 + movups 80(%rsi),%xmm7 + pshufb %xmm10,%xmm8 + pxor %xmm9,%xmm1 + pxor %xmm0,%xmm9 + psrlq $5,%xmm0 + + movdqa %xmm11,%xmm13 + pxor %xmm12,%xmm4 + pshufd $78,%xmm11,%xmm12 + pxor %xmm9,%xmm0 + pxor %xmm8,%xmm1 + pxor %xmm11,%xmm12 +.byte 102,69,15,58,68,222,0 + psrlq $1,%xmm0 + pxor %xmm1,%xmm0 + movdqa %xmm0,%xmm1 +.byte 102,69,15,58,68,238,17 + xorps %xmm11,%xmm3 + pshufd $78,%xmm0,%xmm8 + pxor %xmm0,%xmm8 + +.byte 102,68,15,58,68,231,0 + xorps %xmm13,%xmm5 + + leaq 64(%rdx),%rdx + subq $0x40,%rcx + jnc L$mod4_loop + +L$tail4x: +.byte 102,65,15,58,68,199,0 +.byte 102,65,15,58,68,207,17 +.byte 102,68,15,58,68,199,16 + xorps %xmm12,%xmm4 + xorps %xmm3,%xmm0 + xorps %xmm5,%xmm1 + pxor %xmm0,%xmm1 + pxor %xmm4,%xmm8 + + pxor %xmm1,%xmm8 + pxor %xmm0,%xmm1 + + movdqa %xmm8,%xmm9 + psrldq $8,%xmm8 + pslldq $8,%xmm9 + pxor %xmm8,%xmm1 + pxor %xmm9,%xmm0 + + movdqa %xmm0,%xmm4 + movdqa %xmm0,%xmm3 + psllq $5,%xmm0 + pxor %xmm0,%xmm3 + psllq $1,%xmm0 + pxor %xmm3,%xmm0 + psllq $57,%xmm0 + movdqa %xmm0,%xmm3 + pslldq $8,%xmm0 + psrldq $8,%xmm3 + pxor %xmm4,%xmm0 + pxor %xmm3,%xmm1 + + + movdqa %xmm0,%xmm4 + psrlq $1,%xmm0 + pxor %xmm4,%xmm1 + pxor %xmm0,%xmm4 + psrlq $5,%xmm0 + pxor %xmm4,%xmm0 + psrlq $1,%xmm0 + pxor %xmm1,%xmm0 + addq $0x40,%rcx + jz L$done + movdqu 32(%rsi),%xmm7 + subq $0x10,%rcx + jz L$odd_tail +L$skip4x: + + + + + + movdqu (%rdx),%xmm8 + movdqu 16(%rdx),%xmm3 + pshufb %xmm10,%xmm8 + pshufb %xmm10,%xmm3 + pxor %xmm8,%xmm0 + + movdqa %xmm3,%xmm5 + pshufd $78,%xmm3,%xmm4 + pxor %xmm3,%xmm4 +.byte 102,15,58,68,218,0 +.byte 102,15,58,68,234,17 +.byte 102,15,58,68,231,0 + + leaq 32(%rdx),%rdx + nop + subq $0x20,%rcx + jbe L$even_tail + nop + jmp L$mod_loop + +.p2align 5 +L$mod_loop: + movdqa %xmm0,%xmm1 + movdqa %xmm4,%xmm8 + pshufd $78,%xmm0,%xmm4 + pxor %xmm0,%xmm4 + +.byte 102,15,58,68,198,0 +.byte 102,15,58,68,206,17 +.byte 102,15,58,68,231,16 + + pxor %xmm3,%xmm0 + pxor %xmm5,%xmm1 + movdqu (%rdx),%xmm9 + pxor %xmm0,%xmm8 + pshufb %xmm10,%xmm9 + movdqu 16(%rdx),%xmm3 + + pxor %xmm1,%xmm8 + pxor %xmm9,%xmm1 + pxor %xmm8,%xmm4 + pshufb %xmm10,%xmm3 + movdqa %xmm4,%xmm8 + psrldq $8,%xmm8 + pslldq $8,%xmm4 + pxor %xmm8,%xmm1 + pxor %xmm4,%xmm0 + + movdqa %xmm3,%xmm5 + + movdqa %xmm0,%xmm9 + movdqa %xmm0,%xmm8 + psllq $5,%xmm0 + pxor %xmm0,%xmm8 +.byte 102,15,58,68,218,0 + psllq $1,%xmm0 + pxor %xmm8,%xmm0 + psllq $57,%xmm0 + movdqa %xmm0,%xmm8 + pslldq $8,%xmm0 + psrldq $8,%xmm8 + pxor %xmm9,%xmm0 + pshufd $78,%xmm5,%xmm4 + pxor %xmm8,%xmm1 + pxor %xmm5,%xmm4 + + movdqa %xmm0,%xmm9 + psrlq $1,%xmm0 +.byte 102,15,58,68,234,17 + pxor %xmm9,%xmm1 + pxor %xmm0,%xmm9 + psrlq $5,%xmm0 + pxor %xmm9,%xmm0 + leaq 32(%rdx),%rdx + psrlq $1,%xmm0 +.byte 102,15,58,68,231,0 + pxor %xmm1,%xmm0 + + subq $0x20,%rcx + ja L$mod_loop + +L$even_tail: + movdqa %xmm0,%xmm1 + movdqa %xmm4,%xmm8 + pshufd $78,%xmm0,%xmm4 + pxor %xmm0,%xmm4 + +.byte 102,15,58,68,198,0 +.byte 102,15,58,68,206,17 +.byte 102,15,58,68,231,16 + + pxor %xmm3,%xmm0 + pxor %xmm5,%xmm1 + pxor %xmm0,%xmm8 + pxor %xmm1,%xmm8 + pxor %xmm8,%xmm4 + movdqa %xmm4,%xmm8 + psrldq $8,%xmm8 + pslldq $8,%xmm4 + pxor %xmm8,%xmm1 + pxor %xmm4,%xmm0 + + movdqa %xmm0,%xmm4 + movdqa %xmm0,%xmm3 + psllq $5,%xmm0 + pxor %xmm0,%xmm3 + psllq $1,%xmm0 + pxor %xmm3,%xmm0 + psllq $57,%xmm0 + movdqa %xmm0,%xmm3 + pslldq $8,%xmm0 + psrldq $8,%xmm3 + pxor %xmm4,%xmm0 + pxor %xmm3,%xmm1 + + + movdqa %xmm0,%xmm4 + psrlq $1,%xmm0 + pxor %xmm4,%xmm1 + pxor %xmm0,%xmm4 + psrlq $5,%xmm0 + pxor %xmm4,%xmm0 + psrlq $1,%xmm0 + pxor %xmm1,%xmm0 + testq %rcx,%rcx + jnz L$done + +L$odd_tail: + movdqu (%rdx),%xmm8 + pshufb %xmm10,%xmm8 + pxor %xmm8,%xmm0 + movdqa %xmm0,%xmm1 + pshufd $78,%xmm0,%xmm3 + pxor %xmm0,%xmm3 +.byte 102,15,58,68,194,0 +.byte 102,15,58,68,202,17 +.byte 102,15,58,68,223,0 + pxor %xmm0,%xmm3 + pxor %xmm1,%xmm3 + + movdqa %xmm3,%xmm4 + psrldq $8,%xmm3 + pslldq $8,%xmm4 + pxor %xmm3,%xmm1 + pxor %xmm4,%xmm0 + + movdqa %xmm0,%xmm4 + movdqa %xmm0,%xmm3 + psllq $5,%xmm0 + pxor %xmm0,%xmm3 + psllq $1,%xmm0 + pxor %xmm3,%xmm0 + psllq $57,%xmm0 + movdqa %xmm0,%xmm3 + pslldq $8,%xmm0 + psrldq $8,%xmm3 + pxor %xmm4,%xmm0 + pxor %xmm3,%xmm1 + + + movdqa %xmm0,%xmm4 + psrlq $1,%xmm0 + pxor %xmm4,%xmm1 + pxor %xmm0,%xmm4 + psrlq $5,%xmm0 + pxor %xmm4,%xmm0 + psrlq $1,%xmm0 + pxor %xmm1,%xmm0 +L$done: + pshufb %xmm10,%xmm0 + movdqu %xmm0,(%rdi) + ret + +.globl _gcm_init_avx + +.p2align 5 +_gcm_init_avx: + vzeroupper + + vmovdqu (%rsi),%xmm2 + vpshufd $78,%xmm2,%xmm2 + + + vpshufd $255,%xmm2,%xmm4 + vpsrlq $63,%xmm2,%xmm3 + vpsllq $1,%xmm2,%xmm2 + vpxor %xmm5,%xmm5,%xmm5 + vpcmpgtd %xmm4,%xmm5,%xmm5 + vpslldq $8,%xmm3,%xmm3 + vpor %xmm3,%xmm2,%xmm2 + + + vpand L$0x1c2_polynomial(%rip),%xmm5,%xmm5 + vpxor %xmm5,%xmm2,%xmm2 + + vpunpckhqdq %xmm2,%xmm2,%xmm6 + vmovdqa %xmm2,%xmm0 + vpxor %xmm2,%xmm6,%xmm6 + movq $4,%r10 + jmp L$init_start_avx +.p2align 5 +L$init_loop_avx: + vpalignr $8,%xmm3,%xmm4,%xmm5 + vmovdqu %xmm5,-16(%rdi) + vpunpckhqdq %xmm0,%xmm0,%xmm3 + vpxor %xmm0,%xmm3,%xmm3 + vpclmulqdq $0x11,%xmm2,%xmm0,%xmm1 + vpclmulqdq $0x00,%xmm2,%xmm0,%xmm0 + vpclmulqdq $0x00,%xmm6,%xmm3,%xmm3 + vpxor %xmm0,%xmm1,%xmm4 + vpxor %xmm4,%xmm3,%xmm3 + + vpslldq $8,%xmm3,%xmm4 + vpsrldq $8,%xmm3,%xmm3 + vpxor %xmm4,%xmm0,%xmm0 + vpxor %xmm3,%xmm1,%xmm1 + vpsllq $57,%xmm0,%xmm3 + vpsllq $62,%xmm0,%xmm4 + vpxor %xmm3,%xmm4,%xmm4 + vpsllq $63,%xmm0,%xmm3 + vpxor %xmm3,%xmm4,%xmm4 + vpslldq $8,%xmm4,%xmm3 + vpsrldq $8,%xmm4,%xmm4 + vpxor %xmm3,%xmm0,%xmm0 + vpxor %xmm4,%xmm1,%xmm1 + + vpsrlq $1,%xmm0,%xmm4 + vpxor %xmm0,%xmm1,%xmm1 + vpxor %xmm4,%xmm0,%xmm0 + vpsrlq $5,%xmm4,%xmm4 + vpxor %xmm4,%xmm0,%xmm0 + vpsrlq $1,%xmm0,%xmm0 + vpxor %xmm1,%xmm0,%xmm0 +L$init_start_avx: + vmovdqa %xmm0,%xmm5 + vpunpckhqdq %xmm0,%xmm0,%xmm3 + vpxor %xmm0,%xmm3,%xmm3 + vpclmulqdq $0x11,%xmm2,%xmm0,%xmm1 + vpclmulqdq $0x00,%xmm2,%xmm0,%xmm0 + vpclmulqdq $0x00,%xmm6,%xmm3,%xmm3 + vpxor %xmm0,%xmm1,%xmm4 + vpxor %xmm4,%xmm3,%xmm3 + + vpslldq $8,%xmm3,%xmm4 + vpsrldq $8,%xmm3,%xmm3 + vpxor %xmm4,%xmm0,%xmm0 + vpxor %xmm3,%xmm1,%xmm1 + vpsllq $57,%xmm0,%xmm3 + vpsllq $62,%xmm0,%xmm4 + vpxor %xmm3,%xmm4,%xmm4 + vpsllq $63,%xmm0,%xmm3 + vpxor %xmm3,%xmm4,%xmm4 + vpslldq $8,%xmm4,%xmm3 + vpsrldq $8,%xmm4,%xmm4 + vpxor %xmm3,%xmm0,%xmm0 + vpxor %xmm4,%xmm1,%xmm1 + + vpsrlq $1,%xmm0,%xmm4 + vpxor %xmm0,%xmm1,%xmm1 + vpxor %xmm4,%xmm0,%xmm0 + vpsrlq $5,%xmm4,%xmm4 + vpxor %xmm4,%xmm0,%xmm0 + vpsrlq $1,%xmm0,%xmm0 + vpxor %xmm1,%xmm0,%xmm0 + vpshufd $78,%xmm5,%xmm3 + vpshufd $78,%xmm0,%xmm4 + vpxor %xmm5,%xmm3,%xmm3 + vmovdqu %xmm5,0(%rdi) + vpxor %xmm0,%xmm4,%xmm4 + vmovdqu %xmm0,16(%rdi) + leaq 48(%rdi),%rdi + subq $1,%r10 + jnz L$init_loop_avx + + vpalignr $8,%xmm4,%xmm3,%xmm5 + vmovdqu %xmm5,-16(%rdi) + + vzeroupper + ret + +.globl _gcm_gmult_avx + +.p2align 5 +_gcm_gmult_avx: + jmp L$_gmult_clmul + +.globl _gcm_ghash_avx + +.p2align 5 +_gcm_ghash_avx: + vzeroupper + + vmovdqu (%rdi),%xmm10 + leaq L$0x1c2_polynomial(%rip),%r10 + leaq 64(%rsi),%rsi + vmovdqu L$bswap_mask(%rip),%xmm13 + vpshufb %xmm13,%xmm10,%xmm10 + cmpq $0x80,%rcx + jb L$short_avx + subq $0x80,%rcx + + vmovdqu 112(%rdx),%xmm14 + vmovdqu 0-64(%rsi),%xmm6 + vpshufb %xmm13,%xmm14,%xmm14 + vmovdqu 32-64(%rsi),%xmm7 + + vpunpckhqdq %xmm14,%xmm14,%xmm9 + vmovdqu 96(%rdx),%xmm15 + vpclmulqdq $0x00,%xmm6,%xmm14,%xmm0 + vpxor %xmm14,%xmm9,%xmm9 + vpshufb %xmm13,%xmm15,%xmm15 + vpclmulqdq $0x11,%xmm6,%xmm14,%xmm1 + vmovdqu 16-64(%rsi),%xmm6 + vpunpckhqdq %xmm15,%xmm15,%xmm8 + vmovdqu 80(%rdx),%xmm14 + vpclmulqdq $0x00,%xmm7,%xmm9,%xmm2 + vpxor %xmm15,%xmm8,%xmm8 + + vpshufb %xmm13,%xmm14,%xmm14 + vpclmulqdq $0x00,%xmm6,%xmm15,%xmm3 + vpunpckhqdq %xmm14,%xmm14,%xmm9 + vpclmulqdq $0x11,%xmm6,%xmm15,%xmm4 + vmovdqu 48-64(%rsi),%xmm6 + vpxor %xmm14,%xmm9,%xmm9 + vmovdqu 64(%rdx),%xmm15 + vpclmulqdq $0x10,%xmm7,%xmm8,%xmm5 + vmovdqu 80-64(%rsi),%xmm7 + + vpshufb %xmm13,%xmm15,%xmm15 + vpxor %xmm0,%xmm3,%xmm3 + vpclmulqdq $0x00,%xmm6,%xmm14,%xmm0 + vpxor %xmm1,%xmm4,%xmm4 + vpunpckhqdq %xmm15,%xmm15,%xmm8 + vpclmulqdq $0x11,%xmm6,%xmm14,%xmm1 + vmovdqu 64-64(%rsi),%xmm6 + vpxor %xmm2,%xmm5,%xmm5 + vpclmulqdq $0x00,%xmm7,%xmm9,%xmm2 + vpxor %xmm15,%xmm8,%xmm8 + + vmovdqu 48(%rdx),%xmm14 + vpxor %xmm3,%xmm0,%xmm0 + vpclmulqdq $0x00,%xmm6,%xmm15,%xmm3 + vpxor %xmm4,%xmm1,%xmm1 + vpshufb %xmm13,%xmm14,%xmm14 + vpclmulqdq $0x11,%xmm6,%xmm15,%xmm4 + vmovdqu 96-64(%rsi),%xmm6 + vpxor %xmm5,%xmm2,%xmm2 + vpunpckhqdq %xmm14,%xmm14,%xmm9 + vpclmulqdq $0x10,%xmm7,%xmm8,%xmm5 + vmovdqu 128-64(%rsi),%xmm7 + vpxor %xmm14,%xmm9,%xmm9 + + vmovdqu 32(%rdx),%xmm15 + vpxor %xmm0,%xmm3,%xmm3 + vpclmulqdq $0x00,%xmm6,%xmm14,%xmm0 + vpxor %xmm1,%xmm4,%xmm4 + vpshufb %xmm13,%xmm15,%xmm15 + vpclmulqdq $0x11,%xmm6,%xmm14,%xmm1 + vmovdqu 112-64(%rsi),%xmm6 + vpxor %xmm2,%xmm5,%xmm5 + vpunpckhqdq %xmm15,%xmm15,%xmm8 + vpclmulqdq $0x00,%xmm7,%xmm9,%xmm2 + vpxor %xmm15,%xmm8,%xmm8 + + vmovdqu 16(%rdx),%xmm14 + vpxor %xmm3,%xmm0,%xmm0 + vpclmulqdq $0x00,%xmm6,%xmm15,%xmm3 + vpxor %xmm4,%xmm1,%xmm1 + vpshufb %xmm13,%xmm14,%xmm14 + vpclmulqdq $0x11,%xmm6,%xmm15,%xmm4 + vmovdqu 144-64(%rsi),%xmm6 + vpxor %xmm5,%xmm2,%xmm2 + vpunpckhqdq %xmm14,%xmm14,%xmm9 + vpclmulqdq $0x10,%xmm7,%xmm8,%xmm5 + vmovdqu 176-64(%rsi),%xmm7 + vpxor %xmm14,%xmm9,%xmm9 + + vmovdqu (%rdx),%xmm15 + vpxor %xmm0,%xmm3,%xmm3 + vpclmulqdq $0x00,%xmm6,%xmm14,%xmm0 + vpxor %xmm1,%xmm4,%xmm4 + vpshufb %xmm13,%xmm15,%xmm15 + vpclmulqdq $0x11,%xmm6,%xmm14,%xmm1 + vmovdqu 160-64(%rsi),%xmm6 + vpxor %xmm2,%xmm5,%xmm5 + vpclmulqdq $0x10,%xmm7,%xmm9,%xmm2 + + leaq 128(%rdx),%rdx + cmpq $0x80,%rcx + jb L$tail_avx + + vpxor %xmm10,%xmm15,%xmm15 + subq $0x80,%rcx + jmp L$oop8x_avx + +.p2align 5 +L$oop8x_avx: + vpunpckhqdq %xmm15,%xmm15,%xmm8 + vmovdqu 112(%rdx),%xmm14 + vpxor %xmm0,%xmm3,%xmm3 + vpxor %xmm15,%xmm8,%xmm8 + vpclmulqdq $0x00,%xmm6,%xmm15,%xmm10 + vpshufb %xmm13,%xmm14,%xmm14 + vpxor %xmm1,%xmm4,%xmm4 + vpclmulqdq $0x11,%xmm6,%xmm15,%xmm11 + vmovdqu 0-64(%rsi),%xmm6 + vpunpckhqdq %xmm14,%xmm14,%xmm9 + vpxor %xmm2,%xmm5,%xmm5 + vpclmulqdq $0x00,%xmm7,%xmm8,%xmm12 + vmovdqu 32-64(%rsi),%xmm7 + vpxor %xmm14,%xmm9,%xmm9 + + vmovdqu 96(%rdx),%xmm15 + vpclmulqdq $0x00,%xmm6,%xmm14,%xmm0 + vpxor %xmm3,%xmm10,%xmm10 + vpshufb %xmm13,%xmm15,%xmm15 + vpclmulqdq $0x11,%xmm6,%xmm14,%xmm1 + vxorps %xmm4,%xmm11,%xmm11 + vmovdqu 16-64(%rsi),%xmm6 + vpunpckhqdq %xmm15,%xmm15,%xmm8 + vpclmulqdq $0x00,%xmm7,%xmm9,%xmm2 + vpxor %xmm5,%xmm12,%xmm12 + vxorps %xmm15,%xmm8,%xmm8 + + vmovdqu 80(%rdx),%xmm14 + vpxor %xmm10,%xmm12,%xmm12 + vpclmulqdq $0x00,%xmm6,%xmm15,%xmm3 + vpxor %xmm11,%xmm12,%xmm12 + vpslldq $8,%xmm12,%xmm9 + vpxor %xmm0,%xmm3,%xmm3 + vpclmulqdq $0x11,%xmm6,%xmm15,%xmm4 + vpsrldq $8,%xmm12,%xmm12 + vpxor %xmm9,%xmm10,%xmm10 + vmovdqu 48-64(%rsi),%xmm6 + vpshufb %xmm13,%xmm14,%xmm14 + vxorps %xmm12,%xmm11,%xmm11 + vpxor %xmm1,%xmm4,%xmm4 + vpunpckhqdq %xmm14,%xmm14,%xmm9 + vpclmulqdq $0x10,%xmm7,%xmm8,%xmm5 + vmovdqu 80-64(%rsi),%xmm7 + vpxor %xmm14,%xmm9,%xmm9 + vpxor %xmm2,%xmm5,%xmm5 + + vmovdqu 64(%rdx),%xmm15 + vpalignr $8,%xmm10,%xmm10,%xmm12 + vpclmulqdq $0x00,%xmm6,%xmm14,%xmm0 + vpshufb %xmm13,%xmm15,%xmm15 + vpxor %xmm3,%xmm0,%xmm0 + vpclmulqdq $0x11,%xmm6,%xmm14,%xmm1 + vmovdqu 64-64(%rsi),%xmm6 + vpunpckhqdq %xmm15,%xmm15,%xmm8 + vpxor %xmm4,%xmm1,%xmm1 + vpclmulqdq $0x00,%xmm7,%xmm9,%xmm2 + vxorps %xmm15,%xmm8,%xmm8 + vpxor %xmm5,%xmm2,%xmm2 + + vmovdqu 48(%rdx),%xmm14 + vpclmulqdq $0x10,(%r10),%xmm10,%xmm10 + vpclmulqdq $0x00,%xmm6,%xmm15,%xmm3 + vpshufb %xmm13,%xmm14,%xmm14 + vpxor %xmm0,%xmm3,%xmm3 + vpclmulqdq $0x11,%xmm6,%xmm15,%xmm4 + vmovdqu 96-64(%rsi),%xmm6 + vpunpckhqdq %xmm14,%xmm14,%xmm9 + vpxor %xmm1,%xmm4,%xmm4 + vpclmulqdq $0x10,%xmm7,%xmm8,%xmm5 + vmovdqu 128-64(%rsi),%xmm7 + vpxor %xmm14,%xmm9,%xmm9 + vpxor %xmm2,%xmm5,%xmm5 + + vmovdqu 32(%rdx),%xmm15 + vpclmulqdq $0x00,%xmm6,%xmm14,%xmm0 + vpshufb %xmm13,%xmm15,%xmm15 + vpxor %xmm3,%xmm0,%xmm0 + vpclmulqdq $0x11,%xmm6,%xmm14,%xmm1 + vmovdqu 112-64(%rsi),%xmm6 + vpunpckhqdq %xmm15,%xmm15,%xmm8 + vpxor %xmm4,%xmm1,%xmm1 + vpclmulqdq $0x00,%xmm7,%xmm9,%xmm2 + vpxor %xmm15,%xmm8,%xmm8 + vpxor %xmm5,%xmm2,%xmm2 + vxorps %xmm12,%xmm10,%xmm10 + + vmovdqu 16(%rdx),%xmm14 + vpalignr $8,%xmm10,%xmm10,%xmm12 + vpclmulqdq $0x00,%xmm6,%xmm15,%xmm3 + vpshufb %xmm13,%xmm14,%xmm14 + vpxor %xmm0,%xmm3,%xmm3 + vpclmulqdq $0x11,%xmm6,%xmm15,%xmm4 + vmovdqu 144-64(%rsi),%xmm6 + vpclmulqdq $0x10,(%r10),%xmm10,%xmm10 + vxorps %xmm11,%xmm12,%xmm12 + vpunpckhqdq %xmm14,%xmm14,%xmm9 + vpxor %xmm1,%xmm4,%xmm4 + vpclmulqdq $0x10,%xmm7,%xmm8,%xmm5 + vmovdqu 176-64(%rsi),%xmm7 + vpxor %xmm14,%xmm9,%xmm9 + vpxor %xmm2,%xmm5,%xmm5 + + vmovdqu (%rdx),%xmm15 + vpclmulqdq $0x00,%xmm6,%xmm14,%xmm0 + vpshufb %xmm13,%xmm15,%xmm15 + vpclmulqdq $0x11,%xmm6,%xmm14,%xmm1 + vmovdqu 160-64(%rsi),%xmm6 + vpxor %xmm12,%xmm15,%xmm15 + vpclmulqdq $0x10,%xmm7,%xmm9,%xmm2 + vpxor %xmm10,%xmm15,%xmm15 + + leaq 128(%rdx),%rdx + subq $0x80,%rcx + jnc L$oop8x_avx + + addq $0x80,%rcx + jmp L$tail_no_xor_avx + +.p2align 5 +L$short_avx: + vmovdqu -16(%rdx,%rcx,1),%xmm14 + leaq (%rdx,%rcx,1),%rdx + vmovdqu 0-64(%rsi),%xmm6 + vmovdqu 32-64(%rsi),%xmm7 + vpshufb %xmm13,%xmm14,%xmm15 + + vmovdqa %xmm0,%xmm3 + vmovdqa %xmm1,%xmm4 + vmovdqa %xmm2,%xmm5 + subq $0x10,%rcx + jz L$tail_avx + + vpunpckhqdq %xmm15,%xmm15,%xmm8 + vpxor %xmm0,%xmm3,%xmm3 + vpclmulqdq $0x00,%xmm6,%xmm15,%xmm0 + vpxor %xmm15,%xmm8,%xmm8 + vmovdqu -32(%rdx),%xmm14 + vpxor %xmm1,%xmm4,%xmm4 + vpclmulqdq $0x11,%xmm6,%xmm15,%xmm1 + vmovdqu 16-64(%rsi),%xmm6 + vpshufb %xmm13,%xmm14,%xmm15 + vpxor %xmm2,%xmm5,%xmm5 + vpclmulqdq $0x00,%xmm7,%xmm8,%xmm2 + vpsrldq $8,%xmm7,%xmm7 + subq $0x10,%rcx + jz L$tail_avx + + vpunpckhqdq %xmm15,%xmm15,%xmm8 + vpxor %xmm0,%xmm3,%xmm3 + vpclmulqdq $0x00,%xmm6,%xmm15,%xmm0 + vpxor %xmm15,%xmm8,%xmm8 + vmovdqu -48(%rdx),%xmm14 + vpxor %xmm1,%xmm4,%xmm4 + vpclmulqdq $0x11,%xmm6,%xmm15,%xmm1 + vmovdqu 48-64(%rsi),%xmm6 + vpshufb %xmm13,%xmm14,%xmm15 + vpxor %xmm2,%xmm5,%xmm5 + vpclmulqdq $0x00,%xmm7,%xmm8,%xmm2 + vmovdqu 80-64(%rsi),%xmm7 + subq $0x10,%rcx + jz L$tail_avx + + vpunpckhqdq %xmm15,%xmm15,%xmm8 + vpxor %xmm0,%xmm3,%xmm3 + vpclmulqdq $0x00,%xmm6,%xmm15,%xmm0 + vpxor %xmm15,%xmm8,%xmm8 + vmovdqu -64(%rdx),%xmm14 + vpxor %xmm1,%xmm4,%xmm4 + vpclmulqdq $0x11,%xmm6,%xmm15,%xmm1 + vmovdqu 64-64(%rsi),%xmm6 + vpshufb %xmm13,%xmm14,%xmm15 + vpxor %xmm2,%xmm5,%xmm5 + vpclmulqdq $0x00,%xmm7,%xmm8,%xmm2 + vpsrldq $8,%xmm7,%xmm7 + subq $0x10,%rcx + jz L$tail_avx + + vpunpckhqdq %xmm15,%xmm15,%xmm8 + vpxor %xmm0,%xmm3,%xmm3 + vpclmulqdq $0x00,%xmm6,%xmm15,%xmm0 + vpxor %xmm15,%xmm8,%xmm8 + vmovdqu -80(%rdx),%xmm14 + vpxor %xmm1,%xmm4,%xmm4 + vpclmulqdq $0x11,%xmm6,%xmm15,%xmm1 + vmovdqu 96-64(%rsi),%xmm6 + vpshufb %xmm13,%xmm14,%xmm15 + vpxor %xmm2,%xmm5,%xmm5 + vpclmulqdq $0x00,%xmm7,%xmm8,%xmm2 + vmovdqu 128-64(%rsi),%xmm7 + subq $0x10,%rcx + jz L$tail_avx + + vpunpckhqdq %xmm15,%xmm15,%xmm8 + vpxor %xmm0,%xmm3,%xmm3 + vpclmulqdq $0x00,%xmm6,%xmm15,%xmm0 + vpxor %xmm15,%xmm8,%xmm8 + vmovdqu -96(%rdx),%xmm14 + vpxor %xmm1,%xmm4,%xmm4 + vpclmulqdq $0x11,%xmm6,%xmm15,%xmm1 + vmovdqu 112-64(%rsi),%xmm6 + vpshufb %xmm13,%xmm14,%xmm15 + vpxor %xmm2,%xmm5,%xmm5 + vpclmulqdq $0x00,%xmm7,%xmm8,%xmm2 + vpsrldq $8,%xmm7,%xmm7 + subq $0x10,%rcx + jz L$tail_avx + + vpunpckhqdq %xmm15,%xmm15,%xmm8 + vpxor %xmm0,%xmm3,%xmm3 + vpclmulqdq $0x00,%xmm6,%xmm15,%xmm0 + vpxor %xmm15,%xmm8,%xmm8 + vmovdqu -112(%rdx),%xmm14 + vpxor %xmm1,%xmm4,%xmm4 + vpclmulqdq $0x11,%xmm6,%xmm15,%xmm1 + vmovdqu 144-64(%rsi),%xmm6 + vpshufb %xmm13,%xmm14,%xmm15 + vpxor %xmm2,%xmm5,%xmm5 + vpclmulqdq $0x00,%xmm7,%xmm8,%xmm2 + vmovq 184-64(%rsi),%xmm7 + subq $0x10,%rcx + jmp L$tail_avx + +.p2align 5 +L$tail_avx: + vpxor %xmm10,%xmm15,%xmm15 +L$tail_no_xor_avx: + vpunpckhqdq %xmm15,%xmm15,%xmm8 + vpxor %xmm0,%xmm3,%xmm3 + vpclmulqdq $0x00,%xmm6,%xmm15,%xmm0 + vpxor %xmm15,%xmm8,%xmm8 + vpxor %xmm1,%xmm4,%xmm4 + vpclmulqdq $0x11,%xmm6,%xmm15,%xmm1 + vpxor %xmm2,%xmm5,%xmm5 + vpclmulqdq $0x00,%xmm7,%xmm8,%xmm2 + + vmovdqu (%r10),%xmm12 + + vpxor %xmm0,%xmm3,%xmm10 + vpxor %xmm1,%xmm4,%xmm11 + vpxor %xmm2,%xmm5,%xmm5 + + vpxor %xmm10,%xmm5,%xmm5 + vpxor %xmm11,%xmm5,%xmm5 + vpslldq $8,%xmm5,%xmm9 + vpsrldq $8,%xmm5,%xmm5 + vpxor %xmm9,%xmm10,%xmm10 + vpxor %xmm5,%xmm11,%xmm11 + + vpclmulqdq $0x10,%xmm12,%xmm10,%xmm9 + vpalignr $8,%xmm10,%xmm10,%xmm10 + vpxor %xmm9,%xmm10,%xmm10 + + vpclmulqdq $0x10,%xmm12,%xmm10,%xmm9 + vpalignr $8,%xmm10,%xmm10,%xmm10 + vpxor %xmm11,%xmm10,%xmm10 + vpxor %xmm9,%xmm10,%xmm10 + + cmpq $0,%rcx + jne L$short_avx + + vpshufb %xmm13,%xmm10,%xmm10 + vmovdqu %xmm10,(%rdi) + vzeroupper + ret + +.p2align 6 +L$bswap_mask: +.byte 15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0 +L$0x1c2_polynomial: +.byte 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0xc2 +L$7_mask: +.long 7,0,7,0 +L$7_mask_poly: +.long 7,0,450,0 +.p2align 6 + +L$rem_4bit: +.long 0,0,0,471859200,0,943718400,0,610271232 +.long 0,1887436800,0,1822425088,0,1220542464,0,1423966208 +.long 0,3774873600,0,4246732800,0,3644850176,0,3311403008 +.long 0,2441084928,0,2376073216,0,2847932416,0,3051356160 + +L$rem_8bit: +.value 0x0000,0x01C2,0x0384,0x0246,0x0708,0x06CA,0x048C,0x054E +.value 0x0E10,0x0FD2,0x0D94,0x0C56,0x0918,0x08DA,0x0A9C,0x0B5E +.value 0x1C20,0x1DE2,0x1FA4,0x1E66,0x1B28,0x1AEA,0x18AC,0x196E +.value 0x1230,0x13F2,0x11B4,0x1076,0x1538,0x14FA,0x16BC,0x177E +.value 0x3840,0x3982,0x3BC4,0x3A06,0x3F48,0x3E8A,0x3CCC,0x3D0E +.value 0x3650,0x3792,0x35D4,0x3416,0x3158,0x309A,0x32DC,0x331E +.value 0x2460,0x25A2,0x27E4,0x2626,0x2368,0x22AA,0x20EC,0x212E +.value 0x2A70,0x2BB2,0x29F4,0x2836,0x2D78,0x2CBA,0x2EFC,0x2F3E +.value 0x7080,0x7142,0x7304,0x72C6,0x7788,0x764A,0x740C,0x75CE +.value 0x7E90,0x7F52,0x7D14,0x7CD6,0x7998,0x785A,0x7A1C,0x7BDE +.value 0x6CA0,0x6D62,0x6F24,0x6EE6,0x6BA8,0x6A6A,0x682C,0x69EE +.value 0x62B0,0x6372,0x6134,0x60F6,0x65B8,0x647A,0x663C,0x67FE +.value 0x48C0,0x4902,0x4B44,0x4A86,0x4FC8,0x4E0A,0x4C4C,0x4D8E +.value 0x46D0,0x4712,0x4554,0x4496,0x41D8,0x401A,0x425C,0x439E +.value 0x54E0,0x5522,0x5764,0x56A6,0x53E8,0x522A,0x506C,0x51AE +.value 0x5AF0,0x5B32,0x5974,0x58B6,0x5DF8,0x5C3A,0x5E7C,0x5FBE +.value 0xE100,0xE0C2,0xE284,0xE346,0xE608,0xE7CA,0xE58C,0xE44E +.value 0xEF10,0xEED2,0xEC94,0xED56,0xE818,0xE9DA,0xEB9C,0xEA5E +.value 0xFD20,0xFCE2,0xFEA4,0xFF66,0xFA28,0xFBEA,0xF9AC,0xF86E +.value 0xF330,0xF2F2,0xF0B4,0xF176,0xF438,0xF5FA,0xF7BC,0xF67E +.value 0xD940,0xD882,0xDAC4,0xDB06,0xDE48,0xDF8A,0xDDCC,0xDC0E +.value 0xD750,0xD692,0xD4D4,0xD516,0xD058,0xD19A,0xD3DC,0xD21E +.value 0xC560,0xC4A2,0xC6E4,0xC726,0xC268,0xC3AA,0xC1EC,0xC02E +.value 0xCB70,0xCAB2,0xC8F4,0xC936,0xCC78,0xCDBA,0xCFFC,0xCE3E +.value 0x9180,0x9042,0x9204,0x93C6,0x9688,0x974A,0x950C,0x94CE +.value 0x9F90,0x9E52,0x9C14,0x9DD6,0x9898,0x995A,0x9B1C,0x9ADE +.value 0x8DA0,0x8C62,0x8E24,0x8FE6,0x8AA8,0x8B6A,0x892C,0x88EE +.value 0x83B0,0x8272,0x8034,0x81F6,0x84B8,0x857A,0x873C,0x86FE +.value 0xA9C0,0xA802,0xAA44,0xAB86,0xAEC8,0xAF0A,0xAD4C,0xAC8E +.value 0xA7D0,0xA612,0xA454,0xA596,0xA0D8,0xA11A,0xA35C,0xA29E +.value 0xB5E0,0xB422,0xB664,0xB7A6,0xB2E8,0xB32A,0xB16C,0xB0AE +.value 0xBBF0,0xBA32,0xB874,0xB9B6,0xBCF8,0xBD3A,0xBF7C,0xBEBE + +.byte 71,72,65,83,72,32,102,111,114,32,120,56,54,95,54,52,44,32,67,82,89,80,84,79,71,65,77,83,32,98,121,32,60,97,112,112,114,111,64,108,46,111,114,103,62,0 +.p2align 6 diff --git a/crypto/aesgcm/ghash_x64_nasm.asm b/crypto/aesgcm/ghash_x64_nasm.asm new file mode 100644 index 0000000..22a8020 --- /dev/null +++ b/crypto/aesgcm/ghash_x64_nasm.asm @@ -0,0 +1,2029 @@ +default rel +%define XMMWORD +%define YMMWORD +%define ZMMWORD +section .text code align=64 + +;.extern OPENSSL_ia32cap_P + +global gcm_gmult_4bit + +ALIGN 16 +gcm_gmult_4bit: + mov QWORD[8+rsp],rdi ;WIN64 prologue + mov QWORD[16+rsp],rsi + mov rax,rsp +$L$SEH_begin_gcm_gmult_4bit: + mov rdi,rcx + mov rsi,rdx + + + push rbx + push rbp + push r12 + push r13 + push r14 + push r15 + sub rsp,280 +$L$gmult_prologue: + + movzx r8,BYTE[15+rdi] + lea r11,[$L$rem_4bit] + xor rax,rax + xor rbx,rbx + mov al,r8b + mov bl,r8b + shl al,4 + mov rcx,14 + mov r8,QWORD[8+rax*1+rsi] + mov r9,QWORD[rax*1+rsi] + and bl,0xf0 + mov rdx,r8 + jmp NEAR $L$oop1 + +ALIGN 16 +$L$oop1: + shr r8,4 + and rdx,0xf + mov r10,r9 + mov al,BYTE[rcx*1+rdi] + shr r9,4 + xor r8,QWORD[8+rbx*1+rsi] + shl r10,60 + xor r9,QWORD[rbx*1+rsi] + mov bl,al + xor r9,QWORD[rdx*8+r11] + mov rdx,r8 + shl al,4 + xor r8,r10 + dec rcx + js NEAR $L$break1 + + shr r8,4 + and rdx,0xf + mov r10,r9 + shr r9,4 + xor r8,QWORD[8+rax*1+rsi] + shl r10,60 + xor r9,QWORD[rax*1+rsi] + and bl,0xf0 + xor r9,QWORD[rdx*8+r11] + mov rdx,r8 + xor r8,r10 + jmp NEAR $L$oop1 + +ALIGN 16 +$L$break1: + shr r8,4 + and rdx,0xf + mov r10,r9 + shr r9,4 + xor r8,QWORD[8+rax*1+rsi] + shl r10,60 + xor r9,QWORD[rax*1+rsi] + and bl,0xf0 + xor r9,QWORD[rdx*8+r11] + mov rdx,r8 + xor r8,r10 + + shr r8,4 + and rdx,0xf + mov r10,r9 + shr r9,4 + xor r8,QWORD[8+rbx*1+rsi] + shl r10,60 + xor r9,QWORD[rbx*1+rsi] + xor r8,r10 + xor r9,QWORD[rdx*8+r11] + + bswap r8 + bswap r9 + mov QWORD[8+rdi],r8 + mov QWORD[rdi],r9 + + lea rsi,[((280+48))+rsp] + mov rbx,QWORD[((-8))+rsi] + lea rsp,[rsi] +$L$gmult_epilogue: + mov rdi,QWORD[8+rsp] ;WIN64 epilogue + mov rsi,QWORD[16+rsp] + ret +$L$SEH_end_gcm_gmult_4bit: +global gcm_ghash_4bit + +ALIGN 16 +gcm_ghash_4bit: + mov QWORD[8+rsp],rdi ;WIN64 prologue + mov QWORD[16+rsp],rsi + mov rax,rsp +$L$SEH_begin_gcm_ghash_4bit: + mov rdi,rcx + mov rsi,rdx + mov rdx,r8 + mov rcx,r9 + + + push rbx + push rbp + push r12 + push r13 + push r14 + push r15 + sub rsp,280 +$L$ghash_prologue: + mov r14,rdx + mov r15,rcx + sub rsi,-128 + lea rbp,[((16+128))+rsp] + xor edx,edx + mov r8,QWORD[((0+0-128))+rsi] + mov rax,QWORD[((0+8-128))+rsi] + mov dl,al + shr rax,4 + mov r10,r8 + shr r8,4 + mov r9,QWORD[((16+0-128))+rsi] + shl dl,4 + mov rbx,QWORD[((16+8-128))+rsi] + shl r10,60 + mov BYTE[rsp],dl + or rax,r10 + mov dl,bl + shr rbx,4 + mov r10,r9 + shr r9,4 + mov QWORD[rbp],r8 + mov r8,QWORD[((32+0-128))+rsi] + shl dl,4 + mov QWORD[((0-128))+rbp],rax + mov rax,QWORD[((32+8-128))+rsi] + shl r10,60 + mov BYTE[1+rsp],dl + or rbx,r10 + mov dl,al + shr rax,4 + mov r10,r8 + shr r8,4 + mov QWORD[8+rbp],r9 + mov r9,QWORD[((48+0-128))+rsi] + shl dl,4 + mov QWORD[((8-128))+rbp],rbx + mov rbx,QWORD[((48+8-128))+rsi] + shl r10,60 + mov BYTE[2+rsp],dl + or rax,r10 + mov dl,bl + shr rbx,4 + mov r10,r9 + shr r9,4 + mov QWORD[16+rbp],r8 + mov r8,QWORD[((64+0-128))+rsi] + shl dl,4 + mov QWORD[((16-128))+rbp],rax + mov rax,QWORD[((64+8-128))+rsi] + shl r10,60 + mov BYTE[3+rsp],dl + or rbx,r10 + mov dl,al + shr rax,4 + mov r10,r8 + shr r8,4 + mov QWORD[24+rbp],r9 + mov r9,QWORD[((80+0-128))+rsi] + shl dl,4 + mov QWORD[((24-128))+rbp],rbx + mov rbx,QWORD[((80+8-128))+rsi] + shl r10,60 + mov BYTE[4+rsp],dl + or rax,r10 + mov dl,bl + shr rbx,4 + mov r10,r9 + shr r9,4 + mov QWORD[32+rbp],r8 + mov r8,QWORD[((96+0-128))+rsi] + shl dl,4 + mov QWORD[((32-128))+rbp],rax + mov rax,QWORD[((96+8-128))+rsi] + shl r10,60 + mov BYTE[5+rsp],dl + or rbx,r10 + mov dl,al + shr rax,4 + mov r10,r8 + shr r8,4 + mov QWORD[40+rbp],r9 + mov r9,QWORD[((112+0-128))+rsi] + shl dl,4 + mov QWORD[((40-128))+rbp],rbx + mov rbx,QWORD[((112+8-128))+rsi] + shl r10,60 + mov BYTE[6+rsp],dl + or rax,r10 + mov dl,bl + shr rbx,4 + mov r10,r9 + shr r9,4 + mov QWORD[48+rbp],r8 + mov r8,QWORD[((128+0-128))+rsi] + shl dl,4 + mov QWORD[((48-128))+rbp],rax + mov rax,QWORD[((128+8-128))+rsi] + shl r10,60 + mov BYTE[7+rsp],dl + or rbx,r10 + mov dl,al + shr rax,4 + mov r10,r8 + shr r8,4 + mov QWORD[56+rbp],r9 + mov r9,QWORD[((144+0-128))+rsi] + shl dl,4 + mov QWORD[((56-128))+rbp],rbx + mov rbx,QWORD[((144+8-128))+rsi] + shl r10,60 + mov BYTE[8+rsp],dl + or rax,r10 + mov dl,bl + shr rbx,4 + mov r10,r9 + shr r9,4 + mov QWORD[64+rbp],r8 + mov r8,QWORD[((160+0-128))+rsi] + shl dl,4 + mov QWORD[((64-128))+rbp],rax + mov rax,QWORD[((160+8-128))+rsi] + shl r10,60 + mov BYTE[9+rsp],dl + or rbx,r10 + mov dl,al + shr rax,4 + mov r10,r8 + shr r8,4 + mov QWORD[72+rbp],r9 + mov r9,QWORD[((176+0-128))+rsi] + shl dl,4 + mov QWORD[((72-128))+rbp],rbx + mov rbx,QWORD[((176+8-128))+rsi] + shl r10,60 + mov BYTE[10+rsp],dl + or rax,r10 + mov dl,bl + shr rbx,4 + mov r10,r9 + shr r9,4 + mov QWORD[80+rbp],r8 + mov r8,QWORD[((192+0-128))+rsi] + shl dl,4 + mov QWORD[((80-128))+rbp],rax + mov rax,QWORD[((192+8-128))+rsi] + shl r10,60 + mov BYTE[11+rsp],dl + or rbx,r10 + mov dl,al + shr rax,4 + mov r10,r8 + shr r8,4 + mov QWORD[88+rbp],r9 + mov r9,QWORD[((208+0-128))+rsi] + shl dl,4 + mov QWORD[((88-128))+rbp],rbx + mov rbx,QWORD[((208+8-128))+rsi] + shl r10,60 + mov BYTE[12+rsp],dl + or rax,r10 + mov dl,bl + shr rbx,4 + mov r10,r9 + shr r9,4 + mov QWORD[96+rbp],r8 + mov r8,QWORD[((224+0-128))+rsi] + shl dl,4 + mov QWORD[((96-128))+rbp],rax + mov rax,QWORD[((224+8-128))+rsi] + shl r10,60 + mov BYTE[13+rsp],dl + or rbx,r10 + mov dl,al + shr rax,4 + mov r10,r8 + shr r8,4 + mov QWORD[104+rbp],r9 + mov r9,QWORD[((240+0-128))+rsi] + shl dl,4 + mov QWORD[((104-128))+rbp],rbx + mov rbx,QWORD[((240+8-128))+rsi] + shl r10,60 + mov BYTE[14+rsp],dl + or rax,r10 + mov dl,bl + shr rbx,4 + mov r10,r9 + shr r9,4 + mov QWORD[112+rbp],r8 + shl dl,4 + mov QWORD[((112-128))+rbp],rax + shl r10,60 + mov BYTE[15+rsp],dl + or rbx,r10 + mov QWORD[120+rbp],r9 + mov QWORD[((120-128))+rbp],rbx + add rsi,-128 + mov r8,QWORD[8+rdi] + mov r9,QWORD[rdi] + add r15,r14 + lea r11,[$L$rem_8bit] + jmp NEAR $L$outer_loop +ALIGN 16 +$L$outer_loop: + xor r9,QWORD[r14] + mov rdx,QWORD[8+r14] + lea r14,[16+r14] + xor rdx,r8 + mov QWORD[rdi],r9 + mov QWORD[8+rdi],rdx + shr rdx,32 + xor rax,rax + rol edx,8 + mov al,dl + movzx ebx,dl + shl al,4 + shr ebx,4 + rol edx,8 + mov r8,QWORD[8+rax*1+rsi] + mov r9,QWORD[rax*1+rsi] + mov al,dl + movzx ecx,dl + shl al,4 + movzx r12,BYTE[rbx*1+rsp] + shr ecx,4 + xor r12,r8 + mov r10,r9 + shr r8,8 + movzx r12,r12b + shr r9,8 + xor r8,QWORD[((-128))+rbx*8+rbp] + shl r10,56 + xor r9,QWORD[rbx*8+rbp] + rol edx,8 + xor r8,QWORD[8+rax*1+rsi] + xor r9,QWORD[rax*1+rsi] + mov al,dl + xor r8,r10 + movzx r12,WORD[r12*2+r11] + movzx ebx,dl + shl al,4 + movzx r13,BYTE[rcx*1+rsp] + shr ebx,4 + shl r12,48 + xor r13,r8 + mov r10,r9 + xor r9,r12 + shr r8,8 + movzx r13,r13b + shr r9,8 + xor r8,QWORD[((-128))+rcx*8+rbp] + shl r10,56 + xor r9,QWORD[rcx*8+rbp] + rol edx,8 + xor r8,QWORD[8+rax*1+rsi] + xor r9,QWORD[rax*1+rsi] + mov al,dl + xor r8,r10 + movzx r13,WORD[r13*2+r11] + movzx ecx,dl + shl al,4 + movzx r12,BYTE[rbx*1+rsp] + shr ecx,4 + shl r13,48 + xor r12,r8 + mov r10,r9 + xor r9,r13 + shr r8,8 + movzx r12,r12b + mov edx,DWORD[8+rdi] + shr r9,8 + xor r8,QWORD[((-128))+rbx*8+rbp] + shl r10,56 + xor r9,QWORD[rbx*8+rbp] + rol edx,8 + xor r8,QWORD[8+rax*1+rsi] + xor r9,QWORD[rax*1+rsi] + mov al,dl + xor r8,r10 + movzx r12,WORD[r12*2+r11] + movzx ebx,dl + shl al,4 + movzx r13,BYTE[rcx*1+rsp] + shr ebx,4 + shl r12,48 + xor r13,r8 + mov r10,r9 + xor r9,r12 + shr r8,8 + movzx r13,r13b + shr r9,8 + xor r8,QWORD[((-128))+rcx*8+rbp] + shl r10,56 + xor r9,QWORD[rcx*8+rbp] + rol edx,8 + xor r8,QWORD[8+rax*1+rsi] + xor r9,QWORD[rax*1+rsi] + mov al,dl + xor r8,r10 + movzx r13,WORD[r13*2+r11] + movzx ecx,dl + shl al,4 + movzx r12,BYTE[rbx*1+rsp] + shr ecx,4 + shl r13,48 + xor r12,r8 + mov r10,r9 + xor r9,r13 + shr r8,8 + movzx r12,r12b + shr r9,8 + xor r8,QWORD[((-128))+rbx*8+rbp] + shl r10,56 + xor r9,QWORD[rbx*8+rbp] + rol edx,8 + xor r8,QWORD[8+rax*1+rsi] + xor r9,QWORD[rax*1+rsi] + mov al,dl + xor r8,r10 + movzx r12,WORD[r12*2+r11] + movzx ebx,dl + shl al,4 + movzx r13,BYTE[rcx*1+rsp] + shr ebx,4 + shl r12,48 + xor r13,r8 + mov r10,r9 + xor r9,r12 + shr r8,8 + movzx r13,r13b + shr r9,8 + xor r8,QWORD[((-128))+rcx*8+rbp] + shl r10,56 + xor r9,QWORD[rcx*8+rbp] + rol edx,8 + xor r8,QWORD[8+rax*1+rsi] + xor r9,QWORD[rax*1+rsi] + mov al,dl + xor r8,r10 + movzx r13,WORD[r13*2+r11] + movzx ecx,dl + shl al,4 + movzx r12,BYTE[rbx*1+rsp] + shr ecx,4 + shl r13,48 + xor r12,r8 + mov r10,r9 + xor r9,r13 + shr r8,8 + movzx r12,r12b + mov edx,DWORD[4+rdi] + shr r9,8 + xor r8,QWORD[((-128))+rbx*8+rbp] + shl r10,56 + xor r9,QWORD[rbx*8+rbp] + rol edx,8 + xor r8,QWORD[8+rax*1+rsi] + xor r9,QWORD[rax*1+rsi] + mov al,dl + xor r8,r10 + movzx r12,WORD[r12*2+r11] + movzx ebx,dl + shl al,4 + movzx r13,BYTE[rcx*1+rsp] + shr ebx,4 + shl r12,48 + xor r13,r8 + mov r10,r9 + xor r9,r12 + shr r8,8 + movzx r13,r13b + shr r9,8 + xor r8,QWORD[((-128))+rcx*8+rbp] + shl r10,56 + xor r9,QWORD[rcx*8+rbp] + rol edx,8 + xor r8,QWORD[8+rax*1+rsi] + xor r9,QWORD[rax*1+rsi] + mov al,dl + xor r8,r10 + movzx r13,WORD[r13*2+r11] + movzx ecx,dl + shl al,4 + movzx r12,BYTE[rbx*1+rsp] + shr ecx,4 + shl r13,48 + xor r12,r8 + mov r10,r9 + xor r9,r13 + shr r8,8 + movzx r12,r12b + shr r9,8 + xor r8,QWORD[((-128))+rbx*8+rbp] + shl r10,56 + xor r9,QWORD[rbx*8+rbp] + rol edx,8 + xor r8,QWORD[8+rax*1+rsi] + xor r9,QWORD[rax*1+rsi] + mov al,dl + xor r8,r10 + movzx r12,WORD[r12*2+r11] + movzx ebx,dl + shl al,4 + movzx r13,BYTE[rcx*1+rsp] + shr ebx,4 + shl r12,48 + xor r13,r8 + mov r10,r9 + xor r9,r12 + shr r8,8 + movzx r13,r13b + shr r9,8 + xor r8,QWORD[((-128))+rcx*8+rbp] + shl r10,56 + xor r9,QWORD[rcx*8+rbp] + rol edx,8 + xor r8,QWORD[8+rax*1+rsi] + xor r9,QWORD[rax*1+rsi] + mov al,dl + xor r8,r10 + movzx r13,WORD[r13*2+r11] + movzx ecx,dl + shl al,4 + movzx r12,BYTE[rbx*1+rsp] + shr ecx,4 + shl r13,48 + xor r12,r8 + mov r10,r9 + xor r9,r13 + shr r8,8 + movzx r12,r12b + mov edx,DWORD[rdi] + shr r9,8 + xor r8,QWORD[((-128))+rbx*8+rbp] + shl r10,56 + xor r9,QWORD[rbx*8+rbp] + rol edx,8 + xor r8,QWORD[8+rax*1+rsi] + xor r9,QWORD[rax*1+rsi] + mov al,dl + xor r8,r10 + movzx r12,WORD[r12*2+r11] + movzx ebx,dl + shl al,4 + movzx r13,BYTE[rcx*1+rsp] + shr ebx,4 + shl r12,48 + xor r13,r8 + mov r10,r9 + xor r9,r12 + shr r8,8 + movzx r13,r13b + shr r9,8 + xor r8,QWORD[((-128))+rcx*8+rbp] + shl r10,56 + xor r9,QWORD[rcx*8+rbp] + rol edx,8 + xor r8,QWORD[8+rax*1+rsi] + xor r9,QWORD[rax*1+rsi] + mov al,dl + xor r8,r10 + movzx r13,WORD[r13*2+r11] + movzx ecx,dl + shl al,4 + movzx r12,BYTE[rbx*1+rsp] + shr ecx,4 + shl r13,48 + xor r12,r8 + mov r10,r9 + xor r9,r13 + shr r8,8 + movzx r12,r12b + shr r9,8 + xor r8,QWORD[((-128))+rbx*8+rbp] + shl r10,56 + xor r9,QWORD[rbx*8+rbp] + rol edx,8 + xor r8,QWORD[8+rax*1+rsi] + xor r9,QWORD[rax*1+rsi] + mov al,dl + xor r8,r10 + movzx r12,WORD[r12*2+r11] + movzx ebx,dl + shl al,4 + movzx r13,BYTE[rcx*1+rsp] + shr ebx,4 + shl r12,48 + xor r13,r8 + mov r10,r9 + xor r9,r12 + shr r8,8 + movzx r13,r13b + shr r9,8 + xor r8,QWORD[((-128))+rcx*8+rbp] + shl r10,56 + xor r9,QWORD[rcx*8+rbp] + rol edx,8 + xor r8,QWORD[8+rax*1+rsi] + xor r9,QWORD[rax*1+rsi] + mov al,dl + xor r8,r10 + movzx r13,WORD[r13*2+r11] + movzx ecx,dl + shl al,4 + movzx r12,BYTE[rbx*1+rsp] + and ecx,240 + shl r13,48 + xor r12,r8 + mov r10,r9 + xor r9,r13 + shr r8,8 + movzx r12,r12b + mov edx,DWORD[((-4))+rdi] + shr r9,8 + xor r8,QWORD[((-128))+rbx*8+rbp] + shl r10,56 + xor r9,QWORD[rbx*8+rbp] + movzx r12,WORD[r12*2+r11] + xor r8,QWORD[8+rax*1+rsi] + xor r9,QWORD[rax*1+rsi] + shl r12,48 + xor r8,r10 + xor r9,r12 + movzx r13,r8b + shr r8,4 + mov r10,r9 + shl r13b,4 + shr r9,4 + xor r8,QWORD[8+rcx*1+rsi] + movzx r13,WORD[r13*2+r11] + shl r10,60 + xor r9,QWORD[rcx*1+rsi] + xor r8,r10 + shl r13,48 + bswap r8 + xor r9,r13 + bswap r9 + cmp r14,r15 + jb NEAR $L$outer_loop + mov QWORD[8+rdi],r8 + mov QWORD[rdi],r9 + + lea rsi,[((280+48))+rsp] + mov r15,QWORD[((-48))+rsi] + mov r14,QWORD[((-40))+rsi] + mov r13,QWORD[((-32))+rsi] + mov r12,QWORD[((-24))+rsi] + mov rbp,QWORD[((-16))+rsi] + mov rbx,QWORD[((-8))+rsi] + lea rsp,[rsi] +$L$ghash_epilogue: + mov rdi,QWORD[8+rsp] ;WIN64 epilogue + mov rsi,QWORD[16+rsp] + ret +$L$SEH_end_gcm_ghash_4bit: +global gcm_init_clmul + +ALIGN 16 +gcm_init_clmul: +$L$_init_clmul: +$L$SEH_begin_gcm_init_clmul: + +DB 0x48,0x83,0xec,0x18 +DB 0x0f,0x29,0x34,0x24 + movdqu xmm2,XMMWORD[rdx] + pshufd xmm2,xmm2,78 + + + pshufd xmm4,xmm2,255 + movdqa xmm3,xmm2 + psllq xmm2,1 + pxor xmm5,xmm5 + psrlq xmm3,63 + pcmpgtd xmm5,xmm4 + pslldq xmm3,8 + por xmm2,xmm3 + + + pand xmm5,XMMWORD[$L$0x1c2_polynomial] + pxor xmm2,xmm5 + + + pshufd xmm6,xmm2,78 + movdqa xmm0,xmm2 + pxor xmm6,xmm2 + movdqa xmm1,xmm0 + pshufd xmm3,xmm0,78 + pxor xmm3,xmm0 +DB 102,15,58,68,194,0 +DB 102,15,58,68,202,17 +DB 102,15,58,68,222,0 + pxor xmm3,xmm0 + pxor xmm3,xmm1 + + movdqa xmm4,xmm3 + psrldq xmm3,8 + pslldq xmm4,8 + pxor xmm1,xmm3 + pxor xmm0,xmm4 + + movdqa xmm4,xmm0 + movdqa xmm3,xmm0 + psllq xmm0,5 + pxor xmm3,xmm0 + psllq xmm0,1 + pxor xmm0,xmm3 + psllq xmm0,57 + movdqa xmm3,xmm0 + pslldq xmm0,8 + psrldq xmm3,8 + pxor xmm0,xmm4 + pxor xmm1,xmm3 + + + movdqa xmm4,xmm0 + psrlq xmm0,1 + pxor xmm1,xmm4 + pxor xmm4,xmm0 + psrlq xmm0,5 + pxor xmm0,xmm4 + psrlq xmm0,1 + pxor xmm0,xmm1 + pshufd xmm3,xmm2,78 + pshufd xmm4,xmm0,78 + pxor xmm3,xmm2 + movdqu XMMWORD[rcx],xmm2 + pxor xmm4,xmm0 + movdqu XMMWORD[16+rcx],xmm0 +DB 102,15,58,15,227,8 + movdqu XMMWORD[32+rcx],xmm4 + movdqa xmm1,xmm0 + pshufd xmm3,xmm0,78 + pxor xmm3,xmm0 +DB 102,15,58,68,194,0 +DB 102,15,58,68,202,17 +DB 102,15,58,68,222,0 + pxor xmm3,xmm0 + pxor xmm3,xmm1 + + movdqa xmm4,xmm3 + psrldq xmm3,8 + pslldq xmm4,8 + pxor xmm1,xmm3 + pxor xmm0,xmm4 + + movdqa xmm4,xmm0 + movdqa xmm3,xmm0 + psllq xmm0,5 + pxor xmm3,xmm0 + psllq xmm0,1 + pxor xmm0,xmm3 + psllq xmm0,57 + movdqa xmm3,xmm0 + pslldq xmm0,8 + psrldq xmm3,8 + pxor xmm0,xmm4 + pxor xmm1,xmm3 + + + movdqa xmm4,xmm0 + psrlq xmm0,1 + pxor xmm1,xmm4 + pxor xmm4,xmm0 + psrlq xmm0,5 + pxor xmm0,xmm4 + psrlq xmm0,1 + pxor xmm0,xmm1 + movdqa xmm5,xmm0 + movdqa xmm1,xmm0 + pshufd xmm3,xmm0,78 + pxor xmm3,xmm0 +DB 102,15,58,68,194,0 +DB 102,15,58,68,202,17 +DB 102,15,58,68,222,0 + pxor xmm3,xmm0 + pxor xmm3,xmm1 + + movdqa xmm4,xmm3 + psrldq xmm3,8 + pslldq xmm4,8 + pxor xmm1,xmm3 + pxor xmm0,xmm4 + + movdqa xmm4,xmm0 + movdqa xmm3,xmm0 + psllq xmm0,5 + pxor xmm3,xmm0 + psllq xmm0,1 + pxor xmm0,xmm3 + psllq xmm0,57 + movdqa xmm3,xmm0 + pslldq xmm0,8 + psrldq xmm3,8 + pxor xmm0,xmm4 + pxor xmm1,xmm3 + + + movdqa xmm4,xmm0 + psrlq xmm0,1 + pxor xmm1,xmm4 + pxor xmm4,xmm0 + psrlq xmm0,5 + pxor xmm0,xmm4 + psrlq xmm0,1 + pxor xmm0,xmm1 + pshufd xmm3,xmm5,78 + pshufd xmm4,xmm0,78 + pxor xmm3,xmm5 + movdqu XMMWORD[48+rcx],xmm5 + pxor xmm4,xmm0 + movdqu XMMWORD[64+rcx],xmm0 +DB 102,15,58,15,227,8 + movdqu XMMWORD[80+rcx],xmm4 + movaps xmm6,XMMWORD[rsp] + lea rsp,[24+rsp] +$L$SEH_end_gcm_init_clmul: + ret + +global gcm_gmult_clmul + +ALIGN 16 +gcm_gmult_clmul: +$L$_gmult_clmul: + movdqu xmm0,XMMWORD[rcx] + movdqa xmm5,XMMWORD[$L$bswap_mask] + movdqu xmm2,XMMWORD[rdx] + movdqu xmm4,XMMWORD[32+rdx] + pshufb xmm0,xmm5 + movdqa xmm1,xmm0 + pshufd xmm3,xmm0,78 + pxor xmm3,xmm0 +DB 102,15,58,68,194,0 +DB 102,15,58,68,202,17 +DB 102,15,58,68,220,0 + pxor xmm3,xmm0 + pxor xmm3,xmm1 + + movdqa xmm4,xmm3 + psrldq xmm3,8 + pslldq xmm4,8 + pxor xmm1,xmm3 + pxor xmm0,xmm4 + + movdqa xmm4,xmm0 + movdqa xmm3,xmm0 + psllq xmm0,5 + pxor xmm3,xmm0 + psllq xmm0,1 + pxor xmm0,xmm3 + psllq xmm0,57 + movdqa xmm3,xmm0 + pslldq xmm0,8 + psrldq xmm3,8 + pxor xmm0,xmm4 + pxor xmm1,xmm3 + + + movdqa xmm4,xmm0 + psrlq xmm0,1 + pxor xmm1,xmm4 + pxor xmm4,xmm0 + psrlq xmm0,5 + pxor xmm0,xmm4 + psrlq xmm0,1 + pxor xmm0,xmm1 + pshufb xmm0,xmm5 + movdqu XMMWORD[rcx],xmm0 + ret + +global gcm_ghash_clmul + +ALIGN 32 +gcm_ghash_clmul: +$L$_ghash_clmul: + lea rax,[((-136))+rsp] +$L$SEH_begin_gcm_ghash_clmul: + +DB 0x48,0x8d,0x60,0xe0 +DB 0x0f,0x29,0x70,0xe0 +DB 0x0f,0x29,0x78,0xf0 +DB 0x44,0x0f,0x29,0x00 +DB 0x44,0x0f,0x29,0x48,0x10 +DB 0x44,0x0f,0x29,0x50,0x20 +DB 0x44,0x0f,0x29,0x58,0x30 +DB 0x44,0x0f,0x29,0x60,0x40 +DB 0x44,0x0f,0x29,0x68,0x50 +DB 0x44,0x0f,0x29,0x70,0x60 +DB 0x44,0x0f,0x29,0x78,0x70 + movdqa xmm10,XMMWORD[$L$bswap_mask] + + movdqu xmm0,XMMWORD[rcx] + movdqu xmm2,XMMWORD[rdx] + movdqu xmm7,XMMWORD[32+rdx] + pshufb xmm0,xmm10 + + sub r9,0x10 + jz NEAR $L$odd_tail + + movdqu xmm6,XMMWORD[16+rdx] +; leaq OPENSSL_ia32cap_P(%rip),%rax +; mov 4(%rax),%eax + cmp r9,0x30 + jb NEAR $L$skip4x + +; and $71303168,%eax +; cmp $4194304,%eax +; je .Lskip4x + + sub r9,0x30 + mov rax,0xA040608020C0E000 + movdqu xmm14,XMMWORD[48+rdx] + movdqu xmm15,XMMWORD[64+rdx] + + + + + movdqu xmm3,XMMWORD[48+r8] + movdqu xmm11,XMMWORD[32+r8] + pshufb xmm3,xmm10 + pshufb xmm11,xmm10 + movdqa xmm5,xmm3 + pshufd xmm4,xmm3,78 + pxor xmm4,xmm3 +DB 102,15,58,68,218,0 +DB 102,15,58,68,234,17 +DB 102,15,58,68,231,0 + + movdqa xmm13,xmm11 + pshufd xmm12,xmm11,78 + pxor xmm12,xmm11 +DB 102,68,15,58,68,222,0 +DB 102,68,15,58,68,238,17 +DB 102,68,15,58,68,231,16 + xorps xmm3,xmm11 + xorps xmm5,xmm13 + movups xmm7,XMMWORD[80+rdx] + xorps xmm4,xmm12 + + movdqu xmm11,XMMWORD[16+r8] + movdqu xmm8,XMMWORD[r8] + pshufb xmm11,xmm10 + pshufb xmm8,xmm10 + movdqa xmm13,xmm11 + pshufd xmm12,xmm11,78 + pxor xmm0,xmm8 + pxor xmm12,xmm11 +DB 102,69,15,58,68,222,0 + movdqa xmm1,xmm0 + pshufd xmm8,xmm0,78 + pxor xmm8,xmm0 +DB 102,69,15,58,68,238,17 +DB 102,68,15,58,68,231,0 + xorps xmm3,xmm11 + xorps xmm5,xmm13 + + lea r8,[64+r8] + sub r9,0x40 + jc NEAR $L$tail4x + + jmp NEAR $L$mod4_loop +ALIGN 32 +$L$mod4_loop: +DB 102,65,15,58,68,199,0 + xorps xmm4,xmm12 + movdqu xmm11,XMMWORD[48+r8] + pshufb xmm11,xmm10 +DB 102,65,15,58,68,207,17 + xorps xmm0,xmm3 + movdqu xmm3,XMMWORD[32+r8] + movdqa xmm13,xmm11 +DB 102,68,15,58,68,199,16 + pshufd xmm12,xmm11,78 + xorps xmm1,xmm5 + pxor xmm12,xmm11 + pshufb xmm3,xmm10 + movups xmm7,XMMWORD[32+rdx] + xorps xmm8,xmm4 +DB 102,68,15,58,68,218,0 + pshufd xmm4,xmm3,78 + + pxor xmm8,xmm0 + movdqa xmm5,xmm3 + pxor xmm8,xmm1 + pxor xmm4,xmm3 + movdqa xmm9,xmm8 +DB 102,68,15,58,68,234,17 + pslldq xmm8,8 + psrldq xmm9,8 + pxor xmm0,xmm8 + movdqa xmm8,XMMWORD[$L$7_mask] + pxor xmm1,xmm9 +DB 102,76,15,110,200 + + pand xmm8,xmm0 + pshufb xmm9,xmm8 + pxor xmm9,xmm0 +DB 102,68,15,58,68,231,0 + psllq xmm9,57 + movdqa xmm8,xmm9 + pslldq xmm9,8 +DB 102,15,58,68,222,0 + psrldq xmm8,8 + pxor xmm0,xmm9 + pxor xmm1,xmm8 + movdqu xmm8,XMMWORD[r8] + + movdqa xmm9,xmm0 + psrlq xmm0,1 +DB 102,15,58,68,238,17 + xorps xmm3,xmm11 + movdqu xmm11,XMMWORD[16+r8] + pshufb xmm11,xmm10 +DB 102,15,58,68,231,16 + xorps xmm5,xmm13 + movups xmm7,XMMWORD[80+rdx] + pshufb xmm8,xmm10 + pxor xmm1,xmm9 + pxor xmm9,xmm0 + psrlq xmm0,5 + + movdqa xmm13,xmm11 + pxor xmm4,xmm12 + pshufd xmm12,xmm11,78 + pxor xmm0,xmm9 + pxor xmm1,xmm8 + pxor xmm12,xmm11 +DB 102,69,15,58,68,222,0 + psrlq xmm0,1 + pxor xmm0,xmm1 + movdqa xmm1,xmm0 +DB 102,69,15,58,68,238,17 + xorps xmm3,xmm11 + pshufd xmm8,xmm0,78 + pxor xmm8,xmm0 + +DB 102,68,15,58,68,231,0 + xorps xmm5,xmm13 + + lea r8,[64+r8] + sub r9,0x40 + jnc NEAR $L$mod4_loop + +$L$tail4x: +DB 102,65,15,58,68,199,0 +DB 102,65,15,58,68,207,17 +DB 102,68,15,58,68,199,16 + xorps xmm4,xmm12 + xorps xmm0,xmm3 + xorps xmm1,xmm5 + pxor xmm1,xmm0 + pxor xmm8,xmm4 + + pxor xmm8,xmm1 + pxor xmm1,xmm0 + + movdqa xmm9,xmm8 + psrldq xmm8,8 + pslldq xmm9,8 + pxor xmm1,xmm8 + pxor xmm0,xmm9 + + movdqa xmm4,xmm0 + movdqa xmm3,xmm0 + psllq xmm0,5 + pxor xmm3,xmm0 + psllq xmm0,1 + pxor xmm0,xmm3 + psllq xmm0,57 + movdqa xmm3,xmm0 + pslldq xmm0,8 + psrldq xmm3,8 + pxor xmm0,xmm4 + pxor xmm1,xmm3 + + + movdqa xmm4,xmm0 + psrlq xmm0,1 + pxor xmm1,xmm4 + pxor xmm4,xmm0 + psrlq xmm0,5 + pxor xmm0,xmm4 + psrlq xmm0,1 + pxor xmm0,xmm1 + add r9,0x40 + jz NEAR $L$done + movdqu xmm7,XMMWORD[32+rdx] + sub r9,0x10 + jz NEAR $L$odd_tail +$L$skip4x: + + + + + + movdqu xmm8,XMMWORD[r8] + movdqu xmm3,XMMWORD[16+r8] + pshufb xmm8,xmm10 + pshufb xmm3,xmm10 + pxor xmm0,xmm8 + + movdqa xmm5,xmm3 + pshufd xmm4,xmm3,78 + pxor xmm4,xmm3 +DB 102,15,58,68,218,0 +DB 102,15,58,68,234,17 +DB 102,15,58,68,231,0 + + lea r8,[32+r8] + nop + sub r9,0x20 + jbe NEAR $L$even_tail + nop + jmp NEAR $L$mod_loop + +ALIGN 32 +$L$mod_loop: + movdqa xmm1,xmm0 + movdqa xmm8,xmm4 + pshufd xmm4,xmm0,78 + pxor xmm4,xmm0 + +DB 102,15,58,68,198,0 +DB 102,15,58,68,206,17 +DB 102,15,58,68,231,16 + + pxor xmm0,xmm3 + pxor xmm1,xmm5 + movdqu xmm9,XMMWORD[r8] + pxor xmm8,xmm0 + pshufb xmm9,xmm10 + movdqu xmm3,XMMWORD[16+r8] + + pxor xmm8,xmm1 + pxor xmm1,xmm9 + pxor xmm4,xmm8 + pshufb xmm3,xmm10 + movdqa xmm8,xmm4 + psrldq xmm8,8 + pslldq xmm4,8 + pxor xmm1,xmm8 + pxor xmm0,xmm4 + + movdqa xmm5,xmm3 + + movdqa xmm9,xmm0 + movdqa xmm8,xmm0 + psllq xmm0,5 + pxor xmm8,xmm0 +DB 102,15,58,68,218,0 + psllq xmm0,1 + pxor xmm0,xmm8 + psllq xmm0,57 + movdqa xmm8,xmm0 + pslldq xmm0,8 + psrldq xmm8,8 + pxor xmm0,xmm9 + pshufd xmm4,xmm5,78 + pxor xmm1,xmm8 + pxor xmm4,xmm5 + + movdqa xmm9,xmm0 + psrlq xmm0,1 +DB 102,15,58,68,234,17 + pxor xmm1,xmm9 + pxor xmm9,xmm0 + psrlq xmm0,5 + pxor xmm0,xmm9 + lea r8,[32+r8] + psrlq xmm0,1 +DB 102,15,58,68,231,0 + pxor xmm0,xmm1 + + sub r9,0x20 + ja NEAR $L$mod_loop + +$L$even_tail: + movdqa xmm1,xmm0 + movdqa xmm8,xmm4 + pshufd xmm4,xmm0,78 + pxor xmm4,xmm0 + +DB 102,15,58,68,198,0 +DB 102,15,58,68,206,17 +DB 102,15,58,68,231,16 + + pxor xmm0,xmm3 + pxor xmm1,xmm5 + pxor xmm8,xmm0 + pxor xmm8,xmm1 + pxor xmm4,xmm8 + movdqa xmm8,xmm4 + psrldq xmm8,8 + pslldq xmm4,8 + pxor xmm1,xmm8 + pxor xmm0,xmm4 + + movdqa xmm4,xmm0 + movdqa xmm3,xmm0 + psllq xmm0,5 + pxor xmm3,xmm0 + psllq xmm0,1 + pxor xmm0,xmm3 + psllq xmm0,57 + movdqa xmm3,xmm0 + pslldq xmm0,8 + psrldq xmm3,8 + pxor xmm0,xmm4 + pxor xmm1,xmm3 + + + movdqa xmm4,xmm0 + psrlq xmm0,1 + pxor xmm1,xmm4 + pxor xmm4,xmm0 + psrlq xmm0,5 + pxor xmm0,xmm4 + psrlq xmm0,1 + pxor xmm0,xmm1 + test r9,r9 + jnz NEAR $L$done + +$L$odd_tail: + movdqu xmm8,XMMWORD[r8] + pshufb xmm8,xmm10 + pxor xmm0,xmm8 + movdqa xmm1,xmm0 + pshufd xmm3,xmm0,78 + pxor xmm3,xmm0 +DB 102,15,58,68,194,0 +DB 102,15,58,68,202,17 +DB 102,15,58,68,223,0 + pxor xmm3,xmm0 + pxor xmm3,xmm1 + + movdqa xmm4,xmm3 + psrldq xmm3,8 + pslldq xmm4,8 + pxor xmm1,xmm3 + pxor xmm0,xmm4 + + movdqa xmm4,xmm0 + movdqa xmm3,xmm0 + psllq xmm0,5 + pxor xmm3,xmm0 + psllq xmm0,1 + pxor xmm0,xmm3 + psllq xmm0,57 + movdqa xmm3,xmm0 + pslldq xmm0,8 + psrldq xmm3,8 + pxor xmm0,xmm4 + pxor xmm1,xmm3 + + + movdqa xmm4,xmm0 + psrlq xmm0,1 + pxor xmm1,xmm4 + pxor xmm4,xmm0 + psrlq xmm0,5 + pxor xmm0,xmm4 + psrlq xmm0,1 + pxor xmm0,xmm1 +$L$done: + pshufb xmm0,xmm10 + movdqu XMMWORD[rcx],xmm0 + movaps xmm6,XMMWORD[rsp] + movaps xmm7,XMMWORD[16+rsp] + movaps xmm8,XMMWORD[32+rsp] + movaps xmm9,XMMWORD[48+rsp] + movaps xmm10,XMMWORD[64+rsp] + movaps xmm11,XMMWORD[80+rsp] + movaps xmm12,XMMWORD[96+rsp] + movaps xmm13,XMMWORD[112+rsp] + movaps xmm14,XMMWORD[128+rsp] + movaps xmm15,XMMWORD[144+rsp] + lea rsp,[168+rsp] +$L$SEH_end_gcm_ghash_clmul: + ret + +global gcm_init_avx + +ALIGN 32 +gcm_init_avx: +$L$SEH_begin_gcm_init_avx: + +DB 0x48,0x83,0xec,0x18 +DB 0x0f,0x29,0x34,0x24 + vzeroupper + + vmovdqu xmm2,XMMWORD[rdx] + vpshufd xmm2,xmm2,78 + + + vpshufd xmm4,xmm2,255 + vpsrlq xmm3,xmm2,63 + vpsllq xmm2,xmm2,1 + vpxor xmm5,xmm5,xmm5 + vpcmpgtd xmm5,xmm5,xmm4 + vpslldq xmm3,xmm3,8 + vpor xmm2,xmm2,xmm3 + + + vpand xmm5,xmm5,XMMWORD[$L$0x1c2_polynomial] + vpxor xmm2,xmm2,xmm5 + + vpunpckhqdq xmm6,xmm2,xmm2 + vmovdqa xmm0,xmm2 + vpxor xmm6,xmm6,xmm2 + mov r10,4 + jmp NEAR $L$init_start_avx +ALIGN 32 +$L$init_loop_avx: + vpalignr xmm5,xmm4,xmm3,8 + vmovdqu XMMWORD[(-16)+rcx],xmm5 + vpunpckhqdq xmm3,xmm0,xmm0 + vpxor xmm3,xmm3,xmm0 + vpclmulqdq xmm1,xmm0,xmm2,0x11 + vpclmulqdq xmm0,xmm0,xmm2,0x00 + vpclmulqdq xmm3,xmm3,xmm6,0x00 + vpxor xmm4,xmm1,xmm0 + vpxor xmm3,xmm3,xmm4 + + vpslldq xmm4,xmm3,8 + vpsrldq xmm3,xmm3,8 + vpxor xmm0,xmm0,xmm4 + vpxor xmm1,xmm1,xmm3 + vpsllq xmm3,xmm0,57 + vpsllq xmm4,xmm0,62 + vpxor xmm4,xmm4,xmm3 + vpsllq xmm3,xmm0,63 + vpxor xmm4,xmm4,xmm3 + vpslldq xmm3,xmm4,8 + vpsrldq xmm4,xmm4,8 + vpxor xmm0,xmm0,xmm3 + vpxor xmm1,xmm1,xmm4 + + vpsrlq xmm4,xmm0,1 + vpxor xmm1,xmm1,xmm0 + vpxor xmm0,xmm0,xmm4 + vpsrlq xmm4,xmm4,5 + vpxor xmm0,xmm0,xmm4 + vpsrlq xmm0,xmm0,1 + vpxor xmm0,xmm0,xmm1 +$L$init_start_avx: + vmovdqa xmm5,xmm0 + vpunpckhqdq xmm3,xmm0,xmm0 + vpxor xmm3,xmm3,xmm0 + vpclmulqdq xmm1,xmm0,xmm2,0x11 + vpclmulqdq xmm0,xmm0,xmm2,0x00 + vpclmulqdq xmm3,xmm3,xmm6,0x00 + vpxor xmm4,xmm1,xmm0 + vpxor xmm3,xmm3,xmm4 + + vpslldq xmm4,xmm3,8 + vpsrldq xmm3,xmm3,8 + vpxor xmm0,xmm0,xmm4 + vpxor xmm1,xmm1,xmm3 + vpsllq xmm3,xmm0,57 + vpsllq xmm4,xmm0,62 + vpxor xmm4,xmm4,xmm3 + vpsllq xmm3,xmm0,63 + vpxor xmm4,xmm4,xmm3 + vpslldq xmm3,xmm4,8 + vpsrldq xmm4,xmm4,8 + vpxor xmm0,xmm0,xmm3 + vpxor xmm1,xmm1,xmm4 + + vpsrlq xmm4,xmm0,1 + vpxor xmm1,xmm1,xmm0 + vpxor xmm0,xmm0,xmm4 + vpsrlq xmm4,xmm4,5 + vpxor xmm0,xmm0,xmm4 + vpsrlq xmm0,xmm0,1 + vpxor xmm0,xmm0,xmm1 + vpshufd xmm3,xmm5,78 + vpshufd xmm4,xmm0,78 + vpxor xmm3,xmm3,xmm5 + vmovdqu XMMWORD[rcx],xmm5 + vpxor xmm4,xmm4,xmm0 + vmovdqu XMMWORD[16+rcx],xmm0 + lea rcx,[48+rcx] + sub r10,1 + jnz NEAR $L$init_loop_avx + + vpalignr xmm5,xmm3,xmm4,8 + vmovdqu XMMWORD[(-16)+rcx],xmm5 + + vzeroupper + movaps xmm6,XMMWORD[rsp] + lea rsp,[24+rsp] +$L$SEH_end_gcm_init_avx: + ret + +global gcm_gmult_avx + +ALIGN 32 +gcm_gmult_avx: + jmp NEAR $L$_gmult_clmul + +global gcm_ghash_avx + +ALIGN 32 +gcm_ghash_avx: + lea rax,[((-136))+rsp] +$L$SEH_begin_gcm_ghash_avx: + +DB 0x48,0x8d,0x60,0xe0 +DB 0x0f,0x29,0x70,0xe0 +DB 0x0f,0x29,0x78,0xf0 +DB 0x44,0x0f,0x29,0x00 +DB 0x44,0x0f,0x29,0x48,0x10 +DB 0x44,0x0f,0x29,0x50,0x20 +DB 0x44,0x0f,0x29,0x58,0x30 +DB 0x44,0x0f,0x29,0x60,0x40 +DB 0x44,0x0f,0x29,0x68,0x50 +DB 0x44,0x0f,0x29,0x70,0x60 +DB 0x44,0x0f,0x29,0x78,0x70 + vzeroupper + + vmovdqu xmm10,XMMWORD[rcx] + lea r10,[$L$0x1c2_polynomial] + lea rdx,[64+rdx] + vmovdqu xmm13,XMMWORD[$L$bswap_mask] + vpshufb xmm10,xmm10,xmm13 + cmp r9,0x80 + jb NEAR $L$short_avx + sub r9,0x80 + + vmovdqu xmm14,XMMWORD[112+r8] + vmovdqu xmm6,XMMWORD[((0-64))+rdx] + vpshufb xmm14,xmm14,xmm13 + vmovdqu xmm7,XMMWORD[((32-64))+rdx] + + vpunpckhqdq xmm9,xmm14,xmm14 + vmovdqu xmm15,XMMWORD[96+r8] + vpclmulqdq xmm0,xmm14,xmm6,0x00 + vpxor xmm9,xmm9,xmm14 + vpshufb xmm15,xmm15,xmm13 + vpclmulqdq xmm1,xmm14,xmm6,0x11 + vmovdqu xmm6,XMMWORD[((16-64))+rdx] + vpunpckhqdq xmm8,xmm15,xmm15 + vmovdqu xmm14,XMMWORD[80+r8] + vpclmulqdq xmm2,xmm9,xmm7,0x00 + vpxor xmm8,xmm8,xmm15 + + vpshufb xmm14,xmm14,xmm13 + vpclmulqdq xmm3,xmm15,xmm6,0x00 + vpunpckhqdq xmm9,xmm14,xmm14 + vpclmulqdq xmm4,xmm15,xmm6,0x11 + vmovdqu xmm6,XMMWORD[((48-64))+rdx] + vpxor xmm9,xmm9,xmm14 + vmovdqu xmm15,XMMWORD[64+r8] + vpclmulqdq xmm5,xmm8,xmm7,0x10 + vmovdqu xmm7,XMMWORD[((80-64))+rdx] + + vpshufb xmm15,xmm15,xmm13 + vpxor xmm3,xmm3,xmm0 + vpclmulqdq xmm0,xmm14,xmm6,0x00 + vpxor xmm4,xmm4,xmm1 + vpunpckhqdq xmm8,xmm15,xmm15 + vpclmulqdq xmm1,xmm14,xmm6,0x11 + vmovdqu xmm6,XMMWORD[((64-64))+rdx] + vpxor xmm5,xmm5,xmm2 + vpclmulqdq xmm2,xmm9,xmm7,0x00 + vpxor xmm8,xmm8,xmm15 + + vmovdqu xmm14,XMMWORD[48+r8] + vpxor xmm0,xmm0,xmm3 + vpclmulqdq xmm3,xmm15,xmm6,0x00 + vpxor xmm1,xmm1,xmm4 + vpshufb xmm14,xmm14,xmm13 + vpclmulqdq xmm4,xmm15,xmm6,0x11 + vmovdqu xmm6,XMMWORD[((96-64))+rdx] + vpxor xmm2,xmm2,xmm5 + vpunpckhqdq xmm9,xmm14,xmm14 + vpclmulqdq xmm5,xmm8,xmm7,0x10 + vmovdqu xmm7,XMMWORD[((128-64))+rdx] + vpxor xmm9,xmm9,xmm14 + + vmovdqu xmm15,XMMWORD[32+r8] + vpxor xmm3,xmm3,xmm0 + vpclmulqdq xmm0,xmm14,xmm6,0x00 + vpxor xmm4,xmm4,xmm1 + vpshufb xmm15,xmm15,xmm13 + vpclmulqdq xmm1,xmm14,xmm6,0x11 + vmovdqu xmm6,XMMWORD[((112-64))+rdx] + vpxor xmm5,xmm5,xmm2 + vpunpckhqdq xmm8,xmm15,xmm15 + vpclmulqdq xmm2,xmm9,xmm7,0x00 + vpxor xmm8,xmm8,xmm15 + + vmovdqu xmm14,XMMWORD[16+r8] + vpxor xmm0,xmm0,xmm3 + vpclmulqdq xmm3,xmm15,xmm6,0x00 + vpxor xmm1,xmm1,xmm4 + vpshufb xmm14,xmm14,xmm13 + vpclmulqdq xmm4,xmm15,xmm6,0x11 + vmovdqu xmm6,XMMWORD[((144-64))+rdx] + vpxor xmm2,xmm2,xmm5 + vpunpckhqdq xmm9,xmm14,xmm14 + vpclmulqdq xmm5,xmm8,xmm7,0x10 + vmovdqu xmm7,XMMWORD[((176-64))+rdx] + vpxor xmm9,xmm9,xmm14 + + vmovdqu xmm15,XMMWORD[r8] + vpxor xmm3,xmm3,xmm0 + vpclmulqdq xmm0,xmm14,xmm6,0x00 + vpxor xmm4,xmm4,xmm1 + vpshufb xmm15,xmm15,xmm13 + vpclmulqdq xmm1,xmm14,xmm6,0x11 + vmovdqu xmm6,XMMWORD[((160-64))+rdx] + vpxor xmm5,xmm5,xmm2 + vpclmulqdq xmm2,xmm9,xmm7,0x10 + + lea r8,[128+r8] + cmp r9,0x80 + jb NEAR $L$tail_avx + + vpxor xmm15,xmm15,xmm10 + sub r9,0x80 + jmp NEAR $L$oop8x_avx + +ALIGN 32 +$L$oop8x_avx: + vpunpckhqdq xmm8,xmm15,xmm15 + vmovdqu xmm14,XMMWORD[112+r8] + vpxor xmm3,xmm3,xmm0 + vpxor xmm8,xmm8,xmm15 + vpclmulqdq xmm10,xmm15,xmm6,0x00 + vpshufb xmm14,xmm14,xmm13 + vpxor xmm4,xmm4,xmm1 + vpclmulqdq xmm11,xmm15,xmm6,0x11 + vmovdqu xmm6,XMMWORD[((0-64))+rdx] + vpunpckhqdq xmm9,xmm14,xmm14 + vpxor xmm5,xmm5,xmm2 + vpclmulqdq xmm12,xmm8,xmm7,0x00 + vmovdqu xmm7,XMMWORD[((32-64))+rdx] + vpxor xmm9,xmm9,xmm14 + + vmovdqu xmm15,XMMWORD[96+r8] + vpclmulqdq xmm0,xmm14,xmm6,0x00 + vpxor xmm10,xmm10,xmm3 + vpshufb xmm15,xmm15,xmm13 + vpclmulqdq xmm1,xmm14,xmm6,0x11 + vxorps xmm11,xmm11,xmm4 + vmovdqu xmm6,XMMWORD[((16-64))+rdx] + vpunpckhqdq xmm8,xmm15,xmm15 + vpclmulqdq xmm2,xmm9,xmm7,0x00 + vpxor xmm12,xmm12,xmm5 + vxorps xmm8,xmm8,xmm15 + + vmovdqu xmm14,XMMWORD[80+r8] + vpxor xmm12,xmm12,xmm10 + vpclmulqdq xmm3,xmm15,xmm6,0x00 + vpxor xmm12,xmm12,xmm11 + vpslldq xmm9,xmm12,8 + vpxor xmm3,xmm3,xmm0 + vpclmulqdq xmm4,xmm15,xmm6,0x11 + vpsrldq xmm12,xmm12,8 + vpxor xmm10,xmm10,xmm9 + vmovdqu xmm6,XMMWORD[((48-64))+rdx] + vpshufb xmm14,xmm14,xmm13 + vxorps xmm11,xmm11,xmm12 + vpxor xmm4,xmm4,xmm1 + vpunpckhqdq xmm9,xmm14,xmm14 + vpclmulqdq xmm5,xmm8,xmm7,0x10 + vmovdqu xmm7,XMMWORD[((80-64))+rdx] + vpxor xmm9,xmm9,xmm14 + vpxor xmm5,xmm5,xmm2 + + vmovdqu xmm15,XMMWORD[64+r8] + vpalignr xmm12,xmm10,xmm10,8 + vpclmulqdq xmm0,xmm14,xmm6,0x00 + vpshufb xmm15,xmm15,xmm13 + vpxor xmm0,xmm0,xmm3 + vpclmulqdq xmm1,xmm14,xmm6,0x11 + vmovdqu xmm6,XMMWORD[((64-64))+rdx] + vpunpckhqdq xmm8,xmm15,xmm15 + vpxor xmm1,xmm1,xmm4 + vpclmulqdq xmm2,xmm9,xmm7,0x00 + vxorps xmm8,xmm8,xmm15 + vpxor xmm2,xmm2,xmm5 + + vmovdqu xmm14,XMMWORD[48+r8] + vpclmulqdq xmm10,xmm10,XMMWORD[r10],0x10 + vpclmulqdq xmm3,xmm15,xmm6,0x00 + vpshufb xmm14,xmm14,xmm13 + vpxor xmm3,xmm3,xmm0 + vpclmulqdq xmm4,xmm15,xmm6,0x11 + vmovdqu xmm6,XMMWORD[((96-64))+rdx] + vpunpckhqdq xmm9,xmm14,xmm14 + vpxor xmm4,xmm4,xmm1 + vpclmulqdq xmm5,xmm8,xmm7,0x10 + vmovdqu xmm7,XMMWORD[((128-64))+rdx] + vpxor xmm9,xmm9,xmm14 + vpxor xmm5,xmm5,xmm2 + + vmovdqu xmm15,XMMWORD[32+r8] + vpclmulqdq xmm0,xmm14,xmm6,0x00 + vpshufb xmm15,xmm15,xmm13 + vpxor xmm0,xmm0,xmm3 + vpclmulqdq xmm1,xmm14,xmm6,0x11 + vmovdqu xmm6,XMMWORD[((112-64))+rdx] + vpunpckhqdq xmm8,xmm15,xmm15 + vpxor xmm1,xmm1,xmm4 + vpclmulqdq xmm2,xmm9,xmm7,0x00 + vpxor xmm8,xmm8,xmm15 + vpxor xmm2,xmm2,xmm5 + vxorps xmm10,xmm10,xmm12 + + vmovdqu xmm14,XMMWORD[16+r8] + vpalignr xmm12,xmm10,xmm10,8 + vpclmulqdq xmm3,xmm15,xmm6,0x00 + vpshufb xmm14,xmm14,xmm13 + vpxor xmm3,xmm3,xmm0 + vpclmulqdq xmm4,xmm15,xmm6,0x11 + vmovdqu xmm6,XMMWORD[((144-64))+rdx] + vpclmulqdq xmm10,xmm10,XMMWORD[r10],0x10 + vxorps xmm12,xmm12,xmm11 + vpunpckhqdq xmm9,xmm14,xmm14 + vpxor xmm4,xmm4,xmm1 + vpclmulqdq xmm5,xmm8,xmm7,0x10 + vmovdqu xmm7,XMMWORD[((176-64))+rdx] + vpxor xmm9,xmm9,xmm14 + vpxor xmm5,xmm5,xmm2 + + vmovdqu xmm15,XMMWORD[r8] + vpclmulqdq xmm0,xmm14,xmm6,0x00 + vpshufb xmm15,xmm15,xmm13 + vpclmulqdq xmm1,xmm14,xmm6,0x11 + vmovdqu xmm6,XMMWORD[((160-64))+rdx] + vpxor xmm15,xmm15,xmm12 + vpclmulqdq xmm2,xmm9,xmm7,0x10 + vpxor xmm15,xmm15,xmm10 + + lea r8,[128+r8] + sub r9,0x80 + jnc NEAR $L$oop8x_avx + + add r9,0x80 + jmp NEAR $L$tail_no_xor_avx + +ALIGN 32 +$L$short_avx: + vmovdqu xmm14,XMMWORD[((-16))+r9*1+r8] + lea r8,[r9*1+r8] + vmovdqu xmm6,XMMWORD[((0-64))+rdx] + vmovdqu xmm7,XMMWORD[((32-64))+rdx] + vpshufb xmm15,xmm14,xmm13 + + vmovdqa xmm3,xmm0 + vmovdqa xmm4,xmm1 + vmovdqa xmm5,xmm2 + sub r9,0x10 + jz NEAR $L$tail_avx + + vpunpckhqdq xmm8,xmm15,xmm15 + vpxor xmm3,xmm3,xmm0 + vpclmulqdq xmm0,xmm15,xmm6,0x00 + vpxor xmm8,xmm8,xmm15 + vmovdqu xmm14,XMMWORD[((-32))+r8] + vpxor xmm4,xmm4,xmm1 + vpclmulqdq xmm1,xmm15,xmm6,0x11 + vmovdqu xmm6,XMMWORD[((16-64))+rdx] + vpshufb xmm15,xmm14,xmm13 + vpxor xmm5,xmm5,xmm2 + vpclmulqdq xmm2,xmm8,xmm7,0x00 + vpsrldq xmm7,xmm7,8 + sub r9,0x10 + jz NEAR $L$tail_avx + + vpunpckhqdq xmm8,xmm15,xmm15 + vpxor xmm3,xmm3,xmm0 + vpclmulqdq xmm0,xmm15,xmm6,0x00 + vpxor xmm8,xmm8,xmm15 + vmovdqu xmm14,XMMWORD[((-48))+r8] + vpxor xmm4,xmm4,xmm1 + vpclmulqdq xmm1,xmm15,xmm6,0x11 + vmovdqu xmm6,XMMWORD[((48-64))+rdx] + vpshufb xmm15,xmm14,xmm13 + vpxor xmm5,xmm5,xmm2 + vpclmulqdq xmm2,xmm8,xmm7,0x00 + vmovdqu xmm7,XMMWORD[((80-64))+rdx] + sub r9,0x10 + jz NEAR $L$tail_avx + + vpunpckhqdq xmm8,xmm15,xmm15 + vpxor xmm3,xmm3,xmm0 + vpclmulqdq xmm0,xmm15,xmm6,0x00 + vpxor xmm8,xmm8,xmm15 + vmovdqu xmm14,XMMWORD[((-64))+r8] + vpxor xmm4,xmm4,xmm1 + vpclmulqdq xmm1,xmm15,xmm6,0x11 + vmovdqu xmm6,XMMWORD[((64-64))+rdx] + vpshufb xmm15,xmm14,xmm13 + vpxor xmm5,xmm5,xmm2 + vpclmulqdq xmm2,xmm8,xmm7,0x00 + vpsrldq xmm7,xmm7,8 + sub r9,0x10 + jz NEAR $L$tail_avx + + vpunpckhqdq xmm8,xmm15,xmm15 + vpxor xmm3,xmm3,xmm0 + vpclmulqdq xmm0,xmm15,xmm6,0x00 + vpxor xmm8,xmm8,xmm15 + vmovdqu xmm14,XMMWORD[((-80))+r8] + vpxor xmm4,xmm4,xmm1 + vpclmulqdq xmm1,xmm15,xmm6,0x11 + vmovdqu xmm6,XMMWORD[((96-64))+rdx] + vpshufb xmm15,xmm14,xmm13 + vpxor xmm5,xmm5,xmm2 + vpclmulqdq xmm2,xmm8,xmm7,0x00 + vmovdqu xmm7,XMMWORD[((128-64))+rdx] + sub r9,0x10 + jz NEAR $L$tail_avx + + vpunpckhqdq xmm8,xmm15,xmm15 + vpxor xmm3,xmm3,xmm0 + vpclmulqdq xmm0,xmm15,xmm6,0x00 + vpxor xmm8,xmm8,xmm15 + vmovdqu xmm14,XMMWORD[((-96))+r8] + vpxor xmm4,xmm4,xmm1 + vpclmulqdq xmm1,xmm15,xmm6,0x11 + vmovdqu xmm6,XMMWORD[((112-64))+rdx] + vpshufb xmm15,xmm14,xmm13 + vpxor xmm5,xmm5,xmm2 + vpclmulqdq xmm2,xmm8,xmm7,0x00 + vpsrldq xmm7,xmm7,8 + sub r9,0x10 + jz NEAR $L$tail_avx + + vpunpckhqdq xmm8,xmm15,xmm15 + vpxor xmm3,xmm3,xmm0 + vpclmulqdq xmm0,xmm15,xmm6,0x00 + vpxor xmm8,xmm8,xmm15 + vmovdqu xmm14,XMMWORD[((-112))+r8] + vpxor xmm4,xmm4,xmm1 + vpclmulqdq xmm1,xmm15,xmm6,0x11 + vmovdqu xmm6,XMMWORD[((144-64))+rdx] + vpshufb xmm15,xmm14,xmm13 + vpxor xmm5,xmm5,xmm2 + vpclmulqdq xmm2,xmm8,xmm7,0x00 + vmovq xmm7,QWORD[((184-64))+rdx] + sub r9,0x10 + jmp NEAR $L$tail_avx + +ALIGN 32 +$L$tail_avx: + vpxor xmm15,xmm15,xmm10 +$L$tail_no_xor_avx: + vpunpckhqdq xmm8,xmm15,xmm15 + vpxor xmm3,xmm3,xmm0 + vpclmulqdq xmm0,xmm15,xmm6,0x00 + vpxor xmm8,xmm8,xmm15 + vpxor xmm4,xmm4,xmm1 + vpclmulqdq xmm1,xmm15,xmm6,0x11 + vpxor xmm5,xmm5,xmm2 + vpclmulqdq xmm2,xmm8,xmm7,0x00 + + vmovdqu xmm12,XMMWORD[r10] + + vpxor xmm10,xmm3,xmm0 + vpxor xmm11,xmm4,xmm1 + vpxor xmm5,xmm5,xmm2 + + vpxor xmm5,xmm5,xmm10 + vpxor xmm5,xmm5,xmm11 + vpslldq xmm9,xmm5,8 + vpsrldq xmm5,xmm5,8 + vpxor xmm10,xmm10,xmm9 + vpxor xmm11,xmm11,xmm5 + + vpclmulqdq xmm9,xmm10,xmm12,0x10 + vpalignr xmm10,xmm10,xmm10,8 + vpxor xmm10,xmm10,xmm9 + + vpclmulqdq xmm9,xmm10,xmm12,0x10 + vpalignr xmm10,xmm10,xmm10,8 + vpxor xmm10,xmm10,xmm11 + vpxor xmm10,xmm10,xmm9 + + cmp r9,0 + jne NEAR $L$short_avx + + vpshufb xmm10,xmm10,xmm13 + vmovdqu XMMWORD[rcx],xmm10 + vzeroupper + movaps xmm6,XMMWORD[rsp] + movaps xmm7,XMMWORD[16+rsp] + movaps xmm8,XMMWORD[32+rsp] + movaps xmm9,XMMWORD[48+rsp] + movaps xmm10,XMMWORD[64+rsp] + movaps xmm11,XMMWORD[80+rsp] + movaps xmm12,XMMWORD[96+rsp] + movaps xmm13,XMMWORD[112+rsp] + movaps xmm14,XMMWORD[128+rsp] + movaps xmm15,XMMWORD[144+rsp] + lea rsp,[168+rsp] +$L$SEH_end_gcm_ghash_avx: + ret + +ALIGN 64 +$L$bswap_mask: +DB 15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0 +$L$0x1c2_polynomial: +DB 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0xc2 +$L$7_mask: + DD 7,0,7,0 +$L$7_mask_poly: + DD 7,0,450,0 +ALIGN 64 + +$L$rem_4bit: + DD 0,0,0,471859200,0,943718400,0,610271232 + DD 0,1887436800,0,1822425088,0,1220542464,0,1423966208 + DD 0,3774873600,0,4246732800,0,3644850176,0,3311403008 + DD 0,2441084928,0,2376073216,0,2847932416,0,3051356160 + +$L$rem_8bit: + DW 0x0000,0x01C2,0x0384,0x0246,0x0708,0x06CA,0x048C,0x054E + DW 0x0E10,0x0FD2,0x0D94,0x0C56,0x0918,0x08DA,0x0A9C,0x0B5E + DW 0x1C20,0x1DE2,0x1FA4,0x1E66,0x1B28,0x1AEA,0x18AC,0x196E + DW 0x1230,0x13F2,0x11B4,0x1076,0x1538,0x14FA,0x16BC,0x177E + DW 0x3840,0x3982,0x3BC4,0x3A06,0x3F48,0x3E8A,0x3CCC,0x3D0E + DW 0x3650,0x3792,0x35D4,0x3416,0x3158,0x309A,0x32DC,0x331E + DW 0x2460,0x25A2,0x27E4,0x2626,0x2368,0x22AA,0x20EC,0x212E + DW 0x2A70,0x2BB2,0x29F4,0x2836,0x2D78,0x2CBA,0x2EFC,0x2F3E + DW 0x7080,0x7142,0x7304,0x72C6,0x7788,0x764A,0x740C,0x75CE + DW 0x7E90,0x7F52,0x7D14,0x7CD6,0x7998,0x785A,0x7A1C,0x7BDE + DW 0x6CA0,0x6D62,0x6F24,0x6EE6,0x6BA8,0x6A6A,0x682C,0x69EE + DW 0x62B0,0x6372,0x6134,0x60F6,0x65B8,0x647A,0x663C,0x67FE + DW 0x48C0,0x4902,0x4B44,0x4A86,0x4FC8,0x4E0A,0x4C4C,0x4D8E + DW 0x46D0,0x4712,0x4554,0x4496,0x41D8,0x401A,0x425C,0x439E + DW 0x54E0,0x5522,0x5764,0x56A6,0x53E8,0x522A,0x506C,0x51AE + DW 0x5AF0,0x5B32,0x5974,0x58B6,0x5DF8,0x5C3A,0x5E7C,0x5FBE + DW 0xE100,0xE0C2,0xE284,0xE346,0xE608,0xE7CA,0xE58C,0xE44E + DW 0xEF10,0xEED2,0xEC94,0xED56,0xE818,0xE9DA,0xEB9C,0xEA5E + DW 0xFD20,0xFCE2,0xFEA4,0xFF66,0xFA28,0xFBEA,0xF9AC,0xF86E + DW 0xF330,0xF2F2,0xF0B4,0xF176,0xF438,0xF5FA,0xF7BC,0xF67E + DW 0xD940,0xD882,0xDAC4,0xDB06,0xDE48,0xDF8A,0xDDCC,0xDC0E + DW 0xD750,0xD692,0xD4D4,0xD516,0xD058,0xD19A,0xD3DC,0xD21E + DW 0xC560,0xC4A2,0xC6E4,0xC726,0xC268,0xC3AA,0xC1EC,0xC02E + DW 0xCB70,0xCAB2,0xC8F4,0xC936,0xCC78,0xCDBA,0xCFFC,0xCE3E + DW 0x9180,0x9042,0x9204,0x93C6,0x9688,0x974A,0x950C,0x94CE + DW 0x9F90,0x9E52,0x9C14,0x9DD6,0x9898,0x995A,0x9B1C,0x9ADE + DW 0x8DA0,0x8C62,0x8E24,0x8FE6,0x8AA8,0x8B6A,0x892C,0x88EE + DW 0x83B0,0x8272,0x8034,0x81F6,0x84B8,0x857A,0x873C,0x86FE + DW 0xA9C0,0xA802,0xAA44,0xAB86,0xAEC8,0xAF0A,0xAD4C,0xAC8E + DW 0xA7D0,0xA612,0xA454,0xA596,0xA0D8,0xA11A,0xA35C,0xA29E + DW 0xB5E0,0xB422,0xB664,0xB7A6,0xB2E8,0xB32A,0xB16C,0xB0AE + DW 0xBBF0,0xBA32,0xB874,0xB9B6,0xBCF8,0xBD3A,0xBF7C,0xBEBE + +DB 71,72,65,83,72,32,102,111,114,32,120,56,54,95,54,52 +DB 44,32,67,82,89,80,84,79,71,65,77,83,32,98,121,32 +DB 60,97,112,112,114,111,64,108,46,111,114,103,62,0 +ALIGN 64 +EXTERN __imp_RtlVirtualUnwind + +ALIGN 16 +se_handler: + push rsi + push rdi + push rbx + push rbp + push r12 + push r13 + push r14 + push r15 + pushfq + sub rsp,64 + + mov rax,QWORD[120+r8] + mov rbx,QWORD[248+r8] + + mov rsi,QWORD[8+r9] + mov r11,QWORD[56+r9] + + mov r10d,DWORD[r11] + lea r10,[r10*1+rsi] + cmp rbx,r10 + jb NEAR $L$in_prologue + + mov rax,QWORD[152+r8] + + mov r10d,DWORD[4+r11] + lea r10,[r10*1+rsi] + cmp rbx,r10 + jae NEAR $L$in_prologue + + lea rax,[((48+280))+rax] + + mov rbx,QWORD[((-8))+rax] + mov rbp,QWORD[((-16))+rax] + mov r12,QWORD[((-24))+rax] + mov r13,QWORD[((-32))+rax] + mov r14,QWORD[((-40))+rax] + mov r15,QWORD[((-48))+rax] + mov QWORD[144+r8],rbx + mov QWORD[160+r8],rbp + mov QWORD[216+r8],r12 + mov QWORD[224+r8],r13 + mov QWORD[232+r8],r14 + mov QWORD[240+r8],r15 + +$L$in_prologue: + mov rdi,QWORD[8+rax] + mov rsi,QWORD[16+rax] + mov QWORD[152+r8],rax + mov QWORD[168+r8],rsi + mov QWORD[176+r8],rdi + + mov rdi,QWORD[40+r9] + mov rsi,r8 + mov ecx,154 + DD 0xa548f3fc + + mov rsi,r9 + xor rcx,rcx + mov rdx,QWORD[8+rsi] + mov r8,QWORD[rsi] + mov r9,QWORD[16+rsi] + mov r10,QWORD[40+rsi] + lea r11,[56+rsi] + lea r12,[24+rsi] + mov QWORD[32+rsp],r10 + mov QWORD[40+rsp],r11 + mov QWORD[48+rsp],r12 + mov QWORD[56+rsp],rcx + call QWORD[__imp_RtlVirtualUnwind] + + mov eax,1 + add rsp,64 + popfq + pop r15 + pop r14 + pop r13 + pop r12 + pop rbp + pop rbx + pop rdi + pop rsi + ret + + +section .pdata rdata align=4 +ALIGN 4 + DD $L$SEH_begin_gcm_gmult_4bit wrt ..imagebase + DD $L$SEH_end_gcm_gmult_4bit wrt ..imagebase + DD $L$SEH_info_gcm_gmult_4bit wrt ..imagebase + + DD $L$SEH_begin_gcm_ghash_4bit wrt ..imagebase + DD $L$SEH_end_gcm_ghash_4bit wrt ..imagebase + DD $L$SEH_info_gcm_ghash_4bit wrt ..imagebase + + DD $L$SEH_begin_gcm_init_clmul wrt ..imagebase + DD $L$SEH_end_gcm_init_clmul wrt ..imagebase + DD $L$SEH_info_gcm_init_clmul wrt ..imagebase + + DD $L$SEH_begin_gcm_ghash_clmul wrt ..imagebase + DD $L$SEH_end_gcm_ghash_clmul wrt ..imagebase + DD $L$SEH_info_gcm_ghash_clmul wrt ..imagebase + DD $L$SEH_begin_gcm_init_avx wrt ..imagebase + DD $L$SEH_end_gcm_init_avx wrt ..imagebase + DD $L$SEH_info_gcm_init_clmul wrt ..imagebase + + DD $L$SEH_begin_gcm_ghash_avx wrt ..imagebase + DD $L$SEH_end_gcm_ghash_avx wrt ..imagebase + DD $L$SEH_info_gcm_ghash_clmul wrt ..imagebase +section .xdata rdata align=8 +ALIGN 8 +$L$SEH_info_gcm_gmult_4bit: +DB 9,0,0,0 + DD se_handler wrt ..imagebase + DD $L$gmult_prologue wrt ..imagebase,$L$gmult_epilogue wrt ..imagebase +$L$SEH_info_gcm_ghash_4bit: +DB 9,0,0,0 + DD se_handler wrt ..imagebase + DD $L$ghash_prologue wrt ..imagebase,$L$ghash_epilogue wrt ..imagebase +$L$SEH_info_gcm_init_clmul: +DB 0x01,0x08,0x03,0x00 +DB 0x08,0x68,0x00,0x00 +DB 0x04,0x22,0x00,0x00 +$L$SEH_info_gcm_ghash_clmul: +DB 0x01,0x33,0x16,0x00 +DB 0x33,0xf8,0x09,0x00 +DB 0x2e,0xe8,0x08,0x00 +DB 0x29,0xd8,0x07,0x00 +DB 0x24,0xc8,0x06,0x00 +DB 0x1f,0xb8,0x05,0x00 +DB 0x1a,0xa8,0x04,0x00 +DB 0x15,0x98,0x03,0x00 +DB 0x10,0x88,0x02,0x00 +DB 0x0c,0x78,0x01,0x00 +DB 0x08,0x68,0x00,0x00 +DB 0x04,0x01,0x15,0x00 diff --git a/crypto/aesgcm/ghashp8-ppc.pl b/crypto/aesgcm/ghashp8-ppc.pl new file mode 100644 index 0000000..c46cdb5 --- /dev/null +++ b/crypto/aesgcm/ghashp8-ppc.pl @@ -0,0 +1,670 @@ +#! /usr/bin/env perl +# Copyright 2014-2016 The OpenSSL Project Authors. All Rights Reserved. +# +# Licensed under the OpenSSL license (the "License"). You may not use +# this file except in compliance with the License. You can obtain a copy +# in the file LICENSE in the source distribution or at +# https://www.openssl.org/source/license.html + +# +# ==================================================================== +# Written by Andy Polyakov for the OpenSSL +# project. The module is, however, dual licensed under OpenSSL and +# CRYPTOGAMS licenses depending on where you obtain it. For further +# details see http://www.openssl.org/~appro/cryptogams/. +# ==================================================================== +# +# GHASH for for PowerISA v2.07. +# +# July 2014 +# +# Accurate performance measurements are problematic, because it's +# always virtualized setup with possibly throttled processor. +# Relative comparison is therefore more informative. This initial +# version is ~2.1x slower than hardware-assisted AES-128-CTR, ~12x +# faster than "4-bit" integer-only compiler-generated 64-bit code. +# "Initial version" means that there is room for futher improvement. + +# May 2016 +# +# 2x aggregated reduction improves performance by 50% (resulting +# performance on POWER8 is 1 cycle per processed byte), and 4x +# aggregated reduction - by 170% or 2.7x (resulting in 0.55 cpb). + +$flavour=shift; +$output =shift; + +if ($flavour =~ /64/) { + $SIZE_T=8; + $LRSAVE=2*$SIZE_T; + $STU="stdu"; + $POP="ld"; + $PUSH="std"; + $UCMP="cmpld"; + $SHRI="srdi"; +} elsif ($flavour =~ /32/) { + $SIZE_T=4; + $LRSAVE=$SIZE_T; + $STU="stwu"; + $POP="lwz"; + $PUSH="stw"; + $UCMP="cmplw"; + $SHRI="srwi"; +} else { die "nonsense $flavour"; } + +$sp="r1"; +$FRAME=6*$SIZE_T+13*16; # 13*16 is for v20-v31 offload + +$0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1; +( $xlate="${dir}ppc-xlate.pl" and -f $xlate ) or +( $xlate="${dir}../../../perlasm/ppc-xlate.pl" and -f $xlate) or +die "can't locate ppc-xlate.pl"; + +open STDOUT,"| $^X $xlate $flavour $output" || die "can't call $xlate: $!"; + +my ($Xip,$Htbl,$inp,$len)=map("r$_",(3..6)); # argument block + +my ($Xl,$Xm,$Xh,$IN)=map("v$_",(0..3)); +my ($zero,$t0,$t1,$t2,$xC2,$H,$Hh,$Hl,$lemask)=map("v$_",(4..12)); +my ($Xl1,$Xm1,$Xh1,$IN1,$H2,$H2h,$H2l)=map("v$_",(13..19)); +my $vrsave="r12"; + +$code=<<___; +.machine "any" + +.text + +.globl .gcm_init_p8 +.align 5 +.gcm_init_p8: + li r0,-4096 + li r8,0x10 + mfspr $vrsave,256 + li r9,0x20 + mtspr 256,r0 + li r10,0x30 + lvx_u $H,0,r4 # load H + + vspltisb $xC2,-16 # 0xf0 + vspltisb $t0,1 # one + vaddubm $xC2,$xC2,$xC2 # 0xe0 + vxor $zero,$zero,$zero + vor $xC2,$xC2,$t0 # 0xe1 + vsldoi $xC2,$xC2,$zero,15 # 0xe1... + vsldoi $t1,$zero,$t0,1 # ...1 + vaddubm $xC2,$xC2,$xC2 # 0xc2... + vspltisb $t2,7 + vor $xC2,$xC2,$t1 # 0xc2....01 + vspltb $t1,$H,0 # most significant byte + vsl $H,$H,$t0 # H<<=1 + vsrab $t1,$t1,$t2 # broadcast carry bit + vand $t1,$t1,$xC2 + vxor $IN,$H,$t1 # twisted H + + vsldoi $H,$IN,$IN,8 # twist even more ... + vsldoi $xC2,$zero,$xC2,8 # 0xc2.0 + vsldoi $Hl,$zero,$H,8 # ... and split + vsldoi $Hh,$H,$zero,8 + + stvx_u $xC2,0,r3 # save pre-computed table + stvx_u $Hl,r8,r3 + li r8,0x40 + stvx_u $H, r9,r3 + li r9,0x50 + stvx_u $Hh,r10,r3 + li r10,0x60 + + vpmsumd $Xl,$IN,$Hl # H.lo·H.lo + vpmsumd $Xm,$IN,$H # H.hi·H.lo+H.lo·H.hi + vpmsumd $Xh,$IN,$Hh # H.hi·H.hi + + vpmsumd $t2,$Xl,$xC2 # 1st reduction phase + + vsldoi $t0,$Xm,$zero,8 + vsldoi $t1,$zero,$Xm,8 + vxor $Xl,$Xl,$t0 + vxor $Xh,$Xh,$t1 + + vsldoi $Xl,$Xl,$Xl,8 + vxor $Xl,$Xl,$t2 + + vsldoi $t1,$Xl,$Xl,8 # 2nd reduction phase + vpmsumd $Xl,$Xl,$xC2 + vxor $t1,$t1,$Xh + vxor $IN1,$Xl,$t1 + + vsldoi $H2,$IN1,$IN1,8 + vsldoi $H2l,$zero,$H2,8 + vsldoi $H2h,$H2,$zero,8 + + stvx_u $H2l,r8,r3 # save H^2 + li r8,0x70 + stvx_u $H2,r9,r3 + li r9,0x80 + stvx_u $H2h,r10,r3 + li r10,0x90 +___ +{ +my ($t4,$t5,$t6) = ($Hl,$H,$Hh); +$code.=<<___; + vpmsumd $Xl,$IN,$H2l # H.lo·H^2.lo + vpmsumd $Xl1,$IN1,$H2l # H^2.lo·H^2.lo + vpmsumd $Xm,$IN,$H2 # H.hi·H^2.lo+H.lo·H^2.hi + vpmsumd $Xm1,$IN1,$H2 # H^2.hi·H^2.lo+H^2.lo·H^2.hi + vpmsumd $Xh,$IN,$H2h # H.hi·H^2.hi + vpmsumd $Xh1,$IN1,$H2h # H^2.hi·H^2.hi + + vpmsumd $t2,$Xl,$xC2 # 1st reduction phase + vpmsumd $t6,$Xl1,$xC2 # 1st reduction phase + + vsldoi $t0,$Xm,$zero,8 + vsldoi $t1,$zero,$Xm,8 + vsldoi $t4,$Xm1,$zero,8 + vsldoi $t5,$zero,$Xm1,8 + vxor $Xl,$Xl,$t0 + vxor $Xh,$Xh,$t1 + vxor $Xl1,$Xl1,$t4 + vxor $Xh1,$Xh1,$t5 + + vsldoi $Xl,$Xl,$Xl,8 + vsldoi $Xl1,$Xl1,$Xl1,8 + vxor $Xl,$Xl,$t2 + vxor $Xl1,$Xl1,$t6 + + vsldoi $t1,$Xl,$Xl,8 # 2nd reduction phase + vsldoi $t5,$Xl1,$Xl1,8 # 2nd reduction phase + vpmsumd $Xl,$Xl,$xC2 + vpmsumd $Xl1,$Xl1,$xC2 + vxor $t1,$t1,$Xh + vxor $t5,$t5,$Xh1 + vxor $Xl,$Xl,$t1 + vxor $Xl1,$Xl1,$t5 + + vsldoi $H,$Xl,$Xl,8 + vsldoi $H2,$Xl1,$Xl1,8 + vsldoi $Hl,$zero,$H,8 + vsldoi $Hh,$H,$zero,8 + vsldoi $H2l,$zero,$H2,8 + vsldoi $H2h,$H2,$zero,8 + + stvx_u $Hl,r8,r3 # save H^3 + li r8,0xa0 + stvx_u $H,r9,r3 + li r9,0xb0 + stvx_u $Hh,r10,r3 + li r10,0xc0 + stvx_u $H2l,r8,r3 # save H^4 + stvx_u $H2,r9,r3 + stvx_u $H2h,r10,r3 + + mtspr 256,$vrsave + blr + .long 0 + .byte 0,12,0x14,0,0,0,2,0 + .long 0 +.size .gcm_init_p8,.-.gcm_init_p8 +___ +} +$code.=<<___; +.globl .gcm_gmult_p8 +.align 5 +.gcm_gmult_p8: + lis r0,0xfff8 + li r8,0x10 + mfspr $vrsave,256 + li r9,0x20 + mtspr 256,r0 + li r10,0x30 + lvx_u $IN,0,$Xip # load Xi + + lvx_u $Hl,r8,$Htbl # load pre-computed table + le?lvsl $lemask,r0,r0 + lvx_u $H, r9,$Htbl + le?vspltisb $t0,0x07 + lvx_u $Hh,r10,$Htbl + le?vxor $lemask,$lemask,$t0 + lvx_u $xC2,0,$Htbl + le?vperm $IN,$IN,$IN,$lemask + vxor $zero,$zero,$zero + + vpmsumd $Xl,$IN,$Hl # H.lo·Xi.lo + vpmsumd $Xm,$IN,$H # H.hi·Xi.lo+H.lo·Xi.hi + vpmsumd $Xh,$IN,$Hh # H.hi·Xi.hi + + vpmsumd $t2,$Xl,$xC2 # 1st reduction phase + + vsldoi $t0,$Xm,$zero,8 + vsldoi $t1,$zero,$Xm,8 + vxor $Xl,$Xl,$t0 + vxor $Xh,$Xh,$t1 + + vsldoi $Xl,$Xl,$Xl,8 + vxor $Xl,$Xl,$t2 + + vsldoi $t1,$Xl,$Xl,8 # 2nd reduction phase + vpmsumd $Xl,$Xl,$xC2 + vxor $t1,$t1,$Xh + vxor $Xl,$Xl,$t1 + + le?vperm $Xl,$Xl,$Xl,$lemask + stvx_u $Xl,0,$Xip # write out Xi + + mtspr 256,$vrsave + blr + .long 0 + .byte 0,12,0x14,0,0,0,2,0 + .long 0 +.size .gcm_gmult_p8,.-.gcm_gmult_p8 + +.globl .gcm_ghash_p8 +.align 5 +.gcm_ghash_p8: + li r0,-4096 + li r8,0x10 + mfspr $vrsave,256 + li r9,0x20 + mtspr 256,r0 + li r10,0x30 + lvx_u $Xl,0,$Xip # load Xi + + lvx_u $Hl,r8,$Htbl # load pre-computed table + li r8,0x40 + le?lvsl $lemask,r0,r0 + lvx_u $H, r9,$Htbl + li r9,0x50 + le?vspltisb $t0,0x07 + lvx_u $Hh,r10,$Htbl + li r10,0x60 + le?vxor $lemask,$lemask,$t0 + lvx_u $xC2,0,$Htbl + le?vperm $Xl,$Xl,$Xl,$lemask + vxor $zero,$zero,$zero + + ${UCMP}i $len,64 + bge Lgcm_ghash_p8_4x + + lvx_u $IN,0,$inp + addi $inp,$inp,16 + subic. $len,$len,16 + le?vperm $IN,$IN,$IN,$lemask + vxor $IN,$IN,$Xl + beq Lshort + + lvx_u $H2l,r8,$Htbl # load H^2 + li r8,16 + lvx_u $H2, r9,$Htbl + add r9,$inp,$len # end of input + lvx_u $H2h,r10,$Htbl + be?b Loop_2x + +.align 5 +Loop_2x: + lvx_u $IN1,0,$inp + le?vperm $IN1,$IN1,$IN1,$lemask + + subic $len,$len,32 + vpmsumd $Xl,$IN,$H2l # H^2.lo·Xi.lo + vpmsumd $Xl1,$IN1,$Hl # H.lo·Xi+1.lo + subfe r0,r0,r0 # borrow?-1:0 + vpmsumd $Xm,$IN,$H2 # H^2.hi·Xi.lo+H^2.lo·Xi.hi + vpmsumd $Xm1,$IN1,$H # H.hi·Xi+1.lo+H.lo·Xi+1.hi + and r0,r0,$len + vpmsumd $Xh,$IN,$H2h # H^2.hi·Xi.hi + vpmsumd $Xh1,$IN1,$Hh # H.hi·Xi+1.hi + add $inp,$inp,r0 + + vxor $Xl,$Xl,$Xl1 + vxor $Xm,$Xm,$Xm1 + + vpmsumd $t2,$Xl,$xC2 # 1st reduction phase + + vsldoi $t0,$Xm,$zero,8 + vsldoi $t1,$zero,$Xm,8 + vxor $Xh,$Xh,$Xh1 + vxor $Xl,$Xl,$t0 + vxor $Xh,$Xh,$t1 + + vsldoi $Xl,$Xl,$Xl,8 + vxor $Xl,$Xl,$t2 + lvx_u $IN,r8,$inp + addi $inp,$inp,32 + + vsldoi $t1,$Xl,$Xl,8 # 2nd reduction phase + vpmsumd $Xl,$Xl,$xC2 + le?vperm $IN,$IN,$IN,$lemask + vxor $t1,$t1,$Xh + vxor $IN,$IN,$t1 + vxor $IN,$IN,$Xl + $UCMP r9,$inp + bgt Loop_2x # done yet? + + cmplwi $len,0 + bne Leven + +Lshort: + vpmsumd $Xl,$IN,$Hl # H.lo·Xi.lo + vpmsumd $Xm,$IN,$H # H.hi·Xi.lo+H.lo·Xi.hi + vpmsumd $Xh,$IN,$Hh # H.hi·Xi.hi + + vpmsumd $t2,$Xl,$xC2 # 1st reduction phase + + vsldoi $t0,$Xm,$zero,8 + vsldoi $t1,$zero,$Xm,8 + vxor $Xl,$Xl,$t0 + vxor $Xh,$Xh,$t1 + + vsldoi $Xl,$Xl,$Xl,8 + vxor $Xl,$Xl,$t2 + + vsldoi $t1,$Xl,$Xl,8 # 2nd reduction phase + vpmsumd $Xl,$Xl,$xC2 + vxor $t1,$t1,$Xh + +Leven: + vxor $Xl,$Xl,$t1 + le?vperm $Xl,$Xl,$Xl,$lemask + stvx_u $Xl,0,$Xip # write out Xi + + mtspr 256,$vrsave + blr + .long 0 + .byte 0,12,0x14,0,0,0,4,0 + .long 0 +___ +{ +my ($Xl3,$Xm2,$IN2,$H3l,$H3,$H3h, + $Xh3,$Xm3,$IN3,$H4l,$H4,$H4h) = map("v$_",(20..31)); +my $IN0=$IN; +my ($H21l,$H21h,$loperm,$hiperm) = ($Hl,$Hh,$H2l,$H2h); + +$code.=<<___; +.align 5 +.gcm_ghash_p8_4x: +Lgcm_ghash_p8_4x: + $STU $sp,-$FRAME($sp) + li r10,`15+6*$SIZE_T` + li r11,`31+6*$SIZE_T` + stvx v20,r10,$sp + addi r10,r10,32 + stvx v21,r11,$sp + addi r11,r11,32 + stvx v22,r10,$sp + addi r10,r10,32 + stvx v23,r11,$sp + addi r11,r11,32 + stvx v24,r10,$sp + addi r10,r10,32 + stvx v25,r11,$sp + addi r11,r11,32 + stvx v26,r10,$sp + addi r10,r10,32 + stvx v27,r11,$sp + addi r11,r11,32 + stvx v28,r10,$sp + addi r10,r10,32 + stvx v29,r11,$sp + addi r11,r11,32 + stvx v30,r10,$sp + li r10,0x60 + stvx v31,r11,$sp + li r0,-1 + stw $vrsave,`$FRAME-4`($sp) # save vrsave + mtspr 256,r0 # preserve all AltiVec registers + + lvsl $t0,0,r8 # 0x0001..0e0f + #lvx_u $H2l,r8,$Htbl # load H^2 + li r8,0x70 + lvx_u $H2, r9,$Htbl + li r9,0x80 + vspltisb $t1,8 # 0x0808..0808 + #lvx_u $H2h,r10,$Htbl + li r10,0x90 + lvx_u $H3l,r8,$Htbl # load H^3 + li r8,0xa0 + lvx_u $H3, r9,$Htbl + li r9,0xb0 + lvx_u $H3h,r10,$Htbl + li r10,0xc0 + lvx_u $H4l,r8,$Htbl # load H^4 + li r8,0x10 + lvx_u $H4, r9,$Htbl + li r9,0x20 + lvx_u $H4h,r10,$Htbl + li r10,0x30 + + vsldoi $t2,$zero,$t1,8 # 0x0000..0808 + vaddubm $hiperm,$t0,$t2 # 0x0001..1617 + vaddubm $loperm,$t1,$hiperm # 0x0809..1e1f + + $SHRI $len,$len,4 # this allows to use sign bit + # as carry + lvx_u $IN0,0,$inp # load input + lvx_u $IN1,r8,$inp + subic. $len,$len,8 + lvx_u $IN2,r9,$inp + lvx_u $IN3,r10,$inp + addi $inp,$inp,0x40 + le?vperm $IN0,$IN0,$IN0,$lemask + le?vperm $IN1,$IN1,$IN1,$lemask + le?vperm $IN2,$IN2,$IN2,$lemask + le?vperm $IN3,$IN3,$IN3,$lemask + + vxor $Xh,$IN0,$Xl + + vpmsumd $Xl1,$IN1,$H3l + vpmsumd $Xm1,$IN1,$H3 + vpmsumd $Xh1,$IN1,$H3h + + vperm $H21l,$H2,$H,$hiperm + vperm $t0,$IN2,$IN3,$loperm + vperm $H21h,$H2,$H,$loperm + vperm $t1,$IN2,$IN3,$hiperm + vpmsumd $Xm2,$IN2,$H2 # H^2.lo·Xi+2.hi+H^2.hi·Xi+2.lo + vpmsumd $Xl3,$t0,$H21l # H^2.lo·Xi+2.lo+H.lo·Xi+3.lo + vpmsumd $Xm3,$IN3,$H # H.hi·Xi+3.lo +H.lo·Xi+3.hi + vpmsumd $Xh3,$t1,$H21h # H^2.hi·Xi+2.hi+H.hi·Xi+3.hi + + vxor $Xm2,$Xm2,$Xm1 + vxor $Xl3,$Xl3,$Xl1 + vxor $Xm3,$Xm3,$Xm2 + vxor $Xh3,$Xh3,$Xh1 + + blt Ltail_4x + +Loop_4x: + lvx_u $IN0,0,$inp + lvx_u $IN1,r8,$inp + subic. $len,$len,4 + lvx_u $IN2,r9,$inp + lvx_u $IN3,r10,$inp + addi $inp,$inp,0x40 + le?vperm $IN1,$IN1,$IN1,$lemask + le?vperm $IN2,$IN2,$IN2,$lemask + le?vperm $IN3,$IN3,$IN3,$lemask + le?vperm $IN0,$IN0,$IN0,$lemask + + vpmsumd $Xl,$Xh,$H4l # H^4.lo·Xi.lo + vpmsumd $Xm,$Xh,$H4 # H^4.hi·Xi.lo+H^4.lo·Xi.hi + vpmsumd $Xh,$Xh,$H4h # H^4.hi·Xi.hi + vpmsumd $Xl1,$IN1,$H3l + vpmsumd $Xm1,$IN1,$H3 + vpmsumd $Xh1,$IN1,$H3h + + vxor $Xl,$Xl,$Xl3 + vxor $Xm,$Xm,$Xm3 + vxor $Xh,$Xh,$Xh3 + vperm $t0,$IN2,$IN3,$loperm + vperm $t1,$IN2,$IN3,$hiperm + + vpmsumd $t2,$Xl,$xC2 # 1st reduction phase + vpmsumd $Xl3,$t0,$H21l # H.lo·Xi+3.lo +H^2.lo·Xi+2.lo + vpmsumd $Xh3,$t1,$H21h # H.hi·Xi+3.hi +H^2.hi·Xi+2.hi + + vsldoi $t0,$Xm,$zero,8 + vsldoi $t1,$zero,$Xm,8 + vxor $Xl,$Xl,$t0 + vxor $Xh,$Xh,$t1 + + vsldoi $Xl,$Xl,$Xl,8 + vxor $Xl,$Xl,$t2 + + vsldoi $t1,$Xl,$Xl,8 # 2nd reduction phase + vpmsumd $Xm2,$IN2,$H2 # H^2.hi·Xi+2.lo+H^2.lo·Xi+2.hi + vpmsumd $Xm3,$IN3,$H # H.hi·Xi+3.lo +H.lo·Xi+3.hi + vpmsumd $Xl,$Xl,$xC2 + + vxor $Xl3,$Xl3,$Xl1 + vxor $Xh3,$Xh3,$Xh1 + vxor $Xh,$Xh,$IN0 + vxor $Xm2,$Xm2,$Xm1 + vxor $Xh,$Xh,$t1 + vxor $Xm3,$Xm3,$Xm2 + vxor $Xh,$Xh,$Xl + bge Loop_4x + +Ltail_4x: + vpmsumd $Xl,$Xh,$H4l # H^4.lo·Xi.lo + vpmsumd $Xm,$Xh,$H4 # H^4.hi·Xi.lo+H^4.lo·Xi.hi + vpmsumd $Xh,$Xh,$H4h # H^4.hi·Xi.hi + + vxor $Xl,$Xl,$Xl3 + vxor $Xm,$Xm,$Xm3 + + vpmsumd $t2,$Xl,$xC2 # 1st reduction phase + + vsldoi $t0,$Xm,$zero,8 + vsldoi $t1,$zero,$Xm,8 + vxor $Xh,$Xh,$Xh3 + vxor $Xl,$Xl,$t0 + vxor $Xh,$Xh,$t1 + + vsldoi $Xl,$Xl,$Xl,8 + vxor $Xl,$Xl,$t2 + + vsldoi $t1,$Xl,$Xl,8 # 2nd reduction phase + vpmsumd $Xl,$Xl,$xC2 + vxor $t1,$t1,$Xh + vxor $Xl,$Xl,$t1 + + addic. $len,$len,4 + beq Ldone_4x + + lvx_u $IN0,0,$inp + ${UCMP}i $len,2 + li $len,-4 + blt Lone + lvx_u $IN1,r8,$inp + beq Ltwo + +Lthree: + lvx_u $IN2,r9,$inp + le?vperm $IN0,$IN0,$IN0,$lemask + le?vperm $IN1,$IN1,$IN1,$lemask + le?vperm $IN2,$IN2,$IN2,$lemask + + vxor $Xh,$IN0,$Xl + vmr $H4l,$H3l + vmr $H4, $H3 + vmr $H4h,$H3h + + vperm $t0,$IN1,$IN2,$loperm + vperm $t1,$IN1,$IN2,$hiperm + vpmsumd $Xm2,$IN1,$H2 # H^2.lo·Xi+1.hi+H^2.hi·Xi+1.lo + vpmsumd $Xm3,$IN2,$H # H.hi·Xi+2.lo +H.lo·Xi+2.hi + vpmsumd $Xl3,$t0,$H21l # H^2.lo·Xi+1.lo+H.lo·Xi+2.lo + vpmsumd $Xh3,$t1,$H21h # H^2.hi·Xi+1.hi+H.hi·Xi+2.hi + + vxor $Xm3,$Xm3,$Xm2 + b Ltail_4x + +.align 4 +Ltwo: + le?vperm $IN0,$IN0,$IN0,$lemask + le?vperm $IN1,$IN1,$IN1,$lemask + + vxor $Xh,$IN0,$Xl + vperm $t0,$zero,$IN1,$loperm + vperm $t1,$zero,$IN1,$hiperm + + vsldoi $H4l,$zero,$H2,8 + vmr $H4, $H2 + vsldoi $H4h,$H2,$zero,8 + + vpmsumd $Xl3,$t0, $H21l # H.lo·Xi+1.lo + vpmsumd $Xm3,$IN1,$H # H.hi·Xi+1.lo+H.lo·Xi+2.hi + vpmsumd $Xh3,$t1, $H21h # H.hi·Xi+1.hi + + b Ltail_4x + +.align 4 +Lone: + le?vperm $IN0,$IN0,$IN0,$lemask + + vsldoi $H4l,$zero,$H,8 + vmr $H4, $H + vsldoi $H4h,$H,$zero,8 + + vxor $Xh,$IN0,$Xl + vxor $Xl3,$Xl3,$Xl3 + vxor $Xm3,$Xm3,$Xm3 + vxor $Xh3,$Xh3,$Xh3 + + b Ltail_4x + +Ldone_4x: + le?vperm $Xl,$Xl,$Xl,$lemask + stvx_u $Xl,0,$Xip # write out Xi + + li r10,`15+6*$SIZE_T` + li r11,`31+6*$SIZE_T` + mtspr 256,$vrsave + lvx v20,r10,$sp + addi r10,r10,32 + lvx v21,r11,$sp + addi r11,r11,32 + lvx v22,r10,$sp + addi r10,r10,32 + lvx v23,r11,$sp + addi r11,r11,32 + lvx v24,r10,$sp + addi r10,r10,32 + lvx v25,r11,$sp + addi r11,r11,32 + lvx v26,r10,$sp + addi r10,r10,32 + lvx v27,r11,$sp + addi r11,r11,32 + lvx v28,r10,$sp + addi r10,r10,32 + lvx v29,r11,$sp + addi r11,r11,32 + lvx v30,r10,$sp + lvx v31,r11,$sp + addi $sp,$sp,$FRAME + blr + .long 0 + .byte 0,12,0x04,0,0x80,0,4,0 + .long 0 +___ +} +$code.=<<___; +.size .gcm_ghash_p8,.-.gcm_ghash_p8 + +.asciz "GHASH for PowerISA 2.07, CRYPTOGAMS by " +.align 2 +___ + +foreach (split("\n",$code)) { + s/\`([^\`]*)\`/eval $1/geo; + + if ($flavour =~ /le$/o) { # little-endian + s/le\?//o or + s/be\?/#be#/o; + } else { + s/le\?/#le#/o or + s/be\?//o; + } + print $_,"\n"; +} + +close STDOUT; # enforce flush diff --git a/crypto/aesgcm/ghashv8-armx.pl b/crypto/aesgcm/ghashv8-armx.pl new file mode 100644 index 0000000..9bbca10 --- /dev/null +++ b/crypto/aesgcm/ghashv8-armx.pl @@ -0,0 +1,430 @@ +#! /usr/bin/env perl +# Copyright 2014-2016 The OpenSSL Project Authors. All Rights Reserved. +# +# Licensed under the OpenSSL license (the "License"). You may not use +# this file except in compliance with the License. You can obtain a copy +# in the file LICENSE in the source distribution or at +# https://www.openssl.org/source/license.html + +# +# ==================================================================== +# Written by Andy Polyakov for the OpenSSL +# project. The module is, however, dual licensed under OpenSSL and +# CRYPTOGAMS licenses depending on where you obtain it. For further +# details see http://www.openssl.org/~appro/cryptogams/. +# ==================================================================== +# +# GHASH for ARMv8 Crypto Extension, 64-bit polynomial multiplication. +# +# June 2014 +# +# Initial version was developed in tight cooperation with Ard +# Biesheuvel from bits-n-pieces from +# other assembly modules. Just like aesv8-armx.pl this module +# supports both AArch32 and AArch64 execution modes. +# +# July 2014 +# +# Implement 2x aggregated reduction [see ghash-x86.pl for background +# information]. +# +# Current performance in cycles per processed byte: +# +# PMULL[2] 32-bit NEON(*) +# Apple A7 0.92 5.62 +# Cortex-A53 1.01 8.39 +# Cortex-A57 1.17 7.61 +# Denver 0.71 6.02 +# Mongoose 1.10 8.06 +# +# (*) presented for reference/comparison purposes; + +$flavour = shift; +$output = shift; + +$0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1; +( $xlate="${dir}arm-xlate.pl" and -f $xlate ) or +( $xlate="${dir}../../../perlasm/arm-xlate.pl" and -f $xlate) or +die "can't locate arm-xlate.pl"; + +open OUT,"| \"$^X\" $xlate $flavour $output"; +*STDOUT=*OUT; + +$Xi="x0"; # argument block +$Htbl="x1"; +$inp="x2"; +$len="x3"; + +$inc="x12"; + +{ +my ($Xl,$Xm,$Xh,$IN)=map("q$_",(0..3)); +my ($t0,$t1,$t2,$xC2,$H,$Hhl,$H2)=map("q$_",(8..14)); + +$code=<<___; +#include + +.text +___ +$code.=".arch armv8-a+crypto\n" if ($flavour =~ /64/); +$code.=<<___ if ($flavour !~ /64/); +.fpu neon +.code 32 +#undef __thumb2__ +___ + +################################################################################ +# void gcm_init_v8(u128 Htable[16],const u64 H[2]); +# +# input: 128-bit H - secret parameter E(K,0^128) +# output: precomputed table filled with degrees of twisted H; +# H is twisted to handle reverse bitness of GHASH; +# only few of 16 slots of Htable[16] are used; +# data is opaque to outside world (which allows to +# optimize the code independently); +# +$code.=<<___; +.global gcm_init_v8 +.type gcm_init_v8,%function +.align 4 +gcm_init_v8: + vld1.64 {$t1},[x1] @ load input H + vmov.i8 $xC2,#0xe1 + vshl.i64 $xC2,$xC2,#57 @ 0xc2.0 + vext.8 $IN,$t1,$t1,#8 + vshr.u64 $t2,$xC2,#63 + vdup.32 $t1,${t1}[1] + vext.8 $t0,$t2,$xC2,#8 @ t0=0xc2....01 + vshr.u64 $t2,$IN,#63 + vshr.s32 $t1,$t1,#31 @ broadcast carry bit + vand $t2,$t2,$t0 + vshl.i64 $IN,$IN,#1 + vext.8 $t2,$t2,$t2,#8 + vand $t0,$t0,$t1 + vorr $IN,$IN,$t2 @ H<<<=1 + veor $H,$IN,$t0 @ twisted H + vst1.64 {$H},[x0],#16 @ store Htable[0] + + @ calculate H^2 + vext.8 $t0,$H,$H,#8 @ Karatsuba pre-processing + vpmull.p64 $Xl,$H,$H + veor $t0,$t0,$H + vpmull2.p64 $Xh,$H,$H + vpmull.p64 $Xm,$t0,$t0 + + vext.8 $t1,$Xl,$Xh,#8 @ Karatsuba post-processing + veor $t2,$Xl,$Xh + veor $Xm,$Xm,$t1 + veor $Xm,$Xm,$t2 + vpmull.p64 $t2,$Xl,$xC2 @ 1st phase + + vmov $Xh#lo,$Xm#hi @ Xh|Xm - 256-bit result + vmov $Xm#hi,$Xl#lo @ Xm is rotated Xl + veor $Xl,$Xm,$t2 + + vext.8 $t2,$Xl,$Xl,#8 @ 2nd phase + vpmull.p64 $Xl,$Xl,$xC2 + veor $t2,$t2,$Xh + veor $H2,$Xl,$t2 + + vext.8 $t1,$H2,$H2,#8 @ Karatsuba pre-processing + veor $t1,$t1,$H2 + vext.8 $Hhl,$t0,$t1,#8 @ pack Karatsuba pre-processed + vst1.64 {$Hhl-$H2},[x0] @ store Htable[1..2] + + ret +.size gcm_init_v8,.-gcm_init_v8 +___ +################################################################################ +# void gcm_gmult_v8(u64 Xi[2],const u128 Htable[16]); +# +# input: Xi - current hash value; +# Htable - table precomputed in gcm_init_v8; +# output: Xi - next hash value Xi; +# +$code.=<<___; +.global gcm_gmult_v8 +.type gcm_gmult_v8,%function +.align 4 +gcm_gmult_v8: + vld1.64 {$t1},[$Xi] @ load Xi + vmov.i8 $xC2,#0xe1 + vld1.64 {$H-$Hhl},[$Htbl] @ load twisted H, ... + vshl.u64 $xC2,$xC2,#57 +#ifndef __ARMEB__ + vrev64.8 $t1,$t1 +#endif + vext.8 $IN,$t1,$t1,#8 + + vpmull.p64 $Xl,$H,$IN @ H.lo·Xi.lo + veor $t1,$t1,$IN @ Karatsuba pre-processing + vpmull2.p64 $Xh,$H,$IN @ H.hi·Xi.hi + vpmull.p64 $Xm,$Hhl,$t1 @ (H.lo+H.hi)·(Xi.lo+Xi.hi) + + vext.8 $t1,$Xl,$Xh,#8 @ Karatsuba post-processing + veor $t2,$Xl,$Xh + veor $Xm,$Xm,$t1 + veor $Xm,$Xm,$t2 + vpmull.p64 $t2,$Xl,$xC2 @ 1st phase of reduction + + vmov $Xh#lo,$Xm#hi @ Xh|Xm - 256-bit result + vmov $Xm#hi,$Xl#lo @ Xm is rotated Xl + veor $Xl,$Xm,$t2 + + vext.8 $t2,$Xl,$Xl,#8 @ 2nd phase of reduction + vpmull.p64 $Xl,$Xl,$xC2 + veor $t2,$t2,$Xh + veor $Xl,$Xl,$t2 + +#ifndef __ARMEB__ + vrev64.8 $Xl,$Xl +#endif + vext.8 $Xl,$Xl,$Xl,#8 + vst1.64 {$Xl},[$Xi] @ write out Xi + + ret +.size gcm_gmult_v8,.-gcm_gmult_v8 +___ +################################################################################ +# void gcm_ghash_v8(u64 Xi[2],const u128 Htable[16],const u8 *inp,size_t len); +# +# input: table precomputed in gcm_init_v8; +# current hash value Xi; +# pointer to input data; +# length of input data in bytes, but divisible by block size; +# output: next hash value Xi; +# +$code.=<<___; +.global gcm_ghash_v8 +.type gcm_ghash_v8,%function +.align 4 +gcm_ghash_v8: +___ +$code.=<<___ if ($flavour !~ /64/); + vstmdb sp!,{d8-d15} @ 32-bit ABI says so +___ +$code.=<<___; + vld1.64 {$Xl},[$Xi] @ load [rotated] Xi + @ "[rotated]" means that + @ loaded value would have + @ to be rotated in order to + @ make it appear as in + @ alorithm specification + subs $len,$len,#32 @ see if $len is 32 or larger + mov $inc,#16 @ $inc is used as post- + @ increment for input pointer; + @ as loop is modulo-scheduled + @ $inc is zeroed just in time + @ to preclude oversteping + @ inp[len], which means that + @ last block[s] are actually + @ loaded twice, but last + @ copy is not processed + vld1.64 {$H-$Hhl},[$Htbl],#32 @ load twisted H, ..., H^2 + vmov.i8 $xC2,#0xe1 + vld1.64 {$H2},[$Htbl] + cclr $inc,eq @ is it time to zero $inc? + vext.8 $Xl,$Xl,$Xl,#8 @ rotate Xi + vld1.64 {$t0},[$inp],#16 @ load [rotated] I[0] + vshl.u64 $xC2,$xC2,#57 @ compose 0xc2.0 constant +#ifndef __ARMEB__ + vrev64.8 $t0,$t0 + vrev64.8 $Xl,$Xl +#endif + vext.8 $IN,$t0,$t0,#8 @ rotate I[0] + b.lo .Lodd_tail_v8 @ $len was less than 32 +___ +{ my ($Xln,$Xmn,$Xhn,$In) = map("q$_",(4..7)); + ####### + # Xi+2 =[H*(Ii+1 + Xi+1)] mod P = + # [(H*Ii+1) + (H*Xi+1)] mod P = + # [(H*Ii+1) + H^2*(Ii+Xi)] mod P + # +$code.=<<___; + vld1.64 {$t1},[$inp],$inc @ load [rotated] I[1] +#ifndef __ARMEB__ + vrev64.8 $t1,$t1 +#endif + vext.8 $In,$t1,$t1,#8 + veor $IN,$IN,$Xl @ I[i]^=Xi + vpmull.p64 $Xln,$H,$In @ H·Ii+1 + veor $t1,$t1,$In @ Karatsuba pre-processing + vpmull2.p64 $Xhn,$H,$In + b .Loop_mod2x_v8 + +.align 4 +.Loop_mod2x_v8: + vext.8 $t2,$IN,$IN,#8 + subs $len,$len,#32 @ is there more data? + vpmull.p64 $Xl,$H2,$IN @ H^2.lo·Xi.lo + cclr $inc,lo @ is it time to zero $inc? + + vpmull.p64 $Xmn,$Hhl,$t1 + veor $t2,$t2,$IN @ Karatsuba pre-processing + vpmull2.p64 $Xh,$H2,$IN @ H^2.hi·Xi.hi + veor $Xl,$Xl,$Xln @ accumulate + vpmull2.p64 $Xm,$Hhl,$t2 @ (H^2.lo+H^2.hi)·(Xi.lo+Xi.hi) + vld1.64 {$t0},[$inp],$inc @ load [rotated] I[i+2] + + veor $Xh,$Xh,$Xhn + cclr $inc,eq @ is it time to zero $inc? + veor $Xm,$Xm,$Xmn + + vext.8 $t1,$Xl,$Xh,#8 @ Karatsuba post-processing + veor $t2,$Xl,$Xh + veor $Xm,$Xm,$t1 + vld1.64 {$t1},[$inp],$inc @ load [rotated] I[i+3] +#ifndef __ARMEB__ + vrev64.8 $t0,$t0 +#endif + veor $Xm,$Xm,$t2 + vpmull.p64 $t2,$Xl,$xC2 @ 1st phase of reduction + +#ifndef __ARMEB__ + vrev64.8 $t1,$t1 +#endif + vmov $Xh#lo,$Xm#hi @ Xh|Xm - 256-bit result + vmov $Xm#hi,$Xl#lo @ Xm is rotated Xl + vext.8 $In,$t1,$t1,#8 + vext.8 $IN,$t0,$t0,#8 + veor $Xl,$Xm,$t2 + vpmull.p64 $Xln,$H,$In @ H·Ii+1 + veor $IN,$IN,$Xh @ accumulate $IN early + + vext.8 $t2,$Xl,$Xl,#8 @ 2nd phase of reduction + vpmull.p64 $Xl,$Xl,$xC2 + veor $IN,$IN,$t2 + veor $t1,$t1,$In @ Karatsuba pre-processing + veor $IN,$IN,$Xl + vpmull2.p64 $Xhn,$H,$In + b.hs .Loop_mod2x_v8 @ there was at least 32 more bytes + + veor $Xh,$Xh,$t2 + vext.8 $IN,$t0,$t0,#8 @ re-construct $IN + adds $len,$len,#32 @ re-construct $len + veor $Xl,$Xl,$Xh @ re-construct $Xl + b.eq .Ldone_v8 @ is $len zero? +___ +} +$code.=<<___; +.Lodd_tail_v8: + vext.8 $t2,$Xl,$Xl,#8 + veor $IN,$IN,$Xl @ inp^=Xi + veor $t1,$t0,$t2 @ $t1 is rotated inp^Xi + + vpmull.p64 $Xl,$H,$IN @ H.lo·Xi.lo + veor $t1,$t1,$IN @ Karatsuba pre-processing + vpmull2.p64 $Xh,$H,$IN @ H.hi·Xi.hi + vpmull.p64 $Xm,$Hhl,$t1 @ (H.lo+H.hi)·(Xi.lo+Xi.hi) + + vext.8 $t1,$Xl,$Xh,#8 @ Karatsuba post-processing + veor $t2,$Xl,$Xh + veor $Xm,$Xm,$t1 + veor $Xm,$Xm,$t2 + vpmull.p64 $t2,$Xl,$xC2 @ 1st phase of reduction + + vmov $Xh#lo,$Xm#hi @ Xh|Xm - 256-bit result + vmov $Xm#hi,$Xl#lo @ Xm is rotated Xl + veor $Xl,$Xm,$t2 + + vext.8 $t2,$Xl,$Xl,#8 @ 2nd phase of reduction + vpmull.p64 $Xl,$Xl,$xC2 + veor $t2,$t2,$Xh + veor $Xl,$Xl,$t2 + +.Ldone_v8: +#ifndef __ARMEB__ + vrev64.8 $Xl,$Xl +#endif + vext.8 $Xl,$Xl,$Xl,#8 + vst1.64 {$Xl},[$Xi] @ write out Xi + +___ +$code.=<<___ if ($flavour !~ /64/); + vldmia sp!,{d8-d15} @ 32-bit ABI says so +___ +$code.=<<___; + ret +.size gcm_ghash_v8,.-gcm_ghash_v8 +___ +} +$code.=<<___; +.asciz "GHASH for ARMv8, CRYPTOGAMS by " +.align 2 +___ + +if ($flavour =~ /64/) { ######## 64-bit code + sub unvmov { + my $arg=shift; + + $arg =~ m/q([0-9]+)#(lo|hi),\s*q([0-9]+)#(lo|hi)/o && + sprintf "ins v%d.d[%d],v%d.d[%d]",$1,($2 eq "lo")?0:1,$3,($4 eq "lo")?0:1; + } + foreach(split("\n",$code)) { + s/cclr\s+([wx])([^,]+),\s*([a-z]+)/csel $1$2,$1zr,$1$2,$3/o or + s/vmov\.i8/movi/o or # fix up legacy mnemonics + s/vmov\s+(.*)/unvmov($1)/geo or + s/vext\.8/ext/o or + s/vshr\.s/sshr\.s/o or + s/vshr/ushr/o or + s/^(\s+)v/$1/o or # strip off v prefix + s/\bbx\s+lr\b/ret/o; + + s/\bq([0-9]+)\b/"v".($1<8?$1:$1+8).".16b"/geo; # old->new registers + s/@\s/\/\//o; # old->new style commentary + + # fix up remainig legacy suffixes + s/\.[ui]?8(\s)/$1/o; + s/\.[uis]?32//o and s/\.16b/\.4s/go; + m/\.p64/o and s/\.16b/\.1q/o; # 1st pmull argument + m/l\.p64/o and s/\.16b/\.1d/go; # 2nd and 3rd pmull arguments + s/\.[uisp]?64//o and s/\.16b/\.2d/go; + s/\.[42]([sd])\[([0-3])\]/\.$1\[$2\]/o; + + print $_,"\n"; + } +} else { ######## 32-bit code + sub unvdup32 { + my $arg=shift; + + $arg =~ m/q([0-9]+),\s*q([0-9]+)\[([0-3])\]/o && + sprintf "vdup.32 q%d,d%d[%d]",$1,2*$2+($3>>1),$3&1; + } + sub unvpmullp64 { + my ($mnemonic,$arg)=@_; + + if ($arg =~ m/q([0-9]+),\s*q([0-9]+),\s*q([0-9]+)/o) { + my $word = 0xf2a00e00|(($1&7)<<13)|(($1&8)<<19) + |(($2&7)<<17)|(($2&8)<<4) + |(($3&7)<<1) |(($3&8)<<2); + $word |= 0x00010001 if ($mnemonic =~ "2"); + # since ARMv7 instructions are always encoded little-endian. + # correct solution is to use .inst directive, but older + # assemblers don't implement it:-( + sprintf ".byte\t0x%02x,0x%02x,0x%02x,0x%02x\t@ %s %s", + $word&0xff,($word>>8)&0xff, + ($word>>16)&0xff,($word>>24)&0xff, + $mnemonic,$arg; + } + } + + foreach(split("\n",$code)) { + s/\b[wx]([0-9]+)\b/r$1/go; # new->old registers + s/\bv([0-9])\.[12468]+[bsd]\b/q$1/go; # new->old registers + s/\/\/\s?/@ /o; # new->old style commentary + + # fix up remainig new-style suffixes + s/\],#[0-9]+/]!/o; + + s/cclr\s+([^,]+),\s*([a-z]+)/mov$2 $1,#0/o or + s/vdup\.32\s+(.*)/unvdup32($1)/geo or + s/v?(pmull2?)\.p64\s+(.*)/unvpmullp64($1,$2)/geo or + s/\bq([0-9]+)#(lo|hi)/sprintf "d%d",2*$1+($2 eq "hi")/geo or + s/^(\s+)b\./$1b/o or + s/^(\s+)ret/$1bx\tlr/o; + + print $_,"\n"; + } +} + +close STDOUT; # enforce flush diff --git a/crypto/blake2s-load-sse2.h b/crypto/blake2s-load-sse2.h new file mode 100644 index 0000000..d2e9a09 --- /dev/null +++ b/crypto/blake2s-load-sse2.h @@ -0,0 +1,60 @@ +/* + BLAKE2 reference source code package - optimized C implementations + + Copyright 2012, Samuel Neves . You may use this under the + terms of the CC0, the OpenSSL Licence, or the Apache Public License 2.0, at + your option. The terms of these licenses can be found at: + + - CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0 + - OpenSSL license : https://www.openssl.org/source/license.html + - Apache 2.0 : http://www.apache.org/licenses/LICENSE-2.0 + + More information about the BLAKE2 hash function can be found at + https://blake2.net. +*/ +#ifndef BLAKE2S_LOAD_SSE2_H +#define BLAKE2S_LOAD_SSE2_H + +#define LOAD_MSG_0_1(buf) buf = _mm_set_epi32(m6,m4,m2,m0) +#define LOAD_MSG_0_2(buf) buf = _mm_set_epi32(m7,m5,m3,m1) +#define LOAD_MSG_0_3(buf) buf = _mm_set_epi32(m14,m12,m10,m8) +#define LOAD_MSG_0_4(buf) buf = _mm_set_epi32(m15,m13,m11,m9) +#define LOAD_MSG_1_1(buf) buf = _mm_set_epi32(m13,m9,m4,m14) +#define LOAD_MSG_1_2(buf) buf = _mm_set_epi32(m6,m15,m8,m10) +#define LOAD_MSG_1_3(buf) buf = _mm_set_epi32(m5,m11,m0,m1) +#define LOAD_MSG_1_4(buf) buf = _mm_set_epi32(m3,m7,m2,m12) +#define LOAD_MSG_2_1(buf) buf = _mm_set_epi32(m15,m5,m12,m11) +#define LOAD_MSG_2_2(buf) buf = _mm_set_epi32(m13,m2,m0,m8) +#define LOAD_MSG_2_3(buf) buf = _mm_set_epi32(m9,m7,m3,m10) +#define LOAD_MSG_2_4(buf) buf = _mm_set_epi32(m4,m1,m6,m14) +#define LOAD_MSG_3_1(buf) buf = _mm_set_epi32(m11,m13,m3,m7) +#define LOAD_MSG_3_2(buf) buf = _mm_set_epi32(m14,m12,m1,m9) +#define LOAD_MSG_3_3(buf) buf = _mm_set_epi32(m15,m4,m5,m2) +#define LOAD_MSG_3_4(buf) buf = _mm_set_epi32(m8,m0,m10,m6) +#define LOAD_MSG_4_1(buf) buf = _mm_set_epi32(m10,m2,m5,m9) +#define LOAD_MSG_4_2(buf) buf = _mm_set_epi32(m15,m4,m7,m0) +#define LOAD_MSG_4_3(buf) buf = _mm_set_epi32(m3,m6,m11,m14) +#define LOAD_MSG_4_4(buf) buf = _mm_set_epi32(m13,m8,m12,m1) +#define LOAD_MSG_5_1(buf) buf = _mm_set_epi32(m8,m0,m6,m2) +#define LOAD_MSG_5_2(buf) buf = _mm_set_epi32(m3,m11,m10,m12) +#define LOAD_MSG_5_3(buf) buf = _mm_set_epi32(m1,m15,m7,m4) +#define LOAD_MSG_5_4(buf) buf = _mm_set_epi32(m9,m14,m5,m13) +#define LOAD_MSG_6_1(buf) buf = _mm_set_epi32(m4,m14,m1,m12) +#define LOAD_MSG_6_2(buf) buf = _mm_set_epi32(m10,m13,m15,m5) +#define LOAD_MSG_6_3(buf) buf = _mm_set_epi32(m8,m9,m6,m0) +#define LOAD_MSG_6_4(buf) buf = _mm_set_epi32(m11,m2,m3,m7) +#define LOAD_MSG_7_1(buf) buf = _mm_set_epi32(m3,m12,m7,m13) +#define LOAD_MSG_7_2(buf) buf = _mm_set_epi32(m9,m1,m14,m11) +#define LOAD_MSG_7_3(buf) buf = _mm_set_epi32(m2,m8,m15,m5) +#define LOAD_MSG_7_4(buf) buf = _mm_set_epi32(m10,m6,m4,m0) +#define LOAD_MSG_8_1(buf) buf = _mm_set_epi32(m0,m11,m14,m6) +#define LOAD_MSG_8_2(buf) buf = _mm_set_epi32(m8,m3,m9,m15) +#define LOAD_MSG_8_3(buf) buf = _mm_set_epi32(m10,m1,m13,m12) +#define LOAD_MSG_8_4(buf) buf = _mm_set_epi32(m5,m4,m7,m2) +#define LOAD_MSG_9_1(buf) buf = _mm_set_epi32(m1,m7,m8,m10) +#define LOAD_MSG_9_2(buf) buf = _mm_set_epi32(m5,m6,m4,m2) +#define LOAD_MSG_9_3(buf) buf = _mm_set_epi32(m13,m3,m9,m15) +#define LOAD_MSG_9_4(buf) buf = _mm_set_epi32(m0,m12,m14,m11) + + +#endif diff --git a/crypto/blake2s-load-sse41.h b/crypto/blake2s-load-sse41.h new file mode 100644 index 0000000..c316fb5 --- /dev/null +++ b/crypto/blake2s-load-sse41.h @@ -0,0 +1,229 @@ +/* + BLAKE2 reference source code package - optimized C implementations + + Copyright 2012, Samuel Neves . You may use this under the + terms of the CC0, the OpenSSL Licence, or the Apache Public License 2.0, at + your option. The terms of these licenses can be found at: + + - CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0 + - OpenSSL license : https://www.openssl.org/source/license.html + - Apache 2.0 : http://www.apache.org/licenses/LICENSE-2.0 + + More information about the BLAKE2 hash function can be found at + https://blake2.net. +*/ +#ifndef BLAKE2S_LOAD_SSE41_H +#define BLAKE2S_LOAD_SSE41_H + +#define LOAD_MSG_0_1(buf) \ +buf = TOI(_mm_shuffle_ps(TOF(m0), TOF(m1), _MM_SHUFFLE(2,0,2,0))); + +#define LOAD_MSG_0_2(buf) \ +buf = TOI(_mm_shuffle_ps(TOF(m0), TOF(m1), _MM_SHUFFLE(3,1,3,1))); + +#define LOAD_MSG_0_3(buf) \ +buf = TOI(_mm_shuffle_ps(TOF(m2), TOF(m3), _MM_SHUFFLE(2,0,2,0))); + +#define LOAD_MSG_0_4(buf) \ +buf = TOI(_mm_shuffle_ps(TOF(m2), TOF(m3), _MM_SHUFFLE(3,1,3,1))); + +#define LOAD_MSG_1_1(buf) \ +t0 = _mm_blend_epi16(m1, m2, 0x0C); \ +t1 = _mm_slli_si128(m3, 4); \ +t2 = _mm_blend_epi16(t0, t1, 0xF0); \ +buf = _mm_shuffle_epi32(t2, _MM_SHUFFLE(2,1,0,3)); + +#define LOAD_MSG_1_2(buf) \ +t0 = _mm_shuffle_epi32(m2,_MM_SHUFFLE(0,0,2,0)); \ +t1 = _mm_blend_epi16(m1,m3,0xC0); \ +t2 = _mm_blend_epi16(t0, t1, 0xF0); \ +buf = _mm_shuffle_epi32(t2, _MM_SHUFFLE(2,3,0,1)); + +#define LOAD_MSG_1_3(buf) \ +t0 = _mm_slli_si128(m1, 4); \ +t1 = _mm_blend_epi16(m2, t0, 0x30); \ +t2 = _mm_blend_epi16(m0, t1, 0xF0); \ +buf = _mm_shuffle_epi32(t2, _MM_SHUFFLE(2,3,0,1)); + +#define LOAD_MSG_1_4(buf) \ +t0 = _mm_unpackhi_epi32(m0,m1); \ +t1 = _mm_slli_si128(m3, 4); \ +t2 = _mm_blend_epi16(t0, t1, 0x0C); \ +buf = _mm_shuffle_epi32(t2, _MM_SHUFFLE(2,3,0,1)); + +#define LOAD_MSG_2_1(buf) \ +t0 = _mm_unpackhi_epi32(m2,m3); \ +t1 = _mm_blend_epi16(m3,m1,0x0C); \ +t2 = _mm_blend_epi16(t0, t1, 0x0F); \ +buf = _mm_shuffle_epi32(t2, _MM_SHUFFLE(3,1,0,2)); + +#define LOAD_MSG_2_2(buf) \ +t0 = _mm_unpacklo_epi32(m2,m0); \ +t1 = _mm_blend_epi16(t0, m0, 0xF0); \ +t2 = _mm_slli_si128(m3, 8); \ +buf = _mm_blend_epi16(t1, t2, 0xC0); + +#define LOAD_MSG_2_3(buf) \ +t0 = _mm_blend_epi16(m0, m2, 0x3C); \ +t1 = _mm_srli_si128(m1, 12); \ +t2 = _mm_blend_epi16(t0,t1,0x03); \ +buf = _mm_shuffle_epi32(t2, _MM_SHUFFLE(1,0,3,2)); + +#define LOAD_MSG_2_4(buf) \ +t0 = _mm_slli_si128(m3, 4); \ +t1 = _mm_blend_epi16(m0, m1, 0x33); \ +t2 = _mm_blend_epi16(t1, t0, 0xC0); \ +buf = _mm_shuffle_epi32(t2, _MM_SHUFFLE(0,1,2,3)); + +#define LOAD_MSG_3_1(buf) \ +t0 = _mm_unpackhi_epi32(m0,m1); \ +t1 = _mm_unpackhi_epi32(t0, m2); \ +t2 = _mm_blend_epi16(t1, m3, 0x0C); \ +buf = _mm_shuffle_epi32(t2, _MM_SHUFFLE(3,1,0,2)); + +#define LOAD_MSG_3_2(buf) \ +t0 = _mm_slli_si128(m2, 8); \ +t1 = _mm_blend_epi16(m3,m0,0x0C); \ +t2 = _mm_blend_epi16(t1, t0, 0xC0); \ +buf = _mm_shuffle_epi32(t2, _MM_SHUFFLE(2,0,1,3)); + +#define LOAD_MSG_3_3(buf) \ +t0 = _mm_blend_epi16(m0,m1,0x0F); \ +t1 = _mm_blend_epi16(t0, m3, 0xC0); \ +buf = _mm_shuffle_epi32(t1, _MM_SHUFFLE(3,0,1,2)); + +#define LOAD_MSG_3_4(buf) \ +t0 = _mm_unpacklo_epi32(m0,m2); \ +t1 = _mm_unpackhi_epi32(m1,m2); \ +buf = _mm_unpacklo_epi64(t1,t0); + +#define LOAD_MSG_4_1(buf) \ +t0 = _mm_unpacklo_epi64(m1,m2); \ +t1 = _mm_unpackhi_epi64(m0,m2); \ +t2 = _mm_blend_epi16(t0,t1,0x33); \ +buf = _mm_shuffle_epi32(t2, _MM_SHUFFLE(2,0,1,3)); + +#define LOAD_MSG_4_2(buf) \ +t0 = _mm_unpackhi_epi64(m1,m3); \ +t1 = _mm_unpacklo_epi64(m0,m1); \ +buf = _mm_blend_epi16(t0,t1,0x33); + +#define LOAD_MSG_4_3(buf) \ +t0 = _mm_unpackhi_epi64(m3,m1); \ +t1 = _mm_unpackhi_epi64(m2,m0); \ +buf = _mm_blend_epi16(t1,t0,0x33); + +#define LOAD_MSG_4_4(buf) \ +t0 = _mm_blend_epi16(m0,m2,0x03); \ +t1 = _mm_slli_si128(t0, 8); \ +t2 = _mm_blend_epi16(t1,m3,0x0F); \ +buf = _mm_shuffle_epi32(t2, _MM_SHUFFLE(1,2,0,3)); + +#define LOAD_MSG_5_1(buf) \ +t0 = _mm_unpackhi_epi32(m0,m1); \ +t1 = _mm_unpacklo_epi32(m0,m2); \ +buf = _mm_unpacklo_epi64(t0,t1); + +#define LOAD_MSG_5_2(buf) \ +t0 = _mm_srli_si128(m2, 4); \ +t1 = _mm_blend_epi16(m0,m3,0x03); \ +buf = _mm_blend_epi16(t1,t0,0x3C); + +#define LOAD_MSG_5_3(buf) \ +t0 = _mm_blend_epi16(m1,m0,0x0C); \ +t1 = _mm_srli_si128(m3, 4); \ +t2 = _mm_blend_epi16(t0,t1,0x30); \ +buf = _mm_shuffle_epi32(t2, _MM_SHUFFLE(1,2,3,0)); + +#define LOAD_MSG_5_4(buf) \ +t0 = _mm_unpacklo_epi64(m1,m2); \ +t1= _mm_shuffle_epi32(m3, _MM_SHUFFLE(0,2,0,1)); \ +buf = _mm_blend_epi16(t0,t1,0x33); + +#define LOAD_MSG_6_1(buf) \ +t0 = _mm_slli_si128(m1, 12); \ +t1 = _mm_blend_epi16(m0,m3,0x33); \ +buf = _mm_blend_epi16(t1,t0,0xC0); + +#define LOAD_MSG_6_2(buf) \ +t0 = _mm_blend_epi16(m3,m2,0x30); \ +t1 = _mm_srli_si128(m1, 4); \ +t2 = _mm_blend_epi16(t0,t1,0x03); \ +buf = _mm_shuffle_epi32(t2, _MM_SHUFFLE(2,1,3,0)); + +#define LOAD_MSG_6_3(buf) \ +t0 = _mm_unpacklo_epi64(m0,m2); \ +t1 = _mm_srli_si128(m1, 4); \ +buf = _mm_shuffle_epi32(_mm_blend_epi16(t0,t1,0x0C), _MM_SHUFFLE(2,3,1,0)); + +#define LOAD_MSG_6_4(buf) \ +t0 = _mm_unpackhi_epi32(m1,m2); \ +t1 = _mm_unpackhi_epi64(m0,t0); \ +buf = _mm_shuffle_epi32(t1, _MM_SHUFFLE(3,0,1,2)); + +#define LOAD_MSG_7_1(buf) \ +t0 = _mm_unpackhi_epi32(m0,m1); \ +t1 = _mm_blend_epi16(t0,m3,0x0F); \ +buf = _mm_shuffle_epi32(t1,_MM_SHUFFLE(2,0,3,1)); + +#define LOAD_MSG_7_2(buf) \ +t0 = _mm_blend_epi16(m2,m3,0x30); \ +t1 = _mm_srli_si128(m0,4); \ +t2 = _mm_blend_epi16(t0,t1,0x03); \ +buf = _mm_shuffle_epi32(t2, _MM_SHUFFLE(1,0,2,3)); + +#define LOAD_MSG_7_3(buf) \ +t0 = _mm_unpackhi_epi64(m0,m3); \ +t1 = _mm_unpacklo_epi64(m1,m2); \ +t2 = _mm_blend_epi16(t0,t1,0x3C); \ +buf = _mm_shuffle_epi32(t2,_MM_SHUFFLE(0,2,3,1)); + +#define LOAD_MSG_7_4(buf) \ +t0 = _mm_unpacklo_epi32(m0,m1); \ +t1 = _mm_unpackhi_epi32(m1,m2); \ +buf = _mm_unpacklo_epi64(t0,t1); + +#define LOAD_MSG_8_1(buf) \ +t0 = _mm_unpackhi_epi32(m1,m3); \ +t1 = _mm_unpacklo_epi64(t0,m0); \ +t2 = _mm_blend_epi16(t1,m2,0xC0); \ +buf = _mm_shufflehi_epi16(t2,_MM_SHUFFLE(1,0,3,2)); + +#define LOAD_MSG_8_2(buf) \ +t0 = _mm_unpackhi_epi32(m0,m3); \ +t1 = _mm_blend_epi16(m2,t0,0xF0); \ +buf = _mm_shuffle_epi32(t1,_MM_SHUFFLE(0,2,1,3)); + +#define LOAD_MSG_8_3(buf) \ +t0 = _mm_blend_epi16(m2,m0,0x0C); \ +t1 = _mm_slli_si128(t0,4); \ +buf = _mm_blend_epi16(t1,m3,0x0F); + +#define LOAD_MSG_8_4(buf) \ +t0 = _mm_blend_epi16(m1,m0,0x30); \ +buf = _mm_shuffle_epi32(t0,_MM_SHUFFLE(1,0,3,2)); + +#define LOAD_MSG_9_1(buf) \ +t0 = _mm_blend_epi16(m0,m2,0x03); \ +t1 = _mm_blend_epi16(m1,m2,0x30); \ +t2 = _mm_blend_epi16(t1,t0,0x0F); \ +buf = _mm_shuffle_epi32(t2,_MM_SHUFFLE(1,3,0,2)); + +#define LOAD_MSG_9_2(buf) \ +t0 = _mm_slli_si128(m0,4); \ +t1 = _mm_blend_epi16(m1,t0,0xC0); \ +buf = _mm_shuffle_epi32(t1,_MM_SHUFFLE(1,2,0,3)); + +#define LOAD_MSG_9_3(buf) \ +t0 = _mm_unpackhi_epi32(m0,m3); \ +t1 = _mm_unpacklo_epi32(m2,m3); \ +t2 = _mm_unpackhi_epi64(t0,t1); \ +buf = _mm_shuffle_epi32(t2,_MM_SHUFFLE(3,0,2,1)); + +#define LOAD_MSG_9_4(buf) \ +t0 = _mm_blend_epi16(m3,m2,0xC0); \ +t1 = _mm_unpacklo_epi32(m0,m3); \ +t2 = _mm_blend_epi16(t0,t1,0x0F); \ +buf = _mm_shuffle_epi32(t2,_MM_SHUFFLE(0,1,2,3)); + +#endif diff --git a/crypto/blake2s-load-xop.h b/crypto/blake2s-load-xop.h new file mode 100644 index 0000000..a97ddcc --- /dev/null +++ b/crypto/blake2s-load-xop.h @@ -0,0 +1,191 @@ +/* + BLAKE2 reference source code package - optimized C implementations + + Copyright 2012, Samuel Neves . You may use this under the + terms of the CC0, the OpenSSL Licence, or the Apache Public License 2.0, at + your option. The terms of these licenses can be found at: + + - CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0 + - OpenSSL license : https://www.openssl.org/source/license.html + - Apache 2.0 : http://www.apache.org/licenses/LICENSE-2.0 + + More information about the BLAKE2 hash function can be found at + https://blake2.net. +*/ +#ifndef BLAKE2S_LOAD_XOP_H +#define BLAKE2S_LOAD_XOP_H + +#define TOB(x) ((x)*4*0x01010101 + 0x03020100) /* ..or not TOB */ + +#if 0 +/* Basic VPPERM emulation, for testing purposes */ +static __m128i _mm_perm_epi8(const __m128i src1, const __m128i src2, const __m128i sel) +{ + const __m128i sixteen = _mm_set1_epi8(16); + const __m128i t0 = _mm_shuffle_epi8(src1, sel); + const __m128i s1 = _mm_shuffle_epi8(src2, _mm_sub_epi8(sel, sixteen)); + const __m128i mask = _mm_or_si128(_mm_cmpeq_epi8(sel, sixteen), + _mm_cmpgt_epi8(sel, sixteen)); /* (>=16) = 0xff : 00 */ + return _mm_blendv_epi8(t0, s1, mask); +} +#endif + +#define LOAD_MSG_0_1(buf) \ +buf = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(6),TOB(4),TOB(2),TOB(0)) ); + +#define LOAD_MSG_0_2(buf) \ +buf = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(7),TOB(5),TOB(3),TOB(1)) ); + +#define LOAD_MSG_0_3(buf) \ +buf = _mm_perm_epi8(m2, m3, _mm_set_epi32(TOB(6),TOB(4),TOB(2),TOB(0)) ); + +#define LOAD_MSG_0_4(buf) \ +buf = _mm_perm_epi8(m2, m3, _mm_set_epi32(TOB(7),TOB(5),TOB(3),TOB(1)) ); + +#define LOAD_MSG_1_1(buf) \ +t0 = _mm_perm_epi8(m1, m2, _mm_set_epi32(TOB(0),TOB(5),TOB(0),TOB(0)) ); \ +buf = _mm_perm_epi8(t0, m3, _mm_set_epi32(TOB(5),TOB(2),TOB(1),TOB(6)) ); + +#define LOAD_MSG_1_2(buf) \ +t1 = _mm_perm_epi8(m1, m2, _mm_set_epi32(TOB(2),TOB(0),TOB(4),TOB(6)) ); \ +buf = _mm_perm_epi8(t1, m3, _mm_set_epi32(TOB(3),TOB(7),TOB(1),TOB(0)) ); + +#define LOAD_MSG_1_3(buf) \ +t0 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(5),TOB(0),TOB(0),TOB(1)) ); \ +buf = _mm_perm_epi8(t0, m2, _mm_set_epi32(TOB(3),TOB(7),TOB(1),TOB(0)) ); + +#define LOAD_MSG_1_4(buf) \ +t1 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(3),TOB(7),TOB(2),TOB(0)) ); \ +buf = _mm_perm_epi8(t1, m3, _mm_set_epi32(TOB(3),TOB(2),TOB(1),TOB(4)) ); + +#define LOAD_MSG_2_1(buf) \ +t0 = _mm_perm_epi8(m1, m2, _mm_set_epi32(TOB(0),TOB(1),TOB(0),TOB(7)) ); \ +buf = _mm_perm_epi8(t0, m3, _mm_set_epi32(TOB(7),TOB(2),TOB(4),TOB(0)) ); + +#define LOAD_MSG_2_2(buf) \ +t1 = _mm_perm_epi8(m0, m2, _mm_set_epi32(TOB(0),TOB(2),TOB(0),TOB(4)) ); \ +buf = _mm_perm_epi8(t1, m3, _mm_set_epi32(TOB(5),TOB(2),TOB(1),TOB(0)) ); + +#define LOAD_MSG_2_3(buf) \ +t0 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(0),TOB(7),TOB(3),TOB(0)) ); \ +buf = _mm_perm_epi8(t0, m2, _mm_set_epi32(TOB(5),TOB(2),TOB(1),TOB(6)) ); + +#define LOAD_MSG_2_4(buf) \ +t1 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(4),TOB(1),TOB(6),TOB(0)) ); \ +buf = _mm_perm_epi8(t1, m3, _mm_set_epi32(TOB(3),TOB(2),TOB(1),TOB(6)) ); + +#define LOAD_MSG_3_1(buf) \ +t0 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(0),TOB(0),TOB(3),TOB(7)) ); \ +t0 = _mm_perm_epi8(t0, m2, _mm_set_epi32(TOB(7),TOB(2),TOB(1),TOB(0)) ); \ +buf = _mm_perm_epi8(t0, m3, _mm_set_epi32(TOB(3),TOB(5),TOB(1),TOB(0)) ); + +#define LOAD_MSG_3_2(buf) \ +t1 = _mm_perm_epi8(m0, m2, _mm_set_epi32(TOB(0),TOB(0),TOB(1),TOB(5)) ); \ +buf = _mm_perm_epi8(t1, m3, _mm_set_epi32(TOB(6),TOB(4),TOB(1),TOB(0)) ); + +#define LOAD_MSG_3_3(buf) \ +t0 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(0),TOB(4),TOB(5),TOB(2)) ); \ +buf = _mm_perm_epi8(t0, m3, _mm_set_epi32(TOB(7),TOB(2),TOB(1),TOB(0)) ); + +#define LOAD_MSG_3_4(buf) \ +t1 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(0),TOB(0),TOB(0),TOB(6)) ); \ +buf = _mm_perm_epi8(t1, m2, _mm_set_epi32(TOB(4),TOB(2),TOB(6),TOB(0)) ); + +#define LOAD_MSG_4_1(buf) \ +t0 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(0),TOB(2),TOB(5),TOB(0)) ); \ +buf = _mm_perm_epi8(t0, m2, _mm_set_epi32(TOB(6),TOB(2),TOB(1),TOB(5)) ); + +#define LOAD_MSG_4_2(buf) \ +t1 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(0),TOB(4),TOB(7),TOB(0)) ); \ +buf = _mm_perm_epi8(t1, m3, _mm_set_epi32(TOB(7),TOB(2),TOB(1),TOB(0)) ); + +#define LOAD_MSG_4_3(buf) \ +t0 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(3),TOB(6),TOB(0),TOB(0)) ); \ +t0 = _mm_perm_epi8(t0, m2, _mm_set_epi32(TOB(3),TOB(2),TOB(7),TOB(0)) ); \ +buf = _mm_perm_epi8(t0, m3, _mm_set_epi32(TOB(3),TOB(2),TOB(1),TOB(6)) ); + +#define LOAD_MSG_4_4(buf) \ +t1 = _mm_perm_epi8(m0, m2, _mm_set_epi32(TOB(0),TOB(4),TOB(0),TOB(1)) ); \ +buf = _mm_perm_epi8(t1, m3, _mm_set_epi32(TOB(5),TOB(2),TOB(4),TOB(0)) ); + +#define LOAD_MSG_5_1(buf) \ +t0 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(0),TOB(0),TOB(6),TOB(2)) ); \ +buf = _mm_perm_epi8(t0, m2, _mm_set_epi32(TOB(4),TOB(2),TOB(1),TOB(0)) ); + +#define LOAD_MSG_5_2(buf) \ +t1 = _mm_perm_epi8(m0, m2, _mm_set_epi32(TOB(3),TOB(7),TOB(6),TOB(0)) ); \ +buf = _mm_perm_epi8(t1, m3, _mm_set_epi32(TOB(3),TOB(2),TOB(1),TOB(4)) ); + +#define LOAD_MSG_5_3(buf) \ +t0 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(1),TOB(0),TOB(7),TOB(4)) ); \ +buf = _mm_perm_epi8(t0, m3, _mm_set_epi32(TOB(3),TOB(7),TOB(1),TOB(0)) ); + +#define LOAD_MSG_5_4(buf) \ +t1 = _mm_perm_epi8(m1, m2, _mm_set_epi32(TOB(5),TOB(0),TOB(1),TOB(0)) ); \ +buf = _mm_perm_epi8(t1, m3, _mm_set_epi32(TOB(3),TOB(6),TOB(1),TOB(5)) ); + +#define LOAD_MSG_6_1(buf) \ +t0 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(4),TOB(0),TOB(1),TOB(0)) ); \ +buf = _mm_perm_epi8(t0, m3, _mm_set_epi32(TOB(3),TOB(6),TOB(1),TOB(4)) ); + +#define LOAD_MSG_6_2(buf) \ +t1 = _mm_perm_epi8(m1, m2, _mm_set_epi32(TOB(6),TOB(0),TOB(0),TOB(1)) ); \ +buf = _mm_perm_epi8(t1, m3, _mm_set_epi32(TOB(3),TOB(5),TOB(7),TOB(0)) ); + +#define LOAD_MSG_6_3(buf) \ +t0 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(0),TOB(0),TOB(6),TOB(0)) ); \ +buf = _mm_perm_epi8(t0, m2, _mm_set_epi32(TOB(4),TOB(5),TOB(1),TOB(0)) ); + +#define LOAD_MSG_6_4(buf) \ +t1 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(0),TOB(2),TOB(3),TOB(7)) ); \ +buf = _mm_perm_epi8(t1, m2, _mm_set_epi32(TOB(7),TOB(2),TOB(1),TOB(0)) ); + +#define LOAD_MSG_7_1(buf) \ +t0 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(3),TOB(0),TOB(7),TOB(0)) ); \ +buf = _mm_perm_epi8(t0, m3, _mm_set_epi32(TOB(3),TOB(4),TOB(1),TOB(5)) ); + +#define LOAD_MSG_7_2(buf) \ +t1 = _mm_perm_epi8(m0, m2, _mm_set_epi32(TOB(5),TOB(1),TOB(0),TOB(7)) ); \ +buf = _mm_perm_epi8(t1, m3, _mm_set_epi32(TOB(3),TOB(2),TOB(6),TOB(0)) ); + +#define LOAD_MSG_7_3(buf) \ +t0 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(2),TOB(0),TOB(0),TOB(5)) ); \ +t0 = _mm_perm_epi8(t0, m2, _mm_set_epi32(TOB(3),TOB(4),TOB(1),TOB(0)) ); \ +buf = _mm_perm_epi8(t0, m3, _mm_set_epi32(TOB(3),TOB(2),TOB(7),TOB(0)) ); + +#define LOAD_MSG_7_4(buf) \ +t1 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(0),TOB(6),TOB(4),TOB(0)) ); \ +buf = _mm_perm_epi8(t1, m2, _mm_set_epi32(TOB(6),TOB(2),TOB(1),TOB(0)) ); + +#define LOAD_MSG_8_1(buf) \ +t0 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(0),TOB(0),TOB(0),TOB(6)) ); \ +t0 = _mm_perm_epi8(t0, m2, _mm_set_epi32(TOB(3),TOB(7),TOB(1),TOB(0)) ); \ +buf = _mm_perm_epi8(t0, m3, _mm_set_epi32(TOB(3),TOB(2),TOB(6),TOB(0)) ); + +#define LOAD_MSG_8_2(buf) \ +t1 = _mm_perm_epi8(m0, m2, _mm_set_epi32(TOB(4),TOB(3),TOB(5),TOB(0)) ); \ +buf = _mm_perm_epi8(t1, m3, _mm_set_epi32(TOB(3),TOB(2),TOB(1),TOB(7)) ); + +#define LOAD_MSG_8_3(buf) \ +t0 = _mm_perm_epi8(m0, m2, _mm_set_epi32(TOB(6),TOB(1),TOB(0),TOB(0)) ); \ +buf = _mm_perm_epi8(t0, m3, _mm_set_epi32(TOB(3),TOB(2),TOB(5),TOB(4)) ); \ + +#define LOAD_MSG_8_4(buf) \ +buf = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(5),TOB(4),TOB(7),TOB(2)) ); + +#define LOAD_MSG_9_1(buf) \ +t0 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(1),TOB(7),TOB(0),TOB(0)) ); \ +buf = _mm_perm_epi8(t0, m2, _mm_set_epi32(TOB(3),TOB(2),TOB(4),TOB(6)) ); + +#define LOAD_MSG_9_2(buf) \ +buf = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(5),TOB(6),TOB(4),TOB(2)) ); + +#define LOAD_MSG_9_3(buf) \ +t0 = _mm_perm_epi8(m0, m2, _mm_set_epi32(TOB(0),TOB(3),TOB(5),TOB(0)) ); \ +buf = _mm_perm_epi8(t0, m3, _mm_set_epi32(TOB(5),TOB(2),TOB(1),TOB(7)) ); + +#define LOAD_MSG_9_4(buf) \ +t1 = _mm_perm_epi8(m0, m2, _mm_set_epi32(TOB(0),TOB(0),TOB(0),TOB(7)) ); \ +buf = _mm_perm_epi8(t1, m3, _mm_set_epi32(TOB(3),TOB(4),TOB(6),TOB(0)) ); + +#endif diff --git a/crypto/blake2s-round.h b/crypto/blake2s-round.h new file mode 100644 index 0000000..44a5574 --- /dev/null +++ b/crypto/blake2s-round.h @@ -0,0 +1,88 @@ +/* + BLAKE2 reference source code package - optimized C implementations + + Copyright 2012, Samuel Neves . You may use this under the + terms of the CC0, the OpenSSL Licence, or the Apache Public License 2.0, at + your option. The terms of these licenses can be found at: + + - CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0 + - OpenSSL license : https://www.openssl.org/source/license.html + - Apache 2.0 : http://www.apache.org/licenses/LICENSE-2.0 + + More information about the BLAKE2 hash function can be found at + https://blake2.net. +*/ +#ifndef BLAKE2S_ROUND_H +#define BLAKE2S_ROUND_H + +#define LOADU(p) _mm_loadu_si128( (const __m128i *)(p) ) +#define STOREU(p,r) _mm_storeu_si128((__m128i *)(p), r) + +#define TOF(reg) _mm_castsi128_ps((reg)) +#define TOI(reg) _mm_castps_si128((reg)) + +#define LIKELY(x) __builtin_expect((x),1) + + +/* Microarchitecture-specific macros */ +#ifndef HAVE_XOP +#ifdef HAVE_SSSE3 +#define _mm_roti_epi32(r, c) ( \ + (8==-(c)) ? _mm_shuffle_epi8(r,r8) \ + : (16==-(c)) ? _mm_shuffle_epi8(r,r16) \ + : _mm_xor_si128(_mm_srli_epi32( (r), -(c) ),_mm_slli_epi32( (r), 32-(-(c)) )) ) +#else +#define _mm_roti_epi32(r, c) _mm_xor_si128(_mm_srli_epi32( (r), -(c) ),_mm_slli_epi32( (r), 32-(-(c)) )) +#endif +#else +/* ... */ +#endif + + +#define G1(row1,row2,row3,row4,buf) \ + row1 = _mm_add_epi32( _mm_add_epi32( row1, buf), row2 ); \ + row4 = _mm_xor_si128( row4, row1 ); \ + row4 = _mm_roti_epi32(row4, -16); \ + row3 = _mm_add_epi32( row3, row4 ); \ + row2 = _mm_xor_si128( row2, row3 ); \ + row2 = _mm_roti_epi32(row2, -12); + +#define G2(row1,row2,row3,row4,buf) \ + row1 = _mm_add_epi32( _mm_add_epi32( row1, buf), row2 ); \ + row4 = _mm_xor_si128( row4, row1 ); \ + row4 = _mm_roti_epi32(row4, -8); \ + row3 = _mm_add_epi32( row3, row4 ); \ + row2 = _mm_xor_si128( row2, row3 ); \ + row2 = _mm_roti_epi32(row2, -7); + +#define DIAGONALIZE(row1,row2,row3,row4) \ + row4 = _mm_shuffle_epi32( row4, _MM_SHUFFLE(2,1,0,3) ); \ + row3 = _mm_shuffle_epi32( row3, _MM_SHUFFLE(1,0,3,2) ); \ + row2 = _mm_shuffle_epi32( row2, _MM_SHUFFLE(0,3,2,1) ); + +#define UNDIAGONALIZE(row1,row2,row3,row4) \ + row4 = _mm_shuffle_epi32( row4, _MM_SHUFFLE(0,3,2,1) ); \ + row3 = _mm_shuffle_epi32( row3, _MM_SHUFFLE(1,0,3,2) ); \ + row2 = _mm_shuffle_epi32( row2, _MM_SHUFFLE(2,1,0,3) ); + +#if defined(HAVE_XOP) +#include "blake2s-load-xop.h" +#elif defined(HAVE_SSE41) +#include "blake2s-load-sse41.h" +#else +#include "blake2s-load-sse2.h" +#endif + +#define ROUND(r) \ + LOAD_MSG_ ##r ##_1(buf1); \ + G1(row1,row2,row3,row4,buf1); \ + LOAD_MSG_ ##r ##_2(buf2); \ + G2(row1,row2,row3,row4,buf2); \ + DIAGONALIZE(row1,row2,row3,row4); \ + LOAD_MSG_ ##r ##_3(buf3); \ + G1(row1,row2,row3,row4,buf3); \ + LOAD_MSG_ ##r ##_4(buf4); \ + G2(row1,row2,row3,row4,buf4); \ + UNDIAGONALIZE(row1,row2,row3,row4); \ + +#endif diff --git a/crypto/blake2s.cpp b/crypto/blake2s.cpp new file mode 100644 index 0000000..8c2397c --- /dev/null +++ b/crypto/blake2s.cpp @@ -0,0 +1,446 @@ +/* +BLAKE2 reference source code package - reference C implementations + +Copyright 2012, Samuel Neves . You may use this under the +terms of the CC0, the OpenSSL Licence, or the Apache Public License 2.0, at +your option. The terms of these licenses can be found at: + +- CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0 +- OpenSSL license : https://www.openssl.org/source/license.html +- Apache 2.0 : http://www.apache.org/licenses/LICENSE-2.0 + +More information about the BLAKE2 hash function can be found at +https://blake2.net. +*/ + +#include "stdafx.h" +#include +#include +#include +#include +#include "tunsafe_types.h" +#include "blake2s.h" +#include "crypto_ops.h" + +void blake2s_compress_sse(blake2s_state *S, const uint8_t block[BLAKE2S_BLOCKBYTES]); + +#if !defined(__cplusplus) && (!defined(__STDC_VERSION__) || __STDC_VERSION__ < 199901L) +#if defined(_MSC_VER) +#define BLAKE2_INLINE __inline +#elif defined(__GNUC__) +#define BLAKE2_INLINE __inline__ +#else +#define BLAKE2_INLINE +#endif +#else +#define BLAKE2_INLINE inline +#endif + +static BLAKE2_INLINE uint32_t load32(const void *src) { +#if defined(ARCH_CPU_LITTLE_ENDIAN) + uint32_t w; + memcpy(&w, src, sizeof w); + return w; +#else + const uint8_t *p = (const uint8_t *)src; + return ((uint32_t)(p[0]) << 0) | + ((uint32_t)(p[1]) << 8) | + ((uint32_t)(p[2]) << 16) | + ((uint32_t)(p[3]) << 24); +#endif +} + +static BLAKE2_INLINE uint16_t load16(const void *src) { +#if defined(ARCH_CPU_LITTLE_ENDIAN) + uint16_t w; + memcpy(&w, src, sizeof w); + return w; +#else + const uint8_t *p = (const uint8_t *)src; + return ((uint16_t)(p[0]) << 0) | + ((uint16_t)(p[1]) << 8); +#endif +} + +static BLAKE2_INLINE void store16(void *dst, uint16_t w) { +#if defined(ARCH_CPU_LITTLE_ENDIAN) + memcpy(dst, &w, sizeof w); +#else + uint8_t *p = (uint8_t *)dst; + *p++ = (uint8_t)w; w >>= 8; + *p++ = (uint8_t)w; +#endif +} + +static BLAKE2_INLINE void store32(void *dst, uint32_t w) { +#if defined(ARCH_CPU_LITTLE_ENDIAN) + memcpy(dst, &w, sizeof w); +#else + uint8_t *p = (uint8_t *)dst; + p[0] = (uint8_t)(w >> 0); + p[1] = (uint8_t)(w >> 8); + p[2] = (uint8_t)(w >> 16); + p[3] = (uint8_t)(w >> 24); +#endif +} + +static BLAKE2_INLINE uint32_t rotr32(const uint32_t w, const unsigned c) { + return (w >> c) | (w << (32 - c)); +} + +static BLAKE2_INLINE uint64_t rotr64(const uint64_t w, const unsigned c) { + return (w >> c) | (w << (64 - c)); +} + +static const uint32_t blake2s_IV[8] = { + 0x6A09E667UL, 0xBB67AE85UL, 0x3C6EF372UL, 0xA54FF53AUL, + 0x510E527FUL, 0x9B05688CUL, 0x1F83D9ABUL, 0x5BE0CD19UL +}; + +static const uint8_t blake2s_sigma[10][16] = +{ + {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15} , + {14, 10, 4, 8, 9, 15, 13, 6, 1, 12, 0, 2, 11, 7, 5, 3} , + {11, 8, 12, 0, 5, 2, 15, 13, 10, 14, 3, 6, 7, 1, 9, 4} , + {7, 9, 3, 1, 13, 12, 11, 14, 2, 6, 5, 10, 4, 0, 15, 8} , + {9, 0, 5, 7, 2, 4, 10, 15, 14, 1, 11, 12, 6, 8, 3, 13} , + {2, 12, 6, 10, 0, 11, 8, 3, 4, 13, 7, 5, 15, 14, 1, 9} , + {12, 5, 1, 15, 14, 13, 4, 10, 0, 7, 6, 3, 9, 2, 8, 11} , + {13, 11, 7, 14, 12, 1, 3, 9, 5, 0, 15, 4, 8, 6, 2, 10} , + {6, 15, 14, 9, 11, 3, 0, 8, 12, 2, 13, 7, 1, 4, 10, 5} , + {10, 2, 8, 4, 7, 6, 1, 5, 15, 11, 9, 14, 3, 12, 13 , 0} , +}; + +static void blake2s_set_lastnode(blake2s_state *S) { + S->f[1] = (uint32_t)-1; +} + +/* Some helper functions, not necessarily useful */ +static int blake2s_is_lastblock(const blake2s_state *S) { + return S->f[0] != 0; +} + +static void blake2s_set_lastblock(blake2s_state *S) { + if (S->last_node) blake2s_set_lastnode(S); + + S->f[0] = (uint32_t)-1; +} + +static void blake2s_increment_counter(blake2s_state *S, const uint32_t inc) { + S->t[0] += inc; + S->t[1] += (S->t[0] < inc); +} + +void blake2s_init_with_len(blake2s_state *S, size_t outlen, size_t keylen) { + memset(S, 0, sizeof(blake2s_state)); + + blake2s_param *P = &S->param; + size_t i; + + /* Move interval verification here? */ + assert(outlen && outlen <= BLAKE2S_OUTBYTES); + + P->digest_length = (uint8_t)outlen; + S->outlen = (uint8_t)outlen; + P->key_length = (uint8_t)keylen; + P->fanout = 1; + P->depth = 1; + // store32(&P.leaf_length, 0); + // store32(&P.node_offset, 0); + // store16(&P.xof_length, 0); + // P.node_depth = 0; + // P.inner_length = 0; + /* memset(P->reserved, 0, sizeof(P->reserved) ); */ + // memset(P.salt, 0, sizeof(P.salt)); + // memset(P.personal, 0, sizeof(P.personal)); + for (i = 0; i < 8; ++i) + S->h[i] = load32(&S->h[i]) ^ blake2s_IV[i]; + +} + +/* Sequential blake2s initialization */ +void blake2s_init(blake2s_state *S, size_t outlen) { + blake2s_init_with_len(S, outlen, 0); +} + +void blake2s_init_key(blake2s_state *S, size_t outlen, const void *key, size_t keylen) { + uint8_t block[BLAKE2S_BLOCKBYTES]; + + assert(outlen && outlen <= BLAKE2S_OUTBYTES); + assert(key && keylen && keylen <= BLAKE2S_KEYBYTES); + + blake2s_init_with_len(S, outlen, keylen); + + memset(block, 0, BLAKE2S_BLOCKBYTES); + memcpy(block, key, keylen); + blake2s_update(S, block, BLAKE2S_BLOCKBYTES); + memzero_crypto(block, BLAKE2S_BLOCKBYTES); /* Burn the key from stack */ +} + +#define G(r,i,a,b,c,d) \ + do { \ + a = a + b + m[blake2s_sigma[r][2*i+0]]; \ + d = rotr32(d ^ a, 16); \ + c = c + d; \ + b = rotr32(b ^ c, 12); \ + a = a + b + m[blake2s_sigma[r][2*i+1]]; \ + d = rotr32(d ^ a, 8); \ + c = c + d; \ + b = rotr32(b ^ c, 7); \ + } while(0) + +#define ROUND(r) \ + do { \ + G(r,0,v[ 0],v[ 4],v[ 8],v[12]); \ + G(r,1,v[ 1],v[ 5],v[ 9],v[13]); \ + G(r,2,v[ 2],v[ 6],v[10],v[14]); \ + G(r,3,v[ 3],v[ 7],v[11],v[15]); \ + G(r,4,v[ 0],v[ 5],v[10],v[15]); \ + G(r,5,v[ 1],v[ 6],v[11],v[12]); \ + G(r,6,v[ 2],v[ 7],v[ 8],v[13]); \ + G(r,7,v[ 3],v[ 4],v[ 9],v[14]); \ + } while(0) + +static void blake2s_compress(blake2s_state *S, const uint8_t in[BLAKE2S_BLOCKBYTES]) { + uint32_t m[16]; + uint32_t v[16]; + size_t i; + + for (i = 0; i < 16; ++i) { + m[i] = load32(in + i * sizeof(m[i])); + } + + for (i = 0; i < 8; ++i) { + v[i] = S->h[i]; + } + + v[8] = blake2s_IV[0]; + v[9] = blake2s_IV[1]; + v[10] = blake2s_IV[2]; + v[11] = blake2s_IV[3]; + v[12] = S->t[0] ^ blake2s_IV[4]; + v[13] = S->t[1] ^ blake2s_IV[5]; + v[14] = S->f[0] ^ blake2s_IV[6]; + v[15] = S->f[1] ^ blake2s_IV[7]; + + ROUND(0); + ROUND(1); + ROUND(2); + ROUND(3); + ROUND(4); + ROUND(5); + ROUND(6); + ROUND(7); + ROUND(8); + ROUND(9); + + for (i = 0; i < 8; ++i) { + S->h[i] = S->h[i] ^ v[i] ^ v[i + 8]; + } +} + +#undef G +#undef ROUND + + static inline void blake2s_compress_impl(blake2s_state *S, const uint8_t block[BLAKE2S_BLOCKBYTES]) { +#if defined(ARCH_CPU_X86_64) + blake2s_compress_sse(S, block); +#else + blake2s_compress(S, block); +#endif +} + +void blake2s_update(blake2s_state *S, const void *pin, size_t inlen) { + const unsigned char * in = (const unsigned char *)pin; + if (inlen > 0) { + size_t left = S->buflen; + size_t fill = BLAKE2S_BLOCKBYTES - left; + if (inlen > fill) { + S->buflen = 0; + memcpy(S->buf + left, in, fill); /* Fill buffer */ + blake2s_increment_counter(S, BLAKE2S_BLOCKBYTES); + blake2s_compress_impl(S, S->buf); /* Compress */ + in += fill; inlen -= fill; + while (inlen > BLAKE2S_BLOCKBYTES) { + blake2s_increment_counter(S, BLAKE2S_BLOCKBYTES); + blake2s_compress_impl(S, in); + in += BLAKE2S_BLOCKBYTES; + inlen -= BLAKE2S_BLOCKBYTES; + } + } + memcpy(S->buf + S->buflen, in, inlen); + S->buflen += inlen; + } +} + +void blake2s_final(blake2s_state *S, void *out, size_t outlen) { + size_t i; + + assert(out != NULL && outlen >= S->outlen); + assert(!blake2s_is_lastblock(S)); + + blake2s_increment_counter(S, (uint32_t)S->buflen); + blake2s_set_lastblock(S); + memset(S->buf + S->buflen, 0, BLAKE2S_BLOCKBYTES - S->buflen); /* Padding */ + blake2s_compress_impl(S, S->buf); + + for (i = 0; i < 8; ++i) /* Output full hash to temp buffer */ + store32(&S->h[i], S->h[i]); + + memcpy(out, S->h, outlen); +} + +SAFEBUFFERS void blake2s(void *out, size_t outlen, const void *in, size_t inlen, const void *key, size_t keylen) { + blake2s_state S; + + /* Verify parameters */ + assert(!((NULL == in && inlen > 0))); + assert(out); + assert(!(NULL == key && keylen > 0)); + assert(!(!outlen || outlen > BLAKE2S_OUTBYTES)); + assert(!(keylen > BLAKE2S_KEYBYTES)); + + if (keylen > 0) { + blake2s_init_key(&S, outlen, key, keylen); + } else { + blake2s_init(&S, outlen); + } + blake2s_update(&S, (const uint8_t *)in, inlen); + blake2s_final(&S, out, outlen); +} + +SAFEBUFFERS void blake2s_hmac(uint8_t *out, size_t outlen, const uint8_t *in, size_t inlen, const uint8_t *key, size_t keylen) { + blake2s_state b2s; + uint64_t temp[BLAKE2S_OUTBYTES / 8]; + uint64_t key_temp[BLAKE2S_BLOCKBYTES / 8] = { 0 }; + + if (keylen > BLAKE2S_BLOCKBYTES) { + blake2s_init(&b2s, BLAKE2S_OUTBYTES); + blake2s_update(&b2s, key, keylen); + blake2s_final(&b2s, key_temp, BLAKE2S_OUTBYTES); + } else { + memcpy(key_temp, key, keylen); + } + + for (size_t i = 0; i < BLAKE2S_BLOCKBYTES / 8; i++) + key_temp[i] ^= 0x3636363636363636ull; + + blake2s_init(&b2s, BLAKE2S_OUTBYTES); + blake2s_update(&b2s, key_temp, BLAKE2S_BLOCKBYTES); + blake2s_update(&b2s, in, inlen); + blake2s_final(&b2s, temp, BLAKE2S_OUTBYTES); + + for (size_t i = 0; i < BLAKE2S_BLOCKBYTES / 8; i++) + key_temp[i] ^= 0x5c5c5c5c5c5c5c5cull ^ 0x3636363636363636ull; + + blake2s_init(&b2s, BLAKE2S_OUTBYTES); + blake2s_update(&b2s, key_temp, BLAKE2S_BLOCKBYTES); + blake2s_update(&b2s, temp, BLAKE2S_OUTBYTES); + blake2s_final(&b2s, temp, BLAKE2S_OUTBYTES); + + memcpy(out, temp, outlen); + memzero_crypto(key_temp, sizeof(key_temp)); + memzero_crypto(temp, sizeof(temp)); +} + +SAFEBUFFERS +void blake2s_hkdf(uint8 *dst1, size_t dst1_size, + uint8 *dst2, size_t dst2_size, + uint8 *dst3, size_t dst3_size, + const uint8 *data, size_t data_size, + const uint8 *key, size_t key_size) { + struct { + uint8 prk[BLAKE2S_OUTBYTES]; + uint8 temp[BLAKE2S_OUTBYTES + 1]; + } t; + blake2s_hmac(t.prk, BLAKE2S_OUTBYTES, data, data_size, key, key_size); + // first-key = HMAC(secret, 0x1) + t.temp[0] = 0x1; + blake2s_hmac(t.temp, BLAKE2S_OUTBYTES, t.temp, 1, t.prk, BLAKE2S_OUTBYTES); + memcpy(dst1, t.temp, dst1_size); + if (dst2 != NULL) { + // second-key = HMAC(secret, first-key || 0x2) + t.temp[BLAKE2S_OUTBYTES] = 0x2; + blake2s_hmac(t.temp, BLAKE2S_OUTBYTES, t.temp, BLAKE2S_OUTBYTES + 1, t.prk, BLAKE2S_OUTBYTES); + memcpy(dst2, t.temp, dst2_size); + if (dst3 != NULL) { + // third-key = HMAC(secret, second-key || 0x3) + t.temp[BLAKE2S_OUTBYTES] = 0x3; + blake2s_hmac(t.temp, BLAKE2S_OUTBYTES, t.temp, BLAKE2S_OUTBYTES + 1, t.prk, BLAKE2S_OUTBYTES); + memcpy(dst3, t.temp, dst3_size); + } + } + memzero_crypto(&t, sizeof(t)); +} + + +#if defined(SUPERCOP) +int crypto_hash(unsigned char *out, unsigned char *in, unsigned long long inlen) { + return blake2s(out, BLAKE2S_OUTBYTES in, inlen, NULL, 0); +} +#endif + +#if defined(BLAKE2S_SELFTEST) +#include +#include "blake2-kat.h" +int main(void) { + uint8_t key[BLAKE2S_KEYBYTES]; + uint8_t buf[BLAKE2_KAT_LENGTH]; + size_t i, step; + + for (i = 0; i < BLAKE2S_KEYBYTES; ++i) + key[i] = (uint8_t)i; + + for (i = 0; i < BLAKE2_KAT_LENGTH; ++i) + buf[i] = (uint8_t)i; + + /* Test simple API */ + for (i = 0; i < BLAKE2_KAT_LENGTH; ++i) { + uint8_t hash[BLAKE2S_OUTBYTES]; + blake2s(hash, BLAKE2S_OUTBYTES, buf, i, key, BLAKE2S_KEYBYTES); + + if (0 != memcmp(hash, blake2s_keyed_kat[i], BLAKE2S_OUTBYTES)) { + goto fail; + } + } + + /* Test streaming API */ + for (step = 1; step < BLAKE2S_BLOCKBYTES; ++step) { + for (i = 0; i < BLAKE2_KAT_LENGTH; ++i) { + uint8_t hash[BLAKE2S_OUTBYTES]; + blake2s_state S; + uint8_t * p = buf; + size_t mlen = i; + int err = 0; + + if ((err = blake2s_init_key(&S, BLAKE2S_OUTBYTES, key, BLAKE2S_KEYBYTES)) < 0) { + goto fail; + } + + while (mlen >= step) { + if ((err = blake2s_update(&S, p, step)) < 0) { + goto fail; + } + mlen -= step; + p += step; + } + if ((err = blake2s_update(&S, p, mlen)) < 0) { + goto fail; + } + if ((err = blake2s_final(&S, hash, BLAKE2S_OUTBYTES)) < 0) { + goto fail; + } + + if (0 != memcmp(hash, blake2s_keyed_kat[i], BLAKE2S_OUTBYTES)) { + goto fail; + } + } + } + + puts("ok"); + return 0; +fail: + puts("error"); + return -1; +} +#endif \ No newline at end of file diff --git a/crypto/blake2s.h b/crypto/blake2s.h new file mode 100644 index 0000000..aa53209 --- /dev/null +++ b/crypto/blake2s.h @@ -0,0 +1,100 @@ +/* +BLAKE2 reference source code package - reference C implementations + +Copyright 2012, Samuel Neves . You may use this under the +terms of the CC0, the OpenSSL Licence, or the Apache Public License 2.0, at +your option. The terms of these licenses can be found at: + +- CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0 +- OpenSSL license : https://www.openssl.org/source/license.html +- Apache 2.0 : http://www.apache.org/licenses/LICENSE-2.0 + +More information about the BLAKE2 hash function can be found at +https://blake2.net. +*/ +#ifndef BLAKE2_H +#define BLAKE2_H + +#include +#include +#include "tunsafe_types.h" +#if defined(_MSC_VER) +#define BLAKE2_PACKED(x) __pragma(pack(push, 1)) x __pragma(pack(pop)) +#else +#define BLAKE2_PACKED(x) x __attribute__((packed)) +#endif + +#if defined(__cplusplus) +//extern "C" { +#endif + +enum blake2s_constant { + BLAKE2S_BLOCKBYTES = 64, + BLAKE2S_OUTBYTES = 32, + BLAKE2S_KEYBYTES = 32, + BLAKE2S_SALTBYTES = 8, + BLAKE2S_PERSONALBYTES = 8 +}; + +BLAKE2_PACKED(struct blake2s_param__ { + uint8_t digest_length; /* 1 */ + uint8_t key_length; /* 2 */ + uint8_t fanout; /* 3 */ + uint8_t depth; /* 4 */ + uint32_t leaf_length; /* 8 */ + uint32_t node_offset; /* 12 */ + uint16_t xof_length; /* 14 */ + uint8_t node_depth; /* 15 */ + uint8_t inner_length; /* 16 */ + /* uint8_t reserved[0]; */ + uint32_t salt[BLAKE2S_SALTBYTES / 4]; /* 24 */ + uint32_t personal[BLAKE2S_PERSONALBYTES / 4]; /* 32 */ +}); + + +typedef struct blake2s_param__ blake2s_param; + +/* Padded structs result in a compile-time error */ +enum { + BLAKE2_DUMMY_1 = 1 / (sizeof(blake2s_param) == BLAKE2S_OUTBYTES ? 1 : 0), +}; + + +typedef struct blake2s_state__ { + union { + uint32_t h[8]; + blake2s_param param; + }; + uint32_t t[2]; + uint32_t f[2]; + uint8_t buf[BLAKE2S_BLOCKBYTES]; + size_t buflen; + size_t outlen; + uint8_t last_node; +} blake2s_state; + + + +/* Streaming API */ +void blake2s_init(blake2s_state *S, size_t outlen); +void blake2s_init_key(blake2s_state *S, size_t outlen, const void *key, size_t keylen); +void blake2s_init_param(blake2s_state *S, const blake2s_param *P); +void blake2s_update(blake2s_state *S, const void *in, size_t inlen); +void blake2s_final(blake2s_state *S, void *out, size_t outlen); + +/* Simple API */ +void blake2s(void *out, size_t outlen, const void *in, size_t inlen, const void *key, size_t keylen); + +void blake2s_hmac(uint8_t *out, size_t outlen, const uint8_t *in, size_t inlen, const uint8_t *key, size_t keylen); + +void blake2s_hkdf(uint8 *dst1, size_t dst1_size, + uint8 *dst2, size_t dst2_size, + uint8 *dst3, size_t dst3_size, + const uint8 *data, size_t data_size, + const uint8 *key, size_t key_size); + +#if defined(__cplusplus) +//} +#endif + +#endif \ No newline at end of file diff --git a/crypto/blake2s_sse.cpp b/crypto/blake2s_sse.cpp new file mode 100644 index 0000000..2527f24 --- /dev/null +++ b/crypto/blake2s_sse.cpp @@ -0,0 +1,399 @@ +/* + BLAKE2 reference source code package - optimized C implementations + + Copyright 2012, Samuel Neves . You may use this under the + terms of the CC0, the OpenSSL Licence, or the Apache Public License 2.0, at + your option. The terms of these licenses can be found at: + + - CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0 + - OpenSSL license : https://www.openssl.org/source/license.html + - Apache 2.0 : http://www.apache.org/licenses/LICENSE-2.0 + + More information about the BLAKE2 hash function can be found at + https://blake2.net. +*/ +#include "stdafx.h" +#include +#include +#include + +#include "blake2s.h" +#include "crypto_ops.h" + +#include +#if defined(HAVE_SSSE3) +#include +#endif +#if defined(HAVE_SSE41) +#include +#endif +#if defined(HAVE_AVX) +#include +#endif +#if defined(HAVE_XOP) +#include +#endif + +#include "blake2s-round.h" + +#if !defined(__cplusplus) && (!defined(__STDC_VERSION__) || __STDC_VERSION__ < 199901L) +#if defined(_MSC_VER) +#define BLAKE2_INLINE __inline +#elif defined(__GNUC__) +#define BLAKE2_INLINE __inline__ +#else +#define BLAKE2_INLINE +#endif +#else +#define BLAKE2_INLINE inline +#endif + +static BLAKE2_INLINE uint32_t load32(const void *src) { +#if defined(ARCH_CPU_LITTLE_ENDIAN) + uint32_t w; + memcpy(&w, src, sizeof w); + return w; +#else + const uint8_t *p = (const uint8_t *)src; + return ((uint32_t)(p[0]) << 0) | + ((uint32_t)(p[1]) << 8) | + ((uint32_t)(p[2]) << 16) | + ((uint32_t)(p[3]) << 24); +#endif +} + +static BLAKE2_INLINE void store32(void *dst, uint32_t w) { +#if defined(ARCH_CPU_LITTLE_ENDIAN) + memcpy(dst, &w, sizeof w); +#else + uint8_t *p = (uint8_t *)dst; + p[0] = (uint8_t)(w >> 0); + p[1] = (uint8_t)(w >> 8); + p[2] = (uint8_t)(w >> 16); + p[3] = (uint8_t)(w >> 24); +#endif +} + + + +static const uint32_t blake2s_IV[8] = +{ + 0x6A09E667UL, 0xBB67AE85UL, 0x3C6EF372UL, 0xA54FF53AUL, + 0x510E527FUL, 0x9B05688CUL, 0x1F83D9ABUL, 0x5BE0CD19UL +}; + +/* Some helper functions */ +static void blake2s_set_lastnode( blake2s_state *S ) +{ + S->f[1] = (uint32_t)-1; +} + +static int blake2s_is_lastblock( const blake2s_state *S ) +{ + return S->f[0] != 0; +} + +static void blake2s_set_lastblock( blake2s_state *S ) +{ + if( S->last_node ) blake2s_set_lastnode( S ); + + S->f[0] = (uint32_t)-1; +} + +static void blake2s_increment_counter( blake2s_state *S, const uint32_t inc ) +{ + uint64_t t = ( ( uint64_t )S->t[1] << 32 ) | S->t[0]; + t += inc; + S->t[0] = ( uint32_t )( t >> 0 ); + S->t[1] = ( uint32_t )( t >> 32 ); +} + +/* init2 xors IV with input parameter block */ +#if 0 +void blake2s_init_param( blake2s_state *S, const blake2s_param *P ) +{ + size_t i; + /*blake2s_init0( S ); */ + const uint8_t * v = ( const uint8_t * )( blake2s_IV ); + const uint8_t * p = ( const uint8_t * )( P ); + uint8_t * h = ( uint8_t * )( S->h ); + /* IV XOR ParamBlock */ + memset( S, 0, sizeof( blake2s_state ) ); + + for( i = 0; i < BLAKE2S_OUTBYTES; ++i ) h[i] = v[i] ^ p[i]; + + S->outlen = P->digest_length; +} + +/* Some sort of default parameter block initialization, for sequential blake2s */ +void blake2s_init( blake2s_state *S, size_t outlen ) +{ + blake2s_param P[1]; + assert(outlen && outlen <= BLAKE2S_OUTBYTES); + + P->digest_length = (uint8_t)outlen; + P->key_length = 0; + P->fanout = 1; + P->depth = 1; + store32( &P->leaf_length, 0 ); + store32( &P->node_offset, 0 ); + store16( &P->xof_length, 0 ); + P->node_depth = 0; + P->inner_length = 0; + /* memset(P->reserved, 0, sizeof(P->reserved) ); */ + memset( P->salt, 0, sizeof( P->salt ) ); + memset( P->personal, 0, sizeof( P->personal ) ); + + blake2s_init_param( S, P ); +} + +int blake2s_init_key( blake2s_state *S, size_t outlen, const void *key, size_t keylen ) +{ + blake2s_param P[1]; + + /* Move interval verification here? */ + if ( ( !outlen ) || ( outlen > BLAKE2S_OUTBYTES ) ) return -1; + + if ( ( !key ) || ( !keylen ) || keylen > BLAKE2S_KEYBYTES ) return -1; + + P->digest_length = (uint8_t)outlen; + P->key_length = (uint8_t)keylen; + P->fanout = 1; + P->depth = 1; + store32( &P->leaf_length, 0 ); + store32( &P->node_offset, 0 ); + store16( &P->xof_length, 0 ); + P->node_depth = 0; + P->inner_length = 0; + /* memset(P->reserved, 0, sizeof(P->reserved) ); */ + memset( P->salt, 0, sizeof( P->salt ) ); + memset( P->personal, 0, sizeof( P->personal ) ); + + if( blake2s_init_param( S, P ) < 0 ) + return -1; + + { + uint8_t block[BLAKE2S_BLOCKBYTES]; + memset( block, 0, BLAKE2S_BLOCKBYTES ); + memcpy( block, key, keylen ); + blake2s_update( S, block, BLAKE2S_BLOCKBYTES ); + memzero_crypto( block, BLAKE2S_BLOCKBYTES ); /* Burn the key from stack */ + } + return 0; +} +#endif + + +void blake2s_compress_sse( blake2s_state *S, const uint8_t block[BLAKE2S_BLOCKBYTES] ) +{ + __m128i row1, row2, row3, row4; + __m128i buf1, buf2, buf3, buf4; +#if defined(HAVE_SSE41) + __m128i t0, t1; +#if !defined(HAVE_XOP) + __m128i t2; +#endif +#endif + __m128i ff0, ff1; +#if defined(HAVE_SSSE3) && !defined(HAVE_XOP) + const __m128i r8 = _mm_set_epi8( 12, 15, 14, 13, 8, 11, 10, 9, 4, 7, 6, 5, 0, 3, 2, 1 ); + const __m128i r16 = _mm_set_epi8( 13, 12, 15, 14, 9, 8, 11, 10, 5, 4, 7, 6, 1, 0, 3, 2 ); +#endif +#if defined(HAVE_SSE41) + const __m128i m0 = LOADU( block + 00 ); + const __m128i m1 = LOADU( block + 16 ); + const __m128i m2 = LOADU( block + 32 ); + const __m128i m3 = LOADU( block + 48 ); +#else + const uint32_t m0 = load32(block + 0 * sizeof(uint32_t)); + const uint32_t m1 = load32(block + 1 * sizeof(uint32_t)); + const uint32_t m2 = load32(block + 2 * sizeof(uint32_t)); + const uint32_t m3 = load32(block + 3 * sizeof(uint32_t)); + const uint32_t m4 = load32(block + 4 * sizeof(uint32_t)); + const uint32_t m5 = load32(block + 5 * sizeof(uint32_t)); + const uint32_t m6 = load32(block + 6 * sizeof(uint32_t)); + const uint32_t m7 = load32(block + 7 * sizeof(uint32_t)); + const uint32_t m8 = load32(block + 8 * sizeof(uint32_t)); + const uint32_t m9 = load32(block + 9 * sizeof(uint32_t)); + const uint32_t m10 = load32(block + 10 * sizeof(uint32_t)); + const uint32_t m11 = load32(block + 11 * sizeof(uint32_t)); + const uint32_t m12 = load32(block + 12 * sizeof(uint32_t)); + const uint32_t m13 = load32(block + 13 * sizeof(uint32_t)); + const uint32_t m14 = load32(block + 14 * sizeof(uint32_t)); + const uint32_t m15 = load32(block + 15 * sizeof(uint32_t)); +#endif + row1 = ff0 = LOADU( &S->h[0] ); + row2 = ff1 = LOADU( &S->h[4] ); + row3 = _mm_loadu_si128( (__m128i const *)&blake2s_IV[0] ); + row4 = _mm_xor_si128( _mm_loadu_si128( (__m128i const *)&blake2s_IV[4] ), LOADU( &S->t[0] ) ); + ROUND( 0 ); + ROUND( 1 ); + ROUND( 2 ); + ROUND( 3 ); + ROUND( 4 ); + ROUND( 5 ); + ROUND( 6 ); + ROUND( 7 ); + ROUND( 8 ); + ROUND( 9 ); + STOREU( &S->h[0], _mm_xor_si128( ff0, _mm_xor_si128( row1, row3 ) ) ); + STOREU( &S->h[4], _mm_xor_si128( ff1, _mm_xor_si128( row2, row4 ) ) ); +} + +#if 0 +int blake2s_update( blake2s_state *S, const void *pin, size_t inlen ) +{ + const unsigned char * in = (const unsigned char *)pin; + if( inlen > 0 ) + { + size_t left = S->buflen; + size_t fill = BLAKE2S_BLOCKBYTES - left; + if( inlen > fill ) + { + S->buflen = 0; + memcpy( S->buf + left, in, fill ); /* Fill buffer */ + blake2s_increment_counter( S, BLAKE2S_BLOCKBYTES ); + blake2s_compress( S, S->buf ); /* Compress */ + in += fill; inlen -= fill; + while(inlen > BLAKE2S_BLOCKBYTES) { + blake2s_increment_counter(S, BLAKE2S_BLOCKBYTES); + blake2s_compress( S, in ); + in += BLAKE2S_BLOCKBYTES; + inlen -= BLAKE2S_BLOCKBYTES; + } + } + memcpy( S->buf + S->buflen, in, inlen ); + S->buflen += inlen; + } + return 0; +} + +int blake2s_final( blake2s_state *S, void *out, size_t outlen ) +{ + uint8_t buffer[BLAKE2S_OUTBYTES] = {0}; + size_t i; + + if( out == NULL || outlen < S->outlen ) + return -1; + + if( blake2s_is_lastblock( S ) ) + return -1; + + blake2s_increment_counter( S, (uint32_t)S->buflen ); + blake2s_set_lastblock( S ); + memset( S->buf + S->buflen, 0, BLAKE2S_BLOCKBYTES - S->buflen ); /* Padding */ + blake2s_compress( S, S->buf ); + + for( i = 0; i < 8; ++i ) /* Output full hash to temp buffer */ + store32( buffer + sizeof( S->h[i] ) * i, S->h[i] ); + + memcpy( out, buffer, S->outlen ); + memzero_crypto( buffer, sizeof(buffer) ); + return 0; +} + +/* inlen, at least, should be uint64_t. Others can be size_t. */ +int blake2s( void *out, size_t outlen, const void *in, size_t inlen, const void *key, size_t keylen ) +{ + blake2s_state S[1]; + + /* Verify parameters */ + if ( NULL == in && inlen > 0 ) return -1; + + if ( NULL == out ) return -1; + + if ( NULL == key && keylen > 0) return -1; + + if( !outlen || outlen > BLAKE2S_OUTBYTES ) return -1; + + if( keylen > BLAKE2S_KEYBYTES ) return -1; + + if( keylen > 0 ) + { + if( blake2s_init_key( S, outlen, key, keylen ) < 0 ) return -1; + } + else + { + if( blake2s_init( S, outlen ) < 0 ) return -1; + } + + blake2s_update( S, ( const uint8_t * )in, inlen ); + blake2s_final( S, out, outlen ); + return 0; +} +#endif + +#if defined(SUPERCOP) +int crypto_hash( unsigned char *out, unsigned char *in, unsigned long long inlen ) +{ + return blake2s( out, BLAKE2S_OUTBYTES, in, inlen, NULL, 0 ); +} +#endif + +#if defined(BLAKE2S_SELFTEST) +#include +#include "blake2-kat.h" +int main( void ) +{ + uint8_t key[BLAKE2S_KEYBYTES]; + uint8_t buf[BLAKE2_KAT_LENGTH]; + size_t i, step; + + for( i = 0; i < BLAKE2S_KEYBYTES; ++i ) + key[i] = ( uint8_t )i; + + for( i = 0; i < BLAKE2_KAT_LENGTH; ++i ) + buf[i] = ( uint8_t )i; + + /* Test simple API */ + for( i = 0; i < BLAKE2_KAT_LENGTH; ++i ) + { + uint8_t hash[BLAKE2S_OUTBYTES]; + blake2s( hash, BLAKE2S_OUTBYTES, buf, i, key, BLAKE2S_KEYBYTES ); + + if( 0 != memcmp( hash, blake2s_keyed_kat[i], BLAKE2S_OUTBYTES ) ) + { + goto fail; + } + } + + /* Test streaming API */ + for(step = 1; step < BLAKE2S_BLOCKBYTES; ++step) { + for (i = 0; i < BLAKE2_KAT_LENGTH; ++i) { + uint8_t hash[BLAKE2S_OUTBYTES]; + blake2s_state S; + uint8_t * p = buf; + size_t mlen = i; + int err = 0; + + if( (err = blake2s_init_key(&S, BLAKE2S_OUTBYTES, key, BLAKE2S_KEYBYTES)) < 0 ) { + goto fail; + } + + while (mlen >= step) { + if ( (err = blake2s_update(&S, p, step)) < 0 ) { + goto fail; + } + mlen -= step; + p += step; + } + if ( (err = blake2s_update(&S, p, mlen)) < 0) { + goto fail; + } + if ( (err = blake2s_final(&S, hash, BLAKE2S_OUTBYTES)) < 0) { + goto fail; + } + + if (0 != memcmp(hash, blake2s_keyed_kat[i], BLAKE2S_OUTBYTES)) { + goto fail; + } + } + } + + puts( "ok" ); + return 0; +fail: + puts("error"); + return -1; +} +#endif diff --git a/crypto/chacha20_x64.asm b/crypto/chacha20_x64.asm new file mode 100644 index 0000000..3b4b4db --- /dev/null +++ b/crypto/chacha20_x64.asm @@ -0,0 +1,3011 @@ +default rel +%define XMMWORD +%define YMMWORD +%define ZMMWORD +section .text code align=64 + + +ALIGN 64 +$L$zero: + DD 0,0,0,0 +$L$one: + DD 1,0,0,0 +$L$inc: + DD 0,1,2,3 +$L$four: + DD 4,4,4,4 +$L$incy: + DD 0,2,4,6,1,3,5,7 +$L$eight: + DD 8,8,8,8,8,8,8,8 +$L$rot16: +DB 0x2,0x3,0x0,0x1,0x6,0x7,0x4,0x5,0xa,0xb,0x8,0x9,0xe,0xf,0xc,0xd +$L$rot24: +DB 0x3,0x0,0x1,0x2,0x7,0x4,0x5,0x6,0xb,0x8,0x9,0xa,0xf,0xc,0xd,0xe +$L$sigma: +DB 101,120,112,97,110,100,32,51,50,45,98,121,116,101,32,107 +DB 0 +ALIGN 64 +$L$zeroz: + DD 0,0,0,0,1,0,0,0,2,0,0,0,3,0,0,0 +$L$fourz: + DD 4,0,0,0,4,0,0,0,4,0,0,0,4,0,0,0 +$L$incz: + DD 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 +$L$sixteen: + DD 16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16 +ALIGN 64 +$L$twoy: + DD 2,0,0,0,2,0,0,0 + +global hchacha20_ssse3 + +ALIGN 32 +hchacha20_ssse3: + mov QWORD[8+rsp],rdi ;WIN64 prologue + mov QWORD[16+rsp],rsi + mov rax,rsp +$L$SEH_begin_hchacha20_ssse3: + mov rdi,rcx + mov rsi,rdx + mov rdx,r8 + mov rcx,r9 + mov r8,QWORD[40+rsp] + + + +$L$hchacha20_ssse3: + movdqa xmm0,XMMWORD[$L$sigma] + movdqu xmm1,XMMWORD[rdx] + movdqu xmm2,XMMWORD[16+rdx] + movdqu xmm3,XMMWORD[rsi] + movdqa xmm6,XMMWORD[$L$rot16] + movdqa xmm7,XMMWORD[$L$rot24] + mov r8,10 +ALIGN 32 +$L$oop_hssse3: + paddd xmm0,xmm1 + pxor xmm3,xmm0 + pshufb xmm3,xmm6 + paddd xmm2,xmm3 + pxor xmm1,xmm2 + movdqa xmm4,xmm1 + psrld xmm1,20 + pslld xmm4,12 + por xmm1,xmm4 + paddd xmm0,xmm1 + pxor xmm3,xmm0 + pshufb xmm3,xmm7 + paddd xmm2,xmm3 + pxor xmm1,xmm2 + movdqa xmm4,xmm1 + psrld xmm1,25 + pslld xmm4,7 + por xmm1,xmm4 + pshufd xmm2,xmm2,78 + pshufd xmm1,xmm1,57 + pshufd xmm3,xmm3,147 + nop + paddd xmm0,xmm1 + pxor xmm3,xmm0 + pshufb xmm3,xmm6 + paddd xmm2,xmm3 + pxor xmm1,xmm2 + movdqa xmm4,xmm1 + psrld xmm1,20 + pslld xmm4,12 + por xmm1,xmm4 + paddd xmm0,xmm1 + pxor xmm3,xmm0 + pshufb xmm3,xmm7 + paddd xmm2,xmm3 + pxor xmm1,xmm2 + movdqa xmm4,xmm1 + psrld xmm1,25 + pslld xmm4,7 + por xmm1,xmm4 + pshufd xmm2,xmm2,78 + pshufd xmm1,xmm1,147 + pshufd xmm3,xmm3,57 + dec r8 + jnz NEAR $L$oop_hssse3 + movdqu XMMWORD[rdi],xmm0 + movdqu XMMWORD[16+rdi],xmm3 + mov rdi,QWORD[8+rsp] ;WIN64 epilogue + mov rsi,QWORD[16+rsp] + DB 0F3h,0C3h ;repret + +$L$SEH_end_hchacha20_ssse3: +global chacha20_ssse3 + +ALIGN 32 +chacha20_ssse3: + mov QWORD[8+rsp],rdi ;WIN64 prologue + mov QWORD[16+rsp],rsi + mov rax,rsp +$L$SEH_begin_chacha20_ssse3: + mov rdi,rcx + mov rsi,rdx + mov rdx,r8 + mov rcx,r9 + mov r8,QWORD[40+rsp] + + + +$L$chacha20_ssse3: + mov r9,rsp + + cmp rdx,128 + ja NEAR $L$chacha20_4x + +$L$do_sse3_after_all: + sub rsp,64+40 + movaps XMMWORD[(-40)+r9],xmm6 + movaps XMMWORD[(-24)+r9],xmm7 +$L$ssse3_body: + movdqa xmm0,XMMWORD[$L$sigma] + movdqu xmm1,XMMWORD[rcx] + movdqu xmm2,XMMWORD[16+rcx] + movdqu xmm3,XMMWORD[r8] + movdqa xmm6,XMMWORD[$L$rot16] + movdqa xmm7,XMMWORD[$L$rot24] + + movdqa XMMWORD[rsp],xmm0 + movdqa XMMWORD[16+rsp],xmm1 + movdqa XMMWORD[32+rsp],xmm2 + movdqa XMMWORD[48+rsp],xmm3 + mov r8,10 + jmp NEAR $L$oop_ssse3 + +ALIGN 32 +$L$oop_outer_ssse3: + movdqa xmm3,XMMWORD[$L$one] + movdqa xmm0,XMMWORD[rsp] + movdqa xmm1,XMMWORD[16+rsp] + movdqa xmm2,XMMWORD[32+rsp] + paddd xmm3,XMMWORD[48+rsp] + mov r8,10 + movdqa XMMWORD[48+rsp],xmm3 + jmp NEAR $L$oop_ssse3 + +ALIGN 32 +$L$oop_ssse3: + paddd xmm0,xmm1 + pxor xmm3,xmm0 + pshufb xmm3,xmm6 + paddd xmm2,xmm3 + pxor xmm1,xmm2 + movdqa xmm4,xmm1 + psrld xmm1,20 + pslld xmm4,12 + por xmm1,xmm4 + paddd xmm0,xmm1 + pxor xmm3,xmm0 + pshufb xmm3,xmm7 + paddd xmm2,xmm3 + pxor xmm1,xmm2 + movdqa xmm4,xmm1 + psrld xmm1,25 + pslld xmm4,7 + por xmm1,xmm4 + pshufd xmm2,xmm2,78 + pshufd xmm1,xmm1,57 + pshufd xmm3,xmm3,147 + nop + paddd xmm0,xmm1 + pxor xmm3,xmm0 + pshufb xmm3,xmm6 + paddd xmm2,xmm3 + pxor xmm1,xmm2 + movdqa xmm4,xmm1 + psrld xmm1,20 + pslld xmm4,12 + por xmm1,xmm4 + paddd xmm0,xmm1 + pxor xmm3,xmm0 + pshufb xmm3,xmm7 + paddd xmm2,xmm3 + pxor xmm1,xmm2 + movdqa xmm4,xmm1 + psrld xmm1,25 + pslld xmm4,7 + por xmm1,xmm4 + pshufd xmm2,xmm2,78 + pshufd xmm1,xmm1,147 + pshufd xmm3,xmm3,57 + dec r8 + jnz NEAR $L$oop_ssse3 + paddd xmm0,XMMWORD[rsp] + paddd xmm1,XMMWORD[16+rsp] + paddd xmm2,XMMWORD[32+rsp] + paddd xmm3,XMMWORD[48+rsp] + + cmp rdx,64 + jb NEAR $L$tail_ssse3 + + movdqu xmm4,XMMWORD[rsi] + movdqu xmm5,XMMWORD[16+rsi] + pxor xmm0,xmm4 + movdqu xmm4,XMMWORD[32+rsi] + pxor xmm1,xmm5 + movdqu xmm5,XMMWORD[48+rsi] + lea rsi,[64+rsi] + pxor xmm2,xmm4 + pxor xmm3,xmm5 + + movdqu XMMWORD[rdi],xmm0 + movdqu XMMWORD[16+rdi],xmm1 + movdqu XMMWORD[32+rdi],xmm2 + movdqu XMMWORD[48+rdi],xmm3 + lea rdi,[64+rdi] + + sub rdx,64 + jnz NEAR $L$oop_outer_ssse3 + + jmp NEAR $L$done_ssse3 + +ALIGN 16 +$L$tail_ssse3: + movdqa XMMWORD[rsp],xmm0 + movdqa XMMWORD[16+rsp],xmm1 + movdqa XMMWORD[32+rsp],xmm2 + movdqa XMMWORD[48+rsp],xmm3 + xor r8,r8 + +$L$oop_tail_ssse3: + movzx eax,BYTE[r8*1+rsi] + movzx ecx,BYTE[r8*1+rsp] + lea r8,[1+r8] + xor eax,ecx + mov BYTE[((-1))+r8*1+rdi],al + dec rdx + jnz NEAR $L$oop_tail_ssse3 + +$L$done_ssse3: + movaps xmm6,XMMWORD[((-40))+r9] + movaps xmm7,XMMWORD[((-24))+r9] + lea rsp,[r9] + +$L$ssse3_epilogue: + mov rdi,QWORD[8+rsp] ;WIN64 epilogue + mov rsi,QWORD[16+rsp] + DB 0F3h,0C3h ;repret + +$L$SEH_end_chacha20_ssse3: +global chacha20_4x + +ALIGN 32 +chacha20_4x: + mov QWORD[8+rsp],rdi ;WIN64 prologue + mov QWORD[16+rsp],rsi + mov rax,rsp +$L$SEH_begin_chacha20_4x: + mov rdi,rcx + mov rsi,rdx + mov rdx,r8 + mov rcx,r9 + mov r8,QWORD[40+rsp] + + + +$L$chacha20_4x: + mov r9,rsp + + + + + + + + + + + + +$L$proceed4x: + sub rsp,0x140+168 + movaps XMMWORD[(-168)+r9],xmm6 + movaps XMMWORD[(-152)+r9],xmm7 + movaps XMMWORD[(-136)+r9],xmm8 + movaps XMMWORD[(-120)+r9],xmm9 + movaps XMMWORD[(-104)+r9],xmm10 + movaps XMMWORD[(-88)+r9],xmm11 + movaps XMMWORD[(-72)+r9],xmm12 + movaps XMMWORD[(-56)+r9],xmm13 + movaps XMMWORD[(-40)+r9],xmm14 + movaps XMMWORD[(-24)+r9],xmm15 +$L$4x_body: + movdqa xmm11,XMMWORD[$L$sigma] + movdqu xmm15,XMMWORD[rcx] + movdqu xmm7,XMMWORD[16+rcx] + movdqu xmm3,XMMWORD[r8] + lea rcx,[256+rsp] + lea r10,[$L$rot16] + lea r11,[$L$rot24] + + pshufd xmm8,xmm11,0x00 + pshufd xmm9,xmm11,0x55 + movdqa XMMWORD[64+rsp],xmm8 + pshufd xmm10,xmm11,0xaa + movdqa XMMWORD[80+rsp],xmm9 + pshufd xmm11,xmm11,0xff + movdqa XMMWORD[96+rsp],xmm10 + movdqa XMMWORD[112+rsp],xmm11 + + pshufd xmm12,xmm15,0x00 + pshufd xmm13,xmm15,0x55 + movdqa XMMWORD[(128-256)+rcx],xmm12 + pshufd xmm14,xmm15,0xaa + movdqa XMMWORD[(144-256)+rcx],xmm13 + pshufd xmm15,xmm15,0xff + movdqa XMMWORD[(160-256)+rcx],xmm14 + movdqa XMMWORD[(176-256)+rcx],xmm15 + + pshufd xmm4,xmm7,0x00 + pshufd xmm5,xmm7,0x55 + movdqa XMMWORD[(192-256)+rcx],xmm4 + pshufd xmm6,xmm7,0xaa + movdqa XMMWORD[(208-256)+rcx],xmm5 + pshufd xmm7,xmm7,0xff + movdqa XMMWORD[(224-256)+rcx],xmm6 + movdqa XMMWORD[(240-256)+rcx],xmm7 + + pshufd xmm0,xmm3,0x00 + pshufd xmm1,xmm3,0x55 + paddd xmm0,XMMWORD[$L$inc] + pshufd xmm2,xmm3,0xaa + movdqa XMMWORD[(272-256)+rcx],xmm1 + pshufd xmm3,xmm3,0xff + movdqa XMMWORD[(288-256)+rcx],xmm2 + movdqa XMMWORD[(304-256)+rcx],xmm3 + + jmp NEAR $L$oop_enter4x + +ALIGN 32 +$L$oop_outer4x: + movdqa xmm8,XMMWORD[64+rsp] + movdqa xmm9,XMMWORD[80+rsp] + movdqa xmm10,XMMWORD[96+rsp] + movdqa xmm11,XMMWORD[112+rsp] + movdqa xmm12,XMMWORD[((128-256))+rcx] + movdqa xmm13,XMMWORD[((144-256))+rcx] + movdqa xmm14,XMMWORD[((160-256))+rcx] + movdqa xmm15,XMMWORD[((176-256))+rcx] + movdqa xmm4,XMMWORD[((192-256))+rcx] + movdqa xmm5,XMMWORD[((208-256))+rcx] + movdqa xmm6,XMMWORD[((224-256))+rcx] + movdqa xmm7,XMMWORD[((240-256))+rcx] + movdqa xmm0,XMMWORD[((256-256))+rcx] + movdqa xmm1,XMMWORD[((272-256))+rcx] + movdqa xmm2,XMMWORD[((288-256))+rcx] + movdqa xmm3,XMMWORD[((304-256))+rcx] + paddd xmm0,XMMWORD[$L$four] + +$L$oop_enter4x: + movdqa XMMWORD[32+rsp],xmm6 + movdqa XMMWORD[48+rsp],xmm7 + movdqa xmm7,XMMWORD[r10] + mov eax,10 + movdqa XMMWORD[(256-256)+rcx],xmm0 + jmp NEAR $L$oop4x + +ALIGN 32 +$L$oop4x: + paddd xmm8,xmm12 + paddd xmm9,xmm13 + pxor xmm0,xmm8 + pxor xmm1,xmm9 + pshufb xmm0,xmm7 + pshufb xmm1,xmm7 + paddd xmm4,xmm0 + paddd xmm5,xmm1 + pxor xmm12,xmm4 + pxor xmm13,xmm5 + movdqa xmm6,xmm12 + pslld xmm12,12 + psrld xmm6,20 + movdqa xmm7,xmm13 + pslld xmm13,12 + por xmm12,xmm6 + psrld xmm7,20 + movdqa xmm6,XMMWORD[r11] + por xmm13,xmm7 + paddd xmm8,xmm12 + paddd xmm9,xmm13 + pxor xmm0,xmm8 + pxor xmm1,xmm9 + pshufb xmm0,xmm6 + pshufb xmm1,xmm6 + paddd xmm4,xmm0 + paddd xmm5,xmm1 + pxor xmm12,xmm4 + pxor xmm13,xmm5 + movdqa xmm7,xmm12 + pslld xmm12,7 + psrld xmm7,25 + movdqa xmm6,xmm13 + pslld xmm13,7 + por xmm12,xmm7 + psrld xmm6,25 + movdqa xmm7,XMMWORD[r10] + por xmm13,xmm6 + movdqa XMMWORD[rsp],xmm4 + movdqa XMMWORD[16+rsp],xmm5 + movdqa xmm4,XMMWORD[32+rsp] + movdqa xmm5,XMMWORD[48+rsp] + paddd xmm10,xmm14 + paddd xmm11,xmm15 + pxor xmm2,xmm10 + pxor xmm3,xmm11 + pshufb xmm2,xmm7 + pshufb xmm3,xmm7 + paddd xmm4,xmm2 + paddd xmm5,xmm3 + pxor xmm14,xmm4 + pxor xmm15,xmm5 + movdqa xmm6,xmm14 + pslld xmm14,12 + psrld xmm6,20 + movdqa xmm7,xmm15 + pslld xmm15,12 + por xmm14,xmm6 + psrld xmm7,20 + movdqa xmm6,XMMWORD[r11] + por xmm15,xmm7 + paddd xmm10,xmm14 + paddd xmm11,xmm15 + pxor xmm2,xmm10 + pxor xmm3,xmm11 + pshufb xmm2,xmm6 + pshufb xmm3,xmm6 + paddd xmm4,xmm2 + paddd xmm5,xmm3 + pxor xmm14,xmm4 + pxor xmm15,xmm5 + movdqa xmm7,xmm14 + pslld xmm14,7 + psrld xmm7,25 + movdqa xmm6,xmm15 + pslld xmm15,7 + por xmm14,xmm7 + psrld xmm6,25 + movdqa xmm7,XMMWORD[r10] + por xmm15,xmm6 + paddd xmm8,xmm13 + paddd xmm9,xmm14 + pxor xmm3,xmm8 + pxor xmm0,xmm9 + pshufb xmm3,xmm7 + pshufb xmm0,xmm7 + paddd xmm4,xmm3 + paddd xmm5,xmm0 + pxor xmm13,xmm4 + pxor xmm14,xmm5 + movdqa xmm6,xmm13 + pslld xmm13,12 + psrld xmm6,20 + movdqa xmm7,xmm14 + pslld xmm14,12 + por xmm13,xmm6 + psrld xmm7,20 + movdqa xmm6,XMMWORD[r11] + por xmm14,xmm7 + paddd xmm8,xmm13 + paddd xmm9,xmm14 + pxor xmm3,xmm8 + pxor xmm0,xmm9 + pshufb xmm3,xmm6 + pshufb xmm0,xmm6 + paddd xmm4,xmm3 + paddd xmm5,xmm0 + pxor xmm13,xmm4 + pxor xmm14,xmm5 + movdqa xmm7,xmm13 + pslld xmm13,7 + psrld xmm7,25 + movdqa xmm6,xmm14 + pslld xmm14,7 + por xmm13,xmm7 + psrld xmm6,25 + movdqa xmm7,XMMWORD[r10] + por xmm14,xmm6 + movdqa XMMWORD[32+rsp],xmm4 + movdqa XMMWORD[48+rsp],xmm5 + movdqa xmm4,XMMWORD[rsp] + movdqa xmm5,XMMWORD[16+rsp] + paddd xmm10,xmm15 + paddd xmm11,xmm12 + pxor xmm1,xmm10 + pxor xmm2,xmm11 + pshufb xmm1,xmm7 + pshufb xmm2,xmm7 + paddd xmm4,xmm1 + paddd xmm5,xmm2 + pxor xmm15,xmm4 + pxor xmm12,xmm5 + movdqa xmm6,xmm15 + pslld xmm15,12 + psrld xmm6,20 + movdqa xmm7,xmm12 + pslld xmm12,12 + por xmm15,xmm6 + psrld xmm7,20 + movdqa xmm6,XMMWORD[r11] + por xmm12,xmm7 + paddd xmm10,xmm15 + paddd xmm11,xmm12 + pxor xmm1,xmm10 + pxor xmm2,xmm11 + pshufb xmm1,xmm6 + pshufb xmm2,xmm6 + paddd xmm4,xmm1 + paddd xmm5,xmm2 + pxor xmm15,xmm4 + pxor xmm12,xmm5 + movdqa xmm7,xmm15 + pslld xmm15,7 + psrld xmm7,25 + movdqa xmm6,xmm12 + pslld xmm12,7 + por xmm15,xmm7 + psrld xmm6,25 + movdqa xmm7,XMMWORD[r10] + por xmm12,xmm6 + dec eax + jnz NEAR $L$oop4x + + paddd xmm8,XMMWORD[64+rsp] + paddd xmm9,XMMWORD[80+rsp] + paddd xmm10,XMMWORD[96+rsp] + paddd xmm11,XMMWORD[112+rsp] + + movdqa xmm6,xmm8 + punpckldq xmm8,xmm9 + movdqa xmm7,xmm10 + punpckldq xmm10,xmm11 + punpckhdq xmm6,xmm9 + punpckhdq xmm7,xmm11 + movdqa xmm9,xmm8 + punpcklqdq xmm8,xmm10 + movdqa xmm11,xmm6 + punpcklqdq xmm6,xmm7 + punpckhqdq xmm9,xmm10 + punpckhqdq xmm11,xmm7 + paddd xmm12,XMMWORD[((128-256))+rcx] + paddd xmm13,XMMWORD[((144-256))+rcx] + paddd xmm14,XMMWORD[((160-256))+rcx] + paddd xmm15,XMMWORD[((176-256))+rcx] + + movdqa XMMWORD[rsp],xmm8 + movdqa XMMWORD[16+rsp],xmm9 + movdqa xmm8,XMMWORD[32+rsp] + movdqa xmm9,XMMWORD[48+rsp] + + movdqa xmm10,xmm12 + punpckldq xmm12,xmm13 + movdqa xmm7,xmm14 + punpckldq xmm14,xmm15 + punpckhdq xmm10,xmm13 + punpckhdq xmm7,xmm15 + movdqa xmm13,xmm12 + punpcklqdq xmm12,xmm14 + movdqa xmm15,xmm10 + punpcklqdq xmm10,xmm7 + punpckhqdq xmm13,xmm14 + punpckhqdq xmm15,xmm7 + paddd xmm4,XMMWORD[((192-256))+rcx] + paddd xmm5,XMMWORD[((208-256))+rcx] + paddd xmm8,XMMWORD[((224-256))+rcx] + paddd xmm9,XMMWORD[((240-256))+rcx] + + movdqa XMMWORD[32+rsp],xmm6 + movdqa XMMWORD[48+rsp],xmm11 + + movdqa xmm14,xmm4 + punpckldq xmm4,xmm5 + movdqa xmm7,xmm8 + punpckldq xmm8,xmm9 + punpckhdq xmm14,xmm5 + punpckhdq xmm7,xmm9 + movdqa xmm5,xmm4 + punpcklqdq xmm4,xmm8 + movdqa xmm9,xmm14 + punpcklqdq xmm14,xmm7 + punpckhqdq xmm5,xmm8 + punpckhqdq xmm9,xmm7 + paddd xmm0,XMMWORD[((256-256))+rcx] + paddd xmm1,XMMWORD[((272-256))+rcx] + paddd xmm2,XMMWORD[((288-256))+rcx] + paddd xmm3,XMMWORD[((304-256))+rcx] + + movdqa xmm8,xmm0 + punpckldq xmm0,xmm1 + movdqa xmm7,xmm2 + punpckldq xmm2,xmm3 + punpckhdq xmm8,xmm1 + punpckhdq xmm7,xmm3 + movdqa xmm1,xmm0 + punpcklqdq xmm0,xmm2 + movdqa xmm3,xmm8 + punpcklqdq xmm8,xmm7 + punpckhqdq xmm1,xmm2 + punpckhqdq xmm3,xmm7 + cmp rdx,64*4 + jb NEAR $L$tail4x + + movdqu xmm6,XMMWORD[rsi] + movdqu xmm11,XMMWORD[16+rsi] + movdqu xmm2,XMMWORD[32+rsi] + movdqu xmm7,XMMWORD[48+rsi] + pxor xmm6,XMMWORD[rsp] + pxor xmm11,xmm12 + pxor xmm2,xmm4 + pxor xmm7,xmm0 + + movdqu XMMWORD[rdi],xmm6 + movdqu xmm6,XMMWORD[64+rsi] + movdqu XMMWORD[16+rdi],xmm11 + movdqu xmm11,XMMWORD[80+rsi] + movdqu XMMWORD[32+rdi],xmm2 + movdqu xmm2,XMMWORD[96+rsi] + movdqu XMMWORD[48+rdi],xmm7 + movdqu xmm7,XMMWORD[112+rsi] + lea rsi,[128+rsi] + pxor xmm6,XMMWORD[16+rsp] + pxor xmm11,xmm13 + pxor xmm2,xmm5 + pxor xmm7,xmm1 + + movdqu XMMWORD[64+rdi],xmm6 + movdqu xmm6,XMMWORD[rsi] + movdqu XMMWORD[80+rdi],xmm11 + movdqu xmm11,XMMWORD[16+rsi] + movdqu XMMWORD[96+rdi],xmm2 + movdqu xmm2,XMMWORD[32+rsi] + movdqu XMMWORD[112+rdi],xmm7 + lea rdi,[128+rdi] + movdqu xmm7,XMMWORD[48+rsi] + pxor xmm6,XMMWORD[32+rsp] + pxor xmm11,xmm10 + pxor xmm2,xmm14 + pxor xmm7,xmm8 + + movdqu XMMWORD[rdi],xmm6 + movdqu xmm6,XMMWORD[64+rsi] + movdqu XMMWORD[16+rdi],xmm11 + movdqu xmm11,XMMWORD[80+rsi] + movdqu XMMWORD[32+rdi],xmm2 + movdqu xmm2,XMMWORD[96+rsi] + movdqu XMMWORD[48+rdi],xmm7 + movdqu xmm7,XMMWORD[112+rsi] + lea rsi,[128+rsi] + pxor xmm6,XMMWORD[48+rsp] + pxor xmm11,xmm15 + pxor xmm2,xmm9 + pxor xmm7,xmm3 + movdqu XMMWORD[64+rdi],xmm6 + movdqu XMMWORD[80+rdi],xmm11 + movdqu XMMWORD[96+rdi],xmm2 + movdqu XMMWORD[112+rdi],xmm7 + lea rdi,[128+rdi] + + sub rdx,64*4 + jnz NEAR $L$oop_outer4x + + jmp NEAR $L$done4x + +$L$tail4x: + cmp rdx,192 + jae NEAR $L$192_or_more4x + cmp rdx,128 + jae NEAR $L$128_or_more4x + cmp rdx,64 + jae NEAR $L$64_or_more4x + + + xor r10,r10 + + movdqa XMMWORD[16+rsp],xmm12 + movdqa XMMWORD[32+rsp],xmm4 + movdqa XMMWORD[48+rsp],xmm0 + jmp NEAR $L$oop_tail4x + +ALIGN 32 +$L$64_or_more4x: + movdqu xmm6,XMMWORD[rsi] + movdqu xmm11,XMMWORD[16+rsi] + movdqu xmm2,XMMWORD[32+rsi] + movdqu xmm7,XMMWORD[48+rsi] + pxor xmm6,XMMWORD[rsp] + pxor xmm11,xmm12 + pxor xmm2,xmm4 + pxor xmm7,xmm0 + movdqu XMMWORD[rdi],xmm6 + movdqu XMMWORD[16+rdi],xmm11 + movdqu XMMWORD[32+rdi],xmm2 + movdqu XMMWORD[48+rdi],xmm7 + je NEAR $L$done4x + + movdqa xmm6,XMMWORD[16+rsp] + lea rsi,[64+rsi] + xor r10,r10 + movdqa XMMWORD[rsp],xmm6 + movdqa XMMWORD[16+rsp],xmm13 + lea rdi,[64+rdi] + movdqa XMMWORD[32+rsp],xmm5 + sub rdx,64 + movdqa XMMWORD[48+rsp],xmm1 + jmp NEAR $L$oop_tail4x + +ALIGN 32 +$L$128_or_more4x: + movdqu xmm6,XMMWORD[rsi] + movdqu xmm11,XMMWORD[16+rsi] + movdqu xmm2,XMMWORD[32+rsi] + movdqu xmm7,XMMWORD[48+rsi] + pxor xmm6,XMMWORD[rsp] + pxor xmm11,xmm12 + pxor xmm2,xmm4 + pxor xmm7,xmm0 + + movdqu XMMWORD[rdi],xmm6 + movdqu xmm6,XMMWORD[64+rsi] + movdqu XMMWORD[16+rdi],xmm11 + movdqu xmm11,XMMWORD[80+rsi] + movdqu XMMWORD[32+rdi],xmm2 + movdqu xmm2,XMMWORD[96+rsi] + movdqu XMMWORD[48+rdi],xmm7 + movdqu xmm7,XMMWORD[112+rsi] + pxor xmm6,XMMWORD[16+rsp] + pxor xmm11,xmm13 + pxor xmm2,xmm5 + pxor xmm7,xmm1 + movdqu XMMWORD[64+rdi],xmm6 + movdqu XMMWORD[80+rdi],xmm11 + movdqu XMMWORD[96+rdi],xmm2 + movdqu XMMWORD[112+rdi],xmm7 + je NEAR $L$done4x + + movdqa xmm6,XMMWORD[32+rsp] + lea rsi,[128+rsi] + xor r10,r10 + movdqa XMMWORD[rsp],xmm6 + movdqa XMMWORD[16+rsp],xmm10 + lea rdi,[128+rdi] + movdqa XMMWORD[32+rsp],xmm14 + sub rdx,128 + movdqa XMMWORD[48+rsp],xmm8 + jmp NEAR $L$oop_tail4x + +ALIGN 32 +$L$192_or_more4x: + movdqu xmm6,XMMWORD[rsi] + movdqu xmm11,XMMWORD[16+rsi] + movdqu xmm2,XMMWORD[32+rsi] + movdqu xmm7,XMMWORD[48+rsi] + pxor xmm6,XMMWORD[rsp] + pxor xmm11,xmm12 + pxor xmm2,xmm4 + pxor xmm7,xmm0 + + movdqu XMMWORD[rdi],xmm6 + movdqu xmm6,XMMWORD[64+rsi] + movdqu XMMWORD[16+rdi],xmm11 + movdqu xmm11,XMMWORD[80+rsi] + movdqu XMMWORD[32+rdi],xmm2 + movdqu xmm2,XMMWORD[96+rsi] + movdqu XMMWORD[48+rdi],xmm7 + movdqu xmm7,XMMWORD[112+rsi] + lea rsi,[128+rsi] + pxor xmm6,XMMWORD[16+rsp] + pxor xmm11,xmm13 + pxor xmm2,xmm5 + pxor xmm7,xmm1 + + movdqu XMMWORD[64+rdi],xmm6 + movdqu xmm6,XMMWORD[rsi] + movdqu XMMWORD[80+rdi],xmm11 + movdqu xmm11,XMMWORD[16+rsi] + movdqu XMMWORD[96+rdi],xmm2 + movdqu xmm2,XMMWORD[32+rsi] + movdqu XMMWORD[112+rdi],xmm7 + lea rdi,[128+rdi] + movdqu xmm7,XMMWORD[48+rsi] + pxor xmm6,XMMWORD[32+rsp] + pxor xmm11,xmm10 + pxor xmm2,xmm14 + pxor xmm7,xmm8 + movdqu XMMWORD[rdi],xmm6 + movdqu XMMWORD[16+rdi],xmm11 + movdqu XMMWORD[32+rdi],xmm2 + movdqu XMMWORD[48+rdi],xmm7 + je NEAR $L$done4x + + movdqa xmm6,XMMWORD[48+rsp] + lea rsi,[64+rsi] + xor r10,r10 + movdqa XMMWORD[rsp],xmm6 + movdqa XMMWORD[16+rsp],xmm15 + lea rdi,[64+rdi] + movdqa XMMWORD[32+rsp],xmm9 + sub rdx,192 + movdqa XMMWORD[48+rsp],xmm3 + +$L$oop_tail4x: + movzx eax,BYTE[r10*1+rsi] + movzx ecx,BYTE[r10*1+rsp] + lea r10,[1+r10] + xor eax,ecx + mov BYTE[((-1))+r10*1+rdi],al + dec rdx + jnz NEAR $L$oop_tail4x + +$L$done4x: + movaps xmm6,XMMWORD[((-168))+r9] + movaps xmm7,XMMWORD[((-152))+r9] + movaps xmm8,XMMWORD[((-136))+r9] + movaps xmm9,XMMWORD[((-120))+r9] + movaps xmm10,XMMWORD[((-104))+r9] + movaps xmm11,XMMWORD[((-88))+r9] + movaps xmm12,XMMWORD[((-72))+r9] + movaps xmm13,XMMWORD[((-56))+r9] + movaps xmm14,XMMWORD[((-40))+r9] + movaps xmm15,XMMWORD[((-24))+r9] + lea rsp,[r9] + +$L$4x_epilogue: + mov rdi,QWORD[8+rsp] ;WIN64 epilogue + mov rsi,QWORD[16+rsp] + DB 0F3h,0C3h ;repret + +$L$SEH_end_chacha20_4x: +global chacha20_avx2 + +ALIGN 32 +chacha20_avx2: + mov QWORD[8+rsp],rdi ;WIN64 prologue + mov QWORD[16+rsp],rsi + mov rax,rsp +$L$SEH_begin_chacha20_avx2: + mov rdi,rcx + mov rsi,rdx + mov rdx,r8 + mov rcx,r9 + mov r8,QWORD[40+rsp] + + + +$L$chacha20_avx2: + mov r9,rsp + + sub rsp,0x280+168 + and rsp,-32 + movaps XMMWORD[(-168)+r9],xmm6 + movaps XMMWORD[(-152)+r9],xmm7 + movaps XMMWORD[(-136)+r9],xmm8 + movaps XMMWORD[(-120)+r9],xmm9 + movaps XMMWORD[(-104)+r9],xmm10 + movaps XMMWORD[(-88)+r9],xmm11 + movaps XMMWORD[(-72)+r9],xmm12 + movaps XMMWORD[(-56)+r9],xmm13 + movaps XMMWORD[(-40)+r9],xmm14 + movaps XMMWORD[(-24)+r9],xmm15 +$L$8x_body: + vzeroupper + + vbroadcasti128 ymm11,XMMWORD[$L$sigma] + vbroadcasti128 ymm3,XMMWORD[rcx] + vbroadcasti128 ymm15,XMMWORD[16+rcx] + vbroadcasti128 ymm7,XMMWORD[r8] + lea rcx,[256+rsp] + lea rax,[512+rsp] + lea r10,[$L$rot16] + lea r11,[$L$rot24] + + vpshufd ymm8,ymm11,0x00 + vpshufd ymm9,ymm11,0x55 + vmovdqa YMMWORD[(128-256)+rcx],ymm8 + vpshufd ymm10,ymm11,0xaa + vmovdqa YMMWORD[(160-256)+rcx],ymm9 + vpshufd ymm11,ymm11,0xff + vmovdqa YMMWORD[(192-256)+rcx],ymm10 + vmovdqa YMMWORD[(224-256)+rcx],ymm11 + + vpshufd ymm0,ymm3,0x00 + vpshufd ymm1,ymm3,0x55 + vmovdqa YMMWORD[(256-256)+rcx],ymm0 + vpshufd ymm2,ymm3,0xaa + vmovdqa YMMWORD[(288-256)+rcx],ymm1 + vpshufd ymm3,ymm3,0xff + vmovdqa YMMWORD[(320-256)+rcx],ymm2 + vmovdqa YMMWORD[(352-256)+rcx],ymm3 + + vpshufd ymm12,ymm15,0x00 + vpshufd ymm13,ymm15,0x55 + vmovdqa YMMWORD[(384-512)+rax],ymm12 + vpshufd ymm14,ymm15,0xaa + vmovdqa YMMWORD[(416-512)+rax],ymm13 + vpshufd ymm15,ymm15,0xff + vmovdqa YMMWORD[(448-512)+rax],ymm14 + vmovdqa YMMWORD[(480-512)+rax],ymm15 + + vpshufd ymm4,ymm7,0x00 + vpshufd ymm5,ymm7,0x55 + vpaddd ymm4,ymm4,YMMWORD[$L$incy] + vpshufd ymm6,ymm7,0xaa + vmovdqa YMMWORD[(544-512)+rax],ymm5 + vpshufd ymm7,ymm7,0xff + vmovdqa YMMWORD[(576-512)+rax],ymm6 + vmovdqa YMMWORD[(608-512)+rax],ymm7 + + jmp NEAR $L$oop_enter8x + +ALIGN 32 +$L$oop_outer8x: + vmovdqa ymm8,YMMWORD[((128-256))+rcx] + vmovdqa ymm9,YMMWORD[((160-256))+rcx] + vmovdqa ymm10,YMMWORD[((192-256))+rcx] + vmovdqa ymm11,YMMWORD[((224-256))+rcx] + vmovdqa ymm0,YMMWORD[((256-256))+rcx] + vmovdqa ymm1,YMMWORD[((288-256))+rcx] + vmovdqa ymm2,YMMWORD[((320-256))+rcx] + vmovdqa ymm3,YMMWORD[((352-256))+rcx] + vmovdqa ymm12,YMMWORD[((384-512))+rax] + vmovdqa ymm13,YMMWORD[((416-512))+rax] + vmovdqa ymm14,YMMWORD[((448-512))+rax] + vmovdqa ymm15,YMMWORD[((480-512))+rax] + vmovdqa ymm4,YMMWORD[((512-512))+rax] + vmovdqa ymm5,YMMWORD[((544-512))+rax] + vmovdqa ymm6,YMMWORD[((576-512))+rax] + vmovdqa ymm7,YMMWORD[((608-512))+rax] + vpaddd ymm4,ymm4,YMMWORD[$L$eight] + +$L$oop_enter8x: + vmovdqa YMMWORD[64+rsp],ymm14 + vmovdqa YMMWORD[96+rsp],ymm15 + vbroadcasti128 ymm15,XMMWORD[r10] + vmovdqa YMMWORD[(512-512)+rax],ymm4 + mov eax,10 + jmp NEAR $L$oop8x + +ALIGN 32 +$L$oop8x: + vpaddd ymm8,ymm8,ymm0 + vpxor ymm4,ymm8,ymm4 + vpshufb ymm4,ymm4,ymm15 + vpaddd ymm9,ymm9,ymm1 + vpxor ymm5,ymm9,ymm5 + vpshufb ymm5,ymm5,ymm15 + vpaddd ymm12,ymm12,ymm4 + vpxor ymm0,ymm12,ymm0 + vpslld ymm14,ymm0,12 + vpsrld ymm0,ymm0,20 + vpor ymm0,ymm14,ymm0 + vbroadcasti128 ymm14,XMMWORD[r11] + vpaddd ymm13,ymm13,ymm5 + vpxor ymm1,ymm13,ymm1 + vpslld ymm15,ymm1,12 + vpsrld ymm1,ymm1,20 + vpor ymm1,ymm15,ymm1 + vpaddd ymm8,ymm8,ymm0 + vpxor ymm4,ymm8,ymm4 + vpshufb ymm4,ymm4,ymm14 + vpaddd ymm9,ymm9,ymm1 + vpxor ymm5,ymm9,ymm5 + vpshufb ymm5,ymm5,ymm14 + vpaddd ymm12,ymm12,ymm4 + vpxor ymm0,ymm12,ymm0 + vpslld ymm15,ymm0,7 + vpsrld ymm0,ymm0,25 + vpor ymm0,ymm15,ymm0 + vbroadcasti128 ymm15,XMMWORD[r10] + vpaddd ymm13,ymm13,ymm5 + vpxor ymm1,ymm13,ymm1 + vpslld ymm14,ymm1,7 + vpsrld ymm1,ymm1,25 + vpor ymm1,ymm14,ymm1 + vmovdqa YMMWORD[rsp],ymm12 + vmovdqa YMMWORD[32+rsp],ymm13 + vmovdqa ymm12,YMMWORD[64+rsp] + vmovdqa ymm13,YMMWORD[96+rsp] + vpaddd ymm10,ymm10,ymm2 + vpxor ymm6,ymm10,ymm6 + vpshufb ymm6,ymm6,ymm15 + vpaddd ymm11,ymm11,ymm3 + vpxor ymm7,ymm11,ymm7 + vpshufb ymm7,ymm7,ymm15 + vpaddd ymm12,ymm12,ymm6 + vpxor ymm2,ymm12,ymm2 + vpslld ymm14,ymm2,12 + vpsrld ymm2,ymm2,20 + vpor ymm2,ymm14,ymm2 + vbroadcasti128 ymm14,XMMWORD[r11] + vpaddd ymm13,ymm13,ymm7 + vpxor ymm3,ymm13,ymm3 + vpslld ymm15,ymm3,12 + vpsrld ymm3,ymm3,20 + vpor ymm3,ymm15,ymm3 + vpaddd ymm10,ymm10,ymm2 + vpxor ymm6,ymm10,ymm6 + vpshufb ymm6,ymm6,ymm14 + vpaddd ymm11,ymm11,ymm3 + vpxor ymm7,ymm11,ymm7 + vpshufb ymm7,ymm7,ymm14 + vpaddd ymm12,ymm12,ymm6 + vpxor ymm2,ymm12,ymm2 + vpslld ymm15,ymm2,7 + vpsrld ymm2,ymm2,25 + vpor ymm2,ymm15,ymm2 + vbroadcasti128 ymm15,XMMWORD[r10] + vpaddd ymm13,ymm13,ymm7 + vpxor ymm3,ymm13,ymm3 + vpslld ymm14,ymm3,7 + vpsrld ymm3,ymm3,25 + vpor ymm3,ymm14,ymm3 + vpaddd ymm8,ymm8,ymm1 + vpxor ymm7,ymm8,ymm7 + vpshufb ymm7,ymm7,ymm15 + vpaddd ymm9,ymm9,ymm2 + vpxor ymm4,ymm9,ymm4 + vpshufb ymm4,ymm4,ymm15 + vpaddd ymm12,ymm12,ymm7 + vpxor ymm1,ymm12,ymm1 + vpslld ymm14,ymm1,12 + vpsrld ymm1,ymm1,20 + vpor ymm1,ymm14,ymm1 + vbroadcasti128 ymm14,XMMWORD[r11] + vpaddd ymm13,ymm13,ymm4 + vpxor ymm2,ymm13,ymm2 + vpslld ymm15,ymm2,12 + vpsrld ymm2,ymm2,20 + vpor ymm2,ymm15,ymm2 + vpaddd ymm8,ymm8,ymm1 + vpxor ymm7,ymm8,ymm7 + vpshufb ymm7,ymm7,ymm14 + vpaddd ymm9,ymm9,ymm2 + vpxor ymm4,ymm9,ymm4 + vpshufb ymm4,ymm4,ymm14 + vpaddd ymm12,ymm12,ymm7 + vpxor ymm1,ymm12,ymm1 + vpslld ymm15,ymm1,7 + vpsrld ymm1,ymm1,25 + vpor ymm1,ymm15,ymm1 + vbroadcasti128 ymm15,XMMWORD[r10] + vpaddd ymm13,ymm13,ymm4 + vpxor ymm2,ymm13,ymm2 + vpslld ymm14,ymm2,7 + vpsrld ymm2,ymm2,25 + vpor ymm2,ymm14,ymm2 + vmovdqa YMMWORD[64+rsp],ymm12 + vmovdqa YMMWORD[96+rsp],ymm13 + vmovdqa ymm12,YMMWORD[rsp] + vmovdqa ymm13,YMMWORD[32+rsp] + vpaddd ymm10,ymm10,ymm3 + vpxor ymm5,ymm10,ymm5 + vpshufb ymm5,ymm5,ymm15 + vpaddd ymm11,ymm11,ymm0 + vpxor ymm6,ymm11,ymm6 + vpshufb ymm6,ymm6,ymm15 + vpaddd ymm12,ymm12,ymm5 + vpxor ymm3,ymm12,ymm3 + vpslld ymm14,ymm3,12 + vpsrld ymm3,ymm3,20 + vpor ymm3,ymm14,ymm3 + vbroadcasti128 ymm14,XMMWORD[r11] + vpaddd ymm13,ymm13,ymm6 + vpxor ymm0,ymm13,ymm0 + vpslld ymm15,ymm0,12 + vpsrld ymm0,ymm0,20 + vpor ymm0,ymm15,ymm0 + vpaddd ymm10,ymm10,ymm3 + vpxor ymm5,ymm10,ymm5 + vpshufb ymm5,ymm5,ymm14 + vpaddd ymm11,ymm11,ymm0 + vpxor ymm6,ymm11,ymm6 + vpshufb ymm6,ymm6,ymm14 + vpaddd ymm12,ymm12,ymm5 + vpxor ymm3,ymm12,ymm3 + vpslld ymm15,ymm3,7 + vpsrld ymm3,ymm3,25 + vpor ymm3,ymm15,ymm3 + vbroadcasti128 ymm15,XMMWORD[r10] + vpaddd ymm13,ymm13,ymm6 + vpxor ymm0,ymm13,ymm0 + vpslld ymm14,ymm0,7 + vpsrld ymm0,ymm0,25 + vpor ymm0,ymm14,ymm0 + dec eax + jnz NEAR $L$oop8x + + lea rax,[512+rsp] + vpaddd ymm8,ymm8,YMMWORD[((128-256))+rcx] + vpaddd ymm9,ymm9,YMMWORD[((160-256))+rcx] + vpaddd ymm10,ymm10,YMMWORD[((192-256))+rcx] + vpaddd ymm11,ymm11,YMMWORD[((224-256))+rcx] + + vpunpckldq ymm14,ymm8,ymm9 + vpunpckldq ymm15,ymm10,ymm11 + vpunpckhdq ymm8,ymm8,ymm9 + vpunpckhdq ymm10,ymm10,ymm11 + vpunpcklqdq ymm9,ymm14,ymm15 + vpunpckhqdq ymm14,ymm14,ymm15 + vpunpcklqdq ymm11,ymm8,ymm10 + vpunpckhqdq ymm8,ymm8,ymm10 + vpaddd ymm0,ymm0,YMMWORD[((256-256))+rcx] + vpaddd ymm1,ymm1,YMMWORD[((288-256))+rcx] + vpaddd ymm2,ymm2,YMMWORD[((320-256))+rcx] + vpaddd ymm3,ymm3,YMMWORD[((352-256))+rcx] + + vpunpckldq ymm10,ymm0,ymm1 + vpunpckldq ymm15,ymm2,ymm3 + vpunpckhdq ymm0,ymm0,ymm1 + vpunpckhdq ymm2,ymm2,ymm3 + vpunpcklqdq ymm1,ymm10,ymm15 + vpunpckhqdq ymm10,ymm10,ymm15 + vpunpcklqdq ymm3,ymm0,ymm2 + vpunpckhqdq ymm0,ymm0,ymm2 + vperm2i128 ymm15,ymm9,ymm1,0x20 + vperm2i128 ymm1,ymm9,ymm1,0x31 + vperm2i128 ymm9,ymm14,ymm10,0x20 + vperm2i128 ymm10,ymm14,ymm10,0x31 + vperm2i128 ymm14,ymm11,ymm3,0x20 + vperm2i128 ymm3,ymm11,ymm3,0x31 + vperm2i128 ymm11,ymm8,ymm0,0x20 + vperm2i128 ymm0,ymm8,ymm0,0x31 + vmovdqa YMMWORD[rsp],ymm15 + vmovdqa YMMWORD[32+rsp],ymm9 + vmovdqa ymm15,YMMWORD[64+rsp] + vmovdqa ymm9,YMMWORD[96+rsp] + + vpaddd ymm12,ymm12,YMMWORD[((384-512))+rax] + vpaddd ymm13,ymm13,YMMWORD[((416-512))+rax] + vpaddd ymm15,ymm15,YMMWORD[((448-512))+rax] + vpaddd ymm9,ymm9,YMMWORD[((480-512))+rax] + + vpunpckldq ymm2,ymm12,ymm13 + vpunpckldq ymm8,ymm15,ymm9 + vpunpckhdq ymm12,ymm12,ymm13 + vpunpckhdq ymm15,ymm15,ymm9 + vpunpcklqdq ymm13,ymm2,ymm8 + vpunpckhqdq ymm2,ymm2,ymm8 + vpunpcklqdq ymm9,ymm12,ymm15 + vpunpckhqdq ymm12,ymm12,ymm15 + vpaddd ymm4,ymm4,YMMWORD[((512-512))+rax] + vpaddd ymm5,ymm5,YMMWORD[((544-512))+rax] + vpaddd ymm6,ymm6,YMMWORD[((576-512))+rax] + vpaddd ymm7,ymm7,YMMWORD[((608-512))+rax] + + vpunpckldq ymm15,ymm4,ymm5 + vpunpckldq ymm8,ymm6,ymm7 + vpunpckhdq ymm4,ymm4,ymm5 + vpunpckhdq ymm6,ymm6,ymm7 + vpunpcklqdq ymm5,ymm15,ymm8 + vpunpckhqdq ymm15,ymm15,ymm8 + vpunpcklqdq ymm7,ymm4,ymm6 + vpunpckhqdq ymm4,ymm4,ymm6 + vperm2i128 ymm8,ymm13,ymm5,0x20 + vperm2i128 ymm5,ymm13,ymm5,0x31 + vperm2i128 ymm13,ymm2,ymm15,0x20 + vperm2i128 ymm15,ymm2,ymm15,0x31 + vperm2i128 ymm2,ymm9,ymm7,0x20 + vperm2i128 ymm7,ymm9,ymm7,0x31 + vperm2i128 ymm9,ymm12,ymm4,0x20 + vperm2i128 ymm4,ymm12,ymm4,0x31 + vmovdqa ymm6,YMMWORD[rsp] + vmovdqa ymm12,YMMWORD[32+rsp] + + cmp rdx,64*8 + jb NEAR $L$tail8x + + vpxor ymm6,ymm6,YMMWORD[rsi] + vpxor ymm8,ymm8,YMMWORD[32+rsi] + vpxor ymm1,ymm1,YMMWORD[64+rsi] + vpxor ymm5,ymm5,YMMWORD[96+rsi] + lea rsi,[128+rsi] + vmovdqu YMMWORD[rdi],ymm6 + vmovdqu YMMWORD[32+rdi],ymm8 + vmovdqu YMMWORD[64+rdi],ymm1 + vmovdqu YMMWORD[96+rdi],ymm5 + lea rdi,[128+rdi] + + vpxor ymm12,ymm12,YMMWORD[rsi] + vpxor ymm13,ymm13,YMMWORD[32+rsi] + vpxor ymm10,ymm10,YMMWORD[64+rsi] + vpxor ymm15,ymm15,YMMWORD[96+rsi] + lea rsi,[128+rsi] + vmovdqu YMMWORD[rdi],ymm12 + vmovdqu YMMWORD[32+rdi],ymm13 + vmovdqu YMMWORD[64+rdi],ymm10 + vmovdqu YMMWORD[96+rdi],ymm15 + lea rdi,[128+rdi] + + vpxor ymm14,ymm14,YMMWORD[rsi] + vpxor ymm2,ymm2,YMMWORD[32+rsi] + vpxor ymm3,ymm3,YMMWORD[64+rsi] + vpxor ymm7,ymm7,YMMWORD[96+rsi] + lea rsi,[128+rsi] + vmovdqu YMMWORD[rdi],ymm14 + vmovdqu YMMWORD[32+rdi],ymm2 + vmovdqu YMMWORD[64+rdi],ymm3 + vmovdqu YMMWORD[96+rdi],ymm7 + lea rdi,[128+rdi] + + vpxor ymm11,ymm11,YMMWORD[rsi] + vpxor ymm9,ymm9,YMMWORD[32+rsi] + vpxor ymm0,ymm0,YMMWORD[64+rsi] + vpxor ymm4,ymm4,YMMWORD[96+rsi] + lea rsi,[128+rsi] + vmovdqu YMMWORD[rdi],ymm11 + vmovdqu YMMWORD[32+rdi],ymm9 + vmovdqu YMMWORD[64+rdi],ymm0 + vmovdqu YMMWORD[96+rdi],ymm4 + lea rdi,[128+rdi] + + sub rdx,64*8 + jnz NEAR $L$oop_outer8x + + jmp NEAR $L$done8x + +$L$tail8x: + cmp rdx,448 + jae NEAR $L$448_or_more8x + cmp rdx,384 + jae NEAR $L$384_or_more8x + cmp rdx,320 + jae NEAR $L$320_or_more8x + cmp rdx,256 + jae NEAR $L$256_or_more8x + cmp rdx,192 + jae NEAR $L$192_or_more8x + cmp rdx,128 + jae NEAR $L$128_or_more8x + cmp rdx,64 + jae NEAR $L$64_or_more8x + + xor r10,r10 + vmovdqa YMMWORD[rsp],ymm6 + vmovdqa YMMWORD[32+rsp],ymm8 + jmp NEAR $L$oop_tail8x + +ALIGN 32 +$L$64_or_more8x: + vpxor ymm6,ymm6,YMMWORD[rsi] + vpxor ymm8,ymm8,YMMWORD[32+rsi] + vmovdqu YMMWORD[rdi],ymm6 + vmovdqu YMMWORD[32+rdi],ymm8 + je NEAR $L$done8x + + lea rsi,[64+rsi] + xor r10,r10 + vmovdqa YMMWORD[rsp],ymm1 + lea rdi,[64+rdi] + sub rdx,64 + vmovdqa YMMWORD[32+rsp],ymm5 + jmp NEAR $L$oop_tail8x + +ALIGN 32 +$L$128_or_more8x: + vpxor ymm6,ymm6,YMMWORD[rsi] + vpxor ymm8,ymm8,YMMWORD[32+rsi] + vpxor ymm1,ymm1,YMMWORD[64+rsi] + vpxor ymm5,ymm5,YMMWORD[96+rsi] + vmovdqu YMMWORD[rdi],ymm6 + vmovdqu YMMWORD[32+rdi],ymm8 + vmovdqu YMMWORD[64+rdi],ymm1 + vmovdqu YMMWORD[96+rdi],ymm5 + je NEAR $L$done8x + + lea rsi,[128+rsi] + xor r10,r10 + vmovdqa YMMWORD[rsp],ymm12 + lea rdi,[128+rdi] + sub rdx,128 + vmovdqa YMMWORD[32+rsp],ymm13 + jmp NEAR $L$oop_tail8x + +ALIGN 32 +$L$192_or_more8x: + vpxor ymm6,ymm6,YMMWORD[rsi] + vpxor ymm8,ymm8,YMMWORD[32+rsi] + vpxor ymm1,ymm1,YMMWORD[64+rsi] + vpxor ymm5,ymm5,YMMWORD[96+rsi] + vpxor ymm12,ymm12,YMMWORD[128+rsi] + vpxor ymm13,ymm13,YMMWORD[160+rsi] + vmovdqu YMMWORD[rdi],ymm6 + vmovdqu YMMWORD[32+rdi],ymm8 + vmovdqu YMMWORD[64+rdi],ymm1 + vmovdqu YMMWORD[96+rdi],ymm5 + vmovdqu YMMWORD[128+rdi],ymm12 + vmovdqu YMMWORD[160+rdi],ymm13 + je NEAR $L$done8x + + lea rsi,[192+rsi] + xor r10,r10 + vmovdqa YMMWORD[rsp],ymm10 + lea rdi,[192+rdi] + sub rdx,192 + vmovdqa YMMWORD[32+rsp],ymm15 + jmp NEAR $L$oop_tail8x + +ALIGN 32 +$L$256_or_more8x: + vpxor ymm6,ymm6,YMMWORD[rsi] + vpxor ymm8,ymm8,YMMWORD[32+rsi] + vpxor ymm1,ymm1,YMMWORD[64+rsi] + vpxor ymm5,ymm5,YMMWORD[96+rsi] + vpxor ymm12,ymm12,YMMWORD[128+rsi] + vpxor ymm13,ymm13,YMMWORD[160+rsi] + vpxor ymm10,ymm10,YMMWORD[192+rsi] + vpxor ymm15,ymm15,YMMWORD[224+rsi] + vmovdqu YMMWORD[rdi],ymm6 + vmovdqu YMMWORD[32+rdi],ymm8 + vmovdqu YMMWORD[64+rdi],ymm1 + vmovdqu YMMWORD[96+rdi],ymm5 + vmovdqu YMMWORD[128+rdi],ymm12 + vmovdqu YMMWORD[160+rdi],ymm13 + vmovdqu YMMWORD[192+rdi],ymm10 + vmovdqu YMMWORD[224+rdi],ymm15 + je NEAR $L$done8x + + lea rsi,[256+rsi] + xor r10,r10 + vmovdqa YMMWORD[rsp],ymm14 + lea rdi,[256+rdi] + sub rdx,256 + vmovdqa YMMWORD[32+rsp],ymm2 + jmp NEAR $L$oop_tail8x + +ALIGN 32 +$L$320_or_more8x: + vpxor ymm6,ymm6,YMMWORD[rsi] + vpxor ymm8,ymm8,YMMWORD[32+rsi] + vpxor ymm1,ymm1,YMMWORD[64+rsi] + vpxor ymm5,ymm5,YMMWORD[96+rsi] + vpxor ymm12,ymm12,YMMWORD[128+rsi] + vpxor ymm13,ymm13,YMMWORD[160+rsi] + vpxor ymm10,ymm10,YMMWORD[192+rsi] + vpxor ymm15,ymm15,YMMWORD[224+rsi] + vpxor ymm14,ymm14,YMMWORD[256+rsi] + vpxor ymm2,ymm2,YMMWORD[288+rsi] + vmovdqu YMMWORD[rdi],ymm6 + vmovdqu YMMWORD[32+rdi],ymm8 + vmovdqu YMMWORD[64+rdi],ymm1 + vmovdqu YMMWORD[96+rdi],ymm5 + vmovdqu YMMWORD[128+rdi],ymm12 + vmovdqu YMMWORD[160+rdi],ymm13 + vmovdqu YMMWORD[192+rdi],ymm10 + vmovdqu YMMWORD[224+rdi],ymm15 + vmovdqu YMMWORD[256+rdi],ymm14 + vmovdqu YMMWORD[288+rdi],ymm2 + je NEAR $L$done8x + + lea rsi,[320+rsi] + xor r10,r10 + vmovdqa YMMWORD[rsp],ymm3 + lea rdi,[320+rdi] + sub rdx,320 + vmovdqa YMMWORD[32+rsp],ymm7 + jmp NEAR $L$oop_tail8x + +ALIGN 32 +$L$384_or_more8x: + vpxor ymm6,ymm6,YMMWORD[rsi] + vpxor ymm8,ymm8,YMMWORD[32+rsi] + vpxor ymm1,ymm1,YMMWORD[64+rsi] + vpxor ymm5,ymm5,YMMWORD[96+rsi] + vpxor ymm12,ymm12,YMMWORD[128+rsi] + vpxor ymm13,ymm13,YMMWORD[160+rsi] + vpxor ymm10,ymm10,YMMWORD[192+rsi] + vpxor ymm15,ymm15,YMMWORD[224+rsi] + vpxor ymm14,ymm14,YMMWORD[256+rsi] + vpxor ymm2,ymm2,YMMWORD[288+rsi] + vpxor ymm3,ymm3,YMMWORD[320+rsi] + vpxor ymm7,ymm7,YMMWORD[352+rsi] + vmovdqu YMMWORD[rdi],ymm6 + vmovdqu YMMWORD[32+rdi],ymm8 + vmovdqu YMMWORD[64+rdi],ymm1 + vmovdqu YMMWORD[96+rdi],ymm5 + vmovdqu YMMWORD[128+rdi],ymm12 + vmovdqu YMMWORD[160+rdi],ymm13 + vmovdqu YMMWORD[192+rdi],ymm10 + vmovdqu YMMWORD[224+rdi],ymm15 + vmovdqu YMMWORD[256+rdi],ymm14 + vmovdqu YMMWORD[288+rdi],ymm2 + vmovdqu YMMWORD[320+rdi],ymm3 + vmovdqu YMMWORD[352+rdi],ymm7 + je NEAR $L$done8x + + lea rsi,[384+rsi] + xor r10,r10 + vmovdqa YMMWORD[rsp],ymm11 + lea rdi,[384+rdi] + sub rdx,384 + vmovdqa YMMWORD[32+rsp],ymm9 + jmp NEAR $L$oop_tail8x + +ALIGN 32 +$L$448_or_more8x: + vpxor ymm6,ymm6,YMMWORD[rsi] + vpxor ymm8,ymm8,YMMWORD[32+rsi] + vpxor ymm1,ymm1,YMMWORD[64+rsi] + vpxor ymm5,ymm5,YMMWORD[96+rsi] + vpxor ymm12,ymm12,YMMWORD[128+rsi] + vpxor ymm13,ymm13,YMMWORD[160+rsi] + vpxor ymm10,ymm10,YMMWORD[192+rsi] + vpxor ymm15,ymm15,YMMWORD[224+rsi] + vpxor ymm14,ymm14,YMMWORD[256+rsi] + vpxor ymm2,ymm2,YMMWORD[288+rsi] + vpxor ymm3,ymm3,YMMWORD[320+rsi] + vpxor ymm7,ymm7,YMMWORD[352+rsi] + vpxor ymm11,ymm11,YMMWORD[384+rsi] + vpxor ymm9,ymm9,YMMWORD[416+rsi] + vmovdqu YMMWORD[rdi],ymm6 + vmovdqu YMMWORD[32+rdi],ymm8 + vmovdqu YMMWORD[64+rdi],ymm1 + vmovdqu YMMWORD[96+rdi],ymm5 + vmovdqu YMMWORD[128+rdi],ymm12 + vmovdqu YMMWORD[160+rdi],ymm13 + vmovdqu YMMWORD[192+rdi],ymm10 + vmovdqu YMMWORD[224+rdi],ymm15 + vmovdqu YMMWORD[256+rdi],ymm14 + vmovdqu YMMWORD[288+rdi],ymm2 + vmovdqu YMMWORD[320+rdi],ymm3 + vmovdqu YMMWORD[352+rdi],ymm7 + vmovdqu YMMWORD[384+rdi],ymm11 + vmovdqu YMMWORD[416+rdi],ymm9 + je NEAR $L$done8x + + lea rsi,[448+rsi] + xor r10,r10 + vmovdqa YMMWORD[rsp],ymm0 + lea rdi,[448+rdi] + sub rdx,448 + vmovdqa YMMWORD[32+rsp],ymm4 + +$L$oop_tail8x: + movzx eax,BYTE[r10*1+rsi] + movzx ecx,BYTE[r10*1+rsp] + lea r10,[1+r10] + xor eax,ecx + mov BYTE[((-1))+r10*1+rdi],al + dec rdx + jnz NEAR $L$oop_tail8x + +$L$done8x: + vzeroall + movaps xmm6,XMMWORD[((-168))+r9] + movaps xmm7,XMMWORD[((-152))+r9] + movaps xmm8,XMMWORD[((-136))+r9] + movaps xmm9,XMMWORD[((-120))+r9] + movaps xmm10,XMMWORD[((-104))+r9] + movaps xmm11,XMMWORD[((-88))+r9] + movaps xmm12,XMMWORD[((-72))+r9] + movaps xmm13,XMMWORD[((-56))+r9] + movaps xmm14,XMMWORD[((-40))+r9] + movaps xmm15,XMMWORD[((-24))+r9] + lea rsp,[r9] + +$L$8x_epilogue: + mov rdi,QWORD[8+rsp] ;WIN64 epilogue + mov rsi,QWORD[16+rsp] + DB 0F3h,0C3h ;repret + +$L$SEH_end_chacha20_avx2: +global chacha20_avx512 + +ALIGN 32 +chacha20_avx512: + mov QWORD[8+rsp],rdi ;WIN64 prologue + mov QWORD[16+rsp],rsi + mov rax,rsp +$L$SEH_begin_chacha20_avx512: + mov rdi,rcx + mov rsi,rdx + mov rdx,r8 + mov rcx,r9 + mov r8,QWORD[40+rsp] + + + +$L$chacha20_avx512: + mov r9,rsp + + cmp rdx,512 + ja NEAR $L$chacha20_16x + + sub rsp,64+40 + movaps XMMWORD[(-40)+r9],xmm6 + movaps XMMWORD[(-24)+r9],xmm7 +$L$avx512_body: + vbroadcasti32x4 zmm0,ZMMWORD[$L$sigma] + vbroadcasti32x4 zmm1,ZMMWORD[rcx] + vbroadcasti32x4 zmm2,ZMMWORD[16+rcx] + vbroadcasti32x4 zmm3,ZMMWORD[r8] + + vmovdqa32 zmm16,zmm0 + vmovdqa32 zmm17,zmm1 + vmovdqa32 zmm18,zmm2 + vpaddd zmm3,zmm3,ZMMWORD[$L$zeroz] + vmovdqa32 zmm20,ZMMWORD[$L$fourz] + mov r8,10 + vmovdqa32 zmm19,zmm3 + jmp NEAR $L$oop_avx512 + +ALIGN 16 +$L$oop_outer_avx512: + vmovdqa32 zmm0,zmm16 + vmovdqa32 zmm1,zmm17 + vmovdqa32 zmm2,zmm18 + vpaddd zmm3,zmm19,zmm20 + mov r8,10 + vmovdqa32 zmm19,zmm3 + jmp NEAR $L$oop_avx512 + +ALIGN 32 +$L$oop_avx512: + vpaddd zmm0,zmm0,zmm1 + vpxord zmm3,zmm3,zmm0 + vprold zmm3,zmm3,16 + vpaddd zmm2,zmm2,zmm3 + vpxord zmm1,zmm1,zmm2 + vprold zmm1,zmm1,12 + vpaddd zmm0,zmm0,zmm1 + vpxord zmm3,zmm3,zmm0 + vprold zmm3,zmm3,8 + vpaddd zmm2,zmm2,zmm3 + vpxord zmm1,zmm1,zmm2 + vprold zmm1,zmm1,7 + vpshufd zmm2,zmm2,78 + vpshufd zmm1,zmm1,57 + vpshufd zmm3,zmm3,147 + vpaddd zmm0,zmm0,zmm1 + vpxord zmm3,zmm3,zmm0 + vprold zmm3,zmm3,16 + vpaddd zmm2,zmm2,zmm3 + vpxord zmm1,zmm1,zmm2 + vprold zmm1,zmm1,12 + vpaddd zmm0,zmm0,zmm1 + vpxord zmm3,zmm3,zmm0 + vprold zmm3,zmm3,8 + vpaddd zmm2,zmm2,zmm3 + vpxord zmm1,zmm1,zmm2 + vprold zmm1,zmm1,7 + vpshufd zmm2,zmm2,78 + vpshufd zmm1,zmm1,147 + vpshufd zmm3,zmm3,57 + dec r8 + jnz NEAR $L$oop_avx512 + vpaddd zmm0,zmm0,zmm16 + vpaddd zmm1,zmm1,zmm17 + vpaddd zmm2,zmm2,zmm18 + vpaddd zmm3,zmm3,zmm19 + + sub rdx,64 + jb NEAR $L$tail64_avx512 + + vpxor xmm4,xmm0,XMMWORD[rsi] + vpxor xmm5,xmm1,XMMWORD[16+rsi] + vpxor xmm6,xmm2,XMMWORD[32+rsi] + vpxor xmm7,xmm3,XMMWORD[48+rsi] + lea rsi,[64+rsi] + + vmovdqu XMMWORD[rdi],xmm4 + vmovdqu XMMWORD[16+rdi],xmm5 + vmovdqu XMMWORD[32+rdi],xmm6 + vmovdqu XMMWORD[48+rdi],xmm7 + lea rdi,[64+rdi] + + jz NEAR $L$done_avx512 + + vextracti32x4 xmm4,zmm0,1 + vextracti32x4 xmm5,zmm1,1 + vextracti32x4 xmm6,zmm2,1 + vextracti32x4 xmm7,zmm3,1 + + sub rdx,64 + jb NEAR $L$tail_avx512 + + vpxor xmm4,xmm4,XMMWORD[rsi] + vpxor xmm5,xmm5,XMMWORD[16+rsi] + vpxor xmm6,xmm6,XMMWORD[32+rsi] + vpxor xmm7,xmm7,XMMWORD[48+rsi] + lea rsi,[64+rsi] + + vmovdqu XMMWORD[rdi],xmm4 + vmovdqu XMMWORD[16+rdi],xmm5 + vmovdqu XMMWORD[32+rdi],xmm6 + vmovdqu XMMWORD[48+rdi],xmm7 + lea rdi,[64+rdi] + + jz NEAR $L$done_avx512 + + vextracti32x4 xmm4,zmm0,2 + vextracti32x4 xmm5,zmm1,2 + vextracti32x4 xmm6,zmm2,2 + vextracti32x4 xmm7,zmm3,2 + + sub rdx,64 + jb NEAR $L$tail_avx512 + + vpxor xmm4,xmm4,XMMWORD[rsi] + vpxor xmm5,xmm5,XMMWORD[16+rsi] + vpxor xmm6,xmm6,XMMWORD[32+rsi] + vpxor xmm7,xmm7,XMMWORD[48+rsi] + lea rsi,[64+rsi] + + vmovdqu XMMWORD[rdi],xmm4 + vmovdqu XMMWORD[16+rdi],xmm5 + vmovdqu XMMWORD[32+rdi],xmm6 + vmovdqu XMMWORD[48+rdi],xmm7 + lea rdi,[64+rdi] + + jz NEAR $L$done_avx512 + + vextracti32x4 xmm4,zmm0,3 + vextracti32x4 xmm5,zmm1,3 + vextracti32x4 xmm6,zmm2,3 + vextracti32x4 xmm7,zmm3,3 + + sub rdx,64 + jb NEAR $L$tail_avx512 + + vpxor xmm4,xmm4,XMMWORD[rsi] + vpxor xmm5,xmm5,XMMWORD[16+rsi] + vpxor xmm6,xmm6,XMMWORD[32+rsi] + vpxor xmm7,xmm7,XMMWORD[48+rsi] + lea rsi,[64+rsi] + + vmovdqu XMMWORD[rdi],xmm4 + vmovdqu XMMWORD[16+rdi],xmm5 + vmovdqu XMMWORD[32+rdi],xmm6 + vmovdqu XMMWORD[48+rdi],xmm7 + lea rdi,[64+rdi] + + jnz NEAR $L$oop_outer_avx512 + + jmp NEAR $L$done_avx512 + +ALIGN 16 +$L$tail64_avx512: + vmovdqa XMMWORD[rsp],xmm0 + vmovdqa XMMWORD[16+rsp],xmm1 + vmovdqa XMMWORD[32+rsp],xmm2 + vmovdqa XMMWORD[48+rsp],xmm3 + add rdx,64 + jmp NEAR $L$oop_tail_avx512 + +ALIGN 16 +$L$tail_avx512: + vmovdqa XMMWORD[rsp],xmm4 + vmovdqa XMMWORD[16+rsp],xmm5 + vmovdqa XMMWORD[32+rsp],xmm6 + vmovdqa XMMWORD[48+rsp],xmm7 + add rdx,64 + +$L$oop_tail_avx512: + movzx eax,BYTE[r8*1+rsi] + movzx ecx,BYTE[r8*1+rsp] + lea r8,[1+r8] + xor eax,ecx + mov BYTE[((-1))+r8*1+rdi],al + dec rdx + jnz NEAR $L$oop_tail_avx512 + + vmovdqu32 ZMMWORD[rsp],zmm16 + +$L$done_avx512: + vzeroall + movaps xmm6,XMMWORD[((-40))+r9] + movaps xmm7,XMMWORD[((-24))+r9] + lea rsp,[r9] + +$L$avx512_epilogue: + mov rdi,QWORD[8+rsp] ;WIN64 epilogue + mov rsi,QWORD[16+rsp] + DB 0F3h,0C3h ;repret + +$L$SEH_end_chacha20_avx512: +global chacha20_avx512vl + +ALIGN 32 +chacha20_avx512vl: + mov QWORD[8+rsp],rdi ;WIN64 prologue + mov QWORD[16+rsp],rsi + mov rax,rsp +$L$SEH_begin_chacha20_avx512vl: + mov rdi,rcx + mov rsi,rdx + mov rdx,r8 + mov rcx,r9 + mov r8,QWORD[40+rsp] + + + +$L$chacha20_avx512vl: + mov r9,rsp + + cmp rdx,128 + ja NEAR $L$chacha20_8xvl + + sub rsp,64+40 + movaps XMMWORD[(-40)+r9],xmm6 + movaps XMMWORD[(-24)+r9],xmm7 +$L$avx512vl_body: + vbroadcasti128 ymm0,XMMWORD[$L$sigma] + vbroadcasti128 ymm1,XMMWORD[rcx] + vbroadcasti128 ymm2,XMMWORD[16+rcx] + vbroadcasti128 ymm3,XMMWORD[r8] + + vmovdqa32 ymm16,ymm0 + vmovdqa32 ymm17,ymm1 + vmovdqa32 ymm18,ymm2 + vpaddd ymm3,ymm3,YMMWORD[$L$zeroz] + vmovdqa32 ymm20,YMMWORD[$L$twoy] + mov r8,10 + vmovdqa32 ymm19,ymm3 + jmp NEAR $L$oop_avx512vl + +ALIGN 16 +$L$oop_outer_avx512vl: + vmovdqa32 ymm2,ymm18 + vpaddd ymm3,ymm19,ymm20 + mov r8,10 + vmovdqa32 ymm19,ymm3 + jmp NEAR $L$oop_avx512vl + +ALIGN 32 +$L$oop_avx512vl: + vpaddd ymm0,ymm0,ymm1 + vpxor ymm3,ymm3,ymm0 + vprold ymm3,ymm3,16 + vpaddd ymm2,ymm2,ymm3 + vpxor ymm1,ymm1,ymm2 + vprold ymm1,ymm1,12 + vpaddd ymm0,ymm0,ymm1 + vpxor ymm3,ymm3,ymm0 + vprold ymm3,ymm3,8 + vpaddd ymm2,ymm2,ymm3 + vpxor ymm1,ymm1,ymm2 + vprold ymm1,ymm1,7 + vpshufd ymm2,ymm2,78 + vpshufd ymm1,ymm1,57 + vpshufd ymm3,ymm3,147 + vpaddd ymm0,ymm0,ymm1 + vpxor ymm3,ymm3,ymm0 + vprold ymm3,ymm3,16 + vpaddd ymm2,ymm2,ymm3 + vpxor ymm1,ymm1,ymm2 + vprold ymm1,ymm1,12 + vpaddd ymm0,ymm0,ymm1 + vpxor ymm3,ymm3,ymm0 + vprold ymm3,ymm3,8 + vpaddd ymm2,ymm2,ymm3 + vpxor ymm1,ymm1,ymm2 + vprold ymm1,ymm1,7 + vpshufd ymm2,ymm2,78 + vpshufd ymm1,ymm1,147 + vpshufd ymm3,ymm3,57 + dec r8 + jnz NEAR $L$oop_avx512vl + vpaddd ymm0,ymm0,ymm16 + vpaddd ymm1,ymm1,ymm17 + vpaddd ymm2,ymm2,ymm18 + vpaddd ymm3,ymm3,ymm19 + + sub rdx,64 + jb NEAR $L$tail64_avx512vl + + vpxor xmm4,xmm0,XMMWORD[rsi] + vpxor xmm5,xmm1,XMMWORD[16+rsi] + vpxor xmm6,xmm2,XMMWORD[32+rsi] + vpxor xmm7,xmm3,XMMWORD[48+rsi] + lea rsi,[64+rsi] + + vmovdqu XMMWORD[rdi],xmm4 + vmovdqu XMMWORD[16+rdi],xmm5 + vmovdqu XMMWORD[32+rdi],xmm6 + vmovdqu XMMWORD[48+rdi],xmm7 + lea rdi,[64+rdi] + + jz NEAR $L$done_avx512vl + + vextracti128 xmm4,ymm0,1 + vextracti128 xmm5,ymm1,1 + vextracti128 xmm6,ymm2,1 + vextracti128 xmm7,ymm3,1 + + sub rdx,64 + jb NEAR $L$tail_avx512vl + + vpxor xmm4,xmm4,XMMWORD[rsi] + vpxor xmm5,xmm5,XMMWORD[16+rsi] + vpxor xmm6,xmm6,XMMWORD[32+rsi] + vpxor xmm7,xmm7,XMMWORD[48+rsi] + lea rsi,[64+rsi] + + vmovdqu XMMWORD[rdi],xmm4 + vmovdqu XMMWORD[16+rdi],xmm5 + vmovdqu XMMWORD[32+rdi],xmm6 + vmovdqu XMMWORD[48+rdi],xmm7 + lea rdi,[64+rdi] + + vmovdqa32 ymm0,ymm16 + vmovdqa32 ymm1,ymm17 + jnz NEAR $L$oop_outer_avx512vl + + jmp NEAR $L$done_avx512vl + +ALIGN 16 +$L$tail64_avx512vl: + vmovdqa XMMWORD[rsp],xmm0 + vmovdqa XMMWORD[16+rsp],xmm1 + vmovdqa XMMWORD[32+rsp],xmm2 + vmovdqa XMMWORD[48+rsp],xmm3 + add rdx,64 + jmp NEAR $L$oop_tail_avx512vl + +ALIGN 16 +$L$tail_avx512vl: + vmovdqa XMMWORD[rsp],xmm4 + vmovdqa XMMWORD[16+rsp],xmm5 + vmovdqa XMMWORD[32+rsp],xmm6 + vmovdqa XMMWORD[48+rsp],xmm7 + add rdx,64 + +$L$oop_tail_avx512vl: + movzx eax,BYTE[r8*1+rsi] + movzx ecx,BYTE[r8*1+rsp] + lea r8,[1+r8] + xor eax,ecx + mov BYTE[((-1))+r8*1+rdi],al + dec rdx + jnz NEAR $L$oop_tail_avx512vl + + vmovdqu32 YMMWORD[rsp],ymm16 + vmovdqu32 YMMWORD[32+rsp],ymm16 + +$L$done_avx512vl: + vzeroall + movaps xmm6,XMMWORD[((-40))+r9] + movaps xmm7,XMMWORD[((-24))+r9] + lea rsp,[r9] + +$L$avx512vl_epilogue: + mov rdi,QWORD[8+rsp] ;WIN64 epilogue + mov rsi,QWORD[16+rsp] + DB 0F3h,0C3h ;repret + +$L$SEH_end_chacha20_avx512vl: +global chacha20_16x + +ALIGN 32 +chacha20_16x: + mov QWORD[8+rsp],rdi ;WIN64 prologue + mov QWORD[16+rsp],rsi + mov rax,rsp +$L$SEH_begin_chacha20_16x: + mov rdi,rcx + mov rsi,rdx + mov rdx,r8 + mov rcx,r9 + mov r8,QWORD[40+rsp] + + + +$L$chacha20_16x: + mov r9,rsp + + sub rsp,64+168 + and rsp,-64 + movaps XMMWORD[(-168)+r9],xmm6 + movaps XMMWORD[(-152)+r9],xmm7 + movaps XMMWORD[(-136)+r9],xmm8 + movaps XMMWORD[(-120)+r9],xmm9 + movaps XMMWORD[(-104)+r9],xmm10 + movaps XMMWORD[(-88)+r9],xmm11 + movaps XMMWORD[(-72)+r9],xmm12 + movaps XMMWORD[(-56)+r9],xmm13 + movaps XMMWORD[(-40)+r9],xmm14 + movaps XMMWORD[(-24)+r9],xmm15 +$L$16x_body: + vzeroupper + + lea r10,[$L$sigma] + vbroadcasti32x4 zmm3,ZMMWORD[r10] + vbroadcasti32x4 zmm7,ZMMWORD[rcx] + vbroadcasti32x4 zmm11,ZMMWORD[16+rcx] + vbroadcasti32x4 zmm15,ZMMWORD[r8] + + vpshufd zmm0,zmm3,0x00 + vpshufd zmm1,zmm3,0x55 + vpshufd zmm2,zmm3,0xaa + vpshufd zmm3,zmm3,0xff + vmovdqa64 zmm16,zmm0 + vmovdqa64 zmm17,zmm1 + vmovdqa64 zmm18,zmm2 + vmovdqa64 zmm19,zmm3 + + vpshufd zmm4,zmm7,0x00 + vpshufd zmm5,zmm7,0x55 + vpshufd zmm6,zmm7,0xaa + vpshufd zmm7,zmm7,0xff + vmovdqa64 zmm20,zmm4 + vmovdqa64 zmm21,zmm5 + vmovdqa64 zmm22,zmm6 + vmovdqa64 zmm23,zmm7 + + vpshufd zmm8,zmm11,0x00 + vpshufd zmm9,zmm11,0x55 + vpshufd zmm10,zmm11,0xaa + vpshufd zmm11,zmm11,0xff + vmovdqa64 zmm24,zmm8 + vmovdqa64 zmm25,zmm9 + vmovdqa64 zmm26,zmm10 + vmovdqa64 zmm27,zmm11 + + vpshufd zmm12,zmm15,0x00 + vpshufd zmm13,zmm15,0x55 + vpshufd zmm14,zmm15,0xaa + vpshufd zmm15,zmm15,0xff + vpaddd zmm12,zmm12,ZMMWORD[$L$incz] + vmovdqa64 zmm28,zmm12 + vmovdqa64 zmm29,zmm13 + vmovdqa64 zmm30,zmm14 + vmovdqa64 zmm31,zmm15 + + mov eax,10 + jmp NEAR $L$oop16x + +ALIGN 32 +$L$oop_outer16x: + vpbroadcastd zmm0,DWORD[r10] + vpbroadcastd zmm1,DWORD[4+r10] + vpbroadcastd zmm2,DWORD[8+r10] + vpbroadcastd zmm3,DWORD[12+r10] + vpaddd zmm28,zmm28,ZMMWORD[$L$sixteen] + vmovdqa64 zmm4,zmm20 + vmovdqa64 zmm5,zmm21 + vmovdqa64 zmm6,zmm22 + vmovdqa64 zmm7,zmm23 + vmovdqa64 zmm8,zmm24 + vmovdqa64 zmm9,zmm25 + vmovdqa64 zmm10,zmm26 + vmovdqa64 zmm11,zmm27 + vmovdqa64 zmm12,zmm28 + vmovdqa64 zmm13,zmm29 + vmovdqa64 zmm14,zmm30 + vmovdqa64 zmm15,zmm31 + + vmovdqa64 zmm16,zmm0 + vmovdqa64 zmm17,zmm1 + vmovdqa64 zmm18,zmm2 + vmovdqa64 zmm19,zmm3 + + mov eax,10 + jmp NEAR $L$oop16x + +ALIGN 32 +$L$oop16x: + vpaddd zmm0,zmm0,zmm4 + vpaddd zmm1,zmm1,zmm5 + vpaddd zmm2,zmm2,zmm6 + vpaddd zmm3,zmm3,zmm7 + vpxord zmm12,zmm12,zmm0 + vpxord zmm13,zmm13,zmm1 + vpxord zmm14,zmm14,zmm2 + vpxord zmm15,zmm15,zmm3 + vprold zmm12,zmm12,16 + vprold zmm13,zmm13,16 + vprold zmm14,zmm14,16 + vprold zmm15,zmm15,16 + vpaddd zmm8,zmm8,zmm12 + vpaddd zmm9,zmm9,zmm13 + vpaddd zmm10,zmm10,zmm14 + vpaddd zmm11,zmm11,zmm15 + vpxord zmm4,zmm4,zmm8 + vpxord zmm5,zmm5,zmm9 + vpxord zmm6,zmm6,zmm10 + vpxord zmm7,zmm7,zmm11 + vprold zmm4,zmm4,12 + vprold zmm5,zmm5,12 + vprold zmm6,zmm6,12 + vprold zmm7,zmm7,12 + vpaddd zmm0,zmm0,zmm4 + vpaddd zmm1,zmm1,zmm5 + vpaddd zmm2,zmm2,zmm6 + vpaddd zmm3,zmm3,zmm7 + vpxord zmm12,zmm12,zmm0 + vpxord zmm13,zmm13,zmm1 + vpxord zmm14,zmm14,zmm2 + vpxord zmm15,zmm15,zmm3 + vprold zmm12,zmm12,8 + vprold zmm13,zmm13,8 + vprold zmm14,zmm14,8 + vprold zmm15,zmm15,8 + vpaddd zmm8,zmm8,zmm12 + vpaddd zmm9,zmm9,zmm13 + vpaddd zmm10,zmm10,zmm14 + vpaddd zmm11,zmm11,zmm15 + vpxord zmm4,zmm4,zmm8 + vpxord zmm5,zmm5,zmm9 + vpxord zmm6,zmm6,zmm10 + vpxord zmm7,zmm7,zmm11 + vprold zmm4,zmm4,7 + vprold zmm5,zmm5,7 + vprold zmm6,zmm6,7 + vprold zmm7,zmm7,7 + vpaddd zmm0,zmm0,zmm5 + vpaddd zmm1,zmm1,zmm6 + vpaddd zmm2,zmm2,zmm7 + vpaddd zmm3,zmm3,zmm4 + vpxord zmm15,zmm15,zmm0 + vpxord zmm12,zmm12,zmm1 + vpxord zmm13,zmm13,zmm2 + vpxord zmm14,zmm14,zmm3 + vprold zmm15,zmm15,16 + vprold zmm12,zmm12,16 + vprold zmm13,zmm13,16 + vprold zmm14,zmm14,16 + vpaddd zmm10,zmm10,zmm15 + vpaddd zmm11,zmm11,zmm12 + vpaddd zmm8,zmm8,zmm13 + vpaddd zmm9,zmm9,zmm14 + vpxord zmm5,zmm5,zmm10 + vpxord zmm6,zmm6,zmm11 + vpxord zmm7,zmm7,zmm8 + vpxord zmm4,zmm4,zmm9 + vprold zmm5,zmm5,12 + vprold zmm6,zmm6,12 + vprold zmm7,zmm7,12 + vprold zmm4,zmm4,12 + vpaddd zmm0,zmm0,zmm5 + vpaddd zmm1,zmm1,zmm6 + vpaddd zmm2,zmm2,zmm7 + vpaddd zmm3,zmm3,zmm4 + vpxord zmm15,zmm15,zmm0 + vpxord zmm12,zmm12,zmm1 + vpxord zmm13,zmm13,zmm2 + vpxord zmm14,zmm14,zmm3 + vprold zmm15,zmm15,8 + vprold zmm12,zmm12,8 + vprold zmm13,zmm13,8 + vprold zmm14,zmm14,8 + vpaddd zmm10,zmm10,zmm15 + vpaddd zmm11,zmm11,zmm12 + vpaddd zmm8,zmm8,zmm13 + vpaddd zmm9,zmm9,zmm14 + vpxord zmm5,zmm5,zmm10 + vpxord zmm6,zmm6,zmm11 + vpxord zmm7,zmm7,zmm8 + vpxord zmm4,zmm4,zmm9 + vprold zmm5,zmm5,7 + vprold zmm6,zmm6,7 + vprold zmm7,zmm7,7 + vprold zmm4,zmm4,7 + dec eax + jnz NEAR $L$oop16x + + vpaddd zmm0,zmm0,zmm16 + vpaddd zmm1,zmm1,zmm17 + vpaddd zmm2,zmm2,zmm18 + vpaddd zmm3,zmm3,zmm19 + + vpunpckldq zmm18,zmm0,zmm1 + vpunpckldq zmm19,zmm2,zmm3 + vpunpckhdq zmm0,zmm0,zmm1 + vpunpckhdq zmm2,zmm2,zmm3 + vpunpcklqdq zmm1,zmm18,zmm19 + vpunpckhqdq zmm18,zmm18,zmm19 + vpunpcklqdq zmm3,zmm0,zmm2 + vpunpckhqdq zmm0,zmm0,zmm2 + vpaddd zmm4,zmm4,zmm20 + vpaddd zmm5,zmm5,zmm21 + vpaddd zmm6,zmm6,zmm22 + vpaddd zmm7,zmm7,zmm23 + + vpunpckldq zmm2,zmm4,zmm5 + vpunpckldq zmm19,zmm6,zmm7 + vpunpckhdq zmm4,zmm4,zmm5 + vpunpckhdq zmm6,zmm6,zmm7 + vpunpcklqdq zmm5,zmm2,zmm19 + vpunpckhqdq zmm2,zmm2,zmm19 + vpunpcklqdq zmm7,zmm4,zmm6 + vpunpckhqdq zmm4,zmm4,zmm6 + vshufi32x4 zmm19,zmm1,zmm5,0x44 + vshufi32x4 zmm5,zmm1,zmm5,0xee + vshufi32x4 zmm1,zmm18,zmm2,0x44 + vshufi32x4 zmm2,zmm18,zmm2,0xee + vshufi32x4 zmm18,zmm3,zmm7,0x44 + vshufi32x4 zmm7,zmm3,zmm7,0xee + vshufi32x4 zmm3,zmm0,zmm4,0x44 + vshufi32x4 zmm4,zmm0,zmm4,0xee + vpaddd zmm8,zmm8,zmm24 + vpaddd zmm9,zmm9,zmm25 + vpaddd zmm10,zmm10,zmm26 + vpaddd zmm11,zmm11,zmm27 + + vpunpckldq zmm6,zmm8,zmm9 + vpunpckldq zmm0,zmm10,zmm11 + vpunpckhdq zmm8,zmm8,zmm9 + vpunpckhdq zmm10,zmm10,zmm11 + vpunpcklqdq zmm9,zmm6,zmm0 + vpunpckhqdq zmm6,zmm6,zmm0 + vpunpcklqdq zmm11,zmm8,zmm10 + vpunpckhqdq zmm8,zmm8,zmm10 + vpaddd zmm12,zmm12,zmm28 + vpaddd zmm13,zmm13,zmm29 + vpaddd zmm14,zmm14,zmm30 + vpaddd zmm15,zmm15,zmm31 + + vpunpckldq zmm10,zmm12,zmm13 + vpunpckldq zmm0,zmm14,zmm15 + vpunpckhdq zmm12,zmm12,zmm13 + vpunpckhdq zmm14,zmm14,zmm15 + vpunpcklqdq zmm13,zmm10,zmm0 + vpunpckhqdq zmm10,zmm10,zmm0 + vpunpcklqdq zmm15,zmm12,zmm14 + vpunpckhqdq zmm12,zmm12,zmm14 + vshufi32x4 zmm0,zmm9,zmm13,0x44 + vshufi32x4 zmm13,zmm9,zmm13,0xee + vshufi32x4 zmm9,zmm6,zmm10,0x44 + vshufi32x4 zmm10,zmm6,zmm10,0xee + vshufi32x4 zmm6,zmm11,zmm15,0x44 + vshufi32x4 zmm15,zmm11,zmm15,0xee + vshufi32x4 zmm11,zmm8,zmm12,0x44 + vshufi32x4 zmm12,zmm8,zmm12,0xee + vshufi32x4 zmm16,zmm19,zmm0,0x88 + vshufi32x4 zmm19,zmm19,zmm0,0xdd + vshufi32x4 zmm0,zmm5,zmm13,0x88 + vshufi32x4 zmm13,zmm5,zmm13,0xdd + vshufi32x4 zmm17,zmm1,zmm9,0x88 + vshufi32x4 zmm1,zmm1,zmm9,0xdd + vshufi32x4 zmm9,zmm2,zmm10,0x88 + vshufi32x4 zmm10,zmm2,zmm10,0xdd + vshufi32x4 zmm14,zmm18,zmm6,0x88 + vshufi32x4 zmm18,zmm18,zmm6,0xdd + vshufi32x4 zmm6,zmm7,zmm15,0x88 + vshufi32x4 zmm15,zmm7,zmm15,0xdd + vshufi32x4 zmm8,zmm3,zmm11,0x88 + vshufi32x4 zmm3,zmm3,zmm11,0xdd + vshufi32x4 zmm11,zmm4,zmm12,0x88 + vshufi32x4 zmm12,zmm4,zmm12,0xdd + cmp rdx,64*16 + jb NEAR $L$tail16x + + vpxord zmm16,zmm16,ZMMWORD[rsi] + vpxord zmm17,zmm17,ZMMWORD[64+rsi] + vpxord zmm14,zmm14,ZMMWORD[128+rsi] + vpxord zmm8,zmm8,ZMMWORD[192+rsi] + vmovdqu32 ZMMWORD[rdi],zmm16 + vmovdqu32 ZMMWORD[64+rdi],zmm17 + vmovdqu32 ZMMWORD[128+rdi],zmm14 + vmovdqu32 ZMMWORD[192+rdi],zmm8 + + vpxord zmm19,zmm19,ZMMWORD[256+rsi] + vpxord zmm1,zmm1,ZMMWORD[320+rsi] + vpxord zmm18,zmm18,ZMMWORD[384+rsi] + vpxord zmm3,zmm3,ZMMWORD[448+rsi] + vmovdqu32 ZMMWORD[256+rdi],zmm19 + vmovdqu32 ZMMWORD[320+rdi],zmm1 + vmovdqu32 ZMMWORD[384+rdi],zmm18 + vmovdqu32 ZMMWORD[448+rdi],zmm3 + + vpxord zmm0,zmm0,ZMMWORD[512+rsi] + vpxord zmm9,zmm9,ZMMWORD[576+rsi] + vpxord zmm6,zmm6,ZMMWORD[640+rsi] + vpxord zmm11,zmm11,ZMMWORD[704+rsi] + vmovdqu32 ZMMWORD[512+rdi],zmm0 + vmovdqu32 ZMMWORD[576+rdi],zmm9 + vmovdqu32 ZMMWORD[640+rdi],zmm6 + vmovdqu32 ZMMWORD[704+rdi],zmm11 + + vpxord zmm13,zmm13,ZMMWORD[768+rsi] + vpxord zmm10,zmm10,ZMMWORD[832+rsi] + vpxord zmm15,zmm15,ZMMWORD[896+rsi] + vpxord zmm12,zmm12,ZMMWORD[960+rsi] + lea rsi,[1024+rsi] + vmovdqu32 ZMMWORD[768+rdi],zmm13 + vmovdqu32 ZMMWORD[832+rdi],zmm10 + vmovdqu32 ZMMWORD[896+rdi],zmm15 + vmovdqu32 ZMMWORD[960+rdi],zmm12 + lea rdi,[1024+rdi] + + sub rdx,64*16 + jnz NEAR $L$oop_outer16x + + jmp NEAR $L$done16x + +ALIGN 32 +$L$tail16x: + xor r10,r10 + sub rdi,rsi + cmp rdx,64*1 + jb NEAR $L$ess_than_64_16x + vpxord zmm16,zmm16,ZMMWORD[rsi] + vmovdqu32 ZMMWORD[rsi*1+rdi],zmm16 + je NEAR $L$done16x + vmovdqa32 zmm16,zmm17 + lea rsi,[64+rsi] + + cmp rdx,64*2 + jb NEAR $L$ess_than_64_16x + vpxord zmm17,zmm17,ZMMWORD[rsi] + vmovdqu32 ZMMWORD[rsi*1+rdi],zmm17 + je NEAR $L$done16x + vmovdqa32 zmm16,zmm14 + lea rsi,[64+rsi] + + cmp rdx,64*3 + jb NEAR $L$ess_than_64_16x + vpxord zmm14,zmm14,ZMMWORD[rsi] + vmovdqu32 ZMMWORD[rsi*1+rdi],zmm14 + je NEAR $L$done16x + vmovdqa32 zmm16,zmm8 + lea rsi,[64+rsi] + + cmp rdx,64*4 + jb NEAR $L$ess_than_64_16x + vpxord zmm8,zmm8,ZMMWORD[rsi] + vmovdqu32 ZMMWORD[rsi*1+rdi],zmm8 + je NEAR $L$done16x + vmovdqa32 zmm16,zmm19 + lea rsi,[64+rsi] + + cmp rdx,64*5 + jb NEAR $L$ess_than_64_16x + vpxord zmm19,zmm19,ZMMWORD[rsi] + vmovdqu32 ZMMWORD[rsi*1+rdi],zmm19 + je NEAR $L$done16x + vmovdqa32 zmm16,zmm1 + lea rsi,[64+rsi] + + cmp rdx,64*6 + jb NEAR $L$ess_than_64_16x + vpxord zmm1,zmm1,ZMMWORD[rsi] + vmovdqu32 ZMMWORD[rsi*1+rdi],zmm1 + je NEAR $L$done16x + vmovdqa32 zmm16,zmm18 + lea rsi,[64+rsi] + + cmp rdx,64*7 + jb NEAR $L$ess_than_64_16x + vpxord zmm18,zmm18,ZMMWORD[rsi] + vmovdqu32 ZMMWORD[rsi*1+rdi],zmm18 + je NEAR $L$done16x + vmovdqa32 zmm16,zmm3 + lea rsi,[64+rsi] + + cmp rdx,64*8 + jb NEAR $L$ess_than_64_16x + vpxord zmm3,zmm3,ZMMWORD[rsi] + vmovdqu32 ZMMWORD[rsi*1+rdi],zmm3 + je NEAR $L$done16x + vmovdqa32 zmm16,zmm0 + lea rsi,[64+rsi] + + cmp rdx,64*9 + jb NEAR $L$ess_than_64_16x + vpxord zmm0,zmm0,ZMMWORD[rsi] + vmovdqu32 ZMMWORD[rsi*1+rdi],zmm0 + je NEAR $L$done16x + vmovdqa32 zmm16,zmm9 + lea rsi,[64+rsi] + + cmp rdx,64*10 + jb NEAR $L$ess_than_64_16x + vpxord zmm9,zmm9,ZMMWORD[rsi] + vmovdqu32 ZMMWORD[rsi*1+rdi],zmm9 + je NEAR $L$done16x + vmovdqa32 zmm16,zmm6 + lea rsi,[64+rsi] + + cmp rdx,64*11 + jb NEAR $L$ess_than_64_16x + vpxord zmm6,zmm6,ZMMWORD[rsi] + vmovdqu32 ZMMWORD[rsi*1+rdi],zmm6 + je NEAR $L$done16x + vmovdqa32 zmm16,zmm11 + lea rsi,[64+rsi] + + cmp rdx,64*12 + jb NEAR $L$ess_than_64_16x + vpxord zmm11,zmm11,ZMMWORD[rsi] + vmovdqu32 ZMMWORD[rsi*1+rdi],zmm11 + je NEAR $L$done16x + vmovdqa32 zmm16,zmm13 + lea rsi,[64+rsi] + + cmp rdx,64*13 + jb NEAR $L$ess_than_64_16x + vpxord zmm13,zmm13,ZMMWORD[rsi] + vmovdqu32 ZMMWORD[rsi*1+rdi],zmm13 + je NEAR $L$done16x + vmovdqa32 zmm16,zmm10 + lea rsi,[64+rsi] + + cmp rdx,64*14 + jb NEAR $L$ess_than_64_16x + vpxord zmm10,zmm10,ZMMWORD[rsi] + vmovdqu32 ZMMWORD[rsi*1+rdi],zmm10 + je NEAR $L$done16x + vmovdqa32 zmm16,zmm15 + lea rsi,[64+rsi] + + cmp rdx,64*15 + jb NEAR $L$ess_than_64_16x + vpxord zmm15,zmm15,ZMMWORD[rsi] + vmovdqu32 ZMMWORD[rsi*1+rdi],zmm15 + je NEAR $L$done16x + vmovdqa32 zmm16,zmm12 + lea rsi,[64+rsi] + +$L$ess_than_64_16x: + vmovdqa32 ZMMWORD[rsp],zmm16 + lea rdi,[rsi*1+rdi] + and rdx,63 + +$L$oop_tail16x: + movzx eax,BYTE[r10*1+rsi] + movzx ecx,BYTE[r10*1+rsp] + lea r10,[1+r10] + xor eax,ecx + mov BYTE[((-1))+r10*1+rdi],al + dec rdx + jnz NEAR $L$oop_tail16x + + vpxord zmm16,zmm16,zmm16 + vmovdqa32 ZMMWORD[rsp],zmm16 + +$L$done16x: + vzeroall + movaps xmm6,XMMWORD[((-168))+r9] + movaps xmm7,XMMWORD[((-152))+r9] + movaps xmm8,XMMWORD[((-136))+r9] + movaps xmm9,XMMWORD[((-120))+r9] + movaps xmm10,XMMWORD[((-104))+r9] + movaps xmm11,XMMWORD[((-88))+r9] + movaps xmm12,XMMWORD[((-72))+r9] + movaps xmm13,XMMWORD[((-56))+r9] + movaps xmm14,XMMWORD[((-40))+r9] + movaps xmm15,XMMWORD[((-24))+r9] + lea rsp,[r9] + +$L$16x_epilogue: + mov rdi,QWORD[8+rsp] ;WIN64 epilogue + mov rsi,QWORD[16+rsp] + DB 0F3h,0C3h ;repret + +$L$SEH_end_chacha20_16x: +global chacha20_8xvl + +ALIGN 32 +chacha20_8xvl: + mov QWORD[8+rsp],rdi ;WIN64 prologue + mov QWORD[16+rsp],rsi + mov rax,rsp +$L$SEH_begin_chacha20_8xvl: + mov rdi,rcx + mov rsi,rdx + mov rdx,r8 + mov rcx,r9 + mov r8,QWORD[40+rsp] + + + +$L$chacha20_8xvl: + mov r9,rsp + + sub rsp,64+168 + and rsp,-64 + movaps XMMWORD[(-168)+r9],xmm6 + movaps XMMWORD[(-152)+r9],xmm7 + movaps XMMWORD[(-136)+r9],xmm8 + movaps XMMWORD[(-120)+r9],xmm9 + movaps XMMWORD[(-104)+r9],xmm10 + movaps XMMWORD[(-88)+r9],xmm11 + movaps XMMWORD[(-72)+r9],xmm12 + movaps XMMWORD[(-56)+r9],xmm13 + movaps XMMWORD[(-40)+r9],xmm14 + movaps XMMWORD[(-24)+r9],xmm15 +$L$8xvl_body: + vzeroupper + + lea r10,[$L$sigma] + vbroadcasti128 ymm3,XMMWORD[r10] + vbroadcasti128 ymm7,XMMWORD[rcx] + vbroadcasti128 ymm11,XMMWORD[16+rcx] + vbroadcasti128 ymm15,XMMWORD[r8] + + vpshufd ymm0,ymm3,0x00 + vpshufd ymm1,ymm3,0x55 + vpshufd ymm2,ymm3,0xaa + vpshufd ymm3,ymm3,0xff + vmovdqa64 ymm16,ymm0 + vmovdqa64 ymm17,ymm1 + vmovdqa64 ymm18,ymm2 + vmovdqa64 ymm19,ymm3 + + vpshufd ymm4,ymm7,0x00 + vpshufd ymm5,ymm7,0x55 + vpshufd ymm6,ymm7,0xaa + vpshufd ymm7,ymm7,0xff + vmovdqa64 ymm20,ymm4 + vmovdqa64 ymm21,ymm5 + vmovdqa64 ymm22,ymm6 + vmovdqa64 ymm23,ymm7 + + vpshufd ymm8,ymm11,0x00 + vpshufd ymm9,ymm11,0x55 + vpshufd ymm10,ymm11,0xaa + vpshufd ymm11,ymm11,0xff + vmovdqa64 ymm24,ymm8 + vmovdqa64 ymm25,ymm9 + vmovdqa64 ymm26,ymm10 + vmovdqa64 ymm27,ymm11 + + vpshufd ymm12,ymm15,0x00 + vpshufd ymm13,ymm15,0x55 + vpshufd ymm14,ymm15,0xaa + vpshufd ymm15,ymm15,0xff + vpaddd ymm12,ymm12,YMMWORD[$L$incy] + vmovdqa64 ymm28,ymm12 + vmovdqa64 ymm29,ymm13 + vmovdqa64 ymm30,ymm14 + vmovdqa64 ymm31,ymm15 + + mov eax,10 + jmp NEAR $L$oop8xvl + +ALIGN 32 +$L$oop_outer8xvl: + + + vpbroadcastd ymm2,DWORD[8+r10] + vpbroadcastd ymm3,DWORD[12+r10] + vpaddd ymm28,ymm28,YMMWORD[$L$eight] + vmovdqa64 ymm4,ymm20 + vmovdqa64 ymm5,ymm21 + vmovdqa64 ymm6,ymm22 + vmovdqa64 ymm7,ymm23 + vmovdqa64 ymm8,ymm24 + vmovdqa64 ymm9,ymm25 + vmovdqa64 ymm10,ymm26 + vmovdqa64 ymm11,ymm27 + vmovdqa64 ymm12,ymm28 + vmovdqa64 ymm13,ymm29 + vmovdqa64 ymm14,ymm30 + vmovdqa64 ymm15,ymm31 + + vmovdqa64 ymm16,ymm0 + vmovdqa64 ymm17,ymm1 + vmovdqa64 ymm18,ymm2 + vmovdqa64 ymm19,ymm3 + + mov eax,10 + jmp NEAR $L$oop8xvl + +ALIGN 32 +$L$oop8xvl: + vpaddd ymm0,ymm0,ymm4 + vpaddd ymm1,ymm1,ymm5 + vpaddd ymm2,ymm2,ymm6 + vpaddd ymm3,ymm3,ymm7 + vpxor ymm12,ymm12,ymm0 + vpxor ymm13,ymm13,ymm1 + vpxor ymm14,ymm14,ymm2 + vpxor ymm15,ymm15,ymm3 + vprold ymm12,ymm12,16 + vprold ymm13,ymm13,16 + vprold ymm14,ymm14,16 + vprold ymm15,ymm15,16 + vpaddd ymm8,ymm8,ymm12 + vpaddd ymm9,ymm9,ymm13 + vpaddd ymm10,ymm10,ymm14 + vpaddd ymm11,ymm11,ymm15 + vpxor ymm4,ymm4,ymm8 + vpxor ymm5,ymm5,ymm9 + vpxor ymm6,ymm6,ymm10 + vpxor ymm7,ymm7,ymm11 + vprold ymm4,ymm4,12 + vprold ymm5,ymm5,12 + vprold ymm6,ymm6,12 + vprold ymm7,ymm7,12 + vpaddd ymm0,ymm0,ymm4 + vpaddd ymm1,ymm1,ymm5 + vpaddd ymm2,ymm2,ymm6 + vpaddd ymm3,ymm3,ymm7 + vpxor ymm12,ymm12,ymm0 + vpxor ymm13,ymm13,ymm1 + vpxor ymm14,ymm14,ymm2 + vpxor ymm15,ymm15,ymm3 + vprold ymm12,ymm12,8 + vprold ymm13,ymm13,8 + vprold ymm14,ymm14,8 + vprold ymm15,ymm15,8 + vpaddd ymm8,ymm8,ymm12 + vpaddd ymm9,ymm9,ymm13 + vpaddd ymm10,ymm10,ymm14 + vpaddd ymm11,ymm11,ymm15 + vpxor ymm4,ymm4,ymm8 + vpxor ymm5,ymm5,ymm9 + vpxor ymm6,ymm6,ymm10 + vpxor ymm7,ymm7,ymm11 + vprold ymm4,ymm4,7 + vprold ymm5,ymm5,7 + vprold ymm6,ymm6,7 + vprold ymm7,ymm7,7 + vpaddd ymm0,ymm0,ymm5 + vpaddd ymm1,ymm1,ymm6 + vpaddd ymm2,ymm2,ymm7 + vpaddd ymm3,ymm3,ymm4 + vpxor ymm15,ymm15,ymm0 + vpxor ymm12,ymm12,ymm1 + vpxor ymm13,ymm13,ymm2 + vpxor ymm14,ymm14,ymm3 + vprold ymm15,ymm15,16 + vprold ymm12,ymm12,16 + vprold ymm13,ymm13,16 + vprold ymm14,ymm14,16 + vpaddd ymm10,ymm10,ymm15 + vpaddd ymm11,ymm11,ymm12 + vpaddd ymm8,ymm8,ymm13 + vpaddd ymm9,ymm9,ymm14 + vpxor ymm5,ymm5,ymm10 + vpxor ymm6,ymm6,ymm11 + vpxor ymm7,ymm7,ymm8 + vpxor ymm4,ymm4,ymm9 + vprold ymm5,ymm5,12 + vprold ymm6,ymm6,12 + vprold ymm7,ymm7,12 + vprold ymm4,ymm4,12 + vpaddd ymm0,ymm0,ymm5 + vpaddd ymm1,ymm1,ymm6 + vpaddd ymm2,ymm2,ymm7 + vpaddd ymm3,ymm3,ymm4 + vpxor ymm15,ymm15,ymm0 + vpxor ymm12,ymm12,ymm1 + vpxor ymm13,ymm13,ymm2 + vpxor ymm14,ymm14,ymm3 + vprold ymm15,ymm15,8 + vprold ymm12,ymm12,8 + vprold ymm13,ymm13,8 + vprold ymm14,ymm14,8 + vpaddd ymm10,ymm10,ymm15 + vpaddd ymm11,ymm11,ymm12 + vpaddd ymm8,ymm8,ymm13 + vpaddd ymm9,ymm9,ymm14 + vpxor ymm5,ymm5,ymm10 + vpxor ymm6,ymm6,ymm11 + vpxor ymm7,ymm7,ymm8 + vpxor ymm4,ymm4,ymm9 + vprold ymm5,ymm5,7 + vprold ymm6,ymm6,7 + vprold ymm7,ymm7,7 + vprold ymm4,ymm4,7 + dec eax + jnz NEAR $L$oop8xvl + + vpaddd ymm0,ymm0,ymm16 + vpaddd ymm1,ymm1,ymm17 + vpaddd ymm2,ymm2,ymm18 + vpaddd ymm3,ymm3,ymm19 + + vpunpckldq ymm18,ymm0,ymm1 + vpunpckldq ymm19,ymm2,ymm3 + vpunpckhdq ymm0,ymm0,ymm1 + vpunpckhdq ymm2,ymm2,ymm3 + vpunpcklqdq ymm1,ymm18,ymm19 + vpunpckhqdq ymm18,ymm18,ymm19 + vpunpcklqdq ymm3,ymm0,ymm2 + vpunpckhqdq ymm0,ymm0,ymm2 + vpaddd ymm4,ymm4,ymm20 + vpaddd ymm5,ymm5,ymm21 + vpaddd ymm6,ymm6,ymm22 + vpaddd ymm7,ymm7,ymm23 + + vpunpckldq ymm2,ymm4,ymm5 + vpunpckldq ymm19,ymm6,ymm7 + vpunpckhdq ymm4,ymm4,ymm5 + vpunpckhdq ymm6,ymm6,ymm7 + vpunpcklqdq ymm5,ymm2,ymm19 + vpunpckhqdq ymm2,ymm2,ymm19 + vpunpcklqdq ymm7,ymm4,ymm6 + vpunpckhqdq ymm4,ymm4,ymm6 + vshufi32x4 ymm19,ymm1,ymm5,0 + vshufi32x4 ymm5,ymm1,ymm5,3 + vshufi32x4 ymm1,ymm18,ymm2,0 + vshufi32x4 ymm2,ymm18,ymm2,3 + vshufi32x4 ymm18,ymm3,ymm7,0 + vshufi32x4 ymm7,ymm3,ymm7,3 + vshufi32x4 ymm3,ymm0,ymm4,0 + vshufi32x4 ymm4,ymm0,ymm4,3 + vpaddd ymm8,ymm8,ymm24 + vpaddd ymm9,ymm9,ymm25 + vpaddd ymm10,ymm10,ymm26 + vpaddd ymm11,ymm11,ymm27 + + vpunpckldq ymm6,ymm8,ymm9 + vpunpckldq ymm0,ymm10,ymm11 + vpunpckhdq ymm8,ymm8,ymm9 + vpunpckhdq ymm10,ymm10,ymm11 + vpunpcklqdq ymm9,ymm6,ymm0 + vpunpckhqdq ymm6,ymm6,ymm0 + vpunpcklqdq ymm11,ymm8,ymm10 + vpunpckhqdq ymm8,ymm8,ymm10 + vpaddd ymm12,ymm12,ymm28 + vpaddd ymm13,ymm13,ymm29 + vpaddd ymm14,ymm14,ymm30 + vpaddd ymm15,ymm15,ymm31 + + vpunpckldq ymm10,ymm12,ymm13 + vpunpckldq ymm0,ymm14,ymm15 + vpunpckhdq ymm12,ymm12,ymm13 + vpunpckhdq ymm14,ymm14,ymm15 + vpunpcklqdq ymm13,ymm10,ymm0 + vpunpckhqdq ymm10,ymm10,ymm0 + vpunpcklqdq ymm15,ymm12,ymm14 + vpunpckhqdq ymm12,ymm12,ymm14 + vperm2i128 ymm0,ymm9,ymm13,0x20 + vperm2i128 ymm13,ymm9,ymm13,0x31 + vperm2i128 ymm9,ymm6,ymm10,0x20 + vperm2i128 ymm10,ymm6,ymm10,0x31 + vperm2i128 ymm6,ymm11,ymm15,0x20 + vperm2i128 ymm15,ymm11,ymm15,0x31 + vperm2i128 ymm11,ymm8,ymm12,0x20 + vperm2i128 ymm12,ymm8,ymm12,0x31 + cmp rdx,64*8 + jb NEAR $L$tail8xvl + + mov eax,0x80 + vpxord ymm19,ymm19,YMMWORD[rsi] + vpxor ymm0,ymm0,YMMWORD[32+rsi] + vpxor ymm5,ymm5,YMMWORD[64+rsi] + vpxor ymm13,ymm13,YMMWORD[96+rsi] + lea rsi,[rax*1+rsi] + vmovdqu32 YMMWORD[rdi],ymm19 + vmovdqu YMMWORD[32+rdi],ymm0 + vmovdqu YMMWORD[64+rdi],ymm5 + vmovdqu YMMWORD[96+rdi],ymm13 + lea rdi,[rax*1+rdi] + + vpxor ymm1,ymm1,YMMWORD[rsi] + vpxor ymm9,ymm9,YMMWORD[32+rsi] + vpxor ymm2,ymm2,YMMWORD[64+rsi] + vpxor ymm10,ymm10,YMMWORD[96+rsi] + lea rsi,[rax*1+rsi] + vmovdqu YMMWORD[rdi],ymm1 + vmovdqu YMMWORD[32+rdi],ymm9 + vmovdqu YMMWORD[64+rdi],ymm2 + vmovdqu YMMWORD[96+rdi],ymm10 + lea rdi,[rax*1+rdi] + + vpxord ymm18,ymm18,YMMWORD[rsi] + vpxor ymm6,ymm6,YMMWORD[32+rsi] + vpxor ymm7,ymm7,YMMWORD[64+rsi] + vpxor ymm15,ymm15,YMMWORD[96+rsi] + lea rsi,[rax*1+rsi] + vmovdqu32 YMMWORD[rdi],ymm18 + vmovdqu YMMWORD[32+rdi],ymm6 + vmovdqu YMMWORD[64+rdi],ymm7 + vmovdqu YMMWORD[96+rdi],ymm15 + lea rdi,[rax*1+rdi] + + vpxor ymm3,ymm3,YMMWORD[rsi] + vpxor ymm11,ymm11,YMMWORD[32+rsi] + vpxor ymm4,ymm4,YMMWORD[64+rsi] + vpxor ymm12,ymm12,YMMWORD[96+rsi] + lea rsi,[rax*1+rsi] + vmovdqu YMMWORD[rdi],ymm3 + vmovdqu YMMWORD[32+rdi],ymm11 + vmovdqu YMMWORD[64+rdi],ymm4 + vmovdqu YMMWORD[96+rdi],ymm12 + lea rdi,[rax*1+rdi] + + vpbroadcastd ymm0,DWORD[r10] + vpbroadcastd ymm1,DWORD[4+r10] + + sub rdx,64*8 + jnz NEAR $L$oop_outer8xvl + + jmp NEAR $L$done8xvl + +ALIGN 32 +$L$tail8xvl: + vmovdqa64 ymm8,ymm19 + xor r10,r10 + sub rdi,rsi + cmp rdx,64*1 + jb NEAR $L$ess_than_64_8xvl + vpxor ymm8,ymm8,YMMWORD[rsi] + vpxor ymm0,ymm0,YMMWORD[32+rsi] + vmovdqu YMMWORD[rsi*1+rdi],ymm8 + vmovdqu YMMWORD[32+rsi*1+rdi],ymm0 + je NEAR $L$done8xvl + vmovdqa ymm8,ymm5 + vmovdqa ymm0,ymm13 + lea rsi,[64+rsi] + + cmp rdx,64*2 + jb NEAR $L$ess_than_64_8xvl + vpxor ymm5,ymm5,YMMWORD[rsi] + vpxor ymm13,ymm13,YMMWORD[32+rsi] + vmovdqu YMMWORD[rsi*1+rdi],ymm5 + vmovdqu YMMWORD[32+rsi*1+rdi],ymm13 + je NEAR $L$done8xvl + vmovdqa ymm8,ymm1 + vmovdqa ymm0,ymm9 + lea rsi,[64+rsi] + + cmp rdx,64*3 + jb NEAR $L$ess_than_64_8xvl + vpxor ymm1,ymm1,YMMWORD[rsi] + vpxor ymm9,ymm9,YMMWORD[32+rsi] + vmovdqu YMMWORD[rsi*1+rdi],ymm1 + vmovdqu YMMWORD[32+rsi*1+rdi],ymm9 + je NEAR $L$done8xvl + vmovdqa ymm8,ymm2 + vmovdqa ymm0,ymm10 + lea rsi,[64+rsi] + + cmp rdx,64*4 + jb NEAR $L$ess_than_64_8xvl + vpxor ymm2,ymm2,YMMWORD[rsi] + vpxor ymm10,ymm10,YMMWORD[32+rsi] + vmovdqu YMMWORD[rsi*1+rdi],ymm2 + vmovdqu YMMWORD[32+rsi*1+rdi],ymm10 + je NEAR $L$done8xvl + vmovdqa32 ymm8,ymm18 + vmovdqa ymm0,ymm6 + lea rsi,[64+rsi] + + cmp rdx,64*5 + jb NEAR $L$ess_than_64_8xvl + vpxord ymm18,ymm18,YMMWORD[rsi] + vpxor ymm6,ymm6,YMMWORD[32+rsi] + vmovdqu32 YMMWORD[rsi*1+rdi],ymm18 + vmovdqu YMMWORD[32+rsi*1+rdi],ymm6 + je NEAR $L$done8xvl + vmovdqa ymm8,ymm7 + vmovdqa ymm0,ymm15 + lea rsi,[64+rsi] + + cmp rdx,64*6 + jb NEAR $L$ess_than_64_8xvl + vpxor ymm7,ymm7,YMMWORD[rsi] + vpxor ymm15,ymm15,YMMWORD[32+rsi] + vmovdqu YMMWORD[rsi*1+rdi],ymm7 + vmovdqu YMMWORD[32+rsi*1+rdi],ymm15 + je NEAR $L$done8xvl + vmovdqa ymm8,ymm3 + vmovdqa ymm0,ymm11 + lea rsi,[64+rsi] + + cmp rdx,64*7 + jb NEAR $L$ess_than_64_8xvl + vpxor ymm3,ymm3,YMMWORD[rsi] + vpxor ymm11,ymm11,YMMWORD[32+rsi] + vmovdqu YMMWORD[rsi*1+rdi],ymm3 + vmovdqu YMMWORD[32+rsi*1+rdi],ymm11 + je NEAR $L$done8xvl + vmovdqa ymm8,ymm4 + vmovdqa ymm0,ymm12 + lea rsi,[64+rsi] + +$L$ess_than_64_8xvl: + vmovdqa YMMWORD[rsp],ymm8 + vmovdqa YMMWORD[32+rsp],ymm0 + lea rdi,[rsi*1+rdi] + and rdx,63 + +$L$oop_tail8xvl: + movzx eax,BYTE[r10*1+rsi] + movzx ecx,BYTE[r10*1+rsp] + lea r10,[1+r10] + xor eax,ecx + mov BYTE[((-1))+r10*1+rdi],al + dec rdx + jnz NEAR $L$oop_tail8xvl + + vpxor ymm8,ymm8,ymm8 + vmovdqa YMMWORD[rsp],ymm8 + vmovdqa YMMWORD[32+rsp],ymm8 + +$L$done8xvl: + vzeroall + movaps xmm6,XMMWORD[((-168))+r9] + movaps xmm7,XMMWORD[((-152))+r9] + movaps xmm8,XMMWORD[((-136))+r9] + movaps xmm9,XMMWORD[((-120))+r9] + movaps xmm10,XMMWORD[((-104))+r9] + movaps xmm11,XMMWORD[((-88))+r9] + movaps xmm12,XMMWORD[((-72))+r9] + movaps xmm13,XMMWORD[((-56))+r9] + movaps xmm14,XMMWORD[((-40))+r9] + movaps xmm15,XMMWORD[((-24))+r9] + lea rsp,[r9] + +$L$8xvl_epilogue: + mov rdi,QWORD[8+rsp] ;WIN64 epilogue + mov rsi,QWORD[16+rsp] + DB 0F3h,0C3h ;repret + +$L$SEH_end_chacha20_8xvl: +EXTERN __imp_RtlVirtualUnwind + +ALIGN 16 +ssse3_handler: + push rsi + push rdi + push rbx + push rbp + push r12 + push r13 + push r14 + push r15 + pushfq + sub rsp,64 + + mov rax,QWORD[120+r8] + mov rbx,QWORD[248+r8] + + mov rsi,QWORD[8+r9] + mov r11,QWORD[56+r9] + + mov r10d,DWORD[r11] + lea r10,[r10*1+rsi] + cmp rbx,r10 + jb NEAR $L$common_seh_tail + + mov rax,QWORD[192+r8] + + mov r10d,DWORD[4+r11] + lea r10,[r10*1+rsi] + cmp rbx,r10 + jae NEAR $L$common_seh_tail + + lea rsi,[((-40))+rax] + lea rdi,[512+r8] + mov ecx,4 + DD 0xa548f3fc + +$L$common_seh_tail: + mov rdi,QWORD[8+rax] + mov rsi,QWORD[16+rax] + mov QWORD[152+r8],rax + mov QWORD[168+r8],rsi + mov QWORD[176+r8],rdi + + mov rdi,QWORD[40+r9] + mov rsi,r8 + mov ecx,154 + DD 0xa548f3fc + + mov rsi,r9 + xor rcx,rcx + mov rdx,QWORD[8+rsi] + mov r8,QWORD[rsi] + mov r9,QWORD[16+rsi] + mov r10,QWORD[40+rsi] + lea r11,[56+rsi] + lea r12,[24+rsi] + mov QWORD[32+rsp],r10 + mov QWORD[40+rsp],r11 + mov QWORD[48+rsp],r12 + mov QWORD[56+rsp],rcx + call QWORD[__imp_RtlVirtualUnwind] + + mov eax,1 + add rsp,64 + popfq + pop r15 + pop r14 + pop r13 + pop r12 + pop rbp + pop rbx + pop rdi + pop rsi + DB 0F3h,0C3h ;repret + + + +ALIGN 16 +full_handler: + push rsi + push rdi + push rbx + push rbp + push r12 + push r13 + push r14 + push r15 + pushfq + sub rsp,64 + + mov rax,QWORD[120+r8] + mov rbx,QWORD[248+r8] + + mov rsi,QWORD[8+r9] + mov r11,QWORD[56+r9] + + mov r10d,DWORD[r11] + lea r10,[r10*1+rsi] + cmp rbx,r10 + jb NEAR $L$common_seh_tail + + mov rax,QWORD[192+r8] + + mov r10d,DWORD[4+r11] + lea r10,[r10*1+rsi] + cmp rbx,r10 + jae NEAR $L$common_seh_tail + + lea rsi,[((-168))+rax] + lea rdi,[512+r8] + mov ecx,20 + DD 0xa548f3fc + + jmp NEAR $L$common_seh_tail + + +section .pdata rdata align=4 +ALIGN 4 + DD $L$SEH_begin_chacha20_ssse3 wrt ..imagebase + DD $L$SEH_end_chacha20_ssse3 wrt ..imagebase + DD $L$SEH_info_chacha20_ssse3 wrt ..imagebase + + DD $L$SEH_begin_chacha20_4x wrt ..imagebase + DD $L$SEH_end_chacha20_4x wrt ..imagebase + DD $L$SEH_info_chacha20_4x wrt ..imagebase + DD $L$SEH_begin_chacha20_avx2 wrt ..imagebase + DD $L$SEH_end_chacha20_avx2 wrt ..imagebase + DD $L$SEH_info_chacha20_avx2 wrt ..imagebase + DD $L$SEH_begin_chacha20_avx512 wrt ..imagebase + DD $L$SEH_end_chacha20_avx512 wrt ..imagebase + DD $L$SEH_info_chacha20_avx512 wrt ..imagebase + + DD $L$SEH_begin_chacha20_avx512vl wrt ..imagebase + DD $L$SEH_end_chacha20_avx512vl wrt ..imagebase + DD $L$SEH_info_chacha20_avx512vl wrt ..imagebase + + DD $L$SEH_begin_chacha20_16x wrt ..imagebase + DD $L$SEH_end_chacha20_16x wrt ..imagebase + DD $L$SEH_info_chacha20_16x wrt ..imagebase + + DD $L$SEH_begin_chacha20_8xvl wrt ..imagebase + DD $L$SEH_end_chacha20_8xvl wrt ..imagebase + DD $L$SEH_info_chacha20_8xvl wrt ..imagebase +section .xdata rdata align=8 +ALIGN 8 +$L$SEH_info_chacha20_ssse3: +DB 9,0,0,0 + DD ssse3_handler wrt ..imagebase + DD $L$ssse3_body wrt ..imagebase,$L$ssse3_epilogue wrt ..imagebase + +$L$SEH_info_chacha20_4x: +DB 9,0,0,0 + DD full_handler wrt ..imagebase + DD $L$4x_body wrt ..imagebase,$L$4x_epilogue wrt ..imagebase +$L$SEH_info_chacha20_avx2: +DB 9,0,0,0 + DD full_handler wrt ..imagebase + DD $L$8x_body wrt ..imagebase,$L$8x_epilogue wrt ..imagebase +$L$SEH_info_chacha20_avx512: +DB 9,0,0,0 + DD ssse3_handler wrt ..imagebase + DD $L$avx512_body wrt ..imagebase,$L$avx512_epilogue wrt ..imagebase + +$L$SEH_info_chacha20_avx512vl: +DB 9,0,0,0 + DD ssse3_handler wrt ..imagebase + DD $L$avx512vl_body wrt ..imagebase,$L$avx512vl_epilogue wrt ..imagebase + +$L$SEH_info_chacha20_16x: +DB 9,0,0,0 + DD full_handler wrt ..imagebase + DD $L$16x_body wrt ..imagebase,$L$16x_epilogue wrt ..imagebase + +$L$SEH_info_chacha20_8xvl: +DB 9,0,0,0 + DD full_handler wrt ..imagebase + DD $L$8xvl_body wrt ..imagebase,$L$8xvl_epilogue wrt ..imagebase diff --git a/crypto/chacha20_x64_gas.s b/crypto/chacha20_x64_gas.s new file mode 100644 index 0000000..0aaf4ba --- /dev/null +++ b/crypto/chacha20_x64_gas.s @@ -0,0 +1,2623 @@ +.text + +.align 64 +.Lzero: +.long 0,0,0,0 +.Lone: +.long 1,0,0,0 +.Linc: +.long 0,1,2,3 +.Lfour: +.long 4,4,4,4 +.Lincy: +.long 0,2,4,6,1,3,5,7 +.Leight: +.long 8,8,8,8,8,8,8,8 +.Lrot16: +.byte 0x2,0x3,0x0,0x1, 0x6,0x7,0x4,0x5, 0xa,0xb,0x8,0x9, 0xe,0xf,0xc,0xd +.Lrot24: +.byte 0x3,0x0,0x1,0x2, 0x7,0x4,0x5,0x6, 0xb,0x8,0x9,0xa, 0xf,0xc,0xd,0xe +.Lsigma: +.byte 101,120,112,97,110,100,32,51,50,45,98,121,116,101,32,107,0 +.align 64 +.Lzeroz: +.long 0,0,0,0, 1,0,0,0, 2,0,0,0, 3,0,0,0 +.Lfourz: +.long 4,0,0,0, 4,0,0,0, 4,0,0,0, 4,0,0,0 +.Lincz: +.long 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 +.Lsixteen: +.long 16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16 +.align 64 +.Ltwoy: +.long 2,0,0,0, 2,0,0,0 + +.global hchacha20_ssse3 +.type hchacha20_ssse3,@function +.align 32 +hchacha20_ssse3: +.cfi_startproc +.Lhchacha20_ssse3: + movdqa .Lsigma(%rip),%xmm0 + movdqu (%rdx),%xmm1 + movdqu 16(%rdx),%xmm2 + movdqu (%rsi),%xmm3 + movdqa .Lrot16(%rip),%xmm6 + movdqa .Lrot24(%rip),%xmm7 + movq 10,%r8 +.align 32 +.Loop_hssse3: + paddd %xmm1,%xmm0 + pxor %xmm0,%xmm3 + pshufb %xmm6,%xmm3 + paddd %xmm3,%xmm2 + pxor %xmm2,%xmm1 + movdqa %xmm1,%xmm4 + psrld 20,%xmm1 + pslld 12,%xmm4 + por %xmm4,%xmm1 + paddd %xmm1,%xmm0 + pxor %xmm0,%xmm3 + pshufb %xmm7,%xmm3 + paddd %xmm3,%xmm2 + pxor %xmm2,%xmm1 + movdqa %xmm1,%xmm4 + psrld 25,%xmm1 + pslld 7,%xmm4 + por %xmm4,%xmm1 + pshufd $78,%xmm2,%xmm2 + pshufd $57,%xmm1,%xmm1 + pshufd $147,%xmm3,%xmm3 + nop + paddd %xmm1,%xmm0 + pxor %xmm0,%xmm3 + pshufb %xmm6,%xmm3 + paddd %xmm3,%xmm2 + pxor %xmm2,%xmm1 + movdqa %xmm1,%xmm4 + psrld 20,%xmm1 + pslld 12,%xmm4 + por %xmm4,%xmm1 + paddd %xmm1,%xmm0 + pxor %xmm0,%xmm3 + pshufb %xmm7,%xmm3 + paddd %xmm3,%xmm2 + pxor %xmm2,%xmm1 + movdqa %xmm1,%xmm4 + psrld 25,%xmm1 + pslld 7,%xmm4 + por %xmm4,%xmm1 + pshufd $78,%xmm2,%xmm2 + pshufd $147,%xmm1,%xmm1 + pshufd $57,%xmm3,%xmm3 + decq %r8 + jnz .Loop_hssse3 + movdqu %xmm0,0(%rdi) + movdqu %xmm3,16(%rdi) + ret +.cfi_endproc +.size hchacha20_ssse3,.-hchacha20_ssse3 +.global chacha20_ssse3 +.type chacha20_ssse3,@function +.align 32 +chacha20_ssse3: +.cfi_startproc +.Lchacha20_ssse3: + movq %rsp,%r9 +.cfi_def_cfa_register %r9 + cmpq $128,%rdx + ja .Lchacha20_4x + +.Ldo_sse3_after_all: + subq $64+8,%rsp + movdqa .Lsigma(%rip),%xmm0 + movdqu (%rcx),%xmm1 + movdqu 16(%rcx),%xmm2 + movdqu (%r8),%xmm3 + movdqa .Lrot16(%rip),%xmm6 + movdqa .Lrot24(%rip),%xmm7 + + movdqa %xmm0,0(%rsp) + movdqa %xmm1,16(%rsp) + movdqa %xmm2,32(%rsp) + movdqa %xmm3,48(%rsp) + movq $10,%r8 + jmp .Loop_ssse3 + +.align 32 +.Loop_outer_ssse3: + movdqa .Lone(%rip),%xmm3 + movdqa 0(%rsp),%xmm0 + movdqa 16(%rsp),%xmm1 + movdqa 32(%rsp),%xmm2 + paddd 48(%rsp),%xmm3 + movq $10,%r8 + movdqa %xmm3,48(%rsp) + jmp .Loop_ssse3 + +.align 32 +.Loop_ssse3: + paddd %xmm1,%xmm0 + pxor %xmm0,%xmm3 + pshufb %xmm6,%xmm3 + paddd %xmm3,%xmm2 + pxor %xmm2,%xmm1 + movdqa %xmm1,%xmm4 + psrld $20,%xmm1 + pslld $12,%xmm4 + por %xmm4,%xmm1 + paddd %xmm1,%xmm0 + pxor %xmm0,%xmm3 + pshufb %xmm7,%xmm3 + paddd %xmm3,%xmm2 + pxor %xmm2,%xmm1 + movdqa %xmm1,%xmm4 + psrld $25,%xmm1 + pslld $7,%xmm4 + por %xmm4,%xmm1 + pshufd $78,%xmm2,%xmm2 + pshufd $57,%xmm1,%xmm1 + pshufd $147,%xmm3,%xmm3 + nop + paddd %xmm1,%xmm0 + pxor %xmm0,%xmm3 + pshufb %xmm6,%xmm3 + paddd %xmm3,%xmm2 + pxor %xmm2,%xmm1 + movdqa %xmm1,%xmm4 + psrld $20,%xmm1 + pslld $12,%xmm4 + por %xmm4,%xmm1 + paddd %xmm1,%xmm0 + pxor %xmm0,%xmm3 + pshufb %xmm7,%xmm3 + paddd %xmm3,%xmm2 + pxor %xmm2,%xmm1 + movdqa %xmm1,%xmm4 + psrld $25,%xmm1 + pslld $7,%xmm4 + por %xmm4,%xmm1 + pshufd $78,%xmm2,%xmm2 + pshufd $147,%xmm1,%xmm1 + pshufd $57,%xmm3,%xmm3 + decq %r8 + jnz .Loop_ssse3 + paddd 0(%rsp),%xmm0 + paddd 16(%rsp),%xmm1 + paddd 32(%rsp),%xmm2 + paddd 48(%rsp),%xmm3 + + cmpq $64,%rdx + jb .Ltail_ssse3 + + movdqu 0(%rsi),%xmm4 + movdqu 16(%rsi),%xmm5 + pxor %xmm4,%xmm0 + movdqu 32(%rsi),%xmm4 + pxor %xmm5,%xmm1 + movdqu 48(%rsi),%xmm5 + leaq 64(%rsi),%rsi + pxor %xmm4,%xmm2 + pxor %xmm5,%xmm3 + + movdqu %xmm0,0(%rdi) + movdqu %xmm1,16(%rdi) + movdqu %xmm2,32(%rdi) + movdqu %xmm3,48(%rdi) + leaq 64(%rdi),%rdi + + subq $64,%rdx + jnz .Loop_outer_ssse3 + + jmp .Ldone_ssse3 + +.align 16 +.Ltail_ssse3: + movdqa %xmm0,0(%rsp) + movdqa %xmm1,16(%rsp) + movdqa %xmm2,32(%rsp) + movdqa %xmm3,48(%rsp) + xorq %r8,%r8 + +.Loop_tail_ssse3: + movzbl (%rsi,%r8,1),%eax + movzbl (%rsp,%r8,1),%ecx + leaq 1(%r8),%r8 + xorl %ecx,%eax + movb %al,-1(%rdi,%r8,1) + decq %rdx + jnz .Loop_tail_ssse3 + +.Ldone_ssse3: + leaq (%r9),%rsp +.cfi_def_cfa_register %rsp +.Lssse3_epilogue: + ret +.cfi_endproc +.size chacha20_ssse3,.-chacha20_ssse3 +.global chacha20_4x +.type chacha20_4x,@function +.align 32 +chacha20_4x: +.cfi_startproc +.Lchacha20_4x: + movq %rsp,%r9 +.cfi_def_cfa_register %r9 + + + + + + + + + + + +.Lproceed4x: + subq $0x140+8,%rsp + movdqa .Lsigma(%rip),%xmm11 + movdqu (%rcx),%xmm15 + movdqu 16(%rcx),%xmm7 + movdqu (%r8),%xmm3 + leaq 256(%rsp),%rcx + leaq .Lrot16(%rip),%r10 + leaq .Lrot24(%rip),%r11 + + pshufd $0x00,%xmm11,%xmm8 + pshufd $0x55,%xmm11,%xmm9 + movdqa %xmm8,64(%rsp) + pshufd $0xaa,%xmm11,%xmm10 + movdqa %xmm9,80(%rsp) + pshufd $0xff,%xmm11,%xmm11 + movdqa %xmm10,96(%rsp) + movdqa %xmm11,112(%rsp) + + pshufd $0x00,%xmm15,%xmm12 + pshufd $0x55,%xmm15,%xmm13 + movdqa %xmm12,128-256(%rcx) + pshufd $0xaa,%xmm15,%xmm14 + movdqa %xmm13,144-256(%rcx) + pshufd $0xff,%xmm15,%xmm15 + movdqa %xmm14,160-256(%rcx) + movdqa %xmm15,176-256(%rcx) + + pshufd $0x00,%xmm7,%xmm4 + pshufd $0x55,%xmm7,%xmm5 + movdqa %xmm4,192-256(%rcx) + pshufd $0xaa,%xmm7,%xmm6 + movdqa %xmm5,208-256(%rcx) + pshufd $0xff,%xmm7,%xmm7 + movdqa %xmm6,224-256(%rcx) + movdqa %xmm7,240-256(%rcx) + + pshufd $0x00,%xmm3,%xmm0 + pshufd $0x55,%xmm3,%xmm1 + paddd .Linc(%rip),%xmm0 + pshufd $0xaa,%xmm3,%xmm2 + movdqa %xmm1,272-256(%rcx) + pshufd $0xff,%xmm3,%xmm3 + movdqa %xmm2,288-256(%rcx) + movdqa %xmm3,304-256(%rcx) + + jmp .Loop_enter4x + +.align 32 +.Loop_outer4x: + movdqa 64(%rsp),%xmm8 + movdqa 80(%rsp),%xmm9 + movdqa 96(%rsp),%xmm10 + movdqa 112(%rsp),%xmm11 + movdqa 128-256(%rcx),%xmm12 + movdqa 144-256(%rcx),%xmm13 + movdqa 160-256(%rcx),%xmm14 + movdqa 176-256(%rcx),%xmm15 + movdqa 192-256(%rcx),%xmm4 + movdqa 208-256(%rcx),%xmm5 + movdqa 224-256(%rcx),%xmm6 + movdqa 240-256(%rcx),%xmm7 + movdqa 256-256(%rcx),%xmm0 + movdqa 272-256(%rcx),%xmm1 + movdqa 288-256(%rcx),%xmm2 + movdqa 304-256(%rcx),%xmm3 + paddd .Lfour(%rip),%xmm0 + +.Loop_enter4x: + movdqa %xmm6,32(%rsp) + movdqa %xmm7,48(%rsp) + movdqa (%r10),%xmm7 + movl $10,%eax + movdqa %xmm0,256-256(%rcx) + jmp .Loop4x + +.align 32 +.Loop4x: + paddd %xmm12,%xmm8 + paddd %xmm13,%xmm9 + pxor %xmm8,%xmm0 + pxor %xmm9,%xmm1 + pshufb %xmm7,%xmm0 + pshufb %xmm7,%xmm1 + paddd %xmm0,%xmm4 + paddd %xmm1,%xmm5 + pxor %xmm4,%xmm12 + pxor %xmm5,%xmm13 + movdqa %xmm12,%xmm6 + pslld $12,%xmm12 + psrld $20,%xmm6 + movdqa %xmm13,%xmm7 + pslld $12,%xmm13 + por %xmm6,%xmm12 + psrld $20,%xmm7 + movdqa (%r11),%xmm6 + por %xmm7,%xmm13 + paddd %xmm12,%xmm8 + paddd %xmm13,%xmm9 + pxor %xmm8,%xmm0 + pxor %xmm9,%xmm1 + pshufb %xmm6,%xmm0 + pshufb %xmm6,%xmm1 + paddd %xmm0,%xmm4 + paddd %xmm1,%xmm5 + pxor %xmm4,%xmm12 + pxor %xmm5,%xmm13 + movdqa %xmm12,%xmm7 + pslld $7,%xmm12 + psrld $25,%xmm7 + movdqa %xmm13,%xmm6 + pslld $7,%xmm13 + por %xmm7,%xmm12 + psrld $25,%xmm6 + movdqa (%r10),%xmm7 + por %xmm6,%xmm13 + movdqa %xmm4,0(%rsp) + movdqa %xmm5,16(%rsp) + movdqa 32(%rsp),%xmm4 + movdqa 48(%rsp),%xmm5 + paddd %xmm14,%xmm10 + paddd %xmm15,%xmm11 + pxor %xmm10,%xmm2 + pxor %xmm11,%xmm3 + pshufb %xmm7,%xmm2 + pshufb %xmm7,%xmm3 + paddd %xmm2,%xmm4 + paddd %xmm3,%xmm5 + pxor %xmm4,%xmm14 + pxor %xmm5,%xmm15 + movdqa %xmm14,%xmm6 + pslld $12,%xmm14 + psrld $20,%xmm6 + movdqa %xmm15,%xmm7 + pslld $12,%xmm15 + por %xmm6,%xmm14 + psrld $20,%xmm7 + movdqa (%r11),%xmm6 + por %xmm7,%xmm15 + paddd %xmm14,%xmm10 + paddd %xmm15,%xmm11 + pxor %xmm10,%xmm2 + pxor %xmm11,%xmm3 + pshufb %xmm6,%xmm2 + pshufb %xmm6,%xmm3 + paddd %xmm2,%xmm4 + paddd %xmm3,%xmm5 + pxor %xmm4,%xmm14 + pxor %xmm5,%xmm15 + movdqa %xmm14,%xmm7 + pslld $7,%xmm14 + psrld $25,%xmm7 + movdqa %xmm15,%xmm6 + pslld $7,%xmm15 + por %xmm7,%xmm14 + psrld $25,%xmm6 + movdqa (%r10),%xmm7 + por %xmm6,%xmm15 + paddd %xmm13,%xmm8 + paddd %xmm14,%xmm9 + pxor %xmm8,%xmm3 + pxor %xmm9,%xmm0 + pshufb %xmm7,%xmm3 + pshufb %xmm7,%xmm0 + paddd %xmm3,%xmm4 + paddd %xmm0,%xmm5 + pxor %xmm4,%xmm13 + pxor %xmm5,%xmm14 + movdqa %xmm13,%xmm6 + pslld $12,%xmm13 + psrld $20,%xmm6 + movdqa %xmm14,%xmm7 + pslld $12,%xmm14 + por %xmm6,%xmm13 + psrld $20,%xmm7 + movdqa (%r11),%xmm6 + por %xmm7,%xmm14 + paddd %xmm13,%xmm8 + paddd %xmm14,%xmm9 + pxor %xmm8,%xmm3 + pxor %xmm9,%xmm0 + pshufb %xmm6,%xmm3 + pshufb %xmm6,%xmm0 + paddd %xmm3,%xmm4 + paddd %xmm0,%xmm5 + pxor %xmm4,%xmm13 + pxor %xmm5,%xmm14 + movdqa %xmm13,%xmm7 + pslld $7,%xmm13 + psrld $25,%xmm7 + movdqa %xmm14,%xmm6 + pslld $7,%xmm14 + por %xmm7,%xmm13 + psrld $25,%xmm6 + movdqa (%r10),%xmm7 + por %xmm6,%xmm14 + movdqa %xmm4,32(%rsp) + movdqa %xmm5,48(%rsp) + movdqa 0(%rsp),%xmm4 + movdqa 16(%rsp),%xmm5 + paddd %xmm15,%xmm10 + paddd %xmm12,%xmm11 + pxor %xmm10,%xmm1 + pxor %xmm11,%xmm2 + pshufb %xmm7,%xmm1 + pshufb %xmm7,%xmm2 + paddd %xmm1,%xmm4 + paddd %xmm2,%xmm5 + pxor %xmm4,%xmm15 + pxor %xmm5,%xmm12 + movdqa %xmm15,%xmm6 + pslld $12,%xmm15 + psrld $20,%xmm6 + movdqa %xmm12,%xmm7 + pslld $12,%xmm12 + por %xmm6,%xmm15 + psrld $20,%xmm7 + movdqa (%r11),%xmm6 + por %xmm7,%xmm12 + paddd %xmm15,%xmm10 + paddd %xmm12,%xmm11 + pxor %xmm10,%xmm1 + pxor %xmm11,%xmm2 + pshufb %xmm6,%xmm1 + pshufb %xmm6,%xmm2 + paddd %xmm1,%xmm4 + paddd %xmm2,%xmm5 + pxor %xmm4,%xmm15 + pxor %xmm5,%xmm12 + movdqa %xmm15,%xmm7 + pslld $7,%xmm15 + psrld $25,%xmm7 + movdqa %xmm12,%xmm6 + pslld $7,%xmm12 + por %xmm7,%xmm15 + psrld $25,%xmm6 + movdqa (%r10),%xmm7 + por %xmm6,%xmm12 + decl %eax + jnz .Loop4x + + paddd 64(%rsp),%xmm8 + paddd 80(%rsp),%xmm9 + paddd 96(%rsp),%xmm10 + paddd 112(%rsp),%xmm11 + + movdqa %xmm8,%xmm6 + punpckldq %xmm9,%xmm8 + movdqa %xmm10,%xmm7 + punpckldq %xmm11,%xmm10 + punpckhdq %xmm9,%xmm6 + punpckhdq %xmm11,%xmm7 + movdqa %xmm8,%xmm9 + punpcklqdq %xmm10,%xmm8 + movdqa %xmm6,%xmm11 + punpcklqdq %xmm7,%xmm6 + punpckhqdq %xmm10,%xmm9 + punpckhqdq %xmm7,%xmm11 + paddd 128-256(%rcx),%xmm12 + paddd 144-256(%rcx),%xmm13 + paddd 160-256(%rcx),%xmm14 + paddd 176-256(%rcx),%xmm15 + + movdqa %xmm8,0(%rsp) + movdqa %xmm9,16(%rsp) + movdqa 32(%rsp),%xmm8 + movdqa 48(%rsp),%xmm9 + + movdqa %xmm12,%xmm10 + punpckldq %xmm13,%xmm12 + movdqa %xmm14,%xmm7 + punpckldq %xmm15,%xmm14 + punpckhdq %xmm13,%xmm10 + punpckhdq %xmm15,%xmm7 + movdqa %xmm12,%xmm13 + punpcklqdq %xmm14,%xmm12 + movdqa %xmm10,%xmm15 + punpcklqdq %xmm7,%xmm10 + punpckhqdq %xmm14,%xmm13 + punpckhqdq %xmm7,%xmm15 + paddd 192-256(%rcx),%xmm4 + paddd 208-256(%rcx),%xmm5 + paddd 224-256(%rcx),%xmm8 + paddd 240-256(%rcx),%xmm9 + + movdqa %xmm6,32(%rsp) + movdqa %xmm11,48(%rsp) + + movdqa %xmm4,%xmm14 + punpckldq %xmm5,%xmm4 + movdqa %xmm8,%xmm7 + punpckldq %xmm9,%xmm8 + punpckhdq %xmm5,%xmm14 + punpckhdq %xmm9,%xmm7 + movdqa %xmm4,%xmm5 + punpcklqdq %xmm8,%xmm4 + movdqa %xmm14,%xmm9 + punpcklqdq %xmm7,%xmm14 + punpckhqdq %xmm8,%xmm5 + punpckhqdq %xmm7,%xmm9 + paddd 256-256(%rcx),%xmm0 + paddd 272-256(%rcx),%xmm1 + paddd 288-256(%rcx),%xmm2 + paddd 304-256(%rcx),%xmm3 + + movdqa %xmm0,%xmm8 + punpckldq %xmm1,%xmm0 + movdqa %xmm2,%xmm7 + punpckldq %xmm3,%xmm2 + punpckhdq %xmm1,%xmm8 + punpckhdq %xmm3,%xmm7 + movdqa %xmm0,%xmm1 + punpcklqdq %xmm2,%xmm0 + movdqa %xmm8,%xmm3 + punpcklqdq %xmm7,%xmm8 + punpckhqdq %xmm2,%xmm1 + punpckhqdq %xmm7,%xmm3 + cmpq $256,%rdx + jb .Ltail4x + + movdqu 0(%rsi),%xmm6 + movdqu 16(%rsi),%xmm11 + movdqu 32(%rsi),%xmm2 + movdqu 48(%rsi),%xmm7 + pxor 0(%rsp),%xmm6 + pxor %xmm12,%xmm11 + pxor %xmm4,%xmm2 + pxor %xmm0,%xmm7 + + movdqu %xmm6,0(%rdi) + movdqu 64(%rsi),%xmm6 + movdqu %xmm11,16(%rdi) + movdqu 80(%rsi),%xmm11 + movdqu %xmm2,32(%rdi) + movdqu 96(%rsi),%xmm2 + movdqu %xmm7,48(%rdi) + movdqu 112(%rsi),%xmm7 + leaq 128(%rsi),%rsi + pxor 16(%rsp),%xmm6 + pxor %xmm13,%xmm11 + pxor %xmm5,%xmm2 + pxor %xmm1,%xmm7 + + movdqu %xmm6,64(%rdi) + movdqu 0(%rsi),%xmm6 + movdqu %xmm11,80(%rdi) + movdqu 16(%rsi),%xmm11 + movdqu %xmm2,96(%rdi) + movdqu 32(%rsi),%xmm2 + movdqu %xmm7,112(%rdi) + leaq 128(%rdi),%rdi + movdqu 48(%rsi),%xmm7 + pxor 32(%rsp),%xmm6 + pxor %xmm10,%xmm11 + pxor %xmm14,%xmm2 + pxor %xmm8,%xmm7 + + movdqu %xmm6,0(%rdi) + movdqu 64(%rsi),%xmm6 + movdqu %xmm11,16(%rdi) + movdqu 80(%rsi),%xmm11 + movdqu %xmm2,32(%rdi) + movdqu 96(%rsi),%xmm2 + movdqu %xmm7,48(%rdi) + movdqu 112(%rsi),%xmm7 + leaq 128(%rsi),%rsi + pxor 48(%rsp),%xmm6 + pxor %xmm15,%xmm11 + pxor %xmm9,%xmm2 + pxor %xmm3,%xmm7 + movdqu %xmm6,64(%rdi) + movdqu %xmm11,80(%rdi) + movdqu %xmm2,96(%rdi) + movdqu %xmm7,112(%rdi) + leaq 128(%rdi),%rdi + + subq $256,%rdx + jnz .Loop_outer4x + + jmp .Ldone4x + +.Ltail4x: + cmpq $192,%rdx + jae .L192_or_more4x + cmpq $128,%rdx + jae .L128_or_more4x + cmpq $64,%rdx + jae .L64_or_more4x + + + xorq %r10,%r10 + + movdqa %xmm12,16(%rsp) + movdqa %xmm4,32(%rsp) + movdqa %xmm0,48(%rsp) + jmp .Loop_tail4x + +.align 32 +.L64_or_more4x: + movdqu 0(%rsi),%xmm6 + movdqu 16(%rsi),%xmm11 + movdqu 32(%rsi),%xmm2 + movdqu 48(%rsi),%xmm7 + pxor 0(%rsp),%xmm6 + pxor %xmm12,%xmm11 + pxor %xmm4,%xmm2 + pxor %xmm0,%xmm7 + movdqu %xmm6,0(%rdi) + movdqu %xmm11,16(%rdi) + movdqu %xmm2,32(%rdi) + movdqu %xmm7,48(%rdi) + je .Ldone4x + + movdqa 16(%rsp),%xmm6 + leaq 64(%rsi),%rsi + xorq %r10,%r10 + movdqa %xmm6,0(%rsp) + movdqa %xmm13,16(%rsp) + leaq 64(%rdi),%rdi + movdqa %xmm5,32(%rsp) + subq $64,%rdx + movdqa %xmm1,48(%rsp) + jmp .Loop_tail4x + +.align 32 +.L128_or_more4x: + movdqu 0(%rsi),%xmm6 + movdqu 16(%rsi),%xmm11 + movdqu 32(%rsi),%xmm2 + movdqu 48(%rsi),%xmm7 + pxor 0(%rsp),%xmm6 + pxor %xmm12,%xmm11 + pxor %xmm4,%xmm2 + pxor %xmm0,%xmm7 + + movdqu %xmm6,0(%rdi) + movdqu 64(%rsi),%xmm6 + movdqu %xmm11,16(%rdi) + movdqu 80(%rsi),%xmm11 + movdqu %xmm2,32(%rdi) + movdqu 96(%rsi),%xmm2 + movdqu %xmm7,48(%rdi) + movdqu 112(%rsi),%xmm7 + pxor 16(%rsp),%xmm6 + pxor %xmm13,%xmm11 + pxor %xmm5,%xmm2 + pxor %xmm1,%xmm7 + movdqu %xmm6,64(%rdi) + movdqu %xmm11,80(%rdi) + movdqu %xmm2,96(%rdi) + movdqu %xmm7,112(%rdi) + je .Ldone4x + + movdqa 32(%rsp),%xmm6 + leaq 128(%rsi),%rsi + xorq %r10,%r10 + movdqa %xmm6,0(%rsp) + movdqa %xmm10,16(%rsp) + leaq 128(%rdi),%rdi + movdqa %xmm14,32(%rsp) + subq $128,%rdx + movdqa %xmm8,48(%rsp) + jmp .Loop_tail4x + +.align 32 +.L192_or_more4x: + movdqu 0(%rsi),%xmm6 + movdqu 16(%rsi),%xmm11 + movdqu 32(%rsi),%xmm2 + movdqu 48(%rsi),%xmm7 + pxor 0(%rsp),%xmm6 + pxor %xmm12,%xmm11 + pxor %xmm4,%xmm2 + pxor %xmm0,%xmm7 + + movdqu %xmm6,0(%rdi) + movdqu 64(%rsi),%xmm6 + movdqu %xmm11,16(%rdi) + movdqu 80(%rsi),%xmm11 + movdqu %xmm2,32(%rdi) + movdqu 96(%rsi),%xmm2 + movdqu %xmm7,48(%rdi) + movdqu 112(%rsi),%xmm7 + leaq 128(%rsi),%rsi + pxor 16(%rsp),%xmm6 + pxor %xmm13,%xmm11 + pxor %xmm5,%xmm2 + pxor %xmm1,%xmm7 + + movdqu %xmm6,64(%rdi) + movdqu 0(%rsi),%xmm6 + movdqu %xmm11,80(%rdi) + movdqu 16(%rsi),%xmm11 + movdqu %xmm2,96(%rdi) + movdqu 32(%rsi),%xmm2 + movdqu %xmm7,112(%rdi) + leaq 128(%rdi),%rdi + movdqu 48(%rsi),%xmm7 + pxor 32(%rsp),%xmm6 + pxor %xmm10,%xmm11 + pxor %xmm14,%xmm2 + pxor %xmm8,%xmm7 + movdqu %xmm6,0(%rdi) + movdqu %xmm11,16(%rdi) + movdqu %xmm2,32(%rdi) + movdqu %xmm7,48(%rdi) + je .Ldone4x + + movdqa 48(%rsp),%xmm6 + leaq 64(%rsi),%rsi + xorq %r10,%r10 + movdqa %xmm6,0(%rsp) + movdqa %xmm15,16(%rsp) + leaq 64(%rdi),%rdi + movdqa %xmm9,32(%rsp) + subq $192,%rdx + movdqa %xmm3,48(%rsp) + +.Loop_tail4x: + movzbl (%rsi,%r10,1),%eax + movzbl (%rsp,%r10,1),%ecx + leaq 1(%r10),%r10 + xorl %ecx,%eax + movb %al,-1(%rdi,%r10,1) + decq %rdx + jnz .Loop_tail4x + +.Ldone4x: + leaq (%r9),%rsp +.cfi_def_cfa_register %rsp +.L4x_epilogue: + ret +.cfi_endproc +.size chacha20_4x,.-chacha20_4x +.global chacha20_avx2 +.type chacha20_avx2,@function +.align 32 +chacha20_avx2: +.cfi_startproc +.Lchacha20_avx2: + movq %rsp,%r9 +.cfi_def_cfa_register %r9 + subq $0x280+8,%rsp + andq $-32,%rsp + vzeroupper + + vbroadcasti128 .Lsigma(%rip),%ymm11 + vbroadcasti128 (%rcx),%ymm3 + vbroadcasti128 16(%rcx),%ymm15 + vbroadcasti128 (%r8),%ymm7 + leaq 256(%rsp),%rcx + leaq 512(%rsp),%rax + leaq .Lrot16(%rip),%r10 + leaq .Lrot24(%rip),%r11 + + vpshufd $0x00,%ymm11,%ymm8 + vpshufd $0x55,%ymm11,%ymm9 + vmovdqa %ymm8,128-256(%rcx) + vpshufd $0xaa,%ymm11,%ymm10 + vmovdqa %ymm9,160-256(%rcx) + vpshufd $0xff,%ymm11,%ymm11 + vmovdqa %ymm10,192-256(%rcx) + vmovdqa %ymm11,224-256(%rcx) + + vpshufd $0x00,%ymm3,%ymm0 + vpshufd $0x55,%ymm3,%ymm1 + vmovdqa %ymm0,256-256(%rcx) + vpshufd $0xaa,%ymm3,%ymm2 + vmovdqa %ymm1,288-256(%rcx) + vpshufd $0xff,%ymm3,%ymm3 + vmovdqa %ymm2,320-256(%rcx) + vmovdqa %ymm3,352-256(%rcx) + + vpshufd $0x00,%ymm15,%ymm12 + vpshufd $0x55,%ymm15,%ymm13 + vmovdqa %ymm12,384-512(%rax) + vpshufd $0xaa,%ymm15,%ymm14 + vmovdqa %ymm13,416-512(%rax) + vpshufd $0xff,%ymm15,%ymm15 + vmovdqa %ymm14,448-512(%rax) + vmovdqa %ymm15,480-512(%rax) + + vpshufd $0x00,%ymm7,%ymm4 + vpshufd $0x55,%ymm7,%ymm5 + vpaddd .Lincy(%rip),%ymm4,%ymm4 + vpshufd $0xaa,%ymm7,%ymm6 + vmovdqa %ymm5,544-512(%rax) + vpshufd $0xff,%ymm7,%ymm7 + vmovdqa %ymm6,576-512(%rax) + vmovdqa %ymm7,608-512(%rax) + + jmp .Loop_enter8x + +.align 32 +.Loop_outer8x: + vmovdqa 128-256(%rcx),%ymm8 + vmovdqa 160-256(%rcx),%ymm9 + vmovdqa 192-256(%rcx),%ymm10 + vmovdqa 224-256(%rcx),%ymm11 + vmovdqa 256-256(%rcx),%ymm0 + vmovdqa 288-256(%rcx),%ymm1 + vmovdqa 320-256(%rcx),%ymm2 + vmovdqa 352-256(%rcx),%ymm3 + vmovdqa 384-512(%rax),%ymm12 + vmovdqa 416-512(%rax),%ymm13 + vmovdqa 448-512(%rax),%ymm14 + vmovdqa 480-512(%rax),%ymm15 + vmovdqa 512-512(%rax),%ymm4 + vmovdqa 544-512(%rax),%ymm5 + vmovdqa 576-512(%rax),%ymm6 + vmovdqa 608-512(%rax),%ymm7 + vpaddd .Leight(%rip),%ymm4,%ymm4 + +.Loop_enter8x: + vmovdqa %ymm14,64(%rsp) + vmovdqa %ymm15,96(%rsp) + vbroadcasti128 (%r10),%ymm15 + vmovdqa %ymm4,512-512(%rax) + movl $10,%eax + jmp .Loop8x + +.align 32 +.Loop8x: + vpaddd %ymm0,%ymm8,%ymm8 + vpxor %ymm4,%ymm8,%ymm4 + vpshufb %ymm15,%ymm4,%ymm4 + vpaddd %ymm1,%ymm9,%ymm9 + vpxor %ymm5,%ymm9,%ymm5 + vpshufb %ymm15,%ymm5,%ymm5 + vpaddd %ymm4,%ymm12,%ymm12 + vpxor %ymm0,%ymm12,%ymm0 + vpslld $12,%ymm0,%ymm14 + vpsrld $20,%ymm0,%ymm0 + vpor %ymm0,%ymm14,%ymm0 + vbroadcasti128 (%r11),%ymm14 + vpaddd %ymm5,%ymm13,%ymm13 + vpxor %ymm1,%ymm13,%ymm1 + vpslld $12,%ymm1,%ymm15 + vpsrld $20,%ymm1,%ymm1 + vpor %ymm1,%ymm15,%ymm1 + vpaddd %ymm0,%ymm8,%ymm8 + vpxor %ymm4,%ymm8,%ymm4 + vpshufb %ymm14,%ymm4,%ymm4 + vpaddd %ymm1,%ymm9,%ymm9 + vpxor %ymm5,%ymm9,%ymm5 + vpshufb %ymm14,%ymm5,%ymm5 + vpaddd %ymm4,%ymm12,%ymm12 + vpxor %ymm0,%ymm12,%ymm0 + vpslld $7,%ymm0,%ymm15 + vpsrld $25,%ymm0,%ymm0 + vpor %ymm0,%ymm15,%ymm0 + vbroadcasti128 (%r10),%ymm15 + vpaddd %ymm5,%ymm13,%ymm13 + vpxor %ymm1,%ymm13,%ymm1 + vpslld $7,%ymm1,%ymm14 + vpsrld $25,%ymm1,%ymm1 + vpor %ymm1,%ymm14,%ymm1 + vmovdqa %ymm12,0(%rsp) + vmovdqa %ymm13,32(%rsp) + vmovdqa 64(%rsp),%ymm12 + vmovdqa 96(%rsp),%ymm13 + vpaddd %ymm2,%ymm10,%ymm10 + vpxor %ymm6,%ymm10,%ymm6 + vpshufb %ymm15,%ymm6,%ymm6 + vpaddd %ymm3,%ymm11,%ymm11 + vpxor %ymm7,%ymm11,%ymm7 + vpshufb %ymm15,%ymm7,%ymm7 + vpaddd %ymm6,%ymm12,%ymm12 + vpxor %ymm2,%ymm12,%ymm2 + vpslld $12,%ymm2,%ymm14 + vpsrld $20,%ymm2,%ymm2 + vpor %ymm2,%ymm14,%ymm2 + vbroadcasti128 (%r11),%ymm14 + vpaddd %ymm7,%ymm13,%ymm13 + vpxor %ymm3,%ymm13,%ymm3 + vpslld $12,%ymm3,%ymm15 + vpsrld $20,%ymm3,%ymm3 + vpor %ymm3,%ymm15,%ymm3 + vpaddd %ymm2,%ymm10,%ymm10 + vpxor %ymm6,%ymm10,%ymm6 + vpshufb %ymm14,%ymm6,%ymm6 + vpaddd %ymm3,%ymm11,%ymm11 + vpxor %ymm7,%ymm11,%ymm7 + vpshufb %ymm14,%ymm7,%ymm7 + vpaddd %ymm6,%ymm12,%ymm12 + vpxor %ymm2,%ymm12,%ymm2 + vpslld $7,%ymm2,%ymm15 + vpsrld $25,%ymm2,%ymm2 + vpor %ymm2,%ymm15,%ymm2 + vbroadcasti128 (%r10),%ymm15 + vpaddd %ymm7,%ymm13,%ymm13 + vpxor %ymm3,%ymm13,%ymm3 + vpslld $7,%ymm3,%ymm14 + vpsrld $25,%ymm3,%ymm3 + vpor %ymm3,%ymm14,%ymm3 + vpaddd %ymm1,%ymm8,%ymm8 + vpxor %ymm7,%ymm8,%ymm7 + vpshufb %ymm15,%ymm7,%ymm7 + vpaddd %ymm2,%ymm9,%ymm9 + vpxor %ymm4,%ymm9,%ymm4 + vpshufb %ymm15,%ymm4,%ymm4 + vpaddd %ymm7,%ymm12,%ymm12 + vpxor %ymm1,%ymm12,%ymm1 + vpslld $12,%ymm1,%ymm14 + vpsrld $20,%ymm1,%ymm1 + vpor %ymm1,%ymm14,%ymm1 + vbroadcasti128 (%r11),%ymm14 + vpaddd %ymm4,%ymm13,%ymm13 + vpxor %ymm2,%ymm13,%ymm2 + vpslld $12,%ymm2,%ymm15 + vpsrld $20,%ymm2,%ymm2 + vpor %ymm2,%ymm15,%ymm2 + vpaddd %ymm1,%ymm8,%ymm8 + vpxor %ymm7,%ymm8,%ymm7 + vpshufb %ymm14,%ymm7,%ymm7 + vpaddd %ymm2,%ymm9,%ymm9 + vpxor %ymm4,%ymm9,%ymm4 + vpshufb %ymm14,%ymm4,%ymm4 + vpaddd %ymm7,%ymm12,%ymm12 + vpxor %ymm1,%ymm12,%ymm1 + vpslld $7,%ymm1,%ymm15 + vpsrld $25,%ymm1,%ymm1 + vpor %ymm1,%ymm15,%ymm1 + vbroadcasti128 (%r10),%ymm15 + vpaddd %ymm4,%ymm13,%ymm13 + vpxor %ymm2,%ymm13,%ymm2 + vpslld $7,%ymm2,%ymm14 + vpsrld $25,%ymm2,%ymm2 + vpor %ymm2,%ymm14,%ymm2 + vmovdqa %ymm12,64(%rsp) + vmovdqa %ymm13,96(%rsp) + vmovdqa 0(%rsp),%ymm12 + vmovdqa 32(%rsp),%ymm13 + vpaddd %ymm3,%ymm10,%ymm10 + vpxor %ymm5,%ymm10,%ymm5 + vpshufb %ymm15,%ymm5,%ymm5 + vpaddd %ymm0,%ymm11,%ymm11 + vpxor %ymm6,%ymm11,%ymm6 + vpshufb %ymm15,%ymm6,%ymm6 + vpaddd %ymm5,%ymm12,%ymm12 + vpxor %ymm3,%ymm12,%ymm3 + vpslld $12,%ymm3,%ymm14 + vpsrld $20,%ymm3,%ymm3 + vpor %ymm3,%ymm14,%ymm3 + vbroadcasti128 (%r11),%ymm14 + vpaddd %ymm6,%ymm13,%ymm13 + vpxor %ymm0,%ymm13,%ymm0 + vpslld $12,%ymm0,%ymm15 + vpsrld $20,%ymm0,%ymm0 + vpor %ymm0,%ymm15,%ymm0 + vpaddd %ymm3,%ymm10,%ymm10 + vpxor %ymm5,%ymm10,%ymm5 + vpshufb %ymm14,%ymm5,%ymm5 + vpaddd %ymm0,%ymm11,%ymm11 + vpxor %ymm6,%ymm11,%ymm6 + vpshufb %ymm14,%ymm6,%ymm6 + vpaddd %ymm5,%ymm12,%ymm12 + vpxor %ymm3,%ymm12,%ymm3 + vpslld $7,%ymm3,%ymm15 + vpsrld $25,%ymm3,%ymm3 + vpor %ymm3,%ymm15,%ymm3 + vbroadcasti128 (%r10),%ymm15 + vpaddd %ymm6,%ymm13,%ymm13 + vpxor %ymm0,%ymm13,%ymm0 + vpslld $7,%ymm0,%ymm14 + vpsrld $25,%ymm0,%ymm0 + vpor %ymm0,%ymm14,%ymm0 + decl %eax + jnz .Loop8x + + leaq 512(%rsp),%rax + vpaddd 128-256(%rcx),%ymm8,%ymm8 + vpaddd 160-256(%rcx),%ymm9,%ymm9 + vpaddd 192-256(%rcx),%ymm10,%ymm10 + vpaddd 224-256(%rcx),%ymm11,%ymm11 + + vpunpckldq %ymm9,%ymm8,%ymm14 + vpunpckldq %ymm11,%ymm10,%ymm15 + vpunpckhdq %ymm9,%ymm8,%ymm8 + vpunpckhdq %ymm11,%ymm10,%ymm10 + vpunpcklqdq %ymm15,%ymm14,%ymm9 + vpunpckhqdq %ymm15,%ymm14,%ymm14 + vpunpcklqdq %ymm10,%ymm8,%ymm11 + vpunpckhqdq %ymm10,%ymm8,%ymm8 + vpaddd 256-256(%rcx),%ymm0,%ymm0 + vpaddd 288-256(%rcx),%ymm1,%ymm1 + vpaddd 320-256(%rcx),%ymm2,%ymm2 + vpaddd 352-256(%rcx),%ymm3,%ymm3 + + vpunpckldq %ymm1,%ymm0,%ymm10 + vpunpckldq %ymm3,%ymm2,%ymm15 + vpunpckhdq %ymm1,%ymm0,%ymm0 + vpunpckhdq %ymm3,%ymm2,%ymm2 + vpunpcklqdq %ymm15,%ymm10,%ymm1 + vpunpckhqdq %ymm15,%ymm10,%ymm10 + vpunpcklqdq %ymm2,%ymm0,%ymm3 + vpunpckhqdq %ymm2,%ymm0,%ymm0 + vperm2i128 $0x20,%ymm1,%ymm9,%ymm15 + vperm2i128 $0x31,%ymm1,%ymm9,%ymm1 + vperm2i128 $0x20,%ymm10,%ymm14,%ymm9 + vperm2i128 $0x31,%ymm10,%ymm14,%ymm10 + vperm2i128 $0x20,%ymm3,%ymm11,%ymm14 + vperm2i128 $0x31,%ymm3,%ymm11,%ymm3 + vperm2i128 $0x20,%ymm0,%ymm8,%ymm11 + vperm2i128 $0x31,%ymm0,%ymm8,%ymm0 + vmovdqa %ymm15,0(%rsp) + vmovdqa %ymm9,32(%rsp) + vmovdqa 64(%rsp),%ymm15 + vmovdqa 96(%rsp),%ymm9 + + vpaddd 384-512(%rax),%ymm12,%ymm12 + vpaddd 416-512(%rax),%ymm13,%ymm13 + vpaddd 448-512(%rax),%ymm15,%ymm15 + vpaddd 480-512(%rax),%ymm9,%ymm9 + + vpunpckldq %ymm13,%ymm12,%ymm2 + vpunpckldq %ymm9,%ymm15,%ymm8 + vpunpckhdq %ymm13,%ymm12,%ymm12 + vpunpckhdq %ymm9,%ymm15,%ymm15 + vpunpcklqdq %ymm8,%ymm2,%ymm13 + vpunpckhqdq %ymm8,%ymm2,%ymm2 + vpunpcklqdq %ymm15,%ymm12,%ymm9 + vpunpckhqdq %ymm15,%ymm12,%ymm12 + vpaddd 512-512(%rax),%ymm4,%ymm4 + vpaddd 544-512(%rax),%ymm5,%ymm5 + vpaddd 576-512(%rax),%ymm6,%ymm6 + vpaddd 608-512(%rax),%ymm7,%ymm7 + + vpunpckldq %ymm5,%ymm4,%ymm15 + vpunpckldq %ymm7,%ymm6,%ymm8 + vpunpckhdq %ymm5,%ymm4,%ymm4 + vpunpckhdq %ymm7,%ymm6,%ymm6 + vpunpcklqdq %ymm8,%ymm15,%ymm5 + vpunpckhqdq %ymm8,%ymm15,%ymm15 + vpunpcklqdq %ymm6,%ymm4,%ymm7 + vpunpckhqdq %ymm6,%ymm4,%ymm4 + vperm2i128 $0x20,%ymm5,%ymm13,%ymm8 + vperm2i128 $0x31,%ymm5,%ymm13,%ymm5 + vperm2i128 $0x20,%ymm15,%ymm2,%ymm13 + vperm2i128 $0x31,%ymm15,%ymm2,%ymm15 + vperm2i128 $0x20,%ymm7,%ymm9,%ymm2 + vperm2i128 $0x31,%ymm7,%ymm9,%ymm7 + vperm2i128 $0x20,%ymm4,%ymm12,%ymm9 + vperm2i128 $0x31,%ymm4,%ymm12,%ymm4 + vmovdqa 0(%rsp),%ymm6 + vmovdqa 32(%rsp),%ymm12 + + cmpq $512,%rdx + jb .Ltail8x + + vpxor 0(%rsi),%ymm6,%ymm6 + vpxor 32(%rsi),%ymm8,%ymm8 + vpxor 64(%rsi),%ymm1,%ymm1 + vpxor 96(%rsi),%ymm5,%ymm5 + leaq 128(%rsi),%rsi + vmovdqu %ymm6,0(%rdi) + vmovdqu %ymm8,32(%rdi) + vmovdqu %ymm1,64(%rdi) + vmovdqu %ymm5,96(%rdi) + leaq 128(%rdi),%rdi + + vpxor 0(%rsi),%ymm12,%ymm12 + vpxor 32(%rsi),%ymm13,%ymm13 + vpxor 64(%rsi),%ymm10,%ymm10 + vpxor 96(%rsi),%ymm15,%ymm15 + leaq 128(%rsi),%rsi + vmovdqu %ymm12,0(%rdi) + vmovdqu %ymm13,32(%rdi) + vmovdqu %ymm10,64(%rdi) + vmovdqu %ymm15,96(%rdi) + leaq 128(%rdi),%rdi + + vpxor 0(%rsi),%ymm14,%ymm14 + vpxor 32(%rsi),%ymm2,%ymm2 + vpxor 64(%rsi),%ymm3,%ymm3 + vpxor 96(%rsi),%ymm7,%ymm7 + leaq 128(%rsi),%rsi + vmovdqu %ymm14,0(%rdi) + vmovdqu %ymm2,32(%rdi) + vmovdqu %ymm3,64(%rdi) + vmovdqu %ymm7,96(%rdi) + leaq 128(%rdi),%rdi + + vpxor 0(%rsi),%ymm11,%ymm11 + vpxor 32(%rsi),%ymm9,%ymm9 + vpxor 64(%rsi),%ymm0,%ymm0 + vpxor 96(%rsi),%ymm4,%ymm4 + leaq 128(%rsi),%rsi + vmovdqu %ymm11,0(%rdi) + vmovdqu %ymm9,32(%rdi) + vmovdqu %ymm0,64(%rdi) + vmovdqu %ymm4,96(%rdi) + leaq 128(%rdi),%rdi + + subq $512,%rdx + jnz .Loop_outer8x + + jmp .Ldone8x + +.Ltail8x: + cmpq $448,%rdx + jae .L448_or_more8x + cmpq $384,%rdx + jae .L384_or_more8x + cmpq $320,%rdx + jae .L320_or_more8x + cmpq $256,%rdx + jae .L256_or_more8x + cmpq $192,%rdx + jae .L192_or_more8x + cmpq $128,%rdx + jae .L128_or_more8x + cmpq $64,%rdx + jae .L64_or_more8x + + xorq %r10,%r10 + vmovdqa %ymm6,0(%rsp) + vmovdqa %ymm8,32(%rsp) + jmp .Loop_tail8x + +.align 32 +.L64_or_more8x: + vpxor 0(%rsi),%ymm6,%ymm6 + vpxor 32(%rsi),%ymm8,%ymm8 + vmovdqu %ymm6,0(%rdi) + vmovdqu %ymm8,32(%rdi) + je .Ldone8x + + leaq 64(%rsi),%rsi + xorq %r10,%r10 + vmovdqa %ymm1,0(%rsp) + leaq 64(%rdi),%rdi + subq $64,%rdx + vmovdqa %ymm5,32(%rsp) + jmp .Loop_tail8x + +.align 32 +.L128_or_more8x: + vpxor 0(%rsi),%ymm6,%ymm6 + vpxor 32(%rsi),%ymm8,%ymm8 + vpxor 64(%rsi),%ymm1,%ymm1 + vpxor 96(%rsi),%ymm5,%ymm5 + vmovdqu %ymm6,0(%rdi) + vmovdqu %ymm8,32(%rdi) + vmovdqu %ymm1,64(%rdi) + vmovdqu %ymm5,96(%rdi) + je .Ldone8x + + leaq 128(%rsi),%rsi + xorq %r10,%r10 + vmovdqa %ymm12,0(%rsp) + leaq 128(%rdi),%rdi + subq $128,%rdx + vmovdqa %ymm13,32(%rsp) + jmp .Loop_tail8x + +.align 32 +.L192_or_more8x: + vpxor 0(%rsi),%ymm6,%ymm6 + vpxor 32(%rsi),%ymm8,%ymm8 + vpxor 64(%rsi),%ymm1,%ymm1 + vpxor 96(%rsi),%ymm5,%ymm5 + vpxor 128(%rsi),%ymm12,%ymm12 + vpxor 160(%rsi),%ymm13,%ymm13 + vmovdqu %ymm6,0(%rdi) + vmovdqu %ymm8,32(%rdi) + vmovdqu %ymm1,64(%rdi) + vmovdqu %ymm5,96(%rdi) + vmovdqu %ymm12,128(%rdi) + vmovdqu %ymm13,160(%rdi) + je .Ldone8x + + leaq 192(%rsi),%rsi + xorq %r10,%r10 + vmovdqa %ymm10,0(%rsp) + leaq 192(%rdi),%rdi + subq $192,%rdx + vmovdqa %ymm15,32(%rsp) + jmp .Loop_tail8x + +.align 32 +.L256_or_more8x: + vpxor 0(%rsi),%ymm6,%ymm6 + vpxor 32(%rsi),%ymm8,%ymm8 + vpxor 64(%rsi),%ymm1,%ymm1 + vpxor 96(%rsi),%ymm5,%ymm5 + vpxor 128(%rsi),%ymm12,%ymm12 + vpxor 160(%rsi),%ymm13,%ymm13 + vpxor 192(%rsi),%ymm10,%ymm10 + vpxor 224(%rsi),%ymm15,%ymm15 + vmovdqu %ymm6,0(%rdi) + vmovdqu %ymm8,32(%rdi) + vmovdqu %ymm1,64(%rdi) + vmovdqu %ymm5,96(%rdi) + vmovdqu %ymm12,128(%rdi) + vmovdqu %ymm13,160(%rdi) + vmovdqu %ymm10,192(%rdi) + vmovdqu %ymm15,224(%rdi) + je .Ldone8x + + leaq 256(%rsi),%rsi + xorq %r10,%r10 + vmovdqa %ymm14,0(%rsp) + leaq 256(%rdi),%rdi + subq $256,%rdx + vmovdqa %ymm2,32(%rsp) + jmp .Loop_tail8x + +.align 32 +.L320_or_more8x: + vpxor 0(%rsi),%ymm6,%ymm6 + vpxor 32(%rsi),%ymm8,%ymm8 + vpxor 64(%rsi),%ymm1,%ymm1 + vpxor 96(%rsi),%ymm5,%ymm5 + vpxor 128(%rsi),%ymm12,%ymm12 + vpxor 160(%rsi),%ymm13,%ymm13 + vpxor 192(%rsi),%ymm10,%ymm10 + vpxor 224(%rsi),%ymm15,%ymm15 + vpxor 256(%rsi),%ymm14,%ymm14 + vpxor 288(%rsi),%ymm2,%ymm2 + vmovdqu %ymm6,0(%rdi) + vmovdqu %ymm8,32(%rdi) + vmovdqu %ymm1,64(%rdi) + vmovdqu %ymm5,96(%rdi) + vmovdqu %ymm12,128(%rdi) + vmovdqu %ymm13,160(%rdi) + vmovdqu %ymm10,192(%rdi) + vmovdqu %ymm15,224(%rdi) + vmovdqu %ymm14,256(%rdi) + vmovdqu %ymm2,288(%rdi) + je .Ldone8x + + leaq 320(%rsi),%rsi + xorq %r10,%r10 + vmovdqa %ymm3,0(%rsp) + leaq 320(%rdi),%rdi + subq $320,%rdx + vmovdqa %ymm7,32(%rsp) + jmp .Loop_tail8x + +.align 32 +.L384_or_more8x: + vpxor 0(%rsi),%ymm6,%ymm6 + vpxor 32(%rsi),%ymm8,%ymm8 + vpxor 64(%rsi),%ymm1,%ymm1 + vpxor 96(%rsi),%ymm5,%ymm5 + vpxor 128(%rsi),%ymm12,%ymm12 + vpxor 160(%rsi),%ymm13,%ymm13 + vpxor 192(%rsi),%ymm10,%ymm10 + vpxor 224(%rsi),%ymm15,%ymm15 + vpxor 256(%rsi),%ymm14,%ymm14 + vpxor 288(%rsi),%ymm2,%ymm2 + vpxor 320(%rsi),%ymm3,%ymm3 + vpxor 352(%rsi),%ymm7,%ymm7 + vmovdqu %ymm6,0(%rdi) + vmovdqu %ymm8,32(%rdi) + vmovdqu %ymm1,64(%rdi) + vmovdqu %ymm5,96(%rdi) + vmovdqu %ymm12,128(%rdi) + vmovdqu %ymm13,160(%rdi) + vmovdqu %ymm10,192(%rdi) + vmovdqu %ymm15,224(%rdi) + vmovdqu %ymm14,256(%rdi) + vmovdqu %ymm2,288(%rdi) + vmovdqu %ymm3,320(%rdi) + vmovdqu %ymm7,352(%rdi) + je .Ldone8x + + leaq 384(%rsi),%rsi + xorq %r10,%r10 + vmovdqa %ymm11,0(%rsp) + leaq 384(%rdi),%rdi + subq $384,%rdx + vmovdqa %ymm9,32(%rsp) + jmp .Loop_tail8x + +.align 32 +.L448_or_more8x: + vpxor 0(%rsi),%ymm6,%ymm6 + vpxor 32(%rsi),%ymm8,%ymm8 + vpxor 64(%rsi),%ymm1,%ymm1 + vpxor 96(%rsi),%ymm5,%ymm5 + vpxor 128(%rsi),%ymm12,%ymm12 + vpxor 160(%rsi),%ymm13,%ymm13 + vpxor 192(%rsi),%ymm10,%ymm10 + vpxor 224(%rsi),%ymm15,%ymm15 + vpxor 256(%rsi),%ymm14,%ymm14 + vpxor 288(%rsi),%ymm2,%ymm2 + vpxor 320(%rsi),%ymm3,%ymm3 + vpxor 352(%rsi),%ymm7,%ymm7 + vpxor 384(%rsi),%ymm11,%ymm11 + vpxor 416(%rsi),%ymm9,%ymm9 + vmovdqu %ymm6,0(%rdi) + vmovdqu %ymm8,32(%rdi) + vmovdqu %ymm1,64(%rdi) + vmovdqu %ymm5,96(%rdi) + vmovdqu %ymm12,128(%rdi) + vmovdqu %ymm13,160(%rdi) + vmovdqu %ymm10,192(%rdi) + vmovdqu %ymm15,224(%rdi) + vmovdqu %ymm14,256(%rdi) + vmovdqu %ymm2,288(%rdi) + vmovdqu %ymm3,320(%rdi) + vmovdqu %ymm7,352(%rdi) + vmovdqu %ymm11,384(%rdi) + vmovdqu %ymm9,416(%rdi) + je .Ldone8x + + leaq 448(%rsi),%rsi + xorq %r10,%r10 + vmovdqa %ymm0,0(%rsp) + leaq 448(%rdi),%rdi + subq $448,%rdx + vmovdqa %ymm4,32(%rsp) + +.Loop_tail8x: + movzbl (%rsi,%r10,1),%eax + movzbl (%rsp,%r10,1),%ecx + leaq 1(%r10),%r10 + xorl %ecx,%eax + movb %al,-1(%rdi,%r10,1) + decq %rdx + jnz .Loop_tail8x + +.Ldone8x: + vzeroall + leaq (%r9),%rsp +.cfi_def_cfa_register %rsp +.L8x_epilogue: + ret +.cfi_endproc +.size chacha20_avx2,.-chacha20_avx2 +.global chacha20_avx512 +.type chacha20_avx512,@function +.align 32 +chacha20_avx512: +.cfi_startproc +.Lchacha20_avx512: + movq %rsp,%r9 +.cfi_def_cfa_register %r9 + cmpq $512,%rdx + ja .Lchacha20_16x + + subq $64+8,%rsp + vbroadcasti32x4 .Lsigma(%rip),%zmm0 + vbroadcasti32x4 (%rcx),%zmm1 + vbroadcasti32x4 16(%rcx),%zmm2 + vbroadcasti32x4 (%r8),%zmm3 + + vmovdqa32 %zmm0,%zmm16 + vmovdqa32 %zmm1,%zmm17 + vmovdqa32 %zmm2,%zmm18 + vpaddd .Lzeroz(%rip),%zmm3,%zmm3 + vmovdqa32 .Lfourz(%rip),%zmm20 + movq $10,%r8 + vmovdqa32 %zmm3,%zmm19 + jmp .Loop_avx512 + +.align 16 +.Loop_outer_avx512: + vmovdqa32 %zmm16,%zmm0 + vmovdqa32 %zmm17,%zmm1 + vmovdqa32 %zmm18,%zmm2 + vpaddd %zmm20,%zmm19,%zmm3 + movq $10,%r8 + vmovdqa32 %zmm3,%zmm19 + jmp .Loop_avx512 + +.align 32 +.Loop_avx512: + vpaddd %zmm1,%zmm0,%zmm0 + vpxord %zmm0,%zmm3,%zmm3 + vprold $16,%zmm3,%zmm3 + vpaddd %zmm3,%zmm2,%zmm2 + vpxord %zmm2,%zmm1,%zmm1 + vprold $12,%zmm1,%zmm1 + vpaddd %zmm1,%zmm0,%zmm0 + vpxord %zmm0,%zmm3,%zmm3 + vprold $8,%zmm3,%zmm3 + vpaddd %zmm3,%zmm2,%zmm2 + vpxord %zmm2,%zmm1,%zmm1 + vprold $7,%zmm1,%zmm1 + vpshufd $78,%zmm2,%zmm2 + vpshufd $57,%zmm1,%zmm1 + vpshufd $147,%zmm3,%zmm3 + vpaddd %zmm1,%zmm0,%zmm0 + vpxord %zmm0,%zmm3,%zmm3 + vprold $16,%zmm3,%zmm3 + vpaddd %zmm3,%zmm2,%zmm2 + vpxord %zmm2,%zmm1,%zmm1 + vprold $12,%zmm1,%zmm1 + vpaddd %zmm1,%zmm0,%zmm0 + vpxord %zmm0,%zmm3,%zmm3 + vprold $8,%zmm3,%zmm3 + vpaddd %zmm3,%zmm2,%zmm2 + vpxord %zmm2,%zmm1,%zmm1 + vprold $7,%zmm1,%zmm1 + vpshufd $78,%zmm2,%zmm2 + vpshufd $147,%zmm1,%zmm1 + vpshufd $57,%zmm3,%zmm3 + decq %r8 + jnz .Loop_avx512 + vpaddd %zmm16,%zmm0,%zmm0 + vpaddd %zmm17,%zmm1,%zmm1 + vpaddd %zmm18,%zmm2,%zmm2 + vpaddd %zmm19,%zmm3,%zmm3 + + subq $64,%rdx + jb .Ltail64_avx512 + + vpxor 0(%rsi),%xmm0,%xmm4 + vpxor 16(%rsi),%xmm1,%xmm5 + vpxor 32(%rsi),%xmm2,%xmm6 + vpxor 48(%rsi),%xmm3,%xmm7 + leaq 64(%rsi),%rsi + + vmovdqu %xmm4,0(%rdi) + vmovdqu %xmm5,16(%rdi) + vmovdqu %xmm6,32(%rdi) + vmovdqu %xmm7,48(%rdi) + leaq 64(%rdi),%rdi + + jz .Ldone_avx512 + + vextracti32x4 $1,%zmm0,%xmm4 + vextracti32x4 $1,%zmm1,%xmm5 + vextracti32x4 $1,%zmm2,%xmm6 + vextracti32x4 $1,%zmm3,%xmm7 + + subq $64,%rdx + jb .Ltail_avx512 + + vpxor 0(%rsi),%xmm4,%xmm4 + vpxor 16(%rsi),%xmm5,%xmm5 + vpxor 32(%rsi),%xmm6,%xmm6 + vpxor 48(%rsi),%xmm7,%xmm7 + leaq 64(%rsi),%rsi + + vmovdqu %xmm4,0(%rdi) + vmovdqu %xmm5,16(%rdi) + vmovdqu %xmm6,32(%rdi) + vmovdqu %xmm7,48(%rdi) + leaq 64(%rdi),%rdi + + jz .Ldone_avx512 + + vextracti32x4 $2,%zmm0,%xmm4 + vextracti32x4 $2,%zmm1,%xmm5 + vextracti32x4 $2,%zmm2,%xmm6 + vextracti32x4 $2,%zmm3,%xmm7 + + subq $64,%rdx + jb .Ltail_avx512 + + vpxor 0(%rsi),%xmm4,%xmm4 + vpxor 16(%rsi),%xmm5,%xmm5 + vpxor 32(%rsi),%xmm6,%xmm6 + vpxor 48(%rsi),%xmm7,%xmm7 + leaq 64(%rsi),%rsi + + vmovdqu %xmm4,0(%rdi) + vmovdqu %xmm5,16(%rdi) + vmovdqu %xmm6,32(%rdi) + vmovdqu %xmm7,48(%rdi) + leaq 64(%rdi),%rdi + + jz .Ldone_avx512 + + vextracti32x4 $3,%zmm0,%xmm4 + vextracti32x4 $3,%zmm1,%xmm5 + vextracti32x4 $3,%zmm2,%xmm6 + vextracti32x4 $3,%zmm3,%xmm7 + + subq $64,%rdx + jb .Ltail_avx512 + + vpxor 0(%rsi),%xmm4,%xmm4 + vpxor 16(%rsi),%xmm5,%xmm5 + vpxor 32(%rsi),%xmm6,%xmm6 + vpxor 48(%rsi),%xmm7,%xmm7 + leaq 64(%rsi),%rsi + + vmovdqu %xmm4,0(%rdi) + vmovdqu %xmm5,16(%rdi) + vmovdqu %xmm6,32(%rdi) + vmovdqu %xmm7,48(%rdi) + leaq 64(%rdi),%rdi + + jnz .Loop_outer_avx512 + + jmp .Ldone_avx512 + +.align 16 +.Ltail64_avx512: + vmovdqa %xmm0,0(%rsp) + vmovdqa %xmm1,16(%rsp) + vmovdqa %xmm2,32(%rsp) + vmovdqa %xmm3,48(%rsp) + addq $64,%rdx + jmp .Loop_tail_avx512 + +.align 16 +.Ltail_avx512: + vmovdqa %xmm4,0(%rsp) + vmovdqa %xmm5,16(%rsp) + vmovdqa %xmm6,32(%rsp) + vmovdqa %xmm7,48(%rsp) + addq $64,%rdx + +.Loop_tail_avx512: + movzbl (%rsi,%r8,1),%eax + movzbl (%rsp,%r8,1),%ecx + leaq 1(%r8),%r8 + xorl %ecx,%eax + movb %al,-1(%rdi,%r8,1) + decq %rdx + jnz .Loop_tail_avx512 + + vmovdqu32 %zmm16,0(%rsp) + +.Ldone_avx512: + vzeroall + leaq (%r9),%rsp +.cfi_def_cfa_register %rsp +.Lavx512_epilogue: + ret +.cfi_endproc +.size chacha20_avx512,.-chacha20_avx512 +.global chacha20_avx512vl +.type chacha20_avx512vl,@function +.align 32 +chacha20_avx512vl: +.cfi_startproc +.Lchacha20_avx512vl: + movq %rsp,%r9 +.cfi_def_cfa_register %r9 + cmpq $128,%rdx + ja .Lchacha20_8xvl + + subq $64+8,%rsp + vbroadcasti128 .Lsigma(%rip),%ymm0 + vbroadcasti128 (%rcx),%ymm1 + vbroadcasti128 16(%rcx),%ymm2 + vbroadcasti128 (%r8),%ymm3 + + vmovdqa32 %ymm0,%ymm16 + vmovdqa32 %ymm1,%ymm17 + vmovdqa32 %ymm2,%ymm18 + vpaddd .Lzeroz(%rip),%ymm3,%ymm3 + vmovdqa32 .Ltwoy(%rip),%ymm20 + movq $10,%r8 + vmovdqa32 %ymm3,%ymm19 + jmp .Loop_avx512vl + +.align 16 +.Loop_outer_avx512vl: + vmovdqa32 %ymm18,%ymm2 + vpaddd %ymm20,%ymm19,%ymm3 + movq $10,%r8 + vmovdqa32 %ymm3,%ymm19 + jmp .Loop_avx512vl + +.align 32 +.Loop_avx512vl: + vpaddd %ymm1,%ymm0,%ymm0 + vpxor %ymm0,%ymm3,%ymm3 + vprold $16,%ymm3,%ymm3 + vpaddd %ymm3,%ymm2,%ymm2 + vpxor %ymm2,%ymm1,%ymm1 + vprold $12,%ymm1,%ymm1 + vpaddd %ymm1,%ymm0,%ymm0 + vpxor %ymm0,%ymm3,%ymm3 + vprold $8,%ymm3,%ymm3 + vpaddd %ymm3,%ymm2,%ymm2 + vpxor %ymm2,%ymm1,%ymm1 + vprold $7,%ymm1,%ymm1 + vpshufd $78,%ymm2,%ymm2 + vpshufd $57,%ymm1,%ymm1 + vpshufd $147,%ymm3,%ymm3 + vpaddd %ymm1,%ymm0,%ymm0 + vpxor %ymm0,%ymm3,%ymm3 + vprold $16,%ymm3,%ymm3 + vpaddd %ymm3,%ymm2,%ymm2 + vpxor %ymm2,%ymm1,%ymm1 + vprold $12,%ymm1,%ymm1 + vpaddd %ymm1,%ymm0,%ymm0 + vpxor %ymm0,%ymm3,%ymm3 + vprold $8,%ymm3,%ymm3 + vpaddd %ymm3,%ymm2,%ymm2 + vpxor %ymm2,%ymm1,%ymm1 + vprold $7,%ymm1,%ymm1 + vpshufd $78,%ymm2,%ymm2 + vpshufd $147,%ymm1,%ymm1 + vpshufd $57,%ymm3,%ymm3 + decq %r8 + jnz .Loop_avx512vl + vpaddd %ymm16,%ymm0,%ymm0 + vpaddd %ymm17,%ymm1,%ymm1 + vpaddd %ymm18,%ymm2,%ymm2 + vpaddd %ymm19,%ymm3,%ymm3 + + subq $64,%rdx + jb .Ltail64_avx512vl + + vpxor 0(%rsi),%xmm0,%xmm4 + vpxor 16(%rsi),%xmm1,%xmm5 + vpxor 32(%rsi),%xmm2,%xmm6 + vpxor 48(%rsi),%xmm3,%xmm7 + leaq 64(%rsi),%rsi + + vmovdqu %xmm4,0(%rdi) + vmovdqu %xmm5,16(%rdi) + vmovdqu %xmm6,32(%rdi) + vmovdqu %xmm7,48(%rdi) + leaq 64(%rdi),%rdi + + jz .Ldone_avx512vl + + vextracti128 $1,%ymm0,%xmm4 + vextracti128 $1,%ymm1,%xmm5 + vextracti128 $1,%ymm2,%xmm6 + vextracti128 $1,%ymm3,%xmm7 + + subq $64,%rdx + jb .Ltail_avx512vl + + vpxor 0(%rsi),%xmm4,%xmm4 + vpxor 16(%rsi),%xmm5,%xmm5 + vpxor 32(%rsi),%xmm6,%xmm6 + vpxor 48(%rsi),%xmm7,%xmm7 + leaq 64(%rsi),%rsi + + vmovdqu %xmm4,0(%rdi) + vmovdqu %xmm5,16(%rdi) + vmovdqu %xmm6,32(%rdi) + vmovdqu %xmm7,48(%rdi) + leaq 64(%rdi),%rdi + + vmovdqa32 %ymm16,%ymm0 + vmovdqa32 %ymm17,%ymm1 + jnz .Loop_outer_avx512vl + + jmp .Ldone_avx512vl + +.align 16 +.Ltail64_avx512vl: + vmovdqa %xmm0,0(%rsp) + vmovdqa %xmm1,16(%rsp) + vmovdqa %xmm2,32(%rsp) + vmovdqa %xmm3,48(%rsp) + addq $64,%rdx + jmp .Loop_tail_avx512vl + +.align 16 +.Ltail_avx512vl: + vmovdqa %xmm4,0(%rsp) + vmovdqa %xmm5,16(%rsp) + vmovdqa %xmm6,32(%rsp) + vmovdqa %xmm7,48(%rsp) + addq $64,%rdx + +.Loop_tail_avx512vl: + movzbl (%rsi,%r8,1),%eax + movzbl (%rsp,%r8,1),%ecx + leaq 1(%r8),%r8 + xorl %ecx,%eax + movb %al,-1(%rdi,%r8,1) + decq %rdx + jnz .Loop_tail_avx512vl + + vmovdqu32 %ymm16,0(%rsp) + vmovdqu32 %ymm16,32(%rsp) + +.Ldone_avx512vl: + vzeroall + leaq (%r9),%rsp +.cfi_def_cfa_register %rsp +.Lavx512vl_epilogue: + ret +.cfi_endproc +.size chacha20_avx512vl,.-chacha20_avx512vl +.global chacha20_16x +.type chacha20_16x,@function +.align 32 +chacha20_16x: +.cfi_startproc +.Lchacha20_16x: + movq %rsp,%r9 +.cfi_def_cfa_register %r9 + subq $64+8,%rsp + andq $-64,%rsp + vzeroupper + + leaq .Lsigma(%rip),%r10 + vbroadcasti32x4 (%r10),%zmm3 + vbroadcasti32x4 (%rcx),%zmm7 + vbroadcasti32x4 16(%rcx),%zmm11 + vbroadcasti32x4 (%r8),%zmm15 + + vpshufd $0x00,%zmm3,%zmm0 + vpshufd $0x55,%zmm3,%zmm1 + vpshufd $0xaa,%zmm3,%zmm2 + vpshufd $0xff,%zmm3,%zmm3 + vmovdqa64 %zmm0,%zmm16 + vmovdqa64 %zmm1,%zmm17 + vmovdqa64 %zmm2,%zmm18 + vmovdqa64 %zmm3,%zmm19 + + vpshufd $0x00,%zmm7,%zmm4 + vpshufd $0x55,%zmm7,%zmm5 + vpshufd $0xaa,%zmm7,%zmm6 + vpshufd $0xff,%zmm7,%zmm7 + vmovdqa64 %zmm4,%zmm20 + vmovdqa64 %zmm5,%zmm21 + vmovdqa64 %zmm6,%zmm22 + vmovdqa64 %zmm7,%zmm23 + + vpshufd $0x00,%zmm11,%zmm8 + vpshufd $0x55,%zmm11,%zmm9 + vpshufd $0xaa,%zmm11,%zmm10 + vpshufd $0xff,%zmm11,%zmm11 + vmovdqa64 %zmm8,%zmm24 + vmovdqa64 %zmm9,%zmm25 + vmovdqa64 %zmm10,%zmm26 + vmovdqa64 %zmm11,%zmm27 + + vpshufd $0x00,%zmm15,%zmm12 + vpshufd $0x55,%zmm15,%zmm13 + vpshufd $0xaa,%zmm15,%zmm14 + vpshufd $0xff,%zmm15,%zmm15 + vpaddd .Lincz(%rip),%zmm12,%zmm12 + vmovdqa64 %zmm12,%zmm28 + vmovdqa64 %zmm13,%zmm29 + vmovdqa64 %zmm14,%zmm30 + vmovdqa64 %zmm15,%zmm31 + + movl $10,%eax + jmp .Loop16x + +.align 32 +.Loop_outer16x: + vpbroadcastd 0(%r10),%zmm0 + vpbroadcastd 4(%r10),%zmm1 + vpbroadcastd 8(%r10),%zmm2 + vpbroadcastd 12(%r10),%zmm3 + vpaddd .Lsixteen(%rip),%zmm28,%zmm28 + vmovdqa64 %zmm20,%zmm4 + vmovdqa64 %zmm21,%zmm5 + vmovdqa64 %zmm22,%zmm6 + vmovdqa64 %zmm23,%zmm7 + vmovdqa64 %zmm24,%zmm8 + vmovdqa64 %zmm25,%zmm9 + vmovdqa64 %zmm26,%zmm10 + vmovdqa64 %zmm27,%zmm11 + vmovdqa64 %zmm28,%zmm12 + vmovdqa64 %zmm29,%zmm13 + vmovdqa64 %zmm30,%zmm14 + vmovdqa64 %zmm31,%zmm15 + + vmovdqa64 %zmm0,%zmm16 + vmovdqa64 %zmm1,%zmm17 + vmovdqa64 %zmm2,%zmm18 + vmovdqa64 %zmm3,%zmm19 + + movl $10,%eax + jmp .Loop16x + +.align 32 +.Loop16x: + vpaddd %zmm4,%zmm0,%zmm0 + vpaddd %zmm5,%zmm1,%zmm1 + vpaddd %zmm6,%zmm2,%zmm2 + vpaddd %zmm7,%zmm3,%zmm3 + vpxord %zmm0,%zmm12,%zmm12 + vpxord %zmm1,%zmm13,%zmm13 + vpxord %zmm2,%zmm14,%zmm14 + vpxord %zmm3,%zmm15,%zmm15 + vprold $16,%zmm12,%zmm12 + vprold $16,%zmm13,%zmm13 + vprold $16,%zmm14,%zmm14 + vprold $16,%zmm15,%zmm15 + vpaddd %zmm12,%zmm8,%zmm8 + vpaddd %zmm13,%zmm9,%zmm9 + vpaddd %zmm14,%zmm10,%zmm10 + vpaddd %zmm15,%zmm11,%zmm11 + vpxord %zmm8,%zmm4,%zmm4 + vpxord %zmm9,%zmm5,%zmm5 + vpxord %zmm10,%zmm6,%zmm6 + vpxord %zmm11,%zmm7,%zmm7 + vprold $12,%zmm4,%zmm4 + vprold $12,%zmm5,%zmm5 + vprold $12,%zmm6,%zmm6 + vprold $12,%zmm7,%zmm7 + vpaddd %zmm4,%zmm0,%zmm0 + vpaddd %zmm5,%zmm1,%zmm1 + vpaddd %zmm6,%zmm2,%zmm2 + vpaddd %zmm7,%zmm3,%zmm3 + vpxord %zmm0,%zmm12,%zmm12 + vpxord %zmm1,%zmm13,%zmm13 + vpxord %zmm2,%zmm14,%zmm14 + vpxord %zmm3,%zmm15,%zmm15 + vprold $8,%zmm12,%zmm12 + vprold $8,%zmm13,%zmm13 + vprold $8,%zmm14,%zmm14 + vprold $8,%zmm15,%zmm15 + vpaddd %zmm12,%zmm8,%zmm8 + vpaddd %zmm13,%zmm9,%zmm9 + vpaddd %zmm14,%zmm10,%zmm10 + vpaddd %zmm15,%zmm11,%zmm11 + vpxord %zmm8,%zmm4,%zmm4 + vpxord %zmm9,%zmm5,%zmm5 + vpxord %zmm10,%zmm6,%zmm6 + vpxord %zmm11,%zmm7,%zmm7 + vprold $7,%zmm4,%zmm4 + vprold $7,%zmm5,%zmm5 + vprold $7,%zmm6,%zmm6 + vprold $7,%zmm7,%zmm7 + vpaddd %zmm5,%zmm0,%zmm0 + vpaddd %zmm6,%zmm1,%zmm1 + vpaddd %zmm7,%zmm2,%zmm2 + vpaddd %zmm4,%zmm3,%zmm3 + vpxord %zmm0,%zmm15,%zmm15 + vpxord %zmm1,%zmm12,%zmm12 + vpxord %zmm2,%zmm13,%zmm13 + vpxord %zmm3,%zmm14,%zmm14 + vprold $16,%zmm15,%zmm15 + vprold $16,%zmm12,%zmm12 + vprold $16,%zmm13,%zmm13 + vprold $16,%zmm14,%zmm14 + vpaddd %zmm15,%zmm10,%zmm10 + vpaddd %zmm12,%zmm11,%zmm11 + vpaddd %zmm13,%zmm8,%zmm8 + vpaddd %zmm14,%zmm9,%zmm9 + vpxord %zmm10,%zmm5,%zmm5 + vpxord %zmm11,%zmm6,%zmm6 + vpxord %zmm8,%zmm7,%zmm7 + vpxord %zmm9,%zmm4,%zmm4 + vprold $12,%zmm5,%zmm5 + vprold $12,%zmm6,%zmm6 + vprold $12,%zmm7,%zmm7 + vprold $12,%zmm4,%zmm4 + vpaddd %zmm5,%zmm0,%zmm0 + vpaddd %zmm6,%zmm1,%zmm1 + vpaddd %zmm7,%zmm2,%zmm2 + vpaddd %zmm4,%zmm3,%zmm3 + vpxord %zmm0,%zmm15,%zmm15 + vpxord %zmm1,%zmm12,%zmm12 + vpxord %zmm2,%zmm13,%zmm13 + vpxord %zmm3,%zmm14,%zmm14 + vprold $8,%zmm15,%zmm15 + vprold $8,%zmm12,%zmm12 + vprold $8,%zmm13,%zmm13 + vprold $8,%zmm14,%zmm14 + vpaddd %zmm15,%zmm10,%zmm10 + vpaddd %zmm12,%zmm11,%zmm11 + vpaddd %zmm13,%zmm8,%zmm8 + vpaddd %zmm14,%zmm9,%zmm9 + vpxord %zmm10,%zmm5,%zmm5 + vpxord %zmm11,%zmm6,%zmm6 + vpxord %zmm8,%zmm7,%zmm7 + vpxord %zmm9,%zmm4,%zmm4 + vprold $7,%zmm5,%zmm5 + vprold $7,%zmm6,%zmm6 + vprold $7,%zmm7,%zmm7 + vprold $7,%zmm4,%zmm4 + decl %eax + jnz .Loop16x + + vpaddd %zmm16,%zmm0,%zmm0 + vpaddd %zmm17,%zmm1,%zmm1 + vpaddd %zmm18,%zmm2,%zmm2 + vpaddd %zmm19,%zmm3,%zmm3 + + vpunpckldq %zmm1,%zmm0,%zmm18 + vpunpckldq %zmm3,%zmm2,%zmm19 + vpunpckhdq %zmm1,%zmm0,%zmm0 + vpunpckhdq %zmm3,%zmm2,%zmm2 + vpunpcklqdq %zmm19,%zmm18,%zmm1 + vpunpckhqdq %zmm19,%zmm18,%zmm18 + vpunpcklqdq %zmm2,%zmm0,%zmm3 + vpunpckhqdq %zmm2,%zmm0,%zmm0 + vpaddd %zmm20,%zmm4,%zmm4 + vpaddd %zmm21,%zmm5,%zmm5 + vpaddd %zmm22,%zmm6,%zmm6 + vpaddd %zmm23,%zmm7,%zmm7 + + vpunpckldq %zmm5,%zmm4,%zmm2 + vpunpckldq %zmm7,%zmm6,%zmm19 + vpunpckhdq %zmm5,%zmm4,%zmm4 + vpunpckhdq %zmm7,%zmm6,%zmm6 + vpunpcklqdq %zmm19,%zmm2,%zmm5 + vpunpckhqdq %zmm19,%zmm2,%zmm2 + vpunpcklqdq %zmm6,%zmm4,%zmm7 + vpunpckhqdq %zmm6,%zmm4,%zmm4 + vshufi32x4 $0x44,%zmm5,%zmm1,%zmm19 + vshufi32x4 $0xee,%zmm5,%zmm1,%zmm5 + vshufi32x4 $0x44,%zmm2,%zmm18,%zmm1 + vshufi32x4 $0xee,%zmm2,%zmm18,%zmm2 + vshufi32x4 $0x44,%zmm7,%zmm3,%zmm18 + vshufi32x4 $0xee,%zmm7,%zmm3,%zmm7 + vshufi32x4 $0x44,%zmm4,%zmm0,%zmm3 + vshufi32x4 $0xee,%zmm4,%zmm0,%zmm4 + vpaddd %zmm24,%zmm8,%zmm8 + vpaddd %zmm25,%zmm9,%zmm9 + vpaddd %zmm26,%zmm10,%zmm10 + vpaddd %zmm27,%zmm11,%zmm11 + + vpunpckldq %zmm9,%zmm8,%zmm6 + vpunpckldq %zmm11,%zmm10,%zmm0 + vpunpckhdq %zmm9,%zmm8,%zmm8 + vpunpckhdq %zmm11,%zmm10,%zmm10 + vpunpcklqdq %zmm0,%zmm6,%zmm9 + vpunpckhqdq %zmm0,%zmm6,%zmm6 + vpunpcklqdq %zmm10,%zmm8,%zmm11 + vpunpckhqdq %zmm10,%zmm8,%zmm8 + vpaddd %zmm28,%zmm12,%zmm12 + vpaddd %zmm29,%zmm13,%zmm13 + vpaddd %zmm30,%zmm14,%zmm14 + vpaddd %zmm31,%zmm15,%zmm15 + + vpunpckldq %zmm13,%zmm12,%zmm10 + vpunpckldq %zmm15,%zmm14,%zmm0 + vpunpckhdq %zmm13,%zmm12,%zmm12 + vpunpckhdq %zmm15,%zmm14,%zmm14 + vpunpcklqdq %zmm0,%zmm10,%zmm13 + vpunpckhqdq %zmm0,%zmm10,%zmm10 + vpunpcklqdq %zmm14,%zmm12,%zmm15 + vpunpckhqdq %zmm14,%zmm12,%zmm12 + vshufi32x4 $0x44,%zmm13,%zmm9,%zmm0 + vshufi32x4 $0xee,%zmm13,%zmm9,%zmm13 + vshufi32x4 $0x44,%zmm10,%zmm6,%zmm9 + vshufi32x4 $0xee,%zmm10,%zmm6,%zmm10 + vshufi32x4 $0x44,%zmm15,%zmm11,%zmm6 + vshufi32x4 $0xee,%zmm15,%zmm11,%zmm15 + vshufi32x4 $0x44,%zmm12,%zmm8,%zmm11 + vshufi32x4 $0xee,%zmm12,%zmm8,%zmm12 + vshufi32x4 $0x88,%zmm0,%zmm19,%zmm16 + vshufi32x4 $0xdd,%zmm0,%zmm19,%zmm19 + vshufi32x4 $0x88,%zmm13,%zmm5,%zmm0 + vshufi32x4 $0xdd,%zmm13,%zmm5,%zmm13 + vshufi32x4 $0x88,%zmm9,%zmm1,%zmm17 + vshufi32x4 $0xdd,%zmm9,%zmm1,%zmm1 + vshufi32x4 $0x88,%zmm10,%zmm2,%zmm9 + vshufi32x4 $0xdd,%zmm10,%zmm2,%zmm10 + vshufi32x4 $0x88,%zmm6,%zmm18,%zmm14 + vshufi32x4 $0xdd,%zmm6,%zmm18,%zmm18 + vshufi32x4 $0x88,%zmm15,%zmm7,%zmm6 + vshufi32x4 $0xdd,%zmm15,%zmm7,%zmm15 + vshufi32x4 $0x88,%zmm11,%zmm3,%zmm8 + vshufi32x4 $0xdd,%zmm11,%zmm3,%zmm3 + vshufi32x4 $0x88,%zmm12,%zmm4,%zmm11 + vshufi32x4 $0xdd,%zmm12,%zmm4,%zmm12 + cmpq $1024,%rdx + jb .Ltail16x + + vpxord 0(%rsi),%zmm16,%zmm16 + vpxord 64(%rsi),%zmm17,%zmm17 + vpxord 128(%rsi),%zmm14,%zmm14 + vpxord 192(%rsi),%zmm8,%zmm8 + vmovdqu32 %zmm16,0(%rdi) + vmovdqu32 %zmm17,64(%rdi) + vmovdqu32 %zmm14,128(%rdi) + vmovdqu32 %zmm8,192(%rdi) + + vpxord 256(%rsi),%zmm19,%zmm19 + vpxord 320(%rsi),%zmm1,%zmm1 + vpxord 384(%rsi),%zmm18,%zmm18 + vpxord 448(%rsi),%zmm3,%zmm3 + vmovdqu32 %zmm19,256(%rdi) + vmovdqu32 %zmm1,320(%rdi) + vmovdqu32 %zmm18,384(%rdi) + vmovdqu32 %zmm3,448(%rdi) + + vpxord 512(%rsi),%zmm0,%zmm0 + vpxord 576(%rsi),%zmm9,%zmm9 + vpxord 640(%rsi),%zmm6,%zmm6 + vpxord 704(%rsi),%zmm11,%zmm11 + vmovdqu32 %zmm0,512(%rdi) + vmovdqu32 %zmm9,576(%rdi) + vmovdqu32 %zmm6,640(%rdi) + vmovdqu32 %zmm11,704(%rdi) + + vpxord 768(%rsi),%zmm13,%zmm13 + vpxord 832(%rsi),%zmm10,%zmm10 + vpxord 896(%rsi),%zmm15,%zmm15 + vpxord 960(%rsi),%zmm12,%zmm12 + leaq 1024(%rsi),%rsi + vmovdqu32 %zmm13,768(%rdi) + vmovdqu32 %zmm10,832(%rdi) + vmovdqu32 %zmm15,896(%rdi) + vmovdqu32 %zmm12,960(%rdi) + leaq 1024(%rdi),%rdi + + subq $1024,%rdx + jnz .Loop_outer16x + + jmp .Ldone16x + +.align 32 +.Ltail16x: + xorq %r10,%r10 + subq %rsi,%rdi + cmpq $64,%rdx + jb .Less_than_64_16x + vpxord (%rsi),%zmm16,%zmm16 + vmovdqu32 %zmm16,(%rdi,%rsi,1) + je .Ldone16x + vmovdqa32 %zmm17,%zmm16 + leaq 64(%rsi),%rsi + + cmpq $128,%rdx + jb .Less_than_64_16x + vpxord (%rsi),%zmm17,%zmm17 + vmovdqu32 %zmm17,(%rdi,%rsi,1) + je .Ldone16x + vmovdqa32 %zmm14,%zmm16 + leaq 64(%rsi),%rsi + + cmpq $192,%rdx + jb .Less_than_64_16x + vpxord (%rsi),%zmm14,%zmm14 + vmovdqu32 %zmm14,(%rdi,%rsi,1) + je .Ldone16x + vmovdqa32 %zmm8,%zmm16 + leaq 64(%rsi),%rsi + + cmpq $256,%rdx + jb .Less_than_64_16x + vpxord (%rsi),%zmm8,%zmm8 + vmovdqu32 %zmm8,(%rdi,%rsi,1) + je .Ldone16x + vmovdqa32 %zmm19,%zmm16 + leaq 64(%rsi),%rsi + + cmpq $320,%rdx + jb .Less_than_64_16x + vpxord (%rsi),%zmm19,%zmm19 + vmovdqu32 %zmm19,(%rdi,%rsi,1) + je .Ldone16x + vmovdqa32 %zmm1,%zmm16 + leaq 64(%rsi),%rsi + + cmpq $384,%rdx + jb .Less_than_64_16x + vpxord (%rsi),%zmm1,%zmm1 + vmovdqu32 %zmm1,(%rdi,%rsi,1) + je .Ldone16x + vmovdqa32 %zmm18,%zmm16 + leaq 64(%rsi),%rsi + + cmpq $448,%rdx + jb .Less_than_64_16x + vpxord (%rsi),%zmm18,%zmm18 + vmovdqu32 %zmm18,(%rdi,%rsi,1) + je .Ldone16x + vmovdqa32 %zmm3,%zmm16 + leaq 64(%rsi),%rsi + + cmpq $512,%rdx + jb .Less_than_64_16x + vpxord (%rsi),%zmm3,%zmm3 + vmovdqu32 %zmm3,(%rdi,%rsi,1) + je .Ldone16x + vmovdqa32 %zmm0,%zmm16 + leaq 64(%rsi),%rsi + + cmpq $576,%rdx + jb .Less_than_64_16x + vpxord (%rsi),%zmm0,%zmm0 + vmovdqu32 %zmm0,(%rdi,%rsi,1) + je .Ldone16x + vmovdqa32 %zmm9,%zmm16 + leaq 64(%rsi),%rsi + + cmpq $640,%rdx + jb .Less_than_64_16x + vpxord (%rsi),%zmm9,%zmm9 + vmovdqu32 %zmm9,(%rdi,%rsi,1) + je .Ldone16x + vmovdqa32 %zmm6,%zmm16 + leaq 64(%rsi),%rsi + + cmpq $704,%rdx + jb .Less_than_64_16x + vpxord (%rsi),%zmm6,%zmm6 + vmovdqu32 %zmm6,(%rdi,%rsi,1) + je .Ldone16x + vmovdqa32 %zmm11,%zmm16 + leaq 64(%rsi),%rsi + + cmpq $768,%rdx + jb .Less_than_64_16x + vpxord (%rsi),%zmm11,%zmm11 + vmovdqu32 %zmm11,(%rdi,%rsi,1) + je .Ldone16x + vmovdqa32 %zmm13,%zmm16 + leaq 64(%rsi),%rsi + + cmpq $832,%rdx + jb .Less_than_64_16x + vpxord (%rsi),%zmm13,%zmm13 + vmovdqu32 %zmm13,(%rdi,%rsi,1) + je .Ldone16x + vmovdqa32 %zmm10,%zmm16 + leaq 64(%rsi),%rsi + + cmpq $896,%rdx + jb .Less_than_64_16x + vpxord (%rsi),%zmm10,%zmm10 + vmovdqu32 %zmm10,(%rdi,%rsi,1) + je .Ldone16x + vmovdqa32 %zmm15,%zmm16 + leaq 64(%rsi),%rsi + + cmpq $960,%rdx + jb .Less_than_64_16x + vpxord (%rsi),%zmm15,%zmm15 + vmovdqu32 %zmm15,(%rdi,%rsi,1) + je .Ldone16x + vmovdqa32 %zmm12,%zmm16 + leaq 64(%rsi),%rsi + +.Less_than_64_16x: + vmovdqa32 %zmm16,0(%rsp) + leaq (%rdi,%rsi,1),%rdi + andq $63,%rdx + +.Loop_tail16x: + movzbl (%rsi,%r10,1),%eax + movzbl (%rsp,%r10,1),%ecx + leaq 1(%r10),%r10 + xorl %ecx,%eax + movb %al,-1(%rdi,%r10,1) + decq %rdx + jnz .Loop_tail16x + + vpxord %zmm16,%zmm16,%zmm16 + vmovdqa32 %zmm16,0(%rsp) + +.Ldone16x: + vzeroall + leaq (%r9),%rsp +.cfi_def_cfa_register %rsp +.L16x_epilogue: + ret +.cfi_endproc +.size chacha20_16x,.-chacha20_16x +.global chacha20_8xvl +.type chacha20_8xvl,@function +.align 32 +chacha20_8xvl: +.cfi_startproc +.Lchacha20_8xvl: + movq %rsp,%r9 +.cfi_def_cfa_register %r9 + subq $64+8,%rsp + andq $-64,%rsp + vzeroupper + + leaq .Lsigma(%rip),%r10 + vbroadcasti128 (%r10),%ymm3 + vbroadcasti128 (%rcx),%ymm7 + vbroadcasti128 16(%rcx),%ymm11 + vbroadcasti128 (%r8),%ymm15 + + vpshufd $0x00,%ymm3,%ymm0 + vpshufd $0x55,%ymm3,%ymm1 + vpshufd $0xaa,%ymm3,%ymm2 + vpshufd $0xff,%ymm3,%ymm3 + vmovdqa64 %ymm0,%ymm16 + vmovdqa64 %ymm1,%ymm17 + vmovdqa64 %ymm2,%ymm18 + vmovdqa64 %ymm3,%ymm19 + + vpshufd $0x00,%ymm7,%ymm4 + vpshufd $0x55,%ymm7,%ymm5 + vpshufd $0xaa,%ymm7,%ymm6 + vpshufd $0xff,%ymm7,%ymm7 + vmovdqa64 %ymm4,%ymm20 + vmovdqa64 %ymm5,%ymm21 + vmovdqa64 %ymm6,%ymm22 + vmovdqa64 %ymm7,%ymm23 + + vpshufd $0x00,%ymm11,%ymm8 + vpshufd $0x55,%ymm11,%ymm9 + vpshufd $0xaa,%ymm11,%ymm10 + vpshufd $0xff,%ymm11,%ymm11 + vmovdqa64 %ymm8,%ymm24 + vmovdqa64 %ymm9,%ymm25 + vmovdqa64 %ymm10,%ymm26 + vmovdqa64 %ymm11,%ymm27 + + vpshufd $0x00,%ymm15,%ymm12 + vpshufd $0x55,%ymm15,%ymm13 + vpshufd $0xaa,%ymm15,%ymm14 + vpshufd $0xff,%ymm15,%ymm15 + vpaddd .Lincy(%rip),%ymm12,%ymm12 + vmovdqa64 %ymm12,%ymm28 + vmovdqa64 %ymm13,%ymm29 + vmovdqa64 %ymm14,%ymm30 + vmovdqa64 %ymm15,%ymm31 + + movl $10,%eax + jmp .Loop8xvl + +.align 32 +.Loop_outer8xvl: + + + vpbroadcastd 8(%r10),%ymm2 + vpbroadcastd 12(%r10),%ymm3 + vpaddd .Leight(%rip),%ymm28,%ymm28 + vmovdqa64 %ymm20,%ymm4 + vmovdqa64 %ymm21,%ymm5 + vmovdqa64 %ymm22,%ymm6 + vmovdqa64 %ymm23,%ymm7 + vmovdqa64 %ymm24,%ymm8 + vmovdqa64 %ymm25,%ymm9 + vmovdqa64 %ymm26,%ymm10 + vmovdqa64 %ymm27,%ymm11 + vmovdqa64 %ymm28,%ymm12 + vmovdqa64 %ymm29,%ymm13 + vmovdqa64 %ymm30,%ymm14 + vmovdqa64 %ymm31,%ymm15 + + vmovdqa64 %ymm0,%ymm16 + vmovdqa64 %ymm1,%ymm17 + vmovdqa64 %ymm2,%ymm18 + vmovdqa64 %ymm3,%ymm19 + + movl $10,%eax + jmp .Loop8xvl + +.align 32 +.Loop8xvl: + vpaddd %ymm4,%ymm0,%ymm0 + vpaddd %ymm5,%ymm1,%ymm1 + vpaddd %ymm6,%ymm2,%ymm2 + vpaddd %ymm7,%ymm3,%ymm3 + vpxor %ymm0,%ymm12,%ymm12 + vpxor %ymm1,%ymm13,%ymm13 + vpxor %ymm2,%ymm14,%ymm14 + vpxor %ymm3,%ymm15,%ymm15 + vprold $16,%ymm12,%ymm12 + vprold $16,%ymm13,%ymm13 + vprold $16,%ymm14,%ymm14 + vprold $16,%ymm15,%ymm15 + vpaddd %ymm12,%ymm8,%ymm8 + vpaddd %ymm13,%ymm9,%ymm9 + vpaddd %ymm14,%ymm10,%ymm10 + vpaddd %ymm15,%ymm11,%ymm11 + vpxor %ymm8,%ymm4,%ymm4 + vpxor %ymm9,%ymm5,%ymm5 + vpxor %ymm10,%ymm6,%ymm6 + vpxor %ymm11,%ymm7,%ymm7 + vprold $12,%ymm4,%ymm4 + vprold $12,%ymm5,%ymm5 + vprold $12,%ymm6,%ymm6 + vprold $12,%ymm7,%ymm7 + vpaddd %ymm4,%ymm0,%ymm0 + vpaddd %ymm5,%ymm1,%ymm1 + vpaddd %ymm6,%ymm2,%ymm2 + vpaddd %ymm7,%ymm3,%ymm3 + vpxor %ymm0,%ymm12,%ymm12 + vpxor %ymm1,%ymm13,%ymm13 + vpxor %ymm2,%ymm14,%ymm14 + vpxor %ymm3,%ymm15,%ymm15 + vprold $8,%ymm12,%ymm12 + vprold $8,%ymm13,%ymm13 + vprold $8,%ymm14,%ymm14 + vprold $8,%ymm15,%ymm15 + vpaddd %ymm12,%ymm8,%ymm8 + vpaddd %ymm13,%ymm9,%ymm9 + vpaddd %ymm14,%ymm10,%ymm10 + vpaddd %ymm15,%ymm11,%ymm11 + vpxor %ymm8,%ymm4,%ymm4 + vpxor %ymm9,%ymm5,%ymm5 + vpxor %ymm10,%ymm6,%ymm6 + vpxor %ymm11,%ymm7,%ymm7 + vprold $7,%ymm4,%ymm4 + vprold $7,%ymm5,%ymm5 + vprold $7,%ymm6,%ymm6 + vprold $7,%ymm7,%ymm7 + vpaddd %ymm5,%ymm0,%ymm0 + vpaddd %ymm6,%ymm1,%ymm1 + vpaddd %ymm7,%ymm2,%ymm2 + vpaddd %ymm4,%ymm3,%ymm3 + vpxor %ymm0,%ymm15,%ymm15 + vpxor %ymm1,%ymm12,%ymm12 + vpxor %ymm2,%ymm13,%ymm13 + vpxor %ymm3,%ymm14,%ymm14 + vprold $16,%ymm15,%ymm15 + vprold $16,%ymm12,%ymm12 + vprold $16,%ymm13,%ymm13 + vprold $16,%ymm14,%ymm14 + vpaddd %ymm15,%ymm10,%ymm10 + vpaddd %ymm12,%ymm11,%ymm11 + vpaddd %ymm13,%ymm8,%ymm8 + vpaddd %ymm14,%ymm9,%ymm9 + vpxor %ymm10,%ymm5,%ymm5 + vpxor %ymm11,%ymm6,%ymm6 + vpxor %ymm8,%ymm7,%ymm7 + vpxor %ymm9,%ymm4,%ymm4 + vprold $12,%ymm5,%ymm5 + vprold $12,%ymm6,%ymm6 + vprold $12,%ymm7,%ymm7 + vprold $12,%ymm4,%ymm4 + vpaddd %ymm5,%ymm0,%ymm0 + vpaddd %ymm6,%ymm1,%ymm1 + vpaddd %ymm7,%ymm2,%ymm2 + vpaddd %ymm4,%ymm3,%ymm3 + vpxor %ymm0,%ymm15,%ymm15 + vpxor %ymm1,%ymm12,%ymm12 + vpxor %ymm2,%ymm13,%ymm13 + vpxor %ymm3,%ymm14,%ymm14 + vprold $8,%ymm15,%ymm15 + vprold $8,%ymm12,%ymm12 + vprold $8,%ymm13,%ymm13 + vprold $8,%ymm14,%ymm14 + vpaddd %ymm15,%ymm10,%ymm10 + vpaddd %ymm12,%ymm11,%ymm11 + vpaddd %ymm13,%ymm8,%ymm8 + vpaddd %ymm14,%ymm9,%ymm9 + vpxor %ymm10,%ymm5,%ymm5 + vpxor %ymm11,%ymm6,%ymm6 + vpxor %ymm8,%ymm7,%ymm7 + vpxor %ymm9,%ymm4,%ymm4 + vprold $7,%ymm5,%ymm5 + vprold $7,%ymm6,%ymm6 + vprold $7,%ymm7,%ymm7 + vprold $7,%ymm4,%ymm4 + decl %eax + jnz .Loop8xvl + + vpaddd %ymm16,%ymm0,%ymm0 + vpaddd %ymm17,%ymm1,%ymm1 + vpaddd %ymm18,%ymm2,%ymm2 + vpaddd %ymm19,%ymm3,%ymm3 + + vpunpckldq %ymm1,%ymm0,%ymm18 + vpunpckldq %ymm3,%ymm2,%ymm19 + vpunpckhdq %ymm1,%ymm0,%ymm0 + vpunpckhdq %ymm3,%ymm2,%ymm2 + vpunpcklqdq %ymm19,%ymm18,%ymm1 + vpunpckhqdq %ymm19,%ymm18,%ymm18 + vpunpcklqdq %ymm2,%ymm0,%ymm3 + vpunpckhqdq %ymm2,%ymm0,%ymm0 + vpaddd %ymm20,%ymm4,%ymm4 + vpaddd %ymm21,%ymm5,%ymm5 + vpaddd %ymm22,%ymm6,%ymm6 + vpaddd %ymm23,%ymm7,%ymm7 + + vpunpckldq %ymm5,%ymm4,%ymm2 + vpunpckldq %ymm7,%ymm6,%ymm19 + vpunpckhdq %ymm5,%ymm4,%ymm4 + vpunpckhdq %ymm7,%ymm6,%ymm6 + vpunpcklqdq %ymm19,%ymm2,%ymm5 + vpunpckhqdq %ymm19,%ymm2,%ymm2 + vpunpcklqdq %ymm6,%ymm4,%ymm7 + vpunpckhqdq %ymm6,%ymm4,%ymm4 + vshufi32x4 $0,%ymm5,%ymm1,%ymm19 + vshufi32x4 $3,%ymm5,%ymm1,%ymm5 + vshufi32x4 $0,%ymm2,%ymm18,%ymm1 + vshufi32x4 $3,%ymm2,%ymm18,%ymm2 + vshufi32x4 $0,%ymm7,%ymm3,%ymm18 + vshufi32x4 $3,%ymm7,%ymm3,%ymm7 + vshufi32x4 $0,%ymm4,%ymm0,%ymm3 + vshufi32x4 $3,%ymm4,%ymm0,%ymm4 + vpaddd %ymm24,%ymm8,%ymm8 + vpaddd %ymm25,%ymm9,%ymm9 + vpaddd %ymm26,%ymm10,%ymm10 + vpaddd %ymm27,%ymm11,%ymm11 + + vpunpckldq %ymm9,%ymm8,%ymm6 + vpunpckldq %ymm11,%ymm10,%ymm0 + vpunpckhdq %ymm9,%ymm8,%ymm8 + vpunpckhdq %ymm11,%ymm10,%ymm10 + vpunpcklqdq %ymm0,%ymm6,%ymm9 + vpunpckhqdq %ymm0,%ymm6,%ymm6 + vpunpcklqdq %ymm10,%ymm8,%ymm11 + vpunpckhqdq %ymm10,%ymm8,%ymm8 + vpaddd %ymm28,%ymm12,%ymm12 + vpaddd %ymm29,%ymm13,%ymm13 + vpaddd %ymm30,%ymm14,%ymm14 + vpaddd %ymm31,%ymm15,%ymm15 + + vpunpckldq %ymm13,%ymm12,%ymm10 + vpunpckldq %ymm15,%ymm14,%ymm0 + vpunpckhdq %ymm13,%ymm12,%ymm12 + vpunpckhdq %ymm15,%ymm14,%ymm14 + vpunpcklqdq %ymm0,%ymm10,%ymm13 + vpunpckhqdq %ymm0,%ymm10,%ymm10 + vpunpcklqdq %ymm14,%ymm12,%ymm15 + vpunpckhqdq %ymm14,%ymm12,%ymm12 + vperm2i128 $0x20,%ymm13,%ymm9,%ymm0 + vperm2i128 $0x31,%ymm13,%ymm9,%ymm13 + vperm2i128 $0x20,%ymm10,%ymm6,%ymm9 + vperm2i128 $0x31,%ymm10,%ymm6,%ymm10 + vperm2i128 $0x20,%ymm15,%ymm11,%ymm6 + vperm2i128 $0x31,%ymm15,%ymm11,%ymm15 + vperm2i128 $0x20,%ymm12,%ymm8,%ymm11 + vperm2i128 $0x31,%ymm12,%ymm8,%ymm12 + cmpq $512,%rdx + jb .Ltail8xvl + + movl $0x80,%eax + vpxord 0(%rsi),%ymm19,%ymm19 + vpxor 32(%rsi),%ymm0,%ymm0 + vpxor 64(%rsi),%ymm5,%ymm5 + vpxor 96(%rsi),%ymm13,%ymm13 + leaq (%rsi,%rax,1),%rsi + vmovdqu32 %ymm19,0(%rdi) + vmovdqu %ymm0,32(%rdi) + vmovdqu %ymm5,64(%rdi) + vmovdqu %ymm13,96(%rdi) + leaq (%rdi,%rax,1),%rdi + + vpxor 0(%rsi),%ymm1,%ymm1 + vpxor 32(%rsi),%ymm9,%ymm9 + vpxor 64(%rsi),%ymm2,%ymm2 + vpxor 96(%rsi),%ymm10,%ymm10 + leaq (%rsi,%rax,1),%rsi + vmovdqu %ymm1,0(%rdi) + vmovdqu %ymm9,32(%rdi) + vmovdqu %ymm2,64(%rdi) + vmovdqu %ymm10,96(%rdi) + leaq (%rdi,%rax,1),%rdi + + vpxord 0(%rsi),%ymm18,%ymm18 + vpxor 32(%rsi),%ymm6,%ymm6 + vpxor 64(%rsi),%ymm7,%ymm7 + vpxor 96(%rsi),%ymm15,%ymm15 + leaq (%rsi,%rax,1),%rsi + vmovdqu32 %ymm18,0(%rdi) + vmovdqu %ymm6,32(%rdi) + vmovdqu %ymm7,64(%rdi) + vmovdqu %ymm15,96(%rdi) + leaq (%rdi,%rax,1),%rdi + + vpxor 0(%rsi),%ymm3,%ymm3 + vpxor 32(%rsi),%ymm11,%ymm11 + vpxor 64(%rsi),%ymm4,%ymm4 + vpxor 96(%rsi),%ymm12,%ymm12 + leaq (%rsi,%rax,1),%rsi + vmovdqu %ymm3,0(%rdi) + vmovdqu %ymm11,32(%rdi) + vmovdqu %ymm4,64(%rdi) + vmovdqu %ymm12,96(%rdi) + leaq (%rdi,%rax,1),%rdi + + vpbroadcastd 0(%r10),%ymm0 + vpbroadcastd 4(%r10),%ymm1 + + subq $512,%rdx + jnz .Loop_outer8xvl + + jmp .Ldone8xvl + +.align 32 +.Ltail8xvl: + vmovdqa64 %ymm19,%ymm8 + xorq %r10,%r10 + subq %rsi,%rdi + cmpq $64,%rdx + jb .Less_than_64_8xvl + vpxor 0(%rsi),%ymm8,%ymm8 + vpxor 32(%rsi),%ymm0,%ymm0 + vmovdqu %ymm8,0(%rdi,%rsi,1) + vmovdqu %ymm0,32(%rdi,%rsi,1) + je .Ldone8xvl + vmovdqa %ymm5,%ymm8 + vmovdqa %ymm13,%ymm0 + leaq 64(%rsi),%rsi + + cmpq $128,%rdx + jb .Less_than_64_8xvl + vpxor 0(%rsi),%ymm5,%ymm5 + vpxor 32(%rsi),%ymm13,%ymm13 + vmovdqu %ymm5,0(%rdi,%rsi,1) + vmovdqu %ymm13,32(%rdi,%rsi,1) + je .Ldone8xvl + vmovdqa %ymm1,%ymm8 + vmovdqa %ymm9,%ymm0 + leaq 64(%rsi),%rsi + + cmpq $192,%rdx + jb .Less_than_64_8xvl + vpxor 0(%rsi),%ymm1,%ymm1 + vpxor 32(%rsi),%ymm9,%ymm9 + vmovdqu %ymm1,0(%rdi,%rsi,1) + vmovdqu %ymm9,32(%rdi,%rsi,1) + je .Ldone8xvl + vmovdqa %ymm2,%ymm8 + vmovdqa %ymm10,%ymm0 + leaq 64(%rsi),%rsi + + cmpq $256,%rdx + jb .Less_than_64_8xvl + vpxor 0(%rsi),%ymm2,%ymm2 + vpxor 32(%rsi),%ymm10,%ymm10 + vmovdqu %ymm2,0(%rdi,%rsi,1) + vmovdqu %ymm10,32(%rdi,%rsi,1) + je .Ldone8xvl + vmovdqa32 %ymm18,%ymm8 + vmovdqa %ymm6,%ymm0 + leaq 64(%rsi),%rsi + + cmpq $320,%rdx + jb .Less_than_64_8xvl + vpxord 0(%rsi),%ymm18,%ymm18 + vpxor 32(%rsi),%ymm6,%ymm6 + vmovdqu32 %ymm18,0(%rdi,%rsi,1) + vmovdqu %ymm6,32(%rdi,%rsi,1) + je .Ldone8xvl + vmovdqa %ymm7,%ymm8 + vmovdqa %ymm15,%ymm0 + leaq 64(%rsi),%rsi + + cmpq $384,%rdx + jb .Less_than_64_8xvl + vpxor 0(%rsi),%ymm7,%ymm7 + vpxor 32(%rsi),%ymm15,%ymm15 + vmovdqu %ymm7,0(%rdi,%rsi,1) + vmovdqu %ymm15,32(%rdi,%rsi,1) + je .Ldone8xvl + vmovdqa %ymm3,%ymm8 + vmovdqa %ymm11,%ymm0 + leaq 64(%rsi),%rsi + + cmpq $448,%rdx + jb .Less_than_64_8xvl + vpxor 0(%rsi),%ymm3,%ymm3 + vpxor 32(%rsi),%ymm11,%ymm11 + vmovdqu %ymm3,0(%rdi,%rsi,1) + vmovdqu %ymm11,32(%rdi,%rsi,1) + je .Ldone8xvl + vmovdqa %ymm4,%ymm8 + vmovdqa %ymm12,%ymm0 + leaq 64(%rsi),%rsi + +.Less_than_64_8xvl: + vmovdqa %ymm8,0(%rsp) + vmovdqa %ymm0,32(%rsp) + leaq (%rdi,%rsi,1),%rdi + andq $63,%rdx + +.Loop_tail8xvl: + movzbl (%rsi,%r10,1),%eax + movzbl (%rsp,%r10,1),%ecx + leaq 1(%r10),%r10 + xorl %ecx,%eax + movb %al,-1(%rdi,%r10,1) + decq %rdx + jnz .Loop_tail8xvl + + vpxor %ymm8,%ymm8,%ymm8 + vmovdqa %ymm8,0(%rsp) + vmovdqa %ymm8,32(%rsp) + +.Ldone8xvl: + vzeroall + leaq (%r9),%rsp +.cfi_def_cfa_register %rsp +.L8xvl_epilogue: + ret +.cfi_endproc +.size chacha20_8xvl,.-chacha20_8xvl diff --git a/crypto/chacha20_x64_gas_macosx.s b/crypto/chacha20_x64_gas_macosx.s new file mode 100644 index 0000000..37b9175 --- /dev/null +++ b/crypto/chacha20_x64_gas_macosx.s @@ -0,0 +1,1388 @@ +.text + +.p2align 6 +L$zero: +.long 0,0,0,0 +L$one: +.long 1,0,0,0 +L$inc: +.long 0,1,2,3 +L$four: +.long 4,4,4,4 +L$incy: +.long 0,2,4,6,1,3,5,7 +L$eight: +.long 8,8,8,8,8,8,8,8 +L$rot16: +.byte 0x2,0x3,0x0,0x1, 0x6,0x7,0x4,0x5, 0xa,0xb,0x8,0x9, 0xe,0xf,0xc,0xd +L$rot24: +.byte 0x3,0x0,0x1,0x2, 0x7,0x4,0x5,0x6, 0xb,0x8,0x9,0xa, 0xf,0xc,0xd,0xe +L$sigma: +.byte 101,120,112,97,110,100,32,51,50,45,98,121,116,101,32,107,0 +.p2align 6 +L$zeroz: +.long 0,0,0,0, 1,0,0,0, 2,0,0,0, 3,0,0,0 +L$fourz: +.long 4,0,0,0, 4,0,0,0, 4,0,0,0, 4,0,0,0 +L$incz: +.long 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 +L$sixteen: +.long 16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16 +.p2align 6 +L$twoy: +.long 2,0,0,0, 2,0,0,0 + +.global _hchacha20_ssse3 + +.p2align 5 +_hchacha20_ssse3: + +L$hchacha20_ssse3: + movdqa L$sigma(%rip),%xmm0 + movdqu (%rdx),%xmm1 + movdqu 16(%rdx),%xmm2 + movdqu (%rsi),%xmm3 + movdqa L$rot16(%rip),%xmm6 + movdqa L$rot24(%rip),%xmm7 + movq 10,%r8 +.p2align 5 +L$oop_hssse3: + paddd %xmm1,%xmm0 + pxor %xmm0,%xmm3 + pshufb %xmm6,%xmm3 + paddd %xmm3,%xmm2 + pxor %xmm2,%xmm1 + movdqa %xmm1,%xmm4 + psrld 20,%xmm1 + pslld 12,%xmm4 + por %xmm4,%xmm1 + paddd %xmm1,%xmm0 + pxor %xmm0,%xmm3 + pshufb %xmm7,%xmm3 + paddd %xmm3,%xmm2 + pxor %xmm2,%xmm1 + movdqa %xmm1,%xmm4 + psrld 25,%xmm1 + pslld 7,%xmm4 + por %xmm4,%xmm1 + pshufd $78,%xmm2,%xmm2 + pshufd $57,%xmm1,%xmm1 + pshufd $147,%xmm3,%xmm3 + nop + paddd %xmm1,%xmm0 + pxor %xmm0,%xmm3 + pshufb %xmm6,%xmm3 + paddd %xmm3,%xmm2 + pxor %xmm2,%xmm1 + movdqa %xmm1,%xmm4 + psrld 20,%xmm1 + pslld 12,%xmm4 + por %xmm4,%xmm1 + paddd %xmm1,%xmm0 + pxor %xmm0,%xmm3 + pshufb %xmm7,%xmm3 + paddd %xmm3,%xmm2 + pxor %xmm2,%xmm1 + movdqa %xmm1,%xmm4 + psrld 25,%xmm1 + pslld 7,%xmm4 + por %xmm4,%xmm1 + pshufd $78,%xmm2,%xmm2 + pshufd $147,%xmm1,%xmm1 + pshufd $57,%xmm3,%xmm3 + decq %r8 + jnz L$oop_hssse3 + movdqu %xmm0,0(%rdi) + movdqu %xmm3,16(%rdi) + ret + + +.global _chacha20_ssse3 + +.p2align 5 +_chacha20_ssse3: + +L$chacha20_ssse3: + movq %rsp,%r9 + + cmpq $128,%rdx + ja L$chacha20_4x + +L$do_sse3_after_all: + subq $64+8,%rsp + movdqa L$sigma(%rip),%xmm0 + movdqu (%rcx),%xmm1 + movdqu 16(%rcx),%xmm2 + movdqu (%r8),%xmm3 + movdqa L$rot16(%rip),%xmm6 + movdqa L$rot24(%rip),%xmm7 + + movdqa %xmm0,0(%rsp) + movdqa %xmm1,16(%rsp) + movdqa %xmm2,32(%rsp) + movdqa %xmm3,48(%rsp) + movq $10,%r8 + jmp L$oop_ssse3 + +.p2align 5 +L$oop_outer_ssse3: + movdqa L$one(%rip),%xmm3 + movdqa 0(%rsp),%xmm0 + movdqa 16(%rsp),%xmm1 + movdqa 32(%rsp),%xmm2 + paddd 48(%rsp),%xmm3 + movq $10,%r8 + movdqa %xmm3,48(%rsp) + jmp L$oop_ssse3 + +.p2align 5 +L$oop_ssse3: + paddd %xmm1,%xmm0 + pxor %xmm0,%xmm3 + pshufb %xmm6,%xmm3 + paddd %xmm3,%xmm2 + pxor %xmm2,%xmm1 + movdqa %xmm1,%xmm4 + psrld $20,%xmm1 + pslld $12,%xmm4 + por %xmm4,%xmm1 + paddd %xmm1,%xmm0 + pxor %xmm0,%xmm3 + pshufb %xmm7,%xmm3 + paddd %xmm3,%xmm2 + pxor %xmm2,%xmm1 + movdqa %xmm1,%xmm4 + psrld $25,%xmm1 + pslld $7,%xmm4 + por %xmm4,%xmm1 + pshufd $78,%xmm2,%xmm2 + pshufd $57,%xmm1,%xmm1 + pshufd $147,%xmm3,%xmm3 + nop + paddd %xmm1,%xmm0 + pxor %xmm0,%xmm3 + pshufb %xmm6,%xmm3 + paddd %xmm3,%xmm2 + pxor %xmm2,%xmm1 + movdqa %xmm1,%xmm4 + psrld $20,%xmm1 + pslld $12,%xmm4 + por %xmm4,%xmm1 + paddd %xmm1,%xmm0 + pxor %xmm0,%xmm3 + pshufb %xmm7,%xmm3 + paddd %xmm3,%xmm2 + pxor %xmm2,%xmm1 + movdqa %xmm1,%xmm4 + psrld $25,%xmm1 + pslld $7,%xmm4 + por %xmm4,%xmm1 + pshufd $78,%xmm2,%xmm2 + pshufd $147,%xmm1,%xmm1 + pshufd $57,%xmm3,%xmm3 + decq %r8 + jnz L$oop_ssse3 + paddd 0(%rsp),%xmm0 + paddd 16(%rsp),%xmm1 + paddd 32(%rsp),%xmm2 + paddd 48(%rsp),%xmm3 + + cmpq $64,%rdx + jb L$tail_ssse3 + + movdqu 0(%rsi),%xmm4 + movdqu 16(%rsi),%xmm5 + pxor %xmm4,%xmm0 + movdqu 32(%rsi),%xmm4 + pxor %xmm5,%xmm1 + movdqu 48(%rsi),%xmm5 + leaq 64(%rsi),%rsi + pxor %xmm4,%xmm2 + pxor %xmm5,%xmm3 + + movdqu %xmm0,0(%rdi) + movdqu %xmm1,16(%rdi) + movdqu %xmm2,32(%rdi) + movdqu %xmm3,48(%rdi) + leaq 64(%rdi),%rdi + + subq $64,%rdx + jnz L$oop_outer_ssse3 + + jmp L$done_ssse3 + +.p2align 4 +L$tail_ssse3: + movdqa %xmm0,0(%rsp) + movdqa %xmm1,16(%rsp) + movdqa %xmm2,32(%rsp) + movdqa %xmm3,48(%rsp) + xorq %r8,%r8 + +L$oop_tail_ssse3: + movzbl (%rsi,%r8,1),%eax + movzbl (%rsp,%r8,1),%ecx + leaq 1(%r8),%r8 + xorl %ecx,%eax + movb %al,-1(%rdi,%r8,1) + decq %rdx + jnz L$oop_tail_ssse3 + +L$done_ssse3: + leaq (%r9),%rsp + +L$ssse3_epilogue: + ret + + +.global _chacha20_4x + +.p2align 5 +_chacha20_4x: + +L$chacha20_4x: + movq %rsp,%r9 + + + + + + + + + + + + +L$proceed4x: + subq $0x140+8,%rsp + movdqa L$sigma(%rip),%xmm11 + movdqu (%rcx),%xmm15 + movdqu 16(%rcx),%xmm7 + movdqu (%r8),%xmm3 + leaq 256(%rsp),%rcx + leaq L$rot16(%rip),%r10 + leaq L$rot24(%rip),%r11 + + pshufd $0x00,%xmm11,%xmm8 + pshufd $0x55,%xmm11,%xmm9 + movdqa %xmm8,64(%rsp) + pshufd $0xaa,%xmm11,%xmm10 + movdqa %xmm9,80(%rsp) + pshufd $0xff,%xmm11,%xmm11 + movdqa %xmm10,96(%rsp) + movdqa %xmm11,112(%rsp) + + pshufd $0x00,%xmm15,%xmm12 + pshufd $0x55,%xmm15,%xmm13 + movdqa %xmm12,128-256(%rcx) + pshufd $0xaa,%xmm15,%xmm14 + movdqa %xmm13,144-256(%rcx) + pshufd $0xff,%xmm15,%xmm15 + movdqa %xmm14,160-256(%rcx) + movdqa %xmm15,176-256(%rcx) + + pshufd $0x00,%xmm7,%xmm4 + pshufd $0x55,%xmm7,%xmm5 + movdqa %xmm4,192-256(%rcx) + pshufd $0xaa,%xmm7,%xmm6 + movdqa %xmm5,208-256(%rcx) + pshufd $0xff,%xmm7,%xmm7 + movdqa %xmm6,224-256(%rcx) + movdqa %xmm7,240-256(%rcx) + + pshufd $0x00,%xmm3,%xmm0 + pshufd $0x55,%xmm3,%xmm1 + paddd L$inc(%rip),%xmm0 + pshufd $0xaa,%xmm3,%xmm2 + movdqa %xmm1,272-256(%rcx) + pshufd $0xff,%xmm3,%xmm3 + movdqa %xmm2,288-256(%rcx) + movdqa %xmm3,304-256(%rcx) + + jmp L$oop_enter4x + +.p2align 5 +L$oop_outer4x: + movdqa 64(%rsp),%xmm8 + movdqa 80(%rsp),%xmm9 + movdqa 96(%rsp),%xmm10 + movdqa 112(%rsp),%xmm11 + movdqa 128-256(%rcx),%xmm12 + movdqa 144-256(%rcx),%xmm13 + movdqa 160-256(%rcx),%xmm14 + movdqa 176-256(%rcx),%xmm15 + movdqa 192-256(%rcx),%xmm4 + movdqa 208-256(%rcx),%xmm5 + movdqa 224-256(%rcx),%xmm6 + movdqa 240-256(%rcx),%xmm7 + movdqa 256-256(%rcx),%xmm0 + movdqa 272-256(%rcx),%xmm1 + movdqa 288-256(%rcx),%xmm2 + movdqa 304-256(%rcx),%xmm3 + paddd L$four(%rip),%xmm0 + +L$oop_enter4x: + movdqa %xmm6,32(%rsp) + movdqa %xmm7,48(%rsp) + movdqa (%r10),%xmm7 + movl $10,%eax + movdqa %xmm0,256-256(%rcx) + jmp L$oop4x + +.p2align 5 +L$oop4x: + paddd %xmm12,%xmm8 + paddd %xmm13,%xmm9 + pxor %xmm8,%xmm0 + pxor %xmm9,%xmm1 + pshufb %xmm7,%xmm0 + pshufb %xmm7,%xmm1 + paddd %xmm0,%xmm4 + paddd %xmm1,%xmm5 + pxor %xmm4,%xmm12 + pxor %xmm5,%xmm13 + movdqa %xmm12,%xmm6 + pslld $12,%xmm12 + psrld $20,%xmm6 + movdqa %xmm13,%xmm7 + pslld $12,%xmm13 + por %xmm6,%xmm12 + psrld $20,%xmm7 + movdqa (%r11),%xmm6 + por %xmm7,%xmm13 + paddd %xmm12,%xmm8 + paddd %xmm13,%xmm9 + pxor %xmm8,%xmm0 + pxor %xmm9,%xmm1 + pshufb %xmm6,%xmm0 + pshufb %xmm6,%xmm1 + paddd %xmm0,%xmm4 + paddd %xmm1,%xmm5 + pxor %xmm4,%xmm12 + pxor %xmm5,%xmm13 + movdqa %xmm12,%xmm7 + pslld $7,%xmm12 + psrld $25,%xmm7 + movdqa %xmm13,%xmm6 + pslld $7,%xmm13 + por %xmm7,%xmm12 + psrld $25,%xmm6 + movdqa (%r10),%xmm7 + por %xmm6,%xmm13 + movdqa %xmm4,0(%rsp) + movdqa %xmm5,16(%rsp) + movdqa 32(%rsp),%xmm4 + movdqa 48(%rsp),%xmm5 + paddd %xmm14,%xmm10 + paddd %xmm15,%xmm11 + pxor %xmm10,%xmm2 + pxor %xmm11,%xmm3 + pshufb %xmm7,%xmm2 + pshufb %xmm7,%xmm3 + paddd %xmm2,%xmm4 + paddd %xmm3,%xmm5 + pxor %xmm4,%xmm14 + pxor %xmm5,%xmm15 + movdqa %xmm14,%xmm6 + pslld $12,%xmm14 + psrld $20,%xmm6 + movdqa %xmm15,%xmm7 + pslld $12,%xmm15 + por %xmm6,%xmm14 + psrld $20,%xmm7 + movdqa (%r11),%xmm6 + por %xmm7,%xmm15 + paddd %xmm14,%xmm10 + paddd %xmm15,%xmm11 + pxor %xmm10,%xmm2 + pxor %xmm11,%xmm3 + pshufb %xmm6,%xmm2 + pshufb %xmm6,%xmm3 + paddd %xmm2,%xmm4 + paddd %xmm3,%xmm5 + pxor %xmm4,%xmm14 + pxor %xmm5,%xmm15 + movdqa %xmm14,%xmm7 + pslld $7,%xmm14 + psrld $25,%xmm7 + movdqa %xmm15,%xmm6 + pslld $7,%xmm15 + por %xmm7,%xmm14 + psrld $25,%xmm6 + movdqa (%r10),%xmm7 + por %xmm6,%xmm15 + paddd %xmm13,%xmm8 + paddd %xmm14,%xmm9 + pxor %xmm8,%xmm3 + pxor %xmm9,%xmm0 + pshufb %xmm7,%xmm3 + pshufb %xmm7,%xmm0 + paddd %xmm3,%xmm4 + paddd %xmm0,%xmm5 + pxor %xmm4,%xmm13 + pxor %xmm5,%xmm14 + movdqa %xmm13,%xmm6 + pslld $12,%xmm13 + psrld $20,%xmm6 + movdqa %xmm14,%xmm7 + pslld $12,%xmm14 + por %xmm6,%xmm13 + psrld $20,%xmm7 + movdqa (%r11),%xmm6 + por %xmm7,%xmm14 + paddd %xmm13,%xmm8 + paddd %xmm14,%xmm9 + pxor %xmm8,%xmm3 + pxor %xmm9,%xmm0 + pshufb %xmm6,%xmm3 + pshufb %xmm6,%xmm0 + paddd %xmm3,%xmm4 + paddd %xmm0,%xmm5 + pxor %xmm4,%xmm13 + pxor %xmm5,%xmm14 + movdqa %xmm13,%xmm7 + pslld $7,%xmm13 + psrld $25,%xmm7 + movdqa %xmm14,%xmm6 + pslld $7,%xmm14 + por %xmm7,%xmm13 + psrld $25,%xmm6 + movdqa (%r10),%xmm7 + por %xmm6,%xmm14 + movdqa %xmm4,32(%rsp) + movdqa %xmm5,48(%rsp) + movdqa 0(%rsp),%xmm4 + movdqa 16(%rsp),%xmm5 + paddd %xmm15,%xmm10 + paddd %xmm12,%xmm11 + pxor %xmm10,%xmm1 + pxor %xmm11,%xmm2 + pshufb %xmm7,%xmm1 + pshufb %xmm7,%xmm2 + paddd %xmm1,%xmm4 + paddd %xmm2,%xmm5 + pxor %xmm4,%xmm15 + pxor %xmm5,%xmm12 + movdqa %xmm15,%xmm6 + pslld $12,%xmm15 + psrld $20,%xmm6 + movdqa %xmm12,%xmm7 + pslld $12,%xmm12 + por %xmm6,%xmm15 + psrld $20,%xmm7 + movdqa (%r11),%xmm6 + por %xmm7,%xmm12 + paddd %xmm15,%xmm10 + paddd %xmm12,%xmm11 + pxor %xmm10,%xmm1 + pxor %xmm11,%xmm2 + pshufb %xmm6,%xmm1 + pshufb %xmm6,%xmm2 + paddd %xmm1,%xmm4 + paddd %xmm2,%xmm5 + pxor %xmm4,%xmm15 + pxor %xmm5,%xmm12 + movdqa %xmm15,%xmm7 + pslld $7,%xmm15 + psrld $25,%xmm7 + movdqa %xmm12,%xmm6 + pslld $7,%xmm12 + por %xmm7,%xmm15 + psrld $25,%xmm6 + movdqa (%r10),%xmm7 + por %xmm6,%xmm12 + decl %eax + jnz L$oop4x + + paddd 64(%rsp),%xmm8 + paddd 80(%rsp),%xmm9 + paddd 96(%rsp),%xmm10 + paddd 112(%rsp),%xmm11 + + movdqa %xmm8,%xmm6 + punpckldq %xmm9,%xmm8 + movdqa %xmm10,%xmm7 + punpckldq %xmm11,%xmm10 + punpckhdq %xmm9,%xmm6 + punpckhdq %xmm11,%xmm7 + movdqa %xmm8,%xmm9 + punpcklqdq %xmm10,%xmm8 + movdqa %xmm6,%xmm11 + punpcklqdq %xmm7,%xmm6 + punpckhqdq %xmm10,%xmm9 + punpckhqdq %xmm7,%xmm11 + paddd 128-256(%rcx),%xmm12 + paddd 144-256(%rcx),%xmm13 + paddd 160-256(%rcx),%xmm14 + paddd 176-256(%rcx),%xmm15 + + movdqa %xmm8,0(%rsp) + movdqa %xmm9,16(%rsp) + movdqa 32(%rsp),%xmm8 + movdqa 48(%rsp),%xmm9 + + movdqa %xmm12,%xmm10 + punpckldq %xmm13,%xmm12 + movdqa %xmm14,%xmm7 + punpckldq %xmm15,%xmm14 + punpckhdq %xmm13,%xmm10 + punpckhdq %xmm15,%xmm7 + movdqa %xmm12,%xmm13 + punpcklqdq %xmm14,%xmm12 + movdqa %xmm10,%xmm15 + punpcklqdq %xmm7,%xmm10 + punpckhqdq %xmm14,%xmm13 + punpckhqdq %xmm7,%xmm15 + paddd 192-256(%rcx),%xmm4 + paddd 208-256(%rcx),%xmm5 + paddd 224-256(%rcx),%xmm8 + paddd 240-256(%rcx),%xmm9 + + movdqa %xmm6,32(%rsp) + movdqa %xmm11,48(%rsp) + + movdqa %xmm4,%xmm14 + punpckldq %xmm5,%xmm4 + movdqa %xmm8,%xmm7 + punpckldq %xmm9,%xmm8 + punpckhdq %xmm5,%xmm14 + punpckhdq %xmm9,%xmm7 + movdqa %xmm4,%xmm5 + punpcklqdq %xmm8,%xmm4 + movdqa %xmm14,%xmm9 + punpcklqdq %xmm7,%xmm14 + punpckhqdq %xmm8,%xmm5 + punpckhqdq %xmm7,%xmm9 + paddd 256-256(%rcx),%xmm0 + paddd 272-256(%rcx),%xmm1 + paddd 288-256(%rcx),%xmm2 + paddd 304-256(%rcx),%xmm3 + + movdqa %xmm0,%xmm8 + punpckldq %xmm1,%xmm0 + movdqa %xmm2,%xmm7 + punpckldq %xmm3,%xmm2 + punpckhdq %xmm1,%xmm8 + punpckhdq %xmm3,%xmm7 + movdqa %xmm0,%xmm1 + punpcklqdq %xmm2,%xmm0 + movdqa %xmm8,%xmm3 + punpcklqdq %xmm7,%xmm8 + punpckhqdq %xmm2,%xmm1 + punpckhqdq %xmm7,%xmm3 + cmpq $256,%rdx + jb L$tail4x + + movdqu 0(%rsi),%xmm6 + movdqu 16(%rsi),%xmm11 + movdqu 32(%rsi),%xmm2 + movdqu 48(%rsi),%xmm7 + pxor 0(%rsp),%xmm6 + pxor %xmm12,%xmm11 + pxor %xmm4,%xmm2 + pxor %xmm0,%xmm7 + + movdqu %xmm6,0(%rdi) + movdqu 64(%rsi),%xmm6 + movdqu %xmm11,16(%rdi) + movdqu 80(%rsi),%xmm11 + movdqu %xmm2,32(%rdi) + movdqu 96(%rsi),%xmm2 + movdqu %xmm7,48(%rdi) + movdqu 112(%rsi),%xmm7 + leaq 128(%rsi),%rsi + pxor 16(%rsp),%xmm6 + pxor %xmm13,%xmm11 + pxor %xmm5,%xmm2 + pxor %xmm1,%xmm7 + + movdqu %xmm6,64(%rdi) + movdqu 0(%rsi),%xmm6 + movdqu %xmm11,80(%rdi) + movdqu 16(%rsi),%xmm11 + movdqu %xmm2,96(%rdi) + movdqu 32(%rsi),%xmm2 + movdqu %xmm7,112(%rdi) + leaq 128(%rdi),%rdi + movdqu 48(%rsi),%xmm7 + pxor 32(%rsp),%xmm6 + pxor %xmm10,%xmm11 + pxor %xmm14,%xmm2 + pxor %xmm8,%xmm7 + + movdqu %xmm6,0(%rdi) + movdqu 64(%rsi),%xmm6 + movdqu %xmm11,16(%rdi) + movdqu 80(%rsi),%xmm11 + movdqu %xmm2,32(%rdi) + movdqu 96(%rsi),%xmm2 + movdqu %xmm7,48(%rdi) + movdqu 112(%rsi),%xmm7 + leaq 128(%rsi),%rsi + pxor 48(%rsp),%xmm6 + pxor %xmm15,%xmm11 + pxor %xmm9,%xmm2 + pxor %xmm3,%xmm7 + movdqu %xmm6,64(%rdi) + movdqu %xmm11,80(%rdi) + movdqu %xmm2,96(%rdi) + movdqu %xmm7,112(%rdi) + leaq 128(%rdi),%rdi + + subq $256,%rdx + jnz L$oop_outer4x + + jmp L$done4x + +L$tail4x: + cmpq $192,%rdx + jae L$192_or_more4x + cmpq $128,%rdx + jae L$128_or_more4x + cmpq $64,%rdx + jae L$64_or_more4x + + + xorq %r10,%r10 + + movdqa %xmm12,16(%rsp) + movdqa %xmm4,32(%rsp) + movdqa %xmm0,48(%rsp) + jmp L$oop_tail4x + +.p2align 5 +L$64_or_more4x: + movdqu 0(%rsi),%xmm6 + movdqu 16(%rsi),%xmm11 + movdqu 32(%rsi),%xmm2 + movdqu 48(%rsi),%xmm7 + pxor 0(%rsp),%xmm6 + pxor %xmm12,%xmm11 + pxor %xmm4,%xmm2 + pxor %xmm0,%xmm7 + movdqu %xmm6,0(%rdi) + movdqu %xmm11,16(%rdi) + movdqu %xmm2,32(%rdi) + movdqu %xmm7,48(%rdi) + je L$done4x + + movdqa 16(%rsp),%xmm6 + leaq 64(%rsi),%rsi + xorq %r10,%r10 + movdqa %xmm6,0(%rsp) + movdqa %xmm13,16(%rsp) + leaq 64(%rdi),%rdi + movdqa %xmm5,32(%rsp) + subq $64,%rdx + movdqa %xmm1,48(%rsp) + jmp L$oop_tail4x + +.p2align 5 +L$128_or_more4x: + movdqu 0(%rsi),%xmm6 + movdqu 16(%rsi),%xmm11 + movdqu 32(%rsi),%xmm2 + movdqu 48(%rsi),%xmm7 + pxor 0(%rsp),%xmm6 + pxor %xmm12,%xmm11 + pxor %xmm4,%xmm2 + pxor %xmm0,%xmm7 + + movdqu %xmm6,0(%rdi) + movdqu 64(%rsi),%xmm6 + movdqu %xmm11,16(%rdi) + movdqu 80(%rsi),%xmm11 + movdqu %xmm2,32(%rdi) + movdqu 96(%rsi),%xmm2 + movdqu %xmm7,48(%rdi) + movdqu 112(%rsi),%xmm7 + pxor 16(%rsp),%xmm6 + pxor %xmm13,%xmm11 + pxor %xmm5,%xmm2 + pxor %xmm1,%xmm7 + movdqu %xmm6,64(%rdi) + movdqu %xmm11,80(%rdi) + movdqu %xmm2,96(%rdi) + movdqu %xmm7,112(%rdi) + je L$done4x + + movdqa 32(%rsp),%xmm6 + leaq 128(%rsi),%rsi + xorq %r10,%r10 + movdqa %xmm6,0(%rsp) + movdqa %xmm10,16(%rsp) + leaq 128(%rdi),%rdi + movdqa %xmm14,32(%rsp) + subq $128,%rdx + movdqa %xmm8,48(%rsp) + jmp L$oop_tail4x + +.p2align 5 +L$192_or_more4x: + movdqu 0(%rsi),%xmm6 + movdqu 16(%rsi),%xmm11 + movdqu 32(%rsi),%xmm2 + movdqu 48(%rsi),%xmm7 + pxor 0(%rsp),%xmm6 + pxor %xmm12,%xmm11 + pxor %xmm4,%xmm2 + pxor %xmm0,%xmm7 + + movdqu %xmm6,0(%rdi) + movdqu 64(%rsi),%xmm6 + movdqu %xmm11,16(%rdi) + movdqu 80(%rsi),%xmm11 + movdqu %xmm2,32(%rdi) + movdqu 96(%rsi),%xmm2 + movdqu %xmm7,48(%rdi) + movdqu 112(%rsi),%xmm7 + leaq 128(%rsi),%rsi + pxor 16(%rsp),%xmm6 + pxor %xmm13,%xmm11 + pxor %xmm5,%xmm2 + pxor %xmm1,%xmm7 + + movdqu %xmm6,64(%rdi) + movdqu 0(%rsi),%xmm6 + movdqu %xmm11,80(%rdi) + movdqu 16(%rsi),%xmm11 + movdqu %xmm2,96(%rdi) + movdqu 32(%rsi),%xmm2 + movdqu %xmm7,112(%rdi) + leaq 128(%rdi),%rdi + movdqu 48(%rsi),%xmm7 + pxor 32(%rsp),%xmm6 + pxor %xmm10,%xmm11 + pxor %xmm14,%xmm2 + pxor %xmm8,%xmm7 + movdqu %xmm6,0(%rdi) + movdqu %xmm11,16(%rdi) + movdqu %xmm2,32(%rdi) + movdqu %xmm7,48(%rdi) + je L$done4x + + movdqa 48(%rsp),%xmm6 + leaq 64(%rsi),%rsi + xorq %r10,%r10 + movdqa %xmm6,0(%rsp) + movdqa %xmm15,16(%rsp) + leaq 64(%rdi),%rdi + movdqa %xmm9,32(%rsp) + subq $192,%rdx + movdqa %xmm3,48(%rsp) + +L$oop_tail4x: + movzbl (%rsi,%r10,1),%eax + movzbl (%rsp,%r10,1),%ecx + leaq 1(%r10),%r10 + xorl %ecx,%eax + movb %al,-1(%rdi,%r10,1) + decq %rdx + jnz L$oop_tail4x + +L$done4x: + leaq (%r9),%rsp + +L$4x_epilogue: + ret + + +.global _chacha20_avx2 + +.p2align 5 +_chacha20_avx2: + +L$chacha20_avx2: + movq %rsp,%r9 + + subq $0x280+8,%rsp + andq $-32,%rsp + vzeroupper + + vbroadcasti128 L$sigma(%rip),%ymm11 + vbroadcasti128 (%rcx),%ymm3 + vbroadcasti128 16(%rcx),%ymm15 + vbroadcasti128 (%r8),%ymm7 + leaq 256(%rsp),%rcx + leaq 512(%rsp),%rax + leaq L$rot16(%rip),%r10 + leaq L$rot24(%rip),%r11 + + vpshufd $0x00,%ymm11,%ymm8 + vpshufd $0x55,%ymm11,%ymm9 + vmovdqa %ymm8,128-256(%rcx) + vpshufd $0xaa,%ymm11,%ymm10 + vmovdqa %ymm9,160-256(%rcx) + vpshufd $0xff,%ymm11,%ymm11 + vmovdqa %ymm10,192-256(%rcx) + vmovdqa %ymm11,224-256(%rcx) + + vpshufd $0x00,%ymm3,%ymm0 + vpshufd $0x55,%ymm3,%ymm1 + vmovdqa %ymm0,256-256(%rcx) + vpshufd $0xaa,%ymm3,%ymm2 + vmovdqa %ymm1,288-256(%rcx) + vpshufd $0xff,%ymm3,%ymm3 + vmovdqa %ymm2,320-256(%rcx) + vmovdqa %ymm3,352-256(%rcx) + + vpshufd $0x00,%ymm15,%ymm12 + vpshufd $0x55,%ymm15,%ymm13 + vmovdqa %ymm12,384-512(%rax) + vpshufd $0xaa,%ymm15,%ymm14 + vmovdqa %ymm13,416-512(%rax) + vpshufd $0xff,%ymm15,%ymm15 + vmovdqa %ymm14,448-512(%rax) + vmovdqa %ymm15,480-512(%rax) + + vpshufd $0x00,%ymm7,%ymm4 + vpshufd $0x55,%ymm7,%ymm5 + vpaddd L$incy(%rip),%ymm4,%ymm4 + vpshufd $0xaa,%ymm7,%ymm6 + vmovdqa %ymm5,544-512(%rax) + vpshufd $0xff,%ymm7,%ymm7 + vmovdqa %ymm6,576-512(%rax) + vmovdqa %ymm7,608-512(%rax) + + jmp L$oop_enter8x + +.p2align 5 +L$oop_outer8x: + vmovdqa 128-256(%rcx),%ymm8 + vmovdqa 160-256(%rcx),%ymm9 + vmovdqa 192-256(%rcx),%ymm10 + vmovdqa 224-256(%rcx),%ymm11 + vmovdqa 256-256(%rcx),%ymm0 + vmovdqa 288-256(%rcx),%ymm1 + vmovdqa 320-256(%rcx),%ymm2 + vmovdqa 352-256(%rcx),%ymm3 + vmovdqa 384-512(%rax),%ymm12 + vmovdqa 416-512(%rax),%ymm13 + vmovdqa 448-512(%rax),%ymm14 + vmovdqa 480-512(%rax),%ymm15 + vmovdqa 512-512(%rax),%ymm4 + vmovdqa 544-512(%rax),%ymm5 + vmovdqa 576-512(%rax),%ymm6 + vmovdqa 608-512(%rax),%ymm7 + vpaddd L$eight(%rip),%ymm4,%ymm4 + +L$oop_enter8x: + vmovdqa %ymm14,64(%rsp) + vmovdqa %ymm15,96(%rsp) + vbroadcasti128 (%r10),%ymm15 + vmovdqa %ymm4,512-512(%rax) + movl $10,%eax + jmp L$oop8x + +.p2align 5 +L$oop8x: + vpaddd %ymm0,%ymm8,%ymm8 + vpxor %ymm4,%ymm8,%ymm4 + vpshufb %ymm15,%ymm4,%ymm4 + vpaddd %ymm1,%ymm9,%ymm9 + vpxor %ymm5,%ymm9,%ymm5 + vpshufb %ymm15,%ymm5,%ymm5 + vpaddd %ymm4,%ymm12,%ymm12 + vpxor %ymm0,%ymm12,%ymm0 + vpslld $12,%ymm0,%ymm14 + vpsrld $20,%ymm0,%ymm0 + vpor %ymm0,%ymm14,%ymm0 + vbroadcasti128 (%r11),%ymm14 + vpaddd %ymm5,%ymm13,%ymm13 + vpxor %ymm1,%ymm13,%ymm1 + vpslld $12,%ymm1,%ymm15 + vpsrld $20,%ymm1,%ymm1 + vpor %ymm1,%ymm15,%ymm1 + vpaddd %ymm0,%ymm8,%ymm8 + vpxor %ymm4,%ymm8,%ymm4 + vpshufb %ymm14,%ymm4,%ymm4 + vpaddd %ymm1,%ymm9,%ymm9 + vpxor %ymm5,%ymm9,%ymm5 + vpshufb %ymm14,%ymm5,%ymm5 + vpaddd %ymm4,%ymm12,%ymm12 + vpxor %ymm0,%ymm12,%ymm0 + vpslld $7,%ymm0,%ymm15 + vpsrld $25,%ymm0,%ymm0 + vpor %ymm0,%ymm15,%ymm0 + vbroadcasti128 (%r10),%ymm15 + vpaddd %ymm5,%ymm13,%ymm13 + vpxor %ymm1,%ymm13,%ymm1 + vpslld $7,%ymm1,%ymm14 + vpsrld $25,%ymm1,%ymm1 + vpor %ymm1,%ymm14,%ymm1 + vmovdqa %ymm12,0(%rsp) + vmovdqa %ymm13,32(%rsp) + vmovdqa 64(%rsp),%ymm12 + vmovdqa 96(%rsp),%ymm13 + vpaddd %ymm2,%ymm10,%ymm10 + vpxor %ymm6,%ymm10,%ymm6 + vpshufb %ymm15,%ymm6,%ymm6 + vpaddd %ymm3,%ymm11,%ymm11 + vpxor %ymm7,%ymm11,%ymm7 + vpshufb %ymm15,%ymm7,%ymm7 + vpaddd %ymm6,%ymm12,%ymm12 + vpxor %ymm2,%ymm12,%ymm2 + vpslld $12,%ymm2,%ymm14 + vpsrld $20,%ymm2,%ymm2 + vpor %ymm2,%ymm14,%ymm2 + vbroadcasti128 (%r11),%ymm14 + vpaddd %ymm7,%ymm13,%ymm13 + vpxor %ymm3,%ymm13,%ymm3 + vpslld $12,%ymm3,%ymm15 + vpsrld $20,%ymm3,%ymm3 + vpor %ymm3,%ymm15,%ymm3 + vpaddd %ymm2,%ymm10,%ymm10 + vpxor %ymm6,%ymm10,%ymm6 + vpshufb %ymm14,%ymm6,%ymm6 + vpaddd %ymm3,%ymm11,%ymm11 + vpxor %ymm7,%ymm11,%ymm7 + vpshufb %ymm14,%ymm7,%ymm7 + vpaddd %ymm6,%ymm12,%ymm12 + vpxor %ymm2,%ymm12,%ymm2 + vpslld $7,%ymm2,%ymm15 + vpsrld $25,%ymm2,%ymm2 + vpor %ymm2,%ymm15,%ymm2 + vbroadcasti128 (%r10),%ymm15 + vpaddd %ymm7,%ymm13,%ymm13 + vpxor %ymm3,%ymm13,%ymm3 + vpslld $7,%ymm3,%ymm14 + vpsrld $25,%ymm3,%ymm3 + vpor %ymm3,%ymm14,%ymm3 + vpaddd %ymm1,%ymm8,%ymm8 + vpxor %ymm7,%ymm8,%ymm7 + vpshufb %ymm15,%ymm7,%ymm7 + vpaddd %ymm2,%ymm9,%ymm9 + vpxor %ymm4,%ymm9,%ymm4 + vpshufb %ymm15,%ymm4,%ymm4 + vpaddd %ymm7,%ymm12,%ymm12 + vpxor %ymm1,%ymm12,%ymm1 + vpslld $12,%ymm1,%ymm14 + vpsrld $20,%ymm1,%ymm1 + vpor %ymm1,%ymm14,%ymm1 + vbroadcasti128 (%r11),%ymm14 + vpaddd %ymm4,%ymm13,%ymm13 + vpxor %ymm2,%ymm13,%ymm2 + vpslld $12,%ymm2,%ymm15 + vpsrld $20,%ymm2,%ymm2 + vpor %ymm2,%ymm15,%ymm2 + vpaddd %ymm1,%ymm8,%ymm8 + vpxor %ymm7,%ymm8,%ymm7 + vpshufb %ymm14,%ymm7,%ymm7 + vpaddd %ymm2,%ymm9,%ymm9 + vpxor %ymm4,%ymm9,%ymm4 + vpshufb %ymm14,%ymm4,%ymm4 + vpaddd %ymm7,%ymm12,%ymm12 + vpxor %ymm1,%ymm12,%ymm1 + vpslld $7,%ymm1,%ymm15 + vpsrld $25,%ymm1,%ymm1 + vpor %ymm1,%ymm15,%ymm1 + vbroadcasti128 (%r10),%ymm15 + vpaddd %ymm4,%ymm13,%ymm13 + vpxor %ymm2,%ymm13,%ymm2 + vpslld $7,%ymm2,%ymm14 + vpsrld $25,%ymm2,%ymm2 + vpor %ymm2,%ymm14,%ymm2 + vmovdqa %ymm12,64(%rsp) + vmovdqa %ymm13,96(%rsp) + vmovdqa 0(%rsp),%ymm12 + vmovdqa 32(%rsp),%ymm13 + vpaddd %ymm3,%ymm10,%ymm10 + vpxor %ymm5,%ymm10,%ymm5 + vpshufb %ymm15,%ymm5,%ymm5 + vpaddd %ymm0,%ymm11,%ymm11 + vpxor %ymm6,%ymm11,%ymm6 + vpshufb %ymm15,%ymm6,%ymm6 + vpaddd %ymm5,%ymm12,%ymm12 + vpxor %ymm3,%ymm12,%ymm3 + vpslld $12,%ymm3,%ymm14 + vpsrld $20,%ymm3,%ymm3 + vpor %ymm3,%ymm14,%ymm3 + vbroadcasti128 (%r11),%ymm14 + vpaddd %ymm6,%ymm13,%ymm13 + vpxor %ymm0,%ymm13,%ymm0 + vpslld $12,%ymm0,%ymm15 + vpsrld $20,%ymm0,%ymm0 + vpor %ymm0,%ymm15,%ymm0 + vpaddd %ymm3,%ymm10,%ymm10 + vpxor %ymm5,%ymm10,%ymm5 + vpshufb %ymm14,%ymm5,%ymm5 + vpaddd %ymm0,%ymm11,%ymm11 + vpxor %ymm6,%ymm11,%ymm6 + vpshufb %ymm14,%ymm6,%ymm6 + vpaddd %ymm5,%ymm12,%ymm12 + vpxor %ymm3,%ymm12,%ymm3 + vpslld $7,%ymm3,%ymm15 + vpsrld $25,%ymm3,%ymm3 + vpor %ymm3,%ymm15,%ymm3 + vbroadcasti128 (%r10),%ymm15 + vpaddd %ymm6,%ymm13,%ymm13 + vpxor %ymm0,%ymm13,%ymm0 + vpslld $7,%ymm0,%ymm14 + vpsrld $25,%ymm0,%ymm0 + vpor %ymm0,%ymm14,%ymm0 + decl %eax + jnz L$oop8x + + leaq 512(%rsp),%rax + vpaddd 128-256(%rcx),%ymm8,%ymm8 + vpaddd 160-256(%rcx),%ymm9,%ymm9 + vpaddd 192-256(%rcx),%ymm10,%ymm10 + vpaddd 224-256(%rcx),%ymm11,%ymm11 + + vpunpckldq %ymm9,%ymm8,%ymm14 + vpunpckldq %ymm11,%ymm10,%ymm15 + vpunpckhdq %ymm9,%ymm8,%ymm8 + vpunpckhdq %ymm11,%ymm10,%ymm10 + vpunpcklqdq %ymm15,%ymm14,%ymm9 + vpunpckhqdq %ymm15,%ymm14,%ymm14 + vpunpcklqdq %ymm10,%ymm8,%ymm11 + vpunpckhqdq %ymm10,%ymm8,%ymm8 + vpaddd 256-256(%rcx),%ymm0,%ymm0 + vpaddd 288-256(%rcx),%ymm1,%ymm1 + vpaddd 320-256(%rcx),%ymm2,%ymm2 + vpaddd 352-256(%rcx),%ymm3,%ymm3 + + vpunpckldq %ymm1,%ymm0,%ymm10 + vpunpckldq %ymm3,%ymm2,%ymm15 + vpunpckhdq %ymm1,%ymm0,%ymm0 + vpunpckhdq %ymm3,%ymm2,%ymm2 + vpunpcklqdq %ymm15,%ymm10,%ymm1 + vpunpckhqdq %ymm15,%ymm10,%ymm10 + vpunpcklqdq %ymm2,%ymm0,%ymm3 + vpunpckhqdq %ymm2,%ymm0,%ymm0 + vperm2i128 $0x20,%ymm1,%ymm9,%ymm15 + vperm2i128 $0x31,%ymm1,%ymm9,%ymm1 + vperm2i128 $0x20,%ymm10,%ymm14,%ymm9 + vperm2i128 $0x31,%ymm10,%ymm14,%ymm10 + vperm2i128 $0x20,%ymm3,%ymm11,%ymm14 + vperm2i128 $0x31,%ymm3,%ymm11,%ymm3 + vperm2i128 $0x20,%ymm0,%ymm8,%ymm11 + vperm2i128 $0x31,%ymm0,%ymm8,%ymm0 + vmovdqa %ymm15,0(%rsp) + vmovdqa %ymm9,32(%rsp) + vmovdqa 64(%rsp),%ymm15 + vmovdqa 96(%rsp),%ymm9 + + vpaddd 384-512(%rax),%ymm12,%ymm12 + vpaddd 416-512(%rax),%ymm13,%ymm13 + vpaddd 448-512(%rax),%ymm15,%ymm15 + vpaddd 480-512(%rax),%ymm9,%ymm9 + + vpunpckldq %ymm13,%ymm12,%ymm2 + vpunpckldq %ymm9,%ymm15,%ymm8 + vpunpckhdq %ymm13,%ymm12,%ymm12 + vpunpckhdq %ymm9,%ymm15,%ymm15 + vpunpcklqdq %ymm8,%ymm2,%ymm13 + vpunpckhqdq %ymm8,%ymm2,%ymm2 + vpunpcklqdq %ymm15,%ymm12,%ymm9 + vpunpckhqdq %ymm15,%ymm12,%ymm12 + vpaddd 512-512(%rax),%ymm4,%ymm4 + vpaddd 544-512(%rax),%ymm5,%ymm5 + vpaddd 576-512(%rax),%ymm6,%ymm6 + vpaddd 608-512(%rax),%ymm7,%ymm7 + + vpunpckldq %ymm5,%ymm4,%ymm15 + vpunpckldq %ymm7,%ymm6,%ymm8 + vpunpckhdq %ymm5,%ymm4,%ymm4 + vpunpckhdq %ymm7,%ymm6,%ymm6 + vpunpcklqdq %ymm8,%ymm15,%ymm5 + vpunpckhqdq %ymm8,%ymm15,%ymm15 + vpunpcklqdq %ymm6,%ymm4,%ymm7 + vpunpckhqdq %ymm6,%ymm4,%ymm4 + vperm2i128 $0x20,%ymm5,%ymm13,%ymm8 + vperm2i128 $0x31,%ymm5,%ymm13,%ymm5 + vperm2i128 $0x20,%ymm15,%ymm2,%ymm13 + vperm2i128 $0x31,%ymm15,%ymm2,%ymm15 + vperm2i128 $0x20,%ymm7,%ymm9,%ymm2 + vperm2i128 $0x31,%ymm7,%ymm9,%ymm7 + vperm2i128 $0x20,%ymm4,%ymm12,%ymm9 + vperm2i128 $0x31,%ymm4,%ymm12,%ymm4 + vmovdqa 0(%rsp),%ymm6 + vmovdqa 32(%rsp),%ymm12 + + cmpq $512,%rdx + jb L$tail8x + + vpxor 0(%rsi),%ymm6,%ymm6 + vpxor 32(%rsi),%ymm8,%ymm8 + vpxor 64(%rsi),%ymm1,%ymm1 + vpxor 96(%rsi),%ymm5,%ymm5 + leaq 128(%rsi),%rsi + vmovdqu %ymm6,0(%rdi) + vmovdqu %ymm8,32(%rdi) + vmovdqu %ymm1,64(%rdi) + vmovdqu %ymm5,96(%rdi) + leaq 128(%rdi),%rdi + + vpxor 0(%rsi),%ymm12,%ymm12 + vpxor 32(%rsi),%ymm13,%ymm13 + vpxor 64(%rsi),%ymm10,%ymm10 + vpxor 96(%rsi),%ymm15,%ymm15 + leaq 128(%rsi),%rsi + vmovdqu %ymm12,0(%rdi) + vmovdqu %ymm13,32(%rdi) + vmovdqu %ymm10,64(%rdi) + vmovdqu %ymm15,96(%rdi) + leaq 128(%rdi),%rdi + + vpxor 0(%rsi),%ymm14,%ymm14 + vpxor 32(%rsi),%ymm2,%ymm2 + vpxor 64(%rsi),%ymm3,%ymm3 + vpxor 96(%rsi),%ymm7,%ymm7 + leaq 128(%rsi),%rsi + vmovdqu %ymm14,0(%rdi) + vmovdqu %ymm2,32(%rdi) + vmovdqu %ymm3,64(%rdi) + vmovdqu %ymm7,96(%rdi) + leaq 128(%rdi),%rdi + + vpxor 0(%rsi),%ymm11,%ymm11 + vpxor 32(%rsi),%ymm9,%ymm9 + vpxor 64(%rsi),%ymm0,%ymm0 + vpxor 96(%rsi),%ymm4,%ymm4 + leaq 128(%rsi),%rsi + vmovdqu %ymm11,0(%rdi) + vmovdqu %ymm9,32(%rdi) + vmovdqu %ymm0,64(%rdi) + vmovdqu %ymm4,96(%rdi) + leaq 128(%rdi),%rdi + + subq $512,%rdx + jnz L$oop_outer8x + + jmp L$done8x + +L$tail8x: + cmpq $448,%rdx + jae L$448_or_more8x + cmpq $384,%rdx + jae L$384_or_more8x + cmpq $320,%rdx + jae L$320_or_more8x + cmpq $256,%rdx + jae L$256_or_more8x + cmpq $192,%rdx + jae L$192_or_more8x + cmpq $128,%rdx + jae L$128_or_more8x + cmpq $64,%rdx + jae L$64_or_more8x + + xorq %r10,%r10 + vmovdqa %ymm6,0(%rsp) + vmovdqa %ymm8,32(%rsp) + jmp L$oop_tail8x + +.p2align 5 +L$64_or_more8x: + vpxor 0(%rsi),%ymm6,%ymm6 + vpxor 32(%rsi),%ymm8,%ymm8 + vmovdqu %ymm6,0(%rdi) + vmovdqu %ymm8,32(%rdi) + je L$done8x + + leaq 64(%rsi),%rsi + xorq %r10,%r10 + vmovdqa %ymm1,0(%rsp) + leaq 64(%rdi),%rdi + subq $64,%rdx + vmovdqa %ymm5,32(%rsp) + jmp L$oop_tail8x + +.p2align 5 +L$128_or_more8x: + vpxor 0(%rsi),%ymm6,%ymm6 + vpxor 32(%rsi),%ymm8,%ymm8 + vpxor 64(%rsi),%ymm1,%ymm1 + vpxor 96(%rsi),%ymm5,%ymm5 + vmovdqu %ymm6,0(%rdi) + vmovdqu %ymm8,32(%rdi) + vmovdqu %ymm1,64(%rdi) + vmovdqu %ymm5,96(%rdi) + je L$done8x + + leaq 128(%rsi),%rsi + xorq %r10,%r10 + vmovdqa %ymm12,0(%rsp) + leaq 128(%rdi),%rdi + subq $128,%rdx + vmovdqa %ymm13,32(%rsp) + jmp L$oop_tail8x + +.p2align 5 +L$192_or_more8x: + vpxor 0(%rsi),%ymm6,%ymm6 + vpxor 32(%rsi),%ymm8,%ymm8 + vpxor 64(%rsi),%ymm1,%ymm1 + vpxor 96(%rsi),%ymm5,%ymm5 + vpxor 128(%rsi),%ymm12,%ymm12 + vpxor 160(%rsi),%ymm13,%ymm13 + vmovdqu %ymm6,0(%rdi) + vmovdqu %ymm8,32(%rdi) + vmovdqu %ymm1,64(%rdi) + vmovdqu %ymm5,96(%rdi) + vmovdqu %ymm12,128(%rdi) + vmovdqu %ymm13,160(%rdi) + je L$done8x + + leaq 192(%rsi),%rsi + xorq %r10,%r10 + vmovdqa %ymm10,0(%rsp) + leaq 192(%rdi),%rdi + subq $192,%rdx + vmovdqa %ymm15,32(%rsp) + jmp L$oop_tail8x + +.p2align 5 +L$256_or_more8x: + vpxor 0(%rsi),%ymm6,%ymm6 + vpxor 32(%rsi),%ymm8,%ymm8 + vpxor 64(%rsi),%ymm1,%ymm1 + vpxor 96(%rsi),%ymm5,%ymm5 + vpxor 128(%rsi),%ymm12,%ymm12 + vpxor 160(%rsi),%ymm13,%ymm13 + vpxor 192(%rsi),%ymm10,%ymm10 + vpxor 224(%rsi),%ymm15,%ymm15 + vmovdqu %ymm6,0(%rdi) + vmovdqu %ymm8,32(%rdi) + vmovdqu %ymm1,64(%rdi) + vmovdqu %ymm5,96(%rdi) + vmovdqu %ymm12,128(%rdi) + vmovdqu %ymm13,160(%rdi) + vmovdqu %ymm10,192(%rdi) + vmovdqu %ymm15,224(%rdi) + je L$done8x + + leaq 256(%rsi),%rsi + xorq %r10,%r10 + vmovdqa %ymm14,0(%rsp) + leaq 256(%rdi),%rdi + subq $256,%rdx + vmovdqa %ymm2,32(%rsp) + jmp L$oop_tail8x + +.p2align 5 +L$320_or_more8x: + vpxor 0(%rsi),%ymm6,%ymm6 + vpxor 32(%rsi),%ymm8,%ymm8 + vpxor 64(%rsi),%ymm1,%ymm1 + vpxor 96(%rsi),%ymm5,%ymm5 + vpxor 128(%rsi),%ymm12,%ymm12 + vpxor 160(%rsi),%ymm13,%ymm13 + vpxor 192(%rsi),%ymm10,%ymm10 + vpxor 224(%rsi),%ymm15,%ymm15 + vpxor 256(%rsi),%ymm14,%ymm14 + vpxor 288(%rsi),%ymm2,%ymm2 + vmovdqu %ymm6,0(%rdi) + vmovdqu %ymm8,32(%rdi) + vmovdqu %ymm1,64(%rdi) + vmovdqu %ymm5,96(%rdi) + vmovdqu %ymm12,128(%rdi) + vmovdqu %ymm13,160(%rdi) + vmovdqu %ymm10,192(%rdi) + vmovdqu %ymm15,224(%rdi) + vmovdqu %ymm14,256(%rdi) + vmovdqu %ymm2,288(%rdi) + je L$done8x + + leaq 320(%rsi),%rsi + xorq %r10,%r10 + vmovdqa %ymm3,0(%rsp) + leaq 320(%rdi),%rdi + subq $320,%rdx + vmovdqa %ymm7,32(%rsp) + jmp L$oop_tail8x + +.p2align 5 +L$384_or_more8x: + vpxor 0(%rsi),%ymm6,%ymm6 + vpxor 32(%rsi),%ymm8,%ymm8 + vpxor 64(%rsi),%ymm1,%ymm1 + vpxor 96(%rsi),%ymm5,%ymm5 + vpxor 128(%rsi),%ymm12,%ymm12 + vpxor 160(%rsi),%ymm13,%ymm13 + vpxor 192(%rsi),%ymm10,%ymm10 + vpxor 224(%rsi),%ymm15,%ymm15 + vpxor 256(%rsi),%ymm14,%ymm14 + vpxor 288(%rsi),%ymm2,%ymm2 + vpxor 320(%rsi),%ymm3,%ymm3 + vpxor 352(%rsi),%ymm7,%ymm7 + vmovdqu %ymm6,0(%rdi) + vmovdqu %ymm8,32(%rdi) + vmovdqu %ymm1,64(%rdi) + vmovdqu %ymm5,96(%rdi) + vmovdqu %ymm12,128(%rdi) + vmovdqu %ymm13,160(%rdi) + vmovdqu %ymm10,192(%rdi) + vmovdqu %ymm15,224(%rdi) + vmovdqu %ymm14,256(%rdi) + vmovdqu %ymm2,288(%rdi) + vmovdqu %ymm3,320(%rdi) + vmovdqu %ymm7,352(%rdi) + je L$done8x + + leaq 384(%rsi),%rsi + xorq %r10,%r10 + vmovdqa %ymm11,0(%rsp) + leaq 384(%rdi),%rdi + subq $384,%rdx + vmovdqa %ymm9,32(%rsp) + jmp L$oop_tail8x + +.p2align 5 +L$448_or_more8x: + vpxor 0(%rsi),%ymm6,%ymm6 + vpxor 32(%rsi),%ymm8,%ymm8 + vpxor 64(%rsi),%ymm1,%ymm1 + vpxor 96(%rsi),%ymm5,%ymm5 + vpxor 128(%rsi),%ymm12,%ymm12 + vpxor 160(%rsi),%ymm13,%ymm13 + vpxor 192(%rsi),%ymm10,%ymm10 + vpxor 224(%rsi),%ymm15,%ymm15 + vpxor 256(%rsi),%ymm14,%ymm14 + vpxor 288(%rsi),%ymm2,%ymm2 + vpxor 320(%rsi),%ymm3,%ymm3 + vpxor 352(%rsi),%ymm7,%ymm7 + vpxor 384(%rsi),%ymm11,%ymm11 + vpxor 416(%rsi),%ymm9,%ymm9 + vmovdqu %ymm6,0(%rdi) + vmovdqu %ymm8,32(%rdi) + vmovdqu %ymm1,64(%rdi) + vmovdqu %ymm5,96(%rdi) + vmovdqu %ymm12,128(%rdi) + vmovdqu %ymm13,160(%rdi) + vmovdqu %ymm10,192(%rdi) + vmovdqu %ymm15,224(%rdi) + vmovdqu %ymm14,256(%rdi) + vmovdqu %ymm2,288(%rdi) + vmovdqu %ymm3,320(%rdi) + vmovdqu %ymm7,352(%rdi) + vmovdqu %ymm11,384(%rdi) + vmovdqu %ymm9,416(%rdi) + je L$done8x + + leaq 448(%rsi),%rsi + xorq %r10,%r10 + vmovdqa %ymm0,0(%rsp) + leaq 448(%rdi),%rdi + subq $448,%rdx + vmovdqa %ymm4,32(%rsp) + +L$oop_tail8x: + movzbl (%rsi,%r10,1),%eax + movzbl (%rsp,%r10,1),%ecx + leaq 1(%r10),%r10 + xorl %ecx,%eax + movb %al,-1(%rdi,%r10,1) + decq %rdx + jnz L$oop_tail8x + +L$done8x: + vzeroall + leaq (%r9),%rsp + +L$8x_epilogue: + ret + + diff --git a/crypto/chacha20poly1305.cpp b/crypto/chacha20poly1305.cpp new file mode 100644 index 0000000..a5c222d --- /dev/null +++ b/crypto/chacha20poly1305.cpp @@ -0,0 +1,596 @@ +/* SPDX-License-Identifier: OpenSSL OR (BSD-3-Clause OR GPL-2.0) + * + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + * Copyright 2016 The OpenSSL Project Authors. All Rights Reserved. + */ + +#include "stdafx.h" +#include "crypto/chacha20poly1305.h" +#include "tunsafe_types.h" +#include "tunsafe_endian.h" +#include "build_config.h" +#include "tunsafe_cpu.h" +#include "crypto_ops.h" +#include +#include + +enum { + CHACHA20_IV_SIZE = 16, + CHACHA20_KEY_SIZE = 32, + CHACHA20_BLOCK_SIZE = 64, + POLY1305_BLOCK_SIZE = 16, + POLY1305_KEY_SIZE = 32, + POLY1305_MAC_SIZE = 16 +}; + + +#if defined(OS_MACOSX) || !WITH_AVX512_OPTIMIZATIONS +#define CHACHA20_WITH_AVX512 0 +#else +#define CHACHA20_WITH_AVX512 1 +#endif + +extern "C" { +void _cdecl hchacha20_ssse3(uint8 *derived_key, const uint8 *nonce, const uint8 *key); +void _cdecl chacha20_ssse3(uint8 *out, const uint8 *in, size_t len, const uint32 key[8], const uint32 counter[4]); +void _cdecl chacha20_avx2(uint8 *out, const uint8 *in, size_t len, const uint32 key[8], const uint32 counter[4]); +void _cdecl chacha20_avx512(uint8 *out, const uint8 *in, size_t len, const uint32 key[8], const uint32 counter[4]); +void _cdecl chacha20_avx512vl(uint8 *out, const uint8 *in, size_t len, const uint32 key[8], const uint32 counter[4]); +void _cdecl poly1305_init_x86_64(void *ctx, const uint8 key[16]); +void _cdecl poly1305_blocks_x86_64(void *ctx, const uint8 *inp, size_t len, uint32 padbit); +void _cdecl poly1305_emit_x86_64(void *ctx, uint8 mac[16], const uint32 nonce[4]); +void _cdecl poly1305_emit_avx(void *ctx, uint8 mac[16], const uint32 nonce[4]); +void _cdecl poly1305_blocks_avx(void *ctx, const uint8 *inp, size_t len, uint32 padbit); +void _cdecl poly1305_blocks_avx2(void *ctx, const uint8 *inp, size_t len, uint32 padbit); +void _cdecl poly1305_blocks_avx512(void *ctx, const uint8 *inp, size_t len, uint32 padbit); +} + +struct chacha20_ctx { + uint32 state[CHACHA20_BLOCK_SIZE / sizeof(uint32)]; +}; + +void crypto_xor(uint8 *dst, const uint8 *src, size_t n) { + for (; n >= 4; n -= 4, dst += 4, src += 4) + *(uint32*)dst ^= *(uint32*)src; + for (; n; n--) + *dst++ ^= *src++; +} + +int memcmp_crypto(const uint8 *a, const uint8 *b, size_t n) { + int rv = 0; + for (; n >= 4; n -= 4, a += 4, b += 4) + rv |= *(uint32*)a ^ *(uint32*)b; + for (; n; n--) + rv |= *a++ ^ *b++; + return rv; +} + +#define QUARTER_ROUND(x, a, b, c, d) ( \ + x[a] += x[b], \ + x[d] = rol32((x[d] ^ x[a]), 16), \ + x[c] += x[d], \ + x[b] = rol32((x[b] ^ x[c]), 12), \ + x[a] += x[b], \ + x[d] = rol32((x[d] ^ x[a]), 8), \ + x[c] += x[d], \ + x[b] = rol32((x[b] ^ x[c]), 7) \ +) + +#define C(i, j) (i * 4 + j) + +#define DOUBLE_ROUND(x) ( \ + /* Column Round */ \ + QUARTER_ROUND(x, C(0, 0), C(1, 0), C(2, 0), C(3, 0)), \ + QUARTER_ROUND(x, C(0, 1), C(1, 1), C(2, 1), C(3, 1)), \ + QUARTER_ROUND(x, C(0, 2), C(1, 2), C(2, 2), C(3, 2)), \ + QUARTER_ROUND(x, C(0, 3), C(1, 3), C(2, 3), C(3, 3)), \ + /* Diagonal Round */ \ + QUARTER_ROUND(x, C(0, 0), C(1, 1), C(2, 2), C(3, 3)), \ + QUARTER_ROUND(x, C(0, 1), C(1, 2), C(2, 3), C(3, 0)), \ + QUARTER_ROUND(x, C(0, 2), C(1, 3), C(2, 0), C(3, 1)), \ + QUARTER_ROUND(x, C(0, 3), C(1, 0), C(2, 1), C(3, 2)) \ +) + +#define TWENTY_ROUNDS(x) ( \ + DOUBLE_ROUND(x), \ + DOUBLE_ROUND(x), \ + DOUBLE_ROUND(x), \ + DOUBLE_ROUND(x), \ + DOUBLE_ROUND(x), \ + DOUBLE_ROUND(x), \ + DOUBLE_ROUND(x), \ + DOUBLE_ROUND(x), \ + DOUBLE_ROUND(x), \ + DOUBLE_ROUND(x) \ +) + +SAFEBUFFERS static void chacha20_block_generic(struct chacha20_ctx *ctx, uint32 *stream) +{ + uint32 x[CHACHA20_BLOCK_SIZE / sizeof(uint32)]; + int i; + + for (i = 0; i < ARRAY_SIZE(x); ++i) + x[i] = ctx->state[i]; + + TWENTY_ROUNDS(x); + + for (i = 0; i < ARRAY_SIZE(x); ++i) + stream[i] = ToLE32(x[i] + ctx->state[i]); + + ++ctx->state[12]; +} + +SAFEBUFFERS static void hchacha20_generic(uint8 derived_key[CHACHA20POLY1305_KEYLEN], const uint8 nonce[16], const uint8 key[CHACHA20POLY1305_KEYLEN]) +{ + uint32 *out = (uint32 *)derived_key; + uint32 x[] = { + 0x61707865, 0x3320646e, 0x79622d32, 0x6b206574, + ReadLE32(key + 0), ReadLE32(key + 4), ReadLE32(key + 8), ReadLE32(key + 12), + ReadLE32(key + 16), ReadLE32(key + 20), ReadLE32(key + 24), ReadLE32(key + 28), + ReadLE32(nonce + 0), ReadLE32(nonce + 4), ReadLE32(nonce + 8), ReadLE32(nonce + 12) + }; + + TWENTY_ROUNDS(x); + + out[0] = ToLE32(x[0]); + out[1] = ToLE32(x[1]); + out[2] = ToLE32(x[2]); + out[3] = ToLE32(x[3]); + out[4] = ToLE32(x[12]); + out[5] = ToLE32(x[13]); + out[6] = ToLE32(x[14]); + out[7] = ToLE32(x[15]); +} + +static inline void hchacha20(uint8 derived_key[CHACHA20POLY1305_KEYLEN], const uint8 nonce[16], const uint8 key[CHACHA20POLY1305_KEYLEN]) +{ +#if defined(ARCH_CPU_X86_64) && defined(COMPILER_MSVC) + if (X86_PCAP_SSSE3) { + hchacha20_ssse3(derived_key, nonce, key); + return; + } +#endif // defined(ARCH_CPU_X86_64) + hchacha20_generic(derived_key, nonce, key); +} + +#define chacha20_initial_state(key, nonce) {{ \ + 0x61707865, 0x3320646e, 0x79622d32, 0x6b206574, \ + ReadLE32((key) + 0), ReadLE32((key) + 4), ReadLE32((key) + 8), ReadLE32((key) + 12), \ + ReadLE32((key) + 16), ReadLE32((key) + 20), ReadLE32((key) + 24), ReadLE32((key) + 28), \ + 0, 0, ReadLE32((nonce) + 0), ReadLE32((nonce) + 4) \ +}} + +SAFEBUFFERS static void chacha20_crypt(struct chacha20_ctx *ctx, uint8 *dst, const uint8 *src, uint32 bytes) +{ + uint32 buf[CHACHA20_BLOCK_SIZE / sizeof(uint32)]; + + if (bytes == 0) + return; + +#if defined(ARCH_CPU_X86_64) +#if CHACHA20_WITH_AVX512 + if (X86_PCAP_AVX512F) { + chacha20_avx512(dst, src, bytes, &ctx->state[4], &ctx->state[12]); + ctx->state[12] += (bytes + 63) / 64; + return; + } + if (X86_PCAP_AVX512VL) { + chacha20_avx512vl(dst, src, bytes, &ctx->state[4], &ctx->state[12]); + ctx->state[12] += (bytes + 63) / 64; + return; + } +#endif // CHACHA20_WITH_AVX512 + if (X86_PCAP_AVX2) { + chacha20_avx2(dst, src, bytes, &ctx->state[4], &ctx->state[12]); + ctx->state[12] += (bytes + 63) / 64; + return; + } + if (X86_PCAP_SSSE3) { + assert(bytes); + chacha20_ssse3(dst, src, bytes, &ctx->state[4], &ctx->state[12]); + ctx->state[12] += (bytes + 63) / 64; + return; + } +#endif // defined(ARCH_CPU_X86_64) + + if (dst != src) + memcpy(dst, src, bytes); + + while (bytes >= CHACHA20_BLOCK_SIZE) { + chacha20_block_generic(ctx, buf); + crypto_xor(dst, (uint8 *)buf, CHACHA20_BLOCK_SIZE); + bytes -= CHACHA20_BLOCK_SIZE; + dst += CHACHA20_BLOCK_SIZE; + } + if (bytes) { + chacha20_block_generic(ctx, buf); + crypto_xor(dst, (uint8 *)buf, bytes); + } +} + +struct poly1305_ctx { + uint8 opaque[24 * sizeof(uint64)]; + uint32 nonce[4]; + uint8 data[POLY1305_BLOCK_SIZE]; + size_t num; +}; + +#if !(defined(CONFIG_X86_64) || defined(CONFIG_ARM) || defined(CONFIG_ARM64) || (defined(CONFIG_MIPS) && defined(CONFIG_64BIT))) +struct poly1305_internal { + uint32 h[5]; + uint32 r[4]; +}; + +static void poly1305_init_generic(void *ctx, const uint8 key[16]) { + struct poly1305_internal *st = (struct poly1305_internal *)ctx; + + /* h = 0 */ + st->h[0] = 0; + st->h[1] = 0; + st->h[2] = 0; + st->h[3] = 0; + st->h[4] = 0; + + /* r &= 0xffffffc0ffffffc0ffffffc0fffffff */ + st->r[0] = ReadLE32(&key[ 0]) & 0x0fffffff; + st->r[1] = ReadLE32(&key[ 4]) & 0x0ffffffc; + st->r[2] = ReadLE32(&key[ 8]) & 0x0ffffffc; + st->r[3] = ReadLE32(&key[12]) & 0x0ffffffc; +} + +static void poly1305_blocks_generic(void *ctx, const uint8 *inp, size_t len, uint32 padbit) +{ +#define CONSTANT_TIME_CARRY(a,b) ((a ^ ((a ^ b) | ((a - b) ^ b))) >> (sizeof(a) * 8 - 1)) + struct poly1305_internal *st = (struct poly1305_internal *)ctx; + uint32 r0, r1, r2, r3; + uint32 s1, s2, s3; + uint32 h0, h1, h2, h3, h4, c; + uint64 d0, d1, d2, d3; + + r0 = st->r[0]; + r1 = st->r[1]; + r2 = st->r[2]; + r3 = st->r[3]; + + s1 = r1 + (r1 >> 2); + s2 = r2 + (r2 >> 2); + s3 = r3 + (r3 >> 2); + + h0 = st->h[0]; + h1 = st->h[1]; + h2 = st->h[2]; + h3 = st->h[3]; + h4 = st->h[4]; + + while (len >= POLY1305_BLOCK_SIZE) { + /* h += m[i] */ + h0 = (uint32)(d0 = (uint64)h0 + ReadLE32(inp + 0)); + h1 = (uint32)(d1 = (uint64)h1 + (d0 >> 32) + ReadLE32(inp + 4)); + h2 = (uint32)(d2 = (uint64)h2 + (d1 >> 32) + ReadLE32(inp + 8)); + h3 = (uint32)(d3 = (uint64)h3 + (d2 >> 32) + ReadLE32(inp + 12)); + h4 += (uint32)(d3 >> 32) + padbit; + + /* h *= r "%" p, where "%" stands for "partial remainder" */ + d0 = ((uint64)h0 * r0) + + ((uint64)h1 * s3) + + ((uint64)h2 * s2) + + ((uint64)h3 * s1); + d1 = ((uint64)h0 * r1) + + ((uint64)h1 * r0) + + ((uint64)h2 * s3) + + ((uint64)h3 * s2) + + (h4 * s1); + d2 = ((uint64)h0 * r2) + + ((uint64)h1 * r1) + + ((uint64)h2 * r0) + + ((uint64)h3 * s3) + + (h4 * s2); + d3 = ((uint64)h0 * r3) + + ((uint64)h1 * r2) + + ((uint64)h2 * r1) + + ((uint64)h3 * r0) + + (h4 * s3); + h4 = (h4 * r0); + + /* last reduction step: */ + /* a) h4:h0 = h4<<128 + d3<<96 + d2<<64 + d1<<32 + d0 */ + h0 = (uint32)d0; + h1 = (uint32)(d1 += d0 >> 32); + h2 = (uint32)(d2 += d1 >> 32); + h3 = (uint32)(d3 += d2 >> 32); + h4 += (uint32)(d3 >> 32); + /* b) (h4:h0 += (h4:h0>>130) * 5) %= 2^130 */ + c = (h4 >> 2) + (h4 & ~3U); + h4 &= 3; + h0 += c; + h1 += (c = CONSTANT_TIME_CARRY(h0,c)); + h2 += (c = CONSTANT_TIME_CARRY(h1,c)); + h3 += (c = CONSTANT_TIME_CARRY(h2,c)); + h4 += CONSTANT_TIME_CARRY(h3,c); + /* + * Occasional overflows to 3rd bit of h4 are taken care of + * "naturally". If after this point we end up at the top of + * this loop, then the overflow bit will be accounted for + * in next iteration. If we end up in poly1305_emit, then + * comparison to modulus below will still count as "carry + * into 131st bit", so that properly reduced value will be + * picked in conditional move. + */ + + inp += POLY1305_BLOCK_SIZE; + len -= POLY1305_BLOCK_SIZE; + } + + st->h[0] = h0; + st->h[1] = h1; + st->h[2] = h2; + st->h[3] = h3; + st->h[4] = h4; +#undef CONSTANT_TIME_CARRY +} + +static void poly1305_emit_generic(void *ctx, uint8 mac[16], const uint32 nonce[4]) +{ + struct poly1305_internal *st = (struct poly1305_internal *)ctx; + uint32 *omac = (uint32 *)mac; + uint32 h0, h1, h2, h3, h4; + uint32 g0, g1, g2, g3, g4; + uint64 t; + uint32 mask; + + h0 = st->h[0]; + h1 = st->h[1]; + h2 = st->h[2]; + h3 = st->h[3]; + h4 = st->h[4]; + + /* compare to modulus by computing h + -p */ + g0 = (uint32)(t = (uint64)h0 + 5); + g1 = (uint32)(t = (uint64)h1 + (t >> 32)); + g2 = (uint32)(t = (uint64)h2 + (t >> 32)); + g3 = (uint32)(t = (uint64)h3 + (t >> 32)); + g4 = h4 + (uint32)(t >> 32); + + /* if there was carry into 131st bit, h3:h0 = g3:g0 */ + mask = 0 - (g4 >> 2); + g0 &= mask; + g1 &= mask; + g2 &= mask; + g3 &= mask; + mask = ~mask; + h0 = (h0 & mask) | g0; + h1 = (h1 & mask) | g1; + h2 = (h2 & mask) | g2; + h3 = (h3 & mask) | g3; + + /* mac = (h + nonce) % (2^128) */ + h0 = (uint32)(t = (uint64)h0 + nonce[0]); + h1 = (uint32)(t = (uint64)h1 + (t >> 32) + nonce[1]); + h2 = (uint32)(t = (uint64)h2 + (t >> 32) + nonce[2]); + h3 = (uint32)(t = (uint64)h3 + (t >> 32) + nonce[3]); + + omac[0] = ToLE32(h0); + omac[1] = ToLE32(h1); + omac[2] = ToLE32(h2); + omac[3] = ToLE32(h3); +} +#endif + +SAFEBUFFERS static void poly1305_init(struct poly1305_ctx *ctx, const uint8 key[POLY1305_KEY_SIZE]) +{ + ctx->nonce[0] = ReadLE32(&key[16]); + ctx->nonce[1] = ReadLE32(&key[20]); + ctx->nonce[2] = ReadLE32(&key[24]); + ctx->nonce[3] = ReadLE32(&key[28]); + +#if defined(ARCH_CPU_X86_64) + poly1305_init_x86_64(ctx->opaque, key); +#elif defined(CONFIG_ARM) || defined(CONFIG_ARM64) + poly1305_init_arm(ctx->opaque, key); +#elif defined(CONFIG_MIPS) && defined(CONFIG_64BIT) + poly1305_init_mips(ctx->opaque, key); +#else + poly1305_init_generic(ctx->opaque, key); +#endif + ctx->num = 0; +} + +static inline void poly1305_blocks(void *ctx, const uint8 *inp, size_t len, uint32 padbit) +{ +#if defined(ARCH_CPU_X86_64) +#if CHACHA20_WITH_AVX512 + if(X86_PCAP_AVX512F) + poly1305_blocks_avx512(ctx, inp, len, padbit); + else +#endif // CHACHA20_WITH_AVX512 + if (X86_PCAP_AVX2) + poly1305_blocks_avx2(ctx, inp, len, padbit); + else if (X86_PCAP_AVX) + poly1305_blocks_avx(ctx, inp, len, padbit); + else + poly1305_blocks_x86_64(ctx, inp, len, padbit); +#else // defined(ARCH_CPU_X86_64) + poly1305_blocks_generic(ctx, inp, len, padbit); +#endif // defined(ARCH_CPU_X86_64) +} + +static inline void poly1305_emit(void *ctx, uint8 mac[16], const uint32 nonce[4]) +{ +#if defined(ARCH_CPU_X86_64) + if (X86_PCAP_AVX) + poly1305_emit_avx(ctx, mac, nonce); + else + poly1305_emit_x86_64(ctx, mac, nonce); +#else // defined(ARCH_CPU_X86_64) + poly1305_emit_generic(ctx, mac, nonce); +#endif // defined(ARCH_CPU_X86_64) +} + +SAFEBUFFERS static void poly1305_update(struct poly1305_ctx *ctx, const uint8 *inp, size_t len) +{ + const size_t num = ctx->num; + size_t rem; + + if (num) { + rem = POLY1305_BLOCK_SIZE - num; + if (len >= rem) { + memcpy(ctx->data + num, inp, rem); + poly1305_blocks(ctx->opaque, ctx->data, POLY1305_BLOCK_SIZE, 1); + inp += rem; + len -= rem; + } else { + /* Still not enough data to process a block. */ + memcpy(ctx->data + num, inp, len); + ctx->num = num + len; + return; + } + } + + rem = len % POLY1305_BLOCK_SIZE; + len -= rem; + + if (len >= POLY1305_BLOCK_SIZE) { + poly1305_blocks(ctx->opaque, inp, len, 1); + inp += len; + } + + if (rem) + memcpy(ctx->data, inp, rem); + + ctx->num = rem; +} + +SAFEBUFFERS static void poly1305_finish(struct poly1305_ctx *ctx, uint8 mac[16]) +{ + size_t num = ctx->num; + + if (num) { + ctx->data[num++] = 1; /* pad bit */ + while (num < POLY1305_BLOCK_SIZE) + ctx->data[num++] = 0; + poly1305_blocks(ctx->opaque, ctx->data, POLY1305_BLOCK_SIZE, 0); + } + + poly1305_emit(ctx->opaque, mac, ctx->nonce); + + /* zero out the state */ + memzero_crypto(ctx, sizeof(*ctx)); +} + +static const uint8 pad0[16] = { 0 }; + +SAFEBUFFERS static FORCEINLINE void poly1305_getmac(const uint8 *ad, size_t ad_len, const uint8 *src, size_t src_len, const uint8 key[POLY1305_KEY_SIZE], uint8 mac[CHACHA20POLY1305_AUTHTAGLEN]) { + uint64 len[2]; + struct poly1305_ctx poly1305_state; + + poly1305_init(&poly1305_state, key); + poly1305_update(&poly1305_state, ad, ad_len); + poly1305_update(&poly1305_state, pad0, (0 - ad_len) & 0xf); + poly1305_update(&poly1305_state, src, src_len); + poly1305_update(&poly1305_state, pad0, (0 - src_len) & 0xf); + len[0] = ToLE64(ad_len); + len[1] = ToLE64(src_len); + poly1305_update(&poly1305_state, (uint8 *)&len, sizeof(len)); + poly1305_finish(&poly1305_state, mac); +} + +struct ChaChaState { + struct chacha20_ctx chacha20_state; + uint8 block0[CHACHA20_BLOCK_SIZE]; +}; + +static inline void InitializeChaChaState(ChaChaState *st, const uint8 key[CHACHA20POLY1305_KEYLEN], uint64 nonce) { + uint64 le_nonce = ToLE64(nonce); + WriteLE64((uint8*)st, 0x3320646e61707865); + WriteLE64((uint8*)st + 8, 0x6b20657479622d32); + Write64((uint8*)st + 16, Read64(key + 0)); + Write64((uint8*)st + 24, Read64(key + 8)); + Write64((uint8*)st + 32, Read64(key + 16)); + Write64((uint8*)st + 40, Read64(key + 24)); + Write64((uint8*)st + 48, 0); + Write64((uint8*)st + 56, Read64((uint8*)&le_nonce)); + + Write64((uint8*)st + 64 + 0 * 8, 0); + Write64((uint8*)st + 64 + 1 * 8, 0); + Write64((uint8*)st + 64 + 2 * 8, 0); + Write64((uint8*)st + 64 + 3 * 8, 0); + Write64((uint8*)st + 64 + 4 * 8, 0); + Write64((uint8*)st + 64 + 5 * 8, 0); + Write64((uint8*)st + 64 + 6 * 8, 0); + Write64((uint8*)st + 64 + 7 * 8, 0); +} + +SAFEBUFFERS void poly1305_get_mac(const uint8 *src, size_t src_len, + const uint8 *ad, const size_t ad_len, + const uint64 nonce, const uint8 key[CHACHA20POLY1305_KEYLEN], + uint8 mac[CHACHA20POLY1305_AUTHTAGLEN]) { + ChaChaState st; + + InitializeChaChaState(&st, key, nonce); + chacha20_crypt(&st.chacha20_state, st.block0, st.block0, sizeof(st.block0)); + poly1305_getmac(ad, ad_len, src, src_len, st.block0, mac); + memzero_crypto(&st, sizeof(st)); +} + +SAFEBUFFERS void chacha20poly1305_encrypt(uint8 *dst, const uint8 *src, const size_t src_len, + const uint8 *ad, const size_t ad_len, + const uint64 nonce, const uint8 key[CHACHA20POLY1305_KEYLEN]) { + ChaChaState st; + + InitializeChaChaState(&st, key, nonce); + chacha20_crypt(&st.chacha20_state, st.block0, st.block0, sizeof(st.block0)); + chacha20_crypt(&st.chacha20_state, dst, src, (uint32)src_len); + poly1305_getmac(ad, ad_len, dst, src_len, st.block0, dst + src_len); + memzero_crypto(&st, sizeof(st)); +} + +SAFEBUFFERS void chacha20poly1305_decrypt_get_mac(uint8 *dst, const uint8 *src, const size_t src_len, + const uint8 *ad, const size_t ad_len, + const uint64 nonce, const uint8 key[CHACHA20POLY1305_KEYLEN], + uint8 mac[CHACHA20POLY1305_AUTHTAGLEN]) { + ChaChaState st; + + InitializeChaChaState(&st, key, nonce); + chacha20_crypt(&st.chacha20_state, st.block0, st.block0, sizeof(st.block0)); + poly1305_getmac(ad, ad_len, src, src_len, st.block0, mac); + chacha20_crypt(&st.chacha20_state, dst, src, (uint32)src_len); + memzero_crypto(&st, sizeof(st)); +} + +SAFEBUFFERS bool chacha20poly1305_decrypt(uint8 *dst, const uint8 *src, const size_t src_len, + const uint8 *ad, const size_t ad_len, + const uint64 nonce, const uint8 key[CHACHA20POLY1305_KEYLEN]) { + uint8 mac[POLY1305_MAC_SIZE]; + + if (src_len < CHACHA20POLY1305_AUTHTAGLEN) + return false; + chacha20poly1305_decrypt_get_mac(dst, src, src_len - CHACHA20POLY1305_AUTHTAGLEN, ad, ad_len, nonce, key, mac); + return memcmp_crypto(mac, src + src_len - CHACHA20POLY1305_AUTHTAGLEN, CHACHA20POLY1305_AUTHTAGLEN) == 0; +} + +void xchacha20poly1305_encrypt(uint8 *dst, const uint8 *src, const size_t src_len, + const uint8 *ad, const size_t ad_len, + const uint8 nonce[XCHACHA20POLY1305_NONCELEN], + const uint8 key[CHACHA20POLY1305_KEYLEN]) +{ + __aligned(16) uint8 derived_key[CHACHA20POLY1305_KEYLEN]; + + hchacha20(derived_key, nonce, key); + chacha20poly1305_encrypt(dst, src, src_len, ad, ad_len, ReadLE64(nonce + 16), derived_key); + memzero_crypto(derived_key, CHACHA20POLY1305_KEYLEN); +} + +bool xchacha20poly1305_decrypt(uint8 *dst, const uint8 *src, const size_t src_len, + const uint8 *ad, const size_t ad_len, + const uint8 nonce[XCHACHA20POLY1305_NONCELEN], + const uint8 key[CHACHA20POLY1305_KEYLEN]) { + bool ret; + __aligned(16) uint8 derived_key[CHACHA20POLY1305_KEYLEN]; + + hchacha20(derived_key, nonce, key); + ret = chacha20poly1305_decrypt(dst, src, src_len, ad, ad_len, ReadLE64(nonce + 16), derived_key); + memzero_crypto(derived_key, CHACHA20POLY1305_KEYLEN); + + return ret; +} + diff --git a/crypto/chacha20poly1305.h b/crypto/chacha20poly1305.h new file mode 100644 index 0000000..90b701d --- /dev/null +++ b/crypto/chacha20poly1305.h @@ -0,0 +1,39 @@ +#pragma once +#include "tunsafe_types.h" + + +enum { + XCHACHA20POLY1305_NONCELEN = 24, + CHACHA20POLY1305_KEYLEN = 32, + CHACHA20POLY1305_AUTHTAGLEN = 16 +}; + + +void chacha20poly1305_decrypt_get_mac(uint8 *dst, const uint8 *src, const size_t src_len, + const uint8 *ad, const size_t ad_len, + const uint64 nonce, const uint8 key[CHACHA20POLY1305_KEYLEN], + uint8 mac[CHACHA20POLY1305_AUTHTAGLEN]); + +bool chacha20poly1305_decrypt(uint8 *dst, const uint8 *src, const size_t src_len, + const uint8 *ad, const size_t ad_len, + const uint64 nonce, const uint8 key[CHACHA20POLY1305_KEYLEN]); + +void chacha20poly1305_encrypt(uint8 *dst, const uint8 *src, const size_t src_len, + const uint8 *ad, const size_t ad_len, + const uint64 nonce, const uint8 key[CHACHA20POLY1305_KEYLEN]); + + +void xchacha20poly1305_encrypt(uint8 *dst, const uint8 *src, const size_t src_len, + const uint8 *ad, const size_t ad_len, + const uint8 nonce[XCHACHA20POLY1305_NONCELEN], + const uint8 key[CHACHA20POLY1305_KEYLEN]); + +bool xchacha20poly1305_decrypt(uint8 *dst, const uint8 *src, const size_t src_len, + const uint8 *ad, const size_t ad_len, + const uint8 nonce[XCHACHA20POLY1305_NONCELEN], + const uint8 key[CHACHA20POLY1305_KEYLEN]); + +void poly1305_get_mac(const uint8 *src, size_t src_len, + const uint8 *ad, const size_t ad_len, + const uint64 nonce, const uint8 key[CHACHA20POLY1305_KEYLEN], + uint8 mac[CHACHA20POLY1305_AUTHTAGLEN]); \ No newline at end of file diff --git a/crypto/curve25519-donna.cpp b/crypto/curve25519-donna.cpp new file mode 100644 index 0000000..a8f5cbe --- /dev/null +++ b/crypto/curve25519-donna.cpp @@ -0,0 +1,737 @@ +/* Copyright 2008, Google Inc. + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are + * met: + * + * * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * * Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following disclaimer + * in the documentation and/or other materials provided with the + * distribution. + * * Neither the name of Google Inc. nor the names of its + * contributors may be used to endorse or promote products derived from + * this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * curve25519-donna: Curve25519 elliptic curve, public key function + * + * http://code.google.com/p/curve25519-donna/ + * + * Adam Langley + * + * Derived from public domain C code by Daniel J. Bernstein + * + * More information about curve25519 can be found here + * http://cr.yp.to/ecdh.html + * + * djb's sample implementation of curve25519 is written in a special assembly + * language called qhasm and uses the floating point registers. + * + * This is, almost, a clean room reimplementation from the curve25519 paper. It + * uses many of the tricks described therein. Only the crecip function is taken + * from the sample implementation. + */ + +#include +#include + +#ifdef _MSC_VER +#define inline __inline +#endif + +typedef uint8_t u8; +typedef int32_t s32; +typedef int64_t limb; + +/* Field element representation: + * + * Field elements are written as an array of signed, 64-bit limbs, least + * significant first. The value of the field element is: + * x[0] + 2^26·x[1] + x^51·x[2] + 2^102·x[3] + ... + * + * i.e. the limbs are 26, 25, 26, 25, ... bits wide. + */ + +/* Sum two numbers: output += in */ +static void fsum(limb *output, const limb *in) { + unsigned i; + for (i = 0; i < 10; i += 2) { + output[0+i] = (output[0+i] + in[0+i]); + output[1+i] = (output[1+i] + in[1+i]); + } +} + +/* Find the difference of two numbers: output = in - output + * (note the order of the arguments!) + */ +static void fdifference(limb *output, const limb *in) { + unsigned i; + for (i = 0; i < 10; ++i) { + output[i] = (in[i] - output[i]); + } +} + +/* Multiply a number by a scalar: output = in * scalar */ +static void fscalar_product(limb *output, const limb *in, const limb scalar) { + unsigned i; + for (i = 0; i < 10; ++i) { + output[i] = in[i] * scalar; + } +} + +/* Multiply two numbers: output = in2 * in + * + * output must be distinct to both inputs. The inputs are reduced coefficient + * form, the output is not. + */ +static void fproduct(limb *output, const limb *in2, const limb *in) { + output[0] = ((limb) ((s32) in2[0])) * ((s32) in[0]); + output[1] = ((limb) ((s32) in2[0])) * ((s32) in[1]) + + ((limb) ((s32) in2[1])) * ((s32) in[0]); + output[2] = 2 * ((limb) ((s32) in2[1])) * ((s32) in[1]) + + ((limb) ((s32) in2[0])) * ((s32) in[2]) + + ((limb) ((s32) in2[2])) * ((s32) in[0]); + output[3] = ((limb) ((s32) in2[1])) * ((s32) in[2]) + + ((limb) ((s32) in2[2])) * ((s32) in[1]) + + ((limb) ((s32) in2[0])) * ((s32) in[3]) + + ((limb) ((s32) in2[3])) * ((s32) in[0]); + output[4] = ((limb) ((s32) in2[2])) * ((s32) in[2]) + + 2 * (((limb) ((s32) in2[1])) * ((s32) in[3]) + + ((limb) ((s32) in2[3])) * ((s32) in[1])) + + ((limb) ((s32) in2[0])) * ((s32) in[4]) + + ((limb) ((s32) in2[4])) * ((s32) in[0]); + output[5] = ((limb) ((s32) in2[2])) * ((s32) in[3]) + + ((limb) ((s32) in2[3])) * ((s32) in[2]) + + ((limb) ((s32) in2[1])) * ((s32) in[4]) + + ((limb) ((s32) in2[4])) * ((s32) in[1]) + + ((limb) ((s32) in2[0])) * ((s32) in[5]) + + ((limb) ((s32) in2[5])) * ((s32) in[0]); + output[6] = 2 * (((limb) ((s32) in2[3])) * ((s32) in[3]) + + ((limb) ((s32) in2[1])) * ((s32) in[5]) + + ((limb) ((s32) in2[5])) * ((s32) in[1])) + + ((limb) ((s32) in2[2])) * ((s32) in[4]) + + ((limb) ((s32) in2[4])) * ((s32) in[2]) + + ((limb) ((s32) in2[0])) * ((s32) in[6]) + + ((limb) ((s32) in2[6])) * ((s32) in[0]); + output[7] = ((limb) ((s32) in2[3])) * ((s32) in[4]) + + ((limb) ((s32) in2[4])) * ((s32) in[3]) + + ((limb) ((s32) in2[2])) * ((s32) in[5]) + + ((limb) ((s32) in2[5])) * ((s32) in[2]) + + ((limb) ((s32) in2[1])) * ((s32) in[6]) + + ((limb) ((s32) in2[6])) * ((s32) in[1]) + + ((limb) ((s32) in2[0])) * ((s32) in[7]) + + ((limb) ((s32) in2[7])) * ((s32) in[0]); + output[8] = ((limb) ((s32) in2[4])) * ((s32) in[4]) + + 2 * (((limb) ((s32) in2[3])) * ((s32) in[5]) + + ((limb) ((s32) in2[5])) * ((s32) in[3]) + + ((limb) ((s32) in2[1])) * ((s32) in[7]) + + ((limb) ((s32) in2[7])) * ((s32) in[1])) + + ((limb) ((s32) in2[2])) * ((s32) in[6]) + + ((limb) ((s32) in2[6])) * ((s32) in[2]) + + ((limb) ((s32) in2[0])) * ((s32) in[8]) + + ((limb) ((s32) in2[8])) * ((s32) in[0]); + output[9] = ((limb) ((s32) in2[4])) * ((s32) in[5]) + + ((limb) ((s32) in2[5])) * ((s32) in[4]) + + ((limb) ((s32) in2[3])) * ((s32) in[6]) + + ((limb) ((s32) in2[6])) * ((s32) in[3]) + + ((limb) ((s32) in2[2])) * ((s32) in[7]) + + ((limb) ((s32) in2[7])) * ((s32) in[2]) + + ((limb) ((s32) in2[1])) * ((s32) in[8]) + + ((limb) ((s32) in2[8])) * ((s32) in[1]) + + ((limb) ((s32) in2[0])) * ((s32) in[9]) + + ((limb) ((s32) in2[9])) * ((s32) in[0]); + output[10] = 2 * (((limb) ((s32) in2[5])) * ((s32) in[5]) + + ((limb) ((s32) in2[3])) * ((s32) in[7]) + + ((limb) ((s32) in2[7])) * ((s32) in[3]) + + ((limb) ((s32) in2[1])) * ((s32) in[9]) + + ((limb) ((s32) in2[9])) * ((s32) in[1])) + + ((limb) ((s32) in2[4])) * ((s32) in[6]) + + ((limb) ((s32) in2[6])) * ((s32) in[4]) + + ((limb) ((s32) in2[2])) * ((s32) in[8]) + + ((limb) ((s32) in2[8])) * ((s32) in[2]); + output[11] = ((limb) ((s32) in2[5])) * ((s32) in[6]) + + ((limb) ((s32) in2[6])) * ((s32) in[5]) + + ((limb) ((s32) in2[4])) * ((s32) in[7]) + + ((limb) ((s32) in2[7])) * ((s32) in[4]) + + ((limb) ((s32) in2[3])) * ((s32) in[8]) + + ((limb) ((s32) in2[8])) * ((s32) in[3]) + + ((limb) ((s32) in2[2])) * ((s32) in[9]) + + ((limb) ((s32) in2[9])) * ((s32) in[2]); + output[12] = ((limb) ((s32) in2[6])) * ((s32) in[6]) + + 2 * (((limb) ((s32) in2[5])) * ((s32) in[7]) + + ((limb) ((s32) in2[7])) * ((s32) in[5]) + + ((limb) ((s32) in2[3])) * ((s32) in[9]) + + ((limb) ((s32) in2[9])) * ((s32) in[3])) + + ((limb) ((s32) in2[4])) * ((s32) in[8]) + + ((limb) ((s32) in2[8])) * ((s32) in[4]); + output[13] = ((limb) ((s32) in2[6])) * ((s32) in[7]) + + ((limb) ((s32) in2[7])) * ((s32) in[6]) + + ((limb) ((s32) in2[5])) * ((s32) in[8]) + + ((limb) ((s32) in2[8])) * ((s32) in[5]) + + ((limb) ((s32) in2[4])) * ((s32) in[9]) + + ((limb) ((s32) in2[9])) * ((s32) in[4]); + output[14] = 2 * (((limb) ((s32) in2[7])) * ((s32) in[7]) + + ((limb) ((s32) in2[5])) * ((s32) in[9]) + + ((limb) ((s32) in2[9])) * ((s32) in[5])) + + ((limb) ((s32) in2[6])) * ((s32) in[8]) + + ((limb) ((s32) in2[8])) * ((s32) in[6]); + output[15] = ((limb) ((s32) in2[7])) * ((s32) in[8]) + + ((limb) ((s32) in2[8])) * ((s32) in[7]) + + ((limb) ((s32) in2[6])) * ((s32) in[9]) + + ((limb) ((s32) in2[9])) * ((s32) in[6]); + output[16] = ((limb) ((s32) in2[8])) * ((s32) in[8]) + + 2 * (((limb) ((s32) in2[7])) * ((s32) in[9]) + + ((limb) ((s32) in2[9])) * ((s32) in[7])); + output[17] = ((limb) ((s32) in2[8])) * ((s32) in[9]) + + ((limb) ((s32) in2[9])) * ((s32) in[8]); + output[18] = 2 * ((limb) ((s32) in2[9])) * ((s32) in[9]); +} + +/* Reduce a long form to a short form by taking the input mod 2^255 - 19. */ +static void freduce_degree(limb *output) { + /* Each of these shifts and adds ends up multiplying the value by 19. */ + output[8] += output[18] << 4; + output[8] += output[18] << 1; + output[8] += output[18]; + output[7] += output[17] << 4; + output[7] += output[17] << 1; + output[7] += output[17]; + output[6] += output[16] << 4; + output[6] += output[16] << 1; + output[6] += output[16]; + output[5] += output[15] << 4; + output[5] += output[15] << 1; + output[5] += output[15]; + output[4] += output[14] << 4; + output[4] += output[14] << 1; + output[4] += output[14]; + output[3] += output[13] << 4; + output[3] += output[13] << 1; + output[3] += output[13]; + output[2] += output[12] << 4; + output[2] += output[12] << 1; + output[2] += output[12]; + output[1] += output[11] << 4; + output[1] += output[11] << 1; + output[1] += output[11]; + output[0] += output[10] << 4; + output[0] += output[10] << 1; + output[0] += output[10]; +} + +#if (-1 & 3) != 3 +#error "This code only works on a two's complement system" +#endif + +/* return v / 2^26, using only shifts and adds. */ +static inline limb +div_by_2_26(const limb v) +{ + /* High word of v; no shift needed*/ + const uint32_t highword = (uint32_t) (((uint64_t) v) >> 32); + /* Set to all 1s if v was negative; else set to 0s. */ + const int32_t sign = ((int32_t) highword) >> 31; + /* Set to 0x3ffffff if v was negative; else set to 0. */ + const int32_t roundoff = ((uint32_t) sign) >> 6; + /* Should return v / (1<<26) */ + return (v + roundoff) >> 26; +} + +/* return v / (2^25), using only shifts and adds. */ +static inline limb +div_by_2_25(const limb v) +{ + /* High word of v; no shift needed*/ + const uint32_t highword = (uint32_t) (((uint64_t) v) >> 32); + /* Set to all 1s if v was negative; else set to 0s. */ + const int32_t sign = ((int32_t) highword) >> 31; + /* Set to 0x1ffffff if v was negative; else set to 0. */ + const int32_t roundoff = ((uint32_t) sign) >> 7; + /* Should return v / (1<<25) */ + return (v + roundoff) >> 25; +} + +static inline s32 +div_s32_by_2_25(const s32 v) +{ + const s32 roundoff = ((uint32_t)(v >> 31)) >> 7; + return (v + roundoff) >> 25; +} + +/* Reduce all coefficients of the short form input so that |x| < 2^26. + * + * On entry: |output[i]| < 2^62 + */ +static void freduce_coefficients(limb *output) { + unsigned i; + + output[10] = 0; + + for (i = 0; i < 10; i += 2) { + limb over = div_by_2_26(output[i]); + output[i] -= over << 26; + output[i+1] += over; + + over = div_by_2_25(output[i+1]); + output[i+1] -= over << 25; + output[i+2] += over; + } + /* Now |output[10]| < 2 ^ 38 and all other coefficients are reduced. */ + output[0] += output[10] << 4; + output[0] += output[10] << 1; + output[0] += output[10]; + + output[10] = 0; + + /* Now output[1..9] are reduced, and |output[0]| < 2^26 + 19 * 2^38 + * So |over| will be no more than 77825 */ + { + limb over = div_by_2_26(output[0]); + output[0] -= over << 26; + output[1] += over; + } + + /* Now output[0,2..9] are reduced, and |output[1]| < 2^25 + 77825 + * So |over| will be no more than 1. */ + { + /* output[1] fits in 32 bits, so we can use div_s32_by_2_25 here. */ + s32 over32 = div_s32_by_2_25((s32) output[1]); + output[1] -= over32 << 25; + output[2] += over32; + } + + /* Finally, output[0,1,3..9] are reduced, and output[2] is "nearly reduced": + * we have |output[2]| <= 2^26. This is good enough for all of our math, + * but it will require an extra freduce_coefficients before fcontract. */ +} + +/* A helpful wrapper around fproduct: output = in * in2. + * + * output must be distinct to both inputs. The output is reduced degree and + * reduced coefficient. + */ +static void +fmul(limb *output, const limb *in, const limb *in2) { + limb t[19]; + fproduct(t, in, in2); + freduce_degree(t); + freduce_coefficients(t); + memcpy(output, t, sizeof(limb) * 10); +} + +static void fsquare_inner(limb *output, const limb *in) { + output[0] = ((limb) ((s32) in[0])) * ((s32) in[0]); + output[1] = 2 * ((limb) ((s32) in[0])) * ((s32) in[1]); + output[2] = 2 * (((limb) ((s32) in[1])) * ((s32) in[1]) + + ((limb) ((s32) in[0])) * ((s32) in[2])); + output[3] = 2 * (((limb) ((s32) in[1])) * ((s32) in[2]) + + ((limb) ((s32) in[0])) * ((s32) in[3])); + output[4] = ((limb) ((s32) in[2])) * ((s32) in[2]) + + 4 * ((limb) ((s32) in[1])) * ((s32) in[3]) + + 2 * ((limb) ((s32) in[0])) * ((s32) in[4]); + output[5] = 2 * (((limb) ((s32) in[2])) * ((s32) in[3]) + + ((limb) ((s32) in[1])) * ((s32) in[4]) + + ((limb) ((s32) in[0])) * ((s32) in[5])); + output[6] = 2 * (((limb) ((s32) in[3])) * ((s32) in[3]) + + ((limb) ((s32) in[2])) * ((s32) in[4]) + + ((limb) ((s32) in[0])) * ((s32) in[6]) + + 2 * ((limb) ((s32) in[1])) * ((s32) in[5])); + output[7] = 2 * (((limb) ((s32) in[3])) * ((s32) in[4]) + + ((limb) ((s32) in[2])) * ((s32) in[5]) + + ((limb) ((s32) in[1])) * ((s32) in[6]) + + ((limb) ((s32) in[0])) * ((s32) in[7])); + output[8] = ((limb) ((s32) in[4])) * ((s32) in[4]) + + 2 * (((limb) ((s32) in[2])) * ((s32) in[6]) + + ((limb) ((s32) in[0])) * ((s32) in[8]) + + 2 * (((limb) ((s32) in[1])) * ((s32) in[7]) + + ((limb) ((s32) in[3])) * ((s32) in[5]))); + output[9] = 2 * (((limb) ((s32) in[4])) * ((s32) in[5]) + + ((limb) ((s32) in[3])) * ((s32) in[6]) + + ((limb) ((s32) in[2])) * ((s32) in[7]) + + ((limb) ((s32) in[1])) * ((s32) in[8]) + + ((limb) ((s32) in[0])) * ((s32) in[9])); + output[10] = 2 * (((limb) ((s32) in[5])) * ((s32) in[5]) + + ((limb) ((s32) in[4])) * ((s32) in[6]) + + ((limb) ((s32) in[2])) * ((s32) in[8]) + + 2 * (((limb) ((s32) in[3])) * ((s32) in[7]) + + ((limb) ((s32) in[1])) * ((s32) in[9]))); + output[11] = 2 * (((limb) ((s32) in[5])) * ((s32) in[6]) + + ((limb) ((s32) in[4])) * ((s32) in[7]) + + ((limb) ((s32) in[3])) * ((s32) in[8]) + + ((limb) ((s32) in[2])) * ((s32) in[9])); + output[12] = ((limb) ((s32) in[6])) * ((s32) in[6]) + + 2 * (((limb) ((s32) in[4])) * ((s32) in[8]) + + 2 * (((limb) ((s32) in[5])) * ((s32) in[7]) + + ((limb) ((s32) in[3])) * ((s32) in[9]))); + output[13] = 2 * (((limb) ((s32) in[6])) * ((s32) in[7]) + + ((limb) ((s32) in[5])) * ((s32) in[8]) + + ((limb) ((s32) in[4])) * ((s32) in[9])); + output[14] = 2 * (((limb) ((s32) in[7])) * ((s32) in[7]) + + ((limb) ((s32) in[6])) * ((s32) in[8]) + + 2 * ((limb) ((s32) in[5])) * ((s32) in[9])); + output[15] = 2 * (((limb) ((s32) in[7])) * ((s32) in[8]) + + ((limb) ((s32) in[6])) * ((s32) in[9])); + output[16] = ((limb) ((s32) in[8])) * ((s32) in[8]) + + 4 * ((limb) ((s32) in[7])) * ((s32) in[9]); + output[17] = 2 * ((limb) ((s32) in[8])) * ((s32) in[9]); + output[18] = 2 * ((limb) ((s32) in[9])) * ((s32) in[9]); +} + +static void +fsquare(limb *output, const limb *in) { + limb t[19]; + fsquare_inner(t, in); + freduce_degree(t); + freduce_coefficients(t); + memcpy(output, t, sizeof(limb) * 10); +} + +/* Take a little-endian, 32-byte number and expand it into polynomial form */ +static void +fexpand(limb *output, const u8 *input) { +#define F(n,start,shift,mask) \ + output[n] = ((((limb) input[start + 0]) | \ + ((limb) input[start + 1]) << 8 | \ + ((limb) input[start + 2]) << 16 | \ + ((limb) input[start + 3]) << 24) >> shift) & mask; + F(0, 0, 0, 0x3ffffff); + F(1, 3, 2, 0x1ffffff); + F(2, 6, 3, 0x3ffffff); + F(3, 9, 5, 0x1ffffff); + F(4, 12, 6, 0x3ffffff); + F(5, 16, 0, 0x1ffffff); + F(6, 19, 1, 0x3ffffff); + F(7, 22, 3, 0x1ffffff); + F(8, 25, 4, 0x3ffffff); + F(9, 28, 6, 0x3ffffff); +#undef F +} + +#if (-32 >> 1) != -16 +#error "This code only works when >> does sign-extension on negative numbers" +#endif + +/* Take a fully reduced polynomial form number and contract it into a + * little-endian, 32-byte array + */ +static void +fcontract(u8 *output, limb *input) { + int i; + int j; + + for (j = 0; j < 2; ++j) { + for (i = 0; i < 9; ++i) { + if ((i & 1) == 1) { + /* This calculation is a time-invariant way to make input[i] positive + by borrowing from the next-larger limb. + */ + const s32 mask = (s32)(input[i]) >> 31; + const s32 carry = -(((s32)(input[i]) & mask) >> 25); + input[i] = (s32)(input[i]) + (carry << 25); + input[i+1] = (s32)(input[i+1]) - carry; + } else { + const s32 mask = (s32)(input[i]) >> 31; + const s32 carry = -(((s32)(input[i]) & mask) >> 26); + input[i] = (s32)(input[i]) + (carry << 26); + input[i+1] = (s32)(input[i+1]) - carry; + } + } + { + const s32 mask = (s32)(input[9]) >> 31; + const s32 carry = -(((s32)(input[9]) & mask) >> 25); + input[9] = (s32)(input[9]) + (carry << 25); + input[0] = (s32)(input[0]) - (carry * 19); + } + } + + /* The first borrow-propagation pass above ended with every limb + except (possibly) input[0] non-negative. + + Since each input limb except input[0] is decreased by at most 1 + by a borrow-propagation pass, the second borrow-propagation pass + could only have wrapped around to decrease input[0] again if the + first pass left input[0] negative *and* input[1] through input[9] + were all zero. In that case, input[1] is now 2^25 - 1, and this + last borrow-propagation step will leave input[1] non-negative. + */ + { + const s32 mask = (s32)(input[0]) >> 31; + const s32 carry = -(((s32)(input[0]) & mask) >> 26); + input[0] = (s32)(input[0]) + (carry << 26); + input[1] = (s32)(input[1]) - carry; + } + + /* Both passes through the above loop, plus the last 0-to-1 step, are + necessary: if input[9] is -1 and input[0] through input[8] are 0, + negative values will remain in the array until the end. + */ + + input[1] <<= 2; + input[2] <<= 3; + input[3] <<= 5; + input[4] <<= 6; + input[6] <<= 1; + input[7] <<= 3; + input[8] <<= 4; + input[9] <<= 6; +#define F(i, s) \ + output[s+0] |= input[i] & 0xff; \ + output[s+1] = (input[i] >> 8) & 0xff; \ + output[s+2] = (input[i] >> 16) & 0xff; \ + output[s+3] = (input[i] >> 24) & 0xff; + output[0] = 0; + output[16] = 0; + F(0,0); + F(1,3); + F(2,6); + F(3,9); + F(4,12); + F(5,16); + F(6,19); + F(7,22); + F(8,25); + F(9,28); +#undef F +} + +/* Input: Q, Q', Q-Q' + * Output: 2Q, Q+Q' + * + * x2 z3: long form + * x3 z3: long form + * x z: short form, destroyed + * xprime zprime: short form, destroyed + * qmqp: short form, preserved + */ +static void fmonty(limb *x2, limb *z2, /* output 2Q */ + limb *x3, limb *z3, /* output Q + Q' */ + limb *x, limb *z, /* input Q */ + limb *xprime, limb *zprime, /* input Q' */ + const limb *qmqp /* input Q - Q' */) { + limb origx[10], origxprime[10], zzz[19], xx[19], zz[19], xxprime[19], + zzprime[19], zzzprime[19], xxxprime[19]; + + memcpy(origx, x, 10 * sizeof(limb)); + fsum(x, z); + fdifference(z, origx); // does x - z + + memcpy(origxprime, xprime, sizeof(limb) * 10); + fsum(xprime, zprime); + fdifference(zprime, origxprime); + fproduct(xxprime, xprime, z); + fproduct(zzprime, x, zprime); + freduce_degree(xxprime); + freduce_coefficients(xxprime); + freduce_degree(zzprime); + freduce_coefficients(zzprime); + memcpy(origxprime, xxprime, sizeof(limb) * 10); + fsum(xxprime, zzprime); + fdifference(zzprime, origxprime); + fsquare(xxxprime, xxprime); + fsquare(zzzprime, zzprime); + fproduct(zzprime, zzzprime, qmqp); + freduce_degree(zzprime); + freduce_coefficients(zzprime); + memcpy(x3, xxxprime, sizeof(limb) * 10); + memcpy(z3, zzprime, sizeof(limb) * 10); + + fsquare(xx, x); + fsquare(zz, z); + fproduct(x2, xx, zz); + freduce_degree(x2); + freduce_coefficients(x2); + fdifference(zz, xx); // does zz = xx - zz + memset(zzz + 10, 0, sizeof(limb) * 9); + fscalar_product(zzz, zz, 121665); + /* No need to call freduce_degree here: + fscalar_product doesn't increase the degree of its input. */ + freduce_coefficients(zzz); + fsum(zzz, xx); + fproduct(z2, zz, zzz); + freduce_degree(z2); + freduce_coefficients(z2); +} + +/* Conditionally swap two reduced-form limb arrays if 'iswap' is 1, but leave + * them unchanged if 'iswap' is 0. Runs in data-invariant time to avoid + * side-channel attacks. + * + * NOTE that this function requires that 'iswap' be 1 or 0; other values give + * wrong results. Also, the two limb arrays must be in reduced-coefficient, + * reduced-degree form: the values in a[10..19] or b[10..19] aren't swapped, + * and all all values in a[0..9],b[0..9] must have magnitude less than + * INT32_MAX. + */ +static void +swap_conditional(limb a[19], limb b[19], limb iswap) { + unsigned i; + const s32 swap = (s32) -iswap; + + for (i = 0; i < 10; ++i) { + const s32 x = swap & ( ((s32)a[i]) ^ ((s32)b[i]) ); + a[i] = ((s32)a[i]) ^ x; + b[i] = ((s32)b[i]) ^ x; + } +} + +/* Calculates nQ where Q is the x-coordinate of a point on the curve + * + * resultx/resultz: the x coordinate of the resulting curve point (short form) + * n: a little endian, 32-byte number + * q: a point of the curve (short form) + */ +static void +cmult(limb *resultx, limb *resultz, const u8 *n, const limb *q) { + limb a[19] = {0}, b[19] = {1}, c[19] = {1}, d[19] = {0}; + limb *nqpqx = a, *nqpqz = b, *nqx = c, *nqz = d, *t; + limb e[19] = {0}, f[19] = {1}, g[19] = {0}, h[19] = {1}; + limb *nqpqx2 = e, *nqpqz2 = f, *nqx2 = g, *nqz2 = h; + + unsigned i, j; + + memcpy(nqpqx, q, sizeof(limb) * 10); + + for (i = 0; i < 32; ++i) { + u8 byte = n[31 - i]; + for (j = 0; j < 8; ++j) { + const limb bit = byte >> 7; + + swap_conditional(nqx, nqpqx, bit); + swap_conditional(nqz, nqpqz, bit); + fmonty(nqx2, nqz2, + nqpqx2, nqpqz2, + nqx, nqz, + nqpqx, nqpqz, + q); + swap_conditional(nqx2, nqpqx2, bit); + swap_conditional(nqz2, nqpqz2, bit); + + t = nqx; + nqx = nqx2; + nqx2 = t; + t = nqz; + nqz = nqz2; + nqz2 = t; + t = nqpqx; + nqpqx = nqpqx2; + nqpqx2 = t; + t = nqpqz; + nqpqz = nqpqz2; + nqpqz2 = t; + + byte <<= 1; + } + } + + memcpy(resultx, nqx, sizeof(limb) * 10); + memcpy(resultz, nqz, sizeof(limb) * 10); +} + +// ----------------------------------------------------------------------------- +// Shamelessly copied from djb's code +// ----------------------------------------------------------------------------- +static void +crecip(limb *out, const limb *z) { + limb z2[10]; + limb z9[10]; + limb z11[10]; + limb z2_5_0[10]; + limb z2_10_0[10]; + limb z2_20_0[10]; + limb z2_50_0[10]; + limb z2_100_0[10]; + limb t0[10]; + limb t1[10]; + int i; + + /* 2 */ fsquare(z2,z); + /* 4 */ fsquare(t1,z2); + /* 8 */ fsquare(t0,t1); + /* 9 */ fmul(z9,t0,z); + /* 11 */ fmul(z11,z9,z2); + /* 22 */ fsquare(t0,z11); + /* 2^5 - 2^0 = 31 */ fmul(z2_5_0,t0,z9); + + /* 2^6 - 2^1 */ fsquare(t0,z2_5_0); + /* 2^7 - 2^2 */ fsquare(t1,t0); + /* 2^8 - 2^3 */ fsquare(t0,t1); + /* 2^9 - 2^4 */ fsquare(t1,t0); + /* 2^10 - 2^5 */ fsquare(t0,t1); + /* 2^10 - 2^0 */ fmul(z2_10_0,t0,z2_5_0); + + /* 2^11 - 2^1 */ fsquare(t0,z2_10_0); + /* 2^12 - 2^2 */ fsquare(t1,t0); + /* 2^20 - 2^10 */ for (i = 2;i < 10;i += 2) { fsquare(t0,t1); fsquare(t1,t0); } + /* 2^20 - 2^0 */ fmul(z2_20_0,t1,z2_10_0); + + /* 2^21 - 2^1 */ fsquare(t0,z2_20_0); + /* 2^22 - 2^2 */ fsquare(t1,t0); + /* 2^40 - 2^20 */ for (i = 2;i < 20;i += 2) { fsquare(t0,t1); fsquare(t1,t0); } + /* 2^40 - 2^0 */ fmul(t0,t1,z2_20_0); + + /* 2^41 - 2^1 */ fsquare(t1,t0); + /* 2^42 - 2^2 */ fsquare(t0,t1); + /* 2^50 - 2^10 */ for (i = 2;i < 10;i += 2) { fsquare(t1,t0); fsquare(t0,t1); } + /* 2^50 - 2^0 */ fmul(z2_50_0,t0,z2_10_0); + + /* 2^51 - 2^1 */ fsquare(t0,z2_50_0); + /* 2^52 - 2^2 */ fsquare(t1,t0); + /* 2^100 - 2^50 */ for (i = 2;i < 50;i += 2) { fsquare(t0,t1); fsquare(t1,t0); } + /* 2^100 - 2^0 */ fmul(z2_100_0,t1,z2_50_0); + + /* 2^101 - 2^1 */ fsquare(t1,z2_100_0); + /* 2^102 - 2^2 */ fsquare(t0,t1); + /* 2^200 - 2^100 */ for (i = 2;i < 100;i += 2) { fsquare(t1,t0); fsquare(t0,t1); } + /* 2^200 - 2^0 */ fmul(t1,t0,z2_100_0); + + /* 2^201 - 2^1 */ fsquare(t0,t1); + /* 2^202 - 2^2 */ fsquare(t1,t0); + /* 2^250 - 2^50 */ for (i = 2;i < 50;i += 2) { fsquare(t0,t1); fsquare(t1,t0); } + /* 2^250 - 2^0 */ fmul(t0,t1,z2_50_0); + + /* 2^251 - 2^1 */ fsquare(t1,t0); + /* 2^252 - 2^2 */ fsquare(t0,t1); + /* 2^253 - 2^3 */ fsquare(t1,t0); + /* 2^254 - 2^4 */ fsquare(t0,t1); + /* 2^255 - 2^5 */ fsquare(t1,t0); + /* 2^255 - 21 */ fmul(out,t1,z11); +} + +void curve25519_normalize(u8 *e) { + e[0] &= 248; + e[31] &= 127; + e[31] |= 64; +} + +void curve25519_donna_ref(uint8_t *mypublic, const uint8_t *secret, const uint8_t *basepoint) { + limb bp[10], x[10], z[11], zmone[10]; + uint8_t e[32]; + int i; + + for (i = 0; i < 32; ++i) e[i] = secret[i]; + e[0] &= 248; + e[31] &= 127; + e[31] |= 64; + + fexpand(bp, basepoint); + cmult(x, z, e, bp); + crecip(zmone, z); + fmul(z, x, zmone); + freduce_coefficients(z); + fcontract(mypublic, z); +} + diff --git a/crypto/curve25519-donna.h b/crypto/curve25519-donna.h new file mode 100644 index 0000000..6985273 --- /dev/null +++ b/crypto/curve25519-donna.h @@ -0,0 +1,17 @@ +#ifndef TUNSAFE_CRYPTO_CURVE25519_DONNA_H_ +#define TUNSAFE_CRYPTO_CURVE25519_DONNA_H_ + +#include "tunsafe_types.h" + +void curve25519_donna_ref(uint8 *mypublic, const uint8 *secret, const uint8 *basepoint); +extern "C" void curve25519_donna_x64(uint8 *mypublic, const uint8 *secret, const uint8 *basepoint); + +#if defined(ARCH_CPU_X86_64) && defined(COMPILER_MSVC) +#define curve25519_donna curve25519_donna_x64 +#else +#define curve25519_donna curve25519_donna_ref +#endif + +void curve25519_normalize(uint8 *e); + +#endif // TUNSAFE_CRYPTO_CURVE25519_DONNA_H_ \ No newline at end of file diff --git a/crypto/curve25519_x64_nasm.asm b/crypto/curve25519_x64_nasm.asm new file mode 100644 index 0000000..bdac7e6 --- /dev/null +++ b/crypto/curve25519_x64_nasm.asm @@ -0,0 +1,6825 @@ +default rel +%define XMMWORD +%define YMMWORD +%define ZMMWORD + +section .text code align=64 + + +global curve25519_donna_x64 + +# donna function. +# linux arguments: RDI, RSI, RDX +# windows arguments: RCX, RDX, R8 +curve25519_donna_x64: +$L$FB13: + push r15 + push r14 + xor r15d,r15d + push r13 + push r12 + push rbp + push rbx + push rsi + push rdi + + mov rdi, rcx + mov rsi, rdx + mov rdx, r8 + + xor r8d,r8d + xor r11d,r11d + xor ebp,ebp + xor r9d,r9d + xor r13d,r13d + sub rsp,784 + + mov rcx,QWORD[6+rdx] + mov r10,QWORD[rdx] + movdqu xmm0,XMMWORD[rsi] + lea r14,[488+rsp] + mov QWORD[352+rsp],rdi + movaps XMMWORD[360+rsp],xmm0 + shr rcx,3 + and BYTE[360+rsp],-8 + mov rbx,rcx + mov rcx,QWORD[12+rdx] + movdqu xmm0,XMMWORD[16+rsi] + shr rcx,6 + movaps XMMWORD[376+rsp],xmm0 + movzx eax,BYTE[391+rsp] + and eax,127 + or eax,64 + mov BYTE[391+rsp],al + mov rax,2251799813685247 + and rcx,rax + and rbx,rax + and r10,rax + mov rdi,rcx + mov QWORD[184+rsp],rcx + mov rcx,QWORD[19+rdx] + mov rdx,QWORD[24+rdx] + mov QWORD[24+rsp],r10 + mov QWORD[120+rsp],rbx + shr rcx,1 + shr rdx,12 + and rcx,rax + mov rsi,rdx + mov r12,rcx + mov QWORD[264+rsp],rcx + and rsi,rax + lea rdx,[rsi*8+rsi] + mov QWORD[320+rsp],rsi + mov QWORD[((-120))+rsp],rsi + lea rdx,[rdx*2+rsi] + mov rsi,r14 + mov r14,r15 + mov QWORD[192+rsp],rdx + lea rdx,[rbx*8+rbx] + lea rdx,[rdx*2+rbx] + mov QWORD[328+rsp],rdx + lea rdx,[rdi*8+rdi] + lea rdx,[rdx*2+rdi] + mov QWORD[336+rsp],rdx + lea rdx,[rcx*8+rcx] + lea rdx,[rdx*2+rcx] + lea rcx,[728+rsp] + mov QWORD[200+rsp],rdx + lea rdx,[391+rsp] + mov QWORD[344+rsp],rdx + mov QWORD[((-24))+rsp],rdi + lea rdx,[536+rsp] + mov QWORD[88+rsp],rcx + lea rcx,[680+rsp] + mov QWORD[40+rsp],rbx + mov QWORD[((-88))+rsp],r10 + mov ebx,1 + xor r10d,r10d + mov QWORD[232+rsp],rcx + lea rcx,[632+rsp] + mov QWORD[((-104))+rsp],rbx + mov r15,QWORD[40+rsp] + xor edi,edi + mov QWORD[40+rsp],r12 + mov QWORD[80+rsp],rcx + lea rcx,[584+rsp] + mov QWORD[((-32))+rsp],0 + mov QWORD[((-56))+rsp],0 + mov QWORD[72+rsp],1 + mov rbx,r8 + mov QWORD[104+rsp],rcx + lea rcx,[440+rsp] + mov QWORD[8+rsp],rdx + mov QWORD[56+rsp],r10 + mov r12,r11 + mov QWORD[((-8))+rsp],rcx + lea rcx,[392+rsp] + mov QWORD[((-72))+rsp],rcx + mov rcx,rax + mov rax,QWORD[344+rsp] + + +$L$3: + movzx eax,BYTE[rax] + mov rdx,QWORD[((-8))+rsp] + mov QWORD[240+rsp],rsi + mov DWORD[316+rsp],8 + mov rsi,r15 + mov r15,QWORD[72+rsp] + mov BYTE[315+rsp],al + mov rax,QWORD[80+rsp] + mov QWORD[80+rsp],rdx + mov rdx,r9 + mov QWORD[((-8))+rsp],rax + mov rax,QWORD[8+rsp] + mov QWORD[256+rsp],rax + mov rax,QWORD[((-72))+rsp] + mov QWORD[((-72))+rsp],rbp + mov QWORD[248+rsp],rax + mov rax,r8 + jmp NEAR $L$2 + + +$L$10: + mov r9,r8 + mov r8,QWORD[80+rsp] + mov QWORD[80+rsp],r9 + mov r9,QWORD[88+rsp] + mov QWORD[((-8))+rsp],r8 + mov r8,QWORD[256+rsp] + mov QWORD[256+rsp],r9 + mov r9,QWORD[104+rsp] + mov QWORD[88+rsp],r8 + mov r8,QWORD[248+rsp] + mov QWORD[248+rsp],r9 + mov r9,QWORD[232+rsp] + mov QWORD[104+rsp],r8 + mov r8,QWORD[240+rsp] + mov QWORD[240+rsp],r9 + mov QWORD[232+rsp],r8 +$L$2: + movzx r8d,BYTE[315+rsp] + mov QWORD[208+rsp],rcx + mov rcx,QWORD[((-88))+rsp] + shr r8b,7 + mov r9,rcx + movzx r8d,r8b + xor r9,r15 + neg r8 + and r9,r8 + mov rbp,r8 + xor r15,r9 + xor r9,rcx + mov rcx,rbp + mov QWORD[((-88))+rsp],r15 + mov r15,QWORD[((-56))+rsp] + mov QWORD[160+rsp],r9 + mov QWORD[128+rsp],rcx + mov r9,r15 + xor r9,rsi + mov r8,r9 + and r8,rbp + xor r15,r8 + mov r9,r15 + mov r15,r8 + xor r15,rsi + mov QWORD[72+rsp],r15 + mov r15,QWORD[((-24))+rsp] + mov rsi,r15 + xor rsi,r10 + and rsi,rbp + mov rbp,QWORD[((-72))+rsp] + xor r10,rsi + xor rsi,r15 + mov r15,QWORD[40+rsp] + mov QWORD[8+rsp],rsi + mov rsi,r15 + xor rsi,rbp + and rsi,rcx + xor rbp,rsi + xor rsi,r15 + mov r15,QWORD[((-32))+rsp] + mov QWORD[40+rsp],rsi + mov rsi,QWORD[((-120))+rsp] + xor rsi,r15 + mov r8,rsi + and r8,rcx + xor r15,r8 + mov rsi,r15 + mov r15,QWORD[((-120))+rsp] + xor r15,r8 + mov r8,QWORD[((-104))+rsp] + mov QWORD[152+rsp],r15 + mov r15,QWORD[((-104))+rsp] + xor r8,r12 + and r8,rcx + xor r15,r8 + xor r12,r8 + mov r8,r11 + mov QWORD[((-32))+rsp],r15 + mov r15,QWORD[56+rsp] + xor r8,rdx + and r8,rcx + xor r11,r8 + xor r8,rdx + mov rdx,r15 + mov QWORD[((-72))+rsp],r8 + mov r8,r14 + xor rdx,rdi + and rdx,rcx + xor rdi,rdx + xor r8,r13 + xor rdx,r15 + and r8,rcx + mov r15,QWORD[((-88))+rsp] + xor r14,r8 + xor r13,r8 + mov r8,rbx + xor r8,rax + and r8,rcx + mov rcx,18014398509481832 + xor rbx,r8 + xor r8,rax + lea rax,[r15*1+r12] + mov r15,QWORD[240+rsp] + mov QWORD[((-56))+rsp],rax + mov QWORD[r15],rax + lea rax,[r9*1+r11] + mov QWORD[8+r15],rax + mov QWORD[((-120))+rsp],rax + lea rax,[r10*1+rdi] + mov QWORD[16+r15],rax + mov QWORD[136+rsp],rax + lea rax,[rbp*1+r14] + mov QWORD[24+r15],rax + mov QWORD[144+rsp],rax + lea rax,[rsi*1+rbx] + mov QWORD[((-104))+rsp],rax + mov QWORD[32+r15],rax + mov r15,QWORD[((-88))+rsp] + mov rax,QWORD[256+rsp] + add r15,rcx + sub r15,r12 + mov r12,18014398509481976 + mov QWORD[rax],r15 + mov QWORD[((-24))+rsp],r15 + mov r15,r12 + add r9,r12 + add r10,r15 + add rbp,r15 + sub r10,rdi + mov r12,r9 + mov rdi,r15 + mov r15,rbp + sub r12,r11 + mov r11,rax + sub r15,r14 + mov QWORD[8+rax],r12 + mov QWORD[16+r11],r10 + mov QWORD[24+r11],r15 + mov r11,rdi + mov r14,r15 + add r11,rsi + mov r15,QWORD[((-32))+rsp] + mov QWORD[((-88))+rsp],r10 + sub r11,rbx + mov r10,QWORD[((-72))+rsp] + mov rbx,QWORD[8+rsp] + add r10,QWORD[72+rsp] + mov QWORD[32+rax],r11 + mov rsi,QWORD[40+rsp] + mov rax,QWORD[160+rsp] + mov rdi,r15 + mov r9,QWORD[152+rsp] + mov rbp,QWORD[248+rsp] + add rbx,rdx + add rdi,rax + add rsi,r13 + add rcx,rax + add r9,r8 + mov QWORD[rbp],rdi + mov QWORD[8+rbp],r10 + mov QWORD[16+rbp],rbx + mov QWORD[24+rbp],rsi + mov QWORD[32+rbp],r9 + mov rbp,rcx + mov rcx,18014398509481976 + add rcx,QWORD[72+rsp] + sub rbp,r15 + mov rax,QWORD[80+rsp] + mov QWORD[((-32))+rsp],rbp + mov QWORD[rax],rbp + mov r15,rcx + mov rcx,18014398509481976 + add rcx,QWORD[8+rsp] + sub r15,QWORD[((-72))+rsp] + mov rbp,rcx + mov rcx,18014398509481976 + sub rbp,rdx + mov QWORD[8+rax],r15 + lea rax,[r11*8+r11] + mov rdx,rbp + mov rbp,QWORD[80+rsp] + mov QWORD[72+rsp],rdx + mov QWORD[16+rbp],rdx + add rcx,QWORD[40+rsp] + mov QWORD[((-72))+rsp],r14 + mov rdx,rcx + mov rcx,18014398509481976 + sub rdx,r13 + add rcx,QWORD[152+rsp] + mov QWORD[152+rsp],r11 + mov r13,rdx + mov rdx,rbp + mov QWORD[24+rbp],r13 + mov QWORD[56+rsp],r13 + mov rbp,rcx + sub rbp,r8 + lea r8,[rax*2+r11] + mov QWORD[32+rdx],rbp + mov QWORD[160+rsp],rbp + mov rbp,QWORD[((-88))+rsp] + mov r13,r8 + mov r8,r12 + mov QWORD[8+rsp],r13 + lea rax,[rbp*8+rbp] + lea rdx,[rax*2+rbp] + lea rax,[r14*8+r14] + lea r14,[rax*2+r14] + lea rax,[r12*8+r12] + mov rcx,rdx + mov QWORD[224+rsp],rcx + lea r11,[rax*2+r12] + mov QWORD[168+rsp],r14 + mov rax,r11 + mul r9 + mov r11,rax + mov rax,r13 + mov r12,rdx + mul r10 + mov r13,QWORD[((-24))+rsp] + add r11,rax + mov rax,r13 + adc r12,rdx + mul rdi + add r11,rax + mov rax,r14 + adc r12,rdx + mul rbx + add r11,rax + mov rax,rcx + mov rcx,QWORD[208+rsp] + adc r12,rdx + mul rsi + add r11,rax + mov rax,r13 + adc r12,rdx + mov rdx,r11 + and rdx,rcx + mov QWORD[208+rsp],rdx + mul r10 + mov r13,rax + mov rax,r8 + mov r14,rdx + mul rdi + add r13,rax + mov rax,QWORD[8+rsp] + adc r14,rdx + mul rbx + add r13,rax + mov rax,QWORD[224+rsp] + adc r14,rdx + mul r9 + add r13,rax + mov rax,QWORD[168+rsp] + adc r14,rdx + mul rsi + add rax,r13 + mov r13,r12 + mov r12,r11 + adc rdx,r14 + shrd r12,r13,51 + shr r13,51 + mov r14,r13 + mov r13,r12 + add r13,rax + mov rax,QWORD[((-24))+rsp] + adc r14,rdx + mov r12,r13 + and r12,rcx + mul rbx + mov QWORD[216+rsp],r12 + mov r11,rax + mov rax,rbp + mov r12,rdx + mul rdi + mov rbp,r8 + mov QWORD[40+rsp],rbp + add r11,rax + mov rax,r8 + adc r12,rdx + mul r10 + add r11,rax + mov rax,QWORD[8+rsp] + adc r12,rdx + mul rsi + add r11,rax + mov rax,QWORD[168+rsp] + adc r12,rdx + mul r9 + add rax,r11 + adc rdx,r12 + mov r12,r13 + mov r13,r14 + shrd r12,r14,51 + shr r13,51 + mov r11,r12 + mov r12,r13 + add r11,rax + mov rax,QWORD[((-24))+rsp] + adc r12,rdx + mov r8,r11 + and r8,rcx + mul rsi + mov r13,rax + mov rax,QWORD[((-72))+rsp] + mov r14,rdx + mul rdi + add r13,rax + mov rax,QWORD[((-88))+rsp] + adc r14,rdx + mul r10 + add r13,rax + mov rax,rbp + adc r14,rdx + mul rbx + add r13,rax + mov rax,QWORD[8+rsp] + adc r14,rdx + mul r9 + add rax,r13 + mov r13,r12 + mov r12,r11 + adc rdx,r14 + shrd r12,r13,51 + shr r13,51 + mov r14,r13 + mov r13,r12 + add r13,rax + mov rax,r9 + adc r14,rdx + mov r12,r13 + mov r9,r13 + mul QWORD[((-24))+rsp] + and r12,rcx + mov QWORD[272+rsp],r12 + mov r11,rax + mov r12,rdx + mov rax,rdi + mul QWORD[152+rsp] + mov rdi,rax + mov rbp,rdx + mov rax,rsi + add rdi,r11 + adc rbp,r12 + mul QWORD[40+rsp] + add rdi,rax + mov rax,r10 + mov r10,r14 + adc rbp,rdx + mul QWORD[((-72))+rsp] + add rdi,rax + mov rax,rbx + adc rbp,rdx + mul QWORD[((-88))+rsp] + mov r13,QWORD[56+rsp] + mov rbx,QWORD[136+rsp] + add rdi,rax + adc rbp,rdx + shr r10,51 + shrd r9,r14,51 + mov rdx,r10 + mov r10,QWORD[((-104))+rsp] + mov r14,QWORD[((-32))+rsp] + mov rax,r9 + add rax,rdi + mov rdi,rax + adc rdx,rbp + mov rbp,QWORD[144+rsp] + and rdi,rcx + mov QWORD[280+rsp],rdi + mov rdi,rax + shrd rdi,rdx,51 + mov rdx,QWORD[72+rsp] + lea rax,[rdi*8+rdi] + lea rax,[rax*2+rdi] + add rax,QWORD[208+rsp] + mov r9,rax + shr rax,51 + add rax,QWORD[216+rsp] + and r9,rcx + mov QWORD[288+rsp],r9 + mov r9,QWORD[((-120))+rsp] + mov rdi,rax + shr rax,51 + lea rsi,[r8*1+rax] + mov r8,QWORD[160+rsp] + and rdi,rcx + mov QWORD[208+rsp],rdi + mov QWORD[216+rsp],rsi + lea rax,[r8*8+r8] + lea rsi,[rax*2+r8] + lea rax,[rdx*8+rdx] + lea r8,[rax*2+rdx] + lea rax,[r13*8+r13] + lea rdi,[rax*2+r13] + lea rax,[r15*8+r15] + lea r11,[rax*2+r15] + mov rax,r11 + mul r10 + mov r11,rax + mov rax,r9 + mov r12,rdx + mul rsi + add r11,rax + mov rax,r14 + adc r12,rdx + mul QWORD[((-56))+rsp] + add r11,rax + mov rax,rbx + adc r12,rdx + mul rdi + add r11,rax + mov rax,rbp + adc r12,rdx + mul r8 + add r11,rax + mov rax,r11 + adc r12,rdx + and rax,rcx + mov QWORD[296+rsp],rax + mov rax,r14 + mul r9 + mov r9,r11 + mov r13,rax + mov rax,QWORD[((-56))+rsp] + mov r14,rdx + mul r15 + add r13,rax + mov rax,rbx + adc r14,rdx + mul rsi + add r13,rax + mov rax,r8 + mov r8,rbp + adc r14,rdx + mul r10 + mov r10,r12 + add r13,rax + mov rax,rbp + adc r14,rdx + mul rdi + add rax,r13 + adc rdx,r14 + shr r10,51 + mov r14,rbx + shrd r9,r12,51 + add r9,rax + mov rax,QWORD[((-32))+rsp] + adc r10,rdx + mov r13,r9 + and r13,rcx + mul rbx + mov r11,rax + mov r12,rdx + mov rax,QWORD[((-56))+rsp] + mul QWORD[72+rsp] + add r11,rax + mov rax,QWORD[((-120))+rsp] + adc r12,rdx + mul r15 + add r11,rax + mov rax,rbp + adc r12,rdx + mul rsi + add r11,rax + mov rax,rdi + adc r12,rdx + mul QWORD[((-104))+rsp] + mov rdi,rax + mov rbp,rdx + mov rax,r9 + add rdi,r11 + mov rdx,r10 + adc rbp,r12 + shr rdx,51 + shrd rax,r10,51 + mov r12,rdx + mov r11,rax + mov rax,QWORD[((-32))+rsp] + add r11,rdi + mov rdx,r11 + adc r12,rbp + mov rbp,r8 + and rdx,rcx + mov rdi,r12 + mov rbx,rdx + mul r8 + mov r9,rax + mov r10,rdx + mov rax,QWORD[((-56))+rsp] + mul QWORD[56+rsp] + add r9,rax + mov rax,QWORD[((-120))+rsp] + adc r10,rdx + mul QWORD[72+rsp] + add r9,rax + mov rax,r14 + adc r10,rdx + mul r15 + add r9,rax + mov rax,rsi + mov rsi,r11 + adc r10,rdx + mul QWORD[((-104))+rsp] + add rax,r9 + adc rdx,r10 + shr rdi,51 + shrd rsi,r12,51 + add rsi,rax + mov rax,QWORD[((-32))+rsp] + adc rdi,rdx + mov rdx,rsi + and rdx,rcx + mov r8,rdx + mul QWORD[((-104))+rsp] + mov r9,rax + mov r10,rdx + mov rax,QWORD[160+rsp] + mul QWORD[((-56))+rsp] + mov r12,QWORD[272+rsp] + add r9,rax + mov rax,r15 + mov r15,QWORD[288+rsp] + adc r10,rdx + mul rbp + add r9,rax + mov rax,QWORD[56+rsp] + adc r10,rdx + mul QWORD[((-120))+rsp] + add r9,rax + mov rax,QWORD[72+rsp] + adc r10,rdx + mul r14 + mov r14,QWORD[280+rsp] + add rax,r9 + adc rdx,r10 + shrd rsi,rdi,51 + shr rdi,51 + mov r9,rsi + mov r10,rdi + lea rdi,[r8*1+r12] + add r9,rax + adc r10,rdx + mov rdx,r9 + shrd r9,r10,51 + and rdx,rcx + lea rax,[r9*8+r9] + lea rbp,[rdx*1+r14] + lea rax,[rax*2+r9] + add rax,QWORD[296+rsp] + mov r10,rax + shr rax,51 + add r13,rax + and r10,rcx + mov rax,r13 + shr r13,51 + lea r11,[r10*1+r15] + and rax,rcx + add r13,rbx + mov r9,rax + mov rax,QWORD[208+rsp] + lea rbx,[r9*1+rax] + mov rax,QWORD[216+rsp] + lea rsi,[r13*1+rax] + mov rax,18014398509481832 + add rax,r15 + mov r15,18014398509481976 + sub rax,r10 + add r15,r12 + mov QWORD[72+rsp],rax + mov rax,18014398509481976 + add rax,QWORD[208+rsp] + sub r15,r8 + sub rax,r9 + mov QWORD[56+rsp],rax + mov rax,18014398509481976 + add rax,QWORD[216+rsp] + sub rax,r13 + mov r13,rax + mov rax,18014398509481976 + add rax,r14 + lea r14,[r11*1+r11] + sub rax,rdx + lea rdx,[rbx*1+rbx] + mov QWORD[((-32))+rsp],rax + lea rax,[rbp*8+rbp] + mov QWORD[160+rsp],rdx + lea rdx,[rax*2+rbp] + mov rax,r11 + lea r8,[rdx*1+rdx] + mov QWORD[272+rsp],rdx + mul r11 + mov r11,rax + mov rax,r8 + mov r12,rdx + mul rbx + add r11,rax + lea rax,[rsi*8+rsi] + adc r12,rdx + lea rax,[rax*2+rsi] + add rax,rax + mul rdi + add r11,rax + lea rax,[rdi*8+rdi] + mov r9,r11 + adc r12,rdx + and r9,rcx + mov QWORD[208+rsp],r9 + lea r9,[rax*2+rdi] + mov rax,r9 + mul rdi + mov r9,rax + mov rax,rbx + mov r10,rdx + mul r14 + add r9,rax + mov rax,r8 + adc r10,rdx + mul rsi + add rax,r9 + mov r9,r11 + adc rdx,r10 + mov r10,r12 + shrd r9,r12,51 + shr r10,51 + add r9,rax + mov rax,rbx + adc r10,rdx + mov r12,r9 + mul rbx + and r12,rcx + mov rbx,QWORD[160+rsp] + mov QWORD[216+rsp],r12 + mov r11,rax + mov rax,r8 + mov r12,rdx + mul rdi + add r11,rax + mov rax,r14 + adc r12,rdx + mul rsi + add rax,r11 + adc rdx,r12 + shrd r9,r10,51 + shr r10,51 + add r9,rax + mov rax,rdi + adc r10,rdx + mov r8,r9 + mul r14 + and r8,rcx + mov r11,rax + mov rax,QWORD[272+rsp] + mov r12,rdx + mul rbp + add r11,rax + mov rax,rbx + adc r12,rdx + mul rsi + add rax,r11 + adc rdx,r12 + shrd r9,r10,51 + shr r10,51 + mov r11,r9 + mov r12,r10 + add r11,rax + mov rax,rbx + adc r12,rdx + mov QWORD[160+rsp],r11 + mul rdi + mov r9,rax + mov rax,rsi + mov r10,rdx + mul rsi + mov rsi,rax + mov rdi,rdx + mov rax,r14 + add rsi,r9 + adc rdi,r10 + mov r10,QWORD[72+rsp] + mul rbp + mov rbp,QWORD[56+rsp] + add rsi,rax + adc rdi,rdx + mov rax,rsi + shrd r11,r12,51 + mov rdx,rdi + shr r12,51 + add rax,r11 + adc rdx,r12 + mov QWORD[272+rsp],rax + shrd rax,rdx,51 + mov rsi,rax + lea rax,[rax*8+rax] + lea rdx,[rax*2+rsi] + add rdx,QWORD[208+rsp] + mov rax,QWORD[216+rsp] + lea rsi,[r10*1+r10] + mov QWORD[208+rsp],rdx + shr rdx,51 + add rax,rdx + mov QWORD[216+rsp],rax + shr rax,51 + lea rbx,[r8*1+rax] + mov r8,QWORD[((-32))+rsp] + mov QWORD[280+rsp],rbx + lea rbx,[rbp*1+rbp] + lea rax,[r8*8+r8] + lea r8,[rax*2+r8] + mov rax,r10 + mul r10 + lea rdi,[r8*1+r8] + mov r11,rax + mov rax,rbp + mov r12,rdx + mul rdi + add r11,rax + lea rax,[r13*8+r13] + adc r12,rdx + lea rax,[rax*2+r13] + add rax,rax + mul r15 + add r11,rax + lea rax,[r15*8+r15] + adc r12,rdx + mov r10,r11 + lea r9,[rax*2+r15] + and r10,rcx + mov QWORD[72+rsp],r10 + mov rax,r9 + mul r15 + mov r9,rax + mov rax,rbp + mov r10,rdx + mul rsi + add r9,rax + mov rax,rdi + adc r10,rdx + mul r13 + add rax,r9 + mov r9,r11 + adc rdx,r10 + mov r10,r12 + shrd r9,r12,51 + shr r10,51 + add r9,rax + mov rax,rbp + adc r10,rdx + mov r12,r9 + mul rbp + and r12,rcx + mov r14,r12 + mov r11,rax + mov rax,rdi + mov r12,rdx + mul r15 + add r11,rax + mov rax,rsi + adc r12,rdx + mul r13 + add rax,r11 + adc rdx,r12 + shrd r9,r10,51 + shr r10,51 + mov r11,r9 + mov r12,r10 + add r11,rax + mov rax,r15 + adc r12,rdx + mov r9,r11 + mul rsi + and r9,rcx + mov QWORD[56+rsp],r9 + mov r9,rax + mov rax,r8 + mov r8,QWORD[((-32))+rsp] + mov r10,rdx + mul r8 + add r9,rax + mov rax,r13 + adc r10,rdx + mul rbx + add rax,r9 + mov r9,r11 + adc rdx,r10 + mov r10,r12 + shrd r9,r12,51 + shr r10,51 + add r9,rax + mov rax,r15 + adc r10,rdx + mov r12,r9 + mul rbx + and r12,rcx + mov rbp,r12 + mov r11,rax + mov rax,r13 + mov r12,rdx + mul r13 + add r11,rax + mov rax,r8 + adc r12,rdx + mul rsi + add rax,r11 + adc rdx,r12 + mov r12,QWORD[336+rsp] + shrd r9,r10,51 + shr r10,51 + add rax,r9 + mov r9,QWORD[24+rsp] + adc rdx,r10 + mov rbx,rax + mov r10,QWORD[192+rsp] + shrd rax,rdx,51 + and rbx,rcx + lea rdx,[rax*8+rax] + lea rsi,[rdx*2+rax] + add rsi,QWORD[72+rsp] + mov rax,QWORD[328+rsp] + mul rbx + mov r13,rsi + shr rsi,51 + add rsi,r14 + and r13,rcx + mov r8,r13 + mov r13,rsi + shr rsi,51 + and r13,rcx + mov r14,rdx + add rsi,QWORD[56+rsp] + mov rdi,r13 + mov r13,rax + mov rax,r12 + mul rbp + add r13,rax + mov rax,r9 + adc r14,rdx + mul r8 + add r13,rax + mov rax,r10 + adc r14,rdx + mul rdi + add r13,rax + mov rax,QWORD[200+rsp] + adc r14,rdx + mul rsi + add r13,rax + mov rax,r12 + adc r14,rdx + mov r15,r13 + mul rbx + and r15,rcx + mov r11,rax + mov rax,QWORD[200+rsp] + mov r12,rdx + mul rbp + add r11,rax + mov rax,QWORD[120+rsp] + adc r12,rdx + mul r8 + add r11,rax + mov rax,r9 + adc r12,rdx + mul rdi + add r11,rax + mov rax,r10 + adc r12,rdx + mul rsi + add rax,r11 + adc rdx,r12 + shrd r13,r14,51 + shr r14,51 + mov r11,r13 + mov r12,r14 + add r11,rax + mov rax,r10 + adc r12,rdx + mov r13,r11 + mul rbp + and r13,rcx + mov r14,r13 + mov r9,rax + mov rax,QWORD[200+rsp] + mov r10,rdx + mul rbx + add r9,rax + mov rax,QWORD[184+rsp] + adc r10,rdx + mul r8 + add r9,rax + mov rax,QWORD[120+rsp] + adc r10,rdx + mul rdi + add r9,rax + mov rax,QWORD[24+rsp] + adc r10,rdx + mul rsi + add rax,r9 + mov r9,r11 + adc rdx,r10 + mov r10,r12 + shrd r9,r12,51 + shr r10,51 + add r9,rax + mov rax,QWORD[24+rsp] + adc r10,rdx + mov r13,r9 + and r13,rcx + mul rbp + mov r11,rax + mov rax,QWORD[192+rsp] + mov r12,rdx + mul rbx + add r11,rax + mov rax,QWORD[264+rsp] + adc r12,rdx + mul r8 + add r11,rax + mov rax,QWORD[184+rsp] + adc r12,rdx + mul rdi + add r11,rax + mov rax,QWORD[120+rsp] + adc r12,rdx + mul rsi + add rax,r11 + adc rdx,r12 + shrd r9,r10,51 + shr r10,51 + mov r11,r9 + mov r12,r10 + add r11,rax + mov rax,rbx + adc r12,rdx + mov QWORD[72+rsp],r11 + mul QWORD[24+rsp] + mov r9,rax + mov r10,rdx + mov rax,rbp + mul QWORD[120+rsp] + add r9,rax + mov rax,r8 + mov r8,r11 + adc r10,rdx + mul QWORD[320+rsp] + add r9,rax + mov rax,rdi + adc r10,rdx + mul QWORD[264+rsp] + mov rdi,rax + mov rbp,rdx + mov rax,rsi + add rdi,r9 + mov r9,r12 + adc rbp,r10 + mov r10,QWORD[((-56))+rsp] + mul QWORD[184+rsp] + add rdi,rax + adc rbp,rdx + shr r9,51 + shrd r8,r12,51 + mov rdx,r9 + mov r12,QWORD[((-120))+rsp] + mov rax,r8 + add rax,rdi + mov rdi,QWORD[((-104))+rsp] + adc rdx,rbp + mov r8,rax + mov QWORD[288+rsp],rax + shrd r8,rdx,51 + lea rbp,[r12*1+r12] + lea rax,[r8*8+r8] + lea r8,[rax*2+r8] + lea rax,[rdi*8+rdi] + add r8,r15 + lea r11,[rax*2+rdi] + mov rbx,r8 + mov QWORD[56+rsp],r8 + shr rbx,51 + lea r8,[r14*1+rbx] + lea rbx,[r10*1+r10] + mov QWORD[296+rsp],r8 + shr r8,51 + lea r15,[r13*1+r8] + lea r8,[r11*1+r11] + mov QWORD[304+rsp],r15 + mov r15,QWORD[136+rsp] + mov r13,QWORD[144+rsp] + lea rax,[r15*8+r15] + lea rsi,[rax*2+r15] + add rsi,rsi + mov rax,rsi + mul r13 + mov rsi,rax + mov rax,r10 + mov rdi,rdx + mul r10 + add rsi,rax + mov rax,r12 + adc rdi,rdx + mul r8 + add rsi,rax + lea rax,[r13*8+r13] + adc rdi,rdx + mov r14,rsi + lea r9,[rax*2+r13] + and r14,rcx + mov rax,r9 + mul r13 + mov r9,rax + mov rax,r12 + mov r10,rdx + mul rbx + add r9,rax + mov rax,r15 + adc r10,rdx + mul r8 + add rax,r9 + adc rdx,r10 + shrd rsi,rdi,51 + shr rdi,51 + add rsi,rax + mov rax,r15 + adc rdi,rdx + mov r10,rsi + mov QWORD[((-56))+rsp],rsi + mul rbx + mov QWORD[((-48))+rsp],rdi + mov rdi,QWORD[((-120))+rsp] + and r10,rcx + mov rsi,QWORD[((-56))+rsp] + mov r12,r10 + mov r9,rax + mov rax,rdi + mov r10,rdx + mul rdi + mov rdi,QWORD[((-48))+rsp] + add r9,rax + mov rax,r8 + mov r8,r13 + adc r10,rdx + mul r13 + add rax,r9 + adc rdx,r10 + shrd rsi,rdi,51 + shr rdi,51 + add rsi,rax + mov rax,r8 + adc rdi,rdx + mov r10,rsi + mov QWORD[((-120))+rsp],rsi + mul rbx + and r10,rcx + mov rsi,QWORD[((-120))+rsp] + mov r13,r10 + mov QWORD[((-112))+rsp],rdi + mov rdi,QWORD[((-112))+rsp] + mov r9,rax + mov rax,r15 + mov r10,rdx + mul rbp + add r9,rax + mov rax,r11 + mov r11,QWORD[((-104))+rsp] + adc r10,rdx + mul r11 + add rax,r9 + adc rdx,r10 + shrd rsi,rdi,51 + shr rdi,51 + add rsi,rax + mov rax,rsi + adc rdi,rdx + and rax,rcx + mov QWORD[((-120))+rsp],rax + mov rax,r11 + mov r11,QWORD[((-88))+rsp] + mul rbx + mov rbx,QWORD[((-24))+rsp] + mov r9,rax + mov rax,r8 + mov r10,rdx + mul rbp + mov rbp,QWORD[40+rsp] + add r9,rax + mov rax,r15 + adc r10,rdx + mul r15 + add rax,r9 + adc rdx,r10 + shrd rsi,rdi,51 + shr rdi,51 + add rsi,rax + adc rdi,rdx + mov r15,rsi + shrd rsi,rdi,51 + and r15,rcx + lea rax,[rsi*8+rsi] + lea rax,[rax*2+rsi] + lea rsi,[rbp*1+rbp] + add r14,rax + mov rdi,r14 + shr r14,51 + add r12,r14 + mov r14,QWORD[224+rsp] + and rdi,rcx + mov rdx,r12 + shr r12,51 + mov QWORD[((-104))+rsp],rdi + lea r9,[r13*1+r12] + mov r13,QWORD[8+rsp] + mov r12,QWORD[((-72))+rsp] + and rdx,rcx + lea rdi,[rbx*1+rbx] + mov QWORD[((-56))+rsp],rdx + mov QWORD[((-32))+rsp],r9 + lea r8,[r13*1+r13] + lea r13,[r14*1+r14] + mov rax,r13 + mul r12 + mov r13,rax + mov rax,rbx + mov r14,rdx + mul rbx + mov rbx,rbp + add r13,rax + mov rax,rbp + adc r14,rdx + mul r8 + add r13,rax + mov rax,rbx + adc r14,rdx + mov rbp,r13 + mul rdi + and rbp,rcx + mov r9,rax + mov rax,QWORD[168+rsp] + mov r10,rdx + mul r12 + add r9,rax + mov rax,r11 + adc r10,rdx + mul r8 + add rax,r9 + mov r9,QWORD[40+rsp] + adc rdx,r10 + shrd r13,r14,51 + shr r14,51 + add r13,rax + mov rax,r11 + adc r14,rdx + mov r10,r13 + mul rdi + and r10,rcx + mov rbx,r10 + mov r11,rax + mov rax,r9 + mov r12,rdx + mul r9 + mov r9,QWORD[((-72))+rsp] + add r11,rax + mov rax,r8 + adc r12,rdx + mul r9 + add rax,r11 + adc rdx,r12 + mov r12,r13 + mov r13,r14 + shrd r12,r14,51 + shr r13,51 + mov r11,r12 + mov r12,r13 + mov r13,QWORD[((-88))+rsp] + add r11,rax + mov rax,r9 + adc r12,rdx + mov r14,r11 + mul rdi + and r14,rcx + mov r8,r14 + mov r9,rax + mov rax,r13 + mov r10,rdx + mul rsi + add r9,rax + mov rax,QWORD[8+rsp] + adc r10,rdx + mul QWORD[152+rsp] + add rax,r9 + adc rdx,r10 + shrd r11,r12,51 + shr r12,51 + mov r9,r11 + mov r10,r12 + add r9,rax + mov rax,QWORD[152+rsp] + adc r10,rdx + mov r14,r9 + and r14,rcx + mul rdi + mov r11,rax + mov rax,QWORD[((-72))+rsp] + mov r12,rdx + mul rsi + mov rsi,rax + mov rdi,rdx + mov rax,r13 + add rsi,r11 + adc rdi,r12 + mul r13 + add rsi,rax + adc rdi,rdx + shrd r9,r10,51 + shr r10,51 + add rsi,r9 + adc rdi,r10 + mov r12,rsi + shrd rsi,rdi,51 + and r12,rcx + lea rax,[rsi*8+rsi] + mov QWORD[((-88))+rsp],r12 + lea r13,[rax*2+rsi] + lea rax,[r12*8+r12] + add r13,rbp + lea rdi,[rax*2+r12] + mov rbp,r13 + shr r13,51 + add r13,rbx + and rbp,rcx + mov rbx,r13 + shr r13,51 + add r13,r8 + and rbx,rcx + lea rax,[r13*8+r13] + lea rsi,[rax*2+r13] + lea rax,[r14*8+r14] + lea r8,[rax*2+r14] + mov rax,QWORD[((-104))+rsp] + mov r11,QWORD[((-56))+rsp] + mul rbp + mov r9,rax + mov rax,r11 + mov r10,rdx + mul rdi + add r9,rax + mov rax,QWORD[((-32))+rsp] + adc r10,rdx + mul r8 + add r9,rax + lea rax,[rbx*8+rbx] + adc r10,rdx + lea rax,[rax*2+rbx] + mul r15 + add r9,rax + mov rax,QWORD[((-120))+rsp] + adc r10,rdx + mul rsi + add r9,rax + mov rax,r11 + adc r10,rdx + mov r12,r9 + mul rbp + and r12,rcx + mov QWORD[((-72))+rsp],r12 + mov r11,rax + mov rax,QWORD[((-120))+rsp] + mov r12,rdx + mul r8 + add r11,rax + mov rax,QWORD[((-104))+rsp] + adc r12,rdx + mul rbx + add r11,rax + mov rax,QWORD[((-32))+rsp] + adc r12,rdx + mul rdi + add r11,rax + mov rax,rsi + mov rsi,QWORD[((-104))+rsp] + adc r12,rdx + mul r15 + add rax,r11 + adc rdx,r12 + shrd r9,r10,51 + shr r10,51 + mov r11,r9 + mov r12,r10 + add r11,rax + mov rax,QWORD[((-120))+rsp] + adc r12,rdx + mov r10,r11 + and r10,rcx + mul rdi + mov QWORD[((-24))+rsp],r10 + mov r9,rax + mov rax,r8 + mov r10,rdx + mul r15 + add r9,rax + mov rax,QWORD[((-32))+rsp] + adc r10,rdx + mul rbp + add r9,rax + mov rax,QWORD[((-56))+rsp] + adc r10,rdx + mul rbx + add r9,rax + mov rax,rsi + adc r10,rdx + mul r13 + add rax,r9 + adc rdx,r10 + shrd r11,r12,51 + shr r12,51 + add r11,rax + mov rax,rsi + adc r12,rdx + mov QWORD[8+rsp],r11 + and r11,rcx + mul r14 + mov r8,r11 + mov QWORD[16+rsp],r12 + mov r11,QWORD[8+rsp] + mov r12,QWORD[16+rsp] + mov r9,rax + mov rax,rdi + mov r10,rdx + mul r15 + mov rsi,rax + mov rax,QWORD[((-120))+rsp] + mov rdi,rdx + add rsi,r9 + adc rdi,r10 + mul rbp + add rsi,rax + mov rax,QWORD[((-32))+rsp] + adc rdi,rdx + mul rbx + add rsi,rax + mov rax,QWORD[((-56))+rsp] + adc rdi,rdx + mul r13 + add rsi,rax + mov rax,rbp + adc rdi,rdx + shrd r11,r12,51 + shr r12,51 + add rsi,r11 + mov r11,QWORD[((-104))+rsp] + adc rdi,r12 + mov r12,QWORD[((-120))+rsp] + mov QWORD[40+rsp],rsi + mul r15 + mov r9,rax + mov r10,rdx + mov rax,r11 + mul QWORD[((-88))+rsp] + add r9,rax + mov rax,QWORD[((-56))+rsp] + adc r10,rdx + mul r14 + add r9,rax + mov rax,r12 + adc r10,rdx + mul rbx + add r9,rax + mov rax,QWORD[((-32))+rsp] + adc r10,rdx + mul r13 + add rax,r9 + adc rdx,r10 + shrd rsi,rdi,51 + shr rdi,51 + add rax,rsi + adc rdx,rdi + mov rsi,rax + mov QWORD[136+rsp],rax + shrd rsi,rdx,51 + mov rdi,QWORD[((-72))+rsp] + lea rax,[rsi*8+rsi] + lea rax,[rax*2+rsi] + mov rsi,18014398509481832 + add rdi,rax + lea rax,[r11*1+rsi] + add rsi,144 + mov r9,rdi + add rsi,QWORD[((-56))+rsp] + mov QWORD[144+rsp],rdi + shr r9,51 + add r9,QWORD[((-24))+rsp] + sub rsi,rbx + mov r10,r9 + mov rbx,rsi + mov QWORD[152+rsp],r9 + shr r10,51 + mov rsi,18014398509481976 + add r10,r8 + mov r8,rax + mov QWORD[8+rsp],r10 + add rsi,QWORD[((-32))+rsp] + sub r8,rbp + mov rbp,r8 + mov r8,rsi + mov rsi,18014398509481976 + lea r11,[r12*1+rsi] + lea rax,[rsi*1+r15] + sub rax,QWORD[((-88))+rsp] + sub r8,r13 + mov QWORD[((-88))+rsp],rbp + mov r12,QWORD[((-120))+rsp] + mov r13,r8 + mov r8,r11 + sub r8,r14 + mov QWORD[((-72))+rsp],r13 + mov r14,r8 + mov r8,rax + mov eax,121665 + mul rbp + mov QWORD[((-24))+rsp],r14 + mov r11,rax + mov rsi,rax + mov eax,121665 + mov rdi,rdx + shrd rsi,rdx,51 + mul rbx + shr rdi,51 + add rsi,rax + mov eax,121665 + adc rdi,rdx + mov QWORD[168+rsp],rsi + mul r13 + mov QWORD[176+rsp],rdi + shrd rsi,rdi,51 + shr rdi,51 + mov r10,rdi + mov rdi,rax + mov rbp,rdx + mov eax,121665 + add rdi,rsi + adc rbp,r10 + mov r9,rdi + mul r14 + mov r10,rbp + shrd r9,rbp,51 + shr r10,51 + add r9,rax + mov eax,121665 + adc r10,rdx + mov r13,r9 + mul r8 + mov r14,r10 + shrd r13,r10,51 + shr r14,51 + add r13,rax + adc r14,rdx + mov rax,r13 + and r11,rcx + shrd rax,r14,51 + add r11,QWORD[((-104))+rsp] + and rdi,rcx + add rdi,QWORD[((-32))+rsp] + lea rdx,[rax*8+rax] + lea rax,[rdx*2+rax] + lea rsi,[rax*1+r11] + mov r11,r9 + mov r9,QWORD[((-88))+rsp] + and r11,rcx + lea rbp,[r12*1+r11] + mov r11,r13 + mov QWORD[((-104))+rsp],rsi + and r11,rcx + mov rsi,QWORD[168+rsp] + add r15,r11 + lea rax,[r15*8+r15] + and rsi,rcx + add rsi,QWORD[((-56))+rsp] + lea r14,[rax*2+r15] + lea rax,[rdi*8+rdi] + lea r11,[rax*2+rdi] + lea rax,[rbp*8+rbp] + lea r13,[rax*2+rbp] + lea rax,[rsi*8+rsi] + mov r10,r11 + mov QWORD[((-56))+rsp],r10 + lea r11,[rax*2+rsi] + mov rax,r11 + mul r8 + mov r11,rax + mov rax,QWORD[((-24))+rsp] + mov r12,rdx + mul r10 + add r11,rax + mov rax,QWORD[((-72))+rsp] + adc r12,rdx + mul r13 + add r11,rax + mov rax,r9 + adc r12,rdx + mul QWORD[((-104))+rsp] + add r11,rax + mov rax,rbx + adc r12,rdx + mul r14 + add r11,rax + mov rax,r11 + adc r12,rdx + and rax,rcx + mov QWORD[((-120))+rsp],rax + mov rax,r9 + mul rsi + mov r9,rax + mov rax,QWORD[((-56))+rsp] + mov r10,rdx + mul r8 + add r9,rax + mov rax,QWORD[((-24))+rsp] + adc r10,rdx + mul r13 + add r9,rax + mov rax,QWORD[((-104))+rsp] + adc r10,rdx + mul rbx + add r9,rax + mov rax,QWORD[((-72))+rsp] + adc r10,rdx + mul r14 + add rax,r9 + adc rdx,r10 + shrd r11,r12,51 + shr r12,51 + mov r9,r11 + mov r10,r12 + add r9,rax + mov rax,QWORD[((-88))+rsp] + adc r10,rdx + mov r12,r9 + and r12,rcx + mul rdi + mov QWORD[((-32))+rsp],r12 + mov r11,rax + mov rax,rbx + mov r12,rdx + mul rsi + add r11,rax + mov rax,r13 + adc r12,rdx + mul r8 + add r11,rax + mov rax,QWORD[((-72))+rsp] + adc r12,rdx + mul QWORD[((-104))+rsp] + add r11,rax + mov rax,QWORD[((-24))+rsp] + adc r12,rdx + mul r14 + add rax,r11 + adc rdx,r12 + shrd r9,r10,51 + shr r10,51 + add r9,rax + mov rax,rbx + adc r10,rdx + mov r11,r9 + mul rdi + and r11,rcx + mov QWORD[((-48))+rsp],r10 + mov r13,r11 + mov r11,rax + mov rax,QWORD[((-72))+rsp] + mov r12,rdx + mul rsi + add r11,rax + mov rax,QWORD[((-88))+rsp] + adc r12,rdx + mul rbp + add r11,rax + mov rax,QWORD[((-24))+rsp] + adc r12,rdx + mul QWORD[((-104))+rsp] + add r11,rax + mov rax,r14 + adc r12,rdx + mul r8 + add rax,r11 + adc rdx,r12 + shrd r9,r10,51 + shr r10,51 + add r9,rax + mov rax,QWORD[((-24))+rsp] + adc r10,rdx + mov r14,r9 + mul rsi + mov r11,rax + mov rax,QWORD[((-72))+rsp] + mov r12,rdx + mul rdi + mov rsi,rax + mov rdi,rdx + mov rax,rbx + add rsi,r11 + adc rdi,r12 + mul rbp + mov rbp,QWORD[104+rsp] + add rsi,rax + mov rax,QWORD[((-88))+rsp] + adc rdi,rdx + mul r15 + mov r15,QWORD[144+rsp] + add rsi,rax + mov rax,QWORD[((-104))+rsp] + adc rdi,rdx + mul r8 + mov r8,QWORD[152+rsp] + add rsi,rax + adc rdi,rdx + mov rdx,QWORD[216+rsp] + shrd r9,r10,51 + shr r10,51 + add rsi,r9 + adc rdi,r10 + mov rbx,rsi + mov r10,QWORD[128+rsp] + shrd rsi,rdi,51 + lea rax,[rsi*8+rsi] + lea r12,[rax*2+rsi] + add r12,QWORD[((-120))+rsp] + mov rsi,QWORD[208+rsp] + mov rax,r10 + and rax,rcx + mov r9,rsi + and rsi,rcx + mov r11,r12 + xor r9,r15 + and r15,rcx + shr r11,51 + add r11,QWORD[((-32))+rsp] + and r9,rax + xor rsi,r9 + xor r15,r9 + mov QWORD[rbp],rsi + mov QWORD[((-88))+rsp],rsi + mov rsi,rdx + xor rsi,r8 + and r8,rcx + and rdx,rcx + mov rdi,r11 + and rsi,rax + mov r9,r8 + shr rdi,51 + xor r9,rsi + xor rsi,rdx + add rdi,r13 + mov r13,QWORD[232+rsp] + mov rdx,QWORD[280+rsp] + mov QWORD[((-56))+rsp],r9 + mov QWORD[8+rbp],rsi + mov QWORD[8+r13],r9 + mov r9,QWORD[8+rsp] + mov QWORD[r13],r15 + xor r9,rdx + mov r8,r9 + and r8,r10 + mov r10,QWORD[8+rsp] + mov r9,r8 + xor r9,rdx + xor r10,r8 + mov rdx,r9 + mov r8,QWORD[160+rsp] + mov QWORD[16+rbp],rdx + mov QWORD[((-24))+rsp],r9 + mov r9,rbp + mov rbp,QWORD[40+rsp] + mov QWORD[16+r13],r10 + mov rdx,r8 + and r8,rcx + xor rdx,rbp + and rbp,rcx + and rdx,rax + xor rbp,rdx + xor r8,rdx + mov rdx,r9 + mov QWORD[24+r13],rbp + mov QWORD[24+rdx],r8 + mov QWORD[((-72))+rsp],rbp + mov QWORD[40+rsp],r8 + mov r9,QWORD[272+rsp] + mov rbp,QWORD[136+rsp] + mov rdx,r9 + xor rdx,rbp + mov r8,rdx + mov rdx,rbp + mov rbp,QWORD[((-8))+rsp] + and r8,rax + and rdx,rcx + xor rdx,r8 + mov QWORD[32+r13],rdx + mov QWORD[((-32))+rsp],rdx + mov r13,r9 + mov rdx,QWORD[104+rsp] + mov r9,QWORD[56+rsp] + and r13,rcx + xor r13,r8 + mov r8,2251799813685247 + mov QWORD[((-120))+rsp],r13 + mov QWORD[32+rdx],r13 + mov rdx,r9 + and r8,r9 + xor rdx,r12 + mov r9,r8 + mov r13,2251799813685247 + and rdx,rax + and r12,r13 + mov r8,2251799813685247 + xor r9,rdx + xor r12,rdx + mov r13,QWORD[88+rsp] + mov QWORD[rbp],r9 + mov QWORD[((-104))+rsp],r9 + mov r9,QWORD[296+rsp] + mov QWORD[r13],r12 + mov rdx,r9 + xor rdx,r11 + and r11,r8 + and rdx,rax + xor r11,rdx + mov QWORD[8+r13],r11 + mov r13,r8 + and r13,r9 + mov r9,QWORD[88+rsp] + xor rdx,r13 + mov QWORD[8+rbp],rdx + mov rbp,QWORD[304+rsp] + mov r13,rbp + xor r13,rdi + and r13,QWORD[128+rsp] + mov r8,r13 + xor rdi,r13 + xor r8,rbp + mov rbp,QWORD[72+rsp] + mov QWORD[16+r9],rdi + mov r13,r8 + mov QWORD[56+rsp],r8 + mov r8,QWORD[((-8))+rsp] + mov QWORD[16+r8],r13 + mov r13,QWORD[72+rsp] + mov r8,2251799813685247 + and rbp,r8 + xor r13,r14 + and r14,r8 + mov r8,QWORD[((-8))+rsp] + and r13,rax + xor r14,r13 + xor r13,rbp + mov rbp,QWORD[288+rsp] + mov QWORD[24+r8],r13 + mov QWORD[24+r9],r14 + mov r8,rbp + xor r8,rbx + and r8,rax + mov rax,2251799813685247 + and rbx,rax + and rax,rbp + xor rbx,r8 + xor rax,r8 + mov r8,QWORD[((-8))+rsp] + mov QWORD[32+r9],rbx + mov QWORD[32+r8],rax + sal BYTE[315+rsp],1 + sub DWORD[316+rsp],1 + jne NEAR $L$10 + mov r9,rdx + mov rdx,QWORD[88+rsp] + mov QWORD[72+rsp],r15 + mov r15,rsi + mov rsi,QWORD[104+rsp] + mov rbp,QWORD[((-72))+rsp] + sub QWORD[344+rsp],1 + mov r8,rax + mov QWORD[8+rsp],rdx + mov rdx,QWORD[248+rsp] + mov QWORD[((-72))+rsp],rsi + mov rsi,QWORD[232+rsp] + mov rax,QWORD[344+rsp] + mov QWORD[104+rsp],rdx + mov rdx,QWORD[240+rsp] + mov QWORD[232+rsp],rdx + mov rdx,QWORD[256+rsp] + mov QWORD[88+rsp],rdx + lea rdx,[359+rsp] + cmp rdx,rax + jne NEAR $L$3 + lea rax,[rbx*8+rbx] + mov QWORD[184+rsp],rbp + mov r8,rbx + mov r15,r14 + mov r13,r11 + mov r11,r12 + lea rbp,[rax*2+rbx] + lea rax,[rdi*8+rdi] + mov QWORD[168+rsp],r10 + mov r9,rdi + lea r14,[r12*1+r12] + lea r12,[r13*1+r13] + lea rax,[rax*2+rdi] + lea r10,[rbp*1+rbp] + lea rbx,[rax*1+rax] + mov QWORD[56+rsp],rax + mov rax,rbx + mul r15 + mov rcx,rax + mov rax,r11 + mov rbx,rdx + mul r11 + add rcx,rax + mov rax,r10 + adc rbx,rdx + mul r13 + mov rsi,rax + mov rdi,rdx + lea rax,[r15*8+r15] + add rsi,rcx + adc rdi,rbx + mov QWORD[((-120))+rsp],rsi + mov rbx,2251799813685247 + mov rcx,rdi + lea rdi,[rax*2+r15] + mov rax,r13 + mul r14 + and rbx,QWORD[((-120))+rsp] + mov QWORD[((-112))+rsp],rcx + mov rsi,QWORD[((-120))+rsp] + mov QWORD[8+rsp],rdi + mov rcx,rax + mov rax,rdi + mov QWORD[((-104))+rsp],rbx + mov rbx,rdx + mov rdi,QWORD[((-112))+rsp] + mul r15 + add rcx,rax + mov rax,r10 + adc rbx,rdx + mul r9 + add rcx,rax + mov rax,r14 + adc rbx,rdx + shrd rsi,rdi,51 + shr rdi,51 + add rsi,rcx + mov rcx,2251799813685247 + adc rdi,rbx + and rcx,rsi + mul r9 + mov QWORD[((-88))+rsp],rcx + mov rcx,rax + mov rax,r13 + mov rbx,rdx + mul r13 + add rcx,rax + mov rax,r10 + mov r10,2251799813685247 + adc rbx,rdx + mul r15 + add rcx,rax + adc rbx,rdx + shrd rsi,rdi,51 + shr rdi,51 + add rsi,rcx + mov rax,rsi + adc rdi,rbx + mov rsi,2251799813685247 + mov QWORD[((-120))+rsp],rax + mov rax,r15 + and rsi,QWORD[((-120))+rsp] + mul r14 + mov QWORD[((-112))+rsp],rdi + mov rdi,QWORD[((-112))+rsp] + mov QWORD[((-72))+rsp],rsi + mov rsi,QWORD[((-120))+rsp] + mov rcx,rax + mov rax,r9 + mov rbx,rdx + mul r12 + add rcx,rax + mov rax,r8 + adc rbx,rdx + mul rbp + add rcx,rax + mov rax,r14 + mov r14,2251799813685247 + adc rbx,rdx + shrd rsi,rdi,51 + shr rdi,51 + add rsi,rcx + adc rdi,rbx + and r10,rsi + mul r8 + mov rcx,rax + mov rax,r12 + mov rbx,rdx + mul r15 + add rcx,rax + mov rax,r9 + adc rbx,rdx + mul r9 + add rcx,rax + adc rbx,rdx + shrd rsi,rdi,51 + shr rdi,51 + add rsi,rcx + adc rdi,rbx + mov rcx,rsi + mov rbx,2251799813685247 + shrd rcx,rdi,51 + and r14,rsi + mov rsi,2251799813685247 + lea rax,[rcx*8+rcx] + lea rax,[rax*2+rcx] + add rax,QWORD[((-104))+rsp] + and rbx,rax + shr rax,51 + add rax,QWORD[((-88))+rsp] + mov QWORD[((-24))+rsp],rbx + lea r12,[rbx*1+rbx] + and rsi,rax + shr rax,51 + add rax,QWORD[((-72))+rsp] + lea rcx,[rsi*1+rsi] + mov QWORD[((-88))+rsp],rcx + mov QWORD[((-120))+rsp],rax + lea rax,[r14*8+r14] + lea rcx,[rax*2+r14] + mov rax,rsi + lea rdi,[rcx*1+rcx] + mov QWORD[80+rsp],rcx + mul rdi + mov rcx,rax + mov rax,QWORD[((-24))+rsp] + mov rbx,rdx + mul rax + add rcx,rax + adc rbx,rdx + mov rdx,QWORD[((-120))+rsp] + lea rax,[rdx*8+rdx] + lea rdx,[rax*2+rdx] + lea rax,[rdx*1+rdx] + mov QWORD[88+rsp],rdx + mul r10 + add rcx,rax + lea rax,[r10*8+r10] + adc rbx,rdx + mov rdx,rcx + lea rax,[rax*2+r10] + mov rcx,rbx + mov QWORD[((-104))+rsp],rdx + mov QWORD[((-96))+rsp],rcx + mov rbx,rax + mov rax,2251799813685247 + and rax,QWORD[((-104))+rsp] + mov QWORD[128+rsp],rbx + mov QWORD[24+rsp],rax + mov rax,rbx + mul r10 + mov rcx,rax + mov rax,r12 + mov rbx,rdx + mul rsi + add rcx,rax + mov rax,QWORD[((-120))+rsp] + adc rbx,rdx + mul rdi + add rcx,rax + mov rax,QWORD[((-104))+rsp] + adc rbx,rdx + mov rdx,QWORD[((-96))+rsp] + shrd rax,rdx,51 + shr rdx,51 + add rcx,rax + mov rax,rdi + mov rdi,QWORD[((-120))+rsp] + adc rbx,rdx + mov QWORD[((-104))+rsp],rcx + mov rdx,rbx + mov rbx,2251799813685247 + and rbx,QWORD[((-104))+rsp] + mov QWORD[((-96))+rsp],rdx + mul r10 + mov QWORD[((-8))+rsp],rbx + mov rcx,rax + mov rax,rsi + mov rbx,rdx + mul rsi + add rcx,rax + mov rax,rdi + adc rbx,rdx + mul r12 + add rcx,rax + mov rax,QWORD[((-104))+rsp] + adc rbx,rdx + mov rdx,QWORD[((-96))+rsp] + shrd rax,rdx,51 + shr rdx,51 + add rcx,rax + mov rax,QWORD[80+rsp] + adc rbx,rdx + mov QWORD[((-104))+rsp],rcx + mov rdx,rbx + mov rbx,2251799813685247 + and rbx,QWORD[((-104))+rsp] + mov QWORD[((-96))+rsp],rdx + mul r14 + mov QWORD[((-72))+rsp],rbx + mov rcx,rax + mov rax,r12 + mov rbx,rdx + mul r10 + add rcx,rax + mov rax,QWORD[((-88))+rsp] + adc rbx,rdx + mul rdi + mov rdi,2251799813685247 + add rax,rcx + mov rcx,QWORD[((-104))+rsp] + adc rdx,rbx + mov rbx,QWORD[((-96))+rsp] + shrd rcx,rbx,51 + shr rbx,51 + add rcx,rax + adc rbx,rdx + mov QWORD[((-104))+rsp],rcx + and rdi,QWORD[((-104))+rsp] + mov QWORD[((-96))+rsp],rbx + mov rbx,QWORD[((-120))+rsp] + mov rax,rbx + mul rbx + mov QWORD[40+rsp],rax + mov rax,QWORD[((-88))+rsp] + mov QWORD[48+rsp],rdx + mul r10 + mov rcx,rdx + mov rdx,rax + add rdx,QWORD[40+rsp] + adc rcx,QWORD[48+rsp] + mov rax,r12 + mov rbx,rcx + mov rcx,rdx + mul r14 + add rax,rcx + mov rcx,QWORD[((-104))+rsp] + adc rdx,rbx + mov rbx,QWORD[((-96))+rsp] + shrd rcx,rbx,51 + shr rbx,51 + add rcx,rax + adc rbx,rdx + mov rdx,2251799813685247 + and rdx,rcx + shrd rcx,rbx,51 + mov QWORD[((-88))+rsp],rdx + mov rdx,2251799813685247 + lea rax,[rcx*8+rcx] + lea rax,[rax*2+rcx] + add rax,QWORD[24+rsp] + mov rcx,2251799813685247 + and rcx,rax + shr rax,51 + add rax,QWORD[((-8))+rsp] + mov r12,QWORD[((-72))+rsp] + and rdx,rax + shr rax,51 + add r12,rax + lea rax,[rcx*1+rcx] + mov QWORD[((-8))+rsp],rdx + mov QWORD[((-104))+rsp],rax + lea rax,[rdx*1+rdx] + mov rdx,QWORD[((-88))+rsp] + mov QWORD[24+rsp],rax + lea rax,[rdx*8+rdx] + lea rax,[rax*2+rdx] + mov QWORD[136+rsp],rax + add rax,rax + mov QWORD[((-72))+rsp],rax + mov rax,rcx + mul rcx + mov rcx,rax + mov rbx,rdx + mov rax,QWORD[((-72))+rsp] + mul QWORD[((-8))+rsp] + add rcx,rax + lea rax,[r12*8+r12] + adc rbx,rdx + lea rax,[rax*2+r12] + add rax,rax + mul rdi + add rcx,rax + mov rax,rcx + adc rbx,rdx + mov rcx,2251799813685247 + mov QWORD[40+rsp],rax + lea rax,[rdi*8+rdi] + mov QWORD[48+rsp],rbx + and rcx,QWORD[40+rsp] + lea rbx,[rax*2+rdi] + mov rax,rbx + mul rdi + mov QWORD[104+rsp],rcx + mov rbx,rdx + mov rcx,rax + mov rax,QWORD[((-8))+rsp] + mul QWORD[((-104))+rsp] + add rcx,rax + mov rax,QWORD[((-72))+rsp] + adc rbx,rdx + mul r12 + add rax,rcx + mov rcx,QWORD[40+rsp] + adc rdx,rbx + mov rbx,QWORD[48+rsp] + shrd rcx,rbx,51 + shr rbx,51 + add rcx,rax + adc rbx,rdx + mov rax,rcx + mov rcx,2251799813685247 + mov QWORD[48+rsp],rbx + mov rbx,QWORD[((-8))+rsp] + mov QWORD[40+rsp],rax + and rcx,QWORD[40+rsp] + mov rax,rbx + mul rbx + mov QWORD[120+rsp],rcx + mov QWORD[((-8))+rsp],rax + mov rax,QWORD[((-72))+rsp] + mov QWORD[rsp],rdx + mul rdi + add rax,QWORD[((-8))+rsp] + adc rdx,QWORD[rsp] + mov rcx,rax + mov rax,QWORD[((-104))+rsp] + mov rbx,rdx + mul r12 + add rax,rcx + mov rcx,QWORD[40+rsp] + adc rdx,rbx + mov rbx,QWORD[48+rsp] + shrd rcx,rbx,51 + shr rbx,51 + add rcx,rax + mov rax,QWORD[((-104))+rsp] + adc rbx,rdx + mov QWORD[((-72))+rsp],rcx + mov rdx,rbx + mov rbx,2251799813685247 + and rbx,QWORD[((-72))+rsp] + mov QWORD[((-64))+rsp],rdx + mul rdi + mov QWORD[40+rsp],rbx + mov QWORD[((-8))+rsp],rax + mov QWORD[rsp],rdx + mov rax,QWORD[136+rsp] + mul QWORD[((-88))+rsp] + add rax,QWORD[((-8))+rsp] + adc rdx,QWORD[rsp] + mov rcx,rax + mov rax,QWORD[24+rsp] + mov rbx,rdx + mul r12 + add rax,rcx + mov rcx,QWORD[((-72))+rsp] + adc rdx,rbx + mov rbx,QWORD[((-64))+rsp] + shrd rcx,rbx,51 + shr rbx,51 + add rcx,rax + mov rax,2251799813685247 + adc rbx,rdx + mov QWORD[((-72))+rsp],rcx + and rax,QWORD[((-72))+rsp] + mov QWORD[((-64))+rsp],rbx + mov QWORD[((-8))+rsp],rax + mov rax,QWORD[24+rsp] + mul rdi + mov QWORD[24+rsp],rax + mov rax,r12 + mov QWORD[32+rsp],rdx + mul r12 + mov r12,2251799813685247 + mov rcx,rdx + mov rdx,rax + add rdx,QWORD[24+rsp] + adc rcx,QWORD[32+rsp] + mov rax,QWORD[((-104))+rsp] + mov rbx,rcx + mov rcx,rdx + mul QWORD[((-88))+rsp] + add rax,rcx + mov rcx,QWORD[((-72))+rsp] + adc rdx,rbx + mov rbx,QWORD[((-64))+rsp] + shrd rcx,rbx,51 + shr rbx,51 + add rcx,rax + adc rbx,rdx + mov rdx,2251799813685247 + and rdx,rcx + shrd rcx,rbx,51 + mov rdi,rdx + mov rdx,2251799813685247 + lea rax,[rcx*8+rcx] + lea rax,[rax*2+rcx] + add rax,QWORD[104+rsp] + and r12,rax + shr rax,51 + add rax,QWORD[120+rsp] + and rdx,rax + shr rax,51 + mov QWORD[((-104))+rsp],rdx + add rax,QWORD[40+rsp] + mov QWORD[((-72))+rsp],rdi + mov QWORD[((-88))+rsp],rax + lea rax,[r13*8+r13] + lea rbx,[rax*2+r13] + mov rax,rbx + mul rdi + mov rdi,QWORD[((-8))+rsp] + mov rcx,rax + mov rbx,rdx + mov rax,rdi + mul QWORD[56+rsp] + add rcx,rax + mov rax,r11 + adc rbx,rdx + mul r12 + add rcx,rax + mov rax,QWORD[((-104))+rsp] + adc rbx,rdx + mul rbp + add rcx,rax + mov rax,QWORD[8+rsp] + adc rbx,rdx + mul QWORD[((-88))+rsp] + add rcx,rax + mov rax,rcx + adc rbx,rdx + mov rcx,2251799813685247 + mov QWORD[((-8))+rsp],rax + mov rax,QWORD[56+rsp] + mul QWORD[((-72))+rsp] + and rcx,QWORD[((-8))+rsp] + mov QWORD[rsp],rbx + mov QWORD[24+rsp],rcx + mov rcx,rax + mov rax,QWORD[8+rsp] + mov rbx,rdx + mul rdi + add rcx,rax + mov rax,r13 + adc rbx,rdx + mul r12 + add rcx,rax + mov rax,QWORD[((-104))+rsp] + adc rbx,rdx + mul r11 + add rcx,rax + mov rax,QWORD[((-88))+rsp] + adc rbx,rdx + mul rbp + add rax,rcx + mov rcx,QWORD[((-8))+rsp] + adc rdx,rbx + mov rbx,QWORD[rsp] + shrd rcx,rbx,51 + shr rbx,51 + add rcx,rax + mov rax,2251799813685247 + adc rbx,rdx + mov QWORD[((-8))+rsp],rcx + and rax,QWORD[((-8))+rsp] + mov QWORD[rsp],rbx + mov QWORD[40+rsp],rax + mov rax,rdi + mul rbp + mov QWORD[56+rsp],rax + mov QWORD[64+rsp],rdx + mov rax,QWORD[8+rsp] + mul QWORD[((-72))+rsp] + add rax,QWORD[56+rsp] + adc rdx,QWORD[64+rsp] + mov rcx,rax + mov rax,r9 + mov rbx,rdx + mul r12 + add rax,rcx + mov rcx,rax + mov rax,QWORD[((-104))+rsp] + adc rdx,rbx + mov rbx,rdx + mul r13 + add rax,rcx + mov rcx,rax + mov rax,QWORD[((-88))+rsp] + adc rdx,rbx + mov rbx,rdx + mul r11 + add rax,rcx + mov rcx,QWORD[((-8))+rsp] + mov QWORD[((-8))+rsp],rdi + adc rdx,rbx + mov rbx,QWORD[rsp] + shrd rcx,rbx,51 + shr rbx,51 + add rcx,rax + mov rax,2251799813685247 + mov QWORD[8+rsp],rcx + adc rbx,rdx + and rax,QWORD[8+rsp] + mov QWORD[16+rsp],rbx + mov QWORD[56+rsp],rax + mov rax,rdi + mul r11 + mov QWORD[104+rsp],rax + mov QWORD[112+rsp],rdx + mov rax,rbp + mul QWORD[((-72))+rsp] + mov rcx,QWORD[104+rsp] + mov rbx,QWORD[112+rsp] + add rcx,rax + mov rax,r15 + adc rbx,rdx + mov rdi,rcx + mov rcx,QWORD[8+rsp] + mul r12 + mov rbp,rbx + mov rbx,QWORD[16+rsp] + add rdi,rax + mov rax,QWORD[((-104))+rsp] + adc rbp,rdx + mul r9 + add rdi,rax + mov rax,QWORD[((-88))+rsp] + adc rbp,rdx + mul r13 + add rax,rdi + mov rdi,2251799813685247 + adc rdx,rbp + mov rbp,2251799813685247 + shrd rcx,rbx,51 + shr rbx,51 + add rcx,rax + mov rax,QWORD[((-72))+rsp] + adc rbx,rdx + mov QWORD[8+rsp],rcx + and rbp,QWORD[8+rsp] + mov QWORD[16+rsp],rbx + mul r11 + mov QWORD[((-72))+rsp],rax + mov rax,QWORD[((-8))+rsp] + mov QWORD[((-64))+rsp],rdx + mul r13 + add rax,QWORD[((-72))+rsp] + adc rdx,QWORD[((-64))+rsp] + mov rcx,rax + mov rax,r8 + mov rbx,rdx + mov r11,rcx + mov rcx,QWORD[8+rsp] + mul r12 + mov r12,rbx + mov rbx,QWORD[16+rsp] + add r11,rax + mov rax,QWORD[((-104))+rsp] + adc r12,rdx + mul r15 + add r11,rax + mov rax,QWORD[((-88))+rsp] + adc r12,rdx + mul r9 + add rax,r11 + mov r11,2251799813685247 + adc rdx,r12 + mov r12,2251799813685247 + shrd rcx,rbx,51 + shr rbx,51 + add rax,rcx + adc rdx,rbx + and r12,rax + shrd rax,rdx,51 + mov rcx,rax + lea rax,[rax*8+rax] + lea rax,[rax*2+rcx] + add rax,QWORD[24+rsp] + and r11,rax + shr rax,51 + add rax,QWORD[40+rsp] + mov r8,QWORD[56+rsp] + mov r13,QWORD[80+rsp] + mov r15,QWORD[128+rsp] + and rdi,rax + shr rax,51 + add r8,rax + lea rax,[rsi*8+rsi] + lea r9,[rax*2+rsi] + mov rax,r9 + mov r9,QWORD[((-24))+rsp] + mul r12 + mov rcx,rax + mov rax,QWORD[88+rsp] + mov rbx,rdx + mul rbp + add rcx,rax + mov rax,r9 + adc rbx,rdx + mul r11 + add rcx,rax + mov rax,r13 + adc rbx,rdx + mul rdi + add rcx,rax + mov rax,r15 + adc rbx,rdx + mul r8 + add rcx,rax + mov rax,2251799813685247 + mov QWORD[((-104))+rsp],rcx + adc rbx,rdx + and rax,QWORD[((-104))+rsp] + mov QWORD[((-96))+rsp],rbx + mov QWORD[((-88))+rsp],rax + mov rax,QWORD[88+rsp] + mul r12 + mov rcx,rax + mov rax,r15 + mov rbx,rdx + mul rbp + add rcx,rax + mov rax,r11 + adc rbx,rdx + mul rsi + add rcx,rax + mov rax,r9 + adc rbx,rdx + mul rdi + add rcx,rax + mov rax,r13 + adc rbx,rdx + mul r8 + add rax,rcx + mov rcx,QWORD[((-104))+rsp] + adc rdx,rbx + mov rbx,QWORD[((-96))+rsp] + shrd rcx,rbx,51 + shr rbx,51 + add rcx,rax + mov rax,rcx + adc rbx,rdx + mov rcx,2251799813685247 + mov QWORD[((-104))+rsp],rax + mov rax,r13 + and rcx,QWORD[((-104))+rsp] + mul rbp + mov QWORD[((-96))+rsp],rbx + mov QWORD[((-72))+rsp],rcx + mov QWORD[((-24))+rsp],rax + mov rax,r15 + mov QWORD[((-16))+rsp],rdx + mul r12 + mov r15,QWORD[((-120))+rsp] + mov rcx,rax + mov rax,r15 + add rcx,QWORD[((-24))+rsp] + mov rbx,rdx + adc rbx,QWORD[((-16))+rsp] + mul r11 + add rax,rcx + mov QWORD[((-120))+rsp],rax + adc rdx,rbx + mov rax,rdi + mov QWORD[((-112))+rsp],rdx + mul rsi + mov rcx,rax + mov rax,r9 + add rcx,QWORD[((-120))+rsp] + mov rbx,rdx + adc rbx,QWORD[((-112))+rsp] + mul r8 + add rax,rcx + mov rcx,QWORD[((-104))+rsp] + adc rdx,rbx + mov rbx,QWORD[((-96))+rsp] + shrd rcx,rbx,51 + shr rbx,51 + add rcx,rax + mov rax,r9 + adc rbx,rdx + mov QWORD[((-120))+rsp],rcx + mov rdx,2251799813685247 + and rdx,QWORD[((-120))+rsp] + mov QWORD[((-112))+rsp],rbx + mov QWORD[((-104))+rsp],rdx + mul rbp + mov QWORD[((-24))+rsp],rax + mov rax,r13 + mov QWORD[((-16))+rsp],rdx + mul r12 + mov r13,2251799813685247 + mov rcx,rax + mov rax,r11 + add rcx,QWORD[((-24))+rsp] + mov rbx,rdx + adc rbx,QWORD[((-16))+rsp] + mul r10 + add rax,rcx + mov QWORD[((-24))+rsp],rax + adc rdx,rbx + mov rax,r15 + mov QWORD[((-16))+rsp],rdx + mul rdi + mov rcx,rax + mov rax,r8 + add rcx,QWORD[((-24))+rsp] + mov rbx,rdx + adc rbx,QWORD[((-16))+rsp] + mul rsi + add rax,rcx + mov rcx,QWORD[((-120))+rsp] + adc rdx,rbx + mov rbx,QWORD[((-112))+rsp] + shrd rcx,rbx,51 + shr rbx,51 + add rcx,rax + mov rax,r9 + mov r9,2251799813685247 + adc rbx,rdx + mov QWORD[((-120))+rsp],rcx + and r13,QWORD[((-120))+rsp] + mul r12 + mov QWORD[((-112))+rsp],rbx + mov rcx,QWORD[((-120))+rsp] + mov QWORD[24+rsp],r13 + mov QWORD[((-24))+rsp],rax + mov rax,rsi + mov QWORD[((-16))+rsp],rdx + mul rbp + mov rbx,rax + mov rax,r14 + add rbx,QWORD[((-24))+rsp] + mov rsi,rdx + adc rsi,QWORD[((-16))+rsp] + mul r11 + add rbx,rax + mov rax,r10 + adc rsi,rdx + mov r13,rbx + mov rbx,QWORD[((-112))+rsp] + mul rdi + mov r14,rsi + add r13,rax + mov rax,r15 + mov r15,QWORD[24+rsp] + adc r14,rdx + mul r8 + add rax,r13 + adc rdx,r14 + shrd rcx,rbx,51 + shr rbx,51 + add rcx,rax + adc rbx,rdx + mov rdx,2251799813685247 + and rdx,rcx + shrd rcx,rbx,51 + mov rbx,2251799813685247 + mov QWORD[224+rsp],rdx + lea rax,[rcx*8+rcx] + lea rax,[rax*2+rcx] + add rax,QWORD[((-88))+rsp] + and rbx,rax + shr rax,51 + add rax,QWORD[((-72))+rsp] + lea rsi,[rbx*1+rbx] + mov QWORD[128+rsp],rbx + and r9,rax + shr rax,51 + add rax,QWORD[((-104))+rsp] + mov QWORD[136+rsp],r9 + lea r14,[r9*1+r9] + mov r10,rax + lea rax,[rdx*8+rdx] + mov QWORD[80+rsp],r10 + lea rax,[rax*2+rdx] + lea r13,[rax*1+rax] + mov QWORD[192+rsp],rax + mov rax,rbx + mul rbx + mov rcx,rax + mov rax,r9 + mov rbx,rdx + mul r13 + add rcx,rax + lea rax,[r10*8+r10] + adc rbx,rdx + lea rax,[rax*2+r10] + add rax,rax + mul r15 + add rcx,rax + lea rax,[r15*8+r15] + adc rbx,rdx + mov r9,rcx + mov rcx,2251799813685247 + mov r10,rbx + lea rbx,[rax*2+r15] + and rcx,r9 + mov r15,rcx + mov QWORD[216+rsp],rbx + mov rax,QWORD[136+rsp] + mul rsi + mov rcx,rax + mov rbx,rdx + mov rax,QWORD[24+rsp] + mul QWORD[216+rsp] + add rcx,rax + mov rax,QWORD[80+rsp] + adc rbx,rdx + mul r13 + add rax,rcx + mov rcx,r9 + mov r9,QWORD[136+rsp] + adc rdx,rbx + mov rbx,r10 + shrd rcx,r10,51 + shr rbx,51 + add rcx,rax + mov rax,r9 + adc rbx,rdx + mov QWORD[((-120))+rsp],rcx + mov rcx,QWORD[((-120))+rsp] + mov rdx,rbx + mov rbx,2251799813685247 + and rbx,QWORD[((-120))+rsp] + mov QWORD[((-112))+rsp],rdx + mul r9 + mov QWORD[((-104))+rsp],rbx + mov rbx,QWORD[((-112))+rsp] + mov r9,rax + mov rax,r13 + mov r13,QWORD[24+rsp] + mov r10,rdx + mul r13 + add r9,rax + mov rax,QWORD[80+rsp] + adc r10,rdx + mul rsi + add r9,rax + mov rax,r13 + adc r10,rdx + shrd rcx,rbx,51 + shr rbx,51 + add r9,rcx + mov rcx,2251799813685247 + adc r10,rbx + and rcx,r9 + mul rsi + mov QWORD[((-120))+rsp],rcx + mov rbx,rdx + mov rcx,rax + mov rax,QWORD[224+rsp] + mul QWORD[192+rsp] + add rcx,rax + mov rax,QWORD[80+rsp] + adc rbx,rdx + mul r14 + add rax,rcx + adc rdx,rbx + shrd r9,r10,51 + shr r10,51 + add r9,rax + mov rax,2251799813685247 + adc r10,rdx + and rax,r9 + mov r13,rax + mov rax,r14 + mov r14,QWORD[80+rsp] + mul QWORD[24+rsp] + mov rcx,rax + mov rax,r14 + mov rbx,rdx + mul r14 + mov r14,2251799813685247 + add rcx,rax + mov rax,rsi + mov rsi,2251799813685247 + adc rbx,rdx + mul QWORD[224+rsp] + add rax,rcx + mov rcx,r9 + adc rdx,rbx + mov rbx,r10 + shrd rcx,r10,51 + shr rbx,51 + add rax,rcx + adc rdx,rbx + and rsi,rax + mov rbx,QWORD[((-104))+rsp] + shrd rax,rdx,51 + mov rcx,rax + lea rax,[rax*8+rax] + lea rax,[rax*2+rcx] + add r15,rax + and r14,r15 + shr r15,51 + lea rax,[rbx*1+r15] + mov r15,2251799813685247 + and r15,rax + shr rax,51 + add rax,QWORD[((-120))+rsp] + mov r10,rax + lea rax,[r12*8+r12] + mov QWORD[((-120))+rsp],r10 + lea r9,[rax*2+r12] + lea rax,[r8*8+r8] + lea rax,[rax*2+r8] + mov QWORD[((-104))+rsp],r9 + mov QWORD[((-72))+rsp],rax + lea rax,[rbp*8+rbp] + lea rax,[rax*2+rbp] + mov QWORD[((-88))+rsp],rax + lea rax,[rdi*8+rdi] + lea rbx,[rax*2+rdi] + mov rax,rbx + mul rsi + mov rcx,rax + mov rax,QWORD[((-72))+rsp] + mov rbx,rdx + mul r13 + add rcx,rax + mov rax,r11 + adc rbx,rdx + mul r14 + add rcx,rax + mov rax,r9 + adc rbx,rdx + mul r15 + add rcx,rax + mov rax,r10 + adc rbx,rdx + mul QWORD[((-88))+rsp] + add rcx,rax + mov rax,QWORD[((-72))+rsp] + adc rbx,rdx + mov r9,rcx + mov r10,rbx + mov rbx,2251799813685247 + mul rsi + and rbx,rcx + mov QWORD[((-24))+rsp],rbx + mov rcx,rax + mov rax,QWORD[((-88))+rsp] + mov rbx,rdx + mul r13 + add rcx,rax + mov rax,rdi + adc rbx,rdx + mul r14 + add rcx,rax + mov rax,r11 + adc rbx,rdx + mul r15 + add rcx,rax + mov rax,QWORD[((-120))+rsp] + adc rbx,rdx + mul QWORD[((-104))+rsp] + add rax,rcx + mov rcx,r9 + adc rdx,rbx + mov rbx,r10 + shrd rcx,r10,51 + shr rbx,51 + add rcx,rax + mov rax,rcx + adc rbx,rdx + mov rcx,2251799813685247 + mov QWORD[((-72))+rsp],rax + mov rax,QWORD[((-104))+rsp] + and rcx,QWORD[((-72))+rsp] + mov QWORD[((-64))+rsp],rbx + mov rbx,QWORD[((-64))+rsp] + mul r13 + mov QWORD[((-8))+rsp],rcx + mov rcx,QWORD[((-72))+rsp] + mov r9,rax + mov rax,QWORD[((-88))+rsp] + mov r10,rdx + mul rsi + add r9,rax + mov rax,r8 + adc r10,rdx + mul r14 + add r9,rax + mov rax,rdi + adc r10,rdx + mul r15 + add r9,rax + mov rax,QWORD[((-120))+rsp] + adc r10,rdx + mul r11 + add rax,r9 + adc rdx,r10 + shrd rcx,rbx,51 + shr rbx,51 + add rcx,rax + mov rax,2251799813685247 + adc rbx,rdx + and rax,rcx + mov QWORD[((-88))+rsp],rax + mov rax,r11 + mul r13 + mov r9,rax + mov rax,QWORD[((-104))+rsp] + mov r10,rdx + mul rsi + add r9,rax + mov rax,rbp + adc r10,rdx + mul r14 + add r9,rax + mov rax,r8 + adc r10,rdx + mul r15 + add r9,rax + mov rax,QWORD[((-120))+rsp] + adc r10,rdx + mul rdi + add rax,r9 + mov r9,2251799813685247 + adc rdx,r10 + shrd rcx,rbx,51 + shr rbx,51 + add rcx,rax + mov rax,r11 + adc rbx,rdx + and r9,rcx + mul rsi + mov r10,rax + mov rax,rdi + mov r11,rdx + mul r13 + mov r13,2251799813685247 + mov rsi,rax + mov rdi,rdx + mov rax,r12 + add rsi,r10 + adc rdi,r11 + mul r14 + add rsi,rax + mov rax,rbp + mov rbp,2251799813685247 + adc rdi,rdx + mul r15 + add rsi,rax + mov rax,QWORD[((-120))+rsp] + adc rdi,rdx + mul r8 + add rsi,rax + adc rdi,rdx + shrd rcx,rbx,51 + shr rbx,51 + add rsi,rcx + adc rdi,rbx + mov rdx,rsi + mov rbx,QWORD[((-88))+rsp] + shrd rdx,rdi,51 + mov rdi,2251799813685247 + and r13,rsi + lea rax,[rdx*8+rdx] + lea rax,[rax*2+rdx] + add rax,QWORD[((-24))+rsp] + and rbp,rax + shr rax,51 + add rax,QWORD[((-8))+rsp] + lea rsi,[rbp*1+rbp] + and rdi,rax + shr rax,51 + lea r8,[rbx*1+rax] + lea rax,[r13*8+r13] + lea r15,[rdi*1+rdi] + lea r12,[rax*2+r13] + mov rax,rdi + mov QWORD[((-8))+rsp],r15 + lea r14,[r12*1+r12] + mul r14 + mov rcx,rax + mov rax,rbp + mov rbx,rdx + mul rbp + add rcx,rax + lea rax,[r8*8+r8] + adc rbx,rdx + lea rax,[rax*2+r8] + mov QWORD[((-24))+rsp],rax + add rax,rax + mul r9 + add rcx,rax + lea rax,[r9*8+r9] + adc rbx,rdx + mov r10,rcx + mov rcx,2251799813685247 + mov r11,rbx + lea rbx,[rax*2+r9] + and rcx,r10 + mov QWORD[((-88))+rsp],rcx + mov rax,rbx + mov QWORD[((-72))+rsp],rbx + mul r9 + mov rcx,rax + mov rax,rsi + mov rbx,rdx + mul rdi + add rcx,rax + mov rax,r8 + adc rbx,rdx + mul r14 + add rcx,rax + mov rax,r14 + mov r14,2251799813685247 + adc rbx,rdx + mov rdx,2251799813685247 + shrd r10,r11,51 + shr r11,51 + add r10,rcx + adc r11,rbx + and rdx,r10 + mov QWORD[((-104))+rsp],rdx + mul r9 + mov rcx,rax + mov rax,rdi + mov rbx,rdx + mul rdi + add rcx,rax + mov rax,r8 + adc rbx,rdx + mul rsi + add rcx,rax + mov rax,r12 + adc rbx,rdx + shrd r10,r11,51 + shr r11,51 + add r10,rcx + adc r11,rbx + and r14,r10 + mul r13 + mov QWORD[((-120))+rsp],r14 + mov rcx,rax + mov rax,rsi + mov rbx,rdx + mul r9 + add rcx,rax + mov rax,r15 + adc rbx,rdx + mul r8 + add rax,rcx + mov rcx,r10 + mov r10,2251799813685247 + adc rdx,rbx + mov rbx,r11 + shrd rcx,r11,51 + shr rbx,51 + mov r11,2251799813685247 + add rcx,rax + mov rax,r8 + adc rbx,rdx + and r10,rcx + mul r8 + mov r14,rax + mov rax,QWORD[((-8))+rsp] + mov r15,rdx + mul r9 + add r14,rax + mov rax,rsi + mov rsi,QWORD[((-120))+rsp] + adc r15,rdx + mul r13 + add rax,r14 + mov r14,2251799813685247 + adc rdx,r15 + shrd rcx,rbx,51 + shr rbx,51 + add rax,rcx + adc rdx,rbx + and r14,rax + shrd rax,rdx,51 + mov rcx,rax + lea rax,[rax*8+rax] + lea rax,[rax*2+rcx] + add rax,QWORD[((-88))+rsp] + mov rcx,2251799813685247 + and rcx,rax + shr rax,51 + add rax,QWORD[((-104))+rsp] + lea r15,[rcx*1+rcx] + mov QWORD[((-120))+rsp],rcx + and r11,rax + shr rax,51 + add rsi,rax + lea rax,[r14*8+r14] + lea rdx,[r11*1+r11] + lea rbx,[rax*2+r14] + mov QWORD[((-88))+rsp],rdx + lea rdx,[rbx*1+rbx] + mov QWORD[40+rsp],rbx + mov rax,rdx + mov QWORD[((-104))+rsp],rdx + mul r11 + mov rcx,rax + mov rax,QWORD[((-120))+rsp] + mov rbx,rdx + mul rax + add rcx,rax + lea rax,[rsi*8+rsi] + adc rbx,rdx + lea rax,[rax*2+rsi] + add rax,rax + mul r10 + add rcx,rax + lea rax,[r10*8+r10] + mov QWORD[((-120))+rsp],rcx + adc rbx,rdx + mov rdx,2251799813685247 + and rdx,QWORD[((-120))+rsp] + lea rcx,[rax*2+r10] + mov QWORD[((-112))+rsp],rbx + mov rax,rcx + mov QWORD[8+rsp],rdx + mul r10 + mov rcx,rax + mov rax,r15 + mov rbx,rdx + mul r11 + add rcx,rax + mov rax,QWORD[((-104))+rsp] + adc rbx,rdx + mul rsi + add rax,rcx + mov rcx,QWORD[((-120))+rsp] + adc rdx,rbx + mov rbx,QWORD[((-112))+rsp] + shrd rcx,rbx,51 + shr rbx,51 + add rcx,rax + mov rax,rcx + adc rbx,rdx + mov rcx,2251799813685247 + mov QWORD[((-120))+rsp],rax + mov rax,QWORD[((-104))+rsp] + and rcx,QWORD[((-120))+rsp] + mov QWORD[((-112))+rsp],rbx + mul r10 + mov QWORD[((-8))+rsp],rcx + mov QWORD[((-104))+rsp],rax + mov rax,r11 + mov QWORD[((-96))+rsp],rdx + mul r11 + mov r11,2251799813685247 + mov rcx,rax + mov rax,rsi + add rcx,QWORD[((-104))+rsp] + mov rbx,rdx + adc rbx,QWORD[((-96))+rsp] + mul r15 + add rax,rcx + mov rcx,QWORD[((-120))+rsp] + adc rdx,rbx + mov rbx,QWORD[((-112))+rsp] + shrd rcx,rbx,51 + shr rbx,51 + add rax,rcx + mov QWORD[((-120))+rsp],rax + mov rax,QWORD[40+rsp] + adc rdx,rbx + mov QWORD[((-112))+rsp],rdx + and r11,QWORD[((-120))+rsp] + mul r14 + mov QWORD[((-104))+rsp],r11 + mov r11,2251799813685247 + mov rcx,rax + mov rax,r15 + mov rbx,rdx + mul r10 + add rcx,rax + mov rax,QWORD[((-88))+rsp] + adc rbx,rdx + mul rsi + add rax,rcx + mov rcx,QWORD[((-120))+rsp] + adc rdx,rbx + mov rbx,QWORD[((-112))+rsp] + shrd rcx,rbx,51 + shr rbx,51 + add rcx,rax + mov rax,rsi + adc rbx,rdx + mov QWORD[((-120))+rsp],rcx + and r11,QWORD[((-120))+rsp] + mul rsi + mov QWORD[((-112))+rsp],rbx + mov QWORD[40+rsp],rax + mov rax,QWORD[((-88))+rsp] + mov QWORD[48+rsp],rdx + mul r10 + mov rbx,rax + mov rax,r14 + add rbx,QWORD[40+rsp] + mov rsi,rdx + adc rsi,QWORD[48+rsp] + mov rcx,QWORD[((-120))+rsp] + mul r15 + mov r14,2251799813685247 + add rax,rbx + mov rbx,QWORD[((-112))+rsp] + adc rdx,rsi + mov rsi,QWORD[((-104))+rsp] + shrd rcx,rbx,51 + shr rbx,51 + add rax,rcx + adc rdx,rbx + and r14,rax + shrd rax,rdx,51 + mov rdx,2251799813685247 + mov rcx,rax + lea rax,[rax*8+rax] + lea rax,[rax*2+rcx] + add rax,QWORD[8+rsp] + mov rcx,2251799813685247 + and rcx,rax + shr rax,51 + add rax,QWORD[((-8))+rsp] + mov QWORD[((-120))+rsp],rcx + lea r15,[rcx*1+rcx] + and rdx,rax + shr rax,51 + add rsi,rax + lea rax,[r14*8+r14] + mov r10,rdx + lea rdx,[rdx*1+rdx] + lea rbx,[rax*2+r14] + mov QWORD[((-88))+rsp],rdx + lea rdx,[rbx*1+rbx] + mov QWORD[40+rsp],rbx + mov rax,rdx + mov QWORD[((-104))+rsp],rdx + mul r10 + mov rcx,rax + mov rax,QWORD[((-120))+rsp] + mov rbx,rdx + mul rax + add rcx,rax + lea rax,[rsi*8+rsi] + adc rbx,rdx + lea rax,[rax*2+rsi] + add rax,rax + mul r11 + add rcx,rax + lea rax,[r11*8+r11] + adc rbx,rdx + mov QWORD[((-120))+rsp],rcx + mov rdx,2251799813685247 + and rdx,QWORD[((-120))+rsp] + lea rcx,[rax*2+r11] + mov QWORD[((-112))+rsp],rbx + mov rax,rcx + mov QWORD[8+rsp],rdx + mul r11 + mov rcx,rax + mov rax,r15 + mov rbx,rdx + mul r10 + add rcx,rax + mov rax,QWORD[((-104))+rsp] + adc rbx,rdx + mul rsi + add rax,rcx + mov rcx,QWORD[((-120))+rsp] + adc rdx,rbx + mov rbx,QWORD[((-112))+rsp] + shrd rcx,rbx,51 + shr rbx,51 + add rcx,rax + mov rax,rcx + adc rbx,rdx + mov rcx,2251799813685247 + mov QWORD[((-120))+rsp],rax + mov rax,QWORD[((-104))+rsp] + and rcx,QWORD[((-120))+rsp] + mov QWORD[((-112))+rsp],rbx + mul r11 + mov QWORD[((-8))+rsp],rcx + mov QWORD[((-104))+rsp],rax + mov rax,r10 + mov QWORD[((-96))+rsp],rdx + mul r10 + mov r10,2251799813685247 + mov rcx,rax + mov rax,rsi + add rcx,QWORD[((-104))+rsp] + mov rbx,rdx + adc rbx,QWORD[((-96))+rsp] + mul r15 + add rax,rcx + mov rcx,QWORD[((-120))+rsp] + adc rdx,rbx + mov rbx,QWORD[((-112))+rsp] + shrd rcx,rbx,51 + shr rbx,51 + add rax,rcx + mov QWORD[((-120))+rsp],rax + mov rax,QWORD[40+rsp] + adc rdx,rbx + mov QWORD[((-112))+rsp],rdx + and r10,QWORD[((-120))+rsp] + mul r14 + mov QWORD[((-104))+rsp],r10 + mov rcx,rax + mov rax,r15 + mov rbx,rdx + mul r11 + add rcx,rax + mov rax,QWORD[((-88))+rsp] + adc rbx,rdx + mul rsi + add rax,rcx + mov rcx,QWORD[((-120))+rsp] + adc rdx,rbx + mov rbx,QWORD[((-112))+rsp] + shrd rcx,rbx,51 + shr rbx,51 + add rcx,rax + mov rax,rsi + adc rbx,rdx + mov QWORD[((-120))+rsp],rcx + mov rcx,QWORD[((-120))+rsp] + mov rdx,rbx + mov rbx,2251799813685247 + and rbx,QWORD[((-120))+rsp] + mov QWORD[((-112))+rsp],rdx + mul rsi + mov r10,rbx + mov QWORD[40+rsp],rax + mov rax,QWORD[((-88))+rsp] + mov QWORD[48+rsp],rdx + mul r11 + mov r11,2251799813685247 + mov rbx,rax + mov rax,r14 + add rbx,QWORD[40+rsp] + mov rsi,rdx + adc rsi,QWORD[48+rsp] + mov r14,2251799813685247 + mul r15 + add rax,rbx + mov rbx,QWORD[((-112))+rsp] + adc rdx,rsi + mov rsi,QWORD[((-104))+rsp] + shrd rcx,rbx,51 + shr rbx,51 + add rax,rcx + adc rdx,rbx + and r14,rax + shrd rax,rdx,51 + mov rcx,rax + lea rax,[rax*8+rax] + lea rax,[rax*2+rcx] + add rax,QWORD[8+rsp] + mov rcx,2251799813685247 + and rcx,rax + shr rax,51 + add rax,QWORD[((-8))+rsp] + mov QWORD[((-120))+rsp],rcx + lea r15,[rcx*1+rcx] + and r11,rax + shr rax,51 + add rsi,rax + lea rax,[r14*8+r14] + lea rdx,[r11*1+r11] + lea rbx,[rax*2+r14] + mov QWORD[((-88))+rsp],rdx + lea rdx,[rbx*1+rbx] + mov QWORD[40+rsp],rbx + mov rax,rdx + mov QWORD[((-104))+rsp],rdx + mul r11 + mov rcx,rax + mov rax,QWORD[((-120))+rsp] + mov rbx,rdx + mul rax + add rcx,rax + lea rax,[rsi*8+rsi] + adc rbx,rdx + lea rax,[rax*2+rsi] + add rax,rax + mul r10 + add rcx,rax + lea rax,[r10*8+r10] + adc rbx,rdx + mov QWORD[((-120))+rsp],rcx + mov rdx,2251799813685247 + and rdx,QWORD[((-120))+rsp] + lea rcx,[rax*2+r10] + mov QWORD[((-112))+rsp],rbx + mov rax,rcx + mov QWORD[8+rsp],rdx + mul r10 + mov rcx,rax + mov rax,r15 + mov rbx,rdx + mul r11 + add rcx,rax + mov rax,QWORD[((-104))+rsp] + adc rbx,rdx + mul rsi + add rax,rcx + mov rcx,QWORD[((-120))+rsp] + adc rdx,rbx + mov rbx,QWORD[((-112))+rsp] + shrd rcx,rbx,51 + shr rbx,51 + add rcx,rax + mov rax,rcx + adc rbx,rdx + mov rcx,2251799813685247 + mov QWORD[((-120))+rsp],rax + mov rax,QWORD[((-104))+rsp] + and rcx,QWORD[((-120))+rsp] + mov QWORD[((-112))+rsp],rbx + mul r10 + mov QWORD[((-8))+rsp],rcx + mov QWORD[((-104))+rsp],rax + mov rax,r11 + mov QWORD[((-96))+rsp],rdx + mul r11 + mov r11,2251799813685247 + mov rcx,rax + mov rax,rsi + add rcx,QWORD[((-104))+rsp] + mov rbx,rdx + adc rbx,QWORD[((-96))+rsp] + mul r15 + add rax,rcx + mov rcx,QWORD[((-120))+rsp] + adc rdx,rbx + mov rbx,QWORD[((-112))+rsp] + shrd rcx,rbx,51 + shr rbx,51 + add rax,rcx + mov QWORD[((-120))+rsp],rax + mov rax,QWORD[40+rsp] + adc rdx,rbx + mov QWORD[((-112))+rsp],rdx + and r11,QWORD[((-120))+rsp] + mul r14 + mov QWORD[((-104))+rsp],r11 + mov r11,2251799813685247 + mov rcx,rax + mov rax,r15 + mov rbx,rdx + mul r10 + add rcx,rax + mov rax,QWORD[((-88))+rsp] + adc rbx,rdx + mul rsi + add rax,rcx + mov rcx,QWORD[((-120))+rsp] + adc rdx,rbx + mov rbx,QWORD[((-112))+rsp] + shrd rcx,rbx,51 + shr rbx,51 + add rcx,rax + mov rax,rsi + adc rbx,rdx + mov QWORD[((-120))+rsp],rcx + and r11,QWORD[((-120))+rsp] + mul rsi + mov QWORD[((-112))+rsp],rbx + mov rcx,QWORD[((-120))+rsp] + mov QWORD[40+rsp],rax + mov rax,QWORD[((-88))+rsp] + mov QWORD[48+rsp],rdx + mul r10 + mov rbx,rax + mov rax,r14 + add rbx,QWORD[40+rsp] + mov rsi,rdx + adc rsi,QWORD[48+rsp] + mov r14,2251799813685247 + mul r15 + mov r15,2251799813685247 + add rax,rbx + mov rbx,QWORD[((-112))+rsp] + adc rdx,rsi + shrd rcx,rbx,51 + shr rbx,51 + add rax,rcx + adc rdx,rbx + and r14,rax + shrd rax,rdx,51 + mov rcx,rax + lea rax,[rax*8+rax] + lea rax,[rax*2+rcx] + add rax,QWORD[8+rsp] + mov rcx,2251799813685247 + mov rsi,QWORD[((-104))+rsp] + and rcx,rax + shr rax,51 + add rax,QWORD[((-8))+rsp] + and r15,rax + shr rax,51 + lea r10,[rsi*1+rax] + lea rax,[rcx*1+rcx] + lea rsi,[r15*1+r15] + mov QWORD[((-120))+rsp],rax + lea rax,[r14*8+r14] + mov QWORD[((-104))+rsp],rsi + lea rdx,[rax*2+r14] + mov rax,rcx + lea rsi,[rdx*1+rdx] + mov QWORD[56+rsp],rdx + mul rcx + mov QWORD[((-88))+rsp],rsi + mov rcx,rax + mov rax,rsi + mov rbx,rdx + mul r15 + mov rsi,2251799813685247 + add rcx,rax + lea rax,[r10*8+r10] + adc rbx,rdx + lea rax,[rax*2+r10] + add rax,rax + mul r11 + add rcx,rax + lea rax,[r11*8+r11] + adc rbx,rdx + mov QWORD[((-8))+rsp],rcx + mov QWORD[rsp],rbx + mov rbx,rcx + and rbx,rsi + mov QWORD[8+rsp],rbx + lea rbx,[rax*2+r11] + mov rax,rbx + mul r11 + mov rcx,rax + mov rax,QWORD[((-120))+rsp] + mov rbx,rdx + mul r15 + add rcx,rax + mov rax,QWORD[((-88))+rsp] + adc rbx,rdx + mul r10 + add rax,rcx + mov rcx,QWORD[((-8))+rsp] + adc rdx,rbx + mov rbx,QWORD[rsp] + shrd rcx,rbx,51 + shr rbx,51 + add rcx,rax + mov rax,r15 + adc rbx,rdx + mov QWORD[((-8))+rsp],rcx + and rcx,rsi + mul r15 + mov r15,QWORD[((-120))+rsp] + mov QWORD[40+rsp],rcx + mov QWORD[rsp],rbx + mov QWORD[88+rsp],rax + mov rax,QWORD[((-88))+rsp] + mov QWORD[96+rsp],rdx + mul r11 + mov rcx,rax + mov rax,r15 + add rcx,QWORD[88+rsp] + mov rbx,rdx + adc rbx,QWORD[96+rsp] + mul r10 + add rax,rcx + mov rcx,QWORD[((-8))+rsp] + adc rdx,rbx + mov rbx,QWORD[rsp] + shrd rcx,rbx,51 + shr rbx,51 + add rcx,rax + mov rax,r15 + adc rbx,rdx + mov QWORD[((-88))+rsp],rcx + and rcx,rsi + mul r11 + mov QWORD[((-80))+rsp],rbx + mov QWORD[((-8))+rsp],rcx + mov QWORD[88+rsp],rax + mov rax,QWORD[56+rsp] + mov QWORD[96+rsp],rdx + mul r14 + mov rcx,rax + add rcx,QWORD[88+rsp] + mov rax,QWORD[((-104))+rsp] + mov rbx,rdx + adc rbx,QWORD[96+rsp] + mul r10 + add rax,rcx + mov rcx,QWORD[((-88))+rsp] + adc rdx,rbx + mov rbx,QWORD[((-80))+rsp] + shrd rcx,rbx,51 + shr rbx,51 + add rcx,rax + mov rax,QWORD[((-104))+rsp] + adc rbx,rdx + mov rdx,rcx + and rdx,rsi + mov r15,rdx + mul r11 + mov QWORD[((-104))+rsp],rax + mov rax,r10 + mov QWORD[((-96))+rsp],rdx + mul r10 + mov r10,rax + mov rax,QWORD[((-120))+rsp] + add r10,QWORD[((-104))+rsp] + mov r11,rdx + adc r11,QWORD[((-96))+rsp] + mul r14 + add rax,r10 + adc rdx,r11 + shrd rcx,rbx,51 + shr rbx,51 + add rcx,rax + adc rbx,rdx + mov r14,rcx + shrd rcx,rbx,51 + and r14,rsi + lea rax,[rcx*8+rcx] + lea rax,[rax*2+rcx] + add rax,QWORD[8+rsp] + mov r10,rax + shr rax,51 + add rax,QWORD[40+rsp] + and r10,rsi + mov QWORD[((-104))+rsp],r10 + mov r11,rax + shr rax,51 + add rax,QWORD[((-8))+rsp] + and r11,rsi + mov QWORD[((-88))+rsp],r11 + mov QWORD[((-120))+rsp],rax + lea rax,[rdi*8+rdi] + lea rcx,[rax*2+rdi] + mov rax,rcx + mul r14 + mov rcx,rax + mov rax,QWORD[((-24))+rsp] + mov rbx,rdx + mul r15 + add rcx,rax + mov rax,r10 + adc rbx,rdx + mul rbp + add rcx,rax + mov rax,r11 + adc rbx,rdx + mul r12 + add rcx,rax + mov rax,QWORD[((-120))+rsp] + adc rbx,rdx + mul QWORD[((-72))+rsp] + add rcx,rax + mov rax,QWORD[((-24))+rsp] + adc rbx,rdx + mov r10,rcx + and rcx,rsi + mov QWORD[((-8))+rsp],rcx + mov r11,rbx + mul r14 + mov rcx,rax + mov rax,QWORD[((-72))+rsp] + mov rbx,rdx + mul r15 + add rcx,rax + mov rax,QWORD[((-104))+rsp] + adc rbx,rdx + mul rdi + add rcx,rax + mov rax,QWORD[((-88))+rsp] + adc rbx,rdx + mul rbp + add rcx,rax + mov rax,QWORD[((-120))+rsp] + adc rbx,rdx + mul r12 + add rax,rcx + mov rcx,r10 + adc rdx,rbx + mov rbx,r11 + shrd rcx,r11,51 + shr rbx,51 + add rcx,rax + mov rax,r15 + adc rbx,rdx + mov rdx,rcx + and rdx,rsi + mov QWORD[((-24))+rsp],rdx + mul r12 + mov r10,rax + mov rax,QWORD[((-72))+rsp] + mov r11,rdx + mul r14 + add r10,rax + mov rax,QWORD[((-104))+rsp] + adc r11,rdx + mul r8 + add r10,rax + mov rax,QWORD[((-88))+rsp] + adc r11,rdx + mul rdi + add r10,rax + mov rax,QWORD[((-120))+rsp] + adc r11,rdx + mul rbp + add rax,r10 + adc rdx,r11 + shrd rcx,rbx,51 + shr rbx,51 + add rcx,rax + mov rax,rcx + adc rbx,rdx + and rax,rsi + mov QWORD[((-72))+rsp],rax + mov rax,r15 + mul rbp + mov r10,rax + mov rax,r12 + mov r11,rdx + mul r14 + add r10,rax + mov rax,QWORD[((-104))+rsp] + adc r11,rdx + mul r9 + add r10,rax + mov rax,QWORD[((-88))+rsp] + adc r11,rdx + mul r8 + add r10,rax + mov rax,QWORD[((-120))+rsp] + adc r11,rdx + mul rdi + add rax,r10 + adc rdx,r11 + shrd rcx,rbx,51 + shr rbx,51 + add rcx,rax + mov rax,r14 + adc rbx,rdx + mov QWORD[8+rsp],rcx + mul rbp + mov QWORD[16+rsp],rbx + mov rbx,rcx + and rbx,rsi + mov rcx,QWORD[8+rsp] + mov r10,rbx + mov QWORD[104+rsp],rbx + mov rbx,QWORD[16+rsp] + mov r11,rax + mov rax,r15 + mov r12,rdx + mul rdi + mov rdi,rax + mov rax,QWORD[((-104))+rsp] + mov rbp,rdx + add rdi,r11 + adc rbp,r12 + mov r12,r10 + mul r13 + mov r13,rsi + add rdi,rax + mov rax,QWORD[((-88))+rsp] + adc rbp,rdx + mul r9 + add rdi,rax + mov rax,QWORD[((-120))+rsp] + adc rbp,rdx + mul r8 + add rax,rdi + adc rdx,rbp + shrd rcx,rbx,51 + shr rbx,51 + add rcx,rax + adc rbx,rdx + mov rdx,rcx + shrd rcx,rbx,51 + and rdx,rsi + lea rax,[rcx*8+rcx] + mov QWORD[208+rsp],rdx + mov rbp,rdx + lea rax,[rax*2+rcx] + add rax,QWORD[((-8))+rsp] + mov r9,rax + shr rax,51 + add rax,QWORD[((-24))+rsp] + and r9,rsi + mov QWORD[((-24))+rsp],10 + mov QWORD[8+rsp],r9 + mov rcx,r9 + mov r14,rax + shr rax,51 + add rax,QWORD[((-72))+rsp] + and r14,rsi + mov QWORD[40+rsp],r14 + mov r15,rax + mov QWORD[56+rsp],rax + + +$L$4: + lea rax,[rbp*8+rbp] + lea r8,[rcx*1+rcx] + lea r10,[r14*1+r14] + lea rax,[rax*2+rbp] + mov QWORD[((-120))+rsp],r10 + lea r11,[rax*1+rax] + mov QWORD[((-72))+rsp],rax + lea rax,[r15*8+r15] + lea rax,[rax*2+r15] + lea rbx,[rax*1+rax] + mov rax,rbx + mul r12 + mov rsi,rax + mov rax,rcx + mov rdi,rdx + mul rcx + mov rcx,rax + mov rbx,rdx + mov rax,r11 + add rcx,rsi + adc rbx,rdi + mul r14 + add rcx,rax + mov rax,r8 + adc rbx,rdx + mov rsi,rcx + mul rbp + mov rdi,rbx + mov rcx,rax + mov rax,r10 + mov rbx,rdx + mul r12 + add rcx,rax + mov rax,r15 + adc rbx,rdx + mul r15 + add rcx,rax + lea rax,[r12*8+r12] + mov QWORD[((-104))+rsp],rcx + adc rbx,rdx + lea rcx,[rax*2+r12] + mov QWORD[((-96))+rsp],rbx + mov rbx,rsi + and rbx,r13 + mov rax,rcx + mov QWORD[((-88))+rsp],rbx + mul r12 + mov rcx,rax + mov rax,r14 + mov rbx,rdx + mul r8 + add rcx,rax + mov rax,r11 + adc rbx,rdx + mul r15 + add rcx,rax + mov rax,r8 + adc rbx,rdx + shrd rsi,rdi,51 + shr rdi,51 + add rcx,rsi + adc rbx,rdi + mov rdi,rcx + mul r15 + and rdi,r13 + mov r10,rdi + mov rsi,rax + mov rax,r14 + mov rdi,rdx + mul r14 + add rsi,rax + mov rax,r11 + adc rdi,rdx + mul r12 + add rsi,rax + mov rax,r12 + adc rdi,rdx + shrd rcx,rbx,51 + shr rbx,51 + add rsi,rcx + adc rdi,rbx + mov r9,rsi + mul r8 + and r9,r13 + mov rcx,rax + mov rax,QWORD[((-120))+rsp] + mov rbx,rdx + mul r15 + add rcx,rax + mov rax,QWORD[((-72))+rsp] + adc rbx,rdx + mul rbp + add rcx,rax + mov rax,QWORD[((-104))+rsp] + adc rbx,rdx + mov rdx,QWORD[((-96))+rsp] + shrd rsi,rdi,51 + shr rdi,51 + add rcx,rsi + adc rbx,rdi + mov r12,rcx + shrd rcx,rbx,51 + and r12,r13 + shr rbx,51 + add rax,rcx + adc rdx,rbx + mov rbp,rax + shrd rax,rdx,51 + and rbp,r13 + mov rcx,rax + lea rax,[rax*8+rax] + lea r15,[rax*2+rcx] + add r15,QWORD[((-88))+rsp] + mov rcx,r15 + shr r15,51 + add r15,r10 + and rcx,r13 + mov r14,r15 + shr r15,51 + and r14,r13 + add r15,r9 + sub QWORD[((-24))+rsp],1 + jne NEAR $L$4 + mov rbx,QWORD[208+rsp] + mov r11,QWORD[40+rsp] + mov r9,rcx + mov rcx,QWORD[104+rsp] + mov r8,QWORD[8+rsp] + lea rax,[rbx*8+rbx] + lea rdi,[rax*2+rbx] + lea rax,[r11*8+r11] + mov rbx,QWORD[56+rsp] + lea rax,[rax*2+r11] + mov QWORD[88+rsp],rdi + mov QWORD[248+rsp],rax + lea rax,[rbx*8+rbx] + lea rsi,[rax*2+rbx] + lea rax,[rcx*8+rcx] + lea r10,[rax*2+rcx] + mov rax,r8 + mov QWORD[240+rsp],rsi + mul r9 + mov QWORD[200+rsp],r10 + mov rcx,rax + mov rax,rdi + mov rbx,rdx + mul r14 + add rcx,rax + mov rax,QWORD[248+rsp] + adc rbx,rdx + mul rbp + add rcx,rax + mov rax,r10 + adc rbx,rdx + mul r15 + add rcx,rax + mov rax,rsi + adc rbx,rdx + mul r12 + mov rsi,rax + mov rdi,rdx + add rsi,rcx + mov rax,rsi + adc rdi,rbx + and rax,r13 + mov r10,rax + mov rax,r8 + mov r8,r11 + mul r14 + mov rcx,rax + mov rax,r11 + mov rbx,rdx + mul r9 + add rcx,rax + mov rax,QWORD[88+rsp] + adc rbx,rdx + mul r15 + add rcx,rax + mov rax,QWORD[240+rsp] + adc rbx,rdx + mul rbp + add rcx,rax + mov rax,QWORD[200+rsp] + adc rbx,rdx + mul r12 + add rcx,rax + mov rax,QWORD[8+rsp] + adc rbx,rdx + shrd rsi,rdi,51 + shr rdi,51 + add rsi,rcx + adc rdi,rbx + mov r11,rsi + mul r15 + and r11,r13 + mov rcx,rax + mov rax,QWORD[56+rsp] + mov rbx,rdx + mul r9 + add rcx,rax + mov rax,r8 + adc rbx,rdx + mul r14 + add rcx,rax + mov rax,QWORD[88+rsp] + adc rbx,rdx + mul r12 + add rcx,rax + mov rax,QWORD[200+rsp] + adc rbx,rdx + mul rbp + add rcx,rax + adc rbx,rdx + shrd rsi,rdi,51 + shr rdi,51 + add rsi,rcx + mov rax,rsi + adc rdi,rbx + and rax,r13 + mov r8,rax + mov rax,QWORD[8+rsp] + mul r12 + mov rcx,rax + mov rax,QWORD[104+rsp] + mov rbx,rdx + mul r9 + add rcx,rax + mov rax,QWORD[56+rsp] + adc rbx,rdx + mul r14 + add rcx,rax + mov rax,QWORD[40+rsp] + adc rbx,rdx + mul r15 + add rcx,rax + mov rax,QWORD[88+rsp] + adc rbx,rdx + mul rbp + add rcx,rax + adc rbx,rdx + shrd rsi,rdi,51 + shr rdi,51 + add rsi,rcx + mov rax,rsi + adc rdi,rbx + and rax,r13 + mov QWORD[144+rsp],rax + mov rax,rbp + mov rbp,r13 + mul QWORD[8+rsp] + mov rcx,rax + mov rbx,rdx + mov rax,r9 + mul QWORD[208+rsp] + add rcx,rax + mov rax,r12 + adc rbx,rdx + mul QWORD[40+rsp] + add rcx,rax + mov rax,r14 + adc rbx,rdx + mul QWORD[104+rsp] + add rcx,rax + mov rax,r15 + mov r15,2251799813685247 + adc rbx,rdx + mul QWORD[56+rsp] + add rcx,rax + adc rbx,rdx + shrd rsi,rdi,51 + shr rdi,51 + add rcx,rsi + adc rbx,rdi + mov rdi,rcx + shrd rcx,rbx,51 + and rdi,r13 + lea rax,[rcx*8+rcx] + mov QWORD[232+rsp],rdi + lea rax,[rax*2+rcx] + add r10,rax + mov r9,r10 + shr r10,51 + add r11,r10 + and r9,r13 + and rbp,r11 + shr r11,51 + mov QWORD[152+rsp],r9 + lea r14,[r8*1+r11] + mov QWORD[120+rsp],rbp + mov rcx,r9 + mov rsi,rbp + mov r11,rdi + mov QWORD[160+rsp],r14 + mov r13,QWORD[144+rsp] + mov QWORD[((-8))+rsp],20 + + +$L$5: + lea rax,[r11*8+r11] + lea r12,[rcx*1+rcx] + lea rdi,[rsi*1+rsi] + lea rax,[rax*2+r11] + mov QWORD[((-120))+rsp],rdi + lea r8,[rax*1+rax] + mov QWORD[((-24))+rsp],rax + lea rax,[r14*8+r14] + lea rax,[rax*2+r14] + lea rbx,[rax*1+rax] + mov rax,rbx + mul r13 + mov r9,rax + mov rax,rcx + mov r10,rdx + mul rcx + mov rcx,rax + mov rbx,rdx + mov rax,r8 + add rcx,r9 + adc rbx,r10 + mul rsi + add rcx,rax + mov rax,r12 + adc rbx,rdx + mov r9,rcx + mul r11 + mov r10,rbx + mov rbp,r9 + mov rcx,rax + mov rax,rdi + mov rbx,rdx + mul r13 + add rcx,rax + mov rax,r14 + adc rbx,rdx + mul r14 + add rcx,rax + lea rax,[r13*8+r13] + mov QWORD[((-104))+rsp],rcx + adc rbx,rdx + and rbp,r15 + lea rcx,[rax*2+r13] + mov QWORD[((-96))+rsp],rbx + mov rax,rcx + mul r13 + mov rcx,rax + mov rax,rsi + mov rbx,rdx + mul r12 + add rcx,rax + mov rax,r8 + adc rbx,rdx + mul r14 + add rcx,rax + mov rax,r12 + adc rbx,rdx + shrd r9,r10,51 + shr r10,51 + add rcx,r9 + adc rbx,r10 + mov QWORD[((-88))+rsp],rcx + mul r14 + mov QWORD[((-80))+rsp],rbx + mov rbx,rcx + and rbx,r15 + mov rcx,QWORD[((-88))+rsp] + mov QWORD[((-72))+rsp],rbx + mov rbx,QWORD[((-80))+rsp] + mov r9,rax + mov rax,rsi + mov r10,rdx + mul rsi + mov rsi,rax + mov rdi,rdx + mov rax,r8 + add rsi,r9 + adc rdi,r10 + mul r13 + add rsi,rax + mov rax,r13 + adc rdi,rdx + shrd rcx,rbx,51 + shr rbx,51 + add rsi,rcx + adc rdi,rbx + mov r8,rsi + mul r12 + and r8,r15 + mov rcx,rax + mov rax,QWORD[((-120))+rsp] + mov rbx,rdx + mul r14 + add rcx,rax + mov rax,QWORD[((-24))+rsp] + adc rbx,rdx + mul r11 + add rcx,rax + mov rax,QWORD[((-104))+rsp] + adc rbx,rdx + mov rdx,QWORD[((-96))+rsp] + shrd rsi,rdi,51 + shr rdi,51 + add rsi,rcx + adc rdi,rbx + mov rcx,rsi + mov r13,rsi + shrd rcx,rdi,51 + mov rbx,rdi + and r13,r15 + shr rbx,51 + add rax,rcx + adc rdx,rbx + mov r11,rax + shrd rax,rdx,51 + and r11,r15 + mov rcx,rax + lea rax,[rax*8+rax] + lea r14,[rax*2+rcx] + add r14,rbp + mov rcx,r14 + shr r14,51 + add r14,QWORD[((-72))+rsp] + and rcx,r15 + mov rsi,r14 + shr r14,51 + and rsi,r15 + add r14,r8 + sub QWORD[((-8))+rsp],1 + jne NEAR $L$5 + mov rbx,QWORD[232+rsp] + mov r9,rcx + mov rcx,QWORD[144+rsp] + mov r8,r11 + mov rdi,QWORD[120+rsp] + mov rbp,rsi + mov QWORD[((-24))+rsp],10 + lea rax,[rbx*8+rbx] + lea r12,[rax*2+rbx] + mov rbx,QWORD[160+rsp] + lea rax,[rbx*8+rbx] + lea r11,[rax*2+rbx] + lea rax,[rcx*8+rcx] + lea r10,[rax*2+rcx] + mov rax,QWORD[152+rsp] + mul r9 + mov rcx,rax + mov rax,rsi + mov rbx,rdx + mul r12 + add rcx,rax + lea rax,[rdi*8+rdi] + adc rbx,rdx + lea rax,[rax*2+rdi] + mul r8 + add rcx,rax + mov rax,r14 + adc rbx,rdx + mul r10 + add rcx,rax + mov rax,r13 + adc rbx,rdx + mul r11 + mov rsi,rax + mov rdi,rdx + add rsi,rcx + mov rax,rsi + adc rdi,rbx + and rax,r15 + mov QWORD[((-120))+rsp],rax + mov rax,QWORD[152+rsp] + mul rbp + mov rcx,rax + mov rax,QWORD[120+rsp] + mov rbx,rdx + mul r9 + add rcx,rax + mov rax,r14 + adc rbx,rdx + mul r12 + add rcx,rax + mov rax,r11 + adc rbx,rdx + mul r8 + add rcx,rax + mov rax,r13 + adc rbx,rdx + mul r10 + add rcx,rax + mov rax,QWORD[152+rsp] + adc rbx,rdx + shrd rsi,rdi,51 + shr rdi,51 + add rsi,rcx + adc rdi,rbx + mov r11,rsi + mul r14 + and r11,r15 + mov rcx,rax + mov rax,QWORD[160+rsp] + mov rbx,rdx + mul r9 + add rcx,rax + mov rax,QWORD[120+rsp] + adc rbx,rdx + mul rbp + add rcx,rax + mov rax,r13 + adc rbx,rdx + mul r12 + add rcx,rax + mov rax,r10 + adc rbx,rdx + mul r8 + add rcx,rax + mov rax,QWORD[152+rsp] + adc rbx,rdx + shrd rsi,rdi,51 + shr rdi,51 + add rsi,rcx + adc rdi,rbx + mov r10,rsi + mul r13 + and r10,r15 + mov rcx,rax + mov rax,QWORD[144+rsp] + mov rbx,rdx + mul r9 + add rcx,rax + mov rax,QWORD[160+rsp] + adc rbx,rdx + mul rbp + add rcx,rax + mov rax,QWORD[120+rsp] + adc rbx,rdx + mul r14 + add rcx,rax + mov rax,r12 + adc rbx,rdx + mul r8 + add rcx,rax + mov rax,QWORD[152+rsp] + adc rbx,rdx + shrd rsi,rdi,51 + shr rdi,51 + add rsi,rcx + adc rdi,rbx + mov r12,rsi + mul r8 + and r12,r15 + mov rcx,rax + mov rax,QWORD[232+rsp] + mov rbx,rdx + mul r9 + add rcx,rax + mov rax,QWORD[120+rsp] + adc rbx,rdx + mul r13 + mov r13,2251799813685247 + add rcx,rax + mov rax,QWORD[144+rsp] + adc rbx,rdx + mul rbp + add rcx,rax + mov rax,QWORD[160+rsp] + adc rbx,rdx + mul r14 + mov r14,r15 + add rcx,rax + adc rbx,rdx + shrd rsi,rdi,51 + shr rdi,51 + add rsi,rcx + adc rdi,rbx + mov rcx,rsi + mov rbp,rsi + shrd rcx,rdi,51 + and rbp,r15 + lea rax,[rcx*8+rcx] + lea rax,[rax*2+rcx] + add rax,QWORD[((-120))+rsp] + mov r9,rax + shr rax,51 + add r11,rax + and r9,r15 + and r14,r11 + shr r11,51 + mov rcx,r9 + lea r15,[r10*1+r11] + + +$L$6: + lea rax,[rbp*8+rbp] + lea r8,[rcx*1+rcx] + lea r10,[r14*1+r14] + lea rax,[rax*2+rbp] + mov QWORD[((-120))+rsp],r10 + lea r11,[rax*1+rax] + mov QWORD[((-72))+rsp],rax + lea rax,[r15*8+r15] + lea rax,[rax*2+r15] + lea rbx,[rax*1+rax] + mov rax,rbx + mul r12 + mov rsi,rax + mov rax,rcx + mov rdi,rdx + mul rcx + mov rcx,rax + mov rbx,rdx + mov rax,r11 + add rcx,rsi + adc rbx,rdi + mul r14 + add rcx,rax + mov rax,r8 + adc rbx,rdx + mov rsi,rcx + mul rbp + mov rdi,rbx + mov rcx,rax + mov rax,r10 + mov rbx,rdx + mul r12 + add rcx,rax + mov rax,r15 + adc rbx,rdx + mul r15 + add rcx,rax + lea rax,[r12*8+r12] + mov QWORD[((-104))+rsp],rcx + adc rbx,rdx + lea rcx,[rax*2+r12] + mov QWORD[((-96))+rsp],rbx + mov rbx,rsi + and rbx,r13 + mov rax,rcx + mov QWORD[((-88))+rsp],rbx + mul r12 + mov rcx,rax + mov rax,r14 + mov rbx,rdx + mul r8 + add rcx,rax + mov rax,r11 + adc rbx,rdx + mul r15 + add rcx,rax + mov rax,r8 + adc rbx,rdx + shrd rsi,rdi,51 + shr rdi,51 + add rcx,rsi + adc rbx,rdi + mov rdi,rcx + mul r15 + and rdi,r13 + mov r10,rdi + mov rsi,rax + mov rax,r14 + mov rdi,rdx + mul r14 + add rsi,rax + mov rax,r11 + adc rdi,rdx + mul r12 + add rsi,rax + mov rax,r12 + adc rdi,rdx + shrd rcx,rbx,51 + shr rbx,51 + add rsi,rcx + adc rdi,rbx + mov r9,rsi + mul r8 + and r9,r13 + mov rcx,rax + mov rax,QWORD[((-120))+rsp] + mov rbx,rdx + mul r15 + add rcx,rax + mov rax,QWORD[((-72))+rsp] + adc rbx,rdx + mul rbp + add rcx,rax + mov rax,QWORD[((-104))+rsp] + adc rbx,rdx + mov rdx,QWORD[((-96))+rsp] + shrd rsi,rdi,51 + shr rdi,51 + add rcx,rsi + adc rbx,rdi + mov r12,rcx + shrd rcx,rbx,51 + and r12,r13 + shr rbx,51 + add rax,rcx + adc rdx,rbx + mov rbp,rax + shrd rax,rdx,51 + and rbp,r13 + mov rcx,rax + lea rax,[rax*8+rax] + lea r15,[rax*2+rcx] + add r15,QWORD[((-88))+rsp] + mov rcx,r15 + shr r15,51 + add r15,r10 + and rcx,r13 + mov r14,r15 + shr r15,51 + and r14,r13 + add r15,r9 + sub QWORD[((-24))+rsp],1 + jne NEAR $L$6 + mov r11,QWORD[8+rsp] + mov r10,QWORD[88+rsp] + mov r9,rcx + mov QWORD[((-24))+rsp],50 + mov rax,r11 + mul rcx + mov rcx,rax + mov rax,r10 + mov rbx,rdx + mul r14 + add rcx,rax + mov rax,QWORD[248+rsp] + adc rbx,rdx + mul rbp + add rcx,rax + mov rax,QWORD[200+rsp] + adc rbx,rdx + mul r15 + add rcx,rax + mov rax,QWORD[240+rsp] + adc rbx,rdx + mul r12 + mov rsi,rax + mov rdi,rdx + mov rax,r11 + add rsi,rcx + adc rdi,rbx + mov r8,rsi + mul r14 + and r8,r13 + mov rcx,rax + mov rax,QWORD[40+rsp] + mov rbx,rdx + mul r9 + add rcx,rax + mov rax,r10 + adc rbx,rdx + mul r15 + add rcx,rax + mov rax,QWORD[240+rsp] + adc rbx,rdx + mul rbp + add rcx,rax + mov rax,QWORD[200+rsp] + adc rbx,rdx + mul r12 + add rcx,rax + adc rbx,rdx + shrd rsi,rdi,51 + shr rdi,51 + add rsi,rcx + mov rax,rsi + adc rdi,rbx + and rax,r13 + mov r10,rax + mov rax,r11 + mul r15 + mov rcx,rax + mov rax,QWORD[56+rsp] + mov rbx,rdx + mul r9 + add rcx,rax + mov rax,QWORD[40+rsp] + adc rbx,rdx + mul r14 + add rcx,rax + mov rax,QWORD[88+rsp] + adc rbx,rdx + mul r12 + add rcx,rax + mov rax,QWORD[200+rsp] + adc rbx,rdx + mul rbp + add rcx,rax + adc rbx,rdx + shrd rsi,rdi,51 + shr rdi,51 + add rsi,rcx + mov rax,rsi + adc rdi,rbx + and rax,r13 + mov r11,rax + mov rax,QWORD[8+rsp] + mul r12 + mov rcx,rax + mov rax,QWORD[104+rsp] + mov rbx,rdx + mul r9 + add rcx,rax + mov rax,QWORD[56+rsp] + adc rbx,rdx + mul r14 + add rcx,rax + mov rax,QWORD[40+rsp] + adc rbx,rdx + mul r15 + add rcx,rax + mov rax,QWORD[88+rsp] + adc rbx,rdx + mul rbp + add rcx,rax + adc rbx,rdx + shrd rsi,rdi,51 + shr rdi,51 + add rsi,rcx + mov rax,rsi + adc rdi,rbx + and rax,r13 + mov QWORD[88+rsp],rax + mov rax,QWORD[8+rsp] + mul rbp + mov rbp,r13 + mov rcx,rax + mov rax,QWORD[208+rsp] + mov rbx,rdx + mul r9 + add rcx,rax + mov rax,QWORD[40+rsp] + adc rbx,rdx + mul r12 + mov r12,QWORD[88+rsp] + add rcx,rax + mov rax,QWORD[104+rsp] + adc rbx,rdx + mul r14 + add rcx,rax + mov rax,QWORD[56+rsp] + adc rbx,rdx + mul r15 + mov r15,2251799813685247 + add rcx,rax + adc rbx,rdx + shrd rsi,rdi,51 + shr rdi,51 + add rcx,rsi + adc rbx,rdi + mov rdi,rcx + shrd rcx,rbx,51 + and rdi,r13 + lea rax,[rcx*8+rcx] + mov QWORD[160+rsp],rdi + lea rax,[rax*2+rcx] + add r8,rax + mov rcx,r8 + shr r8,51 + add r10,r8 + and rcx,r13 + and rbp,r10 + shr r10,51 + mov QWORD[40+rsp],rcx + lea r14,[r11*1+r10] + mov QWORD[8+rsp],rbp + mov r11,rdi + mov rsi,rbp + mov QWORD[((-8))+rsp],r14 + + +$L$7: + lea rax,[r11*8+r11] + lea rbp,[rcx*1+rcx] + lea rdi,[rsi*1+rsi] + lea r13,[rax*2+r11] + lea rax,[r14*8+r14] + mov QWORD[((-120))+rsp],rdi + lea r9,[rax*2+r14] + lea r8,[r13*1+r13] + add r9,r9 + mov rax,r9 + mul r12 + mov r9,rax + mov rax,rcx + mov r10,rdx + mul rcx + add r9,rax + mov rax,r8 + adc r10,rdx + mul rsi + add r9,rax + mov rax,rbp + adc r10,rdx + mul r11 + mov rcx,rax + mov rax,rdi + mov rbx,rdx + mul r12 + add rcx,rax + mov rax,r14 + adc rbx,rdx + mul r14 + add rcx,rax + lea rax,[r12*8+r12] + mov QWORD[((-104))+rsp],rcx + adc rbx,rdx + lea rcx,[rax*2+r12] + mov QWORD[((-96))+rsp],rbx + mov rbx,r9 + and rbx,r15 + mov rax,rcx + mov QWORD[((-88))+rsp],rbx + mul r12 + mov rcx,rax + mov rax,rsi + mov rbx,rdx + mul rbp + add rcx,rax + mov rax,r8 + adc rbx,rdx + mul r14 + add rcx,rax + mov rax,rbp + adc rbx,rdx + shrd r9,r10,51 + shr r10,51 + add rcx,r9 + mov rdx,rcx + adc rbx,r10 + and rdx,r15 + mov QWORD[((-72))+rsp],rdx + mul r14 + mov r9,rax + mov rax,rsi + mov r10,rdx + mul rsi + mov rsi,rax + mov rdi,rdx + mov rax,r8 + add rsi,r9 + adc rdi,r10 + mul r12 + add rsi,rax + mov rax,r12 + adc rdi,rdx + shrd rcx,rbx,51 + shr rbx,51 + add rsi,rcx + adc rdi,rbx + mov r8,rsi + mul rbp + and r8,r15 + mov rcx,rax + mov rax,QWORD[((-120))+rsp] + mov rbx,rdx + mul r14 + add rcx,rax + mov rax,r11 + adc rbx,rdx + mul r13 + add rcx,rax + mov rax,QWORD[((-104))+rsp] + adc rbx,rdx + mov rdx,QWORD[((-96))+rsp] + shrd rsi,rdi,51 + shr rdi,51 + add rcx,rsi + adc rbx,rdi + mov r12,rcx + shrd rcx,rbx,51 + and r12,r15 + shr rbx,51 + add rax,rcx + adc rdx,rbx + mov r11,rax + shrd rax,rdx,51 + and r11,r15 + mov rcx,rax + lea rax,[rax*8+rax] + lea r14,[rax*2+rcx] + add r14,QWORD[((-88))+rsp] + mov rcx,r14 + shr r14,51 + add r14,QWORD[((-72))+rsp] + and rcx,r15 + mov rsi,r14 + shr r14,51 + and rsi,r15 + add r14,r8 + sub QWORD[((-24))+rsp],1 + jne NEAR $L$7 + mov rbx,QWORD[160+rsp] + mov rdi,QWORD[((-8))+rsp] + mov rbp,rsi + mov rsi,QWORD[88+rsp] + mov r10,QWORD[40+rsp] + lea rax,[rbx*8+rbx] + lea r13,[rax*2+rbx] + mov rbx,QWORD[8+rsp] + mov QWORD[152+rsp],r13 + lea rax,[rbx*8+rbx] + lea rax,[rax*2+rbx] + mov QWORD[240+rsp],rax + lea rax,[rdi*8+rdi] + lea r8,[rax*2+rdi] + lea rax,[rsi*8+rsi] + lea r9,[rax*2+rsi] + mov rax,r10 + mov QWORD[232+rsp],r8 + mul rcx + mov QWORD[200+rsp],r9 + mov rsi,rax + mov rax,r13 + mov rdi,rdx + mul rbp + add rsi,rax + mov rax,QWORD[240+rsp] + adc rdi,rdx + mul r11 + add rsi,rax + mov rax,r9 + adc rdi,rdx + mul r14 + add rsi,rax + mov rax,r8 + adc rdi,rdx + mul r12 + add rsi,rax + mov rax,rsi + adc rdi,rdx + mov QWORD[((-120))+rsp],rsi + and rax,r15 + mov QWORD[((-112))+rsp],rdi + mov r8,rax + mov rax,r10 + mov r10,QWORD[((-112))+rsp] + mul rbp + mov rsi,rax + mov rax,rbx + mov rdi,rdx + mul rcx + add rsi,rax + mov rax,r13 + adc rdi,rdx + mul r14 + add rsi,rax + mov rax,QWORD[232+rsp] + adc rdi,rdx + mul r11 + add rsi,rax + mov rax,r9 + mov r9,QWORD[((-120))+rsp] + adc rdi,rdx + mul r12 + add rsi,rax + adc rdi,rdx + shrd r9,r10,51 + shr r10,51 + add r9,rsi + mov rax,r9 + adc r10,rdi + and rax,r15 + mov QWORD[((-120))+rsp],rax + mov rax,QWORD[40+rsp] + mul r14 + mov rsi,rax + mov rax,QWORD[((-8))+rsp] + mov rdi,rdx + mul rcx + add rsi,rax + mov rax,rbx + adc rdi,rdx + mul rbp + add rsi,rax + mov rax,r13 + adc rdi,rdx + mul r12 + add rsi,rax + mov rax,QWORD[200+rsp] + adc rdi,rdx + mul r11 + add rsi,rax + mov rax,QWORD[40+rsp] + adc rdi,rdx + shrd r9,r10,51 + shr r10,51 + add r9,rsi + mov rdx,r9 + adc r10,rdi + and rdx,r15 + mov r13,rdx + mul r12 + mov rsi,rax + mov rax,QWORD[88+rsp] + mov rdi,rdx + mul rcx + add rsi,rax + mov rax,QWORD[((-8))+rsp] + adc rdi,rdx + mul rbp + add rsi,rax + mov rax,rbx + adc rdi,rdx + mul r14 + add rsi,rax + mov rax,QWORD[152+rsp] + adc rdi,rdx + mul r11 + add rax,rsi + mov rsi,r9 + adc rdx,rdi + mov rdi,r10 + shrd rsi,r10,51 + shr rdi,51 + add rsi,rax + mov rax,rsi + adc rdi,rdx + and rax,r15 + mov QWORD[56+rsp],rax + mov rax,r11 + mul QWORD[40+rsp] + mov r9,rax + mov r10,rdx + mov rax,rcx + mul QWORD[160+rsp] + mov rcx,rax + mov rbx,rdx + mov rax,r12 + add rcx,r9 + adc rbx,r10 + mul QWORD[8+rsp] + add rcx,rax + mov rax,rbp + adc rbx,rdx + mul QWORD[88+rsp] + add rcx,rax + mov rax,r14 + adc rbx,rdx + mul QWORD[((-8))+rsp] + add rcx,rax + adc rbx,rdx + shrd rsi,rdi,51 + shr rdi,51 + add rcx,rsi + adc rbx,rdi + mov rdi,rcx + shrd rcx,rbx,51 + and rdi,r15 + lea rax,[rcx*8+rcx] + mov QWORD[208+rsp],rdi + lea rax,[rax*2+rcx] + add r8,rax + mov rax,QWORD[((-120))+rsp] + mov rsi,r8 + shr r8,51 + and rsi,r15 + lea r11,[rax*1+r8] + mov QWORD[104+rsp],rsi + mov rcx,rsi + and r15,r11 + shr r11,51 + mov QWORD[120+rsp],r15 + lea r14,[r13*1+r11] + mov r12,QWORD[56+rsp] + mov rbp,r15 + mov QWORD[((-24))+rsp],100 + mov r15,2251799813685247 + mov rsi,rbp + mov QWORD[144+rsp],r14 + mov rbp,rdi + + +$L$8: + lea rax,[rbp*8+rbp] + lea r11,[rcx*1+rcx] + lea rdi,[rsi*1+rsi] + lea r13,[rax*2+rbp] + lea rax,[r14*8+r14] + mov QWORD[((-120))+rsp],rdi + lea r9,[rax*2+r14] + lea r8,[r13*1+r13] + add r9,r9 + mov rax,r9 + mul r12 + mov r9,rax + mov rax,rcx + mov r10,rdx + mul rcx + add r9,rax + mov rax,r8 + adc r10,rdx + mul rsi + add r9,rax + mov rax,r11 + adc r10,rdx + mul rbp + mov rcx,rax + mov rax,rdi + mov rbx,rdx + mul r12 + add rcx,rax + mov rax,r14 + adc rbx,rdx + mul r14 + add rcx,rax + lea rax,[r12*8+r12] + mov QWORD[((-104))+rsp],rcx + adc rbx,rdx + lea rcx,[rax*2+r12] + mov QWORD[((-96))+rsp],rbx + mov rbx,r9 + and rbx,r15 + mov rax,rcx + mov QWORD[((-88))+rsp],rbx + mul r12 + mov rcx,rax + mov rax,rsi + mov rbx,rdx + mul r11 + add rcx,rax + mov rax,r8 + adc rbx,rdx + mul r14 + add rcx,rax + mov rax,r11 + adc rbx,rdx + shrd r9,r10,51 + shr r10,51 + add rcx,r9 + mov rdx,rcx + adc rbx,r10 + and rdx,r15 + mov QWORD[((-72))+rsp],rdx + mul r14 + mov r9,rax + mov rax,rsi + mov r10,rdx + mul rsi + mov rsi,rax + mov rdi,rdx + mov rax,r8 + add rsi,r9 + adc rdi,r10 + mul r12 + add rsi,rax + mov rax,r12 + adc rdi,rdx + shrd rcx,rbx,51 + shr rbx,51 + add rsi,rcx + adc rdi,rbx + mov r9,rsi + mul r11 + and r9,r15 + mov rcx,rax + mov rax,QWORD[((-120))+rsp] + mov rbx,rdx + mul r14 + add rcx,rax + mov rax,rbp + adc rbx,rdx + mul r13 + add rcx,rax + mov rax,QWORD[((-104))+rsp] + adc rbx,rdx + mov rdx,QWORD[((-96))+rsp] + shrd rsi,rdi,51 + shr rdi,51 + add rcx,rsi + adc rbx,rdi + mov r12,rcx + shrd rcx,rbx,51 + and r12,r15 + shr rbx,51 + add rax,rcx + adc rdx,rbx + mov rbp,rax + shrd rax,rdx,51 + and rbp,r15 + mov rcx,rax + lea rax,[rax*8+rax] + lea r14,[rax*2+rcx] + add r14,QWORD[((-88))+rsp] + mov rcx,r14 + shr r14,51 + add r14,QWORD[((-72))+rsp] + and rcx,r15 + mov rsi,r14 + shr r14,51 + and rsi,r15 + add r14,r9 + sub QWORD[((-24))+rsp],1 + jne NEAR $L$8 + mov rbx,QWORD[208+rsp] + mov r11,rbp + mov rbp,rsi + mov rsi,rcx + mov rcx,QWORD[56+rsp] + mov r9,QWORD[104+rsp] + mov r10,QWORD[120+rsp] + mov QWORD[((-24))+rsp],50 + lea rax,[rbx*8+rbx] + lea rdi,[rax*2+rbx] + mov rbx,QWORD[144+rsp] + lea rax,[rbx*8+rbx] + lea r13,[rax*2+rbx] + lea rax,[rcx*8+rcx] + lea r8,[rax*2+rcx] + mov rax,r9 + mul rsi + mov rcx,rax + mov rax,rbp + mov rbx,rdx + mul rdi + add rcx,rax + lea rax,[r10*8+r10] + adc rbx,rdx + lea rax,[rax*2+r10] + mul r11 + add rcx,rax + mov rax,r14 + adc rbx,rdx + mul r8 + add rcx,rax + mov rax,r12 + adc rbx,rdx + mul r13 + add rcx,rax + mov rax,rcx + adc rbx,rdx + mov QWORD[((-120))+rsp],rcx + and rax,r15 + mov QWORD[((-112))+rsp],rbx + mov QWORD[((-104))+rsp],rax + mov rax,r9 + mov r9,QWORD[((-120))+rsp] + mul rbp + mov rcx,rax + mov rax,r10 + mov rbx,rdx + mul rsi + mov r10,QWORD[((-112))+rsp] + add rcx,rax + mov rax,r14 + adc rbx,rdx + mul rdi + add rcx,rax + mov rax,r13 + mov r13,QWORD[120+rsp] + adc rbx,rdx + mul r11 + add rcx,rax + mov rax,r12 + adc rbx,rdx + mul r8 + add rcx,rax + mov rax,QWORD[104+rsp] + adc rbx,rdx + shrd r9,r10,51 + shr r10,51 + add r9,rcx + mov rdx,r9 + adc r10,rbx + and rdx,r15 + mov QWORD[((-120))+rsp],rdx + mul r14 + mov rcx,rax + mov rax,QWORD[144+rsp] + mov rbx,rdx + mul rsi + add rcx,rax + mov rax,r13 + adc rbx,rdx + mul rbp + add rcx,rax + mov rax,r12 + adc rbx,rdx + mul rdi + add rcx,rax + mov rax,r8 + adc rbx,rdx + mul r11 + add rcx,rax + mov rax,QWORD[104+rsp] + adc rbx,rdx + shrd r9,r10,51 + shr r10,51 + add r9,rcx + adc r10,rbx + mov r8,r9 + mul r12 + and r8,r15 + mov rcx,rax + mov rax,QWORD[56+rsp] + mov rbx,rdx + mul rsi + add rcx,rax + mov rax,QWORD[144+rsp] + adc rbx,rdx + mul rbp + add rcx,rax + mov rax,r13 + adc rbx,rdx + mul r14 + add rcx,rax + mov rax,rdi + adc rbx,rdx + mul r11 + add rax,rcx + mov rcx,r9 + adc rdx,rbx + mov rbx,r10 + shrd rcx,r10,51 + shr rbx,51 + add rcx,rax + mov rax,QWORD[104+rsp] + adc rbx,rdx + mov rdi,rcx + and rdi,r15 + mul r11 + mov r13,rdi + mov r9,rax + mov rax,QWORD[208+rsp] + mov r10,rdx + mul rsi + mov rsi,rax + mov rax,QWORD[120+rsp] + mov rdi,rdx + add rsi,r9 + adc rdi,r10 + mul r12 + add rsi,rax + mov rax,QWORD[56+rsp] + adc rdi,rdx + mul rbp + add rsi,rax + mov rax,QWORD[144+rsp] + adc rdi,rdx + mul r14 + add rsi,rax + adc rdi,rdx + mov rdx,QWORD[((-120))+rsp] + shrd rcx,rbx,51 + shr rbx,51 + add rcx,rsi + adc rbx,rdi + mov r12,rcx + shrd rcx,rbx,51 + mov rbx,QWORD[((-104))+rsp] + and r12,r15 + lea rax,[rcx*8+rcx] + lea rax,[rax*2+rcx] + lea rbp,[rbx*1+rax] + mov r9,rbp + shr rbp,51 + lea r11,[rbp*1+rdx] + and r9,r15 + mov rcx,r9 + and r15,r11 + shr r11,51 + mov rbp,r15 + lea r14,[r8*1+r11] + mov r15,2251799813685247 + mov rsi,rbp + + +$L$9: + lea rax,[r12*8+r12] + lea r11,[rcx*1+rcx] + lea rdi,[rsi*1+rsi] + lea rbp,[rax*2+r12] + lea rax,[r14*8+r14] + mov QWORD[((-120))+rsp],rdi + lea r9,[rax*2+r14] + lea r8,[rbp*1+rbp] + add r9,r9 + mov rax,r9 + mul r13 + mov r9,rax + mov rax,rcx + mov r10,rdx + mul rcx + add r9,rax + mov rax,r8 + adc r10,rdx + mul rsi + add r9,rax + mov rax,r11 + adc r10,rdx + mul r12 + mov rcx,rax + mov rax,rdi + mov rbx,rdx + mul r13 + add rcx,rax + mov rax,r14 + adc rbx,rdx + mul r14 + add rcx,rax + lea rax,[r13*8+r13] + mov QWORD[((-104))+rsp],rcx + adc rbx,rdx + lea rcx,[rax*2+r13] + mov QWORD[((-96))+rsp],rbx + mov rbx,r9 + and rbx,r15 + mov rax,rcx + mov QWORD[((-88))+rsp],rbx + mul r13 + mov rcx,rax + mov rax,rsi + mov rbx,rdx + mul r11 + add rcx,rax + mov rax,r8 + adc rbx,rdx + mul r14 + add rcx,rax + mov rax,r11 + adc rbx,rdx + shrd r9,r10,51 + shr r10,51 + add rcx,r9 + mov rdx,rcx + adc rbx,r10 + and rdx,r15 + mov QWORD[((-72))+rsp],rdx + mul r14 + mov r9,rax + mov rax,rsi + mov r10,rdx + mul rsi + mov rsi,rax + mov rdi,rdx + mov rax,r8 + add rsi,r9 + adc rdi,r10 + mul r13 + add rsi,rax + mov rax,r13 + adc rdi,rdx + shrd rcx,rbx,51 + shr rbx,51 + add rsi,rcx + adc rdi,rbx + mov r8,rsi + mul r11 + and r8,r15 + mov rcx,rax + mov rax,QWORD[((-120))+rsp] + mov rbx,rdx + mul r14 + add rcx,rax + mov rax,r12 + adc rbx,rdx + mul rbp + add rcx,rax + mov rax,QWORD[((-104))+rsp] + adc rbx,rdx + mov rdx,QWORD[((-96))+rsp] + shrd rsi,rdi,51 + shr rdi,51 + add rcx,rsi + adc rbx,rdi + mov r13,rcx + shrd rcx,rbx,51 + and r13,r15 + shr rbx,51 + add rax,rcx + adc rdx,rbx + mov r12,rax + shrd rax,rdx,51 + and r12,r15 + mov rcx,rax + lea rax,[rax*8+rax] + lea r14,[rax*2+rcx] + add r14,QWORD[((-88))+rsp] + mov rcx,r14 + shr r14,51 + add r14,QWORD[((-72))+rsp] + and rcx,r15 + mov rsi,r14 + shr r14,51 + and rsi,r15 + add r14,r8 + sub QWORD[((-24))+rsp],1 + jne NEAR $L$9 + mov r11,QWORD[40+rsp] + mov r10,QWORD[152+rsp] + mov r9,rcx + mov rbp,rsi + mov rax,r11 + mul rcx + mov rcx,rax + mov rax,r10 + mov rbx,rdx + mul rsi + add rcx,rax + mov rax,QWORD[240+rsp] + adc rbx,rdx + mul r12 + add rcx,rax + mov rax,QWORD[200+rsp] + adc rbx,rdx + mul r14 + add rcx,rax + mov rax,QWORD[232+rsp] + adc rbx,rdx + mul r13 + mov rsi,rax + mov rdi,rdx + mov rax,r11 + add rsi,rcx + adc rdi,rbx + mov r8,rsi + mul rbp + and r8,r15 + mov rcx,rax + mov rax,QWORD[8+rsp] + mov rbx,rdx + mul r9 + add rcx,rax + mov rax,r10 + adc rbx,rdx + mul r14 + add rcx,rax + mov rax,QWORD[232+rsp] + adc rbx,rdx + mul r12 + add rcx,rax + mov rax,QWORD[200+rsp] + adc rbx,rdx + mul r13 + add rcx,rax + adc rbx,rdx + shrd rsi,rdi,51 + shr rdi,51 + add rsi,rcx + mov rax,rsi + adc rdi,rbx + and rax,r15 + mov r10,rax + mov rax,r11 + mul r14 + mov rcx,rax + mov rax,QWORD[((-8))+rsp] + mov rbx,rdx + mul r9 + add rcx,rax + mov rax,QWORD[8+rsp] + adc rbx,rdx + mul rbp + add rcx,rax + mov rax,QWORD[152+rsp] + adc rbx,rdx + mul r13 + add rcx,rax + mov rax,QWORD[200+rsp] + adc rbx,rdx + mul r12 + add rcx,rax + adc rbx,rdx + shrd rsi,rdi,51 + shr rdi,51 + add rsi,rcx + mov rax,rsi + adc rdi,rbx + and rax,r15 + mov QWORD[((-120))+rsp],rax + mov rax,r11 + mul r13 + mov rcx,rax + mov rax,QWORD[88+rsp] + mov rbx,rdx + mul r9 + add rcx,rax + mov rax,QWORD[((-8))+rsp] + adc rbx,rdx + mul rbp + add rcx,rax + mov rax,QWORD[8+rsp] + adc rbx,rdx + mul r14 + add rcx,rax + mov rax,QWORD[152+rsp] + adc rbx,rdx + mul r12 + add rcx,rax + mov rax,r11 + adc rbx,rdx + shrd rsi,rdi,51 + shr rdi,51 + add rcx,rsi + adc rbx,rdi + mov rsi,rcx + mul r12 + and rsi,r15 + mov r11,rax + mov rax,QWORD[160+rsp] + mov r12,rdx + mul r9 + add r11,rax + mov rax,QWORD[8+rsp] + adc r12,rdx + mul r13 + add r11,rax + mov rax,QWORD[88+rsp] + adc r12,rdx + mul rbp + add r11,rax + mov rax,QWORD[((-8))+rsp] + adc r12,rdx + mul r14 + add rax,r11 + adc rdx,r12 + shrd rcx,rbx,51 + shr rbx,51 + add rcx,rax + adc rbx,rdx + mov r9,rcx + shrd rcx,rbx,51 + and r9,r15 + lea rax,[rcx*8+rcx] + lea rax,[rax*2+rcx] + add r8,rax + mov rax,QWORD[((-120))+rsp] + mov rcx,r8 + shr r8,51 + add r10,r8 + and rcx,r15 + mov r8,r10 + shr r10,51 + lea rdi,[rax*1+r10] + lea rax,[r9*8+r9] + and r8,r15 + lea r10,[rcx*1+rcx] + lea r14,[r8*1+r8] + lea r13,[rax*2+r9] + mov rax,rcx + mul rcx + lea rbp,[r13*1+r13] + mov rcx,rax + mov rax,rbp + mov rbx,rdx + mul r8 + add rcx,rax + lea rax,[rdi*8+rdi] + adc rbx,rdx + lea rax,[rax*2+rdi] + add rax,rax + mul rsi + add rcx,rax + mov rax,rcx + adc rbx,rdx + mov r11,rcx + and rax,r15 + mov r12,rbx + mov QWORD[((-120))+rsp],rax + lea rax,[rsi*8+rsi] + lea rcx,[rax*2+rsi] + mov rax,rcx + mul rsi + mov rcx,rax + mov rax,r8 + mov rbx,rdx + mul r10 + add rcx,rax + mov rax,rbp + adc rbx,rdx + mul rdi + add rax,rcx + mov rcx,r11 + adc rdx,rbx + mov rbx,r12 + shrd rcx,r12,51 + shr rbx,51 + add rcx,rax + mov rax,rbp + adc rbx,rdx + mov QWORD[((-104))+rsp],rcx + and rcx,r15 + mul rsi + mov QWORD[((-88))+rsp],rcx + mov QWORD[((-96))+rsp],rbx + mov rcx,QWORD[((-104))+rsp] + mov rbx,QWORD[((-96))+rsp] + mov r11,rax + mov rax,r8 + mov r12,rdx + mul r8 + add r11,rax + mov rax,r10 + adc r12,rdx + mul rdi + add rax,r11 + adc rdx,r12 + shrd rcx,rbx,51 + shr rbx,51 + add rcx,rax + mov rax,rsi + adc rbx,rdx + mov rbp,rcx + mul r10 + and rbp,r15 + mov r11,rax + mov rax,r13 + mov r12,rdx + mul r9 + add r11,rax + mov rax,r14 + adc r12,rdx + mul rdi + add rax,r11 + adc rdx,r12 + shrd rcx,rbx,51 + shr rbx,51 + add rcx,rax + mov rax,rsi + adc rbx,rdx + mov r8,rcx + mul r14 + and r8,r15 + mov r11,rax + mov rax,rdi + mov r12,rdx + mul rdi + mov rsi,rax + mov rdi,rdx + mov rax,r9 + add rsi,r11 + adc rdi,r12 + mul r10 + add rsi,rax + adc rdi,rdx + shrd rcx,rbx,51 + shr rbx,51 + add rsi,rcx + adc rdi,rbx + mov rcx,rsi + mov r9,rsi + shrd rcx,rdi,51 + and r9,r15 + lea rax,[rcx*8+rcx] + lea rax,[rax*2+rcx] + add rax,QWORD[((-120))+rsp] + mov r11,rax + shr rax,51 + add rax,QWORD[((-88))+rsp] + and r11,r15 + lea r10,[r11*1+r11] + mov rdi,rax + shr rax,51 + lea rsi,[rbp*1+rax] + lea rax,[r9*8+r9] + and rdi,r15 + lea r14,[rdi*1+rdi] + lea r13,[rax*2+r9] + mov rax,rdi + lea rbp,[r13*1+r13] + mul rbp + mov rcx,rax + mov rax,r11 + mov rbx,rdx + mul r11 + add rcx,rax + lea rax,[rsi*8+rsi] + adc rbx,rdx + lea rax,[rax*2+rsi] + add rax,rax + mul r8 + add rcx,rax + mov rax,rcx + adc rbx,rdx + mov r11,rcx + and rax,r15 + mov r12,rbx + mov QWORD[((-88))+rsp],rax + lea rax,[r8*8+r8] + lea rcx,[rax*2+r8] + mov rax,rcx + mul r8 + mov rcx,rax + mov rax,r10 + mov rbx,rdx + mul rdi + add rcx,rax + mov rax,rsi + adc rbx,rdx + mul rbp + add rax,rcx + mov rcx,r11 + adc rdx,rbx + mov rbx,r12 + shrd rcx,r12,51 + shr rbx,51 + add rcx,rax + mov rax,rbp + adc rbx,rdx + mov QWORD[((-104))+rsp],rcx + and rcx,r15 + mul r8 + mov QWORD[((-96))+rsp],rbx + mov QWORD[((-120))+rsp],rcx + mov rbx,QWORD[((-96))+rsp] + mov rcx,QWORD[((-104))+rsp] + mov r11,rax + mov rax,rdi + mov r12,rdx + mul rdi + add r11,rax + mov rax,rsi + adc r12,rdx + mul r10 + add rax,r11 + adc rdx,r12 + shrd rcx,rbx,51 + shr rbx,51 + add rax,rcx + adc rdx,rbx + mov r11,rax + mov rbp,rax + mov rax,r13 + mov r12,rdx + and rbp,r15 + mul r9 + mov rcx,rax + mov rax,r10 + mov rbx,rdx + mul r8 + add rcx,rax + mov rax,r14 + adc rbx,rdx + mul rsi + add rax,rcx + mov rcx,r11 + adc rdx,rbx + mov rbx,r12 + shrd rcx,r12,51 + shr rbx,51 + add rcx,rax + mov rax,rsi + adc rbx,rdx + mov rdi,rcx + mul rsi + and rdi,r15 + mov r11,rax + mov rax,r14 + mov r12,rdx + mul r8 + add r11,rax + mov rax,r9 + adc r12,rdx + mul r10 + add rax,r11 + adc rdx,r12 + shrd rcx,rbx,51 + shr rbx,51 + add rcx,rax + adc rbx,rdx + mov r9,rcx + shrd rcx,rbx,51 + and r9,r15 + lea rax,[rcx*8+rcx] + lea rax,[rax*2+rcx] + add rax,QWORD[((-88))+rsp] + mov r11,rax + shr rax,51 + add rax,QWORD[((-120))+rsp] + and r11,r15 + lea r10,[r11*1+r11] + mov r8,rax + shr rax,51 + lea rsi,[rbp*1+rax] + lea rax,[r9*8+r9] + and r8,r15 + lea r14,[r8*1+r8] + lea r13,[rax*2+r9] + mov rax,r8 + lea rbp,[r13*1+r13] + mul rbp + mov rcx,rax + mov rax,r11 + mov rbx,rdx + mul r11 + add rcx,rax + lea rax,[rsi*8+rsi] + adc rbx,rdx + lea rax,[rax*2+rsi] + add rax,rax + mul rdi + add rcx,rax + mov rax,rcx + adc rbx,rdx + mov r11,rcx + and rax,r15 + mov r12,rbx + mov QWORD[((-88))+rsp],rax + lea rax,[rdi*8+rdi] + lea rcx,[rax*2+rdi] + mov rax,rcx + mul rdi + mov rcx,rax + mov rax,r10 + mov rbx,rdx + mul r8 + add rcx,rax + mov rax,rsi + adc rbx,rdx + mul rbp + add rax,rcx + mov rcx,r11 + adc rdx,rbx + mov rbx,r12 + shrd rcx,r12,51 + shr rbx,51 + add rcx,rax + mov rax,rbp + adc rbx,rdx + mov QWORD[((-104))+rsp],rcx + and rcx,r15 + mul rdi + mov QWORD[((-120))+rsp],rcx + mov QWORD[((-96))+rsp],rbx + mov rcx,QWORD[((-104))+rsp] + mov rbx,QWORD[((-96))+rsp] + mov r11,rax + mov rax,r8 + mov r12,rdx + mul r8 + add r11,rax + mov rax,rsi + adc r12,rdx + mul r10 + add r11,rax + mov rax,r13 + adc r12,rdx + shrd rcx,rbx,51 + shr rbx,51 + add r11,rcx + adc r12,rbx + mov rbp,r11 + mul r9 + and rbp,r15 + mov rcx,rax + mov rax,r10 + mov rbx,rdx + mul rdi + add rcx,rax + mov rax,r14 + adc rbx,rdx + mul rsi + add rax,rcx + mov rcx,r11 + adc rdx,rbx + mov rbx,r12 + shrd rcx,r12,51 + shr rbx,51 + add rcx,rax + mov rax,rsi + adc rbx,rdx + mov r8,rcx + mul rsi + and r8,r15 + mov r11,rax + mov rax,r14 + mov r12,rdx + mul rdi + mov rsi,rax + mov rdi,rdx + mov rax,r9 + add rsi,r11 + adc rdi,r12 + mul r10 + add rsi,rax + adc rdi,rdx + shrd rcx,rbx,51 + shr rbx,51 + add rsi,rcx + adc rdi,rbx + mov rcx,rsi + mov r9,rsi + shrd rcx,rdi,51 + and r9,r15 + lea rax,[rcx*8+rcx] + lea rax,[rax*2+rcx] + add rax,QWORD[((-88))+rsp] + mov r11,rax + shr rax,51 + add rax,QWORD[((-120))+rsp] + and r11,r15 + lea r10,[r11*1+r11] + mov rdi,rax + shr rax,51 + lea rsi,[rbp*1+rax] + lea rax,[r9*8+r9] + and rdi,r15 + lea r14,[rdi*1+rdi] + lea r13,[rax*2+r9] + mov rax,rdi + lea rbp,[r13*1+r13] + mul rbp + mov rcx,rax + mov rax,r11 + mov rbx,rdx + mul r11 + add rcx,rax + lea rax,[rsi*8+rsi] + adc rbx,rdx + lea rax,[rax*2+rsi] + add rax,rax + mul r8 + add rcx,rax + mov rax,rcx + adc rbx,rdx + mov r11,rcx + and rax,r15 + mov r12,rbx + mov QWORD[((-88))+rsp],rax + lea rax,[r8*8+r8] + lea rcx,[rax*2+r8] + mov rax,rcx + mul r8 + mov rcx,rax + mov rax,r10 + mov rbx,rdx + mul rdi + add rcx,rax + mov rax,rsi + adc rbx,rdx + mul rbp + add rax,rcx + mov rcx,r11 + adc rdx,rbx + mov rbx,r12 + shrd rcx,r12,51 + shr rbx,51 + add rcx,rax + mov rax,rbp + adc rbx,rdx + mov QWORD[((-104))+rsp],rcx + and rcx,r15 + mul r8 + mov QWORD[((-120))+rsp],rcx + mov QWORD[((-96))+rsp],rbx + mov rcx,QWORD[((-104))+rsp] + mov rbx,QWORD[((-96))+rsp] + mov r11,rax + mov rax,rdi + mov r12,rdx + mul rdi + add r11,rax + mov rax,rsi + adc r12,rdx + mul r10 + add rax,r11 + adc rdx,r12 + shrd rcx,rbx,51 + shr rbx,51 + add rax,rcx + adc rdx,rbx + mov r11,rax + mov rbp,rax + mov rax,r13 + mov r12,rdx + and rbp,r15 + mul r9 + mov rcx,rax + mov rax,r10 + mov rbx,rdx + mul r8 + add rcx,rax + mov rax,r14 + adc rbx,rdx + mul rsi + add rax,rcx + mov rcx,r11 + adc rdx,rbx + mov rbx,r12 + shrd rcx,r12,51 + shr rbx,51 + add rcx,rax + mov rax,rsi + adc rbx,rdx + mov rdi,rcx + mul rsi + and rdi,r15 + mov r11,rax + mov rax,r14 + mov r12,rdx + mul r8 + add r11,rax + mov rax,r9 + adc r12,rdx + mul r10 + add rax,r11 + adc rdx,r12 + shrd rcx,rbx,51 + shr rbx,51 + add rcx,rax + adc rbx,rdx + mov r8,rcx + shrd rcx,rbx,51 + and r8,r15 + lea rax,[rcx*8+rcx] + lea rax,[rax*2+rcx] + add rax,QWORD[((-88))+rsp] + mov rcx,rax + shr rax,51 + add rax,QWORD[((-120))+rsp] + and rcx,r15 + mov r11,rax + shr rax,51 + and r11,r15 + lea rsi,[rbp*1+rax] + lea rax,[r8*8+r8] + lea r14,[r11*1+r11] + lea rbp,[rcx*1+rcx] + mov QWORD[((-120))+rsp],r14 + lea r14,[rax*2+r8] + mov rax,rcx + mul rcx + lea r13,[r14*1+r14] + mov rcx,rax + mov rax,r13 + mov rbx,rdx + mul r11 + add rcx,rax + lea rax,[rsi*8+rsi] + adc rbx,rdx + lea rax,[rax*2+rsi] + add rax,rax + mul rdi + add rcx,rax + mov rax,rcx + adc rbx,rdx + mov r9,rcx + and rax,r15 + mov r10,rbx + mov QWORD[((-104))+rsp],rax + lea rax,[rdi*8+rdi] + lea rcx,[rax*2+rdi] + mov rax,rcx + mul rdi + mov rcx,rax + mov rax,r11 + mov rbx,rdx + mul rbp + add rcx,rax + mov rax,r13 + adc rbx,rdx + mul rsi + add rax,rcx + mov rcx,r9 + adc rdx,rbx + mov rbx,r10 + shrd rcx,r10,51 + shr rbx,51 + add rcx,rax + mov rax,r11 + adc rbx,rdx + mov r9,rcx + and rcx,r15 + mul r11 + mov QWORD[((-88))+rsp],rcx + mov rcx,r9 + mov r11,rax + mov rax,r13 + mov r12,rdx + mul rdi + add r11,rax + mov rax,rbp + adc r12,rdx + mul rsi + add rax,r11 + adc rdx,r12 + shrd rcx,rbx,51 + shr rbx,51 + add rcx,rax + mov rax,rdi + adc rbx,rdx + mov r13,rcx + mul rbp + and r13,r15 + mov r9,rax + mov rax,r14 + mov r10,rdx + mul r8 + mov r14,QWORD[((-120))+rsp] + add r9,rax + mov rax,r14 + adc r10,rdx + mul rsi + add rax,r9 + adc rdx,r10 + shrd rcx,rbx,51 + shr rbx,51 + add rcx,rax + mov rax,r14 + mov r14,QWORD[136+rsp] + adc rbx,rdx + mov r9,rcx + mul rdi + and r9,r15 + mov r11,rax + mov rax,rsi + mov r12,rdx + mul rsi + mov rsi,rax + mov rdi,rdx + mov rax,rbp + add rsi,r11 + adc rdi,r12 + mul r8 + add rsi,rax + adc rdi,rdx + shrd rcx,rbx,51 + shr rbx,51 + add rcx,rsi + adc rbx,rdi + mov r8,rcx + shrd rcx,rbx,51 + and r8,r15 + lea rax,[rcx*8+rcx] + lea rax,[rax*2+rcx] + add rax,QWORD[((-104))+rsp] + mov r10,rax + shr rax,51 + add rax,QWORD[((-88))+rsp] + and r10,r15 + mov r11,rax + shr rax,51 + lea rbp,[r13*1+rax] + mov r13,QWORD[80+rsp] + and r11,r15 + lea rax,[r13*8+r13] + lea r12,[rax*2+r13] + lea rax,[r14*8+r14] + lea rbx,[rax*2+r14] + mov rax,rbx + mul r8 + mov rcx,rax + mov rax,r9 + mov rbx,rdx + mul r12 + add rcx,rax + mov rax,QWORD[128+rsp] + adc rbx,rdx + mul r10 + add rcx,rax + mov rax,QWORD[192+rsp] + adc rbx,rdx + mul r11 + add rcx,rax + mov rax,QWORD[216+rsp] + adc rbx,rdx + mul rbp + mov rsi,rax + mov rdi,rdx + add rsi,rcx + mov rax,rsi + adc rdi,rbx + and rax,r15 + mov QWORD[((-120))+rsp],rax + mov rax,r12 + mul r8 + mov rcx,rax + mov rax,QWORD[216+rsp] + mov rbx,rdx + mul r9 + add rcx,rax + mov rax,r14 + adc rbx,rdx + mul r10 + add rcx,rax + mov rax,QWORD[128+rsp] + adc rbx,rdx + mul r11 + add rcx,rax + mov rax,QWORD[192+rsp] + adc rbx,rdx + mul rbp + add rcx,rax + mov rax,QWORD[192+rsp] + adc rbx,rdx + shrd rsi,rdi,51 + shr rdi,51 + add rsi,rcx + adc rdi,rbx + mov r12,rsi + mul r9 + and r12,r15 + mov rcx,rax + mov rax,QWORD[216+rsp] + mov rbx,rdx + mul r8 + add rcx,rax + mov rax,r13 + adc rbx,rdx + mul r10 + add rcx,rax + mov rax,r14 + adc rbx,rdx + mul r11 + add rcx,rax + mov rax,QWORD[128+rsp] + adc rbx,rdx + mul rbp + add rcx,rax + mov rax,QWORD[128+rsp] + adc rbx,rdx + shrd rsi,rdi,51 + shr rdi,51 + add rsi,rcx + adc rdi,rbx + mov rcx,rsi + mul r9 + and rcx,r15 + mov QWORD[((-104))+rsp],rcx + mov rcx,rax + mov rax,QWORD[192+rsp] + mov rbx,rdx + mul r8 + add rcx,rax + mov rax,QWORD[24+rsp] + adc rbx,rdx + mul r10 + add rcx,rax + mov rax,r13 + adc rbx,rdx + mul r11 + add rcx,rax + mov rax,r14 + adc rbx,rdx + mul rbp + add rax,rcx + mov rcx,rsi + adc rdx,rbx + mov rbx,rdi + shrd rcx,rdi,51 + shr rbx,51 + add rcx,rax + mov rax,QWORD[128+rsp] + adc rbx,rdx + mov rsi,rcx + and rsi,r15 + mul r8 + mov r13,rax + mov rax,QWORD[136+rsp] + mov r14,rdx + mul r9 + add r13,rax + mov rax,QWORD[224+rsp] + adc r14,rdx + mul r10 + add r13,rax + mov rax,QWORD[24+rsp] + adc r14,rdx + mul r11 + add r13,rax + mov rax,QWORD[80+rsp] + adc r14,rdx + mul rbp + add rax,r13 + adc rdx,r14 + shrd rcx,rbx,51 + shr rbx,51 + add rax,rcx + mov rcx,QWORD[((-104))+rsp] + adc rdx,rbx + mov rdi,rax + shrd rax,rdx,51 + and rdi,r15 + lea rdx,[rax*8+rax] + lea rax,[rdx*2+rax] + add rax,QWORD[((-120))+rsp] + mov rbp,rax + shr rax,51 + add r12,rax + lea rax,[rdi*8+rdi] + and rbp,r15 + mov r9,r12 + shr r12,51 + lea r10,[rcx*1+r12] + lea r8,[rax*2+rdi] + and r9,r15 + lea rax,[r10*8+r10] + lea r14,[rax*2+r10] + lea rax,[rsi*8+rsi] + lea r13,[rax*2+rsi] + mov rax,QWORD[((-56))+rsp] + mul r8 + mov rcx,rax + mov rax,QWORD[168+rsp] + mov rbx,rdx + mul r13 + add rcx,rax + mov rax,QWORD[72+rsp] + adc rbx,rdx + mul rbp + add rcx,rax + lea rax,[r9*8+r9] + adc rbx,rdx + lea rax,[rax*2+r9] + mul QWORD[((-32))+rsp] + add rcx,rax + mov rax,QWORD[184+rsp] + adc rbx,rdx + mul r14 + add rcx,rax + mov rax,rcx + adc rbx,rdx + mov r11,rcx + and rax,r15 + mov r12,rbx + mov QWORD[((-120))+rsp],rax + mov rax,QWORD[168+rsp] + mul r8 + mov rcx,rax + mov rax,QWORD[184+rsp] + mov rbx,rdx + mul r13 + add rcx,rax + mov rax,QWORD[((-56))+rsp] + adc rbx,rdx + mul rbp + add rcx,rax + mov rax,QWORD[72+rsp] + adc rbx,rdx + mul r9 + add rcx,rax + mov rax,r14 + adc rbx,rdx + mul QWORD[((-32))+rsp] + add rax,rcx + mov rcx,r11 + adc rdx,rbx + mov rbx,r12 + shrd rcx,r12,51 + shr rbx,51 + add rcx,rax + mov rax,QWORD[184+rsp] + adc rbx,rdx + mov r14,rcx + and r14,r15 + mul r8 + mov r11,rax + mov r12,rdx + mov rax,r13 + mul QWORD[((-32))+rsp] + add r11,rax + mov rax,QWORD[168+rsp] + adc r12,rdx + mul rbp + add r11,rax + mov rax,QWORD[((-56))+rsp] + adc r12,rdx + mul r9 + add r11,rax + mov rax,QWORD[72+rsp] + mov QWORD[((-96))+rsp],0 + adc r12,rdx + mov QWORD[((-112))+rsp],0 + mul r10 + add rax,r11 + adc rdx,r12 + shrd rcx,rbx,51 + shr rbx,51 + add rcx,rax + mov rax,QWORD[72+rsp] + adc rbx,rdx + mov r13,rcx + and r13,r15 + mul rsi + mov r11,rax + mov rax,r8 + mov r8,QWORD[((-32))+rsp] + mov r12,rdx + mul r8 + add r11,rax + mov rax,QWORD[184+rsp] + adc r12,rdx + mul rbp + add r11,rax + mov rax,QWORD[168+rsp] + adc r12,rdx + mul r9 + add r11,rax + mov rax,QWORD[((-56))+rsp] + adc r12,rdx + mul r10 + add rax,r11 + adc rdx,r12 + shrd rcx,rbx,51 + shr rbx,51 + add rcx,rax + mov rax,QWORD[72+rsp] + adc rbx,rdx + mul rdi + mov r11,rax + mov rax,QWORD[((-56))+rsp] + mov r12,rdx + mul rsi + mov rsi,rax + mov rdi,rdx + mov rax,r8 + add rsi,r11 + adc rdi,r12 + mul rbp + add rsi,rax + mov rax,QWORD[184+rsp] + adc rdi,rdx + mul r9 + add rsi,rax + mov rax,QWORD[168+rsp] + adc rdi,rdx + mul r10 + add rsi,rax + mov rax,rcx + adc rdi,rdx + mov rdx,rbx + shrd rax,rbx,51 + shr rdx,51 + add rax,rsi + adc rdx,rdi + mov rsi,rax + mov QWORD[((-104))+rsp],rax + shrd rsi,rdx,51 + mov rax,rcx + xor r10d,r10d + lea rdx,[rsi*8+rsi] + and rax,r15 + mov r9,rax + mov rax,QWORD[((-104))+rsp] + lea rbp,[rdx*2+rsi] + add rbp,QWORD[((-120))+rsp] + xor edx,edx + mov rbx,rdx + shr rbx,51 + mov rdi,rbp + shr rdi,51 + add rdi,r14 + mov r8,rdi + shr r8,51 + add r8,r13 + mov rcx,r8 + shrd rcx,rdx,51 + add r9,rcx + adc r10,rbx + and rax,r15 + xor r12d,r12d + mov r11,rax + mov rax,r9 + mov rdx,r10 + shrd rax,r10,51 + shr rdx,51 + add r11,rax + mov eax,19 + adc r12,rdx + mov rcx,r11 + and rbp,r15 + shrd rcx,r12,51 + mov rbx,r12 + mul rcx + shr rbx,51 + imul rsi,rbx,19 + mov rcx,rax + mov rbx,rdx + xor edx,edx + add rbx,rsi + add rcx,rbp + adc rbx,rdx + mov rsi,rcx + and rdi,r15 + shrd rsi,rbx,51 + mov rax,rdi + mov rdi,rbx + xor edx,edx + shr rdi,51 + add rsi,rax + mov rbx,rcx + adc rdi,rdx + and r8,r15 + xor edx,edx + mov rbp,rdi + mov rdi,rsi + mov QWORD[((-120))+rsp],rsi + shrd rdi,rbp,51 + shr rbp,51 + mov rsi,r9 + add rdi,r8 + adc rbp,rdx + mov r8,rdi + and rsi,r15 + shrd r8,rbp,51 + mov r9,rbp + xor edx,edx + shr r9,51 + add r8,rsi + mov rsi,r11 + adc r9,rdx + and rsi,r15 + xor edx,edx + mov r10,r9 + mov r9,r8 + mov QWORD[((-104))+rsp],r8 + shrd r9,r10,51 + shr r10,51 + add r9,rsi + mov esi,19 + mov r8,r9 + adc r10,rdx + and rbx,r15 + mov rax,r8 + mov rdx,r10 + mov r11,rbx + shrd rax,r10,51 + shr rdx,51 + xor r12d,r12d + mov rbx,rdi + imul r10,rdx,19 + mul rsi + add rdx,r10 + add r11,19 + adc r12,0 + add r11,rax + mov rax,QWORD[((-120))+rsp] + adc r12,rdx + xor r14d,r14d + mov rdx,r12 + and rax,r15 + shr rdx,51 + mov r13,rax + mov rax,r11 + shrd rax,r12,51 + add r13,rax + mov rax,QWORD[((-104))+rsp] + adc r14,rdx + mov rsi,r13 + and rbx,r15 + shrd rsi,r14,51 + mov rdi,r14 + xor edx,edx + shr rdi,51 + add rsi,rbx + adc rdi,rdx + and rax,r15 + xor ebx,ebx + mov rcx,rax + mov rax,rsi + mov rdx,rdi + shrd rax,rdi,51 + shr rdx,51 + mov rdi,r8 + mov QWORD[((-120))+rsp],rsi + add rcx,rax + mov esi,19 + adc rbx,rdx + mov r8,rcx + and rdi,r15 + shrd r8,rbx,51 + mov r9,rbx + xor edx,edx + mov rbx,rcx + shr r9,51 + add r8,rdi + adc r9,rdx + mov rax,r8 + xor r12d,r12d + shrd rax,r9,51 + mov rdx,r9 + mov QWORD[((-80))+rsp],r9 + xor r10d,r10d + shr rdx,51 + mov QWORD[((-88))+rsp],r8 + imul rdi,rdx,19 + mul rsi + add rdx,rdi + mov rdi,r11 + mov r11,2251799813685229 + and rdi,r15 + mov r9,rdi + mov rdi,r13 + add r9,r11 + adc r10,r12 + add rax,r9 + mov r9,2251799813685247 + adc rdx,r10 + and rdi,r15 + xor r10d,r10d + mov r11,rdi + xor r12d,r12d + mov r13,rax + add r11,r9 + mov rdi,QWORD[((-120))+rsp] + mov r14,rdx + adc r12,r10 + shr r14,51 + shrd r13,rdx,51 + add r13,r11 + adc r14,r12 + and rdi,r15 + xor r12d,r12d + mov rsi,rdi + mov r11,r13 + mov rdi,r12 + add rsi,r9 + mov r12,r14 + adc rdi,r10 + shr r12,51 + shrd r11,r14,51 + add r11,rsi + adc r12,rdi + and rbx,r15 + xor r14d,r14d + mov rdi,r13 + mov rcx,rbx + mov r13,r11 + and rdi,r15 + mov rbx,r14 + add rcx,r9 + adc rbx,r10 + mov r14,r12 + mov QWORD[((-104))+rsp],rdi + shrd r13,r12,51 + mov rsi,QWORD[((-104))+rsp] + shr r14,51 + add r13,rcx + adc r14,rbx + mov rbx,r11 + mov rdi,r13 + and rbx,r15 + mov rdx,rsi + and rax,r15 + mov r13,rbx + mov rbx,rdi + sal rdx,51 + and rbx,r15 + or rax,rdx + mov r11,rdi + mov QWORD[((-120))+rsp],rbx + mov rbx,QWORD[352+rsp] + mov rdx,rax + shr rdx,8 + mov rdi,QWORD[((-96))+rsp] + mov rbp,r14 + mov rcx,r13 + xor r14d,r14d + mov BYTE[1+rbx],dl + mov rdx,rax + mov BYTE[rbx],al + shr rdx,16 + mov BYTE[2+rbx],dl + mov rdx,rax + shr rdx,24 + mov BYTE[3+rbx],dl + mov rdx,rax + shr rdx,32 + mov BYTE[4+rbx],dl + mov rdx,rax + mov r12,QWORD[((-112))+rsp] + shr rdx,40 + mov BYTE[5+rbx],dl + mov rdx,rax + shr rax,56 + mov BYTE[7+rbx],al + mov rax,r13 + shr rdx,48 + shrd rsi,rdi,13 + sal rax,38 + mov rdi,rbx + mov BYTE[6+rbx],dl + or rsi,rax + mov r13,r11 + mov r11,QWORD[((-120))+rsp] + mov rax,rsi + mov BYTE[8+rbx],sil + shr rax,8 + mov BYTE[9+rbx],al + mov rax,rsi + shr rax,16 + mov BYTE[10+rbx],al + mov rax,rsi + shr rax,24 + mov BYTE[11+rbx],al + mov rax,rsi + shr rax,32 + mov BYTE[12+rbx],al + mov rax,rsi + shr rax,40 + mov BYTE[13+rbx],al + mov rax,rsi + shr rsi,56 + shr rax,48 + mov BYTE[15+rbx],sil + mov BYTE[14+rbx],al + mov rax,QWORD[((-120))+rsp] + mov rbx,rdi + shrd rcx,r14,26 + sal rax,25 + or rcx,rax + mov rax,rcx + mov BYTE[16+rdi],cl + shr rax,8 + mov BYTE[17+rdi],al + mov rax,rcx + shr rax,16 + mov BYTE[18+rdi],al + mov rax,rcx + shr rax,24 + mov BYTE[19+rdi],al + mov rax,rcx + shr rax,32 + mov BYTE[20+rdi],al + mov rax,rcx + shr rax,40 + mov BYTE[21+rdi],al + mov rax,rcx + shr rax,48 + shr rcx,56 + mov BYTE[22+rdi],al + mov BYTE[23+rdi],cl + mov rdi,QWORD[((-88))+rsp] + and rdi,r15 + mov rax,rdi + add rax,r9 + shrd r13,rbp,51 + add rax,r13 + and rax,r15 + shrd r11,r12,39 + sal rax,12 + or rax,r11 + mov rdx,rax + mov BYTE[24+rbx],al + shr rdx,8 + mov BYTE[25+rbx],dl + mov rdx,rax + shr rdx,16 + mov BYTE[26+rbx],dl + mov rdx,rax + shr rdx,24 + mov BYTE[27+rbx],dl + mov rdx,rax + shr rdx,32 + mov BYTE[28+rbx],dl + mov rdx,rax + shr rdx,40 + mov BYTE[29+rbx],dl + mov rdx,rax + shr rax,56 + shr rdx,48 + mov BYTE[31+rbx],al + xor eax,eax + mov BYTE[30+rbx],dl + add rsp,784 + + + pop rdi + pop rsi + pop rbx + pop rbp + pop r12 + pop r13 + pop r14 + pop r15 + ret + +$L$FE13: + + diff --git a/crypto/make_all_asm_files.sh b/crypto/make_all_asm_files.sh new file mode 100644 index 0000000..fcbed3d --- /dev/null +++ b/crypto/make_all_asm_files.sh @@ -0,0 +1,28 @@ +#!/bin/sh + +set -e + +# macos +perl make_chacha20_x64.pl macosx > chacha20_x64_gas_macosx.s +perl make_poly1305_x64.pl macosx > poly1305_x64_gas_macosx.s + +cd aesgcm + +perl aesni-gcm-x86_64.pl macosx > aesni_gcm_x64_gas_macosx.s +perl aesni-x86_64.pl macosx > aesni_x64_gas_macosx.s +perl ghash-x86_64.pl macosx > ghash_x64_gas_macosx.s + +cd .. + + +# linux,freebsd +perl make_chacha20_x64.pl gas > chacha20_x64_gas.s +perl make_poly1305_x64.pl gas > poly1305_x64_gas.s + +cd aesgcm + +perl aesni-gcm-x86_64.pl gas > aesni_gcm_x64_gas.s +perl aesni-x86_64.pl gas > aesni_x64_gas.s +perl ghash-x86_64.pl gas > ghash_x64_gas.s + +cd .. diff --git a/crypto/make_chacha20_x64.pl b/crypto/make_chacha20_x64.pl new file mode 100644 index 0000000..f9379ca --- /dev/null +++ b/crypto/make_chacha20_x64.pl @@ -0,0 +1,3665 @@ +#! /usr/bin/env perl +# Copyright 2016 The OpenSSL Project Authors. All Rights Reserved. +# +# Licensed under the OpenSSL license (the "License"). You may not use +# this file except in compliance with the License. You can obtain a copy +# in the file LICENSE in the source distribution or at +# https://www.openssl.org/source/license.html + +# +# ==================================================================== +# Written by Andy Polyakov for the OpenSSL +# project. The module is, however, dual licensed under OpenSSL and +# CRYPTOGAMS licenses depending on where you obtain it. For further +# details see http://www.openssl.org/~appro/cryptogams/. +# ==================================================================== +# +# November 2014 +# +# chacha20 for x86_64. +# +# December 2016 +# +# Add AVX512F code path. +# +# December 2017 +# +# Add AVX512VL code path. +# +# Performance in cycles per byte out of large buffer. +# +# IALU/gcc 4.8(i) 1xSSSE3/SSE2 4xSSSE3 NxAVX(v) +# +# P4 9.48/+99% -/22.7(ii) - +# Core2 7.83/+55% 7.90/8.08 4.35 +# Westmere 7.19/+50% 5.60/6.70 3.00 +# Sandy Bridge 8.31/+42% 5.45/6.76 2.72 +# Ivy Bridge 6.71/+46% 5.40/6.49 2.41 +# Haswell 5.92/+43% 5.20/6.45 2.42 1.23 +# Skylake[-X] 5.87/+39% 4.70/- 2.31 1.19[0.80(vi)] +# Silvermont 12.0/+33% 7.75/7.40 7.03(iii) +# Knights L 11.7/- - 9.60(iii) 0.80 +# Goldmont 10.6/+17% 5.10/- 3.28 +# Sledgehammer 7.28/+52% -/14.2(ii) - +# Bulldozer 9.66/+28% 9.85/11.1 3.06(iv) +# Ryzen 5.96/+50% 5.19/- 2.40 2.09 +# VIA Nano 10.5/+46% 6.72/8.60 6.05 +# +# (i) compared to older gcc 3.x one can observe >2x improvement on +# most platforms; +# (ii) as it can be seen, SSE2 performance is too low on legacy +# processors; NxSSE2 results are naturally better, but not +# impressively better than IALU ones, which is why you won't +# find SSE2 code below; +# (iii) this is not optimal result for Atom because of MSROM +# limitations, SSE2 can do better, but gain is considered too +# low to justify the [maintenance] effort; +# (iv) Bulldozer actually executes 4xXOP code path that delivers 2.20; +# (v) 8xAVX2, 8xAVX512VL or 16xAVX512F, whichever best applicable; +# (vi) even though Skylake-X can execute AVX512F code and deliver 0.57 +# cpb in single thread, the corresponding capability is suppressed; + +$flavour = shift; +$output = shift; +if ($flavour =~ /\./) { $output = $flavour; undef $flavour; } + +$win64=0; $win64=1 if ($flavour =~ /[nm]asm|mingw64/ || $output =~ /\.asm$/); + +$0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1; +( $xlate="${dir}x86_64-xlate.pl" and -f $xlate ) or +( $xlate="${dir}../../perlasm/x86_64-xlate.pl" and -f $xlate) or +die "can't locate x86_64-xlate.pl"; + +$avx = 3; +$avx = 2 if ($flavour =~ /macosx/); + +open OUT,"| \"$^X\" \"$xlate\" $flavour \"$output\""; +*STDOUT=*OUT; + +# input parameter block +($out,$inp,$len,$key,$counter)=("%rdi","%rsi","%rdx","%rcx","%r8"); + +$code.=<<___; +.text + +.align 64 +.Lzero: +.long 0,0,0,0 +.Lone: +.long 1,0,0,0 +.Linc: +.long 0,1,2,3 +.Lfour: +.long 4,4,4,4 +.Lincy: +.long 0,2,4,6,1,3,5,7 +.Leight: +.long 8,8,8,8,8,8,8,8 +.Lrot16: +.byte 0x2,0x3,0x0,0x1, 0x6,0x7,0x4,0x5, 0xa,0xb,0x8,0x9, 0xe,0xf,0xc,0xd +.Lrot24: +.byte 0x3,0x0,0x1,0x2, 0x7,0x4,0x5,0x6, 0xb,0x8,0x9,0xa, 0xf,0xc,0xd,0xe +.Lsigma: +.asciz "expand 32-byte k" +.align 64 +.Lzeroz: +.long 0,0,0,0, 1,0,0,0, 2,0,0,0, 3,0,0,0 +.Lfourz: +.long 4,0,0,0, 4,0,0,0, 4,0,0,0, 4,0,0,0 +.Lincz: +.long 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 +.Lsixteen: +.long 16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16 +.align 64 +.Ltwoy: +.long 2,0,0,0, 2,0,0,0 +___ + +sub AUTOLOAD() # thunk [simplified] 32-bit style perlasm +{ my $opcode = $AUTOLOAD; $opcode =~ s/.*:://; + my $arg = pop; + $arg = "\$$arg" if ($arg*1 eq $arg); + $code .= "\t$opcode\t".join(',',$arg,reverse @_)."\n"; +} + +@x=("%eax","%ebx","%ecx","%edx",map("%r${_}d",(8..11)), + "%nox","%nox","%nox","%nox",map("%r${_}d",(12..15))); +@t=("%esi","%edi"); + +sub ROUND { # critical path is 24 cycles per round +my ($a0,$b0,$c0,$d0)=@_; +my ($a1,$b1,$c1,$d1)=map(($_&~3)+(($_+1)&3),($a0,$b0,$c0,$d0)); +my ($a2,$b2,$c2,$d2)=map(($_&~3)+(($_+1)&3),($a1,$b1,$c1,$d1)); +my ($a3,$b3,$c3,$d3)=map(($_&~3)+(($_+1)&3),($a2,$b2,$c2,$d2)); +my ($xc,$xc_)=map("\"$_\"",@t); +my @x=map("\"$_\"",@x); + + # Consider order in which variables are addressed by their + # index: + # + # a b c d + # + # 0 4 8 12 < even round + # 1 5 9 13 + # 2 6 10 14 + # 3 7 11 15 + # 0 5 10 15 < odd round + # 1 6 11 12 + # 2 7 8 13 + # 3 4 9 14 + # + # 'a', 'b' and 'd's are permanently allocated in registers, + # @x[0..7,12..15], while 'c's are maintained in memory. If + # you observe 'c' column, you'll notice that pair of 'c's is + # invariant between rounds. This means that we have to reload + # them once per round, in the middle. This is why you'll see + # bunch of 'c' stores and loads in the middle, but none in + # the beginning or end. + + # Normally instructions would be interleaved to favour in-order + # execution. Generally out-of-order cores manage it gracefully, + # but not this time for some reason. As in-order execution + # cores are dying breed, old Atom is the only one around, + # instructions are left uninterleaved. Besides, Atom is better + # off executing 1xSSSE3 code anyway... + + ( + "&add (@x[$a0],@x[$b0])", # Q1 + "&xor (@x[$d0],@x[$a0])", + "&rol (@x[$d0],16)", + "&add (@x[$a1],@x[$b1])", # Q2 + "&xor (@x[$d1],@x[$a1])", + "&rol (@x[$d1],16)", + + "&add ($xc,@x[$d0])", + "&xor (@x[$b0],$xc)", + "&rol (@x[$b0],12)", + "&add ($xc_,@x[$d1])", + "&xor (@x[$b1],$xc_)", + "&rol (@x[$b1],12)", + + "&add (@x[$a0],@x[$b0])", + "&xor (@x[$d0],@x[$a0])", + "&rol (@x[$d0],8)", + "&add (@x[$a1],@x[$b1])", + "&xor (@x[$d1],@x[$a1])", + "&rol (@x[$d1],8)", + + "&add ($xc,@x[$d0])", + "&xor (@x[$b0],$xc)", + "&rol (@x[$b0],7)", + "&add ($xc_,@x[$d1])", + "&xor (@x[$b1],$xc_)", + "&rol (@x[$b1],7)", + + "&mov (\"4*$c0(%rsp)\",$xc)", # reload pair of 'c's + "&mov (\"4*$c1(%rsp)\",$xc_)", + "&mov ($xc,\"4*$c2(%rsp)\")", + "&mov ($xc_,\"4*$c3(%rsp)\")", + + "&add (@x[$a2],@x[$b2])", # Q3 + "&xor (@x[$d2],@x[$a2])", + "&rol (@x[$d2],16)", + "&add (@x[$a3],@x[$b3])", # Q4 + "&xor (@x[$d3],@x[$a3])", + "&rol (@x[$d3],16)", + + "&add ($xc,@x[$d2])", + "&xor (@x[$b2],$xc)", + "&rol (@x[$b2],12)", + "&add ($xc_,@x[$d3])", + "&xor (@x[$b3],$xc_)", + "&rol (@x[$b3],12)", + + "&add (@x[$a2],@x[$b2])", + "&xor (@x[$d2],@x[$a2])", + "&rol (@x[$d2],8)", + "&add (@x[$a3],@x[$b3])", + "&xor (@x[$d3],@x[$a3])", + "&rol (@x[$d3],8)", + + "&add ($xc,@x[$d2])", + "&xor (@x[$b2],$xc)", + "&rol (@x[$b2],7)", + "&add ($xc_,@x[$d3])", + "&xor (@x[$b3],$xc_)", + "&rol (@x[$b3],7)" + ); +} +######################################################################## +# HCHACHA20_SSSE3 +$code.=<<___; + +.global hchacha20_ssse3 +.type hchacha20_ssse3,\@function,5 +.align 32 +hchacha20_ssse3: +.cfi_startproc +.Lhchacha20_ssse3: + movdqa .Lsigma(%rip),%xmm0 + movdqu (%rdx),%xmm1 + movdqu 16(%rdx),%xmm2 + movdqu (%rsi),%xmm3 + movdqa .Lrot16(%rip),%xmm6 + movdqa .Lrot24(%rip),%xmm7 + movq 10,%r8 + .align 32 +.Loop_hssse3: + paddd %xmm1,%xmm0 + pxor %xmm0,%xmm3 + pshufb %xmm6,%xmm3 + paddd %xmm3,%xmm2 + pxor %xmm2,%xmm1 + movdqa %xmm1,%xmm4 + psrld 20,%xmm1 + pslld 12,%xmm4 + por %xmm4,%xmm1 + paddd %xmm1,%xmm0 + pxor %xmm0,%xmm3 + pshufb %xmm7,%xmm3 + paddd %xmm3,%xmm2 + pxor %xmm2,%xmm1 + movdqa %xmm1,%xmm4 + psrld 25,%xmm1 + pslld 7,%xmm4 + por %xmm4,%xmm1 + pshufd \$78,%xmm2,%xmm2 + pshufd \$57,%xmm1,%xmm1 + pshufd \$147,%xmm3,%xmm3 + nop + paddd %xmm1,%xmm0 + pxor %xmm0,%xmm3 + pshufb %xmm6,%xmm3 + paddd %xmm3,%xmm2 + pxor %xmm2,%xmm1 + movdqa %xmm1,%xmm4 + psrld 20,%xmm1 + pslld 12,%xmm4 + por %xmm4,%xmm1 + paddd %xmm1,%xmm0 + pxor %xmm0,%xmm3 + pshufb %xmm7,%xmm3 + paddd %xmm3,%xmm2 + pxor %xmm2,%xmm1 + movdqa %xmm1,%xmm4 + psrld 25,%xmm1 + pslld 7,%xmm4 + por %xmm4,%xmm1 + pshufd \$78,%xmm2,%xmm2 + pshufd \$147,%xmm1,%xmm1 + pshufd \$57,%xmm3,%xmm3 + decq %r8 + jnz .Loop_hssse3 + movdqu %xmm0,0(%rdi) + movdqu %xmm3,16(%rdi) + ret +.cfi_endproc +.size hchacha20_ssse3,.-hchacha20_ssse3 +___ + + +######################################################################## +# SSSE3 code path that handles shorter lengths +{ +my ($a,$b,$c,$d,$t,$t1,$rot16,$rot24)=map("%xmm$_",(0..7)); + +sub SSSE3ROUND { # critical path is 20 "SIMD ticks" per round + &paddd ($a,$b); + &pxor ($d,$a); + &pshufb ($d,$rot16); + + &paddd ($c,$d); + &pxor ($b,$c); + &movdqa ($t,$b); + &psrld ($b,20); + &pslld ($t,12); + &por ($b,$t); + + &paddd ($a,$b); + &pxor ($d,$a); + &pshufb ($d,$rot24); + + &paddd ($c,$d); + &pxor ($b,$c); + &movdqa ($t,$b); + &psrld ($b,25); + &pslld ($t,7); + &por ($b,$t); +} + +my $xframe = $win64 ? 32+8 : 8; + +$code.=<<___; +.global chacha20_ssse3 +.type chacha20_ssse3,\@function,5 +.align 32 +chacha20_ssse3: +.cfi_startproc +.Lchacha20_ssse3: + mov %rsp,%r9 # frame pointer +.cfi_def_cfa_register %r9 +___ +#$code.=<<___ if ($avx); +# test \$`1<<(43-32)`,%r10d +# jnz .Lchacha20_4xop # XOP is fastest even if we use 1/4 +#___ +$code.=<<___; + cmp \$128,$len # we might throw away some data, + ja .Lchacha20_4x # but overall it won't be slower + +.Ldo_sse3_after_all: + sub \$64+$xframe,%rsp +___ +$code.=<<___ if ($win64); + movaps %xmm6,-0x28(%r9) + movaps %xmm7,-0x18(%r9) +.Lssse3_body: +___ +$code.=<<___; + movdqa .Lsigma(%rip),$a + movdqu ($key),$b + movdqu 16($key),$c + movdqu ($counter),$d + movdqa .Lrot16(%rip),$rot16 + movdqa .Lrot24(%rip),$rot24 + + movdqa $a,0x00(%rsp) + movdqa $b,0x10(%rsp) + movdqa $c,0x20(%rsp) + movdqa $d,0x30(%rsp) + mov \$10,$counter # reuse $counter + jmp .Loop_ssse3 + +.align 32 +.Loop_outer_ssse3: + movdqa .Lone(%rip),$d + movdqa 0x00(%rsp),$a + movdqa 0x10(%rsp),$b + movdqa 0x20(%rsp),$c + paddd 0x30(%rsp),$d + mov \$10,$counter + movdqa $d,0x30(%rsp) + jmp .Loop_ssse3 + +.align 32 +.Loop_ssse3: +___ + &SSSE3ROUND(); + &pshufd ($c,$c,0b01001110); + &pshufd ($b,$b,0b00111001); + &pshufd ($d,$d,0b10010011); + &nop (); + + &SSSE3ROUND(); + &pshufd ($c,$c,0b01001110); + &pshufd ($b,$b,0b10010011); + &pshufd ($d,$d,0b00111001); + + &dec ($counter); + &jnz (".Loop_ssse3"); + +$code.=<<___; + paddd 0x00(%rsp),$a + paddd 0x10(%rsp),$b + paddd 0x20(%rsp),$c + paddd 0x30(%rsp),$d + + cmp \$64,$len + jb .Ltail_ssse3 + + movdqu 0x00($inp),$t + movdqu 0x10($inp),$t1 + pxor $t,$a # xor with input + movdqu 0x20($inp),$t + pxor $t1,$b + movdqu 0x30($inp),$t1 + lea 0x40($inp),$inp # inp+=64 + pxor $t,$c + pxor $t1,$d + + movdqu $a,0x00($out) # write output + movdqu $b,0x10($out) + movdqu $c,0x20($out) + movdqu $d,0x30($out) + lea 0x40($out),$out # out+=64 + + sub \$64,$len + jnz .Loop_outer_ssse3 + + jmp .Ldone_ssse3 + +.align 16 +.Ltail_ssse3: + movdqa $a,0x00(%rsp) + movdqa $b,0x10(%rsp) + movdqa $c,0x20(%rsp) + movdqa $d,0x30(%rsp) + xor $counter,$counter + +.Loop_tail_ssse3: + movzb ($inp,$counter),%eax + movzb (%rsp,$counter),%ecx + lea 1($counter),$counter + xor %ecx,%eax + mov %al,-1($out,$counter) + dec $len + jnz .Loop_tail_ssse3 + +.Ldone_ssse3: +___ +$code.=<<___ if ($win64); + movaps -0x28(%r9),%xmm6 + movaps -0x18(%r9),%xmm7 +___ +$code.=<<___; + lea (%r9),%rsp +.cfi_def_cfa_register %rsp +.Lssse3_epilogue: + ret +.cfi_endproc +.size chacha20_ssse3,.-chacha20_ssse3 +___ +} + +######################################################################## +# SSSE3 code path that handles longer messages. +{ +# assign variables to favor Atom front-end +my ($xd0,$xd1,$xd2,$xd3, $xt0,$xt1,$xt2,$xt3, + $xa0,$xa1,$xa2,$xa3, $xb0,$xb1,$xb2,$xb3)=map("%xmm$_",(0..15)); +my @xx=($xa0,$xa1,$xa2,$xa3, $xb0,$xb1,$xb2,$xb3, + "%nox","%nox","%nox","%nox", $xd0,$xd1,$xd2,$xd3); + +sub SSSE3_lane_ROUND { +my ($a0,$b0,$c0,$d0)=@_; +my ($a1,$b1,$c1,$d1)=map(($_&~3)+(($_+1)&3),($a0,$b0,$c0,$d0)); +my ($a2,$b2,$c2,$d2)=map(($_&~3)+(($_+1)&3),($a1,$b1,$c1,$d1)); +my ($a3,$b3,$c3,$d3)=map(($_&~3)+(($_+1)&3),($a2,$b2,$c2,$d2)); +my ($xc,$xc_,$t0,$t1)=map("\"$_\"",$xt0,$xt1,$xt2,$xt3); +my @x=map("\"$_\"",@xx); + + # Consider order in which variables are addressed by their + # index: + # + # a b c d + # + # 0 4 8 12 < even round + # 1 5 9 13 + # 2 6 10 14 + # 3 7 11 15 + # 0 5 10 15 < odd round + # 1 6 11 12 + # 2 7 8 13 + # 3 4 9 14 + # + # 'a', 'b' and 'd's are permanently allocated in registers, + # @x[0..7,12..15], while 'c's are maintained in memory. If + # you observe 'c' column, you'll notice that pair of 'c's is + # invariant between rounds. This means that we have to reload + # them once per round, in the middle. This is why you'll see + # bunch of 'c' stores and loads in the middle, but none in + # the beginning or end. + + ( + "&paddd (@x[$a0],@x[$b0])", # Q1 + "&paddd (@x[$a1],@x[$b1])", # Q2 + "&pxor (@x[$d0],@x[$a0])", + "&pxor (@x[$d1],@x[$a1])", + "&pshufb (@x[$d0],$t1)", + "&pshufb (@x[$d1],$t1)", + + "&paddd ($xc,@x[$d0])", + "&paddd ($xc_,@x[$d1])", + "&pxor (@x[$b0],$xc)", + "&pxor (@x[$b1],$xc_)", + "&movdqa ($t0,@x[$b0])", + "&pslld (@x[$b0],12)", + "&psrld ($t0,20)", + "&movdqa ($t1,@x[$b1])", + "&pslld (@x[$b1],12)", + "&por (@x[$b0],$t0)", + "&psrld ($t1,20)", + "&movdqa ($t0,'(%r11)')", # .Lrot24(%rip) + "&por (@x[$b1],$t1)", + + "&paddd (@x[$a0],@x[$b0])", + "&paddd (@x[$a1],@x[$b1])", + "&pxor (@x[$d0],@x[$a0])", + "&pxor (@x[$d1],@x[$a1])", + "&pshufb (@x[$d0],$t0)", + "&pshufb (@x[$d1],$t0)", + + "&paddd ($xc,@x[$d0])", + "&paddd ($xc_,@x[$d1])", + "&pxor (@x[$b0],$xc)", + "&pxor (@x[$b1],$xc_)", + "&movdqa ($t1,@x[$b0])", + "&pslld (@x[$b0],7)", + "&psrld ($t1,25)", + "&movdqa ($t0,@x[$b1])", + "&pslld (@x[$b1],7)", + "&por (@x[$b0],$t1)", + "&psrld ($t0,25)", + "&movdqa ($t1,'(%r10)')", # .Lrot16(%rip) + "&por (@x[$b1],$t0)", + + "&movdqa (\"`16*($c0-8)`(%rsp)\",$xc)", # reload pair of 'c's + "&movdqa (\"`16*($c1-8)`(%rsp)\",$xc_)", + "&movdqa ($xc,\"`16*($c2-8)`(%rsp)\")", + "&movdqa ($xc_,\"`16*($c3-8)`(%rsp)\")", + + "&paddd (@x[$a2],@x[$b2])", # Q3 + "&paddd (@x[$a3],@x[$b3])", # Q4 + "&pxor (@x[$d2],@x[$a2])", + "&pxor (@x[$d3],@x[$a3])", + "&pshufb (@x[$d2],$t1)", + "&pshufb (@x[$d3],$t1)", + + "&paddd ($xc,@x[$d2])", + "&paddd ($xc_,@x[$d3])", + "&pxor (@x[$b2],$xc)", + "&pxor (@x[$b3],$xc_)", + "&movdqa ($t0,@x[$b2])", + "&pslld (@x[$b2],12)", + "&psrld ($t0,20)", + "&movdqa ($t1,@x[$b3])", + "&pslld (@x[$b3],12)", + "&por (@x[$b2],$t0)", + "&psrld ($t1,20)", + "&movdqa ($t0,'(%r11)')", # .Lrot24(%rip) + "&por (@x[$b3],$t1)", + + "&paddd (@x[$a2],@x[$b2])", + "&paddd (@x[$a3],@x[$b3])", + "&pxor (@x[$d2],@x[$a2])", + "&pxor (@x[$d3],@x[$a3])", + "&pshufb (@x[$d2],$t0)", + "&pshufb (@x[$d3],$t0)", + + "&paddd ($xc,@x[$d2])", + "&paddd ($xc_,@x[$d3])", + "&pxor (@x[$b2],$xc)", + "&pxor (@x[$b3],$xc_)", + "&movdqa ($t1,@x[$b2])", + "&pslld (@x[$b2],7)", + "&psrld ($t1,25)", + "&movdqa ($t0,@x[$b3])", + "&pslld (@x[$b3],7)", + "&por (@x[$b2],$t1)", + "&psrld ($t0,25)", + "&movdqa ($t1,'(%r10)')", # .Lrot16(%rip) + "&por (@x[$b3],$t0)" + ); +} + +my $xframe = $win64 ? 0xa8 : 8; + +$code.=<<___; +.global chacha20_4x +.type chacha20_4x,\@function,5 +.align 32 +chacha20_4x: +.cfi_startproc +.Lchacha20_4x: + mov %rsp,%r9 # frame pointer +.cfi_def_cfa_register %r9 +# mov %r10,%r11 +___ +$code.=<<___ if ($avx>1); +# shr \$32,%r10 # OPENSSL_ia32cap_P+8 +# test \$`1<<5`,%r10 # test AVX2 +# jnz .Lchacha20_avx2 +___ +$code.=<<___; +# cmp \$192,$len +# ja .Lproceed4x + +# and \$`1<<26|1<<22`,%r11 # isolate XSAVE+MOVBE +# cmp \$`1<<22`,%r11 # check for MOVBE without XSAVE +# je .Ldo_sse3_after_all # to detect Atom + +.Lproceed4x: + sub \$0x140+$xframe,%rsp +___ + ################ stack layout + # +0x00 SIMD equivalent of @x[8-12] + # ... + # +0x40 constant copy of key[0-2] smashed by lanes + # ... + # +0x100 SIMD counters (with nonce smashed by lanes) + # ... + # +0x140 +$code.=<<___ if ($win64); + movaps %xmm6,-0xa8(%r9) + movaps %xmm7,-0x98(%r9) + movaps %xmm8,-0x88(%r9) + movaps %xmm9,-0x78(%r9) + movaps %xmm10,-0x68(%r9) + movaps %xmm11,-0x58(%r9) + movaps %xmm12,-0x48(%r9) + movaps %xmm13,-0x38(%r9) + movaps %xmm14,-0x28(%r9) + movaps %xmm15,-0x18(%r9) +.L4x_body: +___ +$code.=<<___; + movdqa .Lsigma(%rip),$xa3 # key[0] + movdqu ($key),$xb3 # key[1] + movdqu 16($key),$xt3 # key[2] + movdqu ($counter),$xd3 # key[3] + lea 0x100(%rsp),%rcx # size optimization + lea .Lrot16(%rip),%r10 + lea .Lrot24(%rip),%r11 + + pshufd \$0x00,$xa3,$xa0 # smash key by lanes... + pshufd \$0x55,$xa3,$xa1 + movdqa $xa0,0x40(%rsp) # ... and offload + pshufd \$0xaa,$xa3,$xa2 + movdqa $xa1,0x50(%rsp) + pshufd \$0xff,$xa3,$xa3 + movdqa $xa2,0x60(%rsp) + movdqa $xa3,0x70(%rsp) + + pshufd \$0x00,$xb3,$xb0 + pshufd \$0x55,$xb3,$xb1 + movdqa $xb0,0x80-0x100(%rcx) + pshufd \$0xaa,$xb3,$xb2 + movdqa $xb1,0x90-0x100(%rcx) + pshufd \$0xff,$xb3,$xb3 + movdqa $xb2,0xa0-0x100(%rcx) + movdqa $xb3,0xb0-0x100(%rcx) + + pshufd \$0x00,$xt3,$xt0 # "$xc0" + pshufd \$0x55,$xt3,$xt1 # "$xc1" + movdqa $xt0,0xc0-0x100(%rcx) + pshufd \$0xaa,$xt3,$xt2 # "$xc2" + movdqa $xt1,0xd0-0x100(%rcx) + pshufd \$0xff,$xt3,$xt3 # "$xc3" + movdqa $xt2,0xe0-0x100(%rcx) + movdqa $xt3,0xf0-0x100(%rcx) + + pshufd \$0x00,$xd3,$xd0 + pshufd \$0x55,$xd3,$xd1 + paddd .Linc(%rip),$xd0 # don't save counters yet + pshufd \$0xaa,$xd3,$xd2 + movdqa $xd1,0x110-0x100(%rcx) + pshufd \$0xff,$xd3,$xd3 + movdqa $xd2,0x120-0x100(%rcx) + movdqa $xd3,0x130-0x100(%rcx) + + jmp .Loop_enter4x + +.align 32 +.Loop_outer4x: + movdqa 0x40(%rsp),$xa0 # re-load smashed key + movdqa 0x50(%rsp),$xa1 + movdqa 0x60(%rsp),$xa2 + movdqa 0x70(%rsp),$xa3 + movdqa 0x80-0x100(%rcx),$xb0 + movdqa 0x90-0x100(%rcx),$xb1 + movdqa 0xa0-0x100(%rcx),$xb2 + movdqa 0xb0-0x100(%rcx),$xb3 + movdqa 0xc0-0x100(%rcx),$xt0 # "$xc0" + movdqa 0xd0-0x100(%rcx),$xt1 # "$xc1" + movdqa 0xe0-0x100(%rcx),$xt2 # "$xc2" + movdqa 0xf0-0x100(%rcx),$xt3 # "$xc3" + movdqa 0x100-0x100(%rcx),$xd0 + movdqa 0x110-0x100(%rcx),$xd1 + movdqa 0x120-0x100(%rcx),$xd2 + movdqa 0x130-0x100(%rcx),$xd3 + paddd .Lfour(%rip),$xd0 # next SIMD counters + +.Loop_enter4x: + movdqa $xt2,0x20(%rsp) # SIMD equivalent of "@x[10]" + movdqa $xt3,0x30(%rsp) # SIMD equivalent of "@x[11]" + movdqa (%r10),$xt3 # .Lrot16(%rip) + mov \$10,%eax + movdqa $xd0,0x100-0x100(%rcx) # save SIMD counters + jmp .Loop4x + +.align 32 +.Loop4x: +___ + foreach (&SSSE3_lane_ROUND(0, 4, 8,12)) { eval; } + foreach (&SSSE3_lane_ROUND(0, 5,10,15)) { eval; } +$code.=<<___; + dec %eax + jnz .Loop4x + + paddd 0x40(%rsp),$xa0 # accumulate key material + paddd 0x50(%rsp),$xa1 + paddd 0x60(%rsp),$xa2 + paddd 0x70(%rsp),$xa3 + + movdqa $xa0,$xt2 # "de-interlace" data + punpckldq $xa1,$xa0 + movdqa $xa2,$xt3 + punpckldq $xa3,$xa2 + punpckhdq $xa1,$xt2 + punpckhdq $xa3,$xt3 + movdqa $xa0,$xa1 + punpcklqdq $xa2,$xa0 # "a0" + movdqa $xt2,$xa3 + punpcklqdq $xt3,$xt2 # "a2" + punpckhqdq $xa2,$xa1 # "a1" + punpckhqdq $xt3,$xa3 # "a3" +___ + ($xa2,$xt2)=($xt2,$xa2); +$code.=<<___; + paddd 0x80-0x100(%rcx),$xb0 + paddd 0x90-0x100(%rcx),$xb1 + paddd 0xa0-0x100(%rcx),$xb2 + paddd 0xb0-0x100(%rcx),$xb3 + + movdqa $xa0,0x00(%rsp) # offload $xaN + movdqa $xa1,0x10(%rsp) + movdqa 0x20(%rsp),$xa0 # "xc2" + movdqa 0x30(%rsp),$xa1 # "xc3" + + movdqa $xb0,$xt2 + punpckldq $xb1,$xb0 + movdqa $xb2,$xt3 + punpckldq $xb3,$xb2 + punpckhdq $xb1,$xt2 + punpckhdq $xb3,$xt3 + movdqa $xb0,$xb1 + punpcklqdq $xb2,$xb0 # "b0" + movdqa $xt2,$xb3 + punpcklqdq $xt3,$xt2 # "b2" + punpckhqdq $xb2,$xb1 # "b1" + punpckhqdq $xt3,$xb3 # "b3" +___ + ($xb2,$xt2)=($xt2,$xb2); + my ($xc0,$xc1,$xc2,$xc3)=($xt0,$xt1,$xa0,$xa1); +$code.=<<___; + paddd 0xc0-0x100(%rcx),$xc0 + paddd 0xd0-0x100(%rcx),$xc1 + paddd 0xe0-0x100(%rcx),$xc2 + paddd 0xf0-0x100(%rcx),$xc3 + + movdqa $xa2,0x20(%rsp) # keep offloading $xaN + movdqa $xa3,0x30(%rsp) + + movdqa $xc0,$xt2 + punpckldq $xc1,$xc0 + movdqa $xc2,$xt3 + punpckldq $xc3,$xc2 + punpckhdq $xc1,$xt2 + punpckhdq $xc3,$xt3 + movdqa $xc0,$xc1 + punpcklqdq $xc2,$xc0 # "c0" + movdqa $xt2,$xc3 + punpcklqdq $xt3,$xt2 # "c2" + punpckhqdq $xc2,$xc1 # "c1" + punpckhqdq $xt3,$xc3 # "c3" +___ + ($xc2,$xt2)=($xt2,$xc2); + ($xt0,$xt1)=($xa2,$xa3); # use $xaN as temporary +$code.=<<___; + paddd 0x100-0x100(%rcx),$xd0 + paddd 0x110-0x100(%rcx),$xd1 + paddd 0x120-0x100(%rcx),$xd2 + paddd 0x130-0x100(%rcx),$xd3 + + movdqa $xd0,$xt2 + punpckldq $xd1,$xd0 + movdqa $xd2,$xt3 + punpckldq $xd3,$xd2 + punpckhdq $xd1,$xt2 + punpckhdq $xd3,$xt3 + movdqa $xd0,$xd1 + punpcklqdq $xd2,$xd0 # "d0" + movdqa $xt2,$xd3 + punpcklqdq $xt3,$xt2 # "d2" + punpckhqdq $xd2,$xd1 # "d1" + punpckhqdq $xt3,$xd3 # "d3" +___ + ($xd2,$xt2)=($xt2,$xd2); +$code.=<<___; + cmp \$64*4,$len + jb .Ltail4x + + movdqu 0x00($inp),$xt0 # xor with input + movdqu 0x10($inp),$xt1 + movdqu 0x20($inp),$xt2 + movdqu 0x30($inp),$xt3 + pxor 0x00(%rsp),$xt0 # $xaN is offloaded, remember? + pxor $xb0,$xt1 + pxor $xc0,$xt2 + pxor $xd0,$xt3 + + movdqu $xt0,0x00($out) + movdqu 0x40($inp),$xt0 + movdqu $xt1,0x10($out) + movdqu 0x50($inp),$xt1 + movdqu $xt2,0x20($out) + movdqu 0x60($inp),$xt2 + movdqu $xt3,0x30($out) + movdqu 0x70($inp),$xt3 + lea 0x80($inp),$inp # size optimization + pxor 0x10(%rsp),$xt0 + pxor $xb1,$xt1 + pxor $xc1,$xt2 + pxor $xd1,$xt3 + + movdqu $xt0,0x40($out) + movdqu 0x00($inp),$xt0 + movdqu $xt1,0x50($out) + movdqu 0x10($inp),$xt1 + movdqu $xt2,0x60($out) + movdqu 0x20($inp),$xt2 + movdqu $xt3,0x70($out) + lea 0x80($out),$out # size optimization + movdqu 0x30($inp),$xt3 + pxor 0x20(%rsp),$xt0 + pxor $xb2,$xt1 + pxor $xc2,$xt2 + pxor $xd2,$xt3 + + movdqu $xt0,0x00($out) + movdqu 0x40($inp),$xt0 + movdqu $xt1,0x10($out) + movdqu 0x50($inp),$xt1 + movdqu $xt2,0x20($out) + movdqu 0x60($inp),$xt2 + movdqu $xt3,0x30($out) + movdqu 0x70($inp),$xt3 + lea 0x80($inp),$inp # inp+=64*4 + pxor 0x30(%rsp),$xt0 + pxor $xb3,$xt1 + pxor $xc3,$xt2 + pxor $xd3,$xt3 + movdqu $xt0,0x40($out) + movdqu $xt1,0x50($out) + movdqu $xt2,0x60($out) + movdqu $xt3,0x70($out) + lea 0x80($out),$out # out+=64*4 + + sub \$64*4,$len + jnz .Loop_outer4x + + jmp .Ldone4x + +.Ltail4x: + cmp \$192,$len + jae .L192_or_more4x + cmp \$128,$len + jae .L128_or_more4x + cmp \$64,$len + jae .L64_or_more4x + + #movdqa 0x00(%rsp),$xt0 # $xaN is offloaded, remember? + xor %r10,%r10 + #movdqa $xt0,0x00(%rsp) + movdqa $xb0,0x10(%rsp) + movdqa $xc0,0x20(%rsp) + movdqa $xd0,0x30(%rsp) + jmp .Loop_tail4x + +.align 32 +.L64_or_more4x: + movdqu 0x00($inp),$xt0 # xor with input + movdqu 0x10($inp),$xt1 + movdqu 0x20($inp),$xt2 + movdqu 0x30($inp),$xt3 + pxor 0x00(%rsp),$xt0 # $xaxN is offloaded, remember? + pxor $xb0,$xt1 + pxor $xc0,$xt2 + pxor $xd0,$xt3 + movdqu $xt0,0x00($out) + movdqu $xt1,0x10($out) + movdqu $xt2,0x20($out) + movdqu $xt3,0x30($out) + je .Ldone4x + + movdqa 0x10(%rsp),$xt0 # $xaN is offloaded, remember? + lea 0x40($inp),$inp # inp+=64*1 + xor %r10,%r10 + movdqa $xt0,0x00(%rsp) + movdqa $xb1,0x10(%rsp) + lea 0x40($out),$out # out+=64*1 + movdqa $xc1,0x20(%rsp) + sub \$64,$len # len-=64*1 + movdqa $xd1,0x30(%rsp) + jmp .Loop_tail4x + +.align 32 +.L128_or_more4x: + movdqu 0x00($inp),$xt0 # xor with input + movdqu 0x10($inp),$xt1 + movdqu 0x20($inp),$xt2 + movdqu 0x30($inp),$xt3 + pxor 0x00(%rsp),$xt0 # $xaN is offloaded, remember? + pxor $xb0,$xt1 + pxor $xc0,$xt2 + pxor $xd0,$xt3 + + movdqu $xt0,0x00($out) + movdqu 0x40($inp),$xt0 + movdqu $xt1,0x10($out) + movdqu 0x50($inp),$xt1 + movdqu $xt2,0x20($out) + movdqu 0x60($inp),$xt2 + movdqu $xt3,0x30($out) + movdqu 0x70($inp),$xt3 + pxor 0x10(%rsp),$xt0 + pxor $xb1,$xt1 + pxor $xc1,$xt2 + pxor $xd1,$xt3 + movdqu $xt0,0x40($out) + movdqu $xt1,0x50($out) + movdqu $xt2,0x60($out) + movdqu $xt3,0x70($out) + je .Ldone4x + + movdqa 0x20(%rsp),$xt0 # $xaN is offloaded, remember? + lea 0x80($inp),$inp # inp+=64*2 + xor %r10,%r10 + movdqa $xt0,0x00(%rsp) + movdqa $xb2,0x10(%rsp) + lea 0x80($out),$out # out+=64*2 + movdqa $xc2,0x20(%rsp) + sub \$128,$len # len-=64*2 + movdqa $xd2,0x30(%rsp) + jmp .Loop_tail4x + +.align 32 +.L192_or_more4x: + movdqu 0x00($inp),$xt0 # xor with input + movdqu 0x10($inp),$xt1 + movdqu 0x20($inp),$xt2 + movdqu 0x30($inp),$xt3 + pxor 0x00(%rsp),$xt0 # $xaN is offloaded, remember? + pxor $xb0,$xt1 + pxor $xc0,$xt2 + pxor $xd0,$xt3 + + movdqu $xt0,0x00($out) + movdqu 0x40($inp),$xt0 + movdqu $xt1,0x10($out) + movdqu 0x50($inp),$xt1 + movdqu $xt2,0x20($out) + movdqu 0x60($inp),$xt2 + movdqu $xt3,0x30($out) + movdqu 0x70($inp),$xt3 + lea 0x80($inp),$inp # size optimization + pxor 0x10(%rsp),$xt0 + pxor $xb1,$xt1 + pxor $xc1,$xt2 + pxor $xd1,$xt3 + + movdqu $xt0,0x40($out) + movdqu 0x00($inp),$xt0 + movdqu $xt1,0x50($out) + movdqu 0x10($inp),$xt1 + movdqu $xt2,0x60($out) + movdqu 0x20($inp),$xt2 + movdqu $xt3,0x70($out) + lea 0x80($out),$out # size optimization + movdqu 0x30($inp),$xt3 + pxor 0x20(%rsp),$xt0 + pxor $xb2,$xt1 + pxor $xc2,$xt2 + pxor $xd2,$xt3 + movdqu $xt0,0x00($out) + movdqu $xt1,0x10($out) + movdqu $xt2,0x20($out) + movdqu $xt3,0x30($out) + je .Ldone4x + + movdqa 0x30(%rsp),$xt0 # $xaN is offloaded, remember? + lea 0x40($inp),$inp # inp+=64*3 + xor %r10,%r10 + movdqa $xt0,0x00(%rsp) + movdqa $xb3,0x10(%rsp) + lea 0x40($out),$out # out+=64*3 + movdqa $xc3,0x20(%rsp) + sub \$192,$len # len-=64*3 + movdqa $xd3,0x30(%rsp) + +.Loop_tail4x: + movzb ($inp,%r10),%eax + movzb (%rsp,%r10),%ecx + lea 1(%r10),%r10 + xor %ecx,%eax + mov %al,-1($out,%r10) + dec $len + jnz .Loop_tail4x + +.Ldone4x: +___ +$code.=<<___ if ($win64); + movaps -0xa8(%r9),%xmm6 + movaps -0x98(%r9),%xmm7 + movaps -0x88(%r9),%xmm8 + movaps -0x78(%r9),%xmm9 + movaps -0x68(%r9),%xmm10 + movaps -0x58(%r9),%xmm11 + movaps -0x48(%r9),%xmm12 + movaps -0x38(%r9),%xmm13 + movaps -0x28(%r9),%xmm14 + movaps -0x18(%r9),%xmm15 +___ +$code.=<<___; + lea (%r9),%rsp +.cfi_def_cfa_register %rsp +.L4x_epilogue: + ret +.cfi_endproc +.size chacha20_4x,.-chacha20_4x +___ +} + +######################################################################## +# XOP code path that handles all lengths. +if ($avx && 0) { +# There is some "anomaly" observed depending on instructions' size or +# alignment. If you look closely at below code you'll notice that +# sometimes argument order varies. The order affects instruction +# encoding by making it larger, and such fiddling gives 5% performance +# improvement. This is on FX-4100... + +my ($xb0,$xb1,$xb2,$xb3, $xd0,$xd1,$xd2,$xd3, + $xa0,$xa1,$xa2,$xa3, $xt0,$xt1,$xt2,$xt3)=map("%xmm$_",(0..15)); +my @xx=($xa0,$xa1,$xa2,$xa3, $xb0,$xb1,$xb2,$xb3, + $xt0,$xt1,$xt2,$xt3, $xd0,$xd1,$xd2,$xd3); + +sub XOP_lane_ROUND { +my ($a0,$b0,$c0,$d0)=@_; +my ($a1,$b1,$c1,$d1)=map(($_&~3)+(($_+1)&3),($a0,$b0,$c0,$d0)); +my ($a2,$b2,$c2,$d2)=map(($_&~3)+(($_+1)&3),($a1,$b1,$c1,$d1)); +my ($a3,$b3,$c3,$d3)=map(($_&~3)+(($_+1)&3),($a2,$b2,$c2,$d2)); +my @x=map("\"$_\"",@xx); + + ( + "&vpaddd (@x[$a0],@x[$a0],@x[$b0])", # Q1 + "&vpaddd (@x[$a1],@x[$a1],@x[$b1])", # Q2 + "&vpaddd (@x[$a2],@x[$a2],@x[$b2])", # Q3 + "&vpaddd (@x[$a3],@x[$a3],@x[$b3])", # Q4 + "&vpxor (@x[$d0],@x[$a0],@x[$d0])", + "&vpxor (@x[$d1],@x[$a1],@x[$d1])", + "&vpxor (@x[$d2],@x[$a2],@x[$d2])", + "&vpxor (@x[$d3],@x[$a3],@x[$d3])", + "&vprotd (@x[$d0],@x[$d0],16)", + "&vprotd (@x[$d1],@x[$d1],16)", + "&vprotd (@x[$d2],@x[$d2],16)", + "&vprotd (@x[$d3],@x[$d3],16)", + + "&vpaddd (@x[$c0],@x[$c0],@x[$d0])", + "&vpaddd (@x[$c1],@x[$c1],@x[$d1])", + "&vpaddd (@x[$c2],@x[$c2],@x[$d2])", + "&vpaddd (@x[$c3],@x[$c3],@x[$d3])", + "&vpxor (@x[$b0],@x[$c0],@x[$b0])", + "&vpxor (@x[$b1],@x[$c1],@x[$b1])", + "&vpxor (@x[$b2],@x[$b2],@x[$c2])", # flip + "&vpxor (@x[$b3],@x[$b3],@x[$c3])", # flip + "&vprotd (@x[$b0],@x[$b0],12)", + "&vprotd (@x[$b1],@x[$b1],12)", + "&vprotd (@x[$b2],@x[$b2],12)", + "&vprotd (@x[$b3],@x[$b3],12)", + + "&vpaddd (@x[$a0],@x[$b0],@x[$a0])", # flip + "&vpaddd (@x[$a1],@x[$b1],@x[$a1])", # flip + "&vpaddd (@x[$a2],@x[$a2],@x[$b2])", + "&vpaddd (@x[$a3],@x[$a3],@x[$b3])", + "&vpxor (@x[$d0],@x[$a0],@x[$d0])", + "&vpxor (@x[$d1],@x[$a1],@x[$d1])", + "&vpxor (@x[$d2],@x[$a2],@x[$d2])", + "&vpxor (@x[$d3],@x[$a3],@x[$d3])", + "&vprotd (@x[$d0],@x[$d0],8)", + "&vprotd (@x[$d1],@x[$d1],8)", + "&vprotd (@x[$d2],@x[$d2],8)", + "&vprotd (@x[$d3],@x[$d3],8)", + + "&vpaddd (@x[$c0],@x[$c0],@x[$d0])", + "&vpaddd (@x[$c1],@x[$c1],@x[$d1])", + "&vpaddd (@x[$c2],@x[$c2],@x[$d2])", + "&vpaddd (@x[$c3],@x[$c3],@x[$d3])", + "&vpxor (@x[$b0],@x[$c0],@x[$b0])", + "&vpxor (@x[$b1],@x[$c1],@x[$b1])", + "&vpxor (@x[$b2],@x[$b2],@x[$c2])", # flip + "&vpxor (@x[$b3],@x[$b3],@x[$c3])", # flip + "&vprotd (@x[$b0],@x[$b0],7)", + "&vprotd (@x[$b1],@x[$b1],7)", + "&vprotd (@x[$b2],@x[$b2],7)", + "&vprotd (@x[$b3],@x[$b3],7)" + ); +} + +my $xframe = $win64 ? 0xa8 : 8; + +$code.=<<___; +.global chacha20_4xop +.type chacha20_4xop,\@function,5 +.align 32 +chacha20_4xop: +.cfi_startproc +.Lchacha20_4xop: + mov %rsp,%r9 # frame pointer +.cfi_def_cfa_register %r9 + sub \$0x140+$xframe,%rsp +___ + ################ stack layout + # +0x00 SIMD equivalent of @x[8-12] + # ... + # +0x40 constant copy of key[0-2] smashed by lanes + # ... + # +0x100 SIMD counters (with nonce smashed by lanes) + # ... + # +0x140 +$code.=<<___ if ($win64); + movaps %xmm6,-0xa8(%r9) + movaps %xmm7,-0x98(%r9) + movaps %xmm8,-0x88(%r9) + movaps %xmm9,-0x78(%r9) + movaps %xmm10,-0x68(%r9) + movaps %xmm11,-0x58(%r9) + movaps %xmm12,-0x48(%r9) + movaps %xmm13,-0x38(%r9) + movaps %xmm14,-0x28(%r9) + movaps %xmm15,-0x18(%r9) +.L4xop_body: +___ +$code.=<<___; + vzeroupper + + vmovdqa .Lsigma(%rip),$xa3 # key[0] + vmovdqu ($key),$xb3 # key[1] + vmovdqu 16($key),$xt3 # key[2] + vmovdqu ($counter),$xd3 # key[3] + lea 0x100(%rsp),%rcx # size optimization + + vpshufd \$0x00,$xa3,$xa0 # smash key by lanes... + vpshufd \$0x55,$xa3,$xa1 + vmovdqa $xa0,0x40(%rsp) # ... and offload + vpshufd \$0xaa,$xa3,$xa2 + vmovdqa $xa1,0x50(%rsp) + vpshufd \$0xff,$xa3,$xa3 + vmovdqa $xa2,0x60(%rsp) + vmovdqa $xa3,0x70(%rsp) + + vpshufd \$0x00,$xb3,$xb0 + vpshufd \$0x55,$xb3,$xb1 + vmovdqa $xb0,0x80-0x100(%rcx) + vpshufd \$0xaa,$xb3,$xb2 + vmovdqa $xb1,0x90-0x100(%rcx) + vpshufd \$0xff,$xb3,$xb3 + vmovdqa $xb2,0xa0-0x100(%rcx) + vmovdqa $xb3,0xb0-0x100(%rcx) + + vpshufd \$0x00,$xt3,$xt0 # "$xc0" + vpshufd \$0x55,$xt3,$xt1 # "$xc1" + vmovdqa $xt0,0xc0-0x100(%rcx) + vpshufd \$0xaa,$xt3,$xt2 # "$xc2" + vmovdqa $xt1,0xd0-0x100(%rcx) + vpshufd \$0xff,$xt3,$xt3 # "$xc3" + vmovdqa $xt2,0xe0-0x100(%rcx) + vmovdqa $xt3,0xf0-0x100(%rcx) + + vpshufd \$0x00,$xd3,$xd0 + vpshufd \$0x55,$xd3,$xd1 + vpaddd .Linc(%rip),$xd0,$xd0 # don't save counters yet + vpshufd \$0xaa,$xd3,$xd2 + vmovdqa $xd1,0x110-0x100(%rcx) + vpshufd \$0xff,$xd3,$xd3 + vmovdqa $xd2,0x120-0x100(%rcx) + vmovdqa $xd3,0x130-0x100(%rcx) + + jmp .Loop_enter4xop + +.align 32 +.Loop_outer4xop: + vmovdqa 0x40(%rsp),$xa0 # re-load smashed key + vmovdqa 0x50(%rsp),$xa1 + vmovdqa 0x60(%rsp),$xa2 + vmovdqa 0x70(%rsp),$xa3 + vmovdqa 0x80-0x100(%rcx),$xb0 + vmovdqa 0x90-0x100(%rcx),$xb1 + vmovdqa 0xa0-0x100(%rcx),$xb2 + vmovdqa 0xb0-0x100(%rcx),$xb3 + vmovdqa 0xc0-0x100(%rcx),$xt0 # "$xc0" + vmovdqa 0xd0-0x100(%rcx),$xt1 # "$xc1" + vmovdqa 0xe0-0x100(%rcx),$xt2 # "$xc2" + vmovdqa 0xf0-0x100(%rcx),$xt3 # "$xc3" + vmovdqa 0x100-0x100(%rcx),$xd0 + vmovdqa 0x110-0x100(%rcx),$xd1 + vmovdqa 0x120-0x100(%rcx),$xd2 + vmovdqa 0x130-0x100(%rcx),$xd3 + vpaddd .Lfour(%rip),$xd0,$xd0 # next SIMD counters + +.Loop_enter4xop: + mov \$10,%eax + vmovdqa $xd0,0x100-0x100(%rcx) # save SIMD counters + jmp .Loop4xop + +.align 32 +.Loop4xop: +___ + foreach (&XOP_lane_ROUND(0, 4, 8,12)) { eval; } + foreach (&XOP_lane_ROUND(0, 5,10,15)) { eval; } +$code.=<<___; + dec %eax + jnz .Loop4xop + + vpaddd 0x40(%rsp),$xa0,$xa0 # accumulate key material + vpaddd 0x50(%rsp),$xa1,$xa1 + vpaddd 0x60(%rsp),$xa2,$xa2 + vpaddd 0x70(%rsp),$xa3,$xa3 + + vmovdqa $xt2,0x20(%rsp) # offload $xc2,3 + vmovdqa $xt3,0x30(%rsp) + + vpunpckldq $xa1,$xa0,$xt2 # "de-interlace" data + vpunpckldq $xa3,$xa2,$xt3 + vpunpckhdq $xa1,$xa0,$xa0 + vpunpckhdq $xa3,$xa2,$xa2 + vpunpcklqdq $xt3,$xt2,$xa1 # "a0" + vpunpckhqdq $xt3,$xt2,$xt2 # "a1" + vpunpcklqdq $xa2,$xa0,$xa3 # "a2" + vpunpckhqdq $xa2,$xa0,$xa0 # "a3" +___ + ($xa0,$xa1,$xa2,$xa3,$xt2)=($xa1,$xt2,$xa3,$xa0,$xa2); +$code.=<<___; + vpaddd 0x80-0x100(%rcx),$xb0,$xb0 + vpaddd 0x90-0x100(%rcx),$xb1,$xb1 + vpaddd 0xa0-0x100(%rcx),$xb2,$xb2 + vpaddd 0xb0-0x100(%rcx),$xb3,$xb3 + + vmovdqa $xa0,0x00(%rsp) # offload $xa0,1 + vmovdqa $xa1,0x10(%rsp) + vmovdqa 0x20(%rsp),$xa0 # "xc2" + vmovdqa 0x30(%rsp),$xa1 # "xc3" + + vpunpckldq $xb1,$xb0,$xt2 + vpunpckldq $xb3,$xb2,$xt3 + vpunpckhdq $xb1,$xb0,$xb0 + vpunpckhdq $xb3,$xb2,$xb2 + vpunpcklqdq $xt3,$xt2,$xb1 # "b0" + vpunpckhqdq $xt3,$xt2,$xt2 # "b1" + vpunpcklqdq $xb2,$xb0,$xb3 # "b2" + vpunpckhqdq $xb2,$xb0,$xb0 # "b3" +___ + ($xb0,$xb1,$xb2,$xb3,$xt2)=($xb1,$xt2,$xb3,$xb0,$xb2); + my ($xc0,$xc1,$xc2,$xc3)=($xt0,$xt1,$xa0,$xa1); +$code.=<<___; + vpaddd 0xc0-0x100(%rcx),$xc0,$xc0 + vpaddd 0xd0-0x100(%rcx),$xc1,$xc1 + vpaddd 0xe0-0x100(%rcx),$xc2,$xc2 + vpaddd 0xf0-0x100(%rcx),$xc3,$xc3 + + vpunpckldq $xc1,$xc0,$xt2 + vpunpckldq $xc3,$xc2,$xt3 + vpunpckhdq $xc1,$xc0,$xc0 + vpunpckhdq $xc3,$xc2,$xc2 + vpunpcklqdq $xt3,$xt2,$xc1 # "c0" + vpunpckhqdq $xt3,$xt2,$xt2 # "c1" + vpunpcklqdq $xc2,$xc0,$xc3 # "c2" + vpunpckhqdq $xc2,$xc0,$xc0 # "c3" +___ + ($xc0,$xc1,$xc2,$xc3,$xt2)=($xc1,$xt2,$xc3,$xc0,$xc2); +$code.=<<___; + vpaddd 0x100-0x100(%rcx),$xd0,$xd0 + vpaddd 0x110-0x100(%rcx),$xd1,$xd1 + vpaddd 0x120-0x100(%rcx),$xd2,$xd2 + vpaddd 0x130-0x100(%rcx),$xd3,$xd3 + + vpunpckldq $xd1,$xd0,$xt2 + vpunpckldq $xd3,$xd2,$xt3 + vpunpckhdq $xd1,$xd0,$xd0 + vpunpckhdq $xd3,$xd2,$xd2 + vpunpcklqdq $xt3,$xt2,$xd1 # "d0" + vpunpckhqdq $xt3,$xt2,$xt2 # "d1" + vpunpcklqdq $xd2,$xd0,$xd3 # "d2" + vpunpckhqdq $xd2,$xd0,$xd0 # "d3" +___ + ($xd0,$xd1,$xd2,$xd3,$xt2)=($xd1,$xt2,$xd3,$xd0,$xd2); + ($xa0,$xa1)=($xt2,$xt3); +$code.=<<___; + vmovdqa 0x00(%rsp),$xa0 # restore $xa0,1 + vmovdqa 0x10(%rsp),$xa1 + + cmp \$64*4,$len + jb .Ltail4xop + + vpxor 0x00($inp),$xa0,$xa0 # xor with input + vpxor 0x10($inp),$xb0,$xb0 + vpxor 0x20($inp),$xc0,$xc0 + vpxor 0x30($inp),$xd0,$xd0 + vpxor 0x40($inp),$xa1,$xa1 + vpxor 0x50($inp),$xb1,$xb1 + vpxor 0x60($inp),$xc1,$xc1 + vpxor 0x70($inp),$xd1,$xd1 + lea 0x80($inp),$inp # size optimization + vpxor 0x00($inp),$xa2,$xa2 + vpxor 0x10($inp),$xb2,$xb2 + vpxor 0x20($inp),$xc2,$xc2 + vpxor 0x30($inp),$xd2,$xd2 + vpxor 0x40($inp),$xa3,$xa3 + vpxor 0x50($inp),$xb3,$xb3 + vpxor 0x60($inp),$xc3,$xc3 + vpxor 0x70($inp),$xd3,$xd3 + lea 0x80($inp),$inp # inp+=64*4 + + vmovdqu $xa0,0x00($out) + vmovdqu $xb0,0x10($out) + vmovdqu $xc0,0x20($out) + vmovdqu $xd0,0x30($out) + vmovdqu $xa1,0x40($out) + vmovdqu $xb1,0x50($out) + vmovdqu $xc1,0x60($out) + vmovdqu $xd1,0x70($out) + lea 0x80($out),$out # size optimization + vmovdqu $xa2,0x00($out) + vmovdqu $xb2,0x10($out) + vmovdqu $xc2,0x20($out) + vmovdqu $xd2,0x30($out) + vmovdqu $xa3,0x40($out) + vmovdqu $xb3,0x50($out) + vmovdqu $xc3,0x60($out) + vmovdqu $xd3,0x70($out) + lea 0x80($out),$out # out+=64*4 + + sub \$64*4,$len + jnz .Loop_outer4xop + + jmp .Ldone4xop + +.align 32 +.Ltail4xop: + cmp \$192,$len + jae .L192_or_more4xop + cmp \$128,$len + jae .L128_or_more4xop + cmp \$64,$len + jae .L64_or_more4xop + + xor %r10,%r10 + vmovdqa $xa0,0x00(%rsp) + vmovdqa $xb0,0x10(%rsp) + vmovdqa $xc0,0x20(%rsp) + vmovdqa $xd0,0x30(%rsp) + jmp .Loop_tail4xop + +.align 32 +.L64_or_more4xop: + vpxor 0x00($inp),$xa0,$xa0 # xor with input + vpxor 0x10($inp),$xb0,$xb0 + vpxor 0x20($inp),$xc0,$xc0 + vpxor 0x30($inp),$xd0,$xd0 + vmovdqu $xa0,0x00($out) + vmovdqu $xb0,0x10($out) + vmovdqu $xc0,0x20($out) + vmovdqu $xd0,0x30($out) + je .Ldone4xop + + lea 0x40($inp),$inp # inp+=64*1 + vmovdqa $xa1,0x00(%rsp) + xor %r10,%r10 + vmovdqa $xb1,0x10(%rsp) + lea 0x40($out),$out # out+=64*1 + vmovdqa $xc1,0x20(%rsp) + sub \$64,$len # len-=64*1 + vmovdqa $xd1,0x30(%rsp) + jmp .Loop_tail4xop + +.align 32 +.L128_or_more4xop: + vpxor 0x00($inp),$xa0,$xa0 # xor with input + vpxor 0x10($inp),$xb0,$xb0 + vpxor 0x20($inp),$xc0,$xc0 + vpxor 0x30($inp),$xd0,$xd0 + vpxor 0x40($inp),$xa1,$xa1 + vpxor 0x50($inp),$xb1,$xb1 + vpxor 0x60($inp),$xc1,$xc1 + vpxor 0x70($inp),$xd1,$xd1 + + vmovdqu $xa0,0x00($out) + vmovdqu $xb0,0x10($out) + vmovdqu $xc0,0x20($out) + vmovdqu $xd0,0x30($out) + vmovdqu $xa1,0x40($out) + vmovdqu $xb1,0x50($out) + vmovdqu $xc1,0x60($out) + vmovdqu $xd1,0x70($out) + je .Ldone4xop + + lea 0x80($inp),$inp # inp+=64*2 + vmovdqa $xa2,0x00(%rsp) + xor %r10,%r10 + vmovdqa $xb2,0x10(%rsp) + lea 0x80($out),$out # out+=64*2 + vmovdqa $xc2,0x20(%rsp) + sub \$128,$len # len-=64*2 + vmovdqa $xd2,0x30(%rsp) + jmp .Loop_tail4xop + +.align 32 +.L192_or_more4xop: + vpxor 0x00($inp),$xa0,$xa0 # xor with input + vpxor 0x10($inp),$xb0,$xb0 + vpxor 0x20($inp),$xc0,$xc0 + vpxor 0x30($inp),$xd0,$xd0 + vpxor 0x40($inp),$xa1,$xa1 + vpxor 0x50($inp),$xb1,$xb1 + vpxor 0x60($inp),$xc1,$xc1 + vpxor 0x70($inp),$xd1,$xd1 + lea 0x80($inp),$inp # size optimization + vpxor 0x00($inp),$xa2,$xa2 + vpxor 0x10($inp),$xb2,$xb2 + vpxor 0x20($inp),$xc2,$xc2 + vpxor 0x30($inp),$xd2,$xd2 + + vmovdqu $xa0,0x00($out) + vmovdqu $xb0,0x10($out) + vmovdqu $xc0,0x20($out) + vmovdqu $xd0,0x30($out) + vmovdqu $xa1,0x40($out) + vmovdqu $xb1,0x50($out) + vmovdqu $xc1,0x60($out) + vmovdqu $xd1,0x70($out) + lea 0x80($out),$out # size optimization + vmovdqu $xa2,0x00($out) + vmovdqu $xb2,0x10($out) + vmovdqu $xc2,0x20($out) + vmovdqu $xd2,0x30($out) + je .Ldone4xop + + lea 0x40($inp),$inp # inp+=64*3 + vmovdqa $xa3,0x00(%rsp) + xor %r10,%r10 + vmovdqa $xb3,0x10(%rsp) + lea 0x40($out),$out # out+=64*3 + vmovdqa $xc3,0x20(%rsp) + sub \$192,$len # len-=64*3 + vmovdqa $xd3,0x30(%rsp) + +.Loop_tail4xop: + movzb ($inp,%r10),%eax + movzb (%rsp,%r10),%ecx + lea 1(%r10),%r10 + xor %ecx,%eax + mov %al,-1($out,%r10) + dec $len + jnz .Loop_tail4xop + +.Ldone4xop: + vzeroupper +___ +$code.=<<___ if ($win64); + movaps -0xa8(%r9),%xmm6 + movaps -0x98(%r9),%xmm7 + movaps -0x88(%r9),%xmm8 + movaps -0x78(%r9),%xmm9 + movaps -0x68(%r9),%xmm10 + movaps -0x58(%r9),%xmm11 + movaps -0x48(%r9),%xmm12 + movaps -0x38(%r9),%xmm13 + movaps -0x28(%r9),%xmm14 + movaps -0x18(%r9),%xmm15 +___ +$code.=<<___; + lea (%r9),%rsp +.cfi_def_cfa_register %rsp +.L4xop_epilogue: + ret +.cfi_endproc +.size chacha20_4xop,.-chacha20_4xop +___ +} + +######################################################################## +# AVX2 code path +if ($avx>1) { +my ($xb0,$xb1,$xb2,$xb3, $xd0,$xd1,$xd2,$xd3, + $xa0,$xa1,$xa2,$xa3, $xt0,$xt1,$xt2,$xt3)=map("%ymm$_",(0..15)); +my @xx=($xa0,$xa1,$xa2,$xa3, $xb0,$xb1,$xb2,$xb3, + "%nox","%nox","%nox","%nox", $xd0,$xd1,$xd2,$xd3); + +sub AVX2_lane_ROUND { +my ($a0,$b0,$c0,$d0)=@_; +my ($a1,$b1,$c1,$d1)=map(($_&~3)+(($_+1)&3),($a0,$b0,$c0,$d0)); +my ($a2,$b2,$c2,$d2)=map(($_&~3)+(($_+1)&3),($a1,$b1,$c1,$d1)); +my ($a3,$b3,$c3,$d3)=map(($_&~3)+(($_+1)&3),($a2,$b2,$c2,$d2)); +my ($xc,$xc_,$t0,$t1)=map("\"$_\"",$xt0,$xt1,$xt2,$xt3); +my @x=map("\"$_\"",@xx); + + # Consider order in which variables are addressed by their + # index: + # + # a b c d + # + # 0 4 8 12 < even round + # 1 5 9 13 + # 2 6 10 14 + # 3 7 11 15 + # 0 5 10 15 < odd round + # 1 6 11 12 + # 2 7 8 13 + # 3 4 9 14 + # + # 'a', 'b' and 'd's are permanently allocated in registers, + # @x[0..7,12..15], while 'c's are maintained in memory. If + # you observe 'c' column, you'll notice that pair of 'c's is + # invariant between rounds. This means that we have to reload + # them once per round, in the middle. This is why you'll see + # bunch of 'c' stores and loads in the middle, but none in + # the beginning or end. + + ( + "&vpaddd (@x[$a0],@x[$a0],@x[$b0])", # Q1 + "&vpxor (@x[$d0],@x[$a0],@x[$d0])", + "&vpshufb (@x[$d0],@x[$d0],$t1)", + "&vpaddd (@x[$a1],@x[$a1],@x[$b1])", # Q2 + "&vpxor (@x[$d1],@x[$a1],@x[$d1])", + "&vpshufb (@x[$d1],@x[$d1],$t1)", + + "&vpaddd ($xc,$xc,@x[$d0])", + "&vpxor (@x[$b0],$xc,@x[$b0])", + "&vpslld ($t0,@x[$b0],12)", + "&vpsrld (@x[$b0],@x[$b0],20)", + "&vpor (@x[$b0],$t0,@x[$b0])", + "&vbroadcasti128($t0,'(%r11)')", # .Lrot24(%rip) + "&vpaddd ($xc_,$xc_,@x[$d1])", + "&vpxor (@x[$b1],$xc_,@x[$b1])", + "&vpslld ($t1,@x[$b1],12)", + "&vpsrld (@x[$b1],@x[$b1],20)", + "&vpor (@x[$b1],$t1,@x[$b1])", + + "&vpaddd (@x[$a0],@x[$a0],@x[$b0])", + "&vpxor (@x[$d0],@x[$a0],@x[$d0])", + "&vpshufb (@x[$d0],@x[$d0],$t0)", + "&vpaddd (@x[$a1],@x[$a1],@x[$b1])", + "&vpxor (@x[$d1],@x[$a1],@x[$d1])", + "&vpshufb (@x[$d1],@x[$d1],$t0)", + + "&vpaddd ($xc,$xc,@x[$d0])", + "&vpxor (@x[$b0],$xc,@x[$b0])", + "&vpslld ($t1,@x[$b0],7)", + "&vpsrld (@x[$b0],@x[$b0],25)", + "&vpor (@x[$b0],$t1,@x[$b0])", + "&vbroadcasti128($t1,'(%r10)')", # .Lrot16(%rip) + "&vpaddd ($xc_,$xc_,@x[$d1])", + "&vpxor (@x[$b1],$xc_,@x[$b1])", + "&vpslld ($t0,@x[$b1],7)", + "&vpsrld (@x[$b1],@x[$b1],25)", + "&vpor (@x[$b1],$t0,@x[$b1])", + + "&vmovdqa (\"`32*($c0-8)`(%rsp)\",$xc)", # reload pair of 'c's + "&vmovdqa (\"`32*($c1-8)`(%rsp)\",$xc_)", + "&vmovdqa ($xc,\"`32*($c2-8)`(%rsp)\")", + "&vmovdqa ($xc_,\"`32*($c3-8)`(%rsp)\")", + + "&vpaddd (@x[$a2],@x[$a2],@x[$b2])", # Q3 + "&vpxor (@x[$d2],@x[$a2],@x[$d2])", + "&vpshufb (@x[$d2],@x[$d2],$t1)", + "&vpaddd (@x[$a3],@x[$a3],@x[$b3])", # Q4 + "&vpxor (@x[$d3],@x[$a3],@x[$d3])", + "&vpshufb (@x[$d3],@x[$d3],$t1)", + + "&vpaddd ($xc,$xc,@x[$d2])", + "&vpxor (@x[$b2],$xc,@x[$b2])", + "&vpslld ($t0,@x[$b2],12)", + "&vpsrld (@x[$b2],@x[$b2],20)", + "&vpor (@x[$b2],$t0,@x[$b2])", + "&vbroadcasti128($t0,'(%r11)')", # .Lrot24(%rip) + "&vpaddd ($xc_,$xc_,@x[$d3])", + "&vpxor (@x[$b3],$xc_,@x[$b3])", + "&vpslld ($t1,@x[$b3],12)", + "&vpsrld (@x[$b3],@x[$b3],20)", + "&vpor (@x[$b3],$t1,@x[$b3])", + + "&vpaddd (@x[$a2],@x[$a2],@x[$b2])", + "&vpxor (@x[$d2],@x[$a2],@x[$d2])", + "&vpshufb (@x[$d2],@x[$d2],$t0)", + "&vpaddd (@x[$a3],@x[$a3],@x[$b3])", + "&vpxor (@x[$d3],@x[$a3],@x[$d3])", + "&vpshufb (@x[$d3],@x[$d3],$t0)", + + "&vpaddd ($xc,$xc,@x[$d2])", + "&vpxor (@x[$b2],$xc,@x[$b2])", + "&vpslld ($t1,@x[$b2],7)", + "&vpsrld (@x[$b2],@x[$b2],25)", + "&vpor (@x[$b2],$t1,@x[$b2])", + "&vbroadcasti128($t1,'(%r10)')", # .Lrot16(%rip) + "&vpaddd ($xc_,$xc_,@x[$d3])", + "&vpxor (@x[$b3],$xc_,@x[$b3])", + "&vpslld ($t0,@x[$b3],7)", + "&vpsrld (@x[$b3],@x[$b3],25)", + "&vpor (@x[$b3],$t0,@x[$b3])" + ); +} + +my $xframe = $win64 ? 0xa8 : 8; + +$code.=<<___; +.global chacha20_avx2 +.type chacha20_avx2,\@function,5 +.align 32 +chacha20_avx2: +.cfi_startproc +.Lchacha20_avx2: + mov %rsp,%r9 # frame register +.cfi_def_cfa_register %r9 + sub \$0x280+$xframe,%rsp + and \$-32,%rsp +___ +$code.=<<___ if ($win64); + movaps %xmm6,-0xa8(%r9) + movaps %xmm7,-0x98(%r9) + movaps %xmm8,-0x88(%r9) + movaps %xmm9,-0x78(%r9) + movaps %xmm10,-0x68(%r9) + movaps %xmm11,-0x58(%r9) + movaps %xmm12,-0x48(%r9) + movaps %xmm13,-0x38(%r9) + movaps %xmm14,-0x28(%r9) + movaps %xmm15,-0x18(%r9) +.L8x_body: +___ + ################ stack layout + # +0x00 SIMD equivalent of @x[8-12] + # ... + # +0x80 constant copy of key[0-2] smashed by lanes + # ... + # +0x200 SIMD counters (with nonce smashed by lanes) + # ... + # +0x280 + +$code.=<<___; + vzeroupper + + vbroadcasti128 .Lsigma(%rip),$xa3 # key[0] + vbroadcasti128 ($key),$xb3 # key[1] + vbroadcasti128 16($key),$xt3 # key[2] + vbroadcasti128 ($counter),$xd3 # key[3] + lea 0x100(%rsp),%rcx # size optimization + lea 0x200(%rsp),%rax # size optimization + lea .Lrot16(%rip),%r10 + lea .Lrot24(%rip),%r11 + + vpshufd \$0x00,$xa3,$xa0 # smash key by lanes... + vpshufd \$0x55,$xa3,$xa1 + vmovdqa $xa0,0x80-0x100(%rcx) # ... and offload + vpshufd \$0xaa,$xa3,$xa2 + vmovdqa $xa1,0xa0-0x100(%rcx) + vpshufd \$0xff,$xa3,$xa3 + vmovdqa $xa2,0xc0-0x100(%rcx) + vmovdqa $xa3,0xe0-0x100(%rcx) + + vpshufd \$0x00,$xb3,$xb0 + vpshufd \$0x55,$xb3,$xb1 + vmovdqa $xb0,0x100-0x100(%rcx) + vpshufd \$0xaa,$xb3,$xb2 + vmovdqa $xb1,0x120-0x100(%rcx) + vpshufd \$0xff,$xb3,$xb3 + vmovdqa $xb2,0x140-0x100(%rcx) + vmovdqa $xb3,0x160-0x100(%rcx) + + vpshufd \$0x00,$xt3,$xt0 # "xc0" + vpshufd \$0x55,$xt3,$xt1 # "xc1" + vmovdqa $xt0,0x180-0x200(%rax) + vpshufd \$0xaa,$xt3,$xt2 # "xc2" + vmovdqa $xt1,0x1a0-0x200(%rax) + vpshufd \$0xff,$xt3,$xt3 # "xc3" + vmovdqa $xt2,0x1c0-0x200(%rax) + vmovdqa $xt3,0x1e0-0x200(%rax) + + vpshufd \$0x00,$xd3,$xd0 + vpshufd \$0x55,$xd3,$xd1 + vpaddd .Lincy(%rip),$xd0,$xd0 # don't save counters yet + vpshufd \$0xaa,$xd3,$xd2 + vmovdqa $xd1,0x220-0x200(%rax) + vpshufd \$0xff,$xd3,$xd3 + vmovdqa $xd2,0x240-0x200(%rax) + vmovdqa $xd3,0x260-0x200(%rax) + + jmp .Loop_enter8x + +.align 32 +.Loop_outer8x: + vmovdqa 0x80-0x100(%rcx),$xa0 # re-load smashed key + vmovdqa 0xa0-0x100(%rcx),$xa1 + vmovdqa 0xc0-0x100(%rcx),$xa2 + vmovdqa 0xe0-0x100(%rcx),$xa3 + vmovdqa 0x100-0x100(%rcx),$xb0 + vmovdqa 0x120-0x100(%rcx),$xb1 + vmovdqa 0x140-0x100(%rcx),$xb2 + vmovdqa 0x160-0x100(%rcx),$xb3 + vmovdqa 0x180-0x200(%rax),$xt0 # "xc0" + vmovdqa 0x1a0-0x200(%rax),$xt1 # "xc1" + vmovdqa 0x1c0-0x200(%rax),$xt2 # "xc2" + vmovdqa 0x1e0-0x200(%rax),$xt3 # "xc3" + vmovdqa 0x200-0x200(%rax),$xd0 + vmovdqa 0x220-0x200(%rax),$xd1 + vmovdqa 0x240-0x200(%rax),$xd2 + vmovdqa 0x260-0x200(%rax),$xd3 + vpaddd .Leight(%rip),$xd0,$xd0 # next SIMD counters + +.Loop_enter8x: + vmovdqa $xt2,0x40(%rsp) # SIMD equivalent of "@x[10]" + vmovdqa $xt3,0x60(%rsp) # SIMD equivalent of "@x[11]" + vbroadcasti128 (%r10),$xt3 + vmovdqa $xd0,0x200-0x200(%rax) # save SIMD counters + mov \$10,%eax + jmp .Loop8x + +.align 32 +.Loop8x: +___ + foreach (&AVX2_lane_ROUND(0, 4, 8,12)) { eval; } + foreach (&AVX2_lane_ROUND(0, 5,10,15)) { eval; } +$code.=<<___; + dec %eax + jnz .Loop8x + + lea 0x200(%rsp),%rax # size optimization + vpaddd 0x80-0x100(%rcx),$xa0,$xa0 # accumulate key + vpaddd 0xa0-0x100(%rcx),$xa1,$xa1 + vpaddd 0xc0-0x100(%rcx),$xa2,$xa2 + vpaddd 0xe0-0x100(%rcx),$xa3,$xa3 + + vpunpckldq $xa1,$xa0,$xt2 # "de-interlace" data + vpunpckldq $xa3,$xa2,$xt3 + vpunpckhdq $xa1,$xa0,$xa0 + vpunpckhdq $xa3,$xa2,$xa2 + vpunpcklqdq $xt3,$xt2,$xa1 # "a0" + vpunpckhqdq $xt3,$xt2,$xt2 # "a1" + vpunpcklqdq $xa2,$xa0,$xa3 # "a2" + vpunpckhqdq $xa2,$xa0,$xa0 # "a3" +___ + ($xa0,$xa1,$xa2,$xa3,$xt2)=($xa1,$xt2,$xa3,$xa0,$xa2); +$code.=<<___; + vpaddd 0x100-0x100(%rcx),$xb0,$xb0 + vpaddd 0x120-0x100(%rcx),$xb1,$xb1 + vpaddd 0x140-0x100(%rcx),$xb2,$xb2 + vpaddd 0x160-0x100(%rcx),$xb3,$xb3 + + vpunpckldq $xb1,$xb0,$xt2 + vpunpckldq $xb3,$xb2,$xt3 + vpunpckhdq $xb1,$xb0,$xb0 + vpunpckhdq $xb3,$xb2,$xb2 + vpunpcklqdq $xt3,$xt2,$xb1 # "b0" + vpunpckhqdq $xt3,$xt2,$xt2 # "b1" + vpunpcklqdq $xb2,$xb0,$xb3 # "b2" + vpunpckhqdq $xb2,$xb0,$xb0 # "b3" +___ + ($xb0,$xb1,$xb2,$xb3,$xt2)=($xb1,$xt2,$xb3,$xb0,$xb2); +$code.=<<___; + vperm2i128 \$0x20,$xb0,$xa0,$xt3 # "de-interlace" further + vperm2i128 \$0x31,$xb0,$xa0,$xb0 + vperm2i128 \$0x20,$xb1,$xa1,$xa0 + vperm2i128 \$0x31,$xb1,$xa1,$xb1 + vperm2i128 \$0x20,$xb2,$xa2,$xa1 + vperm2i128 \$0x31,$xb2,$xa2,$xb2 + vperm2i128 \$0x20,$xb3,$xa3,$xa2 + vperm2i128 \$0x31,$xb3,$xa3,$xb3 +___ + ($xa0,$xa1,$xa2,$xa3,$xt3)=($xt3,$xa0,$xa1,$xa2,$xa3); + my ($xc0,$xc1,$xc2,$xc3)=($xt0,$xt1,$xa0,$xa1); +$code.=<<___; + vmovdqa $xa0,0x00(%rsp) # offload $xaN + vmovdqa $xa1,0x20(%rsp) + vmovdqa 0x40(%rsp),$xc2 # $xa0 + vmovdqa 0x60(%rsp),$xc3 # $xa1 + + vpaddd 0x180-0x200(%rax),$xc0,$xc0 + vpaddd 0x1a0-0x200(%rax),$xc1,$xc1 + vpaddd 0x1c0-0x200(%rax),$xc2,$xc2 + vpaddd 0x1e0-0x200(%rax),$xc3,$xc3 + + vpunpckldq $xc1,$xc0,$xt2 + vpunpckldq $xc3,$xc2,$xt3 + vpunpckhdq $xc1,$xc0,$xc0 + vpunpckhdq $xc3,$xc2,$xc2 + vpunpcklqdq $xt3,$xt2,$xc1 # "c0" + vpunpckhqdq $xt3,$xt2,$xt2 # "c1" + vpunpcklqdq $xc2,$xc0,$xc3 # "c2" + vpunpckhqdq $xc2,$xc0,$xc0 # "c3" +___ + ($xc0,$xc1,$xc2,$xc3,$xt2)=($xc1,$xt2,$xc3,$xc0,$xc2); +$code.=<<___; + vpaddd 0x200-0x200(%rax),$xd0,$xd0 + vpaddd 0x220-0x200(%rax),$xd1,$xd1 + vpaddd 0x240-0x200(%rax),$xd2,$xd2 + vpaddd 0x260-0x200(%rax),$xd3,$xd3 + + vpunpckldq $xd1,$xd0,$xt2 + vpunpckldq $xd3,$xd2,$xt3 + vpunpckhdq $xd1,$xd0,$xd0 + vpunpckhdq $xd3,$xd2,$xd2 + vpunpcklqdq $xt3,$xt2,$xd1 # "d0" + vpunpckhqdq $xt3,$xt2,$xt2 # "d1" + vpunpcklqdq $xd2,$xd0,$xd3 # "d2" + vpunpckhqdq $xd2,$xd0,$xd0 # "d3" +___ + ($xd0,$xd1,$xd2,$xd3,$xt2)=($xd1,$xt2,$xd3,$xd0,$xd2); +$code.=<<___; + vperm2i128 \$0x20,$xd0,$xc0,$xt3 # "de-interlace" further + vperm2i128 \$0x31,$xd0,$xc0,$xd0 + vperm2i128 \$0x20,$xd1,$xc1,$xc0 + vperm2i128 \$0x31,$xd1,$xc1,$xd1 + vperm2i128 \$0x20,$xd2,$xc2,$xc1 + vperm2i128 \$0x31,$xd2,$xc2,$xd2 + vperm2i128 \$0x20,$xd3,$xc3,$xc2 + vperm2i128 \$0x31,$xd3,$xc3,$xd3 +___ + ($xc0,$xc1,$xc2,$xc3,$xt3)=($xt3,$xc0,$xc1,$xc2,$xc3); + ($xb0,$xb1,$xb2,$xb3,$xc0,$xc1,$xc2,$xc3)= + ($xc0,$xc1,$xc2,$xc3,$xb0,$xb1,$xb2,$xb3); + ($xa0,$xa1)=($xt2,$xt3); +$code.=<<___; + vmovdqa 0x00(%rsp),$xa0 # $xaN was offloaded, remember? + vmovdqa 0x20(%rsp),$xa1 + + cmp \$64*8,$len + jb .Ltail8x + + vpxor 0x00($inp),$xa0,$xa0 # xor with input + vpxor 0x20($inp),$xb0,$xb0 + vpxor 0x40($inp),$xc0,$xc0 + vpxor 0x60($inp),$xd0,$xd0 + lea 0x80($inp),$inp # size optimization + vmovdqu $xa0,0x00($out) + vmovdqu $xb0,0x20($out) + vmovdqu $xc0,0x40($out) + vmovdqu $xd0,0x60($out) + lea 0x80($out),$out # size optimization + + vpxor 0x00($inp),$xa1,$xa1 + vpxor 0x20($inp),$xb1,$xb1 + vpxor 0x40($inp),$xc1,$xc1 + vpxor 0x60($inp),$xd1,$xd1 + lea 0x80($inp),$inp # size optimization + vmovdqu $xa1,0x00($out) + vmovdqu $xb1,0x20($out) + vmovdqu $xc1,0x40($out) + vmovdqu $xd1,0x60($out) + lea 0x80($out),$out # size optimization + + vpxor 0x00($inp),$xa2,$xa2 + vpxor 0x20($inp),$xb2,$xb2 + vpxor 0x40($inp),$xc2,$xc2 + vpxor 0x60($inp),$xd2,$xd2 + lea 0x80($inp),$inp # size optimization + vmovdqu $xa2,0x00($out) + vmovdqu $xb2,0x20($out) + vmovdqu $xc2,0x40($out) + vmovdqu $xd2,0x60($out) + lea 0x80($out),$out # size optimization + + vpxor 0x00($inp),$xa3,$xa3 + vpxor 0x20($inp),$xb3,$xb3 + vpxor 0x40($inp),$xc3,$xc3 + vpxor 0x60($inp),$xd3,$xd3 + lea 0x80($inp),$inp # size optimization + vmovdqu $xa3,0x00($out) + vmovdqu $xb3,0x20($out) + vmovdqu $xc3,0x40($out) + vmovdqu $xd3,0x60($out) + lea 0x80($out),$out # size optimization + + sub \$64*8,$len + jnz .Loop_outer8x + + jmp .Ldone8x + +.Ltail8x: + cmp \$448,$len + jae .L448_or_more8x + cmp \$384,$len + jae .L384_or_more8x + cmp \$320,$len + jae .L320_or_more8x + cmp \$256,$len + jae .L256_or_more8x + cmp \$192,$len + jae .L192_or_more8x + cmp \$128,$len + jae .L128_or_more8x + cmp \$64,$len + jae .L64_or_more8x + + xor %r10,%r10 + vmovdqa $xa0,0x00(%rsp) + vmovdqa $xb0,0x20(%rsp) + jmp .Loop_tail8x + +.align 32 +.L64_or_more8x: + vpxor 0x00($inp),$xa0,$xa0 # xor with input + vpxor 0x20($inp),$xb0,$xb0 + vmovdqu $xa0,0x00($out) + vmovdqu $xb0,0x20($out) + je .Ldone8x + + lea 0x40($inp),$inp # inp+=64*1 + xor %r10,%r10 + vmovdqa $xc0,0x00(%rsp) + lea 0x40($out),$out # out+=64*1 + sub \$64,$len # len-=64*1 + vmovdqa $xd0,0x20(%rsp) + jmp .Loop_tail8x + +.align 32 +.L128_or_more8x: + vpxor 0x00($inp),$xa0,$xa0 # xor with input + vpxor 0x20($inp),$xb0,$xb0 + vpxor 0x40($inp),$xc0,$xc0 + vpxor 0x60($inp),$xd0,$xd0 + vmovdqu $xa0,0x00($out) + vmovdqu $xb0,0x20($out) + vmovdqu $xc0,0x40($out) + vmovdqu $xd0,0x60($out) + je .Ldone8x + + lea 0x80($inp),$inp # inp+=64*2 + xor %r10,%r10 + vmovdqa $xa1,0x00(%rsp) + lea 0x80($out),$out # out+=64*2 + sub \$128,$len # len-=64*2 + vmovdqa $xb1,0x20(%rsp) + jmp .Loop_tail8x + +.align 32 +.L192_or_more8x: + vpxor 0x00($inp),$xa0,$xa0 # xor with input + vpxor 0x20($inp),$xb0,$xb0 + vpxor 0x40($inp),$xc0,$xc0 + vpxor 0x60($inp),$xd0,$xd0 + vpxor 0x80($inp),$xa1,$xa1 + vpxor 0xa0($inp),$xb1,$xb1 + vmovdqu $xa0,0x00($out) + vmovdqu $xb0,0x20($out) + vmovdqu $xc0,0x40($out) + vmovdqu $xd0,0x60($out) + vmovdqu $xa1,0x80($out) + vmovdqu $xb1,0xa0($out) + je .Ldone8x + + lea 0xc0($inp),$inp # inp+=64*3 + xor %r10,%r10 + vmovdqa $xc1,0x00(%rsp) + lea 0xc0($out),$out # out+=64*3 + sub \$192,$len # len-=64*3 + vmovdqa $xd1,0x20(%rsp) + jmp .Loop_tail8x + +.align 32 +.L256_or_more8x: + vpxor 0x00($inp),$xa0,$xa0 # xor with input + vpxor 0x20($inp),$xb0,$xb0 + vpxor 0x40($inp),$xc0,$xc0 + vpxor 0x60($inp),$xd0,$xd0 + vpxor 0x80($inp),$xa1,$xa1 + vpxor 0xa0($inp),$xb1,$xb1 + vpxor 0xc0($inp),$xc1,$xc1 + vpxor 0xe0($inp),$xd1,$xd1 + vmovdqu $xa0,0x00($out) + vmovdqu $xb0,0x20($out) + vmovdqu $xc0,0x40($out) + vmovdqu $xd0,0x60($out) + vmovdqu $xa1,0x80($out) + vmovdqu $xb1,0xa0($out) + vmovdqu $xc1,0xc0($out) + vmovdqu $xd1,0xe0($out) + je .Ldone8x + + lea 0x100($inp),$inp # inp+=64*4 + xor %r10,%r10 + vmovdqa $xa2,0x00(%rsp) + lea 0x100($out),$out # out+=64*4 + sub \$256,$len # len-=64*4 + vmovdqa $xb2,0x20(%rsp) + jmp .Loop_tail8x + +.align 32 +.L320_or_more8x: + vpxor 0x00($inp),$xa0,$xa0 # xor with input + vpxor 0x20($inp),$xb0,$xb0 + vpxor 0x40($inp),$xc0,$xc0 + vpxor 0x60($inp),$xd0,$xd0 + vpxor 0x80($inp),$xa1,$xa1 + vpxor 0xa0($inp),$xb1,$xb1 + vpxor 0xc0($inp),$xc1,$xc1 + vpxor 0xe0($inp),$xd1,$xd1 + vpxor 0x100($inp),$xa2,$xa2 + vpxor 0x120($inp),$xb2,$xb2 + vmovdqu $xa0,0x00($out) + vmovdqu $xb0,0x20($out) + vmovdqu $xc0,0x40($out) + vmovdqu $xd0,0x60($out) + vmovdqu $xa1,0x80($out) + vmovdqu $xb1,0xa0($out) + vmovdqu $xc1,0xc0($out) + vmovdqu $xd1,0xe0($out) + vmovdqu $xa2,0x100($out) + vmovdqu $xb2,0x120($out) + je .Ldone8x + + lea 0x140($inp),$inp # inp+=64*5 + xor %r10,%r10 + vmovdqa $xc2,0x00(%rsp) + lea 0x140($out),$out # out+=64*5 + sub \$320,$len # len-=64*5 + vmovdqa $xd2,0x20(%rsp) + jmp .Loop_tail8x + +.align 32 +.L384_or_more8x: + vpxor 0x00($inp),$xa0,$xa0 # xor with input + vpxor 0x20($inp),$xb0,$xb0 + vpxor 0x40($inp),$xc0,$xc0 + vpxor 0x60($inp),$xd0,$xd0 + vpxor 0x80($inp),$xa1,$xa1 + vpxor 0xa0($inp),$xb1,$xb1 + vpxor 0xc0($inp),$xc1,$xc1 + vpxor 0xe0($inp),$xd1,$xd1 + vpxor 0x100($inp),$xa2,$xa2 + vpxor 0x120($inp),$xb2,$xb2 + vpxor 0x140($inp),$xc2,$xc2 + vpxor 0x160($inp),$xd2,$xd2 + vmovdqu $xa0,0x00($out) + vmovdqu $xb0,0x20($out) + vmovdqu $xc0,0x40($out) + vmovdqu $xd0,0x60($out) + vmovdqu $xa1,0x80($out) + vmovdqu $xb1,0xa0($out) + vmovdqu $xc1,0xc0($out) + vmovdqu $xd1,0xe0($out) + vmovdqu $xa2,0x100($out) + vmovdqu $xb2,0x120($out) + vmovdqu $xc2,0x140($out) + vmovdqu $xd2,0x160($out) + je .Ldone8x + + lea 0x180($inp),$inp # inp+=64*6 + xor %r10,%r10 + vmovdqa $xa3,0x00(%rsp) + lea 0x180($out),$out # out+=64*6 + sub \$384,$len # len-=64*6 + vmovdqa $xb3,0x20(%rsp) + jmp .Loop_tail8x + +.align 32 +.L448_or_more8x: + vpxor 0x00($inp),$xa0,$xa0 # xor with input + vpxor 0x20($inp),$xb0,$xb0 + vpxor 0x40($inp),$xc0,$xc0 + vpxor 0x60($inp),$xd0,$xd0 + vpxor 0x80($inp),$xa1,$xa1 + vpxor 0xa0($inp),$xb1,$xb1 + vpxor 0xc0($inp),$xc1,$xc1 + vpxor 0xe0($inp),$xd1,$xd1 + vpxor 0x100($inp),$xa2,$xa2 + vpxor 0x120($inp),$xb2,$xb2 + vpxor 0x140($inp),$xc2,$xc2 + vpxor 0x160($inp),$xd2,$xd2 + vpxor 0x180($inp),$xa3,$xa3 + vpxor 0x1a0($inp),$xb3,$xb3 + vmovdqu $xa0,0x00($out) + vmovdqu $xb0,0x20($out) + vmovdqu $xc0,0x40($out) + vmovdqu $xd0,0x60($out) + vmovdqu $xa1,0x80($out) + vmovdqu $xb1,0xa0($out) + vmovdqu $xc1,0xc0($out) + vmovdqu $xd1,0xe0($out) + vmovdqu $xa2,0x100($out) + vmovdqu $xb2,0x120($out) + vmovdqu $xc2,0x140($out) + vmovdqu $xd2,0x160($out) + vmovdqu $xa3,0x180($out) + vmovdqu $xb3,0x1a0($out) + je .Ldone8x + + lea 0x1c0($inp),$inp # inp+=64*7 + xor %r10,%r10 + vmovdqa $xc3,0x00(%rsp) + lea 0x1c0($out),$out # out+=64*7 + sub \$448,$len # len-=64*7 + vmovdqa $xd3,0x20(%rsp) + +.Loop_tail8x: + movzb ($inp,%r10),%eax + movzb (%rsp,%r10),%ecx + lea 1(%r10),%r10 + xor %ecx,%eax + mov %al,-1($out,%r10) + dec $len + jnz .Loop_tail8x + +.Ldone8x: + vzeroall +___ +$code.=<<___ if ($win64); + movaps -0xa8(%r9),%xmm6 + movaps -0x98(%r9),%xmm7 + movaps -0x88(%r9),%xmm8 + movaps -0x78(%r9),%xmm9 + movaps -0x68(%r9),%xmm10 + movaps -0x58(%r9),%xmm11 + movaps -0x48(%r9),%xmm12 + movaps -0x38(%r9),%xmm13 + movaps -0x28(%r9),%xmm14 + movaps -0x18(%r9),%xmm15 +___ +$code.=<<___; + lea (%r9),%rsp +.cfi_def_cfa_register %rsp +.L8x_epilogue: + ret +.cfi_endproc +.size chacha20_avx2,.-chacha20_avx2 +___ +} + +######################################################################## +# AVX512 code paths +if ($avx>2) { +# This one handles shorter inputs... + +my ($a,$b,$c,$d, $a_,$b_,$c_,$d_,$fourz) = map("%zmm$_",(0..3,16..20)); +my ($t0,$t1,$t2,$t3) = map("%xmm$_",(4..7)); + +sub vpxord() # size optimization +{ my $opcode = "vpxor"; # adhere to vpxor when possible + + foreach (@_) { + if (/%([zy])mm([0-9]+)/ && ($1 eq "z" || $2>=16)) { + $opcode = "vpxord"; + last; + } + } + + $code .= "\t$opcode\t".join(',',reverse @_)."\n"; +} + +sub AVX512ROUND { # critical path is 14 "SIMD ticks" per round + &vpaddd ($a,$a,$b); + &vpxord ($d,$d,$a); + &vprold ($d,$d,16); + + &vpaddd ($c,$c,$d); + &vpxord ($b,$b,$c); + &vprold ($b,$b,12); + + &vpaddd ($a,$a,$b); + &vpxord ($d,$d,$a); + &vprold ($d,$d,8); + + &vpaddd ($c,$c,$d); + &vpxord ($b,$b,$c); + &vprold ($b,$b,7); +} + +my $xframe = $win64 ? 32+8 : 8; + +$code.=<<___; +.global chacha20_avx512 +.type chacha20_avx512,\@function,5 +.align 32 +chacha20_avx512: +.cfi_startproc +.Lchacha20_avx512: + mov %rsp,%r9 # frame pointer +.cfi_def_cfa_register %r9 + cmp \$512,$len + ja .Lchacha20_16x + + sub \$64+$xframe,%rsp +___ +$code.=<<___ if ($win64); + movaps %xmm6,-0x28(%r9) + movaps %xmm7,-0x18(%r9) +.Lavx512_body: +___ +$code.=<<___; + vbroadcasti32x4 .Lsigma(%rip),$a + vbroadcasti32x4 ($key),$b + vbroadcasti32x4 16($key),$c + vbroadcasti32x4 ($counter),$d + + vmovdqa32 $a,$a_ + vmovdqa32 $b,$b_ + vmovdqa32 $c,$c_ + vpaddd .Lzeroz(%rip),$d,$d + vmovdqa32 .Lfourz(%rip),$fourz + mov \$10,$counter # reuse $counter + vmovdqa32 $d,$d_ + jmp .Loop_avx512 + +.align 16 +.Loop_outer_avx512: + vmovdqa32 $a_,$a + vmovdqa32 $b_,$b + vmovdqa32 $c_,$c + vpaddd $fourz,$d_,$d + mov \$10,$counter + vmovdqa32 $d,$d_ + jmp .Loop_avx512 + +.align 32 +.Loop_avx512: +___ + &AVX512ROUND(); + &vpshufd ($c,$c,0b01001110); + &vpshufd ($b,$b,0b00111001); + &vpshufd ($d,$d,0b10010011); + + &AVX512ROUND(); + &vpshufd ($c,$c,0b01001110); + &vpshufd ($b,$b,0b10010011); + &vpshufd ($d,$d,0b00111001); + + &dec ($counter); + &jnz (".Loop_avx512"); + +$code.=<<___; + vpaddd $a_,$a,$a + vpaddd $b_,$b,$b + vpaddd $c_,$c,$c + vpaddd $d_,$d,$d + + sub \$64,$len + jb .Ltail64_avx512 + + vpxor 0x00($inp),%x#$a,$t0 # xor with input + vpxor 0x10($inp),%x#$b,$t1 + vpxor 0x20($inp),%x#$c,$t2 + vpxor 0x30($inp),%x#$d,$t3 + lea 0x40($inp),$inp # inp+=64 + + vmovdqu $t0,0x00($out) # write output + vmovdqu $t1,0x10($out) + vmovdqu $t2,0x20($out) + vmovdqu $t3,0x30($out) + lea 0x40($out),$out # out+=64 + + jz .Ldone_avx512 + + vextracti32x4 \$1,$a,$t0 + vextracti32x4 \$1,$b,$t1 + vextracti32x4 \$1,$c,$t2 + vextracti32x4 \$1,$d,$t3 + + sub \$64,$len + jb .Ltail_avx512 + + vpxor 0x00($inp),$t0,$t0 # xor with input + vpxor 0x10($inp),$t1,$t1 + vpxor 0x20($inp),$t2,$t2 + vpxor 0x30($inp),$t3,$t3 + lea 0x40($inp),$inp # inp+=64 + + vmovdqu $t0,0x00($out) # write output + vmovdqu $t1,0x10($out) + vmovdqu $t2,0x20($out) + vmovdqu $t3,0x30($out) + lea 0x40($out),$out # out+=64 + + jz .Ldone_avx512 + + vextracti32x4 \$2,$a,$t0 + vextracti32x4 \$2,$b,$t1 + vextracti32x4 \$2,$c,$t2 + vextracti32x4 \$2,$d,$t3 + + sub \$64,$len + jb .Ltail_avx512 + + vpxor 0x00($inp),$t0,$t0 # xor with input + vpxor 0x10($inp),$t1,$t1 + vpxor 0x20($inp),$t2,$t2 + vpxor 0x30($inp),$t3,$t3 + lea 0x40($inp),$inp # inp+=64 + + vmovdqu $t0,0x00($out) # write output + vmovdqu $t1,0x10($out) + vmovdqu $t2,0x20($out) + vmovdqu $t3,0x30($out) + lea 0x40($out),$out # out+=64 + + jz .Ldone_avx512 + + vextracti32x4 \$3,$a,$t0 + vextracti32x4 \$3,$b,$t1 + vextracti32x4 \$3,$c,$t2 + vextracti32x4 \$3,$d,$t3 + + sub \$64,$len + jb .Ltail_avx512 + + vpxor 0x00($inp),$t0,$t0 # xor with input + vpxor 0x10($inp),$t1,$t1 + vpxor 0x20($inp),$t2,$t2 + vpxor 0x30($inp),$t3,$t3 + lea 0x40($inp),$inp # inp+=64 + + vmovdqu $t0,0x00($out) # write output + vmovdqu $t1,0x10($out) + vmovdqu $t2,0x20($out) + vmovdqu $t3,0x30($out) + lea 0x40($out),$out # out+=64 + + jnz .Loop_outer_avx512 + + jmp .Ldone_avx512 + +.align 16 +.Ltail64_avx512: + vmovdqa %x#$a,0x00(%rsp) + vmovdqa %x#$b,0x10(%rsp) + vmovdqa %x#$c,0x20(%rsp) + vmovdqa %x#$d,0x30(%rsp) + add \$64,$len + jmp .Loop_tail_avx512 + +.align 16 +.Ltail_avx512: + vmovdqa $t0,0x00(%rsp) + vmovdqa $t1,0x10(%rsp) + vmovdqa $t2,0x20(%rsp) + vmovdqa $t3,0x30(%rsp) + add \$64,$len + +.Loop_tail_avx512: + movzb ($inp,$counter),%eax + movzb (%rsp,$counter),%ecx + lea 1($counter),$counter + xor %ecx,%eax + mov %al,-1($out,$counter) + dec $len + jnz .Loop_tail_avx512 + + vmovdqu32 $a_,0x00(%rsp) + +.Ldone_avx512: + vzeroall +___ +$code.=<<___ if ($win64); + movaps -0x28(%r9),%xmm6 + movaps -0x18(%r9),%xmm7 +___ +$code.=<<___; + lea (%r9),%rsp +.cfi_def_cfa_register %rsp +.Lavx512_epilogue: + ret +.cfi_endproc +.size chacha20_avx512,.-chacha20_avx512 +___ + +map(s/%z/%y/, $a,$b,$c,$d, $a_,$b_,$c_,$d_,$fourz); + +$code.=<<___; +.global chacha20_avx512vl +.type chacha20_avx512vl,\@function,5 +.align 32 +chacha20_avx512vl: +.cfi_startproc +.Lchacha20_avx512vl: + mov %rsp,%r9 # frame pointer +.cfi_def_cfa_register %r9 + cmp \$128,$len + ja .Lchacha20_8xvl + + sub \$64+$xframe,%rsp +___ +$code.=<<___ if ($win64); + movaps %xmm6,-0x28(%r9) + movaps %xmm7,-0x18(%r9) +.Lavx512vl_body: +___ +$code.=<<___; + vbroadcasti128 .Lsigma(%rip),$a + vbroadcasti128 ($key),$b + vbroadcasti128 16($key),$c + vbroadcasti128 ($counter),$d + + vmovdqa32 $a,$a_ + vmovdqa32 $b,$b_ + vmovdqa32 $c,$c_ + vpaddd .Lzeroz(%rip),$d,$d + vmovdqa32 .Ltwoy(%rip),$fourz + mov \$10,$counter # reuse $counter + vmovdqa32 $d,$d_ + jmp .Loop_avx512vl + +.align 16 +.Loop_outer_avx512vl: + vmovdqa32 $c_,$c + vpaddd $fourz,$d_,$d + mov \$10,$counter + vmovdqa32 $d,$d_ + jmp .Loop_avx512vl + +.align 32 +.Loop_avx512vl: +___ + &AVX512ROUND(); + &vpshufd ($c,$c,0b01001110); + &vpshufd ($b,$b,0b00111001); + &vpshufd ($d,$d,0b10010011); + + &AVX512ROUND(); + &vpshufd ($c,$c,0b01001110); + &vpshufd ($b,$b,0b10010011); + &vpshufd ($d,$d,0b00111001); + + &dec ($counter); + &jnz (".Loop_avx512vl"); + +$code.=<<___; + vpaddd $a_,$a,$a + vpaddd $b_,$b,$b + vpaddd $c_,$c,$c + vpaddd $d_,$d,$d + + sub \$64,$len + jb .Ltail64_avx512vl + + vpxor 0x00($inp),%x#$a,$t0 # xor with input + vpxor 0x10($inp),%x#$b,$t1 + vpxor 0x20($inp),%x#$c,$t2 + vpxor 0x30($inp),%x#$d,$t3 + lea 0x40($inp),$inp # inp+=64 + + vmovdqu $t0,0x00($out) # write output + vmovdqu $t1,0x10($out) + vmovdqu $t2,0x20($out) + vmovdqu $t3,0x30($out) + lea 0x40($out),$out # out+=64 + + jz .Ldone_avx512vl + + vextracti128 \$1,$a,$t0 + vextracti128 \$1,$b,$t1 + vextracti128 \$1,$c,$t2 + vextracti128 \$1,$d,$t3 + + sub \$64,$len + jb .Ltail_avx512vl + + vpxor 0x00($inp),$t0,$t0 # xor with input + vpxor 0x10($inp),$t1,$t1 + vpxor 0x20($inp),$t2,$t2 + vpxor 0x30($inp),$t3,$t3 + lea 0x40($inp),$inp # inp+=64 + + vmovdqu $t0,0x00($out) # write output + vmovdqu $t1,0x10($out) + vmovdqu $t2,0x20($out) + vmovdqu $t3,0x30($out) + lea 0x40($out),$out # out+=64 + + vmovdqa32 $a_,$a + vmovdqa32 $b_,$b + jnz .Loop_outer_avx512vl + + jmp .Ldone_avx512vl + +.align 16 +.Ltail64_avx512vl: + vmovdqa %x#$a,0x00(%rsp) + vmovdqa %x#$b,0x10(%rsp) + vmovdqa %x#$c,0x20(%rsp) + vmovdqa %x#$d,0x30(%rsp) + add \$64,$len + jmp .Loop_tail_avx512vl + +.align 16 +.Ltail_avx512vl: + vmovdqa $t0,0x00(%rsp) + vmovdqa $t1,0x10(%rsp) + vmovdqa $t2,0x20(%rsp) + vmovdqa $t3,0x30(%rsp) + add \$64,$len + +.Loop_tail_avx512vl: + movzb ($inp,$counter),%eax + movzb (%rsp,$counter),%ecx + lea 1($counter),$counter + xor %ecx,%eax + mov %al,-1($out,$counter) + dec $len + jnz .Loop_tail_avx512vl + + vmovdqu32 $a_,0x00(%rsp) + vmovdqu32 $a_,0x20(%rsp) + +.Ldone_avx512vl: + vzeroall +___ +$code.=<<___ if ($win64); + movaps -0x28(%r9),%xmm6 + movaps -0x18(%r9),%xmm7 +___ +$code.=<<___; + lea (%r9),%rsp +.cfi_def_cfa_register %rsp +.Lavx512vl_epilogue: + ret +.cfi_endproc +.size chacha20_avx512vl,.-chacha20_avx512vl +___ +} +if ($avx>2) { +# This one handles longer inputs... + +my ($xa0,$xa1,$xa2,$xa3, $xb0,$xb1,$xb2,$xb3, + $xc0,$xc1,$xc2,$xc3, $xd0,$xd1,$xd2,$xd3)=map("%zmm$_",(0..15)); +my @xx=($xa0,$xa1,$xa2,$xa3, $xb0,$xb1,$xb2,$xb3, + $xc0,$xc1,$xc2,$xc3, $xd0,$xd1,$xd2,$xd3); +my @key=map("%zmm$_",(16..31)); +my ($xt0,$xt1,$xt2,$xt3)=@key[0..3]; + +sub AVX512_lane_ROUND { +my ($a0,$b0,$c0,$d0)=@_; +my ($a1,$b1,$c1,$d1)=map(($_&~3)+(($_+1)&3),($a0,$b0,$c0,$d0)); +my ($a2,$b2,$c2,$d2)=map(($_&~3)+(($_+1)&3),($a1,$b1,$c1,$d1)); +my ($a3,$b3,$c3,$d3)=map(($_&~3)+(($_+1)&3),($a2,$b2,$c2,$d2)); +my @x=map("\"$_\"",@xx); + + ( + "&vpaddd (@x[$a0],@x[$a0],@x[$b0])", # Q1 + "&vpaddd (@x[$a1],@x[$a1],@x[$b1])", # Q2 + "&vpaddd (@x[$a2],@x[$a2],@x[$b2])", # Q3 + "&vpaddd (@x[$a3],@x[$a3],@x[$b3])", # Q4 + "&vpxord (@x[$d0],@x[$d0],@x[$a0])", + "&vpxord (@x[$d1],@x[$d1],@x[$a1])", + "&vpxord (@x[$d2],@x[$d2],@x[$a2])", + "&vpxord (@x[$d3],@x[$d3],@x[$a3])", + "&vprold (@x[$d0],@x[$d0],16)", + "&vprold (@x[$d1],@x[$d1],16)", + "&vprold (@x[$d2],@x[$d2],16)", + "&vprold (@x[$d3],@x[$d3],16)", + + "&vpaddd (@x[$c0],@x[$c0],@x[$d0])", + "&vpaddd (@x[$c1],@x[$c1],@x[$d1])", + "&vpaddd (@x[$c2],@x[$c2],@x[$d2])", + "&vpaddd (@x[$c3],@x[$c3],@x[$d3])", + "&vpxord (@x[$b0],@x[$b0],@x[$c0])", + "&vpxord (@x[$b1],@x[$b1],@x[$c1])", + "&vpxord (@x[$b2],@x[$b2],@x[$c2])", + "&vpxord (@x[$b3],@x[$b3],@x[$c3])", + "&vprold (@x[$b0],@x[$b0],12)", + "&vprold (@x[$b1],@x[$b1],12)", + "&vprold (@x[$b2],@x[$b2],12)", + "&vprold (@x[$b3],@x[$b3],12)", + + "&vpaddd (@x[$a0],@x[$a0],@x[$b0])", + "&vpaddd (@x[$a1],@x[$a1],@x[$b1])", + "&vpaddd (@x[$a2],@x[$a2],@x[$b2])", + "&vpaddd (@x[$a3],@x[$a3],@x[$b3])", + "&vpxord (@x[$d0],@x[$d0],@x[$a0])", + "&vpxord (@x[$d1],@x[$d1],@x[$a1])", + "&vpxord (@x[$d2],@x[$d2],@x[$a2])", + "&vpxord (@x[$d3],@x[$d3],@x[$a3])", + "&vprold (@x[$d0],@x[$d0],8)", + "&vprold (@x[$d1],@x[$d1],8)", + "&vprold (@x[$d2],@x[$d2],8)", + "&vprold (@x[$d3],@x[$d3],8)", + + "&vpaddd (@x[$c0],@x[$c0],@x[$d0])", + "&vpaddd (@x[$c1],@x[$c1],@x[$d1])", + "&vpaddd (@x[$c2],@x[$c2],@x[$d2])", + "&vpaddd (@x[$c3],@x[$c3],@x[$d3])", + "&vpxord (@x[$b0],@x[$b0],@x[$c0])", + "&vpxord (@x[$b1],@x[$b1],@x[$c1])", + "&vpxord (@x[$b2],@x[$b2],@x[$c2])", + "&vpxord (@x[$b3],@x[$b3],@x[$c3])", + "&vprold (@x[$b0],@x[$b0],7)", + "&vprold (@x[$b1],@x[$b1],7)", + "&vprold (@x[$b2],@x[$b2],7)", + "&vprold (@x[$b3],@x[$b3],7)" + ); +} + +my $xframe = $win64 ? 0xa8 : 8; + +$code.=<<___; +.global chacha20_16x +.type chacha20_16x,\@function,5 +.align 32 +chacha20_16x: +.cfi_startproc +.Lchacha20_16x: + mov %rsp,%r9 # frame register +.cfi_def_cfa_register %r9 + sub \$64+$xframe,%rsp + and \$-64,%rsp +___ +$code.=<<___ if ($win64); + movaps %xmm6,-0xa8(%r9) + movaps %xmm7,-0x98(%r9) + movaps %xmm8,-0x88(%r9) + movaps %xmm9,-0x78(%r9) + movaps %xmm10,-0x68(%r9) + movaps %xmm11,-0x58(%r9) + movaps %xmm12,-0x48(%r9) + movaps %xmm13,-0x38(%r9) + movaps %xmm14,-0x28(%r9) + movaps %xmm15,-0x18(%r9) +.L16x_body: +___ +$code.=<<___; + vzeroupper + + lea .Lsigma(%rip),%r10 + vbroadcasti32x4 (%r10),$xa3 # key[0] + vbroadcasti32x4 ($key),$xb3 # key[1] + vbroadcasti32x4 16($key),$xc3 # key[2] + vbroadcasti32x4 ($counter),$xd3 # key[3] + + vpshufd \$0x00,$xa3,$xa0 # smash key by lanes... + vpshufd \$0x55,$xa3,$xa1 + vpshufd \$0xaa,$xa3,$xa2 + vpshufd \$0xff,$xa3,$xa3 + vmovdqa64 $xa0,@key[0] + vmovdqa64 $xa1,@key[1] + vmovdqa64 $xa2,@key[2] + vmovdqa64 $xa3,@key[3] + + vpshufd \$0x00,$xb3,$xb0 + vpshufd \$0x55,$xb3,$xb1 + vpshufd \$0xaa,$xb3,$xb2 + vpshufd \$0xff,$xb3,$xb3 + vmovdqa64 $xb0,@key[4] + vmovdqa64 $xb1,@key[5] + vmovdqa64 $xb2,@key[6] + vmovdqa64 $xb3,@key[7] + + vpshufd \$0x00,$xc3,$xc0 + vpshufd \$0x55,$xc3,$xc1 + vpshufd \$0xaa,$xc3,$xc2 + vpshufd \$0xff,$xc3,$xc3 + vmovdqa64 $xc0,@key[8] + vmovdqa64 $xc1,@key[9] + vmovdqa64 $xc2,@key[10] + vmovdqa64 $xc3,@key[11] + + vpshufd \$0x00,$xd3,$xd0 + vpshufd \$0x55,$xd3,$xd1 + vpshufd \$0xaa,$xd3,$xd2 + vpshufd \$0xff,$xd3,$xd3 + vpaddd .Lincz(%rip),$xd0,$xd0 # don't save counters yet + vmovdqa64 $xd0,@key[12] + vmovdqa64 $xd1,@key[13] + vmovdqa64 $xd2,@key[14] + vmovdqa64 $xd3,@key[15] + + mov \$10,%eax + jmp .Loop16x + +.align 32 +.Loop_outer16x: + vpbroadcastd 0(%r10),$xa0 # reload key + vpbroadcastd 4(%r10),$xa1 + vpbroadcastd 8(%r10),$xa2 + vpbroadcastd 12(%r10),$xa3 + vpaddd .Lsixteen(%rip),@key[12],@key[12] # next SIMD counters + vmovdqa64 @key[4],$xb0 + vmovdqa64 @key[5],$xb1 + vmovdqa64 @key[6],$xb2 + vmovdqa64 @key[7],$xb3 + vmovdqa64 @key[8],$xc0 + vmovdqa64 @key[9],$xc1 + vmovdqa64 @key[10],$xc2 + vmovdqa64 @key[11],$xc3 + vmovdqa64 @key[12],$xd0 + vmovdqa64 @key[13],$xd1 + vmovdqa64 @key[14],$xd2 + vmovdqa64 @key[15],$xd3 + + vmovdqa64 $xa0,@key[0] + vmovdqa64 $xa1,@key[1] + vmovdqa64 $xa2,@key[2] + vmovdqa64 $xa3,@key[3] + + mov \$10,%eax + jmp .Loop16x + +.align 32 +.Loop16x: +___ + foreach (&AVX512_lane_ROUND(0, 4, 8,12)) { eval; } + foreach (&AVX512_lane_ROUND(0, 5,10,15)) { eval; } +$code.=<<___; + dec %eax + jnz .Loop16x + + vpaddd @key[0],$xa0,$xa0 # accumulate key + vpaddd @key[1],$xa1,$xa1 + vpaddd @key[2],$xa2,$xa2 + vpaddd @key[3],$xa3,$xa3 + + vpunpckldq $xa1,$xa0,$xt2 # "de-interlace" data + vpunpckldq $xa3,$xa2,$xt3 + vpunpckhdq $xa1,$xa0,$xa0 + vpunpckhdq $xa3,$xa2,$xa2 + vpunpcklqdq $xt3,$xt2,$xa1 # "a0" + vpunpckhqdq $xt3,$xt2,$xt2 # "a1" + vpunpcklqdq $xa2,$xa0,$xa3 # "a2" + vpunpckhqdq $xa2,$xa0,$xa0 # "a3" +___ + ($xa0,$xa1,$xa2,$xa3,$xt2)=($xa1,$xt2,$xa3,$xa0,$xa2); +$code.=<<___; + vpaddd @key[4],$xb0,$xb0 + vpaddd @key[5],$xb1,$xb1 + vpaddd @key[6],$xb2,$xb2 + vpaddd @key[7],$xb3,$xb3 + + vpunpckldq $xb1,$xb0,$xt2 + vpunpckldq $xb3,$xb2,$xt3 + vpunpckhdq $xb1,$xb0,$xb0 + vpunpckhdq $xb3,$xb2,$xb2 + vpunpcklqdq $xt3,$xt2,$xb1 # "b0" + vpunpckhqdq $xt3,$xt2,$xt2 # "b1" + vpunpcklqdq $xb2,$xb0,$xb3 # "b2" + vpunpckhqdq $xb2,$xb0,$xb0 # "b3" +___ + ($xb0,$xb1,$xb2,$xb3,$xt2)=($xb1,$xt2,$xb3,$xb0,$xb2); +$code.=<<___; + vshufi32x4 \$0x44,$xb0,$xa0,$xt3 # "de-interlace" further + vshufi32x4 \$0xee,$xb0,$xa0,$xb0 + vshufi32x4 \$0x44,$xb1,$xa1,$xa0 + vshufi32x4 \$0xee,$xb1,$xa1,$xb1 + vshufi32x4 \$0x44,$xb2,$xa2,$xa1 + vshufi32x4 \$0xee,$xb2,$xa2,$xb2 + vshufi32x4 \$0x44,$xb3,$xa3,$xa2 + vshufi32x4 \$0xee,$xb3,$xa3,$xb3 +___ + ($xa0,$xa1,$xa2,$xa3,$xt3)=($xt3,$xa0,$xa1,$xa2,$xa3); +$code.=<<___; + vpaddd @key[8],$xc0,$xc0 + vpaddd @key[9],$xc1,$xc1 + vpaddd @key[10],$xc2,$xc2 + vpaddd @key[11],$xc3,$xc3 + + vpunpckldq $xc1,$xc0,$xt2 + vpunpckldq $xc3,$xc2,$xt3 + vpunpckhdq $xc1,$xc0,$xc0 + vpunpckhdq $xc3,$xc2,$xc2 + vpunpcklqdq $xt3,$xt2,$xc1 # "c0" + vpunpckhqdq $xt3,$xt2,$xt2 # "c1" + vpunpcklqdq $xc2,$xc0,$xc3 # "c2" + vpunpckhqdq $xc2,$xc0,$xc0 # "c3" +___ + ($xc0,$xc1,$xc2,$xc3,$xt2)=($xc1,$xt2,$xc3,$xc0,$xc2); +$code.=<<___; + vpaddd @key[12],$xd0,$xd0 + vpaddd @key[13],$xd1,$xd1 + vpaddd @key[14],$xd2,$xd2 + vpaddd @key[15],$xd3,$xd3 + + vpunpckldq $xd1,$xd0,$xt2 + vpunpckldq $xd3,$xd2,$xt3 + vpunpckhdq $xd1,$xd0,$xd0 + vpunpckhdq $xd3,$xd2,$xd2 + vpunpcklqdq $xt3,$xt2,$xd1 # "d0" + vpunpckhqdq $xt3,$xt2,$xt2 # "d1" + vpunpcklqdq $xd2,$xd0,$xd3 # "d2" + vpunpckhqdq $xd2,$xd0,$xd0 # "d3" +___ + ($xd0,$xd1,$xd2,$xd3,$xt2)=($xd1,$xt2,$xd3,$xd0,$xd2); +$code.=<<___; + vshufi32x4 \$0x44,$xd0,$xc0,$xt3 # "de-interlace" further + vshufi32x4 \$0xee,$xd0,$xc0,$xd0 + vshufi32x4 \$0x44,$xd1,$xc1,$xc0 + vshufi32x4 \$0xee,$xd1,$xc1,$xd1 + vshufi32x4 \$0x44,$xd2,$xc2,$xc1 + vshufi32x4 \$0xee,$xd2,$xc2,$xd2 + vshufi32x4 \$0x44,$xd3,$xc3,$xc2 + vshufi32x4 \$0xee,$xd3,$xc3,$xd3 +___ + ($xc0,$xc1,$xc2,$xc3,$xt3)=($xt3,$xc0,$xc1,$xc2,$xc3); +$code.=<<___; + vshufi32x4 \$0x88,$xc0,$xa0,$xt0 # "de-interlace" further + vshufi32x4 \$0xdd,$xc0,$xa0,$xa0 + vshufi32x4 \$0x88,$xd0,$xb0,$xc0 + vshufi32x4 \$0xdd,$xd0,$xb0,$xd0 + vshufi32x4 \$0x88,$xc1,$xa1,$xt1 + vshufi32x4 \$0xdd,$xc1,$xa1,$xa1 + vshufi32x4 \$0x88,$xd1,$xb1,$xc1 + vshufi32x4 \$0xdd,$xd1,$xb1,$xd1 + vshufi32x4 \$0x88,$xc2,$xa2,$xt2 + vshufi32x4 \$0xdd,$xc2,$xa2,$xa2 + vshufi32x4 \$0x88,$xd2,$xb2,$xc2 + vshufi32x4 \$0xdd,$xd2,$xb2,$xd2 + vshufi32x4 \$0x88,$xc3,$xa3,$xt3 + vshufi32x4 \$0xdd,$xc3,$xa3,$xa3 + vshufi32x4 \$0x88,$xd3,$xb3,$xc3 + vshufi32x4 \$0xdd,$xd3,$xb3,$xd3 +___ + ($xa0,$xa1,$xa2,$xa3,$xb0,$xb1,$xb2,$xb3)= + ($xt0,$xt1,$xt2,$xt3,$xa0,$xa1,$xa2,$xa3); + + ($xa0,$xb0,$xc0,$xd0, $xa1,$xb1,$xc1,$xd1, + $xa2,$xb2,$xc2,$xd2, $xa3,$xb3,$xc3,$xd3) = + ($xa0,$xa1,$xa2,$xa3, $xb0,$xb1,$xb2,$xb3, + $xc0,$xc1,$xc2,$xc3, $xd0,$xd1,$xd2,$xd3); +$code.=<<___; + cmp \$64*16,$len + jb .Ltail16x + + vpxord 0x00($inp),$xa0,$xa0 # xor with input + vpxord 0x40($inp),$xb0,$xb0 + vpxord 0x80($inp),$xc0,$xc0 + vpxord 0xc0($inp),$xd0,$xd0 + vmovdqu32 $xa0,0x00($out) + vmovdqu32 $xb0,0x40($out) + vmovdqu32 $xc0,0x80($out) + vmovdqu32 $xd0,0xc0($out) + + vpxord 0x100($inp),$xa1,$xa1 + vpxord 0x140($inp),$xb1,$xb1 + vpxord 0x180($inp),$xc1,$xc1 + vpxord 0x1c0($inp),$xd1,$xd1 + vmovdqu32 $xa1,0x100($out) + vmovdqu32 $xb1,0x140($out) + vmovdqu32 $xc1,0x180($out) + vmovdqu32 $xd1,0x1c0($out) + + vpxord 0x200($inp),$xa2,$xa2 + vpxord 0x240($inp),$xb2,$xb2 + vpxord 0x280($inp),$xc2,$xc2 + vpxord 0x2c0($inp),$xd2,$xd2 + vmovdqu32 $xa2,0x200($out) + vmovdqu32 $xb2,0x240($out) + vmovdqu32 $xc2,0x280($out) + vmovdqu32 $xd2,0x2c0($out) + + vpxord 0x300($inp),$xa3,$xa3 + vpxord 0x340($inp),$xb3,$xb3 + vpxord 0x380($inp),$xc3,$xc3 + vpxord 0x3c0($inp),$xd3,$xd3 + lea 0x400($inp),$inp + vmovdqu32 $xa3,0x300($out) + vmovdqu32 $xb3,0x340($out) + vmovdqu32 $xc3,0x380($out) + vmovdqu32 $xd3,0x3c0($out) + lea 0x400($out),$out + + sub \$64*16,$len + jnz .Loop_outer16x + + jmp .Ldone16x + +.align 32 +.Ltail16x: + xor %r10,%r10 + sub $inp,$out + cmp \$64*1,$len + jb .Less_than_64_16x + vpxord ($inp),$xa0,$xa0 # xor with input + vmovdqu32 $xa0,($out,$inp) + je .Ldone16x + vmovdqa32 $xb0,$xa0 + lea 64($inp),$inp + + cmp \$64*2,$len + jb .Less_than_64_16x + vpxord ($inp),$xb0,$xb0 + vmovdqu32 $xb0,($out,$inp) + je .Ldone16x + vmovdqa32 $xc0,$xa0 + lea 64($inp),$inp + + cmp \$64*3,$len + jb .Less_than_64_16x + vpxord ($inp),$xc0,$xc0 + vmovdqu32 $xc0,($out,$inp) + je .Ldone16x + vmovdqa32 $xd0,$xa0 + lea 64($inp),$inp + + cmp \$64*4,$len + jb .Less_than_64_16x + vpxord ($inp),$xd0,$xd0 + vmovdqu32 $xd0,($out,$inp) + je .Ldone16x + vmovdqa32 $xa1,$xa0 + lea 64($inp),$inp + + cmp \$64*5,$len + jb .Less_than_64_16x + vpxord ($inp),$xa1,$xa1 + vmovdqu32 $xa1,($out,$inp) + je .Ldone16x + vmovdqa32 $xb1,$xa0 + lea 64($inp),$inp + + cmp \$64*6,$len + jb .Less_than_64_16x + vpxord ($inp),$xb1,$xb1 + vmovdqu32 $xb1,($out,$inp) + je .Ldone16x + vmovdqa32 $xc1,$xa0 + lea 64($inp),$inp + + cmp \$64*7,$len + jb .Less_than_64_16x + vpxord ($inp),$xc1,$xc1 + vmovdqu32 $xc1,($out,$inp) + je .Ldone16x + vmovdqa32 $xd1,$xa0 + lea 64($inp),$inp + + cmp \$64*8,$len + jb .Less_than_64_16x + vpxord ($inp),$xd1,$xd1 + vmovdqu32 $xd1,($out,$inp) + je .Ldone16x + vmovdqa32 $xa2,$xa0 + lea 64($inp),$inp + + cmp \$64*9,$len + jb .Less_than_64_16x + vpxord ($inp),$xa2,$xa2 + vmovdqu32 $xa2,($out,$inp) + je .Ldone16x + vmovdqa32 $xb2,$xa0 + lea 64($inp),$inp + + cmp \$64*10,$len + jb .Less_than_64_16x + vpxord ($inp),$xb2,$xb2 + vmovdqu32 $xb2,($out,$inp) + je .Ldone16x + vmovdqa32 $xc2,$xa0 + lea 64($inp),$inp + + cmp \$64*11,$len + jb .Less_than_64_16x + vpxord ($inp),$xc2,$xc2 + vmovdqu32 $xc2,($out,$inp) + je .Ldone16x + vmovdqa32 $xd2,$xa0 + lea 64($inp),$inp + + cmp \$64*12,$len + jb .Less_than_64_16x + vpxord ($inp),$xd2,$xd2 + vmovdqu32 $xd2,($out,$inp) + je .Ldone16x + vmovdqa32 $xa3,$xa0 + lea 64($inp),$inp + + cmp \$64*13,$len + jb .Less_than_64_16x + vpxord ($inp),$xa3,$xa3 + vmovdqu32 $xa3,($out,$inp) + je .Ldone16x + vmovdqa32 $xb3,$xa0 + lea 64($inp),$inp + + cmp \$64*14,$len + jb .Less_than_64_16x + vpxord ($inp),$xb3,$xb3 + vmovdqu32 $xb3,($out,$inp) + je .Ldone16x + vmovdqa32 $xc3,$xa0 + lea 64($inp),$inp + + cmp \$64*15,$len + jb .Less_than_64_16x + vpxord ($inp),$xc3,$xc3 + vmovdqu32 $xc3,($out,$inp) + je .Ldone16x + vmovdqa32 $xd3,$xa0 + lea 64($inp),$inp + +.Less_than_64_16x: + vmovdqa32 $xa0,0x00(%rsp) + lea ($out,$inp),$out + and \$63,$len + +.Loop_tail16x: + movzb ($inp,%r10),%eax + movzb (%rsp,%r10),%ecx + lea 1(%r10),%r10 + xor %ecx,%eax + mov %al,-1($out,%r10) + dec $len + jnz .Loop_tail16x + + vpxord $xa0,$xa0,$xa0 + vmovdqa32 $xa0,0(%rsp) + +.Ldone16x: + vzeroall +___ +$code.=<<___ if ($win64); + movaps -0xa8(%r9),%xmm6 + movaps -0x98(%r9),%xmm7 + movaps -0x88(%r9),%xmm8 + movaps -0x78(%r9),%xmm9 + movaps -0x68(%r9),%xmm10 + movaps -0x58(%r9),%xmm11 + movaps -0x48(%r9),%xmm12 + movaps -0x38(%r9),%xmm13 + movaps -0x28(%r9),%xmm14 + movaps -0x18(%r9),%xmm15 +___ +$code.=<<___; + lea (%r9),%rsp +.cfi_def_cfa_register %rsp +.L16x_epilogue: + ret +.cfi_endproc +.size chacha20_16x,.-chacha20_16x +___ + +# switch to %ymm domain +($xa0,$xa1,$xa2,$xa3, $xb0,$xb1,$xb2,$xb3, + $xc0,$xc1,$xc2,$xc3, $xd0,$xd1,$xd2,$xd3)=map("%ymm$_",(0..15)); +@xx=($xa0,$xa1,$xa2,$xa3, $xb0,$xb1,$xb2,$xb3, + $xc0,$xc1,$xc2,$xc3, $xd0,$xd1,$xd2,$xd3); +@key=map("%ymm$_",(16..31)); +($xt0,$xt1,$xt2,$xt3)=@key[0..3]; + +$code.=<<___; +.global chacha20_8xvl +.type chacha20_8xvl,\@function,5 +.align 32 +chacha20_8xvl: +.cfi_startproc +.Lchacha20_8xvl: + mov %rsp,%r9 # frame register +.cfi_def_cfa_register %r9 + sub \$64+$xframe,%rsp + and \$-64,%rsp +___ +$code.=<<___ if ($win64); + movaps %xmm6,-0xa8(%r9) + movaps %xmm7,-0x98(%r9) + movaps %xmm8,-0x88(%r9) + movaps %xmm9,-0x78(%r9) + movaps %xmm10,-0x68(%r9) + movaps %xmm11,-0x58(%r9) + movaps %xmm12,-0x48(%r9) + movaps %xmm13,-0x38(%r9) + movaps %xmm14,-0x28(%r9) + movaps %xmm15,-0x18(%r9) +.L8xvl_body: +___ +$code.=<<___; + vzeroupper + + lea .Lsigma(%rip),%r10 + vbroadcasti128 (%r10),$xa3 # key[0] + vbroadcasti128 ($key),$xb3 # key[1] + vbroadcasti128 16($key),$xc3 # key[2] + vbroadcasti128 ($counter),$xd3 # key[3] + + vpshufd \$0x00,$xa3,$xa0 # smash key by lanes... + vpshufd \$0x55,$xa3,$xa1 + vpshufd \$0xaa,$xa3,$xa2 + vpshufd \$0xff,$xa3,$xa3 + vmovdqa64 $xa0,@key[0] + vmovdqa64 $xa1,@key[1] + vmovdqa64 $xa2,@key[2] + vmovdqa64 $xa3,@key[3] + + vpshufd \$0x00,$xb3,$xb0 + vpshufd \$0x55,$xb3,$xb1 + vpshufd \$0xaa,$xb3,$xb2 + vpshufd \$0xff,$xb3,$xb3 + vmovdqa64 $xb0,@key[4] + vmovdqa64 $xb1,@key[5] + vmovdqa64 $xb2,@key[6] + vmovdqa64 $xb3,@key[7] + + vpshufd \$0x00,$xc3,$xc0 + vpshufd \$0x55,$xc3,$xc1 + vpshufd \$0xaa,$xc3,$xc2 + vpshufd \$0xff,$xc3,$xc3 + vmovdqa64 $xc0,@key[8] + vmovdqa64 $xc1,@key[9] + vmovdqa64 $xc2,@key[10] + vmovdqa64 $xc3,@key[11] + + vpshufd \$0x00,$xd3,$xd0 + vpshufd \$0x55,$xd3,$xd1 + vpshufd \$0xaa,$xd3,$xd2 + vpshufd \$0xff,$xd3,$xd3 + vpaddd .Lincy(%rip),$xd0,$xd0 # don't save counters yet + vmovdqa64 $xd0,@key[12] + vmovdqa64 $xd1,@key[13] + vmovdqa64 $xd2,@key[14] + vmovdqa64 $xd3,@key[15] + + mov \$10,%eax + jmp .Loop8xvl + +.align 32 +.Loop_outer8xvl: + #vpbroadcastd 0(%r10),$xa0 # reload key + #vpbroadcastd 4(%r10),$xa1 + vpbroadcastd 8(%r10),$xa2 + vpbroadcastd 12(%r10),$xa3 + vpaddd .Leight(%rip),@key[12],@key[12] # next SIMD counters + vmovdqa64 @key[4],$xb0 + vmovdqa64 @key[5],$xb1 + vmovdqa64 @key[6],$xb2 + vmovdqa64 @key[7],$xb3 + vmovdqa64 @key[8],$xc0 + vmovdqa64 @key[9],$xc1 + vmovdqa64 @key[10],$xc2 + vmovdqa64 @key[11],$xc3 + vmovdqa64 @key[12],$xd0 + vmovdqa64 @key[13],$xd1 + vmovdqa64 @key[14],$xd2 + vmovdqa64 @key[15],$xd3 + + vmovdqa64 $xa0,@key[0] + vmovdqa64 $xa1,@key[1] + vmovdqa64 $xa2,@key[2] + vmovdqa64 $xa3,@key[3] + + mov \$10,%eax + jmp .Loop8xvl + +.align 32 +.Loop8xvl: +___ + foreach (&AVX512_lane_ROUND(0, 4, 8,12)) { eval; } + foreach (&AVX512_lane_ROUND(0, 5,10,15)) { eval; } +$code.=<<___; + dec %eax + jnz .Loop8xvl + + vpaddd @key[0],$xa0,$xa0 # accumulate key + vpaddd @key[1],$xa1,$xa1 + vpaddd @key[2],$xa2,$xa2 + vpaddd @key[3],$xa3,$xa3 + + vpunpckldq $xa1,$xa0,$xt2 # "de-interlace" data + vpunpckldq $xa3,$xa2,$xt3 + vpunpckhdq $xa1,$xa0,$xa0 + vpunpckhdq $xa3,$xa2,$xa2 + vpunpcklqdq $xt3,$xt2,$xa1 # "a0" + vpunpckhqdq $xt3,$xt2,$xt2 # "a1" + vpunpcklqdq $xa2,$xa0,$xa3 # "a2" + vpunpckhqdq $xa2,$xa0,$xa0 # "a3" +___ + ($xa0,$xa1,$xa2,$xa3,$xt2)=($xa1,$xt2,$xa3,$xa0,$xa2); +$code.=<<___; + vpaddd @key[4],$xb0,$xb0 + vpaddd @key[5],$xb1,$xb1 + vpaddd @key[6],$xb2,$xb2 + vpaddd @key[7],$xb3,$xb3 + + vpunpckldq $xb1,$xb0,$xt2 + vpunpckldq $xb3,$xb2,$xt3 + vpunpckhdq $xb1,$xb0,$xb0 + vpunpckhdq $xb3,$xb2,$xb2 + vpunpcklqdq $xt3,$xt2,$xb1 # "b0" + vpunpckhqdq $xt3,$xt2,$xt2 # "b1" + vpunpcklqdq $xb2,$xb0,$xb3 # "b2" + vpunpckhqdq $xb2,$xb0,$xb0 # "b3" +___ + ($xb0,$xb1,$xb2,$xb3,$xt2)=($xb1,$xt2,$xb3,$xb0,$xb2); +$code.=<<___; + vshufi32x4 \$0,$xb0,$xa0,$xt3 # "de-interlace" further + vshufi32x4 \$3,$xb0,$xa0,$xb0 + vshufi32x4 \$0,$xb1,$xa1,$xa0 + vshufi32x4 \$3,$xb1,$xa1,$xb1 + vshufi32x4 \$0,$xb2,$xa2,$xa1 + vshufi32x4 \$3,$xb2,$xa2,$xb2 + vshufi32x4 \$0,$xb3,$xa3,$xa2 + vshufi32x4 \$3,$xb3,$xa3,$xb3 +___ + ($xa0,$xa1,$xa2,$xa3,$xt3)=($xt3,$xa0,$xa1,$xa2,$xa3); +$code.=<<___; + vpaddd @key[8],$xc0,$xc0 + vpaddd @key[9],$xc1,$xc1 + vpaddd @key[10],$xc2,$xc2 + vpaddd @key[11],$xc3,$xc3 + + vpunpckldq $xc1,$xc0,$xt2 + vpunpckldq $xc3,$xc2,$xt3 + vpunpckhdq $xc1,$xc0,$xc0 + vpunpckhdq $xc3,$xc2,$xc2 + vpunpcklqdq $xt3,$xt2,$xc1 # "c0" + vpunpckhqdq $xt3,$xt2,$xt2 # "c1" + vpunpcklqdq $xc2,$xc0,$xc3 # "c2" + vpunpckhqdq $xc2,$xc0,$xc0 # "c3" +___ + ($xc0,$xc1,$xc2,$xc3,$xt2)=($xc1,$xt2,$xc3,$xc0,$xc2); +$code.=<<___; + vpaddd @key[12],$xd0,$xd0 + vpaddd @key[13],$xd1,$xd1 + vpaddd @key[14],$xd2,$xd2 + vpaddd @key[15],$xd3,$xd3 + + vpunpckldq $xd1,$xd0,$xt2 + vpunpckldq $xd3,$xd2,$xt3 + vpunpckhdq $xd1,$xd0,$xd0 + vpunpckhdq $xd3,$xd2,$xd2 + vpunpcklqdq $xt3,$xt2,$xd1 # "d0" + vpunpckhqdq $xt3,$xt2,$xt2 # "d1" + vpunpcklqdq $xd2,$xd0,$xd3 # "d2" + vpunpckhqdq $xd2,$xd0,$xd0 # "d3" +___ + ($xd0,$xd1,$xd2,$xd3,$xt2)=($xd1,$xt2,$xd3,$xd0,$xd2); +$code.=<<___; + vperm2i128 \$0x20,$xd0,$xc0,$xt3 # "de-interlace" further + vperm2i128 \$0x31,$xd0,$xc0,$xd0 + vperm2i128 \$0x20,$xd1,$xc1,$xc0 + vperm2i128 \$0x31,$xd1,$xc1,$xd1 + vperm2i128 \$0x20,$xd2,$xc2,$xc1 + vperm2i128 \$0x31,$xd2,$xc2,$xd2 + vperm2i128 \$0x20,$xd3,$xc3,$xc2 + vperm2i128 \$0x31,$xd3,$xc3,$xd3 +___ + ($xc0,$xc1,$xc2,$xc3,$xt3)=($xt3,$xc0,$xc1,$xc2,$xc3); + ($xb0,$xb1,$xb2,$xb3,$xc0,$xc1,$xc2,$xc3)= + ($xc0,$xc1,$xc2,$xc3,$xb0,$xb1,$xb2,$xb3); +$code.=<<___; + cmp \$64*8,$len + jb .Ltail8xvl + + mov \$0x80,%eax # size optimization + vpxord 0x00($inp),$xa0,$xa0 # xor with input + vpxor 0x20($inp),$xb0,$xb0 + vpxor 0x40($inp),$xc0,$xc0 + vpxor 0x60($inp),$xd0,$xd0 + lea ($inp,%rax),$inp # size optimization + vmovdqu32 $xa0,0x00($out) + vmovdqu $xb0,0x20($out) + vmovdqu $xc0,0x40($out) + vmovdqu $xd0,0x60($out) + lea ($out,%rax),$out # size optimization + + vpxor 0x00($inp),$xa1,$xa1 + vpxor 0x20($inp),$xb1,$xb1 + vpxor 0x40($inp),$xc1,$xc1 + vpxor 0x60($inp),$xd1,$xd1 + lea ($inp,%rax),$inp # size optimization + vmovdqu $xa1,0x00($out) + vmovdqu $xb1,0x20($out) + vmovdqu $xc1,0x40($out) + vmovdqu $xd1,0x60($out) + lea ($out,%rax),$out # size optimization + + vpxord 0x00($inp),$xa2,$xa2 + vpxor 0x20($inp),$xb2,$xb2 + vpxor 0x40($inp),$xc2,$xc2 + vpxor 0x60($inp),$xd2,$xd2 + lea ($inp,%rax),$inp # size optimization + vmovdqu32 $xa2,0x00($out) + vmovdqu $xb2,0x20($out) + vmovdqu $xc2,0x40($out) + vmovdqu $xd2,0x60($out) + lea ($out,%rax),$out # size optimization + + vpxor 0x00($inp),$xa3,$xa3 + vpxor 0x20($inp),$xb3,$xb3 + vpxor 0x40($inp),$xc3,$xc3 + vpxor 0x60($inp),$xd3,$xd3 + lea ($inp,%rax),$inp # size optimization + vmovdqu $xa3,0x00($out) + vmovdqu $xb3,0x20($out) + vmovdqu $xc3,0x40($out) + vmovdqu $xd3,0x60($out) + lea ($out,%rax),$out # size optimization + + vpbroadcastd 0(%r10),%ymm0 # reload key + vpbroadcastd 4(%r10),%ymm1 + + sub \$64*8,$len + jnz .Loop_outer8xvl + + jmp .Ldone8xvl + +.align 32 +.Ltail8xvl: + vmovdqa64 $xa0,%ymm8 # size optimization +___ +$xa0 = "%ymm8"; +$code.=<<___; + xor %r10,%r10 + sub $inp,$out + cmp \$64*1,$len + jb .Less_than_64_8xvl + vpxor 0x00($inp),$xa0,$xa0 # xor with input + vpxor 0x20($inp),$xb0,$xb0 + vmovdqu $xa0,0x00($out,$inp) + vmovdqu $xb0,0x20($out,$inp) + je .Ldone8xvl + vmovdqa $xc0,$xa0 + vmovdqa $xd0,$xb0 + lea 64($inp),$inp + + cmp \$64*2,$len + jb .Less_than_64_8xvl + vpxor 0x00($inp),$xc0,$xc0 + vpxor 0x20($inp),$xd0,$xd0 + vmovdqu $xc0,0x00($out,$inp) + vmovdqu $xd0,0x20($out,$inp) + je .Ldone8xvl + vmovdqa $xa1,$xa0 + vmovdqa $xb1,$xb0 + lea 64($inp),$inp + + cmp \$64*3,$len + jb .Less_than_64_8xvl + vpxor 0x00($inp),$xa1,$xa1 + vpxor 0x20($inp),$xb1,$xb1 + vmovdqu $xa1,0x00($out,$inp) + vmovdqu $xb1,0x20($out,$inp) + je .Ldone8xvl + vmovdqa $xc1,$xa0 + vmovdqa $xd1,$xb0 + lea 64($inp),$inp + + cmp \$64*4,$len + jb .Less_than_64_8xvl + vpxor 0x00($inp),$xc1,$xc1 + vpxor 0x20($inp),$xd1,$xd1 + vmovdqu $xc1,0x00($out,$inp) + vmovdqu $xd1,0x20($out,$inp) + je .Ldone8xvl + vmovdqa32 $xa2,$xa0 + vmovdqa $xb2,$xb0 + lea 64($inp),$inp + + cmp \$64*5,$len + jb .Less_than_64_8xvl + vpxord 0x00($inp),$xa2,$xa2 + vpxor 0x20($inp),$xb2,$xb2 + vmovdqu32 $xa2,0x00($out,$inp) + vmovdqu $xb2,0x20($out,$inp) + je .Ldone8xvl + vmovdqa $xc2,$xa0 + vmovdqa $xd2,$xb0 + lea 64($inp),$inp + + cmp \$64*6,$len + jb .Less_than_64_8xvl + vpxor 0x00($inp),$xc2,$xc2 + vpxor 0x20($inp),$xd2,$xd2 + vmovdqu $xc2,0x00($out,$inp) + vmovdqu $xd2,0x20($out,$inp) + je .Ldone8xvl + vmovdqa $xa3,$xa0 + vmovdqa $xb3,$xb0 + lea 64($inp),$inp + + cmp \$64*7,$len + jb .Less_than_64_8xvl + vpxor 0x00($inp),$xa3,$xa3 + vpxor 0x20($inp),$xb3,$xb3 + vmovdqu $xa3,0x00($out,$inp) + vmovdqu $xb3,0x20($out,$inp) + je .Ldone8xvl + vmovdqa $xc3,$xa0 + vmovdqa $xd3,$xb0 + lea 64($inp),$inp + +.Less_than_64_8xvl: + vmovdqa $xa0,0x00(%rsp) + vmovdqa $xb0,0x20(%rsp) + lea ($out,$inp),$out + and \$63,$len + +.Loop_tail8xvl: + movzb ($inp,%r10),%eax + movzb (%rsp,%r10),%ecx + lea 1(%r10),%r10 + xor %ecx,%eax + mov %al,-1($out,%r10) + dec $len + jnz .Loop_tail8xvl + + vpxor $xa0,$xa0,$xa0 + vmovdqa $xa0,0x00(%rsp) + vmovdqa $xa0,0x20(%rsp) + +.Ldone8xvl: + vzeroall +___ +$code.=<<___ if ($win64); + movaps -0xa8(%r9),%xmm6 + movaps -0x98(%r9),%xmm7 + movaps -0x88(%r9),%xmm8 + movaps -0x78(%r9),%xmm9 + movaps -0x68(%r9),%xmm10 + movaps -0x58(%r9),%xmm11 + movaps -0x48(%r9),%xmm12 + movaps -0x38(%r9),%xmm13 + movaps -0x28(%r9),%xmm14 + movaps -0x18(%r9),%xmm15 +___ +$code.=<<___; + lea (%r9),%rsp +.cfi_def_cfa_register %rsp +.L8xvl_epilogue: + ret +.cfi_endproc +.size chacha20_8xvl,.-chacha20_8xvl +___ +} + +# EXCEPTION_DISPOSITION handler (EXCEPTION_RECORD *rec,ULONG64 frame, +# CONTEXT *context,DISPATCHER_CONTEXT *disp) +if ($win64) { +$rec="%rcx"; +$frame="%rdx"; +$context="%r8"; +$disp="%r9"; + +$code.=<<___; +.extern __imp_RtlVirtualUnwind +.type ssse3_handler,\@abi-omnipotent +.align 16 +ssse3_handler: + push %rsi + push %rdi + push %rbx + push %rbp + push %r12 + push %r13 + push %r14 + push %r15 + pushfq + sub \$64,%rsp + + mov 120($context),%rax # pull context->Rax + mov 248($context),%rbx # pull context->Rip + + mov 8($disp),%rsi # disp->ImageBase + mov 56($disp),%r11 # disp->HandlerData + + mov 0(%r11),%r10d # HandlerData[0] + lea (%rsi,%r10),%r10 # prologue label + cmp %r10,%rbx # context->RipR9 + + mov 4(%r11),%r10d # HandlerData[1] + lea (%rsi,%r10),%r10 # epilogue label + cmp %r10,%rbx # context->Rip>=epilogue label + jae .Lcommon_seh_tail + + lea -0x28(%rax),%rsi + lea 512($context),%rdi # &context.Xmm6 + mov \$4,%ecx + .long 0xa548f3fc # cld; rep movsq + +.Lcommon_seh_tail: + mov 8(%rax),%rdi + mov 16(%rax),%rsi + mov %rax,152($context) # restore context->Rsp + mov %rsi,168($context) # restore context->Rsi + mov %rdi,176($context) # restore context->Rdi + + mov 40($disp),%rdi # disp->ContextRecord + mov $context,%rsi # context + mov \$154,%ecx # sizeof(CONTEXT) + .long 0xa548f3fc # cld; rep movsq + + mov $disp,%rsi + xor %rcx,%rcx # arg1, UNW_FLAG_NHANDLER + mov 8(%rsi),%rdx # arg2, disp->ImageBase + mov 0(%rsi),%r8 # arg3, disp->ControlPc + mov 16(%rsi),%r9 # arg4, disp->FunctionEntry + mov 40(%rsi),%r10 # disp->ContextRecord + lea 56(%rsi),%r11 # &disp->HandlerData + lea 24(%rsi),%r12 # &disp->EstablisherFrame + mov %r10,32(%rsp) # arg5 + mov %r11,40(%rsp) # arg6 + mov %r12,48(%rsp) # arg7 + mov %rcx,56(%rsp) # arg8, (NULL) + call *__imp_RtlVirtualUnwind(%rip) + + mov \$1,%eax # ExceptionContinueSearch + add \$64,%rsp + popfq + pop %r15 + pop %r14 + pop %r13 + pop %r12 + pop %rbp + pop %rbx + pop %rdi + pop %rsi + ret +.size ssse3_handler,.-ssse3_handler + +.type full_handler,\@abi-omnipotent +.align 16 +full_handler: + push %rsi + push %rdi + push %rbx + push %rbp + push %r12 + push %r13 + push %r14 + push %r15 + pushfq + sub \$64,%rsp + + mov 120($context),%rax # pull context->Rax + mov 248($context),%rbx # pull context->Rip + + mov 8($disp),%rsi # disp->ImageBase + mov 56($disp),%r11 # disp->HandlerData + + mov 0(%r11),%r10d # HandlerData[0] + lea (%rsi,%r10),%r10 # prologue label + cmp %r10,%rbx # context->RipR9 + + mov 4(%r11),%r10d # HandlerData[1] + lea (%rsi,%r10),%r10 # epilogue label + cmp %r10,%rbx # context->Rip>=epilogue label + jae .Lcommon_seh_tail + + lea -0xa8(%rax),%rsi + lea 512($context),%rdi # &context.Xmm6 + mov \$20,%ecx + .long 0xa548f3fc # cld; rep movsq + + jmp .Lcommon_seh_tail +.size full_handler,.-full_handler + +.section .pdata +.align 4 + .rva .LSEH_begin_chacha20_ssse3 + .rva .LSEH_end_chacha20_ssse3 + .rva .LSEH_info_chacha20_ssse3 + + .rva .LSEH_begin_chacha20_4x + .rva .LSEH_end_chacha20_4x + .rva .LSEH_info_chacha20_4x +___ +$code.=<<___ if ($avx && 0); + .rva .LSEH_begin_chacha20_4xop + .rva .LSEH_end_chacha20_4xop + .rva .LSEH_info_chacha20_4xop +___ +$code.=<<___ if ($avx>1); + .rva .LSEH_begin_chacha20_avx2 + .rva .LSEH_end_chacha20_avx2 + .rva .LSEH_info_chacha20_avx2 +___ +$code.=<<___ if ($avx>2); + .rva .LSEH_begin_chacha20_avx512 + .rva .LSEH_end_chacha20_avx512 + .rva .LSEH_info_chacha20_avx512 + + .rva .LSEH_begin_chacha20_avx512vl + .rva .LSEH_end_chacha20_avx512vl + .rva .LSEH_info_chacha20_avx512vl + + .rva .LSEH_begin_chacha20_16x + .rva .LSEH_end_chacha20_16x + .rva .LSEH_info_chacha20_16x + + .rva .LSEH_begin_chacha20_8xvl + .rva .LSEH_end_chacha20_8xvl + .rva .LSEH_info_chacha20_8xvl +___ +$code.=<<___; +.section .xdata +.align 8 +.LSEH_info_chacha20_ssse3: + .byte 9,0,0,0 + .rva ssse3_handler + .rva .Lssse3_body,.Lssse3_epilogue + +.LSEH_info_chacha20_4x: + .byte 9,0,0,0 + .rva full_handler + .rva .L4x_body,.L4x_epilogue +___ +$code.=<<___ if ($avx&&0); +.LSEH_info_chacha20_4xop: + .byte 9,0,0,0 + .rva full_handler + .rva .L4xop_body,.L4xop_epilogue # HandlerData[] +___ +$code.=<<___ if ($avx>1); +.LSEH_info_chacha20_avx2: + .byte 9,0,0,0 + .rva full_handler + .rva .L8x_body,.L8x_epilogue # HandlerData[] +___ +$code.=<<___ if ($avx>2); +.LSEH_info_chacha20_avx512: + .byte 9,0,0,0 + .rva ssse3_handler + .rva .Lavx512_body,.Lavx512_epilogue # HandlerData[] + +.LSEH_info_chacha20_avx512vl: + .byte 9,0,0,0 + .rva ssse3_handler + .rva .Lavx512vl_body,.Lavx512vl_epilogue # HandlerData[] + +.LSEH_info_chacha20_16x: + .byte 9,0,0,0 + .rva full_handler + .rva .L16x_body,.L16x_epilogue # HandlerData[] + +.LSEH_info_chacha20_8xvl: + .byte 9,0,0,0 + .rva full_handler + .rva .L8xvl_body,.L8xvl_epilogue # HandlerData[] +___ +} + +foreach (split("\n",$code)) { + s/\`([^\`]*)\`/eval $1/ge; + + s/%x#%[yz]/%x/g; # "down-shift" + + print $_,"\n"; +} + +close STDOUT; diff --git a/crypto/make_poly1305_x64.pl b/crypto/make_poly1305_x64.pl new file mode 100644 index 0000000..f7a2ab7 --- /dev/null +++ b/crypto/make_poly1305_x64.pl @@ -0,0 +1,4719 @@ +#! /usr/bin/env perl +# Copyright 2016 The OpenSSL Project Authors. All Rights Reserved. +# +# Licensed under the OpenSSL license (the "License"). You may not use +# this file except in compliance with the License. You can obtain a copy +# in the file LICENSE in the source distribution or at +# https://www.openssl.org/source/license.html + +# +# ==================================================================== +# Written by Andy Polyakov for the OpenSSL +# project. The module is, however, dual licensed under OpenSSL and +# CRYPTOGAMS licenses depending on where you obtain it. For further +# details see http://www.openssl.org/~appro/cryptogams/. +# ==================================================================== +# +# This module implements Poly1305 hash for x86_64. +# +# March 2015 +# +# Initial release. +# +# December 2016 +# +# Add AVX512F+VL+BW code path. +# +# November 2017 +# +# Convert AVX512F+VL+BW code path to pure AVX512F, so that it can be +# executed even on Knights Landing. Trigger for modification was +# observation that AVX512 code paths can negatively affect overall +# Skylake-X system performance. Since we are likely to suppress +# AVX512F capability flag [at least on Skylake-X], conversion serves +# as kind of "investment protection". Note that next *lake processor, +# Cannolake, has AVX512IFMA code path to execute... +# +# Numbers are cycles per processed byte with poly1305_blocks_x86_64 alone, +# measured with rdtsc at fixed clock frequency. +# +# IALU/gcc-4.8(*) AVX(**) AVX2 AVX-512 +# P4 4.46/+120% - +# Core 2 2.41/+90% - +# Westmere 1.88/+120% - +# Sandy Bridge 1.39/+140% 1.10 +# Haswell 1.14/+175% 1.11 0.65 +# Skylake[-X] 1.13/+120% 0.96 0.51 [0.35] +# Silvermont 2.83/+95% - +# Knights L 3.60/? 1.65 1.10 0.41(***) +# Goldmont 1.70/+180% - +# VIA Nano 1.82/+150% - +# Sledgehammer 1.38/+160% - +# Bulldozer 2.30/+130% 0.97 +# Ryzen 1.15/+200% 1.08 1.18 +# +# (*) improvement coefficients relative to clang are more modest and +# are ~50% on most processors, in both cases we are comparing to +# __int128 code; +# (**) SSE2 implementation was attempted, but among non-AVX processors +# it was faster than integer-only code only on older Intel P4 and +# Core processors, 50-30%, less newer processor is, but slower on +# contemporary ones, for example almost 2x slower on Atom, and as +# former are naturally disappearing, SSE2 is deemed unnecessary; +# (***) strangely enough performance seems to vary from core to core, +# listed result is best case; + +$flavour = shift; +$output = shift; +if ($flavour =~ /\./) { $output = $flavour; undef $flavour; } + +$win64=0; $win64=1 if ($flavour =~ /[nm]asm|mingw64/ || $output =~ /\.asm$/); + +$0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1; +( $xlate="${dir}x86_64-xlate.pl" and -f $xlate ) or +( $xlate="${dir}../../perlasm/x86_64-xlate.pl" and -f $xlate) or +die "can't locate x86_64-xlate.pl"; + +$avx = 3; +$avx = 2 if ($flavour =~ /macosx/); + +open OUT,"| \"$^X\" \"$xlate\" $flavour \"$output\""; +*STDOUT=*OUT; + +my ($ctx,$inp,$len,$padbit)=("%rdi","%rsi","%rdx","%rcx"); +my ($mac,$nonce)=($inp,$len); # *_emit arguments +my ($d1,$d2,$d3, $r0,$r1,$s1)=map("%r$_",(8..13)); +my ($h0,$h1,$h2)=("%r14","%rbx","%rbp"); + +sub poly1305_iteration { +# input: copy of $r1 in %rax, $h0-$h2, $r0-$r1 +# output: $h0-$h2 *= $r0-$r1 +$code.=<<___; + mulq $h0 # h0*r1 + mov %rax,$d2 + mov $r0,%rax + mov %rdx,$d3 + + mulq $h0 # h0*r0 + mov %rax,$h0 # future $h0 + mov $r0,%rax + mov %rdx,$d1 + + mulq $h1 # h1*r0 + add %rax,$d2 + mov $s1,%rax + adc %rdx,$d3 + + mulq $h1 # h1*s1 + mov $h2,$h1 # borrow $h1 + add %rax,$h0 + adc %rdx,$d1 + + imulq $s1,$h1 # h2*s1 + add $h1,$d2 + mov $d1,$h1 + adc \$0,$d3 + + imulq $r0,$h2 # h2*r0 + add $d2,$h1 + mov \$-4,%rax # mask value + adc $h2,$d3 + + and $d3,%rax # last reduction step + mov $d3,$h2 + shr \$2,$d3 + and \$3,$h2 + add $d3,%rax + add %rax,$h0 + adc \$0,$h1 + adc \$0,$h2 +___ +} + +######################################################################## +# Layout of opaque area is following. +# +# unsigned __int64 h[3]; # current hash value base 2^64 +# unsigned __int64 r[2]; # key value base 2^64 + + +$code.=<<___; +.align 64 +.Lconst: +.Lmask24: +.long 0x0ffffff,0,0x0ffffff,0,0x0ffffff,0,0x0ffffff,0 +.L129: +.long `1<<24`,0,`1<<24`,0,`1<<24`,0,`1<<24`,0 +.Lmask26: +.long 0x3ffffff,0,0x3ffffff,0,0x3ffffff,0,0x3ffffff,0 +.Lpermd_avx2: +.long 2,2,2,3,2,0,2,1 +.Lpermd_avx512: +.long 0,0,0,1, 0,2,0,3, 0,4,0,5, 0,6,0,7 + +.L2_44_inp_permd: +.long 0,1,1,2,2,3,7,7 +.L2_44_inp_shift: +.quad 0,12,24,64 +.L2_44_mask: +.quad 0xfffffffffff,0xfffffffffff,0x3ffffffffff,0xffffffffffffffff +.L2_44_shift_rgt: +.quad 44,44,42,64 +.L2_44_shift_lft: +.quad 8,8,10,64 + +.align 64 +.Lx_mask44: +.quad 0xfffffffffff,0xfffffffffff,0xfffffffffff,0xfffffffffff +.quad 0xfffffffffff,0xfffffffffff,0xfffffffffff,0xfffffffffff +.Lx_mask42: +.quad 0x3ffffffffff,0x3ffffffffff,0x3ffffffffff,0x3ffffffffff +.quad 0x3ffffffffff,0x3ffffffffff,0x3ffffffffff,0x3ffffffffff + +.text + + +.global poly1305_init_x86_64 +.global poly1305_blocks_x86_64 +.global poly1305_emit_x86_64 +.global poly1305_emit_avx +.global poly1305_blocks_avx +.global poly1305_blocks_avx2 +.global poly1305_blocks_avx512 + + +.type poly1305_init_x86_64,\@function,3 +.align 32 +poly1305_init_x86_64: + xor %rax,%rax + mov %rax,0($ctx) # initialize hash value + mov %rax,8($ctx) + mov %rax,16($ctx) + + cmp \$0,$inp + je .Lno_key + +# lea poly1305_blocks_x86_64(%rip),%r10 +# lea poly1305_emit_x86_64(%rip),%r11 +___ +#$code.=<<___ if ($avx); +# mov OPENSSL_ia32cap_P+4(%rip),%r9 +# lea poly1305_blocks_avx(%rip),%rax +# lea poly1305_emit_avx(%rip),%rcx +# bt \$`60-32`,%r9 # AVX? +# cmovc %rax,%r10 +# cmovc %rcx,%r11 +#___ +#$code.=<<___ if ($avx>1); +# lea poly1305_blocks_avx2(%rip),%rax +# bt \$`5+32`,%r9 # AVX2? +# cmovc %rax,%r10 +#___ +#$code.=<<___ if ($avx>3); +# mov \$`(1<<31|1<<21|1<<16)`,%rax +# shr \$32,%r9 +# and %rax,%r9 +# cmp %rax,%r9 +# je .Linit_base2_44 +#___ +$code.=<<___; + mov \$0x0ffffffc0fffffff,%rax + mov \$0x0ffffffc0ffffffc,%rcx + and 0($inp),%rax + and 8($inp),%rcx + mov %rax,24($ctx) + mov %rcx,32($ctx) +___ +#$code.=<<___ if ($flavour !~ /elf32/); +# mov %r10,0(%rdx) +# mov %r11,8(%rdx) +#___ +#$code.=<<___ if ($flavour =~ /elf32/); +# mov %r10d,0(%rdx) +# mov %r11d,4(%rdx) +#___ +$code.=<<___; + mov \$1,%eax +.Lno_key: + ret +.size poly1305_init_x86_64,.-poly1305_init_x86_64 + +.type poly1305_blocks_x86_64,\@function,4 +.align 32 +poly1305_blocks_x86_64: +.cfi_startproc +.Lblocks: + shr \$4,$len + jz .Lno_data # too short + + push %rbx +.cfi_push %rbx + push %rbp +.cfi_push %rbp + push %r12 +.cfi_push %r12 + push %r13 +.cfi_push %r13 + push %r14 +.cfi_push %r14 + push %r15 +.cfi_push %r15 +.Lblocks_body: + + mov $len,%r15 # reassign $len + + mov 24($ctx),$r0 # load r + mov 32($ctx),$s1 + + mov 0($ctx),$h0 # load hash value + mov 8($ctx),$h1 + mov 16($ctx),$h2 + + mov $s1,$r1 + shr \$2,$s1 + mov $r1,%rax + add $r1,$s1 # s1 = r1 + (r1 >> 2) + jmp .Loop + +.align 32 +.Loop: + add 0($inp),$h0 # accumulate input + adc 8($inp),$h1 + lea 16($inp),$inp + adc $padbit,$h2 +___ + &poly1305_iteration(); +$code.=<<___; + mov $r1,%rax + dec %r15 # len-=16 + jnz .Loop + + mov $h0,0($ctx) # store hash value + mov $h1,8($ctx) + mov $h2,16($ctx) + + mov 0(%rsp),%r15 +.cfi_restore %r15 + mov 8(%rsp),%r14 +.cfi_restore %r14 + mov 16(%rsp),%r13 +.cfi_restore %r13 + mov 24(%rsp),%r12 +.cfi_restore %r12 + mov 32(%rsp),%rbp +.cfi_restore %rbp + mov 40(%rsp),%rbx +.cfi_restore %rbx + lea 48(%rsp),%rsp +.cfi_adjust_cfa_offset -48 +.Lno_data: +.Lblocks_epilogue: + ret +.cfi_endproc +.size poly1305_blocks_x86_64,.-poly1305_blocks_x86_64 + +.type poly1305_emit_x86_64,\@function,3 +.align 32 +poly1305_emit_x86_64: +.Lemit: + mov 0($ctx),%r8 # load hash value + mov 8($ctx),%r9 + mov 16($ctx),%r10 + + mov %r8,%rax + add \$5,%r8 # compare to modulus + mov %r9,%rcx + adc \$0,%r9 + adc \$0,%r10 + shr \$2,%r10 # did 130-bit value overflow? + cmovnz %r8,%rax + cmovnz %r9,%rcx + + add 0($nonce),%rax # accumulate nonce + adc 8($nonce),%rcx + mov %rax,0($mac) # write result + mov %rcx,8($mac) + + ret +.size poly1305_emit_x86_64,.-poly1305_emit_x86_64 +___ +if ($avx) { + +######################################################################## +# Layout of opaque area is following. +# +# unsigned __int32 h[5]; # current hash value base 2^26 +# unsigned __int32 is_base2_26; +# unsigned __int64 r[2]; # key value base 2^64 +# unsigned __int64 pad; +# struct { unsigned __int32 r^2, r^1, r^4, r^3; } r[9]; +# +# where r^n are base 2^26 digits of degrees of multiplier key. There are +# 5 digits, but last four are interleaved with multiples of 5, totalling +# in 9 elements: r0, r1, 5*r1, r2, 5*r2, r3, 5*r3, r4, 5*r4. + +my ($H0,$H1,$H2,$H3,$H4, $T0,$T1,$T2,$T3,$T4, $D0,$D1,$D2,$D3,$D4, $MASK) = + map("%xmm$_",(0..15)); + +$code.=<<___; +.type __poly1305_block,\@abi-omnipotent +.align 32 +__poly1305_block: +___ + &poly1305_iteration(); +$code.=<<___; + ret +.size __poly1305_block,.-__poly1305_block + +.type __poly1305_init_avx,\@abi-omnipotent +.align 32 +__poly1305_init_avx: + mov $r0,$h0 + mov $r1,$h1 + xor $h2,$h2 + + lea 48+64($ctx),$ctx # size optimization + + mov $r1,%rax + call __poly1305_block # r^2 + + mov \$0x3ffffff,%eax # save interleaved r^2 and r base 2^26 + mov \$0x3ffffff,%edx + mov $h0,$d1 + and $h0#d,%eax + mov $r0,$d2 + and $r0#d,%edx + mov %eax,`16*0+0-64`($ctx) + shr \$26,$d1 + mov %edx,`16*0+4-64`($ctx) + shr \$26,$d2 + + mov \$0x3ffffff,%eax + mov \$0x3ffffff,%edx + and $d1#d,%eax + and $d2#d,%edx + mov %eax,`16*1+0-64`($ctx) + lea (%rax,%rax,4),%eax # *5 + mov %edx,`16*1+4-64`($ctx) + lea (%rdx,%rdx,4),%edx # *5 + mov %eax,`16*2+0-64`($ctx) + shr \$26,$d1 + mov %edx,`16*2+4-64`($ctx) + shr \$26,$d2 + + mov $h1,%rax + mov $r1,%rdx + shl \$12,%rax + shl \$12,%rdx + or $d1,%rax + or $d2,%rdx + and \$0x3ffffff,%eax + and \$0x3ffffff,%edx + mov %eax,`16*3+0-64`($ctx) + lea (%rax,%rax,4),%eax # *5 + mov %edx,`16*3+4-64`($ctx) + lea (%rdx,%rdx,4),%edx # *5 + mov %eax,`16*4+0-64`($ctx) + mov $h1,$d1 + mov %edx,`16*4+4-64`($ctx) + mov $r1,$d2 + + mov \$0x3ffffff,%eax + mov \$0x3ffffff,%edx + shr \$14,$d1 + shr \$14,$d2 + and $d1#d,%eax + and $d2#d,%edx + mov %eax,`16*5+0-64`($ctx) + lea (%rax,%rax,4),%eax # *5 + mov %edx,`16*5+4-64`($ctx) + lea (%rdx,%rdx,4),%edx # *5 + mov %eax,`16*6+0-64`($ctx) + shr \$26,$d1 + mov %edx,`16*6+4-64`($ctx) + shr \$26,$d2 + + mov $h2,%rax + shl \$24,%rax + or %rax,$d1 + mov $d1#d,`16*7+0-64`($ctx) + lea ($d1,$d1,4),$d1 # *5 + mov $d2#d,`16*7+4-64`($ctx) + lea ($d2,$d2,4),$d2 # *5 + mov $d1#d,`16*8+0-64`($ctx) + mov $d2#d,`16*8+4-64`($ctx) + + mov $r1,%rax + call __poly1305_block # r^3 + + mov \$0x3ffffff,%eax # save r^3 base 2^26 + mov $h0,$d1 + and $h0#d,%eax + shr \$26,$d1 + mov %eax,`16*0+12-64`($ctx) + + mov \$0x3ffffff,%edx + and $d1#d,%edx + mov %edx,`16*1+12-64`($ctx) + lea (%rdx,%rdx,4),%edx # *5 + shr \$26,$d1 + mov %edx,`16*2+12-64`($ctx) + + mov $h1,%rax + shl \$12,%rax + or $d1,%rax + and \$0x3ffffff,%eax + mov %eax,`16*3+12-64`($ctx) + lea (%rax,%rax,4),%eax # *5 + mov $h1,$d1 + mov %eax,`16*4+12-64`($ctx) + + mov \$0x3ffffff,%edx + shr \$14,$d1 + and $d1#d,%edx + mov %edx,`16*5+12-64`($ctx) + lea (%rdx,%rdx,4),%edx # *5 + shr \$26,$d1 + mov %edx,`16*6+12-64`($ctx) + + mov $h2,%rax + shl \$24,%rax + or %rax,$d1 + mov $d1#d,`16*7+12-64`($ctx) + lea ($d1,$d1,4),$d1 # *5 + mov $d1#d,`16*8+12-64`($ctx) + + mov $r1,%rax + call __poly1305_block # r^4 + + mov \$0x3ffffff,%eax # save r^4 base 2^26 + mov $h0,$d1 + and $h0#d,%eax + shr \$26,$d1 + mov %eax,`16*0+8-64`($ctx) + + mov \$0x3ffffff,%edx + and $d1#d,%edx + mov %edx,`16*1+8-64`($ctx) + lea (%rdx,%rdx,4),%edx # *5 + shr \$26,$d1 + mov %edx,`16*2+8-64`($ctx) + + mov $h1,%rax + shl \$12,%rax + or $d1,%rax + and \$0x3ffffff,%eax + mov %eax,`16*3+8-64`($ctx) + lea (%rax,%rax,4),%eax # *5 + mov $h1,$d1 + mov %eax,`16*4+8-64`($ctx) + + mov \$0x3ffffff,%edx + shr \$14,$d1 + and $d1#d,%edx + mov %edx,`16*5+8-64`($ctx) + lea (%rdx,%rdx,4),%edx # *5 + shr \$26,$d1 + mov %edx,`16*6+8-64`($ctx) + + mov $h2,%rax + shl \$24,%rax + or %rax,$d1 + mov $d1#d,`16*7+8-64`($ctx) + lea ($d1,$d1,4),$d1 # *5 + mov $d1#d,`16*8+8-64`($ctx) + + lea -48-64($ctx),$ctx # size [de-]optimization + ret +.size __poly1305_init_avx,.-__poly1305_init_avx + +.type poly1305_blocks_avx,\@function,4 +.align 32 +poly1305_blocks_avx: +.cfi_startproc + mov 20($ctx),%r8d # is_base2_26 + cmp \$128,$len + jae .Lblocks_avx + test %r8d,%r8d + jz .Lblocks + +.Lblocks_avx: + and \$-16,$len + jz .Lno_data_avx + + vzeroupper + + test %r8d,%r8d + jz .Lbase2_64_avx + + test \$31,$len + jz .Leven_avx + + push %rbx +.cfi_push %rbx + push %rbp +.cfi_push %rbp + push %r12 +.cfi_push %r12 + push %r13 +.cfi_push %r13 + push %r14 +.cfi_push %r14 + push %r15 +.cfi_push %r15 +.Lblocks_avx_body: + + mov $len,%r15 # reassign $len + + mov 0($ctx),$d1 # load hash value + mov 8($ctx),$d2 + mov 16($ctx),$h2#d + + mov 24($ctx),$r0 # load r + mov 32($ctx),$s1 + + ################################# base 2^26 -> base 2^64 + mov $d1#d,$h0#d + and \$`-1*(1<<31)`,$d1 + mov $d2,$r1 # borrow $r1 + mov $d2#d,$h1#d + and \$`-1*(1<<31)`,$d2 + + shr \$6,$d1 + shl \$52,$r1 + add $d1,$h0 + shr \$12,$h1 + shr \$18,$d2 + add $r1,$h0 + adc $d2,$h1 + + mov $h2,$d1 + shl \$40,$d1 + shr \$24,$h2 + add $d1,$h1 + adc \$0,$h2 # can be partially reduced... + + mov \$-4,$d2 # ... so reduce + mov $h2,$d1 + and $h2,$d2 + shr \$2,$d1 + and \$3,$h2 + add $d2,$d1 # =*5 + add $d1,$h0 + adc \$0,$h1 + adc \$0,$h2 + + mov $s1,$r1 + mov $s1,%rax + shr \$2,$s1 + add $r1,$s1 # s1 = r1 + (r1 >> 2) + + add 0($inp),$h0 # accumulate input + adc 8($inp),$h1 + lea 16($inp),$inp + adc $padbit,$h2 + + call __poly1305_block + + test $padbit,$padbit # if $padbit is zero, + jz .Lstore_base2_64_avx # store hash in base 2^64 format + + ################################# base 2^64 -> base 2^26 + mov $h0,%rax + mov $h0,%rdx + shr \$52,$h0 + mov $h1,$r0 + mov $h1,$r1 + shr \$26,%rdx + and \$0x3ffffff,%rax # h[0] + shl \$12,$r0 + and \$0x3ffffff,%rdx # h[1] + shr \$14,$h1 + or $r0,$h0 + shl \$24,$h2 + and \$0x3ffffff,$h0 # h[2] + shr \$40,$r1 + and \$0x3ffffff,$h1 # h[3] + or $r1,$h2 # h[4] + + sub \$16,%r15 + jz .Lstore_base2_26_avx + + vmovd %rax#d,$H0 + vmovd %rdx#d,$H1 + vmovd $h0#d,$H2 + vmovd $h1#d,$H3 + vmovd $h2#d,$H4 + jmp .Lproceed_avx + +.align 32 +.Lstore_base2_64_avx: + mov $h0,0($ctx) + mov $h1,8($ctx) + mov $h2,16($ctx) # note that is_base2_26 is zeroed + jmp .Ldone_avx + +.align 16 +.Lstore_base2_26_avx: + mov %rax#d,0($ctx) # store hash value base 2^26 + mov %rdx#d,4($ctx) + mov $h0#d,8($ctx) + mov $h1#d,12($ctx) + mov $h2#d,16($ctx) +.align 16 +.Ldone_avx: + mov 0(%rsp),%r15 +.cfi_restore %r15 + mov 8(%rsp),%r14 +.cfi_restore %r14 + mov 16(%rsp),%r13 +.cfi_restore %r13 + mov 24(%rsp),%r12 +.cfi_restore %r12 + mov 32(%rsp),%rbp +.cfi_restore %rbp + mov 40(%rsp),%rbx +.cfi_restore %rbx + lea 48(%rsp),%rsp +.cfi_adjust_cfa_offset -48 +.Lno_data_avx: +.Lblocks_avx_epilogue: + ret +.cfi_endproc + +.align 32 +.Lbase2_64_avx: +.cfi_startproc + push %rbx +.cfi_push %rbx + push %rbp +.cfi_push %rbp + push %r12 +.cfi_push %r12 + push %r13 +.cfi_push %r13 + push %r14 +.cfi_push %r14 + push %r15 +.cfi_push %r15 +.Lbase2_64_avx_body: + + mov $len,%r15 # reassign $len + + mov 24($ctx),$r0 # load r + mov 32($ctx),$s1 + + mov 0($ctx),$h0 # load hash value + mov 8($ctx),$h1 + mov 16($ctx),$h2#d + + mov $s1,$r1 + mov $s1,%rax + shr \$2,$s1 + add $r1,$s1 # s1 = r1 + (r1 >> 2) + + test \$31,$len + jz .Linit_avx + + add 0($inp),$h0 # accumulate input + adc 8($inp),$h1 + lea 16($inp),$inp + adc $padbit,$h2 + sub \$16,%r15 + + call __poly1305_block + +.Linit_avx: + ################################# base 2^64 -> base 2^26 + mov $h0,%rax + mov $h0,%rdx + shr \$52,$h0 + mov $h1,$d1 + mov $h1,$d2 + shr \$26,%rdx + and \$0x3ffffff,%rax # h[0] + shl \$12,$d1 + and \$0x3ffffff,%rdx # h[1] + shr \$14,$h1 + or $d1,$h0 + shl \$24,$h2 + and \$0x3ffffff,$h0 # h[2] + shr \$40,$d2 + and \$0x3ffffff,$h1 # h[3] + or $d2,$h2 # h[4] + + vmovd %rax#d,$H0 + vmovd %rdx#d,$H1 + vmovd $h0#d,$H2 + vmovd $h1#d,$H3 + vmovd $h2#d,$H4 + movl \$1,20($ctx) # set is_base2_26 + + call __poly1305_init_avx + +.Lproceed_avx: + mov %r15,$len + + mov 0(%rsp),%r15 +.cfi_restore %r15 + mov 8(%rsp),%r14 +.cfi_restore %r14 + mov 16(%rsp),%r13 +.cfi_restore %r13 + mov 24(%rsp),%r12 +.cfi_restore %r12 + mov 32(%rsp),%rbp +.cfi_restore %rbp + mov 40(%rsp),%rbx +.cfi_restore %rbx + lea 48(%rsp),%rax + lea 48(%rsp),%rsp +.cfi_adjust_cfa_offset -48 +.Lbase2_64_avx_epilogue: + jmp .Ldo_avx +.cfi_endproc + +.align 32 +.Leven_avx: +.cfi_startproc + vmovd 4*0($ctx),$H0 # load hash value + vmovd 4*1($ctx),$H1 + vmovd 4*2($ctx),$H2 + vmovd 4*3($ctx),$H3 + vmovd 4*4($ctx),$H4 + +.Ldo_avx: +___ +$code.=<<___ if (!$win64); + lea -0x58(%rsp),%r11 +.cfi_def_cfa %r11,0x60 + sub \$0x178,%rsp +___ +$code.=<<___ if ($win64); + lea -0xf8(%rsp),%r11 + sub \$0x218,%rsp + vmovdqa %xmm6,0x50(%r11) + vmovdqa %xmm7,0x60(%r11) + vmovdqa %xmm8,0x70(%r11) + vmovdqa %xmm9,0x80(%r11) + vmovdqa %xmm10,0x90(%r11) + vmovdqa %xmm11,0xa0(%r11) + vmovdqa %xmm12,0xb0(%r11) + vmovdqa %xmm13,0xc0(%r11) + vmovdqa %xmm14,0xd0(%r11) + vmovdqa %xmm15,0xe0(%r11) +.Ldo_avx_body: +___ +$code.=<<___; + sub \$64,$len + lea -32($inp),%rax + cmovc %rax,$inp + + vmovdqu `16*3`($ctx),$D4 # preload r0^2 + lea `16*3+64`($ctx),$ctx # size optimization + lea .Lconst(%rip),%rcx + + ################################################################ + # load input + vmovdqu 16*2($inp),$T0 + vmovdqu 16*3($inp),$T1 + vmovdqa 64(%rcx),$MASK # .Lmask26 + + vpsrldq \$6,$T0,$T2 # splat input + vpsrldq \$6,$T1,$T3 + vpunpckhqdq $T1,$T0,$T4 # 4 + vpunpcklqdq $T1,$T0,$T0 # 0:1 + vpunpcklqdq $T3,$T2,$T3 # 2:3 + + vpsrlq \$40,$T4,$T4 # 4 + vpsrlq \$26,$T0,$T1 + vpand $MASK,$T0,$T0 # 0 + vpsrlq \$4,$T3,$T2 + vpand $MASK,$T1,$T1 # 1 + vpsrlq \$30,$T3,$T3 + vpand $MASK,$T2,$T2 # 2 + vpand $MASK,$T3,$T3 # 3 + vpor 32(%rcx),$T4,$T4 # padbit, yes, always + + jbe .Lskip_loop_avx + + # expand and copy pre-calculated table to stack + vmovdqu `16*1-64`($ctx),$D1 + vmovdqu `16*2-64`($ctx),$D2 + vpshufd \$0xEE,$D4,$D3 # 34xx -> 3434 + vpshufd \$0x44,$D4,$D0 # xx12 -> 1212 + vmovdqa $D3,-0x90(%r11) + vmovdqa $D0,0x00(%rsp) + vpshufd \$0xEE,$D1,$D4 + vmovdqu `16*3-64`($ctx),$D0 + vpshufd \$0x44,$D1,$D1 + vmovdqa $D4,-0x80(%r11) + vmovdqa $D1,0x10(%rsp) + vpshufd \$0xEE,$D2,$D3 + vmovdqu `16*4-64`($ctx),$D1 + vpshufd \$0x44,$D2,$D2 + vmovdqa $D3,-0x70(%r11) + vmovdqa $D2,0x20(%rsp) + vpshufd \$0xEE,$D0,$D4 + vmovdqu `16*5-64`($ctx),$D2 + vpshufd \$0x44,$D0,$D0 + vmovdqa $D4,-0x60(%r11) + vmovdqa $D0,0x30(%rsp) + vpshufd \$0xEE,$D1,$D3 + vmovdqu `16*6-64`($ctx),$D0 + vpshufd \$0x44,$D1,$D1 + vmovdqa $D3,-0x50(%r11) + vmovdqa $D1,0x40(%rsp) + vpshufd \$0xEE,$D2,$D4 + vmovdqu `16*7-64`($ctx),$D1 + vpshufd \$0x44,$D2,$D2 + vmovdqa $D4,-0x40(%r11) + vmovdqa $D2,0x50(%rsp) + vpshufd \$0xEE,$D0,$D3 + vmovdqu `16*8-64`($ctx),$D2 + vpshufd \$0x44,$D0,$D0 + vmovdqa $D3,-0x30(%r11) + vmovdqa $D0,0x60(%rsp) + vpshufd \$0xEE,$D1,$D4 + vpshufd \$0x44,$D1,$D1 + vmovdqa $D4,-0x20(%r11) + vmovdqa $D1,0x70(%rsp) + vpshufd \$0xEE,$D2,$D3 + vmovdqa 0x00(%rsp),$D4 # preload r0^2 + vpshufd \$0x44,$D2,$D2 + vmovdqa $D3,-0x10(%r11) + vmovdqa $D2,0x80(%rsp) + + jmp .Loop_avx + +.align 32 +.Loop_avx: + ################################################################ + # ((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2 + # ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^3+inp[7]*r + # \___________________/ + # ((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2+inp[8])*r^2 + # ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^4+inp[7]*r^2+inp[9])*r + # \___________________/ \____________________/ + # + # Note that we start with inp[2:3]*r^2. This is because it + # doesn't depend on reduction in previous iteration. + ################################################################ + # d4 = h4*r0 + h3*r1 + h2*r2 + h1*r3 + h0*r4 + # d3 = h3*r0 + h2*r1 + h1*r2 + h0*r3 + h4*5*r4 + # d2 = h2*r0 + h1*r1 + h0*r2 + h4*5*r3 + h3*5*r4 + # d1 = h1*r0 + h0*r1 + h4*5*r2 + h3*5*r3 + h2*5*r4 + # d0 = h0*r0 + h4*5*r1 + h3*5*r2 + h2*5*r3 + h1*5*r4 + # + # though note that $Tx and $Hx are "reversed" in this section, + # and $D4 is preloaded with r0^2... + + vpmuludq $T0,$D4,$D0 # d0 = h0*r0 + vpmuludq $T1,$D4,$D1 # d1 = h1*r0 + vmovdqa $H2,0x20(%r11) # offload hash + vpmuludq $T2,$D4,$D2 # d3 = h2*r0 + vmovdqa 0x10(%rsp),$H2 # r1^2 + vpmuludq $T3,$D4,$D3 # d3 = h3*r0 + vpmuludq $T4,$D4,$D4 # d4 = h4*r0 + + vmovdqa $H0,0x00(%r11) # + vpmuludq 0x20(%rsp),$T4,$H0 # h4*s1 + vmovdqa $H1,0x10(%r11) # + vpmuludq $T3,$H2,$H1 # h3*r1 + vpaddq $H0,$D0,$D0 # d0 += h4*s1 + vpaddq $H1,$D4,$D4 # d4 += h3*r1 + vmovdqa $H3,0x30(%r11) # + vpmuludq $T2,$H2,$H0 # h2*r1 + vpmuludq $T1,$H2,$H1 # h1*r1 + vpaddq $H0,$D3,$D3 # d3 += h2*r1 + vmovdqa 0x30(%rsp),$H3 # r2^2 + vpaddq $H1,$D2,$D2 # d2 += h1*r1 + vmovdqa $H4,0x40(%r11) # + vpmuludq $T0,$H2,$H2 # h0*r1 + vpmuludq $T2,$H3,$H0 # h2*r2 + vpaddq $H2,$D1,$D1 # d1 += h0*r1 + + vmovdqa 0x40(%rsp),$H4 # s2^2 + vpaddq $H0,$D4,$D4 # d4 += h2*r2 + vpmuludq $T1,$H3,$H1 # h1*r2 + vpmuludq $T0,$H3,$H3 # h0*r2 + vpaddq $H1,$D3,$D3 # d3 += h1*r2 + vmovdqa 0x50(%rsp),$H2 # r3^2 + vpaddq $H3,$D2,$D2 # d2 += h0*r2 + vpmuludq $T4,$H4,$H0 # h4*s2 + vpmuludq $T3,$H4,$H4 # h3*s2 + vpaddq $H0,$D1,$D1 # d1 += h4*s2 + vmovdqa 0x60(%rsp),$H3 # s3^2 + vpaddq $H4,$D0,$D0 # d0 += h3*s2 + + vmovdqa 0x80(%rsp),$H4 # s4^2 + vpmuludq $T1,$H2,$H1 # h1*r3 + vpmuludq $T0,$H2,$H2 # h0*r3 + vpaddq $H1,$D4,$D4 # d4 += h1*r3 + vpaddq $H2,$D3,$D3 # d3 += h0*r3 + vpmuludq $T4,$H3,$H0 # h4*s3 + vpmuludq $T3,$H3,$H1 # h3*s3 + vpaddq $H0,$D2,$D2 # d2 += h4*s3 + vmovdqu 16*0($inp),$H0 # load input + vpaddq $H1,$D1,$D1 # d1 += h3*s3 + vpmuludq $T2,$H3,$H3 # h2*s3 + vpmuludq $T2,$H4,$T2 # h2*s4 + vpaddq $H3,$D0,$D0 # d0 += h2*s3 + + vmovdqu 16*1($inp),$H1 # + vpaddq $T2,$D1,$D1 # d1 += h2*s4 + vpmuludq $T3,$H4,$T3 # h3*s4 + vpmuludq $T4,$H4,$T4 # h4*s4 + vpsrldq \$6,$H0,$H2 # splat input + vpaddq $T3,$D2,$D2 # d2 += h3*s4 + vpaddq $T4,$D3,$D3 # d3 += h4*s4 + vpsrldq \$6,$H1,$H3 # + vpmuludq 0x70(%rsp),$T0,$T4 # h0*r4 + vpmuludq $T1,$H4,$T0 # h1*s4 + vpunpckhqdq $H1,$H0,$H4 # 4 + vpaddq $T4,$D4,$D4 # d4 += h0*r4 + vmovdqa -0x90(%r11),$T4 # r0^4 + vpaddq $T0,$D0,$D0 # d0 += h1*s4 + + vpunpcklqdq $H1,$H0,$H0 # 0:1 + vpunpcklqdq $H3,$H2,$H3 # 2:3 + + #vpsrlq \$40,$H4,$H4 # 4 + vpsrldq \$`40/8`,$H4,$H4 # 4 + vpsrlq \$26,$H0,$H1 + vpand $MASK,$H0,$H0 # 0 + vpsrlq \$4,$H3,$H2 + vpand $MASK,$H1,$H1 # 1 + vpand 0(%rcx),$H4,$H4 # .Lmask24 + vpsrlq \$30,$H3,$H3 + vpand $MASK,$H2,$H2 # 2 + vpand $MASK,$H3,$H3 # 3 + vpor 32(%rcx),$H4,$H4 # padbit, yes, always + + vpaddq 0x00(%r11),$H0,$H0 # add hash value + vpaddq 0x10(%r11),$H1,$H1 + vpaddq 0x20(%r11),$H2,$H2 + vpaddq 0x30(%r11),$H3,$H3 + vpaddq 0x40(%r11),$H4,$H4 + + lea 16*2($inp),%rax + lea 16*4($inp),$inp + sub \$64,$len + cmovc %rax,$inp + + ################################################################ + # Now we accumulate (inp[0:1]+hash)*r^4 + ################################################################ + # d4 = h4*r0 + h3*r1 + h2*r2 + h1*r3 + h0*r4 + # d3 = h3*r0 + h2*r1 + h1*r2 + h0*r3 + h4*5*r4 + # d2 = h2*r0 + h1*r1 + h0*r2 + h4*5*r3 + h3*5*r4 + # d1 = h1*r0 + h0*r1 + h4*5*r2 + h3*5*r3 + h2*5*r4 + # d0 = h0*r0 + h4*5*r1 + h3*5*r2 + h2*5*r3 + h1*5*r4 + + vpmuludq $H0,$T4,$T0 # h0*r0 + vpmuludq $H1,$T4,$T1 # h1*r0 + vpaddq $T0,$D0,$D0 + vpaddq $T1,$D1,$D1 + vmovdqa -0x80(%r11),$T2 # r1^4 + vpmuludq $H2,$T4,$T0 # h2*r0 + vpmuludq $H3,$T4,$T1 # h3*r0 + vpaddq $T0,$D2,$D2 + vpaddq $T1,$D3,$D3 + vpmuludq $H4,$T4,$T4 # h4*r0 + vpmuludq -0x70(%r11),$H4,$T0 # h4*s1 + vpaddq $T4,$D4,$D4 + + vpaddq $T0,$D0,$D0 # d0 += h4*s1 + vpmuludq $H2,$T2,$T1 # h2*r1 + vpmuludq $H3,$T2,$T0 # h3*r1 + vpaddq $T1,$D3,$D3 # d3 += h2*r1 + vmovdqa -0x60(%r11),$T3 # r2^4 + vpaddq $T0,$D4,$D4 # d4 += h3*r1 + vpmuludq $H1,$T2,$T1 # h1*r1 + vpmuludq $H0,$T2,$T2 # h0*r1 + vpaddq $T1,$D2,$D2 # d2 += h1*r1 + vpaddq $T2,$D1,$D1 # d1 += h0*r1 + + vmovdqa -0x50(%r11),$T4 # s2^4 + vpmuludq $H2,$T3,$T0 # h2*r2 + vpmuludq $H1,$T3,$T1 # h1*r2 + vpaddq $T0,$D4,$D4 # d4 += h2*r2 + vpaddq $T1,$D3,$D3 # d3 += h1*r2 + vmovdqa -0x40(%r11),$T2 # r3^4 + vpmuludq $H0,$T3,$T3 # h0*r2 + vpmuludq $H4,$T4,$T0 # h4*s2 + vpaddq $T3,$D2,$D2 # d2 += h0*r2 + vpaddq $T0,$D1,$D1 # d1 += h4*s2 + vmovdqa -0x30(%r11),$T3 # s3^4 + vpmuludq $H3,$T4,$T4 # h3*s2 + vpmuludq $H1,$T2,$T1 # h1*r3 + vpaddq $T4,$D0,$D0 # d0 += h3*s2 + + vmovdqa -0x10(%r11),$T4 # s4^4 + vpaddq $T1,$D4,$D4 # d4 += h1*r3 + vpmuludq $H0,$T2,$T2 # h0*r3 + vpmuludq $H4,$T3,$T0 # h4*s3 + vpaddq $T2,$D3,$D3 # d3 += h0*r3 + vpaddq $T0,$D2,$D2 # d2 += h4*s3 + vmovdqu 16*2($inp),$T0 # load input + vpmuludq $H3,$T3,$T2 # h3*s3 + vpmuludq $H2,$T3,$T3 # h2*s3 + vpaddq $T2,$D1,$D1 # d1 += h3*s3 + vmovdqu 16*3($inp),$T1 # + vpaddq $T3,$D0,$D0 # d0 += h2*s3 + + vpmuludq $H2,$T4,$H2 # h2*s4 + vpmuludq $H3,$T4,$H3 # h3*s4 + vpsrldq \$6,$T0,$T2 # splat input + vpaddq $H2,$D1,$D1 # d1 += h2*s4 + vpmuludq $H4,$T4,$H4 # h4*s4 + vpsrldq \$6,$T1,$T3 # + vpaddq $H3,$D2,$H2 # h2 = d2 + h3*s4 + vpaddq $H4,$D3,$H3 # h3 = d3 + h4*s4 + vpmuludq -0x20(%r11),$H0,$H4 # h0*r4 + vpmuludq $H1,$T4,$H0 + vpunpckhqdq $T1,$T0,$T4 # 4 + vpaddq $H4,$D4,$H4 # h4 = d4 + h0*r4 + vpaddq $H0,$D0,$H0 # h0 = d0 + h1*s4 + + vpunpcklqdq $T1,$T0,$T0 # 0:1 + vpunpcklqdq $T3,$T2,$T3 # 2:3 + + #vpsrlq \$40,$T4,$T4 # 4 + vpsrldq \$`40/8`,$T4,$T4 # 4 + vpsrlq \$26,$T0,$T1 + vmovdqa 0x00(%rsp),$D4 # preload r0^2 + vpand $MASK,$T0,$T0 # 0 + vpsrlq \$4,$T3,$T2 + vpand $MASK,$T1,$T1 # 1 + vpand 0(%rcx),$T4,$T4 # .Lmask24 + vpsrlq \$30,$T3,$T3 + vpand $MASK,$T2,$T2 # 2 + vpand $MASK,$T3,$T3 # 3 + vpor 32(%rcx),$T4,$T4 # padbit, yes, always + + ################################################################ + # lazy reduction as discussed in "NEON crypto" by D.J. Bernstein + # and P. Schwabe + + vpsrlq \$26,$H3,$D3 + vpand $MASK,$H3,$H3 + vpaddq $D3,$H4,$H4 # h3 -> h4 + + vpsrlq \$26,$H0,$D0 + vpand $MASK,$H0,$H0 + vpaddq $D0,$D1,$H1 # h0 -> h1 + + vpsrlq \$26,$H4,$D0 + vpand $MASK,$H4,$H4 + + vpsrlq \$26,$H1,$D1 + vpand $MASK,$H1,$H1 + vpaddq $D1,$H2,$H2 # h1 -> h2 + + vpaddq $D0,$H0,$H0 + vpsllq \$2,$D0,$D0 + vpaddq $D0,$H0,$H0 # h4 -> h0 + + vpsrlq \$26,$H2,$D2 + vpand $MASK,$H2,$H2 + vpaddq $D2,$H3,$H3 # h2 -> h3 + + vpsrlq \$26,$H0,$D0 + vpand $MASK,$H0,$H0 + vpaddq $D0,$H1,$H1 # h0 -> h1 + + vpsrlq \$26,$H3,$D3 + vpand $MASK,$H3,$H3 + vpaddq $D3,$H4,$H4 # h3 -> h4 + + ja .Loop_avx + +.Lskip_loop_avx: + ################################################################ + # multiply (inp[0:1]+hash) or inp[2:3] by r^2:r^1 + + vpshufd \$0x10,$D4,$D4 # r0^n, xx12 -> x1x2 + add \$32,$len + jnz .Long_tail_avx + + vpaddq $H2,$T2,$T2 + vpaddq $H0,$T0,$T0 + vpaddq $H1,$T1,$T1 + vpaddq $H3,$T3,$T3 + vpaddq $H4,$T4,$T4 + +.Long_tail_avx: + vmovdqa $H2,0x20(%r11) + vmovdqa $H0,0x00(%r11) + vmovdqa $H1,0x10(%r11) + vmovdqa $H3,0x30(%r11) + vmovdqa $H4,0x40(%r11) + + # d4 = h4*r0 + h3*r1 + h2*r2 + h1*r3 + h0*r4 + # d3 = h3*r0 + h2*r1 + h1*r2 + h0*r3 + h4*5*r4 + # d2 = h2*r0 + h1*r1 + h0*r2 + h4*5*r3 + h3*5*r4 + # d1 = h1*r0 + h0*r1 + h4*5*r2 + h3*5*r3 + h2*5*r4 + # d0 = h0*r0 + h4*5*r1 + h3*5*r2 + h2*5*r3 + h1*5*r4 + + vpmuludq $T2,$D4,$D2 # d2 = h2*r0 + vpmuludq $T0,$D4,$D0 # d0 = h0*r0 + vpshufd \$0x10,`16*1-64`($ctx),$H2 # r1^n + vpmuludq $T1,$D4,$D1 # d1 = h1*r0 + vpmuludq $T3,$D4,$D3 # d3 = h3*r0 + vpmuludq $T4,$D4,$D4 # d4 = h4*r0 + + vpmuludq $T3,$H2,$H0 # h3*r1 + vpaddq $H0,$D4,$D4 # d4 += h3*r1 + vpshufd \$0x10,`16*2-64`($ctx),$H3 # s1^n + vpmuludq $T2,$H2,$H1 # h2*r1 + vpaddq $H1,$D3,$D3 # d3 += h2*r1 + vpshufd \$0x10,`16*3-64`($ctx),$H4 # r2^n + vpmuludq $T1,$H2,$H0 # h1*r1 + vpaddq $H0,$D2,$D2 # d2 += h1*r1 + vpmuludq $T0,$H2,$H2 # h0*r1 + vpaddq $H2,$D1,$D1 # d1 += h0*r1 + vpmuludq $T4,$H3,$H3 # h4*s1 + vpaddq $H3,$D0,$D0 # d0 += h4*s1 + + vpshufd \$0x10,`16*4-64`($ctx),$H2 # s2^n + vpmuludq $T2,$H4,$H1 # h2*r2 + vpaddq $H1,$D4,$D4 # d4 += h2*r2 + vpmuludq $T1,$H4,$H0 # h1*r2 + vpaddq $H0,$D3,$D3 # d3 += h1*r2 + vpshufd \$0x10,`16*5-64`($ctx),$H3 # r3^n + vpmuludq $T0,$H4,$H4 # h0*r2 + vpaddq $H4,$D2,$D2 # d2 += h0*r2 + vpmuludq $T4,$H2,$H1 # h4*s2 + vpaddq $H1,$D1,$D1 # d1 += h4*s2 + vpshufd \$0x10,`16*6-64`($ctx),$H4 # s3^n + vpmuludq $T3,$H2,$H2 # h3*s2 + vpaddq $H2,$D0,$D0 # d0 += h3*s2 + + vpmuludq $T1,$H3,$H0 # h1*r3 + vpaddq $H0,$D4,$D4 # d4 += h1*r3 + vpmuludq $T0,$H3,$H3 # h0*r3 + vpaddq $H3,$D3,$D3 # d3 += h0*r3 + vpshufd \$0x10,`16*7-64`($ctx),$H2 # r4^n + vpmuludq $T4,$H4,$H1 # h4*s3 + vpaddq $H1,$D2,$D2 # d2 += h4*s3 + vpshufd \$0x10,`16*8-64`($ctx),$H3 # s4^n + vpmuludq $T3,$H4,$H0 # h3*s3 + vpaddq $H0,$D1,$D1 # d1 += h3*s3 + vpmuludq $T2,$H4,$H4 # h2*s3 + vpaddq $H4,$D0,$D0 # d0 += h2*s3 + + vpmuludq $T0,$H2,$H2 # h0*r4 + vpaddq $H2,$D4,$D4 # h4 = d4 + h0*r4 + vpmuludq $T4,$H3,$H1 # h4*s4 + vpaddq $H1,$D3,$D3 # h3 = d3 + h4*s4 + vpmuludq $T3,$H3,$H0 # h3*s4 + vpaddq $H0,$D2,$D2 # h2 = d2 + h3*s4 + vpmuludq $T2,$H3,$H1 # h2*s4 + vpaddq $H1,$D1,$D1 # h1 = d1 + h2*s4 + vpmuludq $T1,$H3,$H3 # h1*s4 + vpaddq $H3,$D0,$D0 # h0 = d0 + h1*s4 + + jz .Lshort_tail_avx + + vmovdqu 16*0($inp),$H0 # load input + vmovdqu 16*1($inp),$H1 + + vpsrldq \$6,$H0,$H2 # splat input + vpsrldq \$6,$H1,$H3 + vpunpckhqdq $H1,$H0,$H4 # 4 + vpunpcklqdq $H1,$H0,$H0 # 0:1 + vpunpcklqdq $H3,$H2,$H3 # 2:3 + + vpsrlq \$40,$H4,$H4 # 4 + vpsrlq \$26,$H0,$H1 + vpand $MASK,$H0,$H0 # 0 + vpsrlq \$4,$H3,$H2 + vpand $MASK,$H1,$H1 # 1 + vpsrlq \$30,$H3,$H3 + vpand $MASK,$H2,$H2 # 2 + vpand $MASK,$H3,$H3 # 3 + vpor 32(%rcx),$H4,$H4 # padbit, yes, always + + vpshufd \$0x32,`16*0-64`($ctx),$T4 # r0^n, 34xx -> x3x4 + vpaddq 0x00(%r11),$H0,$H0 + vpaddq 0x10(%r11),$H1,$H1 + vpaddq 0x20(%r11),$H2,$H2 + vpaddq 0x30(%r11),$H3,$H3 + vpaddq 0x40(%r11),$H4,$H4 + + ################################################################ + # multiply (inp[0:1]+hash) by r^4:r^3 and accumulate + + vpmuludq $H0,$T4,$T0 # h0*r0 + vpaddq $T0,$D0,$D0 # d0 += h0*r0 + vpmuludq $H1,$T4,$T1 # h1*r0 + vpaddq $T1,$D1,$D1 # d1 += h1*r0 + vpmuludq $H2,$T4,$T0 # h2*r0 + vpaddq $T0,$D2,$D2 # d2 += h2*r0 + vpshufd \$0x32,`16*1-64`($ctx),$T2 # r1^n + vpmuludq $H3,$T4,$T1 # h3*r0 + vpaddq $T1,$D3,$D3 # d3 += h3*r0 + vpmuludq $H4,$T4,$T4 # h4*r0 + vpaddq $T4,$D4,$D4 # d4 += h4*r0 + + vpmuludq $H3,$T2,$T0 # h3*r1 + vpaddq $T0,$D4,$D4 # d4 += h3*r1 + vpshufd \$0x32,`16*2-64`($ctx),$T3 # s1 + vpmuludq $H2,$T2,$T1 # h2*r1 + vpaddq $T1,$D3,$D3 # d3 += h2*r1 + vpshufd \$0x32,`16*3-64`($ctx),$T4 # r2 + vpmuludq $H1,$T2,$T0 # h1*r1 + vpaddq $T0,$D2,$D2 # d2 += h1*r1 + vpmuludq $H0,$T2,$T2 # h0*r1 + vpaddq $T2,$D1,$D1 # d1 += h0*r1 + vpmuludq $H4,$T3,$T3 # h4*s1 + vpaddq $T3,$D0,$D0 # d0 += h4*s1 + + vpshufd \$0x32,`16*4-64`($ctx),$T2 # s2 + vpmuludq $H2,$T4,$T1 # h2*r2 + vpaddq $T1,$D4,$D4 # d4 += h2*r2 + vpmuludq $H1,$T4,$T0 # h1*r2 + vpaddq $T0,$D3,$D3 # d3 += h1*r2 + vpshufd \$0x32,`16*5-64`($ctx),$T3 # r3 + vpmuludq $H0,$T4,$T4 # h0*r2 + vpaddq $T4,$D2,$D2 # d2 += h0*r2 + vpmuludq $H4,$T2,$T1 # h4*s2 + vpaddq $T1,$D1,$D1 # d1 += h4*s2 + vpshufd \$0x32,`16*6-64`($ctx),$T4 # s3 + vpmuludq $H3,$T2,$T2 # h3*s2 + vpaddq $T2,$D0,$D0 # d0 += h3*s2 + + vpmuludq $H1,$T3,$T0 # h1*r3 + vpaddq $T0,$D4,$D4 # d4 += h1*r3 + vpmuludq $H0,$T3,$T3 # h0*r3 + vpaddq $T3,$D3,$D3 # d3 += h0*r3 + vpshufd \$0x32,`16*7-64`($ctx),$T2 # r4 + vpmuludq $H4,$T4,$T1 # h4*s3 + vpaddq $T1,$D2,$D2 # d2 += h4*s3 + vpshufd \$0x32,`16*8-64`($ctx),$T3 # s4 + vpmuludq $H3,$T4,$T0 # h3*s3 + vpaddq $T0,$D1,$D1 # d1 += h3*s3 + vpmuludq $H2,$T4,$T4 # h2*s3 + vpaddq $T4,$D0,$D0 # d0 += h2*s3 + + vpmuludq $H0,$T2,$T2 # h0*r4 + vpaddq $T2,$D4,$D4 # d4 += h0*r4 + vpmuludq $H4,$T3,$T1 # h4*s4 + vpaddq $T1,$D3,$D3 # d3 += h4*s4 + vpmuludq $H3,$T3,$T0 # h3*s4 + vpaddq $T0,$D2,$D2 # d2 += h3*s4 + vpmuludq $H2,$T3,$T1 # h2*s4 + vpaddq $T1,$D1,$D1 # d1 += h2*s4 + vpmuludq $H1,$T3,$T3 # h1*s4 + vpaddq $T3,$D0,$D0 # d0 += h1*s4 + +.Lshort_tail_avx: + ################################################################ + # horizontal addition + + vpsrldq \$8,$D4,$T4 + vpsrldq \$8,$D3,$T3 + vpsrldq \$8,$D1,$T1 + vpsrldq \$8,$D0,$T0 + vpsrldq \$8,$D2,$T2 + vpaddq $T3,$D3,$D3 + vpaddq $T4,$D4,$D4 + vpaddq $T0,$D0,$D0 + vpaddq $T1,$D1,$D1 + vpaddq $T2,$D2,$D2 + + ################################################################ + # lazy reduction + + vpsrlq \$26,$D3,$H3 + vpand $MASK,$D3,$D3 + vpaddq $H3,$D4,$D4 # h3 -> h4 + + vpsrlq \$26,$D0,$H0 + vpand $MASK,$D0,$D0 + vpaddq $H0,$D1,$D1 # h0 -> h1 + + vpsrlq \$26,$D4,$H4 + vpand $MASK,$D4,$D4 + + vpsrlq \$26,$D1,$H1 + vpand $MASK,$D1,$D1 + vpaddq $H1,$D2,$D2 # h1 -> h2 + + vpaddq $H4,$D0,$D0 + vpsllq \$2,$H4,$H4 + vpaddq $H4,$D0,$D0 # h4 -> h0 + + vpsrlq \$26,$D2,$H2 + vpand $MASK,$D2,$D2 + vpaddq $H2,$D3,$D3 # h2 -> h3 + + vpsrlq \$26,$D0,$H0 + vpand $MASK,$D0,$D0 + vpaddq $H0,$D1,$D1 # h0 -> h1 + + vpsrlq \$26,$D3,$H3 + vpand $MASK,$D3,$D3 + vpaddq $H3,$D4,$D4 # h3 -> h4 + + vmovd $D0,`4*0-48-64`($ctx) # save partially reduced + vmovd $D1,`4*1-48-64`($ctx) + vmovd $D2,`4*2-48-64`($ctx) + vmovd $D3,`4*3-48-64`($ctx) + vmovd $D4,`4*4-48-64`($ctx) +___ +$code.=<<___ if ($win64); + vmovdqa 0x50(%r11),%xmm6 + vmovdqa 0x60(%r11),%xmm7 + vmovdqa 0x70(%r11),%xmm8 + vmovdqa 0x80(%r11),%xmm9 + vmovdqa 0x90(%r11),%xmm10 + vmovdqa 0xa0(%r11),%xmm11 + vmovdqa 0xb0(%r11),%xmm12 + vmovdqa 0xc0(%r11),%xmm13 + vmovdqa 0xd0(%r11),%xmm14 + vmovdqa 0xe0(%r11),%xmm15 + lea 0xf8(%r11),%rsp +.Ldo_avx_epilogue: +___ +$code.=<<___ if (!$win64); + lea 0x58(%r11),%rsp +.cfi_def_cfa %rsp,8 +___ +$code.=<<___; + vzeroupper + ret +.cfi_endproc +.size poly1305_blocks_avx,.-poly1305_blocks_avx + +.type poly1305_emit_avx,\@function,3 +.align 32 +poly1305_emit_avx: + cmpl \$0,20($ctx) # is_base2_26? + je .Lemit + + mov 0($ctx),%eax # load hash value base 2^26 + mov 4($ctx),%ecx + mov 8($ctx),%r8d + mov 12($ctx),%r11d + mov 16($ctx),%r10d + + shl \$26,%rcx # base 2^26 -> base 2^64 + mov %r8,%r9 + shl \$52,%r8 + add %rcx,%rax + shr \$12,%r9 + add %rax,%r8 # h0 + adc \$0,%r9 + + shl \$14,%r11 + mov %r10,%rax + shr \$24,%r10 + add %r11,%r9 + shl \$40,%rax + add %rax,%r9 # h1 + adc \$0,%r10 # h2 + + mov %r10,%rax # could be partially reduced, so reduce + mov %r10,%rcx + and \$3,%r10 + shr \$2,%rax + and \$-4,%rcx + add %rcx,%rax + add %rax,%r8 + adc \$0,%r9 + adc \$0,%r10 + + mov %r8,%rax + add \$5,%r8 # compare to modulus + mov %r9,%rcx + adc \$0,%r9 + adc \$0,%r10 + shr \$2,%r10 # did 130-bit value overflow? + cmovnz %r8,%rax + cmovnz %r9,%rcx + + add 0($nonce),%rax # accumulate nonce + adc 8($nonce),%rcx + mov %rax,0($mac) # write result + mov %rcx,8($mac) + + ret +.size poly1305_emit_avx,.-poly1305_emit_avx +___ + +if ($avx>1) { +my ($H0,$H1,$H2,$H3,$H4, $MASK, $T4,$T0,$T1,$T2,$T3, $D0,$D1,$D2,$D3,$D4) = + map("%ymm$_",(0..15)); +my $S4=$MASK; + +$code.=<<___; +.type poly1305_blocks_avx2,\@function,4 +.align 32 +poly1305_blocks_avx2: +.cfi_startproc + mov 20($ctx),%r8d # is_base2_26 + cmp \$128,$len + jae .Lblocks_avx2 + test %r8d,%r8d + jz .Lblocks + +.Lblocks_avx2: + and \$-16,$len + jz .Lno_data_avx2 + + vzeroupper + + test %r8d,%r8d + jz .Lbase2_64_avx2 + + test \$63,$len + jz .Leven_avx2 + + push %rbx +.cfi_push %rbx + push %rbp +.cfi_push %rbp + push %r12 +.cfi_push %r12 + push %r13 +.cfi_push %r13 + push %r14 +.cfi_push %r14 + push %r15 +.cfi_push %r15 +.Lblocks_avx2_body: + + mov $len,%r15 # reassign $len + + mov 0($ctx),$d1 # load hash value + mov 8($ctx),$d2 + mov 16($ctx),$h2#d + + mov 24($ctx),$r0 # load r + mov 32($ctx),$s1 + + ################################# base 2^26 -> base 2^64 + mov $d1#d,$h0#d + and \$`-1*(1<<31)`,$d1 + mov $d2,$r1 # borrow $r1 + mov $d2#d,$h1#d + and \$`-1*(1<<31)`,$d2 + + shr \$6,$d1 + shl \$52,$r1 + add $d1,$h0 + shr \$12,$h1 + shr \$18,$d2 + add $r1,$h0 + adc $d2,$h1 + + mov $h2,$d1 + shl \$40,$d1 + shr \$24,$h2 + add $d1,$h1 + adc \$0,$h2 # can be partially reduced... + + mov \$-4,$d2 # ... so reduce + mov $h2,$d1 + and $h2,$d2 + shr \$2,$d1 + and \$3,$h2 + add $d2,$d1 # =*5 + add $d1,$h0 + adc \$0,$h1 + adc \$0,$h2 + + mov $s1,$r1 + mov $s1,%rax + shr \$2,$s1 + add $r1,$s1 # s1 = r1 + (r1 >> 2) + +.Lbase2_26_pre_avx2: + add 0($inp),$h0 # accumulate input + adc 8($inp),$h1 + lea 16($inp),$inp + adc $padbit,$h2 + sub \$16,%r15 + + call __poly1305_block + mov $r1,%rax + + test \$63,%r15 + jnz .Lbase2_26_pre_avx2 + + test $padbit,$padbit # if $padbit is zero, + jz .Lstore_base2_64_avx2 # store hash in base 2^64 format + + ################################# base 2^64 -> base 2^26 + mov $h0,%rax + mov $h0,%rdx + shr \$52,$h0 + mov $h1,$r0 + mov $h1,$r1 + shr \$26,%rdx + and \$0x3ffffff,%rax # h[0] + shl \$12,$r0 + and \$0x3ffffff,%rdx # h[1] + shr \$14,$h1 + or $r0,$h0 + shl \$24,$h2 + and \$0x3ffffff,$h0 # h[2] + shr \$40,$r1 + and \$0x3ffffff,$h1 # h[3] + or $r1,$h2 # h[4] + + test %r15,%r15 + jz .Lstore_base2_26_avx2 + + vmovd %rax#d,%x#$H0 + vmovd %rdx#d,%x#$H1 + vmovd $h0#d,%x#$H2 + vmovd $h1#d,%x#$H3 + vmovd $h2#d,%x#$H4 + jmp .Lproceed_avx2 + +.align 32 +.Lstore_base2_64_avx2: + mov $h0,0($ctx) + mov $h1,8($ctx) + mov $h2,16($ctx) # note that is_base2_26 is zeroed + jmp .Ldone_avx2 + +.align 16 +.Lstore_base2_26_avx2: + mov %rax#d,0($ctx) # store hash value base 2^26 + mov %rdx#d,4($ctx) + mov $h0#d,8($ctx) + mov $h1#d,12($ctx) + mov $h2#d,16($ctx) +.align 16 +.Ldone_avx2: + mov 0(%rsp),%r15 +.cfi_restore %r15 + mov 8(%rsp),%r14 +.cfi_restore %r14 + mov 16(%rsp),%r13 +.cfi_restore %r13 + mov 24(%rsp),%r12 +.cfi_restore %r12 + mov 32(%rsp),%rbp +.cfi_restore %rbp + mov 40(%rsp),%rbx +.cfi_restore %rbx + lea 48(%rsp),%rsp +.cfi_adjust_cfa_offset -48 +.Lno_data_avx2: +.Lblocks_avx2_epilogue: + ret +.cfi_endproc + +.align 32 +.Lbase2_64_avx2: +.cfi_startproc + push %rbx +.cfi_push %rbx + push %rbp +.cfi_push %rbp + push %r12 +.cfi_push %r12 + push %r13 +.cfi_push %r13 + push %r14 +.cfi_push %r14 + push %r15 +.cfi_push %r15 +.Lbase2_64_avx2_body: + + mov $len,%r15 # reassign $len + + mov 24($ctx),$r0 # load r + mov 32($ctx),$s1 + + mov 0($ctx),$h0 # load hash value + mov 8($ctx),$h1 + mov 16($ctx),$h2#d + + mov $s1,$r1 + mov $s1,%rax + shr \$2,$s1 + add $r1,$s1 # s1 = r1 + (r1 >> 2) + + test \$63,$len + jz .Linit_avx2 + +.Lbase2_64_pre_avx2: + add 0($inp),$h0 # accumulate input + adc 8($inp),$h1 + lea 16($inp),$inp + adc $padbit,$h2 + sub \$16,%r15 + + call __poly1305_block + mov $r1,%rax + + test \$63,%r15 + jnz .Lbase2_64_pre_avx2 + +.Linit_avx2: + ################################# base 2^64 -> base 2^26 + mov $h0,%rax + mov $h0,%rdx + shr \$52,$h0 + mov $h1,$d1 + mov $h1,$d2 + shr \$26,%rdx + and \$0x3ffffff,%rax # h[0] + shl \$12,$d1 + and \$0x3ffffff,%rdx # h[1] + shr \$14,$h1 + or $d1,$h0 + shl \$24,$h2 + and \$0x3ffffff,$h0 # h[2] + shr \$40,$d2 + and \$0x3ffffff,$h1 # h[3] + or $d2,$h2 # h[4] + + vmovd %rax#d,%x#$H0 + vmovd %rdx#d,%x#$H1 + vmovd $h0#d,%x#$H2 + vmovd $h1#d,%x#$H3 + vmovd $h2#d,%x#$H4 + movl \$1,20($ctx) # set is_base2_26 + + call __poly1305_init_avx + +.Lproceed_avx2: + mov %r15,$len # restore $len +# mov OPENSSL_ia32cap_P+8(%rip),%r10d +# mov \$`(1<<31|1<<30|1<<16)`,%r11d + + mov 0(%rsp),%r15 +.cfi_restore %r15 + mov 8(%rsp),%r14 +.cfi_restore %r14 + mov 16(%rsp),%r13 +.cfi_restore %r13 + mov 24(%rsp),%r12 +.cfi_restore %r12 + mov 32(%rsp),%rbp +.cfi_restore %rbp + mov 40(%rsp),%rbx +.cfi_restore %rbx + lea 48(%rsp),%rax + lea 48(%rsp),%rsp +.cfi_adjust_cfa_offset -48 +.Lbase2_64_avx2_epilogue: + jmp .Ldo_avx2 +.cfi_endproc + +.align 32 +.Leven_avx2: +.cfi_startproc +# mov OPENSSL_ia32cap_P+8(%rip),%r10d + vmovd 4*0($ctx),%x#$H0 # load hash value base 2^26 + vmovd 4*1($ctx),%x#$H1 + vmovd 4*2($ctx),%x#$H2 + vmovd 4*3($ctx),%x#$H3 + vmovd 4*4($ctx),%x#$H4 + +.Ldo_avx2: +___ +#$code.=<<___ if ($avx>2); +# cmp \$512,$len +# jb .Lskip_avx512 +# and %r11d,%r10d +# test \$`1<<16`,%r10d # check for AVX512F +# jnz .Lblocks_avx512 +#.Lskip_avx512: +#___ +$code.=<<___ if (!$win64); + lea -8(%rsp),%r11 +.cfi_def_cfa %r11,16 + sub \$0x128,%rsp +___ +$code.=<<___ if ($win64); + lea -0xf8(%rsp),%r11 + sub \$0x1c8,%rsp + vmovdqa %xmm6,0x50(%r11) + vmovdqa %xmm7,0x60(%r11) + vmovdqa %xmm8,0x70(%r11) + vmovdqa %xmm9,0x80(%r11) + vmovdqa %xmm10,0x90(%r11) + vmovdqa %xmm11,0xa0(%r11) + vmovdqa %xmm12,0xb0(%r11) + vmovdqa %xmm13,0xc0(%r11) + vmovdqa %xmm14,0xd0(%r11) + vmovdqa %xmm15,0xe0(%r11) +.Ldo_avx2_body: +___ +$code.=<<___; + lea .Lconst(%rip),%rcx + lea 48+64($ctx),$ctx # size optimization + vmovdqa 96(%rcx),$T0 # .Lpermd_avx2 + + # expand and copy pre-calculated table to stack + vmovdqu `16*0-64`($ctx),%x#$T2 + and \$-512,%rsp + vmovdqu `16*1-64`($ctx),%x#$T3 + vmovdqu `16*2-64`($ctx),%x#$T4 + vmovdqu `16*3-64`($ctx),%x#$D0 + vmovdqu `16*4-64`($ctx),%x#$D1 + vmovdqu `16*5-64`($ctx),%x#$D2 + lea 0x90(%rsp),%rax # size optimization + vmovdqu `16*6-64`($ctx),%x#$D3 + vpermd $T2,$T0,$T2 # 00003412 -> 14243444 + vmovdqu `16*7-64`($ctx),%x#$D4 + vpermd $T3,$T0,$T3 + vmovdqu `16*8-64`($ctx),%x#$MASK + vpermd $T4,$T0,$T4 + vmovdqa $T2,0x00(%rsp) + vpermd $D0,$T0,$D0 + vmovdqa $T3,0x20-0x90(%rax) + vpermd $D1,$T0,$D1 + vmovdqa $T4,0x40-0x90(%rax) + vpermd $D2,$T0,$D2 + vmovdqa $D0,0x60-0x90(%rax) + vpermd $D3,$T0,$D3 + vmovdqa $D1,0x80-0x90(%rax) + vpermd $D4,$T0,$D4 + vmovdqa $D2,0xa0-0x90(%rax) + vpermd $MASK,$T0,$MASK + vmovdqa $D3,0xc0-0x90(%rax) + vmovdqa $D4,0xe0-0x90(%rax) + vmovdqa $MASK,0x100-0x90(%rax) + vmovdqa 64(%rcx),$MASK # .Lmask26 + + ################################################################ + # load input + vmovdqu 16*0($inp),%x#$T0 + vmovdqu 16*1($inp),%x#$T1 + vinserti128 \$1,16*2($inp),$T0,$T0 + vinserti128 \$1,16*3($inp),$T1,$T1 + lea 16*4($inp),$inp + + vpsrldq \$6,$T0,$T2 # splat input + vpsrldq \$6,$T1,$T3 + vpunpckhqdq $T1,$T0,$T4 # 4 + vpunpcklqdq $T3,$T2,$T2 # 2:3 + vpunpcklqdq $T1,$T0,$T0 # 0:1 + + vpsrlq \$30,$T2,$T3 + vpsrlq \$4,$T2,$T2 + vpsrlq \$26,$T0,$T1 + vpsrlq \$40,$T4,$T4 # 4 + vpand $MASK,$T2,$T2 # 2 + vpand $MASK,$T0,$T0 # 0 + vpand $MASK,$T1,$T1 # 1 + vpand $MASK,$T3,$T3 # 3 + vpor 32(%rcx),$T4,$T4 # padbit, yes, always + + vpaddq $H2,$T2,$H2 # accumulate input + sub \$64,$len + jz .Ltail_avx2 + jmp .Loop_avx2 + +.align 32 +.Loop_avx2: + ################################################################ + # ((inp[0]*r^4+inp[4])*r^4+inp[ 8])*r^4 + # ((inp[1]*r^4+inp[5])*r^4+inp[ 9])*r^3 + # ((inp[2]*r^4+inp[6])*r^4+inp[10])*r^2 + # ((inp[3]*r^4+inp[7])*r^4+inp[11])*r^1 + # \________/\__________/ + ################################################################ + #vpaddq $H2,$T2,$H2 # accumulate input + vpaddq $H0,$T0,$H0 + vmovdqa `32*0`(%rsp),$T0 # r0^4 + vpaddq $H1,$T1,$H1 + vmovdqa `32*1`(%rsp),$T1 # r1^4 + vpaddq $H3,$T3,$H3 + vmovdqa `32*3`(%rsp),$T2 # r2^4 + vpaddq $H4,$T4,$H4 + vmovdqa `32*6-0x90`(%rax),$T3 # s3^4 + vmovdqa `32*8-0x90`(%rax),$S4 # s4^4 + + # d4 = h4*r0 + h3*r1 + h2*r2 + h1*r3 + h0*r4 + # d3 = h3*r0 + h2*r1 + h1*r2 + h0*r3 + h4*5*r4 + # d2 = h2*r0 + h1*r1 + h0*r2 + h4*5*r3 + h3*5*r4 + # d1 = h1*r0 + h0*r1 + h4*5*r2 + h3*5*r3 + h2*5*r4 + # d0 = h0*r0 + h4*5*r1 + h3*5*r2 + h2*5*r3 + h1*5*r4 + # + # however, as h2 is "chronologically" first one available pull + # corresponding operations up, so it's + # + # d4 = h2*r2 + h4*r0 + h3*r1 + h1*r3 + h0*r4 + # d3 = h2*r1 + h3*r0 + h1*r2 + h0*r3 + h4*5*r4 + # d2 = h2*r0 + h1*r1 + h0*r2 + h4*5*r3 + h3*5*r4 + # d1 = h2*5*r4 + h1*r0 + h0*r1 + h4*5*r2 + h3*5*r3 + # d0 = h2*5*r3 + h0*r0 + h4*5*r1 + h3*5*r2 + h1*5*r4 + + vpmuludq $H2,$T0,$D2 # d2 = h2*r0 + vpmuludq $H2,$T1,$D3 # d3 = h2*r1 + vpmuludq $H2,$T2,$D4 # d4 = h2*r2 + vpmuludq $H2,$T3,$D0 # d0 = h2*s3 + vpmuludq $H2,$S4,$D1 # d1 = h2*s4 + + vpmuludq $H0,$T1,$T4 # h0*r1 + vpmuludq $H1,$T1,$H2 # h1*r1, borrow $H2 as temp + vpaddq $T4,$D1,$D1 # d1 += h0*r1 + vpaddq $H2,$D2,$D2 # d2 += h1*r1 + vpmuludq $H3,$T1,$T4 # h3*r1 + vpmuludq `32*2`(%rsp),$H4,$H2 # h4*s1 + vpaddq $T4,$D4,$D4 # d4 += h3*r1 + vpaddq $H2,$D0,$D0 # d0 += h4*s1 + vmovdqa `32*4-0x90`(%rax),$T1 # s2 + + vpmuludq $H0,$T0,$T4 # h0*r0 + vpmuludq $H1,$T0,$H2 # h1*r0 + vpaddq $T4,$D0,$D0 # d0 += h0*r0 + vpaddq $H2,$D1,$D1 # d1 += h1*r0 + vpmuludq $H3,$T0,$T4 # h3*r0 + vpmuludq $H4,$T0,$H2 # h4*r0 + vmovdqu 16*0($inp),%x#$T0 # load input + vpaddq $T4,$D3,$D3 # d3 += h3*r0 + vpaddq $H2,$D4,$D4 # d4 += h4*r0 + vinserti128 \$1,16*2($inp),$T0,$T0 + + vpmuludq $H3,$T1,$T4 # h3*s2 + vpmuludq $H4,$T1,$H2 # h4*s2 + vmovdqu 16*1($inp),%x#$T1 + vpaddq $T4,$D0,$D0 # d0 += h3*s2 + vpaddq $H2,$D1,$D1 # d1 += h4*s2 + vmovdqa `32*5-0x90`(%rax),$H2 # r3 + vpmuludq $H1,$T2,$T4 # h1*r2 + vpmuludq $H0,$T2,$T2 # h0*r2 + vpaddq $T4,$D3,$D3 # d3 += h1*r2 + vpaddq $T2,$D2,$D2 # d2 += h0*r2 + vinserti128 \$1,16*3($inp),$T1,$T1 + lea 16*4($inp),$inp + + vpmuludq $H1,$H2,$T4 # h1*r3 + vpmuludq $H0,$H2,$H2 # h0*r3 + vpsrldq \$6,$T0,$T2 # splat input + vpaddq $T4,$D4,$D4 # d4 += h1*r3 + vpaddq $H2,$D3,$D3 # d3 += h0*r3 + vpmuludq $H3,$T3,$T4 # h3*s3 + vpmuludq $H4,$T3,$H2 # h4*s3 + vpsrldq \$6,$T1,$T3 + vpaddq $T4,$D1,$D1 # d1 += h3*s3 + vpaddq $H2,$D2,$D2 # d2 += h4*s3 + vpunpckhqdq $T1,$T0,$T4 # 4 + + vpmuludq $H3,$S4,$H3 # h3*s4 + vpmuludq $H4,$S4,$H4 # h4*s4 + vpunpcklqdq $T1,$T0,$T0 # 0:1 + vpaddq $H3,$D2,$H2 # h2 = d2 + h3*r4 + vpaddq $H4,$D3,$H3 # h3 = d3 + h4*r4 + vpunpcklqdq $T3,$T2,$T3 # 2:3 + vpmuludq `32*7-0x90`(%rax),$H0,$H4 # h0*r4 + vpmuludq $H1,$S4,$H0 # h1*s4 + vmovdqa 64(%rcx),$MASK # .Lmask26 + vpaddq $H4,$D4,$H4 # h4 = d4 + h0*r4 + vpaddq $H0,$D0,$H0 # h0 = d0 + h1*s4 + + ################################################################ + # lazy reduction (interleaved with tail of input splat) + + vpsrlq \$26,$H3,$D3 + vpand $MASK,$H3,$H3 + vpaddq $D3,$H4,$H4 # h3 -> h4 + + vpsrlq \$26,$H0,$D0 + vpand $MASK,$H0,$H0 + vpaddq $D0,$D1,$H1 # h0 -> h1 + + vpsrlq \$26,$H4,$D4 + vpand $MASK,$H4,$H4 + + vpsrlq \$4,$T3,$T2 + + vpsrlq \$26,$H1,$D1 + vpand $MASK,$H1,$H1 + vpaddq $D1,$H2,$H2 # h1 -> h2 + + vpaddq $D4,$H0,$H0 + vpsllq \$2,$D4,$D4 + vpaddq $D4,$H0,$H0 # h4 -> h0 + + vpand $MASK,$T2,$T2 # 2 + vpsrlq \$26,$T0,$T1 + + vpsrlq \$26,$H2,$D2 + vpand $MASK,$H2,$H2 + vpaddq $D2,$H3,$H3 # h2 -> h3 + + vpaddq $T2,$H2,$H2 # modulo-scheduled + vpsrlq \$30,$T3,$T3 + + vpsrlq \$26,$H0,$D0 + vpand $MASK,$H0,$H0 + vpaddq $D0,$H1,$H1 # h0 -> h1 + + vpsrlq \$40,$T4,$T4 # 4 + + vpsrlq \$26,$H3,$D3 + vpand $MASK,$H3,$H3 + vpaddq $D3,$H4,$H4 # h3 -> h4 + + vpand $MASK,$T0,$T0 # 0 + vpand $MASK,$T1,$T1 # 1 + vpand $MASK,$T3,$T3 # 3 + vpor 32(%rcx),$T4,$T4 # padbit, yes, always + + sub \$64,$len + jnz .Loop_avx2 + + .byte 0x66,0x90 +.Ltail_avx2: + ################################################################ + # while above multiplications were by r^4 in all lanes, in last + # iteration we multiply least significant lane by r^4 and most + # significant one by r, so copy of above except that references + # to the precomputed table are displaced by 4... + + #vpaddq $H2,$T2,$H2 # accumulate input + vpaddq $H0,$T0,$H0 + vmovdqu `32*0+4`(%rsp),$T0 # r0^4 + vpaddq $H1,$T1,$H1 + vmovdqu `32*1+4`(%rsp),$T1 # r1^4 + vpaddq $H3,$T3,$H3 + vmovdqu `32*3+4`(%rsp),$T2 # r2^4 + vpaddq $H4,$T4,$H4 + vmovdqu `32*6+4-0x90`(%rax),$T3 # s3^4 + vmovdqu `32*8+4-0x90`(%rax),$S4 # s4^4 + + vpmuludq $H2,$T0,$D2 # d2 = h2*r0 + vpmuludq $H2,$T1,$D3 # d3 = h2*r1 + vpmuludq $H2,$T2,$D4 # d4 = h2*r2 + vpmuludq $H2,$T3,$D0 # d0 = h2*s3 + vpmuludq $H2,$S4,$D1 # d1 = h2*s4 + + vpmuludq $H0,$T1,$T4 # h0*r1 + vpmuludq $H1,$T1,$H2 # h1*r1 + vpaddq $T4,$D1,$D1 # d1 += h0*r1 + vpaddq $H2,$D2,$D2 # d2 += h1*r1 + vpmuludq $H3,$T1,$T4 # h3*r1 + vpmuludq `32*2+4`(%rsp),$H4,$H2 # h4*s1 + vpaddq $T4,$D4,$D4 # d4 += h3*r1 + vpaddq $H2,$D0,$D0 # d0 += h4*s1 + + vpmuludq $H0,$T0,$T4 # h0*r0 + vpmuludq $H1,$T0,$H2 # h1*r0 + vpaddq $T4,$D0,$D0 # d0 += h0*r0 + vmovdqu `32*4+4-0x90`(%rax),$T1 # s2 + vpaddq $H2,$D1,$D1 # d1 += h1*r0 + vpmuludq $H3,$T0,$T4 # h3*r0 + vpmuludq $H4,$T0,$H2 # h4*r0 + vpaddq $T4,$D3,$D3 # d3 += h3*r0 + vpaddq $H2,$D4,$D4 # d4 += h4*r0 + + vpmuludq $H3,$T1,$T4 # h3*s2 + vpmuludq $H4,$T1,$H2 # h4*s2 + vpaddq $T4,$D0,$D0 # d0 += h3*s2 + vpaddq $H2,$D1,$D1 # d1 += h4*s2 + vmovdqu `32*5+4-0x90`(%rax),$H2 # r3 + vpmuludq $H1,$T2,$T4 # h1*r2 + vpmuludq $H0,$T2,$T2 # h0*r2 + vpaddq $T4,$D3,$D3 # d3 += h1*r2 + vpaddq $T2,$D2,$D2 # d2 += h0*r2 + + vpmuludq $H1,$H2,$T4 # h1*r3 + vpmuludq $H0,$H2,$H2 # h0*r3 + vpaddq $T4,$D4,$D4 # d4 += h1*r3 + vpaddq $H2,$D3,$D3 # d3 += h0*r3 + vpmuludq $H3,$T3,$T4 # h3*s3 + vpmuludq $H4,$T3,$H2 # h4*s3 + vpaddq $T4,$D1,$D1 # d1 += h3*s3 + vpaddq $H2,$D2,$D2 # d2 += h4*s3 + + vpmuludq $H3,$S4,$H3 # h3*s4 + vpmuludq $H4,$S4,$H4 # h4*s4 + vpaddq $H3,$D2,$H2 # h2 = d2 + h3*r4 + vpaddq $H4,$D3,$H3 # h3 = d3 + h4*r4 + vpmuludq `32*7+4-0x90`(%rax),$H0,$H4 # h0*r4 + vpmuludq $H1,$S4,$H0 # h1*s4 + vmovdqa 64(%rcx),$MASK # .Lmask26 + vpaddq $H4,$D4,$H4 # h4 = d4 + h0*r4 + vpaddq $H0,$D0,$H0 # h0 = d0 + h1*s4 + + ################################################################ + # horizontal addition + + vpsrldq \$8,$D1,$T1 + vpsrldq \$8,$H2,$T2 + vpsrldq \$8,$H3,$T3 + vpsrldq \$8,$H4,$T4 + vpsrldq \$8,$H0,$T0 + vpaddq $T1,$D1,$D1 + vpaddq $T2,$H2,$H2 + vpaddq $T3,$H3,$H3 + vpaddq $T4,$H4,$H4 + vpaddq $T0,$H0,$H0 + + vpermq \$0x2,$H3,$T3 + vpermq \$0x2,$H4,$T4 + vpermq \$0x2,$H0,$T0 + vpermq \$0x2,$D1,$T1 + vpermq \$0x2,$H2,$T2 + vpaddq $T3,$H3,$H3 + vpaddq $T4,$H4,$H4 + vpaddq $T0,$H0,$H0 + vpaddq $T1,$D1,$D1 + vpaddq $T2,$H2,$H2 + + ################################################################ + # lazy reduction + + vpsrlq \$26,$H3,$D3 + vpand $MASK,$H3,$H3 + vpaddq $D3,$H4,$H4 # h3 -> h4 + + vpsrlq \$26,$H0,$D0 + vpand $MASK,$H0,$H0 + vpaddq $D0,$D1,$H1 # h0 -> h1 + + vpsrlq \$26,$H4,$D4 + vpand $MASK,$H4,$H4 + + vpsrlq \$26,$H1,$D1 + vpand $MASK,$H1,$H1 + vpaddq $D1,$H2,$H2 # h1 -> h2 + + vpaddq $D4,$H0,$H0 + vpsllq \$2,$D4,$D4 + vpaddq $D4,$H0,$H0 # h4 -> h0 + + vpsrlq \$26,$H2,$D2 + vpand $MASK,$H2,$H2 + vpaddq $D2,$H3,$H3 # h2 -> h3 + + vpsrlq \$26,$H0,$D0 + vpand $MASK,$H0,$H0 + vpaddq $D0,$H1,$H1 # h0 -> h1 + + vpsrlq \$26,$H3,$D3 + vpand $MASK,$H3,$H3 + vpaddq $D3,$H4,$H4 # h3 -> h4 + + vmovd %x#$H0,`4*0-48-64`($ctx)# save partially reduced + vmovd %x#$H1,`4*1-48-64`($ctx) + vmovd %x#$H2,`4*2-48-64`($ctx) + vmovd %x#$H3,`4*3-48-64`($ctx) + vmovd %x#$H4,`4*4-48-64`($ctx) +___ +$code.=<<___ if ($win64); + vmovdqa 0x50(%r11),%xmm6 + vmovdqa 0x60(%r11),%xmm7 + vmovdqa 0x70(%r11),%xmm8 + vmovdqa 0x80(%r11),%xmm9 + vmovdqa 0x90(%r11),%xmm10 + vmovdqa 0xa0(%r11),%xmm11 + vmovdqa 0xb0(%r11),%xmm12 + vmovdqa 0xc0(%r11),%xmm13 + vmovdqa 0xd0(%r11),%xmm14 + vmovdqa 0xe0(%r11),%xmm15 + lea 0xf8(%r11),%rsp +.Ldo_avx2_epilogue: +___ +$code.=<<___ if (!$win64); + lea 8(%r11),%rsp +.cfi_def_cfa %rsp,8 +___ +$code.=<<___; + vzeroupper + ret +.cfi_endproc +.size poly1305_blocks_avx2,.-poly1305_blocks_avx2 +___ +####################################################################### +if ($avx>2) { +# On entry we have input length divisible by 64. But since inner loop +# processes 128 bytes per iteration, cases when length is not divisible +# by 128 are handled by passing tail 64 bytes to .Ltail_avx2. For this +# reason stack layout is kept identical to poly1305_blocks_avx2. If not +# for this tail, we wouldn't have to even allocate stack frame... + + +$code.=<<___; +.type poly1305_blocks_avx512,\@function,4 +.align 32 +poly1305_blocks_avx512: +.cfi_startproc + mov 20($ctx),%r8d # is_base2_26 + cmp \$128,$len + jae .Lblocks_avx2_512 + test %r8d,%r8d + jz .Lblocks + +.Lblocks_avx2_512: + and \$-16,$len + jz .Lno_data_avx2_512 + + vzeroupper + + test %r8d,%r8d + jz .Lbase2_64_avx2_512 + + test \$63,$len + jz .Leven_avx2_512 + + push %rbx +.cfi_push %rbx + push %rbp +.cfi_push %rbp + push %r12 +.cfi_push %r12 + push %r13 +.cfi_push %r13 + push %r14 +.cfi_push %r14 + push %r15 +.cfi_push %r15 +.Lblocks_avx2_body_512: + + mov $len,%r15 # reassign $len + + mov 0($ctx),$d1 # load hash value + mov 8($ctx),$d2 + mov 16($ctx),$h2#d + + mov 24($ctx),$r0 # load r + mov 32($ctx),$s1 + + ################################# base 2^26 -> base 2^64 + mov $d1#d,$h0#d + and \$`-1*(1<<31)`,$d1 + mov $d2,$r1 # borrow $r1 + mov $d2#d,$h1#d + and \$`-1*(1<<31)`,$d2 + + shr \$6,$d1 + shl \$52,$r1 + add $d1,$h0 + shr \$12,$h1 + shr \$18,$d2 + add $r1,$h0 + adc $d2,$h1 + + mov $h2,$d1 + shl \$40,$d1 + shr \$24,$h2 + add $d1,$h1 + adc \$0,$h2 # can be partially reduced... + + mov \$-4,$d2 # ... so reduce + mov $h2,$d1 + and $h2,$d2 + shr \$2,$d1 + and \$3,$h2 + add $d2,$d1 # =*5 + add $d1,$h0 + adc \$0,$h1 + adc \$0,$h2 + + mov $s1,$r1 + mov $s1,%rax + shr \$2,$s1 + add $r1,$s1 # s1 = r1 + (r1 >> 2) + +.Lbase2_26_pre_avx2_512: + add 0($inp),$h0 # accumulate input + adc 8($inp),$h1 + lea 16($inp),$inp + adc $padbit,$h2 + sub \$16,%r15 + + call __poly1305_block + mov $r1,%rax + + test \$63,%r15 + jnz .Lbase2_26_pre_avx2_512 + + test $padbit,$padbit # if $padbit is zero, + jz .Lstore_base2_64_avx2_512 # store hash in base 2^64 format + + ################################# base 2^64 -> base 2^26 + mov $h0,%rax + mov $h0,%rdx + shr \$52,$h0 + mov $h1,$r0 + mov $h1,$r1 + shr \$26,%rdx + and \$0x3ffffff,%rax # h[0] + shl \$12,$r0 + and \$0x3ffffff,%rdx # h[1] + shr \$14,$h1 + or $r0,$h0 + shl \$24,$h2 + and \$0x3ffffff,$h0 # h[2] + shr \$40,$r1 + and \$0x3ffffff,$h1 # h[3] + or $r1,$h2 # h[4] + + test %r15,%r15 + jz .Lstore_base2_26_avx2_512 + + vmovd %rax#d,%x#$H0 + vmovd %rdx#d,%x#$H1 + vmovd $h0#d,%x#$H2 + vmovd $h1#d,%x#$H3 + vmovd $h2#d,%x#$H4 + jmp .Lproceed_avx2_512 + +.align 32 +.Lstore_base2_64_avx2_512: + mov $h0,0($ctx) + mov $h1,8($ctx) + mov $h2,16($ctx) # note that is_base2_26 is zeroed + jmp .Ldone_avx2_512 + +.align 16 +.Lstore_base2_26_avx2_512: + mov %rax#d,0($ctx) # store hash value base 2^26 + mov %rdx#d,4($ctx) + mov $h0#d,8($ctx) + mov $h1#d,12($ctx) + mov $h2#d,16($ctx) +.align 16 +.Ldone_avx2_512: + mov 0(%rsp),%r15 +.cfi_restore %r15 + mov 8(%rsp),%r14 +.cfi_restore %r14 + mov 16(%rsp),%r13 +.cfi_restore %r13 + mov 24(%rsp),%r12 +.cfi_restore %r12 + mov 32(%rsp),%rbp +.cfi_restore %rbp + mov 40(%rsp),%rbx +.cfi_restore %rbx + lea 48(%rsp),%rsp +.cfi_adjust_cfa_offset -48 +.Lno_data_avx2_512: +.Lblocks_avx2_epilogue_512: + ret +.cfi_endproc + +.align 32 +.Lbase2_64_avx2_512: +.cfi_startproc + push %rbx +.cfi_push %rbx + push %rbp +.cfi_push %rbp + push %r12 +.cfi_push %r12 + push %r13 +.cfi_push %r13 + push %r14 +.cfi_push %r14 + push %r15 +.cfi_push %r15 +.Lbase2_64_avx2_body_512: + + mov $len,%r15 # reassign $len + + mov 24($ctx),$r0 # load r + mov 32($ctx),$s1 + + mov 0($ctx),$h0 # load hash value + mov 8($ctx),$h1 + mov 16($ctx),$h2#d + + mov $s1,$r1 + mov $s1,%rax + shr \$2,$s1 + add $r1,$s1 # s1 = r1 + (r1 >> 2) + + test \$63,$len + jz .Linit_avx2_512 + +.Lbase2_64_pre_avx2_512: + add 0($inp),$h0 # accumulate input + adc 8($inp),$h1 + lea 16($inp),$inp + adc $padbit,$h2 + sub \$16,%r15 + + call __poly1305_block + mov $r1,%rax + + test \$63,%r15 + jnz .Lbase2_64_pre_avx2_512 + +.Linit_avx2_512: + ################################# base 2^64 -> base 2^26 + mov $h0,%rax + mov $h0,%rdx + shr \$52,$h0 + mov $h1,$d1 + mov $h1,$d2 + shr \$26,%rdx + and \$0x3ffffff,%rax # h[0] + shl \$12,$d1 + and \$0x3ffffff,%rdx # h[1] + shr \$14,$h1 + or $d1,$h0 + shl \$24,$h2 + and \$0x3ffffff,$h0 # h[2] + shr \$40,$d2 + and \$0x3ffffff,$h1 # h[3] + or $d2,$h2 # h[4] + + vmovd %rax#d,%x#$H0 + vmovd %rdx#d,%x#$H1 + vmovd $h0#d,%x#$H2 + vmovd $h1#d,%x#$H3 + vmovd $h2#d,%x#$H4 + movl \$1,20($ctx) # set is_base2_26 + + call __poly1305_init_avx + +.Lproceed_avx2_512: + mov %r15,$len # restore $len +# mov OPENSSL_ia32cap_P+8(%rip),%r10d +# mov \$`(1<<31|1<<30|1<<16)`,%r11d + + mov 0(%rsp),%r15 +.cfi_restore %r15 + mov 8(%rsp),%r14 +.cfi_restore %r14 + mov 16(%rsp),%r13 +.cfi_restore %r13 + mov 24(%rsp),%r12 +.cfi_restore %r12 + mov 32(%rsp),%rbp +.cfi_restore %rbp + mov 40(%rsp),%rbx +.cfi_restore %rbx + lea 48(%rsp),%rax + lea 48(%rsp),%rsp +.cfi_adjust_cfa_offset -48 +.Lbase2_64_avx2_epilogue_512: + jmp .Ldo_avx2_512 +.cfi_endproc + +.align 32 +.Leven_avx2_512: +.cfi_startproc +# mov OPENSSL_ia32cap_P+8(%rip),%r10d + vmovd 4*0($ctx),%x#$H0 # load hash value base 2^26 + vmovd 4*1($ctx),%x#$H1 + vmovd 4*2($ctx),%x#$H2 + vmovd 4*3($ctx),%x#$H3 + vmovd 4*4($ctx),%x#$H4 + +.Ldo_avx2_512: + cmp \$512,$len + jae .Lblocks_avx512 +.Lskip_avx512: +___ +#$code.=<<___ if ($avx>2); +# cmp \$512,$len +# jb .Lskip_avx512 +# and %r11d,%r10d +# test \$`1<<16`,%r10d # check for AVX512F +# jnz .Lblocks_avx512 +#.Lskip_avx512: +#___ +$code.=<<___ if (!$win64); + lea -8(%rsp),%r11 +.cfi_def_cfa %r11,16 + sub \$0x128,%rsp +___ +$code.=<<___ if ($win64); + lea -0xf8(%rsp),%r11 + sub \$0x1c8,%rsp + vmovdqa %xmm6,0x50(%r11) + vmovdqa %xmm7,0x60(%r11) + vmovdqa %xmm8,0x70(%r11) + vmovdqa %xmm9,0x80(%r11) + vmovdqa %xmm10,0x90(%r11) + vmovdqa %xmm11,0xa0(%r11) + vmovdqa %xmm12,0xb0(%r11) + vmovdqa %xmm13,0xc0(%r11) + vmovdqa %xmm14,0xd0(%r11) + vmovdqa %xmm15,0xe0(%r11) +.Ldo_avx2_body_512: +___ +$code.=<<___; + lea .Lconst(%rip),%rcx + lea 48+64($ctx),$ctx # size optimization + vmovdqa 96(%rcx),$T0 # .Lpermd_avx2 + + # expand and copy pre-calculated table to stack + vmovdqu `16*0-64`($ctx),%x#$T2 + and \$-512,%rsp + vmovdqu `16*1-64`($ctx),%x#$T3 + vmovdqu `16*2-64`($ctx),%x#$T4 + vmovdqu `16*3-64`($ctx),%x#$D0 + vmovdqu `16*4-64`($ctx),%x#$D1 + vmovdqu `16*5-64`($ctx),%x#$D2 + lea 0x90(%rsp),%rax # size optimization + vmovdqu `16*6-64`($ctx),%x#$D3 + vpermd $T2,$T0,$T2 # 00003412 -> 14243444 + vmovdqu `16*7-64`($ctx),%x#$D4 + vpermd $T3,$T0,$T3 + vmovdqu `16*8-64`($ctx),%x#$MASK + vpermd $T4,$T0,$T4 + vmovdqa $T2,0x00(%rsp) + vpermd $D0,$T0,$D0 + vmovdqa $T3,0x20-0x90(%rax) + vpermd $D1,$T0,$D1 + vmovdqa $T4,0x40-0x90(%rax) + vpermd $D2,$T0,$D2 + vmovdqa $D0,0x60-0x90(%rax) + vpermd $D3,$T0,$D3 + vmovdqa $D1,0x80-0x90(%rax) + vpermd $D4,$T0,$D4 + vmovdqa $D2,0xa0-0x90(%rax) + vpermd $MASK,$T0,$MASK + vmovdqa $D3,0xc0-0x90(%rax) + vmovdqa $D4,0xe0-0x90(%rax) + vmovdqa $MASK,0x100-0x90(%rax) + vmovdqa 64(%rcx),$MASK # .Lmask26 + + ################################################################ + # load input + vmovdqu 16*0($inp),%x#$T0 + vmovdqu 16*1($inp),%x#$T1 + vinserti128 \$1,16*2($inp),$T0,$T0 + vinserti128 \$1,16*3($inp),$T1,$T1 + lea 16*4($inp),$inp + + vpsrldq \$6,$T0,$T2 # splat input + vpsrldq \$6,$T1,$T3 + vpunpckhqdq $T1,$T0,$T4 # 4 + vpunpcklqdq $T3,$T2,$T2 # 2:3 + vpunpcklqdq $T1,$T0,$T0 # 0:1 + + vpsrlq \$30,$T2,$T3 + vpsrlq \$4,$T2,$T2 + vpsrlq \$26,$T0,$T1 + vpsrlq \$40,$T4,$T4 # 4 + vpand $MASK,$T2,$T2 # 2 + vpand $MASK,$T0,$T0 # 0 + vpand $MASK,$T1,$T1 # 1 + vpand $MASK,$T3,$T3 # 3 + vpor 32(%rcx),$T4,$T4 # padbit, yes, always + + vpaddq $H2,$T2,$H2 # accumulate input + sub \$64,$len + jz .Ltail_avx2_512 + jmp .Loop_avx2_512 + +.align 32 +.Loop_avx2_512: + ################################################################ + # ((inp[0]*r^4+inp[4])*r^4+inp[ 8])*r^4 + # ((inp[1]*r^4+inp[5])*r^4+inp[ 9])*r^3 + # ((inp[2]*r^4+inp[6])*r^4+inp[10])*r^2 + # ((inp[3]*r^4+inp[7])*r^4+inp[11])*r^1 + # \________/\__________/ + ################################################################ + #vpaddq $H2,$T2,$H2 # accumulate input + vpaddq $H0,$T0,$H0 + vmovdqa `32*0`(%rsp),$T0 # r0^4 + vpaddq $H1,$T1,$H1 + vmovdqa `32*1`(%rsp),$T1 # r1^4 + vpaddq $H3,$T3,$H3 + vmovdqa `32*3`(%rsp),$T2 # r2^4 + vpaddq $H4,$T4,$H4 + vmovdqa `32*6-0x90`(%rax),$T3 # s3^4 + vmovdqa `32*8-0x90`(%rax),$S4 # s4^4 + + # d4 = h4*r0 + h3*r1 + h2*r2 + h1*r3 + h0*r4 + # d3 = h3*r0 + h2*r1 + h1*r2 + h0*r3 + h4*5*r4 + # d2 = h2*r0 + h1*r1 + h0*r2 + h4*5*r3 + h3*5*r4 + # d1 = h1*r0 + h0*r1 + h4*5*r2 + h3*5*r3 + h2*5*r4 + # d0 = h0*r0 + h4*5*r1 + h3*5*r2 + h2*5*r3 + h1*5*r4 + # + # however, as h2 is "chronologically" first one available pull + # corresponding operations up, so it's + # + # d4 = h2*r2 + h4*r0 + h3*r1 + h1*r3 + h0*r4 + # d3 = h2*r1 + h3*r0 + h1*r2 + h0*r3 + h4*5*r4 + # d2 = h2*r0 + h1*r1 + h0*r2 + h4*5*r3 + h3*5*r4 + # d1 = h2*5*r4 + h1*r0 + h0*r1 + h4*5*r2 + h3*5*r3 + # d0 = h2*5*r3 + h0*r0 + h4*5*r1 + h3*5*r2 + h1*5*r4 + + vpmuludq $H2,$T0,$D2 # d2 = h2*r0 + vpmuludq $H2,$T1,$D3 # d3 = h2*r1 + vpmuludq $H2,$T2,$D4 # d4 = h2*r2 + vpmuludq $H2,$T3,$D0 # d0 = h2*s3 + vpmuludq $H2,$S4,$D1 # d1 = h2*s4 + + vpmuludq $H0,$T1,$T4 # h0*r1 + vpmuludq $H1,$T1,$H2 # h1*r1, borrow $H2 as temp + vpaddq $T4,$D1,$D1 # d1 += h0*r1 + vpaddq $H2,$D2,$D2 # d2 += h1*r1 + vpmuludq $H3,$T1,$T4 # h3*r1 + vpmuludq `32*2`(%rsp),$H4,$H2 # h4*s1 + vpaddq $T4,$D4,$D4 # d4 += h3*r1 + vpaddq $H2,$D0,$D0 # d0 += h4*s1 + vmovdqa `32*4-0x90`(%rax),$T1 # s2 + + vpmuludq $H0,$T0,$T4 # h0*r0 + vpmuludq $H1,$T0,$H2 # h1*r0 + vpaddq $T4,$D0,$D0 # d0 += h0*r0 + vpaddq $H2,$D1,$D1 # d1 += h1*r0 + vpmuludq $H3,$T0,$T4 # h3*r0 + vpmuludq $H4,$T0,$H2 # h4*r0 + vmovdqu 16*0($inp),%x#$T0 # load input + vpaddq $T4,$D3,$D3 # d3 += h3*r0 + vpaddq $H2,$D4,$D4 # d4 += h4*r0 + vinserti128 \$1,16*2($inp),$T0,$T0 + + vpmuludq $H3,$T1,$T4 # h3*s2 + vpmuludq $H4,$T1,$H2 # h4*s2 + vmovdqu 16*1($inp),%x#$T1 + vpaddq $T4,$D0,$D0 # d0 += h3*s2 + vpaddq $H2,$D1,$D1 # d1 += h4*s2 + vmovdqa `32*5-0x90`(%rax),$H2 # r3 + vpmuludq $H1,$T2,$T4 # h1*r2 + vpmuludq $H0,$T2,$T2 # h0*r2 + vpaddq $T4,$D3,$D3 # d3 += h1*r2 + vpaddq $T2,$D2,$D2 # d2 += h0*r2 + vinserti128 \$1,16*3($inp),$T1,$T1 + lea 16*4($inp),$inp + + vpmuludq $H1,$H2,$T4 # h1*r3 + vpmuludq $H0,$H2,$H2 # h0*r3 + vpsrldq \$6,$T0,$T2 # splat input + vpaddq $T4,$D4,$D4 # d4 += h1*r3 + vpaddq $H2,$D3,$D3 # d3 += h0*r3 + vpmuludq $H3,$T3,$T4 # h3*s3 + vpmuludq $H4,$T3,$H2 # h4*s3 + vpsrldq \$6,$T1,$T3 + vpaddq $T4,$D1,$D1 # d1 += h3*s3 + vpaddq $H2,$D2,$D2 # d2 += h4*s3 + vpunpckhqdq $T1,$T0,$T4 # 4 + + vpmuludq $H3,$S4,$H3 # h3*s4 + vpmuludq $H4,$S4,$H4 # h4*s4 + vpunpcklqdq $T1,$T0,$T0 # 0:1 + vpaddq $H3,$D2,$H2 # h2 = d2 + h3*r4 + vpaddq $H4,$D3,$H3 # h3 = d3 + h4*r4 + vpunpcklqdq $T3,$T2,$T3 # 2:3 + vpmuludq `32*7-0x90`(%rax),$H0,$H4 # h0*r4 + vpmuludq $H1,$S4,$H0 # h1*s4 + vmovdqa 64(%rcx),$MASK # .Lmask26 + vpaddq $H4,$D4,$H4 # h4 = d4 + h0*r4 + vpaddq $H0,$D0,$H0 # h0 = d0 + h1*s4 + + ################################################################ + # lazy reduction (interleaved with tail of input splat) + + vpsrlq \$26,$H3,$D3 + vpand $MASK,$H3,$H3 + vpaddq $D3,$H4,$H4 # h3 -> h4 + + vpsrlq \$26,$H0,$D0 + vpand $MASK,$H0,$H0 + vpaddq $D0,$D1,$H1 # h0 -> h1 + + vpsrlq \$26,$H4,$D4 + vpand $MASK,$H4,$H4 + + vpsrlq \$4,$T3,$T2 + + vpsrlq \$26,$H1,$D1 + vpand $MASK,$H1,$H1 + vpaddq $D1,$H2,$H2 # h1 -> h2 + + vpaddq $D4,$H0,$H0 + vpsllq \$2,$D4,$D4 + vpaddq $D4,$H0,$H0 # h4 -> h0 + + vpand $MASK,$T2,$T2 # 2 + vpsrlq \$26,$T0,$T1 + + vpsrlq \$26,$H2,$D2 + vpand $MASK,$H2,$H2 + vpaddq $D2,$H3,$H3 # h2 -> h3 + + vpaddq $T2,$H2,$H2 # modulo-scheduled + vpsrlq \$30,$T3,$T3 + + vpsrlq \$26,$H0,$D0 + vpand $MASK,$H0,$H0 + vpaddq $D0,$H1,$H1 # h0 -> h1 + + vpsrlq \$40,$T4,$T4 # 4 + + vpsrlq \$26,$H3,$D3 + vpand $MASK,$H3,$H3 + vpaddq $D3,$H4,$H4 # h3 -> h4 + + vpand $MASK,$T0,$T0 # 0 + vpand $MASK,$T1,$T1 # 1 + vpand $MASK,$T3,$T3 # 3 + vpor 32(%rcx),$T4,$T4 # padbit, yes, always + + sub \$64,$len + jnz .Loop_avx2_512 + + .byte 0x66,0x90 +.Ltail_avx2_512: + ################################################################ + # while above multiplications were by r^4 in all lanes, in last + # iteration we multiply least significant lane by r^4 and most + # significant one by r, so copy of above except that references + # to the precomputed table are displaced by 4... + + #vpaddq $H2,$T2,$H2 # accumulate input + vpaddq $H0,$T0,$H0 + vmovdqu `32*0+4`(%rsp),$T0 # r0^4 + vpaddq $H1,$T1,$H1 + vmovdqu `32*1+4`(%rsp),$T1 # r1^4 + vpaddq $H3,$T3,$H3 + vmovdqu `32*3+4`(%rsp),$T2 # r2^4 + vpaddq $H4,$T4,$H4 + vmovdqu `32*6+4-0x90`(%rax),$T3 # s3^4 + vmovdqu `32*8+4-0x90`(%rax),$S4 # s4^4 + + vpmuludq $H2,$T0,$D2 # d2 = h2*r0 + vpmuludq $H2,$T1,$D3 # d3 = h2*r1 + vpmuludq $H2,$T2,$D4 # d4 = h2*r2 + vpmuludq $H2,$T3,$D0 # d0 = h2*s3 + vpmuludq $H2,$S4,$D1 # d1 = h2*s4 + + vpmuludq $H0,$T1,$T4 # h0*r1 + vpmuludq $H1,$T1,$H2 # h1*r1 + vpaddq $T4,$D1,$D1 # d1 += h0*r1 + vpaddq $H2,$D2,$D2 # d2 += h1*r1 + vpmuludq $H3,$T1,$T4 # h3*r1 + vpmuludq `32*2+4`(%rsp),$H4,$H2 # h4*s1 + vpaddq $T4,$D4,$D4 # d4 += h3*r1 + vpaddq $H2,$D0,$D0 # d0 += h4*s1 + + vpmuludq $H0,$T0,$T4 # h0*r0 + vpmuludq $H1,$T0,$H2 # h1*r0 + vpaddq $T4,$D0,$D0 # d0 += h0*r0 + vmovdqu `32*4+4-0x90`(%rax),$T1 # s2 + vpaddq $H2,$D1,$D1 # d1 += h1*r0 + vpmuludq $H3,$T0,$T4 # h3*r0 + vpmuludq $H4,$T0,$H2 # h4*r0 + vpaddq $T4,$D3,$D3 # d3 += h3*r0 + vpaddq $H2,$D4,$D4 # d4 += h4*r0 + + vpmuludq $H3,$T1,$T4 # h3*s2 + vpmuludq $H4,$T1,$H2 # h4*s2 + vpaddq $T4,$D0,$D0 # d0 += h3*s2 + vpaddq $H2,$D1,$D1 # d1 += h4*s2 + vmovdqu `32*5+4-0x90`(%rax),$H2 # r3 + vpmuludq $H1,$T2,$T4 # h1*r2 + vpmuludq $H0,$T2,$T2 # h0*r2 + vpaddq $T4,$D3,$D3 # d3 += h1*r2 + vpaddq $T2,$D2,$D2 # d2 += h0*r2 + + vpmuludq $H1,$H2,$T4 # h1*r3 + vpmuludq $H0,$H2,$H2 # h0*r3 + vpaddq $T4,$D4,$D4 # d4 += h1*r3 + vpaddq $H2,$D3,$D3 # d3 += h0*r3 + vpmuludq $H3,$T3,$T4 # h3*s3 + vpmuludq $H4,$T3,$H2 # h4*s3 + vpaddq $T4,$D1,$D1 # d1 += h3*s3 + vpaddq $H2,$D2,$D2 # d2 += h4*s3 + + vpmuludq $H3,$S4,$H3 # h3*s4 + vpmuludq $H4,$S4,$H4 # h4*s4 + vpaddq $H3,$D2,$H2 # h2 = d2 + h3*r4 + vpaddq $H4,$D3,$H3 # h3 = d3 + h4*r4 + vpmuludq `32*7+4-0x90`(%rax),$H0,$H4 # h0*r4 + vpmuludq $H1,$S4,$H0 # h1*s4 + vmovdqa 64(%rcx),$MASK # .Lmask26 + vpaddq $H4,$D4,$H4 # h4 = d4 + h0*r4 + vpaddq $H0,$D0,$H0 # h0 = d0 + h1*s4 + + ################################################################ + # horizontal addition + + vpsrldq \$8,$D1,$T1 + vpsrldq \$8,$H2,$T2 + vpsrldq \$8,$H3,$T3 + vpsrldq \$8,$H4,$T4 + vpsrldq \$8,$H0,$T0 + vpaddq $T1,$D1,$D1 + vpaddq $T2,$H2,$H2 + vpaddq $T3,$H3,$H3 + vpaddq $T4,$H4,$H4 + vpaddq $T0,$H0,$H0 + + vpermq \$0x2,$H3,$T3 + vpermq \$0x2,$H4,$T4 + vpermq \$0x2,$H0,$T0 + vpermq \$0x2,$D1,$T1 + vpermq \$0x2,$H2,$T2 + vpaddq $T3,$H3,$H3 + vpaddq $T4,$H4,$H4 + vpaddq $T0,$H0,$H0 + vpaddq $T1,$D1,$D1 + vpaddq $T2,$H2,$H2 + + ################################################################ + # lazy reduction + + vpsrlq \$26,$H3,$D3 + vpand $MASK,$H3,$H3 + vpaddq $D3,$H4,$H4 # h3 -> h4 + + vpsrlq \$26,$H0,$D0 + vpand $MASK,$H0,$H0 + vpaddq $D0,$D1,$H1 # h0 -> h1 + + vpsrlq \$26,$H4,$D4 + vpand $MASK,$H4,$H4 + + vpsrlq \$26,$H1,$D1 + vpand $MASK,$H1,$H1 + vpaddq $D1,$H2,$H2 # h1 -> h2 + + vpaddq $D4,$H0,$H0 + vpsllq \$2,$D4,$D4 + vpaddq $D4,$H0,$H0 # h4 -> h0 + + vpsrlq \$26,$H2,$D2 + vpand $MASK,$H2,$H2 + vpaddq $D2,$H3,$H3 # h2 -> h3 + + vpsrlq \$26,$H0,$D0 + vpand $MASK,$H0,$H0 + vpaddq $D0,$H1,$H1 # h0 -> h1 + + vpsrlq \$26,$H3,$D3 + vpand $MASK,$H3,$H3 + vpaddq $D3,$H4,$H4 # h3 -> h4 + + vmovd %x#$H0,`4*0-48-64`($ctx)# save partially reduced + vmovd %x#$H1,`4*1-48-64`($ctx) + vmovd %x#$H2,`4*2-48-64`($ctx) + vmovd %x#$H3,`4*3-48-64`($ctx) + vmovd %x#$H4,`4*4-48-64`($ctx) +___ +$code.=<<___ if ($win64); + vmovdqa 0x50(%r11),%xmm6 + vmovdqa 0x60(%r11),%xmm7 + vmovdqa 0x70(%r11),%xmm8 + vmovdqa 0x80(%r11),%xmm9 + vmovdqa 0x90(%r11),%xmm10 + vmovdqa 0xa0(%r11),%xmm11 + vmovdqa 0xb0(%r11),%xmm12 + vmovdqa 0xc0(%r11),%xmm13 + vmovdqa 0xd0(%r11),%xmm14 + vmovdqa 0xe0(%r11),%xmm15 + lea 0xf8(%r11),%rsp +.Ldo_avx2_epilogue_512: +___ +$code.=<<___ if (!$win64); + lea 8(%r11),%rsp +.cfi_def_cfa %rsp,8 +___ +$code.=<<___; + vzeroupper + ret +.cfi_endproc +.size poly1305_blocks_avx2,.-poly1305_blocks_avx2 +___ + + +my ($R0,$R1,$R2,$R3,$R4, $S1,$S2,$S3,$S4) = map("%zmm$_",(16..24)); +my ($M0,$M1,$M2,$M3,$M4) = map("%zmm$_",(25..29)); +my $PADBIT="%zmm30"; + +map(s/%y/%z/,($T4,$T0,$T1,$T2,$T3)); # switch to %zmm domain +map(s/%y/%z/,($D0,$D1,$D2,$D3,$D4)); +map(s/%y/%z/,($H0,$H1,$H2,$H3,$H4)); +map(s/%y/%z/,($MASK)); + + +$code.=<<___; +.cfi_startproc +.Lblocks_avx512: + mov \$15,%eax + kmovw %eax,%k2 +___ +$code.=<<___ if (!$win64); + lea -8(%rsp),%r11 +.cfi_def_cfa %r11,16 + sub \$0x128,%rsp +___ +$code.=<<___ if ($win64); + lea -0xf8(%rsp),%r11 + sub \$0x1c8,%rsp + vmovdqa %xmm6,0x50(%r11) + vmovdqa %xmm7,0x60(%r11) + vmovdqa %xmm8,0x70(%r11) + vmovdqa %xmm9,0x80(%r11) + vmovdqa %xmm10,0x90(%r11) + vmovdqa %xmm11,0xa0(%r11) + vmovdqa %xmm12,0xb0(%r11) + vmovdqa %xmm13,0xc0(%r11) + vmovdqa %xmm14,0xd0(%r11) + vmovdqa %xmm15,0xe0(%r11) +.Ldo_avx512_body: +___ +$code.=<<___; + lea .Lconst(%rip),%rcx + lea 48+64($ctx),$ctx # size optimization + vmovdqa 96(%rcx),%y#$T2 # .Lpermd_avx2 + + # expand pre-calculated table + vmovdqu `16*0-64`($ctx),%x#$D0 # will become expanded ${R0} + and \$-512,%rsp + vmovdqu `16*1-64`($ctx),%x#$D1 # will become ... ${R1} + mov \$0x20,%rax + vmovdqu `16*2-64`($ctx),%x#$T0 # ... ${S1} + vmovdqu `16*3-64`($ctx),%x#$D2 # ... ${R2} + vmovdqu `16*4-64`($ctx),%x#$T1 # ... ${S2} + vmovdqu `16*5-64`($ctx),%x#$D3 # ... ${R3} + vmovdqu `16*6-64`($ctx),%x#$T3 # ... ${S3} + vmovdqu `16*7-64`($ctx),%x#$D4 # ... ${R4} + vmovdqu `16*8-64`($ctx),%x#$T4 # ... ${S4} + vpermd $D0,$T2,$R0 # 00003412 -> 14243444 + vpbroadcastq 64(%rcx),$MASK # .Lmask26 + vpermd $D1,$T2,$R1 + vpermd $T0,$T2,$S1 + vpermd $D2,$T2,$R2 + vmovdqa64 $R0,0x00(%rsp){%k2} # save in case $len%128 != 0 + vpsrlq \$32,$R0,$T0 # 14243444 -> 01020304 + vpermd $T1,$T2,$S2 + vmovdqu64 $R1,0x00(%rsp,%rax){%k2} + vpsrlq \$32,$R1,$T1 + vpermd $D3,$T2,$R3 + vmovdqa64 $S1,0x40(%rsp){%k2} + vpermd $T3,$T2,$S3 + vpermd $D4,$T2,$R4 + vmovdqu64 $R2,0x40(%rsp,%rax){%k2} + vpermd $T4,$T2,$S4 + vmovdqa64 $S2,0x80(%rsp){%k2} + vmovdqu64 $R3,0x80(%rsp,%rax){%k2} + vmovdqa64 $S3,0xc0(%rsp){%k2} + vmovdqu64 $R4,0xc0(%rsp,%rax){%k2} + vmovdqa64 $S4,0x100(%rsp){%k2} + + ################################################################ + # calculate 5th through 8th powers of the key + # + # d0 = r0'*r0 + r1'*5*r4 + r2'*5*r3 + r3'*5*r2 + r4'*5*r1 + # d1 = r0'*r1 + r1'*r0 + r2'*5*r4 + r3'*5*r3 + r4'*5*r2 + # d2 = r0'*r2 + r1'*r1 + r2'*r0 + r3'*5*r4 + r4'*5*r3 + # d3 = r0'*r3 + r1'*r2 + r2'*r1 + r3'*r0 + r4'*5*r4 + # d4 = r0'*r4 + r1'*r3 + r2'*r2 + r3'*r1 + r4'*r0 + + vpmuludq $T0,$R0,$D0 # d0 = r0'*r0 + vpmuludq $T0,$R1,$D1 # d1 = r0'*r1 + vpmuludq $T0,$R2,$D2 # d2 = r0'*r2 + vpmuludq $T0,$R3,$D3 # d3 = r0'*r3 + vpmuludq $T0,$R4,$D4 # d4 = r0'*r4 + vpsrlq \$32,$R2,$T2 + + vpmuludq $T1,$S4,$M0 + vpmuludq $T1,$R0,$M1 + vpmuludq $T1,$R1,$M2 + vpmuludq $T1,$R2,$M3 + vpmuludq $T1,$R3,$M4 + vpsrlq \$32,$R3,$T3 + vpaddq $M0,$D0,$D0 # d0 += r1'*5*r4 + vpaddq $M1,$D1,$D1 # d1 += r1'*r0 + vpaddq $M2,$D2,$D2 # d2 += r1'*r1 + vpaddq $M3,$D3,$D3 # d3 += r1'*r2 + vpaddq $M4,$D4,$D4 # d4 += r1'*r3 + + vpmuludq $T2,$S3,$M0 + vpmuludq $T2,$S4,$M1 + vpmuludq $T2,$R1,$M3 + vpmuludq $T2,$R2,$M4 + vpmuludq $T2,$R0,$M2 + vpsrlq \$32,$R4,$T4 + vpaddq $M0,$D0,$D0 # d0 += r2'*5*r3 + vpaddq $M1,$D1,$D1 # d1 += r2'*5*r4 + vpaddq $M3,$D3,$D3 # d3 += r2'*r1 + vpaddq $M4,$D4,$D4 # d4 += r2'*r2 + vpaddq $M2,$D2,$D2 # d2 += r2'*r0 + + vpmuludq $T3,$S2,$M0 + vpmuludq $T3,$R0,$M3 + vpmuludq $T3,$R1,$M4 + vpmuludq $T3,$S3,$M1 + vpmuludq $T3,$S4,$M2 + vpaddq $M0,$D0,$D0 # d0 += r3'*5*r2 + vpaddq $M3,$D3,$D3 # d3 += r3'*r0 + vpaddq $M4,$D4,$D4 # d4 += r3'*r1 + vpaddq $M1,$D1,$D1 # d1 += r3'*5*r3 + vpaddq $M2,$D2,$D2 # d2 += r3'*5*r4 + + vpmuludq $T4,$S4,$M3 + vpmuludq $T4,$R0,$M4 + vpmuludq $T4,$S1,$M0 + vpmuludq $T4,$S2,$M1 + vpmuludq $T4,$S3,$M2 + vpaddq $M3,$D3,$D3 # d3 += r2'*5*r4 + vpaddq $M4,$D4,$D4 # d4 += r2'*r0 + vpaddq $M0,$D0,$D0 # d0 += r2'*5*r1 + vpaddq $M1,$D1,$D1 # d1 += r2'*5*r2 + vpaddq $M2,$D2,$D2 # d2 += r2'*5*r3 + + ################################################################ + # load input + vmovdqu64 16*0($inp),%z#$T3 + vmovdqu64 16*4($inp),%z#$T4 + lea 16*8($inp),$inp + + ################################################################ + # lazy reduction + + vpsrlq \$26,$D3,$M3 + vpandq $MASK,$D3,$D3 + vpaddq $M3,$D4,$D4 # d3 -> d4 + + vpsrlq \$26,$D0,$M0 + vpandq $MASK,$D0,$D0 + vpaddq $M0,$D1,$D1 # d0 -> d1 + + vpsrlq \$26,$D4,$M4 + vpandq $MASK,$D4,$D4 + + vpsrlq \$26,$D1,$M1 + vpandq $MASK,$D1,$D1 + vpaddq $M1,$D2,$D2 # d1 -> d2 + + vpaddq $M4,$D0,$D0 + vpsllq \$2,$M4,$M4 + vpaddq $M4,$D0,$D0 # d4 -> d0 + + vpsrlq \$26,$D2,$M2 + vpandq $MASK,$D2,$D2 + vpaddq $M2,$D3,$D3 # d2 -> d3 + + vpsrlq \$26,$D0,$M0 + vpandq $MASK,$D0,$D0 + vpaddq $M0,$D1,$D1 # d0 -> d1 + + vpsrlq \$26,$D3,$M3 + vpandq $MASK,$D3,$D3 + vpaddq $M3,$D4,$D4 # d3 -> d4 + + ################################################################ + # at this point we have 14243444 in $R0-$S4 and 05060708 in + # $D0-$D4, ... + + vpunpcklqdq $T4,$T3,$T0 # transpose input + vpunpckhqdq $T4,$T3,$T4 + + # ... since input 64-bit lanes are ordered as 73625140, we could + # "vperm" it to 76543210 (here and in each loop iteration), *or* + # we could just flow along, hence the goal for $R0-$S4 is + # 1858286838784888 ... + + vmovdqa32 128(%rcx),$M0 # .Lpermd_avx512: + mov \$0x7777,%eax + kmovw %eax,%k1 + + vpermd $R0,$M0,$R0 # 14243444 -> 1---2---3---4--- + vpermd $R1,$M0,$R1 + vpermd $R2,$M0,$R2 + vpermd $R3,$M0,$R3 + vpermd $R4,$M0,$R4 + + vpermd $D0,$M0,${R0}{%k1} # 05060708 -> 1858286838784888 + vpermd $D1,$M0,${R1}{%k1} + vpermd $D2,$M0,${R2}{%k1} + vpermd $D3,$M0,${R3}{%k1} + vpermd $D4,$M0,${R4}{%k1} + + vpslld \$2,$R1,$S1 # *5 + vpslld \$2,$R2,$S2 + vpslld \$2,$R3,$S3 + vpslld \$2,$R4,$S4 + vpaddd $R1,$S1,$S1 + vpaddd $R2,$S2,$S2 + vpaddd $R3,$S3,$S3 + vpaddd $R4,$S4,$S4 + + vpbroadcastq 32(%rcx),$PADBIT # .L129 + + vpsrlq \$52,$T0,$T2 # splat input + vpsllq \$12,$T4,$T3 + vporq $T3,$T2,$T2 + vpsrlq \$26,$T0,$T1 + vpsrlq \$14,$T4,$T3 + vpsrlq \$40,$T4,$T4 # 4 + vpandq $MASK,$T2,$T2 # 2 + vpandq $MASK,$T0,$T0 # 0 + #vpandq $MASK,$T1,$T1 # 1 + #vpandq $MASK,$T3,$T3 # 3 + #vporq $PADBIT,$T4,$T4 # padbit, yes, always + + vpaddq $H2,$T2,$H2 # accumulate input + sub \$192,$len + jbe .Ltail_avx512 + jmp .Loop_avx512 + +.align 32 +.Loop_avx512: + ################################################################ + # ((inp[0]*r^8+inp[ 8])*r^8+inp[16])*r^8 + # ((inp[1]*r^8+inp[ 9])*r^8+inp[17])*r^7 + # ((inp[2]*r^8+inp[10])*r^8+inp[18])*r^6 + # ((inp[3]*r^8+inp[11])*r^8+inp[19])*r^5 + # ((inp[4]*r^8+inp[12])*r^8+inp[20])*r^4 + # ((inp[5]*r^8+inp[13])*r^8+inp[21])*r^3 + # ((inp[6]*r^8+inp[14])*r^8+inp[22])*r^2 + # ((inp[7]*r^8+inp[15])*r^8+inp[23])*r^1 + # \________/\___________/ + ################################################################ + #vpaddq $H2,$T2,$H2 # accumulate input + + # d4 = h4*r0 + h3*r1 + h2*r2 + h1*r3 + h0*r4 + # d3 = h3*r0 + h2*r1 + h1*r2 + h0*r3 + h4*5*r4 + # d2 = h2*r0 + h1*r1 + h0*r2 + h4*5*r3 + h3*5*r4 + # d1 = h1*r0 + h0*r1 + h4*5*r2 + h3*5*r3 + h2*5*r4 + # d0 = h0*r0 + h4*5*r1 + h3*5*r2 + h2*5*r3 + h1*5*r4 + # + # however, as h2 is "chronologically" first one available pull + # corresponding operations up, so it's + # + # d3 = h2*r1 + h0*r3 + h1*r2 + h3*r0 + h4*5*r4 + # d4 = h2*r2 + h0*r4 + h1*r3 + h3*r1 + h4*r0 + # d0 = h2*5*r3 + h0*r0 + h1*5*r4 + h3*5*r2 + h4*5*r1 + # d1 = h2*5*r4 + h0*r1 + h1*r0 + h3*5*r3 + h4*5*r2 + # d2 = h2*r0 + h0*r2 + h1*r1 + h3*5*r4 + h4*5*r3 + + vpmuludq $H2,$R1,$D3 # d3 = h2*r1 + vpaddq $H0,$T0,$H0 + vpmuludq $H2,$R2,$D4 # d4 = h2*r2 + vpandq $MASK,$T1,$T1 # 1 + vpmuludq $H2,$S3,$D0 # d0 = h2*s3 + vpandq $MASK,$T3,$T3 # 3 + vpmuludq $H2,$S4,$D1 # d1 = h2*s4 + vporq $PADBIT,$T4,$T4 # padbit, yes, always + vpmuludq $H2,$R0,$D2 # d2 = h2*r0 + vpaddq $H1,$T1,$H1 # accumulate input + vpaddq $H3,$T3,$H3 + vpaddq $H4,$T4,$H4 + + vmovdqu64 16*0($inp),$T3 # load input + vmovdqu64 16*4($inp),$T4 + lea 16*8($inp),$inp + vpmuludq $H0,$R3,$M3 + vpmuludq $H0,$R4,$M4 + vpmuludq $H0,$R0,$M0 + vpmuludq $H0,$R1,$M1 + vpaddq $M3,$D3,$D3 # d3 += h0*r3 + vpaddq $M4,$D4,$D4 # d4 += h0*r4 + vpaddq $M0,$D0,$D0 # d0 += h0*r0 + vpaddq $M1,$D1,$D1 # d1 += h0*r1 + + vpmuludq $H1,$R2,$M3 + vpmuludq $H1,$R3,$M4 + vpmuludq $H1,$S4,$M0 + vpmuludq $H0,$R2,$M2 + vpaddq $M3,$D3,$D3 # d3 += h1*r2 + vpaddq $M4,$D4,$D4 # d4 += h1*r3 + vpaddq $M0,$D0,$D0 # d0 += h1*s4 + vpaddq $M2,$D2,$D2 # d2 += h0*r2 + + vpunpcklqdq $T4,$T3,$T0 # transpose input + vpunpckhqdq $T4,$T3,$T4 + + vpmuludq $H3,$R0,$M3 + vpmuludq $H3,$R1,$M4 + vpmuludq $H1,$R0,$M1 + vpmuludq $H1,$R1,$M2 + vpaddq $M3,$D3,$D3 # d3 += h3*r0 + vpaddq $M4,$D4,$D4 # d4 += h3*r1 + vpaddq $M1,$D1,$D1 # d1 += h1*r0 + vpaddq $M2,$D2,$D2 # d2 += h1*r1 + + vpmuludq $H4,$S4,$M3 + vpmuludq $H4,$R0,$M4 + vpmuludq $H3,$S2,$M0 + vpmuludq $H3,$S3,$M1 + vpaddq $M3,$D3,$D3 # d3 += h4*s4 + vpmuludq $H3,$S4,$M2 + vpaddq $M4,$D4,$D4 # d4 += h4*r0 + vpaddq $M0,$D0,$D0 # d0 += h3*s2 + vpaddq $M1,$D1,$D1 # d1 += h3*s3 + vpaddq $M2,$D2,$D2 # d2 += h3*s4 + + vpmuludq $H4,$S1,$M0 + vpmuludq $H4,$S2,$M1 + vpmuludq $H4,$S3,$M2 + vpaddq $M0,$D0,$H0 # h0 = d0 + h4*s1 + vpaddq $M1,$D1,$H1 # h1 = d2 + h4*s2 + vpaddq $M2,$D2,$H2 # h2 = d3 + h4*s3 + + ################################################################ + # lazy reduction (interleaved with input splat) + + vpsrlq \$52,$T0,$T2 # splat input + vpsllq \$12,$T4,$T3 + + vpsrlq \$26,$D3,$H3 + vpandq $MASK,$D3,$D3 + vpaddq $H3,$D4,$H4 # h3 -> h4 + + vporq $T3,$T2,$T2 + + vpsrlq \$26,$H0,$D0 + vpandq $MASK,$H0,$H0 + vpaddq $D0,$H1,$H1 # h0 -> h1 + + vpandq $MASK,$T2,$T2 # 2 + + vpsrlq \$26,$H4,$D4 + vpandq $MASK,$H4,$H4 + + vpsrlq \$26,$H1,$D1 + vpandq $MASK,$H1,$H1 + vpaddq $D1,$H2,$H2 # h1 -> h2 + + vpaddq $D4,$H0,$H0 + vpsllq \$2,$D4,$D4 + vpaddq $D4,$H0,$H0 # h4 -> h0 + + vpaddq $T2,$H2,$H2 # modulo-scheduled + vpsrlq \$26,$T0,$T1 + + vpsrlq \$26,$H2,$D2 + vpandq $MASK,$H2,$H2 + vpaddq $D2,$D3,$H3 # h2 -> h3 + + vpsrlq \$14,$T4,$T3 + + vpsrlq \$26,$H0,$D0 + vpandq $MASK,$H0,$H0 + vpaddq $D0,$H1,$H1 # h0 -> h1 + + vpsrlq \$40,$T4,$T4 # 4 + + vpsrlq \$26,$H3,$D3 + vpandq $MASK,$H3,$H3 + vpaddq $D3,$H4,$H4 # h3 -> h4 + + vpandq $MASK,$T0,$T0 # 0 + #vpandq $MASK,$T1,$T1 # 1 + #vpandq $MASK,$T3,$T3 # 3 + #vporq $PADBIT,$T4,$T4 # padbit, yes, always + + sub \$128,$len + ja .Loop_avx512 + +.Ltail_avx512: + ################################################################ + # while above multiplications were by r^8 in all lanes, in last + # iteration we multiply least significant lane by r^8 and most + # significant one by r, that's why table gets shifted... + + vpsrlq \$32,$R0,$R0 # 0105020603070408 + vpsrlq \$32,$R1,$R1 + vpsrlq \$32,$R2,$R2 + vpsrlq \$32,$S3,$S3 + vpsrlq \$32,$S4,$S4 + vpsrlq \$32,$R3,$R3 + vpsrlq \$32,$R4,$R4 + vpsrlq \$32,$S1,$S1 + vpsrlq \$32,$S2,$S2 + + ################################################################ + # load either next or last 64 byte of input + lea ($inp,$len),$inp + + #vpaddq $H2,$T2,$H2 # accumulate input + vpaddq $H0,$T0,$H0 + + vpmuludq $H2,$R1,$D3 # d3 = h2*r1 + vpmuludq $H2,$R2,$D4 # d4 = h2*r2 + vpmuludq $H2,$S3,$D0 # d0 = h2*s3 + vpandq $MASK,$T1,$T1 # 1 + vpmuludq $H2,$S4,$D1 # d1 = h2*s4 + vpandq $MASK,$T3,$T3 # 3 + vpmuludq $H2,$R0,$D2 # d2 = h2*r0 + vporq $PADBIT,$T4,$T4 # padbit, yes, always + vpaddq $H1,$T1,$H1 # accumulate input + vpaddq $H3,$T3,$H3 + vpaddq $H4,$T4,$H4 + + vmovdqu 16*0($inp),%x#$T0 + vpmuludq $H0,$R3,$M3 + vpmuludq $H0,$R4,$M4 + vpmuludq $H0,$R0,$M0 + vpmuludq $H0,$R1,$M1 + vpaddq $M3,$D3,$D3 # d3 += h0*r3 + vpaddq $M4,$D4,$D4 # d4 += h0*r4 + vpaddq $M0,$D0,$D0 # d0 += h0*r0 + vpaddq $M1,$D1,$D1 # d1 += h0*r1 + + vmovdqu 16*1($inp),%x#$T1 + vpmuludq $H1,$R2,$M3 + vpmuludq $H1,$R3,$M4 + vpmuludq $H1,$S4,$M0 + vpmuludq $H0,$R2,$M2 + vpaddq $M3,$D3,$D3 # d3 += h1*r2 + vpaddq $M4,$D4,$D4 # d4 += h1*r3 + vpaddq $M0,$D0,$D0 # d0 += h1*s4 + vpaddq $M2,$D2,$D2 # d2 += h0*r2 + + vinserti128 \$1,16*2($inp),%y#$T0,%y#$T0 + vpmuludq $H3,$R0,$M3 + vpmuludq $H3,$R1,$M4 + vpmuludq $H1,$R0,$M1 + vpmuludq $H1,$R1,$M2 + vpaddq $M3,$D3,$D3 # d3 += h3*r0 + vpaddq $M4,$D4,$D4 # d4 += h3*r1 + vpaddq $M1,$D1,$D1 # d1 += h1*r0 + vpaddq $M2,$D2,$D2 # d2 += h1*r1 + + vinserti128 \$1,16*3($inp),%y#$T1,%y#$T1 + vpmuludq $H4,$S4,$M3 + vpmuludq $H4,$R0,$M4 + vpmuludq $H3,$S2,$M0 + vpmuludq $H3,$S3,$M1 + vpmuludq $H3,$S4,$M2 + vpaddq $M3,$D3,$H3 # h3 = d3 + h4*s4 + vpaddq $M4,$D4,$D4 # d4 += h4*r0 + vpaddq $M0,$D0,$D0 # d0 += h3*s2 + vpaddq $M1,$D1,$D1 # d1 += h3*s3 + vpaddq $M2,$D2,$D2 # d2 += h3*s4 + + vpmuludq $H4,$S1,$M0 + vpmuludq $H4,$S2,$M1 + vpmuludq $H4,$S3,$M2 + vpaddq $M0,$D0,$H0 # h0 = d0 + h4*s1 + vpaddq $M1,$D1,$H1 # h1 = d2 + h4*s2 + vpaddq $M2,$D2,$H2 # h2 = d3 + h4*s3 + + ################################################################ + # horizontal addition + + mov \$1,%eax + vpermq \$0xb1,$H3,$D3 + vpermq \$0xb1,$D4,$H4 + vpermq \$0xb1,$H0,$D0 + vpermq \$0xb1,$H1,$D1 + vpermq \$0xb1,$H2,$D2 + vpaddq $D3,$H3,$H3 + vpaddq $D4,$H4,$H4 + vpaddq $D0,$H0,$H0 + vpaddq $D1,$H1,$H1 + vpaddq $D2,$H2,$H2 + + kmovw %eax,%k3 + vpermq \$0x2,$H3,$D3 + vpermq \$0x2,$H4,$D4 + vpermq \$0x2,$H0,$D0 + vpermq \$0x2,$H1,$D1 + vpermq \$0x2,$H2,$D2 + vpaddq $D3,$H3,$H3 + vpaddq $D4,$H4,$H4 + vpaddq $D0,$H0,$H0 + vpaddq $D1,$H1,$H1 + vpaddq $D2,$H2,$H2 + + vextracti64x4 \$0x1,$H3,%y#$D3 + vextracti64x4 \$0x1,$H4,%y#$D4 + vextracti64x4 \$0x1,$H0,%y#$D0 + vextracti64x4 \$0x1,$H1,%y#$D1 + vextracti64x4 \$0x1,$H2,%y#$D2 + vpaddq $D3,$H3,${H3}{%k3}{z} # keep single qword in case + vpaddq $D4,$H4,${H4}{%k3}{z} # it's passed to .Ltail_avx2 + vpaddq $D0,$H0,${H0}{%k3}{z} + vpaddq $D1,$H1,${H1}{%k3}{z} + vpaddq $D2,$H2,${H2}{%k3}{z} +___ +map(s/%z/%y/,($T0,$T1,$T2,$T3,$T4, $PADBIT)); +map(s/%z/%y/,($H0,$H1,$H2,$H3,$H4, $D0,$D1,$D2,$D3,$D4, $MASK)); +$code.=<<___; + ################################################################ + # lazy reduction (interleaved with input splat) + + vpsrlq \$26,$H3,$D3 + vpand $MASK,$H3,$H3 + vpsrldq \$6,$T0,$T2 # splat input + vpsrldq \$6,$T1,$T3 + vpunpckhqdq $T1,$T0,$T4 # 4 + vpaddq $D3,$H4,$H4 # h3 -> h4 + + vpsrlq \$26,$H0,$D0 + vpand $MASK,$H0,$H0 + vpunpcklqdq $T3,$T2,$T2 # 2:3 + vpunpcklqdq $T1,$T0,$T0 # 0:1 + vpaddq $D0,$H1,$H1 # h0 -> h1 + + vpsrlq \$26,$H4,$D4 + vpand $MASK,$H4,$H4 + + vpsrlq \$26,$H1,$D1 + vpand $MASK,$H1,$H1 + vpsrlq \$30,$T2,$T3 + vpsrlq \$4,$T2,$T2 + vpaddq $D1,$H2,$H2 # h1 -> h2 + + vpaddq $D4,$H0,$H0 + vpsllq \$2,$D4,$D4 + vpsrlq \$26,$T0,$T1 + vpsrlq \$40,$T4,$T4 # 4 + vpaddq $D4,$H0,$H0 # h4 -> h0 + + vpsrlq \$26,$H2,$D2 + vpand $MASK,$H2,$H2 + vpand $MASK,$T2,$T2 # 2 + vpand $MASK,$T0,$T0 # 0 + vpaddq $D2,$H3,$H3 # h2 -> h3 + + vpsrlq \$26,$H0,$D0 + vpand $MASK,$H0,$H0 + vpaddq $H2,$T2,$H2 # accumulate input for .Ltail_avx2 + vpand $MASK,$T1,$T1 # 1 + vpaddq $D0,$H1,$H1 # h0 -> h1 + + vpsrlq \$26,$H3,$D3 + vpand $MASK,$H3,$H3 + vpand $MASK,$T3,$T3 # 3 + vpor 32(%rcx),$T4,$T4 # padbit, yes, always + vpaddq $D3,$H4,$H4 # h3 -> h4 + + lea 0x90(%rsp),%rax # size optimization for .Ltail_avx2 + add \$64,$len + jnz .Ltail_avx2_512 + + vpsubq $T2,$H2,$H2 # undo input accumulation + vmovd %x#$H0,`4*0-48-64`($ctx)# save partially reduced + vmovd %x#$H1,`4*1-48-64`($ctx) + vmovd %x#$H2,`4*2-48-64`($ctx) + vmovd %x#$H3,`4*3-48-64`($ctx) + vmovd %x#$H4,`4*4-48-64`($ctx) + vzeroall +___ +$code.=<<___ if ($win64); + movdqa 0x50(%r11),%xmm6 + movdqa 0x60(%r11),%xmm7 + movdqa 0x70(%r11),%xmm8 + movdqa 0x80(%r11),%xmm9 + movdqa 0x90(%r11),%xmm10 + movdqa 0xa0(%r11),%xmm11 + movdqa 0xb0(%r11),%xmm12 + movdqa 0xc0(%r11),%xmm13 + movdqa 0xd0(%r11),%xmm14 + movdqa 0xe0(%r11),%xmm15 + lea 0xf8(%r11),%rsp +.Ldo_avx512_epilogue: +___ +$code.=<<___ if (!$win64); + lea 8(%r11),%rsp +.cfi_def_cfa %rsp,8 +___ +$code.=<<___; + ret +.cfi_endproc +.size poly1305_blocks_avx512,.-poly1305_blocks_avx512 +___ +if ($avx>3 && 0) { +######################################################################## +# VPMADD52 version using 2^44 radix. +# +# One can argue that base 2^52 would be more natural. Well, even though +# some operations would be more natural, one has to recognize couple of +# things. Base 2^52 doesn't provide advantage over base 2^44 if you look +# at amount of multiply-n-accumulate operations. Secondly, it makes it +# impossible to pre-compute multiples of 5 [referred to as s[]/sN in +# reference implementations], which means that more such operations +# would have to be performed in inner loop, which in turn makes critical +# path longer. In other words, even though base 2^44 reduction might +# look less elegant, overall critical path is actually shorter... + +######################################################################## +# Layout of opaque area is following. +# +# unsigned __int64 h[3]; # current hash value base 2^44 +# unsigned __int64 s[2]; # key value*20 base 2^44 +# unsigned __int64 r[3]; # key value base 2^44 +# struct { unsigned __int64 r^1, r^3, r^2, r^4; } R[4]; +# # r^n positions reflect +# # placement in register, not +# # memory, R[3] is R[1]*20 + +$code.=<<___; +.type poly1305_init_base2_44,\@function,3 +.align 32 +poly1305_init_base2_44: + xor %rax,%rax + mov %rax,0($ctx) # initialize hash value + mov %rax,8($ctx) + mov %rax,16($ctx) + +.Linit_base2_44: + lea poly1305_blocks_vpmadd52(%rip),%r10 + lea poly1305_emit_base2_44(%rip),%r11 + + mov \$0x0ffffffc0fffffff,%rax + mov \$0x0ffffffc0ffffffc,%rcx + and 0($inp),%rax + mov \$0x00000fffffffffff,%r8 + and 8($inp),%rcx + mov \$0x00000fffffffffff,%r9 + and %rax,%r8 + shrd \$44,%rcx,%rax + mov %r8,40($ctx) # r0 + and %r9,%rax + shr \$24,%rcx + mov %rax,48($ctx) # r1 + lea (%rax,%rax,4),%rax # *5 + mov %rcx,56($ctx) # r2 + shl \$2,%rax # magic <<2 + lea (%rcx,%rcx,4),%rcx # *5 + shl \$2,%rcx # magic <<2 + mov %rax,24($ctx) # s1 + mov %rcx,32($ctx) # s2 + movq \$-1,64($ctx) # write impossible value +___ +$code.=<<___ if ($flavour !~ /elf32/); + mov %r10,0(%rdx) + mov %r11,8(%rdx) +___ +$code.=<<___ if ($flavour =~ /elf32/); + mov %r10d,0(%rdx) + mov %r11d,4(%rdx) +___ +$code.=<<___; + mov \$1,%eax + ret +.size poly1305_init_base2_44,.-poly1305_init_base2_44 +___ +{ +my ($H0,$H1,$H2,$r2r1r0,$r1r0s2,$r0s2s1,$Dlo,$Dhi) = map("%ymm$_",(0..5,16,17)); +my ($T0,$inp_permd,$inp_shift,$PAD) = map("%ymm$_",(18..21)); +my ($reduc_mask,$reduc_rght,$reduc_left) = map("%ymm$_",(22..25)); + +$code.=<<___; +.type poly1305_blocks_vpmadd52,\@function,4 +.align 32 +poly1305_blocks_vpmadd52: + shr \$4,$len + jz .Lno_data_vpmadd52 # too short + + shl \$40,$padbit + mov 64($ctx),%r8 # peek on power of the key + + # if powers of the key are not calculated yet, process up to 3 + # blocks with this single-block subroutine, otherwise ensure that + # length is divisible by 2 blocks and pass the rest down to next + # subroutine... + + mov \$3,%rax + mov \$1,%r10 + cmp \$4,$len # is input long + cmovae %r10,%rax + test %r8,%r8 # is power value impossible? + cmovns %r10,%rax + + and $len,%rax # is input of favourable length? + jz .Lblocks_vpmadd52_4x + + sub %rax,$len + mov \$7,%r10d + mov \$1,%r11d + kmovw %r10d,%k7 + lea .L2_44_inp_permd(%rip),%r10 + kmovw %r11d,%k1 + + vmovq $padbit,%x#$PAD + vmovdqa64 0(%r10),$inp_permd # .L2_44_inp_permd + vmovdqa64 32(%r10),$inp_shift # .L2_44_inp_shift + vpermq \$0xcf,$PAD,$PAD + vmovdqa64 64(%r10),$reduc_mask # .L2_44_mask + + vmovdqu64 0($ctx),${Dlo}{%k7}{z} # load hash value + vmovdqu64 40($ctx),${r2r1r0}{%k7}{z} # load keys + vmovdqu64 32($ctx),${r1r0s2}{%k7}{z} + vmovdqu64 24($ctx),${r0s2s1}{%k7}{z} + + vmovdqa64 96(%r10),$reduc_rght # .L2_44_shift_rgt + vmovdqa64 128(%r10),$reduc_left # .L2_44_shift_lft + + jmp .Loop_vpmadd52 + +.align 32 +.Loop_vpmadd52: + vmovdqu32 0($inp),%x#$T0 # load input as ----3210 + lea 16($inp),$inp + + vpermd $T0,$inp_permd,$T0 # ----3210 -> --322110 + vpsrlvq $inp_shift,$T0,$T0 + vpandq $reduc_mask,$T0,$T0 + vporq $PAD,$T0,$T0 + + vpaddq $T0,$Dlo,$Dlo # accumulate input + + vpermq \$0,$Dlo,${H0}{%k7}{z} # smash hash value + vpermq \$0b01010101,$Dlo,${H1}{%k7}{z} + vpermq \$0b10101010,$Dlo,${H2}{%k7}{z} + + vpxord $Dlo,$Dlo,$Dlo + vpxord $Dhi,$Dhi,$Dhi + + vpmadd52luq $r2r1r0,$H0,$Dlo + vpmadd52huq $r2r1r0,$H0,$Dhi + + vpmadd52luq $r1r0s2,$H1,$Dlo + vpmadd52huq $r1r0s2,$H1,$Dhi + + vpmadd52luq $r0s2s1,$H2,$Dlo + vpmadd52huq $r0s2s1,$H2,$Dhi + + vpsrlvq $reduc_rght,$Dlo,$T0 # 0 in topmost qword + vpsllvq $reduc_left,$Dhi,$Dhi # 0 in topmost qword + vpandq $reduc_mask,$Dlo,$Dlo + + vpaddq $T0,$Dhi,$Dhi + + vpermq \$0b10010011,$Dhi,$Dhi # 0 in lowest qword + + vpaddq $Dhi,$Dlo,$Dlo # note topmost qword :-) + + vpsrlvq $reduc_rght,$Dlo,$T0 # 0 in topmost word + vpandq $reduc_mask,$Dlo,$Dlo + + vpermq \$0b10010011,$T0,$T0 + + vpaddq $T0,$Dlo,$Dlo + + vpermq \$0b10010011,$Dlo,${T0}{%k1}{z} + + vpaddq $T0,$Dlo,$Dlo + vpsllq \$2,$T0,$T0 + + vpaddq $T0,$Dlo,$Dlo + + dec %rax # len-=16 + jnz .Loop_vpmadd52 + + vmovdqu64 $Dlo,0($ctx){%k7} # store hash value + + test $len,$len + jnz .Lblocks_vpmadd52_4x + +.Lno_data_vpmadd52: + ret +.size poly1305_blocks_vpmadd52,.-poly1305_blocks_vpmadd52 +___ +} +{ +######################################################################## +# As implied by its name 4x subroutine processes 4 blocks in parallel +# (but handles even 4*n+2 blocks lengths). It takes up to 4th key power +# and is handled in 256-bit %ymm registers. + +my ($H0,$H1,$H2,$R0,$R1,$R2,$S1,$S2) = map("%ymm$_",(0..5,16,17)); +my ($D0lo,$D0hi,$D1lo,$D1hi,$D2lo,$D2hi) = map("%ymm$_",(18..23)); +my ($T0,$T1,$T2,$T3,$mask44,$mask42,$tmp,$PAD) = map("%ymm$_",(24..31)); + +$code.=<<___; +.type poly1305_blocks_vpmadd52_4x,\@function,4 +.align 32 +poly1305_blocks_vpmadd52_4x: + shr \$4,$len + jz .Lno_data_vpmadd52_4x # too short + + shl \$40,$padbit + mov 64($ctx),%r8 # peek on power of the key + +.Lblocks_vpmadd52_4x: + vpbroadcastq $padbit,$PAD + + vmovdqa64 .Lx_mask44(%rip),$mask44 + mov \$5,%eax + vmovdqa64 .Lx_mask42(%rip),$mask42 + kmovw %eax,%k1 # used in 2x path + + test %r8,%r8 # is power value impossible? + js .Linit_vpmadd52 # if it is, then init R[4] + + vmovq 0($ctx),%x#$H0 # load current hash value + vmovq 8($ctx),%x#$H1 + vmovq 16($ctx),%x#$H2 + + test \$3,$len # is length 4*n+2? + jnz .Lblocks_vpmadd52_2x_do + +.Lblocks_vpmadd52_4x_do: + vpbroadcastq 64($ctx),$R0 # load 4th power of the key + vpbroadcastq 96($ctx),$R1 + vpbroadcastq 128($ctx),$R2 + vpbroadcastq 160($ctx),$S1 + +.Lblocks_vpmadd52_4x_key_loaded: + vpsllq \$2,$R2,$S2 # S2 = R2*5*4 + vpaddq $R2,$S2,$S2 + vpsllq \$2,$S2,$S2 + + test \$7,$len # is len 8*n? + jz .Lblocks_vpmadd52_8x + + vmovdqu64 16*0($inp),$T2 # load data + vmovdqu64 16*2($inp),$T3 + lea 16*4($inp),$inp + + vpunpcklqdq $T3,$T2,$T1 # transpose data + vpunpckhqdq $T3,$T2,$T3 + + # at this point 64-bit lanes are ordered as 3-1-2-0 + + vpsrlq \$24,$T3,$T2 # splat the data + vporq $PAD,$T2,$T2 + vpaddq $T2,$H2,$H2 # accumulate input + vpandq $mask44,$T1,$T0 + vpsrlq \$44,$T1,$T1 + vpsllq \$20,$T3,$T3 + vporq $T3,$T1,$T1 + vpandq $mask44,$T1,$T1 + + sub \$4,$len + jz .Ltail_vpmadd52_4x + jmp .Loop_vpmadd52_4x + ud2 + +.align 32 +.Linit_vpmadd52: + vmovq 24($ctx),%x#$S1 # load key + vmovq 56($ctx),%x#$H2 + vmovq 32($ctx),%x#$S2 + vmovq 40($ctx),%x#$R0 + vmovq 48($ctx),%x#$R1 + + vmovdqa $R0,$H0 + vmovdqa $R1,$H1 + vmovdqa $H2,$R2 + + mov \$2,%eax + +.Lmul_init_vpmadd52: + vpxorq $D0lo,$D0lo,$D0lo + vpmadd52luq $H2,$S1,$D0lo + vpxorq $D0hi,$D0hi,$D0hi + vpmadd52huq $H2,$S1,$D0hi + vpxorq $D1lo,$D1lo,$D1lo + vpmadd52luq $H2,$S2,$D1lo + vpxorq $D1hi,$D1hi,$D1hi + vpmadd52huq $H2,$S2,$D1hi + vpxorq $D2lo,$D2lo,$D2lo + vpmadd52luq $H2,$R0,$D2lo + vpxorq $D2hi,$D2hi,$D2hi + vpmadd52huq $H2,$R0,$D2hi + + vpmadd52luq $H0,$R0,$D0lo + vpmadd52huq $H0,$R0,$D0hi + vpmadd52luq $H0,$R1,$D1lo + vpmadd52huq $H0,$R1,$D1hi + vpmadd52luq $H0,$R2,$D2lo + vpmadd52huq $H0,$R2,$D2hi + + vpmadd52luq $H1,$S2,$D0lo + vpmadd52huq $H1,$S2,$D0hi + vpmadd52luq $H1,$R0,$D1lo + vpmadd52huq $H1,$R0,$D1hi + vpmadd52luq $H1,$R1,$D2lo + vpmadd52huq $H1,$R1,$D2hi + + ################################################################ + # partial reduction + vpsrlq \$44,$D0lo,$tmp + vpsllq \$8,$D0hi,$D0hi + vpandq $mask44,$D0lo,$H0 + vpaddq $tmp,$D0hi,$D0hi + + vpaddq $D0hi,$D1lo,$D1lo + + vpsrlq \$44,$D1lo,$tmp + vpsllq \$8,$D1hi,$D1hi + vpandq $mask44,$D1lo,$H1 + vpaddq $tmp,$D1hi,$D1hi + + vpaddq $D1hi,$D2lo,$D2lo + + vpsrlq \$42,$D2lo,$tmp + vpsllq \$10,$D2hi,$D2hi + vpandq $mask42,$D2lo,$H2 + vpaddq $tmp,$D2hi,$D2hi + + vpaddq $D2hi,$H0,$H0 + vpsllq \$2,$D2hi,$D2hi + + vpaddq $D2hi,$H0,$H0 + + vpsrlq \$44,$H0,$tmp # additional step + vpandq $mask44,$H0,$H0 + + vpaddq $tmp,$H1,$H1 + + dec %eax + jz .Ldone_init_vpmadd52 + + vpunpcklqdq $R1,$H1,$R1 # 1,2 + vpbroadcastq %x#$H1,%x#$H1 # 2,2 + vpunpcklqdq $R2,$H2,$R2 + vpbroadcastq %x#$H2,%x#$H2 + vpunpcklqdq $R0,$H0,$R0 + vpbroadcastq %x#$H0,%x#$H0 + + vpsllq \$2,$R1,$S1 # S1 = R1*5*4 + vpsllq \$2,$R2,$S2 # S2 = R2*5*4 + vpaddq $R1,$S1,$S1 + vpaddq $R2,$S2,$S2 + vpsllq \$2,$S1,$S1 + vpsllq \$2,$S2,$S2 + + jmp .Lmul_init_vpmadd52 + ud2 + +.align 32 +.Ldone_init_vpmadd52: + vinserti128 \$1,%x#$R1,$H1,$R1 # 1,2,3,4 + vinserti128 \$1,%x#$R2,$H2,$R2 + vinserti128 \$1,%x#$R0,$H0,$R0 + + vpermq \$0b11011000,$R1,$R1 # 1,3,2,4 + vpermq \$0b11011000,$R2,$R2 + vpermq \$0b11011000,$R0,$R0 + + vpsllq \$2,$R1,$S1 # S1 = R1*5*4 + vpaddq $R1,$S1,$S1 + vpsllq \$2,$S1,$S1 + + vmovq 0($ctx),%x#$H0 # load current hash value + vmovq 8($ctx),%x#$H1 + vmovq 16($ctx),%x#$H2 + + test \$3,$len # is length 4*n+2? + jnz .Ldone_init_vpmadd52_2x + + vmovdqu64 $R0,64($ctx) # save key powers + vpbroadcastq %x#$R0,$R0 # broadcast 4th power + vmovdqu64 $R1,96($ctx) + vpbroadcastq %x#$R1,$R1 + vmovdqu64 $R2,128($ctx) + vpbroadcastq %x#$R2,$R2 + vmovdqu64 $S1,160($ctx) + vpbroadcastq %x#$S1,$S1 + + jmp .Lblocks_vpmadd52_4x_key_loaded + ud2 + +.align 32 +.Ldone_init_vpmadd52_2x: + vmovdqu64 $R0,64($ctx) # save key powers + vpsrldq \$8,$R0,$R0 # 0-1-0-2 + vmovdqu64 $R1,96($ctx) + vpsrldq \$8,$R1,$R1 + vmovdqu64 $R2,128($ctx) + vpsrldq \$8,$R2,$R2 + vmovdqu64 $S1,160($ctx) + vpsrldq \$8,$S1,$S1 + jmp .Lblocks_vpmadd52_2x_key_loaded + ud2 + +.align 32 +.Lblocks_vpmadd52_2x_do: + vmovdqu64 128+8($ctx),${R2}{%k1}{z}# load 2nd and 1st key powers + vmovdqu64 160+8($ctx),${S1}{%k1}{z} + vmovdqu64 64+8($ctx),${R0}{%k1}{z} + vmovdqu64 96+8($ctx),${R1}{%k1}{z} + +.Lblocks_vpmadd52_2x_key_loaded: + vmovdqu64 16*0($inp),$T2 # load data + vpxorq $T3,$T3,$T3 + lea 16*2($inp),$inp + + vpunpcklqdq $T3,$T2,$T1 # transpose data + vpunpckhqdq $T3,$T2,$T3 + + # at this point 64-bit lanes are ordered as x-1-x-0 + + vpsrlq \$24,$T3,$T2 # splat the data + vporq $PAD,$T2,$T2 + vpaddq $T2,$H2,$H2 # accumulate input + vpandq $mask44,$T1,$T0 + vpsrlq \$44,$T1,$T1 + vpsllq \$20,$T3,$T3 + vporq $T3,$T1,$T1 + vpandq $mask44,$T1,$T1 + + jmp .Ltail_vpmadd52_2x + ud2 + +.align 32 +.Loop_vpmadd52_4x: + #vpaddq $T2,$H2,$H2 # accumulate input + vpaddq $T0,$H0,$H0 + vpaddq $T1,$H1,$H1 + + vpxorq $D0lo,$D0lo,$D0lo + vpmadd52luq $H2,$S1,$D0lo + vpxorq $D0hi,$D0hi,$D0hi + vpmadd52huq $H2,$S1,$D0hi + vpxorq $D1lo,$D1lo,$D1lo + vpmadd52luq $H2,$S2,$D1lo + vpxorq $D1hi,$D1hi,$D1hi + vpmadd52huq $H2,$S2,$D1hi + vpxorq $D2lo,$D2lo,$D2lo + vpmadd52luq $H2,$R0,$D2lo + vpxorq $D2hi,$D2hi,$D2hi + vpmadd52huq $H2,$R0,$D2hi + + vmovdqu64 16*0($inp),$T2 # load data + vmovdqu64 16*2($inp),$T3 + lea 16*4($inp),$inp + vpmadd52luq $H0,$R0,$D0lo + vpmadd52huq $H0,$R0,$D0hi + vpmadd52luq $H0,$R1,$D1lo + vpmadd52huq $H0,$R1,$D1hi + vpmadd52luq $H0,$R2,$D2lo + vpmadd52huq $H0,$R2,$D2hi + + vpunpcklqdq $T3,$T2,$T1 # transpose data + vpunpckhqdq $T3,$T2,$T3 + vpmadd52luq $H1,$S2,$D0lo + vpmadd52huq $H1,$S2,$D0hi + vpmadd52luq $H1,$R0,$D1lo + vpmadd52huq $H1,$R0,$D1hi + vpmadd52luq $H1,$R1,$D2lo + vpmadd52huq $H1,$R1,$D2hi + + ################################################################ + # partial reduction (interleaved with data splat) + vpsrlq \$44,$D0lo,$tmp + vpsllq \$8,$D0hi,$D0hi + vpandq $mask44,$D0lo,$H0 + vpaddq $tmp,$D0hi,$D0hi + + vpsrlq \$24,$T3,$T2 + vporq $PAD,$T2,$T2 + vpaddq $D0hi,$D1lo,$D1lo + + vpsrlq \$44,$D1lo,$tmp + vpsllq \$8,$D1hi,$D1hi + vpandq $mask44,$D1lo,$H1 + vpaddq $tmp,$D1hi,$D1hi + + vpandq $mask44,$T1,$T0 + vpsrlq \$44,$T1,$T1 + vpsllq \$20,$T3,$T3 + vpaddq $D1hi,$D2lo,$D2lo + + vpsrlq \$42,$D2lo,$tmp + vpsllq \$10,$D2hi,$D2hi + vpandq $mask42,$D2lo,$H2 + vpaddq $tmp,$D2hi,$D2hi + + vpaddq $T2,$H2,$H2 # accumulate input + vpaddq $D2hi,$H0,$H0 + vpsllq \$2,$D2hi,$D2hi + + vpaddq $D2hi,$H0,$H0 + vporq $T3,$T1,$T1 + vpandq $mask44,$T1,$T1 + + vpsrlq \$44,$H0,$tmp # additional step + vpandq $mask44,$H0,$H0 + + vpaddq $tmp,$H1,$H1 + + sub \$4,$len # len-=64 + jnz .Loop_vpmadd52_4x + +.Ltail_vpmadd52_4x: + vmovdqu64 128($ctx),$R2 # load all key powers + vmovdqu64 160($ctx),$S1 + vmovdqu64 64($ctx),$R0 + vmovdqu64 96($ctx),$R1 + +.Ltail_vpmadd52_2x: + vpsllq \$2,$R2,$S2 # S2 = R2*5*4 + vpaddq $R2,$S2,$S2 + vpsllq \$2,$S2,$S2 + + #vpaddq $T2,$H2,$H2 # accumulate input + vpaddq $T0,$H0,$H0 + vpaddq $T1,$H1,$H1 + + vpxorq $D0lo,$D0lo,$D0lo + vpmadd52luq $H2,$S1,$D0lo + vpxorq $D0hi,$D0hi,$D0hi + vpmadd52huq $H2,$S1,$D0hi + vpxorq $D1lo,$D1lo,$D1lo + vpmadd52luq $H2,$S2,$D1lo + vpxorq $D1hi,$D1hi,$D1hi + vpmadd52huq $H2,$S2,$D1hi + vpxorq $D2lo,$D2lo,$D2lo + vpmadd52luq $H2,$R0,$D2lo + vpxorq $D2hi,$D2hi,$D2hi + vpmadd52huq $H2,$R0,$D2hi + + vpmadd52luq $H0,$R0,$D0lo + vpmadd52huq $H0,$R0,$D0hi + vpmadd52luq $H0,$R1,$D1lo + vpmadd52huq $H0,$R1,$D1hi + vpmadd52luq $H0,$R2,$D2lo + vpmadd52huq $H0,$R2,$D2hi + + vpmadd52luq $H1,$S2,$D0lo + vpmadd52huq $H1,$S2,$D0hi + vpmadd52luq $H1,$R0,$D1lo + vpmadd52huq $H1,$R0,$D1hi + vpmadd52luq $H1,$R1,$D2lo + vpmadd52huq $H1,$R1,$D2hi + + ################################################################ + # horizontal addition + + mov \$1,%eax + kmovw %eax,%k1 + vpsrldq \$8,$D0lo,$T0 + vpsrldq \$8,$D0hi,$H0 + vpsrldq \$8,$D1lo,$T1 + vpsrldq \$8,$D1hi,$H1 + vpaddq $T0,$D0lo,$D0lo + vpaddq $H0,$D0hi,$D0hi + vpsrldq \$8,$D2lo,$T2 + vpsrldq \$8,$D2hi,$H2 + vpaddq $T1,$D1lo,$D1lo + vpaddq $H1,$D1hi,$D1hi + vpermq \$0x2,$D0lo,$T0 + vpermq \$0x2,$D0hi,$H0 + vpaddq $T2,$D2lo,$D2lo + vpaddq $H2,$D2hi,$D2hi + + vpermq \$0x2,$D1lo,$T1 + vpermq \$0x2,$D1hi,$H1 + vpaddq $T0,$D0lo,${D0lo}{%k1}{z} + vpaddq $H0,$D0hi,${D0hi}{%k1}{z} + vpermq \$0x2,$D2lo,$T2 + vpermq \$0x2,$D2hi,$H2 + vpaddq $T1,$D1lo,${D1lo}{%k1}{z} + vpaddq $H1,$D1hi,${D1hi}{%k1}{z} + vpaddq $T2,$D2lo,${D2lo}{%k1}{z} + vpaddq $H2,$D2hi,${D2hi}{%k1}{z} + + ################################################################ + # partial reduction + vpsrlq \$44,$D0lo,$tmp + vpsllq \$8,$D0hi,$D0hi + vpandq $mask44,$D0lo,$H0 + vpaddq $tmp,$D0hi,$D0hi + + vpaddq $D0hi,$D1lo,$D1lo + + vpsrlq \$44,$D1lo,$tmp + vpsllq \$8,$D1hi,$D1hi + vpandq $mask44,$D1lo,$H1 + vpaddq $tmp,$D1hi,$D1hi + + vpaddq $D1hi,$D2lo,$D2lo + + vpsrlq \$42,$D2lo,$tmp + vpsllq \$10,$D2hi,$D2hi + vpandq $mask42,$D2lo,$H2 + vpaddq $tmp,$D2hi,$D2hi + + vpaddq $D2hi,$H0,$H0 + vpsllq \$2,$D2hi,$D2hi + + vpaddq $D2hi,$H0,$H0 + + vpsrlq \$44,$H0,$tmp # additional step + vpandq $mask44,$H0,$H0 + + vpaddq $tmp,$H1,$H1 + # at this point $len is + # either 4*n+2 or 0... + sub \$2,$len # len-=32 + ja .Lblocks_vpmadd52_4x_do + + vmovq %x#$H0,0($ctx) + vmovq %x#$H1,8($ctx) + vmovq %x#$H2,16($ctx) + vzeroall + +.Lno_data_vpmadd52_4x: + ret +.size poly1305_blocks_vpmadd52_4x,.-poly1305_blocks_vpmadd52_4x +___ +} +{ +######################################################################## +# As implied by its name 8x subroutine processes 8 blocks in parallel... +# This is intermediate version, as it's used only in cases when input +# length is either 8*n, 8*n+1 or 8*n+2... + +my ($H0,$H1,$H2,$R0,$R1,$R2,$S1,$S2) = map("%ymm$_",(0..5,16,17)); +my ($D0lo,$D0hi,$D1lo,$D1hi,$D2lo,$D2hi) = map("%ymm$_",(18..23)); +my ($T0,$T1,$T2,$T3,$mask44,$mask42,$tmp,$PAD) = map("%ymm$_",(24..31)); +my ($RR0,$RR1,$RR2,$SS1,$SS2) = map("%ymm$_",(6..10)); + +$code.=<<___; +.type poly1305_blocks_vpmadd52_8x,\@function,4 +.align 32 +poly1305_blocks_vpmadd52_8x: + shr \$4,$len + jz .Lno_data_vpmadd52_8x # too short + + shl \$40,$padbit + mov 64($ctx),%r8 # peek on power of the key + + vmovdqa64 .Lx_mask44(%rip),$mask44 + vmovdqa64 .Lx_mask42(%rip),$mask42 + + test %r8,%r8 # is power value impossible? + js .Linit_vpmadd52 # if it is, then init R[4] + + vmovq 0($ctx),%x#$H0 # load current hash value + vmovq 8($ctx),%x#$H1 + vmovq 16($ctx),%x#$H2 + +.Lblocks_vpmadd52_8x: + ################################################################ + # fist we calculate more key powers + + vmovdqu64 128($ctx),$R2 # load 1-3-2-4 powers + vmovdqu64 160($ctx),$S1 + vmovdqu64 64($ctx),$R0 + vmovdqu64 96($ctx),$R1 + + vpsllq \$2,$R2,$S2 # S2 = R2*5*4 + vpaddq $R2,$S2,$S2 + vpsllq \$2,$S2,$S2 + + vpbroadcastq %x#$R2,$RR2 # broadcast 4th power + vpbroadcastq %x#$R0,$RR0 + vpbroadcastq %x#$R1,$RR1 + + vpxorq $D0lo,$D0lo,$D0lo + vpmadd52luq $RR2,$S1,$D0lo + vpxorq $D0hi,$D0hi,$D0hi + vpmadd52huq $RR2,$S1,$D0hi + vpxorq $D1lo,$D1lo,$D1lo + vpmadd52luq $RR2,$S2,$D1lo + vpxorq $D1hi,$D1hi,$D1hi + vpmadd52huq $RR2,$S2,$D1hi + vpxorq $D2lo,$D2lo,$D2lo + vpmadd52luq $RR2,$R0,$D2lo + vpxorq $D2hi,$D2hi,$D2hi + vpmadd52huq $RR2,$R0,$D2hi + + vpmadd52luq $RR0,$R0,$D0lo + vpmadd52huq $RR0,$R0,$D0hi + vpmadd52luq $RR0,$R1,$D1lo + vpmadd52huq $RR0,$R1,$D1hi + vpmadd52luq $RR0,$R2,$D2lo + vpmadd52huq $RR0,$R2,$D2hi + + vpmadd52luq $RR1,$S2,$D0lo + vpmadd52huq $RR1,$S2,$D0hi + vpmadd52luq $RR1,$R0,$D1lo + vpmadd52huq $RR1,$R0,$D1hi + vpmadd52luq $RR1,$R1,$D2lo + vpmadd52huq $RR1,$R1,$D2hi + + ################################################################ + # partial reduction + vpsrlq \$44,$D0lo,$tmp + vpsllq \$8,$D0hi,$D0hi + vpandq $mask44,$D0lo,$RR0 + vpaddq $tmp,$D0hi,$D0hi + + vpaddq $D0hi,$D1lo,$D1lo + + vpsrlq \$44,$D1lo,$tmp + vpsllq \$8,$D1hi,$D1hi + vpandq $mask44,$D1lo,$RR1 + vpaddq $tmp,$D1hi,$D1hi + + vpaddq $D1hi,$D2lo,$D2lo + + vpsrlq \$42,$D2lo,$tmp + vpsllq \$10,$D2hi,$D2hi + vpandq $mask42,$D2lo,$RR2 + vpaddq $tmp,$D2hi,$D2hi + + vpaddq $D2hi,$RR0,$RR0 + vpsllq \$2,$D2hi,$D2hi + + vpaddq $D2hi,$RR0,$RR0 + + vpsrlq \$44,$RR0,$tmp # additional step + vpandq $mask44,$RR0,$RR0 + + vpaddq $tmp,$RR1,$RR1 + + ################################################################ + # At this point Rx holds 1324 powers, RRx - 5768, and the goal + # is 15263748, which reflects how data is loaded... + + vpunpcklqdq $R2,$RR2,$T2 # 3748 + vpunpckhqdq $R2,$RR2,$R2 # 1526 + vpunpcklqdq $R0,$RR0,$T0 + vpunpckhqdq $R0,$RR0,$R0 + vpunpcklqdq $R1,$RR1,$T1 + vpunpckhqdq $R1,$RR1,$R1 +___ +######## switch to %zmm +map(s/%y/%z/, $H0,$H1,$H2,$R0,$R1,$R2,$S1,$S2); +map(s/%y/%z/, $D0lo,$D0hi,$D1lo,$D1hi,$D2lo,$D2hi); +map(s/%y/%z/, $T0,$T1,$T2,$T3,$mask44,$mask42,$tmp,$PAD); +map(s/%y/%z/, $RR0,$RR1,$RR2,$SS1,$SS2); + +$code.=<<___; + vshufi64x2 \$0x44,$R2,$T2,$RR2 # 15263748 + vshufi64x2 \$0x44,$R0,$T0,$RR0 + vshufi64x2 \$0x44,$R1,$T1,$RR1 + + vmovdqu64 16*0($inp),$T2 # load data + vmovdqu64 16*4($inp),$T3 + lea 16*8($inp),$inp + + vpsllq \$2,$RR2,$SS2 # S2 = R2*5*4 + vpsllq \$2,$RR1,$SS1 # S1 = R1*5*4 + vpaddq $RR2,$SS2,$SS2 + vpaddq $RR1,$SS1,$SS1 + vpsllq \$2,$SS2,$SS2 + vpsllq \$2,$SS1,$SS1 + + vpbroadcastq $padbit,$PAD + vpbroadcastq %x#$mask44,$mask44 + vpbroadcastq %x#$mask42,$mask42 + + vpbroadcastq %x#$SS1,$S1 # broadcast 8th power + vpbroadcastq %x#$SS2,$S2 + vpbroadcastq %x#$RR0,$R0 + vpbroadcastq %x#$RR1,$R1 + vpbroadcastq %x#$RR2,$R2 + + vpunpcklqdq $T3,$T2,$T1 # transpose data + vpunpckhqdq $T3,$T2,$T3 + + # at this point 64-bit lanes are ordered as 73625140 + + vpsrlq \$24,$T3,$T2 # splat the data + vporq $PAD,$T2,$T2 + vpaddq $T2,$H2,$H2 # accumulate input + vpandq $mask44,$T1,$T0 + vpsrlq \$44,$T1,$T1 + vpsllq \$20,$T3,$T3 + vporq $T3,$T1,$T1 + vpandq $mask44,$T1,$T1 + + sub \$8,$len + jz .Ltail_vpmadd52_8x + jmp .Loop_vpmadd52_8x + +.align 32 +.Loop_vpmadd52_8x: + #vpaddq $T2,$H2,$H2 # accumulate input + vpaddq $T0,$H0,$H0 + vpaddq $T1,$H1,$H1 + + vpxorq $D0lo,$D0lo,$D0lo + vpmadd52luq $H2,$S1,$D0lo + vpxorq $D0hi,$D0hi,$D0hi + vpmadd52huq $H2,$S1,$D0hi + vpxorq $D1lo,$D1lo,$D1lo + vpmadd52luq $H2,$S2,$D1lo + vpxorq $D1hi,$D1hi,$D1hi + vpmadd52huq $H2,$S2,$D1hi + vpxorq $D2lo,$D2lo,$D2lo + vpmadd52luq $H2,$R0,$D2lo + vpxorq $D2hi,$D2hi,$D2hi + vpmadd52huq $H2,$R0,$D2hi + + vmovdqu64 16*0($inp),$T2 # load data + vmovdqu64 16*4($inp),$T3 + lea 16*8($inp),$inp + vpmadd52luq $H0,$R0,$D0lo + vpmadd52huq $H0,$R0,$D0hi + vpmadd52luq $H0,$R1,$D1lo + vpmadd52huq $H0,$R1,$D1hi + vpmadd52luq $H0,$R2,$D2lo + vpmadd52huq $H0,$R2,$D2hi + + vpunpcklqdq $T3,$T2,$T1 # transpose data + vpunpckhqdq $T3,$T2,$T3 + vpmadd52luq $H1,$S2,$D0lo + vpmadd52huq $H1,$S2,$D0hi + vpmadd52luq $H1,$R0,$D1lo + vpmadd52huq $H1,$R0,$D1hi + vpmadd52luq $H1,$R1,$D2lo + vpmadd52huq $H1,$R1,$D2hi + + ################################################################ + # partial reduction (interleaved with data splat) + vpsrlq \$44,$D0lo,$tmp + vpsllq \$8,$D0hi,$D0hi + vpandq $mask44,$D0lo,$H0 + vpaddq $tmp,$D0hi,$D0hi + + vpsrlq \$24,$T3,$T2 + vporq $PAD,$T2,$T2 + vpaddq $D0hi,$D1lo,$D1lo + + vpsrlq \$44,$D1lo,$tmp + vpsllq \$8,$D1hi,$D1hi + vpandq $mask44,$D1lo,$H1 + vpaddq $tmp,$D1hi,$D1hi + + vpandq $mask44,$T1,$T0 + vpsrlq \$44,$T1,$T1 + vpsllq \$20,$T3,$T3 + vpaddq $D1hi,$D2lo,$D2lo + + vpsrlq \$42,$D2lo,$tmp + vpsllq \$10,$D2hi,$D2hi + vpandq $mask42,$D2lo,$H2 + vpaddq $tmp,$D2hi,$D2hi + + vpaddq $T2,$H2,$H2 # accumulate input + vpaddq $D2hi,$H0,$H0 + vpsllq \$2,$D2hi,$D2hi + + vpaddq $D2hi,$H0,$H0 + vporq $T3,$T1,$T1 + vpandq $mask44,$T1,$T1 + + vpsrlq \$44,$H0,$tmp # additional step + vpandq $mask44,$H0,$H0 + + vpaddq $tmp,$H1,$H1 + + sub \$8,$len # len-=128 + jnz .Loop_vpmadd52_8x + +.Ltail_vpmadd52_8x: + #vpaddq $T2,$H2,$H2 # accumulate input + vpaddq $T0,$H0,$H0 + vpaddq $T1,$H1,$H1 + + vpxorq $D0lo,$D0lo,$D0lo + vpmadd52luq $H2,$SS1,$D0lo + vpxorq $D0hi,$D0hi,$D0hi + vpmadd52huq $H2,$SS1,$D0hi + vpxorq $D1lo,$D1lo,$D1lo + vpmadd52luq $H2,$SS2,$D1lo + vpxorq $D1hi,$D1hi,$D1hi + vpmadd52huq $H2,$SS2,$D1hi + vpxorq $D2lo,$D2lo,$D2lo + vpmadd52luq $H2,$RR0,$D2lo + vpxorq $D2hi,$D2hi,$D2hi + vpmadd52huq $H2,$RR0,$D2hi + + vpmadd52luq $H0,$RR0,$D0lo + vpmadd52huq $H0,$RR0,$D0hi + vpmadd52luq $H0,$RR1,$D1lo + vpmadd52huq $H0,$RR1,$D1hi + vpmadd52luq $H0,$RR2,$D2lo + vpmadd52huq $H0,$RR2,$D2hi + + vpmadd52luq $H1,$SS2,$D0lo + vpmadd52huq $H1,$SS2,$D0hi + vpmadd52luq $H1,$RR0,$D1lo + vpmadd52huq $H1,$RR0,$D1hi + vpmadd52luq $H1,$RR1,$D2lo + vpmadd52huq $H1,$RR1,$D2hi + + ################################################################ + # horizontal addition + + mov \$1,%eax + kmovw %eax,%k1 + vpsrldq \$8,$D0lo,$T0 + vpsrldq \$8,$D0hi,$H0 + vpsrldq \$8,$D1lo,$T1 + vpsrldq \$8,$D1hi,$H1 + vpaddq $T0,$D0lo,$D0lo + vpaddq $H0,$D0hi,$D0hi + vpsrldq \$8,$D2lo,$T2 + vpsrldq \$8,$D2hi,$H2 + vpaddq $T1,$D1lo,$D1lo + vpaddq $H1,$D1hi,$D1hi + vpermq \$0x2,$D0lo,$T0 + vpermq \$0x2,$D0hi,$H0 + vpaddq $T2,$D2lo,$D2lo + vpaddq $H2,$D2hi,$D2hi + + vpermq \$0x2,$D1lo,$T1 + vpermq \$0x2,$D1hi,$H1 + vpaddq $T0,$D0lo,$D0lo + vpaddq $H0,$D0hi,$D0hi + vpermq \$0x2,$D2lo,$T2 + vpermq \$0x2,$D2hi,$H2 + vpaddq $T1,$D1lo,$D1lo + vpaddq $H1,$D1hi,$D1hi + vextracti64x4 \$1,$D0lo,%y#$T0 + vextracti64x4 \$1,$D0hi,%y#$H0 + vpaddq $T2,$D2lo,$D2lo + vpaddq $H2,$D2hi,$D2hi + + vextracti64x4 \$1,$D1lo,%y#$T1 + vextracti64x4 \$1,$D1hi,%y#$H1 + vextracti64x4 \$1,$D2lo,%y#$T2 + vextracti64x4 \$1,$D2hi,%y#$H2 +___ +######## switch back to %ymm +map(s/%z/%y/, $H0,$H1,$H2,$R0,$R1,$R2,$S1,$S2); +map(s/%z/%y/, $D0lo,$D0hi,$D1lo,$D1hi,$D2lo,$D2hi); +map(s/%z/%y/, $T0,$T1,$T2,$T3,$mask44,$mask42,$tmp,$PAD); + +$code.=<<___; + vpaddq $T0,$D0lo,${D0lo}{%k1}{z} + vpaddq $H0,$D0hi,${D0hi}{%k1}{z} + vpaddq $T1,$D1lo,${D1lo}{%k1}{z} + vpaddq $H1,$D1hi,${D1hi}{%k1}{z} + vpaddq $T2,$D2lo,${D2lo}{%k1}{z} + vpaddq $H2,$D2hi,${D2hi}{%k1}{z} + + ################################################################ + # partial reduction + vpsrlq \$44,$D0lo,$tmp + vpsllq \$8,$D0hi,$D0hi + vpandq $mask44,$D0lo,$H0 + vpaddq $tmp,$D0hi,$D0hi + + vpaddq $D0hi,$D1lo,$D1lo + + vpsrlq \$44,$D1lo,$tmp + vpsllq \$8,$D1hi,$D1hi + vpandq $mask44,$D1lo,$H1 + vpaddq $tmp,$D1hi,$D1hi + + vpaddq $D1hi,$D2lo,$D2lo + + vpsrlq \$42,$D2lo,$tmp + vpsllq \$10,$D2hi,$D2hi + vpandq $mask42,$D2lo,$H2 + vpaddq $tmp,$D2hi,$D2hi + + vpaddq $D2hi,$H0,$H0 + vpsllq \$2,$D2hi,$D2hi + + vpaddq $D2hi,$H0,$H0 + + vpsrlq \$44,$H0,$tmp # additional step + vpandq $mask44,$H0,$H0 + + vpaddq $tmp,$H1,$H1 + + ################################################################ + + vmovq %x#$H0,0($ctx) + vmovq %x#$H1,8($ctx) + vmovq %x#$H2,16($ctx) + vzeroall + +.Lno_data_vpmadd52_8x: + ret +.size poly1305_blocks_vpmadd52_8x,.-poly1305_blocks_vpmadd52_8x +___ +} +$code.=<<___; +.type poly1305_emit_base2_44,\@function,3 +.align 32 +poly1305_emit_base2_44: + mov 0($ctx),%r8 # load hash value + mov 8($ctx),%r9 + mov 16($ctx),%r10 + + mov %r9,%rax + shr \$20,%r9 + shl \$44,%rax + mov %r10,%rcx + shr \$40,%r10 + shl \$24,%rcx + + add %rax,%r8 + adc %rcx,%r9 + adc \$0,%r10 + + mov %r8,%rax + add \$5,%r8 # compare to modulus + mov %r9,%rcx + adc \$0,%r9 + adc \$0,%r10 + shr \$2,%r10 # did 130-bit value overflow? + cmovnz %r8,%rax + cmovnz %r9,%rcx + + add 0($nonce),%rax # accumulate nonce + adc 8($nonce),%rcx + mov %rax,0($mac) # write result + mov %rcx,8($mac) + + ret +.size poly1305_emit_base2_44,.-poly1305_emit_base2_44 +___ +} } } +} + + +# EXCEPTION_DISPOSITION handler (EXCEPTION_RECORD *rec,ULONG64 frame, +# CONTEXT *context,DISPATCHER_CONTEXT *disp) +if ($win64) { +$rec="%rcx"; +$frame="%rdx"; +$context="%r8"; +$disp="%r9"; + +$code.=<<___; +.extern __imp_RtlVirtualUnwind +.type se_handler,\@abi-omnipotent +.align 16 +se_handler: + push %rsi + push %rdi + push %rbx + push %rbp + push %r12 + push %r13 + push %r14 + push %r15 + pushfq + sub \$64,%rsp + + mov 120($context),%rax # pull context->Rax + mov 248($context),%rbx # pull context->Rip + + mov 8($disp),%rsi # disp->ImageBase + mov 56($disp),%r11 # disp->HandlerData + + mov 0(%r11),%r10d # HandlerData[0] + lea (%rsi,%r10),%r10 # prologue label + cmp %r10,%rbx # context->Rip<.Lprologue + jb .Lcommon_seh_tail + + mov 152($context),%rax # pull context->Rsp + + mov 4(%r11),%r10d # HandlerData[1] + lea (%rsi,%r10),%r10 # epilogue label + cmp %r10,%rbx # context->Rip>=.Lepilogue + jae .Lcommon_seh_tail + + lea 48(%rax),%rax + + mov -8(%rax),%rbx + mov -16(%rax),%rbp + mov -24(%rax),%r12 + mov -32(%rax),%r13 + mov -40(%rax),%r14 + mov -48(%rax),%r15 + mov %rbx,144($context) # restore context->Rbx + mov %rbp,160($context) # restore context->Rbp + mov %r12,216($context) # restore context->R12 + mov %r13,224($context) # restore context->R13 + mov %r14,232($context) # restore context->R14 + mov %r15,240($context) # restore context->R14 + + jmp .Lcommon_seh_tail +.size se_handler,.-se_handler + +.type avx_handler,\@abi-omnipotent +.align 16 +avx_handler: + push %rsi + push %rdi + push %rbx + push %rbp + push %r12 + push %r13 + push %r14 + push %r15 + pushfq + sub \$64,%rsp + + mov 120($context),%rax # pull context->Rax + mov 248($context),%rbx # pull context->Rip + + mov 8($disp),%rsi # disp->ImageBase + mov 56($disp),%r11 # disp->HandlerData + + mov 0(%r11),%r10d # HandlerData[0] + lea (%rsi,%r10),%r10 # prologue label + cmp %r10,%rbx # context->RipRsp + + mov 4(%r11),%r10d # HandlerData[1] + lea (%rsi,%r10),%r10 # epilogue label + cmp %r10,%rbx # context->Rip>=epilogue label + jae .Lcommon_seh_tail + + mov 208($context),%rax # pull context->R11 + + lea 0x50(%rax),%rsi + lea 0xf8(%rax),%rax + lea 512($context),%rdi # &context.Xmm6 + mov \$20,%ecx + .long 0xa548f3fc # cld; rep movsq + +.Lcommon_seh_tail: + mov 8(%rax),%rdi + mov 16(%rax),%rsi + mov %rax,152($context) # restore context->Rsp + mov %rsi,168($context) # restore context->Rsi + mov %rdi,176($context) # restore context->Rdi + + mov 40($disp),%rdi # disp->ContextRecord + mov $context,%rsi # context + mov \$154,%ecx # sizeof(CONTEXT) + .long 0xa548f3fc # cld; rep movsq + + mov $disp,%rsi + xor %rcx,%rcx # arg1, UNW_FLAG_NHANDLER + mov 8(%rsi),%rdx # arg2, disp->ImageBase + mov 0(%rsi),%r8 # arg3, disp->ControlPc + mov 16(%rsi),%r9 # arg4, disp->FunctionEntry + mov 40(%rsi),%r10 # disp->ContextRecord + lea 56(%rsi),%r11 # &disp->HandlerData + lea 24(%rsi),%r12 # &disp->EstablisherFrame + mov %r10,32(%rsp) # arg5 + mov %r11,40(%rsp) # arg6 + mov %r12,48(%rsp) # arg7 + mov %rcx,56(%rsp) # arg8, (NULL) + call *__imp_RtlVirtualUnwind(%rip) + + mov \$1,%eax # ExceptionContinueSearch + add \$64,%rsp + popfq + pop %r15 + pop %r14 + pop %r13 + pop %r12 + pop %rbp + pop %rbx + pop %rdi + pop %rsi + ret +.size avx_handler,.-avx_handler + +.section .pdata +.align 4 + .rva .LSEH_begin_poly1305_init_x86_64 + .rva .LSEH_end_poly1305_init_x86_64 + .rva .LSEH_info_poly1305_init + + .rva .LSEH_begin_poly1305_blocks_x86_64 + .rva .LSEH_end_poly1305_blocks_x86_64 + .rva .LSEH_info_poly1305_blocks + + .rva .LSEH_begin_poly1305_emit_x86_64 + .rva .LSEH_end_poly1305_emit_x86_64 + .rva .LSEH_info_poly1305_emit +___ +$code.=<<___ if ($avx); + .rva .LSEH_begin_poly1305_blocks_avx + .rva .Lbase2_64_avx + .rva .LSEH_info_poly1305_blocks_avx_1 + + .rva .Lbase2_64_avx + .rva .Leven_avx + .rva .LSEH_info_poly1305_blocks_avx_2 + + .rva .Leven_avx + .rva .LSEH_end_poly1305_blocks_avx + .rva .LSEH_info_poly1305_blocks_avx_3 + + .rva .LSEH_begin_poly1305_emit_avx + .rva .LSEH_end_poly1305_emit_avx + .rva .LSEH_info_poly1305_emit_avx +___ +$code.=<<___ if ($avx>1); + .rva .LSEH_begin_poly1305_blocks_avx2 + .rva .Lbase2_64_avx2 + .rva .LSEH_info_poly1305_blocks_avx2_1 + + .rva .Lbase2_64_avx2 + .rva .Leven_avx2 + .rva .LSEH_info_poly1305_blocks_avx2_2 + + .rva .Leven_avx2 + .rva .LSEH_end_poly1305_blocks_avx2 + .rva .LSEH_info_poly1305_blocks_avx2_3 +___ +$code.=<<___ if ($avx>2); + .rva .LSEH_begin_poly1305_blocks_avx512 + .rva .LSEH_end_poly1305_blocks_avx512 + .rva .LSEH_info_poly1305_blocks_avx512 +___ +$code.=<<___; +.section .xdata +.align 8 +.LSEH_info_poly1305_init: + .byte 9,0,0,0 + .rva se_handler + .rva .LSEH_begin_poly1305_init_x86_64,.LSEH_begin_poly1305_init_x86_64 + +.LSEH_info_poly1305_blocks: + .byte 9,0,0,0 + .rva se_handler + .rva .Lblocks_body,.Lblocks_epilogue + +.LSEH_info_poly1305_emit: + .byte 9,0,0,0 + .rva se_handler + .rva .LSEH_begin_poly1305_emit_x86_64,.LSEH_begin_poly1305_emit_x86_64 +___ +$code.=<<___ if ($avx); +.LSEH_info_poly1305_blocks_avx_1: + .byte 9,0,0,0 + .rva se_handler + .rva .Lblocks_avx_body,.Lblocks_avx_epilogue # HandlerData[] + +.LSEH_info_poly1305_blocks_avx_2: + .byte 9,0,0,0 + .rva se_handler + .rva .Lbase2_64_avx_body,.Lbase2_64_avx_epilogue # HandlerData[] + +.LSEH_info_poly1305_blocks_avx_3: + .byte 9,0,0,0 + .rva avx_handler + .rva .Ldo_avx_body,.Ldo_avx_epilogue # HandlerData[] + +.LSEH_info_poly1305_emit_avx: + .byte 9,0,0,0 + .rva se_handler + .rva .LSEH_begin_poly1305_emit_avx,.LSEH_begin_poly1305_emit_avx +___ +$code.=<<___ if ($avx>1); +.LSEH_info_poly1305_blocks_avx2_1: + .byte 9,0,0,0 + .rva se_handler + .rva .Lblocks_avx2_body,.Lblocks_avx2_epilogue # HandlerData[] + +.LSEH_info_poly1305_blocks_avx2_2: + .byte 9,0,0,0 + .rva se_handler + .rva .Lbase2_64_avx2_body,.Lbase2_64_avx2_epilogue # HandlerData[] + +.LSEH_info_poly1305_blocks_avx2_3: + .byte 9,0,0,0 + .rva avx_handler + .rva .Ldo_avx2_body,.Ldo_avx2_epilogue # HandlerData[] +___ +$code.=<<___ if ($avx>2); +.LSEH_info_poly1305_blocks_avx512: + .byte 9,0,0,0 + .rva avx_handler + .rva .Ldo_avx512_body,.Ldo_avx512_epilogue # HandlerData[] +___ +} + +foreach (split('\n',$code)) { + s/\`([^\`]*)\`/eval($1)/ge; + s/%r([a-z]+)#d/%e$1/g; + s/%r([0-9]+)#d/%r$1d/g; + s/%x#%[yz]/%x/g or s/%y#%z/%y/g or s/%z#%[yz]/%z/g; + + print $_,"\n"; +} +close STDOUT; diff --git a/crypto/make_poly1305_x86.pl b/crypto/make_poly1305_x86.pl new file mode 100644 index 0000000..ec1efd9 --- /dev/null +++ b/crypto/make_poly1305_x86.pl @@ -0,0 +1,1815 @@ +#! /usr/bin/env perl +# Copyright 2016 The OpenSSL Project Authors. All Rights Reserved. +# +# Licensed under the OpenSSL license (the "License"). You may not use +# this file except in compliance with the License. You can obtain a copy +# in the file LICENSE in the source distribution or at +# https://www.openssl.org/source/license.html + +# +# ==================================================================== +# Written by Andy Polyakov for the OpenSSL +# project. The module is, however, dual licensed under OpenSSL and +# CRYPTOGAMS licenses depending on where you obtain it. For further +# details see http://www.openssl.org/~appro/cryptogams/. +# ==================================================================== +# +# This module implements Poly1305 hash for x86. +# +# April 2015 +# +# Numbers are cycles per processed byte with poly1305_blocks alone, +# measured with rdtsc at fixed clock frequency. +# +# IALU/gcc-3.4(*) SSE2(**) AVX2 +# Pentium 15.7/+80% - +# PIII 6.21/+90% - +# P4 19.8/+40% 3.24 +# Core 2 4.85/+90% 1.80 +# Westmere 4.58/+100% 1.43 +# Sandy Bridge 3.90/+100% 1.36 +# Haswell 3.88/+70% 1.18 0.72 +# Skylake 3.10/+60% 1.14 0.62 +# Silvermont 11.0/+40% 4.80 +# Goldmont 4.10/+200% 2.10 +# VIA Nano 6.71/+90% 2.47 +# Sledgehammer 3.51/+180% 4.27 +# Bulldozer 4.53/+140% 1.31 +# +# (*) gcc 4.8 for some reason generated worse code; +# (**) besides SSE2 there are floating-point and AVX options; FP +# is deemed unnecessary, because pre-SSE2 processor are too +# old to care about, while it's not the fastest option on +# SSE2-capable ones; AVX is omitted, because it doesn't give +# a lot of improvement, 5-10% depending on processor; + +$0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1; +push(@INC,"${dir}","${dir}../../perlasm"); +require "x86asm.pl"; + +$output=pop; +open STDOUT,">$output"; + +&asm_init($ARGV[0],$ARGV[$#ARGV] eq "386"); + +$sse2=$avx=0; +for (@ARGV) { $sse2=1 if (/-DOPENSSL_IA32_SSE2/); } + +if ($sse2) { + &static_label("const_sse2"); + &static_label("enter_blocks"); + &static_label("enter_emit"); + &external_label("OPENSSL_ia32cap_P"); + + if (`$ENV{CC} -Wa,-v -c -o /dev/null -x assembler /dev/null 2>&1` + =~ /GNU assembler version ([2-9]\.[0-9]+)/) { + $avx = ($1>=2.19) + ($1>=2.22); + } + + if (!$avx && $ARGV[0] eq "win32n" && + `nasm -v 2>&1` =~ /NASM version ([2-9]\.[0-9]+)/) { + $avx = ($1>=2.09) + ($1>=2.10); + } + + if (!$avx && `$ENV{CC} -v 2>&1` =~ /(^clang version|based on LLVM) ([3-9]\.[0-9]+)/) { + $avx = ($2>=3.0) + ($2>3.0); + } +} + +######################################################################## +# Layout of opaque area is following. +# +# unsigned __int32 h[5]; # current hash value base 2^32 +# unsigned __int32 pad; # is_base2_26 in vector context +# unsigned __int32 r[4]; # key value base 2^32 + +&align(64); +&function_begin("poly1305_init"); + &mov ("edi",&wparam(0)); # context + &mov ("esi",&wparam(1)); # key + &mov ("ebp",&wparam(2)); # function table + + &xor ("eax","eax"); + &mov (&DWP(4*0,"edi"),"eax"); # zero hash value + &mov (&DWP(4*1,"edi"),"eax"); + &mov (&DWP(4*2,"edi"),"eax"); + &mov (&DWP(4*3,"edi"),"eax"); + &mov (&DWP(4*4,"edi"),"eax"); + &mov (&DWP(4*5,"edi"),"eax"); # is_base2_26 + + &cmp ("esi",0); + &je (&label("nokey")); + + if ($sse2) { + &call (&label("pic_point")); + &set_label("pic_point"); + &blindpop("ebx"); + + &lea ("eax",&DWP("poly1305_blocks-".&label("pic_point"),"ebx")); + &lea ("edx",&DWP("poly1305_emit-".&label("pic_point"),"ebx")); + + &picmeup("edi","OPENSSL_ia32cap_P","ebx",&label("pic_point")); + &mov ("ecx",&DWP(0,"edi")); + &and ("ecx",1<<26|1<<24); + &cmp ("ecx",1<<26|1<<24); # SSE2 and XMM? + &jne (&label("no_sse2")); + + &lea ("eax",&DWP("_poly1305_blocks_sse2-".&label("pic_point"),"ebx")); + &lea ("edx",&DWP("_poly1305_emit_sse2-".&label("pic_point"),"ebx")); + + if ($avx>1) { + &mov ("ecx",&DWP(8,"edi")); + &test ("ecx",1<<5); # AVX2? + &jz (&label("no_sse2")); + + &lea ("eax",&DWP("_poly1305_blocks_avx2-".&label("pic_point"),"ebx")); + } + &set_label("no_sse2"); + &mov ("edi",&wparam(0)); # reload context + &mov (&DWP(0,"ebp"),"eax"); # fill function table + &mov (&DWP(4,"ebp"),"edx"); + } + + &mov ("eax",&DWP(4*0,"esi")); # load input key + &mov ("ebx",&DWP(4*1,"esi")); + &mov ("ecx",&DWP(4*2,"esi")); + &mov ("edx",&DWP(4*3,"esi")); + &and ("eax",0x0fffffff); + &and ("ebx",0x0ffffffc); + &and ("ecx",0x0ffffffc); + &and ("edx",0x0ffffffc); + &mov (&DWP(4*6,"edi"),"eax"); + &mov (&DWP(4*7,"edi"),"ebx"); + &mov (&DWP(4*8,"edi"),"ecx"); + &mov (&DWP(4*9,"edi"),"edx"); + + &mov ("eax",$sse2); +&set_label("nokey"); +&function_end("poly1305_init"); + +($h0,$h1,$h2,$h3,$h4, + $d0,$d1,$d2,$d3, + $r0,$r1,$r2,$r3, + $s1,$s2,$s3)=map(4*$_,(0..15)); + +&function_begin("poly1305_blocks"); + &mov ("edi",&wparam(0)); # ctx + &mov ("esi",&wparam(1)); # inp + &mov ("ecx",&wparam(2)); # len +&set_label("enter_blocks"); + &and ("ecx",-15); + &jz (&label("nodata")); + + &stack_push(16); + &mov ("eax",&DWP(4*6,"edi")); # r0 + &mov ("ebx",&DWP(4*7,"edi")); # r1 + &lea ("ebp",&DWP(0,"esi","ecx")); # end of input + &mov ("ecx",&DWP(4*8,"edi")); # r2 + &mov ("edx",&DWP(4*9,"edi")); # r3 + + &mov (&wparam(2),"ebp"); + &mov ("ebp","esi"); + + &mov (&DWP($r0,"esp"),"eax"); # r0 + &mov ("eax","ebx"); + &shr ("eax",2); + &mov (&DWP($r1,"esp"),"ebx"); # r1 + &add ("eax","ebx"); # s1 + &mov ("ebx","ecx"); + &shr ("ebx",2); + &mov (&DWP($r2,"esp"),"ecx"); # r2 + &add ("ebx","ecx"); # s2 + &mov ("ecx","edx"); + &shr ("ecx",2); + &mov (&DWP($r3,"esp"),"edx"); # r3 + &add ("ecx","edx"); # s3 + &mov (&DWP($s1,"esp"),"eax"); # s1 + &mov (&DWP($s2,"esp"),"ebx"); # s2 + &mov (&DWP($s3,"esp"),"ecx"); # s3 + + &mov ("eax",&DWP(4*0,"edi")); # load hash value + &mov ("ebx",&DWP(4*1,"edi")); + &mov ("ecx",&DWP(4*2,"edi")); + &mov ("esi",&DWP(4*3,"edi")); + &mov ("edi",&DWP(4*4,"edi")); + &jmp (&label("loop")); + +&set_label("loop",32); + &add ("eax",&DWP(4*0,"ebp")); # accumulate input + &adc ("ebx",&DWP(4*1,"ebp")); + &adc ("ecx",&DWP(4*2,"ebp")); + &adc ("esi",&DWP(4*3,"ebp")); + &lea ("ebp",&DWP(4*4,"ebp")); + &adc ("edi",&wparam(3)); # padbit + + &mov (&DWP($h0,"esp"),"eax"); # put aside hash[+inp] + &mov (&DWP($h3,"esp"),"esi"); + + &mul (&DWP($r0,"esp")); # h0*r0 + &mov (&DWP($h4,"esp"),"edi"); + &mov ("edi","eax"); + &mov ("eax","ebx"); # h1 + &mov ("esi","edx"); + &mul (&DWP($s3,"esp")); # h1*s3 + &add ("edi","eax"); + &mov ("eax","ecx"); # h2 + &adc ("esi","edx"); + &mul (&DWP($s2,"esp")); # h2*s2 + &add ("edi","eax"); + &mov ("eax",&DWP($h3,"esp")); + &adc ("esi","edx"); + &mul (&DWP($s1,"esp")); # h3*s1 + &add ("edi","eax"); + &mov ("eax",&DWP($h0,"esp")); + &adc ("esi","edx"); + + &mul (&DWP($r1,"esp")); # h0*r1 + &mov (&DWP($d0,"esp"),"edi"); + &xor ("edi","edi"); + &add ("esi","eax"); + &mov ("eax","ebx"); # h1 + &adc ("edi","edx"); + &mul (&DWP($r0,"esp")); # h1*r0 + &add ("esi","eax"); + &mov ("eax","ecx"); # h2 + &adc ("edi","edx"); + &mul (&DWP($s3,"esp")); # h2*s3 + &add ("esi","eax"); + &mov ("eax",&DWP($h3,"esp")); + &adc ("edi","edx"); + &mul (&DWP($s2,"esp")); # h3*s2 + &add ("esi","eax"); + &mov ("eax",&DWP($h4,"esp")); + &adc ("edi","edx"); + &imul ("eax",&DWP($s1,"esp")); # h4*s1 + &add ("esi","eax"); + &mov ("eax",&DWP($h0,"esp")); + &adc ("edi",0); + + &mul (&DWP($r2,"esp")); # h0*r2 + &mov (&DWP($d1,"esp"),"esi"); + &xor ("esi","esi"); + &add ("edi","eax"); + &mov ("eax","ebx"); # h1 + &adc ("esi","edx"); + &mul (&DWP($r1,"esp")); # h1*r1 + &add ("edi","eax"); + &mov ("eax","ecx"); # h2 + &adc ("esi","edx"); + &mul (&DWP($r0,"esp")); # h2*r0 + &add ("edi","eax"); + &mov ("eax",&DWP($h3,"esp")); + &adc ("esi","edx"); + &mul (&DWP($s3,"esp")); # h3*s3 + &add ("edi","eax"); + &mov ("eax",&DWP($h4,"esp")); + &adc ("esi","edx"); + &imul ("eax",&DWP($s2,"esp")); # h4*s2 + &add ("edi","eax"); + &mov ("eax",&DWP($h0,"esp")); + &adc ("esi",0); + + &mul (&DWP($r3,"esp")); # h0*r3 + &mov (&DWP($d2,"esp"),"edi"); + &xor ("edi","edi"); + &add ("esi","eax"); + &mov ("eax","ebx"); # h1 + &adc ("edi","edx"); + &mul (&DWP($r2,"esp")); # h1*r2 + &add ("esi","eax"); + &mov ("eax","ecx"); # h2 + &adc ("edi","edx"); + &mul (&DWP($r1,"esp")); # h2*r1 + &add ("esi","eax"); + &mov ("eax",&DWP($h3,"esp")); + &adc ("edi","edx"); + &mul (&DWP($r0,"esp")); # h3*r0 + &add ("esi","eax"); + &mov ("ecx",&DWP($h4,"esp")); + &adc ("edi","edx"); + + &mov ("edx","ecx"); + &imul ("ecx",&DWP($s3,"esp")); # h4*s3 + &add ("esi","ecx"); + &mov ("eax",&DWP($d0,"esp")); + &adc ("edi",0); + + &imul ("edx",&DWP($r0,"esp")); # h4*r0 + &add ("edx","edi"); + + &mov ("ebx",&DWP($d1,"esp")); + &mov ("ecx",&DWP($d2,"esp")); + + &mov ("edi","edx"); # last reduction step + &shr ("edx",2); + &and ("edi",3); + &lea ("edx",&DWP(0,"edx","edx",4)); # *5 + &add ("eax","edx"); + &adc ("ebx",0); + &adc ("ecx",0); + &adc ("esi",0); + &adc ("edi",0); + + &cmp ("ebp",&wparam(2)); # done yet? + &jne (&label("loop")); + + &mov ("edx",&wparam(0)); # ctx + &stack_pop(16); + &mov (&DWP(4*0,"edx"),"eax"); # store hash value + &mov (&DWP(4*1,"edx"),"ebx"); + &mov (&DWP(4*2,"edx"),"ecx"); + &mov (&DWP(4*3,"edx"),"esi"); + &mov (&DWP(4*4,"edx"),"edi"); +&set_label("nodata"); +&function_end("poly1305_blocks"); + +&function_begin("poly1305_emit"); + &mov ("ebp",&wparam(0)); # context +&set_label("enter_emit"); + &mov ("edi",&wparam(1)); # output + &mov ("eax",&DWP(4*0,"ebp")); # load hash value + &mov ("ebx",&DWP(4*1,"ebp")); + &mov ("ecx",&DWP(4*2,"ebp")); + &mov ("edx",&DWP(4*3,"ebp")); + &mov ("esi",&DWP(4*4,"ebp")); + + &add ("eax",5); # compare to modulus + &adc ("ebx",0); + &adc ("ecx",0); + &adc ("edx",0); + &adc ("esi",0); + &shr ("esi",2); # did it carry/borrow? + &neg ("esi"); # do we choose hash-modulus? + + &and ("eax","esi"); + &and ("ebx","esi"); + &and ("ecx","esi"); + &and ("edx","esi"); + &mov (&DWP(4*0,"edi"),"eax"); + &mov (&DWP(4*1,"edi"),"ebx"); + &mov (&DWP(4*2,"edi"),"ecx"); + &mov (&DWP(4*3,"edi"),"edx"); + + ¬ ("esi"); # or original hash value? + &mov ("eax",&DWP(4*0,"ebp")); + &mov ("ebx",&DWP(4*1,"ebp")); + &mov ("ecx",&DWP(4*2,"ebp")); + &mov ("edx",&DWP(4*3,"ebp")); + &mov ("ebp",&wparam(2)); + &and ("eax","esi"); + &and ("ebx","esi"); + &and ("ecx","esi"); + &and ("edx","esi"); + &or ("eax",&DWP(4*0,"edi")); + &or ("ebx",&DWP(4*1,"edi")); + &or ("ecx",&DWP(4*2,"edi")); + &or ("edx",&DWP(4*3,"edi")); + + &add ("eax",&DWP(4*0,"ebp")); # accumulate key + &adc ("ebx",&DWP(4*1,"ebp")); + &adc ("ecx",&DWP(4*2,"ebp")); + &adc ("edx",&DWP(4*3,"ebp")); + + &mov (&DWP(4*0,"edi"),"eax"); + &mov (&DWP(4*1,"edi"),"ebx"); + &mov (&DWP(4*2,"edi"),"ecx"); + &mov (&DWP(4*3,"edi"),"edx"); +&function_end("poly1305_emit"); + +if ($sse2) { +######################################################################## +# Layout of opaque area is following. +# +# unsigned __int32 h[5]; # current hash value base 2^26 +# unsigned __int32 is_base2_26; +# unsigned __int32 r[4]; # key value base 2^32 +# unsigned __int32 pad[2]; +# struct { unsigned __int32 r^4, r^3, r^2, r^1; } r[9]; +# +# where r^n are base 2^26 digits of degrees of multiplier key. There are +# 5 digits, but last four are interleaved with multiples of 5, totalling +# in 9 elements: r0, r1, 5*r1, r2, 5*r2, r3, 5*r3, r4, 5*r4. + +my ($D0,$D1,$D2,$D3,$D4,$T0,$T1,$T2)=map("xmm$_",(0..7)); +my $MASK=$T2; # borrow and keep in mind + +&align (32); +&function_begin_B("_poly1305_init_sse2"); + &movdqu ($D4,&QWP(4*6,"edi")); # key base 2^32 + &lea ("edi",&DWP(16*3,"edi")); # size optimization + &mov ("ebp","esp"); + &sub ("esp",16*(9+5)); + &and ("esp",-16); + + #&pand ($D4,&QWP(96,"ebx")); # magic mask + &movq ($MASK,&QWP(64,"ebx")); + + &movdqa ($D0,$D4); + &movdqa ($D1,$D4); + &movdqa ($D2,$D4); + + &pand ($D0,$MASK); # -> base 2^26 + &psrlq ($D1,26); + &psrldq ($D2,6); + &pand ($D1,$MASK); + &movdqa ($D3,$D2); + &psrlq ($D2,4) + &psrlq ($D3,30); + &pand ($D2,$MASK); + &pand ($D3,$MASK); + &psrldq ($D4,13); + + &lea ("edx",&DWP(16*9,"esp")); # size optimization + &mov ("ecx",2); +&set_label("square"); + &movdqa (&QWP(16*0,"esp"),$D0); + &movdqa (&QWP(16*1,"esp"),$D1); + &movdqa (&QWP(16*2,"esp"),$D2); + &movdqa (&QWP(16*3,"esp"),$D3); + &movdqa (&QWP(16*4,"esp"),$D4); + + &movdqa ($T1,$D1); + &movdqa ($T0,$D2); + &pslld ($T1,2); + &pslld ($T0,2); + &paddd ($T1,$D1); # *5 + &paddd ($T0,$D2); # *5 + &movdqa (&QWP(16*5,"esp"),$T1); + &movdqa (&QWP(16*6,"esp"),$T0); + &movdqa ($T1,$D3); + &movdqa ($T0,$D4); + &pslld ($T1,2); + &pslld ($T0,2); + &paddd ($T1,$D3); # *5 + &paddd ($T0,$D4); # *5 + &movdqa (&QWP(16*7,"esp"),$T1); + &movdqa (&QWP(16*8,"esp"),$T0); + + &pshufd ($T1,$D0,0b01000100); + &movdqa ($T0,$D1); + &pshufd ($D1,$D1,0b01000100); + &pshufd ($D2,$D2,0b01000100); + &pshufd ($D3,$D3,0b01000100); + &pshufd ($D4,$D4,0b01000100); + &movdqa (&QWP(16*0,"edx"),$T1); + &movdqa (&QWP(16*1,"edx"),$D1); + &movdqa (&QWP(16*2,"edx"),$D2); + &movdqa (&QWP(16*3,"edx"),$D3); + &movdqa (&QWP(16*4,"edx"),$D4); + + ################################################################ + # d4 = h4*r0 + h3*r1 + h2*r2 + h1*r3 + h0*r4 + # d3 = h3*r0 + h2*r1 + h1*r2 + h0*r3 + h4*5*r4 + # d2 = h2*r0 + h1*r1 + h0*r2 + h4*5*r3 + h3*5*r4 + # d1 = h1*r0 + h0*r1 + h4*5*r2 + h3*5*r3 + h2*5*r4 + # d0 = h0*r0 + h4*5*r1 + h3*5*r2 + h2*5*r3 + h1*5*r4 + + &pmuludq ($D4,$D0); # h4*r0 + &pmuludq ($D3,$D0); # h3*r0 + &pmuludq ($D2,$D0); # h2*r0 + &pmuludq ($D1,$D0); # h1*r0 + &pmuludq ($D0,$T1); # h0*r0 + +sub pmuladd { +my $load = shift; +my $base = shift; $base = "esp" if (!defined($base)); + + ################################################################ + # As for choice to "rotate" $T0-$T2 in order to move paddq + # past next multiplication. While it makes code harder to read + # and doesn't have significant effect on most processors, it + # makes a lot of difference on Atom, up to 30% improvement. + + &movdqa ($T1,$T0); + &pmuludq ($T0,&QWP(16*3,$base)); # r1*h3 + &movdqa ($T2,$T1); + &pmuludq ($T1,&QWP(16*2,$base)); # r1*h2 + &paddq ($D4,$T0); + &movdqa ($T0,$T2); + &pmuludq ($T2,&QWP(16*1,$base)); # r1*h1 + &paddq ($D3,$T1); + &$load ($T1,5); # s1 + &pmuludq ($T0,&QWP(16*0,$base)); # r1*h0 + &paddq ($D2,$T2); + &pmuludq ($T1,&QWP(16*4,$base)); # s1*h4 + &$load ($T2,2); # r2^n + &paddq ($D1,$T0); + + &movdqa ($T0,$T2); + &pmuludq ($T2,&QWP(16*2,$base)); # r2*h2 + &paddq ($D0,$T1); + &movdqa ($T1,$T0); + &pmuludq ($T0,&QWP(16*1,$base)); # r2*h1 + &paddq ($D4,$T2); + &$load ($T2,6); # s2^n + &pmuludq ($T1,&QWP(16*0,$base)); # r2*h0 + &paddq ($D3,$T0); + &movdqa ($T0,$T2); + &pmuludq ($T2,&QWP(16*4,$base)); # s2*h4 + &paddq ($D2,$T1); + &pmuludq ($T0,&QWP(16*3,$base)); # s2*h3 + &$load ($T1,3); # r3^n + &paddq ($D1,$T2); + + &movdqa ($T2,$T1); + &pmuludq ($T1,&QWP(16*1,$base)); # r3*h1 + &paddq ($D0,$T0); + &$load ($T0,7); # s3^n + &pmuludq ($T2,&QWP(16*0,$base)); # r3*h0 + &paddq ($D4,$T1); + &movdqa ($T1,$T0); + &pmuludq ($T0,&QWP(16*4,$base)); # s3*h4 + &paddq ($D3,$T2); + &movdqa ($T2,$T1); + &pmuludq ($T1,&QWP(16*3,$base)); # s3*h3 + &paddq ($D2,$T0); + &pmuludq ($T2,&QWP(16*2,$base)); # s3*h2 + &$load ($T0,4); # r4^n + &paddq ($D1,$T1); + + &$load ($T1,8); # s4^n + &pmuludq ($T0,&QWP(16*0,$base)); # r4*h0 + &paddq ($D0,$T2); + &movdqa ($T2,$T1); + &pmuludq ($T1,&QWP(16*4,$base)); # s4*h4 + &paddq ($D4,$T0); + &movdqa ($T0,$T2); + &pmuludq ($T2,&QWP(16*1,$base)); # s4*h1 + &paddq ($D3,$T1); + &movdqa ($T1,$T0); + &pmuludq ($T0,&QWP(16*2,$base)); # s4*h2 + &paddq ($D0,$T2); + &pmuludq ($T1,&QWP(16*3,$base)); # s4*h3 + &movdqa ($MASK,&QWP(64,"ebx")); + &paddq ($D1,$T0); + &paddq ($D2,$T1); +} + &pmuladd (sub { my ($reg,$i)=@_; + &movdqa ($reg,&QWP(16*$i,"esp")); + },"edx"); + +sub lazy_reduction { +my $extra = shift; + + ################################################################ + # lazy reduction as discussed in "NEON crypto" by D.J. Bernstein + # and P. Schwabe + # + # [(*) see discussion in poly1305-armv4 module] + + &movdqa ($T0,$D3); + &pand ($D3,$MASK); + &psrlq ($T0,26); + &$extra () if (defined($extra)); + &paddq ($T0,$D4); # h3 -> h4 + &movdqa ($T1,$D0); + &pand ($D0,$MASK); + &psrlq ($T1,26); + &movdqa ($D4,$T0); + &paddq ($T1,$D1); # h0 -> h1 + &psrlq ($T0,26); + &pand ($D4,$MASK); + &movdqa ($D1,$T1); + &psrlq ($T1,26); + &paddd ($D0,$T0); # favour paddd when + # possible, because + # paddq is "broken" + # on Atom + &psllq ($T0,2); + &paddq ($T1,$D2); # h1 -> h2 + &paddq ($T0,$D0); # h4 -> h0 (*) + &pand ($D1,$MASK); + &movdqa ($D2,$T1); + &psrlq ($T1,26); + &pand ($D2,$MASK); + &paddd ($T1,$D3); # h2 -> h3 + &movdqa ($D0,$T0); + &psrlq ($T0,26); + &movdqa ($D3,$T1); + &psrlq ($T1,26); + &pand ($D0,$MASK); + &paddd ($D1,$T0); # h0 -> h1 + &pand ($D3,$MASK); + &paddd ($D4,$T1); # h3 -> h4 +} + &lazy_reduction (); + + &dec ("ecx"); + &jz (&label("square_break")); + + &punpcklqdq ($D0,&QWP(16*0,"esp")); # 0:r^1:0:r^2 + &punpcklqdq ($D1,&QWP(16*1,"esp")); + &punpcklqdq ($D2,&QWP(16*2,"esp")); + &punpcklqdq ($D3,&QWP(16*3,"esp")); + &punpcklqdq ($D4,&QWP(16*4,"esp")); + &jmp (&label("square")); + +&set_label("square_break"); + &psllq ($D0,32); # -> r^3:0:r^4:0 + &psllq ($D1,32); + &psllq ($D2,32); + &psllq ($D3,32); + &psllq ($D4,32); + &por ($D0,&QWP(16*0,"esp")); # r^3:r^1:r^4:r^2 + &por ($D1,&QWP(16*1,"esp")); + &por ($D2,&QWP(16*2,"esp")); + &por ($D3,&QWP(16*3,"esp")); + &por ($D4,&QWP(16*4,"esp")); + + &pshufd ($D0,$D0,0b10001101); # -> r^1:r^2:r^3:r^4 + &pshufd ($D1,$D1,0b10001101); + &pshufd ($D2,$D2,0b10001101); + &pshufd ($D3,$D3,0b10001101); + &pshufd ($D4,$D4,0b10001101); + + &movdqu (&QWP(16*0,"edi"),$D0); # save the table + &movdqu (&QWP(16*1,"edi"),$D1); + &movdqu (&QWP(16*2,"edi"),$D2); + &movdqu (&QWP(16*3,"edi"),$D3); + &movdqu (&QWP(16*4,"edi"),$D4); + + &movdqa ($T1,$D1); + &movdqa ($T0,$D2); + &pslld ($T1,2); + &pslld ($T0,2); + &paddd ($T1,$D1); # *5 + &paddd ($T0,$D2); # *5 + &movdqu (&QWP(16*5,"edi"),$T1); + &movdqu (&QWP(16*6,"edi"),$T0); + &movdqa ($T1,$D3); + &movdqa ($T0,$D4); + &pslld ($T1,2); + &pslld ($T0,2); + &paddd ($T1,$D3); # *5 + &paddd ($T0,$D4); # *5 + &movdqu (&QWP(16*7,"edi"),$T1); + &movdqu (&QWP(16*8,"edi"),$T0); + + &mov ("esp","ebp"); + &lea ("edi",&DWP(-16*3,"edi")); # size de-optimization + &ret (); +&function_end_B("_poly1305_init_sse2"); + +&align (32); +&function_begin("_poly1305_blocks_sse2"); + &mov ("edi",&wparam(0)); # ctx + &mov ("esi",&wparam(1)); # inp + &mov ("ecx",&wparam(2)); # len + + &mov ("eax",&DWP(4*5,"edi")); # is_base2_26 + &and ("ecx",-16); + &jz (&label("nodata")); + &cmp ("ecx",64); + &jae (&label("enter_sse2")); + &test ("eax","eax"); # is_base2_26? + &jz (&label("enter_blocks")); + +&set_label("enter_sse2",16); + &call (&label("pic_point")); +&set_label("pic_point"); + &blindpop("ebx"); + &lea ("ebx",&DWP(&label("const_sse2")."-".&label("pic_point"),"ebx")); + + &test ("eax","eax"); # is_base2_26? + &jnz (&label("base2_26")); + + &call ("_poly1305_init_sse2"); + + ################################################# base 2^32 -> base 2^26 + &mov ("eax",&DWP(0,"edi")); + &mov ("ecx",&DWP(3,"edi")); + &mov ("edx",&DWP(6,"edi")); + &mov ("esi",&DWP(9,"edi")); + &mov ("ebp",&DWP(13,"edi")); + &mov (&DWP(4*5,"edi"),1); # is_base2_26 + + &shr ("ecx",2); + &and ("eax",0x3ffffff); + &shr ("edx",4); + &and ("ecx",0x3ffffff); + &shr ("esi",6); + &and ("edx",0x3ffffff); + + &movd ($D0,"eax"); + &movd ($D1,"ecx"); + &movd ($D2,"edx"); + &movd ($D3,"esi"); + &movd ($D4,"ebp"); + + &mov ("esi",&wparam(1)); # [reload] inp + &mov ("ecx",&wparam(2)); # [reload] len + &jmp (&label("base2_32")); + +&set_label("base2_26",16); + &movd ($D0,&DWP(4*0,"edi")); # load hash value + &movd ($D1,&DWP(4*1,"edi")); + &movd ($D2,&DWP(4*2,"edi")); + &movd ($D3,&DWP(4*3,"edi")); + &movd ($D4,&DWP(4*4,"edi")); + &movdqa ($MASK,&QWP(64,"ebx")); + +&set_label("base2_32"); + &mov ("eax",&wparam(3)); # padbit + &mov ("ebp","esp"); + + &sub ("esp",16*(5+5+5+9+9)); + &and ("esp",-16); + + &lea ("edi",&DWP(16*3,"edi")); # size optimization + &shl ("eax",24); # padbit + + &test ("ecx",31); + &jz (&label("even")); + + ################################################################ + # process single block, with SSE2, because it's still faster + # even though half of result is discarded + + &movdqu ($T1,&QWP(0,"esi")); # input + &lea ("esi",&DWP(16,"esi")); + + &movdqa ($T0,$T1); # -> base 2^26 ... + &pand ($T1,$MASK); + &paddd ($D0,$T1); # ... and accumulate + + &movdqa ($T1,$T0); + &psrlq ($T0,26); + &psrldq ($T1,6); + &pand ($T0,$MASK); + &paddd ($D1,$T0); + + &movdqa ($T0,$T1); + &psrlq ($T1,4); + &pand ($T1,$MASK); + &paddd ($D2,$T1); + + &movdqa ($T1,$T0); + &psrlq ($T0,30); + &pand ($T0,$MASK); + &psrldq ($T1,7); + &paddd ($D3,$T0); + + &movd ($T0,"eax"); # padbit + &paddd ($D4,$T1); + &movd ($T1,&DWP(16*0+12,"edi")); # r0 + &paddd ($D4,$T0); + + &movdqa (&QWP(16*0,"esp"),$D0); + &movdqa (&QWP(16*1,"esp"),$D1); + &movdqa (&QWP(16*2,"esp"),$D2); + &movdqa (&QWP(16*3,"esp"),$D3); + &movdqa (&QWP(16*4,"esp"),$D4); + + ################################################################ + # d4 = h4*r0 + h3*r1 + h2*r2 + h1*r3 + h0*r4 + # d3 = h3*r0 + h2*r1 + h1*r2 + h0*r3 + h4*5*r4 + # d2 = h2*r0 + h1*r1 + h0*r2 + h4*5*r3 + h3*5*r4 + # d1 = h1*r0 + h0*r1 + h4*5*r2 + h3*5*r3 + h2*5*r4 + # d0 = h0*r0 + h4*5*r1 + h3*5*r2 + h2*5*r3 + h1*5*r4 + + &pmuludq ($D0,$T1); # h4*r0 + &pmuludq ($D1,$T1); # h3*r0 + &pmuludq ($D2,$T1); # h2*r0 + &movd ($T0,&DWP(16*1+12,"edi")); # r1 + &pmuludq ($D3,$T1); # h1*r0 + &pmuludq ($D4,$T1); # h0*r0 + + &pmuladd (sub { my ($reg,$i)=@_; + &movd ($reg,&DWP(16*$i+12,"edi")); + }); + + &lazy_reduction (); + + &sub ("ecx",16); + &jz (&label("done")); + +&set_label("even"); + &lea ("edx",&DWP(16*(5+5+5+9),"esp"));# size optimization + &lea ("eax",&DWP(-16*2,"esi")); + &sub ("ecx",64); + + ################################################################ + # expand and copy pre-calculated table to stack + + &movdqu ($T0,&QWP(16*0,"edi")); # r^1:r^2:r^3:r^4 + &pshufd ($T1,$T0,0b01000100); # duplicate r^3:r^4 + &cmovb ("esi","eax"); + &pshufd ($T0,$T0,0b11101110); # duplicate r^1:r^2 + &movdqa (&QWP(16*0,"edx"),$T1); + &lea ("eax",&DWP(16*10,"esp")); + &movdqu ($T1,&QWP(16*1,"edi")); + &movdqa (&QWP(16*(0-9),"edx"),$T0); + &pshufd ($T0,$T1,0b01000100); + &pshufd ($T1,$T1,0b11101110); + &movdqa (&QWP(16*1,"edx"),$T0); + &movdqu ($T0,&QWP(16*2,"edi")); + &movdqa (&QWP(16*(1-9),"edx"),$T1); + &pshufd ($T1,$T0,0b01000100); + &pshufd ($T0,$T0,0b11101110); + &movdqa (&QWP(16*2,"edx"),$T1); + &movdqu ($T1,&QWP(16*3,"edi")); + &movdqa (&QWP(16*(2-9),"edx"),$T0); + &pshufd ($T0,$T1,0b01000100); + &pshufd ($T1,$T1,0b11101110); + &movdqa (&QWP(16*3,"edx"),$T0); + &movdqu ($T0,&QWP(16*4,"edi")); + &movdqa (&QWP(16*(3-9),"edx"),$T1); + &pshufd ($T1,$T0,0b01000100); + &pshufd ($T0,$T0,0b11101110); + &movdqa (&QWP(16*4,"edx"),$T1); + &movdqu ($T1,&QWP(16*5,"edi")); + &movdqa (&QWP(16*(4-9),"edx"),$T0); + &pshufd ($T0,$T1,0b01000100); + &pshufd ($T1,$T1,0b11101110); + &movdqa (&QWP(16*5,"edx"),$T0); + &movdqu ($T0,&QWP(16*6,"edi")); + &movdqa (&QWP(16*(5-9),"edx"),$T1); + &pshufd ($T1,$T0,0b01000100); + &pshufd ($T0,$T0,0b11101110); + &movdqa (&QWP(16*6,"edx"),$T1); + &movdqu ($T1,&QWP(16*7,"edi")); + &movdqa (&QWP(16*(6-9),"edx"),$T0); + &pshufd ($T0,$T1,0b01000100); + &pshufd ($T1,$T1,0b11101110); + &movdqa (&QWP(16*7,"edx"),$T0); + &movdqu ($T0,&QWP(16*8,"edi")); + &movdqa (&QWP(16*(7-9),"edx"),$T1); + &pshufd ($T1,$T0,0b01000100); + &pshufd ($T0,$T0,0b11101110); + &movdqa (&QWP(16*8,"edx"),$T1); + &movdqa (&QWP(16*(8-9),"edx"),$T0); + +sub load_input { +my ($inpbase,$offbase)=@_; + + &movdqu ($T0,&QWP($inpbase+0,"esi")); # load input + &movdqu ($T1,&QWP($inpbase+16,"esi")); + &lea ("esi",&DWP(16*2,"esi")); + + &movdqa (&QWP($offbase+16*2,"esp"),$D2); + &movdqa (&QWP($offbase+16*3,"esp"),$D3); + &movdqa (&QWP($offbase+16*4,"esp"),$D4); + + &movdqa ($D2,$T0); # splat input + &movdqa ($D3,$T1); + &psrldq ($D2,6); + &psrldq ($D3,6); + &movdqa ($D4,$T0); + &punpcklqdq ($D2,$D3); # 2:3 + &punpckhqdq ($D4,$T1); # 4 + &punpcklqdq ($T0,$T1); # 0:1 + + &movdqa ($D3,$D2); + &psrlq ($D2,4); + &psrlq ($D3,30); + &movdqa ($T1,$T0); + &psrlq ($D4,40); # 4 + &psrlq ($T1,26); + &pand ($T0,$MASK); # 0 + &pand ($T1,$MASK); # 1 + &pand ($D2,$MASK); # 2 + &pand ($D3,$MASK); # 3 + &por ($D4,&QWP(0,"ebx")); # padbit, yes, always + + &movdqa (&QWP($offbase+16*0,"esp"),$D0) if ($offbase); + &movdqa (&QWP($offbase+16*1,"esp"),$D1) if ($offbase); +} + &load_input (16*2,16*5); + + &jbe (&label("skip_loop")); + &jmp (&label("loop")); + +&set_label("loop",32); + ################################################################ + # ((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2 + # ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^3+inp[7]*r + # \___________________/ + # ((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2+inp[8])*r^2 + # ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^4+inp[7]*r^2+inp[9])*r + # \___________________/ \____________________/ + ################################################################ + + &movdqa ($T2,&QWP(16*(0-9),"edx")); # r0^2 + &movdqa (&QWP(16*1,"eax"),$T1); + &movdqa (&QWP(16*2,"eax"),$D2); + &movdqa (&QWP(16*3,"eax"),$D3); + &movdqa (&QWP(16*4,"eax"),$D4); + + ################################################################ + # d4 = h4*r0 + h0*r4 + h1*r3 + h2*r2 + h3*r1 + # d3 = h3*r0 + h0*r3 + h1*r2 + h2*r1 + h4*5*r4 + # d2 = h2*r0 + h0*r2 + h1*r1 + h3*5*r4 + h4*5*r3 + # d1 = h1*r0 + h0*r1 + h2*5*r4 + h3*5*r3 + h4*5*r2 + # d0 = h0*r0 + h1*5*r4 + h2*5*r3 + h3*5*r2 + h4*5*r1 + + &movdqa ($D1,$T0); + &pmuludq ($T0,$T2); # h0*r0 + &movdqa ($D0,$T1); + &pmuludq ($T1,$T2); # h1*r0 + &pmuludq ($D2,$T2); # h2*r0 + &pmuludq ($D3,$T2); # h3*r0 + &pmuludq ($D4,$T2); # h4*r0 + +sub pmuladd_alt { +my $addr = shift; + + &pmuludq ($D0,&$addr(8)); # h1*s4 + &movdqa ($T2,$D1); + &pmuludq ($D1,&$addr(1)); # h0*r1 + &paddq ($D0,$T0); + &movdqa ($T0,$T2); + &pmuludq ($T2,&$addr(2)); # h0*r2 + &paddq ($D1,$T1); + &movdqa ($T1,$T0); + &pmuludq ($T0,&$addr(3)); # h0*r3 + &paddq ($D2,$T2); + &movdqa ($T2,&QWP(16*1,"eax")); # pull h1 + &pmuludq ($T1,&$addr(4)); # h0*r4 + &paddq ($D3,$T0); + + &movdqa ($T0,$T2); + &pmuludq ($T2,&$addr(1)); # h1*r1 + &paddq ($D4,$T1); + &movdqa ($T1,$T0); + &pmuludq ($T0,&$addr(2)); # h1*r2 + &paddq ($D2,$T2); + &movdqa ($T2,&QWP(16*2,"eax")); # pull h2 + &pmuludq ($T1,&$addr(3)); # h1*r3 + &paddq ($D3,$T0); + &movdqa ($T0,$T2); + &pmuludq ($T2,&$addr(7)); # h2*s3 + &paddq ($D4,$T1); + &movdqa ($T1,$T0); + &pmuludq ($T0,&$addr(8)); # h2*s4 + &paddq ($D0,$T2); + + &movdqa ($T2,$T1); + &pmuludq ($T1,&$addr(1)); # h2*r1 + &paddq ($D1,$T0); + &movdqa ($T0,&QWP(16*3,"eax")); # pull h3 + &pmuludq ($T2,&$addr(2)); # h2*r2 + &paddq ($D3,$T1); + &movdqa ($T1,$T0); + &pmuludq ($T0,&$addr(6)); # h3*s2 + &paddq ($D4,$T2); + &movdqa ($T2,$T1); + &pmuludq ($T1,&$addr(7)); # h3*s3 + &paddq ($D0,$T0); + &movdqa ($T0,$T2); + &pmuludq ($T2,&$addr(8)); # h3*s4 + &paddq ($D1,$T1); + + &movdqa ($T1,&QWP(16*4,"eax")); # pull h4 + &pmuludq ($T0,&$addr(1)); # h3*r1 + &paddq ($D2,$T2); + &movdqa ($T2,$T1); + &pmuludq ($T1,&$addr(8)); # h4*s4 + &paddq ($D4,$T0); + &movdqa ($T0,$T2); + &pmuludq ($T2,&$addr(5)); # h4*s1 + &paddq ($D3,$T1); + &movdqa ($T1,$T0); + &pmuludq ($T0,&$addr(6)); # h4*s2 + &paddq ($D0,$T2); + &movdqa ($MASK,&QWP(64,"ebx")); + &pmuludq ($T1,&$addr(7)); # h4*s3 + &paddq ($D1,$T0); + &paddq ($D2,$T1); +} + &pmuladd_alt (sub { my $i=shift; &QWP(16*($i-9),"edx"); }); + + &load_input (-16*2,0); + &lea ("eax",&DWP(-16*2,"esi")); + &sub ("ecx",64); + + &paddd ($T0,&QWP(16*(5+0),"esp")); # add hash value + &paddd ($T1,&QWP(16*(5+1),"esp")); + &paddd ($D2,&QWP(16*(5+2),"esp")); + &paddd ($D3,&QWP(16*(5+3),"esp")); + &paddd ($D4,&QWP(16*(5+4),"esp")); + + &cmovb ("esi","eax"); + &lea ("eax",&DWP(16*10,"esp")); + + &movdqa ($T2,&QWP(16*0,"edx")); # r0^4 + &movdqa (&QWP(16*1,"esp"),$D1); + &movdqa (&QWP(16*1,"eax"),$T1); + &movdqa (&QWP(16*2,"eax"),$D2); + &movdqa (&QWP(16*3,"eax"),$D3); + &movdqa (&QWP(16*4,"eax"),$D4); + + ################################################################ + # d4 += h4*r0 + h0*r4 + h1*r3 + h2*r2 + h3*r1 + # d3 += h3*r0 + h0*r3 + h1*r2 + h2*r1 + h4*5*r4 + # d2 += h2*r0 + h0*r2 + h1*r1 + h3*5*r4 + h4*5*r3 + # d1 += h1*r0 + h0*r1 + h2*5*r4 + h3*5*r3 + h4*5*r2 + # d0 += h0*r0 + h1*5*r4 + h2*5*r3 + h3*5*r2 + h4*5*r1 + + &movdqa ($D1,$T0); + &pmuludq ($T0,$T2); # h0*r0 + &paddq ($T0,$D0); + &movdqa ($D0,$T1); + &pmuludq ($T1,$T2); # h1*r0 + &pmuludq ($D2,$T2); # h2*r0 + &pmuludq ($D3,$T2); # h3*r0 + &pmuludq ($D4,$T2); # h4*r0 + + &paddq ($T1,&QWP(16*1,"esp")); + &paddq ($D2,&QWP(16*2,"esp")); + &paddq ($D3,&QWP(16*3,"esp")); + &paddq ($D4,&QWP(16*4,"esp")); + + &pmuladd_alt (sub { my $i=shift; &QWP(16*$i,"edx"); }); + + &lazy_reduction (); + + &load_input (16*2,16*5); + + &ja (&label("loop")); + +&set_label("skip_loop"); + ################################################################ + # multiply (inp[0:1]+hash) or inp[2:3] by r^2:r^1 + + &pshufd ($T2,&QWP(16*(0-9),"edx"),0x10);# r0^n + &add ("ecx",32); + &jnz (&label("long_tail")); + + &paddd ($T0,$D0); # add hash value + &paddd ($T1,$D1); + &paddd ($D2,&QWP(16*7,"esp")); + &paddd ($D3,&QWP(16*8,"esp")); + &paddd ($D4,&QWP(16*9,"esp")); + +&set_label("long_tail"); + + &movdqa (&QWP(16*0,"eax"),$T0); + &movdqa (&QWP(16*1,"eax"),$T1); + &movdqa (&QWP(16*2,"eax"),$D2); + &movdqa (&QWP(16*3,"eax"),$D3); + &movdqa (&QWP(16*4,"eax"),$D4); + + ################################################################ + # d4 = h4*r0 + h3*r1 + h2*r2 + h1*r3 + h0*r4 + # d3 = h3*r0 + h2*r1 + h1*r2 + h0*r3 + h4*5*r4 + # d2 = h2*r0 + h1*r1 + h0*r2 + h4*5*r3 + h3*5*r4 + # d1 = h1*r0 + h0*r1 + h4*5*r2 + h3*5*r3 + h2*5*r4 + # d0 = h0*r0 + h4*5*r1 + h3*5*r2 + h2*5*r3 + h1*5*r4 + + &pmuludq ($T0,$T2); # h0*r0 + &pmuludq ($T1,$T2); # h1*r0 + &pmuludq ($D2,$T2); # h2*r0 + &movdqa ($D0,$T0); + &pshufd ($T0,&QWP(16*(1-9),"edx"),0x10);# r1^n + &pmuludq ($D3,$T2); # h3*r0 + &movdqa ($D1,$T1); + &pmuludq ($D4,$T2); # h4*r0 + + &pmuladd (sub { my ($reg,$i)=@_; + &pshufd ($reg,&QWP(16*($i-9),"edx"),0x10); + },"eax"); + + &jz (&label("short_tail")); + + &load_input (-16*2,0); + + &pshufd ($T2,&QWP(16*0,"edx"),0x10); # r0^n + &paddd ($T0,&QWP(16*5,"esp")); # add hash value + &paddd ($T1,&QWP(16*6,"esp")); + &paddd ($D2,&QWP(16*7,"esp")); + &paddd ($D3,&QWP(16*8,"esp")); + &paddd ($D4,&QWP(16*9,"esp")); + + ################################################################ + # multiply inp[0:1] by r^4:r^3 and accumulate + + &movdqa (&QWP(16*0,"esp"),$T0); + &pmuludq ($T0,$T2); # h0*r0 + &movdqa (&QWP(16*1,"esp"),$T1); + &pmuludq ($T1,$T2); # h1*r0 + &paddq ($D0,$T0); + &movdqa ($T0,$D2); + &pmuludq ($D2,$T2); # h2*r0 + &paddq ($D1,$T1); + &movdqa ($T1,$D3); + &pmuludq ($D3,$T2); # h3*r0 + &paddq ($D2,&QWP(16*2,"esp")); + &movdqa (&QWP(16*2,"esp"),$T0); + &pshufd ($T0,&QWP(16*1,"edx"),0x10); # r1^n + &paddq ($D3,&QWP(16*3,"esp")); + &movdqa (&QWP(16*3,"esp"),$T1); + &movdqa ($T1,$D4); + &pmuludq ($D4,$T2); # h4*r0 + &paddq ($D4,&QWP(16*4,"esp")); + &movdqa (&QWP(16*4,"esp"),$T1); + + &pmuladd (sub { my ($reg,$i)=@_; + &pshufd ($reg,&QWP(16*$i,"edx"),0x10); + }); + +&set_label("short_tail"); + + ################################################################ + # horizontal addition + + &pshufd ($T1,$D4,0b01001110); + &pshufd ($T0,$D3,0b01001110); + &paddq ($D4,$T1); + &paddq ($D3,$T0); + &pshufd ($T1,$D0,0b01001110); + &pshufd ($T0,$D1,0b01001110); + &paddq ($D0,$T1); + &paddq ($D1,$T0); + &pshufd ($T1,$D2,0b01001110); + #&paddq ($D2,$T1); + + &lazy_reduction (sub { &paddq ($D2,$T1) }); + +&set_label("done"); + &movd (&DWP(-16*3+4*0,"edi"),$D0); # store hash value + &movd (&DWP(-16*3+4*1,"edi"),$D1); + &movd (&DWP(-16*3+4*2,"edi"),$D2); + &movd (&DWP(-16*3+4*3,"edi"),$D3); + &movd (&DWP(-16*3+4*4,"edi"),$D4); + &mov ("esp","ebp"); +&set_label("nodata"); +&function_end("_poly1305_blocks_sse2"); + +&align (32); +&function_begin("_poly1305_emit_sse2"); + &mov ("ebp",&wparam(0)); # context + + &cmp (&DWP(4*5,"ebp"),0); # is_base2_26? + &je (&label("enter_emit")); + + &mov ("eax",&DWP(4*0,"ebp")); # load hash value + &mov ("edi",&DWP(4*1,"ebp")); + &mov ("ecx",&DWP(4*2,"ebp")); + &mov ("edx",&DWP(4*3,"ebp")); + &mov ("esi",&DWP(4*4,"ebp")); + + &mov ("ebx","edi"); # base 2^26 -> base 2^32 + &shl ("edi",26); + &shr ("ebx",6); + &add ("eax","edi"); + &mov ("edi","ecx"); + &adc ("ebx",0); + + &shl ("edi",20); + &shr ("ecx",12); + &add ("ebx","edi"); + &mov ("edi","edx"); + &adc ("ecx",0); + + &shl ("edi",14); + &shr ("edx",18); + &add ("ecx","edi"); + &mov ("edi","esi"); + &adc ("edx",0); + + &shl ("edi",8); + &shr ("esi",24); + &add ("edx","edi"); + &adc ("esi",0); # can be partially reduced + + &mov ("edi","esi"); # final reduction + &and ("esi",3); + &shr ("edi",2); + &lea ("ebp",&DWP(0,"edi","edi",4)); # *5 + &mov ("edi",&wparam(1)); # output + &add ("eax","ebp"); + &mov ("ebp",&wparam(2)); # key + &adc ("ebx",0); + &adc ("ecx",0); + &adc ("edx",0); + &adc ("esi",0); + + &movd ($D0,"eax"); # offload original hash value + &add ("eax",5); # compare to modulus + &movd ($D1,"ebx"); + &adc ("ebx",0); + &movd ($D2,"ecx"); + &adc ("ecx",0); + &movd ($D3,"edx"); + &adc ("edx",0); + &adc ("esi",0); + &shr ("esi",2); # did it carry/borrow? + + &neg ("esi"); # do we choose (hash-modulus) ... + &and ("eax","esi"); + &and ("ebx","esi"); + &and ("ecx","esi"); + &and ("edx","esi"); + &mov (&DWP(4*0,"edi"),"eax"); + &movd ("eax",$D0); + &mov (&DWP(4*1,"edi"),"ebx"); + &movd ("ebx",$D1); + &mov (&DWP(4*2,"edi"),"ecx"); + &movd ("ecx",$D2); + &mov (&DWP(4*3,"edi"),"edx"); + &movd ("edx",$D3); + + ¬ ("esi"); # ... or original hash value? + &and ("eax","esi"); + &and ("ebx","esi"); + &or ("eax",&DWP(4*0,"edi")); + &and ("ecx","esi"); + &or ("ebx",&DWP(4*1,"edi")); + &and ("edx","esi"); + &or ("ecx",&DWP(4*2,"edi")); + &or ("edx",&DWP(4*3,"edi")); + + &add ("eax",&DWP(4*0,"ebp")); # accumulate key + &adc ("ebx",&DWP(4*1,"ebp")); + &mov (&DWP(4*0,"edi"),"eax"); + &adc ("ecx",&DWP(4*2,"ebp")); + &mov (&DWP(4*1,"edi"),"ebx"); + &adc ("edx",&DWP(4*3,"ebp")); + &mov (&DWP(4*2,"edi"),"ecx"); + &mov (&DWP(4*3,"edi"),"edx"); +&function_end("_poly1305_emit_sse2"); + +if ($avx>1) { +######################################################################## +# Note that poly1305_init_avx2 operates on %xmm, I could have used +# poly1305_init_sse2... + +&align (32); +&function_begin_B("_poly1305_init_avx2"); + &vmovdqu ($D4,&QWP(4*6,"edi")); # key base 2^32 + &lea ("edi",&DWP(16*3,"edi")); # size optimization + &mov ("ebp","esp"); + &sub ("esp",16*(9+5)); + &and ("esp",-16); + + #&vpand ($D4,$D4,&QWP(96,"ebx")); # magic mask + &vmovdqa ($MASK,&QWP(64,"ebx")); + + &vpand ($D0,$D4,$MASK); # -> base 2^26 + &vpsrlq ($D1,$D4,26); + &vpsrldq ($D3,$D4,6); + &vpand ($D1,$D1,$MASK); + &vpsrlq ($D2,$D3,4) + &vpsrlq ($D3,$D3,30); + &vpand ($D2,$D2,$MASK); + &vpand ($D3,$D3,$MASK); + &vpsrldq ($D4,$D4,13); + + &lea ("edx",&DWP(16*9,"esp")); # size optimization + &mov ("ecx",2); +&set_label("square"); + &vmovdqa (&QWP(16*0,"esp"),$D0); + &vmovdqa (&QWP(16*1,"esp"),$D1); + &vmovdqa (&QWP(16*2,"esp"),$D2); + &vmovdqa (&QWP(16*3,"esp"),$D3); + &vmovdqa (&QWP(16*4,"esp"),$D4); + + &vpslld ($T1,$D1,2); + &vpslld ($T0,$D2,2); + &vpaddd ($T1,$T1,$D1); # *5 + &vpaddd ($T0,$T0,$D2); # *5 + &vmovdqa (&QWP(16*5,"esp"),$T1); + &vmovdqa (&QWP(16*6,"esp"),$T0); + &vpslld ($T1,$D3,2); + &vpslld ($T0,$D4,2); + &vpaddd ($T1,$T1,$D3); # *5 + &vpaddd ($T0,$T0,$D4); # *5 + &vmovdqa (&QWP(16*7,"esp"),$T1); + &vmovdqa (&QWP(16*8,"esp"),$T0); + + &vpshufd ($T0,$D0,0b01000100); + &vmovdqa ($T1,$D1); + &vpshufd ($D1,$D1,0b01000100); + &vpshufd ($D2,$D2,0b01000100); + &vpshufd ($D3,$D3,0b01000100); + &vpshufd ($D4,$D4,0b01000100); + &vmovdqa (&QWP(16*0,"edx"),$T0); + &vmovdqa (&QWP(16*1,"edx"),$D1); + &vmovdqa (&QWP(16*2,"edx"),$D2); + &vmovdqa (&QWP(16*3,"edx"),$D3); + &vmovdqa (&QWP(16*4,"edx"),$D4); + + ################################################################ + # d4 = h4*r0 + h3*r1 + h2*r2 + h1*r3 + h0*r4 + # d3 = h3*r0 + h2*r1 + h1*r2 + h0*r3 + h4*5*r4 + # d2 = h2*r0 + h1*r1 + h0*r2 + h4*5*r3 + h3*5*r4 + # d1 = h1*r0 + h0*r1 + h4*5*r2 + h3*5*r3 + h2*5*r4 + # d0 = h0*r0 + h4*5*r1 + h3*5*r2 + h2*5*r3 + h1*5*r4 + + &vpmuludq ($D4,$D4,$D0); # h4*r0 + &vpmuludq ($D3,$D3,$D0); # h3*r0 + &vpmuludq ($D2,$D2,$D0); # h2*r0 + &vpmuludq ($D1,$D1,$D0); # h1*r0 + &vpmuludq ($D0,$T0,$D0); # h0*r0 + + &vpmuludq ($T0,$T1,&QWP(16*3,"edx")); # r1*h3 + &vpaddq ($D4,$D4,$T0); + &vpmuludq ($T2,$T1,&QWP(16*2,"edx")); # r1*h2 + &vpaddq ($D3,$D3,$T2); + &vpmuludq ($T0,$T1,&QWP(16*1,"edx")); # r1*h1 + &vpaddq ($D2,$D2,$T0); + &vmovdqa ($T2,&QWP(16*5,"esp")); # s1 + &vpmuludq ($T1,$T1,&QWP(16*0,"edx")); # r1*h0 + &vpaddq ($D1,$D1,$T1); + &vmovdqa ($T0,&QWP(16*2,"esp")); # r2 + &vpmuludq ($T2,$T2,&QWP(16*4,"edx")); # s1*h4 + &vpaddq ($D0,$D0,$T2); + + &vpmuludq ($T1,$T0,&QWP(16*2,"edx")); # r2*h2 + &vpaddq ($D4,$D4,$T1); + &vpmuludq ($T2,$T0,&QWP(16*1,"edx")); # r2*h1 + &vpaddq ($D3,$D3,$T2); + &vmovdqa ($T1,&QWP(16*6,"esp")); # s2 + &vpmuludq ($T0,$T0,&QWP(16*0,"edx")); # r2*h0 + &vpaddq ($D2,$D2,$T0); + &vpmuludq ($T2,$T1,&QWP(16*4,"edx")); # s2*h4 + &vpaddq ($D1,$D1,$T2); + &vmovdqa ($T0,&QWP(16*3,"esp")); # r3 + &vpmuludq ($T1,$T1,&QWP(16*3,"edx")); # s2*h3 + &vpaddq ($D0,$D0,$T1); + + &vpmuludq ($T2,$T0,&QWP(16*1,"edx")); # r3*h1 + &vpaddq ($D4,$D4,$T2); + &vmovdqa ($T1,&QWP(16*7,"esp")); # s3 + &vpmuludq ($T0,$T0,&QWP(16*0,"edx")); # r3*h0 + &vpaddq ($D3,$D3,$T0); + &vpmuludq ($T2,$T1,&QWP(16*4,"edx")); # s3*h4 + &vpaddq ($D2,$D2,$T2); + &vpmuludq ($T0,$T1,&QWP(16*3,"edx")); # s3*h3 + &vpaddq ($D1,$D1,$T0); + &vmovdqa ($T2,&QWP(16*4,"esp")); # r4 + &vpmuludq ($T1,$T1,&QWP(16*2,"edx")); # s3*h2 + &vpaddq ($D0,$D0,$T1); + + &vmovdqa ($T0,&QWP(16*8,"esp")); # s4 + &vpmuludq ($T2,$T2,&QWP(16*0,"edx")); # r4*h0 + &vpaddq ($D4,$D4,$T2); + &vpmuludq ($T1,$T0,&QWP(16*4,"edx")); # s4*h4 + &vpaddq ($D3,$D3,$T1); + &vpmuludq ($T2,$T0,&QWP(16*1,"edx")); # s4*h1 + &vpaddq ($D0,$D0,$T2); + &vpmuludq ($T1,$T0,&QWP(16*2,"edx")); # s4*h2 + &vpaddq ($D1,$D1,$T1); + &vmovdqa ($MASK,&QWP(64,"ebx")); + &vpmuludq ($T0,$T0,&QWP(16*3,"edx")); # s4*h3 + &vpaddq ($D2,$D2,$T0); + + ################################################################ + # lazy reduction + &vpsrlq ($T0,$D3,26); + &vpand ($D3,$D3,$MASK); + &vpsrlq ($T1,$D0,26); + &vpand ($D0,$D0,$MASK); + &vpaddq ($D4,$D4,$T0); # h3 -> h4 + &vpaddq ($D1,$D1,$T1); # h0 -> h1 + &vpsrlq ($T0,$D4,26); + &vpand ($D4,$D4,$MASK); + &vpsrlq ($T1,$D1,26); + &vpand ($D1,$D1,$MASK); + &vpaddq ($D2,$D2,$T1); # h1 -> h2 + &vpaddd ($D0,$D0,$T0); + &vpsllq ($T0,$T0,2); + &vpsrlq ($T1,$D2,26); + &vpand ($D2,$D2,$MASK); + &vpaddd ($D0,$D0,$T0); # h4 -> h0 + &vpaddd ($D3,$D3,$T1); # h2 -> h3 + &vpsrlq ($T1,$D3,26); + &vpsrlq ($T0,$D0,26); + &vpand ($D0,$D0,$MASK); + &vpand ($D3,$D3,$MASK); + &vpaddd ($D1,$D1,$T0); # h0 -> h1 + &vpaddd ($D4,$D4,$T1); # h3 -> h4 + + &dec ("ecx"); + &jz (&label("square_break")); + + &vpunpcklqdq ($D0,$D0,&QWP(16*0,"esp")); # 0:r^1:0:r^2 + &vpunpcklqdq ($D1,$D1,&QWP(16*1,"esp")); + &vpunpcklqdq ($D2,$D2,&QWP(16*2,"esp")); + &vpunpcklqdq ($D3,$D3,&QWP(16*3,"esp")); + &vpunpcklqdq ($D4,$D4,&QWP(16*4,"esp")); + &jmp (&label("square")); + +&set_label("square_break"); + &vpsllq ($D0,$D0,32); # -> r^3:0:r^4:0 + &vpsllq ($D1,$D1,32); + &vpsllq ($D2,$D2,32); + &vpsllq ($D3,$D3,32); + &vpsllq ($D4,$D4,32); + &vpor ($D0,$D0,&QWP(16*0,"esp")); # r^3:r^1:r^4:r^2 + &vpor ($D1,$D1,&QWP(16*1,"esp")); + &vpor ($D2,$D2,&QWP(16*2,"esp")); + &vpor ($D3,$D3,&QWP(16*3,"esp")); + &vpor ($D4,$D4,&QWP(16*4,"esp")); + + &vpshufd ($D0,$D0,0b10001101); # -> r^1:r^2:r^3:r^4 + &vpshufd ($D1,$D1,0b10001101); + &vpshufd ($D2,$D2,0b10001101); + &vpshufd ($D3,$D3,0b10001101); + &vpshufd ($D4,$D4,0b10001101); + + &vmovdqu (&QWP(16*0,"edi"),$D0); # save the table + &vmovdqu (&QWP(16*1,"edi"),$D1); + &vmovdqu (&QWP(16*2,"edi"),$D2); + &vmovdqu (&QWP(16*3,"edi"),$D3); + &vmovdqu (&QWP(16*4,"edi"),$D4); + + &vpslld ($T1,$D1,2); + &vpslld ($T0,$D2,2); + &vpaddd ($T1,$T1,$D1); # *5 + &vpaddd ($T0,$T0,$D2); # *5 + &vmovdqu (&QWP(16*5,"edi"),$T1); + &vmovdqu (&QWP(16*6,"edi"),$T0); + &vpslld ($T1,$D3,2); + &vpslld ($T0,$D4,2); + &vpaddd ($T1,$T1,$D3); # *5 + &vpaddd ($T0,$T0,$D4); # *5 + &vmovdqu (&QWP(16*7,"edi"),$T1); + &vmovdqu (&QWP(16*8,"edi"),$T0); + + &mov ("esp","ebp"); + &lea ("edi",&DWP(-16*3,"edi")); # size de-optimization + &ret (); +&function_end_B("_poly1305_init_avx2"); + +######################################################################## +# now it's time to switch to %ymm + +my ($D0,$D1,$D2,$D3,$D4,$T0,$T1,$T2)=map("ymm$_",(0..7)); +my $MASK=$T2; + +sub X { my $reg=shift; $reg=~s/^ymm/xmm/; $reg; } + +&align (32); +&function_begin("_poly1305_blocks_avx2"); + &mov ("edi",&wparam(0)); # ctx + &mov ("esi",&wparam(1)); # inp + &mov ("ecx",&wparam(2)); # len + + &mov ("eax",&DWP(4*5,"edi")); # is_base2_26 + &and ("ecx",-16); + &jz (&label("nodata")); + &cmp ("ecx",64); + &jae (&label("enter_avx2")); + &test ("eax","eax"); # is_base2_26? + &jz (&label("enter_blocks")); + +&set_label("enter_avx2"); + &vzeroupper (); + + &call (&label("pic_point")); +&set_label("pic_point"); + &blindpop("ebx"); + &lea ("ebx",&DWP(&label("const_sse2")."-".&label("pic_point"),"ebx")); + + &test ("eax","eax"); # is_base2_26? + &jnz (&label("base2_26")); + + &call ("_poly1305_init_avx2"); + + ################################################# base 2^32 -> base 2^26 + &mov ("eax",&DWP(0,"edi")); + &mov ("ecx",&DWP(3,"edi")); + &mov ("edx",&DWP(6,"edi")); + &mov ("esi",&DWP(9,"edi")); + &mov ("ebp",&DWP(13,"edi")); + + &shr ("ecx",2); + &and ("eax",0x3ffffff); + &shr ("edx",4); + &and ("ecx",0x3ffffff); + &shr ("esi",6); + &and ("edx",0x3ffffff); + + &mov (&DWP(4*0,"edi"),"eax"); + &mov (&DWP(4*1,"edi"),"ecx"); + &mov (&DWP(4*2,"edi"),"edx"); + &mov (&DWP(4*3,"edi"),"esi"); + &mov (&DWP(4*4,"edi"),"ebp"); + &mov (&DWP(4*5,"edi"),1); # is_base2_26 + + &mov ("esi",&wparam(1)); # [reload] inp + &mov ("ecx",&wparam(2)); # [reload] len + +&set_label("base2_26"); + &mov ("eax",&wparam(3)); # padbit + &mov ("ebp","esp"); + + &sub ("esp",32*(5+9)); + &and ("esp",-512); # ensure that frame + # doesn't cross page + # boundary, which is + # essential for + # misaligned 32-byte + # loads + + ################################################################ + # expand and copy pre-calculated table to stack + + &vmovdqu (&X($D0),&QWP(16*(3+0),"edi")); + &lea ("edx",&DWP(32*5+128,"esp")); # +128 size optimization + &vmovdqu (&X($D1),&QWP(16*(3+1),"edi")); + &vmovdqu (&X($D2),&QWP(16*(3+2),"edi")); + &vmovdqu (&X($D3),&QWP(16*(3+3),"edi")); + &vmovdqu (&X($D4),&QWP(16*(3+4),"edi")); + &lea ("edi",&DWP(16*3,"edi")); # size optimization + &vpermq ($D0,$D0,0b01000000); # 00001234 -> 12343434 + &vpermq ($D1,$D1,0b01000000); + &vpermq ($D2,$D2,0b01000000); + &vpermq ($D3,$D3,0b01000000); + &vpermq ($D4,$D4,0b01000000); + &vpshufd ($D0,$D0,0b11001000); # 12343434 -> 14243444 + &vpshufd ($D1,$D1,0b11001000); + &vpshufd ($D2,$D2,0b11001000); + &vpshufd ($D3,$D3,0b11001000); + &vpshufd ($D4,$D4,0b11001000); + &vmovdqa (&QWP(32*0-128,"edx"),$D0); + &vmovdqu (&X($D0),&QWP(16*5,"edi")); + &vmovdqa (&QWP(32*1-128,"edx"),$D1); + &vmovdqu (&X($D1),&QWP(16*6,"edi")); + &vmovdqa (&QWP(32*2-128,"edx"),$D2); + &vmovdqu (&X($D2),&QWP(16*7,"edi")); + &vmovdqa (&QWP(32*3-128,"edx"),$D3); + &vmovdqu (&X($D3),&QWP(16*8,"edi")); + &vmovdqa (&QWP(32*4-128,"edx"),$D4); + &vpermq ($D0,$D0,0b01000000); + &vpermq ($D1,$D1,0b01000000); + &vpermq ($D2,$D2,0b01000000); + &vpermq ($D3,$D3,0b01000000); + &vpshufd ($D0,$D0,0b11001000); + &vpshufd ($D1,$D1,0b11001000); + &vpshufd ($D2,$D2,0b11001000); + &vpshufd ($D3,$D3,0b11001000); + &vmovdqa (&QWP(32*5-128,"edx"),$D0); + &vmovd (&X($D0),&DWP(-16*3+4*0,"edi"));# load hash value + &vmovdqa (&QWP(32*6-128,"edx"),$D1); + &vmovd (&X($D1),&DWP(-16*3+4*1,"edi")); + &vmovdqa (&QWP(32*7-128,"edx"),$D2); + &vmovd (&X($D2),&DWP(-16*3+4*2,"edi")); + &vmovdqa (&QWP(32*8-128,"edx"),$D3); + &vmovd (&X($D3),&DWP(-16*3+4*3,"edi")); + &vmovd (&X($D4),&DWP(-16*3+4*4,"edi")); + &vmovdqa ($MASK,&QWP(64,"ebx")); + &neg ("eax"); # padbit + + &test ("ecx",63); + &jz (&label("even")); + + &mov ("edx","ecx"); + &and ("ecx",-64); + &and ("edx",63); + + &vmovdqu (&X($T0),&QWP(16*0,"esi")); + &cmp ("edx",32); + &jb (&label("one")); + + &vmovdqu (&X($T1),&QWP(16*1,"esi")); + &je (&label("two")); + + &vinserti128 ($T0,$T0,&QWP(16*2,"esi"),1); + &lea ("esi",&DWP(16*3,"esi")); + &lea ("ebx",&DWP(8,"ebx")); # three padbits + &lea ("edx",&DWP(32*5+128+8,"esp")); # --:r^1:r^2:r^3 (*) + &jmp (&label("tail")); + +&set_label("two"); + &lea ("esi",&DWP(16*2,"esi")); + &lea ("ebx",&DWP(16,"ebx")); # two padbits + &lea ("edx",&DWP(32*5+128+16,"esp"));# --:--:r^1:r^2 (*) + &jmp (&label("tail")); + +&set_label("one"); + &lea ("esi",&DWP(16*1,"esi")); + &vpxor ($T1,$T1,$T1); + &lea ("ebx",&DWP(32,"ebx","eax",8)); # one or no padbits + &lea ("edx",&DWP(32*5+128+24,"esp"));# --:--:--:r^1 (*) + &jmp (&label("tail")); + +# (*) spots marked with '--' are data from next table entry, but they +# are multiplied by 0 and therefore rendered insignificant + +&set_label("even",32); + &vmovdqu (&X($T0),&QWP(16*0,"esi")); # load input + &vmovdqu (&X($T1),&QWP(16*1,"esi")); + &vinserti128 ($T0,$T0,&QWP(16*2,"esi"),1); + &vinserti128 ($T1,$T1,&QWP(16*3,"esi"),1); + &lea ("esi",&DWP(16*4,"esi")); + &sub ("ecx",64); + &jz (&label("tail")); + +&set_label("loop"); + ################################################################ + # ((inp[0]*r^4+r[4])*r^4+r[8])*r^4 + # ((inp[1]*r^4+r[5])*r^4+r[9])*r^3 + # ((inp[2]*r^4+r[6])*r^4+r[10])*r^2 + # ((inp[3]*r^4+r[7])*r^4+r[11])*r^1 + # \________/ \_______/ + ################################################################ + +sub vsplat_input { + &vmovdqa (&QWP(32*2,"esp"),$D2); + &vpsrldq ($D2,$T0,6); # splat input + &vmovdqa (&QWP(32*0,"esp"),$D0); + &vpsrldq ($D0,$T1,6); + &vmovdqa (&QWP(32*1,"esp"),$D1); + &vpunpckhqdq ($D1,$T0,$T1); # 4 + &vpunpcklqdq ($T0,$T0,$T1); # 0:1 + &vpunpcklqdq ($D2,$D2,$D0); # 2:3 + + &vpsrlq ($D0,$D2,30); + &vpsrlq ($D2,$D2,4); + &vpsrlq ($T1,$T0,26); + &vpsrlq ($D1,$D1,40); # 4 + &vpand ($D2,$D2,$MASK); # 2 + &vpand ($T0,$T0,$MASK); # 0 + &vpand ($T1,$T1,$MASK); # 1 + &vpand ($D0,$D0,$MASK); # 3 (*) + &vpor ($D1,$D1,&QWP(0,"ebx")); # padbit, yes, always + + # (*) note that output is counterintuitive, inp[3:4] is + # returned in $D1-2, while $D3-4 are preserved; +} + &vsplat_input (); + +sub vpmuladd { +my $addr = shift; + + &vpaddq ($D2,$D2,&QWP(32*2,"esp")); # add hash value + &vpaddq ($T0,$T0,&QWP(32*0,"esp")); + &vpaddq ($T1,$T1,&QWP(32*1,"esp")); + &vpaddq ($D0,$D0,$D3); + &vpaddq ($D1,$D1,$D4); + + ################################################################ + # d3 = h2*r1 + h0*r3 + h1*r2 + h3*r0 + h4*5*r4 + # d4 = h2*r2 + h0*r4 + h1*r3 + h3*r1 + h4*r0 + # d0 = h2*5*r3 + h0*r0 + h1*5*r4 + h3*5*r2 + h4*5*r1 + # d1 = h2*5*r4 + h0*r1 + h1*r0 + h3*5*r3 + h4*5*r2 + # d2 = h2*r0 + h0*r2 + h1*r1 + h3*5*r4 + h4*5*r3 + + &vpmuludq ($D3,$D2,&$addr(1)); # d3 = h2*r1 + &vmovdqa (QWP(32*1,"esp"),$T1); + &vpmuludq ($D4,$D2,&$addr(2)); # d4 = h2*r2 + &vmovdqa (QWP(32*3,"esp"),$D0); + &vpmuludq ($D0,$D2,&$addr(7)); # d0 = h2*s3 + &vmovdqa (QWP(32*4,"esp"),$D1); + &vpmuludq ($D1,$D2,&$addr(8)); # d1 = h2*s4 + &vpmuludq ($D2,$D2,&$addr(0)); # d2 = h2*r0 + + &vpmuludq ($T2,$T0,&$addr(3)); # h0*r3 + &vpaddq ($D3,$D3,$T2); # d3 += h0*r3 + &vpmuludq ($T1,$T0,&$addr(4)); # h0*r4 + &vpaddq ($D4,$D4,$T1); # d4 + h0*r4 + &vpmuludq ($T2,$T0,&$addr(0)); # h0*r0 + &vpaddq ($D0,$D0,$T2); # d0 + h0*r0 + &vmovdqa ($T2,&QWP(32*1,"esp")); # h1 + &vpmuludq ($T1,$T0,&$addr(1)); # h0*r1 + &vpaddq ($D1,$D1,$T1); # d1 += h0*r1 + &vpmuludq ($T0,$T0,&$addr(2)); # h0*r2 + &vpaddq ($D2,$D2,$T0); # d2 += h0*r2 + + &vpmuludq ($T1,$T2,&$addr(2)); # h1*r2 + &vpaddq ($D3,$D3,$T1); # d3 += h1*r2 + &vpmuludq ($T0,$T2,&$addr(3)); # h1*r3 + &vpaddq ($D4,$D4,$T0); # d4 += h1*r3 + &vpmuludq ($T1,$T2,&$addr(8)); # h1*s4 + &vpaddq ($D0,$D0,$T1); # d0 += h1*s4 + &vmovdqa ($T1,&QWP(32*3,"esp")); # h3 + &vpmuludq ($T0,$T2,&$addr(0)); # h1*r0 + &vpaddq ($D1,$D1,$T0); # d1 += h1*r0 + &vpmuludq ($T2,$T2,&$addr(1)); # h1*r1 + &vpaddq ($D2,$D2,$T2); # d2 += h1*r1 + + &vpmuludq ($T0,$T1,&$addr(0)); # h3*r0 + &vpaddq ($D3,$D3,$T0); # d3 += h3*r0 + &vpmuludq ($T2,$T1,&$addr(1)); # h3*r1 + &vpaddq ($D4,$D4,$T2); # d4 += h3*r1 + &vpmuludq ($T0,$T1,&$addr(6)); # h3*s2 + &vpaddq ($D0,$D0,$T0); # d0 += h3*s2 + &vmovdqa ($T0,&QWP(32*4,"esp")); # h4 + &vpmuludq ($T2,$T1,&$addr(7)); # h3*s3 + &vpaddq ($D1,$D1,$T2); # d1+= h3*s3 + &vpmuludq ($T1,$T1,&$addr(8)); # h3*s4 + &vpaddq ($D2,$D2,$T1); # d2 += h3*s4 + + &vpmuludq ($T2,$T0,&$addr(8)); # h4*s4 + &vpaddq ($D3,$D3,$T2); # d3 += h4*s4 + &vpmuludq ($T1,$T0,&$addr(5)); # h4*s1 + &vpaddq ($D0,$D0,$T1); # d0 += h4*s1 + &vpmuludq ($T2,$T0,&$addr(0)); # h4*r0 + &vpaddq ($D4,$D4,$T2); # d4 += h4*r0 + &vmovdqa ($MASK,&QWP(64,"ebx")); + &vpmuludq ($T1,$T0,&$addr(6)); # h4*s2 + &vpaddq ($D1,$D1,$T1); # d1 += h4*s2 + &vpmuludq ($T0,$T0,&$addr(7)); # h4*s3 + &vpaddq ($D2,$D2,$T0); # d2 += h4*s3 +} + &vpmuladd (sub { my $i=shift; &QWP(32*$i-128,"edx"); }); + +sub vlazy_reduction { + ################################################################ + # lazy reduction + + &vpsrlq ($T0,$D3,26); + &vpand ($D3,$D3,$MASK); + &vpsrlq ($T1,$D0,26); + &vpand ($D0,$D0,$MASK); + &vpaddq ($D4,$D4,$T0); # h3 -> h4 + &vpaddq ($D1,$D1,$T1); # h0 -> h1 + &vpsrlq ($T0,$D4,26); + &vpand ($D4,$D4,$MASK); + &vpsrlq ($T1,$D1,26); + &vpand ($D1,$D1,$MASK); + &vpaddq ($D2,$D2,$T1); # h1 -> h2 + &vpaddq ($D0,$D0,$T0); + &vpsllq ($T0,$T0,2); + &vpsrlq ($T1,$D2,26); + &vpand ($D2,$D2,$MASK); + &vpaddq ($D0,$D0,$T0); # h4 -> h0 + &vpaddq ($D3,$D3,$T1); # h2 -> h3 + &vpsrlq ($T1,$D3,26); + &vpsrlq ($T0,$D0,26); + &vpand ($D0,$D0,$MASK); + &vpand ($D3,$D3,$MASK); + &vpaddq ($D1,$D1,$T0); # h0 -> h1 + &vpaddq ($D4,$D4,$T1); # h3 -> h4 +} + &vlazy_reduction(); + + &vmovdqu (&X($T0),&QWP(16*0,"esi")); # load input + &vmovdqu (&X($T1),&QWP(16*1,"esi")); + &vinserti128 ($T0,$T0,&QWP(16*2,"esi"),1); + &vinserti128 ($T1,$T1,&QWP(16*3,"esi"),1); + &lea ("esi",&DWP(16*4,"esi")); + &sub ("ecx",64); + &jnz (&label("loop")); + +&set_label("tail"); + &vsplat_input (); + &and ("ebx",-64); # restore pointer + + &vpmuladd (sub { my $i=shift; &QWP(4+32*$i-128,"edx"); }); + + ################################################################ + # horizontal addition + + &vpsrldq ($T0,$D4,8); + &vpsrldq ($T1,$D3,8); + &vpaddq ($D4,$D4,$T0); + &vpsrldq ($T0,$D0,8); + &vpaddq ($D3,$D3,$T1); + &vpsrldq ($T1,$D1,8); + &vpaddq ($D0,$D0,$T0); + &vpsrldq ($T0,$D2,8); + &vpaddq ($D1,$D1,$T1); + &vpermq ($T1,$D4,2); # keep folding + &vpaddq ($D2,$D2,$T0); + &vpermq ($T0,$D3,2); + &vpaddq ($D4,$D4,$T1); + &vpermq ($T1,$D0,2); + &vpaddq ($D3,$D3,$T0); + &vpermq ($T0,$D1,2); + &vpaddq ($D0,$D0,$T1); + &vpermq ($T1,$D2,2); + &vpaddq ($D1,$D1,$T0); + &vpaddq ($D2,$D2,$T1); + + &vlazy_reduction(); + + &cmp ("ecx",0); + &je (&label("done")); + + ################################################################ + # clear all but single word + + &vpshufd (&X($D0),&X($D0),0b11111100); + &lea ("edx",&DWP(32*5+128,"esp")); # restore pointer + &vpshufd (&X($D1),&X($D1),0b11111100); + &vpshufd (&X($D2),&X($D2),0b11111100); + &vpshufd (&X($D3),&X($D3),0b11111100); + &vpshufd (&X($D4),&X($D4),0b11111100); + &jmp (&label("even")); + +&set_label("done",16); + &vmovd (&DWP(-16*3+4*0,"edi"),&X($D0));# store hash value + &vmovd (&DWP(-16*3+4*1,"edi"),&X($D1)); + &vmovd (&DWP(-16*3+4*2,"edi"),&X($D2)); + &vmovd (&DWP(-16*3+4*3,"edi"),&X($D3)); + &vmovd (&DWP(-16*3+4*4,"edi"),&X($D4)); + &vzeroupper (); + &mov ("esp","ebp"); +&set_label("nodata"); +&function_end("_poly1305_blocks_avx2"); +} +&set_label("const_sse2",64); + &data_word(1<<24,0, 1<<24,0, 1<<24,0, 1<<24,0); + &data_word(0,0, 0,0, 0,0, 0,0); + &data_word(0x03ffffff,0,0x03ffffff,0, 0x03ffffff,0, 0x03ffffff,0); + &data_word(0x0fffffff,0x0ffffffc,0x0ffffffc,0x0ffffffc); +} +&asciz ("Poly1305 for x86, CRYPTOGAMS by "); +&align (4); + +&asm_finish(); + +close STDOUT; diff --git a/crypto/nasm.props b/crypto/nasm.props new file mode 100644 index 0000000..c3c5610 --- /dev/null +++ b/crypto/nasm.props @@ -0,0 +1,18 @@ + + + + Midl + CustomBuild + + + + $(IntDir)%(FileName).obj + 0 + c:\dev\nasm\nasm.exe -f win32 [AllOptions] [AdditionalOptions] %(FullPath) + c:\dev\nasm\nasm.exe -f win64 [AllOptions] [AdditionalOptions] %(FullPath) + echo NASM not supported on this platform + Assembling [Inputs]... + + + \ No newline at end of file diff --git a/crypto/nasm.targets b/crypto/nasm.targets new file mode 100644 index 0000000..8dd1989 --- /dev/null +++ b/crypto/nasm.targets @@ -0,0 +1,82 @@ + + + + + + _NASM + + + + + $(ComputeLinkInputsTargets); + ComputeNASMOutput; + + + $(ComputeLibInputsTargets); + ComputeNASMOutput; + + + + $(MSBuildThisFileDirectory)$(MSBuildThisFileName).xml + + + + + + + + @(NASM, '|') + + + + + + + + + + + + + diff --git a/crypto/nasm.xml b/crypto/nasm.xml new file mode 100644 index 0000000..c2bc89b --- /dev/null +++ b/crypto/nasm.xml @@ -0,0 +1,308 @@ + + + + + + + + + + General + + + + + Preprocessing Options + + + + + Assembler Options + + + + + Advanced + + + + + Command Line + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Execute Before + + + Specifies the targets for the build customization to run before. + + + + + + + + + + + Execute After + + + Specifies the targets for the build customization to run after. + + + + + + + + + + + + + + + Additional Options + + + Additional Options + + + + + + + + diff --git a/crypto/poly1305_x64_gas.s b/crypto/poly1305_x64_gas.s new file mode 100644 index 0000000..709ca13 --- /dev/null +++ b/crypto/poly1305_x64_gas.s @@ -0,0 +1,3132 @@ +.align 64 +.Lconst: +.Lmask24: +.long 0x0ffffff,0,0x0ffffff,0,0x0ffffff,0,0x0ffffff,0 +.L129: +.long 16777216,0,16777216,0,16777216,0,16777216,0 +.Lmask26: +.long 0x3ffffff,0,0x3ffffff,0,0x3ffffff,0,0x3ffffff,0 +.Lpermd_avx2: +.long 2,2,2,3,2,0,2,1 +.Lpermd_avx512: +.long 0,0,0,1, 0,2,0,3, 0,4,0,5, 0,6,0,7 + +.L2_44_inp_permd: +.long 0,1,1,2,2,3,7,7 +.L2_44_inp_shift: +.quad 0,12,24,64 +.L2_44_mask: +.quad 0xfffffffffff,0xfffffffffff,0x3ffffffffff,0xffffffffffffffff +.L2_44_shift_rgt: +.quad 44,44,42,64 +.L2_44_shift_lft: +.quad 8,8,10,64 + +.align 64 +.Lx_mask44: +.quad 0xfffffffffff,0xfffffffffff,0xfffffffffff,0xfffffffffff +.quad 0xfffffffffff,0xfffffffffff,0xfffffffffff,0xfffffffffff +.Lx_mask42: +.quad 0x3ffffffffff,0x3ffffffffff,0x3ffffffffff,0x3ffffffffff +.quad 0x3ffffffffff,0x3ffffffffff,0x3ffffffffff,0x3ffffffffff + +.text + + +.global poly1305_init_x86_64 +.global poly1305_blocks_x86_64 +.global poly1305_emit_x86_64 +.global poly1305_emit_avx +.global poly1305_blocks_avx +.global poly1305_blocks_avx2 +.global poly1305_blocks_avx512 + + +.type poly1305_init_x86_64,@function +.align 32 +poly1305_init_x86_64: + xorq %rax,%rax + movq %rax,0(%rdi) + movq %rax,8(%rdi) + movq %rax,16(%rdi) + + cmpq $0,%rsi + je .Lno_key + + + + movq $0x0ffffffc0fffffff,%rax + movq $0x0ffffffc0ffffffc,%rcx + andq 0(%rsi),%rax + andq 8(%rsi),%rcx + movq %rax,24(%rdi) + movq %rcx,32(%rdi) + movl $1,%eax +.Lno_key: + ret +.size poly1305_init_x86_64,.-poly1305_init_x86_64 + +.type poly1305_blocks_x86_64,@function +.align 32 +poly1305_blocks_x86_64: +.cfi_startproc +.Lblocks: + shrq $4,%rdx + jz .Lno_data + + pushq %rbx +.cfi_adjust_cfa_offset 8 +.cfi_offset %rbx,-16 + pushq %rbp +.cfi_adjust_cfa_offset 8 +.cfi_offset %rbp,-24 + pushq %r12 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r12,-32 + pushq %r13 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r13,-40 + pushq %r14 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r14,-48 + pushq %r15 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r15,-56 +.Lblocks_body: + + movq %rdx,%r15 + + movq 24(%rdi),%r11 + movq 32(%rdi),%r13 + + movq 0(%rdi),%r14 + movq 8(%rdi),%rbx + movq 16(%rdi),%rbp + + movq %r13,%r12 + shrq $2,%r13 + movq %r12,%rax + addq %r12,%r13 + jmp .Loop + +.align 32 +.Loop: + addq 0(%rsi),%r14 + adcq 8(%rsi),%rbx + leaq 16(%rsi),%rsi + adcq %rcx,%rbp + mulq %r14 + movq %rax,%r9 + movq %r11,%rax + movq %rdx,%r10 + + mulq %r14 + movq %rax,%r14 + movq %r11,%rax + movq %rdx,%r8 + + mulq %rbx + addq %rax,%r9 + movq %r13,%rax + adcq %rdx,%r10 + + mulq %rbx + movq %rbp,%rbx + addq %rax,%r14 + adcq %rdx,%r8 + + imulq %r13,%rbx + addq %rbx,%r9 + movq %r8,%rbx + adcq $0,%r10 + + imulq %r11,%rbp + addq %r9,%rbx + movq $-4,%rax + adcq %rbp,%r10 + + andq %r10,%rax + movq %r10,%rbp + shrq $2,%r10 + andq $3,%rbp + addq %r10,%rax + addq %rax,%r14 + adcq $0,%rbx + adcq $0,%rbp + movq %r12,%rax + decq %r15 + jnz .Loop + + movq %r14,0(%rdi) + movq %rbx,8(%rdi) + movq %rbp,16(%rdi) + + movq 0(%rsp),%r15 +.cfi_restore %r15 + movq 8(%rsp),%r14 +.cfi_restore %r14 + movq 16(%rsp),%r13 +.cfi_restore %r13 + movq 24(%rsp),%r12 +.cfi_restore %r12 + movq 32(%rsp),%rbp +.cfi_restore %rbp + movq 40(%rsp),%rbx +.cfi_restore %rbx + leaq 48(%rsp),%rsp +.cfi_adjust_cfa_offset -48 +.Lno_data: +.Lblocks_epilogue: + ret +.cfi_endproc +.size poly1305_blocks_x86_64,.-poly1305_blocks_x86_64 + +.type poly1305_emit_x86_64,@function +.align 32 +poly1305_emit_x86_64: +.Lemit: + movq 0(%rdi),%r8 + movq 8(%rdi),%r9 + movq 16(%rdi),%r10 + + movq %r8,%rax + addq $5,%r8 + movq %r9,%rcx + adcq $0,%r9 + adcq $0,%r10 + shrq $2,%r10 + cmovnzq %r8,%rax + cmovnzq %r9,%rcx + + addq 0(%rdx),%rax + adcq 8(%rdx),%rcx + movq %rax,0(%rsi) + movq %rcx,8(%rsi) + + ret +.size poly1305_emit_x86_64,.-poly1305_emit_x86_64 +.type __poly1305_block,@function +.align 32 +__poly1305_block: + mulq %r14 + movq %rax,%r9 + movq %r11,%rax + movq %rdx,%r10 + + mulq %r14 + movq %rax,%r14 + movq %r11,%rax + movq %rdx,%r8 + + mulq %rbx + addq %rax,%r9 + movq %r13,%rax + adcq %rdx,%r10 + + mulq %rbx + movq %rbp,%rbx + addq %rax,%r14 + adcq %rdx,%r8 + + imulq %r13,%rbx + addq %rbx,%r9 + movq %r8,%rbx + adcq $0,%r10 + + imulq %r11,%rbp + addq %r9,%rbx + movq $-4,%rax + adcq %rbp,%r10 + + andq %r10,%rax + movq %r10,%rbp + shrq $2,%r10 + andq $3,%rbp + addq %r10,%rax + addq %rax,%r14 + adcq $0,%rbx + adcq $0,%rbp + ret +.size __poly1305_block,.-__poly1305_block + +.type __poly1305_init_avx,@function +.align 32 +__poly1305_init_avx: + movq %r11,%r14 + movq %r12,%rbx + xorq %rbp,%rbp + + leaq 48+64(%rdi),%rdi + + movq %r12,%rax + call __poly1305_block + + movl $0x3ffffff,%eax + movl $0x3ffffff,%edx + movq %r14,%r8 + andl %r14d,%eax + movq %r11,%r9 + andl %r11d,%edx + movl %eax,-64(%rdi) + shrq $26,%r8 + movl %edx,-60(%rdi) + shrq $26,%r9 + + movl $0x3ffffff,%eax + movl $0x3ffffff,%edx + andl %r8d,%eax + andl %r9d,%edx + movl %eax,-48(%rdi) + leal (%rax,%rax,4),%eax + movl %edx,-44(%rdi) + leal (%rdx,%rdx,4),%edx + movl %eax,-32(%rdi) + shrq $26,%r8 + movl %edx,-28(%rdi) + shrq $26,%r9 + + movq %rbx,%rax + movq %r12,%rdx + shlq $12,%rax + shlq $12,%rdx + orq %r8,%rax + orq %r9,%rdx + andl $0x3ffffff,%eax + andl $0x3ffffff,%edx + movl %eax,-16(%rdi) + leal (%rax,%rax,4),%eax + movl %edx,-12(%rdi) + leal (%rdx,%rdx,4),%edx + movl %eax,0(%rdi) + movq %rbx,%r8 + movl %edx,4(%rdi) + movq %r12,%r9 + + movl $0x3ffffff,%eax + movl $0x3ffffff,%edx + shrq $14,%r8 + shrq $14,%r9 + andl %r8d,%eax + andl %r9d,%edx + movl %eax,16(%rdi) + leal (%rax,%rax,4),%eax + movl %edx,20(%rdi) + leal (%rdx,%rdx,4),%edx + movl %eax,32(%rdi) + shrq $26,%r8 + movl %edx,36(%rdi) + shrq $26,%r9 + + movq %rbp,%rax + shlq $24,%rax + orq %rax,%r8 + movl %r8d,48(%rdi) + leaq (%r8,%r8,4),%r8 + movl %r9d,52(%rdi) + leaq (%r9,%r9,4),%r9 + movl %r8d,64(%rdi) + movl %r9d,68(%rdi) + + movq %r12,%rax + call __poly1305_block + + movl $0x3ffffff,%eax + movq %r14,%r8 + andl %r14d,%eax + shrq $26,%r8 + movl %eax,-52(%rdi) + + movl $0x3ffffff,%edx + andl %r8d,%edx + movl %edx,-36(%rdi) + leal (%rdx,%rdx,4),%edx + shrq $26,%r8 + movl %edx,-20(%rdi) + + movq %rbx,%rax + shlq $12,%rax + orq %r8,%rax + andl $0x3ffffff,%eax + movl %eax,-4(%rdi) + leal (%rax,%rax,4),%eax + movq %rbx,%r8 + movl %eax,12(%rdi) + + movl $0x3ffffff,%edx + shrq $14,%r8 + andl %r8d,%edx + movl %edx,28(%rdi) + leal (%rdx,%rdx,4),%edx + shrq $26,%r8 + movl %edx,44(%rdi) + + movq %rbp,%rax + shlq $24,%rax + orq %rax,%r8 + movl %r8d,60(%rdi) + leaq (%r8,%r8,4),%r8 + movl %r8d,76(%rdi) + + movq %r12,%rax + call __poly1305_block + + movl $0x3ffffff,%eax + movq %r14,%r8 + andl %r14d,%eax + shrq $26,%r8 + movl %eax,-56(%rdi) + + movl $0x3ffffff,%edx + andl %r8d,%edx + movl %edx,-40(%rdi) + leal (%rdx,%rdx,4),%edx + shrq $26,%r8 + movl %edx,-24(%rdi) + + movq %rbx,%rax + shlq $12,%rax + orq %r8,%rax + andl $0x3ffffff,%eax + movl %eax,-8(%rdi) + leal (%rax,%rax,4),%eax + movq %rbx,%r8 + movl %eax,8(%rdi) + + movl $0x3ffffff,%edx + shrq $14,%r8 + andl %r8d,%edx + movl %edx,24(%rdi) + leal (%rdx,%rdx,4),%edx + shrq $26,%r8 + movl %edx,40(%rdi) + + movq %rbp,%rax + shlq $24,%rax + orq %rax,%r8 + movl %r8d,56(%rdi) + leaq (%r8,%r8,4),%r8 + movl %r8d,72(%rdi) + + leaq -48-64(%rdi),%rdi + ret +.size __poly1305_init_avx,.-__poly1305_init_avx + +.type poly1305_blocks_avx,@function +.align 32 +poly1305_blocks_avx: +.cfi_startproc + movl 20(%rdi),%r8d + cmpq $128,%rdx + jae .Lblocks_avx + testl %r8d,%r8d + jz .Lblocks + +.Lblocks_avx: + andq $-16,%rdx + jz .Lno_data_avx + + vzeroupper + + testl %r8d,%r8d + jz .Lbase2_64_avx + + testq $31,%rdx + jz .Leven_avx + + pushq %rbx +.cfi_adjust_cfa_offset 8 +.cfi_offset %rbx,-16 + pushq %rbp +.cfi_adjust_cfa_offset 8 +.cfi_offset %rbp,-24 + pushq %r12 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r12,-32 + pushq %r13 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r13,-40 + pushq %r14 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r14,-48 + pushq %r15 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r15,-56 +.Lblocks_avx_body: + + movq %rdx,%r15 + + movq 0(%rdi),%r8 + movq 8(%rdi),%r9 + movl 16(%rdi),%ebp + + movq 24(%rdi),%r11 + movq 32(%rdi),%r13 + + + movl %r8d,%r14d + andq $-2147483648,%r8 + movq %r9,%r12 + movl %r9d,%ebx + andq $-2147483648,%r9 + + shrq $6,%r8 + shlq $52,%r12 + addq %r8,%r14 + shrq $12,%rbx + shrq $18,%r9 + addq %r12,%r14 + adcq %r9,%rbx + + movq %rbp,%r8 + shlq $40,%r8 + shrq $24,%rbp + addq %r8,%rbx + adcq $0,%rbp + + movq $-4,%r9 + movq %rbp,%r8 + andq %rbp,%r9 + shrq $2,%r8 + andq $3,%rbp + addq %r9,%r8 + addq %r8,%r14 + adcq $0,%rbx + adcq $0,%rbp + + movq %r13,%r12 + movq %r13,%rax + shrq $2,%r13 + addq %r12,%r13 + + addq 0(%rsi),%r14 + adcq 8(%rsi),%rbx + leaq 16(%rsi),%rsi + adcq %rcx,%rbp + + call __poly1305_block + + testq %rcx,%rcx + jz .Lstore_base2_64_avx + + + movq %r14,%rax + movq %r14,%rdx + shrq $52,%r14 + movq %rbx,%r11 + movq %rbx,%r12 + shrq $26,%rdx + andq $0x3ffffff,%rax + shlq $12,%r11 + andq $0x3ffffff,%rdx + shrq $14,%rbx + orq %r11,%r14 + shlq $24,%rbp + andq $0x3ffffff,%r14 + shrq $40,%r12 + andq $0x3ffffff,%rbx + orq %r12,%rbp + + subq $16,%r15 + jz .Lstore_base2_26_avx + + vmovd %eax,%xmm0 + vmovd %edx,%xmm1 + vmovd %r14d,%xmm2 + vmovd %ebx,%xmm3 + vmovd %ebp,%xmm4 + jmp .Lproceed_avx + +.align 32 +.Lstore_base2_64_avx: + movq %r14,0(%rdi) + movq %rbx,8(%rdi) + movq %rbp,16(%rdi) + jmp .Ldone_avx + +.align 16 +.Lstore_base2_26_avx: + movl %eax,0(%rdi) + movl %edx,4(%rdi) + movl %r14d,8(%rdi) + movl %ebx,12(%rdi) + movl %ebp,16(%rdi) +.align 16 +.Ldone_avx: + movq 0(%rsp),%r15 +.cfi_restore %r15 + movq 8(%rsp),%r14 +.cfi_restore %r14 + movq 16(%rsp),%r13 +.cfi_restore %r13 + movq 24(%rsp),%r12 +.cfi_restore %r12 + movq 32(%rsp),%rbp +.cfi_restore %rbp + movq 40(%rsp),%rbx +.cfi_restore %rbx + leaq 48(%rsp),%rsp +.cfi_adjust_cfa_offset -48 +.Lno_data_avx: +.Lblocks_avx_epilogue: + ret +.cfi_endproc + +.align 32 +.Lbase2_64_avx: +.cfi_startproc + pushq %rbx +.cfi_adjust_cfa_offset 8 +.cfi_offset %rbx,-16 + pushq %rbp +.cfi_adjust_cfa_offset 8 +.cfi_offset %rbp,-24 + pushq %r12 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r12,-32 + pushq %r13 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r13,-40 + pushq %r14 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r14,-48 + pushq %r15 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r15,-56 +.Lbase2_64_avx_body: + + movq %rdx,%r15 + + movq 24(%rdi),%r11 + movq 32(%rdi),%r13 + + movq 0(%rdi),%r14 + movq 8(%rdi),%rbx + movl 16(%rdi),%ebp + + movq %r13,%r12 + movq %r13,%rax + shrq $2,%r13 + addq %r12,%r13 + + testq $31,%rdx + jz .Linit_avx + + addq 0(%rsi),%r14 + adcq 8(%rsi),%rbx + leaq 16(%rsi),%rsi + adcq %rcx,%rbp + subq $16,%r15 + + call __poly1305_block + +.Linit_avx: + + movq %r14,%rax + movq %r14,%rdx + shrq $52,%r14 + movq %rbx,%r8 + movq %rbx,%r9 + shrq $26,%rdx + andq $0x3ffffff,%rax + shlq $12,%r8 + andq $0x3ffffff,%rdx + shrq $14,%rbx + orq %r8,%r14 + shlq $24,%rbp + andq $0x3ffffff,%r14 + shrq $40,%r9 + andq $0x3ffffff,%rbx + orq %r9,%rbp + + vmovd %eax,%xmm0 + vmovd %edx,%xmm1 + vmovd %r14d,%xmm2 + vmovd %ebx,%xmm3 + vmovd %ebp,%xmm4 + movl $1,20(%rdi) + + call __poly1305_init_avx + +.Lproceed_avx: + movq %r15,%rdx + + movq 0(%rsp),%r15 +.cfi_restore %r15 + movq 8(%rsp),%r14 +.cfi_restore %r14 + movq 16(%rsp),%r13 +.cfi_restore %r13 + movq 24(%rsp),%r12 +.cfi_restore %r12 + movq 32(%rsp),%rbp +.cfi_restore %rbp + movq 40(%rsp),%rbx +.cfi_restore %rbx + leaq 48(%rsp),%rax + leaq 48(%rsp),%rsp +.cfi_adjust_cfa_offset -48 +.Lbase2_64_avx_epilogue: + jmp .Ldo_avx +.cfi_endproc + +.align 32 +.Leven_avx: +.cfi_startproc + vmovd 0(%rdi),%xmm0 + vmovd 4(%rdi),%xmm1 + vmovd 8(%rdi),%xmm2 + vmovd 12(%rdi),%xmm3 + vmovd 16(%rdi),%xmm4 + +.Ldo_avx: + leaq -88(%rsp),%r11 +.cfi_def_cfa %r11,0x60 + subq $0x178,%rsp + subq $64,%rdx + leaq -32(%rsi),%rax + cmovcq %rax,%rsi + + vmovdqu 48(%rdi),%xmm14 + leaq 112(%rdi),%rdi + leaq .Lconst(%rip),%rcx + + + + vmovdqu 32(%rsi),%xmm5 + vmovdqu 48(%rsi),%xmm6 + vmovdqa 64(%rcx),%xmm15 + + vpsrldq $6,%xmm5,%xmm7 + vpsrldq $6,%xmm6,%xmm8 + vpunpckhqdq %xmm6,%xmm5,%xmm9 + vpunpcklqdq %xmm6,%xmm5,%xmm5 + vpunpcklqdq %xmm8,%xmm7,%xmm8 + + vpsrlq $40,%xmm9,%xmm9 + vpsrlq $26,%xmm5,%xmm6 + vpand %xmm15,%xmm5,%xmm5 + vpsrlq $4,%xmm8,%xmm7 + vpand %xmm15,%xmm6,%xmm6 + vpsrlq $30,%xmm8,%xmm8 + vpand %xmm15,%xmm7,%xmm7 + vpand %xmm15,%xmm8,%xmm8 + vpor 32(%rcx),%xmm9,%xmm9 + + jbe .Lskip_loop_avx + + + vmovdqu -48(%rdi),%xmm11 + vmovdqu -32(%rdi),%xmm12 + vpshufd $0xEE,%xmm14,%xmm13 + vpshufd $0x44,%xmm14,%xmm10 + vmovdqa %xmm13,-144(%r11) + vmovdqa %xmm10,0(%rsp) + vpshufd $0xEE,%xmm11,%xmm14 + vmovdqu -16(%rdi),%xmm10 + vpshufd $0x44,%xmm11,%xmm11 + vmovdqa %xmm14,-128(%r11) + vmovdqa %xmm11,16(%rsp) + vpshufd $0xEE,%xmm12,%xmm13 + vmovdqu 0(%rdi),%xmm11 + vpshufd $0x44,%xmm12,%xmm12 + vmovdqa %xmm13,-112(%r11) + vmovdqa %xmm12,32(%rsp) + vpshufd $0xEE,%xmm10,%xmm14 + vmovdqu 16(%rdi),%xmm12 + vpshufd $0x44,%xmm10,%xmm10 + vmovdqa %xmm14,-96(%r11) + vmovdqa %xmm10,48(%rsp) + vpshufd $0xEE,%xmm11,%xmm13 + vmovdqu 32(%rdi),%xmm10 + vpshufd $0x44,%xmm11,%xmm11 + vmovdqa %xmm13,-80(%r11) + vmovdqa %xmm11,64(%rsp) + vpshufd $0xEE,%xmm12,%xmm14 + vmovdqu 48(%rdi),%xmm11 + vpshufd $0x44,%xmm12,%xmm12 + vmovdqa %xmm14,-64(%r11) + vmovdqa %xmm12,80(%rsp) + vpshufd $0xEE,%xmm10,%xmm13 + vmovdqu 64(%rdi),%xmm12 + vpshufd $0x44,%xmm10,%xmm10 + vmovdqa %xmm13,-48(%r11) + vmovdqa %xmm10,96(%rsp) + vpshufd $0xEE,%xmm11,%xmm14 + vpshufd $0x44,%xmm11,%xmm11 + vmovdqa %xmm14,-32(%r11) + vmovdqa %xmm11,112(%rsp) + vpshufd $0xEE,%xmm12,%xmm13 + vmovdqa 0(%rsp),%xmm14 + vpshufd $0x44,%xmm12,%xmm12 + vmovdqa %xmm13,-16(%r11) + vmovdqa %xmm12,128(%rsp) + + jmp .Loop_avx + +.align 32 +.Loop_avx: + + + + + + + + + + + + + + + + + + + + + vpmuludq %xmm5,%xmm14,%xmm10 + vpmuludq %xmm6,%xmm14,%xmm11 + vmovdqa %xmm2,32(%r11) + vpmuludq %xmm7,%xmm14,%xmm12 + vmovdqa 16(%rsp),%xmm2 + vpmuludq %xmm8,%xmm14,%xmm13 + vpmuludq %xmm9,%xmm14,%xmm14 + + vmovdqa %xmm0,0(%r11) + vpmuludq 32(%rsp),%xmm9,%xmm0 + vmovdqa %xmm1,16(%r11) + vpmuludq %xmm8,%xmm2,%xmm1 + vpaddq %xmm0,%xmm10,%xmm10 + vpaddq %xmm1,%xmm14,%xmm14 + vmovdqa %xmm3,48(%r11) + vpmuludq %xmm7,%xmm2,%xmm0 + vpmuludq %xmm6,%xmm2,%xmm1 + vpaddq %xmm0,%xmm13,%xmm13 + vmovdqa 48(%rsp),%xmm3 + vpaddq %xmm1,%xmm12,%xmm12 + vmovdqa %xmm4,64(%r11) + vpmuludq %xmm5,%xmm2,%xmm2 + vpmuludq %xmm7,%xmm3,%xmm0 + vpaddq %xmm2,%xmm11,%xmm11 + + vmovdqa 64(%rsp),%xmm4 + vpaddq %xmm0,%xmm14,%xmm14 + vpmuludq %xmm6,%xmm3,%xmm1 + vpmuludq %xmm5,%xmm3,%xmm3 + vpaddq %xmm1,%xmm13,%xmm13 + vmovdqa 80(%rsp),%xmm2 + vpaddq %xmm3,%xmm12,%xmm12 + vpmuludq %xmm9,%xmm4,%xmm0 + vpmuludq %xmm8,%xmm4,%xmm4 + vpaddq %xmm0,%xmm11,%xmm11 + vmovdqa 96(%rsp),%xmm3 + vpaddq %xmm4,%xmm10,%xmm10 + + vmovdqa 128(%rsp),%xmm4 + vpmuludq %xmm6,%xmm2,%xmm1 + vpmuludq %xmm5,%xmm2,%xmm2 + vpaddq %xmm1,%xmm14,%xmm14 + vpaddq %xmm2,%xmm13,%xmm13 + vpmuludq %xmm9,%xmm3,%xmm0 + vpmuludq %xmm8,%xmm3,%xmm1 + vpaddq %xmm0,%xmm12,%xmm12 + vmovdqu 0(%rsi),%xmm0 + vpaddq %xmm1,%xmm11,%xmm11 + vpmuludq %xmm7,%xmm3,%xmm3 + vpmuludq %xmm7,%xmm4,%xmm7 + vpaddq %xmm3,%xmm10,%xmm10 + + vmovdqu 16(%rsi),%xmm1 + vpaddq %xmm7,%xmm11,%xmm11 + vpmuludq %xmm8,%xmm4,%xmm8 + vpmuludq %xmm9,%xmm4,%xmm9 + vpsrldq $6,%xmm0,%xmm2 + vpaddq %xmm8,%xmm12,%xmm12 + vpaddq %xmm9,%xmm13,%xmm13 + vpsrldq $6,%xmm1,%xmm3 + vpmuludq 112(%rsp),%xmm5,%xmm9 + vpmuludq %xmm6,%xmm4,%xmm5 + vpunpckhqdq %xmm1,%xmm0,%xmm4 + vpaddq %xmm9,%xmm14,%xmm14 + vmovdqa -144(%r11),%xmm9 + vpaddq %xmm5,%xmm10,%xmm10 + + vpunpcklqdq %xmm1,%xmm0,%xmm0 + vpunpcklqdq %xmm3,%xmm2,%xmm3 + + + vpsrldq $5,%xmm4,%xmm4 + vpsrlq $26,%xmm0,%xmm1 + vpand %xmm15,%xmm0,%xmm0 + vpsrlq $4,%xmm3,%xmm2 + vpand %xmm15,%xmm1,%xmm1 + vpand 0(%rcx),%xmm4,%xmm4 + vpsrlq $30,%xmm3,%xmm3 + vpand %xmm15,%xmm2,%xmm2 + vpand %xmm15,%xmm3,%xmm3 + vpor 32(%rcx),%xmm4,%xmm4 + + vpaddq 0(%r11),%xmm0,%xmm0 + vpaddq 16(%r11),%xmm1,%xmm1 + vpaddq 32(%r11),%xmm2,%xmm2 + vpaddq 48(%r11),%xmm3,%xmm3 + vpaddq 64(%r11),%xmm4,%xmm4 + + leaq 32(%rsi),%rax + leaq 64(%rsi),%rsi + subq $64,%rdx + cmovcq %rax,%rsi + + + + + + + + + + + vpmuludq %xmm0,%xmm9,%xmm5 + vpmuludq %xmm1,%xmm9,%xmm6 + vpaddq %xmm5,%xmm10,%xmm10 + vpaddq %xmm6,%xmm11,%xmm11 + vmovdqa -128(%r11),%xmm7 + vpmuludq %xmm2,%xmm9,%xmm5 + vpmuludq %xmm3,%xmm9,%xmm6 + vpaddq %xmm5,%xmm12,%xmm12 + vpaddq %xmm6,%xmm13,%xmm13 + vpmuludq %xmm4,%xmm9,%xmm9 + vpmuludq -112(%r11),%xmm4,%xmm5 + vpaddq %xmm9,%xmm14,%xmm14 + + vpaddq %xmm5,%xmm10,%xmm10 + vpmuludq %xmm2,%xmm7,%xmm6 + vpmuludq %xmm3,%xmm7,%xmm5 + vpaddq %xmm6,%xmm13,%xmm13 + vmovdqa -96(%r11),%xmm8 + vpaddq %xmm5,%xmm14,%xmm14 + vpmuludq %xmm1,%xmm7,%xmm6 + vpmuludq %xmm0,%xmm7,%xmm7 + vpaddq %xmm6,%xmm12,%xmm12 + vpaddq %xmm7,%xmm11,%xmm11 + + vmovdqa -80(%r11),%xmm9 + vpmuludq %xmm2,%xmm8,%xmm5 + vpmuludq %xmm1,%xmm8,%xmm6 + vpaddq %xmm5,%xmm14,%xmm14 + vpaddq %xmm6,%xmm13,%xmm13 + vmovdqa -64(%r11),%xmm7 + vpmuludq %xmm0,%xmm8,%xmm8 + vpmuludq %xmm4,%xmm9,%xmm5 + vpaddq %xmm8,%xmm12,%xmm12 + vpaddq %xmm5,%xmm11,%xmm11 + vmovdqa -48(%r11),%xmm8 + vpmuludq %xmm3,%xmm9,%xmm9 + vpmuludq %xmm1,%xmm7,%xmm6 + vpaddq %xmm9,%xmm10,%xmm10 + + vmovdqa -16(%r11),%xmm9 + vpaddq %xmm6,%xmm14,%xmm14 + vpmuludq %xmm0,%xmm7,%xmm7 + vpmuludq %xmm4,%xmm8,%xmm5 + vpaddq %xmm7,%xmm13,%xmm13 + vpaddq %xmm5,%xmm12,%xmm12 + vmovdqu 32(%rsi),%xmm5 + vpmuludq %xmm3,%xmm8,%xmm7 + vpmuludq %xmm2,%xmm8,%xmm8 + vpaddq %xmm7,%xmm11,%xmm11 + vmovdqu 48(%rsi),%xmm6 + vpaddq %xmm8,%xmm10,%xmm10 + + vpmuludq %xmm2,%xmm9,%xmm2 + vpmuludq %xmm3,%xmm9,%xmm3 + vpsrldq $6,%xmm5,%xmm7 + vpaddq %xmm2,%xmm11,%xmm11 + vpmuludq %xmm4,%xmm9,%xmm4 + vpsrldq $6,%xmm6,%xmm8 + vpaddq %xmm3,%xmm12,%xmm2 + vpaddq %xmm4,%xmm13,%xmm3 + vpmuludq -32(%r11),%xmm0,%xmm4 + vpmuludq %xmm1,%xmm9,%xmm0 + vpunpckhqdq %xmm6,%xmm5,%xmm9 + vpaddq %xmm4,%xmm14,%xmm4 + vpaddq %xmm0,%xmm10,%xmm0 + + vpunpcklqdq %xmm6,%xmm5,%xmm5 + vpunpcklqdq %xmm8,%xmm7,%xmm8 + + + vpsrldq $5,%xmm9,%xmm9 + vpsrlq $26,%xmm5,%xmm6 + vmovdqa 0(%rsp),%xmm14 + vpand %xmm15,%xmm5,%xmm5 + vpsrlq $4,%xmm8,%xmm7 + vpand %xmm15,%xmm6,%xmm6 + vpand 0(%rcx),%xmm9,%xmm9 + vpsrlq $30,%xmm8,%xmm8 + vpand %xmm15,%xmm7,%xmm7 + vpand %xmm15,%xmm8,%xmm8 + vpor 32(%rcx),%xmm9,%xmm9 + + + + + + vpsrlq $26,%xmm3,%xmm13 + vpand %xmm15,%xmm3,%xmm3 + vpaddq %xmm13,%xmm4,%xmm4 + + vpsrlq $26,%xmm0,%xmm10 + vpand %xmm15,%xmm0,%xmm0 + vpaddq %xmm10,%xmm11,%xmm1 + + vpsrlq $26,%xmm4,%xmm10 + vpand %xmm15,%xmm4,%xmm4 + + vpsrlq $26,%xmm1,%xmm11 + vpand %xmm15,%xmm1,%xmm1 + vpaddq %xmm11,%xmm2,%xmm2 + + vpaddq %xmm10,%xmm0,%xmm0 + vpsllq $2,%xmm10,%xmm10 + vpaddq %xmm10,%xmm0,%xmm0 + + vpsrlq $26,%xmm2,%xmm12 + vpand %xmm15,%xmm2,%xmm2 + vpaddq %xmm12,%xmm3,%xmm3 + + vpsrlq $26,%xmm0,%xmm10 + vpand %xmm15,%xmm0,%xmm0 + vpaddq %xmm10,%xmm1,%xmm1 + + vpsrlq $26,%xmm3,%xmm13 + vpand %xmm15,%xmm3,%xmm3 + vpaddq %xmm13,%xmm4,%xmm4 + + ja .Loop_avx + +.Lskip_loop_avx: + + + + vpshufd $0x10,%xmm14,%xmm14 + addq $32,%rdx + jnz .Long_tail_avx + + vpaddq %xmm2,%xmm7,%xmm7 + vpaddq %xmm0,%xmm5,%xmm5 + vpaddq %xmm1,%xmm6,%xmm6 + vpaddq %xmm3,%xmm8,%xmm8 + vpaddq %xmm4,%xmm9,%xmm9 + +.Long_tail_avx: + vmovdqa %xmm2,32(%r11) + vmovdqa %xmm0,0(%r11) + vmovdqa %xmm1,16(%r11) + vmovdqa %xmm3,48(%r11) + vmovdqa %xmm4,64(%r11) + + + + + + + + vpmuludq %xmm7,%xmm14,%xmm12 + vpmuludq %xmm5,%xmm14,%xmm10 + vpshufd $0x10,-48(%rdi),%xmm2 + vpmuludq %xmm6,%xmm14,%xmm11 + vpmuludq %xmm8,%xmm14,%xmm13 + vpmuludq %xmm9,%xmm14,%xmm14 + + vpmuludq %xmm8,%xmm2,%xmm0 + vpaddq %xmm0,%xmm14,%xmm14 + vpshufd $0x10,-32(%rdi),%xmm3 + vpmuludq %xmm7,%xmm2,%xmm1 + vpaddq %xmm1,%xmm13,%xmm13 + vpshufd $0x10,-16(%rdi),%xmm4 + vpmuludq %xmm6,%xmm2,%xmm0 + vpaddq %xmm0,%xmm12,%xmm12 + vpmuludq %xmm5,%xmm2,%xmm2 + vpaddq %xmm2,%xmm11,%xmm11 + vpmuludq %xmm9,%xmm3,%xmm3 + vpaddq %xmm3,%xmm10,%xmm10 + + vpshufd $0x10,0(%rdi),%xmm2 + vpmuludq %xmm7,%xmm4,%xmm1 + vpaddq %xmm1,%xmm14,%xmm14 + vpmuludq %xmm6,%xmm4,%xmm0 + vpaddq %xmm0,%xmm13,%xmm13 + vpshufd $0x10,16(%rdi),%xmm3 + vpmuludq %xmm5,%xmm4,%xmm4 + vpaddq %xmm4,%xmm12,%xmm12 + vpmuludq %xmm9,%xmm2,%xmm1 + vpaddq %xmm1,%xmm11,%xmm11 + vpshufd $0x10,32(%rdi),%xmm4 + vpmuludq %xmm8,%xmm2,%xmm2 + vpaddq %xmm2,%xmm10,%xmm10 + + vpmuludq %xmm6,%xmm3,%xmm0 + vpaddq %xmm0,%xmm14,%xmm14 + vpmuludq %xmm5,%xmm3,%xmm3 + vpaddq %xmm3,%xmm13,%xmm13 + vpshufd $0x10,48(%rdi),%xmm2 + vpmuludq %xmm9,%xmm4,%xmm1 + vpaddq %xmm1,%xmm12,%xmm12 + vpshufd $0x10,64(%rdi),%xmm3 + vpmuludq %xmm8,%xmm4,%xmm0 + vpaddq %xmm0,%xmm11,%xmm11 + vpmuludq %xmm7,%xmm4,%xmm4 + vpaddq %xmm4,%xmm10,%xmm10 + + vpmuludq %xmm5,%xmm2,%xmm2 + vpaddq %xmm2,%xmm14,%xmm14 + vpmuludq %xmm9,%xmm3,%xmm1 + vpaddq %xmm1,%xmm13,%xmm13 + vpmuludq %xmm8,%xmm3,%xmm0 + vpaddq %xmm0,%xmm12,%xmm12 + vpmuludq %xmm7,%xmm3,%xmm1 + vpaddq %xmm1,%xmm11,%xmm11 + vpmuludq %xmm6,%xmm3,%xmm3 + vpaddq %xmm3,%xmm10,%xmm10 + + jz .Lshort_tail_avx + + vmovdqu 0(%rsi),%xmm0 + vmovdqu 16(%rsi),%xmm1 + + vpsrldq $6,%xmm0,%xmm2 + vpsrldq $6,%xmm1,%xmm3 + vpunpckhqdq %xmm1,%xmm0,%xmm4 + vpunpcklqdq %xmm1,%xmm0,%xmm0 + vpunpcklqdq %xmm3,%xmm2,%xmm3 + + vpsrlq $40,%xmm4,%xmm4 + vpsrlq $26,%xmm0,%xmm1 + vpand %xmm15,%xmm0,%xmm0 + vpsrlq $4,%xmm3,%xmm2 + vpand %xmm15,%xmm1,%xmm1 + vpsrlq $30,%xmm3,%xmm3 + vpand %xmm15,%xmm2,%xmm2 + vpand %xmm15,%xmm3,%xmm3 + vpor 32(%rcx),%xmm4,%xmm4 + + vpshufd $0x32,-64(%rdi),%xmm9 + vpaddq 0(%r11),%xmm0,%xmm0 + vpaddq 16(%r11),%xmm1,%xmm1 + vpaddq 32(%r11),%xmm2,%xmm2 + vpaddq 48(%r11),%xmm3,%xmm3 + vpaddq 64(%r11),%xmm4,%xmm4 + + + + + vpmuludq %xmm0,%xmm9,%xmm5 + vpaddq %xmm5,%xmm10,%xmm10 + vpmuludq %xmm1,%xmm9,%xmm6 + vpaddq %xmm6,%xmm11,%xmm11 + vpmuludq %xmm2,%xmm9,%xmm5 + vpaddq %xmm5,%xmm12,%xmm12 + vpshufd $0x32,-48(%rdi),%xmm7 + vpmuludq %xmm3,%xmm9,%xmm6 + vpaddq %xmm6,%xmm13,%xmm13 + vpmuludq %xmm4,%xmm9,%xmm9 + vpaddq %xmm9,%xmm14,%xmm14 + + vpmuludq %xmm3,%xmm7,%xmm5 + vpaddq %xmm5,%xmm14,%xmm14 + vpshufd $0x32,-32(%rdi),%xmm8 + vpmuludq %xmm2,%xmm7,%xmm6 + vpaddq %xmm6,%xmm13,%xmm13 + vpshufd $0x32,-16(%rdi),%xmm9 + vpmuludq %xmm1,%xmm7,%xmm5 + vpaddq %xmm5,%xmm12,%xmm12 + vpmuludq %xmm0,%xmm7,%xmm7 + vpaddq %xmm7,%xmm11,%xmm11 + vpmuludq %xmm4,%xmm8,%xmm8 + vpaddq %xmm8,%xmm10,%xmm10 + + vpshufd $0x32,0(%rdi),%xmm7 + vpmuludq %xmm2,%xmm9,%xmm6 + vpaddq %xmm6,%xmm14,%xmm14 + vpmuludq %xmm1,%xmm9,%xmm5 + vpaddq %xmm5,%xmm13,%xmm13 + vpshufd $0x32,16(%rdi),%xmm8 + vpmuludq %xmm0,%xmm9,%xmm9 + vpaddq %xmm9,%xmm12,%xmm12 + vpmuludq %xmm4,%xmm7,%xmm6 + vpaddq %xmm6,%xmm11,%xmm11 + vpshufd $0x32,32(%rdi),%xmm9 + vpmuludq %xmm3,%xmm7,%xmm7 + vpaddq %xmm7,%xmm10,%xmm10 + + vpmuludq %xmm1,%xmm8,%xmm5 + vpaddq %xmm5,%xmm14,%xmm14 + vpmuludq %xmm0,%xmm8,%xmm8 + vpaddq %xmm8,%xmm13,%xmm13 + vpshufd $0x32,48(%rdi),%xmm7 + vpmuludq %xmm4,%xmm9,%xmm6 + vpaddq %xmm6,%xmm12,%xmm12 + vpshufd $0x32,64(%rdi),%xmm8 + vpmuludq %xmm3,%xmm9,%xmm5 + vpaddq %xmm5,%xmm11,%xmm11 + vpmuludq %xmm2,%xmm9,%xmm9 + vpaddq %xmm9,%xmm10,%xmm10 + + vpmuludq %xmm0,%xmm7,%xmm7 + vpaddq %xmm7,%xmm14,%xmm14 + vpmuludq %xmm4,%xmm8,%xmm6 + vpaddq %xmm6,%xmm13,%xmm13 + vpmuludq %xmm3,%xmm8,%xmm5 + vpaddq %xmm5,%xmm12,%xmm12 + vpmuludq %xmm2,%xmm8,%xmm6 + vpaddq %xmm6,%xmm11,%xmm11 + vpmuludq %xmm1,%xmm8,%xmm8 + vpaddq %xmm8,%xmm10,%xmm10 + +.Lshort_tail_avx: + + + + vpsrldq $8,%xmm14,%xmm9 + vpsrldq $8,%xmm13,%xmm8 + vpsrldq $8,%xmm11,%xmm6 + vpsrldq $8,%xmm10,%xmm5 + vpsrldq $8,%xmm12,%xmm7 + vpaddq %xmm8,%xmm13,%xmm13 + vpaddq %xmm9,%xmm14,%xmm14 + vpaddq %xmm5,%xmm10,%xmm10 + vpaddq %xmm6,%xmm11,%xmm11 + vpaddq %xmm7,%xmm12,%xmm12 + + + + + vpsrlq $26,%xmm13,%xmm3 + vpand %xmm15,%xmm13,%xmm13 + vpaddq %xmm3,%xmm14,%xmm14 + + vpsrlq $26,%xmm10,%xmm0 + vpand %xmm15,%xmm10,%xmm10 + vpaddq %xmm0,%xmm11,%xmm11 + + vpsrlq $26,%xmm14,%xmm4 + vpand %xmm15,%xmm14,%xmm14 + + vpsrlq $26,%xmm11,%xmm1 + vpand %xmm15,%xmm11,%xmm11 + vpaddq %xmm1,%xmm12,%xmm12 + + vpaddq %xmm4,%xmm10,%xmm10 + vpsllq $2,%xmm4,%xmm4 + vpaddq %xmm4,%xmm10,%xmm10 + + vpsrlq $26,%xmm12,%xmm2 + vpand %xmm15,%xmm12,%xmm12 + vpaddq %xmm2,%xmm13,%xmm13 + + vpsrlq $26,%xmm10,%xmm0 + vpand %xmm15,%xmm10,%xmm10 + vpaddq %xmm0,%xmm11,%xmm11 + + vpsrlq $26,%xmm13,%xmm3 + vpand %xmm15,%xmm13,%xmm13 + vpaddq %xmm3,%xmm14,%xmm14 + + vmovd %xmm10,-112(%rdi) + vmovd %xmm11,-108(%rdi) + vmovd %xmm12,-104(%rdi) + vmovd %xmm13,-100(%rdi) + vmovd %xmm14,-96(%rdi) + leaq 88(%r11),%rsp +.cfi_def_cfa %rsp,8 + vzeroupper + ret +.cfi_endproc +.size poly1305_blocks_avx,.-poly1305_blocks_avx + +.type poly1305_emit_avx,@function +.align 32 +poly1305_emit_avx: + cmpl $0,20(%rdi) + je .Lemit + + movl 0(%rdi),%eax + movl 4(%rdi),%ecx + movl 8(%rdi),%r8d + movl 12(%rdi),%r11d + movl 16(%rdi),%r10d + + shlq $26,%rcx + movq %r8,%r9 + shlq $52,%r8 + addq %rcx,%rax + shrq $12,%r9 + addq %rax,%r8 + adcq $0,%r9 + + shlq $14,%r11 + movq %r10,%rax + shrq $24,%r10 + addq %r11,%r9 + shlq $40,%rax + addq %rax,%r9 + adcq $0,%r10 + + movq %r10,%rax + movq %r10,%rcx + andq $3,%r10 + shrq $2,%rax + andq $-4,%rcx + addq %rcx,%rax + addq %rax,%r8 + adcq $0,%r9 + adcq $0,%r10 + + movq %r8,%rax + addq $5,%r8 + movq %r9,%rcx + adcq $0,%r9 + adcq $0,%r10 + shrq $2,%r10 + cmovnzq %r8,%rax + cmovnzq %r9,%rcx + + addq 0(%rdx),%rax + adcq 8(%rdx),%rcx + movq %rax,0(%rsi) + movq %rcx,8(%rsi) + + ret +.size poly1305_emit_avx,.-poly1305_emit_avx +.type poly1305_blocks_avx2,@function +.align 32 +poly1305_blocks_avx2: +.cfi_startproc + movl 20(%rdi),%r8d + cmpq $128,%rdx + jae .Lblocks_avx2 + testl %r8d,%r8d + jz .Lblocks + +.Lblocks_avx2: + andq $-16,%rdx + jz .Lno_data_avx2 + + vzeroupper + + testl %r8d,%r8d + jz .Lbase2_64_avx2 + + testq $63,%rdx + jz .Leven_avx2 + + pushq %rbx +.cfi_adjust_cfa_offset 8 +.cfi_offset %rbx,-16 + pushq %rbp +.cfi_adjust_cfa_offset 8 +.cfi_offset %rbp,-24 + pushq %r12 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r12,-32 + pushq %r13 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r13,-40 + pushq %r14 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r14,-48 + pushq %r15 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r15,-56 +.Lblocks_avx2_body: + + movq %rdx,%r15 + + movq 0(%rdi),%r8 + movq 8(%rdi),%r9 + movl 16(%rdi),%ebp + + movq 24(%rdi),%r11 + movq 32(%rdi),%r13 + + + movl %r8d,%r14d + andq $-2147483648,%r8 + movq %r9,%r12 + movl %r9d,%ebx + andq $-2147483648,%r9 + + shrq $6,%r8 + shlq $52,%r12 + addq %r8,%r14 + shrq $12,%rbx + shrq $18,%r9 + addq %r12,%r14 + adcq %r9,%rbx + + movq %rbp,%r8 + shlq $40,%r8 + shrq $24,%rbp + addq %r8,%rbx + adcq $0,%rbp + + movq $-4,%r9 + movq %rbp,%r8 + andq %rbp,%r9 + shrq $2,%r8 + andq $3,%rbp + addq %r9,%r8 + addq %r8,%r14 + adcq $0,%rbx + adcq $0,%rbp + + movq %r13,%r12 + movq %r13,%rax + shrq $2,%r13 + addq %r12,%r13 + +.Lbase2_26_pre_avx2: + addq 0(%rsi),%r14 + adcq 8(%rsi),%rbx + leaq 16(%rsi),%rsi + adcq %rcx,%rbp + subq $16,%r15 + + call __poly1305_block + movq %r12,%rax + + testq $63,%r15 + jnz .Lbase2_26_pre_avx2 + + testq %rcx,%rcx + jz .Lstore_base2_64_avx2 + + + movq %r14,%rax + movq %r14,%rdx + shrq $52,%r14 + movq %rbx,%r11 + movq %rbx,%r12 + shrq $26,%rdx + andq $0x3ffffff,%rax + shlq $12,%r11 + andq $0x3ffffff,%rdx + shrq $14,%rbx + orq %r11,%r14 + shlq $24,%rbp + andq $0x3ffffff,%r14 + shrq $40,%r12 + andq $0x3ffffff,%rbx + orq %r12,%rbp + + testq %r15,%r15 + jz .Lstore_base2_26_avx2 + + vmovd %eax,%xmm0 + vmovd %edx,%xmm1 + vmovd %r14d,%xmm2 + vmovd %ebx,%xmm3 + vmovd %ebp,%xmm4 + jmp .Lproceed_avx2 + +.align 32 +.Lstore_base2_64_avx2: + movq %r14,0(%rdi) + movq %rbx,8(%rdi) + movq %rbp,16(%rdi) + jmp .Ldone_avx2 + +.align 16 +.Lstore_base2_26_avx2: + movl %eax,0(%rdi) + movl %edx,4(%rdi) + movl %r14d,8(%rdi) + movl %ebx,12(%rdi) + movl %ebp,16(%rdi) +.align 16 +.Ldone_avx2: + movq 0(%rsp),%r15 +.cfi_restore %r15 + movq 8(%rsp),%r14 +.cfi_restore %r14 + movq 16(%rsp),%r13 +.cfi_restore %r13 + movq 24(%rsp),%r12 +.cfi_restore %r12 + movq 32(%rsp),%rbp +.cfi_restore %rbp + movq 40(%rsp),%rbx +.cfi_restore %rbx + leaq 48(%rsp),%rsp +.cfi_adjust_cfa_offset -48 +.Lno_data_avx2: +.Lblocks_avx2_epilogue: + ret +.cfi_endproc + +.align 32 +.Lbase2_64_avx2: +.cfi_startproc + pushq %rbx +.cfi_adjust_cfa_offset 8 +.cfi_offset %rbx,-16 + pushq %rbp +.cfi_adjust_cfa_offset 8 +.cfi_offset %rbp,-24 + pushq %r12 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r12,-32 + pushq %r13 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r13,-40 + pushq %r14 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r14,-48 + pushq %r15 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r15,-56 +.Lbase2_64_avx2_body: + + movq %rdx,%r15 + + movq 24(%rdi),%r11 + movq 32(%rdi),%r13 + + movq 0(%rdi),%r14 + movq 8(%rdi),%rbx + movl 16(%rdi),%ebp + + movq %r13,%r12 + movq %r13,%rax + shrq $2,%r13 + addq %r12,%r13 + + testq $63,%rdx + jz .Linit_avx2 + +.Lbase2_64_pre_avx2: + addq 0(%rsi),%r14 + adcq 8(%rsi),%rbx + leaq 16(%rsi),%rsi + adcq %rcx,%rbp + subq $16,%r15 + + call __poly1305_block + movq %r12,%rax + + testq $63,%r15 + jnz .Lbase2_64_pre_avx2 + +.Linit_avx2: + + movq %r14,%rax + movq %r14,%rdx + shrq $52,%r14 + movq %rbx,%r8 + movq %rbx,%r9 + shrq $26,%rdx + andq $0x3ffffff,%rax + shlq $12,%r8 + andq $0x3ffffff,%rdx + shrq $14,%rbx + orq %r8,%r14 + shlq $24,%rbp + andq $0x3ffffff,%r14 + shrq $40,%r9 + andq $0x3ffffff,%rbx + orq %r9,%rbp + + vmovd %eax,%xmm0 + vmovd %edx,%xmm1 + vmovd %r14d,%xmm2 + vmovd %ebx,%xmm3 + vmovd %ebp,%xmm4 + movl $1,20(%rdi) + + call __poly1305_init_avx + +.Lproceed_avx2: + movq %r15,%rdx + + + + movq 0(%rsp),%r15 +.cfi_restore %r15 + movq 8(%rsp),%r14 +.cfi_restore %r14 + movq 16(%rsp),%r13 +.cfi_restore %r13 + movq 24(%rsp),%r12 +.cfi_restore %r12 + movq 32(%rsp),%rbp +.cfi_restore %rbp + movq 40(%rsp),%rbx +.cfi_restore %rbx + leaq 48(%rsp),%rax + leaq 48(%rsp),%rsp +.cfi_adjust_cfa_offset -48 +.Lbase2_64_avx2_epilogue: + jmp .Ldo_avx2 +.cfi_endproc + +.align 32 +.Leven_avx2: +.cfi_startproc + + vmovd 0(%rdi),%xmm0 + vmovd 4(%rdi),%xmm1 + vmovd 8(%rdi),%xmm2 + vmovd 12(%rdi),%xmm3 + vmovd 16(%rdi),%xmm4 + +.Ldo_avx2: + leaq -8(%rsp),%r11 +.cfi_def_cfa %r11,16 + subq $0x128,%rsp + leaq .Lconst(%rip),%rcx + leaq 48+64(%rdi),%rdi + vmovdqa 96(%rcx),%ymm7 + + + vmovdqu -64(%rdi),%xmm9 + andq $-512,%rsp + vmovdqu -48(%rdi),%xmm10 + vmovdqu -32(%rdi),%xmm6 + vmovdqu -16(%rdi),%xmm11 + vmovdqu 0(%rdi),%xmm12 + vmovdqu 16(%rdi),%xmm13 + leaq 144(%rsp),%rax + vmovdqu 32(%rdi),%xmm14 + vpermd %ymm9,%ymm7,%ymm9 + vmovdqu 48(%rdi),%xmm15 + vpermd %ymm10,%ymm7,%ymm10 + vmovdqu 64(%rdi),%xmm5 + vpermd %ymm6,%ymm7,%ymm6 + vmovdqa %ymm9,0(%rsp) + vpermd %ymm11,%ymm7,%ymm11 + vmovdqa %ymm10,32-144(%rax) + vpermd %ymm12,%ymm7,%ymm12 + vmovdqa %ymm6,64-144(%rax) + vpermd %ymm13,%ymm7,%ymm13 + vmovdqa %ymm11,96-144(%rax) + vpermd %ymm14,%ymm7,%ymm14 + vmovdqa %ymm12,128-144(%rax) + vpermd %ymm15,%ymm7,%ymm15 + vmovdqa %ymm13,160-144(%rax) + vpermd %ymm5,%ymm7,%ymm5 + vmovdqa %ymm14,192-144(%rax) + vmovdqa %ymm15,224-144(%rax) + vmovdqa %ymm5,256-144(%rax) + vmovdqa 64(%rcx),%ymm5 + + + + vmovdqu 0(%rsi),%xmm7 + vmovdqu 16(%rsi),%xmm8 + vinserti128 $1,32(%rsi),%ymm7,%ymm7 + vinserti128 $1,48(%rsi),%ymm8,%ymm8 + leaq 64(%rsi),%rsi + + vpsrldq $6,%ymm7,%ymm9 + vpsrldq $6,%ymm8,%ymm10 + vpunpckhqdq %ymm8,%ymm7,%ymm6 + vpunpcklqdq %ymm10,%ymm9,%ymm9 + vpunpcklqdq %ymm8,%ymm7,%ymm7 + + vpsrlq $30,%ymm9,%ymm10 + vpsrlq $4,%ymm9,%ymm9 + vpsrlq $26,%ymm7,%ymm8 + vpsrlq $40,%ymm6,%ymm6 + vpand %ymm5,%ymm9,%ymm9 + vpand %ymm5,%ymm7,%ymm7 + vpand %ymm5,%ymm8,%ymm8 + vpand %ymm5,%ymm10,%ymm10 + vpor 32(%rcx),%ymm6,%ymm6 + + vpaddq %ymm2,%ymm9,%ymm2 + subq $64,%rdx + jz .Ltail_avx2 + jmp .Loop_avx2 + +.align 32 +.Loop_avx2: + + + + + + + + + vpaddq %ymm0,%ymm7,%ymm0 + vmovdqa 0(%rsp),%ymm7 + vpaddq %ymm1,%ymm8,%ymm1 + vmovdqa 32(%rsp),%ymm8 + vpaddq %ymm3,%ymm10,%ymm3 + vmovdqa 96(%rsp),%ymm9 + vpaddq %ymm4,%ymm6,%ymm4 + vmovdqa 48(%rax),%ymm10 + vmovdqa 112(%rax),%ymm5 + + + + + + + + + + + + + + + + + vpmuludq %ymm2,%ymm7,%ymm13 + vpmuludq %ymm2,%ymm8,%ymm14 + vpmuludq %ymm2,%ymm9,%ymm15 + vpmuludq %ymm2,%ymm10,%ymm11 + vpmuludq %ymm2,%ymm5,%ymm12 + + vpmuludq %ymm0,%ymm8,%ymm6 + vpmuludq %ymm1,%ymm8,%ymm2 + vpaddq %ymm6,%ymm12,%ymm12 + vpaddq %ymm2,%ymm13,%ymm13 + vpmuludq %ymm3,%ymm8,%ymm6 + vpmuludq 64(%rsp),%ymm4,%ymm2 + vpaddq %ymm6,%ymm15,%ymm15 + vpaddq %ymm2,%ymm11,%ymm11 + vmovdqa -16(%rax),%ymm8 + + vpmuludq %ymm0,%ymm7,%ymm6 + vpmuludq %ymm1,%ymm7,%ymm2 + vpaddq %ymm6,%ymm11,%ymm11 + vpaddq %ymm2,%ymm12,%ymm12 + vpmuludq %ymm3,%ymm7,%ymm6 + vpmuludq %ymm4,%ymm7,%ymm2 + vmovdqu 0(%rsi),%xmm7 + vpaddq %ymm6,%ymm14,%ymm14 + vpaddq %ymm2,%ymm15,%ymm15 + vinserti128 $1,32(%rsi),%ymm7,%ymm7 + + vpmuludq %ymm3,%ymm8,%ymm6 + vpmuludq %ymm4,%ymm8,%ymm2 + vmovdqu 16(%rsi),%xmm8 + vpaddq %ymm6,%ymm11,%ymm11 + vpaddq %ymm2,%ymm12,%ymm12 + vmovdqa 16(%rax),%ymm2 + vpmuludq %ymm1,%ymm9,%ymm6 + vpmuludq %ymm0,%ymm9,%ymm9 + vpaddq %ymm6,%ymm14,%ymm14 + vpaddq %ymm9,%ymm13,%ymm13 + vinserti128 $1,48(%rsi),%ymm8,%ymm8 + leaq 64(%rsi),%rsi + + vpmuludq %ymm1,%ymm2,%ymm6 + vpmuludq %ymm0,%ymm2,%ymm2 + vpsrldq $6,%ymm7,%ymm9 + vpaddq %ymm6,%ymm15,%ymm15 + vpaddq %ymm2,%ymm14,%ymm14 + vpmuludq %ymm3,%ymm10,%ymm6 + vpmuludq %ymm4,%ymm10,%ymm2 + vpsrldq $6,%ymm8,%ymm10 + vpaddq %ymm6,%ymm12,%ymm12 + vpaddq %ymm2,%ymm13,%ymm13 + vpunpckhqdq %ymm8,%ymm7,%ymm6 + + vpmuludq %ymm3,%ymm5,%ymm3 + vpmuludq %ymm4,%ymm5,%ymm4 + vpunpcklqdq %ymm8,%ymm7,%ymm7 + vpaddq %ymm3,%ymm13,%ymm2 + vpaddq %ymm4,%ymm14,%ymm3 + vpunpcklqdq %ymm10,%ymm9,%ymm10 + vpmuludq 80(%rax),%ymm0,%ymm4 + vpmuludq %ymm1,%ymm5,%ymm0 + vmovdqa 64(%rcx),%ymm5 + vpaddq %ymm4,%ymm15,%ymm4 + vpaddq %ymm0,%ymm11,%ymm0 + + + + + vpsrlq $26,%ymm3,%ymm14 + vpand %ymm5,%ymm3,%ymm3 + vpaddq %ymm14,%ymm4,%ymm4 + + vpsrlq $26,%ymm0,%ymm11 + vpand %ymm5,%ymm0,%ymm0 + vpaddq %ymm11,%ymm12,%ymm1 + + vpsrlq $26,%ymm4,%ymm15 + vpand %ymm5,%ymm4,%ymm4 + + vpsrlq $4,%ymm10,%ymm9 + + vpsrlq $26,%ymm1,%ymm12 + vpand %ymm5,%ymm1,%ymm1 + vpaddq %ymm12,%ymm2,%ymm2 + + vpaddq %ymm15,%ymm0,%ymm0 + vpsllq $2,%ymm15,%ymm15 + vpaddq %ymm15,%ymm0,%ymm0 + + vpand %ymm5,%ymm9,%ymm9 + vpsrlq $26,%ymm7,%ymm8 + + vpsrlq $26,%ymm2,%ymm13 + vpand %ymm5,%ymm2,%ymm2 + vpaddq %ymm13,%ymm3,%ymm3 + + vpaddq %ymm9,%ymm2,%ymm2 + vpsrlq $30,%ymm10,%ymm10 + + vpsrlq $26,%ymm0,%ymm11 + vpand %ymm5,%ymm0,%ymm0 + vpaddq %ymm11,%ymm1,%ymm1 + + vpsrlq $40,%ymm6,%ymm6 + + vpsrlq $26,%ymm3,%ymm14 + vpand %ymm5,%ymm3,%ymm3 + vpaddq %ymm14,%ymm4,%ymm4 + + vpand %ymm5,%ymm7,%ymm7 + vpand %ymm5,%ymm8,%ymm8 + vpand %ymm5,%ymm10,%ymm10 + vpor 32(%rcx),%ymm6,%ymm6 + + subq $64,%rdx + jnz .Loop_avx2 + +.byte 0x66,0x90 +.Ltail_avx2: + + + + + + + + vpaddq %ymm0,%ymm7,%ymm0 + vmovdqu 4(%rsp),%ymm7 + vpaddq %ymm1,%ymm8,%ymm1 + vmovdqu 36(%rsp),%ymm8 + vpaddq %ymm3,%ymm10,%ymm3 + vmovdqu 100(%rsp),%ymm9 + vpaddq %ymm4,%ymm6,%ymm4 + vmovdqu 52(%rax),%ymm10 + vmovdqu 116(%rax),%ymm5 + + vpmuludq %ymm2,%ymm7,%ymm13 + vpmuludq %ymm2,%ymm8,%ymm14 + vpmuludq %ymm2,%ymm9,%ymm15 + vpmuludq %ymm2,%ymm10,%ymm11 + vpmuludq %ymm2,%ymm5,%ymm12 + + vpmuludq %ymm0,%ymm8,%ymm6 + vpmuludq %ymm1,%ymm8,%ymm2 + vpaddq %ymm6,%ymm12,%ymm12 + vpaddq %ymm2,%ymm13,%ymm13 + vpmuludq %ymm3,%ymm8,%ymm6 + vpmuludq 68(%rsp),%ymm4,%ymm2 + vpaddq %ymm6,%ymm15,%ymm15 + vpaddq %ymm2,%ymm11,%ymm11 + + vpmuludq %ymm0,%ymm7,%ymm6 + vpmuludq %ymm1,%ymm7,%ymm2 + vpaddq %ymm6,%ymm11,%ymm11 + vmovdqu -12(%rax),%ymm8 + vpaddq %ymm2,%ymm12,%ymm12 + vpmuludq %ymm3,%ymm7,%ymm6 + vpmuludq %ymm4,%ymm7,%ymm2 + vpaddq %ymm6,%ymm14,%ymm14 + vpaddq %ymm2,%ymm15,%ymm15 + + vpmuludq %ymm3,%ymm8,%ymm6 + vpmuludq %ymm4,%ymm8,%ymm2 + vpaddq %ymm6,%ymm11,%ymm11 + vpaddq %ymm2,%ymm12,%ymm12 + vmovdqu 20(%rax),%ymm2 + vpmuludq %ymm1,%ymm9,%ymm6 + vpmuludq %ymm0,%ymm9,%ymm9 + vpaddq %ymm6,%ymm14,%ymm14 + vpaddq %ymm9,%ymm13,%ymm13 + + vpmuludq %ymm1,%ymm2,%ymm6 + vpmuludq %ymm0,%ymm2,%ymm2 + vpaddq %ymm6,%ymm15,%ymm15 + vpaddq %ymm2,%ymm14,%ymm14 + vpmuludq %ymm3,%ymm10,%ymm6 + vpmuludq %ymm4,%ymm10,%ymm2 + vpaddq %ymm6,%ymm12,%ymm12 + vpaddq %ymm2,%ymm13,%ymm13 + + vpmuludq %ymm3,%ymm5,%ymm3 + vpmuludq %ymm4,%ymm5,%ymm4 + vpaddq %ymm3,%ymm13,%ymm2 + vpaddq %ymm4,%ymm14,%ymm3 + vpmuludq 84(%rax),%ymm0,%ymm4 + vpmuludq %ymm1,%ymm5,%ymm0 + vmovdqa 64(%rcx),%ymm5 + vpaddq %ymm4,%ymm15,%ymm4 + vpaddq %ymm0,%ymm11,%ymm0 + + + + + vpsrldq $8,%ymm12,%ymm8 + vpsrldq $8,%ymm2,%ymm9 + vpsrldq $8,%ymm3,%ymm10 + vpsrldq $8,%ymm4,%ymm6 + vpsrldq $8,%ymm0,%ymm7 + vpaddq %ymm8,%ymm12,%ymm12 + vpaddq %ymm9,%ymm2,%ymm2 + vpaddq %ymm10,%ymm3,%ymm3 + vpaddq %ymm6,%ymm4,%ymm4 + vpaddq %ymm7,%ymm0,%ymm0 + + vpermq $0x2,%ymm3,%ymm10 + vpermq $0x2,%ymm4,%ymm6 + vpermq $0x2,%ymm0,%ymm7 + vpermq $0x2,%ymm12,%ymm8 + vpermq $0x2,%ymm2,%ymm9 + vpaddq %ymm10,%ymm3,%ymm3 + vpaddq %ymm6,%ymm4,%ymm4 + vpaddq %ymm7,%ymm0,%ymm0 + vpaddq %ymm8,%ymm12,%ymm12 + vpaddq %ymm9,%ymm2,%ymm2 + + + + + vpsrlq $26,%ymm3,%ymm14 + vpand %ymm5,%ymm3,%ymm3 + vpaddq %ymm14,%ymm4,%ymm4 + + vpsrlq $26,%ymm0,%ymm11 + vpand %ymm5,%ymm0,%ymm0 + vpaddq %ymm11,%ymm12,%ymm1 + + vpsrlq $26,%ymm4,%ymm15 + vpand %ymm5,%ymm4,%ymm4 + + vpsrlq $26,%ymm1,%ymm12 + vpand %ymm5,%ymm1,%ymm1 + vpaddq %ymm12,%ymm2,%ymm2 + + vpaddq %ymm15,%ymm0,%ymm0 + vpsllq $2,%ymm15,%ymm15 + vpaddq %ymm15,%ymm0,%ymm0 + + vpsrlq $26,%ymm2,%ymm13 + vpand %ymm5,%ymm2,%ymm2 + vpaddq %ymm13,%ymm3,%ymm3 + + vpsrlq $26,%ymm0,%ymm11 + vpand %ymm5,%ymm0,%ymm0 + vpaddq %ymm11,%ymm1,%ymm1 + + vpsrlq $26,%ymm3,%ymm14 + vpand %ymm5,%ymm3,%ymm3 + vpaddq %ymm14,%ymm4,%ymm4 + + vmovd %xmm0,-112(%rdi) + vmovd %xmm1,-108(%rdi) + vmovd %xmm2,-104(%rdi) + vmovd %xmm3,-100(%rdi) + vmovd %xmm4,-96(%rdi) + leaq 8(%r11),%rsp +.cfi_def_cfa %rsp,8 + vzeroupper + ret +.cfi_endproc +.size poly1305_blocks_avx2,.-poly1305_blocks_avx2 +.type poly1305_blocks_avx512,@function +.align 32 +poly1305_blocks_avx512: +.cfi_startproc + movl 20(%rdi),%r8d + cmpq $128,%rdx + jae .Lblocks_avx2_512 + testl %r8d,%r8d + jz .Lblocks + +.Lblocks_avx2_512: + andq $-16,%rdx + jz .Lno_data_avx2_512 + + vzeroupper + + testl %r8d,%r8d + jz .Lbase2_64_avx2_512 + + testq $63,%rdx + jz .Leven_avx2_512 + + pushq %rbx +.cfi_adjust_cfa_offset 8 +.cfi_offset %rbx,-16 + pushq %rbp +.cfi_adjust_cfa_offset 8 +.cfi_offset %rbp,-24 + pushq %r12 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r12,-32 + pushq %r13 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r13,-40 + pushq %r14 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r14,-48 + pushq %r15 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r15,-56 +.Lblocks_avx2_body_512: + + movq %rdx,%r15 + + movq 0(%rdi),%r8 + movq 8(%rdi),%r9 + movl 16(%rdi),%ebp + + movq 24(%rdi),%r11 + movq 32(%rdi),%r13 + + + movl %r8d,%r14d + andq $-2147483648,%r8 + movq %r9,%r12 + movl %r9d,%ebx + andq $-2147483648,%r9 + + shrq $6,%r8 + shlq $52,%r12 + addq %r8,%r14 + shrq $12,%rbx + shrq $18,%r9 + addq %r12,%r14 + adcq %r9,%rbx + + movq %rbp,%r8 + shlq $40,%r8 + shrq $24,%rbp + addq %r8,%rbx + adcq $0,%rbp + + movq $-4,%r9 + movq %rbp,%r8 + andq %rbp,%r9 + shrq $2,%r8 + andq $3,%rbp + addq %r9,%r8 + addq %r8,%r14 + adcq $0,%rbx + adcq $0,%rbp + + movq %r13,%r12 + movq %r13,%rax + shrq $2,%r13 + addq %r12,%r13 + +.Lbase2_26_pre_avx2_512: + addq 0(%rsi),%r14 + adcq 8(%rsi),%rbx + leaq 16(%rsi),%rsi + adcq %rcx,%rbp + subq $16,%r15 + + call __poly1305_block + movq %r12,%rax + + testq $63,%r15 + jnz .Lbase2_26_pre_avx2_512 + + testq %rcx,%rcx + jz .Lstore_base2_64_avx2_512 + + + movq %r14,%rax + movq %r14,%rdx + shrq $52,%r14 + movq %rbx,%r11 + movq %rbx,%r12 + shrq $26,%rdx + andq $0x3ffffff,%rax + shlq $12,%r11 + andq $0x3ffffff,%rdx + shrq $14,%rbx + orq %r11,%r14 + shlq $24,%rbp + andq $0x3ffffff,%r14 + shrq $40,%r12 + andq $0x3ffffff,%rbx + orq %r12,%rbp + + testq %r15,%r15 + jz .Lstore_base2_26_avx2_512 + + vmovd %eax,%xmm0 + vmovd %edx,%xmm1 + vmovd %r14d,%xmm2 + vmovd %ebx,%xmm3 + vmovd %ebp,%xmm4 + jmp .Lproceed_avx2_512 + +.align 32 +.Lstore_base2_64_avx2_512: + movq %r14,0(%rdi) + movq %rbx,8(%rdi) + movq %rbp,16(%rdi) + jmp .Ldone_avx2_512 + +.align 16 +.Lstore_base2_26_avx2_512: + movl %eax,0(%rdi) + movl %edx,4(%rdi) + movl %r14d,8(%rdi) + movl %ebx,12(%rdi) + movl %ebp,16(%rdi) +.align 16 +.Ldone_avx2_512: + movq 0(%rsp),%r15 +.cfi_restore %r15 + movq 8(%rsp),%r14 +.cfi_restore %r14 + movq 16(%rsp),%r13 +.cfi_restore %r13 + movq 24(%rsp),%r12 +.cfi_restore %r12 + movq 32(%rsp),%rbp +.cfi_restore %rbp + movq 40(%rsp),%rbx +.cfi_restore %rbx + leaq 48(%rsp),%rsp +.cfi_adjust_cfa_offset -48 +.Lno_data_avx2_512: +.Lblocks_avx2_epilogue_512: + ret +.cfi_endproc + +.align 32 +.Lbase2_64_avx2_512: +.cfi_startproc + pushq %rbx +.cfi_adjust_cfa_offset 8 +.cfi_offset %rbx,-16 + pushq %rbp +.cfi_adjust_cfa_offset 8 +.cfi_offset %rbp,-24 + pushq %r12 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r12,-32 + pushq %r13 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r13,-40 + pushq %r14 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r14,-48 + pushq %r15 +.cfi_adjust_cfa_offset 8 +.cfi_offset %r15,-56 +.Lbase2_64_avx2_body_512: + + movq %rdx,%r15 + + movq 24(%rdi),%r11 + movq 32(%rdi),%r13 + + movq 0(%rdi),%r14 + movq 8(%rdi),%rbx + movl 16(%rdi),%ebp + + movq %r13,%r12 + movq %r13,%rax + shrq $2,%r13 + addq %r12,%r13 + + testq $63,%rdx + jz .Linit_avx2_512 + +.Lbase2_64_pre_avx2_512: + addq 0(%rsi),%r14 + adcq 8(%rsi),%rbx + leaq 16(%rsi),%rsi + adcq %rcx,%rbp + subq $16,%r15 + + call __poly1305_block + movq %r12,%rax + + testq $63,%r15 + jnz .Lbase2_64_pre_avx2_512 + +.Linit_avx2_512: + + movq %r14,%rax + movq %r14,%rdx + shrq $52,%r14 + movq %rbx,%r8 + movq %rbx,%r9 + shrq $26,%rdx + andq $0x3ffffff,%rax + shlq $12,%r8 + andq $0x3ffffff,%rdx + shrq $14,%rbx + orq %r8,%r14 + shlq $24,%rbp + andq $0x3ffffff,%r14 + shrq $40,%r9 + andq $0x3ffffff,%rbx + orq %r9,%rbp + + vmovd %eax,%xmm0 + vmovd %edx,%xmm1 + vmovd %r14d,%xmm2 + vmovd %ebx,%xmm3 + vmovd %ebp,%xmm4 + movl $1,20(%rdi) + + call __poly1305_init_avx + +.Lproceed_avx2_512: + movq %r15,%rdx + + + + movq 0(%rsp),%r15 +.cfi_restore %r15 + movq 8(%rsp),%r14 +.cfi_restore %r14 + movq 16(%rsp),%r13 +.cfi_restore %r13 + movq 24(%rsp),%r12 +.cfi_restore %r12 + movq 32(%rsp),%rbp +.cfi_restore %rbp + movq 40(%rsp),%rbx +.cfi_restore %rbx + leaq 48(%rsp),%rax + leaq 48(%rsp),%rsp +.cfi_adjust_cfa_offset -48 +.Lbase2_64_avx2_epilogue_512: + jmp .Ldo_avx2_512 +.cfi_endproc + +.align 32 +.Leven_avx2_512: +.cfi_startproc + + vmovd 0(%rdi),%xmm0 + vmovd 4(%rdi),%xmm1 + vmovd 8(%rdi),%xmm2 + vmovd 12(%rdi),%xmm3 + vmovd 16(%rdi),%xmm4 + +.Ldo_avx2_512: + cmpq $512,%rdx + jae .Lblocks_avx512 +.Lskip_avx512: + leaq -8(%rsp),%r11 +.cfi_def_cfa %r11,16 + subq $0x128,%rsp + leaq .Lconst(%rip),%rcx + leaq 48+64(%rdi),%rdi + vmovdqa 96(%rcx),%ymm7 + + + vmovdqu -64(%rdi),%xmm9 + andq $-512,%rsp + vmovdqu -48(%rdi),%xmm10 + vmovdqu -32(%rdi),%xmm6 + vmovdqu -16(%rdi),%xmm11 + vmovdqu 0(%rdi),%xmm12 + vmovdqu 16(%rdi),%xmm13 + leaq 144(%rsp),%rax + vmovdqu 32(%rdi),%xmm14 + vpermd %ymm9,%ymm7,%ymm9 + vmovdqu 48(%rdi),%xmm15 + vpermd %ymm10,%ymm7,%ymm10 + vmovdqu 64(%rdi),%xmm5 + vpermd %ymm6,%ymm7,%ymm6 + vmovdqa %ymm9,0(%rsp) + vpermd %ymm11,%ymm7,%ymm11 + vmovdqa %ymm10,32-144(%rax) + vpermd %ymm12,%ymm7,%ymm12 + vmovdqa %ymm6,64-144(%rax) + vpermd %ymm13,%ymm7,%ymm13 + vmovdqa %ymm11,96-144(%rax) + vpermd %ymm14,%ymm7,%ymm14 + vmovdqa %ymm12,128-144(%rax) + vpermd %ymm15,%ymm7,%ymm15 + vmovdqa %ymm13,160-144(%rax) + vpermd %ymm5,%ymm7,%ymm5 + vmovdqa %ymm14,192-144(%rax) + vmovdqa %ymm15,224-144(%rax) + vmovdqa %ymm5,256-144(%rax) + vmovdqa 64(%rcx),%ymm5 + + + + vmovdqu 0(%rsi),%xmm7 + vmovdqu 16(%rsi),%xmm8 + vinserti128 $1,32(%rsi),%ymm7,%ymm7 + vinserti128 $1,48(%rsi),%ymm8,%ymm8 + leaq 64(%rsi),%rsi + + vpsrldq $6,%ymm7,%ymm9 + vpsrldq $6,%ymm8,%ymm10 + vpunpckhqdq %ymm8,%ymm7,%ymm6 + vpunpcklqdq %ymm10,%ymm9,%ymm9 + vpunpcklqdq %ymm8,%ymm7,%ymm7 + + vpsrlq $30,%ymm9,%ymm10 + vpsrlq $4,%ymm9,%ymm9 + vpsrlq $26,%ymm7,%ymm8 + vpsrlq $40,%ymm6,%ymm6 + vpand %ymm5,%ymm9,%ymm9 + vpand %ymm5,%ymm7,%ymm7 + vpand %ymm5,%ymm8,%ymm8 + vpand %ymm5,%ymm10,%ymm10 + vpor 32(%rcx),%ymm6,%ymm6 + + vpaddq %ymm2,%ymm9,%ymm2 + subq $64,%rdx + jz .Ltail_avx2_512 + jmp .Loop_avx2_512 + +.align 32 +.Loop_avx2_512: + + + + + + + + + vpaddq %ymm0,%ymm7,%ymm0 + vmovdqa 0(%rsp),%ymm7 + vpaddq %ymm1,%ymm8,%ymm1 + vmovdqa 32(%rsp),%ymm8 + vpaddq %ymm3,%ymm10,%ymm3 + vmovdqa 96(%rsp),%ymm9 + vpaddq %ymm4,%ymm6,%ymm4 + vmovdqa 48(%rax),%ymm10 + vmovdqa 112(%rax),%ymm5 + + + + + + + + + + + + + + + + + vpmuludq %ymm2,%ymm7,%ymm13 + vpmuludq %ymm2,%ymm8,%ymm14 + vpmuludq %ymm2,%ymm9,%ymm15 + vpmuludq %ymm2,%ymm10,%ymm11 + vpmuludq %ymm2,%ymm5,%ymm12 + + vpmuludq %ymm0,%ymm8,%ymm6 + vpmuludq %ymm1,%ymm8,%ymm2 + vpaddq %ymm6,%ymm12,%ymm12 + vpaddq %ymm2,%ymm13,%ymm13 + vpmuludq %ymm3,%ymm8,%ymm6 + vpmuludq 64(%rsp),%ymm4,%ymm2 + vpaddq %ymm6,%ymm15,%ymm15 + vpaddq %ymm2,%ymm11,%ymm11 + vmovdqa -16(%rax),%ymm8 + + vpmuludq %ymm0,%ymm7,%ymm6 + vpmuludq %ymm1,%ymm7,%ymm2 + vpaddq %ymm6,%ymm11,%ymm11 + vpaddq %ymm2,%ymm12,%ymm12 + vpmuludq %ymm3,%ymm7,%ymm6 + vpmuludq %ymm4,%ymm7,%ymm2 + vmovdqu 0(%rsi),%xmm7 + vpaddq %ymm6,%ymm14,%ymm14 + vpaddq %ymm2,%ymm15,%ymm15 + vinserti128 $1,32(%rsi),%ymm7,%ymm7 + + vpmuludq %ymm3,%ymm8,%ymm6 + vpmuludq %ymm4,%ymm8,%ymm2 + vmovdqu 16(%rsi),%xmm8 + vpaddq %ymm6,%ymm11,%ymm11 + vpaddq %ymm2,%ymm12,%ymm12 + vmovdqa 16(%rax),%ymm2 + vpmuludq %ymm1,%ymm9,%ymm6 + vpmuludq %ymm0,%ymm9,%ymm9 + vpaddq %ymm6,%ymm14,%ymm14 + vpaddq %ymm9,%ymm13,%ymm13 + vinserti128 $1,48(%rsi),%ymm8,%ymm8 + leaq 64(%rsi),%rsi + + vpmuludq %ymm1,%ymm2,%ymm6 + vpmuludq %ymm0,%ymm2,%ymm2 + vpsrldq $6,%ymm7,%ymm9 + vpaddq %ymm6,%ymm15,%ymm15 + vpaddq %ymm2,%ymm14,%ymm14 + vpmuludq %ymm3,%ymm10,%ymm6 + vpmuludq %ymm4,%ymm10,%ymm2 + vpsrldq $6,%ymm8,%ymm10 + vpaddq %ymm6,%ymm12,%ymm12 + vpaddq %ymm2,%ymm13,%ymm13 + vpunpckhqdq %ymm8,%ymm7,%ymm6 + + vpmuludq %ymm3,%ymm5,%ymm3 + vpmuludq %ymm4,%ymm5,%ymm4 + vpunpcklqdq %ymm8,%ymm7,%ymm7 + vpaddq %ymm3,%ymm13,%ymm2 + vpaddq %ymm4,%ymm14,%ymm3 + vpunpcklqdq %ymm10,%ymm9,%ymm10 + vpmuludq 80(%rax),%ymm0,%ymm4 + vpmuludq %ymm1,%ymm5,%ymm0 + vmovdqa 64(%rcx),%ymm5 + vpaddq %ymm4,%ymm15,%ymm4 + vpaddq %ymm0,%ymm11,%ymm0 + + + + + vpsrlq $26,%ymm3,%ymm14 + vpand %ymm5,%ymm3,%ymm3 + vpaddq %ymm14,%ymm4,%ymm4 + + vpsrlq $26,%ymm0,%ymm11 + vpand %ymm5,%ymm0,%ymm0 + vpaddq %ymm11,%ymm12,%ymm1 + + vpsrlq $26,%ymm4,%ymm15 + vpand %ymm5,%ymm4,%ymm4 + + vpsrlq $4,%ymm10,%ymm9 + + vpsrlq $26,%ymm1,%ymm12 + vpand %ymm5,%ymm1,%ymm1 + vpaddq %ymm12,%ymm2,%ymm2 + + vpaddq %ymm15,%ymm0,%ymm0 + vpsllq $2,%ymm15,%ymm15 + vpaddq %ymm15,%ymm0,%ymm0 + + vpand %ymm5,%ymm9,%ymm9 + vpsrlq $26,%ymm7,%ymm8 + + vpsrlq $26,%ymm2,%ymm13 + vpand %ymm5,%ymm2,%ymm2 + vpaddq %ymm13,%ymm3,%ymm3 + + vpaddq %ymm9,%ymm2,%ymm2 + vpsrlq $30,%ymm10,%ymm10 + + vpsrlq $26,%ymm0,%ymm11 + vpand %ymm5,%ymm0,%ymm0 + vpaddq %ymm11,%ymm1,%ymm1 + + vpsrlq $40,%ymm6,%ymm6 + + vpsrlq $26,%ymm3,%ymm14 + vpand %ymm5,%ymm3,%ymm3 + vpaddq %ymm14,%ymm4,%ymm4 + + vpand %ymm5,%ymm7,%ymm7 + vpand %ymm5,%ymm8,%ymm8 + vpand %ymm5,%ymm10,%ymm10 + vpor 32(%rcx),%ymm6,%ymm6 + + subq $64,%rdx + jnz .Loop_avx2_512 + +.byte 0x66,0x90 +.Ltail_avx2_512: + + + + + + + + vpaddq %ymm0,%ymm7,%ymm0 + vmovdqu 4(%rsp),%ymm7 + vpaddq %ymm1,%ymm8,%ymm1 + vmovdqu 36(%rsp),%ymm8 + vpaddq %ymm3,%ymm10,%ymm3 + vmovdqu 100(%rsp),%ymm9 + vpaddq %ymm4,%ymm6,%ymm4 + vmovdqu 52(%rax),%ymm10 + vmovdqu 116(%rax),%ymm5 + + vpmuludq %ymm2,%ymm7,%ymm13 + vpmuludq %ymm2,%ymm8,%ymm14 + vpmuludq %ymm2,%ymm9,%ymm15 + vpmuludq %ymm2,%ymm10,%ymm11 + vpmuludq %ymm2,%ymm5,%ymm12 + + vpmuludq %ymm0,%ymm8,%ymm6 + vpmuludq %ymm1,%ymm8,%ymm2 + vpaddq %ymm6,%ymm12,%ymm12 + vpaddq %ymm2,%ymm13,%ymm13 + vpmuludq %ymm3,%ymm8,%ymm6 + vpmuludq 68(%rsp),%ymm4,%ymm2 + vpaddq %ymm6,%ymm15,%ymm15 + vpaddq %ymm2,%ymm11,%ymm11 + + vpmuludq %ymm0,%ymm7,%ymm6 + vpmuludq %ymm1,%ymm7,%ymm2 + vpaddq %ymm6,%ymm11,%ymm11 + vmovdqu -12(%rax),%ymm8 + vpaddq %ymm2,%ymm12,%ymm12 + vpmuludq %ymm3,%ymm7,%ymm6 + vpmuludq %ymm4,%ymm7,%ymm2 + vpaddq %ymm6,%ymm14,%ymm14 + vpaddq %ymm2,%ymm15,%ymm15 + + vpmuludq %ymm3,%ymm8,%ymm6 + vpmuludq %ymm4,%ymm8,%ymm2 + vpaddq %ymm6,%ymm11,%ymm11 + vpaddq %ymm2,%ymm12,%ymm12 + vmovdqu 20(%rax),%ymm2 + vpmuludq %ymm1,%ymm9,%ymm6 + vpmuludq %ymm0,%ymm9,%ymm9 + vpaddq %ymm6,%ymm14,%ymm14 + vpaddq %ymm9,%ymm13,%ymm13 + + vpmuludq %ymm1,%ymm2,%ymm6 + vpmuludq %ymm0,%ymm2,%ymm2 + vpaddq %ymm6,%ymm15,%ymm15 + vpaddq %ymm2,%ymm14,%ymm14 + vpmuludq %ymm3,%ymm10,%ymm6 + vpmuludq %ymm4,%ymm10,%ymm2 + vpaddq %ymm6,%ymm12,%ymm12 + vpaddq %ymm2,%ymm13,%ymm13 + + vpmuludq %ymm3,%ymm5,%ymm3 + vpmuludq %ymm4,%ymm5,%ymm4 + vpaddq %ymm3,%ymm13,%ymm2 + vpaddq %ymm4,%ymm14,%ymm3 + vpmuludq 84(%rax),%ymm0,%ymm4 + vpmuludq %ymm1,%ymm5,%ymm0 + vmovdqa 64(%rcx),%ymm5 + vpaddq %ymm4,%ymm15,%ymm4 + vpaddq %ymm0,%ymm11,%ymm0 + + + + + vpsrldq $8,%ymm12,%ymm8 + vpsrldq $8,%ymm2,%ymm9 + vpsrldq $8,%ymm3,%ymm10 + vpsrldq $8,%ymm4,%ymm6 + vpsrldq $8,%ymm0,%ymm7 + vpaddq %ymm8,%ymm12,%ymm12 + vpaddq %ymm9,%ymm2,%ymm2 + vpaddq %ymm10,%ymm3,%ymm3 + vpaddq %ymm6,%ymm4,%ymm4 + vpaddq %ymm7,%ymm0,%ymm0 + + vpermq $0x2,%ymm3,%ymm10 + vpermq $0x2,%ymm4,%ymm6 + vpermq $0x2,%ymm0,%ymm7 + vpermq $0x2,%ymm12,%ymm8 + vpermq $0x2,%ymm2,%ymm9 + vpaddq %ymm10,%ymm3,%ymm3 + vpaddq %ymm6,%ymm4,%ymm4 + vpaddq %ymm7,%ymm0,%ymm0 + vpaddq %ymm8,%ymm12,%ymm12 + vpaddq %ymm9,%ymm2,%ymm2 + + + + + vpsrlq $26,%ymm3,%ymm14 + vpand %ymm5,%ymm3,%ymm3 + vpaddq %ymm14,%ymm4,%ymm4 + + vpsrlq $26,%ymm0,%ymm11 + vpand %ymm5,%ymm0,%ymm0 + vpaddq %ymm11,%ymm12,%ymm1 + + vpsrlq $26,%ymm4,%ymm15 + vpand %ymm5,%ymm4,%ymm4 + + vpsrlq $26,%ymm1,%ymm12 + vpand %ymm5,%ymm1,%ymm1 + vpaddq %ymm12,%ymm2,%ymm2 + + vpaddq %ymm15,%ymm0,%ymm0 + vpsllq $2,%ymm15,%ymm15 + vpaddq %ymm15,%ymm0,%ymm0 + + vpsrlq $26,%ymm2,%ymm13 + vpand %ymm5,%ymm2,%ymm2 + vpaddq %ymm13,%ymm3,%ymm3 + + vpsrlq $26,%ymm0,%ymm11 + vpand %ymm5,%ymm0,%ymm0 + vpaddq %ymm11,%ymm1,%ymm1 + + vpsrlq $26,%ymm3,%ymm14 + vpand %ymm5,%ymm3,%ymm3 + vpaddq %ymm14,%ymm4,%ymm4 + + vmovd %xmm0,-112(%rdi) + vmovd %xmm1,-108(%rdi) + vmovd %xmm2,-104(%rdi) + vmovd %xmm3,-100(%rdi) + vmovd %xmm4,-96(%rdi) + leaq 8(%r11),%rsp +.cfi_def_cfa %rsp,8 + vzeroupper + ret +.cfi_endproc +.size poly1305_blocks_avx2,.-poly1305_blocks_avx2 +.cfi_startproc +.Lblocks_avx512: + movl $15,%eax + kmovw %eax,%k2 + leaq -8(%rsp),%r11 +.cfi_def_cfa %r11,16 + subq $0x128,%rsp + leaq .Lconst(%rip),%rcx + leaq 48+64(%rdi),%rdi + vmovdqa 96(%rcx),%ymm9 + + + vmovdqu -64(%rdi),%xmm11 + andq $-512,%rsp + vmovdqu -48(%rdi),%xmm12 + movq $0x20,%rax + vmovdqu -32(%rdi),%xmm7 + vmovdqu -16(%rdi),%xmm13 + vmovdqu 0(%rdi),%xmm8 + vmovdqu 16(%rdi),%xmm14 + vmovdqu 32(%rdi),%xmm10 + vmovdqu 48(%rdi),%xmm15 + vmovdqu 64(%rdi),%xmm6 + vpermd %zmm11,%zmm9,%zmm16 + vpbroadcastq 64(%rcx),%zmm5 + vpermd %zmm12,%zmm9,%zmm17 + vpermd %zmm7,%zmm9,%zmm21 + vpermd %zmm13,%zmm9,%zmm18 + vmovdqa64 %zmm16,0(%rsp){%k2} + vpsrlq $32,%zmm16,%zmm7 + vpermd %zmm8,%zmm9,%zmm22 + vmovdqu64 %zmm17,0(%rsp,%rax,1){%k2} + vpsrlq $32,%zmm17,%zmm8 + vpermd %zmm14,%zmm9,%zmm19 + vmovdqa64 %zmm21,64(%rsp){%k2} + vpermd %zmm10,%zmm9,%zmm23 + vpermd %zmm15,%zmm9,%zmm20 + vmovdqu64 %zmm18,64(%rsp,%rax,1){%k2} + vpermd %zmm6,%zmm9,%zmm24 + vmovdqa64 %zmm22,128(%rsp){%k2} + vmovdqu64 %zmm19,128(%rsp,%rax,1){%k2} + vmovdqa64 %zmm23,192(%rsp){%k2} + vmovdqu64 %zmm20,192(%rsp,%rax,1){%k2} + vmovdqa64 %zmm24,256(%rsp){%k2} + + + + + + + + + + + vpmuludq %zmm7,%zmm16,%zmm11 + vpmuludq %zmm7,%zmm17,%zmm12 + vpmuludq %zmm7,%zmm18,%zmm13 + vpmuludq %zmm7,%zmm19,%zmm14 + vpmuludq %zmm7,%zmm20,%zmm15 + vpsrlq $32,%zmm18,%zmm9 + + vpmuludq %zmm8,%zmm24,%zmm25 + vpmuludq %zmm8,%zmm16,%zmm26 + vpmuludq %zmm8,%zmm17,%zmm27 + vpmuludq %zmm8,%zmm18,%zmm28 + vpmuludq %zmm8,%zmm19,%zmm29 + vpsrlq $32,%zmm19,%zmm10 + vpaddq %zmm25,%zmm11,%zmm11 + vpaddq %zmm26,%zmm12,%zmm12 + vpaddq %zmm27,%zmm13,%zmm13 + vpaddq %zmm28,%zmm14,%zmm14 + vpaddq %zmm29,%zmm15,%zmm15 + + vpmuludq %zmm9,%zmm23,%zmm25 + vpmuludq %zmm9,%zmm24,%zmm26 + vpmuludq %zmm9,%zmm17,%zmm28 + vpmuludq %zmm9,%zmm18,%zmm29 + vpmuludq %zmm9,%zmm16,%zmm27 + vpsrlq $32,%zmm20,%zmm6 + vpaddq %zmm25,%zmm11,%zmm11 + vpaddq %zmm26,%zmm12,%zmm12 + vpaddq %zmm28,%zmm14,%zmm14 + vpaddq %zmm29,%zmm15,%zmm15 + vpaddq %zmm27,%zmm13,%zmm13 + + vpmuludq %zmm10,%zmm22,%zmm25 + vpmuludq %zmm10,%zmm16,%zmm28 + vpmuludq %zmm10,%zmm17,%zmm29 + vpmuludq %zmm10,%zmm23,%zmm26 + vpmuludq %zmm10,%zmm24,%zmm27 + vpaddq %zmm25,%zmm11,%zmm11 + vpaddq %zmm28,%zmm14,%zmm14 + vpaddq %zmm29,%zmm15,%zmm15 + vpaddq %zmm26,%zmm12,%zmm12 + vpaddq %zmm27,%zmm13,%zmm13 + + vpmuludq %zmm6,%zmm24,%zmm28 + vpmuludq %zmm6,%zmm16,%zmm29 + vpmuludq %zmm6,%zmm21,%zmm25 + vpmuludq %zmm6,%zmm22,%zmm26 + vpmuludq %zmm6,%zmm23,%zmm27 + vpaddq %zmm28,%zmm14,%zmm14 + vpaddq %zmm29,%zmm15,%zmm15 + vpaddq %zmm25,%zmm11,%zmm11 + vpaddq %zmm26,%zmm12,%zmm12 + vpaddq %zmm27,%zmm13,%zmm13 + + + + vmovdqu64 0(%rsi),%zmm10 + vmovdqu64 64(%rsi),%zmm6 + leaq 128(%rsi),%rsi + + + + + vpsrlq $26,%zmm14,%zmm28 + vpandq %zmm5,%zmm14,%zmm14 + vpaddq %zmm28,%zmm15,%zmm15 + + vpsrlq $26,%zmm11,%zmm25 + vpandq %zmm5,%zmm11,%zmm11 + vpaddq %zmm25,%zmm12,%zmm12 + + vpsrlq $26,%zmm15,%zmm29 + vpandq %zmm5,%zmm15,%zmm15 + + vpsrlq $26,%zmm12,%zmm26 + vpandq %zmm5,%zmm12,%zmm12 + vpaddq %zmm26,%zmm13,%zmm13 + + vpaddq %zmm29,%zmm11,%zmm11 + vpsllq $2,%zmm29,%zmm29 + vpaddq %zmm29,%zmm11,%zmm11 + + vpsrlq $26,%zmm13,%zmm27 + vpandq %zmm5,%zmm13,%zmm13 + vpaddq %zmm27,%zmm14,%zmm14 + + vpsrlq $26,%zmm11,%zmm25 + vpandq %zmm5,%zmm11,%zmm11 + vpaddq %zmm25,%zmm12,%zmm12 + + vpsrlq $26,%zmm14,%zmm28 + vpandq %zmm5,%zmm14,%zmm14 + vpaddq %zmm28,%zmm15,%zmm15 + + + + + + vpunpcklqdq %zmm6,%zmm10,%zmm7 + vpunpckhqdq %zmm6,%zmm10,%zmm6 + + + + + + + vmovdqa32 128(%rcx),%zmm25 + movl $0x7777,%eax + kmovw %eax,%k1 + + vpermd %zmm16,%zmm25,%zmm16 + vpermd %zmm17,%zmm25,%zmm17 + vpermd %zmm18,%zmm25,%zmm18 + vpermd %zmm19,%zmm25,%zmm19 + vpermd %zmm20,%zmm25,%zmm20 + + vpermd %zmm11,%zmm25,%zmm16{%k1} + vpermd %zmm12,%zmm25,%zmm17{%k1} + vpermd %zmm13,%zmm25,%zmm18{%k1} + vpermd %zmm14,%zmm25,%zmm19{%k1} + vpermd %zmm15,%zmm25,%zmm20{%k1} + + vpslld $2,%zmm17,%zmm21 + vpslld $2,%zmm18,%zmm22 + vpslld $2,%zmm19,%zmm23 + vpslld $2,%zmm20,%zmm24 + vpaddd %zmm17,%zmm21,%zmm21 + vpaddd %zmm18,%zmm22,%zmm22 + vpaddd %zmm19,%zmm23,%zmm23 + vpaddd %zmm20,%zmm24,%zmm24 + + vpbroadcastq 32(%rcx),%zmm30 + + vpsrlq $52,%zmm7,%zmm9 + vpsllq $12,%zmm6,%zmm10 + vporq %zmm10,%zmm9,%zmm9 + vpsrlq $26,%zmm7,%zmm8 + vpsrlq $14,%zmm6,%zmm10 + vpsrlq $40,%zmm6,%zmm6 + vpandq %zmm5,%zmm9,%zmm9 + vpandq %zmm5,%zmm7,%zmm7 + + + + + vpaddq %zmm2,%zmm9,%zmm2 + subq $192,%rdx + jbe .Ltail_avx512 + jmp .Loop_avx512 + +.align 32 +.Loop_avx512: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + vpmuludq %zmm2,%zmm17,%zmm14 + vpaddq %zmm0,%zmm7,%zmm0 + vpmuludq %zmm2,%zmm18,%zmm15 + vpandq %zmm5,%zmm8,%zmm8 + vpmuludq %zmm2,%zmm23,%zmm11 + vpandq %zmm5,%zmm10,%zmm10 + vpmuludq %zmm2,%zmm24,%zmm12 + vporq %zmm30,%zmm6,%zmm6 + vpmuludq %zmm2,%zmm16,%zmm13 + vpaddq %zmm1,%zmm8,%zmm1 + vpaddq %zmm3,%zmm10,%zmm3 + vpaddq %zmm4,%zmm6,%zmm4 + + vmovdqu64 0(%rsi),%zmm10 + vmovdqu64 64(%rsi),%zmm6 + leaq 128(%rsi),%rsi + vpmuludq %zmm0,%zmm19,%zmm28 + vpmuludq %zmm0,%zmm20,%zmm29 + vpmuludq %zmm0,%zmm16,%zmm25 + vpmuludq %zmm0,%zmm17,%zmm26 + vpaddq %zmm28,%zmm14,%zmm14 + vpaddq %zmm29,%zmm15,%zmm15 + vpaddq %zmm25,%zmm11,%zmm11 + vpaddq %zmm26,%zmm12,%zmm12 + + vpmuludq %zmm1,%zmm18,%zmm28 + vpmuludq %zmm1,%zmm19,%zmm29 + vpmuludq %zmm1,%zmm24,%zmm25 + vpmuludq %zmm0,%zmm18,%zmm27 + vpaddq %zmm28,%zmm14,%zmm14 + vpaddq %zmm29,%zmm15,%zmm15 + vpaddq %zmm25,%zmm11,%zmm11 + vpaddq %zmm27,%zmm13,%zmm13 + + vpunpcklqdq %zmm6,%zmm10,%zmm7 + vpunpckhqdq %zmm6,%zmm10,%zmm6 + + vpmuludq %zmm3,%zmm16,%zmm28 + vpmuludq %zmm3,%zmm17,%zmm29 + vpmuludq %zmm1,%zmm16,%zmm26 + vpmuludq %zmm1,%zmm17,%zmm27 + vpaddq %zmm28,%zmm14,%zmm14 + vpaddq %zmm29,%zmm15,%zmm15 + vpaddq %zmm26,%zmm12,%zmm12 + vpaddq %zmm27,%zmm13,%zmm13 + + vpmuludq %zmm4,%zmm24,%zmm28 + vpmuludq %zmm4,%zmm16,%zmm29 + vpmuludq %zmm3,%zmm22,%zmm25 + vpmuludq %zmm3,%zmm23,%zmm26 + vpaddq %zmm28,%zmm14,%zmm14 + vpmuludq %zmm3,%zmm24,%zmm27 + vpaddq %zmm29,%zmm15,%zmm15 + vpaddq %zmm25,%zmm11,%zmm11 + vpaddq %zmm26,%zmm12,%zmm12 + vpaddq %zmm27,%zmm13,%zmm13 + + vpmuludq %zmm4,%zmm21,%zmm25 + vpmuludq %zmm4,%zmm22,%zmm26 + vpmuludq %zmm4,%zmm23,%zmm27 + vpaddq %zmm25,%zmm11,%zmm0 + vpaddq %zmm26,%zmm12,%zmm1 + vpaddq %zmm27,%zmm13,%zmm2 + + + + + vpsrlq $52,%zmm7,%zmm9 + vpsllq $12,%zmm6,%zmm10 + + vpsrlq $26,%zmm14,%zmm3 + vpandq %zmm5,%zmm14,%zmm14 + vpaddq %zmm3,%zmm15,%zmm4 + + vporq %zmm10,%zmm9,%zmm9 + + vpsrlq $26,%zmm0,%zmm11 + vpandq %zmm5,%zmm0,%zmm0 + vpaddq %zmm11,%zmm1,%zmm1 + + vpandq %zmm5,%zmm9,%zmm9 + + vpsrlq $26,%zmm4,%zmm15 + vpandq %zmm5,%zmm4,%zmm4 + + vpsrlq $26,%zmm1,%zmm12 + vpandq %zmm5,%zmm1,%zmm1 + vpaddq %zmm12,%zmm2,%zmm2 + + vpaddq %zmm15,%zmm0,%zmm0 + vpsllq $2,%zmm15,%zmm15 + vpaddq %zmm15,%zmm0,%zmm0 + + vpaddq %zmm9,%zmm2,%zmm2 + vpsrlq $26,%zmm7,%zmm8 + + vpsrlq $26,%zmm2,%zmm13 + vpandq %zmm5,%zmm2,%zmm2 + vpaddq %zmm13,%zmm14,%zmm3 + + vpsrlq $14,%zmm6,%zmm10 + + vpsrlq $26,%zmm0,%zmm11 + vpandq %zmm5,%zmm0,%zmm0 + vpaddq %zmm11,%zmm1,%zmm1 + + vpsrlq $40,%zmm6,%zmm6 + + vpsrlq $26,%zmm3,%zmm14 + vpandq %zmm5,%zmm3,%zmm3 + vpaddq %zmm14,%zmm4,%zmm4 + + vpandq %zmm5,%zmm7,%zmm7 + + + + + subq $128,%rdx + ja .Loop_avx512 + +.Ltail_avx512: + + + + + + vpsrlq $32,%zmm16,%zmm16 + vpsrlq $32,%zmm17,%zmm17 + vpsrlq $32,%zmm18,%zmm18 + vpsrlq $32,%zmm23,%zmm23 + vpsrlq $32,%zmm24,%zmm24 + vpsrlq $32,%zmm19,%zmm19 + vpsrlq $32,%zmm20,%zmm20 + vpsrlq $32,%zmm21,%zmm21 + vpsrlq $32,%zmm22,%zmm22 + + + + leaq (%rsi,%rdx,1),%rsi + + + vpaddq %zmm0,%zmm7,%zmm0 + + vpmuludq %zmm2,%zmm17,%zmm14 + vpmuludq %zmm2,%zmm18,%zmm15 + vpmuludq %zmm2,%zmm23,%zmm11 + vpandq %zmm5,%zmm8,%zmm8 + vpmuludq %zmm2,%zmm24,%zmm12 + vpandq %zmm5,%zmm10,%zmm10 + vpmuludq %zmm2,%zmm16,%zmm13 + vporq %zmm30,%zmm6,%zmm6 + vpaddq %zmm1,%zmm8,%zmm1 + vpaddq %zmm3,%zmm10,%zmm3 + vpaddq %zmm4,%zmm6,%zmm4 + + vmovdqu 0(%rsi),%xmm7 + vpmuludq %zmm0,%zmm19,%zmm28 + vpmuludq %zmm0,%zmm20,%zmm29 + vpmuludq %zmm0,%zmm16,%zmm25 + vpmuludq %zmm0,%zmm17,%zmm26 + vpaddq %zmm28,%zmm14,%zmm14 + vpaddq %zmm29,%zmm15,%zmm15 + vpaddq %zmm25,%zmm11,%zmm11 + vpaddq %zmm26,%zmm12,%zmm12 + + vmovdqu 16(%rsi),%xmm8 + vpmuludq %zmm1,%zmm18,%zmm28 + vpmuludq %zmm1,%zmm19,%zmm29 + vpmuludq %zmm1,%zmm24,%zmm25 + vpmuludq %zmm0,%zmm18,%zmm27 + vpaddq %zmm28,%zmm14,%zmm14 + vpaddq %zmm29,%zmm15,%zmm15 + vpaddq %zmm25,%zmm11,%zmm11 + vpaddq %zmm27,%zmm13,%zmm13 + + vinserti128 $1,32(%rsi),%ymm7,%ymm7 + vpmuludq %zmm3,%zmm16,%zmm28 + vpmuludq %zmm3,%zmm17,%zmm29 + vpmuludq %zmm1,%zmm16,%zmm26 + vpmuludq %zmm1,%zmm17,%zmm27 + vpaddq %zmm28,%zmm14,%zmm14 + vpaddq %zmm29,%zmm15,%zmm15 + vpaddq %zmm26,%zmm12,%zmm12 + vpaddq %zmm27,%zmm13,%zmm13 + + vinserti128 $1,48(%rsi),%ymm8,%ymm8 + vpmuludq %zmm4,%zmm24,%zmm28 + vpmuludq %zmm4,%zmm16,%zmm29 + vpmuludq %zmm3,%zmm22,%zmm25 + vpmuludq %zmm3,%zmm23,%zmm26 + vpmuludq %zmm3,%zmm24,%zmm27 + vpaddq %zmm28,%zmm14,%zmm3 + vpaddq %zmm29,%zmm15,%zmm15 + vpaddq %zmm25,%zmm11,%zmm11 + vpaddq %zmm26,%zmm12,%zmm12 + vpaddq %zmm27,%zmm13,%zmm13 + + vpmuludq %zmm4,%zmm21,%zmm25 + vpmuludq %zmm4,%zmm22,%zmm26 + vpmuludq %zmm4,%zmm23,%zmm27 + vpaddq %zmm25,%zmm11,%zmm0 + vpaddq %zmm26,%zmm12,%zmm1 + vpaddq %zmm27,%zmm13,%zmm2 + + + + + movl $1,%eax + vpermq $0xb1,%zmm3,%zmm14 + vpermq $0xb1,%zmm15,%zmm4 + vpermq $0xb1,%zmm0,%zmm11 + vpermq $0xb1,%zmm1,%zmm12 + vpermq $0xb1,%zmm2,%zmm13 + vpaddq %zmm14,%zmm3,%zmm3 + vpaddq %zmm15,%zmm4,%zmm4 + vpaddq %zmm11,%zmm0,%zmm0 + vpaddq %zmm12,%zmm1,%zmm1 + vpaddq %zmm13,%zmm2,%zmm2 + + kmovw %eax,%k3 + vpermq $0x2,%zmm3,%zmm14 + vpermq $0x2,%zmm4,%zmm15 + vpermq $0x2,%zmm0,%zmm11 + vpermq $0x2,%zmm1,%zmm12 + vpermq $0x2,%zmm2,%zmm13 + vpaddq %zmm14,%zmm3,%zmm3 + vpaddq %zmm15,%zmm4,%zmm4 + vpaddq %zmm11,%zmm0,%zmm0 + vpaddq %zmm12,%zmm1,%zmm1 + vpaddq %zmm13,%zmm2,%zmm2 + + vextracti64x4 $0x1,%zmm3,%ymm14 + vextracti64x4 $0x1,%zmm4,%ymm15 + vextracti64x4 $0x1,%zmm0,%ymm11 + vextracti64x4 $0x1,%zmm1,%ymm12 + vextracti64x4 $0x1,%zmm2,%ymm13 + vpaddq %zmm14,%zmm3,%zmm3{%k3}{z} + vpaddq %zmm15,%zmm4,%zmm4{%k3}{z} + vpaddq %zmm11,%zmm0,%zmm0{%k3}{z} + vpaddq %zmm12,%zmm1,%zmm1{%k3}{z} + vpaddq %zmm13,%zmm2,%zmm2{%k3}{z} + + + + vpsrlq $26,%ymm3,%ymm14 + vpand %ymm5,%ymm3,%ymm3 + vpsrldq $6,%ymm7,%ymm9 + vpsrldq $6,%ymm8,%ymm10 + vpunpckhqdq %ymm8,%ymm7,%ymm6 + vpaddq %ymm14,%ymm4,%ymm4 + + vpsrlq $26,%ymm0,%ymm11 + vpand %ymm5,%ymm0,%ymm0 + vpunpcklqdq %ymm10,%ymm9,%ymm9 + vpunpcklqdq %ymm8,%ymm7,%ymm7 + vpaddq %ymm11,%ymm1,%ymm1 + + vpsrlq $26,%ymm4,%ymm15 + vpand %ymm5,%ymm4,%ymm4 + + vpsrlq $26,%ymm1,%ymm12 + vpand %ymm5,%ymm1,%ymm1 + vpsrlq $30,%ymm9,%ymm10 + vpsrlq $4,%ymm9,%ymm9 + vpaddq %ymm12,%ymm2,%ymm2 + + vpaddq %ymm15,%ymm0,%ymm0 + vpsllq $2,%ymm15,%ymm15 + vpsrlq $26,%ymm7,%ymm8 + vpsrlq $40,%ymm6,%ymm6 + vpaddq %ymm15,%ymm0,%ymm0 + + vpsrlq $26,%ymm2,%ymm13 + vpand %ymm5,%ymm2,%ymm2 + vpand %ymm5,%ymm9,%ymm9 + vpand %ymm5,%ymm7,%ymm7 + vpaddq %ymm13,%ymm3,%ymm3 + + vpsrlq $26,%ymm0,%ymm11 + vpand %ymm5,%ymm0,%ymm0 + vpaddq %ymm2,%ymm9,%ymm2 + vpand %ymm5,%ymm8,%ymm8 + vpaddq %ymm11,%ymm1,%ymm1 + + vpsrlq $26,%ymm3,%ymm14 + vpand %ymm5,%ymm3,%ymm3 + vpand %ymm5,%ymm10,%ymm10 + vpor 32(%rcx),%ymm6,%ymm6 + vpaddq %ymm14,%ymm4,%ymm4 + + leaq 144(%rsp),%rax + addq $64,%rdx + jnz .Ltail_avx2_512 + + vpsubq %ymm9,%ymm2,%ymm2 + vmovd %xmm0,-112(%rdi) + vmovd %xmm1,-108(%rdi) + vmovd %xmm2,-104(%rdi) + vmovd %xmm3,-100(%rdi) + vmovd %xmm4,-96(%rdi) + vzeroall + leaq 8(%r11),%rsp +.cfi_def_cfa %rsp,8 + ret +.cfi_endproc +.size poly1305_blocks_avx512,.-poly1305_blocks_avx512 diff --git a/crypto/poly1305_x64_gas_macosx.s b/crypto/poly1305_x64_gas_macosx.s new file mode 100644 index 0000000..473b9f0 --- /dev/null +++ b/crypto/poly1305_x64_gas_macosx.s @@ -0,0 +1,1916 @@ +.p2align 6 +L$const: +L$mask24: +.long 0x0ffffff,0,0x0ffffff,0,0x0ffffff,0,0x0ffffff,0 +L$129: +.long 16777216,0,16777216,0,16777216,0,16777216,0 +L$mask26: +.long 0x3ffffff,0,0x3ffffff,0,0x3ffffff,0,0x3ffffff,0 +L$permd_avx2: +.long 2,2,2,3,2,0,2,1 +L$permd_avx512: +.long 0,0,0,1, 0,2,0,3, 0,4,0,5, 0,6,0,7 + +L$2_44_inp_permd: +.long 0,1,1,2,2,3,7,7 +L$2_44_inp_shift: +.quad 0,12,24,64 +L$2_44_mask: +.quad 0xfffffffffff,0xfffffffffff,0x3ffffffffff,0xffffffffffffffff +L$2_44_shift_rgt: +.quad 44,44,42,64 +L$2_44_shift_lft: +.quad 8,8,10,64 + +.p2align 6 +L$x_mask44: +.quad 0xfffffffffff,0xfffffffffff,0xfffffffffff,0xfffffffffff +.quad 0xfffffffffff,0xfffffffffff,0xfffffffffff,0xfffffffffff +L$x_mask42: +.quad 0x3ffffffffff,0x3ffffffffff,0x3ffffffffff,0x3ffffffffff +.quad 0x3ffffffffff,0x3ffffffffff,0x3ffffffffff,0x3ffffffffff + +.text + + +.global _poly1305_init_x86_64 +.global _poly1305_blocks_x86_64 +.global _poly1305_emit_x86_64 +.global _poly1305_emit_avx +.global _poly1305_blocks_avx +.global _poly1305_blocks_avx2 +.global _poly1305_blocks_avx512 + + + +.p2align 5 +_poly1305_init_x86_64: + xorq %rax,%rax + movq %rax,0(%rdi) + movq %rax,8(%rdi) + movq %rax,16(%rdi) + + cmpq $0,%rsi + je L$no_key + + + + movq $0x0ffffffc0fffffff,%rax + movq $0x0ffffffc0ffffffc,%rcx + andq 0(%rsi),%rax + andq 8(%rsi),%rcx + movq %rax,24(%rdi) + movq %rcx,32(%rdi) + movl $1,%eax +L$no_key: + ret + + + +.p2align 5 +_poly1305_blocks_x86_64: + +L$blocks: + shrq $4,%rdx + jz L$no_data + + pushq %rbx + + pushq %rbp + + pushq %r12 + + pushq %r13 + + pushq %r14 + + pushq %r15 + +L$blocks_body: + + movq %rdx,%r15 + + movq 24(%rdi),%r11 + movq 32(%rdi),%r13 + + movq 0(%rdi),%r14 + movq 8(%rdi),%rbx + movq 16(%rdi),%rbp + + movq %r13,%r12 + shrq $2,%r13 + movq %r12,%rax + addq %r12,%r13 + jmp L$oop + +.p2align 5 +L$oop: + addq 0(%rsi),%r14 + adcq 8(%rsi),%rbx + leaq 16(%rsi),%rsi + adcq %rcx,%rbp + mulq %r14 + movq %rax,%r9 + movq %r11,%rax + movq %rdx,%r10 + + mulq %r14 + movq %rax,%r14 + movq %r11,%rax + movq %rdx,%r8 + + mulq %rbx + addq %rax,%r9 + movq %r13,%rax + adcq %rdx,%r10 + + mulq %rbx + movq %rbp,%rbx + addq %rax,%r14 + adcq %rdx,%r8 + + imulq %r13,%rbx + addq %rbx,%r9 + movq %r8,%rbx + adcq $0,%r10 + + imulq %r11,%rbp + addq %r9,%rbx + movq $-4,%rax + adcq %rbp,%r10 + + andq %r10,%rax + movq %r10,%rbp + shrq $2,%r10 + andq $3,%rbp + addq %r10,%rax + addq %rax,%r14 + adcq $0,%rbx + adcq $0,%rbp + movq %r12,%rax + decq %r15 + jnz L$oop + + movq %r14,0(%rdi) + movq %rbx,8(%rdi) + movq %rbp,16(%rdi) + + movq 0(%rsp),%r15 + + movq 8(%rsp),%r14 + + movq 16(%rsp),%r13 + + movq 24(%rsp),%r12 + + movq 32(%rsp),%rbp + + movq 40(%rsp),%rbx + + leaq 48(%rsp),%rsp + +L$no_data: +L$blocks_epilogue: + ret + + + + +.p2align 5 +_poly1305_emit_x86_64: +L$emit: + movq 0(%rdi),%r8 + movq 8(%rdi),%r9 + movq 16(%rdi),%r10 + + movq %r8,%rax + addq $5,%r8 + movq %r9,%rcx + adcq $0,%r9 + adcq $0,%r10 + shrq $2,%r10 + cmovnzq %r8,%rax + cmovnzq %r9,%rcx + + addq 0(%rdx),%rax + adcq 8(%rdx),%rcx + movq %rax,0(%rsi) + movq %rcx,8(%rsi) + + ret + + +.p2align 5 +__poly1305_block: + mulq %r14 + movq %rax,%r9 + movq %r11,%rax + movq %rdx,%r10 + + mulq %r14 + movq %rax,%r14 + movq %r11,%rax + movq %rdx,%r8 + + mulq %rbx + addq %rax,%r9 + movq %r13,%rax + adcq %rdx,%r10 + + mulq %rbx + movq %rbp,%rbx + addq %rax,%r14 + adcq %rdx,%r8 + + imulq %r13,%rbx + addq %rbx,%r9 + movq %r8,%rbx + adcq $0,%r10 + + imulq %r11,%rbp + addq %r9,%rbx + movq $-4,%rax + adcq %rbp,%r10 + + andq %r10,%rax + movq %r10,%rbp + shrq $2,%r10 + andq $3,%rbp + addq %r10,%rax + addq %rax,%r14 + adcq $0,%rbx + adcq $0,%rbp + ret + + + +.p2align 5 +__poly1305_init_avx: + movq %r11,%r14 + movq %r12,%rbx + xorq %rbp,%rbp + + leaq 48+64(%rdi),%rdi + + movq %r12,%rax + call __poly1305_block + + movl $0x3ffffff,%eax + movl $0x3ffffff,%edx + movq %r14,%r8 + andl %r14d,%eax + movq %r11,%r9 + andl %r11d,%edx + movl %eax,-64(%rdi) + shrq $26,%r8 + movl %edx,-60(%rdi) + shrq $26,%r9 + + movl $0x3ffffff,%eax + movl $0x3ffffff,%edx + andl %r8d,%eax + andl %r9d,%edx + movl %eax,-48(%rdi) + leal (%rax,%rax,4),%eax + movl %edx,-44(%rdi) + leal (%rdx,%rdx,4),%edx + movl %eax,-32(%rdi) + shrq $26,%r8 + movl %edx,-28(%rdi) + shrq $26,%r9 + + movq %rbx,%rax + movq %r12,%rdx + shlq $12,%rax + shlq $12,%rdx + orq %r8,%rax + orq %r9,%rdx + andl $0x3ffffff,%eax + andl $0x3ffffff,%edx + movl %eax,-16(%rdi) + leal (%rax,%rax,4),%eax + movl %edx,-12(%rdi) + leal (%rdx,%rdx,4),%edx + movl %eax,0(%rdi) + movq %rbx,%r8 + movl %edx,4(%rdi) + movq %r12,%r9 + + movl $0x3ffffff,%eax + movl $0x3ffffff,%edx + shrq $14,%r8 + shrq $14,%r9 + andl %r8d,%eax + andl %r9d,%edx + movl %eax,16(%rdi) + leal (%rax,%rax,4),%eax + movl %edx,20(%rdi) + leal (%rdx,%rdx,4),%edx + movl %eax,32(%rdi) + shrq $26,%r8 + movl %edx,36(%rdi) + shrq $26,%r9 + + movq %rbp,%rax + shlq $24,%rax + orq %rax,%r8 + movl %r8d,48(%rdi) + leaq (%r8,%r8,4),%r8 + movl %r9d,52(%rdi) + leaq (%r9,%r9,4),%r9 + movl %r8d,64(%rdi) + movl %r9d,68(%rdi) + + movq %r12,%rax + call __poly1305_block + + movl $0x3ffffff,%eax + movq %r14,%r8 + andl %r14d,%eax + shrq $26,%r8 + movl %eax,-52(%rdi) + + movl $0x3ffffff,%edx + andl %r8d,%edx + movl %edx,-36(%rdi) + leal (%rdx,%rdx,4),%edx + shrq $26,%r8 + movl %edx,-20(%rdi) + + movq %rbx,%rax + shlq $12,%rax + orq %r8,%rax + andl $0x3ffffff,%eax + movl %eax,-4(%rdi) + leal (%rax,%rax,4),%eax + movq %rbx,%r8 + movl %eax,12(%rdi) + + movl $0x3ffffff,%edx + shrq $14,%r8 + andl %r8d,%edx + movl %edx,28(%rdi) + leal (%rdx,%rdx,4),%edx + shrq $26,%r8 + movl %edx,44(%rdi) + + movq %rbp,%rax + shlq $24,%rax + orq %rax,%r8 + movl %r8d,60(%rdi) + leaq (%r8,%r8,4),%r8 + movl %r8d,76(%rdi) + + movq %r12,%rax + call __poly1305_block + + movl $0x3ffffff,%eax + movq %r14,%r8 + andl %r14d,%eax + shrq $26,%r8 + movl %eax,-56(%rdi) + + movl $0x3ffffff,%edx + andl %r8d,%edx + movl %edx,-40(%rdi) + leal (%rdx,%rdx,4),%edx + shrq $26,%r8 + movl %edx,-24(%rdi) + + movq %rbx,%rax + shlq $12,%rax + orq %r8,%rax + andl $0x3ffffff,%eax + movl %eax,-8(%rdi) + leal (%rax,%rax,4),%eax + movq %rbx,%r8 + movl %eax,8(%rdi) + + movl $0x3ffffff,%edx + shrq $14,%r8 + andl %r8d,%edx + movl %edx,24(%rdi) + leal (%rdx,%rdx,4),%edx + shrq $26,%r8 + movl %edx,40(%rdi) + + movq %rbp,%rax + shlq $24,%rax + orq %rax,%r8 + movl %r8d,56(%rdi) + leaq (%r8,%r8,4),%r8 + movl %r8d,72(%rdi) + + leaq -48-64(%rdi),%rdi + ret + + + +.p2align 5 +_poly1305_blocks_avx: + + movl 20(%rdi),%r8d + cmpq $128,%rdx + jae L$blocks_avx + testl %r8d,%r8d + jz L$blocks + +L$blocks_avx: + andq $-16,%rdx + jz L$no_data_avx + + vzeroupper + + testl %r8d,%r8d + jz L$base2_64_avx + + testq $31,%rdx + jz L$even_avx + + pushq %rbx + + pushq %rbp + + pushq %r12 + + pushq %r13 + + pushq %r14 + + pushq %r15 + +L$blocks_avx_body: + + movq %rdx,%r15 + + movq 0(%rdi),%r8 + movq 8(%rdi),%r9 + movl 16(%rdi),%ebp + + movq 24(%rdi),%r11 + movq 32(%rdi),%r13 + + + movl %r8d,%r14d + andq $-2147483648,%r8 + movq %r9,%r12 + movl %r9d,%ebx + andq $-2147483648,%r9 + + shrq $6,%r8 + shlq $52,%r12 + addq %r8,%r14 + shrq $12,%rbx + shrq $18,%r9 + addq %r12,%r14 + adcq %r9,%rbx + + movq %rbp,%r8 + shlq $40,%r8 + shrq $24,%rbp + addq %r8,%rbx + adcq $0,%rbp + + movq $-4,%r9 + movq %rbp,%r8 + andq %rbp,%r9 + shrq $2,%r8 + andq $3,%rbp + addq %r9,%r8 + addq %r8,%r14 + adcq $0,%rbx + adcq $0,%rbp + + movq %r13,%r12 + movq %r13,%rax + shrq $2,%r13 + addq %r12,%r13 + + addq 0(%rsi),%r14 + adcq 8(%rsi),%rbx + leaq 16(%rsi),%rsi + adcq %rcx,%rbp + + call __poly1305_block + + testq %rcx,%rcx + jz L$store_base2_64_avx + + + movq %r14,%rax + movq %r14,%rdx + shrq $52,%r14 + movq %rbx,%r11 + movq %rbx,%r12 + shrq $26,%rdx + andq $0x3ffffff,%rax + shlq $12,%r11 + andq $0x3ffffff,%rdx + shrq $14,%rbx + orq %r11,%r14 + shlq $24,%rbp + andq $0x3ffffff,%r14 + shrq $40,%r12 + andq $0x3ffffff,%rbx + orq %r12,%rbp + + subq $16,%r15 + jz L$store_base2_26_avx + + vmovd %eax,%xmm0 + vmovd %edx,%xmm1 + vmovd %r14d,%xmm2 + vmovd %ebx,%xmm3 + vmovd %ebp,%xmm4 + jmp L$proceed_avx + +.p2align 5 +L$store_base2_64_avx: + movq %r14,0(%rdi) + movq %rbx,8(%rdi) + movq %rbp,16(%rdi) + jmp L$done_avx + +.p2align 4 +L$store_base2_26_avx: + movl %eax,0(%rdi) + movl %edx,4(%rdi) + movl %r14d,8(%rdi) + movl %ebx,12(%rdi) + movl %ebp,16(%rdi) +.p2align 4 +L$done_avx: + movq 0(%rsp),%r15 + + movq 8(%rsp),%r14 + + movq 16(%rsp),%r13 + + movq 24(%rsp),%r12 + + movq 32(%rsp),%rbp + + movq 40(%rsp),%rbx + + leaq 48(%rsp),%rsp + +L$no_data_avx: +L$blocks_avx_epilogue: + ret + + +.p2align 5 +L$base2_64_avx: + + pushq %rbx + + pushq %rbp + + pushq %r12 + + pushq %r13 + + pushq %r14 + + pushq %r15 + +L$base2_64_avx_body: + + movq %rdx,%r15 + + movq 24(%rdi),%r11 + movq 32(%rdi),%r13 + + movq 0(%rdi),%r14 + movq 8(%rdi),%rbx + movl 16(%rdi),%ebp + + movq %r13,%r12 + movq %r13,%rax + shrq $2,%r13 + addq %r12,%r13 + + testq $31,%rdx + jz L$init_avx + + addq 0(%rsi),%r14 + adcq 8(%rsi),%rbx + leaq 16(%rsi),%rsi + adcq %rcx,%rbp + subq $16,%r15 + + call __poly1305_block + +L$init_avx: + + movq %r14,%rax + movq %r14,%rdx + shrq $52,%r14 + movq %rbx,%r8 + movq %rbx,%r9 + shrq $26,%rdx + andq $0x3ffffff,%rax + shlq $12,%r8 + andq $0x3ffffff,%rdx + shrq $14,%rbx + orq %r8,%r14 + shlq $24,%rbp + andq $0x3ffffff,%r14 + shrq $40,%r9 + andq $0x3ffffff,%rbx + orq %r9,%rbp + + vmovd %eax,%xmm0 + vmovd %edx,%xmm1 + vmovd %r14d,%xmm2 + vmovd %ebx,%xmm3 + vmovd %ebp,%xmm4 + movl $1,20(%rdi) + + call __poly1305_init_avx + +L$proceed_avx: + movq %r15,%rdx + + movq 0(%rsp),%r15 + + movq 8(%rsp),%r14 + + movq 16(%rsp),%r13 + + movq 24(%rsp),%r12 + + movq 32(%rsp),%rbp + + movq 40(%rsp),%rbx + + leaq 48(%rsp),%rax + leaq 48(%rsp),%rsp + +L$base2_64_avx_epilogue: + jmp L$do_avx + + +.p2align 5 +L$even_avx: + + vmovd 0(%rdi),%xmm0 + vmovd 4(%rdi),%xmm1 + vmovd 8(%rdi),%xmm2 + vmovd 12(%rdi),%xmm3 + vmovd 16(%rdi),%xmm4 + +L$do_avx: + leaq -88(%rsp),%r11 + + subq $0x178,%rsp + subq $64,%rdx + leaq -32(%rsi),%rax + cmovcq %rax,%rsi + + vmovdqu 48(%rdi),%xmm14 + leaq 112(%rdi),%rdi + leaq L$const(%rip),%rcx + + + + vmovdqu 32(%rsi),%xmm5 + vmovdqu 48(%rsi),%xmm6 + vmovdqa 64(%rcx),%xmm15 + + vpsrldq $6,%xmm5,%xmm7 + vpsrldq $6,%xmm6,%xmm8 + vpunpckhqdq %xmm6,%xmm5,%xmm9 + vpunpcklqdq %xmm6,%xmm5,%xmm5 + vpunpcklqdq %xmm8,%xmm7,%xmm8 + + vpsrlq $40,%xmm9,%xmm9 + vpsrlq $26,%xmm5,%xmm6 + vpand %xmm15,%xmm5,%xmm5 + vpsrlq $4,%xmm8,%xmm7 + vpand %xmm15,%xmm6,%xmm6 + vpsrlq $30,%xmm8,%xmm8 + vpand %xmm15,%xmm7,%xmm7 + vpand %xmm15,%xmm8,%xmm8 + vpor 32(%rcx),%xmm9,%xmm9 + + jbe L$skip_loop_avx + + + vmovdqu -48(%rdi),%xmm11 + vmovdqu -32(%rdi),%xmm12 + vpshufd $0xEE,%xmm14,%xmm13 + vpshufd $0x44,%xmm14,%xmm10 + vmovdqa %xmm13,-144(%r11) + vmovdqa %xmm10,0(%rsp) + vpshufd $0xEE,%xmm11,%xmm14 + vmovdqu -16(%rdi),%xmm10 + vpshufd $0x44,%xmm11,%xmm11 + vmovdqa %xmm14,-128(%r11) + vmovdqa %xmm11,16(%rsp) + vpshufd $0xEE,%xmm12,%xmm13 + vmovdqu 0(%rdi),%xmm11 + vpshufd $0x44,%xmm12,%xmm12 + vmovdqa %xmm13,-112(%r11) + vmovdqa %xmm12,32(%rsp) + vpshufd $0xEE,%xmm10,%xmm14 + vmovdqu 16(%rdi),%xmm12 + vpshufd $0x44,%xmm10,%xmm10 + vmovdqa %xmm14,-96(%r11) + vmovdqa %xmm10,48(%rsp) + vpshufd $0xEE,%xmm11,%xmm13 + vmovdqu 32(%rdi),%xmm10 + vpshufd $0x44,%xmm11,%xmm11 + vmovdqa %xmm13,-80(%r11) + vmovdqa %xmm11,64(%rsp) + vpshufd $0xEE,%xmm12,%xmm14 + vmovdqu 48(%rdi),%xmm11 + vpshufd $0x44,%xmm12,%xmm12 + vmovdqa %xmm14,-64(%r11) + vmovdqa %xmm12,80(%rsp) + vpshufd $0xEE,%xmm10,%xmm13 + vmovdqu 64(%rdi),%xmm12 + vpshufd $0x44,%xmm10,%xmm10 + vmovdqa %xmm13,-48(%r11) + vmovdqa %xmm10,96(%rsp) + vpshufd $0xEE,%xmm11,%xmm14 + vpshufd $0x44,%xmm11,%xmm11 + vmovdqa %xmm14,-32(%r11) + vmovdqa %xmm11,112(%rsp) + vpshufd $0xEE,%xmm12,%xmm13 + vmovdqa 0(%rsp),%xmm14 + vpshufd $0x44,%xmm12,%xmm12 + vmovdqa %xmm13,-16(%r11) + vmovdqa %xmm12,128(%rsp) + + jmp L$oop_avx + +.p2align 5 +L$oop_avx: + + + + + + + + + + + + + + + + + + + + + vpmuludq %xmm5,%xmm14,%xmm10 + vpmuludq %xmm6,%xmm14,%xmm11 + vmovdqa %xmm2,32(%r11) + vpmuludq %xmm7,%xmm14,%xmm12 + vmovdqa 16(%rsp),%xmm2 + vpmuludq %xmm8,%xmm14,%xmm13 + vpmuludq %xmm9,%xmm14,%xmm14 + + vmovdqa %xmm0,0(%r11) + vpmuludq 32(%rsp),%xmm9,%xmm0 + vmovdqa %xmm1,16(%r11) + vpmuludq %xmm8,%xmm2,%xmm1 + vpaddq %xmm0,%xmm10,%xmm10 + vpaddq %xmm1,%xmm14,%xmm14 + vmovdqa %xmm3,48(%r11) + vpmuludq %xmm7,%xmm2,%xmm0 + vpmuludq %xmm6,%xmm2,%xmm1 + vpaddq %xmm0,%xmm13,%xmm13 + vmovdqa 48(%rsp),%xmm3 + vpaddq %xmm1,%xmm12,%xmm12 + vmovdqa %xmm4,64(%r11) + vpmuludq %xmm5,%xmm2,%xmm2 + vpmuludq %xmm7,%xmm3,%xmm0 + vpaddq %xmm2,%xmm11,%xmm11 + + vmovdqa 64(%rsp),%xmm4 + vpaddq %xmm0,%xmm14,%xmm14 + vpmuludq %xmm6,%xmm3,%xmm1 + vpmuludq %xmm5,%xmm3,%xmm3 + vpaddq %xmm1,%xmm13,%xmm13 + vmovdqa 80(%rsp),%xmm2 + vpaddq %xmm3,%xmm12,%xmm12 + vpmuludq %xmm9,%xmm4,%xmm0 + vpmuludq %xmm8,%xmm4,%xmm4 + vpaddq %xmm0,%xmm11,%xmm11 + vmovdqa 96(%rsp),%xmm3 + vpaddq %xmm4,%xmm10,%xmm10 + + vmovdqa 128(%rsp),%xmm4 + vpmuludq %xmm6,%xmm2,%xmm1 + vpmuludq %xmm5,%xmm2,%xmm2 + vpaddq %xmm1,%xmm14,%xmm14 + vpaddq %xmm2,%xmm13,%xmm13 + vpmuludq %xmm9,%xmm3,%xmm0 + vpmuludq %xmm8,%xmm3,%xmm1 + vpaddq %xmm0,%xmm12,%xmm12 + vmovdqu 0(%rsi),%xmm0 + vpaddq %xmm1,%xmm11,%xmm11 + vpmuludq %xmm7,%xmm3,%xmm3 + vpmuludq %xmm7,%xmm4,%xmm7 + vpaddq %xmm3,%xmm10,%xmm10 + + vmovdqu 16(%rsi),%xmm1 + vpaddq %xmm7,%xmm11,%xmm11 + vpmuludq %xmm8,%xmm4,%xmm8 + vpmuludq %xmm9,%xmm4,%xmm9 + vpsrldq $6,%xmm0,%xmm2 + vpaddq %xmm8,%xmm12,%xmm12 + vpaddq %xmm9,%xmm13,%xmm13 + vpsrldq $6,%xmm1,%xmm3 + vpmuludq 112(%rsp),%xmm5,%xmm9 + vpmuludq %xmm6,%xmm4,%xmm5 + vpunpckhqdq %xmm1,%xmm0,%xmm4 + vpaddq %xmm9,%xmm14,%xmm14 + vmovdqa -144(%r11),%xmm9 + vpaddq %xmm5,%xmm10,%xmm10 + + vpunpcklqdq %xmm1,%xmm0,%xmm0 + vpunpcklqdq %xmm3,%xmm2,%xmm3 + + + vpsrldq $5,%xmm4,%xmm4 + vpsrlq $26,%xmm0,%xmm1 + vpand %xmm15,%xmm0,%xmm0 + vpsrlq $4,%xmm3,%xmm2 + vpand %xmm15,%xmm1,%xmm1 + vpand 0(%rcx),%xmm4,%xmm4 + vpsrlq $30,%xmm3,%xmm3 + vpand %xmm15,%xmm2,%xmm2 + vpand %xmm15,%xmm3,%xmm3 + vpor 32(%rcx),%xmm4,%xmm4 + + vpaddq 0(%r11),%xmm0,%xmm0 + vpaddq 16(%r11),%xmm1,%xmm1 + vpaddq 32(%r11),%xmm2,%xmm2 + vpaddq 48(%r11),%xmm3,%xmm3 + vpaddq 64(%r11),%xmm4,%xmm4 + + leaq 32(%rsi),%rax + leaq 64(%rsi),%rsi + subq $64,%rdx + cmovcq %rax,%rsi + + + + + + + + + + + vpmuludq %xmm0,%xmm9,%xmm5 + vpmuludq %xmm1,%xmm9,%xmm6 + vpaddq %xmm5,%xmm10,%xmm10 + vpaddq %xmm6,%xmm11,%xmm11 + vmovdqa -128(%r11),%xmm7 + vpmuludq %xmm2,%xmm9,%xmm5 + vpmuludq %xmm3,%xmm9,%xmm6 + vpaddq %xmm5,%xmm12,%xmm12 + vpaddq %xmm6,%xmm13,%xmm13 + vpmuludq %xmm4,%xmm9,%xmm9 + vpmuludq -112(%r11),%xmm4,%xmm5 + vpaddq %xmm9,%xmm14,%xmm14 + + vpaddq %xmm5,%xmm10,%xmm10 + vpmuludq %xmm2,%xmm7,%xmm6 + vpmuludq %xmm3,%xmm7,%xmm5 + vpaddq %xmm6,%xmm13,%xmm13 + vmovdqa -96(%r11),%xmm8 + vpaddq %xmm5,%xmm14,%xmm14 + vpmuludq %xmm1,%xmm7,%xmm6 + vpmuludq %xmm0,%xmm7,%xmm7 + vpaddq %xmm6,%xmm12,%xmm12 + vpaddq %xmm7,%xmm11,%xmm11 + + vmovdqa -80(%r11),%xmm9 + vpmuludq %xmm2,%xmm8,%xmm5 + vpmuludq %xmm1,%xmm8,%xmm6 + vpaddq %xmm5,%xmm14,%xmm14 + vpaddq %xmm6,%xmm13,%xmm13 + vmovdqa -64(%r11),%xmm7 + vpmuludq %xmm0,%xmm8,%xmm8 + vpmuludq %xmm4,%xmm9,%xmm5 + vpaddq %xmm8,%xmm12,%xmm12 + vpaddq %xmm5,%xmm11,%xmm11 + vmovdqa -48(%r11),%xmm8 + vpmuludq %xmm3,%xmm9,%xmm9 + vpmuludq %xmm1,%xmm7,%xmm6 + vpaddq %xmm9,%xmm10,%xmm10 + + vmovdqa -16(%r11),%xmm9 + vpaddq %xmm6,%xmm14,%xmm14 + vpmuludq %xmm0,%xmm7,%xmm7 + vpmuludq %xmm4,%xmm8,%xmm5 + vpaddq %xmm7,%xmm13,%xmm13 + vpaddq %xmm5,%xmm12,%xmm12 + vmovdqu 32(%rsi),%xmm5 + vpmuludq %xmm3,%xmm8,%xmm7 + vpmuludq %xmm2,%xmm8,%xmm8 + vpaddq %xmm7,%xmm11,%xmm11 + vmovdqu 48(%rsi),%xmm6 + vpaddq %xmm8,%xmm10,%xmm10 + + vpmuludq %xmm2,%xmm9,%xmm2 + vpmuludq %xmm3,%xmm9,%xmm3 + vpsrldq $6,%xmm5,%xmm7 + vpaddq %xmm2,%xmm11,%xmm11 + vpmuludq %xmm4,%xmm9,%xmm4 + vpsrldq $6,%xmm6,%xmm8 + vpaddq %xmm3,%xmm12,%xmm2 + vpaddq %xmm4,%xmm13,%xmm3 + vpmuludq -32(%r11),%xmm0,%xmm4 + vpmuludq %xmm1,%xmm9,%xmm0 + vpunpckhqdq %xmm6,%xmm5,%xmm9 + vpaddq %xmm4,%xmm14,%xmm4 + vpaddq %xmm0,%xmm10,%xmm0 + + vpunpcklqdq %xmm6,%xmm5,%xmm5 + vpunpcklqdq %xmm8,%xmm7,%xmm8 + + + vpsrldq $5,%xmm9,%xmm9 + vpsrlq $26,%xmm5,%xmm6 + vmovdqa 0(%rsp),%xmm14 + vpand %xmm15,%xmm5,%xmm5 + vpsrlq $4,%xmm8,%xmm7 + vpand %xmm15,%xmm6,%xmm6 + vpand 0(%rcx),%xmm9,%xmm9 + vpsrlq $30,%xmm8,%xmm8 + vpand %xmm15,%xmm7,%xmm7 + vpand %xmm15,%xmm8,%xmm8 + vpor 32(%rcx),%xmm9,%xmm9 + + + + + + vpsrlq $26,%xmm3,%xmm13 + vpand %xmm15,%xmm3,%xmm3 + vpaddq %xmm13,%xmm4,%xmm4 + + vpsrlq $26,%xmm0,%xmm10 + vpand %xmm15,%xmm0,%xmm0 + vpaddq %xmm10,%xmm11,%xmm1 + + vpsrlq $26,%xmm4,%xmm10 + vpand %xmm15,%xmm4,%xmm4 + + vpsrlq $26,%xmm1,%xmm11 + vpand %xmm15,%xmm1,%xmm1 + vpaddq %xmm11,%xmm2,%xmm2 + + vpaddq %xmm10,%xmm0,%xmm0 + vpsllq $2,%xmm10,%xmm10 + vpaddq %xmm10,%xmm0,%xmm0 + + vpsrlq $26,%xmm2,%xmm12 + vpand %xmm15,%xmm2,%xmm2 + vpaddq %xmm12,%xmm3,%xmm3 + + vpsrlq $26,%xmm0,%xmm10 + vpand %xmm15,%xmm0,%xmm0 + vpaddq %xmm10,%xmm1,%xmm1 + + vpsrlq $26,%xmm3,%xmm13 + vpand %xmm15,%xmm3,%xmm3 + vpaddq %xmm13,%xmm4,%xmm4 + + ja L$oop_avx + +L$skip_loop_avx: + + + + vpshufd $0x10,%xmm14,%xmm14 + addq $32,%rdx + jnz L$ong_tail_avx + + vpaddq %xmm2,%xmm7,%xmm7 + vpaddq %xmm0,%xmm5,%xmm5 + vpaddq %xmm1,%xmm6,%xmm6 + vpaddq %xmm3,%xmm8,%xmm8 + vpaddq %xmm4,%xmm9,%xmm9 + +L$ong_tail_avx: + vmovdqa %xmm2,32(%r11) + vmovdqa %xmm0,0(%r11) + vmovdqa %xmm1,16(%r11) + vmovdqa %xmm3,48(%r11) + vmovdqa %xmm4,64(%r11) + + + + + + + + vpmuludq %xmm7,%xmm14,%xmm12 + vpmuludq %xmm5,%xmm14,%xmm10 + vpshufd $0x10,-48(%rdi),%xmm2 + vpmuludq %xmm6,%xmm14,%xmm11 + vpmuludq %xmm8,%xmm14,%xmm13 + vpmuludq %xmm9,%xmm14,%xmm14 + + vpmuludq %xmm8,%xmm2,%xmm0 + vpaddq %xmm0,%xmm14,%xmm14 + vpshufd $0x10,-32(%rdi),%xmm3 + vpmuludq %xmm7,%xmm2,%xmm1 + vpaddq %xmm1,%xmm13,%xmm13 + vpshufd $0x10,-16(%rdi),%xmm4 + vpmuludq %xmm6,%xmm2,%xmm0 + vpaddq %xmm0,%xmm12,%xmm12 + vpmuludq %xmm5,%xmm2,%xmm2 + vpaddq %xmm2,%xmm11,%xmm11 + vpmuludq %xmm9,%xmm3,%xmm3 + vpaddq %xmm3,%xmm10,%xmm10 + + vpshufd $0x10,0(%rdi),%xmm2 + vpmuludq %xmm7,%xmm4,%xmm1 + vpaddq %xmm1,%xmm14,%xmm14 + vpmuludq %xmm6,%xmm4,%xmm0 + vpaddq %xmm0,%xmm13,%xmm13 + vpshufd $0x10,16(%rdi),%xmm3 + vpmuludq %xmm5,%xmm4,%xmm4 + vpaddq %xmm4,%xmm12,%xmm12 + vpmuludq %xmm9,%xmm2,%xmm1 + vpaddq %xmm1,%xmm11,%xmm11 + vpshufd $0x10,32(%rdi),%xmm4 + vpmuludq %xmm8,%xmm2,%xmm2 + vpaddq %xmm2,%xmm10,%xmm10 + + vpmuludq %xmm6,%xmm3,%xmm0 + vpaddq %xmm0,%xmm14,%xmm14 + vpmuludq %xmm5,%xmm3,%xmm3 + vpaddq %xmm3,%xmm13,%xmm13 + vpshufd $0x10,48(%rdi),%xmm2 + vpmuludq %xmm9,%xmm4,%xmm1 + vpaddq %xmm1,%xmm12,%xmm12 + vpshufd $0x10,64(%rdi),%xmm3 + vpmuludq %xmm8,%xmm4,%xmm0 + vpaddq %xmm0,%xmm11,%xmm11 + vpmuludq %xmm7,%xmm4,%xmm4 + vpaddq %xmm4,%xmm10,%xmm10 + + vpmuludq %xmm5,%xmm2,%xmm2 + vpaddq %xmm2,%xmm14,%xmm14 + vpmuludq %xmm9,%xmm3,%xmm1 + vpaddq %xmm1,%xmm13,%xmm13 + vpmuludq %xmm8,%xmm3,%xmm0 + vpaddq %xmm0,%xmm12,%xmm12 + vpmuludq %xmm7,%xmm3,%xmm1 + vpaddq %xmm1,%xmm11,%xmm11 + vpmuludq %xmm6,%xmm3,%xmm3 + vpaddq %xmm3,%xmm10,%xmm10 + + jz L$short_tail_avx + + vmovdqu 0(%rsi),%xmm0 + vmovdqu 16(%rsi),%xmm1 + + vpsrldq $6,%xmm0,%xmm2 + vpsrldq $6,%xmm1,%xmm3 + vpunpckhqdq %xmm1,%xmm0,%xmm4 + vpunpcklqdq %xmm1,%xmm0,%xmm0 + vpunpcklqdq %xmm3,%xmm2,%xmm3 + + vpsrlq $40,%xmm4,%xmm4 + vpsrlq $26,%xmm0,%xmm1 + vpand %xmm15,%xmm0,%xmm0 + vpsrlq $4,%xmm3,%xmm2 + vpand %xmm15,%xmm1,%xmm1 + vpsrlq $30,%xmm3,%xmm3 + vpand %xmm15,%xmm2,%xmm2 + vpand %xmm15,%xmm3,%xmm3 + vpor 32(%rcx),%xmm4,%xmm4 + + vpshufd $0x32,-64(%rdi),%xmm9 + vpaddq 0(%r11),%xmm0,%xmm0 + vpaddq 16(%r11),%xmm1,%xmm1 + vpaddq 32(%r11),%xmm2,%xmm2 + vpaddq 48(%r11),%xmm3,%xmm3 + vpaddq 64(%r11),%xmm4,%xmm4 + + + + + vpmuludq %xmm0,%xmm9,%xmm5 + vpaddq %xmm5,%xmm10,%xmm10 + vpmuludq %xmm1,%xmm9,%xmm6 + vpaddq %xmm6,%xmm11,%xmm11 + vpmuludq %xmm2,%xmm9,%xmm5 + vpaddq %xmm5,%xmm12,%xmm12 + vpshufd $0x32,-48(%rdi),%xmm7 + vpmuludq %xmm3,%xmm9,%xmm6 + vpaddq %xmm6,%xmm13,%xmm13 + vpmuludq %xmm4,%xmm9,%xmm9 + vpaddq %xmm9,%xmm14,%xmm14 + + vpmuludq %xmm3,%xmm7,%xmm5 + vpaddq %xmm5,%xmm14,%xmm14 + vpshufd $0x32,-32(%rdi),%xmm8 + vpmuludq %xmm2,%xmm7,%xmm6 + vpaddq %xmm6,%xmm13,%xmm13 + vpshufd $0x32,-16(%rdi),%xmm9 + vpmuludq %xmm1,%xmm7,%xmm5 + vpaddq %xmm5,%xmm12,%xmm12 + vpmuludq %xmm0,%xmm7,%xmm7 + vpaddq %xmm7,%xmm11,%xmm11 + vpmuludq %xmm4,%xmm8,%xmm8 + vpaddq %xmm8,%xmm10,%xmm10 + + vpshufd $0x32,0(%rdi),%xmm7 + vpmuludq %xmm2,%xmm9,%xmm6 + vpaddq %xmm6,%xmm14,%xmm14 + vpmuludq %xmm1,%xmm9,%xmm5 + vpaddq %xmm5,%xmm13,%xmm13 + vpshufd $0x32,16(%rdi),%xmm8 + vpmuludq %xmm0,%xmm9,%xmm9 + vpaddq %xmm9,%xmm12,%xmm12 + vpmuludq %xmm4,%xmm7,%xmm6 + vpaddq %xmm6,%xmm11,%xmm11 + vpshufd $0x32,32(%rdi),%xmm9 + vpmuludq %xmm3,%xmm7,%xmm7 + vpaddq %xmm7,%xmm10,%xmm10 + + vpmuludq %xmm1,%xmm8,%xmm5 + vpaddq %xmm5,%xmm14,%xmm14 + vpmuludq %xmm0,%xmm8,%xmm8 + vpaddq %xmm8,%xmm13,%xmm13 + vpshufd $0x32,48(%rdi),%xmm7 + vpmuludq %xmm4,%xmm9,%xmm6 + vpaddq %xmm6,%xmm12,%xmm12 + vpshufd $0x32,64(%rdi),%xmm8 + vpmuludq %xmm3,%xmm9,%xmm5 + vpaddq %xmm5,%xmm11,%xmm11 + vpmuludq %xmm2,%xmm9,%xmm9 + vpaddq %xmm9,%xmm10,%xmm10 + + vpmuludq %xmm0,%xmm7,%xmm7 + vpaddq %xmm7,%xmm14,%xmm14 + vpmuludq %xmm4,%xmm8,%xmm6 + vpaddq %xmm6,%xmm13,%xmm13 + vpmuludq %xmm3,%xmm8,%xmm5 + vpaddq %xmm5,%xmm12,%xmm12 + vpmuludq %xmm2,%xmm8,%xmm6 + vpaddq %xmm6,%xmm11,%xmm11 + vpmuludq %xmm1,%xmm8,%xmm8 + vpaddq %xmm8,%xmm10,%xmm10 + +L$short_tail_avx: + + + + vpsrldq $8,%xmm14,%xmm9 + vpsrldq $8,%xmm13,%xmm8 + vpsrldq $8,%xmm11,%xmm6 + vpsrldq $8,%xmm10,%xmm5 + vpsrldq $8,%xmm12,%xmm7 + vpaddq %xmm8,%xmm13,%xmm13 + vpaddq %xmm9,%xmm14,%xmm14 + vpaddq %xmm5,%xmm10,%xmm10 + vpaddq %xmm6,%xmm11,%xmm11 + vpaddq %xmm7,%xmm12,%xmm12 + + + + + vpsrlq $26,%xmm13,%xmm3 + vpand %xmm15,%xmm13,%xmm13 + vpaddq %xmm3,%xmm14,%xmm14 + + vpsrlq $26,%xmm10,%xmm0 + vpand %xmm15,%xmm10,%xmm10 + vpaddq %xmm0,%xmm11,%xmm11 + + vpsrlq $26,%xmm14,%xmm4 + vpand %xmm15,%xmm14,%xmm14 + + vpsrlq $26,%xmm11,%xmm1 + vpand %xmm15,%xmm11,%xmm11 + vpaddq %xmm1,%xmm12,%xmm12 + + vpaddq %xmm4,%xmm10,%xmm10 + vpsllq $2,%xmm4,%xmm4 + vpaddq %xmm4,%xmm10,%xmm10 + + vpsrlq $26,%xmm12,%xmm2 + vpand %xmm15,%xmm12,%xmm12 + vpaddq %xmm2,%xmm13,%xmm13 + + vpsrlq $26,%xmm10,%xmm0 + vpand %xmm15,%xmm10,%xmm10 + vpaddq %xmm0,%xmm11,%xmm11 + + vpsrlq $26,%xmm13,%xmm3 + vpand %xmm15,%xmm13,%xmm13 + vpaddq %xmm3,%xmm14,%xmm14 + + vmovd %xmm10,-112(%rdi) + vmovd %xmm11,-108(%rdi) + vmovd %xmm12,-104(%rdi) + vmovd %xmm13,-100(%rdi) + vmovd %xmm14,-96(%rdi) + leaq 88(%r11),%rsp + + vzeroupper + ret + + + + +.p2align 5 +_poly1305_emit_avx: + cmpl $0,20(%rdi) + je L$emit + + movl 0(%rdi),%eax + movl 4(%rdi),%ecx + movl 8(%rdi),%r8d + movl 12(%rdi),%r11d + movl 16(%rdi),%r10d + + shlq $26,%rcx + movq %r8,%r9 + shlq $52,%r8 + addq %rcx,%rax + shrq $12,%r9 + addq %rax,%r8 + adcq $0,%r9 + + shlq $14,%r11 + movq %r10,%rax + shrq $24,%r10 + addq %r11,%r9 + shlq $40,%rax + addq %rax,%r9 + adcq $0,%r10 + + movq %r10,%rax + movq %r10,%rcx + andq $3,%r10 + shrq $2,%rax + andq $-4,%rcx + addq %rcx,%rax + addq %rax,%r8 + adcq $0,%r9 + adcq $0,%r10 + + movq %r8,%rax + addq $5,%r8 + movq %r9,%rcx + adcq $0,%r9 + adcq $0,%r10 + shrq $2,%r10 + cmovnzq %r8,%rax + cmovnzq %r9,%rcx + + addq 0(%rdx),%rax + adcq 8(%rdx),%rcx + movq %rax,0(%rsi) + movq %rcx,8(%rsi) + + ret + + +.p2align 5 +_poly1305_blocks_avx2: + + movl 20(%rdi),%r8d + cmpq $128,%rdx + jae L$blocks_avx2 + testl %r8d,%r8d + jz L$blocks + +L$blocks_avx2: + andq $-16,%rdx + jz L$no_data_avx2 + + vzeroupper + + testl %r8d,%r8d + jz L$base2_64_avx2 + + testq $63,%rdx + jz L$even_avx2 + + pushq %rbx + + pushq %rbp + + pushq %r12 + + pushq %r13 + + pushq %r14 + + pushq %r15 + +L$blocks_avx2_body: + + movq %rdx,%r15 + + movq 0(%rdi),%r8 + movq 8(%rdi),%r9 + movl 16(%rdi),%ebp + + movq 24(%rdi),%r11 + movq 32(%rdi),%r13 + + + movl %r8d,%r14d + andq $-2147483648,%r8 + movq %r9,%r12 + movl %r9d,%ebx + andq $-2147483648,%r9 + + shrq $6,%r8 + shlq $52,%r12 + addq %r8,%r14 + shrq $12,%rbx + shrq $18,%r9 + addq %r12,%r14 + adcq %r9,%rbx + + movq %rbp,%r8 + shlq $40,%r8 + shrq $24,%rbp + addq %r8,%rbx + adcq $0,%rbp + + movq $-4,%r9 + movq %rbp,%r8 + andq %rbp,%r9 + shrq $2,%r8 + andq $3,%rbp + addq %r9,%r8 + addq %r8,%r14 + adcq $0,%rbx + adcq $0,%rbp + + movq %r13,%r12 + movq %r13,%rax + shrq $2,%r13 + addq %r12,%r13 + +L$base2_26_pre_avx2: + addq 0(%rsi),%r14 + adcq 8(%rsi),%rbx + leaq 16(%rsi),%rsi + adcq %rcx,%rbp + subq $16,%r15 + + call __poly1305_block + movq %r12,%rax + + testq $63,%r15 + jnz L$base2_26_pre_avx2 + + testq %rcx,%rcx + jz L$store_base2_64_avx2 + + + movq %r14,%rax + movq %r14,%rdx + shrq $52,%r14 + movq %rbx,%r11 + movq %rbx,%r12 + shrq $26,%rdx + andq $0x3ffffff,%rax + shlq $12,%r11 + andq $0x3ffffff,%rdx + shrq $14,%rbx + orq %r11,%r14 + shlq $24,%rbp + andq $0x3ffffff,%r14 + shrq $40,%r12 + andq $0x3ffffff,%rbx + orq %r12,%rbp + + testq %r15,%r15 + jz L$store_base2_26_avx2 + + vmovd %eax,%xmm0 + vmovd %edx,%xmm1 + vmovd %r14d,%xmm2 + vmovd %ebx,%xmm3 + vmovd %ebp,%xmm4 + jmp L$proceed_avx2 + +.p2align 5 +L$store_base2_64_avx2: + movq %r14,0(%rdi) + movq %rbx,8(%rdi) + movq %rbp,16(%rdi) + jmp L$done_avx2 + +.p2align 4 +L$store_base2_26_avx2: + movl %eax,0(%rdi) + movl %edx,4(%rdi) + movl %r14d,8(%rdi) + movl %ebx,12(%rdi) + movl %ebp,16(%rdi) +.p2align 4 +L$done_avx2: + movq 0(%rsp),%r15 + + movq 8(%rsp),%r14 + + movq 16(%rsp),%r13 + + movq 24(%rsp),%r12 + + movq 32(%rsp),%rbp + + movq 40(%rsp),%rbx + + leaq 48(%rsp),%rsp + +L$no_data_avx2: +L$blocks_avx2_epilogue: + ret + + +.p2align 5 +L$base2_64_avx2: + + pushq %rbx + + pushq %rbp + + pushq %r12 + + pushq %r13 + + pushq %r14 + + pushq %r15 + +L$base2_64_avx2_body: + + movq %rdx,%r15 + + movq 24(%rdi),%r11 + movq 32(%rdi),%r13 + + movq 0(%rdi),%r14 + movq 8(%rdi),%rbx + movl 16(%rdi),%ebp + + movq %r13,%r12 + movq %r13,%rax + shrq $2,%r13 + addq %r12,%r13 + + testq $63,%rdx + jz L$init_avx2 + +L$base2_64_pre_avx2: + addq 0(%rsi),%r14 + adcq 8(%rsi),%rbx + leaq 16(%rsi),%rsi + adcq %rcx,%rbp + subq $16,%r15 + + call __poly1305_block + movq %r12,%rax + + testq $63,%r15 + jnz L$base2_64_pre_avx2 + +L$init_avx2: + + movq %r14,%rax + movq %r14,%rdx + shrq $52,%r14 + movq %rbx,%r8 + movq %rbx,%r9 + shrq $26,%rdx + andq $0x3ffffff,%rax + shlq $12,%r8 + andq $0x3ffffff,%rdx + shrq $14,%rbx + orq %r8,%r14 + shlq $24,%rbp + andq $0x3ffffff,%r14 + shrq $40,%r9 + andq $0x3ffffff,%rbx + orq %r9,%rbp + + vmovd %eax,%xmm0 + vmovd %edx,%xmm1 + vmovd %r14d,%xmm2 + vmovd %ebx,%xmm3 + vmovd %ebp,%xmm4 + movl $1,20(%rdi) + + call __poly1305_init_avx + +L$proceed_avx2: + movq %r15,%rdx + + + + movq 0(%rsp),%r15 + + movq 8(%rsp),%r14 + + movq 16(%rsp),%r13 + + movq 24(%rsp),%r12 + + movq 32(%rsp),%rbp + + movq 40(%rsp),%rbx + + leaq 48(%rsp),%rax + leaq 48(%rsp),%rsp + +L$base2_64_avx2_epilogue: + jmp L$do_avx2 + + +.p2align 5 +L$even_avx2: + + + vmovd 0(%rdi),%xmm0 + vmovd 4(%rdi),%xmm1 + vmovd 8(%rdi),%xmm2 + vmovd 12(%rdi),%xmm3 + vmovd 16(%rdi),%xmm4 + +L$do_avx2: + leaq -8(%rsp),%r11 + + subq $0x128,%rsp + leaq L$const(%rip),%rcx + leaq 48+64(%rdi),%rdi + vmovdqa 96(%rcx),%ymm7 + + + vmovdqu -64(%rdi),%xmm9 + andq $-512,%rsp + vmovdqu -48(%rdi),%xmm10 + vmovdqu -32(%rdi),%xmm6 + vmovdqu -16(%rdi),%xmm11 + vmovdqu 0(%rdi),%xmm12 + vmovdqu 16(%rdi),%xmm13 + leaq 144(%rsp),%rax + vmovdqu 32(%rdi),%xmm14 + vpermd %ymm9,%ymm7,%ymm9 + vmovdqu 48(%rdi),%xmm15 + vpermd %ymm10,%ymm7,%ymm10 + vmovdqu 64(%rdi),%xmm5 + vpermd %ymm6,%ymm7,%ymm6 + vmovdqa %ymm9,0(%rsp) + vpermd %ymm11,%ymm7,%ymm11 + vmovdqa %ymm10,32-144(%rax) + vpermd %ymm12,%ymm7,%ymm12 + vmovdqa %ymm6,64-144(%rax) + vpermd %ymm13,%ymm7,%ymm13 + vmovdqa %ymm11,96-144(%rax) + vpermd %ymm14,%ymm7,%ymm14 + vmovdqa %ymm12,128-144(%rax) + vpermd %ymm15,%ymm7,%ymm15 + vmovdqa %ymm13,160-144(%rax) + vpermd %ymm5,%ymm7,%ymm5 + vmovdqa %ymm14,192-144(%rax) + vmovdqa %ymm15,224-144(%rax) + vmovdqa %ymm5,256-144(%rax) + vmovdqa 64(%rcx),%ymm5 + + + + vmovdqu 0(%rsi),%xmm7 + vmovdqu 16(%rsi),%xmm8 + vinserti128 $1,32(%rsi),%ymm7,%ymm7 + vinserti128 $1,48(%rsi),%ymm8,%ymm8 + leaq 64(%rsi),%rsi + + vpsrldq $6,%ymm7,%ymm9 + vpsrldq $6,%ymm8,%ymm10 + vpunpckhqdq %ymm8,%ymm7,%ymm6 + vpunpcklqdq %ymm10,%ymm9,%ymm9 + vpunpcklqdq %ymm8,%ymm7,%ymm7 + + vpsrlq $30,%ymm9,%ymm10 + vpsrlq $4,%ymm9,%ymm9 + vpsrlq $26,%ymm7,%ymm8 + vpsrlq $40,%ymm6,%ymm6 + vpand %ymm5,%ymm9,%ymm9 + vpand %ymm5,%ymm7,%ymm7 + vpand %ymm5,%ymm8,%ymm8 + vpand %ymm5,%ymm10,%ymm10 + vpor 32(%rcx),%ymm6,%ymm6 + + vpaddq %ymm2,%ymm9,%ymm2 + subq $64,%rdx + jz L$tail_avx2 + jmp L$oop_avx2 + +.p2align 5 +L$oop_avx2: + + + + + + + + + vpaddq %ymm0,%ymm7,%ymm0 + vmovdqa 0(%rsp),%ymm7 + vpaddq %ymm1,%ymm8,%ymm1 + vmovdqa 32(%rsp),%ymm8 + vpaddq %ymm3,%ymm10,%ymm3 + vmovdqa 96(%rsp),%ymm9 + vpaddq %ymm4,%ymm6,%ymm4 + vmovdqa 48(%rax),%ymm10 + vmovdqa 112(%rax),%ymm5 + + + + + + + + + + + + + + + + + vpmuludq %ymm2,%ymm7,%ymm13 + vpmuludq %ymm2,%ymm8,%ymm14 + vpmuludq %ymm2,%ymm9,%ymm15 + vpmuludq %ymm2,%ymm10,%ymm11 + vpmuludq %ymm2,%ymm5,%ymm12 + + vpmuludq %ymm0,%ymm8,%ymm6 + vpmuludq %ymm1,%ymm8,%ymm2 + vpaddq %ymm6,%ymm12,%ymm12 + vpaddq %ymm2,%ymm13,%ymm13 + vpmuludq %ymm3,%ymm8,%ymm6 + vpmuludq 64(%rsp),%ymm4,%ymm2 + vpaddq %ymm6,%ymm15,%ymm15 + vpaddq %ymm2,%ymm11,%ymm11 + vmovdqa -16(%rax),%ymm8 + + vpmuludq %ymm0,%ymm7,%ymm6 + vpmuludq %ymm1,%ymm7,%ymm2 + vpaddq %ymm6,%ymm11,%ymm11 + vpaddq %ymm2,%ymm12,%ymm12 + vpmuludq %ymm3,%ymm7,%ymm6 + vpmuludq %ymm4,%ymm7,%ymm2 + vmovdqu 0(%rsi),%xmm7 + vpaddq %ymm6,%ymm14,%ymm14 + vpaddq %ymm2,%ymm15,%ymm15 + vinserti128 $1,32(%rsi),%ymm7,%ymm7 + + vpmuludq %ymm3,%ymm8,%ymm6 + vpmuludq %ymm4,%ymm8,%ymm2 + vmovdqu 16(%rsi),%xmm8 + vpaddq %ymm6,%ymm11,%ymm11 + vpaddq %ymm2,%ymm12,%ymm12 + vmovdqa 16(%rax),%ymm2 + vpmuludq %ymm1,%ymm9,%ymm6 + vpmuludq %ymm0,%ymm9,%ymm9 + vpaddq %ymm6,%ymm14,%ymm14 + vpaddq %ymm9,%ymm13,%ymm13 + vinserti128 $1,48(%rsi),%ymm8,%ymm8 + leaq 64(%rsi),%rsi + + vpmuludq %ymm1,%ymm2,%ymm6 + vpmuludq %ymm0,%ymm2,%ymm2 + vpsrldq $6,%ymm7,%ymm9 + vpaddq %ymm6,%ymm15,%ymm15 + vpaddq %ymm2,%ymm14,%ymm14 + vpmuludq %ymm3,%ymm10,%ymm6 + vpmuludq %ymm4,%ymm10,%ymm2 + vpsrldq $6,%ymm8,%ymm10 + vpaddq %ymm6,%ymm12,%ymm12 + vpaddq %ymm2,%ymm13,%ymm13 + vpunpckhqdq %ymm8,%ymm7,%ymm6 + + vpmuludq %ymm3,%ymm5,%ymm3 + vpmuludq %ymm4,%ymm5,%ymm4 + vpunpcklqdq %ymm8,%ymm7,%ymm7 + vpaddq %ymm3,%ymm13,%ymm2 + vpaddq %ymm4,%ymm14,%ymm3 + vpunpcklqdq %ymm10,%ymm9,%ymm10 + vpmuludq 80(%rax),%ymm0,%ymm4 + vpmuludq %ymm1,%ymm5,%ymm0 + vmovdqa 64(%rcx),%ymm5 + vpaddq %ymm4,%ymm15,%ymm4 + vpaddq %ymm0,%ymm11,%ymm0 + + + + + vpsrlq $26,%ymm3,%ymm14 + vpand %ymm5,%ymm3,%ymm3 + vpaddq %ymm14,%ymm4,%ymm4 + + vpsrlq $26,%ymm0,%ymm11 + vpand %ymm5,%ymm0,%ymm0 + vpaddq %ymm11,%ymm12,%ymm1 + + vpsrlq $26,%ymm4,%ymm15 + vpand %ymm5,%ymm4,%ymm4 + + vpsrlq $4,%ymm10,%ymm9 + + vpsrlq $26,%ymm1,%ymm12 + vpand %ymm5,%ymm1,%ymm1 + vpaddq %ymm12,%ymm2,%ymm2 + + vpaddq %ymm15,%ymm0,%ymm0 + vpsllq $2,%ymm15,%ymm15 + vpaddq %ymm15,%ymm0,%ymm0 + + vpand %ymm5,%ymm9,%ymm9 + vpsrlq $26,%ymm7,%ymm8 + + vpsrlq $26,%ymm2,%ymm13 + vpand %ymm5,%ymm2,%ymm2 + vpaddq %ymm13,%ymm3,%ymm3 + + vpaddq %ymm9,%ymm2,%ymm2 + vpsrlq $30,%ymm10,%ymm10 + + vpsrlq $26,%ymm0,%ymm11 + vpand %ymm5,%ymm0,%ymm0 + vpaddq %ymm11,%ymm1,%ymm1 + + vpsrlq $40,%ymm6,%ymm6 + + vpsrlq $26,%ymm3,%ymm14 + vpand %ymm5,%ymm3,%ymm3 + vpaddq %ymm14,%ymm4,%ymm4 + + vpand %ymm5,%ymm7,%ymm7 + vpand %ymm5,%ymm8,%ymm8 + vpand %ymm5,%ymm10,%ymm10 + vpor 32(%rcx),%ymm6,%ymm6 + + subq $64,%rdx + jnz L$oop_avx2 + +.byte 0x66,0x90 +L$tail_avx2: + + + + + + + + vpaddq %ymm0,%ymm7,%ymm0 + vmovdqu 4(%rsp),%ymm7 + vpaddq %ymm1,%ymm8,%ymm1 + vmovdqu 36(%rsp),%ymm8 + vpaddq %ymm3,%ymm10,%ymm3 + vmovdqu 100(%rsp),%ymm9 + vpaddq %ymm4,%ymm6,%ymm4 + vmovdqu 52(%rax),%ymm10 + vmovdqu 116(%rax),%ymm5 + + vpmuludq %ymm2,%ymm7,%ymm13 + vpmuludq %ymm2,%ymm8,%ymm14 + vpmuludq %ymm2,%ymm9,%ymm15 + vpmuludq %ymm2,%ymm10,%ymm11 + vpmuludq %ymm2,%ymm5,%ymm12 + + vpmuludq %ymm0,%ymm8,%ymm6 + vpmuludq %ymm1,%ymm8,%ymm2 + vpaddq %ymm6,%ymm12,%ymm12 + vpaddq %ymm2,%ymm13,%ymm13 + vpmuludq %ymm3,%ymm8,%ymm6 + vpmuludq 68(%rsp),%ymm4,%ymm2 + vpaddq %ymm6,%ymm15,%ymm15 + vpaddq %ymm2,%ymm11,%ymm11 + + vpmuludq %ymm0,%ymm7,%ymm6 + vpmuludq %ymm1,%ymm7,%ymm2 + vpaddq %ymm6,%ymm11,%ymm11 + vmovdqu -12(%rax),%ymm8 + vpaddq %ymm2,%ymm12,%ymm12 + vpmuludq %ymm3,%ymm7,%ymm6 + vpmuludq %ymm4,%ymm7,%ymm2 + vpaddq %ymm6,%ymm14,%ymm14 + vpaddq %ymm2,%ymm15,%ymm15 + + vpmuludq %ymm3,%ymm8,%ymm6 + vpmuludq %ymm4,%ymm8,%ymm2 + vpaddq %ymm6,%ymm11,%ymm11 + vpaddq %ymm2,%ymm12,%ymm12 + vmovdqu 20(%rax),%ymm2 + vpmuludq %ymm1,%ymm9,%ymm6 + vpmuludq %ymm0,%ymm9,%ymm9 + vpaddq %ymm6,%ymm14,%ymm14 + vpaddq %ymm9,%ymm13,%ymm13 + + vpmuludq %ymm1,%ymm2,%ymm6 + vpmuludq %ymm0,%ymm2,%ymm2 + vpaddq %ymm6,%ymm15,%ymm15 + vpaddq %ymm2,%ymm14,%ymm14 + vpmuludq %ymm3,%ymm10,%ymm6 + vpmuludq %ymm4,%ymm10,%ymm2 + vpaddq %ymm6,%ymm12,%ymm12 + vpaddq %ymm2,%ymm13,%ymm13 + + vpmuludq %ymm3,%ymm5,%ymm3 + vpmuludq %ymm4,%ymm5,%ymm4 + vpaddq %ymm3,%ymm13,%ymm2 + vpaddq %ymm4,%ymm14,%ymm3 + vpmuludq 84(%rax),%ymm0,%ymm4 + vpmuludq %ymm1,%ymm5,%ymm0 + vmovdqa 64(%rcx),%ymm5 + vpaddq %ymm4,%ymm15,%ymm4 + vpaddq %ymm0,%ymm11,%ymm0 + + + + + vpsrldq $8,%ymm12,%ymm8 + vpsrldq $8,%ymm2,%ymm9 + vpsrldq $8,%ymm3,%ymm10 + vpsrldq $8,%ymm4,%ymm6 + vpsrldq $8,%ymm0,%ymm7 + vpaddq %ymm8,%ymm12,%ymm12 + vpaddq %ymm9,%ymm2,%ymm2 + vpaddq %ymm10,%ymm3,%ymm3 + vpaddq %ymm6,%ymm4,%ymm4 + vpaddq %ymm7,%ymm0,%ymm0 + + vpermq $0x2,%ymm3,%ymm10 + vpermq $0x2,%ymm4,%ymm6 + vpermq $0x2,%ymm0,%ymm7 + vpermq $0x2,%ymm12,%ymm8 + vpermq $0x2,%ymm2,%ymm9 + vpaddq %ymm10,%ymm3,%ymm3 + vpaddq %ymm6,%ymm4,%ymm4 + vpaddq %ymm7,%ymm0,%ymm0 + vpaddq %ymm8,%ymm12,%ymm12 + vpaddq %ymm9,%ymm2,%ymm2 + + + + + vpsrlq $26,%ymm3,%ymm14 + vpand %ymm5,%ymm3,%ymm3 + vpaddq %ymm14,%ymm4,%ymm4 + + vpsrlq $26,%ymm0,%ymm11 + vpand %ymm5,%ymm0,%ymm0 + vpaddq %ymm11,%ymm12,%ymm1 + + vpsrlq $26,%ymm4,%ymm15 + vpand %ymm5,%ymm4,%ymm4 + + vpsrlq $26,%ymm1,%ymm12 + vpand %ymm5,%ymm1,%ymm1 + vpaddq %ymm12,%ymm2,%ymm2 + + vpaddq %ymm15,%ymm0,%ymm0 + vpsllq $2,%ymm15,%ymm15 + vpaddq %ymm15,%ymm0,%ymm0 + + vpsrlq $26,%ymm2,%ymm13 + vpand %ymm5,%ymm2,%ymm2 + vpaddq %ymm13,%ymm3,%ymm3 + + vpsrlq $26,%ymm0,%ymm11 + vpand %ymm5,%ymm0,%ymm0 + vpaddq %ymm11,%ymm1,%ymm1 + + vpsrlq $26,%ymm3,%ymm14 + vpand %ymm5,%ymm3,%ymm3 + vpaddq %ymm14,%ymm4,%ymm4 + + vmovd %xmm0,-112(%rdi) + vmovd %xmm1,-108(%rdi) + vmovd %xmm2,-104(%rdi) + vmovd %xmm3,-100(%rdi) + vmovd %xmm4,-96(%rdi) + leaq 8(%r11),%rsp + + vzeroupper + ret + + diff --git a/crypto/poly1305_x64_nasm.asm b/crypto/poly1305_x64_nasm.asm new file mode 100644 index 0000000..4f9d9f5 --- /dev/null +++ b/crypto/poly1305_x64_nasm.asm @@ -0,0 +1,3487 @@ +default rel +%define XMMWORD +%define YMMWORD +%define ZMMWORD +ALIGN 64 +$L$const: +$L$mask24: + DD 0x0ffffff,0,0x0ffffff,0,0x0ffffff,0,0x0ffffff,0 +$L$129: + DD 16777216,0,16777216,0,16777216,0,16777216,0 +$L$mask26: + DD 0x3ffffff,0,0x3ffffff,0,0x3ffffff,0,0x3ffffff,0 +$L$permd_avx2: + DD 2,2,2,3,2,0,2,1 +$L$permd_avx512: + DD 0,0,0,1,0,2,0,3,0,4,0,5,0,6,0,7 + +$L$2_44_inp_permd: + DD 0,1,1,2,2,3,7,7 +$L$2_44_inp_shift: + DQ 0,12,24,64 +$L$2_44_mask: + DQ 0xfffffffffff,0xfffffffffff,0x3ffffffffff,0xffffffffffffffff +$L$2_44_shift_rgt: + DQ 44,44,42,64 +$L$2_44_shift_lft: + DQ 8,8,10,64 + +ALIGN 64 +$L$x_mask44: + DQ 0xfffffffffff,0xfffffffffff,0xfffffffffff,0xfffffffffff + DQ 0xfffffffffff,0xfffffffffff,0xfffffffffff,0xfffffffffff +$L$x_mask42: + DQ 0x3ffffffffff,0x3ffffffffff,0x3ffffffffff,0x3ffffffffff + DQ 0x3ffffffffff,0x3ffffffffff,0x3ffffffffff,0x3ffffffffff + +section .text code align=64 + + + +global poly1305_init_x86_64 +global poly1305_blocks_x86_64 +global poly1305_emit_x86_64 +global poly1305_emit_avx +global poly1305_blocks_avx +global poly1305_blocks_avx2 +global poly1305_blocks_avx512 + + + +ALIGN 32 +poly1305_init_x86_64: + mov QWORD[8+rsp],rdi ;WIN64 prologue + mov QWORD[16+rsp],rsi + mov rax,rsp +$L$SEH_begin_poly1305_init_x86_64: + mov rdi,rcx + mov rsi,rdx + mov rdx,r8 + + + xor rax,rax + mov QWORD[rdi],rax + mov QWORD[8+rdi],rax + mov QWORD[16+rdi],rax + + cmp rsi,0 + je NEAR $L$no_key + + + + mov rax,0x0ffffffc0fffffff + mov rcx,0x0ffffffc0ffffffc + and rax,QWORD[rsi] + and rcx,QWORD[8+rsi] + mov QWORD[24+rdi],rax + mov QWORD[32+rdi],rcx + mov eax,1 +$L$no_key: + mov rdi,QWORD[8+rsp] ;WIN64 epilogue + mov rsi,QWORD[16+rsp] + DB 0F3h,0C3h ;repret +$L$SEH_end_poly1305_init_x86_64: + + +ALIGN 32 +poly1305_blocks_x86_64: + mov QWORD[8+rsp],rdi ;WIN64 prologue + mov QWORD[16+rsp],rsi + mov rax,rsp +$L$SEH_begin_poly1305_blocks_x86_64: + mov rdi,rcx + mov rsi,rdx + mov rdx,r8 + mov rcx,r9 + + + +$L$blocks: + shr rdx,4 + jz NEAR $L$no_data + + push rbx + + push rbp + + push r12 + + push r13 + + push r14 + + push r15 + +$L$blocks_body: + + mov r15,rdx + + mov r11,QWORD[24+rdi] + mov r13,QWORD[32+rdi] + + mov r14,QWORD[rdi] + mov rbx,QWORD[8+rdi] + mov rbp,QWORD[16+rdi] + + mov r12,r13 + shr r13,2 + mov rax,r12 + add r13,r12 + jmp NEAR $L$oop + +ALIGN 32 +$L$oop: + add r14,QWORD[rsi] + adc rbx,QWORD[8+rsi] + lea rsi,[16+rsi] + adc rbp,rcx + mul r14 + mov r9,rax + mov rax,r11 + mov r10,rdx + + mul r14 + mov r14,rax + mov rax,r11 + mov r8,rdx + + mul rbx + add r9,rax + mov rax,r13 + adc r10,rdx + + mul rbx + mov rbx,rbp + add r14,rax + adc r8,rdx + + imul rbx,r13 + add r9,rbx + mov rbx,r8 + adc r10,0 + + imul rbp,r11 + add rbx,r9 + mov rax,-4 + adc r10,rbp + + and rax,r10 + mov rbp,r10 + shr r10,2 + and rbp,3 + add rax,r10 + add r14,rax + adc rbx,0 + adc rbp,0 + mov rax,r12 + dec r15 + jnz NEAR $L$oop + + mov QWORD[rdi],r14 + mov QWORD[8+rdi],rbx + mov QWORD[16+rdi],rbp + + mov r15,QWORD[rsp] + + mov r14,QWORD[8+rsp] + + mov r13,QWORD[16+rsp] + + mov r12,QWORD[24+rsp] + + mov rbp,QWORD[32+rsp] + + mov rbx,QWORD[40+rsp] + + lea rsp,[48+rsp] + +$L$no_data: +$L$blocks_epilogue: + mov rdi,QWORD[8+rsp] ;WIN64 epilogue + mov rsi,QWORD[16+rsp] + DB 0F3h,0C3h ;repret + +$L$SEH_end_poly1305_blocks_x86_64: + + +ALIGN 32 +poly1305_emit_x86_64: + mov QWORD[8+rsp],rdi ;WIN64 prologue + mov QWORD[16+rsp],rsi + mov rax,rsp +$L$SEH_begin_poly1305_emit_x86_64: + mov rdi,rcx + mov rsi,rdx + mov rdx,r8 + + +$L$emit: + mov r8,QWORD[rdi] + mov r9,QWORD[8+rdi] + mov r10,QWORD[16+rdi] + + mov rax,r8 + add r8,5 + mov rcx,r9 + adc r9,0 + adc r10,0 + shr r10,2 + cmovnz rax,r8 + cmovnz rcx,r9 + + add rax,QWORD[rdx] + adc rcx,QWORD[8+rdx] + mov QWORD[rsi],rax + mov QWORD[8+rsi],rcx + + mov rdi,QWORD[8+rsp] ;WIN64 epilogue + mov rsi,QWORD[16+rsp] + DB 0F3h,0C3h ;repret +$L$SEH_end_poly1305_emit_x86_64: + +ALIGN 32 +__poly1305_block: + mul r14 + mov r9,rax + mov rax,r11 + mov r10,rdx + + mul r14 + mov r14,rax + mov rax,r11 + mov r8,rdx + + mul rbx + add r9,rax + mov rax,r13 + adc r10,rdx + + mul rbx + mov rbx,rbp + add r14,rax + adc r8,rdx + + imul rbx,r13 + add r9,rbx + mov rbx,r8 + adc r10,0 + + imul rbp,r11 + add rbx,r9 + mov rax,-4 + adc r10,rbp + + and rax,r10 + mov rbp,r10 + shr r10,2 + and rbp,3 + add rax,r10 + add r14,rax + adc rbx,0 + adc rbp,0 + DB 0F3h,0C3h ;repret + + + +ALIGN 32 +__poly1305_init_avx: + mov r14,r11 + mov rbx,r12 + xor rbp,rbp + + lea rdi,[((48+64))+rdi] + + mov rax,r12 + call __poly1305_block + + mov eax,0x3ffffff + mov edx,0x3ffffff + mov r8,r14 + and eax,r14d + mov r9,r11 + and edx,r11d + mov DWORD[((-64))+rdi],eax + shr r8,26 + mov DWORD[((-60))+rdi],edx + shr r9,26 + + mov eax,0x3ffffff + mov edx,0x3ffffff + and eax,r8d + and edx,r9d + mov DWORD[((-48))+rdi],eax + lea eax,[rax*4+rax] + mov DWORD[((-44))+rdi],edx + lea edx,[rdx*4+rdx] + mov DWORD[((-32))+rdi],eax + shr r8,26 + mov DWORD[((-28))+rdi],edx + shr r9,26 + + mov rax,rbx + mov rdx,r12 + shl rax,12 + shl rdx,12 + or rax,r8 + or rdx,r9 + and eax,0x3ffffff + and edx,0x3ffffff + mov DWORD[((-16))+rdi],eax + lea eax,[rax*4+rax] + mov DWORD[((-12))+rdi],edx + lea edx,[rdx*4+rdx] + mov DWORD[rdi],eax + mov r8,rbx + mov DWORD[4+rdi],edx + mov r9,r12 + + mov eax,0x3ffffff + mov edx,0x3ffffff + shr r8,14 + shr r9,14 + and eax,r8d + and edx,r9d + mov DWORD[16+rdi],eax + lea eax,[rax*4+rax] + mov DWORD[20+rdi],edx + lea edx,[rdx*4+rdx] + mov DWORD[32+rdi],eax + shr r8,26 + mov DWORD[36+rdi],edx + shr r9,26 + + mov rax,rbp + shl rax,24 + or r8,rax + mov DWORD[48+rdi],r8d + lea r8,[r8*4+r8] + mov DWORD[52+rdi],r9d + lea r9,[r9*4+r9] + mov DWORD[64+rdi],r8d + mov DWORD[68+rdi],r9d + + mov rax,r12 + call __poly1305_block + + mov eax,0x3ffffff + mov r8,r14 + and eax,r14d + shr r8,26 + mov DWORD[((-52))+rdi],eax + + mov edx,0x3ffffff + and edx,r8d + mov DWORD[((-36))+rdi],edx + lea edx,[rdx*4+rdx] + shr r8,26 + mov DWORD[((-20))+rdi],edx + + mov rax,rbx + shl rax,12 + or rax,r8 + and eax,0x3ffffff + mov DWORD[((-4))+rdi],eax + lea eax,[rax*4+rax] + mov r8,rbx + mov DWORD[12+rdi],eax + + mov edx,0x3ffffff + shr r8,14 + and edx,r8d + mov DWORD[28+rdi],edx + lea edx,[rdx*4+rdx] + shr r8,26 + mov DWORD[44+rdi],edx + + mov rax,rbp + shl rax,24 + or r8,rax + mov DWORD[60+rdi],r8d + lea r8,[r8*4+r8] + mov DWORD[76+rdi],r8d + + mov rax,r12 + call __poly1305_block + + mov eax,0x3ffffff + mov r8,r14 + and eax,r14d + shr r8,26 + mov DWORD[((-56))+rdi],eax + + mov edx,0x3ffffff + and edx,r8d + mov DWORD[((-40))+rdi],edx + lea edx,[rdx*4+rdx] + shr r8,26 + mov DWORD[((-24))+rdi],edx + + mov rax,rbx + shl rax,12 + or rax,r8 + and eax,0x3ffffff + mov DWORD[((-8))+rdi],eax + lea eax,[rax*4+rax] + mov r8,rbx + mov DWORD[8+rdi],eax + + mov edx,0x3ffffff + shr r8,14 + and edx,r8d + mov DWORD[24+rdi],edx + lea edx,[rdx*4+rdx] + shr r8,26 + mov DWORD[40+rdi],edx + + mov rax,rbp + shl rax,24 + or r8,rax + mov DWORD[56+rdi],r8d + lea r8,[r8*4+r8] + mov DWORD[72+rdi],r8d + + lea rdi,[((-48-64))+rdi] + DB 0F3h,0C3h ;repret + + + +ALIGN 32 +poly1305_blocks_avx: + mov QWORD[8+rsp],rdi ;WIN64 prologue + mov QWORD[16+rsp],rsi + mov rax,rsp +$L$SEH_begin_poly1305_blocks_avx: + mov rdi,rcx + mov rsi,rdx + mov rdx,r8 + mov rcx,r9 + + + + mov r8d,DWORD[20+rdi] + cmp rdx,128 + jae NEAR $L$blocks_avx + test r8d,r8d + jz NEAR $L$blocks + +$L$blocks_avx: + and rdx,-16 + jz NEAR $L$no_data_avx + + vzeroupper + + test r8d,r8d + jz NEAR $L$base2_64_avx + + test rdx,31 + jz NEAR $L$even_avx + + push rbx + + push rbp + + push r12 + + push r13 + + push r14 + + push r15 + +$L$blocks_avx_body: + + mov r15,rdx + + mov r8,QWORD[rdi] + mov r9,QWORD[8+rdi] + mov ebp,DWORD[16+rdi] + + mov r11,QWORD[24+rdi] + mov r13,QWORD[32+rdi] + + + mov r14d,r8d + and r8,-2147483648 + mov r12,r9 + mov ebx,r9d + and r9,-2147483648 + + shr r8,6 + shl r12,52 + add r14,r8 + shr rbx,12 + shr r9,18 + add r14,r12 + adc rbx,r9 + + mov r8,rbp + shl r8,40 + shr rbp,24 + add rbx,r8 + adc rbp,0 + + mov r9,-4 + mov r8,rbp + and r9,rbp + shr r8,2 + and rbp,3 + add r8,r9 + add r14,r8 + adc rbx,0 + adc rbp,0 + + mov r12,r13 + mov rax,r13 + shr r13,2 + add r13,r12 + + add r14,QWORD[rsi] + adc rbx,QWORD[8+rsi] + lea rsi,[16+rsi] + adc rbp,rcx + + call __poly1305_block + + test rcx,rcx + jz NEAR $L$store_base2_64_avx + + + mov rax,r14 + mov rdx,r14 + shr r14,52 + mov r11,rbx + mov r12,rbx + shr rdx,26 + and rax,0x3ffffff + shl r11,12 + and rdx,0x3ffffff + shr rbx,14 + or r14,r11 + shl rbp,24 + and r14,0x3ffffff + shr r12,40 + and rbx,0x3ffffff + or rbp,r12 + + sub r15,16 + jz NEAR $L$store_base2_26_avx + + vmovd xmm0,eax + vmovd xmm1,edx + vmovd xmm2,r14d + vmovd xmm3,ebx + vmovd xmm4,ebp + jmp NEAR $L$proceed_avx + +ALIGN 32 +$L$store_base2_64_avx: + mov QWORD[rdi],r14 + mov QWORD[8+rdi],rbx + mov QWORD[16+rdi],rbp + jmp NEAR $L$done_avx + +ALIGN 16 +$L$store_base2_26_avx: + mov DWORD[rdi],eax + mov DWORD[4+rdi],edx + mov DWORD[8+rdi],r14d + mov DWORD[12+rdi],ebx + mov DWORD[16+rdi],ebp +ALIGN 16 +$L$done_avx: + mov r15,QWORD[rsp] + + mov r14,QWORD[8+rsp] + + mov r13,QWORD[16+rsp] + + mov r12,QWORD[24+rsp] + + mov rbp,QWORD[32+rsp] + + mov rbx,QWORD[40+rsp] + + lea rsp,[48+rsp] + +$L$no_data_avx: +$L$blocks_avx_epilogue: + mov rdi,QWORD[8+rsp] ;WIN64 epilogue + mov rsi,QWORD[16+rsp] + DB 0F3h,0C3h ;repret + + +ALIGN 32 +$L$base2_64_avx: + + push rbx + + push rbp + + push r12 + + push r13 + + push r14 + + push r15 + +$L$base2_64_avx_body: + + mov r15,rdx + + mov r11,QWORD[24+rdi] + mov r13,QWORD[32+rdi] + + mov r14,QWORD[rdi] + mov rbx,QWORD[8+rdi] + mov ebp,DWORD[16+rdi] + + mov r12,r13 + mov rax,r13 + shr r13,2 + add r13,r12 + + test rdx,31 + jz NEAR $L$init_avx + + add r14,QWORD[rsi] + adc rbx,QWORD[8+rsi] + lea rsi,[16+rsi] + adc rbp,rcx + sub r15,16 + + call __poly1305_block + +$L$init_avx: + + mov rax,r14 + mov rdx,r14 + shr r14,52 + mov r8,rbx + mov r9,rbx + shr rdx,26 + and rax,0x3ffffff + shl r8,12 + and rdx,0x3ffffff + shr rbx,14 + or r14,r8 + shl rbp,24 + and r14,0x3ffffff + shr r9,40 + and rbx,0x3ffffff + or rbp,r9 + + vmovd xmm0,eax + vmovd xmm1,edx + vmovd xmm2,r14d + vmovd xmm3,ebx + vmovd xmm4,ebp + mov DWORD[20+rdi],1 + + call __poly1305_init_avx + +$L$proceed_avx: + mov rdx,r15 + + mov r15,QWORD[rsp] + + mov r14,QWORD[8+rsp] + + mov r13,QWORD[16+rsp] + + mov r12,QWORD[24+rsp] + + mov rbp,QWORD[32+rsp] + + mov rbx,QWORD[40+rsp] + + lea rax,[48+rsp] + lea rsp,[48+rsp] + +$L$base2_64_avx_epilogue: + jmp NEAR $L$do_avx + + +ALIGN 32 +$L$even_avx: + + vmovd xmm0,DWORD[rdi] + vmovd xmm1,DWORD[4+rdi] + vmovd xmm2,DWORD[8+rdi] + vmovd xmm3,DWORD[12+rdi] + vmovd xmm4,DWORD[16+rdi] + +$L$do_avx: + lea r11,[((-248))+rsp] + sub rsp,0x218 + vmovdqa XMMWORD[80+r11],xmm6 + vmovdqa XMMWORD[96+r11],xmm7 + vmovdqa XMMWORD[112+r11],xmm8 + vmovdqa XMMWORD[128+r11],xmm9 + vmovdqa XMMWORD[144+r11],xmm10 + vmovdqa XMMWORD[160+r11],xmm11 + vmovdqa XMMWORD[176+r11],xmm12 + vmovdqa XMMWORD[192+r11],xmm13 + vmovdqa XMMWORD[208+r11],xmm14 + vmovdqa XMMWORD[224+r11],xmm15 +$L$do_avx_body: + sub rdx,64 + lea rax,[((-32))+rsi] + cmovc rsi,rax + + vmovdqu xmm14,XMMWORD[48+rdi] + lea rdi,[112+rdi] + lea rcx,[$L$const] + + + + vmovdqu xmm5,XMMWORD[32+rsi] + vmovdqu xmm6,XMMWORD[48+rsi] + vmovdqa xmm15,XMMWORD[64+rcx] + + vpsrldq xmm7,xmm5,6 + vpsrldq xmm8,xmm6,6 + vpunpckhqdq xmm9,xmm5,xmm6 + vpunpcklqdq xmm5,xmm5,xmm6 + vpunpcklqdq xmm8,xmm7,xmm8 + + vpsrlq xmm9,xmm9,40 + vpsrlq xmm6,xmm5,26 + vpand xmm5,xmm5,xmm15 + vpsrlq xmm7,xmm8,4 + vpand xmm6,xmm6,xmm15 + vpsrlq xmm8,xmm8,30 + vpand xmm7,xmm7,xmm15 + vpand xmm8,xmm8,xmm15 + vpor xmm9,xmm9,XMMWORD[32+rcx] + + jbe NEAR $L$skip_loop_avx + + + vmovdqu xmm11,XMMWORD[((-48))+rdi] + vmovdqu xmm12,XMMWORD[((-32))+rdi] + vpshufd xmm13,xmm14,0xEE + vpshufd xmm10,xmm14,0x44 + vmovdqa XMMWORD[(-144)+r11],xmm13 + vmovdqa XMMWORD[rsp],xmm10 + vpshufd xmm14,xmm11,0xEE + vmovdqu xmm10,XMMWORD[((-16))+rdi] + vpshufd xmm11,xmm11,0x44 + vmovdqa XMMWORD[(-128)+r11],xmm14 + vmovdqa XMMWORD[16+rsp],xmm11 + vpshufd xmm13,xmm12,0xEE + vmovdqu xmm11,XMMWORD[rdi] + vpshufd xmm12,xmm12,0x44 + vmovdqa XMMWORD[(-112)+r11],xmm13 + vmovdqa XMMWORD[32+rsp],xmm12 + vpshufd xmm14,xmm10,0xEE + vmovdqu xmm12,XMMWORD[16+rdi] + vpshufd xmm10,xmm10,0x44 + vmovdqa XMMWORD[(-96)+r11],xmm14 + vmovdqa XMMWORD[48+rsp],xmm10 + vpshufd xmm13,xmm11,0xEE + vmovdqu xmm10,XMMWORD[32+rdi] + vpshufd xmm11,xmm11,0x44 + vmovdqa XMMWORD[(-80)+r11],xmm13 + vmovdqa XMMWORD[64+rsp],xmm11 + vpshufd xmm14,xmm12,0xEE + vmovdqu xmm11,XMMWORD[48+rdi] + vpshufd xmm12,xmm12,0x44 + vmovdqa XMMWORD[(-64)+r11],xmm14 + vmovdqa XMMWORD[80+rsp],xmm12 + vpshufd xmm13,xmm10,0xEE + vmovdqu xmm12,XMMWORD[64+rdi] + vpshufd xmm10,xmm10,0x44 + vmovdqa XMMWORD[(-48)+r11],xmm13 + vmovdqa XMMWORD[96+rsp],xmm10 + vpshufd xmm14,xmm11,0xEE + vpshufd xmm11,xmm11,0x44 + vmovdqa XMMWORD[(-32)+r11],xmm14 + vmovdqa XMMWORD[112+rsp],xmm11 + vpshufd xmm13,xmm12,0xEE + vmovdqa xmm14,XMMWORD[rsp] + vpshufd xmm12,xmm12,0x44 + vmovdqa XMMWORD[(-16)+r11],xmm13 + vmovdqa XMMWORD[128+rsp],xmm12 + + jmp NEAR $L$oop_avx + +ALIGN 32 +$L$oop_avx: + + + + + + + + + + + + + + + + + + + + + vpmuludq xmm10,xmm14,xmm5 + vpmuludq xmm11,xmm14,xmm6 + vmovdqa XMMWORD[32+r11],xmm2 + vpmuludq xmm12,xmm14,xmm7 + vmovdqa xmm2,XMMWORD[16+rsp] + vpmuludq xmm13,xmm14,xmm8 + vpmuludq xmm14,xmm14,xmm9 + + vmovdqa XMMWORD[r11],xmm0 + vpmuludq xmm0,xmm9,XMMWORD[32+rsp] + vmovdqa XMMWORD[16+r11],xmm1 + vpmuludq xmm1,xmm2,xmm8 + vpaddq xmm10,xmm10,xmm0 + vpaddq xmm14,xmm14,xmm1 + vmovdqa XMMWORD[48+r11],xmm3 + vpmuludq xmm0,xmm2,xmm7 + vpmuludq xmm1,xmm2,xmm6 + vpaddq xmm13,xmm13,xmm0 + vmovdqa xmm3,XMMWORD[48+rsp] + vpaddq xmm12,xmm12,xmm1 + vmovdqa XMMWORD[64+r11],xmm4 + vpmuludq xmm2,xmm2,xmm5 + vpmuludq xmm0,xmm3,xmm7 + vpaddq xmm11,xmm11,xmm2 + + vmovdqa xmm4,XMMWORD[64+rsp] + vpaddq xmm14,xmm14,xmm0 + vpmuludq xmm1,xmm3,xmm6 + vpmuludq xmm3,xmm3,xmm5 + vpaddq xmm13,xmm13,xmm1 + vmovdqa xmm2,XMMWORD[80+rsp] + vpaddq xmm12,xmm12,xmm3 + vpmuludq xmm0,xmm4,xmm9 + vpmuludq xmm4,xmm4,xmm8 + vpaddq xmm11,xmm11,xmm0 + vmovdqa xmm3,XMMWORD[96+rsp] + vpaddq xmm10,xmm10,xmm4 + + vmovdqa xmm4,XMMWORD[128+rsp] + vpmuludq xmm1,xmm2,xmm6 + vpmuludq xmm2,xmm2,xmm5 + vpaddq xmm14,xmm14,xmm1 + vpaddq xmm13,xmm13,xmm2 + vpmuludq xmm0,xmm3,xmm9 + vpmuludq xmm1,xmm3,xmm8 + vpaddq xmm12,xmm12,xmm0 + vmovdqu xmm0,XMMWORD[rsi] + vpaddq xmm11,xmm11,xmm1 + vpmuludq xmm3,xmm3,xmm7 + vpmuludq xmm7,xmm4,xmm7 + vpaddq xmm10,xmm10,xmm3 + + vmovdqu xmm1,XMMWORD[16+rsi] + vpaddq xmm11,xmm11,xmm7 + vpmuludq xmm8,xmm4,xmm8 + vpmuludq xmm9,xmm4,xmm9 + vpsrldq xmm2,xmm0,6 + vpaddq xmm12,xmm12,xmm8 + vpaddq xmm13,xmm13,xmm9 + vpsrldq xmm3,xmm1,6 + vpmuludq xmm9,xmm5,XMMWORD[112+rsp] + vpmuludq xmm5,xmm4,xmm6 + vpunpckhqdq xmm4,xmm0,xmm1 + vpaddq xmm14,xmm14,xmm9 + vmovdqa xmm9,XMMWORD[((-144))+r11] + vpaddq xmm10,xmm10,xmm5 + + vpunpcklqdq xmm0,xmm0,xmm1 + vpunpcklqdq xmm3,xmm2,xmm3 + + + vpsrldq xmm4,xmm4,5 + vpsrlq xmm1,xmm0,26 + vpand xmm0,xmm0,xmm15 + vpsrlq xmm2,xmm3,4 + vpand xmm1,xmm1,xmm15 + vpand xmm4,xmm4,XMMWORD[rcx] + vpsrlq xmm3,xmm3,30 + vpand xmm2,xmm2,xmm15 + vpand xmm3,xmm3,xmm15 + vpor xmm4,xmm4,XMMWORD[32+rcx] + + vpaddq xmm0,xmm0,XMMWORD[r11] + vpaddq xmm1,xmm1,XMMWORD[16+r11] + vpaddq xmm2,xmm2,XMMWORD[32+r11] + vpaddq xmm3,xmm3,XMMWORD[48+r11] + vpaddq xmm4,xmm4,XMMWORD[64+r11] + + lea rax,[32+rsi] + lea rsi,[64+rsi] + sub rdx,64 + cmovc rsi,rax + + + + + + + + + + + vpmuludq xmm5,xmm9,xmm0 + vpmuludq xmm6,xmm9,xmm1 + vpaddq xmm10,xmm10,xmm5 + vpaddq xmm11,xmm11,xmm6 + vmovdqa xmm7,XMMWORD[((-128))+r11] + vpmuludq xmm5,xmm9,xmm2 + vpmuludq xmm6,xmm9,xmm3 + vpaddq xmm12,xmm12,xmm5 + vpaddq xmm13,xmm13,xmm6 + vpmuludq xmm9,xmm9,xmm4 + vpmuludq xmm5,xmm4,XMMWORD[((-112))+r11] + vpaddq xmm14,xmm14,xmm9 + + vpaddq xmm10,xmm10,xmm5 + vpmuludq xmm6,xmm7,xmm2 + vpmuludq xmm5,xmm7,xmm3 + vpaddq xmm13,xmm13,xmm6 + vmovdqa xmm8,XMMWORD[((-96))+r11] + vpaddq xmm14,xmm14,xmm5 + vpmuludq xmm6,xmm7,xmm1 + vpmuludq xmm7,xmm7,xmm0 + vpaddq xmm12,xmm12,xmm6 + vpaddq xmm11,xmm11,xmm7 + + vmovdqa xmm9,XMMWORD[((-80))+r11] + vpmuludq xmm5,xmm8,xmm2 + vpmuludq xmm6,xmm8,xmm1 + vpaddq xmm14,xmm14,xmm5 + vpaddq xmm13,xmm13,xmm6 + vmovdqa xmm7,XMMWORD[((-64))+r11] + vpmuludq xmm8,xmm8,xmm0 + vpmuludq xmm5,xmm9,xmm4 + vpaddq xmm12,xmm12,xmm8 + vpaddq xmm11,xmm11,xmm5 + vmovdqa xmm8,XMMWORD[((-48))+r11] + vpmuludq xmm9,xmm9,xmm3 + vpmuludq xmm6,xmm7,xmm1 + vpaddq xmm10,xmm10,xmm9 + + vmovdqa xmm9,XMMWORD[((-16))+r11] + vpaddq xmm14,xmm14,xmm6 + vpmuludq xmm7,xmm7,xmm0 + vpmuludq xmm5,xmm8,xmm4 + vpaddq xmm13,xmm13,xmm7 + vpaddq xmm12,xmm12,xmm5 + vmovdqu xmm5,XMMWORD[32+rsi] + vpmuludq xmm7,xmm8,xmm3 + vpmuludq xmm8,xmm8,xmm2 + vpaddq xmm11,xmm11,xmm7 + vmovdqu xmm6,XMMWORD[48+rsi] + vpaddq xmm10,xmm10,xmm8 + + vpmuludq xmm2,xmm9,xmm2 + vpmuludq xmm3,xmm9,xmm3 + vpsrldq xmm7,xmm5,6 + vpaddq xmm11,xmm11,xmm2 + vpmuludq xmm4,xmm9,xmm4 + vpsrldq xmm8,xmm6,6 + vpaddq xmm2,xmm12,xmm3 + vpaddq xmm3,xmm13,xmm4 + vpmuludq xmm4,xmm0,XMMWORD[((-32))+r11] + vpmuludq xmm0,xmm9,xmm1 + vpunpckhqdq xmm9,xmm5,xmm6 + vpaddq xmm4,xmm14,xmm4 + vpaddq xmm0,xmm10,xmm0 + + vpunpcklqdq xmm5,xmm5,xmm6 + vpunpcklqdq xmm8,xmm7,xmm8 + + + vpsrldq xmm9,xmm9,5 + vpsrlq xmm6,xmm5,26 + vmovdqa xmm14,XMMWORD[rsp] + vpand xmm5,xmm5,xmm15 + vpsrlq xmm7,xmm8,4 + vpand xmm6,xmm6,xmm15 + vpand xmm9,xmm9,XMMWORD[rcx] + vpsrlq xmm8,xmm8,30 + vpand xmm7,xmm7,xmm15 + vpand xmm8,xmm8,xmm15 + vpor xmm9,xmm9,XMMWORD[32+rcx] + + + + + + vpsrlq xmm13,xmm3,26 + vpand xmm3,xmm3,xmm15 + vpaddq xmm4,xmm4,xmm13 + + vpsrlq xmm10,xmm0,26 + vpand xmm0,xmm0,xmm15 + vpaddq xmm1,xmm11,xmm10 + + vpsrlq xmm10,xmm4,26 + vpand xmm4,xmm4,xmm15 + + vpsrlq xmm11,xmm1,26 + vpand xmm1,xmm1,xmm15 + vpaddq xmm2,xmm2,xmm11 + + vpaddq xmm0,xmm0,xmm10 + vpsllq xmm10,xmm10,2 + vpaddq xmm0,xmm0,xmm10 + + vpsrlq xmm12,xmm2,26 + vpand xmm2,xmm2,xmm15 + vpaddq xmm3,xmm3,xmm12 + + vpsrlq xmm10,xmm0,26 + vpand xmm0,xmm0,xmm15 + vpaddq xmm1,xmm1,xmm10 + + vpsrlq xmm13,xmm3,26 + vpand xmm3,xmm3,xmm15 + vpaddq xmm4,xmm4,xmm13 + + ja NEAR $L$oop_avx + +$L$skip_loop_avx: + + + + vpshufd xmm14,xmm14,0x10 + add rdx,32 + jnz NEAR $L$ong_tail_avx + + vpaddq xmm7,xmm7,xmm2 + vpaddq xmm5,xmm5,xmm0 + vpaddq xmm6,xmm6,xmm1 + vpaddq xmm8,xmm8,xmm3 + vpaddq xmm9,xmm9,xmm4 + +$L$ong_tail_avx: + vmovdqa XMMWORD[32+r11],xmm2 + vmovdqa XMMWORD[r11],xmm0 + vmovdqa XMMWORD[16+r11],xmm1 + vmovdqa XMMWORD[48+r11],xmm3 + vmovdqa XMMWORD[64+r11],xmm4 + + + + + + + + vpmuludq xmm12,xmm14,xmm7 + vpmuludq xmm10,xmm14,xmm5 + vpshufd xmm2,XMMWORD[((-48))+rdi],0x10 + vpmuludq xmm11,xmm14,xmm6 + vpmuludq xmm13,xmm14,xmm8 + vpmuludq xmm14,xmm14,xmm9 + + vpmuludq xmm0,xmm2,xmm8 + vpaddq xmm14,xmm14,xmm0 + vpshufd xmm3,XMMWORD[((-32))+rdi],0x10 + vpmuludq xmm1,xmm2,xmm7 + vpaddq xmm13,xmm13,xmm1 + vpshufd xmm4,XMMWORD[((-16))+rdi],0x10 + vpmuludq xmm0,xmm2,xmm6 + vpaddq xmm12,xmm12,xmm0 + vpmuludq xmm2,xmm2,xmm5 + vpaddq xmm11,xmm11,xmm2 + vpmuludq xmm3,xmm3,xmm9 + vpaddq xmm10,xmm10,xmm3 + + vpshufd xmm2,XMMWORD[rdi],0x10 + vpmuludq xmm1,xmm4,xmm7 + vpaddq xmm14,xmm14,xmm1 + vpmuludq xmm0,xmm4,xmm6 + vpaddq xmm13,xmm13,xmm0 + vpshufd xmm3,XMMWORD[16+rdi],0x10 + vpmuludq xmm4,xmm4,xmm5 + vpaddq xmm12,xmm12,xmm4 + vpmuludq xmm1,xmm2,xmm9 + vpaddq xmm11,xmm11,xmm1 + vpshufd xmm4,XMMWORD[32+rdi],0x10 + vpmuludq xmm2,xmm2,xmm8 + vpaddq xmm10,xmm10,xmm2 + + vpmuludq xmm0,xmm3,xmm6 + vpaddq xmm14,xmm14,xmm0 + vpmuludq xmm3,xmm3,xmm5 + vpaddq xmm13,xmm13,xmm3 + vpshufd xmm2,XMMWORD[48+rdi],0x10 + vpmuludq xmm1,xmm4,xmm9 + vpaddq xmm12,xmm12,xmm1 + vpshufd xmm3,XMMWORD[64+rdi],0x10 + vpmuludq xmm0,xmm4,xmm8 + vpaddq xmm11,xmm11,xmm0 + vpmuludq xmm4,xmm4,xmm7 + vpaddq xmm10,xmm10,xmm4 + + vpmuludq xmm2,xmm2,xmm5 + vpaddq xmm14,xmm14,xmm2 + vpmuludq xmm1,xmm3,xmm9 + vpaddq xmm13,xmm13,xmm1 + vpmuludq xmm0,xmm3,xmm8 + vpaddq xmm12,xmm12,xmm0 + vpmuludq xmm1,xmm3,xmm7 + vpaddq xmm11,xmm11,xmm1 + vpmuludq xmm3,xmm3,xmm6 + vpaddq xmm10,xmm10,xmm3 + + jz NEAR $L$short_tail_avx + + vmovdqu xmm0,XMMWORD[rsi] + vmovdqu xmm1,XMMWORD[16+rsi] + + vpsrldq xmm2,xmm0,6 + vpsrldq xmm3,xmm1,6 + vpunpckhqdq xmm4,xmm0,xmm1 + vpunpcklqdq xmm0,xmm0,xmm1 + vpunpcklqdq xmm3,xmm2,xmm3 + + vpsrlq xmm4,xmm4,40 + vpsrlq xmm1,xmm0,26 + vpand xmm0,xmm0,xmm15 + vpsrlq xmm2,xmm3,4 + vpand xmm1,xmm1,xmm15 + vpsrlq xmm3,xmm3,30 + vpand xmm2,xmm2,xmm15 + vpand xmm3,xmm3,xmm15 + vpor xmm4,xmm4,XMMWORD[32+rcx] + + vpshufd xmm9,XMMWORD[((-64))+rdi],0x32 + vpaddq xmm0,xmm0,XMMWORD[r11] + vpaddq xmm1,xmm1,XMMWORD[16+r11] + vpaddq xmm2,xmm2,XMMWORD[32+r11] + vpaddq xmm3,xmm3,XMMWORD[48+r11] + vpaddq xmm4,xmm4,XMMWORD[64+r11] + + + + + vpmuludq xmm5,xmm9,xmm0 + vpaddq xmm10,xmm10,xmm5 + vpmuludq xmm6,xmm9,xmm1 + vpaddq xmm11,xmm11,xmm6 + vpmuludq xmm5,xmm9,xmm2 + vpaddq xmm12,xmm12,xmm5 + vpshufd xmm7,XMMWORD[((-48))+rdi],0x32 + vpmuludq xmm6,xmm9,xmm3 + vpaddq xmm13,xmm13,xmm6 + vpmuludq xmm9,xmm9,xmm4 + vpaddq xmm14,xmm14,xmm9 + + vpmuludq xmm5,xmm7,xmm3 + vpaddq xmm14,xmm14,xmm5 + vpshufd xmm8,XMMWORD[((-32))+rdi],0x32 + vpmuludq xmm6,xmm7,xmm2 + vpaddq xmm13,xmm13,xmm6 + vpshufd xmm9,XMMWORD[((-16))+rdi],0x32 + vpmuludq xmm5,xmm7,xmm1 + vpaddq xmm12,xmm12,xmm5 + vpmuludq xmm7,xmm7,xmm0 + vpaddq xmm11,xmm11,xmm7 + vpmuludq xmm8,xmm8,xmm4 + vpaddq xmm10,xmm10,xmm8 + + vpshufd xmm7,XMMWORD[rdi],0x32 + vpmuludq xmm6,xmm9,xmm2 + vpaddq xmm14,xmm14,xmm6 + vpmuludq xmm5,xmm9,xmm1 + vpaddq xmm13,xmm13,xmm5 + vpshufd xmm8,XMMWORD[16+rdi],0x32 + vpmuludq xmm9,xmm9,xmm0 + vpaddq xmm12,xmm12,xmm9 + vpmuludq xmm6,xmm7,xmm4 + vpaddq xmm11,xmm11,xmm6 + vpshufd xmm9,XMMWORD[32+rdi],0x32 + vpmuludq xmm7,xmm7,xmm3 + vpaddq xmm10,xmm10,xmm7 + + vpmuludq xmm5,xmm8,xmm1 + vpaddq xmm14,xmm14,xmm5 + vpmuludq xmm8,xmm8,xmm0 + vpaddq xmm13,xmm13,xmm8 + vpshufd xmm7,XMMWORD[48+rdi],0x32 + vpmuludq xmm6,xmm9,xmm4 + vpaddq xmm12,xmm12,xmm6 + vpshufd xmm8,XMMWORD[64+rdi],0x32 + vpmuludq xmm5,xmm9,xmm3 + vpaddq xmm11,xmm11,xmm5 + vpmuludq xmm9,xmm9,xmm2 + vpaddq xmm10,xmm10,xmm9 + + vpmuludq xmm7,xmm7,xmm0 + vpaddq xmm14,xmm14,xmm7 + vpmuludq xmm6,xmm8,xmm4 + vpaddq xmm13,xmm13,xmm6 + vpmuludq xmm5,xmm8,xmm3 + vpaddq xmm12,xmm12,xmm5 + vpmuludq xmm6,xmm8,xmm2 + vpaddq xmm11,xmm11,xmm6 + vpmuludq xmm8,xmm8,xmm1 + vpaddq xmm10,xmm10,xmm8 + +$L$short_tail_avx: + + + + vpsrldq xmm9,xmm14,8 + vpsrldq xmm8,xmm13,8 + vpsrldq xmm6,xmm11,8 + vpsrldq xmm5,xmm10,8 + vpsrldq xmm7,xmm12,8 + vpaddq xmm13,xmm13,xmm8 + vpaddq xmm14,xmm14,xmm9 + vpaddq xmm10,xmm10,xmm5 + vpaddq xmm11,xmm11,xmm6 + vpaddq xmm12,xmm12,xmm7 + + + + + vpsrlq xmm3,xmm13,26 + vpand xmm13,xmm13,xmm15 + vpaddq xmm14,xmm14,xmm3 + + vpsrlq xmm0,xmm10,26 + vpand xmm10,xmm10,xmm15 + vpaddq xmm11,xmm11,xmm0 + + vpsrlq xmm4,xmm14,26 + vpand xmm14,xmm14,xmm15 + + vpsrlq xmm1,xmm11,26 + vpand xmm11,xmm11,xmm15 + vpaddq xmm12,xmm12,xmm1 + + vpaddq xmm10,xmm10,xmm4 + vpsllq xmm4,xmm4,2 + vpaddq xmm10,xmm10,xmm4 + + vpsrlq xmm2,xmm12,26 + vpand xmm12,xmm12,xmm15 + vpaddq xmm13,xmm13,xmm2 + + vpsrlq xmm0,xmm10,26 + vpand xmm10,xmm10,xmm15 + vpaddq xmm11,xmm11,xmm0 + + vpsrlq xmm3,xmm13,26 + vpand xmm13,xmm13,xmm15 + vpaddq xmm14,xmm14,xmm3 + + vmovd DWORD[(-112)+rdi],xmm10 + vmovd DWORD[(-108)+rdi],xmm11 + vmovd DWORD[(-104)+rdi],xmm12 + vmovd DWORD[(-100)+rdi],xmm13 + vmovd DWORD[(-96)+rdi],xmm14 + vmovdqa xmm6,XMMWORD[80+r11] + vmovdqa xmm7,XMMWORD[96+r11] + vmovdqa xmm8,XMMWORD[112+r11] + vmovdqa xmm9,XMMWORD[128+r11] + vmovdqa xmm10,XMMWORD[144+r11] + vmovdqa xmm11,XMMWORD[160+r11] + vmovdqa xmm12,XMMWORD[176+r11] + vmovdqa xmm13,XMMWORD[192+r11] + vmovdqa xmm14,XMMWORD[208+r11] + vmovdqa xmm15,XMMWORD[224+r11] + lea rsp,[248+r11] +$L$do_avx_epilogue: + vzeroupper + mov rdi,QWORD[8+rsp] ;WIN64 epilogue + mov rsi,QWORD[16+rsp] + DB 0F3h,0C3h ;repret + +$L$SEH_end_poly1305_blocks_avx: + + +ALIGN 32 +poly1305_emit_avx: + mov QWORD[8+rsp],rdi ;WIN64 prologue + mov QWORD[16+rsp],rsi + mov rax,rsp +$L$SEH_begin_poly1305_emit_avx: + mov rdi,rcx + mov rsi,rdx + mov rdx,r8 + + + cmp DWORD[20+rdi],0 + je NEAR $L$emit + + mov eax,DWORD[rdi] + mov ecx,DWORD[4+rdi] + mov r8d,DWORD[8+rdi] + mov r11d,DWORD[12+rdi] + mov r10d,DWORD[16+rdi] + + shl rcx,26 + mov r9,r8 + shl r8,52 + add rax,rcx + shr r9,12 + add r8,rax + adc r9,0 + + shl r11,14 + mov rax,r10 + shr r10,24 + add r9,r11 + shl rax,40 + add r9,rax + adc r10,0 + + mov rax,r10 + mov rcx,r10 + and r10,3 + shr rax,2 + and rcx,-4 + add rax,rcx + add r8,rax + adc r9,0 + adc r10,0 + + mov rax,r8 + add r8,5 + mov rcx,r9 + adc r9,0 + adc r10,0 + shr r10,2 + cmovnz rax,r8 + cmovnz rcx,r9 + + add rax,QWORD[rdx] + adc rcx,QWORD[8+rdx] + mov QWORD[rsi],rax + mov QWORD[8+rsi],rcx + + mov rdi,QWORD[8+rsp] ;WIN64 epilogue + mov rsi,QWORD[16+rsp] + DB 0F3h,0C3h ;repret +$L$SEH_end_poly1305_emit_avx: + +ALIGN 32 +poly1305_blocks_avx2: + mov QWORD[8+rsp],rdi ;WIN64 prologue + mov QWORD[16+rsp],rsi + mov rax,rsp +$L$SEH_begin_poly1305_blocks_avx2: + mov rdi,rcx + mov rsi,rdx + mov rdx,r8 + mov rcx,r9 + + + + mov r8d,DWORD[20+rdi] + cmp rdx,128 + jae NEAR $L$blocks_avx2 + test r8d,r8d + jz NEAR $L$blocks + +$L$blocks_avx2: + and rdx,-16 + jz NEAR $L$no_data_avx2 + + vzeroupper + + test r8d,r8d + jz NEAR $L$base2_64_avx2 + + test rdx,63 + jz NEAR $L$even_avx2 + + push rbx + + push rbp + + push r12 + + push r13 + + push r14 + + push r15 + +$L$blocks_avx2_body: + + mov r15,rdx + + mov r8,QWORD[rdi] + mov r9,QWORD[8+rdi] + mov ebp,DWORD[16+rdi] + + mov r11,QWORD[24+rdi] + mov r13,QWORD[32+rdi] + + + mov r14d,r8d + and r8,-2147483648 + mov r12,r9 + mov ebx,r9d + and r9,-2147483648 + + shr r8,6 + shl r12,52 + add r14,r8 + shr rbx,12 + shr r9,18 + add r14,r12 + adc rbx,r9 + + mov r8,rbp + shl r8,40 + shr rbp,24 + add rbx,r8 + adc rbp,0 + + mov r9,-4 + mov r8,rbp + and r9,rbp + shr r8,2 + and rbp,3 + add r8,r9 + add r14,r8 + adc rbx,0 + adc rbp,0 + + mov r12,r13 + mov rax,r13 + shr r13,2 + add r13,r12 + +$L$base2_26_pre_avx2: + add r14,QWORD[rsi] + adc rbx,QWORD[8+rsi] + lea rsi,[16+rsi] + adc rbp,rcx + sub r15,16 + + call __poly1305_block + mov rax,r12 + + test r15,63 + jnz NEAR $L$base2_26_pre_avx2 + + test rcx,rcx + jz NEAR $L$store_base2_64_avx2 + + + mov rax,r14 + mov rdx,r14 + shr r14,52 + mov r11,rbx + mov r12,rbx + shr rdx,26 + and rax,0x3ffffff + shl r11,12 + and rdx,0x3ffffff + shr rbx,14 + or r14,r11 + shl rbp,24 + and r14,0x3ffffff + shr r12,40 + and rbx,0x3ffffff + or rbp,r12 + + test r15,r15 + jz NEAR $L$store_base2_26_avx2 + + vmovd xmm0,eax + vmovd xmm1,edx + vmovd xmm2,r14d + vmovd xmm3,ebx + vmovd xmm4,ebp + jmp NEAR $L$proceed_avx2 + +ALIGN 32 +$L$store_base2_64_avx2: + mov QWORD[rdi],r14 + mov QWORD[8+rdi],rbx + mov QWORD[16+rdi],rbp + jmp NEAR $L$done_avx2 + +ALIGN 16 +$L$store_base2_26_avx2: + mov DWORD[rdi],eax + mov DWORD[4+rdi],edx + mov DWORD[8+rdi],r14d + mov DWORD[12+rdi],ebx + mov DWORD[16+rdi],ebp +ALIGN 16 +$L$done_avx2: + mov r15,QWORD[rsp] + + mov r14,QWORD[8+rsp] + + mov r13,QWORD[16+rsp] + + mov r12,QWORD[24+rsp] + + mov rbp,QWORD[32+rsp] + + mov rbx,QWORD[40+rsp] + + lea rsp,[48+rsp] + +$L$no_data_avx2: +$L$blocks_avx2_epilogue: + mov rdi,QWORD[8+rsp] ;WIN64 epilogue + mov rsi,QWORD[16+rsp] + DB 0F3h,0C3h ;repret + + +ALIGN 32 +$L$base2_64_avx2: + + push rbx + + push rbp + + push r12 + + push r13 + + push r14 + + push r15 + +$L$base2_64_avx2_body: + + mov r15,rdx + + mov r11,QWORD[24+rdi] + mov r13,QWORD[32+rdi] + + mov r14,QWORD[rdi] + mov rbx,QWORD[8+rdi] + mov ebp,DWORD[16+rdi] + + mov r12,r13 + mov rax,r13 + shr r13,2 + add r13,r12 + + test rdx,63 + jz NEAR $L$init_avx2 + +$L$base2_64_pre_avx2: + add r14,QWORD[rsi] + adc rbx,QWORD[8+rsi] + lea rsi,[16+rsi] + adc rbp,rcx + sub r15,16 + + call __poly1305_block + mov rax,r12 + + test r15,63 + jnz NEAR $L$base2_64_pre_avx2 + +$L$init_avx2: + + mov rax,r14 + mov rdx,r14 + shr r14,52 + mov r8,rbx + mov r9,rbx + shr rdx,26 + and rax,0x3ffffff + shl r8,12 + and rdx,0x3ffffff + shr rbx,14 + or r14,r8 + shl rbp,24 + and r14,0x3ffffff + shr r9,40 + and rbx,0x3ffffff + or rbp,r9 + + vmovd xmm0,eax + vmovd xmm1,edx + vmovd xmm2,r14d + vmovd xmm3,ebx + vmovd xmm4,ebp + mov DWORD[20+rdi],1 + + call __poly1305_init_avx + +$L$proceed_avx2: + mov rdx,r15 + + + + mov r15,QWORD[rsp] + + mov r14,QWORD[8+rsp] + + mov r13,QWORD[16+rsp] + + mov r12,QWORD[24+rsp] + + mov rbp,QWORD[32+rsp] + + mov rbx,QWORD[40+rsp] + + lea rax,[48+rsp] + lea rsp,[48+rsp] + +$L$base2_64_avx2_epilogue: + jmp NEAR $L$do_avx2 + + +ALIGN 32 +$L$even_avx2: + + + vmovd xmm0,DWORD[rdi] + vmovd xmm1,DWORD[4+rdi] + vmovd xmm2,DWORD[8+rdi] + vmovd xmm3,DWORD[12+rdi] + vmovd xmm4,DWORD[16+rdi] + +$L$do_avx2: + lea r11,[((-248))+rsp] + sub rsp,0x1c8 + vmovdqa XMMWORD[80+r11],xmm6 + vmovdqa XMMWORD[96+r11],xmm7 + vmovdqa XMMWORD[112+r11],xmm8 + vmovdqa XMMWORD[128+r11],xmm9 + vmovdqa XMMWORD[144+r11],xmm10 + vmovdqa XMMWORD[160+r11],xmm11 + vmovdqa XMMWORD[176+r11],xmm12 + vmovdqa XMMWORD[192+r11],xmm13 + vmovdqa XMMWORD[208+r11],xmm14 + vmovdqa XMMWORD[224+r11],xmm15 +$L$do_avx2_body: + lea rcx,[$L$const] + lea rdi,[((48+64))+rdi] + vmovdqa ymm7,YMMWORD[96+rcx] + + + vmovdqu xmm9,XMMWORD[((-64))+rdi] + and rsp,-512 + vmovdqu xmm10,XMMWORD[((-48))+rdi] + vmovdqu xmm6,XMMWORD[((-32))+rdi] + vmovdqu xmm11,XMMWORD[((-16))+rdi] + vmovdqu xmm12,XMMWORD[rdi] + vmovdqu xmm13,XMMWORD[16+rdi] + lea rax,[144+rsp] + vmovdqu xmm14,XMMWORD[32+rdi] + vpermd ymm9,ymm7,ymm9 + vmovdqu xmm15,XMMWORD[48+rdi] + vpermd ymm10,ymm7,ymm10 + vmovdqu xmm5,XMMWORD[64+rdi] + vpermd ymm6,ymm7,ymm6 + vmovdqa YMMWORD[rsp],ymm9 + vpermd ymm11,ymm7,ymm11 + vmovdqa YMMWORD[(32-144)+rax],ymm10 + vpermd ymm12,ymm7,ymm12 + vmovdqa YMMWORD[(64-144)+rax],ymm6 + vpermd ymm13,ymm7,ymm13 + vmovdqa YMMWORD[(96-144)+rax],ymm11 + vpermd ymm14,ymm7,ymm14 + vmovdqa YMMWORD[(128-144)+rax],ymm12 + vpermd ymm15,ymm7,ymm15 + vmovdqa YMMWORD[(160-144)+rax],ymm13 + vpermd ymm5,ymm7,ymm5 + vmovdqa YMMWORD[(192-144)+rax],ymm14 + vmovdqa YMMWORD[(224-144)+rax],ymm15 + vmovdqa YMMWORD[(256-144)+rax],ymm5 + vmovdqa ymm5,YMMWORD[64+rcx] + + + + vmovdqu xmm7,XMMWORD[rsi] + vmovdqu xmm8,XMMWORD[16+rsi] + vinserti128 ymm7,ymm7,XMMWORD[32+rsi],1 + vinserti128 ymm8,ymm8,XMMWORD[48+rsi],1 + lea rsi,[64+rsi] + + vpsrldq ymm9,ymm7,6 + vpsrldq ymm10,ymm8,6 + vpunpckhqdq ymm6,ymm7,ymm8 + vpunpcklqdq ymm9,ymm9,ymm10 + vpunpcklqdq ymm7,ymm7,ymm8 + + vpsrlq ymm10,ymm9,30 + vpsrlq ymm9,ymm9,4 + vpsrlq ymm8,ymm7,26 + vpsrlq ymm6,ymm6,40 + vpand ymm9,ymm9,ymm5 + vpand ymm7,ymm7,ymm5 + vpand ymm8,ymm8,ymm5 + vpand ymm10,ymm10,ymm5 + vpor ymm6,ymm6,YMMWORD[32+rcx] + + vpaddq ymm2,ymm9,ymm2 + sub rdx,64 + jz NEAR $L$tail_avx2 + jmp NEAR $L$oop_avx2 + +ALIGN 32 +$L$oop_avx2: + + + + + + + + + vpaddq ymm0,ymm7,ymm0 + vmovdqa ymm7,YMMWORD[rsp] + vpaddq ymm1,ymm8,ymm1 + vmovdqa ymm8,YMMWORD[32+rsp] + vpaddq ymm3,ymm10,ymm3 + vmovdqa ymm9,YMMWORD[96+rsp] + vpaddq ymm4,ymm6,ymm4 + vmovdqa ymm10,YMMWORD[48+rax] + vmovdqa ymm5,YMMWORD[112+rax] + + + + + + + + + + + + + + + + + vpmuludq ymm13,ymm7,ymm2 + vpmuludq ymm14,ymm8,ymm2 + vpmuludq ymm15,ymm9,ymm2 + vpmuludq ymm11,ymm10,ymm2 + vpmuludq ymm12,ymm5,ymm2 + + vpmuludq ymm6,ymm8,ymm0 + vpmuludq ymm2,ymm8,ymm1 + vpaddq ymm12,ymm12,ymm6 + vpaddq ymm13,ymm13,ymm2 + vpmuludq ymm6,ymm8,ymm3 + vpmuludq ymm2,ymm4,YMMWORD[64+rsp] + vpaddq ymm15,ymm15,ymm6 + vpaddq ymm11,ymm11,ymm2 + vmovdqa ymm8,YMMWORD[((-16))+rax] + + vpmuludq ymm6,ymm7,ymm0 + vpmuludq ymm2,ymm7,ymm1 + vpaddq ymm11,ymm11,ymm6 + vpaddq ymm12,ymm12,ymm2 + vpmuludq ymm6,ymm7,ymm3 + vpmuludq ymm2,ymm7,ymm4 + vmovdqu xmm7,XMMWORD[rsi] + vpaddq ymm14,ymm14,ymm6 + vpaddq ymm15,ymm15,ymm2 + vinserti128 ymm7,ymm7,XMMWORD[32+rsi],1 + + vpmuludq ymm6,ymm8,ymm3 + vpmuludq ymm2,ymm8,ymm4 + vmovdqu xmm8,XMMWORD[16+rsi] + vpaddq ymm11,ymm11,ymm6 + vpaddq ymm12,ymm12,ymm2 + vmovdqa ymm2,YMMWORD[16+rax] + vpmuludq ymm6,ymm9,ymm1 + vpmuludq ymm9,ymm9,ymm0 + vpaddq ymm14,ymm14,ymm6 + vpaddq ymm13,ymm13,ymm9 + vinserti128 ymm8,ymm8,XMMWORD[48+rsi],1 + lea rsi,[64+rsi] + + vpmuludq ymm6,ymm2,ymm1 + vpmuludq ymm2,ymm2,ymm0 + vpsrldq ymm9,ymm7,6 + vpaddq ymm15,ymm15,ymm6 + vpaddq ymm14,ymm14,ymm2 + vpmuludq ymm6,ymm10,ymm3 + vpmuludq ymm2,ymm10,ymm4 + vpsrldq ymm10,ymm8,6 + vpaddq ymm12,ymm12,ymm6 + vpaddq ymm13,ymm13,ymm2 + vpunpckhqdq ymm6,ymm7,ymm8 + + vpmuludq ymm3,ymm5,ymm3 + vpmuludq ymm4,ymm5,ymm4 + vpunpcklqdq ymm7,ymm7,ymm8 + vpaddq ymm2,ymm13,ymm3 + vpaddq ymm3,ymm14,ymm4 + vpunpcklqdq ymm10,ymm9,ymm10 + vpmuludq ymm4,ymm0,YMMWORD[80+rax] + vpmuludq ymm0,ymm5,ymm1 + vmovdqa ymm5,YMMWORD[64+rcx] + vpaddq ymm4,ymm15,ymm4 + vpaddq ymm0,ymm11,ymm0 + + + + + vpsrlq ymm14,ymm3,26 + vpand ymm3,ymm3,ymm5 + vpaddq ymm4,ymm4,ymm14 + + vpsrlq ymm11,ymm0,26 + vpand ymm0,ymm0,ymm5 + vpaddq ymm1,ymm12,ymm11 + + vpsrlq ymm15,ymm4,26 + vpand ymm4,ymm4,ymm5 + + vpsrlq ymm9,ymm10,4 + + vpsrlq ymm12,ymm1,26 + vpand ymm1,ymm1,ymm5 + vpaddq ymm2,ymm2,ymm12 + + vpaddq ymm0,ymm0,ymm15 + vpsllq ymm15,ymm15,2 + vpaddq ymm0,ymm0,ymm15 + + vpand ymm9,ymm9,ymm5 + vpsrlq ymm8,ymm7,26 + + vpsrlq ymm13,ymm2,26 + vpand ymm2,ymm2,ymm5 + vpaddq ymm3,ymm3,ymm13 + + vpaddq ymm2,ymm2,ymm9 + vpsrlq ymm10,ymm10,30 + + vpsrlq ymm11,ymm0,26 + vpand ymm0,ymm0,ymm5 + vpaddq ymm1,ymm1,ymm11 + + vpsrlq ymm6,ymm6,40 + + vpsrlq ymm14,ymm3,26 + vpand ymm3,ymm3,ymm5 + vpaddq ymm4,ymm4,ymm14 + + vpand ymm7,ymm7,ymm5 + vpand ymm8,ymm8,ymm5 + vpand ymm10,ymm10,ymm5 + vpor ymm6,ymm6,YMMWORD[32+rcx] + + sub rdx,64 + jnz NEAR $L$oop_avx2 + +DB 0x66,0x90 +$L$tail_avx2: + + + + + + + + vpaddq ymm0,ymm7,ymm0 + vmovdqu ymm7,YMMWORD[4+rsp] + vpaddq ymm1,ymm8,ymm1 + vmovdqu ymm8,YMMWORD[36+rsp] + vpaddq ymm3,ymm10,ymm3 + vmovdqu ymm9,YMMWORD[100+rsp] + vpaddq ymm4,ymm6,ymm4 + vmovdqu ymm10,YMMWORD[52+rax] + vmovdqu ymm5,YMMWORD[116+rax] + + vpmuludq ymm13,ymm7,ymm2 + vpmuludq ymm14,ymm8,ymm2 + vpmuludq ymm15,ymm9,ymm2 + vpmuludq ymm11,ymm10,ymm2 + vpmuludq ymm12,ymm5,ymm2 + + vpmuludq ymm6,ymm8,ymm0 + vpmuludq ymm2,ymm8,ymm1 + vpaddq ymm12,ymm12,ymm6 + vpaddq ymm13,ymm13,ymm2 + vpmuludq ymm6,ymm8,ymm3 + vpmuludq ymm2,ymm4,YMMWORD[68+rsp] + vpaddq ymm15,ymm15,ymm6 + vpaddq ymm11,ymm11,ymm2 + + vpmuludq ymm6,ymm7,ymm0 + vpmuludq ymm2,ymm7,ymm1 + vpaddq ymm11,ymm11,ymm6 + vmovdqu ymm8,YMMWORD[((-12))+rax] + vpaddq ymm12,ymm12,ymm2 + vpmuludq ymm6,ymm7,ymm3 + vpmuludq ymm2,ymm7,ymm4 + vpaddq ymm14,ymm14,ymm6 + vpaddq ymm15,ymm15,ymm2 + + vpmuludq ymm6,ymm8,ymm3 + vpmuludq ymm2,ymm8,ymm4 + vpaddq ymm11,ymm11,ymm6 + vpaddq ymm12,ymm12,ymm2 + vmovdqu ymm2,YMMWORD[20+rax] + vpmuludq ymm6,ymm9,ymm1 + vpmuludq ymm9,ymm9,ymm0 + vpaddq ymm14,ymm14,ymm6 + vpaddq ymm13,ymm13,ymm9 + + vpmuludq ymm6,ymm2,ymm1 + vpmuludq ymm2,ymm2,ymm0 + vpaddq ymm15,ymm15,ymm6 + vpaddq ymm14,ymm14,ymm2 + vpmuludq ymm6,ymm10,ymm3 + vpmuludq ymm2,ymm10,ymm4 + vpaddq ymm12,ymm12,ymm6 + vpaddq ymm13,ymm13,ymm2 + + vpmuludq ymm3,ymm5,ymm3 + vpmuludq ymm4,ymm5,ymm4 + vpaddq ymm2,ymm13,ymm3 + vpaddq ymm3,ymm14,ymm4 + vpmuludq ymm4,ymm0,YMMWORD[84+rax] + vpmuludq ymm0,ymm5,ymm1 + vmovdqa ymm5,YMMWORD[64+rcx] + vpaddq ymm4,ymm15,ymm4 + vpaddq ymm0,ymm11,ymm0 + + + + + vpsrldq ymm8,ymm12,8 + vpsrldq ymm9,ymm2,8 + vpsrldq ymm10,ymm3,8 + vpsrldq ymm6,ymm4,8 + vpsrldq ymm7,ymm0,8 + vpaddq ymm12,ymm12,ymm8 + vpaddq ymm2,ymm2,ymm9 + vpaddq ymm3,ymm3,ymm10 + vpaddq ymm4,ymm4,ymm6 + vpaddq ymm0,ymm0,ymm7 + + vpermq ymm10,ymm3,0x2 + vpermq ymm6,ymm4,0x2 + vpermq ymm7,ymm0,0x2 + vpermq ymm8,ymm12,0x2 + vpermq ymm9,ymm2,0x2 + vpaddq ymm3,ymm3,ymm10 + vpaddq ymm4,ymm4,ymm6 + vpaddq ymm0,ymm0,ymm7 + vpaddq ymm12,ymm12,ymm8 + vpaddq ymm2,ymm2,ymm9 + + + + + vpsrlq ymm14,ymm3,26 + vpand ymm3,ymm3,ymm5 + vpaddq ymm4,ymm4,ymm14 + + vpsrlq ymm11,ymm0,26 + vpand ymm0,ymm0,ymm5 + vpaddq ymm1,ymm12,ymm11 + + vpsrlq ymm15,ymm4,26 + vpand ymm4,ymm4,ymm5 + + vpsrlq ymm12,ymm1,26 + vpand ymm1,ymm1,ymm5 + vpaddq ymm2,ymm2,ymm12 + + vpaddq ymm0,ymm0,ymm15 + vpsllq ymm15,ymm15,2 + vpaddq ymm0,ymm0,ymm15 + + vpsrlq ymm13,ymm2,26 + vpand ymm2,ymm2,ymm5 + vpaddq ymm3,ymm3,ymm13 + + vpsrlq ymm11,ymm0,26 + vpand ymm0,ymm0,ymm5 + vpaddq ymm1,ymm1,ymm11 + + vpsrlq ymm14,ymm3,26 + vpand ymm3,ymm3,ymm5 + vpaddq ymm4,ymm4,ymm14 + + vmovd DWORD[(-112)+rdi],xmm0 + vmovd DWORD[(-108)+rdi],xmm1 + vmovd DWORD[(-104)+rdi],xmm2 + vmovd DWORD[(-100)+rdi],xmm3 + vmovd DWORD[(-96)+rdi],xmm4 + vmovdqa xmm6,XMMWORD[80+r11] + vmovdqa xmm7,XMMWORD[96+r11] + vmovdqa xmm8,XMMWORD[112+r11] + vmovdqa xmm9,XMMWORD[128+r11] + vmovdqa xmm10,XMMWORD[144+r11] + vmovdqa xmm11,XMMWORD[160+r11] + vmovdqa xmm12,XMMWORD[176+r11] + vmovdqa xmm13,XMMWORD[192+r11] + vmovdqa xmm14,XMMWORD[208+r11] + vmovdqa xmm15,XMMWORD[224+r11] + lea rsp,[248+r11] +$L$do_avx2_epilogue: + vzeroupper + mov rdi,QWORD[8+rsp] ;WIN64 epilogue + mov rsi,QWORD[16+rsp] + DB 0F3h,0C3h ;repret + +$L$SEH_end_poly1305_blocks_avx2: + +ALIGN 32 +poly1305_blocks_avx512: + mov QWORD[8+rsp],rdi ;WIN64 prologue + mov QWORD[16+rsp],rsi + mov rax,rsp +$L$SEH_begin_poly1305_blocks_avx512: + mov rdi,rcx + mov rsi,rdx + mov rdx,r8 + mov rcx,r9 + + + + mov r8d,DWORD[20+rdi] + cmp rdx,128 + jae NEAR $L$blocks_avx2_512 + test r8d,r8d + jz NEAR $L$blocks + +$L$blocks_avx2_512: + and rdx,-16 + jz NEAR $L$no_data_avx2_512 + + vzeroupper + + test r8d,r8d + jz NEAR $L$base2_64_avx2_512 + + test rdx,63 + jz NEAR $L$even_avx2_512 + + push rbx + + push rbp + + push r12 + + push r13 + + push r14 + + push r15 + +$L$blocks_avx2_body_512: + + mov r15,rdx + + mov r8,QWORD[rdi] + mov r9,QWORD[8+rdi] + mov ebp,DWORD[16+rdi] + + mov r11,QWORD[24+rdi] + mov r13,QWORD[32+rdi] + + + mov r14d,r8d + and r8,-2147483648 + mov r12,r9 + mov ebx,r9d + and r9,-2147483648 + + shr r8,6 + shl r12,52 + add r14,r8 + shr rbx,12 + shr r9,18 + add r14,r12 + adc rbx,r9 + + mov r8,rbp + shl r8,40 + shr rbp,24 + add rbx,r8 + adc rbp,0 + + mov r9,-4 + mov r8,rbp + and r9,rbp + shr r8,2 + and rbp,3 + add r8,r9 + add r14,r8 + adc rbx,0 + adc rbp,0 + + mov r12,r13 + mov rax,r13 + shr r13,2 + add r13,r12 + +$L$base2_26_pre_avx2_512: + add r14,QWORD[rsi] + adc rbx,QWORD[8+rsi] + lea rsi,[16+rsi] + adc rbp,rcx + sub r15,16 + + call __poly1305_block + mov rax,r12 + + test r15,63 + jnz NEAR $L$base2_26_pre_avx2_512 + + test rcx,rcx + jz NEAR $L$store_base2_64_avx2_512 + + + mov rax,r14 + mov rdx,r14 + shr r14,52 + mov r11,rbx + mov r12,rbx + shr rdx,26 + and rax,0x3ffffff + shl r11,12 + and rdx,0x3ffffff + shr rbx,14 + or r14,r11 + shl rbp,24 + and r14,0x3ffffff + shr r12,40 + and rbx,0x3ffffff + or rbp,r12 + + test r15,r15 + jz NEAR $L$store_base2_26_avx2_512 + + vmovd xmm0,eax + vmovd xmm1,edx + vmovd xmm2,r14d + vmovd xmm3,ebx + vmovd xmm4,ebp + jmp NEAR $L$proceed_avx2_512 + +ALIGN 32 +$L$store_base2_64_avx2_512: + mov QWORD[rdi],r14 + mov QWORD[8+rdi],rbx + mov QWORD[16+rdi],rbp + jmp NEAR $L$done_avx2_512 + +ALIGN 16 +$L$store_base2_26_avx2_512: + mov DWORD[rdi],eax + mov DWORD[4+rdi],edx + mov DWORD[8+rdi],r14d + mov DWORD[12+rdi],ebx + mov DWORD[16+rdi],ebp +ALIGN 16 +$L$done_avx2_512: + mov r15,QWORD[rsp] + + mov r14,QWORD[8+rsp] + + mov r13,QWORD[16+rsp] + + mov r12,QWORD[24+rsp] + + mov rbp,QWORD[32+rsp] + + mov rbx,QWORD[40+rsp] + + lea rsp,[48+rsp] + +$L$no_data_avx2_512: +$L$blocks_avx2_epilogue_512: + mov rdi,QWORD[8+rsp] ;WIN64 epilogue + mov rsi,QWORD[16+rsp] + DB 0F3h,0C3h ;repret + + +ALIGN 32 +$L$base2_64_avx2_512: + + push rbx + + push rbp + + push r12 + + push r13 + + push r14 + + push r15 + +$L$base2_64_avx2_body_512: + + mov r15,rdx + + mov r11,QWORD[24+rdi] + mov r13,QWORD[32+rdi] + + mov r14,QWORD[rdi] + mov rbx,QWORD[8+rdi] + mov ebp,DWORD[16+rdi] + + mov r12,r13 + mov rax,r13 + shr r13,2 + add r13,r12 + + test rdx,63 + jz NEAR $L$init_avx2_512 + +$L$base2_64_pre_avx2_512: + add r14,QWORD[rsi] + adc rbx,QWORD[8+rsi] + lea rsi,[16+rsi] + adc rbp,rcx + sub r15,16 + + call __poly1305_block + mov rax,r12 + + test r15,63 + jnz NEAR $L$base2_64_pre_avx2_512 + +$L$init_avx2_512: + + mov rax,r14 + mov rdx,r14 + shr r14,52 + mov r8,rbx + mov r9,rbx + shr rdx,26 + and rax,0x3ffffff + shl r8,12 + and rdx,0x3ffffff + shr rbx,14 + or r14,r8 + shl rbp,24 + and r14,0x3ffffff + shr r9,40 + and rbx,0x3ffffff + or rbp,r9 + + vmovd xmm0,eax + vmovd xmm1,edx + vmovd xmm2,r14d + vmovd xmm3,ebx + vmovd xmm4,ebp + mov DWORD[20+rdi],1 + + call __poly1305_init_avx + +$L$proceed_avx2_512: + mov rdx,r15 + + + + mov r15,QWORD[rsp] + + mov r14,QWORD[8+rsp] + + mov r13,QWORD[16+rsp] + + mov r12,QWORD[24+rsp] + + mov rbp,QWORD[32+rsp] + + mov rbx,QWORD[40+rsp] + + lea rax,[48+rsp] + lea rsp,[48+rsp] + +$L$base2_64_avx2_epilogue_512: + jmp NEAR $L$do_avx2_512 + + +ALIGN 32 +$L$even_avx2_512: + + + vmovd xmm0,DWORD[rdi] + vmovd xmm1,DWORD[4+rdi] + vmovd xmm2,DWORD[8+rdi] + vmovd xmm3,DWORD[12+rdi] + vmovd xmm4,DWORD[16+rdi] + +$L$do_avx2_512: + cmp rdx,512 + jae NEAR $L$blocks_avx512 +$L$skip_avx512: + lea r11,[((-248))+rsp] + sub rsp,0x1c8 + vmovdqa XMMWORD[80+r11],xmm6 + vmovdqa XMMWORD[96+r11],xmm7 + vmovdqa XMMWORD[112+r11],xmm8 + vmovdqa XMMWORD[128+r11],xmm9 + vmovdqa XMMWORD[144+r11],xmm10 + vmovdqa XMMWORD[160+r11],xmm11 + vmovdqa XMMWORD[176+r11],xmm12 + vmovdqa XMMWORD[192+r11],xmm13 + vmovdqa XMMWORD[208+r11],xmm14 + vmovdqa XMMWORD[224+r11],xmm15 +$L$do_avx2_body_512: + lea rcx,[$L$const] + lea rdi,[((48+64))+rdi] + vmovdqa ymm7,YMMWORD[96+rcx] + + + vmovdqu xmm9,XMMWORD[((-64))+rdi] + and rsp,-512 + vmovdqu xmm10,XMMWORD[((-48))+rdi] + vmovdqu xmm6,XMMWORD[((-32))+rdi] + vmovdqu xmm11,XMMWORD[((-16))+rdi] + vmovdqu xmm12,XMMWORD[rdi] + vmovdqu xmm13,XMMWORD[16+rdi] + lea rax,[144+rsp] + vmovdqu xmm14,XMMWORD[32+rdi] + vpermd ymm9,ymm7,ymm9 + vmovdqu xmm15,XMMWORD[48+rdi] + vpermd ymm10,ymm7,ymm10 + vmovdqu xmm5,XMMWORD[64+rdi] + vpermd ymm6,ymm7,ymm6 + vmovdqa YMMWORD[rsp],ymm9 + vpermd ymm11,ymm7,ymm11 + vmovdqa YMMWORD[(32-144)+rax],ymm10 + vpermd ymm12,ymm7,ymm12 + vmovdqa YMMWORD[(64-144)+rax],ymm6 + vpermd ymm13,ymm7,ymm13 + vmovdqa YMMWORD[(96-144)+rax],ymm11 + vpermd ymm14,ymm7,ymm14 + vmovdqa YMMWORD[(128-144)+rax],ymm12 + vpermd ymm15,ymm7,ymm15 + vmovdqa YMMWORD[(160-144)+rax],ymm13 + vpermd ymm5,ymm7,ymm5 + vmovdqa YMMWORD[(192-144)+rax],ymm14 + vmovdqa YMMWORD[(224-144)+rax],ymm15 + vmovdqa YMMWORD[(256-144)+rax],ymm5 + vmovdqa ymm5,YMMWORD[64+rcx] + + + + vmovdqu xmm7,XMMWORD[rsi] + vmovdqu xmm8,XMMWORD[16+rsi] + vinserti128 ymm7,ymm7,XMMWORD[32+rsi],1 + vinserti128 ymm8,ymm8,XMMWORD[48+rsi],1 + lea rsi,[64+rsi] + + vpsrldq ymm9,ymm7,6 + vpsrldq ymm10,ymm8,6 + vpunpckhqdq ymm6,ymm7,ymm8 + vpunpcklqdq ymm9,ymm9,ymm10 + vpunpcklqdq ymm7,ymm7,ymm8 + + vpsrlq ymm10,ymm9,30 + vpsrlq ymm9,ymm9,4 + vpsrlq ymm8,ymm7,26 + vpsrlq ymm6,ymm6,40 + vpand ymm9,ymm9,ymm5 + vpand ymm7,ymm7,ymm5 + vpand ymm8,ymm8,ymm5 + vpand ymm10,ymm10,ymm5 + vpor ymm6,ymm6,YMMWORD[32+rcx] + + vpaddq ymm2,ymm9,ymm2 + sub rdx,64 + jz NEAR $L$tail_avx2_512 + jmp NEAR $L$oop_avx2_512 + +ALIGN 32 +$L$oop_avx2_512: + + + + + + + + + vpaddq ymm0,ymm7,ymm0 + vmovdqa ymm7,YMMWORD[rsp] + vpaddq ymm1,ymm8,ymm1 + vmovdqa ymm8,YMMWORD[32+rsp] + vpaddq ymm3,ymm10,ymm3 + vmovdqa ymm9,YMMWORD[96+rsp] + vpaddq ymm4,ymm6,ymm4 + vmovdqa ymm10,YMMWORD[48+rax] + vmovdqa ymm5,YMMWORD[112+rax] + + + + + + + + + + + + + + + + + vpmuludq ymm13,ymm7,ymm2 + vpmuludq ymm14,ymm8,ymm2 + vpmuludq ymm15,ymm9,ymm2 + vpmuludq ymm11,ymm10,ymm2 + vpmuludq ymm12,ymm5,ymm2 + + vpmuludq ymm6,ymm8,ymm0 + vpmuludq ymm2,ymm8,ymm1 + vpaddq ymm12,ymm12,ymm6 + vpaddq ymm13,ymm13,ymm2 + vpmuludq ymm6,ymm8,ymm3 + vpmuludq ymm2,ymm4,YMMWORD[64+rsp] + vpaddq ymm15,ymm15,ymm6 + vpaddq ymm11,ymm11,ymm2 + vmovdqa ymm8,YMMWORD[((-16))+rax] + + vpmuludq ymm6,ymm7,ymm0 + vpmuludq ymm2,ymm7,ymm1 + vpaddq ymm11,ymm11,ymm6 + vpaddq ymm12,ymm12,ymm2 + vpmuludq ymm6,ymm7,ymm3 + vpmuludq ymm2,ymm7,ymm4 + vmovdqu xmm7,XMMWORD[rsi] + vpaddq ymm14,ymm14,ymm6 + vpaddq ymm15,ymm15,ymm2 + vinserti128 ymm7,ymm7,XMMWORD[32+rsi],1 + + vpmuludq ymm6,ymm8,ymm3 + vpmuludq ymm2,ymm8,ymm4 + vmovdqu xmm8,XMMWORD[16+rsi] + vpaddq ymm11,ymm11,ymm6 + vpaddq ymm12,ymm12,ymm2 + vmovdqa ymm2,YMMWORD[16+rax] + vpmuludq ymm6,ymm9,ymm1 + vpmuludq ymm9,ymm9,ymm0 + vpaddq ymm14,ymm14,ymm6 + vpaddq ymm13,ymm13,ymm9 + vinserti128 ymm8,ymm8,XMMWORD[48+rsi],1 + lea rsi,[64+rsi] + + vpmuludq ymm6,ymm2,ymm1 + vpmuludq ymm2,ymm2,ymm0 + vpsrldq ymm9,ymm7,6 + vpaddq ymm15,ymm15,ymm6 + vpaddq ymm14,ymm14,ymm2 + vpmuludq ymm6,ymm10,ymm3 + vpmuludq ymm2,ymm10,ymm4 + vpsrldq ymm10,ymm8,6 + vpaddq ymm12,ymm12,ymm6 + vpaddq ymm13,ymm13,ymm2 + vpunpckhqdq ymm6,ymm7,ymm8 + + vpmuludq ymm3,ymm5,ymm3 + vpmuludq ymm4,ymm5,ymm4 + vpunpcklqdq ymm7,ymm7,ymm8 + vpaddq ymm2,ymm13,ymm3 + vpaddq ymm3,ymm14,ymm4 + vpunpcklqdq ymm10,ymm9,ymm10 + vpmuludq ymm4,ymm0,YMMWORD[80+rax] + vpmuludq ymm0,ymm5,ymm1 + vmovdqa ymm5,YMMWORD[64+rcx] + vpaddq ymm4,ymm15,ymm4 + vpaddq ymm0,ymm11,ymm0 + + + + + vpsrlq ymm14,ymm3,26 + vpand ymm3,ymm3,ymm5 + vpaddq ymm4,ymm4,ymm14 + + vpsrlq ymm11,ymm0,26 + vpand ymm0,ymm0,ymm5 + vpaddq ymm1,ymm12,ymm11 + + vpsrlq ymm15,ymm4,26 + vpand ymm4,ymm4,ymm5 + + vpsrlq ymm9,ymm10,4 + + vpsrlq ymm12,ymm1,26 + vpand ymm1,ymm1,ymm5 + vpaddq ymm2,ymm2,ymm12 + + vpaddq ymm0,ymm0,ymm15 + vpsllq ymm15,ymm15,2 + vpaddq ymm0,ymm0,ymm15 + + vpand ymm9,ymm9,ymm5 + vpsrlq ymm8,ymm7,26 + + vpsrlq ymm13,ymm2,26 + vpand ymm2,ymm2,ymm5 + vpaddq ymm3,ymm3,ymm13 + + vpaddq ymm2,ymm2,ymm9 + vpsrlq ymm10,ymm10,30 + + vpsrlq ymm11,ymm0,26 + vpand ymm0,ymm0,ymm5 + vpaddq ymm1,ymm1,ymm11 + + vpsrlq ymm6,ymm6,40 + + vpsrlq ymm14,ymm3,26 + vpand ymm3,ymm3,ymm5 + vpaddq ymm4,ymm4,ymm14 + + vpand ymm7,ymm7,ymm5 + vpand ymm8,ymm8,ymm5 + vpand ymm10,ymm10,ymm5 + vpor ymm6,ymm6,YMMWORD[32+rcx] + + sub rdx,64 + jnz NEAR $L$oop_avx2_512 + +DB 0x66,0x90 +$L$tail_avx2_512: + + + + + + + + vpaddq ymm0,ymm7,ymm0 + vmovdqu ymm7,YMMWORD[4+rsp] + vpaddq ymm1,ymm8,ymm1 + vmovdqu ymm8,YMMWORD[36+rsp] + vpaddq ymm3,ymm10,ymm3 + vmovdqu ymm9,YMMWORD[100+rsp] + vpaddq ymm4,ymm6,ymm4 + vmovdqu ymm10,YMMWORD[52+rax] + vmovdqu ymm5,YMMWORD[116+rax] + + vpmuludq ymm13,ymm7,ymm2 + vpmuludq ymm14,ymm8,ymm2 + vpmuludq ymm15,ymm9,ymm2 + vpmuludq ymm11,ymm10,ymm2 + vpmuludq ymm12,ymm5,ymm2 + + vpmuludq ymm6,ymm8,ymm0 + vpmuludq ymm2,ymm8,ymm1 + vpaddq ymm12,ymm12,ymm6 + vpaddq ymm13,ymm13,ymm2 + vpmuludq ymm6,ymm8,ymm3 + vpmuludq ymm2,ymm4,YMMWORD[68+rsp] + vpaddq ymm15,ymm15,ymm6 + vpaddq ymm11,ymm11,ymm2 + + vpmuludq ymm6,ymm7,ymm0 + vpmuludq ymm2,ymm7,ymm1 + vpaddq ymm11,ymm11,ymm6 + vmovdqu ymm8,YMMWORD[((-12))+rax] + vpaddq ymm12,ymm12,ymm2 + vpmuludq ymm6,ymm7,ymm3 + vpmuludq ymm2,ymm7,ymm4 + vpaddq ymm14,ymm14,ymm6 + vpaddq ymm15,ymm15,ymm2 + + vpmuludq ymm6,ymm8,ymm3 + vpmuludq ymm2,ymm8,ymm4 + vpaddq ymm11,ymm11,ymm6 + vpaddq ymm12,ymm12,ymm2 + vmovdqu ymm2,YMMWORD[20+rax] + vpmuludq ymm6,ymm9,ymm1 + vpmuludq ymm9,ymm9,ymm0 + vpaddq ymm14,ymm14,ymm6 + vpaddq ymm13,ymm13,ymm9 + + vpmuludq ymm6,ymm2,ymm1 + vpmuludq ymm2,ymm2,ymm0 + vpaddq ymm15,ymm15,ymm6 + vpaddq ymm14,ymm14,ymm2 + vpmuludq ymm6,ymm10,ymm3 + vpmuludq ymm2,ymm10,ymm4 + vpaddq ymm12,ymm12,ymm6 + vpaddq ymm13,ymm13,ymm2 + + vpmuludq ymm3,ymm5,ymm3 + vpmuludq ymm4,ymm5,ymm4 + vpaddq ymm2,ymm13,ymm3 + vpaddq ymm3,ymm14,ymm4 + vpmuludq ymm4,ymm0,YMMWORD[84+rax] + vpmuludq ymm0,ymm5,ymm1 + vmovdqa ymm5,YMMWORD[64+rcx] + vpaddq ymm4,ymm15,ymm4 + vpaddq ymm0,ymm11,ymm0 + + + + + vpsrldq ymm8,ymm12,8 + vpsrldq ymm9,ymm2,8 + vpsrldq ymm10,ymm3,8 + vpsrldq ymm6,ymm4,8 + vpsrldq ymm7,ymm0,8 + vpaddq ymm12,ymm12,ymm8 + vpaddq ymm2,ymm2,ymm9 + vpaddq ymm3,ymm3,ymm10 + vpaddq ymm4,ymm4,ymm6 + vpaddq ymm0,ymm0,ymm7 + + vpermq ymm10,ymm3,0x2 + vpermq ymm6,ymm4,0x2 + vpermq ymm7,ymm0,0x2 + vpermq ymm8,ymm12,0x2 + vpermq ymm9,ymm2,0x2 + vpaddq ymm3,ymm3,ymm10 + vpaddq ymm4,ymm4,ymm6 + vpaddq ymm0,ymm0,ymm7 + vpaddq ymm12,ymm12,ymm8 + vpaddq ymm2,ymm2,ymm9 + + + + + vpsrlq ymm14,ymm3,26 + vpand ymm3,ymm3,ymm5 + vpaddq ymm4,ymm4,ymm14 + + vpsrlq ymm11,ymm0,26 + vpand ymm0,ymm0,ymm5 + vpaddq ymm1,ymm12,ymm11 + + vpsrlq ymm15,ymm4,26 + vpand ymm4,ymm4,ymm5 + + vpsrlq ymm12,ymm1,26 + vpand ymm1,ymm1,ymm5 + vpaddq ymm2,ymm2,ymm12 + + vpaddq ymm0,ymm0,ymm15 + vpsllq ymm15,ymm15,2 + vpaddq ymm0,ymm0,ymm15 + + vpsrlq ymm13,ymm2,26 + vpand ymm2,ymm2,ymm5 + vpaddq ymm3,ymm3,ymm13 + + vpsrlq ymm11,ymm0,26 + vpand ymm0,ymm0,ymm5 + vpaddq ymm1,ymm1,ymm11 + + vpsrlq ymm14,ymm3,26 + vpand ymm3,ymm3,ymm5 + vpaddq ymm4,ymm4,ymm14 + + vmovd DWORD[(-112)+rdi],xmm0 + vmovd DWORD[(-108)+rdi],xmm1 + vmovd DWORD[(-104)+rdi],xmm2 + vmovd DWORD[(-100)+rdi],xmm3 + vmovd DWORD[(-96)+rdi],xmm4 + vmovdqa xmm6,XMMWORD[80+r11] + vmovdqa xmm7,XMMWORD[96+r11] + vmovdqa xmm8,XMMWORD[112+r11] + vmovdqa xmm9,XMMWORD[128+r11] + vmovdqa xmm10,XMMWORD[144+r11] + vmovdqa xmm11,XMMWORD[160+r11] + vmovdqa xmm12,XMMWORD[176+r11] + vmovdqa xmm13,XMMWORD[192+r11] + vmovdqa xmm14,XMMWORD[208+r11] + vmovdqa xmm15,XMMWORD[224+r11] + lea rsp,[248+r11] +$L$do_avx2_epilogue_512: + vzeroupper + mov rdi,QWORD[8+rsp] ;WIN64 epilogue + mov rsi,QWORD[16+rsp] + DB 0F3h,0C3h ;repret + +$L$SEH_end_poly1305_blocks_avx512: +$L$blocks_avx512: + mov eax,15 + kmovw k2,eax + lea r11,[((-248))+rsp] + sub rsp,0x1c8 + vmovdqa XMMWORD[80+r11],xmm6 + vmovdqa XMMWORD[96+r11],xmm7 + vmovdqa XMMWORD[112+r11],xmm8 + vmovdqa XMMWORD[128+r11],xmm9 + vmovdqa XMMWORD[144+r11],xmm10 + vmovdqa XMMWORD[160+r11],xmm11 + vmovdqa XMMWORD[176+r11],xmm12 + vmovdqa XMMWORD[192+r11],xmm13 + vmovdqa XMMWORD[208+r11],xmm14 + vmovdqa XMMWORD[224+r11],xmm15 +$L$do_avx512_body: + lea rcx,[$L$const] + lea rdi,[((48+64))+rdi] + vmovdqa ymm9,YMMWORD[96+rcx] + + + vmovdqu xmm11,XMMWORD[((-64))+rdi] + and rsp,-512 + vmovdqu xmm12,XMMWORD[((-48))+rdi] + mov rax,0x20 + vmovdqu xmm7,XMMWORD[((-32))+rdi] + vmovdqu xmm13,XMMWORD[((-16))+rdi] + vmovdqu xmm8,XMMWORD[rdi] + vmovdqu xmm14,XMMWORD[16+rdi] + vmovdqu xmm10,XMMWORD[32+rdi] + vmovdqu xmm15,XMMWORD[48+rdi] + vmovdqu xmm6,XMMWORD[64+rdi] + vpermd zmm16,zmm9,zmm11 + vpbroadcastq zmm5,QWORD[64+rcx] + vpermd zmm17,zmm9,zmm12 + vpermd zmm21,zmm9,zmm7 + vpermd zmm18,zmm9,zmm13 + vmovdqa64 ZMMWORD[rsp]{k2},zmm16 + vpsrlq zmm7,zmm16,32 + vpermd zmm22,zmm9,zmm8 + vmovdqu64 ZMMWORD[rax*1+rsp]{k2},zmm17 + vpsrlq zmm8,zmm17,32 + vpermd zmm19,zmm9,zmm14 + vmovdqa64 ZMMWORD[64+rsp]{k2},zmm21 + vpermd zmm23,zmm9,zmm10 + vpermd zmm20,zmm9,zmm15 + vmovdqu64 ZMMWORD[64+rax*1+rsp]{k2},zmm18 + vpermd zmm24,zmm9,zmm6 + vmovdqa64 ZMMWORD[128+rsp]{k2},zmm22 + vmovdqu64 ZMMWORD[128+rax*1+rsp]{k2},zmm19 + vmovdqa64 ZMMWORD[192+rsp]{k2},zmm23 + vmovdqu64 ZMMWORD[192+rax*1+rsp]{k2},zmm20 + vmovdqa64 ZMMWORD[256+rsp]{k2},zmm24 + + + + + + + + + + + vpmuludq zmm11,zmm16,zmm7 + vpmuludq zmm12,zmm17,zmm7 + vpmuludq zmm13,zmm18,zmm7 + vpmuludq zmm14,zmm19,zmm7 + vpmuludq zmm15,zmm20,zmm7 + vpsrlq zmm9,zmm18,32 + + vpmuludq zmm25,zmm24,zmm8 + vpmuludq zmm26,zmm16,zmm8 + vpmuludq zmm27,zmm17,zmm8 + vpmuludq zmm28,zmm18,zmm8 + vpmuludq zmm29,zmm19,zmm8 + vpsrlq zmm10,zmm19,32 + vpaddq zmm11,zmm11,zmm25 + vpaddq zmm12,zmm12,zmm26 + vpaddq zmm13,zmm13,zmm27 + vpaddq zmm14,zmm14,zmm28 + vpaddq zmm15,zmm15,zmm29 + + vpmuludq zmm25,zmm23,zmm9 + vpmuludq zmm26,zmm24,zmm9 + vpmuludq zmm28,zmm17,zmm9 + vpmuludq zmm29,zmm18,zmm9 + vpmuludq zmm27,zmm16,zmm9 + vpsrlq zmm6,zmm20,32 + vpaddq zmm11,zmm11,zmm25 + vpaddq zmm12,zmm12,zmm26 + vpaddq zmm14,zmm14,zmm28 + vpaddq zmm15,zmm15,zmm29 + vpaddq zmm13,zmm13,zmm27 + + vpmuludq zmm25,zmm22,zmm10 + vpmuludq zmm28,zmm16,zmm10 + vpmuludq zmm29,zmm17,zmm10 + vpmuludq zmm26,zmm23,zmm10 + vpmuludq zmm27,zmm24,zmm10 + vpaddq zmm11,zmm11,zmm25 + vpaddq zmm14,zmm14,zmm28 + vpaddq zmm15,zmm15,zmm29 + vpaddq zmm12,zmm12,zmm26 + vpaddq zmm13,zmm13,zmm27 + + vpmuludq zmm28,zmm24,zmm6 + vpmuludq zmm29,zmm16,zmm6 + vpmuludq zmm25,zmm21,zmm6 + vpmuludq zmm26,zmm22,zmm6 + vpmuludq zmm27,zmm23,zmm6 + vpaddq zmm14,zmm14,zmm28 + vpaddq zmm15,zmm15,zmm29 + vpaddq zmm11,zmm11,zmm25 + vpaddq zmm12,zmm12,zmm26 + vpaddq zmm13,zmm13,zmm27 + + + + vmovdqu64 zmm10,ZMMWORD[rsi] + vmovdqu64 zmm6,ZMMWORD[64+rsi] + lea rsi,[128+rsi] + + + + + vpsrlq zmm28,zmm14,26 + vpandq zmm14,zmm14,zmm5 + vpaddq zmm15,zmm15,zmm28 + + vpsrlq zmm25,zmm11,26 + vpandq zmm11,zmm11,zmm5 + vpaddq zmm12,zmm12,zmm25 + + vpsrlq zmm29,zmm15,26 + vpandq zmm15,zmm15,zmm5 + + vpsrlq zmm26,zmm12,26 + vpandq zmm12,zmm12,zmm5 + vpaddq zmm13,zmm13,zmm26 + + vpaddq zmm11,zmm11,zmm29 + vpsllq zmm29,zmm29,2 + vpaddq zmm11,zmm11,zmm29 + + vpsrlq zmm27,zmm13,26 + vpandq zmm13,zmm13,zmm5 + vpaddq zmm14,zmm14,zmm27 + + vpsrlq zmm25,zmm11,26 + vpandq zmm11,zmm11,zmm5 + vpaddq zmm12,zmm12,zmm25 + + vpsrlq zmm28,zmm14,26 + vpandq zmm14,zmm14,zmm5 + vpaddq zmm15,zmm15,zmm28 + + + + + + vpunpcklqdq zmm7,zmm10,zmm6 + vpunpckhqdq zmm6,zmm10,zmm6 + + + + + + + vmovdqa32 zmm25,ZMMWORD[128+rcx] + mov eax,0x7777 + kmovw k1,eax + + vpermd zmm16,zmm25,zmm16 + vpermd zmm17,zmm25,zmm17 + vpermd zmm18,zmm25,zmm18 + vpermd zmm19,zmm25,zmm19 + vpermd zmm20,zmm25,zmm20 + + vpermd zmm16{k1},zmm25,zmm11 + vpermd zmm17{k1},zmm25,zmm12 + vpermd zmm18{k1},zmm25,zmm13 + vpermd zmm19{k1},zmm25,zmm14 + vpermd zmm20{k1},zmm25,zmm15 + + vpslld zmm21,zmm17,2 + vpslld zmm22,zmm18,2 + vpslld zmm23,zmm19,2 + vpslld zmm24,zmm20,2 + vpaddd zmm21,zmm21,zmm17 + vpaddd zmm22,zmm22,zmm18 + vpaddd zmm23,zmm23,zmm19 + vpaddd zmm24,zmm24,zmm20 + + vpbroadcastq zmm30,QWORD[32+rcx] + + vpsrlq zmm9,zmm7,52 + vpsllq zmm10,zmm6,12 + vporq zmm9,zmm9,zmm10 + vpsrlq zmm8,zmm7,26 + vpsrlq zmm10,zmm6,14 + vpsrlq zmm6,zmm6,40 + vpandq zmm9,zmm9,zmm5 + vpandq zmm7,zmm7,zmm5 + + + + + vpaddq zmm2,zmm9,zmm2 + sub rdx,192 + jbe NEAR $L$tail_avx512 + jmp NEAR $L$oop_avx512 + +ALIGN 32 +$L$oop_avx512: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + vpmuludq zmm14,zmm17,zmm2 + vpaddq zmm0,zmm7,zmm0 + vpmuludq zmm15,zmm18,zmm2 + vpandq zmm8,zmm8,zmm5 + vpmuludq zmm11,zmm23,zmm2 + vpandq zmm10,zmm10,zmm5 + vpmuludq zmm12,zmm24,zmm2 + vporq zmm6,zmm6,zmm30 + vpmuludq zmm13,zmm16,zmm2 + vpaddq zmm1,zmm8,zmm1 + vpaddq zmm3,zmm10,zmm3 + vpaddq zmm4,zmm6,zmm4 + + vmovdqu64 zmm10,ZMMWORD[rsi] + vmovdqu64 zmm6,ZMMWORD[64+rsi] + lea rsi,[128+rsi] + vpmuludq zmm28,zmm19,zmm0 + vpmuludq zmm29,zmm20,zmm0 + vpmuludq zmm25,zmm16,zmm0 + vpmuludq zmm26,zmm17,zmm0 + vpaddq zmm14,zmm14,zmm28 + vpaddq zmm15,zmm15,zmm29 + vpaddq zmm11,zmm11,zmm25 + vpaddq zmm12,zmm12,zmm26 + + vpmuludq zmm28,zmm18,zmm1 + vpmuludq zmm29,zmm19,zmm1 + vpmuludq zmm25,zmm24,zmm1 + vpmuludq zmm27,zmm18,zmm0 + vpaddq zmm14,zmm14,zmm28 + vpaddq zmm15,zmm15,zmm29 + vpaddq zmm11,zmm11,zmm25 + vpaddq zmm13,zmm13,zmm27 + + vpunpcklqdq zmm7,zmm10,zmm6 + vpunpckhqdq zmm6,zmm10,zmm6 + + vpmuludq zmm28,zmm16,zmm3 + vpmuludq zmm29,zmm17,zmm3 + vpmuludq zmm26,zmm16,zmm1 + vpmuludq zmm27,zmm17,zmm1 + vpaddq zmm14,zmm14,zmm28 + vpaddq zmm15,zmm15,zmm29 + vpaddq zmm12,zmm12,zmm26 + vpaddq zmm13,zmm13,zmm27 + + vpmuludq zmm28,zmm24,zmm4 + vpmuludq zmm29,zmm16,zmm4 + vpmuludq zmm25,zmm22,zmm3 + vpmuludq zmm26,zmm23,zmm3 + vpaddq zmm14,zmm14,zmm28 + vpmuludq zmm27,zmm24,zmm3 + vpaddq zmm15,zmm15,zmm29 + vpaddq zmm11,zmm11,zmm25 + vpaddq zmm12,zmm12,zmm26 + vpaddq zmm13,zmm13,zmm27 + + vpmuludq zmm25,zmm21,zmm4 + vpmuludq zmm26,zmm22,zmm4 + vpmuludq zmm27,zmm23,zmm4 + vpaddq zmm0,zmm11,zmm25 + vpaddq zmm1,zmm12,zmm26 + vpaddq zmm2,zmm13,zmm27 + + + + + vpsrlq zmm9,zmm7,52 + vpsllq zmm10,zmm6,12 + + vpsrlq zmm3,zmm14,26 + vpandq zmm14,zmm14,zmm5 + vpaddq zmm4,zmm15,zmm3 + + vporq zmm9,zmm9,zmm10 + + vpsrlq zmm11,zmm0,26 + vpandq zmm0,zmm0,zmm5 + vpaddq zmm1,zmm1,zmm11 + + vpandq zmm9,zmm9,zmm5 + + vpsrlq zmm15,zmm4,26 + vpandq zmm4,zmm4,zmm5 + + vpsrlq zmm12,zmm1,26 + vpandq zmm1,zmm1,zmm5 + vpaddq zmm2,zmm2,zmm12 + + vpaddq zmm0,zmm0,zmm15 + vpsllq zmm15,zmm15,2 + vpaddq zmm0,zmm0,zmm15 + + vpaddq zmm2,zmm2,zmm9 + vpsrlq zmm8,zmm7,26 + + vpsrlq zmm13,zmm2,26 + vpandq zmm2,zmm2,zmm5 + vpaddq zmm3,zmm14,zmm13 + + vpsrlq zmm10,zmm6,14 + + vpsrlq zmm11,zmm0,26 + vpandq zmm0,zmm0,zmm5 + vpaddq zmm1,zmm1,zmm11 + + vpsrlq zmm6,zmm6,40 + + vpsrlq zmm14,zmm3,26 + vpandq zmm3,zmm3,zmm5 + vpaddq zmm4,zmm4,zmm14 + + vpandq zmm7,zmm7,zmm5 + + + + + sub rdx,128 + ja NEAR $L$oop_avx512 + +$L$tail_avx512: + + + + + + vpsrlq zmm16,zmm16,32 + vpsrlq zmm17,zmm17,32 + vpsrlq zmm18,zmm18,32 + vpsrlq zmm23,zmm23,32 + vpsrlq zmm24,zmm24,32 + vpsrlq zmm19,zmm19,32 + vpsrlq zmm20,zmm20,32 + vpsrlq zmm21,zmm21,32 + vpsrlq zmm22,zmm22,32 + + + + lea rsi,[rdx*1+rsi] + + + vpaddq zmm0,zmm7,zmm0 + + vpmuludq zmm14,zmm17,zmm2 + vpmuludq zmm15,zmm18,zmm2 + vpmuludq zmm11,zmm23,zmm2 + vpandq zmm8,zmm8,zmm5 + vpmuludq zmm12,zmm24,zmm2 + vpandq zmm10,zmm10,zmm5 + vpmuludq zmm13,zmm16,zmm2 + vporq zmm6,zmm6,zmm30 + vpaddq zmm1,zmm8,zmm1 + vpaddq zmm3,zmm10,zmm3 + vpaddq zmm4,zmm6,zmm4 + + vmovdqu xmm7,XMMWORD[rsi] + vpmuludq zmm28,zmm19,zmm0 + vpmuludq zmm29,zmm20,zmm0 + vpmuludq zmm25,zmm16,zmm0 + vpmuludq zmm26,zmm17,zmm0 + vpaddq zmm14,zmm14,zmm28 + vpaddq zmm15,zmm15,zmm29 + vpaddq zmm11,zmm11,zmm25 + vpaddq zmm12,zmm12,zmm26 + + vmovdqu xmm8,XMMWORD[16+rsi] + vpmuludq zmm28,zmm18,zmm1 + vpmuludq zmm29,zmm19,zmm1 + vpmuludq zmm25,zmm24,zmm1 + vpmuludq zmm27,zmm18,zmm0 + vpaddq zmm14,zmm14,zmm28 + vpaddq zmm15,zmm15,zmm29 + vpaddq zmm11,zmm11,zmm25 + vpaddq zmm13,zmm13,zmm27 + + vinserti128 ymm7,ymm7,XMMWORD[32+rsi],1 + vpmuludq zmm28,zmm16,zmm3 + vpmuludq zmm29,zmm17,zmm3 + vpmuludq zmm26,zmm16,zmm1 + vpmuludq zmm27,zmm17,zmm1 + vpaddq zmm14,zmm14,zmm28 + vpaddq zmm15,zmm15,zmm29 + vpaddq zmm12,zmm12,zmm26 + vpaddq zmm13,zmm13,zmm27 + + vinserti128 ymm8,ymm8,XMMWORD[48+rsi],1 + vpmuludq zmm28,zmm24,zmm4 + vpmuludq zmm29,zmm16,zmm4 + vpmuludq zmm25,zmm22,zmm3 + vpmuludq zmm26,zmm23,zmm3 + vpmuludq zmm27,zmm24,zmm3 + vpaddq zmm3,zmm14,zmm28 + vpaddq zmm15,zmm15,zmm29 + vpaddq zmm11,zmm11,zmm25 + vpaddq zmm12,zmm12,zmm26 + vpaddq zmm13,zmm13,zmm27 + + vpmuludq zmm25,zmm21,zmm4 + vpmuludq zmm26,zmm22,zmm4 + vpmuludq zmm27,zmm23,zmm4 + vpaddq zmm0,zmm11,zmm25 + vpaddq zmm1,zmm12,zmm26 + vpaddq zmm2,zmm13,zmm27 + + + + + mov eax,1 + vpermq zmm14,zmm3,0xb1 + vpermq zmm4,zmm15,0xb1 + vpermq zmm11,zmm0,0xb1 + vpermq zmm12,zmm1,0xb1 + vpermq zmm13,zmm2,0xb1 + vpaddq zmm3,zmm3,zmm14 + vpaddq zmm4,zmm4,zmm15 + vpaddq zmm0,zmm0,zmm11 + vpaddq zmm1,zmm1,zmm12 + vpaddq zmm2,zmm2,zmm13 + + kmovw k3,eax + vpermq zmm14,zmm3,0x2 + vpermq zmm15,zmm4,0x2 + vpermq zmm11,zmm0,0x2 + vpermq zmm12,zmm1,0x2 + vpermq zmm13,zmm2,0x2 + vpaddq zmm3,zmm3,zmm14 + vpaddq zmm4,zmm4,zmm15 + vpaddq zmm0,zmm0,zmm11 + vpaddq zmm1,zmm1,zmm12 + vpaddq zmm2,zmm2,zmm13 + + vextracti64x4 ymm14,zmm3,0x1 + vextracti64x4 ymm15,zmm4,0x1 + vextracti64x4 ymm11,zmm0,0x1 + vextracti64x4 ymm12,zmm1,0x1 + vextracti64x4 ymm13,zmm2,0x1 + vpaddq zmm3{k3}{z},zmm3,zmm14 + vpaddq zmm4{k3}{z},zmm4,zmm15 + vpaddq zmm0{k3}{z},zmm0,zmm11 + vpaddq zmm1{k3}{z},zmm1,zmm12 + vpaddq zmm2{k3}{z},zmm2,zmm13 + + + + vpsrlq ymm14,ymm3,26 + vpand ymm3,ymm3,ymm5 + vpsrldq ymm9,ymm7,6 + vpsrldq ymm10,ymm8,6 + vpunpckhqdq ymm6,ymm7,ymm8 + vpaddq ymm4,ymm4,ymm14 + + vpsrlq ymm11,ymm0,26 + vpand ymm0,ymm0,ymm5 + vpunpcklqdq ymm9,ymm9,ymm10 + vpunpcklqdq ymm7,ymm7,ymm8 + vpaddq ymm1,ymm1,ymm11 + + vpsrlq ymm15,ymm4,26 + vpand ymm4,ymm4,ymm5 + + vpsrlq ymm12,ymm1,26 + vpand ymm1,ymm1,ymm5 + vpsrlq ymm10,ymm9,30 + vpsrlq ymm9,ymm9,4 + vpaddq ymm2,ymm2,ymm12 + + vpaddq ymm0,ymm0,ymm15 + vpsllq ymm15,ymm15,2 + vpsrlq ymm8,ymm7,26 + vpsrlq ymm6,ymm6,40 + vpaddq ymm0,ymm0,ymm15 + + vpsrlq ymm13,ymm2,26 + vpand ymm2,ymm2,ymm5 + vpand ymm9,ymm9,ymm5 + vpand ymm7,ymm7,ymm5 + vpaddq ymm3,ymm3,ymm13 + + vpsrlq ymm11,ymm0,26 + vpand ymm0,ymm0,ymm5 + vpaddq ymm2,ymm9,ymm2 + vpand ymm8,ymm8,ymm5 + vpaddq ymm1,ymm1,ymm11 + + vpsrlq ymm14,ymm3,26 + vpand ymm3,ymm3,ymm5 + vpand ymm10,ymm10,ymm5 + vpor ymm6,ymm6,YMMWORD[32+rcx] + vpaddq ymm4,ymm4,ymm14 + + lea rax,[144+rsp] + add rdx,64 + jnz NEAR $L$tail_avx2_512 + + vpsubq ymm2,ymm2,ymm9 + vmovd DWORD[(-112)+rdi],xmm0 + vmovd DWORD[(-108)+rdi],xmm1 + vmovd DWORD[(-104)+rdi],xmm2 + vmovd DWORD[(-100)+rdi],xmm3 + vmovd DWORD[(-96)+rdi],xmm4 + vzeroall + movdqa xmm6,XMMWORD[80+r11] + movdqa xmm7,XMMWORD[96+r11] + movdqa xmm8,XMMWORD[112+r11] + movdqa xmm9,XMMWORD[128+r11] + movdqa xmm10,XMMWORD[144+r11] + movdqa xmm11,XMMWORD[160+r11] + movdqa xmm12,XMMWORD[176+r11] + movdqa xmm13,XMMWORD[192+r11] + movdqa xmm14,XMMWORD[208+r11] + movdqa xmm15,XMMWORD[224+r11] + lea rsp,[248+r11] +$L$do_avx512_epilogue: + DB 0F3h,0C3h ;repret + + +EXTERN __imp_RtlVirtualUnwind + +ALIGN 16 +se_handler: + push rsi + push rdi + push rbx + push rbp + push r12 + push r13 + push r14 + push r15 + pushfq + sub rsp,64 + + mov rax,QWORD[120+r8] + mov rbx,QWORD[248+r8] + + mov rsi,QWORD[8+r9] + mov r11,QWORD[56+r9] + + mov r10d,DWORD[r11] + lea r10,[r10*1+rsi] + cmp rbx,r10 + jb NEAR $L$common_seh_tail + + mov rax,QWORD[152+r8] + + mov r10d,DWORD[4+r11] + lea r10,[r10*1+rsi] + cmp rbx,r10 + jae NEAR $L$common_seh_tail + + lea rax,[48+rax] + + mov rbx,QWORD[((-8))+rax] + mov rbp,QWORD[((-16))+rax] + mov r12,QWORD[((-24))+rax] + mov r13,QWORD[((-32))+rax] + mov r14,QWORD[((-40))+rax] + mov r15,QWORD[((-48))+rax] + mov QWORD[144+r8],rbx + mov QWORD[160+r8],rbp + mov QWORD[216+r8],r12 + mov QWORD[224+r8],r13 + mov QWORD[232+r8],r14 + mov QWORD[240+r8],r15 + + jmp NEAR $L$common_seh_tail + + + +ALIGN 16 +avx_handler: + push rsi + push rdi + push rbx + push rbp + push r12 + push r13 + push r14 + push r15 + pushfq + sub rsp,64 + + mov rax,QWORD[120+r8] + mov rbx,QWORD[248+r8] + + mov rsi,QWORD[8+r9] + mov r11,QWORD[56+r9] + + mov r10d,DWORD[r11] + lea r10,[r10*1+rsi] + cmp rbx,r10 + jb NEAR $L$common_seh_tail + + mov rax,QWORD[152+r8] + + mov r10d,DWORD[4+r11] + lea r10,[r10*1+rsi] + cmp rbx,r10 + jae NEAR $L$common_seh_tail + + mov rax,QWORD[208+r8] + + lea rsi,[80+rax] + lea rax,[248+rax] + lea rdi,[512+r8] + mov ecx,20 + DD 0xa548f3fc + +$L$common_seh_tail: + mov rdi,QWORD[8+rax] + mov rsi,QWORD[16+rax] + mov QWORD[152+r8],rax + mov QWORD[168+r8],rsi + mov QWORD[176+r8],rdi + + mov rdi,QWORD[40+r9] + mov rsi,r8 + mov ecx,154 + DD 0xa548f3fc + + mov rsi,r9 + xor rcx,rcx + mov rdx,QWORD[8+rsi] + mov r8,QWORD[rsi] + mov r9,QWORD[16+rsi] + mov r10,QWORD[40+rsi] + lea r11,[56+rsi] + lea r12,[24+rsi] + mov QWORD[32+rsp],r10 + mov QWORD[40+rsp],r11 + mov QWORD[48+rsp],r12 + mov QWORD[56+rsp],rcx + call QWORD[__imp_RtlVirtualUnwind] + + mov eax,1 + add rsp,64 + popfq + pop r15 + pop r14 + pop r13 + pop r12 + pop rbp + pop rbx + pop rdi + pop rsi + DB 0F3h,0C3h ;repret + + +section .pdata rdata align=4 +ALIGN 4 + DD $L$SEH_begin_poly1305_init_x86_64 wrt ..imagebase + DD $L$SEH_end_poly1305_init_x86_64 wrt ..imagebase + DD $L$SEH_info_poly1305_init wrt ..imagebase + + DD $L$SEH_begin_poly1305_blocks_x86_64 wrt ..imagebase + DD $L$SEH_end_poly1305_blocks_x86_64 wrt ..imagebase + DD $L$SEH_info_poly1305_blocks wrt ..imagebase + + DD $L$SEH_begin_poly1305_emit_x86_64 wrt ..imagebase + DD $L$SEH_end_poly1305_emit_x86_64 wrt ..imagebase + DD $L$SEH_info_poly1305_emit wrt ..imagebase + DD $L$SEH_begin_poly1305_blocks_avx wrt ..imagebase + DD $L$base2_64_avx wrt ..imagebase + DD $L$SEH_info_poly1305_blocks_avx_1 wrt ..imagebase + + DD $L$base2_64_avx wrt ..imagebase + DD $L$even_avx wrt ..imagebase + DD $L$SEH_info_poly1305_blocks_avx_2 wrt ..imagebase + + DD $L$even_avx wrt ..imagebase + DD $L$SEH_end_poly1305_blocks_avx wrt ..imagebase + DD $L$SEH_info_poly1305_blocks_avx_3 wrt ..imagebase + + DD $L$SEH_begin_poly1305_emit_avx wrt ..imagebase + DD $L$SEH_end_poly1305_emit_avx wrt ..imagebase + DD $L$SEH_info_poly1305_emit_avx wrt ..imagebase + DD $L$SEH_begin_poly1305_blocks_avx2 wrt ..imagebase + DD $L$base2_64_avx2 wrt ..imagebase + DD $L$SEH_info_poly1305_blocks_avx2_1 wrt ..imagebase + + DD $L$base2_64_avx2 wrt ..imagebase + DD $L$even_avx2 wrt ..imagebase + DD $L$SEH_info_poly1305_blocks_avx2_2 wrt ..imagebase + + DD $L$even_avx2 wrt ..imagebase + DD $L$SEH_end_poly1305_blocks_avx2 wrt ..imagebase + DD $L$SEH_info_poly1305_blocks_avx2_3 wrt ..imagebase + DD $L$SEH_begin_poly1305_blocks_avx512 wrt ..imagebase + DD $L$SEH_end_poly1305_blocks_avx512 wrt ..imagebase + DD $L$SEH_info_poly1305_blocks_avx512 wrt ..imagebase +section .xdata rdata align=8 +ALIGN 8 +$L$SEH_info_poly1305_init: +DB 9,0,0,0 + DD se_handler wrt ..imagebase + DD $L$SEH_begin_poly1305_init_x86_64 wrt ..imagebase,$L$SEH_begin_poly1305_init_x86_64 wrt ..imagebase + +$L$SEH_info_poly1305_blocks: +DB 9,0,0,0 + DD se_handler wrt ..imagebase + DD $L$blocks_body wrt ..imagebase,$L$blocks_epilogue wrt ..imagebase + +$L$SEH_info_poly1305_emit: +DB 9,0,0,0 + DD se_handler wrt ..imagebase + DD $L$SEH_begin_poly1305_emit_x86_64 wrt ..imagebase,$L$SEH_begin_poly1305_emit_x86_64 wrt ..imagebase +$L$SEH_info_poly1305_blocks_avx_1: +DB 9,0,0,0 + DD se_handler wrt ..imagebase + DD $L$blocks_avx_body wrt ..imagebase,$L$blocks_avx_epilogue wrt ..imagebase + +$L$SEH_info_poly1305_blocks_avx_2: +DB 9,0,0,0 + DD se_handler wrt ..imagebase + DD $L$base2_64_avx_body wrt ..imagebase,$L$base2_64_avx_epilogue wrt ..imagebase + +$L$SEH_info_poly1305_blocks_avx_3: +DB 9,0,0,0 + DD avx_handler wrt ..imagebase + DD $L$do_avx_body wrt ..imagebase,$L$do_avx_epilogue wrt ..imagebase + +$L$SEH_info_poly1305_emit_avx: +DB 9,0,0,0 + DD se_handler wrt ..imagebase + DD $L$SEH_begin_poly1305_emit_avx wrt ..imagebase,$L$SEH_begin_poly1305_emit_avx wrt ..imagebase +$L$SEH_info_poly1305_blocks_avx2_1: +DB 9,0,0,0 + DD se_handler wrt ..imagebase + DD $L$blocks_avx2_body wrt ..imagebase,$L$blocks_avx2_epilogue wrt ..imagebase + +$L$SEH_info_poly1305_blocks_avx2_2: +DB 9,0,0,0 + DD se_handler wrt ..imagebase + DD $L$base2_64_avx2_body wrt ..imagebase,$L$base2_64_avx2_epilogue wrt ..imagebase + +$L$SEH_info_poly1305_blocks_avx2_3: +DB 9,0,0,0 + DD avx_handler wrt ..imagebase + DD $L$do_avx2_body wrt ..imagebase,$L$do_avx2_epilogue wrt ..imagebase +$L$SEH_info_poly1305_blocks_avx512: +DB 9,0,0,0 + DD avx_handler wrt ..imagebase + DD $L$do_avx512_body wrt ..imagebase,$L$do_avx512_epilogue wrt ..imagebase diff --git a/crypto/siphash.cpp b/crypto/siphash.cpp new file mode 100644 index 0000000..98033a9 --- /dev/null +++ b/crypto/siphash.cpp @@ -0,0 +1,193 @@ +/* Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + * + * This file is provided under a dual BSD/GPLv2 license. + * + * SipHash: a fast short-input PRF + * https://131002.net/siphash/ + * + * This implementation is specifically for SipHash2-4 for a secure PRF + * and HalfSipHash1-3/SipHash1-3 for an insecure PRF only suitable for + * hashtables. + */ +#include "stdafx.h" + +#include "crypto/siphash.h" +#include "tunsafe_endian.h" + +#define SIPROUND \ + do { \ + v0 += v1; v1 = rol64(v1, 13); v1 ^= v0; v0 = rol64(v0, 32); \ + v2 += v3; v3 = rol64(v3, 16); v3 ^= v2; \ + v0 += v3; v3 = rol64(v3, 21); v3 ^= v0; \ + v2 += v1; v1 = rol64(v1, 17); v1 ^= v2; v2 = rol64(v2, 32); \ + } while (0) + +#define PREAMBLE(len) \ + uint64 v0 = 0x736f6d6570736575ULL; \ + uint64 v1 = 0x646f72616e646f6dULL; \ + uint64 v2 = 0x6c7967656e657261ULL; \ + uint64 v3 = 0x7465646279746573ULL; \ + uint64 b = ((uint64)(len)) << 56; \ + v3 ^= key->key[1]; \ + v2 ^= key->key[0]; \ + v1 ^= key->key[1]; \ + v0 ^= key->key[0]; + +#define POSTAMBLE \ + v3 ^= b; \ + SIPROUND; \ + SIPROUND; \ + v0 ^= b; \ + v2 ^= 0xff; \ + SIPROUND; \ + SIPROUND; \ + SIPROUND; \ + SIPROUND; \ + return (v0 ^ v1) ^ (v2 ^ v3); + +uint64 siphash(const void *data, size_t len, const siphash_key_t *key) { + const uint8 *end = (uint8*)data + len - (len % sizeof(uint64)); + const uint8 left = len & (sizeof(uint64) - 1); + uint64 m; + PREAMBLE(len) + for (; data != end; data = (uint8*)data + sizeof(uint64)) { + m = ReadLE64(data); + v3 ^= m; + SIPROUND; + SIPROUND; + v0 ^= m; + } + switch (left) { + case 7: b |= ((uint64)end[6]) << 48; + case 6: b |= ((uint64)end[5]) << 40; + case 5: b |= ((uint64)end[4]) << 32; + case 4: b |= ReadLE32(data); break; + case 3: b |= ((uint64)end[2]) << 16; + case 2: b |= ReadLE16(data); break; + case 1: b |= end[0]; + } + POSTAMBLE +} + +/** + * siphash_1u64 - compute 64-bit siphash PRF value of a uint64 + * @first: first uint64 + * @key: the siphash key + */ +uint64 siphash_1u64(const uint64 first, const siphash_key_t *key) +{ + PREAMBLE(8) + v3 ^= first; + SIPROUND; + SIPROUND; + v0 ^= first; + POSTAMBLE +} + +/** + * siphash_2u64 - compute 64-bit siphash PRF value of 2 uint64 + * @first: first uint64 + * @second: second uint64 + * @key: the siphash key + */ +uint64 siphash_2u64(const uint64 first, const uint64 second, const siphash_key_t *key) +{ + PREAMBLE(16) + v3 ^= first; + SIPROUND; + SIPROUND; + v0 ^= first; + v3 ^= second; + SIPROUND; + SIPROUND; + v0 ^= second; + POSTAMBLE +} + +/** + * siphash_3u64 - compute 64-bit siphash PRF value of 3 uint64 + * @first: first uint64 + * @second: second uint64 + * @third: third uint64 + * @key: the siphash key + */ +uint64 siphash_3u64(const uint64 first, const uint64 second, const uint64 third, + const siphash_key_t *key) +{ + PREAMBLE(24) + v3 ^= first; + SIPROUND; + SIPROUND; + v0 ^= first; + v3 ^= second; + SIPROUND; + SIPROUND; + v0 ^= second; + v3 ^= third; + SIPROUND; + SIPROUND; + v0 ^= third; + POSTAMBLE +} + +/** + * siphash_4u64 - compute 64-bit siphash PRF value of 4 uint64 + * @first: first uint64 + * @second: second uint64 + * @third: third uint64 + * @forth: forth uint64 + * @key: the siphash key + */ +uint64 siphash_4u64(const uint64 first, const uint64 second, const uint64 third, + const uint64 forth, const siphash_key_t *key) +{ + PREAMBLE(32) + v3 ^= first; + SIPROUND; + SIPROUND; + v0 ^= first; + v3 ^= second; + SIPROUND; + SIPROUND; + v0 ^= second; + v3 ^= third; + SIPROUND; + SIPROUND; + v0 ^= third; + v3 ^= forth; + SIPROUND; + SIPROUND; + v0 ^= forth; + POSTAMBLE +} + +uint64 siphash_1u32(const uint32 first, const siphash_key_t *key) +{ + PREAMBLE(4) + b |= first; + POSTAMBLE +} + +uint64 siphash_3u32(const uint32 first, const uint32 second, const uint32 third, + const siphash_key_t *key) +{ + uint64 combined = (uint64)second << 32 | first; + PREAMBLE(12) + v3 ^= combined; + SIPROUND; + SIPROUND; + v0 ^= combined; + b |= third; + POSTAMBLE +} + +uint64 siphash_u64_u32(const uint64 combined, const uint32 third, const siphash_key_t *key) { + PREAMBLE(12) + v3 ^= combined; + SIPROUND; + SIPROUND; + v0 ^= combined; + b |= third; + POSTAMBLE +} + diff --git a/crypto/siphash.h b/crypto/siphash.h new file mode 100644 index 0000000..3b5dc74 --- /dev/null +++ b/crypto/siphash.h @@ -0,0 +1,53 @@ +/* Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + * + * This file is provided under a dual BSD/GPLv2 license. + * + * SipHash: a fast short-input PRF + * https://131002.net/siphash/ + * + * This implementation is specifically for SipHash2-4 for a secure PRF + * and HalfSipHash1-3/SipHash1-3 for an insecure PRF only suitable for + * hashtables. + */ + +#ifndef TUNSAFE_CRYPTO_SIPHASH_H_ +#define TUNSAFE_CRYPTO_SIPHASH_H_ + +#include "tunsafe_types.h" + +typedef struct { + uint64 key[2]; +} siphash_key_t; + +uint64 siphash_1u64(const uint64 a, const siphash_key_t *key); +uint64 siphash_2u64(const uint64 a, const uint64 b, const siphash_key_t *key); +uint64 siphash_3u64(const uint64 a, const uint64 b, const uint64 c, + const siphash_key_t *key); +uint64 siphash_4u64(const uint64 a, const uint64 b, const uint64 c, const uint64 d, + const siphash_key_t *key); +uint64 siphash_1u32(const uint32 a, const siphash_key_t *key); +uint64 siphash_3u32(const uint32 a, const uint32 b, const uint32 c, + const siphash_key_t *key); + +static inline uint64 siphash_2u32(const uint32 a, const uint32 b, + const siphash_key_t *key) +{ + return siphash_1u64((uint64)b << 32 | a, key); +} +static inline uint64 siphash_4u32(const uint32 a, const uint32 b, const uint32 c, + const uint32 d, const siphash_key_t *key) +{ + return siphash_2u64((uint64)b << 32 | a, (uint64)d << 32 | c, key); +} + +uint64 siphash_u64_u32(const uint64 combined, const uint32 third, const siphash_key_t *key); + +/** + * siphash - compute 64-bit siphash PRF value + * @data: buffer to hash + * @size: size of @data + * @key: the siphash key + */ +uint64 siphash(const void *data, size_t len, const siphash_key_t *key); + +#endif // TUNSAFE_CRYPTO_SIPHASH_H_ diff --git a/crypto/x86_64-xlate.pl b/crypto/x86_64-xlate.pl new file mode 100644 index 0000000..c1ae6ad --- /dev/null +++ b/crypto/x86_64-xlate.pl @@ -0,0 +1,1433 @@ +#! /usr/bin/env perl +# Copyright 2005-2016 The OpenSSL Project Authors. All Rights Reserved. +# +# Licensed under the OpenSSL license (the "License"). You may not use +# this file except in compliance with the License. You can obtain a copy +# in the file LICENSE in the source distribution or at +# https://www.openssl.org/source/license.html + + +# Ascetic x86_64 AT&T to MASM/NASM assembler translator by . +# +# Why AT&T to MASM and not vice versa? Several reasons. Because AT&T +# format is way easier to parse. Because it's simpler to "gear" from +# Unix ABI to Windows one [see cross-reference "card" at the end of +# file]. Because Linux targets were available first... +# +# In addition the script also "distills" code suitable for GNU +# assembler, so that it can be compiled with more rigid assemblers, +# such as Solaris /usr/ccs/bin/as. +# +# This translator is not designed to convert *arbitrary* assembler +# code from AT&T format to MASM one. It's designed to convert just +# enough to provide for dual-ABI OpenSSL modules development... +# There *are* limitations and you might have to modify your assembler +# code or this script to achieve the desired result... +# +# Currently recognized limitations: +# +# - can't use multiple ops per line; +# +# Dual-ABI styling rules. +# +# 1. Adhere to Unix register and stack layout [see cross-reference +# ABI "card" at the end for explanation]. +# 2. Forget about "red zone," stick to more traditional blended +# stack frame allocation. If volatile storage is actually required +# that is. If not, just leave the stack as is. +# 3. Functions tagged with ".type name,@function" get crafted with +# unified Win64 prologue and epilogue automatically. If you want +# to take care of ABI differences yourself, tag functions as +# ".type name,@abi-omnipotent" instead. +# 4. To optimize the Win64 prologue you can specify number of input +# arguments as ".type name,@function,N." Keep in mind that if N is +# larger than 6, then you *have to* write "abi-omnipotent" code, +# because >6 cases can't be addressed with unified prologue. +# 5. Name local labels as .L*, do *not* use dynamic labels such as 1: +# (sorry about latter). +# 6. Don't use [or hand-code with .byte] "rep ret." "ret" mnemonic is +# required to identify the spots, where to inject Win64 epilogue! +# But on the pros, it's then prefixed with rep automatically:-) +# 7. Stick to explicit ip-relative addressing. If you have to use +# GOTPCREL addressing, stick to mov symbol@GOTPCREL(%rip),%r??. +# Both are recognized and translated to proper Win64 addressing +# modes. +# +# 8. In order to provide for structured exception handling unified +# Win64 prologue copies %rsp value to %rax. For further details +# see SEH paragraph at the end. +# 9. .init segment is allowed to contain calls to functions only. +# a. If function accepts more than 4 arguments *and* >4th argument +# is declared as non 64-bit value, do clear its upper part. + + +use strict; + +my $flavour = shift; +my $output = shift; +if ($flavour =~ /\./) { $output = $flavour; undef $flavour; } + +open STDOUT,">$output" || die "can't open $output: $!" + if (defined($output)); + +my $gas=1; $gas=0 if ($output =~ /\.asm$/); +my $elf=1; $elf=0 if (!$gas); +my $win64=0; +my $prefix=""; +my $decor=".L"; + +my $masmref=8 + 50727*2**-32; # 8.00.50727 shipped with VS2005 +my $masm=0; +my $PTR=" PTR"; + +my $nasmref=2.03; +my $nasm=0; + +if ($flavour eq "mingw64") { $gas=1; $elf=0; $win64=1; + $prefix=`echo __USER_LABEL_PREFIX__ | $ENV{CC} -E -P -`; + $prefix =~ s|\R$||; # Better chomp + } +elsif ($flavour eq "macosx") { $gas=1; $elf=0; $prefix="_"; $decor="L\$"; } +elsif ($flavour eq "masm") { $gas=0; $elf=0; $masm=$masmref; $win64=1; $decor="\$L\$"; } +elsif ($flavour eq "nasm") { $gas=0; $elf=0; $nasm=$nasmref; $win64=1; $decor="\$L\$"; $PTR=""; } +elsif (!$gas) +{ if ($ENV{ASM} =~ m/nasm/ && `nasm -v` =~ m/version ([0-9]+)\.([0-9]+)/i) + { $nasm = $1 + $2*0.01; $PTR=""; } + elsif (`ml64 2>&1` =~ m/Version ([0-9]+)\.([0-9]+)(\.([0-9]+))?/) + { $masm = $1 + $2*2**-16 + $4*2**-32; } + die "no assembler found on %PATH%" if (!($nasm || $masm)); + $win64=1; + $elf=0; + $decor="\$L\$"; +} + +my $current_segment; +my $current_function; +my %globals; + +{ package opcode; # pick up opcodes + sub re { + my ($class, $line) = @_; + my $self = {}; + my $ret; + + if ($$line =~ /^([a-z][a-z0-9]*)/i) { + bless $self,$class; + $self->{op} = $1; + $ret = $self; + $$line = substr($$line,@+[0]); $$line =~ s/^\s+//; + + undef $self->{sz}; + if ($self->{op} =~ /^(movz)x?([bw]).*/) { # movz is pain... + $self->{op} = $1; + $self->{sz} = $2; + } elsif ($self->{op} =~ /call|jmp/) { + $self->{sz} = ""; + } elsif ($self->{op} =~ /^p/ && $' !~ /^(ush|op|insrw)/) { # SSEn + $self->{sz} = ""; + } elsif ($self->{op} =~ /^[vk]/) { # VEX or k* such as kmov + $self->{sz} = ""; + } elsif ($self->{op} =~ /mov[dq]/ && $$line =~ /%xmm/) { + $self->{sz} = ""; + } elsif ($self->{op} =~ /([a-z]{3,})([qlwb])$/) { + $self->{op} = $1; + $self->{sz} = $2; + } + } + $ret; + } + sub size { + my ($self, $sz) = @_; + $self->{sz} = $sz if (defined($sz) && !defined($self->{sz})); + $self->{sz}; + } + sub out { + my $self = shift; + if ($gas) { + if ($self->{op} eq "movz") { # movz is pain... + sprintf "%s%s%s",$self->{op},$self->{sz},shift; + } elsif ($self->{op} =~ /^set/) { + "$self->{op}"; + } elsif ($self->{op} eq "ret") { + my $epilogue = ""; + if ($win64 && $current_function->{abi} eq "svr4") { + $epilogue = "movq 8(%rsp),%rdi\n\t" . + "movq 16(%rsp),%rsi\n\t"; + } + #$epilogue . ".byte 0xf3,0xc3"; + $epilogue . "ret"; + } elsif ($self->{op} eq "call" && !$elf && $current_segment eq ".init") { + ".p2align\t3\n\t.quad"; + } else { + "$self->{op}$self->{sz}"; + } + } else { + $self->{op} =~ s/^movz/movzx/; + if ($self->{op} eq "ret") { + $self->{op} = ""; + if ($win64 && $current_function->{abi} eq "svr4") { + $self->{op} = "mov rdi,QWORD$PTR\[8+rsp\]\t;WIN64 epilogue\n\t". + "mov rsi,QWORD$PTR\[16+rsp\]\n\t"; + } + $self->{op} .= "ret"; + } elsif ($self->{op} =~ /^(pop|push)f/) { + $self->{op} .= $self->{sz}; + } elsif ($self->{op} eq "call" && $current_segment eq ".CRT\$XCU") { + $self->{op} = "\tDQ"; + } + $self->{op}; + } + } + sub mnemonic { + my ($self, $op) = @_; + $self->{op}=$op if (defined($op)); + $self->{op}; + } +} +{ package const; # pick up constants, which start with $ + sub re { + my ($class, $line) = @_; + my $self = {}; + my $ret; + + if ($$line =~ /^\$([^,]+)/) { + bless $self, $class; + $self->{value} = $1; + $ret = $self; + $$line = substr($$line,@+[0]); $$line =~ s/^\s+//; + } + $ret; + } + sub out { + my $self = shift; + + $self->{value} =~ s/\b(0b[0-1]+)/oct($1)/eig; + if ($gas) { + # Solaris /usr/ccs/bin/as can't handle multiplications + # in $self->{value} + my $value = $self->{value}; + no warnings; # oct might complain about overflow, ignore here... + $value =~ s/(?{value} = $value; + } + sprintf "\$%s",$self->{value}; + } else { + my $value = $self->{value}; + $value =~ s/0x([0-9a-f]+)/0$1h/ig if ($masm); + sprintf "%s",$value; + } + } +} +{ package ea; # pick up effective addresses: expr(%reg,%reg,scale) + + my %szmap = ( b=>"BYTE$PTR", w=>"WORD$PTR", + l=>"DWORD$PTR", d=>"DWORD$PTR", + q=>"QWORD$PTR", o=>"OWORD$PTR", + x=>"XMMWORD$PTR", y=>"YMMWORD$PTR", + z=>"ZMMWORD$PTR" ) if (!$gas); + + sub re { + my ($class, $line, $opcode) = @_; + my $self = {}; + my $ret; + + # optional * ----vvv--- appears in indirect jmp/call + if ($$line =~ /^(\*?)([^\(,]*)\(([%\w,]+)\)((?:{[^}]+})*)/) { + bless $self, $class; + $self->{asterisk} = $1; + $self->{label} = $2; + ($self->{base},$self->{index},$self->{scale})=split(/,/,$3); + $self->{scale} = 1 if (!defined($self->{scale})); + $self->{opmask} = $4; + $ret = $self; + $$line = substr($$line,@+[0]); $$line =~ s/^\s+//; + + if ($win64 && $self->{label} =~ s/\@GOTPCREL//) { + die if ($opcode->mnemonic() ne "mov"); + $opcode->mnemonic("lea"); + } + $self->{base} =~ s/^%//; + $self->{index} =~ s/^%// if (defined($self->{index})); + $self->{opcode} = $opcode; + } + $ret; + } + sub size {} + sub out { + my ($self, $sz) = @_; + + $self->{label} =~ s/([_a-z][_a-z0-9]*)/$globals{$1} or $1/gei; + $self->{label} =~ s/\.L/$decor/g; + + # Silently convert all EAs to 64-bit. This is required for + # elder GNU assembler and results in more compact code, + # *but* most importantly AES module depends on this feature! + $self->{index} =~ s/^[er](.?[0-9xpi])[d]?$/r\1/; + $self->{base} =~ s/^[er](.?[0-9xpi])[d]?$/r\1/; + + # Solaris /usr/ccs/bin/as can't handle multiplications + # in $self->{label}... + use integer; + $self->{label} =~ s/(?{label} =~ s/\b([0-9]+\s*[\*\/\%]\s*[0-9]+)\b/eval($1)/eg; + + # Some assemblers insist on signed presentation of 32-bit + # offsets, but sign extension is a tricky business in perl... + if ((1<<31)<<1) { + $self->{label} =~ s/\b([0-9]+)\b/$1<<32>>32/eg; + } else { + $self->{label} =~ s/\b([0-9]+)\b/$1>>0/eg; + } + + # if base register is %rbp or %r13, see if it's possible to + # flip base and index registers [for better performance] + if (!$self->{label} && $self->{index} && $self->{scale}==1 && + $self->{base} =~ /(rbp|r13)/) { + $self->{base} = $self->{index}; $self->{index} = $1; + } + + if ($gas) { + $self->{label} =~ s/^___imp_/__imp__/ if ($flavour eq "mingw64"); + + if (defined($self->{index})) { + sprintf "%s%s(%s,%%%s,%d)%s", + $self->{asterisk},$self->{label}, + $self->{base}?"%$self->{base}":"", + $self->{index},$self->{scale}, + $self->{opmask}; + } else { + sprintf "%s%s(%%%s)%s", $self->{asterisk},$self->{label}, + $self->{base},$self->{opmask}; + } + } else { + $self->{label} =~ s/\./\$/g; + $self->{label} =~ s/(?{label} = "($self->{label})" if ($self->{label} =~ /[\*\+\-\/]/); + + my $mnemonic = $self->{opcode}->mnemonic(); + ($self->{asterisk}) && ($sz="q") || + ($mnemonic =~ /^v?mov([qd])$/) && ($sz=$1) || + ($mnemonic =~ /^v?pinsr([qdwb])$/) && ($sz=$1) || + ($mnemonic =~ /^vpbroadcast([qdwb])$/) && ($sz=$1) || + ($mnemonic =~ /^v(?!perm)[a-z]+[fi]128$/) && ($sz="x"); + + $self->{opmask} =~ s/%(k[0-7])/$1/; + + if (defined($self->{index})) { + sprintf "%s[%s%s*%d%s]%s",$szmap{$sz}, + $self->{label}?"$self->{label}+":"", + $self->{index},$self->{scale}, + $self->{base}?"+$self->{base}":"", + $self->{opmask}; + } elsif ($self->{base} eq "rip") { + sprintf "%s[%s]",$szmap{$sz},$self->{label}; + } else { + sprintf "%s[%s%s]%s", $szmap{$sz}, + $self->{label}?"$self->{label}+":"", + $self->{base},$self->{opmask}; + } + } + } +} +{ package register; # pick up registers, which start with %. + sub re { + my ($class, $line, $opcode) = @_; + my $self = {}; + my $ret; + + # optional * ----vvv--- appears in indirect jmp/call + if ($$line =~ /^(\*?)%(\w+)((?:{[^}]+})*)/) { + bless $self,$class; + $self->{asterisk} = $1; + $self->{value} = $2; + $self->{opmask} = $3; + $opcode->size($self->size()); + $ret = $self; + $$line = substr($$line,@+[0]); $$line =~ s/^\s+//; + } + $ret; + } + sub size { + my $self = shift; + my $ret; + + if ($self->{value} =~ /^r[\d]+b$/i) { $ret="b"; } + elsif ($self->{value} =~ /^r[\d]+w$/i) { $ret="w"; } + elsif ($self->{value} =~ /^r[\d]+d$/i) { $ret="l"; } + elsif ($self->{value} =~ /^r[\w]+$/i) { $ret="q"; } + elsif ($self->{value} =~ /^[a-d][hl]$/i){ $ret="b"; } + elsif ($self->{value} =~ /^[\w]{2}l$/i) { $ret="b"; } + elsif ($self->{value} =~ /^[\w]{2}$/i) { $ret="w"; } + elsif ($self->{value} =~ /^e[a-z]{2}$/i){ $ret="l"; } + + $ret; + } + sub out { + my $self = shift; + if ($gas) { sprintf "%s%%%s%s", $self->{asterisk}, + $self->{value}, + $self->{opmask}; } + else { $self->{opmask} =~ s/%(k[0-7])/$1/; + $self->{value}.$self->{opmask}; } + } +} +{ package label; # pick up labels, which end with : + sub re { + my ($class, $line) = @_; + my $self = {}; + my $ret; + + if ($$line =~ /(^[\.\w]+)\:/) { + bless $self,$class; + $self->{value} = $1; + $ret = $self; + $$line = substr($$line,@+[0]); $$line =~ s/^\s+//; + + $self->{value} =~ s/^\.L/$decor/; + } + $ret; + } + sub out { + my $self = shift; + + if ($gas) { + my $func = ($globals{$self->{value}} or $self->{value}) . ":"; + if ($win64 && $current_function->{name} eq $self->{value} + && $current_function->{abi} eq "svr4") { + $func .= "\n"; + $func .= " movq %rdi,8(%rsp)\n"; + $func .= " movq %rsi,16(%rsp)\n"; + $func .= " movq %rsp,%rax\n"; + $func .= "${decor}SEH_begin_$current_function->{name}:\n"; + my $narg = $current_function->{narg}; + $narg=6 if (!defined($narg)); + $func .= " movq %rcx,%rdi\n" if ($narg>0); + $func .= " movq %rdx,%rsi\n" if ($narg>1); + $func .= " movq %r8,%rdx\n" if ($narg>2); + $func .= " movq %r9,%rcx\n" if ($narg>3); + $func .= " movq 40(%rsp),%r8\n" if ($narg>4); + $func .= " movq 48(%rsp),%r9\n" if ($narg>5); + } + $func; + } elsif ($self->{value} ne "$current_function->{name}") { + # Make all labels in masm global. + $self->{value} .= ":" if ($masm); + $self->{value} . ":"; + } elsif ($win64 && $current_function->{abi} eq "svr4") { + my $func = "$current_function->{name}" . + ($nasm ? ":" : "\tPROC $current_function->{scope}") . + "\n"; + $func .= " mov QWORD$PTR\[8+rsp\],rdi\t;WIN64 prologue\n"; + $func .= " mov QWORD$PTR\[16+rsp\],rsi\n"; + $func .= " mov rax,rsp\n"; + $func .= "${decor}SEH_begin_$current_function->{name}:"; + $func .= ":" if ($masm); + $func .= "\n"; + my $narg = $current_function->{narg}; + $narg=6 if (!defined($narg)); + $func .= " mov rdi,rcx\n" if ($narg>0); + $func .= " mov rsi,rdx\n" if ($narg>1); + $func .= " mov rdx,r8\n" if ($narg>2); + $func .= " mov rcx,r9\n" if ($narg>3); + $func .= " mov r8,QWORD$PTR\[40+rsp\]\n" if ($narg>4); + $func .= " mov r9,QWORD$PTR\[48+rsp\]\n" if ($narg>5); + $func .= "\n"; + } else { + "$current_function->{name}". + ($nasm ? ":" : "\tPROC $current_function->{scope}"); + } + } +} +{ package expr; # pick up expressions + sub re { + my ($class, $line, $opcode) = @_; + my $self = {}; + my $ret; + + if ($$line =~ /(^[^,]+)/) { + bless $self,$class; + $self->{value} = $1; + $ret = $self; + $$line = substr($$line,@+[0]); $$line =~ s/^\s+//; + + $self->{value} =~ s/\@PLT// if (!$elf); + $self->{value} =~ s/([_a-z][_a-z0-9]*)/$globals{$1} or $1/gei; + $self->{value} =~ s/\.L/$decor/g; + $self->{opcode} = $opcode; + } + $ret; + } + sub out { + my $self = shift; + if ($nasm && $self->{opcode}->mnemonic()=~m/^j(?![re]cxz)/) { + "NEAR ".$self->{value}; + } else { + $self->{value}; + } + } +} +{ package cfi_directive; + # CFI directives annotate instructions that are significant for + # stack unwinding procedure compliant with DWARF specification, + # see http://dwarfstd.org/. Besides naturally expected for this + # script platform-specific filtering function, this module adds + # three auxiliary synthetic directives not recognized by [GNU] + # assembler: + # + # - .cfi_push to annotate push instructions in prologue, which + # translates to .cfi_adjust_cfa_offset (if needed) and + # .cfi_offset; + # - .cfi_pop to annotate pop instructions in epilogue, which + # translates to .cfi_adjust_cfa_offset (if needed) and + # .cfi_restore; + # - [and most notably] .cfi_cfa_expression which encodes + # DW_CFA_def_cfa_expression and passes it to .cfi_escape as + # byte vector; + # + # CFA expressions were introduced in DWARF specification version + # 3 and describe how to deduce CFA, Canonical Frame Address. This + # becomes handy if your stack frame is variable and you can't + # spare register for [previous] frame pointer. Suggested directive + # syntax is made-up mix of DWARF operator suffixes [subset of] + # and references to registers with optional bias. Following example + # describes offloaded *original* stack pointer at specific offset + # from *current* stack pointer: + # + # .cfi_cfa_expression %rsp+40,deref,+8 + # + # Final +8 has everything to do with the fact that CFA is defined + # as reference to top of caller's stack, and on x86_64 call to + # subroutine pushes 8-byte return address. In other words original + # stack pointer upon entry to a subroutine is 8 bytes off from CFA. + + # Below constants are taken from "DWARF Expressions" section of the + # DWARF specification, section is numbered 7.7 in versions 3 and 4. + my %DW_OP_simple = ( # no-arg operators, mapped directly + deref => 0x06, dup => 0x12, + drop => 0x13, over => 0x14, + pick => 0x15, swap => 0x16, + rot => 0x17, xderef => 0x18, + + abs => 0x19, and => 0x1a, + div => 0x1b, minus => 0x1c, + mod => 0x1d, mul => 0x1e, + neg => 0x1f, not => 0x20, + or => 0x21, plus => 0x22, + shl => 0x24, shr => 0x25, + shra => 0x26, xor => 0x27, + ); + + my %DW_OP_complex = ( # used in specific subroutines + constu => 0x10, # uleb128 + consts => 0x11, # sleb128 + plus_uconst => 0x23, # uleb128 + lit0 => 0x30, # add 0-31 to opcode + reg0 => 0x50, # add 0-31 to opcode + breg0 => 0x70, # add 0-31 to opcole, sleb128 + regx => 0x90, # uleb28 + fbreg => 0x91, # sleb128 + bregx => 0x92, # uleb128, sleb128 + piece => 0x93, # uleb128 + ); + + # Following constants are defined in x86_64 ABI supplement, for + # example available at https://www.uclibc.org/docs/psABI-x86_64.pdf, + # see section 3.7 "Stack Unwind Algorithm". + my %DW_reg_idx = ( + "%rax"=>0, "%rdx"=>1, "%rcx"=>2, "%rbx"=>3, + "%rsi"=>4, "%rdi"=>5, "%rbp"=>6, "%rsp"=>7, + "%r8" =>8, "%r9" =>9, "%r10"=>10, "%r11"=>11, + "%r12"=>12, "%r13"=>13, "%r14"=>14, "%r15"=>15 + ); + + my ($cfa_reg, $cfa_rsp); + + # [us]leb128 format is variable-length integer representation base + # 2^128, with most significant bit of each byte being 0 denoting + # *last* most significant digit. See "Variable Length Data" in the + # DWARF specification, numbered 7.6 at least in versions 3 and 4. + sub sleb128 { + use integer; # get right shift extend sign + + my $val = shift; + my $sign = ($val < 0) ? -1 : 0; + my @ret = (); + + while(1) { + push @ret, $val&0x7f; + + # see if remaining bits are same and equal to most + # significant bit of the current digit, if so, it's + # last digit... + last if (($val>>6) == $sign); + + @ret[-1] |= 0x80; + $val >>= 7; + } + + return @ret; + } + sub uleb128 { + my $val = shift; + my @ret = (); + + while(1) { + push @ret, $val&0x7f; + + # see if it's last significant digit... + last if (($val >>= 7) == 0); + + @ret[-1] |= 0x80; + } + + return @ret; + } + sub const { + my $val = shift; + + if ($val >= 0 && $val < 32) { + return ($DW_OP_complex{lit0}+$val); + } + return ($DW_OP_complex{consts}, sleb128($val)); + } + sub reg { + my $val = shift; + + return if ($val !~ m/^(%r\w+)(?:([\+\-])((?:0x)?[0-9a-f]+))?/); + + my $reg = $DW_reg_idx{$1}; + my $off = eval ("0 $2 $3"); + + return (($DW_OP_complex{breg0} + $reg), sleb128($off)); + # Yes, we use DW_OP_bregX+0 to push register value and not + # DW_OP_regX, because latter would require even DW_OP_piece, + # which would be a waste under the circumstances. If you have + # to use DWP_OP_reg, use "regx:N"... + } + sub cfa_expression { + my $line = shift; + my @ret; + + foreach my $token (split(/,\s*/,$line)) { + if ($token =~ /^%r/) { + push @ret,reg($token); + } elsif ($token =~ /((?:0x)?[0-9a-f]+)\((%r\w+)\)/) { + push @ret,reg("$2+$1"); + } elsif ($token =~ /(\w+):(\-?(?:0x)?[0-9a-f]+)(U?)/i) { + my $i = 1*eval($2); + push @ret,$DW_OP_complex{$1}, ($3 ? uleb128($i) : sleb128($i)); + } elsif (my $i = 1*eval($token) or $token eq "0") { + if ($token =~ /^\+/) { + push @ret,$DW_OP_complex{plus_uconst},uleb128($i); + } else { + push @ret,const($i); + } + } else { + push @ret,$DW_OP_simple{$token}; + } + } + + # Finally we return DW_CFA_def_cfa_expression, 15, followed by + # length of the expression and of course the expression itself. + return (15,scalar(@ret),@ret); + } + sub re { + my ($class, $line) = @_; + my $self = {}; + my $ret; + + if ($$line =~ s/^\s*\.cfi_(\w+)\s*//) { + bless $self,$class; + $ret = $self; + undef $self->{value}; + my $dir = $1; + + SWITCH: for ($dir) { + # What is $cfa_rsp? Effectively it's difference between %rsp + # value and current CFA, Canonical Frame Address, which is + # why it starts with -8. Recall that CFA is top of caller's + # stack... + /startproc/ && do { ($cfa_reg, $cfa_rsp) = ("%rsp", -8); last; }; + /endproc/ && do { ($cfa_reg, $cfa_rsp) = ("%rsp", 0); last; }; + /def_cfa_register/ + && do { $cfa_reg = $$line; last; }; + /def_cfa_offset/ + && do { $cfa_rsp = -1*eval($$line) if ($cfa_reg eq "%rsp"); + last; + }; + /adjust_cfa_offset/ + && do { $cfa_rsp -= 1*eval($$line) if ($cfa_reg eq "%rsp"); + last; + }; + /def_cfa/ && do { if ($$line =~ /(%r\w+)\s*,\s*(.+)/) { + $cfa_reg = $1; + $cfa_rsp = -1*eval($2) if ($cfa_reg eq "%rsp"); + } + last; + }; + /push/ && do { $dir = undef; + $cfa_rsp -= 8; + if ($cfa_reg eq "%rsp") { + $self->{value} = ".cfi_adjust_cfa_offset\t8\n"; + } + $self->{value} .= ".cfi_offset\t$$line,$cfa_rsp"; + last; + }; + /pop/ && do { $dir = undef; + $cfa_rsp += 8; + if ($cfa_reg eq "%rsp") { + $self->{value} = ".cfi_adjust_cfa_offset\t-8\n"; + } + $self->{value} .= ".cfi_restore\t$$line"; + last; + }; + /cfa_expression/ + && do { $dir = undef; + $self->{value} = ".cfi_escape\t" . + join(",", map(sprintf("0x%02x", $_), + cfa_expression($$line))); + last; + }; + } + + $self->{value} = ".cfi_$dir\t$$line" if ($dir); + + $$line = ""; + } + + return $ret; + } + sub out { + my $self = shift; + return ($elf ? $self->{value} : undef); + } +} +{ package directive; # pick up directives, which start with . + sub re { + my ($class, $line) = @_; + my $self = {}; + my $ret; + my $dir; + + # chain-call to cfi_directive + $ret = cfi_directive->re($line) and return $ret; + + if ($$line =~ /^\s*(\.\w+)/) { + bless $self,$class; + $dir = $1; + $ret = $self; + undef $self->{value}; + $$line = substr($$line,@+[0]); $$line =~ s/^\s+//; + + SWITCH: for ($dir) { + /\.global|\.globl|\.extern/ + && do { $globals{$$line} = $prefix . $$line; + $$line = $globals{$$line} if ($prefix); + last; + }; + /\.type/ && do { my ($sym,$type,$narg) = split(',',$$line); + if ($type eq "\@function") { + undef $current_function; + $current_function->{name} = $sym; + $current_function->{abi} = "svr4"; + $current_function->{narg} = $narg; + $current_function->{scope} = defined($globals{$sym})?"PUBLIC":"PRIVATE"; + } elsif ($type eq "\@abi-omnipotent") { + undef $current_function; + $current_function->{name} = $sym; + $current_function->{scope} = defined($globals{$sym})?"PUBLIC":"PRIVATE"; + } + $$line =~ s/\@abi\-omnipotent/\@function/; + $$line =~ s/\@function.*/\@function/; + last; + }; + /\.asciz/ && do { if ($$line =~ /^"(.*)"$/) { + $dir = ".byte"; + $$line = join(",",unpack("C*",$1),0); + } + last; + }; + /\.rva|\.long|\.quad/ + && do { $$line =~ s/([_a-z][_a-z0-9]*)/$globals{$1} or $1/gei; + $$line =~ s/\.L/$decor/g; + last; + }; + } + + if ($gas) { + $self->{value} = $dir . "\t" . $$line; + + if ($dir =~ /\.extern/) { + $self->{value} = ""; # swallow extern + } elsif (!$elf && $dir =~ /\.type/) { + $self->{value} = ""; + $self->{value} = ".def\t" . ($globals{$1} or $1) . ";\t" . + (defined($globals{$1})?".scl 2;":".scl 3;") . + "\t.type 32;\t.endef" + if ($win64 && $$line =~ /([^,]+),\@function/); + } elsif (!$elf && $dir =~ /\.size/) { + $self->{value} = ""; + if (defined($current_function)) { + $self->{value} .= "${decor}SEH_end_$current_function->{name}:" + if ($win64 && $current_function->{abi} eq "svr4"); + undef $current_function; + } + } elsif (!$elf && $dir =~ /\.align/) { + $self->{value} = ".p2align\t" . (log($$line)/log(2)); + } elsif ($dir eq ".section") { + $current_segment=$$line; + if (!$elf && $current_segment eq ".init") { + if ($flavour eq "macosx") { $self->{value} = ".mod_init_func"; } + elsif ($flavour eq "mingw64") { $self->{value} = ".section\t.ctors"; } + } + } elsif ($dir =~ /\.(text|data)/) { + $current_segment=".$1"; + } elsif ($dir =~ /\.hidden/) { + if ($flavour eq "macosx") { $self->{value} = ".private_extern\t$prefix$$line"; } + elsif ($flavour eq "mingw64") { $self->{value} = ""; } + } elsif ($dir =~ /\.comm/) { + $self->{value} = "$dir\t$prefix$$line"; + $self->{value} =~ s|,([0-9]+),([0-9]+)$|",$1,".log($2)/log(2)|e if ($flavour eq "macosx"); + } + $$line = ""; + return $self; + } + + # non-gas case or nasm/masm + SWITCH: for ($dir) { + /\.text/ && do { my $v=undef; + if ($nasm) { + $v="section .text code align=64\n"; + } else { + $v="$current_segment\tENDS\n" if ($current_segment); + $current_segment = ".text\$"; + $v.="$current_segment\tSEGMENT "; + $v.=$masm>=$masmref ? "ALIGN(256)" : "PAGE"; + $v.=" 'CODE'"; + } + $self->{value} = $v; + last; + }; + /\.data/ && do { my $v=undef; + if ($nasm) { + $v="section .data data align=8\n"; + } else { + $v="$current_segment\tENDS\n" if ($current_segment); + $current_segment = "_DATA"; + $v.="$current_segment\tSEGMENT"; + } + $self->{value} = $v; + last; + }; + /\.section/ && do { my $v=undef; + $$line =~ s/([^,]*).*/$1/; + $$line = ".CRT\$XCU" if ($$line eq ".init"); + if ($nasm) { + $v="section $$line"; + if ($$line=~/\.([px])data/) { + $v.=" rdata align="; + $v.=$1 eq "p"? 4 : 8; + } elsif ($$line=~/\.CRT\$/i) { + $v.=" rdata align=8"; + } + } else { + $v="$current_segment\tENDS\n" if ($current_segment); + $v.="$$line\tSEGMENT"; + if ($$line=~/\.([px])data/) { + $v.=" READONLY"; + $v.=" ALIGN(".($1 eq "p" ? 4 : 8).")" if ($masm>=$masmref); + } elsif ($$line=~/\.CRT\$/i) { + $v.=" READONLY "; + $v.=$masm>=$masmref ? "ALIGN(8)" : "DWORD"; + } + } + $current_segment = $$line; + $self->{value} = $v; + last; + }; + /\.extern/ && do { $self->{value} = "EXTERN\t".$$line; + $self->{value} .= ":NEAR" if ($masm); + last; + }; + /\.globl|.global/ + && do { $self->{value} = $masm?"PUBLIC":"global"; + $self->{value} .= "\t".$$line; + last; + }; + /\.size/ && do { if (defined($current_function)) { + undef $self->{value}; + if ($current_function->{abi} eq "svr4") { + $self->{value}="${decor}SEH_end_$current_function->{name}:"; + $self->{value}.=":\n" if($masm); + } + $self->{value}.="$current_function->{name}\tENDP" if($masm && $current_function->{name}); + undef $current_function; + } + last; + }; + /\.align/ && do { my $max = ($masm && $masm>=$masmref) ? 256 : 4096; + $self->{value} = "ALIGN\t".($$line>$max?$max:$$line); + last; + }; + /\.(value|long|rva|quad)/ + && do { my $sz = substr($1,0,1); + my @arr = split(/,\s*/,$$line); + my $last = pop(@arr); + my $conv = sub { my $var=shift; + $var=~s/^(0b[0-1]+)/oct($1)/eig; + $var=~s/^0x([0-9a-f]+)/0$1h/ig if ($masm); + if ($sz eq "D" && ($current_segment=~/.[px]data/ || $dir eq ".rva")) + { $var=~s/([_a-z\$\@][_a-z0-9\$\@]*)/$nasm?"$1 wrt ..imagebase":"imagerel $1"/egi; } + $var; + }; + + $sz =~ tr/bvlrq/BWDDQ/; + $self->{value} = "\tD$sz\t"; + for (@arr) { $self->{value} .= &$conv($_).","; } + $self->{value} .= &$conv($last); + last; + }; + /\.byte/ && do { my @str=split(/,\s*/,$$line); + map(s/(0b[0-1]+)/oct($1)/eig,@str); + map(s/0x([0-9a-f]+)/0$1h/ig,@str) if ($masm); + while ($#str>15) { + $self->{value}.="DB\t" + .join(",",@str[0..15])."\n"; + foreach (0..15) { shift @str; } + } + $self->{value}.="DB\t" + .join(",",@str) if (@str); + last; + }; + /\.comm/ && do { my @str=split(/,\s*/,$$line); + my $v=undef; + if ($nasm) { + $v.="common $prefix@str[0] @str[1]"; + } else { + $v="$current_segment\tENDS\n" if ($current_segment); + $current_segment = "_DATA"; + $v.="$current_segment\tSEGMENT\n"; + $v.="COMM @str[0]:DWORD:".@str[1]/4; + } + $self->{value} = $v; + last; + }; + } + $$line = ""; + } + + $ret; + } + sub out { + my $self = shift; + $self->{value}; + } +} + +# Upon initial x86_64 introduction SSE>2 extensions were not introduced +# yet. In order not to be bothered by tracing exact assembler versions, +# but at the same time to provide a bare security minimum of AES-NI, we +# hard-code some instructions. Extensions past AES-NI on the other hand +# are traced by examining assembler version in individual perlasm +# modules... + +my %regrm = ( "%eax"=>0, "%ecx"=>1, "%edx"=>2, "%ebx"=>3, + "%esp"=>4, "%ebp"=>5, "%esi"=>6, "%edi"=>7 ); + +sub rex { + my $opcode=shift; + my ($dst,$src,$rex)=@_; + + $rex|=0x04 if($dst>=8); + $rex|=0x01 if($src>=8); + push @$opcode,($rex|0x40) if ($rex); +} + +my $movq = sub { # elderly gas can't handle inter-register movq + my $arg = shift; + my @opcode=(0x66); + if ($arg =~ /%xmm([0-9]+),\s*%r(\w+)/) { + my ($src,$dst)=($1,$2); + if ($dst !~ /[0-9]+/) { $dst = $regrm{"%e$dst"}; } + rex(\@opcode,$src,$dst,0x8); + push @opcode,0x0f,0x7e; + push @opcode,0xc0|(($src&7)<<3)|($dst&7); # ModR/M + @opcode; + } elsif ($arg =~ /%r(\w+),\s*%xmm([0-9]+)/) { + my ($src,$dst)=($2,$1); + if ($dst !~ /[0-9]+/) { $dst = $regrm{"%e$dst"}; } + rex(\@opcode,$src,$dst,0x8); + push @opcode,0x0f,0x6e; + push @opcode,0xc0|(($src&7)<<3)|($dst&7); # ModR/M + @opcode; + } else { + (); + } +}; + +my $pextrd = sub { + if (shift =~ /\$([0-9]+),\s*%xmm([0-9]+),\s*(%\w+)/) { + my @opcode=(0x66); + my $imm=$1; + my $src=$2; + my $dst=$3; + if ($dst =~ /%r([0-9]+)d/) { $dst = $1; } + elsif ($dst =~ /%e/) { $dst = $regrm{$dst}; } + rex(\@opcode,$src,$dst); + push @opcode,0x0f,0x3a,0x16; + push @opcode,0xc0|(($src&7)<<3)|($dst&7); # ModR/M + push @opcode,$imm; + @opcode; + } else { + (); + } +}; + +my $pinsrd = sub { + if (shift =~ /\$([0-9]+),\s*(%\w+),\s*%xmm([0-9]+)/) { + my @opcode=(0x66); + my $imm=$1; + my $src=$2; + my $dst=$3; + if ($src =~ /%r([0-9]+)/) { $src = $1; } + elsif ($src =~ /%e/) { $src = $regrm{$src}; } + rex(\@opcode,$dst,$src); + push @opcode,0x0f,0x3a,0x22; + push @opcode,0xc0|(($dst&7)<<3)|($src&7); # ModR/M + push @opcode,$imm; + @opcode; + } else { + (); + } +}; + +my $pshufb = sub { + if (0 && shift =~ /%xmm([0-9]+),\s*%xmm([0-9]+)/) { + my @opcode=(0x66); + rex(\@opcode,$2,$1); + push @opcode,0x0f,0x38,0x00; + push @opcode,0xc0|($1&7)|(($2&7)<<3); # ModR/M + @opcode; + } else { + (); + } +}; + +my $palignr = sub { + if (shift =~ /\$([0-9]+),\s*%xmm([0-9]+),\s*%xmm([0-9]+)/) { + my @opcode=(0x66); + rex(\@opcode,$3,$2); + push @opcode,0x0f,0x3a,0x0f; + push @opcode,0xc0|($2&7)|(($3&7)<<3); # ModR/M + push @opcode,$1; + @opcode; + } else { + (); + } +}; + +my $pclmulqdq = sub { + if (shift =~ /\$([x0-9a-f]+),\s*%xmm([0-9]+),\s*%xmm([0-9]+)/) { + my @opcode=(0x66); + rex(\@opcode,$3,$2); + push @opcode,0x0f,0x3a,0x44; + push @opcode,0xc0|($2&7)|(($3&7)<<3); # ModR/M + my $c=$1; + push @opcode,$c=~/^0/?oct($c):$c; + @opcode; + } else { + (); + } +}; + +my $rdrand = sub { + if (shift =~ /%[er](\w+)/) { + my @opcode=(); + my $dst=$1; + if ($dst !~ /[0-9]+/) { $dst = $regrm{"%e$dst"}; } + rex(\@opcode,0,$dst,8); + push @opcode,0x0f,0xc7,0xf0|($dst&7); + @opcode; + } else { + (); + } +}; + +my $rdseed = sub { + if (shift =~ /%[er](\w+)/) { + my @opcode=(); + my $dst=$1; + if ($dst !~ /[0-9]+/) { $dst = $regrm{"%e$dst"}; } + rex(\@opcode,0,$dst,8); + push @opcode,0x0f,0xc7,0xf8|($dst&7); + @opcode; + } else { + (); + } +}; + +# Not all AVX-capable assemblers recognize AMD XOP extension. Since we +# are using only two instructions hand-code them in order to be excused +# from chasing assembler versions... + +sub rxb { + my $opcode=shift; + my ($dst,$src1,$src2,$rxb)=@_; + + $rxb|=0x7<<5; + $rxb&=~(0x04<<5) if($dst>=8); + $rxb&=~(0x01<<5) if($src1>=8); + $rxb&=~(0x02<<5) if($src2>=8); + push @$opcode,$rxb; +} + +my $vprotd = sub { + if (shift =~ /\$([x0-9a-f]+),\s*%xmm([0-9]+),\s*%xmm([0-9]+)/) { + my @opcode=(0x8f); + rxb(\@opcode,$3,$2,-1,0x08); + push @opcode,0x78,0xc2; + push @opcode,0xc0|($2&7)|(($3&7)<<3); # ModR/M + my $c=$1; + push @opcode,$c=~/^0/?oct($c):$c; + @opcode; + } else { + (); + } +}; + +my $vprotq = sub { + if (shift =~ /\$([x0-9a-f]+),\s*%xmm([0-9]+),\s*%xmm([0-9]+)/) { + my @opcode=(0x8f); + rxb(\@opcode,$3,$2,-1,0x08); + push @opcode,0x78,0xc3; + push @opcode,0xc0|($2&7)|(($3&7)<<3); # ModR/M + my $c=$1; + push @opcode,$c=~/^0/?oct($c):$c; + @opcode; + } else { + (); + } +}; + +# Intel Control-flow Enforcement Technology extension. All functions and +# indirect branch targets will have to start with this instruction... + +my $endbranch = sub { + (0xf3,0x0f,0x1e,0xfa); +}; + +######################################################################## + +if ($nasm) { + print <<___; +default rel +%define XMMWORD +%define YMMWORD +%define ZMMWORD +___ +} elsif ($masm) { + print <<___; +OPTION DOTNAME +___ +} +while(defined(my $line=<>)) { + + $line =~ s|\R$||; # Better chomp + + $line =~ s|[#!].*$||; # get rid of asm-style comments... + $line =~ s|/\*.*\*/||; # ... and C-style comments... + $line =~ s|^\s+||; # ... and skip white spaces in beginning + $line =~ s|\s+$||; # ... and at the end + + if (my $label=label->re(\$line)) { print $label->out(); } + + if (my $directive=directive->re(\$line)) { + printf "%s",$directive->out(); + } elsif (my $opcode=opcode->re(\$line)) { + my $asm = eval("\$".$opcode->mnemonic()); + + if ((ref($asm) eq 'CODE') && scalar(my @bytes=&$asm($line))) { + print $gas?".byte\t":"DB\t",join(',',@bytes),"\n"; + next; + } + + my @args; + ARGUMENT: while (1) { + my $arg; + + ($arg=register->re(\$line, $opcode))|| + ($arg=const->re(\$line)) || + ($arg=ea->re(\$line, $opcode)) || + ($arg=expr->re(\$line, $opcode)) || + last ARGUMENT; + + push @args,$arg; + + last ARGUMENT if ($line !~ /^,/); + + $line =~ s/^,\s*//; + } # ARGUMENT: + + if ($#args>=0) { + my $insn; + my $sz=$opcode->size(); + + if ($gas) { + $insn = $opcode->out($#args>=1?$args[$#args]->size():$sz); + @args = map($_->out($sz),@args); + printf "\t%s\t%s",$insn,join(",",@args); + } else { + $insn = $opcode->out(); + foreach (@args) { + my $arg = $_->out(); + # $insn.=$sz compensates for movq, pinsrw, ... + if ($arg =~ /^xmm[0-9]+$/) { $insn.=$sz; $sz="x" if(!$sz); last; } + if ($arg =~ /^ymm[0-9]+$/) { $insn.=$sz; $sz="y" if(!$sz); last; } + if ($arg =~ /^zmm[0-9]+$/) { $insn.=$sz; $sz="z" if(!$sz); last; } + if ($arg =~ /^mm[0-9]+$/) { $insn.=$sz; $sz="q" if(!$sz); last; } + } + @args = reverse(@args); + undef $sz if ($nasm && $opcode->mnemonic() eq "lea"); + printf "\t%s\t%s",$insn,join(",",map($_->out($sz),@args)); + } + } else { + printf "\t%s",$opcode->out(); + } + } + + print $line,"\n"; +} + +print "\n$current_segment\tENDS\n" if ($current_segment && $masm); +print "END\n" if ($masm); + +close STDOUT; + + ################################################# +# Cross-reference x86_64 ABI "card" +# +# Unix Win64 +# %rax * * +# %rbx - - +# %rcx #4 #1 +# %rdx #3 #2 +# %rsi #2 - +# %rdi #1 - +# %rbp - - +# %rsp - - +# %r8 #5 #3 +# %r9 #6 #4 +# %r10 * * +# %r11 * * +# %r12 - - +# %r13 - - +# %r14 - - +# %r15 - - +# +# (*) volatile register +# (-) preserved by callee +# (#) Nth argument, volatile +# +# In Unix terms top of stack is argument transfer area for arguments +# which could not be accommodated in registers. Or in other words 7th +# [integer] argument resides at 8(%rsp) upon function entry point. +# 128 bytes above %rsp constitute a "red zone" which is not touched +# by signal handlers and can be used as temporal storage without +# allocating a frame. +# +# In Win64 terms N*8 bytes on top of stack is argument transfer area, +# which belongs to/can be overwritten by callee. N is the number of +# arguments passed to callee, *but* not less than 4! This means that +# upon function entry point 5th argument resides at 40(%rsp), as well +# as that 32 bytes from 8(%rsp) can always be used as temporal +# storage [without allocating a frame]. One can actually argue that +# one can assume a "red zone" above stack pointer under Win64 as well. +# Point is that at apparently no occasion Windows kernel would alter +# the area above user stack pointer in true asynchronous manner... +# +# All the above means that if assembler programmer adheres to Unix +# register and stack layout, but disregards the "red zone" existence, +# it's possible to use following prologue and epilogue to "gear" from +# Unix to Win64 ABI in leaf functions with not more than 6 arguments. +# +# omnipotent_function: +# ifdef WIN64 +# movq %rdi,8(%rsp) +# movq %rsi,16(%rsp) +# movq %rcx,%rdi ; if 1st argument is actually present +# movq %rdx,%rsi ; if 2nd argument is actually ... +# movq %r8,%rdx ; if 3rd argument is ... +# movq %r9,%rcx ; if 4th argument ... +# movq 40(%rsp),%r8 ; if 5th ... +# movq 48(%rsp),%r9 ; if 6th ... +# endif +# ... +# ifdef WIN64 +# movq 8(%rsp),%rdi +# movq 16(%rsp),%rsi +# endif +# ret +# + ################################################# +# Win64 SEH, Structured Exception Handling. +# +# Unlike on Unix systems(*) lack of Win64 stack unwinding information +# has undesired side-effect at run-time: if an exception is raised in +# assembler subroutine such as those in question (basically we're +# referring to segmentation violations caused by malformed input +# parameters), the application is briskly terminated without invoking +# any exception handlers, most notably without generating memory dump +# or any user notification whatsoever. This poses a problem. It's +# possible to address it by registering custom language-specific +# handler that would restore processor context to the state at +# subroutine entry point and return "exception is not handled, keep +# unwinding" code. Writing such handler can be a challenge... But it's +# doable, though requires certain coding convention. Consider following +# snippet: +# +# .type function,@function +# function: +# movq %rsp,%rax # copy rsp to volatile register +# pushq %r15 # save non-volatile registers +# pushq %rbx +# pushq %rbp +# movq %rsp,%r11 +# subq %rdi,%r11 # prepare [variable] stack frame +# andq $-64,%r11 +# movq %rax,0(%r11) # check for exceptions +# movq %r11,%rsp # allocate [variable] stack frame +# movq %rax,0(%rsp) # save original rsp value +# magic_point: +# ... +# movq 0(%rsp),%rcx # pull original rsp value +# movq -24(%rcx),%rbp # restore non-volatile registers +# movq -16(%rcx),%rbx +# movq -8(%rcx),%r15 +# movq %rcx,%rsp # restore original rsp +# magic_epilogue: +# ret +# .size function,.-function +# +# The key is that up to magic_point copy of original rsp value remains +# in chosen volatile register and no non-volatile register, except for +# rsp, is modified. While past magic_point rsp remains constant till +# the very end of the function. In this case custom language-specific +# exception handler would look like this: +# +# EXCEPTION_DISPOSITION handler (EXCEPTION_RECORD *rec,ULONG64 frame, +# CONTEXT *context,DISPATCHER_CONTEXT *disp) +# { ULONG64 *rsp = (ULONG64 *)context->Rax; +# ULONG64 rip = context->Rip; +# +# if (rip >= magic_point) +# { rsp = (ULONG64 *)context->Rsp; +# if (rip < magic_epilogue) +# { rsp = (ULONG64 *)rsp[0]; +# context->Rbp = rsp[-3]; +# context->Rbx = rsp[-2]; +# context->R15 = rsp[-1]; +# } +# } +# context->Rsp = (ULONG64)rsp; +# context->Rdi = rsp[1]; +# context->Rsi = rsp[2]; +# +# memcpy (disp->ContextRecord,context,sizeof(CONTEXT)); +# RtlVirtualUnwind(UNW_FLAG_NHANDLER,disp->ImageBase, +# dips->ControlPc,disp->FunctionEntry,disp->ContextRecord, +# &disp->HandlerData,&disp->EstablisherFrame,NULL); +# return ExceptionContinueSearch; +# } +# +# It's appropriate to implement this handler in assembler, directly in +# function's module. In order to do that one has to know members' +# offsets in CONTEXT and DISPATCHER_CONTEXT structures and some constant +# values. Here they are: +# +# CONTEXT.Rax 120 +# CONTEXT.Rcx 128 +# CONTEXT.Rdx 136 +# CONTEXT.Rbx 144 +# CONTEXT.Rsp 152 +# CONTEXT.Rbp 160 +# CONTEXT.Rsi 168 +# CONTEXT.Rdi 176 +# CONTEXT.R8 184 +# CONTEXT.R9 192 +# CONTEXT.R10 200 +# CONTEXT.R11 208 +# CONTEXT.R12 216 +# CONTEXT.R13 224 +# CONTEXT.R14 232 +# CONTEXT.R15 240 +# CONTEXT.Rip 248 +# CONTEXT.Xmm6 512 +# sizeof(CONTEXT) 1232 +# DISPATCHER_CONTEXT.ControlPc 0 +# DISPATCHER_CONTEXT.ImageBase 8 +# DISPATCHER_CONTEXT.FunctionEntry 16 +# DISPATCHER_CONTEXT.EstablisherFrame 24 +# DISPATCHER_CONTEXT.TargetIp 32 +# DISPATCHER_CONTEXT.ContextRecord 40 +# DISPATCHER_CONTEXT.LanguageHandler 48 +# DISPATCHER_CONTEXT.HandlerData 56 +# UNW_FLAG_NHANDLER 0 +# ExceptionContinueSearch 1 +# +# In order to tie the handler to the function one has to compose +# couple of structures: one for .xdata segment and one for .pdata. +# +# UNWIND_INFO structure for .xdata segment would be +# +# function_unwind_info: +# .byte 9,0,0,0 +# .rva handler +# +# This structure designates exception handler for a function with +# zero-length prologue, no stack frame or frame register. +# +# To facilitate composing of .pdata structures, auto-generated "gear" +# prologue copies rsp value to rax and denotes next instruction with +# .LSEH_begin_{function_name} label. This essentially defines the SEH +# styling rule mentioned in the beginning. Position of this label is +# chosen in such manner that possible exceptions raised in the "gear" +# prologue would be accounted to caller and unwound from latter's frame. +# End of function is marked with respective .LSEH_end_{function_name} +# label. To summarize, .pdata segment would contain +# +# .rva .LSEH_begin_function +# .rva .LSEH_end_function +# .rva function_unwind_info +# +# Reference to function_unwind_info from .xdata segment is the anchor. +# In case you wonder why references are 32-bit .rvas and not 64-bit +# .quads. References put into these two segments are required to be +# *relative* to the base address of the current binary module, a.k.a. +# image base. No Win64 module, be it .exe or .dll, can be larger than +# 2GB and thus such relative references can be and are accommodated in +# 32 bits. +# +# Having reviewed the example function code, one can argue that "movq +# %rsp,%rax" above is redundant. It is not! Keep in mind that on Unix +# rax would contain an undefined value. If this "offends" you, use +# another register and refrain from modifying rax till magic_point is +# reached, i.e. as if it was a non-volatile register. If more registers +# are required prior [variable] frame setup is completed, note that +# nobody says that you can have only one "magic point." You can +# "liberate" non-volatile registers by denoting last stack off-load +# instruction and reflecting it in finer grade unwind logic in handler. +# After all, isn't it why it's called *language-specific* handler... +# +# SE handlers are also involved in unwinding stack when executable is +# profiled or debugged. Profiling implies additional limitations that +# are too subtle to discuss here. For now it's sufficient to say that +# in order to simplify handlers one should either a) offload original +# %rsp to stack (like discussed above); or b) if you have a register to +# spare for frame pointer, choose volatile one. +# +# (*) Note that we're talking about run-time, not debug-time. Lack of +# unwind information makes debugging hard on both Windows and +# Unix. "Unlike" refers to the fact that on Unix signal handler +# will always be invoked, core dumped and appropriate exit code +# returned to parent (for user notification). diff --git a/crypto_ops.h b/crypto_ops.h new file mode 100644 index 0000000..4c72280 --- /dev/null +++ b/crypto_ops.h @@ -0,0 +1,41 @@ +// SPDX-License-Identifier: AGPL-1.0-only +// Copyright (C) 2018 Ludvig Strigeus . All Rights Reserved. +#ifndef TUNSAFE_CRYPTO_OPS_H_ +#define TUNSAFE_CRYPTO_OPS_H_ + +#include "build_config.h" +#include "tunsafe_types.h" + +#include +#if defined(COMPILER_MSVC) +#include +#endif // defined(COMPILER_MSVC) + +#if defined(ARCH_CPU_X86_64) && defined(COMPILER_MSVC) +FORCEINLINE static void memzero_crypto(void *dst, size_t n) { +if (n & 7) { + __stosb((unsigned char*)dst, 0, n); + } else { + __stosq((uint64*)dst, 0, n >> 3); + } +} + +#elif defined(ARCH_CPU_X86) && defined(COMPILER_MSVC) +FORCEINLINE static void memzero_crypto(void *dst, size_t n) { + if (n & 3) { + __stosb((unsigned char*)dst, 0, n); + } else { + __stosd((unsigned long*)dst, 0, n >> 2); + } +} +#else +FORCEINLINE static void memzero_crypto(void *dst, size_t n) { + memset(dst, 0, n); + __asm__ __volatile__("": :"r"(dst) :"memory"); +} +#endif + +int memcmp_crypto(const uint8 *a, const uint8 *b, size_t n); + + +#endif // TUNSAFE_CRYPTO_OPS_H_ \ No newline at end of file diff --git a/icons/green-bg-icon.ico b/icons/green-bg-icon.ico new file mode 100644 index 0000000..376ff26 Binary files /dev/null and b/icons/green-bg-icon.ico differ diff --git a/icons/green-bg-icon.png b/icons/green-bg-icon.png new file mode 100644 index 0000000..69af2ae Binary files /dev/null and b/icons/green-bg-icon.png differ diff --git a/icons/green-icon.ico b/icons/green-icon.ico new file mode 100644 index 0000000..7434c05 Binary files /dev/null and b/icons/green-icon.ico differ diff --git a/icons/green-icon.png b/icons/green-icon.png new file mode 100644 index 0000000..bfc2eeb Binary files /dev/null and b/icons/green-icon.png differ diff --git a/icons/neutral-icon.ico b/icons/neutral-icon.ico new file mode 100644 index 0000000..254f2f3 Binary files /dev/null and b/icons/neutral-icon.ico differ diff --git a/icons/neutral-icon.png b/icons/neutral-icon.png new file mode 100644 index 0000000..bf6412e Binary files /dev/null and b/icons/neutral-icon.png differ diff --git a/icons/red-icon.ico b/icons/red-icon.ico new file mode 100644 index 0000000..ec4c1a9 Binary files /dev/null and b/icons/red-icon.ico differ diff --git a/icons/red-icon.png b/icons/red-icon.png new file mode 100644 index 0000000..ddb33a3 Binary files /dev/null and b/icons/red-icon.png differ diff --git a/installer/.gitignore b/installer/.gitignore new file mode 100644 index 0000000..66727c9 --- /dev/null +++ b/installer/.gitignore @@ -0,0 +1,4 @@ +/tunsafe*.exe +/x64/ +/x86/ +*.pyc diff --git a/installer/ChangeLog.txt b/installer/ChangeLog.txt new file mode 100644 index 0000000..3cb9202 --- /dev/null +++ b/installer/ChangeLog.txt @@ -0,0 +1,48 @@ +2018-06-20 - TunSafe v1.3-rc3 + +Changes: +1.Add option to block Internet traffic outside of TunSafe. Either + based on firewall rules, or by adding a null route, or both. + The firewall rule blocks all traffic except traffic from TunSafe, + loopback traffic, and DHCP traffic on the default NIC. + The route rule adds two /1 routes to 0.0.0.0. +2.Convert LF to CRLF when importing config files +3.Update some logging messages +4.Delete the old routing rule pointing at the VPN server IP when + disconnecting +5.Delete any conflicting old routing rule pointing at the VPN server + when connecting. +6.Tray popup menu did not disappear when clicking outside of it. +7.Show config file names also in tray popup menu. +8.Make the menu item bold if connection is selected in popup menu. +9.Don't show the .conf filename extension in the UI. +10.Show also config file name when hovering on tray icon. +11.Click on the connected server to toggle connection +12.Fix bug where internet blocking checkbox was not removed. +13.Change so bold is used for selected server, and checkbox + is used when connected. +14.Use WS_EX_COMPOSITED to reduce flicker +15.Now possible to enter a filename on command line to connect to. +16.Support /minimize and /minimize_on_connect command line opts. +17.Support PreUp,PostUp,PreDown,PostDown options on [Interface] + Note: For security reasons you need to first enable them, + so either Shift-Click on Options and select Allow Pre/Post Commands + or specify the /allow_pre_post command line option. + +2018-04-29 - TunSafe v1.2 + +Changes: +1.Use /24 instead of failing when a /32 Address is used +2.Use /120 instead of failing when a /128 Address is used +3.Add routes for all entries in AllowedIPs + +2018-04-29 - TunSafe v1.1 + +Changes: +1.Retry on failed DNS lookup. Helps when resuming from sleep. +2.Display a better message if the TAP adapter can't be found. +3.Retry connect when getting ERROR_FILE_NOT_FOUND. + +2018-03-06 - TunSafe v1.0 + +First public release. \ No newline at end of file diff --git a/installer/LICENSE.TXT b/installer/LICENSE.TXT new file mode 100644 index 0000000..06968fb --- /dev/null +++ b/installer/LICENSE.TXT @@ -0,0 +1,240 @@ +TunSafe © 2018 Ludvig Strigeus +============================== + +BY USING THE SOFTWARE, YOU ACCEPT THESE TERMS. IF YOU DO NOT ACCEPT +THEM, DO NOT USE THE SOFTWARE. + +This software is provided "as is", without warranty of any kind, +express or implied, including but not limited to the warranties of +merchantability, fitness for a particular purpose and noninfringement. +In no event shall the authors or copyright holders be liable for any +claim, damages or other liability, whether in an action of contract, +tort or otherwise, arising from, out of or in connection with the +Software or the use or other dealings in the Software. + +We may not provide support services for this software in the future. + +You may install and use any number of copies of the software on your +devices. + +Please be aware that, similar to other networking tools that capture +network packets, the information processed by TunSafe or your VPN +provider may include personally identifiable or other sensitive +information (such as usernames, passwords, addresses of web sites +accessed). By using this software, you acknowledge that you are aware of +this and take sole responsibility for any personally identifiable or +other sensitive information provided to TunSafe or your VPN provider +through your use of the software. + +The software is licensed, not sold. This agreement only gives you some +rights to use the software. Unless applicable law gives you more rights +despite this limitation, you may use the software only as expressly +permitted in this agreement. In doing so, you must comply with any +technical limitations in the software that only allow you to use it in +certain ways. You may not + + * work around any technical limitations in the software; + + * reverse engineer, decompile or disassemble the software, except + and only to the extent that applicable law expressly permits, + despite this limitation; + + * publish the software for others to copy; + + * sell, rent, lease or lend the software; + + * transfer the software or this agreement to any third party; or + + * use the software for commercial software hosting services. + +All exceptions require prior written consent from info@tunsafe.com. + +You can recover from us and our suppliers only direct damages up to +U.S. $0.10. You cannot recover any other damages, including consequential, +lost profits, special, indirect or incidental damages. + +This limitation applies to + * anything related to the software, services, content (including code) + on third party Internet sites, or third party programs; and + * claims for breach of contract, breach of warranty, guarantee or + condition, strict liability, negligence, or other tort to the extent + permitted by applicable law. + +It also applies even if we knew or should have known about the possibility +of the damages. + +This agreement describes certain legal rights. You may have other rights +under the laws of your country. You may also have rights with respect to the +party from whom you acquired the software. This agreement does not change +your rights under the laws of your country if the laws of your country do +not permit it to do so. + +This agreement is the entire agreement and is governed by the laws of Sweden. + +Several pieces of Open Source software were used in this product. +Here are their licenses. + +BLAKE2 License +-------------- + +Copyright 2012, Samuel Neves . You may use this under the +terms of the CC0, the OpenSSL Licence, or the Apache Public License 2.0, at +your option. The terms of these licenses can be found at: + +- CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0 +- OpenSSL license : https://www.openssl.org/source/license.html +- Apache 2.0 : http://www.apache.org/licenses/LICENSE-2.0 + +More information about the BLAKE2 hash function can be found at +https://blake2.net. + + +Curve25519-Donna License +------------------------ + +Copyright 2008, Google Inc. +All rights reserved. + +Redistribution and use in source and binary forms, with or without +modification, are permitted provided that the following conditions are +met: + + * Redistributions of source code must retain the above copyright +notice, this list of conditions and the following disclaimer. + * Redistributions in binary form must reproduce the above +copyright notice, this list of conditions and the following disclaimer +in the documentation and/or other materials provided with the +distribution. + * Neither the name of Google Inc. nor the names of its +contributors may be used to endorse or promote products derived from +this software without specific prior written permission. + +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS +"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT +LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR +A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT +OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, +SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT +LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, +DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY +THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +OpenSSL License +--------------- + +==================================================================== +Copyright (c) 1998-2018 The OpenSSL Project. All rights reserved. + +Redistribution and use in source and binary forms, with or without +modification, are permitted provided that the following conditions +are met: + +1. Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + +2. Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in + the documentation and/or other materials provided with the + distribution. + +3. All advertising materials mentioning features or use of this + software must display the following acknowledgment: + "This product includes software developed by the OpenSSL Project + for use in the OpenSSL Toolkit. (http://www.openssl.org/)" + +4. The names "OpenSSL Toolkit" and "OpenSSL Project" must not be used to + endorse or promote products derived from this software without + prior written permission. For written permission, please contact + openssl-core@openssl.org. + +5. Products derived from this software may not be called "OpenSSL" + nor may "OpenSSL" appear in their names without prior written + permission of the OpenSSL Project. + +6. Redistributions of any form whatsoever must retain the following + acknowledgment: + "This product includes software developed by the OpenSSL Project + for use in the OpenSSL Toolkit (http://www.openssl.org/)" + +THIS SOFTWARE IS PROVIDED BY THE OpenSSL PROJECT ``AS IS'' AND ANY +EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE OpenSSL PROJECT OR +ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, +SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT +NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; +LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, +STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) +ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED +OF THE POSSIBILITY OF SUCH DAMAGE. +==================================================================== + +This product includes cryptographic software written by Eric Young +(eay@cryptsoft.com). This product includes software written by Tim +Hudson (tjh@cryptsoft.com). + + + +Original SSLeay License +----------------------- + +Copyright (C) 1995-1998 Eric Young (eay@cryptsoft.com) +All rights reserved. + +This package is an SSL implementation written +by Eric Young (eay@cryptsoft.com). +The implementation was written so as to conform with Netscapes SSL. + +This library is free for commercial and non-commercial use as long as +the following conditions are aheared to. The following conditions +apply to all code found in this distribution, be it the RC4, RSA, +lhash, DES, etc., code; not just the SSL code. The SSL documentation +included with this distribution is covered by the same copyright terms +except that the holder is Tim Hudson (tjh@cryptsoft.com). + +Copyright remains Eric Young's, and as such any Copyright notices in +the code are not to be removed. +If this package is used in a product, Eric Young should be given attribution +as the author of the parts of the library used. +This can be in the form of a textual message at program startup or +in documentation (online or textual) provided with the package. + +Redistribution and use in source and binary forms, with or without +modification, are permitted provided that the following conditions +are met: +1. Redistributions of source code must retain the copyright + notice, this list of conditions and the following disclaimer. +2. Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. +3. All advertising materials mentioning features or use of this software + must display the following acknowledgement: + "This product includes cryptographic software written by + Eric Young (eay@cryptsoft.com)" + The word 'cryptographic' can be left out if the rouines from the library + being used are not cryptographic related :-). +4. If you include any Windows specific code (or a derivative thereof) from + the apps directory (application code) you must include an acknowledgement: + "This product includes software written by Tim Hudson (tjh@cryptsoft.com)" + +THIS SOFTWARE IS PROVIDED BY ERIC YOUNG ``AS IS'' AND +ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE +FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +SUCH DAMAGE. + +The licence and distribution terms for any publically available version or +derivative of this code cannot be changed. i.e. this code cannot simply be +copied and put under another distribution licence +[including the GNU Public Licence.] + + diff --git a/installer/TunSafe.conf b/installer/TunSafe.conf new file mode 100644 index 0000000..fce908a --- /dev/null +++ b/installer/TunSafe.conf @@ -0,0 +1,46 @@ +# This is a sample config file for TunSafe. It uses the same syntax as +# WireGuard's wg-quick tool + +[Interface] + +# The private key of this computer. This is a secret key, don't give it out. +# To convert it to a public key you can go to 'Generate Key Pair' in TunSafe. +PrivateKey = gIIBl0OHb3wZjYGqZtgzRml3wec0e5vqXtSvCTfa42w= + +# Whether we want to bind a port to allow others to initiate connections to us. +# Please ensure this port is mapped in your router. +# ListenPort = 51820 + +# Switch DNS server while connected +# DNS = 8.8.8.8 + +# The addresses to bind to. Either IPv4 or IPv6. /31 and /32 are not supported. +Address = 192.168.2.2/24 + +# Whether to block all access to Internet that doesn't go through tunsafe. +# Note that Internet will keep being blocked even after TunSafe is restarted. +# Possible values (comma separated): +# route - Blocks all traffic using null route entries +# firewall - Blocks all traffic except TunSafe through the Windows firewall +# on - Uses the default block mechanism +# off - Turns off blocking +# BlockInternet = route, firewall + +[Peer] +# The public key of the peer. Do not use the private key here. Use the 'Generate Key Pair' +# function in TunSafe to convert a private key to a public key. +PublicKey = hIA3ikjlSOAo0qqrI+rXaS3ZH04Yx7Q2YQ4m2Syz+XE= + +# It's also possible to use a preshared key for extra security +# PresharedKey = SNz4BYc61amtDhzxNCxgYgdV9rPU+WiC8woX47Xf/2Y= + +# The IP range that we may send packets to for this peer. +AllowedIPs = 192.168.2.0/24 + +# Address of the server +Endpoint = 192.168.1.4:8040 + +# Send periodic keepalives to ensure connection stays up behind NAT. +PersistentKeepalive = 25 + + diff --git a/installer/icon.ico b/installer/icon.ico new file mode 100644 index 0000000..06b583b Binary files /dev/null and b/installer/icon.ico differ diff --git a/installer/signplugin.dll b/installer/signplugin.dll new file mode 100644 index 0000000..ef19ae8 Binary files /dev/null and b/installer/signplugin.dll differ diff --git a/installer/signplugin/.gitignore b/installer/signplugin/.gitignore new file mode 100644 index 0000000..99e1bf3 --- /dev/null +++ b/installer/signplugin/.gitignore @@ -0,0 +1,3 @@ +/Debug/ +/Release/ +/.vs/ \ No newline at end of file diff --git a/installer/signplugin/chkstk.obj b/installer/signplugin/chkstk.obj new file mode 100644 index 0000000..e9956a6 Binary files /dev/null and b/installer/signplugin/chkstk.obj differ diff --git a/installer/signplugin/ed25519.py b/installer/signplugin/ed25519.py new file mode 100644 index 0000000..7f8613b --- /dev/null +++ b/installer/signplugin/ed25519.py @@ -0,0 +1,104 @@ +import hashlib + +b = 256 +q = 2**255 - 19 +l = 2**252 + 27742317777372353535851937790883648493 + +def H(m): + return hashlib.sha512(m).digest() + +def expmod(b,e,m): + if e == 0: return 1 + t = expmod(b,e/2,m)**2 % m + if e & 1: t = (t*b) % m + return t + +def inv(x): + return expmod(x,q-2,q) + +d = -121665 * inv(121666) +I = expmod(2,(q-1)/4,q) + +def xrecover(y): + xx = (y*y-1) * inv(d*y*y+1) + x = expmod(xx,(q+3)/8,q) + if (x*x - xx) % q != 0: x = (x*I) % q + if x % 2 != 0: x = q-x + return x + +By = 4 * inv(5) +Bx = xrecover(By) +B = [Bx % q,By % q] + +def edwards(P,Q): + x1 = P[0] + y1 = P[1] + x2 = Q[0] + y2 = Q[1] + x3 = (x1*y2+x2*y1) * inv(1+d*x1*x2*y1*y2) + y3 = (y1*y2+x1*x2) * inv(1-d*x1*x2*y1*y2) + return [x3 % q,y3 % q] + +def scalarmult(P,e): + if e == 0: return [0,1] + Q = scalarmult(P,e/2) + Q = edwards(Q,Q) + if e & 1: Q = edwards(Q,P) + return Q + +def encodeint(y): + bits = [(y >> i) & 1 for i in range(b)] + return ''.join([chr(sum([bits[i * 8 + j] << j for j in range(8)])) for i in range(b/8)]) + +def encodepoint(P): + x = P[0] + y = P[1] + bits = [(y >> i) & 1 for i in range(b - 1)] + [x & 1] + return ''.join([chr(sum([bits[i * 8 + j] << j for j in range(8)])) for i in range(b/8)]) + +def bit(h,i): + return (ord(h[i/8]) >> (i%8)) & 1 + +def publickey(sk): + h = H(sk) + a = 2**(b-2) + sum(2**i * bit(h,i) for i in range(3,b-2)) + A = scalarmult(B,a) + return encodepoint(A) + +def Hint(m): + h = H(m) + return sum(2**i * bit(h,i) for i in range(2*b)) + +def signature(m,sk,pk): + h = H(sk) + a = 2**(b-2) + sum(2**i * bit(h,i) for i in range(3,b-2)) + r = Hint(''.join([h[i] for i in range(b/8,b/4)]) + m) + R = scalarmult(B,r) + S = (r + Hint(encodepoint(R) + pk + m) * a) % l + return encodepoint(R) + encodeint(S) + +def isoncurve(P): + x = P[0] + y = P[1] + return (-x*x + y*y - 1 - d*x*x*y*y) % q == 0 + +def decodeint(s): + return sum(2**i * bit(s,i) for i in range(0,b)) + +def decodepoint(s): + y = sum(2**i * bit(s,i) for i in range(0,b-1)) + x = xrecover(y) + if x & 1 != bit(s,b-1): x = q-x + P = [x,y] + if not isoncurve(P): raise Exception("decoding point that is not on curve") + return P + +def checkvalid(s,m,pk): + if len(s) != b/4: raise Exception("signature length is wrong") + if len(pk) != b/8: raise Exception("public-key length is wrong") + R = decodepoint(s[0:b/8]) + A = decodepoint(pk) + S = decodeint(s[b/8:b/4]) + h = Hint(encodepoint(R) + pk + m) + if scalarmult(B,S) != edwards(R,scalarmult(A,h)): + raise Exception("signature does not pass verification") diff --git a/installer/signplugin/ed_signtool.py b/installer/signplugin/ed_signtool.py new file mode 100644 index 0000000..3f8d0dd --- /dev/null +++ b/installer/signplugin/ed_signtool.py @@ -0,0 +1,22 @@ +import hashlib + +def H(m): + return hashlib.sha512(m).digest() + +import ed25519 +import os + +sk = "".join(chr(c) for c in [4, 213, 116, 80, 117, 4, 70, 166, 244, 214, 234, 159, 197, 101, 182, 177, 106, 180, 68, 125, 51, 32, 159, 77, 27, 151, 233, 91, 109, 184, 147, 235]) +pk = "".join(chr(c) for c in [79, 236, 107, 197, 85, 239, 235, 109, 123, 181, 230, 115, 206, 112, 218, 80, 174, 167, 119, 187, 113, 153, 17, 115, 77, 100, 154, 84, 181, 194, 254, 99]) + +hash = H(file('../tap/TunSafe-TAP-9.21.2.exe', 'rb').read()) +print hash.encode('hex'), repr(hash) + +#sk = os.urandom(32) +#pk = ed25519.publickey(sk) +#print 'sk', [ord(c) for c in sk] +#print 'pk', [ord(c) for c in pk] + +#m = 'test' +s = ed25519.signature(hash,sk,pk) +file('../tap/TunSafe-TAP-9.21.2.exe.sig', 'wb').write(s.encode('hex')) diff --git a/installer/signplugin/main.cpp b/installer/signplugin/main.cpp new file mode 100644 index 0000000..db8c2f4 --- /dev/null +++ b/installer/signplugin/main.cpp @@ -0,0 +1,121 @@ +#include +extern "C" { +#include "tiny/edsign.h" +#include "nsis/pluginapi.h" +#include "tiny/sha512.h" +} + +// To work with Unicode version of NSIS, please use TCHAR-type +// functions for accessing the variables and the stack. + +unsigned char buffer[4096]; + +// sk[4, 213, 116, 80, 117, 4, 70, 166, 244, 214, 234, 159, 197, 101, 182, 177, 106, 180, 68, 125, 51, 32, 159, 77, 27, 151, 233, 91, 109, 184, 147, 235] +// pk[79, 236, 107, 197, 85, 239, 235, 109, 123, 181, 230, 115, 206, 112, 218, 80, 174, 167, 119, 187, 113, 153, 17, 115, 77, 100, 154, 84, 181, 194, 254, 99] +static const unsigned char pk[32] = {79, 236, 107, 197, 85, 239, 235, 109, 123, 181, 230, 115, 206, 112, 218, 80, 174, 167, 119, 187, 113, 153, 17, 115, 77, 100, 154, 84, 181, 194, 254, 99}; + +int CheckFile(char *file) { + sha512_state ctx; + int ret; + HANDLE h; + unsigned char out[64]; + unsigned char signature[64]; + + h = CreateFileA(file, GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL); + if (h == INVALID_HANDLE_VALUE) + return 1; + DWORD n; + sha512_init(&ctx); + + size_t total_size = 0; + size_t p = 0; + while (ReadFile(h, buffer, sizeof(buffer), &n, NULL) && n) { + total_size += n; + p = 0; + while (p + 128 <= n) { + sha512_block(&ctx, buffer + p); + p += 128; + } + if (p != n) + break; + } + sha512_final(&ctx, buffer + p, total_size); + sha512_get(&ctx, out, 0, 64); + CloseHandle(h); + /* + for (size_t i = 0; i < 64; i++) { + buffer[i * 2 + 0] = "0123456789abcdef"[out[i] >> 4]; + buffer[i * 2 + 1] = "0123456789abcdef"[out[i] & 0xF]; + } + buffer[128] = 0; + MessageBoxA(0, (char*)buffer, "sha", 0); + */ + char *x = file; + while (*x)x++; + memcpy(x, ".sig", 5); + + h = CreateFileA(file, GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL); + if (h == INVALID_HANDLE_VALUE) + return 2; + n = 0; + ReadFile(h, buffer, sizeof(buffer), &n, NULL); + CloseHandle(h); + if (n < 128) + return 3; + + memset(signature, 0, sizeof(signature)); + + for (int i = 0; i < 128; i++) { + unsigned char c = buffer[i]; + if (c >= '0' && c <= '9') + c -= '0'; + else if ((c |= 32), c >= 'a' && c <= 'f') + c -= 'a' - 10; + else + return 4; + signature[i >> 1] = (signature[i >> 1] << 4) + c; + } + + /* create a random seed, and a keypair out of that seed */ + //ed25519_create_seed(seed); + //ed25519_create_keypair(public_key, private_key, seed); + + /* create signature on the message with the keypair */ + //ed25519_sign(signature, message, message_len, public_key, private_key); + + /* verify the signature */ + return edsign_verify(signature, pk, out, sizeof(out)) ? 0 : 5; +} + +extern "C" void __declspec(dllexport) myFunction(HWND hwndParent, int string_size, + LPTSTR variables, stack_t **stacktop, + extra_parameters *extra, ...) { + EXDLL_INIT(); + + int rv = 10; + + // note if you want parameters from the stack, pop them off in order. + // i.e. if you are called via exdll::myFunction file.dat read.txt + // calling popstring() the first time would give you file.dat, + // and the second time would give you read.txt. + // you should empty the stack of your parameters, and ONLY your + // parameters. + + // do your stuff here + { + LPTSTR msgbuf = (LPTSTR)GlobalAlloc(GPTR, (string_size + 1 + 10) * sizeof(*msgbuf)); + if (msgbuf) { + if (!popstring(msgbuf)) { + rv = CheckFile(msgbuf); + } + GlobalFree(msgbuf); + } + } + + pushint(rv); +} + + +BOOL WINAPI DllMain(HINSTANCE hInst, ULONG ul_reason_for_call, LPVOID lpReserved) { + return TRUE; +} diff --git a/installer/signplugin/nsis/api.h b/installer/signplugin/nsis/api.h new file mode 100644 index 0000000..eebbbf0 --- /dev/null +++ b/installer/signplugin/nsis/api.h @@ -0,0 +1,85 @@ +/* + * apih + * + * This file is a part of NSIS. + * + * Copyright (C) 1999-2018 Nullsoft and Contributors + * + * Licensed under the zlib/libpng license (the "License"); + * you may not use this file except in compliance with the License. + * + * Licence details can be found in the file COPYING. + * + * This software is provided 'as-is', without any express or implied + * warranty. + */ + +#ifndef _NSIS_EXEHEAD_API_H_ +#define _NSIS_EXEHEAD_API_H_ + +// Starting with NSIS 2.42, you can check the version of the plugin API in exec_flags->plugin_api_version +// The format is 0xXXXXYYYY where X is the major version and Y is the minor version (MAKELONG(y,x)) +// When doing version checks, always remember to use >=, ex: if (pX->exec_flags->plugin_api_version >= NSISPIAPIVER_1_0) {} + +#define NSISPIAPIVER_1_0 0x00010000 +#define NSISPIAPIVER_CURR NSISPIAPIVER_1_0 + +// NSIS Plug-In Callback Messages +enum NSPIM +{ + NSPIM_UNLOAD, // This is the last message a plugin gets, do final cleanup + NSPIM_GUIUNLOAD, // Called after .onGUIEnd +}; + +// Prototype for callbacks registered with extra_parameters->RegisterPluginCallback() +// Return NULL for unknown messages +// Should always be __cdecl for future expansion possibilities +typedef UINT_PTR (*NSISPLUGINCALLBACK)(enum NSPIM); + +// extra_parameters data structure containing other interesting stuff +// besides the stack, variables and HWND passed on to plug-ins. +typedef struct +{ + int autoclose; // SetAutoClose + int all_user_var; // SetShellVarContext: User context = 0, Machine context = 1 + int exec_error; // IfErrors + int abort; // IfAbort + int exec_reboot; // IfRebootFlag (NSIS_SUPPORT_REBOOT) + int reboot_called; // NSIS_SUPPORT_REBOOT + int XXX_cur_insttype; // Deprecated + int plugin_api_version; // Plug-in ABI. See NSISPIAPIVER_CURR (Note: used to be XXX_insttype_changed) + int silent; // IfSilent (NSIS_CONFIG_SILENT_SUPPORT) + int instdir_error; // GetInstDirError + int rtl; // 1 if $LANGUAGE is a RTL language + int errlvl; // SetErrorLevel + int alter_reg_view; // SetRegView: Default View = 0, Alternative View = (sizeof(void*) > 4 ? KEY_WOW64_32KEY : KEY_WOW64_64KEY) + int status_update; // SetDetailsPrint +} exec_flags_t; + +#ifndef NSISCALL +# define NSISCALL __stdcall +#endif +#if !defined(_WIN32) && !defined(LPTSTR) +# define LPTSTR TCHAR* +#endif + +typedef struct { + exec_flags_t *exec_flags; + int (NSISCALL *ExecuteCodeSegment)(int, HWND); + void (NSISCALL *validate_filename)(LPTSTR); + int (NSISCALL *RegisterPluginCallback)(HMODULE, NSISPLUGINCALLBACK); // returns 0 on success, 1 if already registered and < 0 on errors +} extra_parameters; + +// Definitions for page showing plug-ins +// See Ui.c to understand better how they're used + +// sent to the outer window to tell it to go to the next inner window +#define WM_NOTIFY_OUTER_NEXT (WM_USER+0x8) + +// custom pages should send this message to let NSIS know they're ready +#define WM_NOTIFY_CUSTOM_READY (WM_USER+0xd) + +// sent as wParam with WM_NOTIFY_OUTER_NEXT when user cancels - heed its warning +#define NOTIFY_BYE_BYE 'x' + +#endif /* _NSIS_EXEHEAD_API_H_ */ diff --git a/installer/signplugin/nsis/nsis_tchar.h b/installer/signplugin/nsis/nsis_tchar.h new file mode 100644 index 0000000..3f105ba --- /dev/null +++ b/installer/signplugin/nsis/nsis_tchar.h @@ -0,0 +1,229 @@ +/* + * nsis_tchar.h + * + * This file is a part of NSIS. + * + * Copyright (C) 1999-2018 Nullsoft and Contributors + * + * This software is provided 'as-is', without any express or implied + * warranty. + * + * For Unicode support by Jim Park -- 08/30/2007 + */ + +// Jim Park: Only those we use are listed here. + +#pragma once + +#ifdef _UNICODE + +#ifndef _T +#define __T(x) L ## x +#define _T(x) __T(x) +#define _TEXT(x) __T(x) +#endif + +#ifndef _TCHAR_DEFINED +#define _TCHAR_DEFINED +#if !defined(_NATIVE_WCHAR_T_DEFINED) && !defined(_WCHAR_T_DEFINED) +typedef unsigned short TCHAR; +#else +typedef wchar_t TCHAR; +#endif +#endif + + +// program +#define _tenviron _wenviron +#define __targv __wargv + +// printfs +#define _ftprintf fwprintf +#define _sntprintf _snwprintf +#if (defined(_MSC_VER) && (_MSC_VER<=1310||_MSC_FULL_VER<=140040310)) || defined(__MINGW32__) +# define _stprintf swprintf +#else +# define _stprintf _swprintf +#endif +#define _tprintf wprintf +#define _vftprintf vfwprintf +#define _vsntprintf _vsnwprintf +#if defined(_MSC_VER) && (_MSC_VER<=1310) +# define _vstprintf vswprintf +#else +# define _vstprintf _vswprintf +#endif + +// scanfs +#define _tscanf wscanf +#define _stscanf swscanf + +// string manipulations +#define _tcscat wcscat +#define _tcschr wcschr +#define _tcsclen wcslen +#define _tcscpy wcscpy +#define _tcsdup _wcsdup +#define _tcslen wcslen +#define _tcsnccpy wcsncpy +#define _tcsncpy wcsncpy +#define _tcsrchr wcsrchr +#define _tcsstr wcsstr +#define _tcstok wcstok + +// string comparisons +#define _tcscmp wcscmp +#define _tcsicmp _wcsicmp +#define _tcsncicmp _wcsnicmp +#define _tcsncmp wcsncmp +#define _tcsnicmp _wcsnicmp + +// upper / lower +#define _tcslwr _wcslwr +#define _tcsupr _wcsupr +#define _totlower towlower +#define _totupper towupper + +// conversions to numbers +#define _tcstoi64 _wcstoi64 +#define _tcstol wcstol +#define _tcstoul wcstoul +#define _tstof _wtof +#define _tstoi _wtoi +#define _tstoi64 _wtoi64 +#define _ttoi _wtoi +#define _ttoi64 _wtoi64 +#define _ttol _wtol + +// conversion from numbers to strings +#define _itot _itow +#define _ltot _ltow +#define _i64tot _i64tow +#define _ui64tot _ui64tow + +// file manipulations +#define _tfopen _wfopen +#define _topen _wopen +#define _tremove _wremove +#define _tunlink _wunlink + +// reading and writing to i/o +#define _fgettc fgetwc +#define _fgetts fgetws +#define _fputts fputws +#define _gettchar getwchar + +// directory +#define _tchdir _wchdir + +// environment +#define _tgetenv _wgetenv +#define _tsystem _wsystem + +// time +#define _tcsftime wcsftime + +#else // ANSI + +#ifndef _T +#define _T(x) x +#define _TEXT(x) x +#endif + +#ifndef _TCHAR_DEFINED +#define _TCHAR_DEFINED +typedef char TCHAR; +#endif + +// program +#define _tenviron environ +#define __targv __argv + +// printfs +#define _ftprintf fprintf +#define _sntprintf _snprintf +#define _stprintf sprintf +#define _tprintf printf +#define _vftprintf vfprintf +#define _vsntprintf _vsnprintf +#define _vstprintf vsprintf + +// scanfs +#define _tscanf scanf +#define _stscanf sscanf + +// string manipulations +#define _tcscat strcat +#define _tcschr strchr +#define _tcsclen strlen +#define _tcscnlen strnlen +#define _tcscpy strcpy +#define _tcsdup _strdup +#define _tcslen strlen +#define _tcsnccpy strncpy +#define _tcsrchr strrchr +#define _tcsstr strstr +#define _tcstok strtok + +// string comparisons +#define _tcscmp strcmp +#define _tcsicmp _stricmp +#define _tcsncmp strncmp +#define _tcsncicmp _strnicmp +#define _tcsnicmp _strnicmp + +// upper / lower +#define _tcslwr _strlwr +#define _tcsupr _strupr + +#define _totupper toupper +#define _totlower tolower + +// conversions to numbers +#define _tcstol strtol +#define _tcstoul strtoul +#define _tstof atof +#define _tstoi atoi +#define _tstoi64 _atoi64 +#define _tstoi64 _atoi64 +#define _ttoi atoi +#define _ttoi64 _atoi64 +#define _ttol atol + +// conversion from numbers to strings +#define _i64tot _i64toa +#define _itot _itoa +#define _ltot _ltoa +#define _ui64tot _ui64toa + +// file manipulations +#define _tfopen fopen +#define _topen _open +#define _tremove remove +#define _tunlink _unlink + +// reading and writing to i/o +#define _fgettc fgetc +#define _fgetts fgets +#define _fputts fputs +#define _gettchar getchar + +// directory +#define _tchdir _chdir + +// environment +#define _tgetenv getenv +#define _tsystem system + +// time +#define _tcsftime strftime + +#endif + +// is functions (the same in Unicode / ANSI) +#define _istgraph isgraph +#define _istascii __isascii + +#define __TFILE__ _T(__FILE__) +#define __TDATE__ _T(__DATE__) +#define __TTIME__ _T(__TIME__) diff --git a/installer/signplugin/nsis/pluginapi-x86-ansi.lib b/installer/signplugin/nsis/pluginapi-x86-ansi.lib new file mode 100644 index 0000000..4921639 Binary files /dev/null and b/installer/signplugin/nsis/pluginapi-x86-ansi.lib differ diff --git a/installer/signplugin/nsis/pluginapi-x86-unicode.lib b/installer/signplugin/nsis/pluginapi-x86-unicode.lib new file mode 100644 index 0000000..400c488 Binary files /dev/null and b/installer/signplugin/nsis/pluginapi-x86-unicode.lib differ diff --git a/installer/signplugin/nsis/pluginapi.h b/installer/signplugin/nsis/pluginapi.h new file mode 100644 index 0000000..63fe790 --- /dev/null +++ b/installer/signplugin/nsis/pluginapi.h @@ -0,0 +1,108 @@ +#ifndef ___NSIS_PLUGIN__H___ +#define ___NSIS_PLUGIN__H___ + +#ifdef __cplusplus +extern "C" { +#endif + +#include "api.h" +#include "nsis_tchar.h" // BUGBUG: Why cannot our plugins use the compilers tchar.h? + +#ifndef NSISCALL +# define NSISCALL WINAPI +#endif + +#define EXDLL_INIT() { \ + g_stringsize=string_size; \ + g_stacktop=stacktop; \ + g_variables=variables; } + +typedef struct _stack_t { + struct _stack_t *next; +#ifdef UNICODE + WCHAR text[1]; // this should be the length of g_stringsize when allocating +#else + char text[1]; +#endif +} stack_t; + +enum +{ +INST_0, // $0 +INST_1, // $1 +INST_2, // $2 +INST_3, // $3 +INST_4, // $4 +INST_5, // $5 +INST_6, // $6 +INST_7, // $7 +INST_8, // $8 +INST_9, // $9 +INST_R0, // $R0 +INST_R1, // $R1 +INST_R2, // $R2 +INST_R3, // $R3 +INST_R4, // $R4 +INST_R5, // $R5 +INST_R6, // $R6 +INST_R7, // $R7 +INST_R8, // $R8 +INST_R9, // $R9 +INST_CMDLINE, // $CMDLINE +INST_INSTDIR, // $INSTDIR +INST_OUTDIR, // $OUTDIR +INST_EXEDIR, // $EXEDIR +INST_LANG, // $LANGUAGE +__INST_LAST +}; + +extern unsigned int g_stringsize; +extern stack_t **g_stacktop; +extern LPTSTR g_variables; + +void NSISCALL pushstring(LPCTSTR str); +void NSISCALL pushintptr(INT_PTR value); +#define pushint(v) pushintptr((INT_PTR)(v)) +int NSISCALL popstring(LPTSTR str); // 0 on success, 1 on empty stack +int NSISCALL popstringn(LPTSTR str, int maxlen); // with length limit, pass 0 for g_stringsize +INT_PTR NSISCALL popintptr(); +#define popint() ( (int) popintptr() ) +int NSISCALL popint_or(); // with support for or'ing (2|4|8) +INT_PTR NSISCALL nsishelper_str_to_ptr(LPCTSTR s); +#define myatoi(s) ( (int) nsishelper_str_to_ptr(s) ) // converts a string to an integer +unsigned int NSISCALL myatou(LPCTSTR s); // converts a string to an unsigned integer, decimal only +int NSISCALL myatoi_or(LPCTSTR s); // with support for or'ing (2|4|8) +LPTSTR NSISCALL getuservariable(const int varnum); +void NSISCALL setuservariable(const int varnum, LPCTSTR var); + +#ifdef UNICODE +#define PopStringW(x) popstring(x) +#define PushStringW(x) pushstring(x) +#define SetUserVariableW(x,y) setuservariable(x,y) + +int NSISCALL PopStringA(LPSTR ansiStr); +void NSISCALL PushStringA(LPCSTR ansiStr); +void NSISCALL GetUserVariableW(const int varnum, LPWSTR wideStr); +void NSISCALL GetUserVariableA(const int varnum, LPSTR ansiStr); +void NSISCALL SetUserVariableA(const int varnum, LPCSTR ansiStr); + +#else +// ANSI defs + +#define PopStringA(x) popstring(x) +#define PushStringA(x) pushstring(x) +#define SetUserVariableA(x,y) setuservariable(x,y) + +int NSISCALL PopStringW(LPWSTR wideStr); +void NSISCALL PushStringW(LPWSTR wideStr); +void NSISCALL GetUserVariableW(const int varnum, LPWSTR wideStr); +void NSISCALL GetUserVariableA(const int varnum, LPSTR ansiStr); +void NSISCALL SetUserVariableW(const int varnum, LPCWSTR wideStr); + +#endif + +#ifdef __cplusplus +} +#endif + +#endif//!___NSIS_PLUGIN__H___ diff --git a/installer/signplugin/signplugin.sln b/installer/signplugin/signplugin.sln new file mode 100644 index 0000000..fd263d8 --- /dev/null +++ b/installer/signplugin/signplugin.sln @@ -0,0 +1,28 @@ + +Microsoft Visual Studio Solution File, Format Version 12.00 +# Visual Studio 15 +VisualStudioVersion = 15.0.26403.7 +MinimumVisualStudioVersion = 10.0.40219.1 +Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "signplugin", "signplugin.vcxproj", "{C6E4A1D7-ECBC-466E-9183-30727EF81533}" +EndProject +Global + GlobalSection(SolutionConfigurationPlatforms) = preSolution + Debug|x64 = Debug|x64 + Debug|x86 = Debug|x86 + Release|x64 = Release|x64 + Release|x86 = Release|x86 + EndGlobalSection + GlobalSection(ProjectConfigurationPlatforms) = postSolution + {C6E4A1D7-ECBC-466E-9183-30727EF81533}.Debug|x64.ActiveCfg = Debug|x64 + {C6E4A1D7-ECBC-466E-9183-30727EF81533}.Debug|x64.Build.0 = Debug|x64 + {C6E4A1D7-ECBC-466E-9183-30727EF81533}.Debug|x86.ActiveCfg = Debug|Win32 + {C6E4A1D7-ECBC-466E-9183-30727EF81533}.Debug|x86.Build.0 = Debug|Win32 + {C6E4A1D7-ECBC-466E-9183-30727EF81533}.Release|x64.ActiveCfg = Release|x64 + {C6E4A1D7-ECBC-466E-9183-30727EF81533}.Release|x64.Build.0 = Release|x64 + {C6E4A1D7-ECBC-466E-9183-30727EF81533}.Release|x86.ActiveCfg = Release|Win32 + {C6E4A1D7-ECBC-466E-9183-30727EF81533}.Release|x86.Build.0 = Release|Win32 + EndGlobalSection + GlobalSection(SolutionProperties) = preSolution + HideSolutionNode = FALSE + EndGlobalSection +EndGlobal diff --git a/installer/signplugin/signplugin.vcxproj b/installer/signplugin/signplugin.vcxproj new file mode 100644 index 0000000..1104e33 --- /dev/null +++ b/installer/signplugin/signplugin.vcxproj @@ -0,0 +1,166 @@ + + + + + Debug + Win32 + + + Release + Win32 + + + Debug + x64 + + + Release + x64 + + + + 15.0 + {C6E4A1D7-ECBC-466E-9183-30727EF81533} + Win32Proj + 10.0.15063.0 + + + + DynamicLibrary + true + v141 + + + DynamicLibrary + false + v141 + false + + + Application + true + v141 + + + Application + false + v141 + + + + + + + + + + + + + + + + + + + + + true + + + true + false + + + + WIN32;_DEBUG;_WINDOWS;_USRDLL;SIGNPLUGIN_EXPORTS;%(PreprocessorDefinitions) + MultiThreadedDebugDLL + Level3 + ProgramDatabase + Disabled + + + MachineX86 + true + Windows + + + false + false + + + + + WIN32;NDEBUG;_WINDOWS;_USRDLL;SIGNPLUGIN_EXPORTS;%(PreprocessorDefinitions) + MultiThreaded + Level3 + ProgramDatabase + false + false + MinSpace + true + true + + + MachineX86 + false + Windows + true + true + true + DllMain + false + UseLinkTimeCodeGeneration + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + true + + + + + + + + + \ No newline at end of file diff --git a/installer/signplugin/signplugin.vcxproj.filters b/installer/signplugin/signplugin.vcxproj.filters new file mode 100644 index 0000000..57b82ec --- /dev/null +++ b/installer/signplugin/signplugin.vcxproj.filters @@ -0,0 +1,132 @@ + + + + + {4FC737F1-C7A5-4376-A066-2A32D752A2FF} + cpp;c;cc;cxx;def;odl;idl;hpj;bat;asm;asmx + + + {93995380-89BD-4b04-88EB-625FBE52EBFB} + h;hh;hpp;hxx;hm;inl;inc;xsd + + + {67DA6AB6-F800-4c08-8B7A-83BB121AAD01} + rc;ico;cur;bmp;dlg;rc2;rct;bin;rgs;gif;jpg;jpeg;jpe;resx;tiff;tif;png;wav + + + + + Source Files + + + Source Files + + + Source Files + + + Source Files + + + Source Files + + + Source Files + + + Source Files + + + Source Files + + + Source Files + + + Source Files + + + + + Header Files + + + Header Files + + + Header Files + + + Header Files + + + Header Files + + + Header Files + + + Header Files + + + Header Files + + + Header Files + + + Header Files + + + Header Files + + + Header Files + + + Header Files + + + Header Files + + + Header Files + + + Header Files + + + Header Files + + + Header Files + + + Source Files + + + Source Files + + + Source Files + + + Source Files + + + Source Files + + + Source Files + + + Source Files + + + + + + + + + \ No newline at end of file diff --git a/installer/signplugin/signplugin.vcxproj.user b/installer/signplugin/signplugin.vcxproj.user new file mode 100644 index 0000000..be25078 --- /dev/null +++ b/installer/signplugin/signplugin.vcxproj.user @@ -0,0 +1,4 @@ + + + + \ No newline at end of file diff --git a/installer/signplugin/tiny/c25519.c b/installer/signplugin/tiny/c25519.c new file mode 100644 index 0000000..a9c9f08 --- /dev/null +++ b/installer/signplugin/tiny/c25519.c @@ -0,0 +1,124 @@ +/* Curve25519 (Montgomery form) + * Daniel Beer , 18 Apr 2014 + * + * This file is in the public domain. + */ + +#include "c25519.h" + +const uint8_t c25519_base_x[F25519_SIZE] = {9}; + +/* Double an X-coordinate */ +static void xc_double(uint8_t *x3, uint8_t *z3, + const uint8_t *x1, const uint8_t *z1) +{ + /* Explicit formulas database: dbl-1987-m + * + * source 1987 Montgomery "Speeding the Pollard and elliptic + * curve methods of factorization", page 261, fourth display + * compute X3 = (X1^2-Z1^2)^2 + * compute Z3 = 4 X1 Z1 (X1^2 + a X1 Z1 + Z1^2) + */ + uint8_t x1sq[F25519_SIZE]; + uint8_t z1sq[F25519_SIZE]; + uint8_t x1z1[F25519_SIZE]; + uint8_t a[F25519_SIZE]; + + f25519_mul__distinct(x1sq, x1, x1); + f25519_mul__distinct(z1sq, z1, z1); + f25519_mul__distinct(x1z1, x1, z1); + + f25519_sub(a, x1sq, z1sq); + f25519_mul__distinct(x3, a, a); + + f25519_mul_c(a, x1z1, 486662); + f25519_add(a, x1sq, a); + f25519_add(a, z1sq, a); + f25519_mul__distinct(x1sq, x1z1, a); + f25519_mul_c(z3, x1sq, 4); +} + +/* Differential addition */ +static void xc_diffadd(uint8_t *x5, uint8_t *z5, + const uint8_t *x1, const uint8_t *z1, + const uint8_t *x2, const uint8_t *z2, + const uint8_t *x3, const uint8_t *z3) +{ + /* Explicit formulas database: dbl-1987-m3 + * + * source 1987 Montgomery "Speeding the Pollard and elliptic curve + * methods of factorization", page 261, fifth display, plus + * common-subexpression elimination + * compute A = X2+Z2 + * compute B = X2-Z2 + * compute C = X3+Z3 + * compute D = X3-Z3 + * compute DA = D A + * compute CB = C B + * compute X5 = Z1(DA+CB)^2 + * compute Z5 = X1(DA-CB)^2 + */ + uint8_t da[F25519_SIZE]; + uint8_t cb[F25519_SIZE]; + uint8_t a[F25519_SIZE]; + uint8_t b[F25519_SIZE]; + + f25519_add(a, x2, z2); + f25519_sub(b, x3, z3); /* D */ + f25519_mul__distinct(da, a, b); + + f25519_sub(b, x2, z2); + f25519_add(a, x3, z3); /* C */ + f25519_mul__distinct(cb, a, b); + + f25519_add(a, da, cb); + f25519_mul__distinct(b, a, a); + f25519_mul__distinct(x5, z1, b); + + f25519_sub(a, da, cb); + f25519_mul__distinct(b, a, a); + f25519_mul__distinct(z5, x1, b); +} + +void c25519_smult(uint8_t *result, const uint8_t *q, const uint8_t *e) +{ + /* Current point: P_m */ + uint8_t xm[F25519_SIZE]; + uint8_t zm[F25519_SIZE] = {1}; + + /* Predecessor: P_(m-1) */ + uint8_t xm1[F25519_SIZE] = {1}; + uint8_t zm1[F25519_SIZE] = {0}; + + int i; + + /* Note: bit 254 is assumed to be 1 */ + f25519_copy(xm, q); + + for (i = 253; i >= 0; i--) { + const int bit = (e[i >> 3] >> (i & 7)) & 1; + uint8_t xms[F25519_SIZE]; + uint8_t zms[F25519_SIZE]; + + /* From P_m and P_(m-1), compute P_(2m) and P_(2m-1) */ + xc_diffadd(xm1, zm1, q, f25519_one, xm, zm, xm1, zm1); + xc_double(xm, zm, xm, zm); + + /* Compute P_(2m+1) */ + xc_diffadd(xms, zms, xm1, zm1, xm, zm, q, f25519_one); + + /* Select: + * bit = 1 --> (P_(2m+1), P_(2m)) + * bit = 0 --> (P_(2m), P_(2m-1)) + */ + f25519_select(xm1, xm1, xm, bit); + f25519_select(zm1, zm1, zm, bit); + f25519_select(xm, xm, xms, bit); + f25519_select(zm, zm, zms, bit); + } + + /* Freeze out of projective coordinates */ + f25519_inv__distinct(zm1, zm); + f25519_mul__distinct(result, zm1, xm); + f25519_normalize(result); +} diff --git a/installer/signplugin/tiny/c25519.h b/installer/signplugin/tiny/c25519.h new file mode 100644 index 0000000..4596438 --- /dev/null +++ b/installer/signplugin/tiny/c25519.h @@ -0,0 +1,48 @@ +/* Curve25519 (Montgomery form) + * Daniel Beer , 18 Apr 2014 + * + * This file is in the public domain. + */ + +#ifndef C25519_H_ +#define C25519_H_ + +#include +#include "f25519.h" + +/* Curve25519 has the equation over F(p = 2^255-19): + * + * y^2 = x^3 + 486662x^2 + x + * + * 486662 = 4A+2, where A = 121665. This is a Montgomery curve. + * + * For more information, see: + * + * Bernstein, D.J. (2006) "Curve25519: New Diffie-Hellman speed + * records". Document ID: 4230efdfa673480fc079449d90f322c0. + */ + +/* This is the site of a Curve25519 exponent (private key) */ +#define C25519_EXPONENT_SIZE 32 + +/* Having generated 32 random bytes, you should call this function to + * finalize the generated key. + */ +static inline void c25519_prepare(uint8_t *key) +{ + key[0] &= 0xf8; + key[31] &= 0x7f; + key[31] |= 0x40; +} + +/* X-coordinate of the base point */ +extern const uint8_t c25519_base_x[F25519_SIZE]; + +/* X-coordinate scalar multiply: given the X-coordinate of q, return the + * X-coordinate of e*q. + * + * result and q are field elements. e is an exponent. + */ +void c25519_smult(uint8_t *result, const uint8_t *q, const uint8_t *e); + +#endif diff --git a/installer/signplugin/tiny/ed25519.c b/installer/signplugin/tiny/ed25519.c new file mode 100644 index 0000000..51ac462 --- /dev/null +++ b/installer/signplugin/tiny/ed25519.c @@ -0,0 +1,320 @@ +/* Edwards curve operations + * Daniel Beer , 9 Jan 2014 + * + * This file is in the public domain. + */ + +#include "ed25519.h" + +/* Base point is (numbers wrapped): + * + * x = 151122213495354007725011514095885315114 + * 54012693041857206046113283949847762202 + * y = 463168356949264781694283940034751631413 + * 07993866256225615783033603165251855960 + * + * y is derived by transforming the original Montgomery base (u=9). x + * is the corresponding positive coordinate for the new curve equation. + * t is x*y. + */ +const struct ed25519_pt ed25519_base = { + .x = { + 0x1a, 0xd5, 0x25, 0x8f, 0x60, 0x2d, 0x56, 0xc9, + 0xb2, 0xa7, 0x25, 0x95, 0x60, 0xc7, 0x2c, 0x69, + 0x5c, 0xdc, 0xd6, 0xfd, 0x31, 0xe2, 0xa4, 0xc0, + 0xfe, 0x53, 0x6e, 0xcd, 0xd3, 0x36, 0x69, 0x21 + }, + .y = { + 0x58, 0x66, 0x66, 0x66, 0x66, 0x66, 0x66, 0x66, + 0x66, 0x66, 0x66, 0x66, 0x66, 0x66, 0x66, 0x66, + 0x66, 0x66, 0x66, 0x66, 0x66, 0x66, 0x66, 0x66, + 0x66, 0x66, 0x66, 0x66, 0x66, 0x66, 0x66, 0x66 + }, + .t = { + 0xa3, 0xdd, 0xb7, 0xa5, 0xb3, 0x8a, 0xde, 0x6d, + 0xf5, 0x52, 0x51, 0x77, 0x80, 0x9f, 0xf0, 0x20, + 0x7d, 0xe3, 0xab, 0x64, 0x8e, 0x4e, 0xea, 0x66, + 0x65, 0x76, 0x8b, 0xd7, 0x0f, 0x5f, 0x87, 0x67 + }, + .z = {1, 0} +}; + +const struct ed25519_pt ed25519_neutral = { + .x = {0}, + .y = {1, 0}, + .t = {0}, + .z = {1, 0} +}; + +/* Conversion to and from projective coordinates */ +void ed25519_project(struct ed25519_pt *p, + const uint8_t *x, const uint8_t *y) +{ + f25519_copy(p->x, x); + f25519_copy(p->y, y); + f25519_load(p->z, 1); + f25519_mul__distinct(p->t, x, y); +} + +void ed25519_unproject(uint8_t *x, uint8_t *y, + const struct ed25519_pt *p) +{ + uint8_t z1[F25519_SIZE]; + + f25519_inv__distinct(z1, p->z); + f25519_mul__distinct(x, p->x, z1); + f25519_mul__distinct(y, p->y, z1); + + f25519_normalize(x); + f25519_normalize(y); +} + +/* Compress/uncompress points. We compress points by storing the x + * coordinate and the parity of the y coordinate. + * + * Rearranging the curve equation, we obtain explicit formulae for the + * coordinates: + * + * x = sqrt((y^2-1) / (1+dy^2)) + * y = sqrt((x^2+1) / (1-dx^2)) + * + * Where d = (-121665/121666), or: + * + * d = 370957059346694393431380835087545651895 + * 42113879843219016388785533085940283555 + */ + +static const uint8_t ed25519_d[F25519_SIZE] = { + 0xa3, 0x78, 0x59, 0x13, 0xca, 0x4d, 0xeb, 0x75, + 0xab, 0xd8, 0x41, 0x41, 0x4d, 0x0a, 0x70, 0x00, + 0x98, 0xe8, 0x79, 0x77, 0x79, 0x40, 0xc7, 0x8c, + 0x73, 0xfe, 0x6f, 0x2b, 0xee, 0x6c, 0x03, 0x52 +}; + +void ed25519_pack(uint8_t *c, const uint8_t *x, const uint8_t *y) +{ + uint8_t tmp[F25519_SIZE]; + uint8_t parity; + + f25519_copy(tmp, x); + f25519_normalize(tmp); + parity = (tmp[0] & 1) << 7; + + f25519_copy(c, y); + f25519_normalize(c); + c[31] |= parity; +} + +uint8_t ed25519_try_unpack(uint8_t *x, uint8_t *y, const uint8_t *comp) +{ + const int parity = comp[31] >> 7; + uint8_t a[F25519_SIZE]; + uint8_t b[F25519_SIZE]; + uint8_t c[F25519_SIZE]; + + /* Unpack y */ + f25519_copy(y, comp); + y[31] &= 127; + + /* Compute c = y^2 */ + f25519_mul__distinct(c, y, y); + + /* Compute b = (1+dy^2)^-1 */ + f25519_mul__distinct(b, c, ed25519_d); + f25519_add(a, b, f25519_one); + f25519_inv__distinct(b, a); + + /* Compute a = y^2-1 */ + f25519_sub(a, c, f25519_one); + + /* Compute c = a*b = (y^2-1)/(1-dy^2) */ + f25519_mul__distinct(c, a, b); + + /* Compute a, b = +/-sqrt(c), if c is square */ + f25519_sqrt(a, c); + f25519_neg(b, a); + + /* Select one of them, based on the compressed parity bit */ + f25519_select(x, a, b, (a[0] ^ parity) & 1); + + /* Verify that x^2 = c */ + f25519_mul__distinct(a, x, x); + f25519_normalize(a); + f25519_normalize(c); + + return f25519_eq(a, c); +} + +/* k = 2d */ +static const uint8_t ed25519_k[F25519_SIZE] = { + 0x59, 0xf1, 0xb2, 0x26, 0x94, 0x9b, 0xd6, 0xeb, + 0x56, 0xb1, 0x83, 0x82, 0x9a, 0x14, 0xe0, 0x00, + 0x30, 0xd1, 0xf3, 0xee, 0xf2, 0x80, 0x8e, 0x19, + 0xe7, 0xfc, 0xdf, 0x56, 0xdc, 0xd9, 0x06, 0x24 +}; + +void ed25519_add(struct ed25519_pt *r, + const struct ed25519_pt *p1, const struct ed25519_pt *p2) +{ + /* Explicit formulas database: add-2008-hwcd-3 + * + * source 2008 Hisil--Wong--Carter--Dawson, + * http://eprint.iacr.org/2008/522, Section 3.1 + * appliesto extended-1 + * parameter k + * assume k = 2 d + * compute A = (Y1-X1)(Y2-X2) + * compute B = (Y1+X1)(Y2+X2) + * compute C = T1 k T2 + * compute D = Z1 2 Z2 + * compute E = B - A + * compute F = D - C + * compute G = D + C + * compute H = B + A + * compute X3 = E F + * compute Y3 = G H + * compute T3 = E H + * compute Z3 = F G + */ + uint8_t a[F25519_SIZE]; + uint8_t b[F25519_SIZE]; + uint8_t c[F25519_SIZE]; + uint8_t d[F25519_SIZE]; + uint8_t e[F25519_SIZE]; + uint8_t f[F25519_SIZE]; + uint8_t g[F25519_SIZE]; + uint8_t h[F25519_SIZE]; + + /* A = (Y1-X1)(Y2-X2) */ + f25519_sub(c, p1->y, p1->x); + f25519_sub(d, p2->y, p2->x); + f25519_mul__distinct(a, c, d); + + /* B = (Y1+X1)(Y2+X2) */ + f25519_add(c, p1->y, p1->x); + f25519_add(d, p2->y, p2->x); + f25519_mul__distinct(b, c, d); + + /* C = T1 k T2 */ + f25519_mul__distinct(d, p1->t, p2->t); + f25519_mul__distinct(c, d, ed25519_k); + + /* D = Z1 2 Z2 */ + f25519_mul__distinct(d, p1->z, p2->z); + f25519_add(d, d, d); + + /* E = B - A */ + f25519_sub(e, b, a); + + /* F = D - C */ + f25519_sub(f, d, c); + + /* G = D + C */ + f25519_add(g, d, c); + + /* H = B + A */ + f25519_add(h, b, a); + + /* X3 = E F */ + f25519_mul__distinct(r->x, e, f); + + /* Y3 = G H */ + f25519_mul__distinct(r->y, g, h); + + /* T3 = E H */ + f25519_mul__distinct(r->t, e, h); + + /* Z3 = F G */ + f25519_mul__distinct(r->z, f, g); +} + +void ed25519_double(struct ed25519_pt *r, const struct ed25519_pt *p) +{ + /* Explicit formulas database: dbl-2008-hwcd + * + * source 2008 Hisil--Wong--Carter--Dawson, + * http://eprint.iacr.org/2008/522, Section 3.3 + * compute A = X1^2 + * compute B = Y1^2 + * compute C = 2 Z1^2 + * compute D = a A + * compute E = (X1+Y1)^2-A-B + * compute G = D + B + * compute F = G - C + * compute H = D - B + * compute X3 = E F + * compute Y3 = G H + * compute T3 = E H + * compute Z3 = F G + */ + uint8_t a[F25519_SIZE]; + uint8_t b[F25519_SIZE]; + uint8_t c[F25519_SIZE]; + uint8_t e[F25519_SIZE]; + uint8_t f[F25519_SIZE]; + uint8_t g[F25519_SIZE]; + uint8_t h[F25519_SIZE]; + + /* A = X1^2 */ + f25519_mul__distinct(a, p->x, p->x); + + /* B = Y1^2 */ + f25519_mul__distinct(b, p->y, p->y); + + /* C = 2 Z1^2 */ + f25519_mul__distinct(c, p->z, p->z); + f25519_add(c, c, c); + + /* D = a A (alter sign) */ + /* E = (X1+Y1)^2-A-B */ + f25519_add(f, p->x, p->y); + f25519_mul__distinct(e, f, f); + f25519_sub(e, e, a); + f25519_sub(e, e, b); + + /* G = D + B */ + f25519_sub(g, b, a); + + /* F = G - C */ + f25519_sub(f, g, c); + + /* H = D - B */ + f25519_neg(h, b); + f25519_sub(h, h, a); + + /* X3 = E F */ + f25519_mul__distinct(r->x, e, f); + + /* Y3 = G H */ + f25519_mul__distinct(r->y, g, h); + + /* T3 = E H */ + f25519_mul__distinct(r->t, e, h); + + /* Z3 = F G */ + f25519_mul__distinct(r->z, f, g); +} + +void ed25519_smult(struct ed25519_pt *r_out, const struct ed25519_pt *p, + const uint8_t *e) +{ + struct ed25519_pt r; + int i; + + ed25519_copy(&r, &ed25519_neutral); + + for (i = 255; i >= 0; i--) { + const uint8_t bit = (e[i >> 3] >> (i & 7)) & 1; + struct ed25519_pt s; + + ed25519_double(&r, &r); + ed25519_add(&s, &r, p); + + f25519_select(r.x, r.x, s.x, bit); + f25519_select(r.y, r.y, s.y, bit); + f25519_select(r.z, r.z, s.z, bit); + f25519_select(r.t, r.t, s.t, bit); + } + + ed25519_copy(r_out, &r); +} diff --git a/installer/signplugin/tiny/ed25519.h b/installer/signplugin/tiny/ed25519.h new file mode 100644 index 0000000..62f0120 --- /dev/null +++ b/installer/signplugin/tiny/ed25519.h @@ -0,0 +1,82 @@ +/* Edwards curve operations + * Daniel Beer , 9 Jan 2014 + * + * This file is in the public domain. + */ + +#ifndef ED25519_H_ +#define ED25519_H_ + +#include "f25519.h" + +/* This is not the Ed25519 signature system. Rather, we're implementing + * basic operations on the twisted Edwards curve over (Z mod 2^255-19): + * + * -x^2 + y^2 = 1 - (121665/121666)x^2y^2 + * + * With the positive-x base point y = 4/5. + * + * These functions will not leak secret data through timing. + * + * For more information, see: + * + * Bernstein, D.J. & Lange, T. (2007) "Faster addition and doubling on + * elliptic curves". Document ID: 95616567a6ba20f575c5f25e7cebaf83. + * + * Hisil, H. & Wong, K K. & Carter, G. & Dawson, E. (2008) "Twisted + * Edwards curves revisited". Advances in Cryptology, ASIACRYPT 2008, + * Vol. 5350, pp. 326-343. + */ + +/* Projective coordinates */ +struct ed25519_pt { + uint8_t x[F25519_SIZE]; + uint8_t y[F25519_SIZE]; + uint8_t t[F25519_SIZE]; + uint8_t z[F25519_SIZE]; +}; + +extern const struct ed25519_pt ed25519_base; +extern const struct ed25519_pt ed25519_neutral; + +/* Convert between projective and affine coordinates (x/y in F25519) */ +void ed25519_project(struct ed25519_pt *p, + const uint8_t *x, const uint8_t *y); + +void ed25519_unproject(uint8_t *x, uint8_t *y, + const struct ed25519_pt *p); + +/* Compress/uncompress points. try_unpack() will check that the + * compressed point is on the curve, returning 1 if the unpacked point + * is valid, and 0 otherwise. + */ +#define ED25519_PACK_SIZE F25519_SIZE + +void ed25519_pack(uint8_t *c, const uint8_t *x, const uint8_t *y); +uint8_t ed25519_try_unpack(uint8_t *x, uint8_t *y, const uint8_t *c); + +/* Add, double and scalar multiply */ +#define ED25519_EXPONENT_SIZE 32 + +/* Prepare an exponent by clamping appropriate bits */ +static inline void ed25519_prepare(uint8_t *e) +{ + e[0] &= 0xf8; + e[31] &= 0x7f; + e[31] |= 0x40; +} + +/* Order of the group generated by the base point */ +static inline void ed25519_copy(struct ed25519_pt *dst, + const struct ed25519_pt *src) +{ + memcpy(dst, src, sizeof(*dst)); +} + +void ed25519_add(struct ed25519_pt *r, + const struct ed25519_pt *a, const struct ed25519_pt *b); +void ed25519_double(struct ed25519_pt *r, const struct ed25519_pt *a); +void ed25519_smult(struct ed25519_pt *r, const struct ed25519_pt *a, + const uint8_t *e); + +#endif diff --git a/installer/signplugin/tiny/edsign.c b/installer/signplugin/tiny/edsign.c new file mode 100644 index 0000000..bf131a5 --- /dev/null +++ b/installer/signplugin/tiny/edsign.c @@ -0,0 +1,168 @@ +/* Edwards curve signature system + * Daniel Beer , 22 Apr 2014 + * + * This file is in the public domain. + */ + +#include "ed25519.h" +#include "sha512.h" +#include "fprime.h" +#include "edsign.h" + +#define EXPANDED_SIZE 64 + +static const uint8_t ed25519_order[FPRIME_SIZE] = { + 0xed, 0xd3, 0xf5, 0x5c, 0x1a, 0x63, 0x12, 0x58, + 0xd6, 0x9c, 0xf7, 0xa2, 0xde, 0xf9, 0xde, 0x14, + 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, + 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x10 +}; + +static void expand_key(uint8_t *expanded, const uint8_t *secret) +{ + struct sha512_state s; + + sha512_init(&s); + sha512_final(&s, secret, EDSIGN_SECRET_KEY_SIZE); + sha512_get(&s, expanded, 0, EXPANDED_SIZE); + ed25519_prepare(expanded); +} + +static uint8_t upp(struct ed25519_pt *p, const uint8_t *packed) +{ + uint8_t x[F25519_SIZE]; + uint8_t y[F25519_SIZE]; + uint8_t ok = ed25519_try_unpack(x, y, packed); + + ed25519_project(p, x, y); + return ok; +} + +static void pp(uint8_t *packed, const struct ed25519_pt *p) +{ + uint8_t x[F25519_SIZE]; + uint8_t y[F25519_SIZE]; + + ed25519_unproject(x, y, p); + ed25519_pack(packed, x, y); +} + +static void sm_pack(uint8_t *r, const uint8_t *k) +{ + struct ed25519_pt p; + + ed25519_smult(&p, &ed25519_base, k); + pp(r, &p); +} + +void edsign_sec_to_pub(uint8_t *pub, const uint8_t *secret) +{ + uint8_t expanded[EXPANDED_SIZE]; + + expand_key(expanded, secret); + sm_pack(pub, expanded); +} + +static void hash_with_prefix(uint8_t *out_fp, + uint8_t *init_block, unsigned int prefix_size, + const uint8_t *message, size_t len) +{ + struct sha512_state s; + + sha512_init(&s); + + if (len < SHA512_BLOCK_SIZE && len + prefix_size < SHA512_BLOCK_SIZE) { + memcpy(init_block + prefix_size, message, len); + sha512_final(&s, init_block, len + prefix_size); + } else { + size_t i; + + memcpy(init_block + prefix_size, message, + SHA512_BLOCK_SIZE - prefix_size); + sha512_block(&s, init_block); + + for (i = SHA512_BLOCK_SIZE - prefix_size; + i + SHA512_BLOCK_SIZE <= len; + i += SHA512_BLOCK_SIZE) + sha512_block(&s, message + i); + + sha512_final(&s, message + i, len + prefix_size); + } + + sha512_get(&s, init_block, 0, SHA512_HASH_SIZE); + fprime_from_bytes(out_fp, init_block, SHA512_HASH_SIZE, ed25519_order); +} + +static void generate_k(uint8_t *k, const uint8_t *kgen_key, + const uint8_t *message, size_t len) +{ + uint8_t block[SHA512_BLOCK_SIZE]; + + memcpy(block, kgen_key, 32); + hash_with_prefix(k, block, 32, message, len); +} + +static void hash_message(uint8_t *z, const uint8_t *r, const uint8_t *a, + const uint8_t *m, size_t len) +{ + uint8_t block[SHA512_BLOCK_SIZE]; + + memcpy(block, r, 32); + memcpy(block + 32, a, 32); + hash_with_prefix(z, block, 64, m, len); +} + +void edsign_sign(uint8_t *signature, const uint8_t *pub, + const uint8_t *secret, + const uint8_t *message, size_t len) +{ + uint8_t expanded[EXPANDED_SIZE]; + uint8_t e[FPRIME_SIZE]; + uint8_t s[FPRIME_SIZE]; + uint8_t k[FPRIME_SIZE]; + uint8_t z[FPRIME_SIZE]; + + expand_key(expanded, secret); + + /* Generate k and R = kB */ + generate_k(k, expanded + 32, message, len); + sm_pack(signature, k); + + /* Compute z = H(R, A, M) */ + hash_message(z, signature, pub, message, len); + + /* Obtain e */ + fprime_from_bytes(e, expanded, 32, ed25519_order); + + /* Compute s = ze + k */ + fprime_mul(s, z, e, ed25519_order); + fprime_add(s, k, ed25519_order); + memcpy(signature + 32, s, 32); +} + +uint8_t edsign_verify(const uint8_t *signature, const uint8_t *pub, + const uint8_t *message, size_t len) +{ + struct ed25519_pt p; + struct ed25519_pt q; + uint8_t lhs[F25519_SIZE]; + uint8_t rhs[F25519_SIZE]; + uint8_t z[FPRIME_SIZE]; + uint8_t ok = 1; + + /* Compute z = H(R, A, M) */ + hash_message(z, signature, pub, message, len); + + /* sB = (ze + k)B = ... */ + sm_pack(lhs, signature + 32); + + /* ... = zA + R */ + ok &= upp(&p, pub); + ed25519_smult(&p, &p, z); + ok &= upp(&q, signature); + ed25519_add(&p, &p, &q); + pp(rhs, &p); + + /* Equal? */ + return ok & f25519_eq(lhs, rhs); +} diff --git a/installer/signplugin/tiny/edsign.h b/installer/signplugin/tiny/edsign.h new file mode 100644 index 0000000..85e2208 --- /dev/null +++ b/installer/signplugin/tiny/edsign.h @@ -0,0 +1,51 @@ +/* Edwards curve signature system + * Daniel Beer , 22 Apr 2014 + * + * This file is in the public domain. + */ + +#ifndef EDSIGN_H_ +#define EDSIGN_H_ + +#include +#include + +/* This is the Ed25519 signature system, as described in: + * + * Daniel J. Bernstein, Niels Duif, Tanja Lange, Peter Schwabe, Bo-Yin + * Yang. High-speed high-security signatures. Journal of Cryptographic + * Engineering 2 (2012), 77-89. Document ID: + * a1a62a2f76d23f65d622484ddd09caf8. URL: + * http://cr.yp.to/papers.html#ed25519. Date: 2011.09.26. + * + * The format and calculation of signatures is compatible with the + * Ed25519 implementation in SUPERCOP. Note, however, that our secret + * keys are half the size: we don't store a copy of the public key in + * the secret key (we generate it on demand). + */ + +/* Any string of 32 random bytes is a valid secret key. There is no + * clamping of bits, because we don't use the key directly as an + * exponent (the exponent is derived from part of a key expansion). + */ +#define EDSIGN_SECRET_KEY_SIZE 32 + +/* Given a secret key, produce the public key (a packed Edwards-curve + * point). + */ +#define EDSIGN_PUBLIC_KEY_SIZE 32 + +void edsign_sec_to_pub(uint8_t *pub, const uint8_t *secret); + +/* Produce a signature for a message. */ +#define EDSIGN_SIGNATURE_SIZE 64 + +void edsign_sign(uint8_t *signature, const uint8_t *pub, + const uint8_t *secret, + const uint8_t *message, size_t len); + +/* Verify a message signature. Returns non-zero if ok. */ +uint8_t edsign_verify(const uint8_t *signature, const uint8_t *pub, + const uint8_t *message, size_t len); + +#endif diff --git a/installer/signplugin/tiny/f25519.c b/installer/signplugin/tiny/f25519.c new file mode 100644 index 0000000..3b06fa6 --- /dev/null +++ b/installer/signplugin/tiny/f25519.c @@ -0,0 +1,324 @@ +/* Arithmetic mod p = 2^255-19 + * Daniel Beer , 5 Jan 2014 + * + * This file is in the public domain. + */ + +#include "f25519.h" + +const uint8_t f25519_zero[F25519_SIZE] = {0}; +const uint8_t f25519_one[F25519_SIZE] = {1}; + +void f25519_load(uint8_t *x, uint32_t c) +{ + unsigned int i; + + for (i = 0; i < sizeof(c); i++) { + x[i] = c; + c >>= 8; + } + + for (; i < F25519_SIZE; i++) + x[i] = 0; +} + +void f25519_normalize(uint8_t *x) +{ + uint8_t minusp[F25519_SIZE]; + uint16_t c; + int i; + + /* Reduce using 2^255 = 19 mod p */ + c = (x[31] >> 7) * 19; + x[31] &= 127; + + for (i = 0; i < F25519_SIZE; i++) { + c += x[i]; + x[i] = c; + c >>= 8; + } + + /* The number is now less than 2^255 + 18, and therefore less than + * 2p. Try subtracting p, and conditionally load the subtracted + * value if underflow did not occur. + */ + c = 19; + + for (i = 0; i + 1 < F25519_SIZE; i++) { + c += x[i]; + minusp[i] = c; + c >>= 8; + } + + c += ((uint16_t)x[i]) - 128; + minusp[31] = c; + + /* Load x-p if no underflow */ + f25519_select(x, minusp, x, (c >> 15) & 1); +} + +uint8_t f25519_eq(const uint8_t *x, const uint8_t *y) +{ + uint8_t sum = 0; + int i; + + for (i = 0; i < F25519_SIZE; i++) + sum |= x[i] ^ y[i]; + + sum |= (sum >> 4); + sum |= (sum >> 2); + sum |= (sum >> 1); + + return (sum ^ 1) & 1; +} + +void f25519_select(uint8_t *dst, + const uint8_t *zero, const uint8_t *one, + uint8_t condition) +{ + const uint8_t mask = -condition; + int i; + + for (i = 0; i < F25519_SIZE; i++) + dst[i] = zero[i] ^ (mask & (one[i] ^ zero[i])); +} + +void f25519_add(uint8_t *r, const uint8_t *a, const uint8_t *b) +{ + uint16_t c = 0; + int i; + + /* Add */ + for (i = 0; i < F25519_SIZE; i++) { + c >>= 8; + c += ((uint16_t)a[i]) + ((uint16_t)b[i]); + r[i] = c; + } + + /* Reduce with 2^255 = 19 mod p */ + r[31] &= 127; + c = (c >> 7) * 19; + + for (i = 0; i < F25519_SIZE; i++) { + c += r[i]; + r[i] = c; + c >>= 8; + } +} + +void f25519_sub(uint8_t *r, const uint8_t *a, const uint8_t *b) +{ + uint32_t c = 0; + int i; + + /* Calculate a + 2p - b, to avoid underflow */ + c = 218; + for (i = 0; i + 1 < F25519_SIZE; i++) { + c += 65280 + ((uint32_t)a[i]) - ((uint32_t)b[i]); + r[i] = c; + c >>= 8; + } + + c += ((uint32_t)a[31]) - ((uint32_t)b[31]); + r[31] = c & 127; + c = (c >> 7) * 19; + + for (i = 0; i < F25519_SIZE; i++) { + c += r[i]; + r[i] = c; + c >>= 8; + } +} + +void f25519_neg(uint8_t *r, const uint8_t *a) +{ + uint32_t c = 0; + int i; + + /* Calculate 2p - a, to avoid underflow */ + c = 218; + for (i = 0; i + 1 < F25519_SIZE; i++) { + c += 65280 - ((uint32_t)a[i]); + r[i] = c; + c >>= 8; + } + + c -= ((uint32_t)a[31]); + r[31] = c & 127; + c = (c >> 7) * 19; + + for (i = 0; i < F25519_SIZE; i++) { + c += r[i]; + r[i] = c; + c >>= 8; + } +} + +void f25519_mul__distinct(uint8_t *r, const uint8_t *a, const uint8_t *b) +{ + uint32_t c = 0; + int i; + + for (i = 0; i < F25519_SIZE; i++) { + int j; + + c >>= 8; + for (j = 0; j <= i; j++) + c += ((uint32_t)a[j]) * ((uint32_t)b[i - j]); + + for (; j < F25519_SIZE; j++) + c += ((uint32_t)a[j]) * + ((uint32_t)b[i + F25519_SIZE - j]) * 38; + + r[i] = c; + } + + r[31] &= 127; + c = (c >> 7) * 19; + + for (i = 0; i < F25519_SIZE; i++) { + c += r[i]; + r[i] = c; + c >>= 8; + } +} + +void f25519_mul(uint8_t *r, const uint8_t *a, const uint8_t *b) +{ + uint8_t tmp[F25519_SIZE]; + + f25519_mul__distinct(tmp, a, b); + f25519_copy(r, tmp); +} + +void f25519_mul_c(uint8_t *r, const uint8_t *a, uint32_t b) +{ + uint32_t c = 0; + int i; + + for (i = 0; i < F25519_SIZE; i++) { + c >>= 8; + c += b * ((uint32_t)a[i]); + r[i] = c; + } + + r[31] &= 127; + c >>= 7; + c *= 19; + + for (i = 0; i < F25519_SIZE; i++) { + c += r[i]; + r[i] = c; + c >>= 8; + } +} + +void f25519_inv__distinct(uint8_t *r, const uint8_t *x) +{ + uint8_t s[F25519_SIZE]; + int i; + + /* This is a prime field, so by Fermat's little theorem: + * + * x^(p-1) = 1 mod p + * + * Therefore, raise to (p-2) = 2^255-21 to get a multiplicative + * inverse. + * + * This is a 255-bit binary number with the digits: + * + * 11111111... 01011 + * + * We compute the result by the usual binary chain, but + * alternate between keeping the accumulator in r and s, so as + * to avoid copying temporaries. + */ + + /* 1 1 */ + f25519_mul__distinct(s, x, x); + f25519_mul__distinct(r, s, x); + + /* 1 x 248 */ + for (i = 0; i < 248; i++) { + f25519_mul__distinct(s, r, r); + f25519_mul__distinct(r, s, x); + } + + /* 0 */ + f25519_mul__distinct(s, r, r); + + /* 1 */ + f25519_mul__distinct(r, s, s); + f25519_mul__distinct(s, r, x); + + /* 0 */ + f25519_mul__distinct(r, s, s); + + /* 1 */ + f25519_mul__distinct(s, r, r); + f25519_mul__distinct(r, s, x); + + /* 1 */ + f25519_mul__distinct(s, r, r); + f25519_mul__distinct(r, s, x); +} + +void f25519_inv(uint8_t *r, const uint8_t *x) +{ + uint8_t tmp[F25519_SIZE]; + + f25519_inv__distinct(tmp, x); + f25519_copy(r, tmp); +} + +/* Raise x to the power of (p-5)/8 = 2^252-3, using s for temporary + * storage. + */ +static void exp2523(uint8_t *r, const uint8_t *x, uint8_t *s) +{ + int i; + + /* This number is a 252-bit number with the binary expansion: + * + * 111111... 01 + */ + + /* 1 1 */ + f25519_mul__distinct(r, x, x); + f25519_mul__distinct(s, r, x); + + /* 1 x 248 */ + for (i = 0; i < 248; i++) { + f25519_mul__distinct(r, s, s); + f25519_mul__distinct(s, r, x); + } + + /* 0 */ + f25519_mul__distinct(r, s, s); + + /* 1 */ + f25519_mul__distinct(s, r, r); + f25519_mul__distinct(r, s, x); +} + +void f25519_sqrt(uint8_t *r, const uint8_t *a) +{ + uint8_t v[F25519_SIZE]; + uint8_t i[F25519_SIZE]; + uint8_t x[F25519_SIZE]; + uint8_t y[F25519_SIZE]; + + /* v = (2a)^((p-5)/8) [x = 2a] */ + f25519_mul_c(x, a, 2); + exp2523(v, x, y); + + /* i = 2av^2 - 1 */ + f25519_mul__distinct(y, v, v); + f25519_mul__distinct(i, x, y); + f25519_load(y, 1); + f25519_sub(i, i, y); + + /* r = avi */ + f25519_mul__distinct(x, v, a); + f25519_mul__distinct(r, x, i); +} diff --git a/installer/signplugin/tiny/f25519.h b/installer/signplugin/tiny/f25519.h new file mode 100644 index 0000000..4cfa5ec --- /dev/null +++ b/installer/signplugin/tiny/f25519.h @@ -0,0 +1,92 @@ +/* Arithmetic mod p = 2^255-19 + * Daniel Beer , 8 Jan 2014 + * + * This file is in the public domain. + */ + +#ifndef F25519_H_ +#define F25519_H_ + +#include +#include + +/* Field elements are represented as little-endian byte strings. All + * operations have timings which are independent of input data, so they + * can be safely used for cryptography. + * + * Computation is performed on un-normalized elements. These are byte + * strings which fall into the range 0 <= x < 2p. Use f25519_normalize() + * to convert to a value 0 <= x < p. + * + * Elements received from the outside may greater even than 2p. + * f25519_normalize() will correctly deal with these numbers too. + */ +#define F25519_SIZE 32 + +/* Identity constants */ +extern const uint8_t f25519_zero[F25519_SIZE]; +extern const uint8_t f25519_one[F25519_SIZE]; + +/* Load a small constant */ +void f25519_load(uint8_t *x, uint32_t c); + +/* Copy two points */ +static inline void f25519_copy(uint8_t *x, const uint8_t *a) +{ + memcpy(x, a, F25519_SIZE); +} + +/* Normalize a field point x < 2*p by subtracting p if necessary */ +void f25519_normalize(uint8_t *x); + +/* Compare two field points in constant time. Return one if equal, zero + * otherwise. This should be performed only on normalized values. + */ +uint8_t f25519_eq(const uint8_t *x, const uint8_t *y); + +/* Conditional copy. If condition == 0, then zero is copied to dst. If + * condition == 1, then one is copied to dst. Any other value results in + * undefined behaviour. + */ +void f25519_select(uint8_t *dst, + const uint8_t *zero, const uint8_t *one, + uint8_t condition); + +/* Add/subtract two field points. The three pointers are not required to + * be distinct. + */ +void f25519_add(uint8_t *r, const uint8_t *a, const uint8_t *b); +void f25519_sub(uint8_t *r, const uint8_t *a, const uint8_t *b); + +/* Unary negation */ +void f25519_neg(uint8_t *r, const uint8_t *a); + +/* Multiply two field points. The __distinct variant is used when r is + * known to be in a different location to a and b. + */ +void f25519_mul(uint8_t *r, const uint8_t *a, const uint8_t *b); +void f25519_mul__distinct(uint8_t *r, const uint8_t *a, const uint8_t *b); + +/* Multiply a point by a small constant. The two pointers are not + * required to be distinct. + * + * The constant must be less than 2^24. + */ +void f25519_mul_c(uint8_t *r, const uint8_t *a, uint32_t b); + +/* Take the reciprocal of a field point. The __distinct variant is used + * when r is known to be in a different location to x. + */ +void f25519_inv(uint8_t *r, const uint8_t *x); +void f25519_inv__distinct(uint8_t *r, const uint8_t *x); + +/* Compute one of the square roots of the field element, if the element + * is square. The other square is -r. + * + * If the input is not square, the returned value is a valid field + * element, but not the correct answer. If you don't already know that + * your element is square, you should square the return value and test. + */ +void f25519_sqrt(uint8_t *r, const uint8_t *x); + +#endif diff --git a/installer/signplugin/tiny/fprime.c b/installer/signplugin/tiny/fprime.c new file mode 100644 index 0000000..25f2197 --- /dev/null +++ b/installer/signplugin/tiny/fprime.c @@ -0,0 +1,215 @@ +/* Arithmetic in prime fields + * Daniel Beer , 10 Jan 2014 + * + * This file is in the public domain. + */ + +#include "fprime.h" + +const uint8_t fprime_zero[FPRIME_SIZE] = {0}; +const uint8_t fprime_one[FPRIME_SIZE] = {1}; + +static void raw_add(uint8_t *x, const uint8_t *p) +{ + uint16_t c = 0; + int i; + + for (i = 0; i < FPRIME_SIZE; i++) { + c += ((uint16_t)x[i]) + ((uint16_t)p[i]); + x[i] = c; + c >>= 8; + } +} + +static void raw_try_sub(uint8_t *x, const uint8_t *p) +{ + uint8_t minusp[FPRIME_SIZE]; + uint16_t c = 0; + int i; + + for (i = 0; i < FPRIME_SIZE; i++) { + c = ((uint16_t)x[i]) - ((uint16_t)p[i]) - c; + minusp[i] = c; + c = (c >> 8) & 1; + } + + fprime_select(x, minusp, x, c); +} + +/* Warning: this function is variable-time */ +static int prime_msb(const uint8_t *p) +{ + int i; + uint8_t x; + + for (i = FPRIME_SIZE - 1; i >= 0; i--) + if (p[i]) + break; + + x = p[i]; + i <<= 3; + + while (x) { + x >>= 1; + i++; + } + + return i - 1; +} + +/* Warning: this function may be variable-time in the argument n */ +static void shift_n_bits(uint8_t *x, int n) +{ + uint16_t c = 0; + int i; + + for (i = 0; i < FPRIME_SIZE; i++) { + c |= ((uint16_t)x[i]) << n; + x[i] = c; + c >>= 8; + } +} + +void fprime_load(uint8_t *x, uint32_t c) +{ + unsigned int i; + + for (i = 0; i < sizeof(c); i++) { + x[i] = c; + c >>= 8; + } + + for (; i < FPRIME_SIZE; i++) + x[i] = 0; +} + +static inline int min_int(int a, int b) +{ + return a < b ? a : b; +} + +void fprime_from_bytes(uint8_t *n, + const uint8_t *x, size_t len, + const uint8_t *modulus) +{ + const int preload_total = min_int(prime_msb(modulus) - 1, len << 3); + const int preload_bytes = preload_total >> 3; + const int preload_bits = preload_total & 7; + const int rbits = (len << 3) - preload_total; + int i; + + memset(n, 0, FPRIME_SIZE); + + for (i = 0; i < preload_bytes; i++) + n[i] = x[len - preload_bytes + i]; + + if (preload_bits) { + shift_n_bits(n, preload_bits); + n[0] |= x[len - preload_bytes - 1] >> (8 - preload_bits); + } + + for (i = rbits - 1; i >= 0; i--) { + const uint8_t bit = (x[i >> 3] >> (i & 7)) & 1; + + shift_n_bits(n, 1); + n[0] |= bit; + raw_try_sub(n, modulus); + } +} + +void fprime_normalize(uint8_t *x, const uint8_t *modulus) +{ + uint8_t n[FPRIME_SIZE]; + + fprime_from_bytes(n, x, FPRIME_SIZE, modulus); + fprime_copy(x, n); +} + +uint8_t fprime_eq(const uint8_t *x, const uint8_t *y) +{ + uint8_t sum = 0; + int i; + + for (i = 0; i < FPRIME_SIZE; i++) + sum |= x[i] ^ y[i]; + + sum |= (sum >> 4); + sum |= (sum >> 2); + sum |= (sum >> 1); + + return (sum ^ 1) & 1; +} + +void fprime_select(uint8_t *dst, + const uint8_t *zero, const uint8_t *one, + uint8_t condition) +{ + const uint8_t mask = -condition; + int i; + + for (i = 0; i < FPRIME_SIZE; i++) + dst[i] = zero[i] ^ (mask & (one[i] ^ zero[i])); +} + +void fprime_add(uint8_t *r, const uint8_t *a, const uint8_t *modulus) +{ + raw_add(r, a); + raw_try_sub(r, modulus); +} + +void fprime_sub(uint8_t *r, const uint8_t *a, const uint8_t *modulus) +{ + raw_add(r, modulus); + raw_try_sub(r, a); + raw_try_sub(r, modulus); +} + +void fprime_mul(uint8_t *r, const uint8_t *a, const uint8_t *b, + const uint8_t *modulus) +{ + int i; + + memset(r, 0, FPRIME_SIZE); + + for (i = prime_msb(modulus); i >= 0; i--) { + const uint8_t bit = (b[i >> 3] >> (i & 7)) & 1; + uint8_t plusa[FPRIME_SIZE]; + + shift_n_bits(r, 1); + raw_try_sub(r, modulus); + + fprime_copy(plusa, r); + fprime_add(plusa, a, modulus); + + fprime_select(r, r, plusa, bit); + } +} + +void fprime_inv(uint8_t *r, const uint8_t *a, const uint8_t *modulus) +{ + uint8_t pm2[FPRIME_SIZE]; + uint16_t c = 2; + int i; + + /* Compute (p-2) */ + fprime_copy(pm2, modulus); + for (i = 0; i < FPRIME_SIZE; i++) { + c = modulus[i] - c; + pm2[i] = c; + c >>= 8; + } + + /* Binary exponentiation */ + fprime_load(r, 1); + + for (i = prime_msb(modulus); i >= 0; i--) { + uint8_t r2[FPRIME_SIZE]; + + fprime_mul(r2, r, r, modulus); + + if ((pm2[i >> 3] >> (i & 7)) & 1) + fprime_mul(r, r2, a, modulus); + else + fprime_copy(r, r2); + } +} diff --git a/installer/signplugin/tiny/fprime.h b/installer/signplugin/tiny/fprime.h new file mode 100644 index 0000000..4a5486c --- /dev/null +++ b/installer/signplugin/tiny/fprime.h @@ -0,0 +1,70 @@ +/* Arithmetic in prime fields + * Daniel Beer , 10 Jan 2014 + * + * This file is in the public domain. + */ + +#ifndef FPRIME_H_ +#define FPRIME_H_ + +#include +#include + +/* Maximum size of a field element (or a prime). Field elements are + * always manipulated and stored in normalized form, with 0 <= x < p. + * You can use normalize() to convert a denormalized bitstring to normal + * form. + * + * Operations are constant with respect to the value of field elements, + * but not with respect to the modulus. + * + * The modulus is a number p, such that 2p-1 fits in FPRIME_SIZE bytes. + */ +#define FPRIME_SIZE 32 + +/* Useful constants */ +extern const uint8_t fprime_zero[FPRIME_SIZE]; +extern const uint8_t fprime_one[FPRIME_SIZE]; + +/* Load a small constant */ +void fprime_load(uint8_t *x, uint32_t c); + +/* Load a large constant */ +void fprime_from_bytes(uint8_t *x, + const uint8_t *in, size_t len, + const uint8_t *modulus); + +/* Copy an element */ +static inline void fprime_copy(uint8_t *x, const uint8_t *a) +{ + memcpy(x, a, FPRIME_SIZE); +} + +/* Normalize a field element */ +void fprime_normalize(uint8_t *x, const uint8_t *modulus); + +/* Compare two field points in constant time. Return one if equal, zero + * otherwise. This should be performed only on normalized values. + */ +uint8_t fprime_eq(const uint8_t *x, const uint8_t *y); + +/* Conditional copy. If condition == 0, then zero is copied to dst. If + * condition == 1, then one is copied to dst. Any other value results in + * undefined behaviour. + */ +void fprime_select(uint8_t *dst, + const uint8_t *zero, const uint8_t *one, + uint8_t condition); + +/* Add one value to another. The two pointers must be distinct. */ +void fprime_add(uint8_t *r, const uint8_t *a, const uint8_t *modulus); +void fprime_sub(uint8_t *r, const uint8_t *a, const uint8_t *modulus); + +/* Multiply two values to get a third. r must be distinct from a and b */ +void fprime_mul(uint8_t *r, const uint8_t *a, const uint8_t *b, + const uint8_t *modulus); + +/* Compute multiplicative inverse. r must be distinct from a */ +void fprime_inv(uint8_t *r, const uint8_t *a, const uint8_t *modulus); + +#endif diff --git a/installer/signplugin/tiny/morph25519.c b/installer/signplugin/tiny/morph25519.c new file mode 100644 index 0000000..3d64022 --- /dev/null +++ b/installer/signplugin/tiny/morph25519.c @@ -0,0 +1,87 @@ +/* Montgomery <-> Edwards isomorphism + * Daniel Beer , 18 Jan 2014 + * + * This file is in the public domain. + */ + +#include "morph25519.h" +#include "f25519.h" + +void morph25519_e2m(uint8_t *montgomery, const uint8_t *y) +{ + uint8_t yplus[F25519_SIZE]; + uint8_t yminus[F25519_SIZE]; + + f25519_sub(yplus, f25519_one, y); + f25519_inv__distinct(yminus, yplus); + f25519_add(yplus, f25519_one, y); + f25519_mul__distinct(montgomery, yplus, yminus); + f25519_normalize(montgomery); +} + +static void mx2ey(uint8_t *ey, const uint8_t *mx) +{ + uint8_t n[F25519_SIZE]; + uint8_t d[F25519_SIZE]; + + f25519_add(n, mx, f25519_one); + f25519_inv__distinct(d, n); + f25519_sub(n, mx, f25519_one); + f25519_mul__distinct(ey, n, d); +} + +static uint8_t ey2ex(uint8_t *x, const uint8_t *y, int parity) +{ + static const uint8_t d[F25519_SIZE] = { + 0xa3, 0x78, 0x59, 0x13, 0xca, 0x4d, 0xeb, 0x75, + 0xab, 0xd8, 0x41, 0x41, 0x4d, 0x0a, 0x70, 0x00, + 0x98, 0xe8, 0x79, 0x77, 0x79, 0x40, 0xc7, 0x8c, + 0x73, 0xfe, 0x6f, 0x2b, 0xee, 0x6c, 0x03, 0x52 + }; + + uint8_t a[F25519_SIZE]; + uint8_t b[F25519_SIZE]; + uint8_t c[F25519_SIZE]; + + /* Compute c = y^2 */ + f25519_mul__distinct(c, y, y); + + /* Compute b = (1+dy^2)^-1 */ + f25519_mul__distinct(b, c, d); + f25519_add(a, b, f25519_one); + f25519_inv__distinct(b, a); + + /* Compute a = y^2-1 */ + f25519_sub(a, c, f25519_one); + + /* Compute c = a*b = (y^2+1)/(1-dy^2) */ + f25519_mul__distinct(c, a, b); + + /* Compute a, b = +/-sqrt(c), if c is square */ + f25519_sqrt(a, c); + f25519_neg(b, a); + + /* Select one of them, based on the parity bit */ + f25519_select(x, a, b, (a[0] ^ parity) & 1); + + /* Verify that x^2 = c */ + f25519_mul__distinct(a, x, x); + f25519_normalize(a); + f25519_normalize(c); + + return f25519_eq(a, c); +} + +uint8_t morph25519_m2e(uint8_t *ex, uint8_t *ey, + const uint8_t *mx, int parity) +{ + uint8_t ok; + + mx2ey(ey, mx); + ok = ey2ex(ex, ey, parity); + + f25519_normalize(ex); + f25519_normalize(ey); + + return ok; +} diff --git a/installer/signplugin/tiny/morph25519.h b/installer/signplugin/tiny/morph25519.h new file mode 100644 index 0000000..ead91f4 --- /dev/null +++ b/installer/signplugin/tiny/morph25519.h @@ -0,0 +1,29 @@ +/* Montgomery <-> Edwards isomorphism + * Daniel Beer , 18 Jan 2014 + * + * This file is in the public domain. + */ + +#ifndef MORPH25519_H_ +#define MORPH25519_H_ + +#include + +/* Convert an Edwards Y to a Montgomery X (Edwards X is not used). + * Resulting coordinate is normalized. + */ +void morph25519_e2m(uint8_t *montgomery_x, const uint8_t *edwards_y); + +/* Return a parity bit for the Edwards X coordinate */ +static inline int morph25519_eparity(const uint8_t *edwards_x) +{ + return edwards_x[0] & 1; +} + +/* Convert a Montgomery X and a parity bit to an Edwards X/Y. Returns + * non-zero if successful. + */ +uint8_t morph25519_m2e(uint8_t *ex, uint8_t *ey, + const uint8_t *mx, int parity); + +#endif diff --git a/installer/signplugin/tiny/sha512.c b/installer/signplugin/tiny/sha512.c new file mode 100644 index 0000000..d90d22d --- /dev/null +++ b/installer/signplugin/tiny/sha512.c @@ -0,0 +1,228 @@ +/* SHA512 + * Daniel Beer , 22 Apr 2014 + * + * This file is in the public domain. + */ + +#include "sha512.h" + +const struct sha512_state sha512_initial_state = { { + 0x6a09e667f3bcc908LL, 0xbb67ae8584caa73bLL, + 0x3c6ef372fe94f82bLL, 0xa54ff53a5f1d36f1LL, + 0x510e527fade682d1LL, 0x9b05688c2b3e6c1fLL, + 0x1f83d9abfb41bd6bLL, 0x5be0cd19137e2179LL, +} }; + +static const uint64_t round_k[80] = { + 0x428a2f98d728ae22LL, 0x7137449123ef65cdLL, + 0xb5c0fbcfec4d3b2fLL, 0xe9b5dba58189dbbcLL, + 0x3956c25bf348b538LL, 0x59f111f1b605d019LL, + 0x923f82a4af194f9bLL, 0xab1c5ed5da6d8118LL, + 0xd807aa98a3030242LL, 0x12835b0145706fbeLL, + 0x243185be4ee4b28cLL, 0x550c7dc3d5ffb4e2LL, + 0x72be5d74f27b896fLL, 0x80deb1fe3b1696b1LL, + 0x9bdc06a725c71235LL, 0xc19bf174cf692694LL, + 0xe49b69c19ef14ad2LL, 0xefbe4786384f25e3LL, + 0x0fc19dc68b8cd5b5LL, 0x240ca1cc77ac9c65LL, + 0x2de92c6f592b0275LL, 0x4a7484aa6ea6e483LL, + 0x5cb0a9dcbd41fbd4LL, 0x76f988da831153b5LL, + 0x983e5152ee66dfabLL, 0xa831c66d2db43210LL, + 0xb00327c898fb213fLL, 0xbf597fc7beef0ee4LL, + 0xc6e00bf33da88fc2LL, 0xd5a79147930aa725LL, + 0x06ca6351e003826fLL, 0x142929670a0e6e70LL, + 0x27b70a8546d22ffcLL, 0x2e1b21385c26c926LL, + 0x4d2c6dfc5ac42aedLL, 0x53380d139d95b3dfLL, + 0x650a73548baf63deLL, 0x766a0abb3c77b2a8LL, + 0x81c2c92e47edaee6LL, 0x92722c851482353bLL, + 0xa2bfe8a14cf10364LL, 0xa81a664bbc423001LL, + 0xc24b8b70d0f89791LL, 0xc76c51a30654be30LL, + 0xd192e819d6ef5218LL, 0xd69906245565a910LL, + 0xf40e35855771202aLL, 0x106aa07032bbd1b8LL, + 0x19a4c116b8d2d0c8LL, 0x1e376c085141ab53LL, + 0x2748774cdf8eeb99LL, 0x34b0bcb5e19b48a8LL, + 0x391c0cb3c5c95a63LL, 0x4ed8aa4ae3418acbLL, + 0x5b9cca4f7763e373LL, 0x682e6ff3d6b2b8a3LL, + 0x748f82ee5defb2fcLL, 0x78a5636f43172f60LL, + 0x84c87814a1f0ab72LL, 0x8cc702081a6439ecLL, + 0x90befffa23631e28LL, 0xa4506cebde82bde9LL, + 0xbef9a3f7b2c67915LL, 0xc67178f2e372532bLL, + 0xca273eceea26619cLL, 0xd186b8c721c0c207LL, + 0xeada7dd6cde0eb1eLL, 0xf57d4f7fee6ed178LL, + 0x06f067aa72176fbaLL, 0x0a637dc5a2c898a6LL, + 0x113f9804bef90daeLL, 0x1b710b35131c471bLL, + 0x28db77f523047d84LL, 0x32caab7b40c72493LL, + 0x3c9ebe0a15c9bebcLL, 0x431d67c49c100d4cLL, + 0x4cc5d4becb3e42b6LL, 0x597f299cfc657e2aLL, + 0x5fcb6fab3ad6faecLL, 0x6c44198c4a475817LL, +}; + +static inline uint64_t load64(const uint8_t *x) +{ + uint64_t r; + + r = *(x++); + r = (r << 8) | *(x++); + r = (r << 8) | *(x++); + r = (r << 8) | *(x++); + r = (r << 8) | *(x++); + r = (r << 8) | *(x++); + r = (r << 8) | *(x++); + r = (r << 8) | *(x++); + + return r; +} + +static inline void store64(uint8_t *x, uint64_t v) +{ + x += 7; + *(x--) = v; + v >>= 8; + *(x--) = v; + v >>= 8; + *(x--) = v; + v >>= 8; + *(x--) = v; + v >>= 8; + *(x--) = v; + v >>= 8; + *(x--) = v; + v >>= 8; + *(x--) = v; + v >>= 8; + *(x--) = v; +} + +static inline uint64_t rot64(uint64_t x, int bits) +{ + return (x >> bits) | (x << (64 - bits)); +} + +void sha512_block(struct sha512_state *s, const uint8_t *blk) +{ + uint64_t w[16]; + uint64_t a, b, c, d, e, f, g, h; + int i; + + for (i = 0; i < 16; i++) { + w[i] = load64(blk); + blk += 8; + } + + /* Load state */ + a = s->h[0]; + b = s->h[1]; + c = s->h[2]; + d = s->h[3]; + e = s->h[4]; + f = s->h[5]; + g = s->h[6]; + h = s->h[7]; + + for (i = 0; i < 80; i++) { + /* Compute value of w[i + 16]. w[wrap(i)] is currently w[i] */ + const uint64_t wi = w[i & 15]; + const uint64_t wi15 = w[(i + 1) & 15]; + const uint64_t wi2 = w[(i + 14) & 15]; + const uint64_t wi7 = w[(i + 9) & 15]; + const uint64_t s0 = + rot64(wi15, 1) ^ rot64(wi15, 8) ^ (wi15 >> 7); + const uint64_t s1 = + rot64(wi2, 19) ^ rot64(wi2, 61) ^ (wi2 >> 6); + + /* Round calculations */ + const uint64_t S0 = rot64(a, 28) ^ rot64(a, 34) ^ rot64(a, 39); + const uint64_t S1 = rot64(e, 14) ^ rot64(e, 18) ^ rot64(e, 41); + const uint64_t ch = (e & f) ^ ((~e) & g); + const uint64_t temp1 = h + S1 + ch + round_k[i] + wi; + const uint64_t maj = (a & b) ^ (a & c) ^ (b & c); + const uint64_t temp2 = S0 + maj; + + /* Update round state */ + h = g; + g = f; + f = e; + e = d + temp1; + d = c; + c = b; + b = a; + a = temp1 + temp2; + + /* w[wrap(i)] becomes w[i + 16] */ + w[i & 15] = wi + s0 + wi7 + s1; + } + + /* Store state */ + s->h[0] += a; + s->h[1] += b; + s->h[2] += c; + s->h[3] += d; + s->h[4] += e; + s->h[5] += f; + s->h[6] += g; + s->h[7] += h; +} + +void sha512_final(struct sha512_state *s, const uint8_t *blk, + size_t total_size) +{ + uint8_t temp[SHA512_BLOCK_SIZE] = {0}; + const size_t last_size = total_size & (SHA512_BLOCK_SIZE - 1); + + if (last_size) + memcpy(temp, blk, last_size); + temp[last_size] = 0x80; + + if (last_size > 111) { + sha512_block(s, temp); + memset(temp, 0, sizeof(temp)); + } + + /* Note: we assume total_size fits in 61 bits */ + store64(temp + SHA512_BLOCK_SIZE - 8, total_size << 3); + sha512_block(s, temp); +} + +void sha512_get(const struct sha512_state *s, uint8_t *hash, + unsigned int offset, unsigned int len) +{ + int i; + + if (offset > SHA512_BLOCK_SIZE) + return; + + if (len > SHA512_BLOCK_SIZE - offset) + len = SHA512_BLOCK_SIZE - offset; + + /* Skip whole words */ + i = offset >> 3; + offset &= 7; + + /* Skip/read out bytes */ + if (offset) { + uint8_t tmp[8]; + unsigned int c = 8 - offset; + + if (c > len) + c = len; + + store64(tmp, s->h[i++]); + memcpy(hash, tmp + offset, c); + len -= c; + hash += c; + } + + /* Read out whole words */ + while (len >= 8) { + store64(hash, s->h[i++]); + hash += 8; + len -= 8; + } + + /* Read out bytes */ + if (len) { + uint8_t tmp[8]; + + store64(tmp, s->h[i]); + memcpy(hash, tmp, len); + } +} diff --git a/installer/signplugin/tiny/sha512.h b/installer/signplugin/tiny/sha512.h new file mode 100644 index 0000000..1391745 --- /dev/null +++ b/installer/signplugin/tiny/sha512.h @@ -0,0 +1,52 @@ +/* SHA512 + * Daniel Beer , 22 Apr 2014 + * + * This file is in the public domain. + */ + +#ifndef SHA512_H_ +#define SHA512_H_ + +#include +#include +#include + +/* SHA512 state. State is updated as data is fed in, and then the final + * hash can be read out in slices. + * + * Data is fed in as a sequence of full blocks terminated by a single + * partial block. + */ +struct sha512_state { + uint64_t h[8]; +}; + +/* Initial state */ +extern const struct sha512_state sha512_initial_state; + +/* Set up a new context */ +static inline void sha512_init(struct sha512_state *s) +{ + memcpy(s, &sha512_initial_state, sizeof(*s)); +} + +/* Feed a full block in */ +#define SHA512_BLOCK_SIZE 128 + +void sha512_block(struct sha512_state *s, const uint8_t *blk); + +/* Feed the last partial block in. The total stream size must be + * specified. The size of the block given is assumed to be (total_size % + * SHA512_BLOCK_SIZE). This might be zero, but you still need to call + * this function to terminate the stream. + */ +void sha512_final(struct sha512_state *s, const uint8_t *blk, + size_t total_size); + +/* Fetch a slice of the hash result. */ +#define SHA512_HASH_SIZE 64 + +void sha512_get(const struct sha512_state *s, uint8_t *hash, + unsigned int offset, unsigned int len); + +#endif diff --git a/installer/signplugin/win32_crt_float.cpp b/installer/signplugin/win32_crt_float.cpp new file mode 100644 index 0000000..172fe7e --- /dev/null +++ b/installer/signplugin/win32_crt_float.cpp @@ -0,0 +1,95 @@ +extern "C" +{ + int _fltused; + +#ifdef _M_IX86 // following functions are needed only for 32-bit architecture + + __declspec(naked) void _ftol2() + { + __asm + { + fistp qword ptr [esp-8] + mov edx,[esp-4] + mov eax,[esp-8] + ret + } + } + + __declspec(naked) void _ftol2_sse() + { + __asm + { + fistp dword ptr [esp-4] + mov eax,[esp-4] + ret + } + } + +#if 0 // these functions are needed for SSE code for 32-bit arch, TODO: implement them + __declspec(naked) void _dtol3() + { + __asm + { + } + } + + + __declspec(naked) void _dtoui3() + { + __asm + { + } + } + + + __declspec(naked) void _dtoul3() + { + __asm + { + } + } + + + __declspec(naked) void _ftol3() + { + __asm + { + } + } + + + __declspec(naked) void _ftoui3() + { + __asm + { + } + } + + + __declspec(naked) void _ftoul3() + { + __asm + { + } + } + + + __declspec(naked) void _ltod3() + { + __asm + { + } + } + + + __declspec(naked) void _ultod3() + { + __asm + { + } + } +#endif + +#endif + +} \ No newline at end of file diff --git a/installer/signplugin/win32_crt_math.cpp b/installer/signplugin/win32_crt_math.cpp new file mode 100644 index 0000000..de61c7f --- /dev/null +++ b/installer/signplugin/win32_crt_math.cpp @@ -0,0 +1,947 @@ +#ifdef _M_IX86 // use this file only for 32-bit architecture + +#define CRT_LOWORD(x) dword ptr [x+0] +#define CRT_HIWORD(x) dword ptr [x+4] + +extern "C" +{ + __declspec(naked) void _alldiv() + { + #define DVND esp + 16 // stack address of dividend (a) + #define DVSR esp + 24 // stack address of divisor (b) + + __asm + { + push edi + push esi + push ebx + +; Determine sign of the result (edi = 0 if result is positive, non-zero +; otherwise) and make operands positive. + + xor edi,edi ; result sign assumed positive + + mov eax,CRT_HIWORD(DVND) ; hi word of a + or eax,eax ; test to see if signed + jge short L1 ; skip rest if a is already positive + inc edi ; complement result sign flag + mov edx,CRT_LOWORD(DVND) ; lo word of a + neg eax ; make a positive + neg edx + sbb eax,0 + mov CRT_HIWORD(DVND),eax ; save positive value + mov CRT_LOWORD(DVND),edx +L1: + mov eax,CRT_HIWORD(DVSR) ; hi word of b + or eax,eax ; test to see if signed + jge short L2 ; skip rest if b is already positive + inc edi ; complement the result sign flag + mov edx,CRT_LOWORD(DVSR) ; lo word of a + neg eax ; make b positive + neg edx + sbb eax,0 + mov CRT_HIWORD(DVSR),eax ; save positive value + mov CRT_LOWORD(DVSR),edx +L2: + +; +; Now do the divide. First look to see if the divisor is less than 4194304K. +; If so, then we can use a simple algorithm with word divides, otherwise +; things get a little more complex. +; +; NOTE - eax currently contains the high order word of DVSR +; + + or eax,eax ; check to see if divisor < 4194304K + jnz short L3 ; nope, gotta do this the hard way + mov ecx,CRT_LOWORD(DVSR) ; load divisor + mov eax,CRT_HIWORD(DVND) ; load high word of dividend + xor edx,edx + div ecx ; eax <- high order bits of quotient + mov ebx,eax ; save high bits of quotient + mov eax,CRT_LOWORD(DVND) ; edx:eax <- remainder:lo word of dividend + div ecx ; eax <- low order bits of quotient + mov edx,ebx ; edx:eax <- quotient + jmp short L4 ; set sign, restore stack and return + +; +; Here we do it the hard way. Remember, eax contains the high word of DVSR +; + +L3: + mov ebx,eax ; ebx:ecx <- divisor + mov ecx,CRT_LOWORD(DVSR) + mov edx,CRT_HIWORD(DVND) ; edx:eax <- dividend + mov eax,CRT_LOWORD(DVND) +L5: + shr ebx,1 ; shift divisor right one bit + rcr ecx,1 + shr edx,1 ; shift dividend right one bit + rcr eax,1 + or ebx,ebx + jnz short L5 ; loop until divisor < 4194304K + div ecx ; now divide, ignore remainder + mov esi,eax ; save quotient + +; +; We may be off by one, so to check, we will multiply the quotient +; by the divisor and check the result against the orignal dividend +; Note that we must also check for overflow, which can occur if the +; dividend is close to 2**64 and the quotient is off by 1. +; + + mul CRT_HIWORD(DVSR) ; QUOT * CRT_HIWORD(DVSR) + mov ecx,eax + mov eax,CRT_LOWORD(DVSR) + mul esi ; QUOT * CRT_LOWORD(DVSR) + add edx,ecx ; EDX:EAX = QUOT * DVSR + jc short L6 ; carry means Quotient is off by 1 + +; +; do long compare here between original dividend and the result of the +; multiply in edx:eax. If original is larger or equal, we are ok, otherwise +; subtract one (1) from the quotient. +; + + cmp edx,CRT_HIWORD(DVND) ; compare hi words of result and original + ja short L6 ; if result > original, do subtract + jb short L7 ; if result < original, we are ok + cmp eax,CRT_LOWORD(DVND) ; hi words are equal, compare lo words + jbe short L7 ; if less or equal we are ok, else subtract +L6: + dec esi ; subtract 1 from quotient +L7: + xor edx,edx ; edx:eax <- quotient + mov eax,esi + +; +; Just the cleanup left to do. edx:eax contains the quotient. Set the sign +; according to the save value, cleanup the stack, and return. +; + +L4: + dec edi ; check to see if result is negative + jnz short L8 ; if EDI == 0, result should be negative + neg edx ; otherwise, negate the result + neg eax + sbb edx,0 + +; +; Restore the saved registers and return. +; + +L8: + pop ebx + pop esi + pop edi + + ret 16 + } + + #undef DVND + #undef DVSR + } + + __declspec(naked) void _alldvrm() + { + #define DVND esp + 16 // stack address of dividend (a) + #define DVSR esp + 24 // stack address of divisor (b) + + __asm + { + push edi + push esi + push ebp + +; Determine sign of the quotient (edi = 0 if result is positive, non-zero +; otherwise) and make operands positive. +; Sign of the remainder is kept in ebp. + + xor edi,edi ; result sign assumed positive + xor ebp,ebp ; result sign assumed positive + + mov eax,CRT_HIWORD(DVND) ; hi word of a + or eax,eax ; test to see if signed + jge short L1 ; skip rest if a is already positive + inc edi ; complement result sign flag + inc ebp ; complement result sign flag + mov edx,CRT_LOWORD(DVND) ; lo word of a + neg eax ; make a positive + neg edx + sbb eax,0 + mov CRT_HIWORD(DVND),eax ; save positive value + mov CRT_LOWORD(DVND),edx +L1: + mov eax,CRT_HIWORD(DVSR) ; hi word of b + or eax,eax ; test to see if signed + jge short L2 ; skip rest if b is already positive + inc edi ; complement the result sign flag + mov edx,CRT_LOWORD(DVSR) ; lo word of a + neg eax ; make b positive + neg edx + sbb eax,0 + mov CRT_HIWORD(DVSR),eax ; save positive value + mov CRT_LOWORD(DVSR),edx +L2: + +; +; Now do the divide. First look to see if the divisor is less than 4194304K. +; If so, then we can use a simple algorithm with word divides, otherwise +; things get a little more complex. +; +; NOTE - eax currently contains the high order word of DVSR +; + + or eax,eax ; check to see if divisor < 4194304K + jnz short L3 ; nope, gotta do this the hard way + mov ecx,CRT_LOWORD(DVSR) ; load divisor + mov eax,CRT_HIWORD(DVND) ; load high word of dividend + xor edx,edx + div ecx ; eax <- high order bits of quotient + mov ebx,eax ; save high bits of quotient + mov eax,CRT_LOWORD(DVND) ; edx:eax <- remainder:lo word of dividend + div ecx ; eax <- low order bits of quotient + mov esi,eax ; ebx:esi <- quotient +; +; Now we need to do a multiply so that we can compute the remainder. +; + mov eax,ebx ; set up high word of quotient + mul CRT_LOWORD(DVSR) ; CRT_HIWORD(QUOT) * DVSR + mov ecx,eax ; save the result in ecx + mov eax,esi ; set up low word of quotient + mul CRT_LOWORD(DVSR) ; CRT_LOWORD(QUOT) * DVSR + add edx,ecx ; EDX:EAX = QUOT * DVSR + jmp short L4 ; complete remainder calculation + +; +; Here we do it the hard way. Remember, eax contains the high word of DVSR +; + +L3: + mov ebx,eax ; ebx:ecx <- divisor + mov ecx,CRT_LOWORD(DVSR) + mov edx,CRT_HIWORD(DVND) ; edx:eax <- dividend + mov eax,CRT_LOWORD(DVND) +L5: + shr ebx,1 ; shift divisor right one bit + rcr ecx,1 + shr edx,1 ; shift dividend right one bit + rcr eax,1 + or ebx,ebx + jnz short L5 ; loop until divisor < 4194304K + div ecx ; now divide, ignore remainder + mov esi,eax ; save quotient + +; +; We may be off by one, so to check, we will multiply the quotient +; by the divisor and check the result against the orignal dividend +; Note that we must also check for overflow, which can occur if the +; dividend is close to 2**64 and the quotient is off by 1. +; + + mul CRT_HIWORD(DVSR) ; QUOT * CRT_HIWORD(DVSR) + mov ecx,eax + mov eax,CRT_LOWORD(DVSR) + mul esi ; QUOT * CRT_LOWORD(DVSR) + add edx,ecx ; EDX:EAX = QUOT * DVSR + jc short L6 ; carry means Quotient is off by 1 + +; +; do long compare here between original dividend and the result of the +; multiply in edx:eax. If original is larger or equal, we are ok, otherwise +; subtract one (1) from the quotient. +; + + cmp edx,CRT_HIWORD(DVND) ; compare hi words of result and original + ja short L6 ; if result > original, do subtract + jb short L7 ; if result < original, we are ok + cmp eax,CRT_LOWORD(DVND) ; hi words are equal, compare lo words + jbe short L7 ; if less or equal we are ok, else subtract +L6: + dec esi ; subtract 1 from quotient + sub eax,CRT_LOWORD(DVSR) ; subtract divisor from result + sbb edx,CRT_HIWORD(DVSR) +L7: + xor ebx,ebx ; ebx:esi <- quotient + +L4: +; +; Calculate remainder by subtracting the result from the original dividend. +; Since the result is already in a register, we will do the subtract in the +; opposite direction and negate the result if necessary. +; + + sub eax,CRT_LOWORD(DVND) ; subtract dividend from result + sbb edx,CRT_HIWORD(DVND) + +; +; Now check the result sign flag to see if the result is supposed to be positive +; or negative. It is currently negated (because we subtracted in the 'wrong' +; direction), so if the sign flag is set we are done, otherwise we must negate +; the result to make it positive again. +; + + dec ebp ; check result sign flag + jns short L9 ; result is ok, set up the quotient + neg edx ; otherwise, negate the result + neg eax + sbb edx,0 + +; +; Now we need to get the quotient into edx:eax and the remainder into ebx:ecx. +; +L9: + mov ecx,edx + mov edx,ebx + mov ebx,ecx + mov ecx,eax + mov eax,esi + +; +; Just the cleanup left to do. edx:eax contains the quotient. Set the sign +; according to the save value, cleanup the stack, and return. +; + + dec edi ; check to see if result is negative + jnz short L8 ; if EDI == 0, result should be negative + neg edx ; otherwise, negate the result + neg eax + sbb edx,0 + +; +; Restore the saved registers and return. +; + +L8: + pop ebp + pop esi + pop edi + + ret 16 + } + + #undef DVND + #undef DVSR + } + + __declspec(naked) void _allmul() + { + #define A esp + 8 // stack address of a + #define B esp + 16 // stack address of b + + __asm + { + push ebx + + mov eax,CRT_HIWORD(A) + mov ecx,CRT_LOWORD(B) + mul ecx ;eax has AHI, ecx has BLO, so AHI * BLO + mov ebx,eax ;save result + + mov eax,CRT_LOWORD(A) + mul CRT_HIWORD(B) ;ALO * BHI + add ebx,eax ;ebx = ((ALO * BHI) + (AHI * BLO)) + + mov eax,CRT_LOWORD(A) ;ecx = BLO + mul ecx ;so edx:eax = ALO*BLO + add edx,ebx ;now edx has all the LO*HI stuff + + pop ebx + + ret 16 ; callee restores the stack + } + + #undef A + #undef B + } + + __declspec(naked) void _allrem() + { + #define DVND esp + 12 // stack address of dividend (a) + #define DVSR esp + 20 // stack address of divisor (b) + + __asm + { + push ebx + push edi + + +; Determine sign of the result (edi = 0 if result is positive, non-zero +; otherwise) and make operands positive. + + xor edi,edi ; result sign assumed positive + + mov eax,CRT_HIWORD(DVND) ; hi word of a + or eax,eax ; test to see if signed + jge short L1 ; skip rest if a is already positive + inc edi ; complement result sign flag bit + mov edx,CRT_LOWORD(DVND) ; lo word of a + neg eax ; make a positive + neg edx + sbb eax,0 + mov CRT_HIWORD(DVND),eax ; save positive value + mov CRT_LOWORD(DVND),edx +L1: + mov eax,CRT_HIWORD(DVSR) ; hi word of b + or eax,eax ; test to see if signed + jge short L2 ; skip rest if b is already positive + mov edx,CRT_LOWORD(DVSR) ; lo word of b + neg eax ; make b positive + neg edx + sbb eax,0 + mov CRT_HIWORD(DVSR),eax ; save positive value + mov CRT_LOWORD(DVSR),edx +L2: + +; +; Now do the divide. First look to see if the divisor is less than 4194304K. +; If so, then we can use a simple algorithm with word divides, otherwise +; things get a little more complex. +; +; NOTE - eax currently contains the high order word of DVSR +; + + or eax,eax ; check to see if divisor < 4194304K + jnz short L3 ; nope, gotta do this the hard way + mov ecx,CRT_LOWORD(DVSR) ; load divisor + mov eax,CRT_HIWORD(DVND) ; load high word of dividend + xor edx,edx + div ecx ; edx <- remainder + mov eax,CRT_LOWORD(DVND) ; edx:eax <- remainder:lo word of dividend + div ecx ; edx <- final remainder + mov eax,edx ; edx:eax <- remainder + xor edx,edx + dec edi ; check result sign flag + jns short L4 ; negate result, restore stack and return + jmp short L8 ; result sign ok, restore stack and return + +; +; Here we do it the hard way. Remember, eax contains the high word of DVSR +; + +L3: + mov ebx,eax ; ebx:ecx <- divisor + mov ecx,CRT_LOWORD(DVSR) + mov edx,CRT_HIWORD(DVND) ; edx:eax <- dividend + mov eax,CRT_LOWORD(DVND) +L5: + shr ebx,1 ; shift divisor right one bit + rcr ecx,1 + shr edx,1 ; shift dividend right one bit + rcr eax,1 + or ebx,ebx + jnz short L5 ; loop until divisor < 4194304K + div ecx ; now divide, ignore remainder + +; +; We may be off by one, so to check, we will multiply the quotient +; by the divisor and check the result against the orignal dividend +; Note that we must also check for overflow, which can occur if the +; dividend is close to 2**64 and the quotient is off by 1. +; + + mov ecx,eax ; save a copy of quotient in ECX + mul CRT_HIWORD(DVSR) + xchg ecx,eax ; save product, get quotient in EAX + mul CRT_LOWORD(DVSR) + add edx,ecx ; EDX:EAX = QUOT * DVSR + jc short L6 ; carry means Quotient is off by 1 + +; +; do long compare here between original dividend and the result of the +; multiply in edx:eax. If original is larger or equal, we are ok, otherwise +; subtract the original divisor from the result. +; + + cmp edx,CRT_HIWORD(DVND) ; compare hi words of result and original + ja short L6 ; if result > original, do subtract + jb short L7 ; if result < original, we are ok + cmp eax,CRT_LOWORD(DVND) ; hi words are equal, compare lo words + jbe short L7 ; if less or equal we are ok, else subtract +L6: + sub eax,CRT_LOWORD(DVSR) ; subtract divisor from result + sbb edx,CRT_HIWORD(DVSR) +L7: + +; +; Calculate remainder by subtracting the result from the original dividend. +; Since the result is already in a register, we will do the subtract in the +; opposite direction and negate the result if necessary. +; + + sub eax,CRT_LOWORD(DVND) ; subtract dividend from result + sbb edx,CRT_HIWORD(DVND) + +; +; Now check the result sign flag to see if the result is supposed to be positive +; or negative. It is currently negated (because we subtracted in the 'wrong' +; direction), so if the sign flag is set we are done, otherwise we must negate +; the result to make it positive again. +; + + dec edi ; check result sign flag + jns short L8 ; result is ok, restore stack and return +L4: + neg edx ; otherwise, negate the result + neg eax + sbb edx,0 + +; +; Just the cleanup left to do. edx:eax contains the quotient. +; Restore the saved registers and return. +; + +L8: + pop edi + pop ebx + + ret 16 + } + + #undef DVND + #undef DVSR + } + + __declspec(naked) void _allshl() + { + __asm + { +; +; Handle shifts of 64 or more bits (all get 0) +; + cmp cl, 64 + jae short RETZERO + +; +; Handle shifts of between 0 and 31 bits +; + cmp cl, 32 + jae short MORE32 + shld edx,eax,cl + shl eax,cl + ret + +; +; Handle shifts of between 32 and 63 bits +; +MORE32: + mov edx,eax + xor eax,eax + and cl,31 + shl edx,cl + ret + +; +; return 0 in edx:eax +; +RETZERO: + xor eax,eax + xor edx,edx + ret + } + } + + __declspec(naked) void _allshr() + { + __asm + { +; +; Handle shifts of 64 bits or more (if shifting 64 bits or more, the result +; depends only on the high order bit of edx). +; + cmp cl,64 + jae short RETSIGN + +; +; Handle shifts of between 0 and 31 bits +; + cmp cl, 32 + jae short MORE32 + shrd eax,edx,cl + sar edx,cl + ret + +; +; Handle shifts of between 32 and 63 bits +; +MORE32: + mov eax,edx + sar edx,31 + and cl,31 + sar eax,cl + ret + +; +; Return double precision 0 or -1, depending on the sign of edx +; +RETSIGN: + sar edx,31 + mov eax,edx + ret + } + } + + __declspec(naked) void _aulldiv() + { + #define DVND esp + 12 // stack address of dividend (a) + #define DVSR esp + 20 // stack address of divisor (b) + + __asm + { + push ebx + push esi + +; +; Now do the divide. First look to see if the divisor is less than 4194304K. +; If so, then we can use a simple algorithm with word divides, otherwise +; things get a little more complex. +; + + mov eax,CRT_HIWORD(DVSR) ; check to see if divisor < 4194304K + or eax,eax + jnz short L1 ; nope, gotta do this the hard way + mov ecx,CRT_LOWORD(DVSR) ; load divisor + mov eax,CRT_HIWORD(DVND) ; load high word of dividend + xor edx,edx + div ecx ; get high order bits of quotient + mov ebx,eax ; save high bits of quotient + mov eax,CRT_LOWORD(DVND) ; edx:eax <- remainder:lo word of dividend + div ecx ; get low order bits of quotient + mov edx,ebx ; edx:eax <- quotient hi:quotient lo + jmp short L2 ; restore stack and return + +; +; Here we do it the hard way. Remember, eax contains DVSRHI +; + +L1: + mov ecx,eax ; ecx:ebx <- divisor + mov ebx,CRT_LOWORD(DVSR) + mov edx,CRT_HIWORD(DVND) ; edx:eax <- dividend + mov eax,CRT_LOWORD(DVND) +L3: + shr ecx,1 ; shift divisor right one bit; hi bit <- 0 + rcr ebx,1 + shr edx,1 ; shift dividend right one bit; hi bit <- 0 + rcr eax,1 + or ecx,ecx + jnz short L3 ; loop until divisor < 4194304K + div ebx ; now divide, ignore remainder + mov esi,eax ; save quotient + +; +; We may be off by one, so to check, we will multiply the quotient +; by the divisor and check the result against the orignal dividend +; Note that we must also check for overflow, which can occur if the +; dividend is close to 2**64 and the quotient is off by 1. +; + + mul CRT_HIWORD(DVSR) ; QUOT * CRT_HIWORD(DVSR) + mov ecx,eax + mov eax,CRT_LOWORD(DVSR) + mul esi ; QUOT * CRT_LOWORD(DVSR) + add edx,ecx ; EDX:EAX = QUOT * DVSR + jc short L4 ; carry means Quotient is off by 1 + +; +; do long compare here between original dividend and the result of the +; multiply in edx:eax. If original is larger or equal, we are ok, otherwise +; subtract one (1) from the quotient. +; + + cmp edx,CRT_HIWORD(DVND) ; compare hi words of result and original + ja short L4 ; if result > original, do subtract + jb short L5 ; if result < original, we are ok + cmp eax,CRT_LOWORD(DVND) ; hi words are equal, compare lo words + jbe short L5 ; if less or equal we are ok, else subtract +L4: + dec esi ; subtract 1 from quotient +L5: + xor edx,edx ; edx:eax <- quotient + mov eax,esi + +; +; Just the cleanup left to do. edx:eax contains the quotient. +; Restore the saved registers and return. +; + +L2: + + pop esi + pop ebx + + ret 16 + } + + #undef DVND + #undef DVSR + } + + __declspec(naked) void _aulldvrm() + { + #define DVND esp + 8 // stack address of dividend (a) + #define DVSR esp + 16 // stack address of divisor (b) + + __asm + { + push esi + +; +; Now do the divide. First look to see if the divisor is less than 4194304K. +; If so, then we can use a simple algorithm with word divides, otherwise +; things get a little more complex. +; + + mov eax,CRT_HIWORD(DVSR) ; check to see if divisor < 4194304K + or eax,eax + jnz short L1 ; nope, gotta do this the hard way + mov ecx,CRT_LOWORD(DVSR) ; load divisor + mov eax,CRT_HIWORD(DVND) ; load high word of dividend + xor edx,edx + div ecx ; get high order bits of quotient + mov ebx,eax ; save high bits of quotient + mov eax,CRT_LOWORD(DVND) ; edx:eax <- remainder:lo word of dividend + div ecx ; get low order bits of quotient + mov esi,eax ; ebx:esi <- quotient + +; +; Now we need to do a multiply so that we can compute the remainder. +; + mov eax,ebx ; set up high word of quotient + mul CRT_LOWORD(DVSR) ; CRT_HIWORD(QUOT) * DVSR + mov ecx,eax ; save the result in ecx + mov eax,esi ; set up low word of quotient + mul CRT_LOWORD(DVSR) ; CRT_LOWORD(QUOT) * DVSR + add edx,ecx ; EDX:EAX = QUOT * DVSR + jmp short L2 ; complete remainder calculation + +; +; Here we do it the hard way. Remember, eax contains DVSRHI +; + +L1: + mov ecx,eax ; ecx:ebx <- divisor + mov ebx,CRT_LOWORD(DVSR) + mov edx,CRT_HIWORD(DVND) ; edx:eax <- dividend + mov eax,CRT_LOWORD(DVND) +L3: + shr ecx,1 ; shift divisor right one bit; hi bit <- 0 + rcr ebx,1 + shr edx,1 ; shift dividend right one bit; hi bit <- 0 + rcr eax,1 + or ecx,ecx + jnz short L3 ; loop until divisor < 4194304K + div ebx ; now divide, ignore remainder + mov esi,eax ; save quotient + +; +; We may be off by one, so to check, we will multiply the quotient +; by the divisor and check the result against the orignal dividend +; Note that we must also check for overflow, which can occur if the +; dividend is close to 2**64 and the quotient is off by 1. +; + + mul CRT_HIWORD(DVSR) ; QUOT * CRT_HIWORD(DVSR) + mov ecx,eax + mov eax,CRT_LOWORD(DVSR) + mul esi ; QUOT * CRT_LOWORD(DVSR) + add edx,ecx ; EDX:EAX = QUOT * DVSR + jc short L4 ; carry means Quotient is off by 1 + +; +; do long compare here between original dividend and the result of the +; multiply in edx:eax. If original is larger or equal, we are ok, otherwise +; subtract one (1) from the quotient. +; + + cmp edx,CRT_HIWORD(DVND) ; compare hi words of result and original + ja short L4 ; if result > original, do subtract + jb short L5 ; if result < original, we are ok + cmp eax,CRT_LOWORD(DVND) ; hi words are equal, compare lo words + jbe short L5 ; if less or equal we are ok, else subtract +L4: + dec esi ; subtract 1 from quotient + sub eax,CRT_LOWORD(DVSR) ; subtract divisor from result + sbb edx,CRT_HIWORD(DVSR) +L5: + xor ebx,ebx ; ebx:esi <- quotient + +L2: +; +; Calculate remainder by subtracting the result from the original dividend. +; Since the result is already in a register, we will do the subtract in the +; opposite direction and negate the result. +; + + sub eax,CRT_LOWORD(DVND) ; subtract dividend from result + sbb edx,CRT_HIWORD(DVND) + neg edx ; otherwise, negate the result + neg eax + sbb edx,0 + +; +; Now we need to get the quotient into edx:eax and the remainder into ebx:ecx. +; + mov ecx,edx + mov edx,ebx + mov ebx,ecx + mov ecx,eax + mov eax,esi +; +; Just the cleanup left to do. edx:eax contains the quotient. +; Restore the saved registers and return. +; + + pop esi + + ret 16 + } + + #undef DVND + #undef DVSR + } + + __declspec(naked) void _aullrem() + { + #define DVND esp + 8 // stack address of dividend (a) + #define DVSR esp + 16 // stack address of divisor (b) + + __asm + { + push ebx + +; Now do the divide. First look to see if the divisor is less than 4194304K. +; If so, then we can use a simple algorithm with word divides, otherwise +; things get a little more complex. +; + + mov eax,CRT_HIWORD(DVSR) ; check to see if divisor < 4194304K + or eax,eax + jnz short L1 ; nope, gotta do this the hard way + mov ecx,CRT_LOWORD(DVSR) ; load divisor + mov eax,CRT_HIWORD(DVND) ; load high word of dividend + xor edx,edx + div ecx ; edx <- remainder, eax <- quotient + mov eax,CRT_LOWORD(DVND) ; edx:eax <- remainder:lo word of dividend + div ecx ; edx <- final remainder + mov eax,edx ; edx:eax <- remainder + xor edx,edx + jmp short L2 ; restore stack and return + +; +; Here we do it the hard way. Remember, eax contains DVSRHI +; + +L1: + mov ecx,eax ; ecx:ebx <- divisor + mov ebx,CRT_LOWORD(DVSR) + mov edx,CRT_HIWORD(DVND) ; edx:eax <- dividend + mov eax,CRT_LOWORD(DVND) +L3: + shr ecx,1 ; shift divisor right one bit; hi bit <- 0 + rcr ebx,1 + shr edx,1 ; shift dividend right one bit; hi bit <- 0 + rcr eax,1 + or ecx,ecx + jnz short L3 ; loop until divisor < 4194304K + div ebx ; now divide, ignore remainder + +; +; We may be off by one, so to check, we will multiply the quotient +; by the divisor and check the result against the orignal dividend +; Note that we must also check for overflow, which can occur if the +; dividend is close to 2**64 and the quotient is off by 1. +; + + mov ecx,eax ; save a copy of quotient in ECX + mul CRT_HIWORD(DVSR) + xchg ecx,eax ; put partial product in ECX, get quotient in EAX + mul CRT_LOWORD(DVSR) + add edx,ecx ; EDX:EAX = QUOT * DVSR + jc short L4 ; carry means Quotient is off by 1 + +; +; do long compare here between original dividend and the result of the +; multiply in edx:eax. If original is larger or equal, we're ok, otherwise +; subtract the original divisor from the result. +; + + cmp edx,CRT_HIWORD(DVND) ; compare hi words of result and original + ja short L4 ; if result > original, do subtract + jb short L5 ; if result < original, we're ok + cmp eax,CRT_LOWORD(DVND) ; hi words are equal, compare lo words + jbe short L5 ; if less or equal we're ok, else subtract +L4: + sub eax,CRT_LOWORD(DVSR) ; subtract divisor from result + sbb edx,CRT_HIWORD(DVSR) +L5: + +; +; Calculate remainder by subtracting the result from the original dividend. +; Since the result is already in a register, we will perform the subtract in +; the opposite direction and negate the result to make it positive. +; + + sub eax,CRT_LOWORD(DVND) ; subtract original dividend from result + sbb edx,CRT_HIWORD(DVND) + neg edx ; and negate it + neg eax + sbb edx,0 + +; +; Just the cleanup left to do. dx:ax contains the remainder. +; Restore the saved registers and return. +; + +L2: + + pop ebx + + ret 16 + } + + #undef DVND + #undef DVSR + } + + __declspec(naked) void _aullshr() + { + __asm + { + cmp cl,64 + jae short RETZERO + +; +; Handle shifts of between 0 and 31 bits +; + cmp cl, 32 + jae short MORE32 + shrd eax,edx,cl + shr edx,cl + ret + +; +; Handle shifts of between 32 and 63 bits +; +MORE32: + mov eax,edx + xor edx,edx + and cl,31 + shr eax,cl + ret + +; +; return 0 in edx:eax +; +RETZERO: + xor eax,eax + xor edx,edx + ret + } + } +} + +#undef CRT_LOWORD +#undef CRT_HIWORD + +#endif diff --git a/installer/signplugin/win32_crt_memory.cpp b/installer/signplugin/win32_crt_memory.cpp new file mode 100644 index 0000000..b6bd6b6 --- /dev/null +++ b/installer/signplugin/win32_crt_memory.cpp @@ -0,0 +1,26 @@ +#include +extern "C" +{ + #pragma function(memset) + void *memset(void *dest, int c, size_t count) + { + char *bytes = (char *)dest; + while (count--) + { + *bytes++ = (char)c; + } + return dest; + } + + #pragma function(memcpy) + void *memcpy(void *dest, const void *src, size_t count) + { + char *dest8 = (char *)dest; + const char *src8 = (const char *)src; + while (count--) + { + *dest8++ = *src8++; + } + return dest; + } +} diff --git a/installer/signplugin/win32_crt_seh.cpp b/installer/signplugin/win32_crt_seh.cpp new file mode 100644 index 0000000..51feb8e --- /dev/null +++ b/installer/signplugin/win32_crt_seh.cpp @@ -0,0 +1,99 @@ +extern "C" +{ +#if _M_IX86 + +EXCEPTION_DISPOSITION +_except_handler3( + struct _EXCEPTION_RECORD* ExceptionRecord, + void* EstablisherFrame, + struct _CONTEXT* ContextRecord, + void* DispatcherContext) +{ + typedef EXCEPTION_DISPOSITION Function(struct _EXCEPTION_RECORD*, void*, struct _CONTEXT*, void*); + static Function* FunctionPtr; + + if (!FunctionPtr) + { + HMODULE Library = LoadLibraryA("msvcrt.dll"); + FunctionPtr = (Function*)GetProcAddress(Library, "_except_handler3"); + } + + return FunctionPtr(ExceptionRecord, EstablisherFrame, ContextRecord, DispatcherContext); +} + +UINT_PTR __security_cookie = 0xBB40E64E; + +extern PVOID __safe_se_handler_table[]; +extern BYTE __safe_se_handler_count; + +typedef struct { + DWORD Size; + DWORD TimeDateStamp; + WORD MajorVersion; + WORD MinorVersion; + DWORD GlobalFlagsClear; + DWORD GlobalFlagsSet; + DWORD CriticalSectionDefaultTimeout; + DWORD DeCommitFreeBlockThreshold; + DWORD DeCommitTotalFreeThreshold; + DWORD LockPrefixTable; + DWORD MaximumAllocationSize; + DWORD VirtualMemoryThreshold; + DWORD ProcessHeapFlags; + DWORD ProcessAffinityMask; + WORD CSDVersion; + WORD Reserved1; + DWORD EditList; + PUINT_PTR SecurityCookie; + PVOID *SEHandlerTable; + DWORD SEHandlerCount; +} IMAGE_LOAD_CONFIG_DIRECTORY32_2; + +const +IMAGE_LOAD_CONFIG_DIRECTORY32_2 _load_config_used = { + sizeof(IMAGE_LOAD_CONFIG_DIRECTORY32_2), + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + &__security_cookie, + __safe_se_handler_table, + (DWORD)(DWORD_PTR) &__safe_se_handler_count +}; + +#elif _M_AMD64 + +EXCEPTION_DISPOSITION +__C_specific_handler( + struct _EXCEPTION_RECORD* ExceptionRecord, + void* EstablisherFrame, + struct _CONTEXT* ContextRecord, + struct _DISPATCHER_CONTEXT* DispatcherContext) +{ + typedef EXCEPTION_DISPOSITION Function(struct _EXCEPTION_RECORD*, void*, struct _CONTEXT*, _DISPATCHER_CONTEXT*); + static Function* FunctionPtr; + + if (!FunctionPtr) + { + HMODULE Library = LoadLibraryA("msvcrt.dll"); + FunctionPtr = (Function*)GetProcAddress(Library, "__C_specific_handler"); + } + + return FunctionPtr(ExceptionRecord, EstablisherFrame, ContextRecord, DispatcherContext); +} + +#endif + +} diff --git a/installer/tap/.gitignore b/installer/tap/.gitignore new file mode 100644 index 0000000..fee6563 --- /dev/null +++ b/installer/tap/.gitignore @@ -0,0 +1,2 @@ +/TunSafe-TAP-auto.exe.sig +/TunSafe-TAP-auto.exe \ No newline at end of file diff --git a/installer/tap/COPYING b/installer/tap/COPYING new file mode 100644 index 0000000..d8f3e59 --- /dev/null +++ b/installer/tap/COPYING @@ -0,0 +1,365 @@ +You can find and download the source code for this +TunSafe-TAP Network Adapter at: https://tunsafe.com/open-source + +The source and object code of the tap-windows6 project +is Copyright (C) 2002-2014 OpenVPN Technologies, Inc. The +NSIS installer is Copyright (C) 2018 TunSafe, Copyright (C) +2014 OpenVPN Technologies, Inc. and (C) 2012 Alon Bar-Lev. +Both are released under the GPL version 2. See COPYING +for the full GPL license. The licensors also make the following +statement borrowed from the SPICE project: + +With respect to binaries built using the Microsoft(R) +Windows Driver Kit (WDK), GPLv2 does not extend to any code +contained in or derived from the WDK ("WDK Code"). As to +WDK Code, by using or distributing such binaries you agree +to be bound by the Microsoft Software License Terms for the +WDK. All WDK Code is considered by the GPLv2 licensors to +qualify for the special exception stated in section 3 of +GPLv2 (commonly known as the system library exception). + +The tap-windows.h file has been released under the MIT +license (see COPYRIGHT.MIT) as well as under GPLv2 (see +COPYRIGHT.GPL). This has been done to allow the use of the +header file in non-GPLv2 compatible projects. + + GNU GENERAL PUBLIC LICENSE + Version 2, June 1991 + + Copyright (C) 1989, 1991 Free Software Foundation, Inc., + 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + Everyone is permitted to copy and distribute verbatim copies + of this license document, but changing it is not allowed. + + Preamble + + The licenses for most software are designed to take away your +freedom to share and change it. By contrast, the GNU General Public +License is intended to guarantee your freedom to share and change free +software--to make sure the software is free for all its users. This +General Public License applies to most of the Free Software +Foundation's software and to any other program whose authors commit to +using it. (Some other Free Software Foundation software is covered by +the GNU Lesser General Public License instead.) You can apply it to +your programs, too. + + When we speak of free software, we are referring to freedom, not +price. Our General Public Licenses are designed to make sure that you +have the freedom to distribute copies of free software (and charge for +this service if you wish), that you receive source code or can get it +if you want it, that you can change the software or use pieces of it +in new free programs; and that you know you can do these things. + + To protect your rights, we need to make restrictions that forbid +anyone to deny you these rights or to ask you to surrender the rights. +These restrictions translate to certain responsibilities for you if you +distribute copies of the software, or if you modify it. + + For example, if you distribute copies of such a program, whether +gratis or for a fee, you must give the recipients all the rights that +you have. You must make sure that they, too, receive or can get the +source code. And you must show them these terms so they know their +rights. + + We protect your rights with two steps: (1) copyright the software, and +(2) offer you this license which gives you legal permission to copy, +distribute and/or modify the software. + + Also, for each author's protection and ours, we want to make certain +that everyone understands that there is no warranty for this free +software. If the software is modified by someone else and passed on, we +want its recipients to know that what they have is not the original, so +that any problems introduced by others will not reflect on the original +authors' reputations. + + Finally, any free program is threatened constantly by software +patents. We wish to avoid the danger that redistributors of a free +program will individually obtain patent licenses, in effect making the +program proprietary. To prevent this, we have made it clear that any +patent must be licensed for everyone's free use or not licensed at all. + + The precise terms and conditions for copying, distribution and +modification follow. + + GNU GENERAL PUBLIC LICENSE + TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION + + 0. This License applies to any program or other work which contains +a notice placed by the copyright holder saying it may be distributed +under the terms of this General Public License. The "Program", below, +refers to any such program or work, and a "work based on the Program" +means either the Program or any derivative work under copyright law: +that is to say, a work containing the Program or a portion of it, +either verbatim or with modifications and/or translated into another +language. (Hereinafter, translation is included without limitation in +the term "modification".) Each licensee is addressed as "you". + +Activities other than copying, distribution and modification are not +covered by this License; they are outside its scope. The act of +running the Program is not restricted, and the output from the Program +is covered only if its contents constitute a work based on the +Program (independent of having been made by running the Program). +Whether that is true depends on what the Program does. + + 1. You may copy and distribute verbatim copies of the Program's +source code as you receive it, in any medium, provided that you +conspicuously and appropriately publish on each copy an appropriate +copyright notice and disclaimer of warranty; keep intact all the +notices that refer to this License and to the absence of any warranty; +and give any other recipients of the Program a copy of this License +along with the Program. + +You may charge a fee for the physical act of transferring a copy, and +you may at your option offer warranty protection in exchange for a fee. + + 2. You may modify your copy or copies of the Program or any portion +of it, thus forming a work based on the Program, and copy and +distribute such modifications or work under the terms of Section 1 +above, provided that you also meet all of these conditions: + + a) You must cause the modified files to carry prominent notices + stating that you changed the files and the date of any change. + + b) You must cause any work that you distribute or publish, that in + whole or in part contains or is derived from the Program or any + part thereof, to be licensed as a whole at no charge to all third + parties under the terms of this License. + + c) If the modified program normally reads commands interactively + when run, you must cause it, when started running for such + interactive use in the most ordinary way, to print or display an + announcement including an appropriate copyright notice and a + notice that there is no warranty (or else, saying that you provide + a warranty) and that users may redistribute the program under + these conditions, and telling the user how to view a copy of this + License. (Exception: if the Program itself is interactive but + does not normally print such an announcement, your work based on + the Program is not required to print an announcement.) + +These requirements apply to the modified work as a whole. If +identifiable sections of that work are not derived from the Program, +and can be reasonably considered independent and separate works in +themselves, then this License, and its terms, do not apply to those +sections when you distribute them as separate works. But when you +distribute the same sections as part of a whole which is a work based +on the Program, the distribution of the whole must be on the terms of +this License, whose permissions for other licensees extend to the +entire whole, and thus to each and every part regardless of who wrote it. + +Thus, it is not the intent of this section to claim rights or contest +your rights to work written entirely by you; rather, the intent is to +exercise the right to control the distribution of derivative or +collective works based on the Program. + +In addition, mere aggregation of another work not based on the Program +with the Program (or with a work based on the Program) on a volume of +a storage or distribution medium does not bring the other work under +the scope of this License. + + 3. You may copy and distribute the Program (or a work based on it, +under Section 2) in object code or executable form under the terms of +Sections 1 and 2 above provided that you also do one of the following: + + a) Accompany it with the complete corresponding machine-readable + source code, which must be distributed under the terms of Sections + 1 and 2 above on a medium customarily used for software interchange; or, + + b) Accompany it with a written offer, valid for at least three + years, to give any third party, for a charge no more than your + cost of physically performing source distribution, a complete + machine-readable copy of the corresponding source code, to be + distributed under the terms of Sections 1 and 2 above on a medium + customarily used for software interchange; or, + + c) Accompany it with the information you received as to the offer + to distribute corresponding source code. (This alternative is + allowed only for noncommercial distribution and only if you + received the program in object code or executable form with such + an offer, in accord with Subsection b above.) + +The source code for a work means the preferred form of the work for +making modifications to it. For an executable work, complete source +code means all the source code for all modules it contains, plus any +associated interface definition files, plus the scripts used to +control compilation and installation of the executable. However, as a +special exception, the source code distributed need not include +anything that is normally distributed (in either source or binary +form) with the major components (compiler, kernel, and so on) of the +operating system on which the executable runs, unless that component +itself accompanies the executable. + +If distribution of executable or object code is made by offering +access to copy from a designated place, then offering equivalent +access to copy the source code from the same place counts as +distribution of the source code, even though third parties are not +compelled to copy the source along with the object code. + + 4. You may not copy, modify, sublicense, or distribute the Program +except as expressly provided under this License. Any attempt +otherwise to copy, modify, sublicense or distribute the Program is +void, and will automatically terminate your rights under this License. +However, parties who have received copies, or rights, from you under +this License will not have their licenses terminated so long as such +parties remain in full compliance. + + 5. You are not required to accept this License, since you have not +signed it. However, nothing else grants you permission to modify or +distribute the Program or its derivative works. These actions are +prohibited by law if you do not accept this License. Therefore, by +modifying or distributing the Program (or any work based on the +Program), you indicate your acceptance of this License to do so, and +all its terms and conditions for copying, distributing or modifying +the Program or works based on it. + + 6. Each time you redistribute the Program (or any work based on the +Program), the recipient automatically receives a license from the +original licensor to copy, distribute or modify the Program subject to +these terms and conditions. You may not impose any further +restrictions on the recipients' exercise of the rights granted herein. +You are not responsible for enforcing compliance by third parties to +this License. + + 7. If, as a consequence of a court judgment or allegation of patent +infringement or for any other reason (not limited to patent issues), +conditions are imposed on you (whether by court order, agreement or +otherwise) that contradict the conditions of this License, they do not +excuse you from the conditions of this License. If you cannot +distribute so as to satisfy simultaneously your obligations under this +License and any other pertinent obligations, then as a consequence you +may not distribute the Program at all. For example, if a patent +license would not permit royalty-free redistribution of the Program by +all those who receive copies directly or indirectly through you, then +the only way you could satisfy both it and this License would be to +refrain entirely from distribution of the Program. + +If any portion of this section is held invalid or unenforceable under +any particular circumstance, the balance of the section is intended to +apply and the section as a whole is intended to apply in other +circumstances. + +It is not the purpose of this section to induce you to infringe any +patents or other property right claims or to contest validity of any +such claims; this section has the sole purpose of protecting the +integrity of the free software distribution system, which is +implemented by public license practices. Many people have made +generous contributions to the wide range of software distributed +through that system in reliance on consistent application of that +system; it is up to the author/donor to decide if he or she is willing +to distribute software through any other system and a licensee cannot +impose that choice. + +This section is intended to make thoroughly clear what is believed to +be a consequence of the rest of this License. + + 8. If the distribution and/or use of the Program is restricted in +certain countries either by patents or by copyrighted interfaces, the +original copyright holder who places the Program under this License +may add an explicit geographical distribution limitation excluding +those countries, so that distribution is permitted only in or among +countries not thus excluded. In such case, this License incorporates +the limitation as if written in the body of this License. + + 9. The Free Software Foundation may publish revised and/or new versions +of the General Public License from time to time. Such new versions will +be similar in spirit to the present version, but may differ in detail to +address new problems or concerns. + +Each version is given a distinguishing version number. If the Program +specifies a version number of this License which applies to it and "any +later version", you have the option of following the terms and conditions +either of that version or of any later version published by the Free +Software Foundation. If the Program does not specify a version number of +this License, you may choose any version ever published by the Free Software +Foundation. + + 10. If you wish to incorporate parts of the Program into other free +programs whose distribution conditions are different, write to the author +to ask for permission. For software which is copyrighted by the Free +Software Foundation, write to the Free Software Foundation; we sometimes +make exceptions for this. Our decision will be guided by the two goals +of preserving the free status of all derivatives of our free software and +of promoting the sharing and reuse of software generally. + + NO WARRANTY + + 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY +FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN +OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES +PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED +OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF +MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS +TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE +PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, +REPAIR OR CORRECTION. + + 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING +WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR +REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, +INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING +OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED +TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY +YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER +PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE +POSSIBILITY OF SUCH DAMAGES. + + END OF TERMS AND CONDITIONS + + How to Apply These Terms to Your New Programs + + If you develop a new program, and you want it to be of the greatest +possible use to the public, the best way to achieve this is to make it +free software which everyone can redistribute and change under these terms. + + To do so, attach the following notices to the program. It is safest +to attach them to the start of each source file to most effectively +convey the exclusion of warranty; and each file should have at least +the "copyright" line and a pointer to where the full notice is found. + + + Copyright (C) + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License along + with this program; if not, write to the Free Software Foundation, Inc., + 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. + +Also add information on how to contact you by electronic and paper mail. + +If the program is interactive, make it output a short notice like this +when it starts in an interactive mode: + + Gnomovision version 69, Copyright (C) year name of author + Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. + This is free software, and you are welcome to redistribute it + under certain conditions; type `show c' for details. + +The hypothetical commands `show w' and `show c' should show the appropriate +parts of the General Public License. Of course, the commands you use may +be called something other than `show w' and `show c'; they could even be +mouse-clicks or menu items--whatever suits your program. + +You should also get your employer (if you work as a programmer) or your +school, if any, to sign a "copyright disclaimer" for the program, if +necessary. Here is a sample; alter the names: + + Yoyodyne, Inc., hereby disclaims all copyright interest in the program + `Gnomovision' (which makes passes at compilers) written by James Hacker. + + , 1 April 1989 + Ty Coon, President of Vice + +This General Public License does not permit incorporating your program into +proprietary programs. If your program is a subroutine library, you may +consider it more useful to permit linking proprietary applications with the +library. If this is what you want to do, use the GNU Lesser General +Public License instead of this License. + diff --git a/installer/tap/ShellLink.dll b/installer/tap/ShellLink.dll new file mode 100644 index 0000000..f57ded3 Binary files /dev/null and b/installer/tap/ShellLink.dll differ diff --git a/installer/tap/build.bat b/installer/tap/build.bat new file mode 100644 index 0000000..45251d3 --- /dev/null +++ b/installer/tap/build.bat @@ -0,0 +1 @@ +"C:\Dev\NSIS\makensis.exe" tap-windows6.nsi \ No newline at end of file diff --git a/installer/tap/icon.ico b/installer/tap/icon.ico new file mode 100644 index 0000000..06b583b Binary files /dev/null and b/installer/tap/icon.ico differ diff --git a/installer/tap/install-whirl.bmp b/installer/tap/install-whirl.bmp new file mode 100644 index 0000000..e1186bd Binary files /dev/null and b/installer/tap/install-whirl.bmp differ diff --git a/installer/tap/prebuilt/x64/OemVista.inf b/installer/tap/prebuilt/x64/OemVista.inf new file mode 100644 index 0000000..d92e255 --- /dev/null +++ b/installer/tap/prebuilt/x64/OemVista.inf @@ -0,0 +1,191 @@ +; **************************************************************************** +; * Copyright (C) 2002-2014 OpenVPN Technologies, Inc. * +; * This program is free software; you can redistribute it and/or modify * +; * it under the terms of the GNU General Public License version 2 * +; * as published by the Free Software Foundation. * +; **************************************************************************** + +; SYNTAX CHECKER +; cd \WINDDK\3790\tools\chkinf +; chkinf c:\src\openvpn\tap-win32\i386\oemvista.inf +; OUTPUT -> file:///c:/WINDDK/3790/tools/chkinf/htm/c%23+src+openvpn+tap-win32+i386+__OemWin2k.htm + +; INSTALL/REMOVE DRIVER +; tapinstall install OemVista.inf tapoas +; tapinstall update OemVista.inf tapoas +; tapinstall remove tapoas + +;********************************************************* +; Note to Developers: +; +; If you are bundling the TAP-Windows driver with your app, +; you should try to rename it in such a way that it will +; not collide with other instances of TAP-Windows defined +; by other apps. Multiple versions of the TAP-Windows +; driver, each installed by different apps, can coexist +; on the same machine if you follow these guidelines. +; NOTE: these instructions assume you are editing the +; generated OemWin2k.inf file, not the source +; OemWin2k.inf.in file which is preprocessed by winconfig +; and uses macro definitions from settings.in. +; +; (1) Rename all tapXXXX instances in this file to +; something different (use at least 5 characters +; for this name!) +; (2) Change the "!define TAP" definition in openvpn.nsi +; to match what you changed tapXXXX to. +; (3) Change TARGETNAME in SOURCES to match what you +; changed tapXXXX to. +; (4) Change TAP_COMPONENT_ID in common.h to match what +; you changed tapXXXX to. +; (5) Change SZDEPENDENCIES in service.h to match what +; you changed tapXXXX to. +; (6) Change DeviceDescription and Provider strings. +; (7) Change PRODUCT_TAP_WIN_DEVICE_DESCRIPTION in constants.h to what you +; set DeviceDescription to. +; +;********************************************************* + +[Version] + Signature = "$Windows NT$" + CatalogFile = tap0901.cat + ClassGUID = {4d36e972-e325-11ce-bfc1-08002be10318} + Provider = %Provider% + Class = Net + +; This version number should match the version +; number given in SOURCES. + DriverVer=04/21/2016,9.00.00.21 + +[Strings] + DeviceDescription = "TAP-Windows Adapter V9" + Provider = "TAP-Windows Provider V9" + +;---------------------------------------------------------------- +; Manufacturer + Product Section (Done) +;---------------------------------------------------------------- +[Manufacturer] + %Provider% = tap0901, NTamd64 + +[tap0901.NTamd64] + %DeviceDescription% = tap0901.ndi, root\tap0901 ; Root enumerated + %DeviceDescription% = tap0901.ndi, tap0901 ; Legacy + +;--------------------------------------------------------------- +; Driver Section (Done) +;--------------------------------------------------------------- + +;----------------- Characteristics ------------ +; NCF_PHYSICAL = 0x04 +; NCF_VIRTUAL = 0x01 +; NCF_SOFTWARE_ENUMERATED = 0x02 +; NCF_HIDDEN = 0x08 +; NCF_NO_SERVICE = 0x10 +; NCF_HAS_UI = 0x80 +;----------------- Characteristics ------------ + +[tap0901.ndi] + CopyFiles = tap0901.driver, tap0901.files + AddReg = tap0901.reg + AddReg = tap0901.params.reg + Characteristics = + *IfType = 0x6 ; IF_TYPE_ETHERNET_CSMACD + *MediaType = 0x0 ; NdisMedium802_3 + *PhysicalMediaType = 14 ; NdisPhysicalMedium802_3 + +[tap0901.ndi.Services] + AddService = tap0901, 2, tap0901.service + +[tap0901.reg] + HKR, Ndi, Service, 0, "tap0901" + HKR, Ndi\Interfaces, UpperRange, 0, "ndis5" + HKR, Ndi\Interfaces, LowerRange, 0, "ethernet" + HKR, , Manufacturer, 0, "%Provider%" + HKR, , ProductName, 0, "%DeviceDescription%" + +[tap0901.params.reg] + HKR, Ndi\params\MTU, ParamDesc, 0, "MTU" + HKR, Ndi\params\MTU, Type, 0, "int" + HKR, Ndi\params\MTU, Default, 0, "1500" + HKR, Ndi\params\MTU, Optional, 0, "0" + HKR, Ndi\params\MTU, Min, 0, "100" + HKR, Ndi\params\MTU, Max, 0, "1500" + HKR, Ndi\params\MTU, Step, 0, "1" + HKR, Ndi\params\MediaStatus, ParamDesc, 0, "Media Status" + HKR, Ndi\params\MediaStatus, Type, 0, "enum" + HKR, Ndi\params\MediaStatus, Default, 0, "0" + HKR, Ndi\params\MediaStatus, Optional, 0, "0" + HKR, Ndi\params\MediaStatus\enum, "0", 0, "Application Controlled" + HKR, Ndi\params\MediaStatus\enum, "1", 0, "Always Connected" + HKR, Ndi\params\MAC, ParamDesc, 0, "MAC Address" + HKR, Ndi\params\MAC, Type, 0, "edit" + HKR, Ndi\params\MAC, Optional, 0, "1" + HKR, Ndi\params\AllowNonAdmin, ParamDesc, 0, "Non-Admin Access" + HKR, Ndi\params\AllowNonAdmin, Type, 0, "enum" + HKR, Ndi\params\AllowNonAdmin, Default, 0, "1" + HKR, Ndi\params\AllowNonAdmin, Optional, 0, "0" + HKR, Ndi\params\AllowNonAdmin\enum, "0", 0, "Not Allowed" + HKR, Ndi\params\AllowNonAdmin\enum, "1", 0, "Allowed" + +;---------------------------------------------------------------- +; Service Section +;---------------------------------------------------------------- + +;---------- Service Type ------------- +; SERVICE_KERNEL_DRIVER = 0x01 +; SERVICE_WIN32_OWN_PROCESS = 0x10 +;---------- Service Type ------------- + +;---------- Start Mode --------------- +; SERVICE_BOOT_START = 0x0 +; SERVICE_SYSTEM_START = 0x1 +; SERVICE_AUTO_START = 0x2 +; SERVICE_DEMAND_START = 0x3 +; SERVICE_DISABLED = 0x4 +;---------- Start Mode --------------- + +[tap0901.service] + DisplayName = %DeviceDescription% + ServiceType = 1 + StartType = 3 + ErrorControl = 1 + LoadOrderGroup = NDIS + ServiceBinary = %12%\tap0901.sys + +;----------------------------------------------------------------- +; File Installation +;----------------------------------------------------------------- + +;----------------- Copy Flags ------------ +; COPYFLG_NOSKIP = 0x02 +; COPYFLG_NOVERSIONCHECK = 0x04 +;----------------- Copy Flags ------------ + +; SourceDisksNames +; diskid = description[, [tagfile] [, , subdir]] +; 1 = "Intel Driver Disk 1",e100bex.sys,, + +[SourceDisksNames] + 1 = %DeviceDescription%, tap0901.sys + +; SourceDisksFiles +; filename_on_source = diskID[, [subdir][, size]] +; e100bex.sys = 1,, ; on distribution disk 1 + +[SourceDisksFiles] +tap0901.sys = 1 + +[DestinationDirs] + tap0901.files = 11 + tap0901.driver = 12 + +[tap0901.files] +; TapPanel.cpl,,,6 ; COPYFLG_NOSKIP | COPYFLG_NOVERSIONCHECK +; cipsrvr.exe,,,6 ; COPYFLG_NOSKIP | COPYFLG_NOVERSIONCHECK + +[tap0901.driver] + tap0901.sys,,,6 ; COPYFLG_NOSKIP | COPYFLG_NOVERSIONCHECK + +;--------------------------------------------------------------- +; End +;--------------------------------------------------------------- diff --git a/installer/tap/prebuilt/x64/tap0901.cat b/installer/tap/prebuilt/x64/tap0901.cat new file mode 100644 index 0000000..70ddd2c Binary files /dev/null and b/installer/tap/prebuilt/x64/tap0901.cat differ diff --git a/installer/tap/prebuilt/x64/tap0901.sys b/installer/tap/prebuilt/x64/tap0901.sys new file mode 100644 index 0000000..c662820 Binary files /dev/null and b/installer/tap/prebuilt/x64/tap0901.sys differ diff --git a/installer/tap/prebuilt/x64/tapinstall.exe b/installer/tap/prebuilt/x64/tapinstall.exe new file mode 100644 index 0000000..a1ebb9f Binary files /dev/null and b/installer/tap/prebuilt/x64/tapinstall.exe differ diff --git a/installer/tap/prebuilt/x86/OemVista.inf b/installer/tap/prebuilt/x86/OemVista.inf new file mode 100644 index 0000000..6cd6791 --- /dev/null +++ b/installer/tap/prebuilt/x86/OemVista.inf @@ -0,0 +1,191 @@ +; **************************************************************************** +; * Copyright (C) 2002-2014 OpenVPN Technologies, Inc. * +; * This program is free software; you can redistribute it and/or modify * +; * it under the terms of the GNU General Public License version 2 * +; * as published by the Free Software Foundation. * +; **************************************************************************** + +; SYNTAX CHECKER +; cd \WINDDK\3790\tools\chkinf +; chkinf c:\src\openvpn\tap-win32\i386\oemvista.inf +; OUTPUT -> file:///c:/WINDDK/3790/tools/chkinf/htm/c%23+src+openvpn+tap-win32+i386+__OemWin2k.htm + +; INSTALL/REMOVE DRIVER +; tapinstall install OemVista.inf tapoas +; tapinstall update OemVista.inf tapoas +; tapinstall remove tapoas + +;********************************************************* +; Note to Developers: +; +; If you are bundling the TAP-Windows driver with your app, +; you should try to rename it in such a way that it will +; not collide with other instances of TAP-Windows defined +; by other apps. Multiple versions of the TAP-Windows +; driver, each installed by different apps, can coexist +; on the same machine if you follow these guidelines. +; NOTE: these instructions assume you are editing the +; generated OemWin2k.inf file, not the source +; OemWin2k.inf.in file which is preprocessed by winconfig +; and uses macro definitions from settings.in. +; +; (1) Rename all tapXXXX instances in this file to +; something different (use at least 5 characters +; for this name!) +; (2) Change the "!define TAP" definition in openvpn.nsi +; to match what you changed tapXXXX to. +; (3) Change TARGETNAME in SOURCES to match what you +; changed tapXXXX to. +; (4) Change TAP_COMPONENT_ID in common.h to match what +; you changed tapXXXX to. +; (5) Change SZDEPENDENCIES in service.h to match what +; you changed tapXXXX to. +; (6) Change DeviceDescription and Provider strings. +; (7) Change PRODUCT_TAP_WIN_DEVICE_DESCRIPTION in constants.h to what you +; set DeviceDescription to. +; +;********************************************************* + +[Version] + Signature = "$Windows NT$" + CatalogFile = tap0901.cat + ClassGUID = {4d36e972-e325-11ce-bfc1-08002be10318} + Provider = %Provider% + Class = Net + +; This version number should match the version +; number given in SOURCES. + DriverVer=04/21/2016,9.00.00.21 + +[Strings] + DeviceDescription = "TAP-Windows Adapter V9" + Provider = "TAP-Windows Provider V9" + +;---------------------------------------------------------------- +; Manufacturer + Product Section (Done) +;---------------------------------------------------------------- +[Manufacturer] + %Provider% = tap0901 + +[tap0901] + %DeviceDescription% = tap0901.ndi, root\tap0901 ; Root enumerated + %DeviceDescription% = tap0901.ndi, tap0901 ; Legacy + +;--------------------------------------------------------------- +; Driver Section (Done) +;--------------------------------------------------------------- + +;----------------- Characteristics ------------ +; NCF_PHYSICAL = 0x04 +; NCF_VIRTUAL = 0x01 +; NCF_SOFTWARE_ENUMERATED = 0x02 +; NCF_HIDDEN = 0x08 +; NCF_NO_SERVICE = 0x10 +; NCF_HAS_UI = 0x80 +;----------------- Characteristics ------------ + +[tap0901.ndi] + CopyFiles = tap0901.driver, tap0901.files + AddReg = tap0901.reg + AddReg = tap0901.params.reg + Characteristics = + *IfType = 0x6 ; IF_TYPE_ETHERNET_CSMACD + *MediaType = 0x0 ; NdisMedium802_3 + *PhysicalMediaType = 14 ; NdisPhysicalMedium802_3 + +[tap0901.ndi.Services] + AddService = tap0901, 2, tap0901.service + +[tap0901.reg] + HKR, Ndi, Service, 0, "tap0901" + HKR, Ndi\Interfaces, UpperRange, 0, "ndis5" + HKR, Ndi\Interfaces, LowerRange, 0, "ethernet" + HKR, , Manufacturer, 0, "%Provider%" + HKR, , ProductName, 0, "%DeviceDescription%" + +[tap0901.params.reg] + HKR, Ndi\params\MTU, ParamDesc, 0, "MTU" + HKR, Ndi\params\MTU, Type, 0, "int" + HKR, Ndi\params\MTU, Default, 0, "1500" + HKR, Ndi\params\MTU, Optional, 0, "0" + HKR, Ndi\params\MTU, Min, 0, "100" + HKR, Ndi\params\MTU, Max, 0, "1500" + HKR, Ndi\params\MTU, Step, 0, "1" + HKR, Ndi\params\MediaStatus, ParamDesc, 0, "Media Status" + HKR, Ndi\params\MediaStatus, Type, 0, "enum" + HKR, Ndi\params\MediaStatus, Default, 0, "0" + HKR, Ndi\params\MediaStatus, Optional, 0, "0" + HKR, Ndi\params\MediaStatus\enum, "0", 0, "Application Controlled" + HKR, Ndi\params\MediaStatus\enum, "1", 0, "Always Connected" + HKR, Ndi\params\MAC, ParamDesc, 0, "MAC Address" + HKR, Ndi\params\MAC, Type, 0, "edit" + HKR, Ndi\params\MAC, Optional, 0, "1" + HKR, Ndi\params\AllowNonAdmin, ParamDesc, 0, "Non-Admin Access" + HKR, Ndi\params\AllowNonAdmin, Type, 0, "enum" + HKR, Ndi\params\AllowNonAdmin, Default, 0, "1" + HKR, Ndi\params\AllowNonAdmin, Optional, 0, "0" + HKR, Ndi\params\AllowNonAdmin\enum, "0", 0, "Not Allowed" + HKR, Ndi\params\AllowNonAdmin\enum, "1", 0, "Allowed" + +;---------------------------------------------------------------- +; Service Section +;---------------------------------------------------------------- + +;---------- Service Type ------------- +; SERVICE_KERNEL_DRIVER = 0x01 +; SERVICE_WIN32_OWN_PROCESS = 0x10 +;---------- Service Type ------------- + +;---------- Start Mode --------------- +; SERVICE_BOOT_START = 0x0 +; SERVICE_SYSTEM_START = 0x1 +; SERVICE_AUTO_START = 0x2 +; SERVICE_DEMAND_START = 0x3 +; SERVICE_DISABLED = 0x4 +;---------- Start Mode --------------- + +[tap0901.service] + DisplayName = %DeviceDescription% + ServiceType = 1 + StartType = 3 + ErrorControl = 1 + LoadOrderGroup = NDIS + ServiceBinary = %12%\tap0901.sys + +;----------------------------------------------------------------- +; File Installation +;----------------------------------------------------------------- + +;----------------- Copy Flags ------------ +; COPYFLG_NOSKIP = 0x02 +; COPYFLG_NOVERSIONCHECK = 0x04 +;----------------- Copy Flags ------------ + +; SourceDisksNames +; diskid = description[, [tagfile] [, , subdir]] +; 1 = "Intel Driver Disk 1",e100bex.sys,, + +[SourceDisksNames] + 1 = %DeviceDescription%, tap0901.sys + +; SourceDisksFiles +; filename_on_source = diskID[, [subdir][, size]] +; e100bex.sys = 1,, ; on distribution disk 1 + +[SourceDisksFiles] +tap0901.sys = 1 + +[DestinationDirs] + tap0901.files = 11 + tap0901.driver = 12 + +[tap0901.files] +; TapPanel.cpl,,,6 ; COPYFLG_NOSKIP | COPYFLG_NOVERSIONCHECK +; cipsrvr.exe,,,6 ; COPYFLG_NOSKIP | COPYFLG_NOVERSIONCHECK + +[tap0901.driver] + tap0901.sys,,,6 ; COPYFLG_NOSKIP | COPYFLG_NOVERSIONCHECK + +;--------------------------------------------------------------- +; End +;--------------------------------------------------------------- diff --git a/installer/tap/prebuilt/x86/tap0901.cat b/installer/tap/prebuilt/x86/tap0901.cat new file mode 100644 index 0000000..d845310 Binary files /dev/null and b/installer/tap/prebuilt/x86/tap0901.cat differ diff --git a/installer/tap/prebuilt/x86/tap0901.sys b/installer/tap/prebuilt/x86/tap0901.sys new file mode 100644 index 0000000..fcba857 Binary files /dev/null and b/installer/tap/prebuilt/x86/tap0901.sys differ diff --git a/installer/tap/prebuilt/x86/tapinstall.exe b/installer/tap/prebuilt/x86/tapinstall.exe new file mode 100644 index 0000000..bc351c3 Binary files /dev/null and b/installer/tap/prebuilt/x86/tapinstall.exe differ diff --git a/installer/tap/src/.appveyor.yml b/installer/tap/src/.appveyor.yml new file mode 100644 index 0000000..09f2094 --- /dev/null +++ b/installer/tap/src/.appveyor.yml @@ -0,0 +1,5 @@ +version: 1.0.{build} +build_script: +- cmd: python buildtap.py -b +artifacts: +- path: '*' diff --git a/installer/tap/src/.gitattributes b/installer/tap/src/.gitattributes new file mode 100644 index 0000000..14d9eb5 --- /dev/null +++ b/installer/tap/src/.gitattributes @@ -0,0 +1 @@ +*.yml text=auto diff --git a/installer/tap/src/.gitignore b/installer/tap/src/.gitignore new file mode 100644 index 0000000..c84d745 --- /dev/null +++ b/installer/tap/src/.gitignore @@ -0,0 +1,10 @@ +dist/** +*.pyc +*.tar.gz +src/config.h +src/SOURCES +src/build*.log +src/obj* +src/i386 +src/amd64 +tap-windows-*.exe diff --git a/installer/tap/src/CONTRIBUTING.rst b/installer/tap/src/CONTRIBUTING.rst new file mode 100644 index 0000000..6ee5908 --- /dev/null +++ b/installer/tap/src/CONTRIBUTING.rst @@ -0,0 +1,26 @@ +Contributing to tap-windows6 +============================ + +To contribute to tap-windows6 please send your patches to openvpn-devel mailing +list: + +- https://lists.sourceforge.net/lists/listinfo/openvpn-devel + +The subject line should look like this: + + [PATCH: tap-windows6] summary of the patch + +To avoid merging issues patches should be created with git-format-patch or sent +using git-send-email. The easiest way to add the subject line prefix is to use +this option: + + --subject-prefix='PATCH: tap-windows6' + +Patches that do not modify the actual driver code can be sent as GitHub pull +requests. Try to split large patches into small, atomic pieces to make reviews +and merging easier. + +If you want quick feedback on a patch, you can visit the #openvpn-devel channel +on Freenode. Note that you need to be logged in to join the channel: + +- http://freenode.net/faq.shtml#nicksetup diff --git a/installer/tap/src/COPYING b/installer/tap/src/COPYING new file mode 100644 index 0000000..a2dbdb8 --- /dev/null +++ b/installer/tap/src/COPYING @@ -0,0 +1,24 @@ +tap-windows6 license +-------------------- + +The source and object code of the tap-windows6 project +is Copyright (C) 2002-2014 OpenVPN Technologies, Inc. The +NSIS installer is Copyright (C) 2014 OpenVPN Technologies, +Inc. and (C) 2012 Alon Bar-Lev. Both are released under the +GPL version 2. See COPYRIGHT.GPL for the full GPL license. +The licensors also make the following statement borrowed +from the SPICE project: + +With respect to binaries built using the Microsoft(R) +Windows Driver Kit (WDK), GPLv2 does not extend to any code +contained in or derived from the WDK ("WDK Code"). As to +WDK Code, by using or distributing such binaries you agree +to be bound by the Microsoft Software License Terms for the +WDK. All WDK Code is considered by the GPLv2 licensors to +qualify for the special exception stated in section 3 of +GPLv2 (commonly known as the system library exception). + +The tap-windows.h file has been released under the MIT +license (see COPYRIGHT.MIT) as well as under GPLv2 (see +COPYRIGHT.GPL). This has been done to allow the use of the +header file in non-GPLv2 compatible projects. diff --git a/installer/tap/src/COPYRIGHT.GPL b/installer/tap/src/COPYRIGHT.GPL new file mode 100644 index 0000000..d159169 --- /dev/null +++ b/installer/tap/src/COPYRIGHT.GPL @@ -0,0 +1,339 @@ + GNU GENERAL PUBLIC LICENSE + Version 2, June 1991 + + Copyright (C) 1989, 1991 Free Software Foundation, Inc., + 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + Everyone is permitted to copy and distribute verbatim copies + of this license document, but changing it is not allowed. + + Preamble + + The licenses for most software are designed to take away your +freedom to share and change it. By contrast, the GNU General Public +License is intended to guarantee your freedom to share and change free +software--to make sure the software is free for all its users. This +General Public License applies to most of the Free Software +Foundation's software and to any other program whose authors commit to +using it. (Some other Free Software Foundation software is covered by +the GNU Lesser General Public License instead.) You can apply it to +your programs, too. + + When we speak of free software, we are referring to freedom, not +price. Our General Public Licenses are designed to make sure that you +have the freedom to distribute copies of free software (and charge for +this service if you wish), that you receive source code or can get it +if you want it, that you can change the software or use pieces of it +in new free programs; and that you know you can do these things. + + To protect your rights, we need to make restrictions that forbid +anyone to deny you these rights or to ask you to surrender the rights. +These restrictions translate to certain responsibilities for you if you +distribute copies of the software, or if you modify it. + + For example, if you distribute copies of such a program, whether +gratis or for a fee, you must give the recipients all the rights that +you have. You must make sure that they, too, receive or can get the +source code. And you must show them these terms so they know their +rights. + + We protect your rights with two steps: (1) copyright the software, and +(2) offer you this license which gives you legal permission to copy, +distribute and/or modify the software. + + Also, for each author's protection and ours, we want to make certain +that everyone understands that there is no warranty for this free +software. If the software is modified by someone else and passed on, we +want its recipients to know that what they have is not the original, so +that any problems introduced by others will not reflect on the original +authors' reputations. + + Finally, any free program is threatened constantly by software +patents. We wish to avoid the danger that redistributors of a free +program will individually obtain patent licenses, in effect making the +program proprietary. To prevent this, we have made it clear that any +patent must be licensed for everyone's free use or not licensed at all. + + The precise terms and conditions for copying, distribution and +modification follow. + + GNU GENERAL PUBLIC LICENSE + TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION + + 0. This License applies to any program or other work which contains +a notice placed by the copyright holder saying it may be distributed +under the terms of this General Public License. The "Program", below, +refers to any such program or work, and a "work based on the Program" +means either the Program or any derivative work under copyright law: +that is to say, a work containing the Program or a portion of it, +either verbatim or with modifications and/or translated into another +language. (Hereinafter, translation is included without limitation in +the term "modification".) Each licensee is addressed as "you". + +Activities other than copying, distribution and modification are not +covered by this License; they are outside its scope. The act of +running the Program is not restricted, and the output from the Program +is covered only if its contents constitute a work based on the +Program (independent of having been made by running the Program). +Whether that is true depends on what the Program does. + + 1. You may copy and distribute verbatim copies of the Program's +source code as you receive it, in any medium, provided that you +conspicuously and appropriately publish on each copy an appropriate +copyright notice and disclaimer of warranty; keep intact all the +notices that refer to this License and to the absence of any warranty; +and give any other recipients of the Program a copy of this License +along with the Program. + +You may charge a fee for the physical act of transferring a copy, and +you may at your option offer warranty protection in exchange for a fee. + + 2. You may modify your copy or copies of the Program or any portion +of it, thus forming a work based on the Program, and copy and +distribute such modifications or work under the terms of Section 1 +above, provided that you also meet all of these conditions: + + a) You must cause the modified files to carry prominent notices + stating that you changed the files and the date of any change. + + b) You must cause any work that you distribute or publish, that in + whole or in part contains or is derived from the Program or any + part thereof, to be licensed as a whole at no charge to all third + parties under the terms of this License. + + c) If the modified program normally reads commands interactively + when run, you must cause it, when started running for such + interactive use in the most ordinary way, to print or display an + announcement including an appropriate copyright notice and a + notice that there is no warranty (or else, saying that you provide + a warranty) and that users may redistribute the program under + these conditions, and telling the user how to view a copy of this + License. (Exception: if the Program itself is interactive but + does not normally print such an announcement, your work based on + the Program is not required to print an announcement.) + +These requirements apply to the modified work as a whole. If +identifiable sections of that work are not derived from the Program, +and can be reasonably considered independent and separate works in +themselves, then this License, and its terms, do not apply to those +sections when you distribute them as separate works. But when you +distribute the same sections as part of a whole which is a work based +on the Program, the distribution of the whole must be on the terms of +this License, whose permissions for other licensees extend to the +entire whole, and thus to each and every part regardless of who wrote it. + +Thus, it is not the intent of this section to claim rights or contest +your rights to work written entirely by you; rather, the intent is to +exercise the right to control the distribution of derivative or +collective works based on the Program. + +In addition, mere aggregation of another work not based on the Program +with the Program (or with a work based on the Program) on a volume of +a storage or distribution medium does not bring the other work under +the scope of this License. + + 3. You may copy and distribute the Program (or a work based on it, +under Section 2) in object code or executable form under the terms of +Sections 1 and 2 above provided that you also do one of the following: + + a) Accompany it with the complete corresponding machine-readable + source code, which must be distributed under the terms of Sections + 1 and 2 above on a medium customarily used for software interchange; or, + + b) Accompany it with a written offer, valid for at least three + years, to give any third party, for a charge no more than your + cost of physically performing source distribution, a complete + machine-readable copy of the corresponding source code, to be + distributed under the terms of Sections 1 and 2 above on a medium + customarily used for software interchange; or, + + c) Accompany it with the information you received as to the offer + to distribute corresponding source code. (This alternative is + allowed only for noncommercial distribution and only if you + received the program in object code or executable form with such + an offer, in accord with Subsection b above.) + +The source code for a work means the preferred form of the work for +making modifications to it. For an executable work, complete source +code means all the source code for all modules it contains, plus any +associated interface definition files, plus the scripts used to +control compilation and installation of the executable. However, as a +special exception, the source code distributed need not include +anything that is normally distributed (in either source or binary +form) with the major components (compiler, kernel, and so on) of the +operating system on which the executable runs, unless that component +itself accompanies the executable. + +If distribution of executable or object code is made by offering +access to copy from a designated place, then offering equivalent +access to copy the source code from the same place counts as +distribution of the source code, even though third parties are not +compelled to copy the source along with the object code. + + 4. You may not copy, modify, sublicense, or distribute the Program +except as expressly provided under this License. Any attempt +otherwise to copy, modify, sublicense or distribute the Program is +void, and will automatically terminate your rights under this License. +However, parties who have received copies, or rights, from you under +this License will not have their licenses terminated so long as such +parties remain in full compliance. + + 5. You are not required to accept this License, since you have not +signed it. However, nothing else grants you permission to modify or +distribute the Program or its derivative works. These actions are +prohibited by law if you do not accept this License. Therefore, by +modifying or distributing the Program (or any work based on the +Program), you indicate your acceptance of this License to do so, and +all its terms and conditions for copying, distributing or modifying +the Program or works based on it. + + 6. Each time you redistribute the Program (or any work based on the +Program), the recipient automatically receives a license from the +original licensor to copy, distribute or modify the Program subject to +these terms and conditions. You may not impose any further +restrictions on the recipients' exercise of the rights granted herein. +You are not responsible for enforcing compliance by third parties to +this License. + + 7. If, as a consequence of a court judgment or allegation of patent +infringement or for any other reason (not limited to patent issues), +conditions are imposed on you (whether by court order, agreement or +otherwise) that contradict the conditions of this License, they do not +excuse you from the conditions of this License. If you cannot +distribute so as to satisfy simultaneously your obligations under this +License and any other pertinent obligations, then as a consequence you +may not distribute the Program at all. For example, if a patent +license would not permit royalty-free redistribution of the Program by +all those who receive copies directly or indirectly through you, then +the only way you could satisfy both it and this License would be to +refrain entirely from distribution of the Program. + +If any portion of this section is held invalid or unenforceable under +any particular circumstance, the balance of the section is intended to +apply and the section as a whole is intended to apply in other +circumstances. + +It is not the purpose of this section to induce you to infringe any +patents or other property right claims or to contest validity of any +such claims; this section has the sole purpose of protecting the +integrity of the free software distribution system, which is +implemented by public license practices. Many people have made +generous contributions to the wide range of software distributed +through that system in reliance on consistent application of that +system; it is up to the author/donor to decide if he or she is willing +to distribute software through any other system and a licensee cannot +impose that choice. + +This section is intended to make thoroughly clear what is believed to +be a consequence of the rest of this License. + + 8. If the distribution and/or use of the Program is restricted in +certain countries either by patents or by copyrighted interfaces, the +original copyright holder who places the Program under this License +may add an explicit geographical distribution limitation excluding +those countries, so that distribution is permitted only in or among +countries not thus excluded. In such case, this License incorporates +the limitation as if written in the body of this License. + + 9. The Free Software Foundation may publish revised and/or new versions +of the General Public License from time to time. Such new versions will +be similar in spirit to the present version, but may differ in detail to +address new problems or concerns. + +Each version is given a distinguishing version number. If the Program +specifies a version number of this License which applies to it and "any +later version", you have the option of following the terms and conditions +either of that version or of any later version published by the Free +Software Foundation. If the Program does not specify a version number of +this License, you may choose any version ever published by the Free Software +Foundation. + + 10. If you wish to incorporate parts of the Program into other free +programs whose distribution conditions are different, write to the author +to ask for permission. For software which is copyrighted by the Free +Software Foundation, write to the Free Software Foundation; we sometimes +make exceptions for this. Our decision will be guided by the two goals +of preserving the free status of all derivatives of our free software and +of promoting the sharing and reuse of software generally. + + NO WARRANTY + + 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY +FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN +OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES +PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED +OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF +MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS +TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE +PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, +REPAIR OR CORRECTION. + + 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING +WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR +REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, +INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING +OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED +TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY +YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER +PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE +POSSIBILITY OF SUCH DAMAGES. + + END OF TERMS AND CONDITIONS + + How to Apply These Terms to Your New Programs + + If you develop a new program, and you want it to be of the greatest +possible use to the public, the best way to achieve this is to make it +free software which everyone can redistribute and change under these terms. + + To do so, attach the following notices to the program. It is safest +to attach them to the start of each source file to most effectively +convey the exclusion of warranty; and each file should have at least +the "copyright" line and a pointer to where the full notice is found. + + + Copyright (C) + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License along + with this program; if not, write to the Free Software Foundation, Inc., + 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. + +Also add information on how to contact you by electronic and paper mail. + +If the program is interactive, make it output a short notice like this +when it starts in an interactive mode: + + Gnomovision version 69, Copyright (C) year name of author + Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. + This is free software, and you are welcome to redistribute it + under certain conditions; type `show c' for details. + +The hypothetical commands `show w' and `show c' should show the appropriate +parts of the General Public License. Of course, the commands you use may +be called something other than `show w' and `show c'; they could even be +mouse-clicks or menu items--whatever suits your program. + +You should also get your employer (if you work as a programmer) or your +school, if any, to sign a "copyright disclaimer" for the program, if +necessary. Here is a sample; alter the names: + + Yoyodyne, Inc., hereby disclaims all copyright interest in the program + `Gnomovision' (which makes passes at compilers) written by James Hacker. + + , 1 April 1989 + Ty Coon, President of Vice + +This General Public License does not permit incorporating your program into +proprietary programs. If your program is a subroutine library, you may +consider it more useful to permit linking proprietary applications with the +library. If this is what you want to do, use the GNU Lesser General +Public License instead of this License. diff --git a/installer/tap/src/COPYRIGHT.MIT b/installer/tap/src/COPYRIGHT.MIT new file mode 100644 index 0000000..bfbb900 --- /dev/null +++ b/installer/tap/src/COPYRIGHT.MIT @@ -0,0 +1,20 @@ +The MIT License (MIT) +Copyright © 2014 OpenVPN Technologies, Inc. + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the “Softwareâ€), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in +all copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED “AS ISâ€, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +THE SOFTWARE. diff --git a/installer/tap/src/MSCV-VSClass3.cer b/installer/tap/src/MSCV-VSClass3.cer new file mode 100644 index 0000000..831757d Binary files /dev/null and b/installer/tap/src/MSCV-VSClass3.cer differ diff --git a/installer/tap/src/README.rst b/installer/tap/src/README.rst new file mode 100644 index 0000000..c7039a9 --- /dev/null +++ b/installer/tap/src/README.rst @@ -0,0 +1,142 @@ +TAP-Windows driver (NDIS 6) +=========================== + +This is an NDIS 6 implementation of the TAP-Windows driver, used by OpenVPN and +other apps. NDIS 6 drivers can run on Windows Vista or higher. + +Build +----- + +To build, the following prerequisites are required: + +- Python 2.7 +- Microsoft Windows 7 WDK (Windows Driver Kit) +- Windows code signing certificate +- Git (not strictly required, but useful for running commands using bundled bash shell) +- MakeNSIS (optional) +- Patched source code directory of **devcon** sample from WDK (optional) +- Prebuilt tapinstall.exe binaries (optional) + +Make sure you add Python's install directory (usually c:\\python27) to the PATH +environment variable. + +These instructions have been tested on Windows 7 using Git Bash, as well as on +Windows 2012 Server using Git Bash and Windows Powershell. + +View build script options:: + + $ python buildtap.py + Usage: buildtap.py [options] + + Options: + -h, --help show this help message and exit + -s SRC, --src=SRC TAP-Windows top-level directory, default= + --ti=TAPINSTALL tapinstall (i.e. devcon) directory (optional) + -d, --debug enable debug build + -c, --clean do an nmake clean before build + -b, --build build TAP-Windows and possibly tapinstall (add -c to + clean before build) + --sign sign the driver files (disabled by default) + -p, --package generate an NSIS installer from the compiled files + --cert=CERT Common name of code signing certificate, default=openvpn + --crosscert=CERT The cross-certificate file to use, default=MSCV- + VSClass3.cer + --timestamp=URL Timestamp URL to use, default=http://timestamp.verisign.c + om/scripts/timstamp.dll + -a, --oas Build for OpenVPN Access Server clients + +Edit **version.m4** and **paths.py** as necessary then build:: + + $ python buildtap.py -b + +On successful completion, all build products will be placed in the "dist" +directory as well as tap6.tar.gz. The NSIS installer package will be placed to +the build root directory. + +Note that due to the strict driver signing requirements in Windows 10 you need +an EV certificate to sign the driver files. These EV certificates may be +stored inside a hardware device, which makes fully automated signing process +difficult, dangerous or impossible. Eventually the signing process will become +even more involved, with drivers having to be submitted to the Windows +Hardware Developer Center Dashboard portal. Therefore, by default, this +buildsystem no longer signs any files. You can revert to the old behavior +by using the --sign parameter. + +Building tapinstall (optional) +------------------------------ + +The build system supports building tapinstall.exe (a.k.a. devcon.exe). However +the devcon source code in WinDDK does not build without modifications which +cannot be made public due to licensing restrictions. For these reasons the +default behavior is to reuse pre-built executables. To make sure the buildsystem +finds the executables create the following directory structure under +tap-windows6 directory: +:: + tapinstall + └── 7600 + ├── objfre_wlh_amd64 + │   └── amd64 + │   └── tapinstall.exe + └── objfre_wlh_x86 + └── i386 + └── tapinstall.exe + +This structure is equal to what building tapinstall would create. Replace 7600 +with the major number of your WinDDK version. Finally call buildtap.py with +"--ti=tapinstall". + +Please note that the NSIS packaging (-p) step will fail if you don't have +tapinstall.exe available. Also don't use the "-c" flag or the above directories +will get wiped before MakeNSIS is able to find them. + +Install/Update/Remove +--------------------- + +The driver can be installed using a command-line tool, tapinstall.exe, which is +bundled with OpenVPN and tap-windows installers. Note that in some versions of +OpenVPN tapinstall.exe is called devcon.exe. To install, update or remove the +tap-windows NDIS 6 driver follow these steps: + +- place tapinstall.exe/devcon.exe to your PATH +- open an Administrator shell +- cd to **dist** +- cd to **amd64** or **i386** depending on your system's processor architecture. + +Install:: + + $ tapinstall install OemVista.inf TAP0901 + +Update:: + + $ tapinstall update OemVista.inf TAP0901 + +Remove:: + + $ tapinstall remove TAP0901 + +Notes on proxies +---------------- + +It is possible to build tap-windows6 without connectivity to the Internet but +any attempt to timestamp the driver will fail. For this reason configure your +outbound proxy server before starting the build. Note that the command prompt +also needs to be restarted to make use of new proxy settings. + +Notes on Authenticode signatures +-------------------------------- + +Recent Windows versions such as Windows 10 are fairly picky about the +Authenticode signatures of kernel-mode drivers. In addition making older Windows +versions such as Vista play along with signatures that Windows 10 accepts can be +rather challenging. A good starting point on this topic is the +`building tap-windows6 `_ +page on the OpenVPN community wiki. As that page points out, having two +completely separate Authenticode signatures may be the only reasonable option. +Fortunately there is a tool, `Sign-Tap6 `_, +which can be used to append secondary signatures to the tap-windows6 driver or +to handle the entire signing process if necessary. + +License +------- + +See the file `COPYING `_. diff --git a/installer/tap/src/buildtap.py b/installer/tap/src/buildtap.py new file mode 100644 index 0000000..6f506cd --- /dev/null +++ b/installer/tap/src/buildtap.py @@ -0,0 +1,513 @@ +# build TAP-Windows NDIS 6.0 driver + +import sys, os, re, shutil, tarfile + +import paths + +class BuildTAPWindows(object): + # regex for doing search replace on @MACRO@ style macros + macro_amper = re.compile(r"@(\w+)@") + + def __init__(self, opt): + self.opt = opt # command line options + if not opt.src: + raise ValueError("source directory undefined") + self.top = os.path.realpath(opt.src) # top-level dir + self.src = os.path.join(self.top, 'src') # src/openvpn dir + if opt.tapinstall: + self.top_tapinstall = os.path.realpath(opt.tapinstall) # tapinstall dir + else: + self.top_tapinstall = None + if opt.package: + raise ValueError("parameter -p must be used with --ti") + + # path to DDK + self.ddk_path = paths.DDK + + # path to makensis + self.makensis = os.path.join(paths.NSIS, 'makensis.exe') + + # driver signing options + self.codesign = opt.codesign + self.sign_cn = opt.cert + self.sign_cert = opt.certfile + self.cert_pw = opt.certpw + self.crosscert = os.path.join(self.top, opt.crosscert) + + self.inf2cat_cmd = os.path.join(self.ddk_path, 'bin', 'selfsign', 'Inf2Cat') + self.signtool_cmd = os.path.join(self.ddk_path, 'bin', 'x86', 'SignTool') + + self.timestamp_server = opt.timestamp + + # split a path into a list of components + @staticmethod + def path_split(path): + folders = [] + while True: + path, folder = os.path.split(path) + if folder: + folders.append(folder) + else: + if path: + folders.append(path) + break + folders.reverse() + return folders + + # run a command + def system(self, cmd): + print "RUN:", cmd + os.system(cmd) + + # make a directory + def mkdir(self, dir): + try: + os.mkdir(dir) + except: + pass + else: + print "MKDIR", dir + + # make a directory including parents + def makedirs(self, dir): + try: + os.makedirs(dir) + except: + pass + else: + print "MAKEDIRS", dir + + # copy a file + def cp(self, src, dest): + print "COPY %s %s" % (src, dest) + shutil.copy2(src, dest) + + # make a tarball + @staticmethod + def make_tarball(output_filename, source_dir, arcname=None): + if arcname is None: + arcname = os.path.basename(source_dir) + tar = tarfile.open(output_filename, "w:gz") + tar.add(source_dir, arcname=arcname) + tar.close() + print "***** Generated tarball:", output_filename + + # remove a file + def rm(self, file): + print "RM", file + os.remove(file) + + # remove whole directory tree, like rm -rf + def rmtree(self, dir): + print "RMTREE", dir + shutil.rmtree(dir, ignore_errors=True) + + # return path of dist directory + def dist_path(self): + return os.path.join(self.top, 'dist') + + # return path of dist include directory + def dist_include_path(self): + return os.path.join(self.dist_path(), 'include') + + # make a distribution directory (if absent) and return its path + def mkdir_dist(self, x64): + dir = self.drvdir(self.dist_path(), x64) + self.makedirs(dir) + return dir + + # run an MSVC command + def build_vc(self, cmd): + self.system('cmd /c "vcvarsall.bat x86 && %s"' % (cmd,)) + + # parse version.m4 file + def parse_version_m4(self): + kv = {} + r = re.compile(r'^define\(\[?(\w+)\]?,\s*\[(.*)\]\)') + with open(os.path.join(self.top, 'version.m4')) as f: + for line in f: + line = line.rstrip() + m = re.match(r, line) + if m: + g = m.groups() + kv[g[0]] = g[1] + return kv + + # our tap-windows version.m4 settings + def gen_version_m4(self, x64): + kv = self.parse_version_m4() + if self.opt.oas: # for OpenVPN Connect (i.e. OpenVPN Access Server) + kv['PRODUCT_NAME'] = "OpenVPNAS" + kv['PRODUCT_TAP_WIN_DEVICE_DESCRIPTION'] = "TAP Adapter OAS NDIS 6.0" + kv['PRODUCT_TAP_WIN_PROVIDER'] = "TAP-Win32 Provider OAS" + kv['PRODUCT_TAP_WIN_COMPONENT_ID'] = "tapoas" + + if (x64): + kv['INF_PROVIDER_SUFFIX'] = ", NTamd64" + kv['INF_SECTION_SUFFIX'] = ".NTamd64" + else: + kv['INF_PROVIDER_SUFFIX'] = "" + kv['INF_SECTION_SUFFIX'] = "" + return kv + + # DDK major version number (as a string) + def ddk_major(self): + ddk_ver = os.path.basename(self.ddk_path) + ddk_ver_major = re.match(r'^(\d+)\.', ddk_ver).groups()[0] + return ddk_ver_major + + # return tapinstall source directory + def tapinstall_src(self): + if self.top_tapinstall: + d = os.path.join(self.top_tapinstall, self.ddk_major()) + if os.path.exists(d): + return d + else: + return self.top_tapinstall + + # preprocess a file, doing macro substitution on @MACRO@ + def preprocess(self, kv, in_path, out_path=None): + def repfn(m): + var, = m.groups() + return kv.get(var, '') + if out_path is None: + out_path = in_path + with open(in_path+'.in') as f: + modtxt = re.sub(self.macro_amper, repfn, f.read()) + with open(out_path, "w") as f: + f.write(modtxt) + + # set up configuration files for building tap driver + def config_tap(self, x64): + kv = self.gen_version_m4(x64) + drvdir = self.drvdir(self.src, x64) + self.mkdir(drvdir) + self.preprocess(kv, os.path.join(self.src, "OemVista.inf"), os.path.join(drvdir, "OemVista.inf")) + self.preprocess(kv, os.path.join(self.src, "SOURCES")) + self.preprocess(kv, os.path.join(self.src, "config.h")) + + # set up configuration files for building tapinstall + def config_tapinstall(self, x64): + kv = {} + tisrc = self.tapinstall_src() + self.preprocess(kv, os.path.join(tisrc, "sources")) + + # build a "build" file using DDK + def build_ddk(self, dir, x64, debug): + setenv_bat = os.path.join(self.ddk_path, 'bin', 'setenv.bat') + target = 'chk' if debug else 'fre' + if x64: + target += ' x64' + else: + target += ' x86' + + target += ' wlh' # vista + + self.system('cmd /c "%s %s %s no_oacr && cd %s && build -cef"' % ( + setenv_bat, + self.ddk_path, + target, + dir + )) + + # copy tap driver files to dist + def copy_tap_to_dist(self, x64): + dist = self.mkdir_dist(x64) + drvdir = self.drvdir(self.src, x64) + for dirpath, dirnames, filenames in os.walk(drvdir): + for f in filenames: + path = os.path.join(dirpath, f) + if f.endswith('.inf') or f.endswith('.cat') or f.endswith('.sys'): + destfn = os.path.join(dist, f) + self.cp(path, destfn) + + # copy tap-windows.h to dist/include + def copy_include(self): + incdir = os.path.join(self.dist_path(), 'include') + self.makedirs(incdir) + self.cp(os.path.join(self.src, 'tap-windows.h'), incdir) + + # copy tapinstall to dist + def copy_tapinstall_to_dist(self, x64): + dist = self.mkdir_dist(x64) + t = os.path.basename(dist) + tisrc = self.tapinstall_src() + for dirpath, dirnames, filenames in os.walk(tisrc): + if os.path.basename(dirpath) == t: + for f in filenames: + path = os.path.join(dirpath, f) + if f == 'tapinstall.exe': + destfn = os.path.join(dist, f) + self.cp(path, destfn) + + # copy dist-src to dist; dist-src contains prebuilt files + # for some old platforms (such as win2k) + def copy_dist_src_to_dist(self): + dist_path = self.path_split(self.dist_path()) + dist_src = os.path.join(self.top, "dist-src") + baselen = len(self.path_split(dist_src)) + for dirpath, dirnames, filenames in os.walk(dist_src): + dirpath_split = self.path_split(dirpath) + depth = len(dirpath_split) - baselen + dircomp = () + if depth > 0: + dircomp = dirpath_split[-depth:] + for exclude_dir in ('.svn', '.git'): + if exclude_dir in dirnames: + dirnames.remove(exclude_dir) + for f in filenames: + path = os.path.join(dirpath, f) + destdir = os.path.join(*(dist_path + dircomp)) + destfn = os.path.join(destdir, f) + self.makedirs(destdir) + self.cp(path, destfn) + + # build, sign, and verify tap driver + def build_tap(self): + for x64 in (False, True): + print "***** BUILD TAP x64=%s" % (x64,) + self.config_tap(x64=x64) + self.build_ddk(dir=self.src, x64=x64, debug=opt.debug) + if self.codesign: + self.sign_verify(x64=x64) + self.copy_tap_to_dist(x64=x64) + + # build tapinstall + def build_tapinstall(self): + for x64 in (False, True): + print "***** BUILD TAPINSTALL x64=%s" % (x64,) + tisrc = self.tapinstall_src() + # Only build if we have a chance of succeeding + sources_in = os.path.join(tisrc, "sources.in") + if os.path.isfile(sources_in): + self.config_tapinstall(x64=x64) + self.build_ddk(tisrc, x64=x64, debug=opt.debug) + if self.codesign: + self.sign_verify_ti(x64=x64) + self.copy_tapinstall_to_dist(x64) + + # build tap driver and tapinstall + def build(self): + self.build_tap() + self.copy_include() + if self.top_tapinstall: + self.build_tapinstall() + self.copy_dist_src_to_dist() + + print "***** Generated files" + self.dump_dist() + + tapbase = "tapoas6" if self.opt.oas else "tap6" + self.make_tarball(os.path.join(self.top, tapbase+".tar.gz"), + self.dist_path(), + tapbase) + + # package the produced files into an NSIS installer + def package(self): + + # Generate license.txt and converting LF -> CRLF as we go. Apparently + # this type of conversion will stop working in Python 3.x. + dst = open(os.path.join(self.dist_path(), 'license.txt'), mode='wb') + + for f in (os.path.join(self.top, 'COPYING'), os.path.join(self.top, 'COPYRIGHT.GPL')): + src=open(f, mode='rb') + dst.write(src.read()+'\r\n') + src.close() + + dst.close() + + # Copy tap-windows.h to dist include directory + self.mkdir(self.dist_include_path()) + self.cp(os.path.join(self.src, 'tap-windows.h'), self.dist_include_path()) + + # Get variables from version.m4 + kv = self.gen_version_m4(True) + + installer_type = "" + if self.opt.oas: + installer_type = "-oas" + installer_file=os.path.join(self.top, 'tap-windows'+installer_type+'-'+kv['PRODUCT_VERSION']+'-I'+kv['PRODUCT_TAP_WIN_BUILD']+'.exe') + + installer_cmd = "\"%s\" -DDEVCON32=%s -DDEVCON64=%s -DDEVCON_BASENAME=%s -DPRODUCT_TAP_WIN_COMPONENT_ID=%s -DPRODUCT_NAME=%s -DPRODUCT_VERSION=%s -DPRODUCT_TAP_WIN_BUILD=%s -DOUTPUT=%s -DIMAGE=%s %s" % \ + (self.makensis, + self.tifile(x64=False), + self.tifile(x64=True), + 'tapinstall.exe', + kv['PRODUCT_TAP_WIN_COMPONENT_ID'], + kv['PRODUCT_NAME'], + kv['PRODUCT_VERSION'], + kv['PRODUCT_TAP_WIN_BUILD'], + installer_file, + self.dist_path(), + os.path.join(self.top, 'installer', 'tap-windows6.nsi') + ) + + self.system(installer_cmd) + self.sign(installer_file) + + # like find . | sort + def enum_tree(self, dir): + data = [] + for dirpath, dirnames, filenames in os.walk(dir): + data.append(dirpath) + for f in filenames: + data.append(os.path.join(dirpath, f)) + data.sort() + return data + + # show files in dist + def dump_dist(self): + for f in self.enum_tree(self.dist_path()): + print f + + # remove generated files from given directory tree + def clean_tree(self, top): + for dirpath, dirnames, filenames in os.walk(top): + for d in list(dirnames): + if d in ('.svn', '.git'): + dirnames.remove(d) + else: + path = os.path.join(dirpath, d) + deldir = False + if d in ('amd64', 'i386', 'dist'): + deldir = True + if d.endswith('_amd64') or d.endswith('_x86'): + deldir = True + if deldir: + self.rmtree(path) + dirnames.remove(d) + for f in filenames: + path = os.path.join(dirpath, f) + if f in ('SOURCES', 'sources', 'config.h'): + self.rm(path) + if f.endswith('.log') or f.endswith('.wrn') or f.endswith('.cod'): + self.rm(path) + + # remove generated files for both tap-windows and tapinstall + def clean(self): + self.clean_tree(self.top) + if self.top_tapinstall: + self.clean_tree(self.top_tapinstall) + + # BEGIN Driver signing + + def drvdir(self, dir, x64): + if x64: + return os.path.join(dir, "amd64") + else: + return os.path.join(dir, "i386") + + def drvfile(self, x64, ext): + dd = self.drvdir(self.src, x64) + for dirpath, dirnames, filenames in os.walk(dd): + catlist = [ f for f in filenames if f.endswith(ext) ] + assert(len(catlist)==1) + return os.path.join(dd, catlist[0]) + + def tifile(self, x64): + if x64: + return os.path.join(self.tapinstall_src(), 'objfre_wlh_amd64', 'amd64', 'tapinstall.exe') + else: + return os.path.join(self.tapinstall_src(), 'objfre_wlh_x86', 'i386', 'tapinstall.exe') + + def inf2cat(self, x64): + if x64: + oslist = "Vista_X64,Server2008_X64,Server2008R2_X64,7_X64" + else: + oslist = "Vista_X86,Server2008_X86,7_X86" + self.system("%s /driver:%s /os:%s" % (self.inf2cat_cmd, self.drvdir(self.src, x64), oslist)) + + def sign(self, file): + certspec = "" + if self.sign_cert: + certspec += "/f '%s' " % self.sign_cert + if self.cert_pw: + certspec += "/p '%s' " % self.cert_pw + else: + certspec += "/s my /n '%s' " % self.sign_cn + + self.system("%s sign /v /ac %s %s /t %s %s" % ( + self.signtool_cmd, + self.crosscert, + certspec, + self.timestamp_server, + file, + )) + + def sign_driver(self, x64): + self.sign(self.drvfile(x64, '.cat')) + + def verify(self, x64): + self.system("%s verify /kp /v /c %s %s" % ( + self.signtool_cmd, + self.drvfile(x64, '.cat'), + self.drvfile(x64, '.sys'), + )) + + def sign_verify(self, x64): + self.inf2cat(x64) + self.sign_driver(x64) + self.verify(x64) + + def sign_verify_ti(self, x64): + self.sign(self.tifile(x64)) + self.system("%s verify /pa %s" % (self.signtool_cmd, self.tifile(x64))) + + # END Driver signing + +if __name__ == '__main__': + # parse options + import optparse, codecs + codecs.register(lambda name: codecs.lookup('utf-8') if name == 'cp65001' else None) # windows UTF-8 hack + op = optparse.OptionParser() + + # defaults + src = os.path.dirname(os.path.realpath(__file__)) + cert = "openvpn" + crosscert = "MSCV-VSClass3.cer" # cross certs available here: http://msdn.microsoft.com/en-us/library/windows/hardware/dn170454(v=vs.85).aspx + timestamp = "http://timestamp.verisign.com/scripts/timstamp.dll" + + op.add_option("-s", "--src", dest="src", metavar="SRC", + + default=src, + help="TAP-Windows top-level directory, default=%s" % (src,)) + op.add_option("--ti", dest="tapinstall", metavar="TAPINSTALL", + help="tapinstall (i.e. devcon) directory (optional)") + op.add_option("-d", "--debug", action="store_true", dest="debug", + help="enable debug build") + op.add_option("-c", "--clean", action="store_true", dest="clean", + help="do an nmake clean before build") + op.add_option("-b", "--build", action="store_true", dest="build", + help="build TAP-Windows and possibly tapinstall (add -c to clean before build)") + op.add_option("--sign", action="store_true", dest="codesign", + default=False, help="sign the driver files") + op.add_option("-p", "--package", action="store_true", dest="package", + help="generate an NSIS installer from the compiled files") + op.add_option("--cert", dest="cert", metavar="CERT", + default=cert, + help="Common name of code signing certificate, default=%s" % (cert,)) + op.add_option("--certfile", dest="certfile", metavar="CERTFILE", + help="Path to the code signing certificate") + op.add_option("--certpw", dest="certpw", metavar="CERTPW", + help="Password for the code signing certificate/key (optional)") + op.add_option("--crosscert", dest="crosscert", metavar="CERT", + default=crosscert, + help="The cross-certificate file to use, default=%s" % (crosscert,)) + op.add_option("--timestamp", dest="timestamp", metavar="URL", + default=timestamp, + help="Timestamp URL to use, default=%s" % (timestamp,)) + op.add_option("-a", "--oas", action="store_true", dest="oas", + help="Build for OpenVPN Access Server clients") + (opt, args) = op.parse_args() + + if len(sys.argv) <= 1: + op.print_help() + sys.exit(1) + + btw = BuildTAPWindows(opt) + if opt.clean: + btw.clean() + if opt.build: + btw.build() + if opt.package: + btw.package() diff --git a/installer/tap/src/installer/ShellLink.dll b/installer/tap/src/installer/ShellLink.dll new file mode 100644 index 0000000..f57ded3 Binary files /dev/null and b/installer/tap/src/installer/ShellLink.dll differ diff --git a/installer/tap/src/installer/icon.ico b/installer/tap/src/installer/icon.ico new file mode 100644 index 0000000..03ea0b1 Binary files /dev/null and b/installer/tap/src/installer/icon.ico differ diff --git a/installer/tap/src/installer/install-whirl.bmp b/installer/tap/src/installer/install-whirl.bmp new file mode 100644 index 0000000..03f33fc Binary files /dev/null and b/installer/tap/src/installer/install-whirl.bmp differ diff --git a/installer/tap/src/installer/tap-windows6.nsi b/installer/tap/src/installer/tap-windows6.nsi new file mode 100644 index 0000000..2d03566 --- /dev/null +++ b/installer/tap/src/installer/tap-windows6.nsi @@ -0,0 +1,340 @@ +; **************************************************************************** +; * Copyright (C) 2002-2010 OpenVPN Technologies, Inc. * +; * Copyright (C) 2012 Alon Bar-Lev * +; * This program is free software; you can redistribute it and/or modify * +; * it under the terms of the GNU General Public License version 2 * +; * as published by the Free Software Foundation. * +; **************************************************************************** + +; TAP-Windows install script for Windows, using NSIS + +SetCompressor /SOLID lzma + +!addplugindir . +!include "MUI.nsh" +!include "StrFunc.nsh" +!include "x64.nsh" +!define MULTIUSER_EXECUTIONLEVEL Admin +!include "MultiUser.nsh" +!include FileFunc.nsh +!insertmacro GetParameters +!insertmacro GetOptions + +${StrLoc} + +;-------------------------------- +;Configuration + +;General + +OutFile "${OUTPUT}" + +ShowInstDetails show +ShowUninstDetails show + +;Remember install folder +InstallDirRegKey HKLM "SOFTWARE\${PRODUCT_NAME}" "" + +;-------------------------------- +;Modern UI Configuration + +Name "${PRODUCT_NAME} ${PRODUCT_VERSION}-I${PRODUCT_TAP_WIN_BUILD}" + +!define MUI_WELCOMEPAGE_TEXT "This wizard will guide you through the installation of ${PRODUCT_NAME}, a kernel driver to provide virtual tap device functionality on Windows originally written by James Yonan.\r\n\r\nNote that ${PRODUCT_NAME} will only run on Windows Vista or later.\r\n\r\n\r\n" + +!define MUI_COMPONENTSPAGE_TEXT_TOP "Select the components to install/upgrade. Stop any ${PRODUCT_NAME} processes or the ${PRODUCT_NAME} service if it is running. All DLLs are installed locally." + +!define MUI_COMPONENTSPAGE_SMALLDESC +!define MUI_FINISHPAGE_NOAUTOCLOSE +!define MUI_ABORTWARNING +!define MUI_ICON "icon.ico" +!define MUI_UNICON "icon.ico" +!define MUI_HEADERIMAGE +!define MUI_HEADERIMAGE_BITMAP "install-whirl.bmp" +!define MUI_UNFINISHPAGE_NOAUTOCLOSE + +!insertmacro MUI_PAGE_WELCOME +!insertmacro MUI_PAGE_LICENSE "${IMAGE}\license.txt" +!insertmacro MUI_PAGE_COMPONENTS +!insertmacro MUI_PAGE_DIRECTORY +!insertmacro MUI_PAGE_INSTFILES +!insertmacro MUI_PAGE_FINISH + +!insertmacro MUI_UNPAGE_CONFIRM +!insertmacro MUI_UNPAGE_INSTFILES +!insertmacro MUI_UNPAGE_FINISH + +;-------------------------------- +;Languages + +!insertmacro MUI_LANGUAGE "English" + +;-------------------------------- +;Language Strings + +LangString DESC_SecTAP ${LANG_ENGLISH} "Install/Upgrade the TAP virtual device driver. Will not interfere with CIPE." +LangString DESC_SecTAPUtilities ${LANG_ENGLISH} "Install the TAP Utilities." +LangString DESC_SecTAPSDK ${LANG_ENGLISH} "Install the TAP SDK." + +;-------------------------------- +;Reserve Files + +;Things that need to be extracted on first (keep these lines before any File command!) +;Only useful for BZIP2 compression + +ReserveFile "install-whirl.bmp" + +;-------------------------------- +;Macros + +!macro SelectByParameter SECT PARAMETER DEFAULT + ${GetOptions} $R0 "/${PARAMETER}=" $0 + ${If} ${DEFAULT} == 0 + ${If} $0 == 1 + !insertmacro SelectSection ${SECT} + ${EndIf} + ${Else} + ${If} $0 != 0 + !insertmacro SelectSection ${SECT} + ${EndIf} + ${EndIf} +!macroend + +;-------------------------------- +;Installer Sections + +Section /o "TAP Virtual Ethernet Adapter" SecTAP + + SetOverwrite on + + ${If} ${RunningX64} + DetailPrint "We are running on a 64-bit system." + + SetOutPath "$INSTDIR\bin" + File "${DEVCON64}" + + SetOutPath "$INSTDIR\driver" + File "${IMAGE}\amd64\OemVista.inf" + File "${IMAGE}\amd64\${PRODUCT_TAP_WIN_COMPONENT_ID}.cat" + File "${IMAGE}\amd64\${PRODUCT_TAP_WIN_COMPONENT_ID}.sys" + ${Else} + DetailPrint "We are running on a 32-bit system." + + SetOutPath "$INSTDIR\bin" + File "${DEVCON32}" + + SetOutPath "$INSTDIR\driver" + File "${IMAGE}\i386\OemVista.inf" + File "${IMAGE}\i386\${PRODUCT_TAP_WIN_COMPONENT_ID}.cat" + File "${IMAGE}\i386\${PRODUCT_TAP_WIN_COMPONENT_ID}.sys" + ${EndIf} +SectionEnd + +Section /o "TAP Utilities" SecTAPUtilities + SetOverwrite on + + # Delete previous start menu + RMDir /r "$SMPROGRAMS\${PRODUCT_NAME}" + + FileOpen $R0 "$INSTDIR\bin\addtap.bat" w + FileWrite $R0 "rem Add a new TAP virtual ethernet adapter$\r$\n" + FileWrite $R0 '"$INSTDIR\bin\${DEVCON_BASENAME}" install "$INSTDIR\driver\OemVista.inf" ${PRODUCT_TAP_WIN_COMPONENT_ID}$\r$\n' + FileWrite $R0 "pause$\r$\n" + FileClose $R0 + + FileOpen $R0 "$INSTDIR\bin\deltapall.bat" w + FileWrite $R0 "echo WARNING: this script will delete ALL TAP virtual adapters (use the device manager to delete adapters one at a time)$\r$\n" + FileWrite $R0 "pause$\r$\n" + FileWrite $R0 '"$INSTDIR\bin\${DEVCON_BASENAME}" remove ${PRODUCT_TAP_WIN_COMPONENT_ID}$\r$\n' + FileWrite $R0 "pause$\r$\n" + FileClose $R0 + + ; Create shortcuts + CreateDirectory "$SMPROGRAMS\${PRODUCT_NAME}\Utilities" + CreateShortCut "$SMPROGRAMS\${PRODUCT_NAME}\Utilities\Add a new TAP virtual ethernet adapter.lnk" "$INSTDIR\bin\addtap.bat" "" + ; set runas admin flag on the addtap link + ShellLink::SetRunAsAdministrator "$SMPROGRAMS\${PRODUCT_NAME}\Utilities\Add a new TAP virtual ethernet adapter.lnk" + Pop $0 + ${If} $0 != 0 + DetailPrint "Setting RunAsAdmin flag on addtap failed: status = $0" + ${Endif} + CreateShortCut "$SMPROGRAMS\${PRODUCT_NAME}\Utilities\Delete ALL TAP virtual ethernet adapters.lnk" "$INSTDIR\bin\deltapall.bat" "" + ; set runas admin flag on the deltapall link + ShellLink::SetRunAsAdministrator "$SMPROGRAMS\${PRODUCT_NAME}\Utilities\Delete ALL TAP virtual ethernet adapters.lnk" + Pop $0 + ${If} $0 != 0 + DetailPrint "Setting RunAsAdmin flag on deltapall failed: status = $0" + ${Endif} +SectionEnd + +Section /o "TAP SDK" SecTAPSDK + SetOverwrite on + SetOutPath "$INSTDIR\include" + File "${IMAGE}\include\tap-windows.h" +SectionEnd + +Function .onInit + ${GetParameters} $R0 + ClearErrors + +${IfNot} ${AtLeastWinVista} + MessageBox MB_OK "This package requires at least Windows Vista" + SetErrorLevel 1 + Quit +${EndIf} + + !insertmacro SelectByParameter ${SecTAP} SELECT_TAP 1 + !insertmacro SelectByParameter ${SecTAPUtilities} SELECT_UTILITIES 0 + !insertmacro SelectByParameter ${SecTAPSDK} SELECT_SDK 0 + + !insertmacro MULTIUSER_INIT + SetShellVarContext all + + ${If} ${RunningX64} + SetRegView 64 + StrCpy $INSTDIR "$PROGRAMFILES64\${PRODUCT_NAME}" + ${Else} + StrCpy $INSTDIR "$PROGRAMFILES\${PRODUCT_NAME}" + ${EndIf} +FunctionEnd + +;-------------------------------- +;Dependencies + +Function .onSelChange + ${If} ${SectionIsSelected} ${SecTAPUtilities} + !insertmacro SelectSection ${SecTAP} + ${EndIf} +FunctionEnd + +;-------------------- +;Post-install section + +Section -post + + ; Store README, license, icon + SetOverwrite on + SetOutPath $INSTDIR + File "${IMAGE}\license.txt" + File "icon.ico" + + ${If} ${SectionIsSelected} ${SecTAP} + ; + ; install/upgrade TAP driver if selected, using devcon + ; + ; TAP install/update was selected. + ; Should we install or update? + ; If tapinstall error occurred, $R5 will + ; be nonzero. + IntOp $R5 0 & 0 + nsExec::ExecToStack '"$INSTDIR\bin\${DEVCON_BASENAME}" hwids ${PRODUCT_TAP_WIN_COMPONENT_ID}' + Pop $R0 # return value/error/timeout + IntOp $R5 $R5 | $R0 + DetailPrint "${DEVCON_BASENAME} hwids returned: $R0" + + ; If tapinstall output string contains "${PRODUCT_TAP_WIN_COMPONENT_ID}" we assume + ; that TAP device has been previously installed, + ; therefore we will update, not install. + Push "${PRODUCT_TAP_WIN_COMPONENT_ID}" + Push ">" + Call StrLoc + Pop $R0 + + ${If} $R5 == 0 + ${If} $R0 == "" + StrCpy $R1 "install" + ${Else} + StrCpy $R1 "update" + ${EndIf} + DetailPrint "TAP $R1 (${PRODUCT_TAP_WIN_COMPONENT_ID}) (May require confirmation)" + nsExec::ExecToLog '"$INSTDIR\bin\${DEVCON_BASENAME}" $R1 "$INSTDIR\driver\OemVista.inf" ${PRODUCT_TAP_WIN_COMPONENT_ID}' + Pop $R0 # return value/error/timeout + ${If} $R0 == "" + IntOp $R0 0 & 0 + SetRebootFlag true + DetailPrint "REBOOT flag set" + ${EndIf} + IntOp $R5 $R5 | $R0 + DetailPrint "${DEVCON_BASENAME} returned: $R0" + ${EndIf} + + DetailPrint "${DEVCON_BASENAME} cumulative status: $R5" + ${If} $R5 != 0 + MessageBox MB_OK "An error occurred installing the TAP device driver." + ${EndIf} + + ; Store install folder in registry + WriteRegStr HKLM SOFTWARE\${PRODUCT_NAME} "" $INSTDIR + ${EndIf} + + ; Create uninstaller + WriteUninstaller "$INSTDIR\Uninstall.exe" + + ; Show up in Add/Remove programs + WriteRegStr HKLM "Software\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "DisplayName" "${PRODUCT_NAME} ${PRODUCT_VERSION}" + WriteRegExpandStr HKLM "Software\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "UninstallString" "$INSTDIR\Uninstall.exe" + WriteRegStr HKLM "Software\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "DisplayIcon" "$INSTDIR\icon.ico" + WriteRegStr HKLM "Software\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "DisplayVersion" "${PRODUCT_VERSION}" + WriteRegDWORD HKLM "SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "NoModify" 1 + WriteRegDWORD HKLM "SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "NoRepair" 1 + WriteRegStr HKLM "SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "Publisher" "${PRODUCT_PUBLISHER}" + WriteRegStr HKLM "SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "HelpLink" "https://openvpn.net/index.php/open-source.html" + WriteRegStr HKLM "SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "URLInfoAbout" "https://openvpn.net" + + ${GetSize} "$INSTDIR" "/S=0K" $0 $1 $2 + IntFmt $0 "0x%08X" $0 + WriteRegDWORD HKLM "SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "EstimatedSize" "$0" + +SectionEnd + +;-------------------------------- +;Descriptions + +!insertmacro MUI_FUNCTION_DESCRIPTION_BEGIN +!insertmacro MUI_DESCRIPTION_TEXT ${SecTAP} $(DESC_SecTAP) +!insertmacro MUI_DESCRIPTION_TEXT ${SecTAPUtilities} $(DESC_SecTAPUtilities) +!insertmacro MUI_DESCRIPTION_TEXT ${SecTAPSDK} $(DESC_SecTAPSDK) +!insertmacro MUI_FUNCTION_DESCRIPTION_END + +;-------------------------------- +;Uninstaller Section + +Function un.onInit + ClearErrors + !insertmacro MULTIUSER_UNINIT + SetShellVarContext all + ${If} ${RunningX64} + SetRegView 64 + ${EndIf} +FunctionEnd + +Section "Uninstall" + DetailPrint "TAP REMOVE" + nsExec::ExecToLog '"$INSTDIR\bin\${DEVCON_BASENAME}" remove ${PRODUCT_TAP_WIN_COMPONENT_ID}' + Pop $R0 # return value/error/timeout + DetailPrint "${DEVCON_BASENAME} remove returned: $R0" + + Delete "$INSTDIR\bin\${DEVCON_BASENAME}" + Delete "$INSTDIR\bin\addtap.bat" + Delete "$INSTDIR\bin\deltapall.bat" + + Delete "$INSTDIR\driver\OemVista.inf" + Delete "$INSTDIR\driver\${PRODUCT_TAP_WIN_COMPONENT_ID}.cat" + Delete "$INSTDIR\driver\${PRODUCT_TAP_WIN_COMPONENT_ID}.sys" + + Delete "$INSTDIR\include\tap-windows.h" + + Delete "$INSTDIR\icon.ico" + Delete "$INSTDIR\license.txt" + Delete "$INSTDIR\Uninstall.exe" + + RMDir "$INSTDIR\bin" + RMDir "$INSTDIR\driver" + RMDir "$INSTDIR\include" + RMDir "$INSTDIR" + RMDir /r "$SMPROGRAMS\${PRODUCT_NAME}" + + DeleteRegKey HKLM "SOFTWARE\${PRODUCT_NAME}" + DeleteRegKey HKLM "Software\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" + +SectionEnd diff --git a/installer/tap/src/paths.py b/installer/tap/src/paths.py new file mode 100644 index 0000000..8598446 --- /dev/null +++ b/installer/tap/src/paths.py @@ -0,0 +1,3 @@ +# Windows 7 DDK +DDK = "C:\\winddk\\7600.16385.1" +NSIS = "C:\\Program Files (x86)\\NSIS" diff --git a/installer/tap/src/src/MAKEFILE b/installer/tap/src/src/MAKEFILE new file mode 100644 index 0000000..d5bedee --- /dev/null +++ b/installer/tap/src/src/MAKEFILE @@ -0,0 +1,8 @@ +# +# DO NOT EDIT THIS FILE!!! Edit .\sources. if you want to add a new source +# file to this component. This file merely indirects to the real make file +# that is shared by all the driver components of the Windows NT DDK +# + +!INCLUDE $(NTMAKEENV)\makefile.def + diff --git a/installer/tap/src/src/OemVista.inf.in b/installer/tap/src/src/OemVista.inf.in new file mode 100644 index 0000000..004ed62 --- /dev/null +++ b/installer/tap/src/src/OemVista.inf.in @@ -0,0 +1,191 @@ +; **************************************************************************** +; * Copyright (C) 2002-2014 OpenVPN Technologies, Inc. * +; * This program is free software; you can redistribute it and/or modify * +; * it under the terms of the GNU General Public License version 2 * +; * as published by the Free Software Foundation. * +; **************************************************************************** + +; SYNTAX CHECKER +; cd \WINDDK\3790\tools\chkinf +; chkinf c:\src\openvpn\tap-win32\i386\oemvista.inf +; OUTPUT -> file:///c:/WINDDK/3790/tools/chkinf/htm/c%23+src+openvpn+tap-win32+i386+__OemWin2k.htm + +; INSTALL/REMOVE DRIVER +; tapinstall install OemVista.inf tapoas +; tapinstall update OemVista.inf tapoas +; tapinstall remove tapoas + +;********************************************************* +; Note to Developers: +; +; If you are bundling the TAP-Windows driver with your app, +; you should try to rename it in such a way that it will +; not collide with other instances of TAP-Windows defined +; by other apps. Multiple versions of the TAP-Windows +; driver, each installed by different apps, can coexist +; on the same machine if you follow these guidelines. +; NOTE: these instructions assume you are editing the +; generated OemWin2k.inf file, not the source +; OemWin2k.inf.in file which is preprocessed by winconfig +; and uses macro definitions from settings.in. +; +; (1) Rename all tapXXXX instances in this file to +; something different (use at least 5 characters +; for this name!) +; (2) Change the "!define TAP" definition in openvpn.nsi +; to match what you changed tapXXXX to. +; (3) Change TARGETNAME in SOURCES to match what you +; changed tapXXXX to. +; (4) Change TAP_COMPONENT_ID in common.h to match what +; you changed tapXXXX to. +; (5) Change SZDEPENDENCIES in service.h to match what +; you changed tapXXXX to. +; (6) Change DeviceDescription and Provider strings. +; (7) Change PRODUCT_TAP_WIN_DEVICE_DESCRIPTION in constants.h to what you +; set DeviceDescription to. +; +;********************************************************* + +[Version] + Signature = "$Windows NT$" + CatalogFile = @PRODUCT_TAP_WIN_COMPONENT_ID@.cat + ClassGUID = {4d36e972-e325-11ce-bfc1-08002be10318} + Provider = %Provider% + Class = Net + +; This version number should match the version +; number given in SOURCES. + DriverVer=@PRODUCT_TAP_WIN_RELDATE@,@PRODUCT_TAP_WIN_MAJOR@.00.00.@PRODUCT_TAP_WIN_MINOR@ + +[Strings] + DeviceDescription = "@PRODUCT_TAP_WIN_DEVICE_DESCRIPTION@" + Provider = "@PRODUCT_TAP_WIN_PROVIDER@" + +;---------------------------------------------------------------- +; Manufacturer + Product Section (Done) +;---------------------------------------------------------------- +[Manufacturer] + %Provider% = @PRODUCT_TAP_WIN_COMPONENT_ID@@INF_PROVIDER_SUFFIX@ + +[@PRODUCT_TAP_WIN_COMPONENT_ID@@INF_SECTION_SUFFIX@] + %DeviceDescription% = @PRODUCT_TAP_WIN_COMPONENT_ID@.ndi, root\@PRODUCT_TAP_WIN_COMPONENT_ID@ ; Root enumerated + %DeviceDescription% = @PRODUCT_TAP_WIN_COMPONENT_ID@.ndi, @PRODUCT_TAP_WIN_COMPONENT_ID@ ; Legacy + +;--------------------------------------------------------------- +; Driver Section (Done) +;--------------------------------------------------------------- + +;----------------- Characteristics ------------ +; NCF_PHYSICAL = 0x04 +; NCF_VIRTUAL = 0x01 +; NCF_SOFTWARE_ENUMERATED = 0x02 +; NCF_HIDDEN = 0x08 +; NCF_NO_SERVICE = 0x10 +; NCF_HAS_UI = 0x80 +;----------------- Characteristics ------------ + +[@PRODUCT_TAP_WIN_COMPONENT_ID@.ndi] + CopyFiles = @PRODUCT_TAP_WIN_COMPONENT_ID@.driver, @PRODUCT_TAP_WIN_COMPONENT_ID@.files + AddReg = @PRODUCT_TAP_WIN_COMPONENT_ID@.reg + AddReg = @PRODUCT_TAP_WIN_COMPONENT_ID@.params.reg + Characteristics = @PRODUCT_TAP_WIN_CHARACTERISTICS@ + *IfType = 0x6 ; IF_TYPE_ETHERNET_CSMACD + *MediaType = 0x0 ; NdisMedium802_3 + *PhysicalMediaType = 14 ; NdisPhysicalMedium802_3 + +[@PRODUCT_TAP_WIN_COMPONENT_ID@.ndi.Services] + AddService = @PRODUCT_TAP_WIN_COMPONENT_ID@, 2, @PRODUCT_TAP_WIN_COMPONENT_ID@.service + +[@PRODUCT_TAP_WIN_COMPONENT_ID@.reg] + HKR, Ndi, Service, 0, "@PRODUCT_TAP_WIN_COMPONENT_ID@" + HKR, Ndi\Interfaces, UpperRange, 0, "ndis5" + HKR, Ndi\Interfaces, LowerRange, 0, "ethernet" + HKR, , Manufacturer, 0, "%Provider%" + HKR, , ProductName, 0, "%DeviceDescription%" + +[@PRODUCT_TAP_WIN_COMPONENT_ID@.params.reg] + HKR, Ndi\params\MTU, ParamDesc, 0, "MTU" + HKR, Ndi\params\MTU, Type, 0, "int" + HKR, Ndi\params\MTU, Default, 0, "1500" + HKR, Ndi\params\MTU, Optional, 0, "0" + HKR, Ndi\params\MTU, Min, 0, "100" + HKR, Ndi\params\MTU, Max, 0, "1500" + HKR, Ndi\params\MTU, Step, 0, "1" + HKR, Ndi\params\MediaStatus, ParamDesc, 0, "Media Status" + HKR, Ndi\params\MediaStatus, Type, 0, "enum" + HKR, Ndi\params\MediaStatus, Default, 0, "0" + HKR, Ndi\params\MediaStatus, Optional, 0, "0" + HKR, Ndi\params\MediaStatus\enum, "0", 0, "Application Controlled" + HKR, Ndi\params\MediaStatus\enum, "1", 0, "Always Connected" + HKR, Ndi\params\MAC, ParamDesc, 0, "MAC Address" + HKR, Ndi\params\MAC, Type, 0, "edit" + HKR, Ndi\params\MAC, Optional, 0, "1" + HKR, Ndi\params\AllowNonAdmin, ParamDesc, 0, "Non-Admin Access" + HKR, Ndi\params\AllowNonAdmin, Type, 0, "enum" + HKR, Ndi\params\AllowNonAdmin, Default, 0, "1" + HKR, Ndi\params\AllowNonAdmin, Optional, 0, "0" + HKR, Ndi\params\AllowNonAdmin\enum, "0", 0, "Not Allowed" + HKR, Ndi\params\AllowNonAdmin\enum, "1", 0, "Allowed" + +;---------------------------------------------------------------- +; Service Section +;---------------------------------------------------------------- + +;---------- Service Type ------------- +; SERVICE_KERNEL_DRIVER = 0x01 +; SERVICE_WIN32_OWN_PROCESS = 0x10 +;---------- Service Type ------------- + +;---------- Start Mode --------------- +; SERVICE_BOOT_START = 0x0 +; SERVICE_SYSTEM_START = 0x1 +; SERVICE_AUTO_START = 0x2 +; SERVICE_DEMAND_START = 0x3 +; SERVICE_DISABLED = 0x4 +;---------- Start Mode --------------- + +[@PRODUCT_TAP_WIN_COMPONENT_ID@.service] + DisplayName = %DeviceDescription% + ServiceType = 1 + StartType = 3 + ErrorControl = 1 + LoadOrderGroup = NDIS + ServiceBinary = %12%\@PRODUCT_TAP_WIN_COMPONENT_ID@.sys + +;----------------------------------------------------------------- +; File Installation +;----------------------------------------------------------------- + +;----------------- Copy Flags ------------ +; COPYFLG_NOSKIP = 0x02 +; COPYFLG_NOVERSIONCHECK = 0x04 +;----------------- Copy Flags ------------ + +; SourceDisksNames +; diskid = description[, [tagfile] [, , subdir]] +; 1 = "Intel Driver Disk 1",e100bex.sys,, + +[SourceDisksNames] + 1 = %DeviceDescription%, @PRODUCT_TAP_WIN_COMPONENT_ID@.sys + +; SourceDisksFiles +; filename_on_source = diskID[, [subdir][, size]] +; e100bex.sys = 1,, ; on distribution disk 1 + +[SourceDisksFiles] +@PRODUCT_TAP_WIN_COMPONENT_ID@.sys = 1 + +[DestinationDirs] + @PRODUCT_TAP_WIN_COMPONENT_ID@.files = 11 + @PRODUCT_TAP_WIN_COMPONENT_ID@.driver = 12 + +[@PRODUCT_TAP_WIN_COMPONENT_ID@.files] +; TapPanel.cpl,,,6 ; COPYFLG_NOSKIP | COPYFLG_NOVERSIONCHECK +; cipsrvr.exe,,,6 ; COPYFLG_NOSKIP | COPYFLG_NOVERSIONCHECK + +[@PRODUCT_TAP_WIN_COMPONENT_ID@.driver] + @PRODUCT_TAP_WIN_COMPONENT_ID@.sys,,,6 ; COPYFLG_NOSKIP | COPYFLG_NOVERSIONCHECK + +;--------------------------------------------------------------- +; End +;--------------------------------------------------------------- diff --git a/installer/tap/src/src/SOURCES.in b/installer/tap/src/src/SOURCES.in new file mode 100644 index 0000000..cf98d5f --- /dev/null +++ b/installer/tap/src/src/SOURCES.in @@ -0,0 +1,62 @@ +# Build TAP-Windows NDIS 6.0 driver. +# Build Command: build -cef + +MAJORCOMP=ntos +MINORCOMP=ndis + +TARGETNAME=@PRODUCT_TAP_WIN_COMPONENT_ID@ +TARGETTYPE=DRIVER +TARGETPATH=. + +TARGETLIBS=\ + $(DDK_LIB_PATH)\ndis.lib \ + $(DDK_LIB_PATH)\ntstrsafe.lib \ + $(DDK_LIB_PATH)\wdmsec.lib + +INCLUDES=$(DDK_INCLUDE_PATH) .. + +# System and NDIS wrapper definitions. +C_DEFINES=$(C_DEFINES) -DNDIS_MINIPORT_DRIVER=1 +C_DEFINES=$(C_DEFINES) -DNDIS61_MINIPORT=1 +C_DEFINES=$(C_DEFINES) -DNDIS_SUPPORT_NDIS61=1 +C_DEFINES=$(C_DEFINES) -DNDIS_WDM=1 + +# The TAP version numbers here must be >= +# PRODUCT_TAP_WIN32_MIN_x values defined in version.m4 +C_DEFINES=$(C_DEFINES) -DTAP_DRIVER_MAJOR_VERSION=@PRODUCT_TAP_WIN_MAJOR@ +C_DEFINES=$(C_DEFINES) -DTAP_DRIVER_MINOR_VERSION=@PRODUCT_TAP_WIN_MINOR@ + +# Produce the same symbolic information for both free & checked builds. +# This will allow us to perform full source-level debugging on both +# builds without affecting the free build's performance. +!IF "$(DDKBUILDENV)" != "chk" +NTDEBUGTYPE=both +USE_PDB=1 +!ELSE +NTDEBUGTYPE=both +USE_PDB=1 +!ENDIF + +# Generate a linker map file just in case we need one for debugging +LINKER_FLAGS=$(LINKER_FLAGS) /INCREMENTAL:NO /MAP /MAPINFO:EXPORTS + + +# MSC_WARNING_LEVEL=/W4 /WX + +# disabled warning 4201 -- nonstandard extension used : nameless struct/union +# disabled warning 4214 -- nonstandard extension used : bit field types other than int +# disabled warning 4127 -- conditional expression is constant +MSC_WARNING_LEVEL=$(MSC_WARNING_LEVEL) /wd4201 /wd4214 /wd4127 + +SOURCES=\ + tapdrvr.c \ + adapter.c \ + device.c \ + rxpath.c \ + txpath.c \ + oidrequest.c \ + mem.c \ + macinfo.c \ + error.c \ + dhcp.c \ + resource.rc diff --git a/installer/tap/src/src/adapter.c b/installer/tap/src/src/adapter.c new file mode 100644 index 0000000..2883b79 --- /dev/null +++ b/installer/tap/src/src/adapter.c @@ -0,0 +1,1717 @@ +/* + * TAP-Windows -- A kernel driver to provide virtual tap + * device functionality on Windows. + * + * This code was inspired by the CIPE-Win32 driver by Damion K. Wilson. + * + * This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc., + * and is released under the GPL version 2 (see below). + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program (see the file COPYING included with this + * distribution); if not, write to the Free Software Foundation, Inc., + * 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +// +// Include files. +// + +#include "tap.h" + +NDIS_OID TAPSupportedOids[] = +{ + OID_GEN_HARDWARE_STATUS, + OID_GEN_TRANSMIT_BUFFER_SPACE, + OID_GEN_RECEIVE_BUFFER_SPACE, + OID_GEN_TRANSMIT_BLOCK_SIZE, + OID_GEN_RECEIVE_BLOCK_SIZE, + OID_GEN_VENDOR_ID, + OID_GEN_VENDOR_DESCRIPTION, + OID_GEN_VENDOR_DRIVER_VERSION, + OID_GEN_CURRENT_PACKET_FILTER, + OID_GEN_CURRENT_LOOKAHEAD, + OID_GEN_DRIVER_VERSION, + OID_GEN_MAXIMUM_TOTAL_SIZE, + OID_GEN_XMIT_OK, + OID_GEN_RCV_OK, + OID_GEN_STATISTICS, +#ifdef IMPLEMENT_OPTIONAL_OIDS + OID_GEN_TRANSMIT_QUEUE_LENGTH, // Optional +#endif // IMPLEMENT_OPTIONAL_OIDS + OID_GEN_LINK_PARAMETERS, + OID_GEN_INTERRUPT_MODERATION, + OID_GEN_MEDIA_SUPPORTED, + OID_GEN_MEDIA_IN_USE, + OID_GEN_MAXIMUM_SEND_PACKETS, + OID_GEN_XMIT_ERROR, + OID_GEN_RCV_ERROR, + OID_GEN_RCV_NO_BUFFER, + OID_802_3_PERMANENT_ADDRESS, + OID_802_3_CURRENT_ADDRESS, + OID_802_3_MULTICAST_LIST, + OID_802_3_MAXIMUM_LIST_SIZE, + OID_802_3_RCV_ERROR_ALIGNMENT, + OID_802_3_XMIT_ONE_COLLISION, + OID_802_3_XMIT_MORE_COLLISIONS, +#ifdef IMPLEMENT_OPTIONAL_OIDS + OID_802_3_XMIT_DEFERRED, // Optional + OID_802_3_XMIT_MAX_COLLISIONS, // Optional + OID_802_3_RCV_OVERRUN, // Optional + OID_802_3_XMIT_UNDERRUN, // Optional + OID_802_3_XMIT_HEARTBEAT_FAILURE, // Optional + OID_802_3_XMIT_TIMES_CRS_LOST, // Optional + OID_802_3_XMIT_LATE_COLLISIONS, // Optional + OID_PNP_CAPABILITIES, // Optional +#endif // IMPLEMENT_OPTIONAL_OIDS +}; + +//====================================================================== +// TAP NDIS 6 Miniport Callbacks +//====================================================================== + +// Returns with reference count initialized to one. +PTAP_ADAPTER_CONTEXT +tapAdapterContextAllocate( + __in NDIS_HANDLE MiniportAdapterHandle +) +{ + PTAP_ADAPTER_CONTEXT adapter = NULL; + + adapter = (PTAP_ADAPTER_CONTEXT )NdisAllocateMemoryWithTagPriority( + GlobalData.NdisDriverHandle, + sizeof(TAP_ADAPTER_CONTEXT), + TAP_ADAPTER_TAG, + NormalPoolPriority + ); + + if(adapter) + { + NET_BUFFER_LIST_POOL_PARAMETERS nblPoolParameters = {0}; + + NdisZeroMemory(adapter,sizeof(TAP_ADAPTER_CONTEXT)); + + adapter->MiniportAdapterHandle = MiniportAdapterHandle; + + // Initialize cancel-safe IRP queue + tapIrpCsqInitialize(&adapter->PendingReadIrpQueue); + + // Initialize TAP send packet queue. + tapPacketQueueInitialize(&adapter->SendPacketQueue); + + // Allocate the adapter lock. + NdisAllocateSpinLock(&adapter->AdapterLock); + + // NBL pool for making TAP receive indications. + NdisZeroMemory(&nblPoolParameters, sizeof(NET_BUFFER_LIST_POOL_PARAMETERS)); + + // Initialize event used to determine when all receive NBLs have been returned. + NdisInitializeEvent(&adapter->ReceiveNblInFlightCountZeroEvent); + + nblPoolParameters.Header.Type = NDIS_OBJECT_TYPE_DEFAULT; + nblPoolParameters.Header.Revision = NET_BUFFER_LIST_POOL_PARAMETERS_REVISION_1; + nblPoolParameters.Header.Size = NDIS_SIZEOF_NET_BUFFER_LIST_POOL_PARAMETERS_REVISION_1; + nblPoolParameters.ProtocolId = NDIS_PROTOCOL_ID_DEFAULT; + nblPoolParameters.ContextSize = 0; + //nblPoolParameters.ContextSize = sizeof(RX_NETBUFLIST_RSVD); + nblPoolParameters.fAllocateNetBuffer = TRUE; + nblPoolParameters.PoolTag = TAP_RX_NBL_TAG; + +#pragma warning( suppress : 28197 ) + adapter->ReceiveNblPool = NdisAllocateNetBufferListPool( + adapter->MiniportAdapterHandle, + &nblPoolParameters); + + if (adapter->ReceiveNblPool == NULL) + { + DEBUGP (("[TAP] Couldn't allocate adapter receive NBL pool\n")); + NdisFreeMemory(adapter,0,0); + } + + // Add initial reference. Normally removed in AdapterHalt. + adapter->RefCount = 1; + + // Safe for multiple removes. + NdisInitializeListHead(&adapter->AdapterListLink); + + // + // The miniport adapter is initially powered up + // + adapter->CurrentPowerState = NdisDeviceStateD0; + } + + return adapter; +} + +VOID +tapReadPermanentAddress( + __in PTAP_ADAPTER_CONTEXT Adapter, + __in NDIS_HANDLE ConfigurationHandle, + __out MACADDR PermanentAddress + ) +{ + NDIS_STATUS status; + NDIS_CONFIGURATION_PARAMETER *configParameter; + NDIS_STRING macKey = NDIS_STRING_CONST("MAC"); + ANSI_STRING macString; + BOOLEAN macFromRegistry = FALSE; + + // Read MAC parameter from registry. + NdisReadConfiguration( + &status, + &configParameter, + ConfigurationHandle, + &macKey, + NdisParameterString + ); + + if (status == NDIS_STATUS_SUCCESS) + { + if( (configParameter->ParameterType == NdisParameterString) + && (configParameter->ParameterData.StringData.Length >= 12) + ) + { + if (RtlUnicodeStringToAnsiString( + &macString, + &configParameter->ParameterData.StringData, + TRUE) == STATUS_SUCCESS + ) + { + macFromRegistry = ParseMAC (PermanentAddress, macString.Buffer); + RtlFreeAnsiString (&macString); + } + } + } + + if(!macFromRegistry) + { + // + // There is no (valid) address stashed in the registry parameter. + // + // Make up a dummy mac address based on the ANSI representation of the + // NetCfgInstanceId GUID. + // + GenerateRandomMac(PermanentAddress, MINIPORT_INSTANCE_ID(Adapter)); + } +} + +NDIS_STATUS +tapReadConfiguration( + __in PTAP_ADAPTER_CONTEXT Adapter + ) +{ + NDIS_STATUS status = NDIS_STATUS_SUCCESS; + NDIS_CONFIGURATION_OBJECT configObject; + NDIS_HANDLE configHandle; + + DEBUGP (("[TAP] --> tapReadConfiguration\n")); + + // + // Setup defaults in case configuration cannot be opened. + // + Adapter->MtuSize = ETHERNET_MTU; + Adapter->MediaStateAlwaysConnected = FALSE; + Adapter->LogicalMediaState = FALSE; + Adapter->AllowNonAdmin = FALSE; + // + // Open the registry for this adapter to read advanced + // configuration parameters stored by the INF file. + // + NdisZeroMemory(&configObject, sizeof(configObject)); + + {C_ASSERT(sizeof(configObject) >= NDIS_SIZEOF_CONFIGURATION_OBJECT_REVISION_1);} + configObject.Header.Type = NDIS_OBJECT_TYPE_CONFIGURATION_OBJECT; + configObject.Header.Size = NDIS_SIZEOF_CONFIGURATION_OBJECT_REVISION_1; + configObject.Header.Revision = NDIS_CONFIGURATION_OBJECT_REVISION_1; + + configObject.NdisHandle = Adapter->MiniportAdapterHandle; + configObject.Flags = 0; + + status = NdisOpenConfigurationEx( + &configObject, + &configHandle + ); + + // Read on the opened configuration handle. + if(status == NDIS_STATUS_SUCCESS) + { + NDIS_CONFIGURATION_PARAMETER *configParameter; + NDIS_STRING mkey = NDIS_STRING_CONST("NetCfgInstanceId"); + + // + // Read NetCfgInstanceId from the registry. + // ------------------------------------ + // NetCfgInstanceId is required to create device and associated + // symbolic link for the adapter device. + // + // NetCfgInstanceId is a GUID string provided by NDIS that identifies + // the adapter instance. An example is: + // + // NetCfgInstanceId={410EB49D-2381-4FE7-9B36-498E22619DF0} + // + // Other names are derived from NetCfgInstanceId. For example, MiniportName: + // + // MiniportName=\DEVICE\{410EB49D-2381-4FE7-9B36-498E22619DF0} + // + NdisReadConfiguration ( + &status, + &configParameter, + configHandle, + &mkey, + NdisParameterString + ); + + if (status == NDIS_STATUS_SUCCESS) + { + if (configParameter->ParameterType == NdisParameterString + && configParameter->ParameterData.StringData.Length <= sizeof(Adapter->NetCfgInstanceIdBuffer) - sizeof(WCHAR)) + { + DEBUGP (("[TAP] NdisReadConfiguration (NetCfgInstanceId=%wZ)\n", + &configParameter->ParameterData.StringData )); + + // Save NetCfgInstanceId as UNICODE_STRING. + Adapter->NetCfgInstanceId.Length = Adapter->NetCfgInstanceId.MaximumLength + = configParameter->ParameterData.StringData.Length; + + Adapter->NetCfgInstanceId.Buffer = Adapter->NetCfgInstanceIdBuffer; + + NdisMoveMemory( + Adapter->NetCfgInstanceId.Buffer, + configParameter->ParameterData.StringData.Buffer, + Adapter->NetCfgInstanceId.Length + ); + + // Save NetCfgInstanceId as ANSI_STRING as well. + if (RtlUnicodeStringToAnsiString ( + &Adapter->NetCfgInstanceIdAnsi, + &configParameter->ParameterData.StringData, + TRUE) != STATUS_SUCCESS + ) + { + DEBUGP (("[TAP] NetCfgInstanceId ANSI name conversion failed\n")); + status = NDIS_STATUS_RESOURCES; + } + } + else + { + DEBUGP (("[TAP] NetCfgInstanceId has invalid type\n")); + status = NDIS_STATUS_INVALID_DATA; + } + } + else + { + DEBUGP (("[TAP] NetCfgInstanceId failed\n")); + status = NDIS_STATUS_INVALID_DATA; + } + + if (status == NDIS_STATUS_SUCCESS) + { + NDIS_STATUS localStatus; // Use default if these fail. + NDIS_CONFIGURATION_PARAMETER *configParameter; + NDIS_STRING mtuKey = NDIS_STRING_CONST("MTU"); + NDIS_STRING mediaStatusKey = NDIS_STRING_CONST("MediaStatus"); +#if ENABLE_NONADMIN + NDIS_STRING allowNonAdminKey = NDIS_STRING_CONST("AllowNonAdmin"); +#endif + + // Read MTU from the registry. + NdisReadConfiguration ( + &localStatus, + &configParameter, + configHandle, + &mtuKey, + NdisParameterInteger + ); + + if (localStatus == NDIS_STATUS_SUCCESS) + { + if (configParameter->ParameterType == NdisParameterInteger) + { + int mtu = configParameter->ParameterData.IntegerData; + + if(mtu == 0) + { + mtu = ETHERNET_MTU; + } + + // Sanity check + if (mtu < MINIMUM_MTU) + { + mtu = MINIMUM_MTU; + } + else if (mtu > MAXIMUM_MTU) + { + mtu = MAXIMUM_MTU; + } + + Adapter->MtuSize = mtu; + } + } + + DEBUGP (("[%s] Using MTU %d\n", + MINIPORT_INSTANCE_ID (Adapter), + Adapter->MtuSize + )); + + // Read MediaStatus setting from registry. + NdisReadConfiguration ( + &localStatus, + &configParameter, + configHandle, + &mediaStatusKey, + NdisParameterInteger + ); + + if (localStatus == NDIS_STATUS_SUCCESS) + { + if (configParameter->ParameterType == NdisParameterInteger) + { + if(configParameter->ParameterData.IntegerData == 0) + { + // Connect state is appplication controlled. + DEBUGP(("[%s] Initial MediaConnectState: Application Controlled\n", + MINIPORT_INSTANCE_ID (Adapter))); + + Adapter->MediaStateAlwaysConnected = FALSE; + Adapter->LogicalMediaState = FALSE; + } + else + { + // Connect state is always connected. + DEBUGP(("[%s] Initial MediaConnectState: Always Connected\n", + MINIPORT_INSTANCE_ID (Adapter))); + + Adapter->MediaStateAlwaysConnected = TRUE; + Adapter->LogicalMediaState = TRUE; + } + } + } + + // Read MAC PermanentAddress setting from registry. + tapReadPermanentAddress( + Adapter, + configHandle, + Adapter->PermanentAddress + ); + + DEBUGP (("[%s] Using MAC PermanentAddress %2.2x:%2.2x:%2.2x:%2.2x:%2.2x:%2.2x\n", + MINIPORT_INSTANCE_ID (Adapter), + Adapter->PermanentAddress[0], + Adapter->PermanentAddress[1], + Adapter->PermanentAddress[2], + Adapter->PermanentAddress[3], + Adapter->PermanentAddress[4], + Adapter->PermanentAddress[5]) + ); + + // Now seed the current MAC address with the permanent address. + ETH_COPY_NETWORK_ADDRESS(Adapter->CurrentAddress, Adapter->PermanentAddress); + + DEBUGP (("[%s] Using MAC CurrentAddress %2.2x:%2.2x:%2.2x:%2.2x:%2.2x:%2.2x\n", + MINIPORT_INSTANCE_ID (Adapter), + Adapter->CurrentAddress[0], + Adapter->CurrentAddress[1], + Adapter->CurrentAddress[2], + Adapter->CurrentAddress[3], + Adapter->CurrentAddress[4], + Adapter->CurrentAddress[5]) + ); + + // Read optional AllowNonAdmin setting from registry. +#if ENABLE_NONADMIN + NdisReadConfiguration ( + &localStatus, + &configParameter, + configHandle, + &allowNonAdminKey, + NdisParameterInteger + ); + + if (localStatus == NDIS_STATUS_SUCCESS) + { + if (configParameter->ParameterType == NdisParameterInteger) + { + Adapter->AllowNonAdmin = TRUE; + } + } +#endif + } + + // Close the configuration handle. + NdisCloseConfiguration(configHandle); + } + else + { + DEBUGP (("[TAP] Couldn't open adapter registry\n")); + } + + DEBUGP (("[TAP] <-- tapReadConfiguration; status = %8.8X\n",status)); + + return status; +} + +VOID +tapAdapterContextAddToGlobalList( + __in PTAP_ADAPTER_CONTEXT Adapter + ) +{ + LOCK_STATE lockState; + PLIST_ENTRY listEntry = &Adapter->AdapterListLink; + + // Acquire global adapter list lock. + NdisAcquireReadWriteLock( + &GlobalData.Lock, + TRUE, // Acquire for write + &lockState + ); + + // Adapter context should NOT be in any list. + ASSERT( (listEntry->Flink == listEntry) && (listEntry->Blink == listEntry ) ); + + // Add reference to persist until after removal. + tapAdapterContextReference(Adapter); + + // Add the adapter context to the global list. + InsertTailList(&GlobalData.AdapterList,&Adapter->AdapterListLink); + + // Release global adapter list lock. + NdisReleaseReadWriteLock(&GlobalData.Lock,&lockState); +} + +VOID +tapAdapterContextRemoveFromGlobalList( + __in PTAP_ADAPTER_CONTEXT Adapter + ) +{ + LOCK_STATE lockState; + + // Acquire global adapter list lock. + NdisAcquireReadWriteLock( + &GlobalData.Lock, + TRUE, // Acquire for write + &lockState + ); + + // Remove the adapter context from the global list. + RemoveEntryList(&Adapter->AdapterListLink); + + // Safe for multiple removes. + NdisInitializeListHead(&Adapter->AdapterListLink); + + // Remove reference added in tapAdapterContextAddToGlobalList. + tapAdapterContextDereference(Adapter); + + // Release global adapter list lock. + NdisReleaseReadWriteLock(&GlobalData.Lock,&lockState); +} + +// Returns with added reference on adapter context. +PTAP_ADAPTER_CONTEXT +tapAdapterContextFromDeviceObject( + __in PDEVICE_OBJECT DeviceObject + ) +{ + LOCK_STATE lockState; + + // Acquire global adapter list lock. + NdisAcquireReadWriteLock( + &GlobalData.Lock, + FALSE, // Acquire for read + &lockState + ); + + if (!IsListEmpty(&GlobalData.AdapterList)) + { + PLIST_ENTRY entry = GlobalData.AdapterList.Flink; + PTAP_ADAPTER_CONTEXT adapter; + + while (entry != &GlobalData.AdapterList) + { + adapter = CONTAINING_RECORD(entry, TAP_ADAPTER_CONTEXT, AdapterListLink); + + // Match on DeviceObject + if(adapter->DeviceObject == DeviceObject ) + { + // Add reference to adapter context. + tapAdapterContextReference(adapter); + + // Release global adapter list lock. + NdisReleaseReadWriteLock(&GlobalData.Lock,&lockState); + + return adapter; + } + + // Move to next entry + entry = entry->Flink; + } + } + + // Release global adapter list lock. + NdisReleaseReadWriteLock(&GlobalData.Lock,&lockState); + + return (PTAP_ADAPTER_CONTEXT )NULL; +} + +NDIS_STATUS +AdapterSetOptions( + __in NDIS_HANDLE NdisDriverHandle, + __in NDIS_HANDLE DriverContext + ) +/*++ +Routine Description: + + The MiniportSetOptions function registers optional handlers. For each + optional handler that should be registered, this function makes a call + to NdisSetOptionalHandlers. + + MiniportSetOptions runs at IRQL = PASSIVE_LEVEL. + +Arguments: + + DriverContext The context handle + +Return Value: + + NDIS_STATUS_xxx code + +--*/ +{ + NDIS_STATUS status; + + DEBUGP (("[TAP] --> AdapterSetOptions\n")); + + // + // Set any optional handlers by filling out the appropriate struct and + // calling NdisSetOptionalHandlers here. + // + + status = NDIS_STATUS_SUCCESS; + + DEBUGP (("[TAP] <-- AdapterSetOptions; status = %8.8X\n",status)); + + return status; +} + +NDIS_STATUS +AdapterCreate( + __in NDIS_HANDLE MiniportAdapterHandle, + __in NDIS_HANDLE MiniportDriverContext, + __in PNDIS_MINIPORT_INIT_PARAMETERS MiniportInitParameters + ) +{ + PTAP_ADAPTER_CONTEXT adapter = NULL; + NDIS_STATUS status; + + UNREFERENCED_PARAMETER(MiniportDriverContext); + UNREFERENCED_PARAMETER(MiniportInitParameters); + + DEBUGP (("[TAP] --> AdapterCreate\n")); + + do + { + NDIS_MINIPORT_ADAPTER_REGISTRATION_ATTRIBUTES regAttributes = {0}; + NDIS_MINIPORT_ADAPTER_GENERAL_ATTRIBUTES genAttributes = {0}; + NDIS_PNP_CAPABILITIES pnpCapabilities = {0}; + + // + // Allocate adapter context structure and initialize all the + // memory resources for sending and receiving packets. + // + // Returns with reference count initialized to one. + // + adapter = tapAdapterContextAllocate(MiniportAdapterHandle); + + if(adapter == NULL) + { + DEBUGP (("[TAP] Couldn't allocate adapter memory\n")); + status = NDIS_STATUS_RESOURCES; + break; + } + + // Enter the Initializing state. + DEBUGP (("[TAP] Miniport State: Initializing\n")); + + tapAdapterAcquireLock(adapter,FALSE); + adapter->Locked.AdapterState = MiniportInitializingState; + tapAdapterReleaseLock(adapter,FALSE); + + // + // First read adapter configuration from registry. + // ----------------------------------------------- + // Subsequent device registration will fail if NetCfgInstanceId + // has not been successfully read. + // + status = tapReadConfiguration(adapter); + + // + // Set the registration attributes. + // + {C_ASSERT(sizeof(regAttributes) >= NDIS_SIZEOF_MINIPORT_ADAPTER_REGISTRATION_ATTRIBUTES_REVISION_1);} + regAttributes.Header.Type = NDIS_OBJECT_TYPE_MINIPORT_ADAPTER_REGISTRATION_ATTRIBUTES; + regAttributes.Header.Size = NDIS_SIZEOF_MINIPORT_ADAPTER_REGISTRATION_ATTRIBUTES_REVISION_1; + regAttributes.Header.Revision = NDIS_SIZEOF_MINIPORT_ADAPTER_REGISTRATION_ATTRIBUTES_REVISION_1; + + regAttributes.MiniportAdapterContext = adapter; + regAttributes.AttributeFlags = TAP_ADAPTER_ATTRIBUTES_FLAGS; + + regAttributes.CheckForHangTimeInSeconds = TAP_ADAPTER_CHECK_FOR_HANG_TIME_IN_SECONDS; + regAttributes.InterfaceType = TAP_INTERFACE_TYPE; + + //NDIS_DECLARE_MINIPORT_ADAPTER_CONTEXT(TAP_ADAPTER_CONTEXT); + status = NdisMSetMiniportAttributes( + MiniportAdapterHandle, + (PNDIS_MINIPORT_ADAPTER_ATTRIBUTES)®Attributes + ); + + if (status != NDIS_STATUS_SUCCESS) + { + DEBUGP (("[TAP] NdisSetOptionalHandlers failed; Status 0x%08x\n",status)); + break; + } + + // + // Next, set the general attributes. + // + {C_ASSERT(sizeof(genAttributes) >= NDIS_SIZEOF_MINIPORT_ADAPTER_GENERAL_ATTRIBUTES_REVISION_1);} + genAttributes.Header.Type = NDIS_OBJECT_TYPE_MINIPORT_ADAPTER_GENERAL_ATTRIBUTES; + genAttributes.Header.Size = NDIS_SIZEOF_MINIPORT_ADAPTER_GENERAL_ATTRIBUTES_REVISION_1; + genAttributes.Header.Revision = NDIS_MINIPORT_ADAPTER_GENERAL_ATTRIBUTES_REVISION_1; + + // + // Specify the medium type that the NIC can support but not + // necessarily the medium type that the NIC currently uses. + // + genAttributes.MediaType = TAP_MEDIUM_TYPE; + + // + // Specifiy medium type that the NIC currently uses. + // + genAttributes.PhysicalMediumType = TAP_PHYSICAL_MEDIUM; + + // + // Specifiy the maximum network frame size, in bytes, that the NIC + // supports excluding the header. + // + genAttributes.MtuSize = TAP_FRAME_MAX_DATA_SIZE; + genAttributes.MaxXmitLinkSpeed = TAP_XMIT_SPEED; + genAttributes.XmitLinkSpeed = TAP_XMIT_SPEED; + genAttributes.MaxRcvLinkSpeed = TAP_RECV_SPEED; + genAttributes.RcvLinkSpeed = TAP_RECV_SPEED; + + if(adapter->MediaStateAlwaysConnected) + { + DEBUGP(("[%s] Initial MediaConnectState: Connected\n", + MINIPORT_INSTANCE_ID (adapter))); + + genAttributes.MediaConnectState = MediaConnectStateConnected; + } + else + { + DEBUGP(("[%s] Initial MediaConnectState: Disconnected\n", + MINIPORT_INSTANCE_ID (adapter))); + + genAttributes.MediaConnectState = MediaConnectStateDisconnected; + } + + genAttributes.MediaDuplexState = MediaDuplexStateFull; + + // + // The maximum number of bytes the NIC can provide as lookahead data. + // If that value is different from the size of the lookahead buffer + // supported by bound protocols, NDIS will call MiniportOidRequest to + // set the size of the lookahead buffer provided by the miniport driver + // to the minimum of the miniport driver and protocol(s) values. If the + // driver always indicates up full packets with + // NdisMIndicateReceiveNetBufferLists, it should set this value to the + // maximum total frame size, which excludes the header. + // + // Upper-layer drivers examine lookahead data to determine whether a + // packet that is associated with the lookahead data is intended for + // one or more of their clients. If the underlying driver supports + // multipacket receive indications, bound protocols are given full net + // packets on every indication. Consequently, this value is identical + // to that returned for OID_GEN_RECEIVE_BLOCK_SIZE. + // + genAttributes.LookaheadSize = TAP_MAX_LOOKAHEAD; + genAttributes.MacOptions = TAP_MAC_OPTIONS; + genAttributes.SupportedPacketFilters = TAP_SUPPORTED_FILTERS; + + // + // The maximum number of multicast addresses the NIC driver can manage. + // This list is global for all protocols bound to (or above) the NIC. + // Consequently, a protocol can receive NDIS_STATUS_MULTICAST_FULL from + // the NIC driver when attempting to set the multicast address list, + // even if the number of elements in the given list is less than the + // number originally returned for this query. + // + genAttributes.MaxMulticastListSize = TAP_MAX_MCAST_LIST; + genAttributes.MacAddressLength = MACADDR_SIZE; + + // + // Return the MAC address of the NIC burnt in the hardware. + // + ETH_COPY_NETWORK_ADDRESS(genAttributes.PermanentMacAddress, adapter->PermanentAddress); + + // + // Return the MAC address the NIC is currently programmed to use. Note + // that this address could be different from the permananent address as + // the user can override using registry. Read NdisReadNetworkAddress + // doc for more info. + // + ETH_COPY_NETWORK_ADDRESS(genAttributes.CurrentMacAddress, adapter->CurrentAddress); + + genAttributes.RecvScaleCapabilities = NULL; + genAttributes.AccessType = TAP_ACCESS_TYPE; + genAttributes.DirectionType = TAP_DIRECTION_TYPE; + genAttributes.ConnectionType = TAP_CONNECTION_TYPE; + genAttributes.IfType = TAP_IFTYPE; + genAttributes.IfConnectorPresent = TAP_HAS_PHYSICAL_CONNECTOR; + genAttributes.SupportedStatistics = TAP_SUPPORTED_STATISTICS; + genAttributes.SupportedPauseFunctions = NdisPauseFunctionsUnsupported; // IEEE 802.3 pause frames + genAttributes.DataBackFillSize = 0; + genAttributes.ContextBackFillSize = 0; + + // + // The SupportedOidList is an array of OIDs for objects that the + // underlying driver or its NIC supports. Objects include general, + // media-specific, and implementation-specific objects. NDIS forwards a + // subset of the returned list to protocols that make this query. That + // is, NDIS filters any supported statistics OIDs out of the list + // because protocols never make statistics queries. + // + genAttributes.SupportedOidList = TAPSupportedOids; + genAttributes.SupportedOidListLength = sizeof(TAPSupportedOids); + genAttributes.AutoNegotiationFlags = NDIS_LINK_STATE_DUPLEX_AUTO_NEGOTIATED; + + // + // Set power management capabilities + // + NdisZeroMemory(&pnpCapabilities, sizeof(pnpCapabilities)); + pnpCapabilities.WakeUpCapabilities.MinMagicPacketWakeUp = NdisDeviceStateUnspecified; + pnpCapabilities.WakeUpCapabilities.MinPatternWakeUp = NdisDeviceStateUnspecified; + genAttributes.PowerManagementCapabilities = &pnpCapabilities; + + status = NdisMSetMiniportAttributes( + MiniportAdapterHandle, + (PNDIS_MINIPORT_ADAPTER_ATTRIBUTES)&genAttributes + ); + + if (status != NDIS_STATUS_SUCCESS) + { + DEBUGP (("[TAP] NdisMSetMiniportAttributes failed; Status 0x%08x\n",status)); + break; + } + + // + // Create the Win32 device I/O interface. + // + status = CreateTapDevice(adapter); + + if (status == NDIS_STATUS_SUCCESS) + { + // Add this adapter to the global adapter list. + tapAdapterContextAddToGlobalList(adapter); + } + else + { + DEBUGP (("[TAP] CreateTapDevice failed; Status 0x%08x\n",status)); + break; + } + } while(FALSE); + + if(status == NDIS_STATUS_SUCCESS) + { + // Enter the Paused state if initialization is complete. + DEBUGP (("[TAP] Miniport State: Paused\n")); + + tapAdapterAcquireLock(adapter,FALSE); + adapter->Locked.AdapterState = MiniportPausedState; + tapAdapterReleaseLock(adapter,FALSE); + } + else + { + if(adapter != NULL) + { + DEBUGP (("[TAP] Miniport State: Halted\n")); + + // + // Remove reference when adapter context was allocated + // --------------------------------------------------- + // This should result in freeing adapter context memory + // and assiciated resources. + // + tapAdapterContextDereference(adapter); + adapter = NULL; + } + } + + DEBUGP (("[TAP] <-- AdapterCreate; status = %8.8X\n",status)); + + return status; +} + +VOID +AdapterHalt( + __in NDIS_HANDLE MiniportAdapterContext, + __in NDIS_HALT_ACTION HaltAction + ) +/*++ + +Routine Description: + + Halt handler is called when NDIS receives IRP_MN_STOP_DEVICE, + IRP_MN_SUPRISE_REMOVE or IRP_MN_REMOVE_DEVICE requests from the PNP + manager. Here, the driver should free all the resources acquired in + MiniportInitialize and stop access to the hardware. NDIS will not submit + any further request once this handler is invoked. + + 1) Free and unmap all I/O resources. + 2) Disable interrupt and deregister interrupt handler. + 3) Deregister shutdown handler regsitered by + NdisMRegisterAdapterShutdownHandler . + 4) Cancel all queued up timer callbacks. + 5) Finally wait indefinitely for all the outstanding receive + packets indicated to the protocol to return. + + MiniportHalt runs at IRQL = PASSIVE_LEVEL. + + +Arguments: + + MiniportAdapterContext Pointer to the Adapter + HaltAction The reason for halting the adapter + +Return Value: + + None. + +--*/ +{ + PTAP_ADAPTER_CONTEXT adapter = (PTAP_ADAPTER_CONTEXT )MiniportAdapterContext; + + UNREFERENCED_PARAMETER(HaltAction); + + DEBUGP (("[TAP] --> AdapterHalt\n")); + + // Enter the Halted state. + DEBUGP (("[TAP] Miniport State: Halted\n")); + + tapAdapterAcquireLock(adapter,FALSE); + adapter->Locked.AdapterState = MiniportHaltedState; + tapAdapterReleaseLock(adapter,FALSE); + + // Remove this adapter from the global adapter list. + tapAdapterContextRemoveFromGlobalList(adapter); + + // BUGBUG!!! Call AdapterShutdownEx to do some of the work of stopping. + + // TODO!!! More... + + // + // Destroy the TAP Win32 device. + // + DestroyTapDevice(adapter); + + // + // Remove initial reference added in AdapterCreate. + // ------------------------------------------------ + // This should result in freeing adapter context memory + // and resources allocated in AdapterCreate. + // + tapAdapterContextDereference(adapter); + adapter = NULL; + + DEBUGP (("[TAP] <-- AdapterHalt\n")); +} + +VOID +tapWaitForReceiveNblInFlightCountZeroEvent( + __in PTAP_ADAPTER_CONTEXT Adapter + ) +{ + LONG nblCount; + + // + // Wait until higher-level protocol has returned all NBLs + // to the driver. + // + + // Add one NBL "bias" to insure allow event to be reset safely. + nblCount = NdisInterlockedIncrement(&Adapter->ReceiveNblInFlightCount); + ASSERT(nblCount > 0 ); + NdisResetEvent(&Adapter->ReceiveNblInFlightCountZeroEvent); + + // + // Now remove the bias and wait for the ReceiveNblInFlightCountZeroEvent + // if the count returned is not zero. + // + nblCount = NdisInterlockedDecrement(&Adapter->ReceiveNblInFlightCount); + ASSERT(nblCount >= 0); + + if(nblCount) + { + LARGE_INTEGER startTime, currentTime; + + NdisGetSystemUpTimeEx(&startTime); + + for (;;) + { + BOOLEAN waitResult = NdisWaitEvent( + &Adapter->ReceiveNblInFlightCountZeroEvent, + TAP_WAIT_POLL_LOOP_TIMEOUT + ); + + NdisGetSystemUpTimeEx(¤tTime); + + if (waitResult) + { + break; + } + + DEBUGP (("[%s] Waiting for %d in-flight receive NBLs to be returned.\n", + MINIPORT_INSTANCE_ID (Adapter), + Adapter->ReceiveNblInFlightCount + )); + } + + DEBUGP (("[%s] Waited %d ms for all in-flight NBLs to be returned.\n", + MINIPORT_INSTANCE_ID (Adapter), + (currentTime.LowPart - startTime.LowPart) + )); + } +} + +NDIS_STATUS +AdapterPause( + __in NDIS_HANDLE MiniportAdapterContext, + __in PNDIS_MINIPORT_PAUSE_PARAMETERS PauseParameters + ) +/*++ + +Routine Description: + + When a miniport receives a pause request, it enters into a Pausing state. + The miniport should not indicate up any more network data. Any pending + send requests must be completed, and new requests must be rejected with + NDIS_STATUS_PAUSED. + + Once all sends have been completed and all recieve NBLs have returned to + the miniport, the miniport enters the Paused state. + + While paused, the miniport can still service interrupts from the hardware + (to, for example, continue to indicate NDIS_STATUS_MEDIA_CONNECT + notifications). + + The miniport must continue to be able to handle status indications and OID + requests. MiniportPause is different from MiniportHalt because, in + general, the MiniportPause operation won't release any resources. + MiniportPause must not attempt to acquire any resources where allocation + can fail, since MiniportPause itself must not fail. + + + MiniportPause runs at IRQL = PASSIVE_LEVEL. + +Arguments: + + MiniportAdapterContext Pointer to the Adapter + MiniportPauseParameters Additional information about the pause operation + +Return Value: + + If the miniport is able to immediately enter the Paused state, it should + return NDIS_STATUS_SUCCESS. + + If the miniport must wait for send completions or pending receive NBLs, it + should return NDIS_STATUS_PENDING now, and call NDISMPauseComplete when the + miniport has entered the Paused state. + + No other return value is permitted. The pause operation must not fail. + +--*/ +{ + PTAP_ADAPTER_CONTEXT adapter = (PTAP_ADAPTER_CONTEXT )MiniportAdapterContext; + NDIS_STATUS status; + + UNREFERENCED_PARAMETER(PauseParameters); + + DEBUGP (("[TAP] --> AdapterPause\n")); + + // Enter the Pausing state. + DEBUGP (("[TAP] Miniport State: Pausing\n")); + + tapAdapterAcquireLock(adapter,FALSE); + adapter->Locked.AdapterState = MiniportPausingState; + tapAdapterReleaseLock(adapter,FALSE); + + // + // Stop the flow of network data through the receive path + // ------------------------------------------------------ + // In the Pausing and Paused state tapAdapterSendAndReceiveReady + // will prevent new calls to NdisMIndicateReceiveNetBufferLists + // to indicate additional receive NBLs to the host. + // + // However, there may be some in-flight NBLs owned by the driver + // that have been indicated to the host but have not yet been + // returned. + // + // Wait here for all in-flight receive indications to be returned. + // + tapWaitForReceiveNblInFlightCountZeroEvent(adapter); + + // + // Stop the flow of network data through the send path + // --------------------------------------------------- + // The initial implementation of the NDIS 6 send path follows the + // NDIS 5 pattern. Under this approach every send packet is copied + // into a driver-owned TAP_PACKET structure and the NBL owned by + // higher-level protocol is immediatly completed. + // + // With this deep-copy approach the driver never claims ownership + // of any send NBL. + // + // A future implementation may queue send NBLs and thereby eliminate + // the need for the unnecessary allocation and deep copy of each packet. + // + // So, nothing to do here for the send path for now... + + status = NDIS_STATUS_SUCCESS; + + // Enter the Paused state. + DEBUGP (("[TAP] Miniport State: Paused\n")); + + tapAdapterAcquireLock(adapter,FALSE); + adapter->Locked.AdapterState = MiniportPausedState; + tapAdapterReleaseLock(adapter,FALSE); + + DEBUGP (("[TAP] <-- AdapterPause; status = %8.8X\n",status)); + + return status; +} + +NDIS_STATUS +AdapterRestart( + __in NDIS_HANDLE MiniportAdapterContext, + __in PNDIS_MINIPORT_RESTART_PARAMETERS RestartParameters + ) +/*++ + +Routine Description: + + When a miniport receives a restart request, it enters into a Restarting + state. The miniport may begin indicating received data (e.g., using + NdisMIndicateReceiveNetBufferLists), handling status indications, and + processing OID requests in the Restarting state. However, no sends will be + requested while the miniport is in the Restarting state. + + Once the miniport is ready to send data, it has entered the Running state. + The miniport informs NDIS that it is in the Running state by returning + NDIS_STATUS_SUCCESS from this MiniportRestart function; or if this function + has already returned NDIS_STATUS_PENDING, by calling NdisMRestartComplete. + + + MiniportRestart runs at IRQL = PASSIVE_LEVEL. + +Arguments: + + MiniportAdapterContext Pointer to the Adapter + RestartParameters Additional information about the restart operation + +Return Value: + + If the miniport is able to immediately enter the Running state, it should + return NDIS_STATUS_SUCCESS. + + If the miniport is still in the Restarting state, it should return + NDIS_STATUS_PENDING now, and call NdisMRestartComplete when the miniport + has entered the Running state. + + Other NDIS_STATUS codes indicate errors. If an error is encountered, the + miniport must return to the Paused state (i.e., stop indicating receives). + +--*/ +{ + PTAP_ADAPTER_CONTEXT adapter = (PTAP_ADAPTER_CONTEXT )MiniportAdapterContext; + NDIS_STATUS status; + + UNREFERENCED_PARAMETER(RestartParameters); + + DEBUGP (("[TAP] --> AdapterRestart\n")); + + // Enter the Restarting state. + DEBUGP (("[TAP] Miniport State: Restarting\n")); + + tapAdapterAcquireLock(adapter,FALSE); + adapter->Locked.AdapterState = MiniportRestartingState; + tapAdapterReleaseLock(adapter,FALSE); + + status = NDIS_STATUS_SUCCESS; + + if(status == NDIS_STATUS_SUCCESS) + { + // Enter the Running state. + DEBUGP (("[TAP] Miniport State: Running\n")); + + tapAdapterAcquireLock(adapter,FALSE); + adapter->Locked.AdapterState = MiniportRunning; + tapAdapterReleaseLock(adapter,FALSE); + } + else + { + // Enter the Paused state if restart failed. + DEBUGP (("[TAP] Miniport State: Paused\n")); + + tapAdapterAcquireLock(adapter,FALSE); + adapter->Locked.AdapterState = MiniportPausedState; + tapAdapterReleaseLock(adapter,FALSE); + } + + DEBUGP (("[TAP] <-- AdapterRestart; status = %8.8X\n",status)); + + return status; +} + +BOOLEAN +tapAdapterReadAndWriteReady( + __in PTAP_ADAPTER_CONTEXT Adapter + ) +/*++ + +Routine Description: + + This routine determines whether the adapter device interface can + accept read and write operations. + +Arguments: + + Adapter Pointer to our adapter context + +Return Value: + + Returns TRUE if the adapter state allows it to queue IRPs passed to + the device read and write callbacks. +--*/ +{ + if(!Adapter->TapDeviceCreated) + { + // TAP device not created or is being destroyed. + return FALSE; + } + + if(Adapter->TapFileObject == NULL) + { + // TAP application file object not open. + return FALSE; + } + + if(!Adapter->TapFileIsOpen) + { + // TAP application file object may be closing. + return FALSE; + } + + if(!Adapter->LogicalMediaState) + { + // Don't handle read/write if media not connected. + return FALSE; + } + + if(Adapter->CurrentPowerState != NdisDeviceStateD0) + { + // Don't handle read/write if device is not fully powered. + return FALSE; + } + + return TRUE; +} + +NDIS_STATUS +tapAdapterSendAndReceiveReady( + __in PTAP_ADAPTER_CONTEXT Adapter + ) +/*++ + +Routine Description: + + This routine determines whether the adapter NDIS send and receive + paths are ready. + + This routine examines various adapter state variables and returns + a value that indicates whether the adapter NDIS interfaces can + accept send packets or indicate receive packets. + + In normal operation the adapter may temporarily enter and then exit + a not-ready condition. In particular, the adapter becomes not-ready + when in the Pausing/Paused states, but may become ready again when + Restarted. + + Runs at IRQL <= DISPATCH_LEVEL + +Arguments: + + Adapter Pointer to our adapter context + +Return Value: + + Returns NDIS_STATUS_SUCCESS if the adapter state allows it to + accept send packets and indicate receive packets. + + Otherwise it returns a NDIS_STATUS value other than NDIS_STATUS_SUCCESS. + These status values can be used directly as the completion status for + packets that must be completed immediatly in the send path. +--*/ +{ + NDIS_STATUS status = NDIS_STATUS_SUCCESS; + + // + // Check various state variables to insure adapter is ready. + // + tapAdapterAcquireLock(Adapter,FALSE); + + if(!Adapter->LogicalMediaState) + { + status = NDIS_STATUS_MEDIA_DISCONNECTED; + } + else if(Adapter->CurrentPowerState != NdisDeviceStateD0) + { + status = NDIS_STATUS_LOW_POWER_STATE; + } + else if(Adapter->ResetInProgress) + { + status = NDIS_STATUS_RESET_IN_PROGRESS; + } + else + { + switch(Adapter->Locked.AdapterState) + { + case MiniportPausingState: + case MiniportPausedState: + status = NDIS_STATUS_PAUSED; + break; + + case MiniportHaltedState: + status = NDIS_STATUS_INVALID_STATE; + break; + + default: + status = NDIS_STATUS_SUCCESS; + break; + } + } + + tapAdapterReleaseLock(Adapter,FALSE); + + return status; +} + +BOOLEAN +AdapterCheckForHangEx( + __in NDIS_HANDLE MiniportAdapterContext + ) +/*++ + +Routine Description: + + The MiniportCheckForHangEx handler is called to report the state of the + NIC, or to monitor the responsiveness of an underlying device driver. + This is an optional function. If this handler is not specified, NDIS + judges the driver unresponsive when the driver holds + MiniportQueryInformation or MiniportSetInformation requests for a + time-out interval (deafult 4 sec), and then calls the driver's + MiniportReset function. A NIC driver's MiniportInitialize function can + extend NDIS's time-out interval by calling NdisMSetAttributesEx to + avoid unnecessary resets. + + MiniportCheckForHangEx runs at IRQL <= DISPATCH_LEVEL. + +Arguments: + + MiniportAdapterContext Pointer to our adapter + +Return Value: + + TRUE NDIS calls the driver's MiniportReset function. + FALSE Everything is fine + +--*/ +{ + PTAP_ADAPTER_CONTEXT adapter = (PTAP_ADAPTER_CONTEXT )MiniportAdapterContext; + + //DEBUGP (("[TAP] --> AdapterCheckForHangEx\n")); + + //DEBUGP (("[TAP] <-- AdapterCheckForHangEx; status = FALSE\n")); + + return FALSE; // Everything is fine +} + +NDIS_STATUS +AdapterReset( + __in NDIS_HANDLE MiniportAdapterContext, + __out PBOOLEAN AddressingReset + ) +/*++ + +Routine Description: + + MiniportResetEx is a required to issue a hardware reset to the NIC + and/or to reset the driver's software state. + + 1) The miniport driver can optionally complete any pending + OID requests. NDIS will submit no further OID requests + to the miniport driver for the NIC being reset until + the reset operation has finished. After the reset, + NDIS will resubmit to the miniport driver any OID requests + that were pending but not completed by the miniport driver + before the reset. + + 2) A deserialized miniport driver must complete any pending send + operations. NDIS will not requeue pending send packets for + a deserialized driver since NDIS does not maintain the send + queue for such a driver. + + 3) If MiniportReset returns NDIS_STATUS_PENDING, the driver must + complete the original request subsequently with a call to + NdisMResetComplete. + + MiniportReset runs at IRQL <= DISPATCH_LEVEL. + +Arguments: + +AddressingReset - If multicast or functional addressing information + or the lookahead size, is changed by a reset, + MiniportReset must set the variable at AddressingReset + to TRUE before it returns control. This causes NDIS to + call the MiniportSetInformation function to restore + the information. + +MiniportAdapterContext - Pointer to our adapter + +Return Value: + + NDIS_STATUS + +--*/ +{ + PTAP_ADAPTER_CONTEXT adapter = (PTAP_ADAPTER_CONTEXT )MiniportAdapterContext; + NDIS_STATUS status; + + UNREFERENCED_PARAMETER(MiniportAdapterContext); + UNREFERENCED_PARAMETER(AddressingReset); + + DEBUGP (("[TAP] --> AdapterReset\n")); + + // Indicate that adapter reset is in progress. + adapter->ResetInProgress = TRUE; + + // See note above... + *AddressingReset = FALSE; + + // BUGBUG!!! TODO!!! Lots of work here... + + // Indicate that adapter reset has completed. + adapter->ResetInProgress = FALSE; + + status = NDIS_STATUS_SUCCESS; + + DEBUGP (("[TAP] <-- AdapterReset; status = %8.8X\n",status)); + + return status; +} + +VOID +AdapterDevicePnpEventNotify( + __in NDIS_HANDLE MiniportAdapterContext, + __in PNET_DEVICE_PNP_EVENT NetDevicePnPEvent + ) +{ + PTAP_ADAPTER_CONTEXT adapter = (PTAP_ADAPTER_CONTEXT )MiniportAdapterContext; + + DEBUGP (("[TAP] --> AdapterDevicePnpEventNotify\n")); + +/* + switch (NetDevicePnPEvent->DevicePnPEvent) + { + case NdisDevicePnPEventSurpriseRemoved: + // + // Called when NDIS receives IRP_MN_SUPRISE_REMOVAL. + // NDIS calls MiniportHalt function after this call returns. + // + MP_SET_FLAG(Adapter, fMP_ADAPTER_SURPRISE_REMOVED); + DEBUGP(MP_INFO, "[%p] MPDevicePnpEventNotify: NdisDevicePnPEventSurpriseRemoved\n", Adapter); + break; + + case NdisDevicePnPEventPowerProfileChanged: + // + // After initializing a miniport driver and after miniport driver + // receives an OID_PNP_SET_POWER notification that specifies + // a device power state of NdisDeviceStateD0 (the powered-on state), + // NDIS calls the miniport's MiniportPnPEventNotify function with + // PnPEvent set to NdisDevicePnPEventPowerProfileChanged. + // + DEBUGP(MP_INFO, "[%p] MPDevicePnpEventNotify: NdisDevicePnPEventPowerProfileChanged\n", Adapter); + + if (NetDevicePnPEvent->InformationBufferLength == sizeof(ULONG)) + { + ULONG NdisPowerProfile = *((PULONG)NetDevicePnPEvent->InformationBuffer); + + if (NdisPowerProfile == NdisPowerProfileBattery) + { + DEBUGP(MP_INFO, "[%p] The host system is running on battery power\n", Adapter); + } + if (NdisPowerProfile == NdisPowerProfileAcOnLine) + { + DEBUGP(MP_INFO, "[%p] The host system is running on AC power\n", Adapter); + } + } + break; + + default: + DEBUGP(MP_ERROR, "[%p] MPDevicePnpEventNotify: unknown PnP event 0x%x\n", Adapter, NetDevicePnPEvent->DevicePnPEvent); + } +*/ + DEBUGP (("[TAP] <-- AdapterDevicePnpEventNotify\n")); +} + +VOID +AdapterShutdownEx( + __in NDIS_HANDLE MiniportAdapterContext, + __in NDIS_SHUTDOWN_ACTION ShutdownAction + ) +/*++ + +Routine Description: + + The MiniportShutdownEx handler restores hardware to its initial state when + the system is shut down, whether by the user or because an unrecoverable + system error occurred. This is to ensure that the NIC is in a known + state and ready to be reinitialized when the machine is rebooted after + a system shutdown occurs for any reason, including a crash dump. + + Here just disable the interrupt and stop the DMA engine. Do not free + memory resources or wait for any packet transfers to complete. Do not call + into NDIS at this time. + + This can be called at aribitrary IRQL, including in the context of a + bugcheck. + +Arguments: + + MiniportAdapterContext Pointer to our adapter + ShutdownAction The reason why NDIS called the shutdown function + +Return Value: + + None. + +--*/ +{ + PTAP_ADAPTER_CONTEXT adapter = (PTAP_ADAPTER_CONTEXT )MiniportAdapterContext; + + UNREFERENCED_PARAMETER(ShutdownAction); + UNREFERENCED_PARAMETER(MiniportAdapterContext); + + DEBUGP (("[TAP] --> AdapterShutdownEx\n")); + + // Enter the Shutdown state. + DEBUGP (("[TAP] Miniport State: Shutdown\n")); + + tapAdapterAcquireLock(adapter,FALSE); + adapter->Locked.AdapterState = MiniportShutdownState; + tapAdapterReleaseLock(adapter,FALSE); + + // + // BUGBUG!!! FlushIrpQueues??? + // + + DEBUGP (("[TAP] <-- AdapterShutdownEx\n")); +} + + +// Free adapter context memory and associated resources. +VOID +tapAdapterContextFree( + __in PTAP_ADAPTER_CONTEXT Adapter + ) +{ + PLIST_ENTRY listEntry = &Adapter->AdapterListLink; + + DEBUGP (("[TAP] --> tapAdapterContextFree\n")); + + // Adapter context should already be removed. + ASSERT( (listEntry->Flink == listEntry) && (listEntry->Blink == listEntry ) ); + + // Insure that adapter context has been removed from global adapter list. + RemoveEntryList(&Adapter->AdapterListLink); + + // Free the adapter lock. + NdisFreeSpinLock(&Adapter->AdapterLock); + + // Free the ANSI NetCfgInstanceId buffer. + if(Adapter->NetCfgInstanceIdAnsi.Buffer != NULL) + { + RtlFreeAnsiString(&Adapter->NetCfgInstanceIdAnsi); + } + + Adapter->NetCfgInstanceIdAnsi.Buffer = NULL; + + // Free the receive NBL pool. + if(Adapter->ReceiveNblPool != NULL ) + { + NdisFreeNetBufferListPool(Adapter->ReceiveNblPool); + } + + Adapter->ReceiveNblPool = NULL; + + NdisFreeMemory(Adapter,0,0); + + DEBUGP (("[TAP] <-- tapAdapterContextFree\n")); +} +ULONG +tapGetNetBufferFrameType( + __in PNET_BUFFER NetBuffer + ) +/*++ + +Routine Description: + + Reads the network frame's destination address to determine the type + (broadcast, multicast, etc) + + Runs at IRQL <= DISPATCH_LEVEL. + +Arguments: + + NetBuffer The NB to examine + +Return Value: + + NDIS_PACKET_TYPE_BROADCAST + NDIS_PACKET_TYPE_MULTICAST + NDIS_PACKET_TYPE_DIRECTED + +--*/ +{ + PETH_HEADER ethernetHeader; + + ethernetHeader = (PETH_HEADER )NdisGetDataBuffer( + NetBuffer, + sizeof(ETH_HEADER), + NULL, + 1, + 0 + ); + + ASSERT(ethernetHeader); + + if (ETH_IS_BROADCAST(ethernetHeader->dest)) + { + return NDIS_PACKET_TYPE_BROADCAST; + } + else if(ETH_IS_MULTICAST(ethernetHeader->dest)) + { + return NDIS_PACKET_TYPE_MULTICAST; + } + else + { + return NDIS_PACKET_TYPE_DIRECTED; + } + +} + +ULONG +tapGetNetBufferCountsFromNetBufferList( + __in PNET_BUFFER_LIST NetBufferList, + __inout_opt PULONG TotalByteCount // Of all linked NBs + ) +/*++ + +Routine Description: + + Returns the number of net buffers linked to the net buffer list. + + Optionally retuens the total byte count of all net buffers linked + to the net buffer list + + Runs at IRQL <= DISPATCH_LEVEL. + +Arguments: + + NetBufferList The NBL to examine + +Return Value: + + The number of net buffers linked to the net buffer list. + +--*/ +{ + ULONG netBufferCount = 0; + PNET_BUFFER currentNb; + + if(TotalByteCount) + { + *TotalByteCount = 0; + } + + currentNb = NET_BUFFER_LIST_FIRST_NB(NetBufferList); + + while(currentNb) + { + ++netBufferCount; + + if(TotalByteCount) + { + *TotalByteCount += NET_BUFFER_DATA_LENGTH(currentNb); + } + + // Move to next NB + currentNb = NET_BUFFER_NEXT_NB(currentNb); + } + + return netBufferCount; +} + +VOID +tapAdapterAcquireLock( + __in PTAP_ADAPTER_CONTEXT Adapter, + __in BOOLEAN DispatchLevel + ) +{ + ASSERT(!DispatchLevel || (DISPATCH_LEVEL == KeGetCurrentIrql())); + + if (DispatchLevel) + { + NdisDprAcquireSpinLock(&Adapter->AdapterLock); + } + else + { + NdisAcquireSpinLock(&Adapter->AdapterLock); + } +} + +VOID +tapAdapterReleaseLock( + __in PTAP_ADAPTER_CONTEXT Adapter, + __in BOOLEAN DispatchLevel + ) +{ + ASSERT(!DispatchLevel || (DISPATCH_LEVEL == KeGetCurrentIrql())); + + if (DispatchLevel) + { + NdisDprReleaseSpinLock(&Adapter->AdapterLock); + } + else + { + NdisReleaseSpinLock(&Adapter->AdapterLock); + } +} + + diff --git a/installer/tap/src/src/adapter.h b/installer/tap/src/src/adapter.h new file mode 100644 index 0000000..2f09d12 --- /dev/null +++ b/installer/tap/src/src/adapter.h @@ -0,0 +1,346 @@ +/* + * TAP-Windows -- A kernel driver to provide virtual tap + * device functionality on Windows. + * + * This code was inspired by the CIPE-Win32 driver by Damion K. Wilson. + * + * This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc., + * and is released under the GPL version 2 (see below). + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program (see the file COPYING included with this + * distribution); if not, write to the Free Software Foundation, Inc., + * 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ +#ifndef __TAP_ADAPTER_CONTEXT_H_ +#define __TAP_ADAPTER_CONTEXT_H_ + +// Memory allocation tags. +#define TAP_ADAPTER_TAG ((ULONG)'ApaT') // "TapA +#define TAP_RX_NBL_TAG ((ULONG)'RpaT') // "TapR +#define TAP_RX_INJECT_BUFFER_TAG ((ULONG)'IpaT') // "TapI + +#define TAP_MAX_NDIS_NAME_LENGTH 64 // 38 character GUID string plus extra.. + +// TAP receive indication NBL flag definitions. +#define TAP_RX_NBL_FLAGS NBL_FLAGS_MINIPORT_RESERVED +#define TAP_RX_NBL_FLAGS_CLEAR_ALL(_NBL) ((_NBL)->Flags &= ~TAP_RX_NBL_FLAGS) +#define TAP_RX_NBL_FLAG_SET(_NBL, _F) ((_NBL)->Flags |= ((_F) & TAP_RX_NBL_FLAGS)) +#define TAP_RX_NBL_FLAG_CLEAR(_NBL, _F) ((_NBL)->Flags &= ~((_F) & TAP_RX_NBL_FLAGS)) +#define TAP_RX_NBL_FLAG_TEST(_NBL, _F) (((_NBL)->Flags & ((_F) & TAP_RX_NBL_FLAGS)) != 0) + +#define TAP_RX_NBL_FLAGS_IS_P2P 0x00001000 +#define TAP_RX_NBL_FLAGS_IS_INJECTED 0x00002000 + +// MSDN Ref: http://msdn.microsoft.com/en-us/library/windows/hardware/ff560490(v=vs.85).aspx +typedef +enum _TAP_MINIPORT_ADAPTER_STATE +{ + // The Halted state is the initial state of all adapters. When an + // adapter is in the Halted state, NDIS can call the driver's + // MiniportInitializeEx function to initialize the adapter. + MiniportHaltedState, + + // In the Shutdown state, a system shutdown and restart must occur + // before the system can use the adapter again. + MiniportShutdownState, + + // In the Initializing state, a miniport driver completes any + //operations that are required to initialize an adapter. + MiniportInitializingState, + + // Entering the Paused state... + MiniportPausingState, + + // In the Paused state, the adapter does not indicate received + // network data or accept send requests. + MiniportPausedState, + + // In the Running state, a miniport driver performs send and + // receive processing for an adapter. + MiniportRunning, + + // In the Restarting state, a miniport driver completes any + // operations that are required to restart send and receive + // operations for an adapter. + MiniportRestartingState +} TAP_MINIPORT_ADAPTER_STATE, *PTAP_MINIPORT_ADAPTER_STATE; + +// +// Each adapter managed by this driver has a TapAdapter struct. +// ------------------------------------------------------------ +// Since there is a one-to-one relationship between adapter instances +// and device instances this structure is the device extension as well. +// +typedef struct _TAP_ADAPTER_CONTEXT +{ + LIST_ENTRY AdapterListLink; + + volatile LONG RefCount; + + NDIS_HANDLE MiniportAdapterHandle; + + NDIS_SPIN_LOCK AdapterLock; // Lock for protection of state and outstanding sends and recvs + + // + // All fields that are protected by the AdapterLock are included + // in the Locked structure to remind us to take the Lock + // before accessing them :) + // + struct + { + TAP_MINIPORT_ADAPTER_STATE AdapterState; + } Locked; + + BOOLEAN ResetInProgress; + + // + // NetCfgInstanceId as UNICODE_STRING + // ---------------------------------- + // This a GUID string provided by NDIS that identifies the adapter instance. + // An example is: + // + // NetCfgInstanceId={410EB49D-2381-4FE7-9B36-498E22619DF0} + // + // Other names are derived from NetCfgInstanceId. For example, MiniportName: + // + // MiniportName=\DEVICE\{410EB49D-2381-4FE7-9B36-498E22619DF0} + // + NDIS_STRING NetCfgInstanceId; + WCHAR NetCfgInstanceIdBuffer[TAP_MAX_NDIS_NAME_LENGTH]; + +# define MINIPORT_INSTANCE_ID(a) ((a)->NetCfgInstanceIdAnsi.Buffer) + ANSI_STRING NetCfgInstanceIdAnsi; // Used occasionally + + ULONG MtuSize; // 1500 byte (typical) + + // TRUE if adapter should always be "connected" even when device node + // is not open by a userspace process. + // + // FALSE if connection state is application controlled. + BOOLEAN MediaStateAlwaysConnected; + + // TRUE if device is "connected". + BOOLEAN LogicalMediaState; + + NDIS_DEVICE_POWER_STATE CurrentPowerState; + + BOOLEAN AllowNonAdmin; + + MACADDR PermanentAddress; // From registry, if available + MACADDR CurrentAddress; + + // Device registration parameters from NdisRegisterDeviceEx. + NDIS_STRING DeviceName; + WCHAR DeviceNameBuffer[TAP_MAX_NDIS_NAME_LENGTH]; + + NDIS_STRING LinkName; + WCHAR LinkNameBuffer[TAP_MAX_NDIS_NAME_LENGTH]; + + NDIS_HANDLE DeviceHandle; + PDEVICE_OBJECT DeviceObject; + BOOLEAN TapDeviceCreated; // WAS: m_TapIsRunning + + PFILE_OBJECT TapFileObject; // Exclusive access + BOOLEAN TapFileIsOpen; // WAS: m_TapOpens + LONG TapFileOpenCount; // WAS: m_NumTapOpens + + // Cancel-Safe read IRP queue. + TAP_IRP_CSQ PendingReadIrpQueue; + + // Queue containing TAP packets representing host send NBs. These are + // waiting to be read by user-mode application. + TAP_PACKET_QUEUE SendPacketQueue; + + // NBL pool for making TAP receive indications. + NDIS_HANDLE ReceiveNblPool; + + volatile LONG ReceiveNblInFlightCount; +#define TAP_WAIT_POLL_LOOP_TIMEOUT 3000 // 3 seconds + NDIS_EVENT ReceiveNblInFlightCountZeroEvent; + + // Info for point-to-point mode + BOOLEAN m_tun; + IPADDR m_localIP; + IPADDR m_remoteNetwork; + IPADDR m_remoteNetmask; + ETH_HEADER m_TapToUser; + ETH_HEADER m_UserToTap; + ETH_HEADER m_UserToTap_IPv6; // same as UserToTap but proto=ipv6 + + // Info for DHCP server masquerade + BOOLEAN m_dhcp_enabled; + IPADDR m_dhcp_addr; + ULONG m_dhcp_netmask; + IPADDR m_dhcp_server_ip; + BOOLEAN m_dhcp_server_arp; + MACADDR m_dhcp_server_mac; + ULONG m_dhcp_lease_time; + UCHAR m_dhcp_user_supplied_options_buffer[DHCP_USER_SUPPLIED_OPTIONS_BUFFER_SIZE]; + ULONG m_dhcp_user_supplied_options_buffer_len; + BOOLEAN m_dhcp_received_discover; + ULONG m_dhcp_bad_requests; + + // Multicast list. Fixed size. + ULONG ulMCListSize; + UCHAR MCList[TAP_MAX_MCAST_LIST][MACADDR_SIZE]; + + ULONG PacketFilter; + ULONG ulLookahead; + + // + // Statistics + // ------------------------------------------------------------------------- + // + + // Packet counts + ULONG64 FramesRxDirected; + ULONG64 FramesRxMulticast; + ULONG64 FramesRxBroadcast; + ULONG64 FramesTxDirected; + ULONG64 FramesTxMulticast; + ULONG64 FramesTxBroadcast; + + // Byte counts + ULONG64 BytesRxDirected; + ULONG64 BytesRxMulticast; + ULONG64 BytesRxBroadcast; + ULONG64 BytesTxDirected; + ULONG64 BytesTxMulticast; + ULONG64 BytesTxBroadcast; + + // Count of transmit errors + ULONG TxAbortExcessCollisions; + ULONG TxLateCollisions; + ULONG TxDmaUnderrun; + ULONG TxLostCRS; + ULONG TxOKButDeferred; + ULONG OneRetry; + ULONG MoreThanOneRetry; + ULONG TotalRetries; + ULONG TransmitFailuresOther; + + // Count of receive errors + ULONG RxCrcErrors; + ULONG RxAlignmentErrors; + ULONG RxResourceErrors; + ULONG RxDmaOverrunErrors; + ULONG RxCdtFrames; + ULONG RxRuntErrors; + +#if PACKET_TRUNCATION_CHECK + LONG m_RxTrunc, m_TxTrunc; +#endif + + BOOLEAN m_InterfaceIsRunning; + LONG m_Rx, m_RxErr; + NDIS_MEDIUM m_Medium; + + // Help to tear down the adapter by keeping + // some state information on allocated + // resources. + BOOLEAN m_CalledAdapterFreeResources; + BOOLEAN m_RegisteredAdapterShutdownHandler; + +} TAP_ADAPTER_CONTEXT, *PTAP_ADAPTER_CONTEXT; + +FORCEINLINE +LONG +tapAdapterContextReference( + __in PTAP_ADAPTER_CONTEXT Adapter + ) +{ + LONG refCount = NdisInterlockedIncrement(&Adapter->RefCount); + + ASSERT(refCount>1); // Cannot dereference a zombie. + + return refCount; +} + +VOID +tapAdapterContextFree( + __in PTAP_ADAPTER_CONTEXT Adapter + ); + +FORCEINLINE +LONG +tapAdapterContextDereference( + IN PTAP_ADAPTER_CONTEXT Adapter + ) +{ + LONG refCount = NdisInterlockedDecrement(&Adapter->RefCount); + ASSERT(refCount >= 0); + if (!refCount) + { + tapAdapterContextFree(Adapter); + } + + return refCount; +} + +VOID +tapAdapterAcquireLock( + __in PTAP_ADAPTER_CONTEXT Adapter, + __in BOOLEAN DispatchLevel + ); + +VOID +tapAdapterReleaseLock( + __in PTAP_ADAPTER_CONTEXT Adapter, + __in BOOLEAN DispatchLevel + ); + +// Returns with added reference on adapter context. +PTAP_ADAPTER_CONTEXT +tapAdapterContextFromDeviceObject( + __in PDEVICE_OBJECT DeviceObject + ); + +BOOLEAN +tapAdapterReadAndWriteReady( + __in PTAP_ADAPTER_CONTEXT Adapter + ); + +NDIS_STATUS +tapAdapterSendAndReceiveReady( + __in PTAP_ADAPTER_CONTEXT Adapter + ); + +ULONG +tapGetNetBufferFrameType( + __in PNET_BUFFER NetBuffer + ); + +ULONG +tapGetNetBufferCountsFromNetBufferList( + __in PNET_BUFFER_LIST NetBufferList, + __inout_opt PULONG TotalByteCount // Of all linked NBs + ); + +// Prototypes for standard NDIS miniport entry points +MINIPORT_SET_OPTIONS AdapterSetOptions; +MINIPORT_INITIALIZE AdapterCreate; +MINIPORT_HALT AdapterHalt; +MINIPORT_UNLOAD TapDriverUnload; +MINIPORT_PAUSE AdapterPause; +MINIPORT_RESTART AdapterRestart; +MINIPORT_OID_REQUEST AdapterOidRequest; +MINIPORT_SEND_NET_BUFFER_LISTS AdapterSendNetBufferLists; +MINIPORT_RETURN_NET_BUFFER_LISTS AdapterReturnNetBufferLists; +MINIPORT_CANCEL_SEND AdapterCancelSend; +MINIPORT_CHECK_FOR_HANG AdapterCheckForHangEx; +MINIPORT_RESET AdapterReset; +MINIPORT_DEVICE_PNP_EVENT_NOTIFY AdapterDevicePnpEventNotify; +MINIPORT_SHUTDOWN AdapterShutdownEx; +MINIPORT_CANCEL_OID_REQUEST AdapterCancelOidRequest; + +#endif // __TAP_ADAPTER_CONTEXT_H_ \ No newline at end of file diff --git a/installer/tap/src/src/config.h.in b/installer/tap/src/src/config.h.in new file mode 100644 index 0000000..322afa8 --- /dev/null +++ b/installer/tap/src/src/config.h.in @@ -0,0 +1,9 @@ +#define PRODUCT_NAME "@PRODUCT_NAME@" +#define PRODUCT_VERSION "@PRODUCT_VERSION@" +#define PRODUCT_VERSION_RESOURCE @PRODUCT_VERSION_RESOURCE@ +#define PRODUCT_TAP_WIN_COMPONENT_ID "@PRODUCT_TAP_WIN_COMPONENT_ID@" +#define PRODUCT_TAP_WIN_MAJOR @PRODUCT_TAP_WIN_MAJOR@ +#define PRODUCT_TAP_WIN_MINOR @PRODUCT_TAP_WIN_MINOR@ +#define PRODUCT_TAP_WIN_PROVIDER "@PRODUCT_TAP_WIN_PROVIDER@" +#define PRODUCT_TAP_WIN_DEVICE_DESCRIPTION "@PRODUCT_TAP_WIN_DEVICE_DESCRIPTION@" +#define PRODUCT_TAP_WIN_RELDATE "@PRODUCT_TAP_WIN_RELDATE@" diff --git a/installer/tap/src/src/constants.h b/installer/tap/src/src/constants.h new file mode 100644 index 0000000..31b2d54 --- /dev/null +++ b/installer/tap/src/src/constants.h @@ -0,0 +1,195 @@ +/* + * TAP-Windows -- A kernel driver to provide virtual tap + * device functionality on Windows. + * + * This code was inspired by the CIPE-Win32 driver by Damion K. Wilson. + * + * This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc., + * and is released under the GPL version 2 (see below). + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program (see the file COPYING included with this + * distribution); if not, write to the Free Software Foundation, Inc., + * 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +//==================================================================== +// Product and Version public settings +//==================================================================== + +#define PRODUCT_STRING PRODUCT_TAP_DEVICE_DESCRIPTION + + +// +// Update the driver version number every time you release a new driver +// The high word is the major version. The low word is the minor version. +// Also make sure that VER_FILEVERSION specified in the .RC file also +// matches with the driver version because NDISTESTER checks for that. +// +#ifndef TAP_DRIVER_MAJOR_VERSION + +#define TAP_DRIVER_MAJOR_VERSION 0x04 +#define TAP_DRIVER_MINOR_VERSION 0x02 + +#endif + +#define TAP_DRIVER_VENDOR_VERSION ((TAP_DRIVER_MAJOR_VERSION << 16) | TAP_DRIVER_MINOR_VERSION) + +// +// Define the NDIS miniport interface version that this driver targets. +// +#if defined(NDIS60_MINIPORT) +# define TAP_NDIS_MAJOR_VERSION 6 +# define TAP_NDIS_MINOR_VERSION 0 +#elif defined(NDIS61_MINIPORT) +# define TAP_NDIS_MAJOR_VERSION 6 +# define TAP_NDIS_MINOR_VERSION 1 +#elif defined(NDIS620_MINIPORT) +# define TAP_NDIS_MAJOR_VERSION 6 +# define TAP_NDIS_MINOR_VERSION 20 +#elif defined(NDIS630_MINIPORT) +# define TAP_NDIS_MAJOR_VERSION 6 +# define TAP_NDIS_MINOR_VERSION 30 +#else +#define TAP_NDIS_MAJOR_VERSION 5 +#define TAP_NDIS_MINOR_VERSION 0 +#endif + +//=========================================================== +// Driver constants +//=========================================================== + +#define ETHERNET_HEADER_SIZE (sizeof (ETH_HEADER)) +#define ETHERNET_MTU 1500 +#define ETHERNET_PACKET_SIZE (ETHERNET_MTU + ETHERNET_HEADER_SIZE) +#define DEFAULT_PACKET_LOOKAHEAD (ETHERNET_PACKET_SIZE) +#define VLAN_TAG_SIZE 4 + +//=========================================================== +// Medium properties +//=========================================================== + +#define TAP_FRAME_HEADER_SIZE ETHERNET_HEADER_SIZE +#define TAP_FRAME_MAX_DATA_SIZE ETHERNET_MTU +#define TAP_MAX_FRAME_SIZE (TAP_FRAME_HEADER_SIZE + TAP_FRAME_MAX_DATA_SIZE) +#define TAP_MIN_FRAME_SIZE 60 + +#define TAP_MEDIUM_TYPE NdisMedium802_3 + +//=========================================================== +// Physical adapter properties +//=========================================================== + +// The bus that connects the adapter to the PC. +// (Example: PCI adapters should use NdisInterfacePci). +#define TAP_INTERFACE_TYPE NdisInterfaceInternal + +#define TAP_VENDOR_DESC PRODUCT_TAP_WIN_DEVICE_DESCRIPTION + +// Highest byte is the NIC byte plus three vendor bytes. This is normally +// obtained from the NIC. +#define TAP_VENDOR_ID 0x00FFFFFF + +// If you have physical hardware on 802.3, use NdisPhysicalMedium802_3. +#define TAP_PHYSICAL_MEDIUM NdisPhysicalMediumUnspecified + +// Claim to be 100mbps duplex +#define MEGABITS_PER_SECOND 1000000ULL +#define TAP_XMIT_SPEED (100ULL*MEGABITS_PER_SECOND) +#define TAP_RECV_SPEED (100ULL*MEGABITS_PER_SECOND) + +// Max number of multicast addresses supported in hardware +#define TAP_MAX_MCAST_LIST 32 + +#define TAP_MAX_LOOKAHEAD TAP_FRAME_MAX_DATA_SIZE +#define TAP_BUFFER_SIZE TAP_MAX_FRAME_SIZE + +// Set this value to TRUE if there is a physical adapter. +#define TAP_HAS_PHYSICAL_CONNECTOR FALSE +#define TAP_ACCESS_TYPE NET_IF_ACCESS_BROADCAST +#define TAP_DIRECTION_TYPE NET_IF_DIRECTION_SENDRECEIVE +#define TAP_CONNECTION_TYPE NET_IF_CONNECTION_DEDICATED + +// This value must match the *IfType in the driver .inf file +#define TAP_IFTYPE IF_TYPE_ETHERNET_CSMACD + +// +// This is a virtual device, so it can tolerate surprise removal and +// suspend. Ensure the correct flags are set for your hardware. +// +#define TAP_ADAPTER_ATTRIBUTES_FLAGS (\ + NDIS_MINIPORT_ATTRIBUTES_SURPRISE_REMOVE_OK | NDIS_MINIPORT_ATTRIBUTES_NDIS_WDM) + +#define TAP_SUPPORTED_FILTERS ( \ + NDIS_PACKET_TYPE_DIRECTED | \ + NDIS_PACKET_TYPE_MULTICAST | \ + NDIS_PACKET_TYPE_BROADCAST | \ + NDIS_PACKET_TYPE_ALL_LOCAL | \ + NDIS_PACKET_TYPE_PROMISCUOUS | \ + NDIS_PACKET_TYPE_ALL_MULTICAST) + +#define TAP_MAX_MCAST_LIST 32 // Max length of multicast address list + +// +// Specify a bitmask that defines optional properties of the NIC. +// This miniport indicates receive with NdisMIndicateReceiveNetBufferLists +// function. Such a driver should set this NDIS_MAC_OPTION_TRANSFERS_NOT_PEND +// flag. +// +// NDIS_MAC_OPTION_NO_LOOPBACK tells NDIS that NIC has no internal +// loopback support so NDIS will manage loopbacks on behalf of +// this driver. +// +// NDIS_MAC_OPTION_COPY_LOOKAHEAD_DATA tells the protocol that +// our receive buffer is not on a device-specific card. If +// NDIS_MAC_OPTION_COPY_LOOKAHEAD_DATA is not set, multi-buffer +// indications are copied to a single flat buffer. +// + +#define TAP_MAC_OPTIONS (\ + NDIS_MAC_OPTION_COPY_LOOKAHEAD_DATA | \ + NDIS_MAC_OPTION_TRANSFERS_NOT_PEND | \ + NDIS_MAC_OPTION_NO_LOOPBACK) + +#define TAP_ADAPTER_CHECK_FOR_HANG_TIME_IN_SECONDS 4 + + +// NDIS 6.x miniports must support all counters in OID_GEN_STATISTICS. +#define TAP_SUPPORTED_STATISTICS (\ + NDIS_STATISTICS_FLAGS_VALID_DIRECTED_FRAMES_RCV | \ + NDIS_STATISTICS_FLAGS_VALID_MULTICAST_FRAMES_RCV | \ + NDIS_STATISTICS_FLAGS_VALID_BROADCAST_FRAMES_RCV | \ + NDIS_STATISTICS_FLAGS_VALID_BYTES_RCV | \ + NDIS_STATISTICS_FLAGS_VALID_RCV_DISCARDS | \ + NDIS_STATISTICS_FLAGS_VALID_RCV_ERROR | \ + NDIS_STATISTICS_FLAGS_VALID_DIRECTED_FRAMES_XMIT | \ + NDIS_STATISTICS_FLAGS_VALID_MULTICAST_FRAMES_XMIT | \ + NDIS_STATISTICS_FLAGS_VALID_BROADCAST_FRAMES_XMIT | \ + NDIS_STATISTICS_FLAGS_VALID_BYTES_XMIT | \ + NDIS_STATISTICS_FLAGS_VALID_XMIT_ERROR | \ + NDIS_STATISTICS_FLAGS_VALID_XMIT_DISCARDS | \ + NDIS_STATISTICS_FLAGS_VALID_DIRECTED_BYTES_RCV | \ + NDIS_STATISTICS_FLAGS_VALID_MULTICAST_BYTES_RCV | \ + NDIS_STATISTICS_FLAGS_VALID_BROADCAST_BYTES_RCV | \ + NDIS_STATISTICS_FLAGS_VALID_DIRECTED_BYTES_XMIT | \ + NDIS_STATISTICS_FLAGS_VALID_MULTICAST_BYTES_XMIT | \ + NDIS_STATISTICS_FLAGS_VALID_BROADCAST_BYTES_XMIT) + + +#define MINIMUM_MTU 576 // USE TCP Minimum MTU +#define MAXIMUM_MTU 65536 // IP maximum MTU + +#define PACKET_QUEUE_SIZE 64 // tap -> userspace queue size +#define IRP_QUEUE_SIZE 16 // max number of simultaneous i/o operations from userspace +#define INJECT_QUEUE_SIZE 16 // DHCP/ARP -> tap injection queue + +#define TAP_LITTLE_ENDIAN // affects ntohs, htonl, etc. functions diff --git a/installer/tap/src/src/device.c b/installer/tap/src/src/device.c new file mode 100644 index 0000000..2b7ba9b --- /dev/null +++ b/installer/tap/src/src/device.c @@ -0,0 +1,1169 @@ +/* + * TAP-Windows -- A kernel driver to provide virtual tap + * device functionality on Windows. + * + * This code was inspired by the CIPE-Win32 driver by Damion K. Wilson. + * + * This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc., + * and is released under the GPL version 2 (see below). + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program (see the file COPYING included with this + * distribution); if not, write to the Free Software Foundation, Inc., + * 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +// +// Include files. +// + +#include "tap.h" +#include // for SDDLs + +//====================================================================== +// TAP Win32 Device I/O Callbacks +//====================================================================== + +#ifdef ALLOC_PRAGMA +#pragma alloc_text( PAGE, TapDeviceCreate) +#pragma alloc_text( PAGE, TapDeviceControl) +#pragma alloc_text( PAGE, TapDeviceCleanup) +#pragma alloc_text( PAGE, TapDeviceClose) +#endif // ALLOC_PRAGMA + +//=================================================================== +// Go back to default TAP mode from Point-To-Point mode. +// Also reset (i.e. disable) DHCP Masq mode. +//=================================================================== +VOID tapResetAdapterState( + __in PTAP_ADAPTER_CONTEXT Adapter + ) +{ + // Point-To-Point + Adapter->m_tun = FALSE; + Adapter->m_localIP = 0; + Adapter->m_remoteNetwork = 0; + Adapter->m_remoteNetmask = 0; + NdisZeroMemory (&Adapter->m_TapToUser, sizeof (Adapter->m_TapToUser)); + NdisZeroMemory (&Adapter->m_UserToTap, sizeof (Adapter->m_UserToTap)); + NdisZeroMemory (&Adapter->m_UserToTap_IPv6, sizeof (Adapter->m_UserToTap_IPv6)); + + // DHCP Masq + Adapter->m_dhcp_enabled = FALSE; + Adapter->m_dhcp_server_arp = FALSE; + Adapter->m_dhcp_user_supplied_options_buffer_len = 0; + Adapter->m_dhcp_addr = 0; + Adapter->m_dhcp_netmask = 0; + Adapter->m_dhcp_server_ip = 0; + Adapter->m_dhcp_lease_time = 0; + Adapter->m_dhcp_received_discover = FALSE; + Adapter->m_dhcp_bad_requests = 0; + NdisZeroMemory (Adapter->m_dhcp_server_mac, MACADDR_SIZE); +} + +// IRP_MJ_CREATE +NTSTATUS +TapDeviceCreate( + PDEVICE_OBJECT DeviceObject, + PIRP Irp + ) +/*++ + +Routine Description: + + This routine is called by the I/O system when the device is opened. + + No action is performed other than completing the request successfully. + +Arguments: + + DeviceObject - a pointer to the object that represents the device + that I/O is to be done on. + + Irp - a pointer to the I/O Request Packet for this request. + +Return Value: + + NT status code + +--*/ +{ + NDIS_STATUS status; + PIO_STACK_LOCATION irpSp;// Pointer to current stack location + PTAP_ADAPTER_CONTEXT adapter = NULL; + PFILE_OBJECT originalFileObject; + + PAGED_CODE(); + + DEBUGP (("[TAP] --> TapDeviceCreate\n")); + + irpSp = IoGetCurrentIrpStackLocation(Irp); + + // + // Invalidate file context + // + irpSp->FileObject->FsContext = NULL; + irpSp->FileObject->FsContext2 = NULL; + + // + // Find adapter context for this device. + // ------------------------------------- + // Returns with added reference on adapter context. + // + adapter = tapAdapterContextFromDeviceObject(DeviceObject); + + // Insure that adapter exists. + ASSERT(adapter); + + if(adapter == NULL ) + { + DEBUGP (("[TAP] release [%d.%d] open request; adapter not found\n", + TAP_DRIVER_MAJOR_VERSION, + TAP_DRIVER_MINOR_VERSION + )); + + Irp->IoStatus.Status = STATUS_DEVICE_DOES_NOT_EXIST; + Irp->IoStatus.Information = 0; + + IoCompleteRequest( Irp, IO_NO_INCREMENT ); + + return STATUS_DEVICE_DOES_NOT_EXIST; + } + + DEBUGP(("[%s] [TAP] release [%d.%d] open request (TapFileIsOpen=%d)\n", + MINIPORT_INSTANCE_ID(adapter), + TAP_DRIVER_MAJOR_VERSION, + TAP_DRIVER_MINOR_VERSION, + adapter->TapFileIsOpen + )); + + // Enforce exclusive access + originalFileObject = InterlockedCompareExchangePointer( + &adapter->TapFileObject, + irpSp->FileObject, + NULL + ); + + if(originalFileObject == NULL) + { + irpSp->FileObject->FsContext = adapter; // Quick reference + + status = STATUS_SUCCESS; + } + else + { + status = STATUS_UNSUCCESSFUL; + } + + // Release the lock. + //tapAdapterReleaseLock(adapter,FALSE); + + if(status == STATUS_SUCCESS) + { + // Reset adapter state on successful open. + tapResetAdapterState(adapter); + + adapter->TapFileIsOpen = 1; // Legacy... + + // NOTE!!! Reference added by tapAdapterContextFromDeviceObject + // will be removed when file is closed. + } + else + { + DEBUGP (("[%s] TAP is presently unavailable (TapFileIsOpen=%d)\n", + MINIPORT_INSTANCE_ID(adapter), adapter->TapFileIsOpen + )); + + NOTE_ERROR(); + + // Remove reference added by tapAdapterContextFromDeviceObject. + tapAdapterContextDereference(adapter); + } + + // Complete the IRP. + Irp->IoStatus.Status = status; + Irp->IoStatus.Information = 0; + + IoCompleteRequest( Irp, IO_NO_INCREMENT ); + + DEBUGP (("[TAP] <-- TapDeviceCreate; status = %8.8X\n",status)); + + return status; +} + +//=================================================== +// Tell Windows whether the TAP device should be +// considered "connected" or "disconnected". +// +// Allows application control of media connect state. +//=================================================== +VOID +tapSetMediaConnectStatus( + __in PTAP_ADAPTER_CONTEXT Adapter, + __in BOOLEAN LogicalMediaState + ) +{ + NDIS_STATUS_INDICATION statusIndication; + NDIS_LINK_STATE linkState; + + NdisZeroMemory(&statusIndication, sizeof(NDIS_STATUS_INDICATION)); + NdisZeroMemory(&linkState, sizeof(NDIS_LINK_STATE)); + + // + // Fill in object headers + // + statusIndication.Header.Type = NDIS_OBJECT_TYPE_STATUS_INDICATION; + statusIndication.Header.Revision = NDIS_STATUS_INDICATION_REVISION_1; + statusIndication.Header.Size = sizeof(NDIS_STATUS_INDICATION); + + linkState.Header.Revision = NDIS_LINK_STATE_REVISION_1; + linkState.Header.Type = NDIS_OBJECT_TYPE_DEFAULT; + linkState.Header.Size = sizeof(NDIS_LINK_STATE); + + // + // Link state buffer + // + if(Adapter->LogicalMediaState == TRUE) + { + linkState.MediaConnectState = MediaConnectStateConnected; + } + + linkState.MediaDuplexState = MediaDuplexStateFull; + linkState.RcvLinkSpeed = TAP_RECV_SPEED; + linkState.XmitLinkSpeed = TAP_XMIT_SPEED; + + // + // Fill in the status buffer + // + statusIndication.StatusCode = NDIS_STATUS_LINK_STATE; + statusIndication.SourceHandle = Adapter->MiniportAdapterHandle; + statusIndication.DestinationHandle = NULL; + statusIndication.RequestId = 0; + + statusIndication.StatusBuffer = &linkState; + statusIndication.StatusBufferSize = sizeof(NDIS_LINK_STATE); + + // Fill in new media connect state. + if ( (Adapter->LogicalMediaState != LogicalMediaState) && !Adapter->MediaStateAlwaysConnected) + { + Adapter->LogicalMediaState = LogicalMediaState; + + if (LogicalMediaState == TRUE) + { + linkState.MediaConnectState = MediaConnectStateConnected; + + DEBUGP (("[TAP] Set MediaConnectState: Connected.\n")); + } + else + { + linkState.MediaConnectState = MediaConnectStateDisconnected; + + DEBUGP (("[TAP] Set MediaConnectState: Disconnected.\n")); + } + } + + // Make the status indication. + if(Adapter->Locked.AdapterState != MiniportHaltedState) + { + NdisMIndicateStatusEx(Adapter->MiniportAdapterHandle, &statusIndication); + } +} + +//====================================================== +// If DHCP mode is used together with tun +// mode, consider the fact that the P2P remote subnet +// might enclose the DHCP masq server address. +//====================================================== +VOID +CheckIfDhcpAndTunMode ( + __in PTAP_ADAPTER_CONTEXT Adapter + ) +{ + if (Adapter->m_tun && Adapter->m_dhcp_enabled) + { + if ((Adapter->m_dhcp_server_ip & Adapter->m_remoteNetmask) == Adapter->m_remoteNetwork) + { + ETH_COPY_NETWORK_ADDRESS (Adapter->m_dhcp_server_mac, Adapter->m_TapToUser.dest); + Adapter->m_dhcp_server_arp = FALSE; + } + } +} + +// IRP_MJ_DEVICE_CONTROL callback. +NTSTATUS +TapDeviceControl( + PDEVICE_OBJECT DeviceObject, + PIRP Irp + ) + +/*++ + +Routine Description: + + This routine is called by the I/O system to perform a device I/O + control function. + +Arguments: + + DeviceObject - a pointer to the object that represents the device + that I/O is to be done on. + + Irp - a pointer to the I/O Request Packet for this request. + +Return Value: + + NT status code + +--*/ + +{ + NTSTATUS ntStatus = STATUS_SUCCESS; // Assume success + PIO_STACK_LOCATION irpSp; // Pointer to current stack location + PTAP_ADAPTER_CONTEXT adapter = NULL; + ULONG inBufLength; // Input buffer length + ULONG outBufLength; // Output buffer length + PCHAR inBuf, outBuf; // pointer to Input and output buffer + PMDL mdl = NULL; + PCHAR buffer = NULL; + + PAGED_CODE(); + + irpSp = IoGetCurrentIrpStackLocation( Irp ); + + // + // Fetch adapter context for this device. + // -------------------------------------- + // Adapter pointer was stashed in FsContext when handle was opened. + // + adapter = (PTAP_ADAPTER_CONTEXT )(irpSp->FileObject)->FsContext; + + ASSERT(adapter); + + inBufLength = irpSp->Parameters.DeviceIoControl.InputBufferLength; + outBufLength = irpSp->Parameters.DeviceIoControl.OutputBufferLength; + + if (!inBufLength || !outBufLength) + { + ntStatus = STATUS_INVALID_PARAMETER; + goto End; + } + + // + // Determine which I/O control code was specified. + // + switch ( irpSp->Parameters.DeviceIoControl.IoControlCode ) + { + case TAP_WIN_IOCTL_GET_MAC: + { + if (outBufLength >= MACADDR_SIZE ) + { + ETH_COPY_NETWORK_ADDRESS( + Irp->AssociatedIrp.SystemBuffer, + adapter->CurrentAddress + ); + + Irp->IoStatus.Information = MACADDR_SIZE; + } + else + { + NOTE_ERROR(); + Irp->IoStatus.Status = ntStatus = STATUS_BUFFER_TOO_SMALL; + } + } + break; + + case TAP_WIN_IOCTL_GET_VERSION: + { + const ULONG size = sizeof (ULONG) * 3; + + if (outBufLength >= size) + { + ((PULONG) (Irp->AssociatedIrp.SystemBuffer))[0] + = TAP_DRIVER_MAJOR_VERSION; + + ((PULONG) (Irp->AssociatedIrp.SystemBuffer))[1] + = TAP_DRIVER_MINOR_VERSION; + + ((PULONG) (Irp->AssociatedIrp.SystemBuffer))[2] +#if DBG + = 1; +#else + = 0; +#endif + Irp->IoStatus.Information = size; + } + else + { + NOTE_ERROR(); + Irp->IoStatus.Status = ntStatus = STATUS_BUFFER_TOO_SMALL; + } + } + break; + + case TAP_WIN_IOCTL_GET_MTU: + { + const ULONG size = sizeof (ULONG) * 1; + + if (outBufLength >= size) + { + ((PULONG) (Irp->AssociatedIrp.SystemBuffer))[0] + = adapter->MtuSize; + + Irp->IoStatus.Information = size; + } + else + { + NOTE_ERROR(); + Irp->IoStatus.Status = ntStatus = STATUS_BUFFER_TOO_SMALL; + } + } + break; + + case TAP_WIN_IOCTL_CONFIG_TUN: + { + if(inBufLength >= sizeof(IPADDR)*3) + { + MACADDR dest; + + adapter->m_tun = FALSE; + + GenerateRelatedMAC (dest, adapter->CurrentAddress, 1); + + adapter->m_localIP = ((IPADDR*) (Irp->AssociatedIrp.SystemBuffer))[0]; + adapter->m_remoteNetwork = ((IPADDR*) (Irp->AssociatedIrp.SystemBuffer))[1]; + adapter->m_remoteNetmask = ((IPADDR*) (Irp->AssociatedIrp.SystemBuffer))[2]; + + // Sanity check on network/netmask + if ((adapter->m_remoteNetwork & adapter->m_remoteNetmask) != adapter->m_remoteNetwork) + { + NOTE_ERROR(); + Irp->IoStatus.Status = ntStatus = STATUS_INVALID_PARAMETER; + break; + } + + ETH_COPY_NETWORK_ADDRESS (adapter->m_TapToUser.src, adapter->CurrentAddress); + ETH_COPY_NETWORK_ADDRESS (adapter->m_TapToUser.dest, dest); + ETH_COPY_NETWORK_ADDRESS (adapter->m_UserToTap.src, dest); + ETH_COPY_NETWORK_ADDRESS (adapter->m_UserToTap.dest, adapter->CurrentAddress); + + adapter->m_TapToUser.proto = adapter->m_UserToTap.proto = htons (NDIS_ETH_TYPE_IPV4); + adapter->m_UserToTap_IPv6 = adapter->m_UserToTap; + adapter->m_UserToTap_IPv6.proto = htons(NDIS_ETH_TYPE_IPV6); + + adapter->m_tun = TRUE; + + CheckIfDhcpAndTunMode (adapter); + + Irp->IoStatus.Information = 1; // Simple boolean value + + DEBUGP (("[TAP] Set TUN mode.\n")); + } + else + { + NOTE_ERROR(); + Irp->IoStatus.Status = ntStatus = STATUS_INVALID_PARAMETER; + } + } + break; + + case TAP_WIN_IOCTL_CONFIG_POINT_TO_POINT: + { + if(inBufLength >= sizeof(IPADDR)*2) + { + MACADDR dest; + + adapter->m_tun = FALSE; + + GenerateRelatedMAC (dest, adapter->CurrentAddress, 1); + + adapter->m_localIP = ((IPADDR*) (Irp->AssociatedIrp.SystemBuffer))[0]; + adapter->m_remoteNetwork = ((IPADDR*) (Irp->AssociatedIrp.SystemBuffer))[1]; + adapter->m_remoteNetmask = ~0; + + ETH_COPY_NETWORK_ADDRESS (adapter->m_TapToUser.src, adapter->CurrentAddress); + ETH_COPY_NETWORK_ADDRESS (adapter->m_TapToUser.dest, dest); + ETH_COPY_NETWORK_ADDRESS (adapter->m_UserToTap.src, dest); + ETH_COPY_NETWORK_ADDRESS (adapter->m_UserToTap.dest, adapter->CurrentAddress); + + adapter->m_TapToUser.proto = adapter->m_UserToTap.proto = htons (NDIS_ETH_TYPE_IPV4); + adapter->m_UserToTap_IPv6 = adapter->m_UserToTap; + adapter->m_UserToTap_IPv6.proto = htons(NDIS_ETH_TYPE_IPV6); + + adapter->m_tun = TRUE; + + CheckIfDhcpAndTunMode (adapter); + + Irp->IoStatus.Information = 1; // Simple boolean value + + DEBUGP (("[TAP] Set P2P mode.\n")); + } + else + { + NOTE_ERROR(); + Irp->IoStatus.Status = ntStatus = STATUS_INVALID_PARAMETER; + } + } + break; + + case TAP_WIN_IOCTL_CONFIG_DHCP_MASQ: + { + if(inBufLength >= sizeof(IPADDR)*4) + { + adapter->m_dhcp_enabled = FALSE; + adapter->m_dhcp_server_arp = FALSE; + adapter->m_dhcp_user_supplied_options_buffer_len = 0; + + // Adapter IP addr / netmask + adapter->m_dhcp_addr = + ((IPADDR*) (Irp->AssociatedIrp.SystemBuffer))[0]; + adapter->m_dhcp_netmask = + ((IPADDR*) (Irp->AssociatedIrp.SystemBuffer))[1]; + + // IP addr of DHCP masq server + adapter->m_dhcp_server_ip = + ((IPADDR*) (Irp->AssociatedIrp.SystemBuffer))[2]; + + // Lease time in seconds + adapter->m_dhcp_lease_time = + ((IPADDR*) (Irp->AssociatedIrp.SystemBuffer))[3]; + + GenerateRelatedMAC( + adapter->m_dhcp_server_mac, + adapter->CurrentAddress, + 2 + ); + + adapter->m_dhcp_enabled = TRUE; + adapter->m_dhcp_server_arp = TRUE; + + CheckIfDhcpAndTunMode (adapter); + + Irp->IoStatus.Information = 1; // Simple boolean value + + DEBUGP (("[TAP] Configured DHCP MASQ.\n")); + } + else + { + NOTE_ERROR(); + Irp->IoStatus.Status = ntStatus = STATUS_INVALID_PARAMETER; + } + } + break; + + case TAP_WIN_IOCTL_CONFIG_DHCP_SET_OPT: + { + if (inBufLength <= DHCP_USER_SUPPLIED_OPTIONS_BUFFER_SIZE + && adapter->m_dhcp_enabled) + { + adapter->m_dhcp_user_supplied_options_buffer_len = 0; + + NdisMoveMemory( + adapter->m_dhcp_user_supplied_options_buffer, + Irp->AssociatedIrp.SystemBuffer, + inBufLength + ); + + adapter->m_dhcp_user_supplied_options_buffer_len = + inBufLength; + + Irp->IoStatus.Information = 1; // Simple boolean value + + DEBUGP (("[TAP] Set DHCP OPT.\n")); + } + else + { + NOTE_ERROR(); + Irp->IoStatus.Status = ntStatus = STATUS_INVALID_PARAMETER; + } + } + break; + + case TAP_WIN_IOCTL_GET_INFO: + { + char state[16]; + + // Fetch adapter (miniport) state. + if (tapAdapterSendAndReceiveReady(adapter) == NDIS_STATUS_SUCCESS) + state[0] = 'A'; + else + state[0] = 'a'; + + if (tapAdapterReadAndWriteReady(adapter)) + state[1] = 'T'; + else + state[1] = 't'; + + state[2] = '0' + adapter->CurrentPowerState; + + if (adapter->MediaStateAlwaysConnected) + state[3] = 'C'; + else + state[3] = 'c'; + + state[4] = '\0'; + + // BUGBUG!!! What follows, and is not yet implemented, is a real mess. + // BUGBUG!!! Tied closely to the NDIS 5 implementation. Need to map + // as much as possible to the NDIS 6 implementation. + Irp->IoStatus.Status = ntStatus = RtlStringCchPrintfExA ( + ((LPTSTR) (Irp->AssociatedIrp.SystemBuffer)), + outBufLength, + NULL, + NULL, + STRSAFE_FILL_BEHIND_NULL | STRSAFE_IGNORE_NULLS, +#if PACKET_TRUNCATION_CHECK + "State=%s Err=[%s/%d] #O=%d Tx=[%d,%d,%d] Rx=[%d,%d,%d] IrpQ=[%d,%d,%d] PktQ=[%d,%d,%d] InjQ=[%d,%d,%d]", +#else + "State=%s Err=[%s/%d] #O=%d Tx=[%d,%d] Rx=[%d,%d] IrpQ=[%d,%d,%d] PktQ=[%d,%d,%d] InjQ=[%d,%d,%d]", +#endif + state, + g_LastErrorFilename, + g_LastErrorLineNumber, + (int)adapter->TapFileOpenCount, + (int)(adapter->FramesTxDirected + adapter->FramesTxMulticast + adapter->FramesTxBroadcast), + (int)adapter->TransmitFailuresOther, +#if PACKET_TRUNCATION_CHECK + (int)adapter->m_TxTrunc, +#endif + (int)adapter->m_Rx, + (int)adapter->m_RxErr, +#if PACKET_TRUNCATION_CHECK + (int)adapter->m_RxTrunc, +#endif + (int)adapter->PendingReadIrpQueue.Count, + (int)adapter->PendingReadIrpQueue.MaxCount, + (int)IRP_QUEUE_SIZE, // Ignored in NDIS 6 driver... + + (int)adapter->SendPacketQueue.Count, + (int)adapter->SendPacketQueue.MaxCount, + (int)PACKET_QUEUE_SIZE, + + (int)0, // adapter->InjectPacketQueue.Count - Unused + (int)0, // adapter->InjectPacketQueue.MaxCount - Unused + (int)INJECT_QUEUE_SIZE + ); + + Irp->IoStatus.Information = outBufLength; + + // BUGBUG!!! Fail because this is not completely implemented. + ntStatus = STATUS_INVALID_DEVICE_REQUEST; + } + break; + +#if DBG + case TAP_WIN_IOCTL_GET_LOG_LINE: + { + if (GetDebugLine( (LPTSTR)Irp->AssociatedIrp.SystemBuffer,outBufLength)) + { + Irp->IoStatus.Status = ntStatus = STATUS_SUCCESS; + } + else + { + Irp->IoStatus.Status = ntStatus = STATUS_UNSUCCESSFUL; + } + + Irp->IoStatus.Information = outBufLength; + + break; + } +#endif + + case TAP_WIN_IOCTL_SET_MEDIA_STATUS: + { + if(inBufLength >= sizeof(ULONG)) + { + ULONG parm = ((PULONG) (Irp->AssociatedIrp.SystemBuffer))[0]; + tapSetMediaConnectStatus (adapter, (BOOLEAN) parm); + Irp->IoStatus.Information = 1; + } + else + { + NOTE_ERROR(); + Irp->IoStatus.Status = ntStatus = STATUS_INVALID_PARAMETER; + } + } + break; + + default: + + // + // The specified I/O control code is unrecognized by this driver. + // + ntStatus = STATUS_INVALID_DEVICE_REQUEST; + break; + } + +End: + + // + // Finish the I/O operation by simply completing the packet and returning + // the same status as in the packet itself. + // + Irp->IoStatus.Status = ntStatus; + + IoCompleteRequest( Irp, IO_NO_INCREMENT ); + + return ntStatus; +} + +// Flush the pending read IRP queue. +VOID +tapFlushIrpQueues( + __in PTAP_ADAPTER_CONTEXT Adapter + ) +{ + + DEBUGP (("[TAP] tapFlushIrpQueues: Flushing %d pending read IRPs\n", + Adapter->PendingReadIrpQueue.Count)); + + tapIrpCsqFlush(&Adapter->PendingReadIrpQueue); +} + +// IRP_MJ_CLEANUP +NTSTATUS +TapDeviceCleanup( + PDEVICE_OBJECT DeviceObject, + PIRP Irp + ) +/*++ + +Routine Description: + + Receipt of this request indicates that the last handle for a file + object that is associated with the target device object has been closed + (but, due to outstanding I/O requests, might not have been released). + + A driver that holds pending IRPs internally must implement a routine for + IRP_MJ_CLEANUP. When the routine is called, the driver should cancel all + the pending IRPs that belong to the file object identified by the IRP_MJ_CLEANUP + call. + + In other words, it should cancel all the IRPs that have the same file-object + pointer as the one supplied in the current I/O stack location of the IRP for the + IRP_MJ_CLEANUP call. Of course, IRPs belonging to other file objects should + not be canceled. Also, if an outstanding IRP is completed immediately, the + driver does not have to cancel it. + +Arguments: + + DeviceObject - a pointer to the object that represents the device + to be cleaned up. + + Irp - a pointer to the I/O Request Packet for this request. + +Return Value: + + NT status code + +--*/ + +{ + NDIS_STATUS status = NDIS_STATUS_SUCCESS; // Always succeed. + PIO_STACK_LOCATION irpSp; // Pointer to current stack location + PTAP_ADAPTER_CONTEXT adapter = NULL; + + PAGED_CODE(); + + DEBUGP (("[TAP] --> TapDeviceCleanup\n")); + + irpSp = IoGetCurrentIrpStackLocation(Irp); + + // + // Fetch adapter context for this device. + // -------------------------------------- + // Adapter pointer was stashed in FsContext when handle was opened. + // + adapter = (PTAP_ADAPTER_CONTEXT )(irpSp->FileObject)->FsContext; + + // Insure that adapter exists. + ASSERT(adapter); + + if(adapter == NULL ) + { + DEBUGP (("[TAP] release [%d.%d] cleanup request; adapter not found\n", + TAP_DRIVER_MAJOR_VERSION, + TAP_DRIVER_MINOR_VERSION + )); + } + + if(adapter != NULL ) + { + adapter->TapFileIsOpen = 0; // Legacy... + + // Disconnect from media. + tapSetMediaConnectStatus(adapter,FALSE); + + // Reset adapter state when cleaning up; + tapResetAdapterState(adapter); + + // BUGBUG!!! Use RemoveLock??? + + // + // Flush pending send TAP packet queue. + // + tapFlushSendPacketQueue(adapter); + + ASSERT(adapter->SendPacketQueue.Count == 0); + + // + // Flush the pending IRP queues + // + tapFlushIrpQueues(adapter); + + ASSERT(adapter->PendingReadIrpQueue.Count == 0); + } + + // Complete the IRP. + Irp->IoStatus.Status = status; + Irp->IoStatus.Information = 0; + + IoCompleteRequest( Irp, IO_NO_INCREMENT ); + + DEBUGP (("[TAP] <-- TapDeviceCleanup; status = %8.8X\n",status)); + + return status; +} + +// IRP_MJ_CLOSE +NTSTATUS +TapDeviceClose( + PDEVICE_OBJECT DeviceObject, + PIRP Irp + ) +/*++ + +Routine Description: + + Receipt of this request indicates that the last handle of the file + object that is associated with the target device object has been closed + and released. + + All outstanding I/O requests have been completed or canceled. + +Arguments: + + DeviceObject - a pointer to the object that represents the device + to be closed. + + Irp - a pointer to the I/O Request Packet for this request. + +Return Value: + + NT status code + +--*/ + +{ + NDIS_STATUS status = NDIS_STATUS_SUCCESS; // Always succeed. + PIO_STACK_LOCATION irpSp; // Pointer to current stack location + PTAP_ADAPTER_CONTEXT adapter = NULL; + + PAGED_CODE(); + + DEBUGP (("[TAP] --> TapDeviceClose\n")); + + irpSp = IoGetCurrentIrpStackLocation(Irp); + + // + // Fetch adapter context for this device. + // -------------------------------------- + // Adapter pointer was stashed in FsContext when handle was opened. + // + adapter = (PTAP_ADAPTER_CONTEXT )(irpSp->FileObject)->FsContext; + + // Insure that adapter exists. + ASSERT(adapter); + + if(adapter == NULL ) + { + DEBUGP (("[TAP] release [%d.%d] close request; adapter not found\n", + TAP_DRIVER_MAJOR_VERSION, + TAP_DRIVER_MINOR_VERSION + )); + } + + if(adapter != NULL ) + { + if(adapter->TapFileObject == NULL) + { + // Should never happen!!! + ASSERT(FALSE); + } + else + { + ASSERT(irpSp->FileObject->FsContext == adapter); + + ASSERT(adapter->TapFileObject == irpSp->FileObject); + } + + adapter->TapFileObject = NULL; + irpSp->FileObject = NULL; + + // Remove reference added by when handle was opened. + tapAdapterContextDereference(adapter); + } + + // Complete the IRP. + Irp->IoStatus.Status = status; + Irp->IoStatus.Information = 0; + + IoCompleteRequest( Irp, IO_NO_INCREMENT ); + + DEBUGP (("[TAP] <-- TapDeviceClose; status = %8.8X\n",status)); + + return status; +} + +NTSTATUS +tapConcatenateNdisStrings( + __inout PNDIS_STRING DestinationString, + __in_opt PNDIS_STRING SourceString1, + __in_opt PNDIS_STRING SourceString2, + __in_opt PNDIS_STRING SourceString3 + ) +{ + NTSTATUS status; + + ASSERT(SourceString1 && SourceString2 && SourceString3); + + status = RtlAppendUnicodeStringToString( + DestinationString, + SourceString1 + ); + + if(status == STATUS_SUCCESS) + { + status = RtlAppendUnicodeStringToString( + DestinationString, + SourceString2 + ); + + if(status == STATUS_SUCCESS) + { + status = RtlAppendUnicodeStringToString( + DestinationString, + SourceString3 + ); + } + } + + return status; +} + +NTSTATUS +tapMakeDeviceNames( + __in PTAP_ADAPTER_CONTEXT Adapter + ) +{ + NDIS_STATUS status; + NDIS_STRING deviceNamePrefix = NDIS_STRING_CONST("\\Device\\"); + NDIS_STRING tapNameSuffix = NDIS_STRING_CONST(".tap"); + + // Generate DeviceName from NetCfgInstanceId. + Adapter->DeviceName.Buffer = Adapter->DeviceNameBuffer; + Adapter->DeviceName.MaximumLength = sizeof(Adapter->DeviceNameBuffer); + + status = tapConcatenateNdisStrings( + &Adapter->DeviceName, + &deviceNamePrefix, + &Adapter->NetCfgInstanceId, + &tapNameSuffix + ); + + if(status == STATUS_SUCCESS) + { + NDIS_STRING linkNamePrefix = NDIS_STRING_CONST("\\DosDevices\\Global\\"); + + Adapter->LinkName.Buffer = Adapter->LinkNameBuffer; + Adapter->LinkName.MaximumLength = sizeof(Adapter->LinkNameBuffer); + + status = tapConcatenateNdisStrings( + &Adapter->LinkName, + &linkNamePrefix, + &Adapter->NetCfgInstanceId, + &tapNameSuffix + ); + } + + return status; +} + +NDIS_STATUS +CreateTapDevice( + __in PTAP_ADAPTER_CONTEXT Adapter + ) +{ + NDIS_STATUS status; + NDIS_DEVICE_OBJECT_ATTRIBUTES deviceAttribute; + PDRIVER_DISPATCH dispatchTable[IRP_MJ_MAXIMUM_FUNCTION+1]; + + DEBUGP (("[TAP] version [%d.%d] creating tap device: %wZ\n", + TAP_DRIVER_MAJOR_VERSION, + TAP_DRIVER_MINOR_VERSION, + &Adapter->NetCfgInstanceId)); + + // Generate DeviceName and LinkName from NetCfgInstanceId. + status = tapMakeDeviceNames(Adapter); + + if (NT_SUCCESS(status)) + { + DEBUGP (("[TAP] DeviceName: %wZ\n",&Adapter->DeviceName)); + DEBUGP (("[TAP] LinkName: %wZ\n",&Adapter->LinkName)); + + // Initialize dispatch table. + NdisZeroMemory(dispatchTable, (IRP_MJ_MAXIMUM_FUNCTION+1) * sizeof(PDRIVER_DISPATCH)); + + dispatchTable[IRP_MJ_CREATE] = TapDeviceCreate; + dispatchTable[IRP_MJ_CLEANUP] = TapDeviceCleanup; + dispatchTable[IRP_MJ_CLOSE] = TapDeviceClose; + dispatchTable[IRP_MJ_READ] = TapDeviceRead; + dispatchTable[IRP_MJ_WRITE] = TapDeviceWrite; + dispatchTable[IRP_MJ_DEVICE_CONTROL] = TapDeviceControl; + + // + // Create a device object and register dispatch handlers + // + NdisZeroMemory(&deviceAttribute, sizeof(NDIS_DEVICE_OBJECT_ATTRIBUTES)); + + deviceAttribute.Header.Type = NDIS_OBJECT_TYPE_DEVICE_OBJECT_ATTRIBUTES; + deviceAttribute.Header.Revision = NDIS_DEVICE_OBJECT_ATTRIBUTES_REVISION_1; + deviceAttribute.Header.Size = sizeof(NDIS_DEVICE_OBJECT_ATTRIBUTES); + + deviceAttribute.DeviceName = &Adapter->DeviceName; + deviceAttribute.SymbolicName = &Adapter->LinkName; + deviceAttribute.MajorFunctions = &dispatchTable[0]; + //deviceAttribute.ExtensionSize = sizeof(FILTER_DEVICE_EXTENSION); + +#if ENABLE_NONADMIN + if(Adapter->AllowNonAdmin) + { + // + // SDDL_DEVOBJ_SYS_ALL_WORLD_RWX_RES_RWX allows the kernel and system complete + // control over the device. By default the admin can access the entire device, + // but cannot change the ACL (the admin must take control of the device first) + // + // Everyone else, including "restricted" or "untrusted" code can read or write + // to the device. Traversal beneath the device is also granted (removing it + // would only effect storage devices, except if the "bypass-traversal" + // privilege was revoked). + // + deviceAttribute.DefaultSDDLString = &SDDL_DEVOBJ_SYS_ALL_ADM_RWX_WORLD_RWX_RES_RWX; + } +#endif + + status = NdisRegisterDeviceEx( + Adapter->MiniportAdapterHandle, + &deviceAttribute, + &Adapter->DeviceObject, + &Adapter->DeviceHandle + ); + } + + ASSERT(NT_SUCCESS(status)); + + if (NT_SUCCESS(status)) + { + // Set TAP device flags. + (Adapter->DeviceObject)->Flags &= ~DO_BUFFERED_IO; + (Adapter->DeviceObject)->Flags |= DO_DIRECT_IO;; + + //======================== + // Finalize initialization + //======================== + + Adapter->TapDeviceCreated = TRUE; + + DEBUGP (("[%wZ] successfully created TAP device [%wZ]\n", + &Adapter->NetCfgInstanceId, + &Adapter->DeviceName + )); + } + + DEBUGP (("[TAP] <-- CreateTapDevice; status = %8.8X\n",status)); + + return status; +} + +// +// DestroyTapDevice is called from AdapterHalt and NDIS miniport +// is in Halted state. Prior to entering the Halted state the +// miniport would have passed through the Pausing and Paused +// states. These miniport states have responsibility for waiting +// until NDIS network operations have completed. +// +VOID +DestroyTapDevice( + __in PTAP_ADAPTER_CONTEXT Adapter + ) +{ + DEBUGP (("[TAP] --> DestroyTapDevice; Adapter: %wZ\n", + &Adapter->NetCfgInstanceId)); + + // + // Let clients know we are shutting down + // + Adapter->TapDeviceCreated = FALSE; + + // + // Flush pending send TAP packet queue. + // + tapFlushSendPacketQueue(Adapter); + + ASSERT(Adapter->SendPacketQueue.Count == 0); + + // + // Flush IRP queues. Wait for pending I/O. Etc. + // -------------------------------------------- + // Exhaust IRP and packet queues. Any pending IRPs will + // be cancelled, causing user-space to get this error + // on overlapped reads: + // + // ERROR_OPERATION_ABORTED, code=995 + // + // "The I/O operation has been aborted because of either a + // thread exit or an application request." + // + // It's important that user-space close the device handle + // when this code is returned, so that when we finally + // do a NdisMDeregisterDeviceEx, the device reference count + // is 0. Otherwise the driver will not unload even if the + // the last adapter has been halted. + // + // The act of flushing the queues at this point should result in the user-mode + // application closing the adapter's device handle. Closing the handle will + // result in the TapDeviceCleanup call being made, followed by the a call to + // the TapDeviceClose callback. + // + tapFlushIrpQueues(Adapter); + + ASSERT(Adapter->PendingReadIrpQueue.Count == 0); + + // + // Deregister the Win32 device. + // ---------------------------- + // When a driver calls NdisDeregisterDeviceEx, the I/O manager deletes the + // target device object if there are no outstanding references to it. However, + // if any outstanding references remain, the I/O manager marks the device + // object as "delete pending" and deletes the device object when the references + // are finally released. + // + if(Adapter->DeviceHandle) + { + DEBUGP (("[TAP] Calling NdisDeregisterDeviceEx\n")); + NdisDeregisterDeviceEx(Adapter->DeviceHandle); + } + + Adapter->DeviceHandle = NULL; + + DEBUGP (("[TAP] <-- DestroyTapDevice\n")); +} + diff --git a/installer/tap/src/src/device.h b/installer/tap/src/src/device.h new file mode 100644 index 0000000..93dae0d --- /dev/null +++ b/installer/tap/src/src/device.h @@ -0,0 +1,50 @@ +/* + * TAP-Windows -- A kernel driver to provide virtual tap + * device functionality on Windows. + * + * This code was inspired by the CIPE-Win32 driver by Damion K. Wilson. + * + * This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc., + * and is released under the GPL version 2 (see below). + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program (see the file COPYING included with this + * distribution); if not, write to the Free Software Foundation, Inc., + * 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#ifndef __TAP_DEVICE_H_ +#define __TAP_DEVICE_H_ + +//====================================================================== +// TAP Prototypes for standard Win32 device I/O entry points +//====================================================================== + +__drv_dispatchType(IRP_MJ_CREATE) +DRIVER_DISPATCH TapDeviceCreate; + +__drv_dispatchType(IRP_MJ_READ) +DRIVER_DISPATCH TapDeviceRead; + +__drv_dispatchType(IRP_MJ_WRITE) +DRIVER_DISPATCH TapDeviceWrite; + +__drv_dispatchType(IRP_MJ_DEVICE_CONTROL) +DRIVER_DISPATCH TapDeviceControl; + +__drv_dispatchType(IRP_MJ_CLEANUP) +DRIVER_DISPATCH TapDeviceCleanup; + +__drv_dispatchType(IRP_MJ_CLOSE) +DRIVER_DISPATCH TapDeviceClose; + +#endif // __TAP_DEVICE_H_ \ No newline at end of file diff --git a/installer/tap/src/src/dhcp.c b/installer/tap/src/src/dhcp.c new file mode 100644 index 0000000..30b22f4 --- /dev/null +++ b/installer/tap/src/src/dhcp.c @@ -0,0 +1,710 @@ +/* + * TAP-Windows -- A kernel driver to provide virtual tap + * device functionality on Windows. + * + * This code was inspired by the CIPE-Win32 driver by Damion K. Wilson. + * + * This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc., + * and is released under the GPL version 2 (see below). + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program (see the file COPYING included with this + * distribution); if not, write to the Free Software Foundation, Inc., + * 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include "tap.h" + +//========================= +// Code to set DHCP options +//========================= + +VOID +SetDHCPOpt( + __in DHCPMsg *m, + __in void *data, + __in unsigned int len + ) +{ + if (!m->overflow) + { + if (m->optlen + len <= DHCP_OPTIONS_BUFFER_SIZE) + { + if (len) + { + NdisMoveMemory (m->msg.options + m->optlen, data, len); + m->optlen += len; + } + } + else + { + m->overflow = TRUE; + } + } +} + +VOID +SetDHCPOpt0( + __in DHCPMsg *msg, + __in int type + ) +{ + DHCPOPT0 opt; + opt.type = (UCHAR) type; + SetDHCPOpt (msg, &opt, sizeof (opt)); +} + +VOID +SetDHCPOpt8( + __in DHCPMsg *msg, + __in int type, + __in ULONG data + ) +{ + DHCPOPT8 opt; + opt.type = (UCHAR) type; + opt.len = sizeof (opt.data); + opt.data = (UCHAR) data; + SetDHCPOpt (msg, &opt, sizeof (opt)); +} + +VOID +SetDHCPOpt32( + __in DHCPMsg *msg, + __in int type, + __in ULONG data + ) +{ + DHCPOPT32 opt; + opt.type = (UCHAR) type; + opt.len = sizeof (opt.data); + opt.data = data; + SetDHCPOpt (msg, &opt, sizeof (opt)); +} + +//============== +// Checksum code +//============== + +USHORT +ip_checksum( + __in const UCHAR *buf, + __in const int len_ip_header + ) +{ + USHORT word16; + ULONG sum = 0; + int i; + + // make 16 bit words out of every two adjacent 8 bit words in the packet + // and add them up + for (i = 0; i < len_ip_header - 1; i += 2) + { + word16 = ((buf[i] << 8) & 0xFF00) + (buf[i+1] & 0xFF); + sum += (ULONG) word16; + } + + // take only 16 bits out of the 32 bit sum and add up the carries + while (sum >> 16) + { + sum = (sum & 0xFFFF) + (sum >> 16); + } + + // one's complement the result + return ((USHORT) ~sum); +} + +USHORT +udp_checksum ( + __in const UCHAR *buf, + __in const int len_udp, + __in const UCHAR *src_addr, + __in const UCHAR *dest_addr + ) +{ + USHORT word16; + ULONG sum = 0; + int i; + + // make 16 bit words out of every two adjacent 8 bit words and + // calculate the sum of all 16 bit words + for (i = 0; i < len_udp; i += 2) + { + word16 = ((buf[i] << 8) & 0xFF00) + ((i + 1 < len_udp) ? (buf[i+1] & 0xFF) : 0); + sum += word16; + } + + // add the UDP pseudo header which contains the IP source and destination addresses + for (i = 0; i < 4; i += 2) + { + word16 =((src_addr[i] << 8) & 0xFF00) + (src_addr[i+1] & 0xFF); + sum += word16; + } + + for (i = 0; i < 4; i += 2) + { + word16 =((dest_addr[i] << 8) & 0xFF00) + (dest_addr[i+1] & 0xFF); + sum += word16; + } + + // the protocol number and the length of the UDP packet + sum += (USHORT) IPPROTO_UDP + (USHORT) len_udp; + + // keep only the last 16 bits of the 32 bit calculated sum and add the carries + while (sum >> 16) + { + sum = (sum & 0xFFFF) + (sum >> 16); + } + + // Take the one's complement of sum + return ((USHORT) ~sum); +} + +//================================ +// Set IP and UDP packet checksums +//================================ + +VOID +SetChecksumDHCPMsg( + __in DHCPMsg *m + ) +{ + // Set IP checksum + m->msg.pre.ip.check = htons (ip_checksum ((UCHAR *) &m->msg.pre.ip, sizeof (IPHDR))); + + // Set UDP Checksum + m->msg.pre.udp.check = htons (udp_checksum ((UCHAR *) &m->msg.pre.udp, + sizeof (UDPHDR) + sizeof (DHCP) + m->optlen, + (UCHAR *)&m->msg.pre.ip.saddr, + (UCHAR *)&m->msg.pre.ip.daddr)); +} + +//=================== +// DHCP message tests +//=================== + +int +GetDHCPMessageType( + __in const DHCP *dhcp, + __in const int optlen + ) +{ + const UCHAR *p = (UCHAR *) (dhcp + 1); + int i; + + for (i = 0; i < optlen; ++i) + { + const UCHAR type = p[i]; + const int room = optlen - i - 1; + + if (type == DHCP_END) // didn't find what we were looking for + return -1; + else if (type == DHCP_PAD) // no-operation + ; + else if (type == DHCP_MSG_TYPE) // what we are looking for + { + if (room >= 2) + { + if (p[i+1] == 1) // message length should be 1 + return p[i+2]; // return message type + } + return -1; + } + else // some other message + { + if (room >= 1) + { + const int len = p[i+1]; // get message length + i += (len + 1); // advance to next message + } + } + } + return -1; +} + +BOOLEAN +DHCPMessageOurs ( + __in const PTAP_ADAPTER_CONTEXT Adapter, + __in const ETH_HEADER *eth, + __in const IPHDR *ip, + __in const UDPHDR *udp, + __in const DHCP *dhcp + ) +{ + // Must be UDPv4 protocol + if (!(eth->proto == htons (NDIS_ETH_TYPE_IPV4) && ip->protocol == IPPROTO_UDP)) + { + return FALSE; + } + + // Source MAC must be our adapter + if (!MAC_EQUAL (eth->src, Adapter->CurrentAddress)) + { + return FALSE; + } + + // Dest MAC must be either broadcast or our virtual DHCP server + if (!(ETH_IS_BROADCAST(eth->dest) + || MAC_EQUAL (eth->dest, Adapter->m_dhcp_server_mac))) + { + return FALSE; + } + + // Port numbers must be correct + if (!(udp->dest == htons (BOOTPS_PORT) + && udp->source == htons (BOOTPC_PORT))) + { + return FALSE; + } + + // Hardware address must be MAC addr sized + if (!(dhcp->hlen == sizeof (MACADDR))) + { + return FALSE; + } + + // Hardware address must match our adapter + if (!MAC_EQUAL (eth->src, dhcp->chaddr)) + { + return FALSE; + } + + return TRUE; +} + + +//===================================================== +// Build all of DHCP packet except for DHCP options. +// Assume that *p has been zeroed before we are called. +//===================================================== + +VOID +BuildDHCPPre ( + __in const PTAP_ADAPTER_CONTEXT Adapter, + __inout DHCPPre *p, + __in const ETH_HEADER *eth, + __in const IPHDR *ip, + __in const UDPHDR *udp, + __in const DHCP *dhcp, + __in const int optlen, + __in const int type) +{ + // Should we broadcast or direct to a specific MAC / IP address? + const BOOLEAN broadcast = (type == DHCPNAK + || ETH_IS_BROADCAST(eth->dest)); + + // + // Build ethernet header + // + ETH_COPY_NETWORK_ADDRESS (p->eth.src, Adapter->m_dhcp_server_mac); + + if (broadcast) + { + memset(p->eth.dest,0xFF,ETH_LENGTH_OF_ADDRESS); + } + else + { + ETH_COPY_NETWORK_ADDRESS (p->eth.dest, eth->src); + } + + p->eth.proto = htons (NDIS_ETH_TYPE_IPV4); + + // + // Build IP header + // + p->ip.version_len = (4 << 4) | (sizeof (IPHDR) >> 2); + p->ip.tos = 0; + p->ip.tot_len = htons (sizeof (IPHDR) + sizeof (UDPHDR) + sizeof (DHCP) + optlen); + p->ip.id = 0; + p->ip.frag_off = 0; + p->ip.ttl = 16; + p->ip.protocol = IPPROTO_UDP; + p->ip.check = 0; + p->ip.saddr = Adapter->m_dhcp_server_ip; + + if (broadcast) + { + p->ip.daddr = ~0; + } + else + { + p->ip.daddr = Adapter->m_dhcp_addr; + } + + // + // Build UDP header + // + p->udp.source = htons (BOOTPS_PORT); + p->udp.dest = htons (BOOTPC_PORT); + p->udp.len = htons (sizeof (UDPHDR) + sizeof (DHCP) + optlen); + p->udp.check = 0; + + // Build DHCP response + + p->dhcp.op = BOOTREPLY; + p->dhcp.htype = 1; + p->dhcp.hlen = sizeof (MACADDR); + p->dhcp.hops = 0; + p->dhcp.xid = dhcp->xid; + p->dhcp.secs = 0; + p->dhcp.flags = 0; + p->dhcp.ciaddr = 0; + + if (type == DHCPNAK) + { + p->dhcp.yiaddr = 0; + } + else + { + p->dhcp.yiaddr = Adapter->m_dhcp_addr; + } + + p->dhcp.siaddr = Adapter->m_dhcp_server_ip; + p->dhcp.giaddr = 0; + ETH_COPY_NETWORK_ADDRESS (p->dhcp.chaddr, eth->src); + p->dhcp.magic = htonl (0x63825363); +} + +//============================= +// Build specific DHCP messages +//============================= + +VOID +SendDHCPMsg( + __in PTAP_ADAPTER_CONTEXT Adapter, + __in const int type, + __in const ETH_HEADER *eth, + __in const IPHDR *ip, + __in const UDPHDR *udp, + __in const DHCP *dhcp + ) +{ + DHCPMsg *pkt; + + if (!(type == DHCPOFFER || type == DHCPACK || type == DHCPNAK)) + { + DEBUGP (("[TAP] SendDHCPMsg: Bad DHCP type: %d\n", type)); + return; + } + + pkt = (DHCPMsg *) MemAlloc (sizeof (DHCPMsg), TRUE); + + if(pkt) + { + //----------------------- + // Build DHCP options + //----------------------- + + // Message Type + SetDHCPOpt8 (pkt, DHCP_MSG_TYPE, type); + + // Server ID + SetDHCPOpt32 (pkt, DHCP_SERVER_ID, Adapter->m_dhcp_server_ip); + + if (type == DHCPOFFER || type == DHCPACK) + { + // Lease Time + SetDHCPOpt32 (pkt, DHCP_LEASE_TIME, htonl (Adapter->m_dhcp_lease_time)); + + // Netmask + SetDHCPOpt32 (pkt, DHCP_NETMASK, Adapter->m_dhcp_netmask); + + // Other user-defined options + SetDHCPOpt ( + pkt, + Adapter->m_dhcp_user_supplied_options_buffer, + Adapter->m_dhcp_user_supplied_options_buffer_len); + } + + // End + SetDHCPOpt0 (pkt, DHCP_END); + + if (!DHCPMSG_OVERFLOW (pkt)) + { + // The initial part of the DHCP message (not including options) gets built here + BuildDHCPPre ( + Adapter, + &pkt->msg.pre, + eth, + ip, + udp, + dhcp, + DHCPMSG_LEN_OPT (pkt), + type); + + SetChecksumDHCPMsg (pkt); + + DUMP_PACKET ("DHCPMsg", + DHCPMSG_BUF (pkt), + DHCPMSG_LEN_FULL (pkt)); + + // Return DHCP response to kernel + IndicateReceivePacket( + Adapter, + DHCPMSG_BUF (pkt), + DHCPMSG_LEN_FULL (pkt) + ); + } + else + { + DEBUGP (("[TAP] SendDHCPMsg: DHCP buffer overflow\n")); + } + + MemFree (pkt, sizeof (DHCPMsg)); + } +} + +//=================================================================== +// Handle a BOOTPS packet produced by the local system to +// resolve the address/netmask of this adapter. +// If we are in TAP_WIN_IOCTL_CONFIG_DHCP_MASQ mode, reply +// to the message. Return TRUE if we processed the passed +// message, so that downstream stages can ignore it. +//=================================================================== + +BOOLEAN +ProcessDHCP( + __in PTAP_ADAPTER_CONTEXT Adapter, + __in const ETH_HEADER *eth, + __in const IPHDR *ip, + __in const UDPHDR *udp, + __in const DHCP *dhcp, + __in int optlen + ) +{ + int msg_type; + + // Sanity check IP header + if (!(ntohs (ip->tot_len) == sizeof (IPHDR) + sizeof (UDPHDR) + sizeof (DHCP) + optlen + && (ntohs (ip->frag_off) & IP_OFFMASK) == 0)) + { + return TRUE; + } + + // Does this message belong to us? + if (!DHCPMessageOurs (Adapter, eth, ip, udp, dhcp)) + { + return FALSE; + } + + msg_type = GetDHCPMessageType (dhcp, optlen); + + // Drop non-BOOTREQUEST messages + if (dhcp->op != BOOTREQUEST) + { + return TRUE; + } + + // Drop any messages except DHCPDISCOVER or DHCPREQUEST + if (!(msg_type == DHCPDISCOVER || msg_type == DHCPREQUEST)) + { + return TRUE; + } + + // Should we reply with DHCPOFFER, DHCPACK, or DHCPNAK? + if (msg_type == DHCPREQUEST + && ((dhcp->ciaddr && dhcp->ciaddr != Adapter->m_dhcp_addr) + || !Adapter->m_dhcp_received_discover + || Adapter->m_dhcp_bad_requests >= BAD_DHCPREQUEST_NAK_THRESHOLD)) + { + SendDHCPMsg( + Adapter, + DHCPNAK, + eth, ip, udp, dhcp + ); + } + else + { + SendDHCPMsg( + Adapter, + (msg_type == DHCPDISCOVER ? DHCPOFFER : DHCPACK), + eth, ip, udp, dhcp + ); + } + + // Remember if we received a DHCPDISCOVER + if (msg_type == DHCPDISCOVER) + { + Adapter->m_dhcp_received_discover = TRUE; + } + + // Is this a bad DHCPREQUEST? + if (msg_type == DHCPREQUEST && dhcp->ciaddr && dhcp->ciaddr != Adapter->m_dhcp_addr) + { + ++Adapter->m_dhcp_bad_requests; + } + + return TRUE; +} + +#if DBG + +const char * + message_op_text (int op) +{ + switch (op) + { + case BOOTREQUEST: + return "BOOTREQUEST"; + + case BOOTREPLY: + return "BOOTREPLY"; + + default: + return "???"; + } +} + +const char * + message_type_text (int type) +{ + switch (type) + { + case DHCPDISCOVER: + return "DHCPDISCOVER"; + + case DHCPOFFER: + return "DHCPOFFER"; + + case DHCPREQUEST: + return "DHCPREQUEST"; + + case DHCPDECLINE: + return "DHCPDECLINE"; + + case DHCPACK: + return "DHCPACK"; + + case DHCPNAK: + return "DHCPNAK"; + + case DHCPRELEASE: + return "DHCPRELEASE"; + + case DHCPINFORM: + return "DHCPINFORM"; + + default: + return "???"; + } +} + +const char * +port_name (int port) +{ + switch (port) + { + case BOOTPS_PORT: + return "BOOTPS"; + + case BOOTPC_PORT: + return "BOOTPC"; + + default: + return "unknown"; + } +} + +VOID +DumpDHCP ( + const ETH_HEADER *eth, + const IPHDR *ip, + const UDPHDR *udp, + const DHCP *dhcp, + const int optlen + ) +{ + DEBUGP ((" %s", message_op_text (dhcp->op))); + DEBUGP ((" %s ", message_type_text (GetDHCPMessageType (dhcp, optlen)))); + PrIP (ip->saddr); + DEBUGP ((":%s[", port_name (ntohs (udp->source)))); + PrMac (eth->src); + DEBUGP (("] -> ")); + PrIP (ip->daddr); + DEBUGP ((":%s[", port_name (ntohs (udp->dest)))); + PrMac (eth->dest); + DEBUGP (("]")); + if (dhcp->ciaddr) + { + DEBUGP ((" ci=")); + PrIP (dhcp->ciaddr); + } + if (dhcp->yiaddr) + { + DEBUGP ((" yi=")); + PrIP (dhcp->yiaddr); + } + if (dhcp->siaddr) + { + DEBUGP ((" si=")); + PrIP (dhcp->siaddr); + } + if (dhcp->hlen == sizeof (MACADDR)) + { + DEBUGP ((" ch=")); + PrMac (dhcp->chaddr); + } + + DEBUGP ((" xid=0x%08x", ntohl (dhcp->xid))); + + if (ntohl (dhcp->magic) != 0x63825363) + DEBUGP ((" ma=0x%08x", ntohl (dhcp->magic))); + if (dhcp->htype != 1) + DEBUGP ((" htype=%d", dhcp->htype)); + if (dhcp->hops) + DEBUGP ((" hops=%d", dhcp->hops)); + if (ntohs (dhcp->secs)) + DEBUGP ((" secs=%d", ntohs (dhcp->secs))); + if (ntohs (dhcp->flags)) + DEBUGP ((" flags=0x%04x", ntohs (dhcp->flags))); + + // extra stuff + + if (ip->version_len != 0x45) + DEBUGP ((" vl=0x%02x", ip->version_len)); + if (ntohs (ip->tot_len) != sizeof (IPHDR) + sizeof (UDPHDR) + sizeof (DHCP) + optlen) + DEBUGP ((" tl=%d", ntohs (ip->tot_len))); + if (ntohs (udp->len) != sizeof (UDPHDR) + sizeof (DHCP) + optlen) + DEBUGP ((" ul=%d", ntohs (udp->len))); + + if (ip->tos) + DEBUGP ((" tos=0x%02x", ip->tos)); + if (ntohs (ip->id)) + DEBUGP ((" id=0x%04x", ntohs (ip->id))); + if (ntohs (ip->frag_off)) + DEBUGP ((" frag_off=0x%04x", ntohs (ip->frag_off))); + + DEBUGP ((" ttl=%d", ip->ttl)); + DEBUGP ((" ic=0x%04x [0x%04x]", ntohs (ip->check), + ip_checksum ((UCHAR*)ip, sizeof (IPHDR)))); + DEBUGP ((" uc=0x%04x [0x%04x/%d]", ntohs (udp->check), + udp_checksum ((UCHAR *) udp, + sizeof (UDPHDR) + sizeof (DHCP) + optlen, + (UCHAR *) &ip->saddr, + (UCHAR *) &ip->daddr), + optlen)); + + // Options + { + const UCHAR *opt = (UCHAR *) (dhcp + 1); + int i; + + DEBUGP ((" OPT")); + for (i = 0; i < optlen; ++i) + { + const UCHAR data = opt[i]; + DEBUGP ((".%d", data)); + } + } +} + +#endif /* DBG */ diff --git a/installer/tap/src/src/dhcp.h b/installer/tap/src/src/dhcp.h new file mode 100644 index 0000000..b594a5e --- /dev/null +++ b/installer/tap/src/src/dhcp.h @@ -0,0 +1,165 @@ +/* + * TAP-Windows -- A kernel driver to provide virtual tap + * device functionality on Windows. + * + * This code was inspired by the CIPE-Win32 driver by Damion K. Wilson. + * + * This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc., + * and is released under the GPL version 2 (see below). + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program (see the file COPYING included with this + * distribution); if not, write to the Free Software Foundation, Inc., + * 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ +#pragma once + +#pragma pack(1) + +//=================================================== +// How many bad DHCPREQUESTs do we receive before we +// return a NAK? +// +// A bad DHCPREQUEST is defined to be one where the +// requestor doesn't know its IP address. +//=================================================== + +#define BAD_DHCPREQUEST_NAK_THRESHOLD 3 + +//============================================== +// Maximum number of DHCP options bytes supplied +//============================================== + +#define DHCP_USER_SUPPLIED_OPTIONS_BUFFER_SIZE 256 +#define DHCP_OPTIONS_BUFFER_SIZE 256 + +//=================================== +// UDP port numbers of DHCP messages. +//=================================== + +#define BOOTPS_PORT 67 +#define BOOTPC_PORT 68 + +//=========================== +// The DHCP message structure +//=========================== + +typedef struct { +# define BOOTREQUEST 1 +# define BOOTREPLY 2 + UCHAR op; /* message op */ + + UCHAR htype; /* hardware address type (e.g. '1' = 10Mb Ethernet) */ + UCHAR hlen; /* hardware address length (e.g. '6' for 10Mb Ethernet) */ + UCHAR hops; /* client sets to 0, may be used by relay agents */ + ULONG xid; /* transaction ID, chosen by client */ + USHORT secs; /* seconds since request process began, set by client */ + USHORT flags; + ULONG ciaddr; /* client IP address, client sets if known */ + ULONG yiaddr; /* 'your' IP address -- server's response to client */ + ULONG siaddr; /* server IP address */ + ULONG giaddr; /* relay agent IP address */ + UCHAR chaddr[16]; /* client hardware address */ + UCHAR sname[64]; /* optional server host name */ + UCHAR file[128]; /* boot file name */ + ULONG magic; /* must be 0x63825363 (network order) */ +} DHCP; + +typedef struct { + ETH_HEADER eth; + IPHDR ip; + UDPHDR udp; + DHCP dhcp; +} DHCPPre; + +typedef struct { + DHCPPre pre; + UCHAR options[DHCP_OPTIONS_BUFFER_SIZE]; +} DHCPFull; + +typedef struct { + unsigned int optlen; + BOOLEAN overflow; + DHCPFull msg; +} DHCPMsg; + +//=================== +// Macros for DHCPMSG +//=================== + +#define DHCPMSG_LEN_BASE(p) (sizeof (DHCPPre)) +#define DHCPMSG_LEN_OPT(p) ((p)->optlen) +#define DHCPMSG_LEN_FULL(p) (DHCPMSG_LEN_BASE(p) + DHCPMSG_LEN_OPT(p)) +#define DHCPMSG_BUF(p) ((UCHAR*) &(p)->msg) +#define DHCPMSG_OVERFLOW(p) ((p)->overflow) + +//======================================== +// structs to hold individual DHCP options +//======================================== + +typedef struct { + UCHAR type; +} DHCPOPT0; + +typedef struct { + UCHAR type; + UCHAR len; + UCHAR data; +} DHCPOPT8; + +typedef struct { + UCHAR type; + UCHAR len; + ULONG data; +} DHCPOPT32; + +#pragma pack() + +//================== +// DHCP Option types +//================== + +#define DHCP_MSG_TYPE 53 /* message type (u8) */ +#define DHCP_PARM_REQ 55 /* parameter request list: c1 (u8), ... */ +#define DHCP_CLIENT_ID 61 /* client ID: type (u8), i1 (u8), ... */ +#define DHCP_IP 50 /* requested IP addr (u32) */ +#define DHCP_NETMASK 1 /* subnet mask (u32) */ +#define DHCP_LEASE_TIME 51 /* lease time sec (u32) */ +#define DHCP_RENEW_TIME 58 /* renewal time sec (u32) */ +#define DHCP_REBIND_TIME 59 /* rebind time sec (u32) */ +#define DHCP_SERVER_ID 54 /* server ID: IP addr (u32) */ +#define DHCP_PAD 0 +#define DHCP_END 255 + +//==================== +// DHCP Messages types +//==================== + +#define DHCPDISCOVER 1 +#define DHCPOFFER 2 +#define DHCPREQUEST 3 +#define DHCPDECLINE 4 +#define DHCPACK 5 +#define DHCPNAK 6 +#define DHCPRELEASE 7 +#define DHCPINFORM 8 + +#if DBG + +VOID +DumpDHCP (const ETH_HEADER *eth, + const IPHDR *ip, + const UDPHDR *udp, + const DHCP *dhcp, + const int optlen); + +#endif diff --git a/installer/tap/src/src/endian.h b/installer/tap/src/src/endian.h new file mode 100644 index 0000000..b7d3449 --- /dev/null +++ b/installer/tap/src/src/endian.h @@ -0,0 +1,35 @@ +/* + * TAP-Windows -- A kernel driver to provide virtual tap + * device functionality on Windows. + * + * This code was inspired by the CIPE-Win32 driver by Damion K. Wilson. + * + * This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc., + * and is released under the GPL version 2 (see below). + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program (see the file COPYING included with this + * distribution); if not, write to the Free Software Foundation, Inc., + * 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#ifdef TAP_LITTLE_ENDIAN +#define ntohs(x) RtlUshortByteSwap(x) +#define htons(x) RtlUshortByteSwap(x) +#define ntohl(x) RtlUlongByteSwap(x) +#define htonl(x) RtlUlongByteSwap(x) +#else +#define ntohs(x) ((USHORT)(x)) +#define htons(x) ((USHORT)(x)) +#define ntohl(x) ((ULONG)(x)) +#define htonl(x) ((ULONG)(x)) +#endif diff --git a/installer/tap/src/src/error.c b/installer/tap/src/src/error.c new file mode 100644 index 0000000..1fad1d3 --- /dev/null +++ b/installer/tap/src/src/error.c @@ -0,0 +1,398 @@ +/* + * TAP-Windows -- A kernel driver to provide virtual tap + * device functionality on Windows. + * + * This code was inspired by the CIPE-Win32 driver by Damion K. Wilson. + * + * This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc., + * and is released under the GPL version 2 (see below). + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program (see the file COPYING included with this + * distribution); if not, write to the Free Software Foundation, Inc., + * 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include "tap.h" + +//----------------- +// DEBUGGING OUTPUT +//----------------- + +const char *g_LastErrorFilename; +int g_LastErrorLineNumber; + +#if DBG + +DebugOutput g_Debug; + +BOOLEAN +NewlineExists (const char *str, int len) +{ + while (len-- > 0) + { + const char c = *str++; + if (c == '\n') + return TRUE; + else if (c == '\0') + break; + } + return FALSE; +} + +VOID +MyDebugInit (unsigned int bufsiz) +{ + NdisZeroMemory (&g_Debug, sizeof (g_Debug)); + g_Debug.text = (char *) MemAlloc (bufsiz, FALSE); + + if (g_Debug.text) + { + g_Debug.capacity = bufsiz; + } +} + +VOID +MyDebugFree () +{ + if (g_Debug.text) + { + MemFree (g_Debug.text, g_Debug.capacity); + } + + NdisZeroMemory (&g_Debug, sizeof (g_Debug)); +} + +VOID +MyDebugPrint (const unsigned char* format, ...) +{ + if (g_Debug.text && g_Debug.capacity > 0 && CAN_WE_PRINT) + { + BOOLEAN owned; + ACQUIRE_MUTEX_ADAPTIVE (&g_Debug.lock, owned); + if (owned) + { + const int remaining = (int)g_Debug.capacity - (int)g_Debug.out; + + if (remaining > 0) + { + va_list args; + NTSTATUS status; + char *end; + +#ifdef DBG_PRINT + va_start (args, format); + vDbgPrintEx (DPFLTR_IHVNETWORK_ID, DPFLTR_INFO_LEVEL, format, args); + va_end (args); +#endif + va_start (args, format); + status = RtlStringCchVPrintfExA (g_Debug.text + g_Debug.out, + remaining, + &end, + NULL, + STRSAFE_NO_TRUNCATION | STRSAFE_IGNORE_NULLS, + format, + args); + va_end (args); + va_start (args, format); + vDbgPrintEx(DPFLTR_IHVDRIVER_ID , 1, format, args); + va_end (args); + if (status == STATUS_SUCCESS) + g_Debug.out = (unsigned int) (end - g_Debug.text); + else + g_Debug.error = TRUE; + } + else + g_Debug.error = TRUE; + + RELEASE_MUTEX (&g_Debug.lock); + } + else + g_Debug.error = TRUE; + } +} + +BOOLEAN +GetDebugLine ( + __in char *buf, + __in const int len + ) +{ + static const char *truncated = "[OUTPUT TRUNCATED]\n"; + BOOLEAN ret = FALSE; + + NdisZeroMemory (buf, len); + + if (g_Debug.text && g_Debug.capacity > 0) + { + BOOLEAN owned; + ACQUIRE_MUTEX_ADAPTIVE (&g_Debug.lock, owned); + if (owned) + { + int i = 0; + + if (g_Debug.error || NewlineExists (g_Debug.text + g_Debug.in, (int)g_Debug.out - (int)g_Debug.in)) + { + while (i < (len - 1) && g_Debug.in < g_Debug.out) + { + const char c = g_Debug.text[g_Debug.in++]; + if (c == '\n') + break; + buf[i++] = c; + } + if (i < len) + buf[i] = '\0'; + } + + if (!i) + { + if (g_Debug.in == g_Debug.out) + { + g_Debug.in = g_Debug.out = 0; + if (g_Debug.error) + { + const unsigned int tlen = strlen (truncated); + if (tlen < g_Debug.capacity) + { + NdisMoveMemory (g_Debug.text, truncated, tlen+1); + g_Debug.out = tlen; + } + g_Debug.error = FALSE; + } + } + } + else + ret = TRUE; + + RELEASE_MUTEX (&g_Debug.lock); + } + } + return ret; +} + +VOID +PrMac (const MACADDR mac) +{ + DEBUGP (("%x:%x:%x:%x:%x:%x", + mac[0], mac[1], mac[2], + mac[3], mac[4], mac[5])); +} + +VOID +PrIP (IPADDR ip_addr) +{ + const unsigned char *ip = (const unsigned char *) &ip_addr; + + DEBUGP (("%d.%d.%d.%d", + ip[0], ip[1], ip[2], ip[3])); +} + +const char * +PrIPProto (int proto) +{ + switch (proto) + { + case IPPROTO_UDP: + return "UDP"; + + case IPPROTO_TCP: + return "TCP"; + + case IPPROTO_ICMP: + return "ICMP"; + + case IPPROTO_IGMP: + return "IGMP"; + + default: + return "???"; + } +} + +VOID +DumpARP (const char *prefix, const ARP_PACKET *arp) +{ + DEBUGP (("%s ARP src=", prefix)); + PrMac (arp->m_MAC_Source); + DEBUGP ((" dest=")); + PrMac (arp->m_MAC_Destination); + DEBUGP ((" OP=0x%04x", + (int)ntohs(arp->m_ARP_Operation))); + DEBUGP ((" M=0x%04x(%d)", + (int)ntohs(arp->m_MAC_AddressType), + (int)arp->m_MAC_AddressSize)); + DEBUGP ((" P=0x%04x(%d)", + (int)ntohs(arp->m_PROTO_AddressType), + (int)arp->m_PROTO_AddressSize)); + + DEBUGP ((" MacSrc=")); + PrMac (arp->m_ARP_MAC_Source); + DEBUGP ((" MacDest=")); + PrMac (arp->m_ARP_MAC_Destination); + + DEBUGP ((" IPSrc=")); + PrIP (arp->m_ARP_IP_Source); + DEBUGP ((" IPDest=")); + PrIP (arp->m_ARP_IP_Destination); + + DEBUGP (("\n")); +} + +struct ethpayload +{ + ETH_HEADER eth; + UCHAR payload[DEFAULT_PACKET_LOOKAHEAD]; +}; + +#ifdef ALLOW_PACKET_DUMP + +VOID +DumpPacket2( + __in const char *prefix, + __in const ETH_HEADER *eth, + __in const unsigned char *data, + __in unsigned int len + ) +{ + struct ethpayload *ep = (struct ethpayload *) MemAlloc (sizeof (struct ethpayload), TRUE); + if (ep) + { + if (len > DEFAULT_PACKET_LOOKAHEAD) + len = DEFAULT_PACKET_LOOKAHEAD; + ep->eth = *eth; + NdisMoveMemory (ep->payload, data, len); + DumpPacket (prefix, (unsigned char *) ep, sizeof (ETH_HEADER) + len); + MemFree (ep, sizeof (struct ethpayload)); + } +} + +VOID +DumpPacket( + __in const char *prefix, + __in const unsigned char *data, + __in unsigned int len + ) +{ + const ETH_HEADER *eth = (const ETH_HEADER *) data; + const IPHDR *ip = (const IPHDR *) (data + sizeof (ETH_HEADER)); + + if (len < sizeof (ETH_HEADER)) + { + DEBUGP (("%s TRUNCATED PACKET LEN=%d\n", prefix, len)); + return; + } + + // ARP Packet? + if (len >= sizeof (ARP_PACKET) && eth->proto == htons (ETH_P_ARP)) + { + DumpARP (prefix, (const ARP_PACKET *) data); + return; + } + + // IPv4 packet? + if (len >= (sizeof (IPHDR) + sizeof (ETH_HEADER)) + && eth->proto == htons (ETH_P_IP) + && IPH_GET_VER (ip->version_len) == 4) + { + const int hlen = IPH_GET_LEN (ip->version_len); + const int blen = len - sizeof (ETH_HEADER); + BOOLEAN did = FALSE; + + DEBUGP (("%s IPv4 %s[%d]", prefix, PrIPProto (ip->protocol), len)); + + if (!(ntohs (ip->tot_len) == blen && hlen <= blen)) + { + DEBUGP ((" XXX")); + return; + } + + // TCP packet? + if (ip->protocol == IPPROTO_TCP + && blen - hlen >= (sizeof (TCPHDR))) + { + const TCPHDR *tcp = (TCPHDR *) (data + sizeof (ETH_HEADER) + hlen); + DEBUGP ((" ")); + PrIP (ip->saddr); + DEBUGP ((":%d", ntohs (tcp->source))); + DEBUGP ((" -> ")); + PrIP (ip->daddr); + DEBUGP ((":%d", ntohs (tcp->dest))); + did = TRUE; + } + + // UDP packet? + else if ((ntohs (ip->frag_off) & IP_OFFMASK) == 0 + && ip->protocol == IPPROTO_UDP + && blen - hlen >= (sizeof (UDPHDR))) + { + const UDPHDR *udp = (UDPHDR *) (data + sizeof (ETH_HEADER) + hlen); + + // DHCP packet? + if ((udp->dest == htons (BOOTPC_PORT) || udp->dest == htons (BOOTPS_PORT)) + && blen - hlen >= (sizeof (UDPHDR) + sizeof (DHCP))) + { + const DHCP *dhcp = (DHCP *) (data + + hlen + + sizeof (ETH_HEADER) + + sizeof (UDPHDR)); + + int optlen = len + - sizeof (ETH_HEADER) + - hlen + - sizeof (UDPHDR) + - sizeof (DHCP); + + if (optlen < 0) + optlen = 0; + + DumpDHCP (eth, ip, udp, dhcp, optlen); + did = TRUE; + } + + if (!did) + { + DEBUGP ((" ")); + PrIP (ip->saddr); + DEBUGP ((":%d", ntohs (udp->source))); + DEBUGP ((" -> ")); + PrIP (ip->daddr); + DEBUGP ((":%d", ntohs (udp->dest))); + did = TRUE; + } + } + + if (!did) + { + DEBUGP ((" ipproto=%d ", ip->protocol)); + PrIP (ip->saddr); + DEBUGP ((" -> ")); + PrIP (ip->daddr); + } + + DEBUGP (("\n")); + return; + } + + { + DEBUGP (("%s ??? src=", prefix)); + PrMac (eth->src); + DEBUGP ((" dest=")); + PrMac (eth->dest); + DEBUGP ((" proto=0x%04x len=%d\n", + (int) ntohs(eth->proto), + len)); + } +} + +#endif // ALLOW_PACKET_DUMP + +#endif diff --git a/installer/tap/src/src/error.h b/installer/tap/src/src/error.h new file mode 100644 index 0000000..2ba39cc --- /dev/null +++ b/installer/tap/src/src/error.h @@ -0,0 +1,114 @@ +/* + * TAP-Windows -- A kernel driver to provide virtual tap + * device functionality on Windows. + * + * This code was inspired by the CIPE-Win32 driver by Damion K. Wilson. + * + * This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc., + * and is released under the GPL version 2 (see below). + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program (see the file COPYING included with this + * distribution); if not, write to the Free Software Foundation, Inc., + * 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +//----------------- +// DEBUGGING OUTPUT +//----------------- + +extern const char *g_LastErrorFilename; +extern int g_LastErrorLineNumber; + +// Debug info output +#define ALSO_DBGPRINT 1 +#define DEBUGP_AT_DISPATCH 1 + +// Uncomment line below to allow packet dumps +//#define ALLOW_PACKET_DUMP 1 + +#define NOTE_ERROR() \ +{ \ + g_LastErrorFilename = __FILE__; \ + g_LastErrorLineNumber = __LINE__; \ +} + +#if DBG + +typedef struct +{ + unsigned int in; + unsigned int out; + unsigned int capacity; + char *text; + BOOLEAN error; + MUTEX lock; +} DebugOutput; + +VOID MyDebugPrint (const unsigned char* format, ...); + +VOID PrMac (const MACADDR mac); + +VOID PrIP (IPADDR ip_addr); + +#ifdef ALLOW_PACKET_DUMP + +VOID +DumpPacket( + __in const char *prefix, + __in const unsigned char *data, + __in unsigned int len + ); + +DumpPacket2( + __in const char *prefix, + __in const ETH_HEADER *eth, + __in const unsigned char *data, + __in unsigned int len + ); + +#else +#define DUMP_PACKET(prefix, data, len) +#define DUMP_PACKET2(prefix, eth, data, len) +#endif + +#define CAN_WE_PRINT (DEBUGP_AT_DISPATCH || KeGetCurrentIrql () < DISPATCH_LEVEL) + +#if ALSO_DBGPRINT +#define DEBUGP(fmt) { MyDebugPrint fmt; if (CAN_WE_PRINT) DbgPrint fmt; } +#else +#define DEBUGP(fmt) { MyDebugPrint fmt; } +#endif + +#ifdef ALLOW_PACKET_DUMP + +#define DUMP_PACKET(prefix, data, len) \ + DumpPacket (prefix, data, len) + +#define DUMP_PACKET2(prefix, eth, data, len) \ + DumpPacket2 (prefix, eth, data, len) + +#endif + +BOOLEAN +GetDebugLine ( + __in char *buf, + __in const int len + ); + +#else + +#define DEBUGP(fmt) +#define DUMP_PACKET(prefix, data, len) +#define DUMP_PACKET2(prefix, eth, data, len) + +#endif diff --git a/installer/tap/src/src/hexdump.h b/installer/tap/src/src/hexdump.h new file mode 100644 index 0000000..d6275c1 --- /dev/null +++ b/installer/tap/src/src/hexdump.h @@ -0,0 +1,63 @@ +/* + * TAP-Windows -- A kernel driver to provide virtual tap + * device functionality on Windows. + * + * This code was inspired by the CIPE-Win32 driver by Damion K. Wilson. + * + * This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc., + * and is released under the GPL version 2 (see below). + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program (see the file COPYING included with this + * distribution); if not, write to the Free Software Foundation, Inc., + * 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#ifndef HEXDUMP_DEFINED +#define HEXDUMP_DEFINED + +#ifdef __cplusplus +extern "C" { +#endif + +//===================================================================================== +// Debug Routines +//===================================================================================== + +#ifndef NDIS_MINIPORT_DRIVER +# include +# include +# include +# include +# include + +# ifndef DEBUGP +# define DEBUGP(fmt) { DbgMessage fmt; } +# endif + + extern VOID (*DbgMessage)(char *p_Format, ...); + + VOID DisplayDebugString (char *p_Format, ...); +#endif + +//=================================================================================== +// Reporting / Debugging +//=================================================================================== +#define IfPrint(c) (c >= 32 && c < 127 ? c : '.') + +VOID HexDump (unsigned char *p_Buffer, unsigned long p_Size); + +#ifdef __cplusplus +} +#endif + +#endif diff --git a/installer/tap/src/src/lock.h b/installer/tap/src/src/lock.h new file mode 100644 index 0000000..c80b164 --- /dev/null +++ b/installer/tap/src/src/lock.h @@ -0,0 +1,75 @@ +/* + * TAP-Windows -- A kernel driver to provide virtual tap + * device functionality on Windows. + * + * This code was inspired by the CIPE-Win32 driver by Damion K. Wilson. + * + * This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc., + * and is released under the GPL version 2 (see below). + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program (see the file COPYING included with this + * distribution); if not, write to the Free Software Foundation, Inc., + * 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +typedef struct +{ + volatile long count; +} MUTEX; + +#define MUTEX_SLEEP_TIME 10000 // microseconds + +#define INIT_MUTEX(m) { (m)->count = 0; } + +#define ACQUIRE_MUTEX_BLOCKING(m) \ +{ \ + while (NdisInterlockedIncrement (&((m)->count)) != 1) \ + { \ + NdisInterlockedDecrement(&((m)->count)); \ + NdisMSleep(MUTEX_SLEEP_TIME); \ + } \ +} + +#define RELEASE_MUTEX(m) \ +{ \ + NdisInterlockedDecrement(&((m)->count)); \ +} + +#define ACQUIRE_MUTEX_NONBLOCKING(m, result) \ +{ \ + if (NdisInterlockedIncrement (&((m)->count)) != 1) \ + { \ + NdisInterlockedDecrement(&((m)->count)); \ + result = FALSE; \ + } \ + else \ + { \ + result = TRUE; \ + } \ +} + +#define ACQUIRE_MUTEX_ADAPTIVE(m, result) \ +{ \ + result = TRUE; \ + while (NdisInterlockedIncrement (&((m)->count)) != 1) \ + { \ + NdisInterlockedDecrement(&((m)->count)); \ + if (KeGetCurrentIrql () < DISPATCH_LEVEL) \ + NdisMSleep(MUTEX_SLEEP_TIME); \ + else \ + { \ + result = FALSE; \ + break; \ + } \ + } \ +} diff --git a/installer/tap/src/src/macinfo.c b/installer/tap/src/src/macinfo.c new file mode 100644 index 0000000..dfd0a07 --- /dev/null +++ b/installer/tap/src/src/macinfo.c @@ -0,0 +1,164 @@ +/* + * TAP-Windows -- A kernel driver to provide virtual tap + * device functionality on Windows. + * + * This code was inspired by the CIPE-Win32 driver by Damion K. Wilson. + * + * This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc., + * and is released under the GPL version 2 (see below). + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program (see the file COPYING included with this + * distribution); if not, write to the Free Software Foundation, Inc., + * 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + + +#include "tap.h" + +int +HexStringToDecimalInt (const int p_Character) +{ + int l_Value = 0; + + if (p_Character >= 'A' && p_Character <= 'F') + l_Value = (p_Character - 'A') + 10; + else if (p_Character >= 'a' && p_Character <= 'f') + l_Value = (p_Character - 'a') + 10; + else if (p_Character >= '0' && p_Character <= '9') + l_Value = p_Character - '0'; + + return l_Value; +} + +BOOLEAN +ParseMAC (MACADDR dest, const char *src) +{ + int c; + int mac_index = 0; + BOOLEAN high_digit = FALSE; + int delim_action = 1; + + ASSERT (src); + ASSERT (dest); + + CLEAR_MAC (dest); + + while (c = *src++) + { + if (IsMacDelimiter (c)) + { + mac_index += delim_action; + high_digit = FALSE; + delim_action = 1; + } + else if (IsHexDigit (c)) + { + const int digit = HexStringToDecimalInt (c); + if (mac_index < sizeof (MACADDR)) + { + if (!high_digit) + { + dest[mac_index] = (char)(digit); + high_digit = TRUE; + delim_action = 1; + } + else + { + dest[mac_index] = (char)(dest[mac_index] * 16 + digit); + ++mac_index; + high_digit = FALSE; + delim_action = 0; + } + } + else + return FALSE; + } + else + return FALSE; + } + + return (mac_index + delim_action) >= sizeof (MACADDR); +} + +/* + * Generate a MAC using the GUID in the adapter name. + * + * The mac is constructed as 00:FF:xx:xx:xx:xx where + * the Xs are taken from the first 32 bits of the GUID in the + * adapter name. This is similar to the Linux 2.4 tap MAC + * generator, except linux uses 32 random bits for the Xs. + * + * In general, this solution is reasonable for most + * applications except for very large bridged TAP networks, + * where the probability of address collisions becomes more + * than infintesimal. + * + * Using the well-known "birthday paradox", on a 1000 node + * network the probability of collision would be + * 0.000116292153. On a 10,000 node network, the probability + * of collision would be 0.01157288998621678766. + */ + +VOID +GenerateRandomMac( + __in MACADDR mac, + __in const unsigned char *adapter_name + ) +{ + unsigned const char *cp = adapter_name; + unsigned char c; + unsigned int i = 2; + unsigned int byte = 0; + int brace = 0; + int state = 0; + + CLEAR_MAC (mac); + + mac[0] = 0x00; + mac[1] = 0xFF; + + while (c = *cp++) + { + if (i >= sizeof (MACADDR)) + break; + if (c == '{') + brace = 1; + if (IsHexDigit (c) && brace) + { + const unsigned int digit = HexStringToDecimalInt (c); + if (state) + { + byte <<= 4; + byte |= digit; + mac[i++] = (unsigned char) byte; + state = 0; + } + else + { + byte = digit; + state = 1; + } + } + } +} + +VOID +GenerateRelatedMAC( + __in MACADDR dest, + __in const MACADDR src, + __in const int delta + ) +{ + ETH_COPY_NETWORK_ADDRESS (dest, src); + dest[2] += (UCHAR) delta; +} diff --git a/installer/tap/src/src/macinfo.h b/installer/tap/src/src/macinfo.h new file mode 100644 index 0000000..dd88b6f --- /dev/null +++ b/installer/tap/src/src/macinfo.h @@ -0,0 +1,53 @@ +/* + * TAP-Windows -- A kernel driver to provide virtual tap + * device functionality on Windows. + * + * This code was inspired by the CIPE-Win32 driver by Damion K. Wilson. + * + * This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc., + * and is released under the GPL version 2 (see below). + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program (see the file COPYING included with this + * distribution); if not, write to the Free Software Foundation, Inc., + * 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#ifndef MacInfoDefined +#define MacInfoDefined + +//=================================================================================== +// Macros +//=================================================================================== +#define IsMacDelimiter(a) (a == ':' || a == '-' || a == '.') +#define IsHexDigit(c) ((c >= '0' && c <= '9') || (c >= 'A' && c <= 'F') || (c >= 'a' && c <= 'f')) + +#define CLEAR_MAC(dest) NdisZeroMemory ((dest), sizeof (MACADDR)) +#define MAC_EQUAL(a,b) (memcmp ((a), (b), sizeof (MACADDR)) == 0) + +BOOLEAN +ParseMAC (MACADDR dest, const char *src); + +VOID +GenerateRandomMac( + __in MACADDR mac, + __in const unsigned char *adapter_name + ); + +VOID +GenerateRelatedMAC( + __in MACADDR dest, + __in const MACADDR src, + __in const int delta + ); + +#endif diff --git a/installer/tap/src/src/mem.c b/installer/tap/src/src/mem.c new file mode 100644 index 0000000..78bfa22 --- /dev/null +++ b/installer/tap/src/src/mem.c @@ -0,0 +1,384 @@ +/* + * TAP-Windows -- A kernel driver to provide virtual tap + * device functionality on Windows. + * + * This code was inspired by the CIPE-Win32 driver by Damion K. Wilson. + * + * This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc., + * and is released under the GPL version 2 (see below). + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program (see the file COPYING included with this + * distribution); if not, write to the Free Software Foundation, Inc., + * 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +//------------------ +// Memory Management +//------------------ + +#include "tap.h" + +PVOID +MemAlloc( + __in ULONG p_Size, + __in BOOLEAN zero + ) +{ + PVOID l_Return = NULL; + + if (p_Size) + { + __try + { + if (NdisAllocateMemoryWithTag (&l_Return, p_Size, 'APAT') + == NDIS_STATUS_SUCCESS) + { + if (zero) + { + NdisZeroMemory (l_Return, p_Size); + } + } + else + { + l_Return = NULL; + } + } + __except (EXCEPTION_EXECUTE_HANDLER) + { + l_Return = NULL; + } + } + + return l_Return; +} + +VOID +MemFree( + __in PVOID p_Addr, + __in ULONG p_Size + ) +{ + if (p_Addr && p_Size) + { + __try + { +#if DBG + NdisZeroMemory (p_Addr, p_Size); +#endif + NdisFreeMemory (p_Addr, p_Size, 0); + } + __except (EXCEPTION_EXECUTE_HANDLER) + { + } + } +} + +//====================================================================== +// TAP Packet Queue Support +//====================================================================== + +VOID +tapPacketQueueInsertTail( + __in PTAP_PACKET_QUEUE TapPacketQueue, + __in PTAP_PACKET TapPacket + ) +{ + KIRQL irql; + + KeAcquireSpinLock(&TapPacketQueue->QueueLock,&irql); + + InsertTailList(&TapPacketQueue->Queue,&TapPacket->QueueLink); + + // BUGBUG!!! Enforce PACKET_QUEUE_SIZE queue count limit??? + // For NDIS 6 there is no per-packet status, so this will need to + // be handled on per-NBL basis in AdapterSendNetBufferLists... + + // Update counts + ++TapPacketQueue->Count; + + if(TapPacketQueue->Count > TapPacketQueue->MaxCount) + { + TapPacketQueue->MaxCount = TapPacketQueue->Count; + + DEBUGP (("[TAP] tapPacketQueueInsertTail: New MAX queued packet count = %d\n", + TapPacketQueue->MaxCount)); + } + + KeReleaseSpinLock(&TapPacketQueue->QueueLock,irql); +} + +// Call with QueueLock held +PTAP_PACKET +tapPacketRemoveHeadLocked( + __in PTAP_PACKET_QUEUE TapPacketQueue + ) +{ + PTAP_PACKET tapPacket = NULL; + PLIST_ENTRY listEntry; + + listEntry = RemoveHeadList(&TapPacketQueue->Queue); + + if(listEntry != &TapPacketQueue->Queue) + { + tapPacket = CONTAINING_RECORD(listEntry, TAP_PACKET, QueueLink); + + // Update counts + --TapPacketQueue->Count; + } + + return tapPacket; +} + +VOID +tapPacketQueueInitialize( + __in PTAP_PACKET_QUEUE TapPacketQueue + ) +{ + KeInitializeSpinLock(&TapPacketQueue->QueueLock); + + NdisInitializeListHead(&TapPacketQueue->Queue); +} + +//====================================================================== +// TAP Cancel-Safe Queue Support +//====================================================================== + +VOID +tapIrpCsqInsert ( + __in struct _IO_CSQ *Csq, + __in PIRP Irp + ) +{ + PTAP_IRP_CSQ tapIrpCsq; + + tapIrpCsq = (PTAP_IRP_CSQ )Csq; + + InsertTailList( + &tapIrpCsq->Queue, + &Irp->Tail.Overlay.ListEntry + ); + + // Update counts + ++tapIrpCsq->Count; + + if(tapIrpCsq->Count > tapIrpCsq->MaxCount) + { + tapIrpCsq->MaxCount = tapIrpCsq->Count; + + DEBUGP (("[TAP] tapIrpCsqInsert: New MAX queued IRP count = %d\n", + tapIrpCsq->MaxCount)); + } +} + +VOID +tapIrpCsqRemoveIrp( + __in PIO_CSQ Csq, + __in PIRP Irp + ) +{ + PTAP_IRP_CSQ tapIrpCsq; + + tapIrpCsq = (PTAP_IRP_CSQ )Csq; + + // Update counts + --tapIrpCsq->Count; + + RemoveEntryList(&Irp->Tail.Overlay.ListEntry); +} + + +PIRP +tapIrpCsqPeekNextIrp( + __in PIO_CSQ Csq, + __in PIRP Irp, + __in PVOID PeekContext + ) +{ + PTAP_IRP_CSQ tapIrpCsq; + PIRP nextIrp = NULL; + PLIST_ENTRY nextEntry; + PLIST_ENTRY listHead; + PIO_STACK_LOCATION irpStack; + + tapIrpCsq = (PTAP_IRP_CSQ )Csq; + + listHead = &tapIrpCsq->Queue; + + // + // If the IRP is NULL, we will start peeking from the listhead, else + // we will start from that IRP onwards. This is done under the + // assumption that new IRPs are always inserted at the tail. + // + + if (Irp == NULL) + { + nextEntry = listHead->Flink; + } + else + { + nextEntry = Irp->Tail.Overlay.ListEntry.Flink; + } + + while(nextEntry != listHead) + { + nextIrp = CONTAINING_RECORD(nextEntry, IRP, Tail.Overlay.ListEntry); + + irpStack = IoGetCurrentIrpStackLocation(nextIrp); + + // + // If context is present, continue until you find a matching one. + // Else you break out as you got next one. + // + if (PeekContext) + { + if (irpStack->FileObject == (PFILE_OBJECT) PeekContext) + { + break; + } + } + else + { + break; + } + + nextIrp = NULL; + nextEntry = nextEntry->Flink; + } + + return nextIrp; +} + +// +// tapIrpCsqAcquireQueueLock modifies the execution level of the current processor. +// +// KeAcquireSpinLock raises the execution level to Dispatch Level and stores +// the current execution level in the Irql parameter to be restored at a later +// time. KeAcqurieSpinLock also requires us to be running at no higher than +// Dispatch level when it is called. +// +// The annotations reflect these changes and requirments. +// + +__drv_raisesIRQL(DISPATCH_LEVEL) +__drv_maxIRQL(DISPATCH_LEVEL) +VOID +tapIrpCsqAcquireQueueLock( + __in PIO_CSQ Csq, + __out PKIRQL Irql + ) +{ + PTAP_IRP_CSQ tapIrpCsq; + + tapIrpCsq = (PTAP_IRP_CSQ )Csq; + + // + // Suppressing because the address below csq is valid since it's + // part of TAP_ADAPTER_CONTEXT structure. + // +#pragma prefast(suppress: __WARNING_BUFFER_UNDERFLOW, "Underflow using expression 'adapter->PendingReadCsqQueueLock'") + KeAcquireSpinLock(&tapIrpCsq->QueueLock, Irql); +} + +// +// tapIrpCsqReleaseQueueLock modifies the execution level of the current processor. +// +// KeReleaseSpinLock assumes we already hold the spin lock and are therefore +// running at Dispatch level. It will use the Irql parameter saved in a +// previous call to KeAcquireSpinLock to return the thread back to it's original +// execution level. +// +// The annotations reflect these changes and requirments. +// + +__drv_requiresIRQL(DISPATCH_LEVEL) +VOID +tapIrpCsqReleaseQueueLock( + __in PIO_CSQ Csq, + __in KIRQL Irql + ) +{ + PTAP_IRP_CSQ tapIrpCsq; + + tapIrpCsq = (PTAP_IRP_CSQ )Csq; + + // + // Suppressing because the address below csq is valid since it's + // part of TAP_ADAPTER_CONTEXT structure. + // +#pragma prefast(suppress: __WARNING_BUFFER_UNDERFLOW, "Underflow using expression 'adapter->PendingReadCsqQueueLock'") + KeReleaseSpinLock(&tapIrpCsq->QueueLock, Irql); +} + +VOID +tapIrpCsqCompleteCanceledIrp( + __in PIO_CSQ pCsq, + __in PIRP Irp + ) +{ + UNREFERENCED_PARAMETER(pCsq); + + Irp->IoStatus.Status = STATUS_CANCELLED; + Irp->IoStatus.Information = 0; + IoCompleteRequest(Irp, IO_NO_INCREMENT); +} + +VOID +tapIrpCsqInitialize( + __in PTAP_IRP_CSQ TapIrpCsq + ) +{ + KeInitializeSpinLock(&TapIrpCsq->QueueLock); + + NdisInitializeListHead(&TapIrpCsq->Queue); + + IoCsqInitialize( + &TapIrpCsq->CsqQueue, + tapIrpCsqInsert, + tapIrpCsqRemoveIrp, + tapIrpCsqPeekNextIrp, + tapIrpCsqAcquireQueueLock, + tapIrpCsqReleaseQueueLock, + tapIrpCsqCompleteCanceledIrp + ); +} + +VOID +tapIrpCsqFlush( + __in PTAP_IRP_CSQ TapIrpCsq + ) +{ + PIRP pendingIrp; + + // + // Flush the pending read IRP queue. + // + pendingIrp = IoCsqRemoveNextIrp( + &TapIrpCsq->CsqQueue, + NULL + ); + + while(pendingIrp) + { + // Cancel the IRP + pendingIrp->IoStatus.Information = 0; + pendingIrp->IoStatus.Status = STATUS_CANCELLED; + IoCompleteRequest(pendingIrp, IO_NO_INCREMENT); + + pendingIrp = IoCsqRemoveNextIrp( + &TapIrpCsq->CsqQueue, + NULL + ); + } + + ASSERT(IsListEmpty(&TapIrpCsq->Queue)); +} diff --git a/installer/tap/src/src/mem.h b/installer/tap/src/src/mem.h new file mode 100644 index 0000000..d10d536 --- /dev/null +++ b/installer/tap/src/src/mem.h @@ -0,0 +1,108 @@ +/* + * TAP-Windows -- A kernel driver to provide virtual tap + * device functionality on Windows. + * + * This code was inspired by the CIPE-Win32 driver by Damion K. Wilson. + * + * This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc., + * and is released under the GPL version 2 (see below). + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program (see the file COPYING included with this + * distribution); if not, write to the Free Software Foundation, Inc., + * 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +//------------------ +// Memory Management +//------------------ + +PVOID +MemAlloc( + __in ULONG p_Size, + __in BOOLEAN zero + ); + +VOID +MemFree( + __in PVOID p_Addr, + __in ULONG p_Size + ); + +//====================================================================== +// TAP Packet Queue +//====================================================================== + +typedef +struct _TAP_PACKET +{ + LIST_ENTRY QueueLink; + +# define TAP_PACKET_SIZE(data_size) (sizeof (TAP_PACKET) + (data_size)) +# define TP_TUN 0x80000000 +# define TP_SIZE_MASK (~TP_TUN) + ULONG m_SizeFlags; + + // m_Data must be the last struct member + UCHAR m_Data []; +} TAP_PACKET, *PTAP_PACKET; + +#define TAP_PACKET_TAG '6PAT' // "TAP6" + +typedef struct _TAP_PACKET_QUEUE +{ + KSPIN_LOCK QueueLock; + LIST_ENTRY Queue; + ULONG Count; // Count of currently queued items + ULONG MaxCount; +} TAP_PACKET_QUEUE, *PTAP_PACKET_QUEUE; + +VOID +tapPacketQueueInsertTail( + __in PTAP_PACKET_QUEUE TapPacketQueue, + __in PTAP_PACKET TapPacket + ); + + +// Call with QueueLock held +PTAP_PACKET +tapPacketRemoveHeadLocked( + __in PTAP_PACKET_QUEUE TapPacketQueue + ); + +VOID +tapPacketQueueInitialize( + __in PTAP_PACKET_QUEUE TapPacketQueue + ); + +//---------------------- +// Cancel-Safe IRP Queue +//---------------------- + +typedef struct _TAP_IRP_CSQ +{ + IO_CSQ CsqQueue; + KSPIN_LOCK QueueLock; + LIST_ENTRY Queue; + ULONG Count; // Count of currently queued items + ULONG MaxCount; +} TAP_IRP_CSQ, *PTAP_IRP_CSQ; + +VOID +tapIrpCsqInitialize( + __in PTAP_IRP_CSQ TapIrpCsq + ); + +VOID +tapIrpCsqFlush( + __in PTAP_IRP_CSQ TapIrpCsq + ); diff --git a/installer/tap/src/src/oidrequest.c b/installer/tap/src/src/oidrequest.c new file mode 100644 index 0000000..a6882f8 --- /dev/null +++ b/installer/tap/src/src/oidrequest.c @@ -0,0 +1,1028 @@ +/* + * TAP-Windows -- A kernel driver to provide virtual tap + * device functionality on Windows. + * + * This code was inspired by the CIPE-Win32 driver by Damion K. Wilson. + * + * This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc., + * and is released under the GPL version 2 (see below). + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program (see the file COPYING included with this + * distribution); if not, write to the Free Software Foundation, Inc., + * 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +// +// Include files. +// + +#include "tap.h" + +#ifndef DBG + +#define DBG_PRINT_OID_NAME + +#else + +VOID +DBG_PRINT_OID_NAME( + __in NDIS_OID Oid + ) +{ + PCHAR oidName = NULL; + + switch (Oid){ + + #undef MAKECASE + #define MAKECASE(oidx) case oidx: oidName = #oidx "\n"; break; + + /* Operational OIDs */ + MAKECASE(OID_GEN_SUPPORTED_LIST) + MAKECASE(OID_GEN_HARDWARE_STATUS) + MAKECASE(OID_GEN_MEDIA_SUPPORTED) + MAKECASE(OID_GEN_MEDIA_IN_USE) + MAKECASE(OID_GEN_MAXIMUM_LOOKAHEAD) + MAKECASE(OID_GEN_MAXIMUM_FRAME_SIZE) + MAKECASE(OID_GEN_LINK_SPEED) + MAKECASE(OID_GEN_TRANSMIT_BUFFER_SPACE) + MAKECASE(OID_GEN_RECEIVE_BUFFER_SPACE) + MAKECASE(OID_GEN_TRANSMIT_BLOCK_SIZE) + MAKECASE(OID_GEN_RECEIVE_BLOCK_SIZE) + MAKECASE(OID_GEN_VENDOR_ID) + MAKECASE(OID_GEN_VENDOR_DESCRIPTION) + MAKECASE(OID_GEN_VENDOR_DRIVER_VERSION) + MAKECASE(OID_GEN_CURRENT_PACKET_FILTER) + MAKECASE(OID_GEN_CURRENT_LOOKAHEAD) + MAKECASE(OID_GEN_DRIVER_VERSION) + MAKECASE(OID_GEN_MAXIMUM_TOTAL_SIZE) + MAKECASE(OID_GEN_PROTOCOL_OPTIONS) + MAKECASE(OID_GEN_MAC_OPTIONS) + MAKECASE(OID_GEN_MEDIA_CONNECT_STATUS) + MAKECASE(OID_GEN_MAXIMUM_SEND_PACKETS) + MAKECASE(OID_GEN_SUPPORTED_GUIDS) + MAKECASE(OID_GEN_NETWORK_LAYER_ADDRESSES) + MAKECASE(OID_GEN_TRANSPORT_HEADER_OFFSET) + MAKECASE(OID_GEN_MEDIA_CAPABILITIES) + MAKECASE(OID_GEN_PHYSICAL_MEDIUM) + MAKECASE(OID_GEN_MACHINE_NAME) + MAKECASE(OID_GEN_VLAN_ID) + MAKECASE(OID_GEN_RNDIS_CONFIG_PARAMETER) + + /* Operational OIDs for NDIS 6.0 */ + MAKECASE(OID_GEN_MAX_LINK_SPEED) + MAKECASE(OID_GEN_LINK_STATE) + MAKECASE(OID_GEN_LINK_PARAMETERS) + MAKECASE(OID_GEN_MINIPORT_RESTART_ATTRIBUTES) + MAKECASE(OID_GEN_ENUMERATE_PORTS) + MAKECASE(OID_GEN_PORT_STATE) + MAKECASE(OID_GEN_PORT_AUTHENTICATION_PARAMETERS) + MAKECASE(OID_GEN_INTERRUPT_MODERATION) + MAKECASE(OID_GEN_PHYSICAL_MEDIUM_EX) + + /* Statistical OIDs */ + MAKECASE(OID_GEN_XMIT_OK) + MAKECASE(OID_GEN_RCV_OK) + MAKECASE(OID_GEN_XMIT_ERROR) + MAKECASE(OID_GEN_RCV_ERROR) + MAKECASE(OID_GEN_RCV_NO_BUFFER) + MAKECASE(OID_GEN_DIRECTED_BYTES_XMIT) + MAKECASE(OID_GEN_DIRECTED_FRAMES_XMIT) + MAKECASE(OID_GEN_MULTICAST_BYTES_XMIT) + MAKECASE(OID_GEN_MULTICAST_FRAMES_XMIT) + MAKECASE(OID_GEN_BROADCAST_BYTES_XMIT) + MAKECASE(OID_GEN_BROADCAST_FRAMES_XMIT) + MAKECASE(OID_GEN_DIRECTED_BYTES_RCV) + MAKECASE(OID_GEN_DIRECTED_FRAMES_RCV) + MAKECASE(OID_GEN_MULTICAST_BYTES_RCV) + MAKECASE(OID_GEN_MULTICAST_FRAMES_RCV) + MAKECASE(OID_GEN_BROADCAST_BYTES_RCV) + MAKECASE(OID_GEN_BROADCAST_FRAMES_RCV) + MAKECASE(OID_GEN_RCV_CRC_ERROR) + MAKECASE(OID_GEN_TRANSMIT_QUEUE_LENGTH) + + /* Statistical OIDs for NDIS 6.0 */ + MAKECASE(OID_GEN_STATISTICS) + MAKECASE(OID_GEN_BYTES_RCV) + MAKECASE(OID_GEN_BYTES_XMIT) + MAKECASE(OID_GEN_RCV_DISCARDS) + MAKECASE(OID_GEN_XMIT_DISCARDS) + + /* Misc OIDs */ + MAKECASE(OID_GEN_GET_TIME_CAPS) + MAKECASE(OID_GEN_GET_NETCARD_TIME) + MAKECASE(OID_GEN_NETCARD_LOAD) + MAKECASE(OID_GEN_DEVICE_PROFILE) + MAKECASE(OID_GEN_INIT_TIME_MS) + MAKECASE(OID_GEN_RESET_COUNTS) + MAKECASE(OID_GEN_MEDIA_SENSE_COUNTS) + + /* PnP power management operational OIDs */ + MAKECASE(OID_PNP_CAPABILITIES) + MAKECASE(OID_PNP_SET_POWER) + MAKECASE(OID_PNP_QUERY_POWER) + MAKECASE(OID_PNP_ADD_WAKE_UP_PATTERN) + MAKECASE(OID_PNP_REMOVE_WAKE_UP_PATTERN) + MAKECASE(OID_PNP_ENABLE_WAKE_UP) + MAKECASE(OID_PNP_WAKE_UP_PATTERN_LIST) + + /* PnP power management statistical OIDs */ + MAKECASE(OID_PNP_WAKE_UP_ERROR) + MAKECASE(OID_PNP_WAKE_UP_OK) + + /* Ethernet operational OIDs */ + MAKECASE(OID_802_3_PERMANENT_ADDRESS) + MAKECASE(OID_802_3_CURRENT_ADDRESS) + MAKECASE(OID_802_3_MULTICAST_LIST) + MAKECASE(OID_802_3_MAXIMUM_LIST_SIZE) + MAKECASE(OID_802_3_MAC_OPTIONS) + + /* Ethernet operational OIDs for NDIS 6.0 */ + MAKECASE(OID_802_3_ADD_MULTICAST_ADDRESS) + MAKECASE(OID_802_3_DELETE_MULTICAST_ADDRESS) + + /* Ethernet statistical OIDs */ + MAKECASE(OID_802_3_RCV_ERROR_ALIGNMENT) + MAKECASE(OID_802_3_XMIT_ONE_COLLISION) + MAKECASE(OID_802_3_XMIT_MORE_COLLISIONS) + MAKECASE(OID_802_3_XMIT_DEFERRED) + MAKECASE(OID_802_3_XMIT_MAX_COLLISIONS) + MAKECASE(OID_802_3_RCV_OVERRUN) + MAKECASE(OID_802_3_XMIT_UNDERRUN) + MAKECASE(OID_802_3_XMIT_HEARTBEAT_FAILURE) + MAKECASE(OID_802_3_XMIT_TIMES_CRS_LOST) + MAKECASE(OID_802_3_XMIT_LATE_COLLISIONS) + + /* TCP/IP OIDs */ + MAKECASE(OID_TCP_TASK_OFFLOAD) + MAKECASE(OID_TCP_TASK_IPSEC_ADD_SA) + MAKECASE(OID_TCP_TASK_IPSEC_DELETE_SA) + MAKECASE(OID_TCP_SAN_SUPPORT) + MAKECASE(OID_TCP_TASK_IPSEC_ADD_UDPESP_SA) + MAKECASE(OID_TCP_TASK_IPSEC_DELETE_UDPESP_SA) + MAKECASE(OID_TCP4_OFFLOAD_STATS) + MAKECASE(OID_TCP6_OFFLOAD_STATS) + MAKECASE(OID_IP4_OFFLOAD_STATS) + MAKECASE(OID_IP6_OFFLOAD_STATS) + + /* TCP offload OIDs for NDIS 6 */ + MAKECASE(OID_TCP_OFFLOAD_CURRENT_CONFIG) + MAKECASE(OID_TCP_OFFLOAD_PARAMETERS) + MAKECASE(OID_TCP_OFFLOAD_HARDWARE_CAPABILITIES) + MAKECASE(OID_TCP_CONNECTION_OFFLOAD_CURRENT_CONFIG) + MAKECASE(OID_TCP_CONNECTION_OFFLOAD_HARDWARE_CAPABILITIES) + MAKECASE(OID_OFFLOAD_ENCAPSULATION) + +#if (NDIS_SUPPORT_NDIS620) + /* VMQ OIDs for NDIS 6.20 */ + MAKECASE(OID_RECEIVE_FILTER_FREE_QUEUE) + MAKECASE(OID_RECEIVE_FILTER_CLEAR_FILTER) + MAKECASE(OID_RECEIVE_FILTER_ALLOCATE_QUEUE) + MAKECASE(OID_RECEIVE_FILTER_QUEUE_ALLOCATION_COMPLETE) + MAKECASE(OID_RECEIVE_FILTER_SET_FILTER) +#endif + +#if (NDIS_SUPPORT_NDIS630) + /* NDIS QoS OIDs for NDIS 6.30 */ + MAKECASE(OID_QOS_PARAMETERS) +#endif + } + + if (oidName) + { + DEBUGP(("OID: %s", oidName)); + } + else + { + DEBUGP(("<** Unknown OID 0x%08x **>\n", Oid)); + } +} + +#endif // DBG + +//====================================================================== +// TAP NDIS 6 OID Request Callbacks +//====================================================================== + +NDIS_STATUS +tapSetMulticastList( + __in PTAP_ADAPTER_CONTEXT Adapter, + __in PNDIS_OID_REQUEST OidRequest + ) +{ + NDIS_STATUS status = NDIS_STATUS_SUCCESS; + + // + // Initialize. + // + OidRequest->DATA.SET_INFORMATION.BytesNeeded = MACADDR_SIZE; + OidRequest->DATA.SET_INFORMATION.BytesRead + = OidRequest->DATA.SET_INFORMATION.InformationBufferLength; + + + do + { + if (OidRequest->DATA.SET_INFORMATION.InformationBufferLength % MACADDR_SIZE) + { + status = NDIS_STATUS_INVALID_LENGTH; + break; + } + + if (OidRequest->DATA.SET_INFORMATION.InformationBufferLength > (TAP_MAX_MCAST_LIST * MACADDR_SIZE)) + { + status = NDIS_STATUS_MULTICAST_FULL; + OidRequest->DATA.SET_INFORMATION.BytesNeeded = TAP_MAX_MCAST_LIST * MACADDR_SIZE; + break; + } + + // BUGBUG!!! Is lock needed??? If so, use NDIS_RW_LOCK. Also apply to packet filter. + + NdisZeroMemory(Adapter->MCList, + TAP_MAX_MCAST_LIST * MACADDR_SIZE); + + NdisMoveMemory(Adapter->MCList, + OidRequest->DATA.SET_INFORMATION.InformationBuffer, + OidRequest->DATA.SET_INFORMATION.InformationBufferLength); + + Adapter->ulMCListSize = OidRequest->DATA.SET_INFORMATION.InformationBufferLength / MACADDR_SIZE; + + } while(FALSE); + return status; +} + +NDIS_STATUS +tapSetPacketFilter( + __in PTAP_ADAPTER_CONTEXT Adapter, + __in ULONG PacketFilter + ) +{ + NDIS_STATUS status = NDIS_STATUS_SUCCESS; + + // any bits not supported? + if (PacketFilter & ~(TAP_SUPPORTED_FILTERS)) + { + DEBUGP (("[TAP] Unsupported packet filter: 0x%08x\n", PacketFilter)); + status = NDIS_STATUS_NOT_SUPPORTED; + } + else + { + // Any actual filtering changes? + if (PacketFilter != Adapter->PacketFilter) + { + // + // Change the filtering modes on hardware + // + + // Save the new packet filter value + Adapter->PacketFilter = PacketFilter; + } + } + + return status; +} + +NDIS_STATUS +AdapterSetPowerD0( + __in PTAP_ADAPTER_CONTEXT Adapter + ) +/*++ +Routine Description: + + NIC power has been restored to the working power state (D0). + Prepare the NIC for normal operation: + - Restore hardware context (packet filters, multicast addresses, MAC address, etc.) + - Enable interrupts and the NIC's DMA engine. + +Arguments: + + Adapter - Pointer to adapter block + +Return Value: + + NDIS_STATUS + +--*/ +{ + NDIS_STATUS status = NDIS_STATUS_SUCCESS; + + DEBUGP (("[TAP] PowerState: Fully powered\n")); + + // Start data path... + + return status; +} + +NDIS_STATUS +AdapterSetPowerLow( + __in PTAP_ADAPTER_CONTEXT Adapter, + __in NDIS_DEVICE_POWER_STATE PowerState + ) +/*++ +Routine Description: + + The NIC is about to be transitioned to a low power state. + Prepare the NIC for the sleeping state: + - Disable interrupts and the NIC's DMA engine, cancel timers. + - Save any hardware context that the NIC cannot preserve in + a sleeping state (packet filters, multicast addresses, + the current MAC address, etc.) + A miniport driver cannot access the NIC hardware after + the NIC has been set to the D3 state by the bus driver. + + Miniport drivers NDIS v6.30 and above + Do NOT wait for NDIS to return the ownership of all + NBLs from outstanding receive indications + Retain ownership of all the receive descriptors and + packet buffers previously owned by the hardware. + +Arguments: + + Adapter - Pointer to adapter block + PowerState - New power state + +Return Value: + + NDIS_STATUS + +--*/ +{ + NDIS_STATUS status = NDIS_STATUS_SUCCESS; + + DEBUGP (("[TAP] PowerState: Low-power\n")); + + // + // Miniport drivers NDIS v6.20 and below are + // paused prior the low power transition + // + + // Check for paused state... + // Verify data path stopped... + + return status; +} + +NDIS_STATUS +tapSetInformation( + __in PTAP_ADAPTER_CONTEXT Adapter, + __in PNDIS_OID_REQUEST OidRequest + ) +/*++ + +Routine Description: + + Helper function to perform a set OID request + +Arguments: + + Adapter - + NdisSetRequest - The OID to set + +Return Value: + + NDIS_STATUS + +--*/ +{ + NDIS_STATUS status = NDIS_STATUS_SUCCESS; + + DBG_PRINT_OID_NAME(OidRequest->DATA.SET_INFORMATION.Oid); + + switch(OidRequest->DATA.SET_INFORMATION.Oid) + { + case OID_802_3_MULTICAST_LIST: + // + // Set the multicast address list on the NIC for packet reception. + // The NIC driver can set a limit on the number of multicast + // addresses bound protocol drivers can enable simultaneously. + // NDIS returns NDIS_STATUS_MULTICAST_FULL if a protocol driver + // exceeds this limit or if it specifies an invalid multicast + // address. + // + status = tapSetMulticastList(Adapter,OidRequest); + break; + + case OID_GEN_CURRENT_LOOKAHEAD: + // + // A protocol driver can set a suggested value for the number + // of bytes to be used in its binding; however, the underlying + // NIC driver is never required to limit its indications to + // the value set. + // + if (OidRequest->DATA.SET_INFORMATION.InformationBufferLength != sizeof(ULONG)) + { + OidRequest->DATA.SET_INFORMATION.BytesNeeded = sizeof(ULONG); + status = NDIS_STATUS_INVALID_LENGTH; + break; + } + + Adapter->ulLookahead = *(PULONG)OidRequest->DATA.SET_INFORMATION.InformationBuffer; + + OidRequest->DATA.SET_INFORMATION.BytesRead = sizeof(ULONG); + status = NDIS_STATUS_SUCCESS; + break; + + case OID_GEN_CURRENT_PACKET_FILTER: + // + // Program the hardware to indicate the packets + // of certain filter types. + // + if(OidRequest->DATA.SET_INFORMATION.InformationBufferLength != sizeof(ULONG)) + { + OidRequest->DATA.SET_INFORMATION.BytesNeeded = sizeof(ULONG); + status = NDIS_STATUS_INVALID_LENGTH; + break; + } + + OidRequest->DATA.SET_INFORMATION.BytesRead + = OidRequest->DATA.SET_INFORMATION.InformationBufferLength; + + status = tapSetPacketFilter( + Adapter, + *((PULONG)OidRequest->DATA.SET_INFORMATION.InformationBuffer) + ); + + break; + + case OID_PNP_SET_POWER: + { + // Sanity check. + if (OidRequest->DATA.SET_INFORMATION.InformationBufferLength + < sizeof(NDIS_DEVICE_POWER_STATE) + ) + { + status = NDIS_STATUS_INVALID_LENGTH; + } + else + { + NDIS_DEVICE_POWER_STATE PowerState; + + PowerState = *(PNDIS_DEVICE_POWER_STATE UNALIGNED)OidRequest->DATA.SET_INFORMATION.InformationBuffer; + OidRequest->DATA.SET_INFORMATION.BytesRead = sizeof(NDIS_DEVICE_POWER_STATE); + + if(PowerState < NdisDeviceStateD0 || + PowerState > NdisDeviceStateD3) + { + status = NDIS_STATUS_INVALID_DATA; + } + else + { + Adapter->CurrentPowerState = PowerState; + + if (PowerState == NdisDeviceStateD0) + { + status = AdapterSetPowerD0(Adapter); + } + else + { + status = AdapterSetPowerLow(Adapter, PowerState); + } + } + } + } + break; + +#if (NDIS_SUPPORT_NDIS61) + case OID_PNP_ADD_WAKE_UP_PATTERN: + case OID_PNP_REMOVE_WAKE_UP_PATTERN: + case OID_PNP_ENABLE_WAKE_UP: +#endif + ASSERT(!"NIC does not support wake on LAN OIDs"); + default: + // + // The entry point may by used by other requests + // + status = NDIS_STATUS_NOT_SUPPORTED; + break; + } + + return status; +} + +NDIS_STATUS +tapQueryInformation( + __in PTAP_ADAPTER_CONTEXT Adapter, + __in PNDIS_OID_REQUEST OidRequest + ) +/*++ + +Routine Description: + + Helper function to perform a query OID request + +Arguments: + + Adapter - + OidRequest - The OID request that is being queried + +Return Value: + + NDIS_STATUS + +--*/ +{ + NDIS_STATUS status = NDIS_STATUS_SUCCESS; + NDIS_MEDIUM Medium = TAP_MEDIUM_TYPE; + NDIS_HARDWARE_STATUS HardwareStatus = NdisHardwareStatusReady; + UCHAR VendorDesc[] = TAP_VENDOR_DESC; + ULONG ulInfo; + USHORT usInfo; + ULONG64 ulInfo64; + + // Default to returning the ULONG value + PVOID pInfo=NULL; + ULONG ulInfoLen = sizeof(ulInfo); + + // ATTENTION!!! Ignore OIDs to noisy to print... + if((OidRequest->DATA.QUERY_INFORMATION.Oid != OID_GEN_STATISTICS) + && (OidRequest->DATA.QUERY_INFORMATION.Oid != OID_IP4_OFFLOAD_STATS) + && (OidRequest->DATA.QUERY_INFORMATION.Oid != OID_IP6_OFFLOAD_STATS) + ) + { + DBG_PRINT_OID_NAME(OidRequest->DATA.QUERY_INFORMATION.Oid); + } + + // Dispatch based on object identifier (OID). + switch(OidRequest->DATA.QUERY_INFORMATION.Oid) + { + case OID_GEN_HARDWARE_STATUS: + // + // Specify the current hardware status of the underlying NIC as + // one of the following NDIS_HARDWARE_STATUS-type values. + // + pInfo = (PVOID) &HardwareStatus; + ulInfoLen = sizeof(NDIS_HARDWARE_STATUS); + break; + + case OID_802_3_PERMANENT_ADDRESS: + // + // Return the MAC address of the NIC burnt in the hardware. + // + pInfo = Adapter->PermanentAddress; + ulInfoLen = MACADDR_SIZE; + break; + + case OID_802_3_CURRENT_ADDRESS: + // + // Return the MAC address the NIC is currently programmed to + // use. Note that this address could be different from the + // permananent address as the user can override using + // registry. Read NdisReadNetworkAddress doc for more info. + // + pInfo = Adapter->CurrentAddress; + ulInfoLen = MACADDR_SIZE; + break; + + case OID_GEN_MEDIA_SUPPORTED: + // + // Return an array of media that are supported by the miniport. + // This miniport only supports one medium (Ethernet), so the OID + // returns identical results to OID_GEN_MEDIA_IN_USE. + // + + __fallthrough; + + case OID_GEN_MEDIA_IN_USE: + // + // Return an array of media that are currently in use by the + // miniport. This array should be a subset of the array returned + // by OID_GEN_MEDIA_SUPPORTED. + // + pInfo = &Medium; + ulInfoLen = sizeof(Medium); + break; + + case OID_GEN_MAXIMUM_TOTAL_SIZE: + // + // Specify the maximum total packet length, in bytes, the NIC + // supports including the header. A protocol driver might use + // this returned length as a gauge to determine the maximum + // size packet that a NIC driver could forward to the + // protocol driver. The miniport driver must never indicate + // up to the bound protocol driver packets received over the + // network that are longer than the packet size specified by + // OID_GEN_MAXIMUM_TOTAL_SIZE. + // + + __fallthrough; + + case OID_GEN_TRANSMIT_BLOCK_SIZE: + // + // The OID_GEN_TRANSMIT_BLOCK_SIZE OID specifies the minimum + // number of bytes that a single net packet occupies in the + // transmit buffer space of the NIC. In our case, the transmit + // block size is identical to its maximum packet size. + __fallthrough; + + case OID_GEN_RECEIVE_BLOCK_SIZE: + // + // The OID_GEN_RECEIVE_BLOCK_SIZE OID specifies the amount of + // storage, in bytes, that a single packet occupies in the receive + // buffer space of the NIC. + // + ulInfo = (ULONG) TAP_MAX_FRAME_SIZE; + pInfo = &ulInfo; + break; + + case OID_GEN_INTERRUPT_MODERATION: + { + PNDIS_INTERRUPT_MODERATION_PARAMETERS moderationParams + = (PNDIS_INTERRUPT_MODERATION_PARAMETERS)OidRequest->DATA.QUERY_INFORMATION.InformationBuffer; + + moderationParams->Header.Type = NDIS_OBJECT_TYPE_DEFAULT; + moderationParams->Header.Revision = NDIS_INTERRUPT_MODERATION_PARAMETERS_REVISION_1; + moderationParams->Header.Size = NDIS_SIZEOF_INTERRUPT_MODERATION_PARAMETERS_REVISION_1; + moderationParams->Flags = 0; + moderationParams->InterruptModeration = NdisInterruptModerationNotSupported; + ulInfoLen = NDIS_SIZEOF_INTERRUPT_MODERATION_PARAMETERS_REVISION_1; + } + break; + + case OID_PNP_QUERY_POWER: + // Simply succeed this. + break; + + case OID_GEN_VENDOR_ID: + // + // Specify a three-byte IEEE-registered vendor code, followed + // by a single byte that the vendor assigns to identify a + // particular NIC. The IEEE code uniquely identifies the vendor + // and is the same as the three bytes appearing at the beginning + // of the NIC hardware address. Vendors without an IEEE-registered + // code should use the value 0xFFFFFF. + // + + ulInfo = TAP_VENDOR_ID; + pInfo = &ulInfo; + break; + + case OID_GEN_VENDOR_DESCRIPTION: + // + // Specify a zero-terminated string describing the NIC vendor. + // + pInfo = VendorDesc; + ulInfoLen = sizeof(VendorDesc); + break; + + case OID_GEN_VENDOR_DRIVER_VERSION: + // + // Specify the vendor-assigned version number of the NIC driver. + // The low-order half of the return value specifies the minor + // version; the high-order half specifies the major version. + // + + ulInfo = TAP_DRIVER_VENDOR_VERSION; + pInfo = &ulInfo; + break; + + case OID_GEN_DRIVER_VERSION: + // + // Specify the NDIS version in use by the NIC driver. The high + // byte is the major version number; the low byte is the minor + // version number. + // + usInfo = (USHORT) (TAP_NDIS_MAJOR_VERSION<<8) + TAP_NDIS_MINOR_VERSION; + pInfo = (PVOID) &usInfo; + ulInfoLen = sizeof(USHORT); + break; + + case OID_802_3_MAXIMUM_LIST_SIZE: + // + // The maximum number of multicast addresses the NIC driver + // can manage. This list is global for all protocols bound + // to (or above) the NIC. Consequently, a protocol can receive + // NDIS_STATUS_MULTICAST_FULL from the NIC driver when + // attempting to set the multicast address list, even if + // the number of elements in the given list is less than + // the number originally returned for this query. + // + + ulInfo = TAP_MAX_MCAST_LIST; + pInfo = &ulInfo; + break; + + case OID_GEN_XMIT_ERROR: + ulInfo = (ULONG) + (Adapter->TxAbortExcessCollisions + + Adapter->TxDmaUnderrun + + Adapter->TxLostCRS + + Adapter->TxLateCollisions+ + Adapter->TransmitFailuresOther); + pInfo = &ulInfo; + break; + + case OID_GEN_RCV_ERROR: + ulInfo = (ULONG) + (Adapter->RxCrcErrors + + Adapter->RxAlignmentErrors + + Adapter->RxDmaOverrunErrors + + Adapter->RxRuntErrors); + pInfo = &ulInfo; + break; + + case OID_GEN_RCV_DISCARDS: + ulInfo = (ULONG)Adapter->RxResourceErrors; + pInfo = &ulInfo; + break; + + case OID_GEN_RCV_NO_BUFFER: + ulInfo = (ULONG)Adapter->RxResourceErrors; + pInfo = &ulInfo; + break; + + case OID_GEN_XMIT_OK: + ulInfo64 = Adapter->FramesTxBroadcast + + Adapter->FramesTxMulticast + + Adapter->FramesTxDirected; + pInfo = &ulInfo64; + if (OidRequest->DATA.QUERY_INFORMATION.InformationBufferLength >= sizeof(ULONG64) || + OidRequest->DATA.QUERY_INFORMATION.InformationBufferLength == 0) + { + ulInfoLen = sizeof(ULONG64); + } + else + { + ulInfoLen = sizeof(ULONG); + } + + // We should always report that only 8 bytes are required to keep ndistest happy + OidRequest->DATA.QUERY_INFORMATION.BytesNeeded = sizeof(ULONG64); + break; + + case OID_GEN_RCV_OK: + ulInfo64 = Adapter->FramesRxBroadcast + + Adapter->FramesRxMulticast + + Adapter->FramesRxDirected; + + pInfo = &ulInfo64; + + if (OidRequest->DATA.QUERY_INFORMATION.InformationBufferLength >= sizeof(ULONG64) || + OidRequest->DATA.QUERY_INFORMATION.InformationBufferLength == 0) + { + ulInfoLen = sizeof(ULONG64); + } + else + { + ulInfoLen = sizeof(ULONG); + } + + // We should always report that only 8 bytes are required to keep ndistest happy + OidRequest->DATA.QUERY_INFORMATION.BytesNeeded = sizeof(ULONG64); + break; + + case OID_802_3_RCV_ERROR_ALIGNMENT: + + ulInfo = Adapter->RxAlignmentErrors; + pInfo = &ulInfo; + break; + + case OID_802_3_XMIT_ONE_COLLISION: + + ulInfo = Adapter->OneRetry; + pInfo = &ulInfo; + break; + + case OID_802_3_XMIT_MORE_COLLISIONS: + + ulInfo = Adapter->MoreThanOneRetry; + pInfo = &ulInfo; + break; + + case OID_802_3_XMIT_DEFERRED: + + ulInfo = Adapter->TxOKButDeferred; + pInfo = &ulInfo; + break; + + case OID_802_3_XMIT_MAX_COLLISIONS: + + ulInfo = Adapter->TxAbortExcessCollisions; + pInfo = &ulInfo; + break; + + case OID_802_3_RCV_OVERRUN: + + ulInfo = Adapter->RxDmaOverrunErrors; + pInfo = &ulInfo; + break; + + case OID_802_3_XMIT_UNDERRUN: + + ulInfo = Adapter->TxDmaUnderrun; + pInfo = &ulInfo; + break; + + case OID_GEN_STATISTICS: + + if (OidRequest->DATA.QUERY_INFORMATION.InformationBufferLength < sizeof(NDIS_STATISTICS_INFO)) + { + status = NDIS_STATUS_INVALID_LENGTH; + OidRequest->DATA.QUERY_INFORMATION.BytesNeeded = sizeof(NDIS_STATISTICS_INFO); + break; + } + else + { + PNDIS_STATISTICS_INFO Statistics + = (PNDIS_STATISTICS_INFO)OidRequest->DATA.QUERY_INFORMATION.InformationBuffer; + + {C_ASSERT(sizeof(NDIS_STATISTICS_INFO) >= NDIS_SIZEOF_STATISTICS_INFO_REVISION_1);} + Statistics->Header.Type = NDIS_OBJECT_TYPE_DEFAULT; + Statistics->Header.Size = NDIS_SIZEOF_STATISTICS_INFO_REVISION_1; + Statistics->Header.Revision = NDIS_STATISTICS_INFO_REVISION_1; + + Statistics->SupportedStatistics = TAP_SUPPORTED_STATISTICS; + + /* Bytes in */ + Statistics->ifHCInOctets = + Adapter->BytesRxDirected + + Adapter->BytesRxMulticast + + Adapter->BytesRxBroadcast; + + Statistics->ifHCInUcastOctets = + Adapter->BytesRxDirected; + + Statistics->ifHCInMulticastOctets = + Adapter->BytesRxMulticast; + + Statistics->ifHCInBroadcastOctets = + Adapter->BytesRxBroadcast; + + /* Packets in */ + Statistics->ifHCInUcastPkts = + Adapter->FramesRxDirected; + + Statistics->ifHCInMulticastPkts = + Adapter->FramesRxMulticast; + + Statistics->ifHCInBroadcastPkts = + Adapter->FramesRxBroadcast; + + /* Errors in */ + Statistics->ifInErrors = + Adapter->RxCrcErrors + + Adapter->RxAlignmentErrors + + Adapter->RxDmaOverrunErrors + + Adapter->RxRuntErrors; + + Statistics->ifInDiscards = + Adapter->RxResourceErrors; + + + /* Bytes out */ + Statistics->ifHCOutOctets = + Adapter->BytesTxDirected + + Adapter->BytesTxMulticast + + Adapter->BytesTxBroadcast; + + Statistics->ifHCOutUcastOctets = + Adapter->BytesTxDirected; + + Statistics->ifHCOutMulticastOctets = + Adapter->BytesTxMulticast; + + Statistics->ifHCOutBroadcastOctets = + Adapter->BytesTxBroadcast; + + /* Packets out */ + Statistics->ifHCOutUcastPkts = + Adapter->FramesTxDirected; + + Statistics->ifHCOutMulticastPkts = + Adapter->FramesTxMulticast; + + Statistics->ifHCOutBroadcastPkts = + Adapter->FramesTxBroadcast; + + /* Errors out */ + Statistics->ifOutErrors = + Adapter->TxAbortExcessCollisions + + Adapter->TxDmaUnderrun + + Adapter->TxLostCRS + + Adapter->TxLateCollisions+ + Adapter->TransmitFailuresOther; + + Statistics->ifOutDiscards = 0ULL; + + ulInfoLen = NDIS_SIZEOF_STATISTICS_INFO_REVISION_1; + } + + break; + + // TODO: Inplement these query information requests. + case OID_GEN_RECEIVE_BUFFER_SPACE: + case OID_GEN_MAXIMUM_SEND_PACKETS: + case OID_GEN_TRANSMIT_QUEUE_LENGTH: + case OID_802_3_XMIT_HEARTBEAT_FAILURE: + case OID_802_3_XMIT_TIMES_CRS_LOST: + case OID_802_3_XMIT_LATE_COLLISIONS: + + default: + // + // The entry point may by used by other requests + // + status = NDIS_STATUS_NOT_SUPPORTED; + break; + } + + if (status == NDIS_STATUS_SUCCESS) + { + ASSERT(ulInfoLen > 0); + + if (ulInfoLen <= OidRequest->DATA.QUERY_INFORMATION.InformationBufferLength) + { + if(pInfo) + { + // Copy result into InformationBuffer + NdisMoveMemory( + OidRequest->DATA.QUERY_INFORMATION.InformationBuffer, + pInfo, + ulInfoLen + ); + } + + OidRequest->DATA.QUERY_INFORMATION.BytesWritten = ulInfoLen; + } + else + { + // too short + OidRequest->DATA.QUERY_INFORMATION.BytesNeeded = ulInfoLen; + status = NDIS_STATUS_BUFFER_TOO_SHORT; + } + } + + return status; +} + +NDIS_STATUS +AdapterOidRequest( + __in NDIS_HANDLE MiniportAdapterContext, + __in PNDIS_OID_REQUEST OidRequest + ) +/*++ + +Routine Description: + + Entry point called by NDIS to get or set the value of a specified OID. + +Arguments: + + MiniportAdapterContext - Our adapter handle + NdisRequest - The OID request to handle + +Return Value: + + Return code from the NdisRequest below. + +--*/ +{ + PTAP_ADAPTER_CONTEXT adapter = (PTAP_ADAPTER_CONTEXT )MiniportAdapterContext; + NDIS_STATUS status; + + // Dispatch based on request type. + switch (OidRequest->RequestType) + { + case NdisRequestSetInformation: + status = tapSetInformation(adapter,OidRequest); + break; + + case NdisRequestQueryInformation: + case NdisRequestQueryStatistics: + status = tapQueryInformation(adapter,OidRequest); + break; + + case NdisRequestMethod: // TAP doesn't need to respond to this request type. + default: + // + // The entry point may by used by other requests + // + status = NDIS_STATUS_NOT_SUPPORTED; + break; + } + + return status; +} + +VOID +AdapterCancelOidRequest( + __in NDIS_HANDLE MiniportAdapterContext, + __in PVOID RequestId + ) +{ + PTAP_ADAPTER_CONTEXT adapter = (PTAP_ADAPTER_CONTEXT )MiniportAdapterContext; + + UNREFERENCED_PARAMETER(RequestId); + + // + // This miniport sample does not pend any OID requests, so we don't have + // to worry about cancelling them. + // +} + diff --git a/installer/tap/src/src/proto.h b/installer/tap/src/src/proto.h new file mode 100644 index 0000000..cc23de6 --- /dev/null +++ b/installer/tap/src/src/proto.h @@ -0,0 +1,224 @@ +/* + * TAP-Windows -- A kernel driver to provide virtual tap + * device functionality on Windows. + * + * This code was inspired by the CIPE-Win32 driver by Damion K. Wilson. + * + * This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc., + * and is released under the GPL version 2 (see below). + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program (see the file COPYING included with this + * distribution); if not, write to the Free Software Foundation, Inc., + * 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +//============================================================ +// MAC address, Ethernet header, and ARP +//============================================================ + +#pragma pack(1) + +#define IP_HEADER_SIZE 20 +#define IPV6_HEADER_SIZE 40 + +#define MACADDR_SIZE 6 +typedef unsigned char MACADDR[MACADDR_SIZE]; + +typedef unsigned long IPADDR; +typedef unsigned char IPV6ADDR[16]; + +//----------------- +// Ethernet address +//----------------- + +typedef struct { + MACADDR addr; +} ETH_ADDR; + +typedef struct { + ETH_ADDR list[TAP_MAX_MCAST_LIST]; +} MC_LIST; + + +// BUGBUG!!! Consider using ststem defines in netiodef.h!!! + +//---------------- +// Ethernet header +//---------------- +typedef struct +{ + MACADDR dest; /* destination eth addr */ + MACADDR src; /* source ether addr */ + USHORT proto; /* packet type ID field */ +} ETH_HEADER, *PETH_HEADER; + +//---------------- +// ARP packet +//---------------- + +typedef struct + { + MACADDR m_MAC_Destination; // Reverse these two + MACADDR m_MAC_Source; // to answer ARP requests + USHORT m_Proto; // 0x0806 + +# define MAC_ADDR_TYPE 0x0001 + USHORT m_MAC_AddressType; // 0x0001 + + USHORT m_PROTO_AddressType; // 0x0800 + UCHAR m_MAC_AddressSize; // 0x06 + UCHAR m_PROTO_AddressSize; // 0x04 + +# define ARP_REQUEST 0x0001 +# define ARP_REPLY 0x0002 + USHORT m_ARP_Operation; // 0x0001 for ARP request, 0x0002 for ARP reply + + MACADDR m_ARP_MAC_Source; + IPADDR m_ARP_IP_Source; + MACADDR m_ARP_MAC_Destination; + IPADDR m_ARP_IP_Destination; + } +ARP_PACKET, *PARP_PACKET; + +//---------- +// IP Header +//---------- + +typedef struct { +# define IPH_GET_VER(v) (((v) >> 4) & 0x0F) +# define IPH_GET_LEN(v) (((v) & 0x0F) << 2) + UCHAR version_len; + + UCHAR tos; + USHORT tot_len; + USHORT id; + +# define IP_OFFMASK 0x1fff + USHORT frag_off; + + UCHAR ttl; + +# define IPPROTO_UDP 17 /* UDP protocol */ +# define IPPROTO_TCP 6 /* TCP protocol */ +# define IPPROTO_ICMP 1 /* ICMP protocol */ +# define IPPROTO_IGMP 2 /* IGMP protocol */ + UCHAR protocol; + + USHORT check; + ULONG saddr; + ULONG daddr; + /* The options start here. */ +} IPHDR; + +//----------- +// UDP header +//----------- + +typedef struct { + USHORT source; + USHORT dest; + USHORT len; + USHORT check; +} UDPHDR; + +//-------------------------- +// TCP header, per RFC 793. +//-------------------------- + +typedef struct { + USHORT source; /* source port */ + USHORT dest; /* destination port */ + ULONG seq; /* sequence number */ + ULONG ack_seq; /* acknowledgement number */ + +# define TCPH_GET_DOFF(d) (((d) & 0xF0) >> 2) + UCHAR doff_res; + +# define TCPH_FIN_MASK (1<<0) +# define TCPH_SYN_MASK (1<<1) +# define TCPH_RST_MASK (1<<2) +# define TCPH_PSH_MASK (1<<3) +# define TCPH_ACK_MASK (1<<4) +# define TCPH_URG_MASK (1<<5) +# define TCPH_ECE_MASK (1<<6) +# define TCPH_CWR_MASK (1<<7) + UCHAR flags; + + USHORT window; + USHORT check; + USHORT urg_ptr; +} TCPHDR; + +#define TCPOPT_EOL 0 +#define TCPOPT_NOP 1 +#define TCPOPT_MAXSEG 2 +#define TCPOLEN_MAXSEG 4 + +//------------ +// IPv6 Header +//------------ + +typedef struct { + UCHAR version_prio; + UCHAR flow_lbl[3]; + USHORT payload_len; +# define IPPROTO_ICMPV6 0x3a /* ICMP protocol v6 */ + UCHAR nexthdr; + UCHAR hop_limit; + IPV6ADDR saddr; + IPV6ADDR daddr; +} IPV6HDR; + +//-------------------------------------------- +// IPCMPv6 NS/NA Packets (RFC4443 and RFC4861) +//-------------------------------------------- + +// Neighbor Solictiation - RFC 4861, 4.3 +// (this is just the ICMPv6 part of the packet) +typedef struct { + UCHAR type; +# define ICMPV6_TYPE_NS 135 // neighbour solicitation + UCHAR code; +# define ICMPV6_CODE_0 0 // no specific sub-code for NS/NA + USHORT checksum; + ULONG reserved; + IPV6ADDR target_addr; +} ICMPV6_NS; + +// Neighbor Advertisement - RFC 4861, 4.4 + 4.6/4.6.1 +// (this is just the ICMPv6 payload) +typedef struct { + UCHAR type; +# define ICMPV6_TYPE_NA 136 // neighbour advertisement + UCHAR code; +# define ICMPV6_CODE_0 0 // no specific sub-code for NS/NA + USHORT checksum; + UCHAR rso_bits; // Router(0), Solicited(2), Ovrrd(4) + UCHAR reserved[3]; + IPV6ADDR target_addr; +// always include "Target Link-layer Address" option (RFC 4861 4.6.1) + UCHAR opt_type; +#define ICMPV6_OPTION_TLLA 2 + UCHAR opt_length; +#define ICMPV6_LENGTH_TLLA 1 // multiplied by 8 -> 1 = 8 bytes + MACADDR target_macaddr; +} ICMPV6_NA; + +// this is the complete packet with Ethernet and IPv6 headers +typedef struct { + ETH_HEADER eth; + IPV6HDR ipv6; + ICMPV6_NA icmpv6; +} ICMPV6_NA_PKT; + +#pragma pack() diff --git a/installer/tap/src/src/prototypes.h b/installer/tap/src/src/prototypes.h new file mode 100644 index 0000000..ad70261 --- /dev/null +++ b/installer/tap/src/src/prototypes.h @@ -0,0 +1,87 @@ +/* + * TAP-Windows -- A kernel driver to provide virtual tap + * device functionality on Windows. + * + * This code was inspired by the CIPE-Win32 driver by Damion K. Wilson. + * + * This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc., + * and is released under the GPL version 2 (see below). + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program (see the file COPYING included with this + * distribution); if not, write to the Free Software Foundation, Inc., + * 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#ifndef TAP_PROTOTYPES_DEFINED +#define TAP_PROTOTYPES_DEFINED + +DRIVER_INITIALIZE DriverEntry; + +//VOID AdapterFreeResources +// ( +// TapAdapterPointer p_Adapter +// ); +// + +// +//NTSTATUS TapDeviceHook +// ( +// IN PDEVICE_OBJECT p_DeviceObject, +// IN PIRP p_IRP +// ); +// + +NDIS_STATUS +CreateTapDevice( + __in PTAP_ADAPTER_CONTEXT Adapter + ); + +VOID +DestroyTapDevice( + __in PTAP_ADAPTER_CONTEXT Adapter + ); + +// Flush the pending send TAP packet queue. +VOID +tapFlushSendPacketQueue( + __in PTAP_ADAPTER_CONTEXT Adapter + ); + +VOID +IndicateReceivePacket( + __in PTAP_ADAPTER_CONTEXT Adapter, + __in PUCHAR packetData, + __in const unsigned int packetLength + ); + +BOOLEAN +ProcessDHCP( + __in PTAP_ADAPTER_CONTEXT Adapter, + __in const ETH_HEADER *eth, + __in const IPHDR *ip, + __in const UDPHDR *udp, + __in const DHCP *dhcp, + __in int optlen + ); + +BOOLEAN +ProcessARP( + __in PTAP_ADAPTER_CONTEXT Adapter, + __in const PARP_PACKET src, + __in const IPADDR adapter_ip, + __in const IPADDR ip_network, + __in const IPADDR ip_netmask, + __in const MACADDR mac + ); + +#endif diff --git a/installer/tap/src/src/resource.rc b/installer/tap/src/src/resource.rc new file mode 100644 index 0000000..fbe2775 --- /dev/null +++ b/installer/tap/src/src/resource.rc @@ -0,0 +1,62 @@ +#include +#include + +#include "config.h" + +#undef VER_PRODUCTVERSION +#undef VER_PRODUCTVERSION_STR +#undef VER_COMPANYNAME_STR +#undef VER_PRODUCTNAME_STR + +/* VER_FILETYPE, VER_FILESUBTYPE, VER_FILEDESCRIPTION_STR + * and VER_INTERNALNAME_STR must be defined before including COMMON.VER + * The strings don't need a '\0', since common.ver has them. + */ + +#define VER_FILETYPE VFT_DRV +/* possible values: VFT_UNKNOWN + VFT_APP + VFT_DLL + VFT_DRV + VFT_FONT + VFT_VXD + VFT_STATIC_LIB +*/ +#define VER_FILESUBTYPE VFT2_DRV_NETWORK +/* possible values VFT2_UNKNOWN + VFT2_DRV_PRINTER + VFT2_DRV_KEYBOARD + VFT2_DRV_LANGUAGE + VFT2_DRV_DISPLAY + VFT2_DRV_MOUSE + VFT2_DRV_NETWORK + VFT2_DRV_SYSTEM + VFT2_DRV_INSTALLABLE + VFT2_DRV_SOUND + VFT2_DRV_COMM +*/ + +#define VER_COMPANYNAME_STR "The OpenVPN Project" +#define VER_FILEDESCRIPTION_STR "TAP-Windows Virtual Network Driver (NDIS 6.0)" +#define VER_ORIGINALFILENAME_STR PRODUCT_TAP_WIN_COMPONENT_ID ".sys" +#define VER_LEGALCOPYRIGHT_YEARS "2003-2014" +#define VER_LEGALCOPYRIGHT_STR "OpenVPN Technologies, Inc." + + +#define VER_PRODUCTNAME_STR VER_FILEDESCRIPTION_STR +#define VER_PRODUCTVERSION PRODUCT_TAP_WIN_MAJOR,00,00,PRODUCT_TAP_WIN_MINOR + +#define XSTR(s) STR(s) +#define STR(s) #s + +#define VSTRING PRODUCT_VERSION " " XSTR(PRODUCT_TAP_WIN_MAJOR) "/" XSTR(PRODUCT_TAP_WIN_MINOR) + +#ifdef DBG +#define VER_PRODUCTVERSION_STR VSTRING " (DEBUG)" +#else +#define VER_PRODUCTVERSION_STR VSTRING +#endif + +#define VER_INTERNALNAME_STR VER_ORIGINALFILENAME_STR + +#include "common.ver" diff --git a/installer/tap/src/src/rxpath.c b/installer/tap/src/src/rxpath.c new file mode 100644 index 0000000..7415b5e --- /dev/null +++ b/installer/tap/src/src/rxpath.c @@ -0,0 +1,667 @@ +/* + * TAP-Windows -- A kernel driver to provide virtual tap + * device functionality on Windows. + * + * This code was inspired by the CIPE-Win32 driver by Damion K. Wilson. + * + * This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc., + * and is released under the GPL version 2 (see below). + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program (see the file COPYING included with this + * distribution); if not, write to the Free Software Foundation, Inc., + * 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +// +// Include files. +// + +#include "tap.h" + +//====================================================================== +// TAP Receive Path Support +//====================================================================== + +#ifdef ALLOC_PRAGMA +#pragma alloc_text( PAGE, TapDeviceWrite) +#endif // ALLOC_PRAGMA + +//=============================================================== +// Used in cases where internally generated packets such as +// ARP or DHCP replies must be returned to the kernel, to be +// seen as an incoming packet "arriving" on the interface. +//=============================================================== + +VOID +IndicateReceivePacket( + __in PTAP_ADAPTER_CONTEXT Adapter, + __in PUCHAR packetData, + __in const unsigned int packetLength + ) +{ + PUCHAR injectBuffer; + + // + // Handle miniport Pause + // --------------------- + // NDIS 6 miniports implement a temporary "Pause" state normally followed + // by the Restart. While in the Pause state it is forbidden for the miniport + // to indicate receive NBLs. + // + // That is: The device interface may be "up", but the NDIS miniport send/receive + // interface may be temporarily "down". + // + // BUGBUG!!! In the initial implementation of the NDIS 6 TapOas inject path + // the code below will simply ignore inject packets passed to the driver while + // the miniport is in the Paused state. + // + // The correct implementation is to go ahead and build the NBLs corresponding + // to the inject packet - but queue them. When Restart is entered the + // queued NBLs would be dequeued and indicated to the host. + // + if(tapAdapterSendAndReceiveReady(Adapter) != NDIS_STATUS_SUCCESS) + { + DEBUGP (("[%s] Lying send in IndicateReceivePacket while adapter paused\n", + MINIPORT_INSTANCE_ID (Adapter))); + + return; + } + + // Allocate flat buffer for packet data. + injectBuffer = (PUCHAR )NdisAllocateMemoryWithTagPriority( + Adapter->MiniportAdapterHandle, + packetLength, + TAP_RX_INJECT_BUFFER_TAG, + NormalPoolPriority + ); + + if( injectBuffer) + { + PMDL mdl; + + // Copy packet data to flat buffer. + NdisMoveMemory (injectBuffer, packetData, packetLength); + + // Allocate MDL for flat buffer. + mdl = NdisAllocateMdl( + Adapter->MiniportAdapterHandle, + injectBuffer, + packetLength + ); + + if( mdl ) + { + PNET_BUFFER_LIST netBufferList; + + mdl->Next = NULL; // No next MDL + + // Allocate the NBL and NB. Link MDL chain to NB. + netBufferList = NdisAllocateNetBufferAndNetBufferList( + Adapter->ReceiveNblPool, + 0, // ContextSize + 0, // ContextBackFill + mdl, // MDL chain + 0, + packetLength + ); + + if(netBufferList != NULL) + { + ULONG receiveFlags = 0; + LONG nblCount; + + NET_BUFFER_LIST_NEXT_NBL(netBufferList) = NULL; // Only one NBL + + if(KeGetCurrentIrql() == DISPATCH_LEVEL) + { + receiveFlags |= NDIS_RECEIVE_FLAGS_DISPATCH_LEVEL; + } + + // Set flag indicating that this is an injected packet + TAP_RX_NBL_FLAGS_CLEAR_ALL(netBufferList); + TAP_RX_NBL_FLAG_SET(netBufferList,TAP_RX_NBL_FLAGS_IS_INJECTED); + + netBufferList->MiniportReserved[0] = NULL; + netBufferList->MiniportReserved[1] = NULL; + + // Increment in-flight receive NBL count. + nblCount = NdisInterlockedIncrement(&Adapter->ReceiveNblInFlightCount); + ASSERT(nblCount > 0 ); + + netBufferList->SourceHandle = Adapter->MiniportAdapterHandle; + + // + // Indicate the packet + // ------------------- + // Irp->AssociatedIrp.SystemBuffer with length irpSp->Parameters.Write.Length + // contains the complete packet including Ethernet header and payload. + // + NdisMIndicateReceiveNetBufferLists( + Adapter->MiniportAdapterHandle, + netBufferList, + NDIS_DEFAULT_PORT_NUMBER, + 1, // NumberOfNetBufferLists + receiveFlags + ); + + return; + } + else + { + DEBUGP (("[%s] NdisAllocateNetBufferAndNetBufferList failed in IndicateReceivePacket\n", + MINIPORT_INSTANCE_ID (Adapter))); + NOTE_ERROR (); + + NdisFreeMdl(mdl); + NdisFreeMemory(injectBuffer,0,0); + } + } + else + { + DEBUGP (("[%s] NdisAllocateMdl failed in IndicateReceivePacket\n", + MINIPORT_INSTANCE_ID (Adapter))); + NOTE_ERROR (); + + NdisFreeMemory(injectBuffer,0,0); + } + } + else + { + DEBUGP (("[%s] NdisAllocateMemoryWithTagPriority failed in IndicateReceivePacket\n", + MINIPORT_INSTANCE_ID (Adapter))); + NOTE_ERROR (); + } +} + +VOID +tapCompleteIrpAndFreeReceiveNetBufferList( + __in PTAP_ADAPTER_CONTEXT Adapter, + __in PNET_BUFFER_LIST NetBufferList, // Only one NB here... + __in NTSTATUS IoCompletionStatus + ) +{ + PIRP irp; + ULONG frameType, netBufferCount, byteCount; + LONG nblCount; + + // Fetch NB frame type. + frameType = tapGetNetBufferFrameType(NET_BUFFER_LIST_FIRST_NB(NetBufferList)); + + // Fetch statistics for all NBs linked to the NB. + netBufferCount = tapGetNetBufferCountsFromNetBufferList( + NetBufferList, + &byteCount + ); + + // Update statistics by frame type + if(IoCompletionStatus == STATUS_SUCCESS) + { + switch(frameType) + { + case NDIS_PACKET_TYPE_DIRECTED: + Adapter->FramesRxDirected += netBufferCount; + Adapter->BytesRxDirected += byteCount; + break; + + case NDIS_PACKET_TYPE_BROADCAST: + Adapter->FramesRxBroadcast += netBufferCount; + Adapter->BytesRxBroadcast += byteCount; + break; + + case NDIS_PACKET_TYPE_MULTICAST: + Adapter->FramesRxMulticast += netBufferCount; + Adapter->BytesRxMulticast += byteCount; + break; + + default: + ASSERT(FALSE); + break; + } + } + + // + // Handle P2P Packet + // ----------------- + // Free MDL allocated for P2P Ethernet header. + // + if(TAP_RX_NBL_FLAG_TEST(NetBufferList,TAP_RX_NBL_FLAGS_IS_P2P)) + { + PNET_BUFFER netBuffer; + PMDL mdl; + + netBuffer = NET_BUFFER_LIST_FIRST_NB(NetBufferList); + mdl = NET_BUFFER_FIRST_MDL(netBuffer); + mdl->Next = NULL; + + NdisFreeMdl(mdl); + } + + // + // Handle Injected Packet + // ----------------------- + // Free MDL and data buffer allocated for injected packet. + // + if(TAP_RX_NBL_FLAG_TEST(NetBufferList,TAP_RX_NBL_FLAGS_IS_INJECTED)) + { + PNET_BUFFER netBuffer; + PMDL mdl; + PUCHAR injectBuffer; + + netBuffer = NET_BUFFER_LIST_FIRST_NB(NetBufferList); + mdl = NET_BUFFER_FIRST_MDL(netBuffer); + + injectBuffer = (PUCHAR )MmGetSystemAddressForMdlSafe(mdl,NormalPagePriority); + + if(injectBuffer) + { + NdisFreeMemory(injectBuffer,0,0); + } + + NdisFreeMdl(mdl); + } + + // + // Complete the IRP + // + irp = (PIRP )NetBufferList->MiniportReserved[0]; + + if(irp) + { + irp->IoStatus.Status = IoCompletionStatus; + IoCompleteRequest(irp, IO_NO_INCREMENT); + } + + // Decrement in-flight receive NBL count. + nblCount = NdisInterlockedDecrement(&Adapter->ReceiveNblInFlightCount); + ASSERT(nblCount >= 0 ); + if (0 == nblCount) + { + NdisSetEvent(&Adapter->ReceiveNblInFlightCountZeroEvent); + } + + // Free the NBL + NdisFreeNetBufferList(NetBufferList); +} + +VOID +AdapterReturnNetBufferLists( + __in NDIS_HANDLE MiniportAdapterContext, + __in PNET_BUFFER_LIST NetBufferLists, + __in ULONG ReturnFlags + ) +{ + PTAP_ADAPTER_CONTEXT adapter = (PTAP_ADAPTER_CONTEXT )MiniportAdapterContext; + PNET_BUFFER_LIST currentNbl, nextNbl; + + UNREFERENCED_PARAMETER(ReturnFlags); + + // + // Process each NBL individually + // + currentNbl = NetBufferLists; + while (currentNbl) + { + PNET_BUFFER_LIST nextNbl; + + nextNbl = NET_BUFFER_LIST_NEXT_NBL(currentNbl); + NET_BUFFER_LIST_NEXT_NBL(currentNbl) = NULL; + + // Complete write IRP and free NBL and associated resources. + tapCompleteIrpAndFreeReceiveNetBufferList( + adapter, + currentNbl, + STATUS_SUCCESS + ); + + // Move to next NBL + currentNbl = nextNbl; + } +} + +// IRP_MJ_WRITE callback. +NTSTATUS +TapDeviceWrite( + PDEVICE_OBJECT DeviceObject, + PIRP Irp + ) +{ + NTSTATUS ntStatus = STATUS_SUCCESS;// Assume success + PIO_STACK_LOCATION irpSp;// Pointer to current stack location + PTAP_ADAPTER_CONTEXT adapter = NULL; + ULONG dataLength; + + PAGED_CODE(); + + irpSp = IoGetCurrentIrpStackLocation( Irp ); + + // + // Fetch adapter context for this device. + // -------------------------------------- + // Adapter pointer was stashed in FsContext when handle was opened. + // + adapter = (PTAP_ADAPTER_CONTEXT )(irpSp->FileObject)->FsContext; + + ASSERT(adapter); + + // + // Sanity checks on state variables + // + if (!tapAdapterReadAndWriteReady(adapter)) + { + //DEBUGP (("[%s] Interface is down in IRP_MJ_WRITE\n", + // MINIPORT_INSTANCE_ID (adapter))); + //NOTE_ERROR(); + + Irp->IoStatus.Status = ntStatus = STATUS_CANCELLED; + Irp->IoStatus.Information = 0; + IoCompleteRequest (Irp, IO_NO_INCREMENT); + + return ntStatus; + } + + // Save IRP-accessible copy of buffer length + Irp->IoStatus.Information = irpSp->Parameters.Write.Length; + + if (Irp->MdlAddress == NULL) + { + DEBUGP (("[%s] MdlAddress is NULL for IRP_MJ_WRITE\n", + MINIPORT_INSTANCE_ID (adapter))); + + NOTE_ERROR(); + Irp->IoStatus.Status = ntStatus = STATUS_INVALID_PARAMETER; + Irp->IoStatus.Information = 0; + IoCompleteRequest (Irp, IO_NO_INCREMENT); + + return ntStatus; + } + + // + // Try to get a virtual address for the MDL. + // + NdisQueryMdl( + Irp->MdlAddress, + &Irp->AssociatedIrp.SystemBuffer, + &dataLength, + NormalPagePriority + ); + + if (Irp->AssociatedIrp.SystemBuffer == NULL) + { + DEBUGP (("[%s] Could not map address in IRP_MJ_WRITE\n", + MINIPORT_INSTANCE_ID (adapter))); + + NOTE_ERROR(); + Irp->IoStatus.Status = ntStatus = STATUS_INSUFFICIENT_RESOURCES; + Irp->IoStatus.Information = 0; + IoCompleteRequest (Irp, IO_NO_INCREMENT); + + return ntStatus; + } + + ASSERT(dataLength == irpSp->Parameters.Write.Length); + + Irp->IoStatus.Information = irpSp->Parameters.Write.Length; + + // + // Handle miniport Pause + // --------------------- + // NDIS 6 miniports implement a temporary "Pause" state normally followed + // by the Restart. While in the Pause state it is forbidden for the miniport + // to indicate receive NBLs. + // + // That is: The device interface may be "up", but the NDIS miniport send/receive + // interface may be temporarily "down". + // + // BUGBUG!!! In the initial implementation of the NDIS 6 TapOas receive path + // the code below will perform a "lying send" for write IRPs passed to the + // driver while the miniport is in the Paused state. + // + // The correct implementation is to go ahead and build the NBLs corresponding + // to the user-mode write - but queue them. When Restart is entered the + // queued NBLs would be dequeued and indicated to the host. + // + if(tapAdapterSendAndReceiveReady(adapter) == NDIS_STATUS_SUCCESS) + { + if (!adapter->m_tun && ((irpSp->Parameters.Write.Length) >= ETHERNET_HEADER_SIZE)) + { + PNET_BUFFER_LIST netBufferList; + + DUMP_PACKET ("IRP_MJ_WRITE ETH", + (unsigned char *) Irp->AssociatedIrp.SystemBuffer, + irpSp->Parameters.Write.Length); + + //===================================================== + // If IPv4 packet, check whether or not packet + // was truncated. + //===================================================== +#if PACKET_TRUNCATION_CHECK + IPv4PacketSizeVerify ( + (unsigned char *) Irp->AssociatedIrp.SystemBuffer, + irpSp->Parameters.Write.Length, + FALSE, + "RX", + &adapter->m_RxTrunc + ); +#endif + (Irp->MdlAddress)->Next = NULL; // No next MDL + + // Allocate the NBL and NB. Link MDL chain to NB. + netBufferList = NdisAllocateNetBufferAndNetBufferList( + adapter->ReceiveNblPool, + 0, // ContextSize + 0, // ContextBackFill + Irp->MdlAddress, // MDL chain + 0, + dataLength + ); + + if(netBufferList != NULL) + { + LONG nblCount; + + NET_BUFFER_LIST_NEXT_NBL(netBufferList) = NULL; // Only one NBL + + // Stash IRP pointer in NBL MiniportReserved[0] field. + netBufferList->MiniportReserved[0] = Irp; + netBufferList->MiniportReserved[1] = NULL; + + // This IRP is pended. + IoMarkIrpPending(Irp); + + // This IRP cannot be cancelled while in-flight. + IoSetCancelRoutine(Irp,NULL); + + TAP_RX_NBL_FLAGS_CLEAR_ALL(netBufferList); + + // Increment in-flight receive NBL count. + nblCount = NdisInterlockedIncrement(&adapter->ReceiveNblInFlightCount); + ASSERT(nblCount > 0 ); + + // + // Indicate the packet + // ------------------- + // Irp->AssociatedIrp.SystemBuffer with length irpSp->Parameters.Write.Length + // contains the complete packet including Ethernet header and payload. + // + NdisMIndicateReceiveNetBufferLists( + adapter->MiniportAdapterHandle, + netBufferList, + NDIS_DEFAULT_PORT_NUMBER, + 1, // NumberOfNetBufferLists + 0 // ReceiveFlags + ); + + ntStatus = STATUS_PENDING; + } + else + { + DEBUGP (("[%s] NdisMIndicateReceiveNetBufferLists failed in IRP_MJ_WRITE\n", + MINIPORT_INSTANCE_ID (adapter))); + NOTE_ERROR (); + + // Fail the IRP + Irp->IoStatus.Information = 0; + ntStatus = STATUS_INSUFFICIENT_RESOURCES; + } + } + else if (adapter->m_tun && ((irpSp->Parameters.Write.Length) >= IP_HEADER_SIZE)) + { + PETH_HEADER p_UserToTap = &adapter->m_UserToTap; + PMDL mdl; // Head of MDL chain. + + // For IPv6, need to use Ethernet header with IPv6 proto + if ( IPH_GET_VER( ((IPHDR*) Irp->AssociatedIrp.SystemBuffer)->version_len) == 6 ) + { + p_UserToTap = &adapter->m_UserToTap_IPv6; + } + + DUMP_PACKET2 ("IRP_MJ_WRITE P2P", + p_UserToTap, + (unsigned char *) Irp->AssociatedIrp.SystemBuffer, + irpSp->Parameters.Write.Length); + + //===================================================== + // If IPv4 packet, check whether or not packet + // was truncated. + //===================================================== +#if PACKET_TRUNCATION_CHECK + IPv4PacketSizeVerify ( + (unsigned char *) Irp->AssociatedIrp.SystemBuffer, + irpSp->Parameters.Write.Length, + TRUE, + "RX", + &adapter->m_RxTrunc + ); +#endif + + // + // Allocate MDL for Ethernet header + // -------------------------------- + // Irp->AssociatedIrp.SystemBuffer with length irpSp->Parameters.Write.Length + // contains the only the Ethernet payload. Prepend the user-mode provided + // payload with the Ethernet header pointed to by p_UserToTap. + // + mdl = NdisAllocateMdl( + adapter->MiniportAdapterHandle, + p_UserToTap, + sizeof(ETH_HEADER) + ); + + if(mdl != NULL) + { + PNET_BUFFER_LIST netBufferList; + + // Chain user's Ethernet payload behind Ethernet header. + mdl->Next = Irp->MdlAddress; + (Irp->MdlAddress)->Next = NULL; // No next MDL + + // Allocate the NBL and NB. Link MDL chain to NB. + netBufferList = NdisAllocateNetBufferAndNetBufferList( + adapter->ReceiveNblPool, + 0, // ContextSize + 0, // ContextBackFill + mdl, // MDL chain + 0, + sizeof(ETH_HEADER) + dataLength + ); + + if(netBufferList != NULL) + { + LONG nblCount; + + NET_BUFFER_LIST_NEXT_NBL(netBufferList) = NULL; // Only one NBL + + // This IRP is pended. + IoMarkIrpPending(Irp); + + // This IRP cannot be cancelled while in-flight. + IoSetCancelRoutine(Irp,NULL); + + // Stash IRP pointer in NBL MiniportReserved[0] field. + netBufferList->MiniportReserved[0] = Irp; + netBufferList->MiniportReserved[1] = NULL; + + // Set flag indicating that this is P2P packet + TAP_RX_NBL_FLAGS_CLEAR_ALL(netBufferList); + TAP_RX_NBL_FLAG_SET(netBufferList,TAP_RX_NBL_FLAGS_IS_P2P); + + // Increment in-flight receive NBL count. + nblCount = NdisInterlockedIncrement(&adapter->ReceiveNblInFlightCount); + ASSERT(nblCount > 0 ); + + // + // Indicate the packet + // + NdisMIndicateReceiveNetBufferLists( + adapter->MiniportAdapterHandle, + netBufferList, + NDIS_DEFAULT_PORT_NUMBER, + 1, // NumberOfNetBufferLists + 0 // ReceiveFlags + ); + + ntStatus = STATUS_PENDING; + } + else + { + mdl->Next = NULL; + NdisFreeMdl(mdl); + + DEBUGP (("[%s] NdisMIndicateReceiveNetBufferLists failed in IRP_MJ_WRITE\n", + MINIPORT_INSTANCE_ID (adapter))); + NOTE_ERROR (); + + // Fail the IRP + Irp->IoStatus.Information = 0; + ntStatus = STATUS_INSUFFICIENT_RESOURCES; + } + } + else + { + DEBUGP (("[%s] NdisAllocateMdl failed in IRP_MJ_WRITE\n", + MINIPORT_INSTANCE_ID (adapter))); + NOTE_ERROR (); + + // Fail the IRP + Irp->IoStatus.Information = 0; + ntStatus = STATUS_INSUFFICIENT_RESOURCES; + } + } + else + { + DEBUGP (("[%s] Bad buffer size in IRP_MJ_WRITE, len=%d\n", + MINIPORT_INSTANCE_ID (adapter), + irpSp->Parameters.Write.Length)); + NOTE_ERROR (); + + Irp->IoStatus.Information = 0; // ETHERNET_HEADER_SIZE; + Irp->IoStatus.Status = ntStatus = STATUS_BUFFER_TOO_SMALL; + } + } + else + { + DEBUGP (("[%s] Lying send in IRP_MJ_WRITE while adapter paused\n", + MINIPORT_INSTANCE_ID (adapter))); + + ntStatus = STATUS_SUCCESS; + } + + if (ntStatus != STATUS_PENDING) + { + Irp->IoStatus.Status = ntStatus; + IoCompleteRequest(Irp, IO_NO_INCREMENT); + } + + return ntStatus; +} + diff --git a/installer/tap/src/src/tap-windows.h b/installer/tap/src/src/tap-windows.h new file mode 100644 index 0000000..9971534 --- /dev/null +++ b/installer/tap/src/src/tap-windows.h @@ -0,0 +1,75 @@ +/* + * TAP-Windows -- A kernel driver to provide virtual tap + * device functionality on Windows. + * + * This code was inspired by the CIPE-Win32 driver by Damion K. Wilson. + * + * This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc., + * and is released under the GPL version 2 (see below). This particular file + * (tap-windows.h) is also licensed using the MIT license (see COPYRIGHT.MIT). + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program (see the file COPYING included with this + * distribution); if not, write to the Free Software Foundation, Inc., + * 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ +#ifndef __TAP_WIN_H +#define __TAP_WIN_H + +/* + * ============= + * TAP IOCTLs + * ============= + */ + +#define TAP_WIN_CONTROL_CODE(request,method) \ + CTL_CODE (FILE_DEVICE_UNKNOWN, request, method, FILE_ANY_ACCESS) + +/* Present in 8.1 */ + +#define TAP_WIN_IOCTL_GET_MAC TAP_WIN_CONTROL_CODE (1, METHOD_BUFFERED) +#define TAP_WIN_IOCTL_GET_VERSION TAP_WIN_CONTROL_CODE (2, METHOD_BUFFERED) +#define TAP_WIN_IOCTL_GET_MTU TAP_WIN_CONTROL_CODE (3, METHOD_BUFFERED) +#define TAP_WIN_IOCTL_GET_INFO TAP_WIN_CONTROL_CODE (4, METHOD_BUFFERED) +#define TAP_WIN_IOCTL_CONFIG_POINT_TO_POINT TAP_WIN_CONTROL_CODE (5, METHOD_BUFFERED) +#define TAP_WIN_IOCTL_SET_MEDIA_STATUS TAP_WIN_CONTROL_CODE (6, METHOD_BUFFERED) +#define TAP_WIN_IOCTL_CONFIG_DHCP_MASQ TAP_WIN_CONTROL_CODE (7, METHOD_BUFFERED) +#define TAP_WIN_IOCTL_GET_LOG_LINE TAP_WIN_CONTROL_CODE (8, METHOD_BUFFERED) +#define TAP_WIN_IOCTL_CONFIG_DHCP_SET_OPT TAP_WIN_CONTROL_CODE (9, METHOD_BUFFERED) + +/* Added in 8.2 */ + +/* obsoletes TAP_WIN_IOCTL_CONFIG_POINT_TO_POINT */ +#define TAP_WIN_IOCTL_CONFIG_TUN TAP_WIN_CONTROL_CODE (10, METHOD_BUFFERED) + +/* + * ================= + * Registry keys + * ================= + */ + +#define ADAPTER_KEY "SYSTEM\\CurrentControlSet\\Control\\Class\\{4D36E972-E325-11CE-BFC1-08002BE10318}" + +#define NETWORK_CONNECTIONS_KEY "SYSTEM\\CurrentControlSet\\Control\\Network\\{4D36E972-E325-11CE-BFC1-08002BE10318}" + +/* + * ====================== + * Filesystem prefixes + * ====================== + */ + +#define USERMODEDEVICEDIR "\\\\.\\Global\\" +#define SYSDEVICEDIR "\\Device\\" +#define USERDEVICEDIR "\\DosDevices\\Global\\" +#define TAP_WIN_SUFFIX ".tap" + +#endif // __TAP_WIN_H diff --git a/installer/tap/src/src/tap.h b/installer/tap/src/src/tap.h new file mode 100644 index 0000000..ded959b --- /dev/null +++ b/installer/tap/src/src/tap.h @@ -0,0 +1,83 @@ +/* + * TAP-Windows -- A kernel driver to provide virtual tap + * device functionality on Windows. + * + * This code was inspired by the CIPE-Win32 driver by Damion K. Wilson. + * + * This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc., + * and is released under the GPL version 2 (see below). + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program (see the file COPYING included with this + * distribution); if not, write to the Free Software Foundation, Inc., + * 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ +#ifndef __TAP_H +#define __TAP_H + +#include +#include +#include +#include + +#include "config.h" +#include "lock.h" +#include "constants.h" +#include "proto.h" +#include "mem.h" +#include "macinfo.h" +#include "dhcp.h" +#include "error.h" +#include "endian.h" +#include "dhcp.h" +#include "types.h" +#include "adapter.h" +#include "device.h" +#include "prototypes.h" +#include "tap-windows.h" + +//======================================================== +// Check for truncated IPv4 packets, log errors if found. +//======================================================== +#define PACKET_TRUNCATION_CHECK 0 + +//======================================================== +// EXPERIMENTAL -- Configure TAP device object to be +// accessible from non-administrative accounts, based +// on an advanced properties setting. +// +// Duplicates the functionality of OpenVPN's +// --allow-nonadmin directive. +//======================================================== +#define ENABLE_NONADMIN 1 + +// +// The driver has exactly one instance of the TAP_GLOBAL structure. NDIS keeps +// an opaque handle to this data, (it doesn't attempt to read or interpret this +// data), and it passes the handle back to the miniport in MiniportSetOptions +// and MiniportInitializeEx. +// +typedef struct _TAP_GLOBAL +{ + LIST_ENTRY AdapterList; + + NDIS_RW_LOCK Lock; + + NDIS_HANDLE NdisDriverHandle; // From NdisMRegisterMiniportDriver + +} TAP_GLOBAL, *PTAP_GLOBAL; + + +// Global data +extern TAP_GLOBAL GlobalData; + +#endif // __TAP_H diff --git a/installer/tap/src/src/tapdrvr.c b/installer/tap/src/src/tapdrvr.c new file mode 100644 index 0000000..6c537f1 --- /dev/null +++ b/installer/tap/src/src/tapdrvr.c @@ -0,0 +1,232 @@ +/* + * TAP-Windows -- A kernel driver to provide virtual tap + * device functionality on Windows. + * + * This code was inspired by the CIPE-Win32 driver by Damion K. Wilson. + * + * This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc., + * and is released under the GPL version 2 (see below). + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program (see the file COPYING included with this + * distribution); if not, write to the Free Software Foundation, Inc., + * 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +//====================================================== +// This driver is designed to work on Windows Vista or higher +// versions of Windows. +// +// It is SMP-safe and handles power management. +// +// By default we operate as a "tap" virtual ethernet +// 802.3 interface, but we can emulate a "tun" +// interface (point-to-point IPv4) through the +// TAP_WIN_IOCTL_CONFIG_POINT_TO_POINT or +// TAP_WIN_IOCTL_CONFIG_TUN ioctl. +//====================================================== + +// +// Include files. +// + +#include + +#include "tap.h" + + +// Global data +TAP_GLOBAL GlobalData; + + +#ifdef ALLOC_PRAGMA +#pragma alloc_text( INIT, DriverEntry ) +#pragma alloc_text( PAGE, TapDriverUnload) +#endif // ALLOC_PRAGMA + +NTSTATUS +DriverEntry( + __in PDRIVER_OBJECT DriverObject, + __in PUNICODE_STRING RegistryPath + ) +/*++ +Routine Description: + + In the context of its DriverEntry function, a miniport driver associates + itself with NDIS, specifies the NDIS version that it is using, and + registers its entry points. + + +Arguments: + PVOID DriverObject - pointer to the driver object. + PVOID RegistryPath - pointer to the driver registry path. + + Return Value: + + NTSTATUS code + +--*/ +{ + NTSTATUS status; + + UNREFERENCED_PARAMETER(RegistryPath); + + DEBUGP (("[TAP] --> DriverEntry; version [%d.%d] %s %s\n", + TAP_DRIVER_MAJOR_VERSION, + TAP_DRIVER_MINOR_VERSION, + __DATE__, + __TIME__)); + + DEBUGP (("[TAP] Registry Path: '%wZ'\n", RegistryPath)); + + // + // Initialize any driver-global variables here. + // + NdisZeroMemory(&GlobalData, sizeof(GlobalData)); + + // + // The ApaterList in the GlobalData structure is used to track multiple + // adapters controlled by this miniport. + // + NdisInitializeListHead(&GlobalData.AdapterList); + + // + // This lock protects the AdapterList. + // + NdisInitializeReadWriteLock(&GlobalData.Lock); + + do + { + NDIS_MINIPORT_DRIVER_CHARACTERISTICS miniportCharacteristics; + + NdisZeroMemory(&miniportCharacteristics, sizeof(miniportCharacteristics)); + + {C_ASSERT(sizeof(miniportCharacteristics) >= NDIS_SIZEOF_MINIPORT_DRIVER_CHARACTERISTICS_REVISION_2);} + miniportCharacteristics.Header.Type = NDIS_OBJECT_TYPE_MINIPORT_DRIVER_CHARACTERISTICS; + miniportCharacteristics.Header.Size = NDIS_SIZEOF_MINIPORT_DRIVER_CHARACTERISTICS_REVISION_2; + miniportCharacteristics.Header.Revision = NDIS_MINIPORT_DRIVER_CHARACTERISTICS_REVISION_2; + + miniportCharacteristics.MajorNdisVersion = TAP_NDIS_MAJOR_VERSION; + miniportCharacteristics.MinorNdisVersion = TAP_NDIS_MINOR_VERSION; + + miniportCharacteristics.MajorDriverVersion = TAP_DRIVER_MAJOR_VERSION; + miniportCharacteristics.MinorDriverVersion = TAP_DRIVER_MINOR_VERSION; + + miniportCharacteristics.Flags = 0; + + //miniportCharacteristics.SetOptionsHandler = MPSetOptions; // Optional + miniportCharacteristics.InitializeHandlerEx = AdapterCreate; + miniportCharacteristics.HaltHandlerEx = AdapterHalt; + miniportCharacteristics.UnloadHandler = TapDriverUnload; + miniportCharacteristics.PauseHandler = AdapterPause; + miniportCharacteristics.RestartHandler = AdapterRestart; + miniportCharacteristics.OidRequestHandler = AdapterOidRequest; + miniportCharacteristics.SendNetBufferListsHandler = AdapterSendNetBufferLists; + miniportCharacteristics.ReturnNetBufferListsHandler = AdapterReturnNetBufferLists; + miniportCharacteristics.CancelSendHandler = AdapterCancelSend; + miniportCharacteristics.CheckForHangHandlerEx = AdapterCheckForHangEx; + miniportCharacteristics.ResetHandlerEx = AdapterReset; + miniportCharacteristics.DevicePnPEventNotifyHandler = AdapterDevicePnpEventNotify; + miniportCharacteristics.ShutdownHandlerEx = AdapterShutdownEx; + miniportCharacteristics.CancelOidRequestHandler = AdapterCancelOidRequest; + + // + // Associate the miniport driver with NDIS by calling the + // NdisMRegisterMiniportDriver. This function returns an NdisDriverHandle. + // The miniport driver must retain this handle but it should never attempt + // to access or interpret this handle. + // + // By calling NdisMRegisterMiniportDriver, the driver indicates that it + // is ready for NDIS to call the driver's MiniportSetOptions and + // MiniportInitializeEx handlers. + // + DEBUGP (("[TAP] Calling NdisMRegisterMiniportDriver...\n")); + //NDIS_DECLARE_MINIPORT_DRIVER_CONTEXT(TAP_GLOBAL); + status = NdisMRegisterMiniportDriver( + DriverObject, + RegistryPath, + &GlobalData, + &miniportCharacteristics, + &GlobalData.NdisDriverHandle + ); + + if (NDIS_STATUS_SUCCESS == status) + { + DEBUGP (("[TAP] Registered miniport successfully\n")); + } + else + { + DEBUGP(("[TAP] NdisMRegisterMiniportDriver failed: %8.8X\n", status)); + TapDriverUnload(DriverObject); + status = NDIS_STATUS_FAILURE; + break; + } + } while(FALSE); + + DEBUGP (("[TAP] <-- DriverEntry; status = %8.8X\n",status)); + + return status; +} + +VOID +TapDriverUnload( + __in PDRIVER_OBJECT DriverObject + ) +/*++ + +Routine Description: + + The unload handler is called during driver unload to free up resources + acquired in DriverEntry. This handler is registered in DriverEntry through + NdisMRegisterMiniportDriver. Note that an unload handler differs from + a MiniportHalt function in that this unload handler releases resources that + are global to the driver, while the halt handler releases resource for a + particular adapter. + + Runs at IRQL = PASSIVE_LEVEL. + +Arguments: + + DriverObject Not used + +Return Value: + + None. + +--*/ +{ + PDEVICE_OBJECT deviceObject = DriverObject->DeviceObject; + UNICODE_STRING uniWin32NameString; + + DEBUGP (("[TAP] --> TapDriverUnload; version [%d.%d] %s %s unloaded\n", + TAP_DRIVER_MAJOR_VERSION, + TAP_DRIVER_MINOR_VERSION, + __DATE__, + __TIME__ + )); + + PAGED_CODE(); + + // + // Clean up all globals that were allocated in DriverEntry + // + + ASSERT(IsListEmpty(&GlobalData.AdapterList)); + + if(GlobalData.NdisDriverHandle != NULL ) + { + NdisMDeregisterMiniportDriver(GlobalData.NdisDriverHandle); + } + + DEBUGP (("[TAP] <-- TapDriverUnload\n")); +} + diff --git a/installer/tap/src/src/txpath.c b/installer/tap/src/src/txpath.c new file mode 100644 index 0000000..f627934 --- /dev/null +++ b/installer/tap/src/src/txpath.c @@ -0,0 +1,1166 @@ +/* + * TAP-Windows -- A kernel driver to provide virtual tap + * device functionality on Windows. + * + * This code was inspired by the CIPE-Win32 driver by Damion K. Wilson. + * + * This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc., + * and is released under the GPL version 2 (see below). + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program (see the file COPYING included with this + * distribution); if not, write to the Free Software Foundation, Inc., + * 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +// +// Include files. +// + +#include "tap.h" + +//====================================================================== +// TAP Send Path Support +//====================================================================== + +#ifdef ALLOC_PRAGMA +#pragma alloc_text( PAGE, TapDeviceRead) +#endif // ALLOC_PRAGMA + +// checksum code for ICMPv6 packet, taken from dhcp.c / udp_checksum +// see RFC 4443, 2.3, and RFC 2460, 8.1 +USHORT +icmpv6_checksum( + __in const UCHAR *buf, + __in const int len_icmpv6, + __in const UCHAR *saddr6, + __in const UCHAR *daddr6 + ) +{ + USHORT word16; + ULONG sum = 0; + int i; + + // make 16 bit words out of every two adjacent 8 bit words and + // calculate the sum of all 16 bit words + for (i = 0; i < len_icmpv6; i += 2) + { + word16 = ((buf[i] << 8) & 0xFF00) + ((i + 1 < len_icmpv6) ? (buf[i+1] & 0xFF) : 0); + sum += word16; + } + + // add the IPv6 pseudo header which contains the IP source and destination addresses + for (i = 0; i < 16; i += 2) + { + word16 =((saddr6[i] << 8) & 0xFF00) + (saddr6[i+1] & 0xFF); + sum += word16; + } + + for (i = 0; i < 16; i += 2) + { + word16 =((daddr6[i] << 8) & 0xFF00) + (daddr6[i+1] & 0xFF); + sum += word16; + } + + // the next-header number and the length of the ICMPv6 packet + sum += (USHORT) IPPROTO_ICMPV6 + (USHORT) len_icmpv6; + + // keep only the last 16 bits of the 32 bit calculated sum and add the carries + while (sum >> 16) + sum = (sum & 0xFFFF) + (sum >> 16); + + // Take the one's complement of sum + return ((USHORT) ~sum); +} + +// check IPv6 packet for "is this an IPv6 Neighbor Solicitation that +// the tap driver needs to answer?" +// see RFC 4861 4.3 for the different cases +static IPV6ADDR IPV6_NS_TARGET_MCAST = + { 0xff, 0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, + 0x00, 0x00, 0x00, 0x01, 0xff, 0x00, 0x00, 0x08 }; +static IPV6ADDR IPV6_NS_TARGET_UNICAST = + { 0xfe, 0x80, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, + 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x08 }; + +BOOLEAN +HandleIPv6NeighborDiscovery( + __in PTAP_ADAPTER_CONTEXT Adapter, + __in UCHAR * m_Data + ) +{ + const IPV6HDR *ipv6 = (IPV6HDR *) (m_Data + sizeof (ETH_HEADER)); + const ICMPV6_NS * icmpv6_ns = (ICMPV6_NS *) (m_Data + sizeof (ETH_HEADER) + sizeof (IPV6HDR)); + ICMPV6_NA_PKT *na; + USHORT icmpv6_len, icmpv6_csum; + + // we don't really care about the destination MAC address here + // - it's either a multicast MAC, or the userland destination MAC + // but since the TAP driver is point-to-point, all packets are "for us" + + // IPv6 target address must be ff02::1::ff00:8 (multicast for + // initial NS) or fe80::1 (unicast for recurrent NUD) + if ( memcmp( ipv6->daddr, IPV6_NS_TARGET_MCAST, + sizeof(IPV6ADDR) ) != 0 && + memcmp( ipv6->daddr, IPV6_NS_TARGET_UNICAST, + sizeof(IPV6ADDR) ) != 0 ) + { + return FALSE; // wrong target address + } + + // IPv6 Next-Header must be ICMPv6 + if ( ipv6->nexthdr != IPPROTO_ICMPV6 ) + { + return FALSE; // wrong next-header + } + + // ICMPv6 type+code must be 135/0 for NS + if ( icmpv6_ns->type != ICMPV6_TYPE_NS || + icmpv6_ns->code != ICMPV6_CODE_0 ) + { + return FALSE; // wrong ICMPv6 type + } + + // ICMPv6 target address must be fe80::8 (magic) + if ( memcmp( icmpv6_ns->target_addr, IPV6_NS_TARGET_UNICAST, + sizeof(IPV6ADDR) ) != 0 ) + { + return FALSE; // not for us + } + + // packet identified, build magic response packet + + na = (ICMPV6_NA_PKT *) MemAlloc (sizeof (ICMPV6_NA_PKT), TRUE); + if ( !na ) return FALSE; + + //------------------------------------------------ + // Initialize Neighbour Advertisement reply packet + //------------------------------------------------ + + // ethernet header + na->eth.proto = htons(NDIS_ETH_TYPE_IPV6); + ETH_COPY_NETWORK_ADDRESS(na->eth.dest, Adapter->PermanentAddress); + ETH_COPY_NETWORK_ADDRESS(na->eth.src, Adapter->m_TapToUser.dest); + + // IPv6 header + na->ipv6.version_prio = ipv6->version_prio; + NdisMoveMemory( na->ipv6.flow_lbl, ipv6->flow_lbl, + sizeof(na->ipv6.flow_lbl) ); + icmpv6_len = sizeof(ICMPV6_NA_PKT) - sizeof(ETH_HEADER) - sizeof(IPV6HDR); + na->ipv6.payload_len = htons(icmpv6_len); + na->ipv6.nexthdr = IPPROTO_ICMPV6; + na->ipv6.hop_limit = 255; + NdisMoveMemory( na->ipv6.saddr, IPV6_NS_TARGET_UNICAST, + sizeof(IPV6ADDR) ); + NdisMoveMemory( na->ipv6.daddr, ipv6->saddr, + sizeof(IPV6ADDR) ); + + // ICMPv6 + na->icmpv6.type = ICMPV6_TYPE_NA; + na->icmpv6.code = ICMPV6_CODE_0; + na->icmpv6.checksum = 0; + na->icmpv6.rso_bits = 0x60; // Solicited + Override + NdisZeroMemory( na->icmpv6.reserved, sizeof(na->icmpv6.reserved) ); + NdisMoveMemory( na->icmpv6.target_addr, IPV6_NS_TARGET_UNICAST, + sizeof(IPV6ADDR) ); + + // ICMPv6 option "Target Link Layer Address" + na->icmpv6.opt_type = ICMPV6_OPTION_TLLA; + na->icmpv6.opt_length = ICMPV6_LENGTH_TLLA; + ETH_COPY_NETWORK_ADDRESS( na->icmpv6.target_macaddr, Adapter->m_TapToUser.dest ); + + // calculate and set checksum + icmpv6_csum = icmpv6_checksum ( + (UCHAR*) &(na->icmpv6), + icmpv6_len, + na->ipv6.saddr, + na->ipv6.daddr + ); + + na->icmpv6.checksum = htons( icmpv6_csum ); + + DUMP_PACKET ("HandleIPv6NeighborDiscovery", + (unsigned char *) na, + sizeof (ICMPV6_NA_PKT)); + + IndicateReceivePacket (Adapter, (UCHAR *) na, sizeof (ICMPV6_NA_PKT)); + + MemFree (na, sizeof (ICMPV6_NA_PKT)); + + return TRUE; // all fine +} + +//=================================================== +// Generate an ARP reply message for specific kinds +// ARP queries. +//=================================================== +BOOLEAN +ProcessARP( + __in PTAP_ADAPTER_CONTEXT Adapter, + __in const PARP_PACKET src, + __in const IPADDR adapter_ip, + __in const IPADDR ip_network, + __in const IPADDR ip_netmask, + __in const MACADDR mac + ) +{ + //----------------------------------------------- + // Is this the kind of packet we are looking for? + //----------------------------------------------- + if (src->m_Proto == htons (NDIS_ETH_TYPE_ARP) + && MAC_EQUAL (src->m_MAC_Source, Adapter->PermanentAddress) + && MAC_EQUAL (src->m_ARP_MAC_Source, Adapter->PermanentAddress) + && ETH_IS_BROADCAST(src->m_MAC_Destination) + && src->m_ARP_Operation == htons (ARP_REQUEST) + && src->m_MAC_AddressType == htons (MAC_ADDR_TYPE) + && src->m_MAC_AddressSize == sizeof (MACADDR) + && src->m_PROTO_AddressType == htons (NDIS_ETH_TYPE_IPV4) + && src->m_PROTO_AddressSize == sizeof (IPADDR) + && src->m_ARP_IP_Source == adapter_ip + && (src->m_ARP_IP_Destination & ip_netmask) == ip_network + && src->m_ARP_IP_Destination != adapter_ip) + { + ARP_PACKET *arp = (ARP_PACKET *) MemAlloc (sizeof (ARP_PACKET), TRUE); + if (arp) + { + //---------------------------------------------- + // Initialize ARP reply fields + //---------------------------------------------- + arp->m_Proto = htons (NDIS_ETH_TYPE_ARP); + arp->m_MAC_AddressType = htons (MAC_ADDR_TYPE); + arp->m_PROTO_AddressType = htons (NDIS_ETH_TYPE_IPV4); + arp->m_MAC_AddressSize = sizeof (MACADDR); + arp->m_PROTO_AddressSize = sizeof (IPADDR); + arp->m_ARP_Operation = htons (ARP_REPLY); + + //---------------------------------------------- + // ARP addresses + //---------------------------------------------- + ETH_COPY_NETWORK_ADDRESS (arp->m_MAC_Source, mac); + ETH_COPY_NETWORK_ADDRESS (arp->m_MAC_Destination, Adapter->PermanentAddress); + ETH_COPY_NETWORK_ADDRESS (arp->m_ARP_MAC_Source, mac); + ETH_COPY_NETWORK_ADDRESS (arp->m_ARP_MAC_Destination, Adapter->PermanentAddress); + arp->m_ARP_IP_Source = src->m_ARP_IP_Destination; + arp->m_ARP_IP_Destination = adapter_ip; + + DUMP_PACKET ("ProcessARP", + (unsigned char *) arp, + sizeof (ARP_PACKET)); + + IndicateReceivePacket (Adapter, (UCHAR *) arp, sizeof (ARP_PACKET)); + + MemFree (arp, sizeof (ARP_PACKET)); + } + + return TRUE; + } + else + return FALSE; +} + +//============================================================= +// CompleteIRP is normally called with an adapter -> userspace +// network packet and an IRP (Pending I/O request) from userspace. +// +// The IRP will normally represent a queued overlapped read +// operation from userspace that is in a wait state. +// +// Use the ethernet packet to satisfy the IRP. +//============================================================= + +VOID +tapCompletePendingReadIrp( + __in PIRP Irp, + __in PTAP_PACKET TapPacket + ) +{ + int offset; + int len; + NTSTATUS status = STATUS_UNSUCCESSFUL; + + ASSERT(Irp); + ASSERT(TapPacket); + + //------------------------------------------- + // While TapPacket always contains a + // full ethernet packet, including the + // ethernet header, in point-to-point mode, + // we only want to return the IPv4 + // component. + //------------------------------------------- + + if (TapPacket->m_SizeFlags & TP_TUN) + { + offset = ETHERNET_HEADER_SIZE; + len = (int) (TapPacket->m_SizeFlags & TP_SIZE_MASK) - ETHERNET_HEADER_SIZE; + } + else + { + offset = 0; + len = (TapPacket->m_SizeFlags & TP_SIZE_MASK); + } + + if (len < 0 || (int) Irp->IoStatus.Information < len) + { + Irp->IoStatus.Information = 0; + Irp->IoStatus.Status = status = STATUS_BUFFER_OVERFLOW; + NOTE_ERROR (); + } + else + { + Irp->IoStatus.Information = len; + Irp->IoStatus.Status = status = STATUS_SUCCESS; + + // Copy packet data + NdisMoveMemory( + Irp->AssociatedIrp.SystemBuffer, + TapPacket->m_Data + offset, + len + ); + } + + // Free the TAP packet + NdisFreeMemory(TapPacket,0,0); + + // Complete the IRP + IoCompleteRequest (Irp, IO_NETWORK_INCREMENT); +} + +VOID +tapProcessSendPacketQueue( + __in PTAP_ADAPTER_CONTEXT Adapter + ) +{ + KIRQL irql; + + // Process the send packet queue + KeAcquireSpinLock(&Adapter->SendPacketQueue.QueueLock,&irql); + + while(Adapter->SendPacketQueue.Count > 0 ) + { + PIRP irp; + PTAP_PACKET tapPacket; + + // Fetch a read IRP + irp = IoCsqRemoveNextIrp( + &Adapter->PendingReadIrpQueue.CsqQueue, + NULL + ); + + if( irp == NULL ) + { + // No IRP to satisfy + break; + } + + // Fetch a queued TAP send packet + tapPacket = tapPacketRemoveHeadLocked( + &Adapter->SendPacketQueue + ); + + ASSERT(tapPacket); + + // BUGBUG!!! Investigate whether release/reacquire can cause + // out-of-order IRP completion. Also, whether user-mode can + // tolerate out-of-order packets. + + // Release packet queue lock while completing the IRP + //KeReleaseSpinLock(&Adapter->SendPacketQueue.QueueLock,irql); + + // Complete the read IRP from queued TAP send packet. + tapCompletePendingReadIrp(irp,tapPacket); + + // Reqcquire packet queue lock after completing the IRP + //KeAcquireSpinLock(&Adapter->SendPacketQueue.QueueLock,&irql); + } + + KeReleaseSpinLock(&Adapter->SendPacketQueue.QueueLock,irql); +} + +// Flush the pending send TAP packet queue. +VOID +tapFlushSendPacketQueue( + __in PTAP_ADAPTER_CONTEXT Adapter + ) +{ + KIRQL irql; + + // Process the send packet queue + KeAcquireSpinLock(&Adapter->SendPacketQueue.QueueLock,&irql); + + DEBUGP (("[TAP] tapFlushSendPacketQueue: Flushing %d TAP packets\n", + Adapter->SendPacketQueue.Count)); + + while(Adapter->SendPacketQueue.Count > 0 ) + { + PTAP_PACKET tapPacket; + + // Fetch a queued TAP send packet + tapPacket = tapPacketRemoveHeadLocked( + &Adapter->SendPacketQueue + ); + + ASSERT(tapPacket); + + // Free the TAP packet + NdisFreeMemory(tapPacket,0,0); + } + + KeReleaseSpinLock(&Adapter->SendPacketQueue.QueueLock,irql); +} + +VOID +tapAdapterTransmit( + __in PTAP_ADAPTER_CONTEXT Adapter, + __in PNET_BUFFER NetBuffer, + __in BOOLEAN DispatchLevel + ) +/*++ + +Routine Description: + + This routine is called to transmit an individual net buffer using a + style similar to the previous NDIS 5 AdapterTransmit function. + + In this implementation adapter state and NB length checks have already + been done before this function has been called. + + The net buffer will be completed by the calling routine after this + routine exits. So, under this design it is necessary to make a deep + copy of frame data in the net buffer. + + This routine creates a flat buffer copy of NB frame data. This is an + unnecessary performance bottleneck. However, the bottleneck is probably + not significant or measurable except for adapters running at 1Gbps or + greater speeds. Since this adapter is currently running at 100Mbps this + defect can be ignored. + + Runs at IRQL <= DISPATCH_LEVEL + +Arguments: + + Adapter Pointer to our adapter context + NetBuffer Pointer to the net buffer to transmit + DispatchLevel TRUE if called at IRQL == DISPATCH_LEVEL + +Return Value: + + None. + + In the Microsoft NDIS 6 architecture there is no per-packet status. + +--*/ +{ + NDIS_STATUS status; + ULONG packetLength; + PTAP_PACKET tapPacket; + PVOID packetData; + + packetLength = NET_BUFFER_DATA_LENGTH(NetBuffer); + + // Allocate TAP packet memory + tapPacket = (PTAP_PACKET )NdisAllocateMemoryWithTagPriority( + Adapter->MiniportAdapterHandle, + TAP_PACKET_SIZE (packetLength), + TAP_PACKET_TAG, + NormalPoolPriority + ); + + if(tapPacket == NULL) + { + DEBUGP (("[TAP] tapAdapterTransmit: TAP packet allocation failed\n")); + return; + } + + tapPacket->m_SizeFlags = (packetLength & TP_SIZE_MASK); + + // + // Reassemble packet contents + // -------------------------- + // NdisGetDataBuffer does most of the work. There are two cases: + // + // 1.) If the NB data was not contiguous it will copy the entire + // NB's data to m_data and return pointer to m_data. + // 2.) If the NB data was contiguous it returns a pointer to the + // first byte of the contiguous data instead of a pointer to m_Data. + // In this case the data will not have been copied to m_Data. Copy + // to m_Data will need to be done in an extra step. + // + // Case 1.) is the most likely in normal operation. + // + packetData = NdisGetDataBuffer(NetBuffer,packetLength,tapPacket->m_Data,1,0); + + if(packetData == NULL) + { + DEBUGP (("[TAP] tapAdapterTransmit: Could not get packet data\n")); + + NdisFreeMemory(tapPacket,0,0); + + return; + } + + if(packetData != tapPacket->m_Data) + { + // Packet data was contiguous and not yet copied to m_Data. + NdisMoveMemory(tapPacket->m_Data,packetData,packetLength); + } + + DUMP_PACKET ("AdapterTransmit", tapPacket->m_Data, packetLength); + + //===================================================== + // If IPv4 packet, check whether or not packet + // was truncated. + //===================================================== +#if PACKET_TRUNCATION_CHECK + IPv4PacketSizeVerify( + tapPacket->m_Data, + packetLength, + FALSE, + "TX", + &Adapter->m_TxTrunc + ); +#endif + + //===================================================== + // Are we running in DHCP server masquerade mode? + // + // If so, catch both DHCP requests and ARP queries + // to resolve the address of our virtual DHCP server. + //===================================================== + if (Adapter->m_dhcp_enabled) + { + const ETH_HEADER *eth = (ETH_HEADER *) tapPacket->m_Data; + const IPHDR *ip = (IPHDR *) (tapPacket->m_Data + sizeof (ETH_HEADER)); + const UDPHDR *udp = (UDPHDR *) (tapPacket->m_Data + sizeof (ETH_HEADER) + sizeof (IPHDR)); + + // ARP packet? + if (packetLength == sizeof (ARP_PACKET) + && eth->proto == htons (NDIS_ETH_TYPE_ARP) + && Adapter->m_dhcp_server_arp + ) + { + if (ProcessARP( + Adapter, + (PARP_PACKET) tapPacket->m_Data, + Adapter->m_dhcp_addr, + Adapter->m_dhcp_server_ip, + ~0, + Adapter->m_dhcp_server_mac) + ) + { + goto no_queue; + } + } + + // DHCP packet? + else if (packetLength >= sizeof (ETH_HEADER) + sizeof (IPHDR) + sizeof (UDPHDR) + sizeof (DHCP) + && eth->proto == htons (NDIS_ETH_TYPE_IPV4) + && ip->version_len == 0x45 // IPv4, 20 byte header + && ip->protocol == IPPROTO_UDP + && udp->dest == htons (BOOTPS_PORT) + ) + { + const DHCP *dhcp = (DHCP *) (tapPacket->m_Data + + sizeof (ETH_HEADER) + + sizeof (IPHDR) + + sizeof (UDPHDR)); + + const int optlen = packetLength + - sizeof (ETH_HEADER) + - sizeof (IPHDR) + - sizeof (UDPHDR) + - sizeof (DHCP); + + if (optlen > 0) // we must have at least one DHCP option + { + if (ProcessDHCP (Adapter, eth, ip, udp, dhcp, optlen)) + { + goto no_queue; + } + } + else + { + goto no_queue; + } + } + } + + //=============================================== + // In Point-To-Point mode, check to see whether + // packet is ARP (handled) or IPv4 (sent to app). + // IPv6 packets are inspected for neighbour discovery + // (to be handled locally), and the rest is forwarded + // all other protocols are dropped + //=============================================== + if (Adapter->m_tun) + { + ETH_HEADER *e; + + e = (ETH_HEADER *) tapPacket->m_Data; + + switch (ntohs (e->proto)) + { + case NDIS_ETH_TYPE_ARP: + + // Make sure that packet is the right size for ARP. + if (packetLength != sizeof (ARP_PACKET)) + { + goto no_queue; + } + + ProcessARP ( + Adapter, + (PARP_PACKET) tapPacket->m_Data, + Adapter->m_localIP, + Adapter->m_remoteNetwork, + Adapter->m_remoteNetmask, + Adapter->m_TapToUser.dest + ); + + default: + goto no_queue; + + case NDIS_ETH_TYPE_IPV4: + + // Make sure that packet is large enough to be IPv4. + if (packetLength < (ETHERNET_HEADER_SIZE + IP_HEADER_SIZE)) + { + goto no_queue; + } + + // Only accept directed packets, not broadcasts. + if (memcmp (e, &Adapter->m_TapToUser, ETHERNET_HEADER_SIZE)) + { + goto no_queue; + } + + // Packet looks like IPv4, queue it. :-) + tapPacket->m_SizeFlags |= TP_TUN; + break; + + case NDIS_ETH_TYPE_IPV6: + // Make sure that packet is large enough to be IPv6. + if (packetLength < (ETHERNET_HEADER_SIZE + IPV6_HEADER_SIZE)) + { + goto no_queue; + } + + // Broadcasts and multicasts are handled specially + // (to be implemented) + + // Neighbor discovery packets to fe80::8 are special + // OpenVPN sets this next-hop to signal "handled by tapdrv" + if ( HandleIPv6NeighborDiscovery(Adapter,tapPacket->m_Data) ) + { + goto no_queue; + } + + // Packet looks like IPv6, queue it. :-) + tapPacket->m_SizeFlags |= TP_TUN; + } + } + + //=============================================== + // Push packet onto queue to wait for read from + // userspace. + //=============================================== + if(tapAdapterReadAndWriteReady(Adapter)) + { + tapPacketQueueInsertTail(&Adapter->SendPacketQueue,tapPacket); + } + else + { + // + // Tragedy. All this work and the packet is of no use... + // + NdisFreeMemory(tapPacket,0,0); + } + + // Return after queuing or freeing TAP packet. + return; + + // Free TAP packet without queuing. +no_queue: + if(tapPacket != NULL ) + { + NdisFreeMemory(tapPacket,0,0); + } + + return; +} + +VOID +tapSendNetBufferListsComplete( + __in PTAP_ADAPTER_CONTEXT Adapter, + __in PNET_BUFFER_LIST NetBufferLists, + __in NDIS_STATUS SendCompletionStatus, + __in BOOLEAN DispatchLevel + ) +{ + PNET_BUFFER_LIST currentNbl; + PNET_BUFFER_LIST nextNbl = NULL; + ULONG sendCompleteFlags = 0; + + for ( + currentNbl = NetBufferLists; + currentNbl != NULL; + currentNbl = nextNbl + ) + { + ULONG frameType; + ULONG netBufferCount; + ULONG byteCount; + + nextNbl = NET_BUFFER_LIST_NEXT_NBL(currentNbl); + + // Set NBL completion status. + NET_BUFFER_LIST_STATUS(currentNbl) = SendCompletionStatus; + + // Fetch first NBs frame type. All linked NBs will have same type. + frameType = tapGetNetBufferFrameType(NET_BUFFER_LIST_FIRST_NB(currentNbl)); + + // Fetch statistics for all NBs linked to the NB. + netBufferCount = tapGetNetBufferCountsFromNetBufferList( + currentNbl, + &byteCount + ); + + // Update statistics by frame type + if(SendCompletionStatus == NDIS_STATUS_SUCCESS) + { + switch(frameType) + { + case NDIS_PACKET_TYPE_DIRECTED: + Adapter->FramesTxDirected += netBufferCount; + Adapter->BytesTxDirected += byteCount; + break; + + case NDIS_PACKET_TYPE_BROADCAST: + Adapter->FramesTxBroadcast += netBufferCount; + Adapter->BytesTxBroadcast += byteCount; + break; + + case NDIS_PACKET_TYPE_MULTICAST: + Adapter->FramesTxMulticast += netBufferCount; + Adapter->BytesTxMulticast += byteCount; + break; + + default: + ASSERT(FALSE); + break; + } + } + else + { + // Transmit error. + Adapter->TransmitFailuresOther += netBufferCount; + } + + currentNbl = nextNbl; + } + + if(DispatchLevel) + { + sendCompleteFlags |= NDIS_SEND_COMPLETE_FLAGS_DISPATCH_LEVEL; + } + + // Complete the NBLs + NdisMSendNetBufferListsComplete( + Adapter->MiniportAdapterHandle, + NetBufferLists, + sendCompleteFlags + ); +} + +BOOLEAN +tapNetBufferListNetBufferLengthsValid( + __in PTAP_ADAPTER_CONTEXT Adapter, + __in PNET_BUFFER_LIST NetBufferLists + ) +/*++ + +Routine Description: + + Scan all NBLs and their linked NBs for valid lengths. + + Fairly absurd to find and packets with bogus lengths, but wise + to check anyway. If ANY packet has a bogus length, then abort the + entire send. + + The only time that one might see this check fail might be during + HCK driver testing. The HKC test might send oversize packets to + determine if the miniport can gracefully deal with them. + + This check is fairly fast. Unlike NDIS 5 packets, fetching NDIS 6 + packets lengths do not require any computation. + +Arguments: + + Adapter Pointer to our adapter context + NetBufferLists Head of a list of NBLs to examine + +Return Value: + + Returns TRUE if all NBs have reasonable lengths. + Otherwise, returns FALSE. + +--*/ +{ + PNET_BUFFER_LIST currentNbl; + + currentNbl = NetBufferLists; + + while (currentNbl) + { + PNET_BUFFER_LIST nextNbl; + PNET_BUFFER currentNb; + + // Locate next NBL + nextNbl = NET_BUFFER_LIST_NEXT_NBL(currentNbl); + + // Locate first NB (aka "packet") + currentNb = NET_BUFFER_LIST_FIRST_NB(currentNbl); + + // + // Process all NBs linked to this NBL + // + while(currentNb) + { + PNET_BUFFER nextNb; + ULONG packetLength; + + // Locate next NB + nextNb = NET_BUFFER_NEXT_NB(currentNb); + + packetLength = NET_BUFFER_DATA_LENGTH(currentNb); + + // Minimum packet size is size of Ethernet plus IPv4 headers. + ASSERT(packetLength >= (ETHERNET_HEADER_SIZE + IP_HEADER_SIZE)); + + if(packetLength < (ETHERNET_HEADER_SIZE + IP_HEADER_SIZE)) + { + return FALSE; + } + + // Maximum size should be Ethernet header size plus MTU plus modest pad for + // VLAN tag. + ASSERT( packetLength <= (ETHERNET_HEADER_SIZE + VLAN_TAG_SIZE + Adapter->MtuSize)); + + if(packetLength > (ETHERNET_HEADER_SIZE + VLAN_TAG_SIZE + Adapter->MtuSize)) + { + return FALSE; + } + + // Move to next NB + currentNb = nextNb; + } + + // Move to next NBL + currentNbl = nextNbl; + } + + return TRUE; +} + +VOID +AdapterSendNetBufferLists( + __in NDIS_HANDLE MiniportAdapterContext, + __in PNET_BUFFER_LIST NetBufferLists, + __in NDIS_PORT_NUMBER PortNumber, + __in ULONG SendFlags + ) +/*++ + +Routine Description: + + Send Packet Array handler. Called by NDIS whenever a protocol + bound to our miniport sends one or more packets. + + The input packet descriptor pointers have been ordered according + to the order in which the packets should be sent over the network + by the protocol driver that set up the packet array. The NDIS + library preserves the protocol-determined ordering when it submits + each packet array to MiniportSendPackets + + As a deserialized driver, we are responsible for holding incoming send + packets in our internal queue until they can be transmitted over the + network and for preserving the protocol-determined ordering of packet + descriptors incoming to its MiniportSendPackets function. + A deserialized miniport driver must complete each incoming send packet + with NdisMSendComplete, and it cannot call NdisMSendResourcesAvailable. + + Runs at IRQL <= DISPATCH_LEVEL + +Arguments: + + MiniportAdapterContext Pointer to our adapter + NetBufferLists Head of a list of NBLs to send + PortNumber A miniport adapter port. Default is 0. + SendFlags Additional flags for the send operation + +Return Value: + + None. Write status directly into each NBL with the NET_BUFFER_LIST_STATUS + macro. + +--*/ +{ + NDIS_STATUS status; + PTAP_ADAPTER_CONTEXT adapter = (PTAP_ADAPTER_CONTEXT )MiniportAdapterContext; + BOOLEAN DispatchLevel = (SendFlags & NDIS_SEND_FLAGS_DISPATCH_LEVEL); + PNET_BUFFER_LIST currentNbl; + BOOLEAN validNbLengths; + + UNREFERENCED_PARAMETER(NetBufferLists); + UNREFERENCED_PARAMETER(PortNumber); + UNREFERENCED_PARAMETER(SendFlags); + + ASSERT(PortNumber == 0); // Only the default port is supported + + // + // Can't process sends if TAP device is not open. + // ---------------------------------------------- + // Just perform a "lying send" and return packets as if they + // were successfully sent. + // + if(adapter->TapFileObject == NULL) + { + // + // Complete all NBLs and return if adapter not ready. + // + tapSendNetBufferListsComplete( + adapter, + NetBufferLists, + NDIS_STATUS_SUCCESS, + DispatchLevel + ); + + return; + } + + // + // Check Adapter send/receive ready state. + // + status = tapAdapterSendAndReceiveReady(adapter); + + if(status != NDIS_STATUS_SUCCESS) + { + // + // Complete all NBLs and return if adapter not ready. + // + tapSendNetBufferListsComplete( + adapter, + NetBufferLists, + status, + DispatchLevel + ); + + return; + } + + // + // Scan all NBLs and linked packets for valid lengths. + // --------------------------------------------------- + // If _ANY_ NB length is invalid, then fail the entire send operation. + // + // BUGBUG!!! Perhaps this should be less agressive. Fail only individual + // NBLs... + // + // If length check is valid, then TAP_PACKETS can be safely allocated + // and processed for all NBs being sent. + // + validNbLengths = tapNetBufferListNetBufferLengthsValid( + adapter, + NetBufferLists + ); + + if(!validNbLengths) + { + // + // Complete all NBLs and return if and NB length is invalid. + // + tapSendNetBufferListsComplete( + adapter, + NetBufferLists, + NDIS_STATUS_INVALID_LENGTH, + DispatchLevel + ); + + return; + } + + // + // Process each NBL individually + // + currentNbl = NetBufferLists; + + while (currentNbl) + { + PNET_BUFFER_LIST nextNbl; + PNET_BUFFER currentNb; + + // Locate next NBL + nextNbl = NET_BUFFER_LIST_NEXT_NBL(currentNbl); + + // Locate first NB (aka "packet") + currentNb = NET_BUFFER_LIST_FIRST_NB(currentNbl); + + // Transmit all NBs linked to this NBL + while(currentNb) + { + PNET_BUFFER nextNb; + + // Locate next NB + nextNb = NET_BUFFER_NEXT_NB(currentNb); + + // Transmit the NB + tapAdapterTransmit(adapter,currentNb,DispatchLevel); + + // Move to next NB + currentNb = nextNb; + } + + // Move to next NBL + currentNbl = nextNbl; + } + + // Complete all NBLs + tapSendNetBufferListsComplete( + adapter, + NetBufferLists, + NDIS_STATUS_SUCCESS, + DispatchLevel + ); + + // Attempt to complete pending read IRPs from pending TAP + // send packet queue. + tapProcessSendPacketQueue(adapter); +} + +VOID +AdapterCancelSend( + __in NDIS_HANDLE MiniportAdapterContext, + __in PVOID CancelId + ) +{ + PTAP_ADAPTER_CONTEXT adapter = (PTAP_ADAPTER_CONTEXT )MiniportAdapterContext; + + // + // This miniport completes its sends quickly, so it isn't strictly + // neccessary to implement MiniportCancelSend. + // + // If we did implement it, we'd have to walk the Adapter->SendWaitList + // and look for any NB that points to a NBL where the CancelId matches + // NDIS_GET_NET_BUFFER_LIST_CANCEL_ID(Nbl). For any NB that so matches, + // we'd remove the NB from the SendWaitList and set the NBL's status to + // NDIS_STATUS_SEND_ABORTED, then complete the NBL. + // +} + +// IRP_MJ_READ callback. +NTSTATUS +TapDeviceRead( + PDEVICE_OBJECT DeviceObject, + PIRP Irp + ) +{ + NTSTATUS ntStatus = STATUS_SUCCESS;// Assume success + PIO_STACK_LOCATION irpSp;// Pointer to current stack location + PTAP_ADAPTER_CONTEXT adapter = NULL; + + PAGED_CODE(); + + irpSp = IoGetCurrentIrpStackLocation( Irp ); + + // + // Fetch adapter context for this device. + // -------------------------------------- + // Adapter pointer was stashed in FsContext when handle was opened. + // + adapter = (PTAP_ADAPTER_CONTEXT )(irpSp->FileObject)->FsContext; + + ASSERT(adapter); + + // + // Sanity checks on state variables + // + if (!tapAdapterReadAndWriteReady(adapter)) + { + //DEBUGP (("[%s] Interface is down in IRP_MJ_READ\n", + // MINIPORT_INSTANCE_ID (adapter))); + //NOTE_ERROR(); + + Irp->IoStatus.Status = ntStatus = STATUS_CANCELLED; + Irp->IoStatus.Information = 0; + IoCompleteRequest (Irp, IO_NO_INCREMENT); + + return ntStatus; + } + + // Save IRP-accessible copy of buffer length + Irp->IoStatus.Information = irpSp->Parameters.Read.Length; + + if (Irp->MdlAddress == NULL) + { + DEBUGP (("[%s] MdlAddress is NULL for IRP_MJ_READ\n", + MINIPORT_INSTANCE_ID (adapter))); + + NOTE_ERROR(); + Irp->IoStatus.Status = ntStatus = STATUS_INVALID_PARAMETER; + Irp->IoStatus.Information = 0; + IoCompleteRequest (Irp, IO_NO_INCREMENT); + + return ntStatus; + } + + if ((Irp->AssociatedIrp.SystemBuffer + = MmGetSystemAddressForMdlSafe( + Irp->MdlAddress, + NormalPagePriority + ) ) == NULL + ) + { + DEBUGP (("[%s] Could not map address in IRP_MJ_READ\n", + MINIPORT_INSTANCE_ID (adapter))); + + NOTE_ERROR(); + Irp->IoStatus.Status = ntStatus = STATUS_INSUFFICIENT_RESOURCES; + Irp->IoStatus.Information = 0; + IoCompleteRequest (Irp, IO_NO_INCREMENT); + + return ntStatus; + } + + // BUGBUG!!! Use RemoveLock??? + + // + // Queue the IRP and return STATUS_PENDING. + // ---------------------------------------- + // Note: IoCsqInsertIrp marks the IRP pending. + // + + // BUGBUG!!! NDIS 5 implementation has IRP_QUEUE_SIZE of 16 and + // does not queue IRP if this capacity is exceeded. + // + // Is this needed??? + // + IoCsqInsertIrp(&adapter->PendingReadIrpQueue.CsqQueue, Irp, NULL); + + // Attempt to complete pending read IRPs from pending TAP + // send packet queue. + tapProcessSendPacketQueue(adapter); + + ntStatus = STATUS_PENDING; + + return ntStatus; +} + diff --git a/installer/tap/src/src/types.h b/installer/tap/src/src/types.h new file mode 100644 index 0000000..acea175 --- /dev/null +++ b/installer/tap/src/src/types.h @@ -0,0 +1,90 @@ +/* + * TAP-Windows -- A kernel driver to provide virtual tap + * device functionality on Windows. + * + * This code was inspired by the CIPE-Win32 driver by Damion K. Wilson. + * + * This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc., + * and is released under the GPL version 2 (see below). + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program (see the file COPYING included with this + * distribution); if not, write to the Free Software Foundation, Inc., + * 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#ifndef TAP_TYPES_DEFINED +#define TAP_TYPES_DEFINED + +//typedef +//struct _Queue +//{ +// ULONG base; +// ULONG size; +// ULONG capacity; +// ULONG max_size; +// PVOID data[]; +//} Queue; + +//typedef struct _TAP_PACKET; + +//typedef struct _TapExtension +//{ +// // TAP device object and packet queues +// Queue *m_PacketQueue, *m_IrpQueue; +// PDEVICE_OBJECT m_TapDevice; +// NDIS_HANDLE m_TapDeviceHandle; +// ULONG TapFileIsOpen; +// +// // Used to lock packet queues +// NDIS_SPIN_LOCK m_QueueLock; +// BOOLEAN m_AllocatedSpinlocks; +// +// // Used to bracket open/close +// // state changes. +// MUTEX m_OpenCloseMutex; +// +// // True if device has been permanently halted +// BOOLEAN m_Halt; +// +// // TAP device name +// unsigned char *m_TapName; +// UNICODE_STRING m_UnicodeLinkName; +// BOOLEAN m_CreatedUnicodeLinkName; +// +// // Used for device status ioctl only +// const char *m_LastErrorFilename; +// int m_LastErrorLineNumber; +// LONG TapFileOpenCount; +// +// // Flags +// BOOLEAN TapDeviceCreated; +// BOOLEAN m_CalledTapDeviceFreeResources; +// +// // DPC queue for deferred packet injection +// BOOLEAN m_InjectDpcInitialized; +// KDPC m_InjectDpc; +// NDIS_SPIN_LOCK m_InjectLock; +// Queue *m_InjectQueue; +//} +//TapExtension, *TapExtensionPointer; + +typedef struct _InjectPacket + { +# define INJECT_PACKET_SIZE(data_size) (sizeof (InjectPacket) + (data_size)) +# define INJECT_PACKET_FREE(ib) NdisFreeMemory ((ib), INJECT_PACKET_SIZE ((ib)->m_Size), 0) + ULONG m_Size; + UCHAR m_Data []; // m_Data must be the last struct member + } +InjectPacket, *InjectPacketPointer; + +#endif diff --git a/installer/tap/src/version.m4 b/installer/tap/src/version.m4 new file mode 100644 index 0000000..fdd605c --- /dev/null +++ b/installer/tap/src/version.m4 @@ -0,0 +1,14 @@ +dnl define the TAP version +define([PRODUCT_NAME], [TAP-Windows]) +define([PRODUCT_PUBLISHER], [OpenVPN Technologies, Inc.]) +define([PRODUCT_VERSION], [9.21.2]) +define([PRODUCT_VERSION_RESOURCE], [9,0,0,21]) +define([PRODUCT_TAP_WIN_COMPONENT_ID], [tap0901]) +define([PRODUCT_TAP_WIN_MAJOR], [9]) +define([PRODUCT_TAP_WIN_MINOR], [21]) +define([PRODUCT_TAP_WIN_REVISION], [2]) +define([PRODUCT_TAP_WIN_BUILD], [601]) +define([PRODUCT_TAP_WIN_PROVIDER], [TAP-Windows Provider V9]) +define([PRODUCT_TAP_WIN_CHARACTERISTICS], [0x81]) +define([PRODUCT_TAP_WIN_DEVICE_DESCRIPTION], [TAP-Windows Adapter V9]) +define([PRODUCT_TAP_WIN_RELDATE], [04/08/2014]) diff --git a/installer/tap/tap-windows6.nsi b/installer/tap/tap-windows6.nsi new file mode 100644 index 0000000..1580ea6 --- /dev/null +++ b/installer/tap/tap-windows6.nsi @@ -0,0 +1,321 @@ +; **************************************************************************** +; * Copyright (C) 2002-2010 OpenVPN Technologies, Inc. * +; * Copyright (C) 2012 Alon Bar-Lev * +; * This program is free software; you can redistribute it and/or modify * +; * it under the terms of the GNU General Public License version 2 * +; * as published by the Free Software Foundation. * +; **************************************************************************** + +; TAP-Windows install script for Windows, using NSIS + +SetCompressor /SOLID lzma + +!addplugindir . +!include "MUI.nsh" +!include "StrFunc.nsh" +!include "x64.nsh" +!define MULTIUSER_EXECUTIONLEVEL Admin +!include "MultiUser.nsh" +!include FileFunc.nsh +!insertmacro GetParameters +!insertmacro GetOptions + +!define PRODUCT_TAP_WIN_COMPONENT_ID "tap0901" +!define PRODUCT_NAME "TunSafe-TAP" +!define PRODUCT_VERSION "9.21.2" +!define PRODUCT_PUBLISHER "TunSafe" + +${StrLoc} + +;-------------------------------- +;Configuration + +;General + + +OutFile "TunSafe-TAP-${PRODUCT_VERSION}.exe" + +BrandingText " " +ShowInstDetails show +ShowUninstDetails show + +;-------------------------------- +;Modern UI Configuration + +Name "${PRODUCT_NAME}" + +#!define MUI_WELCOMEPAGE_TEXT "This wizard will guide you through the installation of ${PRODUCT_NAME}, a kernel driver to provide virtual tap device #functionality on Windows originally written by James Yonan.\r\n\r\nNote that ${PRODUCT_NAME} will only run on Windows Vista or later.\r\n\r\n\r\n" + +!define MUI_COMPONENTSPAGE_TEXT_TOP "Select the components to install/upgrade. Stop any ${PRODUCT_NAME} processes or the ${PRODUCT_NAME} service if it is running." + +#!define MUI_COMPONENTSPAGE_SMALLDESC +!define MUI_FINISHPAGE_NOAUTOCLOSE +!define MUI_ABORTWARNING +!define MUI_ICON "icon.ico" +!define MUI_UNICON "icon.ico" +!define MUI_HEADERIMAGE +!define MUI_HEADERIMAGE_BITMAP "install-whirl.bmp" +!define MUI_UNFINISHPAGE_NOAUTOCLOSE +!define MUI_TEXT_LICENSE_TITLE "Welcome to the TunSafe-TAP installer" + +#!insertmacro MUI_PAGE_WELCOME +!insertmacro MUI_PAGE_LICENSE "COPYING" +#!insertmacro MUI_PAGE_COMPONENTS +!define MUI_PAGE_CUSTOMFUNCTION_PRE dirPre +!insertmacro MUI_PAGE_DIRECTORY +!insertmacro MUI_PAGE_INSTFILES +#!insertmacro MUI_PAGE_FINISH + +!insertmacro MUI_UNPAGE_CONFIRM +!insertmacro MUI_UNPAGE_INSTFILES +#!insertmacro MUI_UNPAGE_FINISH + +;-------------------------------- +;Languages + +!insertmacro MUI_LANGUAGE "English" + +;-------------------------------- +;Language Strings + +LangString DESC_SecTAP ${LANG_ENGLISH} "Install/Upgrade the TAP Virtual Ethernet Adapter from OpenVPN." +LangString DESC_SecTAPUtilities ${LANG_ENGLISH} "Install the TAP Utilities." + +Function dirPre + ${GetParameters} $R0 + ${GetOptions} "$R0" "/X" $R1 + IfErrors +2 0 + Abort +FunctionEnd + +;-------------------------------- +;Installer Sections + +Section "TAP Virtual Ethernet Adapter" SecTAP + SetOverwrite on + + ${If} ${RunningX64} + DetailPrint "We are running on a 64-bit system." + + SetOutPath "$INSTDIR" + File "prebuilt\x64\tapinstall.exe" + + SetOutPath "$INSTDIR\driver" + File "prebuilt\x64\OemVista.inf" + File "prebuilt\x64\${PRODUCT_TAP_WIN_COMPONENT_ID}.cat" + File "prebuilt\x64\${PRODUCT_TAP_WIN_COMPONENT_ID}.sys" + ${Else} + DetailPrint "We are running on a 32-bit system." + + SetOutPath "$INSTDIR" + File "prebuilt\x86\tapinstall.exe" + + SetOutPath "$INSTDIR\driver" + File "prebuilt\x86\OemVista.inf" + File "prebuilt\x86\${PRODUCT_TAP_WIN_COMPONENT_ID}.cat" + File "prebuilt\x86\${PRODUCT_TAP_WIN_COMPONENT_ID}.sys" + ${EndIf} +SectionEnd + +Section "TAP Utilities" SecTAPUtilities + SetOverwrite on + + # Delete previous start menu + RMDir /r "$SMPROGRAMS\${PRODUCT_NAME}" + + FileOpen $R0 "$INSTDIR\addtap.bat" w + FileWrite $R0 "rem Add a new TAP virtual ethernet adapter$\r$\n" + FileWrite $R0 '"$INSTDIR\tapinstall.exe" install "$INSTDIR\driver\OemVista.inf" ${PRODUCT_TAP_WIN_COMPONENT_ID}$\r$\n' + FileWrite $R0 "pause$\r$\n" + FileClose $R0 + + FileOpen $R0 "$INSTDIR\deltapall.bat" w + FileWrite $R0 "echo WARNING: this script will delete ALL TAP virtual adapters (use the device manager to delete adapters one at a time)$\r$\n" + FileWrite $R0 "pause$\r$\n" + FileWrite $R0 '"$INSTDIR\tapinstall.exe" remove ${PRODUCT_TAP_WIN_COMPONENT_ID}$\r$\n' + FileWrite $R0 "pause$\r$\n" + FileClose $R0 + + ; Create shortcuts + CreateDirectory "$SMPROGRAMS\${PRODUCT_NAME}\Utilities" + CreateShortCut "$SMPROGRAMS\${PRODUCT_NAME}\Utilities\Add a new TAP virtual ethernet adapter.lnk" "$INSTDIR\addtap.bat" "" + ; set runas admin flag on the addtap link + ShellLink::SetRunAsAdministrator "$SMPROGRAMS\${PRODUCT_NAME}\Utilities\Add a new TAP virtual ethernet adapter.lnk" + Pop $0 + ${If} $0 != 0 + DetailPrint "Setting RunAsAdmin flag on addtap failed: status = $0" + ${Endif} + CreateShortCut "$SMPROGRAMS\${PRODUCT_NAME}\Utilities\Delete ALL TAP virtual ethernet adapters.lnk" "$INSTDIR\deltapall.bat" "" + ; set runas admin flag on the deltapall link + ShellLink::SetRunAsAdministrator "$SMPROGRAMS\${PRODUCT_NAME}\Utilities\Delete ALL TAP virtual ethernet adapters.lnk" + Pop $0 + ${If} $0 != 0 + DetailPrint "Setting RunAsAdmin flag on deltapall failed: status = $0" + ${Endif} +SectionEnd + +Function .onInit + ${GetParameters} $R0 + ClearErrors + +${IfNot} ${AtLeastWin7} + MessageBox MB_OK "TunSafe-TAP requires at least Windows 7" + SetErrorLevel 1 + Quit +${EndIf} + + !insertmacro MULTIUSER_INIT + SetShellVarContext all + + ${If} $INSTDIR == "" + StrCpy $1 "$PROGRAMFILES\TunSafe\TAP" + ${If} ${RunningX64} + SetRegView 64 + StrCpy $1 "$PROGRAMFILES64\TunSafe\TAP" + ${EndIf} + ReadRegStr $INSTDIR HKLM "SOFTWARE\${PRODUCT_NAME}" "" + StrCmp $INSTDIR "" 0 +2 + StrCpy $INSTDIR $1 + ${EndIf} +FunctionEnd + +;-------------------------------- +;Dependencies + +Function .onSelChange +# ${If} ${SectionIsSelected} ${SecTAPUtilities} +# !insertmacro SelectSection ${SecTAP} +# ${EndIf} +FunctionEnd + +;-------------------- +;Post-install section + +Section -post + + ; Store README, license, icon + SetOverwrite on + SetOutPath $INSTDIR + File "COPYING" + + ${If} ${SectionIsSelected} ${SecTAP} + ; + ; install/upgrade TAP driver if selected, using devcon + ; + ; TAP install/update was selected. + ; Should we install or update? + ; If tapinstall error occurred, $R5 will + ; be nonzero. + IntOp $R5 0 & 0 + nsExec::ExecToStack '"$INSTDIR\tapinstall.exe" hwids ${PRODUCT_TAP_WIN_COMPONENT_ID}' + Pop $R0 # return value/error/timeout + IntOp $R5 $R5 | $R0 + DetailPrint "tapinstall.exe hwids returned: $R0" + + ; If tapinstall output string contains "${PRODUCT_TAP_WIN_COMPONENT_ID}" we assume + ; that TAP device has been previously installed, + ; therefore we will update, not install. + Push "${PRODUCT_TAP_WIN_COMPONENT_ID}" + Push ">" + Call StrLoc + Pop $R0 + + ${If} $R5 == 0 + ${If} $R0 == "" + StrCpy $R1 "install" + ${Else} + StrCpy $R1 "update" + ${EndIf} + DetailPrint "TAP $R1 (${PRODUCT_TAP_WIN_COMPONENT_ID}) (May require confirmation)" + nsExec::ExecToLog '"$INSTDIR\tapinstall.exe" $R1 "$INSTDIR\driver\OemVista.inf" ${PRODUCT_TAP_WIN_COMPONENT_ID}' + Pop $R0 # return value/error/timeout + ${If} $R0 == "" + IntOp $R0 0 & 0 + SetRebootFlag true + DetailPrint "REBOOT flag set" + ${EndIf} + IntOp $R5 $R5 | $R0 + DetailPrint "tapinstall.exe returned: $R0" + ${EndIf} + + DetailPrint "tapinstall.exe cumulative status: $R5" + ${If} $R5 != 0 + MessageBox MB_OK "An error occurred installing the TAP device driver." + ${EndIf} + + ; Store install folder in registry + WriteRegStr HKLM SOFTWARE\${PRODUCT_NAME} "" $INSTDIR + ${EndIf} + + ; Create uninstaller + WriteUninstaller "$INSTDIR\Uninstall.exe" + + ; Show up in Add/Remove programs + WriteRegStr HKLM "Software\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "DisplayName" "${PRODUCT_NAME} ${PRODUCT_VERSION}" + WriteRegExpandStr HKLM "Software\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "UninstallString" "$INSTDIR\Uninstall.exe" + WriteRegStr HKLM "Software\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "DisplayIcon" "$INSTDIR\Uninstall.exe" + WriteRegStr HKLM "Software\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "DisplayVersion" "${PRODUCT_VERSION}" + WriteRegDWORD HKLM "SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "NoModify" 1 + WriteRegDWORD HKLM "SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "NoRepair" 1 + WriteRegStr HKLM "SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "Publisher" "${PRODUCT_PUBLISHER}" + WriteRegStr HKLM "SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "HelpLink" "https://tunsafe.com/open-source" + WriteRegStr HKLM "SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "URLInfoAbout" "https://tunsafe.com" + + ${GetSize} "$INSTDIR" "/S=0K" $0 $1 $2 + IntFmt $0 "0x%08X" $0 + WriteRegDWORD HKLM "SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "EstimatedSize" "$0" + + ${GetParameters} $R0 + ${GetOptions} "$R0" "/X" $R1 + IfErrors +3 0 + SetErrorLevel 0 + Quit + +SectionEnd + +;-------------------------------- +;Descriptions + +!insertmacro MUI_FUNCTION_DESCRIPTION_BEGIN +!insertmacro MUI_DESCRIPTION_TEXT ${SecTAP} $(DESC_SecTAP) +!insertmacro MUI_DESCRIPTION_TEXT ${SecTAPUtilities} $(DESC_SecTAPUtilities) +!insertmacro MUI_FUNCTION_DESCRIPTION_END + +;-------------------------------- +;Uninstaller Section + +Function un.onInit + ClearErrors + !insertmacro MULTIUSER_UNINIT + SetShellVarContext all + ${If} ${RunningX64} + SetRegView 64 + ${EndIf} +FunctionEnd + +Section "Uninstall" + DetailPrint "TAP REMOVE" + nsExec::ExecToLog '"$INSTDIR\tapinstall.exe" remove ${PRODUCT_TAP_WIN_COMPONENT_ID}' + Pop $R0 # return value/error/timeout + DetailPrint "tapinstall.exe remove returned: $R0" + + Delete "$INSTDIR\tapinstall.exe" + Delete "$INSTDIR\addtap.bat" + Delete "$INSTDIR\deltapall.bat" + + Delete "$INSTDIR\driver\OemVista.inf" + Delete "$INSTDIR\driver\${PRODUCT_TAP_WIN_COMPONENT_ID}.cat" + Delete "$INSTDIR\driver\${PRODUCT_TAP_WIN_COMPONENT_ID}.sys" + + Delete "$INSTDIR\COPYING" + Delete "$INSTDIR\Uninstall.exe" + + RMDir "$INSTDIR" + RMDir "$INSTDIR\driver" + RMDir "$INSTDIR\include" + RMDir "$INSTDIR" + RMDir /r "$SMPROGRAMS\${PRODUCT_NAME}" + + DeleteRegKey HKLM "SOFTWARE\${PRODUCT_NAME}" + DeleteRegKey HKLM "Software\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" +SectionEnd diff --git a/installer/tunsafe.nsi b/installer/tunsafe.nsi new file mode 100644 index 0000000..7b77322 --- /dev/null +++ b/installer/tunsafe.nsi @@ -0,0 +1,214 @@ +; **************************************************************************** +; * Copyright (C) 2018 Ludde * +; **************************************************************************** + +SetCompressor /SOLID lzma + +!addplugindir . +!include "MUI2.nsh" +!include "x64.nsh" +!define MULTIUSER_EXECUTIONLEVEL Admin +!include "MultiUser.nsh" +!insertmacro GetParameters +!insertmacro GetOptions + +!define PRODUCT_NAME "TunSafe" +!define PRODUCT_PUBLISHER "TunSafe" + +OutFile "TunSafe-${PRODUCT_VERSION}.exe" + +BrandingText " " +ShowInstDetails show +ShowUninstDetails show + +Name "${PRODUCT_NAME}" + +!define MUI_COMPONENTSPAGE_SMALLDESC +!define MUI_FINISHPAGE_NOAUTOCLOSE +!define MUI_ABORTWARNING +!define MUI_ICON "icon.ico" +!define MUI_UNICON "icon.ico" +!define MUI_HEADERIMAGE +!define MUI_HEADERIMAGE_BITMAP "tap\install-whirl.bmp" +!define MUI_UNFINISHPAGE_NOAUTOCLOSE + +!define MUI_TEXT_LICENSE_TITLE "Welcome to the TunSafe installer" + +#!insertmacro MUI_PAGE_WELCOME +!insertmacro MUI_PAGE_LICENSE "LICENSE.TXT" +!insertmacro MUI_PAGE_COMPONENTS +!insertmacro MUI_PAGE_DIRECTORY +!insertmacro MUI_PAGE_INSTFILES +#!insertmacro MUI_PAGE_FINISH + +!insertmacro MUI_UNPAGE_CONFIRM +!insertmacro MUI_UNPAGE_INSTFILES +#!insertmacro MUI_UNPAGE_FINISH + +!insertmacro MUI_LANGUAGE "English" + +LangString DESC_SecTAP ${LANG_ENGLISH} "Install the TunSafe client." +LangString DESC_SecTapAdapter ${LANG_ENGLISH} "Download and Install the TunSafe-TAP Virtual Ethernet Adapter (GPL)." + +Section "TunSafe Client" SecTunSafe + SetOverwrite on + ${If} ${RunningX64} + DetailPrint "Installing 64-bit version of TunSafe." + SetOutPath "$INSTDIR" + File "x64\TunSafe.exe" + ${Else} + DetailPrint "Installing 32-bit version of TunSafe." + SetOutPath "$INSTDIR" + File "x86\TunSafe.exe" + ${EndIf} + File "License.txt" + File "ChangeLog.txt" + CreateDirectory "$INSTDIR\Config" + SetOutPath "$INSTDIR\Config" + File "TunSafe.conf" + CreateDirectory "$SMPROGRAMS\${PRODUCT_NAME}" + CreateShortCut "$SMPROGRAMS\${PRODUCT_NAME}\TunSafe.lnk" "$INSTDIR\TunSafe.exe" "" +SectionEnd + +Section "TunSafe-TAP Ethernet Adapter (GPL)" SecTapAdapter + SetOverwrite on + + Delete "$INSTDIR\tunsafe-tap-installer.exe" + NSISdl::download http://tunsafe.com/downloads/TunSafe-TAP-auto.exe "$INSTDIR\TunSafe-TAP Installer.exe" + Pop $R0 ;Get the return value + ${Unless} $R0 == "success" + MessageBox MB_ICONEXCLAMATION "An error occurred while downloading the TunSafe-TAP Virtual Ethernet Adapter. The installer will now abort." + SetErrorLevel 1 + Quit + ${EndUnless} + + NSISdl::download http://tunsafe.com/downloads/TunSafe-TAP-auto.exe.sig "$INSTDIR\TunSafe-TAP Installer.exe.sig" + ${Unless} $R0 == "success" + Delete "$INSTDIR\TunSafe-TAP Installer.exe.sig" + MessageBox MB_ICONEXCLAMATION "An error occurred while downloading the TunSafe-TAP Virtual Ethernet Adapter. The installer will now abort." + SetErrorLevel 1 + Quit + ${EndUnless} + + SignPlugin::myFunction "$INSTDIR\TunSafe-TAP Installer.exe" + Pop $R1 ;Get the return value + + Delete "$INSTDIR\TunSafe-TAP Installer.exe.sig" + + ${Unless} $R1 = 0 + MessageBox MB_ICONEXCLAMATION "The TunSafe-TAP installer that was downloaded is broken (error $R1). The installer will now abort." + SetErrorLevel 1 + Quit + ${EndUnless} + + + HideWindow + # Launch TunSafe-TAP installer + ExecWait '"$INSTDIR\TunSafe-TAP Installer.exe" /X /D=$INSTDIR\TAP' $1 + ShowWindow $HWNDPARENT ${SW_SHOW} + ${Unless} $1 = 0 + MessageBox MB_ICONEXCLAMATION "An error occurred while installing the TunSafe-TAP Virtual Ethernet Adapter. The installer will now abort." + SetErrorLevel 1 + Quit + ${EndUnless} + + BringToFront +SectionEnd + +Function CloseTunsafe +again: + FindWindow $0 "TunSafe-f19e092db01cbe0fb6aee132f8231e5b71c98f90" + IntCmp $0 0 done + MessageBox MB_ICONEXCLAMATION|MB_OKCANCEL "TunSafe is currently started. The installer will close TunSafe and proceed with the installation." IDOK proceed + Quit + proceed: + SendMessage $0 1034 1 0 $1 + IntCmp $1 31337 proceed2 + MessageBox MB_ICONEXCLAMATION|MB_OKCANCEL "Unable to close TunSafe. Please close it and press OK to continue." IDOK again + Quit + proceed2: + Sleep 500 + Goto again + done: +FunctionEnd + +Function .onInit + ${GetParameters} $R0 + ClearErrors +${IfNot} ${AtLeastWin7} + MessageBox MB_OK "TunSafe requires at least Windows 7" + SetErrorLevel 1 + Quit +${EndIf} + Call CloseTunsafe + + !insertmacro MULTIUSER_INIT + SetShellVarContext all + + ${If} $INSTDIR == "" + StrCpy $1 "$PROGRAMFILES\TunSafe" + ${If} ${RunningX64} + SetRegView 64 + StrCpy $1 "$PROGRAMFILES64\TunSafe" + ${EndIf} + ReadRegStr $INSTDIR HKLM "SOFTWARE\${PRODUCT_NAME}" "" + StrCmp $INSTDIR "" 0 +2 + StrCpy $INSTDIR $1 + ${EndIf} +FunctionEnd + +Section -post + SetOverwrite on + SetOutPath $INSTDIR + + WriteRegStr HKLM SOFTWARE\${PRODUCT_NAME} "" $INSTDIR + + ; Create uninstaller + WriteUninstaller "$INSTDIR\Uninstall.exe" + + ; Show up in Add/Remove programs + WriteRegStr HKLM "Software\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "DisplayName" "${PRODUCT_NAME} ${PRODUCT_VERSION}" + WriteRegExpandStr HKLM "Software\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "UninstallString" "$INSTDIR\Uninstall.exe" + WriteRegStr HKLM "Software\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "DisplayIcon" "$INSTDIR\TunSafe.exe" + WriteRegStr HKLM "Software\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "DisplayVersion" "${PRODUCT_VERSION}" + WriteRegDWORD HKLM "SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "NoModify" 1 + WriteRegDWORD HKLM "SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "NoRepair" 1 + WriteRegStr HKLM "SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "Publisher" "${PRODUCT_PUBLISHER}" + WriteRegStr HKLM "SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "HelpLink" "https://tunsafe.com" + WriteRegStr HKLM "SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "URLInfoAbout" "https://tunsafe.com" + +SectionEnd + +Function .onInstSuccess + ExecShell "" "$INSTDIR\TunSafe.exe" +FunctionEnd + +!insertmacro MUI_FUNCTION_DESCRIPTION_BEGIN +!insertmacro MUI_DESCRIPTION_TEXT ${SecTunSafe} $(DESC_SecTAP) +!insertmacro MUI_DESCRIPTION_TEXT ${SecTapAdapter} $(DESC_SecTapAdapter) +!insertmacro MUI_FUNCTION_DESCRIPTION_END + +Function un.onInit + ClearErrors + !insertmacro MULTIUSER_UNINIT + SetShellVarContext all + ${If} ${RunningX64} + SetRegView 64 + ${EndIf} +FunctionEnd + +Section "Uninstall" + Delete "$INSTDIR\TunSafe.exe" + Delete "$INSTDIR\License.txt" + Delete "$INSTDIR\ChangeLog.txt" + Delete "$INSTDIR\Config\TunSafe.conf" + Delete "$INSTDIR\Uninstall.exe" + Delete "$INSTDIR\TunSafe-TAP Installer.exe" + + RMDir "$INSTDIR" + RMDir "$INSTDIR\Config" + RMDir /r "$SMPROGRAMS\${PRODUCT_NAME}" + + DeleteRegKey HKLM "SOFTWARE\${PRODUCT_NAME}" + DeleteRegKey HKLM "Software\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" +SectionEnd diff --git a/ipzip2/ipzip2.cpp b/ipzip2/ipzip2.cpp new file mode 100644 index 0000000..1b23962 --- /dev/null +++ b/ipzip2/ipzip2.cpp @@ -0,0 +1 @@ +// this is a placeholder for a packet compression algorithm not yet released. \ No newline at end of file diff --git a/netapi.h b/netapi.h new file mode 100644 index 0000000..56af4f6 --- /dev/null +++ b/netapi.h @@ -0,0 +1,145 @@ +// SPDX-License-Identifier: AGPL-1.0-only +// Copyright (C) 2018 Ludvig Strigeus . All Rights Reserved. +#ifndef TINYVPN_NETAPI_H_ +#define TINYVPN_NETAPI_H_ + +#include "stdafx.h" +#include "tunsafe_types.h" + +#include +#include + +#if !defined(OS_WIN) +#include +#include +#include +#include +#endif + +#pragma warning (disable: 4200) + +void OsGetRandomBytes(uint8 *dst, size_t dst_size); +uint64 OsGetMilliseconds(); +void OsGetTimestampTAI64N(uint8 dst[12]); +void OsInterruptibleSleep(int millis); + +union IpAddr { + sockaddr_in sin; + sockaddr_in6 sin6; +}; + +struct WgCidrAddr { + uint8 addr[16]; + uint8 size; + uint8 cidr; +}; + +struct Packet { + union { + Packet *next; +#if defined(OS_WIN) + SLIST_ENTRY list_entry; +#endif + }; + unsigned int post_target, size; + byte *data; + +#if defined(OS_WIN) + OVERLAPPED overlapped; // For Windows overlapped IO +#endif + + IpAddr addr; // Optionally set to target/source of the packet + int sin_size; + + byte data_pre[4]; + byte data_buf[0]; + + enum { + // there's always this much data before data_ptr + HEADROOM_BEFORE = 64, + }; +}; + +enum { + kPacketAllocSize = 2048 - 16, + kPacketCapacity = kPacketAllocSize - sizeof(Packet) - Packet::HEADROOM_BEFORE, +}; + +void FreePacket(Packet *packet); +void FreePackets(Packet *packet, Packet **end, int count); +Packet *AllocPacket(); +void FreeAllPackets(); + +class TunInterface { +public: + struct PrePostCommands { + std::vector pre_up; + std::vector post_up; + std::vector pre_down; + std::vector post_down; + }; + + + struct TunConfig { + // IP address and netmask of the tun device + in_addr_t ip; + uint8 cidr; + + bool block_dns_on_adapters; + + // no, yes(firewall), yes(route), yes(both), 255(default) + uint8 internet_blocking; + + // Set this to configure a default route for ipv4 + bool use_ipv4_default_route; + + // Set this to configure a default route for ipv6 + bool use_ipv6_default_route; + + // DHCP settings + const byte *dhcp_options; + size_t dhcp_options_size; + + // This holds the address of the vpn endpoint, so those get routed to the old iface. + uint32 default_route_endpoint_v4; + + // Set mtu + int mtu; + + // Set ipv6 address? + uint8 ipv6_address[16]; + uint8 ipv6_cidr; + + bool set_ipv6_dns; + + // Set this to configure DNS server. + uint8 dns_server_v6[16]; + + // This holds the address of the vpn endpoint, so those get routed to the old iface. + uint8 default_route_endpoint_v6[16]; + + // This holds all cidr addresses to add as additional routing entries + std::vector extra_routes; + + // This holds the pre/post commands + PrePostCommands pre_post_commands; + }; + + struct TunConfigOut { + bool enable_neighbor_discovery_spoofing; + uint8 neighbor_discovery_spoofing_mac[6]; + }; + + virtual bool Initialize(const TunConfig &&config, TunConfigOut *out) = 0; + virtual void WriteTunPacket(Packet *packet) = 0; +}; + +class UdpInterface { +public: + virtual bool Initialize(int listen_port) = 0; + virtual void WriteUdpPacket(Packet *packet) = 0; +}; + +extern bool g_allow_pre_post; + +#endif // TINYVPN_NETAPI_H_ diff --git a/network_bsd.cpp b/network_bsd.cpp new file mode 100644 index 0000000..b617835 --- /dev/null +++ b/network_bsd.cpp @@ -0,0 +1,898 @@ +// SPDX-License-Identifier: AGPL-1.0-only +// Copyright (C) 2018 Ludvig Strigeus . All Rights Reserved. +#include "netapi.h" +#include "wireguard.h" +#include "wireguard_config.h" +#include "tunsafe_endian.h" +#include "util.h" + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +#if defined(OS_MACOSX) +#include +#include +#include +#include +#include +#include +#elif defined(OS_FREEBSD) +#include +#include +#elif defined(OS_LINUX) +#include +#include +#endif + +static Packet *freelist; + +void FreePacket(Packet *packet) { + packet->next = freelist; + freelist = packet; +} + +Packet *AllocPacket() { + Packet *p = freelist; + if (p) { + freelist = p->next; + } else { + p = (Packet*)malloc(kPacketAllocSize); + if (p == NULL) { + RERROR("Allocation failure"); + abort(); + } + } + p->data = p->data_buf + Packet::HEADROOM_BEFORE; + p->size = 0; + return p; +} + +void FreePackets() { + Packet *p; + while ( (p = freelist ) != NULL) { + freelist = p->next; + free(p); + } +} + + +#if defined(OS_MACOSX) +static mach_timebase_info_data_t timebase = { 0, 0 }; +static uint64_t initclock; + +void InitOsxGetMilliseconds() { + if (mach_timebase_info(&timebase) != 0) + abort(); + initclock = mach_absolute_time(); + + timebase.denom *= 1000000; +} + +uint64 OsGetMilliseconds() +{ + uint64_t clock = mach_absolute_time() - initclock; + return clock * (uint64_t)timebase.numer / (uint64_t)timebase.denom; +} + +#else // defined(OS_MACOSX) +uint64 OsGetMilliseconds() { + struct timespec ts; + if (clock_gettime(CLOCK_MONOTONIC, &ts) != 0) { + //error + fprintf(stderr, "clock_gettime failed\n"); + exit(1); + } + return (uint64)ts.tv_sec * 1000 + (ts.tv_nsec / 1000000); +} +#endif + +void OsGetTimestampTAI64N(uint8 dst[12]) { + struct timeval tv; + gettimeofday(&tv, NULL); + uint64 secs_since_epoch = tv.tv_sec + 0x400000000000000a; + uint32 nanos = tv.tv_usec * 1000; + WriteBE64(dst, secs_since_epoch); + WriteBE32(dst + 8, nanos); +} + +void OsGetRandomBytes(uint8 *data, size_t data_size) { + int fd = open("/dev/urandom", O_RDONLY); + int r = read(fd, data, data_size); + if (r < 0) r = 0; + close(fd); + for (; r < data_size; r++) + data[r] = rand() >> 6; +} + +void OsInterruptibleSleep(int millis) { + usleep((useconds_t)millis * 1000); +} + +#if defined(OS_MACOSX) +#define TUN_PREFIX_BYTES 4 +int open_tun(char *devname, size_t devname_size) { + struct sockaddr_ctl sc; + struct ctl_info ctlinfo = {0}; + int fd; + + memcpy(ctlinfo.ctl_name, UTUN_CONTROL_NAME, sizeof(UTUN_CONTROL_NAME)); + + for(int i = 0; i < 256; i++) { + fd = socket(PF_SYSTEM, SOCK_DGRAM, SYSPROTO_CONTROL); + if (fd < 0) { + RERROR("socket(SYSPROTO_CONTROL) failed"); + return -1; + } + + if (ioctl(fd, CTLIOCGINFO, &ctlinfo) == -1) { + RERROR("ioctl(CTLIOCGINFO) failed: %d", errno); + close(fd); + return -1; + } + sc.sc_id = ctlinfo.ctl_id; + sc.sc_len = sizeof(sc); + sc.sc_family = AF_SYSTEM; + sc.ss_sysaddr = AF_SYS_CONTROL; + sc.sc_unit = i + 1; + if (connect(fd, (struct sockaddr *)&sc, sizeof(sc)) == 0) { + socklen_t devname_size2 = devname_size; + if (getsockopt(fd, SYSPROTO_CONTROL, UTUN_OPT_IFNAME, devname, &devname_size2)) { + RERROR("getsockopt(UTUN_OPT_IFNAME) failed"); + close(fd); + return -1; + } + + + return fd; + } + close(fd); + } + return -1; +} + +#elif defined(OS_FREEBSD) +#define TUN_PREFIX_BYTES 4 +int open_tun(char *devname, size_t devname_size) { + char buf[32]; + int tun_fd; + // First open an existing tun device + for(int i = 0; i < 256; i++) { + sprintf(buf, "/dev/tun%d", i); + tun_fd = open(buf, O_RDWR); + if (tun_fd >= 0) goto did_open; + } + tun_fd = open("/dev/tun", O_RDWR); + if (tun_fd < 0) + return tun_fd; +did_open: + if (!fdevname_r(tun_fd, devname, devname_size)) { + RERROR("Unable to get name of tun device"); + close(tun_fd); + return -1; + } + int flags = IFF_POINTOPOINT | IFF_MULTICAST; + if (ioctl(tun_fd, TUNSIFMODE, &flags) < 0) { + RERROR("ioctl(TUNSIFMODE) failed"); + close(tun_fd); + return -1; + + } + flags = 1; + if (ioctl(tun_fd, TUNSIFHEAD, &flags) < 0) { + RERROR("ioctl(TUNSIFHEAD) failed"); + close(tun_fd); + return -1; + } + return tun_fd; +} + +#elif defined(OS_LINUX) +#define TUN_PREFIX_BYTES 0 +int open_tun(char *devname, size_t devname_size) { + int fd, err; + struct ifreq ifr; + + fd = open("/dev/net/tun", O_RDWR); + if (fd < 0) + return fd; + + memset(&ifr, 0, sizeof(ifr)); + ifr.ifr_flags = IFF_TUN | IFF_NO_PI; + + if ((err = ioctl(fd, TUNSETIFF, (void *) &ifr)) < 0) { + close(fd); + return err; + } + strcpy(devname, ifr.ifr_name); + return fd; +} +#endif + +int open_udp(int listen_on_port) { + int udp_fd = socket(AF_INET, SOCK_DGRAM, 0); + if (udp_fd < 0) return udp_fd; + sockaddr_in sin = {0}; + sin.sin_family = AF_INET; + sin.sin_port = htons(listen_on_port); + if (bind(udp_fd, (struct sockaddr*)&sin, sizeof(sin)) != 0) { + close(udp_fd); + return -1; + } + return udp_fd; +} + +struct RouteInfo { + uint8 family; + uint8 cidr; + uint8 ip[16]; + uint8 gw[16]; +}; + +class TunsafeBackendBsd : public TunInterface, public UdpInterface { +public: + TunsafeBackendBsd(); + void RunLoop(); + void Cleanup(); + + void SetProcessor(WireguardProcessor *wg) { processor_ = wg; } + + // -- from TunInterface + virtual bool Initialize(const TunConfig &&config, TunConfigOut *out) override; + virtual void WriteTunPacket(Packet *packet) override; + + // -- from UdpInterface + virtual bool Initialize(int listen_port) override; + virtual void WriteUdpPacket(Packet *packet) override; + + + void HandleSigAlrm() { got_sig_alarm_ = true; } + void HandleExit() { exit_ = true; } + +private: + bool ReadFromUdp(); + bool ReadFromTun(); + bool WriteToUdp(); + bool WriteToTun(); + + + void SetUdpFd(int fd); + void SetTunFd(int fd); + + void AddRoute(uint32 ip, uint32 cidr, uint32 gw); + void DelRoute(const RouteInfo &cd); + bool AddRoute(int family, const void *dest, int dest_prefix, const void *gateway); + + + inline void RecomputeMaxFd() { max_fd_ = ((tun_fd_>udp_fd_) ? tun_fd_ : udp_fd_) + 1; } + + WireguardProcessor *processor_; + + int tun_fd_, udp_fd_, max_fd_; + bool got_sig_alarm_; + bool exit_; + + bool tun_readable_, tun_writable_; + bool udp_readable_, udp_writable_; + + Packet *tun_queue_, **tun_queue_end_; + Packet *udp_queue_, **udp_queue_end_; + + Packet *read_packet_; + + std::vector cleanup_commands_; + + fd_set readfds_, writefds_; + + +}; + +TunsafeBackendBsd::TunsafeBackendBsd() + : processor_(NULL), + tun_fd_(-1), + udp_fd_(-1), + tun_readable_(false), + tun_writable_(false), + udp_readable_(false), + udp_writable_(false), + got_sig_alarm_(false), + exit_(false), + tun_queue_(NULL), + tun_queue_end_(&tun_queue_), + udp_queue_(NULL), + udp_queue_end_(&udp_queue_), + read_packet_(NULL) { + RecomputeMaxFd(); + + FD_ZERO(&readfds_); + FD_ZERO(&writefds_); + read_packet_ = AllocPacket(); +} + +void TunsafeBackendBsd::SetUdpFd(int fd) { + udp_fd_ = fd; + RecomputeMaxFd(); + udp_writable_ = true; +} + +void TunsafeBackendBsd::SetTunFd(int fd) { + tun_fd_ = fd; + RecomputeMaxFd(); + tun_writable_ = true; +} + + +bool TunsafeBackendBsd::ReadFromUdp() { + socklen_t sin_len; + sin_len = sizeof(read_packet_->addr.sin); + int r = recvfrom(udp_fd_, read_packet_->data, kPacketCapacity, 0, + (sockaddr*)&read_packet_->addr.sin, &sin_len); + if (r >= 0) { +// printf("Read %d bytes from UDP\n", r); + read_packet_->sin_size = sin_len; + read_packet_->size = r; + if (processor_) { + processor_->HandleUdpPacket(read_packet_, false); + read_packet_ = AllocPacket(); + } + return true; + } else { + if (errno != EAGAIN) { + fprintf(stderr, "Read from UDP failed\n"); + } + udp_readable_ = false; + return false; + } +} + +bool TunsafeBackendBsd::WriteToUdp() { + assert(udp_writable_); +// RINFO("Send %d bytes to %s", (int)udp_queue_->size, inet_ntoa(udp_queue_->sin.sin_addr)); + int r = sendto(udp_fd_, udp_queue_->data, udp_queue_->size, 0, + (sockaddr*)&udp_queue_->addr.sin, sizeof(udp_queue_->addr.sin)); + if (r < 0) { + if (errno == EAGAIN) { + udp_writable_ = false; + return false; + } + perror("Write to UDP failed"); + } else { + if (r != udp_queue_->size) + perror("Write to udp incomplete!"); +// else +// RINFO("Wrote %d bytes to UDP", r); + } + Packet *next = udp_queue_->next; + FreePacket(udp_queue_); + if ((udp_queue_ = next) != NULL) return true; + udp_queue_end_ = &udp_queue_; + return false; +} + +static inline bool IsCompatibleProto(uint32 v) { + return v == AF_INET || v == AF_INET6; +} + +bool TunsafeBackendBsd::ReadFromTun() { + assert(tun_readable_); + Packet *packet = read_packet_; + int r = read(tun_fd_, packet->data - TUN_PREFIX_BYTES, kPacketCapacity + TUN_PREFIX_BYTES); + if (r >= 0) { +// printf("Read %d bytes from TUN\n", r); + packet->size = r - TUN_PREFIX_BYTES; + if (r >= TUN_PREFIX_BYTES && (!TUN_PREFIX_BYTES || IsCompatibleProto(ReadBE32(packet->data - TUN_PREFIX_BYTES))) && processor_) { +// printf("%X %X %X %X %X %X %X %X\n", +// read_packet_->data[0], read_packet_->data[1], read_packet_->data[2], read_packet_->data[3], +// read_packet_->data[4], read_packet_->data[5], read_packet_->data[6], read_packet_->data[7]); + read_packet_ = AllocPacket(); + processor_->HandleTunPacket(packet); + } + return true; + } else { + if (errno != EAGAIN) { + fprintf(stderr, "Read from tun failed\n"); + } + tun_readable_ = false; + return false; + } +} + +static uint32 GetProtoFromPacket(const uint8 *data, size_t size) { + return size < 1 || (data[0] >> 4) != 6 ? AF_INET : AF_INET6; +} + +bool TunsafeBackendBsd::WriteToTun() { + assert(tun_writable_); + if (TUN_PREFIX_BYTES) { + WriteBE32(tun_queue_->data - TUN_PREFIX_BYTES, GetProtoFromPacket(tun_queue_->data, tun_queue_->size)); + } + int r = write(tun_fd_, tun_queue_->data - TUN_PREFIX_BYTES, tun_queue_->size + TUN_PREFIX_BYTES); + if (r < 0) { + if (errno == EAGAIN) { + tun_writable_ = false; + return false; + } + RERROR("Write to tun failed"); + } else { + r -= TUN_PREFIX_BYTES; + if (r != tun_queue_->size) + RERROR("Write to tun incomplete!"); +// else +// RINFO("Wrote %d bytes to TUN", r); + } + Packet *next = tun_queue_->next; + FreePacket(tun_queue_); + if ((tun_queue_ = next) != NULL) return true; + tun_queue_end_ = &tun_queue_; + return false; +} + +static uint32 CidrToNetmaskV4(int cidr) { + return cidr == 32 ? 0xffffffff : 0xffffffff << (32 - cidr); +} + +#if defined(OS_MACOSX) || defined(OS_FREEBSD) +struct MyRouteMsg { + struct rt_msghdr hdr; + uint32 pad; + struct sockaddr_in target; + struct sockaddr_in netmask; +}; + +struct MyRouteReply { + struct rt_msghdr hdr; + uint8 buf[512]; +}; + +// Zero gets rounded up +#if defined(OS_MACOSX) +#define RTMSG_ROUNDUP(a) ((a) ? ((((a) - 1) | (sizeof(uint32_t) - 1)) + 1) : sizeof(uint32_t)) +#else +#define RTMSG_ROUNDUP(a) ((a) ? ((((a) - 1) | (sizeof(long) - 1)) + 1) : sizeof(long)) +#endif + + +static bool GetDefaultRoute(char *iface, size_t iface_size, uint32 *gw_addr) { + int fd, pid, len; + + union { + MyRouteMsg rt; + MyRouteReply rep; + }; + + fd = socket(PF_ROUTE, SOCK_RAW, AF_INET); + if (fd < 0) + return false; + + memset(&rt, 0, sizeof(rt)); + + rt.hdr.rtm_type = RTM_GET; + rt.hdr.rtm_flags = RTF_UP | RTF_GATEWAY; + rt.hdr.rtm_version = RTM_VERSION; + rt.hdr.rtm_seq = 0; + rt.hdr.rtm_addrs = RTA_DST | RTA_NETMASK | RTA_IFP; + + rt.target.sin_family = AF_INET; + rt.netmask.sin_family = AF_INET; + + rt.target.sin_len = sizeof(struct sockaddr_in); + rt.netmask.sin_len = sizeof(struct sockaddr_in); + + rt.hdr.rtm_msglen = sizeof(rt); + + if (write(fd, (char*)&rt, sizeof(rt)) != sizeof(rt)) { + RERROR("PF_ROUTE write failed."); + close(fd); + return false; + } + + pid = getpid(); + do { + len = read(fd, (char *)&rep, sizeof(rep)); + if (len <= 0) { + RERROR("PF_ROUTE read failed."); + close(fd); + return false; + } + } while (rep.hdr.rtm_seq != 0 || rep.hdr.rtm_pid != pid); + close(fd); + + const struct sockaddr_dl *ifp = NULL; + const struct sockaddr_in *gw = NULL; + + uint8 *pos = rep.buf; + for(int i = 1; i && i < rep.hdr.rtm_addrs; i <<= 1) { + if (rep.hdr.rtm_addrs & i) { + if (1 > rep.buf + 512 - pos) + break; // invalid + size_t len = RTMSG_ROUNDUP(((struct sockaddr*)pos)->sa_len); + if (len > rep.buf + 512 - pos) + break; // invalid +// RINFO("rtm %d %d", i, ((struct sockaddr*)pos)->sa_len); + if (i == RTA_IFP && ((struct sockaddr*)pos)->sa_len == sizeof(struct sockaddr_dl)) { + ifp = (struct sockaddr_dl *)pos; + } else if (i == RTA_GATEWAY && ((struct sockaddr*)pos)->sa_len == sizeof(struct sockaddr_in)) { + gw = (struct sockaddr_in *)pos; + + } + pos += len; + } + } + + if (ifp && ifp->sdl_nlen && ifp->sdl_nlen < iface_size) { + iface[ifp->sdl_nlen] = 0; + memcpy(iface, ifp->sdl_data, ifp->sdl_nlen); + if (gw && gw->sin_family == AF_INET) { + *gw_addr = ReadBE32(&gw->sin_addr); + return true; + } + + } +// RINFO("Read %d %d %d", len, rep.hdr.rtm_addrs, (int)sizeof(struct rt_msghdr )); + return false; +} +#endif // defined(OS_MACOSX) || defined(OS_FREEBSD) + +#if defined(OS_LINUX) +static bool GetDefaultRoute(char *iface, size_t iface_size, uint32 *gw_addr) { + return false; +} +#endif // defined(OS_LINUX) + +static uint32 ComputeIpv4DefaultRoute(uint32 ip, uint32 netmask) { + uint32 default_route_v4 = (ip & netmask) | 1; + if (default_route_v4 == ip) + default_route_v4++; + return default_route_v4; +} + +static void ComputeIpv6DefaultRoute(const uint8 *ipv6_address, uint8 ipv6_cidr, uint8 *default_route_v6) { + memcpy(default_route_v6, ipv6_address, 16); + // clear the last bits of the ipv6 address to match the cidr. + size_t n = (ipv6_cidr + 7) >> 3; + memset(&default_route_v6[n], 0, 16 - n); + if (n == 0) + return; + // adjust the final byte + default_route_v6[n - 1] &= ~(0xff >> (ipv6_cidr & 7)); + // set the very last byte to something + default_route_v6[15] |= 1; + // ensure it doesn't collide + if (memcmp(default_route_v6, ipv6_address, 16) == 0) + default_route_v6[15] ^= 3; +} + +void TunsafeBackendBsd::AddRoute(uint32 ip, uint32 cidr, uint32 gw) { + uint32 ip_be, gw_be; + WriteBE32(&ip_be, ip); + WriteBE32(&gw_be, gw); + AddRoute(AF_INET, &ip_be, cidr, &gw_be); +} + +static void AddOrRemoveRoute(const RouteInfo &cd, bool remove) { + char buf1[kSizeOfAddress], buf2[kSizeOfAddress]; + + print_ip_prefix(buf1, cd.family, cd.ip, cd.cidr); + print_ip_prefix(buf2, cd.family, cd.gw, -1); + +#if defined(OS_LINUX) + const char *cmd = remove ? "delete" : "add"; + if (cd.family == AF_INET) { + RunCommand("/sbin/route %s -net %s gw %s", cmd, buf1, buf2); + } else { + RunCommand("/sbin/route %s -net inet6 %s gw %s", cmd, buf1, buf2); + } +#elif defined(OS_MACOSX) + const char *cmd = remove ? "delete" : "add"; + if (cd.family == AF_INET) { + RunCommand("/sbin/route -q %s %s %s", cmd, buf1, buf2); + } else { + RunCommand("/sbin/route -q %s -inet6 %s %s", cmd, buf1, buf2); + } +#endif +} + +bool TunsafeBackendBsd::AddRoute(int family, const void *dest, int dest_prefix, const void *gateway) { + RouteInfo c; + + c.family = family; + size_t len = (family == AF_INET) ? 4 : 16; + memcpy(c.ip, dest, len); + memcpy(c.gw, gateway, len); + c.cidr = dest_prefix; + cleanup_commands_.push_back(c); + AddOrRemoveRoute(c, false); + return true; +} + +void TunsafeBackendBsd::DelRoute(const RouteInfo &cd) { + AddOrRemoveRoute(cd, true); +} + +static bool IsIpv6AddressSet(const void *p) { + return (ReadLE64(p) | ReadLE64((char*)p + 8)) != 0; +} + +// Called to initialize tun +bool TunsafeBackendBsd::Initialize(const TunConfig &&config, TunConfigOut *out) override { + char devname[12]; + char def_iface[12]; + char buf[kSizeOfAddress]; + + Cleanup(); + + out->enable_neighbor_discovery_spoofing = false; + + int tun_fd = open_tun(devname, sizeof(devname)); + if (tun_fd < 0) { RERROR("Error opening tun device"); return false; } + + fcntl(tun_fd, F_SETFD, FD_CLOEXEC); + fcntl(tun_fd, F_SETFL, O_NONBLOCK); + + SetTunFd(tun_fd); + + uint32 netmask = CidrToNetmaskV4(config.cidr); + uint32 default_route_v4 = ComputeIpv4DefaultRoute(config.ip, netmask); + + RunCommand("/sbin/ifconfig %s %A mtu %d %A netmask %A up", devname, config.ip, config.mtu, config.ip, netmask); + AddRoute(config.ip & netmask, config.cidr, config.ip); + + if (config.use_ipv4_default_route) { + if (config.default_route_endpoint_v4) { + uint32 gw; + if (!GetDefaultRoute(def_iface, sizeof(def_iface), &gw)) { + RERROR("Unable to determine default interface."); + return false; + } + AddRoute(config.default_route_endpoint_v4, 32, gw); + + } + AddRoute(0x00000000, 1, default_route_v4); + AddRoute(0x80000000, 1, default_route_v4); + } + + uint8 default_route_v6[16]; + + if (config.ipv6_cidr) { + static const uint8 matchall_1_route[17] = {0x80, 0, 0, 0}; + + ComputeIpv6DefaultRoute(config.ipv6_address, config.ipv6_cidr, default_route_v6); + + RunCommand("/sbin/ifconfig %s inet6 %s", devname, print_ip_prefix(buf, AF_INET6, config.ipv6_address, config.ipv6_cidr)); + + if (config.use_ipv6_default_route) { + if (IsIpv6AddressSet(config.default_route_endpoint_v6)) { + RERROR("default_route_endpoint_v6 not supported"); + } + AddRoute(AF_INET6, matchall_1_route + 1, 1, default_route_v6); + AddRoute(AF_INET6, matchall_1_route + 0, 1, default_route_v6); + } + } + + // Add all the extra routes + for (auto it = config.extra_routes.begin(); it != config.extra_routes.end(); ++it) { + if (it->size == 32) { + AddRoute(ReadBE32(it->addr), it->cidr, default_route_v4); + } else if (it->size == 128 && config.ipv6_cidr) { + AddRoute(AF_INET6, it->addr, it->cidr, default_route_v6); + } + } + + return true; +} + +void TunsafeBackendBsd::Cleanup() { + for(auto it = cleanup_commands_.begin(); it != cleanup_commands_.end(); ++it) + DelRoute(*it); + cleanup_commands_.clear(); +} + +void TunsafeBackendBsd::WriteTunPacket(Packet *packet) override { + assert(tun_fd_ >= 0); + Packet *queue_is_used = tun_queue_; + *tun_queue_end_ = packet; + tun_queue_end_ = &packet->next; + packet->next = NULL; + if (!queue_is_used) + WriteToTun(); +} + +// Called to initialize udp +bool TunsafeBackendBsd::Initialize(int listen_port) override { + int udp_fd = open_udp(listen_port); + if (udp_fd < 0) { RERROR("Error opening udp"); return false; } + fcntl(udp_fd, F_SETFD, FD_CLOEXEC); + fcntl(udp_fd, F_SETFL, O_NONBLOCK); + SetUdpFd(udp_fd); + return true; +} + +void TunsafeBackendBsd::WriteUdpPacket(Packet *packet) override { + assert(udp_fd_ >= 0); + Packet *queue_is_used = udp_queue_; + *udp_queue_end_ = packet; + udp_queue_end_ = &packet->next; + packet->next = NULL; + if (!queue_is_used) + WriteToUdp(); +} + +static TunsafeBackendBsd *g_socket_loop; + +static void SigAlrm(int sig) { + if (g_socket_loop) + g_socket_loop->HandleSigAlrm(); +} + +static bool did_ctrlc; + +void SigInt(int sig) { + if (did_ctrlc) + exit(1); + did_ctrlc = true; + write(1, "Ctrl-C detected. Exiting. Press again to force quit.\n", sizeof("Ctrl-C detected. Exiting. Press again to force quit.\n")-1); + + if (g_socket_loop) + g_socket_loop->HandleExit(); +} + +void TunsafeBackendBsd::RunLoop() { + int free_packet_interval = 10; + + assert(!g_socket_loop); + assert(processor_); + + g_socket_loop = this; + // We want an alarm signal every second. + { + struct sigaction act = {0}; + act.sa_handler = SigAlrm; + if (sigaction(SIGALRM, &act, NULL) < 0) { + RERROR("Unable to install SIGALRM handler."); + return; + } + } + + { + struct sigaction act = {0}; + act.sa_handler = SigInt; + if (sigaction(SIGINT, &act, NULL) < 0) { + RERROR("Unable to install SIGINT handler."); + return; + } + } + +#if defined(OS_LINUX) || defined(OS_FREEBSD) + { + struct itimerspec tv = {0}; + struct sigevent sev; + timer_t timer_id; + + tv.it_interval.tv_sec = 1; + tv.it_value.tv_sec = 1; + + sev.sigev_notify = SIGEV_SIGNAL; + sev.sigev_signo = SIGALRM; + sev.sigev_value.sival_ptr = NULL; + + if (timer_create(CLOCK_MONOTONIC, &sev, &timer_id) < 0) { + RERROR("timer_create failed"); + return; + } + + if (timer_settime(timer_id, 0, &tv, NULL) < 0) { + RERROR("timer_settime failed"); + return; + } + } +#elif defined(OS_MACOSX) + ualarm(1000000, 1000000); +#endif + + while (!exit_) { + int n = -1; + +// printf("entering sleep %d,%d,%d %d\n", udp_fd_, tun_fd_, max_fd_, FD_ISSET(tun_fd_, &readfds_)); + // Wait for sockets to become usable + if (!got_sig_alarm_) { + + if (tun_fd_ >= 0) { + FD_SET(tun_fd_, &readfds_); + if (tun_writable_) FD_CLR(tun_fd_, &writefds_); else FD_SET(tun_fd_, &writefds_); + } + + if (udp_fd_ >= 0) { + FD_SET(udp_fd_, &readfds_); + if (udp_writable_) FD_CLR(udp_fd_, &writefds_); else FD_SET(udp_fd_, &writefds_); + } + + n = select(max_fd_, &readfds_, &writefds_, NULL, NULL); + if (n == -1) { + if (errno != EINTR) { + fprintf(stderr, "select failed\n"); + break; + } + } + } + // This is not fully signal safe. + if (got_sig_alarm_) { + got_sig_alarm_ = false; + processor_->SecondLoop(); + if (free_packet_interval == 0) { + FreePackets(); + free_packet_interval = 10; + } + free_packet_interval--; + } + if (n < 0) continue; + + if (tun_fd_ >= 0) { + tun_readable_ = (FD_ISSET(tun_fd_, &readfds_) != 0); + tun_writable_ |= (FD_ISSET(tun_fd_, &writefds_) != 0); + } + if (udp_fd_ >= 0) { + udp_readable_ = (FD_ISSET(udp_fd_, &readfds_) != 0); + udp_writable_ |= (FD_ISSET(udp_fd_, &writefds_) != 0); + } + + for(int loop = 0; loop < 256; loop++) { + bool more_work = false; + if (tun_queue_ != NULL && tun_writable_) more_work |= WriteToTun(); + if (udp_queue_ != NULL && udp_writable_) more_work |= WriteToUdp(); + if (tun_readable_) more_work |= ReadFromTun(); + if (udp_readable_) more_work |= ReadFromUdp(); + if (!more_work) + break; + } + } + + g_socket_loop = NULL; +} + +void InitCpuFeatures(); +void Benchmark(); + +int main(int argc, char **argv) { + bool exit_flag = false; + + InitCpuFeatures(); + + if (argc == 2 && strcmp(argv[1], "--benchmark") == 0) { + Benchmark(); + return 0; + } + + if (argc < 2) { + fprintf(stderr, "Syntax: tunsafe file.conf\n"); + return 1; + } + +#if defined(OS_MACOSX) + InitOsxGetMilliseconds(); +#endif + + TunsafeBackendBsd socket_loop; + WireguardProcessor wg(&socket_loop, &socket_loop, NULL); + socket_loop.SetProcessor(&wg); + + if (!ParseWireGuardConfigFile(&wg, argv[1], &exit_flag)) return 1; + if (!wg.Start()) return 1; + + socket_loop.RunLoop(); + socket_loop.Cleanup(); + return 0; +} diff --git a/network_bsd_mt.cpp b/network_bsd_mt.cpp new file mode 100644 index 0000000..3f1a043 --- /dev/null +++ b/network_bsd_mt.cpp @@ -0,0 +1,1251 @@ +// SPDX-License-Identifier: AGPL-1.0-only +// Copyright (C) 2018 Ludvig Strigeus . All Rights Reserved. +#include "netapi.h" +#include "wireguard.h" +#include "wireguard_config.h" +#include "tunsafe_endian.h" +#include "tunsafe_config.h" +#include "util.h" + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +#include + +#if defined(OS_MACOSX) +#include +#include +#include +#include +#include +#include +#elif defined(OS_FREEBSD) +#include +#include +#elif defined(OS_LINUX) +#include +#include +#include +#endif + + + +static Packet *freelist; + +void SetThreadName(const char *name) { +#if defined(OS_LINUX) + prctl(PR_SET_NAME, name, 0, 0, 0); +#endif // defined(OS_LINUX) +} + +void FreePacket(Packet *packet) { + free(packet); +// packet->next = freelist; +// freelist = packet; +} + +Packet *AllocPacket() { + Packet *p = NULL;// freelist; + if (p) { + freelist = p->next; + } else { + p = (Packet*)malloc(kPacketAllocSize); + if (p == NULL) { + RERROR("Allocation failure"); + abort(); + } + } + p->data = p->data_buf + Packet::HEADROOM_BEFORE; + p->size = 0; + return p; +} + +void FreePackets() { + Packet *p; + while ( (p = freelist ) != NULL) { + freelist = p->next; + free(p); + } +} + +#if defined(OS_MACOSX) || defined(OS_FREEBSD) +struct MyRouteMsg { + struct rt_msghdr hdr; + uint32 pad; + struct sockaddr_in target; + struct sockaddr_in netmask; +}; + +struct MyRouteReply { + struct rt_msghdr hdr; + uint8 buf[512]; +}; + +// Zero gets rounded up +#if defined(OS_MACOSX) +#define RTMSG_ROUNDUP(a) ((a) ? ((((a) - 1) | (sizeof(uint32_t) - 1)) + 1) : sizeof(uint32_t)) +#else +#define RTMSG_ROUNDUP(a) ((a) ? ((((a) - 1) | (sizeof(long) - 1)) + 1) : sizeof(long)) +#endif + + +static bool GetDefaultRoute(char *iface, size_t iface_size, uint32 *gw_addr) { + int fd, pid, len; + + union { + MyRouteMsg rt; + MyRouteReply rep; + }; + + fd = socket(PF_ROUTE, SOCK_RAW, AF_INET); + if (fd < 0) + return false; + + memset(&rt, 0, sizeof(rt)); + + rt.hdr.rtm_type = RTM_GET; + rt.hdr.rtm_flags = RTF_UP | RTF_GATEWAY; + rt.hdr.rtm_version = RTM_VERSION; + rt.hdr.rtm_seq = 0; + rt.hdr.rtm_addrs = RTA_DST | RTA_NETMASK | RTA_IFP; + + rt.target.sin_family = AF_INET; + rt.netmask.sin_family = AF_INET; + + rt.target.sin_len = sizeof(struct sockaddr_in); + rt.netmask.sin_len = sizeof(struct sockaddr_in); + + rt.hdr.rtm_msglen = sizeof(rt); + + if (write(fd, (char*)&rt, sizeof(rt)) != sizeof(rt)) { + RERROR("PF_ROUTE write failed."); + close(fd); + return false; + } + + pid = getpid(); + do { + len = read(fd, (char *)&rep, sizeof(rep)); + if (len <= 0) { + RERROR("PF_ROUTE read failed."); + close(fd); + return false; + } + } while (rep.hdr.rtm_seq != 0 || rep.hdr.rtm_pid != pid); + close(fd); + + const struct sockaddr_dl *ifp = NULL; + const struct sockaddr_in *gw = NULL; + + uint8 *pos = rep.buf; + for (int i = 1; i && i < rep.hdr.rtm_addrs; i <<= 1) { + if (rep.hdr.rtm_addrs & i) { + if (1 > rep.buf + 512 - pos) + break; // invalid + size_t len = RTMSG_ROUNDUP(((struct sockaddr*)pos)->sa_len); + if (len > rep.buf + 512 - pos) + break; // invalid + // RINFO("rtm %d %d", i, ((struct sockaddr*)pos)->sa_len); + if (i == RTA_IFP && ((struct sockaddr*)pos)->sa_len == sizeof(struct sockaddr_dl)) { + ifp = (struct sockaddr_dl *)pos; + } else if (i == RTA_GATEWAY && ((struct sockaddr*)pos)->sa_len == sizeof(struct sockaddr_in)) { + gw = (struct sockaddr_in *)pos; + + } + pos += len; + } + } + + if (ifp && ifp->sdl_nlen && ifp->sdl_nlen < iface_size) { + iface[ifp->sdl_nlen] = 0; + memcpy(iface, ifp->sdl_data, ifp->sdl_nlen); + if (gw && gw->sin_family == AF_INET) { + *gw_addr = ReadBE32(&gw->sin_addr); + return true; + } + + } + // RINFO("Read %d %d %d", len, rep.hdr.rtm_addrs, (int)sizeof(struct rt_msghdr )); + return false; +} +#endif // defined(OS_MACOSX) || defined(OS_FREEBSD) + +#if defined(OS_LINUX) +static bool GetDefaultRoute(char *iface, size_t iface_size, uint32 *gw_addr) { + return false; +} +#endif // defined(OS_LINUX) + + +#if defined(OS_MACOSX) +static mach_timebase_info_data_t timebase = { 0, 0 }; +static uint64_t initclock; + +void InitOsxGetMilliseconds() { + if (mach_timebase_info(&timebase) != 0) + abort(); + initclock = mach_absolute_time(); + + timebase.denom *= 1000000; +} + +uint64 OsGetMilliseconds() +{ + uint64_t clock = mach_absolute_time() - initclock; + return clock * (uint64_t)timebase.numer / (uint64_t)timebase.denom; +} + +#else // defined(OS_MACOSX) +uint64 OsGetMilliseconds() { + struct timespec ts; + if (clock_gettime(CLOCK_MONOTONIC, &ts) != 0) { + //error + fprintf(stderr, "clock_gettime failed\n"); + exit(1); + } + return (uint64)ts.tv_sec * 1000 + (ts.tv_nsec / 1000000); +} +#endif + +void OsGetTimestampTAI64N(uint8 dst[12]) { + struct timeval tv; + gettimeofday(&tv, NULL); + uint64 secs_since_epoch = tv.tv_sec + 0x400000000000000a; + uint32 nanos = tv.tv_usec * 1000; + WriteBE64(dst, secs_since_epoch); + WriteBE32(dst + 8, nanos); +} + +void OsGetRandomBytes(uint8 *data, size_t data_size) { + int fd = open("/dev/urandom", O_RDONLY); + int r = read(fd, data, data_size); + if (r < 0) r = 0; + close(fd); + for (; r < data_size; r++) + data[r] = rand() >> 6; +} + +void OsInterruptibleSleep(int millis) { + usleep((useconds_t)millis * 1000); +} + +#if defined(OS_MACOSX) +#define TUN_PREFIX_BYTES 4 +int open_tun(char *devname, size_t devname_size) { + struct sockaddr_ctl sc; + struct ctl_info ctlinfo = {0}; + int fd; + + memcpy(ctlinfo.ctl_name, UTUN_CONTROL_NAME, sizeof(UTUN_CONTROL_NAME)); + + for(int i = 0; i < 256; i++) { + fd = socket(PF_SYSTEM, SOCK_DGRAM, SYSPROTO_CONTROL); + if (fd < 0) { + RERROR("socket(SYSPROTO_CONTROL) failed"); + return -1; + } + + if (ioctl(fd, CTLIOCGINFO, &ctlinfo) == -1) { + RERROR("ioctl(CTLIOCGINFO) failed: %d", errno); + close(fd); + return -1; + } + sc.sc_id = ctlinfo.ctl_id; + sc.sc_len = sizeof(sc); + sc.sc_family = AF_SYSTEM; + sc.ss_sysaddr = AF_SYS_CONTROL; + sc.sc_unit = i + 1; + if (connect(fd, (struct sockaddr *)&sc, sizeof(sc)) == 0) { + socklen_t devname_size2 = devname_size; + if (getsockopt(fd, SYSPROTO_CONTROL, UTUN_OPT_IFNAME, devname, &devname_size2)) { + RERROR("getsockopt(UTUN_OPT_IFNAME) failed"); + close(fd); + return -1; + } + + + return fd; + } + close(fd); + } + return -1; +} + +#elif defined(OS_FREEBSD) +#define TUN_PREFIX_BYTES 4 +int open_tun(char *devname, size_t devname_size) { + char buf[32]; + int tun_fd; + // First open an existing tun device + for(int i = 0; i < 256; i++) { + sprintf(buf, "/dev/tun%d", i); + tun_fd = open(buf, O_RDWR); + if (tun_fd >= 0) goto did_open; + } + tun_fd = open("/dev/tun", O_RDWR); + if (tun_fd < 0) + return tun_fd; +did_open: + if (!fdevname_r(tun_fd, devname, devname_size)) { + RERROR("Unable to get name of tun device"); + close(tun_fd); + return -1; + } + int flags = IFF_POINTOPOINT | IFF_MULTICAST; + if (ioctl(tun_fd, TUNSIFMODE, &flags) < 0) { + RERROR("ioctl(TUNSIFMODE) failed"); + close(tun_fd); + return -1; + + } + flags = 1; + if (ioctl(tun_fd, TUNSIFHEAD, &flags) < 0) { + RERROR("ioctl(TUNSIFHEAD) failed"); + close(tun_fd); + return -1; + } + return tun_fd; +} + +#elif defined(OS_LINUX) +#define TUN_PREFIX_BYTES 0 +int open_tun(char *devname, size_t devname_size) { + int fd, err; + struct ifreq ifr; + + fd = open("/dev/net/tun", O_RDWR); + if (fd < 0) + return fd; + + memset(&ifr, 0, sizeof(ifr)); + ifr.ifr_flags = IFF_TUN | IFF_NO_PI; + + if ((err = ioctl(fd, TUNSETIFF, (void *) &ifr)) < 0) { + close(fd); + return err; + } + strcpy(devname, ifr.ifr_name); + return fd; +} +#endif + +int open_udp(int listen_on_port) { + int udp_fd = socket(AF_INET, SOCK_DGRAM, 0); + if (udp_fd < 0) return udp_fd; + sockaddr_in sin = {0}; + sin.sin_family = AF_INET; + sin.sin_port = htons(listen_on_port); + if (bind(udp_fd, (struct sockaddr*)&sin, sizeof(sin)) != 0) { + close(udp_fd); + return -1; + } + return udp_fd; +} + +class WorkerLoop { +public: + WorkerLoop(); + ~WorkerLoop(); + + bool Initialize(WireguardProcessor *processor); + + void *ThreadMain(); + void StartThread(); + + void StopThread(); + + void NotifyStop(); + + enum { + TARGET_UDP, TARGET_TUN + }; + + void HandleUdpPacket(Packet *packet) { + HandlePacket(packet, TARGET_UDP); + } + void HandleTunPacket(Packet *packet) { + HandlePacket(packet, TARGET_TUN); + } + + void HandleSigAlrm() { + got_sig_alarm_ = true; + } + +private: + static void *ThreadMainStatic(void *x); + void HandlePacket(Packet *packet, int target); + + WireguardProcessor *processor_; + pthread_t tid_; + Packet *queue_, **queue_end_; + bool shutting_down_; + bool got_sig_alarm_; + + pthread_mutex_t lock_; + pthread_cond_t cond_; +}; + +// Handles the threads that read/write to the udp socket. +class UdpLoop { +public: + UdpLoop(); + ~UdpLoop(); + + bool Initialize(int listen_port, WorkerLoop *worker); + void Start(); + void Stop(); + + void WriteUdpPacket(Packet *packet); +private: + static void *ReaderMainStatic(void *x); + static void *WriterMainStatic(void *x); + void *ReaderMain(); + void *WriterMain(); + + int fd_; + WorkerLoop *worker_; + pthread_t read_tid_, write_tid_; + + Packet *queue_, **queue_end_; + + bool shutting_down_; + + pthread_mutex_t lock_; + pthread_cond_t cond_; +}; + +// Handles the threads that read/write to the tun socket. +class TunLoop { +public: + TunLoop(); + ~TunLoop(); + + bool Initialize(WorkerLoop *worker); + void Start(); + void Stop(); + + void WriteTunPacket(Packet *packet); + + char *devname() { return devname_; } +private: + static void *ReaderMainStatic(void *x); + static void *WriterMainStatic(void *x); + void *ReaderMain(); + void *WriterMain(); + + int fd_; + bool shutting_down_; + char devname_[16]; + + WorkerLoop *worker_; + pthread_t read_tid_, write_tid_; + Packet *queue_, **queue_end_; + pthread_mutex_t lock_; + pthread_cond_t cond_; +}; + +WorkerLoop::WorkerLoop() { + queue_end_ = &queue_; + queue_ = NULL; + tid_ = 0; + shutting_down_ = false; + got_sig_alarm_ = false; + processor_ = NULL; + pthread_mutex_init(&lock_, NULL); + pthread_cond_init(&cond_, NULL); +} + +WorkerLoop::~WorkerLoop() { + pthread_mutex_destroy(&lock_); + pthread_cond_destroy(&cond_); +} + +bool WorkerLoop::Initialize(WireguardProcessor *processor) { + processor_ = processor; + return true; +} + +void WorkerLoop::StartThread() { + assert(tid_ == 0); + pthread_create(&tid_, NULL, &ThreadMainStatic, this); +} + +void WorkerLoop::StopThread() { + pthread_mutex_lock(&lock_); + shutting_down_ = true; + pthread_mutex_unlock(&lock_); + + if (tid_) { + void *x; + pthread_join(tid_, &x); + tid_ = 0; + } +} + + +// This is called from signal handler so cannot block etc. +void WorkerLoop::NotifyStop() { + shutting_down_ = true; +} + +void WorkerLoop::HandlePacket(Packet *packet, int target) { +// RINFO("WorkerLoop::HandlePacket"); + packet->post_target = target; + pthread_mutex_lock(&lock_); + Packet *old_queue = queue_; + *queue_end_ = packet; + queue_end_ = &packet->next; + packet->next = NULL; + if (old_queue == NULL) { + pthread_mutex_unlock(&lock_); + pthread_cond_signal(&cond_); + } else { + pthread_mutex_unlock(&lock_); + } +} + +void *WorkerLoop::ThreadMainStatic(void *x) { + return ((WorkerLoop*)x)->ThreadMain(); +} + +void *WorkerLoop::ThreadMain() { + Packet *packet_queue; + + pthread_mutex_lock(&lock_); + for (;;) { + // Grab the whole list + for (;;) { + while (got_sig_alarm_) { + got_sig_alarm_ = false; + pthread_mutex_unlock(&lock_); + processor_->SecondLoop(); + pthread_mutex_lock(&lock_); + } + if (shutting_down_ || queue_ != NULL) + break; + pthread_cond_wait(&cond_, &lock_); + } + if (shutting_down_) + break; + packet_queue = queue_; + queue_ = NULL; + queue_end_ = &queue_; + + pthread_mutex_unlock(&lock_); + // And send all items in the list + while (packet_queue != NULL) { + Packet *next = packet_queue->next; + if (packet_queue->post_target == TARGET_TUN) { + processor_->HandleTunPacket(packet_queue); + } else { + processor_->HandleUdpPacket(packet_queue, false); + } + packet_queue = next; + } + pthread_mutex_lock(&lock_); + } + pthread_mutex_unlock(&lock_); + return NULL; +} + + + +UdpLoop::UdpLoop() { + fd_ = -1; + read_tid_ = 0; + write_tid_ = 0; + shutting_down_ = false; + worker_ = NULL; + queue_ = NULL; + queue_end_ = &queue_; + pthread_mutex_init(&lock_, NULL); + pthread_cond_init(&cond_, NULL); +} + +UdpLoop::~UdpLoop() { + if (fd_ != -1) + close(fd_); + pthread_mutex_destroy(&lock_); + pthread_cond_destroy(&cond_); +} + +bool UdpLoop::Initialize(int listen_port, WorkerLoop *worker) { + int fd = open_udp(listen_port); + if (fd < 0) { RERROR("Error opening udp"); return false; } + fcntl(fd, F_SETFD, FD_CLOEXEC); + fd_ = fd; + worker_ = worker; + return true; +} + +void UdpLoop::Start() { + pthread_create(&read_tid_, NULL, &ReaderMainStatic, this); + pthread_create(&write_tid_, NULL, &WriterMainStatic, this); +} + +void UdpLoop::Stop() { + void *x; + + pthread_mutex_lock(&lock_); + shutting_down_ = true; + pthread_mutex_unlock(&lock_); + pthread_cond_signal(&cond_); + + pthread_kill(read_tid_, SIGUSR1); + pthread_kill(write_tid_, SIGUSR1); + + pthread_join(read_tid_, &x); + pthread_join(write_tid_, &x); + + read_tid_ = 0; + write_tid_ = 0; +} + +void *UdpLoop::ReaderMainStatic(void *x) { + SetThreadName("tunsafe-ur"); + return ((UdpLoop*)x)->ReaderMain(); +} + +void *UdpLoop::WriterMainStatic(void *x) { + SetThreadName("tunsafe-uw"); + return ((UdpLoop*)x)->WriterMain(); +} + +void *UdpLoop::ReaderMain() { + Packet *packet; + socklen_t sin_len; + int r; + + while (!shutting_down_) { + packet = AllocPacket(); + sin_len = sizeof(packet->addr.sin); + r = recvfrom(fd_, packet->data, kPacketCapacity, 0, (sockaddr*)&packet->addr.sin, &sin_len); + if (r < 0) { + FreePacket(packet); + if (shutting_down_) + break; + + RERROR("ReadMain failed %d", errno); + + } else { + packet->size = r; + worker_->HandleUdpPacket(packet); + } + } + return NULL; +} + +void *UdpLoop::WriterMain() { + Packet *queue; + + pthread_mutex_lock(&lock_); + for (;;) { + // Grab the whole list + while (!shutting_down_ && queue_ == NULL) + pthread_cond_wait(&cond_, &lock_); + if (shutting_down_) + break; + queue = queue_; + queue_ = NULL; + queue_end_ = &queue_; + pthread_mutex_unlock(&lock_); + // And send all items in the list + while (queue != NULL) { + int r = sendto(fd_, queue->data, queue->size, 0, + (sockaddr*)&queue->addr.sin, sizeof(queue->addr.sin)); + if (r != queue->size) { + if (errno != ENOBUFS) + RERROR("WriterMain failed: %d", errno); + } else { +// RINFO("WRote udp packet!"); + } + Packet *to_free = queue; + queue = queue->next; + FreePacket(to_free); + } + pthread_mutex_lock(&lock_); + } + pthread_mutex_unlock(&lock_); + return NULL; +} + +void UdpLoop::WriteUdpPacket(Packet *packet) { +// RINFO("write udp packet to queue!"); + packet->next = NULL; + + pthread_mutex_lock(&lock_); + Packet *old_queue = queue_; + *queue_end_ = packet; + queue_end_ = &packet->next; + if (old_queue == NULL) { + pthread_mutex_unlock(&lock_); + pthread_cond_signal(&cond_); + } else { + pthread_mutex_unlock(&lock_); + } +} + +TunLoop::TunLoop() { + fd_ = -1; + shutting_down_ = false; + worker_ = NULL; + read_tid_ = 0; + write_tid_ = 0; + queue_ = NULL; + queue_end_ = &queue_; + pthread_mutex_init(&lock_, NULL); + pthread_cond_init(&cond_, NULL); +} + +TunLoop::~TunLoop() { + if (fd_ != -1) + close(fd_); + pthread_mutex_destroy(&lock_); + pthread_cond_destroy(&cond_); +} + +bool TunLoop::Initialize(WorkerLoop *worker) { + int fd = open_tun(devname_, sizeof(devname_)); + if (fd < 0) { RERROR("Error opening tun"); return false; } + fcntl(fd, F_SETFD, FD_CLOEXEC); + fd_ = fd; + worker_ = worker; + return true; +} + +void TunLoop::Start() { + pthread_create(&read_tid_, NULL, &ReaderMainStatic, this); + pthread_create(&write_tid_, NULL, &WriterMainStatic, this); +} + +void TunLoop::Stop() { + void *x; + + pthread_mutex_lock(&lock_); + shutting_down_ = true; + pthread_mutex_unlock(&lock_); + + pthread_kill(read_tid_, SIGUSR1); + pthread_kill(write_tid_, SIGUSR1); + pthread_join(read_tid_, &x); + pthread_join(write_tid_, &x); + + read_tid_ = 0; + write_tid_ = 0; +} + +void *TunLoop::ReaderMainStatic(void *x) { + SetThreadName("tunsafe-tr"); + return ((TunLoop*)x)->ReaderMain(); +} + +void *TunLoop::WriterMainStatic(void *x) { + SetThreadName("tunsafe-tw"); + return ((TunLoop*)x)->WriterMain(); +} + +void *TunLoop::ReaderMain() { + Packet *packet = AllocPacket(); + while (!shutting_down_) { + int r = read(fd_, packet->data - TUN_PREFIX_BYTES, kPacketCapacity + TUN_PREFIX_BYTES); + if (r >= 0) { + packet->size = r - TUN_PREFIX_BYTES; + if (r >= TUN_PREFIX_BYTES && (!TUN_PREFIX_BYTES || ReadBE32(packet->data - TUN_PREFIX_BYTES) == AF_INET)) { + worker_->HandleTunPacket(packet); + packet = AllocPacket(); + } + } + } + return NULL; +} + +void *TunLoop::WriterMain() { + Packet *queue; + + pthread_mutex_lock(&lock_); + for (;;) { + // Grab the whole list + while (!shutting_down_ && queue_ == NULL) { + pthread_cond_wait(&cond_, &lock_); + } + if (shutting_down_) + break; + queue = queue_; + queue_ = NULL; + queue_end_ = &queue_; + pthread_mutex_unlock(&lock_); + // And send all items in the list + while (queue != NULL) { + if (TUN_PREFIX_BYTES) + WriteBE32(queue->data - TUN_PREFIX_BYTES, AF_INET); + int r = write(fd_, queue->data - TUN_PREFIX_BYTES, queue->size + TUN_PREFIX_BYTES); + if (r != queue->size + TUN_PREFIX_BYTES) { + RERROR("WriterMain failed: %d", errno); + break; + } + Packet *to_free = queue; + queue = queue->next; + FreePacket(to_free); + } + pthread_mutex_lock(&lock_); + } + pthread_mutex_unlock(&lock_); + return NULL; +} + +void TunLoop::WriteTunPacket(Packet *packet) { + packet->next = NULL; + + pthread_mutex_lock(&lock_); + Packet *old_queue = queue_; + *queue_end_ = packet; + queue_end_ = &packet->next; + if (old_queue == NULL) { + pthread_mutex_unlock(&lock_); + pthread_cond_signal(&cond_); + } else { + pthread_mutex_unlock(&lock_); + } +} + +struct RouteInfo { + uint8 family; + uint8 cidr; + uint8 ip[16]; + uint8 gw[16]; +}; + +class TunsafeBackendBsd : public TunInterface, public UdpInterface { +public: + TunsafeBackendBsd(); + ~TunsafeBackendBsd(); + + void RunLoop(); + void CleanupRoutes(); + + void SetProcessor(WireguardProcessor *wg) { processor_ = wg; } + + // -- from TunInterface + virtual bool Initialize(const TunConfig &&config, TunConfigOut *out) override; + virtual void WriteTunPacket(Packet *packet) override; + + // -- from UdpInterface + virtual bool Initialize(int listen_port) override; + virtual void WriteUdpPacket(Packet *packet) override; + + void HandleSigAlrm() { worker_.HandleSigAlrm(); } + void HandleExit() { worker_.NotifyStop(); } + +private: + void AddRoute(uint32 ip, uint32 cidr, uint32 gw); + void DelRoute(const RouteInfo &cd); + bool AddRoute(int family, const void *dest, int dest_prefix, const void *gateway); + bool RunPrePostCommand(const std::vector &vec); + + + WireguardProcessor *processor_; + + bool got_sig_alarm_; + bool exit_; + + uint32 added_route_addr_, added_route_gw_; + + WorkerLoop worker_; + UdpLoop udp_; + TunLoop tun_; + + std::vector cleanup_commands_; + + std::vector pre_down_, post_down_; +}; + +TunsafeBackendBsd::TunsafeBackendBsd() + : processor_(NULL), + got_sig_alarm_(false), + exit_(false) { +} + +TunsafeBackendBsd::~TunsafeBackendBsd() { +} + +static uint32 CidrToNetmaskV4(int cidr) { + return cidr == 32 ? 0xffffffff : 0xffffffff << (32 - cidr); +} + +static uint32 ComputeIpv4DefaultRoute(uint32 ip, uint32 netmask) { + uint32 default_route_v4 = (ip & netmask) | 1; + if (default_route_v4 == ip) + default_route_v4++; + return default_route_v4; +} + +static void ComputeIpv6DefaultRoute(const uint8 *ipv6_address, uint8 ipv6_cidr, uint8 *default_route_v6) { + memcpy(default_route_v6, ipv6_address, 16); + // clear the last bits of the ipv6 address to match the cidr. + size_t n = (ipv6_cidr + 7) >> 3; + memset(&default_route_v6[n], 0, 16 - n); + if (n == 0) + return; + // adjust the final byte + default_route_v6[n - 1] &= ~(0xff >> (ipv6_cidr & 7)); + // set the very last byte to something + default_route_v6[15] |= 1; + // ensure it doesn't collide + if (memcmp(default_route_v6, ipv6_address, 16) == 0) + default_route_v6[15] ^= 3; +} + +void TunsafeBackendBsd::AddRoute(uint32 ip, uint32 cidr, uint32 gw) { + uint32 ip_be, gw_be; + WriteBE32(&ip_be, ip); + WriteBE32(&gw_be, gw); + AddRoute(AF_INET, &ip_be, cidr, &gw_be); +} + +static void AddOrRemoveRoute(const RouteInfo &cd, bool remove) { + char buf1[kSizeOfAddress], buf2[kSizeOfAddress]; + + print_ip_prefix(buf1, cd.family, cd.ip, cd.cidr); + print_ip_prefix(buf2, cd.family, cd.gw, -1); + +#if defined(OS_LINUX) + const char *cmd = remove ? "delete" : "add"; + if (cd.family == AF_INET) { + RunCommand("/sbin/route %s -net %s gw %s", cmd, buf1, buf2); + } else { + RunCommand("/sbin/route %s -net inet6 %s gw %s", cmd, buf1, buf2); + } +#elif defined(OS_MACOSX) + const char *cmd = remove ? "delete" : "add"; + if (cd.family == AF_INET) { + RunCommand("/sbin/route -q %s %s %s", cmd, buf1, buf2); + } else { + RunCommand("/sbin/route -q %s -inet6 %s %s", cmd, buf1, buf2); + } +#endif +} + +bool TunsafeBackendBsd::AddRoute(int family, const void *dest, int dest_prefix, const void *gateway) { + RouteInfo c; + + c.family = family; + size_t len = (family == AF_INET) ? 4 : 16; + memcpy(c.ip, dest, len); + memcpy(c.gw, gateway, len); + c.cidr = dest_prefix; + cleanup_commands_.push_back(c); + AddOrRemoveRoute(c, false); + return true; +} + +void TunsafeBackendBsd::DelRoute(const RouteInfo &cd) { + AddOrRemoveRoute(cd, true); +} + +static bool IsIpv6AddressSet(const void *p) { + return (ReadLE64(p) | ReadLE64((char*)p + 8)) != 0; +} + +// Called to initialize tun +bool TunsafeBackendBsd::Initialize(const TunConfig &&config, TunConfigOut *out) override { + char def_iface[12]; + + if (!RunPrePostCommand(config.pre_post_commands.pre_up)) { + RERROR("Pre command failed!"); + return false; + } + + out->enable_neighbor_discovery_spoofing = false; + + if (!tun_.Initialize(&worker_)) + return false; + + if (config.ipv6_cidr) + RERROR("IPv6 not supported"); + + uint32 netmask = CidrToNetmaskV4(config.cidr); + uint32 default_route_v4 = ComputeIpv4DefaultRoute(config.ip, netmask); + + RunCommand("/sbin/ifconfig %s %A mtu %d %A netmask %A up", tun_.devname(), config.ip, config.mtu, config.ip, netmask); + AddRoute(config.ip & netmask, config.cidr, config.ip); + + if (config.use_ipv4_default_route) { + if (config.default_route_endpoint_v4) { + uint32 gw; + if (!GetDefaultRoute(def_iface, sizeof(def_iface), &gw)) { + RERROR("Unable to determine default interface."); + return false; + } + AddRoute(config.default_route_endpoint_v4, 32, gw); + + } + AddRoute(0x00000000, 1, default_route_v4); + AddRoute(0x80000000, 1, default_route_v4); + } + + uint8 default_route_v6[16]; + + if (config.ipv6_cidr) { + static const uint8 matchall_1_route[17] = {0x80, 0, 0, 0}; + char buf[kSizeOfAddress]; + + ComputeIpv6DefaultRoute(config.ipv6_address, config.ipv6_cidr, default_route_v6); + + RunCommand("/sbin/ifconfig %s inet6 %s", tun_.devname(), print_ip_prefix(buf, AF_INET6, config.ipv6_address, config.ipv6_cidr)); + + if (config.use_ipv6_default_route) { + if (IsIpv6AddressSet(config.default_route_endpoint_v6)) { + RERROR("default_route_endpoint_v6 not supported"); + } + AddRoute(AF_INET6, matchall_1_route + 1, 1, default_route_v6); + AddRoute(AF_INET6, matchall_1_route + 0, 1, default_route_v6); + } + } + + // Add all the extra routes + for (auto it = config.extra_routes.begin(); it != config.extra_routes.end(); ++it) { + if (it->size == 32) { + AddRoute(ReadBE32(it->addr), it->cidr, default_route_v4); + } else if (it->size == 128 && config.ipv6_cidr) { + AddRoute(AF_INET6, it->addr, it->cidr, default_route_v6); + } + } + + RunPrePostCommand(config.pre_post_commands.post_up); + + pre_down_ = std::move(config.pre_post_commands.pre_down); + post_down_ = std::move(config.pre_post_commands.post_down); + + return true; +} + +void TunsafeBackendBsd::CleanupRoutes() { + RunPrePostCommand(pre_down_); + + for(auto it = cleanup_commands_.begin(); it != cleanup_commands_.end(); ++it) + DelRoute(*it); + cleanup_commands_.clear(); + + RunPrePostCommand(post_down_); + + pre_down_.clear(); + post_down_.clear(); +} + +static bool RunOneCommand(const std::string &cmd) { + RINFO("Run: %s", cmd.c_str()); + int exit_code = system(cmd.c_str()); + if (exit_code) { + RERROR("Run Failed (%d) : %s", exit_code, cmd.c_str()); + return false; + } + return true; +} + +bool TunsafeBackendBsd::RunPrePostCommand(const std::vector &vec) { + bool success = true; + for (auto it = vec.begin(); it != vec.end(); ++it) { + success &= RunOneCommand(*it); + } + return success; +} + + +void TunsafeBackendBsd::WriteTunPacket(Packet *packet) override { + tun_.WriteTunPacket(packet); +} + +// Called to initialize udp +bool TunsafeBackendBsd::Initialize(int listen_port) override { + return udp_.Initialize(listen_port, &worker_); +} + +void TunsafeBackendBsd::WriteUdpPacket(Packet *packet) override { + udp_.WriteUdpPacket(packet); +} + +static TunsafeBackendBsd *g_tunsafe_backend_bsd; + +static void SigAlrm(int sig) { + if (g_tunsafe_backend_bsd) + g_tunsafe_backend_bsd->HandleSigAlrm(); +} + +static void SigUsr1(int sig) { + +} + +static bool did_ctrlc; + +void SigInt(int sig) { + if (did_ctrlc) + exit(1); + did_ctrlc = true; + write(1, "Ctrl-C detected. Exiting. Press again to force quit.\n", sizeof("Ctrl-C detected. Exiting. Press again to force quit.\n")-1); + + if (g_tunsafe_backend_bsd) + g_tunsafe_backend_bsd->HandleExit(); +} + +void TunsafeBackendBsd::RunLoop() { + int free_packet_interval = 10; + + assert(!g_tunsafe_backend_bsd); + assert(processor_); + + g_tunsafe_backend_bsd = this; + // We want an alarm signal every second. + { + struct sigaction act = {0}; + act.sa_handler = SigAlrm; + if (sigaction(SIGALRM, &act, NULL) < 0) { + RERROR("Unable to install SIGALRM handler."); + return; + } + } + + { + struct sigaction act = {0}; + act.sa_handler = SigInt; + if (sigaction(SIGINT, &act, NULL) < 0) { + RERROR("Unable to install SIGINT handler."); + return; + } + } + + { + struct sigaction act = {0}; + act.sa_handler = SigUsr1; + if (sigaction(SIGUSR1, &act, NULL) < 0) { + RERROR("Unable to install SIGUSR1 handler."); + return; + } + } + + + +#if defined(OS_LINUX) || defined(OS_FREEBSD) + { + struct itimerspec tv = {0}; + struct sigevent sev; + timer_t timer_id; + + tv.it_interval.tv_sec = 1; + tv.it_value.tv_sec = 1; + + sev.sigev_notify = SIGEV_SIGNAL; + sev.sigev_signo = SIGALRM; + sev.sigev_value.sival_ptr = NULL; + + if (timer_create(CLOCK_MONOTONIC, &sev, &timer_id) < 0) { + RERROR("timer_create failed"); + return; + } + + if (timer_settime(timer_id, 0, &tv, NULL) < 0) { + RERROR("timer_settime failed"); + return; + } + } +#elif defined(OS_MACOSX) + ualarm(1000000, 1000000); +#endif + + worker_.Initialize(processor_); + + // Start the processing threads + udp_.Start(); + tun_.Start(); + + worker_.ThreadMain(); + + tun_.Stop(); + udp_.Stop(); + + g_tunsafe_backend_bsd = NULL; +} + +void InitCpuFeatures(); +void Benchmark(); + + +uint32 g_ui_ip; + +const char *print_ip(char buf[kSizeOfAddress], in_addr_t ip) { + snprintf(buf, kSizeOfAddress, "%d.%d.%d.%d", (ip >> 24) & 0xff, (ip >> 16) & 0xff, (ip >> 8) & 0xff, (ip >> 0) & 0xff); + return buf; +} + + +class MyProcessorDelegate : public ProcessorDelegate { +public: + virtual void OnConnected(in_addr_t my_ip) { + if (my_ip != g_ui_ip) { + if (my_ip) { + char buf[kSizeOfAddress]; + print_ip(buf, my_ip); + RINFO("Connection established. IP %s", buf); + } + g_ui_ip = my_ip; + } + } + virtual void OnDisconnected() { + MyProcessorDelegate::OnConnected(0); + } +}; + + + + +int main(int argc, char **argv) { + bool exit_flag = false; + + InitCpuFeatures(); + + if (argc == 2 && strcmp(argv[1], "--benchmark") == 0) { + Benchmark(); + return 0; + } + + fprintf(stderr, "%s\n", TUNSAFE_VERSION_STRING); + + if (argc < 2) { + fprintf(stderr, "Syntax: tunsafe file.conf\n"); + return 1; + } + +#if defined(OS_MACOSX) + InitOsxGetMilliseconds(); +#endif + + SetThreadName("tunsafe-m"); + + + MyProcessorDelegate my_procdel; + TunsafeBackendBsd socket_loop; + WireguardProcessor wg(&socket_loop, &socket_loop, &my_procdel); + socket_loop.SetProcessor(&wg); + + if (!ParseWireGuardConfigFile(&wg, argv[1], &exit_flag)) return 1; + if (!wg.Start()) return 1; + + socket_loop.RunLoop(); + socket_loop.CleanupRoutes(); + + return 0; +} diff --git a/network_win32.cpp b/network_win32.cpp new file mode 100644 index 0000000..beb1b39 --- /dev/null +++ b/network_win32.cpp @@ -0,0 +1,1956 @@ +// SPDX-License-Identifier: AGPL-1.0-only +// Copyright (C) 2018 Ludvig Strigeus . All Rights Reserved. +#include "stdafx.h" +#include "network_win32.h" +#include "wireguard_config.h" +#include "netapi.h" +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "tunsafe_endian.h" +#include "wireguard.h" +#include "util.h" +#include +#include "network_win32_dnsblock.h" + +enum { + HARD_MAXIMUM_QUEUE_SIZE = 102400, + MAX_BYTES_IN_UDP_OUT_QUEUE = 256 * 1024, + MAX_BYTES_IN_UDP_OUT_QUEUE_SMALL = (256 + 64) * 1024, +}; + +enum { + ROUTE_BLOCK_UNKNOWN = 0, + ROUTE_BLOCK_OFF = 1, + ROUTE_BLOCK_ON = 2, + ROUTE_BLOCK_PENDING = 3, +}; +static uint8 internet_route_blocking_state; +static SLIST_HEADER freelist_head; + +bool g_allow_pre_post; + +Packet *AllocPacket() { + Packet *packet = (Packet*)InterlockedPopEntrySList(&freelist_head); + if (packet == NULL) + packet = (Packet *)_aligned_malloc(kPacketAllocSize, 16); + packet->data = packet->data_buf + Packet::HEADROOM_BEFORE; + packet->size = 0; + return packet; +} + +void FreePacket(Packet *packet) { + InterlockedPushEntrySList(&freelist_head, &packet->list_entry); +} + +extern "C" +PSLIST_ENTRY __fastcall InterlockedPushListSList( + IN PSLIST_HEADER ListHead, + IN PSLIST_ENTRY List, + IN PSLIST_ENTRY ListEnd, + IN ULONG Count +); + +void FreePackets(Packet *packet, Packet **end, int count) { + InterlockedPushListSList(&freelist_head, &packet->list_entry, (PSLIST_ENTRY)end, count); +} + +void FreeAllPackets() { + Packet *p; + p = (Packet*)InterlockedFlushSList(&freelist_head); + while (Packet *r = p) { + p = p->next; + _aligned_free(r); + } +} + +void InitPacketMutexes() { + static bool mutex_inited; + if (!mutex_inited) { + mutex_inited = true; + InitializeSListHead(&freelist_head); + } +} + + +void CallbackUpdateUI(); +void CallbackTriggerReconnect(); +void CallbackSetPublicKey(const uint8 public_key[32]); + +int tpq_last_qsize; +int g_tun_reads, g_tun_writes; + +struct { + uint32 pad1[3]; + uint32 udp_qsize1; + uint32 pad2[3]; + uint32 udp_qsize2; +} qs; + + +#define kConcurrentReadUdp 16 +#define kConcurrentWriteUdp 16 +#define kConcurrentReadTap 16 +#define kConcurrentWriteTap 16 + +#define kAdapterKeyName "SYSTEM\\CurrentControlSet\\Control\\Class\\{4D36E972-E325-11CE-BFC1-08002BE10318}" +#define kTapComponentId "tap0901" + +#define TAP_CONTROL_CODE(request,method) \ + CTL_CODE (FILE_DEVICE_UNKNOWN, request, method, FILE_ANY_ACCESS) + +#define TAP_IOCTL_GET_MAC TAP_CONTROL_CODE(1, METHOD_BUFFERED) +#define TAP_IOCTL_GET_VERSION TAP_CONTROL_CODE(2, METHOD_BUFFERED) +#define TAP_IOCTL_GET_MTU TAP_CONTROL_CODE(3, METHOD_BUFFERED) +#define TAP_IOCTL_GET_INFO TAP_CONTROL_CODE(4, METHOD_BUFFERED) +#define TAP_IOCTL_CONFIG_POINT_TO_POINT TAP_CONTROL_CODE(5, METHOD_BUFFERED) +#define TAP_IOCTL_SET_MEDIA_STATUS TAP_CONTROL_CODE(6, METHOD_BUFFERED) +#define TAP_IOCTL_CONFIG_DHCP_MASQ TAP_CONTROL_CODE(7, METHOD_BUFFERED) +#define TAP_IOCTL_GET_LOG_LINE TAP_CONTROL_CODE(8, METHOD_BUFFERED) +#define TAP_IOCTL_CONFIG_DHCP_SET_OPT TAP_CONTROL_CODE(9, METHOD_BUFFERED) +#define TAP_IOCTL_CONFIG_TUN TAP_CONTROL_CODE(10, METHOD_BUFFERED) + +static bool RunNetsh(const char *cmdline) { + wchar_t path[MAX_PATH + 20]; + size_t size = GetSystemDirectoryW(path, MAX_PATH); + bool result = false; + if (!size) { + RERROR("GetSystemDirectory failed"); + return false; + } + memcpy(path + size, L"\\netsh.exe", 11 * sizeof(path[0])); + + size_t cmdline_size = strlen(cmdline); + wchar_t *cmdlinew = new wchar_t[cmdline_size + 1]; + for (size_t i = 0; i <= cmdline_size; i++) + cmdlinew[i] = cmdline[i]; + + STARTUPINFOW si = {0}; + PROCESS_INFORMATION pi = {0}; + + GetStartupInfoW(&si); + si.dwFlags = STARTF_USESHOWWINDOW; + si.wShowWindow = SW_HIDE; + if (CreateProcessW(path, cmdlinew, NULL, NULL, FALSE, CREATE_NO_WINDOW, NULL, NULL, &si, &pi)) { + DWORD exit_code = -1; + WaitForSingleObject(pi.hProcess, INFINITE); + GetExitCodeProcess(pi.hProcess, &exit_code); + if (exit_code != 0) + RERROR("Netsh failed (%d) : %s", exit_code, cmdline); + else { + RINFO("Run: %s", cmdline); + result = true; + } + CloseHandle(pi.hThread); + CloseHandle(pi.hProcess); + } else { + RERROR("CreateProcess failed: %s", cmdline); + } + delete[]cmdlinew; + return result; +} + +// Retrieve the device path to the TAP adapter. +static bool GetTapAdapterGuid(char guid[64]) { + LONG err; + HKEY adapter_key, device_key; + bool retval = false; + err = RegOpenKeyEx(HKEY_LOCAL_MACHINE, kAdapterKeyName, 0, KEY_READ, &adapter_key); + if (err != ERROR_SUCCESS) { + RERROR("GetTapAdapterName: RegOpenKeyEx failed: 0x%X", GetLastError()); + return false; + } + for (int i = 0; !retval; i++) { + char keyname[64 + sizeof(kAdapterKeyName) + 1]; + char value[64]; + DWORD len = sizeof(value), type; + err = RegEnumKeyEx(adapter_key, i, value, &len, NULL, NULL, NULL, NULL); + if (err == ERROR_NO_MORE_ITEMS) + break; + if (err != ERROR_SUCCESS) { + RERROR("GetTapAdapterName: RegEnumKeyEx failed: 0x%X", GetLastError()); + return false; + } + snprintf(keyname, sizeof(keyname), "%s\\%s", kAdapterKeyName, value); + err = RegOpenKeyEx(HKEY_LOCAL_MACHINE, keyname, 0, KEY_READ, &device_key); + if (err == ERROR_SUCCESS) { + len = sizeof(value); + err = RegQueryValueEx(device_key, "ComponentId", NULL, &type, (LPBYTE)value, &len); + if (err == ERROR_SUCCESS && type == REG_SZ && !memcmp(value, kTapComponentId, sizeof(kTapComponentId))) { + len = 64; + err = RegQueryValueEx(device_key, "NetCfgInstanceId", NULL, &type, (LPBYTE)guid, &len); + if (err == ERROR_SUCCESS && type == REG_SZ) { + guid[63] = 0; + retval = true; + } + } + RegCloseKey(device_key); + } + } + RegCloseKey(adapter_key); + return retval; +} + +// Open the TAP adapter +static HANDLE OpenTunAdapter(char guid[64], int retry_count, bool *exit_thread, DWORD open_flags) { + char path[128]; + HANDLE h; + int retries = 0; + if (!GetTapAdapterGuid(guid)) { + RERROR("Unable to find ID of TAP adapter"); + RERROR(" Please ensure that TunSafe-TAP is properly installed."); + return NULL; + } + snprintf(path, sizeof(path), "\\\\.\\Global\\%s.tap", guid); +RETRY: + h = CreateFile(path, GENERIC_READ | GENERIC_WRITE, 0, 0, OPEN_EXISTING, + FILE_ATTRIBUTE_SYSTEM | open_flags, 0); + if (h == INVALID_HANDLE_VALUE) { + int error_code = GetLastError(); + + // Sometimes if you close the device right before, it will fail to open with errorcode 31. + // When resuming from sleep in my VM, the error code is ERROR_FILE_NOT_FOUND + if ((error_code == ERROR_FILE_NOT_FOUND || error_code == ERROR_GEN_FAILURE) && retry_count != 0 && !*exit_thread) { + RERROR("OpenTapAdapter: CreateFile failed: 0x%X... retrying", error_code); + retry_count--; + Sleep(250 * ++retries); + goto RETRY; + } + + RERROR("OpenTapAdapter: CreateFile failed: 0x%X", error_code); + if (error_code == ERROR_FILE_NOT_FOUND) { + RERROR(" Please ensure that TunSafe-TAP is properly installed."); + } else if (error_code == 0x1f) { + RERROR(" Please ensure that the TAP device is not in use."); + } + return NULL; + } + return h; +} + +static bool AddRoute(int family, + const void *dest, int dest_prefix, + const void *gateway, const NET_LUID *interface_luid, + std::vector *undo_array = NULL) { + MIB_IPFORWARD_ROW2 row = {0}; + char buf1[kSizeOfAddress], buf2[kSizeOfAddress]; + + row.InterfaceLuid = *interface_luid; + row.DestinationPrefix.PrefixLength = dest_prefix; + row.DestinationPrefix.Prefix.si_family = family; + row.NextHop.si_family = family; + if (family == AF_INET) { + memcpy(&row.DestinationPrefix.Prefix.Ipv4.sin_addr, dest, 4); + memcpy(&row.NextHop.Ipv4.sin_addr, gateway, 4); + } else if (family == AF_INET6) { + memcpy(&row.DestinationPrefix.Prefix.Ipv6.sin6_addr, dest, 16); + memcpy(&row.NextHop.Ipv6.sin6_addr, gateway, 16); + } else { + return false; + } + row.ValidLifetime = 0xffffffff; + row.PreferredLifetime = 0xffffffff; + row.Metric = 100; + row.Protocol = MIB_IPPROTO_NETMGMT; + + if (undo_array) + undo_array->push_back(row); + + DWORD error = CreateIpForwardEntry2(&row); + if (error == NO_ERROR || error == ERROR_OBJECT_ALREADY_EXISTS) { + RINFO("Added Route %s => %s", print_ip_prefix(buf1, family, dest, dest_prefix), + print_ip_prefix(buf2, family, gateway, -1)); + return true; + } + RINFO("AddRoute failed (%d) %s => %s", error, print_ip_prefix(buf1, family, dest, dest_prefix), + print_ip_prefix(buf2, family, gateway, -1)); + return false; +} + +static bool DeleteRoute(MIB_IPFORWARD_ROW2 *row) { + char buf1[kSizeOfAddress], buf2[kSizeOfAddress]; + DWORD error = DeleteIpForwardEntry2(row); + + print_ip_prefix(buf1, row->DestinationPrefix.Prefix.si_family, + (row->DestinationPrefix.Prefix.si_family == AF_INET) ? (uint8*) &row->DestinationPrefix.Prefix.Ipv4.sin_addr : (uint8*) &row->DestinationPrefix.Prefix.Ipv6.sin6_addr, row->DestinationPrefix.PrefixLength); + + print_ip_prefix(buf2, row->NextHop.si_family, + (row->NextHop.si_family == AF_INET) ? (uint8*)&row->NextHop.Ipv4.sin_addr : (uint8*)&row->NextHop.Ipv6.sin6_addr, -1); + + if (error == NO_ERROR) { + RINFO("Deleted Route %s => %s", buf1, buf2); + return true; + } + RINFO("DeleteRoute failed (%d) %s => %s", error, buf1, buf2); + return false; +} + + +static uint32 CidrToNetmaskV4(int cidr) { + return cidr == 32 ? 0xffffffff : 0xffffffff << (32 - cidr); +} + +struct RouteInfo { + uint8 default_gw[16]; + NET_LUID default_adapter; + bool found_default_adapter; + uint8 found_null_routes; +}; + +static inline bool IsRouteOriginatingFromNullRoute(MIB_IPFORWARD_ROW2 *row) { + if (!(row->InterfaceLuid.Info.IfType == 24 && row->Protocol == MIB_IPPROTO_NETMGMT && row->DestinationPrefix.PrefixLength == 1)) + return false; + if (row->NextHop.si_family == AF_INET) { + return (row->NextHop.Ipv4.sin_addr.S_un.S_addr == 0); + } else if (row->NextHop.si_family == AF_INET6) { + static const uint32 nulladdr[4]; + return memcmp(&row->NextHop.Ipv6.sin6_addr, nulladdr, 16) == 0; + } + return false; +} + +static inline bool IsRouteTheAddressOfTheServer(int family, MIB_IPFORWARD_ROW2 *row, uint8 *old_endpoint_to_delete) { + if (!(row->Protocol == MIB_IPPROTO_NETMGMT && row->DestinationPrefix.Prefix.si_family == family)) + return false; + if (family == AF_INET) { + return (row->DestinationPrefix.PrefixLength == 32 && memcmp(&row->DestinationPrefix.Prefix.Ipv4.sin_addr, old_endpoint_to_delete, 4) == 0); + } else if (family == AF_INET6) { + return (row->DestinationPrefix.PrefixLength == 128 && memcmp(&row->DestinationPrefix.Prefix.Ipv6.sin6_addr, old_endpoint_to_delete, 16) == 0); + } + return false; +} + +static void DeleteRouteOrPrintErr(MIB_IPFORWARD_ROW2 *row) { + char buf1[kSizeOfAddress]; + UINT32 r = DeleteIpForwardEntry2(row); + if (r) + RERROR("Unable to delete old route (%d): %s", r, + print_ip_prefix(buf1, row->DestinationPrefix.Prefix.si_family, row->DestinationPrefix.Prefix.si_family == AF_INET ? + (void*)&row->DestinationPrefix.Prefix.Ipv4.sin_addr : + (void*)&row->DestinationPrefix.Prefix.Ipv6.sin6_addr, row->DestinationPrefix.PrefixLength)); +} + +static bool GetDefaultRouteAndDeleteOldRoutes(int family, const NET_LUID *InterfaceLuid, bool keep_null_routes, uint8 *old_endpoint_to_delete, RouteInfo *ri) { + MIB_IPFORWARD_TABLE2 *table = NULL; + + assert(family == AF_INET || family == AF_INET6); + + if (GetIpForwardTable2(family, &table)) + return false; + DWORD rv = 0; + DWORD gw_metric = 0xffffffff; + ri->found_default_adapter = false; + ri->found_null_routes = 0; + for (unsigned i = 0; i < table->NumEntries; i++) { + MIB_IPFORWARD_ROW2 *row = &table->Table[i]; + if (InterfaceLuid && memcmp(&row->InterfaceLuid, InterfaceLuid, sizeof(NET_LUID)) == 0) { + if (row->Protocol == MIB_IPPROTO_NETMGMT) + DeleteRouteOrPrintErr(row); + } else if (IsRouteOriginatingFromNullRoute(row)) { + ri->found_null_routes++; + if (!keep_null_routes) + DeleteRouteOrPrintErr(row); + } else if (row->DestinationPrefix.PrefixLength == 0 && row->Metric < gw_metric) { + gw_metric = row->Metric; + if (family == AF_INET) { + memcpy(&ri->default_gw, &row->NextHop.Ipv4.sin_addr, 4); + } else { + memcpy(&ri->default_gw, &row->NextHop.Ipv6.sin6_addr, 16); + } + ri->default_adapter = row->InterfaceLuid; + ri->found_default_adapter = true; + } + } + + if (old_endpoint_to_delete && ri->found_default_adapter) { + for (unsigned i = 0; i < table->NumEntries; i++) { + MIB_IPFORWARD_ROW2 *row = &table->Table[i]; + if (memcmp(&row->InterfaceLuid, &ri->default_adapter, sizeof(NET_LUID)) == 0) { + if (IsRouteTheAddressOfTheServer(family, row, old_endpoint_to_delete)) + DeleteRouteOrPrintErr(row); + } + } + } + + FreeMibTable(table); + return (rv == 0); +} + +static inline bool NoMoreAllocationRetry(volatile bool *exit_flag) { + if (*exit_flag) + return true; + Sleep(1000); + return *exit_flag; +} + +static inline bool AllocPacketFrom(Packet **list, int *counter, bool *exit_flag, Packet **res) { + Packet *p; + if (p = *list) { + *list = p->next; + (*counter)--; + p->data = p->data_buf + Packet::HEADROOM_BEFORE; + } else { + while ((p = AllocPacket()) == NULL) { + if (NoMoreAllocationRetry(exit_flag)) + return false; + } + } + *res = p; + return true; +} + +static void FreePacketList(Packet *pp) { + while (Packet *p = pp) { + pp = p->next; + FreePacket(p); + } +} + +UdpSocketWin32::UdpSocketWin32() { + wqueue_end_ = &wqueue_; + wqueue_ = NULL; + exit_thread_ = false; + socket_ = INVALID_SOCKET; + thread_ = NULL; + socket_ipv6_ = INVALID_SOCKET; + completion_port_handle_ = NULL; + + InitializeCriticalSectionAndSpinCount(&mutex_, 1024); +} + +UdpSocketWin32::~UdpSocketWin32() { + assert(thread_ == NULL); + closesocket(socket_); + closesocket(socket_ipv6_); + CloseHandle(completion_port_handle_); + FreePacketList(wqueue_); + DeleteCriticalSection(&mutex_); +} + +bool UdpSocketWin32::Initialize(int listen_on_port) { + SOCKET s = WSASocket(AF_INET, SOCK_DGRAM, 0, NULL, 0, WSA_FLAG_OVERLAPPED); + if (s == INVALID_SOCKET) { + RERROR("UdpSocketWin32::Initialize WSASocket failed"); + return false; + } + completion_port_handle_ = CreateIoCompletionPort((HANDLE)s, NULL, NULL, 0); + if (!completion_port_handle_) { + closesocket(s); + return false; + } + socket_ = s; + + sockaddr_in sin = {0}; + sin.sin_family = AF_INET; + sin.sin_port = htons(listen_on_port); + if (bind(s, (struct sockaddr*)&sin, sizeof(sin)) != 0) { + RERROR("UdpSocketWin32::Initialize bind failed"); + return false; + } + + // Also open up a socket for ipv6 + s = WSASocket(AF_INET6, SOCK_DGRAM, 0, NULL, 0, WSA_FLAG_OVERLAPPED); + if (s != INVALID_SOCKET) { + if (!CreateIoCompletionPort((HANDLE)s, completion_port_handle_, 1, 0)) { + RERROR("IPv6 Socket completion port failed."); + closesocket(s); + } else { + socket_ipv6_ = s; + sockaddr_in6 sin6 = {0}; + sin6.sin6_family = AF_INET6; + sin6.sin6_port = htons(listen_on_port); + if (bind(s, (struct sockaddr*)&sin6, sizeof(sin6)) != 0) { + RERROR("UdpSocketWin32::Initialize bind failed IPv6"); + } + } + } else { + RERROR("IPv6 Socket creation failed."); + } + return true; +} + +enum { + kUdpGetQueuedCompletionStatusSize = kConcurrentWriteTap + kConcurrentReadTap + 1 +}; + +static inline void ClearOverlapped(OVERLAPPED *o) { + memset(o, 0, sizeof(*o)); +} + +#ifndef STATUS_PORT_UNREACHABLE +#define STATUS_PORT_UNREACHABLE 0xC000023F +#endif + +static inline bool IsIgnoredUdpError(DWORD err) { + return err == WSAEMSGSIZE || err == WSAECONNRESET || err == WSAENETRESET || err == STATUS_PORT_UNREACHABLE; +} + +void UdpSocketWin32::ThreadMain() { + OVERLAPPED_ENTRY entries[kUdpGetQueuedCompletionStatusSize]; + Packet *pending_writes = NULL; + int num_reads[2] = {0,0}, num_writes = 0; + enum { IPV4, IPV6 }; + Packet *finished_reads = NULL, **finished_reads_end = &finished_reads; + Packet *freed_packets = NULL, **freed_packets_end = &freed_packets; + int freed_packets_count = 0; + int max_read_ipv6 = socket_ipv6_ != INVALID_SOCKET ? 1 : 0; + + while (!exit_thread_) { + // Listen with multiple ipv6 packets only if we ever sent an ipv6 packet. + for (int i = num_reads[IPV6]; i < max_read_ipv6; i++) { + Packet *p; + if (!AllocPacketFrom(&freed_packets, &freed_packets_count, &exit_thread_, &p)) + break; +restart_read_udp6: + ClearOverlapped(&p->overlapped); + p->post_target = ThreadedPacketQueue::TARGET_PROCESSOR_UDP; + WSABUF wsabuf = {(ULONG)kPacketCapacity, (char*)p->data}; + DWORD flags = 0; + p->sin_size = sizeof(p->addr.sin6); + if (WSARecvFrom(socket_ipv6_, &wsabuf, 1, NULL, &flags, (struct sockaddr*)&p->addr, &p->sin_size, &p->overlapped, NULL) != 0) { + DWORD err = WSAGetLastError(); + if (err != WSA_IO_PENDING) { + if (err == WSAEMSGSIZE || err == WSAECONNRESET || err == WSAENETRESET) + goto restart_read_udp6; + RERROR("UdpSocketWin32:WSARecvFrom failed 0x%X", err); + FreePacket(p); + break; + } + } + num_reads[IPV6]++; + } + + // Initiate more reads, reusing the Packet structures in |finished_writes|. + for (int i = num_reads[IPV4]; i < kConcurrentReadTap; i++) { + Packet *p; + if (!AllocPacketFrom(&freed_packets, &freed_packets_count, &exit_thread_, &p)) + break; +restart_read_udp: + ClearOverlapped(&p->overlapped); + p->post_target = ThreadedPacketQueue::TARGET_PROCESSOR_UDP; + WSABUF wsabuf = {(ULONG)kPacketCapacity, (char*)p->data}; + DWORD flags = 0; + p->sin_size = sizeof(p->addr.sin); + if (WSARecvFrom(socket_, &wsabuf, 1, NULL, &flags, (struct sockaddr*)&p->addr, &p->sin_size, &p->overlapped, NULL) != 0) { + DWORD err = WSAGetLastError(); + if (err != WSA_IO_PENDING) { + if (err == WSAEMSGSIZE || err == WSAECONNRESET || err == WSAENETRESET) + goto restart_read_udp; + RERROR("UdpSocketWin32:WSARecvFrom failed 0x%X", err); + FreePacket(p); + break; + } + } + num_reads[IPV4]++; + } + + assert(freed_packets_count >= 0); + if (freed_packets_count >= 32) { + FreePackets(freed_packets, freed_packets_end, freed_packets_count); + freed_packets_count = 0; + freed_packets_end = &freed_packets; + } else if (freed_packets == NULL) { + assert(freed_packets_count == 0); + freed_packets_end = &freed_packets; + } + + ULONG num_entries = 0; + if (!GetQueuedCompletionStatusEx(completion_port_handle_, entries, kUdpGetQueuedCompletionStatusSize, &num_entries, INFINITE, FALSE)) { + RINFO("GetQueuedCompletionStatusEx failed."); + break; + } + finished_reads_end = &finished_reads; + + int finished_reads_count = 0; + // Go through the finished entries and determine which ones are reads, and which ones are writes. + for (ULONG i = 0; i < num_entries; i++) { + if (!entries[i].lpOverlapped) + continue; // This is the dummy entry from |PostQueuedCompletionStatus| + Packet *p = (Packet*)((byte*)entries[i].lpOverlapped - offsetof(Packet, overlapped)); + if (p->post_target == ThreadedPacketQueue::TARGET_PROCESSOR_UDP) { + num_reads[entries[i].lpCompletionKey]--; + if ((DWORD)p->overlapped.Internal != 0) { + if (!IsIgnoredUdpError((DWORD)p->overlapped.Internal)) + RERROR("UdpSocketWin32::Read error 0x%X", (DWORD)p->overlapped.Internal); + FreePacket(p); + continue; + } + p->size = (int)p->overlapped.InternalHigh; + *finished_reads_end = p; + finished_reads_end = &p->next; + finished_reads_count++; + } else { + num_writes--; + if ((DWORD)p->overlapped.Internal != 0) { + RERROR("UdpSocketWin32::Write error 0x%X", (DWORD)p->overlapped.Internal); + FreePacket(p); + continue; + } + *freed_packets_end = p; + freed_packets_end = &p->next; + freed_packets_count++; + } + } + *finished_reads_end = NULL; + *freed_packets_end = NULL; + assert(num_writes >= 0); + + // Push all the finished reads to the packet handler + if (finished_reads != NULL) { + packet_handler_->Post(finished_reads, finished_reads_end, finished_reads_count); + } + // Initiate more writes from |wqueue_| + while (num_writes < kConcurrentWriteTap) { + // Refill from queue if empty, avoid taking the mutex if it looks empty + if (!pending_writes) { + if (!wqueue_) + break; + EnterCriticalSection(&mutex_); + pending_writes = wqueue_; + wqueue_end_ = &wqueue_; + wqueue_ = NULL; + LeaveCriticalSection(&mutex_); + if (!pending_writes) + break; + } + + qs.udp_qsize1+= pending_writes->size; + + // Then issue writes + Packet *p = pending_writes; + pending_writes = p->next; + ClearOverlapped(&p->overlapped); + p->post_target = ThreadedPacketQueue::TARGET_UDP_DEVICE; + WSABUF wsabuf = {(ULONG)p->size, (char*)p->data}; + + int rv; + if (p->addr.sin.sin_family == AF_INET) { + rv = WSASendTo(socket_, &wsabuf, 1, NULL, 0, (struct sockaddr*)&p->addr.sin, sizeof(p->addr.sin), &p->overlapped, NULL); + } else { + if (socket_ipv6_ == INVALID_SOCKET) { + RERROR("UdpSocketWin32: unavailable ipv6 socket"); + FreePacket(p); + continue; + } + max_read_ipv6 = kConcurrentReadTap; + rv = WSASendTo(socket_ipv6_, &wsabuf, 1, NULL, 0, (struct sockaddr*)&p->addr.sin6, sizeof(p->addr.sin6), &p->overlapped, NULL); + } + if (rv != 0) { + DWORD err = WSAGetLastError(); + if (err != ERROR_IO_PENDING) { + RERROR("UdpSocketWin32: WSASendTo failed 0x%X", err); + FreePacket(p); + continue; + } + } + num_writes++; + } + } + FreePacketList(freed_packets); + FreePacketList(pending_writes); + + // Cancel all IO and wait for all completions + CancelIo((HANDLE)socket_); + CancelIo((HANDLE)socket_ipv6_); + + while (num_reads[IPV4] + num_reads[IPV6] + num_writes) { + ULONG num_entries = 0; + if (!GetQueuedCompletionStatusEx(completion_port_handle_, entries, 1, &num_entries, INFINITE, FALSE)) { + RINFO("GetQueuedCompletionStatusEx failed."); + break; + } + if (!entries[0].lpOverlapped) + continue; // This is the dummy entry from |PostQueuedCompletionStatus| + Packet *p = (Packet*)((byte*)entries[0].lpOverlapped - offsetof(Packet, overlapped)); + if (p->post_target == ThreadedPacketQueue::TARGET_PROCESSOR_UDP) { + num_reads[entries[0].lpCompletionKey]--; + } else { + num_writes--; + } + FreePacket(p); + } +} + + + +// Called on another thread to queue up a udp packet +void UdpSocketWin32::WriteUdpPacket(Packet *packet) { + if (qs.udp_qsize2 - qs.udp_qsize1 >= (unsigned)(packet->size < 576 ? MAX_BYTES_IN_UDP_OUT_QUEUE_SMALL : MAX_BYTES_IN_UDP_OUT_QUEUE)) { + FreePacket(packet); + return; + } + packet->next = NULL; + qs.udp_qsize2 += packet->size; + + EnterCriticalSection(&mutex_); + Packet *was_empty = wqueue_; + *wqueue_end_ = packet; + wqueue_end_ = &packet->next; + LeaveCriticalSection(&mutex_); + + if (was_empty == NULL) { + // Notify the worker thread that it should attempt more writes + PostQueuedCompletionStatus(completion_port_handle_, NULL, NULL, NULL); + } +} + +DWORD WINAPI UdpSocketWin32::UdpThread(void *x) { + UdpSocketWin32 *udp = (UdpSocketWin32 *)x; + udp->ThreadMain(); + return 0; +} + +void UdpSocketWin32::StartThread() { + DWORD thread_id; + thread_ = CreateThread(NULL, 0, &UdpThread, this, 0, &thread_id); + SetThreadPriority(thread_, ABOVE_NORMAL_PRIORITY_CLASS); +} + +void UdpSocketWin32::StopThread() { + exit_thread_ = true; + PostQueuedCompletionStatus(completion_port_handle_, NULL, NULL, NULL); + WaitForSingleObject(thread_, INFINITE); + CloseHandle(thread_); + thread_ = NULL; +} + +ThreadedPacketQueue::ThreadedPacketQueue(WireguardProcessor *wg, NetworkStats *stats) { + wg_ = wg; + stats_ = stats; + InitializeCriticalSectionAndSpinCount(&mutex_, 1024); + event_ = CreateEvent(NULL, FALSE, FALSE, NULL); + + last_ptr_ = &first_; + first_ = NULL; + handle_ = NULL; + timer_handle_ = NULL; + exit_flag_ = false; + timer_interrupt_ = false; + packets_in_queue_ = 0; + need_notify_ = 0; +} + +ThreadedPacketQueue::~ThreadedPacketQueue() { + assert(handle_ == NULL); + assert(timer_handle_ == NULL); + first_ = NULL; + last_ptr_ = &first_; + DeleteCriticalSection(&mutex_); + CloseHandle(event_); +} + +DWORD WINAPI ThreadedPacketQueue::ThreadedPacketQueueLauncher(VOID *x) { + ThreadedPacketQueue *pq = (ThreadedPacketQueue *)x; + return pq->ThreadMain(); +} + +DWORD ThreadedPacketQueue::ThreadMain() { + int free_packets_ctr = 0; + int overload = 0; + + EnterCriticalSection(&mutex_); + while (!exit_flag_) { + if (timer_interrupt_) { + timer_interrupt_ = false; + need_notify_ = 0; + LeaveCriticalSection(&mutex_); + wg_->SecondLoop(); + EnterCriticalSection(&stats_->mutex); + if (stats_->reset_stats) { + stats_->reset_stats = false; + wg_->ResetStats(); + } + stats_->packet_stats = wg_->GetStats(); + LeaveCriticalSection(&stats_->mutex); + + CallbackUpdateUI(); + + // Conserve memory every 10s + if (free_packets_ctr++ == 10) { + free_packets_ctr = 0; + FreeAllPackets(); + } + if (overload) + overload -= 1; + EnterCriticalSection(&mutex_); + continue; + } + + // Grab the elements of the queue + Packet *packet = first_; + if (packet == NULL) { + need_notify_ = 1; + LeaveCriticalSection(&mutex_); + WaitForSingleObject(event_, INFINITE); + EnterCriticalSection(&mutex_); + + //SleepConditionVariableCS(&cv_, &mutex, INFINITE); + continue; + } + // Steal the whole work queue + first_ = NULL; + last_ptr_ = &first_; + int packets_in_queue = packets_in_queue_; + packets_in_queue_ = 0; + need_notify_ = 0; + LeaveCriticalSection(&mutex_); + + tpq_last_qsize = packets_in_queue; + if (packets_in_queue >= 1024) + overload = 2; + bool is_overload = (overload != 0); + + WireguardProcessor *procint = wg_; + do { + Packet *next = packet->next; + if (packet->post_target == TARGET_PROCESSOR_UDP) + procint->HandleUdpPacket(packet, is_overload); + else + procint->HandleTunPacket(packet); + packet = next; + } while (packet); + EnterCriticalSection(&mutex_); + } + LeaveCriticalSection(&mutex_); + return 0; +} + +void ThreadedPacketQueue::Start() { + if (handle_ == NULL) { + exit_flag_ = false; + DWORD thread_id; + handle_ = CreateThread(NULL, 0, &ThreadedPacketQueueLauncher, this, 0, &thread_id); + } + + assert(timer_handle_ == NULL); + timer_handle_ = CreateWaitableTimer(NULL, FALSE, NULL); + long long due_time = 10000000; + SetWaitableTimer(timer_handle_, (LARGE_INTEGER*)&due_time, 1000, &TimerRoutine, this, FALSE); +} + +void ThreadedPacketQueue::Stop() { + EnterCriticalSection(&mutex_); + exit_flag_ = true; + LeaveCriticalSection(&mutex_); + + SetEvent(event_); + + if (timer_handle_ != NULL) { + // Not sure if just CloseHandle will close any outstanding APCs + CancelWaitableTimer(timer_handle_); + CloseHandle(timer_handle_); + timer_handle_ = NULL; + } + + if (handle_ != NULL) { + WaitForSingleObject(handle_, INFINITE); + CloseHandle(handle_); + handle_ = NULL; + } + +} + +void ThreadedPacketQueue::AbortingDriver() { + EnterCriticalSection(&mutex_); + exit_flag_ = true; + LeaveCriticalSection(&mutex_); +} + +void ThreadedPacketQueue::Post(Packet *packet, Packet **end, int count) { + EnterCriticalSection(&mutex_); + if (packets_in_queue_ >= HARD_MAXIMUM_QUEUE_SIZE) { + LeaveCriticalSection(&mutex_); + FreePackets(packet, end, count); + return; + } + assert(packet != NULL); + if (!first_) { + assert(last_ptr_ == &first_); + } + packets_in_queue_ += count; + *last_ptr_ = packet; + last_ptr_ = end; + if (!first_) { + assert(last_ptr_ == &first_); + } + if (need_notify_) { + need_notify_ = 0; + LeaveCriticalSection(&mutex_); + SetEvent(event_); + return; + } + LeaveCriticalSection(&mutex_); +} + +void CALLBACK ThreadedPacketQueue::TimerRoutine(LPVOID lpArgToCompletionRoutine, DWORD dwTimerLowValue, DWORD dwTimerHighValue) { + ((ThreadedPacketQueue*)lpArgToCompletionRoutine)->PostTimerInterrupt(); +} + +void ThreadedPacketQueue::PostTimerInterrupt() { + EnterCriticalSection(&mutex_); + timer_interrupt_ = true; + if (need_notify_) { + need_notify_ = 0; + LeaveCriticalSection(&mutex_); + SetEvent(event_); + return; + } + LeaveCriticalSection(&mutex_); +} + +bool GetNetLuidFromGuid(const char *adapter_guid, NET_LUID *luid) { + char buffer[64]; + UUID uuid; + size_t len = strlen(adapter_guid); + if (adapter_guid[0] != '{' || adapter_guid[len - 1] != '}' || len >= 64) return false; + buffer[len - 2] = 0; + memcpy(buffer, adapter_guid + 1, len - 2); + RPC_STATUS status = UuidFromStringA((RPC_CSTR)buffer, &uuid); + if (status != 0) + return false; + return ConvertInterfaceGuidToLuid((GUID*)&uuid, luid) == 0; +} + +DWORD SetMtuOnNetworkAdapter(NET_LUID *InterfaceLuid, ADDRESS_FAMILY family, int new_mtu) { + MIB_IPINTERFACE_ROW row; + DWORD err; + InitializeIpInterfaceEntry(&row); + row.Family = family; + row.InterfaceLuid = *InterfaceLuid; + if ((err = GetIpInterfaceEntry(&row)) == 0) { + row.NlMtu = new_mtu; + if (row.Family == AF_INET) + row.SitePrefixLength = 0; + err = SetIpInterfaceEntry(&row); + } + return err; +} + +DWORD SetMetricOnNetworkAdapter(NET_LUID *InterfaceLuid, ADDRESS_FAMILY family, int new_metric) { + MIB_IPINTERFACE_ROW row; + DWORD err; + InitializeIpInterfaceEntry(&row); + row.Family = family; + row.InterfaceLuid = *InterfaceLuid; + if ((err = GetIpInterfaceEntry(&row)) == 0) { + row.Metric = new_metric; + row.UseAutomaticMetric = (new_metric == 0); + if (row.Family == AF_INET) + row.SitePrefixLength = 0; + err = SetIpInterfaceEntry(&row); + } + return err; +} + +static const char *PrintIPV6(const uint8 new_address[16]) { + sockaddr_in6 sin6 = {0}; + static char buf[100]; + if (!inet_ntop(PF_INET6, new_address, buf, 100)) + memcpy(buf, "unknown", 8); + return buf; +} + +static bool SetIPV6AddressOnInterface(NET_LUID *InterfaceLuid, const uint8 new_address[16], int new_cidr) { + NETIO_STATUS Status; + PMIB_UNICASTIPADDRESS_TABLE table = NULL; + Status = GetUnicastIpAddressTable(AF_INET6, &table); + if (Status != 0) { + RERROR("GetUnicastAddressTable Failed. Error %d\n", Status); + return false; + } + + bool found_row = false; + for (int i = 0; i < (int)table->NumEntries; i++) { + MIB_UNICASTIPADDRESS_ROW *row = &table->Table[i]; + if (!memcmp(&row->InterfaceLuid, InterfaceLuid, sizeof(NET_LUID))) { + if (row->PrefixOrigin == 1 && row->SuffixOrigin == 1) { + if (row->OnLinkPrefixLength == new_cidr && !memcmp(&row->Address.Ipv6.sin6_addr, new_address, 16)) { + found_row = true; + continue; + } + Status = DeleteUnicastIpAddressEntry(row); + if (Status) + RERROR("Error %d deleting IPv6 address: %s/%d", Status, PrintIPV6((uint8*)&row->Address.Ipv6.sin6_addr), row->OnLinkPrefixLength); + else + RINFO("Deleted IPv6 address: %s/%d", PrintIPV6((uint8*)&row->Address.Ipv6.sin6_addr), row->OnLinkPrefixLength); + } + } + } + FreeMibTable(table); + + if (found_row) { + RINFO("Using IPv6 address: %s/%d", PrintIPV6(new_address), new_cidr); + return true; + } + + MIB_UNICASTIPADDRESS_ROW Row; + InitializeUnicastIpAddressEntry(&Row); + Row.OnLinkPrefixLength = new_cidr; + Row.Address.si_family = AF_INET6; + memcpy(&Row.Address.Ipv6.sin6_addr, new_address, 16); + Row.InterfaceLuid = *InterfaceLuid; + Status = CreateUnicastIpAddressEntry(&Row); + if (Status != 0) { + RERROR("Error %d setting IPv6 address: %s/%d", Status, PrintIPV6(new_address), new_cidr); + return false; + } + RINFO("Set IPV6 Address to: %s/%d", PrintIPV6(new_address), new_cidr); + return true; +} + +static bool IsIpv6AddressSet(const void *p) { + return (ReadLE64(p) | ReadLE64((char*)p + 8)) != 0; +} + + +static bool SetIPV6DnsOnInterface(NET_LUID *InterfaceLuid, const uint8 new_address[16]) { + char buf[128]; + char ipv6[128]; + NET_IFINDEX InterfaceIndex; + if (ConvertInterfaceLuidToIndex(InterfaceLuid, &InterfaceIndex)) + return false; + if (IsIpv6AddressSet(new_address)) { + if (!inet_ntop(AF_INET6, new_address, ipv6, sizeof(ipv6))) + return false; + + snprintf(buf, sizeof(buf), "netsh interface ipv6 set dns name=%d static %s validate=no", InterfaceIndex, ipv6); + } else { + snprintf(buf, sizeof(buf), "netsh interface ipv6 delete dns name=%d all", InterfaceIndex); + } + return RunNetsh(buf); +} + +static uint32 ComputeIpv4DefaultRoute(uint32 ip, uint32 netmask) { + uint32 default_route_v4 = (ip & netmask) | 1; + if (default_route_v4 == ip) + default_route_v4++; + return default_route_v4; +} + +static void ComputeIpv6DefaultRoute(const uint8 *ipv6_address, uint8 ipv6_cidr, uint8 *default_route_v6) { + memcpy(default_route_v6, ipv6_address, 16); + // clear the last bits of the ipv6 address to match the cidr. + size_t n = (ipv6_cidr + 7) >> 3; + memset(&default_route_v6[n], 0, 16 - n); + if (n == 0) + return; + // adjust the final byte + default_route_v6[n - 1] &= ~(0xff >> (ipv6_cidr & 7)); + // set the very last byte to something + default_route_v6[15] |= 1; + // ensure it doesn't collide + if (memcmp(default_route_v6, ipv6_address, 16) == 0) + default_route_v6[15] ^= 3; +} + + +static bool AddMultipleCatchallRoutes(int inet, int bits, const uint8 *target, const NET_LUID &luid) { + uint8 tmp[16] = {0}; + bool success = true; + for (int i = 0; i < (1 << bits); i++) { + tmp[0] = i << (8 - bits); + success &= AddRoute(inet, tmp, bits, target, &luid); + } + return success; +} + +static uint8 GetInternetRouteBlockingState() { + if (internet_route_blocking_state == ROUTE_BLOCK_UNKNOWN) { + RouteInfo ri; + internet_route_blocking_state = + (GetDefaultRouteAndDeleteOldRoutes(AF_INET, NULL, TRUE, NULL, &ri) && ri.found_null_routes == 2) + ROUTE_BLOCK_OFF; + } + return internet_route_blocking_state; +} + +static void SetInternetRouteBlockingState(bool want) { + if (want) { + internet_route_blocking_state = ROUTE_BLOCK_PENDING; + } else if (internet_route_blocking_state != ROUTE_BLOCK_OFF) { + RouteInfo ri; + GetDefaultRouteAndDeleteOldRoutes(AF_INET, NULL, FALSE, NULL, &ri); + GetDefaultRouteAndDeleteOldRoutes(AF_INET6, NULL, FALSE, NULL, &ri); + internet_route_blocking_state = ROUTE_BLOCK_OFF; + } +} + +InternetBlockState GetInternetBlockState(bool *is_activated) { + int a = GetInternetRouteBlockingState(); + int b = GetInternetFwBlockingState(); + + if (is_activated) + *is_activated = (a == ROUTE_BLOCK_ON || b == IBS_ACTIVE); + + return (InternetBlockState)( + (a >= ROUTE_BLOCK_ON) * kBlockInternet_Route + + (b >= IBS_ACTIVE) * kBlockInternet_Firewall); +} + +void SetInternetBlockState(InternetBlockState s) { + SetInternetRouteBlockingState((s & kBlockInternet_Route) != 0); + SetInternetFwBlockingState((s & kBlockInternet_Firewall) != 0); +} + +TunWin32Adapter::TunWin32Adapter() { + handle_ = NULL; + current_dns_block_ = NULL; +} + +TunWin32Adapter::~TunWin32Adapter() { + +} + +bool TunWin32Adapter::OpenAdapter(bool *exit_thread, DWORD open_flags) { + int retry_count = 10; + handle_ = OpenTunAdapter(guid_, retry_count, exit_thread, open_flags); + return (handle_ != NULL); +} + +bool TunWin32Adapter::InitAdapter(const TunInterface::TunConfig &&config, TunInterface::TunConfigOut *out) { + ULONG info[3]; + DWORD len; + out->enable_neighbor_discovery_spoofing = false; + + if (!RunPrePostCommand(config.pre_post_commands.pre_up)) { + RERROR("Pre command failed!"); + return false; + } + + memset(info, 0, sizeof(info)); + if (DeviceIoControl(handle_, TAP_IOCTL_GET_VERSION, &info, sizeof(info), + &info, sizeof(info), &len, NULL)) { + RINFO("TAP Driver Version %d.%d %s", (int)info[0], (int)info[1], (info[2] ? "(DEBUG)" : "")); + } + + if (info[0] < 9 || info[0] == 9 && info[1] <= 8) { + RERROR("TAP is too old. Go to https://tunsafe.com/download to upgrade the driver"); + return false; + } + + // ULONG mtu = 0; + // if (DeviceIoControl(handle_, TAP_IOCTL_GET_MTU, &mtu, sizeof(mtu), &mtu, sizeof(mtu), &len, NULL)) + // RINFO("TAP-Win32 MTU=%d", (int)mtu); + // mtu_ = mtu; + + uint32 netmask = CidrToNetmaskV4(config.cidr); + + // Set TAP-Windows TUN subnet mode + if (1) { + uint32 v[3]; + + v[0] = htonl(config.ip); + v[1] = htonl(config.ip & netmask); + v[2] = htonl(netmask); + if (!DeviceIoControl(handle_, TAP_IOCTL_CONFIG_TUN, v, sizeof(v), v, sizeof(v), &len, NULL)) { + RERROR("DeviceIoControl(TAP_IOCTL_CONFIG_TUN) failed"); + return false; + } + } + + // Set DHCP IP/netmask + { + uint32 v[4]; + v[0] = htonl(config.ip); + v[1] = htonl(netmask); + v[2] = htonl((config.ip | ~netmask) - 1); // x.x.x.254 + v[3] = 31536000; // One year + if (!DeviceIoControl(handle_, TAP_IOCTL_CONFIG_DHCP_MASQ, v, sizeof(v), v, sizeof(v), &len, NULL)) { + RERROR("DeviceIoControl(TAP_IOCTL_CONFIG_DHCP_MASQ) failed"); + return false; + } + } + + bool has_dns_setting = false; + + // Set DHCP config string + if (config.dhcp_options_size != 0) { + byte output[10]; + if (!DeviceIoControl(handle_, TAP_IOCTL_CONFIG_DHCP_SET_OPT, + (void*)config.dhcp_options, (DWORD)config.dhcp_options_size, output, sizeof(output), &len, NULL)) { + RERROR("DeviceIoControl(TAP_IOCTL_CONFIG_DHCP_SET_OPT) failed"); + return false; + } + has_dns_setting = true; + } + + // Get device MAC address + if (!DeviceIoControl(handle_, TAP_IOCTL_GET_MAC, mac_adress_, 6, mac_adress_, sizeof(mac_adress_), &len, NULL)) { + RERROR("DeviceIoControl(TAP_IOCTL_GET_MAC) failed"); + } else { + out->enable_neighbor_discovery_spoofing = true; + memcpy(out->neighbor_discovery_spoofing_mac, mac_adress_, sizeof(out->neighbor_discovery_spoofing_mac)); + } + + // Set driver media status to 'connected' + ULONG status = TRUE; + if (!DeviceIoControl(handle_, TAP_IOCTL_SET_MEDIA_STATUS, &status, sizeof(status), + &status, sizeof(status), &len, NULL)) { + RERROR("DeviceIoControl(TAP_IOCTL_SET_MEDIA_STATUS) failed"); + return false; + } + + NET_LUID InterfaceLuid = {0}; + bool has_interface_luid = GetNetLuidFromGuid(guid_, &InterfaceLuid); + + if (!has_interface_luid) { + RERROR("Unable to determine interface luid for %s.", guid_); + return false; + } + + DWORD err; + + if (config.mtu) { + err = SetMtuOnNetworkAdapter(&InterfaceLuid, AF_INET, config.mtu); + if (err) + RERROR("SetMtuOnNetworkAdapter IPv4 failed: %d", err); + if (config.ipv6_cidr) { + err = SetMtuOnNetworkAdapter(&InterfaceLuid, AF_INET6, config.mtu); + if (err) + RERROR("SetMtuOnNetworkAdapter IPv6 failed: %d", err); + } + } + + if (config.ipv6_cidr) { + SetIPV6AddressOnInterface(&InterfaceLuid, config.ipv6_address, config.ipv6_cidr); + if (config.set_ipv6_dns) { + has_dns_setting |= IsIpv6AddressSet(config.dns_server_v6); + if (!SetIPV6DnsOnInterface(&InterfaceLuid, config.dns_server_v6)) { + RERROR("SetIPV6DnsOnInterface: failed"); + } + } + } + + if (has_dns_setting && config.block_dns_on_adapters) { + RINFO("Blocking standard DNS on all adapters"); + current_dns_block_ = BlockDnsExceptOnAdapter(InterfaceLuid, config.ipv6_cidr != 0); + + err = SetMetricOnNetworkAdapter(&InterfaceLuid, AF_INET, 2); + if (err) + RERROR("SetMetricOnNetworkAdapter IPv4 failed: %d", err); + + if (config.ipv6_cidr) { + err = SetMetricOnNetworkAdapter(&InterfaceLuid, AF_INET6, 2); + if (err) + RERROR("SetMetricOnNetworkAdapter IPv6 failed: %d", err); + } + } + + uint8 ibs = config.internet_blocking; + if (ibs == kBlockInternet_Default || ibs == kBlockInternet_DefaultOn) { + uint8 new_ibs = GetInternetBlockState(NULL); + ibs = (new_ibs == kBlockInternet_Off && ibs == kBlockInternet_DefaultOn) ? kBlockInternet_Firewall : new_ibs; + } + + bool block_all_traffic_route = (ibs & kBlockInternet_Route) != 0; + + RouteInfo ri, ri6; + + uint32 default_route_endpoint_v4 = ToBE32(config.default_route_endpoint_v4); + + // Delete any current /1 default routes and read some stuff from the routing table. + if (!GetDefaultRouteAndDeleteOldRoutes(AF_INET, &InterfaceLuid, block_all_traffic_route, config.use_ipv4_default_route ? (uint8*)&default_route_endpoint_v4 : NULL, &ri)) { + RERROR("Unable to read old default gateway and delete old default routes."); + return false; + } + + if (config.ipv6_cidr) { + // Delete any current /1 default routes and read some stuff from the routing table. + if (!GetDefaultRouteAndDeleteOldRoutes(AF_INET6, &InterfaceLuid, block_all_traffic_route, config.use_ipv6_default_route ? (uint8*)config.default_route_endpoint_v6 : NULL, &ri6)) { + RERROR("Unable to read old default gateway and delete old default routes for IPv6."); + return false; + } + } + + uint32 default_route_v4 = ComputeIpv4DefaultRoute(config.ip, netmask); + uint8 default_route_v6[16]; + + if (block_all_traffic_route) { + RINFO("Blocking all regular Internet traffic using routing rules"); + NET_LUID localhost_luid; + if (ConvertInterfaceIndexToLuid(1, &localhost_luid) || localhost_luid.Info.IfType != 24) { + RERROR("Unable to get localhost luid - while adding route based blocking."); + } else { + uint32 dst[4] = {0}; + if (!AddMultipleCatchallRoutes(AF_INET, 1, (uint8*)&dst, localhost_luid)) + RERROR("Unable to add routes for route based blocking."); + if (config.ipv6_cidr) { + if (!AddMultipleCatchallRoutes(AF_INET6, 1, (uint8*)&dst, localhost_luid)) + RERROR("Unable to add IPv6 routes for route based blocking."); + } + } + } + + internet_route_blocking_state = block_all_traffic_route + ROUTE_BLOCK_OFF; + + if (ibs & kBlockInternet_Firewall) { + RINFO("Blocking all regular Internet traffic%s", ri.found_default_adapter ? " (except DHCP)" : ""); + AddPersistentInternetBlocking(ri.found_default_adapter ? &ri.default_adapter : NULL, InterfaceLuid, config.ipv6_cidr != 0); + } else { + SetInternetFwBlockingState(false); + } + + // Configure default route? + if (config.use_ipv4_default_route) { + // Add a bypass route to the original gateway? + if (config.default_route_endpoint_v4 != 0) { + if (!ri.found_default_adapter) { + RERROR("Unable to read old ipv4 default gateway"); + return false; + } + if (!AddRoute(AF_INET, &default_route_endpoint_v4, 32, ri.default_gw, &ri.default_adapter, &routes_to_undo_)) { + RERROR("Unable to add ipv4 gateway bypass route."); + return false; + } + } + // Either add 4 routes or 2 routes, depending on if we use route blocking. + uint32 be = ToBE32(default_route_v4); + if (!AddMultipleCatchallRoutes(AF_INET, block_all_traffic_route ? 2 : 1, (uint8*)&be, InterfaceLuid)) + RERROR("Unable to add new default ipv4 route."); + } + + if (config.ipv6_cidr) { + ComputeIpv6DefaultRoute(config.ipv6_address, config.ipv6_cidr, default_route_v6); + + // Configure default route? + if (config.use_ipv6_default_route) { + if (IsIpv6AddressSet(config.default_route_endpoint_v6)) { + if (!ri6.found_default_adapter) { + RERROR("Unable to read old ipv6 default gateway"); + return false; + } + if (!AddRoute(AF_INET6, config.default_route_endpoint_v6, 128, ri.default_gw, &ri6.default_adapter, &routes_to_undo_)) { + RERROR("Unable to add ipv6 gateway bypass route."); + return false; + } + } + if (!AddMultipleCatchallRoutes(AF_INET6, block_all_traffic_route ? 2 : 1, default_route_v6, InterfaceLuid)) + RERROR("Unable to add new default ipv6 route."); + } + } + + // Add all the extra routes + for (auto it = config.extra_routes.begin(); it != config.extra_routes.end(); ++it) { + if (it->size == 32) { + uint32 be = ToBE32(default_route_v4); + AddRoute(AF_INET, it->addr, it->cidr, &be, &InterfaceLuid); + } else if (it->size == 128 && config.ipv6_cidr) { + AddRoute(AF_INET6, it->addr, it->cidr, default_route_v6, &InterfaceLuid); + } + } + + NET_IFINDEX InterfaceIndex; + if (ConvertInterfaceLuidToIndex(&InterfaceLuid, &InterfaceIndex)) { + RERROR("Unable to get index of adapter"); + return false; + } + if ((err = FlushIpNetTable2(AF_INET, InterfaceIndex)) != NO_ERROR) { + RERROR("FlushIpNetTable failed: 0x%X", err); + return false; + } + if (config.ipv6_cidr) { + if ((err = FlushIpNetTable2(AF_INET6, InterfaceIndex)) != NO_ERROR) { + RERROR("FlushIpNetTable failed: 0x%X", err); + return false; + } + } + + RunPrePostCommand(config.pre_post_commands.post_up); + + pre_down_ = std::move(config.pre_post_commands.pre_down); + post_down_ = std::move(config.pre_post_commands.post_down); + + return true; +} + +void TunWin32Adapter::CloseAdapter() { + RunPrePostCommand(pre_down_); + + if (handle_ != NULL) { + ULONG status = FALSE; + DWORD len; + DeviceIoControl(handle_, TAP_IOCTL_SET_MEDIA_STATUS, &status, sizeof(status), + &status, sizeof(status), &len, NULL); + CloseHandle(handle_); + handle_ = NULL; + } + + for (auto it = routes_to_undo_.begin(); it != routes_to_undo_.end(); ++it) + DeleteRoute(&*it); + routes_to_undo_.clear(); + + RestoreDnsExceptOnAdapter(current_dns_block_); + current_dns_block_ = NULL; + + RunPrePostCommand(post_down_); +} + +static bool RunOneCommand(const std::string &cmd) { + std::string command = "cmd.exe /C " + cmd; + + STARTUPINFOA si = {0}; + PROCESS_INFORMATION pi = {0}; + + HANDLE hstdout_wr = NULL, hstdout_rd = NULL; + HANDLE hstdin_wr = NULL, hstdin_rd = NULL; + + bool result = false; + + SECURITY_ATTRIBUTES saAttr; + saAttr.nLength = sizeof(SECURITY_ATTRIBUTES); + saAttr.bInheritHandle = TRUE; + saAttr.lpSecurityDescriptor = NULL; + + if (!CreatePipe(&hstdout_rd, &hstdout_wr, &saAttr, 0) || + !CreatePipe(&hstdin_rd, &hstdin_wr, &saAttr, 0) || + !SetHandleInformation(hstdout_rd, HANDLE_FLAG_INHERIT, 0) || + !SetHandleInformation(hstdin_wr, HANDLE_FLAG_INHERIT, 0)) { + goto out; + } + + CloseHandle(hstdin_wr); + hstdin_wr = NULL; + + si.cb = sizeof(si); + si.dwFlags = STARTF_USESTDHANDLES; + si.hStdError = hstdout_wr; + si.hStdOutput = hstdout_wr; + si.hStdInput = hstdin_rd; + + RINFO("Run: %s", cmd.c_str()); + if (CreateProcessA(NULL, &command[0], NULL, NULL, TRUE, CREATE_NO_WINDOW, NULL, NULL, &si, &pi)) { + DWORD exit_code = -1; + char buf[1024]; + DWORD bufend = 0, bufstart = 0; + + CloseHandle(hstdout_wr); + hstdout_wr = NULL; + + for (;;) { + DWORD bytes_read = 0; + bool foundeof = (!ReadFile(hstdout_rd, buf + bufend, sizeof(buf) - bufend, &bytes_read, NULL) || bytes_read == 0); + bufend += bytes_read; + for(;;) { + char *nl = (char*)memchr(buf + bufstart, '\n', bufend - bufstart); + if (!nl) + break; + char *st = buf + bufstart; + char *nl2 = nl; + if (nl != buf + bufstart && nl[-1] == '\r') + nl--; + bufstart = nl2 - buf + 1; + RINFO("%.*s", nl - st, st); + } + if (bufend - bufstart == sizeof(buf) || foundeof) { + if (bufend - bufstart) + RINFO("%.*s", buf + bufstart, bufend - bufstart); + bufstart = bufend = 0; + } + if (foundeof) + break; + if (bufstart) { + bufend -= bufstart; + memmove(buf, buf + bufstart, bufend); + bufstart = 0; + } + } + WaitForSingleObject(pi.hProcess, INFINITE); + GetExitCodeProcess(pi.hProcess, &exit_code); + CloseHandle(pi.hThread); + CloseHandle(pi.hProcess); + if (exit_code != 0) { + RERROR("Command line failed (%d) : %s", exit_code, cmd.c_str()); + } else { + result = true; + } + } else { + RERROR("CreateProcess failed: %s", cmd.c_str()); + } + CloseHandle(hstdout_rd); + CloseHandle(hstdout_wr); + CloseHandle(hstdin_rd); + CloseHandle(hstdin_wr); +out: + return result; +} + +bool TunWin32Adapter::RunPrePostCommand(const std::vector &vec) { + bool success = true; + for (auto it = vec.begin(); it != vec.end(); ++it) { + if (!g_allow_pre_post) { + RERROR("Pre/Post commands are disabled. Ignoring: %s", it->c_str()); + } else { + success &= RunOneCommand(*it); + } + } + return success; +} + + +////////////////////////////////////////////////////////////////////////////// + +TunWin32Iocp::TunWin32Iocp() { + wqueue_end_ = &wqueue_; + wqueue_ = NULL; + + thread_ = NULL; + completion_port_handle_ = NULL; + packet_handler_ = NULL; + InitializeCriticalSectionAndSpinCount(&mutex_, 1024); + exit_thread_ = false; +} + +TunWin32Iocp::~TunWin32Iocp() { + //assert(num_reads_ == 0 && num_writes_ == 0); + assert(thread_ == NULL); + CloseTun(); + DeleteCriticalSection(&mutex_); +} + +bool TunWin32Iocp::Initialize(const TunConfig &&config, TunConfigOut *out) { + CloseTun(); + + if (!adapter_.OpenAdapter(&exit_thread_, FILE_FLAG_OVERLAPPED)) + return false; + + completion_port_handle_ = CreateIoCompletionPort(adapter_.handle(), NULL, NULL, 0); + if (completion_port_handle_ == NULL) + return false; + + return adapter_.InitAdapter(std::move(config), out); +} + +void TunWin32Iocp::CloseTun() { + assert(thread_ == NULL); + + adapter_.CloseAdapter(); + + if (completion_port_handle_) { + CloseHandle(completion_port_handle_); + completion_port_handle_ = NULL; + } + + FreePacketList(wqueue_); + wqueue_ = NULL; + wqueue_end_ = &wqueue_; +} + +enum { + kTunGetQueuedCompletionStatusSize = kConcurrentWriteTap + kConcurrentReadTap + 1 +}; + +void TunWin32Iocp::ThreadMain() { + OVERLAPPED_ENTRY entries[kTunGetQueuedCompletionStatusSize]; + Packet *pending_writes = NULL; + int num_reads = 0, num_writes = 0; + Packet *finished_reads = NULL, **finished_reads_end; + Packet *freed_packets = NULL, **freed_packets_end; + int freed_packets_count = 0; + DWORD err; + + while (!exit_thread_) { + // Initiate more reads, reusing the Packet structures in |finished_writes|. + for (int i = num_reads; i < kConcurrentReadTap; i++) { + Packet *p; + if (!AllocPacketFrom(&freed_packets, &freed_packets_count, &exit_thread_, &p)) + break; + memset(&p->overlapped, 0, sizeof(p->overlapped)); + p->post_target = ThreadedPacketQueue::TARGET_PROCESSOR_TUN; + if (!ReadFile(adapter_.handle(), p->data, kPacketCapacity, NULL, &p->overlapped) && (err = GetLastError()) != ERROR_IO_PENDING) { + FreePacket(p); + + RERROR("TunWin32: ReadFile failed 0x%X", err); + + if (err == ERROR_OPERATION_ABORTED) { + packet_handler_->AbortingDriver(); + RERROR("TAP driver stopped communicating. Attempting to restart.", err); + // This can happen if we reinstall the TAP driver while there's an active connection. Wait a bit, then attempt to + // restart. + Sleep(1000); + CallbackTriggerReconnect(); + goto EXIT; + } + } else { + num_reads++; + } + } + g_tun_reads = num_reads; + + assert(freed_packets_count >= 0); + if (freed_packets_count >= 32) { + FreePackets(freed_packets, freed_packets_end, freed_packets_count); + freed_packets_count = 0; + freed_packets_end = &freed_packets; + } else if (freed_packets == NULL) { + assert(freed_packets_count == 0); + freed_packets_end = &freed_packets; + } + + ULONG num_entries = 0; + if (!GetQueuedCompletionStatusEx(completion_port_handle_, entries, kTunGetQueuedCompletionStatusSize, &num_entries, INFINITE, FALSE)) { + RINFO("GetQueuedCompletionStatusEx failed."); + break; + } + finished_reads_end = &finished_reads; + int finished_reads_count = 0; + + // Go through the finished entries and determine which ones are reads, and which ones are writes. + for (ULONG i = 0; i < num_entries; i++) { + if (!entries[i].lpOverlapped) + continue; // This is the dummy entry from |PostQueuedCompletionStatus| + Packet *p = (Packet*)((byte*)entries[i].lpOverlapped - offsetof(Packet, overlapped)); + if (p->post_target == ThreadedPacketQueue::TARGET_PROCESSOR_TUN) { + num_reads--; + if ((int)p->overlapped.Internal != 0) { + RERROR("TunWin32::ReadComplete error 0x%X", (int)p->overlapped.Internal); + FreePacket(p); + continue; + } + p->size = (int)p->overlapped.InternalHigh; + + *finished_reads_end = p; + finished_reads_end = &p->next; + finished_reads_count++; + } else { + num_writes--; + if ((int)p->overlapped.Internal != 0) { + RERROR("TunWin32::WriteComplete error 0x%X", (int)p->overlapped.Internal); + FreePacket(p); + continue; + } + freed_packets_count++; + *freed_packets_end = p; + freed_packets_end = &p->next; + } + } + *finished_reads_end = NULL; + *freed_packets_end = NULL; + + if (finished_reads != NULL) + packet_handler_->Post(finished_reads, finished_reads_end, finished_reads_count); + + // Initiate more writes from |wqueue_| + while (num_writes < kConcurrentWriteTap) { + // Refill from queue if empty, avoid taking the mutex if it looks empty + if (!pending_writes) { + if (!wqueue_) + break; + EnterCriticalSection(&mutex_); + pending_writes = wqueue_; + wqueue_end_ = &wqueue_; + wqueue_ = NULL; + LeaveCriticalSection(&mutex_); + if (!pending_writes) + break; + } + // Then issue writes + Packet *p = pending_writes; + pending_writes = p->next; + memset(&p->overlapped, 0, sizeof(p->overlapped)); + p->post_target = ThreadedPacketQueue::TARGET_TUN_DEVICE; + if (!WriteFile(adapter_.handle(), p->data, p->size, NULL, &p->overlapped) && (err = GetLastError()) != ERROR_IO_PENDING) { + RERROR("TunWin32: WriteFile failed 0x%X", err); + FreePacket(p); + } else { + num_writes++; + } + } + g_tun_writes = num_writes; + } + +EXIT: + // Cancel all IO and wait for all completions + CancelIo(adapter_.handle()); + while (num_reads + num_writes) { + ULONG num_entries = 0; + if (!GetQueuedCompletionStatusEx(completion_port_handle_, entries, 1, &num_entries, INFINITE, FALSE)) { + RINFO("GetQueuedCompletionStatusEx failed."); + break; + } + if (!entries[0].lpOverlapped) + continue; // This is the dummy entry from |PostQueuedCompletionStatus| + Packet *p = (Packet*)((byte*)entries[0].lpOverlapped - offsetof(Packet, overlapped)); + if (p->post_target == ThreadedPacketQueue::TARGET_PROCESSOR_TUN) { + num_reads--; + } else { + num_writes--; + } + FreePacket(p); + } + + FreePacketList(freed_packets); + FreePacketList(pending_writes); +} + +DWORD WINAPI TunWin32Iocp::TunThread(void *x) { + TunWin32Iocp *xx = (TunWin32Iocp *)x; + xx->ThreadMain(); + return 0; +} + +void TunWin32Iocp::StartThread() { + DWORD thread_id; + thread_ = CreateThread(NULL, 0, &TunThread, this, 0, &thread_id); + SetThreadPriority(thread_, ABOVE_NORMAL_PRIORITY_CLASS); +} + +void TunWin32Iocp::StopThread() { + exit_thread_ = true; + PostQueuedCompletionStatus(completion_port_handle_, NULL, NULL, NULL); + WaitForSingleObject(thread_, INFINITE); + CloseHandle(thread_); + thread_ = NULL; +} + +void TunWin32Iocp::WriteTunPacket(Packet *packet) { + packet->next = NULL; + EnterCriticalSection(&mutex_); + Packet *was_empty = wqueue_; + *wqueue_end_ = packet; + wqueue_end_ = &packet->next; + LeaveCriticalSection(&mutex_); + if (was_empty == NULL) { + // Notify the worker thread that it should attempt more writes + PostQueuedCompletionStatus(completion_port_handle_, NULL, NULL, NULL); + } +} + + + +////////////////////////////////////////////////////////////////////////////// + +TunWin32Overlapped::TunWin32Overlapped() { + wqueue_end_ = &wqueue_; + wqueue_ = NULL; + + thread_ = NULL; + + read_event_ = CreateEvent(NULL, TRUE, FALSE, NULL); + write_event_ = CreateEvent(NULL, TRUE, FALSE, NULL); + wake_event_ = CreateEvent(NULL, FALSE, FALSE, NULL); + + packet_handler_ = NULL; + InitializeCriticalSectionAndSpinCount(&mutex_, 1024); + exit_thread_ = false; +} + +TunWin32Overlapped::~TunWin32Overlapped() { + CloseTun(); + DeleteCriticalSection(&mutex_); + CloseHandle(read_event_); + CloseHandle(write_event_); + CloseHandle(wake_event_); +} + +bool TunWin32Overlapped::Initialize(const TunConfig &&config, TunConfigOut *out) { + CloseTun(); + return adapter_.OpenAdapter(&exit_thread_, FILE_FLAG_OVERLAPPED) && + adapter_.InitAdapter(std::move(config), out); +} + +void TunWin32Overlapped::CloseTun() { + assert(thread_ == NULL); + adapter_.CloseAdapter(); + FreePacketList(wqueue_); + wqueue_ = NULL; + wqueue_end_ = &wqueue_; +} + +void TunWin32Overlapped::ThreadMain() { + Packet *pending_writes = NULL; + DWORD err; + Packet *read_packet = NULL, *write_packet = NULL; + + HANDLE h[3]; + while (!exit_thread_) { + if (read_packet == NULL) { + Packet *p = AllocPacket(); + memset(&p->overlapped, 0, sizeof(p->overlapped)); + p->overlapped.hEvent = read_event_; + p->post_target = ThreadedPacketQueue::TARGET_PROCESSOR_TUN; + if (!ReadFile(adapter_.handle(), p->data, kPacketCapacity, NULL, &p->overlapped) && (err = GetLastError()) != ERROR_IO_PENDING) { + FreePacket(p); + RERROR("TunWin32: ReadFile failed 0x%X", err); + } else { + read_packet = p; + } + } + + int n = 0; + if (write_packet) + h[n++] = write_event_; + if (read_packet != NULL) + h[n++] = read_event_; + h[n++] = wake_event_; + + DWORD res = WaitForMultipleObjects(n, h, FALSE, INFINITE); + + if (res >= WAIT_OBJECT_0 && res <= WAIT_OBJECT_0 + 2) { + HANDLE hx = h[res - WAIT_OBJECT_0]; + if (hx == read_event_) { + read_packet->size = (int)read_packet->overlapped.InternalHigh; + read_packet->next = NULL; + packet_handler_->Post(read_packet, &read_packet->next, 1); + read_packet = NULL; + } else if (hx == write_event_) { + FreePacket(write_packet); + write_packet = NULL; + } + } else { + RERROR("Wait said %d", res); + } + + if (write_packet == NULL) { + if (!pending_writes) { + EnterCriticalSection(&mutex_); + pending_writes = wqueue_; + wqueue_end_ = &wqueue_; + wqueue_ = NULL; + LeaveCriticalSection(&mutex_); + } + if (pending_writes) { + // Then issue writes + Packet *p = pending_writes; + pending_writes = p->next; + memset(&p->overlapped, 0, sizeof(p->overlapped)); + p->overlapped.hEvent = write_event_; + p->post_target = ThreadedPacketQueue::TARGET_TUN_DEVICE; + if (!WriteFile(adapter_.handle(), p->data, p->size, NULL, &p->overlapped) && (err = GetLastError()) != ERROR_IO_PENDING) { + RERROR("TunWin32: WriteFile failed 0x%X", err); + FreePacket(p); + } else { + write_packet = p; + } + } + } + } + + // TODO: Free memory + CancelIo(adapter_.handle()); + FreePacketList(pending_writes); +} + +DWORD WINAPI TunWin32Overlapped::TunThread(void *x) { + TunWin32Overlapped *xx = (TunWin32Overlapped *)x; + xx->ThreadMain(); + return 0; +} + +void TunWin32Overlapped::StartThread() { + DWORD thread_id; + thread_ = CreateThread(NULL, 0, &TunThread, this, 0, &thread_id); + SetThreadPriority(thread_, ABOVE_NORMAL_PRIORITY_CLASS); +} + +void TunWin32Overlapped::StopThread() { + exit_thread_ = true; + SetEvent(wake_event_); + WaitForSingleObject(thread_, INFINITE); + CloseHandle(thread_); + thread_ = NULL; +} + +void TunWin32Overlapped::WriteTunPacket(Packet *packet) { + packet->next = NULL; + EnterCriticalSection(&mutex_); + Packet *was_empty = wqueue_; + *wqueue_end_ = packet; + wqueue_end_ = &packet->next; + LeaveCriticalSection(&mutex_); + if (was_empty == NULL) + SetEvent(wake_event_); +} + + + + + +DWORD WINAPI TunsafeBackendWin32::WorkerThread(void *bk) { + TunsafeBackendWin32 *backend = (TunsafeBackendWin32*)bk; + + TunWin32Iocp tun; + UdpSocketWin32 udp; + WireguardProcessor wg_proc(&udp, &tun, backend->procdel_); + + ThreadedPacketQueue queues_for_processor(&wg_proc, &backend->stats_); + + qs.udp_qsize1 = qs.udp_qsize2 = 0; + + udp.SetPacketHandler(&queues_for_processor); + tun.SetPacketHandler(&queues_for_processor); + + if (!ParseWireGuardConfigFile(&wg_proc, backend->config_file_, &backend->exit_flag_)) + goto getout; + + if (!wg_proc.Start()) + goto getout; + + queues_for_processor.Start(); + udp.StartThread(); + tun.StartThread(); + + CallbackSetPublicKey(wg_proc.dev().public_key()); + + while (!backend->exit_flag_) { + SleepEx(INFINITE, TRUE); + } + + udp.StopThread(); + tun.StopThread(); + queues_for_processor.Stop(); + + FreeAllPackets(); +getout: + return 0; +} + +static void WINAPI ExitServiceAPC(ULONG_PTR a) { + *(bool*)a = true; +} + +TunsafeBackendWin32::TunsafeBackendWin32() { + memset(&stats_, 0, sizeof(stats_)); + InitPacketMutexes(); + InitializeCriticalSectionAndSpinCount(&stats_.mutex, 1024); + worker_thread_ = NULL; +} + +TunsafeBackendWin32::~TunsafeBackendWin32() { + DeleteCriticalSection(&stats_.mutex); +} + +ProcessorStats TunsafeBackendWin32::GetStats() { + EnterCriticalSection(&stats_.mutex); + ProcessorStats stats = stats_.packet_stats; + LeaveCriticalSection(&stats_.mutex); + return stats; +} + +void TunsafeBackendWin32::Start(ProcessorDelegate *procdel, const char *config_file) { + Stop(); + procdel_ = procdel; + exit_flag_ = false; + DWORD thread_id; + config_file_ = _strdup(config_file); + worker_thread_ = CreateThread(NULL, 0, &WorkerThread, this, 0, &thread_id); + SetThreadPriority(worker_thread_, THREAD_PRIORITY_ABOVE_NORMAL); +} + +void TunsafeBackendWin32::Stop() { + if (worker_thread_) { + QueueUserAPC(&ExitServiceAPC, worker_thread_, (ULONG_PTR)&exit_flag_); + WaitForSingleObject(worker_thread_, INFINITE); + CloseHandle(worker_thread_); + worker_thread_ = NULL; + free(config_file_); + config_file_ = NULL; + } +} + diff --git a/network_win32.h b/network_win32.h new file mode 100644 index 0000000..a67f226 --- /dev/null +++ b/network_win32.h @@ -0,0 +1,179 @@ +// SPDX-License-Identifier: AGPL-1.0-only +// Copyright (C) 2018 Ludvig Strigeus . All Rights Reserved. +#pragma once + +#include "stdafx.h" +#include "tunsafe_types.h" +#include "netapi.h" +#include "network_win32_api.h" + +struct Packet; +class WireguardProcessor; + + +class ThreadedPacketQueue { +public: + explicit ThreadedPacketQueue(WireguardProcessor *wg, NetworkStats *stats); + ~ThreadedPacketQueue(); + + enum { + TARGET_PROCESSOR_UDP = 0, + TARGET_PROCESSOR_TUN = 1, + TARGET_UDP_DEVICE = 2, + TARGET_TUN_DEVICE = 3, + }; + + void Start(); + void Stop(); + + void Post(Packet *packet, Packet **end, int count); + void AbortingDriver(); + +private: + void PostTimerInterrupt(); + static void CALLBACK TimerRoutine(LPVOID lpArgToCompletionRoutine, DWORD dwTimerLowValue, DWORD dwTimerHighValue); + + DWORD ThreadMain(); + static DWORD WINAPI ThreadedPacketQueueLauncher(VOID *x); + Packet *first_; + Packet **last_ptr_; + uint32 packets_in_queue_; + uint32 need_notify_; + CRITICAL_SECTION mutex_; + HANDLE event_; + + HANDLE timer_handle_; + HANDLE handle_; + WireguardProcessor *wg_; + bool exit_flag_; + bool timer_interrupt_; + NetworkStats *stats_; +}; + +// Encapsulates a UDP socket, optionally listening for incoming packets +// on a specific port. +class UdpSocketWin32 : public UdpInterface { +public: + explicit UdpSocketWin32(); + ~UdpSocketWin32(); + + void SetPacketHandler(ThreadedPacketQueue *packet_handler) { packet_handler_ = packet_handler; } + + void StartThread(); + void StopThread(); + + // -- from UdpInterface + virtual bool Initialize(int listen_on_port) override; + virtual void WriteUdpPacket(Packet *packet) override; + +private: + + void ThreadMain(); + static DWORD WINAPI UdpThread(void *x); + + // All packets queued for writing. Locked by |mutex_| + Packet *wqueue_, **wqueue_end_; + + CRITICAL_SECTION mutex_; + + ThreadedPacketQueue *packet_handler_; + SOCKET socket_; + SOCKET socket_ipv6_; + HANDLE completion_port_handle_; + HANDLE thread_; + + bool exit_thread_; +}; + +class TunWin32Adapter { +public: + TunWin32Adapter(); + ~TunWin32Adapter(); + + bool OpenAdapter(bool *exit_thread, DWORD open_flags); + bool InitAdapter(const TunInterface::TunConfig &&config, TunInterface::TunConfigOut *out); + void CloseAdapter(); + + HANDLE handle() { return handle_; } + +private: + bool RunPrePostCommand(const std::vector &vec); + + HANDLE handle_; + HANDLE current_dns_block_; + + std::vector routes_to_undo_; + uint8 mac_adress_[6]; + int mtu_; + char guid_[64]; + + std::vector pre_down_, post_down_; +}; + +// Implementation of TUN interface handling using IO Completion Ports +class TunWin32Iocp : public TunInterface { +public: + explicit TunWin32Iocp(); + ~TunWin32Iocp(); + + void SetPacketHandler(ThreadedPacketQueue *packet_handler) { packet_handler_ = packet_handler; } + + void StartThread(); + void StopThread(); + + // -- from TunInterface + virtual bool Initialize(const TunConfig &&config, TunConfigOut *out) override; + virtual void WriteTunPacket(Packet *packet) override; + +private: + void CloseTun(); + void ThreadMain(); + static DWORD WINAPI TunThread(void *x); + + ThreadedPacketQueue *packet_handler_; + HANDLE completion_port_handle_; + HANDLE thread_; + + CRITICAL_SECTION mutex_; + + bool exit_thread_; + + // All packets queued for writing + Packet *wqueue_, **wqueue_end_; + + TunWin32Adapter adapter_; +}; + +// Implementation of TUN interface handling using Overlapped IO +class TunWin32Overlapped : public TunInterface { +public: + explicit TunWin32Overlapped(); + ~TunWin32Overlapped(); + + void SetPacketHandler(ThreadedPacketQueue *packet_handler) { packet_handler_ = packet_handler; } + + void StartThread(); + void StopThread(); + + // -- from TunInterface + virtual bool Initialize(const TunConfig &&config, TunConfigOut *out) override; + virtual void WriteTunPacket(Packet *packet) override; + +private: + void CloseTun(); + void ThreadMain(); + static DWORD WINAPI TunThread(void *x); + + ThreadedPacketQueue *packet_handler_; + HANDLE thread_; + + CRITICAL_SECTION mutex_; + + HANDLE read_event_, write_event_, wake_event_; + + bool exit_thread_; + + Packet *wqueue_, **wqueue_end_; + + TunWin32Adapter adapter_; +}; diff --git a/network_win32_api.h b/network_win32_api.h new file mode 100644 index 0000000..dac9856 --- /dev/null +++ b/network_win32_api.h @@ -0,0 +1,49 @@ +// SPDX-License-Identifier: AGPL-1.0-only +// Copyright (C) 2018 Ludvig Strigeus . All Rights Reserved. +#pragma once + +#include "stdafx.h" +#include "tunsafe_types.h" +#include "wireguard.h" + +struct NetworkStats { + bool reset_stats; + CRITICAL_SECTION mutex; + ProcessorStats packet_stats; +}; + +class TunsafeBackendWin32 { +public: + TunsafeBackendWin32(); + ~TunsafeBackendWin32(); + + void Start(ProcessorDelegate *procdel, const char *config_file); + void Stop(); + + ProcessorStats GetStats(); + void ResetStats() { stats_.reset_stats = true; } + + bool is_started() const { return worker_thread_ != NULL; } + +private: + static DWORD WINAPI WorkerThread(void *x); + + NetworkStats stats_; + HANDLE worker_thread_; + bool exit_flag_; + + ProcessorDelegate *procdel_; + char *config_file_; +}; + + + +InternetBlockState GetInternetBlockState(bool *is_activated); + +// Returns if reconnect is needed +void SetInternetBlockState(InternetBlockState s); + + + +extern int tpq_last_qsize; +extern int g_tun_reads, g_tun_writes; diff --git a/network_win32_dnsblock.cpp b/network_win32_dnsblock.cpp new file mode 100644 index 0000000..e17f09a --- /dev/null +++ b/network_win32_dnsblock.cpp @@ -0,0 +1,385 @@ +// SPDX-License-Identifier: AGPL-1.0-only +// Copyright (C) 2018 Ludvig Strigeus . All Rights Reserved. +#include "stdafx.h" +#include "tunsafe_types.h" +#include "network_win32_dnsblock.h" +#include +#include + +#pragma comment (lib, "Fwpuclnt.lib") + +static const GUID TUNSAFE_DNS_SUBLAYER = {0x1ce6cce2, 0xcc8f, 0x4175, { 0xac, 0x7b, 0x95, 0xfd, 0xe8, 0x95, 0x80, 0x92}}; +static const GUID TUNSAFE_GLOBAL_BLOCK_SUBLAYER = {0x1ce6cce2, 0xcc8f, 0x4175,{0xac, 0x7b, 0x95, 0xfd, 0xe8, 0x95, 0x80, 0x93}}; + +static bool GetFwpmAppIdFromCurrentProcess(FWP_BYTE_BLOB **appid) { + wchar_t module_filename[MAX_PATH]; + DWORD err = GetModuleFileNameW(NULL, module_filename, ARRAYSIZE(module_filename)); + if (err == 0 || err == ARRAYSIZE(module_filename)) + return false; + err = FwpmGetAppIdFromFileName0(module_filename, appid); + if (err != 0) + return false; + return true; +} + +static uint8 internet_fw_blocking_state; + +static inline bool FwpmFilterAddCheckedAleConnect(HANDLE handle, FWPM_FILTER0 *filter, bool also_ipv6, int idx) { + DWORD err; + UINT64 dummy; + + filter->layerKey = FWPM_LAYER_ALE_AUTH_CONNECT_V4; + err = FwpmFilterAdd0(handle, filter, NULL, &dummy); + if (err != 0) { + RERROR("FwpmFilterAdd0 #%d failed (%s): %d", idx, "ipv4", err); + return false; + } + + if (also_ipv6) { + filter->layerKey = FWPM_LAYER_ALE_AUTH_CONNECT_V6; + err = FwpmFilterAdd0(handle, filter, NULL, &dummy); + if (err != 0) { + RERROR("FwpmFilterAdd0 #%d failed (%s): %d", idx, "ipv6", err); + return false; + } + } + + return true; +} + +HANDLE BlockDnsExceptOnAdapter(const NET_LUID &luid, bool also_ipv6) { + FWPM_SUBLAYER0 *sublayer = NULL; + FWP_BYTE_BLOB *fwp_appid = NULL; + + FWPM_FILTER0 filter; + FWPM_FILTER_CONDITION0 filter_condition[2]; + DWORD err; + HANDLE handle = NULL; + + { + FWPM_SESSION0 session = {0}; + session.flags = FWPM_SESSION_FLAG_DYNAMIC; + err = FwpmEngineOpen0(NULL, RPC_C_AUTHN_WINNT, NULL, &session, &handle); + if (err != 0) { + RERROR("FwpmEngineOpen0 failed: %d", err); + goto getout; + } + } + + { + FWPM_SUBLAYER0 sublayer = {0}; + sublayer.subLayerKey = TUNSAFE_DNS_SUBLAYER; + sublayer.displayData.name = L"TunSafe"; + sublayer.weight = 0x100; + err = FwpmSubLayerAdd0(handle, &sublayer, NULL); + if (err != 0) { + RERROR("FwpmSubLayerAdd0 failed: %d", err); + goto getout; + } + } + + if (!GetFwpmAppIdFromCurrentProcess(&fwp_appid)) { + RERROR("GetFwpmAppIdFromCurrentProcess failed"); + goto getout; + } + + // Allow all queries to port 53 from our process + memset(&filter, 0, sizeof(filter)); + filter_condition[0].fieldKey = FWPM_CONDITION_IP_REMOTE_PORT; + filter_condition[0].matchType = FWP_MATCH_EQUAL; + filter_condition[0].conditionValue.type = FWP_UINT16; + filter_condition[0].conditionValue.uint16 = 53; + filter_condition[1].fieldKey = FWPM_CONDITION_ALE_APP_ID; + filter_condition[1].matchType = FWP_MATCH_EQUAL; + filter_condition[1].conditionValue.type = FWP_BYTE_BLOB_TYPE; + filter_condition[1].conditionValue.byteBlob = fwp_appid; + filter.filterCondition = filter_condition; + filter.numFilterConditions = 2; + filter.subLayerKey = TUNSAFE_DNS_SUBLAYER; + filter.displayData.name = L"TunSafe"; + filter.weight.type = FWP_UINT8; + filter.weight.uint8 = 15; + filter.action.type = FWP_ACTION_PERMIT; + if (!FwpmFilterAddCheckedAleConnect(handle, &filter, also_ipv6, 1)) + goto getout; + + // Allow DNS queries from TAP + filter_condition[1].fieldKey = FWPM_CONDITION_IP_LOCAL_INTERFACE; + filter_condition[1].conditionValue.type = FWP_UINT64; + filter_condition[1].conditionValue.uint64 = (uint64*)&luid.Value; + filter.weight.uint8 = 14; + if (!FwpmFilterAddCheckedAleConnect(handle, &filter, also_ipv6, 2)) + goto getout; + + // Block all IPv4 and IPv6 + filter.numFilterConditions = 1; + filter.weight.type = FWP_EMPTY; + filter.action.type = FWP_ACTION_BLOCK; + if (!FwpmFilterAddCheckedAleConnect(handle, &filter, also_ipv6, 3)) + goto getout; + + goto success; +getout: + if (handle != NULL) { + FwpmEngineClose0(handle); + handle = NULL; + } +success: + if (fwp_appid) + FwpmFreeMemory0((void **)&fwp_appid); + return handle; +} + +void RestoreDnsExceptOnAdapter(HANDLE h) { + if (h) + FwpmEngineClose0(h); +} + + +static bool RemovePersistentInternetBlockingInner(HANDLE handle) { + FWPM_FILTER_ENUM_TEMPLATE0 enum_template = {0}; + HANDLE enum_handle = NULL; + DWORD err; + UINT32 num_returned; + FWPM_FILTER0 **filter = NULL; + + for (int iptype = 0; iptype < 2; iptype++) { + enum_template.layerKey = iptype == 0 ? FWPM_LAYER_ALE_AUTH_CONNECT_V4 : FWPM_LAYER_ALE_AUTH_CONNECT_V6; + enum_template.actionMask = 0xffffffff; + + err = FwpmFilterCreateEnumHandle0(handle, &enum_template, &enum_handle); + if (err != 0) { + RERROR("FwpmFilterCreateEnumHandle0 failed: %d", err); + goto getout; + } + + do { + err = FwpmFilterEnum0(handle, enum_handle, 256, &filter, &num_returned); + if (err != 0) { + RERROR("FwpmFilterEnum0 failed: %d", err); + goto getout; + } + for (UINT32 i = 0; i < num_returned; i++) { + FWPM_FILTER0 *cur_filter = filter[i]; + if (memcmp(&cur_filter->subLayerKey, &TUNSAFE_GLOBAL_BLOCK_SUBLAYER, sizeof(GUID)) == 0) { + err = FwpmFilterDeleteById0(handle, cur_filter->filterId); + if (err != 0) + RERROR("FwpmFilterDeleteById0 failed: %d", err); + } + } + FwpmFreeMemory0((void**)&filter); + } while (num_returned == 256); + + FwpmFilterDestroyEnumHandle0(handle, enum_handle); + enum_handle = NULL; + } + + err = FwpmSubLayerDeleteByKey0(handle, &TUNSAFE_GLOBAL_BLOCK_SUBLAYER); + if (err != 0 && err != FWP_E_SUBLAYER_NOT_FOUND) { + RERROR("FwpmSubLayerDeleteByKey0 failed: %d", err); + goto getout; + } + + internet_fw_blocking_state = IBS_INACTIVE; + +getout: + if (enum_handle != NULL) { + FwpmFilterDestroyEnumHandle0(handle, enum_handle); + } + return false; +} + +bool AddPersistentInternetBlocking(const NET_LUID *default_interface, const NET_LUID &luid_to_allow, bool also_ipv6) { + FWPM_SUBLAYER0 *sublayer_p = NULL; + FWP_BYTE_BLOB *fwp_appid = NULL; + FWPM_FILTER0 filter; + FWPM_FILTER_CONDITION0 filter_condition[3]; + DWORD err; + HANDLE handle = NULL; + bool success = false; + + { + FWPM_SESSION0 session = {0}; + err = FwpmEngineOpen0(NULL, RPC_C_AUTHN_WINNT, NULL, &session, &handle); + if (err != 0) { + RERROR("FwpmEngineOpen0 failed: %d", err); + goto getout; + } + } + + if (FwpmSubLayerGetByKey0(handle, &TUNSAFE_GLOBAL_BLOCK_SUBLAYER, &sublayer_p) == 0) { + // The sublayer already exists + FwpmFreeMemory0((void **)&sublayer_p); + } else { + // Add new sublayer + FWPM_SUBLAYER0 sublayer = {0}; + sublayer.subLayerKey = TUNSAFE_GLOBAL_BLOCK_SUBLAYER; + sublayer.displayData.name = L"TunSafe Global Block"; + sublayer.weight = 0x101; + err = FwpmSubLayerAdd0(handle, &sublayer, NULL); + if (err != 0) { + RERROR("FwpmSubLayerAdd0 failed: %d", err); + goto getout; + } + } + + if (!GetFwpmAppIdFromCurrentProcess(&fwp_appid)) { + RERROR("GetFwpmAppIdFromCurrentProcess failed"); + goto getout; + } + + // Allow all outgoing queries from our process + memset(&filter, 0, sizeof(filter)); + filter_condition[0].fieldKey = FWPM_CONDITION_ALE_APP_ID; + filter_condition[0].matchType = FWP_MATCH_EQUAL; + filter_condition[0].conditionValue.type = FWP_BYTE_BLOB_TYPE; + filter_condition[0].conditionValue.byteBlob = fwp_appid; + filter.numFilterConditions = 1; + filter.filterCondition = filter_condition; + filter.subLayerKey = TUNSAFE_GLOBAL_BLOCK_SUBLAYER; + filter.displayData.name = L"TunSafe Global Block"; + filter.weight.type = FWP_UINT8; + filter.weight.uint8 = 15; + filter.action.type = FWP_ACTION_PERMIT; + if (!FwpmFilterAddCheckedAleConnect(handle, &filter, also_ipv6, 1)) + goto getout; + + // Permit all queries going out on TUN + filter_condition[0].fieldKey = FWPM_CONDITION_IP_LOCAL_INTERFACE; + filter_condition[0].conditionValue.type = FWP_UINT64; + filter_condition[0].conditionValue.uint64 = (uint64*)&luid_to_allow.Value; + filter_condition[0].matchType = FWP_MATCH_EQUAL; + filter.weight.uint8 = 14; + if (!FwpmFilterAddCheckedAleConnect(handle, &filter, also_ipv6, 2)) + goto getout; + // Permit everything that's loopback + filter_condition[0].fieldKey = FWPM_CONDITION_INTERFACE_TYPE; + filter_condition[0].conditionValue.type = FWP_UINT32; + filter_condition[0].conditionValue.uint32 = 24; + filter_condition[0].matchType = FWP_MATCH_EQUAL; + filter.weight.uint8 = 13; + if (!FwpmFilterAddCheckedAleConnect(handle, &filter, also_ipv6, 2)) + goto getout; + + // Permit all queries on the DHCP port (It uses 68 on the local side and 67 on the remote side) + if (default_interface) { + filter_condition[2].fieldKey = FWPM_CONDITION_IP_LOCAL_PORT; + filter_condition[2].matchType = FWP_MATCH_EQUAL; + filter_condition[2].conditionValue.type = FWP_UINT16; + filter_condition[2].conditionValue.uint16 = 68; + filter_condition[1].fieldKey = FWPM_CONDITION_IP_REMOTE_PORT; + filter_condition[1].matchType = FWP_MATCH_EQUAL; + filter_condition[1].conditionValue.type = FWP_UINT16; + filter_condition[1].conditionValue.uint16 = 67; + filter.numFilterConditions = 3; + filter_condition[0].fieldKey = FWPM_CONDITION_IP_LOCAL_INTERFACE; + filter_condition[0].conditionValue.type = FWP_UINT64; + filter_condition[0].conditionValue.uint64 = (uint64*)&default_interface->Value; + filter_condition[0].matchType = FWP_MATCH_EQUAL; + filter.weight.uint8 = 12; + if (!FwpmFilterAddCheckedAleConnect(handle, &filter, also_ipv6, 2)) + goto getout; + } + + // Block the rest + filter.numFilterConditions = 0; + filter.weight.type = FWP_EMPTY; + filter.action.type = FWP_ACTION_BLOCK; + if (!FwpmFilterAddCheckedAleConnect(handle, &filter, also_ipv6, 3)) + goto getout; + + success = true; + internet_fw_blocking_state = IBS_ACTIVE; + +getout: + if (handle != NULL) { + // delete the layer on failure + if (!success) + RemovePersistentInternetBlockingInner(handle); + FwpmEngineClose0(handle); + handle = NULL; + } + if (fwp_appid) + FwpmFreeMemory0((void **)&fwp_appid); + return success; +} + +static bool RemovePersistentInternetBlocking() { + DWORD err; + HANDLE handle = NULL; + FWPM_SUBLAYER0 *sublayer_p = NULL; + + { + FWPM_SESSION0 session = {0}; + err = FwpmEngineOpen0(NULL, RPC_C_AUTHN_WINNT, NULL, &session, &handle); + if (err != 0) { + RERROR("FwpmEngineOpen0 failed: %d", err); + goto getout; + } + } + + if (FwpmSubLayerGetByKey0(handle, &TUNSAFE_GLOBAL_BLOCK_SUBLAYER, &sublayer_p) == 0) { + // The sublayer exists + FwpmFreeMemory0((void **)&sublayer_p); + } else { + // Sublayer does not exist + internet_fw_blocking_state = IBS_INACTIVE; + goto getout; + } + + RemovePersistentInternetBlockingInner(handle); + +getout: + if (handle != NULL) { + FwpmEngineClose0(handle); + handle = NULL; + } + return false; +} + +uint8 GetInternetFwBlockingState() { + if (internet_fw_blocking_state != 0) + return internet_fw_blocking_state; + + DWORD err; + HANDLE handle = NULL; + FWPM_SUBLAYER0 *sublayer_p = NULL; + bool result; + + { + FWPM_SESSION0 session = {0}; + err = FwpmEngineOpen0(NULL, RPC_C_AUTHN_WINNT, NULL, &session, &handle); + if (err != 0) { + RERROR("FwpmEngineOpen0 failed: %d", err); + goto getout; + } + } + + if (FwpmSubLayerGetByKey0(handle, &TUNSAFE_GLOBAL_BLOCK_SUBLAYER, &sublayer_p) == 0) { + // The sublayer already exists + FwpmFreeMemory0((void **)&sublayer_p); + result = true; + } else { + result = false; + } + +getout: + if (handle != NULL) { + FwpmEngineClose0(handle); + handle = NULL; + } + + return internet_fw_blocking_state = result + IBS_INACTIVE; +} + +void SetInternetFwBlockingState(bool want) { + uint8 old_state = GetInternetFwBlockingState(); + if ((old_state >= IBS_ACTIVE) != want) { + if (!want) { + RemovePersistentInternetBlocking(); + } else { + internet_fw_blocking_state = IBS_PENDING; + } + } +} + diff --git a/network_win32_dnsblock.h b/network_win32_dnsblock.h new file mode 100644 index 0000000..1da7e64 --- /dev/null +++ b/network_win32_dnsblock.h @@ -0,0 +1,20 @@ +// SPDX-License-Identifier: AGPL-1.0-only +// Copyright (C) 2018 Ludvig Strigeus . All Rights Reserved. +#pragma once + +HANDLE BlockDnsExceptOnAdapter(const NET_LUID &luid, bool also_ipv6 ); +void RestoreDnsExceptOnAdapter(HANDLE h); + +bool AddPersistentInternetBlocking(const NET_LUID *default_interface, const NET_LUID &luid_to_allow, bool also_ipv6); + + + +enum { + IBS_UNKOWN, + IBS_INACTIVE, + IBS_ACTIVE, + IBS_PENDING, +}; +void SetInternetFwBlockingState(bool want); +uint8 GetInternetFwBlockingState(); + diff --git a/readme_osx.txt b/readme_osx.txt new file mode 100644 index 0000000..e10d754 --- /dev/null +++ b/readme_osx.txt @@ -0,0 +1,19 @@ +WARNING: ALPHA SOFTWARE - USE AT YOUR OWN RISK + +License: https://tunsafe.com/downloads/LICENSE.TXT + +This is the experimental OSX version of TunSafe. + +It is single threaded, has no UI, does not support IPv6, +and does not support switching DNS. + +Still - it's roughly 2x as fast as OpenVPN. 260mbit vs 140mbit. + +It uses the built-in utun network adapter so you need a +reasonably new OSX version. + +Usage (from a Terminal): +sudo ./tunsafe Config.conf + +Press Ctrl-C to exit. + diff --git a/resource.h b/resource.h new file mode 100644 index 0000000..3c10a98 Binary files /dev/null and b/resource.h differ diff --git a/stdafx.cpp b/stdafx.cpp new file mode 100644 index 0000000..fd4f341 --- /dev/null +++ b/stdafx.cpp @@ -0,0 +1 @@ +#include "stdafx.h" diff --git a/stdafx.h b/stdafx.h new file mode 100644 index 0000000..bd6427f --- /dev/null +++ b/stdafx.h @@ -0,0 +1,33 @@ +// stdafx.h : include file for standard system include files, +// or project specific include files that are used frequently, but +// are changed infrequently +// + +#pragma once + +#define WINVER 0x0A00 +#define _WIN32_WINNT _WIN32_WINNT_VISTA +#define NTDDI_VERSION NTDDI_VISTA + +#include "build_config.h" + +#if defined(OS_WIN) +#define _WINSOCK_DEPRECATED_NO_WARNINGS 1 +//#include +#include + +#include +//#include +#include +#include +#include + + +#include +#else +#define override +#endif + +#include +#include + diff --git a/tunsafe_config.h b/tunsafe_config.h new file mode 100644 index 0000000..2f29472 --- /dev/null +++ b/tunsafe_config.h @@ -0,0 +1,9 @@ +#pragma once + +#define TUNSAFE_VERSION_STRING "TunSafe 1.3-rc3" + +#define WITH_HANDSHAKE_EXT 0 +#define WITH_SHORT_HEADERS 0 +#define WITH_HEADER_OBFUSCATION 0 +#define WITH_AVX512_OPTIMIZATIONS 0 +#define WITH_BENCHMARK 0 diff --git a/tunsafe_cpu.cpp b/tunsafe_cpu.cpp new file mode 100644 index 0000000..b1ee8cc --- /dev/null +++ b/tunsafe_cpu.cpp @@ -0,0 +1,68 @@ +// SPDX-License-Identifier: AGPL-1.0-only +// Copyright (C) 2018 Ludvig Strigeus . All Rights Reserved. +#include "stdafx.h" +#include "tunsafe_cpu.h" +#include "tunsafe_types.h" + +#if defined(COMPILER_MSVC) +#include +#endif + +#include + +uint32 x86_pcap[3]; + +#if !defined(COMPILER_MSVC) +static inline void __cpuid(int info[4], int func) { + __asm__ __volatile__( + "cpuid" + : "=a"(info[0]), "=b"(info[1]), "=c"(info[2]), "=d"(info[3]) + : "a"(func), "c"(0) + ); +} +#endif + +void InitCpuFeatures() { + unsigned nIds, nExIds; + + { + int info[4]; + __cpuid(info, 0); + nIds = info[0]; + __cpuid(info, 0x80000000); + nExIds = info[0]; + } + if (nIds >= 0x00000001) { + int info[4]; + __cpuid(info, 0x00000001); + x86_pcap[0] = info[3]; + x86_pcap[1] = info[2]; + } + if (nIds >= 0x00000007) { + int info[4]; + __cpuid(info, 0x00000007); + x86_pcap[2] = info[1]; + } +} + +static char *strcpy_e(char *dst, char *end, const char *copy) { + size_t len = strlen(copy); + if (len >= (size_t)(end - dst)) return end; + memcpy(dst, copy, len + 1); + return dst + len; +} + +void PrintCpuFeatures() { + char capbuf[2048], *end = capbuf + 2048, *s = capbuf; + + if (X86_PCAP_AVX) s = strcpy_e(s, end, " avx"); + if (X86_PCAP_SSSE3) s = strcpy_e(s, end, " ssse3"); + if (X86_PCAP_AVX2) s = strcpy_e(s, end, " avx2"); + if (X86_PCAP_MOVBE) s = strcpy_e(s, end, " movbe"); + if (X86_PCAP_AES) s = strcpy_e(s, end, " aes"); + if (X86_PCAP_PCLMULQDQ) s = strcpy_e(s, end, " pclmuldqd"); + if (X86_PCAP_AVX512F) s = strcpy_e(s, end, " avx512f"); + if (X86_PCAP_AVX512VL) s = strcpy_e(s, end, " avx512vl"); + + RINFO("Using:%s", capbuf); +} diff --git a/tunsafe_cpu.h b/tunsafe_cpu.h new file mode 100644 index 0000000..de97b6c --- /dev/null +++ b/tunsafe_cpu.h @@ -0,0 +1,29 @@ +// SPDX-License-Identifier: AGPL-1.0-only +// Copyright (C) 2018 Ludvig Strigeus . All Rights Reserved. +#ifndef TUNSAFE_CPU_H_ +#define TUNSAFE_CPU_H_ + +#include "tunsafe_types.h" + +extern uint32 x86_pcap[3]; + +// cpuid 1, edx +#define X86_PCAP_SSE (x86_pcap[0] & (1 << 25)) +#define X86_PCAP_SSE2 (x86_pcap[0] & (1 << 26)) +// cpuid 1, ecx +#define X86_PCAP_SSE3 (x86_pcap[1] & (1 << 0)) +#define X86_PCAP_PCLMULQDQ (x86_pcap[1] & (1 << 0)) +#define X86_PCAP_SSSE3 (x86_pcap[1] & (1 << 9)) +#define X86_PCAP_MOVBE (x86_pcap[1] & (1 << 22)) +#define X86_PCAP_AES (x86_pcap[1] & (1 << 25)) +#define X86_PCAP_AVX (x86_pcap[1] & (1 << 28)) +// cpuid 7, ebx +#define X86_PCAP_AVX2 (x86_pcap[2] & (1 << 5)) +#define X86_PCAP_AVX512F (x86_pcap[2] & (1 << 16)) +#define X86_PCAP_AVX512VL (x86_pcap[2] & (1 << 31)) + +void InitCpuFeatures(); +void PrintCpuFeatures(); + + +#endif // TUNSAFE_CPU_H_ \ No newline at end of file diff --git a/tunsafe_endian.h b/tunsafe_endian.h new file mode 100644 index 0000000..32bce5e --- /dev/null +++ b/tunsafe_endian.h @@ -0,0 +1,95 @@ +// SPDX-License-Identifier: AGPL-1.0-only +// Copyright (C) 2018 Ludvig Strigeus . All Rights Reserved. +#ifndef TINYVPN_ENDIAN_H_ +#define TINYVPN_ENDIAN_H_ + +#include "build_config.h" +#include "tunsafe_types.h" +#if defined(OS_WIN) && defined(COMPILER_MSVC) +#include +#endif +#include + +#define ByteSwap32Fallback(x) ( \ + (((uint32)(x) & (uint32)0x000000fful) << 24) | \ + (((uint32)(x) & (uint32)0x0000ff00ul) << 8) | \ + (((uint32)(x) & (uint32)0x00ff0000ul) >> 8) | \ + (((uint32)(x) & (uint32)0xff000000ul) >> 24)) + +#define ByteSwap16Fallback(x) ((uint16)( \ + (((uint16)(x) & (uint16)0x00ffu) << 8) | \ + (((uint16)(x) & (uint16)0xff00u) >> 8))) + +#define ByteSwap64Fallback(x) ((uint64)ByteSwap32Fallback(x)<<32 | ByteSwap32Fallback(x>>32)) + +#define ReadBE32AlignedFallback(pt) (((uint32)((pt)[0] & 0xFF) << 24) ^ \ + ((uint32)((pt)[1] & 0xFF) << 16) ^ \ + ((uint32)((pt)[2] & 0xFF) << 8) ^ \ + ((uint32)((pt)[3] & 0xFF))) +#define WriteBE32AlignedFallback(ct, st) { \ + (ct)[0] = (char)((st) >> 24); \ + (ct)[1] = (char)((st) >> 16); \ + (ct)[2] = (char)((st) >> 8); \ + (ct)[3] = (char)(st); } + + + + +#if defined(OS_WIN) && defined(COMPILER_MSVC) +#define ByteSwap16(x) _byteswap_ushort((uint16)x) +#define ByteSwap32(x) _byteswap_ulong((uint32)x) +#define ByteSwap64(x) _byteswap_uint64((uint64)x) +#elif defined(COMPILER_GCC) +#define ByteSwap16(x) __builtin_bswap16((uint16)x) +#define ByteSwap32(x) __builtin_bswap32((uint32)x) +#define ByteSwap64(x) __builtin_bswap64((uint64)x) +#else +#define ByteSwap16 ByteSwap16Fallback +#define ByteSwap32 ByteSwap32Fallback +#define ByteSwap64 ByteSwap64Fallback +#endif + +#if defined(ARCH_CPU_LITTLE_ENDIAN) +#define ToBE64(x) ByteSwap64(x) +#define ToBE32(x) ByteSwap32(x) +#define ToBE16(x) ByteSwap16(x) +#define ToLE64(x) (x) +#define ToLE32(x) (x) +#define ToLE16(x) (x) +#else +#define ToBE64(x) (x) +#define ToBE32(x) (x) +#define ToBE16(x) (x) +#define ToLE64(x) ByteSwap64(x) +#define ToLE32(x) ByteSwap32(x) +#define ToLE16(x) ByteSwap16(x) +#endif + +#define ReadBE16Aligned(pt) ToBE16(*(uint16*)(pt)) +#define WriteBE16Aligned(ct, st) (*(uint16*)(ct) = ToBE16(st)) +#define ReadBE32Aligned(pt) ToBE32(*(uint32*)(pt)) +#define WriteBE32Aligned(ct, st) (*(uint32*)(ct) = ToBE32(st)) + +#define ReadBE16(pt) ToBE16(*(uint16*)(pt)) +#define WriteBE16(ct, st) (*(uint16*)(ct) = ToBE16(st)) +#define ReadBE32(pt) ToBE32(*(uint32*)(pt)) +#define WriteBE32(ct, st) (*(uint32*)(ct) = ToBE32(st)) +#define ReadBE64(pt) ToBE64(*(uint64*)(pt)) +#define WriteBE64(ct, st) (*(uint64*)(ct) = ToBE64(st)) + +#define ReadLE16(pt) ToLE16(*(uint16*)(pt)) +#define WriteLE16(ct, st) (*(uint16*)(ct) = ToLE16(st)) +#define ReadLE32(pt) ToLE32(*(uint32*)(pt)) +#define WriteLE32(ct, st) (*(uint32*)(ct) = ToLE32(st)) +#define ReadLE64(pt) ToLE64(*(uint64*)(pt)) +#define WriteLE64(ct, st) (*(uint64*)(ct) = ToLE64(st)) + +#define Read16(pt) (*(uint16*)(pt)) +#define Write16(ct, st) (*(uint16*)(ct) = (st)) +#define Read32(pt) (*(uint32*)(pt)) +#define Write32(ct, st) (*(uint32*)(ct) = (st)) +#define Read64(pt) (*(uint64*)(pt)) +#define Write64(ct, st) (*(uint64*)(ct) = (st)) + + +#endif // TINYVPN_ENDIAN_H_ diff --git a/tunsafe_types.h b/tunsafe_types.h new file mode 100644 index 0000000..9ddabab --- /dev/null +++ b/tunsafe_types.h @@ -0,0 +1,73 @@ +// SPDX-License-Identifier: AGPL-1.0-only +// Copyright (C) 2018 Ludvig Strigeus . All Rights Reserved. +#ifndef TINYVPN_TYPES_H_ +#define TINYVPN_TYPES_H_ +#include + +#include "build_config.h" +#include "tunsafe_config.h" + + +typedef uint8_t byte; +typedef uint8_t uint8; +typedef uint16_t uint16; +typedef uint32_t uint32; +typedef uint64_t uint64; +typedef int64_t int64; +typedef int8_t int8; +typedef int16_t int16; +typedef int32_t int32; + +typedef unsigned int in_addr_t; + +#define CTASTR2(pre,post) pre ## post +#define CTASTR(pre,post) CTASTR2(pre,post) +#define STATIC_ASSERT(cond,msg) \ + typedef struct { int CTASTR(static_assertion_failed_,msg) : !!(cond); } \ + CTASTR(static_assertion_failed_x_,msg) + +#ifndef ARRAY_SIZE +#define ARRAY_SIZE(x) (sizeof(x)/sizeof(x[0])) +#endif + +void printhex(const char *name, const void *a, size_t l); + +#if defined(COMPILER_MSVC) +#define FORCEINLINE __forceinline +#define NOINLINE __declspec(noinline) +#define SAFEBUFFERS __declspec(safebuffers) +#define __aligned(x) __declspec(align(x)) +#define rol32 _rotl +#define rol64 _rotl64 +#elif defined(COMPILER_GCC) +#define FORCEINLINE inline __attribute__((always_inline)) +#define NOINLINE +#define SAFEBUFFERS +#define _stricmp strcasecmp +#define _strdup strdup +#define _cdecl +#define __aligned(x) __attribute__((__aligned__(x))) +#else +#define FORCEINLINE inline +#define NOINLINE +#define SAFEBUFFERS +#define __aligned(x) +#endif + +#define likely(x) (x) +#define unlikely(x) (x) + +#if !defined(COMPILER_MSVC) +static inline uint64 rol64(uint64 x, int8_t r) { + return (x << r) | (x >> (64 - r)); +} +static inline uint32 rol32(uint32 x, int8_t r) { + return (x << r) | (x >> (32 - r)); +} +#endif // !defined(COMPILER_MSVC) + +void RERROR(const char *msg, ...); +void RINFO(const char *msg, ...); + + +#endif // TINYVPN_TYPES_H_ diff --git a/tunsafe_win32.cpp b/tunsafe_win32.cpp new file mode 100644 index 0000000..846ce28 --- /dev/null +++ b/tunsafe_win32.cpp @@ -0,0 +1,1143 @@ +// SPDX-License-Identifier: AGPL-1.0-only +// Copyright (C) 2018 Ludvig Strigeus . All Rights Reserved. +#include "stdafx.h" +#include "wireguard_config.h" +#include "network_win32_api.h" +#include "network_win32_dnsblock.h" +#include +#include +#include +#include +#include +#include "resource.h" +#include +#include +#include +#include +#include +#include +#include +#include +#include "tunsafe_endian.h" +#include "util.h" +#include +#include +#include "crypto/curve25519-donna.h" + +#undef min +#pragma comment(lib, "iphlpapi.lib") +#pragma comment(lib, "rpcrt4.lib") +#pragma comment(lib,"comctl32.lib") +#pragma comment(linker,"/manifestdependency:\"type='win32' name='Microsoft.Windows.Common-Controls' version='6.0.0.0' processorArchitecture='*' publicKeyToken='6595b64144ccf1df' language='*'\"") + +void InitCpuFeatures(); +void PrintCpuFeatures(); +void Benchmark(); +static const char *GetCurrentConfigTitle(char *buf, size_t max_size); + +#pragma warning(disable: 4200) + +static void MyPostMessage(int msg, WPARAM wparam, LPARAM lparam); + +static HWND g_ui_window; +static in_addr_t g_ui_ip; +static HICON g_icons[2]; +static bool g_minimize_on_connect; + +static bool g_ui_visible; +static char *g_current_filename; +static HKEY g_reg_key; +static HINSTANCE g_hinstance; +static TunsafeBackendWin32 *g_backend; +static bool g_last_popup_is_tray; + +int RegReadInt(const char *key, int def) { + DWORD value = def, n = sizeof(value); + RegQueryValueEx(g_reg_key, key, NULL, NULL, (BYTE*)&value, &n); + return value; +} + +void RegWriteInt(const char *key, int value) { + RegSetValueEx(g_reg_key, key, NULL, REG_DWORD, (BYTE*)&value, sizeof(value)); +} + +char *RegReadStr(const char *key, const char *def) { + char buf[1024]; + DWORD n = sizeof(buf) - 1; + DWORD type = 0; + if (RegQueryValueEx(g_reg_key, key, NULL, &type, (BYTE*)buf, &n) != ERROR_SUCCESS || type != REG_SZ) + return def ? _strdup(def) : NULL; + if (n && buf[n - 1] == 0) + n--; + buf[n] = 0; + return _strdup(buf); +} + +void RegWriteStr(const char *key, const char *v) { + RegSetValueEx(g_reg_key, key, NULL, REG_SZ, (BYTE*)v, (DWORD)strlen(v) + 1); +} + +void str_set(char **x, const char *s) { + free(*x); + *x = _strdup(s); +} + +char *str_cat_alloc(const char *a, const char *b) { + size_t al = strlen(a); + size_t bl = strlen(b); + char *r = (char *)malloc(al + bl + 1); + memcpy(r, a, al); + r[al + bl] = 0; + memcpy(r + al, b, bl); + return r; +} + +static const char *FindLastFolderSep(const char *s) { + size_t len = strlen(s); + for (;;) { + if (len == 0) + return NULL; + len--; + if (s[len] == '\\' || s[len] == '/') + break; + } + return s + len; +} + + +static bool GetConfigFullName(const char *basename, char *fullname, size_t fullname_size) { + size_t len = strlen(basename); + + if (FindLastFolderSep(basename)) { + if (len >= fullname_size) + return false; + memcpy(fullname, basename, len + 1); + return true; + } + if (!GetModuleFileName(NULL, fullname, (DWORD)fullname_size)) + return false; + char *last = (char *)FindLastFolderSep(fullname); + if (!last || last + len + 8 >= fullname + fullname_size) + return false; + memcpy(last + 1, "Config\\", 7 * sizeof(last[0])); + memcpy(last + 8, basename, (len + 1) * sizeof(last[0])); + return true; +} + + +enum UpdateIconWhy { + UIW_NONE = 0, + UIW_STOPPED_WORKING_FAIL = 1, + UIW_STOPPED_WORKING_RETRY = 2, + UIW_EXITING = 3, +}; +static void UpdateIcon(UpdateIconWhy error); +static void UpdateButtons(); + + +void StopService(UpdateIconWhy error) { + if (g_backend->is_started()) { + g_backend->Stop(); + + g_ui_ip = 0; + + if (error != UIW_EXITING) { + UpdateIcon(error); + RINFO("Disconnecting"); + UpdateButtons(); + RegWriteInt("IsConnected", 0); + } + } +} + +const char *print_ip(char buf[kSizeOfAddress], in_addr_t ip) { + snprintf(buf, kSizeOfAddress, "%d.%d.%d.%d", (ip >> 24) & 0xff, (ip >> 16) & 0xff, (ip >> 8) & 0xff, (ip >> 0) & 0xff); + return buf; +} + +class MyProcessorDelegate : public ProcessorDelegate { +public: + virtual void OnConnected(in_addr_t my_ip) { + if (my_ip != g_ui_ip) { + + if (my_ip) { + char buf[kSizeOfAddress]; + print_ip(buf, my_ip); + RINFO("Connection established. IP %s", buf); + } + g_ui_ip = my_ip; + MyPostMessage(WM_USER + 2, 0, 0); + } + } + virtual void OnDisconnected() { + MyProcessorDelegate::OnConnected(0); + } +}; + +static MyProcessorDelegate my_procdel; + +void StartService(bool skip_clear = false) { + char buf[1024]; + if (!GetConfigFullName(g_current_filename, buf, ARRAYSIZE(buf))) + return; + + if (!g_backend->is_started()) { + if (!skip_clear) + PostMessage(g_ui_window, WM_USER + 6, NULL, NULL); + + g_backend->Start(&my_procdel, buf); + + UpdateButtons(); + RegWriteInt("IsConnected", 1); + } +} + +static bool g_has_icon; + +static char *PrintMB(char *buf, int64 bytes) { + char *bo = buf; + if (bytes < 0) { + *buf++ = '-'; + bytes = -bytes; + } + int64 big = bytes / (1024*1024); + int little = bytes % (1024*1024); + if (bytes < 10*1024*1024) { + // X.XXX + snprintf(buf, 64, "%lld.%.3d MB", big, 1000 * little / (1024*1024)); + } else if (bytes < 100*1024*1024) { + // XX.XX + snprintf(buf, 64, "%lld.%.2d MB", big, 100 * little / (1024*1024)); + } else { + // XX.X + snprintf(buf, 64, "%lld.%.1d MB", big, 10 * little / (1024*1024)); + } + return bo; +} + +static void UpdateStats() { + ProcessorStats stats = g_backend->GetStats(); + + char tmp[64], tmp2[64]; + char buf[512]; + snprintf(buf, 512, "%s received (%lld packets), %s sent (%lld packets)", + PrintMB(tmp, stats.udp_bytes_in), stats.udp_packets_in, + PrintMB(tmp2, stats.udp_bytes_out), stats.udp_packets_out/*, udp_qsize2 - udp_qsize1, g_tun_reads*/); + SetDlgItemText(g_ui_window, IDTXT_UDP, buf); + + snprintf(buf, 512, "%s received (%lld packets), %s sent (%lld packets)", + PrintMB(tmp, stats.tun_bytes_in), stats.tun_packets_in, + PrintMB(tmp2, stats.tun_bytes_out), stats.tun_packets_out/*, + tpq_last_qsize, g_tun_writes*/); + SetDlgItemText(g_ui_window, IDTXT_TUN, buf); + + char *d = buf; + if (stats.last_complete_handskake_timestamp) { + uint32 ago = (uint32)((OsGetMilliseconds() - stats.last_complete_handskake_timestamp) / 1000); + uint32 hours = ago / 3600; + uint32 minutes = (ago - hours * 3600) / 60; + uint32 seconds = (ago - hours * 3600 - minutes * 60); + + if (hours) + d += snprintf(d, 32, hours == 1 ? "%d hour, " : "%d hours, ", hours); + if (minutes) + d += snprintf(d, 32, minutes == 1 ? "%d minute, " : "%d minutes, ", minutes); + if (d == buf || seconds) + d += snprintf(d, 32, seconds == 1 ? "%d second, " : "%d seconds, ", seconds); + memcpy(d - 2, " ago", 5); + } else { + memcpy(buf, "(never)", 8); + } + SetDlgItemText(g_ui_window, IDTXT_HANDSHAKE, buf); +} + +void UpdatePublicKey(char *s) { + SetDlgItemText(g_ui_window, IDC_PUBLIC_KEY, s); + free(s); +} + +static void UpdateButtons() { + bool running = g_backend->is_started(); + SetDlgItemText(g_ui_window, ID_START, running ? "Re&connect" : "&Connect"); + EnableWindow(GetDlgItem(g_ui_window, ID_STOP), running); +} + +static void UpdateIcon(UpdateIconWhy why) { + in_addr_t ip = g_ui_ip; + NOTIFYICONDATA nid; + memset(&nid, 0, sizeof(nid)); + nid.cbSize = sizeof(nid); + nid.hWnd = g_ui_window; + nid.uID = 1; + nid.uVersion = NOTIFYICON_VERSION; + nid.uCallbackMessage = WM_USER + 1; + nid.uFlags = NIF_MESSAGE | NIF_TIP | NIF_ICON; + nid.hIcon = g_icons[ip ? 0 : 1]; + + char buf[kSizeOfAddress]; + char namebuf[64]; + if (ip != 0) { + snprintf(nid.szTip, sizeof(nid.szTip), "TunSafe [%s - %s]", GetCurrentConfigTitle(namebuf, sizeof(namebuf)), print_ip(buf, ip)); + nid.uFlags |= NIF_INFO; + snprintf(nid.szInfoTitle, sizeof(nid.szInfoTitle), "Connected to: %s", namebuf); + snprintf(nid.szInfo, sizeof(nid.szInfo), "IP: %s", buf); + nid.uTimeout = 5000; + nid.dwInfoFlags = NIIF_INFO; + } else { + snprintf(nid.szTip, sizeof(nid.szTip), "TunSafe [%s]", "Disconnected"); + + if (why == UIW_STOPPED_WORKING_FAIL) { + nid.uFlags |= NIF_INFO; + strcpy(nid.szInfoTitle, "Disconnected!"); + strcpy(nid.szInfo, "There was a problem with the connection. You are now disconnected."); + nid.uTimeout = 5000; + nid.dwInfoFlags = NIIF_ERROR; + } + } + Shell_NotifyIcon(g_has_icon ? NIM_MODIFY : NIM_ADD, &nid); + + SendMessage(g_ui_window, WM_SETICON, ICON_SMALL, (LPARAM)g_icons[ip ? 0 : 1]); + + g_has_icon = true; +} + +static void RemoveIcon() { + if (g_has_icon) { + NOTIFYICONDATA nid; + memset(&nid, 0, sizeof(nid)); + nid.cbSize = sizeof(nid); + nid.hWnd = g_ui_window; + nid.uID = 1; + Shell_NotifyIcon(NIM_DELETE, &nid); + } +} + +#define MAX_CONFIG_FILES 100 +#define ID_POPUP_CONFIG_FILE 10000 +char *config_filenames[MAX_CONFIG_FILES]; + +static void RestartService(UpdateIconWhy why, bool only_if_active) { + if (!only_if_active || g_backend->is_started()) { + StopService(why); + StartService(why != UIW_NONE); + } +} + +static char *StripConfExtension(const char *src, char *target, size_t size) { + size_t len = strlen(src); + if (len >= 5 && memcmp(src + len - 5, ".conf", 5) == 0) + len -= 5; + + len = std::min(len, size - 1); + target[len] = 0; + memcpy(target, src, len); + return target; +} + +static const char *GetCurrentConfigTitle(char *target, size_t size) { + const char *ll = FindLastFolderSep(g_current_filename); + return StripConfExtension(ll ? ll + 1 : g_current_filename, target, size); +} + +static void LoadConfigFile(const char *filename, bool save, bool force_start) { + str_set(&g_current_filename, filename); + char namebuf[64]; + char *f = str_cat_alloc("TunSafe VPN Client - ", GetCurrentConfigTitle(namebuf, sizeof(namebuf))); + SetWindowText(g_ui_window, f); + free(f); + RestartService(UIW_NONE, !force_start); + if (save) + RegWriteStr("ConfigFile", filename); +} + +static void AddToAvailableFilesPopup(HMENU menu, int max_num_items, bool is_settings) { + char buf[1024]; + int nfiles = 0; + if (!GetConfigFullName("*.*", buf, ARRAYSIZE(buf))) + return; + + int selected_item = -1; + WIN32_FIND_DATA wfd; + HANDLE handle = FindFirstFile(buf, &wfd); + if (handle != INVALID_HANDLE_VALUE) { + do { + if (wfd.cFileName[0] == '.') + continue; + + if (strcmp(g_current_filename, wfd.cFileName) == 0) + selected_item = nfiles; + + str_set(&config_filenames[nfiles], wfd.cFileName); + + nfiles++; + if (nfiles == MAX_CONFIG_FILES) + break; + } while (FindNextFile(handle, &wfd)); + FindClose(handle); + } + + HMENU where; + + bool is_connected = g_backend->is_started(); + + where = menu; + for (int i = 0; i < nfiles; i++) { + if (i == max_num_items) { + where = CreatePopupMenu(); + AppendMenu(menu, MF_POPUP, (UINT_PTR)where, "&More"); + } + + AppendMenu(where, (i == selected_item && is_connected) ? MF_CHECKED : 0, ID_POPUP_CONFIG_FILE + i, StripConfExtension(config_filenames[i], buf, sizeof(buf))); + + if (i == selected_item) + SetMenuDefaultItem(where, ID_POPUP_CONFIG_FILE + i, MF_BYCOMMAND); + } + if (nfiles) + AppendMenu(menu, MF_SEPARATOR, 0, 0); +} + +static void ShowSettingsMenu(HWND wnd) { + HMENU menu = CreatePopupMenu(); + + AddToAvailableFilesPopup(menu, 10, true); + + AppendMenu(menu, 0, IDSETT_OPEN_FILE, "&Import File..."); + AppendMenu(menu, 0, IDSETT_BROWSE_FILES, "&Browse in Explorer"); + + AppendMenu(menu, MF_SEPARATOR, 0, 0); + AppendMenu(menu, 0, IDSETT_KEYPAIR, "Generate &Key Pair..."); + AppendMenu(menu, MF_SEPARATOR, 0, 0); + + HMENU blockinternet = CreatePopupMenu(); + AppendMenu(blockinternet, 0, IDSETT_BLOCKINTERNET_OFF, "Off"); + AppendMenu(blockinternet, MF_SEPARATOR, 0, 0); + AppendMenu(blockinternet, 0, IDSETT_BLOCKINTERNET_ROUTE, "Yes, with Routing Rules"); + AppendMenu(blockinternet, 0, IDSETT_BLOCKINTERNET_FIREWALL, "Yes, with Firewall Rules"); + AppendMenu(blockinternet, 0, IDSETT_BLOCKINTERNET_BOTH, "Yes, Both Methods"); + bool is_activated = false; + int value = GetInternetBlockState(&is_activated); + CheckMenuRadioItem(blockinternet, IDSETT_BLOCKINTERNET_OFF, IDSETT_BLOCKINTERNET_BOTH, IDSETT_BLOCKINTERNET_OFF + value, MF_BYCOMMAND); + AppendMenu(menu, MF_POPUP + is_activated * MF_CHECKED, (UINT_PTR)blockinternet, "Block &All Internet Traffic"); + + if (g_allow_pre_post || GetAsyncKeyState(VK_SHIFT) < 0) { + AppendMenu(menu, g_allow_pre_post ? MF_CHECKED : 0, IDSETT_PREPOST, "&Allow Pre/Post commands"); + } + + AppendMenu(menu, MF_SEPARATOR, 0, 0); + AppendMenu(menu, 0, IDSETT_WEB_PAGE, "Go to &Web Page"); + AppendMenu(menu, 0, IDSETT_OPENSOURCE, "See Open Source Licenses"); + AppendMenu(menu, 0, IDSETT_ABOUT, "&About TunSafe..."); + + POINT pt; + GetCursorPos(&pt); + + g_last_popup_is_tray = false; + int rv = TrackPopupMenu(menu, 0, pt.x, pt.y, 0, wnd, NULL); + DestroyMenu(menu); +} + +void FindDesktopFolderView(REFIID riid, void **ppv) { + CComPtr spShellWindows; + spShellWindows.CoCreateInstance(CLSID_ShellWindows); + + CComVariant vtLoc(CSIDL_DESKTOP); + CComVariant vtEmpty; + long lhwnd; + CComPtr spdisp; + spShellWindows->FindWindowSW( + &vtLoc, &vtEmpty, + SWC_DESKTOP, &lhwnd, SWFO_NEEDDISPATCH, &spdisp); + + CComPtr spBrowser; + CComQIPtr(spdisp)-> + QueryService(SID_STopLevelBrowser, + IID_PPV_ARGS(&spBrowser)); + + CComPtr spView; + spBrowser->QueryActiveShellView(&spView); + + spView->QueryInterface(riid, ppv); +} + +void GetDesktopAutomationObject(REFIID riid, void **ppv) { + CComPtr spsv; + FindDesktopFolderView(IID_PPV_ARGS(&spsv)); + CComPtr spdispView; + spsv->GetItemObject(SVGIO_BACKGROUND, IID_PPV_ARGS(&spdispView)); + spdispView->QueryInterface(riid, ppv); +} + +void ShellExecuteFromExplorer( + PCSTR pszFile, + PCSTR pszParameters = nullptr, + PCSTR pszDirectory = nullptr, + PCSTR pszOperation = nullptr, + int nShowCmd = SW_SHOWNORMAL) { + CComPtr spFolderView; + GetDesktopAutomationObject(IID_PPV_ARGS(&spFolderView)); + CComPtr spdispShell; + spFolderView->get_Application(&spdispShell); + + CComQIPtr(spdispShell) + ->ShellExecute(CComBSTR(pszFile), + CComVariant(pszParameters ? pszParameters : ""), + CComVariant(pszDirectory ? pszDirectory : ""), + CComVariant(pszOperation ? pszOperation : ""), + CComVariant(nShowCmd)); +} + +static void OpenEditor() { + char buf[MAX_PATH]; + if (GetConfigFullName(g_current_filename, buf, ARRAYSIZE(buf))) { + SHELLEXECUTEINFO shinfo = {0}; + shinfo.cbSize = sizeof(shinfo); + shinfo.fMask = SEE_MASK_CLASSNAME; + shinfo.lpFile = buf; + shinfo.lpParameters = ""; + shinfo.lpClass = ".txt"; + shinfo.nShow = SW_SHOWNORMAL; + ShellExecuteEx(&shinfo); + } +} + +static void BrowseFiles() { + char buf[MAX_PATH]; + if (GetConfigFullName("", buf, ARRAYSIZE(buf))) { + size_t l = strlen(buf); + buf[l - 1] = 0; + ShellExecuteFromExplorer(buf, NULL, NULL, "explore"); + } +} + +bool FileExists(const CHAR *fileName) { + DWORD fileAttr = GetFileAttributes(fileName); + return (0xFFFFFFFF != fileAttr); +} + +__int64 FileSize(const char* name) { + WIN32_FILE_ATTRIBUTE_DATA fad; + if (!GetFileAttributesEx(name, GetFileExInfoStandard, &fad)) + return -1; // error condition, could call GetLastError to find out more + LARGE_INTEGER size; + size.HighPart = fad.nFileSizeHigh; + size.LowPart = fad.nFileSizeLow; + return size.QuadPart; +} + +static bool is_space(uint8_t c) { + return c == ' ' || c == '\r' || c == '\n' || c == '\t'; +} + +static bool is_valid(uint8_t c) { + return c >= ' ' || c == '\r' || c == '\n' || c == '\t'; +} + +bool SanityCheckBuf(uint8 *buf, size_t n) { + for (size_t i = 0; i < n; i++) { + if (!is_space(buf[i])) { + if (buf[i] != '[' && buf[i] != '#') + return false; + for (; i < n; i++) + if (!is_valid(buf[i])) + return false; + return true; + } + } + return false; +} + +uint8* LoadFileSane(const char *name, size_t *size) { + FILE *f = fopen(name, "rb"); + uint8 *new_file = NULL, *file = NULL; + size_t j, i, n; + if (!f) return false; + fseek(f, 0, SEEK_END); + long x = ftell(f); + fseek(f, 0, SEEK_SET); + if (x < 0 || x >= 65536) goto error; + file = (uint8*)malloc(x + 1); + if (!file) goto error; + n = fread(file, 1, x + 1, f); + if (n != x || !SanityCheckBuf(file, n)) + goto error; + // Convert the file to DOS new lines + for (i = j = 0; i < n; i++) + j += (file[i] == '\n'); + new_file = (uint8*)malloc(n + 1 + j); + if (!new_file) goto error; + for (i = j = 0; i < n; i++) { + uint8 c = file[i]; + if (c == '\r') + continue; + if (c == '\n') + new_file[j++] = '\r'; + new_file[j++] = c; + } + new_file[j] = 0; + *size = j; + +error: + fclose(f); + free(file); + return new_file; +} + +bool WriteOutFile(const char *filename, uint8 *filedata, size_t filesize) { + FILE *f = fopen(filename, "wb"); + if (!f) return false; + if (fwrite(filedata, 1, filesize, f) != filesize) { + fclose(f); + return false; + } + fclose(f); + return true; +} + +void ImportFile(const char *s) { + char buf[1024]; + char mesg[1024]; + size_t filesize; + const char *last = FindLastFolderSep(s); + if (!last || !GetConfigFullName(last + 1, buf, ARRAYSIZE(buf)) || _stricmp(buf, s) == 0) + return; + + uint8 *filedata = LoadFileSane(s, &filesize); + if (!filedata) goto fail; + + if (FileExists(buf)) { + snprintf(mesg, ARRAYSIZE(mesg), "A file already exists with the name '%s' in the configuration folder. Do you want to overwrite it?", last + 1); + if (MessageBoxA(g_ui_window, mesg, "TunSafe", MB_OKCANCEL | MB_ICONEXCLAMATION) != IDOK) + goto out; + } else { + snprintf(mesg, ARRAYSIZE(mesg), "Do you want to import '%s' into TunSafe?", last + 1); + if (MessageBoxA(g_ui_window, mesg, "TunSafe", MB_OKCANCEL | MB_ICONQUESTION) != IDOK) + goto out; + } + + if (!WriteOutFile(buf, filedata, filesize)) { + DeleteFileA(buf); +fail: + MessageBoxA(g_ui_window, "There was a problem importing the file.", "TunSafe", MB_ICONEXCLAMATION); + } else { + LoadConfigFile(last + 1, true, false); + } + +out: + free(filedata); +} + +void ShowUI(HWND hWnd) { + g_ui_visible = true; + UpdateStats(); + ShowWindow(hWnd, SW_SHOW); + BringWindowToTop(hWnd); + SetForegroundWindow(hWnd); +} + +void HandleDroppedFiles(HWND wnd, HDROP hdrop) { + char buf[MAX_PATH]; + if (DragQueryFile(hdrop, -1, NULL, 0) == 1) { + if (DragQueryFile(hdrop, 0, buf, ARRAYSIZE(buf))) { + SetForegroundWindow(wnd); + ImportFile(buf); + } + } + DragFinish(hdrop); +} + +void BrowseFile(HWND wnd) { + char szFile[1024]; + + // open a file name + OPENFILENAME ofn = {0}; + ofn.lStructSize = sizeof(ofn); + ofn.hwndOwner = g_ui_window; + ofn.lpstrFile = szFile; + ofn.lpstrFile[0] = '\0'; + ofn.nMaxFile = sizeof(szFile); + ofn.lpstrFilter = "Config Files (*.conf)\0*.conf\0"; + ofn.nFilterIndex = 1; + ofn.lpstrFileTitle = NULL; + ofn.nMaxFileTitle = 0; + ofn.lpstrInitialDir = NULL; + ofn.Flags = OFN_PATHMUSTEXIST | OFN_FILEMUSTEXIST; + if (GetOpenFileName(&ofn)) + ImportFile(szFile); +} + +static const uint8 kCurve25519Basepoint[32] = {9}; + +static void SetKeyBox(HWND wnd, int ctr, uint8 buf[32]) { + uint8 *privs = base64_encode(buf, 32, NULL); + SetDlgItemText(wnd, ctr, (char*)privs); + free(privs); +} + +static INT_PTR WINAPI KeyPairDlgProc(HWND hWnd, UINT message, WPARAM wParam, + LPARAM lParam) { + switch (message) { + case WM_INITDIALOG: + return TRUE; + case WM_CLOSE: + EndDialog(hWnd, 0); + return TRUE; + case WM_COMMAND: + switch (wParam) { + case IDCANCEL: + EndDialog(hWnd, 0); + return TRUE; + case IDC_PRIVATE_KEY | (EN_CHANGE << 16) : { + char buf[128]; + uint8 pub[32]; + uint8 priv[32]; + buf[0] = 0; + size_t len = GetDlgItemText(hWnd, IDC_PRIVATE_KEY, buf, sizeof(buf)); + size_t olen = 32; + if (base64_decode((uint8*)buf, len, priv, &olen) && olen == 32) { + curve25519_donna(pub, priv, kCurve25519Basepoint); + SetKeyBox(hWnd, IDC_PUBLIC_KEY, pub); + } else { + SetDlgItemText(hWnd, IDC_PUBLIC_KEY, "(Invalid Private Key)"); + } + + return TRUE; + } + case IDRAND: { + uint8 priv[32]; + uint8 pub[32]; + OsGetRandomBytes(priv, 32); + curve25519_normalize(priv); + curve25519_donna(pub, priv, kCurve25519Basepoint); + SetKeyBox(hWnd, IDC_PRIVATE_KEY, priv); + SetKeyBox(hWnd, IDC_PUBLIC_KEY, pub); + return TRUE; + } + } + } + return FALSE; +} + +bool wm_dropfiles_recursive; +uint64 last_auto_service_restart; +static INT_PTR WINAPI DlgProc(HWND hWnd, UINT message, WPARAM wParam, + LPARAM lParam) { + switch(message) { + case WM_INITDIALOG: + return TRUE; + case WM_CLOSE: + g_ui_visible = false; + ShowWindow(hWnd, SW_HIDE); + return TRUE; + case WM_COMMAND: + if (wParam >= ID_POPUP_CONFIG_FILE && wParam < ID_POPUP_CONFIG_FILE + MAX_CONFIG_FILES) { + const char *new_conf = config_filenames[wParam - ID_POPUP_CONFIG_FILE]; + if (!new_conf) + return TRUE; + + if (g_last_popup_is_tray && strcmp(new_conf, g_current_filename) == 0 && g_backend->is_started()) { + StopService(UIW_NONE); + } else { + LoadConfigFile(new_conf, true, g_last_popup_is_tray); + } + + + return TRUE; + } + switch(wParam) { + case ID_START: + StopService(UIW_NONE); + StartService(); + break; + case ID_STOP: StopService(UIW_NONE); break; + case ID_EXIT: PostQuitMessage(0); break; + case ID_RESET: g_backend->ResetStats(); break; + case ID_MORE_BUTTON: ShowSettingsMenu(hWnd); break; + case IDSETT_WEB_PAGE: ShellExecute(NULL, NULL, "https://tunsafe.com/", NULL, NULL, 0); break; + case IDSETT_OPENSOURCE: ShellExecute(NULL, NULL, "https://tunsafe.com/open-source", NULL, NULL, 0); break; + case ID_EDITCONF: OpenEditor(); break; + case IDSETT_BROWSE_FILES:BrowseFiles(); break; + case IDSETT_OPEN_FILE: BrowseFile(hWnd); break; + case IDSETT_ABOUT: + MessageBoxA(g_ui_window, TUNSAFE_VERSION_STRING "\r\n\r\nCopyright © 2018, Ludvig Strigeus\r\n\r\nThanks for choosing TunSafe!\r\n\r\nThis version was built on " __DATE__ " " __TIME__, "About TunSafe", MB_ICONINFORMATION); + break; + case IDSETT_KEYPAIR: + DialogBox(g_hinstance, MAKEINTRESOURCE(IDD_DIALOG2), hWnd, &KeyPairDlgProc); + break; + case IDSETT_BLOCKINTERNET_OFF: + case IDSETT_BLOCKINTERNET_ROUTE: + case IDSETT_BLOCKINTERNET_FIREWALL: + case IDSETT_BLOCKINTERNET_BOTH: { + InternetBlockState old_state = GetInternetBlockState(NULL); + InternetBlockState new_state = (InternetBlockState)(wParam - IDSETT_BLOCKINTERNET_OFF); + + if (old_state == kBlockInternet_Off && new_state != kBlockInternet_Off) { + if (MessageBoxA(g_ui_window, "Warning! All Internet traffic will be blocked until you restart your computer. Only traffic through TunSafe will be allowed.\r\n\r\nThe blocking is activated the next time you connect to a VPN server.\r\n\r\nDo you want to continue?", "TunSafe", MB_ICONWARNING | MB_OKCANCEL) == IDCANCEL) + return TRUE; + } + + SetInternetBlockState(new_state); + + if ((~old_state & new_state) && g_backend->is_started()) { + StopService(UIW_NONE); + StartService(); + } + return TRUE; + } + case IDSETT_PREPOST: { + g_allow_pre_post = !g_allow_pre_post; + RegWriteInt("AllowPrePost", g_allow_pre_post); + return TRUE; + } + } + break; + case WM_DROPFILES: + if (!wm_dropfiles_recursive) { + wm_dropfiles_recursive = true; + HandleDroppedFiles(hWnd, (HDROP)wParam); + wm_dropfiles_recursive = false; + } + break; + case WM_USER + 1: + if (lParam == WM_RBUTTONUP) { + HMENU menu = CreatePopupMenu(); + AddToAvailableFilesPopup(menu, 10, false); + + bool active = g_backend->is_started(); + AppendMenu(menu, 0, ID_START, active ? "Re&connect" : "&Connect"); + AppendMenu(menu, active ? 0 : MF_GRAYED, ID_STOP, "&Disconnect"); + AppendMenu(menu, MF_SEPARATOR, 0, NULL); + AppendMenu(menu, 0, ID_EXIT, "&Exit"); + POINT pt; + GetCursorPos(&pt); + + SetForegroundWindow(hWnd); + + g_last_popup_is_tray = true; + + int rv = TrackPopupMenu(menu, 0, pt.x, pt.y, 0, hWnd, NULL); + DestroyMenu(menu); + } else if (lParam == WM_LBUTTONDBLCLK) { + if (IsWindowVisible(hWnd)) { + g_ui_visible = false; + ShowWindow(hWnd, SW_HIDE); + } else { + ShowUI(hWnd); + } + } + return TRUE; + case WM_USER + 2: + if (g_ui_ip != 0 && g_minimize_on_connect) { + g_minimize_on_connect = false; + g_ui_visible = false; + ShowWindow(hWnd, SW_HIDE); + } + UpdateIcon(UIW_NONE); + return TRUE; + case WM_USER + 3: { + CHARRANGE cr; + cr.cpMin = -1; + cr.cpMax = -1; + // hwnd = rich edit hwnd + SendDlgItemMessage(hWnd, IDC_RICHEDIT21, EM_EXSETSEL, 0, (LPARAM)&cr); + SendDlgItemMessage(hWnd, IDC_RICHEDIT21, EM_REPLACESEL, 0, (LPARAM)lParam); + free( (void*) lParam); + return true; + } + case WM_USER + 6: + SetDlgItemText(hWnd, IDC_RICHEDIT21, ""); + return true; + case WM_USER + 5: + UpdatePublicKey((char*)lParam); + return true; + case WM_USER + 4: { + UpdateStats(); + return true; + } + case WM_USER + 10: + break; + + case WM_USER + 11: { + uint64 now = GetTickCount64(); + if (now < last_auto_service_restart + 5000) { + RERROR("Too many automatic restarts..."); + StopService(UIW_STOPPED_WORKING_FAIL); + } else { + last_auto_service_restart = now; + RestartService(UIW_STOPPED_WORKING_RETRY, true); + } + break; + } + } + return FALSE; +} + +struct PostMsg { + int msg; + WPARAM wparam; + LPARAM lparam; + PostMsg(int a, WPARAM b, LPARAM c) : msg(a), wparam(b), lparam(c) {} +}; + +static HANDLE msg_event; +static CRITICAL_SECTION msg_section; +static std::vector msgvect; + +static DWORD WINAPI MessageThread(void *x) { + std::vector proc; + for(;;) { + WaitForSingleObject(msg_event, INFINITE); + proc.clear(); + EnterCriticalSection(&msg_section); + std::swap(proc, msgvect); + LeaveCriticalSection(&msg_section); + for(size_t i = 0; i != proc.size(); i++) + PostMessage(g_ui_window, proc[i].msg, proc[i].wparam, proc[i].lparam); + } +} + +static void MyPostMessage(int msg, WPARAM wparam, LPARAM lparam) { + size_t count; + EnterCriticalSection(&msg_section); + count = msgvect.size(); + msgvect.emplace_back(msg, wparam, lparam); + LeaveCriticalSection(&msg_section); + if (count == 0) SetEvent(msg_event); +} + +static void InitMyPostMessage() { + msg_event = CreateEvent(NULL, FALSE, FALSE, NULL); + InitializeCriticalSection(&msg_section); + DWORD thread_id; + CloseHandle(CreateThread(NULL, 0, &MessageThread, NULL, 0, &thread_id)); +} + + +void OsGetRandomBytes(uint8 *data, size_t data_size) { +#if defined(OS_WIN) + static BOOLEAN(APIENTRY *pfn)(void*, ULONG); + static bool resolved; + if (!resolved) { + pfn = (BOOLEAN(APIENTRY *)(void*, ULONG))GetProcAddress(LoadLibrary("ADVAPI32.DLL"), "SystemFunction036"); + resolved = true; + } + if (pfn && pfn(data, (ULONG)data_size)) + return; + int r = 0; +#else + int fd = open("/dev/urandom", O_RDONLY); + int r = read(fd, data, data_size); + if (r < 0) r = 0; + close(fd); +#endif + for (; r < data_size; r++) + data[r] = rand() >> 6; +} + +void OsInterruptibleSleep(int millis) { + SleepEx(millis, TRUE); +} + + +uint64 OsGetMilliseconds() { + return GetTickCount64(); +} + +void OsGetTimestampTAI64N(uint8 dst[12]) { + SYSTEMTIME systime; + uint64 file_time_uint64 = 0; + GetSystemTime(&systime); + SystemTimeToFileTime(&systime, (FILETIME*)&file_time_uint64); + uint64 time_since_epoch_100ns = (file_time_uint64 - 116444736000000000); + uint64 secs_since_epoch = time_since_epoch_100ns / 10000000 + 0x400000000000000a; + uint32 nanos = (uint32)(time_since_epoch_100ns % 10000000) * 100; + WriteBE64(dst, secs_since_epoch); + WriteBE32(dst + 8, nanos); +} + + + +void PushLine(const char *s) { + size_t l = strlen(s); + char buf[64]; + SYSTEMTIME t; + + GetLocalTime(&t); + + snprintf(buf, sizeof(buf), "[%.2d:%.2d:%.2d] ", t.wHour, t.wMinute, t.wSecond); + size_t tl = strlen(buf); + + char *x = (char*)malloc(tl + l + 3); + if (!x) return; + memcpy(x, buf, tl); + memcpy(x + tl, s, l); + x[l + tl] = '\r'; + x[l + tl + 1] = '\n'; + x[l + tl + 2] = '\0'; + MyPostMessage(WM_USER + 3, 0, (LPARAM)x); +} + +void EnsureConfigDirCreated() { + char fullname[1024]; + if (GetConfigFullName("", fullname, sizeof(fullname))) + CreateDirectory(fullname, NULL); +} + +void EnableControl(int wnd, bool b) { + EnableWindow(GetDlgItem(g_ui_window, wnd), b); +} + + +LRESULT CALLBACK NotifyWndProc(HWND hwnd, UINT uMsg, WPARAM wParam, LPARAM lParam) { + switch (uMsg) { + case WM_USER + 10: + if (wParam == 1) { + PostQuitMessage(0); + return 31337; + } else if (wParam == 0) { + ShowUI(g_ui_window); + return 31337; + } + break; + } + return DefWindowProc(hwnd, uMsg, wParam, lParam); +} + +void CreateNotificationWindow() { + WNDCLASSEX wce = {0}; + wce.cbSize = sizeof(wce); + wce.lpfnWndProc = &NotifyWndProc; + wce.hInstance = g_hinstance; + wce.lpszClassName = "TunSafe-f19e092db01cbe0fb6aee132f8231e5b71c98f90"; + RegisterClassEx(&wce); + CreateWindow("TunSafe-f19e092db01cbe0fb6aee132f8231e5b71c98f90", "TunSafe-f19e092db01cbe0fb6aee132f8231e5b71c98f90", 0, 0, 0, 0, 0, 0, 0, g_hinstance, NULL); +} + + +void CallbackUpdateUI() { + if (g_ui_visible) + MyPostMessage(WM_USER + 4, NULL, NULL); +} + +void CallbackTriggerReconnect() { + PostMessage(g_ui_window, WM_USER + 11, 0, 0); +} + +void CallbackSetPublicKey(const uint8 public_key[32]) { + char *str = (char*)base64_encode(public_key, 32, NULL); + PostMessage(g_ui_window, WM_USER + 5, NULL, (LPARAM)str); +} + +int WINAPI WinMain (HINSTANCE hInstance, HINSTANCE hPrevInstance, LPSTR lpCmdLine, int nShowCmd) { + g_hinstance = hInstance; + InitCpuFeatures(); + + // Check if the app is already running. + CreateMutexA(0, FALSE, "TunSafe-f19e092db01cbe0fb6aee132f8231e5b71c98f90"); + if (GetLastError() == ERROR_ALREADY_EXISTS) { + HWND window = FindWindow("TunSafe-f19e092db01cbe0fb6aee132f8231e5b71c98f90", NULL); + DWORD_PTR result; + if (!window || !SendMessageTimeout(window, WM_USER + 10, 0, 0, SMTO_BLOCK, 3000, &result) || result != 31337) { + MessageBoxA(NULL, "It looks like TunSafe is already running, but not responding. Please kill the old process first.", "TunSafe", MB_ICONWARNING); + } + return 1; + } + CreateNotificationWindow(); + + WSADATA wsaData = {0}; + if (WSAStartup(MAKEWORD(2, 2), &wsaData) != 0) { + RERROR("WSAStartup failed"); + return 1; + } + + LoadLibrary(TEXT("Riched20.dll")); + + g_backend = new TunsafeBackendWin32(); + + InitMyPostMessage(); + InitCommonControls(); + + g_icons[0] = LoadIcon(GetModuleHandle(NULL), MAKEINTRESOURCE(IDI_ICON1)); + g_icons[1] = LoadIcon(GetModuleHandle(NULL), MAKEINTRESOURCE(IDI_ICON0)); + g_ui_window = CreateDialog(GetModuleHandle(NULL), MAKEINTRESOURCE(IDD_DIALOG1), NULL, &DlgProc); + + if (!g_ui_window) + return 1; + + RegCreateKeyEx(HKEY_CURRENT_USER, "Software\\TunSafe", NULL, NULL, 0, KEY_ALL_ACCESS, NULL, &g_reg_key, NULL); + DragAcceptFiles(g_ui_window, TRUE); + + ChangeWindowMessageFilter(WM_DROPFILES, MSGFLT_ADD); + ChangeWindowMessageFilter(WM_COPYDATA, MSGFLT_ADD); + ChangeWindowMessageFilter(0x0049, MSGFLT_ADD); + + static const int ctrls[] = {IDTXT_UDP, IDTXT_TUN, IDTXT_HANDSHAKE}; + for (int i = 0; i < 3; i++) { + HWND w = GetDlgItem(g_ui_window, ctrls[i]); + SetWindowLong(w, GWL_EXSTYLE, GetWindowLong(w, GWL_EXSTYLE) | WS_EX_COMPOSITED); + } + + g_allow_pre_post = RegReadInt("AllowPrePost", 0) != 0; + + bool minimize = false; + const char *filename = NULL; + + for (size_t i = 1; i < __argc; i++) { + const char *arg = __argv[i]; + + if (_stricmp(arg, "/minimize") == 0) { + minimize = true; + } else if (_stricmp(arg, "/minimize_on_connect") == 0) { + g_minimize_on_connect = true; + } else if (_stricmp(arg, "/allow_pre_post") == 0) { + g_allow_pre_post = true; + } else { + filename = arg; + break; + } + } + + if (!minimize) { + g_ui_visible = true; + ShowWindow(g_ui_window, SW_SHOW); + } + + UpdateIcon(UIW_NONE); + + g_logger = &PushLine; + + EnsureConfigDirCreated(); + + if (filename) { + LoadConfigFile(filename, false, false); + } else { + char *conf = RegReadStr("ConfigFile", "TunSafe.conf"); + LoadConfigFile(conf, false, false); + free(conf); + } + + // PrintCpuFeatures(); + +// Benchmark(); + + if (filename != NULL || RegReadInt("IsConnected", 0)) { + StartService(); + } else { + RINFO("Press Connect to initiate a connection to the WireGuard server."); + } + + MSG msg; + + while (GetMessage(&msg, NULL, 0, 0)) { + if (!IsDialogMessage(g_ui_window, &msg)) { + TranslateMessage(&msg); + DispatchMessage(&msg); + } + } + StopService(UIW_EXITING); + RemoveIcon(); + + return 0; +} + + + diff --git a/util.cpp b/util.cpp new file mode 100644 index 0000000..a601a0b --- /dev/null +++ b/util.cpp @@ -0,0 +1,267 @@ +// SPDX-License-Identifier: AGPL-1.0-only +// Copyright (C) 2018 Ludvig Strigeus . All Rights Reserved. +#include "stdafx.h" + +#include +#include +#include +#include +#include + +#if defined(OS_POSIX) +#include +#include +#include +#include +#include +#include +#endif + +#include "tunsafe_types.h" + +static char base64_alphabet[] = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"; + +uint8 *base64_encode(const uint8 *input, size_t length, size_t *out_length) { + uint32 a; + size_t size; + uint8 *result, *r; + const uint8 *end; + + size = length * 4 / 3 + 4 + 1; + r = result = (byte*)malloc(size); + + end = input + length - 3; + + // Encode full blocks + while (input <= end) { + a = (input[0] << 16) + (input[1] << 8) + input[2]; + input += 3; + + r[0] = base64_alphabet[(a >> 18)/* & 0x3F*/]; + r[1] = base64_alphabet[(a >> 12) & 0x3F]; + r[2] = base64_alphabet[(a >> 6) & 0x3F]; + r[3] = base64_alphabet[(a) & 0x3F]; + r += 4; + } + + if (input == end + 2) { + a = input[0] << 4; + r[0] = base64_alphabet[(a >> 6) /*& 0x3F*/]; + r[1] = base64_alphabet[(a) & 0x3F]; + r[2] = '='; + r[3] = '='; + r += 4; + } else if (input == end + 1) { + a = (input[0] << 10) + (input[1] << 2); + r[0] = base64_alphabet[(a >> 12) /*& 0x3F*/]; + r[1] = base64_alphabet[(a >> 6) & 0x3F]; + r[2] = base64_alphabet[(a) & 0x3F]; + r[3] = '='; + r += 4; + } + if (out_length) + *out_length = r - result; + *r = 0; + return result; +} + +#define WHITESPACE 64 +#define EQUALS 65 +#define INVALID 66 + +static const unsigned char d[] = { + 66,66,66,66,66,66,66,66,66,66,64,66,66,66,66,66,66,66,66,66,66,66,66,66,66, + 66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,62,66,66,66,63,52,53, + 54,55,56,57,58,59,60,61,66,66,66,65,66,66,66, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, + 10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,66,66,66,66,66,66,26,27,28, + 29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,66,66, + 66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66, + 66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66, + 66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66, + 66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66, + 66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66, + 66,66,66,66,66,66 +}; + +bool base64_decode(uint8 *in, size_t inLen, uint8 *out, size_t *outLen) { + uint8 *end = in + inLen; + uint8 iter = 0; + uint32_t buf = 0; + size_t len = 0; + + while (in < end) { + unsigned char c = d[*in++]; + + switch (c) { + case WHITESPACE: continue; /* skip whitespace */ + case INVALID: return false; /* invalid input, return error */ + case EQUALS: /* pad character, end of data */ + in = end; + continue; + default: + buf = buf << 6 | c; + iter++; + if (iter == 4) { + if ((len += 3) > *outLen) return 0; /* buffer overflow */ + *(out++) = (buf >> 16) & 255; + *(out++) = (buf >> 8) & 255; + *(out++) = buf & 255; + buf = 0; iter = 0; + + } + } + } + if (iter == 3) { + if ((len += 2) > *outLen) return 0; /* buffer overflow */ + *(out++) = (buf >> 10) & 255; + *(out++) = (buf >> 2) & 255; + } else if (iter == 2) { + if (++len > *outLen) return 0; /* buffer overflow */ + *(out++) = (buf >> 4) & 255; + } + *outLen = len; + return true; +} + + + +int RunCommand(const char *fmt, ...) { + const char *fmt_org = fmt; + va_list va; + std::string tmp; + char buf[32], c; + char *args[33]; + char *envp[1] = {NULL}; + int nargs = 0; + va_start(va, fmt); + for (;;) { + c = *fmt++; + if (c == '%') { + c = *fmt++; + if (c == 0) goto ZERO; + if (c == 's') { + tmp += va_arg(va, char*); + } else if (c == 'd') { + snprintf(buf, 32, "%d", va_arg(va, int)); + tmp += buf; + } else if (c == 'u') { + snprintf(buf, 32, "%u", va_arg(va, int)); + tmp += buf; + } else if (c == '%') { + tmp += '%'; + } else if (c == 'A') { + struct in_addr in; + in.s_addr = htonl(va_arg(va, in_addr_t)); + tmp += inet_ntoa(in); + } + } else if (c == ' ' || c == 0) { +ZERO: + args[nargs++] = _strdup(tmp.c_str()); + tmp.clear(); + if (nargs == 32 || c == 0) break; + } else { + tmp += c; + } + } + args[nargs] = 0; + + fprintf(stderr, "Run:"); + for (int i = 0; args[i]; i++) + fprintf(stderr, " %s", args[i]); + fprintf(stderr, "\n"); + + int ret = -1; + + +#if defined(OS_POSIX) + pid_t pid = fork(); + if (pid == 0) { + execve(args[0], args, envp); + exit(127); + } + if (pid < 0) { + RERROR("Fork failed"); + } else if (waitpid(pid, &ret, 0) != pid) { + ret = -1; + } +#endif + + if (ret != 0) + RERROR("Command %s failed %d!", fmt_org, ret); + + return ret; +} + +bool IsOnlyZeros(const uint8 *data, size_t data_size) { + for (size_t i = 0; i != data_size; i++) + if (data[i]) + return false; + return true; +} + + +#ifdef _MSC_VER +void printhex(const char *name, const void *a, size_t l) { + char buf[256]; + snprintf(buf, 256, "%s (%d):", name, (int)l); OutputDebugString(buf); + for (size_t i = 0; i < l; i++) { + if (i % 4 == 0) printf(" "); + snprintf(buf, 256, "%.2X", *((uint8*)a + i)); OutputDebugString(buf); + } + OutputDebugString("\n"); +} + +#else +void printhex(const char *name, const void *a, size_t l) { + printf("%s (%d):", name, (int)l); + for (size_t i = 0; i < l; i++) { + if (i % 4 == 0) printf(" "); + printf("%.2X", *((uint8*)a + i)); + } + printf("\n"); +} +#endif + +typedef void Logger(const char *msg); +Logger *g_logger; + +#undef RERROR +#undef void + +void RERROR(const char *msg, ...); + +void RERROR(const char *msg, ...) { + va_list va; + char buf[512]; + va_start(va, msg); + vsnprintf(buf, sizeof(buf), msg, va); + va_end(va); + if (g_logger) { + g_logger(buf); + } else { + fputs(buf, stderr); + fputs("\n", stderr); + } +} + +void rinfo(const char *msg, ...) { + printf("muu"); +} + +void rinfo2(const char *msg) { + printf("muu2"); +} + +void RINFO(const char *msg, ...) { + va_list va; + char buf[512]; + va_start(va, msg); + vsnprintf(buf, sizeof(buf), msg, va); + va_end(va); + if (g_logger) { + g_logger(buf); + } else { + fputs(buf, stderr); + fputs("\n", stderr); + } +} diff --git a/util.h b/util.h new file mode 100644 index 0000000..48b8324 --- /dev/null +++ b/util.h @@ -0,0 +1,14 @@ +// SPDX-License-Identifier: AGPL-1.0-only +// Copyright (C) 2018 Ludvig Strigeus . All Rights Reserved. +#pragma once +#include "tunsafe_types.h" + +uint8 *base64_encode(const uint8 *input, size_t length, size_t *out_length); +bool base64_decode(uint8 *in, size_t inLen, uint8 *out, size_t *outLen); +bool IsOnlyZeros(const uint8 *data, size_t data_size); + +int RunCommand(const char *fmt, ...); +typedef void Logger(const char *msg); +extern Logger *g_logger; + + diff --git a/wireguard.cpp b/wireguard.cpp new file mode 100644 index 0000000..ab9b393 --- /dev/null +++ b/wireguard.cpp @@ -0,0 +1,998 @@ +// SPDX-License-Identifier: AGPL-1.0-only +// Copyright (C) 2018 Ludvig Strigeus . All Rights Reserved. +#include "stdafx.h" +#include "wireguard.h" +#include "netapi.h" +#include "wireguard_proto.h" +#include "crypto/chacha20poly1305.h" +#include "crypto/blake2s.h" +#include "crypto/siphash.h" +#include "tunsafe_endian.h" +#include +#include +#include +#include +#include "wireguard.h" + +uint64 OsGetMilliseconds(); + +enum { + IPV4_HEADER_SIZE = 20, + IPV6_HEADER_SIZE = 40, +}; + +WireguardProcessor::WireguardProcessor(UdpInterface *udp, TunInterface *tun, ProcessorDelegate *procdel) { + tun_addr_.size = 0; + tun6_addr_.size = 0; + udp_ = udp; + tun_ = tun; + procdel_ = procdel; + mtu_ = 1420; + memset(&stats_, 0, sizeof(stats_)); + listen_port_ = 0; + network_discovery_spoofing_ = false; + add_routes_mode_ = true; + dns_blocking_ = true; + internet_blocking_ = kBlockInternet_Default; + dns6_addr_.sin.sin_family = dns_addr_.sin.sin_family = 0; +} + +WireguardProcessor::~WireguardProcessor() { +} + +bool WireguardProcessor::AddDnsServer(const IpAddr &sin) { + IpAddr *target = (sin.sin.sin_family == AF_INET6) ? &dns6_addr_ : &dns_addr_; + if (target->sin.sin_family != 0) + return false; + *target = sin; + return true; +} + + +bool WireguardProcessor::SetTunAddress(const WgCidrAddr &addr) { + WgCidrAddr *target = (addr.size == 128) ? &tun6_addr_ : &tun_addr_; + if (target->size != 0) + return false; + *target = addr; + return true; +} + + +ProcessorStats WireguardProcessor::GetStats() { + stats_.last_complete_handskake_timestamp = dev_.last_complete_handskake_timestamp(); + return stats_; +} + +void WireguardProcessor::ResetStats() { + memset(&stats_, 0, sizeof(stats_)); +} + +void WireguardProcessor::SetupCompressionHeader(WgPacketCompressionVer01 *c) { + memset(c, 0, sizeof(WgPacketCompressionVer01)); + // Windows uses a ttl of 128 while other platforms use 64 +#if defined(OS_WIN) + c->ttl = 128; +#else // defined(OS_WIN) + c->ttl = 64; +#endif // defined(OS_WIN) + WriteLE16(&c->version, EXT_PACKET_COMPRESSION_VER); + memcpy(c->ipv4_addr, &tun_addr_.addr, 4); + if (tun6_addr_.size == 128) + memcpy(c->ipv6_addr, &tun6_addr_.addr, 16); + c->flags = ((tun_addr_.cidr >> 3) & 3); +} + +static inline bool CheckFirstNbitsEquals(const byte *a, const byte *b, size_t n) { + return memcmp(a, b, n >> 3) == 0 && ((n & 7) == 0 || !((a[n >> 3] ^ b[n >> 3]) & (0xff << (8 - (n & 7))))); +} + +static bool IsWgCidrAddrSubsetOf(const WgCidrAddr &inner, const WgCidrAddr &outer) { + return inner.size == outer.size && inner.cidr >= outer.cidr && + CheckFirstNbitsEquals(inner.addr, outer.addr, outer.cidr); +} + +bool WireguardProcessor::Start() { + if (!udp_->Initialize(listen_port_)) + return false; + + if (tun_addr_.size != 32) { + RERROR("No IPv4 address configured"); + return false; + } + + if (tun_addr_.cidr >= 31) { + RERROR("The TAP driver is not compatible with Address using CIDR /31 or /32. Changing to /24"); + tun_addr_.cidr = 24; + } + + TunInterface::TunConfig config = {0}; + config.ip = ReadBE32(tun_addr_.addr); + config.cidr = tun_addr_.cidr; + config.mtu = mtu_; + config.pre_post_commands = pre_post_; + + uint32 netmask = tun_addr_.cidr == 32 ? 0xffffffff : 0xffffffff << (32 - tun_addr_.cidr); + + uint32 ipv4_broadcast_addr = (netmask == 0xffffffff) ? 0xffffffff : config.ip | ~netmask; + + if (tun6_addr_.size == 128) { + if (tun6_addr_.cidr > 126) { + RERROR("IPv6 /127 or /128 not supported. Changing to 120"); + tun6_addr_.cidr = 120; + } + config.ipv6_cidr = tun6_addr_.cidr; + memcpy(&config.ipv6_address, tun6_addr_.addr, 16); + } + + if (add_routes_mode_) { + WgPeer *peer = (WgPeer *)dev_.ip_to_peer_map().LookupV4DefaultPeer(); + if (peer != NULL && peer->endpoint_.sin.sin_family != 0) { + config.default_route_endpoint_v4 = (peer->endpoint_.sin.sin_family == AF_INET) ? ReadBE32(&peer->endpoint_.sin.sin_addr) : 0; + // Set the default route to something + config.use_ipv4_default_route = true; + } + + // Also configure ipv6 gw? + if (config.ipv6_cidr != 0) { + peer = (WgPeer*)dev_.ip_to_peer_map().LookupV6DefaultPeer(); + if (peer != NULL && peer->endpoint_.sin.sin_family != 0) { + if (peer->endpoint_.sin.sin_family == AF_INET6) + memcpy(&config.default_route_endpoint_v6, &peer->endpoint_.sin6.sin6_addr, 16); + config.use_ipv6_default_route = true; + } + } + + // For each peer, add the extra routes to the extra routes table + for (WgPeer *peer = dev_.first_peer(); peer; peer = peer->next_peer_) { + for (auto it = peer->allowed_ips_.begin(); it != peer->allowed_ips_.end(); ++it) { + // Don't add an entry if it's identical to my address or it's a default route + if (IsWgCidrAddrSubsetOf(*it, tun_addr_) || IsWgCidrAddrSubsetOf(*it, tun6_addr_) || it->cidr == 0) + continue; + // Don't add an entry if we have no ipv6 address configured + if (config.ipv6_cidr == 0 && it->size != 32) + continue; + config.extra_routes.push_back(*it); + } + } + } + + uint8 dhcp_options[6]; + + config.block_dns_on_adapters = dns_blocking_; + config.internet_blocking = internet_blocking_; + + if (dns_addr_.sin.sin_family == AF_INET) { + dhcp_options[0] = 6; + dhcp_options[1] = 4; + memcpy(&dhcp_options[2], &dns_addr_.sin.sin_addr, 4); + config.dhcp_options = dhcp_options; + config.dhcp_options_size = sizeof(dhcp_options); + } + + if (dns6_addr_.sin6.sin6_family == AF_INET6) { + config.set_ipv6_dns = true; + memcpy(&config.dns_server_v6, &dns6_addr_.sin6.sin6_addr, 16); + } + + TunInterface::TunConfigOut config_out; + if (!tun_->Initialize(std::move(config), &config_out)) + return false; + + SetupCompressionHeader(dev_.compression_header()); + + network_discovery_spoofing_ = config_out.enable_neighbor_discovery_spoofing; + memcpy(network_discovery_mac_, config_out.neighbor_discovery_spoofing_mac, 6); + + for (WgPeer *peer = dev_.first_peer(); peer; peer = peer->next_peer_) { + peer->ipv4_broadcast_addr_ = ipv4_broadcast_addr; + if (peer->endpoint_.sin.sin_family != 0) { + RINFO("Sending handshake..."); + SendHandshakeInitiationAndResetRetries(peer); + } + } + + return true; +} + +static uint8 kIcmpv6NeighborMulticastPrefix[] = {0xff, 0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,0x00, 0x00, 0x00, 0x01, 0xff}; + +enum { + kIpProto_ICMPv6 = 0x3A, + kICMPv6_NeighborSolicitation = 135, +}; + +#pragma pack(push, 1) +struct ICMPv6NaPacket { + uint8 type; + uint8 code; + uint16 checksum; + uint8 rso; + uint8 reserved[3]; + uint8 target[16]; + uint8 opt_type; + uint8 opt_length; + uint8 target_mac[6]; +}; + +struct ICMPv6NaPacketWithoutTarget { + uint8 type; + uint8 code; + uint16 checksum; + uint8 rso; + uint8 reserved[3]; + uint8 target[16]; +}; + +#pragma pack (pop) + + +static uint16 ComputeIcmpv6Checksum(const uint8 *buf, int buf_size, const uint8 src_addr[16], const uint8 dst_addr[16]) { + uint32 sum = 0; + for (int i = 0; i < buf_size - 1; i += 2) + sum += ReadBE16(&buf[i]); + if (buf_size & 1) + sum += buf[buf_size - 1]; + for (int i = 0; i < 16; i += 2) + sum += ReadBE16(&src_addr[i]); + for (int i = 0; i < 16; i += 2) + sum += ReadBE16(&dst_addr[i]); + sum += (uint16)IPPROTO_ICMPV6 + (uint16)buf_size; + while (sum >> 16) + sum = (sum & 0xFFFF) + (sum >> 16); + return ((uint16)~sum); +} + + +bool WireguardProcessor::HandleIcmpv6NeighborSolicitation(const byte *data, size_t data_size) { + if (data_size < 48 + 16) + return false; + + // Filter out neighbor solicitation + if (data[40] != kICMPv6_NeighborSolicitation || data[41] != 0) + return false; + + if (!network_discovery_spoofing_) + return false; + + bool is_broadcast = true; + + if (memcmp(data + 24, kIcmpv6NeighborMulticastPrefix, sizeof(kIcmpv6NeighborMulticastPrefix)) != 0) { + if (memcmp(data + 24, data + 48, 16) != 0) + return false; + is_broadcast = false; + } + + // Target address must match a peer's range. + WgPeer *peer = (WgPeer*)dev_.ip_to_peer_map().LookupV6(data + 48); + if (peer == NULL) + return false; + + // Build response packet + Packet *out = AllocPacket(); + if (out == NULL) + return false; + + byte *odata = out->data; + + int packet_size = is_broadcast ? sizeof(ICMPv6NaPacket) : sizeof(ICMPv6NaPacketWithoutTarget); + + memcpy(odata, data, 4); + WriteBE16(odata + 4, packet_size); + odata[6] = 58; // next = icmp + odata[7] = 255; // HopLimit + memcpy(odata + 8, data + 48, 16); // Source Address + memcpy(odata + 24, data + 8, 16); // Dest addr + + ((ICMPv6NaPacket*)(odata + 40))->type = 136; // NA + ((ICMPv6NaPacket*)(odata + 40))->code = 0; + ((ICMPv6NaPacket*)(odata + 40))->checksum = 0; + ((ICMPv6NaPacket*)(odata + 40))->rso = 0x60; // solicited + memset(((ICMPv6NaPacket*)(odata + 40))->reserved, 0, 3); + memcpy(((ICMPv6NaPacket*)(odata + 40))->target, odata + 8, 16); + if (is_broadcast) { + ((ICMPv6NaPacket*)(odata + 40))->opt_type = 2; + ((ICMPv6NaPacket*)(odata + 40))->opt_length = 1; + + memcpy(((ICMPv6NaPacket*)(odata + 40))->target_mac, network_discovery_mac_, 6); + + // For some reason this is openvpn's 'related mac' + ((ICMPv6NaPacket*)(odata + 40))->target_mac[2] += 1; + } + uint16 checksum = ComputeIcmpv6Checksum(odata + 40, packet_size, odata + 8, odata + 24); + WriteBE16(&((ICMPv6NaPacket*)(odata + 40))->checksum, checksum); + + out->size = 40 + packet_size; + tun_->WriteTunPacket(out); + return true; +} + +static inline bool IsIpv6Multicast(const uint8 dst[16]) { + return dst[0] == 0xff; +} + +// On incoming packet to the tun interface. +void WireguardProcessor::HandleTunPacket(Packet *packet) { + uint8 *data = packet->data; + size_t data_size = packet->size; + unsigned ip_version, size_from_header; + WgPeer *peer; + + stats_.tun_bytes_in += data_size; + stats_.tun_packets_in++; + + // Sanity check that it looks like a valid ipv4 or ipv6 packet, + // and determine the destination peer from the ip header + if (data_size < IPV4_HEADER_SIZE) + goto getout; + + ip_version = *data >> 4; + if (ip_version == 4) { + uint32 ip = ReadBE32(data + 16); + peer = (WgPeer*)dev_.ip_to_peer_map().LookupV4(ip); + if (peer == NULL) + goto getout; + if ((ip >= (224 << 24) || ip == peer->ipv4_broadcast_addr_) && !peer->allow_multicast_through_peer_) + goto getout; + + size_from_header = ReadBE16(data + 2); + if (size_from_header < IPV4_HEADER_SIZE) + goto getout; + } else if (ip_version == 6) { + if (data_size < IPV6_HEADER_SIZE) + goto getout; + + // Check if the packet is a Neighbor solicitation ICMP6 packet, in that case fake + // a reply. + if (data[6] == kIpProto_ICMPv6 && HandleIcmpv6NeighborSolicitation(data, data_size)) + goto getout; + + peer = (WgPeer*)dev_.ip_to_peer_map().LookupV6(data + 24); + if (peer == NULL) + goto getout; + + if (IsIpv6Multicast(data + 24) && !peer->allow_multicast_through_peer_) + goto getout; + + size_from_header = IPV6_HEADER_SIZE + ReadBE16(data + 4); + } else { + goto getout; + } + if (size_from_header > data_size) + goto getout; + if (peer->endpoint_.sin.sin_family == 0) + goto getout; + + WritePacketToUdp(peer, packet); + return; + +getout: + // send ICMP? + FreePacket(packet); +} + +void WireguardProcessor::WritePacketToUdp(WgPeer *peer, Packet *packet) { + byte *data = packet->data; + size_t size = packet->size; + bool want_handshake; + uint64 send_ctr; + WgKeypair *keypair = peer->curr_keypair_; + + if (keypair == NULL || + keypair->send_key_state == WgKeypair::KEY_INVALID || + keypair->send_ctr >= REJECT_AFTER_MESSAGES) + goto getout_handshake; + + want_handshake = (keypair->send_ctr >= REKEY_AFTER_MESSAGES || + keypair->send_key_state == WgKeypair::KEY_WANT_REFRESH); + + // Ensure packet will fit including the biggest padding + if (size > kPacketCapacity - 15 - CHACHA20POLY1305_AUTHTAGLEN) + goto getout_discard; + + if (size == 0) { + peer->OnKeepaliveSent(); + } else { + peer->OnDataSent(); + +#if WITH_HANDSHAKE_EXT + // Attempt to compress the packet headers using ipzip. + if (keypair->enabled_features[WG_FEATURE_ID_IPZIP]) { + uint32 rv = IpzipCompress(data, (uint32)size, &keypair->ipzip_state_, 0); + if (rv == (uint32)-1) + goto getout_discard; + if (rv == 0) + goto add_padding; + stats_.compression_hdr_saved_out += (int32)(size - rv); + data += (int32)(size - rv); + size = rv; + } else { +add_padding: +#else + { +#endif // WITH_HANDSHAKE_EXT + // Pad packet to a multiple of 16 bytes, but no more than the mtu bytes. + unsigned padding = std::min((0 - size) & 15, (unsigned)mtu_ - (unsigned)size); + memset(data + size, 0, padding); + size += padding; + } + } + send_ctr = keypair->send_ctr++; + +#if WITH_SHORT_HEADERS + if (keypair->enabled_features[WG_FEATURE_ID_SHORT_HEADER]) { + size_t header_size; + byte *write = data; + uint8 tag = WG_SHORT_HEADER_BIT, inner_tag; + // For every 16 incoming packets, send out an ack. + if (keypair->incoming_packet_count >= 16) { + keypair->incoming_packet_count = 0; + uint64 next_expected_packet = keypair->replay_detector.expected_seq_nr(); + if (next_expected_packet < 0x10000) { + WriteLE16(write -= 2, (uint16)next_expected_packet); + inner_tag = WG_ACK_HEADER_COUNTER_2; + } else if (next_expected_packet < 0x100000000ull) { + WriteLE32(write -= 4, (uint32)next_expected_packet); + inner_tag = WG_ACK_HEADER_COUNTER_4; + } else { + WriteLE64(write -= 8, next_expected_packet); + inner_tag = WG_ACK_HEADER_COUNTER_8; + } + if (keypair->broadcast_short_key != 0) { + inner_tag += keypair->addr_entry_slot; + keypair->broadcast_short_key = 2; + } + *--write = inner_tag; + tag += WG_SHORT_HEADER_ACK; + } else if (keypair->broadcast_short_key == 1) { + keypair->broadcast_short_key = 2; + *--write = keypair->addr_entry_slot; + tag += WG_SHORT_HEADER_ACK; + } + + // Determine the distance from the most recently acked packet, + // be conservative when picking a suitable packet length to send. + uint64 distance = send_ctr - keypair->send_ctr_acked; + if (distance < (1 << 6)) { + *(write -= 1) = (uint8)send_ctr; + tag += WG_SHORT_HEADER_CTR1; + } else if (distance < (1 << 14)) { + WriteLE16(write -= 2, (uint16)send_ctr); + tag += WG_SHORT_HEADER_CTR2; + } else if (distance < (1 << 30)) { + WriteLE32(write -= 4, (uint32)send_ctr); + tag += WG_SHORT_HEADER_CTR4; + } else { + // Too far ahead. Can't use short packets. + goto need_big_packet; + } + + tag += keypair->can_use_short_key_for_outgoing; + if (!keypair->can_use_short_key_for_outgoing) + WriteLE32(write -= 4, keypair->remote_key_id); + *--write = tag; + + + header_size = data - write; + + stats_.compression_wg_saved_out += (int64)16 - header_size; + + packet->data = data - header_size; + packet->size = (int)(size + header_size + keypair->auth_tag_length); + WgKeypairEncryptPayload(data, size, write, data - write, send_ctr, keypair); + } else { +need_big_packet: +#else + { +#endif // #if WITH_SHORT_HEADERS + ((MessageData*)data)[-1].type = ToLE32(MESSAGE_DATA); + ((MessageData*)data)[-1].receiver_id = keypair->remote_key_id; + ((MessageData*)data)[-1].counter = ToLE64(send_ctr); + packet->data = data - sizeof(MessageData); + packet->size = (int)(size + sizeof(MessageData) + keypair->auth_tag_length); + WgKeypairEncryptPayload(data, size, NULL, 0, send_ctr, keypair); + } + + packet->addr = peer->endpoint_; + DoWriteUdpPacket(packet); + if (want_handshake) + SendHandshakeInitiationAndResetRetries(peer); + return; + +getout_discard: + FreePacket(packet); + return; + +getout_handshake: + // Keep only the first MAX_QUEUED_PACKETS packets. + while (peer->num_queued_packets_ >= MAX_QUEUED_PACKETS_PER_PEER) { + Packet *packet = peer->first_queued_packet_; + peer->first_queued_packet_ = packet->next; + peer->num_queued_packets_--; + FreePacket(packet); + } + // Add the packet to the out queue that will get sent once handshake completes + *peer->last_queued_packet_ptr_ = packet; + peer->last_queued_packet_ptr_ = &packet->next; + packet->next = NULL; + peer->num_queued_packets_++; + + SendHandshakeInitiationAndResetRetries(peer); +} + +// This scrambles the initial 16 bytes of the packet with the +// trailing 8 bytes of the packet. +static void ScrambleUnscramblePacket(Packet *packet, ScramblerSiphashKeys *keys) { + uint8 *data = packet->data; + size_t data_size = packet->size; + + if (data_size < 8) + return; + + uint64 last_uint64 = ReadLE64(data_size >= 24 ? data + 16 : data + data_size - 8); + uint64 a = siphash_u64_u32(last_uint64, (uint32)data_size, (siphash_key_t*)&keys->keys[0]); + uint64 b = siphash_u64_u32(last_uint64, (uint32)data_size, (siphash_key_t*)&keys->keys[2]); + a = ToLE64(a); + b = ToLE64(b); + if (data_size >= 24) { + ((uint64*)data)[0] ^= a; + ((uint64*)data)[1] ^= b; + } else { + struct { uint64 a, b; } scramblers = {a, b}; + uint8 *s = (uint8*)&scramblers; + for (size_t i = 0; i < data_size - 8; i++) + data[i] ^= s[i]; + } +} + +static NOINLINE void ScrambleUnscrambleAndWrite(Packet *packet, ScramblerSiphashKeys *keys, UdpInterface *udp) { +#if WITH_HEADER_OBFUSCATION + ScrambleUnscramblePacket(packet, keys); + udp->WriteUdpPacket(packet); +#endif // WITH_HEADER_OBFUSCATION +} + +void WireguardProcessor::DoWriteUdpPacket(Packet *packet) { + stats_.udp_packets_out++; + stats_.udp_bytes_out += packet->size; + if (!dev_.header_obfuscation_) + udp_->WriteUdpPacket(packet); + else + ScrambleUnscrambleAndWrite(packet, &dev_.header_obfuscation_key_, udp_); +} + +void WireguardProcessor::SendHandshakeInitiationAndResetRetries(WgPeer *peer) { + peer->handshake_attempts_ = 0; + SendHandshakeInitiation(peer); +} + +void WireguardProcessor::SendHandshakeInitiation(WgPeer *peer) { + // Send out a handshake init packet to trigger the handshake procedure + if (!peer->CheckHandshakeRateLimit()) + return; + Packet *packet = AllocPacket(); + if (!packet) + return; + peer->CreateMessageHandshakeInitiation(packet); + + packet->addr = peer->endpoint_; + DoWriteUdpPacket(packet); + peer->OnHandshakeInitSent(); +} + +// Handles an incoming WireGuard packet from the UDP side, decrypt etc. +void WireguardProcessor::HandleUdpPacket(Packet *packet, bool overload) { + uint32 type; + + stats_.udp_bytes_in += packet->size; + stats_.udp_packets_in++; + + // Unscramble incoming packets +#if WITH_HEADER_OBFUSCATION + if (dev_.header_obfuscation_) + ScrambleUnscramblePacket(packet, &dev_.header_obfuscation_key_); +#endif // WITH_HEADER_OBFUSCATION + + if (packet->size < sizeof(uint32)) + goto invalid_size; + type = ReadLE32((uint32*)packet->data); + if (type == MESSAGE_DATA) { + if (packet->size < sizeof(MessageData)) + goto invalid_size; + HandleDataPacket(packet); +#if WITH_SHORT_HEADERS + } else if (type & WG_SHORT_HEADER_BIT) { + HandleShortHeaderFormatPacket(type, packet); +#endif // WITH_SHORT_HEADERS + } else if (type == MESSAGE_HANDSHAKE_COOKIE) { + if (packet->size != sizeof(MessageHandshakeCookie)) + goto invalid_size; + HandleHandshakeCookiePacket(packet); + } else if (type == MESSAGE_HANDSHAKE_INITIATION) { + if (WITH_HANDSHAKE_EXT ? (packet->size < sizeof(MessageHandshakeInitiation)) : (packet->size != sizeof(MessageHandshakeInitiation))) + goto invalid_size; + + if (!CheckIncomingHandshakeRateLimit(packet, overload)) + return; + HandleHandshakeInitiationPacket(packet); + } else if (type == MESSAGE_HANDSHAKE_RESPONSE) { + if (WITH_HANDSHAKE_EXT ? (packet->size < sizeof(MessageHandshakeResponse)) : (packet->size != sizeof(MessageHandshakeResponse))) + goto invalid_size; + if (!CheckIncomingHandshakeRateLimit(packet, overload)) + return; + HandleHandshakeResponsePacket(packet); + } else { + // unknown packet +invalid_size: + FreePacket(packet); + } +} + +// Returns nonzero if two endpoints are different. +static uint32 CompareEndpoint(const IpAddr *a, const IpAddr *b) { + uint32 rv = b->sin.sin_family ^ a->sin.sin_family; + if (b->sin.sin_family != AF_INET6) { + rv |= b->sin.sin_addr.s_addr ^ a->sin.sin_addr.s_addr; + rv |= b->sin.sin_port ^ a->sin.sin_port; + } else { + uint64 rx = ((uint64*)&b->sin6.sin6_addr)[0] ^ ((uint64*)&a->sin6.sin6_addr)[0]; + rx |= ((uint64*)&b->sin6.sin6_addr)[1] ^ ((uint64*)&a->sin6.sin6_addr)[1]; + rv |= rx | (rx >> 32); + rv |= b->sin6.sin6_port ^ a->sin6.sin6_port; + } + return rv; +} + +void WgPeer::CopyEndpointToPeer(WgKeypair *keypair, const IpAddr *addr) { + // Remember how to send packets to this peer + if (CompareEndpoint(&keypair->peer->endpoint_, addr)) { +#if WITH_SHORT_HEADERS + // When the endpoint changes, forget about using the short key. + keypair->broadcast_short_key = 0; + keypair->can_use_short_key_for_outgoing = false; +#endif // WITH_SHORT_HEADERS + keypair->peer->endpoint_ = *addr; + } +} + +#if WITH_SHORT_HEADERS +void WireguardProcessor::HandleShortHeaderFormatPacket(uint32 tag, Packet *packet) { + uint8 *data = packet->data + 1; + size_t bytes_left = packet->size - 1; + WgKeypair *keypair; + uint64 counter, acked_counter; + uint8 ack_tag; + + if ((tag & WG_SHORT_HEADER_KEY_ID_MASK) == 0x00) { + // The key_id is explicitly included in the packet. + if (bytes_left < 4) goto getout; + uint32 key_id = ReadLE32(data); + data += 4, bytes_left -= 4; + auto it = dev_.key_id_lookup().find(key_id); + if (it == dev_.key_id_lookup().end()) goto getout; + keypair = it->second.second; + } else { + // Lookup the packet source ip and port in the address mapping + uint64 addr_id = packet->addr.sin.sin_addr.s_addr | ((uint64)packet->addr.sin.sin_port << 32); + auto it = dev_.addr_entry_map().find(addr_id); + if (it == dev_.addr_entry_map().end()) + goto getout; + WgAddrEntry *addr_entry = it->second; + keypair = addr_entry->keys[((tag / WG_SHORT_HEADER_KEY_ID) & 3) - 1]; + } + + if (!keypair || keypair->recv_key_state == WgKeypair::KEY_INVALID || + !keypair->enabled_features[WG_FEATURE_ID_SHORT_HEADER]) + goto getout; + + // Pick the closest possible counter value with the same low bits. + counter = keypair->replay_detector.expected_seq_nr(); + switch (tag & WG_SHORT_HEADER_TYPE_MASK) { + case WG_SHORT_HEADER_CTR1: + if (bytes_left < 1) goto getout; + counter += (int8)(*data - counter); + data += 1, bytes_left -= 1; + break; + case WG_SHORT_HEADER_CTR2: + if (bytes_left < 2) goto getout; + counter += (int16)(ReadLE16(data) - counter); + data += 2, bytes_left -= 2; + break; + case WG_SHORT_HEADER_CTR4: + if (bytes_left < 4) goto getout; + counter += (int32)(ReadLE32(data) - counter); + data += 4, bytes_left -= 4; + break; + default: + goto getout; // invalid packet + } + + acked_counter = 0; + ack_tag = 0; + + // If the acknowledge header is present, then parse it so we may + // get an ack for the highest seen packet. + if (tag & WG_SHORT_HEADER_ACK) { + if (bytes_left == 0) goto getout; + ack_tag = *data; + data += 1, bytes_left -= 1; + + switch (ack_tag & WG_ACK_HEADER_COUNTER_MASK) { + case WG_ACK_HEADER_COUNTER_2: + if (bytes_left < 2) goto getout; + acked_counter = ReadLE16(data); + data += 2, bytes_left -= 2; + break; + case WG_ACK_HEADER_COUNTER_4: + if (bytes_left < 4) goto getout; + acked_counter = ReadLE32(data); + data += 4, bytes_left -= 4; + break; + case WG_ACK_HEADER_COUNTER_8: + if (bytes_left < 8) goto getout; + acked_counter = ReadLE64(data); + data += 8, bytes_left -= 8; + break; + default: + break; + } + } + if (counter >= REJECT_AFTER_MESSAGES) + goto getout; + // Authenticate the packet before we can apply the state changes. + if (!WgKeypairDecryptPayload(data, bytes_left, packet->data, data - packet->data, counter, keypair)) + goto getout; + + if (!keypair->replay_detector.CheckReplay(counter)) + goto getout; + + stats_.compression_wg_saved_in += 16 - (data - packet->data); + + keypair->send_ctr_acked = std::max(keypair->send_ctr_acked, acked_counter); + keypair->incoming_packet_count++; + + WgPeer::CopyEndpointToPeer(keypair, &packet->addr); + + // Periodically broadcast out the short key + if ((tag & WG_SHORT_HEADER_KEY_ID_MASK) == 0x00 && !keypair->did_attempt_remember_ip_port) { + keypair->did_attempt_remember_ip_port = true; + if (keypair->enabled_features[WG_FEATURE_ID_SKIP_KEYID_IN]) { + uint64 addr_id = packet->addr.sin.sin_addr.s_addr | ((uint64)packet->addr.sin.sin_port << 32); + dev_.UpdateKeypairAddrEntry(addr_id, keypair); + } + } + + // Ack header may also signal that we can omit the key id in packets from now on. + if (tag & WG_SHORT_HEADER_ACK) + keypair->can_use_short_key_for_outgoing = (ack_tag & WG_ACK_HEADER_KEY_MASK) * WG_SHORT_HEADER_KEY_ID; + + HandleAuthenticatedDataPacket(keypair, packet, data, bytes_left - keypair->auth_tag_length); + return; +getout: + FreePacket(packet); + return; +} +#endif // WITH_SHORT_HEADERS + +void WireguardProcessor::HandleAuthenticatedDataPacket(WgKeypair *keypair, Packet *packet, uint8 *data, size_t data_size) { + WgPeer *peer = keypair->peer; + + // Promote the next key to the current key when we receive a data packet, + // the handshake is now complete. + if (peer->CheckSwitchToNextKey(keypair)) { + if (procdel_) { + procdel_->OnConnected(ReadBE32(tun_addr_.addr)); + } + peer->OnHandshakeFullyComplete(); + SendQueuedPackets(peer); + } + + // Refresh when current key gets too old + if (peer->curr_keypair_ && peer->curr_keypair_->recv_key_state == WgKeypair::KEY_WANT_REFRESH) { + peer->curr_keypair_->recv_key_state = WgKeypair::KEY_DID_REFRESH; + SendHandshakeInitiationAndResetRetries(peer); + } + + if (data_size == 0) { + peer->OnKeepaliveReceived(); + goto getout; + } + peer->OnDataReceived(); + +#if WITH_HANDSHAKE_EXT + // Unpack the packet headers using ipzip + if (keypair->enabled_features[WG_FEATURE_ID_IPZIP]) { + uint32 rv = IpzipDecompress(data, (uint32)data_size, &keypair->ipzip_state_, IPZIP_RECV_BY_CLIENT); + if (rv == (uint32)-1) + goto getout; // ipzip failed decompress + stats_.compression_hdr_saved_in += (int64)rv - data_size; + data -= (int64)rv - data_size, data_size = rv; + } +#endif // WITH_HANDSHAKE_EXT + + // Verify that the packet is a valid ipv4 or ipv6 packet of proper length, + // with a source address that belongs to the peer. + WgPeer *peer_from_header; + unsigned int ip_version, size_from_header; + + ip_version = *data >> 4; + if (ip_version == 4) { + if (data_size < IPV4_HEADER_SIZE) { + // too small ipv4 header + goto getout; + } + peer_from_header = (WgPeer*)dev_.ip_to_peer_map().LookupV4(ReadBE32(data + 12)); + size_from_header = ReadBE16(data + 2); + if (size_from_header < IPV4_HEADER_SIZE) { + // too small packet? + goto getout; + } + } else if (ip_version == 6) { + if (data_size < IPV6_HEADER_SIZE) { + // too small ipv6 header + goto getout; + } + peer_from_header = (WgPeer*)dev_.ip_to_peer_map().LookupV6(data + 8); + size_from_header = IPV6_HEADER_SIZE + ReadBE16(data + 4); + } else { + // invalid ip version + goto getout; + } + if (size_from_header > data_size) { + // oversized packet? + goto getout; + } + if (peer_from_header != peer) { + // source address mismatch? + goto getout; + } + //RINFO("Outgoing TUN packet of size %d", (int)size_from_header); + packet->data = data; + packet->size = size_from_header; + + stats_.tun_bytes_out += packet->size; + stats_.tun_packets_out++; + + tun_->WriteTunPacket(packet); + return; + +getout: + FreePacket(packet); + return; +} + +void WireguardProcessor::HandleDataPacket(Packet *packet) { + uint8 *data = packet->data; + size_t data_size = packet->size; + uint32 key_id = ((MessageData*)data)->receiver_id; + uint64 counter = ToLE64((((MessageData*)data)->counter)); + WgKeypair *keypair; + + auto it = dev_.key_id_lookup().find(key_id); + if (it == dev_.key_id_lookup().end() || + (keypair = it->second.second) == NULL || + keypair->recv_key_state == WgKeypair::KEY_INVALID) { +getout: + FreePacket(packet); + return; + } + + if (counter >= REJECT_AFTER_MESSAGES) + goto getout; + + if (!WgKeypairDecryptPayload(data + sizeof(MessageData), data_size - sizeof(MessageData), + NULL, 0, counter, keypair)) { + goto getout; + } + if (!keypair->replay_detector.CheckReplay(counter)) + goto getout; + + WgPeer::CopyEndpointToPeer(keypair, &packet->addr); + HandleAuthenticatedDataPacket(keypair, packet, data + sizeof(MessageData), data_size - sizeof(MessageData) - keypair->auth_tag_length); +} + +static uint64 GetIpForRateLimit(Packet *packet) { + if (packet->addr.sin.sin_family == AF_INET) { + return ReadLE32(&packet->addr.sin.sin_addr); + } else { + return ReadLE64(&packet->addr.sin6.sin6_addr); + } +} + +bool WireguardProcessor::CheckIncomingHandshakeRateLimit(Packet *packet, bool overload) { + WgRateLimit::RateLimitResult rr = dev_.rate_limiter()->CheckRateLimit(GetIpForRateLimit(packet)); + if ((overload && rr.is_rate_limited()) || !dev_.CheckCookieMac1(packet)) { + FreePacket(packet); + return false; + } + if (overload && !rr.is_first_ip() && !dev_.CheckCookieMac2(packet)) { + dev_.rate_limiter()->CommitResult(rr); + dev_.CreateCookieMessage((MessageHandshakeCookie*)packet->data, packet, ((MessageHandshakeInitiation*)packet->data)->sender_key_id); + packet->size = sizeof(MessageHandshakeCookie); + DoWriteUdpPacket(packet); + return false; + } + dev_.rate_limiter()->CommitResult(rr); + return true; +} + +// server receives this when client wants to setup a session +void WireguardProcessor::HandleHandshakeInitiationPacket(Packet *packet) { + WgPeer *peer = WgPeer::ParseMessageHandshakeInitiation(&dev_, packet); + if (!peer) { + FreePacket(packet); + return; + } + peer->OnHandshakeAuthComplete(); + DoWriteUdpPacket(packet); +} + +// client receives this after session is established +void WireguardProcessor::HandleHandshakeResponsePacket(Packet *packet) { + WgPeer *peer = WgPeer::ParseMessageHandshakeResponse(&dev_, packet); + if (!peer) { + FreePacket(packet); + return; + } + peer->endpoint_ = packet->addr; + FreePacket(packet); + peer->OnHandshakeAuthComplete(); + peer->OnHandshakeFullyComplete(); + if (procdel_) + procdel_->OnConnected(ReadBE32(tun_addr_.addr)); + SendKeepalive(peer); +} + +void WireguardProcessor::SendKeepalive(WgPeer *peer) { + // can't send keepalive if no endpoint is configured + if (peer->endpoint_.sin.sin_family == 0) + return; + + // If nothing is queued, insert a keepalive packet + if (peer->first_queued_packet_ == NULL) { + Packet *packet = AllocPacket(); + if (!packet) + return; + packet->size = 0; + packet->next = NULL; + peer->first_queued_packet_ = packet; + } + SendQueuedPackets(peer); +} + +void WireguardProcessor::SendQueuedPackets(WgPeer *peer) { + // Steal the packets + Packet *packet = peer->first_queued_packet_; + peer->first_queued_packet_ = NULL; + peer->last_queued_packet_ptr_ = &peer->first_queued_packet_; + peer->num_queued_packets_ = 0; + while (packet) { + Packet *next = packet->next; + WritePacketToUdp(peer, packet); + packet = next; + } +} + +void WireguardProcessor::HandleHandshakeCookiePacket(Packet *packet) { + WgPeer::ParseMessageHandshakeCookie(&dev_, (MessageHandshakeCookie *)packet->data); +} + +void WireguardProcessor::SecondLoop() { + uint64 now = OsGetMilliseconds(); + for (WgPeer *peer = dev_.first_peer(); peer; peer = peer->next_peer_) { + + // Allow ip/port to be remembered again for this keypair + if (peer->curr_keypair_) + peer->curr_keypair_->did_attempt_remember_ip_port = false; + + uint32 mask = peer->CheckTimeouts(now); + if (mask == 0) + continue; + if (mask & WgPeer::ACTION_SEND_KEEPALIVE) + SendKeepalive(peer); + if (mask & WgPeer::ACTION_SEND_HANDSHAKE) + SendHandshakeInitiation(peer); + } + + dev_.SecondLoop(now); +} + diff --git a/wireguard.h b/wireguard.h new file mode 100644 index 0000000..ef050c5 --- /dev/null +++ b/wireguard.h @@ -0,0 +1,133 @@ +// SPDX-License-Identifier: AGPL-1.0-only +// Copyright (C) 2018 Ludvig Strigeus . All Rights Reserved. +#pragma once + +#include "tunsafe_types.h" +#include "wireguard_proto.h" + +struct ProcessorStats { + // Number of bytes sent/received over the physical UDP connections + int64 udp_bytes_in, udp_bytes_out; + int64 udp_packets_in, udp_packets_out; + // Number of bytes sent/received over the TUN interface + int64 tun_bytes_in, tun_bytes_out; + int64 tun_packets_in, tun_packets_out; + uint64 last_complete_handskake_timestamp; + + int64 compression_hdr_saved_in, compression_hdr_saved_out; + + int64 compression_wg_saved_in, compression_wg_saved_out; +}; + +class ProcessorDelegate { +public: + virtual void OnConnected(in_addr_t my_ip) = 0; + virtual void OnDisconnected() = 0; +}; + +enum InternetBlockState { + kBlockInternet_Off, + kBlockInternet_Route, + kBlockInternet_Firewall, + kBlockInternet_Both, + + // An unspecified value that uses either route or firewall + kBlockInternet_DefaultOn = 254, + + kBlockInternet_Default = 255, +}; + +class WireguardProcessor { +public: + WireguardProcessor(UdpInterface *udp, TunInterface *tun, ProcessorDelegate *procdel); + ~WireguardProcessor(); + + void SetListenPort(int listen_port) { + listen_port_ = listen_port; + } + + bool SetTunAddress(const WgCidrAddr &addr); + + bool AddDnsServer(const IpAddr &sin); + + void SetMtu(int mtu) { + if (mtu >= 576 && mtu <= 10000) + mtu_ = mtu; + } + + void SetAddRoutesMode(bool mode) { + add_routes_mode_ = mode; + } + + void SetDnsBlocking(bool dns_blocking) { + dns_blocking_ = dns_blocking; + } + + void SetInternetBlocking(InternetBlockState internet_blocking) { + internet_blocking_ = internet_blocking; + } + + void SetHeaderObfuscation(const char *key) { + dev_.SetHeaderObfuscation(key); + } + + void HandleTunPacket(Packet *packet); + void HandleUdpPacket(Packet *packet, bool overload); + void SecondLoop(); + + ProcessorStats GetStats(); + void ResetStats(); + + bool Start(); + + WgDevice &dev() { return dev_; } + + TunInterface::PrePostCommands &prepost() { return pre_post_; } + +private: + void DoWriteUdpPacket(Packet *packet); + void WritePacketToUdp(WgPeer *peer, Packet *packet); + void SendHandshakeInitiation(WgPeer *peer); + void SendHandshakeInitiationAndResetRetries(WgPeer *peer); + void SendKeepalive(WgPeer *peer); + void SendQueuedPackets(WgPeer *peer); + + void HandleHandshakeInitiationPacket(Packet *packet); + void HandleHandshakeResponsePacket(Packet *packet); + void HandleHandshakeCookiePacket(Packet *packet); + void HandleDataPacket(Packet *packet); + + void HandleAuthenticatedDataPacket(WgKeypair *keypair, Packet *packet, uint8 *data, size_t data_size); + + void HandleShortHeaderFormatPacket(uint32 tag, Packet *packet); + + bool CheckIncomingHandshakeRateLimit(Packet *packet, bool overload); + + bool HandleIcmpv6NeighborSolicitation(const byte *data, size_t data_size); + + void SetupCompressionHeader(WgPacketCompressionVer01 *c); + + int listen_port_; + + ProcessorDelegate *procdel_; + TunInterface *tun_; + UdpInterface *udp_; + int mtu_; + ProcessorStats stats_; + + bool dns_blocking_; + uint8 internet_blocking_; + bool add_routes_mode_; + bool network_discovery_spoofing_; + uint8 network_discovery_mac_[6]; + + WgDevice dev_; + + WgCidrAddr tun_addr_; + WgCidrAddr tun6_addr_; + + IpAddr dns_addr_, dns6_addr_; + + TunInterface::PrePostCommands pre_post_; +}; + diff --git a/wireguard_config.cpp b/wireguard_config.cpp new file mode 100644 index 0000000..3d51f62 --- /dev/null +++ b/wireguard_config.cpp @@ -0,0 +1,444 @@ +// SPDX-License-Identifier: AGPL-1.0-only +// Copyright (C) 2018 Ludvig Strigeus . All Rights Reserved. +#include "stdafx.h" +#include "wireguard_config.h" +#include "netapi.h" +#include "tunsafe_endian.h" +#include "wireguard.h" +#include "util.h" +#include +#include +#include +#include +#include + +#if defined(OS_POSIX) +#include +#include +#include +#include +#include +#endif + +const char *print_ip_prefix(char buf[kSizeOfAddress], int family, const void *ip, int prefixlen) { + if (!inet_ntop(family, ip, buf, kSizeOfAddress - 8)) { + memcpy(buf, "unknown", 8); + } + if (prefixlen >= 0) + snprintf(buf + strlen(buf), 8, "/%d", prefixlen); + return buf; +} + +struct Addr { + byte addr[4]; + uint8 cidr; +}; + +static bool ParseCidrAddr(char *s, WgCidrAddr *out) { + char *slash = strchr(s, '/'); + if (!slash) + return false; + + *slash = 0; + int e = atoi(slash + 1); + if (e < 0) return false; + + if (inet_pton(AF_INET, s, out->addr) == 1) { + if (e > 32) return false; + out->cidr = e; + out->size = 32; + return true; + } + if (inet_pton(AF_INET6, s, out->addr) == 1) { + if (e > 128) return false; + out->cidr = e; + out->size = 128; + return true; + } + return false; +} + +struct hostent *gethostbyname_retry_on_failure(const char * name, bool *exit_flag) { + int attempt = 0; + static const uint8 retry_delays[] = {1, 2, 3, 5, 10, 20, 40, 60}; + + for (;;) { + hostent *he = gethostbyname(name); + if (he || exit_flag == NULL || *exit_flag) + return he; + + RINFO("Unable to resolve %s. Trying again in %d second(s)", name, retry_delays[attempt]); + OsInterruptibleSleep(retry_delays[attempt] * 1000); + if (*exit_flag) + return NULL; + + if (attempt != ARRAY_SIZE(retry_delays) - 1) + attempt++; + } +} + + +static bool ParseSockaddrInWithPort(char *s, IpAddr *sin, bool *exit_flag) { + memset(sin, 0, sizeof(IpAddr)); + if (*s == '[') { + char *end = strchr(s, ']'); + if (end == NULL) + return false; + *end = 0; + if (inet_pton(AF_INET6, s + 1, &sin->sin6.sin6_addr) != 1) + return false; + char *x = strchr(end + 1, ':'); + if (!x) + return false; + sin->sin.sin_family = AF_INET6; + sin->sin.sin_port = htons(atoi(x + 1)); + return true; + } + char *x = strchr(s, ':'); + if (!x) return false; + *x = 0; + hostent *he = gethostbyname_retry_on_failure(s, exit_flag); + if (!he) { + RERROR("Unable to resolve %s", s); + return false; + } + sin->sin.sin_family = AF_INET; + sin->sin.sin_port = htons(atoi(x + 1)); + memcpy(&sin->sin.sin_addr, he->h_addr_list[0], 4); + return true; +} + +static bool ParseSockaddrInWithoutPort(char *s, IpAddr *sin, bool *exit_flag) { + memset(sin, 0, sizeof(IpAddr)); + if (inet_pton(AF_INET6, s, &sin->sin6.sin6_addr) == 1) { + sin->sin.sin_family = AF_INET6; + return true; + } + hostent *he = gethostbyname_retry_on_failure(s, exit_flag); + if (!he) { + RERROR("Unable to resolve %s", s); + return false; + } + sin->sin.sin_family = AF_INET; + memcpy(&sin->sin.sin_addr, he->h_addr_list[0], 4); + return true; +} + +static bool ParseBase64Key(const char *s, uint8 key[32]) { + size_t size = 32; + return base64_decode((uint8*)s, strlen(s), key, &size) && size == 32; +} + +class WgFileParser { +public: + WgFileParser(WireguardProcessor *wg, bool *exit_flag) : wg_(wg), exit_flag_(exit_flag) {} + bool ParseFlag(const char *group, const char *key, char *value); + WireguardProcessor *wg_; + + void FinishGroup(); + struct Peer { + uint8 pub[32]; + uint8 psk[32]; + }; + Peer pi_; + WgPeer *peer_ = NULL; + bool *exit_flag_; + bool had_interface_ = false; +}; + +bool is_space(uint8_t c) { + return c == ' ' || c == '\r' || c == '\n' || c == '\t'; +} + + +void SplitString(char *s, int separator, std::vector *components) { + for (;;) { + while (is_space(*s)) s++; + char *d = strchr(s, separator); + if (d == NULL) { + if (*s) + components->push_back(s); + return; + } + *d = 0; + char *e = d; + while (e > s && is_space(e[-1])) + *--e = 0; + components->push_back(s); + s = d + 1; + } +} + +static bool ParseBoolean(const char *str, bool *value) { + if (_stricmp(str, "true") == 0 || + _stricmp(str, "yes") == 0 || + _stricmp(str, "1") == 0 || + _stricmp(str, "on") == 0) { + *value = true; + return true; + } + if (_stricmp(str, "false") == 0 || + _stricmp(str, "no") == 0 || + _stricmp(str, "0") == 0 || + _stricmp(str, "off") == 0) { + *value = false; + return true; + } + return false; +} + +static int ParseFeature(const char *str) { + size_t len = strlen(str); + int what = WG_BOOLEAN_FEATURE_WANTS; + if (len > 0) { + if (str[len - 1] == '?') + what = WG_BOOLEAN_FEATURE_SUPPORTS, len--; + else if (str[len - 1] == '!') + what = WG_BOOLEAN_FEATURE_ENFORCES, len--; + } + if (len == 5 && memcmp(str, "mac64", 5) == 0) + return what + WG_FEATURE_ID_SHORT_MAC * 16; + if (len == 12 && memcmp(str, "short_header", 12) == 0) + return what + WG_FEATURE_ID_SHORT_HEADER * 16; + if (len == 5 && memcmp(str, "ipzip", 5) == 0) + return what + WG_FEATURE_ID_IPZIP * 16; + if (len == 10 && memcmp(str, "skip_keyid", 10) == 0) + return what + WG_FEATURE_ID_SKIP_KEYID_IN * 16 + 1 * 4; + if (len == 13 && memcmp(str, "skip_keyid_in", 13) == 0) + return what + WG_FEATURE_ID_SKIP_KEYID_IN * 16; + if (len == 14 && memcmp(str, "skip_keyid_out", 14) == 0) + return what + WG_FEATURE_ID_SKIP_KEYID_OUT * 16; + return -1; +} + +static int ParseCipherSuite(const char *cipher) { + if (!strcmp(cipher, "chacha20-poly1305")) + return EXT_CIPHER_SUITE_CHACHA20POLY1305; + if (!strcmp(cipher, "aes128-gcm")) + return EXT_CIPHER_SUITE_AES128_GCM; + if (!strcmp(cipher, "aes256-gcm")) + return EXT_CIPHER_SUITE_AES256_GCM; + if (!strcmp(cipher, "none")) + return EXT_CIPHER_SUITE_NONE_POLY1305; + return -1; +} + +void WgFileParser::FinishGroup() { + if (peer_) { + peer_->Initialize(pi_.pub, pi_.psk); + peer_ = NULL; + } +} + +bool WgFileParser::ParseFlag(const char *group, const char *key, char *value) { + uint8 binkey[32]; + WgCidrAddr addr; + IpAddr sin; + std::vector ss; + bool ciphermode = false; + + if (strcmp(group, "[Interface]") == 0) { + if (key == NULL) return true; + if (strcmp(key, "PrivateKey") == 0) { + if (!ParseBase64Key(value, binkey)) + return false; + had_interface_ = true; + wg_->dev().Initialize(binkey); + } else if (strcmp(key, "ListenPort") == 0) { + wg_->SetListenPort(atoi(value)); + } else if (strcmp(key, "Address") == 0) { + SplitString(value, ',', &ss); + for (size_t i = 0; i < ss.size(); i++) { + if (!ParseCidrAddr(ss[i], &addr)) + return false; + if (!wg_->SetTunAddress(addr)) { + RERROR("Multiple Address not allowed"); + return false; + } + } + } else if (strcmp(key, "MTU") == 0) { + wg_->SetMtu(atoi(value)); + } else if (strcmp(key, "Table") == 0) { + bool mode; + if (!strcmp(value, "off")) { + mode = false; + } else if (!strcmp(value, "auto")) { + mode = true; + } else { + goto err; + } + wg_->SetAddRoutesMode(mode); + } else if (strcmp(key, "DNS") == 0) { + SplitString(value, ',', &ss); + for (size_t i = 0; i < ss.size(); i++) { + if (!ParseSockaddrInWithoutPort(ss[i], &sin, exit_flag_)) + return false; + if (!wg_->AddDnsServer(sin)) { + RERROR("Multiple DNS not allowed."); + return false; + } + } + } else if (strcmp(key, "BlockDNS") == 0) { + bool v; + if (!ParseBoolean(value, &v)) + goto err; + wg_->SetDnsBlocking(v); + } else if (strcmp(key, "BlockInternet") == 0) { + uint8 v = kBlockInternet_Default; + + SplitString(value, ',', &ss); + for (size_t i = 0; i < ss.size(); i++) { + if (strcmp(ss[i], "route") == 0) { + if (v & 128) v = 0; + v |= kBlockInternet_Route; + } else if (strcmp(ss[i], "firewall") == 0) { + if (v & 128) v = 0; + v |= kBlockInternet_Firewall; + } else if (strcmp(ss[i], "off") == 0) + v = 0; + else if (strcmp(ss[i], "on") == 0) + v = kBlockInternet_DefaultOn; + else if (strcmp(ss[i], "default") == 0) + v = kBlockInternet_Default; + else + RERROR("Unknown mode in BlockInternet: %s", ss[i]); + } + + wg_->SetInternetBlocking((InternetBlockState)v); + } else if (strcmp(key, "HeaderObfuscation") == 0) { + wg_->SetHeaderObfuscation(value); + } else if (strcmp(key, "PostUp") == 0) { + wg_->prepost().post_up.emplace_back(value); + } else if (strcmp(key, "PostDown") == 0) { + wg_->prepost().post_down.emplace_back(value); + } else if (strcmp(key, "PreUp") == 0) { + wg_->prepost().pre_up.emplace_back(value); + } else if (strcmp(key, "PreDown") == 0) { + wg_->prepost().pre_down.emplace_back(value); + } else { + goto err; + } + } else if (strcmp(group, "[Peer]") == 0) { + if (key == NULL) { + if (!had_interface_) { + RERROR("Missing [Interface].PrivateKey."); + return false; + } + FinishGroup(); + peer_ = wg_->dev().AddPeer(); + memset(&pi_, 0, sizeof(pi_)); + return true; + } + if (strcmp(key, "PublicKey") == 0) { + if (!ParseBase64Key(value, pi_.pub)) + return false; + } else if (strcmp(key, "PresharedKey") == 0) { + if (!ParseBase64Key(value, pi_.psk)) + return false; + } else if (strcmp(key, "AllowedIPs") == 0) { + SplitString(value, ',', &ss); + for (size_t i = 0; i < ss.size(); i++) { + if (!ParseCidrAddr(ss[i], &addr)) + return false; + if (!peer_->AddIp(addr)) + return false; + } + } else if (strcmp(key, "Endpoint") == 0) { + if (!ParseSockaddrInWithPort(value, &sin, exit_flag_)) + return false; + peer_->SetEndpoint(sin); + } else if (strcmp(key, "PersistentKeepalive") == 0) { + peer_->SetPersistentKeepalive(atoi(value)); + } else if (strcmp(key, "AllowMulticast") == 0) { + bool b; + if (!ParseBoolean(value, &b)) + return false; + peer_->SetAllowMulticast(b); + } else if (strcmp(key, "Features") == 0) { + SplitString(value, ',', &ss); + for (size_t i = 0; i < ss.size(); i++) { + int v = ParseFeature(ss[i]); + if (v < 0) + return false; + for (;; v += 12) { + peer_->SetFeature(v >> 4, v & 3); + if (!(v & 12)) + break; + } + } + } else if (strcmp(key, "Ciphers") == 0 || (ciphermode = true, strcmp(key, "Ciphers!") == 0)) { + SplitString(value, ',', &ss); + peer_->SetCipherPrio(ciphermode); + for (size_t i = 0; i < ss.size(); i++) { + int v = ParseCipherSuite(ss[i]); + if (v < 0 || !peer_->AddCipher(v)) + return false; + } + } else { + goto err; + } + } else { +err: + return false; + } + return true; +} + +bool ParseWireGuardConfigFile(WireguardProcessor *wg, const char *filename, bool *exit_flag) { + char buf[1024]; + char group[32] = {0}; + + WgFileParser file_parser(wg, exit_flag); + + RINFO("Loading file: %s", filename); + + FILE *f = fopen(filename, "r"); + if (!f) { + RERROR("Unable to open: %s", filename); + return false; + } + + while (fgets(buf, sizeof(buf), f)) { + size_t l = strlen(buf); + while (l && is_space(buf[l - 1])) + buf[--l] = 0; + if (buf[0] == '#' || buf[0] == '\0') + continue; + + if (buf[0] == '[') { + size_t len = strlen(buf); + if (len < sizeof(group)) { + memcpy(group, buf, len + 1); + if (!file_parser.ParseFlag(group, NULL, NULL)) { + RERROR("Error parsing %s", group); + fclose(f); + return false; + } + } + continue; + } + char *sep = strchr(buf, '='); + if (!sep) { + RERROR("Missing = on line: %s", buf); + continue; + } + char *sepe = sep; + while (sepe > buf && is_space(sepe[-1])) + sepe--; + *sepe = 0; + + // trim space after = + sep++; + while (is_space(*sep)) + sep++; + + if (!file_parser.ParseFlag(group, buf, sep)) { + RERROR("Error parsing %s.%s = %s", group, buf, sep); + fclose(f); + return false; + } + } + file_parser.FinishGroup(); + fclose(f); + return true; +} diff --git a/wireguard_config.h b/wireguard_config.h new file mode 100644 index 0000000..03d7899 --- /dev/null +++ b/wireguard_config.h @@ -0,0 +1,15 @@ +// SPDX-License-Identifier: AGPL-1.0-only +// Copyright (C) 2018 Ludvig Strigeus . All Rights Reserved. +#ifndef TINYVPN_TINYVPN_H_ +#define TINYVPN_TINYVPN_H_ + +class WireguardProcessor; + +bool ParseWireGuardConfigFile(WireguardProcessor *wg, const char *filename, bool *exit_flag); + +#define kSizeOfAddress 64 +const char *print_ip_prefix(char buf[kSizeOfAddress], int family, const void *ip, int prefixlen); + + + +#endif // TINYVPN_TINYVPN_H_ diff --git a/wireguard_proto.cpp b/wireguard_proto.cpp new file mode 100644 index 0000000..ad20a53 --- /dev/null +++ b/wireguard_proto.cpp @@ -0,0 +1,1307 @@ +// SPDX-License-Identifier: AGPL-1.0-only +// Copyright (C) 2018 Ludvig Strigeus . All Rights Reserved. +#include "stdafx.h" +#include "wireguard_proto.h" +#include "crypto/chacha20poly1305.h" +#include "crypto/blake2s.h" +#include "crypto/curve25519-donna.h" +#include "crypto/aesgcm/aes.h" +#include "crypto/siphash.h" +#include "tunsafe_endian.h" +#include "util.h" +#include "crypto_ops.h" +#include "bit_ops.h" +#include "tunsafe_cpu.h" +#include +#include +#include +#include + +static const uint8 kLabelCookie[] = {'c', 'o', 'o', 'k', 'i', 'e', '-', '-'}; +static const uint8 kLabelMac1[] = {'m', 'a', 'c', '1', '-', '-', '-', '-'}; +static const uint8 kWgInitHash[WG_HASH_LEN] = {0x22,0x11,0xb3,0x61,0x08,0x1a,0xc5,0x66,0x69,0x12,0x43,0xdb,0x45,0x8a,0xd5,0x32,0x2d,0x9c,0x6c,0x66,0x22,0x93,0xe8,0xb7,0x0e,0xe1,0x9c,0x65,0xba,0x07,0x9e,0xf3}; +static const uint8 kWgInitChainingKey[WG_HASH_LEN] = {0x60,0xe2,0x6d,0xae,0xf3,0x27,0xef,0xc0,0x2e,0xc3,0x35,0xe2,0xa0,0x25,0xd2,0xd0,0x16,0xeb,0x42,0x06,0xf8,0x72,0x77,0xf5,0x2d,0x38,0xd1,0x98,0x8b,0x78,0xcd,0x36}; +static const uint8 kCurve25519Basepoint[32] = {9}; + +IpToPeerMap::IpToPeerMap() { + +} + +IpToPeerMap::~IpToPeerMap() { +} + +bool IpToPeerMap::InsertV4(const void *addr, int cidr, void *peer) { + uint32 mask = cidr == 32 ? 0xffffffff : ~(0xffffffff >> cidr); + Entry4 e = {ReadBE32(addr) & mask, mask, peer}; + ipv4_.push_back(e); + return true; +} + +bool IpToPeerMap::InsertV6(const void *addr, int cidr, void *peer) { + Entry6 e; + e.cidr_len = cidr; + e.peer = peer; + memcpy(e.ip, addr, 16); + ipv6_.push_back(e); + return true; +} + +void *IpToPeerMap::LookupV4(uint32 ip) { + uint32 best_mask = 0; + void *best_peer = NULL; + for (auto it = ipv4_.begin(); it != ipv4_.end(); ++it) { + if (it->ip == (ip & it->mask) && it->mask >= best_mask) { + best_mask = it->mask; + best_peer = it->peer; + } + } + return best_peer; +} + +void *IpToPeerMap::LookupV4DefaultPeer() { + for (auto it = ipv4_.begin(); it != ipv4_.end(); ++it) { + if (it->mask == 0) + return it->peer; + } + return NULL; +} + +void *IpToPeerMap::LookupV6DefaultPeer() { + for (auto it = ipv6_.begin(); it != ipv6_.end(); ++it) { + if (it->cidr_len == 0) + return it->peer; + } + return NULL; +} + +static int CalculateIPv6CommonPrefix(const uint8 *a, const uint8 *b) { + uint64 x = ToBE64(*(uint64*)&a[0] ^ *(uint64*)&b[0]); + uint64 y = ToBE64(*(uint64*)&a[8] ^ *(uint64*)&b[8]); + return x ? 64 - FindHighestSetBit64(x) : 128 - FindHighestSetBit64(y); +} + +void *IpToPeerMap::LookupV6(const void *addr) { + int best_len = 0; + void *best_peer = NULL; + for (auto it = ipv6_.begin(); it != ipv6_.end(); ++it) { + int len = CalculateIPv6CommonPrefix((const uint8*)addr, it->ip); + if (len >= it->cidr_len && len >= best_len) { + best_len = len; + best_peer = it->peer; + } + } + return best_peer; +} + +void IpToPeerMap::RemovePeer(void *peer) { + { + size_t n = ipv4_.size(); + Entry4 *r = &ipv4_[0], *w = r; + for (size_t i = 0; i != n; i++, r++) { + if (r->peer != peer) + *w++ = *r; + } + ipv4_.resize(w - &ipv4_[0]); + } + { + size_t n = ipv6_.size(); + Entry6 *r = &ipv6_[0], *w = r; + for (size_t i = 0; i != n; i++, r++) { + if (r->peer != peer) + *w++ = *r; + } + ipv6_.resize(w - &ipv6_[0]); + } +} + +ReplayDetector::ReplayDetector() { + expected_seq_nr_ = 0; + memset(bitmap_, 0, sizeof(bitmap_)); +} + +ReplayDetector::~ReplayDetector() { +} + +bool ReplayDetector::CheckReplay(uint64 seq_nr) { + uint64 slot = seq_nr / BITS_PER_ENTRY; + if (seq_nr >= expected_seq_nr_) { + uint64 prev_slot = (expected_seq_nr_ + BITS_PER_ENTRY - 1) / BITS_PER_ENTRY - 1, n; + if ((n = slot - prev_slot) != 0) { + size_t nn = (size_t)std::min(n, BITMAP_SIZE); + do { + bitmap_[(prev_slot + nn) & BITMAP_MASK] = 0; + } while (--nn); + } + expected_seq_nr_ = seq_nr + 1; + } else if (seq_nr + WINDOW_SIZE <= expected_seq_nr_) { + return false; + } + uint32 mask = 1 << (seq_nr & (BITS_PER_ENTRY - 1)), prev; + prev = bitmap_[slot & BITMAP_MASK]; + bitmap_[slot & BITMAP_MASK] = prev | mask; + return (prev & mask) == 0; +} + +WgDevice::WgDevice() { + peers_ = NULL; + header_obfuscation_ = false; + next_rng_slot_ = 0; + last_complete_handskake_timestamp_ = 0; + memset(&compression_header_, 0, sizeof(compression_header_)); + + low_resolution_timestamp_ = cookie_secret_timestamp_ = OsGetMilliseconds(); + OsGetRandomBytes(cookie_secret_, sizeof(cookie_secret_)); + OsGetRandomBytes((uint8*)random_number_input_, sizeof(random_number_input_)); + +} + +WgDevice::~WgDevice() { +} + +void WgDevice::SecondLoop(uint64 now) { + low_resolution_timestamp_ = now; + + if (rate_limiter_.is_used()) { + uint32 k[5]; + for (size_t i = 0; i < ARRAY_SIZE(k); i++) + k[i] = GetRandomNumber(); + rate_limiter_.Periodic(k); + } +} + +uint32 WgDevice::InsertInKeyIdLookup(WgPeer *peer, WgKeypair *kp) { + assert(peer); + for (;;) { + uint32 v = GetRandomNumber(); + if (v == 0) + continue; + std::pair &peer_and_keypair = key_id_lookup_[v]; + if (peer_and_keypair.first == NULL) { + peer_and_keypair = std::make_pair(peer, kp); + uint32 &x = (kp ? kp->local_key_id : peer->local_key_id_during_hs_); + uint32 old = x; + x = v; + if (old) + key_id_lookup_.erase(old); + return v; + } + } +} + +uint32 WgDevice::GetRandomNumber() { + size_t slot; + if ((slot = next_rng_slot_) == 0) { + blake2s(random_number_output_, sizeof(random_number_output_), random_number_input_, sizeof(random_number_input_), NULL, 0); + random_number_input_[0]++; + slot = BLAKE2S_OUTBYTES / 4; + } + next_rng_slot_ = (uint8) --slot; + return random_number_output_[slot]; +} + +static void BlakeX2(uint8 *dst, size_t dst_size, const uint8 *a, size_t a_size, const uint8 *b, size_t b_size) { + blake2s_state b2s; + blake2s_init(&b2s, dst_size); + blake2s_update(&b2s, a, a_size); + blake2s_update(&b2s, b, b_size); + blake2s_final(&b2s, dst, dst_size); +} + +static inline void BlakeMix(uint8 dst[WG_HASH_LEN], const uint8 *a, size_t a_size) { + BlakeX2(dst, WG_HASH_LEN, dst, WG_HASH_LEN, a, a_size); +} + +static inline void ComputeHKDF2DH(uint8 ci[WG_HASH_LEN], uint8 k[WG_SYMMETRIC_KEY_LEN], const uint8 priv[WG_PUBLIC_KEY_LEN], const uint8 pub[WG_PUBLIC_KEY_LEN]) { + uint8 dh[WG_PUBLIC_KEY_LEN]; + curve25519_donna(dh, priv, pub); + blake2s_hkdf(ci, WG_HASH_LEN, k, WG_SYMMETRIC_KEY_LEN, NULL, 32, dh, sizeof(dh), ci, WG_HASH_LEN); + memzero_crypto(dh, sizeof(dh)); +} + +void WgDevice::Initialize(const uint8 private_key[WG_PUBLIC_KEY_LEN]) { + // Derive the public key from the private key. + memcpy(s_priv_, private_key, sizeof(s_priv_)); + curve25519_donna(s_pub_, s_priv_, kCurve25519Basepoint); + + // Precompute: precomputed_cookie_label_hash_ := HASH(LABEL-COOKIE || Spub_m) + // precomputed_label_mac1_hash_ := HASH(MAC1-COOKIE || Spub_m) + BlakeX2(precomputed_cookie_key_, sizeof(precomputed_cookie_key_), + kLabelCookie, sizeof(kLabelCookie), s_pub_, sizeof(s_pub_)); + BlakeX2(precomputed_mac1_key_, sizeof(precomputed_mac1_key_), + kLabelMac1, sizeof(kLabelMac1), s_pub_, sizeof(s_pub_)); +} + +WgPeer *WgDevice::AddPeer() { + WgPeer *peer = new WgPeer(this); + WgPeer **pp = &peers_; + while (*pp) + pp = &(*pp)->next_peer_; + *pp = peer; + return peer; +} + +WgPeer *WgDevice::GetPeerFromPublicKey(uint8 public_key[WG_PUBLIC_KEY_LEN]) { + for (WgPeer *peer = peers_; peer; peer = peer->next_peer_) { + if (memcmp(peer->s_remote_, public_key, WG_PUBLIC_KEY_LEN) == 0) + return peer; + } + return NULL; +} + +bool WgDevice::CheckCookieMac1(Packet *packet) { + uint8 mac[WG_COOKIE_LEN]; + const uint8 *data = packet->data; + size_t data_size = packet->size; + + blake2s(mac, sizeof(mac), data, data_size - WG_COOKIE_LEN * 2, precomputed_mac1_key_, sizeof(precomputed_mac1_key_)); + return !memcmp_crypto(mac, data + data_size - WG_COOKIE_LEN * 2, WG_COOKIE_LEN); +} + +void WgDevice::MakeCookie(uint8 cookie[WG_COOKIE_LEN], Packet *packet) { + blake2s_state b2s; + uint64 now = OsGetMilliseconds(); + if (now - cookie_secret_timestamp_ >= COOKIE_SECRET_MAX_AGE_MS) { + cookie_secret_timestamp_ = now; + OsGetRandomBytes(cookie_secret_, sizeof(cookie_secret_)); + } + blake2s_init_key(&b2s, WG_COOKIE_LEN, cookie_secret_, sizeof(cookie_secret_)); + if (packet->addr.sin.sin_family == AF_INET) + blake2s_update(&b2s, &packet->addr.sin.sin_addr, 4); + else if (packet->addr.sin.sin_family == AF_INET6) + blake2s_update(&b2s, &packet->addr.sin6.sin6_addr, sizeof(packet->addr.sin6.sin6_addr)); + blake2s_update(&b2s, &packet->addr.sin6.sin6_port, 2); + blake2s_final(&b2s, cookie, WG_COOKIE_LEN); +} + +bool WgDevice::CheckCookieMac2(Packet *packet) { + uint8 cookie[WG_COOKIE_LEN]; + uint8 mac[WG_COOKIE_LEN]; + MakeCookie(cookie, packet); + blake2s(mac, sizeof(mac), packet->data, packet->size - WG_COOKIE_LEN, cookie, sizeof(cookie)); + return !memcmp_crypto(mac, packet->data + packet->size - WG_COOKIE_LEN, WG_COOKIE_LEN); +} + +void WgDevice::CreateCookieMessage(MessageHandshakeCookie *dst, Packet *packet, uint32 remote_key_id) { + dst->type = MESSAGE_HANDSHAKE_COOKIE; + dst->receiver_key_id = remote_key_id; + MakeCookie(dst->cookie_enc, packet); + OsGetRandomBytes(dst->nonce, sizeof(dst->nonce)); + MessageMacs *mac = (MessageMacs *)(packet->data + packet->size - sizeof(MessageMacs)); + xchacha20poly1305_encrypt(dst->cookie_enc, dst->cookie_enc, WG_COOKIE_LEN, mac->mac1, WG_COOKIE_LEN, dst->nonce, precomputed_cookie_key_); +} + +void WgDevice::EraseKeypairAddrEntry(WgKeypair *kp) { + WgAddrEntry *ae = kp->addr_entry; + + assert(ae->ref_count >= 1); + assert(ae->ref_count == !!ae->keys[0] + !!ae->keys[1] + !!ae->keys[2]); + assert(ae->keys[kp->addr_entry_slot - 1] == kp); + + kp->addr_entry = NULL; + + ae->keys[kp->addr_entry_slot - 1] = NULL; + kp->addr_entry_slot = 0; + + if (ae->ref_count-- == 1) { + addr_entry_lookup_.erase(ae->addr_entry_id); + delete ae; + } +} + +void WgDevice::UpdateKeypairAddrEntry(uint64 addr_id, WgKeypair *keypair) { + if (keypair->addr_entry != NULL && keypair->addr_entry->addr_entry_id == addr_id) { + keypair->broadcast_short_key = 1; + return; + } + + if (keypair->addr_entry != NULL) + EraseKeypairAddrEntry(keypair); + + WgAddrEntry **aep = &addr_entry_lookup_[addr_id], *ae; + + if ((ae = *aep) == NULL) { + *aep = ae = new WgAddrEntry(addr_id); + } else { + // Ensure we don't insert new things in this addr entry too often. + if (ae->time_of_last_insertion + 1000 * 60 > low_resolution_timestamp_) + return; + } + + ae->time_of_last_insertion = low_resolution_timestamp_; + + // Update slot # + uint32 next_slot = ae->next_slot; + ae->next_slot = (next_slot == 2) ? 0 : next_slot + 1; + + WgKeypair *old_keypair = ae->keys[next_slot]; + ae->keys[next_slot] = keypair; + keypair->addr_entry = ae; + keypair->addr_entry_slot = next_slot + 1; + if (old_keypair != NULL) { + old_keypair->addr_entry = NULL; + old_keypair->addr_entry_slot = 0; + } else { + ae->ref_count++; + } + assert(ae->ref_count == !!ae->keys[0] + !!ae->keys[1] + !!ae->keys[2]); + + keypair->broadcast_short_key = 1; +} + +//>> > hashlib.sha256('TunSafe Header Obfuscation Key').hexdigest() +//'2444423e33eb5bb875961224c6441f54c5dea95a3a4e1139509ffa6992bdb278' +static const uint8 kHeaderObfuscationKey[32] = {36, 68, 66, 62, 51, 235, 91, 184, 117, 150, 18, 36, 198, 68, 31, 84, 197, 222, 169, 90, 58, 78, 17, 57, 80, 159, 250, 105, 146, 189, 178, 120}; + +void WgDevice::SetHeaderObfuscation(const char *key) { +#if WITH_HEADER_OBFUSCATION + header_obfuscation_ = (key != NULL); + if (key) + blake2s_hmac((uint8*)&header_obfuscation_key_, sizeof(header_obfuscation_key_), (uint8*)key, strlen(key), kHeaderObfuscationKey, sizeof(kHeaderObfuscationKey)); +#endif // WITH_HEADER_OBFUSCATION +} + + +WgPeer::WgPeer(WgDevice *dev) { + dev_ = dev; + endpoint_.sin.sin_family = 0; + next_peer_ = NULL; + curr_keypair_ = next_keypair_ = prev_keypair_ = NULL; + expect_cookie_reply_ = false; + has_mac2_cookie_ = false; + allow_multicast_through_peer_ = false; + supports_handshake_extensions_ = true; + local_key_id_during_hs_ = 0; + last_handshake_init_timestamp_ = -1000000ll; + last_handshake_init_recv_timestamp_ = 0; + last_complete_handskake_timestamp_ = 0; + persistent_keepalive_ms_ = 0; + timers_ = 0; + first_queued_packet_ = NULL; + last_queued_packet_ptr_ = &first_queued_packet_; + num_queued_packets_ = 0; + handshake_attempts_ = 0; + num_ciphers_ = 0; + cipher_prio_ = 0; + memset(last_timestamp_, 0, sizeof(last_timestamp_)); + ipv4_broadcast_addr_ = 0xffffffff; + memset(features_, 0, sizeof(features_)); +} + +WgPeer::~WgPeer() { + ClearKeys(); + ClearHandshake(); + ClearPacketQueue(); +} + +void WgPeer::ClearPacketQueue() { + Packet *packet; + while ((packet = first_queued_packet_) != NULL) { + first_queued_packet_ = packet->next; + FreePacket(packet); + } + last_queued_packet_ptr_ = &first_queued_packet_; + num_queued_packets_ = 0; +} + +void WgPeer::Initialize(const uint8 spub[WG_PUBLIC_KEY_LEN], const uint8 preshared_key[WG_SYMMETRIC_KEY_LEN]) { + // Optionally use a preshared key, it defaults to all zeros. + if (preshared_key) + memcpy(preshared_key_, preshared_key, sizeof(preshared_key_)); + else + memset(preshared_key_, 0, sizeof(preshared_key_)); + // Precompute: s_priv_pub_ := DH(sprivr, spubi) + memcpy(s_remote_, spub, sizeof(s_remote_)); + curve25519_donna(s_priv_pub_, dev_->s_priv_, s_remote_); + // Precompute: precomputed_cookie_key_ := HASH(LABEL-COOKIE || Spub_m) + // precomputed_mac1_key_ := HASH(MAC1-COOKIE || Spub_m) + BlakeX2(precomputed_cookie_key_, sizeof(precomputed_cookie_key_), + kLabelCookie, sizeof(kLabelCookie), spub, WG_PUBLIC_KEY_LEN); + BlakeX2(precomputed_mac1_key_, sizeof(precomputed_mac1_key_), + kLabelMac1, sizeof(kLabelMac1), spub, WG_PUBLIC_KEY_LEN); +} + +// run on the client +void WgPeer::CreateMessageHandshakeInitiation(Packet *packet) { + uint8 k[WG_SYMMETRIC_KEY_LEN]; + MessageHandshakeInitiation *dst = (MessageHandshakeInitiation *)packet->data; + + // Ci := HASH(CONSTRUCTION) + memcpy(hs_.ci, kWgInitChainingKey, sizeof(hs_.ci)); + // Hi := HASH(Ci || IDENTIFIER) + memcpy(hs_.hi, kWgInitHash, sizeof(hs_.hi)); + // Hi := HASH(Hi || Spub_r) + BlakeMix(hs_.hi, s_remote_, sizeof(s_remote_)); + // (Epriv_r, Epub_r) := DH-GENERATE() + // msg.ephemeral = Epub_r + OsGetRandomBytes(hs_.e_priv, sizeof(hs_.e_priv)); + curve25519_normalize(hs_.e_priv); + curve25519_donna(dst->ephemeral, hs_.e_priv, kCurve25519Basepoint); + // Ci := KDF_1(Ci, msg.ephemeral) + blake2s_hkdf(hs_.ci, sizeof(hs_.ci), NULL, 32, NULL, 32, dst->ephemeral, sizeof(dst->ephemeral), hs_.ci, WG_HASH_LEN); + // Hi := HASH(Hi || msg.ephemeral) + BlakeMix(hs_.hi, dst->ephemeral, sizeof(dst->ephemeral)); + // (Ci, K) := KDF2(Ci, DH(epriv, spub_r)) + ComputeHKDF2DH(hs_.ci, k, hs_.e_priv, s_remote_); + // msg.static = AEAD(K, 0, Spub_i, Hi) + chacha20poly1305_encrypt(dst->static_enc, dev_->s_pub_, sizeof(dev_->s_pub_), hs_.hi, sizeof(hs_.hi), 0, k); + // Hi := HASH(Hi || msg.static) + BlakeMix(hs_.hi, dst->static_enc, sizeof(dst->static_enc)); + // (Ci, K) := KDF2(Ci, DH(sprivr, spubi)) + blake2s_hkdf(hs_.ci, sizeof(hs_.ci), k, sizeof(k), NULL, 32, s_priv_pub_, sizeof(s_priv_pub_), hs_.ci, WG_HASH_LEN); + // TAI64N + OsGetTimestampTAI64N(dst->timestamp_enc); + + size_t extfield_size = 0; +#if WITH_HANDSHAKE_EXT + if (supports_handshake_extensions_) + extfield_size = WriteHandshakeExtension(dst->timestamp_enc + WG_TIMESTAMP_LEN, NULL); +#endif // WITH_HANDSHAKE_EXT + // msg.timestamp := AEAD(K, 0, timestamp, hi) + chacha20poly1305_encrypt(dst->timestamp_enc, dst->timestamp_enc, extfield_size + WG_TIMESTAMP_LEN, hs_.hi, sizeof(hs_.hi), 0, k); + // Hi := HASH(Hi || msg.timestamp) + BlakeMix(hs_.hi, dst->timestamp_enc, extfield_size + WG_TIMESTAMP_LEN + WG_MAC_LEN); + + packet->size = (unsigned)(sizeof(MessageHandshakeInitiation) + extfield_size); + + // Insert a pointer to this object, + dst->sender_key_id = dev_->InsertInKeyIdLookup(this, NULL); + dst->type = MESSAGE_HANDSHAKE_INITIATION; + memzero_crypto(k, sizeof(k)); + WriteMacToPacket((uint8*)dst, (MessageMacs*)((uint8*)&dst->mac + extfield_size)); +} + +// Parsed by server +WgPeer *WgPeer::ParseMessageHandshakeInitiation(WgDevice *dev, Packet *packet) { // const MessageHandshakeInitiation *src, MessageHandshakeResponse *dst) { + // Copy values into handshake once we've validated it all. + uint8 ci[WG_HASH_LEN]; + uint8 hi[WG_HASH_LEN]; + union { + uint8 k[WG_SYMMETRIC_KEY_LEN]; + uint8 e_priv[WG_PUBLIC_KEY_LEN]; + }; + union { + uint8 spubi[WG_PUBLIC_KEY_LEN]; + uint8 e_remote[WG_PUBLIC_KEY_LEN]; + uint8 hi2[WG_HASH_LEN]; + }; + uint8 t[WG_HASH_LEN]; + WgPeer *peer; + WgKeypair *keypair; + uint32 remote_key_id; + uint64 now; + uint8 extbuf[MAX_SIZE_OF_HANDSHAKE_EXTENSION + WG_TIMESTAMP_LEN]; + MessageHandshakeInitiation *src = (MessageHandshakeInitiation *)packet->data; + MessageHandshakeResponse *dst; + size_t extfield_size; + + // Ci := HASH(CONSTRUCTION) + memcpy(ci, kWgInitChainingKey, sizeof(ci)); + // Hi := HASH(Ci || IDENTIFIER) + memcpy(hi, kWgInitHash, sizeof(hi)); + // Hi := HASH(Hi || Spub_r) + BlakeMix(hi, dev->s_pub_, sizeof(dev->s_pub_)); + // Ci := KDF_1(Ci, msg.ephemeral) + blake2s_hkdf(ci, sizeof(ci), NULL, 32, NULL, 32, src->ephemeral, sizeof(src->ephemeral), ci, WG_HASH_LEN); + // Hi := HASH(Hi || msg.ephemeral) + BlakeMix(hi, src->ephemeral, sizeof(src->ephemeral)); + // (Ci, K) := KDF2(Ci, DH(spriv, msg.ephemeral)) + ComputeHKDF2DH(ci, k, dev->s_priv_, src->ephemeral); + // Spub_i = AEAD_DEC(K, 0, msg.static, Hi) + if (!chacha20poly1305_decrypt(spubi, src->static_enc, sizeof(src->static_enc), hi, sizeof(hi), 0, k)) + goto getout; + // Hi := HASH(Hi || msg.static) + BlakeMix(hi, src->static_enc, sizeof(src->static_enc)); + // Lookup the peer with this ID + if (!(peer = dev->GetPeerFromPublicKey(spubi))) + goto getout; + // (Ci, K) := KDF2(Ci, DH(sprivr, spubi)) + blake2s_hkdf(ci, sizeof(ci), k, sizeof(k), NULL, 32, peer->s_priv_pub_, sizeof(peer->s_priv_pub_), ci, WG_HASH_LEN); + // Hi2 := Hi + memcpy(hi2, hi, sizeof(hi2)); + extfield_size = packet->size - sizeof(MessageHandshakeInitiation); + if (extfield_size > MAX_SIZE_OF_HANDSHAKE_EXTENSION || (extfield_size && !peer->supports_handshake_extensions_)) + goto getout; + // Hi := HASH(Hi || msg.timestamp) + BlakeMix(hi, src->timestamp_enc, extfield_size + WG_TIMESTAMP_LEN + WG_MAC_LEN); + // TIMESTAMP := AEAD_DEC(K, 0, msg.timestamp, hi2) + if (!chacha20poly1305_decrypt(extbuf, src->timestamp_enc, extfield_size + WG_TIMESTAMP_LEN + WG_MAC_LEN, hi2, sizeof(hi2), 0, k)) + goto getout; + // Replay attack? + if (memcmp(extbuf, peer->last_timestamp_, WG_TIMESTAMP_LEN) <= 0) + goto getout; + // Flood attack? + now = OsGetMilliseconds(); + if (now < peer->last_handshake_init_recv_timestamp_ + MIN_HANDSHAKE_INTERVAL_MS) + goto getout; + + // Remember all the information we need to produce a response cause we cannot touch src again + peer->last_handshake_init_recv_timestamp_ = now; + memcpy(peer->last_timestamp_, extbuf, sizeof(peer->last_timestamp_)); + + memcpy(e_remote, src->ephemeral, sizeof(e_remote)); + remote_key_id = src->sender_key_id; + + dst = (MessageHandshakeResponse *)src; + + // (Epriv_r, Epub_r) := DH-GENERATE() + // msg.ephemeral = Epub_r + OsGetRandomBytes(e_priv, sizeof(e_priv)); + curve25519_normalize(e_priv); + curve25519_donna(dst->ephemeral, e_priv, kCurve25519Basepoint); + // Hr := HASH(Hr || msg.ephemeral) + BlakeMix(hi, dst->ephemeral, sizeof(dst->ephemeral)); + // Ci := KDF_1(Ci, msg.ephemeral) + blake2s_hkdf(ci, sizeof(ci), NULL, 32, NULL, 32, dst->ephemeral, sizeof(dst->ephemeral), ci, WG_HASH_LEN); + // Ci : = KDF2(Ci, DH(epriv, epub)) + ComputeHKDF2DH(ci, NULL, e_priv, e_remote); + // Ci : = KDF2(Ci, DH(epriv, spub)) + ComputeHKDF2DH(ci, NULL, e_priv, peer->s_remote_); + // (Ci, T, K) := KDF3(Ci, Q) + blake2s_hkdf(ci, sizeof(ci), t, sizeof(t), k, sizeof(k), peer->preshared_key_, sizeof(preshared_key_), ci, WG_HASH_LEN); + // Hr := HASH(Hr || T) + BlakeMix(hi, t, sizeof(t)); + + dst->receiver_key_id = remote_key_id; + keypair = peer->CreateNewKeypair(false, ci, remote_key_id, extbuf + WG_TIMESTAMP_LEN, extfield_size); + if (keypair) { + peer->InsertKeypairInPeer(keypair); + dst->sender_key_id = dev->InsertInKeyIdLookup(peer, keypair); + + size_t extfield_out_size = 0; +#if WITH_HANDSHAKE_EXT + if (extfield_size) + extfield_out_size = peer->WriteHandshakeExtension(dst->empty_enc, keypair); +#endif // WITH_HANDSHAKE_EXT + packet->size = (unsigned)(sizeof(MessageHandshakeResponse) + extfield_out_size); + + // msg.empty := AEAD(K, 0, "", Hr) + chacha20poly1305_encrypt(dst->empty_enc, dst->empty_enc, extfield_out_size, hi, sizeof(hi), 0, k); + // Hr := HASH(Hr || "") + //BlakeMix(hi, dst->empty_enc, extfield_out_size); + + dst->type = MESSAGE_HANDSHAKE_RESPONSE; + peer->WriteMacToPacket((uint8*)dst, (MessageMacs*)((uint8*)&dst->mac + extfield_out_size)); + } else { +getout: + peer = NULL; + } + memzero_crypto(hi, sizeof(hi)); + memzero_crypto(ci, sizeof(ci)); + memzero_crypto(k, sizeof(k)); + memzero_crypto(t, sizeof(t)); + return peer; +} + +WgPeer *WgPeer::ParseMessageHandshakeResponse(WgDevice *dev, const Packet *packet) { + MessageHandshakeResponse *src = (MessageHandshakeResponse *)packet->data; + uint8 t[WG_HASH_LEN]; + uint8 k[WG_SYMMETRIC_KEY_LEN]; + WgKeypair *keypair; + auto it = dev->key_id_lookup().find(src->receiver_key_id); + if (it == dev->key_id_lookup().end() || it->second.second != NULL) + return NULL; + WgPeer *peer = it->second.first; + + assert(src->receiver_key_id == peer->local_key_id_during_hs_); + + HandshakeState hs = peer->hs_; + // Hr := HASH(Hr || msg.ephemeral) + BlakeMix(hs.hi, src->ephemeral, sizeof(src->ephemeral)); + // Ci := KDF_1(Ci, msg.ephemeral) + blake2s_hkdf(hs.ci, sizeof(hs.ci), NULL, 32, NULL, 32, src->ephemeral, sizeof(src->ephemeral), hs.ci, sizeof(hs.ci)); + // Ci : = KDF2(Ci, DH(epriv, epub)) + ComputeHKDF2DH(hs.ci, NULL, hs.e_priv, src->ephemeral); + // Ci : = KDF2(Ci, DH(spriv, epub)) + ComputeHKDF2DH(hs.ci, NULL, peer->dev_->s_priv_, src->ephemeral); + // (Ci, T, K) := KDF3(Ci, Q) + blake2s_hkdf(hs.ci, sizeof(hs.ci), t, sizeof(t), k, sizeof(k), peer->preshared_key_, sizeof(peer->preshared_key_), hs.ci, sizeof(hs.ci)); + // Hr := HASH(Hr || T) + BlakeMix(hs.hi, t, sizeof(t)); + + size_t extfield_size = packet->size - sizeof(MessageHandshakeResponse); + if (extfield_size > MAX_SIZE_OF_HANDSHAKE_EXTENSION) + goto getout; + + // "" := AEAD_DEC(K, 0, msg.empty, Hr) + if (!chacha20poly1305_decrypt(src->empty_enc, src->empty_enc, extfield_size + sizeof(src->empty_enc), hs.hi, sizeof(hs.hi), 0, k)) + goto getout; + + keypair = peer->CreateNewKeypair(true, hs.ci, src->sender_key_id, src->empty_enc, extfield_size); + if (!keypair) + goto getout; + + peer->InsertKeypairInPeer(keypair); + + // Re-map the entry in the id table so it points at this keypair instead. + keypair->local_key_id = peer->local_key_id_during_hs_; + peer->local_key_id_during_hs_ = 0; + it->second.second = keypair; + + if (0) { +getout: + peer = NULL; + } + memzero_crypto(t, sizeof(t)); + memzero_crypto(k, sizeof(k)); + memzero_crypto(&hs, sizeof(hs)); + + return peer; +} + +// This is parsed by the initiator, when it needs to re-send the handshake message with a better mac. +void WgPeer::ParseMessageHandshakeCookie(WgDevice *dev, const MessageHandshakeCookie *src) { + uint8 cookie[WG_COOKIE_LEN]; + auto it = dev->key_id_lookup().find(src->receiver_key_id); + if (it == dev->key_id_lookup().end() || it->second.second != NULL) + return; + WgPeer *peer = it->second.first; + if (!peer->expect_cookie_reply_) + return; + if (!xchacha20poly1305_decrypt(cookie, src->cookie_enc, sizeof(src->cookie_enc), + peer->sent_mac1_, sizeof(peer->sent_mac1_), src->nonce, peer->precomputed_cookie_key_)) + return; + peer->expect_cookie_reply_ = false; + peer->has_mac2_cookie_ = true; + peer->mac2_cookie_timestamp_ = OsGetMilliseconds(); + memcpy(peer->mac2_cookie_, cookie, sizeof(peer->mac2_cookie_)); +} + +#if WITH_HANDSHAKE_EXT + +size_t WgPeer::WriteHandshakeExtension(uint8 *dst, WgKeypair *keypair) { + uint8 *dst_org = dst, value = 0; + // Include the supported features extension + if (!IsOnlyZeros(features_, sizeof(features_))) { + *dst++ = EXT_BOOLEAN_FEATURES; + *dst++ = (WG_FEATURES_COUNT + 3) >> 2; + for (size_t i = 0; i != WG_FEATURES_COUNT; i++) { + if ((i & 3) == 0) + value = 0; + dst[i >> 2] = (value += (features_[i] << ((i * 2) & 7))); + } + // swap WG_FEATURE_ID_SKIP_KEYID_IN and WG_FEATURE_ID_SKIP_KEYID_OUT + dst[1] = (dst[1] & 0xF0) + ((dst[1] >> 2) & 0x03) + ((dst[1] << 2) & 0x0C); + dst += (WG_FEATURES_COUNT + 3) >> 2; + } + // Ordered list of cipher suites + size_t ciphers = num_ciphers_; + if (ciphers) { + *dst++ = EXT_CIPHER_SUITES + cipher_prio_; + if (keypair) { + *dst++ = 1; + *dst++ = keypair->cipher_suite; + } else { + *dst++ = (uint8)ciphers; + memcpy(dst, ciphers_, ciphers); + dst += ciphers; + } + } + if (features_[WG_FEATURE_ID_IPZIP]) { + // Include the packet compression extension + *dst++ = EXT_PACKET_COMPRESSION; + *dst++ = sizeof(WgPacketCompressionVer01); + memcpy(dst, &dev_->compression_header_, sizeof(WgPacketCompressionVer01)); + dst += sizeof(WgPacketCompressionVer01); + } + return dst - dst_org; +} + +static bool ResolveBooleanFeatureValue(uint8 other, uint8 self, bool *result) { + uint8 both = other * 4 + self; + *result = (0xfec0 >> both) & 1; + return (0xeff7 >> both) & 1; +} + +static const uint8 cipher_strengths[EXT_CIPHER_SUITE_COUNT] = {4,2,3,1}; + +static uint32 ResolveCipherSuite(int tie, const uint8 *a, size_t a_size, const uint8 *b, size_t b_size) { + uint32 abits[8] = {0}, bbits[8] = {0}, found_a = 0, found_b = 0; + for (size_t i = 0; i < a_size; i++) + abits[a[i] >> 5] |= 1 << (a[i] & 31); + for (size_t i = 0; i < b_size; i++) + bbits[b[i] >> 5] |= 1 << (b[i] & 31); + for (size_t i = 0; i < a_size; i++) + if (bbits[a[i] >> 5] & (1 << (a[i] & 31))) { + found_a = a[i]; + break; + } + for (size_t i = 0; i < b_size; i++) + if (abits[b[i] >> 5] & (1 << (b[i] & 31))) { + found_b = b[i]; + break; + } + return (tie > 0 || + (tie == 0 && cipher_strengths[found_a] > cipher_strengths[found_b])) ? found_a : found_b; +} + +void WgKeypairSetupCompressionExtension(WgKeypair *keypair, const WgPacketCompressionVer01 *remotec) { + const WgPacketCompressionVer01 *localc = keypair->peer->dev_->compression_header(); + IpzipState *state = &keypair->ipzip_state_; + + // Use is_initiator as tie-breaker on who's going to be the client side. + int flags_xor = 0; + if ((localc->flags & ~3) + 2 * keypair->is_initiator - 1 <= (remotec->flags & ~3)) + std::swap(localc, remotec), flags_xor = 1; + state->flags_xor = flags_xor; + + memcpy(state->client_addr_v4, localc->ipv4_addr, 4); + memcpy(state->client_addr_v6, localc->ipv6_addr, 16); + state->guess_ttl[0] = localc->ttl; + state->client_addr_v4_subnet_bytes = (localc->flags & 3); + WriteLE32(&state->client_addr_v4_netmask, 0xffffffff >> ((localc->flags & 3) * 8)); + + memcpy(state->server_addr_v4, remotec->ipv4_addr, 4); + memcpy(state->server_addr_v6, remotec->ipv6_addr, 16); + state->guess_ttl[1] = remotec->ttl; + state->server_addr_v4_subnet_bytes = (remotec->flags & 3); + WriteLE32(&state->server_addr_v4_netmask, 0xffffffff >> ((remotec->flags & 3) * 8)); +} +bool WgKeypairParseExtendedHandshake(WgKeypair *keypair, const uint8 *data, size_t data_size) { + bool did_setup_compression = false; + + while (data_size >= 2) { + uint8 type = data[0], size = data[1]; + data += 2, data_size -= 2; + if (size > data_size) + return false; + switch (type) { + case EXT_CIPHER_SUITES_PRIO: + case EXT_CIPHER_SUITES: + keypair->cipher_suite = ResolveCipherSuite(keypair->peer->cipher_prio_ - (type - EXT_CIPHER_SUITES), + keypair->peer->ciphers_, keypair->peer->num_ciphers_, + data, data_size); + break; + case EXT_BOOLEAN_FEATURES: + for (size_t i = 0, j = std::max(WG_FEATURES_COUNT, size * 4); i != j; i++) { + uint8 value = (i < size * 4) ? (data[i >> 2] >> ((i * 2) & 7)) & 3 : 0; + if (i >= WG_FEATURES_COUNT ? (value == WG_BOOLEAN_FEATURE_ENFORCES) : + !ResolveBooleanFeatureValue(value, keypair->peer->features_[i], &keypair->enabled_features[i])) + return false; + } + break; + case EXT_PACKET_COMPRESSION: + if (size == sizeof(WgPacketCompressionVer01)) { + WgPacketCompressionVer01 *c = (WgPacketCompressionVer01*)data; + if (ReadLE16(&c->version) == EXT_PACKET_COMPRESSION_VER) { + WgKeypairSetupCompressionExtension(keypair, c); + did_setup_compression = true; + } + } + break; + } + data += size, data_size -= size; + } + if (data_size != 0) + return false; + + keypair->enabled_features[WG_FEATURE_ID_IPZIP] &= did_setup_compression; + keypair->auth_tag_length = (keypair->enabled_features[WG_FEATURE_ID_SHORT_MAC] ? 8 : CHACHA20POLY1305_AUTHTAGLEN); + +// RINFO("Cipher Suite = %d", keypair->cipher_suite); + + return true; +} + +#endif // WITH_HANDSHAKE_EXT + +void WgPeer::ClearKeys() { + DeleteKeypair(&curr_keypair_); + DeleteKeypair(&next_keypair_); + DeleteKeypair(&prev_keypair_); +} + +void WgPeer::ClearHandshake() { + uint32 v = local_key_id_during_hs_; + if (v != 0) { + local_key_id_during_hs_ = 0; + dev_->key_id_lookup_.erase(v); + } +} + +void WgPeer::DeleteKeypair(WgKeypair **kp) { + WgKeypair *t = *kp; + *kp = NULL; + if (t) { + if (t->addr_entry) + dev_->EraseKeypairAddrEntry(t); + + if (t->local_key_id) + dev_->key_id_lookup_.erase(t->local_key_id); + + if (t->aes_gcm128_context_) + free(t->aes_gcm128_context_); + delete t; + } +} + +WgKeypair *WgPeer::CreateNewKeypair(bool is_initiator, const uint8 chaining_key[WG_HASH_LEN], uint32 remote_key_id, const uint8 *extfield, size_t extfield_size) { + WgKeypair *kp = new WgKeypair; + uint8 *first_key, *second_key; + if (!kp) + return NULL; + memset(kp, 0, offsetof(WgKeypair, replay_detector)); + kp->peer = this; + kp->is_initiator = is_initiator; + kp->remote_key_id = remote_key_id; + kp->auth_tag_length = CHACHA20POLY1305_AUTHTAGLEN; + +#if WITH_HANDSHAKE_EXT + if (!WgKeypairParseExtendedHandshake(kp, extfield, extfield_size)) + goto fail; +#endif // WITH_HANDSHAKE_EXT + + first_key = kp->send_key, second_key = kp->recv_key; + if (!is_initiator) + std::swap(first_key, second_key); + blake2s_hkdf(first_key, sizeof(kp->send_key), second_key, sizeof(kp->recv_key), + kp->auth_tag_length != CHACHA20POLY1305_AUTHTAGLEN ? (uint8*)kp->compress_mac_keys : NULL, 32, NULL, 0, chaining_key, WG_HASH_LEN); + + if (!is_initiator) { + std::swap(kp->compress_mac_keys[0][0], kp->compress_mac_keys[1][0]); + std::swap(kp->compress_mac_keys[0][1], kp->compress_mac_keys[1][1]); + } + +#if WITH_HANDSHAKE_EXT + if (kp->cipher_suite >= EXT_CIPHER_SUITE_AES128_GCM && kp->cipher_suite <= EXT_CIPHER_SUITE_AES256_GCM) { +#if WITH_AESGCM + kp->aes_gcm128_context_ = (AesGcm128StaticContext *)malloc(sizeof(*kp->aes_gcm128_context_) * 2); + if (!kp->aes_gcm128_context_) + goto fail; + int key_size = (kp->cipher_suite == EXT_CIPHER_SUITE_AES128_GCM) ? 128 : 256; + CRYPTO_gcm128_init(&kp->aes_gcm128_context_[0], kp->send_key, key_size); + CRYPTO_gcm128_init(&kp->aes_gcm128_context_[1], kp->recv_key, key_size); +#else + goto fail; +#endif + } +#endif // WITH_HANDSHAKE_EXT + + kp->send_key_state = kp->recv_key_state = WgKeypair::KEY_VALID; + time_of_next_key_event_ = 0; + kp->key_timestamp = OsGetMilliseconds(); + + return kp; + +fail: + delete kp; + return NULL; +} + +void WgPeer::InsertKeypairInPeer(WgKeypair *kp) { + assert(kp->peer == this); + DeleteKeypair(&prev_keypair_); + if (kp->is_initiator) { + // When we're the initator then we got the handshake and we can + // use the keypair right away. + if (next_keypair_) { + prev_keypair_ = next_keypair_; + next_keypair_ = NULL; + DeleteKeypair(&curr_keypair_); + } else { + prev_keypair_ = curr_keypair_; + } + curr_keypair_ = kp; + } else { + // The keypair will be moved to curr when we get the first data packet. + DeleteKeypair(&next_keypair_); + next_keypair_ = kp; + } +} + +bool WgPeer::CheckSwitchToNextKey(WgKeypair *keypair) { + if (keypair != next_keypair_) + return false; + DeleteKeypair(&prev_keypair_); + prev_keypair_ = curr_keypair_; + curr_keypair_ = next_keypair_; + next_keypair_ = NULL; + time_of_next_key_event_ = 0; + return true; +} + +bool WgPeer::CheckHandshakeRateLimit() { + uint64 now = OsGetMilliseconds(); + if (now - last_handshake_init_timestamp_ < REKEY_TIMEOUT_MS) + return false; + last_handshake_init_timestamp_ = now; + return true; +} + +void WgPeer::WriteMacToPacket(const uint8 *data, MessageMacs *dst) { + expect_cookie_reply_ = true; + blake2s(dst->mac1, sizeof(dst->mac1), data, (uint8*)dst->mac1 - data, precomputed_mac1_key_, sizeof(precomputed_mac1_key_)); + memcpy(sent_mac1_, dst->mac1, sizeof(sent_mac1_)); + if (has_mac2_cookie_ && OsGetMilliseconds() - mac2_cookie_timestamp_ < COOKIE_SECRET_MAX_AGE_MS - COOKIE_SECRET_LATENCY_MS) { + blake2s(dst->mac2, sizeof(dst->mac2), data, (uint8*)dst->mac2 - data, mac2_cookie_, sizeof(mac2_cookie_)); + } else { + has_mac2_cookie_ = false; + + if (dev_->header_obfuscation_) { + // when obfuscation is enabled just make the top bits random + for (size_t i = 0; i < 4; i++) + ((uint32*)dst->mac2)[i] = dev_->GetRandomNumber(); + } else { + memset(dst->mac2, 0, sizeof(dst->mac2)); + } + } +} + +enum { + // Timer for retransmitting the handshake if we don't hear back after REKEY_TIMEOUT_MS + TIMER_RETRANSMIT_HANDSHAKE = 0, + // Timer for sending keepalive if we received a packet if we don't send anything else for KEEPALIVE_TIMEOUT_MS + TIMER_SEND_KEEPALIVE = 1, + // Timer for initiating new handshake if we have sent a packet but after have not received one for KEEPALIVE_TIMEOUT_MS + REKEY_TIMEOUT_MS + TIMER_NEW_HANDSHAKE = 2, + // Timer for zeroing out all keys and handshake state after (REJECT_AFTER_TIME_MS * 3) if no new keys have been received + TIMER_ZERO_KEYS = 3, + // Timer for sending a keepalive packet every PERSISTENT_KEEPALIVE_MS + TIMER_PERSISTENT_KEEPALIVE = 4, +}; + +#define WgClearTimer(x) (timers_ &= ~(33 << x)) +#define WgIsTimerActive(x) (timers_ & (33 << x)) +#define WgSetTimer(x) (timers_ |= (32 << (x))) + +void WgPeer::OnDataSent() { + WgClearTimer(TIMER_SEND_KEEPALIVE); + if (!WgIsTimerActive(TIMER_NEW_HANDSHAKE)) + WgSetTimer(TIMER_NEW_HANDSHAKE); + WgSetTimer(TIMER_PERSISTENT_KEEPALIVE); +} + +void WgPeer::OnKeepaliveSent() { + WgSetTimer(TIMER_PERSISTENT_KEEPALIVE); +} + +void WgPeer::OnDataReceived() { + WgClearTimer(TIMER_NEW_HANDSHAKE); + if (!WgIsTimerActive(TIMER_SEND_KEEPALIVE)) + WgSetTimer(TIMER_SEND_KEEPALIVE); + else + pending_keepalive_ = true; + WgSetTimer(TIMER_PERSISTENT_KEEPALIVE); +} + +void WgPeer::OnKeepaliveReceived() { + WgClearTimer(TIMER_NEW_HANDSHAKE); + WgSetTimer(TIMER_PERSISTENT_KEEPALIVE); +} + +void WgPeer::OnHandshakeInitSent() { + WgClearTimer(TIMER_SEND_KEEPALIVE); + WgSetTimer(TIMER_RETRANSMIT_HANDSHAKE); +} + +void WgPeer::OnHandshakeAuthComplete() { + WgClearTimer(TIMER_NEW_HANDSHAKE); + WgSetTimer(TIMER_ZERO_KEYS); + WgSetTimer(TIMER_PERSISTENT_KEEPALIVE); +} + +static const char * const kCipherSuites[] = { + "chacha20-poly1305", + "aes128-gcm", + "aes256-gcm", + "none" +}; + +void WgPeer::OnHandshakeFullyComplete() { + WgClearTimer(TIMER_RETRANSMIT_HANDSHAKE); + handshake_attempts_ = 0; + + if (last_complete_handskake_timestamp_ == 0) { + bool any_feature = false; + for(size_t i = 0; i < WG_FEATURES_COUNT; i++) + any_feature |= curr_keypair_->enabled_features[i]; + if (curr_keypair_->cipher_suite != 0 || any_feature) { + RINFO("Using %s, %s %s %s %s %s", kCipherSuites[curr_keypair_->cipher_suite], + curr_keypair_->enabled_features[0] ? "short_header" : "", + curr_keypair_->enabled_features[1] ? "mac64" : "", + curr_keypair_->enabled_features[2] ? "ipzip" : "", + curr_keypair_->enabled_features[4] ? "skip_keyid_in" : "", + curr_keypair_->enabled_features[5] ? "skip_keyid_out" : ""); + } + + + } + + last_complete_handskake_timestamp_ = OsGetMilliseconds(); + dev_->last_complete_handskake_timestamp_ = last_complete_handskake_timestamp_; +// RINFO("Connection established."); +} + +// Check if any of the timeouts have expired +uint32 WgPeer::CheckTimeouts(uint64 now) { + uint32 t, rv = 0; + + if (now >= time_of_next_key_event_) + CheckAndUpdateTimeOfNextKeyEvent(now); + + if ((t = timers_) == 0) + return 0; + uint32 now32 = (uint32)now; + // Got any new timers? + if (t & (0x1f << 5)) { + if (t & (1 << (5+0))) timer_value_[0] = now32; + if (t & (1 << (5+1))) timer_value_[1] = now32; + if (t & (1 << (5+2))) timer_value_[2] = now32; + if (t & (1 << (5+3))) timer_value_[3] = now32; + if (t & (1 << (5+4))) timer_value_[4] = now32; + t |= (t >> 5); + t &= 0x1F; + } + // Got any expired timers? + if (t & 0x1F) { + if ((t & (1 << TIMER_RETRANSMIT_HANDSHAKE)) && (now32 - timer_value_[TIMER_RETRANSMIT_HANDSHAKE]) >= REKEY_TIMEOUT_MS) { + t ^= (1 << TIMER_RETRANSMIT_HANDSHAKE); + if (handshake_attempts_ > MAX_HANDSHAKE_ATTEMPTS) { + RINFO("Too many handshake attempts. Stopping."); + t &= ~(1 << TIMER_SEND_KEEPALIVE); + ClearPacketQueue(); + } else { + RINFO("Retrying handshake, attempt %d...", handshake_attempts_ + 2); + handshake_attempts_++; + rv |= ACTION_SEND_HANDSHAKE; + } + } + if ((t & (1 << TIMER_SEND_KEEPALIVE)) && (now32 - timer_value_[TIMER_SEND_KEEPALIVE]) >= KEEPALIVE_TIMEOUT_MS) { + t &= ~(1 << TIMER_SEND_KEEPALIVE); + rv |= ACTION_SEND_KEEPALIVE; + if (pending_keepalive_) { + pending_keepalive_ = false; + timer_value_[TIMER_SEND_KEEPALIVE] = now32; + t |= (1 << TIMER_SEND_KEEPALIVE); + } + } + if ((t & (1 << TIMER_PERSISTENT_KEEPALIVE)) && (now32 - timer_value_[TIMER_PERSISTENT_KEEPALIVE]) >= (uint32)persistent_keepalive_ms_) { + t &= ~(1 << TIMER_PERSISTENT_KEEPALIVE); + if (persistent_keepalive_ms_) { + t &= ~(1 << TIMER_SEND_KEEPALIVE); + rv |= ACTION_SEND_KEEPALIVE; + } + } + if ((t & (1 << TIMER_NEW_HANDSHAKE)) && (now32 - timer_value_[TIMER_NEW_HANDSHAKE]) >= KEEPALIVE_TIMEOUT_MS + REKEY_TIMEOUT_MS) { + t &= ~(1 << TIMER_NEW_HANDSHAKE); + handshake_attempts_ = 0; + rv |= ACTION_SEND_HANDSHAKE; + RINFO("Retrying handshake with peer"); + } + if ((t & (1 << TIMER_ZERO_KEYS)) && (now32 - timer_value_[TIMER_ZERO_KEYS]) >= REJECT_AFTER_TIME_MS * 3) { + RINFO("Expiring all keys for peer"); + t &= ~(1 << TIMER_ZERO_KEYS); + ClearKeys(); + ClearHandshake(); + } + } + timers_ = t; + return rv; +} + +// Check all key stuff here to avoid calling possibly expensive timestamp routines in the packet handler +void WgPeer::CheckAndUpdateTimeOfNextKeyEvent(uint64 now) { + uint64 next_time = UINT64_MAX; + uint32 rv = 0; + + if (curr_keypair_ != NULL) { + if (now >= curr_keypair_->key_timestamp + REJECT_AFTER_TIME_MS) { + DeleteKeypair(&curr_keypair_); + } else if (curr_keypair_->is_initiator) { + // if a peer is the initiator of a current secure session, WireGuard will send a handshake initiation + // message to begin a new secure session if, after transmitting a transport data message, the current secure session + // is REKEY_AFTER_TIME_MS old, or if after receiving a transport data message, the current secure session is + // (REKEY_AFTER_TIME_MS - KEEPALIVE_TIMEOUT_MS - REKEY_TIMEOUT_MS) seconds old and it has not yet acted upon + // this event. + if (now >= curr_keypair_->key_timestamp + (REJECT_AFTER_TIME_MS - KEEPALIVE_TIMEOUT_MS - REKEY_TIMEOUT_MS)) { + next_time = curr_keypair_->key_timestamp + REJECT_AFTER_TIME_MS; + if (curr_keypair_->recv_key_state == WgKeypair::KEY_VALID) + curr_keypair_->recv_key_state = WgKeypair::KEY_WANT_REFRESH; + } else if (now >= curr_keypair_->key_timestamp + REKEY_AFTER_TIME_MS) { + next_time = curr_keypair_->key_timestamp + (REJECT_AFTER_TIME_MS - KEEPALIVE_TIMEOUT_MS - REKEY_TIMEOUT_MS); + if (curr_keypair_->send_key_state == WgKeypair::KEY_VALID) + curr_keypair_->send_key_state = WgKeypair::KEY_WANT_REFRESH; + } else { + next_time = curr_keypair_->key_timestamp + REKEY_AFTER_TIME_MS; + } + } else { + next_time = curr_keypair_->key_timestamp + REJECT_AFTER_TIME_MS; + } + } + if (prev_keypair_ != NULL) { + if (now >= prev_keypair_->key_timestamp + REJECT_AFTER_TIME_MS) + DeleteKeypair(&prev_keypair_); + else + next_time = std::min(next_time, prev_keypair_->key_timestamp + REJECT_AFTER_TIME_MS); + } + if (next_keypair_ != NULL) { + if (now >= next_keypair_->key_timestamp + REJECT_AFTER_TIME_MS) + DeleteKeypair(&next_keypair_); + else + next_time = std::min(next_time, next_keypair_->key_timestamp + REJECT_AFTER_TIME_MS); + } + time_of_next_key_event_ = next_time; +} + +void WgPeer::SetEndpoint(const IpAddr &sin) { + endpoint_ = sin; +} + +void WgPeer::SetPersistentKeepalive(int persistent_keepalive_secs) { + if (persistent_keepalive_secs < 10 || persistent_keepalive_secs > 10000) + return; + persistent_keepalive_ms_ = persistent_keepalive_secs * 1000; +} + +bool WgPeer::AddIp(const WgCidrAddr &cidr_addr) { + if (cidr_addr.size == 32) { + if (cidr_addr.cidr > 32) + return false; + dev_->ip_to_peer_map_.InsertV4(cidr_addr.addr, cidr_addr.cidr, this); + allowed_ips_.push_back(cidr_addr); + return true; + } else if (cidr_addr.size == 128) { + if (cidr_addr.cidr > 128) + return false; + dev_->ip_to_peer_map_.InsertV6(cidr_addr.addr, cidr_addr.cidr, this); + allowed_ips_.push_back(cidr_addr); + return true; + } else { + return false; + } +} + +void WgPeer::SetAllowMulticast(bool allow) { + allow_multicast_through_peer_ = allow; +} + +void WgPeer::SetFeature(int feature, uint8 value) { + features_[feature] = value; +} + +bool WgPeer::AddCipher(int cipher) { + if (num_ciphers_ == MAX_CIPHERS) + return false; + + if (cipher == EXT_CIPHER_SUITE_AES128_GCM || cipher == EXT_CIPHER_SUITE_AES256_GCM) { +#if !WITH_AESGCM + return true; +#endif // !WITH_AESGCM + if (!X86_PCAP_AES) + return true; + } + + + ciphers_[num_ciphers_++] = cipher; + return true; +} + +WgRateLimit::WgRateLimit() { + key1_[0] = key1_[1] = 1; + key2_[0] = key2_[1] = 1; + bin1_ = bins_[0]; + bin2_ = bins_[1]; + rand_ = 0; + rand_xor_ = 0; + packets_per_sec_ = PACKETS_PER_SEC; + used_rate_limit_ = 0; + memset(bins_, 0, sizeof(bins_)); +} + +void WgRateLimit::Periodic(uint32 s[5]) { + unsigned int per_sec = PACKETS_PER_SEC; + if (used_rate_limit_ >= TOTAL_PACKETS_PER_SEC) { + per_sec = PACKETS_PER_SEC * TOTAL_PACKETS_PER_SEC / used_rate_limit_; + if (per_sec < 1) + per_sec = 1; + } + + if ((unsigned)per_sec > packets_per_sec_) + per_sec = (per_sec + packets_per_sec_ + 1) >> 1; + +// if (per_sec != packets_per_sec_) { +// RINFO("Setting pps: %d", per_sec); + packets_per_sec_ = per_sec; +// } + + used_rate_limit_ = 0; + rand_xor_ = s[4]; + key2_[0] = key1_[0]; + key2_[1] = key1_[1]; + memcpy(key1_, s, sizeof(key1_)); + std::swap(bin1_, bin2_); + memset(bin1_, 0, BINSIZE); +} + +static inline size_t hashit(uint64 ip, const uint64 *key) { + uint64 x = ip * key[0] + rol64(ip, 32) * key[1]; + uint32 a = (uint32)(x + (x >> 32) * 0x85ebca6b); + a -= a >> 16; + a ^= a >> 4; + return a; +} + +WgRateLimit::RateLimitResult WgRateLimit::CheckRateLimit(uint64 ip) { + uint8 *a = &bin1_[hashit(ip, key1_) & (BINSIZE - 1)]; + uint8 *b = &bin2_[hashit(ip, key2_) & (BINSIZE - 1)]; + unsigned int old = std::max(*a, *b - packets_per_sec_), v = 0; + if (old < PACKET_ACCUM / 2) { + v = 1; + } else if (old < PACKET_ACCUM) { + v = old < ((uint64)rand_ * ((PACKET_ACCUM / 2) + 1) >> 32) + (PACKET_ACCUM / 2); + rand_ = (rand_ * 0x1b873593 + 5) + rand_xor_; + } + RateLimitResult rr = {a, (uint8)(old + v), (uint8)v}; + return rr; +} + +void WgKeypairEncryptPayload(uint8 *dst, const size_t src_len, + const uint8 *ad, const size_t ad_len, + const uint64 nonce, WgKeypair *keypair) { + if (keypair->cipher_suite == EXT_CIPHER_SUITE_CHACHA20POLY1305) { + chacha20poly1305_encrypt(dst, dst, src_len, ad, ad_len, nonce, keypair->send_key); + } else if (keypair->cipher_suite >= EXT_CIPHER_SUITE_AES128_GCM && keypair->cipher_suite <= EXT_CIPHER_SUITE_AES256_GCM) { +#if WITH_AESGCM + aesgcm_encrypt(dst, dst, src_len, ad, ad_len, nonce, &keypair->aes_gcm128_context_[0]); +#endif // WITH_AESGCM + } else { + poly1305_get_mac(dst, src_len, ad, ad_len, nonce, keypair->send_key, dst + src_len); + } + + // Convert MAC to 8 bytes if that's all we need. + if (keypair->auth_tag_length != WG_MAC_LEN) { + uint8 *mac = dst + src_len; + uint64 rv = siphash_2u64(ReadLE64(mac), ReadLE64(mac + 8), (siphash_key_t*)keypair->compress_mac_keys[0]); + WriteLE64(mac, rv); + } +} + +bool WgKeypairDecryptPayload(uint8 *dst, size_t src_len, + const uint8 *ad, size_t ad_len, + const uint64 nonce, WgKeypair *keypair) { + uint8 mac[16]; + + if (src_len < keypair->auth_tag_length) + return false; + + src_len -= keypair->auth_tag_length; + + if (keypair->cipher_suite == EXT_CIPHER_SUITE_CHACHA20POLY1305) { + chacha20poly1305_decrypt_get_mac(dst, dst, src_len, ad, ad_len, nonce, keypair->recv_key, mac); + } else if (keypair->cipher_suite >= EXT_CIPHER_SUITE_AES128_GCM && keypair->cipher_suite <= EXT_CIPHER_SUITE_AES256_GCM) { +#if WITH_AESGCM + aesgcm_decrypt_get_mac(dst, dst, src_len, ad, ad_len, nonce, &keypair->aes_gcm128_context_[1], mac); +#else // WITH_AESGCM + return false; +#endif // WITH_AESGCM + } else { + poly1305_get_mac(dst, src_len, ad, ad_len, nonce, keypair->recv_key, mac); + } + + if (keypair->auth_tag_length == WG_MAC_LEN) { + return memcmp_crypto(mac, dst + src_len, WG_MAC_LEN) == 0; + } else { + uint64 rv = siphash_2u64(ReadLE64(mac), ReadLE64(mac + 8), (siphash_key_t*)keypair->compress_mac_keys[1]); + WriteLE64(mac, rv); + return memcmp_crypto(mac, dst + src_len, keypair->auth_tag_length) == 0; + } +} diff --git a/wireguard_proto.h b/wireguard_proto.h new file mode 100644 index 0000000..cd66901 --- /dev/null +++ b/wireguard_proto.h @@ -0,0 +1,617 @@ +// SPDX-License-Identifier: AGPL-1.0-only +// Copyright (C) 2018 Ludvig Strigeus . All Rights Reserved. +#pragma once + +#include "tunsafe_types.h" +#include "netapi.h" +#include "tunsafe_config.h" +#include +#include + +enum ProtocolTimeouts { + COOKIE_SECRET_MAX_AGE_MS = 120000, + COOKIE_SECRET_LATENCY_MS = 5000, + REKEY_TIMEOUT_MS = 5000, + KEEPALIVE_TIMEOUT_MS = 10000, + REKEY_AFTER_TIME_MS = 120000, + REJECT_AFTER_TIME_MS = 180000, + PERSISTENT_KEEPALIVE_MS = 25000, + MIN_HANDSHAKE_INTERVAL_MS = 20, +}; + +enum ProtocolLimits { + REJECT_AFTER_MESSAGES = UINT64_MAX - 2048, + REKEY_AFTER_MESSAGES = UINT64_MAX - 0xffff, + + MAX_HANDSHAKE_ATTEMPTS = 20, + MAX_QUEUED_PACKETS_PER_PEER = 128, + MESSAGE_MINIMUM_SIZE = 16, + MAX_SIZE_OF_HANDSHAKE_EXTENSION = 1024, +}; + +enum MessageType { + MESSAGE_HANDSHAKE_INITIATION = 1, + MESSAGE_HANDSHAKE_RESPONSE = 2, + MESSAGE_HANDSHAKE_COOKIE = 3, + MESSAGE_DATA = 4, +}; + +enum MessageFieldSizes { + WG_COOKIE_LEN = 16, + WG_COOKIE_NONCE_LEN = 24, + WG_PUBLIC_KEY_LEN = 32, + WG_HASH_LEN = 32, + WG_SYMMETRIC_KEY_LEN = 32, + WG_MAC_LEN = 16, + WG_TIMESTAMP_LEN = 12, + WG_SIPHASH_KEY_LEN = 16, +}; + +enum { + WG_SHORT_HEADER_BIT = 0x80, + WG_SHORT_HEADER_KEY_ID_MASK = 0x60, + WG_SHORT_HEADER_KEY_ID = 0x20, + WG_SHORT_HEADER_ACK = 0x10, + WG_SHORT_HEADER_TYPE_MASK = 0x0F, + WG_SHORT_HEADER_CTR1 = 0x00, + WG_SHORT_HEADER_CTR2 = 0x01, + WG_SHORT_HEADER_CTR4 = 0x02, + + WG_ACK_HEADER_COUNTER_MASK = 0x0C, + WG_ACK_HEADER_COUNTER_NONE = 0x00, + WG_ACK_HEADER_COUNTER_2 = 0x04, + WG_ACK_HEADER_COUNTER_4 = 0x08, + WG_ACK_HEADER_COUNTER_8 = 0x0C, + + WG_ACK_HEADER_KEY_MASK = 3, +}; + + +struct MessageMacs { + uint8 mac1[WG_COOKIE_LEN]; + uint8 mac2[WG_COOKIE_LEN]; +}; +STATIC_ASSERT(sizeof(MessageMacs) == 32, MessageMacs_wrong_size); + +struct MessageHandshakeInitiation { + uint32 type; + uint32 sender_key_id; + uint8 ephemeral[WG_PUBLIC_KEY_LEN]; + uint8 static_enc[WG_PUBLIC_KEY_LEN + WG_MAC_LEN]; + uint8 timestamp_enc[WG_TIMESTAMP_LEN + WG_MAC_LEN]; + MessageMacs mac; +}; +STATIC_ASSERT(sizeof(MessageHandshakeInitiation) == 148, MessageHandshakeInitiation_wrong_size); + +// Format of variable length payload. +// 1 byte type +// 1 byte length +// + + + +struct MessageHandshakeResponse { + uint32 type; + uint32 sender_key_id; + uint32 receiver_key_id; + uint8 ephemeral[WG_PUBLIC_KEY_LEN]; + uint8 empty_enc[WG_MAC_LEN]; + MessageMacs mac; +}; +STATIC_ASSERT(sizeof(MessageHandshakeResponse) == 92, MessageHandshakeResponse_wrong_size); + +struct MessageHandshakeCookie { + uint32 type; + uint32 receiver_key_id; + uint8 nonce[WG_COOKIE_NONCE_LEN]; + uint8 cookie_enc[WG_COOKIE_LEN + WG_MAC_LEN]; +}; +STATIC_ASSERT(sizeof(MessageHandshakeCookie) == 64, MessageHandshakeCookie_wrong_size); + +struct MessageData { + uint32 type; + uint32 receiver_id; + uint64 counter; +}; +STATIC_ASSERT(sizeof(MessageData) == 16, MessageData_wrong_size); + +enum { + EXT_PACKET_COMPRESSION = 0x15, + EXT_PACKET_COMPRESSION_VER = 0x01, + + EXT_BOOLEAN_FEATURES = 0x16, + + EXT_CIPHER_SUITES = 0x18, + EXT_CIPHER_SUITES_PRIO = 0x19, + + // The standard wireguard chacha + EXT_CIPHER_SUITE_CHACHA20POLY1305 = 0x00, + // AES GCM 128 bit + EXT_CIPHER_SUITE_AES128_GCM = 0x01, + // AES GCM 256 bit + EXT_CIPHER_SUITE_AES256_GCM = 0x02, + // Same as CHACHA20POLY1305 but without the encryption step + EXT_CIPHER_SUITE_NONE_POLY1305 = 0x03, + + EXT_CIPHER_SUITE_COUNT = 4, + +}; + +enum { + WG_FEATURES_COUNT = 6, + WG_FEATURE_ID_SHORT_HEADER = 0, // Supports short headers + WG_FEATURE_ID_SHORT_MAC = 1, // Supports 8-byte MAC + WG_FEATURE_ID_IPZIP = 2, // Using ipzip + WG_FEATURE_ID_SKIP_KEYID_IN = 4, // Skip keyid for incoming packets + WG_FEATURE_ID_SKIP_KEYID_OUT = 5, // Skip keyid for outgoing packets +}; + +enum { + WG_BOOLEAN_FEATURE_OFF = 0x0, + WG_BOOLEAN_FEATURE_SUPPORTS = 0x1, + WG_BOOLEAN_FEATURE_WANTS = 0x2, + WG_BOOLEAN_FEATURE_ENFORCES = 0x3, +}; + +struct WgPacketCompressionVer01 { + uint16 version; // Packet compressor version + uint8 ttl; // Guessed TTL + uint8 flags; // Subnet length and packet direction + uint8 ipv4_addr[4]; // IPV4 address of endpoint + uint8 ipv6_addr[16]; // IPV6 address of endpoint +}; +STATIC_ASSERT(sizeof(WgPacketCompressionVer01) == 24, WgPacketCompressionVer01_wrong_size); + + +struct WgKeypair; +class WgPeer; + +// Maps CIDR addresses to a peer, always returning the longest match +class IpToPeerMap { +public: + IpToPeerMap(); + ~IpToPeerMap(); + + // Inserts an IP address of a given CIDR length into the lookup table, pointing to peer. + bool InsertV4(const void *addr, int cidr, void *peer); + bool InsertV6(const void *addr, int cidr, void *peer); + + // Lookup the peer matching the IP Address + void *LookupV4(uint32 ip); + void *LookupV6(const void *addr); + + void *LookupV4DefaultPeer(); + void *LookupV6DefaultPeer(); + + // Remove a peer from the table + void RemovePeer(void *peer); +private: + struct Entry4 { + uint32 ip; + uint32 mask; + void *peer; + }; + struct Entry6 { + uint8 ip[16]; + uint8 cidr_len; + void *peer; + }; + std::vector ipv4_; + std::vector ipv6_; +}; + +class WgRateLimit { +public: + WgRateLimit(); + + struct RateLimitResult { + uint8 *value_ptr; + uint8 new_value; + uint8 is_ok; + + bool is_rate_limited() { return !is_ok; } + bool is_first_ip() { return new_value == 1; } + }; + + RateLimitResult CheckRateLimit(uint64 ip); + + void CommitResult(const RateLimitResult &rr) { *rr.value_ptr = rr.new_value; if (used_rate_limit_++ == TOTAL_PACKETS_PER_SEC) packets_per_sec_ = (packets_per_sec_ + 1) >> 1; } + + void Periodic(uint32 s[5]); + + bool is_used() { return used_rate_limit_ != 0 || packets_per_sec_ != PACKETS_PER_SEC; } +private: + uint8 *bin1_, *bin2_; + uint32 rand_, rand_xor_; + uint32 packets_per_sec_, used_rate_limit_; + uint64 key1_[2], key2_[2]; + enum { + BINSIZE = 4096, + PACKETS_PER_SEC = 25, + PACKET_ACCUM = 100, + TOTAL_PACKETS_PER_SEC = 25000, + }; + uint8 bins_[2][BINSIZE]; +}; + +struct WgAddrEntry { + // The id of the addr entry, so we can delete ourselves + uint64 addr_entry_id; + + // Ensure there's at least 1 minute between we allow registering + // a new key in this table. This means that each key will have + // a life time of at least 3 minutes. + uint64 time_of_last_insertion; + + // This entry gets erased when there's no longer any key pointing at it. + uint8 ref_count; + + // Index of the next slot 0-2 where we'll insert the next key. + uint8 next_slot; + + // The three keys. + WgKeypair *keys[3]; + + WgAddrEntry(uint64 addr_entry_id) : addr_entry_id(addr_entry_id), ref_count(0), next_slot(0) { + keys[0] = keys[1] = keys[2] = NULL; + time_of_last_insertion = 0x123456789123456; + } +}; + +struct ScramblerSiphashKeys { + uint64 keys[4]; +}; + +// Implementation of most business logic of Wireguard +class WgDevice { + friend class WgPeer; + friend class WireguardProcessor; +public: + WgDevice(); + ~WgDevice(); + + // Initialize with the private key, precompute all internal keys etc. + void Initialize(const uint8 private_key[WG_PUBLIC_KEY_LEN]); + + WgPeer *AddPeer(); + + // Setup header obfuscation + void SetHeaderObfuscation(const char *key); + + // Check whether Mac1 appears to be valid + bool CheckCookieMac1(Packet *packet); + + // Check whether Mac2 appears to be valid, this also uses + // the remote ip address + bool CheckCookieMac2(Packet *packet); + + void CreateCookieMessage(MessageHandshakeCookie *dst, Packet *packet, uint32 remote_key_id); + + void UpdateKeypairAddrEntry(uint64 addr_id, WgKeypair *keypair); + + IpToPeerMap &ip_to_peer_map() { return ip_to_peer_map_; } + + std::unordered_map > &key_id_lookup() { return key_id_lookup_; } + + WgPeer *first_peer() { return peers_; } + + uint64 last_complete_handskake_timestamp() const { + return last_complete_handskake_timestamp_; + } + + const uint8 *public_key() const { return s_pub_; } + + void SecondLoop(uint64 now); + + WgRateLimit *rate_limiter() { return &rate_limiter_; } + + std::unordered_map &addr_entry_map() { return addr_entry_lookup_; } + + + WgPacketCompressionVer01 *compression_header() { return &compression_header_; } +private: + // Return the peer matching the |public_key| or NULL + WgPeer *GetPeerFromPublicKey(uint8 public_key[WG_PUBLIC_KEY_LEN]); + // Create a cookie by inspecting the source address of the |packet| + void MakeCookie(uint8 cookie[WG_COOKIE_LEN], Packet *packet); + // Insert a new entry in |key_id_lookup_| + uint32 InsertInKeyIdLookup(WgPeer *peer, WgKeypair *kp); + // Get a random number + uint32 GetRandomNumber(); + + void EraseKeypairAddrEntry(WgKeypair *kp); + + // Maps IP addresses to peers + IpToPeerMap ip_to_peer_map_; + // For enumerating all peers + WgPeer *peers_; + // Mapping from key-id to either an active keypair (if keypair is non-NULL), + // or to a handshake. + std::unordered_map > key_id_lookup_; + + // Mapping from IPV4 IP/PORT to WgPeer*, so we can find the peer when a key id is + // not explicitly included. + std::unordered_map addr_entry_lookup_; + + // Counter for generating new indices in |keypair_lookup_| + uint8 next_rng_slot_; + + // Whether packet obfuscation is enabled + bool header_obfuscation_; + + uint64 last_complete_handskake_timestamp_; + + uint64 low_resolution_timestamp_; + + uint64 cookie_secret_timestamp_; + uint8 cookie_secret_[WG_HASH_LEN]; + uint8 s_priv_[WG_PUBLIC_KEY_LEN]; + uint8 s_pub_[WG_PUBLIC_KEY_LEN]; + + // Siphash keys for packet scrambling + ScramblerSiphashKeys header_obfuscation_key_; + + uint8 precomputed_cookie_key_[WG_SYMMETRIC_KEY_LEN]; + uint8 precomputed_mac1_key_[WG_SYMMETRIC_KEY_LEN]; + + uint64 random_number_input_[WG_HASH_LEN / 8 + 1]; + uint32 random_number_output_[WG_HASH_LEN / 4]; + + WgRateLimit rate_limiter_; + + WgPacketCompressionVer01 compression_header_; +}; + +// State for Noise handshake +class WgPeer { + friend class WgDevice; + friend class WireguardProcessor; + friend bool WgKeypairParseExtendedHandshake(WgKeypair *keypair, const uint8 *data, size_t data_size); + friend void WgKeypairSetupCompressionExtension(WgKeypair *keypair, const WgPacketCompressionVer01 *remotec); +public: + explicit WgPeer(WgDevice *dev); + ~WgPeer(); + + void Initialize(const uint8 spub[WG_PUBLIC_KEY_LEN], const uint8 preshared_key[WG_SYMMETRIC_KEY_LEN]); + + void SetPersistentKeepalive(int persistent_keepalive_secs); + void SetEndpoint(const IpAddr &sin); + void SetAllowMulticast(bool allow); + + void SetFeature(int feature, uint8 value); + bool AddCipher(int cipher); + void SetCipherPrio(bool prio) { cipher_prio_ = prio; } + bool AddIp(const WgCidrAddr &cidr_addr); + + static WgPeer *ParseMessageHandshakeInitiation(WgDevice *dev, Packet *packet); + static WgPeer *ParseMessageHandshakeResponse(WgDevice *dev, const Packet *packet); + static void ParseMessageHandshakeCookie(WgDevice *dev, const MessageHandshakeCookie *src); + void CreateMessageHandshakeInitiation(Packet *packet); + bool CheckSwitchToNextKey(WgKeypair *keypair); + void ClearKeys(); + void ClearHandshake(); + void ClearPacketQueue(); + bool CheckHandshakeRateLimit(); + + // Timer notifications + void OnDataSent(); + void OnKeepaliveSent(); + void OnDataReceived(); + void OnKeepaliveReceived(); + void OnHandshakeInitSent(); + void OnHandshakeAuthComplete(); + void OnHandshakeFullyComplete(); + + enum { + ACTION_SEND_KEEPALIVE = 1, + ACTION_SEND_HANDSHAKE = 2, + }; + uint32 CheckTimeouts(uint64 now); + +private: + WgKeypair *CreateNewKeypair(bool is_initiator, const uint8 key[WG_HASH_LEN], uint32 send_key_id, const uint8 *extfield, size_t extfield_size); + void WriteMacToPacket(const uint8 *data, MessageMacs *mac); + void DeleteKeypair(WgKeypair **kp); + void CheckAndUpdateTimeOfNextKeyEvent(uint64 now); + static void CopyEndpointToPeer(WgKeypair *keypair, const IpAddr *addr); + size_t WriteHandshakeExtension(uint8 *dst, WgKeypair *keypair); + void InsertKeypairInPeer(WgKeypair *keypair); + + WgDevice *dev_; + WgPeer *next_peer_; + + // Keypairs, |curr_keypair_| is the used one, the other ones are + // the old ones and the next one. + WgKeypair *curr_keypair_; + WgKeypair *prev_keypair_; + WgKeypair *next_keypair_; + + // Timestamp when the next key related event is going to occur. + uint64 time_of_next_key_event_; + + // For timer management + uint32 timers_; + uint32 timer_value_[5]; + + // Holds the entry into the key id table during handshake + uint32 local_key_id_during_hs_; + IpAddr endpoint_; + + // The broadcast address of the IPv4 network, used to block broadcast traffic + // from being sent out over the VPN link. + uint32 ipv4_broadcast_addr_; + + bool supports_handshake_extensions_; + + bool pending_keepalive_; + bool expect_cookie_reply_; + + // Whether we want to route incoming multicast/broadcast traffic to this peer. + bool allow_multicast_through_peer_; + + // Whether + bool has_mac2_cookie_; + + // Number of handshakes made so far, when this gets too high we stop connecting. + uint8 handshake_attempts_; + + // Which features are enabled for this peer? + uint8 features_[WG_FEATURES_COUNT]; + + // Queue of packets that will get sent once handshake finishes + uint8 num_queued_packets_; + Packet *first_queued_packet_, **last_queued_packet_ptr_; + + uint64 last_handshake_init_timestamp_; + uint64 last_complete_handskake_timestamp_; + uint64 last_handshake_init_recv_timestamp_; + + enum { MAX_CIPHERS = 16 }; + uint8 cipher_prio_; + uint8 num_ciphers_; + uint8 ciphers_[MAX_CIPHERS]; + + // Handshake state that gets setup in |CreateMessageHandshakeInitiation| and used in + // the response. + struct HandshakeState { + // Hash + uint8 hi[WG_HASH_LEN]; + // Chaining key + uint8 ci[WG_HASH_LEN]; + // Private ephemeral + uint8 e_priv[WG_PUBLIC_KEY_LEN]; + }; + HandshakeState hs_; + // Remote's static public key - Written only by Init + uint8 s_remote_[WG_PUBLIC_KEY_LEN]; + // Remote's preshared key - Written only by Init + uint8 preshared_key_[WG_SYMMETRIC_KEY_LEN]; + // Precomputed DH(spriv_local, spub_remote). + uint8 s_priv_pub_[WG_PUBLIC_KEY_LEN]; + // The most recent seen timestamp, only accept higher timestamps. + uint8 last_timestamp_[WG_TIMESTAMP_LEN]; + // Precomputed key for decrypting cookies from the peer. + uint8 precomputed_cookie_key_[WG_SYMMETRIC_KEY_LEN]; + // Precomputed key for sending MACs to the peer. + uint8 precomputed_mac1_key_[WG_SYMMETRIC_KEY_LEN]; + // The last mac value sent, required to make cookies + uint8 sent_mac1_[WG_COOKIE_LEN]; + // The mac2 cookie that gets appended to outgoing packets + uint8 mac2_cookie_[WG_COOKIE_LEN]; + // The timestamp of the mac2 cookie + uint64 mac2_cookie_timestamp_; + int persistent_keepalive_ms_; + + // Allowed ips + std::vector allowed_ips_; +}; + +// RFC6479 - IPsec Anti-Replay Algorithm without Bit Shifting +class ReplayDetector { +public: + ReplayDetector(); + ~ReplayDetector(); + + bool CheckReplay(uint64 other); + enum { + BITS_PER_ENTRY = 32, + WINDOW_SIZE = 2048 - BITS_PER_ENTRY, + BITMAP_SIZE = WINDOW_SIZE / BITS_PER_ENTRY + 1, + BITMAP_MASK = BITMAP_SIZE - 1, + }; + + uint64 expected_seq_nr() const { return expected_seq_nr_; } + +private: + uint64 expected_seq_nr_; + uint32 bitmap_[BITMAP_SIZE]; +}; + +struct AesGcm128StaticContext; + +struct WgKeypair { + WgPeer *peer; + + // If the key has an addr entry mapping, + // then this points at it. + WgAddrEntry *addr_entry; + // The slot in the addr entry where the key is registered. + uint8 addr_entry_slot; + + enum { + KEY_INVALID = 0, + KEY_VALID = 1, + KEY_WANT_REFRESH = 2, + KEY_DID_REFRESH = 3, + }; + // True if i'm the initiator of the key exchange + bool is_initiator; + + // True if we saved the peer's address in our table recently, + // avoids doing it too much + bool did_attempt_remember_ip_port; + + // Which features are enabled + bool enabled_features[WG_FEATURES_COUNT]; + + // True if we want to notify the sender about that it can use a short key. + uint8 broadcast_short_key; + + // Index of the short key index that we can use for outgoing packets. + uint8 can_use_short_key_for_outgoing; + + // Whether the key is valid or needs refresh for receives + uint8 recv_key_state; + // Whether the key is valid or needs refresh for sends + uint8 send_key_state; + + // Length of authentication tag + uint8 auth_tag_length; + + // Cipher suite + uint8 cipher_suite; + + // Used so we know when to send out ack packets. + uint32 incoming_packet_count; + + // Id of the key in my map + uint32 local_key_id; + // Id of the key in their map + uint32 remote_key_id; + // The timestamp of when the key was created, to be able to expire it + uint64 key_timestamp; + // The highest acked send_ctr value + uint64 send_ctr_acked; + // Counter value for chacha20 for outgoing packets + uint64 send_ctr; + // The key used for chacha20 encryption + uint8 send_key[WG_SYMMETRIC_KEY_LEN]; + // The key used for chacha20 decryption + uint8 recv_key[WG_SYMMETRIC_KEY_LEN]; + + // Used when less than 16-byte mac is enabled to hash the hmac into 64 bits. + uint64 compress_mac_keys[2][2]; + + AesGcm128StaticContext *aes_gcm128_context_; + + // -- all up to this point is initialized to zero + // For replay detection of incoming packets + ReplayDetector replay_detector; + +#if WITH_HANDSHAKE_EXT + // State for packet compressor + IpzipState ipzip_state_; +#endif // WITH_HANDSHAKE_EXT + +}; + +void WgKeypairEncryptPayload(uint8 *dst, const size_t src_len, + const uint8 *ad, const size_t ad_len, + const uint64 nonce, WgKeypair *keypair); + +bool WgKeypairDecryptPayload(uint8 *dst, const size_t src_len, + const uint8 *ad, const size_t ad_len, + const uint64 nonce, WgKeypair *keypair); + +bool WgKeypairParseExtendedHandshake(WgKeypair *keypair, const uint8 *data, size_t data_size); +